In our previous post in this series, we introduced sessionization, or converting log data into a form that’s suitable for analysis. We looked at basic considerations, like dealing with time, choosing an appropriate dataset for training models, and choosing appropriate (and achievable) business goals. In that previous example, we sessionized the data by considering all possible aggregations (window widths) of the data as features. Such naive sessionization can quickly lead to very wide data sets, with potentially more features than you have datums (and collinear features, as well). In this post, we will use the same example, but try to select our features more intelligently.
Recall that you have a mobile app with both free (A) and paid (B) actions; if a customer’s tasks involve too many paid actions, they will abandon the app. Your goal is to detect when a customer is in a state when they are likely to abandon, and offer them (perhaps through an in-app ad) a more economical alternative, for example a “Pro User” subscription that allows them to do what they are currently doing at a lower rate. You don’t want to be too aggressive about showing customers this ad, because showing it to someone who doesn’t need the subscription service is likely to antagonize them (and convince them to stop using your app).
You want to build a model that predicts whether a customer will abandon the app (“exit”) within seven days. Your training set is a set of 648 customers who were present on a specific reference day (“day 0”); their activity on day 0 and the ten days previous to that (days 1 through 10), and how many days until each customer exited (Inf for customers who never exit), counting from day 0. For each day, you constructed all possible windows within those ten days, and counted the relative rates of A events and B events in each window. This gives you 132 features per row. You also have a hold-out set of 660 customers, with the same structure. You can download the wide data set used for these examples as an .rData file here. The explanation of the variable names is in the previous post in this series.
In the previous installment, we built a regularized (ridge) logistic regression model over all 132 features. This model didn’t perform too badly, but in general there is more danger of overfitting when working with very wide data sets; in addition, it is quite expensive to analyze a large number of variables with standard implementations of logistic regression. In this installment, we will look for potentially more robust and less expensive ways of analyzing this data.
When we teach data science we emphasize the data scientist’s responsibility to transform available data from multiple systems of record into a wide or denormalized form. In such a “ready to analyze” form each individual example gets a row of data and every fact about the example is a column. Usually transforming data into this form is a matter of performing the equivalent of a number of SQL joins (for example, Lecture 23 (“The Shape of Data”) from our paid video course Introduction to Data Science discusses this).
One notable exception is log data. Log data is a very thin data form where different facts about different individuals are written across many different rows. Converting log data into a ready for analysis form is called sessionizing. We are going to share a short series of articles showing important aspects of sessionizing and modeling log data. Each article will touch on one aspect of the problem in a simplified and idealized setting. In this article we will discuss the importance of dealing with time and of picking a business appropriate goal when evaluating predictive models.
For this article we are going to assume that we have sessionized our data by picking a concrete near-term goal (predicting cancellation of account or “exit” within the next 7 days) and that we have already selected variables for analysis (a number of time-lagged windows of recent log events of various types). We will use a simple model without variable selection as our first example. We will use these results to show how you examine and evaluate these types of models. In later articles we will discuss how you sessionize, how you choose examples, variable selection, and other key topics.
It’s a folk theorem I sometimes hear from colleagues and clients: that you must balance the class prevalence before training a classifier. Certainly, I believe that classification tends to be easier when the classes are nearly balanced, especially when the class you are actually interested in is the rarer one. But I have always been skeptical of the claim that artificially balancing the classes (through resampling, for instance) always helps, when the model is to be run on a population with the native class prevalences.
On the other hand, there are situations where balancing the classes, or at least enriching the prevalence of the rarer class, might be necessary, if not desirable. Fraud detection, anomaly detection, or other situations where positive examples are hard to get, can fall into this case. In this situation, I’ve suspected (without proof) that SVM would perform well, since the formulation of hard-margin SVM is pretty much distribution-free. Intuitively speaking, if both classes are far away from the margin, then it shouldn’t matter whether the rare class is 10% or 49% of the population. In the soft-margin case, of course, distribution starts to matter again, but perhaps not as strongly as with other classifiers like logistic regression, which explicitly encodes the distribution of the training data.
So let’s run a small experiment to investigate this question.
I often need to build a predictive model that estimates rates. The example of our age is: ad click through rates (how often a viewer clicks on an ad estimated as a function of the features of the ad and the viewer). Another timely example is estimating default rates of mortgages or credit cards. You could try linear regression, but specialized tools often do much better. For rate problems involving estimating probabilities and frequencies we recommend logistic regression. For non-frequency (and non-categorical) rate problems (such as forecasting yield or purity) we suggest beta regression.
Implement safe optimization methods (like conjugate gradient, line-search and majorization) for situations where the standard Iteratively-re-Weighted-Least-Squares/Newton-Raphson fails.
Provide an overall framework to quickly try implementation experiments (as opposed to novel usage experiments).
What we mean by this code being “experimental” is that it has capabilities that many standard implementations do not. In fact most of the items in the above list are not usually made available to the logistic regression user. But our project is also stand-alone and not as well integrated into existing workflows as standard production systems. Before trying our code you may want to try R or Mahout. Continue reading Added worked example to logistic regression project
A recent run of too many articles on the same topic (exhibits: A, B and C) puts me in a position where I feel the need to explain my motivation. Which itself becomes yet another article related to the original topic. The explanation I offer is: this is the way mathematicians think. To us mathematicians the tension is that there are far too many observable patterns in the world to be attributed to mere chance. So our dilemma is: for which patterns/regularities should we derive some underlying law and which ones are not worth worrying about. Or which conjectures should try to work all the way to proof or counter-example? Continue reading The Mathematician’s Dilemma
In our article How robust is logistic regression? we pointed out some basic yet deep limitations of the traditional full-step Newton-Raphson or Iteratively Reweighted Least Squares methods of solving logistic regression problems (such as in R‘s standard glm() implementation). In fact in the comments we exhibit a well posed data fitting problem that can not be fit using the traditional methods starting at the traditional (0,0) start point. And we cited an example where the traditional methods fail to compute the average from a non-zero start. The question remained: can we prove the standard methods always compute the average correctly if started at zero? It turns out they can, and the proof isn’t as messy as I anticipated. Continue reading Newton-Raphson can compute an average
Logistic Regression is a popular and effective technique for modeling categorical outcomes as a function of both continuous and categorical variables. The question is: how robust is it? Or: how robust are the common implementations? (note: we are using robust in a more standard English sense of performs well for all inputs, not in the technical statistical sense of immune to deviations from assumptions or outliers.)
Even a detailed reference such as “Categorical Data Analysis” (Alan Agresti, Wiley, 1990) leaves off with an empirical observation: “the convergence … for the Newton-Raphson method is usually fast” (chapter 4, section 4.7.3, page 117). This is a book that if there is a known proof that the estimation step is a contraction (one very strong guarantee of convergence) you would expect to see the proof reproduced. I always suspected there was some kind of Brouwer fixed-point theorem based folk-theorem proving absolute convergence of the Newton-Raphson method in for the special case of logistic regression. This can not be the case as the Newton-Raphson method can diverge even on trivial full-rank well-posed logistic regression problems. Continue reading How robust is logistic regression?