It’s a folk theorem I sometimes hear from colleagues and clients: that you must balance the class prevalence before training a classifier. Certainly, I believe that classification tends to be easier when the classes are nearly balanced, especially when the class you are actually interested in is the rarer one. But I have always been skeptical of the claim that artificially balancing the classes (through resampling, for instance) always helps, when the model is to be run on a population with the native class prevalences.
On the other hand, there are situations where balancing the classes, or at least enriching the prevalence of the rarer class, might be necessary, if not desirable. Fraud detection, anomaly detection, or other situations where positive examples are hard to get, can fall into this case. In this situation, I’ve suspected (without proof) that SVM would perform well, since the formulation of hard-margin SVM is pretty much distribution-free. Intuitively speaking, if both classes are far away from the margin, then it shouldn’t matter whether the rare class is 10% or 49% of the population. In the soft-margin case, of course, distribution starts to matter again, but perhaps not as strongly as with other classifiers like logistic regression, which explicitly encodes the distribution of the training data.
So let’s run a small experiment to investigate this question.
Continue reading Does Balancing Classes Improve Classifier Performance?
I often need to build a predictive model that estimates rates. The example of our age is: ad click through rates (how often a viewer clicks on an ad estimated as a function of the features of the ad and the viewer). Another timely example is estimating default rates of mortgages or credit cards. You could try linear regression, but specialized tools often do much better. For rate problems involving estimating probabilities and frequencies we recommend logistic regression. For non-frequency (and non-categorical) rate problems (such as forecasting yield or purity) we suggest beta regression.
In this note we will work a toy problem and suggest some relevant R analysis libraries. Continue reading Generalized linear models for predicting rates
I know I have already written a lot about technicalities in logistic regression (see for example: How robust is logistic regression? and Newton-Raphson can compute an average). But I just ran into a simple case where R‘s glm() implementation of logistic regression seems to fail without issuing a warning message. Yes the data is a bit pathological, but one would hope for a diagnostic or warning message from the fitter. Continue reading A pathological glm() problem that doesn’t issue a warning
We have added a worked example to the README of our experimental logistic regression code.
The Logistic codebase is designed to support experimentation on variations of logistic regression including:
What we mean by this code being “experimental” is that it has capabilities that many standard implementations do not. In fact most of the items in the above list are not usually made available to the logistic regression user. But our project is also stand-alone and not as well integrated into existing workflows as standard production systems. Before trying our code you may want to try R or Mahout. Continue reading Added worked example to logistic regression project
We have been writing for a while about the convergence of Newton steps applied to a logistic regression (See: What does a generalized linear model do?, How robust is logistic regression? and Newton-Raphson can compute an average). This is all based on our principle of working examples for understanding. This eventually progressed to some writing on the nature of problem solving (a nice complement to our earlier writing on calculation). In the course of research we were directed to a very powerful technique called the MM algorithm (see: “The MM Algorithm” Kenneth Lang, 2007; “A Tutorial on MM Algorithms”, David R. Hunter, Kenneth Lange, Amer. Statistician 58:30–37, 2004; and “Monotonicity of Quadratic-Approximation Algorithms”, Dankmar Bohning, Bruce G. Lindsay, Ann. Inst. Statist. Math, Vol. 40, No. 4, pp 641-664, 1988). The MM algorithm introduces an essential idea: majorized functions (not to be confused with the majorized order on R^d). Majorization it is an interesting way to modify Newton methods to be reliable contractions (and therefore converge in a manner similar to EM algorithms).
Here we will work an example of the MM method. We will not work it in its most general form, but in a form that quickly reveals much of the beauty of the method. We also introduce a “collared Newton step” which guarantees convergence without resorting to line-search (essentially resolving the issues in solving a logistic regression by Newton style methods). Continue reading Rudie can’t fail (if majorized)
A recent run of too many articles on the same topic (exhibits: A, B and C) puts me in a position where I feel the need to explain my motivation. Which itself becomes yet another article related to the original topic. The explanation I offer is: this is the way mathematicians think. To us mathematicians the tension is that there are far too many observable patterns in the world to be attributed to mere chance. So our dilemma is: for which patterns/regularities should we derive some underlying law and which ones are not worth worrying about. Or which conjectures should try to work all the way to proof or counter-example? Continue reading The Mathematician’s Dilemma
In our article How robust is logistic regression? we pointed out some basic yet deep limitations of the traditional full-step Newton-Raphson or Iteratively Reweighted Least Squares methods of solving logistic regression problems (such as in R‘s standard glm() implementation). In fact in the comments we exhibit a well posed data fitting problem that can not be fit using the traditional methods starting at the traditional (0,0) start point. And we cited an example where the traditional methods fail to compute the average from a non-zero start. The question remained: can we prove the standard methods always compute the average correctly if started at zero? It turns out they can, and the proof isn’t as messy as I anticipated. Continue reading Newton-Raphson can compute an average
Logistic Regression is a popular and effective technique for modeling categorical outcomes as a function of both continuous and categorical variables. The question is: how robust is it? Or: how robust are the common implementations? (note: we are using robust in a more standard English sense of performs well for all inputs, not in the technical statistical sense of immune to deviations from assumptions or outliers.)
Even a detailed reference such as “Categorical Data Analysis” (Alan Agresti, Wiley, 1990) leaves off with an empirical observation: “the convergence … for the Newton-Raphson method is usually fast” (chapter 4, section 4.7.3, page 117). This is a book that if there is a known proof that the estimation step is a contraction (one very strong guarantee of convergence) you would expect to see the proof reproduced. I always suspected there was some kind of Brouwer fixed-point theorem based folk-theorem proving absolute convergence of the Newton-Raphson method in for the special case of logistic regression. This can not be the case as the Newton-Raphson method can diverge even on trivial full-rank well-posed logistic regression problems. Continue reading How robust is logistic regression?
What does a generalized linear model do? R supplies a modeling function called
glm() that fits generalized linear models (abbreviated as GLMs). A natural question is what does it do and what problem is it solving for you? We work some examples and place generalized linear models in context with other techniques. Continue reading What does a generalized linear model do?
Dr. Nina Zumel recently published an excellent tutorial on a modeling technique she called impact coding. It is a pragmatic machine learning technique that has helped with more than one client project. Impact coding is a bridge from Naive Bayes (where each variable’s impact is added without regard to the known effects of any other variable) to Logistic Regression (where dependencies between variables and levels is completely accounted). A natural question is can pick up more of the positive features of each model? Continue reading A bit more on impact coding