It’s a folk theorem I sometimes hear from colleagues and clients: that you must balance the class prevalence before training a classifier. Certainly, I believe that classification tends to be easier when the classes are nearly balanced, especially when the class you are actually interested in is the rarer one. But I have always been skeptical of the claim that artificially balancing the classes (through resampling, for instance) always helps, when the model is to be run on a population with the native class prevalences.
On the other hand, there are situations where balancing the classes, or at least enriching the prevalence of the rarer class, might be necessary, if not desirable. Fraud detection, anomaly detection, or other situations where positive examples are hard to get, can fall into this case. In this situation, I’ve suspected (without proof) that SVM would perform well, since the formulation of hard-margin SVM is pretty much distribution-free. Intuitively speaking, if both classes are far away from the margin, then it shouldn’t matter whether the rare class is 10% or 49% of the population. In the soft-margin case, of course, distribution starts to matter again, but perhaps not as strongly as with other classifiers like logistic regression, which explicitly encodes the distribution of the training data.
So let’s run a small experiment to investigate this question.
Continue reading Does Balancing Classes Improve Classifier Performance?
This note is a link to an excerpt from my upcoming monster support vector machine article (where I work through a number of sections of [Vapnik, 1998] Vapnik, V. N. (1998), Statistical Learning Theory, Wiley). I try to run down how the original theoretical support vector machine claims are precisely linked to what is said about the common implementations. The write-up is fairly technical and very large (26 pages).
Here we are extracting an appendix: “Soft margin is not as good as hard margin.” In it we build a toy problem that is not large-margin separated and note that if the dimension of the concept space you were working in was not obvious (i.e. you were forced to rely on the margin derived portion of generalization bounds) then generalization improvement for a soft margin SVM is much slower than you would expect given experience from the hard margin theorems. The punch-line is: every time you get eight times as much training data you only halve your expected excess generalization error bound (whereas once you get below a data-set’s hard-margin bound you expect one to one reduction of the bound with respect to training data set size). What this points out is: the soft margin idea can simulate margin, but it comes at a cost. Continue reading Soft margin is not as good as hard margin
As John mentioned in his last post, we have been quite interested in the recent study by Fernandez-Delgado, et.al., “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” (the “DWN study” for short), which evaluated 179 popular implementations of common classification algorithms over 120 or so data sets, mostly from the UCI Machine Learning Repository. For fun, we decided to do a follow-up study, using their data and several classifier implementations from
scikit-learn, the Python machine learning library. We were interested not just in classifier accuracy, but also in seeing if there is a “geometry” of classifiers: which classifiers produce predictions patterns that look similar to each other, and which classifiers produce predictions that are quite different? To examine these questions, we put together a Shiny app to interactively explore how the relative behavior of classifiers changes for different types of data sets.
Continue reading The Geometry of Classifiers