Beginning analysts and data scientists often ask: “how does one remember and master the seemingly endless number of classifier metrics?”
My concrete advice is:
- Read Nina Zumel’s excellent series on scoring classifiers.
- Keep notes.
- Settle on one or two metrics as you move project to project. We prefer “AUC” early in a project (when you want a flexible score) and “deviance” late in a project (when you want a strict score).
- When working on practical problems work with your business partners to find out which of precision/recall, or sensitivity/specificity most match their business needs. If you have time show them and explain the ROC plot and invite them to price and pick points along the ROC curve that most fit their business goals. Finance partners will rapidly recognize the ROC curve as “the efficient frontier” of classifier performance and be very comfortable working with this summary.
That being said it always seems like there is a bit of gamesmanship in that somebody always brings up yet another score, often apparently in the hope you may not have heard of it. Some choice of measure is signaling your pedigree (precision/recall implies a data mining background, sensitivity/specificity a medical science background) and hoping to befuddle others.
Stanley Wyatt illustration from “Mathmanship” Nicholas Vanserg, 1958, collected in A Stress Analysis of a Strapless Evening Gown, Robert A. Baker, Prentice-Hall, 1963
The rest of this note is some help in dealing with this menagerie of common competing classifier evaluation scores.
Continue reading A budget of classifier evaluation measures
At Strata+Hadoop World “R Day” Tutorial, Tuesday, March 29 2016, San Jose, California we spent some time on classifier measures derived from the so-called “confusion matrix.”
We repeated our usual admonition to not use “accuracy itself” as a project quality goal (business people tend to ask for it as it is the word they are most familiar with, but it usually isn’t what they really want).
One reason not to use accuracy: an example where a classifier that does nothing is “more accurate” than one that actually has some utility. (Figure credit Nina Zumel, slides here)
And we worked through the usual bestiary of other metrics (precision, recall, sensitivity, specificity, AUC, balanced accuracy, and many more).
Please read on to see what stood out. Continue reading A bit on the F1 score floor
One of the things I like about R is: because it is not used for systems programming you can expect to install your own current version of R without interference from some system version of R that is deliberately being held back at some older version (for reasons of script compatibility). R is conveniently distributed as a single package (with automated install of additional libraries).
Want to do some data analysis? Install R, load your data, and go. You don’t expect to spend hours on system administration just to get back to your task.
Python, being a popular general purpose language does not have this advantage, but thanks to Anaconda from Continuum Analytics you can skip (or at least delegate) a lot of the system environment imposed pain. With Anaconda trying out Python packages (Jupyter, scikit-learn, pandas, numpy, sympy, cvxopt, bokeh, and more) becomes safe and pleasant. Continue reading Thumbs up for Anaconda
As John mentioned in his last post, we have been quite interested in the recent study by Fernandez-Delgado, et.al., “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” (the “DWN study” for short), which evaluated 179 popular implementations of common classification algorithms over 120 or so data sets, mostly from the UCI Machine Learning Repository. For fun, we decided to do a follow-up study, using their data and several classifier implementations from
scikit-learn, the Python machine learning library. We were interested not just in classifier accuracy, but also in seeing if there is a “geometry” of classifiers: which classifiers produce predictions patterns that look similar to each other, and which classifiers produce predictions that are quite different? To examine these questions, we put together a Shiny app to interactively explore how the relative behavior of classifiers changes for different types of data sets.
Continue reading The Geometry of Classifiers
In most of our data science teaching (including our book Practical Data Science with R) we emphasize the deliberately easy problem of “exchangeable prediction.” We define exchangeable prediction as: given a series of observations with two distinguished classes of variables/observations denoted “x”s (denoting control variables, independent variables, experimental variables, or predictor variables) and “y” (denoting an outcome variable, or dependent variable) then:
- Estimate an approximate functional relation
y ~ f(x).
- Apply that relation to new instances where
x is known and
y is not yet known.
An example of this would be to use measured characteristics of online shoppers to predict if they will purchase in the next month. Data more than a month old gives us a training set where both
y are known. Newer shoppers give us examples where only
x is currently known and it would presumably be of some value to estimate
y or estimate the probability of different
y values. The problem is philosophically “easy” in the sense we are not attempting inference (estimating unknown parameters that are not later exposed to us) and we are not extrapolating (making predictions about situations that are out of the range of our training data). All we are doing is essentially generalizing memorization: if somebody who shares characteristics of recent buyers shows up, predict they are likely to buy. We repeat: we are not forecasting or “predicting the future” as we are not modeling how many high-value prospects will show up, just assigning scores to the prospects that do show up.
The reliability of such a scheme rests on the concept of exchangeability. If the future individuals we are asked to score are exchangeable with those we had access to during model construction then we expect to be able to make useful predictions. How we construct the model (and how to ensure we indeed find a good one) is the core of machine learning. We can bring in any big name machine learning method (deep learning, support vector machines, random forests, decision trees, regression, nearest neighbors, conditional random fields, and so-on) but the legitimacy of the technique pretty much stands on some variation of the idea of exchangeability.
One effect antithetical to exchangeability is “concept drift.” Concept drift is when the meanings and distributions of variables or relations between variables changes over time. Concept drift is a killer: if the relations available to you during training are thought not to hold during later application then you should not expect to build a useful model. This one of the hard lessons that statistics tries so hard to quantify and teach.
We know that you should always prefer fixing your experimental design over trying a mechanical correction (which can go wrong). And there are no doubt “name brand” procedures for dealing with concept drift. However, data science and machine learning practitioners are at heart tinkerers. We ask: can we (to a limited extent) attempt to directly correct for concept drift? This article demonstrates a simple correction applied to a deliberately simple artificial example.
Image: Wikipedia: Elgin watchmaker
Continue reading Can we try to make an adjustment?
Two of the most common methods of statistical inference are frequentism and Bayesianism (see Bayesian and Frequentist Approaches: Ask the Right Question for some good discussion). In both cases we are attempting to perform reliable inference of unknown quantities from related observations. And in both cases inference is made possible by introducing and reasoning over well-behaved distributions of values.
As a first example, consider the problem of trying to estimate the speed of light from a series of experiments.
In this situation the frequentist method quietly does some heavy philosophical lifting before you even start work. Under the frequentist interpretation since the speed of light is thought to have a single value it does not make sense to model it as having a prior distribution of possible values over any non-trivial range. To get the ability to infer, frequentist philosophy considers the act of measurement repeatable and introduces very subtle concepts such as confidence intervals. The frequentist statement that a series of experiments places the speed of light in vacuum at 300,000,000 meters a second plus or minus 1,000,000 meters a second with 95% confidence does not mean there is a 95% chance that the actual speed of light is in the interval 299,000,000 to 301,000,000 (the common incorrect recollection of what a confidence interval is). It means if the procedure that generated the interval were repeated on new data, then 95% of the time the speed of light would be in the interval produced: which may not be the interval we are looking at right now. Frequentist procedures are typically easy on the practitioner (all of the heavy philosophic work has already been done) and result in simple procedures and calculations (through years of optimization of practice).
Bayesian procedures on the other hand are philosophically much simpler, but require much more from the user (production and acceptance of priors). The Bayesian philosophy is: given a generative model, a complete prior distribution (detailed probabilities of the unknown value posited before looking at the current experimental data) of the quantity to be estimated, and observations: then inference is just a matter of calculating the complete posterior distribution of the quantity to be estimated (by correct application of Bayes’ Law). Supply a bad model or bad prior beliefs on possible values of the speed of light and you get bad results (and it is your fault, not the methodology’s fault). The Bayesian method seems to ask more, but you have to remember it is trying to supply more (complete posterior distribution, versus subjunctive confidence intervals).
In this article we are going to work a simple (but important) problem where (for once) the Bayesian calculations are in fact easier than the frequentist ones. Continue reading Frequentist inference only seems easy
I was watching my cousins play Unspeakable Words over Christmas break and got interested in the end game. The game starts out as a spell a word from cards and then bet some points game, but in the end (when you are down to one marker) it becomes a pure betting game. In this article we analyze an idealized form of the pure betting end game. Continue reading Unspeakable bets: take small steps