In some of my recent public talks (for example: here and here) I have mentioned a desire for “a deeper theory of fitting and testing.” I thought I would expand on what I meant by this.

In this note I am going to cover a lot of different topics to try and suggest some perspective. I won’t have my usual luxury of fully defining my terms or working concrete examples. Hopefully a number of these ideas (which are related, but don’t seem to easily synthesize together) will be subjects of their own later articles.

Introduction

The focus of this article is: the true goal of predictive analytics is always: to build a model that works well in production. Training and testing procedures are designed to simulate this unknown future model performance, but can be expensive and can also fail.

What we want is a good measure of future model performance, and to apply that measure in picking a model without running deep into Goodhart’s law (“When a measure becomes a target, it ceases to be a good measure.”).

Most common training and testing procedures are destructive in the sense they use up data (data used for one step may not be safely used for another step in an unbiased fashion, example: excess generalization error). In this note I thought I would expand on the ideas for extending statistical efficiency or getting more out of your training while avoiding overfitting.

Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn’t crash and burn in the real world.

We’ve discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it’s better than the models that you rejected?

Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.

In this latest “Statistics as it should be” article, we will systematically look at what to worry about and what to check. This is standard material, but presented in a “data science” oriented manner. Meaning we are going to consider scoring system utility in terms of service to a negotiable business goal (one of the many ways data science differs from pure machine learning).

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 2 of our four part mini-series “How do you know if your model is going to work?” we develop in-training set measures.

The most tempting procedure is to score your model on the data used to train it. The attraction is this avoids the statistical inefficiency of denying some of your data to the training procedure.

Run it once procedure

A common way to asses score quality is to run your scoring function on the data used to build your model. We might try comparing several models scored by AUC or deviance (normalized to factor out sample size) on their own training data as shown below.

What we have done is take five popular machine learning techniques (random forest, logistic regression, gbm, GAM logistic regression, and elastic net logistic regression) and plotted their performance in terms of AUC and normalized deviance on their own training data. For AUC larger numbers are better, and for deviance smaller numbers are better. Because we have evaluated multiple models we are starting to get a sense of scale. We should suspect an AUC of 0.7 on training data is good (though random forest achieved an AUC on training of almost 1.0), and we should be acutely aware that evaluating models on their own training data has an upward bias (the model has seen the training data, so it has a good chance of doing well on it; or training data is not exchangeable with future data for the purpose of estimating model performance).

There are two more Gedankenexperiment models that any machine data scientist should always have in mind:

The null model (on the graph as “null model”). This is the performance of the best constant model (model that returns the same answer for all datums). In this case it is a model scores each and every row as having an identical 7% chance of churning. This is an important model that you want to better than. It is also a model you are often competing against as a data science as it is the “what if we treat everything in this group the same” option (often the business process you are trying to replace).

The data scientist should always compare their work to the null model on deviance (null model AUC is trivially 0.5) and packages like logistic regression routinely report this statistic.

The best single variable model (on the graph as “best single variable model”). This is the best model built using only one variable or column (in this case using a GAM logistic regression as the modeling method). This is another model the data scientist wants to out perform as it represents the “maybe one of the columns is already the answer case” (if so that would be very good for the business as they could get good predictions without modeling infrastructure).

The data scientist should definitely compare their model to the best single variable model. Until you significantly outperform the best single variable model you have not outperformed what an analyst can find with a single pivot table.

At this point it would be tempting to pick the random forest model as the winner as it performed best on the training data. There are at least two things wrong with this idea:

Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn’t crash and burn in the real world.

We’ve discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it’s better than the models that you rejected?

Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.

In this latest “Statistics as it should be” series, we will systematically look at what to worry about and what to check. This is standard material, but presented in a “data science” oriented manner. Meaning we are going to consider scoring system utility in terms of service to a negotiable business goal (one of the many ways data science differs from pure machine learning).

To organize the ideas into digestible chunks, we are presenting this article as a four part series (to finished in the next 3 Tuesdays). This part (part 1) sets up the specific problem.

I’ll admit it: I have been wrong about statistics. However, that isn’t what this article is about. This article is less about some of the statistical mistakes I have made, as a mere working data scientist, and more of a rant about the hectoring tone of corrections from some statisticians (both when I have been right and when I have been wrong).

Modern text encoding is a convoluted mess where costs can easily exceed benefits. I admit we are in a world that has moved beyond ASCII (which at best served only English, and even then without full punctuation). But modern text encoding standards (utf-x, Unicode) have metastasized to the point you spend more time working around them than benefiting from them.

What the Sharpe ratio does is: give you a dimensionless score to compare similar investments that may vary both in riskiness and returns without needing to know the investor’s risk tolerance. It does this by separating the task of valuing an investment (which can be made independent of the investor’s risk tolerance) from the task of allocating/valuing a portfolio (which must depend on the investor’s preferences).

The recent The Atlantic article “The Man Who Broke Atlantic City” tells the story of Don Johnson who won millions of dollars in private room custom rules high stakes blackjack. The method Mr. Johnson reportedly used is, surprisingly, not card counting (as made famous by professor Edward O. Thorp in Beat the Dealer). It is instead likely an amazingly simple process I will call a martingale money pump. Naturally the Atlantic wouldn’t want to go into the math, but we can do that here.

The June 4, 2015 Wikipedia entry on A/B Testing claims Google data scientists were the origin of the term “A/B test”:

Google data scientists ran their first A/B test at the turn of the millennium to determine the optimum number of results to display on a search engine results page.[citation needed] While this was the origin of the term, very similar methods had been used by marketers long before “A/B test” was coined. Common terms used before the internet era were “split test” and “bucket test”.

It is very unlikely Google data scientists were the first to use the informal shorthand “A/B test.” Test groups have been routinely called “A” and “B” at least as early as the 1940s. So it would be natural for any working group to informally call their test comparing abstract groups “A” and “B” an “A/B test” from time to time. Statisticians are famous for using the names of variables (merely chosen by convention) as formal names of procedures (p-values, t-tests, and many more).

Even if other terms were dominant in earlier writing, it is likely A/B test was used in speech. And writings of our time are sufficiently informal (or like speech) that they should be compared to earlier speech, not just earlier formal writing.

32 bit data structures (pointers, integer representations, single precision floating point) have been past their “best before date” for quite some time. R itself moved to a 64 bit memory model some time ago, but still has only 32 bit integers. This is going to get more and more awkward going forward. What is R doing to work around this limitation?