Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn’t crash and burn in the real world.

We’ve discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it’s better than the models that you rejected?

Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.

In this latest “Statistics as it should be” article, we will systematically look at what to worry about and what to check. This is standard material, but presented in a “data science” oriented manner. Meaning we are going to consider scoring system utility in terms of service to a negotiable business goal (one of the many ways data science differs from pure machine learning).

In this article we conclude our four part series on basic model testing.

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this concluding Part 4 of our four part mini-series “How do you know if your model is going to work?” we demonstrate cross-validation techniques.

Cross validation techniques attempt to improve statistical efficiency by repeatedly splitting data into train and test and re-performing model fit and model evaluation.

For example: the variation called k-fold cross-validation splits the original data into k roughly equal sized sets. To score each set we build a model on all data not in the set and then apply the model to our set. This means we build k different models (none which is our final model, which is traditionally trained on all of the data).

Notional 3-fold cross validation (solid arrows are model construction/training, dashed arrows are model evaluation).

This is statistically efficient as each model is trained on a 1-1/k fraction of the data, so for k=20 we are using 95% of the data for training.

Another variation called “leave one out” (which is essentially Jackknife resampling) is very statistically efficient as each datum is scored on a unique model built using all other data. Though this is very computationally inefficient as you construct a very large number of models (except in special cases such as the PRESS statistic for linear regression).

Statisticians tend to prefer cross-validation techniques to test/train split as cross-validation techniques are more statistically efficient and can give sampling distribution style distributional estimates (instead of mere point estimates). However, remember cross validation techniques are measuring facts about the fitting procedure and not about the actual model in hand (so they are answering a different question than test/train split).

Though, there is some attraction to actually scoring the model you are going to turn in (as is done with in-sample methods, and test/train split, but not with cross-validation). The way to remember this is: bosses are essentially frequentist (they want to know their team and procedure tends to produce good models) and employees are essentially Bayesian (they want to know the actual model they are turning in is likely good; see here for how it the nature of the question you are trying to answer controls if you are in a Bayesian or Frequentist situation).

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 3 of our four part mini-series “How do you know if your model is going to work?” we develop out of sample procedures.

Let’s try working “out of sample” or with data not seen during training or construction of our model. The attraction of these procedures is they represent a principled attempt at simulating the arrival of new data in the future.

Hold-out tests

Hold out tests are a staple for data scientists. You reserve a fraction of your data (say 10%) for evaluation and don’t use that data in any way during model construction and calibration. There is the issue that the test data is often used to choose between models, but that should not cause a problem of too much data leakage in practice. However, there are procedures to systematically abuse easy access to test performance in contests such as Kaggle (see Blum, Hardt, “The Ladder: A Reliable Leaderboard for Machine Learning Competitions”).

Notional train/test split (first 4 rows are training set, last 2 rows are the test set).

The results of a test/train split produce graphs like the following:

The training panels are the same as we have seen before. We have now added the upper test panels. These are where the models are evaluated on data not used during construction.

Notice on the test graphs random forest is the worst (for this data set, with this set of columns, and this set of random forest parameters) of the non-trivial machine learning algorithms on the test data. Since the test data is the best simulation of future data we have seen so far, we should not select random forest as our one true model in this case- but instead consider GAM logistic regression.

We have definitely learned something about how these models will perform on future data, but why should we settle for a mere point estimate. Let’s get some estimates of the likely distribution of future model behavior.

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 2 of our four part mini-series “How do you know if your model is going to work?” we develop in-training set measures.

The most tempting procedure is to score your model on the data used to train it. The attraction is this avoids the statistical inefficiency of denying some of your data to the training procedure.

Run it once procedure

A common way to asses score quality is to run your scoring function on the data used to build your model. We might try comparing several models scored by AUC or deviance (normalized to factor out sample size) on their own training data as shown below.

What we have done is take five popular machine learning techniques (random forest, logistic regression, gbm, GAM logistic regression, and elastic net logistic regression) and plotted their performance in terms of AUC and normalized deviance on their own training data. For AUC larger numbers are better, and for deviance smaller numbers are better. Because we have evaluated multiple models we are starting to get a sense of scale. We should suspect an AUC of 0.7 on training data is good (though random forest achieved an AUC on training of almost 1.0), and we should be acutely aware that evaluating models on their own training data has an upward bias (the model has seen the training data, so it has a good chance of doing well on it; or training data is not exchangeable with future data for the purpose of estimating model performance).

There are two more Gedankenexperiment models that any machine data scientist should always have in mind:

The null model (on the graph as “null model”). This is the performance of the best constant model (model that returns the same answer for all datums). In this case it is a model scores each and every row as having an identical 7% chance of churning. This is an important model that you want to better than. It is also a model you are often competing against as a data science as it is the “what if we treat everything in this group the same” option (often the business process you are trying to replace).

The data scientist should always compare their work to the null model on deviance (null model AUC is trivially 0.5) and packages like logistic regression routinely report this statistic.

The best single variable model (on the graph as “best single variable model”). This is the best model built using only one variable or column (in this case using a GAM logistic regression as the modeling method). This is another model the data scientist wants to out perform as it represents the “maybe one of the columns is already the answer case” (if so that would be very good for the business as they could get good predictions without modeling infrastructure).

The data scientist should definitely compare their model to the best single variable model. Until you significantly outperform the best single variable model you have not outperformed what an analyst can find with a single pivot table.

At this point it would be tempting to pick the random forest model as the winner as it performed best on the training data. There are at least two things wrong with this idea:

Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn’t crash and burn in the real world.

We’ve discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it’s better than the models that you rejected?

Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.

In this latest “Statistics as it should be” series, we will systematically look at what to worry about and what to check. This is standard material, but presented in a “data science” oriented manner. Meaning we are going to consider scoring system utility in terms of service to a negotiable business goal (one of the many ways data science differs from pure machine learning).

To organize the ideas into digestible chunks, we are presenting this article as a four part series (to finished in the next 3 Tuesdays). This part (part 1) sets up the specific problem.

I’ll admit it: I have been wrong about statistics. However, that isn’t what this article is about. This article is less about some of the statistical mistakes I have made, as a mere working data scientist, and more of a rant about the hectoring tone of corrections from some statisticians (both when I have been right and when I have been wrong).

Obviously Gelman and Nolan are smart and careful people. And we are discussing a well-regarded peer-reviewed article. So we don’t expect there is a major error. What we say is the abstraction they are using doesn’t match the physical abstraction I would pick. I pick a different one and I get different results. This is what I would like to discuss. Continue reading I still think you can manufacture an unfair coin

A fair complaint when seeing yet another “data science” article is to say: “this is just medical statistics” or “this is already part of bioinformatics.” We certainly label many articles as “data science” on this blog. Probably the complaint is slightly cleaner if phrased as “this is already known statistics.” But the essence of the complaint is a feeling of claiming novelty in putting old wine in new bottles. Rob Tibshirani nailed this type of distinction in is famous machine learning versus statistics glossary.

We (the authors on this blog) label many of our articles as being about data science because we want to emphasize that the various techniques we write about are only meaningful when considered parts of a larger end to end process. The process we are interested in is the deployment of useful data driven models into production. The important components are learning the true business needs (often by extensive partnership with customers), enabling the collection of data, managing data, applying modeling techniques and applying statistics criticisms. The pre-existing term I have found that is closest to describing this whole project system is data science, so that is the term I use. I tend to use it a lot, because while I love the tools and techniques our true loyalty is to the whole process (and I want to emphasize this to our readers).

The phrase “data science” as in use it today is a fairly new term (made popular by William S. Cleveland, DJ Patil, and Jeff Hammerbacher). I myself worked in a “computational sciences” group in the mid 1990’s (this group emphasized simulation based modeling of small molecules and their biological interactions, the naming was an attempt to emphasize computation over computers). So for me “data science” seems like a good term when your work is driven by data (versus driven from computer simulations). For some people data science is considered a new calling and for others it is a faddish misrepresentation of work that has already been done. I think there are enough substantial differences in approach between traditional statistics, machine learning, data mining, predictive analytics, and data science to justify at least this much nomenclature. In this article I will try to describe (but not fully defend) my opinion. Continue reading Data Science, Machine Learning, and Statistics: what is in a name?

Using correlation to track model performance is “a mistake that nobody would ever make” combined with a vague “what would be wrong if I did do that” feeling. I hope after reading this feel a least a small urge to double check your work and presentations to make sure you have not reported correlation where R-squared, likelihood or root mean square error (RMSE) would have been more appropriate.

It is tempting (but wrong) to use correlation to track the performance of model predictions. The temptation arises because we often (correctly) use correlation to evaluate possible model inputs. And the correlation function is often a convenient built-in function. Continue reading Don’t use correlation to track prediction performance