Authors: John Mount (more articles) and Nina Zumel (more articles).

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 3 of our four part mini-series “How do you know if your model is going to work?” we develop out of sample procedures.

Previously we worked on:

### Out of sample procedures

Let’s try working “out of sample” or with data not seen during training or construction of our model. The attraction of these procedures is they represent a principled attempt at simulating the arrival of new data in the future.

#### Hold-out tests

Hold out tests are a staple for data scientists. You reserve a fraction of your data (say 10%) for evaluation and don’t use that data in any way during model construction and calibration. There is the issue that the test data is often used to choose between models, but that should not cause a problem of too much data leakage in practice. However, there are procedures to systematically abuse easy access to test performance in contests such as Kaggle (see Blum, Hardt, “The Ladder: A Reliable Leaderboard for Machine Learning Competitions”).

Notional train/test split (first 4 rows are training set, last 2 rows are the test set).

The results of a test/train split produce graphs like the following:

The training panels are the same as we have seen before. We have now added the upper test panels. These are where the models are evaluated on data not used during construction.

Notice on the test graphs random forest is the worst (for this data set, with this set of columns, and this set of random forest parameters) of the non-trivial machine learning algorithms on the test data. Since the test data is the best simulation of future data we have seen so far, we should not select random forest as our one true model in this case- but instead consider GAM logistic regression.

We have definitely learned something about how these models will perform on future data, but why should we settle for a mere point estimate. Let’s get some estimates of the likely distribution of future model behavior.

Because we perform the split only once we have the disadvantage that we have only a point estimate of future performance, but we have the advantage that we estimating only the performance of the actual model in hand (and not the expected performance of the modeling procedure). However for any sort of bounded additive measure (such as deviance of Winsorized probability predictions) these point estimates should in fact be very stable.

We can also cheaply get access to some error bars on these estimates through standard bootstrap techniques. What we do is perform one test train split, build only one model, but then score it on many bootstrap re-samplings of both the test and train splits (only training data used in the training bootstrap and only test data used in the test bootstrap). This shows us what variation of scoring we can expect just due to our sample size and target prevalence (this can be important for data sets with very imbalanced target classes). Because we only use one model fit the bootstrap enhancement of the graphs is almost free (and very easy to automate). It produces the following graphs (normal 95% bootstrap confidence intervals shown):

Remember we are holding both the test/train split and the model constant. The error bars are only due to variation in the scoring sets (train and test) simulated by bootstrap sampling (sampling with replacement). What you are seeing is if your boss had another test set the same size and distribution as your test set: what are the likely scores they may see re-running your classifier. Obviously you would like what the boss sees to be very much like what you see, so you want to see the error bars collapsing around your reported measurement.

A portion of the bootstrap variation is coming from changes in the y-prevalence in the re-sampling. This is a portion of variance we can assign to the re-sampling plan itself (independent of the modeling procedure) so it makes sense to try and eliminate it in case it is obscuring other sources of variation. We can do this through stratified sampling. In this case we “stratify on y” which means we want all re-samplings to have the same y-prevalence we saw in the original sets. This produces the following graphs (notice the null-models error bars collapse, all variation of the null model is due to prevalence changes):

We can see from the hold-out variation on AUC that the KDD Cup winner’s AUC score of 0.76 does appear significantly better than the performance of any of our models to the 0.05 level, in that 0.76 is outside the range of all our models’ 95% confidence intervals. By the same token, the performances of our three logistic regression variants and of gradient boosting are essentially equivalent, and better than random forest’s performance.

When we look at deviance, however, gradient boosting’s performance is not as good as logistic regression.

## Next

Our series concludes with:

- Part 4: Cross-validation techniques