Authors: John Mount (more articles) and Nina Zumel (more articles).

Our four part article series collected into one piece.

- Part 1: The problem
- Part 2: In-training set measures
- Part 3: Out of sample procedures
- Part 4: Cross-validation techniques

“Essentially, all models are wrong, but some are useful.”

George Box

Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn’t crash and burn in the real world.

We’ve discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it’s better than the models that you rejected?

Geocentric illustration Bartolomeu Velho, 1568 (BibliothÃ¨que Nationale, Paris)

Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.

In this latest “Statistics as it should be” article, we will systematically look at what to worry about and what to check. This is standard material, but presented in a “data science” oriented manner. Meaning we are going to consider scoring system utility in terms of service to a *negotiable* business goal (one of the many ways data science differs from pure machine learning).

## Our example problem

Let’s use a single example to make things concrete. We have used the 2009 KDD Cup dataset to demonstrate estimating variable significance, so we will use it again here to demonstrate model evaluation. The contest task was supervised machine learning. The goal was to build scores that predict things like churn (account cancellation) from a data set consisting of about 50,000 rows (representing credit card accounts) and 234 variables (both numeric and categorical facts about the accounts). An IBM group won the contest with an AUC (“area under the curve”) of 0.76 in predicting churn on held-out data. Using R we can get an AUC of 0.71 on our own hold-out set (meaning we used less data for training) using automated variable preparation, standard gradient boosting, and essentially no parameter tuning (which itself can be automated as it is in packages such as caret).

Obviously a 0.71 AUC would not win the contest. But remember the difference between 0.76 and 0.71 may or may not be statistically significant (something we will touch on in this article) and may or may not *make a business difference*. Typically a business combines a score with a single threshold to convert it into an operating classifier or decision procedure. The threshold is chosen as a business driven compromise between domain driven precision and recall (or sensitivity and specificity) goals. Businesses do not directly experience AUC which summarizes facts about the classifiers the score would induce at many different threshold levels (including ones that are irrelevant to the business). A scoring system whose ROC curve contains another scoring system’s ROC curve is definitely the better classifier, but small increases in AUC don’t always ensure such containment. AUC is an acceptable proxy score when choosing among classifiers (however, it does not have a non-strained reasonable probabilistic interpretation, despite such claims), and it should not be your final business metric.

For this article, however- we will stick with the score evaluation measures: deviance and AUC. But keep in mind that in an actual data science project you are much more likely to quickly get a reliable 0.05 increase in AUC by working with your business partners to transform, clean, or find more variables- than by tuning your post-data- collection machine learning procedure. So we feel score tuning is already over-emphasized and don’t want to dwell too much more on it here.

## Choice of utility metric

One way a data science project differs from a machine learning contest is that the choice of score or utility metric is an important choice made by the data scientist, and not a choice supplied by a competition framework. The metric or score must map to utility for the business client. The business goal in supervised machine learning project is usually either classification (picking a group of accounts at higher risk of churn) or sorting (ordering accounts by predicted risk).

Choice of experimental design, data preparation, and choice of metric can be a big driver of project success or failure. For example in hazard models (such as predicting churn) the items that are easiest to score are items that have essentially already happened. You may have call-center code that encodes “called to cancel” as one of your predictive signals. Technically it is a great signal, the person certainly hasn’t cancelled prior to the end of the call. But it is useless to the business. The data-scientist has to help re-design the problem definition and data curation to focus in on customers that are going to cancel soon, but to indicate some reasonable time before they cancel (see here for more on the issue). The business goal is to change the problem to a more useful business problem that may induce a harder machine learning problem. The business goal is not to do as well as possible on a single unchanging machine learning problem.

If the business needs a decision procedure: then part of the project is picking a threshold that converts the scoring system into a classifier. To do this you need some sort of business sensitive pricing of true-positives, false-positives, true-negatives, and false-negatives or working out appropriate trade-offs between precision and recall. While tuning scoring procedures we suggest using one of deviance or AUC as a proxy measure until you are ready to try converting your score into a classifier. Deviance has the advantage that it has nice interpretations in terms of log-likelihood and entropy, and AUC has the advantage that is invariant under any one-to-one monotone transformation of your score.

A classifier is best evaluated with precision and recall or sensitivity and specificity. Order evaluation is best done with an AUC-like score such as the Gini coefficient or even a gain curve.

### A note on accuracy

In most applications the cost of false-positives (accounts the classifier thinks will churn, but do not) is usually very different than the cost of false-negatives (accounts the classifier things will not churn, but do). This means a measure that prices these two errors identically is almost never the right final utility score. Accuracy is exactly one such measure. You must understand most business partners ask for “accurate” classifiers only because it may be the only term they are familiar with. Take the time to discuss appropriate utility measures with your business partners.

Here is an example to really drive the point home. The KDD2009 data set had a churn rate of around 7%. Consider the following two classifiers. Classifier A that predicts “churn” on 21% of the data but captures all of the churners in its positive predictions. Classifier B that predicts “no churn” on all data. Classifier A is wrong 14% of the time and thus has an accuracy of 86%. Classifier B is wrong 7% of the time and thus has an accuracy of 93% and is the more accurate classifier. Classifier A is a “home run” in a business sense (it has recall 1.0 and precision 33%!), Classifier B is absolutely useless. See here, for more discussion on this issue.

## The issues

In all cases we are going to pick a utility score or statistic. We want to estimate the utility of our model on future data (as our model will hopefully be used on new data in the future). The performance of our model in the future is usually an unknowable quantity. However, we can try to estimate this unknowable quantity by an appeal to the idea of exchangeability. If we had a set of test data that was exchangeable with the unknown future data, then an estimate of our utility on this test set should be a good estimate of future behavior. A similar way to get at this is if future data were independent and identically distributed with the test data then we again could expect to make an estimate.

The issues we run into in designing an estimate of model utility include at least the following:

- Are we attempting to evaluate an actual score or the procedure for building scores? These are two related, but different questions.
- Are we deriving a single point estimate or a distribution of estimates? Are we estimating sizes of effects, significances, or both?
- Are we using data that was involved in the training procedure (which breaks exchangeability!) or fresh data?

Your answers to these questions determine what procedures you should try.

## Scoring Procedures

We are going to work through a good number of the available testing and validation procedures. There is no “one true” procedure, so you need to get used to having more than one method to choose from. We suggest you go over each of these graphs with a ruler and see what conclusions you can draw about the relative utility of each of the models we are demonstrating.

### Naive methods

#### No measure

The no-measure procedure is the following: pick a good machine learning procedure, use it to fit the data, and turn that in as your solution. In principle nobody is ever so ill mannered to do this.

However, if you only try one modeling technique and don’t base any decision on your measure or score- how does that differ from having made no measurement? Suppose we (as in this R example) only made one try of Random Forest on the KDD2009 problem? We could present our boss with a ROC graph like the following:

Because we only tried one model the only thing our boss can look for is the AUC above 0.5 (uselessness) or not. They have no idea if 0.67 is large or small. Since or AUC measure drove no decision, it essentially was no measurement.

So at the very least we need to set a sense of scale. We should at least try more than one model.

#### Model supplied diagnostics

If we are going to try more than one model, we run into the problem that each model reports different diagnostics. Random forest tends to report error rates, logistic regression reports deviance, GBM reports variable importance. At this point you find you need to standardize on your own quality of score measure and run your own (or library code) on all models.

### In-training set measures

The most tempting procedure is to score your model on the data used to train it. The attraction is this avoids the statistical inefficiency of denying some of your data to the training procedure.

#### Run it once procedure

A common way to asses score quality is to run your scoring function on the data used to build your model. We might try comparing several models scored by AUC or deviance (normalized to factor out sample size) on their own training data as shown below (all R code here).

What we have done is take five popular machine learning techniques (random forest, logistic regression, gbm, GAM logistic regression, and elastic net logistic regression) and plotted their performance in terms of AUC and normalized deviance on their own training data. For AUC larger numbers are better, and for deviance smaller numbers are better. Because we have evaluated multiple models we are starting to get a sense of scale. We should suspect an AUC of 0.7 on training data is good (though random forest achieved an AUC on training of almost 1.0), and we should be acutely aware that evaluating models on their own training data has an upward bias (the model has seen the training data, so it has a good chance of doing well on it; or training data is not exchangeable with future data for the purpose of estimating model performance).

There are two more Gedankenexperiment models that any machine data scientist should always have in mind:

- The null model (on the graph as “null model”). This is the performance of the best constant model (model that returns the same answer for all datums). In this case it is a model scores each and every row as having an identical 7% chance of churning. This is an important model that you want to better than. It is also a model you are often competing against as a data science as it is the “what if we treat everything in this group the same” option (often the business process you are trying to replace).
The data scientist should always compare their work to the null model on deviance (null model AUC is trivially 0.5) and packages like logistic regression routinely report this statistic.

- The best single variable model (on the graph as “best single variable model”). This is the best model built using only one variable or column (in this case using a GAM logistic regression as the modeling method). This is another model the data scientist wants to out perform as it represents the “maybe one of the columns is already the answer case” (if so that would be very
*good*for the business as they could get good predictions without modeling infrastructure).The data scientist should definitely compare their model to the best single variable model. Until you significantly outperform the best single variable model you have not outperformed what an analyst can find with a single pivot table.

At this point it would be tempting to pick the random forest model as the winner as it performed best on the training data. There are at least two things wrong with this idea:

- The random forest performance is implausibly good, so we should expect it is an effect of overfitting (possibly independent of true model quality).
- Of course the best of five models is going to
*appear*better than any given modeling technique chosen before looking at the training data due to the multiple comparison effect, regardless of the value of picking among the modeling methods in question.

“Score once on training data” has started to show us things. But we can improve our scoring procedures, and it will turn out random forest is not in fact the best choice in this particular case (though random forest is often the best choice in general).

#### Permutation tests

One question we could try to answer using in-sample data (or data seen during training) is: are any of the models significantly different than what you would get fitting on noise? Are our models above what one might see by chance? A permutation test uses training data (so it well suited for situations where you don’t feel you have enough data for test/train split or cross validation) and are also a quick way to measure if you in fact “have a lot of data.”

We’ve already addressed permutation tests in an earlier article, so we will just move on to the appropriate graphs. Below we have re-plotted our in-sample training performance and added a new pain called “xptrain” (“experiment permutation training”). In the xptrain panel we repeated ten times permuting the y or outcome column of our data (so in expectation it has no true relation to the inputs or x’s) and run the modeling procedure. We then scored the quality of the fit models. The error bars drawn are the error bars are the 95% confidence intervals of the normal distribution that has the same mean and variance as we saw on the ten fits. The fit qualities are not normally distributed (for instance AUC is always in the interval zero to one), the error bars are merely a convenient way to get a view of the scale of dispersion of the permutation test.

We would like to see that the models we fit on real data (the top panel) are significantly better than the modes we fit on permuted data. That would be each dot to the right of the corresponding permutation error bar in the AUC graph and to the left of the corresponding error bar in the normalized deviance graph. This would be a clue that the types of fits we saw were in fact unlikely to be entirely due to uncorrelated noise, making the supposition we have actually fit something a bit more plausible.

Notice the random forest model achieves AUCs near 1 for many of the noise permutations (look also at the related deviances)! That doesn’t mean the actual random forest model fit does not have useful score (it in fact does) it just means you don’t know from only looking at the training data whether it represents a useful score.

### Out of sample procedures

Let’s try working “out of sample” or with data not seen during training or construction of our model. The attraction of these procedures is they represent a principled attempt at simulating the arrival of new data in the future.

#### Hold-out tests

Hold out tests are a staple for data scientists. You reserve a fraction of your data (say 10%) for evaluation and don’t use that data in any way during model construction and calibration. There is the issue that the test data is often used to choose between models, but that should not cause a problem of too much data leakage in practice. However, there are procedures to systematically abuse easy access to test performance in contests such as Kaggle (see Blum, Hardt, “The Ladder: A Reliable Leaderboard for Machine Learning Competitions”).

Notional train/test split (first 4 rows are training set, last 2 rows are the test set).

The results of a test/train split produce graphs like the following:

The training panels are the same as we have seen before. We have now added the upper test panels. These are where the models are evaluated on data not used during construction. Because we perform the split only once we have the disadvantage that we have only a point estimate of future performance, but we have the advantage that we estimating only the performance of the actual model in hand (and not the expected performance of the modeling procedure). However for any sort of bounded additive measure (such as deviance of Winsorized probability predictions) these point estimates should in fact be very stable.

Notice on the test graphs random forest is the worst (for this data set, with this set of columns, and this set of random forest parameters) of the non-trivial machine learning algorithms on the test data. Since the test data is the best simulation of future data we have seen so far, we should not select random forest as our one true model in this case- but instead consider GAM logistic regression.

We have definitely learned something about how these models will perform on future data, but why should we settle for a mere point estimate. Let’s get some estimates of the likely distribution of future model behavior.

We can also cheaply get access to some error bars on these estimates through standard bootstrap techniques. What we do is perform one test train split, build only one model, but then score it on many bootstrap re-samplings of both the test and train splits (only training data used in the training bootstrap and only test data used in the test bootstrap). This shows us what variation of scoring we can expect just due to our sample size and target prevalence (this can be important for data sets with very imbalanced target classes). Because we only use one model fit the bootstrap enhancement of the graphs is almost free (and very easy to automate). It produces the following graphs (normal 95% bootstrap confidence intervals shown):

Remember we are holding both the test/train split and the model constant. The error bars are only due to variation in the scoring sets (train and test) simulated by bootstrap sampling (sampling with replacement). What you are seeing is if your boss had another test set the same size and distribution as your test set: what are the likely scores they may see re-running your classifier. Obviously you would like what the boss sees to be very much like what you see, so you want to see the error bars collapsing around your reported measurement.

A portion of the bootstrap variation is coming from changes in the y-prevalence in the re-sampling. This is a portion of variance we can assign to the re-sampling plan itself (independent of the modeling procedure) so it makes sense to try and eliminate it in case it is obscuring other sources of variation. We can do this through stratified sampling. In this case we “stratify on y” which means we want all re-samplings to have the same y-prevalence we saw in the original sets. This produces the following graphs (notice the null-models error bars collapse, all variation of the null model is due to prevalence changes):

We can see from the hold-out variation on AUC that the KDD Cup winner’s AUC score of 0.76 does appear significantly better than the performance of any of our models to the 0.05 level, in that 0.76 is outside the range of all our models’ 95% confidence intervals. By the same token, the performances of our three logistic regression variants and of gradient boosting are essentially equivalent, and better than random forest’s performance.

When we look at deviance, however, gradient boosting’s performance is not as good as logistic regression.

### Cross-validation techniques

Cross validation techniques attempt to improve statistical efficiency by repeatedly splitting data into train and test and re-performing model fit and model evaluation.

For example: the variation called k-fold cross-validation splits the original data into k roughly equal sized sets. To score each set we build a model on all data not in the set and then apply the model to our set. This means we build k different models (none which is our final model, which is traditionally trained on all of the data).

Notional 3-fold cross validation (solid arrows are model construction/training, dashed arrows are model evaluation).

This is statistically efficient as each model is trained on a 1-1/k fraction of the data, so for k=20 we are using 95% of the data for training.

Another variation called “leave one out” (which is essentially Jackknife resampling) is very statistically efficient as each datum is scored on a unique model built using all other data. Though this is very computationally inefficient as you construct a very large number of models (except in special cases such as the PRESS statistic for linear regression).

Statisticians tend to prefer cross-validation techniques to test/train split as cross-validation techniques are more statistically efficient and can give sampling distribution style distributional estimates (instead of mere point estimates). However, remember cross validation techniques are measuring facts *about the fitting procedure* and *not about the actual model in hand* (so they are answering a different question than test/train split).

Though, there is some attraction to actually scoring the model you are going to turn in (as is done with in-sample methods, and test/train split, but not with cross-validation). The way to remember this is: bosses are essentially frequentist (they want to know their team and procedure tends to produce good models) and employees are essentially Bayesian (they want to know if the actual model they are turning in is likely good; see here for how it the nature of the question you are trying to answer controls if you are in a Bayesian or Frequentist situation).

Remember cross validation is only measuring the effects of steps that are re-done during the cross validation. So any by-hand variable transformations or pruning are not measured. This is one reason you want to automate such procedures, so you can include them in the cross validated procedures and measure their effects!

For the cross validation below we used a slightly non-standard construction (code here). We then repeated five times splitting the data into calibration, train, and test; and repeated all variable encodings, pruning, and scoring steps. This differs from many of the named cross-validation routines in that we are not building a single model prediction per row, but instead going directly for the distribution of model fit performance. Due to the test/train split we still have the desirable property that no data row is ever scored using a model it was involved in the construction of.

This gives us the following graphs:

In this case the error bars are just the minimum and maximum of the observed scores (no parametric confidence intervals). Again, the data suggests that one of the variants of logistic regression may be your best choice. Of particular interest is random forest, which shows large error bars. This means that random forest (on this type of data, with the variable treatment and settings that we used) has high variance compared to the other fitting methods that we tried. The random forest model that you fit is much more sensitive to the training data that you used.

For more on cross-validation methods see our free video lecture here.

## Takeaways

Model testing and validation are important parts of statistics and data science. You can only validate what you can repeat, so automated variable processing and selection is a necessity. That is why this series was organized a light outline of typical questions leading to traditional techniques.

You can become very good at testing and validation, if instead of working from a list of tests (and there are hundreds of such tests) you work in the following way:

- Ask: What do I need to measure (a size of effect and/or a confidence)?
- Ask: Do I have enough data to work out of sample?
- Ask: Am I okay with a point estimate, or do I need distributional details?
- Ask: Do I want to measure the model I am turning in or the modeling procedure?
- Ask: Am I concerned about computational efficiency?

The answers to these questions or trade-offs between these issues determines your test procedure.