When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 2 of our four part mini-series “How do you know if your model is going to work?” we develop in-training set measures.
Previously we worked on:
- Part 1: Defining the scoring problem
In-training set measures
The most tempting procedure is to score your model on the data used to train it. The attraction is this avoids the statistical inefficiency of denying some of your data to the training procedure.
Run it once procedure
A common way to asses score quality is to run your scoring function on the data used to build your model. We might try comparing several models scored by AUC or deviance (normalized to factor out sample size) on their own training data as shown below.
What we have done is take five popular machine learning techniques (random forest, logistic regression, gbm, GAM logistic regression, and elastic net logistic regression) and plotted their performance in terms of AUC and normalized deviance on their own training data. For AUC larger numbers are better, and for deviance smaller numbers are better. Because we have evaluated multiple models we are starting to get a sense of scale. We should suspect an AUC of 0.7 on training data is good (though random forest achieved an AUC on training of almost 1.0), and we should be acutely aware that evaluating models on their own training data has an upward bias (the model has seen the training data, so it has a good chance of doing well on it; or training data is not exchangeable with future data for the purpose of estimating model performance).
There are two more Gedankenexperiment models that any machine data scientist should always have in mind:
- The null model (on the graph as “null model”). This is the performance of the best constant model (model that returns the same answer for all datums). In this case it is a model scores each and every row as having an identical 7% chance of churning. This is an important model that you want to better than. It is also a model you are often competing against as a data science as it is the “what if we treat everything in this group the same” option (often the business process you are trying to replace). The data scientist should always compare their work to the null model on deviance (null model AUC is trivially 0.5) and packages like logistic regression routinely report this statistic.
- The best single variable model (on the graph as “best single variable model”). This is the best model built using only one variable or column (in this case using a GAM logistic regression as the modeling method). This is another model the data scientist wants to out perform as it represents the “maybe one of the columns is already the answer case” (if so that would be very good for the business as they could get good predictions without modeling infrastructure). The data scientist should definitely compare their model to the best single variable model. Until you significantly outperform the best single variable model you have not outperformed what an analyst can find with a single pivot table.
At this point it would be tempting to pick the random forest model as the winner as it performed best on the training data. There are at least two things wrong with this idea:
The flaws include:
- The random forest performance is implausibly good, so we should expect it is an effect of overfitting (possibly independent of true model quality).
- Of course the best of five models is going to appear better than any given modeling technique chosen before looking at the training data due to the multiple comparison effect, regardless of the value of picking among the modeling methods in question.
“Score once on training data” has started to show us things. But we can improve our scoring procedures, and it will turn out random forest is not in fact the best choice in this particular case (though random forest is often the best choice in general).
One question we could try to answer using in-sample data (or data seen during training) is: are any of the models significantly different than what you would get fitting on noise? Are our models above what one might see by chance? A permutation test uses training data (so it well suited for situations where you don’t feel you have enough data for test/train split or cross validation) and are also a quick way to measure if you in fact “have a lot of data.”
We’ve already addressed permutation tests in an earlier article, so we will just move on to the appropriate graphs. Below we have re-plotted our in-sample training performance and added a new pain called “xptrain” (“experiment permutation training”). In the xptrain panel we repeated ten times permuting the y or outcome column of our data (so in expectation it has no true relation to the inputs or x’s) and run the modeling procedure. We then scored the quality of the fit models. The error bars drawn are the error bars are the 95% confidence intervals of the normal distribution that has the same mean and variance as we saw on the ten fits. The fit qualities are not normally distributed (for instance AUC is always in the interval zero to one), the error bars are merely a convenient way to get a view of the scale of dispersion of the permutation test.
We would like to see that the models we fit on real data (the top panel) are significantly better than the modes we fit on permuted data. That would be each dot to the right of the corresponding permutation error bar in the AUC graph and to the left of the corresponding error bar in the normalized deviance graph. This would be a clue that the types of fits we saw were in fact unlikely to be entirely due to uncorrelated noise, making the supposition we have actually fit something a bit more plausible.
Notice the random forest model achieves AUCs near 1 for many of the noise permutations (look also at the related deviances)! That doesn’t mean the actual random forest model fit does not have useful score (it in fact does) it just means you don’t know from only looking at the training data whether it represents a useful score.
Our series continues with:
- Part 3: Out of sample procedures
- Part 4: Cross-validation techniques