Nina Zumel and I recently wrote a few article and series on best practices in testing models and data:
What stands out in these presentations is: the simple practice of a static test/train split is merely a convenience to cut down on operational complexity and difficulty of teaching. It is in no way optimal. That is, using slightly more complicated procedures can build better models on a given set of data.
Suggested static cal/train/test experiment design from vtreat data treatment library.
Continue reading Fluid use of data
In some of my recent public talks (for example: here and here) I have mentioned a desire for “a deeper theory of fitting and testing.” I thought I would expand on what I meant by this.
In this note I am going to cover a lot of different topics to try and suggest some perspective. I won’t have my usual luxury of fully defining my terms or working concrete examples. Hopefully a number of these ideas (which are related, but don’t seem to easily synthesize together) will be subjects of their own later articles.
The focus of this article is: the true goal of predictive analytics is always: to build a model that works well in production. Training and testing procedures are designed to simulate this unknown future model performance, but can be expensive and can also fail.
What we want is a good measure of future model performance, and to apply that measure in picking a model without running deep into Goodhart’s law (“When a measure becomes a target, it ceases to be a good measure.”).
Most common training and testing procedures are destructive in the sense they use up data (data used for one step may not be safely used for another step in an unbiased fashion, example: excess generalization error). In this note I thought I would expand on the ideas for extending statistical efficiency or getting more out of your training while avoiding overfitting.
I will outline a few variations of model construction and testing techniques that one should keep in mind.
Continue reading A deeper theory of testing
Most data science projects are well served by a random test/train split. In our book Practical Data Science with R we strongly advise preparing data and including enough variables so that data is exchangeable, and scoring classifiers using a random test/train split.
With enough data and a big enough arsenal of methods, it’s relatively easy to find a classifier that looks good; the trick is finding one that is good. What many data science practitioners (and consumers) don’t seem to remember is that when evaluating a model, a random test/train split may not always be enough.
Continue reading Random Test/Train Split is not Always Enough