One of the big points of Practical Data Science with R is to supply a large number of fully worked examples. Our intent has always been for readers to read the book, and if they wanted to follow up on a data set or technique to find the matching worked examples in the project directory of our book support materials git repository.
Some readers want to work much closer to the sequence in the book. To make working along with book easier we extracted all book examples and shared them with our readers (in a Github directory, and a downloadable zip file, press “Raw” to download). The direct extraction from the book guarantees the files are in sync with our revised book. However there are trade-offs, sometimes (for legibility) the book mixed input and output without using R’s comment conventions. So you can’t always just paste everything. Also for a snippet to run you may need some libraries, data and results of previous snippets to be present in your R environment.
To help these readers we have added a new section to the book support materials: knitr markdown sheets that work all the book extracts from each chapter. Each chapter and appendix now has a matching markdown file that sets up the correct context to run each and every snippet extracted from the book. In principle you can now clone the entire zmPDSwR repository to your local machine and run all the from the CodeExamples directory by using the RStudio project in RunExamples. Correct execution also depens on having the right packages installed so we have also added a worksheet showing everything we expect to see installed in one place: InstallAll.Rmd (note some of the packages require external dependencies to work such as a C compiler, curl libraries, and a Java framework to run).
What stands out in these presentations is: the simple practice of a static test/train split is merely a convenience to cut down on operational complexity and difficulty of teaching. It is in no way optimal. That is, using slightly more complicated procedures can build better models on a given set of data.
Suggested static cal/train/test experiment design from vtreat data treatment library.
When working with an analysis system (such as R) there are usually good reasons to prefer using functions from the “base” system over using functions from extension packages. However, base functions are sometimes locked into unfortunate design compromises that can now be avoided. In R’s case I would say: do not use stats::aggregate().
It has been popular to complain that the current terms “data science” and “big data” are so vague as to be meaningless. While these terms are quite high on the hype-cycle, even the American Statistical Association was forced to admit that data science is actually a real thing and exists.
Win-Vector LLC‘s Nina Zumel wrote a great article explaining differential privacy and demonstrating how to use it to enhance forward step-wise logistic regression (essentially reusing test data). This allowed her to reproduce results similar to the recent Science paper “The reusable holdout: Preserving validity in adaptive data analysis”. The technique essentially protects and reuses test data, allowing the series of adaptive decisions driving forward step-wise logistic regression to remain valid with respect to unseen future data. Without the differential privacy precaution these steps are not always sufficiently independent of each other to ensure good model generalization performance. Through differential privacy one gets safe reuse of test data across many adaptive queries, yielding more accurate estimates of out of sample performance, more robust choices, and resulting in a better model.
In this note I will discuss a specific related application: using differential privacy to reuse training data (or equivalently make training procedures more statistically efficient). I will also demonstrate similar effects using more familiar statistical techniques.
In some of my recent public talks (for example: here and here) I have mentioned a desire for “a deeper theory of fitting and testing.” I thought I would expand on what I meant by this.
In this note I am going to cover a lot of different topics to try and suggest some perspective. I won’t have my usual luxury of fully defining my terms or working concrete examples. Hopefully a number of these ideas (which are related, but don’t seem to easily synthesize together) will be subjects of their own later articles.
The focus of this article is: the true goal of predictive analytics is always: to build a model that works well in production. Training and testing procedures are designed to simulate this unknown future model performance, but can be expensive and can also fail.
What we want is a good measure of future model performance, and to apply that measure in picking a model without running deep into Goodhart’s law (“When a measure becomes a target, it ceases to be a good measure.”).
Most common training and testing procedures are destructive in the sense they use up data (data used for one step may not be safely used for another step in an unbiased fashion, example: excess generalization error). In this note I thought I would expand on the ideas for extending statistical efficiency or getting more out of your training while avoiding overfitting.
In this article we conclude our four part series on basic model testing.
When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this concluding Part 4 of our four part mini-series “How do you know if your model is going to work?” we demonstrate cross-validation techniques.
Cross validation techniques attempt to improve statistical efficiency by repeatedly splitting data into train and test and re-performing model fit and model evaluation.
For example: the variation called k-fold cross-validation splits the original data into k roughly equal sized sets. To score each set we build a model on all data not in the set and then apply the model to our set. This means we build k different models (none which is our final model, which is traditionally trained on all of the data).
Notional 3-fold cross validation (solid arrows are model construction/training, dashed arrows are model evaluation).
This is statistically efficient as each model is trained on a 1-1/k fraction of the data, so for k=20 we are using 95% of the data for training.
Statisticians tend to prefer cross-validation techniques to test/train split as cross-validation techniques are more statistically efficient and can give sampling distribution style distributional estimates (instead of mere point estimates). However, remember cross validation techniques are measuring facts about the fitting procedure and not about the actual model in hand (so they are answering a different question than test/train split).
Though, there is some attraction to actually scoring the model you are going to turn in (as is done with in-sample methods, and test/train split, but not with cross-validation). The way to remember this is: bosses are essentially frequentist (they want to know their team and procedure tends to produce good models) and employees are essentially Bayesian (they want to know the actual model they are turning in is likely good; see here for how it the nature of the question you are trying to answer controls if you are in a Bayesian or Frequentist situation).