A client recently came to us with a question: what’s a good way to monitor data or model output for changes? That is, how can you tell if new data is distributed differently from previous data, or if the distribution of scores returned by a model have changed? This client, like many others who have faced the same problem, simply checked whether the mean and standard deviation of the data had changed more than some amount, where the threshold value they checked against was selected in a more or less ad-hoc manner. But they were curious whether there was some other, perhaps more principled way, to check for a change in distribution.
Here at Win-Vector LLC we like permutation tests. Our team has written on them (for example: How Do You Know if Your Data Has Signal?) and they are used to estimate significances in our sigr and WVPlots R packages. For example permutation methods are used to estimate the significance reported in the following ROC plot.
Permutation tests have their own literature and issues (examples: Permutation, Parametric and Bootstrap Tests of Hypotheses, Springer-Verlag, NY, 1994 (3rd edition, 2005), 2, 3, and 4).
In our R packages the permutation tests are estimated by a sampling procedure, and not computed exactly (or deterministically). It turns out this is likely a necessary concession; a complete exact permutation test procedure at scale would be big news. Please read on for my comments on this issue.
Image by Liz Sullivan, Creative Commons. Source: Wikimedia
An all too common approach to modeling in data science is to throw all possible variables at a modeling procedure and “let the algorithm sort it out.” This is tempting when you are not sure what are the true causes or predictors of the phenomenon you are interested in, but it presents dangers, too. Very wide data sets are computationally difficult for some modeling procedures; and more importantly, they can lead to overfit models that generalize poorly on new data. In extreme cases, wide data can fool modeling procedures into finding models that look good on training data, even when that data has no signal. We showed some examples of this previously in our “Bad Bayes” blog post.
In this latest “Statistics as it should be” article, we will look at a heuristic to help determine which of your input variables have signal. Continue reading How Do You Know if Your Data Has Signal?