We have a new Win Vector data science article to share:
Cross-Methods are a Leak/Variance Trade-Off
John Mount (Win Vector LLC), Nina Zumel (Win Vector LLC)
March 10, 2020
We work some exciting examples of when cross-methods (cross validation, and also cross-frames) work, and when they do not work.
Cross-methods such as cross-validation, and cross-prediction are effective tools for many machine learning, statisitics, and data science related applications. They are useful for parameter selection, model selection, impact/target encoding of high cardinality variables, stacking models, and super learning. They are more statistically efficient than partitioning training data into calibration/training/holdout sets, but do not satisfy the full exchangeability conditions that full hold-out methods have. This introduces some additional statistical trade-offs when using cross-methods, beyond the obvious increases in computational cost.
Specifically, cross-methods can introduce an information leak into the modeling process. This information leak will be the subject of this post.
The entire article is a JupyterLab notebook, and can be found here. Please check it out, and share it with your favorite statisticians, machine learning researchers, and data scientists.
Here is a quick, simple, and important tip for doing machine learning, data science, or statistics in Python: don’t use the default cross validation settings. The default can default to a deterministic, and even ordered split, which is not in general what one wants or expects from a statistical point of view. From a software engineering point of view the defaults may be sensible as since they don’t touch the pseudo-random number generator they are repeatable, deterministic, and side-effect free.
This issue falls under “read the manual”, but it is always frustrating when the defaults are not sufficiently generous.
Continue reading Python Data Science Tip: Don’t use Default Cross Validation Settings
A client recently came to us with a question: what’s a good way to monitor data or model output for changes? That is, how can you tell if new data is distributed differently from previous data, or if the distribution of scores returned by a model have changed? This client, like many others who have faced the same problem, simply checked whether the mean and standard deviation of the data had changed more than some amount, where the threshold value they checked against was selected in a more or less ad-hoc manner. But they were curious whether there was some other, perhaps more principled way, to check for a change in distribution.
Continue reading Monitoring for Changes in Distribution with Resampling Tests
vtreat version 1.5.2 just became available from CRAN.
We have a logged a few improvement in the NEWS. The changes are small and incremental, as the package is already in a great stable state for production use.
Continue reading What is New For vtreat 1.5.2?
We have a new data scientist sticker!
If you see Nina or John at a conference/MeetUp, please ask us for a sticker!
For the next version of the R package wrapr we are going to be removing a number of under-used functions/methods and classes. This update will likely happen in March 2020, and is the start of the wrapr 2.* series.
Most of the items being removed are different abstractions for helping with function composition. We ended up moving most of our work to category-theory based composition, so don’t think these various frameworks are needed any longer. If you have been using these items in your own projects, please reach out and we try and find a way to help you out.
Continue reading wrapr Update: Removing Some Under-Used Functions and Classes
In a lot of our R writing we casually say “install from CRAN using
install.packages('PKGNAME')” or “update your packages by using
update.packages(ask = FALSE, checkBuilt = TRUE) (and answering ‘no’ to all questions about compiling).”
We recently became aware that for some users this isn’t complete advice.
Continue reading R Tip: Check What Repos You are Using
Nina Zumel and I have a two new tutorials on fluid data wrangling/shaping. They are written in a parallel structure, with the R version of the tutorial being almost identical to the Python version of the tutorial.
This reflects our opinion on the “which is better for data science R or Python?” They both are great. So start with one, and expect to eventually work with both (if you are lucky).
Continue reading Data re-Shaping in R and in Python