Nina Zumel and I are happy to announce a formal article discussing data preparation and cleaning using the vtreat methodology is now available from arXiv.org as citation arXiv:1611.09477 [stat.AP].
vtreat is an R
data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems
vtreat defends against include:
NA, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training).
vtreat::prepare should be your first choice for real world data preparation and cleaning.
We hope this article will make getting started with
vtreat much easier. We also hope this helps with citing the use of
vtreat in scientific publications. Continue reading vtreat data cleaning and preparation article now available on arXiv
It is a bit of a shock when R
dplyr users switch from using a
tbl implementation based on R in-memory
data.frames to one based on a remote database or service. A lot of the power and convenience of the
dplyr notation is hard to maintain with these more restricted data service providers. Things that work locally can’t always be used remotely at scale. It is emphatically not yet the case that one can practice with
dplyr in one modality and hope to move to another back-end without significant debugging and work-arounds.
replyr attempts to provide a few helpful work-arounds.
Our new package
replyr supplies methods to get a grip on working with remote
tbl sources (SQL databases, Spark) through
dplyr. The idea is to add convenience functions to make such tasks more like working with an in-memory
data.frame. Results still do depend on which
dplyr service you use, but with
replyr you have fairly uniform access to some useful functions.
Continue reading New R package: replyr (get a grip on remote dplyr data services)
I have previously written on using containerized PostgreSQL with R. This show the steps for using containerized MySQL with R. Continue reading MySql in a container
Practical Data Science with R, Zumel, Mount; Manning 2014 is a book Nina Zumel and I are very proud of.
I have written before how I think this book stands out and why you should consider studying from it.
Please read on for a some additional comments on the intent of different sections of the book. Continue reading Teaching Practical Data Science with R
Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. These are variables that essentially take on string-values (also called levels or factors) and vary through many such levels. Typical examples include zip-codes, vendor IDs, and product codes.
In a sort of “burying the lede” way I feel we may not have sufficiently emphasized that you really do need to perform such re-encodings. Below is a graph (generated in R, code available here) of the kind of disaster you see if you throw such variables into a model without any pre-processing or post-controls.
In the above graph each dot represents the performance of a model fit on synthetic data. The x-axis is model performance (in this case pseudo R-squared, 1 being perfect and below zero worse than using an average). The training pane represents performance on the training data (perfect, but over-fit) and the test pane represents performance on held-out test data (an attempt to simulate future application data). Notice the test performance implies these models are dangerously worse than useless.
Please read on for how to fix this. Continue reading You should re-encode high cardinality categorical variables
Nina Zumel recently mentioned the use of Laplace noise in “count codes” by Misha Bilenko (see here and here) as a known method to break the overfit bias that comes from using the same data to design impact codes and fit a next level model. It is a fascinating method inspired by differential privacy methods, that Nina and I respect but don’t actually use in production.
Nested dolls, Wikimedia Commons
Please read on for my discussion of some of the limitations of the technique, and how we solve the problem for impact coding (also called “effects codes”), and a worked example in R. Continue reading Laplace noising versus simulated out of sample methods (cross frames)
We have already written quite a few times about our vtreat open source variable treatment package for R (which implements effects/impact coding, missing value replacement, and novel value replacement; among other important data preparation steps), but we thought we would take some time to describe some of the principles behind the package design.
vtreat is something we really feel you you should add to your predictive analytics or data science work flow.
vtreat getting a call-out from Dmitry Larko, photo Erin LeDell
vtreat’s design and implementation follows from a number of reasoned assumptions or principles, a few of which we discuss below.
Continue reading Some vtreat design principles