If you are working with predictive modeling or machine learning in R this is the R tip that is going to save you the most time and deliver the biggest improvement in your results.
R Tip: Use the vtreat package for data preparation in predictive analytics and machine learning projects.
When attempting predictive modeling with real-world data you quickly run into difficulties beyond what is typically emphasized in machine learning coursework:
Missing, invalid, or out of range values.
Categorical variables with large sets of possible levels.
Novel categorical levels discovered during test, cross-validation, or model application/deployment.
Large numbers of columns to consider as potential modeling variables (both statistically hazardous and time consuming).
Nested model bias poisoning results in non-trivial data processing pipelines.
Any one of these issues can add to project time and decrease the predictive power and reliability of a machine learning project. Many real world projects encounter all of these issues, which are often ignored leading to degraded performance in production.
vtreat systematically and correctly deals with all of the above issues in a documented, automated, parallel, and statistically sound manner.
vtreat can fix or mitigate these domain independent issues much more reliably and much faster than by-hand ad-hoc methods.
This leaves the data scientist or analyst more time to research and apply critical domain dependent (or knowledge based) steps and checks.
If you are attempting high-value predictive modeling in R, you should try out vtreat and consider adding it to your workflow.
[Reader’s Note. Some of our articles are applied and some of our articles are more theoretical. The following article is more theoretical, and requires fairly formal notation to even work through. However, it should be of interest as it touches on some of the fine points of cross-validation that are quite hard to perceive or discuss without the notational framework. We thought about including some “simplifying explanatory diagrams” but so many entities are being introduced and manipulated by the processes we are describing we found equation notation to be in fact cleaner than the diagrams we attempted and rejected.]
Please consider either of the following common predictive modeling tasks:
Picking hyper-parameters, fitting a model, and then evaluating the model.
Variable preparation/pruning, fitting a model, and then evaluating the model.
In each case you are building a pipeline where “y-aware” (or outcome aware) choices and transformations made at each stage affect later stages. This can introduce undesirable nested model bias and over-fitting.
Split your data into 3 or more disjoint pieces, such as separate variable preparation/pruning, model fitting, and model evaluation.
Reserve a test-set for evaluation and use “simulated out of sample data” or “cross-frame”/“cross simulation” techniques to simulate dividing data among the first two model construction stages.
The first practice is simple and computationally efficient, but statistically inefficient. This may not matter if you have a lot of data, as in “big data”. The second procedure is more statistically efficient, but is also more complicated and has some computational cost. For convenience the cross simulation method is supplied as a ready to go procedure in our R data cleaning and preparation package vtreat.
What would it look like if we insisted on using cross simulation or simulated out of sample techniques for all three (or more) stages? Please read on to find out.
Edit: we are going to be writing on a situation of some biases that do leak into the cross-frame “new data simulation.” So think of cross-frames as bias (some small amount is introduced) / variance (reduced be appearing to have a full sized data set at all stages) trade-off.
In this example we are going to show what building a predictive model using vtreat best practices looks like assuming you were somehow already in the habit of using vtreat for your data preparation step. We are deliberately not going to explain any steps, but just show the small number of steps we advise routinely using. This is a simple schematic, but not a guide. Of course we do not advise use without understanding (and we work hard to teach the concepts in our writing), but want what small effort is required to add vtreat to your predictive modeling practice.
The Win-Vector LLCvtreat library is a library we supply (under a GPL license) for automating the simple domain independent part of variable cleaning an preparation.
The idea is you supply (in R) an example general data.frame to vtreat’s designTreatmentsC method (for single-class categorical targets) or designTreatmentsN method (for numeric targets) and vtreat returns a data structure that can be used to prepare data frames for training and scoring. A vtreat-prepared data frame is nice in the sense:
All result columns are numeric.
No odd type columns (dates, lists, matrices, and so on) are present.
No columns have NA, NaN, +-infinity.
Categorical variables are expanded into multiple indicator columns with all levels present which is a good encoding if you are using any sort of regularization in your modeling technique.
No rare indicators are encoded (limiting the number of indicators on the translated data.frame).
Categorical variables are also impact coded, so even categorical variables with very many levels (like zip-codes) can be safely used in models.
Novel levels (levels not seen during design/train phase) do not cause NA or errors.
The idea is vtreat automates a number of standard inspection and preparation steps that are common to all predictive analytic projects. This leaves the data scientist more time to work on important domain specific steps. vtreat also leaves as much of variable selection to the down-stream modeling software. The goal of vtreat is to reliably (and repeatably) generate a data.frame that is safe to work with.