vtreat is an R
data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems
vtreat defends against include:
NA, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training).
vtreat::prepare should be your first choice for real world data preparation and cleaning.
We hope this article will make getting started with
vtreat much easier. We also hope this helps with citing the use of
vtreat in scientific publications. Continue reading vtreat data cleaning and preparation article now available on arXiv
In previous writings we have gone to great lengths to document, explain and motivate
vtreat. That necessarily gets long and unnecessarily feels complicated.
In this example we are going to show what building a predictive model using
vtreat best practices looks like assuming you were somehow already in the habit of using vtreat for your data preparation step. We are deliberately not going to explain any steps, but just show the small number of steps we advise routinely using. This is a simple schematic, but not a guide. Of course we do not advise use without understanding (and we work hard to teach the concepts in our writing), but want what small effort is required to add
vtreat to your predictive modeling practice.
- Random Test/Train Split is not Always Enough
- How Do You Know if Your Data Has Signal?
- How do you know if your model is going to work?
- A Simpler Explanation of Differential Privacy (explaining the reusable holdout set)
- Using differential privacy to reuse training data
- Preparing Data for Analysis using R: Basic through Advanced Techniques
What stands out in these presentations is: the simple practice of a static test/train split is merely a convenience to cut down on operational complexity and difficulty of teaching. It is in no way optimal. That is, using slightly more complicated procedures can build better models on a given set of data.
Suggested static cal/train/test experiment design from vtreat data treatment library.