Win Vector LLC’s Dr. Nina Zumel has had great success applying y-aware methods to machine learning problems, and working out the detailed cross-validation methods needed to make y-aware procedures safe. I thought I would try our hand at y-aware neural net or deep learning methods here.
vtreat is a very complete and rigorous tool for preparing messy real world data for supervised machine-learning tasks. It implements a technique we call “safe y-aware processing” using cross-validation or stacking techniques. It is very easy to use: you show it some data and it designs a data transform for you.
Thanks to the
rquery package, this data preparation transform can now be directly applied to databases, or big data systems such as
Apache Spark, or
Google BigQuery. Or, thanks to the
rqdatatable packages, even fast large in-memory transforms are possible.
‘vtreat’ is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.
‘vtreat’ is an R package that incorporates a number of transforms and simulated out of sample (cross-frame simulation) procedures that can:
- Decrease the amount of hand-work needed to prepare data for predictive modeling.
- Improve actual model performance on new out of sample or application data.
- Lower your procedure documentation burden (through ready vtreat documentation and tutorials).
- Increase model reliability (by re-coding unexpected situations).
- Increase model expressiveness (by allowing use of more variable types, especially large cardinality categorical variables).
‘vtreat’ can be used to prepare data for either regression or classification.
Please read on for what ‘vtreat’ does and what is new. Continue reading vtreat version 0.5.26 released on CRAN
In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (x) and dependent (y) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in particular what we call Y-Aware Principal Components Analysis, or Y-Aware PCA. We will use our variable treatment package
vtreat in the examples we show in this note, but you can easily implement the approach independently of
We have been recently working on and presenting on nested modeling issues. These are situations where the output of one trained machine learning model is part of the input of a later model or procedure. I am now of the opinion that correct treatment of nested models is one of the biggest opportunities for improvement in data science practice. Nested models can be more powerful than non-nested, but are easy to get wrong.