We here at Win-Vector LLC have some really big news we would please like the
R-community’s help sharing.
vtreat version 1.2.0 is now available on CRAN, and this version of
vtreat can now implement its data cleaning and preparation steps on databases and big data systems such as
vtreat is a very complete and rigorous tool for preparing messy real world data for supervised machine-learning tasks. It implements a technique we call “safe y-aware processing” using cross-validation or stacking techniques. It is very easy to use: you show it some data and it designs a data transform for you.
Thanks to the
rquery package, this data preparation transform can now be directly applied to databases, or big data systems such as
Apache Spark, or
Google BigQuery. Or, thanks to the
rqdatatable packages, even fast large in-memory transforms are possible.
We have some basic examples of the new
vtreat capabilities here and here.
Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.26 has been released on CRAN.
‘vtreat’ is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.
(from the package documentation)
‘vtreat’ is an R package that incorporates a number of transforms and simulated out of sample (cross-frame simulation) procedures that can:
- Decrease the amount of hand-work needed to prepare data for predictive modeling.
- Improve actual model performance on new out of sample or application data.
- Lower your procedure documentation burden (through ready vtreat documentation and tutorials).
- Increase model reliability (by re-coding unexpected situations).
- Increase model expressiveness (by allowing use of more variable types, especially large cardinality categorical variables).
‘vtreat’ can be used to prepare data for either regression or classification.
Please read on for what ‘vtreat’ does and what is new. Continue reading vtreat version 0.5.26 released on CRAN
In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (x) and dependent (y) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in particular what we call Y-Aware Principal Components Analysis, or Y-Aware PCA. We will use our variable treatment package
vtreat in the examples we show in this note, but you can easily implement the approach independently of
Continue reading Principal Components Regression, Pt. 2: Y-Aware Methods
We have been recently working on and presenting on nested modeling issues. These are situations where the output of one trained machine learning model is part of the input of a later model or procedure. I am now of the opinion that correct treatment of nested models is one of the biggest opportunities for improvement in data science practice. Nested models can be more powerful than non-nested, but are easy to get wrong.
Continue reading On Nested Models