Win Vector LLC‘s Dr. Nina Zumel has just released some new vtreat documentation.
vtreat is a an all-in one step data preparation system that helps defend your machine learning algorithms from:
- Missing values
- Large cardinality categorical variables
- Novel levels from categorical variables
I hoped she could get the Python vtreat documentation up to parity with the R vtreat documentation. But I think she really hit the ball out of the park, and went way past that.
The new documentation is 3 “getting started” guides. These guides deliberately overlap, so you don’t have to read them all. Just read the one suited to your problem and go.
The new guides:
Perhaps we can back-port the new guides to the R version at some point.
I am excited to announce
vtreat is now available for
Python on PyPi, in addition for
R on CRAN.
Continue reading vtreat up on PyPi
The following really made my day.
I tell every data scientist I know about vtreat and urge them to read the paper.
Jason, thanks for your support and thank you so much for taking the time to say this (and for your permission to quote you on this).
For those interested the R version of vtreat can be found here, the paper can be found here, and the in-development Python/Pandas version of vtreat can be found (with examples) here.
Chapter of 8 Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019 has a more operational discussion of vtreat (which itself uses concepts developed in chapter 4).
We at Win-Vector LLC have some big news.
We are finally porting a streamlined version of our R vtreat variable preparation package to Python.
vtreat is a great system for preparing messy data for supervised machine learning.
The new implementation is based on Pandas, and we are experimenting with pushing the sklearn.pipeline.Pipeline APIs to their limit. In particular we have found the
.fit_transform() pattern is a great way to express building up a cross-frame to avoid nested model bias (in this case
.fit_transform() != .fit().transform()). There is a bit of difference in how object oriented APIs compose versus how functional APIs compose. We are making an effort to research how to make this an advantage, and not a liability.
The new repository is here. And we have a non-trivial worked classification example. Next up is multinomial classification. After that a few validation suites to prove the two vtreat systems work similarly. And then we have some exciting new capabilities.
The first application is going to be a shortening and streamlining of our current 4 day data science in Python course (while allowing more concrete examples!).
This also means data scientists who use both R and Python will have a few more tools that present similarly in each language.
vtreat‘s purpose is to produce pure numeric
data.frames that are ready for supervised predictive modeling (predicting a value from other values). By ready we mean: a purely numeric data frame with no missing values and a reasonable number of columns (missing-values re-encoded with indicators, and high-degree categorical re-encode by effects codes or impact codes).
In this note we will discuss a small aspect of the
vtreat package: variable screening.
Continue reading vtreat Variable Importance
Reusable modeling pipelines are a practical idea that gets re-developed many times in many contexts.
wrapr supplies a particularly powerful pipeline notation, and a pipe-stage re-use system (notes here). We will demonstrate this with the
vtreat data preparation system.
Continue reading Sharing Modeling Pipelines in R
vtreat is a powerful
R package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).
vtreat and can now effectively prepare data for multi-class classification or multinomial modeling.
Continue reading Modeling multi-category Outcomes With vtreat
We here at Win-Vector LLC have some really big news we would please like the
R-community’s help sharing.
vtreat version 1.2.0 is now available on CRAN, and this version of
vtreat can now implement its data cleaning and preparation steps on databases and big data systems such as
vtreat is a very complete and rigorous tool for preparing messy real world data for supervised machine-learning tasks. It implements a technique we call “safe y-aware processing” using cross-validation or stacking techniques. It is very easy to use: you show it some data and it designs a data transform for you.
Thanks to the
rquery package, this data preparation transform can now be directly applied to databases, or big data systems such as
Apache Spark, or
Google BigQuery. Or, thanks to the
rqdatatable packages, even fast large in-memory transforms are possible.
We have some basic examples of the new
vtreat capabilities here and here.
If you are working with predictive modeling or machine learning in
R this is the
R tip that is going to save you the most time and deliver the biggest improvement in your results.
R Tip: Use the
vtreat package for data preparation in predictive analytics and machine learning projects.
When attempting predictive modeling with real-world data you quickly run into difficulties beyond what is typically emphasized in machine learning coursework:
- Missing, invalid, or out of range values.
- Categorical variables with large sets of possible levels.
- Novel categorical levels discovered during test, cross-validation, or model application/deployment.
- Large numbers of columns to consider as potential modeling variables (both statistically hazardous and time consuming).
- Nested model bias poisoning results in non-trivial data processing pipelines.
Any one of these issues can add to project time and decrease the predictive power and reliability of a machine learning project. Many real world projects encounter all of these issues, which are often ignored leading to degraded performance in production.
vtreat systematically and correctly deals with all of the above issues in a documented, automated, parallel, and statistically sound manner.
vtreat can fix or mitigate these domain independent issues much more reliably and much faster than by-hand ad-hoc methods.
This leaves the data scientist or analyst more time to research and apply critical domain dependent (or knowledge based) steps and checks.
If you are attempting high-value predictive modeling in
R, you should try out
vtreat and consider adding it to your workflow.
Continue reading R Tip: Use the vtreat Package For Data Preparation