Nina Zumel recently mentioned the use of Laplace noise in “count codes” by Misha Bilenko (see here and here) as a known method to break the overfit bias that comes from using the same data to design impact codes and fit a next level model. It is a fascinating method inspired by differential privacy methods, that Nina and I respect but don’t actually use in production.
Nested dolls, Wikimedia Commons
Please read on for my discussion of some of the limitations of the technique, and how we solve the problem for impact coding (also called “effects codes”), and a worked example in R. Continue reading Laplace noising versus simulated out of sample methods (cross frames)
We have already written quite a few times about our vtreat open source variable treatment package for R (which implements effects/impact coding, missing value replacement, and novel value replacement; among other important data preparation steps), but we thought we would take some time to describe some of the principles behind the package design.
vtreat is something we really feel you you should add to your predictive analytics or data science work flow.
vtreat getting a call-out from Dmitry Larko, photo Erin LeDell
vtreat’s design and implementation follows from a number of reasoned assumptions or principles, a few of which we discuss below.
Continue reading Some vtreat design principles
Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.27 has been released on CRAN.
vtreat is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.
(from the package documentation)
vtreat accepts an arbitrary “from the wild” data frame (with different column types,
NaNs and so forth) and returns a transformation that reliably and repeatably converts similar data frames to numeric (matrix-like) frames (all independent variables numeric free of
NaNs, infinities, and so on) ready for predictive modeling. This is a systematic way to work with high-cardinality character and factor variables (which are incompatible with some machine learning implementations such as random forest, and also bring in a danger of statistical over-fitting) and leaves the analyst more time to incorporate domain specific data preparation (as
vtreat tries to handle as much of the common stuff as practical). For more of an overall description please see here.
We suggest any users please update (and you will want to re-run any “design” steps instead of mixing “design” and “prepare” from two different versions of
For what is new in version 0.5.27 please read on. Continue reading vtreat 0.5.27 released on CRAN
Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.26 has been released on CRAN.
‘vtreat’ is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.
(from the package documentation)
‘vtreat’ is an R package that incorporates a number of transforms and simulated out of sample (cross-frame simulation) procedures that can:
- Decrease the amount of hand-work needed to prepare data for predictive modeling.
- Improve actual model performance on new out of sample or application data.
- Lower your procedure documentation burden (through ready vtreat documentation and tutorials).
- Increase model reliability (by re-coding unexpected situations).
- Increase model expressiveness (by allowing use of more variable types, especially large cardinality categorical variables).
‘vtreat’ can be used to prepare data for either regression or classification.
Please read on for what ‘vtreat’ does and what is new. Continue reading vtreat version 0.5.26 released on CRAN
This article is a demonstration the use of the R vtreat variable preparation package followed by caret controlled training.
In previous writings we have gone to great lengths to document, explain and motivate
vtreat. That necessarily gets long and unnecessarily feels complicated.
In this example we are going to show what building a predictive model using
vtreat best practices looks like assuming you were somehow already in the habit of using vtreat for your data preparation step. We are deliberately not going to explain any steps, but just show the small number of steps we advise routinely using. This is a simple schematic, but not a guide. Of course we do not advise use without understanding (and we work hard to teach the concepts in our writing), but want what small effort is required to add
vtreat to your predictive modeling practice.
Continue reading A demonstration of vtreat data preparation
In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (x) and dependent (y) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in particular what we call Y-Aware Principal Components Analysis, or Y-Aware PCA. We will use our variable treatment package
vtreat in the examples we show in this note, but you can easily implement the approach independently of
Continue reading Principal Components Regression, Pt. 2: Y-Aware Methods
vtreat cross frames
John Mount, Nina Zumel
As a follow on to “On Nested Models” we work R examples demonstrating “cross validated training frames” (or “cross frames”) in vtreat.
Continue reading vtreat cross frames
The Microsoft Data Science User Group just sponsored Dr. Nina Zumel‘s presentation “Preparing Data for Analysis Using R”. Microsoft saw Win-Vector LLC‘s ODSC West 2015 presentation “Prepping Data for Analysis using R” and generously offered to sponsor improving it and disseminating it to a wider audience.
We feel Nina really hit the ball out of the park with over 400 new live viewers. Read more for links to even more free materials! Continue reading More on preparing data
Nina Zumel and I are proud to announce our R
vtreat variable treatment library has just been accepted by CRAN!
Continue reading vtreat up on CRAN!