Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , 2 Comments on Partial Pooling for Lower Variance Variable Encoding

Partial Pooling for Lower Variance Variable Encoding


Terraces
Banaue rice terraces. Photo: Jon Rawlinson

In a previous article, we showed the use of partial pooling, or hierarchical/multilevel models, for level coding high-cardinality categorical variables in vtreat. In this article, we will discuss a little more about the how and why of partial pooling in R.

We will use the lme4 package to fit the hierarchical models. The acronym “lme” stands for “linear mixed-effects” models: models that combine so-called “fixed effects” and “random effects” in a single (generalized) linear model. The lme4 documentation uses the random/fixed effects terminology, but we are going to follow Gelman and Hill, and avoid the use of the terms “fixed” and “random” effects.

The varying coefficients [corresponding to the levels of a categorical variable] in a multilevel model are sometimes called random effects, a term that refers to the randomness in the probability model for the group-level coefficients….

The term fixed effects is used in contrast to random effects – but not in a consistent way! … Because of the conflicting definitions and advice, we will avoid the terms “fixed” and “random” entirely, and focus on the description of the model itself…

– Gelman and Hill 2007, Chapter 11.4

We will also restrict ourselves to the case that vtreat considers: partially pooled estimates of conditional group expectations, with no other predictors considered.

Continue reading Partial Pooling for Lower Variance Variable Encoding

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on Custom Level Coding in vtreat

Custom Level Coding in vtreat

One of the services that the R package vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "one-hot encoding"). Level coding can be computationally and statistically preferable to one-hot encoding for variables that have an extremely large number of possible levels.

Speed

Level coding is like measurement: it summarizes categories of individuals into useful numbers. Source: USGS

By default, vtreat level codes to the difference between the conditional means and the grand mean (catN variables) when the outcome is numeric, and to the difference between the conditional log-likelihood and global log-likelihood of the target class (catB variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the ranger package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for tree-based methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by vtreat‘s coding. This often isn’t a problem — but sometimes, it may be.

So the data scientist may want to use a level coding different from what vtreat defaults to. In this article, we will demonstrate how to implement custom level encoders in vtreat. We assume you are familiar with the basics of vtreat: the types of derived variables, how to create and apply a treatment plan, etc.

Continue reading Custom Level Coding in vtreat

Posted on Categories Administrativia, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , 3 Comments on Upcoming data preparation and modeling article series

Upcoming data preparation and modeling article series

I am pleased to announce that vtreat version 0.6.0 is now available to R users on CRAN.


Vtreat

vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an R user we strongly suggest you incorporate vtreat into your projects. Continue reading Upcoming data preparation and modeling article series

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , , , 1 Comment on More documentation for Win-Vector R packages

More documentation for Win-Vector R packages

The Win-Vector public R packages now all have new pkgdown documentation sites! (And, a thank-you to Hadley Wickham for developing the pkgdown tool.)

Please check them out (hint: vtreat is our favorite).

NewImage Continue reading More documentation for Win-Vector R packages

Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , ,

Encoding categorical variables: one-hot and beyond

(or: how to correctly use xgboost from R)

R has "one-hot" encoding hidden in most of its modeling paths. Asking an R user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere.

For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:

dTrain <-  data.frame(x= c('a','b','b', 'c'),
                      y= c(1, 2, 1, 2))
summary(lm(y~x, data= dTrain))
## 
## Call:
## lm(formula = y ~ x, data = dTrain)
## 
## Residuals:
##          1          2          3          4 
## -2.914e-16  5.000e-01 -5.000e-01  2.637e-16 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1.0000     0.7071   1.414    0.392
## xb            0.5000     0.8660   0.577    0.667
## xc            1.0000     1.0000   1.000    0.500
## 
## Residual standard error: 0.7071 on 1 degrees of freedom
## Multiple R-squared:    0.5,  Adjusted R-squared:   -0.5 
## F-statistic:   0.5 on 2 and 1 DF,  p-value: 0.7071

Continue reading Encoding categorical variables: one-hot and beyond

Posted on Categories Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , ,

vtreat: prepare data

This article is on preparing data for modeling in R using vtreat.

Vtreat Continue reading vtreat: prepare data

Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, StatisticsTags , , , , , 4 Comments on Data Preparation, Long Form and tl;dr Form

Data Preparation, Long Form and tl;dr Form

Data preparation and cleaning are some of the most important steps of predictive analytic and data science tasks. They are laborious, where most of the errors are made, your last line of defense against a wild data, and hold the biggest opportunities for outcome improvement. No matter how much time you spend on them, they still seem like a neglected topic. Data preparation isn’t as self contained or genteel as tweaking machine learning models or hyperparameter tuning; and that is one of the reasons data preparation represents such an important practical opportunity for improvement.


NewImage

Photo: NY – http://nyphotographic.com/, License: Creative Commons 3 – CC BY-SA 3.0

Our group is distributing a detailed writeup of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what vtreat does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-R environments (such as Python/Pandas/scikit-learn, Spark, and many others).

We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.

Or alternately, below is the tl;dr (“too long; didn’t read”) form. Continue reading Data Preparation, Long Form and tl;dr Form

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , ,

vtreat data cleaning and preparation article now available on arXiv

Nina Zumel and I are happy to announce a formal article discussing data preparation and cleaning using the vtreat methodology is now available from arXiv.org as citation arXiv:1611.09477 [stat.AP].

vtreat is an R data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems vtreat defends against include: infinity, NA, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). vtreat::prepare should be your first choice for real world data preparation and cleaning.

We hope this article will make getting started with vtreat much easier. We also hope this helps with citing the use of vtreat in scientific publications. Continue reading vtreat data cleaning and preparation article now available on arXiv

Posted on Categories Exciting Techniques, Expository Writing, Opinion, StatisticsTags , , , , , , , 1 Comment on Laplace noising versus simulated out of sample methods (cross frames)

Laplace noising versus simulated out of sample methods (cross frames)

Nina Zumel recently mentioned the use of Laplace noise in “count codes” by Misha Bilenko (see here and here) as a known method to break the overfit bias that comes from using the same data to design impact codes and fit a next level model. It is a fascinating method inspired by differential privacy methods, that Nina and I respect but don’t actually use in production.


NewImage

Nested dolls, Wikimedia Commons

Please read on for my discussion of some of the limitations of the technique, and how we solve the problem for impact coding (also called “effects codes”), and a worked example in R. Continue reading Laplace noising versus simulated out of sample methods (cross frames)

Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , ,

Some vtreat design principles

We have already written quite a few times about our vtreat open source variable treatment package for R (which implements effects/impact coding, missing value replacement, and novel value replacement; among other important data preparation steps), but we thought we would take some time to describe some of the principles behind the package design.

Introduction

vtreat is something we really feel you you should add to your predictive analytics or data science work flow.


NewImage
vtreat getting a call-out from Dmitry Larko, photo Erin LeDell

vtreat’s design and implementation follows from a number of reasoned assumptions or principles, a few of which we discuss below.

Continue reading Some vtreat design principles