Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, StatisticsTags ,

What is new in the vtreat library?

The Win-Vector LLC vtreat library is a library we supply (under a GPL license) for automating the simple domain independent part of variable cleaning an preparation.

The idea is you supply (in R) an example general data.frame to vtreat’s designTreatmentsC method (for single-class categorical targets) or designTreatmentsN method (for numeric targets) and vtreat returns a data structure that can be used to prepare data frames for training and scoring. A vtreat-prepared data frame is nice in the sense:

  • All result columns are numeric.
  • No odd type columns (dates, lists, matrices, and so on) are present.
  • No columns have NA, NaN, +-infinity.
  • Categorical variables are expanded into multiple indicator columns with all levels present which is a good encoding if you are using any sort of regularization in your modeling technique.
  • No rare indicators are encoded (limiting the number of indicators on the translated data.frame).
  • Categorical variables are also impact coded, so even categorical variables with very many levels (like zip-codes) can be safely used in models.
  • Novel levels (levels not seen during design/train phase) do not cause NA or errors.

The idea is vtreat automates a number of standard inspection and preparation steps that are common to all predictive analytic projects. This leaves the data scientist more time to work on important domain specific steps. vtreat also leaves as much of variable selection to the down-stream modeling software. The goal of vtreat is to reliably (and repeatably) generate a data.frame that is safe to work with.

This note explains a few things that are new in the vtreat library.

The typical use of vtreat is to defend down-stream modeling code from all kinds of typical incoming data problems. Such issues include:

  • NA, NaN, +-infinity
  • Categoricals with very large numbers of levels.
  • Odd types (dates, matrix, and more).
  • Novel levels (levels not seen during design/train phase).
  • Outlier values.
  • Variables that don’t move.

These are all things that “shouldn’t happen” but do happen often enough that you want a systematic notifications, treatments and defenses against them. Uncaught these issues can cause your model to error-out or skip examples during scoring (novel levels often cause this) or lurk subtly causing a (large or small) unnoticed loss in model quality.

A typical use looks like the following:

library('vtreat')
# our design and training data frame
dTrainC <- data.frame(x=c('a','a','a','b','b',NA),
   z=c(1,2,3,4,NA,6),y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE))
print(dTrainC)

# build the treatment plan on the training frame
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)
# treat the training frame and use this treated frame to build models
dTrainCTreated <- prepare(treatmentsC,dTrainC,pruneLevel=c())
print(dTrainCTreated)

# later, new test or application data arrives
dTestC <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA))
print(dTestC)
# use the treatment plan to prepare this frame
dTestCTreated <- prepare(treatmentsC,dTestC,pruneLevel=c())
print(dTestCTreated)

vtreat was designed to package and automate some of the more common steps from section 4.1 of Practical Data Science with R. This is not a replacement for actually looking at the data. The automation is just to leave the data scientist more time to work on important domain specific adaptions and transformations. Similarly vtreat does a little variable scoring- but leaves the bulk of variable selection to the modeling technique the data scientist chooses to use after treatment. We want vtreat to be very light-weight and easy to combine with other libraries.

A few things have been added since we introduced the Win-Vector LLC basic variable preparation library. In particular:

  • You can now install directly from Github using Hadley Wickham’s devtools package! The R-code is as follows:

    install.packages("devtools")
    devtools::install_github("WinVector/vtreat")
    


    Previously you had to download and install the tar file by hand.

  • A bit more documentation. Example:
  • library('vtreat')
    help(vtreat)
    

  • Package now includes tests!
  • vtreat now looks for and warns about unexpected and exotic types in incoming data.frames!
  • Variable pre-scoring is much more efficient in both time and space.
  • Categorical impact coding is now properly Bayesian (previous versions of vtreat used a 0/1 regression encoding).
  • Outlier values can now be collard or Winsorized.
  • Logit transformation removed from package (not justifiable as a sufficiently general package feature).

We strongly encourage all data scientists to incorporate vtreat (or something like it) into their workflow.

One thought on “What is new in the vtreat library?”

  1. And some more:

    Variable scoring is now (optionally) parallelized and tries to work out of sample in more circumstances.

Comments are closed.