Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , ,

A demonstration of vtreat data preparation

This article is a demonstration the use of the R vtreat variable preparation package followed by caret controlled training.

In previous writings we have gone to great lengths to document, explain and motivate vtreat. That necessarily gets long and unnecessarily feels complicated.

In this example we are going to show what building a predictive model using vtreat best practices looks like assuming you were somehow already in the habit of using vtreat for your data preparation step. We are deliberately not going to explain any steps, but just show the small number of steps we advise routinely using. This is a simple schematic, but not a guide. Of course we do not advise use without understanding (and we work hard to teach the concepts in our writing), but want what small effort is required to add vtreat to your predictive modeling practice.

First we set things up: load libraries, initialize parallel processing.

library('vtreat')
library('caret')
library('gbm')
library('doMC')
library('WVPlots') # see https://github.com/WinVector/WVPlots

# parallel for vtreat
ncores <- parallel::detectCores()
parallelCluster <- parallel::makeCluster(ncores)
# parallel for caret
registerDoMC(cores=ncores)

The we load our data for analysis. We are going to build a model predicting an income level from other demographic features. The data is taken from here and you can perform all of the demonstrated steps if you download the contents of the example git directory. Obviously this has a lot of moving parts (R, R Markdown, Github, R packages, devtools)- but is very easy to do a second time (first time can be a bit of learning and preparation).

# load data
# data from: http://archive.ics.uci.edu/ml/machine-learning-databases/adult/
colnames <-
  c(
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'class'
  )
dTrain <- read.table(
  'adult.data.txt',
  header = FALSE,
  sep = ',',
  strip.white = TRUE,
  stringsAsFactors = FALSE,
  na.strings = c('NA', '?', '')
)
colnames(dTrain) <- colnames
dTest <- read.table(
  'adult.test.txt',
  skip = 1,
  header = FALSE,
  sep = ',',
  strip.white = TRUE,
  stringsAsFactors = FALSE,
  na.strings = c('NA', '?', '')
)
colnames(dTest) <- colnames

Now we use vtreat to prepare the data for analysis. The goal of vtreat is to ensure a ready-to-dance data frame in a statistically valid manner. We are respecting the test/train split and building our data preparation plan only on the training data (though we do apply it to the test data). This step helps with a huge number of potential problems through automated repairs:

  • re-encoding missing values
  • dealing with large cardinality categorical variables
  • dealing with novel levels
  • fixing variable/column names to be "R safe"
  • looking for strange column types
# define problem
yName <- 'class'
yTarget <- '>50K'
varNames <- setdiff(colnames,yName)

# build variable encoding plan and prepare simulated out of sample
# training fame (cross-frame) 
# http://www.win-vector.com/blog/2016/05/vtreat-cross-frames/
system.time({
  cd <- vtreat::mkCrossFrameCExperiment(dTrain,varNames,yName,yTarget,
                                        parallelCluster=parallelCluster)
  scoreFrame <- cd$treatments$scoreFrame
  dTrainTreated <- cd$crossFrame
  # pick our variables
  newVars <- scoreFrame$varName[scoreFrame$sig<1/nrow(scoreFrame)]
  dTestTreated <- vtreat::prepare(cd$treatments,dTest,
                                  pruneSig=NULL,varRestriction=newVars)
})
##    user  system elapsed 
##  11.340   2.760  30.872
#print(newVars)

Now we train our model. In this case we are using the caret package to tune parameters.

# train our model using caret
system.time({
  yForm <- as.formula(paste(yName,paste(newVars,collapse=' + '),sep=' ~ '))
  # from: http://topepo.github.io/caret/training.html
  fitControl <- trainControl(
    method = "cv",
    number = 3)
  model <- train(yForm,
                 data = dTrainTreated,
                 method = "gbm",
                 trControl = fitControl,
                 verbose = FALSE)
  print(model)
  dTest$pred <- predict(model,newdata=dTestTreated,type='prob')[,yTarget]
})
## Stochastic Gradient Boosting 
## 
## 32561 samples
##    64 predictor
##     2 classes: '<=50K', '>50K' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 21707, 21708, 21707 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.8476398  0.5083558
##   1                  100      0.8556555  0.5561726
##   1                  150      0.8577746  0.5699958
##   2                   50      0.8560855  0.5606650
##   2                  100      0.8593102  0.5810931
##   2                  150      0.8625042  0.5930111
##   3                   50      0.8593717  0.5789289
##   3                  100      0.8649919  0.6017707
##   3                  150      0.8660975  0.6073645
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

##    user  system elapsed 
##  61.908   2.227  36.850

Finally we take a look at the results on the held-out test data.

WVPlots::ROCPlot(dTest,'pred',yName,'predictions on test')
Score 1
WVPlots::DoubleDensityPlot(dTest,'pred',yName,'predictions on test')
Score 2
confusionMatrix <- table(truth=dTest[[yName]],pred=dTest$pred>=0.5)
print(confusionMatrix)
##         pred
## truth    FALSE  TRUE
##   <=50K. 11684   751
##   >50K.   1406  2440
testAccuarcy <- (confusionMatrix[1,1]+confusionMatrix[2,2])/sum(confusionMatrix)
testAccuarcy
## [1] 0.8675143

Notice the achieved test accuracy is in the ballpark of what was reported for this dataset.

(From [adult.names description](http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names) )
Error Accuracy reported as follows, after removal of unknowns from
 |    train/test sets):
 |    C4.5       : 84.46+-0.30
 |    Naive-Bayes: 83.88+-0.30
 |    NBTree     : 85.90+-0.28

We can also compare accuracy on the "complete cases":

dTestComplete <- dTest[complete.cases(dTest[,varNames]),]
confusionMatrixComplete <- table(truth=dTestComplete[[yName]],
                                 pred=dTestComplete$pred>=0.5)
print(confusionMatrixComplete)
##         pred
## truth    FALSE  TRUE
##   <=50K. 10618   742
##   >50K.   1331  2369
testAccuarcyComplete <- (confusionMatrixComplete[1,1]+confusionMatrixComplete[2,2])/
  sum(confusionMatrixComplete)
testAccuarcyComplete
## [1] 0.8623506
# clean up
parallel::stopCluster(parallelCluster)

These two scores are within noise bounds of each other, but it is our experience that missingness is often actually informative, so in addition to imputing missing values you would like to preserve some notation indicating the missingness (which vtreat does in fact do).

And that is all there is to this example. I’d like to emphasize that vtreat steps were only a few lines in one of the blocks of code. vtreat treatment can take some time, but it is usually bearable. By design it is easy to add vtreat to your predictive analytics projects.

The point is: we got competitive results on real world data, in a single try (using vtreat to prepare data and caret to tune parameters). The job of the data scientist is to actually work longer on a problem and do better. But having a good start helps.

The theory behind vtreat is fairly important to the correctness of our implementation, and we would love for you to read through some of it:

But operationally, please think of vtreat as just adding a couple of lines to your analysis scripts. Again, the raw R markdown source can be found here and a rendered copy (with results and graphs) here.

8 thoughts on “A demonstration of vtreat data preparation”

  1. For some semblance of flow and brevity this note intentionally skips over two key points in evaluating vtreat: comparing it to working without vtreat, and teaching the underlying methods. This makes me so uncomfortable I will repeat myself and re-point out where to find this vital backing material in this comment.

    It was important in our earlier writing to show the actual effect of vtreat by comparing it to similar analyses without the vtreat step. However, that meant on real world data we had to show a number of alternative preparations. This eventually becomes long (missing numeric values, missing categorical values, high degree categoricals, and so on). So in this write up we just ran our preferred work pattern and deliberately didn’t run the alternatives. For a demonstration where we do run the alternatives on very simple data I suggest the vtreat cross frame article: http://www.win-vector.com/blog/2016/05/vtreat-cross-frames/ .

    It was also important to teach the concepts, and this is done well in Nina Zumel’s white paper: http://winvector.github.io/DataPrep/EN-CNTNT-Whitepaper-Data-Prep-Using-R.pdf

  2. You guys have consistently delivered since your book; still refer to it this day – and I’ve backed it during its infancy. I’ve been watching vtreat closely. I see the examples have been mainly on binary or labeled outcomes. How do I use vtreat for continuous outcome variables? Or am I stuck on using something like designTreatmentsZ for now?

    1. Thanks, Koma! The support of backers during the MEAP (the Manning semi-public writing period) meant a lot to us and made it easier to finish the book.

      As for vtreat. For regression or continuous outcomes it is designTreatmentsN or (if you want) mkCrossFrameNExperiment. designTreatmentsZ is only for when you don’t have an outcome/dependent variable (a “y“) to work with (and does a lot less than the other methods).

      For multinomial classification we suggest running vtreat once per possible outcome and then (after renaming) cbinding in all the “cat?” variables form the various frames. I’ll try to add a vignette on the idea later.

  3. Thanks again for this new post demonstrating “vtreat” capabilities.

    There is one issue I do not fully understand in your code:

    newVars <- scoreFrame$varName[scoreFrame$sig<1/nrow(scoreFrame)]

    Why do you divide "scoreFrame$sig<1" by the number of rows of scoreFrame ?.
    I reviewed the different results of scoreFrame$sig<1, (scoreFrame$sig/nrow(scoreFrame) <1, and what you coded. I do not see why if you just check for which columns are selected with just "scoreFrame$sig <1" are different to the ones you get with "scoreFrame$sig<1/nrow(scoreFrame)", when most of the columns are already lower than 1 in their "sig" column.

    Also, by using a different approach to vtreat for high cardinality variables. I got a better result when using a minFraction is in the order of 0.0002 rather than 0.02 which is the default you set. You can see that in this example that you can get values of "testAccuracyComplete" of 0.8695 rather than 0.8677. By tweaking this minFraction you can get a significative jump in accuracy when in data competitions you fight for increasing the fourth decimal point.

    Thanks again,
    Carlos.

    1. Picking the significance cutoff (and doing it in a cross-validated way that is not peeking at the test data) is a big and important issue.

      We try to just give a simple “not so bad” rule of thumb. sig < 1 is “take almost every variable” (not always a good practice). The idea of 1/nrow(scoreFrame) is an adaption of some of the ideas found here: http://winvector.github.io/vtreathtml/vtreatSignificance.html . The idea is to leave as much as possible to the downstream machine learning (assuming it is also smart) without overwhelming it with useless variables. By the definition of significance at a significance ≤ epsilon cutoff we get only an epsilon fraction of useless variable leaking through. Our null hypothesis could be all nrow(scoreFrame) variables are useless and setting epsilon = 1/nrow(scoreFrame) means we expect about nrow(scoreFrame) * 1/nrow(scoreFrame) ~ 1 useless variables to pass the filter. We assume a constant number passed through is something the next stage model can then deal with. Obviously the stricter the filter the more actual useful variables also get quashed- so it is a balance.

      In data science competitions you are always going to have to tweak things for that 4th decimal place (please see this excellent paper for some thoughts on that http://arxiv.org/abs/1502.04585 ). Some great ideas do come out of these contests, but our sympathy is more towards production situations where such tuning is note always really in the client’s interest (as such small changes may not affect the business much, they are paying for the tuning time, and could possibly use the data scientist on another project).

      That being said: we would be thrilled if vtreat or ideas attributed to it (or its documentation http://winvector.github.io/vtreathtml/ ) were publicly considered as valuable in such competitions.

      We haven’t written much on the smFactor, rareCount, or rareSig controls yet. But these are all per-level controls and changing these away from the “simple safe” default values in a principled way can have big rewards. The issue is they all control the treatment of levels inside the categorical re-encodings, so this is the last place to deal with these issues as this is the last place all the levels are necessarily visible. We think offering these controls is one of the innovations of vtreat.

  4. Thanks John for the very valuable references.

    Yesterday I tried “mkCrossFrameNExperiment” with very low values of “minFraction” in the dataset I am analyzing and I got a very big number of new columns. And obviously it took a lot to process.

    I thought about the possibility to estimate how big will become the dataset when playing around with minFraction and other parameters, before applying them. Do you think that can be estimated before applying mk* functions?.

    Also, based on how long it takes to apply mk* functions in these cases, eventhough you can parallelize the process, I do not know if when creating many new columns it is possible to optimize the process by using data.table functionalities. Well, consider this just random thoughts…

    Kind Regards and thanks again for your clear indications.
    Carlos.

    1. I wasn’t quite accurate in my earlier comment (now edited). minFraction needs to be kept a fair bit above zero as a categorical variable can expand to as many as 1/minFraction indicator columns in addition to the catX encoded columns.

Comments are closed.