Posted on Categories Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , ,

vtreat: prepare data

This article is on preparing data for modeling in R using vtreat.

Vtreat

Our example

Suppose we wish to work with some data. Our example task is to train a classification model for credit approval using the ranger implementation of the random forests method. We will take our data from John Ross Quinlan's re-processed "credit approval" dataset hosted at Lichman, M. (2013). UCI Machine Learning Repository, http://archive.ics.uci.edu/ml; Irvine, CA: University of California, School of Information and Computer Science.

For convenience we have copied the data to our working directory here. We start by loading the data, identifying the outcome, and splitting the data into training and evaluation portions:

# load data
d <- read.table(
  'crx.data.txt',
  header = FALSE,
  sep = ',',
  stringsAsFactors = FALSE,
  na.strings = '?'
)

# prepare outcome column and level
outcome <- 'V16'
positive <- '+'
d[[outcome]] <- as.factor(d[[outcome]])

# identify variables
vars <- setdiff(colnames(d), outcome)

# split into train and test/evaluation
set.seed(25325)
isTrain <- runif(nrow(d)) <= 0.8
dTrain <- d[isTrain, , drop = FALSE]
dTest <- d[!isTrain, , drop = FALSE]
rm(list = 'd')

Without vtreat

We could try to model directly on the original variables without vtreat as below.

library("ranger")
f <- paste(outcome, 
           paste(vars, collapse = ' + '), 
           sep = ' ~ ')
model <- ranger(as.formula(f),  
                probability = TRUE,
                data = dTrain)
 #  Error: Missing data in columns: V1, V2, V4, V5, V6, V7, V14.

ranger signaled it did not want to work with this data as there are variables missing values in the training set. This is only one of the many potentially analysis ruining issues that can be lurking in real world data (and as always, detected or signaled errors are better news than undetected errors!).

We name just a few of the issues that could be lurking in real world data:

– Missing values.
– Categorical variable levels that occur in the evaluation set, but were not in the training set (bad luck).
– Categorical variables with large sets of levels.

With vtreat

vtreat is designed to prepare your data for analysis in a statistically sound manner. After vtreat processing all columns are numeric, there are no missing values, and the information in the original data is preserved (modulo user discretion in selecting variables and categorical variable levels).

vtreat::prepare() is roughly a powered-up replacement for stats::model.matrix() (which itself is implicit in many R modeling work-flows).

Let's finish the modeling task with vtreat.

Cross frame experiment

First we run a "cross frame experiment." A cross frame experiment is a sophisticated modeling step that:

– Collects statistics on the relations between your original modeling variables and the training outcome.
– Introduces proposed new transformed modeling variables.
– Produces a "simulated out of sample" training frame that is a version of your model training data prepared in such a way as to simulate having been produced without having looked at that same data during the design of the variable transformations. This is an attempt to eliminate any nested model bias introduced by the transform design and application step. This step is performed using cross-validation inspired methods.

In short vtreat thinks very hard on your data (called the design phase). We call vtreat and pull out the promised results as follows.

library("vtreat")

# Run a "cross frame experiment"
cfe <- vtreat::mkCrossFrameCExperiment(dTrain, vars,
                                       outcome, positive)

# get the "treatment plan" or mapping from original variables
# to derived variables.
plan <- cfe$treatments

# get the performance statistics on the derived variables.
sf <- plan$scoreFrame

# get the simulated out of sample transformed training data.frame
treatedTrain <- cfe$crossFrame

Training a model

The scoreFrame collects estimated effects sizes and significances on the new modeling variables. The analyst can use this to evaluate and choose modeling variables. In this example we will use all the derived variables that have a training significance below 1/NumberOfVariables (which is a fun way to try and avoid multiple comparison bias in picking modeling variables).

newVars <- sf$varName[sf$sig<1/nrow(sf)]

We now build our model using our transformed training data (the cross frame) and our chosen variables.

f <- paste(outcome, 
           paste(newVars, collapse = ' + '), 
           sep = ' ~ ')
model <- ranger(as.formula(f), 
                probability = TRUE,
                data = treatedTrain)

Evaluating a model

We now prepare a transformed version of the evaluation/test frame using the treatment plan.
This is done directly using vtreat::prepare. We could have converted our training data the same way, but as we said it is more rigorous to use the supplied cross frame (though the nested model bias we are trying to avoid is strongest on high-cardinality categorical variables, which are not present in this example). In a real world applications you would keep the plan data structure around to treat or prepare any future application data for model application.

treatedTest <- vtreat::prepare(plan, dTest, 
                               pruneSig = NULL, 
                               varRestriction = newVars)

pred <- predict(model, 
                data=treatedTest, 
                type='response')
treatedTest$pred <- pred$predictions[,positive]

And we are now ready to examine the model's out of sample performance. In this case we are going to use ROC/AUC as our evaluation.

library("WVPlots")
WVPlots::ROCPlot(treatedTest, 
                 'pred', outcome, positive,
                 'test performance')

Eval 1

And that's it, we have fit and evaluated a model!

Variable statistics

Let's take a moment and look at the vtreat supplied variable statistics. These are roughly the predictive power of each derived column treated as a single variable model. Obviously this isn't always going to always correlate with the performance of the variable as part of a joint model; but frankly in real world problems this measure is in fact a useful heuristic.

vtreat reports both an effect size (in this case rsq which for categorization is a pseudo-Rsquared, or portion of deviance explained) and a significance estimate. vtreat also reports the new variable name (`varName`), the original column the variable was derived from (`origName`), and the transformation performed (`code`).

Below we add an indicator ("`take`") showing if we used the variable in our model and exhibit a few rows of scoreFrame.

sf$take <- sf$varName %in% newVars
head(sf[, c('varName', 'rsq', 'sig', 'origName', 'code', 'take')])
 #       varName          rsq          sig origName  code  take
 #  1 V1_lev_x.a 7.475340e-04 0.4485378512       V1   lev FALSE
 #  2 V1_lev_x.b 1.694203e-04 0.7182571422       V1   lev FALSE
 #  3    V1_catP 1.878477e-05 0.9043753424       V1  catP FALSE
 #  4    V1_catB 4.921671e-04 0.5385998340       V1  catB FALSE
 #  5   V2_clean 2.191721e-02 0.0000406801       V2 clean  TRUE
 #  6   V2_isBAD 9.136120e-03 0.0080629087       V2 isBAD  TRUE

One thing I would like to call out is that categorical variables (such as "V1") are eligible for a great number of transformations including:

– Retaining non-negligible levels as dummy or indicator variables (code: "lev").
– Re-encoding the entire column as an effect code or impact code (code: "catB"). vtreat can also take a user supplied per-level significance filter to control the formation of this encoding.
– Other statistics, such as per-level prevalence (allows pooling of rare or common events) (code: "catP").

Conclusion

And that is it: vtreat data preparation.

vtreat data preparation is sound, very powerful, and can greatly improve the quality of your predictive models. The package is available from CRAN here and includes a large number of worked vignettes. We have a formal write-up of the technique here, many articles and tutorials on the methodology here, and a good central resource is here.

I strongly advise adding vtreat to your data science or predictive analytic work-flows.

And a thanks to Dmitry Larko of h2o.ai for his generous advocacy: