Posted on Categories data science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , ,

vtreat cross frames

vtreat cross frames

John Mount, Nina Zumel

2016-05-05

As a follow on to “On Nested Models” we work R examples demonstrating “cross validated training frames” (or “cross frames”) in vtreat.

Consider the following data frame. The outcome only depends on the “good” variables, not on the (high degree of freedom) “bad” variables. Modeling such a data set runs a high risk of overfit.

set.seed(22626)

mkData <- function(n) {
  d <- data.frame(xBad1=sample(paste('level',1:1000,sep=''),n,replace=TRUE),
                  xBad2=sample(paste('level',1:1000,sep=''),n,replace=TRUE),
                  xBad3=sample(paste('level',1:1000,sep=''),n,replace=TRUE),
                  xGood1=rnorm(n),
                  xGood2=rnorm(n))
  
  # outcome only depends on "good" variables
  d$y <- rnorm(nrow(d))+0.2*d$xGood1 + 0.3*d$xGood2>0.5
  # the random group used for splitting the data set, not a variable.
  d$rgroup <- sample(c("cal","train","test"),nrow(d),replace=TRUE)  
  d
}

d <- mkData(2000)

# devtools::install_github("WinVector/WVPlots")
# library('WVPlots')
plotRes <- function(d,predName,yName,title) {
  print(title)
  tab <- table(truth=d[[yName]],pred=d[[predName]]>0.5)
  print(tab)
  diag <- sum(vapply(seq_len(min(dim(tab))),
                     function(i) tab[i,i],numeric(1)))
  acc <- diag/sum(tab)
#  if(requireNamespace("WVPlots",quietly=TRUE)) {
#     print(WVPlots::ROCPlot(d,predName,yName,title))
#  }
  print(paste('accuracy',acc))
}

The Wrong Way

Bad practice: use the same set of data to prepare variable encoding and train a model.

dTrain <- d[d$rgroup!='test',,drop=FALSE]
dTest <- d[d$rgroup=='test',,drop=FALSE]
treatments <- vtreat::designTreatmentsC(dTrain,c('xBad1','xBad2','xBad3','xGood1','xGood2'),
                                        'y',TRUE,
  rareCount=0 # Note: usually want rareCount>0, setting to zero to illustrate problem
)
## [1] "desigining treatments Sat May 28 11:03:34 2016"
## [1] "design var xBad1 Sat May 28 11:03:34 2016"
## [1] "design var xBad2 Sat May 28 11:03:34 2016"
## [1] "design var xBad3 Sat May 28 11:03:34 2016"
## [1] "design var xGood1 Sat May 28 11:03:34 2016"
## [1] "design var xGood2 Sat May 28 11:03:34 2016"
## [1] "scoring treatments Sat May 28 11:03:34 2016"
## [1] "have treatment plan Sat May 28 11:03:34 2016"
## [1] "rescoring complex variables Sat May 28 11:03:34 2016"
## [1] "done rescoring complex variables Sat May 28 11:03:35 2016"
dTrainTreated <- vtreat::prepare(treatments,dTrain,
  pruneSig=c() # Note: usually want pruneSig to be a small fraction, setting to null to illustrate problems
)
m1 <- glm(y~xBad1_catB + xBad2_catB + xBad3_catB + xGood1_clean + xGood2_clean,
          data=dTrainTreated,family=binomial(link='logit'))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(summary(m1))  # notice low residual deviance
## 
## Call:
## glm(formula = y ~ xBad1_catB + xBad2_catB + xBad3_catB + xGood1_clean + 
##     xGood2_clean, family = binomial(link = "logit"), data = dTrainTreated)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.70438   0.00000   0.00000   0.03995   2.61063  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.5074     0.3350  -1.515  0.12983    
## xBad1_catB     2.9432     0.5549   5.304 1.13e-07 ***
## xBad2_catB     2.5338     0.5857   4.326 1.52e-05 ***
## xBad3_catB     3.4172     0.6092   5.610 2.03e-08 ***
## xGood1_clean   0.7288     0.2429   3.001  0.00269 ** 
## xGood2_clean   0.7788     0.2585   3.012  0.00259 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1724.55  on 1331  degrees of freedom
## Residual deviance:  132.59  on 1326  degrees of freedom
## AIC: 144.59
## 
## Number of Fisher Scoring iterations: 12
dTrain$predM1 <- predict(m1,newdata=dTrainTreated,type='response')
plotRes(dTrain,'predM1','y','model1 on train')
## [1] "model1 on train"
##        pred
## truth   FALSE TRUE
##   FALSE   848   18
##   TRUE      6  460
## [1] "accuracy 0.981981981981982"
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())
dTest$predM1 <- predict(m1,newdata=dTestTreated,type='response')
plotRes(dTest,'predM1','y','model1 on test')
## [1] "model1 on test"
##        pred
## truth   FALSE TRUE
##   FALSE   360  114
##   TRUE    153   41
## [1] "accuracy 0.600299401197605"

Notice above that we see a training accuracy of 98% and a test accuracy of 60%. Also notice the downstream model (the glm) erroneously thinks the xBad?_cat variables are significant (due to the large number of degrees of freedom hidden from the downstream model by the impact/effect coding).

The Right Way: A Calibration Set

Now try a proper calibration/train/test split:

dCal <- d[d$rgroup=='cal',,drop=FALSE]
dTrain <- d[d$rgroup=='train',,drop=FALSE]
dTest <- d[d$rgroup=='test',,drop=FALSE]

# a nice heuristic, 
# expect only a constant number of noise variables to sneak past
pruneSig <- 1/ncol(dTrain) 
treatments <- vtreat::designTreatmentsC(dCal,
                                        c('xBad1','xBad2','xBad3','xGood1','xGood2'),
                                        'y',TRUE,
  rareCount=0 # Note: usually want rareCount>0, setting to zero to illustrate problem
)
## [1] "desigining treatments Sat May 28 11:03:35 2016"
## [1] "design var xBad1 Sat May 28 11:03:35 2016"
## [1] "design var xBad2 Sat May 28 11:03:35 2016"
## [1] "design var xBad3 Sat May 28 11:03:35 2016"
## [1] "design var xGood1 Sat May 28 11:03:35 2016"
## [1] "design var xGood2 Sat May 28 11:03:35 2016"
## [1] "scoring treatments Sat May 28 11:03:35 2016"
## [1] "have treatment plan Sat May 28 11:03:35 2016"
## [1] "rescoring complex variables Sat May 28 11:03:35 2016"
## [1] "done rescoring complex variables Sat May 28 11:03:35 2016"
dTrainTreated <- vtreat::prepare(treatments,dTrain,
  pruneSig=pruneSig)
newvars <- setdiff(colnames(dTrainTreated),'y')
m1 <- glm(paste('y',paste(newvars,collapse=' + '),sep=' ~ '),
          data=dTrainTreated,family=binomial(link='logit'))
print(summary(m1))  
## 
## Call:
## glm(formula = paste("y", paste(newvars, collapse = " + "), sep = " ~ "), 
##     family = binomial(link = "logit"), data = dTrainTreated)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5267  -0.9204  -0.6939   1.1678   2.3039  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.69455    0.08873  -7.828 4.97e-15 ***
## xGood1_clean  0.39673    0.08591   4.618 3.87e-06 ***
## xGood2_clean  0.55549    0.09621   5.774 7.75e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 832.55  on 642  degrees of freedom
## Residual deviance: 771.74  on 640  degrees of freedom
## AIC: 777.74
## 
## Number of Fisher Scoring iterations: 4
dTrain$predM1 <- predict(m1,newdata=dTrainTreated,type='response')
plotRes(dTrain,'predM1','y','model1 on train')
## [1] "model1 on train"
##        pred
## truth   FALSE TRUE
##   FALSE   377   41
##   TRUE    160   65
## [1] "accuracy 0.687402799377916"
dTestTreated <- vtreat::prepare(treatments,dTest,
                                pruneSig=pruneSig)
dTest$predM1 <- predict(m1,newdata=dTestTreated,type='response')
plotRes(dTest,'predM1','y','model1 on test')
## [1] "model1 on test"
##        pred
## truth   FALSE TRUE
##   FALSE   424   50
##   TRUE    150   44
## [1] "accuracy 0.70059880239521"

Notice above that we now see training and test accuracies of 70%. We have defeated overfit in two ways: training performance is closer to test performance, and test performance is better. Also we see that the model now properly considers the “bad” variables to be insignificant.

Another Right Way: Cross-Validation

Below is a more statistically efficient practice: building a cross training frame.

The intuition

Consider any trained statistical model (in this case our treatment plan and variable selection plan) as a two-argument function f(A,B). The first argument is the training data and the second argument is the application data. In our case f(A,B) is: designTreatmentsC(A) %>% prepare(B), and it produces a treated data frame.

When we use the same data in both places to build our training frame, as in

TrainTreated = f(TrainData,TrainData),

we are not doing a good job simulating the future application of f(,), which will be f(TrainData,FutureData).

To improve the quality of our simulation we can call

TrainTreated = f(CalibrationData,TrainData)

where CalibrationData and TrainData are disjoint datasets (as we did in the earlier example) and expect this to be a good imitation of future f(CalibrationData,FutureData).

Cross-Validation and vtreat: The cross-frame.

Another approach is to build a “cross validated” version of f. We split TrainData into a list of 3 disjoint row intervals: Train1,Train2,Train3. Instead of computing f(TrainData,TrainData) compute:

TrainTreated = f(Train2+Train3,Train1) + f(Train1+Train3,Train2) + f(Train1+Train2,Train3)

(where + denotes rbind()).

The idea is this looks a lot like f(TrainData,TrainData) except it has the important property that no row in the right-hand side is ever worked on by a model built using that row (a key characteristic that future data will have) so we have a good imitation of f(TrainData,FutureData).

In other words: we use cross validation to simulate future data. The main thing we are doing differently is remembering that we can apply cross validation to any two argument function f(A,B) and not only to functions of the form f(A,B) = buildModel(A) %>% scoreData(B). We can use this formulation in stacking or super-learning with f(A,B) of the form buildSubModels(A) %>% combineModels(B) (to produce a stacked or ensemble model); the idea applies to improving ensemble methods in general.

See:

  • “General oracle inequalities for model selection” Charles Mitchell and Sara van de Geer
  • “On Cross-Validation and Stacking: Building seemingly predictive models on random data” Claudia Perlich and Grzegorz Swirszcz
  • “Super Learner” Mark J. van der Laan, Eric C. Polley, and Alan E. Hubbard

In fact (though it was developed independently) you can think of vtreat as a superlearner.

In super learning cross validation techniques are used to simulate having built sub-model predictions on novel data. The simulated out of sample-applications of these sub models (and not the sub models themselves) are then used as input data for the next stage learner. In future application the actual sub-models are applied and their immediate outputs is used by the super model.

SuperX

In vtreat the sub-models are single variable treatments and the outer model construction is left to the practitioner (using the cross-frames for simulation and not the treatmentplan). In application the treatment plan is used.

VtreatX

Example

Below is the example cross-run. The function mkCrossFrameCExperiment returns a treatment plan for use in preparing future data, and a cross-frame for use in fitting a model.

dTrain <- d[d$rgroup!='test',,drop=FALSE]
dTest <- d[d$rgroup=='test',,drop=FALSE]
prep <- vtreat::mkCrossFrameCExperiment(dTrain,
           c('xBad1','xBad2','xBad3','xGood1','xGood2'),
           'y',TRUE,
           rareCount=0 # Note: usually want rareCount>0, setting to zero to illustrate problems
)
treatments <- prep$treatments
print(treatments$scoreFrame[,c('varName','sig')])
##        varName          sig
## 1   xBad1_catP 8.172186e-01
## 2   xBad1_catB 5.676541e-01
## 3   xBad2_catP 7.441869e-01
## 4   xBad2_catB 5.793585e-01
## 5   xBad3_catP 4.342227e-01
## 6   xBad3_catB 1.770493e-01
## 7 xGood1_clean 6.072599e-12
## 8 xGood2_clean 8.286789e-21
# vtreat::mkCrossFrameCExperiment doesn't take a pruneSig argument, but we can
# prune on our own.
print(pruneSig)
## [1] 0.1428571
newvars <- treatments$scoreFrame$varName[treatments$scoreFrame$sig<=pruneSig]
# force in bad variables, to show we "belt and suspenders" deal with them
# in that things go well in the cross-frame even if they sneak past pruning
newvars <- sort(union(newvars,c("xBad1_catB","xBad2_catB","xBad3_catB")))
print(newvars)
## [1] "xBad1_catB"   "xBad2_catB"   "xBad3_catB"   "xGood1_clean"
## [5] "xGood2_clean"
dTrainTreated <- prep$crossFrame

We ensured the undesirable xBad*_catB variables back in to demonstrate that even if they sneak past a lose pruneSig, the crosframe lets the downstream model deal with them correctly. To ensure more consistent filtering of the complicated variables one can increase the ncross argument in vtreat::mkCrossFrame?Experiment.

Now we fit the model to the cross-frame rather than to prepare(treatments, dTrain) (the treated training data).

m1 <- glm(paste('y',paste(newvars,collapse=' + '),sep=' ~ '),
          data=dTrainTreated,family=binomial(link='logit'))
print(summary(m1))  
## 
## Call:
## glm(formula = paste("y", paste(newvars, collapse = " + "), sep = " ~ "), 
##     family = binomial(link = "logit"), data = dTrainTreated)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6842  -0.9236  -0.6573   1.1824   2.3257  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.690824   0.091313  -7.565 3.87e-14 ***
## xBad1_catB    0.001813   0.017218   0.105    0.916    
## xBad2_catB   -0.023835   0.017128  -1.392    0.164    
## xBad3_catB    0.024460   0.016978   1.441    0.150    
## xGood1_clean  0.404827   0.061885   6.542 6.09e-11 ***
## xGood2_clean  0.570083   0.064988   8.772  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1724.6  on 1331  degrees of freedom
## Residual deviance: 1587.7  on 1326  degrees of freedom
## AIC: 1599.7
## 
## Number of Fisher Scoring iterations: 4
dTrain$predM1 <- predict(m1,newdata=dTrainTreated,type='response')
plotRes(dTrain,'predM1','y','model1 on train')
## [1] "model1 on train"
##        pred
## truth   FALSE TRUE
##   FALSE   776   90
##   TRUE    335  131
## [1] "accuracy 0.680930930930931"
dTestTreated <- vtreat::prepare(treatments,dTest,
                                pruneSig=c(),varRestriction=newvars)
dTest$predM1 <- predict(m1,newdata=dTestTreated,type='response')
plotRes(dTest,'predM1','y','model1 on test')
## [1] "model1 on test"
##        pred
## truth   FALSE TRUE
##   FALSE   421   53
##   TRUE    145   49
## [1] "accuracy 0.703592814371258"

We again get the better 70% test accuracy. And this is a more statistically powerful technique as we didn’t have to restrict some data to calibration.

The model fit to the cross-frame behaves similarly to the model produced via the process f(CalibrationData, TrainData). Notice that the xBad*_catB variables fail to achieve significance in the downstream glm model, allowing that model to give them small coefficients and even (if need be) prune them out. This is the point of using a cross frame as we see in the first example the xBad*_catB are hard to remove if they make it to standard (non-cross) frames as they are hiding a lot of degrees of freedom from downstream modeling procedures.

2 thoughts on “vtreat cross frames”

  1. Thanks for the clear explanation and the example provided.
    I did not know your package (vtreat) until today’s entry about “ranger” (May 31st, 2016).

    In your example you split your data in three groups (‘cal’, ‘train’, ‘test’) with the same length (1/3 each). Is that your recommendation to use or it is something to play around depending on the dataset?.

    Thanks,
    Carlos.

    1. The train/cal/test split is a bit ad-hoc. But for discussion see The Elements of Statistical Learning, Second Edition, Trevor Hastie, Robert Tibshirani, Jerome Friedman; Springer 2009. In principle I would say train should be the biggest (as model quality can improve as this goes up), cal the next biggest if there are large categoricals (as we need lots of data to estimate the level effects well), and test the smallest as test size controls only the variance of the performance model performance estimate. The purpose of the vtreat::mkCrossFrame?Experiment is to attempt to use the same data for train and cal reliably (to decrease how much we have to cut up our data).

      So in practice we tend to use most of the data for train/cal. The 1/3rd, 1/3rd, 1/3rd split was just for simplicity of example.

Comments are closed.