Bad Bayes: an example of why you need hold-out testing

We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit.

The example is designed to imitate a common situation found in predictive analytic natural language processing. In this type of application you are often building a model using many rare text features. The rare text features are often nearly unique k-grams and the model can be anything from Naive Bayes to conditional random fields. This sort of modeling situation exposes the modeler to a lot of training bias. You can get models that look good on training data even though they have no actual value on new data (very poor generalization performance). In this sort of situation you are very vulnerable to having fit mere noise.

Often there is a feeling if a model is doing really well on training data then must be some way to bound generalization error and at least get useful performance on new test and production data. This is, of course, false as we will demonstrate by building deliberately useless features that allow various models to perform well on training data. What is actually happening is you are working through variations of worthless models that only appear to be good on training data due to overfitting. And the more “tweaking, tuning, and fixing” you try only appears to improve things because as you peek at your test-data (which you really should have held some out until the entire end of project for final acceptance) your test data is becoming less exchangeable with future new data and more exchangeable with your training data (and thus less helpful in detecting overfit).

Any researcher that does not have proper per-feature significance checks or hold-out testing procedures will be fooled into promoting faulty models.

Many predictive NLP (natural language processing) applications require the use of very many very rare (almost unique) text features. A simple example would be 4-grams or sequences of 4-consecutive works from a document. At some point you are tracking phrases that occur in only 1 to 2 documents in your training corpus. A tempting intuition is that each of these rare features is in fact a low utility clue for document classification. The hope is if we track enough of them then enough are available when scoring a given document to make a reliable classification.

These features may in fact be useful, but you must be careful to have procedures to determine which features are in fact useful and which are mere noise. The issue is that rare features are only seen in a few training examples, so it is hard to reliably estimate their value during training. We will demonstrate (in R) some absolutely useless variables masquerading as actual signal during training. Our example is artificial, but if you don’t have proper hold-out testing procedures you can easily fall into a similar trap.

Our code to create a bad example is as follows:

runExample <- function(rows,features,rareFeature,trainer,predictor) {
   print(sys.call(0)) # print call and arguments
   set.seed(123525)   # make result deterministic
   yValues <- factor(c('A','B'))
   xValues <- factor(c('a','b','z'))
   d <- data.frame(y=sample(yValues,replace=T,size=rows),
                   group=sample(1:100,replace=T,size=rows))
   if(rareFeature) {
      mkRandVar <- function() {
         v <- rep(xValues[[3]],rows)
         signalIndices <- sample(1:rows,replace=F,size=2)
         v[signalIndices] <- sample(xValues[1:2],replace=T,size=2)
         v
      }
   } else {
      mkRandVar <- function() {
         sample(xValues[1:2],replace=T,size=rows)
      }
   }
   varValues <- as.data.frame(replicate(features,mkRandVar()))
   varNames <- colnames(varValues)
   d <- cbind(d,varValues)
   dTrain <- subset(d,group<=50)
   dTest <- subset(d,group>50)
   formula <- as.formula(paste('y',paste(varNames,collapse=' + '),sep=' ~ '))
   model <- trainer(formula,data=dTrain)
   tabTrain <- table(truth=dTrain$y,
      predict=predictor(model,newdata=dTrain,yValues=yValues))
   print('train set results')
   print(tabTrain)
   print(fisher.test(tabTrain))
   tabTest <- table(truth=dTest$y,
      predict=predictor(model,newdata=dTest,yValues=yValues))
   print('hold-out test set results')
   print(tabTest)
   print(fisher.test(tabTest))
}

This block of code builds a universe of examples of size rows. The ground-truth we are trying to predict is if y is "A" or "B". Each row has a number of features (equal to features). And these features are considered rare if we have rareFeature=T (if so the feature spends almost all of its time parked at the constant "z"). The point is each and every feature in this example is random and built without looking at the actual truth-values or y’s (and therefore useless). We split the universe of data into a 50/50 test/train split. We then build a model on the training data and show the performance of predicting the y-category on both the test and train set. We use the Fisher contingency table test to see if we have what looks like a significant model. In all cases we get a deceptive very good (very low) p-value on training that does not translate to any real effect on test data. We show the effect for Naive Bayes (a common text classifier), decision trees, logistic regression, and random forests (note for the non Naive Bayes classifiers we use non-rare features to trick them into thinking there is a model).

Basically if you don’t at least look at model diagnostics (such as coefficient p-values in logistic regression) or look at test significance you fool yourself into thinking you have a model that is good in training. You may even feel with the right sort of smoothing it should at least be usable in test. It will not. The most you can hope for is a training procedure that notices there is no useful signal. You can’t model your way out of having no useful features.

The results we get are as follows:

  • Naive Bayes train (looks good when it is not):

    > library(e1071)
    > runExample(rows=200,features=400,rareFeature=T,
        trainer=function(formula,data) { naiveBayes(formula,data) },
        predictor=function(model,newdata,yValues) { 
           predict(model,newdata,type='class')
        }
     )
    runExample(rows = 200, features = 400, rareFeature = T, trainer = function(formula, 
        data) {
        naiveBayes(formula, data)
    }, predictor = function(model, newdata, yValues) {
        predict(model, newdata, type = "class")
    })
    [1] "train set results"
         predict
    truth  A  B
        A 45  2
        B  0 49
    
    	Fisher's Exact Test for Count Data
    
    data:  tabTrain
    p-value < 2.2e-16
    alternative hypothesis: true odds ratio is not equal to 1
    95 percent confidence interval:
     131.2821      Inf
    sample estimates:
    odds ratio 
           Inf 
    
  • Naive Bayes hold-out test (is bad):

    [1] "hold-out test set results"
         predict
    truth  A  B
        A 17 41
        B 14 32
    
    	Fisher's Exact Test for Count Data
    
    data:  tabTest
    p-value = 1
    alternative hypothesis: true odds ratio is not equal to 1
    95 percent confidence interval:
     0.3752898 2.4192687
    sample estimates:
    odds ratio 
     0.9482474 
    
  • Decision tree train (looks good when it is not):

    > library(rpart)
    > runExample(rows=200,features=400,rareFeature=F,
        trainer=function(formula,data) { rpart(formula,data) },
        predictor=function(model,newdata,yValues) { 
           predict(model,newdata,type='class')
        }
     )
    runExample(rows = 200, features = 400, rareFeature = F, trainer = function(formula, 
        data) {
        rpart(formula, data)
    }, predictor = function(model, newdata, yValues) {
        predict(model, newdata, type = "class")
    })
    [1] "train set results"
         predict
    truth  A  B
        A 42  5
        B 16 33
    
    	Fisher's Exact Test for Count Data
    
    data:  tabTrain
    p-value = 7.575e-09
    alternative hypothesis: true odds ratio is not equal to 1
    95 percent confidence interval:
      5.27323 64.71322
    sample estimates:
    odds ratio 
      16.69703 
    
  • Decision tree hold-out test (is bad):

    [1] "hold-out test set results"
         predict
    truth  A  B
        A 33 25
        B 27 19
    
    	Fisher's Exact Test for Count Data
    data:  tabTest
    p-value = 1
    alternative hypothesis: true odds ratio is not equal to 1
    95 percent confidence interval:
     0.3932841 2.1838878
    sample estimates:
    odds ratio 
     0.9295556 
    
  • Logistic regression train (looks good when it is not):

    > runExample(rows=200,features=400,rareFeature=F,
        trainer=function(formula,data) { 
           glm(formula,data,family=binomial(link='logit')) 
        },
        predictor=function(model,newdata,yValues) { 
           yValues[ifelse(predict(model,newdata=newdata,type='response')>=0.5,2,1)]
        }
     )
    runExample(rows = 200, features = 400, rareFeature = F, trainer = function(formula, 
        data) {
        glm(formula, data, family = binomial(link = "logit"))
    }, predictor = function(model, newdata, yValues) {
        yValues[ifelse(predict(model, newdata = newdata, type = "response") >= 
            0.5, 2, 1)]
    })
    [1] "train set results"
         predict
    truth  A  B
        A 47  0
        B  0 49
    
    	Fisher's Exact Test for Count Data
    
    data:  tabTrain
    p-value < 2.2e-16
    alternative hypothesis: true odds ratio is not equal to 1
    95 percent confidence interval:
     301.5479      Inf
    sample estimates:
    odds ratio 
           Inf 
    
  • Logistic regression test (is bad):

    [1] "hold-out test set results"
         predict
    truth  A  B
        A 35 23
        B 25 21
    
    	Fisher's Exact Test for Count Data
    
    data:  tabTest
    p-value = 0.5556
    alternative hypothesis: true odds ratio is not equal to 1
    95 percent confidence interval:
     0.5425696 3.0069854
    sample estimates:
    odds ratio 
      1.275218 
    
    Warning messages:
    1: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==  :
      prediction from a rank-deficient fit may be misleading
    2: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==  :
      prediction from a rank-deficient fit may be misleading
    
  • Random Forests train (looks good, but is not):

    > library(randomForest)
    > runExample(rows=200,features=400,rareFeature=F,
        trainer=function(formula,data) { randomForest(formula,data) },
        predictor=function(model,newdata,yValues) { 
           predict(model,newdata,type='response')
        }
     )
    runExample(rows = 200, features = 400, rareFeature = F, trainer = function(formula, 
        data) {
        randomForest(formula, data)
    }, predictor = function(model, newdata, yValues) {
        predict(model, newdata, type = "response")
    })
    [1] "train set results"
         predict
    truth  A  B
        A 47  0
        B  0 49
    
    	Fisher's Exact Test for Count Data
    
    data:  tabTrain
    p-value < 2.2e-16
    alternative hypothesis: true odds ratio is not equal to 1
    95 percent confidence interval:
     301.5479      Inf
    sample estimates:
    odds ratio 
           Inf 
    
  • Random Forests tests (is bad):

    [1] "hold-out test set results"
         predict
    truth  A  B
        A 21 37
        B 13 33
    
    	Fisher's Exact Test for Count Data
    
    data:  tabTest
    p-value = 0.4095
    alternative hypothesis: true odds ratio is not equal to 1
    95 percent confidence interval:
     0.5793544 3.6528127
    sample estimates:
    odds ratio 
      1.435704 
    

The point is: good training performance means nothing (unless your trainer is in fact reporting cross-validated results). To avoid overfit you must at least examine model diagnostics, per-variable model coefficient significances, and should always report results on truly held-out data. It is not enough to look only at model-fit significance on training data. An additional risk is when you are in a situation where you are likely to encounter a mixture of rare useful features and rare noise features. As we have illustrated above the model fitting procedures can’t always tell the difference between features and noise. So it is easy to expect that the noise features can drown out rare useful features in practice. This should remind all of us of the need for good variable curation, selection and principled dimension reduction (domain knowledge sensitive and y-sensitive, not just broad principal components analysis). Lots of features (the so-called “wide data” style of analytics) are not always easy to work with (as opposed to “tall data” which is always good as you have more examples to falsify bad relations).

We took the liberty of using the title “Bad Bayes” because this is where we have most often seen the use of many weak variables without enough data to really establish per-variable significance.

For a more on feature selection and model testing please see Zumel, Mount, “Practical Data Science with R”.


Be Sociable, Share!

3 thoughts on “Bad Bayes: an example of why you need hold-out testing”

  1. The point isn’t that overfit is unexpected (it should be something you are always worried about), but that this is a reliable example of extreme example of overfit (there is no signal to fit). And of course everyone should already you need hold-out test and calibration data, but there can be a “if the math is sufficiently complicated I can work around that” vibe from some data scientists. Often instead of looking coldly at the hold out data you hear one of the follow falsehoods: “there isn’t enough data to waste on hold-out” (due to the need to model many rare features), “random forests doesn’t overfit” (due to its internal cross validation), or “it is just a matter of picking the right smoothing or shrinkage parameters to fix this” (when in fact there is no signal to fit).

    I was playing with the gbm package recently, and it seems to “fail safe” on this problem- refusing to build a model on the training data (the correct outcome). See github/BadBayesExample.md for details. And as any data scientist should know failing to get a model is not the worst thing that can happen (sharing a wrong model is much worse).

    It is funny, this is how we did machine learning in the 1990’s: before we had easy access to data we used synthetic data sets.

  2. And a note why I pick on NLP applications in this article. One of the luxuries of modern data science is you can assume you have a lot of data (not considered true in the 80’s and 90’s). In this case many methods provably don’t over-fit (for example linear regression as the feature set is held constant and the amount of training data is taken to infinity: empirical estimates of expectations limit to correct values and decisions made based on them become correct). However in NLP you often use extra training data to not only better estimate effects of known features but to recruit new rare features, which is one of the issues the deliberately bad dataset is simulating. One reason you recruit such features is there are so many of them, when you see Ziphian style distributions on k-grams you feel you are throwing a big fraction of signal if you don’t include the many many rare features in your model. The issue is: the true effect of rare features is hard to estimate.

Comments are closed.