We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit.
The example is designed to imitate a common situation found in predictive analytic natural language processing. In this type of application you are often building a model using many rare text features. The rare text features are often nearly unique kgrams and the model can be anything from Naive Bayes to conditional random fields. This sort of modeling situation exposes the modeler to a lot of training bias. You can get models that look good on training data even though they have no actual value on new data (very poor generalization performance). In this sort of situation you are very vulnerable to having fit mere noise.
Often there is a feeling if a model is doing really well on training data then must be some way to bound generalization error and at least get useful performance on new test and production data. This is, of course, false as we will demonstrate by building deliberately useless features that allow various models to perform well on training data. What is actually happening is you are working through variations of worthless models that only appear to be good on training data due to overfitting. And the more “tweaking, tuning, and fixing” you try only appears to improve things because as you peek at your testdata (which you really should have held some out until the entire end of project for final acceptance) your test data is becoming less exchangeable with future new data and more exchangeable with your training data (and thus less helpful in detecting overfit).
Any researcher that does not have proper perfeature significance checks or holdout testing procedures will be fooled into promoting faulty models.
Many predictive NLP (natural language processing) applications require the use of very many very rare (almost unique) text features. A simple example would be 4grams or sequences of 4consecutive works from a document. At some point you are tracking phrases that occur in only 1 to 2 documents in your training corpus. A tempting intuition is that each of these rare features is in fact a low utility clue for document classification. The hope is if we track enough of them then enough are available when scoring a given document to make a reliable classification.
These features may in fact be useful, but you must be careful to have procedures to determine which features are in fact useful and which are mere noise. The issue is that rare features are only seen in a few training examples, so it is hard to reliably estimate their value during training. We will demonstrate (in R) some absolutely useless variables masquerading as actual signal during training. Our example is artificial, but if you don’t have proper holdout testing procedures you can easily fall into a similar trap.
Our code to create a bad example is as follows:
runExample < function(rows,features,rareFeature,trainer,predictor) { print(sys.call(0)) # print call and arguments set.seed(123525) # make result deterministic yValues < factor(c('A','B')) xValues < factor(c('a','b','z')) d < data.frame(y=sample(yValues,replace=T,size=rows), group=sample(1:100,replace=T,size=rows)) if(rareFeature) { mkRandVar < function() { v < rep(xValues[[3]],rows) signalIndices < sample(1:rows,replace=F,size=2) v[signalIndices] < sample(xValues[1:2],replace=T,size=2) v } } else { mkRandVar < function() { sample(xValues[1:2],replace=T,size=rows) } } varValues < as.data.frame(replicate(features,mkRandVar())) varNames < colnames(varValues) d < cbind(d,varValues) dTrain < subset(d,group<=50) dTest < subset(d,group>50) formula < as.formula(paste('y',paste(varNames,collapse=' + '),sep=' ~ ')) model < trainer(formula,data=dTrain) tabTrain < table(truth=dTrain$y, predict=predictor(model,newdata=dTrain,yValues=yValues)) print('train set results') print(tabTrain) print(fisher.test(tabTrain)) tabTest < table(truth=dTest$y, predict=predictor(model,newdata=dTest,yValues=yValues)) print('holdout test set results') print(tabTest) print(fisher.test(tabTest)) }
This block of code builds a universe of examples of size rows
. The groundtruth we are trying to predict is if y
is "A"
or "B"
. Each row has a number of features (equal to features
). And these features are considered rare if we have rareFeature=T
(if so the feature spends almost all of its time parked at the constant "z"
). The point is each and every feature in this example is random and built without looking at the actual truthvalues or y’s (and therefore useless). We split the universe of data into a 50/50 test/train split. We then build a model on the training data and show the performance of predicting the ycategory on both the test and train set. We use the Fisher contingency table test to see if we have what looks like a significant model. In all cases we get a deceptive very good (very low) pvalue on training that does not translate to any real effect on test data. We show the effect for Naive Bayes (a common text classifier), decision trees, logistic regression, and random forests (note for the non Naive Bayes classifiers we use nonrare features to trick them into thinking there is a model).
Basically if you don’t at least look at model diagnostics (such as coefficient pvalues in logistic regression) or look at test significance you fool yourself into thinking you have a model that is good in training. You may even feel with the right sort of smoothing it should at least be usable in test. It will not. The most you can hope for is a training procedure that notices there is no useful signal. You can’t model your way out of having no useful features.
The results we get are as follows:

Naive Bayes train (looks good when it is not):
> library(e1071) > runExample(rows=200,features=400,rareFeature=T, trainer=function(formula,data) { naiveBayes(formula,data) }, predictor=function(model,newdata,yValues) { predict(model,newdata,type='class') } ) runExample(rows = 200, features = 400, rareFeature = T, trainer = function(formula, data) { naiveBayes(formula, data) }, predictor = function(model, newdata, yValues) { predict(model, newdata, type = "class") }) [1] "train set results" predict truth A B A 45 2 B 0 49 Fisher's Exact Test for Count Data data: tabTrain pvalue < 2.2e16 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 131.2821 Inf sample estimates: odds ratio Inf

Naive Bayes holdout test (is bad):
[1] "holdout test set results" predict truth A B A 17 41 B 14 32 Fisher's Exact Test for Count Data data: tabTest pvalue = 1 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.3752898 2.4192687 sample estimates: odds ratio 0.9482474

Decision tree train (looks good when it is not):
> library(rpart) > runExample(rows=200,features=400,rareFeature=F, trainer=function(formula,data) { rpart(formula,data) }, predictor=function(model,newdata,yValues) { predict(model,newdata,type='class') } ) runExample(rows = 200, features = 400, rareFeature = F, trainer = function(formula, data) { rpart(formula, data) }, predictor = function(model, newdata, yValues) { predict(model, newdata, type = "class") }) [1] "train set results" predict truth A B A 42 5 B 16 33 Fisher's Exact Test for Count Data data: tabTrain pvalue = 7.575e09 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 5.27323 64.71322 sample estimates: odds ratio 16.69703

Decision tree holdout test (is bad):
[1] "holdout test set results" predict truth A B A 33 25 B 27 19 Fisher's Exact Test for Count Data data: tabTest pvalue = 1 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.3932841 2.1838878 sample estimates: odds ratio 0.9295556

Logistic regression train (looks good when it is not):
> runExample(rows=200,features=400,rareFeature=F, trainer=function(formula,data) { glm(formula,data,family=binomial(link='logit')) }, predictor=function(model,newdata,yValues) { yValues[ifelse(predict(model,newdata=newdata,type='response')>=0.5,2,1)] } ) runExample(rows = 200, features = 400, rareFeature = F, trainer = function(formula, data) { glm(formula, data, family = binomial(link = "logit")) }, predictor = function(model, newdata, yValues) { yValues[ifelse(predict(model, newdata = newdata, type = "response") >= 0.5, 2, 1)] }) [1] "train set results" predict truth A B A 47 0 B 0 49 Fisher's Exact Test for Count Data data: tabTrain pvalue < 2.2e16 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 301.5479 Inf sample estimates: odds ratio Inf

Logistic regression test (is bad):
[1] "holdout test set results" predict truth A B A 35 23 B 25 21 Fisher's Exact Test for Count Data data: tabTest pvalue = 0.5556 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.5425696 3.0069854 sample estimates: odds ratio 1.275218 Warning messages: 1: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == : prediction from a rankdeficient fit may be misleading 2: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == : prediction from a rankdeficient fit may be misleading

Random Forests train (looks good, but is not):
> library(randomForest) > runExample(rows=200,features=400,rareFeature=F, trainer=function(formula,data) { randomForest(formula,data) }, predictor=function(model,newdata,yValues) { predict(model,newdata,type='response') } ) runExample(rows = 200, features = 400, rareFeature = F, trainer = function(formula, data) { randomForest(formula, data) }, predictor = function(model, newdata, yValues) { predict(model, newdata, type = "response") }) [1] "train set results" predict truth A B A 47 0 B 0 49 Fisher's Exact Test for Count Data data: tabTrain pvalue < 2.2e16 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 301.5479 Inf sample estimates: odds ratio Inf

Random Forests tests (is bad):
[1] "holdout test set results" predict truth A B A 21 37 B 13 33 Fisher's Exact Test for Count Data data: tabTest pvalue = 0.4095 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.5793544 3.6528127 sample estimates: odds ratio 1.435704
The point is: good training performance means nothing (unless your trainer is in fact reporting crossvalidated results). To avoid overfit you must at least examine model diagnostics, pervariable model coefficient significances, and should always report results on truly heldout data. It is not enough to look only at modelfit significance on training data. An additional risk is when you are in a situation where you are likely to encounter a mixture of rare useful features and rare noise features. As we have illustrated above the model fitting procedures can’t always tell the difference between features and noise. So it is easy to expect that the noise features can drown out rare useful features in practice. This should remind all of us of the need for good variable curation, selection and principled dimension reduction (domain knowledge sensitive and ysensitive, not just broad principal components analysis). Lots of features (the socalled “wide data” style of analytics) are not always easy to work with (as opposed to “tall data” which is always good as you have more examples to falsify bad relations).
We took the liberty of using the title “Bad Bayes” because this is where we have most often seen the use of many weak variables without enough data to really establish pervariable significance.
For a more on feature selection and model testing please see Zumel, Mount, “Practical Data Science with R”.
The point isn’t that overfit is unexpected (it should be something you are always worried about), but that this is a reliable example of extreme example of overfit (there is no signal to fit). And of course everyone should already you need holdout test and calibration data, but there can be a “if the math is sufficiently complicated I can work around that” vibe from some data scientists. Often instead of looking coldly at the hold out data you hear one of the follow falsehoods: “there isn’t enough data to waste on holdout” (due to the need to model many rare features), “random forests doesn’t overfit” (due to its internal cross validation), or “it is just a matter of picking the right smoothing or shrinkage parameters to fix this” (when in fact there is no signal to fit).
I was playing with the gbm package recently, and it seems to “fail safe” on this problem refusing to build a model on the training data (the correct outcome). See github/BadBayesExample.md for details. And as any data scientist should know failing to get a model is not the worst thing that can happen (sharing a wrong model is much worse).
It is funny, this is how we did machine learning in the 1990′s: before we had easy access to data we used synthetic data sets.
Relevant: Pedro Domingos (1998) – a ProcessOriented Heuristic for Model Selection
And a note why I pick on NLP applications in this article. One of the luxuries of modern data science is you can assume you have a lot of data (not considered true in the 80′s and 90′s). In this case many methods provably don’t overfit (for example linear regression as the feature set is held constant and the amount of training data is taken to infinity: empirical estimates of expectations limit to correct values and decisions made based on them become correct). However in NLP you often use extra training data to not only better estimate effects of known features but to recruit new rare features, which is one of the issues the deliberately bad dataset is simulating. One reason you recruit such features is there are so many of them, when you see Ziphian style distributions on kgrams you feel you are throwing a big fraction of signal if you don’t include the many many rare features in your model. The issue is: the true effect of rare features is hard to estimate.