Posted on 1 Comment on When Cross-Validation is More Powerful than Regularization

## When Cross-Validation is More Powerful than Regularization

Regularization is a way of avoiding overfit by restricting the magnitude of model coefficients (or in deep learning, node weights). A simple example of regularization is the use of ridge or lasso regression to fit linear models in the presence of collinear variables or (quasi-)separation. The intuition is that smaller coefficients are less sensitive to idiosyncracies in the training data, and hence, less likely to overfit.

Cross-validation is a way to safely reuse training data in nested model situations. This includes both the case of setting hyperparameters before fitting a model, and the case of fitting models (let’s call them base learners) that are then used as variables in downstream models, as shown in Figure 1. In either situation, using the same data twice can lead to models that are overtuned to idiosyncracies in the training data, and more likely to overfit.

In general, if any stage of your modeling pipeline involves looking at the outcome (we’ll call that a y-aware stage), you cannot directly use the same data in the following stage of the pipeline. If you have enough data, you can use separate data in each stage of the modeling process (for example, one set of data to learn hyperparameters, another set of data to train the model that uses those hyperparameters). Otherwise, you should use cross-validation to reduce the nested model bias.

Cross-validation is relatively computationally expensive; regularization is relatively cheap. Can you mitigate nested model bias by using regularization techniques instead of cross-validation?

The short answer: no, you shouldn’t. But as, we’ve written before, demonstrating this is more memorable than simply saying “Don’t do that.”

## A simple example

Suppose you have a system with two categorical variables. The variable x_s has 10 levels, and the variable x_n has 100 levels. The outcome y is a function of x_s, but not of x_n (but you, the analyst building the model, don’t know this). Here’s the head of the data.

##     x_s  x_n           y
## 2  s_10 n_72  0.34228110
## 3  s_01 n_09 -0.03805102
## 4  s_03 n_18 -0.92145960
## 9  s_08 n_43  1.77069352
## 10 s_08 n_17  0.51992928
## 11 s_01 n_78  1.04714355

With most modeling techniques, a categorical variable with K levels is equivalent to K or K-1 numerical (indicator or dummy) variables, so this system actually has around 110 variables. In real life situations where a data scientist is working with high-cardinality categorical variables, or with a lot of categorical variables, the number of actual variables can begin to swamp the size of training data, and/or bog down the machine learning algorithm.

One way to deal with these issues is to represent each categorical variable by a single variable model (or base learner), and then use the predictions of those base learners as the inputs to a bigger model. So instead of fitting a model with 110 indicator variables, you can fit a model with two numerical variables. This is a simple example of nested models.

We refer to this procedure as “impact coding,” and it is one of the data treatments available in the vtreat package, specifically for dealing with high-cardinality categorical variables. But for now, let’s go back to the original problem.

## The naive way

For this simple example, you might try representing each variable as the expected value of y - mean(y) in the training data, conditioned on the variable’s level. So the ith “coefficient” of the one-variable model would be given by:

vi = E[y|x = si] − E[y]

Where si is the ith level. Let’s show this with the variable x_s (the code for all the examples in this article is here):

##     x_s      meany      coeff
## 1  s_01  0.7998263  0.8503282
## 2  s_02 -1.3815640 -1.3310621
## 3  s_03 -0.7928449 -0.7423430
## 4  s_04 -0.8245088 -0.7740069
## 5  s_05  0.7547054  0.8052073
## 6  s_06  0.1564710  0.2069728
## 7  s_07 -1.1747557 -1.1242539
## 8  s_08  1.3520153  1.4025171
## 9  s_09  1.5789785  1.6294804
## 10 s_10 -0.7313895 -0.6808876

In other words, whenever the value of x_s is s_01, the one variable model vs returns the value 0.8503282, and so on. If you do this for both variables, you get a training set that looks like this:

##     x_s  x_n           y         vs         vn
## 2  s_10 n_72  0.34228110 -0.6808876 0.64754957
## 3  s_01 n_09 -0.03805102  0.8503282 0.54991135
## 4  s_03 n_18 -0.92145960 -0.7423430 0.01923877
## 9  s_08 n_43  1.77069352  1.4025171 1.90394159
## 10 s_08 n_17  0.51992928  1.4025171 0.26448341
## 11 s_01 n_78  1.04714355  0.8503282 0.70342961

Now fit a linear model for y as a function of vs and vn.

model_raw = lm(y ~ vs + vn,
data=dtrain_treated)
summary(model_raw)
##
## Call:
## lm(formula = y ~ vs + vn, data = dtrain_treated)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -2.33068 -0.57106  0.00342  0.52488  2.25472
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.05050    0.05597  -0.902    0.368
## vs           0.77259    0.05940  13.006   <2e-16 ***
## vn           0.61201    0.06906   8.862   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8761 on 242 degrees of freedom
## Multiple R-squared:  0.6382, Adjusted R-squared:  0.6352
## F-statistic: 213.5 on 2 and 242 DF,  p-value: < 2.2e-16

Note that this model gives significant coefficients to both vs and vn, even though y is not a function of x_n (or vn). Because you used the same data to fit the one variable base learners and to fit the larger model, you have overfit.

## The right way: cross-validation

The correct way to impact code (or to nest models in general) is to use cross-validation techniques. Impact coding with cross-validation is already implemented in vtreat; note the similarity between this diagram and Figure 1 above.

The training data is used both to fit the base learners (as we did above) and to also to create a data frame of cross-validated base learner predictions (called a cross-frame in vtreat). This cross-frame is used to train the overall model. Let’s fit the correct nested model, using vtreat.

library(vtreat)
library(wrapr)
xframeResults = mkCrossFrameNExperiment(dtrain,
qc(x_s, x_n), "y",
codeRestriction = qc(catN),
verbose = FALSE)
# the plan uses the one-variable models to treat data
treatmentPlan = xframeResults$treatments # the cross-frame dtrain_treated = xframeResults$crossFrame

head(dtrain_treated)
##     x_s_catN   x_n_catN           y
## 1 -0.6337889 0.91241547  0.34228110
## 2  0.8342227 0.82874089 -0.03805102
## 3 -0.7020597 0.18198634 -0.92145960
## 4  1.3983175 1.99197404  1.77069352
## 5  1.3983175 0.11679580  0.51992928
## 6  0.8342227 0.06421659  1.04714355
variables = setdiff(colnames(dtrain_treated), "y")

model_X = lm(mk_formula("y", variables),
data=dtrain_treated)
summary(model_X)
##
## Call:
## lm(formula = mk_formula("y", variables), data = dtrain_treated)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.2157 -0.7343  0.0225  0.7483  2.9639
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.04169    0.06745  -0.618    0.537
## x_s_catN     0.92968    0.06344  14.656   <2e-16 ***
## x_n_catN     0.10204    0.06654   1.533    0.126
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 242 degrees of freedom
## Multiple R-squared:  0.4753, Adjusted R-squared:  0.471
## F-statistic: 109.6 on 2 and 242 DF,  p-value: < 2.2e-16

This model correctly determines that x_n (and its one-variable model x_n_catN) do not affect the outcome. We can compare the performance of this model to the naive model on holdout data.

rmse rsquared
ypred_naive 1.303778 0.2311538
ypred_crossval 1.093955 0.4587089

The correct model has a much smaller root-mean-squared error and a much larger R-squared than the naive model when applied to new data.

## An attempted alternative: regularized models.

But cross-validation is so complicated. Can’t we just regularize? As we’ll show in the appendix of this article, for a one-variable model, L2-regularization is simply Laplace smoothing. Again, we’ll represent each “coefficient” of the one-variable model as the Laplace smoothed value minus the grand mean.

vi = ∑xj = si yi/(counti + λ) − E[yi]

Where counti is the frequency of si in the training data, and λ is the smoothing parameter (usually 1). If λ = 1 then the first term on the right is just adding one to the frequency of the level and then taking the “adjusted conditional mean” of y.

Again, let’s show this for the variable x_s.

##     x_s      sum_y count_y   grandmean         vs
## 1  s_01  20.795484      26 -0.05050187  0.8207050
## 2  s_02 -37.302227      27 -0.05050187 -1.2817205
## 3  s_03 -22.199656      28 -0.05050187 -0.7150035
## 4  s_04 -14.016649      17 -0.05050187 -0.7282009
## 5  s_05  19.622340      26 -0.05050187  0.7772552
## 6  s_06   3.129419      20 -0.05050187  0.1995218
## 7  s_07 -35.242672      30 -0.05050187 -1.0863585
## 8  s_08  36.504412      27 -0.05050187  1.3542309
## 9  s_09  33.158549      21 -0.05050187  1.5577086
## 10 s_10 -16.821957      23 -0.05050187 -0.6504130

After applying the one variable models for x_s and x_n to the data, the head of the resulting treated data looks like this:

##     x_s  x_n           y         vs         vn
## 2  s_10 n_72  0.34228110 -0.6504130 0.44853367
## 3  s_01 n_09 -0.03805102  0.8207050 0.42505898
## 4  s_03 n_18 -0.92145960 -0.7150035 0.02370493
## 9  s_08 n_43  1.77069352  1.3542309 1.28612835
## 10 s_08 n_17  0.51992928  1.3542309 0.21098803
## 11 s_01 n_78  1.04714355  0.8207050 0.61015422

Now fit the overall model:

##
## Call:
## lm(formula = y ~ vs + vn, data = dtrain_treated)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -2.30354 -0.57688 -0.02224  0.56799  2.25723
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.06665    0.05637  -1.182    0.238
## vs           0.81142    0.06203  13.082  < 2e-16 ***
## vn           0.85393    0.09905   8.621  8.8e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8819 on 242 degrees of freedom
## Multiple R-squared:  0.6334, Adjusted R-squared:  0.6304
## F-statistic: 209.1 on 2 and 242 DF,  p-value: < 2.2e-16

Again, both variables look significant. Even with regularization, the model is still overfit. Comparing the performance of the models on holdout data, you see that the regularized model does a little better than the naive model, but not as well as the correctly cross-validated model.

rmse rsquared
ypred_naive 1.303778 0.2311538
ypred_crossval 1.093955 0.4587089
ypred_reg 1.267648 0.2731756

## The Moral of the Story

Unfortunately, regularization is not enough to overcome nested model bias. Whenever you apply a y-aware process to your data, you have to use cross-validation methods (or a separate data set) at the next stage of your modeling pipeline.

### Appendix: Derivation of Laplace Smoothing as L2-Regularization

Without regularization, the optimal one-variable model for y in terms of a categorical variable with K levels {sj} is a set of K coefficients v such that

$f(\mathbf{v}) := \sum\limits_{i=1}^N (y_i - v_i)^2$

is minimized (N is the number of data points). L2-regularization adds a penalty to the magnitude of v, so that the goal is to minimize

$f(\mathbf{v}) := \sum\limits_{i=1}^N (y_i - v_i)^2 + \lambda \sum\limits_{j=1}^K {v_j}^2$

where λ is a known smoothing hyperparameter, usually set (in this case) to 1.

To minimize the above expression for a single coefficient vj, take the deriviative with respect to vj and set it to zero:

$\sum\nolimits_{x_i = s_j} -2 (y_i - v_j) + 2 \lambda v_j = 0\\ \sum\nolimits_{x_i = s_j }-y_i + \sum\nolimits_{x_i = s_j} v_j + \lambda v_j = 0\\ \sum\nolimits_{x_i = s_j }-y_i + \text{count}_j v_j + \lambda v_j = 0$

Where countj is the number of times the level sj appears in the training data. Now solve for vj:

$v_j (\text{count}_j + \lambda) = \sum\nolimits_{x_i = s_j} y_i\\ v_j = \sum\nolimits_{x_i = s_i} y_i / (\text{count}_j + \lambda)$

This is Laplace smoothing. Note that it is also the one-variable equivalent of ridge regression.

Posted on Categories data science, Exciting Techniques, Programming, Tutorials2 Comments on Sharing Modeling Pipelines in R

## Sharing Modeling Pipelines in R

Reusable modeling pipelines are a practical idea that gets re-developed many times in many contexts. wrapr supplies a particularly powerful pipeline notation, and a pipe-stage re-use system (notes here). We will demonstrate this with the vtreat data preparation system.

Posted on 3 Comments on A Theory of Nested Cross Simulation

## A Theory of Nested Cross Simulation

[Reader’s Note. Some of our articles are applied and some of our articles are more theoretical. The following article is more theoretical, and requires fairly formal notation to even work through. However, it should be of interest as it touches on some of the fine points of cross-validation that are quite hard to perceive or discuss without the notational framework. We thought about including some “simplifying explanatory diagrams” but so many entities are being introduced and manipulated by the processes we are describing we found equation notation to be in fact cleaner than the diagrams we attempted and rejected.]

• Picking hyper-parameters, fitting a model, and then evaluating the model.
• Variable preparation/pruning, fitting a model, and then evaluating the model.

In each case you are building a pipeline where “y-aware” (or outcome aware) choices and transformations made at each stage affect later stages. This can introduce undesirable nested model bias and over-fitting.

Our current standard advice to avoid nested model bias is either:

• Split your data into 3 or more disjoint pieces, such as separate variable preparation/pruning, model fitting, and model evaluation.
• Reserve a test-set for evaluation and use “simulated out of sample data” or “cross-frame”/“cross simulation” techniques to simulate dividing data among the first two model construction stages.

The first practice is simple and computationally efficient, but statistically inefficient. This may not matter if you have a lot of data, as in “big data”. The second procedure is more statistically efficient, but is also more complicated and has some computational cost. For convenience the cross simulation method is supplied as a ready to go procedure in our R data cleaning and preparation package vtreat.

What would it look like if we insisted on using cross simulation or simulated out of sample techniques for all three (or more) stages? Please read on to find out.

Hyperbole and a Half copyright Allie Brosh (use allowed in some situations with attribution)

Edit: we are going to be writing on a situation of some biases that do leak into the cross-frame “new data simulation.” So think of cross-frames as bias (some small amount is introduced) / variance (reduced be appearing to have a full sized data set at all stages) trade-off.

Posted on 1 Comment on Laplace noising versus simulated out of sample methods (cross frames)

## Laplace noising versus simulated out of sample methods (cross frames)

Nina Zumel recently mentioned the use of Laplace noise in “count codes” by Misha Bilenko (see here and here) as a known method to break the overfit bias that comes from using the same data to design impact codes and fit a next level model. It is a fascinating method inspired by differential privacy methods, that Nina and I respect but don’t actually use in production.

Please read on for my discussion of some of the limitations of the technique, and how we solve the problem for impact coding (also called “effects codes”), and a worked example in R. Continue reading Laplace noising versus simulated out of sample methods (cross frames)

Posted on 1 Comment on On Nested Models

## On Nested Models

We have been recently working on and presenting on nested modeling issues. These are situations where the output of one trained machine learning model is part of the input of a later model or procedure. I am now of the opinion that correct treatment of nested models is one of the biggest opportunities for improvement in data science practice. Nested models can be more powerful than non-nested, but are easy to get wrong.

Posted on 1 Comment on Random Test/Train Split is not Always Enough

## Random Test/Train Split is not Always Enough

Most data science projects are well served by a random test/train split. In our book Practical Data Science with R we strongly advise preparing data and including enough variables so that data is exchangeable, and scoring classifiers using a random test/train split.

With enough data and a big enough arsenal of methods, it’s relatively easy to find a classifier that looks good; the trick is finding one that is good. What many data science practitioners (and consumers) don’t seem to remember is that when evaluating a model, a random test/train split may not always be enough.

Posted on 3 Comments on Bad Bayes: an example of why you need hold-out testing

## Bad Bayes: an example of why you need hold-out testing

We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit.

The example is designed to imitate a common situation found in predictive analytic natural language processing. In this type of application you are often building a model using many rare text features. The rare text features are often nearly unique k-grams and the model can be anything from Naive Bayes to conditional random fields. This sort of modeling situation exposes the modeler to a lot of training bias. You can get models that look good on training data even though they have no actual value on new data (very poor generalization performance). In this sort of situation you are very vulnerable to having fit mere noise.

Often there is a feeling if a model is doing really well on training data then must be some way to bound generalization error and at least get useful performance on new test and production data. This is, of course, false as we will demonstrate by building deliberately useless features that allow various models to perform well on training data. What is actually happening is you are working through variations of worthless models that only appear to be good on training data due to overfitting. And the more “tweaking, tuning, and fixing” you try only appears to improve things because as you peek at your test-data (which you really should have held some out until the entire end of project for final acceptance) your test data is becoming less exchangeable with future new data and more exchangeable with your training data (and thus less helpful in detecting overfit).

Any researcher that does not have proper per-feature significance checks or hold-out testing procedures will be fooled into promoting faulty models. Continue reading Bad Bayes: an example of why you need hold-out testing