## Introduction

Suppose we have the task of predicting an outcome `y`

given a number of variables `v1,..,vk`

. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve explain-ability, even reduce over-fit, and improve the quality of the resulting model.

For some informative discussion on such issues please see the following:

- How Do You Know if Your Data Has Signal?
- How do you know if your model is going to work?
- Variable pruning is NP hard

In this article we are going to deliberately (and artificially) find and test one of the limits of the technique. We recommend simple variable pruning, but also think it is important to be aware of its limits.

To be truly effective in applied fields (such as data science) one often has to use (with care) methods that “happen to work” in addition to methods that “are known to always work” (or at least be aware, you are always competing against such); hence the interest in mere heuristic.

## The pruning heuristics

Let \(L(m;S)\) denote the estimate loss (or badness of performance, so smaller is better) of a model for \(y\) fit using modeling method \(m\) and the variables \(v_i : i \in S\). Let \(d(m;a)\) denote the portion of \(L(m;\{ \})-L(m;\{ a \} )\) credited to the variable \(v_a\). This could be the change in loss, something like \(\mathrm{effectsize}(v_a)\), or \(-\log(\mathrm{significance}(v_a))\); in all cases *larger* is considered better.

For practical variable pruning (during predictive modeling) our intuition often implicitly relies on the following heuristic arguments.

- \(L(m; )\) is monotone decreasing, we expect \(L(m;S \cup \{ a \} )\) is no larger than \(L(m;S)\). Note this may be achievable “in sample” (or on training data), but is often false if \(L(m; )\) accounts for model complexity or is estimated on out of sample data (itself a good practice).
- If \(L(m;S \cup \{ a \} )\) is significantly lower than \(L(m;S)\) then we will be lucky enough to have \(d(m;a)\) not too small.
- If \(d(m;a)\) is not too small then we will be lucky enough to have \(d(\mathrm{lm};a)\) is non-negligible (where modeling method
`lm`

is one linear regression or logistic regression).

Intuitively we are *hoping* (for ease of calculation) variable utility has a roughly diminishing return structure and at least some non-vanishing fraction of a variable’s utility can be seen in simple linear or generalized linear models. Obviously this can not be true in general (interactions in decision trees being a well know situation where variable utility can increase in the presence of other variables, and there are many non-linear relations that escape detection by linear models). Synergy is a good thing, we just would hate to miss it, and one way to prove we don’t miss it would be to know it isn’t there. We will show there is in fact synergy, so naive methods may in fact miss it.

However, if the above were true (or often nearly true) we could effectively prune variables by keeping only the set of variables \(\left\{ a \; \left| \; d(\mathrm{lm};a) \; \text{is non negligible} \right. \right\}\). This is a (user controllable) heuristic built into our `vtreat`

R package and proves to be quite useful in practice.

I’ll repeat: we feel in real world data you can use the above heuristics to usefully prune variables. Complex models do eventually get into a regime of diminishing returns, and real world engineered useful variables usually (by design) have a hard time hiding. Also, remember data science is an empirical field- methods that happen to work will dominate (even if they do not apply in all cases).

## Counter-examples

For every heuristic you should crisply know if it is true (and is in fact a theorem) or it is false (and has counter-examples). We stand behind the above heuristics, and will show their empirical worth in a follow-up article. Let’s take some time and show that they are not in fact laws.

We are going to show that per-variable coefficient significances and effect sizes are not monotone in that adding more variables can in fact improve them.

### First example

First (using R) we build a data frame where `y = a xor b`

. This is a classic example of `y`

being a function of two variable but not a *linear* function of them (at least over the real numbers, it is a linear relation over the field GF(2)).

```
d <- data.frame(a=c(0,0,1,1),b=c(0,1,0,1))
d$y <- as.numeric(d$a == d$b)
```

We look at the (real) linear relations between `y`

and `a`

, `b`

.

`summary(lm(y~a+b,data=d))`

```
##
## Call:
## lm(formula = y ~ a + b, data = d)
##
## Residuals:
## 1 2 3 4
## 0.5 -0.5 -0.5 0.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.500 0.866 0.577 0.667
## a 0.000 1.000 0.000 1.000
## b 0.000 1.000 0.000 1.000
##
## Residual standard error: 1 on 1 degrees of freedom
## Multiple R-squared: 3.698e-32, Adjusted R-squared: -2
## F-statistic: 1.849e-32 on 2 and 1 DF, p-value: 1
```

`anova(lm(y~a+b,data=d))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## a 1 0 0 0 1
## b 1 0 0 0 1
## Residuals 1 1 1
```

As we expect linear methods fail to find any evidence of a relation between `y`

and `a`

, `b`

. This clearly violates our hoped for heuristics.

For details on reading these summaries we strongly recommend *Practical Regression and Anova using R*, Julian J. Faraway, 2002.

In this example the linear model fails to recognize `a`

and `b`

as useful variables (even though `y`

is a function of `a`

and `b`

). From the linear model’s point of view variables are not improving each other (so that at least looks monotone), but it is largely because the linear model can not see the relation unless we were to add an interaction of `a`

and `b`

(denoted `a:b`

).

### Second example

Let us develop this example a bit more to get a more interesting counterexample.

Introduce new variables `u = a and b`

, `v = a or b`

. By the rules of logic we have `y == 1+u-v`

, so there is a linear relation.

```
d$u <- as.numeric(d$a & d$b)
d$v <- as.numeric(d$a | d$b)
print(d)
```

```
## a b y u v
## 1 0 0 1 0 0
## 2 0 1 0 0 1
## 3 1 0 0 0 1
## 4 1 1 1 1 1
```

`print(all.equal(d$y,1+d$u-d$v))`

`## [1] TRUE`

We can now see the counter-example effect: together the variables work better than they did alone.

`summary(lm(y~u,data=d))`

```
##
## Call:
## lm(formula = y ~ u, data = d)
##
## Residuals:
## 1 2 3 4
## 6.667e-01 -3.333e-01 -3.333e-01 -1.388e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3333 0.3333 1 0.423
## u 0.6667 0.6667 1 0.423
##
## Residual standard error: 0.5774 on 2 degrees of freedom
## Multiple R-squared: 0.3333, Adjusted R-squared: 5.551e-16
## F-statistic: 1 on 1 and 2 DF, p-value: 0.4226
```

`anova(lm(y~u,data=d))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## u 1 0.33333 0.33333 1 0.4226
## Residuals 2 0.66667 0.33333
```

`summary(lm(y~v,data=d))`

```
##
## Call:
## lm(formula = y ~ v, data = d)
##
## Residuals:
## 1 2 3 4
## 5.551e-17 -3.333e-01 -3.333e-01 6.667e-01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0000 0.5774 1.732 0.225
## v -0.6667 0.6667 -1.000 0.423
##
## Residual standard error: 0.5774 on 2 degrees of freedom
## Multiple R-squared: 0.3333, Adjusted R-squared: 0
## F-statistic: 1 on 1 and 2 DF, p-value: 0.4226
```

`anova(lm(y~v,data=d))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## v 1 0.33333 0.33333 1 0.4226
## Residuals 2 0.66667 0.33333
```

`summary(lm(y~u+v,data=d))`

```
## Warning in summary.lm(lm(y ~ u + v, data = d)): essentially perfect fit:
## summary may be unreliable
```

```
##
## Call:
## lm(formula = y ~ u + v, data = d)
##
## Residuals:
## 1 2 3 4
## -1.849e-32 7.850e-17 -7.850e-17 1.849e-32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.00e+00 1.11e-16 9.007e+15 <2e-16 ***
## u 1.00e+00 1.36e-16 7.354e+15 <2e-16 ***
## v -1.00e+00 1.36e-16 -7.354e+15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.11e-16 on 1 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.056e+31 on 2 and 1 DF, p-value: < 2.2e-16
```

`anova(lm(y~u+v,data=d))`

```
## Warning in anova.lm(lm(y ~ u + v, data = d)): ANOVA F-tests on an
## essentially perfect fit are unreliable
```

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## u 1 0.33333 0.33333 2.7043e+31 < 2.2e-16 ***
## v 1 0.66667 0.66667 5.4086e+31 < 2.2e-16 ***
## Residuals 1 0.00000 0.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

In this example we see synergy instead of diminishing returns. Each variable becomes better in the presence of the other. This is on its own good, but indicates variable pruning is harder than one might expect- even for a linear model.

### Third example

We can get around the above warnings by adding some rows to the data frame that don’t follow the designed relation. We can even draw rows from this frame to show the effect on a “more row independent looking” data frame.

```
d0 <- d
d0$y <- 0
d1 <- d
d1$y <- 1
dG <- rbind(d,d,d,d,d0,d1)
set.seed(23235)
dR <- dG[sample.int(nrow(dG),100,replace=TRUE),,drop=FALSE]
summary(lm(y~u,data=dR))
```

```
##
## Call:
## lm(formula = y ~ u, data = dR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8148 -0.3425 -0.3425 0.3033 0.6575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.34247 0.05355 6.396 5.47e-09 ***
## u 0.47235 0.10305 4.584 1.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4575 on 98 degrees of freedom
## Multiple R-squared: 0.1765, Adjusted R-squared: 0.1681
## F-statistic: 21.01 on 1 and 98 DF, p-value: 1.349e-05
```

`anova(lm(y~u,data=dR))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## u 1 4.3976 4.3976 21.01 1.349e-05 ***
## Residuals 98 20.5124 0.2093
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`summary(lm(y~v,data=dR))`

```
##
## Call:
## lm(formula = y ~ v, data = dR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7619 -0.3924 -0.3924 0.6076 0.6076
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7619 0.1049 7.263 9.12e-11 ***
## v -0.3695 0.1180 -3.131 0.0023 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4807 on 98 degrees of freedom
## Multiple R-squared: 0.09093, Adjusted R-squared: 0.08165
## F-statistic: 9.802 on 1 and 98 DF, p-value: 0.002297
```

`anova(lm(y~v,data=dR))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## v 1 2.265 2.26503 9.8023 0.002297 **
## Residuals 98 22.645 0.23107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`summary(lm(y~u+v,data=dR))`

```
##
## Call:
## lm(formula = y ~ u + v, data = dR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8148 -0.1731 -0.1731 0.1984 0.8269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.76190 0.08674 8.784 5.65e-14 ***
## u 0.64174 0.09429 6.806 8.34e-10 ***
## v -0.58883 0.10277 -5.729 1.13e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3975 on 97 degrees of freedom
## Multiple R-squared: 0.3847, Adjusted R-squared: 0.3721
## F-statistic: 30.33 on 2 and 97 DF, p-value: 5.875e-11
```

`anova(lm(y~u+v,data=dR))`

```
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## u 1 4.3976 4.3976 27.833 8.047e-07 ***
## v 1 5.1865 5.1865 32.826 1.133e-07 ***
## Residuals 97 15.3259 0.1580
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

### Conclusion

Consider the above counter example as *exceptio probat regulam in casibus non exceptis* (“the exception confirms the rule in cases not excepted”). Or roughly outlining the (hopefully labored and uncommon) structure needed to break the otherwise common and useful heuristics.

In later articles in this series we will show more about the structure of model quality and show the above heuristics actually working very well in practice (and adding a lot of value to projects).

// add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); });

Just a comment on the math typesetting: would something like MathJax be something of interest? (https://www.mathjax.org/) I can parse the math in your post, but it’s easier to read mathematical notation. Just a thought.

Good point, thanks. I have converted it over to MathJax, simplified the notation a bit, and fixed an error in the original notation that implied we were scoring with all variables present.

A nice reference on these sorts of issues: “Algorithms for Subset Selection in Linear Regression”, Abhimanyu Das, David Kempe, STOC’08, May 17–20, 2008, Victoria, British Columbia, Canada. Copyright 2008 ACM 978-1-60558-047-0/08/05 http://www-bcf.usc.edu/~dkempe/publications/regression.pdf .

One more strong example. Pick x and y so that all of x,y, and 1 are mutually orthogonal. Then throw in one more vector (z) to complete a basis for R^3.

Nina Zumel pointed out to me some other common sources of such examples.

The first is introducing co-linear variables, they may or may not cause significances to unexpectedly improve but they are very likely to increase coefficient magnitudes.

The other is Simpson’s paradox set-ups where coefficient signs change when introducing additional variables.