Here is an absolutely *horrible* way to confuse yourself and get an inflated reported `R-squared`

on a simple linear regression model in `R`

.

We have written about this before, but we found a new twist on the problem (interactions with categorical variable encoding) which we would like to call out here.

First let’s set up our problem with a data set where the quantity to be predicted (`y`

) has no real relation to the independent variable (`x`

). We will first build our example data:

```
library("sigr")
library("broom")
set.seed(23255)
d <- data.frame(y= runif(100),
x= ifelse(runif(100)>=0.5, 'a', 'b'),
stringsAsFactors = FALSE)
```

Now let’s build a model and look at the summary statistics returned as part of the model fitting process:

```
m1 <- lm(y~x, data=d)
t(broom::glance(m1))
```

```
## [,1]
## r.squared 0.002177326
## adj.r.squared -0.008004538
## sigma 0.302851476
## statistic 0.213843593
## p.value 0.644796456
## df 2.000000000
## logLik -21.432440763
## AIC 48.864881526
## BIC 56.680392084
## deviance 8.988463618
## df.residual 98.000000000
```

`d$pred1 <- predict(m1, newdata = d)`

I *strongly* prefer to directly calculate the the model performance statistics off the predictions (it lets us easily compare different modeling methods), so let’s also do that also:

```
cat(render(sigr::wrapFTest(d, 'pred1', 'y'),
format='markdown'))
```

**F Test** summary: (*R ^{2}*=0.0022,

*F*(1,98)=0.21,

*p*=n.s.).

So far so good. Let’s now remove the "intercept term" by adding the "`0+`

" from the fitting command.

```
m2 <- lm(y~0+x, data=d)
t(broom::glance(m2))
```

```
## [,1]
## r.squared 7.524811e-01
## adj.r.squared 7.474297e-01
## sigma 3.028515e-01
## statistic 1.489647e+02
## p.value 1.935559e-30
## df 2.000000e+00
## logLik -2.143244e+01
## AIC 4.886488e+01
## BIC 5.668039e+01
## deviance 8.988464e+00
## df.residual 9.800000e+01
```

`d$pred2 <- predict(m2, newdata = d)`

Uh oh. That *appeared* to vastly improve the reported `R-squared`

and the significance ("`p.value`

")!

That does not make sense, anything `m2`

can do `m1`

can also do. In fact the two models make essentially identical predictions, which we confirm below:

`sum((d$pred1 - d$y)^2)`

`## [1] 8.988464`

`sum((d$pred2 - d$y)^2)`

`## [1] 8.988464`

`max(abs(d$pred1 - d$pred2))`

`## [1] 4.440892e-16`

`head(d)`

```
## y x pred1 pred2
## 1 0.007509118 b 0.5098853 0.5098853
## 2 0.980353615 a 0.5380361 0.5380361
## 3 0.055880927 b 0.5098853 0.5098853
## 4 0.993814410 a 0.5380361 0.5380361
## 5 0.636617385 b 0.5098853 0.5098853
## 6 0.154032277 a 0.5380361 0.5380361
```

Let’s double check the fit quality of the predictions.

```
cat(render(sigr::wrapFTest(d, 'pred2', 'y'),
format='markdown'))
```

**F Test** summary: (*R ^{2}*=0.0022,

*F*(1,98)=0.21,

*p*=n.s.).

Ah. The prediction fit quality is identical to the first time (as one would expect). This is yet another reason to directly calculate model fit quality from the predictions: it isolates you from any foibles of how the modeling software does it.

The answer to our puzzles of "what went wrong" is something we have written about before here.

Roughly what is going on is:

If the fit formula sent `lm()`

has no intercept (triggered by the "`0+`

") notation then `summary.lm()`

changes how it computes `r.squared`

as follows (from `help(summary.lm)`

):

r.squared R^2, the ‘fraction of variance explained by the model’, R^2 = 1 - Sum(R[i]^2) / Sum((y[i]- y*)^2), where y* is the mean of y[i] if there is an intercept and zero otherwise.

This is pretty bad.

Then to add insult to injury the "`0+`

" notation also changes how `R`

encodes the categorical variable `x`

.

Compare:

`summary(m1)`

```
##
## Call:
## lm(formula = y ~ x, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.53739 -0.23265 -0.02039 0.27247 0.47111
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.53804 0.04515 11.918 <2e-16 ***
## xb -0.02815 0.06088 -0.462 0.645
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3029 on 98 degrees of freedom
## Multiple R-squared: 0.002177, Adjusted R-squared: -0.008005
## F-statistic: 0.2138 on 1 and 98 DF, p-value: 0.6448
```

`summary(m2)`

```
##
## Call:
## lm(formula = y ~ 0 + x, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.53739 -0.23265 -0.02039 0.27247 0.47111
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xa 0.53804 0.04515 11.92 <2e-16 ***
## xb 0.50989 0.04084 12.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3029 on 98 degrees of freedom
## Multiple R-squared: 0.7525, Adjusted R-squared: 0.7474
## F-statistic: 149 on 2 and 98 DF, p-value: < 2.2e-16
```

Notice the second model directly encodes both levels of `x`

. This means that if `m1`

has `pred1 = a + b * (x=='b')`

we can reproduce this model (intercept and all) as `m2`

: `pred2 = a * (x=='a') + (a+b) * (x=='b')`

. I.e., the invariant `(x=='a') + (x=='b') == 1`

means `m2`

can imitate the model with the intercept term.

The presumed (and I think weak) justification of `summary.lm()`

switching the model quality assessment method is something along the lines that `mean(y)`

may not be in the model’s concept space and this might lead to reporting negative `R-squared`

. I don’t have any problem with negative `R-squared`

, it can be taken to mean you did worse than the unconditional average. However, even if you accept the (no-warning) scoring method switch: that argument doesn’t apply here. `m2`

can imitate having an intercept, so it isn’t unfair to check if it is better than using only the intercept.

This is an oldy but a goody, and rather independent of R. I have seen models justified on the basis of high R^2 with a dropped intercept compared with essentially the same model with an intercept and rather indifferent performance. The proponents of the no-intercept models are often quite unaware that this is an issue and are genuinely proud of the fact that R^2 has increased so dramatically.

There are two problems here (well, three, if you count the change in “summary” behaviour):

1. Using R^2 without thinking about whether it is reasonable to compare this metric between two different types of model.

2. Failing to concentrate on a proper performance measure such as AIC (which didn’t change one iota between the two models and which ought to alert one to the issue).

The idea of R^2 is so ingrained from dim distant statistical education that researchers place huge weight on getting a high value, regardless of the formulation of the model. This is made even worse by clients who also place great store on high R^2 models and are rather disappointed when a model fails to reach what they would consider a “high enough” value. In these cases I like to ask what R^2 they would like to achieve, and I can usually achieve that by careful re-formulation of the data. Eventually, the penny drops…