Posted on Categories Opinion, Programming, TutorialsTags ,

# Programming Over lm() in R

Here is simple modeling problem in `R`.

We want to fit a linear model where the names of the data columns carrying the outcome to predict (`y`), the explanatory variables (`x1`, `x2`), and per-example row weights (`wt`) are given to us as string values in variables.

Lets start with our example data and parameters. The point is: we are assuming the data and parameters come to us as arguments and are not known at the time of writing the script or program.

``````# our inputs
d <- data.frame(
x1 = c(1, 2, 3, 4),
x2 = c(0, 0, 1, 1),
y = c(3, 3, 4, -5),
wt = c(1, 2, 1, 1))
knitr::kable(d)``````
x1 x2 y wt
1 0 3 1
2 0 3 2
3 1 4 1
4 1 -5 1
``````outcome_name <- "y"
explanatory_vars <- c("x1", "x2")
name_for_weight_column <- "wt"``````

For everything except the weights this is easy, as the linear regression function `lm()` is willing to take strings in its first argument “`formula`” (and also, there are tools for building up formula objects).

``````# start our generic solution
formula_str <- paste(
outcome_name,
"~",
paste(explanatory_vars, collapse = " + "))
print(formula_str)``````
``##  "y ~ x1 + x2"``
``````model <- lm(formula_str,
data = d)

print(model)``````
``````##
## Call:
## lm(formula = formula_str, data = d)
##
## Coefficients:
## (Intercept)           x1           x2
##        9.75        -4.50         5.50``````
``format(model\$terms)``
``##  "y ~ x1 + x2"``

However, once we try to add weights we have problems.

``````lm(formula_str,
data = d,
weights = name_for_weight_column)``````
``## Error in model.frame.default(formula = formula_str, data = d, weights = name_for_weight_column, : variable lengths differ (found for '(weights)')``

This is a bit disappointing, as much of the point of working in `R` is being able to write parameterized scripts and programs over the `R` functions. So we really want to be able to take names of columns from an external source.

The reason is the following (taken from `help(lm)`):

All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula.

This means the weights argument is not treated as a value, but instead the name typed in is captured through “non standard evaluation” (NSE). The data frame environment, and formula environment are searched for a column or value named “`name_for_weight_column`”, and not for one named “`wt`”.

This is a big hindrance to using `lm()` programmatically. The NSE notation style can be convenient when working interactively, but is often a burden when programming.

However, `R` has some ways out of the problem.

The first solution is `bquote()` which is part of `R` itself (in the `base` package).

``````eval(bquote(
lm(
formula_str,
data = d,
weights = .(as.name(name_for_weight_column)))
))``````
``````##
## Call:
## lm(formula = formula_str, data = d, weights = wt)
##
## Coefficients:
## (Intercept)           x1           x2
##       9.429       -3.857        3.571``````

In the above, the `.()` notation indicates to replace the `.()`-expression with its evaluated value before evaluating the rest of the expression. This is a substitution principle based on escaping notation (also called quasiquoting).

Another solution is the `let()` function from the `wrapr` package (a user extension package, not part of `R` itself).

``````wrapr::let(
c(NAME_FOR_WEIGHT_COLUMN = name_for_weight_column),
lm(
formula_str,
data = d,
weights = NAME_FOR_WEIGHT_COLUMN)
)``````
``````##
## Call:
## lm(formula = formula_str, data = d, weights = wt)
##
## Coefficients:
## (Intercept)           x1           x2
##       9.429       -3.857        3.571``````

In the above, the left-hand sides of the named vector are symbols to be replaced and the right had sides refer to values to replace them with. The specification `c(NAME_FOR_WEIGHT_COLUMN = name_for_weight_column)` means to replace `NAME_FOR_WEIGHT_COLUMN` with `wt` (the value referred to by `name_for_weight_column`). This is a substitution principle based on named substitution targets.

Another solution is the `!!` notation from the `rlang` package (a user extension package, not part of `R` itself).

``````rlang::eval_tidy(rlang::quo(
lm(
formula_str,
data = d,
weights = !!as.name(name_for_weight_column))
))``````
``````##
## Call:
## lm(formula = formula_str, data = d, weights = wt)
##
## Coefficients:
## (Intercept)           x1           x2
##       9.429       -3.857        3.571``````

In the above, the `!!` notation indicates to replace the `!!`-expression with its evaluated value before evaluating the rest of the expression. This is a substitution principle based on escaping notation (also called quasiquoting).

Note the argument types and/or the internals of `lm()` do not currently appear to allow the use of the newer `rlang` double curly brace notation (a notation that can replace `!!rlang::enquo()`, and possibly other forms).

``````rlang::eval_tidy(rlang::quo(
lm(
formula_str,
data = d,
weights = {{name_for_weight_column}})
))``````
``## Error in model.frame.default(formula = formula_str, data = d, weights = ~"wt", : invalid type (language) for variable '(weights)'``

We only mention the “`{{}}`” notation as `rlang` familiar readers are likely to wonder about using it.

All of the above solutions are using meta-programming tools. These are tools that let programs treat programs as data. As such they are very powerful. In fact they are big hammers for such a simple problem as specifying a column name that is already stored as data. However, the above tools should not be blamed for the awkwardness of having a need for them.

The need for these meta-programming tools comes from the over-reliance on NSE interfaces. Or more precisely: the lack of a value oriented interface (possibly in addition to the NSE interface). NSE designs emphasize specifying things in program text instead of in data. This violation of “The Rule of Representation: Fold knowledge into data so program logic can be stupid and robust” is what causes the problems.

## 11 thoughts on “Programming Over lm() in R”

1. Constantinos says:

Unless I’m mistaken, you can also use the following, I believe:

```lm(formula_str,
data = d,
weights = d\$name_for_weight_column)
```
1. That is a good point, but it doesn’t actually work. `d\$name_for_weight_column` evaluates to `NULL` as there is no column named “`name_for_weight_column`“.

However that reminded me to check `d[[name_for_weight_column]]` which does work the way you probably intended. It turns out that sometimes works, but isn’t reliable. It turns out if the `data.frame` has a column with the same name as the variable name we are using to refer to the `data.frame` things break. The reason being: in the `lm()` execution environment `d` refers to the column and not the `data.frame` in this case.

The whole issue is: `lm()` is exerting so much control on how things are evaluated that a lot of the normal expectations of working in `R` are violated. Mostly we don’t hit them, but they are lurking there.

So thanks for the comment! You correctly pointed out I needed to check more.

Below is an example showing right and wrong answers.

```d <- data.frame(
x1 = c(1, 2, 3, 4),
x2 = c(0, 0, 1, 1),
y = c(3, 3, 4, -5),
wt = c(1, 2, 1, 1))
outcome_name <- "y"
explanatory_vars <- c("x1", "x2")
name_for_weight_column <- "wt"

formula_str <- paste(
outcome_name,
"~",
paste(explanatory_vars, collapse = " + "))

# unweighted answer (so we can see what it is)
lm(formula_str,
data = d)
#>
#> Call:
#> lm(formula = formula_str, data = d)
#>
#> Coefficients:
#> (Intercept)           x1           x2
#>        9.75        -4.50         5.50

eval(bquote(
lm(
formula_str,
data = d,
weights = .(as.name(name_for_weight_column)))
))
#>
#> Call:
#> lm(formula = formula_str, data = d, weights = wt)
#>
#> Coefficients:
#> (Intercept)           x1           x2
#>       9.429       -3.857        3.571

# this sets weights to NULL, not giving the correct answer
lm(formula_str,
data = d,
weights = d\$name_for_weight_column)
#>
#> Call:
#> lm(formula = formula_str, data = d, weights = d\$name_for_weight_column)
#>
#> Coefficients:
#> (Intercept)           x1           x2
#>        9.75        -4.50         5.50
d\$name_for_weight_column
#> NULL

# this works, but only because there is no column named d!
lm(formula_str,
data = d,
weights = d[[name_for_weight_column]])
#>
#> Call:
#> lm(formula = formula_str, data = d, weights = d[[name_for_weight_column]])
#>
#> Coefficients:
#> (Intercept)           x1           x2
#>       9.429       -3.857        3.571
d[[name_for_weight_column]]
#>  1 2 1 1

d\$d <- 2
# now it fails
lm(formula_str,
data = d,
weights = d[[name_for_weight_column]])
#> Error in d[[name_for_weight_column]]: subscript out of bounds
d[[name_for_weight_column]]
#>  1 2 1 1
```
2. Constantinos says:

Yes, you are right. I actualy had tried:

```weights = d[, name_for_weight_column]
```

first, and it worked before “simplifying” it to:

```d\$name_for_weight_column
```

which I did not check Great work, anyway!

1. David Disabato says:

Assuming you are creating functions for your programming, why not just pass name_for_weight_column as an argument whose value is a string, and then do weights = d[[substitute(name_for_weight_column)]] inside the function?

1. I apologize, but I don’t think I understand your proposed solution.

```d <- data.frame(
x1 = c(1, 2, 3, 4),
x2 = c(0, 0, 1, 1),
y = c(3, 3, 4, -5),
wt = c(1, 2, 1, 1))
outcome_name <- "y"
explanatory_vars <- c("x1", "x2")
name_for_weight_column <- "wt"

f <- function(d,
outcome_name,
explanatory_vars,
name_for_weight_column) {
formula_str <- paste(
outcome_name,
"~",
paste(explanatory_vars, collapse = " + "))
lm(formula_str,
data = d,
weights = d[[substitute(name_for_weight_column)]])
}

# fn answer (turns to be unweighted)
f(d,
outcome_name,
explanatory_vars,
name_for_weight_column)
#>
#> Call:
#> lm(formula = formula_str, data = d, weights = d[[substitute(name_for_weight_column)]])
#>
#> Coefficients:
#> (Intercept)           x1           x2
#>        9.75        -4.50         5.50

formula_str <- paste(
outcome_name,
"~",
paste(explanatory_vars, collapse = " + "))

# unweighted solution
lm(formula_str,
data = d)
#>
#> Call:
#> lm(formula = formula_str, data = d)
#>
#> Coefficients:
#> (Intercept)           x1           x2
#>        9.75        -4.50         5.50

# desired weighted solution
lm(formula_str,
data = d,
weights = wt)
#>
#> Call:
#> lm(formula = formula_str, data = d, weights = wt)
#>
#> Coefficients:
#> (Intercept)           x1           x2
#>       9.429       -3.857        3.571

# and the column coincidence issue
d\$d <- 2

f(d,
outcome_name,
explanatory_vars,
name_for_weight_column)
#> Error in d[[substitute(name_for_weight_column)]]: subscript out of bounds
```
3. Chaehan So says:

Thank you John for this insightful blog post!
So far, I always wanted to ask “why not simply pass by string?” – but who dares to criticize the holy grail of tidyverse?
I totally agree with your criticism on the over reliance on NSE and the data-over-program text rule substantiates it.

1. To be clear, this isn’t the `tidyverse`‘s fault/problem. The NSE is in `lm()` and we can’t pass a string as it doesn’t seem to accept one.

4. Vaidotas Zemlys-Balevičius says:

There is a much simpler way to achieve this using call.

```#create the data
set.seed(1313)
dt % mutate(y = x1+x2+ rnorm(10))
dt % mutate(wt = runif(10)) %>% mutate(wt = wt/sum(wt))

#create the call

exp <-  call("lm", formula = NULL, data = NULL, weight = NULL)
exp\$formula <- parse(text = "y~x1+x2")[]
exp\$data <- as.name("dt")
exp\$weight <- as.name("wt")

#eval the call

eval(exp)

#Call:
#lm(formula = y ~ x1 + x2, data = dt, weights = wt)

#Coefficients:
#(Intercept)           x1           x2

0.2191       1.2041      -0.1593
```

As you can see you have a total control programmatically, as all the variables start as strings and then become expressions. Combine this with environment in eval and you have another level of control of the data.

If you look inside the code of lm, or any other base R model functions, you will see that this is pretty standard R practice. I used this approach in my own package midasr and found it works quite well.

1. That is a good solution, thank you for sharing it.

5. Dan Murphy says:

The issue with the solution in the prior reply is that “wt” cannot be a name otherwise used in the formula. Here is another base R solution that suffers from the same shortcoming, but which I find to be slightly more straightforward. Starting with the print statement:

```print(formula_str) # Explicitly connect the environment of the formula with the formula e <- new.env() # Place the values of the variables from d into e with the same names for (nm in names(d)) assign(nm, d[[nm]], e) # Using your preferred name for the weights, place their values in e as well assign("name_for_weight_column", d[[name_for_weight_column]], e) # Create the formula fmla <- as.formula(formula_str, env = e) # This call gives your desired results. lm(fmla, weights = name_for_weight_column) ```

6. What about passing matrices in?

```outc <- as.matrix(d[outcome_name]) expv <- as.matrix(d[explanatory_vars]) lm(outc ~ expv, weights = d[[name_for_weight_column]]) ```

There’s probably other complications to this I’m not thinking through, but it seems to work on the surface.