Posted on Categories Coding, data science, Opinion, Programming, Statistics, TutorialsTags , , , , , , , , , ,

Non-Standard Evaluation and Function Composition in R

In this article we will discuss composing standard-evaluation interfaces (SE) and composing non-standard-evaluation interfaces (NSE) in R.

In R the package tidyeval/rlang is a tool for building domain specific languages intended to allow easier composition of NSE interfaces.

To use it you must know some of its structure and notation. Here are some details paraphrased from the major tidyeval/rlang client, the package dplyr: vignette('programming', package = 'dplyr')).

  • ":=" is needed to make left-hand-side re-mapping possible (adding yet another "more than one assignment type operator running around" notation issue).
  • "!!" substitution requires parenthesis to safely bind (so the notation is actually "(!! )", not "!!").
  • Left-hand-sides of expressions are names or strings, while right-hand-sides are quosures/expressions.

Example

Let’s apply tidyeval/rlang notation to the task of building re-usable generic in R.

# setup
suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
## [1] '0.7.0'

vignette('programming', package = 'dplyr') includes the following example:

my_mutate <- function(df, expr) {
  expr <- enquo(expr)
  mean_name <- paste0("mean_", quo_name(expr))
  sum_name <- paste0("sum_", quo_name(expr))

  mutate(df, 
    !!mean_name := mean(!!expr), 
    !!sum_name := sum(!!expr)
  )
}

We can try this:

d <- data.frame(a=1)
my_mutate(d, a)
##   a mean_a sum_a
## 1 1      1     1

SE Example

From this example we can figure out how to use tidyeval/rlang notation to build a standard interface version of a function that adds one to a column and lands the value in an arbitrary column:

tidy_add_one_se <- function(df, res_var_name, input_var_name) {
  input_var <- as.name(input_var_name)
  res_var <- res_var_name
  mutate(df,
         !!res_var := (!!input_var) + 1)
}

tidy_add_one_se(d, 'res', 'a')
##   a res
## 1 1   2

And we can re-wrap tidy_add_one_se as into a "add one to self" function as we show here:

tidy_increment_se <- function(df, var_name) {
  tidy_add_one_se(df, var_name, var_name)
}

tidy_increment_se(d, 'a')
##   a
## 1 2

NSE Example

We can also use the tidyeval/rlang notation more as it is intended: to wrap or compose a non-standard interface in another non-standard interface.

tidy_add_one_nse <- function(df, res_var, input_var) {
  input_var <- enquo(input_var)
  res_var <- quo_name(enquo(res_var))
  mutate(df,
         !!res_var := (!!input_var) + 1)
}

tidy_add_one_nse(d, res, a)
##   a res
## 1 1   2

And we even wrap this again as a new "add one to self" function:

tidy_increment_nse <- function(df, var) {
  var <- enquo(var)
  tidy_add_one_nse(df, !!var, !!var)
}

tidy_increment_nse(d, a)
##   a
## 1 2

(The above enquo() then "!!" pattern is pretty much necissary, as the simpler idea of just passing var through doesn’t work.)

An Issue

We could try use base::substitute() instead of quo_name(enquo()) in the non-standard-evaluation wrapper. At first this appears to work, but it runs into trouble when we try to compose non-standard-evaluation functions with each other.

tidy_add_one_nse_subs <- function(df, res_var, input_var) {
  input_var <- enquo(input_var)
  res_var <- substitute(res_var)
  mutate(df,
         !!res_var := (!!input_var) + 1)
}

tidy_add_one_nse_subs(d, res, a)
##   a res
## 1 1   2

However this seemingly similar variation is not re-composable in the same manner.

tidy_increment_nse_subs <- function(df, var) {
  var <- enquo(var)
  tidy_add_one_nse_subs(df, !!var, !!var)
}

tidy_increment_nse_subs(d, a)
## Error: LHS must be a name or string

Likely there is some way to get this to work, but my point is:

  • The obvious way didn’t work.
  • Some NSE functions can’t be re-used in standard NSE composition. You may not know which ones those are ahead of time. Presumably functions from major packages are so-vetted, but you may not be able to trust "one off compositions" to be safe to re-compose.

wrapr::let

It is easy to specify the function we want with wrapr as follows (both using standard evaluation, and using non-standard evaluation):

SE version

library("wrapr")

wrapr_add_one_se <- function(df, res_var_name, input_var_name) {
  wrapr::let(
    c(RESVAR= res_var_name,
      INPUTVAR= input_var_name),
    df %>%
      mutate(RESVAR = INPUTVAR + 1)
  )
}

wrapr_add_one_se(d, 'res', 'a')
##   a res
## 1 1   2

Standard composition:

wrapr_increment_se <- function(df, var_name) {
  wrapr_add_one_se(df, var_name, var_name)
}

wrapr_increment_se(d, 'a')
##   a
## 1 2

NSE version

Non-standard evaluation interface:

wrapr_add_one_nse <- function(df, res_var, input_var) {
  wrapr::let(
    c(RESVAR= substitute(res_var),
      INPUTVAR= substitute(input_var)),
    df %>%
      mutate(RESVAR = INPUTVAR + 1)
  )
}

wrapr_add_one_nse(d, res, a)
##   a res
## 1 1   2

wrapr::let()‘s NSE composition pattern seems to work even when applied to itself:

wrapr_increment_nse <- function(df, var) {
  wrapr::let(
    c(VAR= substitute(var)),
    wrapr_add_one_nse(df, VAR, VAR)
  )
}

wrapr_increment_nse(d, a)
##   a
## 1 2

Abstract Syntax Tree Version

Or, if you are uncomfortable with macros being implemented through string-substitution one can use wrapr::let() in "language mode" (where it works directly on abstract syntax trees).

SE re-do

wrapr_add_one_se <- function(df, res_var_name, input_var_name) {
  wrapr::let(
    c(RESVAR= res_var_name,
      INPUTVAR= input_var_name),
    df %>%
      mutate(RESVAR = INPUTVAR + 1),
    subsMethod= 'langsubs'
  )
}

wrapr_add_one_se(d, 'res', 'a')
##   a res
## 1 1   2
wrapr_increment_se <- function(df, var_name) {
  wrapr_add_one_se(df, var_name, var_name)
}

wrapr_increment_se(d, 'a')
##   a
## 1 2

NSE re-do

wrapr_add_one_nse <- function(df, res_var, input_var) {
  wrapr::let(
    c(RESVAR= substitute(res_var),
      INPUTVAR= substitute(input_var)),
    df %>%
      mutate(RESVAR = INPUTVAR + 1),
    subsMethod= 'langsubs'
  )
}

wrapr_add_one_nse(d, res, a)
##   a res
## 1 1   2
wrapr_increment_nse <- function(df, var) {
  wrapr::let(
    c(VAR= substitute(var)),
    wrapr_add_one_nse(df, VAR, VAR),
    subsMethod= 'langsubs'
  )
}

wrapr_increment_nse(d, a)
##   a
## 1 2

Conclusion

tidyeval/rlang provides general tools to compose or daisy-chain non-standard-evaluation functions (i.e., write new non-standard-evaluation functions in terms of others. This tries to abrogate the issue that it can be hard to compose non-standard function interfaces (i.e., one can not parameterize them or program over them without a tool such as tidyeval/rlang). In contrast wrapr::let() concentrates on standard evaluation, providing a tool that allows one to re-wrap non-standard-evaluation interfaces as standard evaluation interfaces.

A lot of the tidyeval/rlang design is centered on treating variable names as lexical closures that capture an environment they should be evaluated in. This does make them more like general R functions (which also have this behavior).

However, creating so many long-term bindings is actually counter to some common data analyst practice.

The my_mutate(df, expr) example itself from vignette('programming', package = 'dplyr') even shows the pattern I am referring to: the analyst transiently pairs a chosen concrete data set to abstract variable names. One argument is the data and the other is the expression to be applied to that data (and only that data, with clean code not capturing values from environments).

Many methods are written expecting to be re-run on different data (for example predict()). This has the huge advantage that it documents your intent to change out what data is being applied (such as running a procedure twice, once on training data and once on future application data).

This is a principle we also strongly apply in our join controller which has no issue sharing variables out as an external spreadsheet, because it thinks of variable names (here meaning columns names) as fundamentally being strings (not as quosures temporally working "under cover" in string representations).

10 thoughts on “Non-Standard Evaluation and Function Composition in R”

  1. I think the (very mathy) crux of it is: function composition is natural for standard-evaluation interfaces and isn’t natural for non-standard-evaluation interfaces (as they try to capture the details of the most recent call). However, there probably is a monadic adapter for which non-standard-evaluation interface composition is natural. I.e., this adapter would automatically interpolate in all the “capture and then release” notation that is getting added to arguments. In that world: NSE interfaces would then just nicely compose (with no additional explicit notational overhead). However, prior to introducing such a composition manager or higher-order combinator, non-standard-evaluation interface composition looks laborious.

    Keep in mind rlang/tidyeval isn’t the first or only attempt at this even in R. There is also at least bquote/.() and lazyeval (which dplyr recently removed its dependency on).

    For fun in base-R:

    x <- 7
    VARNAME <- as.name('x')
    eval(bquote(1 + .(VARNAME)))
    # 8
    
    1. I wrote a couple SO answers which illustrate how one can use quosures and parse_quosure() to pass strings to dplyr, as we once did with the deprecated underscore verbs.

      https://stackoverflow.com/a/44594223/845800
      https://stackoverflow.com/a/44593617/845800

      I do appreciate wrapr::let as an alternative solution, and it’s not yet clear to me if we can easily cover all use cases with my approach. But given Hadley’s track record of producing quality frameworks, I think we should make a serious effort to figure out the idiomatic solutions to our problems within his framework.

      1. Thanks for your comment, and sharing your perspective.

        wrapr::let() is designed only for the use case of dealing with variables by name (which is an important central case). Because of that the first example is quite clear in wrapr::let() notation:

        library("wrapr")
        
        f1w <- function(df, group_var_name, unique_var_name) {
          let(
            c(GROUP_VAR = group_var_name,
              UNIQUE_VAR = unique_var_name),
            df %>%
              group_by(GROUP_VAR) %>%
              summarise(n_uniq = n_distinct(UNIQUE_VAR)) %>%
              filter(n_uniq > 1)
          )
        }
        
        c <- f1w(iris, "Sepal.Length", "Sepal.Width")
        # identical(a,c)
        # # TRUE
        

        The second example (building up an entire expression) is different, and in fact something `wrapr::let()` deliberately forbids (as you probably want to build up full expressions in other ways). So yes `wrapr::let()` doesn’t help with that (legitimate) use case.

        As to your closing comments on rlang / tidyeval, I think there is evidence rlang / tidyeval can experience bad interactions with other tidyverse components (one example here, and package authorship can be a subtle thing (i.e., different packages have different distributions of contributors, even in the `tidyverse`).

        1. Thanks! I agree with all your points and I think your argument for wrapr::let is persuasive. Personally all I want is a safe standard evaluation system that lets me use column names and ideally expressions.

          I am steering clear of enquo because I don’t understand it and I haven’t been able to consistently predict its behavior in simple examples. Sometimes I’d rather type a few extra keystrokes so I don’t have to spend extra minutes or hours pondering the finer points of metaprogramming to make my script work. Your links suggest you are of a similar mind!

          1. Wow, thanks Paul,

            I apologize for coming off a bit “debatey” in my last comment.

            I should have said: I also agree with your premise: we should (at least at first) try hard to use the idiom that comes from a given package to work with that package. That makes sense.

            I also see you have been working at that and even linked your useful SO answers to [the dangling documentation issue on this topic](https://github.com/tidyverse/rlang/issues/116). This does have a benefit, as it will in fact help people with the rlang / tidyverse notation.

            Also I think you hit the nail on the head: it is all about staying in the moment and with the task at hand (doing data science, not pondering the finer points of meta-programming).

            Finally, I have also found rlang / tidyverse behavior unfortunately hard to predict in some (reasonable) cases.

  2. Side note, regarding quosures, I had a thought the other day. I think quosures are basically functions except code using a quosure is allowed to know the symbols used as formal arguments.

    In other words, the only difference between function(x) x*y and quo(x*!!y) is that, in the latter case, code that uses the quosure knows that the free variable in the expression is the symbol x. With that knowledge, functions like dplyr’s select can go looking for the symbol x in an environment holding other symbols, such as a data frame.

    So why don’t we just declare quosures like this: quosure(x) x*y. Wouldn’t that be a lot more natural for users than quo(x*!!y) ?

    1. To capture environments and/or execution intent R already has (at least): functions (also called “closures” in R, in actuality close close to Lisp fexpr), R formulas, R expression objects, R language objects, R name/symbol objects, R strings (also called “character“), environments, lists and source-code. So it is a bit of a challenge for something like quosure to be completely different than all of these.

      R already has a lot of quoting and un-quoting notations: bquote()/.() (try: “a=1; b=3; VAR_NAME=as.name('b'); eval(bquote(a+.(VAR_NAME)))“), substitute()/eval(), and even an extra un-quoting mechanism such as seen in “FNNAME = 'ls'; help((FNNAME))".

      And dplr itself (depending on the version) has at least 4 evaluation systems: standard-R, lazyeval, hybrid-eval, and tidy-eval.

      It is a lot to have to keep in mind. And checking the interactions between them has proven prohibitive.

      1. Oh I completely agree. Ideally this would be nothing more than a nicer syntax to create quosures, and would have no other impact on the R ecosystem whatsoever. Perhaps easier said than done :)

Leave a Reply