Posted on Categories Opinion, Programming, TutorialsTags ,

A Subtle Flaw in Some Popular R NSE Interfaces

It is no great secret: I like value oriented interfaces that preserve referential transparency. It is the side of the public debate I take in R programming.

"One of the most useful properties of expressions is that called by Quine referential transparency. In essence this means that if we wish to find the value of an expression which contains a sub-expression, the only thing we need to know about the sub-expression is its value."

Christopher Strachey, "Fundamental Concepts in Programming Languages", Higher-Order and Symbolic Computation, 13, 1149, 2000, Kluwer Academic Publishers (lecture notes written by Christopher Strachey for the International Summer School in Computer Programming at Copenhagen in August, 1967).

Please read on for discussion of a subtle bug shared by a few popular non-standard evaluation interfaces.

Most of my work on non-standard-evaluation (NSE or code capturing interfaces, alternatives to value oriented or referentially transparent interfaces) is to help eliminate and contain use of NSE. For instance wrapr::let() is designed to adapt existing NSE interfaces into standard value oriented interfaces, not to create new NSE interfaces. I pretty much feel composing one NSE interface into another NSE interface is cleaning up one mess by adding another mess. If wrapr::let() fails to clean up a mess: it can be evidence of a limitation of wrapr::let(), evidence of user error, or evidence the initial mess was already a problem.

Let’s take a look at a standard interface. For example we can use base-R to select a column from a data.frame by name (technically we are building a new data.frame with the selected column(s)). The clarity of the value oriented notation is most valuable when exposed to potentially unclear code (we most value help when we most need help). For example: suppose the user is selecting the column named x, and their selection is in a poorly-named variable called y (technically values are not "in" variables, but there are environments associating values with names).

y <- "x"
d <- data.frame(x = 1, y = 2)

d[, y, drop= FALSE] # returns a data.frame
##   x
## 1 1
d[[y]]  # returns a column
## [1] 1

And if the column we are looking for ("x") is not present we still get reliable and easily predictable results.

y <- "x"
d <- data.frame(y = 2)

d[, y, drop = FALSE] # exception, signalling no such column
## Error in `[.data.frame`(d, , y, drop = FALSE): undefined columns selected
d[[y]]  # NULL, signalling no such column
## NULL

Beyond the uninformative choice of variable name there was nothing confusing or dangerous about the above code.

Now let’s look at a non-standard or code capturing interface: "$".

y <- "x"
d <- data.frame(x = 1, y = 2)

d$y
## [1] 2
d <- data.frame(x = 1)
d$y
## NULL

Notice both of these operations used the name "y" (captured from user code) to try and retrieve a column named "y", regardless of what the value referred to by the variable "y" is. In these cases the interface is unambiguously taking the name from the user code. This non-standard interface is convenient and reliable, and has an obvious standard analogue in "[[]]", so it does not present problems.

Now let’s look at a more problematic non-standard interface base::subset()

y <- "x"
d <- data.frame(x = 1, y = 2)

subset(d, select = y)
##   y
## 1 2

subset() appears to be using the name found in the code to look-up columns. This is not quite the case: from help(subset) we see the select argument is defined as "expression, indicating columns to select from a data frame." What may not be obvious from a cursory read of the documentation is these expressions are evaluated in a new environment that maps the data.frame column names to integers (a clever way to replace column names with column indices) but in an enclosure by the current execution environment (via a call to parent.frame()). This subtlety means that any names that are not column names of the data.frame are evaluated as expressions in the caller’s environment. This in turn means they are replaced by values from this environment, which leads to very confusing outcomes such as the following.

y <- "x"
d <- data.frame(x = 1)

subset(d, select = y)
##   x
## 1 1

Notice the column "x" was returned, (not a NULL). Essentially each specified column name gets two chances to resolve to a data.frame column name or index: first from an environment mapping data.frame names to indices, and second from the calling environment.

The above issue would be fixed if the expression evaluation occurred in a limited environment where only a few symbols (such as "-" and "c" were defined). Such a function is given below:

subset_strict <- function(x, subset, select, drop = FALSE) {
  r <- if (missing(subset))
    rep_len(TRUE, nrow(x))
  else {
    e <- substitute(subset)
    r <- eval(e, x, parent.frame())
    if (!is.logical(r))
      stop("'subset' must be logical")
    r & !is.na(r)
  }
  vars <- if (missing(select))
    TRUE
  else {
    nl <- as.list(seq_along(x))
    names(nl) <- names(x)
    env <- new.env(parent = emptyenv())
    for(sym in c("-", "c", ":")) {
      assign(sym, get(sym, envir = parent.frame()), envir = env)
    }
    eval(substitute(select), nl, env)
  }
  x[r, vars, drop = drop]
}

This function is stricter than base::subset(), reliably signalling errors when non-columns are requested.

y <- "x"
d <- data.frame(x = 1)

subset_strict(d, select = y)
## Error in eval(substitute(select), nl, env): object 'y' not found
d <- data.frame(x = 1, y = 2, z = 3)
subset_strict(d, select = y)
##   y
## 1 2
subset_strict(d, select = -y)
##   x z
## 1 1 3
subset_strict(d, select = c(x, y))
##   x y
## 1 1 2
subset_strict(d, select = c(x:z))
##   x y z
## 1 1 2 3

Roughly I would say the above issue is a reason to not use base::subset() when you are not around to check if it is correct (i.e. in scripts or packages). And help(subset) has some text about this:

Warning

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

If you don’t use base::subset() you would think this is a non-issue. Except a similar mal-feature seems to be also in dplyr::select(), and dplyr documentation doesn’t currently seem to advise not using dplyr in non-interactive environments.

help(select, package = "dplyr") tells us that select() takes positional argument from "..." and they are to be "One or more unquoted expressions separated by commas. You can treat variable names like they are positions." What is unclear is: what environment "..." is actually evaluated in.

R.version.string
## [1] "R version 3.5.0 (2018-04-23)"
library("dplyr")
packageVersion("dplyr")
## [1] '0.7.6'
packageVersion("rlang")
## [1] '0.2.2'
packageVersion("tidyselect")
## [1] '0.2.4'

Now consider the following code:

y <- "x"

data.frame(x = 1) %>% 
  select(y)

My theory is the typical dplyr user expectation is that the above should throw an exception (as that is what select(data.frame(x = 1), z) does in our clean environment). My opinion is: the above expresses that we are asking for a column named "y", and there is no such column in the data. The unfortunate coincidence that "y" has a value in our environment should be irrelevant to a select(). "y"’s value might (or might not) be relevant in a complex expression, such as a mutate()– but not to a select().

We the result actually is as before: the value "y" refers to in the executing environment can unfortunately alter the result (depending on the columns present in the data.frame).

y <- "x"

data.frame(x = 1) %>% 
  select(y)
##   x
## 1 1

Notice we got back a data.frame with a column named "x". There was no warning or error indicated. This wrong value could poison later results (especially if we had used pull() to convert the data into an unnamed vector). Maybe your code has never run into this, or maybe it already quietly has (as it is not a signalling error).

Notice how the above result differs from the following:

y <- "x"

data.frame(x = 1, y = 2) %>% 
  select(y)
##   y
## 1 2

The name/string version of this issue possibly goes back to dplyr 0.7.0 (the introduction of rlang). A numeric-index version of the issue can be exhibited for dplyr 0.5.0 and dplyr 0.4.0 (and possibly earlier, dplyr 0.4.0 is the earliest version of dplyr that is convenient to install in R 3.5.0). Notice how results vary with data changes below.

# restart R
install.packages("dplyr_0.4.0.tar.gz", repos = NULL)
# restart R
packageVersion("dplyr")
# [1] ‘0.4.0’
library("dplyr")


y <- "x"

data.frame(x = 1, y = 2) %>% 
   select(y)
#   y
# 1 2

data.frame(x = 1) %>% 
  select(y)
# Error: All select() inputs must resolve to integer column positions.
# The following do not:
# *  y

y <- 1L

data.frame(x = 1, y = 2) %>% 
   select(y)
#   y
# 1 2

data.frame(x = 1) %>% 
  select(y)
#   x
# 1 1
  

# restart R
install.packages("dplyr")
# restart R

It is possible this is in fact designed or intentional behavior. However, if that is the case it still violates some principles of safety through isolation and least astonishment.

In dplyr 0.7.6: select(.data[[y]]) shows the same issue, while select(.data[[!!y]]) does not (and select(.data$y) appears to always use the name "y", which seems right). I suspect ".data[y]", while commonly taught, may not be well-formed dplyr code (i.e. the user should not type that so the results are possibly user error, and not package error).

In conclusion:

  • base::subset() in addition to having known issues with the non-standard evaluation of its subset argument has a dangerous double look-up mal-feature in its select argument.
  • dplyr::select() has variations of a similar mal-feature. In the presence of this issue the only safe way to use name-capturing code with dplyr::select() appears to be select("y") or select(.data$y), which are both inconvenient enough to defeat the purpose of non-standard evaluation (the eliding of quotation marks in this case).
  • It is possible select(.data[[y]]) is not well-formed dplyr code, or is ambiguous (is it meant to operate on names, on values, or both?), or has a similar bug.

I think the above shows a small bit of how NSE interfaces can be confusing and hard to get right both for users and package developers. This is why I advise avoiding introducing NSE interfaces unless they are buying you something much larger than leaving out a few quote marks.

"Those who would give up essential Referential Transparency, to purchase a little temporary Notational Conveneince, deserve neither Referential Transparency nor Notational Conveneince."

(not) Ben Franklin

Edit: we have a scarier example here: dropping an unexpected column.

5 thoughts on “A Subtle Flaw in Some Popular R NSE Interfaces”

  1. I am playing around with a select_nse() that treats the un-evaluated arguments as names here.

    And I don’t really have a strong opinion how the following code should work.

    library("dplyr")
    
    selection <- sym("y")
    y <- "x"
    
    data.frame(x = 1, y = 2) %>% 
      select(!!selection)
    #   y
    # 1 2
    
    data.frame(x = 1) %>% 
      select(!!selection)
    #   x
    # 1 1
    
    1. I like the way select works. Conceptually I see lists or data frames as environments, so select checks if the variable is there locally, and if not, goes up the search path. I’d be frustrated if it worked otherwise.

      But I do agree it’s to be used in interactive code only, I did a quick check and I believe tidyr is in fact the only tidyverse package to import dplyr, and I’d be surprised if it used select. However dplyr is imported by many many other packages :

      https://cran.r-project.org/package=dplyr

      1. Thanks for your comment (and perspective).

        Actually tidyr does use dplyr::select() a couple of places. More importantly it uses tidyselect::vars_select() a few places, and that has the issue. For example from help(separate_rows, package = "tidy") we can get:

        library("tidyr")
        df <- data.frame(
          x = 1:3,
          y = c("a", "d,e,f", "g,h"),
          z = c("1", "2,3,4", "5,6"),
          stringsAsFactors = FALSE
        )
        
        # make sure no q in environment
        rm(list = intersect("q", ls()))
        
        # name a non-existent column, get a signalled error (good)
        separate_rows(df, y, q, convert = TRUE)
        # Error: `q` must evaluate to column positions or names, not a function
        # In addition: Warning message:
        #   'glue::collapse' is deprecated.
        # Use 'glue_collapse' instead.
        # See help("Deprecated") and help("glue-deprecated"). 
        
        q <- "z"
        # same code again, column silently re-maps (bad)
        separate_rows(df, y, q, convert = TRUE)
        #   x y z
        # 1 1 a 1
        # 2 2 d 2
        # 3 2 e 3
        # 4 2 f 4
        # 5 3 g 5
        # 6 3 h 6
        

        Notice y and q are doing completely different things while using the same syntax in a common context (that is hard on the user): y is acting as the name of a column q is acting as a reference to a value that is in turn the name of a column.

        1. It’s indeed more confusing. This intermediate case makes it easier to comprehend (same output still):

          separate_rows(df, "y", q, convert = TRUE)

          So arguments can be column names, if found, or a string describing a column name. So there are two possible types of argument, they might conflict and in this case the most usual/intuitive practice wins, I don’t think it’s so hard to teach, but I’m not a teacher :).

          I think the flexibility is worth the confusion (I do a lot of interactive work), but an additional names.only argument, FALSE by default, could be useful for safer programming.

  2. Honestly there is simple dilemma here in trying to say this is both simple (easy to teach) and intended behavior. If the above is intended behavior is intended then dplyr::select() is not easy to teach (one can not say “un-evaluated symbols are treated as column names”, one has to say “un-evaluated symbols that match columns are treated as column names, non-matching ones go through a different path”). If the above is not intended behavior then it is a bug. And I did say the issue was subtle.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.