Posted on Categories Coding, Opinion, Programming, StatisticsTags , ,

More on safe substitution in R

Let’s worry a bit about substitution in R. Substitution is very powerful, which means it can be both used and mis-used. However, that does not mean every use is unsafe or a mistake.

From Advanced R : substitute:



We can confirm the above code performs no substitution:

a <- 1
b <- 2
substitute(a + b + z)
## a + b + z

And it appears the effect is that substitute is designed to not take values from the global environment. So, as we see below, it isn’t so much what environment we are running in that changes substitute’s behavior, it is what environment the values are bound to that changes things.

(function() {
  a <- 1
  substitute(a + b + z, 
             environment())
})()
## 1 + b + z

We can in fact find many simple variations of substitute that work conveniently.

substitute(a + b + z, 
           list(a=1, b=2))
## 1 + 2 + z
substitute(a + b + z, 
           as.list(environment()))
## 1 + 2 + z

Often R‘s documentation is a bit terse (or even incomplete) and functions (confusingly) change behavior based on type of arguments and context. I say: always try a few variations to see if some simple alteration can make "base-R" work for you before giving up and delegating everything to an add-on package.

However, we in fact found could not use substitute() to implement wrapr::let() effects (that is re-mapping non-standard interfaces to parametric interfaces). There were some avoidable difficulties regarding quoting and un-quoting of expressions. But the killing issue was: substitute() apparently does not re-map left-hand sides:

# function that print all of its arguments (including bindings)
f <- function(...) {
  args <- match.call()
  print(paste("f() call is:", capture.output(str(args))))
}

# set up some global variables
X <- 2
B <- 5

# try it
f(X=7, Y=X)
## [1] "f() call is:  language f(X = 7, Y = X)"
# use substitute to capture an expression
captured <- substitute(f(X=7, Y=X))
# print the captured expression
print(captured)
## f(X = 7, Y = X)
# evaluate the captured expression
eval(captured)
## [1] "f() call is:  language f(X = 7, Y = X)"
# notice above by the time we get into the function 
# the function arguments have taken there value first
# from explicit argument assignment (X=7) and then from
# the calling environment (Y=X goes to 2).

# now try to use substitute() to re-map values
xform1 <- substitute(captured, 
                     list(X= as.name('B')))
# doesn't look good in printing
print(xform1)
## captured
# and substitutions did not happen as the variables we
# are trying to alter are not free in the word "captured"
# (they are in the expression the name captured is referring to)
eval(xform1)
## f(X = 7, Y = X)
# can almost fix that by calling substitute on the value
# of captured (not the word "captured") with do.call()
subs <- do.call(substitute, 
                list(captured,  list(X= as.name('B'))))
print(subs)
## f(X = 7, Y = B)
eval(subs)
## [1] "f() call is:  language f(X = 7, Y = B)"
# notice however, only right hand side was re-mapped
# we saw "f(X = 7, Y = B)", not "f(B = 7, Y = B)"
# for some packages (such as dplyr) re-mapping
# left-hand sides is important
# this is why wrapr::let() exists

wrapr::let(
  c(X= 'B'),
  f(X=7, Y=X)
)
## [1] "f() call is:  language f(B = 7, Y = B)"

Re-mapping left hand sides is an important capability when trying to program over dplyr:

suppressPackageStartupMessages(library("dplyr"))

d <- data.frame(x = 1:3)
mapping <- c(OLDCOL= 'x',
             NEWCOL= 'y')
wrapr::let(
  mapping,
  d %>%
    mutate(NEWCOL = OLDCOL*OLDCOL)
)
##   x y
## 1 1 1
## 2 2 4
## 3 3 9

wrapr::let() is based on string substitution. This is considered risky. Consider help(substitute, package='base')

Note

substitute works on a purely lexical basis. There is no guarantee that the resulting expression makes any sense.

And that is why wrapr::let() takes a large number of precautions and vets user input before performing any substitution.

The idea is: wrapr::let() is more specialized than substitute() so in addition to attempting extra effects (re-mapping left hand sides) it can introduce a lot of checks to ensure safe invariants.

And that is a bit of my point: when moving to a package look for specificity and safety in addition to "extra power." That is how wrapr::let() is designed and whey wrapr::let() is a safe and effective package to add to your production work-flows.

2 thoughts on “More on safe substitution in R”

  1. Since people have asked.

    One can always try to be safer. The development version of wrapr does include a parse tree only (that is working on the parse tree, not on text) version of the substitution as letprep_lang. When we get more experience with that in production (ensuring we are touching all language elements correctly) we will switch that to the default behavior. So trust wrapr::let() as an API that lets you declare intent, and trust that it is safe and we are working on making it safer.

  2. As “There is usually more than one way in R” a natural question is: is there a function in R already supplying the desired re-writing effects? The answer is: it is always possible, R plus its packages is a big ecosystem.

    In detail:

    • substitute() doesn’t quite express the effect we want.
    • gtools ran into similar issues and ended up implementing strmacro() after already having defmacro(). We note this is where we first looked for ideas. We also specialized to a single application and added a lot of checks and limits to let().
    • base::bquote() supplies “An analogue of the LISP backquote macro” (from help(bquote)) However: let(list(COLUMNAME='x'), data.frame(x=1:3)$COLUMNAME) and eval(bquote(data.frame(x=1:3)$.(COLUMNAME), list(COLUMNAME='x'))) behave (likely by design) very differently.
    • lazyeval (“Provides a full implementation of LISP style ‘quasiquotation’, making it easier to generate code with other code”, from the June-8-2017 package description). However, we have found teaching and using the notation needlessly difficult. The development dplyr itself has removed its dependence on lazyeval, in favor of tidyeval/rlang.
    • tidyeval/rlang (“a general toolkit for non-standard evaluation, principally used to create domain-specific languages of grammars”, from the package description) wasn’t available when I proposed the method. I think wrapr::let() remains the superior solution for user application (the packages have different goals).
    • Re-mapping actual column names in a data.frame is an interesting (and deliberately limited alternative) also in my original proposal. This ability is available as replyr::replyr_apply_f_mapped() in the development version of replyr.

    In developing, distributing, and teaching the wrapr::let() methodology I have had a full-range of feedback (“there is no need”, “that isn’t possible in R“, “I don’t think your solution will work”, “x already solves this”, “wait for y to come out and solve this”, and “that solves my problem!”). We feel wrapr::let() is a very useful and very safe solution to the real world problem of “easily writing readable, generic, and re-usable code for use in production in the presence of non-standard interfaces.” wrapr::let() is designed for one task only: substituting in variable and column names from parameters (for code that expects to take these form source code). It is not designed to generate new grammars of analysis, but to help you out when you find yourself “painted into a corner.”

    I know being seen to squabble with people much more popular than myself is not prudent. But the R ecosystem is both crowded (you can not move without touching elbows) and large (there is a lot of room to contribute, but only if you stay in the game). The standard of discourse is: a contribution must fill an unmet need to be useful, which unfortunately means one must spend some time establishing the need is in fact un-met (i.e., this is what forces one to discuss other packages and approaches- which is something I wish I could in fact avoid, but required if I want to give my own package a fair chance in the marketplace of ideas).

    Meta-programming and macro re-writing is always going to be controversial in R. The reason is: R‘s “functions” are very much like LISP fexprs. fexprs were deliberately not part of Common Lisp and Scheme (two of the dominant LISP descendants) under the theory that it was better to live in a world of true functions (entities that operate over already evaluated values, and do not have access to unevaluated code) and macros (entities that can re-write code for meta-programming effects).

Leave a Reply