Posted on Categories TutorialsTags , ,

Programming over R

R is a very fluid language amenable to meta-programming, or alterations of the language itself. This has allowed the late user-driven introduction of a number of powerful features such as magrittr pipes, the foreach system, futures, data.table, and dplyr. Please read on for some small meta-programming effects we have been experimenting with.

NewImage

Meta-Programming

Meta-programming is a powerful tool that allows one to re-shape a programming language or write programs that automate parts of working with a programming language.

Meta-programming itself has the central contradiction that one hopes nobody else is doing meta-programming, but that they are instead dutifully writing referentially transparent code that is safe to perform transformations over, so that one can safely introduce their own clever meta-programming. For example: one would hate to lose the ability to use a powerful package such as future because we already “used up all the referential transparency” for some minor notational effect or convenience.

That being said, R is an open system and it is fun to play with the notation. I have been experimenting with different notations for programming over R for a while, and thought I would demonstrate a few of them here.

Let Blocks

We have been using let to code over non-standard evaluation (NSE) packages in R for a while now. This allows code such as the following:

library("dplyr")
library("wrapr")

d <- data.frame(x = c(1, NA))

cname <- 'x'
rname <- paste(cname, 'isNA', sep = '_')

let(list(COL = cname, RES = rname),
    d %>% mutate(RES = is.na(COL))
)

 #    x x_isNA
 # 1  1  FALSE
 # 2 NA   TRUE

let is in fact quite handy notation that will work in a non-deprecated manner with both dplyr 0.5 and dplyr 0.6. It is how we are future-proofing our current dplyr workflows. There is a need as all of the “standard evaluation”/”underscore” dplyr verbs are being marked deprecated in the next version of dplyr, meaning there is no parametric dplyr notation that is considered simultaneously current for both dplyr 0.5 and dplyr 0.6.

Unquoting

dplyr 0.6 is introducing a new execution system (alternately called rlang or tidyeval, see here) which uses a notation more like the following (but fewer parenthesis, and with the ability to control left-hand side of an in-argument assignment):

beval(d %>% mutate(x_isNA = is.na((!!cname))))

The inability to re-map the right-hand side of the apparent assignment is because the “(!! )” notation doesn’t successfully masquerade as a lexical token valid on the left-hand side of assignments or function argument bindings.

And there was an R language proposal for a notation like the following (but without the quotes, and with some care to keep it syntactically distinct from other uses of “@”):

ateval('d %>% mutate(@rname = is.na(@cname))')

beval and ateval are just curiosities implemented to try and get a taste of the new dplyr notation, and we don’t recommend using them in production — their ad-hoc demonstration implementations are just not powerful enough to supply a uniform interface. dplyr itself seems to be replacing a lot of R‘s execution framework to achieve stronger effects.

Write Arrow

We are experimenting with “write arrow” (a deliberate homophone of “right arrow”). It allows the convenient storing of a pipe result into a variable chosen by name.

library("dplyr")
library("replyr")

'x' -> whereToStoreResult

7 %>% sin %>% cos %->_% whereToStoreResult

print(x)
 ## [1] 0.7918362

Notice, the value “7” is stored in the variable “x” not in a variable named “whereToStoreResult”. “whereToStoreResult” was able to name where to store the value parametrically.

This allows code such as the following:

for(i in 1:3) { 
  i %->_% paste0('x',i)
}

(Please run the above to see the automatic creation of variables named “x1”, “x2”, and “x3”, storing values 1,2, and 3 respectively.)

We know left to right assignment is heterodox; but the notation is very slick if you are consistent with it, and add in some formatting rules (such as insisting on a line break after each pipe stage).

Conclusion

One wants to use meta-programming with care. In addition to bringing in desired convenience it can have unexpected effects and interactions deeper in a language or when exposed to other meta-programming systems. This is one reason why a “seemingly harmless” proposal such as “user defined unary functions” or “at unquoting” takes so long to consider. This is also why new language features are best tried in small packages first (so users can easily chose to include them or not in their larger workflow) to drive public request for comments (RFC) processes or allow the ideas to evolve (and not be frozen at their first good idea, a great example of community accepted change being Haskel’s switch from request chaining IO to monadic IO; the first IO system “seemed inevitable” until it was completely replaced).

11 thoughts on “Programming over R”

  1. I am still trying to wrap my head around the new quo system in dplyr 0.6. Personally, I found the let mechanism intuitive, but according to Hadley Wickham the new notation in dplyr is stronger (hence his decision to go with it). Do you have any plans to write a post comparing the two approaches?

    1. Thanks!

      The dplyr unquoting system has the advantage that you don’t have to pick intermediate names for replacements. Also dplyr already needs to capture expressions to send them to remote back-ends anyway (such as databases and Spark), so dplyr itself managing its own substitution makes a lot of sense. The ‘!!!’ splicing may be useful as a standard-evaluation equivalent to ‘…’ or varargs, but I think that is somewhat different functionality getting mixed into the same conception.

      The let re-writing system has the advantage that the semantics are: your code is substituted as follows and that code is then run with standard R semantics. tidyeval/rlang (which I assume will not stay limited to dplyr) have their own data structures (“quosure”) and execution semantics which will end up having to be taught if one is to properly use them.

      Both systems have precedents in LISP (let/in blocks, back-ticking, and macro replacement). I probably will compare the two some time after dplyr 0.6 is out on CRAN and everybody picks up experience with it. I haven’t seen any writing comparing let versus unquoting yet, so any links would be appreciated.

    1. I assume you mean let (which has moved to the wrapr package). wrapr::let is a nice small and clear function that I do suggest using in production.

      replyr is essentially my notes on using dplyr 0.5 well on Spark and databases. So really dplyr does all the heavy lifting. Also replyr will have to be re-tested after dplyr 0.6 comes out.

  2. Some details on quo(), !!, :=, quos(), !!!, UQ(), quosures, and .data (and possibly other “pronouns”): http://rpubs.com/hadley/dplyr-programming. If one masters/teaches all of these terms (and their syntax and semantics end up getting specified and stable) you have the new tidyeval/package-rlang execution system.

    := is the odd one. It appears to be an unused (obsolete) assignment operator that is still recognized by the parser. So it can be used syntactically to signal assignment to dplyr verbs, unless or until R‘s parser is fixed to reflect the current language documentation (which does not mention :=).

    1. I agree, and that should already work (sorry I did not mention it!). With the left to right assignment plus action you don’t need the reversal notation (taking right value instead of left), you start out good to go.

      The replyr assignment operators return their value (after making it eager by calling dplyr::compute if the class includes “tbl“) as an invisible. So the following works:

      library("dplyr")
      library("replyr")
      
      5 %->% x %>% sqrt
       # [1] 2.236068
      print(x)
       # [1] 5
      
      'z' -> whereToStore
      5 %->_% whereToStore %>% sqrt
       # [1] 2.236068
      print(z)
       # [1] 5
      

      The “eagerization” is discussed here and is to make debugging easier (based on the assumption the programmer knows they want the primary result of their pipeline).

  3. I don’t really understand the use case of the write arrow. It already works in dplyr:

    > ‘x’ -> whereToStoreResult
    > 7 %>% sin %>% cos -> .GlobalEnv[[whereToStoreResult]]
    > x
    [1] 0.7918362

    This also works with %T>% if you need the value for more things.

    So where is the added value from your operator?

    1. It is a matter of personal taste and choice, one certainly doesn’t need these arrows (as one doesn’t need a lot of the popular R notations).

      If you are willing to write “-> .GlobalEnv[[whereToStoreResult]]” and the environment you want to store in is always the global environment then you don’t need the %->_% arrow. Another difference is that %->_% and %->% call compute() on tbl-types to try and “eagerize” computation and make debugging easier.

      1. To your point about always requiring the global environment as the destination, that isn’t true – you could easily name another environment or even allow the caller to pass in whatever environment they wanted as the environment to store the result in. If you wanted them to be able to name an environment in a string that would be trickier but still possible. How does %->_% allow you to specify a specific destination environment?

        Finally, I would argue that %->_% isn’t analogous to most of the other novel R notations such as magrittr pipes and dplyr’s non-standard evaluation. The function doesn’t enable or extend support to a new way of thinking or specifying data or remove any byzantine syntax (such as the floating comma in df[expr,]). Is specifying just once the name of the environment where you want to save your variable really so much more inconvenient than trying to remember %->_%?

        I know I’m skirting near to sounding like those dinosaurs who make similar comments about non-standard evaluation, but the advantage of those kinds of syntaxes is the mindset change, not the simplicity or not of the syntax. Your write arrow doesn’t create any new mindset changes.

        1. If you don’t like the arrow notation, please don’t feel you have to use it. And please do feel free to criticize, that is what they are up here for.

          However, I do feel alternatives are not quite as simple as your first comment implies. I.e.: there may in fact be a need for convenient parametric assignment, independent of if arrow meets that need or not. That is what I am responding to, not trying to argue “you need to like arrows.” Your conclusion is, of course your own, but some of your minor premises I think deserve a bit of tuning- which I am trying to apply here.

          The arrows are part of a larger article showing a few possible notations; I myself prefer the let() notation in practice. The intent is to use them to show context and possibilities relative to other notations (as is also the case with ateval() and beval).

          The arrows deliberately don’t have a way to specify the environment as they are designed to choose the current environment (which is not always the global environment).

          Also assignment in R is not quite as regular as your initial comment might imply. “The first to mind” notation (i.e., easiest to remember, right or wrong) to assign into the current environment with arrow (instead of using assign) doesn’t work. Yes there is a notation that works, but it isn’t easy to remember (same for assign). The issue is .GlobalEnv is syntactically a valid left hand side and some obvious functional notations are not, as I show in the code below.

          # works
          > .GlobalEnv[['x']] <- 12
          # not syntactic
          > (environment())[['x']] <- 12
           # Error in (environment())[["x"]] <- 12 : 
           #  invalid (NULL) left side of assignment
          

          Not every R user has mastered all of the intricacies of what is allowed or not allowed in the language, so uniform notations (even limited ones) can be helpful.

Comments are closed.