Posted on Categories Opinion, Programming, TutorialsTags , , , ,

Iteration and closures in R

I recently read an interesting thread on unexpected behavior in R when creating a list of functions in a loop or iteration. The issue is solved, but I am going to take the liberty to try and re-state and slow down the discussion of the problem (and fix) for clarity.

The issue is: are references or values captured during iteration?

Many users expect values to be captured. Most programming language implementations capture variables or references (leading to strange aliasing issues). It is confusing (especially in R, which pushes so far in the direction of value oriented semantics) and best demonstrated with concrete examples.


NewImage

Please read on for a some of the history and future of this issue.

for loops

Consider the following code run in R version 3.3.2 (2016-10-31):

 functionsFor <- vector(2, mode='list')
 for(x in 1:2) { 
   functionsFor[[x]] <- function() return(x)
 }

 functionsFor[[1]]()

 # [1] 2

In real applications the functions would take additional arguments and perform calculations involving both the “partially applied” x and these future arguments. Obviously if we just wanted values we would not use functions. However, this trivial example is much simpler (except for the feeling it is silly) than a substantial application. The notation gets confusing even as we stand. But partial application (binding values into functions) is a common functional programming pattern (which happens to not always interact well with iteration).

Notice the answer printed is 2 (not 1).

This is because all the functions created in the loop captured a closure or reference to the same variable x (which is 2 at the end of the loop). The functions did not capture the value x had when the functions were created. We can confirm this by moving x around by hand, as we show below.

 x <- 4
 functionsFor[[1]]()

 # [1] 4

This is a well know language design issue.

Trying to work-around it

The more complicated examples referenced in the thread are variations of the standard work-around: build a function factory so each function has a different closure (the new closures being the execution environments of each factory invocation). That code looks like the following:

 functionsFor2 <- vector(2, mode='list')
 for(x in 1:2) {
   functionsFor2[[x]] <- (function(x) {
     return(function() return(x))
   })(x)
 }

 functionsFor2[[1]]()

 # [1] 2

The outer function (which gets called) is called the factory and is trivial (we are only using it to get new environments). The inner function is our example, which in the real world would take additional arguments and perform calculations involving these arguemnts in addition to x.

Notice the “fix” did not work. There is more than one problem lurking, and this is why so many experienced functional programmers are surprised by the behavior (despite probably having experience in many of the other functional languages we have mentioned). R “functions” are different than many current languages in that they have semantics closer to what Lisp called an fexpr. In particular arguments are subject to “lazy evaluation” (a feature R implements by a bookeeping process called “promises“).

So in addition to the (probably expected) unwanted shared closure issue, we have a lazy evaluation issue. The complete fix involves both introducing new closures (by the using the function factory’s execution closure) and forcing evaluation in these new environments. We show the code below:

 functionsFor3 <- vector(2, mode='list')
 for(x in 1:2) {
   functionsFor3[[x]] <- (function(x) {
     force(x)
     return(function() return(x))
   })(x)
 }

 functionsFor3[[1]]()
 # [1] 1

Lazy evaluation is a fairly rare language feature (most famously used in Haskell), so it is not always everybody’s mind. R has lazy evaluation a number of places (function arguments and dplyr pipelines and data-structures being some of the most prominent uses).

lapply and purrr::map

I’ve taught this issue for years in our advanced R-programming workshops.

One thing I didn’t know is: R fixed this issue for base::lapply(). Consider the following code:

 functionsL <- lapply(1:2, 
   function(x) { function() return(x) })

 functionsL[[1]]()

 # [1] 1

Apparently lapply used to have the problem and was fixed by the time we got to R 3.2.

Coming back to the original thread, the current CRAN release of purrr (0.2.2) also has the reference behavior, as we can see below:

 functionsM <- purrr::map(1:2, 
   function(x) { function() return(x) })

 functionsM[[1]]()

 # [1] 2

Apparently this is scheduled for a fix.

Though, there is no way purrr::map() can behave the same as both for(){} and lapply() as the two currently have different behavior.

Conclusion

Lazy evaluation can increase complexity as it makes it less obvious to the programmer when something will be executed and increases the number of possible interactions the programmer can experience (as it is not determined when code will run, so one can not always know the state of the world it will run in).

My opinion is: lazy evaluation should be used sparingly in R, and only where it is trading non-determinism for some benefit. I would also point out that lazy evaluation is not the only possible way to capture specifications of calculations for future interpretation even in R. For example, formula-like interfaces also provide this capability.

One thought on “Iteration and closures in R”

  1. Basically we are seeing different language features interact. This kind of an “R is an easy programming easy to teach” (due to the value semantics) “until it is not” (due to the heroic machinery that supplies efficient value semantics).

    Some of the feedback on articles like this (thankfully usually only a minority) invariably includes “nobody would do anything that complicated” (i.e., the author is trying too hard) and “nobody would make that mistake” (i.e., the author is not trying hard enough). One can even say the purrr::map example doesn’t count as the user function that was passed to purrr::map currently has the responsibility to correctly capture values.

    As a prepared response I’ll say: I’ve seen this type bug in the wild, including in marquee packages such as the current CRAN version of dplyr (0.5.0 2016-06-24) (my notes here; issue is reported fixed in the development version). That is not to denigrate the package authors; if you write non-trivial code you are going to write bugs. I write bugs.

Comments are closed.