You don’t need to understand pointers to program using R

R is a statistical analysis package based on writing short scripts or programs (versus being based on GUIs like spreadsheets or directed workflow editors). I say “writing short scripts” because R’s programming language (itself called S) is a bit of an oddity that you really wouldn’t be using except it gives you access to superior analytics data structures (R’s data.frame and treatment of missing values) and deep ready to go statistical libraries. For longer pure programming tasks you are better off using something else (be it Python, Ruby, Java, C++, Javascript, Go, ML, Julia, or something else). However, the S language has one feature that makes it pleasant to learn (despite any warts): it can be initially used and taught without having the worry about the semantics of references or pointers.

In our new book (Practical Data Science with R) we didn’t get into the lack of pointers for a purely didactic reason. To tell a general audience (perhaps one new to scripting or programming) that they don’t need to know about pointers, we would have to first explain what pointers are (somewhat losing the cognitive savings). We settled for demonstrating R’s (primarily) call by value semantics for functions (which we already needed to explain) with the following example:

> vec <- c(1,2)
> fun <- function(v) { v[[2]]<-5; print(v)}
> fun(vec)
[1] 1 5
> print(vec)
[1] 1 2

Notice how the mutation (changing an entry to 5) does not escape the function as a side effect. Because R is a bit of kitchen sink (everything and its opposite is pretty much available) we had to cautiously title this example as “R behaves like a call-by-value language” in our book (R in fact has a number of sharable reference structures including environments, ReferenceClasses, lazy evaluation systems like promises/delayedAssign, and more). (The ugly [[]] notation is something we recommend as it catches a few more errors than the more common [] notation. For details please see appendix A of our book.)

What we didn’t discuss is that you get this sort of change isolation and safety in R in just about every situation (not just when binding values to function arguments). Here is another example (this time not from the book):

> vec <- c(1,2)
> v2 <- vec
> v2[[2]] <- 5
> print(v2)
[1] 1 5
> print(vec)
[1] 1 2

Unlike many languages the assignment “v2 <- vec” does not end up with vec and v2 as references (or pointers) entangled to the same object. Instead they behave as if they are two different objects. This does prevent using these two symbols to communicate results (a legitimate programming practice) but it also prevents a whole host of errors and confusions that beginning programmers run into in the presence of such shared mutability. R protects the programmer by treating objects directly without exposing the additional ideas of references or pointers. Many ideal functional programming languages more directly expose references but mitigate their danger by insisting on immutable structures; but this requires the user to learn (in addition to data handling, statistics and programming) the fairly alien discipline of composing immutable data structures.

We encourage beginning programmers to think of programs as organizing sequences of transformations over data. So the simpler (and fewer) the mutations are, the easier it is to reason about programs. When you program in R you are mostly working with values and not variables (which is good, as it leaves you more time to think about data). So, as much as we complain about R, it is in fact a good choice for teaching, analysis, data science and even basic scripting tasks.

However, you do eventually have to deal with the unpleasant details of side-effects and shared mutability. One place where R doesn’t hide the sharp edges from you is in closures (the structure R uses to represent the context of a function). Consider the following code puzzle where we wonder what gets printed by the following:

# make an array of 3 functions
f <- vector('list',3)
# set the i'th function to return i
for(i in 1:length(f)) {
  f[[i]] <- function() { i }
}
# apply the functions using a different loop variable
for(j in 1:length(f)) {
  print(f[[j]]())
}

Note this is one place where you really do need to use the uglier [[]] notation. In the current version of R (3.0.2) if you try to use [] you get the error message “cannot coerce type ‘closure’ to vector of type ‘list’.” But the puzzle is: what do you expect to be printed. If R was binding the value of i into the i‘th function you would expect to see the sequence “1,2,3.” Instead each function in fact gets its value for i by using what is current in its capture of the evaluation environment. So this code in fact prints “3,3,3″, as this is the value i has after the first loop is finished. This is unfortunate, as a lot of productive programming patterns depend on capturing safe isolated values- not capturing entangled references.

This sort of puzzle may seem unpleasant and unnatural, but when pointers (and other sort of shared references) are involved you are forced to solve this sort of puzzle to understand the meaning or semantics of a code fragment or program. It is because these puzzles are laborious that languages like R emphasize isolation, so there is much less to worry about when you try to compose useful data transformations.

Closures and environments are very powerful tools (many of R’s features and built in terms of them). And this common shared mutability of them is a huge source of confusion in many programming languages (Javascript also has this issue, and Java only allows closures to capture final variables to try and cut down on some of the possible interference). To get the behavior we want (each function capturing the current value of i in its closure and not sharing a common reference) we can write the following code:

f <- vector('list',3)
for(i in 1:length(f)) {
  f[[i]] <- function() { i }
  e <- new.env()
  assign('i',i,envir=e)
  environment(f[[i]]) <- e
}
for(j in 1:length(f)) {
  print(f[[j]]())
}

And this prints 1,2,3 as we would hope. Note we are now in very deep programming ground (closures being at least as confusing to beginners as pointers) and no longer even thinking about data. We have to admit: we really counted to 3 the hard way.


Be Sociable, Share!

9 thoughts on “You don’t need to understand pointers to program using R”

  1. A small technicality. The line v2 <- vec actually does (almost literally) result in a pointer to the original vector vec. At that point, both names point to the same object. It is only when we attempt to modify v2 that a copy is created and the two diverge.

  2. Regarding the first code block for the puzzle, I tried the design pattern of using a counter, identified here by the variable k:

    f <- vector('list', 3)
    k <- 1
    for (i in 1:length(f)) {
      f[[i]] <- function() {k}
      print(k)
      k <- k + 1
    }
    for (j in 1:length(f)){
      print(f[[j]] ())
    }
    
    Results from printing value of counter k inside the loop:
    
    [1] 1
    [1] 2
    [1] 3
    
    Results of print(f[[j]] ()):
    
    [1] 4
    [1] 4
    [1] 4
    

    Even in the counter design pattern, the value being assigned is the capture of the evaluation environment for k.

    Be careful out there!

  3. Thanks for your post. IMHO, the following:

    # make an array of 3 functions
    f <- vector('list',3)
    # set the i'th function to return i
    for(i in 1:length(f)) {
    #make a copy of i's value in local j
      j=i
      f[[i]] <- function() { j }
    }
    # apply the functions using a different loop variable
    for(j in 1:length(f)) {
      print(f[[j]]())
    }
    

    solves the puzzle much more intuitively.
    Regards

  4. @public4mac
    Unfortunately that doesn’t solve it. The j copy isn’t truly local and the three functions are still aliased together in an undesirable way (you are only getting 1,2,3 as you used the same variable name as your copy as you do for your later test). Try this code after yours:

    for(fi in f) { 
       print(fi())
    }
    [1] 3
    [1] 3
    [1] 3
    

    The (unstated) goal is to get three functions that bind the values 1,2,3 safely (so you don’t lose the effect when other global variables are manipulated) not to just print 1,2,3. There probably are other ways to solve it, but they are going to involve setting up a new scope while the first loop is active (so creating environments, calling/currying a function, or something like that).

  5. Here is a curried or partially applied function solution (which is better than mucking directly with R’s environment representation):

    f <- vector('list',3)
    for(i in 1:length(f)) {
      p <- function(x) { y=x; function() { y } }
      f[[i]] <- p(i)
    }
    for(fi in f) { 
       print(fi())
    }
    [1] 1
    [1] 2
    [1] 3
    

    It doesn’t matter where the p() function is defined (inside our outside the loop). What matters is when p() is applied the new variable y is a new local variable per call.

  6. @jmount
    And now we are into the nightmare world of evaluation details (a lot of these examples depend both on scope rules which say what value a variable name binds to, and R’s evaluation rules which are largely lazy meaning some things are evaluated much later than one might think):

    # doesn't work (not expected to)
    p <- function(x) { function() { x } }
    f <- vector('list',3)
    for(i in 1:length(f)) {
      f[[i]] <- p(i)
    }
    for(fi in f) { 
       print(fi())
    }
    [1] 3
    [1] 3
    [1] 3
    
    # does work, even though we are returning x not y!
    p <- function(x) { y=x; function() { x } }
    f <- vector('list',3)
    for(i in 1:length(f)) {
      f[[i]] <- p(i)
    }
    for(fi in f) { 
       print(fi())
    }
    [1] 1
    [1] 2
    [1] 3
    
    # does work, helps explain previous example
    p <- function(x) { eval(x); function() { x } }
    f <- vector('list',3)
    for(i in 1:length(f)) {
      f[[i]] <- p(i)
    }
    for(fi in f) { 
       print(fi())
    }
    [1] 1
    [1] 2
    [1] 3
    

    And for more horror see R’s “Standard non-standard evaluation rules”: http://developer.r-project.org/nonstandard-eval.pdf .

    So I guess I would say detailed function semantics are very much not a beginner friendly topic in R (so I would not encourage exposing beginners to these details early on).

  7. @jmount
    You’re damn right!
    This is indeed much more tricky for R : In a language like C#, the trick of creating a local variable to capture value inside an anonymous delegate would have work.
    This makes your post even more valuable!

  8. Interesting example of “gotcha” semantics of mutable state in Python (from: https://news.ycombinator.com/item?id=8009565 ):

    def append_one(list=[]):
       list.append(1)
       return list
    
    append_one()
    append_one()
    append_one()
    

    And the thankfully inequivalent R version:

    append_one < - function(list={}) { 
      list[[length(list)+1]] <- 1
      list
    }
    
    print(append_one())
    print(append_one())
    print(append_one())
    

Comments are closed.