Posted on Categories Coding, Programming, Statistics, TutorialsTags , , ,

How and why to return functions in R

One of the advantages of functional languages (such as R) is the ability to create and return functions “on the fly.” We will discuss one good use of this capability and what to look out for when creating functions in R.

Why wrap/return functions?

One of my favorite uses of “on the fly functions” is regularizing R’s predict() function to actually do the same thing across many implementations. The issue is: different classification methods in R require different arguments for predict() (not needing a type= argument, or needing type='response' versus type='prob') and return different types (some return a vector of probabilities of being in a target class, some return a matrix with probability columns for all possible classes).

It is a great convenience to wrap these differences (and differences in training control, such as table versus function interface, tolerance/intolerance of boolean/factor/character/numeric target class labels, and more). An example of such wrapping is given below:

rfFitter <- function(vars,yTarget,data) {
  model <- randomForest(x=data[,vars,drop=FALSE],
    y=as.factor(as.character(data[,yTarget,drop=TRUE])),
    ntree=100,
    maxnodes=10)
  function(newd) {
    predict(model,newdata=newd,type='prob')[,'TRUE']
  }
}


logisticFitter <- function(vars,yTarget,data) {
  formula <- paste(yTarget,
      paste(vars,collapse=' + '),sep=' ~ ')
  model <- glm(as.formula(formula),data,
               family=binomial(link='logit'))
  function(newd) {
    predict(model,newdata=newd,type='response')
  }
}

Notice in wrapping the fitting functions we have taken different precautions (the as.factor(as.character()) pattern to defend against boolean and numeric targets for random forest, the selection of column 'TRUE' for random forest, and the type='response' for logistic regression). This means downstream code does not have to worry about such things and we can confidently write code like the following:

rfFitter(vars,'y',dTrain)(dTest)
logisticFitter(vars,'y',dTrain)(dTest)

Which (assuming dTrain is a training data frame and dTest is a test data frame) neatly fits and applies a model. The wrapping function pattern is a good way to apply the don’t repeat yourself pattern (which greatly improves the maintainability of code).

We demonstrate a slightly less trivial use of the pattern here.

The problems with wrapping/returning functions in R

There are at least three problems with the above return a function code pattern:

  1. The underlying fitters often contain unused large structures including complete references to training data. This was the subject of the article Trimming the fat from glm models in R.
  2. Running the fitters inside a function (as we are doing here and as a training harness such as caret does) changes both the lexical and runtime environments causing subtle changes in what is captured in the so-called
    function closure (for how closures work see here). This was discussed in the article Using Closures as Objects in R.
  3. R’s “lazy argument evaluation” semantics can sometimes cause the caller’s environment to be inadvertently captured by the returned function (we call the the “unfulfilled promise leak”).

The third issue is even more subtle than the others, but can cause problems. We will discuss that after a quick review of reference leaks.

Fixing Reference Leaks


NewImage
Sailors practice repairing leaks (public domain source).

The strategy in Trimming the fat from glm models in R was to find (by inspection) and stomp-out excessively large referred items to prevent leaks. A number of these items were in fact environments that were attached to functions on the object. Since the functions are already defined the only way to shrink the object is to do brutal surgery on the objects (such as using something like the restrictEnvironmet() transformer advocated in Using Closures as Objects in R).

In a comment on this second article Professor Luke Tierney correctly pointed out that we should not perform environmental surgery if we can avoid it. A more natural way to achieve what we want is to define a wrapping function as follows:

stripGLMModel <- function(model) { ... ; model }

wrapGLMModel <- function(model) {
  force(model)
  function(newd) {
    predict(model,newdata=newd,type='response')
  }
}

logisticFitter <- function(vars,yTarget,data) {
  formula <- paste(yTarget,
    paste(vars,collapse=' + '),sep=' ~ ')
  model <- glm(as.formula(formula),data,
               family=binomial(link='logit'))
  model <- stripGLMModel(model)
  wrapGLMModel(model)
}

We use three functions (to neatly separate concerns).

  • stripGLMModel() is from Trimming the fat from glm models in R and does the ugly work of stomping out fields we are not using and re-writing environments of functions. This is exactly the work we have to do because the glm() function itself wasn’t parsimonious in what it returned, and didn’t take the wrapping precautions we are taking when it did the fit. So this code is “cleaning up after others” and very idiomatic per-fitter.
  • wrapGLMModel() Returns a function that has all the right arguments set to perform predictions on new data. This method is re-unifying the predict() calling interfaces to be the same. There are three important points of this function: the training data is not an argument to this function, this function is defined at the top-level (so its lexical closures is special, more on this later), and force(model) is called to prevent a new unfulfilled promise leak (more on this later).

    For another situation where force() is relevant (though we used eval() see here).

  • logisticFitter() wraps the per-fitter different details for fitting and calls the other two functions to return an adapted predict() function.

Some code roughly in this style for glm, bigglm, gmb, randomForest, and rpart is given here. For each fitter we had to find the fitter leaks by hand and write appropriate stomping code.

The nature of closure driven reference leaks

In R when a function is defined it captures a reference to the current execution environment. This environment is used to bind values to free variables in the function (free variables are variables whose names are not defined in the function or in the function arguments).

An example is the following:

f <- function() { print(x) }
x <- 5
f()
## [1] 5

x was a free variable in our function, and a reference to the current execution environment (in this case <environment: R_GlobalEnv>) was captured to implement the closure. Roughly this is a lexical or static closure as the variable binding environment is chosen when the function is defined and not when the function is executed. Notice that it was irrelevant that x wasn’t actually defined at the time we defined our function.

The problem with R is: R has no way of determining the list of free variables in a function. Instead of binding the free variables it keeps the entire lexical environment around “just in case” it needs variables from this environment in the future.

This has a number of consequences. In fact this scheme would collapse under its own weight except for the following hack in object serialization/de-serialization. In R when objects are serialized they save their lexical environment (and any parent environments) up until the global environment. The global environment is not saved in these situations. When a function is re-loaded it brings in new copies of its saved lexical environment chain and the top of this chain is altered to have a current environment as its parent. This is made clearer by the following two code examples:

Example 1: R closure fails to durably bind items in the global environment (due to serialization hack).

f <- function() { print(x) }
x <- 5
f()
## [1] 5
saveRDS(f,file='f1.rds')
rm(list=ls())
f = readRDS('f1.rds')
f()
## Error in print(x) : object 'x' not found

Example 2: R closure seems to bind items in intermediate lexical environments.

g <- function() {
  x <- 5
  function() {
    print(x)
  }
}
f <- g()
saveRDS(f,file='f2.rds')
rm(list=ls())
f = readRDS('f2.rds')
f()
## [1] 5

So in a sense R lexical closures are both more expensive than those of many other languages (they hold onto all possible variables instead of free variables) and a bit weaker than expected (saved functions fail to durably capture bindings from the global environment).

We worry about these environments driving reference leaks up and down.

up-leaks are when we build a function in an environment we hoped would be transient (such as the execution environment of a function) and the environment lasts longer because a reference to the environment is returned up to callers. The thing to look out for are any uses of the function keyword because functions capture a reference to the current execution environment as their closure (their static or lexical environment). Any such function returned in a value can therefore keep the so-called transient execution environment alive indefinitely. These leaks are the most common and we saw them causing a reference to training data lasting past the time it was used for fitting. The base modeling functions such as lm() and glm() have these leaks (though you may not see them in calculating size if you are executing in the base environment, again due to the serialization hack).

down-leaks are less common, but are when a function that gets passed into another function as an argument carries more references and data than you intended. Usually you would not care (as you are only holding a reference, not causing a data copy) because the leak only lasts the duration of the sub-function call. The problem is this can waste space in serialization and cause problems for systems that use serialization to implement parallelism (common in R).

Lexical or static scope leak

The main reference leak we have been seeing is the leak of our training data.frame (data). In principle the training data can be huge. The whole purpose of the wrapGLMModel() function is to have function where the data is not in the current execution scope and therefore won’t be captured when this execution scope is used to form the closure (when we build a function causing the formation of lexical or static closure).

Global/base/library-level wrapping functions would be insufficient precaution (as the data is in fact in the lexical scope of wrapGLMModel() when we happen to be working in that scope), except the global scope isn’t saved hack saves us.

Unfulfilled promise leak

The unfulfilled promise leak is an insidious leak. The following code demonstrates the problem.

build1 <- function(z) {
  function() { print(z) }
}

build2 <- function(z) {
  force(z)
  function() { print(z) }
}

expmt <- function() {
  d <- data.frame(x=1:100000000)
  f1 <- build1(5)
  print(paste('f1 size',
           length(serialize(f1, NULL))))
  f2 <- build2(5)
  print(paste('f2 size',
          length(serialize(f2, NULL))))
}

expmt()
## [1] "f1 size 400001437"
## [1] "f2 size 824"

Notice the radically different sizes from the nearly identical build1() and build2() (which differ only in the use of force()).

R implements lazy argument evaluation through a mechanism called “promises.” In the build1() example the argument z (which is just the number 5) is not evaluated in build1(), because build1() never actually used it. Instead the promise (or object that can get the value of z if needed) is passed to the returned function. So z ends up getting evaluated only if/when the function returned by build1() actually uses it.

Normally this is good. If z is a very expensive to evaluate, not evaluating it if its value is never actually used can be substantial savings. Not many languages expose this to the user (early Lisps through fexprs and most famously Haskel). However, the promise must be able to evaluate z if it ever is needed. Since z itself could be a function the promise must therefore keep around the environment that was active when z was defined. Without this environment it can’t fulfill the promise. Since nobody used z the promise is unfulfilled, and the environment leaks. This is what I am calling this an “unfulfilled promise leak.”

My advice to avoid environment and variable leaks is: any service functions that are essentially supplying a Currying or a partial evaluation service (binding arguments into functions) should call force() on all of their own arguments to make sure there are no promises holding live references to unexpected environments. The whole reason we write Currying functions at the global or library level is to avoid accidentally capturing variables in the logisticFitter, and the lazy-eval promise mechanism can itself cause an environment capture. These reference leaks can become expensive when serializing objects, or when using parallel systems that transport serialized objects.

Conclusion

A lot of R’s programming power comes from conventions working over a few user exposed structures (such as environments). This means in some case you have undesirable side-effects that you must write explicit code to mitigate.