Posted on Categories Coding, Opinion, Programming, Statistics, TutorialsTags , , , , , ,

Is R base::subset() really that bad?

Is R base::subset() really that bad?

The Hitchhiker s Guide to the Galaxy svg

Notes discussing subset() often refer to the following text (from help(subset), referred to in examples: 1, 2):

Warning

This is a convenience function intended for use interactively. For programming it is better to use the standard sub-setting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

Is it really obvious the subset() authors or describers are apologizing solely for the implementation, or could they be expressing reservations about wholesale use/implementation of non-standard evaluation in general? It could be a situation similar to the following dialogue from "The Hitchhiker’s Guide to the Galaxy" by Douglas Adams:

Zaphod stared at Arthur.

"Did you think of that, Earthman?" he demanded.

"Well," said Arthur, "all I did was…"

"That’s very good thinking, you know. Turn on the Improbability Drive for a second without first activating the proofing screens. Hey, kid, you just saved our lives, you know that?"

"Oh," said Arthur, "well, it was nothing really…"

"Was it?" said Zaphod. "Oh well, forget it then. Okay computer, take us in to land."

"But…"

"I said forget it."

subset()‘s code for data.frames is actually fairly short and clear (and easy to examine via print(subset.data.frame)). Objections could include:

  1. The result of subset() depends on more than just the values it is applied to (it also depends on exact code passed to subset, and the contents of various environments).
  2. There is no way to unambiguously refer to column names of the data.frame in question.
  3. There is no way to unambiguously refer to items in arbitrary environments.

Let comment on these points in order.

  1. The first point is the nature of non-standard evaluation. It will follow you wherever you implement non-referentially transparent evaluation schemes.
  2. The second point is valid, and some authors advocate adding a prefix to refer to the data.frame.
  3. The third point is perhaps, here, a bit over-stated. The R convention is: environments nest, they are structured (not spaghetti pointers). The R programmer expects this. If we know both the name we want and the environment we want to de-reference it in, we can often just take an extra step and grab the value in question. Adding the ability to refer to arbitrary environments is powerful, but it is the data equivalent to adding a "goto" statement to a structured programming language. From a clean design point of view, all one really needs is another prefix to refer to the expected environment.

One can quickly create a variation of subset that allows for both data and environment prefixes. Below is a simple implementation based on the above ideas and minor edits to the original subset() code.

#' Pick a subset of rows, evaluating the subset expression 
#' as if the columns of x were in the evaluation environment.
#' References can be forced to the environment with a .e$ prefix
#' and forced to the data frame with a .d$ prefix (failure to
#' lookup returns null).
#' 
#' @param x data.frame to work with
#' @param subset logical expression to compute per-row
#' @param env environment to work in
#' @return data.frame that is the specified subset of the rows of x.
#' 
#' @seealso \code{\link[base]{subset}} 
#' 
#' @examples
#' 
#' Temp <- 90
#' subset_rows(airquality, (.d$Temp > .e$Temp) & !is.na(Ozone))
#' 
#' 
subset_rows <- function(x, subset, env = parent.frame()) {
  if(!is.data.frame(x)) {
    stop("subset_rows expected x to be a data.frame")
  }
  if(missing(subset) || (nrow(x)<=0)) {
    return(x)
  }
  e <- substitute(subset) # capture expression
  eval_env <- new.env(parent = env)
  assign(".d", x, eval_env)   # data prefix .d$
  assign(".e", env, eval_env) # environment prefix .e$
  r <- eval(e, envir = x, enclos = eval_env)
  if(!is.logical(r)) {
    stop("subset_rows predicate must evaluate to logical")
  }
  r <- r & !is.na(r)
  if(length(r)!=nrow(x)) {
    stop("subset_rows predicate must have one entry per row")
  }
  x[r, , drop = FALSE]
}

The idea is:

  • By default names are de-referenced as possible data.frame column names, failing that are matched to the evaluation environment.
  • Names prefixed by .d$ can only refer to columns of the data.frame.
  • Names prefixed by .e$ can only refer to the evaluation environment.

The above code has deliberately different semantics than base::subset() (else, why write it): the prefix ability, no "drop", skips predicate evaluation on zero-row data.frames, checks predicate length, no use of "...", and executes in a more deeply nested environment. Beyond style differences the change from base::substitute() are really only the three lines of code directly manipulating environments. We feel we have thrown out a lot of bath water while keeping quite a lot of the baby.

This is enough to work a standard problem: suppose our data.frame has a column named "Temp" and we want to select rows that have a value in the "Temp" column that exceeded the value stored in a variable which is also named "Temp" (plus some other conditions). The common way out of that is to re-design a bit and store a reference to the threshold in a fresh variable named "Temp_bound" (which is why some of the above issues will be unfamiliar to some R users, they are often easy to avoid). The user can specify a variable or column named .d or .e by adding a prefix (such as .e$.e, but this does mean they must specify if they are looking at the data.frame or the environment, for these two name they do not have the luxury of saying "either").

The prefix style solution is given below:

Temp <- 90
Ozone_bound <- 100
subset_rows(airquality, 
            (.d$Temp > .e$Temp) & 
              (!is.na(Ozone)) & (Ozone < Ozone_bound))
##     Ozone Solar.R Wind Temp Month Day
## 69     97     267  6.3   92     7   8
## 70     97     272  5.7   92     7   9
## 120    76     203  9.7   97     8  28
## 122    84     237  6.3   96     8  30
## 123    85     188  6.3   94     8  31
## 124    96     167  6.9   91     9   1
## 125    78     197  5.1   92     9   2
## 126    73     183  2.8   93     9   3
## 127    91     189  4.6   93     9   4

We do not actually need the ".d$" prefix on ".d$Temp", as the rules are the data.frame is checked first. However, adding it lowered the ambiguity to the reader. We know the .d$Temp and .e$Temp must be referring to different columns, as for standard numbers (non NA/NaN) we do not have "Temp > Temp". Notice we did not need to write ".e$Ozone_bound" (though that would work if we so choose).

We have only shown the solution in an interactive environment, but with a few conventions (such as building up long names) the code works about as well as any other when writing re-usable functions or packages.

I think the above if fairly teachable, and working through it has the side benefit that it gives one a chance learn about R itself.

6 thoughts on “Is R base::subset() really that bad?”

  1. Another point (going back to the original) source: is [ , ] is such a powerful operator that one can doesn’t have as much unmet need for base::subset() (or even perhaps dplyr::filter()) as some would claim.

Comments are closed.