Posted on Categories Coding, RantsTags , , , , ,

sample(): “Monkey’s Paw” style programming in R

The R functions base::sample and base::sample.int are functions that include extra “conveniences” that seem to have no purpose beyond encouraging grave errors. In this note we will outline the problem and a suggested work around. Obviously the R developers are highly skilled people with good intent, and likely have no choice in these matters (due to the need for backwards compatibility). However, that doesn’t mean we can’t take steps to write safer and easier to debug code.


NewImage
“The Monkey’s Paw”, story: William Wymark Jacobs, 1902; illustration Maurice Greiffenhagen.

Suppose we were given data in the following form:

set.seed(2562)
x <- 10*rnorm(5)
print(x)
# [1] -17.442331   7.361322 -10.537903  -4.208578  -1.560607
goodIndices <- which(x>0)
print(goodIndices)
# [1] 2

Further suppose our goal is to generate a sample of size 5 of the values of x from only the goodIndices positions. That is a sample (with replacement) of the positive values from our vector x. I challenge a working R developer who has used base::sample or base::sample.int regularly to say they have never written at least one of the following errors at some time:

sample(x[goodIndices],size=5,replace=TRUE)
# [1] 5 6 1 3 2
x[sample(goodIndices,size=5,replace=TRUE)]
# [1]   7.361322 -17.442331   7.361322 -17.442331   7.361322

These samples are obviously wrong, but you will notice this only if you check. There is only one positive value in x (7.361322) so the only possible legitimate sample of 5 positive values under replacement is c(7.361322,7.361322,7.361322,7.361322,7.361322). Notice we never got this, and received no diagnostic. A bad sample like this can take a long time to find through its pernicious effects in downstream code.

Notice the following code works (because it reliably prohibits triggering the horrid special case):

as.numeric(sample(as.list(x[goodIndices]),size=5,replace=TRUE))
# [1] 7.361322 7.361322 7.361322 7.361322 7.361322
x[as.numeric(sample(as.list(goodIndices),size=5,replace=TRUE))]
# [1] 7.361322 7.361322 7.361322 7.361322 7.361322
x[goodIndices[sample.int(length(goodIndices),size=5,replace=TRUE)]]
# [1] 7.361322 7.361322 7.361322 7.361322 7.361322

As always: this is a deliberately trivial example so you can see the problem clearly.

So what is going on? The issue given in help('sample'):

If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x. Note that this convenience feature may lead to undesired behaviour when x is of varying length in calls such as sample(x).

This little gem is the first paragraph of the “Details” section of help('sample'). The authors rightly understand that more important than knowing the intended purpose of base::sample is to first know there is a sneaky exception hardcoded into its behavior. In some situations base::sample assumes you really meant to call base::sample.int and switches behavior. Remember (as pointed out in the documentation): the population we are sampling from may have come from an external source, so the analyst may not even know they have triggered (or even could trigger) these exceptional cases.

Here is the code confirming this “convenience.”

> print(base::sample)
function (x, size, replace = FALSE, prob = NULL) 
{
    if (length(x) == 1L && is.numeric(x) && x >= 1) {
        if (missing(size)) 
            size <- x
        sample.int(x, size, replace, prob)
    }
    else {
        if (missing(size)) 
            size <- length(x)
        x[sample.int(length(x), size, replace, prob)]
    }
}
<bytecode: 0x103102340>
<environment: namespace:base>

If we meant to call base::sample.int we certainly could have. There aren’t even any of the traditional “don’t lose flag”s available (such as “drop=FALSE“, “stringsAsFactors=FALSE“, “character.only=TRUE“, or “type='response'“). This “convenience” makes it impossible to reliably use base::sample without some trick (such as hiding our vector in a list). Our current advice is to use the following two replacement functions:

sampleint <- function(n,size,replace=FALSE,prob=NULL) {
  if((!is.numeric(n)) || (length(n)!=1)) {
    stop("sampleint: n must be a numeric of length exactly 1")
  }
  if(missing(size)) {
    size <- n
  }
  if((!is.numeric(size)) || (length(size)!=1)) {
    stop("sampleint: size must be a numeric of length exactly 1")
  }
  sample.int(n,size,replace,prob)
}

samplex <- function(x,size,replace=FALSE,prob=NULL) {
  if(missing(size)) {
    size <- length(x)
  }
  if((!is.numeric(size)) || (length(size)!=1)) {
    stop("samplex: n must be a numeric of length exactly 1")
  }
  x[sampleint(length(x), size, replace, prob)]
}

With these functions loaded you can write more natural code:

samplex(x[goodIndices],size=5,replace=TRUE)
# [1] 7.361322 7.361322 7.361322 7.361322 7.361322

As a bonus we included sampleint which actually checks its arguments (a very good thing for library code to do) catching if the analyst accidentally writes “sample.int(1:10,size=10,replace=TRUE)” or “sample.int(seq_len(10),size=10,replace=TRUE)” (which return 10 copies of 1!) when they meant to write “sample.int(10,size=10,replace=TRUE)“. The overall principle is that a completed run (one that did not call stop()) should have obvious and expected semantics, this allows the user to treat the successful execution as an empirical proof they didn’t hit too bad a corner-case.

Obviously it is the data scientist’s responsibility to know actual function semantics, to write correct code, and to check their intermediate results. However, it traditionally is the responsibility of library code to help in this direction by having regular behavior (see the principle of least astonishment) and to signal errors in the case of clearly invalid arguments (reporting errors near their occurrence makes debugging much easier). Nobody enjoys working with Monkey’s Paw style libraries (that are technically “correct” but not truly helpful).

9 thoughts on “sample(): “Monkey’s Paw” style programming in R”

  1. Isn’t it easier just write a wrapper around sample, which checks for length 1 argument and return rep(x, size)?

    From statistical point of view drawing sample from one element is nonsense, you need to use rep then, so I see a point of R behaviour. Some sort of warning however should be nice. On the other hand if you sample from one point, your code is not statistically sound, and you bound to have problems anyway.

    1. I do not agree.

      From a software engineering point of view we want simple regular library code that works well on the primary cases and happens to get the corner cases correct. Getting corner cases right through enumeration is risky as there (in general) is always the chance you are missing some. And surely sample(-1,5,replace=TRUE) doesn’t make a different amount of “statistical sense” than sample(2,5,replace=TRUE) (both of which are treated differently).

      If one wanted to prohibit sampling from a population size smaller than 2 would be easier still to check for a length<=1 argument and call stop(). Then at least the researcher has some chance of seeing what went wrong near the problem. I think it makes far more sense to for samplex(7,1) to return a 7 than for sample(7,1) to return something like a 2.

      1. Note that proposed code does enumerate the corner cases. R code does that too, it checks for the argument of length 1, but treats that case differently. When you have corner cases you need to make a choice how to treat it. Since R is used by very diverse crowd of people sometimes it is hard to make a choice which suits everybody.

    2. Well,once I learned how sample() works, I made sure always to write sample(c(x,x),…) . The distribution is the same as for sample(x) when x is a vector, and when x is scalar it ensures it’s not converting to 1:x .

      1. Clever, that works correctly over many different types (assuming replace=TRUE). I had been using as.numeric(sample(as.list(x),…), but this only works correctly if we know x is numeric (different types of disasters with strings or factors).

  2. This behavior of sample function is really bad. I was bitten badly twice and wasted a day of time to finally figure it out.

  3. This must be a bad one as I have just been bitten by it myself, and I’ve been at the game for some time now. The suggestion made is a good one and I would hope it eventually makes its way into R base, and sample(), sample.int() gradually eased out by deprecation and then defunct. But that will take some time, sadly.

    It’s only there, of course, as a result of slavery to backward compatibility. It’s an old function coming from the early S days which originally was “An Interactive Environment for Data Analysis and Graphics”. It was only with the introduction of S3 in the early ’90s that it became “A Programming Environment for Data Analysis and Graphics”, and the change of word is significant. Originally the usual way to do your data analysis and graphics was to sit at a terminal and punch it in. Scripts and an emphasis on planned coding and reproducibility only came later, strange as it may now seem.

Comments are closed.