Posted on Categories Coding, Rants9 Comments on sample(): “Monkey’s Paw” style programming in R

## sample(): “Monkey’s Paw” style programming in R

The R functions `base::sample` and `base::sample.int` are functions that include extra “conveniences” that seem to have no purpose beyond encouraging grave errors. In this note we will outline the problem and a suggested work around. Obviously the R developers are highly skilled people with good intent, and likely have no choice in these matters (due to the need for backwards compatibility). However, that doesn’t mean we can’t take steps to write safer and easier to debug code.

“The Monkey’s Paw”, story: William Wymark Jacobs, 1902; illustration Maurice Greiffenhagen.

Continue reading sample(): “Monkey’s Paw” style programming in R

Posted on Categories Opinion, Programming, Rants, Statistics, Tutorials3 Comments on What can be in an R data.frame column?

## What can be in an R data.frame column?

As an R programmer have you every wondered what can be in a `data.frame` column? Continue reading What can be in an R data.frame column?

Posted on Categories Programming, Rants, Statistics4 Comments on Check your return types when modeling in R

## Check your return types when modeling in R

Just a warning: double check your return types in R, especially when using different modeling packages. Continue reading Check your return types when modeling in R

Posted on 4 Comments on Factors are not first-class citizens in R

## Factors are not first-class citizens in R

The primary user-facing data types in the R statistical computing environment behave as vectors. That is: one dimensional arrays of scalar values that have a nice operational algebra. There are additional types (lists, data frames, matrices, environments, and so-on) but the most common data types are vectors. In fact vectors are so common in R that scalar values such as the number `5` are actually represented as length-1 vectors. We commonly think about working over vectors of “logical”, “integer”, “numeric”, “complex”, “character”, and “factor” types. However, a “factor” is not a R vector. In fact “factor” is not a first-class citizen in R, which can lead to some ugly bugs.

For example, consider the following R code.

``` levels <- c('a','b','c') f <- factor(c('c','a','a',NA,'b','a'),levels=levels) print(f) ## [1] c a a <NA> b a ## Levels: a b c print(class(f)) ## [1] "factor" ```

This example encoding a series of 6 observations into a known set of factor-levels (`'a'`, `'b'`, and `'c'`). As is the case with real data some of the positions might be missing/invalid values such as `NA`. One of the strengths of R is we have a uniform explicit representation of bad values, so with appropriate domain knowledge we can find and fix such problems. Suppose we knew (by policy or domain experience) that the level `'a'` was a suitable default value to use when the actual data is missing/invalid. You would think the following code would be the reasonable way to build a new revised data column.

``` fRevised <- ifelse(is.na(f),'a',f) print(fRevised) ## [1] "3" "1" "1" "a" "2" "1" print(class(fRevised)) ## [1] "character" ```

Notice the new column `fRevised` is an absolute mess (and not even of class/type factor). This sort of fix would have worked if `f` had been a vector of characters or even a vector of integers, but for factors we get gibberish.

We are going to work through some more examples of this problem. Continue reading Factors are not first-class citizens in R

Posted on Categories Coding, Rants, Statistics10 Comments on R annoyances

## R annoyances

Readers returning to our blog will know that Win-Vector LLC is fairly “pro-R.” You can take that to mean “in favor or R” or “professionally using R” (both statements are true). Some days we really don’t feel that way. Continue reading R annoyances