Posted on 4 Comments on Factors are not first-class citizens in R

## Factors are not first-class citizens in R

The primary user-facing data types in the R statistical computing environment behave as vectors. That is: one dimensional arrays of scalar values that have a nice operational algebra. There are additional types (lists, data frames, matrices, environments, and so-on) but the most common data types are vectors. In fact vectors are so common in R that scalar values such as the number `5` are actually represented as length-1 vectors. We commonly think about working over vectors of “logical”, “integer”, “numeric”, “complex”, “character”, and “factor” types. However, a “factor” is not a R vector. In fact “factor” is not a first-class citizen in R, which can lead to some ugly bugs.

For example, consider the following R code.

``` levels <- c('a','b','c') f <- factor(c('c','a','a',NA,'b','a'),levels=levels) print(f) ## [1] c a a <NA> b a ## Levels: a b c print(class(f)) ## [1] "factor" ```

This example encoding a series of 6 observations into a known set of factor-levels (`'a'`, `'b'`, and `'c'`). As is the case with real data some of the positions might be missing/invalid values such as `NA`. One of the strengths of R is we have a uniform explicit representation of bad values, so with appropriate domain knowledge we can find and fix such problems. Suppose we knew (by policy or domain experience) that the level `'a'` was a suitable default value to use when the actual data is missing/invalid. You would think the following code would be the reasonable way to build a new revised data column.

``` fRevised <- ifelse(is.na(f),'a',f) print(fRevised) ## [1] "3" "1" "1" "a" "2" "1" print(class(fRevised)) ## [1] "character" ```

Notice the new column `fRevised` is an absolute mess (and not even of class/type factor). This sort of fix would have worked if `f` had been a vector of characters or even a vector of integers, but for factors we get gibberish.

We are going to work through some more examples of this problem. Continue reading Factors are not first-class citizens in R