The primary user-facing data types in the R statistical computing environment behave as vectors. That is: one dimensional arrays of scalar values that have a nice operational algebra. There are additional types (lists, data frames, matrices, environments, and so-on) but the most common data types are vectors. In fact vectors are so common in R that scalar values such as the number `5`

are actually represented as length-1 vectors. We commonly think about working over vectors of “logical”, “integer”, “numeric”, “complex”, “character”, and “factor” types. However, a “factor” is not a R vector. In fact “factor” is *not* a first-class citizen in R, which can lead to some ugly bugs.

For example, consider the following R code.

```
```levels <- c('a','b','c')
f <- factor(c('c','a','a',NA,'b','a'),levels=levels)
print(f)
## [1] c a a <NA> b a
## Levels: a b c
print(class(f))
## [1] "factor"

This example encoding a series of 6 observations into a known set of factor-levels (`'a'`

, `'b'`

, and `'c'`

). As is the case with real data some of the positions might be missing/invalid values such as `NA`

. One of the strengths of R is we have a uniform explicit representation of bad values, so with appropriate domain knowledge we can find and fix such problems. Suppose we knew (by policy or domain experience) that the level `'a'`

was a suitable default value to use when the actual data is missing/invalid. You would think the following code would be the reasonable way to build a new revised data column.

```
```fRevised <- ifelse(is.na(f),'a',f)
print(fRevised)
## [1] "3" "1" "1" "a" "2" "1"
print(class(fRevised))
## [1] "character"

Notice the new column `fRevised`

is an absolute mess (and not even of class/type factor). This sort of fix would have worked if `f`

had been a vector of characters or even a vector of integers, but for factors we get gibberish.

We are going to work through some more examples of this problem. Continue reading