Posted on Categories Coding, TutorialsTags , , ,

R Tip: Use stringsAsFactors = FALSE

R tip: use stringsAsFactors = FALSE.

R often uses a concept of factors to re-encode strings. This can be too early and too aggressive. Sometimes a string is just a string.


800px Sigmund Freud by Max Halberstadt cropped

It is often claimed Sigmund Freud said “Sometimes a cigar is just a cigar.”

To avoid problems delay re-encoding of strings by using stringsAsFactors = FALSE when creating data.frames.

Example:

d <- data.frame(label = rep("tbd", 5))

d$label[[2]] <- "north"
#> Warning in `[[<-.factor`(`*tmp*`, 2, value = structure(c(1L, NA, 1L, 1L, :
#> invalid factor level, NA generated

print(d)
#>   label
#> 1   tbd
#> 2  <NA>
#> 3   tbd
#> 4   tbd
#> 5   tbd

Notice our new value was not copied in!

The fix is easy: use stringsAsFactors = FALSE.

d <- data.frame(label = rep("tbd", 5),
                stringsAsFactors = FALSE)

d$label[[2]] <- "north"

print(d)
#>   label
#> 1   tbd
#> 2 north
#> 3   tbd
#> 4   tbd
#> 5   tbd

As is often the case: base R works okay in default mode and works very well if you judiciously change a few defaults. There is much less need to whole-hog replace R functionality than some claim.

Note: the above pattern of pre-building a data.frame and filling values by addressing row/column index sets is a very effective (and under appreciated) way to build up data (often easier and quicker than binding rows or columns).

4 thoughts on “R Tip: Use stringsAsFactors = FALSE

  1. Great article. Completely agree.
    This comment is just as an aside that the new factor level is automatically added for you in data.table.
    As you know, stringsAsFactors=FALSE is the default in data.table for 10 years. So to demonstrate this feature of a factor column, we first need to set it to TRUE :

      DT = data.table(label = rep("tbd", 5), stringsAsFactors=TRUE)
      DT
          label
         
      1:    tbd
      2:    tbd
      3:    tbd
      4:    tbd
      5:    tbd
      
      DT[2, label:="north"]
      DT
          label
         
      1:    tbd
      2:  north
      3:    tbd
      4:    tbd
      5:    tbd
      DT$label
      [1] tbd   north tbd   tbd   tbd
      Levels: tbd north
    

    The point is just that it added in the new factor level automatically for you, whereas in base R that’s an error. I agree most of the time plain character type is probably best, but I’m just adding minor information that if you do have a factor (sometimes a factor is better when modelling, and ordered factors are also sometimes useful) then := in data.table copes with new factor levels.

    It’s one convenience/ease-of-use feature of data.table that is nothing to do with size or speed.

  2. And for anyone who has ever had to deal with the frustration of factors, a very cathartic way to implement this tip is

    devtools::install_github("nutterb/sillylogic")

    d <- data.frame(label = rep("tbd", 5),
    stringsAsFactors = HELLNO)

Comments are closed.