Posted on Categories UncategorizedTags , , 9 Comments on For loops in R can lose class information

For loops in R can lose class information

Did you know R‘s for() loop control structure drops class annotations from vectors? Continue reading For loops in R can lose class information

Posted on Categories Coding, RantsTags , , , , , 9 Comments on sample(): “Monkey’s Paw” style programming in R

sample(): “Monkey’s Paw” style programming in R

The R functions base::sample and base::sample.int are functions that include extra “conveniences” that seem to have no purpose beyond encouraging grave errors. In this note we will outline the problem and a suggested work around. Obviously the R developers are highly skilled people with good intent, and likely have no choice in these matters (due to the need for backwards compatibility). However, that doesn’t mean we can’t take steps to write safer and easier to debug code.


NewImage
“The Monkey’s Paw”, story: William Wymark Jacobs, 1902; illustration Maurice Greiffenhagen.

Continue reading sample(): “Monkey’s Paw” style programming in R

Posted on Categories Opinion, Programming, Rants, Statistics, TutorialsTags , , , , 3 Comments on What can be in an R data.frame column?

What can be in an R data.frame column?

As an R programmer have you every wondered what can be in a data.frame column? Continue reading What can be in an R data.frame column?

Posted on Categories Programming, Rants, StatisticsTags , , 4 Comments on Check your return types when modeling in R

Check your return types when modeling in R

Just a warning: double check your return types in R, especially when using different modeling packages. Continue reading Check your return types when modeling in R

Posted on Categories Computer Science, Opinion, Practical Data Science, Pragmatic Data Science, Programming, Rants, StatisticsTags , , , 4 Comments on Factors are not first-class citizens in R

Factors are not first-class citizens in R

The primary user-facing data types in the R statistical computing environment behave as vectors. That is: one dimensional arrays of scalar values that have a nice operational algebra. There are additional types (lists, data frames, matrices, environments, and so-on) but the most common data types are vectors. In fact vectors are so common in R that scalar values such as the number 5 are actually represented as length-1 vectors. We commonly think about working over vectors of “logical”, “integer”, “numeric”, “complex”, “character”, and “factor” types. However, a “factor” is not a R vector. In fact “factor” is not a first-class citizen in R, which can lead to some ugly bugs.

For example, consider the following R code.

levels <- c('a','b','c')
f <- factor(c('c','a','a',NA,'b','a'),levels=levels)
print(f)
## [1] c    a    a    <NA> b    a   
## Levels: a b c
print(class(f))
## [1] "factor"

This example encoding a series of 6 observations into a known set of factor-levels ('a', 'b', and 'c'). As is the case with real data some of the positions might be missing/invalid values such as NA. One of the strengths of R is we have a uniform explicit representation of bad values, so with appropriate domain knowledge we can find and fix such problems. Suppose we knew (by policy or domain experience) that the level 'a' was a suitable default value to use when the actual data is missing/invalid. You would think the following code would be the reasonable way to build a new revised data column.

fRevised <- ifelse(is.na(f),'a',f)
print(fRevised)
##  [1] "3" "1" "1" "a" "2" "1"
print(class(fRevised))
## [1] "character"

Notice the new column fRevised is an absolute mess (and not even of class/type factor). This sort of fix would have worked if f had been a vector of characters or even a vector of integers, but for factors we get gibberish.

We are going to work through some more examples of this problem. Continue reading Factors are not first-class citizens in R

Posted on Categories Coding, Rants, StatisticsTags , , , 10 Comments on R annoyances

R annoyances

Readers returning to our blog will know that Win-Vector LLC is fairly “pro-R.” You can take that to mean “in favor or R” or “professionally using R” (both statements are true). Some days we really don’t feel that way. Continue reading R annoyances