Posted on Categories Opinion, Statistics, TutorialsTags , , ,

Introduction

Beginning `R` users often come to the false impression that the popular packages `dplyr` and `tidyr` are both all of `R` and sui generis inventions (in that they might be unprecedented and there might no other reasonable way to get the same effects in `R`). These packages and their conventions are high-value, but they are results of evolution and implement a style of programming that has been available in `R` for some time. They evolved in a context, and did not burst on the scene fully armored with spear in hand.

`dplyr` and `tidyr`

We will start with a (very) brief outline of the primary capabilities of `dplyr` and `tidyr`.

`dplyr`

`dplyr` embodies the idea that data manipulation should be broken down into a sequence of transformations.

For example: in `R` if one wishes to add a column to a `data.frame` it is common to perform an "in-place" calculation as shown below:

``````d <- data.frame(x=c(-1,0,1))
print(d)``````
``````##    x
## 1 -1
## 2  0
## 3  1``````
``````d\$absx <- abs(d\$x)
print(d)``````
``````##    x absx
## 1 -1    1
## 2  0    0
## 3  1    1``````

This has a couple of disadvantages:

• The original `d` has been altered, so re-starting calculations (say after we discover a mistake) can be inconvenient.
• We have to keep repeating the name of the `data.frame` which is not only verbose (which is not that important an issue), it is a chance to write the wrong name and introduce an error.

The "`dplyr`-style" is to write the same code as follows:

``````suppressPackageStartupMessages(library("dplyr"))
d <- data.frame(x=c(-1,0,1))

d %>%
mutate(absx = abs(x))``````
``````##    x absx
## 1 -1    1
## 2  0    0
## 3  1    1``````
``````# confirm our original data frame is unaltered
print(d)``````
``````##    x
## 1 -1
## 2  0
## 3  1``````

The idea is to break your task into the sequential application of a small number of "standard verbs" to produce your result. The verbs are "pipelined" or sequenced using the `magrittr` pipe "`%>%`" which can be thought of as if the following four statements were to be taken as equivalent:

• `f(x)`
• `x %>% f(.)`
• `x %>% f()`
• `x %>% f`

This lets one write a sequence of operations as a left to right pipeline (without explicit nesting of functions or use of numerous intermediate variables). Some discussion can be found here.

Primary `dplyr` verbs include the "single table verbs" from the `dplyr 0.5.0` introduction vignette:

• `filter()` (and `slice()`)
• `arrange()`
• `select()` (and `rename()`)
• `distinct()`
• `mutate()` (and `transmute()`)
• `summarise()`
• `sample_n()` (and `sample_frac()`)

These have high-performance implementations (often in `C++` thanks to Rcpp) and often have defaults that are safer and better for programming (not changing types on single column data-frames, not promoting strings to factors, and so-on). Not really discussed in the `dplyr 0.5.0` introduction are the `dplyr::*join()` operators which are in fact critical components, but easily explained as standard relational joins (i.e., they are very important implementations, but not novel concepts).

Fairly complex data transforms can be broken down in terms of these verbs (plus some verbs from `tidyr`):

Take for example a slightly extended version of one of the complex work-flows from `dplyr 0.5.0` introduction vignette.

The goal is: plot the distribution of average flight arrive delays and flight departure (all averages grouped by date) for dates where either of these averages is at least 30 minutes. The first step is writing down the goal (as we did above). With that clear, someone familiar with `dplyr` can write a pipeline or work-flow as below (we have added the `gather` and `arrange` steps to extend the example a bit):

``````library("nycflights13")
suppressPackageStartupMessages(library("dplyr"))
library("tidyr")
library("ggplot2")

summary1 <- flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30) %>%
gather(key = delayType,
value = delayMinutes,
arr, dep) %>%
arrange(year, month, day, delayType)``````
``## Adding missing grouping variables: `year`, `month`, `day```
``dim(summary1)``
``## [1] 98  5``
``head(summary1)``
``````## Source: local data frame [6 x 5]
## Groups: year, month [2]
##
##    year month   day delayType delayMinutes
##   <int> <int> <int>     <chr>        <dbl>
## 1  2013     1    16       arr     34.24736
## 2  2013     1    16       dep     24.61287
## 3  2013     1    31       arr     32.60285
## 4  2013     1    31       dep     28.65836
## 5  2013     2    11       arr     36.29009
## 6  2013     2    11       dep     39.07360``````
``````ggplot(data= summary1, mapping=aes(x=delayMinutes, color=delayType)) +
geom_density() +
ggtitle(paste("distribution of mean arrival and departure delays by date",
"when either mean delay is at least 30 minutes", sep='\n'),
subtitle = "produced by: dplyr/magrittr/tidyr packages")``````

Once you get used to the notation (become familiar with "`%>%`" and the verbs) the above can be read in small pieces and is considered fairly elegant. The warning message indicates it would have been better documentation to have the initial `select()` have been "`select(year, month, day, arr_delay, dep_delay)`" (in addition I feel that `group_by()` should always be written as close to `summarise()` as is practical). We have intentionally (beyond minor extension) kept the example as is.

But `dplyr` is not un-precedented. It was preceeded by the `plyr` package and many of these transformational verbs actually have near equivalents in the `R` name-space `base::`:

• `dplyr::filter()` ~ `base::subset()`
• `dplyr::arrange()` ~ `base::order()`
• `dplyr::select()` ~ `base::[]`
• `dplyr::mutate()` ~ `base::transform()`

We will get back to these substitutions after we discuss `tidyr`.

`tidyr`

`tidyr` is a smaller package than `dplyr` and it mostly supplies the following verbs:

• `complete()` (a bulk coalsece function)
• `gather()` (a un-pivot operation, related to `stats::reshape()`)
• `spread()` (a pivot operation, related to `stats::reshape()`)
• `nest()` (a hierarchical data operation)
• `unnest()` (opposite of `nest()`, closest analogy might be `base::unlist()`)
• `separate()` (split a column into multiple columns)
• `extract()` (extract one column)
• `expand()` (complete an experimental design)

The most famous `tidyr` verbs are `nest()`, `unnest()`, `gather()`, and `spread()`. We will discuss `gather()` here as it and `spread()` are incremental improvements on `stats::reshape()`.

Note also the `tidyr` package was itself preceded by a package called `reshape2`, which supplied `pivot` capabilities in terms of verbs called `melt()` and `dcast()`.

The flights example again

It may come as a shock to some: but one can roughly "line for line"" translate the "nycflights13" example from the `dplyr 0.5.0` introduction into common methods from `base::` and `stats::` that reproduces the sequence of transforms style. I.e., transformational style is already available in "base- `R`".

By "base-`R`" we mean `R` with only its standard name-spaces (`base`, `util`, `stats` and a few others). Or "`R` out of the box" (before loading many packages). "base-`R`" is not meant as a pejorative term here. We don’t take "base-`R`" to in any way mean "old-`R`", but to denote the core of the language we have decided to use for many analytic tasks.

What we are doing is separating the style of programming taught "as `dplyr`" (itself a signficant contribution) from the implementation (also a significant contribution). We will replace the use of the `magrittr` pipe "`%>%`" with the Bizarro Pipe (an effect available in base-`R`) to produce code that works without use of `dplyr`, `tidyr`, or `magrittr`.

The translated example:

``````library("nycflights13")
library("ggplot2")

flights ->.;
# select columns we are working with
.[c('arr_delay', 'dep_delay', 'year', 'month', 'day')] ->.;
# simulate the group_by/summarize by split/lapply/rbind
transform(., key=paste(year, month, day)) ->.;
split(., .\$key) ->.;
lapply(., function(.) {
transform(.,  arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
)[1, , drop=FALSE]
}) ->.;
do.call(rbind, .) ->.;
# filter to either delay at least 30 minutes
subset(., arr > 30 | dep > 30) ->.;
# select only columns we wish to present
.[c('year', 'month', 'day', 'arr', 'dep')] ->.;
# get the data into a long form
# can't easily use stack as (from help(stack)):
#  "stack produces a data frame with two columns""
reshape(.,
idvar = c('year','month','day'),
direction = 'long',
varying = c('arr', 'dep'),
timevar = 'delayType',
v.names = 'delayMinutes') ->.;
# convert reshape ordinals back to original names
transform(., delayType = c('arr', 'dep')[delayType]) ->.;
# make sure the data is in the order we expect
.[order(.\$year, .\$month, .\$day, .\$delayType), , drop=FALSE] -> summary2

# clean out the row names for clarity of presentation
rownames(summary2) <- NULL

dim(summary2)``````
``## [1] 98  5``
``head(summary2)``
``````##   year month day delayType delayMinutes
## 1 2013     1  16       arr     34.24736
## 2 2013     1  16       dep     24.61287
## 3 2013     1  31       arr     32.60285
## 4 2013     1  31       dep     28.65836
## 5 2013     2  11       arr     36.29009
## 6 2013     2  11       dep     39.07360``````
``````ggplot(data= summary2, mapping=aes(x=delayMinutes, color=delayType)) +
geom_density() +
ggtitle(paste("distribution of mean arrival and departure delays by date",
"when either mean delay is at least 30 minutes", sep='\n'),
subtitle = "produced by: base/stats packages plus Bizarro Pipe")``````
``print(all.equal(as.data.frame(summary1),summary2))``
``## [1] TRUE``

The above work-flow is a bit rough, but the simple introduction of a few light-weight wrapper functions would clean up the code immensely.

The ugliest bit is the by-hand replacement of the `group_by()`/`summarize()` pair, so that would be a good candidate to wrap in a function (either full split/apply/combine style or some specialization such as grouped ordered apply).

The `reshape` step is also a bit rough, but I like the explicit specification of `idvars` (without these the person reading the code has little idea what the structure of the intended transform is). This is why even though I prefer the `tidyr::gather()` implementation to `stats::reshape()` I chose to wrap `tidyr::gather()` into a more teachable "coordinatized data" signature (the idea is: explicit grouping columns were a good idea for `summarize()`, and they are also a good idea for `pivot`/`un-pivot`).

Also, the use of expressions such as "`.\$year`" is probably not a bad thing; `dplyr` itself is introducing "data pronouns" to try and reduce ambiguity and would write some of these expressions as "`.data\$year`". In fact `dplyr` also allows notations such as "`mtcars %>% select(.data["disp"])`" ; so such notation does have its place.

Conclusion

`R` itself is very powerful. That is why additional powerful notations and powerful conventions can be built on top of `R`. `R` also, for all its warts, has always been a platform for statistics and analytics. So: for common data manipulation tasks you should expect `R` does in fact have some ready-made tools.

It is often said "`R` is its packages", but I think that is missing how much `R` packages owe back to design decisions found in "base-`R`".

2 thoughts on “dplyr in Context”

1. Note: I have gotten some appropriate and correct criticism for not having traced important influences. I want to apologize on having presented a myopic view of “context.”

My goal was to remind people of the existence of `R` itself and point out with the proper conventions you already can write `R` in transformational style (and you need conventions in any big language).

That being said I was remiss not to mention `data.table` a high-performance package that has been around for about 11 years, is very much preferred by some large `R` groups (Google), already has a high-performance work-alike for `data.frame`, and already has a transformational query language (group, update, join, and so on; please see here).

2. Professor Hadley Wickham has taken issue with my writing:

the dplyr authors consider notations such as “mtcars %>% select(.data[“disp”])” as recommended notation

Evidently that is not the case and the notation was to be considered as relevant in a specific context. I have corrected the current copy of the article.

The notation was in fact suggested or recommended to me by him in an issue report that was already closed after a comment of mine that included the text “all my questions are now answered.” This is why I in good faith thought I could describe it as “recommended.”

(Also: I tend to move code fluidly between scripts, functions, and packages. When I indicated “I was not asking about those things”: I meant I already knew how to properly wrap code, not that I was not interested in functions and packages. But as we see- readers do not always take away exactly what writers may have intended.)