Posted on Categories Opinion, StatisticsTags , , , , , ,

Organize your data manipulation in terms of “grouped ordered apply”

Consider the common following problem: compute for a data set (say the infamous iris example data set) per-group ranks. Suppose we want the rank of iris Sepal.Lengths on a per-Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.


Iris germanica Purple bearded Iris Wakehurst Place UK DiliffIris, by DiliffOwn work, CC BY-SA 3.0, Link

In our previous article in this series we discussed the general ideas of “row-ID independent data manipulation” and “Split-Apply-Combine”. Here, continuing with our example, we will specialize to a data analysis pattern I call: “Grouped-Ordered-Apply”.

The example

Let’s start (as always) with our data. We are going to look at the iris data set in R. You can view the data in by typing the following in your R console:

data(iris)
View(iris)

The package dplyr makes the grouped calculation quite easy. We define our “window function” (function we want applied to sub-groups of data in a given order) and then use dplyr (I’ve added some text explaining magrittr/dplyr notation in a comment below; when training this is a topic we spend a lot of time on) to apply the function to grouped and arranged data:

library('dplyr')

# define our windowed operation, in this case ranking
rank_in_group <- . %>% mutate(constcol=1) %>%
          mutate(rank=cumsum(constcol)) %>% select(-constcol)

# calculate
res <-
  iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% 
  rank_in_group

# display first few results
res %>% filter(rank<=2) %>% arrange(Species,rank)

 # Source: local data frame [6 x 6]
 # Groups: Species [3]
 # 
 #   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species  rank
 #                                      
 # 1          5.8         4.0          1.2         0.2     setosa     1
 # 2          5.7         4.4          1.5         0.4     setosa     2
 # 3          7.0         3.2          4.7         1.4 versicolor     1
 # 4          6.9         3.1          4.9         1.5 versicolor     2
 # 5          7.9         3.8          6.4         2.0  virginica     1
 # 6          7.7         3.8          6.7         2.2  virginica     2

Some possible confusion

The above works well, because all the operators we used were “grouping aware.” I think all dplyr operations are “grouping aware”, but some of the “in the street” tactics of “working with or around dplyr” may not be. For example slice is part of dplyr and grouping aware:

iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% slice(1)

 # Source: local data frame [3 x 5]
 # Groups: Species [3]
 # 
 #   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
 #                                     
 # 1          5.8         4.0          1.2         0.2     setosa
 # 2          7.0         3.2          4.7         1.4 versicolor
 # 3          7.9         3.8          6.4         2.0  virginica

But head is not part of dplyr and not grouping aware:

iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% head(n=1)

 # Source: local data frame [1 x 5]
 # Groups: Species [1]
 # 
 #   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
 #                                    
 # 1          7.9         3.8          6.4           2 virginica

Thus head “is not part of dplyr” even when it is likely a dplyr adapter supplying the actual implementation.

This can be very confusing to new analysts. We are seeing changes in semantics of downstream operators based on a data annotation (the “Groups:”). To the analyst grouping and ordering probably have equal stature. In dplyr grouping comes first, has a visible annotation, is durable, and changes the semantics of downstream operators. In dplyr ordering has no annotation, is not durable (it is quietly lost by many operators such as dplyr::compute and dplyr::collapse, though this is possibly changing), and can’t be stored (as it isn’t a concept in many back-ends such as relational databases).

It is hard for new analysts to trust dplyr the data iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) is both grouped and ordered. As we see below order is in the presentation (and not annotated) and grouping is annotated (but not in the presentation):

iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% print(n=50)

 # Source: local data frame [150 x 5]
 # Groups: Species [3]
 # 
 #    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
 #                                      
 # 1           7.9         3.8          6.4         2.0  virginica
 # ...
 # 12          7.1         3.0          5.9         2.1  virginica
 # 13          7.0         3.2          4.7         1.4 versicolor
 # 14          6.9         3.1          4.9         1.5 versicolor
 # 15          6.9         3.2          5.7         2.3  virginica
 # ...

Notice the apparent mixing of the groups in presentation. That is part of why there is a visible Groups: annotation, the grouping can not be inferred from the data presentation.

My opinion

Frankly it can be hard to document and verify which dplyr pipelines are maintaining the semantics you intended. We have every reason to believe the following is both grouped and ordered:

  iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) 

It is ordered as dplyr::arrange is the last step and we can verify the grouping is present with dplyr::groups().

We have less reason to trust the following is also grouped and ordered (especially in remote databases or Spark):

  iris %>% arrange(desc(Sepal.Length)) %>% group_by(Species)

The above may be simultaneously grouped and ordered (i.e. have not lost the order), but for reasons of “trust, but verify” it would be nice to have a user-visible annotation certifying that. Remember, explicitly verifying the order requires the use of a window-function (such as lag) so verifying order by hand isn’t always a convenient option.

We need to put some of the dplyr machinery in a housing to keep our fingers from getting into the gears.

Essentially this is saying wrap Hadley Wickham’s “The Split-Apply-Combine Strategy for Data Analysis” (link) concept into a single atomic operation with semantics:

    data %>% split(column1) %>% lapply(arrange(column2) %>% f()) %>% bind_rows()

which we call “Grouped-Ordered-Apply.”

By wrapping the pipeline into a single “Grouped-Ordered-Apply” operation we are deliberately making intermediate results not visible. This is exactly what is needed to get rid of depending on distinctions of how partitioning is enforced (be it by a grouping annotation, or with an actual split) and worrying about the order of the internal operations.

A Suggested Solution

Our new package replyr supplies the “Grouped-Ordered-Apply” operation as replyr::gapply (itself built on top of dplyr). It performs the above grouped/ordered calculation as follows:

library('dplyr')
# install.packages('replyr') # Run this if you don't already have replyr
library('replyr')

data(iris)

# define our operation, in this case ranking
rank_in_group <- . %>% mutate(constcol=1) %>%
          mutate(rank=cumsum(constcol)) %>% select(-constcol)

# apply our operation to data that is simultaneously grouped and ordered
res  <-
   iris %>% gapply(gcolumn='Species',
                   f=rank_in_group,
                   ocolumn='Sepal.Length',decreasing=TRUE)

# present results
res %>% filter(rank<=2) %>% arrange(Species,rank)

 #   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species rank
 # 1          5.8         4.0          1.2         0.2     setosa    1
 # 2          5.7         4.4          1.5         0.4     setosa    2
 # 3          7.0         3.2          4.7         1.4 versicolor    1
 # 4          6.9         3.1          4.9         1.5 versicolor    2
 # 5          7.9         3.8          6.4         2.0  virginica    1
 # 6          7.7         3.8          6.7         2.2  virginica    2

replyr::gapply can use either a split based strategy, or a dplyr::group_by_ based strategy for calculation. Notice replyr::gapply‘s preference for “parametric treatment of variables.” replyr::gapply anticipates that the analyst may not know the names of columns or variables when they are writing their code, but may in fact need to take names as values stored in other variables. Essentially we are making dplyr::*_ forms preferred. The rank_in_group is using dplyr preferred non-standard evaluation, which assumes the analyst knows the names of the columns they are manipulating; that is appropriate for transient user code.

Now that we have the rank annotations present we can try to confirm they are in fact correct (i.e. that the implementation maintained grouping and ranking throughout). The calculation is detailed (checking ranks are unique per-group, integers in the range 1 to group-size, and order compatible with the value column Sepal.Length), so we have wrapped the calculation in replyr:

replyr_check_ranks(res,
                   GroupColumnName='Species',
                   ValueColumnName='Sepal.Length',
                   RankColumnName='rank',
                   decreasing=TRUE)

 #   goodRankedGroup    groupID nRows nGroups nBadRanks nUniqueRanks nBadOrders
 # 1            TRUE     setosa    50       1         0           50          0
 # 2            TRUE versicolor    50       1         0           50          0
 # 3            TRUE  virginica    50       1         0           50          0

For simplicity we wrote the primary checking function in terms of operations that happen to be only correct when there is only one group present (i.e. the function needs formal splitting and isolation, not just dplyr::group_by). This isn’t a problem as we can then use replyr::gapply(partitionMetod='split') to correctly apply such code to all groups in turn.
Notice the Split-Apply-Combine steps are all wrapped together and supplied as part of the service; the user only supplies (column1,column2,f()). The transient lifetime and limited visibility of the sub-stages of the wrapped calculation are the appropriate abstractions given the fragility of row-order in modern data stores. The user doesn’t care if the data is actually split and ordered, as long as it is presented to their function as if it were so structured. We are using the Split-Apply-Combine pattern, but abstracting out if it is actually implemented by formal splitting (ameliorating the differences between base::split, tidyr::nest and SQL GROUP BY ... ORDER BY ...). There are benefits in isolating the user-visible semantics from the details of realization.

Much can be written in terms of this pattern including grouped ranking problems, dplyr::summarize, and more. And this is precisely the semantics of gapply (grouped ordered apply) found in replyr.

Additional examples

An advantage of using the general notation as above is that dplyr has implementations that work on large remote data services such as databases and Spark.

For example here is the “rank within group” calculation performed in PostgreSQL (assuming you have such a database up, and using your own user/password). For these additional examples we are going to continue to suppose our goal is to compute the rank of Sepal.Length for irises grouped by Species.

# install.packages('replyr') # Run this if you don't already have replyr
library('dplyr')
library('replyr')
data(iris)

# define our windowed operation, in this case ranking
rank_in_group <- . %>% mutate(constcol=1) %>%
          mutate(rank=cumsum(constcol)) %>% select(-constcol)

# get a databse handle and copy the data into the database
my_db <- dplyr::src_postgres(host = 'localhost', port = 5432,
                             user = 'postgres', password = 'pg')
irisD <- replyr_copy_to(my_db,iris)

# run the ranking in the database
res <-
   irisD %>% gapply(gcolumn='Species',
                    f=rank_in_group,
                    ocolumn='Sepal.Length',decreasing=TRUE)

# present results
res %>% filter(rank<=2) %>% arrange(Species,rank)

 # Source:   query [?? x 6]
 # Database: postgres 9.6.1 [postgres@localhost:5432/postgres]
 # 
 #   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species  rank
 #                                       
 # 1          5.8         4.0          1.2         0.2     setosa     1
 # 2          5.7         3.8          1.7         0.3     setosa     2
 # 3          7.0         3.2          4.7         1.4 versicolor     1
 # 4          6.9         3.1          4.9         1.5 versicolor     2
 # 5          7.9         3.8          6.4         2.0  virginica     1
 # 6          7.7         3.0          6.1         2.3  virginica     2

dplyr::group_by can also perform the grouped ordered calculation in PostgreSQL using the code below:

irisD %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% rank_in_group %>% 
  filter(rank<=2) %>% arrange(Species,rank)

We can even perform the same calculation using Spark and sparklyr.

# get a Spark handle and copy the data in
my_s <- sparklyr::spark_connect(master = "local")
irisS <- replyr_copy_to(my_s,iris)

# re-run the ranking in Spark
irisS %>% gapply(gcolumn='Species',
                 f=rank_in_group,
                 ocolumn='Sepal_Length',decreasing=TRUE) %>%
          filter(rank<=2) %>% arrange(Species,rank)

  # Source:   query [?? x 6]
 # Database: spark connection master=local[4] app=sparklyr local=TRUE
 # 
 #   Sepal_Length Sepal_Width Petal_Length Petal_Width    Species  rank
 #                                       
 # 1          5.8         4.0          1.2         0.2     setosa     1
 # 2          5.7         4.4          1.5         0.4     setosa     2
 # 3          7.0         3.2          4.7         1.4 versicolor     1
 # 4          6.9         3.1          4.9         1.5 versicolor     2
 # 5          7.9         3.8          6.4         2.0  virginica     1
 # 6          7.7         3.8          6.7         2.2  virginica     2

Notice the sparklyr adapter changed column names by replacing “.” with “_“, so we had to change our ordering column specification “ocolumn='Sepal_Length'” to match. This is the only accommodation we had to make to switch to a Spark service. Outside of R (and Lisp) dots in identifiers are considered a bad idea and should be avoided. For instances most SQL databases reserved dots to indicate relations between schemas, tables, and columns (so it is only through sophisticated quoting mechanisms that the PostgreSQL example was able to use dots in column names).

dplyr directly completes the same calculation with:

irisS %>% group_by(Species) %>% arrange(desc(Sepal_Length)) %>% rank_in_group %>% 
  filter(rank<=2) %>% arrange(Species,rank)

And that is the Grouped-Ordered-Apply pattern and our dplyr based reference implementation. (Exercise for the reader: implement SQL‘s useful “UPDATE WHERE” operation in terms of replyr::gapply.)

3 thoughts on “Organize your data manipulation in terms of “grouped ordered apply””

  1. To help explain the terms in the above article:

    A quick explanation of “<-“, “.”, and “%>%” for R users. These three symbols ( “<-” being basic R assignment, “.” being a valid R variable name that now has special uses, and “%>%” being the magrittr “pipe operator”, used extensively by dplyr package users) are the core of “pipe oriented R” which is a newish dialect of R programming where things are written as pipelines.

    The purpose of magrittr is to replace code that looks like this:

      x <- sin(cos(5))
    

    With code that looks like this:

      x <- 5 %>% cos() %>% sin()
    

    We can be a bit bold and call the first "vanilla R" and the second "pipey R" (deliberately being equally mean to both camps).

    Of course this means to read this sort of "pipey R" you have to be able to translate back. I.e. when you see "x <- 5 %>% cos() %>% sin()" you have to be able to understand "x <- sin(cos(5))" is intended.

    Below is our quick and dirty explantation. Our explanations are deliberately simple, so they are going to differ from how magrittr is actually implemented and not explain some corner-case interactions. But I feel they are loyal to the expressed intent.

    • "<-" or "assign"

      As we should know "<-" is assignment in R. "x <- 5" places the value "5" into a variable named "x" in the current execution environment. The thing to remember is "<-" has "low operator precedence" in that it pretty much always happens last. So "x <- 3*5+2" is really shorthand for "x <- ( (3 * 5) + 2 )". So when read, R always splits statements at the assignments ("<-", "=", and "->"). Understand they mean the rest of the expression gets assigned into the name on the assignment end (usually left, "->" being the odd one as it assigning to right as in "5 -> x").

    • "%>%" or "pipe":

      "a %>% b" is roughly equivalent to "b(a)". Good explanation of magrittr pipes here: https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html .

      The intent is: "7 %>% sin() %>% cos()" should work as "cos(sin(7))". The rule is we treat the right hand side ("b") as a function and delay its evaluation until we can paste-in the value of the lefthand side of the "%>%" or "pipe." The function on the right can be written as "sin", "sin()", or "sin(.)" (magrittr searches for the dot and replaces it with the incoming value, otherwise it tries for the first argument). A cool feature is other function arguments are "Curryed in" ( https://en.wikipedia.org/wiki/Currying ) automatically (actually, strictly they are "partially applied"). So we can write "3 %>% atan2(2,.)" or "2 %>% atan2(.,3): and they both mean "atan2(2,3)"

    • "." or "dot":

      "." is just a valid variable or function argument name in R such as "x" or "y" or "x.1". magrittr specializes "." two ways. First as we mentioned above an explicit "." in a function argument list is where magrittr decides to land values. The second is pipeline capture which we explain below.

    • ". %>%" or "pipeline capture"

      In magrittr the expression ". %>%" is a special form roughly equivalent to function defintion or "lambda abstraction." Because of it the expression ". %>% sin %>% cos" is similar to writing "function(.) { cos(sin(.)) }". You can try this "f1 <- . %>% sin %>% cos", "f2 <- function(.) { cos(sin(.)) }". Now try "f1(5)" and "f2(5)". So ". %>%" is a special case that "captures the pipeline for future use."

    The above is something we spend significant time on when training, here it is unfortunately a bit telegraphic.

    When writing about what I consider a good way to organize calculations I chose to use what I feel is the current dominant organizing notation in my examples (pipes). I didn't want to lose readers, but I also didn't want to be told that the Grouped-Ordered-Apply convention was only needed because I had failed to use this notation or that notation. The methods demonstrated (and replyr::gapply itself) do not need pipes, you can perform all of the above with standard R notation and replyr::gapply.

  2. Can you comment on the advantages/disadvantages of doing this in data.table?

    Here’s what I did to produce the same output you have in top example.

    library(data.table)
    iris.dt <- as.data.table(iris)
    ## generate rank (descending order) within each species
    ## we use ties.method = "first" to match example
    iris.dt[, rank := rank(-Sepal.Length, ties.method = c("first")),
    by = Species]
    ## order display by species and rank
    setkeyv(iris.dt, c("Species", "rank"))
    ## display top 2 ranks
    iris.dt[rank <= 2,]

    ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species rank
    ## 1: 5.8 4.0 1.2 0.2 setosa 1
    ## 2: 5.7 4.4 1.5 0.4 setosa 2
    ## 3: 7.0 3.2 4.7 1.4 versicolor 1
    ## 4: 6.9 3.1 4.9 1.5 versicolor 2
    ## 5: 7.9 3.8 6.4 2.0 virginica 1
    ## 6: 7.7 3.8 6.7 2.2 virginica 2

    1. Thanks for the conversation and example, here are some of my thoughts on the topic.

      First when working with people it is important (unless there are very good reasons) to use the tools and conventions they have chosen. So, I would never move a data.table based project to dplyr unless that was the help the group was asking for.

      Essentially the code you are demonstrating is performing the same calculation using a built in data.table “grouped apply” primitive (given by the “[,fn,by=]” notation). This is good enough as the rank function is powerful enough to work on un-ordered data (as would any function be if we asked the user to sort as their first step). The Grouped-Ordered-Apply primitive lets a weaker operator (in this case cumsum) complete the same calculation. Other applications I envisioned for Grouped-Orded-Apply include: pick best in each group, summarize each group (doesn’t use the ordered control, which is in fact optional), time-series manipulation (such as differencing, calculating returns, lagging, and so on), and general transforms (again, many of which do not use the ordered control).

      Any important example is going to be already easily achievable in R as if such were not achievable it would represent a major gap in R capabilities, and thus attract attention and a fix.

      That being said there are some great primitives to think about: Hadley Wickham’s “Split-Apply-Combine” (actually a convention as the user has to ask the data to be re-assembled), and data table’s “grouped apply”. I feel there is a gap in how ordering is treated (it isn’t a durable or annotated property in some dplyr service providers and it is a hard thing to teach students). So I feel making order explicit and wrapping the operations into one manager would be helpful. Some cases I think this handles really well are summarization or “best” each of which potentially return a different number of rows per group (i.e. one) than went into the calculation.

      Now the reason I am working with dplyr, is it provides an abstraction to remote data sources (such as PostgreSQL, and Spark). This means (with some care) one can learn a “back-end agnostic” dialect of dplyr (what I am trying to test and document in my own replyr package) that can be moved from platform to platform.

      For what I understand data.table is not an abstraction but a complete implementation or service. I have no problem with that, much of our valuable work fits in-memory (and I am aware of COST); however, some clients need to work directly in PostgreSQL and Spark. As you point out data.table already supplies a complete atomic (or indivisible) “grouped apply” abstraction (which in dplyr is several steps group_by %>% FUNCTION %>% ungroup, and FUNCTION has to be dplyr compatible as dplyr::slice is and utils::head is not). So perhaps the need for Grouped-Ordered-Apply is less with data.table, but I think it is worthwhile to explore which abstractions are teachable yet wrap up enough steps to be useful.

Comments are closed.