Posted on Categories data science, Statistics, TutorialsTags , , ,

Better Grouped Summaries in dplyr

For R dplyr users one of the promises of the new rlang/tidyeval system is an improved ability to program over dplyr itself. In particular to add new verbs that encapsulate previously compound steps into better self-documenting atomic steps.

Let’s take a look at this capability.

First let’s start dplyr.

suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
## [1] '0.7.1.9000'

A dplyr pattern that I have seen used often is the "group_by() %>% mutate()" pattern. This historically has been shorthand for a "group_by() %>% summarize()" followed by a join(). It is easiest to show by example.

The following code:

mtcars %>% 
  group_by(cyl, gear) %>%
  mutate(group_mean_mpg = mean(mpg), 
            group_mean_disp = mean(disp)) %>% 
  select(cyl, gear, mpg, disp, 
         group_mean_mpg, group_mean_disp) %>%
  head()
## # A tibble: 6 x 6
## # Groups:   cyl, gear [4]
##     cyl  gear   mpg  disp group_mean_mpg group_mean_disp
##   <dbl> <dbl> <dbl> <dbl>          <dbl>           <dbl>
## 1     6     4  21.0   160         19.750        163.8000
## 2     6     4  21.0   160         19.750        163.8000
## 3     4     4  22.8   108         26.925        102.6250
## 4     6     3  21.4   258         19.750        241.5000
## 5     8     3  18.7   360         15.050        357.6167
## 6     6     3  18.1   225         19.750        241.5000

is taken to be shorthand for:

mtcars %>% 
  group_by(cyl, gear) %>%
  summarize(group_mean_mpg = mean(mpg), 
            group_mean_disp = mean(disp)) %>% 
  left_join(mtcars, ., by = c('cyl', 'gear')) %>%
  select(cyl, gear, mpg, disp, 
         group_mean_mpg, group_mean_disp) %>%
  head()
##   cyl gear  mpg disp group_mean_mpg group_mean_disp
## 1   6    4 21.0  160         19.750        163.8000
## 2   6    4 21.0  160         19.750        163.8000
## 3   4    4 22.8  108         26.925        102.6250
## 4   6    3 21.4  258         19.750        241.5000
## 5   8    3 18.7  360         15.050        357.6167
## 6   6    3 18.1  225         19.750        241.5000

The advantages of the shorthand are:

  • The analyst only has to specify the grouping column once.
  • The data (mtcars) enters the pipeline only once.
  • The analyst doesn’t have to start thinking about joins immediately.

Frankly I’ve never liked the shorthand. I feel it is a "magic extra" that a new user would have no way of anticipating from common use of group_by() and summarize(). I very much like the idea of wrapping this important common use case into a single verb. Adjoining "windowed" or group-calculated columns is a common and important step in analysis, and well worth having its own verb.

Below is our attempt at elevating this pattern into a packaged verb.

#' Simulate the group_by/mutate pattern 
#' with an explicit summarize and join.
#' 
#' Group a data frame by the groupingVars
#' and compute user summaries on this data
#' frame (user summaries specified in ...),
#' then join these new columns back into
#' the original data and return to the
#' user.  It is a demonstration of a
#' higher-order dplyr verb.
#' 
#' Author: John Mount, Win-Vector LLC.
#' 
#' @param d data.frame
#' @param groupingVars character vector of column names to group by.
#' @param ... dplyr::summarize commands.
#' @return d with grouped summaries added as extra columns
#' 
#' @examples
#' 
#' add_group_summaries(mtcars, 
#'                     c("cyl", "gear"), 
#'                     group_mean_mpg = mean(mpg), 
#'                     group_mean_disp = mean(disp)) %>%
#'   head()
#' 
#' @export
#' 
add_group_summaries <- function(d, 
                                groupingVars, 
                                ...) {
  # convert char vector into spliceable vector
  groupingSyms <- rlang::syms(groupingVars)
  d <- ungroup(d) # just in case
  dg <- group_by(d, !!!groupingSyms)
  ds <- summarize(dg, ...)
  # work around https://github.com/tidyverse/dplyr/issues/2963
  ds <- ungroup(ds)
  left_join(d, ds, by= groupingVars)
}

This works as follows:

mtcars %>% 
  add_group_summaries(c("cyl", "gear"), 
                      group_mean_mpg = mean(mpg), 
                      group_mean_disp = mean(disp)) %>%
  select(cyl, gear, mpg, disp, 
         group_mean_mpg, group_mean_disp) %>%
  head()
##   cyl gear  mpg disp group_mean_mpg group_mean_disp
## 1   6    4 21.0  160         19.750        163.8000
## 2   6    4 21.0  160         19.750        163.8000
## 3   4    4 22.8  108         26.925        102.6250
## 4   6    3 21.4  258         19.750        241.5000
## 5   8    3 18.7  360         15.050        357.6167
## 6   6    3 18.1  225         19.750        241.5000

And this also works on SQLite-backed dplyr data (which the shorthand currently does not, please see dplyr 2887 issue and dplyr issue 2960).

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")

mtcars2 %>% 
  group_by(cyl, gear) %>%
  mutate(group_mean_mpg = mean(mpg), 
         group_mean_disp = mean(disp))
## Error: Window function `avg()` is not supported by this database
mtcars2 %>% 
  add_group_summaries(c("cyl", "gear"), 
                      group_mean_mpg = mean(mpg), 
                      group_mean_disp = mean(disp)) %>%
  select(cyl, gear, mpg, disp, 
         group_mean_mpg, group_mean_disp) %>%
  head()
## # Source:   lazy query [?? x 6]
## # Database: sqlite 3.11.1 [:memory:]
##     cyl  gear   mpg  disp group_mean_mpg group_mean_disp
##   <dbl> <dbl> <dbl> <dbl>          <dbl>           <dbl>
## 1     6     4  21.0   160         19.750        163.8000
## 2     6     4  21.0   160         19.750        163.8000
## 3     4     4  22.8   108         26.925        102.6250
## 4     6     3  21.4   258         19.750        241.5000
## 5     8     3  18.7   360         15.050        357.6167
## 6     6     3  18.1   225         19.750        241.5000

The above, and many more useful dplyr standard evaluation adapters are now all part of the new package seplyr (on CRAN and GitHub).

10 thoughts on “Better Grouped Summaries in dplyr”

  1. It took me a while to discover what group_by with mutate would do (no collapsing) and somehow found it unintuitive in an otherwise very intuitive package. By comparison, in Stata the equivalent:
    by cyl gear: egen group_mean_mpg = mean(mpg)
    does what you’d expect without collapsing as summarize in dplyr does.

    1. I come from a SQL background, so I find the non-collapsing case to be the odd one. My gut always says: build a table with the right keying and join it back in.

      However, running dplyr::show_query() on the examples (this time using a PostgreSQL back end) is very illuminating. We see the summarize collapse triggered by the SQL group command, and the windowed calculation co-mingling with other results in the mutate example.

      suppressPackageStartupMessages(library("dplyr"))
      library('RPostgreSQL')
      #> Loading required package: DBI
      
      # Local Postgres.app database; no password by default
      # Of course, you fill in your own database information here.
      con = DBI::dbConnect(DBI::dbDriver("PostgreSQL"),
                           host = 'localhost',
                           port = 5432,
                           user = 'postgres',
                           password = 'pg')
      
      copy_to(con, 
              mtcars %>% mutate(id = row_number()), 
              'mtcars')
      mtcars2 <- tbl(con, "mtcars")
      
      mtcars2 %>%
        select(id, mpg, cyl) %>%
        group_by(cyl) %>%
        mutate(mpg_min = min(mpg)) %>% 
        show_query()
      #> <SQL>
      #> SELECT "id", "mpg", "cyl", min("mpg") OVER (PARTITION BY "cyl") AS "mpg_min"
      #> FROM (SELECT "id" AS "id", "mpg" AS "mpg", "cyl" AS "cyl"
      #> FROM "mtcars") "kkztfjtpuw"
      
      mtcars2 %>%
        select(id, mpg, cyl) %>%
        group_by(cyl) %>%
        summarize(mpg_min = min(mpg)) %>% 
        show_query()
      #> <SQL>
      #> SELECT "cyl", MIN("mpg") AS "mpg_min"
      #> FROM (SELECT "id" AS "id", "mpg" AS "mpg", "cyl" AS "cyl"
      #> FROM "mtcars") "pdtcvcoqju"
      #> GROUP BY "cyl"
      
  2. # group_by() also works with mutate() eliminating the need for summarise() and join()

    mtcars %>%
    group_by(cyl, gear) %>%
    mutate(group_mean_mpg = mean(mpg),
    group_mean_disp = mean(disp)) %>%
    head()

    1. Yes, that is what I hoped to indicate with “The analyst doesn’t have to start thinking about joins immediately”.

      And the shorthand works, until it does not.

      suppressPackageStartupMessages(library("dplyr"))
      con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
      copy_to(con, mtcars)
      mtcars2 <- tbl(con, "mtcars")
      
      mtcars2 %>%
        group_by(cyl, gear) %>%
        mutate(group_mean_mpg = mean(mpg),
               group_mean_disp = mean(disp)) %>%
        head()
      #> Error: Window function `avg()` is not supported by this database
      
  3. A similar pattern that I find myself often using to avoid a join is the group_by() %>% summarize(), where the grouping is over-specified. An example of this is a time series with columns year, month, date, value, (multiple values per day) and the analyst wants to summarize by date.

    A simple group_by(date) %>% summarize(day_value = sum(value)) will drop columns that we want to retain (year, month), but can re-derive (or join). To avoid the hassle, there are multiple options, all sub-optimal:

    “`
    #Over-specify grouping
    #Over-specify grouping
    dta %>%
    group_by(year, month, date) %>%
    summarize(value = sum(value))

    #Summarize extra columns
    dta %>%
    group_by(date) %>%
    summarize(value = sum(value),
    month = first(month),
    year = first(year))

    #Explicit join
    dta %>%
    group_by(date) %>%
    summarize(value = sum(value)) %>%
    left_join(dta[c(“year”, “month”, “date”)], by = “date”)
    “`

    1. Olivia B-R,

      That is a great issue and a great example. Thanks for bringing it up (and your solutions).

      To state it back: suppose are grouping by US zipcode and you have also a US State column. Under typical `group_by(zip_code) %>% summarize(age = mean(age))` type work you lose the (potentially useful) state column.

      In `SQL` if you try to include such columns you usually get an error message of the nature “ERROR: column ‘d.state’ must appear in the GROUP BY clause or be used in an aggregate function.”

      The solutions, as you state are usually one of:

      • Add the column to the “group by” to “overkey.”
      • Add the column to the query with a useless aggregator (like ‘min()’). We teach this one and suggest adding the comment “pseudo aggregator” to the code for readability.
      • Joining the columns back in from a fact table later.

      As I said our favorite solution is the pseudo-aggegator.

      library("DBI")
      library("RPostgreSQL")
      suppressPackageStartupMessages(library("dplyr"))
      
      conn <- dbConnect(dbDriver("PostgreSQL"),
                        host = 'localhost',
                        port = 5432,
                        user = 'postgres',
                        password = 'pg')
      
      dbWriteTable(conn, 'd', 
                   data.frame(state='CA', zipcode=94207, age=40,
                              stringsAsFactors = FALSE),
                   temporary=TRUE, overwite=TRUE)
      #> [1] TRUE
      DBI::dbGetQuery(conn, "select 
                                zipcode, 
                                min(state) state, -- pseudo aggregator, column is a function of keys
                                avg(age) 
                             from 
                                d 
                             group by
                                zipcode")
      #>   zipcode state avg
      #> 1   94207    CA  40
      d <- dplyr::tbl(conn, 'd')
      
      # loses state
      d %>% 
        group_by(zipcode) %>% 
        summarize(meanage = mean(age))
      #> # Source:   lazy query [?? x 2]
      #> # Database: postgres 9.6.1 [postgres@localhost:5432/postgres]
      #>   zipcode meanage
      #>     <dbl>   <dbl>
      #> 1   94207      40
      
      # over keying
      d %>% 
        group_by(zipcode, state) %>% 
        summarize(meanage = mean(age))
      #> # Source:   lazy query [?? x 3]
      #> # Database: postgres 9.6.1 [postgres@localhost:5432/postgres]
      #> # Groups:   zipcode
      #>   zipcode state meanage
      #>     <dbl> <chr>   <dbl>
      #> 1   94207    CA      40
      
      # pseudo-aggregator
      d %>% 
        group_by(zipcode) %>% 
        summarize(meanage = mean(age), 
                  state= min(state) #  pseudo aggregator, column is a function of keys
        )
      #> # Source:   lazy query [?? x 3]
      #> # Database: postgres 9.6.1 [postgres@localhost:5432/postgres]
      #>   zipcode meanage state
      #>     <dbl>   <dbl> <chr>
      #> 1   94207      40    CA
      
      dbExecute(conn, 'drop table d')
      #> [1] 0
      
      1. Hi John,

        Thank you for your response. The zip code example is a more natural one (well, other than for a handful of zips, see https://gis.stackexchange.com/a/167333).

        My prefered method is the over-keying one. Could you explain why you prefer, and teach, the pseudo-aggregator instead?

        I generally think of the group_by dimensions as the keys/coordinates of the data, and the summarize ones as my measurements. In that framework, I think it makes more sense to over-key, although I guess there is an argument that the first key is sufficient and that the state is a “measurement” done at the zipcode level.

        My ideal solution would be to encode that key hierarchy in the data itself. I however cannot think of a nice design to allow the programmer to easily express this notion. I have been playing with the idea of using information in the join plan, we’ll see where that goes. I am very interested in other people’s thoughts on this.

        Best,
        Olivia

        1. I should have figured there would be mixed-state zip codes (argh). It is just non-time series examples are simpler for those who have not worked with time series.

          Over-keying is a great solution. I tend to use pseudo-aggergation because I come from a SQL world, so I am often in the middle of the join and the aggregation notation is often a bit more convenient than naming table-qualified columns (or even worse grouping by derived expressions).

          I agree- more specification and having the code deal with it would be better. In Practical Data Science with R we didn’t talk about pseudo-aggregators, but did talk about column roles (keys, values) and had a concept of a “derived column.” A system like you are talking about could deal with work automatically if we had such annotations.

Comments are closed.