Posted on Categories Coding, data science, Opinion, Programming, Statistics, TutorialsTags , , , ,

Tutorial: Using seplyr to Program Over dplyr

seplyr is an R package that makes it easy to program over dplyr 0.7.*.

To illustrate this we will work an example.

Suppose you had worked out a dplyr pipeline that performed an analysis you were interested in. For an example we could take something similar to one of the examples from the dplyr 0.7.0 announcement.

suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
## [1] '0.7.2'
cat(colnames(starwars), sep='\n')
## name
## height
## mass
## hair_color
## skin_color
## eye_color
## birth_year
## gender
## homeworld
## species
## films
## vehicles
## starships
starwars %>%
  group_by(homeworld) %>%
  summarise(mean_height = 
              mean(height, na.rm = TRUE),
            mean_mass = 
              mean(mass, na.rm = TRUE),
            count = n())
## # A tibble: 49 x 4
##         homeworld mean_height mean_mass count
##             <chr>       <dbl>     <dbl> <int>
##  1       Alderaan    176.3333      64.0     3
##  2    Aleen Minor     79.0000      15.0     1
##  3         Bespin    175.0000      79.0     1
##  4     Bestine IV    180.0000     110.0     1
##  5 Cato Neimoidia    191.0000      90.0     1
##  6          Cerea    198.0000      82.0     1
##  7       Champala    196.0000       NaN     1
##  8      Chandrila    150.0000       NaN     1
##  9   Concord Dawn    183.0000      79.0     1
## 10       Corellia    175.0000      78.5     2
## # ... with 39 more rows

The above is colloquially called "an interactive script." The name comes from the fact that we use names of variables (such as "homeworld") that would only be known from looking at the data directly in the analysis code. Only somebody interacting with the data could write such a script (hence the name).

It has long been considered a point of discomfort to convert such an interactive dplyr pipeline into a re-usable script or function. That is a script or function that specifies column names in some parametric or re-usable fashion. Roughly it means the names of the data columns are not yet known when we are writing the code (and this is what makes the code re-usable).

This inessential (or conquerable) difficulty is largely a due to the preference for non-standard evaluation interfaces (that is interfaces that capture and inspect un-evaluated expressions from their calling interface) in the design dplyr.

seplyr is a dplyr adapter layer that prefers "slightly clunkier" standard interfaces (or referentially transparent interfaces), which are actually very powerful and can be used to some advantage.

The above description and comparisons can come off as needlessly broad and painfully abstract. Things are much clearer if we move away from theory and return to our practical example.

Let’s translate the above example into a re-usable function in small (easy) stages. First translate the interactive script from dplyr notation into seplyr notation. This step is a pure re-factoring, we are changing the code without changing its observable external behavior.

The translation is mechanical in that it is mostly using seplyr documentation as a lookup table. What you have to do is:

  • Change dplyr verbs to their matching seplyr "*_se()" adapters.
  • Add quote marks around names and expressions.
  • Convert sequences of expressions (such as in the summarize()) to explicit vectors by adding the "c()" notation.
  • Replace "=" in expressions with ":=".

Our converted code looks like the following.

library("seplyr")

starwars %>%
  group_by_se("homeworld") %>%
  summarize_se(c("mean_height" := 
                   "mean(height, na.rm = TRUE)",
                 "mean_mass" := 
                   "mean(mass, na.rm = TRUE)",
                 "count" := "n()"))
## # A tibble: 49 x 4
##         homeworld mean_height mean_mass count
##             <chr>       <dbl>     <dbl> <int>
##  1       Alderaan    176.3333      64.0     3
##  2    Aleen Minor     79.0000      15.0     1
##  3         Bespin    175.0000      79.0     1
##  4     Bestine IV    180.0000     110.0     1
##  5 Cato Neimoidia    191.0000      90.0     1
##  6          Cerea    198.0000      82.0     1
##  7       Champala    196.0000       NaN     1
##  8      Chandrila    150.0000       NaN     1
##  9   Concord Dawn    183.0000      79.0     1
## 10       Corellia    175.0000      78.5     2
## # ... with 39 more rows

This code works the same as the original dplyr code. Obviously at this point all we have done is: worked to make the code a bit less pleasant looking. We have yet to see any benefit from this conversion (though we can turn this on its head and say all the original dplyr notation is saving us is from having to write a few quote marks).

The benefit is: this new code can very easily be parameterized and wrapped in a re-usable function. In fact it is now simpler to do than to describe.

For example: suppose (as in the original example) we want to create a function that lets us choose the grouping variable? This is now easy, we copy the code into a function and replace the explicit value "homeworld" with a variable:

starwars_mean <- function(my_var) {
  starwars %>%
    group_by_se(my_var) %>%
    summarize_se(c("mean_height" := 
                     "mean(height, na.rm = TRUE)",
                   "mean_mass" := 
                     "mean(mass, na.rm = TRUE)",
                   "count" := "n()"))
}

starwars_mean("hair_color")
## # A tibble: 13 x 4
##       hair_color mean_height mean_mass count
##            <chr>       <dbl>     <dbl> <int>
##  1        auburn    150.0000       NaN     1
##  2  auburn, grey    180.0000       NaN     1
##  3 auburn, white    182.0000  77.00000     1
##  4         black    174.3333  73.05714    13
##  5         blond    176.6667  80.50000     3
##  6        blonde    168.0000  55.00000     1
##  7         brown    175.2667  79.27273    18
##  8   brown, grey    178.0000 120.00000     1
##  9          grey    170.0000  75.00000     1
## 10          none    180.8889  78.51852    37
## 11       unknown         NaN       NaN     1
## 12         white    156.0000  59.66667     4
## 13          <NA>    141.6000 314.20000     5

In seplyr programming is easy (just replace values with variables). For example we can make a completely generic re-usable "grouped mean" function using R‘s paste() function to build up expressions.

grouped_mean <- function(data, 
                         grouping_variables, 
                         value_variables) {
  result_names <- paste0("mean_", 
                         value_variables)
  expressions <- paste0("mean(", 
                        value_variables, 
                        ", na.rm = TRUE)")
  calculation <- result_names := expressions
  print(as.list(calculation)) # print for demo
  data %>%
    group_by_se(grouping_variables) %>%
    summarize_se(c(calculation,
                   "count" := "n()"))
}

starwars %>% 
  grouped_mean(grouping_variables = "eye_color",
               value_variables = c("mass", "birth_year"))
## $mean_mass
## [1] "mean(mass, na.rm = TRUE)"
## 
## $mean_birth_year
## [1] "mean(birth_year, na.rm = TRUE)"

## # A tibble: 15 x 4
##        eye_color mean_mass mean_birth_year count
##            <chr>     <dbl>           <dbl> <int>
##  1         black  76.28571        33.00000    10
##  2          blue  86.51667        67.06923    19
##  3     blue-gray  77.00000        57.00000     1
##  4         brown  66.09231       108.96429    21
##  5          dark       NaN             NaN     1
##  6          gold       NaN             NaN     1
##  7 green, yellow 159.00000             NaN     1
##  8         hazel  66.00000        34.50000     3
##  9        orange 282.33333       231.00000     8
## 10          pink       NaN             NaN     1
## 11           red  81.40000        33.66667     5
## 12     red, blue       NaN             NaN     1
## 13       unknown  31.50000             NaN     3
## 14         white  48.00000             NaN     1
## 15        yellow  81.11111        76.38000    11

The only part that requires more study and practice was messing around with the expressions using paste() (for more details on the string manipulation please try "help(paste)"). Notice also we used the ":=" operator to bind the list of desired result names to the matching calculations (please see "help(named_map_builder)" for more details).

The point is: we did not have to bring in (or study) any deep-theory or heavy-weight tools such as rlang/tidyeval or lazyeval to complete our programming task. Once you are in seplyr notation, changes are very easy. You can separate translating into seplyr notation from the work of designing your wrapper function (breaking your programming work into smaller easier to understand steps).

The seplyr method is simple, easy to teach, and powerful. The package contains a number of worked examples both in help() and vignette(package='seplyr') documentation.

13 thoughts on “Tutorial: Using seplyr to Program Over dplyr”

  1. So what is “:=“? It turns out it is a reserved symbol for an old Pascal-style assignment operator that is no longer bound to an implementation in current R.

    This means it is unique among R operators in that:

    • It doesn’t have a base-R implementation (meaning it is somewhat up for grabs).
    • It doesn’t need %% notation like most user defined operators.
    • It binds late as you would expect an assignment operator to (meaning we don’t need parenthesis in many situations).

    Below is some example code playing with the := symbol.

    print(`*`)
    #> function (e1, e2)  .Primitive("*")
    
    tryCatch(
      print(`:=`),
      error = function(e) { e })
    #> <simpleError in print(`:=`): object ':=' not found>
    
    `:=` <- function(a,b) { print(match.call()) }
    
    x := 3 + 7
    #> `:=`(a = x, b = 3 + 7)
    
    `%:=%` <- function(a,b) { print(match.call()) }
    
    x %:=% (3 + 7)
    #> `%:=%`(a = x, b = (3 + 7))
    
    x %:=% 3 + 7
    #> `%:=%`(a = x, b = 3)
    #> Error in x %:=% 3 + 7: non-numeric argument to binary operator
    

    Now some packages do already use the := operator, but it isn’t something anyone can really claim to own or reserve. Here are the packages I am aware of using :=:

    • data.table. data.table does export an implementation of :=. However, I believe it is largely used in a data.table controlled context, so R package semantics should ensure data.table does not get clobbered by seplyr.
    • rlang/tidyeval. rlang/tidyeval assigns := to be ~, which seems like a waste as ~ does in fact exist and few users should be importing rlang/tidyeval directly as it is a toolbox largely used to build other packages. rlang/tidyeval also uses the fact that expressions that treat := as an assignment are not parse-errors, allowing it to capture unevaluated user expressions with the := symbol in an assignment position. Since rlang/tidyeval is essentially running its own interpreter (as eval and apply have been called the fundamental equations of functional languages, so when you override them you essentially have a new interpreter or language) we expect our mere binding of := can’t interfere with rlang/tidyeval operations.
    • dplyr. dplyr does not export a implementation for := but uses it to allow specification of left-hand-sides of assignment in unparsed user expressions (especially in dplyr::mutate and dplyr::summarize).

    Example of data.table correct use of := even after loading seplyr:

    library("data.table")
    DT = data.table(a = LETTERS[c(3L,1:3)], b = 4:7)
    suppressPackageStartupMessages(library("seplyr"))
    DT[, c := 8]
    DT
    #>    a b c
    #> 1: C 4 8
    #> 2: A 5 8
    #> 3: B 6 8
    #> 4: C 7 8
    

    Example of dplyr use of :=, notice how to the user the effect seems very similar to the seplyr use.

    suppressPackageStartupMessages(library(dplyr))
    data.frame(a = 1:3) %>%
      mutate(b := a + 1)
    #>   a b
    #> 1 1 2
    #> 2 2 3
    #> 3 3 4
    

    Very similar seplyr code:

    suppressPackageStartupMessages(library("seplyr"))
    data.frame(a = 1:3) %>%
      mutate_se("b" := "a + 1")
    #>   a b
    #> 1 1 2
    #> 2 2 3
    #> 3 3 4
    

    Now, finally what is seplyr::`:=`? It is so short I think it can be considered elegant:

    suppressPackageStartupMessages(library("seplyr"))
    print(`:=`)
    #> function(names, values) {
    #>   names(values) <- names
    #>   values
    #> }
    #> <bytecode: 0x7f9aedced8d0>
    #> <environment: namespace:seplyr>
    

    (More documentation in help(":=", package="seplyr").)

    The above is something I am really trying to strive for in my packages: working with R (instead of overriding or replacing large swaths of it). The wrapr dot-pipe operator is similarly concise (please see help("%.>%", package="wrapr") for details).

  2. Developers were very happy with base R and plyr. No one wanted the Jekyll/Hide NSE/SE complication. Now with dplyr you have a choice of not using it, but the NSE/SE confusion is going to be added to ggplot and others. Are you going to make seplyr like variations for those packages as well? I doubt that.

    NSE/SE is going to make ggplot lot more difficult to use and understand.

    1. Obviously I am not part of the tidyverse team and am not privy to their plans.

      But that being said.

      I think adapting ggplot2 to rlang/tidyeval is not going to look as good as the dplyr adaption from the user point of view. There are often a lot more moving parts in a ggplot2 plot (multiple data.frames, aesthetics, statistics, facets, colors, groups, legends, keys, and compositing by name) than in a typical dplyr pipeline (which tends to be more a linear structure with data entering at one end). I am in fact worried it will break the user facing plotting interface and further worried that the useful aes_string() method will be deprecated to force users to try the new interface (remember, the dplyr::"underscore" methods ended up deprecated).

      I don’t anticipate a very graceful way re-adapt what I assume is coming back to standard interfaces. I expect in production code I will pretend I have known column names either by wrapping everything in one giant wrapr::let() block or by using transient column re-namings via replyr::replyr_apply_f_mapped() (though this requires separate control of the axes names).

      It is kind of too bad this is the next announced ggplot2 update. ggplot2 is ahead on notation, but bokeh, plotly, and others are getting ahead on a lot of important rendering and interaction features (i.e. plots are not all for the programmer).

  3. I am trying to shrink the visible footprint of seplyr. To that end the development version:

    • No longer loads dplyr into the user namespace (so if you want dplyr names available you must call library("dplyr") directly even if you have already called library("seplyr")).
    • Declares the operator “:=” in the usual S3 manner and now only supplies implementations for a few appropriate types (string vectors, name vectors, and lists). The idea is: if all packages wanting to use := did this then there would have even less of a chance of packages interfering with each other. To see the implementation you now type print(named_map_builder) instead of print(`:=`).
  4. Hi, isn’t get() function combined with dplyr sufficient enough to achieve the desired re-usability? Is there other reason why seplyr is necessary?

    1. Thanks for the comment, I appreciate people sharing their views and experience.

      I don’t consider seplyr necessary. What I consider it is: a proof by example that standard (value oriented) interfaces are in fact sufficient for graceful data manipulation. The unstated idea is: it would have been possible to have something like dplyr without a lot of the other issues. I know we don’t, but I wanted to remind people it was possible.

      Or (with a sympathetic eye): suppose instead of translating the pipeline from dplyr notation to seplyr notation (which puts one in the silly “don’t like the metric system because I have to use math to convert that the US is stuck in”) one had just written the seplyr pipeline in the first place. My point is: it isn’t too much worse (and dplyr itself needs the := notation if one want to substitute on left hand sides of expressions).

      1. My blogging comment system isn’t the best- so it may not be clear of Joe is asking a question of me or a question of Sungmin (I think the admin panel says it is a question to Sungmin). That being said I think we are not all of one opinion on this, which I want to respect.

        1. Hi John,

          Yes, it is a question for Sungmin’s observation but you can also expand on it. I do use `wrapr::let()` quite often but it’d be nice to see an application with `get()` for comparison purposes.

          1. My group’s preference remains dplyr with wrapr::let() doing all of the re-mapping (as shown here).

            I don’t quite see how base::get() works in this context and am beginning to wonder if it might be a typo (for wrapr::let()). If not I’d love to see an example, as it would be a neat new idea. If so then it sounds like all three of use may be closer in opinion than any of us anticipated.

  5. It isn’t so much that dplyr can’t perform all of these tasks with rlang/tidyeval and the dplyr::*_at() methods, it is just each one works slightly differently (for instance dplyr::rename_at()) and they are not strict (they look for values more places than I would like- hiding potential errors, which I will indicate below). I feel part of the issue is rlang/tidyeval as described is a different creature than rlang/tidyeval as implemented (so it is easy for the described version to appear better).

    Here is a example closer to my point where it isn’t clear that the seplyr is really worse than the tidyverse version.

    suppressPackageStartupMessages(library("dplyr"))
    
    const_col_name = "const_column"
    
    # possibly preferred dplyr/rlang/tidyeval notation
    data.frame(x = 1:3) %>%
      mutate(!!rlang::sym(const_col_name) := 1)
    #>   x const_column
    #> 1 1            1
    #> 2 2            1
    #> 3 3            1
    
    # also works
    data.frame(x = 1:3) %>%
      mutate(const_col_name := 1)
    #>   x const_col_name
    #> 1 1              1
    #> 2 2              1
    #> 3 3              1
    
    # but risky, uses variable name instead of erroring-out
    data.frame(x = 1:3) %>%
      mutate(const_col_name_misspelled := 1)
    #>   x const_col_name_misspelled
    #> 1 1                         1
    #> 2 2                         1
    #> 3 3                         1
    
    # help(mutate_at) says:
    #  mutate_at(.tbl, .vars, .funs, ..., .cols = NULL))
    #  which to my mind give no clear advice on 
    #  how to perform the above
    
    library("seplyr")
    
    # preferred seplyr notation
    # use '"1"' to write a string
    data.frame(x = 1:3) %>%
      mutate_se(const_col_name := '1') %>% str()
    #> 'data.frame':    3 obs. of  2 variables:
    #>  $ x           : int  1 2 3
    #>  $ const_column: num  1 1 1
    
    # safe catch of error
    data.frame(x = 1:3) %>%
      mutate_se(const_col_name_misspelled := '1')
    #> Error in `:=`(const_col_name_misspelled, "1"): object 'const_col_name_misspelled' not found
    

Comments are closed.