Posted on Categories Coding, Opinion, Statistics, TutorialsTags , , , , , , ,

R Tip: Break up Function Nesting for Legibility

There are a number of easy ways to avoid illegible code nesting problems in R.

In this R tip we will expand upon the above statement with a simple example.

At some point it becomes illegible and undesirable to compose operations by nesting them, such as in the following code.

   head(mtcars[with(mtcars, cyl == 8), c("mpg", "cyl", "wt")])

#                     mpg cyl   wt
# Hornet Sportabout  18.7   8 3.44
# Duster 360         14.3   8 3.57
# Merc 450SE         16.4   8 4.07
# Merc 450SL         17.3   8 3.73
# Merc 450SLC        15.2   8 3.78
# Cadillac Fleetwood 10.4   8 5.25

One popular way to break up nesting is to use magrittr‘s “%>%” in combination with dplyr transform verbs as we show below.

library("dplyr")

mtcars                 %>%
  filter(cyl == 8)     %>%
  select(mpg, cyl, wt) %>%
  head

#    mpg cyl   wt
# 1 18.7   8 3.44
# 2 14.3   8 3.57
# 3 16.4   8 4.07
# 4 17.3   8 3.73
# 5 15.2   8 3.78
# 6 10.4   8 5.25

Note: the above code lost (without warning) the row names that are part of mtcars. We also pass over the details of how pipe notation works. It is sufficient to say the notational convention is: each stage is approximately treated as an altered function call with a new inserted first argument set to the value of the pipeline up to the current point.

Many R users already routinely avoid nested notation problems through a convention I call “name re-use.” Such code looks like the following.

result <- mtcars
result <- filter(result, cyl == 8)
result <- select(result, mpg, cyl, wt)
head(result)

The above convention is enough to get around all problems of nesting. It also has the great advantage that it is step-debuggable. I recommend introducing and re-using a result name (in this case “result“), and not re-using the starting data name (in this case “mtcars“). This extra care makes the entire block restartable which is another benefit when developing and debugging.

I like a variation I call “dot intermediates”, which looks like the code below (notice we are switching back from dplyr verbs, to base R operators).

. <- mtcars
. <- subset(., cyl == 8)
. <- .[, c("mpg", "cyl", "wt")]
result <- .
head(result)

#                     mpg cyl   wt
# Hornet Sportabout  18.7   8 3.44
# Duster 360         14.3   8 3.57
# Merc 450SE         16.4   8 4.07
# Merc 450SL         17.3   8 3.73
# Merc 450SLC        15.2   8 3.78
# Cadillac Fleetwood 10.4   8 5.25

The dot intermediate convention is very succinct, and we can use it with base R transforms to get a correct (and performant) result. The dot intermediates convention is particularly neat when you don’t intend to take the result further into your calculation (such as when you only want to print it) as it does not require you to think up an evocative result name. Like all conventions: it is just a matter of teaching, learning, and repetition to make this seem natural, familiar and legible.

Also, contrary to what many repeat, base R is often faster than the dplyr alternative.

library("dplyr")
library("microbenchmark")
library("ggplot2")

timings <- microbenchmark(
  base = {
    . <- mtcars
    . <- subset(., cyl == 8)
    . <- .[, c("mpg", "cyl", "wt")]
    nrow(.)
  },
  dplyr = {
    mtcars                 %>%
      filter(cyl == 8)     %>%
      select(mpg, cyl, wt) %>%
      nrow
  })

print(timings)

## Unit: microseconds
##   expr      min       lq      mean   median       uq       max neval
##   base  122.948  136.948  167.2253  159.688  179.924   349.328   100
##  dplyr 1570.188 1654.700 2537.2912 1699.744 1785.611 50759.770   100

autoplot(timings)


Unknown

Durations for related tasks, smaller is better.

In this case the base R is 15 times faster (possibly due to magrittr overhead and the small size of this example). We also see, with some care, base R can be quite legible. dplyr is a useful tool and convention, however it is not the only allowed tool or only allowed convention.

13 thoughts on “R Tip: Break up Function Nesting for Legibility”

  1. Thank you for this series, it is highly educational. Proceeding with the trend of base R alternatives to dplyr, could you do joins via merge next? There are several resources online that discuss merge but I couldn’t find any good comparisons to dplyr’s joins. I’m curious to see your take on it.

    1. Thanks!

      I think two places where dplyr shines are the joins and the grouped summaries. We may touch on both in the future from a SQL point of view, but nothing other than data.table is as convenient and respectful of R types/classes.

  2. At 320000 rows dplyr seems to be at least 2 times faster. So if the size of your data is bigger then would it still make sense to use base R functions that might not scale well?

    do.call("rbind", replicate(10000, mtcars, simplify = FALSE)) -> mtcars
    
    library("dplyr")
    library("microbenchmark")
    library("ggplot2")
    
    timings <- microbenchmark(
      base = {
        . <- mtcars
        . <- subset(., cyl == 8)
        . <- .[, c("mpg", "cyl", "wt")]
        nrow(.)
      },
      dplyr = {
        mtcars                 %>%
          filter(cyl == 8)     %>%
          select(mpg, cyl, wt) %>%
          nrow
      }
    )
    
    
    print(timings)
    
    Unit: milliseconds
      expr       min       lq     mean    median       uq       max neval
      base 19.822568 31.43256 40.37011 33.433080 43.39094 107.66600   100
     dplyr  7.613895  7.96304 18.08439  8.683566 21.03271  86.27648   100
    
    1. First thank you for your interesting comment and point. Also, I admit: I am the one who first brought up dplyr and timings. But trust me, if I don’t bring them up: they will be brought up for me.

      If speed of in-memory operations is the primary concern, I would suggest that is the data range where one might switch over to data.table.

      library("dplyr")
      library("microbenchmark")
      library("ggplot2")
      library("data.table")
      
      mtcarsb <- mtcars[rep(seq_len(nrow(mtcars)), 10000), ,]
      mtcarsd <- as.data.table(mtcarsb)
      
      timings <- microbenchmark(
        base = {
          . <- mtcarsb
          . <- subset(., cyl == 8)
          . <- .[, c("mpg", "cyl", "wt")]
          nrow(.)
        },
        dplyr = {
          mtcarsb                 %>%
            filter(cyl == 8)     %>%
            select(mpg, cyl, wt) %>%
            nrow
        },
        data.table = {
          nrow(mtcarsd[cyl==8, c("mpg", "cyl", "wt")])
        }
      )
      
      
      
      print(timings)
      ## Unit: milliseconds
      ##        expr       min       lq     mean    median        uq      max neval
      ##        base 36.613135 49.23939 57.30778 52.667116 56.821898 165.6374   100
      ##       dplyr  9.089105 13.33665 19.61850 14.351512 25.811134 105.6308   100
      ##  data.table  3.505766  4.90764 10.43046  5.971567  6.818303 115.2896   100
      
      autoplot(timings)
      

      I have no problem with people using dplyr. And, I have no problem with testing and discussing interesting alternatives (as you initiated here). What I do have problem with is the subset of dplyr aficionados who criticize any non-use of dplyr (be it base R itself or other packages). Understand- once an R package promoter has successfully argued that R is inadequate it is plausible that they are chasing users to Python/Pandas/scikit (also good tools), instead of towards the R package of their choosing. Again, I am not saying that is what is happening here, just something that unfortunately informs my writing at this point.

      1. Just a minor remark. Since the blog post is about comparing speed in chaining operations, one should probably use data.table chaining in the benchmark above. Not that it makes much of a difference – as expected, data.table still outperforms dplyr by a huge margin.

        library("dplyr")
        library("microbenchmark")
        library("ggplot2")
        library("data.table")
        
        mtcarsb <- mtcars[rep(seq_len(nrow(mtcars)), 10000), ,]
        mtcarsd <- as.data.table(mtcarsb)
        
        timings <- microbenchmark(
          base = {
            . <- mtcarsb
            . <- subset(., cyl == 8)
            . <- .[, c("mpg", "cyl", "wt")]
            nrow(.)
          },
          dplyr = {
            mtcarsb                %>%
              filter(cyl == 8)     %>%
              select(mpg, cyl, wt) %>%
              nrow
          },
         data.table = {
            mtcarsd[cyl==8, ][,
            c("mpg", "cyl", "wt") ][, 
            .N]
          }
        )
        
        print(timings)
        
        autoplot(timings)
        
        print(timings)
        
          Unit: milliseconds
          expr                min       lq     mean   median       uq
          base           22.39194 34.02887 46.84050 39.47613 43.55339
          dplyr          18.20177 19.57842 26.37421 20.93748 23.37392
          data.table     10.17765 10.83761 22.33043 12.31881 24.79852
        
        
        1. Thanks for your point, petrovski. One of data.table‘s strengths is chaining operations (though in this case the example has no real transformations, so it doesn’t show well).

          The relative timings probably depends a bit on architecture and compiler details. For my Mac Mini the multi-stage change seems to completely abrogate data.table‘s speed advantage for this (very trivial operations and trivial scale) example.

          However it does give me a chance to point out one could consider “][” as data.table‘s own pipe operator and further re-write the data.table block as:

          mtcarsd[cyl==8,                 ][
                  , c("mpg", "cyl", "wt") ][
                  , .N                    ]
          
    2. It can really depend on what’s being done. With your example of large data, I wasn’t able to get base R to outperform dplyr, even with using more standard notation. However, just a couple days ago, I had to rewrite some tests which heavily relied on dplyr for basic subsetting. They took way too long.

      Most cases were just one filter call followed by a pull. By refactoring the filters into is_foo <- mydata[["column"]] == x and such, the tests went from taking 4 minutes to 1 second. The dataset being tested is almost 3 million rows. In most cases, it was just one filter call followed by a pull.

      I’ll add that I agree with John: if you want performance, use data.table. I only avoided it in my case because my colleagues understand base R and dplyr, and I want maintainability over performance.

      1. I have also run into issues where a small variation of a dplyr pipeline causes it to take a minute instead of milliseconds. filter() (especially in the presence of an active grouping) is the most common danger step.

        It should be obvious from my stumbling around that I am not a regular data.table user. The reason is most of my current contracting work has been with data living in Spark/Hadoop or PostgreSQL. So I have been working a lot on database first methodologies.

        You made a very good point on the the “land a column of boolean decisions to use later as a index. That is a very powerful and fast R pattern that I feel is still under-used and under-appreciated. For example: I use it here to invert permutations (combining it with the “write some indices on the left of an assignment” trick). Vector notations are incredibly powerful and can do some things that are not convenient to express otherwise.

        1. especially in the presence of an active grouping

          Thank you for pointing this out! This actually was the problem. I threw ungroup at the end of certain pipelines, and now the tests are back down to 1.8 seconds.

  3. with data.table it should be done like this:

    mtcarsd[cyl==8, .N],

    much faster, the selection of c(“mpg”, “cyl”, “wt”) is needless.

    1. I know selecting columns does not affect row counts. The same possible change applies to all of the examples (not just data.table). This is just a sequence of operations to simulate a small workflow without bringing in actual concerns. So please consider it a notional example. A slightly more realistic example can be found here. Often I include a row-count just to force execution on remote systems that have lazy eval.

Leave a Reply