Posted on Categories data science, Opinion, Programming, TutorialsTags , , , , , , , ,

seplyr 0.5.8 Now Available on CRAN

We are pleased to announce that seplyr version 0.5.8 is now available on CRAN.

seplyr is an R package that provides a thin wrapper around elements of the dplyr package and (now with version 0.5.8) the tidyr package. The intent is to give the part time R user the ability to easily program over functions from the popular dplyr and tidyr packages. Our assumption is always that a data scientist most often comes to R to work with data, not to tinker with the programming language itself.

Tools such as seplyr, wrapr or rlang are needed when you (the data scientist temporarily working on a programming sub-task) do not know the names of the columns you want your code to be working with. These are situations where you expect the column names to be made available later, in additional variables or parameters.

For an example: suppose we have following data where for two rows (identified by the “id” column) we have two measurements each (identified by the column names “measurement1” and “measurement2”).

library("wrapr")

d <- build_frame(
   'id', 'measurement1', 'measurement2' |
   1   , 'a'           , 10             |
   2   , 'b'           , 20             )

print(d)

#   id measurement1 measurement2
# 1  1            a           10
# 2  2            b           20

Further suppose we wished to have each measurement in its own row (which is often required, such as when using the ggplot2 package to produce plots). In this case we need a tool to convert the data format. If we are doing this as part of an ad-hoc analysis (i.e. we can look at the data and find the column names at the time of coding) we can use tidyr to perform the conversion:

library("tidyr")

gather(d,
       key = value_came_from_column,
       value = value_was,
       measurement1, measurement2)

#   id value_came_from_column value_was
# 1  1           measurement1         a
# 2  2           measurement1         b
# 3  1           measurement2        10
# 4  2           measurement2        20

Notice, however, all column names are specified in gather() without quotes. The names are taken from unexecuted versions of the actual source code of the arguments to gather(). This is somewhat convenient for the analyst (they can skip writing a few quote marks), but a severe limitation imposed on the script writer or programmer (they have problems taking the names of columns from other sources).

seplyr now supplies a standard value oriented interface for gather(). With seplyr we can write code such as the following:

library("seplyr")

gather_se(d,
  key = "value_came_from_column",
  value = "value_was",
  columns = c("measurement1", "measurement2"))
  
#   id value_came_from_column value_was
# 1  1           measurement1         a
# 2  2           measurement1         b
# 3  1           measurement2        10
# 4  2           measurement2        20

This sort of interface is handy when the names of the columns are coming from elsewhere, in variables. Here is an example of that situation:

# pretend these assignments are done elsewhere
# by somebody else
key_col_name <- "value_came_from_column"
value_col_name <- "value_was"
value_columns <- c("measurement1", "measurement2")

# we can use the above values with
# code such as this
gather_se(d,
          key = key_col_name,
          value = value_col_name,
          columns = value_columns)

#   id value_came_from_column value_was
# 1  1           measurement1         a
# 2  2           measurement1         b
# 3  1           measurement2        10
# 4  2           measurement2        20

There are ways to use gather() with “to be named later” column names directly, but it is not simple as it neeedlessly forces the user to master a number of internal implementation details of rlang and dplyr. From documentation and “help(gather)” we can deduce at least 3 related “pure tidyeval/rlang” programming over gather() solutions:

# possibly the solution hinted at in help(gather)
gather(d,
       key = !!key_col_name,
       value = !!value_col_name,
       dplyr::one_of(value_columns))

# concise rlang solution
gather(d,
       key = !!key_col_name,
       value = !!value_col_name,
       !!!value_columns)

# fully qualified rlang solution
gather(d,
       key = !!rlang::sym(key_col_name),
       value = !!rlang::sym(value_col_name),
       !!!rlang::syms(value_columns))

In all cases the user must prepare and convert values for use. Really this is showing gather() does not conveniently expect parametric columns (column names supplied by variables or parameters), but will accept a work-around if the user re-codes column names in some way (some combination of quoting and de-quoting). With “gather_se()” the tool expects to take values and the user does not have to make special arrangements (or remember special notation) to supply them.

Our advice for analysts is:

  • If your goal is to work with data: use a combination of wrapr::let() (a preferred user friendly solution that in fact pre-dates rlang) and seplyr (a data-friendly wrapper over dplyr and tidyr functions).
  • If your goal is to write an article about rlang: then use rlang.
  • If you are interested in more advanced data manipulation, please check out our cdata package (video introduction here). The cdata equivilant
    of the above transform is.

    library("cdata")
    
    control_table <- build_unpivot_control(
      nameForNewKeyColumn = key_col_name,
      nameForNewValueColumn = value_col_name,
      columnsToTakeFrom = value_columns)
    
    rowrecs_to_blocks(d, 
                      control_table, 
                      columnsToCopy = "id")
    #   id value_came_from_column value_was
    # 1  1           measurement1         a
    # 2  1           measurement2        10
    # 3  2           measurement1         b
    # 4  2           measurement2        20
    
  • If you need high in-memory performance: try data.table.

In addition to wrapping a number of dplyr functions and tidyr::gather()/tidyr::spread(), seplyr 0.5.8 now also wraps tidyr::complete() (thanks to a contribution from Richard Layton).

We hope you try seplyr out both in your work and in your teaching.

4 thoughts on “seplyr 0.5.8 Now Available on CRAN”

  1. Hi John

    Thanks for this!

    Quick question on seplyr – do you have a recommended way to work around the named map builder conflict that occurs with data.table (ie “:=”)? I usually run into this when loading data.table and seplyr together

    1. It is as simple as: whichever package loads second wins. So if you want to use data.table’s :=, load data.table second.

      Also, the effect is not always as bad as the warning. As far as I can tell data.table‘s external definition of := is only used to implement an error message telling the user that they should not call := in a non-data.table context (and to have something defined to avoid := appearing to be an undefined operator). Most data.table uses of := are in the data.table package context, so stay with the intended data.table semantics even if seplyr is loaded by the user.

      I have cut down the number of Win-Vector packages that re-export := (to cut down the possible sources of confusion). But the named map builder (:=) is fairly central to seplyr.

  2. In the current version of tidyr, and I guess by extension rlang, you can pass strings to the key and value arguments of tidyr::gather and tidyr::spread. Pretty sure that’s been possible with tidyr::gather for quite some time, since you’re essentially just telling it what to name a new column that doesn’t yet exist. I have put them in quotes for some time now to be more explicit that I’m not referring to a variable/object/symbol. tidyr::spread would probably be a better example since, in the past anyway, you had to pass the name of an existing column name, hence a valid symbol, but that doesn’t seem to be necessary anymore either.

    library("tidyr")

    d <- read.csv(header = TRUE, stringsAsFactors = FALSE, text = "
    id,measurement1,measurement2
    1,a,10
    2,b, 20
    ")

    gather(d, key = "value_came_from_column", value = "value_was",
    measurement1, measurement2)

    # id value_came_from_column value_was
    # 1 1 measurement1 a
    # 2 2 measurement1 b
    # 3 1 measurement2 10
    # 4 2 measurement2 20

    d2 <- gather(d, key = "value_came_from_column", value = "value_was",
    measurement1, measurement2)

    spread(d2, "value_came_from_column", "value_was")

    # id measurement1 measurement2
    # 1 1 a 10
    # 2 2 b 20

    If the strings you want to use are stored in an object/variable, then you do have to use the !! evaluation operator (which I don’t find too cumbersome), and in the case of tidyr::spread, rlang seems to automatically figure out that you want the value not the object name, presumably because the object name is not valid (?; would have to investigate that further to understand for sure why it works in this case).

    key_name <- "value_came_from_column"
    value_name <- "value_was"

    gather(d, key = !!key_name, value = !!value_name, measurement1, measurement2)

    # id value_came_from_column value_was
    # 1 1 measurement1 a
    # 2 2 measurement1 b
    # 3 1 measurement2 10
    # 4 2 measurement2 20

    spread(d2, key_name, value_name)

    # id measurement1 measurement2
    # 1 1 a 10
    # 2 2 b 20

    1. Pretty much all of the examples you give are covered by the three rlang/dplyr solutions I gave in the original article (and close to what one would see from print(seplyr::gather_se) or print(seplyr::spread_se)). Possibly that is a testament to how opaque the rlang notation is.

      And yes, gather/spread will accept quotes in the arguments as in the following.

      
      library("tidyr")
      
      d <- wrapr::build_frame(
        'id', 'measurement1', 'measurement2' |
         1L , 'a'           , 10L            |
         2L , 'b'           , 20L            |
        NA  , ''            , NA             )
      
      gather(d, 
             key = "value_came_from_column", 
             value = "value_was",
             measurement1, measurement2)
      
      # d value_came_from_column value_was
      # 1  1           measurement1         a
      # 2  2           measurement1         b
      # 3 NA           measurement1          
      # 4  1           measurement2        10
      # 5  2           measurement2        20
      # 6 NA           measurement2      <NA>
      

      But accepting quote marks is (unfortunately) is not the same as them accepting values as we see in the following (erroneous, because I forgot the magic “!!” marks) code>

      key_name <- "value_came_from_column"
      value_name <- "value_was"
      
      gather(d, 
             key = key_name, 
             value = value_name,
             measurement1, measurement2)
      
      # id     key_name value_name
      # 1  1 measurement1          a
      # 2  2 measurement1          b
      # 3 NA measurement1           
      # 4  1 measurement2         10
      # 5  2 measurement2         20
      # 6 NA measurement2       <NA>
      

      I know you know the difference (i.e., would never make the above mistake), and I know the difference also. However, many users find the !!-notation cumbersome and confusing. I am pretty much done trying to teach rlang to beginning or part time R users, so I need an alternative.

      Finally, please a look at help(gather) and how notice how much time it spends on notational issues (and how little time it now spends on data shaping).



Comments are closed.