Posted on Categories Coding, Opinion, Statistics, TutorialsTags , , , ,

Why to use the replyr R package

Recently I noticed that the R package sparklyr had the following odd behavior:

suppressPackageStartupMessages(library("dplyr"))
library("sparklyr")
packageVersion("dplyr")
#> [1] '0.7.2.9000'
packageVersion("sparklyr")
#> [1] '0.6.2'
packageVersion("dbplyr")
#> [1] '1.1.0.9000'

sc <- spark_connect(master = 'local')
#> * Using Spark: 2.1.0
d <- dplyr::copy_to(sc, data.frame(x = 1:2))

dim(d)
#> [1] NA
ncol(d)
#> [1] NA
nrow(d)
#> [1] NA

This means user code or user analyses that depend on one of dim(), ncol() or nrow() possibly breaks. nrow() used to return something other than NA, so older work may not be reproducible.

In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).


Tron
Tron: fights for the users.

In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both sparklyr and dbplyr users.

The explanation is: tibble::truncate uses nrow()” and “print.tbl_spark is too slow since dbplyr started using tibble as the default way of printing records”.

A little digging gets us to this:

Nrow

The above might make sense if tibble and dbplyr were the only users of dim(), ncol() or nrow().

Frankly if I call nrow() I expect to learn the number of rows in a table.

The suggestion is for all user code to adapt to use sdf_dim(), sdf_ncol() and sdf_nrow() (instead of tibble adapting). Even if practical (there are already a lot of existing sparklyr analyses), this prohibits the writing of generic dplyr code that works the same over local data, databases, and Spark (by generic code, we mean code that does not check the data source type and adapt). The situation is possibly even worse for non-sparklyr dbplyr users (i.e., databases such as PostgreSQL), as I don’t see any obvious convenient “no please really calculate the number of rows for me” (other than “d %>% tally %>% pull“, but that turns out to not always work).

I admit, calling nrow() against an arbitrary query can be expensive. However, I am usually calling nrow() on physical tables (not on arbitrary dplyr queries or pipelines). Physical tables ofter deliberately carry explicit meta-data to make it possible for nrow() to be a cheap operation.

Allowing the user to write reliable generic code that works against many dplyr data sources is the purpose of our replyr package. Being able to use the same code many places increases the value of the code (without user facing complexity) and allows one to rehearse procedures in-memory before trying databases or Spark. Below are the functions replyr supplies for examining the size of tables:

library("replyr")
packageVersion("replyr")
#> [1] '0.5.4'

replyr_hasrows(d)
#> [1] TRUE
replyr_dim(d)
#> [1] 2 1
replyr_ncol(d)
#> [1] 1
replyr_nrow(d)
#> [1] 2

spark_disconnect(sc)

Note: the above is only working properly in the development version of replyr, as I only found out about the issue and made the fix recently.

replyr_hasrows() was added as I found in many projects the primary use of nrow() was to determine if there was any data in a table. The idea is: user code uses the replyr functions, and the replyr functions deal with the complexities of dealing with different data sources. This also gives us a central place to collect patches and fixes as we run into future problems. replyr accretes functionality as our group runs into different use cases (and we try to put use cases first, prior to other design considerations).

The point of replyr is to provide re-usable work arounds of design choices far away from our influence.

Leave a Reply