dplyr work is taking what you consider to be a too long (seconds instead of instant, or minutes instead of seconds, or hours instead of minutes, or a day instead of an hour) then try
For some tasks
data.table is routinely faster than alternatives at pretty much all scales (example timings here).
If your project is large (millions of rows, hundreds of columns) you really should rent an an Amazon EC2 r4.8xlarge (244 GiB RAM) machine for an hour for about $2.13 (quick setup instructions here) and experience speed at scale.
In this note we will show how to speed up work in
R by partitioning data and process-level parallelization. We will show the technique with three different
dplyr. The methods shown will also work with base-
R and other packages.
For each of the above packages we speed up work by using
wrapr::execute_parallel which in turn uses
wrapr::partition_tables to partition un-related
data.frame rows and then distributes them to different processors to be executed.
rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with
The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.
seplyr is an R package that provides a thin wrapper around elements of the dplyr package and (now with version 0.5.8) the tidyr package. The intent is to give the part time R user the ability to easily program over functions from the popular dplyr and tidyr packages. Our assumption is always that a data scientist most often comes to R to work with data, not to tinker with the programming language itself.
R Tip: be wary of “
The following code example contains an easy error in using the R function
vec1 <- c("a", "b", "c") vec2 <- c("c", "d") unique(vec1, vec2) #  "a" "b" "c"
Notice none of the novel values from
vec2 are present in the result. Our mistake was: we (improperly) tried to use
unique() with multiple value arguments, as one would use
union(). Also notice no error or warning was signaled. We used
unique() incorrectly and nothing pointed this out to us. What compounded our error was
...” function signature feature.
In this note I will talk a bit about how to defend against this kind of mistake. I am going to apply the principle that a design that makes committing mistakes more difficult (or even impossible) is a good thing, and not a sign of carelessness, laziness, or weakness. I am well aware that every time I admit to making a mistake (I have indeed made the above mistake) those who claim to never make mistakes have a laugh at my expense. Honestly I feel the reason I see more mistakes is I check a lot more.
R Tip: use
A lot of R functions are type unstable, which means they return different types or classes depending on details of their values.
For example consider
all.equal(), it returns the logical value
TRUE when the items being compared are equal:
all.equal(1:3, c(1, 2, 3)) #  TRUE
However, when the items being compared are not equal
all.equal() instead returns a message:
all.equal(1:3, c(1, 2.5, 3)) #  "Mean relative difference: 0.25"
This can be inconvenient in using functions similar to
all.equal() as tests in
if()-statements and other program control structures.
The saving functions is
TRUE if its argument value is equivalent to
TRUE, and returns
R programming much easier.
R has a lot of under-appreciated super powerful functions. I list a few of our favorites below.
Atlas, carrying the sky. Royal Palace (Paleis op de Dam), Amsterdam.
Photo: Dominik Bartsch, CC some rights reserved.