In this note we will show how to speed up work in
R by partitioning data and process-level parallelization. We will show the technique with three different
dplyr. The methods shown will also work with base-
R and other packages.
For each of the above packages we speed up work by using
wrapr::execute_parallel which in turn uses
wrapr::partition_tables to partition un-related
data.frame rows and then distributes them to different processors to be executed.
rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with
The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.
seplyr is an R package that provides a thin wrapper around elements of the dplyr package and (now with version 0.5.8) the tidyr package. The intent is to give the part time R user the ability to easily program over functions from the popular dplyr and tidyr packages. Our assumption is always that a data scientist most often comes to R to work with data, not to tinker with the programming language itself.
wrapr includes a lot of tools for writing better
%.>%(dot arrow pipe)
data.framebuilders and formatters )
:=(named map builder)
%.|%(reduce/expand args) NEW!
DebugFnW()(function debug wrappers)
λ()(anonymous function builder)
I’ll be writing articles on a number of the new capabilities. For now I just leave you with the nifty operator coalesce notation.
R Tip: use
A lot of R functions are type unstable, which means they return different types or classes depending on details of their values.
For example consider
all.equal(), it returns the logical value
TRUE when the items being compared are equal:
all.equal(1:3, c(1, 2, 3)) #  TRUE
However, when the items being compared are not equal
all.equal() instead returns a message:
all.equal(1:3, c(1, 2.5, 3)) #  "Mean relative difference: 0.25"
This can be inconvenient in using functions similar to
all.equal() as tests in
if()-statements and other program control structures.
The saving functions is
TRUE if its argument value is equivalent to
TRUE, and returns
R programming much easier.
Nina Zumel and I have been working on packaging our favorite graphing techniques in a more reusable way that emphasizes the analysis task at hand over the steps needed to produce a good visualization. We are excited to announce the WVPlots is now at version 1.0.0 on CRAN!
rquery at BARUG, photo credit: Timothy Liu)
I am now looking for invitations to give a streamlined version of this talk privately to groups using
R who want to work with
SQL (with databases such as PostgreSQL or big data systems such as Apache Spark).
rquery has a number of features that greatly improve team productivity in this environment (strong separation of concerns, strong error checking, high usability, specific debugging features, and high performance queries).
If your group is in the San Francisco Bay Area and using
R to work with a
SQL accessible data source, please reach out to me at firstname.lastname@example.org, I would be honored to show your team how to speed up their project and lower development costs with
rquery. If you are a big data vendor and some of your clients use
R, I am especially interested in getting in touch: our system can help
R users start working with your installation.
R has a lot of under-appreciated super powerful functions. I list a few of our favorites below.
Atlas, carrying the sky. Royal Palace (Paleis op de Dam), Amsterdam.
Photo: Dominik Bartsch, CC some rights reserved.