A big thank you to Databricks for working with us and sharing:
rqdatatable are new
R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The packages speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks.
Win-Vector LLC‘s John Mount will be speaking on the
rqdatatable packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA).
In this note we will show how to speed up work in
R by partitioning data and process-level parallelization. We will show the technique with three different
dplyr. The methods shown will also work with base-
R and other packages.
For each of the above packages we speed up work by using
wrapr::execute_parallel which in turn uses
wrapr::partition_tables to partition un-related
data.frame rows and then distributes them to different processors to be executed.
rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with
The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.
rquery is an
R package for specifying data transforms using piped Codd-style operators. It has already shown great performance on
rqdatatable is a new package that supplies a screaming fast implementation of the
rquery system in-memory using the
rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now
rquery is also one of the fastest methods to wrangle data in-memory in
R (thanks to
data.table, via a thin adaption supplied by
rquery at BARUG, photo credit: Timothy Liu)
I am now looking for invitations to give a streamlined version of this talk privately to groups using
R who want to work with
SQL (with databases such as PostgreSQL or big data systems such as Apache Spark).
rquery has a number of features that greatly improve team productivity in this environment (strong separation of concerns, strong error checking, high usability, specific debugging features, and high performance queries).
If your group is in the San Francisco Bay Area and using
R to work with a
SQL accessible data source, please reach out to me at firstname.lastname@example.org, I would be honored to show your team how to speed up their project and lower development costs with
rquery. If you are a big data vendor and some of your clients use
R, I am especially interested in getting in touch: our system can help
R users start working with your installation.
I have a couple of public appearances coming up soon.
- The East Bay R Language Beginners Group: Preparing Datasets – The Ugly Truth & Some Solutions, Tuesday, May 1, 2018 at Robert Half Technologies, 1999 Harrison Street, Oakland, CA, 94612.
- Official May 2018 BARUG Meeting: rquery: a Query Generator for Working With SQL Data, Tuesday, May 8, 2018 at Intuit, Building 20
2600 Marine Way · Mountain View, CA.
I would like to thank LinkedIn for letting me speak with some of their data scientists and analysts.
John Mount discussing
SQLgeneration at LinkedIn.
If you have a group using
R at database or
Spark scale, please reach out ( jmount at win-vector.com ). We (Win-Vector LLC) have some great new tools I’d love to speak on and share. I’d love an invite, especially if your group is in the San Francisco Bay Area.
Note: we also now have a 1/2 to 1 day on-site “Spark for R Users” training offering. Again, please reach out if your team is interested.
“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are (edit: “always”) faster than R code.”
The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions.
Below is a graph summarizing our new results for a number of in-memory implementations, a range of data sizes, and two different machine types.