A big thank you to Databricks for working with us and sharing:
rqdatatable are new
R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The packages speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks.
Win-Vector LLC‘s John Mount will be speaking on the
rqdatatable packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA).
In this note we will show how to speed up work in
R by partitioning data and process-level parallelization. We will show the technique with three different
dplyr. The methods shown will also work with base-
R and other packages.
For each of the above packages we speed up work by using
wrapr::execute_parallel which in turn uses
wrapr::partition_tables to partition un-related
data.frame rows and then distributes them to different processors to be executed.
rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with
The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.
seplyr is an R package that provides a thin wrapper around elements of the dplyr package and (now with version 0.5.8) the tidyr package. The intent is to give the part time R user the ability to easily program over functions from the popular dplyr and tidyr packages. Our assumption is always that a data scientist most often comes to R to work with data, not to tinker with the programming language itself.