rqdatatable are new
R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The packages speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks.
Win-Vector LLC‘s John Mount will be speaking on the
rqdatatable packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA).
Continue reading John Mount speaking on rquery and rqdatatable
In this note we will show how to speed up work in
R by partitioning data and process-level parallelization. We will show the technique with three different
dplyr. The methods shown will also work with base-
R and other packages.
For each of the above packages we speed up work by using
wrapr::execute_parallel which in turn uses
wrapr::partition_tables to partition un-related
data.frame rows and then distributes them to different processors to be executed.
rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with
The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.
Continue reading Speed up your R Work
rquery is an
R package for specifying data transforms using piped Codd-style operators. It has already shown great performance on
rqdatatable is a new package that supplies a screaming fast implementation of the
rquery system in-memory using the
rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now
rquery is also one of the fastest methods to wrangle data in-memory in
R (thanks to
data.table, via a thin adaption supplied by
Continue reading rqdatatable: rquery Powered by data.table
rquery talk went very well, thank you very much to the attendees for being an attentive and generous audience.
rquery at BARUG, photo credit: Timothy Liu)
I am now looking for invitations to give a streamlined version of this talk privately to groups using
R who want to work with
SQL (with databases such as PostgreSQL or big data systems such as Apache Spark).
rquery has a number of features that greatly improve team productivity in this environment (strong separation of concerns, strong error checking, high usability, specific debugging features, and high performance queries).
If your group is in the San Francisco Bay Area and using
R to work with a
SQL accessible data source, please reach out to me at firstname.lastname@example.org, I would be honored to show your team how to speed up their project and lower development costs with
rquery. If you are a big data vendor and some of your clients use
R, I am especially interested in getting in touch: our system can help
R users start working with your installation.
R tip: use slices.
R has a very powerful array slicing ability that allows for some very slick data processing.
Continue reading R Tip: Use Slices
I would like to thank LinkedIn for letting me speak with some of their data scientists and analysts.
John Mount discussing
SQL generation at LinkedIn.
If you have a group using
R at database or
Spark scale, please reach out ( jmount at win-vector.com ). We (Win-Vector LLC) have some great new tools I’d love to speak on and share. I’d love an invite, especially if your group is in the San Francisco Bay Area.
Note: we also now have a 1/2 to 1 day on-site “Spark for R Users” training offering. Again, please reach out if your team is interested.
“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are (edit: “always”) faster than R code.”
The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions.
Below is a graph summarizing our new results for a number of in-memory implementations, a range of data sizes, and two different machine types.
Continue reading Base R can be Fast
Win-Vector LLC recently announced the
R package, an operator based query generator.
In this note I want to share some exciting and favorable initial rquery benchmark timings.
Continue reading rquery: Fast Data Manipulation in R