Posted on Categories Administrativia, data science, Opinion, Practical Data Science, StatisticsTags , , , , Leave a comment on John Mount speaking on rquery and rqdatatable

John Mount speaking on rquery and rqdatatable

rquery and rqdatatable are new R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The packages speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks.


Rquery
Rqdatatable

Win-Vector LLC‘s John Mount will be speaking on the rquery and rqdatatable packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA).

Continue reading John Mount speaking on rquery and rqdatatable

Posted on Categories data science, ProgrammingTags , , , , , , 11 Comments on Speed up your R Work

Speed up your R Work

Introduction

In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatable, data.table, and dplyr. The methods shown will also work with base-R and other packages.

For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tables to partition un-related data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with rquery pipelines.

The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.

Continue reading Speed up your R Work

Posted on Categories data science, Exciting Techniques, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , Leave a comment on rqdatatable: rquery Powered by data.table

rqdatatable: rquery Powered by data.table

rquery is an R package for specifying data transforms using piped Codd-style operators. It has already shown great performance on PostgreSQL and Apache Spark. rqdatatable is a new package that supplies a screaming fast implementation of the rquery system in-memory using the data.table package.

rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now rquery is also one of the fastest methods to wrangle data in-memory in R (thanks to data.table, via a thin adaption supplied by rqdatatable).

Continue reading rqdatatable: rquery Powered by data.table

Posted on Categories Administrativia, Opinion, Programming, StatisticsTags , , , , , 2 Comments on rquery: SQL from R

rquery: SQL from R

My BARUG rquery talk went very well, thank you very much to the attendees for being an attentive and generous audience.


IMG 5152

(John teaching rquery at BARUG, photo credit: Timothy Liu)

I am now looking for invitations to give a streamlined version of this talk privately to groups using R who want to work with SQL (with databases such as PostgreSQL or big data systems such as Apache Spark). rquery has a number of features that greatly improve team productivity in this environment (strong separation of concerns, strong error checking, high usability, specific debugging features, and high performance queries).

If your group is in the San Francisco Bay Area and using R to work with a SQL accessible data source, please reach out to me at jmount@win-vector.com, I would be honored to show your team how to speed up their project and lower development costs with rquery. If you are a big data vendor and some of your clients use R, I am especially interested in getting in touch: our system can help R users start working with your installation.

Posted on Categories Administrativia, data science, Exciting Techniques, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , Leave a comment on Upcoming speaking engagments

Upcoming speaking engagments

I have a couple of public appearances coming up soon.

Continue reading Upcoming speaking engagments

Posted on Categories Administrativia, StatisticsTags , , , , ,

Speaking on New Tools for R at Big Data Scale

I would like to thank LinkedIn for letting me speak with some of their data scientists and analysts.


IMG 4606
John Mount discussing rquery SQL generation at LinkedIn.

If you have a group using R at database or Spark scale, please reach out ( jmount at win-vector.com ). We (Win-Vector LLC) have some great new tools I’d love to speak on and share. I’d love an invite, especially if your group is in the San Francisco Bay Area.

Note: we also now have a 1/2 to 1 day on-site “Spark for R Users” training offering. Again, please reach out if your team is interested.

Posted on Categories Coding, Computer Science, data science, Opinion, Programming, Statistics, TutorialsTags , , , , 14 Comments on Base R can be Fast

Base R can be Fast

“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are (edit: “always”) faster than R code.”

The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions.

Below is a graph summarizing our new results for a number of in-memory implementations, a range of data sizes, and two different machine types.

Unnamed chunk 2 1 Continue reading Base R can be Fast

Posted on Categories Computer Science, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, ProgrammingTags , , , , , , 3 Comments on rquery: Fast Data Manipulation in R

rquery: Fast Data Manipulation in R

Win-Vector LLC recently announced the rquery R package, an operator based query generator.

In this note I want to share some exciting and favorable initial rquery benchmark timings.

Continue reading rquery: Fast Data Manipulation in R