Posted on Categories Opinion, Programming, TutorialsTags , , , Leave a comment on Timing Column Indexing in R

Timing Column Indexing in R

I’ve ended up (almost accidentally) collecting a number of different solutions to the “use a column to choose values from other columns in R” problem.

Please read on for a brief benchmark comparing these methods/solutions.

Continue reading Timing Column Indexing in R

Posted on Categories Opinion, ProgrammingTags , , , , , , , Leave a comment on How to use rquery with Apache Spark on Databricks

How to use rquery with Apache Spark on Databricks

A big thank you to Databricks for working with us and sharing:

rquery: Practical Big Data Transforms for R-Spark Users
How to use rquery with Apache Spark on Databricks

NewImage

rquery on Databricks is a great data science tool.

Posted on Categories Administrativia, data science, Opinion, Practical Data Science, StatisticsTags , , , ,

John Mount speaking on rquery and rqdatatable

rquery and rqdatatable are new R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The packages speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks.


Rquery
Rqdatatable

Win-Vector LLC‘s John Mount will be speaking on the rquery and rqdatatable packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA).

Continue reading John Mount speaking on rquery and rqdatatable

Posted on Categories data science, ProgrammingTags , , , , , , 11 Comments on Speed up your R Work

Speed up your R Work

Introduction

In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatable, data.table, and dplyr. The methods shown will also work with base-R and other packages.

For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tables to partition un-related data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with rquery pipelines.

The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.

Continue reading Speed up your R Work

Posted on Categories data science, Exciting Techniques, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , ,

rqdatatable: rquery Powered by data.table

rquery is an R package for specifying data transforms using piped Codd-style operators. It has already shown great performance on PostgreSQL and Apache Spark. rqdatatable is a new package that supplies a screaming fast implementation of the rquery system in-memory using the data.table package.

rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now rquery is also one of the fastest methods to wrangle data in-memory in R (thanks to data.table, via a thin adaption supplied by rqdatatable).

Continue reading rqdatatable: rquery Powered by data.table