Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, TutorialsTags , , , , 7 Comments on R Tip: Give data.table a Try

R Tip: Give data.table a Try

If your R or dplyr work is taking what you consider to be a too long (seconds instead of instant, or minutes instead of seconds, or hours instead of minutes, or a day instead of an hour) then try data.table.

For some tasks data.table is routinely faster than alternatives at pretty much all scales (example timings here).

If your project is large (millions of rows, hundreds of columns) you really should rent an an Amazon EC2 r4.8xlarge (244 GiB RAM) machine for an hour for about $2.13 (quick setup instructions here) and experience speed at scale.

Posted on Categories data science, Pragmatic Data Science, TutorialsTags , , 2 Comments on Timings of a Grouped Rank Filter Task

Timings of a Grouped Rank Filter Task

Introduction

This note shares an experiment comparing the performance of a number of data processing systems available in R. Our notional or example problem is finding the top ranking item per group (group defined by three string columns, and order defined by a single numeric column). This is a common and often needed task.

Continue reading Timings of a Grouped Rank Filter Task

Posted on Categories Opinion, ProgrammingTags , , 2 Comments on data.table is Really Good at Sorting

data.table is Really Good at Sorting

The data.table R package is really good at sorting. Below is a comparison of it versus dplyr for a range of problem sizes.

Present 2

Continue reading data.table is Really Good at Sorting

Posted on Categories data science, ProgrammingTags , , , , , , 11 Comments on Speed up your R Work

Speed up your R Work

Introduction

In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatable, data.table, and dplyr. The methods shown will also work with base-R and other packages.

For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tables to partition un-related data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with rquery pipelines.

The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.

Continue reading Speed up your R Work

Posted on Categories data science, Exciting Techniques, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , ,

rqdatatable: rquery Powered by data.table

rquery is an R package for specifying data transforms using piped Codd-style operators. It has already shown great performance on PostgreSQL and Apache Spark. rqdatatable is a new package that supplies a screaming fast implementation of the rquery system in-memory using the data.table package.

rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now rquery is also one of the fastest methods to wrangle data in-memory in R (thanks to data.table, via a thin adaption supplied by rqdatatable).

Continue reading rqdatatable: rquery Powered by data.table

Posted on Categories Exciting Techniques, Programming, Statistics, TutorialsTags , , , , , 4 Comments on Supercharge your R code with wrapr

Supercharge your R code with wrapr

I would like to demonstrate some helpful wrapr R notation tools that really neaten up your R code.


1968 AMX blown and tubbed e

Img: Christopher Ziemnowicz.

Continue reading Supercharge your R code with wrapr

Posted on Categories Coding, Computer Science, data science, Opinion, Programming, Statistics, TutorialsTags , , , , 14 Comments on Base R can be Fast

Base R can be Fast

“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are (edit: “always”) faster than R code.”

The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions.

Below is a graph summarizing our new results for a number of in-memory implementations, a range of data sizes, and two different machine types.

Unnamed chunk 2 1 Continue reading Base R can be Fast

Posted on Categories StatisticsTags , ,

Does replyr::let work with data.table?

I’ve been asked if the adapter “let” from our R package replyr works with data.table.

My answer is: it does work. I am not a data.table user so I am not the one to ask if data.table benefits a from a non-standard evaluation to standard evaluation adapter such as replyr::let. Continue reading Does replyr::let work with data.table?