Please support Shriekback’s proposed 2018 US tour Kickstarter !!!
In this note I exhibit a troublesome example, and a systematic solution.
We also have two really nifty articles on the theory and methods:
Please give it a try!
This is the material I recently presented at the January 2017 BARUG Meetup.
“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are (edit: “always”) faster than R code.”
The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions.
Below is a graph summarizing our new results for a number of in-memory implementations, a range of data sizes, and two different machine types.
Amazon EC2 supplies near instant access to on-demand disposable computing in a variety of sizes (billed in hours). RStudio Server supplies an interactive user interface to your remote R environment that is nearly indistinguishable from a local RStudio console. The idea is: for a few dollars you can work interactively on R tasks requiring hundreds of GB of memory and tens of CPUs and GPUs.
If you are already an Amazon EC2 user with some Unix experience it is very easy to quickly stand up a powerful R environment, which is what I will demonstrate in this note.
In this note I want to share some exciting and favorable initial rquery benchmark timings.
This feature allows objects to declare a surrogate function to stand in for the object in wrapr pipelines. It is a powerful technique and allowed us to quickly implement a convenient new ad hoc query mode for rquery.
A small effort in making a package “wrapr aware” appears to have a fairly large payoff.
cdata is our general coordinatized data tool. It is what powers the deep learning performance graph (here demonstrated with R and Keras) that I announced a while ago.
However, cdata is much more than that.
Kudos to Professor Andrew Gelman for telling a great joke at his own expense:
He brilliantly burlesqued a frustrating common occurrence many people say they “have never seen happen.”
One of the pains of writing about data science is there is a (small but vocal) sub-population of statisticians jump on your first mistake (we all make errors) and then expand it into an essay on how you: known nothing, are stupid, are ignorant, are unqualified, and are evil.
I get it: many people writing about data science do not know enough statistics. However, not every person writing from a data science point of view is statistically ignorant. That is not to say computer science (my original field) doesn’t have similar problems.
Trying to destroy a sweater by pulling on a loose thread in no way establishes that it wasn’t a nice sweater in the first place (or how nice a sweater it would be if the loose thread were fixed).
(BTW: the book in question is in fact excellent. Chapter 12 alone is worth at least ten times the list price of the book.)