R has a very powerful array slicing ability that allows for some very slick data processing.
R has a lot of under-appreciated super powerful functions. I list a few of our favorites below.
Atlas, carrying the sky. Royal Palace (Paleis op de Dam), Amsterdam.
Photo: Dominik Bartsch, CC some rights reserved.
Trick question: is a
10,000 cell numeric
data.frame big or small?
In the era of "big data"
10,000 cells is minuscule. Such data could be fit on fewer than
1,000 punched cards (or less than half a box).
The joking answer is: it is small when they are selling you the system, but can be considered unfairly large later.
In this note I exhibit a troublesome example, and a systematic solution.
“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are (edit: “always”) faster than R code.”
The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions.
Below is a graph summarizing our new results for a number of in-memory implementations, a range of data sizes, and two different machine types.
Amazon EC2 supplies near instant access to on-demand disposable computing in a variety of sizes (billed in hours). RStudio Server supplies an interactive user interface to your remote R environment that is nearly indistinguishable from a local RStudio console. The idea is: for a few dollars you can work interactively on R tasks requiring hundreds of GB of memory and tens of CPUs and GPUs.
If you are already an Amazon EC2 user with some Unix experience it is very easy to quickly stand up a powerful R environment, which is what I will demonstrate in this note.