“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are (edit: “always”) faster than R code.”
The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions.
Below is a graph summarizing our new results for a number of in-memory implementations, a range of data sizes, and two different machine types.
Continue reading Base R can be Fast
I have recently been working on projects using Amazon EC2 (elastic compute cloud), and RStudio Server. I thought I would share some of my working notes.
Amazon EC2 supplies near instant access to on-demand disposable computing in a variety of sizes (billed in hours). RStudio Server supplies an interactive user interface to your remote R environment that is nearly indistinguishable from a local RStudio console. The idea is: for a few dollars you can work interactively on R tasks requiring hundreds of GB of memory and tens of CPUs and GPUs.
If you are already an Amazon EC2 user with some Unix experience it is very easy to quickly stand up a powerful R environment, which is what I will demonstrate in this note.
Continue reading Setting up RStudio Server quickly on Amazon EC2
I have some big news about our R package cdata. We have greatly improved the calling interface and Nina Zumel has just written the definitive introduction to cdata.
cdata is our general coordinatized data tool. It is what powers the deep learning performance graph (here demonstrated with R and Keras) that I announced a while ago.
However, cdata is much more than that.
Continue reading Big cdata News
Kudos to Professor Andrew Gelman for telling a great joke at his own expense:
Stupid-ass statisticians don’t know what a goddam confidence interval is.
He brilliantly burlesqued a frustrating common occurrence many people say they “have never seen happen.”
One of the pains of writing about data science is there is a (small but vocal) sub-population of statisticians jump on your first mistake (we all make errors) and then expand it into an essay on how you: known nothing, are stupid, are ignorant, are unqualified, and are evil.
I get it: many people writing about data science do not know enough statistics. However, not every person writing from a data science point of view is statistically ignorant. That is not to say computer science (my original field) doesn’t have similar problems.
Trying to destroy a sweater by pulling on a loose thread in no way establishes that it wasn’t a nice sweater in the first place (or how nice a sweater it would be if the loose thread were fixed).
(BTW: the book in question is in fact excellent. Chapter 12 alone is worth at least ten times the list price of the book.)
We are excited to announce the
rquery is Win-Vector LLC‘s currently in development big data query tool for
rquery supplies set of operators inspired by Edgar F. Codd‘s relational algebra (updated to reflect lessons learned from working with
dplyr at big data scale in production).
Continue reading Announcing rquery
I am excited to share a new deep learning model performance trajectory graph.
Here is an example produced based on Keras in R using ggplot2:
Continue reading Plotting Deep Learning Model Performance Trajectories
For some time we have been teaching
R users "when working with wide tables on Spark or on databases: narrow to the columns you really want to work with early in your analysis."
The idea behind the advice is: working with fewer columns makes for quicker queries.
photo: Jacques Henri Lartigue 1912
The issue arises because wide tables (200 to 1000 columns) are quite common in big-data analytics projects. Often these are "denormalized marts" that are used to drive many different projects. For any one project only a small subset of the columns may be relevant in a calculation.
Continue reading How to Greatly Speed Up Your Spark Queries
Was enjoying Gabriel’s article Pipes in R Tutorial For Beginners and wanted call attention to a few more pipes in R (not all for beginners).
Continue reading More Pipes in R
A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool.
Continue reading Getting started with seplyr
In our last article we pointed out a dangerous silent result corruption we have seen when using the
dplyr package with databases.
To systematically avoid this result corruption we suggest breaking up your
dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any value in the same mutate it is formed). We consider these to be key and critical precautions to take when using
dplyr with a database.
We would also like to point out we are also distributing free tools to do this automatically, and a worked example of this solution.