Running R jobs quickly on many machines

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, ProgrammingTags , , 4 Comments on Running R jobs quickly on many machines

As we demonstrated in “A gentle introduction to parallel computing in R” one of the great things about R is how easy it is to take advantage of parallel processing capabilities to speed up calculation. In this note we will show how to move from running jobs multiple CPUs/cores to running jobs multiple machines (for even larger scaling and greater speedup). Using the technique on Amazon EC2 even turns your credit card into a supercomputer.

Colossus supercomputer : The Forbin Project

Continue reading Running R jobs quickly on many machines

A gentle introduction to parallel computing in R

Posted on Categories Coding, data science, Exciting Techniques, math programming, Programming, TutorialsTags , 4 Comments on A gentle introduction to parallel computing in R

Let’s talk about the use and benefits of parallel computation in R.


IBM’s Blue Gene/P massively parallel supercomputer (Wikipedia).

Parallel computing is a type of computation in which many calculations are carried out simultaneously.”

Wikipedia quoting: Gottlieb, Allan; Almasi, George S. (1989). Highly parallel computing

The reason we care is: by making the computer work harder (perform many calculations simultaneously) we wait less time for our experiments and can run more experiments. This is especially important when doing data science (as we often do using the R analysis platform) as we often need to repeat variations of large analyses to learn things, infer parameters, and estimate model stability.

Typically to get the computer to work a harder the analyst, programmer, or library designer must themselves work a bit hard to arrange calculations in a parallel friendly manner. In the best circumstances somebody has already done this for you:

  • Good parallel libraries, such as the multi-threaded BLAS/LAPACK libraries included in Revolution R Open (RRO, now Microsoft R Open) (see here).
  • Specialized parallel extensions that supply their own high performance implementations of important procedures such as rx methods from RevoScaleR or h2o methods from
  • Parallelization abstraction frameworks such as Thrust/Rth (see here).
  • Using R application libraries that dealt with parallelism on their own (examples include gbm, boot and our own vtreat). (Some of these libraries do not attempt parallel operation until you specify a parallel execution environment.)

In addition to having a task ready to “parallelize” you need a facility willing to work on it in a parallel manner. Examples include:

  • Your own machine. Even a laptop computer usually now has four our more cores. Potentially running four times faster, or equivalently waiting only one fourth the time, is big.
  • Graphics processing units (GPUs). Many machines have a one or more powerful graphics cards already installed. For some numerical task these cards are 10 to 100 times faster than the basic Central Processing Unit (CPU) you normally use for computation (see here).
  • Clusters of computers (such as Amazon ec2, Hadoop backends and more).

Obviously parallel computation with R is a vast and specialized topic. It can seem impossible to quickly learn how to use all this magic to run your own calculation more quickly.

In this tutorial we will demonstrate how to speed up a calculation of your own choosing using basic R. Continue reading A gentle introduction to parallel computing in R

Using Excel versus using R

Posted on Categories Opinion, Programming, Statistics, TutorialsTags , , , Leave a comment on Using Excel versus using R

Here is a video I made showing how R should not be considered “scarier” than Excel to analysts. One of the takeaway points: it is easier to email R procedures than Excel procedures.

Win-Vector’s John Mount shows a simple analysis both in Excel and in R.

A save of the “email” linking to all code and data is here.

The theory is the recipient of the email already had R, RStudio and the required packages installed from previous use. The package install step is only needed once and is:


Then all the steps are (in a more cut/paste friendly format):

cars <- read.table('',header=TRUE,sep=',')
model <- rpart(rating ~ buying + maint + doors + persons + lug_boot + safety, data=cars, control=rpart.control(maxdepth=6))

Note, you would only have to install the packages once- not every time you run an analysis (which is why that command was left out).

Some programming language theory in R

Posted on Categories Coding, Computer Science, Expository Writing, Programming, TutorialsTags , , , 1 Comment on Some programming language theory in R

Let’s take a break from statistics and data science to think a bit about programming language theory, and how the theory relates to the programming language used in the R analysis platform (the language is technically called “S”, but we are going to just call the whole analysis system “R”).

Our reasoning is: if you want to work as a modern data scientist you have to program (this is not optional for reasons of documentation, sharing and scientific repeatability). If you do program you are going to have to eventually think a bit about programming theory (hopefully not too early in your studies, but it will happen). Let’s use R’s powerful programming language (and implementation) to dive into some deep issues in programming language theory:

  • References versus values
  • Function abstraction
  • Equational reasoning
  • Recursion
  • Substitution and evaluation
  • Fixed point theory

To do this we will translate some common ideas from a theory called “the lambda calculus” into R (where we can actually execute them). This translation largely involves changing the word “lambda” to “function” and introducing some parenthesis (which I think greatly improve readability, part of the mystery of the lambda calculus is how unreadable its preferred notation actually is).

Opus hyp
Recursive Opus (on a Hyperbolic disk)
Continue reading Some programming language theory in R

An R function return and assignment puzzle

Posted on Categories Programming, TutorialsTags , , 1 Comment on An R function return and assignment puzzle

Here is an R programming puzzle. What does the following code snippet actually do? And ever harder: what does it mean? (See here for some material on the difference between what code does and what code means.)

f <- function() { x <- 5 }

In R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" the code appears to call the function f() and return nothing (nothing is printed). When teaching I often state that you should explicitly use a non-assignment expression as your return value. You should write code such as the following:

f <- function() { x <- 5; x }
## [1] 5

(We are showing an R output as being prefixed with ##.)

But take a look at the this:

f <- function() { x <- 5 }
## [1] 5

It prints! Read further for what is really going on.

NewImage Continue reading An R function return and assignment puzzle

Don’t use stats::aggregate()

Posted on Categories data science, Pragmatic Data Science, Programming, Statistics, TutorialsTags 7 Comments on Don’t use stats::aggregate()

When working with an analysis system (such as R) there are usually good reasons to prefer using functions from the “base” system over using functions from extension packages. However, base functions are sometimes locked into unfortunate design compromises that can now be avoided. In R’s case I would say: do not use stats::aggregate().

Read on for our example. Continue reading Don’t use stats::aggregate()

Thumbs up for Anaconda

Posted on Categories Computer Science, Exciting Techniques, Opinion, Programming, RantsTags , 2 Comments on Thumbs up for Anaconda

One of the things I like about R is: because it is not used for systems programming you can expect to install your own current version of R without interference from some system version of R that is deliberately being held back at some older version (for reasons of script compatibility). R is conveniently distributed as a single package (with automated install of additional libraries).

Want to do some data analysis? Install R, load your data, and go. You don’t expect to spend hours on system administration just to get back to your task.

Python, being a popular general purpose language does not have this advantage, but thanks to Anaconda from Continuum Analytics you can skip (or at least delegate) a lot of the system environment imposed pain. With Anaconda trying out Python packages (Jupyter, scikit-learn, pandas, numpy, sympy, cvxopt, bokeh, and more) becomes safe and pleasant. Continue reading Thumbs up for Anaconda

Wanted: A Perfect Scatterplot (with Marginals)

Posted on Categories Coding, data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , , , , 9 Comments on Wanted: A Perfect Scatterplot (with Marginals)

We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki:


The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear relationship between the x and y variables. Nice.

I like this plot a lot, but we’re mostly an R shop here at Win-Vector. So we asked: can we make this plot in ggplot2? Natively, ggplot2 can add rugs to a scatterplot, but doesn’t immediately offer marginals, as above.

However, you can use Dean Attali’s ggExtra package. Here’s an example using the same data as the seaborn jointplot above; you can download the dataset here.

frm = read.csv("tips.csv")

plot_center = ggplot(frm, aes(x=total_bill,y=tip)) + 
  geom_point() +

# default: type="density"
ggMarginal(plot_center, type="histogram")

I didn’t bother to add the internal annotation for the goodness of the linear fit, though I could.


The ggMarginal() function goes to heroic effort to line up the coordinate axes of all the graphs, and is probably the best way to do a scatterplot-plus-marginals in ggplot (you can also do it in base graphics, of course). Still, we were curious how close we could get to the seaborn version: marginal density and histograms together, along with annotations. Below is our version of the graph; we report the linear fit’s R-squared, rather than the Pearson correlation.

# our own (very beta) plot package: details later
frm = read.csv("tips.csv")

ScatterHist(frm, "total_bill", "tip",
            title="Tips vs. Total Bill")


You can see that (at the moment) we’ve resorted to padding the axis labels with underbars to force the x-coordinates of the top marginal plot and the scatterplot to align; white space gets trimmed. This is profoundly unsatisfying, and less robust than the ggMarginal version. If you’re curious, the code is here. It relies on some functions in the file sharedFunctions.R in the same repository. Our more general version will do either a linear or lowess/spline smooth, and you can also adjust the histogram and density plot parameters.

Thanks to Slawa Rokicki’s excellent ggplot2: Cheatsheet for Visualizing Distributions for our basic approach. Check out the graph at the bottom of her post — and while you’re at it, check out the rest of her blog too.

R in a 64 bit world

Posted on Categories Opinion, ProgrammingTags , , 4 Comments on R in a 64 bit world

32 bit data structures (pointers, integer representations, single precision floating point) have been past their “best before date” for quite some time. R itself moved to a 64 bit memory model some time ago, but still has only 32 bit integers. This is going to get more and more awkward going forward. What is R doing to work around this limitation?

IMG 1691

We discuss this in this article, the first of a new series of articles discussing aspects of “R as it is” that we are publishing with cooperation from Revolution Analytics. Continue reading R in a 64 bit world

My favorite R bug

Posted on Categories Expository Writing, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Rants, Statistics, TutorialsTags , 4 Comments on My favorite R bug

In this note am going to recount “my favorite R bug.” It isn’t a bug in R. It is a bug in some code I wrote in R. I call it my favorite bug, as it is easy to commit and (thanks to R’s overly helpful nature) takes longer than it should to find.

H96566k Continue reading My favorite R bug