The Win-Vector parallel computing in R series

Posted on Categories Administrativia, Programming, Statistics, TutorialsTags , Leave a comment on The Win-Vector parallel computing in R series

With our recent publication of “Can you nest parallel operations in R?” we now have a nice series of “how to speed up statistical computations in R” that moves from application, to larger/cloud application, and then to details.

For your convenience here they are in order:

  1. A gentle introduction to parallel computing in R
  2. Running R jobs quickly on many machines
  3. Can you nest parallel operations in R?

Please check it out, and please do Tweet/share these tutorials.

Can you nest parallel operations in R?

Posted on Categories Programming, TutorialsTags , , , 2 Comments on Can you nest parallel operations in R?

Parallel programming is a technique to decrease how long a task takes by performing more parts of it at the same time (using additional resources). When we teach parallel programming in R we start with the basic use of parallel (please see here for example). This is, in our opinion, a necessary step before getting into clever notation and wrapping such as doParallel and foreach. Only then do the students have a sufficiently explicit interface to frame important questions about the semantics of parallel computing. Beginners really need a solid mental model of what services are really being provided by their tools and to test edge cases early.

One question that comes up over and over again is “can you nest parLapply?”

The answer is “no.” This is in fact an advanced topic, but it is one of the things that pops up when you start worrying about parallel programming. Please read on for what that is the right answer and how to work around that (simulate a “yes”).

I don’t think the above question is usually given sufficient consideration (nesting parallel operations can in fact make a lot of sense). You can’t directly nest parLapply, but that is a different issue than can one invent a work-around. For example: a “yes” answer (really meaning there are work-arounds) can be found here. Again this is a different question than “is there a way to nest foreach loops” (which is possible through the nesting operator %.% which presumably handles working around nesting issues in parLapply).

Continue reading Can you nest parallel operations in R?

Free data science video lecture: debugging in R

Posted on Categories Coding, Programming, TutorialsTags , , 2 Comments on Free data science video lecture: debugging in R

We are pleased to release a new free data science video lecture: Debugging R code using R, RStudio and wrapper functions. In this 8 minute video we demonstrate the incredible power of R using wrapper functions to catch errors for later reproduction and debugging. If you haven’t tried these techniques this will really improve your debugging game.

All code and examples can be found here and in WVPlots. Continue reading Free data science video lecture: debugging in R

WVPlots: example plots in R using ggplot2

Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , 7 Comments on WVPlots: example plots in R using ggplot2

Nina Zumel and I have been working on packaging our favorite graphing techniques in a more reusable way that emphasizes the analysis task at hand over the steps needed to produce a good visualization. The idea is: we sacrifice some of the flexibility and composability inherent to ggplot2 in R for a menu of prescribed presentation solutions (which we are sharing on Github).

For example the plot below showing both an observed discrete empirical distribution (as stems) and a matching theoretical distribution (as bars) is a built in “one liner.”


Please read on for some of the ideas and how to use this package. Continue reading WVPlots: example plots in R using ggplot2

Bend or break: strings in R

Posted on Categories ProgrammingTags 3 Comments on Bend or break: strings in R

A common complaint from new users of R is: the string processing notation is ugly.

  • Using paste(,,sep='') to concatenate strings seems clumsy.
  • You are never sure which regular expression dialect grep()/gsub() are really using.
  • Remembering the difference between length() and nchar() is initially difficult.

As always things can be improved by using additional libraries (for example: stringr). But this always evokes Python’s “There should be one– and preferably only one –obvious way to do it” or what I call the “rule 42” problem: “if it is the right way, why isn’t it the first way?”

From “Alice’s Adventures in Wonderland”:

Alice’s Adventures in Wonderland, drawn by John Tenniel.

At this moment the King, who had been for some time busily writing in his note-book, cackled out `Silence!' and read out from his book, `Rule Forty-two. All persons more than a mile high to leave the court.'

Everybody looked at Alice.

`I'm not a mile high,' said Alice.

`You are,' said the King.

`Nearly two miles high,' added the Queen.

`Well, I shan't go, at any rate,' said Alice: `besides, that's not a regular rule: you invented it just now.'

`It's the oldest rule in the book,' said the King.

`Then it ought to be Number One,' said Alice.

We will write a bit on evil ways that you should never actually use to try and weasel around the string concatenation notation issue in R. Continue reading Bend or break: strings in R

More Shiny user showcase demonstrations

Posted on Categories Administrativia, data science, Programming, StatisticsTags ,

We at Win-Vector LLC are very proud to announce that RStudio just inducted two more of our demonstration Shiny applications into their Shiny User Showcase gallery. Continue reading More Shiny user showcase demonstrations

Running R jobs quickly on many machines

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, ProgrammingTags , , 4 Comments on Running R jobs quickly on many machines

As we demonstrated in “A gentle introduction to parallel computing in R” one of the great things about R is how easy it is to take advantage of parallel processing capabilities to speed up calculation. In this note we will show how to move from running jobs multiple CPUs/cores to running jobs multiple machines (for even larger scaling and greater speedup). Using the technique on Amazon EC2 even turns your credit card into a supercomputer.

Colossus supercomputer : The Forbin Project

Continue reading Running R jobs quickly on many machines

A gentle introduction to parallel computing in R

Posted on Categories Coding, data science, Exciting Techniques, math programming, Programming, TutorialsTags , 4 Comments on A gentle introduction to parallel computing in R

Let’s talk about the use and benefits of parallel computation in R.


IBM’s Blue Gene/P massively parallel supercomputer (Wikipedia).

Parallel computing is a type of computation in which many calculations are carried out simultaneously.”

Wikipedia quoting: Gottlieb, Allan; Almasi, George S. (1989). Highly parallel computing

The reason we care is: by making the computer work harder (perform many calculations simultaneously) we wait less time for our experiments and can run more experiments. This is especially important when doing data science (as we often do using the R analysis platform) as we often need to repeat variations of large analyses to learn things, infer parameters, and estimate model stability.

Typically to get the computer to work a harder the analyst, programmer, or library designer must themselves work a bit hard to arrange calculations in a parallel friendly manner. In the best circumstances somebody has already done this for you:

  • Good parallel libraries, such as the multi-threaded BLAS/LAPACK libraries included in Revolution R Open (RRO, now Microsoft R Open) (see here).
  • Specialized parallel extensions that supply their own high performance implementations of important procedures such as rx methods from RevoScaleR or h2o methods from
  • Parallelization abstraction frameworks such as Thrust/Rth (see here).
  • Using R application libraries that dealt with parallelism on their own (examples include gbm, boot and our own vtreat). (Some of these libraries do not attempt parallel operation until you specify a parallel execution environment.)

In addition to having a task ready to “parallelize” you need a facility willing to work on it in a parallel manner. Examples include:

  • Your own machine. Even a laptop computer usually now has four our more cores. Potentially running four times faster, or equivalently waiting only one fourth the time, is big.
  • Graphics processing units (GPUs). Many machines have a one or more powerful graphics cards already installed. For some numerical task these cards are 10 to 100 times faster than the basic Central Processing Unit (CPU) you normally use for computation (see here).
  • Clusters of computers (such as Amazon ec2, Hadoop backends and more).

Obviously parallel computation with R is a vast and specialized topic. It can seem impossible to quickly learn how to use all this magic to run your own calculation more quickly.

In this tutorial we will demonstrate how to speed up a calculation of your own choosing using basic R. Continue reading A gentle introduction to parallel computing in R

Using Excel versus using R

Posted on Categories Opinion, Programming, Statistics, TutorialsTags , , ,

Here is a video I made showing how R should not be considered “scarier” than Excel to analysts. One of the takeaway points: it is easier to email R procedures than Excel procedures.

Win-Vector’s John Mount shows a simple analysis both in Excel and in R.

A save of the “email” linking to all code and data is here.

The theory is the recipient of the email already had R, RStudio and the required packages installed from previous use. The package install step is only needed once and is:


Then all the steps are (in a more cut/paste friendly format):

cars <- read.table('',header=TRUE,sep=',')
model <- rpart(rating ~ buying + maint + doors + persons + lug_boot + safety, data=cars, control=rpart.control(maxdepth=6))

Note, you would only have to install the packages once- not every time you run an analysis (which is why that command was left out).

Some programming language theory in R

Posted on Categories Coding, Computer Science, Expository Writing, Programming, TutorialsTags , , , 1 Comment on Some programming language theory in R

Let’s take a break from statistics and data science to think a bit about programming language theory, and how the theory relates to the programming language used in the R analysis platform (the language is technically called “S”, but we are going to just call the whole analysis system “R”).

Our reasoning is: if you want to work as a modern data scientist you have to program (this is not optional for reasons of documentation, sharing and scientific repeatability). If you do program you are going to have to eventually think a bit about programming theory (hopefully not too early in your studies, but it will happen). Let’s use R’s powerful programming language (and implementation) to dive into some deep issues in programming language theory:

  • References versus values
  • Function abstraction
  • Equational reasoning
  • Recursion
  • Substitution and evaluation
  • Fixed point theory

To do this we will translate some common ideas from a theory called “the lambda calculus” into R (where we can actually execute them). This translation largely involves changing the word “lambda” to “function” and introducing some parenthesis (which I think greatly improve readability, part of the mystery of the lambda calculus is how unreadable its preferred notation actually is).

Opus hyp
Recursive Opus (on a Hyperbolic disk)
Continue reading Some programming language theory in R