With our recent publication of “Can you nest parallel operations in R?” we now have a nice series of “how to speed up statistical computations in R” that moves from application, to larger/cloud application, and then to details.
For your convenience here they are in order:
- A gentle introduction to parallel computing in R
- Running R jobs quickly on many machines
- Can you nest parallel operations in R?
Please check it out, and please do Tweet/share these tutorials.
Parallel programming is a technique to decrease how long a task takes by performing more parts of it at the same time (using additional resources). When we teach parallel programming in R we start with the basic use of
parallel (please see here for example). This is, in our opinion, a necessary step before getting into clever notation and wrapping such as doParallel and foreach. Only then do the students have a sufficiently explicit interface to frame important questions about the semantics of parallel computing. Beginners really need a solid mental model of what services are really being provided by their tools and to test edge cases early.
One question that comes up over and over again is “can you nest
The answer is “no.” This is in fact an advanced topic, but it is one of the things that pops up when you start worrying about parallel programming. Please read on for what that is the right answer and how to work around that (simulate a “yes”).
I don’t think the above question is usually given sufficient consideration (nesting parallel operations can in fact make a lot of sense). You can’t directly nest
parLapply, but that is a different issue than can one invent a work-around. For example: a “yes” answer (really meaning there are work-arounds) can be found here. Again this is a different question than “is there a way to nest foreach loops” (which is possible through the nesting operator
%.% which presumably handles working around nesting issues in
Continue reading Can you nest parallel operations in R?
As we demonstrated in “A gentle introduction to parallel computing in R” one of the great things about R is how easy it is to take advantage of parallel processing capabilities to speed up calculation. In this note we will show how to move from running jobs multiple CPUs/cores to running jobs multiple machines (for even larger scaling and greater speedup). Using the technique on Amazon EC2 even turns your credit card into a supercomputer.
Colossus supercomputer : The Forbin Project
Continue reading Running R jobs quickly on many machines
Let’s talk about the use and benefits of parallel computation in R.
IBM’s Blue Gene/P massively parallel supercomputer (Wikipedia).
Parallel computing is a type of computation in which many calculations are carried out simultaneously.”
Wikipedia quoting: Gottlieb, Allan; Almasi, George S. (1989). Highly parallel computing
The reason we care is: by making the computer work harder (perform many calculations simultaneously) we wait less time for our experiments and can run more experiments. This is especially important when doing data science (as we often do using the R analysis platform) as we often need to repeat variations of large analyses to learn things, infer parameters, and estimate model stability.
Typically to get the computer to work a harder the analyst, programmer, or library designer must themselves work a bit hard to arrange calculations in a parallel friendly manner. In the best circumstances somebody has already done this for you:
- Good parallel libraries, such as the multi-threaded BLAS/LAPACK libraries included in Revolution R Open (RRO, now Microsoft R Open) (see here).
- Specialized parallel extensions that supply their own high performance implementations of important procedures such as rx methods from RevoScaleR or h2o methods from h2o.ai.
- Parallelization abstraction frameworks such as Thrust/Rth (see here).
- Using R application libraries that dealt with parallelism on their own (examples include gbm, boot and our own vtreat). (Some of these libraries do not attempt parallel operation until you specify a parallel execution environment.)
In addition to having a task ready to “parallelize” you need a facility willing to work on it in a parallel manner. Examples include:
- Your own machine. Even a laptop computer usually now has four our more cores. Potentially running four times faster, or equivalently waiting only one fourth the time, is big.
- Graphics processing units (GPUs). Many machines have a one or more powerful graphics cards already installed. For some numerical task these cards are 10 to 100 times faster than the basic Central Processing Unit (CPU) you normally use for computation (see here).
- Clusters of computers (such as Amazon ec2, Hadoop backends and more).
Obviously parallel computation with R is a vast and specialized topic. It can seem impossible to quickly learn how to use all this magic to run your own calculation more quickly.
In this tutorial we will demonstrate how to speed up a calculation of your own choosing using basic R. Continue reading A gentle introduction to parallel computing in R