Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, ProgrammingTags , , 4 Comments on Running R jobs quickly on many machines

Running R jobs quickly on many machines

As we demonstrated in “A gentle introduction to parallel computing in R” one of the great things about R is how easy it is to take advantage of parallel processing capabilities to speed up calculation. In this note we will show how to move from running jobs multiple CPUs/cores to running jobs multiple machines (for even larger scaling and greater speedup). Using the technique on Amazon EC2 even turns your credit card into a supercomputer.

Colossus supercomputer : The Forbin Project

Continue reading Running R jobs quickly on many machines

Posted on Categories Administrativia, StatisticsTags 2 Comments on Win-Vector data science mailing list (and a give-away!)

Win-Vector data science mailing list (and a give-away!)

Win-Vector LLC is starting a data science mailing list that we would like you to sign up for. It is going to be a (deliberately infrequent) set of updates including Win-Vector LLC notices, upcoming speaking events, and data science products.

To kick this off we will be awarding 5 free permanent subscriptions to our video course “Introduction to Data Science” to people who join the mailing list in January 2016 (people who have already signed up already eligible!).
(The contest is over, thank you all for entering!).

For more news/announcements please follow us on:

Posted on Categories Statistics, TutorialsTags , ,

Prepping Data for Analysis using R

Nina and I are proud to share our lecture: “Prepping Data for Analysis using R” from ODSC West 2015.

Nina Zumel and John Mount ODSC WEST 2015

It is about 90 minutes, and covers a lot of the theory behind the vtreat data preparation library.

We also have a Github repository including all the lecture materials here. Continue reading Prepping Data for Analysis using R

Posted on Categories Coding, data science, Exciting Techniques, math programming, Programming, TutorialsTags , 4 Comments on A gentle introduction to parallel computing in R

A gentle introduction to parallel computing in R

Let’s talk about the use and benefits of parallel computation in R.


IBM’s Blue Gene/P massively parallel supercomputer (Wikipedia).

Parallel computing is a type of computation in which many calculations are carried out simultaneously.”

Wikipedia quoting: Gottlieb, Allan; Almasi, George S. (1989). Highly parallel computing

The reason we care is: by making the computer work harder (perform many calculations simultaneously) we wait less time for our experiments and can run more experiments. This is especially important when doing data science (as we often do using the R analysis platform) as we often need to repeat variations of large analyses to learn things, infer parameters, and estimate model stability.

Typically to get the computer to work a harder the analyst, programmer, or library designer must themselves work a bit hard to arrange calculations in a parallel friendly manner. In the best circumstances somebody has already done this for you:

  • Good parallel libraries, such as the multi-threaded BLAS/LAPACK libraries included in Revolution R Open (RRO, now Microsoft R Open) (see here).
  • Specialized parallel extensions that supply their own high performance implementations of important procedures such as rx methods from RevoScaleR or h2o methods from
  • Parallelization abstraction frameworks such as Thrust/Rth (see here).
  • Using R application libraries that dealt with parallelism on their own (examples include gbm, boot and our own vtreat). (Some of these libraries do not attempt parallel operation until you specify a parallel execution environment.)

In addition to having a task ready to “parallelize” you need a facility willing to work on it in a parallel manner. Examples include:

  • Your own machine. Even a laptop computer usually now has four our more cores. Potentially running four times faster, or equivalently waiting only one fourth the time, is big.
  • Graphics processing units (GPUs). Many machines have a one or more powerful graphics cards already installed. For some numerical task these cards are 10 to 100 times faster than the basic Central Processing Unit (CPU) you normally use for computation (see here).
  • Clusters of computers (such as Amazon ec2, Hadoop backends and more).

Obviously parallel computation with R is a vast and specialized topic. It can seem impossible to quickly learn how to use all this magic to run your own calculation more quickly.

In this tutorial we will demonstrate how to speed up a calculation of your own choosing using basic R. Continue reading A gentle introduction to parallel computing in R

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags

Nina Zumel and John Mount part of R Day at Strata + Hadoop World in San Jose 2016

Nina Zumel and I are honored to have been invited to be part of Strata + Hadoop World in San Jose 2016 R Day organized by RStudio and O’Reilly. Continue reading Nina Zumel and John Mount part of R Day at Strata + Hadoop World in San Jose 2016

Posted on Categories Opinion, Programming, Statistics, TutorialsTags , , ,

Using Excel versus using R

Here is a video I made showing how R should not be considered “scarier” than Excel to analysts. One of the takeaway points: it is easier to email R procedures than Excel procedures.

Win-Vector’s John Mount shows a simple analysis both in Excel and in R.

A save of the “email” linking to all code and data is here.

The theory is the recipient of the email already had R, RStudio and the required packages installed from previous use. The package install step is only needed once and is:


Then all the steps are (in a more cut/paste friendly format):

cars <- read.table('',header=TRUE,sep=',')
model <- rpart(rating ~ buying + maint + doors + persons + lug_boot + safety, data=cars, control=rpart.control(maxdepth=6))

Note, you would only have to install the packages once- not every time you run an analysis (which is why that command was left out).

Posted on Categories Coding, Computer Science, Expository Writing, Programming, TutorialsTags , , , 1 Comment on Some programming language theory in R

Some programming language theory in R

Let’s take a break from statistics and data science to think a bit about programming language theory, and how the theory relates to the programming language used in the R analysis platform (the language is technically called “S”, but we are going to just call the whole analysis system “R”).

Our reasoning is: if you want to work as a modern data scientist you have to program (this is not optional for reasons of documentation, sharing and scientific repeatability). If you do program you are going to have to eventually think a bit about programming theory (hopefully not too early in your studies, but it will happen). Let’s use R’s powerful programming language (and implementation) to dive into some deep issues in programming language theory:

  • References versus values
  • Function abstraction
  • Equational reasoning
  • Recursion
  • Substitution and evaluation
  • Fixed point theory

To do this we will translate some common ideas from a theory called “the lambda calculus” into R (where we can actually execute them). This translation largely involves changing the word “lambda” to “function” and introducing some parenthesis (which I think greatly improve readability, part of the mystery of the lambda calculus is how unreadable its preferred notation actually is).

Opus hyp
Recursive Opus (on a Hyperbolic disk)
Continue reading Some programming language theory in R