Databases in containers

Posted on Categories Coding, Exciting Techniques, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, RantsTags , , , , 4 Comments on Databases in containers

A great number of readers reacted very positively to Nina Zumel‘s article Using PostgreSQL in R: A quick how-to. Part of the reason is she described an incredibly powerful data science pattern: using a formerly expensive permanent system infrastructure as a simple transient tool.

In her case the tools were the data manipulation grammars SQL (Structured Query Language) and dplyr. It happened to be the case that in both cases the implementation was supplied by a backing database system (PostgreSQL), but the database was not the center of attention for very long.

In this note we will concentrate on SQL (which itself can be used to implement dplyr operators, and is available on even Hadoop scaled systems such as Hive). Our point can be summarized as: SQL isn’t the price of admission to a server, a server is the fee paid to use SQL. We will try to reduce the fee and show how to containerize PostgreSQL on Microsoft Windows (as was already done for us on Apple OSX).


Containerized DB

NewImage

The Smashing Pumpkins “Bullet with Butterfly Wings” (start 2 minutes 6s)

“Despite all my rage I am still just a rat in a cage!”

(image credit).

Continue reading Databases in containers

Neglected optimization topic: set diversity

Posted on Categories Applications, data science, Expository Writing, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , Leave a comment on Neglected optimization topic: set diversity

The mathematical concept of set diversity is a somewhat neglected topic in current applied decision sciences and optimization. We take this opportunity to discuss the issue.

The problem

Consider the following problem: for a number of items U = {x_1, … x_n} pick a small set of them X = {x_i1, x_i2, ..., x_ik} such that there is a high probability one of the x in X is a “success.” By success I mean some standard business outcome such as making a sale (in the sense of any of: propensity, appetency, up selling, and uplift modeling), clicking an advertisement, adding an account, finding a new medicine, or learning something useful.

This is common in:

  • Search engines. The user is presented with a page consisting of “top results” with the hope that one of the results is what the user wanted.
  • Online advertising. The user is presented with a number of advertisements in enticements in the hope that one of them matches user taste.
  • Science. A number of molecules are simultaneously presented to biological assay hoping that at least one of them is a new drug candidate, or that the simultaneous set of measurements shows us where to experiment further.
  • Sensor/guard placement. Overlapping areas of coverage don’t make up for uncovered areas.
  • Machine learning method design. The random forest algorithm requires diversity among its sub-trees to work well. It tries to ensure by both per-tree variable selections and re-sampling (some of these issues discussed here).

In this note we will touch on key applications and some of the theory involved. While our group specializes in practical data science implementations, applications, and training, our researchers experience great joy when they can re-formulate a common problem using known theory/math and the reformulation is game changing (as it is in the case of set-scoring).


Rplot01

Minimal spanning trees, the basis of one set diversity metric.

Continue reading Neglected optimization topic: set diversity

Free video course: applied Bayesian A/B testing in R

Posted on Categories Administrativia, Pragmatic Data Science, StatisticsTags , , Leave a comment on Free video course: applied Bayesian A/B testing in R

As a “thank you” to our blog, mailing list, and Twitter followers (@WinVectorLLC) we at Win-Vector LLC have decided to re-release our formerly fee-based A/B testing video course as a free (advertisement supported) video course here on Youtube.


CTvideoCourse

The course emphasizes how to design A/B tests using prior “guestimates” of effect sizes (often you have these from prior campaigns, or somebody claims an effect size and it is merely your job to confirm it). It is fairly technical, and the emphasis is Bayesian- where we are trying to get an actual estimate of the distribution unknown true expected payoff rate of the various campaigns (the so-called posteriors). We show how to design and evaluate a sales campaigns for a product at two different price points.

The solution is coded in R and Nina Zumel has contributed an updated Shiny user interface demonstrating the technique (for more on Shiny, please see here). The code for the calculation methods and older shiny app are shared here. Continue reading Free video course: applied Bayesian A/B testing in R

Running R jobs quickly on many machines

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, ProgrammingTags , , 4 Comments on Running R jobs quickly on many machines

As we demonstrated in “A gentle introduction to parallel computing in R” one of the great things about R is how easy it is to take advantage of parallel processing capabilities to speed up calculation. In this note we will show how to move from running jobs multiple CPUs/cores to running jobs multiple machines (for even larger scaling and greater speedup). Using the technique on Amazon EC2 even turns your credit card into a supercomputer.


NewImage
Colossus supercomputer : The Forbin Project

Continue reading Running R jobs quickly on many machines

Win-Vector data science mailing list (and a give-away!)

Posted on Categories Administrativia, StatisticsTags 2 Comments on Win-Vector data science mailing list (and a give-away!)

Win-Vector LLC is starting a data science mailing list that we would like you to sign up for. It is going to be a (deliberately infrequent) set of updates including Win-Vector LLC notices, upcoming speaking events, and data science products.


To kick this off we will be awarding 5 free permanent subscriptions to our video course “Introduction to Data Science” to people who join the mailing list in January 2016 (people who have already signed up already eligible!).
(The contest is over, thank you all for entering!).

For more news/announcements please follow us on:

Prepping Data for Analysis using R

Posted on Categories Statistics, TutorialsTags , , Leave a comment on Prepping Data for Analysis using R

Nina and I are proud to share our lecture: “Prepping Data for Analysis using R” from ODSC West 2015.


Nina Zumel and John Mount ODSC WEST 2015

It is about 90 minutes, and covers a lot of the theory behind the vtreat data preparation library.

We also have a Github repository including all the lecture materials here. Continue reading Prepping Data for Analysis using R

A gentle introduction to parallel computing in R

Posted on Categories Coding, data science, Exciting Techniques, math programming, Programming, TutorialsTags , 4 Comments on A gentle introduction to parallel computing in R

Let’s talk about the use and benefits of parallel computation in R.


NewImage

IBM’s Blue Gene/P massively parallel supercomputer (Wikipedia).

Parallel computing is a type of computation in which many calculations are carried out simultaneously.”

Wikipedia quoting: Gottlieb, Allan; Almasi, George S. (1989). Highly parallel computing

The reason we care is: by making the computer work harder (perform many calculations simultaneously) we wait less time for our experiments and can run more experiments. This is especially important when doing data science (as we often do using the R analysis platform) as we often need to repeat variations of large analyses to learn things, infer parameters, and estimate model stability.

Typically to get the computer to work a harder the analyst, programmer, or library designer must themselves work a bit hard to arrange calculations in a parallel friendly manner. In the best circumstances somebody has already done this for you:

  • Good parallel libraries, such as the multi-threaded BLAS/LAPACK libraries included in Revolution R Open (RRO, now Microsoft R Open) (see here).
  • Specialized parallel extensions that supply their own high performance implementations of important procedures such as rx methods from RevoScaleR or h2o methods from h2o.ai.
  • Parallelization abstraction frameworks such as Thrust/Rth (see here).
  • Using R application libraries that dealt with parallelism on their own (examples include gbm, boot and our own vtreat). (Some of these libraries do not attempt parallel operation until you specify a parallel execution environment.)

In addition to having a task ready to “parallelize” you need a facility willing to work on it in a parallel manner. Examples include:

  • Your own machine. Even a laptop computer usually now has four our more cores. Potentially running four times faster, or equivalently waiting only one fourth the time, is big.
  • Graphics processing units (GPUs). Many machines have a one or more powerful graphics cards already installed. For some numerical task these cards are 10 to 100 times faster than the basic Central Processing Unit (CPU) you normally use for computation (see here).
  • Clusters of computers (such as Amazon ec2, Hadoop backends and more).

Obviously parallel computation with R is a vast and specialized topic. It can seem impossible to quickly learn how to use all this magic to run your own calculation more quickly.

In this tutorial we will demonstrate how to speed up a calculation of your own choosing using basic R. Continue reading A gentle introduction to parallel computing in R

Nina Zumel and John Mount part of R Day at Strata + Hadoop World in San Jose 2016

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags Leave a comment on Nina Zumel and John Mount part of R Day at Strata + Hadoop World in San Jose 2016

Nina Zumel and I are honored to have been invited to be part of Strata + Hadoop World in San Jose 2016 R Day organized by RStudio and O’Reilly. Continue reading Nina Zumel and John Mount part of R Day at Strata + Hadoop World in San Jose 2016

Using Excel versus using R

Posted on Categories Opinion, Programming, Statistics, TutorialsTags , , , Leave a comment on Using Excel versus using R

Here is a video I made showing how R should not be considered “scarier” than Excel to analysts. One of the takeaway points: it is easier to email R procedures than Excel procedures.



Win-Vector’s John Mount shows a simple analysis both in Excel and in R.

A save of the “email” linking to all code and data is here.

The theory is the recipient of the email already had R, RStudio and the required packages installed from previous use. The package install step is only needed once and is:

install.packages(c('rpart','rpart.plot'))

Then all the steps are (in a more cut/paste friendly format):

cars <- read.table('http://www.win-vector.com/dfiles/car.data.csv',header=TRUE,sep=',')
library(rpart)
library(rpart.plot)
model <- rpart(rating ~ buying + maint + doors + persons + lug_boot + safety, data=cars, control=rpart.control(maxdepth=6))
rpart.plot(model,extra=4)
levels(cars$rating)

Note, you would only have to install the packages once- not every time you run an analysis (which is why that command was left out).