More on preparing data

Posted on Categories Administrativia, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , Leave a comment on More on preparing data

The Microsoft Data Science User Group just sponsored Dr. Nina Zumel‘s presentation “Preparing Data for Analysis Using R”. Microsoft saw Win-Vector LLC‘s ODSC West 2015 presentation “Prepping Data for Analysis using R” and generously offered to sponsor improving it and disseminating it to a wider audience.


We feel Nina really hit the ball out of the park with over 400 new live viewers. Read more for links to even more free materials! Continue reading More on preparing data

Bend or break: strings in R

Posted on Categories ProgrammingTags 3 Comments on Bend or break: strings in R

A common complaint from new users of R is: the string processing notation is ugly.

  • Using paste(,,sep='') to concatenate strings seems clumsy.
  • You are never sure which regular expression dialect grep()/gsub() are really using.
  • Remembering the difference between length() and nchar() is initially difficult.

As always things can be improved by using additional libraries (for example: stringr). But this always evokes Python’s “There should be one– and preferably only one –obvious way to do it” or what I call the “rule 42” problem: “if it is the right way, why isn’t it the first way?”

From “Alice’s Adventures in Wonderland”:

Alice’s Adventures in Wonderland, drawn by John Tenniel.

At this moment the King, who had been for some time busily writing in his note-book, cackled out `Silence!' and read out from his book, `Rule Forty-two. All persons more than a mile high to leave the court.'

Everybody looked at Alice.

`I'm not a mile high,' said Alice.

`You are,' said the King.

`Nearly two miles high,' added the Queen.

`Well, I shan't go, at any rate,' said Alice: `besides, that's not a regular rule: you invented it just now.'

`It's the oldest rule in the book,' said the King.

`Then it ought to be Number One,' said Alice.

We will write a bit on evil ways that you should never actually use to try and weasel around the string concatenation notation issue in R. Continue reading Bend or break: strings in R

Win-Vector video courses: price/status changes

Posted on Categories Administrativia, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , 3 Comments on Win-Vector video courses: price/status changes

Win-Vector LLC has been offering a couple of online video courses on the topics of data science and A/B testing (both using R). These are high quality courses and well worth the money and time needed to work through them closely (with all materials distributed on GitHub).

Our current distributor is Udemy, which has just announced a unilateral change in pricing policy (March 2, 2016). This note is about the current status of these courses. Continue reading Win-Vector video courses: price/status changes

Reading and writing proofs

Posted on Categories Expository Writing, MathematicsTags , , , Leave a comment on Reading and writing proofs

In my recent article on optimizing set diversity I mentioned the primary abstraction was of “diminishing returns” and is formalized by the theory of monotone submodular functions (though I did call out some of my own work which used a different abstraction). A proof that appears again and again in the literature is: showing that when maximizing a monotone submodular function the greedy algorithm run for k steps picks a set that is scores no worse than 1-1/e less than the unknown optimal pick (or picks up at least 63% of the possible value). This is significant, because naive optimization may only pick a set of value 1/k of the value of the optimal selection.

The proof that the greedy algorithm does well in maximizing monotone increasing submodular functions is clever and a very good opportunity to teach about reading and writing mathematical proofs. The point is: one needs an active reading style as: most of what is crucial to a proof isn’t written, and that which is written in a proof can’t all be pivotal (else proofs would be a lot more fragile than they actually are).

Uwe Kils “Iceberg”

In this article I am attempting to reproduce some fraction of the insight found in: Polya “How to Solve It” (1945) and Doron Zeilberger “The Method of Undetermined Generalization and Specialization Illustrated with Fred Galvin’s Amazing Proof of the Dinitz Conjecture” (1994).

So I repeat the proof here (with some annotations and commentary). Continue reading Reading and writing proofs

More Shiny user showcase demonstrations

Posted on Categories Administrativia, data science, Programming, StatisticsTags , Leave a comment on More Shiny user showcase demonstrations

We at Win-Vector LLC are very proud to announce that RStudio just inducted two more of our demonstration Shiny applications into their Shiny User Showcase gallery. Continue reading More Shiny user showcase demonstrations

What does a “slow news day” look like at Win-Vector LLC?

Posted on Categories AdministrativiaTags Leave a comment on What does a “slow news day” look like at Win-Vector LLC?

We at Win-Vector LLC have been writing this blog for almost 9 years. In that time we have accumulated a lot of what we feel is very good writing on data science topics (277 posts and 687 contributed comments). Below is what a “slow news day” looks like on the WordPress pages viewed statistics summary. That is: the list of what was read yesterday from our site. Continue reading What does a “slow news day” look like at Win-Vector LLC?

Finding the K in K-means by Parametric Bootstrap

Posted on Categories data science, Exciting Techniques, Expository Writing, Mathematics, StatisticsTags , , , , , , Leave a comment on Finding the K in K-means by Parametric Bootstrap

One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 (available as a free sample chapter) of our book Practical Data Science with R.

We also came upon another cool approach, in the mixtools package for mixture model analysis. As with clustering, if you want to fit a mixture model (say, a mixture of gaussians) to your data, it helps to know how many components are in your mixture. The boot.comp function estimates the number of components (let’s call it k) by incrementally testing the hypothesis that there are k+1 components against the null hypothesis that there are k components, via parametric bootstrap.

You can use a similar idea to estimate the number of clusters in a clustering problem, if you make a few assumptions about the shape of the clusters. This approach is only heuristic, and more ad-hoc in the clustering situation than it is in mixture modeling. Still, it’s another approach to add to your toolkit, and estimating the number of clusters via a variety of different heuristics isn’t a bad idea.

Continue reading Finding the K in K-means by Parametric Bootstrap

Databases in containers

Posted on Categories Coding, Exciting Techniques, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, RantsTags , , , , 5 Comments on Databases in containers

A great number of readers reacted very positively to Nina Zumel‘s article Using PostgreSQL in R: A quick how-to. Part of the reason is she described an incredibly powerful data science pattern: using a formerly expensive permanent system infrastructure as a simple transient tool.

In her case the tools were the data manipulation grammars SQL (Structured Query Language) and dplyr. It happened to be the case that in both cases the implementation was supplied by a backing database system (PostgreSQL), but the database was not the center of attention for very long.

In this note we will concentrate on SQL (which itself can be used to implement dplyr operators, and is available on even Hadoop scaled systems such as Hive). Our point can be summarized as: SQL isn’t the price of admission to a server, a server is the fee paid to use SQL. We will try to reduce the fee and show how to containerize PostgreSQL on Microsoft Windows (as was already done for us on Apple OSX).

Containerized DB


The Smashing Pumpkins “Bullet with Butterfly Wings” (start 2 minutes 6s)

“Despite all my rage I am still just a rat in a cage!”

(image credit).

Continue reading Databases in containers

Neglected optimization topic: set diversity

Posted on Categories Applications, data science, Expository Writing, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 2 Comments on Neglected optimization topic: set diversity

The mathematical concept of set diversity is a somewhat neglected topic in current applied decision sciences and optimization. We take this opportunity to discuss the issue.

The problem

Consider the following problem: for a number of items U = {x_1, … x_n} pick a small set of them X = {x_i1, x_i2, ..., x_ik} such that there is a high probability one of the x in X is a “success.” By success I mean some standard business outcome such as making a sale (in the sense of any of: propensity, appetency, up selling, and uplift modeling), clicking an advertisement, adding an account, finding a new medicine, or learning something useful.

This is common in:

  • Search engines. The user is presented with a page consisting of “top results” with the hope that one of the results is what the user wanted.
  • Online advertising. The user is presented with a number of advertisements in enticements in the hope that one of them matches user taste.
  • Science. A number of molecules are simultaneously presented to biological assay hoping that at least one of them is a new drug candidate, or that the simultaneous set of measurements shows us where to experiment further.
  • Sensor/guard placement. Overlapping areas of coverage don’t make up for uncovered areas.
  • Machine learning method design. The random forest algorithm requires diversity among its sub-trees to work well. It tries to ensure by both per-tree variable selections and re-sampling (some of these issues discussed here).

In this note we will touch on key applications and some of the theory involved. While our group specializes in practical data science implementations, applications, and training, our researchers experience great joy when they can re-formulate a common problem using known theory/math and the reformulation is game changing (as it is in the case of set-scoring).


Minimal spanning trees, the basis of one set diversity metric.

Continue reading Neglected optimization topic: set diversity

Free video course: applied Bayesian A/B testing in R

Posted on Categories Administrativia, Pragmatic Data Science, StatisticsTags , , Leave a comment on Free video course: applied Bayesian A/B testing in R

As a “thank you” to our blog, mailing list, and Twitter followers (@WinVectorLLC) we at Win-Vector LLC have decided to re-release our formerly fee-based A/B testing video course as a free (advertisement supported) video course here on Youtube.


The course emphasizes how to design A/B tests using prior “guestimates” of effect sizes (often you have these from prior campaigns, or somebody claims an effect size and it is merely your job to confirm it). It is fairly technical, and the emphasis is Bayesian- where we are trying to get an actual estimate of the distribution unknown true expected payoff rate of the various campaigns (the so-called posteriors). We show how to design and evaluate a sales campaigns for a product at two different price points.

The solution is coded in R and Nina Zumel has contributed an updated Shiny user interface demonstrating the technique (for more on Shiny, please see here). The code for the calculation methods and older shiny app are shared here. Continue reading Free video course: applied Bayesian A/B testing in R