On Nested Models

Posted on Categories Exciting Techniques, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 1 Comment on On Nested Models

We have been recently working on and presenting on nested modeling issues. These are situations where the output of one trained machine learning model is part of the input of a later model or procedure. I am now of the opinion that correct treatment of nested models is one of the biggest opportunities for improvement in data science practice. Nested models can be more powerful than non-nested, but are easy to get wrong.

Continue reading On Nested Models

A bit on the F1 score floor

Posted on Categories Mathematics, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , Leave a comment on A bit on the F1 score floor

At Strata+Hadoop World “R Day” Tutorial, Tuesday, March 29 2016, San Jose, California we spent some time on classifier measures derived from the so-called “confusion matrix.”

We repeated our usual admonition to not use “accuracy itself” as a project quality goal (business people tend to ask for it as it is the word they are most familiar with, but it usually isn’t what they really want).

One reason not to use accuracy: an example where a classifier that does nothing is “more accurate” than one that actually has some utility. (Figure credit Nina Zumel, slides here)

And we worked through the usual bestiary of other metrics (precision, recall, sensitivity, specificity, AUC, balanced accuracy, and many more).

Please read on to see what stood out. Continue reading A bit on the F1 score floor

More on preparing data

Posted on Categories Administrativia, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , Leave a comment on More on preparing data

The Microsoft Data Science User Group just sponsored Dr. Nina Zumel‘s presentation “Preparing Data for Analysis Using R”. Microsoft saw Win-Vector LLC‘s ODSC West 2015 presentation “Prepping Data for Analysis using R” and generously offered to sponsor improving it and disseminating it to a wider audience.


We feel Nina really hit the ball out of the park with over 400 new live viewers. Read more for links to even more free materials! Continue reading More on preparing data

Databases in containers

Posted on Categories Coding, Exciting Techniques, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, RantsTags , , , , 5 Comments on Databases in containers

A great number of readers reacted very positively to Nina Zumel‘s article Using PostgreSQL in R: A quick how-to. Part of the reason is she described an incredibly powerful data science pattern: using a formerly expensive permanent system infrastructure as a simple transient tool.

In her case the tools were the data manipulation grammars SQL (Structured Query Language) and dplyr. It happened to be the case that in both cases the implementation was supplied by a backing database system (PostgreSQL), but the database was not the center of attention for very long.

In this note we will concentrate on SQL (which itself can be used to implement dplyr operators, and is available on even Hadoop scaled systems such as Hive). Our point can be summarized as: SQL isn’t the price of admission to a server, a server is the fee paid to use SQL. We will try to reduce the fee and show how to containerize PostgreSQL on Microsoft Windows (as was already done for us on Apple OSX).

Containerized DB


The Smashing Pumpkins “Bullet with Butterfly Wings” (start 2 minutes 6s)

“Despite all my rage I am still just a rat in a cage!”

(image credit).

Continue reading Databases in containers

Neglected optimization topic: set diversity

Posted on Categories Applications, data science, Expository Writing, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 2 Comments on Neglected optimization topic: set diversity

The mathematical concept of set diversity is a somewhat neglected topic in current applied decision sciences and optimization. We take this opportunity to discuss the issue.

The problem

Consider the following problem: for a number of items U = {x_1, … x_n} pick a small set of them X = {x_i1, x_i2, ..., x_ik} such that there is a high probability one of the x in X is a “success.” By success I mean some standard business outcome such as making a sale (in the sense of any of: propensity, appetency, up selling, and uplift modeling), clicking an advertisement, adding an account, finding a new medicine, or learning something useful.

This is common in:

  • Search engines. The user is presented with a page consisting of “top results” with the hope that one of the results is what the user wanted.
  • Online advertising. The user is presented with a number of advertisements in enticements in the hope that one of them matches user taste.
  • Science. A number of molecules are simultaneously presented to biological assay hoping that at least one of them is a new drug candidate, or that the simultaneous set of measurements shows us where to experiment further.
  • Sensor/guard placement. Overlapping areas of coverage don’t make up for uncovered areas.
  • Machine learning method design. The random forest algorithm requires diversity among its sub-trees to work well. It tries to ensure by both per-tree variable selections and re-sampling (some of these issues discussed here).

In this note we will touch on key applications and some of the theory involved. While our group specializes in practical data science implementations, applications, and training, our researchers experience great joy when they can re-formulate a common problem using known theory/math and the reformulation is game changing (as it is in the case of set-scoring).


Minimal spanning trees, the basis of one set diversity metric.

Continue reading Neglected optimization topic: set diversity

Using Excel versus using R

Posted on Categories Opinion, Programming, Statistics, TutorialsTags , , ,

Here is a video I made showing how R should not be considered “scarier” than Excel to analysts. One of the takeaway points: it is easier to email R procedures than Excel procedures.

Win-Vector’s John Mount shows a simple analysis both in Excel and in R.

A save of the “email” linking to all code and data is here.

The theory is the recipient of the email already had R, RStudio and the required packages installed from previous use. The package install step is only needed once and is:


Then all the steps are (in a more cut/paste friendly format):

cars <- read.table('http://www.win-vector.com/dfiles/car.data.csv',header=TRUE,sep=',')
model <- rpart(rating ~ buying + maint + doors + persons + lug_boot + safety, data=cars, control=rpart.control(maxdepth=6))

Note, you would only have to install the packages once- not every time you run an analysis (which is why that command was left out).

What was data science before it was called data science?

Posted on Categories Opinion, StatisticsTags , , , 3 Comments on What was data science before it was called data science?

“Data Science” is obviously a trendy term making it way through the hype cycle. Either nobody is good enough to be a data scientist (unicorns) or everybody is too good to be a data scientist (or the truth is somewhere in the middle).


Gartner hype cycle (Wikipedia).

And there is a quarter that grumbles that we are merely talking about statistics under a new name (see here and here).

It has always been the case that advances in data engineering (such as punch cards, or data centers) make analysis practical at new scales (though I still suspect Map/Reduce was a plot designed to trick engineers into being excited about ETL and report generation).


Data Science 1832: Semen Korsakov card.

However, in the 1940s and 1950s the field was called “operations research” (even when performed by statisticians). When you read John F. Magee, (2002) “Operations Research at Arthur D. Little, Inc.: The Early Years”, Operations Research 50(1):149-153 http://dx.doi.org/10.1287/opre. you really come away with the impression you are reading about a study of online advertising performed in the 1940s (okay mail advertising, but mail was “the email of its time”).

In this spirit next week we will write about the sequential analysis solution for A/B-testing, invented in the 1940s by one of the greats of statistics and operations research: Abraham Wald (whom we have written about before).


Abraham Wald

Bitcoin’s status isn’t as simple as ruling if it is more a private token or a public ledger

Posted on Categories Finance, History, OpinionTags

There is a lot of current interest in various “crypto currencies” such as Bitcoin, but that does not mean there have not been previous combined ledger and token recording systems. Others have noticed the relevance of Crawfurd v The Royal Bank (the case where money became money), and we are going to write about this yet again.

Very roughly: a Bitcoin is a cryptographic secret that is considered to have some value. Bitcoins are individual data tokens, and duplication is prevented through a distributed shared ledger (called the blockchain). As interesting as this is, we want to point out notional value existing both in ledgers and as possessed tokens has quite a long precedent.

This helps us remember that important questions about Bitcoins (such as: are they a currency or a commodity?) will be determined by regulators, courts, and legislators. It will not be a simple inevitable consequence of some detail of implementation as this has never been the case for other forms of value (gold, coins, bank notes, stocks certificates, or bank account balances).

Value has often been recorded in combinations of ledgers and tokens, so many of these issues have been seen before (though they have never been as simple as one would hope). Historically the rules that apply to such systems are subtle, and not completely driven by whether the system primarily resides in ledgers or primarily resides portable tokens. So we shouldn’t expect determinations involving Bitcoin to be simple either.

What I would like to do with this note is point out some fun examples and end with the interesting case of Crawfurd v The Royal Bank, as brought up by “goonsack” in 2013. Continue reading Bitcoin’s status isn’t as simple as ruling if it is more a private token or a public ledger

Baking priors

Posted on Categories data science, Expository Writing, Opinion, Rants, Statistics, Statistics To English Translation, TutorialsTags , , , , ,

There remains a bit of a two-way snobbery that Frequentist statistics is what we teach (as so-called objective statistics remain the same no matter who works with them) and Bayesian statistics is what we do (as it tends to directly estimate posterior probabilities we are actually interested in). Nina Zumel hit the nail on the head when she wrote an article explaining the appropriateness of the type of statistical theory depends on the type of question you are trying to answer, not on your personal prejudices.

We will discuss a few more examples that have been in our mind, including one I am calling “baking priors.” This final example will demonstrate some of the advantages of allowing researchers to document their priors.

Thumb IMG 0539 1024
Figure 1: two loaves of bread.
Continue reading Baking priors

Thumbs up for Anaconda

Posted on Categories Computer Science, Exciting Techniques, Opinion, Programming, RantsTags , 2 Comments on Thumbs up for Anaconda

One of the things I like about R is: because it is not used for systems programming you can expect to install your own current version of R without interference from some system version of R that is deliberately being held back at some older version (for reasons of script compatibility). R is conveniently distributed as a single package (with automated install of additional libraries).

Want to do some data analysis? Install R, load your data, and go. You don’t expect to spend hours on system administration just to get back to your task.

Python, being a popular general purpose language does not have this advantage, but thanks to Anaconda from Continuum Analytics you can skip (or at least delegate) a lot of the system environment imposed pain. With Anaconda trying out Python packages (Jupyter, scikit-learn, pandas, numpy, sympy, cvxopt, bokeh, and more) becomes safe and pleasant. Continue reading Thumbs up for Anaconda