Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, StatisticsTags , , , , , 4 Comments on Data Preparation, Long Form and tl;dr Form

Data Preparation, Long Form and tl;dr Form

Data preparation and cleaning are some of the most important steps of predictive analytic and data science tasks. They are laborious, where most of the errors are made, your last line of defense against a wild data, and hold the biggest opportunities for outcome improvement. No matter how much time you spend on them, they still seem like a neglected topic. Data preparation isn’t as self contained or genteel as tweaking machine learning models or hyperparameter tuning; and that is one of the reasons data preparation represents such an important practical opportunity for improvement.


Photo: NY –, License: Creative Commons 3 – CC BY-SA 3.0

Our group is distributing a detailed writeup of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what vtreat does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-R environments (such as Python/Pandas/scikit-learn, Spark, and many others).

We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.

Or alternately, below is the tl;dr (“too long; didn’t read”) form. Continue reading Data Preparation, Long Form and tl;dr Form

Posted on Categories Opinion, Programming, Statistics, TutorialsTags , , ,

Using Excel versus using R

Here is a video I made showing how R should not be considered “scarier” than Excel to analysts. One of the takeaway points: it is easier to email R procedures than Excel procedures.

Win-Vector’s John Mount shows a simple analysis both in Excel and in R.

A save of the “email” linking to all code and data is here.

The theory is the recipient of the email already had R, RStudio and the required packages installed from previous use. The package install step is only needed once and is:


Then all the steps are (in a more cut/paste friendly format):

cars <- read.table('',header=TRUE,sep=',')
model <- rpart(rating ~ buying + maint + doors + persons + lug_boot + safety, data=cars, control=rpart.control(maxdepth=6))

Note, you would only have to install the packages once- not every time you run an analysis (which is why that command was left out).

Posted on Categories Opinion, StatisticsTags , , , 3 Comments on What was data science before it was called data science?

What was data science before it was called data science?

“Data Science” is obviously a trendy term making it way through the hype cycle. Either nobody is good enough to be a data scientist (unicorns) or everybody is too good to be a data scientist (or the truth is somewhere in the middle).


Gartner hype cycle (Wikipedia).

And there is a quarter that grumbles that we are merely talking about statistics under a new name (see here and here).

It has always been the case that advances in data engineering (such as punch cards, or data centers) make analysis practical at new scales (though I still suspect Map/Reduce was a plot designed to trick engineers into being excited about ETL and report generation).


Data Science 1832: Semen Korsakov card.

However, in the 1940s and 1950s the field was called “operations research” (even when performed by statisticians). When you read John F. Magee, (2002) “Operations Research at Arthur D. Little, Inc.: The Early Years”, Operations Research 50(1):149-153 you really come away with the impression you are reading about a study of online advertising performed in the 1940s (okay mail advertising, but mail was “the email of its time”).

In this spirit next week we will write about the sequential analysis solution for A/B-testing, invented in the 1940s by one of the greats of statistics and operations research: Abraham Wald (whom we have written about before).


Abraham Wald

Posted on Categories Pragmatic Data Science, Quantitative Finance, RantsTags , , 3 Comments on Who is allowed to call themselves a data scientist?

Who is allowed to call themselves a data scientist?

It has been popular to complain that the current terms “data science” and “big data” are so vague as to be meaningless. While these terms are quite high on the hype-cycle, even the American Statistical Association was forced to admit that data science is actually a real thing and exists.

Gartner hype cycle (Wikipedia).

Given we agree data science exists, who is allowed to call themselves a data scientist? Continue reading Who is allowed to call themselves a data scientist?

Posted on Categories Administrativia, StatisticsTags , , , , , ,

The Statistics behind “Verification by Multiplicity”

There’s a new post up at the blog that looks at the statistics of “verification by multiplicity” — the statistical technique that is behind NASA’s announcement of 715 new planets that have been validated in the data from the Kepler Space Telescope.

We normally don’t write about science here at Win-Vector, but we do sometimes examine the statistics and statistical methods behind scientific announcements and issues. NASA’s new technique is a cute and relatively straightforward (statistically speaking) approach.

From what I understand of the introduction to the paper, there are two ways to determine whether or not a planet candidate is really a planet: the first is to confirm the fact with additional measurements of the target star’s gravitational wobble, or by measurements of the transit times of the apparent planets across the face of the star. Getting sufficient measurements can take time. The other way is to “validate” the planet by showing that it’s highly unlikely that the sighting was a false positive. Specifically, the probability that the signal observed was caused by a planet should be at least 100 times larger than the probability that the signal is a false positive. The validation analysis is a Bayesian approach that considers various mechanisms that produce false positives, determines the probability that these various mechanisms could have produced the signal in question, and compares them to the probability that a planet produced the signal.

The basic idea behind verification by multiplicity is that planets are often clustered in multi-planet star systems, while false positive measurements (mistaken identification of potential planets) occur randomly. Putting this another way: if false positives are random, then they won’t tend to occur together near the same star. So if you observe a star with multiple “planet signals,” it’s unlikely that all the signals are false positives. We can use that observation to quantify how much more likely it is that a star with multiple candidates actually hosts a planet. The resulting probability can be used as an improved prior for the planet model when doing the statistical validation described above.

You can read the rest of the article here.