Free gradient boosting lecture

By: , November 21st, 2015.


We have always regretted that we didn’t get to cover gradient boosting in Practical Data Science with R (Manning 2014). To try make up for that we are sharing (for free) our GBM lecture from our (paid) video course Introduction to Data Science.

(link, all support material here).

Please help us get the word out by sharing/Tweeting!

Fluid use of data

By: , November 19th, 2015.


Nina Zumel and I recently wrote a few article and series on best practices in testing models and data:

What stands out in these presentations is: the simple practice of a static test/train split is merely a convenience to cut down on operational complexity and difficulty of teaching. It is in no way optimal. That is, using slightly more complicated procedures can build better models on a given set of data.

Suggested static cal/train/test experiment design from vtreat data treatment library.
Continue reading Fluid use of data

Upcoming Win-Vector Appearances

By: , November 9th, 2015.


We have two public appearances coming up in the next few weeks:

Workshop at ODSC, San Francisco – November 14

Both of us will be giving a two-hour workshop called Preparing Data for Analysis using R: Basic through Advanced Techniques. We will cover key issues in this important but often neglected aspect of data science, what can go wrong, and how to fix it. This is part of the Open Data Science Conference (ODSC) at the Marriot Waterfront in Burlingame, California, November 14-15. If you are attending this conference, we look forward to seeing you there!

You can find an abstract for the workshop, along with links to software and code you can download ahead of time, here.

An Introduction to Differential Privacy as Applied to Machine Learning: Women in ML/DS – December 2

I (Nina) will give a talk to the Bay Area Women in Machine Learning & Data Science Meetup group, on applying differential privacy for reusable hold-out sets in machine learning. The talk will also cover the use of differential privacy in effects coding (what we’ve been calling “impact coding”) to reduce the bias that can arise from the use of nested models. Information about the talk, and the meetup group, can be found here.

We’re looking forward to these upcoming appearances, and we hope you can make one or both of them.

Our Differential Privacy Mini-series

By: , November 1st, 2015.


We’ve just finished off a series of articles on some recent research results applying differential privacy to improve machine learning. Some of these results are pretty technical, so we thought it was worth working through concrete examples. And some of the original results are locked behind academic journal paywalls, so we’ve tried to touch on the highlights of the papers, and to play around with variations of our own.

Blurry snowflakes stock by cosmicgallifrey d3inho1

  • A Simpler Explanation of Differential Privacy: Quick explanation of epsilon-differential privacy, and an introduction to an algorithm for safely reusing holdout data, recently published in Science (Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth, “The reusable holdout: Preserving validity in adaptive data analysis”, Science, vol 349, no. 6248, pp. 636-638, August 2015).

    Note that Cynthia Dwork is one of the inventors of differential privacy, originally used in the analysis of sensitive information.

  • Using differential privacy to reuse training data: Specifically, how differential privacy helps you build efficient encodings of categorical variables with many levels from your training data without introducing undue bias into downstream modeling.
  • A simple differentially private-ish procedure: The bootstrap as an alternative to Laplace noise to introduce privacy.

Our R code and experiments are available on Github here, so you can try some experiments and variations yourself.

Image Credit

Don’t use stats::aggregate()

By: , October 31st, 2015.


When working with an analysis system (such as R) there are usually good reasons to prefer using functions from the “base” system over using functions from extension packages. However, base functions are sometimes locked into unfortunate design compromises that can now be avoided. In R’s case I would say: do not use stats::aggregate().

Read on for our example. Continue reading Don’t use stats::aggregate()

Baking priors

By: , October 13th, 2015.


There remains a bit of a two-way snobbery that Frequentist statistics is what we teach (as so-called objective statistics remain the same no matter who works with them) and Bayesian statistics is what we do (as it tends to directly estimate posterior probabilities we are actually interested in). Nina Zumel hit the nail on the head when she wrote an article explaining the appropriateness of the type of statistical theory depends on the type of question you are trying to answer, not on your personal prejudices.

We will discuss a few more examples that have been in our mind, including one I am calling “baking priors.” This final example will demonstrate some of the advantages of allowing researchers to document their priors.

Thumb IMG 0539 1024
Figure 1: two loaves of bread.
Continue reading Baking priors

A Simpler Explanation of Differential Privacy

By: , October 2nd, 2015.


Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning.

In this article we’ll work through the definition of differential privacy and demonstrate how Dwork’s recent results can be used to improve the model fitting process.

The Voight-Kampff Test: Looking for a difference. Scene from Blade Runner

Continue reading A Simpler Explanation of Differential Privacy

A deeper theory of testing

By: , September 26th, 2015.


In some of my recent public talks (for example: here and here) I have mentioned a desire for “a deeper theory of fitting and testing.” I thought I would expand on what I meant by this.

In this note I am going to cover a lot of different topics to try and suggest some perspective. I won’t have my usual luxury of fully defining my terms or working concrete examples. Hopefully a number of these ideas (which are related, but don’t seem to easily synthesize together) will be subjects of their own later articles.


The focus of this article is: the true goal of predictive analytics is always: to build a model that works well in production. Training and testing procedures are designed to simulate this unknown future model performance, but can be expensive and can also fail.

What we want is a good measure of future model performance, and to apply that measure in picking a model without running deep into Goodhart’s law (“When a measure becomes a target, it ceases to be a good measure.”).

Most common training and testing procedures are destructive in the sense they use up data (data used for one step may not be safely used for another step in an unbiased fashion, example: excess generalization error). In this note I thought I would expand on the ideas for extending statistical efficiency or getting more out of your training while avoiding overfitting.

6816780226 4ff0d8324a o
Destructive testing.

I will outline a few variations of model construction and testing techniques that one should keep in mind.

Continue reading A deeper theory of testing

How do you know if your model is going to work? Part 4: Cross-validation techniques

By: , September 21st, 2015.


Authors: John Mount (more articles) and Nina Zumel (more articles).

In this article we conclude our four part series on basic model testing.

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this concluding Part 4 of our four part mini-series “How do you know if your model is going to work?” we demonstrate cross-validation techniques.

Previously we worked on:

Cross-validation techniques

Cross validation techniques attempt to improve statistical efficiency by repeatedly splitting data into train and test and re-performing model fit and model evaluation.

For example: the variation called k-fold cross-validation splits the original data into k roughly equal sized sets. To score each set we build a model on all data not in the set and then apply the model to our set. This means we build k different models (none which is our final model, which is traditionally trained on all of the data).

Notional 3-fold cross validation (solid arrows are model construction/training, dashed arrows are model evaluation).

This is statistically efficient as each model is trained on a 1-1/k fraction of the data, so for k=20 we are using 95% of the data for training.

Another variation called “leave one out” (which is essentially Jackknife resampling) is very statistically efficient as each datum is scored on a unique model built using all other data. Though this is very computationally inefficient as you construct a very large number of models (except in special cases such as the PRESS statistic for linear regression).

Statisticians tend to prefer cross-validation techniques to test/train split as cross-validation techniques are more statistically efficient and can give sampling distribution style distributional estimates (instead of mere point estimates). However, remember cross validation techniques are measuring facts about the fitting procedure and not about the actual model in hand (so they are answering a different question than test/train split).

Though, there is some attraction to actually scoring the model you are going to turn in (as is done with in-sample methods, and test/train split, but not with cross-validation). The way to remember this is: bosses are essentially frequentist (they want to know their team and procedure tends to produce good models) and employees are essentially Bayesian (they want to know the actual model they are turning in is likely good; see here for how it the nature of the question you are trying to answer controls if you are in a Bayesian or Frequentist situation).

Continue reading How do you know if your model is going to work? Part 4: Cross-validation techniques

Bootstrap Evaluation of Clusters

By: , September 4th, 2015.


Illustration from Project Gutenberg

The goal of cluster analysis is to group the observations in the data into clusters such that every datum in a cluster is more similar to other datums in the same cluster than it is to datums in other clusters. This is an analysis method of choice when annotated training data is not readily available. In this article, based on chapter 8 of Practical Data Science with R, the authors discuss one approach to evaluating the clusters that are discovered by a chosen clustering method. Continue reading Bootstrap Evaluation of Clusters