We have two public appearances coming up in the next few weeks:
Workshop at ODSC, San Francisco – November 14
Both of us will be giving a two-hour workshop called Preparing Data for Analysis using R: Basic through Advanced Techniques. We will cover key issues in this important but often neglected aspect of data science, what can go wrong, and how to fix it. This is part of the Open Data Science Conference (ODSC) at the Marriot Waterfront in Burlingame, California, November 14-15. If you are attending this conference, we look forward to seeing you there!
You can find an abstract for the workshop, along with links to software and code you can download ahead of time, here.
An Introduction to Differential Privacy as Applied to Machine Learning: Women in ML/DS – December 2
I (Nina) will give a talk to the Bay Area Women in Machine Learning & Data Science Meetup group, on applying differential privacy for reusable hold-out sets in machine learning. The talk will also cover the use of differential privacy in effects coding (what we’ve been calling “impact coding”) to reduce the bias that can arise from the use of nested models. Information about the talk, and the meetup group, can be found here.
We’re looking forward to these upcoming appearances, and we hope you can make one or both of them.
We’ve just finished off a series of articles on some recent research results applying differential privacy to improve machine learning. Some of these results are pretty technical, so we thought it was worth working through concrete examples. And some of the original results are locked behind academic journal paywalls, so we’ve tried to touch on the highlights of the papers, and to play around with variations of our own.
A Simpler Explanation of Differential Privacy: Quick explanation of epsilon-differential privacy, and an introduction to an algorithm for safely reusing holdout data, recently published in Science (Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth, “The reusable holdout: Preserving validity in adaptive data analysis”, Science, vol 349, no. 6248, pp. 636-638, August 2015).
Note that Cynthia Dwork is one of the inventors of differential privacy, originally used in the analysis of sensitive information.
Using differential privacy to reuse training data: Specifically, how differential privacy helps you build efficient encodings of categorical variables with many levels from your training data without introducing undue bias into downstream modeling.
When working with an analysis system (such as R) there are usually good reasons to prefer using functions from the “base” system over using functions from extension packages. However, base functions are sometimes locked into unfortunate design compromises that can now be avoided. In R’s case I would say: do not use stats::aggregate().
As readers have surely noticed the Win-Vector LLCblog isn’t a stream of short notes, but instead a collection of long technical articles. It is the only way we can properly treat topics of consequence.
What not everybody may have noticed is a number of these articles are serialized into series for deeper comprehension. The key series include:
Statistics to English translation.
This series tries to find vibrant applications and explanations of standard good statistical practices, to make them more approachable to the non statistician.
Statistics as it should be.
This series tries to cover cutting edge machine learning techniques, and then adapt and explain them in traditional statistical terms.
R as it is.
This series tries to teach the statistical programming language R “warts and all” so we can see it as the versatile and powerful data science tool that it is.
Win-Vector LLC‘s Nina Zumel wrote a great article explaining differential privacy and demonstrating how to use it to enhance forward step-wise logistic regression (essentially reusing test data). This allowed her to reproduce results similar to the recent Science paper “The reusable holdout: Preserving validity in adaptive data analysis”. The technique essentially protects and reuses test data, allowing the series of adaptive decisions driving forward step-wise logistic regression to remain valid with respect to unseen future data. Without the differential privacy precaution these steps are not always sufficiently independent of each other to ensure good model generalization performance. Through differential privacy one gets safe reuse of test data across many adaptive queries, yielding more accurate estimates of out of sample performance, more robust choices, and resulting in a better model.
In this note I will discuss a specific related application: using differential privacy to reuse training data (or equivalently make training procedures more statistically efficient). I will also demonstrate similar effects using more familiar statistical techniques.
Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning.
In this article we’ll work through the definition of differential privacy and demonstrate how Dwork et.al.’s recent results can be used to improve the model fitting process.
The Voight-Kampff Test: Looking for a difference. Scene from Blade Runner
Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn’t crash and burn in the real world.
Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.
In this latest “Statistics as it should be” article, we will systematically look at what to worry about and what to check. This is standard material, but presented in a “data science” oriented manner. Meaning we are going to consider scoring system utility in terms of service to a negotiable business goal (one of the many ways data science differs from pure machine learning).
In this article we conclude our four part series on basic model testing.
When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this concluding Part 4 of our four part mini-series “How do you know if your model is going to work?” we demonstrate cross-validation techniques.
Cross validation techniques attempt to improve statistical efficiency by repeatedly splitting data into train and test and re-performing model fit and model evaluation.
For example: the variation called k-fold cross-validation splits the original data into k roughly equal sized sets. To score each set we build a model on all data not in the set and then apply the model to our set. This means we build k different models (none which is our final model, which is traditionally trained on all of the data).
Notional 3-fold cross validation (solid arrows are model construction/training, dashed arrows are model evaluation).
This is statistically efficient as each model is trained on a 1-1/k fraction of the data, so for k=20 we are using 95% of the data for training.
Statisticians tend to prefer cross-validation techniques to test/train split as cross-validation techniques are more statistically efficient and can give sampling distribution style distributional estimates (instead of mere point estimates). However, remember cross validation techniques are measuring facts about the fitting procedure and not about the actual model in hand (so they are answering a different question than test/train split).
Though, there is some attraction to actually scoring the model you are going to turn in (as is done with in-sample methods, and test/train split, but not with cross-validation). The way to remember this is: bosses are essentially frequentist (they want to know their team and procedure tends to produce good models) and employees are essentially Bayesian (they want to know the actual model they are turning in is likely good; see here for how it the nature of the question you are trying to answer controls if you are in a Bayesian or Frequentist situation).