Free gradient boosting lecture

By: , November 21st, 2015.


We have always regretted that we didn’t get to cover gradient boosting in Practical Data Science with R (Manning 2014). To try make up for that we are sharing (for free) our GBM lecture from our (paid) video course Introduction to Data Science.

(link, all support material here).

Please help us get the word out by sharing/Tweeting!

Fluid use of data

By: , November 19th, 2015.


Nina Zumel and I recently wrote a few article and series on best practices in testing models and data:

What stands out in these presentations is: the simple practice of a static test/train split is merely a convenience to cut down on operational complexity and difficulty of teaching. It is in no way optimal. That is, using slightly more complicated procedures can build better models on a given set of data.

Suggested static cal/train/test experiment design from vtreat data treatment library.
Continue reading Fluid use of data

Upcoming Win-Vector Appearances

By: , November 9th, 2015.


We have two public appearances coming up in the next few weeks:

Workshop at ODSC, San Francisco – November 14

Both of us will be giving a two-hour workshop called Preparing Data for Analysis using R: Basic through Advanced Techniques. We will cover key issues in this important but often neglected aspect of data science, what can go wrong, and how to fix it. This is part of the Open Data Science Conference (ODSC) at the Marriot Waterfront in Burlingame, California, November 14-15. If you are attending this conference, we look forward to seeing you there!

You can find an abstract for the workshop, along with links to software and code you can download ahead of time, here.

An Introduction to Differential Privacy as Applied to Machine Learning: Women in ML/DS – December 2

I (Nina) will give a talk to the Bay Area Women in Machine Learning & Data Science Meetup group, on applying differential privacy for reusable hold-out sets in machine learning. The talk will also cover the use of differential privacy in effects coding (what we’ve been calling “impact coding”) to reduce the bias that can arise from the use of nested models. Information about the talk, and the meetup group, can be found here.

We’re looking forward to these upcoming appearances, and we hope you can make one or both of them.

Fast food, fast publication

By: , November 8th, 2015.


The following article is getting quite a lot of press right now: David Just and Brian Wansink (2015), “Fast Food, Soft Drink, and Candy Intake is Unrelated to Body Mass Index for 95% of American Adults”, Obesity Science & Practice, forthcoming (upcoming in a new pay for placement journal). Obviously it is a sensational contrary position (some coverage: here, here, and here).

I thought I would take a peek to learn about the statistical methodology (see here for some commentary). I would say the kindest thing you can say about the paper is: its problems are not statistical.

At this time the authors don’t seem to have supplied their data preparation or analysis scripts and the paper “isn’t published yet” (though they have had time for a press release), so we have to rely on their pre-print. Read on for excerpts from the work itself (with commentary). Continue reading Fast food, fast publication

Our Differential Privacy Mini-series

By: , November 1st, 2015.


We’ve just finished off a series of articles on some recent research results applying differential privacy to improve machine learning. Some of these results are pretty technical, so we thought it was worth working through concrete examples. And some of the original results are locked behind academic journal paywalls, so we’ve tried to touch on the highlights of the papers, and to play around with variations of our own.

Blurry snowflakes stock by cosmicgallifrey d3inho1

  • A Simpler Explanation of Differential Privacy: Quick explanation of epsilon-differential privacy, and an introduction to an algorithm for safely reusing holdout data, recently published in Science (Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth, “The reusable holdout: Preserving validity in adaptive data analysis”, Science, vol 349, no. 6248, pp. 636-638, August 2015).

    Note that Cynthia Dwork is one of the inventors of differential privacy, originally used in the analysis of sensitive information.

  • Using differential privacy to reuse training data: Specifically, how differential privacy helps you build efficient encodings of categorical variables with many levels from your training data without introducing undue bias into downstream modeling.
  • A simple differentially private-ish procedure: The bootstrap as an alternative to Laplace noise to introduce privacy.

Our R code and experiments are available on Github here, so you can try some experiments and variations yourself.

Image Credit

Don’t use stats::aggregate()

By: , October 31st, 2015.


When working with an analysis system (such as R) there are usually good reasons to prefer using functions from the “base” system over using functions from extension packages. However, base functions are sometimes locked into unfortunate design compromises that can now be avoided. In R’s case I would say: do not use stats::aggregate().

Read on for our example. Continue reading Don’t use stats::aggregate()

A simple differentially private-ish procedure

By: , October 13th, 2015.


Authors: John Mount and Nina Zumel

Nina and I were noodling with some variations of differentially private machine learning, and think we have found a variation of a standard practice that is actually fairly efficient in establishing differential privacy a privacy condition (but, as commenters pointed out- not differential privacy).


Read on for the idea and a rough analysis. Continue reading A simple differentially private-ish procedure

Baking priors

By: , October 13th, 2015.


There remains a bit of a two-way snobbery that Frequentist statistics is what we teach (as so-called objective statistics remain the same no matter who works with them) and Bayesian statistics is what we do (as it tends to directly estimate posterior probabilities we are actually interested in). Nina Zumel hit the nail on the head when she wrote an article explaining the appropriateness of the type of statistical theory depends on the type of question you are trying to answer, not on your personal prejudices.

We will discuss a few more examples that have been in our mind, including one I am calling “baking priors.” This final example will demonstrate some of the advantages of allowing researchers to document their priors.

Thumb IMG 0539 1024
Figure 1: two loaves of bread.
Continue reading Baking priors

Some key Win-Vector serial data science articles

By: , October 7th, 2015.


As readers have surely noticed the Win-Vector LLC blog isn’t a stream of short notes, but instead a collection of long technical articles. It is the only way we can properly treat topics of consequence.


What not everybody may have noticed is a number of these articles are serialized into series for deeper comprehension. The key series include:

  • Statistics to English translation.

    This series tries to find vibrant applications and explanations of standard good statistical practices, to make them more approachable to the non statistician.

  • Statistics as it should be.

    This series tries to cover cutting edge machine learning techniques, and then adapt and explain them in traditional statistical terms.

  • R as it is.

    This series tries to teach the statistical programming language R “warts and all” so we can see it as the versatile and powerful data science tool that it is.

To get a taste of what we are up to in our writing please checkout our blog highlights and these series. For deeper treatments of more operational topics also check out our book Practical Data Science with R.

Or if you have something particular you need solved consider engaging us at Win-Vector LLC for data science consulting and/or training.

Using differential privacy to reuse training data

By: , October 5th, 2015.


Win-Vector LLC‘s Nina Zumel wrote a great article explaining differential privacy and demonstrating how to use it to enhance forward step-wise logistic regression (essentially reusing test data). This allowed her to reproduce results similar to the recent Science paper “The reusable holdout: Preserving validity in adaptive data analysis”. The technique essentially protects and reuses test data, allowing the series of adaptive decisions driving forward step-wise logistic regression to remain valid with respect to unseen future data. Without the differential privacy precaution these steps are not always sufficiently independent of each other to ensure good model generalization performance. Through differential privacy one gets safe reuse of test data across many adaptive queries, yielding more accurate estimates of out of sample performance, more robust choices, and resulting in a better model.

In this note I will discuss a specific related application: using differential privacy to reuse training data (or equivalently make training procedures more statistically efficient). I will also demonstrate similar effects using more familiar statistical techniques.

Continue reading Using differential privacy to reuse training data