A Simpler Explanation of Differential Privacy

By: , October 2nd, 2015.


Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning.

In this article we’ll work through the definition of differential privacy and demonstrate how Dwork et.al.’s recent results can be used to improve the model fitting process.

The Voight-Kampff Test: Looking for a difference. Scene from Blade Runner

Continue reading A Simpler Explanation of Differential Privacy

A dynamic programming solution to A/B test design

By: , July 6th, 2015.


Our last article on A/B testing described the scope of the realistic circumstances of A/B testing in practice and gave links to different standard solutions. In this article we will be take an idealized specific situation allowing us to show a particularly beautiful solution to one very special type of A/B test.

For this article we are assigning two different advertising message to our potential customers. The first message, called “A”, we have been using a long time, and we have a very good estimate at what rate it generates sales (we are going to assume all sales are for exactly $1, so all we are trying to estimate rates or probabilities). We have a new proposed advertising message, called “B”, and we wish to know does B convert traffic to sales at a higher rate than A?

We are assuming:

  • We know exact rate of A events.
  • We know exactly how long we are going to be in this business (how many potential customers we will ever attempt to message, or the total number of events we will ever process).
  • The goal is to maximize expected revenue over the lifetime of the project.

As we wrote in our previous article: in practice you usually do not know the answers to the above questions. There is always uncertainty in the value of the A-group, you never know how long you are going to run the business (in terms of events or in terms of time, and you would also want to time-discount any far future revenue), and often you value things other than revenue (valuing knowing if B is greater than A, or even maximizing risk adjusted returns instead of gross returns). This represents severe idealization of the A/B testing problem, one that will let us solve the problem exactly using fairly simple R code. The solution comes from the theory of binomial option pricing (which is in turn related to Pascal’s triangle).

Yang Hui (ca. 1238–1298) (Pascal’s) triangle, as depicted by the Chinese using rod numerals.

For this “statistics as it should be” (in partnership with Revolution Analytics) article let us work the problem (using R) pretending things are this simple. Continue reading A dynamic programming solution to A/B test design

Wanted: A Perfect Scatterplot (with Marginals)

By: , June 11th, 2015.


We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki:


The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear relationship between the x and y variables. Nice.

I like this plot a lot, but we’re mostly an R shop here at Win-Vector. So we asked: can we make this plot in ggplot2? Natively, ggplot2 can add rugs to a scatterplot, but doesn’t immediately offer marginals, as above.

However, you can use Dean Attali’s ggExtra package. Here’s an example using the same data as the seaborn jointplot above; you can download the dataset here.

frm = read.csv("tips.csv")

plot_center = ggplot(frm, aes(x=total_bill,y=tip)) + 
  geom_point() +

# default: type="density"
ggMarginal(plot_center, type="histogram")

I didn’t bother to add the internal annotation for the goodness of the linear fit, though I could.


The ggMarginal() function goes to heroic effort to line up the coordinate axes of all the graphs, and is probably the best way to do a scatterplot-plus-marginals in ggplot (you can also do it in base graphics, of course). Still, we were curious how close we could get to the seaborn version: marginal density and histograms together, along with annotations. Below is our version of the graph; we report the linear fit’s R-squared, rather than the Pearson correlation.

# our own (very beta) plot package: details later
frm = read.csv("tips.csv")

ScatterHist(frm, "total_bill", "tip",
            title="Tips vs. Total Bill")


You can see that (at the moment) we’ve resorted to padding the axis labels with underbars to force the x-coordinates of the top marginal plot and the scatterplot to align; white space gets trimmed. This is profoundly unsatisfying, and less robust than the ggMarginal version. If you’re curious, the code is here. It relies on some functions in the file sharedFunctions.R in the same repository. Our more general version will do either a linear or lowess/spline smooth, and you can also adjust the histogram and density plot parameters.

Thanks to Slawa Rokicki’s excellent ggplot2: Cheatsheet for Visualizing Distributions for our basic approach. Check out the graph at the bottom of her post — and while you’re at it, check out the rest of her blog too.

Big News! “Practical Data Science with R” MEAP launched!

By: , May 15th, 2013.


Nina Zumel and I ( John Mount ) have been working very hard on producing an exciting new book called “Practical Data Science with R.” The book has now entered Manning Early Access Program (MEAP) which allows you to subscribe to chapters as they become available and give us feedback before the book goes into print.

Zumel PDSwithR 3

Please subscribe to our book, your support now will help us improve it. Please also forward this offer to your friends and colleagues (and please ask them to also subscribe and forward). Continue reading Big News! “Practical Data Science with R” MEAP launched!

Minimal Version Control Lesson: Use It

By: , July 30th, 2012.


There is no excuse for a digital creative person to not use some sort of version control or source control. In the past disk space was too dear, version control systems were too expensive and software was not powerful enough; this is no longer the case. Unless your work is worthless both back it up and version control it. We will demonstrate a minimal set of version control commands that will one day save your bacon. Continue reading Minimal Version Control Lesson: Use It

Importance Sampling

By: , January 1st, 2012.


We describe briefly the powerful simulation technique known as “importance sampling.” Importance sampling is a technique that allows you to use numerical simulation to explore events that, at first look, appear too rare to be reliably approximated numerically. The correctness of importance sampling follows almost immediately from the definition of a change of density. Like most mathematical techniques, importance sampling brings in its own concerns and controls that were not obvious in the original problem. To deal with these concerns (like picking the re-weighting to use) we will largely appeal to the ideas from “A Tutorial on the Cross-Entropy Method” Pieter-Tjerk de Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein, Annals of Operations Research, 2005 vol. 134 (1) pp. 19-67. Continue reading Importance Sampling

An Appreciation of Locality Sensitive Hashing

By: , November 21st, 2011.


We share our admiration for a set of results called “locality sensitive hashing” by demonstrating a greatly simplified example that exhibits the spirit of the techniques. Continue reading An Appreciation of Locality Sensitive Hashing

Gradients via Reverse Accumulation

By: , July 14th, 2010.


We extend the ideas of from Automatic Differentiation with Scala to include the reverse accumulation. Reverse accumulation is a non-obvious improvement to automatic differentiation that can in many cases vastly speed up calculations of gradients. Continue reading Gradients via Reverse Accumulation

Automatic Differentiation with Scala

By: , June 14th, 2010.


This article is a worked-out exercise in applying the Scala type system to solve a small scale optimization problem. For this article we supply complete Scala source code (under a GPLv3 license) and some design discussion. Continue reading Automatic Differentiation with Scala