Category Archives: Exciting Techniques

A dynamic programming solution to A/B test design

Our last article on A/B testing described the scope of the realistic circumstances of A/B testing in practice and gave links to different standard solutions. In this article we will be take an idealized specific situation allowing us to show a particularly beautiful solution to one very special type of A/B test.

For this article we are assigning two different advertising message to our potential customers. The first message, called “A”, we have been using a long time, and we have a very good estimate at what rate it generates sales (we are going to assume all sales are for exactly $1, so all we are trying to estimate rates or probabilities). We have a new proposed advertising message, called “B”, and we wish to know does B convert traffic to sales at a higher rate than A?

We are assuming:

  • We know exact rate of A events.
  • We know exactly how long we are going to be in this business (how many potential customers we will ever attempt to message, or the total number of events we will ever process).
  • The goal is to maximize expected revenue over the lifetime of the project.

As we wrote in our previous article: in practice you usually do not know the answers to the above questions. There is always uncertainty in the value of the A-group, you never know how long you are going to run the business (in terms of events or in terms of time, and you would also want to time-discount any far future revenue), and often you value things other than revenue (valuing knowing if B is greater than A, or even maximizing risk adjusted returns instead of gross returns). This represents severe idealization of the A/B testing problem, one that will let us solve the problem exactly using fairly simple R code. The solution comes from the theory of binomial option pricing (which is in turn related to Pascal’s triangle).


NewImage
Yang Hui (ca. 1238–1298) (Pascal’s) triangle, as depicted by the Chinese using rod numerals.

For this “statistics as it should be” (in partnership with Revolution Analytics) article let us work the problem (using R) pretending things are this simple. Continue reading A dynamic programming solution to A/B test design

Wanted: A Perfect Scatterplot (with Marginals)

We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki:

NewImage

The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear relationship between the x and y variables. Nice.

I like this plot a lot, but we’re mostly an R shop here at Win-Vector. So we asked: can we make this plot in ggplot2? Natively, ggplot2 can add rugs to a scatterplot, but doesn’t immediately offer marginals, as above.

However, you can use Dean Attali’s ggExtra package. Here’s an example using the same data as the seaborn jointplot above; you can download the dataset here.

library(ggplot2)
library(ggExtra)
frm = read.csv("tips.csv")

plot_center = ggplot(frm, aes(x=total_bill,y=tip)) + 
  geom_point() +
  geom_smooth(method="lm")

# default: type="density"
ggMarginal(plot_center, type="histogram")

I didn’t bother to add the internal annotation for the goodness of the linear fit, though I could.

PltggMarginal

The ggMarginal() function goes to heroic effort to line up the coordinate axes of all the graphs, and is probably the best way to do a scatterplot-plus-marginals in ggplot (you can also do it in base graphics, of course). Still, we were curious how close we could get to the seaborn version: marginal density and histograms together, along with annotations. Below is our version of the graph; we report the linear fit’s R-squared, rather than the Pearson correlation.

# our own (very beta) plot package: details later
library(WVPlots)
frm = read.csv("tips.csv")

ScatterHist(frm, "total_bill", "tip",
            smoothmethod="lm",
            annot_size=3,
            title="Tips vs. Total Bill")

PlotPkg

You can see that (at the moment) we’ve resorted to padding the axis labels with underbars to force the x-coordinates of the top marginal plot and the scatterplot to align; white space gets trimmed. This is profoundly unsatisfying, and less robust than the ggMarginal version. If you’re curious, the code is here. It relies on some functions in the file sharedFunctions.R in the same repository. Our more general version will do either a linear or lowess/spline smooth, and you can also adjust the histogram and density plot parameters.

Thanks to Slawa Rokicki’s excellent ggplot2: Cheatsheet for Visualizing Distributions for our basic approach. Check out the graph at the bottom of her post — and while you’re at it, check out the rest of her blog too.

Big News! “Practical Data Science with R” MEAP launched!

Nina Zumel and I ( John Mount ) have been working very hard on producing an exciting new book called “Practical Data Science with R.” The book has now entered Manning Early Access Program (MEAP) which allows you to subscribe to chapters as they become available and give us feedback before the book goes into print.


Zumel PDSwithR 3

Please subscribe to our book, your support now will help us improve it. Please also forward this offer to your friends and colleagues (and please ask them to also subscribe and forward). Continue reading Big News! “Practical Data Science with R” MEAP launched!

Minimal Version Control Lesson: Use It

There is no excuse for a digital creative person to not use some sort of version control or source control. In the past disk space was too dear, version control systems were too expensive and software was not powerful enough; this is no longer the case. Unless your work is worthless both back it up and version control it. We will demonstrate a minimal set of version control commands that will one day save your bacon. Continue reading Minimal Version Control Lesson: Use It

Importance Sampling

We describe briefly the powerful simulation technique known as “importance sampling.” Importance sampling is a technique that allows you to use numerical simulation to explore events that, at first look, appear too rare to be reliably approximated numerically. The correctness of importance sampling follows almost immediately from the definition of a change of density. Like most mathematical techniques, importance sampling brings in its own concerns and controls that were not obvious in the original problem. To deal with these concerns (like picking the re-weighting to use) we will largely appeal to the ideas from “A Tutorial on the Cross-Entropy Method” Pieter-Tjerk de Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein, Annals of Operations Research, 2005 vol. 134 (1) pp. 19-67. Continue reading Importance Sampling

Gradients via Reverse Accumulation

We extend the ideas of from Automatic Differentiation with Scala to include the reverse accumulation. Reverse accumulation is a non-obvious improvement to automatic differentiation that can in many cases vastly speed up calculations of gradients. Continue reading Gradients via Reverse Accumulation

Automatic Differentiation with Scala

This article is a worked-out exercise in applying the Scala type system to solve a small scale optimization problem. For this article we supply complete Scala source code (under a GPLv3 license) and some design discussion. Continue reading Automatic Differentiation with Scala

The Local to Global Principle

We describe the “the local to global principle.” It is a principle used to break algorithmic problem solving into two distinct phases (local criticism followed by global solution) and is an aid both in the design and in the application of algorithms. Instead of giving a formal definition of the principle we quickly define it and discuss a few examples and methods. We have produced both a stand-alone PDF (more legible) and a HTML/blog form (more skimable).
Continue reading The Local to Global Principle

Good Graphs: Graphical Perception and Data Visualization

What makes a good graph? When faced with a slew of numeric data, graphical visualization can be a more efficient way of getting a feel for the data than going through the rows of a spreadsheet. But do we know if we are getting an accurate or useful picture? How do we pick an effective visualization that neither obscures important details, or drowns us in confusing clutter? In 1968, William Cleveland published a text called The Elements of Graphing Data, inspired by Strunk and White’s classic writing handbook The Elements of Style . The Elements of Graphing Data puts forward Cleveland’s philosophy about how to produce good, clear graphs — not only for presenting one’s experimental results to peers, but also for the purposes of data analysis and exploration. Cleveland’s approach is based on a theory of graphical perception: how well the human perceptual system accomplishes certain tasks involved in reading a graph. For a given data analysis task, the goal is to align the information being presented with the perceptual tasks the viewer accomplishes the best. Continue reading Good Graphs: Graphical Perception and Data Visualization