Our last article on A/B testing described the scope of the realistic circumstances of A/B testing in practice and gave links to different standard solutions. In this article we will be take an idealized specific situation allowing us to show a particularly beautiful solution to one very special type of A/B test.
For this article we are assigning two different advertising message to our potential customers. The first message, called “A”, we have been using a long time, and we have a very good estimate at what rate it generates sales (we are going to assume all sales are for exactly $1, so all we are trying to estimate rates or probabilities). We have a new proposed advertising message, called “B”, and we wish to know does B convert traffic to sales at a higher rate than A?
We are assuming:
- We know exact rate of A events.
- We know exactly how long we are going to be in this business (how many potential customers we will ever attempt to message, or the total number of events we will ever process).
- The goal is to maximize expected revenue over the lifetime of the project.
As we wrote in our previous article: in practice you usually do not know the answers to the above questions. There is always uncertainty in the value of the A-group, you never know how long you are going to run the business (in terms of events or in terms of time, and you would also want to time-discount any far future revenue), and often you value things other than revenue (valuing knowing if B is greater than A, or even maximizing risk adjusted returns instead of gross returns). This represents severe idealization of the A/B testing problem, one that will let us solve the problem exactly using fairly simple R code. The solution comes from the theory of binomial option pricing (which is in turn related to Pascal’s triangle).
Yang Hui (ca. 1238–1298) (Pascal’s) triangle, as depicted by the Chinese using rod numerals.
For this “statistics as it should be” (in partnership with Revolution Analytics) article let us work the problem (using R) pretending things are this simple. Continue reading A dynamic programming solution to A/B test design
We have previously written that we like the investment performance summary called the Sharpe ratio (though it does have some limits).
What the Sharpe ratio does is: give you a dimensionless score to compare similar investments that may vary both in riskiness and returns without needing to know the investor’s risk tolerance. It does this by separating the task of valuing an investment (which can be made independent of the investor’s risk tolerance) from the task of allocating/valuing a portfolio (which must depend on the investor’s preferences).
But what we have noticed is nobody is willing to honestly say what a good value for this number is. We will use the R analysis suite and Yahoo finance data to produce some example real Sharpe ratios here so you can get a qualitative sense of the metric. Continue reading What is a good Sharpe ratio?
I recently wrote a tiny bit about the style of the original published proof of the Erdős-Ko-Rado theorem. In this note I’ll write a bit about the theorem and a bit more about the style of some later proofs. In particular I want to write about two different readings of Katona’s proof. Continue reading Proof style in the Erdős-Ko-Rado theorem
Here is a neat example of famous mathematician Pál Erdős (often rendered in English as Paul Erdős) writing like a programmer in 1961. He goes to some trouble to introduce notation that allows him to index everything from zero. Continue reading Erdős writing like a programmer
How sure are you that large margin implies low VC dimension (and good generalization error)? It is true. But even if you have taken a good course on machine learning you many have seen the actual proof (with all of the caveats and conditions). I worked through the literature proofs over the holiday and it took a lot of notes to track what is really going on in the derivation of the support vector machine.
Figure: the standard SVM margin diagram, this time with some un-marked data added.
Continue reading How sure are you that large margin implies low VC dimension?
This note is a link to an excerpt from my upcoming monster support vector machine article where I work through a number of sections of [Vapnik, 1998] Vapnik, V. N. (1998), Statistical Learning Theory, Wiley. I try to run down how the original theoretical claims are precisely linked to what is said about the common implementations. The write-up is fairly technical and very large (26 pages).
Here we are extracting an appendix: “Soft margin is not as good as hard margin.” In it we build a toy problem that is not large-margin separated and note that if the dimension of the concept space you were working in was not obvious (i.e. you were forced to rely on the margin derived portion of generalization bounds) then generalization improvement for a soft margin SVM is much slower than you would expect given experience from the hard margin theorems. The punch-line is: every time you get eight times as much training data you only halve your expected excess generalization error bound (whereas once you get below a data-set’s hard-margin bound you expect one to one reduction of the bound with respect to training data set size). What this points out is: the soft margin idea can simulate margin, but it comes at a cost. Continue reading Soft margin is not as good as hard margin
Recently there has been some controversy over David Mumford’s Nature magazine invited obituary of Alexander Grothendieck being initially rejected on submission (see here and here). At issue was the attempt to explain the mathematical idea of schemes (one of Alexander Grothendieck’s most important contributions) to a non-mathematician audience. Professor Mumford is a mathematician of great stature and his explanation is better than anything I could even attempt. However, in addition to the issues he raises I don’t think he was sensitive enough to what a non-mathematician considers motivation.
I’ll take a quick stab at explaining a very tiny bit of the motivation of schemes. I not sure the kind of chain of analogies argument I am attempting would work in an obituary (or in a short length), so I certainly don’t presume to advise professor Mumford on his obituary of a great mathematician (and person). Continue reading Let’s try to motivate schemes
In most of our data science teaching (including our book Practical Data Science with R) we emphasize the deliberately easy problem of “exchangeable prediction.” We define exchangeable prediction as: given a series of observations with two distinguished classes of variables/observations denoted “x”s (denoting control variables, independent variables, experimental variables, or predictor variables) and “y” (denoting an outcome variable, or dependent variable) then:
- Estimate an approximate functional relation
y ~ f(x).
- Apply that relation to new instances where
x is known and
y is not yet known.
An example of this would be to use measured characteristics of online shoppers to predict if they will purchase in the next month. Data more than a month old gives us a training set where both
y are known. Newer shoppers give us examples where only
x is currently known and it would presumably be of some value to estimate
y or estimate the probability of different
y values. The problem is philosophically “easy” in the sense we are not attempting inference (estimating unknown parameters that are not later exposed to us) and we are not extrapolating (making predictions about situations that are out of the range of our training data). All we are doing is essentially generalizing memorization: if somebody who shares characteristics of recent buyers shows up, predict they are likely to buy. We repeat: we are not forecasting or “predicting the future” as we are not modeling how many high-value prospects will show up, just assigning scores to the prospects that do show up.
The reliability of such a scheme rests on the concept of exchangeability. If the future individuals we are asked to score are exchangeable with those we had access to during model construction then we expect to be able to make useful predictions. How we construct the model (and how to ensure we indeed find a good one) is the core of machine learning. We can bring in any big name machine learning method (deep learning, support vector machines, random forests, decision trees, regression, nearest neighbors, conditional random fields, and so-on) but the legitimacy of the technique pretty much stands on some variation of the idea of exchangeability.
One effect antithetical to exchangeability is “concept drift.” Concept drift is when the meanings and distributions of variables or relations between variables changes over time. Concept drift is a killer: if the relations available to you during training are thought not to hold during later application then you should not expect to build a useful model. This one of the hard lessons that statistics tries so hard to quantify and teach.
We know that you should always prefer fixing your experimental design over trying a mechanical correction (which can go wrong). And there are no doubt “name brand” procedures for dealing with concept drift. However, data science and machine learning practitioners are at heart tinkerers. We ask: can we (to a limited extent) attempt to directly correct for concept drift? This article demonstrates a simple correction applied to a deliberately simple artificial example.
Image: Wikipedia: Elgin watchmaker
Continue reading Can we try to make an adjustment?
From time to time we are asked “what is the company name Win-Vector LLC referring to?” It is a cryptic pun trying to be an encoding of “we deliver victory.”
The story is an inside joke referring to something really only funny to one of the founders. But a joke that amuses the teller is always enjoyed by at least one person. Win-Vector LLC’s John Mount had the honor of co-authoring a 1997 paper titled “The Polytope of Win Vectors.” The paper title is obviously mathematical terms in an odd combination. However the telegraphic grammar is coincidentally similar to deliberately ungrammatical gamer slang such as “full of win” and “so much win.”
If we treat “win” as a concrete noun (say something you can put in a sack) and “vector” in its non-mathematical sense (as an entity of infectious transmission) we have “Win-Vector LLC is an infectious delivery of victory.” I.e.: we deliver success to our clients. Of course, we have now attempt to explain a weak joke. It is not as grand as “winged victory,” but it does encode a positive company value: Win-Vector LLC delivers successful data science projects and training to clients.
Winged Victory: from Wikipedia
Let’s take this as an opportunity to describe what a win vector is. Continue reading What is a win vector?
Page 94 of Gelman, Carlin, Stern, Dunson, Vehtari, Rubin “Bayesian Data Analysis” 3rd Edition (which we will call BDA3) provides a great example of what happens when common broad frequentist bias criticisms are over-applied to predictions from ordinary linear regression: the predictions appear to fall apart. BDA3 goes on to exhibit what might be considered the kind of automatic/mechanical fix responding to such criticisms would entail (producing a bias corrected predictor), and rightly shows these adjusted predictions are far worse than the original ordinary linear regression predictions. BDA3 makes a number of interesting points and is worth studying closely. We work their example in a bit more detail for emphasis. Continue reading Automatic bias correction doesn’t fix omitted variable bias