Posted on Categories Finance, Statistics, Tutorials1 Comment on One place not to use the Sharpe ratio

## One place not to use the Sharpe ratio

Having worked in finance I am a public fan of the Sharpe ratio. I have written about this here and here.

One thing I have often forgotten (driving some bad analyses) is: the Sharpe ratio isn’t appropriate for models of repeated events that already have linked mean and variance (such as Poisson or Binomial models) or situations where the variance is very small (with respect to the mean or expectation). These are common situations in a number of large scale online advertising problems (such as modeling the response rate to online advertisements or email campaigns).

In this note we will quickly explain the problem. Continue reading One place not to use the Sharpe ratio

Posted on Categories data science, Statistics, Tutorials2 Comments on A clear picture of power and significance in A/B tests

## A clear picture of power and significance in A/B tests

A/B tests are one of the simplest reliable experimental designs.

Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior.

“Practical guide to controlled experiments on the web: listen to your customers not to the HIPPO” Ron Kohavi, Randal M Henne, and Dan Sommerfield, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007 pp. 959-967.

The ideas is to test a variation (called “treatment” or “B”) in parallel with continuing to test a baseline (called “control” or “A”) to see if the variation drives a desired effect (increase in revenue, cure of disease, and so on). By running both tests at the same time it is hoped that any confounding or omitted factors are nearly evenly distributed between the two groups and therefore not spoiling results. This is a much safer system of testing than retrospective studies (where we look for features from data already collected).

Interestingly enough the multi-armed bandit alternative to A/B testing (a procedure that introduces online control) is one of the simplest non-trivial Markov decision processes. However, we will limit ourselves to traditional A/B testing for the remainder of this note. Continue reading A clear picture of power and significance in A/B tests

Posted on 2 Comments on Bandit Formulations for A/B Tests: Some Intuition

## Bandit Formulations for A/B Tests: Some Intuition

Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior.

— Kohavi, Henne, Sommerfeld, “Practical Guide to Controlled Experiments on the Web” (2007)

A/B tests are one of the simplest ways of running controlled experiments to evaluate the efficacy of a proposed improvement (a new medicine, compared to an old one; a promotional campaign; a change to a website). To run an A/B test, you split your population into a control group (let’s call them “A”) and a treatment group (“B”). The A group gets the “old” protocol, the B group gets the proposed improvement, and you collect data on the outcome that you are trying to achieve: the rate that patients are cured; the amount of money customers spend; the rate at which people who come to your website actually complete a transaction. In the traditional formulation of A/B tests, you measure the outcomes for the A and B groups, determine which is better (if either), and whether or not the difference observed is statistically significant. This leads to questions of test size: how big a population do you need to get reliably detect a difference to the desired statistical significance? And to answer that question, you need to know how big a difference (effect size) matters to you.

The irony is that to detect small differences accurately you need a larger population size, even though in many cases, if the difference is small, picking the wrong answer matters less. It can be easy to lose sight of that observation in the struggle to determine correct experiment sizes.

There is an alternative formulation for A/B tests that is especially suitable for online situations, and that explicitly takes the above observation into account: the so-called multi-armed bandit problem. Imagine that you are in a casino, faced with K slot machines (which used to be called “one-armed bandits” because they had a lever that you pulled to play (the “arm”) — and they pretty much rob you of all your money). Each of the slot machines pays off at a different (unknown) rate. You want to figure out which of the machines pays off at the highest rate, then switch to that one — but you don’t want to lose too much money to the suboptimal slot machines while doing so. What’s the best strategy?

The “pulling one lever at a time” formulation isn’t a bad way of thinking about online transactions (as opposed to drug trials); you can imagine all your customers arriving at your site sequentially, and being sent to bandit A or bandit B according to some strategy. Note also, that if the best bandit and the second-best bandit have very similar payoff rates, then settling on the second best bandit, while not optimal, isn’t necessarily that bad a strategy. You lose winnings — but not much.

Traditionally, bandit games are infinitely long, so analysis of bandit strategies is asymptotic. The idea is that you test less as the game continues — but the testing stage can go on for a very long time (often interleaved with periods of pure exploitation, or playing the best bandit). This infinite-game assumption isn’t always tenable for A/B tests — for one thing, the world changes; for another, testing is not necessarily without cost. We’ll look at finite games below.

Posted on Categories data science, Opinion, Rants, Statistics5 Comments on Drowning in insignificance

## Drowning in insignificance

Some researchers (in both science and marketing) abuse a slavish view of p-values to try and falsely claim credibility. The incantation is: “we achieved p = x (with x ≤ 0.05) so you should trust our work.” This might be true if the published result had been performed as a single project (and not as the sole shared result in longer series of private experiments) and really points to the fact that even frequentist significance is a subjective and intensional quantity (an accusation usually reserved for Bayesian inference). In this article we will comment briefly on the negative effect of un-reported repeated experiments and what should be done to compensate. Continue reading Drowning in insignificance