Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior.
A/B tests are one of the simplest ways of running controlled experiments to evaluate the efficacy of a proposed improvement (a new medicine, compared to an old one; a promotional campaign; a change to a website). To run an A/B test, you split your population into a control group (let’s call them “A”) and a treatment group (“B”). The A group gets the “old” protocol, the B group gets the proposed improvement, and you collect data on the outcome that you are trying to achieve: the rate that patients are cured; the amount of money customers spend; the rate at which people who come to your website actually complete a transaction. In the traditional formulation of A/B tests, you measure the outcomes for the A and B groups, determine which is better (if either), and whether or not the difference observed is statistically significant. This leads to questions of test size: how big a population do you need to get reliably detect a difference to the desired statistical significance? And to answer that question, you need to know how big a difference (effect size) matters to you.
The irony is that to detect small differences accurately you need a larger population size, even though in many cases, if the difference is small, picking the wrong answer matters less. It can be easy to lose sight of that observation in the struggle to determine correct experiment sizes.
There is an alternative formulation for A/B tests that is especially suitable for online situations, and that explicitly takes the above observation into account: the so-called multi-armed bandit problem. Imagine that you are in a casino, faced with K slot machines (which used to be called “one-armed bandits” because they had a lever that you pulled to play (the “arm”) — and they pretty much rob you of all your money). Each of the slot machines pays off at a different (unknown) rate. You want to figure out which of the machines pays off at the highest rate, then switch to that one — but you don’t want to lose too much money to the suboptimal slot machines while doing so. What’s the best strategy?
The “pulling one lever at a time” formulation isn’t a bad way of thinking about online transactions (as opposed to drug trials); you can imagine all your customers arriving at your site sequentially, and being sent to bandit A or bandit B according to some strategy. Note also, that if the best bandit and the second-best bandit have very similar payoff rates, then settling on the second best bandit, while not optimal, isn’t necessarily that bad a strategy. You lose winnings — but not much.
Traditionally, bandit games are infinitely long, so analysis of bandit strategies is asymptotic. The idea is that you test less as the game continues — but the testing stage can go on for a very long time (often interleaved with periods of pure exploitation, or playing the best bandit). This infinite-game assumption isn’t always tenable for A/B tests — for one thing, the world changes; for another, testing is not necessarily without cost. We’ll look at finite games below.
Continue reading Bandit Formulations for A/B Tests: Some Intuition