Posted on Categories data science, Statistics, Tutorials

# A clear picture of power and significance in A/B tests

A/B tests are one of the simplest reliable experimental designs.

Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior.

“Practical guide to controlled experiments on the web: listen to your customers not to the HIPPO” Ron Kohavi, Randal M Henne, and Dan Sommerfield, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007 pp. 959-967.

The ideas is to test a variation (called “treatment” or “B”) in parallel with continuing to test a baseline (called “control” or “A”) to see if the variation drives a desired effect (increase in revenue, cure of disease, and so on). By running both tests at the same time it is hoped that any confounding or omitted factors are nearly evenly distributed between the two groups and therefore not spoiling results. This is a much safer system of testing than retrospective studies (where we look for features from data already collected).

Interestingly enough the multi-armed bandit alternative to A/B testing (a procedure that introduces online control) is one of the simplest non-trivial Markov decision processes. However, we will limit ourselves to traditional A/B testing for the remainder of this note.

The traditional frequentist batch-oriented A/B test is organized around two concepts called power and significance. Once you have your experimental goal (what you are testing then the expected target difference) issues of power and significance determine determine the final parameters of your experiment: the necessary sample sizes. Power and significance are defined as follows:

Power: The probability of rejecting the null hypothesis when it is false. You want to design your experiment to have a power near 1.

Significance: The probability of failing to reject the null hypothesis when it is true. You want to design your experiment so you have significance or p-values near zero.

Adapted from “The Cambridge Dictionary of Statistics”, B.S. Everitt, 2nd Edition, Cambridge, 2005 printing.

Each of these represents an error rate based on a different (unknown) state of the world (the null hypothesis being true or being false). If we design an experiment to have low error rates in both unknown states of the world then it should have low error rate not matter what state the world is in. Often power and significance are confused (or only one is calculated), but this sloppiness rarely causes large problems as the sample sizes suggested be each are usually in the rough ballpark of each other.

We have in the past fretted over informative formulas for power and significance, but in this note we will delegate sample size calculations to R.

Suppose we have an advertising campaign that historically has a conversion or success rate of 0.5%. Further suppose we have a product manager who has suggested a change in the wording that the product manager claims will represent a 20% relative increase in conversion (or has a targeted success rate of 0.6%). The question is: how large an A sample and B sample must be run to have good chance of both detecting a change of this magnitude (if it is present) and not hallucinating a change of this magnitude (if it is in fact no difference). Once the two assumed rates (0.5% and 0.6%) and confidence goals (say 1% chance of error in either alternative) are set test design is just a matter of calculation.

The standard definitions of power and significance have a hidden implicit assumption that we are only going to run one of the A or B experiments. Solving for the sample size needed to not confuse A with B (given only A is run and B is assumed to be at its ideal rate of 0.6%) or not to confuse B with A (given only B is run and A is assumed to be at its ideal rate of 0.5%) gives us we need a sample size of at least 28001 for a good A significance experiment and a sample size of at least 30342 for a good B power experiment. The result is summarized in the following graph.

The top portion of the graph illustrates the probability density of seeing different counts (here encoded as rates) when running the A-test (the null hypothesis is true case). The red area are results that look as good as the theoretical B-treatment. The claim is a sample size of at least 28001 A-trials is enough to ensure the probability of the null-hypothesis looking as good as the theoretical B is no more that 1%. The bottom portion of the graph illustrates the probability density of different counts (again encoded as rates) when running the B-test (the presumed treatment). The red area are results that look no better than the theoretical A-treatment. The claim is a sample size of at least 30342 B-trials is enough to ensure the probability of such a good B looking no better that the baseline A is no more that 1%.

Of course in practice we want to run both tests simultaneously. In this case we suggest the simple expedient of picking a mid-point between the A-rate and the B-rate (say 0.55%) and saying a test design is good if the probability of the observed A-rate crossing the separating line plus the probability of the observed B-rate crossing the separating line is small. If neither observed rate crosses the separating line, then they do not cross each other. With such a design we have a good chance of empirically observing the separation we hypothesized (if it is there) and a low probability of observing such if there is no difference (B=A=0.5%). The modified test design (with shared error boundary) is illustrated in the graph below.

Notice the required sample sizes have gone up a lot (as we have asked for more accuracy) to nA ≥ 109455 and nB ≥ 125050. Also notice the summed odds of making a mistake are 2% assuming the A and B rates are as advertised (as we have a 1% chance of an A-driven mistake and a 1% chance of B driven mistake). The odds of a mistake are also around 2% in the null-hypothesis situation (B=A=0.5%) as we then have two chances of one of the tests scoring at lest 0.55% (A could do it or B could do it, a conservative so-called two sided test situation). We can get the overall error chance down to 1% by designing each test to have an error probability of no more than 0.5% which requires the moderately larger samples sizes of nA ≥ 135091 and nB ≥ 154805. Notice: changes in accuracy are expensive and changes in confidence are relatively cheap.

It is, as always, important to use the correct standard definitions (significance and power, yielding our first experimental plan). It is simultaneously important to adapt the standard calculations to what you are actually trying to measure (simultaneous A and B estimates, yielding our second more conservative experimental plan).

The complete R-code for these examples and plots is given below.

``` library(gtools) library(ggplot2) # q>p, compute the probability of a # p-rate process measuring as q-rate # or better in n steps pSignificanceError <- function(p,q,n) { pbinom(ceiling(q*n)-1,prob=p,size=n,lower.tail=FALSE) } # q>p, compute the proability of a # q-rate process measuring as p-rate # or lower in n steps pPowerError <- function(p,q,n) { pbinom(floor(p*n),prob=q,size=n,lower.tail=TRUE) } designExperiment <- function(pA,pB,pError,pAUpper=pB,pBLower=pA) { aSoln <- binsearch( function(k) { pSignificanceError(pA,pAUpper,k) - pError}, range=c(100,1000000)) nA <- max(aSoln\$where) print(paste('nA',nA)) bSoln <- binsearch( function(k) { pPowerError(pBLower,pB,k) - pError}, range=c(100,1000000)) nB <- max(bSoln\$where) print(paste('nB',nB)) low = floor(min(pA*nA,pB*nB)) high = ceiling(max(pA*nA,pB*nB)) width = high-low countRange <- (low-width):(high+width) dA <- data.frame(count=countRange) dA\$group <- paste('A: sample size=',nA,sep='') dA\$density <- dbinom(dA\$count,prob=pA,size=nA) dA\$rate <- dA\$count/nA dA\$error <- dA\$rate>=pAUpper dB <- data.frame(count=countRange) dB\$group <- paste('B: sample size=',nB,sep='') dB\$density <- dbinom(dB\$count,prob=pB,size=nB) dB\$rate <- dB\$count/nB dB\$error <- dB\$rate<=pBLower d <- rbind(dA,dB) plot = ggplot(data=d,aes(x=rate,y=density)) + geom_line() + geom_ribbon(data=subset(d,error), aes(ymin=0,ymax=density),fill='red') + facet_wrap(~group,ncol=1,scales='free_y') + geom_vline(xintercept=pAUpper,linetype=2) + geom_vline(xintercept=pBLower,linetype=2) list(nA=nA,nB=nB,plot=plot) } r1 <- designExperiment(pA=0.005,pB=0.006,pError=0.01) print(r1\$plot) r2 <- designExperiment(pA=0.005,pB=0.006,pError=0.01, pAUpper=0.0055,pBLower=0.0055) print(r2\$plot) r3 <- designExperiment(pA=0.005,pB=0.006,pError=0.005, pAUpper=0.0055,pBLower=0.0055) ```

## 2 thoughts on “A clear picture of power and significance in A/B tests”

1. Mo says:

I am not sure why I get different results when I use the bsamsize function in Hmisc library.

Using your example assuming type 1 error of 1% and a 50/50 split:

bsamsize(0.005, 0.006, fraction=.50,alpha=.1)
n1 n2
67633.04 67633.04

designExperiment(pA=0.005,pB=0.006,pError=01)
[1] “nA 8334”
[1] “nB 9870”

1. The likely explanation is each experiment size calculator is answering a slightly different question (different designs and different null hypothesis). The calculation on our article that is closets to the experiment design you seem to be asking about is:

``` designExperiment(pA=0.005,pB=0.006,pError=0.1, pAUpper=0.0055,pBLower=0.0055) [1] "nA 33091" [1] "nB 39685" ```

Notice this sums up to 72776 but is not a 50/50 split (it suggest running more experiments on B). From the Hmisc manual: bsamsize “Uses method of Fleiss, Tytun, and Ury (but without the continuity correction) to estimate the power (or the sample size to achieve a given power) of a two-sided test for the difference in two proportion.” So possibly they are estimating only power- where we are worrying a bit about both power and significance (and using direct binomial tail estimates, which can be justified but may or may not be the same method as “Fleiss, Tytun, and Ury”). To keep things simple we are assuming very strong competing hypothesis (exact values of the unknown rates are assumed either identical or to be the two values we want to test- obviously other priors are possible; and we are assuming the confusion point as a mid-point, not optimizing that choice or trying to integrate it out).