It occurred to us recently that we don’t have any articles about Bayesian approaches to statistics here. I’m not going to get into the “Bayesian versus Frequentist” war; in my opinion, which style of approach to use is less about philosophy, and more about figuring out the best way to answer a question. Once you have the right question, then the right approach will naturally suggest itself to you. It could be a frequentist approach, it could be a bayesian one, it could be both — even while solving the same problem.

Let’s take the example that Bayesians love to hate: significance testing, especially in clinical trial style experiments. Clinical trial experiments are designed to answer questions of the form “Does treatment X have a discernible effect on condition Y, on average?” To be specific, let’s use the question “Does drugX reduce hypertension, on average?” Assuming that your experiment does show a positive effect, the statistical significance tests that you run should check for the sorts of problems that John discussed in our previous article, Worry about correctness and repeatability, not p-values: What are the chances that an ineffective drug could produce the results that I saw? How likely is it that another researcher could replicate my results with the same size trial?

We can argue about whether or not the question we are answering is the *correct* question — but given that it *is* the question, the procedure to answer it and to verify the statistical validity of the results is perfectly appropriate.

So what is the correct question? From your family doctor’s viewpoint, a clinical trial answers the question “If I prescribe drugX to all my hypertensive patients, will their blood pressure improve, on average?” That isn’t the question (hopefully) that your doctor actually asks, though possibly your insurance company does. Your doctor should be asking “If I prescribe drugX to *this patient*, the one sitting in my examination room, will the patient’s blood pressure improve?” There is only one patient, so there is no such thing as “on average.”

If your doctor has a masters degree in statistics, the question might be phrased as “If I prescribe drugX to this patient, what is the posterior probability that the patient’s blood pressure will improve?” And that’s a bayesian question.Let’s run through a small toy example. We will run a 500 patient clinical trial on drugX. All the patients have “moderately high” blood pressure, and are of similar age, health and family history, and so on. We will measure whether or not drugX reduces their blood pressure to “normal” — somewhere in the region of 120/80. The control group will be on the sort of diet recommended for hypertensive patients(say a low-sodium, low-cholesterol, high-fiber diet) and will take a placebo. The treatment group will be on the same diet, plus drugX.

Now suppose that the diet alone will normalize blood pressure in about 10% of the population. And also suppose that (unknown to the researchers) there is a hidden factor HF (a genetic factor, perhaps) that moderates whether or not drugX actually works. There are two types of people. 90% of the population are HFA, and drugX has no effect on them. 10% of the population are HFB, and drugX completely normalizes blood pressure for 95% of the HFB population.

So you, the omnipotent readers of this article, now know that drugX is only effective on about 9.5% of the general population, although an overlapping 10% of the general population will show improvement from diet alone. This gives you the luxury of comparing the “right answer” with what could happen in an actual experiment. Now let’s see what might be observed.

Here’s some R code to simulate the trial.

# # 2 populations: HFA, HFB. A not affected by the drug # n = 500; spontaneous = 0.1 # effectiveness of diet alone effectiveness = c(0, 0.95) names(effectiveness) = c("HFA", "HFB") # set the HF for the population hfcoin = runif(n) hf = ifelse(hfcoin < 0.9, "HFA", "HFB") # assign control and treatment groups group = runif(n) group = ifelse(group<0.5, "control", "drug") # assign outcomes spontcoin = runif(n) drugcoin = runif(n) outcome = ( (spontcoin < spontaneous) | (drugcoin < effectiveness[hf]*(group=="drug")) ) expframe=data.frame(group=group,hf=hf, improved=outcome)

Here are the summaries I got when I ran the code:

> summary(expframe) group hf improved control:255 HFA:449 Mode :logical drug :245 HFB: 51 FALSE:437 TRUE :63 NA's :0 > with(expframe[group=="control",], table(hf, improved)) improved hf FALSE TRUE HFA 209 16 HFB 26 4 > with(expframe[group=="drug",], table(hf, improved)) improved hf FALSE TRUE HFA 201 23 HFB 1 20 > tab = with(expframe, table(group, improved)) > tab improved group FALSE TRUE control 235 20 drug 202 43

The last contingency table, `tab`

, is the only of the above summaries known to the researchers. From it, you can see that the drug group had a 100*43/(202+43) = 17.5% improvement rate, and the control group had a 7.8% improvement rate. So, empirically, drugX more than doubled the probability of improvement (17.5/7.8 = 2.25 — this is called the *risk ratio*). If you think in odds like a gambler does (odds of improvement are 20 to 235 for the control group), then we have also more than doubled the odds of improvement ( (43/202)/(20/235) = 2.5 — this is called the *odds ratio*). Now we want to test if these results are real (and not a fluke).

**Frequentist Approach**

One way to check the significance of the results (from a frequentist viewpoint) is check whether the contingency table `tab`

is independent. Under the null hypothesis that improvement is independent of whether or not the patient took the drug, the odds of improvement should be the same for both the control and the drug groups. We can test this using Fisher’s Exact Test for Count Data (or we can use the chi-squared test, which is an approximation of Fisher’s exact test). In Fisher’s test, the null hypothesis is that the odds ratio is 1.

> fisher.test(tab) Fisher's Exact Test for Count Data data: tab p-value = 0.001158 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 1.384802 4.635614 sample estimates: odds ratio 2.496745

So now we know that our results are significant to the 0.05 level (in fact, to the 0.01 level: if the drug had no effect, we would see a result this good or better no more than 1% of the time). We also know that if our estimate of the odds ratio is correct, then when other researchers repeat our experiment, 95% of the time they will capture the true value in a confidence interval similar to ours– which is bounded away from one. So we can reject the null hypothesis and assume that drugX will increase the improvement rate in the population, relative to diet alone.

**Bayesian Approach**

But what about the poor patient sitting in the doctor’s examination room? What are the chances that *his* blood pressure will improve if he takes drugX? Roughly 17%, which is better than the 10% chance from dietary changes alone, but still isn’t very high. Let’s verify this statement using the bayesian approach.

The bayesian approach assumes that the quantity that you are interested in, in this case the rate of improvement *p*, is distributed according to some distribution Prior(*p*). Once you have a set of observations, *x*, you update your estimate of the distribution to

Posterior(*p* | *x*) = C * Prior(*p*) * f(*x* | *p*),

where f is the probability of the data conditioned on the parameter, and C is the total probability of the data over all possible settings of the parameter. Usually, calculating C is hard. Fortunately, for some common scenarios, like coin-flipping, calculating the posterior is quite easy.

Estimating the improvement rate *p* of drugX is a coin-flipping problem, where *p* is the (unknown) probability of the coin coming up heads. If you model the coin as a binomial distribution, and the distribution of *p* as a Beta distribution:

then the posterior is also a Beta distribution, with α’ = α + nheads and β’ = β + ntails.

The mode of the distribution (which is what is usually used as a point estimate for *p*) is

The mean of the distribution (which is close to the mode when α and β are large) is

Now back to our problem. Suppose that we already knew (never mind how) that the hypertension improvement rate from diet alone was about 10%. We can set the prior to have a mean value of 0.1 by setting α = 0.1 and β = 0.9. That looks like this:

It’s an nasty prior — notice it goes to infinity at both 0 and 1 — but it spreads the probability mass all along the unit interval, which is what we want, since we don’t want to start with a very strong bias about the improvement rate. Another common prior is the Jeffrey’s prior: α = β = 0.5. The Jeffrey’s prior is maximally uninformative (or minimally biased) and has a mean of 0.5.

Now let’s calculate the posterior, its mean and its mode, in R:

# The mean of the Beta distribution beta_mean = function(alpha, beta) alpha/(alpha+beta) } # The mode of the Beta Distribution beta_mode = function(alpha, beta) (alpha+1)/(alpha+beta-2) } # prior, mean 0.1, mode not defined alpha = 0.1 beta = 0.9 # The values from the contingency table for the experiment improved.control = tab[1,2] # 20 notimproved.control = tab[1,1] # 235 improved.drug = tab[2,2] # 43 notimproved.drug = tab[2,1] # 202 # update the distribution for the treatment group alpha.drug = alpha + improved.drug beta.drug = beta + notimproved.drug # calculate the mean and the mode for the treatment group beta_mean(alpha.drug, beta.drug) # 0.1752033 beta_mode(alpha.drug, beta.drug) # 0.1807377 # update the distribution for the control group alpha.control = alpha + improved.control beta.control = beta + notimproved.control # calculate the mean and the mode for the control group beta_mean(alpha.control, beta.control) # 0.07851563 beta_mode(alpha.control, beta.control) # 0.08307087 # plot both distributions to compare # the function dbeta() returns the value of the distribution # at point x, for a given alpha and beta x=seq(from=0.0, to=0.3,by=0.005) frame=melt(data.frame(x=x, control=dbeta(x,alpha.control,beta.control), drug=dbeta(x,alpha.drug,beta.drug)), measure.vars=c("control", "drug"), variable.name="treatment", value.name="y") ggplot(frame, aes(x=x,y=y,color=treatment)) + geom_line()

The means and modes of both distributions are about where we estimated them from the naive calculations directly on the contingency table; if we use the mode as our point estimate of the improvement rates for both groups, then the spreads of the distributions give us the uncertainty around that estimate, based on the size of our data sample. The distributions don’t overlap much (the result we expected, based on our frequentist analysis); the two populations do in fact have different improvement rates. The difference (mostly a philosophical one, perhaps) is that this analysis gives us *directly* what our family doctor wants: an estimate of the posterior probability of a patient’s blood pressure improving when prescribed drugX. We can calculate what is called the *95% credible interval* for each distribution: the interval that with 95% probability contains the true improvement rate:

credible_interval= function(conf, alpha, beta){ p = (1-conf) lower = p/2 upper = 1-lower c(qbeta(lower, alpha, beta), qbeta(upper, alpha, beta)) } credible_interval(0.95, alpha.drug, beta.drug) # 0.1303853 0.2250201 credible_interval(0.95, alpha.control, beta.control) # 0.04887709 0.11437549

Based on this data, if you take drugX for your hypertension, the probability of normalizing your blood pressure is likely somewhere in the range of 13 to 22 percent, compared to 4.8 to 11 percent from diet alone. So you will improve your chance of normalizing your blood pressure — but it’s more likely that your blood pressure will remain high.

The credible interval, by the way, is what most people *think* the confidence interval is. With 95% probability (based on the available evidence), the true improvement rate is in the 95% credible interval. The *95% confidence interval* is the interval that is produced by a construction procedure such that, if you repeated the experiment again and again, the constructed confidence interval contains the true improvement rate 95% of the time. This still makes it likely that you’ve bracketed the true improvement rate, and in practice, the confidence interval is probably a good stand-in for the credible interval. It’s just not really answering the question you actually asked, philosophically speaking.

**The Hidden Factor**

Suppose the researchers had suspected that the hidden factor HF might be implicated in the drug’s performance, and had been able to measure it in the experiment.

tabfull = aggregate(numeric(dim(expframe)[1])+1, by=list(expframe$group, expframe$hf, expframe$improved), FUN=sum) > tabfull Group.1 Group.2 Group.3 x 1 control HFA FALSE 209 2 drug HFA FALSE 201 3 control HFB FALSE 26 4 drug HFB FALSE 1 5 control HFA TRUE 16 6 drug HFA TRUE 23 7 control HFB TRUE 4 8 drug HFB TRUE 20

In this case, we can also estimate the posterior probabilities of improvement for each group, using the bayesian approach. I’ll just give you the graph.

From this evidence, HFB people taking drugX have better than 75% probability of improving their blood pressure; everyone else has probability less than 25%. Just looking at the modes of the distributions, you might naively think that HFA people also have a higher improvement rate when they are taking drugX, or that HFB people have a higher improvement rate than HFA people even in the control group. But the distributions overlap substantially; there is no real evidence that the three groups on the left of the graph have different improvement rates. In other words, if your family doctor knows that you are type HFB, it would make sense to prescribe drugX for your high blood pressure; if you are type HFA, then it doesn’t.

This is the kind of reasoning promoted by the personalized medicine movement. In fact it is what your family doctor already tries to do, by taking into account your family and previous health history, and so on. So far, your doctor can only do this in a negative way — if you have a family history of colon cancer, then start your annual colonoscopies sooner, otherwise, don’t bother — and as far as I know (though I’m not a doctor or a medical researcher) most published medical research isn’t designed to help doctors make “bayesian type” assessments in a more positive way.

**But Don’t Throw Out Frequentism**

So we’ve established that determining individual patient outcomes is a bayesian question. You might then wonder why anyone would use the frequentist approach at all. But some problems really are frequentist. A medical practitioner who is in public health rather than in a direct patient care practice is interested in the effects of treatments over entire populations, rather than on individuals. Similarly, an insurance company that is deciding whether or not to approve coverage for drugX is interested in whether the drug helps anyone, at all, or if the drug is no better than diet alone. In those situations, a frequentist analysis of drugX does in fact answer the question that is being asked.

An improper prior is one which doesn’t integrate (i.e. can’t be normalized), not one that goes off to infinity. A Beta(0.1, 0.9) prior is perfectly proper! And you really ought to replace the factorials with Gamma functions in the pdf; it was invented to extend the factorial to non-integer arguments, after all…

Nice work! Here is a thought on the Null Hypothesis. http://www.statisticsblog.com/2013/04/sudden-clarity-about-the-null-hypothesis/

@StatsLover

Edit: the more I think about it, the more I like your point!

I think frequentist inability to reject a null hypothesis is meant to try and represent even more uncertainty than an uninformative prior (though at the expense of being able to represent much less structure!). For example the standard Jeffrey’s prior on the binomial has mean 1/2, so if somebody told you they would give you two dollars every time a coin came up heads and you pay a dollar each time it came up tails your “uninformative prior” would still want you to take the bet. The frequentist would possibly not play.

@StatsLover Thanks, you are correct on both counts. Definitely needs to be Gammas (I’ll ask the author to fix that ASAP). As far as improper, some Jeffreys priors are improper in the missing moments sense (like the uniform prior for a location of a Gaussian mean). I think the author had assumed (without checking) that Beta(1/2,1/2) was missing moments- but it seems to have a complete moment generating function (so isn’t really improper).

@Jared You are right. Thanks for catching that. Fixed.

I get errors running this part:

> frame=melt(data.frame(x=x,

control=dbeta(x,alpha.control,beta.control),

drug=dbeta(x,alpha.drug,beta.drug)),

measure.vars=c(“control”, “drug”),

variable.name=”treatment”,

value.name=”y”)

Error: object ‘y’ not found

> ggplot(frame, aes(x=x,y=y,color=treatment)) + geom_line()

Error in eval(expr, envir, enclos) : object ‘y’ not found

I’m a noob, so it’s probably something that I am doing wrong.

@Dammit

I tried the commands and it seems to work. I am guessing you need to make sure you have current R (3.0.0), ggplot2 and reshape2 installed; plus start a few code-blocks back (to make sure tab is there). Each command requires effects from previous commands to be around and the error message might be misleading. You will see slightly different results as the random number generator was not set at a fixed seed.

Very nice summary of this article from @statisticsblog on Twitter: “Bayesian vs Frequentist: Ask the right question & the approach is obvious.”