## Worry about correctness and repeatability, not p-values

In data science work you often run into cryptic sentences like the following:

Age adjusted death rates per 10,000 person years across incremental thirds of muscular strength were 38.9, 25.9, and 26.6 for all causes; 12.1, 7.6, and 6.6 for cardiovascular disease; and 6.1, 4.9, and 4.2 for cancer (all P < 0.01 for linear trend).

(From “Association between muscular strength and mortality in men: prospective cohort study,” Ruiz et. al. BMJ 2008;337:a439.)

The accepted procedure is to recognize “p” or “p-value” as shorthand for “significance,” keep your mouth shut and hope the paper explains what is actually claimed somewhere later on. We know the writer is claiming significance, but despite the technical terminology they have not actually said which test they actually ran (lm(), glm(), contingency table, normal test, t-test, f-test, g-test, chi-sq, permutation test, exact test and so on). I am going to go out on a limb here and say these type of sentences are gibberish and nobody actually understands them. From experience we know generally what to expect, but it isn’t until we read further we can precisely pin down what is actually being claimed. This isn’t the authors’ fault, they are likely good scientists, good statisticians, and good writers; but this incantation is required by publishing tradition and reviewers.

We argue you should worry about the correctness of your results (how likely a bad result could look like yours, the subject of frequentist significance) and repeatability (how much variance is in your estimation procedure, as measured by procedures like the bootstrap). p-values and significance are important in how they help structure the above questions.

The legitimate purpose of technical jargon is to make conversations quicker and more precise. However, saying “p” is not much shorter than saying “significance” and there are many different procedures that return p-values (so saying “p” does not limit you down to exactly one procedure like a good acronym might). At best the savings in time would be from having to spend 10 minutes thinking which interpretation of significance is most approbate to the actual problem at hand versus needing a mere 30 seconds to read about the “p.” However, if you don’t have 10 minutes to consider if the entire result a paper is likely an observation artifact due to chance or noise (the subject of significance) then you really don’t care much about the paper.

In our opinion “p-values” have degenerated from a useful jargon into a secretive argot. We are going to discuss thinking about significance as “worrying about correctness” (a fundamental concern) instead of as a cut and dried statistical procedure you should automate out of view (uncritically copying reported p’s from fitters). Yes “p”s are significances, but there is no reason to not just say what sort of error you are claiming is unlikely.

We started with bad writing on significance, let’s share an example of good writing:

Suppose that in an agricultural experiment four different chemical treatments of soil produced mean wheat yields of 28, 22, 18, and 24 bushels per acre, respectively. Is there a significant difference in these means, or is the observed spread due simply to chance?

(From Schaum’s Outlines, “Statistics” Fourth Edition, Murray R. Spiegel and Larry J. Stephens, McGraw-Hill, 2008.)

From the above paragraph you have some idea of what is going on and why you should care. Imagine you were asked to choose one of these soil treatments for your farm. You would want the one that actually is the best, not one that managed to fool you once. You care about actual correctness. The mathematician Gian-Carlo Rota called out an earlier version of this text as being the one that finally explained to him what the purpose of analysis of variance was. Things like this are why this book has been in continuous print since 1961.

From this same book:

When the first edition was introduced in 1961, the p-value was not as widely used as it is today, because it is often difficult to determine without the aid of computer software. Today p-values are routinely provided by statistical software packages since the computer software computation of p-values is often a trivial matter.

People knew how to do statistics properly before 1961, so they probably had interesting methods that work around the need for explicit p-values. In this article we will demonstrate R code to take all of the small steps to organize our data and produce summaries demonstrating the nature, correctness and repeatability of experimental results. There are things you should look for (small counts, large error bars and overlapping distributions) that give diagnostic clues long before you make it to the p-values.

There is nothing intrinsically wrong with p-values, but I hold that the slavish copying of them from computer results into reports has distracted us away from thinking about important issues of correctness, repeatability and significance.

What is significance? Significance is usually an estimate of the probability of some event you don’t want to happen looking like a favorable event you saw in your experimental data. It is a frequentist idea and you introduce a straw-man explanation of the data (that you hope to falsify or knock down) called the “null hypothesis.” For the soil treatment example it could be the probability that what we are identifying as the best soil treatment is actually an inferior soil treatment that “got lucky” during the measurements. We wish to show that the odds of this kind of error are low, and such low odds of error are called “high significance.” High significance does not guarantee you are not making a mistake (for one your modeling assumptions could be wrong). However, low significance is usually very bad. It says even assuming everything is the way you hoped, there is still a significant chance you are wrong. That should matter to you.

Even if you think significance doesn’t matter to you, it will matter to your clients, managers and peers. If you promote work that you has low significance (or worse yet, you haven’t checked some form of significance) you are promoting work that not only may fail, you are promoting work that may have an obvious large chance of failure. That can go over poorly in a project post-mortem. You should always work with the feeling that someday “the truth will out.” Not only will more data be collected in the future, but it will be obvious if you had enough evidence to justify the decisions you made earlier in a project. For example in Bob Shaw’s short story “Burden of Proof” a crime is committed in front of a device that will play the crime back years in the future. In the story there is no way to get the playback sooner, but knowing that someday the truth will out intensifies the detective dilemma: they can’t just make a convincing case they must make the right case. In the real world you can’t always be right, so you should measure and share your level of uncertainty. This why you calculate significance and why you want to effectively communicate what significance means to your possibly non-technical partners.

Let’s work through the significance claim from the paper relating muscular strength to mortality rates that we started with. The claim of the paper is that in a study of 10,000 people over an average follow-up period of 19 years that a statistically significant predictor of mortality rate was weakness in certain muscular strength tests. The paper goes on to claim that this relation is significant even when accounting for age and disease conditions. We can check what the paper claims on the relation between strength and mortality (as we can see the raw numbers in their reported tables), but we can’t check if the effect remains when controlling for other conditions as we don’t have enough of their data to reproduce that second analysis. So let’s end this article on a concrete note by exploring the relation between strength and mortality using the statistical package R.

From the paper’s tables 1 and 2 we can find the number of people in each of the muscular strength groups (called “lower”, “middle”, and “upper) and the number of deaths in each of these groups. The raw numbers are (typed by hand into R):

> # from tables 1 and 2 of http://www.bmj.com/content/337/bmj.a439 > d = data.frame(MuscularStrength=c('lower','middle','upper'), count=c(2920,2919,2923),deaths=c(214,143,146)) > # make upper strength the reference level > d$MuscularStrength = relevel(as.factor(d$MuscularStrength), ref='upper') > print(d) MuscularStrength count deaths 1 lower 2920 214 2 middle 2919 143 3 upper 2923 146

The obvious thing to look at is the death rates:

> # quickly look at rates and typical deviations (sqrt of variance) > # http://en.wikipedia.org/wiki/Binomial_distribution > d$MortalityRate = d$deaths/d$count > d$MortalityRateDevEst = sqrt(d$MortalityRate*(1-d$MortalityRate)/d$count) > d$MortalityRateDevBound = sqrt(0.25/d$count) > print(d) MuscularStrength count deaths MortalityRate MortalityRateDevEst MortalityRateDevBound 1 lower 2920 214 0.07328767 0.004822769 0.009252915 2 middle 2919 143 0.04898938 0.003995090 0.009254500 3 upper 2923 146 0.04994868 0.004029222 0.009248166

It looks like the lower strength group has a mortality rate of about 7% over the 19 year interval which is above the 5% rate we see for the other two groups. This result is likely significant because we see each of these estimates has a standard error of around 0.5%, much lower than the 2% difference between groups we are seeing. In relative terms it looks like you can cut your mortality rate by 40% by not being in the lower muscle performance group. Notice also that these rates are radically different from the 3.89%, 2.59% and 2.66% reported in the summary. This is because we are using different units (our case deaths per study individual and theirs deaths per year) and the actual study is breaking deaths up into different causes.

From a business perspective at this point, using the rule of thumb that at least 3 standard errors is significant, we are done. The table above is exactly the right thing to show the client. They can see the important things: the size of the study, the general mortality rates, the size of the effect, and the likely error in measurement. The likely errors in measurement we have added to the table are the likely errors we would see in re-running a study such as this one. That is: it describes the distribution of results a second researcher trying to reproduce our result might see even when our estimates were exactly right. If there estimated deviations are large then we know even if we were right our work is unlikely to be reproduced with sample sizes similar to what we used (which is bad). If the estimated deviations are small then *assuming we are right* others should be able to reproduce our work (which is good).

We can produce a graphical version of this table as follows.

# plot likely observed distributions of deaths, assuming # rates are exactly as measured > library('ggplot2') # plotting > plotFrame = c() > deaths=0:sum(d$deaths) > for(i in 1:(dim(d)[[1]])) { row = as.list(d[i,]) plotData = data.frame(deaths=deaths, DeathRate=deaths/row$count, MuscularStrength=row$MuscularStrength, density=dbinom(deaths,size=row$count,prob=row$deaths/row$count)) plotFrame = rbind(plotFrame,plotData) } > ggplot() + geom_line(data=plotFrame, aes(x=DeathRate,y=density,color=MuscularStrength, linetype=MuscularStrength)) + geom_vline(data=d,aes(xintercept=deaths/count,color=MuscularStrength, linetype=MuscularStrength))

This produces the following plot which shows, assuming we have measure the mortality rates for the three groups correctly, how follow-up studies (using the same data-set size we used) would likely look. The important thing to notice is how little the 3 distribution groups overlap with each other (and how little of each of them crosses the center line of the others).

At this point we have addressed the variance of our estimation procedure (an issue the bootstrap method also works on) which speaks to the repeatability of our work. We have not yet touched on the correctness (such as measuring the variance of a null hypothesis would help with) or specific-ness of our result (such as forming Bayesian estimates of the posterior distributions of the three mortality rates would help with).

While we haven’t quite gotten to the traditional frequentist significance interpretation yet, we are very close. In the frequentist notion of statistics instead of assuming our measurements are correct (which is a dangerous habit to get into) we pick something we don’t want to happen and try to show our data is very unlikely under such an assumption. For example we could assume that lower physical strength group has the same mortality rate (around 5%) as the other groups (which is bad as implies our 7% measurement is then wrong). The frequentist significance then calculates how often a 5% mortality rate group would return a sample with a 7% mortality rate. This fact is already represented on our graph has how much of the middle and lower distributions cross the lower-groups center line (in this case almost none of the density does this). So we do in fact have strong evidence of both a reproducible and significant result already in our graph and table.

In this case it is okay to leave significance un-calculated and informal. The important follow up questions are ones of practicality and causation: can we change people’s strength group and would changing their strength group change their mortality rate? These are the important questions, but they would have to be answered by new experiments as they can’t be addressed with just the data at hand. And that is why I don’t suggest spending too long on significance with this data, observational error (the topic of significance) in this particular case it is only one worry among many.

Of course not all studies work out this well. Many experiments generate results that are near the boundary of significance and non-significance. So we very much need to know how to precisely estimate significance. The easiest way to do this is to apply the appropriate model (in this case logistic regression) and copy the significances from the model’s supplied summary report. Before we can fit a model we must re-shape our data into an appropriate format. We could use the melt and cast operators from Hadley Wickham‘s reshape2 package (as illustrated in Your Data is Never the Right Shape), but we will instead use the join operator in his plyr package. We feel in this case the notion of joining more accurately expresses the fluid transformations you need to be able to perform on data. The commands to create the new data format are as follows:

# convert data into a longer format and get at same facts as in a model > library('plyr') # joining data > outcomes = data.frame(outcome=c('survived','died')) > outcomes$dummy = 'a' > d$dummy='a' > joined = join(d,outcomes,by=c('dummy'),type='full',match='all') > joined$count = ifelse(joined$outcome=='survived', joined$count-joined$deaths, joined$deaths) > data = subset(joined,select=c(MuscularStrength,count,outcome)) > print(data) MuscularStrength count outcome 1 lower 2706 survived 2 lower 214 died 3 middle 2776 survived 4 middle 143 died 5 upper 2777 survived 6 upper 146 died

And now that we have prepared, the step of interest is essentially a one-liner:

> model = glm(outcome=='died'~MuscularStrength, weights=data$count,family=binomial(link='logit'),data=data) > summary(model) Call: glm(formula = outcome == "died" ~ MuscularStrength, family = binomial(link = "logit"), data = data, weights = data$count) Deviance Residuals: 1 2 3 4 5 6 -20.30 33.44 -16.70 29.37 -16.87 29.58 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.94552 0.08491 -34.691 < 2e-16 *** MuscularStrengthlower 0.40827 0.11069 3.688 0.000226 *** MuscularStrengthmiddle -0.02040 0.12068 -0.169 0.865746 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 3851.3 on 5 degrees of freedom Residual deviance: 3831.6 on 3 degrees of freedom AIC: 3837.6 Number of Fisher Scoring iterations: 6

What we are looking for is the value in the column named "Pr(>|z|)" for the row MuscularStrengthlower. This is the so-called p-value (notice not even the fitting software is so rude as to use just "p") and in this case it is 0.000226 which says there is about a one in four thousand chance of seeing a relation this strong between muscular strength and mortality if there were in fact no correlation (the so-called null hypothesis). The result is in fact statistically significant.

As a word of warning look at the deviance and null deviance reported in this model (3831.6 and 3851.3) respectively. Treating deviance as an analogy for variance we see our model explains only 0.5% of the variation outcomes (who dies and who lives). This is why to not rely too much on global variance style measures when evaluating models: they behave too much like accuracy and miss a lot of what is going on. Notice instead this model identifies a group of around 1/3rd of the subjects that if muscle weakness was in fact causing mortality (it might not be) perhaps exercises could be used to reduce the mortality rate of this group by 40%. The hope is this would be an opportunity to cut down the mortality rate of the overall population by as much as 13%, which would be huge. The original authors know this, this is why they made it the title of their study. And this is the kind of thing you miss if you just look at modeling summaries instead of getting your hands in the data.

There are a few additional points to share here. In principal a statistician would see the estimates and standard errors as mere details of parameterization to be integrated out along the path to computing significance. To a business the actual values of the estimates, relative rates in different groups and sizes of expected errors are all of vital interest. Significance is a important check if these values are right. But we also want the actual values available for discussion and possible use. We feel working more with the data in small fluid steps is of more benefit to the data scientist and client that submitting data hopefully (in the right format) to a monolithic tool that quickly returns a single answer without exposing the intermediate tables, graphs and calculations. You lose a lot of opportunities to notice an anomaly when you don't look at the intermediate results.

In conclusion: you don't directly care about p-values; you care about correctness, repeatability and significance.

I’m curious about how the graph was generated. I don’t actually understand R, but looking up dbinom, it seems like you might be generating random data from the summary statistics. Perhaps that’s more convenient than downloading the raw data and transforming it, but in real life, wouldn’t it make more sense to plot the raw data whenever possible, so you can see (informally) how well it fits the expected distribution?

Brian, the graph is in some sense “random data,” it is the expected distribution around what we know. The problem is that all we have from the paper are total counts, but it doesn’t affect much as counts are what are called a “sufficient statistic” in that they contain all information we should care about.

But back to your point. Fitting yes/no data is really weird. For numeric data you can see how it fits the distribution or not (as you imply). For binomial yes/no data if 7 out of 100 are yes that is your plot. You could (if you had the raw data) look for serial correlation and streaks, but there isn’t any binomial equivalent of “has the same mean as and standard deviation as the normal, but way too heavy in the tails.” Or to state it better, come up with something like that and you have had a good idea.

From Rubin, D. B. (1981) “The Bayesian Bootstrap” The Annals of Statistics, 9(1), 130–134 we can get a good Bayes posterior distribution is proportional to a Dirichlet distribution with parameters (deaths-1,survivals-1) (corresponds to having a uninformative improper prior proportional to a Dirichlet density with parameters (-1,-1). This is essentialy identical to the binomial distributions we plotted, so you can consider the produced graph as a Bayesian posterior estimate of the distribution of the mortality rates.

Thank you for this article. It made much clearer a topic that has always been confusing to me. I have tried to simplified the source code used in the example here: https://gist.github.com/nachocab/5327857 Feel free to incorporate whatever parts you find useful to the original article.

@Nacho Caballero Thanks, your code is definitely cool and I love that you are sharing it.

@Nacho Caballero Also, I would love a similar example for continuous data, when you don’t know the density function.

Apologies; I’ve been skimming. So if I understand correctly, this is really a graphical representation of three actual numbers (the totals) and three estimates of how they would vary if we re-ran the study, based on certain assumptions. Six scalars total. I might make a rough analogy to a poll where we have the percentages for each choice and the margins of error.

I’m sure this is a statistically naive question, but it’s unclear to me why we are throwing away data so quickly or why the totals are “all we should care about”. I’m a curious person so why shouldn’t I look at all the data? For example, I might be interested in plotting deaths at each age. I would be interested in how to visualize the entire dataset so that it isn’t misleading. For two scalar dimensions, a scatter plot would show all the data and give us an idea of density. The places where the dots are far apart are the places where we have little data (outliers). A few dots seems like a good visual representation of “very little data here; beware”. For places with many dots it just turns black, but perhaps if it were colored by density like a weather map that would be useful?

Plotting the deaths in three buckets against age would result in three lines of dots. If we bucket the data again by age then that gives us a bar chart, or instead it might be colored in a similar way to show density.

However, I forgot that people entered the study at different ages and there’s no representation of the people who were still alive at the end of the study, and that should be represented as well. Perhaps we could plot each person as a horizontal line, beginning at age of entry and ending with age of death or the end of the study.

So this is getting complicated, and there’s still no representation of a margin of error. Is there somewhere I could go with this or is it just statistical nonsense? Am I wrong in thinking that it would be nice to plot all the data in a visually rich way?

@Brian Slesinsky Bunch of issues (all good).

First I would love to have the other data (ages and so on). They would be very important and I would deal with them by setting up a more detailed model. But, I couldn’t find a copy of the data where the researches showed them aligned (so I now the age and muscle category of the same individual). The paper only shared population totals for various features (which I could use to make some adjustments, but would be less convincing than having the actual data).

So when I say “only counts matter” that is for a population we are assuming (right or wrong) that are identical and exchangeable. And that is what is meant by a sufficient statistic, any other fact about the data (assuming identical independently distributed data) can be re-derived from the sufficient statistic.

For binomial outcomes we do get uncertainty bars- they are the widths of the plotted distributions (or more accurately an interval that picks up a lot of mass under the curve of the distribution). Binomial outcomes (yes/no) are always weird. If I were to plot the data we get a funny graph where 95% of the dots are at zero (didn’t die) and 5% of the dots are at 1 (did die) and I then claim the mean value is at 0.05 (moved 5 percent of the way from zero to one). But this 5% isn’t a typical value for the population (nobody was 5% dead). This is different for something like weight where if I say an average weight of a adult males is something like 140 pounds then there are also a lot of adult males near this weight. So for binomial except for finding correlations with more variables (like age) you don’t get the diagnostics you get for numeric statistics (where seeing a skew or bimodality is actually a negative diagnostic). yes/no data caries so little information you pretty much compute the rate and then don’t get any free diagnostics as to if you are right or wrong (without introducing new variables like date and looking for streaks and so on).

So you are right: deaths in three buckets is three piles of dots, the density plots represent (depending on which method I used to produced them) either the expected error in repeating the experiment or the location uncertainty in the best estimate of the error rate. The margin of error is the sqrt(p*(1-p)/n) term (also the visible widths) which is then compared to p (p being the measured mortality rate). You are definitely right to want rich visual presentations. Nina has a few new presentations in our upcoming book.

Just a note: we did not make it all the way to confidence intervals or credible intervals in this article. The issue is confidence intervals are not quite what the non statistician thinks they are and credible intervals require a discussion of priors. Definitely things we will get to in later articles.