Some researchers (in both science and marketing) abuse a slavish view of p-values to try and falsely claim credibility. The incantation is: “we achieved p = x (with x ≤ 0.05) so you should trust our work.” This might be true if the published result had been performed as a single project (and not as the sole shared result in longer series of private experiments) and really points to the fact that even frequentist significance is a subjective and intensional quantity (an accusation usually reserved for Bayesian inference). In this article we will comment briefly on the negative effect of un-reported repeated experiments and what should be done to compensate.
The most common reported significance is the frequentist p-value. Formally the p-value is the probability a repeat of the current experiment would show an effect as large as the current one assuming the null-hypothesis that there is in fact no effect present. This is frequentist because we are assuming an unknown fixed state of the world and variation in the possibility of alternative or repeated experiments. The issue is: significance tests are neither as simple as one would like nor as powerful as one would hope. Usually significance is misstated (either through sloppiness, ignorance, or malice) as being the chance the given result is false. Failure to reject the null hypothesis is only one possible source of error, so a low p-value is necessary but in no way sufficient condition to having a good result. False positives of this sort are not reproducible and show what is called reversion to mediocrity.
The Bayesian version of such a test would assume a prior distribution of the unknown quantity and hope to infer a low posterior probability on the “no effect” alternative. This leads to a similar calculation as the frequentist, but with the the ability to interpret a low probability of mistake as a high probability of success. An issue with the Bayesian analysis is you must supply priors, so your conclusion is dependent on and sensitive to your choice of priors (another possible avenue of abuse).
At best what a p-value represents is the degree of filtering the experiment has (under ideal conditions) against non-results. Run 100 experiments at p=0.05 and you expect to see at least 5 results that appear to be good; even if there was in fact no improvement to be measured. This is unfortunately standard practice for many. It is not enough to work hard on many projects and report your good results, see: “Why Most Published Research Findings Are False” John P A Ioannidis. Plos Med, 2005 vol. 2 (8) p. e124; and “Does your model weigh the same as a Duck?” Ajay N Jain and Ann E Cleves, J Comput Aided Mol Des, 2011 vol. 26 (1) pp. 57-67. Also shotgun style A/B testing of pointless variations is particularly problematic (see “Most winning A/B test results are illusory” Martin Goodson, qubitproducts.com, 2014). Projects like 41 blues are not only bad design they are likely bad science.
Combine a large number of bad hypotheses, the impossibility of “accepting the null hypothesis” and you have no reason to believe any result through mere first report. The issue being: while only 5% of the tests ran falsely appear to succeed, if mostly useless experiments are run it can easily be that nearly 100% of what gets published and acted on are false results. A stream of nonsense can drown out and hide more expensive and rare actual good work, if your filter is sloppy enough. Add in the inability to reproduce results and have a large problem.
Two questions we want to comment on: why would a researcher submit a bunch of bad work to testing and surely there is an easy fix?
Why are unsubstantiated work and ideas submitted for testing? Ideally testing is a means of scientific confirmation: you submit an idea that has good reasons to work in principle and then confirm the improvement in performance with a test. In fact to correctly design an A/B test you must propose the smallest difference you expect to detect. The reason you get so many meaningless changes submitted as meaningful experiments are varied. First A/B testing has been sold as a way to avoid bike shedding (avoiding the debate of meaningless differences by attempting to test meaningless differences). Also you get what you reward: if there is a benefit (getting a publication or bonus) for having the appearance of a good result, then you will eventually only get results that merely appear to be good. Once people figure out the appearance of success is rewarded your field becomes dominated by shotgun studies (proposing many useless variations is easier than inventing a plausible improvement) using a fixed p-threshold (p=0.05, because you are not traditionally allowed to get away with p any higher and p any lower just makes it take longer to appear to succeed).
There is a any easy fix: apply the Bonferroni correction. This is just a fancy way of saying: if we allow somebody to submit 10 ideas to test and report success if any of them look good, then we need to tighten the test criterion. If we are convinced that p=0.05 is a valid threshold for a single test (which should not be automatic, just because everybody uses p=0.05 doesn’t mean you should) then we should force somebody submitting 10 tests to run each test at p=0.005 to try and compensate for their venue shopping. A possible Bayesian adjustment would be to force the prior estimate of the probability of success to fall linearly in the number of experiments run.
Tests are filters. What p-value you should use is not set in stone at p=0.05. It depends on your prior model of the distribution of items you are going to test (are we confirming experiments thought to work, or are we running through a haystack looking for rumored needle?) and your estimates of the relative costs of type-1 versus type-2 errors (is this early screen where false negatives are to be avoided, or a final decision where false positive are to be avoided?). With a good loss model and prior estimates it is mere arithmetic to pick an optimal p-value.
Experimental design and significance encompass the whole experimental process. To calculate correct significances you must include facts about many experiments, not just a given single experiment. You must think in terms of actual probability of correctness, not mere procedures.