The myth of objective statistics
In my last post, “To p or not to p”, I promised to write a follow-up post in which I would tell you all about Bayesian hypothesis testing. Although I can do that (and I will), I recognize that there is something dry about listing a bunch of properties about some statsy technique. You can learn about that during study-time already; that’s not why you’re on Mindwise, right?
As it happens, I have a much more pertinent issue to discuss with you. This idea that statistics provides some kind of ground truth and is separate from the rest of science in the sense that “we all agree on how it should be done” and “there are no more developments of note”. This is a notion I often encounter among my students. Statistics is a tool, a technique we learn to facilitate the real thing: science. Much like plugging a hole in a bike tyre does not have any other purpose than to facilitate the actual riding of the bike.
I propose to you, however, that statistics is not like that at all; it is a dynamic and fascinating field. Just words from a stuffy guy working in the stats department? I hope to convince you of the opposite. And I shall actually use Bayesian hypothesis testing to make my case.
As you may recall from “To p or not to p”, traditional Null Hypothesis Significance Testing – as it is routinely taught in undergraduate courses – suffers from four problems:
- You cannot quantify evidence in favor of the null hypothesis
- You over-reject the null hypothesis in cases where both the null and the alternative are unlikely (i.e., the data is just unlikely, regardless of the true state of the world)
- P-values are hard to interpret
- P-values do not allow for sequential testing
However, Bayesian statistics do not fall prey to any of these problems. To illustrate the idea behind Bayes factors (again, briefly, this is not a statistics class), consider the following example, adapted from Dan Navarro’s wonderful textbook “Learning statistics with R”.
You wish to know whether or not it’s raining outside. Specifically, you have two hypotheses about the world:
- H0: it is dry
- H1: it is raining
You have one datum (singular of data), which is that you have seen me going out with an umbrella.
How does a Bayesian statistician approach this problem? They combine the prior (what we believe about the world before seeing data) with the likelihood (what the data tell us we should believe about the world) to create a posterior (what we now believe about the world, after having seen the data). Applied to our rain-problem, it might go something like this:
- Prior: there is a 45% chance it is raining (and therefore a 55% chance it is dry). You might base this on data from the KNMI (Dutch weather institute), who have catalogued that it typically rains on 45% of the days in May.
- Likelihood: what is the chance of me walking out with an umbrella, given that it rains? Let’s say that probability is 80% (I sometimes forget). Secondly, we need to know what the chance is of me walking out with an umbrella, given that it is dry. Let’s say that probability is 10% (time to dust up that scattered-professor stereotype).
- Posterior: we can calculate the probability that it is raining and I go out with an umbrella (.45*.8=.36), as well as the probability that it is dry and I go out with an umbrella (.55*.1=.055) . These two scenarios encompass all states of the world (we already know that I did in fact go out with an umbrella), so our posterior belief about the hypotheses is:
- H0 (it is dry): .055/(.055+.36) = .13
- H1 (it is raining): .36/(.055+.36) = .87
So based on your observation only of my umbrella, coupled with the information we have from KNMI (as well as the likelihood of my remembering to use it), you conclude there is an 87% chance that it is raining (or that it will rain). In other words, your single observation of my behaviour greatly increases your estimate of the risk of rain (because KNMI data suggests only 45% chance of rain). And there is no p-value involved.
If the numbers dazzle you, do not worry. The point of this is to provide you with an intuition of how this works: we combine our prior belief about the world with what we learn from the data to end up with a new (and hopefully more refined) belief about the world.
Bayesian hypothesis testing is not plagued by any of the four problems that make the p-value wielding researcher’s life so difficult: (1) We explicitly quantified evidence in favor of the null hypothesis (13% chance that it is dry); (2) We explicitly specify our null and our alternative hypothesis, so that we do not bias against the null hypothesis in the face of unlikely data. We saw in the previous post how conventional NHST struggles with really unlikely data (Sally Clark’s two infant children dying), and as a result biases against the null hypothesis; (3) Bayesian posteriors are easy to interpret: the probability that it is dry, given your a-priori information and the data you have observed, is 13%. The p-value does not allow one to calculate probabilities that any hypothesis is true; and (4) We can continue collecting new data, calculate our new posterior each time along the way, and stop whenever we want without issues. The p-value requires one to specify in advance exactly how many data points one will collect.
Case closed then? Perhaps not. The use of p-values is ubiquitous. One context in which they are routinely used is in the endorsement of new mediations and medical treatments. It should come as no surprise that I believe that better tools exist out there. Last summer, I went on a bit of a mission, visiting Prof. John Ioannidis’ lab at Stanford University. My purpose was to join forces with him and write a paper to flag some of the issues that stick to using p-values as a yardstick for the efficacy of new medicine.
Why go to Stanford for this? In 2005, Prof. Ioannidis rocked the scientific world with what has since become his most iconic publication: Why most published research findings are false. As the name of the paper suggests, the paper shows using simulations “that for most study designs and settings, it is more likely for a research claim to be false than true.” Partly as a result, he is presently regarded as one of the world’s leading authorities on proper research methodology. I knew from his previous work that he has similar objections to the use of p-values, so an unholy alliance was forged.
During my stay at Stanford, Prof. Ioannidis and I wrote a paper on the US Food and Drug Administration’s (FDA) policy for endorsing new medication. Our paper, which can be found here, provides a simulation-based critique of the errors that can occur when endorsing medications based on a certain policy for combining p-value results. In one sentence, we conclude that strict adoption of the FDA’s policy to endorse new medications following two statistically significant trials may lead to a large proportion of ineffective medication on the market.
This conclusion led to a wave of controversy over if it was really as bad as we claimed. Published only two months ago, the paper has been viewed over 8,000 times and shared on social media over 200 times. It also led to two blog-posts that were quite critical of the main message. One of them, available here, was written by Stephen Senn, who is one of the most respected statisticians in the field of medicine. He questions the value of our conclusions, stating that we use Bayesian statistics to quantify evidence in simulations that are based on a traditional frequentist premise, greatly exaggerate the number of non-significant medical trials that are seen in practice (or are left unpublished in a file-drawer), and use an unrealistic prior distribution for the problem at hand.
Responses to the article (and the blogpost) were quite heated, as can be seen from the comments sections at both websites. And that’s for good reason! FDA policies have an enormous impact on people’s lives. And we showed that their process needs improvement: It was inevitable that this would evoke some emotion!
In my previous blog post, I talked about how uncritical thinking about statistics cost Sally Clark her life. My suggestion here, now, is that this is a widespread and rampant problem: given the influence of the FDA, it’s clear that this is not an individual problem. The FDA policy feeds the idea that statistics is finished; that it’s a tool you can apply unthinkingly. That, ultimately, is the point I wanted to make.
Statistics is not ‘done-and-dusted’. It is a healthy and vibrant area of science that includes a number of cutting-edge topics about which some very smart people intensely disagree. That does not mean that ‘everything you learned is wrong’ or that ‘we may as well not do any statistics at all’. Rather, it means that it is important that we don’t turn off our brains and continue to think about how best to quantify evidence, how to best make a smart generalization from a small sample to the entire population, and how to optimally carve out the hidden gem of information from the raw data so painstakingly obtained. In my opinion, statistics and methodology is – hands-down – the most exciting area in science right now. Consider doing your post-graduate work in this area, you can make a difference that affects all of science!