Should the notion of “statistical significance” be abolished?

Meaningful work?
Meaningful work?
Image: Reuters/Stefan Wermuth
We may earn a commission from links on this page.

Every year, millions of students across the world take a statistics class. As the world is flooded with data, it is an increasingly popular subject. And if there is one thing most students remember from this class, it is probably the notion of “statistical significance.”

It would probably be better if they didn’t, according to the academics Valentin Amrhein, Sander Greenland, and Blake McShane. They want to “retire statistical significance,” and they have a lot of support. Their article arguing for ridding statistics of the term, published in Nature in early 2019, has over 800 signatories—many of whom are leaders in quantitative fields.

Today, the term statistical significance is generally used to quantify whether the result of a study is likely due to chance. For example, take a company that wants to measure the impact of two different advertisements on Facebook. They find that one ad gets 10% of people to click on it, and another gets 8%. To figure out if that difference is meaningful, or just happened by happenstance, they might run a statistical test to see if the result is “significant.” Historically, if researchers found that there was a 5% or lower likelihood that a difference this large in click rates was due to pure chance, they would likely dub it significant. Often, decisions in business and medicine are made based on the 5% rule, though many researchers choose more stringent thresholds, like 1% or even lower.

British economist and statistician Francis Edgeworth was the first person to use the term “significant” in relation to statistical testing in the 1880s. He used it differently than the way it is employed today, according to statistician Glenn Shafer (pdf). Edgeworth discussed significance in terms of how likely it “signified” a meaningful difference. He would say a finding was “likely significant” or “certainly significant.” By the mid-20th century, researchers began routinely saying that results were “highly significant” or “barely significant.” The terminology changed to one more suggestive of rules and less about judgment.

Rule-based thinking is the biggest problem with statistical significance, according to Amrhein, Greenland, and McShane. “The trouble is human and cognitive more than it is statistical: bucketing results into ‘statistically significant’ and ‘statistically non-significant’ makes people think that the items assigned in that way are categorically different,” they write. It is silly, they argue, to consider a result that had a 4% chance of occurring by chance to be real, while a result with a 6% chance to be unreal. (Statistician Andrew Gelman goes into more detail on the technical issues with significance on his excellent blog.)

The reliance on a binary measure of significance has likely contributed to a loss of confidence in findings in medicine and social science. For example, one research initiative found that about 40% of findings published in major psychology journals failed to be replicated when redone. A major reason for the lack of replication is that journals are biased towards publishing statistically significant results. This makes researchers search for results that meet the 5% standard, and run experiments until they get that result. Yet, if you run 50 experiments, you are bound to stumble into at least one “significant” result that you can submit to a journal. That study would have higher odds of being the result of chance than research that takes a more methodical approach.

Rather than relying on whether a result is “significant” or “insignificant,” the authors want researchers to think about the context of their experiments. They should use a cost-benefit analysis of their results, because a result that is insignificant might still be useful. For example, if the difference between an experimental cancer medicine and a placebo was positive but not significant, it might still be worth giving that medicine to some patients, particularly if there is a strong theoretical reason to believe it could work. Results should be discussed in terms of how likely they are to be useful, not whether they meet some relatively arbitrary statistical threshold.

Not everyone agrees that talk of statistical significance should be abolished. Statistician John Ioannidis wrote in response to Amrhein, Greenland, and McShane’s letter that thresholds can often be helpful. Without significance thresholds, he writes, just about any result might be published and “[i]rrefutable nonsense would rule.” Those who disagree with him would say that such nonsense already rules, and a more nuanced approach could only help.

Such entrenched ideas don’t die off quickly. Statistical significance is so central to quantitative analysis that the official magazine of the American Statistical Association and Britain’s Royal Statistical Society is named Significance.