Many scientific studies you read about are hinged on the wrong metric

Prove it’s not chance.
Prove it’s not chance.
Image: Reuters/Paulo Whitaker
We may earn a commission from links on this page.

Scientists love a good p-value.

If you’ve ever read an article about science, chances are, the author had to decide whether or not a p-value was trustworthy. As FiveThirtyEight shows with this handy interactive graphic, a good way of understanding a p-value is: ”how surprising would these results be if you assumed your hypothesis was false.” A p-value of less than .05—which is typically regarded as the gold-standard of statistical significance—says that the probability of the trend you observed is extremely high, if you assume that the two variables being tested are unrelated. Most of the time, with a few notable exceptions, journals require at least a p-value of less than .05 in order for scientific research to be published.

Yet more scientific research is citing p-values without further context, possibly as a quick way to get readers’ attention. Today (March 15), a study from researchers at Stanford University published (paywall) in JAMA found that from 1990 to 2014, the number of p-values popping up in the abstracts of over 1 million biomedical research papers more than doubled, from about 7% to 15%. In the identified “core” biomedical research journals, p-values were reported at rates around 33%, and studies that reported the results of clinical trials cited p-values at an even greater rate of 55%.

“The journals want novel research, and novelty is often associated with significant findings,” Joshua Wallach, an epidemiology PhD candidate at Stanford and co-author of the study, told Quartz. By putting small p-values in the abstracts of more papers, Wallach thinks that more and more papers are trying to hook readers who may not fully appreciate the limitations of what the statistic actually says.

While the p-value can be a useful tool, it’s less reliable without more in-depth statistics or methodology accompanying it because it can be manipulated and misinterpreted.

“I’ve heard this quote, ‘If you torture your data long enough, it will confess,'” Wallach said. In other words, it is possible to look at most data in a way that yields some significant finding (a process known as “p-hacking“). And as FiveThirtyEight reports, p-values don’t actually measure how likely it is for a given hypothesis is to be true. “The p-value only tells you something about the probability of seeing your results given a particular hypothetical explanation—it cannot tell you the probability that the results are true or whether they’re due to random chance,” Christine Ashwanden writes.

Wallach and his team aren’t alone in their thinking: On March 7, the American Statistical Association (ASA) issued a statement (pdf) about the role of p-values in scientific research reporting. “The p-value was never intended to be a substitute for scientific reasoning,” Ron Wasserstein, the ASA’s executive director, said in a press release (pdf). “Well-reasoned statistical arguments contain much more than the value of a single number.”

And the consequences of these p-values go far beyond buying into science that may not be replicated. “Patients with serious diseases have been harmed. Researchers have chased wild geese, finding too often that statistically significant conclusions could not be reproduced,” Donald Berry, a statistician at the University of Texas wrote in a supplementary commentary to the ASA’s statement.

To evaluate the credibility of a research paper, Wallach recommends that we look for other measures of reproducibility: The most reliable studies are those that have a large sample size, and also report statistics like confidence intervals, which show the range of results that scientists would expect to see 95% of the time.

“Everyone wants p-values that are statistically significant…and in most studies, they really have no meaning on their own,” Wallach said. In other words, without adequately describing other measure of statistics, like the size of the effect, or other factors like reproducibility, the clinical relevance, and study size, p-values don’t tell us much about the actual importance of a finding.