The Problem with p

Abstract

Having had several scientific articles published in peer-reviewed periodicals, I was aghast during a decision analysis class when the professor said “you can set a p-value to whatever you want it to be.” The scientific zealot in me who has built my entire academic significance on how many times I discovered p < 0.05 demanded clarification. Was the professor annulling the rigor of p-value <5%; the threshold I use to show a difference between my study sample and the rest of the population and thereby reject the null hypothesis?

The counterargument went something like this. Imagine your child needs an eye procedure that has <5% chance of not working (1 out of 20). Are you satisfied with it? Would you rather have 4% or 2%? For the rest of the medical world, 5% is just fine but for my child, I want it to be infinitely <1%. In contrast, there are areas in social sciences or business marketing wherein p ≤ 0.10 is perfectly acceptable. Saying you would accept being wrong 1 in 10 times may be more palatable when the risks of being wrong is less costly than my child's eye.

For the past decade, statisticians and scientific journals have cautioned against blindly using a p-value to judge significance. Indeed, our medical journals are filled predominantly by findings reporting p < 0.05, which either means the investigators go through great lengths to manipulate (not a bad word) their analyses to achieve such significance, or there are mounds of unpublished research data with p > 0.05 hidden in file cabinets and vinyl three-holed binders.

We have weaponized p < 0.05. It is too common to brand well-done research as “not statistically significant” or “no difference” because a p-value was >0.05. This binary all-or-nothing determination of research can falsely promote a claim while overlooking important findings of those who did not meet the arbitrary threshold. It is possible that two studies examining the same phenomenon, but each landing on opposite sides of 0.05, are telling the same story if one scrutinizes the methods and the statistical assumptions. In fact, Wasserstein and colleagues recommend abandoning the terms “statistically significant” or using symbols such as asterisks in publications to denote any kind of statistical inference. In their words, “as ‘statistical significance’ is used less, statistical thinking will be used more.”¹ In summary, the p-value is not the absolute statement of truth and should never be used as the sole determinant of treatment decisions.

All investigators need to be able to communicate all their findings, especially the pertinent negatives. No. The journal will not publish anything without statistical merit. In many surgical journals, with samples not in the thousands but rather tens and hundreds, the statistics is usually not complex and, therefore, whether the data meet statistical reasonableness is not hard to discern. As long as the statistical methodology is clear and reasonable, showing a trend should warrant consideration. Attempts at post hoc analysis to generate an attractive p-value should be avoided. By “putting it out there” not only gives the investigators credibility as researchers who are willing to show good work that may not have met a preset threshold, but also allows the scientific community to see all aggregate data. Allow the scientific community to be the judge of all quality data (even if a threshold is not met) to determine whether a treatment is effective. This collaborative evaluation truly evaluates all data, and not just those that are deemed the “s-word.” It saves time and money so that no one has to prove or disprove what has already been done.

What useful conclusions can we draw for using statistics in this journal. Much of the impetus comes from the New England Journal of Medicine guidelines for statistical reporting.² More helpful is the American Statistical Association's statement on p-values.³ For the purposes of this journal, these are broad recommendations we are considering for reporting statistical thinking: 1.

Explain the rationale for choosing a statistical method and choosing the p-value. Adherence to a reasonable thought process is just as valuable as finding data that meets an arbitrary threshold.

Use multiple methods to test your hypothesis. The outcomes may converge. Any divergence just makes the study more interesting and begs “statistical thinking.”

Do not use symbols to show that your results met some type of threshold. p-Values should be reported to two decimal places and no more than three decimal places. Let the data speak for itself and allow the scientific community to be the final arbiters.

If your work seemingly differs from others, describe where the potential differences are but look at where the synergies are.

Discuss your power analysis and sample size calculations.

When appropriate, report confidence intervals that will give readers a sense of your data range. For now, it is still reasonable to report 95% confidence intervals.

References

Wasserstein

, Schirm

, Lazar

. Moving to a world beyond “p <0.05.”. Am Stat, 2019; 73(Suppl. 1):1–19.

Harrington

, D'Agostino

, Gatsonis

, Hogan

, Hunter

, Normand

S-LT

, et al. New guidelines for statistical reporting in the Journal. N Engl J Med, 2019; 381:285–286.

Wasserstein

, Lazar

. The ASA's statement on p-values: context, process, and purpose. Am Stat, 2016; 70:129–133.

The Problem with p < 0.05

Abstract

References