Abstract

An example is a recent article in the journal Nature Neuroscience. 1 The authors of the article had an impression that a particular statistical error was appearing frequently in the neuroscience literature. So they reviewed every relevant article in Nature, Nature Neuroscience, Science, and Neuron for 2009 and 2010, and, in addition, every fourth issue of the Journal of Neuroscience also for 2009 and 2010. They found 157 articles in which the error could have been committed, and in half of them it was. As Nieuwenhuis et al. point out, not all of these errors were critical, and, in some cases, a correct test might have given the same answer. But it is nevertheless unnerving that five of the most prestigious scientific journals in the United States published articles with one particular statistical error half the time.
The mistake investigated by Nieuwenhuis and his colleagues actually goes back to the dawn age of statistics, and it goes like this: A scientist computes the average of something under one set of conditions, then another average of the same thing under a different set of conditions. The scientists finds that the first mean is positive and the second is negative, and concludes that there is a difference between the two conditions, because they produce averages with different signs. We now know that most measurements (and virtually all measurements in biomedicine) contain errors, and so we think that the scientist should carry out a statistical hypothesis test. In this fictional case, the most frequently used method would be the two-sample t-test, which directly compares the two means with each other.
Now suppose that the scientist pretends to follow our advice. He/she measures the means, and tests two hypotheses—that the first differs from zero and that the second differs from zero. The positive mean is significantly different from zero, but the negative mean is not, so the scientist declares that the first condition produces positive values but the second condition produces zero. The scientist then claims that the effects of the two conditions are different, but has found this difference without ever comparing the two means directly. This scientist has simply found a statistical argument for repeating the original mistake.
In the cases inspected by Nieuwenhuis et al., the situation was only slightly more complicated. Again, speaking fictively, the scientist finds two means (under two different conditions) that both trend in the same direction. He/she tests the first against the null hypothesis that it is zero and gets a p-value of 0.03. Similarly by testing the second, the p=0.07. The scientist claims that the first is significantly different from zero while the second is not, and so concludes that the conditions have different effects. But he/she never tests the two means against each other directly, which is the correct statistical procedure.
As Nieuwenhuis et al. point out, the scientist's argument sounds as though it is right. After all, if one condition produces non-zero means and the other does not, then surely they differ, and the advice to statistically test has been followed. However, the advice to perform a two-sample t-test seems right as well. How can we tell who is correct?
One way is to perform a simple set of computations. We consider situations in which two averages are computed in two samples that come from two populations with exactly the same mean. We allow both the common mean value and the sample size to vary, and for each such set of values we compute the probability that one of the means will differ statistically from zero and the other will not. The results are shown in Figure 1. It is important to keep in mind while viewing this figure that, in all the situations considered, the means in the two populations are identical, so that the two conditions do not have different effects. Each probability is the chance of declaring that the conditions differ, for samples of size determined by the individual curve, as shown.

Probability of committing the error reported by Nieuwenhuis et al. 1 related to sample size per group (1 through 80) and value of the mean common to the two populations.
The conclusion is clear: The probability of falsely rejecting the null hypothesis of no difference is not bounded by 0.05, as claimed, but can rise as high as 0.50. Moreover, the graph shows that making this significance error happens most frequently when the common mean is in the neighborhood of the cutoff value that would determine statistical significance (for the test of the null hypothesis of 0 for a single mean). Thus, the situation in which the error is most frequent depends both on the common effect that the two conditions share, and on the sample size that was utilized.
Given that it is beyond question that the procedure studied by Nieuwenhuis et al. violates the usual rules of statistical tests 2 and is likely to produce false statements of effect with unacceptably high probabilities, the Editors of the Journal have decided that commission of this error will be grounds for immediately returning a submission without review.
