Abstract

The Silver Samarsanda stood above the Jardeen, behind a line of tall pencil cypress: an irregular bulk of masonry, plastered and whitewashed, with a wide, many-slanted roof of mossy tiles. Beside the entrance five colored lanterns hung in a vertical line: deep green, a dark, smoky scarlet, a gay light green, violet, and once more dark scarlet; and at the bottom, slightly to the side, a small, steady yellow lamp, the purport of all being: Never neglect the wonder of conscious existence, which too soon comes to an end! Gotham City. Clean shafts of concrete and snowy rooftops. The work of men who died generations ago. From here, it looks like an achievement. From here, you can’t see the enemy.
This second part of the special issue on “Perspectives on the Use of Null Hypothesis Statistical Testing (NHST)” is dedicated to the classic null hypothesis statistical testing framework. This framework slowly evolved from the work of Bernoulli (1713/2005) and Gauss (1809/1864), among others. Fisher (1925, 1935/1951) formalized the idea of hypothesis testing, building on the core idea of a null hypothesis, and the evaluation of the probability of the data under this hypothesis. At this stage, there were little sources for confusion: A model of the null hypothesis (the null model) is set down, often based on the famous normality assumption. In parallel, a model of the observed data is laid down. Both models are compared with a likelihood ratio. The model of the observed data always fits best, but what matters is the loss in fit of the null model. An F statistics (or the square of a t statistic) is obtained by computing twice the decrement in fit when fits are expressed using log likelihoods. The importance of the decrement is then assessed by finding the probability of a decrement at least as large as the one observed given that there would be no difference between the two models (see Cousineau & Allan, 2015, for an overview). Thus, in Fisher’s view, a small p value indicates that it is not desirable to describe the data using the null model.
In our opinion, things got confusing after the works of Neyman and Pearson (1933). These two researchers wanted to identify optimal statistical tests. The Neyman–Pearson lemma concluded that likelihood ratio tests are the most powerful tests for a given significance level α. To that end, they needed to set the true population difference (the true effect size). They called this true effect size the alternative hypothesis, a formulation that Fisher strongly opposed (Cohen, 1990, p. 1307). This wording is very unfortunate because the population true effect size is not a hypothesis, and in any event, the likelihood ratio tests are not meant to evaluate dual hypotheses. In subsequent textbooks, the alternative hypothesis creeped in, but with a modified signification: The alternative hypothesis became anything but the null hypothesis (e.g., Snedecor, 1946).
The alternative hypothesis is not part of NHST; its inclusion leads to paradoxical situations (e.g., Wagenmakers, 2007, found that given a data set, rejection and nonrejection decisions are obtained from NHST and from Bayesian computations, respectively). This is one example of how a logical construction can be altered through the years to the point that it loses its foundations.
In the second part of this special issue, three articles defend the soundness of NHST. Häggström presents arguments to embrace such an approach and indicates ways for its appropriate use. García-Pérez argues that current alternatives to null hypothesis statistical tests do not solve issues this type of testing is said to have. Miller reinforces the idea that null hypothesis testing is an appropriate method to do science.
We also begin to examine alternatives to NHST (more will follow in a subsequent issue). One alternative to the NHST that has received much attention is that of inferences based on confidence intervals. Wiens and Nilsson illustrate this approach with contrast analyses in factorial designs where confidence intervals are rarely given. Another growing field of research in statistics is how to reconcile parameter estimation and inference with the presence of outliers and non-Gaussian distributions. Wilcox and Serang demonstrate how robust statistics can address problems that null hypothesis, confidence intervals, and Bayesian approaches have.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
