A reconceptualization of significance testing

Abstract

Significance testing has been controversial since Neyman and Pearson published their procedure for testing statistical hypotheses. Fisher, who popularized tests of significance, first noticed the emerging confusion between that procedure and his own, yet he could not stop their hybridization into what is nowadays known as Null Hypothesis Significance Testing (NHST). Here I hypothesize why similar attempts to clarify matters have also failed; namely because both procedures are designed to be confused: their names may not match purpose, both use null hypotheses and levels of significance yet for different goals, and p-values, errors, alternative hypotheses, and significance only apply to one procedure yet are commonly used with both. I also propose a reconceptualization of the procedures to prevent further confusion.

Keywords

Fisher Neyman–Pearson Null Hypothesis Significance Testing NHST significance testing statistical hypotheses

Hager’s (2013) article on the statistical theories of Fisher and of Neyman and Pearson is the latest in a string of exhortations trying to sort out confusion between those two main theories for testing research data. Hager starts his abstract with “Most of the debates around statistical testing suffer from a failure to identify clearly the features specific to the theories invented by Fisher and by Neyman and Pearson [emphasis added]” (p. 251). However, there is literary evidence that those specific features have been covered, most clearly by Hubbard (2004), Gigerenzer (2004), Louçã (2008), and, of course, Fisher (1925, 1955, 1959, 1960), and Neyman and Pearson (1928, 1933). A subset of both theories, the confusion between p-values and Type I error, has also been covered by Hubbard (2004, 2011), Goodman (1999), and Christensen (2005), among others. To Hager’s credit his article is the best to date in presenting an exhaustive identification of such features.

Furthermore, historical analyses on the hybridization of Fisher’s and Neyman–Pearson’s approaches have been done by Halpin and Stam (2006), Hubbard (2004), Huberty (1993), and Johnstone (1986). And, among proposed solutions, there are mathematical bridges between both approaches, such as Berger’s (2003) proposal of conditioning Neyman–Pearson’s error probabilities on Fisher’s strength of evidence, and Schweder’s (1988) proposal of a Fisher-like significance version for Neyman–Pearson’s tests; philosophical bridges, such as Lehmann’s (1993) attempt at a unified theory, and Jones and Tukey’s (2000) proposal of testing three alternative decisions rather than null hypotheses; exhortations to give up testing in favor of confidence intervals (Gigerenzer, 2004), meta-analysis (Schmidt, 1992), or Bayes’ hypothesis testing (Hubbard & Bayarri, 2003); and exhortations to better statistical education provided by statisticians (Hubbard, 2011). So far, most solutions have been unable to resolve the confusion inherent to significance testing.

An interesting observation when studying this topic is that the confusion of both approaches is as old as Neyman–Pearson’s theory. Fisher was first to criticize such confusion (more clearly in 1955), even before Lindquist’s textbook hybridizing both theories into the Null Hypothesis Significance Testing (NHST) procedure was printed (1940; see also Halpin & Stam, 2006); yet he could not stop the hybridization wave.

I see an insidious factor that may explain why the confusion persists despite the best efforts to straighten it: the theories are unwittingly designed to be confused, insofar as they use the same tests, similar procedures, and similar concepts. Little can be done regarding tests and procedures; resolving the confusion thus rests on re-designing the conceptual framework of both theories (as Pearson put it in 1955, “I would agree that some of our wording may have been chosen inadequately [emphasis added],” p. 206). It does not matter how much ink is spent on trying to correct the wave of hybridization, the basic design of both theories will by default confuse the unwary and the expert alike, so much so that, among the latter, it has occurred to the American Psychological Association (2010), Wilkinson and the Task Force on Statistical Inference (1999), and Krueger (2001)—who are rather Fisherian—and to Kline (2004), Nickerson (2000), Wainer (1999), Cortina and Dunlap (1997), Frick (1996), Cohen (1988, 1994), Schmidt (1992), and Rosnow and Rosenthal (1989)—who are rather Neyman–Pearsonian.

The goal of this paper is to propose such conceptual re-design. This covers the names of the procedures, the role of the null and alternative hypotheses, levels of significance, p-values and errors, the statistical significance of results, and its interpretation. The article ends with appropriate examples to put theory to practice, thus describing the same results depending on whether they are interpreted following Fisher’s approach or following Neyman–Pearson’s approach.

The name of the procedures

When doing significance testing, Fisher was interested in finding noteworthy results and in assessing the strength of such evidence. Using his approach, the researcher is prepared to pay attention to statistically significant results and ignore the rest (Fisher, 1960). Therefore, Fisher’s tests of (statistical) significance is an appropriate name for this procedure.

In contrast, Neyman and Pearson’s interest was in deciding which hypothesis, among competing ones, to accept as most likely. They called their procedure “tests of statistical hypotheses” (e.g., Neyman & Pearson, 1933), a denomination which certainly differs from that of Fisher’s but which does not create a clear conceptual separation between both. After all, Fisher also uses statistical hypotheses, yet neither procedure tests hypotheses, only research data against statistical hypotheses assumed to be true (Gigerenzer, 2004). The reference to testing hypotheses is, thus, misleading and the procedure benefits by being renamed as Neyman–Pearson’s tests of acceptance (Fisher, 1955).

Null hypotheses

The core of Fisher’s procedure is a null hypothesis, which represents a theoretical random distribution generated ad hoc from information provided by the data (e.g., variance) as well as by theory (e.g., normal curve; Fisher, 1959). The procedure locates the research results within this distribution and assesses their theoretical probability. Research results with a low probability of occurrence are taken as evidence against the hypothesis, nullifying it as being explanatory of those results (Hubbard & Bayarri, 2003). Calling it the null hypothesis (H₀) is, thus, appropriate.

Neyman–Pearson’s procedure works in a somewhat similar manner, although it uses at least two hypotheses representing distributions to be tested in the long run, using repeated sampling procedures on the same population (Fisher, 1955). The procedure selects the hypothesis of greatest interest as the one on which to carry the test, which is also called the null hypothesis. Yet this approach is not about nullifying a hypothesis but about deciding among competing hypotheses (Neyman & Pearson, 1928). Thus, the concept of null hypothesis is inaccurate. Neyman–Pearson’s procedure benefits by substituting the concept of main hypothesis (H_M), instead.

Alternative hypotheses

Fisher’s procedure does not contemplate an alternative hypothesis at all (Fisher, 1960). Somehow, however, there is some emptiness in this procedure on two accounts: on the one hand, the interest is in a research (substantive) hypothesis even when the test is done on its complementary statistical null hypothesis (Hager, 2013). On the other hand, there is very little that can be said about significant results other than that the null hypothesis has been rejected. The need for filling this emptiness is strong enough that a concept such as the alternative hypothesis fits well, even when inappropriate. To avoid confusion, Fisher’s procedure could include a reference to the research hypothesis to fill such a gap, as well as to make researchers aware that the null is proposed as a way of making inferences about the research hypothesis in the first place; however, we should avoid calling it the alternative hypothesis.

Neyman–Pearson’s procedure requires the use of one or more alternative hypotheses, which are specific and represent long-run distributions (Neyman & Pearson, 1928). Nowadays, this procedure is used in such a manner that only the power of the alternative hypothesis (1 − β) is ever considered. Thus, although the alternative hypothesis is, in principle, specific, it is often portrayed as an unspecified hypothesis (as being “unequal” to the main hypothesis); as a consequence, accepting the alternative hypothesis does not tell much beyond such fact. Neyman–Pearson’s alternative hypothesis (H_A) can certainly retain its name under this procedure, although it should be made explicit, if only by providing information about its power. (Put otherwise, if the alternative hypothesis is not so specified, the test is carried out as a, de facto, Fisher’s test.)

Levels of significance

Fisher’s procedure uses levels of significance for ascertaining the noteworthiness of a result (Fisher, 1925). Fisher did not put forward any particular notation for it, but researchers tend to write it as alpha (α). Most researchers also find it convenient to work with fixed levels of significance (e.g., 5%, 1%). These features have parallelisms in Neyman–Pearson’s approach. To avoid confusion, Fisher’s levels of significance, which are fit-for-purpose, should be called as such, and sig could be used as shorthand notation when required (thus, avoiding alpha). Convenient levels of significance may be used (Fisher, 1960), although these should be treated as flexible rather than as strict cut-off points (i.e., the difference between 5% and 6% is not too critical under Fisher’s approach). Furthermore, a gradation in levels of significance is appropriate (such as significant and highly significant), as it reflects the relative strength of the evidence against the null hypothesis.

Neyman–Pearson’s procedure does not work with evidence against the null but with long-run error probabilities. They sought to make a small Type I error (the error committed when wrongly accepting the alternative hypothesis¹), whose probability is known as alpha (α). Alpha needs to be set in conjunction with beta (the probability of making a Type II error, of wrongly accepting the main hypothesis). For convenience, Neyman and Pearson also worked with fixed levels of alpha (e.g., 5%, 1%). They also called alpha the significance level (Neyman & Pearson, 1933). To avoid confusion, acceptance level can substitute the latter, as it is a more appropriate name to its functioning as the cut-off point for deciding between hypotheses. This procedure also retains exclusive use of the concepts of alpha (α), beta (β), and power (1 − β).² Convenient levels of acceptance may be used; however, whatever level of acceptance is chosen, it is fixed throughout the research (it represents a fixed error risk), it is strict, not flexible, and cannot accept any gradation (i.e., a result cannot be highly accepted, or accepted with a particular proportion of error, nor alpha can become a roving alpha, so called by Goodman, 1993).

P-values

To Fisher, p-values represent the probability of the observed and more extreme results, always assuming that the null hypothesis is true. They also represent the strength of the evidence against the null (Fisher, 1960), so that knowing the exact p-value is very informative, indeed. Actually, p-values only make sense under Fisher’s approach and should be used exclusively with this approach. The preferred method is to report the exact p-value (e.g., p = .012; Gigerenzer, 2004), although reporting p-values bound to levels of significance (e.g., p < .05, or ‘***’ for p < .001) may be convenient in some circumstances (e.g., in tables).

Under Neyman–Pearson’s approach, p-values are neither necessary nor routinely calculated, although they may be used as proxies for deciding when a result falls in the alpha region, so as to accept the alternative hypothesis. The p-value does not provide any strength of evidence (i.e., the alternative hypothesis can only be accepted, it cannot be accepted slightly or strongly; Gigerenzer, 2004). Therefore, when p-values are used, they should be stripped of their numerical value and simply be described bound to the selected alpha level (e.g., p < α, or p < α_.05).

Errors

Fisher’s procedure is aimed towards novel, single research projects. Although Type I errors are plausible, they are of little practical relevance given the ad hoc nature of the theoretical distributions the tests are carried on. Any interpretation of p-values as the probability of making a Type I error is, thus, inaccurate. Because Fisher’s procedure does not contemplate alternative hypotheses, Type II errors are not possible under this approach (Fisher, 1955).

Type I and Type II errors are concepts first introduced by Neyman and Pearson (1928), although they were rather interested in minimizing the probability of their occurrence (α and β respectively) in the long run under repeated testing (i.e., it is not possible to ascertain whether an error has been made in any particular trial). Thus, the concepts of Type I and Type II errors, and their associated probabilities alpha (α) and beta (β), are exclusive to Neyman–Pearson’s approach.

Significant results

As discussed above, a research result can only be statistically significant under Fisher’s procedure. They can also be more or less significant according to the strength of the evidence they represent against the null hypothesis (Fisher, 1960).

When using Neyman–Pearson’s procedure, a research result cannot be significant, properly speaking. Instead, a result that falls within the alpha region is a result that will be accepted under the alternative hypothesis.

Interpretation

Fisher’s results can only be interpreted according to an unavoidable duality: either a rare chance event occurred or the null hypothesis does not explain the research results (Fisher, 1959). From here an inductive inference may be made regarding whether the results support the substantive research hypothesis or whether further research is needed.

Neyman–Pearson’s results can only be interpreted as a decision: the research results support either the alternative hypothesis, or, if power is adequate, the main hypothesis (if power is not adequate, nothing can be concluded about the latter; Neyman, 1953).

Examples

Below are two interpretations of how the same results may appear in a research article, depending on whether one is following Fisher’s procedure or Neyman–Pearson’s:

This was a novel, single research study, and statistical analyses were conducted according to Fisher’s procedure, using a conventional level of significance of 5% (sig = .05). Results showed a highly signiﬁcant difference in performance between control and experimental groups in favor of the latter, t(50) = 2.34, p = .012, 1-tailed. As the probability of getting these or greater results is low under the null hypothesis, we rejected the latter and inferred that the results reflected a true difference between groups.

This study forms part of a repeated sampling program, and statistical analyses were conducted according to Neyman–Pearson’s procedure. The acceptance criterion was set at 5% (α = .05), and a priori research power was set at 80% (1 − β = .80) for an estimated large effect size (d = 0.8), requiring a minimum total sample size of 42 participants. The observed results fell in the critical alpha region, t(50) = 2.34, p < α, 1-tailed; thus we accepted the alternative hypothesis and concluded that the experimental treatment had a definite positive impact on performance.

Final note

Null Hypothesis Significance Testing (NHST) is, by and large, a mismatch of two incompatible statistical philosophies (e.g., Halpin & Stam, 2006). A reasonable factor in such confusion is the conceptual framework under which both philosophies were developed and published. The negative impact of that shared conceptual framework is pervasive enough that even expert users of statistics get confused between the theory of Fisher and the theory of Neyman and Pearson (e.g., Cohen, 1994). Although NHST may be waning in psychology, its role in future publications is necessarily open to editorial policy (APA, 2010), while its historical footprint in the publication record is here to stay, thus perpetuating the confusion despite the best efforts to straighten it (e.g., Hager, 2013). The proposed solution is to re-engineer such conceptual frameworks in a way that not only clarifies the confusion at present but also offers a tool for re-interpreting past and future NHST-based articles in light of their most probable underlying philosophy: either Fisher’s tests of significance or Neyman–Pearson’s tests of acceptance. Hopefully, it may also serve as a tool for researchers to follow their desired philosophy coherently throughout their works.

Footnotes

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Notes

Author biography

Jose Perezgonzalez is a lecturer at Massey University’s School of Aviation in New Zealand. His research interests encompass Human Factors and Psychology, including sense-making, research methods and statistics.

References

American Psychological Association. (2010). Publication manual of the American Psychological Association (6th ed.). Washington, DC: Author.

Berger

J. O.

(2003). Could Fisher, Jeffreys and Neyman have agreed on testing? Statistical Science, 18(1), 1–32.

Christensen

(2005). Testing Fisher, Neyman, Pearson, and Bayes. The American Statistician, 59(2), 121–126.

Cohen

(1988). Statistical power analysis for the behavioral sciences (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.

Cohen

(1994). The Earth is round (p < .05). American Psychologist, 49(12), 997–1003.

Cortina

J. M.

Dunlap

W. P.

(1997). On the logic and purpose of significance testing. Psychological Methods, 2(2), 161–172.

Fisher

R. A.

(1925). Statistical methods for research workers. Edinburgh, UK: Oliver and Boyd.

Fisher

R. A.

(1955). Statistical methods and scientific induction. Journal of the Royal Statistical Society, Series B (Methodological), 17(1), 69–78.

Fisher

R. A.

(1959). Statistical methods and scientific inference (2nd ed.). Edinburgh, UK: Oliver and Boyd.

10.

Fisher

R. A.

(1960). The design of experiments (7th ed.). Edinburgh, UK: Oliver and Boyd.

11.

Frick

R. W.

(1996). The appropriate use of null hypothesis testing. Psychological Methods, 1(4), 379–390.

12.

Gigerenzer

(2004). Mindless statistics. The Journal of Socio-Economics, 33, 587–606.

13.

Goodman

S. N.

(1993). P values, hypothesis tests, and likelihood: Implications for epidemiology of a neglected historical debate. American Journal of Epidemiology, 137(5), 485–496.

14.

Goodman

S. N.

(1999). Toward evidence-based medical statistics: 1. The p-value fallacy. Annals of Internal Medicine, 130(12), 995–1004.

15.

Hager

(2013). The statistical theories of Fisher and of Neyman and Pearson: A methodological perspective. Theory & Psychology, 23, 251–270. doi:10.1177/0959354312465483

16.

Halpin

P. F.

Stam

H. J.

(2006). Inductive inference or inductive behavior: Fisher and Neyman–Pearson approaches to statistical testing in psychological research (1940–1960). American Journal of Psychology, 119(4), 625–653.

17.

Hubbard

(2004). Alphabet soup: Blurring the distinctions between p’s and α’s in psychological research. Theory & Psychology, 14, 295–327. doi:10.1177/0959354304043638

18.

Hubbard

(2011). The widespread misinterpretation of p-values as error probabilities. Journal of Applied Statistics, 38(11), 2617–2626.

19.

Hubbard

Bayarri

M. J.

(2003). Confusion over measures of evidence (p’s) versus errors (α’s) in classical statistical testing. The American Statistician, 57(3), 171–178. doi:10.1198/0003130031856

20.

Huberty

C. J.

(1993). Historical origins of statistical testing practices: The treatment of Fisher versus Neyman–Pearson views in textbooks. Journal of Experimental Education, 61(4), 317–333.

21.

Johnstone

D. J.

(1986). Tests of significance in theory and practice. Journal of the Royal Statistical Society, Series D (The Statistician), 35(5), 491–504.

22.

Jones

L. V.

Tukey

J. W.

(2000). A sensible formulation of the significance test. Psychological Methods, 5(4), 411–414. doi:10.1037//1082–989X.5.4.411

23.

Kline

R. B.

(2004). Beyond significance testing: Reforming data analysis methods in behavioral research. Washington, DC: APA.

24.

Krueger

(2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56(1), 16–26. doi:10.1037//0003–066X.56.1.16

25.

Lehmann

E. I.

(1993). The Fisher, Neyman–Pearson theories of testing hypotheses: One theory or two? Journal of the American Statistical Association, 88(424), 1242–1249.

26.

Lindquist

E. F.

(1940). Statistical analysis in educational research. Boston, MA: Houghton Mifflin.

27.

Louçã

(2008). The widest cleft in statistics: How and why Fisher opposed Neyman and Pearson (Working Papers WP/02/2008/DE/UECE). Lisbon, Portugal: School of Economics and Management, Technical University of Lisbon. Retrieved from https://www.repository.utl.pt/bitstream/10400.5/2327/1/wp022008.pdf

28.

Neyman

(1953). First course in probability and statistics. New York, NY: Henry Holt.

29.

Neyman

Pearson

E. S.

(1928). On the use and interpretation of certain test criteria for purposes of statistical inference: Part I. Biometrika, 20(A), 175–263.

30.

Neyman

Pearson

E. S.

(1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London, Series A, 231, 289–337.

31.

Nickerson

R. S.

(2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5(2), 241–301. doi:10.1037//1082–989X.5.2.241

32.

Pearson

E. S.

(1955). Statistical concepts in the relation to reality. Journal of the Royal Statistical Society, Series B (Methodological), 17(2), 204–207.

33.

Rosnow

R. L.

Rosenthal

(1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44(10), 1276–1284.

34.

Schmidt

F. L.

(1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist, 47(10), 1173–1181.

35.

Schweder

(1988). A significant version of the basic Neyman–Pearson theory for scientific hypothesis testing. Scandinavian Journal of Statistics, 15(4), 225–242.

36.

Wainer

(1999). One cheer for null hypothesis significance testing. Psychological Methods, 4(2), 212–213.

37.

Wilkinson

, & the Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54(8), 594–604.