Abstract
Most of the debates around statistical testing suffer from a failure to identify clearly the features specific to the theories invented by Fisher and by Neyman and Pearson. These features are outlined. The hybrids of Fisher’s and Neyman–Pearson’s theory are briefly addressed. The lack of random sampling and its consequences for statistical inference are also highlighted, leading to the recommendation to dispense with inferences and perform approximate randomization tests instead. A possible scheme for the appraisal of substantive hypotheses is offered, the corroboration of which is a necessary prerequisite for scientific explanations and predictions. The scheme is partly based on the Neyman–Pearson theory. This theory, though not perfect, is superior to its competitors, especially when examining substantive hypotheses. The many statistical and extra-statistical decisions prior to experimentation and the inevitable subjectivity of our research endeavors are emphasized. If feasible, statistical problems should be discussed from an extra-statistical methodological/epistemological viewpoint.
Keywords
This article will deal mainly with three questions: What did Fisher, on the one hand, and Neyman and Pearson, on the other, express in their theories? Can the lack of random samples be incorporated into the theories? Of what help can statistical tests be in examining substantive (non-statistical) hypotheses?
Statistical testing
Fisher (1956) formulated as the goal of any statistical test: “The statistical tests … are … in use to distinguish real effects of importance to a research programme from such … effects as … appeared in consequence of errors of random sampling, or of uncontrolled variability, of any sort” (p. 75). Statistical tests refer to statistical hypotheses, which are assertions about (theoretical) values of a random variable and/or about its distribution (cf. Hays, 1994, p. 269). They inevitably test their particular null hypothesis (H0), stating that the variation in the data is solely due to chance, and this may mean more specifically: H0: µ 1 − µ2 = 0 with µ1 and µ2 being two (theoretical) means. Any statistical test is about p(Data/H0): that is, the probability of the data under the H0 assumed valid.
Statistical tests do not refer to different modes of interpreting their results. To fill this lacuna, R. A. Fisher developed his Theory of Significance Testing (FST) in the early 1920s, and in the late 1920s Neyman and Pearson began to develop their Theory of Testing Statistical Hypotheses (NPT)—two very different theories, as it turned out.
The two statistical testing theories
To understand the differences between the two testing theories, the original sources should be consulted. Amongst others, Halpin and Stam (2006), Hubbard (2004), and Huberty (1987, 1993) give many original quotations, compiled and supplemented in Table 1. Only some points will be addressed subsequently.
An overview of the main features of the statistical theories of R.A. Fisher (FST) and of J. Neyman and E.S. Pearson (NPT).
Note. Chow (1996, p. 21) gives a similar, but partly erroneous table [e.g., NPT is about the Bayesian probability p(H0/Data)]. One-sided testing is assumed.
Fisher’s opinion regarding his p value is ambiguous (1950, pp. 80, 93; 1966, pp. 13, 25, 187).
Fisher (1956) claimed: “Inductive inference is the only process known to us by which essentially new knowledge comes into the world” (p. 6). And: “[Inductive] inferences we recognize to be uncertain inferences, but it does not follow from this that they are not mathematically rigorous inferences” (Fisher, 1935, p. 39). (The necessary populations “have no objective reality” and are mere inventions of the statistician; Fisher, 1956, p. 77.) In contrast, Neyman (1942) stated that “there is a considerable amount of reasoning involved [in the theory of testing hypotheses]. As usual, however, the reasoning is deductive” (p. 301). Neyman and Pearson (1933a) thought that “no test based on the theory of probability can by itself provide any valuable evidence of the truth or falsehood of … [a statistical] hypothesis” (pp. 290–291). For Neyman and Pearson (1933a), a Neyman–Pearson test is a “rule of (inductive) behaviour,” as opposed to Fisher’s “inductive inferences”: “If a rule … unambiguously prescribes the selection of action for each possible [experimental] outcome …, then it is a rule of inductive behavior” (Neyman, 1950, p. 10; 1957, 1976). Following Popper (1934/1992), this is a “methodological rule” (pp. 53–56) which may serve as a convention. Whereas Neyman and Pearson emphasized the decisions necessary with their theory (Neyman, 1942, p. 303), Fisher (1956) rejected decisions as unscientific, because “[they] are final” (pp. 98–103), while the results of significance tests are “provisional, and capable, not only of confirmation, but of revision” (p. 99). Pearson (1955) replied that “the tests themselves give no final verdict, but as tools help the worker … to form his final or provisional decision” (p. 206). Fisher (1966, pp. 44–45) emphasized that randomization is a necessary pre-requisite for a valid interpretation of the outcome of a statistical test; Neyman acknowledged the importance of this concept in 1950 (pp. 292–294; 1967; see also Cook & Campbell, 1979; Wilkinson & The Task Force on Statistical Inference, 1999, p. 595, for strategies with non-random assignment).
In the NPT two kinds of error can occur with certain probabilities (Neyman, 1950, pp. 261–264): α = p(AH1/H0 valid; Type I error) and β = p(AH0/H1 valid; Type II error; A: acceptance), the complement of β is called power (1 − β), the probability of correctly accepting the H1. Although ideally both hypotheses should be considered as equally important, prefixing unequal error probabilities means to consider both hypotheses as unequally important (Neyman, 1942, p. 304). Together with the sample size and the noncentrality parameters (NCPs), assessing the magnitude of a relationship, there are four related determinants, which can be subjected to power analysis to control α and β by determining the sample size. For its execution an exact H1,crit is necessary, which is selected by choosing NCPcrit. Power analysis then rests on special graphs or tables; the concept of power is NPT-specific. Despite the caveat of Wilkinson and the Task Force on Statistical Inference (1999, p. 596), other forms of power analysis are possible—it makes no sense to use the sample size as a “dependent variable” when it is fixed. Fisher’s (1966) counterpart to power is “precision”: “The value of the experiment is increased whenever it permits the [H0] to be more readily disproved” (p. 23). “Precision” can be associated with the NPT without causing inconsistencies—like all methods of Experimental Design, which are not FST-theoretical, but specific to Fisher’s Theory of Experimental Design. For Fisher (1966), “the [H0] is never proved or established, but is probably disproved” (p. 16), and it is not “accepted” (p. 17). Wilkinson et al. (1999, p. 599) and many others follow this verdict. Neyman (1942) states: “If [the test statistic falls into the region of rejection], the … H0 will be rejected, if it does not, it will be accepted” (p. 303)—an “act of will or a decision” (Neyman, 1957, p. 10; cf. Frick, 1995), but this decision must rest on a comparatively small β value: If H1 is only accepted if α is small, then accepting H0 is only justified with small a priori β; often α < β ≤ .20 (.30); Neyman and Pearson did not mention this important point.
Some authors take issue with the frequentist definitions of the two error probabilities α and β, being “ill-defined long-run error rates,” according to Oakes (1986, p. 145). The NPT is based on an axiomatic theory of probability (Neyman, 1952, pp. 24–27) in which “probability” is defined as an “abstract concept,” and the usual empirical counterparts are relative frequencies of events (Neyman, 1952, pp. 26–27; 1977). The shortcoming of the “ill-defined error rates” can be overcome by invoking an auxiliary hypothesis 1 without empirical content, stating that the error probabilities are nearly equal to the ones fixed prior to experimentation—an assumption researchers usually, but tacitly, accept (but see Hacking, 1965).
A Neyman–Pearson test leads to a categorical decision concerning the acceptance of the H0 or the opposed H1: “The outcome of the test reduces to either accepting or rejecting the hypothesis tested” (Neyman, 1950, p. 260). However, this decision only tells us whether a relationship exists in the data or (approximately) not. If it does, its magnitude has to be assessed, “because science is inevitably about magnitudes” (Cohen, 1990, p. 1309). This magnitude is expressed by effect sizes (ESs), which are functionally related to the NCPs (Winer, Brown, & Michels, 1991, p. 127), so that they can be associated with the NPT without inconsistencies; they are—as opposed to the NCPs—free from the sample size: d = (M1 − M2)/se (M: sample mean; s: standard deviation; Cohen, 1988, pp. 21–22). Power analysis based on (theoretical) ESs can be performed for many tests (e.g., Cohen, 1988; Kraemer & Thiemann, 1987). But in research practice they are very rare and this leads to poor empirical investigations with an average power of about .50 (Cohen, 1994, p. 1000).
Table 1 shows many differences between Fisher’s significance testing and the testing of statistical hypotheses of Neyman and Pearson, based on tables of Halpin and Stam (2006), of Huberty (1987, 1993), of Hubbard (2004), and on Fisher (1935, 1947, 1950, 1955, 1956, 1966), on Neyman (1942, 1950, 1952, 1957, 1971, 1976), on Neyman and Pearson (1928, 1933a, 1933b), and on Pearson (1955, 1962), supplemented by some features partly taken from Chow (1996), Mayo (1996), and Wilkinson et al. (1999).
The hybrids
Table 1 shows how the tests should be prepared and how the results should be interpreted—in the ideal case. It is well known, however, that in research practice mixtures of the distinct theories prevail (Hubbard, 2004), called the “hybrid models” by Halpin and Stam (2006; see Spielman, 1973). The FST constitutes the essential part of the analysis, supplemented by one or more concept/s of the NPT and partly accompanied by Bayes thinking, giving p(H0/Data) (Gigerenzer & Murray, 1987, pp. 24–25; Pruzek, 1997). 2 Scanning several statistical articles, partly of well-known authors, revealed hybrid terminology in more than 40% of them: for example, “significance test as accept–reject rule.” Although the authors differ according to the details, especially Gigerenzer and Murray (1987) and Halpin and Stam (2006), they agree that the hybrids constitute an “incoherent mishmash” of both statistical theories (Gigerenzer, 1993, p. 314), which are “incommensurate” (Hubbard, 2004, p. 319) and “incompatible” (Gigerenzer & Murray, 1987, p. 28). (The latter authors speak of the “hybrid theory” [pp. 21, 23], but a system of conjectures must be non-contradictory and logically consistent to qualify as a “theory.” Any theory has its own basic constructs and ideas, and melting incompatible constructs from different theories constitutes one of the greatest sins scientists can commit; see Neyman, 1976, for an example.) Gigerenzer and Murray (1987, pp. 22–24) see the origins of this hybridization in the early textbooks, in which “the hybrid was … presented as the only truth, … as the single solution to inductive inference” (p. 21; Halpin & Stam, 2006; Huberty, 1993). The “institutionalization of the hybrid/s” was attained, in part, by a “neglect of … controversial issues,” and by an “anonymous presentation of the [constituents of the] hybrid/s” (Gigerenzer & Murray, 1987, p. 23; Halpin & Stam, 2006, p. 640). A further reason for this hybridization may be that both parties (necessarily) referred to the same tests. Overall, the hybrids should be banned (e.g., Cortina & Dunlap, 1997; Gigerenzer, 1993, p. 332; Huberty & Pike, 1999, p. 17; for further insightful sources see Cowles, 1989; Halpin & Stam, 2006; Mayo, 1996; Oakes, 1986).
On the problem of the exact H0
This problem occurs under both theories, but is discussed here in NPT terms. The H0 are necessarily exact in order to derive the sampling distributions analytically. This H0, though, is never correct, since the density of an exact value is zero (e.g., Nickerson, 2000, pp. 275–276; Rindskopf, 1997, pp. 320–322, give an overview of the literature; see also Harlow, Mulaik, & Steiger, 1997). But the consequence of the a priori choice of α is that a set of small non-null hypotheses is inevitably declared compatible with the exact H0, so that the H0 always is composite, “composite” here meaning a set of H0,r. Thus, retaining “H0” means accepting the whole set H0,r, including H0 and the small non-null hypotheses, the correct interpretation being “an approximate null difference/correlation” (Chow, 1996). Each non-null hypothesis is associated with its own non-null effect, ES0,r; the maximum non-null effect ES0,max varies with α and n. Consider an example with H0: µ1 − µ2 ≤ 0; H1: µ1 − µ2 > 0; n = 21, α = .05, and tcrit(.05;40) = tcrit,1,min = 1.68385. The ES used above was: d = (M1 − M2)/se = (2/n)1/2*t. The maximum non-null effect is: d0,max = (2/n)1/2*(1.68385 − 0.00001) = 0.51964; t0,max = 1.68384; the minimum ES under the H1,s is: dcrit,1,min = (2/n)1/2*tcrit,1,min = 0.51965. Values d0,r ≤ 0.51964 lie in the region of acceptance and usually are more likely under the H0,r than under the H1,s; they are interpreted as associated with the H0,r (Rindskopf, 1997, pp. 320–322). Values d1,s ≥ 0.51965 and ts ≥ 1.68385 lead to accepting the set H1,s. (See also the informative example of Rosnow & Rosenthal, 1989, pp. 1277–1278, in which an effect of d = .50 is associated with an H0 retained using a comparatively low sample size—N = 20, t = 1.118 and p > .10—and with an H1 for a larger sample size, i.e., N = 80, t = 2.236 and p < .05 [t and p values slightly modified]; see also Cohen’s values of the ESc critical for “statistical significance” [Cohen, 1988, pp. 67, 28–39]). A variant of the (computationally laborious) “good-enough principle” of Serlin and Lapsley (1993) is already part of any statistical test, and this belt seems unnecessary (Chow, 1996, pp. 56–57). “The … [statistical] test acts as a (mediocre) filter for separating effects of different magnitude” (Abelson, 1997, p. 14). There are other points worth discussing, but space is limited. Let’s attack a further problem.
The lack of random samples and some of its consequences
In statistical testing, another problem needs to be addressed: “All our conclusions … rest on a process of random sampling, without it our tests of significance would be worthless” (Fisher, 1947, pp. 435–436). But Edgington (1995) states: “Few experiments in … psychology … use randomly selected subjects, and those [experimenters] who do usually concern populations so specific as to be of little interest” (pp. 7, 335–336). Instead, “convenience samples” with participants at hand are used (Hunter & May, 2003; Reichardt & Gollob, 1999), which—though not necessarily explicitly—are interpreted as random samples from the target populations—an auxiliary hypothesis not testable, which serves as a convention (Westermann, 2000, p. 351); the target populations are never infinite (Reichardt & Gollob, 1999, p. 117). But Hays (1994) warns: “Unless the assumption of random sampling is at least reasonable, the probability results of inferential methods mean very little, and these methods might as well be omitted” (p. 227). Frick (1998) reasons: “The assumption of random sampling from a population is unjustified, unnecessary, inaccurate, and does not serve psychology well” (p. 527).
If random samples are lacking, “it is important to see whether there is any justification for … hypothesis testing procedures” (Edgington, 1966, p. 485). Fisher (1966, pp. 44–48) was the first to consider this problem, and later it was shown “that … [randomization] tests could be based on random assignment [or permutation] alone, stressing their freedom from the assumption of random sampling” (Edgington, 1995, p. 18). (See for the procedure to generate the randomization sampling distributions by permuting the data at hand, Edgington, 1995—in the same way the sampling distributions for the rank tests are generated.) Generally, the randomization critical values are close to the ones obtained from the respective parametric sampling distributions (e.g., Anderson & Ter Braak, 2003; Edgington; 1995, pp. 39–44; Hunter & May, 2003; Scheffé, 1959, pp. 291–330; Still & White, 1981), so that the tabled distributions can be used instead of the randomization ones (Edgington, 1995, pp. 338–340): “The [parametric] test is used as an approximation to a randomization test,” and it is called an “approximate randomization test” (Edgington, 1969). (In contrast to “randomization tests,” “permutation tests” rest on random sampling from unspecified populations; Edgington, 1995, pp. 337–338).
Randomization tests, free from assumptions concerning populations, test a “wider hypothesis” (Fisher, 1966, pp. 44–49) than, for example, the t test on means from normal distributions: that is, the H0 that the underlying distributions are equal or the “identity of treatment effects” (Edgington, 1995, p. 339). Rejecting this H0 means that the distributions are unequal. To test hypotheses about, say, expected means, the randomization tests need to be associated with the assumption that the underlying distributions are at least approximately equal in form, which seems widely accepted in empirical research. Based on this assumption, the importance of findings such as those of Boik (1987) is de-emphasized. (Boik refers to the non-robustness of a randomization test in the case of heterogeneous variances—a finding that also holds for normal theory tests.) With this new justification, statistical tests are mere binary decision rules of accept–reject a statistical hypothesis, apparently conforming to the NPT, but not to the FST (see Reichardt & Gollob, 1999, for another approach concerning randomization tests).
Without random sampling, “statistical inferences about populations are … irrelevant” (Edgington, 1995, p. ix; see also Hays, 1994, p. 227), but generalizability is only seldom the research goal (Frick, 1998; Hunter & May, 2003, p. 180). With randomization tests, generalizations are impossible and the results are valid only for the one experiment (Edgington, 1995, p. 339; Westermann, 2000, pp. 431–433). Oakes (1986) stated: “It is not unreasonable to conclude, as Popper did with induction, that statistical inference is a myth” (p. 145), but a myth statisticians and also psychological researchers adhere to invariably. Parameter estimations need populations, which are absent, and without parameters confidence intervals are no longer needed (cf. Cortina & Dunlap, 1997, p. 170). Additionally, “it is rare for psychologists to need estimates of parameters; we are more typically interested whether a causal relationship exists between independent and dependent variables” (Killeen, 2005, p. 345; see also Cortina & Dunlap, 1997, p. 171; see for alternative approaches, e.g., Davison & Hinkley, 2006; Efron & Tibshirani, 1993; Thompson, 1993).
Amongst others, Anderson and Ter Braak (2003), Still and White (1981), and Westermann (2000, pp. 374–375) dealt with the power and reasoned that it is about equal to that of the corresponding parametric tests. Therefore power analysis for approximate randomization tests can be performed in the same manner as for parametric tests.
Substantive and statistical hypotheses
In this part, a possible scheme for examining substantive hypotheses (SuH) by statistical ones will be presented; a subject which is discussed only rarely in the statistical literature (e.g., Fisher, 1966; Neyman, 1950, 1957, 1976; Pearson, 1962). Neyman (1950) stated that when planning an experiment, one usually has in mind the appraisal of an SuH, “which is not a statistical hypothesis H. H is never identical with the [nonstatistical SuH]; since… the [SuH] does not mention any random variables” (p. 290). What about psychologists? “We must carefully distinguish substantive theory [and SuH] from statistical hypothesis” (Meehl, 1978, p. 824). “[An] H1 is neither the substantive hypothesis itself nor a paraphrase of the substantive hypothesis” (Chow, 1996, p. 70; cf. Cortina & Dunlap, 1997). Also Fisher (1956, pp. 46–50) addresses “scientific hypotheses,” which “differ from the simple (statistical) hypotheses such as ‘the random distribution of the stars’, in that they allow of one or more parameters,” so that Fisher’s “scientific hypotheses” are better described as complex statistical hypotheses. Moreover, he gave no hint of how to connect his “scientific” hypotheses with substantive ones, but he explained in which way his scientific hypotheses can be subjected to a test of significance.
What is the difference between the two kinds of hypotheses?
Although any competent psychologist knows the difference between an SuH and a statistical hypothesis (Meehl, 1967, p. 107), in practice this fact is often neglected, as may be seen in the many cases where an SuH is equated to an H1 (and its artificial negation to an H0; an SuH can only be negated by another SuH some researchers adhere to; Neyman, 1950, p. 290; Popper, 1934/1992, p. 87). Statistical hypotheses refer solely to random variables (see above), whereas substantive hypotheses typically are conjectures about unobservable hypothetical entities, states, processes, and the relationship/s between the concepts (Popper, 1934/1992). These differences mean that substantive theory/SuH appraisal is completely different from statistical testing.
It is not the aim of science to formulate SuH or to generalize (Popper, 1934/1992). Instead, one of the most important aims of science is to explain the domain-specific phenomena, and to control and to predict behavior (e.g., Popper, 1934/1992, pp. 59–62). These aims can only be reached by corroborated 3 SuH and theories, and corroboration is as important as discorroboration. Theories are tools for generating SuH, and any theory leads to more than one SuH (Chow, 1996, p. 49). Some SuH not inspired by a theory are mere inventions, possibly based on some data (Popper, 1934/1992, p. 32). In this case, they must be examined with new experiments (Mayo, 1996). Research consists in attempts to apply the SuH to one experimental situation after the other, the SuH being evaluated separately for each experiment. This process is mostly deductive, as is the NPT; also Fisher (1950, p. 8) uses deductive arguments in the present context. Most importantly, the SuH determines the adequate analysis and not the Experimental Design: “In research it is better to ask, ‘What is the best way to test my [substantive …] hypothesis’ rather than ‘Which statistical test is appropriate for these data?’” (Gonzalez, 1994, p. 326). What about the linkage between the hypotheses?
“A statistical hypothesis H [concerning certain …] random variables, is … formulated so as to be intimately related to the … [SuH]” (Neyman, 1950, p. 290), with the linkage being “frequently loose.” Chow (1996, pp. 46–51) prefers an “implicative relationship,” whereas Meehl (1967, p. 104) proposed “to derive” statistical hypotheses “in a rather loose sense” (“deductive derivability”; Meehl, 1997, p. 398); Popper (1934/1992, pp. 76–77) and others (e.g., Chow, 1996, pp. 71–72) discuss or use the logical “modus tollens” (Cohen, 1994, p. 998; Falk & Greenbaum, 1995, p. 75). Meehl (1997, pp. 398–400) shows why this syllogism is inappropriate in appraising SuH, as it is with statistical testing (Cohen, 1994, p. 998; Cortina & Dunlap, 1997, p. 170). Therefore, this syllogism is replaced by decisions, so that the burden of the “proper” decisions is put on the researcher’s competent shoulders.
The derivation of the substantive prediction must consider many auxiliary hypotheses (Meehl, 1978, 1997, p. 398), the “ceteris paribus clause” (CPC, i.e., “all other things being about equal,” Meehl, 1997, p. 398), the applicability of the theory/SuH, the experimental design, the a priori criteria for “corroborating” or “discorroborating” the SuH (Lakatos, 1978; Popper, 1934/1992), most of the points of Wilkinson et al. (1999), and so on. In short: The concrete experimental situation with all details must be specified, forming the set CSES (Completely Specified Experimental Situation), filling the subsequent lacuna: “We left in our mathematical model a gap for the exercise of a more intuitive process of personal judgment in such matters as … the appropriate … (error probability α …), the magnitude of worthwhile effects and the balance of utilities” (Pearson, 1962, . 396). Most decisions are extra-statistical, inevitably subjective, and sometimes arbitrary—science is a wo/man-made endeavor. Thompson (1993) states: “Like it or not, empirical science is inescapably a subjective business” (p. 365; cf. Frick, 1999, p. 187), so that the “fight against subjectivity” (Gigerenzer, 1987, p. 11) can never be won—despite statistical tests. We must learn to accept that subjective and arbitrary factors play a decisive role in appraisals of SuH, and that our data are impregnated by theories and hypotheses, and they are fallible (Popper, 1934/1992, pp. 59, 107; see also Hacking, 1965, p. 84; Lakatos, 1978, pp. 29–30; Polkinghorne, 1983, p. 125). And: Different research strategies and different types of analysis will possibly lead to divergent results.
Direct derivations from the SuH are impossible because of the theoretical variables involved, and to appraise an abstract SuH, the abstract concepts must be converted into observable or empirical variables, which can be chosen according to former theory appraisals. After enriching the SuH with the set CSES, a substantive prediction (SuP) referring to observable variables solely is derived from the SuH. From the SuP, enriched by random variables (RVs), a statistical testing theory, and statistical auxiliary hypotheses (SAH), a conforming H1 is derived. In rare cases a conforming H0 can be derived from an SuP. “The primary product of a research inquiry is one or more measures of effect size, not p values” (Cohen, 1990, p. 1310). Although some authors claim the contrary, 4 any theory/SuH claiming a relationship only contains information concerning the area or interval the ES should manifest in (Frick, 1999, p. 186); the area depends on the SuP and the CSES, and it is restricted by choosing the EScrit for power analysis. The EScrit can be based on former applications of the theory/SuH and modified with respect to differences in the CSES. Although the call for ESs is ubiquitious, nobody tells us what to do with them. So, I propose to interpret the ESs as a crude possibility of operationalizing the “degree of corroboration” (DC; Popper, 1934/1992, p. 281): DC = ES/EScrit. It is crude because it is unknown whether there is a monotonous relationship between the ES and the (real) “degree of corroboration” (see, for a more sophisticated alternative, Meehl, 1997). After testing, the concept of “inductive behaviour” is again applied “to assume a particular attitude towards the … [SuH]” (Neyman, 1957, p. 10). The following examples are simplified (see Table 2 for a summary of steps).
Appraisal of a substantive hypothesis and the NPT: Some necessary complements.
Note. This table contains the most important complements for appraisal of a substantive hypothesis using the NPT.
Example 1
Paivio’s (1971, 1990) Dual Coding Theory tries to explain different amounts of learning by the extent the two storages or codes are used. High imaginal content, as in “chair” (as contrasted with “idea”), is postulated to activate the verbal plus the imaginal store, thus leading to a superior performance of high imagery words because of more features encoded. This theory is applicable to verbal learning. An SuH-P may be: “If high and low imaginal content words are learned, then the amount of learning is CPC-P higher for words with high imaginal content than for words with low imaginal content.” One of the learning paradigms is chosen and becomes part of the CSES-P. The empirical independent variable is the scaled imagery of the words in the lists, assessed prior to experimentation; the empirical dependent variable is the number of correctly reproduced words; it is at least interval scaled. The Ss are randomized. These aspects become part of the CSES-P. The SuH-P is enriched by the CSES-P, and one of several possible SuP-P is then derived (∧: logical “and”; ≈>: loose derivation; >–: is enriched by; AH: auxiliary hypotheses): [SuH-P >– (AH-P ∧ CPC-P ∧
Example 2
The StP is introduced since often statistical hypotheses are derived which cannot exhaustively be tested with one test, for example the StP-Q: “The relationship between the independent and the means of the dependent variable, an RV, is exclusively positive linear.” The shortened derivation leads to: (SuH-Q) >–/
Example 3
If—on the other hand—a qualitative and strictly monotonous or strictly ordered trend (i.e., without equalities and without rank inversions) is predicted: StP-MT: µ1 < µ2 < … < µK, the most simple procedure consists in performing a series of K − 1 t tests (Winer et al., 1991, p. 146) on the partial hypotheses H1,t; these tests also are applicable for more complex patterns, for example a bitonic trend with at least one equality and so on. Δt = µk+1 − µk > 0 suffices, since there can be only one rank order in the data. The retention of at least one H0,t: Δt = µk+1 ≤ µk leads to rejecting the StP-MT. Despite the cumulation of one of the error probabilities, power analysis can easily be done here (Hager, 2004). This StP-MT very often results for qualitative independent variables: for example, if the SuH-P is examined with K ≥ 3 experimental treatments, a test for an ordered alternative should not be used, since they do not test against the StP-MT as H1 (Berenson, 1982). The H1 of these tests is “weaker,” since it is accepted in spite of partial equalities and/or partial rank inversions, as long as at least one pair difference—not necessarily between adjacent means—conforms to the prediction. Thus, these tests are roughly similar to K(K − 1)/2 t tests, but without cumulating error probabilities.
This proposal for examining SuH contains an asymmetry: Whenever an H0 is derived, the corroborating result is accepting this H0 which implies ES ≈ 0. So, only a dichotomous decision is possible, and a qualified appraisal of the SuH cannot be achieved.
The proposed deductive scheme can, by the way, also be used for SuH in the technological domain (e.g., psychotherapy research).
But the commonly encountered alternative to the procedure just outlined is to choose an Experimental Design, collect data, and to analyze them with the analysis of variance (ANOVA) appropriate for this design, thus following Fisher (1966): “Statistical procedure and experimental design are only two different aspects of the same whole” (p. 3). But the Experimental Design is only one determinant of the proper statistical tests. And: ANOVAs test bidirectional hypotheses on means or on squared multiple correlations, whereas most SuH are directional. (Any psychotherapy claims to help the clients to lead a better life.)
Conclusions
What about the questions this article began with?
What did Fisher, on the one hand, and Neyman and Pearson, on the other, express in their respective theories? The main features of the theories of these authors have been compiled in Table 1 and will not be repeated here.
Can the lack of random samples be incorporated into the theories? Of what help can statistical tests be in examining substantive hypotheses?
The case for an SuH preceding statistical testing
Progress in scientific knowledge, be it cumulative or not (Kuhn, 1970; Mayo, 1996), can only be achieved by examining SuH, not by testing isolated statistical hypotheses (Neyman, 1950, p. 290). Therefore, research should begin with an SuH, often inspired by a theory. As Popper (1981) formulated it: “[P]ure observational knowledge, unadultered by theory, would, if at all possible, be utterly barren and futile” (p. 23).
The case for statistical testing (ST) I
According to Abelson (1997), “[W]e are awash in a sea of uncertainty, caused by a flood tide of sampling and measurement errors” (p. 13). Statistical tests enable the researchers to divide the total variation into an unsystematic plus a systematic part and an unsystematic part (Abelson, 1995, p. 9; Fisher, 1956, p. 75). The criterion for this separation is chosen by the researcher/s and varies over different experiments; it is probabilistic, thus taking into account the fallibility of our data.
The case for ST II
Statistical testing is the most wide-spread strategy for data analysis and constitutes a conventional procedure. According to Westermann (2000, p. 351), procedures like this are necessary ingredients of domain-specific paradigms in the sense of Kuhn (1970), and without them no systematic research is possible in normal science; therefore any ban on statistical tests would be counterproductive and useless. But: Statistical testing should always follow an SuH.
The case for the NPT I
If there is no random sampling, approximate randomization tests are applied, whose results are exclusively valid for the sample under study. This approach precludes generalizations and is consistent with the NPT.
The case for the NPT II
The examination of SuH is error-prone, and there are two probabilities of wrong assertions, which the NPT allows to control, thus additionally conforming to the central aim of Experimental Design: to minimize the frequency of wrong assertions. This fact is often overlooked by authors arguing for a ban on statistical testing (e.g., Schmidt & Hunter, 1997; see also Harlow et al., 1997).
The case for the NPT III
According to Neyman and Pearson, the truth or falsehood of statistical and substantive hypotheses always remains unknown. We only act as if the hypothesis accepted/corroborated were true. This opinion is shared by modern philosophers like Popper and Lakatos.
The case for the NPT IV
After deriving a substantive prediction from the SuH, the conforming statistical hypothesis is derived prior to experimentation (Meehl, 1967, 1997) and its complement is stated. Only the NPT allows derivation of an H0 or an H1 on an a priori basis, both of which can be accepted. This is necessary when SuH are appraised.
The case for the NPT V
The criteria for deciding on an SuH must be defined prior to experimentation (Lakatos, Popper), as is the case with the NPT.
The case for the NPT VI
The ESs can be associated with the NPT (but not with the FST). They are computed after the NP test with its dichotomous decision to get a qualified judgment concerning the SuH.
The case for the NPT VII
The NPT explicitly addresses the many and mainly extra-statistical decisions the researcher has to make before data gathering.
The case for the NPT VIII
Meehl (1978) commented on the FST: “[T]he almost universal reliance on merely refuting the null hypothesis … is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology” (p. 817). Oakes (1986) reasons: “[T]he statements emanating from a Neyman–Pearson analysis are deductive truths (conditional upon a variety of assumptions). The charge against the Neyman–Pearson doctrine is one of irrelevance. … We are offered rules for decisions with ill-defined long-run error rates” (p. 145). Of course, the NPT should not be considered as perfect and therefore is open to justified criticisms (e.g., Spielman, 1973). The same holds for any other theory (Kuhn, 1970). Cortina and Dunlap (1997) correctly state: “The abuses of [statistical tests] have come about largely because of a lack of judgment or education with respect to those using the procedure. The cure lies in improving education and, consequently, judgment, not in abolishing the method” (p. 171; Falk & Greenbaum, 1995, p. 94; Huberty, 1993, pp. 331–332). Neyman and Pearson (1928) add: “The [statistical] tests should only be regarded as tools which must be used with discretion and understanding, and not as instruments which in themselves give the final verdict” (p. 232). “There are no objective procedures that avoid human judgment and guarantee correct interpretations of results” (Abelson, 1997, p. 13). Hays (1988) correctly claims:
It is sad but true: the specifity of our final conclusion is more or less bought in terms of what we already know or can at least assume to be true [as is the case, for example, with the auxiliary hypotheses]. If we do not know or assume anything, we cannot conclude very much. (p. 815)
At the end of this article it seems appropriate to return to some general non-statistical aspects in order not to conclude with statistics (recency effect!).
There is no algorithm for the appropriate examination of a substantive hypothesis via statistical hypotheses—at best there are some heuristics as presented above. Despite these heuristics, the researchers’ decisions depend to a high degree on their knowledge concerning many aspects partly addressed in this article. But whatever decisions are chosen: “A principle … to remember is, keep it simple! Concentrate on how well and carefully you can carry out a few meaningful manipulations” (Hays, 1994, p. 585; see also Cohen, 1990, pp. 1304–1305). “Other things being equal, the simpler the experiment, the better will its execution, and the more likely will one be able to decide what actually happened and what the results really mean” (Hays, 1994, p. 518). Without the choice of an appropriate experimental design and without specifying the CSES, no derivations are possible. Therefore, it is necessary to pay more attention to designing and planning an experiment carefully and to invoke statistical techniques in a later phase of designing (Gigerenzer, 1987, p. 23; Preece, 1982, p. 204).
Overall, there are important reasons to continue with statistical testing, but judiciously and with sound judgment; statistical tests, though important, are merely convenient tools for appraising SuH. The basis should be the NPT—the best testing theory available, which, though, is not perfect. I recommend routinely choosing an extra-statistical viewpoint such as substantive hypotheses, a methodology, some philosophy or epistemology as a general background for discussing statistical problems (Meehl, 1997; Serlin & Lapsley, 1993)—the isolated discussion of statistical problems has rarely been beneficial.
In view of the long duration of the discussion about statistical testing, one question arises: What are the reasons for not inventing a superior theory to replace the NPT?
Footnotes
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit-sectors.
