Abstract
Klein (2014) argues that the replication crisis in social psychology is due—at least in large part—to the tendency of psychological theories to be ill-specified. We disagree. First, we use both historical and contemporary examples to show that high-quality replication is possible even in the absence of a well-specified theory; and, second, we argue that it is typically auxiliary assumptions, rather than theories themselves, that need to be more clearly specified in order to understand the implications of a given replication effort.
Followers of contemporary debates in psychology will have noticed that questionable research practices (QRPs; see John, Loewenstein, & Prelec, 2012) have taken the lion’s share of the blame for the so-called replication crisis affecting the field (see Earp & Trafimow, 2015, for an overview; see also Earp, Everett, Madva, & Hamlin, 2014; Everett & Earp, 2015; Pashler & Wagenmakers, 2012; Simmons, Nelson, & Simonsohn, 2011). In a recent paper in this journal, Klein (2014) expresses sympathy with those who offer such a diagnosis, but nevertheless insists they have missed the main culprit. Specifically, Klein suggests that many theories in psychology are not sufficiently well specified, and that this lack of specification is primarily responsible for the fact that many studies do not appear to replicate (e.g., Open Science Collaboration, 2015; but see Earp, 2015, 2016; Earp & Everett, 2015; Feldman Barrett, 2015).
Klein (2014) is certainly right about the lack of specification. As Karl Popper once noted: “too many theories, particularly in the social sciences, [are] constructed so loosely that they could be stretched to fit any conceivable set of experimental results, making them … devoid of testable content” (as summarized by Folger, 1989, p. 156). We do not disagree that this is a problem for psychology (and perhaps especially for social psychology). But we differ from Klein in that we do not think the replication crisis can be blamed on this issue. To show why this is the case, we will emphasize two main factors: first, the history of science, which is replete with important findings that could be replicated based on no theory or bad theory; and second, the argument that—even from a purely logical perspective—predictions come from auxiliary assumptions in combination with theories, not from theories alone. Taken together, we believe that these considerations pose a serious challenge for Klein’s primary argument.
Replication in the absence of a well-specified theory: Some examples
To show that replication is possible even in the absence of a well-specified theory, we turn first to the history of science. We begin with an example from chemistry. Consider phlogiston theory, a blatantly wrong and ill-specified theory—at least from the perspective of hindsight (but see Chang, 2012)—which nevertheless dominated the field from approximately the late 17th century to the late 18th century (Conant, 1964). Roughly, this theory held that the fire-like element of phlogiston was responsible for combustion, although the specific nature of this relationship was never precisely articulated. Nevertheless, despite this lack of specification, researchers were able to demonstrate—and replicate—the existence of oxygen (wrongly considered to be “dephlogisticated” air), nitrogen (“phlogisticated” air), and other major elements. Eventually, Lavoisier disconfirmed 1 phlogiston theory on the basis of replicable findings he obtained using increasingly precise measurements (e.g., some objects had increased weight after allegedly losing phlogiston) and suggested a better theory (see Trafimow & Rice, 2009 for further discussion).
There are two points worth emphasizing here. First: there were important and replicable findings even with as badly specified a theory as phlogiston theory. And second: these replicable findings provided the fodder for Lavoisier’s important theoretical advances. This suggests that replicable findings, produced under the aegis of even a bad or ill-specified theory, can nevertheless be important for theory generation and development.
Let us take another example. Consider Galileo and his famous experiments with rolling balls down inclined planes (for an introduction, see Asimov, 1966). When Galileo began this empirical work he was not guided by any formal theory; he also had the disadvantage of having to use only the very imprecise time measuring devices that were available in his era. Nevertheless, he produced highly replicable findings relating the degrees of incline of the planes to the velocity of the balls rolling down them. Based on this mathematical relationship, he extrapolated to the case in which the incline was 90 degrees, to draw conclusions about falling objects. Although Galileo eventually performed important theoretical work that included concepts such as inertia and Galilean relativity, the original empirical findings—which he himself replicated on many occasions, and which still hold up today—were not based on a formal theory. 2
Finally, consider a more contemporary example from psychology. One of us has published work showing replicable findings in a situation in which the relevant theory was patently indeterminate. Specifically, Trafimow, Triandis, and Goto (1991) showed that it was possible to prime either the “private self” or the “collective self”—in order to elicit either more private or more collective self-cognitions, respectively—using an invented target person from ancient Mesopotamia. The hypothesis for this experiment was that describing the target person’s commitment to his family would increase the accessibility of the collective self, whereas describing the target person’s personal traits would increase the accessibility of the private self. Responses on the Twenty Statements Test (TST; Kuhn & McPartland, 1954; for a more recent application see Bargh & Earp, 2009) were taken to provide a plausible index of the relative accessibility of these differing self-cognitions.
Importantly, Trafimow et al. (1991) were able to show (and many others were able to replicate, see the meta-analysis by Oyserman & Lee, 2008) the predicted effect: namely, that priming the private vs. collective self increased the proportion of private vs. collective self-cognitions listed by participants on their TST protocols, as well as vice versa. Yet this was despite the fact that the theory that generated this prediction was not at all precisely defined. 3
These examples spell trouble for Klein’s (2014) argument. Specifically, they show that having a well-specified theory is not a prerequisite for obtaining replicable findings. Therefore, although we do not dispute that many theories in psychology are of poor quality, one can hardly blame the apparent difficulty that many psychologists have in replicating findings on this fact.
Theory specification: The importance of auxiliary assumptions
There is a second problem with Klein’s (2014) argument. Even in the case where there is a clear theory to draw upon, it is important to remember that empirical predictions come from the combination of a theory and auxiliary assumptions rather than from a theory alone (Earp & Trafimow, 2015). If an empirical prediction is made, therefore, its success or failure can be attributed either to the theory itself, or to at least one auxiliary assumption (i.e., a logical assumption that is required to link the theory to an actual observation). If a finding does not (apparently) replicate, Klein (2014) suggests that we should place blame on the theory itself for being inadequately specified. But we have argued elsewhere that just as good a case can be made for blaming one of the auxiliary assumptions (see Earp & Trafimow, 2015; see also Trafimow, 2003, 2009).
Now, Klein himself comes close to endorsing this view when he writes that: the conditions of replication can derive from pragmatic as well as theoretical considerations … The former can be reasonably addressed by (a) simple logic (e.g., visually impaired individuals should not take part in visual perception studies, participants should not be told the experimental hypothesis in advance of the study, etc.), [or] (b) random assignment. (2014, p. 329)
But we suggest that this bland articulation does not do justice to the work of auxiliary assumptions in linking theories to observations. Specifically, as we will discuss in a later section, auxiliary assumptions are often much more sophisticated than would be implied by the “obvious” need to, for example, bar visually impaired individuals from participating in visual perception studies.
In fact, a serious consideration of auxiliary assumptions in the context of the history of science suggests that Klein’s argument can be turned on its head. Consider, for example, the complexity of the universe that proved challenging to explain theoretically, even for such luminaries as Newton. As a matter of fact, Newton’s Laws of Motion and Universal Gravitation do a poor job of making predictions about moving bodies on our planet, because they do not take friction with the atmosphere into account (Asimov, 1971). If a Newtonian test were to be performed at different atmospheric levels, we would not expect much success in terms of replication across the different experiments. However, if we used appropriate auxiliary assumptions to take friction into account, our results would be much more replicable.
It is interesting to speculate about whether Newton should be criticized for failing to include friction in his theorizing. Based on Klein’s argument, one might wish to claim that Newton’s theory was not sufficiently well-specified because it failed to account for an important factor that influences experimental findings. But by not including friction in his theorizing—and thus by invoking an over-simplified universe—Newton was able to generate a parsimonious explanatory model. Indeed, all scientific theories invoke simplified universes. Rather than being a weakness, however, this is one of the main reasons for having a theory in the first place: it helps us to step back from sheer re-statements of the data and look for explanatory pathways. In other words, it is a mistake to include all of the less basic factors, such as friction in Newton’s case, so as to have a “well specified” theory. Rather, such issues should be relegated to the category of auxiliary assumptions.
To see why this is the case, consider that theoretical terms, by their very nature, are general and non-observational (Trafimow, 2012). Thus it is not reasonable to expect that they should be particularly well specified. For instance, Newton never even defined mass—a core theoretical concept in physics—as discussed at length by the Nobel Laureate Leon Lederman (1993). Rather, auxiliary assumptions bring about the specification that Klein desires by providing the means to traverse the gap between the non-observational terms in theories and the observational terms in empirical hypotheses (Trafimow, 2012).
Some examples from social psychology
An episode from the history of psychology should drive this point home. 4 Consider a classic example: according to the theory of reasoned action (e.g., Fishbein, 1980), attitudes determine behavioral intentions. One implication of this theoretical assumption is that researchers should be able to obtain strong correlations between these two constructs. When it comes to actually carrying out a study, however, this assumes, among other things, that a check mark on an attitude scale really indicates a person’s intention. The theory of reasoned action has nothing to say about whether check marks on scales indicate attitudes or intentions; these assumptions are peripheral to the basic theory. They are auxiliary assumptions that researchers use to connect non-observational terms such as “attitude” and “intention” to observable phenomena such as check marks.
Fishbein and Ajzen (1975) recognized this and took great pains to spell out, as well as possible, the auxiliary assumptions that best aid in measuring theoretically relevant variables (see also Ajzen & Fishbein, 1980). In fact, much of the original impetus for the theory of reasoned action (e.g., Fishbein, 1980) was Fishbein’s realization that there was a problem with attitude measurement at the time: when this problem was corrected, strong attitude-behavior (or at least attitude-intention) correlations became the rule rather than the exception. This story provides a good illustration of a case in which attention to the auxiliary assumptions that bore on actual measurement—in a way that was not quite as “obvious” as using simple common sense—played a larger role in resolving a crisis in psychology than debates over the theory itself (however, see Sniehotta, Presseau, & Araújo-Saores, 2014 for a recent critique, and the reply by Ajzen, 2015).
This lesson applies equally to contemporary debates. 5 Take the famous “walking-time” study by Bargh, Chen, and Burrows (1996)—emphasized by Klein in his discussion—which seemed to provide evidence that participants who were primed with the elderly stereotype walked more slowly down the hallway compared to controls. In a replication study using infrared sensors (as opposed to students with stopwatches) to time the participants, Doyen, Klein, Pichon, and Cleermans (2012) failed to find evidence of this priming effect in a sample of 120 undergraduates at the University of Brussels. As Klein (2014) sees it, failed replications such as this are too often “met with accusations that seem to imply that almost any deviation from the original protocol can be responsible for causing the anticipated effect to wither and die, rather than simply alter in a theoretically predictable manner” (p. 329).
But what, exactly, is predicted by the theory—and which “deviations” from the original protocol are relevant to altering the expected effect? Here, we suggest, a thoughtful consideration of auxiliary assumptions becomes necessary for a meaningful answer.
Consider the fact that the replication effort by Doyen et al. (2012) involved “at least one significant change to Bargh et al.’s methods that has thus far gone un-noticed in the burgeoning literature devoted to this debate,” as noted by Ramscar, Shaoul, and Baayen (2015, p. 16). Specifically, Doyen et al. tested French-speaking participants rather than English-speaking participants, and their study materials used French-language rather than English-language primes. As Ramscar and colleagues go on to explain: Doyen et al. appear to [have assumed] that the language a priming study is conducted in is not an important factor [in] replication … [But] there is reason to believe that this assumption is mistaken: Bargh et al.’s prime set (like the materials used in many priming studies) utilizes a high proportion of English adjectives [and yet] the frequencies and distribution of adjectives varies considerably across languages [and] and adjectives appear to play very different functional roles [emphasis added] in discourse depending on the language in question. (pp. 16–17)
In this case, one relevant difference is that adjectives in English tend to precede nouns, and therefore prime them, much more frequently than do adjectives in French. Importantly, this pattern holds true for the specific adjectives used in the Bargh et al. (1996) study compared to those in the Doyen et al. (2012) replication attempt. According to Ramscar et al. (2015), the average frequency of the English items used in the Bargh et al. study in a large English corpus was “far higher” (p. 19) than that of the French items used by Doyen et al. in a similar French corpus. Moreover, the English adjectives occurred prior to nouns, specifically, on average 18 times per million words compared to just 3 times per million words for the French adjectives. In other words, “when it comes to their experience of encountering the [adjectives] in the prime sets in contexts where they actually served as primes to nouns … we can expect that the subjects in Bargh et al.’s study will have had something [on] the order of six times more experience” (Ramscar et al., 2015, p. 20) than the participants in Doyen et al.’s study. Accordingly, the adjectives used in the Bargh et al. study could be expected to serve as much stronger primes of the targeted elderly stereotype.
In short, the theory advanced by Bargh et al. (1996) is that the unconscious activation of stereotypes should increase the likelihood that participants will, themselves, behave in a stereotype-consistent manner (see, e.g., Dijksterhuis & Bargh, 2001). An empirical hypothesis that follows from this theory is that priming the stereotype of the elderly in particular (which includes the trait of slowness) will cause participants themselves to walk more slowly (on average) down a hallway after the prime. A relevant auxiliary assumption—which is not part of the theory itself—is that exposure to certain adjectives will, in fact, activate the elderly stereotype in a way that is strong enough to elicit the predicted behavior. If the analysis by Ramscar et al. (2015) is correct, we should actually expect that translating those adjectives into a language where their association with the targeted stereotype is much weaker would violate this auxiliary assumption. But this says nothing about the overriding theory, which remains the same as it was originally specified. 6
We hasten to add that we are not arguing in favor of or against the validity of the original finding by Bargh et al. (1996), nor the replication attempt by Doyen et al. (2012). Rather, we are simply trying to show that a careful consideration of auxiliary assumptions is highly relevant to our understanding of the (likely) implications of a replication effort.
Conclusion
Our conclusion is twofold. First, as we have shown, having a well-specified theory is not a prerequisite for having replicable findings; hence the blame for apparent replication failures should not be placed upon ill-specified theories. And second, when there is a relevant theory, experimental predictions depend much more strongly than Klein (2014) seems to appreciate on auxiliary assumptions, as opposed to on the theory proper. Recognizing this distinction, we argue, will be crucial for making progress in resolving the replication crisis in psychology (Earp & Trafimow, 2015).
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
