Badly specified theories are not responsible for the replication crisis in social psychology: Comment on Klein

Abstract

Klein (2014) argues that the replication crisis in social psychology is due—at least in large part—to the tendency of psychological theories to be ill-specified. We disagree. First, we use both historical and contemporary examples to show that high-quality replication is possible even in the absence of a well-specified theory; and, second, we argue that it is typically auxiliary assumptions, rather than theories themselves, that need to be more clearly specified in order to understand the implications of a given replication effort.

Keywords

auxiliary assumptions historical examples replication crisis theory

Followers of contemporary debates in psychology will have noticed that questionable research practices (QRPs; see John, Loewenstein, & Prelec, 2012) have taken the lion’s share of the blame for the so-called replication crisis affecting the field (see Earp & Trafimow, 2015, for an overview; see also Earp, Everett, Madva, & Hamlin, 2014; Everett & Earp, 2015; Pashler & Wagenmakers, 2012; Simmons, Nelson, & Simonsohn, 2011). In a recent paper in this journal, Klein (2014) expresses sympathy with those who offer such a diagnosis, but nevertheless insists they have missed the main culprit. Specifically, Klein suggests that many theories in psychology are not sufficiently well specified, and that this lack of specification is primarily responsible for the fact that many studies do not appear to replicate (e.g., Open Science Collaboration, 2015; but see Earp, 2015, 2016; Earp & Everett, 2015; Feldman Barrett, 2015).

Klein (2014) is certainly right about the lack of specification. As Karl Popper once noted: “too many theories, particularly in the social sciences, [are] constructed so loosely that they could be stretched to fit any conceivable set of experimental results, making them … devoid of testable content” (as summarized by Folger, 1989, p. 156). We do not disagree that this is a problem for psychology (and perhaps especially for social psychology). But we differ from Klein in that we do not think the replication crisis can be blamed on this issue. To show why this is the case, we will emphasize two main factors: first, the history of science, which is replete with important findings that could be replicated based on no theory or bad theory; and second, the argument that—even from a purely logical perspective—predictions come from auxiliary assumptions in combination with theories, not from theories alone. Taken together, we believe that these considerations pose a serious challenge for Klein’s primary argument.

Replication in the absence of a well-specified theory: Some examples

To show that replication is possible even in the absence of a well-specified theory, we turn first to the history of science. We begin with an example from chemistry. Consider phlogiston theory, a blatantly wrong and ill-specified theory—at least from the perspective of hindsight (but see Chang, 2012)—which nevertheless dominated the field from approximately the late 17th century to the late 18th century (Conant, 1964). Roughly, this theory held that the fire-like element of phlogiston was responsible for combustion, although the specific nature of this relationship was never precisely articulated. Nevertheless, despite this lack of specification, researchers were able to demonstrate—and replicate—the existence of oxygen (wrongly considered to be “dephlogisticated” air), nitrogen (“phlogisticated” air), and other major elements. Eventually, Lavoisier disconfirmed¹ phlogiston theory on the basis of replicable findings he obtained using increasingly precise measurements (e.g., some objects had increased weight after allegedly losing phlogiston) and suggested a better theory (see Trafimow & Rice, 2009 for further discussion).

There are two points worth emphasizing here. First: there were important and replicable findings even with as badly specified a theory as phlogiston theory. And second: these replicable findings provided the fodder for Lavoisier’s important theoretical advances. This suggests that replicable findings, produced under the aegis of even a bad or ill-specified theory, can nevertheless be important for theory generation and development.

Let us take another example. Consider Galileo and his famous experiments with rolling balls down inclined planes (for an introduction, see Asimov, 1966). When Galileo began this empirical work he was not guided by any formal theory; he also had the disadvantage of having to use only the very imprecise time measuring devices that were available in his era. Nevertheless, he produced highly replicable findings relating the degrees of incline of the planes to the velocity of the balls rolling down them. Based on this mathematical relationship, he extrapolated to the case in which the incline was 90 degrees, to draw conclusions about falling objects. Although Galileo eventually performed important theoretical work that included concepts such as inertia and Galilean relativity, the original empirical findings—which he himself replicated on many occasions, and which still hold up today—were not based on a formal theory.²

Finally, consider a more contemporary example from psychology. One of us has published work showing replicable findings in a situation in which the relevant theory was patently indeterminate. Specifically, Trafimow, Triandis, and Goto (1991) showed that it was possible to prime either the “private self” or the “collective self”—in order to elicit either more private or more collective self-cognitions, respectively—using an invented target person from ancient Mesopotamia. The hypothesis for this experiment was that describing the target person’s commitment to his family would increase the accessibility of the collective self, whereas describing the target person’s personal traits would increase the accessibility of the private self. Responses on the Twenty Statements Test (TST; Kuhn & McPartland, 1954; for a more recent application see Bargh & Earp, 2009) were taken to provide a plausible index of the relative accessibility of these differing self-cognitions.

Importantly, Trafimow et al. (1991) were able to show (and many others were able to replicate, see the meta-analysis by Oyserman & Lee, 2008) the predicted effect: namely, that priming the private vs. collective self increased the proportion of private vs. collective self-cognitions listed by participants on their TST protocols, as well as vice versa. Yet this was despite the fact that the theory that generated this prediction was not at all precisely defined.³

These examples spell trouble for Klein’s (2014) argument. Specifically, they show that having a well-specified theory is not a prerequisite for obtaining replicable findings. Therefore, although we do not dispute that many theories in psychology are of poor quality, one can hardly blame the apparent difficulty that many psychologists have in replicating findings on this fact.

Theory specification: The importance of auxiliary assumptions

There is a second problem with Klein’s (2014) argument. Even in the case where there is a clear theory to draw upon, it is important to remember that empirical predictions come from the combination of a theory and auxiliary assumptions rather than from a theory alone (Earp & Trafimow, 2015). If an empirical prediction is made, therefore, its success or failure can be attributed either to the theory itself, or to at least one auxiliary assumption (i.e., a logical assumption that is required to link the theory to an actual observation). If a finding does not (apparently) replicate, Klein (2014) suggests that we should place blame on the theory itself for being inadequately specified. But we have argued elsewhere that just as good a case can be made for blaming one of the auxiliary assumptions (see Earp & Trafimow, 2015; see also Trafimow, 2003, 2009).

Now, Klein himself comes close to endorsing this view when he writes that:

the conditions of replication can derive from pragmatic as well as theoretical considerations … The former can be reasonably addressed by (a) simple logic (e.g., visually impaired individuals should not take part in visual perception studies, participants should not be told the experimental hypothesis in advance of the study, etc.), [or] (b) random assignment. (2014, p. 329)

But we suggest that this bland articulation does not do justice to the work of auxiliary assumptions in linking theories to observations. Specifically, as we will discuss in a later section, auxiliary assumptions are often much more sophisticated than would be implied by the “obvious” need to, for example, bar visually impaired individuals from participating in visual perception studies.

In fact, a serious consideration of auxiliary assumptions in the context of the history of science suggests that Klein’s argument can be turned on its head. Consider, for example, the complexity of the universe that proved challenging to explain theoretically, even for such luminaries as Newton. As a matter of fact, Newton’s Laws of Motion and Universal Gravitation do a poor job of making predictions about moving bodies on our planet, because they do not take friction with the atmosphere into account (Asimov, 1971). If a Newtonian test were to be performed at different atmospheric levels, we would not expect much success in terms of replication across the different experiments. However, if we used appropriate auxiliary assumptions to take friction into account, our results would be much more replicable.

It is interesting to speculate about whether Newton should be criticized for failing to include friction in his theorizing. Based on Klein’s argument, one might wish to claim that Newton’s theory was not sufficiently well-specified because it failed to account for an important factor that influences experimental findings. But by not including friction in his theorizing—and thus by invoking an over-simplified universe—Newton was able to generate a parsimonious explanatory model. Indeed, all scientific theories invoke simplified universes. Rather than being a weakness, however, this is one of the main reasons for having a theory in the first place: it helps us to step back from sheer re-statements of the data and look for explanatory pathways. In other words, it is a mistake to include all of the less basic factors, such as friction in Newton’s case, so as to have a “well specified” theory. Rather, such issues should be relegated to the category of auxiliary assumptions.

To see why this is the case, consider that theoretical terms, by their very nature, are general and non-observational (Trafimow, 2012). Thus it is not reasonable to expect that they should be particularly well specified. For instance, Newton never even defined mass—a core theoretical concept in physics—as discussed at length by the Nobel Laureate Leon Lederman (1993). Rather, auxiliary assumptions bring about the specification that Klein desires by providing the means to traverse the gap between the non-observational terms in theories and the observational terms in empirical hypotheses (Trafimow, 2012).

Some examples from social psychology

An episode from the history of psychology should drive this point home.⁴ Consider a classic example: according to the theory of reasoned action (e.g., Fishbein, 1980), attitudes determine behavioral intentions. One implication of this theoretical assumption is that researchers should be able to obtain strong correlations between these two constructs. When it comes to actually carrying out a study, however, this assumes, among other things, that a check mark on an attitude scale really indicates a person’s intention. The theory of reasoned action has nothing to say about whether check marks on scales indicate attitudes or intentions; these assumptions are peripheral to the basic theory. They are auxiliary assumptions that researchers use to connect non-observational terms such as “attitude” and “intention” to observable phenomena such as check marks.

Fishbein and Ajzen (1975) recognized this and took great pains to spell out, as well as possible, the auxiliary assumptions that best aid in measuring theoretically relevant variables (see also Ajzen & Fishbein, 1980). In fact, much of the original impetus for the theory of reasoned action (e.g., Fishbein, 1980) was Fishbein’s realization that there was a problem with attitude measurement at the time: when this problem was corrected, strong attitude-behavior (or at least attitude-intention) correlations became the rule rather than the exception. This story provides a good illustration of a case in which attention to the auxiliary assumptions that bore on actual measurement—in a way that was not quite as “obvious” as using simple common sense—played a larger role in resolving a crisis in psychology than debates over the theory itself (however, see Sniehotta, Presseau, & Araújo-Saores, 2014 for a recent critique, and the reply by Ajzen, 2015).

This lesson applies equally to contemporary debates.⁵ Take the famous “walking-time” study by Bargh, Chen, and Burrows (1996)—emphasized by Klein in his discussion—which seemed to provide evidence that participants who were primed with the elderly stereotype walked more slowly down the hallway compared to controls. In a replication study using infrared sensors (as opposed to students with stopwatches) to time the participants, Doyen, Klein, Pichon, and Cleermans (2012) failed to find evidence of this priming effect in a sample of 120 undergraduates at the University of Brussels. As Klein (2014) sees it, failed replications such as this are too often “met with accusations that seem to imply that almost any deviation from the original protocol can be responsible for causing the anticipated effect to wither and die, rather than simply alter in a theoretically predictable manner” (p. 329).

But what, exactly, is predicted by the theory—and which “deviations” from the original protocol are relevant to altering the expected effect? Here, we suggest, a thoughtful consideration of auxiliary assumptions becomes necessary for a meaningful answer.

Consider the fact that the replication effort by Doyen et al. (2012) involved “at least one significant change to Bargh et al.’s methods that has thus far gone un-noticed in the burgeoning literature devoted to this debate,” as noted by Ramscar, Shaoul, and Baayen (2015, p. 16). Specifically, Doyen et al. tested French-speaking participants rather than English-speaking participants, and their study materials used French-language rather than English-language primes. As Ramscar and colleagues go on to explain:

Doyen et al. appear to [have assumed] that the language a priming study is conducted in is not an important factor [in] replication … [But] there is reason to believe that this assumption is mistaken: Bargh et al.’s prime set (like the materials used in many priming studies) utilizes a high proportion of English adjectives [and yet] the frequencies and distribution of adjectives varies considerably across languages [and] and adjectives appear to play very different functional roles [emphasis added] in discourse depending on the language in question. (pp. 16–17)

In this case, one relevant difference is that adjectives in English tend to precede nouns, and therefore prime them, much more frequently than do adjectives in French. Importantly, this pattern holds true for the specific adjectives used in the Bargh et al. (1996) study compared to those in the Doyen et al. (2012) replication attempt. According to Ramscar et al. (2015), the average frequency of the English items used in the Bargh et al. study in a large English corpus was “far higher” (p. 19) than that of the French items used by Doyen et al. in a similar French corpus. Moreover, the English adjectives occurred prior to nouns, specifically, on average 18 times per million words compared to just 3 times per million words for the French adjectives. In other words, “when it comes to their experience of encountering the [adjectives] in the prime sets in contexts where they actually served as primes to nouns … we can expect that the subjects in Bargh et al.’s study will have had something [on] the order of six times more experience” (Ramscar et al., 2015, p. 20) than the participants in Doyen et al.’s study. Accordingly, the adjectives used in the Bargh et al. study could be expected to serve as much stronger primes of the targeted elderly stereotype.

In short, the theory advanced by Bargh et al. (1996) is that the unconscious activation of stereotypes should increase the likelihood that participants will, themselves, behave in a stereotype-consistent manner (see, e.g., Dijksterhuis & Bargh, 2001). An empirical hypothesis that follows from this theory is that priming the stereotype of the elderly in particular (which includes the trait of slowness) will cause participants themselves to walk more slowly (on average) down a hallway after the prime. A relevant auxiliary assumption—which is not part of the theory itself—is that exposure to certain adjectives will, in fact, activate the elderly stereotype in a way that is strong enough to elicit the predicted behavior. If the analysis by Ramscar et al. (2015) is correct, we should actually expect that translating those adjectives into a language where their association with the targeted stereotype is much weaker would violate this auxiliary assumption. But this says nothing about the overriding theory, which remains the same as it was originally specified.⁶

We hasten to add that we are not arguing in favor of or against the validity of the original finding by Bargh et al. (1996), nor the replication attempt by Doyen et al. (2012). Rather, we are simply trying to show that a careful consideration of auxiliary assumptions is highly relevant to our understanding of the (likely) implications of a replication effort.

Conclusion

Our conclusion is twofold. First, as we have shown, having a well-specified theory is not a prerequisite for having replicable findings; hence the blame for apparent replication failures should not be placed upon ill-specified theories. And second, when there is a relevant theory, experimental predictions depend much more strongly than Klein (2014) seems to appreciate on auxiliary assumptions, as opposed to on the theory proper. Recognizing this distinction, we argue, will be crucial for making progress in resolving the replication crisis in psychology (Earp & Trafimow, 2015).

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

Author biographies

David Trafimow is a Distinguished Achievement Professor of psychology at New Mexico State University, a Fellow of the Association for Psychological Science, Executive Editor of The Journal of General Psychology, and also for Basic and Applied Social Psychology. He received his PhD in psychology from the University of Illinois at Urbana-Champaign in 1993. His current research interests include attribution, attitudes, cross-cultural research, ethics, morality, methodology, and potential performance theory.

Brian D. Earp is a Resident Visiting Scholar at The Hastings Center Bioethics Research Institute in Garrison, New York as well as a Research Associate with both the Oxford Uehiro Centre for Practical Ethics and the Oxford Centre for Neuroethics at the University of Oxford. He holds degrees from Yale, Oxford, and Cambridge Universities in cognitive science, experimental psychology, and philosophy of science, respectively, and has published more than 40 articles in peer-reviewed journals in those and related areas. Brian serves on the Board of Editorial Advisors for the journal Public Affairs Quarterly and is also an Associate Editor of the Journal of Medical Ethics.

References

Ajzen

(2015). The theory of planned behaviour is alive and well, and not ready to retire: A commentary on Sniehotta, Presseau, and Araújo-Soares. Health Psychology Review, 9(2), 131–137.

Ajzen

Fishbein

(1980). Understanding attitudes and predicting social behavior. Englewood Cliffs, NJ: Prentice-Hall.

Asimov

(1966). Understanding physics: Motion, sound, and heat. New York, NY: Mentor.

Asimov

(1971). The stars in their courses. New York, NY: Mercury Press.

Bargh

J. A.

Chen

Burrows

(1996). Automaticity of social behavior: Direct effects of trait construct and stereotype activation on action. Journal of Personality and Social Psychology, 71(2), 230–244.

Bargh

J. A.

Earp

B. D.

(2009). The will is caused, not “free.” Dialogue: Newsletter of the Society for Personality and Social Psychology, 24, 13–15.

Butterfield

(1957). The origins of modern science 1300 – 1800. New York, NY: The Free Press.

Chang

(2012). Is water H₂O? Evidence, realism, and pluralism (Boston Studies in the Philosophy of Science: Vol. 293). Dordrecht, the Netherlands: Springer Science & Business Media.

Conant

J. B.

(1964). The overthrow of the phlogiston theory: The chemical revolution of 1775–1789. Cambridge, MA: Harvard University Press.

10.

Dijksterhuis

Bargh

J. A.

(2001). The perception-behavior expressway: Automatic effects of social perception on social behavior. Advances in Experimental Social Psychology, 33, 1–40.

11.

Doyen

Klein

Pichon

C.-L.

Cleermans

(2012). Behavioral priming: It’s all in the mind, but whose mind? PloS ONE, 7(1), 1–7.

12.

Earp

B. D.

(2015, September 2). Psychology is not in crisis? Depends on what you mean by “crisis.” The Huffington Post. Retrieved from http://www.huffingtonpost.com/brian-earp/psychology-is-not-in-crisis_b_8077522.html

13.

Earp

B. D.

(2016). [Open review of the draft paper, “Replication initiatives will not salvage the trustworthiness of psychology” by James C. Coyne]. BMC Psychology. Retrieved from https://www.academia.edu/21711738/Open_review_of_the_draft_paper_entitled_Replication_initiatives_will_not_salvage_the_trustworthiness_of_psychology_by_James_C._Coyne

14.

Earp

B. D.

Everett

J. A. C.

(2015, October 25). How to fix psychology’s replication crisis. The Chronicle of Higher Education. Retrieved from http://chronicle.com/article/How-to-Fix-Psychology-s/233857

15.

Earp

B. D.

Everett

J. A. C.

Madva

E. N.

Hamlin

J. K.

(2014). Out, damned spot: Can the “Macbeth Effect” be replicated? Basic and Applied Social Psychology, 36(1), 91–98.

16.

Earp

B. D.

Trafimow

(2015). Replication, falsification, and the crisis of confidence in social psychology. Frontiers in Psychology, 6, 621.

17.

Everett

J. A. C.

Earp

B. D.

(2015). A tragedy of the (academic) commons: Interpreting the replication crisis in psychology as a social dilemma for early-career researchers. Frontiers in Psychology, 6, 1152.

18.

Feldman Barrett

(2015, September 1). Psychology is not in crisis. The New York Times. Retrieved from http://www.nytimes.com/2015/09/01/opinion/psychology-is-not-in-crisis.html

19.

Fishbein

(1980). Theory of reasoned action: Some applications and implications. In Howe

Page

(Eds.), Nebraska Symposium on Motivation, 1979 (pp. 477–492). Lincoln: University of Nebraska Press.

20.

Fishbein

Ajzen

(1975). Belief, attitude, intention and behavior: An introduction to theory and research. Reading, MA: Addison-Wesley.

21.

Folger

(1989). Significance tests and the duplicity of binary decisions. Psychological Bulletin, 106, 155–160.

22.

John

L. K.

Loewenstein

Prelec

(2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524–532.

23.

Klein

S. B.

(2014). What can recent replication failures tell us about the theoretical commitments of psychology? Theory & Psychology, 24, 326–338.

24.

Kuhn

M. H.

McPartland

T. S.

(1954). An empirical investigation of self-attitudes. American Sociological Review, 19(1), 68–76.

25.

Lederman

(1993). The God particle: If the universe is the answer, what is the question? New York, NY: Houghton Mifflin.

26.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.

27.

Oyserman

Lee

W. S.

(2008). Does culture influence what and how we think? Effects of priming individualism and collectivism. Psychological Bulletin, 134(2), 311–342.

28.

Pashler

Wagenmakers

E. J.

(2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on Psychological Science, 7(6), 528–530.

29.

Ramscar

Shaoul

Baayen

R. H.

(2015). Why many priming results don’t (and won’t) replicate: A quantitative analysis. Retrieved from http://psych.stanford.edu/~michael/papers/Ramscar-Shaoul-Baayen_replication.pdf

30.

Simmons

J. P.

Nelson

L. D.

Simonsohn

(2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.

31.

Sniehotta

F. F.

Presseau

Araújo-Soares

(2014). Time to retire the theory of planned behaviour. Health Psychology Review, 8(1), 1–7.

32.

Trafimow

(2003). Hypothesis testing and theory evaluation at the boundaries: Surprising insights from Bayes’s theorem. Psychological Review, 110(3), 526–535.

33.

Trafimow

(2009). The theory of reasoned action: A case study of falsification in psychology. Theory & Psychology, 19, 501–518.

34.

Trafimow

(2012). The role of auxiliary assumptions for the validity of manipulations and measures. Theory & Psychology, 22, 486–498.

35.

Trafimow

Rice

(2009). What if social scientists had reviewed great scientific works of the past? Perspectives in Psychological Science, 4(1), 65–78.

36.

Trafimow

Triandis

H. C.

Goto

S. G.

(1991). Some tests of the distinction between the private self and the collective self. Journal of Personality and Social Psychology, 60(5), 649–655.