Abstract
Science progresses by finding and correcting problems in theories. Good theories are those that help facilitate this process by being hard to vary: They explain what they are supposed to explain, they are consistent with other good theories, and they are not easily adaptable to explain anything. Here we argue that, rather than a lack of distinction between exploratory and confirmatory research, an abundance of flexible theories is a better explanation for the current replicability problems of psychology. We also explain why popular methods-oriented solutions fail to address the real problem of flexibility. Instead, we propose that a greater emphasis on theory criticism by argument might improve replicability.
Keywords
Initiatives to identify and eliminate the causes of the apparent nonreplicability of research findings in the behavioral sciences have received a great deal of attention in recent years. Discussion around the topic has become increasingly dominated by methods-oriented solutions, such as preregistration and direct replication, that aim to eliminate this nonreplicability. The popularity of such practices is evidenced not only by their quick uptake by researchers (Nosek, 2019) but also by the actions journals and grant agencies take to reward them (e.g., with badges; Eich, 2014; or with the allocation of research funding; Research Excellence Framework, 2019, p. 36). The alternative view—that replicability problems emerge from bad theorizing and resolving them might come from the improvement of theories—has been considerably less influential, although there have been some exceptions (e.g., Fiedler, 2017, 2018; Gray, 2017; Oberauer & Lewandowsky, 2019; Shiffrin et al., 2018), including more informal discussions (such as conference discussions, blog posts, and commentaries; e.g., Carsel et al., 2018; van Rooij, 2019).
In this article, we take the view that replicability problems are the symptoms of bad theorizing. Our aim is to move the discussion of how to improve theory development forward. We argue specifically that the flexibility inherent to many existing psychological theories is the main cause of replicability problems, and thus we should seek to improve our ability to evaluate the quality of theories, particularly their flexibility. We also argue that such evaluations should take place more prominently before experimental testing than they currently do.
Controversies Around Social Priming: A Case Study
The problems with replication of social-priming studies are well known. The theories that motivated these studies suggest that stimuli with a certain conceptual meaning can affect people’s behavior in a way that is related to that meaning (Bargh, 2006). For example, priming people with the concept of a professor was found to make them perform better on a general-knowledge quiz compared with when they were primed with the concept of a football hooligan (Dijksterhuis & van Knippenberg, 1998). Controversies surrounding this and similar findings have accumulated over the years, perhaps most prominently illustrated by a set of failures in replicating such effects (e.g., Doyen et al., 2012; Harris et al., 2013; Pashler et al., 2012; Shanks et al., 2013).
Social-priming studies quickly became the testing ground for newly proposed methods-oriented solutions to such replicability problems. These solutions are often based on the distinction between exploratory and confirmatory research, which focus on the supposedly important dichotomy between hypothesis-testing (confirmatory/prediction) and hypothesis-generating (exploratory/postdiction) modes of research. For example, Nosek et al. (2018) explain that failing to appreciate the difference can lead to overconfidence in post hoc explanations (postdictions) and inflate the likelihood of believing that there is evidence for a finding when there is not. Presenting postdictions as predictions can increase the attractiveness and publishability of findings by falsely reducing uncertainty. Ultimately, this decreases reproducibility. (p. 2600)
In other words, because exploratory research is presumably more prone to error than confirmatory research, researchers’ disregard for this distinction is supposed to be blamed for the problems with replicability. As a result, many researchers promote methods that explicitly differentiate between these two modes of research, such as the preregistration of studies (e.g., Nosek et al., 2018; Wagenmakers et al., 2012) and the direct replication of controversial findings (e.g., Pashler & Harris, 2012; Simons, 2014; Zwaan et al., 2018). Preregistration is a time-stamped public document that requires researchers to define their method for data analysis before the outcome of the research is known. A direct replication is an attempt to recreate the parameters of the original experiment, including the method for data analysis.
To illustrate our arguments, we revisit the controversies around social-priming studies and their attempted solutions. We focus specifically on a recent large-scale replication study of the professor-priming study (Dijksterhuis & van Knippenberg, 1998) that failed to find the predicted effect (O’Donnell et al., 2018). This study used a combination of preregistration and direct replication (it was a so-called registered replication report; Simons et al., 2014). The joint presence of a (in our view) flexible theory, replicability problems, and solution attempts based on the exploratory–confirmatory distinction make this an ideal case study for our current purposes.
We use this example to explain and advance our main argument that experimental tests are often superfluous and that we should instead focus more on nonempirical evaluations of the quality of our theories. We unpack these arguments by outlining a general framework of how we should evaluate the quality of theories and illustrate it with examples from the professor-priming study. We then explain that there is no room for the exploratory–confirmatory distinction in theory evaluation because it focuses on unimportant types of theory flexibility. Last, we argue in favor of a greater focus on nonempirical evaluations of theories and provide some suggestions on how to move toward this goal.
Development of Good Theories
Although the methodology and conduct of science constantly evolve, making a definitive answer about its aim impossible, it can still be useful to propose a tentative answer. Here, we base our argument on (and advocate for) the conventions proposed by Popper (1959) and more recently by Deutsch (2011, 2016), which consider science as primarily a problem-solving endeavor. In our opinion, a version of this perspective is already widely held in the behavioral sciences. Our aim here, therefore, is to address some of the ambiguities around the specifics of this view, focusing on the aspects of this philosophy pertaining to the current problems of psychological science and the proposed methodological reforms. In particular, we focus on the role of replicability, on the properties of good theories, and on the ways in which theories should change and improve.
Under this problem-solving philosophy of science, the aim is to develop good explanations (Deutsch, 2011, 2016). We can regard scientific theories as a set of explanatory conjectures about how things appear in the world and why. That is, scientific theories are a collection of statements that usually rely on other theories as background knowledge and provide answers to how and why questions. Together, these statements also designate what their explicanda are (i.e., imply what regularity or regularities they are explaining).
The way to bring about good explanations is by attempting to detect and correct apparent flaws in our existing theories (Popper, 1959). Flaws can be detected by different types of criticism—either by argument or by experimental test. Correcting the flaws works by conjecture: We either guess what modification solves a flaw in our existing theory or (if we cannot think of any such modification) guess a new theory.
The key property of good explanations is that they are hard to vary (Deutsch, 2011). More specifically, good theories (a) explain what they are supposed to explain (in the sense that they give a tentatively satisfying answer to the relevant why and how questions), (b) are consistent with other good theories, and (c) cannot easily be adapted to explain anything (Deutsch, 2016). These criteria aim to ensure that a theory is constrained by all of our existing knowledge (existing observations and other good theories) without the benefit of flexibility to tailor the explanation to any possible pattern of observation. In other words, the conjectures that make up a theory must be inflexible while still allowing the theory to account for its explicanda. 1
Most relevant to our current argument are potential ways of criticizing a theory. The aim of criticism is to find problems in theories so that they can be subsequently improved, and it can come in two forms. First, theories can be criticized by argument, which means the assessment of whether they (or similar variations of them) are bad explanations according to any of the above criteria. Often this takes the form of an argument that a theory cannot account for some existing observation. Equally valid, but largely absent in psychological science, is criticizing a theory according to how easily it can be adapted to account for a large range of unobserved data patterns.
Second, a theory can be criticized by experimental testing, which can make a theory problematic by increasing the set of observations that a theory is supposed to explain and showing that it cannot. Although the common view is that experimental tests are the primary way in which science progresses, such tests are useful only when experiments are capable of posing problems for that theory. However, this can occur only once they have been sufficiently improved on the basis of argumentative criticism. A bad theory will be immune to criticism by experimental testing because it either does not account for its supposed explicanda to begin with or it can always be easily adapted to explain anything.
When problems are detected through any method of criticism, theories need to change. Although there is no prescribed way in which new (variations of) theories are to be conjectured, the criteria for being hard to vary constrains the way in which that theory should be changed. A good theory will resist most changes because the explanation for any change must be consistent with the retained inflexible conjectures of that theory without making the theory inconsistent with existing theories and/or observations. Thus, we want the changes to our theories to create multiple new implications: The theory should not simply expand to incorporate new explicanda; expectations for existing observations should also be affected. As a result, only changes that are themselves hard to vary are desirable.
For a concrete example of why theory flexibility matters, consider the implication of these arguments for theories of social priming. Let us first assess the flexibility of the theory that was being tested in the large-scale preregistered replication of the professor-priming study (O’Donnell et al., 2018). Recall that the central idea was that priming the concept of a university professor would lead to better performance on general-knowledge questions than priming the concept of a football hooligan. Before the data for the registered report were collected, one change was already made to the theory: On the basis of a pilot study, people’s sex was proposed to moderate the effect of the prime. This is a good illustration of an easy-to-vary change because it specifies only vaguely why we should expect such moderation—presumably because males can relate to stereotypes associated with university professors and football hooligans more readily than females; they are more influenced by the primed concept. But here is another version of this change with the opposite prediction: Females relate less to these stereotypes, but they want to relate more, and so they are particularly susceptible to their priming. The predictions can easily change within the theory because what it means to relate to a stereotype or how relating to a prime exerts its influence on behavior is not specified. Indeed, when the effect of sex was not found in the preregistered replication, the initial change was abandoned and a new moderator was proposed for the theory, namely that awareness of the aim of the experiment suppresses the priming effect (Dijksterhuis, 2018)—although such suppression by awareness was not observed in similar settings (Newell & Shaw, 2017). More importantly, even within the subset of the sample consisting of “unaware” participants, there was no effect of sex—an observation that apparently creates no problems for the theory.
Experiments are not necessary to see the problematic flexibility of such a theory. Even if the results of the original study were observed in the registered replication project, the theory would have remained problematic because it could be easily adapted to explain anything. Although the effect was not replicated, the theory was easily adapted to explain the results. Because there is no possible set of observations that can make such a theory problematic, it should be regarded as a bad explanation and not tested experimentally.
To summarize, the aim of science is to create explanations that are inflexible in a way that maximally allow for criticism and thus for improvement. Unfortunately, current psychological research places disproportionate attention on empirical testing without considering whether such tests are useful. In this section, we argued that empirical tests are useful only if they are capable of creating problems for existing theories but that easy-to-vary theories (such as the theory that motivated the professor priming studies) are impervious to being made problematic. As we explain in the following section, these limits of experimental tests cannot be avoided by separating exploratory and confirmatory research because easy-to-vary theories are not made inflexible by ensuring that the research is confirmatory. Instead, a large part of theory evaluation can be done before experiments are conducted.
The Misguided Exploratory–Confirmatory Distinction
On the face of it, the use of preregistration and direct replication on the basis of the exploratory–confirmatory distinction aims to reduce some version of theory flexibility. With preregistration, researchers must make clear when their analysis is being tailored to the observed data (usually in a subsequent “exploratory” analysis or in deviations from the preregistered analysis plan). In a direct replication, it should be apparent when an experiment differs from the original and when analysis methods are adjusted. Such restrictions on flexibility will make clear when hypotheses, and their motivating analyses, are generated after the results are known. Thus, making the distinction between exploratory and confirmatory research is aimed at reducing the flexibility of the predictions of theories. Specifically, the reader will know whether a theory has been adjusted to explain data or when data have been tailored to fit a theory. Presumably, we are to be more skeptical of theories that have been adjusted and more confident in theories whose predictions were borne out in an experiment.
Such evaluation of theories is misguided because the requirement for the researcher to reduce theoretical flexibility can be satisfied by a temporary reduction in flexibility. Methods based on the exploratory–confirmatory distinction allow researchers to temporarily fix the predictions of their theories, which can easily be done even when theories are flexible. This usually takes the form of choosing a set of predictions out of the many possible ones consistent with the theory and stating that these are what the researcher expects. However, unless the particular choice of prediction is implied by a theory, such that observations to the contrary create a problem, then that theory has become no more testable. In other words, if the theory is capable of easily adapting to any observation, then the choice to preregister a particular prediction does not make that prediction an implication of the theory—and thus does not make its empirical test more useful.
Turning back to our example of the professor-priming replication (O’Donnell et al., 2018), an example of temporary flexibility reduction can be readily seen. In this replication, the proposed effect of sex was held in place only by preregistration and not by any theoretical consideration. As a result, when the effect was not observed in the experiment, the theory was not compromised by this new observation. Instead, the inherent flexibility in the theory that allowed the moderating effect of sex to be proposed in the first place now allowed a new, equally flexible moderator to be conjectured and a new empirical test to be proposed (Dijksterhuis, 2018). One may argue that this preregistered replication made this flexibility clear. However, a critique by argument would have led to the same conclusion before the replication study was conducted (and would have saved a lot of time, money, and effort).
In reality, the distinction between prediction and postdiction is irrelevant for theory development because both predictions and postdictions are supposed to be implications of a theory. A good theory designates both what we should have observed in the past and what we should observe in the future—there is no difference. Thus, theories should be judged not on the basis of whether they had to be changed but on how easy they are to change. If the theory was changed but both the theory and the change are hard to vary, we have no additional reason to be skeptical of the theory simply because the theory was proposed after an experiment. Likewise, if the theory is easy to change, then even if its temporarily fixed predictions are borne out in an experiment, we have little reason to entertain that theory.
To summarize, viewing replicability issues through the exploratory–confirmatory distinction is not helpful because it suggests that replicability problems result from flexible predictions. In contrast, we argue that it is the inherent flexibilities in theories, and not flexible predictions, that are the underlying causes of replication problems under consideration. Psychology has issues with replication because it expects results to be replicable on the basis of theories whose explanations are undermined by their ability to easily explain anything. Thus, instead of resulting from the lack of distinction between exploratory and confirmatory research, replication problems arise as the result of adopting bad theories. Unfortunately, flexible theories will not become hard to vary because their predictions are temporarily fixed—reducing such flexibilities can be done only by arguments that take into account all aspects of good theorizing. Thus, methodological approaches leave the real flexibility problem unaddressed: Easy-to-vary theories can be retained, or worse, can be considered good when observations turn out to be consistent with the predictions.
Clarifying the Scientific Role of Preregistration and Direct Replication
If we reject the exploratory–confirmatory distinction as irrelevant for the scientific goal of developing good explanations, what role remains for preregistration and direct replication? In this section, we attempt to clarify the scientific usefulness of these methods in two potential scenarios: in tests of bad theories and in tests of good theories. We also consider the argument that these practices can be helpful in determining an empirical basis (explicanda) that theories need to explain.
Preregistration is not scientifically useful 2 either in tests of bad or in tests of good theories; that is, it does not necessitate the improvement of those theories (Szollosi et al., 2020). This is because it focuses on the reduction of superficial flexibilities in theories. Yet one might still be tempted to argue that preregistration is useful because it makes it apparent when a bad theory is changed. Two points are worth reiterating in opposition to this argument. First, although preregistration does indeed reveal when a theory is changed, it does not consider whether the resultant theory is good or bad. That is, because theory change is a necessary part of science, it makes no sense to be skeptical of change per se—trusting theories less because they were changed makes no sense if the change made the theory better. Second, we should be dissatisfied with any method for evaluating the flexibility of a theory that depends on explicitly observing theory change (e.g., the flexibility in social priming theories that was revealed in O’Donnell et al., 2018). The flexibility of a theory is its emergent property and thus is independent of the outcome of any experiment. Instead of superfluous experiments whose best outcome is to highlight the flexibility of a theory, we recommend more focus on how we can use argumentation to assess and critique the flexibility inherent to theories.
This perspective also forces us to reconsider what we mean by replicability. Replicability refers to the extent to which the invariant predictions of a theory can be observed in repeated testing. Thus, the conditions under which we should expect replicability are implied by that theory: These are the aspects of an experimental protocol that differ across repeats of the experiment but are deemed unimportant by the theory (e.g., the experimental location or cohort of participants). Thus, replication is simply a special case of an experimental test of the theory. However, as we have explained, experimental testing of any prediction (invariant as well as noninvariant predictions) matters only for explanations that are good. If the theory is flexible, it matters little if the experiment is repeated— repeating the test will remain inconsequential because the predictions of invariance can easily change regardless of the observations (e.g., from sex to awareness in Dijksterhuis, 2018).
In cases of good theories, the usefulness of replication will vary. Most good theories will imply data patterns that are highly diagnostic, so coincidence-based alternative explanations—that the observation was due to unexpected causes (i.e., the explanations that replication studies are supposed to rule out)—are unlikely to be worthy competitors (Fiedler, 2017; Roberts & Pashler, 2000). In some cases, however, such explanations can be worthy competitors (e.g., when attempting to distinguish between similar variants of a good theory). Our main point here is that blindly using direct replication in the hope that it will improve psychological science is mistaken: It is useful only in particular tests of good theories.
It is worth separately considering the related argument that preregistration and direct replication can help establish an empirical basis—that defining an analysis in advance can help specify whether some explicanda will be reliably observable (e.g., Wagenmakers et al., 2012). Recall from our earlier discussion of replicability that our expectations regarding future observations (e.g., whether some observations will repeat in the future) are always implied by a theory. Thus, they cannot be implied by the outcomes of statistical tests that (often only very approximately) represent that theory—the confirmatory status of the test notwithstanding. The implications of any such analysis will ultimately depend on the quality of the explanation of what new observation will occur (and on the accompanying explanation for why the statistical analysis is sensible). Such statistical models can be helpful, but they are always subordinate to scientific arguments and methodology: Their interpretation should be determined exclusively by the theory they aim to represent (Fiedler, 2017; Navarro, 2018; Szollosi & Donkin, 2019; for an extended discussion of this issue, see Donkin & Szollosi, 2020). Observations resulting from bad theories, even with “strong” statistics, can at best be used as heuristics to motivate further research, but they do not form an empirical base.
To summarize, experimental tests are useful for testing inflexible theories. Experimental testing of flexible theories—even if preregistered or direct replications—cannot contribute much to theory evaluation because the experimental test is going to be nonconsequential to the theory. Such theories should instead be advanced via nonexperimental criticism to develop them sufficiently before they are subjected to empirical tests.
Unarresting Theory Development
In this article we have given an alternative account for the causes of current reproducibility problems in psychology. We argued that replicability issues are symptoms of the underlying flexibility problem with theories: They result from the shortsighted focus on how well predictions of theories fit with observations while neglecting other aspects of good theorizing, such as the flexibility of the theory. We also argued that currently proposed methods-oriented solutions restrict flexibility mostly superficially, and therefore we should be placing a greater emphasis on exploring other avenues, particularly nonempirical ways of reducing theory flexibility. In this section, we provide some suggestions that could be taken as first steps toward getting better at evaluating and reducing the inherent flexibility of theories.
An important step we can take to improve our ability to evaluate theories through arguments is increasing the extent to which we hold theories accountable. This means that they should be evaluated on the basis of the criteria of being hard to vary, and if they are found to be inadequate, changes should be made to improve them. Just to reiterate, assessments should be made about whether the theory actually explains its explicanda, whether it is compatible with other good theories, and whether it cannot be changed to account for anything.
As an example, consider the importance of accountability in theory change from the social-priming case study. For instance, if a factor such as sex is proposed to moderate the effect of a social prime, then there must be an accompanying explanation for why such a change is reasonable and why the change is not easy to vary. Assuming such an explanation is possible, the change must also have an effect on the existing implications of the theory. For example, if males are now expected to show a larger priming effect than females, then this observation should be present in all existing and future data sets. Thus, to be convincing, the authors should demonstrate that all existing observations are consistent with the change in the theory. The moderating effect of sex must also now be expected in all future studies. If the new expected effect is not observed, the theory must again be updated and held to the same standards of accountability—namely, why does sex only sometimes have an effect? When it is no longer possible to reconcile the observations because no apparent change is possible for the theory (without creating new problems), then the theory is rendered problematic and a new theory should be sought. For science to progress, accountability in theory change is crucial: We should expect that each adaptation of a theory makes it more inflexible and therefore increases its potential to be made problematic.
Focusing on accountability also helps clarify which types of transparency are important. Transparency is useful when it increases the accountability of theories. Transparent practices such as openly sharing methods, data, and analyses (e.g., some aspects of Aczel et al., 2020; Miguel et al., 2014) are important because they can contribute to accountability. For example, they allow researchers to use existing data sets to test the new implications introduced by changes to their theories. But increasing openness in unimportant features, such as the documentation of the specific steps taken to arrive at the current version of a theory, or of the specific time when the analyses were conducted, is not needed for the assessment of the theory.
Another important step toward nonexperimental theory assessment is to consider the reliance of the focal theory on other related theories. When we attempt to make a theory hard to vary, any of its conjectures (existing or new) must have implications that can be evaluated against other existing theories (and observations). For example, Dijksterhuis and van Knippenberg (1998) explain that the professor prime may cause individuals (unconsciously and automatically) to think harder about the answer, to use varied strategies, or to be more confident. However, we lack good answers to the questions of how and why thinking harder, choosing better strategies, or being more confident can produce better answers to general-knowledge tasks. This lack of good related theories decreases the constraints of Dijksterhuis and van Knippenberg’s conjectures about why activating the construct of professors improves performance. Therefore, their study provides relatively little opportunity to create problems for their overall theory. On the other hand, better (supplementary) theories for confidence or strategy selection in general-knowledge tasks could have introduced more constraints and allowed for more meaningful empirical tests.
Conclusion
Methods-oriented solutions focus on inflexibility where it does not matter but not where it does: Scientists can get a badge as long as the predictions of their theory were temporarily fixed, but hardly anyone cares if the theory could have easily accommodated the opposite predictions. A perspective in which theories are judged on the basis of how hard they are to vary resolves this problem: The predictions of a good theory will never need temporary fixing because they cannot be easily changed. From this point of view, replicability may emerge from good theories and is not an aim that needs to be independently pursued.
Finding good explanations of complex high-level phenomena is difficult but possible. We should not be pessimistic and accept that our theories will inevitably be worse than those of other sciences (e.g., Sanbonmatsu & Johnston, 2019). It is true that the lack of thinking about flexibility (in this particular way) has resulted in a lot of flexible theories. Although we focused on social-priming theories as an example, we suspect that we can all identify similar flexibilities in our theories. But existing good explanations for complex phenomena, such as evolution, show us that there is no reason to believe that we cannot develop good explanations if we want to and know what we need to do. The real question is not whether we can but whether we will rise to the challenge of developing good explanations in the behavioral sciences.
Footnotes
Acknowledgements
We thank Balazs Aczel, Alex Holcombe, Jared Hotaling, Arthur Kary, Ben Newell, Iris van Rooij, and two anonymous referees for helpful discussions and comments on a previous version of the manuscript.
Transparency
Action Editors: Travis Proulx and Richard Morey
Advisory Editor: Richard Lucas
Editor: Laura A. King
