Abstract
Multisite (multilab/many-lab) replications have emerged as a popular way of verifying prior research findings, but their record in social psychology has prompted distrust of the field and a sense of crisis. We review all 36 multisite social-psychology replications (plus three articles reporting multiple ministudies). We start by assuming that both the original and the multisite replications were conducted in honest and diligent fashion, despite often yielding different conclusions. Four of the 36 (11%) were clearly successful in terms of providing significant support for the original hypothesis, and five others (14%) had mixed results. The remaining 27 (75%) were failures. Multiple explanations for the generally poor record of replications are considered, including the possibility that the original hypothesis was wrong; operational failure; low engagement of participants; and bias toward failure. The relevant evidence is assessed as well. There was evidence for each of the possibilities listed above, with low engagement emerging as a widespread problem (reflected in high rates of discarded data and weak manipulation checks). The few procedures with actual interpersonal interaction fared much better than others. We discuss implications in relation to manipulation checks, effect sizes, and impact on the field and offer recommendations for improving future multisite projects.
Keywords
Concern over the replicability of psychological science has risen in recent years, especially in social psychology. A combination of events led to a general sense of a “replication crisis.” Anecdotal reports of failure stimulated a large-scale attempt to replicate 100 previously published findings (Open Science Collaboration, 2015). Less than half were significant, with social psychology faring worse than cognitive psychology. Even among successes, effect sizes were considerably smaller than the original effects.
In an unrelated development, several high-profile cases of researcher fraud were found in social psychology (e.g., Diederik Stapel), and the fact that fraud happens at all makes one suspicious of the field’s knowledge base. Fraudulent findings are of course highly unlikely to replicate. Yet in another unrelated development, Bem’s (2011) findings of precognition, published in social psychology’s leading journal (Journal of Personality and Social Psychology), seemed a priori implausible to many readers, raising further doubt about the validity of the knowledge base.
The importance of replication has led many researchers to embrace multisite replications as a presumably rigorous method for verifying the validity of scientific findings (e.g., Nosek et al., 2022). The large samples that come with conducting essentially the same experiment simultaneously in multiple labs should increase the statistical power to detect even small effects. Yet the results have continued to be quite disappointing. The present article reviews the published results of these multisite replication attempts in social psychology in the hope of understanding why these attempts fail so frequently.
The idea for our article arose during the 2020 Society of Personality and Social Psychology conference. Researchers there presented the results of a multisite attempt to replicate the impact of mortality salience on worldview defense (Klein et al., 2021)—specifically, the idea that being reminded of one’s mortality makes one assert one’s cultural worldview more strongly and resist criticisms of it. This has been a major line of work derived from terror-management theory (Greenberg et al., 1986), and there are many published replications already of mortality-salience effects. Yet the multisite attempt failed to find significant support for it, despite the large sample. One of us has been engaged in friendly feuding with the terror-management theorists for over three decades (e.g., Baumeister & Tice, 1990; Greenberg et al., 1990). Despite not being a supporter of their theory, his own laboratory has replicated the mortality-salience effects over a dozen times, and he has published the results in multiple publications. Because of both his own data and the many published studies finding similar effects, he has confidence in the reality of that causal pattern (though he disputes the broader implications). Why, then, would a multilaboratory replication attempt of what appears to be a true effect fail, even with input by the original investigators?
Our assumption is that the amount of fraud in scientific work, although obviously not zero, is quite small. We proceed on the assumption that both the original researchers and the replicators are honest, competent scientists sincerely trying to conduct good research. Therefore, we are inclined to accept both the original results and the replications, even though the modal pattern is that the original results are significant, whereas the replications are nonsignificant. Indeed, some replication articles report nonsignificant trends in the opposite direction from the original result. In that case, it is not merely that the replication finding is too weak to be significant even in a very large sample; rather, it is that there is no sign whatsoever of the effect. Thus, we have two sets of researchers testing the same hypothesis and coming to opposite conclusions. Again, this is what usually happens, at least in social psychology. What can one make of this discrepancy? What are its implications for building a broad theoretical understanding of the phenomena, and for future empirical projects?
Features of High-Quality Replications
We begin with Fiedler’s (2017) definition of strong research: “Strong research must rely on sufficiently large samples that allow for powerful statistical tests of precisely predicted relationships, minimizing false positives and failures to replicate” (p. 46). Although emphasizing the value of replication, Fiedler notes that this does not mean something will work every time. He points out that “a negative outcome may reflect the overshadowing impact of an uncontrolled cause” (p. 57). In particular, concern over false-positive findings may distract from a potentially larger problem, namely false negatives. False-negative findings may impair scientific progress more than false-positive findings, insofar as further research will expose false-positive findings whereas false-negative findings may discourage further work (Fiedler et al., 2012). Indeed, false-positive findings may indeed be exposed by subsequent work, and further, scientists appear to be remarkably accurate at predicting in advance which findings will replicate (see Camerer et al., 2018). Small samples as traditionally used in social psychology will cause both false positives and false negatives. The larger aggregated samples in multisite replications may reduce false negatives caused by small samples—but it is possible that other features of the multisite procedure will offset that advantage. The damaging effects of false-negative findings from multisite replications is likely to be considerably worse than false-negative findings from single studies because the research community may regard the null result as firm evidence that the phenomenon does not exist, thereby stifling further research and preventing self-correction. Importantly, most efforts to reduce false-positive findings will increase false-negative findings (Fiedler et al., 2012).
What would be best practices for conducting a multisite replication? None of the present authors has conducted such a project, but on the basis of our extensive reading of the literature we can list several important features—as well as several debatable ones. It deserves mention that having many laboratories and many participants is an apparent strength, but if the replication is conducted in a flawed manner, those features quickly lose their value. They may even be damaging insofar as the apparent strength of so much data lends credibility to misleading findings.
The clear strengths would include preregistration, including specifying the primary analyses, but also a willingness to conduct and report further exploratory analyses, as long as the distinction is explicit. A large aggregate sample is desirable. Given that replications nearly always find reductions in effect size from the original, the sample should be extra large.
Excluding large amounts of data can seriously compromise the quality of replication. Excluded data are especially problematic if they are not evenly distributed across conditions and for the same reasons (e.g., Cook & Campbell, 1979). It is important to report whether the excluded participants were evenly distributed across conditions, because otherwise the different conditions sample somewhat different populations (a problem that increases with more exclusions). If exclusion rates are high, analyses should be reported both ways (i.e., both full sample and after exclusions). As a commendable example, McCarthy et al. (2021) reported their preregistered analyses but also, after noting the high proportion of discarded data, reported multiple analyses with different exclusion criteria (as well as no exclusion criteria).
Whether procedures should be precisely the same for all participants is debatable. Here we appreciate the practice of using multiple different procedures to test the same hypothesis (e.g., Vohs et al., 2021). To be sure, when different procedures yield different results, interpretation is compromised—but it is nevertheless scientifically important to know this.
Manipulation checks are essential, especially considering the high rate of failure in multisite projects. A manipulation check should do more than verify that the instructions were correctly perceived: It should confirm that the requisite psychological states were successfully manipulated (Fiedler et al., 2021). Ideally, and especially in cases of failed replications, some evidence that the dependent variable was sufficiently sensitive to detect group differences would also be provided. Attention checks could also be useful, especially for computer-administered studies. In case of failure accompanied by weak manipulation checks, further analyses would be appropriate to see whether any support for the hypothesis can be found among participants whose manipulation-check data were strongest—with the caveat that such analyses introduce selective sampling and hence potential confounds.
It may seem that we are holding a double standard, given that we fault replications for not having manipulation checks when the original studies might also have omitted them. We have three responses to that objection. First, establishing the validity of a manipulation is a property of good research in general, and that applies equally to original studies and replications. Second, the multisite replications are intended to raise the quality of the field’s knowledge base, so improvements such as adding manipulation checks are consistent with that. Third, within the purview of our efforts to understand this literature, some double standard is perhaps, unfortunately, necessary. The modal case we shall report involves a significant original finding but a nonsignificant multisite replication. The need for the original authors to validate their manipulation is thus less—because, after all, it worked. With the failed replication, in contrast, it is vital to know whether the manipulation was effective. A significant manipulation check combined with a null result on the dependent variable is evidence that the original hypothesis is wrong. But as many authors of these replications have acknowledged (e.g., Buttrick et al., 2020), if the procedures fail to manipulate the independent variable, then no conclusions about the hypothesis can be drawn.
To be sure, manipulation checks do not solve all problems. Gruijters (2021) has pointed out that many social-psychology procedures manipulate more than just the focal variable, thus enabling alternative explanations. In the past, it has largely been the job of editors and reviewers to raise these issues, and authors may have been asked to collect additional data to rule out specific alternative explanations. Fully validating a manipulation would ideally involve both establishing that the focal independent variable was successfully and substantially manipulated and that other relevant variables were not. For example, affect is often a source of alternative explanations, and a great many studies include affect measures to show that their manipulation did not create emotional side effects—hence the remarkably low rate of significant findings for emotion mediation across half a century of social-psychology research (DeWall et al., 2016). Gruijters also makes the valuable point that even a significant manipulation check does not necessarily indicate that the manipulation was strong enough to produce a significant effect on the dependent variable. Manipulation checks should presumably have large effect sizes for the treatment groups to be meaningfully different. Indeed, the effect size of a manipulation check may be more meaningful than the effect size on the dependent variable because the former is a helpful indication of how valid the study was.
Fidelity to the original study seems central to the very idea of replication. Ellefson and Oppenheimer (2022) have argued for the importance of such fidelity. However, given that no replication can be truly exact, some departures from precisely duplicating original procedures may be desirable, such as updating materials to reflect current social and historical factors. In case of failed replication, a clear discussion of departures from the original procedure is needed. However, successful replications gain credibility with increasing discrepancies from the original procedure because these indicate greater generality and robustness.
Given that the purpose of experimentation is generalizable knowledge, exact fidelity is a mixed blessing depending on outcome. A successful replication is more informative and persuasive to the extent that it was a purely conceptual replication with different operational definitions of key variables from the original study. In contrast, replication failure becomes increasingly ambiguous with each departure from the original study, and so it is most informative when fidelity is maximal.
A particular fidelity problem arises when the original study did not include a manipulation check, so that adding one reduces the replication’s fidelity. Manipulation checks are generally a positive feature indicating good quality research, but sometimes they are impractical. Nevertheless, given the high failure rate of multisite replications, manipulation checks are crucial for establishing whether the experiment provided a fair test of the hypothesis. If the original study worked as predicted, and if reviewers did not find potential confounds that suggested alternative explanations, it seems reasonable to assume, at least provisionally, that the manipulation did what the original authors proposed. When a replication fails, however, it is immensely valuable to have clear evidence as to whether the manipulation was effective.
It is important to realize that many (original) social-psychology publications have fallen short of best practices (especially as we now understand best practices). Many classic social-psychology studies had no manipulation checks, used small samples, discarded participants without reporting full-sample analyses, and the like. More broadly, all social-science research is imperfect. Replication failures may indicate problems with original studies, replication studies, or both.
Possible explanations for failed replications
Experiments typically test causal hypotheses of the form “X causes Y,” with X being the independent variable and Y being the dependent variable. As we shall show, the modal pattern in social psychology has been for original research reports to indicate significant support for the causal relationship but for the multisite replication to fail to find any such support. Because the two findings contradict each other, and a causal relationship cannot simultaneously exist and not exist, the challenge is to understand what could account for the difference. Fiedler (2017) has also pointed out that such so-called “minimal hypotheses,” such as “X causes an increase or decrease in Y,” are insufficient for scientific theory building. It is therefore useful to explore why X may cause Y sometimes but not other times.
A first possible explanation would be that there are boundary conditions: The hypothesized causation occurs under some conditions and not others. Relatively few hypotheses are put forward with a theoretical elucidation of when this effect will be observed versus when it will not. In this case, both the original and the replication finding may be correct evidence about reality. The two projects merely lie on opposite sides of some crucial boundary condition. The challenge for advancing theory is to delineate what the boundary conditions may be and to show in subsequent work that in the same study the finding can be replicated and not replicated by manipulating that boundary condition. We concur that future theorizing in psychology would benefit from articulating likely boundary conditions (Fiedler, 2017). Without theoretical specification, it is generally not possible to evaluate whether boundary conditions account for replication failures in the present sample of multilab replications. At most, they can be post hoc speculations.
A second possibility would be that the original finding was wrong, and the replication correctly finds that X does not cause Y. In the present sample of studies, the best evidence for this would be that the replication provides significant and substantial evidence of a successful manipulation check, but the predicted differences on the dependent variable were not significant. (They could be significant in the opposite direction, which would also speak very strongly against the original hypothesis.) Further evidence that social psychology’s failed replications show that original findings were spurious would be that findings that have been obtained only once or a few times would fare worse (in our sample) than those that have been replicated many times.
A third possibility would be that the replication failed to provide a valid test of the hypothesis. We call this operational failure. One sign of this would be that the manipulation check is not significant. Indeed, for a strong test of a hypothesis, there should be a large effect size as well as a significant manipulation check, indicating that the treatment groups were substantially different on the independent variable. Manipulation checks should ideally verify not simply that the manipulation was perceived but that it produced the intended difference in psychological states (Fiedler et al., 2021). A priming manipulation, for example, could be checked simply by verifying that the participant unscrambled the assigned words to make coherent phrases using the target word—but could also be checked by showing that it made the concept mentally accessible.
Other signs of operational failure would include high rates of discarded data. Social psychologists have long accepted that it is fair to discard some data points occasionally for valid reasons, such as when a participant misunderstood the instructions or was suspicious. But journals traditionally would not publish studies from which substantial numbers were excluded. Differential rates of discarding in different conditions produces selective attrition (Cook & Campbell, 1979), making it difficult to know whether observed differences were due to the manipulated independent variable (as hypothesized) or were confounded by differential sampling (e.g., if only one condition deletes inattentive participants at a high rate). Differential attrition may be particularly common with online procedures (Zhou & Fishbach, 2016). High rates of attrition may also signal that participants were not fully engaged with the experiment and instead were simply trying to get the study over with as quickly as possible. Older social psychologists recall training in how to ensure each participant was motivationally and emotionally engaged in the procedure (called experimental realism; Aronson & Carlsmith, 1968).
Exactly why engagement would be lower among participants in a multisite replication than in the original study is not entirely clear. Research on social loafing has suggested that people are generally less engaged when performing anonymously as part of a large group than when they are individually identified (e.g., Karau & Williams, 1993)—but participants were often told explicitly that their identities would be submerged in a large group. Most consent forms explicitly state that data will be analyzed at the group level (i.e., the data from all individual participants within a group will be combined). It is also plausible that the streamlining of procedures for mass administration reduces the engagement, by (for example) minimizing social contact between the participant and the experimenter.
Low participant engagement may be a serious problem for many studies, including replications, particularly in this era of online data collection. The global pandemic lockdown alerted many instructors to the fact that teaching is more effective in person than via computer-mediated communication such as Zoom (e.g., Alpert et al., 2016; Halloran et al., 2021; Kofoed et al., 2021; Cellini & Grueso, 2021). Overall, online learning has been shown to be less engaging (e.g., Kop et al., 2011) and lead to less learning (e.g., Heppen et al., 2017) than traditional in-person learning. Even in a best-case scenario such as The Netherlands (described by its authors as best case because it had a short lockdown, equitable school funding, and excellent rates of high-quality Internet access), a study of around 350,000 students found that computer-mediated distance learning led to lower academic outcomes than in-person learning (Engzell et al., 2021). Although He et al. (2021) found that online learning was comparable to in-person learning for health-science students, this was only the case for synchronous learning, whereas most (if not all) online psychology experiments are asynchronous. A more suitable analogy may be the switch to asynchronous, computer-mediated learning during the COVID pandemic. A recent Brookings study found that students’ learning and achievement levels dropped between a tenth and a quarter of a standard deviation between 2019 and 2021 (Kuhfeld et al., 2022). Lower levels of engagement and performance in asynchronous online learning may be analogous to less engagement and lower-quality data in asynchronous online psychology experiments. Hence, we coded studies for whether live personal interaction was involved.
Possible sources of bias in present authors
Because discussions of sensitive issues may be clouded by bias, we wish to lay out the broad assumptions and goals for our approach. Each of the present authors has long cultivated a positive attitude about social psychology in general. Our positive outlook—tempered, to be sure, with judicious skepticism about specific issues—is clear in a textbook coauthored by two of us (Baumeister & Bushman, 2021). It is therefore fair to suspect our analysis of bias in favor of the existing literature, and indeed we think our admiration for the general body of work by social psychologists over the past half century fueled our curiosity about why the multisite replications fail so often. We think readers seeking a perspective with the opposite bias (i.e., preferring to think that most social-psychology research has yielded no useful knowledge) will have no difficulty finding sources to make that case.
As one highly relevant example, the term “p-hacking” has come into use recently to disparage conducting multiple analyses and then reporting the best results. The term encompasses some practices that were never acceptable, such as analyzing frequently during the data-collection process and stopping the collection whenever the .05 significance line was successfully crossed. It also encompasses practices that were once considered acceptable, such as using gender as a covariate or transforming the data in an attempt to nudge the finding across the .05 line if the initial analysis yielded a p level of .06. Skeptics have proposed that the literature is rife with false-positive findings, and that view deserves to be respected. A more charitable view, however, is that with small samples, many true effects would reach only marginal levels of significance because of low statistical power, and such formerly acceptable p-hacking would sometimes enable the finding to reach significance and publishability—thus essentially rescuing a true effect from a false-negative result (Type II error). The two perspectives differ as to how easily or frequently a bogus significant finding (Type I Error) could be manufactured with a truly nonexistent effect. Our view is that such occurrences would be rare and would hardly ever replicate, so when findings have been obtained repeatedly, we suspect that there is a genuine phenomenon there—even though the evidence for it was inflated.
We agree with the skeptics, and with the great bulk of the replication literature, which indicates that effect sizes in original reports of lab findings are inflated, and even the formerly acceptable forms of p-hacking would likely contribute to this. Nevertheless, in contrast to the doom-and-gloom tone of much writing about replication problems, we seek to explore the perspective that many of the phenomena are genuine, even if the original reports included inflated effect sizes.
Two of us were also heavy contributors to the ego-depletion literature, which has a mixed record in multisite replications, thereby rendering our own possible bias as something potentially substantial yet hard to specify. We do have confidence in our own work, which we have always sought to conduct carefully and consistent with prevailing best practices (which, admittedly, have changed substantially over the decades). An early multisite replication reported a failure to find evidence of ego depletion, though subsequent reanalyses did indicate support. Crucially, however, a recent multisite replication found strong and unequivocal support for ego depletion, thus vindicating our work (Dang et al., 2021). Indeed, if one accepts all the social-psychological multisite findings uncritically, ego depletion is one of only four clear successes we uncovered. Indeed, the rarity of success in social psychology multisite revisions enabled us to make the case elsewhere that ego depletion is at present social psychology’s best replicated finding (Baumeister & Tice, 2022). Whether we would therefore be biased in favor of or against the multisite approach is therefore unclear. At minimum, we would be likely to notice factors that have contributed to weak replication results in general. Our conscious goal is to advance understanding of the field, and, by definition, we cannot know what our unconscious biases are.
Our attempt to understand the pattern of widespread failure to replicate is also hampered by the literature itself. No one would suggest that the multisite replications are a representative sample of all the work done in social psychology. Indeed, around a third of the multisite replications have focused on priming, which is one important phenomenon in social psychology—but just one of many. Meanwhile, many hypotheses and studies have not been tested in multisite replications.
Method
Literature search
We searched the database PsycINFO through 2022 (July 28) using the terms “multi-site” or “multisite” or “multi-lab” or “multilab” or “replicat*” or “many labs.” The wildcard character (*) retrieves terms with letters after * (e.g., replicate, replicates, replicated, replication). The search retrieved 7,361 articles. These articles were then searched to identify multisite replications of social-psychology experiments (broadly defined, e.g., including applied areas such as law, environment, health, consumer behavior, and industrial-organizational behavior). The search excluded dissertations, articles not written in English, and articles that were not peer reviewed. Methodology was restricted to “experimental replication.” Because of a reviewer suggestion, we used the same terms in the journal Collabra: Psychology and found an additional 54 articles. Of these, one was relevant. Our search resulted in 36 relevant replications, which are marked with an asterisk (*) in the Reference section. Some articles seemed at best borderline social psychology, as we shall note. All multisite replication studies included in our analysis were registered in advance.
Three additional articles are described in a narrative fashion (Klein et al., 2014, 2018; Schweinsberg et al., 2016 1 ). These reported multisite replications of multiple different studies rather than a single study. These indeed qualified as multisite replications, but the experiments are perhaps best described as ministudies, given that they were chosen for brevity, rapidity, and convenience. Importantly, these ministudies are not statistically independent because the same participants completed several ministudies. The assumption of independence is a foundation for many statistical tests, including the ones we report in this article.
Coded variables
For each multisite replication, we coded (a) publication year of multisite replication study, (b) publication year of original study, (c) whether an author on the original study was also an author on the multisite replication study, (d) the number of authors on the multisite replication study, (e) the number of labs contributing data in the multisite replication study, (f) the total sample size in in the multisite replication study, (g) the sample size analyzed in the multisite replication study, (h) the total sample size in the original study, (i) the sample size analyzed in the original study, (j) whether the multisite replication study used exactly the same procedure as the original study, (k) social interaction in the multisite replication study, (l) social interaction in the original study (coded the same as social interaction in the multisite replication study), (m) manipulation check in the multisite replication study (coded as significant, nonsignificant, mixed = significant for some measures but nonsignificant for other measures, or none), and (n) manipulation check in original study (significant, nonsignificant, mixed, or none). For social action (k), live ongoing interactions were coded as 2; computer-mediated, computer-mediated simulated, and observing people interacting were coded as 1; and solitary responding with or without computer, completing tasks alone in classrooms with other students, and computer-administered hypothetical vignettes were coded 0.
We were going to code how many times the original study been replicated before and by whom, but that was difficult to determine in a precise manner. We were also going to code whether a journal was a flagship journal as a rough indicator of journal quality, but we decided not to code this variable because some journals were difficult to classify.
We considered a multisite replication to be successful if it yielded significant results on the main dependent measures in the same direction as the original study, thereby upholding the original finding (coded 2). We coded the results as mixed or partial success if multiple analyses were reported, some of which indicated significant confirmation of the original (primary) finding while other analyses (e.g., using different protocols) on the same dependent variable were not significant (coded 1). (Note that these analyses had to be across the multiple labs; a significant result in one lab was not sufficient to code as a partial success.) There were some borderline cases in which most analyses failed but a lone result emerged from an analysis that lacked credibility. We considered a multisite replication to be unsuccessful if it yielded nonsignificant results on all measures, results in the opposite direction as the original (coded 0), or both.
Each variable was coded by two independent coders. Percent agreement ranged from 94% to 100% (M = 98.5%; Mdn = 100%). Disagreements were resolved via discussion among the three authors. Our codings are presented in Table S1 in the Supplemental Material available online, and anyone who wishes to recode and reanalyze is welcome to do so.
Results
We found 36 articles reporting multisite replications in social psychology. We considered four of them (11%) to be largely successful (despite smaller effect sizes than the original studies) and five additional ones (14%) to be partial or mixed successes. 2 Counting the mixed successes with the successes (a stretch) still leaves 75% as failures. We emphasize that generalizing to all social psychology is unwarranted because the sample of effects chosen for replication is not at all a randomly created, representative sample of the field’s knowledge as a whole. (In particular, field studies, labor-intensive studies, and studies with special populations seem underrepresented among replications.) Nevertheless, the 75% failure rate is the most generous description of the entire population of published multisite social-psychology replications we could locate as of June 2022.
As already noted, we also report separately on three other articles (Klein et al., 2014, 2018; Schweinsberg et al., 2016), each of which reported multiple different replications of different studies. The studies they sought to replicate were not all in social psychology and involved brief, one-shot measures, typically of social judgment or hypothetical reactions to imaginary situations, with large multilab samples. Thus, Klein et al. (2014) had each participant at each session perform 13 different brief experiments, some of which involved just a few seconds to answer a single question. Klein et al. (2018) replicated 28 different brief experiments; each protocol was administered to approximately half of 125 samples that were composed of 15,305 participants from 36 different countries and territories. Schweinsberg et al. (2016) had 25 research groups replicate between three and ten unpublished findings regarding moral judgments of hypothetical or imaginary events. We decided, for multiple reasons, not to cover these on the same level as the studies devoted to replicating a single finding; these reasons included the necessarily briefer presentation of results and resultant lack of information about manipulation checks, number of participants discarded from analyses, statistical nonindependence of results, and the like. Nevertheless, they do qualify as multisite replications, and we describe them separately.
Effect sizes
Almost all multisite replications of social-psychology effects find results with considerably smaller effect sizes than the original study. This is obviously true for studies reporting nonsignificant results, but it is true even for the few that report positive, significant results in line with the original finding (see the section below on successful replications). It is highly likely that effect sizes reported throughout the social-psychology literature are frequently inflated because of publication bias and p-hacking. The shrinking of effect sizes has also been noted in medical research (Hunniford et al., in press), where it was attributed to superior methodological rigor, including larger samples, in multisite replications than in single-site original studies. Fiedler and Prager (2018) proposed that regression toward the mean (i.e., the statistical tendency for extreme observations or results to be followed by others closer to the mean) almost guarantees that replications (whether multilab or single lab) will produce smaller effects than original studies. We have also proposed that smaller effects would be consistent with lesser engagement of participants.
The very notion of a true effect size for a social-psychology laboratory manipulation of a situational variable, such as interpersonal rejection, may be elusive if not meaningless. In some fields, it may be possible to establish true effect sizes. Insofar as social psychologists wish to accept true effect sizes, the pervasive shrinkage during replications poses a challenge. Original researchers may often have been both good and lucky (because unlucky ones failed to publish), so they reported effect sizes that will inevitably shrink in replications (Fiedler & Prager, 2018). Fortunately, numerical effect sizes are almost never stipulated in psychological theorizing, so the inability to establish correct estimates may not be a major problem for advancing theory.
Nevertheless, it is also entirely possible that multisite replications may be biased toward underestimating effects. Presumably the reasons for that would overlap heavily with the reasons for the broad tendency for multisite replications to fail, such as streamlining procedures for efficiency, low engagement of participants, and procedures that were better suited to the original than the replication sample.
Manipulation checks and operation failure
Still, the findings may not be as dismal as they first appeared. Manipulation checks were missing from half (50%) of replications, as well as nearly all of the ministudies. In these cases, it is impossible to interpret whether the null findings indicate falsification of the hypothesis or mere operational failure. The hypothesis was not tested if the experimental treatments failed to create the intended difference on the independent variable. It was also not tested if the dependent measure was insensitive.
It is also important to note that there are two kinds of manipulation checks. One verifies that the manipulation was correctly perceived, enacted, and administered. The other verifies that the resulting psychological states differed as intended. As a commendable example, Ghelfi et al. (2020) administered beverages of different tastes to manipulate disgust. Their manipulation checks verified both that the bitter beverage tasted more bitter than the others and that it caused more disgust. But very few replications have been that thorough.
True failures to replicate
This and the next few sections will summarize the different types of findings. Readers who wish to read the narrative descriptions of all 36 multisite replications can access them in the Supplemental Material.
We start with the bad news. Here we summarize the nine true failures to replicate, which indicate a falsification of the hypothesis (see Table S2 in the Supplemental Material for the full list). This requires a significant and presumably sizeable difference on the manipulation check but a null (nonsignificant) finding on the dependent variable. (Ideally, one should also show that the dependent variable was sensitive; e.g., providing a significant difference as a result of another manipulation, even of a different variable.) As one example, Williams and Bargh (2008) found that briefly holding a hot pack in one’s hand caused people to select a reward for a friend rather for themselves, compared with briefly holding a cold pack. Lynott et al. (2014) attempted a replication in three sites. The manipulation check was large and significant in that the hot pack was rated as much warmer than the cold pack. Verifying that the notion of hot (or warm) or cold was unconsciously activated in participants’ minds is difficult, but given that participants did rate it correctly from memory suggests it was. None of the three sites replicated the effect, and indeed one site found a significant effect in the opposite direction. The combined results approached significance, but in the direction opposite to the original finding.
In this category, we also included several multisite replications that did not report a manipulation check as such, but for which it seemed fair to assume that the independent variable was successfully manipulated. As an example, Dijksterhuis and van Knippenberg (1998) reported that priming participants with the concept of “professor” caused them to perform better on a trivia test than priming participants with the concept of “soccer hooligan.” O’Donnell et al. (2018) reported 23 direct replications in a multisite project. They found no evidence of improved performance when they primed with “professor,” nor of moderation by gender. There was no manipulation check, but given that participants had to write a paragraph imagining their life as either a professor or a soccer hooligan, it seems reasonable to assume that those different roles were activated in their thinking.
Operational failures
Operational failures refer to replication attempts in which the independent variable was not effectively manipulated. They therefore do not constitute falsifications of the hypothesis, because they were unable to provide a test of it. Nevertheless, they do raise other concerns, such as the generalizability of the methods. Table S3 in the Supplemental Material lists the six multisite replications in this category.
For example, inducing people to disbelieve in free will caused them to become more prone to antisocial behavior (cheating on a test to claim extra reward money), as shown by Vohs and Schooler (2008). An early attempt by Embley et al. (2015) at replication failed, and one possibility was that the manipulation (reading a difficult passage from a science book) was ineffective because it was hard to understand. (The original study may have sampled a highly intelligent and conscientious population.) Buttrick et al. (2020) sought to replicate the original effect in five sites, using both the original manipulation and a revised one that supposedly would be easier to understand. The effect on cheating was in the predicted direction but weak and nonsignificant. However, it appeared that neither the original nor the revised manipulation was able to alter people’s beliefs in free will; thus, the study was unable to test the hypothesis. A follow-up analysis found that the correlation between the manipulation check and the dependent variable was not significant, unlike in several other cases; thus, no support for the original hypothesis could be salvaged. Nevertheless, as Buttrick et al. themselves point out, this investigation permits no conclusions about the hypothesis that disbelief in free will causes cheating, because the manipulation failed.
Ambiguous failures to replicate
We designate as ambiguous those failures for which it is unclear whether operational failure or hypothesis falsification occurred. Half of the replications did not collect or report manipulation-check data, so it is impossible to ascertain whether the failures among them indicate falsification of the hypothesis, failure to test it properly because of operational failure, or both. Table S4 in the Supplemental Material lists these seven replications.
As an example, Payne et al. (2008) found that priming people with the importance of responding accurately and honestly led to higher correlations between implicit and explicit attitudes, compared with priming them with the awareness that everyone is biased and it is necessary to be on guard. A replication in Italy found a significant effect also, though much reduced in effect size (Vianello, 2015). Ebersole et al. (2020) did a direct replication in four U.S. and four Italian laboratories. The American labs found a significant difference in the opposite direction, whereas the Italian labs found no difference. There were no manipulation checks. It could have been an operational failure, as the authors note, but the failure could possibly reflect historical change, given that race relations in America deteriorated between the time the original study was conducted (2008) and the time the replications were conducted (2016–2017).
Mixed successes
We turn now to the minority of multisite replications that provide at least some support for the original finding. Table S5 in the Supplemental Material lists the five partly successful replications. Establishing precise boundaries for this category was more difficult than for any others because many studies reported many analyses. We defined this category as studies providing significant support for the original finding with some analyses, but falling short of significance with other analyses. These analyses had to be across the full data set (thus, a significant finding from one of the many laboratories was not enough), and the successful one had to be sufficient to enable some disinterested scholars to regard it as a successful replication.
Classification of the mixed successes can be explained as follows. For Ghelfi et al. (2020), the most important comparison was between the bitter- and neutral-beverage conditions, and that comparison was significant, although the sweet-beverage condition’s results were radically different from the original and complicated the theoretical point. Findings by Moran et al. (2021) and Vohs et al. (2021) were significant or not depending on how many participants were excluded from analyses. Skorb et al. (2020) tried three different protocols, of which only one provided a significant result, and that was after excluding over half the sample. Baranski et al. (2020) replicated a finding about imaginary perpetrators but not the complementary finding for imaginary victims—thus contradicting the overarching theoretical point despite replicating one of the findings.
In classifying these as mixed or partial successes, emphasis must be on “mixed or partial.” In general, the preregistered analyses yielded the poorest results. Authors of original articles may take some comfort and justification from the fact that significant support was found at all. Critics and skeptics may claim that that is akin to grasping at straws and that the results mainly point toward failure.
True and full successes
Only four multisite replications have provided unqualified support for the original hypothesis. These are listed in Table S6 in the Supplemental Material. The biggest success we found was a study on eyewitness identification. Garry et al. (2008) had two participants think they were watching the same videotaped crime. In fact, they saw different versions with different details. They discussed the film after watching it, and their different recollections influenced each other, so that participants ended up claiming to recall details they never actually observed. Earlier studies had often found similar effects, with both similar and different methods. Ito et al. (2019) conducted a replication in 10 different countries and replicated the effect fully. Most unusually, their effect size was comparable to (even slightly larger than) the original effect size. Although this study was framed and published as applied cognitive psychology, it does have a social interaction as a key independent variable and does qualify as applied social psychology.
Three multilab articles about multiple mini-studies
Two articles by Klein et al. (2014, 2018) each reported multilab replications of several different effects. These were typically very brief studies, administered to participants one after another, almost exclusively asking how the participant thought about something, with neither social interaction nor emotion. These fared better than the other studies. Many are more properly characterized as cognitive-psychology studies, marketing studies, or judgment and decision-making studies than as social-psychology studies, and replications in those other fields generally fare better than in social psychology. Manipulation checks were generally not reported, so all failures qualify as ambiguous; it is uncertain whether hypothesis falsification or operational failure was involved.
The successful replications from Klein et al. (2018) are as follows. An (imaginary) man who accidentally hurts a baby is blamed more than a baby who accidentally hurts a grown man. A gift giver is rated as more generous despite giving a cheaper gift if the gift was at the high end of the range of possible prices for that particular product (compared with a gift that was more expensive but at the low end of the range for that product). Students think someone who has not done the assigned reading is more likely to be called on by the instructor compared with someone who has done the reading. When people make a binary (hypothetical) choice, they overestimate how many others would make the same choice (i.e., the false-consensus effect, replicated twice). In the trolley problem, people approve of someone who steers the train to kill one person instead of five more than they approve of someone who pushes a fat man off a bridge to block the train (again killing one but saving five). An imagined executive who expresses indifference as to how his policy will affect the environment is rated as intentionally harming the environment but not as intentionally benefiting it. Correspondence bias (also termed fundamental attribution error) was replicated: People tended to think an essay writer believed what was advocated in the essay, even if they were told the writer had been assigned to write on that side of an issue. The last may be partly confounded, given that the essay argued its point rather than signaling detachment, and there was some evidence that the more strongly the essay argued its point, the more participants assumed the writer believed what was written.
As to the failures: Klein et al. (2018) failed to show a sequence effect, in which rating relationship satisfaction first and then life satisfaction altered the second rating based on the first, whereas rating life satisfaction first had no effect on subsequent rating of relationship satisfaction. (This was significant in the opposite direction from the original.) The finding that imaginary jurors awarded custody more to the parent with extreme good and bad traits, rather than to the one with neutral traits, was significant in the opposite direction. The finding that hand-copying a description of an immoral action made people rate hand cleansers more favorably was not replicated. Likewise, the finding that people formed higher opinions of a leader on the basis of an organizational chart that had a longer line between the leader and his team (compared with a shorter line) was not replicated. Another finding that was significant in the opposite direction was that the choice between a kiss from a favorite movie star and a $50 cash payment depended on whether the outcome was guaranteed or merely a slight chance (i.e., the kiss was most preferred when they are uncertain rather than definite). The attempt to replicate an anchoring effect (from the number on a product designation to an estimate of what percentage of sales would occur within the United States) found a null result.
Three priming studies were included in Klein et al. (2018). Two were not significant. Priming orderliness had originally made people express more eagerness to pursue personal goals, but this failed to replicate. Priming heat had originally led to greater belief in and concern about global warming, but this also failed to replicate. On a more positive note, priming the consumer mindset—by referring to people as consumers rather than as individuals—caused participants to predict that such people would generally fail to conserve water. This finding is notable given that it is the only one of many attempted multilab replications of priming that yielded a significant result. Unfortunately, there is possibly a confound here, given that the word “consumer” suggests the person may consume water (rather than conserve it), an implication that the word “individual” does not evoke.
An earlier project likewise selected studies for brevity, simple (two-cell) design, and ease of administration, so that multiple studies could be administered in the same session (Klein et al., 2014). Most of these were judgment studies and decision-making studies and wording or framing studies, but some could be regarded as social-psychology studies (and that distinction is fuzzy). In particular, whereas anchoring and adjustment failed to replicate in Klein et al. (2018), the 2014 article reported four successful replications. This effect is often covered in social-psychology textbooks, including the present authors’ textbook (Baumeister & Bushman, 2021)—but it is essentially a common mistake people make when estimating a number, so there is nothing very social about it. In terms of more purely social psychology, imagining contact with Muslims reduced anti-Muslim prejudice. Another was that an ambiguous quotation was rated more favorably when attributed to a liked than a disliked historical figure. Two priming studies (flag and currency) failed to replicate.
A methodologically innovative project by Schweinsberg et al. (2016) obtained multisite replications of 10 findings that one laboratory had accumulated but not yet published. These were again ministudies, such as asking whether a corporate executive or a vandal was more likely to burn in hell, or whether a manager who is mean to all subordinates is worse than a manager who is mean only to subordinates who belong to racial minorities. This project differed from most other multisite replications not only in replicating unpublished findings, using quick ministudies, including the original authors, and including replicating original studies that had failed to yield significant support, but also in selecting the replicating labs by invitation only. Indeed, the labs were selected or invited by the original authors in the expectation that the replicating labs would have similar subject populations and would be likely to get similar effects without modifying the procedures (other than translating into the local language). The authors note that this last feature removed the adversarial element that other replication projects have. All the studies involved moral judgments associated with the same general theory (a person-centered account of moral judgment).
Schweinsberg et al.’s (2016) findings stand out as highly successful replications, and so the methods they adopted may offer valuable guides to future attempts. Eight of the 10 findings were successfully replicated, and a ninth confirmed the original hypothesis much better than the original (nonsignificant) finding. (Note that a significant replication combined with a nonsignificant original finding is precisely the opposite of the modal pattern we have found.) The tenth was the main failure: It tested the hypothesis that executives of charitable organizations are held to higher moral standards than for-profit corporate executives.
Discarding data
Disturbingly high rates of data exclusion were found in multiple replications. The proportion of data discarded from the main analyses in the replication studies ranged from 0 to 39% (M = 10.5%, Mdn = 5.6%). In terms of raw numbers, the studies discarded an average of 318 participants (N range = 0–2,484). Supplementary analyses sometimes excluded even more participants. In comparison, the proportion of data discarded from original studies ranged from 0 to 13% (M = 1.1%, Mdn = 0%).
Moreover, in the multisite replications, the excluded data were often quite unequally distributed across conditions, which raises questions about whether nonexcluded data are unbiased. Usually this was the result of preregistered exclusion criteria—but ones that were presumably specified with the expectation that only a few outliers or inattentive participants would be discarded, not a sizeable fraction of the sample. We review these below, noting also whether results were different between the full and truncated samples.
Baranski et al. (2020) noted that nearly half (45%) their sample failed the attention check. They reported that the focal interaction was significant with the full sample but not the truncated sample, although the patterns were similar and both departed from the original finding.
The Bouwmeester et al. (2017) project required participants to respond either before or after 10 s, and many failed to comply. Two thirds of participants in the fast-response condition failed to respond by the deadline (compared with only 7.5% in the slow-response condition who responded early). Differential attrition is of course a serious threat to internal validity (e.g., Cook & Campbell, 1979). Excluding them yielded a significant result supporting the original finding, whereas retaining them yielded a nonsignificant result in the opposite direction. Other exclusions (e.g., for failing to understand, or for having previously participated in similar studies) produced yet different results. In particular, excluding noncomprehending participants yielded a significant result. Noncomprehension itself suggests inadequate engagement: A participant who does not understand what is going on cannot usually provide data relevant to a fair test of the hypothesis. The analysis using all three of their preregistered exclusion criteria reduced their sample from 3,596 to 792 participants—and yielded their most significant result. But excluding 78% of the sample is highly worrisome.
Dang et al. (2021) presented analyses on the basis of both full samples and after excluding a sizable number (19%) who appeared to respond randomly. Ego depletion was significantly supported in both analyses. Random responding certainly indicates low engagement. Excluding all those participants increased the effect size by 60%. In this context, the exclusion criteria did appear to have had some valid basis (and strengthened the result) but did not change the positive conclusion.
The O’Donnell et al. (2018) project excluded at multiple levels. Forty laboratories participated in the project, but nearly half of these (17 labs) were discarded entirely, mainly for not recruiting enough male participants. Additional participants were excluded laboratory by laboratory. In some analyses, over half the remaining participants were excluded for apparent suspicion. Such analyses, retaining only a small proportion of the total, did find a significant effect consistent with the initial hypothesis. Including data from all 40 laboratories yielded a nonsignificant effect. Thus, in this study, the exclusions increased successful replication.
McCarthy, Hartnett, et al. (2018) recruited 1,246 participants but reported analyses on 1,069, thus excluding 14.2%. They did not report analyses on the full sample, possibly because most of the excluded data were not usable.
McCarthy, Skowronski, et al. (2018) excluded four of 26 laboratories entirely for failing to recruit at least 100 participants in each condition (not a sign of low engagement among participants). They also excluded additional participants from each laboratory, dropping their initial full sample from 7,343 to 5,610 participants, thus excluding 23.6% of their data. Restoring the four additional laboratories yielded almost identical results.
McCarthy et al. (2021) ran both a close replication and a conceptual replication protocol. Their preregistered criteria dictated deleting 34% and 35.6% of participants, respectively, and they excluded significantly more in the hostile priming condition than in the neutral control condition, which threatens the internal validity of their findings. They reported multiple analyses restoring some or all of the excluded data, and the main finding was nonsignificant in all analyses.
The study of unconscious conditioning by Moran et al. (2021) had to exclude participants who were aware of the manipulation, which was assessed in various ways for different analyses. The most rigorous exclusion eliminated 49% of the sample. The analysis with the largest sample eliminated only 9%, and it was the only one that yielded a significant result. Thus, again, more exclusions failed to improve the findings and even seems to have weakened them.
Panero et al. (2016) recruited 1,302 participants and excluded 510, or 39% of the sample, for various reasons. They reported only analyses based on the reduced final sample.
Sanchez et al. (2017) deleted 7.9% of their sample for various reasons. They reported that results remain identical in alternate analyses retaining them all.
Skorb et al. (2020) reported both high and uneven rates of excluding participants. In their three protocols, 57%, 64%, and 0% were excluded. Their sample thus dropped from the initial 1,018 to 663 participants. They reported that results maintained the same patterns in alternate analyses of the full sample. Inspection of their online tables indicates, however, that the effects dropped out of the significant range. Thus, in this case, excluding suspicious and noncompliant participants improved the results slightly—including making one significant result possible.
Verschuere et al. (2018) tested 7,158 participants and for their primary analyses included only 4,674 participants, thus discarding 35% of the sample. They do not report a full sample analysis, but restoring some participants yielded no improvement.
Vohs et al. (2021) discarded 1,068 participants (30% of their sample). The replication was significant with the full sample but not after deleting the 1,068 participants. This high rate contrasts with a preregistered single-lab replication of the same phenomenon (ego depletion), which discarded 9% of the sample (Garrison et al., 2019). Schmeichel 3 was an author on both studies, and it is reasonable to assume that the criteria were similar. Discarding 9% is worrisome, but hardly as bad as 30%. The high rate is consistent with the low-engagement hypothesis. And, again, discarding participants weakened the results.
To summarize: excluding a substantial number of participants sometimes improved results (Bouwmeester et al., 2017; Dang et al., 2021; O’Donnell et al., 2018), sometimes made them substantially worse (Baranski et al., 2020; Moran et al., 2021; Vohs et al., 2021), and sometimes made no difference (McCarthy et al., 2021; McCarthy, Hartnett, et al., 2018; Sanchez et al., 2017; Skorb et al., 2020; Verschuere et al., 2018). Apart from Dang et al. (2021), whose results were significant in both analyses, and McCarthy et al. (2021), whose results were nonsignificant either way, many of the exclusions were decisive in terms of whether the replication’s omnibus finding was significant or not. The purpose of excluding participants is presumably to improve the test of the hypothesis and, if the hypothesis is correct, to increase the chances of a significant result. In these studies, this usually fails or even backfires.
Analysis of coded variables
The sample of 36 articles is small and not representative, so any statistical analyses with them should be considered exploratory and descriptive (of the full population of published multisite replications in social psychology). Nevertheless, we conducted a series of such analyses. These analyses and predictions were preregistered (https://osf.io/t7u3x/).
Significance tests for manipulation checks and replication effects were treated as rank-order data (0 = nonsignificant, 1 = mixed, 2 = significant) and were therefore analyzed using nonparametric procedures. For social interaction, live ongoing interactions were coded as 2, and computer-mediated interactions, computer-mediated-simulated interactions, and observations of people interacting were coded 1. All others (e.g., solitary responding with or without computer, completing tasks independently in classrooms with other students, computer-administered hypothetical vignettes) were coded 0.
A meta-analysis by Richard et al. (2003) found an average correlation of .21 from over 25,000 social-psychology studies involving over 8 million participants conducted over the past 100 years. Therefore, we used an rs of .21 as a benchmark for a correlation to be considered meaningful in our analysis, regardless of whether it was statistically significant at the .05 significance level.
As predicted, multilab studies with successful manipulation checks were more successful at replicating original results than ones with failed checks. Although the Spearman rank-order correlation was nonsignificant (given low statistical power), it was quite large, rs (N = 18) = .377, p = .123. Similar results were obtained when manipulation checks coded 0 and 1 were combined and compared with manipulation checks coded 2, rs (N = 18) = .381, p = .118. Replications without manipulation checks (N = 18), a full 50% of replications, were excluded from these analyses. We also coded whether original studies included manipulation checks. One third of original studies did, and all these studies found significant effects for their manipulation checks.
Discarding more data did not increase the success rate. This is surprising, because the purpose of discarding data is presumably to strengthen the test of the hypothesis. If anything, the trends indicated that higher numbers of discarded participants and higher rates of discarding both correlated with lower success at replication (though not significantly). This is consistent with what we reported in the previous section on discarding data, namely that discarding has not generally improved replication outcomes and often seems to backfire.
We also conducted some exploratory analyses. The most remarkable finding concerned whether the procedure included actual social interaction. This correlated positively with replication success at rs (N = 36) = .701, p < .001. Similar results were obtained when social interactions coded 1 and 2 were combined and compared with no social interaction at all, rs (N = 36) = .695, p < .001. Thus, the important point appears to be whether the study included any social interaction. Overall, 80.6% of replications contained no social interaction. None of the replication failures featured live social interaction.
In addition, we coded whether original studies featured social interaction, which was also highly correlated with whether a replication was successful, rs (N = 36) = .557, p < .001. Similar results were obtained when social interactions coded 1 and 2 were combined and compared with no social interaction at all, rs (N = 36) = .562, p < .001. Perhaps one reason so many social-psychology multisite replications fail is that the authors attempt to replicate original studies that featured no social interaction. Overall, 75% of original studies that authors tried to replicate contained no social interaction at all.
Year of publication was positively related to successful replications, rs (N = 36) = .403, p = .015, which could signal that the field is improving at conducting these projects over time. Alternatively, this could mean that at the beginning of the replication efforts, about 10 years ago, effects were chosen because they were thought to be unreliable.
Replications that included the original authors fared better than those that did not, φ (N = 36) = .327, p = .147. Exact replications also were more successful than modified replications, φ (N = 36) = .279, p = .246. It did not matter how many authors or labs were included on the multilab replication.
Discussion
We have reported on all the multisite social-psychology replication attempts that have been published (as of July 28, 2022). Their record is admittedly dispiriting. Effect sizes are uniformly much smaller than in the original studies. The majority are nonsignificant, with only four yielding clear significant support for the original hypothesis and a few others presenting mixed results. The few successful ones often followed previous attempts that had failed to replicate, which suggests it often may take some learning to conduct an effective and successful replication.
Whom should one believe? When an original, significant finding encounters a multisite replication that is nonsignificant, scientists must decide whether to continue believing the hypothesis. Cautious skepticism should be maintained in both directions. The multisite replications have several clear advantages in terms of credibility, including large samples, preregistration, and often greater transparency. Our review has also noted some issues with them, however, including high rates of discarded data, weak or absent manipulation checks, and a general impression of low engagement among participants.
The failed replications certainly cast some doubt on the published research literature. Because of publication bias, early null results are often not reported. In social psychology especially, the effect sizes in the published literature appear to be generally inflated. (To be sure, that does not guarantee that the multisite replication’s effect size is true or correct.) Some p-hacking practices are likely to have inflated effect sizes, indeed helping weak real effects attain significance with underpowered samples. The more damning view—that p-hacking and publication bias have created a surfeit of false-positive findings, ostensibly significant evidence for nonexistent effects—seems less plausible to us, especially with findings that have been replicated several times. Nevertheless, we acknowledge that other social psychologists believe that the research literature is full of bogus evidence purporting to support hypotheses that are objectively false (e,g., Schimmack, 2020). Such a dismal outlook cannot be refuted from this review.
More precisely, the pessimistic view that the social-psychology literature is full of false findings can neither be validated nor refuted on the basis of the currently available multisite replications. It is true that most of the multisite replications have yielded nonsignificant findings and only four of 36 (11.1%) yielded clear, significant support (and three of those with reduced effect sizes). Crucially, however, these are hardly a representative sample of hypotheses and findings in social psychology. The selection of what to replicate and which procedures to use requires some attention from the field as a whole.
Moreover, there are indeed many successful replications apart from the limited set of multisite projects. It is worth noting that several classic social-psychology studies have been replicated multiple times in individual labs, despite the lack of multisite testing. For example, the original Solomon Asch (1955) conformity study was replicated by Asch himself (1956) as well as by others (e.g., Allen & Crutchfield, 1963), and there were replications in several other countries besides the United States, such as Bosnia and Herzegovina (Ušto et al., 2019), Japan (Takano & Sogon, 2008), Kuwait (Amir, 1984), Portugal (Neto, 1995), and The Netherlands (Vlaander & Van Rooijen, 1985). Similarly, the Stanley Milgram (1963) obedience study has been replicated in many different contexts by Milgram himself (1974) as well as by others (Burger, 2009), with replications in Poland (Doliński et al., 2017), in a French “real” TV game show (Beauvois et al., 2012; Bègue et al., 2015), and in a virtual-reality environment (Dambrun & Valentiné, 2010). All of these successful replications involved social interaction. It is, however, impossible to know how many failed replications there have been.
Even the multisite findings are not as uniformly discouraging as they initially seem. As we noted, many of the nonsignificant results are accompanied by null results on a manipulation check. We have called these operational failures and emphasize that insofar as the manipulation fails, the study does not constitute a test of the hypothesis. The original theoretical point can be disconfirmed only when a study provides a significant and presumably substantial difference on the manipulation check while also providing a null result on the dependent measure. Ideally, the sensitivity of the dependent variable should also be verified by showing that it can detect some (real) differences under the same conditions. Although there are some true replication failures that meet these criteria, most of those covered here do not. Indeed, there were several cases in which an ostensibly failed multisite replication yielded significant support for the original hypothesis, if one corrects for the weak manipulation check and hence bases the analysis on the minority of participants for whom the manipulation was successful (e.g., Cheung et al., 2016; Dang, 2016).
Our review suggests it would be a mistake to regard the multisite method as the best, most objective test of a hypothesis. For now, the multisite method appears to be a fairly weak way of verifying hypotheses, possibly biased toward false-negative findings. Perhaps for the present it would be appropriate to treat multisite replications similarly to original findings: Significant positive results are precious and informative, whereas nonsignificant results are often ambiguous (especially without evidence that the replication was an effective test, such as by strong effects on manipulation checks).
Explaining failures
Here we revisit the multiple possible explanations for failure outlined in the introduction. There was some evidence for each type. The formidable assortment of failed multisite replications does not all fit neatly into a single explanation.
The hypothesis was wrong
If the original finding was spurious, the theoretical point can be contradicted by a multisite replication. Several studies met the criteria of reporting both a significant and large difference on the manipulation check and a null result on the dependent variable. Only such a combination can justify the conclusion that the hypothesis was effectively tested and falsified by the multisite replication. The following were the purely falsified hypotheses: that priming warm or cold with a handheld pack would cause more prosocial behavior (Lynott et al., 2014); that thinking about comfort foods reduces feelings of loneliness (Ong et al., 2015); that holding a red item makes a woman seem sexier (Pollet et al., 2019); and that trigger warnings increase feelings of vulnerability (Bellet et al., 2020). Wagenmakers et al. (2016) also provided a seemingly true failure to replicate the finding that holding a pen in the lips (vs. the mouth), thereby evoking facial feedback of a smile or a frown, would alter ratings of how funny some cartoons are—but subsequent work has reaffirmed the hypothesis by showing that the Wagenmakers video manipulation check counteracted the manipulation (Noah et al., 2018).
The possibility that manipulation checks could affect the dependent-variable measures (for another example, see Kühnen, 2010) presents a challenge for all psychological research. The challenge is exacerbated with multisite replications. Given the high failure rate of such projects, it is vital to know whether the manipulation itself failed. Adding a manipulation check when the original procedure lacked one is desirable but may alter the manipulation decisively, as in the Wagenmakers case. We recommend that multisite projects administer manipulation checks to only half or two thirds of the sample, thereby making it possible to test whether the manipulation check altered responses on the main dependent measure.
There was another set of studies that we judged did not need manipulation checks because the manipulation was unmissable. These hypotheses therefore also count as having been falsified: writing about one’s imaginary life as a professor (vs. a soccer hooligan) would improve mental performance on a trivia test (O’Donnell et al., 2018); telling people to respond fast (vs. slow) on a social-goods dilemma would make them behave more prosocially (though note that the minority who did respond within the time did replicate the original finding; Bouwmeester et al., 2017); that priming people with action words would cause improved cognitive performance on Scholastic Aptitude Test items (Chartier et al., 2020); and that people believe more in “climate change” than in “global warming” (Soutter & Mõttus, 2020).
The elaboration likelihood model’s finding that persuasion would reflect a three-way interaction among personal involvement, endorser identity, and argument strength was not replicated by Kerr et al. (2015), and one sample did have significant manipulation-check differences on all three independent variables 4 —but the others did not show that the crucial manipulation of argument strength was successful, and their manipulation checks for the other two variables were weak. Thus, it was at best a feeble test. Our view is that the positive results for the elaboration likelihood model by Ebersole et al. (2017) should take precedence.
Our theory proposed another sign that would indicate that replication failures falsify the original hypothesis. Specifically, hypotheses that had many previous replications would presumably be more successful than hypotheses that had been found only once or twice. Later, we discuss the implications of multisite replications of hypotheses that had already garnered extensive previous support. For now, the point is that multisite replication failures do not uniformly indicate that the original hypothesis was wrong, but in some cases this conclusion is justified.
Operational failures: manipulation failure
A failed manipulation check means the hypothesis was not tested. These do raise concerns about why the manipulation did not work as intended, but the findings are not sufficient to overturn the original findings and reject the hypothesis.
There were multiple clear instances of manipulation failure: Cheung et al. (2016), Buttrick et al. (2020), Ijzerman et al. (2020), Hagger et al. (2016), Corker et al. (2020), and DeJong et al. (2009). These six replications constitute 17% of the total sample. Moreover, the true number of operational failures is almost certainly higher, given that many failed replications did not report manipulation checks. If the ratio is the same as with studies that did report manipulation checks, then about half those ambiguous ones will also be operational failures.
All of this suggests another dimension to the so-called replication crisis. The procedures that manipulated independent variables successfully in the original study failed in the replication to produce the intended difference between the experimental and control groups. To illustrate: If a high-anxiety treatment condition reports the same anxiety level as the control group, the experiment cannot demonstrate anything about anxiety. But it raises the important question of why that manipulation did make the original sample, yet not the replication sample, differentially anxious.
Operational failure: low engagement among participants
Another form of operational failure involves low engagement by participants, so that they are not emotionally or motivationally engaged in the situation. We proposed two indicators that participants in replication studies are not heavily engaged and therefore fail to exhibit the hypothesized responses. One was high rates of excluded data based on some quality deficit, such as not following instructions. The other was much weaker effect sizes on the manipulation check (which might be a sign of other problems but would nevertheless follow from low engagement). There was abundant evidence of both problems.
As covered in the Results section, many of the multisite replications discarded large amounts of data, sometimes over half in some analyses. Some also discarded more from one condition than another (and others failed to report such a breakdown). Typically, these followed from preregistered criteria. In two cases excluding large amounts of data yielded the only significant positive finding, but more commonly the exclusion of large amounts of data yielded no benefit or even made the findings weaker.
Insofar as low engagement contributes to the replication failures, it would be desirable to acknowledge this and take steps to correct it. Manipulation checks are typically used to evaluate the sample and procedure, but perhaps one could use them to select subsamples that responded best to the manipulation. For example, one could select half the participants in the experimental condition whose manipulation checks indicated the strongest reaction and analyze them specifically. To be sure, such selection introduces possible confounds, such as individual difference dimensions. Nevertheless, a replication can claim some support for the original finding if the effect on the dependent variable is notably stronger among participants whose manipulation checks indicated that they responded as intended (as opposed to participants whose manipulation-check data indicated little or no effect).
We have already noted that effect sizes (including for manipulation checks) are weaker, often much weaker, in multisite replications (indeed in other replications, too) than in original social-psychology findings. There may be multiple reasons for this, including (we assume) inflation of effect sizes in original studies through publication bias and p-hacking, and regression to the mean. But weak effect sizes are also consistent with low engagement.
The low-engagement findings point to a common and destructive misconception in the field. It is tempting to assume that a multisite replication is a strong test of the hypothesis because of the large sample, which should bolster statistical significance. But quite possibly the putative strength stemming from the large sample is nullified and counteracted by the low engagement. Multisite investigations seem to be quite weak rather than strong tests, and the low engagement of participants seems a prominent reason for that. Moreover, engagement is not a dichotomous variable, and it may be misleading to assume that the discarded participants (25%) were not engaged whereas the remaining 75% were highly engaged. Rather, high rates of discarding may indicate that the whole sample was infected with low engagement, even if some participants did take it seriously and respond earnestly.
The low engagement may be partly due to social psychology’s shift away from the highly involving personal experiences cultivated by early researchers and toward collecting data from participants sitting alone at computers (see the section on live interaction below). Oppenheimer et al. (2009) highlighted the need for attention checks in online studies, reporting that over a third (46% and 35%) of participants in their two studies failed such checks. This could be interpreted as indicating that the rest are fully involved, so that the 35% or 46% merely supplied large amounts of error variance to dilute the results. Alternatively, if one assumes a continuum of motivational engagement, the 35% or 46% were not engaged while many of the others (who managed to pass the attention check) might still have been only barely complying, so that most of the sample was compromised by low engagement. Oppenheimer et al. reported another (pilot) study in which they reduced the failure rate to 14% among highly motivated participants, which again confirms the importance of subjective motivation and engagement. In contrast, many online samples may lack such high motivation. Webb and Tangney (in press) reported multiple checks for quality responding with an MTurk sample, and each check eliminated substantially more data—fitting the view that the full sample was infected with low engagement, which can only be found in different ways.
Low engagement will affect only some social-psychology findings. As noteworthy examples, anchoring adjustment and false consensus fared well among the ministudy replications (Klein et al., 2014, 2018). Simple mental mistakes may replicate well (or even better) among people who are not highly engaged. Likewise, Schweinsberg et al. (2016) had very good success replicating moral judgments of hypothetical situations, which obviously did not require personal involvement in any fashion (and indeed they discarded no data at all). In contrast, the early forms of social psychology often depended on creating highly involving experiences, and participant disengagement could be fatal to attempts to replicate those.
Lack of social interaction
Over the years, social-psychology experiments have shifted from elaborately staged, highly involving live interactions to reliance on solitary individuals sitting at computers making ratings (see Baumeister et al., 2007). Hauser et al. (2018) noted that online materials “are typically relatively uninvolving” (p. 998). We went back and examined the procedures for actual social interaction and found a strong relationship to replication success.
The single most successful multisite replication we found (Ito et al., 2019) was the only one to feature live, unscripted social interaction. Perhaps not coincidentally, it was the only one outside the ministudies to replicate the original effect size. Three others featured live, ongoing interaction with the experimenter, and they also were either complete or mixed successes. One of the other full successes had participants observe an exchange (a supposedly unscripted social interaction) and tracked their eye movements during it. Two studies used computers but simulated live, ongoing interactions, and these had some findings, albeit weak ones, consistent with the original (Bouwmeester et al., 2017; Skorb et al., 2020). Thus, the few studies featuring genuine interpersonal interaction replicated reasonably well.
In sharp contrast, the failures contained all the studies conducted by having solitary participants make ratings on a computer, or in a few cases on paper, as well as the studies conducted with participants in large rooms responding in solitary fashion rather than interacting. None of the failed replications featured live social interaction. 5
Thus, there is a substantial difference in replication success as a function of whether the procedure included human-to-human social interaction or was conducted as a mostly solitary procedure. This likely overlaps substantially with the point about low engagement, insofar as participants become more involved when dealing with another human being than when they are merely sitting at a computer making solitary ratings and at best imagining a social event (e.g., Hauser et al., 2018). The preference for the latter sort of solitary procedure may reflect the constraints under which many multilab replications operate. The need for large samples makes labor-intensive interpersonal interactions costly, but the costs of relying on computers as the essential medium for social-psychology research may include impairments in replicability.
The opposite result would have been plausible. After all, it easier to standardize a computer-administered protocol than a study containing live, semiunscripted interaction. Method-based error variance should be higher in the live interaction studies—yet they replicated better, not worse.
We had initially suspected that replicability suffered when researchers switched from original studies containing live interaction to studies featuring computer-administered procedures. Although there were a few such cases, in general the original and the multisite replication were the same in terms of whether there was live interaction. Any difference due to live social interaction mainly involved the selection of studies to replicate rather than the selection of different procedures for testing the same hypothesis. We recommend that in selecting future topics for multisite replication, social psychologists give some priority to findings involving live interpersonal interaction.
To be sure, the ministudies in the Klein et al. (2014, 2018; also Schweinsberg et al., 2016) articles were administered by computer and had a notably higher success rate than the full studies. Again, however, these focused on phenomena that do not depend on personal engagement, such as making moral judgments about hypothetical vignettes, or estimating numbers. If social psychology desires replicable successes without having to include live interactions, it may profit by focusing mainly on studying how people think about things they have no personal reason to care about.
A speculative solution is that live social interaction engages the individual more fully than computer-administered questionnaires. That would explain the benefits of live interaction for social-psychology studies, as well as the higher replicability of judgment and decision-making and other findings that do not depend on high engagement, such as the estimation error in anchoring adjustment.
Editorial bias favoring failure
It is uncomfortable to discuss possible systemic bias in the editorial system, though critics have long (and quite plausibly) asserted that the published literature can be misleading because of editorial bias in favor of significant findings. The contribution of any such bias to the replication debate is difficult to gauge, especially with objective data, so any discussion here is impressionistic. Anecdotal evidence is consistent with this interpretation (Schmeichel, personal communication). Vohs et al. (2021) found both significant and nonsignificant results depending on the exclusion of over 1,000 participants. The editor directed them to feature the nonsignificant results in the published article, on the basis of the preregistered analyses, while consigning the significant results to the supplemental online materials. This contrasts with the usual and best practice, which is to report results both ways when there are many exclusions. It is impossible to know whether this reflects a general pattern, but it is suggestive that studies with mixed results are reported mainly as failed replications (thus emphasizing the nonsignificant rather than the significant findings).
No strong conclusions can be drawn regarding possible editorial bias, but we think it worth mentioning, in part because publication bias is widely assumed to contribute to the inflated effect sizes among original findings. Galiani et al. (2017) surveyed journal editors in economics and found that they preferred to publish failed rather than successful replications. Such a preference would be understandable, insofar as journal editors seek to preserve their much-sought-after journal space for new information that advances the field. A successful replication provides no new knowledge, in an important sense, because it merely confirms what has already been found. In contrast, a failed replication suggests that currently held beliefs should be questioned, revised, or even discarded. Hence an editor might plausibly believe that a failed replication is a more important and newsworthy contribution than a successful one.
We note that a multisite replication offers a rare opportunity to conduct a meta-analysis with zero publication bias. That assumes that the journal has agreed in advance (i.e., before data collection) to publish the paper, whatever the outcome. All studies in the project can then be included in the meta-analysis. To be sure, the typically high rate of data exclusion does compromise the validity to some degree. But the meta-analysis of the multisite replications includes all studies that were part of it, regardless of outcome.
Explaining success
Although our focus has been on the failures of multisite replications, it is worth considering the successes together. As already reported, the big multisite replications focusing on a single theory or hypothesis yielded four successes, as follows: Eyewitnesses who confer among themselves before testifying (in simulated trial situations) influence each other’s testimony; prior exercise of self-regulation leads to impaired performance on a subsequent, different test of self-regulation (i.e., ego depletion); the personality trait of need for cognition interacts with quality of persuasive argument to influence attitude change (i.e., elaboration likelihood model); and people look at someone whom they expect to speak to.
These are a motley group, but one thing they have in common is being heavily cognitive. (Ego depletion is not necessarily a cognitive phenomenon, but the successful multisite replication by Dang et al., 2021, used highly cognitive procedures.) Cognitive effects have tended to replicate better than more purely social ones (Open Science Collaboration, 2015; Wilson & Wixted, 2018), and so relying more on cognitive procedures may increase replicability, as these results suggest.
The idea that focusing on cognitive processes will improve replicability gains credence from the ministudies, which had a higher success rate than the experiments devoted to a single hypothesis or theory. These assessed quick thought reactions and revealed common biases and mental mistakes. Crucially, they do not rely on emotional or motivational engagement by participants. For example, the finding that people blame a (hypothetical) man who accidentally hurts a baby more than they blame a baby who accidentally hurts a grown man (replicated by Klein et al., 2018) probably does not require deep emotional involvement or careful thought. The same goes for a false-consensus effect, in which people estimate that many others would share their opinions. This line of thought suggests an alternative way forward, which is for social psychology to dispense with studying phenomena that engage people’s motivations and limit research to quick thought-reaction procedures. Such an approach (which does appear to be the trend in the field at present) may have the benefit of improving replicability in multisite online procedures, though some would object that there are hidden costs in neglecting to study more highly involving behavioral phenomena.
Focusing more on socially cognitive processes would be a departure from the roots of social psychology, which had a strong behavioral focus despite the cognitive thrust of some early dominant theories (e.g., cognitive dissonance, 6 attribution). Nevertheless, it seems well suited to the current preference for online methods, easily administered procedures, and large samples. Focusing mainly on how people think about the social world and how they think they would react to hypothetical situations offers fertile ground for further research, and if this approach is combined with the promise of improvements in replicability, it could be a good way to go.
To be sure, the dismal record of social priming in multisite replications sends a cautionary message about shifting toward an ever-more-cognitive social psychology. Priming may seem at first blush to be a cognitive phenomenon, but accumulating evidence suggests that it relies heavily on motivation (Weingarten et al., 2016). There have been over a dozen multisite attempts to replicate priming effects, all of which failed. (One ministudy did find a significant result, but that unfortunately seems confounded.) Priming strikes us as the primary upcoming battleground for the theoretical implications of multisite replications. There is a large volume of published findings in support of priming, but the multisite record provides no encouragement for the idea that the phenomenon is real.
More broadly, these multisite replications have created a paradoxical pattern in the research and publication process. Authors continue to write articles and to cite previously published studies. It is apparently fine to cite findings that have been obtained only once or twice. Meanwhile, however, findings that have been obtained dozens of times lose credibility once they are tested and dismissed in multisite replications. New authors must therefore build their theoretical case on shaky grounds. Most scientists would presumably agree that findings that have been obtained dozens of times and by multiple laboratories have more credibility than those that have been found only once or twice. Social psychology may be moving toward the opposite assumption.
Concluding Remarks
The multisite replication once held appeal as a definitive method for verifying the truth of social psychology’s findings and theories. Thus far, it has failed to confirm multiple well-replicated findings and brought the field’s credibility into serious doubt. We freely concede that it is possible to take this record as a sign that social psychology, as practiced for the past half century, has been an exercise in futility marked by dubious theories built around false-positive findings. Nevertheless, we think it is also possible to maintain a more positive attitude about social psychology’s research literature. The multisite replication, perhaps especially as administered by impersonal procedures and characterized by low participant engagement, could be a weak and flawed method for verifying social-psychology findings. Its drawbacks may be especially problematic for social psychology, and multisite replications do seem to succeed better in other fields, though such a comparison is beyond the scope of our review. The disappearance of live interpersonal interaction from social psychology’s methods may be a particularly costly loss: We found that studies with actual human interaction replicated much better than ones relying on solitary responses. The most effective social-psychology multisite replications involve actual social interaction. Thus, as an alternative to making social psychology more cognitive, there may be utility in revisiting the more social side of social psychology.
Moreover, the purpose and value of the laboratory experiment, as developed in the early years of social psychology, can be reconsidered. Our impression is that such early researchers did not look upon their findings as establishing infinitely replicable laws of nature but as showing that certain causal effects could be obtained under optimal conditions. Even cognitive dissonance, which dominated the field’s thinking during its formative years, was not always found. The appropriate conclusion based on significant social-psychology experimental findings could perhaps be characterized as “sometimes this happens.” Indeed, some areas of research have already begun acknowledging this explicitly. Moral-licensing patterns, for example, have been found repeatedly—but so have significant findings in the opposite direction (Mullen & Monin, 2016).
Although “sometimes this happens” may be disappointing compared with establishing universal laws, perhaps the field should accept this with both humility and pride. It is valuable to demonstrate regular patterns in behavior, even if they will not be found all the time across diverse circumstances and populations. Historians, for example, seek to establish what happened but do not expect to end up with a robust grand theory of history that will explain all the past and predict all the future. Social psychologists might likewise be content with showing that independent variable X sometimes causes dependent variable Y without expecting that to occur under all circumstances. This would map the field’s future agenda as containing at least two steps: (a) showing that causal patterns occur sometimes, and (b) identifying boundary conditions to indicate when it does and does not occur. Such an approach may enable social psychologists to build on the positive achievements and contributions of earlier generations while also weeding out the false-positive findings and shelving those that occur only under rare circumstances. Multisite replications may have an important and even constructive role in that sort of future. That could at least be a more productive and constructive way of viewing social psychology’s research activities than the current one of producing provocative important findings and then discarding them because of failed multisite replications.
Supplemental Material
sj-docx-1-pps-10.1177_17456916221121815 – Supplemental material for A Review of Multisite Replication Projects in Social Psychology: Is It Viable to Sustain Any Confidence in Social Psychology’s Knowledge Base?
Supplemental material, sj-docx-1-pps-10.1177_17456916221121815 for A Review of Multisite Replication Projects in Social Psychology: Is It Viable to Sustain Any Confidence in Social Psychology’s Knowledge Base? by Roy F. Baumeister, Dianne M. Tice and Brad J. Bushman in Perspectives on Psychological Science
Supplemental Material
sj-docx-2-pps-10.1177_17456916221121815 – Supplemental material for A Review of Multisite Replication Projects in Social Psychology: Is It Viable to Sustain Any Confidence in Social Psychology’s Knowledge Base?
Supplemental material, sj-docx-2-pps-10.1177_17456916221121815 for A Review of Multisite Replication Projects in Social Psychology: Is It Viable to Sustain Any Confidence in Social Psychology’s Knowledge Base? by Roy F. Baumeister, Dianne M. Tice and Brad J. Bushman in Perspectives on Psychological Science
Supplemental Material
sj-docx-3-pps-10.1177_17456916221121815 – Supplemental material for A Review of Multisite Replication Projects in Social Psychology: Is It Viable to Sustain Any Confidence in Social Psychology’s Knowledge Base?
Supplemental material, sj-docx-3-pps-10.1177_17456916221121815 for A Review of Multisite Replication Projects in Social Psychology: Is It Viable to Sustain Any Confidence in Social Psychology’s Knowledge Base? by Roy F. Baumeister, Dianne M. Tice and Brad J. Bushman in Perspectives on Psychological Science
Supplemental Material
sj-docx-4-pps-10.1177_17456916221121815 – Supplemental material for A Review of Multisite Replication Projects in Social Psychology: Is It Viable to Sustain Any Confidence in Social Psychology’s Knowledge Base?
Supplemental material, sj-docx-4-pps-10.1177_17456916221121815 for A Review of Multisite Replication Projects in Social Psychology: Is It Viable to Sustain Any Confidence in Social Psychology’s Knowledge Base? by Roy F. Baumeister, Dianne M. Tice and Brad J. Bushman in Perspectives on Psychological Science
Supplemental Material
sj-docx-5-pps-10.1177_17456916221121815 – Supplemental material for A Review of Multisite Replication Projects in Social Psychology: Is It Viable to Sustain Any Confidence in Social Psychology’s Knowledge Base?
Supplemental material, sj-docx-5-pps-10.1177_17456916221121815 for A Review of Multisite Replication Projects in Social Psychology: Is It Viable to Sustain Any Confidence in Social Psychology’s Knowledge Base? by Roy F. Baumeister, Dianne M. Tice and Brad J. Bushman in Perspectives on Psychological Science
Supplemental Material
sj-docx-6-pps-10.1177_17456916221121815 – Supplemental material for A Review of Multisite Replication Projects in Social Psychology: Is It Viable to Sustain Any Confidence in Social Psychology’s Knowledge Base?
Supplemental material, sj-docx-6-pps-10.1177_17456916221121815 for A Review of Multisite Replication Projects in Social Psychology: Is It Viable to Sustain Any Confidence in Social Psychology’s Knowledge Base? by Roy F. Baumeister, Dianne M. Tice and Brad J. Bushman in Perspectives on Psychological Science
Footnotes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
