Abstract

Psychological Science, the journal, and psychological science, the field, continue to struggle with the challenge of establishing interesting and important and replicable phenomena. As I often tell my students, “If scientific psychology was easy, everyone would do it.” We can take some comfort in knowing that other sciences, too, face similar challenges (e.g., Begley & Ellis, 2012). But our business is with psychology.
In August of this year, Science published a fascinating article by Brian Nosek and 269 coauthors (Open Science Collaboration, 2015). They reported direct replication attempts of 100 experiments published in prestigious psychology journals in 2008, including experiments reported in 39 articles in Psychological Science. Although I expect there is room to critique some of the replications, the article strikes me as a terrific piece of work, and I recommend reading it (and giving it to students). For each experiment, researchers prespecified a benchmark finding. On average, the replications had statistical power of .90+ to detect effects of the sizes obtained in the original studies, but fewer than half of them yielded a statistically significant effect. As Nosek and his coauthors made clear, even ideal replications of ideal studies are expected to fail some of the time (Francis, 2012), and failure to replicate a previously observed effect can arise from differences between the original and replication studies and hence do not necessarily indicate flaws in the original study (Maxwell, Lau, & Howard, 2015; Stroebe & Strack, 2014). Still, it seems likely that psychology journals have too often reported spurious effects arising from Type I errors (e.g., Francis, 2014).
Awareness of threats to replicability has greatly increased among psychologists (including yours truly) over the past 4 years. This is thanks in no small part to a 2011 Psychological Science article by Simmons, Nelson, and Simonsohn, although many other psychologists and statisticians have contributed to this awareness as well (for a wide-ranging treatment, see Bollen, Cacioppo, Kaplan, Krosnick, & Olds, 2015). The Association for Psychological Science (APS) has been in the lead of efforts to enhance replicability while maintaining theoretical impact. One exciting initiative is the addition of Registered Replication Reports to Perspectives on Psychological Science (APS, n.d.). Also, APS is a signatory to the Center for Open Science’s Transparency and Openness Promotion guidelines (Nosek et al., 2015; see Center for Open Science, 2011–2015, and APS, 2015).
Former Editor of Psychological Science Eric Eich (2014) instituted a number of important changes to the journal to enhance replicability. Eich called for increased statistical power and urged reporting of confidence intervals and effect sizes, use of meta-analysis, and other approaches to avoiding problems with null-hypothesis significance testing (NHST). He relaxed word limits on Method sections so that authors can provide full details of stimuli and procedures. He instituted requirements to disclose details, such as how sample size was determined, whether any observations were excluded, and whether any conditions or measures were dropped. He also took steps to encourage open science, such as introducing “badges” recognizing articles reporting studies that were preregistered or for which stimuli and data have been made publicly available. These are all good steps, and because of them I am optimistic that recent experiments reported in Psychological Science are more replicable than those published in 2008 (but see Replication-Index, 2015, for an analysis suggesting that the statistical power of experiments reported in Psychological Science has declined, rather than increased, in recent years).
I am committed to continuing and extending Eich’s effort to increase the percentage of results published in Psychological Science that accurately inform readers of replicable phenomena. Replicability is not the only criterion of a first-rate science journal, but it had better be a fundamental one. The journal’s team of Senior and Associate Editors are all on board with this effort. In this Editorial, I highlight four issues that the editors and I believe are particularly important. My emphasis here is on experiments and NHST, but many of these points also apply to nonexperimental research, 1 and some pertain to Bayesian and precision-estimation approaches (Cumming, 2012). (By the way, I am enthusiastically open to submissions that make appropriate use of alternatives to NHST.)
The Troubling Trio
Editors at Psychological Science are on the lookout for this troubling trio: (a) low statistical power, (b) a surprising result, and (c) a p value only slightly less than .05. 2 In my view, Psychological Science should not publish any single-experiment report with these three features because the results are of questionable replicability. When an editor is in doubt about the replicability of an interesting effect in a submitted manuscript, he or she may invite the authors to conduct a high-powered replication preregistered on the Open Science Framework or a comparable service. Preregistration entails specifying in advance the subjects, materials, procedures, measures, exclusions, and analyses. (Personally, I aim never again to submit for publication a report of a study that was not preregistered. Exploratory work has great and essential value but is typically not appropriate for publication.) An invited resubmission that includes a preregistered replication will typically be handled by the same editor who invited the resubmission and will typically be sent out for rereview.
Regarding statistical power, my impression is that many authors who submit manuscripts to Psychological Science have but a shaky grasp on that concept. And I suspect that many are unfamiliar with the perils of “optional stopping,” in which the researcher tests some subjects, analyzes the data, and then either stops and writes up the results (if the desired effect is obtained) or collects additional data and repeats this process until a significant result is obtained or fatigue sets in. This practice inflates Type I error rates (e.g., Sanborn & Hills, 2014).
One of Eich’s replicability-enhancing interventions was to require that, as part of the submission process, authors explain how the sample size for each study reported in the manuscript was determined. I have not conducted a formal analysis of responses to this question, but my impression is that the modal response is to say that the Ns were selected to be comparable to Ns of prior research. However, there is ample reason to believe that many past psychology experiments were grossly underpowered (Cohen, 1962; Vankov, Bowers, & Munafò, 2014), so basing the Ns of new research on past practice is generally not appropriate. Other times, authors say that they assumed a medium-sized effect (e.g., Cohen’s d of 0.50) but do not cite evidence backing up that estimated effect size. Sometimes authors report a power analysis that is based on only one effect size even though the research they report tested several effects. Yet other times, authors claim to have calculated estimates of power that are clearly incorrect.
I hope to raise standards for statistical power in Psychological Science, so it is important for authors to understand this concept (see also Vazire, 2015). Manuscripts reporting studies that do not appear to have adequate power may be rejected on that basis without external review. Note that the number of subjects tested is only one determinant of power. In addition to the raw size of the effect, the amount of error variance is important. Error variance can be reduced by improving control, by using better measures, or both. As noted by Associate Editor Leaf Van Boven (personal communication, September 8, 2015), error variance can be further reduced via statistical approaches such as generalized linear mixed-effects models that treat both subjects and items as random factors (e.g., Hoffman & Rovine, 2007; Westfall, Kenny, & Judd, 2014).
p-Hacking
Editors at Psychological Science are also on the lookout for evidence of p-hacking. This refers to practices that inflate the Type I error rate, such as (a) dropping subjects, observations, measures, or conditions that yielded inconvenient data; (b) applying poorly motivated and post hoc data transformations; (c) using questionable covariates; (d) suppressing mention of experiments that were conducted but “didn’t work”; and (e) using the optional-stopping strategy (mentioned in the previous section) during data collection. Whether these sorts of things are done innocently or nefariously, readers need to know about them to assess the replicability of the research. Senior authors must ensure that their supervisees understand the risks of p-hacking and understand that hiding p-hacking is unethical. Preregistering a study likely reduces p-hacking.
Interpreting Correlations
Researchers must be aware of the noisiness of correlations in small samples, even when the correlations are statistically significant. Except when a correlation in the population is nearly perfect, a quite large sample size is required to ensure that the obtained correlation is highly likely to approximate the correlation in the population. Senior Editor Ralph Adolphs referred me to a fascinating article on this by Schönbrodt and Perugini (2013), who argued that “in typical scenarios the sample size should approach 250 for stable estimates” (p. 609). These authors gave the example of a sample of 24 in which two dependent variables correlate at an r of .40; that correlation is statistically significant at the alpha .05 level, but the 90% confidence interval runs from .07 (trivially weak in most contexts) to .67 (strong in most contexts). I am not suggesting that all correlations require an N of 250, but assertions about the meaning of observed correlations must be informed by awareness of this issue. Reports of values of r must, like reports of means, be accompanied by appropriate confidence intervals.
A second point regarding correlations, emphasized by Senior Editor John Jonides (personal communication, September 6, 2015), is that it is very wise when considering correlations to examine scatterplots, because r can be greatly exaggerated by a very few outliers (or underestimated because of range restrictions). On a related point, E. J. Wagenmakers (personal communication, September 30, 2015) referred me to Anscombe’s quartet (Wikipedia, 2015). The four data sets of the quartet have identical summary statistics, but, as shown in Figure 1, when they are graphed, the relationships between the two variables are revealed to be qualitatively different. The importance of graphing is not limited to correlations, but pertains to all quantitative data: Researchers must pay attention to the shapes of distributions, not just to their means. I encourage authors to include in their submissions figures such as scatterplots, frequency histograms, or stacked dot plots that show individual subjects’ data.

The four data sets of Anscombe’s quartet. The data sets all have the same summary statistics (i.e., mean of variable x, mean of variable y, sample variance for x, sample variance for y, correlation between x and y, and best-fitting regression line), yet the nature of the association between x and y differs qualitatively from one data set to the next. Adapted from Wikipedia (2015).
Misinterpretation of Nonsignificant Results
Researchers sometimes imply that a nonsignificant difference shows that the null hypothesis is true. A null result could mean that the null hypothesis is true, but it could instead mean that the experiment failed to detect a real effect (i.e., a Type II error). If the same experiment is conducted repeatedly, p values will range widely across experiments, especially if the statistical power of the experiment is low. I recommend Geoff Cumming’s tutorial videos on the “new statistics” (Cumming, 2014), especially the chapter titled “Dance of the p Values,” and his free Exploratory Software for Confidence Intervals (ESCI; Cumming, 2012) for investigating relationships among sample size, effect size, error variance, and statistical significance. Figure 2 illustrates a case in which the standard deviation of a dependent variable in the population is 20, the true size of the effect is 10 (i.e., half a standard deviation, which is pretty large by psychology standards), and two randomly assigned groups of 24 subjects are compared in each of 25 experiments. About 60% of those experiments yield a Type II error and fail to reject the (false) null hypothesis, often with large p values (e.g., .9). A big p value does not show that the null hypothesis is true. Under the conditions of the typical psychology experiment, p values are noisy.

Results of 25 simulated experiments (illustration created using Exploratory Software for Confidence Intervals, or ESCI; Cumming, 2012). In each experiment, 24 cases were randomly sampled from the control distribution (µ = 50, δ = 20), and 24 cases were randomly sampled from the experimental distribution (µ = 60, δ = 20). For each experiment, the circle indicates the difference between the means of the two conditions, and the green line represents the 95% confidence interval around that difference. The p values for the comparisons are shown along the left margin. For this particular run, 15 of the 25 experiments failed to reject the false null hypothesis (H0) at alpha = .05 (with p values as high as .991). Thus, large p values are not good evidence for the null hypothesis.
I most often see misinterpretations of nonsignificant effects in reports of factorial designs. For example, a researcher might find that monetary incentives had a statistically significant effect on boys’ performance but not on girls’ performance and summarize the results as showing that boys were affected by incentives but girls were not. It would be appropriate to say that the effect was significant for boys and nonsignificant for girls. 3 And if the two-way interaction was significant, then it would be appropriate to say that the effect was larger for boys than for girls. But it would not be appropriate to conclude that there was no effect for girls. The null result for girls could just be a Type II error, a failure to detect a real effect.
If authors want to argue that their results support the null hypothesis, there are at least two ways that they can do that. One is to acknowledge that null results are inherently ambiguous in NHST but to argue that the samples were large and the error variance was small (demonstrated with small 95% confidence intervals), so the experiment had ample power to detect a nontiny effect; given that the difference in the means was small (perhaps even in the wrong direction), the null result can be taken as evidence that there probably is not a nontiny effect. This argument is categorically different from (and better than) saying or implying that the result shows that there is no effect of the independent variable on the dependent variable under the conditions of the experiment. The other approach is to use Bayesian analyses to estimate the extent to which the results support the null hypothesis. For highly accessible information about the latter approach, see Masson’s (2011) tutorial article. E. J. Wagenmakers’s JASP team at the University of Amsterdam has developed free and easy-to-use software for Bayesian hypothesis testing (see JASP, 2014).
Closing Words
It is not trivial to estimate what percentage of direct replication attempts of psychology experiments should succeed (Maxwell et al., 2015; Stroebe & Strack, 2014). Numerous factors conspire to inflate the estimated effect sizes that are published. Moreover, it is impossible to conduct a replication attempt under exactly the same conditions as the original study, and often researchers do not know which differences matter. Replication rates would be high in psychology if all of the effects studied were huge and robust, but if psychologists studied only huge and robust effects, then progress toward understanding subtleties of psychology would surely be thwarted. The editors of Psychological Science are confident that we can reduce the rate at which Type I errors are published without compromising other values (e.g., interestingness, relevance, elegance), and that is what we intend to do. If you have other ideas as to how to enhance Psychological Science, please e-mail me about them (
Footnotes
Acknowledgements
I thank Brian P. Ackerman, Ralph Adolphs, Hal Arkes, Gretchen Chapman, Geoff Cumming, K. Andrew DeSoto, Steven Gangestad, Jamin Halberstadt, Eddie Harmon-Jones, Scott M. Hofer, John Jonides, Michael E. J. Masson, Wendy Berry Mendes, Brian Nosek, Henry L. Roediger, III, Leaf Van Boven, Bill von Hippel, and E. J. Wagenmakers for insightful and constructive comments. I also thank Geoff Cumming for permission to use an example from Exploratory Software for Confidence Intervals (ESCI; Cumming, 2012).
