Abstract
We assessed whether the most highly cited studies in emotion research reported larger effect sizes compared with meta-analyses and the largest studies on the same question. We screened all reports with at least 1,000 citations and identified matching meta-analyses for 40 highly cited observational studies and 25 highly cited experimental studies. Highly cited observational studies had effects greater on average by 1.42-fold (95% confidence interval [CI] = [1.09, 1.87]) compared with meta-analyses and 1.99-fold (95% CI = [1.33, 2.99]) compared with largest studies on the same questions. Highly cited experimental studies had increases of 1.29-fold (95% CI = [1.01, 1.63]) compared with meta-analyses and 2.02-fold (95% CI = [1.60, 2.57]) compared with the largest studies. There was substantial between-topics heterogeneity, more prominently for observational studies. Highly cited studies often did not have the largest weight in meta-analyses (12 of 65 topics, 18%) but were frequently the earliest ones published on the topic (31 of 65 topics, 48%). Highly cited studies may offer, on average, exaggerated estimates of effects in both observational and experimental designs.
Highly cited (HC) studies are often considered to be the most valued and influential scholarship, which leads to an expectation that they should hopefully report the most accurate findings. However, meta-epidemiological investigations in some scientific fields have found that HC studies may report overestimated effects relative to larger or better designed studies (Ioannidis, 2005; Tajika et al., 2015) or to meta-analyses on the same topic (Ioannidis & Panagiotou, 2011). In addition, influential studies often produce substantially larger or contradictory effects relative to subsequent preregistered replication attempts (Camerer et al., 2018; Klein et al., 2018; Open Science Collaboration, 2015; Wagenmakers et al., 2016).
Multiples sources of bias may contribute to effect size inflation (Fanelli et al., 2017; Ioannidis et al., 2008). A major concern is that when research findings are incentivized to pass a prescribed threshold of statistical significance to be published (publication bias) and research designs have suboptimal statistical power, published effect sizes are inflated on average (Bakker et al., 2012; Button et al., 2013; Gelman, 2018; Ioannidis, 2008). In addition, flexibility in analytical choices (Simmons et al., 2011) can lead to a large “vibration of effects” (i.e., the range of possible effects obtained for different analysis specifications estimating the same association) that, when combined with selective reporting, can lead to an upward bias for published effects (Patel et al., 2015; Steegen et al., 2016). Finally, influential stakeholders within the scientific ecosystem, such as funders and journals, exert a preference for aesthetically appealing (“positive,” “clean,” or “novel”) results (Nosek et al., 2012), which could lead to preferential citation (citation bias) of studies that report larger effects compared with those reporting smaller or null effects (Cristea & Naudet, 2018; Gøtzsche, 1987; Greenberg, 2009).
Although systematic investigations of effect-size inflation in HC articles in the social and behavioral sciences are lacking, indirect evidence from replication studies suggests that effects reported by HC studies may be exaggerated. For example, three large-scale studies have found that effects reported in multilaboratory preregistered replication attempts are on average 49% to 66% smaller than corresponding effects reported in previously published research (Camerer et al., 2016, 2018; Open Science Collaboration, 2015). Many, but not all, of the included original studies were HC, and they were all published in high-profile journals. Other multilaboratory replication efforts specifically targeting influential psychology studies often report smaller (and sometimes null) effects relative to original studies (Klein et al., 2014, 2018; Wagenmakers et al., 2016). These replication efforts did not systematically target specifically the most HC articles—even though some of the assessed work was HC. Moreover, they have also focused predominantly on randomized experiments. However, there are many other studies that attract a lot of attention and citations, including diverse observational associations, biomarkers or predictive markers, and more. It would be important to assess whether HC studies covering such a broad spectrum of designs have inflated effect sizes and, if so, the size of the inflation compared with other studies on the same questions that do not get so many citations.
Hence, the goal of the present study was to investigate whether effect sizes reported in HC emotion research are greater relative to larger studies and meta-analyses addressing the same questions. We focused on emotion research because it is a major topic domain in psychology with a breadth of content and research designs and covers both highly exploratory basic research and applied research with clinical implications. Our goal was to gauge the extent to which effects differed between HC studies and summary effects from meta-analyses and the larger studies on the same topic. We also wanted to map the timing of publication of HC studies, largest studies, and other studies on the same topic.
Method
We adopted the approach of previous similar investigations in clinical research (Ioannidis, 2005; Ioannidis & Panagiotou, 2011) and psychiatry (Tajika et al., 2015). Changes to the preregistered study protocol are detailed in the Supplemental Material available online.
Identification and selection of target HC articles
The database Scopus was searched through October 8, 2019, using keywords generically related to “emotion,” “mood,” “anxiety,” or “depression” present in the title, abstract, or keywords.
Eligible records reported on primary data that could be used for generating effect sizes in human participants, mentioned findings related to emotions in the abstract (even if these were peripheral to the goals of the study), had an experimental (i.e., randomized) or observational design, and had been cited at least 1,000 times in Scopus as of the date of the search. Articles in which the abstract made no mention of emotion or focused exclusively on biomedical, molecular, or other aspects not related to emotional disorders or conditions were excluded. However, articles that were found to mention emotion during abstract inspection, even if in a peripheral role (e.g., as one of many secondary outcomes, a component in a model), were included.
We also excluded (a) meta-analyses and other articles using secondary data, (b) observational studies focused on prevalence, (c) studies describing the development or subsequent validation of scales, and (d) estimations of disease burden, such as the Global Burden of Disease.
One researcher (I. A. Cristea) screened all records with at least 1,000 citations by title and abstract and selected those that mentioned emotion and described observational (including pre/post designs and nonrandomized studies of various associations) and experimental (including all studies in which participants were randomized to an intervention or to different modalities of an independent variable) designs.
Identification and selection of meta-analyses
For each eligible observational or experimental HC record, we searched for the most recent meta-analysis including effect-size data from any finding in the article, provided it was related with emotion. In cases in which the HC article mentioned emotion in a peripheral role, an eligible meta-analysis had to report effect size related to the emotion finding and not to the article’s other findings.
Meta-analyses for each target article were identified by downloading the most recent 2,000 records citing the target study in the form of a searchable .csv file. We then used the “find” command in a document processor to search for the text string “meta-analy*” in the title, author, or index keywords. Citing records were screened starting with the most recent ones and moving downward on the list. Whenever a potentially eligible meta-analysis was identified, the full text was retrieved and manually searched to identify whether (a) the HC study was included and (b) an effect size of interest from the HC study was reported. If these criteria were not satisfied, we moved down the list of citing articles chronologically until identifying another eligible meta-analysis. Meta-analyses that substituted the HC study with a larger study that encompassed it or with another publication on the same sample were eligible. In these cases, we planned to recalculate the effect size from the original report, if possible.
For eligible records that described more than one meta-analysis (i.e., reported more than one forest plot) including the target HC study, we chose the one with the highest number of studies or, if there were ties in this regard, the one that appeared first in the text, provided it reported a finding related to emotion.
One researcher (I. A. Cristea) searched citing records and identified meta-analyses.
Data extraction
For each matching meta-analysis, we coded information about publication year, meta-analysis model used (fixed or random), total number of included effect sizes in the selected forest plot, effect-size measure (e.g., mean difference, standardized mean difference [SMD], correlation, odds ratio [OR], risk ratio [RR], hazard ratio [HR]), earliest study (by publication year) in the forest plot, effect sizes and 95% confidence intervals (CIs) for the HC and largest study, and summary effect sizes and 95% CIs in the meta-analysis. When the meta-analysis reported different models of estimating effect size, we preferred random effects. The largest study was defined as the study with the lowest standard error in the matching meta-analysis. To select the largest study, we relied on the following succession of information, if reported: (a) weights in the forest plots, followed by (b) standard errors/variance associated to individual effect sizes, followed by (c) recalculation of the 95% CI width for those individual effect sizes in which the CI appeared visually smaller in the forest plot, and finally, (d) study sample size. If more studies with the same weight or standard error were included, sample size was used to break the tie.
If more studies, including the HC study, were the earliest in the forest plot (published in the same year), the HC study was considered the earliest.
When the forest plot included only graphic information, we attempted to contact the authors or used tools such as WebPlotDigitizer (https://automeris.io/WebPlotDigitizer/) to reconstruct the data from the plots.
Outcomes
All outcomes were assessed separately for observational and experimental designs.
The primary outcome was the degree of agreement between (a) the effect size of the HC study and the summary effect size of the matching meta-analysis and (b) the effect size of the HC study and the effect size in the largest study in the matching meta-analysis. To this purpose, we calculated the ratios of odds ratios (RORs), as detailed in the Data Analysis section.
This outcome is reported both nominally, as the percentage of topics in which the 95% CI of ROR included 1, and statistically, as the meta-analytical aggregate across topics, separately for experimental and observational studies.
Secondary outcomes were the percentages of HC studies with effect sizes that differed by 2-fold (ROR ≥ 2 or ≤ 0.5) or 4-fold (ROR ≥ 4 or ≤ 0.25) from the effect size in the matching meta-analysis and respective largest study in the meta-analysis.
Data analysis
Analyses were performed in Microsoft Excel for Mac (Version 16.43) and STATA/SE for Mac (Version 16.1; programs admetan and metaeff). Scatterplots were constructed in the R software environment (R Core Team, 2020) using RStudio (Version 1.2.5033; RStudio Team, 2019) and the lessR package (Version 3.9.8; Gerbing, 2020).
For each identified meta-analysis, we extracted the effect size and standard error or 95% CI reported for the HC study, the summary effect size, and the effect size of the largest study in the meta-analysis. The preferred meta-analytic estimate was the OR. Effect sizes were extracted as reported in the meta-analysis without retrieving the primary studies. When meta-analyses reported estimates other than the OR, we employed standard procedures for converting estimates into ORs. SMDs, including Hedges’s g, were transformed to naturally logarithmic ORs using the Chinn transformation (Chinn, 2000). Correlation coefficients were first converted into SMDs (Polanin & Snilstveit, 2016) and then into ORs. For RRs that could not be converted into ORs without estimates of baseline risk, often not reported, we first checked whether study-level event data (e.g., a 2 × 2 table) were reported. If yes, we extracted them and reran the meta-analysis with effect sizes expressed as ORs using the authors’ specified meta-analytic model. If neither baseline risk nor event data were provided, remaining RRs were treated as ORs in the main analyses and excluded in sensitivity analyses. Likewise, HRs were assimilated to ORs. For continuous outcomes expressed as mean differences and standard errors or CIs, we also reran the meta-analysis to produce SMDs.
For meta-analyses that reported data by subgroups, we took the pooled estimate (i.e., across subgroups) if available and the estimate in the largest subgroup including the HC study if the pooled estimate across all subgroups was not reported. For forest plots that included separate effect sizes from the HC or largest study (e.g., different subgroups or outcomes), we first pooled these distinct estimates under a fixed-effects model and used that estimate for further analyses, whereas the summary estimate remained the one reported in the meta-analysis.
To assess the magnitude of the differences for each pair (HC vs. summary estimate; HC vs. largest study), we computed the RORs using the Altman-Bland approach (Altman & Bland, 2003). In brief, RORs were obtained by dividing the OR of the HC article by the (a) summary effect size of the meta-analysis and (b) the effect size in the largest study.
To ensure coherence across studies, effect sizes were coined (i.e., the sign was inverted) when necessary. For experiments, coining was performed so that an ROR greater than 1 implied that the intervention or experimental manipulation had more favorable results than control. For observational studies, exposures were coined to represent values over 1 for the HC study so that an ROR greater than 1 meant that the effect size in the HC study was larger than the one in the meta-analysis or larger study. For each comparison of the HC study and meta-analysis and HC study and largest study, we noted whether RORs were statistically significant (i.e., the 95% CI did not include 1) and whether estimates from the HC study differed by at least 2-fold (ROR ≥ 2 or ≤ 0.5), at least 4-fold (ROR ≥ 4 or ≤ 0.25), or more.
We also conducted meta-analyses of RORs separately for experimental and observational designs. Although in the protocol we planned both fixed- and random-effects models for meta-analyses of RORs, given the substantial clinical heterogeneity, we reported only a random-effects model. We used a random-effects model with the Paule and Mandel estimator (Paule & Mandel, 1989), recommended for dichotomous outcomes in the presence of high heterogeneity (Veroniki et al., 2016). Although not specified in the protocol, for comparisons with the largest study, cases in which the largest study coincided with the HC study were excluded. Heterogeneity was assessed with the between-topics variance τ2, I2, and its 95% CI estimated using the Q-profile method (Viechtbauer, 2007). Because some clinical psychologists may be accustomed to SMD rather than OR metrics for expressing effects, we also transformed summary RORs from the main analysis in differences of SMDs (dSMDs) by applying the conversion formula described by Chinn (2000) using the natural logarithm (ln) of the ROR. For the ROR of the HC study versus the summary estimate of the meta-analysis (MA in the equation), we have:
Therefore, dSMD = ln(RORHC MA) ÷ 1.81.
The standard errors can be computed by the same formula, SE(dSMD) = SE(RORHC MA) ÷ 1.81. We also reported estimates as dSMDs for the topics in which the SMD was the effect measure used in the selected meta-analysis.
Sensitivity analyses were performed by repeating the main analyses (a) excluding studies for which HRs and RRs were considered to be ORs, (b) limited to the topics in which the HC study was the earliest published on the topic, (c) limited to the topics in which the largest study was published later than the HC study, and (d) restricted to HC studies in which the abstract mentioned the outcome and exposure/intervention extracted from the matching meta-analysis. This last analysis was not preregistered and was added post hoc to verify whether the finding selected for evaluation was considered central in the HC study. Finally, because we observed extremely large heterogeneity for observational designs, we added a series of nonpreregistered exploratory sensitivity analyses for this cohort (e) limited to topics in which both the exposure and outcome are clinical manifestations (e.g., depression, anxiety, insomnia), demographic variables (e.g., gender), or major life events (e.g., adverse events, childhood abuse); (f) limited to topics in which either exposure or outcome are nonclinical or surrogate measurements (e.g., genes, neuroimaging, cognitive tasks); and (g) limited to topics in which the matching meta-analysis was lower variance or higher variance compared with the median of the entire sample. For this analysis, we calculated the median standard error of the log OR for the entire cohort of meta-analyses and used the median to dichotomize the sample into topics in which the standard error of the log OR in the matching meta-analysis was below and above the median.
Results
Selection of target HC articles and matching meta-analyses
The search produced 1,686,834 records, of which 1,183 had at least 1,000 citations. From these, 187 studies were selected (114 observational and 73 experimental; for the Preferred Reporting Items for Systematic Reviews and Meta-Analyses [PRISMA] flow diagram, see Fig. 1). Twenty-seven studies (14%) had more than 2,000 citations, and as per protocol and owing to Scopus limitations on download of citing records, we screened through only the first 2,000 most recent citations until identifying a matching meta-analysis. This procedure failed to identify a matching meta-analysis for 19 of these 27.

Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram of the study-selection process.
We contacted authors of three meta-analyses in which forest plots did not contain effect-size data or were incomplete (i.e., presented only a subgroup) and retrieved data for two meta-analyses. The remaining meta-analysis (McKinnon et al., 2009) was excluded. Therefore, we identified matching meta-analyses with study-level effect-size data for 41 of 114 (36%) observational studies (37 unique meta-analyses, four of which contained more than one HC study) and 25 of 73 (34%) experimental studies (22 unique meta-analyses, three of which contained two different HC studies).
Characteristics of the sample
The 41 observational studies were published between and 1972 and 2013, and citation counts ranged from 1,001 to 5,497 (Mdn = 1,357, interquartile range [IQR] = 1,087–1,769). The 25 experimental studies spanned 1989 to 2006, and citation counts ranged from 1,126 to 2,374 (Mdn = 1,426, IQR = 1,290–1,723). Matching meta-analyses were published between 1998 and 2019 for observational studies and between 2014 and 2019 for experimental studies. Twenty-eight of 37 meta-analyses for observational studies (76%) and 21 of 22 (95%) experimental studies were published after 2015 (see Tables S1 and S2 in the Supplemental Material).
For observational studies, 17 meta-analyses used ORs, 14 used SMDs (five used Hedges’s g), seven used RRs, one used HR, one used mean difference, and one used standardized coefficients from linear regression. In this last case, we did not have enough information to convert or recalculate the regression coefficient, and the meta-analysis (Martinez-Calderon et al., 2019) was excluded, which left a total of 40 meta-analyses for quantitative synthesis.
There were a few special cases. In the case of one meta-analysis involving individual patient data (Culverhouse et al., 2018), the HC study was eligible for inclusion but did not provide primary data. As per protocol, we used the estimates from the meta-analysis for the summary and the largest study estimates and recalculated the effect for the HC study from the primary report, using coining so it would represent the same contrast as the meta-analysis. For another HC study (Regier et al., 1990), the meta-analysis (Lai et al., 2015) included data from a pooled analysis (Swendsen et al., 1998) combining the cohort reported in the HC study with other cohorts. The HC study did not report the sufficient data for effect-size calculation, but the meta-analysis included separate estimates for the HC study cohort (Epidemiologic Catchment Area), which we extracted. Another meta-analysis (Reising et al., 2019) included a larger study that contained the HC study (Odgers et al., 2008). The HC study did not report sufficient data for effect-size calculation. We substituted its estimate with the one from the overlapping larger study (Odgers et al., 2008) reported in the meta-analysis. Because the original HC study reported only on males and the overlapping study included separate estimates for males and females, we used only the former. We conducted additional sensitivity analyses excluding all these special cases.
For experimental studies, 13 meta-analyses reported SMDs (six reported Hedges’s g), four reported ORs, four reported RRs, three reported Pearson correlation coefficients (r), and one reported HR. For one HC study describing the Enhancing Recovery in Coronary Heart Disease Patients randomized trial (Writing Committee for the ENRICHD Investigators, 2003), the corresponding meta-analysis (Richards et al., 2017) combined data from all trial publications. As per protocol, we used summary and largest study estimates from the meta-analysis and recalculated the HC study effects using the primary report.
Primary outcomes and meta-analysis
Observational studies
Effect estimates were recalculated (n = 2) or converted (n = 15) for 16 meta-analyses. For six meta-analyses, RRs and HRs were assimilated to ORs in computing RORs. Effect estimates were coined for eight meta-analyses. For 27 of 40 (67.5%) HC studies, we rated the abstract as describing the finding extracted from the matching meta-analysis. Twenty-five of 40 (62.5%) HC studies were the earliest or conducted within 3 years of the earliest study in the meta-analysis (see Tables 1 and 2 and Fig. 2).
Meta-Analytic Estimates and Sensitivity Analyses of Ratio of Odds Ratios for Observational Designs
Note: ROR = ratio of odds ratio; CI = confidence interval; HC = highly cited; HR = hazard ratio; RR = risk ratio; OR = odds ratio.
Excluded HC studies: Caspi (2003), Regier (1990), and Moffitt (2002). bThe comparison excludes cases in which RRs and HRs could not be converted and were considered equivalent to ORs.
Meta-Analytic Primary Analyses Estimates and Sensitivity Analyses Expressed as Differences in Standardized Mean Differences
Note: dSMD = difference in standardized mean differences; CI = confidence interval; HC = highly cited; SMD = standardized mean difference.

Scatterplots showing the relation between odds ratios in the highly cited studies and (a) the corresponding summary estimates in the meta-analyses and (b) the corresponding largest-study estimates in observational designs. The diagonal lines show where the points would fall if the effects were equal. Not shown are two very large outliers (odds ratios of 883 and 49 in highly cited studies). For (b), five topics in which the highly cited study was the largest study are also not shown. LS = largest study.
In 27 of 40 HC studies (67.5%), estimates were nominally larger (i.e., ROR > 1) than the summary effect in the corresponding meta-analysis (Fig. 2). In 12 of 40 HC studies (30%), effects were statistically significantly different from the summary estimate, and in 10 cases, RORs were greater than 1. The difference was at least 2-fold for 15 (37.5%) pairs and at least 4-fold for six (15%) pairs. The summary ROR (see Fig. S1 in the Supplemental Material) across all topics was 1.42 (95% CI = [1.09, 1.87]) with extremely high heterogeneity (τ2 = 0.55, I2 = 98%, 95% CI = [95%, 99%]). This was equivalent to a dSMD of 0.19 (95% CI = [0.04, 0.34]; Table 2). RORs were somewhat larger for topics in which the HC was the earliest study (n = 21; ROR = 1.77, 95% CI = [1.07, 2.94], τ2 = 1.07) and those in which the HC study abstract mentioned the exposure and outcome used in the meta-analysis (n = 27; ROR = 1.60, 95% CI = [1.08, 2.36], τ2 = 0.77), but heterogeneity remained very large. For topics in which the matched meta-analysis reported effects as SMDs (n = 15), estimates were higher (dSMD = 0.45, 95% CI = [0.04, 0.86]) and had extremely high heterogeneity (τ2 = 1.66).
Heterogeneity was significantly reduced in exploratory sensitivity analyses of topics in which both exposure and outcome were clinical manifestations, demographic variables, or major life events (n = 25; ROR = 1.16, 95% CI = [0.94, 1.42], τ2 = 0.17). Heterogeneity was also contained in analyses (n = 20) circumscribed to the meta-analyses with lower variance (i.e., under the median variance of the entire cohort of meta-analyses; ROR = 1.18, 95% CI = [0.92, 1.52], τ2 = 0.26).
Five HC studies (12.5%) were also the largest, which left 35 for further ROR analyses. In 29 of 35 (83%) cases, HC study estimates were greater than those in the largest study in the meta-analysis (Fig. 2). In 17 cases, RORs comparing estimates were statistically significantly different from 1 (49%); in 13 of these cases, ROR was greater than 1. RORs were at least 2-fold for 17 of 35 cases (49%) and at least 4-fold in 10 of 35 (29%) cases. The summary ROR (see Fig. S2 in the Supplemental Material) was 1.99 (95% CI = [1.33, 2.99]) and had extremely high heterogeneity (τ2 = 1.17, I2 = 98%, 95% CI = [97%, 99%]). This corresponded to a dSMD of 0.38 (95% CI = [0.15, 0.61]; Table 2). The summary ROR was similar in sensitivity analyses restricted to topics in which the HC study predated the larger study (n = 33; ROR = 1.99, 95% CI = [1.29, 3.07]) and had similarly high heterogeneity (τ2 = 1.25). In exploratory sensitivity analyses (Table 1) on topics in which both exposure and outcome were clinical, summary RORs were reduced, and heterogeneity was more contained (ROR = 1.49, 95% CI = [0.97, 2.3], τ2 = 0.81). Heterogeneity was substantially reduced in analyses limited to meta-analyses with lower variance (ROR = 1.71, 95% CI = [1.22, 2.41], τ2 = 0.40).
Experimental studies
Effect estimates were recalculated (n = 4) or converted (n = 16) for 20 meta-analyses, and for one meta-analysis, the HR was considered equivalent to the OR. Effect estimates were coined for eight meta-analyses. Fifteen of 25 (60%) HC studies were the earliest or conducted within 3 years of the earliest study in the meta-analysis. The abstract of 21 of 25 (84%) HC studies described the intervention and outcome used in the meta-analysis (see Tables 2 and 3 and Fig. 3).
Meta-Analytic Estimates and Sensitivity Analyses of Ratio of Odds Ratios for Experimental Designs
Note: ROR = ratio of odds ratio; CI = confidence interval; HC = highly cited; HR = hazard ratio; RR = risk ratio; OR = odds ratio.
Excluded HC studies: Writing Committee for the ENRICHD Investigators (2003). bThe comparison excludes cases in which RRs and HRs could not be converted and were considered equivalent to ORs.

Scatterplots showing the relation between odds ratios in the highly cited studies and (a) the corresponding summary estimates in the meta-analyses and (b) the corresponding largest-study estimates in experimental designs. The diagonal lines show where the points would fall if the effects were equal. For (b), seven topics in which the highly cited study was the largest study are not shown. LS = largest study.
For 17 of 25 (68%) HC studies, estimates were nominally larger than summary estimates of matching meta-analyses (Fig. 3). The ROR of the HC study compared with the summary estimate was statistically significantly different from 1 in six of 25 cases (24%), three of which had RORs greater than 1. The estimates from the HC study differed by at least 2-fold (i.e., ROR ≥ 2 or ≤ 0.5) in five cases (20%) and by at least 4-fold in one case (4%). The summary ROR (see Fig. S3 in the Supplemental Material) was 1.25 (95% CI = [0.97, 1.61]) and had substantial heterogeneity (τ2 = 0.25, I2 = 73%, 95% CI = [53%, 87%]). The ROR corresponded to a dSMD of 0.14 (95% CI = [0.007, 0.27]). For topics in which the matched meta-analyses reported effects as SMDs, summary estimates were higher (dSMD = 0.24, 95% CI = [0.05, 0.44]). Sensitivity analyses limited to topics in which the HC study was the earliest study (n = 10) resulted in a similar summary ROR of 1.33 (95% CI = [1.08, 1.64]) and had no between-topics heterogeneity (τ2 = 0). Analyses of topics in which the HC study abstract mentioned the intervention and outcome extracted from the matching meta-analysis (n = 21) led to a similar summary ROR of 1.30 (95% CI = [0.97, 1.73], τ2 = 0.33).
Seven HC studies (28%) were also the largest in the matching meta-analysis, which left 18 studies for ROR analyses. Estimates from the HC study were nominally higher than those from the largest study for all 18 studies (Fig. 3), and for six of 18 (33%) studies, RORs were statistically significantly different from 1. RORs were at least 2-fold in seven of 18 (39%) cases and at least 4-fold in one of 18 (6%) cases. The summary ROR (see Fig. S4 in the Supplemental Material) of the HC study compared with the largest study was 1.81 (95% CI = [1.39, 2.36]) and had moderate heterogeneity (τ2 = 0.09, I2 = 29%, 95% CI = [0%, 68%]). One HC study postdated the largest study. Analyses restricted to the cases in which the HC study (n = 17) predated the largest study yielded a larger summary ROR of 1.85 (95% CI = [1.39, 2.46], τ2 = 0.10).
Discussion
Reports that collect extreme numbers of citations can be very influential in shaping the scientific literature and often also inform crucial decisions about which research to conduct, publish, or finance. Hence, the validity of their claims is paramount. Although there is no perfect means to evaluate validity, placing the results of HC studies against those of meta-analyses and of the largest studies on the same topic can offer valuable comparative insights. In a large, field-wide survey of emotion research, we showed that HC studies report more prominent effects compared with meta-analyses and larger studies on the same topic. For observational designs, HC studies produced effects about 1.4-fold higher on average than those from meta-analyses and almost 2-fold higher than those from the largest studies. For experimental designs, the average difference was around 1.3-fold for summary estimates and 2-fold in comparisons with the largest study. Translated in dSMDs, an estimate more habitually used by clinical psychologists, HC observational studies produced effects higher by 0.19 compared with summary meta-analytic estimates and by 0.38 compared with the largest study. The differences were similar for experimental designs (0.14 compared with the summary estimate and 0.39 compared with the largest study).
These average differences need to be viewed with great caution because there was extremely prominent heterogeneity across topics. Heterogeneity was extremely high for observational designs and more moderate for experimental studies. Heterogeneity was considerably reduced (a) in exploratory sensitivity analyses that were limited to topics in which both the exposure and outcome were clinical or demographic—or involved major life events—and (b) when considering the meta-analyses with the lower variance (i.e., below the median variance of the entire cohort of matching meta-analyses). In these analyses, differences between estimates were reduced to around 1.2-fold compared with meta-analyses and to 1.5 to 1.7 compared with larger studies and were, in most cases, nonsignificant (the CI around RORs included 1, albeit narrowly). Therefore, although HC studies may be expected to report larger effects on average, it is not possible to predict in advance for which topics this will be most pronounced and for which topics HC studies may not have larger effects at all. It is impossible to “correct” the effect estimates of an HC study by using some standard inflation factor.
We also examined the timing of publication of the HC reports compared with the other studies and with the largest studies on the same topic. HC studies are sometimes the first ones on the topic, and thus they would be the earliest published among the studies included in a meta-analysis. This pattern occurred in almost half (31 of 65, 48%) of the topics that we examined. However, in approximately 40% of cases, HC studies were published later or even substantially later (i.e., > 3 years after). Their high citation profile may reflect early publication (“being the first”), some citation bias favoring extreme results, or a combination thereof. Relatedly, the HC study predated the largest study in about two thirds of the pairs for observational designs and in all but one for experimental ones. Sensitivity analyses limited to topics in which the HC studies were the first ones mirrored the main analyses. If anything, the summary ROR estimates became slightly larger when only these topics were considered, a pattern compatible with some influence of “being first.” However, the available data are too limited to exclude that this observation may reflect chance.
Finally, our approach of selecting a recent meta-analysis that used emotion-related estimates from the HC studies could have failed to capture the main outcomes of these studies that led to their high citation impact. To account for this possibility, we added a post hoc sensitivity analysis restricted to instances in which the selected meta-analytic comparison included the outcome and exposure/intervention also mentioned in the abstract of the HC study. For most observational (65%) and experimental (85%) HC studies, this was indeed the case, and this sensitivity analysis resulted in very similar results to the main analysis. Of course, the approach cannot fully guarantee we examined the principal finding of the HC study, and it is often impossible to single out only one particular finding from a complex study. However, given that abstracts describe what are considered by the authors to be the most noteworthy results, this approach could represent a useful proxy to identifying the principal findings. We did not assess the quality of the HC studies because this would have posed significant challenges given the diversity of topics, designs, and scientific standards at the time of publication. Although study size is not a surrogate for quality, larger studies are more precise in estimating effects. In general, it was uncommon for HC studies to be also the largest ones.
Some of the HC studies had extremely large effects that also differed tremendously from the respective meta-analyses and largest studies. In the most conspicuous case (RORs of 287 and 297, respectively), Hariri et al. (2002) examined neuroimaging differences in amygdala activation in carriers of the short serotonin-transporter-linked promoter region (5-HTTLPR) allele (one or two copies) compared with those of the long 5-HTTLPR allele. The authors collected 1,769 citations, and the article’s standardized effect size in the matching meta-analysis (Munafò et al., 2008) was an incredible SMD of 3.74 (95% CI = [2.51, 4.97]). In contrast, the summary effect in the meta-analysis was considerably smaller (SMD = 0.62, 95% CI = [0.42, 0.82]), similar to the largest study (Hariri et al., 2005; SMD = 0.6, 95% CI = [0.14, 1.06]). The true effect may actually be entirely null. The reason is that this HC study, as well as the other studies in the meta-analysis, depends on a candidate-gene approach, a design that has since been shown as notoriously unreliable (Ioannidis et al., 2011), even more so in neuropsychiatric genetics (Duncan et al., 2019). Moreover, neuroimaging studies are a classic example of a literature replete with small, underpowered studies with high analytical flexibility and often spurious results (Botvinik-Nezer et al., 2020; David et al., 2013; Szucs & Ioannidis, 2020). The proposed association with amygdala activation (Hariri et al., 2002, 2005) would suggest a role of this genetic polymorphism in depression. However, a very large, rigorous genome-wide association study found absolutely no effect for this polymorphism (Border et al., 2019).
RORs were also very large (22 and 29) for a study (Klin et al., 2002) examining differences in visual fixation patterns between autistic males and control subjects while viewing social situations (1,150 citations). In this case, the effects between the index study (SMD = −1.47, 95% CI = [−2.27, −0.66]) compared with the corresponding meta-analysis (SMD = 0.24, 95% CI = [0.1, 0.39]) and largest study in it (SMD = 0.39; 95% CI = [0.19, 0.58]) differed not just by magnitude but also by direction. There were no such large outliers in the analyses on experimental studies. Overall, for observational designs, topics in which the original meta-analyses used the SMD as the metric of choice tended to have greater differences in effect size between the HC study and the respective meta-analysis or largest study. Several of the large outliers identified in neuroimaging or genetics belong to this category.
In selecting meta-analyses that included the index study, we focused on the most recent one that contained effect-size data. Around 80% of the identified meta-analyses for observational studies and all but one for experimental studies were published after 2015. The recency of selected meta-analyses makes it more likely that they included a larger number of publications. In addition, the quality of reporting and analysis might also have improved with time (Page et al., 2016; Wen et al., 2008). However, we should caution that the “true” effects for the topic examined are unknown, and effects may genuinely differ across studies on the same topic because of genuine differences rather than bias. Moreover, meta-analyses and even single large studies may also be biased. Random-effects models for obtaining summary results are appropriate in situations in which there is substantial heterogeneity, as is often the case in emotion research, but random-effects estimates are also susceptible to biases such as small-study effects that might underlie publication bias (Sterne et al., 2011). On average, meta-analyses may be more biased than the largest studies. This would be entirely consistent with our observation that HC results seemed to be less inflated when the comparison was made against the summary effect of a meta-analysis than when it was made against the largest study.
Kvarven et al. (2020) employed a somewhat similar methodological approach to compare results from registered replications with meta-analyses testing the same hypotheses. The starting point were multilaboratory registered replication studies in psychology, for which matching meta-analyses on the same hypothesis, as identified by the study authors, were searched. The authors retrieved meta-analyses with effect-size data for 15 of 62 replication studies selected and used a Z test to compare replication effects with summary meta-analysis estimates, either by a random-effects model or with bias adjustment. Results indicated an increase in summary meta-analysis estimates of almost 3-fold compared with replication studies even when using methods to adjust for publication bias, which suggests that better designed studies in which publication bias is avoided (as in the case of preregistered replications) may provide the most accurate effect estimates. If one were to extrapolate from their findings to ours, it is possible that HC studies provide highly inflated results, more inflated than what a comparison against meta-analyses would suggest. Even the comparison against the largest available study may not fully capture the inflation of results because these largest studies that we used were not preregistered. Therefore, they could also suffer from some selective reporting of analyses.
In a study that has direct relevance to the present work, Kvarven et al. (2020) also compared estimates from the original studies—defined as the study that was the object of the replication project—with those in the selected meta-analyses and reported a nonstatistical mean difference of 0.10 for 14 pairs of original studies and meta-analyses. However, the pairs of replication studies and meta-analyses included mostly meta-analyses of small studies, and such meta-analyses may also be unreliable and biased. Likewise, we showed that for meta-analyses with reduced variance, and hence lower uncertainty around the summary effects, differences between estimates from HC studies and summary ones were reduced and no longer significant (RORs close to 1). Conversely, for meta-analyses with higher variance and highly uncertain estimates, differences with HC studies were augmented. In an analysis of 200 meta-analyses published in Psychological Bulletin, an eminent journal in psychological science, Stanley et al. (2018) found that only a tiny percentage (< 1%) of experimental studies are adequately powered, compared with about a third of observational studies. Meta-analyses that include only underpowered studies may not be a good “gold standard.”
Our findings need to be qualified by important limitations. We were able to identify a matching meta-analysis containing effect-size data for only a third of our sample of target articles. We considered meta-analyses as eligible if they included any emotion-related finding from the target article to avoid ranking findings in the original article in terms of importance. Previous research has dealt with this problem by choosing a finding for which effect-size data are reported in the abstract (Ioannidis & Panagiotou, 2011). However, we were concerned that most of the articles in social and behavioral sciences might simply present findings narratively, with absent or incomplete data, especially in abstracts. Moreover, there is evidence that abstracts are frequently inconsistent with full reports (Li et al., 2017). Nonetheless, ancillary analyses restricted to findings that were mentioned in the abstract supported our main findings. In addition, in the interest of consistency, when a matching meta-analysis included multiple forest plots, we chose the largest one, although it might not have used the most important finding from the HC study. We were able to screen a maximum of 2,000 citations for each target article because of the limitations of exporting data from Scopus. Because we were mostly interested in research on human participants, more general terms such as “fear” or “stress” were not used because they would have rendered the search overtly nonspecific. Finally, we cannot exclude the possibility that in some cases in which effect size was larger in the HC study, the HC study may have been more “correct” than the respective meta-analysis and the largest study on the topic. For instance, the HC study might have had some particularly high-quality features and protection from bias that other studies did not, and biases might have eroded an otherwise genuinely large effect in the other studies. However, this does not seem to be the case in other fields in which HC studies compared with other evidence have been assessed.
Investigations of HC articles in the social and behavioral sciences have been limited and mostly restricted to surveying content and design (Price et al., 2011) or the availability and sharing of the data underlying their findings in HC articles (Hardwicke & Ioannidis, 2018). We add to this metaresearch literature by demonstrating a pervasive systematic citation bias toward exaggerated effects across empirical studies in emotion research.
Supplemental Material
sj-pdf-1-cpx-10.1177_21677026211049366 – Supplemental material for Effect Sizes Reported in Highly Cited Emotion Research Compared With Larger Studies and Meta-Analyses Addressing the Same Questions
Supplemental material, sj-pdf-1-cpx-10.1177_21677026211049366 for Effect Sizes Reported in Highly Cited Emotion Research Compared With Larger Studies and Meta-Analyses Addressing the Same Questions by Ioana A. Cristea, Raluca Georgescu and John P. A. Ioannidis in Clinical Psychological Science
Footnotes
Acknowledgements
We thank Tom E. Hardwicke for contributions to revising the study protocol.
Transparency
Action Editor: Aidan G. C. Wright
Editor: Jennifer L. Tackett
Author Contributions
I. A. Cristea and J. P. A. Ioannidis were responsible for study concept and design. I. A. Cristea, R. Georgescu, and J. P. A. Ioannidis were responsible for acquisition, analysis, and interpretation of data. I. A. Cristea was responsible for statistical analysis. J. P. A. Ioannidis was responsible for study supervision. I. A. Cristea was responsible for drafting the manuscript. R. Georgescu and J. P. A. Ioannidis were responsible for critical revision of the manuscript for important intellectual content. All of the authors reviewed and approved the final manuscript for submission.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
