Abstract
Researchers’ subjective judgments may affect the statistical results they obtain. This possibility is particularly stark in Bayesian hypothesis testing: To use this increasingly popular approach, researchers specify the effect size they are expecting (the “prior mean”), which is then incorporated into the final statistical results. Because the prior mean represents an expression of confidence that one is studying a large effect, we reasoned that scientists who are more confident in their research skills may be inclined to select larger prior means. Across two preregistered studies with more than 900 active researchers in psychology, we showed that more self-confident researchers selected larger prior means. We also found suggestive but somewhat inconsistent evidence that men may choose larger prior means than women, due in part to gender differences in researcher self-confidence. Our findings provide the first evidence that researchers’ personal characteristics might shape the statistical results they obtain with Bayesian hypothesis testing.
Scientists typically strive to conduct studies that provide objective reflections of reality (Slaney, 2001). In the wake of the replication crisis, however, we now recognize that scientists who are motivated to obtain a particular finding may employ “researcher degrees of freedom” to bring their statistical results in line with their expectations (Simmons et al., 2011). Specifically, such “p-hacking” can make nonexistent effects appear significant and inflate the size of effects reported in scientific articles (Simonsohn et al., 2014a, 2014b). If we assume that p-hacking was widespread in our field, then it follows that many published effect sizes are overestimates (e.g., Open Science Collaboration, 2015). As a result, researchers themselves may have been left with inflated expectations about the effect sizes they are likely to obtain. These expectations, in turn, can potentially shape the outcomes of their research. On one hand, if researchers expect overly large effect sizes, they may underpower their studies, leading them to obtain nonsignificant results, even if their hypotheses are correct. On the other, if researchers predict rather diminutive effect sizes, their proposed work may be less likely to get funded by grant panels (to the extent that smaller effects are perceived as less important or exciting). 1 While these consequences may be difficult to document, researchers’ expectations are both explicit and transparent in one increasingly popular statistical approach: Bayesian hypothesis testing. A central feature of Bayesian hypothesis testing is that expected effect sizes are incorporated into the analyses and affect the resulting statistical conclusions drawn from the study. In the present research, we focused on Bayesian hypothesis testing and examined whether researchers’ personal characteristics are linked to the effect sizes they predict.
Although researchers should rely on theory and on the results of relevant past research in predicting effect sizes, it is not always straightforward to determine which past effect sizes to incorporate or how to combine them. Indeed, a registered replication report (RRR) showed that studies testing the same conceptual hypothesis can produce vastly different effect sizes depending on the specific paradigm used (Aknin et al., 2020). Therefore, in making predictions about effect sizes, scientists may rely in part on their own subjective confidence in their research. Because no measure of researcher self-confidence exists, we created a scale asking scientists to rate their agreement with items such as, “I make good choices about which research ideas to pursue” and “When I test a new hypothesis of mine, I feel confident that it will turn out to be right.” Researcher self-confidence levels are almost certainly influenced by factors that are directly relevant to predicting effect sizes (including their past success in obtaining large effect sizes), but it is also possible that confidence is shaped by factors that should be irrelevant, including scientists’ gender or academic status.
A wealth of research suggests that confidence is related to gender, power, and level of expertise in the relevant domain (see Dunning, 2005 for review). Women are generally more risk-averse than men (Byrnes et al., 1999; Charness & Gneezy, 2012). Women also routinely express less confidence in their abilities and the products of their work than their male peers (Beyer & Bowden, 1997; Huang, 2013; Instone et al., 1983; Stankov & Lee, 2014), particularly in the domains of math and science (Ehrlinger et al., 2017; Ellis et al., 2016; Else-Quest et al., 2010; Micari et al., 2007). Furthermore, Ehrlinger and Dunning (2003) showed that women expressed less confidence in their scientific abilities than men and, in turn, were less confident in the quality of their answers on a science exam than their male peers (despite performing equally well on the exam). Examining titles and abstracts in PubMed, Lerchenmueller and colleagues (2019) found that female scientists were less likely to frame their research in a positive light by using words such as “novel,” “excellent,” and “robust” compared with their male counterparts. To the extent that women are less confident (i.e., humbler) than men, they may be more cautious about making bold predictions. If this theorizing is correct, then female researchers may predict lower effect sizes than men. Similarly, people with more power tend to be more confident (Min & Kim, 2013; See et al., 2011). As researchers climb the academic ladder, progressing from being graduate students to professors with tenure, they acquire higher status and greater power. As a result, people with higher academic ranks (e.g., professors vs. graduate students) may expect to obtain larger effect sizes, all else being equal.
Bayesian Hypothesis Testing
In Bayesian hypothesis testing, researchers must specify the effect size they are expecting, which is then incorporated into the final statistical results. Specifically, the researcher specifies a prior distribution, which is a mathematical representation of the researcher’s degree of belief in the likelihood of various population effect sizes. This prior distribution serves as the research hypothesis. Once the data are collected, the researcher calculates a statistic called the Bayes Factor. The larger the Bayes Factor, the stronger the evidence for the research hypothesis (as captured by the prior distribution) compared with the null hypothesis of no effect. 2
The most familiar prior distribution is a normal curve centered on the effect size the researcher considers to be most likely (see Figure 1). For example, a researcher who believes they are studying a large effect (say, Cohen’s d = .70) would express their confidence in the substantial magnitude of their effect by setting the prior mean to 0.70. Thus, setting the prior mean is equivalent to guessing an effect size for power analysis, and this step is relatively intuitive. A second step is also required, however: The researcher must select the prior standard deviation, reflecting the degree of uncertainty around the prior mean. For example, if the researcher believes that there is a 95% chance that the true population effect size falls between .50 and .90, they would set their prior standard deviation to .10 (see Figure 1). Choosing a large prior mean is an expression of confidence that one is studying a large effect, while choosing a narrow prior standard deviation is an expression of confidence that the most likely effect size (whether big or small) was guessed correctly. In the present research, we asked researchers to select both prior means and prior standard deviations, but to contain Type I error, we only preregistered hypotheses about prior means, which we felt would be more intuitive for researchers previously unfamiliar with Bayesian statistics to understand and apply.

Illustration of the null
How does the choice of the prior distribution affect the conclusions that researchers draw from their data? First, researchers will obtain a larger Bayes Factor if their prior distribution matches the observed data; in other words, accuracy matters. Second, the size of the prior mean matters: on average, researchers will obtain bigger Bayes Factors if they choose a prior distribution with a larger prior mean (holding accuracy constant). Finally, a smaller prior standard deviation will result in larger Bayes Factors given that the prior mean matches the observed data. 3 In sum, an accurate prior distribution with a high mean and a narrow standard deviation will yield the largest Bayes Factor.
The Present Research
In the present studies, we examined whether researchers’ personal characteristics are related to the prior means they select. We expected that more self-confident (i.e., less humble) researchers would set higher prior means. Given that gender and academic rank may be related to researcher self-confidence, we hypothesized that men (vs. women) and academics with higher (vs. lower) ranks would select higher means for their priors. One challenge of testing these hypotheses is that most psychologists have not been trained in the use of Bayesian statistics. Thus, we began by creating an online tutorial to provide active researchers with an introduction to Bayesian hypothesis testing. We conducted a pilot study that included a subset of our measures, enabling us to refine our tutorial and questionnaire; the materials and data are available on Open Science Framework (OSF). 4 We then conducted Study 1, recruiting more than 450 graduate students and faculty in psychology across North America. We conducted a partial replication in Study 2. Study materials, data, and analysis of both studies are available on the OSF project page. 5
Study 1
Method
Participants
As preregistered on the OSF, 6 we aimed to recruit graduate students and faculty members actively conducting research in psychology and other related fields (e.g., marketing) in universities located across the United States and Canada. We used email and relevant listservs to recruit participants. We wanted to collect as large a sample as possible, within the constraints imposed by our funding and by the limited size of the population of interest. Thus, we preregistered that we would obtain usable data from at least 200 graduate students and 200 faculty members. We were able to attract enough participants to exceed these targets, thereby ensuring that we would have sufficient usable data. We did not conduct any of our analyses until the study was closed.
The final sample consisted of 485 eligible participants (54.8% graduate students, 26.6% nontenured faculty, 18.6% tenured faculty; 58.6% female, 41.4% male; see Table 1 for summary of sample demographics), who completed the study online in exchange for US$25 Amazon gift cards. As preregistered, we excluded 42 participants who scored 60% or lower on the comprehension quiz. In addition, we excluded 18 participants who had completed the earlier version of the study; this exclusion criterion was not in the preregistration because we failed to anticipate that some individuals would participate multiple times. Three participants who did not identify as either female or male were not included because our analysis plan entailed treating gender as a binary variable (given our limited population of interest, we did not expect to obtain a sufficient number of participants to treat gender in a more nuanced manner). Finally, we removed the scenario responses of three participants who selected prior means greater than Cohen’s d = .2, as preregistered.
Summary of Sample Demographics (Study 1).
Note. Experience with statistics: 1 = equivalent to undergraduate courses only, 2 = equivalent to 1–2 graduate courses, 3 = equivalent to 3 or more graduate courses, 4 = even better (e.g., you have a degree in quantitative psychology, you have spent years teaching yourself advanced statistics for use in your research, or you teach grad statistics courses, etc.). Familiarity with Bayesian Statistics: 1 = “not at all (may have heard whispers in the hallway)”, 2 = “a bit (equivalent to a few lectures/preconference workshop/1–2 articles/or you have used it in your research with the help of a consultant)”, 3 = “a fair amount (equivalent to a course or so)”, 4 = “very (more than one course and/or you use it in your research)”, and 5 = “very very (You teach it! You write about it!).” Frequency distributions of these variables are available in Supplemental Material Tables S1 and S2.
Researcher confidence (RC) scale
To test whether the effects of gender and/or status on the prior distribution were mediated by researcher self-confidence, we created a RC scale. We created an initial pool of 19 items with high content and face validity for assessing confidence regarding the strength of one’s research ideas, skills, and findings. Pilot testing was conducted with 72 researchers (51 students, 14 faculty, four postdocs, and three who selected “other”; 52.8% females), who completed the scale at a social and personality psychology conference (SPSP). Based on the comments from the pilot participants and item-total correlations, we reworded two items and dropped three items from the scale. The final 16-item RC scale is provided in the Supplemental Material (Table S3). Respondents were asked to indicate how often they experienced each thought or behavior on a 5-point Likert-type scale from 0 (“almost never”) to 4 (“almost always”). Because the newly developed RC scale had gone through limited psychometric validation, we preregistered a plan to also use a trimmed version of the scale, as well as a single key item, as alternative mediator variables in our analyses. We created the trimmed version by fitting a one-factor model and removing items with loadings below .3. The single key item that we reasoned would most directly tap our construct of interest was “My research projects will yield effects that are at least medium in size.”
Procedure
Participants completed the study online. Upon providing consent, participants first completed a brief questionnaire to confirm their eligibility. Eligible participants then completed a demographic survey (including questions about their gender 7 and age) and then indicated the area of psychology they most identified with, their experience with statistics, their familiarity with Bayesian statistics, and the highest degree they had obtained in psychology (see Table 1). Next, participants went through a very nontechnical 30-min (approximately) tutorial on Bayes factors, developed by the research team. Two experts in Bayesian statistics reviewed the tutorial for accuracy prior to the commencement of the study. The tutorial covered the basic concepts of Bayesian inference and gave information and examples on how to set bell-shaped priors on effect sizes for the purpose of testing research hypotheses via Bayes Factors. Participants were free to move through the tutorial at their own pace and could review or skip through any concepts. Upon finishing the tutorial, participants were asked to complete a 10-item comprehension quiz to test their understanding of the key concepts covered in the tutorial; participants were unable to return to the tutorial upon starting the quiz. Participants who passed the quiz (with more than 60% correct answers) were then asked to select priors for four scenarios that involved hypothetical research studies and for two scenarios involving their own research. Finally, participants were asked to complete the 16-item RC scale.
Hypothetical research scenarios
Participants were asked to imagine that they were the lead researcher on an upcoming study and that they believed the hypothesis being tested to be true. In each scenario, some information about effect sizes was given (e.g., average effect sizes in this hypothetical research area, effects sizes from an earlier study). Participants then used an interactive graph to select the mean and/or standard deviation for the bell-shaped prior representing their expectations for each scenario, in standardized effect size units (Cohen’s d). In one scenario, the standard deviation of the prior was provided, and participants were asked to decide on a mean; in another scenario, the mean of the prior was provided, and participants had to select the standard deviation. In the remaining two scenarios, participants were asked to choose both the mean and the standard deviation. Our preregistered hypotheses concern the means and not the standard deviations (which were included for exploratory purposes); as such, only three hypothetical research scenarios are available to test our primary hypotheses.
Real-research scenarios
Next, participants were asked to construct priors relating to their own research. In the first real-research scenario, participants were asked to think of a published or unpublished research study they had previously conducted with just two conditions (either a between- or within-subjects design). Participants were told that it should be a study they really cared about—one which provided at least some support for their hypothesis. Participants were then asked to describe the conceptual independent variable (IV) and the dependent variable (DV) for this study, as well as to indicate what area of psychology the study fell into. Finally, they were asked to imagine conducting a conceptual replication of the earlier study and to set the mean and standard deviation for the prior, in standardized effect size units. In the second real-research scenario, participants were asked to think of a research study they would like to conduct, which would test a specific hypothesis that they believe in. Participants were told that it could be a study they were about to conduct, a study proposed in a recent grant application, or a study that they had always dreamed of conducting. Again, participants were told that it could have a within- or between-subjects design and that it should focus on comparing two conditions. Participants then reported the IV, DV, and area of study, and set the mean and standard deviation of their prior.
Results
RC scale analysis
Of the 485 participants, 481 completed the RC scale. Of these, 34 participants had some missing data on the scale. To investigate whether any items had loadings smaller than .3 on the general factor, a one-factor model was fit to the data using the full-information maximum likelihood (FIML) estimator in the R package lavaan (Rosseel, 2012; version 0.6-3). While the one-factor model had poor fit (confirmatory fit index [CFI] = .767, root mean square error of approximation [RMSEA] = .089, standardized root mean square residual [SRMR] = .066), only two items had standardized loadings smaller than .3. These loadings were .172 and .164 for items 10 and 11, respectively (see Supplemental Material Table S3 for wording). Thus, as preregistered, we computed a “trimmed” scale composite consisting of the 14 best performing items (Cronbach’s α = .83; CFI = .775, RMSEA = .101, SRMR = .069), as well as the full 16-item RC scale composite (α = .82). When computing composite scores, missing data were dealt with by computing the item score from all available items for a given participant.
Preregistered analyses
As preregistered, we examined the prior means specified by the participants for each of the five scenarios that asked the participants to specify prior means: three hypothetical scenarios and two real-research scenarios. Following our preregistered analysis plan, we first examined each scenario separately. We also preregistered that we would create a composite across all five scenarios if there was at least moderate reliability (α > .5), but the actual reliability (α = .45) failed to meet this threshold because the prior means of the three hypothetical scenarios were not sufficiently correlated with each other (rs = .14, .18, .34) and largely uncorrelated with the prior means for the real-research scenarios (rs = .02–.16, see Supplemental Material Table S4 for the correlation matrix). 8 We also preregistered that we would average across the two real-research scenarios if participants’ chosen prior means for these scenarios were at least moderately correlated (r > .30), and the actual correlation (r = .37) met this criterion, so we created a composite for the two real-research scenarios. 9 All regression and mediation analyses were conducted as path analyses in the R package lavaan 0.6-3, and remaining missing data was handled using FIML with robust standard errors (i.e., MLR estimation in lavaan).
Individual scenarios
We entered gender (female = 0, male = 1) and academic status (grad student = −1, untenured faculty = 0, tenured faculty = 1) into regressions predicting prior means for each of the five scenarios individually. As shown in Table 2, there was no significant effect of gender or academic status on any individual scenario. However, the strongest effects appear to be those of gender on the prior means set for the real-research scenarios. Although there were no significant effects of gender or status for the hypothetical scenarios, we repeated the mediation models for these scenarios as preregistered. Doing so yielded largely nonsignificant effects (see Supplemental Material Table S8).
Standardized Effects of Gender and Academic Status on Prior Means for Each HS, RRS, and the Real Research Composite (Study 1).
Note. The reported values are z-tests because the regressions were fit as path analyses in lavaan. Given a large sample, this distinction does not have any practical consequences. The independent variables (gender and academic status) were essentially uncorrelated, point-biserial correlation rpb = .11,
Real-research composite
As preregistered, we entered gender (female = 0, male = 1) and academic status (grad student = −1, untenured faculty = 0, tenured faculty = 1)into regressions predicting the average prior mean for the two real-research scenarios. There was no significant effect of academic status (β = −.001, p = .984, b = –.00, CI95 = [−0.02, 0.02]), but there was a significant effect of gender (β = .10, p = .038, b = .04, CI95 = [0.00, 0.07]). Consistent with our hypothesis, men selected slightly larger prior means (M = 0.44, SD = 0.18) than did women (M = 0.40, SD = 0.15; illustration in Figure 2). We did not have any hypotheses regarding possible interactions between gender and status and thus we did not include an interaction term, consistent with our preregistered analysis plan.

The density distributions of prior means chosen by women and men in Study 1.
Mediator: Self-confidence
Next, we examined whether the effects of gender or status on prior means were mediated by researcher self-confidence. As gender and academic status were mostly independent,
Gender and self-confidence
Using the lavaan package (0.6-3) in R, we performed mediation analyses to test whether the relationship between gender and prior means was mediated by researchers’ self-confidence (see Figure 3 and the top section of Table 3). Consistent with our regression analyses, we found a weak but significant total effect of gender on prior means (total effect in Table 3 and Figure 3), β = .11, p = .029, b = .04, CI95 = [0.00, 0.07]. Examining the trimmed self-confidence scale, we found a significant effect of gender on self-confidence (path a in Table 3 and Figure 3), whereby men reported higher self-confidence than women, β = .12, p = .006, b = .11, CI95 = [0.03, 0.19]. Self-confidence, in turn, was strongly related to selecting larger prior means (path b), β = .17, p < .001, b = .06, CI95 = [0.03, 0.10]. Thus, there was a significant indirect effect of gender on prior means via confidence (path a*b), β = .02, p = .031, b = .01, CI95 = [0.00, 0.01]. After accounting for confidence, the direct path (c) from gender to prior means remained positive but did not reach statistical significance, β = .09, p = .069, b = .03, CI95 = [–0.00, 0.06]. This pattern of results is consistent with our hypothesis that men select larger prior means than women at least partially due to greater self-confidence. When we replaced the trimmed scale with the full scale, the results were substantively the same (see Supplemental Material Table S7). Using the single-item scale, the results were similar but weaker, with only the effect of self-confidence on prior means reaching significance (see Supplemental Material Table S7); this pattern is as expected given that the reliability of a single item is much lower than the reliability of a longer scale.

Path diagram of gender on the Real Research Composite prior mean mediated by self-confidence in Study 1.
Mediation Models of Gender Effects (βs) on the Prior Means With Confidence (Trimmed Scale) as a Mediator for Each RRS and the Real Research Composite.
Note. Table cells give the standardized regression coefficients and the corresponding p-value in parentheses. p-values smaller than .05 are bolded. RRS = real-research scenario.
Academic status and self-confidence
The overall relationship between academic status and prior means was not significant (the total effect on Table 4 and Figure 4), β = .01, p = .859, b = .00, CI95 = [–0.02, 0.02]. We next examined whether there was an indirect effect of status on prior means through self-confidence. Using the trimmed self-confidence scale, we found that there was a strong effect of status, whereby individuals with higher academic status reported greater self-confidence, β = .30, p < .001, b = .18, CI95 = [0.12, 0.23]. Self-confidence was related to selecting larger prior means, β = .20, p < .001, b = .07, CI95 = [0.03, 0.11]. Thus, there was a significant indirect effect of status on prior means via self-confidence, β = .06, p = .001, b = .01, CI95 = [0.00, 0.02]. Given this substantial indirect effect through the hypothesized pathway, one might expect a significant total effect of status on self-confidence; this overall effect may have failed to emerge in part because there was a small, unreliable, negative direct effect of status on prior means, β = –.05, p = .329, b = –.01, CI95 = [–0.03, 0.01], after taking the indirect effect of confidence into account. Taken together, this suggests that higher academic status is linked to greater self-confidence, which in turn promotes the selection of larger prior means—but that confidence is not the only variable that matters in shaping the relationship between academic status and the selection of prior means. We observed substantively identical effects using the full scale, but there was no effect of status on the single-item scale (see Supplemental Material Table S8).
Mediation Models of Status Effects (βs) on the Prior Means With Confidence (Trimmed Scale) as a Mediator for Each RRS and the Real Research Composite (Study 1).
Note. Table cells give the standardized regression coefficients and the corresponding p-value in parentheses. p-values smaller than .05 are bolded. RRS = real-research scenario.

Path diagram of academic status on the Real Research Composite prior mean mediated by self-confidence in Study 1.
Exploratory analyses
In addition to our preregistered analyses, we conducted a series of exploratory analyses. Here, we briefly describe the results of selected exploratory analyses that are likely to be of interest. See Supplemental Material for all performed analyses.
Hypothetical scenarios
Although there were no effects of gender or status on prior means for the hypothetical scenarios, we repeated the mediation analyses for the real-research scenarios on an exploratory basis for completeness. These results are presented in Supplemental Material Tables S5 and S6.
Subgroup analysis
To examine the effect of gender within each academic group, we repeated the gender mediation analysis separately for graduate students, untenured faculty, and tenured faculty. The total effect of gender on prior means was positive for all three groups, but appeared stronger for graduate students (β = .14, p = .051, b = .05, CI95 = [–0.00, 0.10]) and tenured faculty (β = .17, p = .145, b = .06, CI95 = [–0.02, 0.14]) than untenured faculty (β = .01, p = .890, b = .00, CI95 = [–0.05, 0.05]).
To examine whether our effects would hold within a more homogeneous sample of psychologists, we also repeated our primary analyses looking only within our largest subgroup: social/personality psychologists (N = 203). Consistent with our findings for the whole sample, gender significantly predicted prior means (β = .21, p = .004, b = .05, CI95 = [0.02, 0.09]) and status did not (β = .09, p = .312, b = .01, CI95 = [–0.01, 0.04]). Similarly, researcher self-confidence was linked to choosing larger prior means (β = .14, p = .054, b = .04, CI95 = [–0.00, 0.08]). However, gender did not have a detectable effect on RC (β = –.02, p = .796, b = –.02, CI95 = [–0.14, 0.10]), and there was no mediation effect overall (β = –.003, p = .799, b = –.00, CI95 = [–0.01, 0.00]).
Missing data handling
We did not anticipate considerable amounts of missing data in participants’ responses to the research scenarios and did not preregister how missing data would be handled when creating the composite score for the two real-research scenarios. In the main analyses, we computed the composite score only for those participants who provided answers to both real-research scenarios, treating the composite score for the rest of the participants as missing data. Missingness was then handled using FIML in all analyses. However, an alternative is to use the average of the two scenarios when both are available, and to only use one as the “composite” score when one is available. This approach retains data from more participants, potentially at the expense of quality of measurement. When we repeated our main analyses using this alternative approach, men appeared to choose higher prior means than women (β = .08, p = .070, b = .03, CI95 = [–0.00, 0.06]), an effect that was weak, but consistent overall with our primary analyses; once again, academic status did not significantly predict prior means(β = .03, p = .487, b = .01, CI95 = [–0.01, 0.03]). Coefficient estimates and p-values for mediation analyses were similar under the two treatments of missing data (see Supplemental Material Table S10).
Prior standard deviations
We only preregistered hypotheses regarding prior means, but prior standard deviations can be viewed as another way in which researchers express confidence (in the accuracy of their chosen effect size). When we repeated our primary analysis with this alternative dependent variable—entering gender and status into a regression predicting the average prior standard deviation set by participants in the two real-research scenarios—we found no clear effects of either gender (β = .07, p = .152, b = .01, CI95 = [–0.00, 0.02]) or status (β = –.02, p = .618, b = –.00, CI95 = [–0.01, 0.01]). However, researchers who reported higher confidence selected narrower prior distributions (path b: β = –.13, p = .046, b = –.02, CI95 = [–0.03, –0.00]).
Discussion
Study 1 provides initial evidence that male scientists may select higher prior means than female scientists due to gender differences in researcher self-confidence. The effect emerged only when participants thought about their own research, not about hypothetical scenarios. Individuals with higher academic status also reported greater researcher self-confidence (which in turn was associated with setting higher prior means), but the overall relationship between status and prior means was not reliable.
Although we preregistered the study, it is worth noting that we preregistered multiple analyses with multiple IVs and DVs. In particular, the total effect of gender on prior means for the Real Research Composite was near the border of statistical significance (p = .038), and the total effect of status was not statistically significant. Also, many participants left one or the other of the two real-research scenarios blank, and we did not preregister a strategy for dealing with missing data in computing the composite score.
In light of these limitations, we conducted a replication in Study 2, focusing more narrowly on the effect of gender on the real-research scenarios. Specifically, because participants’ responses to each hypothetical scenario in Study 1 lacked meaningful variability (see Footnote 8), we eliminated these scenarios in Study 2. We preregistered the same analysis plan as in Study 1, but we predicted only an effect of gender (not status) and specified that we would handle missing data for the real-research composite by only computing the composite score for those participants who answered both real-research scenarios (i.e., same as in the main analyses in Study 1). Because we were less interested in status, we broadened our inclusion criteria to include other active researchers within academia (e.g., post-docs) and we did not cap the number of graduate students who could participate, leaving us with a predominantly student sample.
Study 2
Method
Participants
As preregistered, we recruited active academic researchers in psychology or related fields to participate in Study 2. 10 To maximize sample size, we included post-doctoral scholars as eligible participants, and did not impose a goal for the proportion of faculty members in our sample. We preregistered a target sample size of 550, which would allow us to detect a gender difference of Cohen’s d = .25 at above 80% power. We also preregistered a plan to analyze Study 2 as an independent sample as long as a minimum sample size of 400 was met, which would correspond to approximately 70% power at d = .25.
Four hundred and fifty-nine eligible participants completed the study. Five participants did not identify as either male or female and were thus not included in our analyses, as in Study 1. Consistent with our preregistration, we excluded an additional 43 participants who scored 60% or lower on our comprehension quiz and four participants who selected prior means greater than 2. 11 As shown in Table 5, our sample was 61% female and 39% male. The sample consisted almost entirely of graduate students (88%), with only 24 nontenured faculty and 33 tenured faculty. Approximately half of the participants came from the social and personality area.
Summary of Sample Demographics (Study 2).
Note. Experience with statistics: 1 = equivalent to undergraduate courses only, 2 = equivalent to 1–2 graduate courses, 3 = equivalent to 3 or more graduate courses, and 4 = even better (e.g., you have a degree in quantitative psychology; you have spent years teaching yourself advanced statistics for use in your research; you teach grad statistics courses; etc.). Familiarity with Bayesian Statistics: 1 = “not at all (may have heard whispers in the hallway),” 2 = “a bit (equivalent to a few lectures/preconference workshop/1–2 articles/or you have used it in your research with the help of a consultant),” 3 = “a fair amount (equivalent to a course or so),” 4 = “very (more than one course and/or you use it in your research),” and 5 = “very (You teach it! You write about it!).” Frequency distributions of these variables are available in Supplemental Material Tables S11 and S12.
Procedure
The procedure of Study 2 was largely identical to that of Study 1, with two exceptions. First, we only presented the two real-research scenarios (and no hypothetical scenarios) to the participants. In addition, if participants responded that they were unable to think of a study for the second real-research scenario, they were presented with a prompt urging them to take some extra time to do so. This prompt was added in an attempt to reduce the amount of missing data in responses to the real-research scenarios.
RC scale
Based on the results of Study 1, we preregistered using the trimmed 14-item scale to measure researcher self-confidence (although we included all 16 original items in the survey to facilitate ongoing scale validation work). Of the 459 participants who met our inclusion criteria, 29 participants had some missing data on the RC scale. One participant skipped all 14 items. As before, a one-factor model was fit to all available data (N = 458) using lavaan 0.6-3 (Rosseel, 2012). Although the model again had a mediocre fit,
Results
Preregistered analyses
Predicting prior means from gender and academic status
As preregistered, we computed the composite prior mean for each participant by taking the average of their prior means on both scenarios using the same approach as Study 1. 12 Academic status was coded as in Study 1; post-doctoral students were grouped with graduate students. We entered gender and academic status into a regression predicting prior means. Men selected slightly higher prior means than women (for men, M = 0.45, SD = 0.22; for women, M = 0.43, SD = 0.20; illustrated in Figure 5), although this effect was not significant (β = .06, p = .191, b = .03, CI95 = [–0.01, 0.07]), and there was no significant effect of academic status (β = .01, p = .829, b = .00, CI95 = [–0.02, 0.03]).

The density distributions of prior means chosen by women and men in Study 2.
RC as a mediator
As preregistered, we examined whether the relationship between gender and prior means was mediated by RC. As shown in Figure 6 and the Study 2 section of Table 3, men reported trivially greater confidence than women (path a: β = .03, p = .541, b = .03, CI95 = [–0.06, 0.11]). More confident participants chose marginally higher prior means (path b: β = .10, p = .071, b = .05, CI95 = [–0.00, 0.10]). The combined indirect path was not significant (β = .00, p = .575, b = .00, CI95 = [–0.00, 0.01]), failing to provide evidence of mediation.

Path diagram of gender on the Real Research Composite prior mean mediated by self-confidence in Study 2.
Exploratory analyses
Subgroup analysis: Social/personality psychologists
To reduce the heterogeneity of our overall sample, which included researchers across six areas of psychology, we repeated our primary regression analysis for social/personality psychologists, who comprised our largest sub-sample (N = 229). In this subgroup, men selected higher prior means than women (β = .18, p = .016, b = .06, CI95 = [0.01, 0.10]), and again there was no effect of status (β = .05, p = .413, b = .01, CI95 = [–0.02, 0.05]). However, the effect of gender on prior means was not mediated by RC (see Supplemental Material Table S9).
Prior standard deviations
As in Study 1, we treated the average prior standard deviation as an alternative dependent variable, entering gender and status into a regression predicting this variable. Again, we found no strong effects of gender (β = .08, p = .115, b = .01, CI95 = [–0.00, 0.02]) or status (β = –.02, p = .589, b = –.00, CI95 = [–0.01, 0.01]). Consistent with Study 1, higher self-confidence was linked to choosing slightly narrower priors, although the effect did not reach significance (β = –.09, p = .071, b = –.01,CI95 = [–0.03, 0.001]).
Combined Results Across Studies 1 and 2 (Exploratory)
Although we preregistered a target sample size of 550 participants for Study 2, we fell short of this goal, primarily due to budget constraints. We preregistered a plan to analyze the data for Study 2 separately as long as we obtained a minimum sample size of at least 400, which we met (N = 459). Because Study 2 was somewhat underpowered, we also present our key analyses combining across Studies 1 and 2 below (N = 944), although these analyses should be treated as exploratory.
Predicting Prior Means From Gender and Academic Status
As before, we entered gender and status into a regression predicting the composite prior means. Men selected significantly higher prior means (M = 0.44, SD = 0.20) than women (M = 0.41, SD = 0.18), β = .08, p = .022, b = .03, CI95 = [0.00, 0.06]; illustrated in Figure 7. There was no effect of status (β = –.02, p = .591, b = –.00, CI95 = [–0.02, 0.01]).

The density distributions of prior means chosen by women and men in the combined sample (Study 1 and 2).
RC as a Mediator
Repeating our mediation analysis (see Figure 8), we found that men reported higher levels of confidence than women (β = .08, p = .014, b = .08, CI95 = [0.02, 0.14]), and more confident researchers chose higher prior means (β = .12, p = .002, b = .05, CI95 = [0.02, 0.08]). The indirect effect of gender on prior means via confidence was marginally significant (β = .01, p = .061, b = .00, CI95 = [–0.00, 0.01]). After taking this indirect effect into account, there was still a significant direct effect of gender on prior means (β = .07, p = .037, b = .03, CI95 = [0.00, 0.05]), suggesting that gender differences in RC could only partially explain why men chose higher prior means than women.

Path diagram of gender on the Real Research Composite prior mean mediated by self-confidence in the combined sample (Study 1 and 2).
Previous Experience in Statistics
Our most puzzling finding was that status had no effect on prior means, even though status was strongly related to increased confidence. One possibility is that while participants with higher academic status would have selected higher prior means due to their high confidence, they also had more statistics training, which might be linked to more realistic expectations of effect sizes. If this were the case, the competing effects of confidence and statistical experience would cancel each other out, resulting in an overall null effect of status on prior means. To explore this idea, we ran a full mediation model predicting prior means from status, using two correlated mediators: confidence and experience in statistics (see Figure 9). Applying the same model specifications as our other mediation analyses, we found that status predicted confidence (path a: β = .26, p < .001, b = .17, CI95 = [0.13, 0.21]), and confidence in turn predicted prior means (path b: β = .14, p < .001, b = .06, CI95 = [0.02, 0.09]. Based on this combined pathway alone (path a*b: β = .04, p = .002, b = .01, CI95 = [0.00, 0.02]), we would infer that participants with higher academic status chose higher prior means due to their higher researcher self-confidence. However, participants with higher academic status also reported more statistical experience (path c: β = .22, p < .001, b = .22, CI95 = [0.15, 0.28]), and those with more statistical experience chose marginally lower prior means (path d: β = –.07, p = .055, b = –.02, CI95 = [–0.04, 0.00]) when the other pathways are controlled for (path c*d: β = –.02, p = .068, b = –.00, CI95 = [–0.01, 0.00]). These exploratory findings tentatively suggest that researchers with higher status tend to be more self-confident, which is linked to selecting larger prior means, but they also tend to have more statistical expertise, which is linked to selecting smaller prior means—and these two effects appear to counteract one another.

Path diagram of status on the Real Research Composite prior mean mediated by researcher confidence and previous statistical experience in the combined sample (Study 1 and 2).
General Discussion
In the present research, we conducted the first empirical investigation of how psychological scientists engage in Bayesian reasoning. Across studies, our most robust finding was that more self-confident researchers selected higher prior means. We also observed mixed evidence that men may choose larger prior means than women; this effect was significant in Study 1, but not Study 2. Combining across both studies with more than 900 active researchers in psychology, we found that men selected higher prior means than women when asked to think about their own research studies; this effect (approximately d = .16) was smaller than the modal effect size across social psychology (d = .36; Richard et al., 2003). Across studies, we found suggestive evidence that the effect of gender on prior means was mediated by researcher self-confidence; however, our mediation analyses did not reach significance in Study 2, and thus should be interpreted with considerable caution. In addition, we saw that those with higher academic status reported greater self-confidence—but not surprisingly, people with higher status also had more statistical experience. While higher self-confidence was linked to choosing larger prior means, statistical experience was linked to choosing smaller prior means, which might help to explain why we did not observe an overall effect of status on prior means. Taken together, our findings suggest that more self-confident researchers are inclined to select larger prior means, and demographic variables such as gender and status may shape researcher self-confidence—but demographic variables may also influence prior means in other ways.
Although we only preregistered hypotheses regarding prior means, prior standard deviations may provide an alternative expression of confidence. While we did not find clear effects of gender or status on prior standard deviations in either study, we found that more self-confident researchers selected significantly narrower standard deviations in Study 1. This effect was very similar in size (though only marginally significant) in Study 2. Thus, the current findings point to the possibility that researcher self-confidence may be linked to the choice of both components of the prior distribution, though we would speculate that these effects may be easier to detect in samples with higher levels of statistical expertise. Indeed, a critical difference between Study 1 and Study 2 was that participants in Study 2 (vs. Study 1) reported a lower level of statistical expertise and included a much lower proportion of faculty (12% vs. 45%). This difference might help to explain why we observed clearer effects—almost across the board—in Study 1 relative to Study 2.
Limitations and Future Research
A central limitation of the present work is that gender, confidence, and status were measured rather than manipulated. In particular, although we theorized that gender differences in confidence should lead male (vs. female) researchers to set higher prior means, this pattern could also have emerged if men simply choose to study effects that are genuinely larger than the effects women choose to study. Indeed, it is possible that obtaining large effects increases researcher self-confidence, helping to account for the gender difference in self-confidence we observed. This problem is compounded by the fact that we drew participants from across diverse areas of psychology; it is possible that men might be overrepresented in areas of psychology in which larger effects are typically observed. However, we found a consistent effect of gender on prior means within our largest subgroup (social/personality psychology), suggesting that our observed effects cannot be easily explained by different gender representations across different areas of psychology. Of course, actual effect sizes are likely to differ within sub-areas of psychology, and men might simply be more drawn to studying larger effect sizes. To circumvent this seemingly intractable issue, it would be interesting to examine registered reports in psychology; if our theoretical perspective is correct, then all-male author teams may predict larger effect sizes than all-female author teams. Because the authors’ predicted effect sizes could be compared with the actual effect sizes that emerged, providing a benchmark for accuracy, this approach would make it possible to assess whether men are overconfident on average.
Constraints on generalizability
Because our sample consisted entirely of psychologists in North America, special caution is warranted in generalizing these findings to other cultural contexts. Indeed, most previous work on gender differences in confidence regarding math and science has been conducted in North America. Moreover, only a handful of participants with a gender identity other than male or female completed the study, precluding any meaningful conclusions about the relationship between confidence and setting Bayesian prior means for people across the gender continuum (see Connell, 2012). Future researchers should strive to utilize well-powered samples of people with diverse gender identities and heed recommendations to operationalize gender using multidimensional, continuous measures (see Ansara & Hegarty, 2014; Hyde et al., 2019; Lindqvist et al., 2020). It is also important to note that researchers voluntarily completed our study, and individuals with some interest in Bayesian statistics may have been particularly likely to participate. From our theoretical perspective, the observed gender differences should also emerge among researchers in other areas of science, but it would be important to test this extension empirically.
Our RC scale could potentially be used to test the generalizability of gender differences in confidence among researchers in diverse fields of science and regions of the world. Although the scale will require further development and validation for broader use, the present research provides initial evidence for construct validity, in that our scale captures expected group differences in researcher self-confidence (e.g., individuals with higher academic status report greater self-confidence 13 ). As part of the present research, we were able to collect pilot data (N = 72) on the initial version of the scale, which led to the improvements in item wording, but this initial sample size was insufficient for full psychometric validation. While the developed scale had good internal consistency and all 14 items had reasonably high standardized loadings in both studies, the one-factor model had poor fit to the data. We are currently conducting additional scale validation work to establish whether the scale encompasses more than one factor, to facilitate future research using this scale.
Implications
Keeping in mind the important limitations highlighted above, our studies point to the conclusion that more self-confident researchers may select larger prior means and narrower prior standard deviations than their humbler colleagues. As a result, more self-confident researchers are likely to obtain larger Bayes Factors, all else being equal. Although we focused on gender and status as predictors of self-confidence, we would theorize that any variable that increases confidence—such as being at a prestigious university—might also have downstream consequences for the priors that researchers select. Specifically, these extraneous variables may increase researchers’ priors without altering the effect sizes they are studying, producing overconfidence.
Although researchers routinely make many subjective decisions even when using traditional statistical methods, a unique feature of Bayesian hypothesis testing is that these subjective judgments are explicitly incorporated into the computations and directly impact the size of the resulting statistic (i.e., the Bayes Factor). This feature of Bayesian analysis makes it relatively easy to study the impact of subjective judgments on statistical conclusions. That said, the present research carries implications for statistical inference more broadly, whether Bayesian or frequentist. For example, if men anticipate higher effect sizes than women, men may be more likely to make bold predictions in grant applications (which often require effect size estimates) or to underpower their studies. In fact, the implications for users of frequentist statistics may be more insidious because researchers’ subjective judgments are less transparent, but may still shape research decisions.
Supplemental Material
Proulx_Online_Appendix_1 – Supplemental material for Can Researchers’ Personal Characteristics Shape Their Statistical Inferences?
Supplemental material, Proulx_Online_Appendix_1 for Can Researchers’ Personal Characteristics Shape Their Statistical Inferences? by Elizabeth W. Dunn, Lihan Chen, Jason D. E. Proulx, Joyce Ehrlinger and Victoria Savalei in Personality and Social Psychology Bulletin
Supplemental Material
Proulx_Online_Appendix_2 – Supplemental material for Can Researchers’ Personal Characteristics Shape Their Statistical Inferences?
Supplemental material, Proulx_Online_Appendix_2 for Can Researchers’ Personal Characteristics Shape Their Statistical Inferences? by Elizabeth W. Dunn, Lihan Chen, Jason D. E. Proulx, Joyce Ehrlinger and Victoria Savalei in Personality and Social Psychology Bulletin
Supplemental Material
SOM_Personal_Characteristics_and_Statistical_Inferences_PSPB_2020.doc – Supplemental material for Can Researchers’ Personal Characteristics Shape Their Statistical Inferences?
Supplemental material, SOM_Personal_Characteristics_and_Statistical_Inferences_PSPB_2020.doc for Can Researchers’ Personal Characteristics Shape Their Statistical Inferences? by Elizabeth W. Dunn, Lihan Chen, Jason D. E. Proulx, Joyce Ehrlinger and Victoria Savalei in Personality and Social Psychology Bulletin
Footnotes
Acknowledgements
We would like to acknowledge the work of Winnie Tse for her efforts in data collection on Study 2.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research was supported by the Social Sciences and Humanities Research Council of Canada (SSHRC) Insight Development Grant and a SSHRC Explore Grant to Victoria Savalei. The authors declare no other conflicts of interest with respect to the authorship or the publication of this article.
Supplemental Material
Supplemental material is available online with this article.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
