Abstract
P values and confidence intervals (CIs) are the most widely used
statistical indices in scientific literature. Several surveys have revealed that
these two indices are generally misunderstood. However, existing surveys on this
subject fall under psychology and biomedical research, and data from other
disciplines are rare. Moreover, the confidence of researchers when constructing
judgments remains unclear. To fill this research gap, we surveyed 1,479
researchers and students from different fields in China. Results reveal that for
significant (i.e., p < .05, CI does not include zero) and
non-significant (i.e., p > .05, CI includes zero)
conditions, most respondents, regardless of academic degrees, research fields
and stages of career, could not interpret p values and CIs
accurately. Moreover, the majority were confident about their (inaccurate)
judgements (see
Statistical inference has played a crucial role in scientific research since the latter half of the 20th century by bridging data and hypothesis testing (Gigerenzer, Swijtink, Porter, & Daston, 1990). Currently, the most common statistical index in scientific literature is the p value, despite repeated criticism of its thoughtless use (Benjamin et al., 2018; Cumming, 2013; Cumming et al., 2007; McCloskey & Ziliak, 2008). In the last 20 years, items (e.g., figures and tables) displayed in the top three multidisciplinary journals (Nature, Science, and PNAS) progressively relied on p values (Cristea & Ioannidis, 2018).
However, the widely used p value is also generally misunderstood. Several surveys in psychology show that most researchers and students misinterpret p values (Badenes-Ribera, Frias-Navarro, Iotti, Bonilla-Campos, & Longobardi, 2016; Badenes-Ribera, Frías-Navarro, Monterde-i-Bort, & Pascual-Soler, 2015; Haller & Krauss, 2002; Lyu, Peng, & Hu, 2018; Oakes, 1986). This misinterpretation may result in the misuse and abuse of p values, such as the cult of statistical significance (McCloskey & Ziliak, 2008) and p-hacking (Head, Holman, Lanfear, Kahn, & Jennions, 2015; Nuijten, Hartgerink, van Assen, Epskamp, & Wicherts, 2016), which might be the main reason behind the replication crisis in psychology (Hu et al., 2016; John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011).
An alternative to p values is effect sizes and their confidence intervals (CIs). In particular, CIs represent the variations of the effect size and help researchers produce improved statistical inference (Coulson, Healey, Fidler, & Cumming, 2010). However, CIs are also difficult to understand. For example, Hoekstra, Morey, Rouder, and Wagenmakers (2014) surveyed researchers’ understanding of CIs in a similar approach to surveys on the p value and found that most researchers misunderstood CIs. This phenomenon is confirmed by surveys from multiple countries (Greenland et al., 2016; Lyu et al., 2018; Morey, Hoekstra, Rouder, & Wagenmakers, 2016).
Even with the availability of multiple surveys, several questions remain unanswered. First, all available data are from psychological researchers or researchers in biomedical science. Only a few studies surveyed researchers in other disciplines. Given that p values and CIs are frequently used in other fields as much as in psychology (Colquhoun, 2014; Vidgen & Yasseri, 2016), the extent of the understanding of researchers’ and students’ in other fields of these statistical indices is an open question. Second, the majority of previous surveys failed to identify how confident the respondents were of their own judgment. Third, most previous surveys only focused on the statistically significant statement, though non-significant results are equally important and often miscomprehended (Aczel et al., 2018). To address these issues, a survey is conducted to investigate the following aspects related to the misinterpretation of p values and CIs: (1) whether the misinterpretation prevails across different fields of science; (2) whether researchers interpret significant and nonsignificant results differently; and (3) whether researchers are aware of their own misinterpretations, such as how confident they are when they endorse a statement toward p values or CIs.
In this survey, we adopt four questions from previous studies (Gigerenzer, 2004; Haller & Krauss, 2002; Hoekstra et al., 2014) for p values and CIs. These questions were used in Germany (Haller & Krauss, 2002), UK (Oakes, 1986), Spain (Badenes-Ribera et al., 2015), Italy (Badenes-Ribera et al., 2015), Chile (Badenes-Ribera et al., 2015) and China (Hu et al., 2016; Lyu et al., 2018). We selected four items to minimize the length of the questionnaire. We opted for these particular items because they are widely used and they enable a comparison between the results of the present and previous surveys. These items have several limitations. For example, certain items (e.g., “The probability that the true mean is greater than 0 is at least 95%”; “The probability that the true mean equals 0 is smaller than 5%.”) in the study of Hoekstra et al. (2014) could not be considered “incorrect” due to varied understanding of the conception “probability” (Miller & Ulrich, 2015).
Materials and methods
Participants
All participants were recruited through online advertisements on WeChat-Public-Accounts; the subscribed accounts enable users to obtain information and interact with them (Montag, Becker, & Gan, 2018). Specifically, our advertisements were spread via The Intellectuals (知识份子), Guoke Scientists (果壳科学人), Capital for Statistics (统计之都), Research Circle (科研圈), 52brain (我爱脑科学网), and Quantitative Sociology (定量群学). Advertisements posted among the WeChat-Public-Accounts are identical, emphasizing the importance of statistics and encouraging readers to devote their time for scientific purposes by clicking the Qualtrics link at the end of the post and participating in our survey. A total of 4,206 respondents from different backgrounds (respondents’ academic background was based on the degree they awarded in China) voluntarily participated in the survey. However, 2,727 of them withdrew before completing the survey, leaving a sample size of 1,479. All participants read and signed the informed consent form prior to their participation. Data were collected from September 2017 to November 2018. The response rate (35%) was relatively higher than previous studies in psychology; specifically, 10% and 7% higher response rates in comparison with Badenes-Ribera et al. (2015) and Badenes-Ribera et al. (2016) respectively.
Materials
The questions on the interpretation of p values and CIs were adopted from Lyu et al. (2018). These questions were first translated by C-P Hu and then reviewed by other bilingual psychological researchers (X-K Lyu and Dr Fei Wang at Tsinghua University) to ensure accuracy. Our survey included scenarios on p values and CIs. To investigate the understandings of non-significant results, we created two versions of the survey: one used a significant scenario (p < .05 and CIs did not include zero) and the other used a non-significant scenario (i.e., p > .05 and CIs included zero). Participants were randomly assigned to the significant and non-significant version by Qualtrics.
Questions for p values
This scenario was adopted from previous studies (Gigerenzer, 2004; Haller & Krauss, 2002; Lyu et al., 2018). Respondents first read a research context and were then asked to judge whether the four statements could be logically inferred from the p values of the results. To explore the effect of significant and non-significant results, the p value was either smaller than .05 or greater than .05. Respondents first read the following scenario: Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (50 subjects in each sample). Your hypotheses are as follows. H0: No significant difference exists between the experimental and the control groups. H1: Significant difference exists between the experimental and the control groups. Further, suppose you use a simple independent means t test, and your result is t = 2.7, df = 98, p = .008 (in the significant version) or t = 1.26, df = 98, p = .21 (in the non-significant version).
Participants were asked to judge the following statements (note that the italicized phrases are different from two versions of our survey; the non-significant version is inside a bracket): (a) You have absolutely disproved (proved) the null hypothesis; (b) You have found the probability of the null (alternative) hypothesis true; (c) You are aware, if you decide to (not to) reject the null hypothesis of the probability that you are making the wrong decision; (d) You have a reliable (unreliable) experimental finding in the sense that you would obtain a significant result on 99% (21%) of occasions if, hypothetically, the experiment was repeated multiple times.
Questions for CIs
This scenario was also adopted from previous studies (Hoekstra et al., 2014; Lyu et al., 2018). As in the p-value situation, respondents first read one of the two versions of the context in which the CIs did (significant) or did not (non-significant) include zero: A researcher conducts an experiment, analyzes the data, and reports: “The 95% (bilateral) confidence interval of the mean difference between the experimental group and the control group ranges from .1 to .4 (or from –.1 to .4 in the non-significant version).”
They were then required to make a judgment about the accuracy of each
statement (note that the italicized phrases are different in the two
versions of our survey; the non-significant version is in brackets): (a) A
95% probability exists that the true mean lies between .1 (–.1) and .4 (.4);
(b) If we were to repeat the experiment over and over, then 95% of the time
the true mean falls between .1 (–.1) to .4 (.4); (c) If the null hypothesis
is that no difference exists between the mean of experimental group and
control group, then the experiment has disproved (proved) the null
hypothesis; (d) The null hypothesis is that no difference exists between the
means of the experimental and the control groups. If you decide to (not to)
reject the null hypothesis, then the probability that you are making the
wrong decision is 5%. The English-translated questionnaires are available
at:
After generating a judgment for each statement, respondents were immediately asked to indicate their confidence about the judgment from 1 (not confident at all) to 5 (very confident). All statements cannot be logically inferred from the results. Hence, any statement in which the “True” option was chosen would be coded as misinterpreting p value or CIs.
Data analysis
R 3.5.3 was used to analyze the data. The error rates of different groups of
participants were compared with a chi-square test under the NHST framework.
In addition, we reported Bayes factor (BF) as complementary indices for
statistical inference. Bayes factors are calculated using JASP 8.6.0, with
the default prior (Hu, Kong, Wagenmakers, Ly, & Peng, 2018; Love et al., 2019). The following criteria for Bayesian
inference are used: 1 < BF10 < 3 indicates anecdote evidence for H1, 3
< BF10 < 6 represent weak evidence for H1, 6 < BF10 < 10 means
moderate evidence for H1, 10 < BF10 < 100 means strong evidence for
H1, 100 < BF10 means overwhelming evidence for H1 (Jeffreys, 1961). All analysis codes are available
at
Results
A total of 1,479 participants possess valid data in the p value or CI items. Sample sizes for the significant and the non-significant versions were n = 759 and n = 720 respectively. All the statements about p values and CIs cannot logically be inferred from the given context. Hence, refer to the supplementary materials where we calculated the error rate on each item to identify why these statements are wrong.
In general, the results (all the raw data are available at
Percentage of misinterpretation of p values and CIs. (a)
Percentage of misinterpretation by education attainment: Bachelor degree
= undergraduates or their highest degree was bachelors, Master’s degree
= masters students or their highest degree was a master’s; (b)
Percentage of misinterpretation by disciplines: Discipline division was
based on the degree of the respondents awarded in China. Science =
disciplines awarded a degree of natural science, excluded Math and
statistics. Engr/Agr. = engineering/agronomy, Social Science = sociology
or other social sciences; (c) Percentage of misinterpretation by the
location where the respondents received their highest degree.
Percentage of misinterpretation of p values and CIs for each statement
Note: Discipline division was based on the degree of the respondents awarded in China. Science = disciplines awarded a degree of natural science, excluded Math and statistics, Eng/Agr. = engineering/agronomy, Social Science = sociology or other social sciences.
For the difference between the significant and the non-significant versions, the error rate for p values was lower in the latter (86%) than in the former (92%), χ2(1) = 16.841, p < .001, BF10 = 543.871. This study failed to find strong evidence for the difference between significant CIs (94%) and non-significant CIs (91%), χ2(1) = 2.892, p =.049, BF10 = 0.580. For detailed analysis and figures, see the supplementary materials.
This study discovered that most respondents were confident with the following. In all
four statements for p values and CIs, the averaged confidence was
over 3.8 out of 5 (see Figure 2a and 2b). We also compared the difference in
confidence levels between correct answers and wrong answers by t
test and found that high confidence level for accurate answers exist for certain
items (see Supplementary results 3, Table
Percentage of misinterpretation of p values (a) and CIs
(b) as compared with the average confidence level (error bars present ±1
standard error) for each statement. Horizontal labels (A-D) represent
four incorrect statements about p values or CIs.
Detailed statements can be found in the Materials section; shortly, four
statements of p values are about (A) disprove/prove the
H0, (B) obtain the probability of a true
H0/H1, (C) obtain the probability of type I
error, (D) replication delusion; four statements of CI are about (A) get
the probability to have the true mean, (B) replication delusion, (C)
disprove/prove the H0, (D) get the probability of an
error.
Our exploratory analysis uncovered that respondents who get their highest degree
overseas or in Hong Kong, Macao and Taiwan might have a lower error rate on the
interpretation of p values than those who obtained their highest
degree in Mainland China (See Figure 1c). For
p values, 90% respondents who acquired their highest degree in
Mainland China (n = 1231) had at least one wrong answer, whereas
84% respondents who attained their highest degree overseas (n =
248) had at least one wrong answer, χ2(1) = 6.38, p =
.012, BF10 = 1.654. For CIs, 93% respondents who obtained their highest degree in
Mainland China had at least one wrong answer, whereas 89% respondents who secured
their highest degree overseas had at least one wrong answer, χ2(1) =
4.57, p = .033, BF10 = 0.602. For further analysis of the
difference between Mainland China and Overseas, see Supplementary materials Figure
Discussion
The current survey found that the misinterpretation of p values and CI was prevalent in the Chinese scientific community, even in certain methodological fields. The rates of misinterpretation were high for significant or non-significant p values, and CIs that did or did not include zero. Moreover, researchers and students were generally confident about their (incorrect) judgements. These results suggest that researchers generally do not have a good understanding of these common statistical indices.
The possible reasons for these misconceptions have been discussed in the literature.
For example, Gigerenzer (2004, 2018) suggested that researchers used
p values as a “null ritual”, which has the following steps
(Gigerenzer, 2004): Set up a null hypothesis of “no mean
difference” or “zero correlation”. Do not specify the predictions of
your own research
hypothesis. Use 5% as a
convention for rejecting the null hypothesis. If the test is
significant, then accept your research hypothesis. Report the test
result as p < .05, p < .01, or
p < .001, whichever level is met by the obtained
p
value. Always perform this
procedure.
This “ritual” was “inherited” in psychology by generations of researchers, as demonstrated by the inaccurate interpretation of statistical significance in the introductory textbooks of psychology (Cassidy, Dimova, Giguère, Spence, & Stanley, 2019). Our results confirmed and extended this view. First, similar to many previous surveys in psychology (Haller & Krauss, 2002), our results found that respondents who were teaching statistics had a high error rate (>80%). Thus, students may have a wrong understanding of p value at the very beginning. Second, our results extended the scope of previous surveys and suggest that the “ritual” is not limited to psychology or social science but also to the entire scientific community. In our survey, the four items represent different “illusions” that are necessary for justifying the null ritual (Gigerenzer, 2004, 2018).
First, over half of respondents considered p values as evidence to disprove or prove a null hypothesis (statement A in p value and statement C in CIs). This “illusion of certainty” (Gigerenzer, 2004, 2018) justifies the use of null ritual. It may even motivate researchers to interrogate data to obtain a value smaller than .05 as evidence toward the existence of effects. This motivation was further enforced by the current publishing system in which p < .05 is a premise of publication.
Our results also revealed that respondents across different fields share the “replication delusion” and false Bayesian thinking. Over 50% respondents believe that 1-p or 1-α can represent the probability of successful replication (statement D in p-value section and statement B in CI section). However, p values convey nothing about the replication rate. As for Statement C in the p-value section, respondents thought the p value was equal to the type I error rate or type II error, which confused the probability of data, given the hypothesis, namely P(D|M). The probability of the hypothesis gives the data, such as P(M|D). This confusion represents Bayesian wishful thinking.
Methodologists have long discussed the lack of statistical thinking, but its potential consequences (Cohen, 1962, 1994; Gigerenzer, 2004; Goodman, 2008; Meehl, 1978) were never heard. Only recently did researchers rediscover these problems with p values after the “replication crisis”. The “p-war” became one of the highlights in the field (Amrhein & Greenland, 2017; Amrhein, Greenland, & McShane, 2019; Benjamin et al., 2018; Lakens et al., 2018). The rationale behind this debate is straightforward, that is, the p value is the most widely used statistic index, and many problems that have plagued psychology and social science are related to the misunderstanding of p values and statistics in general. For example, statistical power (Bakker, Hartgerink, Wicherts, & van der Maas, 2016) was promoted by Cohen in the 1960s (1962, 1994). However, the low power problem persisted in psychology (Button et al., 2013; Maxwell, 2004), probably because statistical power is not part of the “null ritual” (Gigerenzer, 2018). Other similar issues are questionable research practice (John et al., 2012) and publication bias (Franco, Malhotra, & Simonovits, 2014), which are probably due to the “illusion of certainty” among researchers. By revealing that researchers outside psychology share the same inaccurate understanding of p values and CIs, our results suggested that other fields might also be threatened by those problems.
Another important addition to information about the misunderstanding of p values and CIs is the confidence ratings from respondents. Most respondents were relatively confident about their own responses. This fact provides additional evidence that people have a false certainty about their own understanding, and this inaccurate certainty justifies their use of p values. Similar to the researcher’s understanding of power (Bakker et al., 2016), this result revealed that researchers across different fields may rely on intuition more than statistical thinking when making research decisions.
In our survey, respondents who received their highest degree abroad performed on the p-value items better than their peers who acquired their highest degree in Mainland China. However, this finding did not apply to CI-related items. The only available explanation for this scenario might be that the replication crisis was discussed more in the English media than in the Chinese media. Therefore, students who had studied overseas were more familiar with this topic than their local counterparts.
Limitations
Several limitations in this survey should be pointed out. First, although we used a multidisciplinary and relatively large sample, the data were from a convenient sample, which might not be representative of the entire population. However, our results may underestimate the rate of misunderstanding of p values and CIs because our survey did not provide any compensation. Most respondents might be interested in p values and related issues. Typically, people who are interested in statistical issues may perform better than those who are not. Second, as mentioned before, we used four items for p values and four items for CIs, and the validity of certain items remain controversial. Ultimately, we found that respondents have great confidence in their interpretation of p values and CI, but we did not examine why they are confident and how they make their decisions.
Conclusion
The current survey showed that researchers from various fields of science may not be able to correctly interpret p values and CIs. They are unaware of their own misinterpretation. These results call for deep and accurate statistical training in all scientific fields.
Supplemental Material
Lyu et al. supplementary material
Lyu et al. supplementary material
Supplementary Materials
Footnotes
Supplementary material
To view supplementary material for this article, please visit
Acknowledgments
We appreciate the following new online media/websites for circulating our recruitment information: The Intellectuals (知识份子); Guoke Scientists (果壳科学人); Capital for Statistics (统计之者); Research Circle (科研圈); 52brain (我爱脑科学网); Quantitative Sociology (定量群学).
Financial Support
None.
Funding
This study was supported by Social Sciences and Humanities Youth Foundation of Ministry of Education of China (19YJC840030) and Philosophy and Social Science Foundation of Tianjin (TJJX18-001).
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
