Abstract
With the rapid spread of information via social media, individuals are prone to misinformation exposure that they may utilize when forming beliefs. Over five experiments (total N = 815 adults, recruited through Amazon Mechanical Turk in the United States), we investigated whether people could ignore quantitative information when they judged for themselves that it was misreported. Participants recruited online viewed sets of values sampled from Gaussian distributions to estimate the underlying means. They attempted to ignore invalid information, which were outlier values inserted into the value sequences. Results indicated participants were able to detect outliers. Nevertheless, participants’ estimates were still biased in the direction of the outlier, even when they were most certain that they detected invalid information. The addition of visual warning cues and different task scenarios did not fully eliminate systematic over- and underestimation. These findings suggest that individuals may incorporate invalid information they meant to ignore when forming beliefs.
With the rapid spread of information via social media, people are susceptible to seeing misinformation that could impact their beliefs. This is particularly true of “infodemics” involving massive amounts of information about a particular topic (e.g., COVID-19), including false and misleading information (Eysenbach, 2002; Greenspan & Loftus, 2021). It is concerning that false news often spreads more quickly and broadly online than true news does (Vosoughi et al., 2018). The ease with which misinformation is disseminated online produces an environment in which a few insistent voices sharing false information can sway much of the populace (Cook & Lewandowsky, 2016). Even those who do not intend to misinform may do so by sharing false news stories simply because they have seen the headline before (Effron & Raj, 2020).
Introduction
It can be easy to assume that identifying misinformation allows one to ignore it, but research into the continued influence effect (CIE; Johnson & Seifert, 1994) has shown that people will utilize information that has been retracted even if they remember that the information is not legitimate. This effect is difficult to eliminate completely. Invalidated information continues to influence beliefs even when the retraction was much stronger than the false information (Ecker et al., 2011) and even when people are forewarned about misinformation (Ecker et al., 2010).
Interventions to reduce the influence of retracted information include shifting one’s focus toward evaluating accuracy when encoding false information (Pennycook et al., 2021). Similarly, reading a debunking message shortly after encountering false information diminishes, but does not eliminate, the CIE (Brashier et al., 2021; Wilkes & Leatherbarrow, 1988). The most effective approach appears to be absorbing truthful debunking messages, so the truthful information replaces the retracted information in one’s mental model (Chan et al., 2017); this method has been recommended for news sources, social media sites, and educators (Lewandowsky et al., 2012). However, these interventions apply to situations in which an outside source has invalidated a piece of information. It is also important to consider whether people can ignore information that they have deemed invalid for themselves.
In the illusory-truth paradigm, participants indicate the degree to which they believe different statements, some of which are false. Later, participants are presented with another list of statements, some of which they had previously encountered. Results show that people more strongly believe previously encountered statements, including false statements (Begg et al., 1992; Hasher et al., 1977). Additional repetitions of a statement increase belief in that statement even further (Hassan & Barber, 2021).
Illusory truth occurs even when one possesses knowledge that contradicts false information (Fazio et al., 2015). The increase in believability occurs for both plausible and implausible statements (e.g., “the Earth is a perfect square”; Fazio et al., 2019). Like the CIE, the illusory-truth effect can be diminished, but rarely eliminated, by having participants focus on the accuracy of statements as they initially encounter them (Brashier et al., 2020).
Other research has also observed that ignoring information is not an easy task. Children 4 to 6 years of age found it difficult to ignore false information about a previous playdate (Schaaf et al., 2015), adult jurors have difficulty ignoring inadmissible evidence when deliberating a verdict (London & Nunez, 2000), and even experienced judges who have ruled evidence to be inadmissible struggle to ignore that evidence (Wistrich et al., 2004).
The previous literature on misinformation has mainly investigated beliefs regarding verbal information, focusing (a) on whether individuals utilize CIE or believe false factoids (illusory truth), (b) on how misinformation spreads online (Effron & Raj, 2020; Lewandowsky et al., 2012; Vosoughi et al., 2018), and (c) on how to combat it (Brashier et al., 2021; Pennycook et al., 2021). With the growth of data journalism (Stalph & Borges-Rey, 2018), people are increasingly asked to form and update beliefs regarding numerical information. There is reason to believe that people process verbal and numerical information differently. For example, Liu et al. (2021) found that people rely more on context when making decisions using verbal quantifiers compared to numerical quantifiers. Thus, it is unknown how misinformation influences beliefs regarding numerical information. The current paper extends the existing literature on misinformation by investigating how individuals handle misinformation in the form of invalid reports of numerical quantities.
Statement of Relevance
Data journalism has become a part of daily life. Take, for example, coverage of the COVID-19 pandemic, during which one could hear any number of arguments over “the numbers.” To form beliefs grounded in truth, it is important to accurately judge which data are legitimate and avoid any that are invalid. In these experiments, we examined whether people are able to ignore invalid numerical information once they have detected it. We found that even when individuals were certain that a piece of information was invalid, they were not able to fully ignore it. This was true even when we provided warnings or told people which data were invalid. These findings are important because they show how harmful simply encountering bad data can be. This highlights the importance of news and social-media sites making concentrated efforts to fight misinformation before it can make it to the public—before it is too late.
News sources are increasingly reporting data to consumers (Westlund & Hermida, 2021), with data journalists often presenting their findings as fact. However, they less often acknowledge limitations such as data-collection practices or conflicts of interest in their work. This requires consumers to be vigilant and sample many disparate findings of varying quality in order to estimate the truth of a matter (e.g., COVID-19 vaccine-efficacy rates, global temperature change). Therefore, it is important to understand how people process noisy numerical information as they form beliefs of the underlying truth (Stubenvoll & Matthes, 2021), especially information they deem to be false.
In this paper we describe five experiments investigating whether people could ignore invalid numerical information when they detected it. We also examine manipulations aimed at helping people identify false reports (i.e., visual warning cues). Across our experiments, we show that it is difficult for people to disregard invalid numerical information even when this information is very easy to detect.
Open Practices
Data, analyses, and example experimental code (for Experiment 5) have been made publicly available via the Open Science Framework and can be accessed at https://osf.io/9ybnx/. The design and analysis plans for studies 3, 4, and 5 were preregistered at AsPredicted and can be accessed at https://aspredicted.org/NNB_QGM, https://aspredicted.org/OAB_ATF, and https://aspredicted.org/MTN_FZN.
Method
In a series of five experiments, participants read imaginary scenarios in which they viewed sequences of reports that included numerical information sampled from underlying Gaussian distributions. Participants were asked to ignore invalid reports while estimating the true mean of the underlying distribution. These five experiments were conducted sequentially, and the specific manipulations used in later experiments were inspired by findings of the previous ones.
Participants
For each experiment, we intended to have approximately 50 participants per condition after exclusions. We recruited 106 adult participants for Experiment 1 and 107 adult participants for Experiment 2. Both experiments had two between-subjects conditions. The sample size for Experiments 1 and 2 was determined prior to starting recruitment by using previous studies of the CIE and illusory-truth effect (Ecker et al., 2011; Hassan & Barber, 2021) to estimate the number of participants needed for each condition. We note that a formal power analysis was not carried out for these experiments. The results of Experiments 1 and 2 showed that our selected sample size was sufficient for our design, and so we used it again for Experiments 3, 4, and 5, again targeting about 50 participants per condition. As described in the preregistrations for these experiments, we oversampled in each condition because we anticipated exclusions. In the end, we had 382 adult participants for Experiment 3, 134 adult participants for Experiment 4, and 274 adult participants for Experiment 5. This research received approval from the Institutional Review Board at Vanderbilt University (ID 192171).
We recruited participants from Amazon Mechanical Turk (MTurk) using the CloudResearch platform, and we targeted those who typically submitted good work, as indicated by an approval rating of over 95% for all human-intelligence tasks they had completed. Each participant received $0.50 for every 15 min of participation. This payment rate was determined by the median earning rate of MTurk workers (Hara et al., 2018). In Experiments 1 and 2, participants received $1.00 upon completion of the approximately 30-min task. In Experiments 3 through 5, participants received $1.50 upon completion of the approximately 45-min task. Data was analyzed only after all data had been collected.
Because of the nature of the experimental manipulation, participants needed to view at least three reports (i.e., stimuli) in a trial to possibly encounter the invalid report of interest to our hypotheses (see the Procedure section). Therefore, participants with a median number of stimuli viewed per trial of less than three were excluded from the analyses. This criterion also served to protect against participants who merely clicked through the experiment as fast as possible without dutifully completing the task. We excluded 21 participants from Experiment 1, 24 participants from Experiment 2, 68 participants from Experiment 3, 31 participants from Experiment 4, and 44 participants from Experiment 5.
After excluding participants because of low median stimuli viewed per trial, we retained 85 participants (45 women, 39 men, 1 nonbinary; age: M = 39.88 years, SD = 13.31 years) in Experiment 1, 83 participants (49 women, 31 men, 1 nonbinary, and 2 unknown; age: M = 41.37 years, SD = 13.70 years) in Experiment 2, 314 participants (197 women, 112 men, 1 nonbinary, 1 genderqueer, 1 transmasculine, and 2 unknown; age: M = 43.75 years, SD = 14.25 years) in Experiment 3, 103 participants (76 women, 27 men; age: M = 42.51 years, SD = 13.47 years) in Experiment 4, and 230 participants (156 women, 72 men, 1 nonbinary, and 1 unknown; age: M = 40.94 years, SD = 14.42 years) in Experiment 5 for data analyses.
Materials
Each trial consisted of a sequence of screens, each showing the results from a hypothetical medical study (e.g., Fig. 1a) involving a certain number of patients (e.g., Fig. 1b). Participants sampled information (i.e., results of medical tests, such as 7 out of 20 participants having negative side effects) until they felt ready to make a judgment about the true underlying mean in that trial. In Experiments 1 through 3, the number of patients involved in each medical test was 20. In Experiments 4 and 5, we used 35 patients to investigate whether the pattern of effects would be observed with different stimuli values. To construct these sequences, a list of 51 fictional reports (i.e., stimuli) were created for each trial. The number of reports available in a given trial was to be 51, a somewhat arbitrary figure, but a large enough number to ensure that participants would most likely cease their information search before exhausting the reports (which is critical to investigating information-sampling behavior). Participants could sample as many as up to 51 fictional reports in each trial before they made an estimation (Fig. 1c). This self-paced task allowed participants to decide how many stimuli they viewed on each trial.

An example of the different screens participants viewed on each trial in Experiment 1. In (a), we show an example of the orienting story participants read at the start of each trial. The medical condition and fictional drug names were determined randomly for each trial. An example of the stimuli participants viewed on each trial is illustrated in (b). Each stimulus represents one fictional researcher’s report, and participants sampled reports until they felt comfortable estimating the underlying true prevalence rate. The response screen where participants typed their estimates after terminating their information search is shown in (c); the outlier detection-response screen where participants indicated the likelihood that they saw a fabricated (i.e., invalid) report is shown in (d). At the end of each trial, one group in Experiment 1 and all participants in the following experiments indicated how likely it was that they encountered invalid information.
The key within-subject manipulation was the presence of invalid reports (i.e., outlier test results) on some trials. In each experiment, there were 52 trials consisting of three main types: 16 control trials without outliers, 16 t test trials with outliers, and 20 catch trials.
In Experiments 1 through 3, the 16 control trials were constructed by sampling from a Gaussian distribution with a mean of 8 and a standard deviation of 2—that is,
The test trials were designed to investigate the effect of invalid reports on information seeking and one’s final estimates. There were two sets of eight test trials (one set with low outliers and one with high outliers) created by following the same procedure for generating the control trials and then randomly inserting an outlier as report three, four, or five in the sequence. Placing the outlier early in the sequence of reports ensured that participants were likely to see it when it appeared. Low outliers were 1, 2, and 3 in Experiments 1 through 3 and 13, 14, and 15 in Experiments 4 and 5; high outliers were 13, 14, and 15 in Experiments 1 through 3 and 25, 26, and 27 in Experiments 4 and 5. The presence of these outliers should not affect people’s estimation of the underlying true mean of the distributions if ignored. In other words, the mean value of the test trials was equivalent to that of the control trials, after ignoring the invalid information present in outliers.
The catch trials were designed to ensure that there was sufficient variety in the stimuli. In each experiment, there were 20 catch trials made up of four sets of five catch trials. In Experiments 1 through 3, the four sets were sampled from normal distributions with means of either 5 or 11 and standard deviations of either 1 or 2. In Experiments 4 and 5, the four sets were from normal distributions with means of either 17 or 23 and standard deviations of either 1 or 2. These additional trials served as a distraction to help prevent participants from learning the structure underlying the control and test stimuli.
In addition to the within-subject manipulation of trial types, each experiment employed a set of between-subject manipulations (Table 1). In Experiment 1, we investigated how response types interacted with the effect of trial types. One group of participants provided only their estimates of medical cases (estimate-only group; Fig. 1c). The other group provided these estimates but also indicated the likelihood of observing a outlier in their sampled reports using a 5-level Likert scale on each trial (estimate-detect group; Fig. 1d).
Summary of Experimental Design
Note: All experiments also involved catch trials that are not listed in the table. Boldface type denotes the key manipulations in each experiment.
In Experiments 2 and 3, we used visual warning cues to alert participants of upcoming invalid information on trials. The warning cue was a red-and-white triangular sign with an exclamation point inside it (see Fig. 2a). This design and color scheme was chosen to ensure participants did not miss the cue. Participants were informed about the nature of the cues at the start of the experiment. The warning cue appeared in the space above the story that started each trial (Fig. 2a) and disappeared when the story was no longer on the screen. There were two levels of reliability for these trial-level warning cues: 70% and 100% reliable. In the 100%-reliable cue group, the warning cue showed up on every test trial (i.e., the trials with outliers). In the 70%-reliable group, the cue had a 70% chance of appearing on outlier trials and a 30% chance of appearing on nonoutlier (i.e., control and catch) trials. In Experiment 2, we assigned participants randomly to 70% and 100% cue groups. In Experiment 3, we included a third group, which did not see any warning cues, to serve as control (no-cue group), in addition to the two cue groups.

An example of the orienting story (a) that participants read in Experiments 2 and 3. The cue at the top of the screen warned that at least one outlier would appear in the reports for that trial. For some participants, this cue was 100% reliable; for other participants, the cue indicated that there was a 70% chance that they would see an outlier in the upcoming reports. Participants were randomly assigned to a cue type before beginning the first trial. In (b), we show the outlier cue employed in Experiment 5. In this experiment, one group of participants was shown the cue above any report that had been fabricated. The other group was not shown any cue during the task.
In Experiment 3, we also examined the effect of response order on participants’ ability to ignore invalid test results. Half of participants were required to report an estimate before reporting the likelihood of outlier detection (i.e., the estimate-detect group), whereas the other participants were asked to first report the likelihood of detecting an outlier and then submit their numerical estimates (i.e., the detect-estimate group).
In Experiment 4, we examined whether the task scenario affected participants’ ability to ignore outliers. Half of participants were required to estimate numerical evidence in the context of negative side effects (i.e., the side-effects group), as in Experiments 1 through 3. The other half of participants estimated the numerical evidence in the context of positive effects—that is, they were required to estimate the proportion of patients that would show improved health for each medication (i.e., the health-improvements group). For the health-improvements group, any language referencing side effects was modified to describe health improvements. This included the instructions, the short story at the start of each trial, and the response screens. Only the language was manipulated; the program that governed the presentation of numerical stimuli was identical for both groups.
In Experiment 5, we investigated the interaction between the task scenario and warning cues on the effect of invalid test results on numerical estimates. We employed a stronger report-level cuing manipulation than the trial-level cues employed in Experiments 2 and 3. In this experiment, the report-level cue appeared above the invalid report (Fig. 2b) and disappeared when the report left the screen. In other words, rather than warning participants about the possible appearance of invalid information in upcoming reports, as in Experiments 2 and 3, the warning cues in Experiment 5 were 100% accurate and flagged each invalid report for participants.
Procedure
In each experiment, participants were randomly assigned to one of the between-subject groups before the experimental session began. There were two between-subject groups in Experiments 1, 2, and 4, six between-subject groups in Experiment 3, and four between-subject groups in Experiment 5 (see Table 1).
Participants were instructed that they would be viewing a number of fictional scenarios in which a medical lab had developed a new drug to treat an ailment. In Experiments 1 through 3, the lab’s researchers were each said to have administered the drug to 20 different patients to see how many of them developed negative side effects. In Experiments 4 and 5, the lab’s researchers were each said to have administered the drug to 35 different patients to see how many developed negative side effects (side-effects group) or how many showed positive health improvements (health-improvements group). Participants were then shown an example of the reports they would be viewing. Participants were informed that their task was to view the researchers’ reports to determine the true underlying prevalence rate of the effect for each medication.
While reading the instructions for the task, participants were also made aware of two key points. First, participants were reminded that it is normal for lab members’ results to differ because they are different people testing different samples of patients. Participants were also informed that they may encounter reports from lab members who misreported their results. Participants were told that most researchers report their results honestly and accurately, but that some researchers may possibly have fabricated data. They were told that fabricated results would be “much higher or much lower than the other reports for a given medication.” Participants were specifically instructed to ignore any report that they believed was misreported in this way.
Participants assigned to no-cue groups also responded to a comprehension question: After reading the instructions, they were asked whether they had understood the instructions and whether they understood how to determine which reports were invalid. For participants assigned to warning-cue groups, they were also informed about the nature of the cues, including the cue’s reliability. A comprehension check was added to ensure participants understood the meaning of the particular warning cue. Any participants who did not pass these checks were removed from the experiment and were not permitted to begin the task.
At the beginning of each trial, participants read a brief story to orient them to the scenario. For example, “Imagine that a lab has developed a pill named Corfenib for treatment of sinus infections. While testing the prevalence of negative side effects, different lab members reported the following results.” Imaginary medication names were created for this experiment to ensure participants could not utilize existing information about real medications to inform their estimates. The medication and condition names were randomly selected for each trial, as the specific names were not of interest to our hypotheses.
After reading the imaginary scenario, the participant was shown the first fictional researcher’s report. Each report was presented as a number, such as “8 out of 20.” Each report remained on the screen for 2 s, after which participants pressed a key, choosing either to see another report or to provide their estimate. (After seeing the first report, participants were then shown a screen that allowed them to press the right arrow on the keyboard to see another report or press the space bar to indicate their estimate on a new screen and then move on to the next trial.) In this way, participants could sample as many reports per trial as they wanted until they felt comfortable estimating the true side-effect prevalence rate of the medication (or health-improvement prevalence rate, in Experiments 4 and 5).
When participants were ready to indicate their estimate, they pressed the space bar and typed in their answer on a response screen. Participants in estimate-detect groups were also asked to rate how likely it was that they encountered an invalid report on that trial on a Likert scale (1 = not very likely to 5 = very likely). Participants repeated this sequential sampling of the reports, choosing their stopping point, and estimating the true prevalence rate of negative side effects for each trial. In total, each participant completed 52 trials, and in each of the trials they could view up to 51 reports. Participants were given the option of taking a short break after completing half the trials.
Results
All analyses described in this paper were conducted using Jamovi computer software (Jamovi Project, 2021; R Core Team, 2020). All mixed-effects models were fitted using the GAMLj Jamovi module (Gallucci, 2019; Ripley et al., 2018). In addition to reporting the statistical results of each main effect and interaction term for each regression analysis from the omnibus likelihood ratio test in the main paper, we have reported estimated coefficients in the Supplemental Material available online. Note that the catch trials were not included in these analyses because they served only as distractors and were not of interest to the hypotheses of this experiment. Additionally, the results of paired t tests (Table S1 in the Supplemental Material) suggest that the accuracy estimated from the catch trials was similar to that estimated from the control trials.
All data and analyses have been made publicly available via the Open Science Framework and can be accessed at https://osf.io/9ybnx/.
Can people detect invalid information?
Number of reports viewed
As shown in Table 2, the observed (i.e., true) mean of the valid reports viewed by participants was similar across different trial types and consistent with experimental manipulations. In addition, we observed that the number of reports viewed by participants was slightly higher in test trials (i.e., low-outlier and high-outlier trials) than in control trials (i.e., trials that included only valid reports) in all five experiments.
Descriptive Statistics for Experiments 1 Through 5
Note: Control trials only contained valid reports. Low-outlier (high-outlier) trials included one low (high) invalid report.
To examine whether the presence of outliers affected the number of reports participants sampled, we fitted a linear mixed-effects regression model to predict the number of reports participants viewed on the basis of trial type (control, low outlier, high outlier) and corresponding between-subject manipulations (Table 1) in each experiment, as well as the interaction terms. We also included by-subject random intercepts (by-subject random slopes were not included because of model-convergence issues). The results of the omnibus likelihood ratio test for main effects and interaction terms are summarized in Table 3. The estimated coefficients are reported in the Supplemental Material (Table S2).
Number of Reports Viewed: Results of Omnibus Tests for Regression Models Predicting the Number of Reports Participants Viewed by Key Manipulations in Each Experiment
Consistent with our observations, the model revealed an increase in the number of reports participants viewed when they encountered either low outliers or high outliers on test trials compared to when they encountered no invalid reports on control trials. There was no effect of warning cues on the number of reports participants sampled in Experiments 2 and 3. In Experiment 5, when warning cues flagged the exact reports containing invalid information, participants viewed fewer reports in test trials (M = 10.4) than in control trials (M = 10.73). Other factors, including response type (estimate only vs. estimate-detect) and task scenario (side effects vs. health improvements), did not impact the number of reports participants viewed.
Outlier likelihood ratings
We also observed that the reported likelihood of seeing an outlier was higher in test trials than in control trials at the mean level (Table 2), suggesting that participants might be able to detect outliers when they were present. An ordinal logistic regression of outlier likelihood ratings was conducted for each experiment to examine this hypothesis. The model predicted the outlier likelihood ratings in the estimate-detect condition using trial type and the key between-subject manipulations (Table 1) in each experiment as predictors. The model also included interaction terms and by-subject random intercepts. The results of omnibus tests are summarized in Table 4, and the respective estimated coefficients are summarized in the Supplemental Materials (Table S3).
Outlier Likelihood Ratings: Results of Omnibus Likelihood Ratio Tests for Regression Models Predicting Outlier Likelihood Ratings as a Function of Key Manipulations in Each Experiment
In all five experiments, the model supported the idea that likelihood ratings were predicted by trial type (Table 4). When participants encountered an outlier higher than the underlying mean (i.e., in high-outlier test trials), they reported higher likelihoods of detecting the outlier than when they did not encounter any outlier (i.e., in control trials). Similarly, when they encountered an outlier lower than the underlying mean (i.e., in low-outlier test trials), participants also reported higher likelihood ratings compared to control trials (see Table S3 in the Supplemental Material). This indicates that participants were able to detect the presence of outliers in test trials.
The interaction between the presence of warning cues and trial type also predicted participants’ outlier likelihood ratings (Table 4). In Experiment 2, the increase in likelihood ratings between test trials (i.e., low-outlier and high-outlier trials) and control trials was larger in the 100%-cue group than in the 70%-cue group, reflecting that participants were more certain about encountering an outlier in trials when the cues were more reliable. Regression results for Experiment 3 also suggested an increase in likelihood ratings when participants were provided with cues of higher reliability, replicating the findings of Experiment 2. In Experiment 5, when warning cues flagged the exact report containing invalid information, the increase in likelihood ratings between test and control trials was even larger (see Table S3).
No consistent evidence was found to support the effects of other factors, including response order (Experiment 3) and task scenario (Experiments 4 and 5) on changes in likelihood ratings across different trial types.
Can people ignore invalid information?
To assess participants’ performance at estimating prevalence rates of side effects (or health improvements, in Experiments 4 and 5), we converted the estimates to an error score by subtracting the true mean of all valid reports a participant viewed on a trial (i.e., excluding outlier values), from participants’ reported mean (Figs. 3, 4, and 5). This error score reflects how far a reported estimate was from the ground truth. A positive error value indicates an overestimation of ground truth, and a negative error value indicates an underestimation of ground truth.

Mean error in each of the experimental conditions of Experiment 1. For each trial, error was computed by first determining the mean of all the report values a participant saw in that trial, minus any outliers. This mean was then subtracted from the estimate the participant provided. Thus, error is the difference between the participants’ estimates and the mean of all valid reports they viewed on a trial, with positive values indicating an overestimation and negative values indicating underestimation. Overall, estimates were biased in the direction of outliers when they were present. Error bars represent standard errors of the means.

Mean errors for Experiments 2 and 3. Participants’ estimates in Experiment 2 were biased in the direction of an outlier when it was present (a). This bias was present in both the 70%-cue group and the 100%-cue group. Results from a preregistered replication of Experiment 2 (i.e., Experiment 3) are shown in (b), including a group that was shown no warning cues. Results from Experiment 2 were replicated, and the no-cue group’s estimates were also biased in the direction of outliers. Note that the presence of warning cues in Experiments 2 and 3 was at the trial level. Error bars represent standard errors of the means.

Mean errors for Experiments 4 and 5. Estimates in Experiment 4 (a) were biased toward outliers, indicating participants were not able to fully ignore them. This was true regardless of whether participants were estimating the underlying rates of health improvements or negative side effects. Mean errors for Experiment 5 (for both cue groups combined) are shown in (b), and results for the no-cue group in Experiment 5 are shown in (c). Results for the cue-present group in Experiment 5 are displayed in (d). Note that the presence of warning cues in Experiment 5 was at the report level. Error bars represent standard errors of the means.
To assess the effect of outlier presence on participants’ estimates, we were especially interested in examining the changes in error scores when comparing between test trials and control trials. Deviation in error scores between test trials (i.e., including either low outliers or high outliers) and control trials reflects the biasing effects of outliers on numerical estimation. Note that with this approach we calculate the estimation bias by comparing outlier trials to control trials and not ground truth.
Biasing effects in each experiment
As demonstrated in Table 2, the observed average error scores differed across trial types in all five experiments. Compared to the error scores observed in control trials, the error scores in high-outlier trials were higher, and the error scores in low-outlier trials were lower. A linear mixed-effects regression model was constructed to statistically test this observation. For each experiment, the model predicted error scores as a linear function of predictors including trial type (no outlier, low outlier, high outlier) and the corresponding between-subject groups in that experiment (Table 1). The model also included interaction terms and by-subject random intercepts. The results of omnibus tests are summarized in Table 5, and the estimated coefficients are summarized in the Supplemental Materials (Table S4).
Biasing Effects in Each Experiment: Results of Omnibus Tests for Regression Models Predicting Error Scores by Key Manipulations in Each Experiment
In all five experiments, the model indicated an effect of trial type on error scores (Table 5). Moreover, the direction of change in error scores between test and control trials was positively associated with the type of outliers present in the test trials (Table S4). When participants encountered an outlier lower than the underlying mean in low-outlier trials, their error scores were lower than in control trials. In contrast, when participants encountered an outlier higher than the underlying mean in high-outlier trials, their error scores were higher than in control trials. These results support the biasing effects of outliers on participants’ numerical estimates.
The model in Experiment 1 also revealed a main effect of response type on error scores. When participants were explicitly required to report outlier likelihood ratings along with their estimates (i.e., estimate-detection group), their error scores were lower (M = −0.061) compared to when they only reported estimates (i.e., estimate-only group, M = 0.147). However, the interaction between response type and trial type was not significant, suggesting that response type might not markedly affect the degree to which outliers influence participants’ estimates (Fig. 3).
For Experiments 2 and 3, the model revealed no main effect of the trial-level warning cue or its interaction with trial type on error scores (Fig. 4; see also Table 5). In Experiment 2, the results suggested that the reliability of warning cues did not affect participants’ error scores, irrespective of the type of trial. In Experiment 3, the model results additionally suggested that the presence or absence of these cues did not affect participants’ error scores. These results indicate that cues of upcoming invalid reports did not effectively reduce the influence of outliers on participants’ numerical estimates. Additionally, no evidence was found to support the effect of response order (i.e., estimate-detect vs. detect-estimate) in Experiment 3.
On the other hand, in Experiment 5, when 100% reliable cues for misinformation were presented simultaneously with invalid reports (i.e., report-level cues), error scores on test trials were closer to those in control trials (Fig. 5). The model results confirmed an interaction between warning cue and trial type in Experiment 5 (Table 5). Note that in the control conditions, participants’ estimates deviated negatively from the ground truth. This is similar to what we observed in Experiment 4 (as shown in Fig. 5a), suggesting that participants might be biased toward underestimation in these experiments.
In the no-cue group of Experiment 5 (see Fig. 5c), participants’ error scores in outlier conditions deviated from those in the control condition in the direction of the outliers they encountered—
We note that for the high-outlier trials in the cue-present group, the estimation error was negative for the health-improvements group (see Fig. 5d). Thus, it might appear that the presence of the cue caused participants to overly discount the outlier, leading to underestimation. However, the critical comparison is between the error scores of control and outlier trials. Because the error scores for control and high-outlier trials were similar to each other, we concluded that the warning cue simply reduced the biasing effects of the high outliers for both the side-effect and health-improvement framing conditions. We also note that there appeared to be an asymmetry regarding the impact of the cue on high- and low-outlier trials. As shown in Figure 5d, the warning cue had a larger impact in the high-outlier trials compared to the low-outlier trials. This could be mere chance, or it could be related to how people think about different forms of invalid data.
The model results for Experiment 5 also revealed a main effect of task scenario on error scores, but the interaction between task scenario and trial type was not significant (Table 5). On average, error scores were higher in the side-effects group (M = −0.125 ± 1.30) than in the health-improvements group (M = −0.380
We also report the error scores after excluding trials with fewer than five viewed reports in the Supplemental Material (see Tables S7 and S8) because in these trials participants might not encounter any invalid reports. These error-score patterns were similar to those estimated from trials following the preregistered exclusion criteria.
Biasing effects across experiments
The results of within-experiment analyses consistently revealed the biasing effects of outliers on numerical estimation in all experiments, despite different between-subject manipulations across these experiments. However, variations in certain between-subject factors, such as warning cues and task scenarios, appeared to affect estimation accuracy and the biasing effects of outliers in some experiments.
Because the designs of these experiments were highly similar, we conducted a mixed-effects linear regression analysis on data combined from all five experiments to assess the influence of these between-subject factors on the biasing effects of outliers. The mixed linear regression model was constructed to include trial type, warning cue, task scenario, as well as the interaction terms, as predictors for error scores. We note that only the estimate-detect group from Experiment 1 was included, as this response type was used in all of the other experiments (Table 1). The results (n = 764) of omnibus tests are summarized in Table 6, and the estimated coefficients are summarized in the Supplemental Material (Table S5).
In line with our findings from the within-experiment analyses, the results using the combined data confirmed the biasing effects of outliers on participants’ estimates, where error scores were positively associated with the type of outliers participants encountered. Moreover, the model confirmed the interaction between warning cue and trial type (Table 6), showing that 100% reliable warning cues at the report level reduced these biasing effects (see Table S5). Additionally, the model confirmed an effect of task scenario on estimates, suggesting that participants underestimated prevalence rates to a greater extent in the positive scenario (i.e., health improvements) than in the negative scenario (i.e., side effects).
Biasing Effects Across Experiments: Results of an Omnibus Test for a Regression Model Predicting Error Scores Combined Across Experiment by Trial Type, Warning Cue, and Task Scenario
Biasing effects when outliers were detected
Finally, we examined whether participants were able to ignore invalid information when they were highly certain about its presence in a trial. To this end, we focused on a subset of trials in each experiment where an outlier was shown (i.e., test trials) and participants were maximally certain that they encountered the outlier (i.e., likelihood ratings = 5). Table 7 shows that error scores in low-outlier trials still deviated negatively from the ground truth, whereas in high outlier trials error scores deviated positively.
Mean and Standard Deviation of Error Scores Estimated From the Subset of Trials Where Participants Were Certain About Encountering Outliers in Each Experiment
To validate these observations, we fitted a linear mixed-effects regression model to predict error scores for each experiment separately when participants gave likelihood ratings of 5. The predictors included trial type (i.e., high- and low-outlier trials) and the key between-subject manipulations (Table 1), as well as interaction terms and by-subject random intercepts. For these analyses, we only included high- and low-outlier trials and not control trials, because our objective was to see if estimates differed when outliers were present and detected. The results of omnibus tests are summarized in Table 8, and the estimated coefficients are summarized in the Supplemental Material (see Table S6).
Biasing Effects When Outliers Were Detected: Results of Omnibus Tests for Regression Models Predicting Error Scores Estimated From the Outlier Trials Where Participants Were Maximally Certain of the Presence of Outliers
Note: Control trials were excluded from this analysis.
The models revealed an effect of trial type on error scores in each experiment (Table 8). The presence of high outliers generally resulted in higher estimates than the presence of low outliers (see Table S6). This indicates that participants were still biased by outliers, despite being aware of their presence. The presence of report-level warning cues reduced the influence of high outliers (Experiment 5). In addition, the model revealed a main effect of task scenario on estimates (see Experiments 4 and 5 in Table S6 in the Supplemental Material), suggesting that error scores were lower in the health-improvements scenario than in the side-effects scenario. No consistent evidence was found to support interactions between trial type and the between-subjects factors.
General Discussion
Across five experiments, participants’ numerical estimates were biased in the direction of outliers. Moreover, the biasing effects persisted across two different ranges of stimuli values. Overall, participants’ information-seeking slightly increased when they saw an outlier, but this did not affect the bias. These findings suggest that people may not be able to fully ignore invalid numerical information once they have encountered it.
This is consistent with the CIE and the illusory-truth literature (Hasher et al., 1977; Johnson & Seifert, 1994; Wistrich et al., 2004), which have mainly examined beliefs using verbal information. Our results show that it is difficult for individuals to ignore or disbelieve invalid numerical information that they have encountered. Thus, misinformation can have important consequences for how people form and update beliefs about both verbal and numerical information. Further, this work expanded the traditional CIE paradigm by having participants decide for themselves which pieces of information (numerical reports) were invalid. Even when participants actively detected invalid information and made their final response shortly thereafter (≈10–25 s), they were still influenced by it.
In addition, consistent with previous research on misinformation, it appears that numerical misinformation is difficult to ignore even with interventions. In the current study, receiving warnings for upcoming invalid information increased participants’ ability to detect outliers, but did not eliminate the influence of invalid reports on their estimates. Furthermore, when outliers were clearly flagged, the biasing effects were reduced but still existed. These findings are also consistent with existing evidence in Brashier et al. (2021), suggesting that the timing of belief revision is critical.
One explanation of these findings could relate to memory-retrieval errors. According to exemplar theories of memory, individuals in our task store instances in memory representing each stimulus they encounter. When participants are asked to make judgments, they sample values from memory and combine them to generate an estimate (André et al., 2022). In this account, invalid information would be encoded, but would have a potentially lower (but nonzero) activation weight than other values encountered on a trial. When participants generate their estimates, the outlier values could be sampled and incorporated into the final judgment, producing estimates biased toward outliers.
It should be noted that the current experiments did not ascertain the exact moment participants detected an outlier. Participants only indicated the likelihood rating of outlier detection at the end of the trial. Thus, we could not determine in which trials participants detected the outlier upon seeing it, which might determine whether they encoded the value. Similarly, we cannot say in which trials participants encoded the outlier as valid and later determined it to be invalid after seeing more values in the set. Future work could probe participants’ detection after each stimulus to investigate this issue.
Another possible explanation of our findings is that our outlier values were not surprising enough. Filipowicz et al. (2018) found that participants who would usually show increased belief updating to surprising stimuli instead showed no updating when new information was extremely surprising. They suggested that their participants may have judged information to be too surprising to be legitimate and ignored it. Our experiments used outliers that were 2.5 to 3.5 SD from the mean in order to mimic deceptive statistics that people may encounter in their daily lives. Future work could employ more extreme outliers to investigate the boundary conditions for incorporating outliers into beliefs.
Constraint of generality
It should also be noted that our experiments utilized a narrow range of values with a small variance, presented in a format that might not reflect data seen in the real world. For example, people often encounter data in the form of percentages, as in demographic information cited by politicians (Prévost & Beaud, 2015), or they encounter large numbers, such as mortality predictions for the COVID-19 pandemic (Allyn, 2020). They may also encounter numerical data that are not normally distributed or data with large volatility, such as stock prices. Future work could employ different numerical formats, magnitudes, and variances to investigate whether the biasing effects of invalid information would generalize to these numerical variations.
We also note that there are different types of invalid data individuals may have encountered. In our experiments, participants were led to believe that fabricated data was generated by bad actors (i.e., researchers trying to manipulate the outcome of medical trials). This is only one type of invalid data people might experience. Invalid data might also arise because of unintentional error. We cannot draw any conclusions about whether the nature of invalid data (i.e., whether it was intentional or unintentional) impacts people’s ability to ignore it. Future research is needed to fully understand how different types of invalid data bias people’s beliefs.
Conclusion
The current experiments add to the literature concerning how we deal with misinformation. Much of the current CIE and illusory-truth work examines how we share and spread misinformation and how we may curtail that spread (Chan et al., 2017; Lewandowsky et al., 2012; Pennycook et al., 2021). Few studies have examined how people handle numerical misinformation (but see Stubenvoll & Matthes, 2021), and our work contributes to the literature by illuminating our conclusion that people have difficulty ignoring misinformation in the form of invalid numerical reports, even when they do not believe the report is legitimate. The influence of bad numerical information on beliefs suggests a vulnerability to sloppy data reporting or outright false statistical information in individuals’ voting, consumer, and health decisions.
Supplemental Material
sj-docx-1-pss-10.1177_09567976241231571 – Supplemental material for Can Invalid Information Be Ignored When It Is Detected?
Supplemental material, sj-docx-1-pss-10.1177_09567976241231571 for Can Invalid Information Be Ignored When It Is Detected? by Adam T. Ramsey, Yanjun Liu and Jennifer S. Trueblood in Psychological Science
Footnotes
Transparency
Action Editor: Karen Rodrigue
Editor: Patricia J. Bauer
Author Contributions
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
