Abstract

Galton (1907) first demonstrated the “wisdom of crowds” phenomenon by averaging independent estimates of unknown quantities given by many individuals. Herzog and Hertwig (2009; hereafter, H&H) showed that individuals’ own estimation accuracy can be improved by asking them to make two estimates at separate times and averaging those estimates.
Participants in that study estimated the dates of 40 historical events. In the control condition, they made their second estimates without special instructions. In the dialectical-bootstrapping condition, they were instructed to think about why their first estimate might have been wrong before giving their second estimate. H&H claimed that the improvement in accuracy when responses were averaged was far greater among participants in the dialectical-bootstrapping condition than among those in the control condition. We reanalyzed H&H’s data using measures of accuracy that are unrelated to the frequency of identical first and second responses and found that the improvement in accuracy was equal in the two conditions.
Results
For each participant and item, i, H&H subtracted the absolute difference between the average of the participant’s two responses,
The mean value of Adiff was significantly higher in the dialectical-bootstrapping condition (M = 0.046, SE = 0.008) than in the control condition (M = 0.010, SE = 0.008), t(99) = 3.12, p = .002.
H&H did not report a further difference between the conditions that was confounded with Adiff. In the dialectical-bootstrapping condition, the second response (R2) to an item matched the participant’s first response to the item only 0.7% of the time (SE = 1.4%). In the control condition, participants’ first and second responses matched 20.2% (SE = 1.3%) of the time, significantly more often than in the dialectical-bootstrapping condition, t(99) = 10.3, p < .001. When R1 equals R2, Adiff is 0, and in fact, the median Adiff in the control condition was 0 for 29 of the 51 participants; thus, the mean Adiff across all participants was close to 0. In the dialectical-bootstrapping condition, only 2 of the 50 participants had a median Adiff of 0. The confounding of these two measures is reflected in the significant correlation between the proportion of identical first and second responses (i.e., p(R1 = R2)) and Adiff, r(99) = −.307, p = .002.
We prefer to measure accuracy change independently of the proportion of identical first and second responses. Instead of using a median value of the differences in accuracy, such as Adiff, we analyzed a pair of median accuracy values for each participant. In addition, we used absolute measures of accuracy instead of relative (normalized) measures because in hindsight-bias research, relative measures that have similarities to Adiff have “awkward statistical properties” (Pohl, 2007, p. 22).
For each participant, we took a pair of values: the median absolute error of R1 across the 40 items, A1, and the median absolute error of
We analyzed these data using a mixed-design analysis of variance including the independent variables of response (R1 vs.
These paired accuracy measures (A1 and Aavg) provide no support for the effectiveness of the dialectical-bootstrapping instructions beyond that of the control instructions. Tellingly, the accuracy gain shown by these paired accuracy measures (i.e., A1 – Aavg) was unrelated to p(R1 = R2), r(99) = –.003, p = .97. A robust Wald test showed that this correlation was significantly lower than that between p(R1=R2) and Adiff, χ2(1, N = 101) = 4.85, p = .03. Means for all of the measures we have discussed are shown in Table 1.
Results for an additional control condition and several alternative accuracy measures are included in the Supplemental Material available online. Every accuracy measure that was similar to Adiff in that the median of a set of difference scores was used (whether the value was normalized for item difficulty, as in Adiff, or not) was significantly correlated with p(R1=R2), and every variation that was similar to A1 and Aavg in using a set of paired accuracy scores for each person (again, whether the values were normalized or not) was not significantly correlated with p(R1 = R2). Only the accuracy measures that were correlated with p(R1 = R2) showed significant differences between the conditions.
Mean Values for Various Dependent Measures
Note: R1 = first response; R2 = second response;
Discussion
H&H concluded that the accuracy gained by making a second response and then averaging the pair of responses was significantly greater for participants in the dialectical-bootstrapping condition than for those in the control condition. This conclusion was based on an accuracy change measure that was confounded with the proportion of identical first and second responses. Participants in the dialectical-bootstrapping condition were instructed to “assume that your first estimate is off the mark” and to “make a second, alternative estimate” (H&H, p. 234). Observing a difference between the conditions when using a measure that was confounded with the difference in the proportion of identical responses therefore served only as a manipulation check.
Using measures that are independent of each other is important in many fields of research. For example, in hindsight-bias research, measures of the percentage of perfect recall must be separated from measures of retrieval bias (Pohl, 2007).
When the data were analyzed using measures of accuracy that were uncorrelated with the proportion of identical first and second responses, the difference between the conditions disappeared. People may have some awareness of when they cannot improve upon their first response, and in these cases, they will change their response only if explicitly instructed to do so. There is no evidence in H&H’s data that encouraging people to alter their responses more often than they would if not given special instructions yields more accurate average responses. Dialectical instructions are not needed to achieve the wisdom of many in one mind.
Footnotes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
