Abstract
Although researchers often assume their participants are naive to experimental materials, this is not always the case. We investigated how prior exposure to a task affects subsequent experimental results. Participants in this study completed the same set of 12 experimental tasks at two points in time, first as a part of the Many Labs replication project and again a few days, a week, or a month later. Effect sizes were markedly lower in the second wave than in the first. The reduction was most pronounced when participants were assigned to a different condition in the second wave. We discuss the methodological implications of these findings.
Keywords
When researchers conduct a study, they often assume that participants are naive to the research materials, either because the pool of participants is large (e.g., in the case of Internet samples) or because prior exposure to research is limited (e.g., in the case of first-year college students). However, people may belong to a participant pool for several years, and small numbers of “professional” survey takers tend to dominate responses (Chandler, Mueller, & Paolacci, 2014; Goyder, 1986; Hillygus, Jackson, & Young, 2014). Undergraduate subject pools experience faster turnover than others (1–4 years) but are also likely to be used by researchers with a greater overlap of interests (e.g., researchers in the same lab). People also may belong to multiple participant pools, which increases their exposure to experimental materials (Hillygus et al., 2014), or may gain knowledge of research materials through college courses, other pool members, or media coverage.
Prior exposure to experimental materials can change how participants behave in subsequent related studies through various pathways. For example, performance can be improved through practice (e.g., Chandler et al., 2014), beliefs can change through additional cognitive elaboration (Sherman, 1980; Sturgis, Allum, & Brunton-Smith, 2009), and motivation to perform well can be deterred by boredom or increased by a desire to please the experimenter. Statistically, this can affect both group means (e.g., uniform improvements can lead to a ceiling effect that compresses scores) and standard deviations (e.g., inattention caused by boredom can increase within-condition variance), which can result in changes in observed effect sizes. In practice, the effects of prior participation on experimental data have been identified only in a narrow range of circumstances and typically only in domains in which a hypothesis about the purpose of the experiment can be used to respond in a manner that makes participants look more favorable (for a review, see Weber & Cook, 1972).
The effect of repeated participation may be particularly apparent when participants are assigned to different experimental conditions, as illustrated through comparisons of experiments using within- and between-subjects designs. Exposure to earlier conditions informs responses to subsequent conditions and leads to different results than logically equivalent between-subjects designs (Charness, Gneezy, & Kuhn, 2012; Greenwald, 1976). Depending on the available information, observed effects can be inflated (Fox & Tversky, 1995), attenuated (Hershey & Schoemaker, 1980), or reversed (Birnbaum, 1999).
Information contained within psychological experiments is of little relevance to research participants and should be forgotten quickly. Recently, however, researchers have noted a correlation between responses to psychological measures and indirect measures of prior participation in similar experiments, such as memory of prior participation (Greenwald & Nosek, 2001), the chronological order of studies themselves (Rand et al., 2014), estimates of the total number of completed experiments (Chandler et al., 2014), and naturally varying levels of experience with a task (Mason, Suri, & Watts, 2014). These findings suggest that exposure to research materials can influence effect sizes more generally, but this possibility has not been directly tested. To address this gap, we examined how prior exposure to study materials affects responses.
Method
Design and procedure
To test the effects of nonnaïveté on commonly used psychological measures and to examine potential moderators of these effects, we conducted a two-stage study on Amazon Mechanical Turk (MTurk), an online crowdsourcing service frequently used for experimental research (for a review, see Paolacci & Chandler, 2014). All participants completed a set of experimental tasks in Wave 1 (previously reported in Klein et al., 2014); each task had two conditions, of which participants were assigned to one. Participants were then invited to participate in a study in Wave 2 that included the same tasks. Sample size for Wave 1 was predetermined for an existing study. Sample size for Wave 2 consisted of all participants from Wave 1 who responded to the survey invitation. For each task, assignment to experimental condition was randomized at the individual task level, and thus participants completed each task in either the same condition as in Wave 1 or in the alternative condition.
Further, we explored whether the effect of prior exposure is more pronounced under conditions in which it should be particularly easy to remember previous materials. The visual similarity of the tasks was manipulated by randomly assigning participants to complete the tasks on either the same online platform or a different, visually distinct platform. The duration between the completion of Wave 1 and Wave 2 was also manipulated by randomly assigning participants to be recontacted a few days, about a week, or about a month later. This produced a 3 (time delay: a few days, about a week, about a month; between participants) × 2 (platform: same, different; between participants) × 2 (condition: same, different; between participants) design. This study was approved by the University of Michigan Institutional Review Board.
Wave 1
One thousand adults were recruited from MTurk to participate in a “decision making and attitudes survey” as a part of a larger research study (Klein et al., 2014). Participants were restricted to those classified as U.S. residents who had completed at least 50 human-intelligence tasks (HITs) with an approval ratio of at least 95%. Participants were paid $0.70 and were told that the experiment would take approximately 13 min to complete. Participants completed 14 between-subjects experimental tasks and an Implicit Association Test (Greenwald, McGhee, & Schwartz, 1998). After finishing all tasks, participants completed additional measures and demographic information (for full details, see Klein et al., 2014). Participants were thanked for their time but were not debriefed.
Wave 2
A unique qualification was created on MTurk for each time-delay period, and qualifications were randomly assigned to all Wave 1 participants who responded from U.S. Internet protocol addresses at Wave 1 (N = 950; see Chandler et al., 2014, for technical details). This qualification was required before MTurk workers could participate in Wave 2, and it ensured that (a) only Wave 1 participants could complete the Wave 2 measures and (b) Wave 1 participants completed the Wave 2 HIT after the time delay to which they were randomly assigned. Workers who completed the Wave 2 HIT were paid $1 for their time. Participants completed the experimental tasks a second time, and then they provided demographic information and indicated whether they remembered completing each task in the past.
Once a Wave 2 HIT was created, the eligible participants received an e-mail sent via the MTurk application programming interface stating that “as a result of your participation in a prior study, you have qualified to complete another set of tasks.” The e-mail went on to inform them that this study was shorter than the first one, the pay rate was higher, and it would likely take about 10 min.
Participants
Six hundred eighty-seven individuals participated in both Wave 1 and Wave 2. Response rates were approximately 72% across all time-delay conditions. To eliminate the possibility that differences in effect size between waves was a result of attrition, we included participants in the analysis only if they completed all experimental tasks in both Wave 1 and Wave 2 from a U.S. Internet protocol address and submitted a payment request (N = 638; 55% women, 45% men; mean age = 36 years, SD = 12.8, range 18–75; 83% White). In cases in which the participant was recorded as attempting the study more than once, only the first attempt was used unless the participant saw none of the tasks in this attempt (i.e., read the consent form only).
Experimental tasks
Participants completed the following tasks in both Wave 1 and Wave 2. For brevity, an Implicit Association Test conducted in Wave 1 was not included in Wave 2. Two additional tasks (not reported here) did not reveal significant between-conditions differences at Wave 1 and were dropped from subsequent analyses (because our purpose was to examine changes in bona fide effects; see Klein et al., 2014, for more details about the experimental tasks).
Allow/forbid
In the allow/forbid task (Rugg, 1941), participants were asked to indicate whether (Condition 1) the United States should allow speeches against democracy or (Condition 2) the United States should forbid speeches against democracy. Participants answered “yes” or “no.”
Anchoring and adjustment
The anchoring task partially replicated a task designed by Jacowitz and Kahneman (1995). In this task, participants made four quantitative estimates (the distance from New York to San Francisco, the population of Chicago, the height of Mount Everest, and how many babies are born per day in the United States). Participants made their estimates after being told that the target is either greater than or less than a specified value, depending on condition.
Gain versus loss framing for combating disease
In the gain-versus-loss-framing task (Tversky & Kahneman, 1981), participants read a disease scenario framed in terms of either gains or losses. In the gain condition, they were asked to imagine that the United States is preparing for the outbreak of a disease and to select from two courses of action: Program A, under which 200 people will be saved, or Program B, under which there is a one-third probability that 600 people will be saved (and no people will die) and a two-thirds probability that no people will be saved (600 people will die). In the loss condition, the scenario was the same except that Program A was framed in terms of loss: Participants read that 400 people will die instead of that 200 will be saved.
Imagined contact
In the imagined-contact task (closely replicated from a task designed by Husnu & Crisp, 2010), participants in the experimental condition were asked to imagine and describe (for 1 min) meeting a Muslim stranger for the first time, whereas those in the control condition were asked to imagine and describe walking in the outdoors (for 1 min). Participants then responded to four measures indicating their willingness to interact with Muslims.
Low versus high category scales
In the category-scales task (Schwarz, Hippler, Deutsch, & Strack, 1985), participants in the first condition were asked to estimate how much TV they watch daily on a low-frequency Likert-type scale (up to 0.5 hr, 0.5–1 hr, 1–1.5 hr, 1.5–2 hr, 2–2.5 hr, more than 2.5 hr), whereas in the second condition, they rated their estimate on a high-frequency Likert-type scale (up to 2.5 hr, 2.5–3 hr, 3–3.5 hr, 3.5–4 hr, 4–4.5 hr, more than 4.5 hr).
Norm of reciprocity
Conceptually replicating the norm-of-reciprocity task designed by Hyman and Sheatsley (1950), we asked participants whether (a) North Korea should allow American reporters in and allow them to report the news back to American papers and (b) America should allow North Korean reporters into the United States and allow them to report back to their papers. Questions were presented in two different orders, depending on condition, and participants answered “yes” or “no” to both questions.
Quote attribution
In a conceptual replication of a quote-attribution task designed by Lorge and Curtis (1936), we showed participants a quote (“I have sworn to only live free, even if I find bitter the taste of death”) attributed to either George Washington or Osama bin Laden, depending on condition. Participants indicated the extent to which they agreed with the quote (1 = strongly agree, 9 = strongly disagree).
Retrospective gambler’s fallacy
Participants performing the retrospective-gambler’s-fallacy task (Oppenheimer & Monin, 2009) were asked to imagine a man rolling dice. In one condition, the man rolls 3 sixes, whereas in the other condition, he rolls 2 sixes and a three. Participants then estimated how many times the man had rolled the dice before they had entered the room to watch him.
Sunk costs
In the sunk-costs task (Oppenheimer, Meyvis, & Davidenko, 2009), participants in the cost condition were asked to “imagine that your favorite football team is playing an important game. You have a ticket to the game that you have paid handsomely for. However, on the day of the game, it happens to be freezing cold. What do you do?” Participants in the no-cost condition read the same scenario, except that they read that they had received the ticket for free from a friend. All participants rated their likelihood of attending the game on a 9-point scale (1 = definitely stay at home, 9 = definitely go to the game).
Results
Nonnaïveté reduced observed effect sizes
Table 1 shows effect sizes in Wave 1 and Wave 2. For the anchoring effect, results outside of the specified upper and lower anchors were dropped from analysis and treated as missing values. The rightmost column of the table contains the interaction between condition and wave from a within-subjects generalized-linear-modeling analysis with condition, wave, and their interaction as factors; this analysis functioned as a within-subjects significance test of the overall decline in effect sizes. (Correlations across waves for each task and within Wave 2 across tasks can be found in Tables S1 and S2, respectively, in the Supplemental Material available online.)
Experimental Effects of Condition and Wave
Note: Positive t values indicate that the effect was in the theoretically predicted direction. Reported t values and degrees of freedom assume equality of variance unless Levine’s test indicated that this assumption was violated. Fractional degrees of freedom were rounded down. The chi-square tests of the interaction between condition and wave used a generalized estimating equation treating Wave 1 and Wave 2 data as nonindependent.
p < .10. *p < .05. **p < .01. ***p < .001.
As can be seen in Table 1, only one effect size increased from Wave 1 to Wave 2 (for the low-vs.-high-scales task), while 11 of the 12 observed effects exhibited statistically significant declines, p < .01 for a sign test. Declines were statistically significant in 5 of 12 measurements using traditional omnibus tests of the interaction between condition and wave. The largest effects in Wave 1 tended to show correspondingly larger declines, which suggests that a floor effect might have limited our ability to observe significant results for phenomena with smaller effect sizes. Although not statistically significant, declines in smaller effects may have greater practical implications for researchers who evaluate the truth of observed effects through standard frequentist tests of statistical significance.
Meta-regression (Lipsey & Wilson, 2001) was used to illustrate the average effect of nonnaïveté on participant responses. A meta-analytic approach was selected because the dependent measures of interest consisted of a mixture of categorical and continuous variables and thus could not be integrated into a single analysis until transformed into a common metric. This approach was preferred over simple averaging because it assigns greater weight to effects with smaller standard errors.
To prevent the four measures of anchoring from exerting an undue influence, we estimated a single effect size for them by taking the simple average of effect sizes and standard errors across the four anchoring tasks. Another concern was that increased variance in Wave 2 responses may have resulted from the varied treatment effects of the different conditions in Wave 2. 1 To address this possibility, we estimated effect sizes for Wave 1 and Wave 2 for each of the nine effects of interest under each of the 12 possible combinations of Wave 2 treatment level. This yielded 216 effect-size estimates that were regressed on dummy codes for the different experimental paradigms and wave. Overall, effect sizes declined from Wave 1 (weighted d = 0.82) to Wave 2 (weighted d = 0.63) by d = 0.19, a drop of about 25%.
Moderators of the decline of effect sizes
Additional analyses were conducted to examine the effect of the variables of theoretical interest and their higher-order interactions on individual tasks. Generalized linear modeling was used for all effects except the anchoring measures, which were examined together in a single linear mixed model to account for their dependence. Wave 2 condition, whether the participant was in the same condition across both waves (dummy coded), platform, delay, and their higher-order interactions were included in all models. Because of issues with model fit, the analysis of the allow/forbid task had to be simplified by eliminating delay. 2 As can be seen in Table 2, the effect of being in different conditions in each wave (represented by the interaction between condition and same condition) was significant for anchoring, quote attribution, and sunk costs, which suggests that attenuation of these effects was driven by information gained from exposure to both experimental conditions.
Effects of Condition, Platform, and Delay in Each Task
Note: All values were obtained from Wald chi-square tests of significance, except for the values for anchoring, which are the results of an analysis of variance.
p < .10. *p < .05. **p < .01. ***p < .001.
To illustrate how various treatment conditions affected the decline of effect sizes in aggregate, we regressed dummy variables for task (accounting for differences in attenuation across experimental paradigms) and the theoretically relevant factors of same condition, platform, delay, and their higher-order interactions on Wave 2 effect-size estimates using meta-regression (Lipsey & Wilson, 2001). As can be seen in Table 3, the effect of condition was pronounced relative to the effects of other variables and their interactions. The decline in effect sizes from Wave 1 to Wave 2 was examined separately for experiments in which participants were assigned to the same and to different conditions in order to determine whether attenuation was driven solely by being assigned to different conditions. The effect of wave was substantial for participants assigned to different conditions, d = 0.23, but was observed even among participants assigned to the same condition, d = 0.14. In addition to condition, the delay between Wave 1 and Wave 2 played a role, with mean effect sizes returning toward the magnitude observed in Wave 1 as time progressed. Survey platform and the higher-order interactions between theoretically relevant variables had minimal influence on effect size (Table 3). Mean Wave 2 effect sizes across conditions and the grand mean of Wave 1 effect sizes are displayed in Figure 1.
Results From the Meta-Regression of Wave 2 Responses
Note: Experimental paradigms are represented as dummy variables with allow/forbid serving as the reference group.

Average effect size (Cohen’s d) at Wave 1 and Wave 2. Wave 2 results are shown for participants in the same and different conditions as in Wave 1 and for those who used the same and different platforms as in Wave 1, separately for the three delay periods.
Memory for prior participation
A related question is whether people report participating in a prior task and whether this self-reported memory mediates the observed reduction of effect sizes. If the reduction of effect sizes is related to the probability that a participant reports his or her prior participation, researchers could simply exclude participants who report having participated in similar experiments. Memory for participation in prior studies was high but far from unanimous, ranging between 35% and 80% of participants for each task. In addition to reporting whether they had previously participated in the relevant experimental tasks, participants were also asked if they had ever participated in a task in which they sorted images of dinosaur fossils (this hypothetical task was used because it sounded plausible but has likely never been posted on MTurk). It is unlikely that memory for participating in previous studies reflects acquiescence bias, as the proportion of participants who reported sorting dinosaur fossils was close to zero.
To examine whether participants were more likely to recognize an identical (as opposed to a similar) experimental task, we calculated the ratio of tasks that participants remembered versus did not remember participating in previously, both for tasks in which they were assigned to the same condition at Waves 1 and 2 and for tasks in which they were assigned to different conditions at Waves 1 and 2. A 2 (same condition; within participants) × 2 (platform; between participants) × 3 (delay; between participants) mixed-model analysis of variance revealed that participants tended to remember fewer tasks as time passed, F(2, 626) = 16.57, p < .001, η p 2 = .05. There was no main effect of condition or platform and no higher-order interactions among variables, all Fs < 1. The amount of time between Wave 1 and Wave 2 was thus the only reliable predictor of self-reported memory for prior participation.
Reporting prior participation does not predict effect-size attenuation
To test whether memory for prior participation was related to the attenuation of effect sizes, we used generalized linear modeling for all effects except the anchoring measures, which were examined together in a single linear mixed model to account for their dependence. Dummy variables for task (accounting for differences in attenuation across experimental paradigms) and the theoretically relevant factors of same condition, memory for prior participation (memory), and their interaction were regressed on Wave 2 effect-size estimates. Because of issues with model fit, the analysis of the allow/forbid task had to be simplified by eliminating the three-way interaction. The effect of remembering prior participation (represented by the interaction between same condition and memory) was significant only for anchoring, F(2, 2181.6) = 5.79, p < .01. For all other tasks, these two- and three-way interactions were not significant, ps > .22.
To examine the overall effect of self-reported memory, we calculated Wave 2 effect sizes for participants who did and did not remember each task, which yielded a total of 18 effect-size measures. Using meta-regression, we regressed effect sizes on (a) dummy variables for the different experimental tasks, (b) a dummy variable indicating whether the effect size represented participants who did or did not report remembering the prior task, (c) a dummy variable indicating whether participants were assigned to the same or a different experimental condition, and (d) the interaction of the latter two variables (Lipsey & Wilson, 2001). Neither the effect of memory, β = 0.12, nor the interaction between memory and same condition, β = 0.11, were significant at the p < .05 level, even under the (possibly liberal) assumption that the responses provided by participants were statistically independent from each other. Moreover, both were substantially smaller than the main effect of being assigned to the same condition, β = 0.23. These findings suggest that self-reported memory for prior participation is at best a poor indicator of which participants will display attenuated effect sizes because of prior participation.
Discussion
Prior exposure to research materials can reduce the effect size of true research findings. When participants in the present study performed the same tasks on two different occasions, effect sizes decreased by about 25% at the second time point. Declines of this magnitude can have a surprisingly large effect on experimental power, even when nonnaive participants are a fraction of the total sample. To illustrate, research has suggested that 10% of all respondents on MTurk are responsible for 40% of all experimental responses (Chandler et al., 2014) and could thus be considered nonnaive. In a subsequent study, these highly productive workers composed 25% of a large sample of workers (Chandler et al., 2014). A 25% reduction among a quarter of the sample would reduce an average-size behavioral-science effect of d = 0.43 (Richard, Bond, & Stokes-Zoota, 2003) to d = 0.40 (ignoring further declines due to increased within-group variance created by a mix of naive and nonnaive participants). For a two-condition experiment with 80% power, this would require increasing the sample size from 172 to 200 to compensate (~15%; Faul, Erdfelder, Buchner, & Lang, 2009).
Effect sizes observed at Wave 2 were generally smaller than those observed at Wave 1, but this difference was not uniformly statistically significant. Larger effect sizes were more likely to demonstrate a significant reduction, perhaps because observable between-conditions variance is constrained for true effects that are closer to zero. Additionally, some effects may be more susceptible than others to attenuation. For example, it is easy to imagine that for the anchoring questions, memory of prior anchoring values may have informed numerical estimates.
Effect sizes were particularly attenuated when participants were exposed to alternative conditions of a task, which suggests that this additional contextual information undermines effect sizes, analogous to participating once in a within-participant experiment (Greenwald, 1976). However, effects were also attenuated among participants exposed to the same condition twice. While there is no direct evidence of a mechanism underlying this decline, one possible explanation is that asking questions multiple times may lead to elaboration (Sherman, 1980; Sturgis et al., 2009). Elaboration could reduce observed effect sizes if it undermines intuitive responses to one or both conditions or if it brings to mind idiosyncratic information that increases within-condition variance. If so, changes in effect sizes should be especially pronounced toward hypothetical or unfamiliar situations (Hirt & Sherman, 1985).
Self-reported participation is an imperfect measure of prior participation. It does not identify all prior participants or even those who demonstrated a particularly large behavioral effect of prior participation. While this may be surprising, it is not inconsistent with the hypothesis that the additional information gained from exposure to both conditions explains the attenuation effect: The source from which information is learned can be forgotten more quickly than or separately from the information itself (Johnson, Hashtroudi, & Lindsay, 1993). This finding suggests that researchers cannot simply ask participants to identify whether they have participated before, and it underscores the importance of directly monitoring prior participation. When this is not possible, researchers should strive to design procedures and stimuli that differ from those known to the tested population (Chandler et al., 2014; Rand et al., 2014) or to increase their sample size to offset the anticipated decrease in power.
Footnotes
Declaration of Conflicting Interests
The authors declared that they had no conflicts of interest with respect to their authorship or the publication of this article.
Funding
Wave 1 of this research was supported by Project Implicit.
Open Practices
All data and materials have been made publicly available via Open Science Framework and can be accessed at https://osf.io/4vx3z/. The complete Open Practices Disclosure for this article can be found at http://pss.sagepub.com/content/by/supplemental-data. This article has received badges for Open Data and Open Materials. More information about the Open Practices badges can be found at https://osf.io/tvyxz/wiki/1.%20View%20the%20Badges/ and
.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
