Abstract
When data analyses produce encouraging but nonsignificant results, researchers often respond by collecting more data. This may transform a disappointing dataset into a publishable study, but it does so at the cost of increasing the Type I error rate. How big of a problem is this, and what can we do about it? To answer the first question, we estimate the Type I error inflation based on the initial sample size, the number of participants used to augment the dataset, the critical value for determining significance (typically .05), and the maximum p value within the initial sample such that the dataset would be augmented. With one round of augmentation, Type I error inflation maximizes at .0975 with typical values from .0564 to .0883. To answer the second question, we review methods of adjusting the critical value to allow augmentation while maintaining p < .05, but we note that such methods must be applied a priori. For the common occurrence of post-hoc dataset augmentation, we develop a new statistic, paugmented, that represents the magnitude of the resulting Type I error inflation. We argue that the disclosure of post-hoc dataset augmentation via paugmented elevates such augmentation from a questionable research practice to an ethical research decision.
“Surely, God loves the .06 nearly as much as the .05.” It’s a familiar scene. A researcher runs a study, analyzes her data, crosses her fingers and holds her breath… p = .06. Damn.
The study represents many hours of effort, and p = .06 strongly suggests there’s an effect to be found. So, now what? Declare the results nonsignificant and relegate the study to the already bulging file drawer? Perhaps. But a study by John, Loewenstein, and Prelec (2012) suggests that, when faced with this situation, many researchers would not stop at .06. Rather, they would augment the dataset in the hopes of achieving p < .05.
John et al. (2012) label such dataset augmentation a “questionable research practice”—and for good reason. Augmenting a dataset in this manner inflates the Type I error rate (from 5% to 7.7% in a simulation conducted by Simmons, Nelson, & Simonsohn, 2011, and potentially much more severely with multiple rounds of augmentation; Armitage, McPherson, & Rowe, 1969). Nevertheless, the practice remains popular.
The popularity of this practice likely stems from an accurate perception of its benefits and an underestimate of its costs. The benefits are clear: The transformation of a dataset destined for the file drawer into a publishable study. The costs, in contrast, are less obvious: Although augmenting a dataset increases the Type I error rate (Armitage et al., 1969; Simmons et al., 2011), sometimes drastically (Armitage et al.), researchers may fail to realize the magnitude of the Type I error inflation.
A search of the statistical literature reveals a wealth of techniques for augmenting datasets while maintaining a Type I error rate of .05 (e.g., Fitts, 2010; Frick, 1998; Pocock, 1977; just to name a few), but all these options require an a priori adjustment to the critical value for determining significance. None can be fully applied post hoc.
Unfortunately, this leaves researchers with the unpalatable choice between an ethically questionable research practice and a potentially unpublishable study. In this article, we offer a solution to this dilemma: a new statistic, paugmented, that quantifies the Type I error inflation resulting from a post-hoc decision to augment a dataset. paugmented thus provides researchers with an ethical method of augmenting a dataset, and it provides editors, reviewers, and readers with the information they need to evaluate the ramifications of the dataset augmentation.
Given the counterintuitive nature of the Type I error inflation resulting from augmenting a dataset, we begin with an example designed to demonstrate mathematically why dataset augmentation increases the Type I error rate. We then calculate the precise degree of Type I error inflation caused by dataset augmentation in the specific case simulated by Simmons et al. (2011) and across a range of more general cases. We then review methods developed in other fields to control Type I error rates while allowing dataset augmentation and extend these methods using our inflation calculations. Thereafter, we derive paugmented and demonstrate how it represents the magnitude of Type I error inflation resulting from a post-hoc decision to augment a dataset. Finally, we offer some recommendations to researchers considering the risks and benefits of dataset augmentation.
Augmenting Datasets and Type I Error Inflation
Informally, we have encountered the argument that augmenting a dataset does not inflate the Type I error rate because the augmented dataset is still evaluated at p < .05. Similarly, Simmons et al. (2011) report that, “In conversations with colleagues, we have learned that many believe this practice exerts no more than a trivial influence on false-positive rates,” (p. 1361). To get a sense of why augmenting a dataset inflates the Type I error rate, consider a study to determine if a coin is biased towards heads. A researcher flips the coin 100 times and compares the results against a binomial distribution to determine the probability of the observed (or more extreme) results occurring purely by chance. With 100 flips, the ratio becomes significant (p < .05) at 41 tails/59 heads (p = .0443, one-tailed; see top panel of Fig. 1). The ratio is marginally significant (.05 ≤ p < .10) at 42 tails/58 heads (p = .0666) and 43 tails/57 heads (p = .0967). What are the ramifications of the researcher making a conditional decision of the following form, based on the first 100 flips?

In the top panel, the probability of a Type I error (i.e., the Type I error rate) is the probability that a fair coin would produce 59 or more heads on 100 flips, which is .0443 according to the binomial distribution. In the bottom panel, the Type I error rate is calculated as follows: Assuming the null hypothesis is true (i.e., the coin has an equal probability of coming up heads or tails), the probability of a Type I error equals the probability of the first 100 flips being significant plus the conditional probability of the full 200 flips being significant times the probability of the first 100 flips being marginally significant: P(Type I error) = P(first 100 flips are significant) + P(full 200 flips are significant | first 100 flips are marginally significant)*P(first 100 flips are marginally significant). With 200 flips, the ratio becomes significant at 87 tails/113 heads. Thus, with 42 tails/58 heads on the first 100 flips, at least 45 tails/55 heads would be needed on the second 100 flips to produce a significant ratio of at least 87 tails/113 heads on the full 200 flips. Similarly, with 43 heads/57 tails on the first 100 flips, at least 44 tails/56 heads would be needed on the second 100 flips to produce a significant ratio of at least 87 tails/113 heads on the full 200 flips. Plugging in values from the binomial distribution yields: P(Type I error) = P(at least 41 tails/59 heads) + P(at least 45 tails/55 heads)*P(exactly 42 tails/58 heads) + P(at least 44 tails/56 heads)*P(exactly 43 tails/57 heads) = .0443 + .1841*.0223 + .1356*.0301 = .0525.
If p < .05, stop, declaring the results significant.
If .05 ≤ p < .10, add 100 flips, basing the conclusion on the 200 flips.
If .10 ≤ p, stop, declaring the results nonsignificant.
The bottom panel of Figure 1 displays the effects of this conditional decision on the Type I error rate. As can be seen in Figure 1, a conditional decision to flip the coin an extra 100 times if the initial 100 flips are marginally significant increases the Type I error rate from 4.43% to 5.25%—an increase of 18.5%.
To calculate estimates of Type I error inflation applicable across a range of inferential tests, we need to define four values: (a) N1, the initial sample size, (b) N2, the number of participants used to augment the dataset, (c) pcrit, the critical value for determination of significance (typically .05), and (d) pmax, the maximum p value for the first N1 participants such that the sample would be supplemented with an additional N2 participants. With these values, the conditional decision rule is structured as follows:
If, in the initial N1 participants, p < pcrit, stop and declare the results significant.
If, in the initial N1 participants, pcrit ≤ p < pmax, add N2 participants and base the conclusion on the full N1+N2 participants.
If, in the initial N1 participants, pmax ≤ p, stop and declare the results nonsignificant.
(see Fig. 2 for a pictorial representation of the conditional decision rule and the supplemental online materials for information on calculating the Type I error inflation.)

The conditional decision rule. N1 is the initial sample size, N2 is the number of participants used to augment the dataset, pcrit is the critical value for determination of significance, pmax is the maximum p value for the first N1 participants such that the sample would be supplemented with an additional N2 participants, and pactual is the actual Type I error rate resulting from this conditional decision rule.
To provide a sense of how dataset augmentation affects Type I error rates, Figure 3 displays the actual Type I error rates for two-tailed tests 1 across values of pmax ranging from .051 to 1.00 for five curves representing five relationships between N1 and N2 (N2 = N1*1000000, N2 = N1*2, N2 = N1, N2 = N1/2, N2 = N1/1000000). In all cases, pcrit = .05. As can be seen in the N2 = N1*1000000 curve, when N2 overwhelmingly outweighs N1, the overall probability of a Type I error is a linear function of the probability of supplementing the sample with the additional N2 participants. In other words, if N2 is much larger than N1, then augmenting the dataset increases the Type I error rate by an additional 5% times the probability of actually augmenting the dataset (pmax – pcrit). As can be seen in the N2 = N1/1000000 curve, when N1 overwhelmingly outweighs N2, the overall probability of a Type I error remains at 5% (or, more precisely, rises very slightly above 5%). In other words, if N2 is much smaller than N1, then augmenting the dataset will have very little impact on the Type I error rate because there is very little chance that such a small addition to the dataset will change a nonsignificant result into a significant result. Of course, these curves represent boundary conditions that help to illustrate the relationships but are unlikely to arise in actual research situations.

Type I error inflation for a two-tailed test based on N1 (the initial sample size), N2 (the number of participants used to augment the dataset), pcrit (the critical value for determination of significance), and pmax (the maximum p value for the first N1 participants such that the sample would be supplemented with an additional N2 participants). pcrit = .05 for all calculations in this figure.
The three other curves represent more typical research situations. If pmax = .1 (which would arise if a researcher decided to supplement a dataset with marginally significant initial results), the two-tailed Type I error rate is .0564 when N2 = N1*2, .0582 when N2 = N1, and .0597 when N2 = N1/2. Thus, if a researcher will augment a dataset only once and only under limited circumstances (e.g., when p < .10), the Type I error inflation is relatively modest. If pmax = 1 (which would arise if a researcher decided to supplement a dataset if the initial results were marginal or nonsignificant), the two-tailed Type I error rate is .0883 when N2 = N1*2, .0831 when N2 = N1, and .0770 when N2 = N1/2. This last situation (pmax = 1, N2 = N1/2, two-tailed) is analogous to the situation simulated by Simmons et al. (2011). With pcrit = .05, Simmons et al.’s simulation produced a Type I error rate of .077—the same value as our calculated estimate (see Fig. 4). In sum, the degree of Type I error inflation depends on both the circumstances under which the researcher would augment the dataset (with higher values of pmax associated with greater degrees of Type I error inflation) and the number of participants added relative to the number initially collected.

Type I error inflation for the situation simulated by Simmons, Nelson, and Simonsohn (2011): N2 = N1/2, pcrit = .05, pmax = 1, two-tailed.
Allowing a second round of dataset augmentation further inflates the Type I error rate. This inflation maximizes at .1426 when pmax = 1, N2 >> N1, and N3 >> N2. When N1 = N2 = N3 and pmax = .1, the two-tailed Type I error rate is .0595. Likewise, when N1 = N2 = N3 and pmax = 1, the two-tailed Type I error rate is .1073 (R functions and Excel spreadsheets for calculating these values are available at http://www.paugmented.com). As can be seen, additional rounds of dataset augmentation can dramatically inflate the Type I error rate, particularly with higher values of pmax.
To determine which statistical techniques these estimates apply to, we ran a series of Monte Carlo simulations and compared our calculated estimates to the simulation results. We simulated a t test, a correlation, an omnibus one-way ANOVA with five levels, and a planned contrast on a one-way ANOVA with five levels that compared two groups against the other three groups (see Table 1 for simulation results). As can be seen in Table 1, the results of the simulations adhered closely to the calculated estimates, with the largest discrepancies (of .0019) occurring for the omnibus ANOVA. These results suggest that our calculations apply to a variety of statistical analyses including t tests, correlations, ANOVAs, and regressions.
Calculated and Simulated Two-Tailed Type I Error Inflation Estimates
Note: Monte Carlo simulations each represent 1,000,000 simulated datasets in which the data were generated using the R function rnorm. For unaugmented datasets, N = 50. For datasets in which N2 = N1*2, N1 = 50 and N2 = 100. For datasets in which N2 = N1, N1 = 100 and N2 = 100. For datasets in which N2 = N1/2, N1 = 100 and N2 = 50. The omnibus ANOVA represents an omnibus one-way ANOVA with five levels. The planned contrast represents a contrast on an ANOVA with five levels that compares three conditions against the other two conditions: contrast weights {−2, −2, −2, 3, 3}.
In the next section, we review techniques that allow for dataset augmentation while holding the Type I error rate at 5%, and we extend these techniques using the calculations represented in Figure 3.
Maintaining p < .05 While Augmenting the Dataset
Although dataset augmentation and its impact on Type I error rates has only recently become salient within the field of psychology, other fields have a long history of wrestling with this issue. Armitage et al. (1969), for example, examined Type I error inflation in the case of pmax = 1, N1 = N2, two-tailed. Armitage et al.’s approach also extended to cases with additional dataset augmentation of the same number of participants (i.e., N1 = N2 = N3), and as Armitage et al. demonstrated, additional rounds of dataset augmentation dramatically increase the Type I error rate. Thereafter, in a classic paper published in Biometrika, Pocock (1977) developed the group sequential design, which divides “patient entry into a number of equal-sized groups so that the decision to stop the trial or continue is based on repeated significance tests of the accumulated data after each group is evaluated” (p. 191). Pocock determined the critical values necessary to maintain a desired Type I error rate for the cases identified by Armitage et al. and identified situations in which group sequential designs have statistical benefits over other types of designs (see also Cui, Hung, & Wang, 1999; Lakens & Evers, 2014, this issue; Lehmacher & Wassmer, 1999; O’Brien & Fleming, 1979; Pocock, 1982).
Expanding on the calculations represented in Figure 3, we determined the values of pcrit necessary to maintain .05 for the same five relationships between N1 and N2 based on pmax for two-tailed tests (see Fig. 5). For example, with pmax = .1 (if a researcher plans to supplement the dataset only if the initial results are marginally significant), the two-tailed value for pcrit is .0433 when N2 = N1*2, .0415 when N2 = N1, and .0399 when N2 = N1/2. With pmax = 1 (if a researcher plans to supplement the dataset if the initial results are marginal or nonsignificant), the two-tailed value for pcrit is .0277 when N2 = N1*2, .0294 when N2 = N1, and .0318 when N2 = N1/2 (R functions and Excel spreadsheets for calculating these values are available at http://www.paugmented.com). Thus, the greater the Type I error inflation (stemming, for example, from a choice of 1 for pmax), the greater the adjustment necessary to the critical value to maintain a Type I error rate of .05.

Values of pcrit necessary to maintain a Type I error rate of .05 for a two-tailed test.
Prior research has also offered some alternative approaches for controlling the Type I error rate while allowing the researcher to augment the dataset (see Baguley, 2012; Lakens & Evers, 2014). Frick (1998) developed an approach applicable when a researcher is willing to accept a more stringent pcrit in exchange for the opportunity to supplement a dataset with multiple small increments. Frick’s technique requires defining N1 (the initial sample size), αlower (the p value below which data collection stops with significant results—equivalent to pcrit), and αupper (the p value above which data collection stops with nonsignificant results—equivalent to pmax). As long as αlower ≤ p ≤ αupper, the researcher can repeatedly supplement the dataset with additional data. Frick’s simulations demonstrate that a Type I error rate of .05 can be maintained with αlower = .01 and αupper = .36 for a two-tailed t test. In comparison, when N1 = N2, pmax = .36, and the test is two-tailed, our technique maintains a Type I error rate of .05 with pcrit = .0318. Thus, the present technique provides a greater probability of the initial sample producing significant results (p < .0318 vs. p < .01), but it allows only one supplemental sample. Botella, Ximénez, Revuelta, and Suero (2006) refined Frick’s technique by adding a maximum sample size beyond which data collection would stop regardless of whether p still lies between αlower and αupper, and Fitts (2010) provides precise values for αlower and αupper for studies with small samples and large effects. Frick (1998), Botella et al. (2006), or Fitts (2010) might be recommended for research situations in which an accurate power analysis is infeasible and when additional participants can be readily run individually or in small sets. Our technique might be recommended for situations in which an accurate power analysis can inform the size of N1 or when multiple rounds of small dataset augmentation is less feasible.
It should be noted that all the techniques discussed above (e.g., Frick, 1998; Pocock, 1977; the adjustments to pcrit shown in Fig. 5) were designed to be applied a priori (see Table 2). If a researcher peeks at his data and would have stopped data collection if p < .05, there is no adjustment possible to pcrit that can restore the Type I error rate to .05 while allowing data collection to continue. Furthermore, if a researcher makes an a priori adjustment to pcrit to allow himself the option of dataset augmentation, the researcher must be disciplined enough to continue data collection even if pcrit ≤ p < .05. In other words, the right to augment the dataset obligates the researcher to continue data collection if pcrit ≤ p, even if the results on the initial N1 participants would have been statistically significant by the usual p < .05 criterion.2,3 Systems that enable researchers to preregister study designs provide an important tool to ensure adherence to such a priori decisions, and we support efforts such as Psychological Science’s “Preregistered” badge (Eich, 2014) that encourage preregistration.
Techniques Available at Each Stage of Dataset Augmentation
But as discussed earlier, many researchers find themselves facing unanticipated marginally significant results, and the a priori requirement of these techniques renders them inapplicable in such situations. In the next section, we offer a solution to this dilemma: a method of quantifying and reporting the Type I error inflation that results from a post-hoc decision to augment a dataset.
Ethical Post-Hoc Dataset Augmentation via paugmented
Ideally, researchers would always precede data collection with a careful power analysis, an a priori target sample size, and an a priori decision regarding whether to allow dataset augmentation. In such circumstances, researchers can use the full range of statistical techniques available to maintain a desired Type I error rate.
In practice, a careful power analysis might be infeasible if, for example, little prior research is available to inform the anticipated effect size. Alternately, an a priori target sample size might not be achieved due to unanticipated difficulty recruiting participants. Or the data might produce a marginally significant effect despite adequate statistical power.
In such situations, many researchers (at least 55.9%, according to the results of John et al., 2012) might make a post-hoc decision to augment the dataset—thus engaging in a “questionable research practice” according to John et al. We believe, however, that it is not the dataset augmentation itself that is questionable. It is the failure to report the dataset augmentation.
To facilitate the reporting of post-hoc dataset augmentation, we offer a new statistic, paugmented, that quantifies the Type I error inflation that results from such augmentation and thus allows researchers to augment a dataset in a fully ethical manner. paugmented is calculated based on the number of participants in the original sample (N1), the number of participants added to the sample (N2), the critical value used to determine statistical significance (pcrit; typically set to .05), and the p value in the final, combined dataset (pcombined). As can be seen in Figure 3, calculating the actual Type I error rate requires an additional piece of information: pmax, the maximum p-value from the original sample such that the researcher would have augmented the dataset. Because of the difficulty of determining a precise value for pmax after the fact, we define paugmented as a range rather than a single value. The low end of the range represents the best-case scenario in which pmax = the p value obtained in the original sample (p1). The high end of the range represents the worst-case scenario in which pmax = 1. The benefit of such an approach is that it does not require the researcher to define her intent. That is, it avoids the need for the researcher to determine post-hoc what she would have done had the results come out differently than they did. (See the Supplemental Materials for information on the calculation of paugmented. R functions and Excel spreadsheets for calculating these values are available at http://www.paugmented.com.)
The following paragraph illustrates how paugmented can be used to ethically and accurately describe post-hoc dataset augmentation: After running the first 100 participants, the primary comparison between the treatment and control groups was nonsignificant, t(98) = 1.45, p = .15, Cohen’s d = 0.293. We then ran another 50 participants. For the full sample of 150 participants, the comparison between the treatment and control groups was significant, t(148) = 2.35, p = .02, Cohen’s d = 0.386, paugmented = [.055, .057].
This paragraph demonstrates how paugmented enables the researcher to disclose the dataset augmentation and to report the impact of the augmentation. This provides reviewers, editors, and readers with an accurate picture of the data and elevates dataset augmentation from a questionable research practice to an ethical research decision. This paragraph also reveals, however, an inevitable ramification of post hoc dataset augmentation: paugmented will always exceed .05 (or, more generally, paugmented will always exceed pcrit). Given this, we recommend that reviewers, editors, and readers offer some flexibility toward researchers voluntarily disclosing post-hoc dataset augmentation, accepting, for example, the above final p value of .02 and paugmented range of .055 to .057 as providing sufficient evidence for a confident interpretation. Indeed, without the voluntary disclosure of the dataset augmentation, reviewers, editors, and readers would know nothing more than the significant results for the full sample of 150 participants, t(148) = 2.35, p = .02. We believe the slight increase in the Type I error rate (or, perhaps more accurately, the open recognition of the increase in the Type I error rate that is already occurring due to undisclosed post hoc dataset augmentation) is a price worth paying to encourage disclosure.4,5
Should We Care?
“A billion here, a billion there, pretty soon, you’re talking real money”
Figure 3 shows that the upper limit for the Type I error rate with a single round of dataset augmentation (when pcrit = .05, pmax = 1, and N2 >> N1) is .0975. In most research situations, the Type I error rate would be lower than this. In one sense, a Type I error rate ≤ .0975 doesn’t seem that bad. Top journals often allow a cautious interpretation of “marginally significant” results when p < .10. In discussing techniques for controlling familywise Type I error rates with methods such as the Bonferroni Procedure, Keppel and Wickens (2004) point out that researchers are not obligated to restrict αFW to 5%: “The larger value is a compromise that gives the individual statistical tests more power, achieved by tolerating more overall error” (p. 115). If our field were willing to accept a Type I error rate of ≤ .0975, we would enjoy the benefit of many fewer Type II errors. Furthermore, the decision to not augment a dataset sometimes carries ethical costs. For example, if an experiment inflicted substantial risk, pain, or discomfort on participants, or if an experiment required the sacrifice of animal subjects, the researcher has an ethical obligation to maximize the value of these costly data, which could include augmenting (rather than discarding) the dataset if p = .06 (see also Lakens, 2013, for a cogent argument regarding the obligation of researchers to analyze data as they accumulate).
We are, however, wary of a wholesale indulgence for researchers peeking at their data. As Simmons et al.’s (2011) simulation vividly revealed, a combination of four Type I error-inflating tactics (selecting whichever variable(s) produced better results, adding new data if the initial results were nonsignificant, including gender as a covariate or moderator if it improves the results, and dropping one of three conditions if it improves the results) yielded a remarkably high (and clearly unacceptable) overall Type I error rate of 60.7%. Instead, we recommend that researchers be given the flexibility to examine their data and, if they deem it appropriate, to augment their dataset. But researchers then bear an obligation to report sufficient information so that reviewers, editors, and readers can determine the resulting inflation of the Type I error rate. We believe that paugmented offers a simple mechanism for researchers to meet this obligation.
Two additional issues merit some discussion: meta-analysis and replication. First, it could be correctly noted that our method of combining Z scores (Whitlock, 2005) is, essentially, meta-analytic. Given this, is there a reason to treat the synthesis of an initial sample (N1) with a supplemental sample (N2) any differently than we would treat the synthesis of multiple studies using meta-analysis (a synthesis that is seldom if ever accompanied by a correction for the Type I error rate)? We believe so. In a standard meta-analysis, the researcher synthesizes the results of existing studies. In the research situation explored in this article, the existence of the supplemental sample (N2) is conditional on the statistical results observed in the initial sample (N1). This conditional decision to collect supplemental data based on the results of the initial data causes the inflation of the Type I error rate regardless of whether the resulting augmented dataset is analyzed as a single dataset or as two meta-analytically combined datasets. The use of meta-analysis to analyze N1 and N2 does not solve the problem.
Replication, in contrast, does help. A Type I error in an augmented dataset still has only a 5% chance of being replicated. This 5% is predicated, however, on the replication study not including any Type I error-inflating practices. If such practices are employed in the replication study, the protection that replication normally provides is weakened. If, for example, a researcher conducts an experiment consistent with Simmons et al.’s (2011) simulation and employs all four questionable practices identified by Simmons et al. in attempting to replicate a Type I error, the probability of replication would be 60.7%. Indeed, with all four practices employed (which might, of course, require that the replication use a different dependent variable, an augmented dataset, a different role for gender, and different experimental conditions), the probability of producing and replicating a Type I error would be 60.7% × 60.7% = 36.8%. Full disclosure, of the type advocated by Simmons et al. and by us, would, of course, reveal the weakness of such a replication.
A Slippery Slope?
We have heard the argument that the methods described in this article could be counterproductive if they provide a veneer of integrity that hides other questionable research practices. This argument is predicated on the assumption that a researcher willing to augment a dataset would also be willing to engage in other questionable research practices to achieve significant results. To this, we offer two counterarguments. First, not all questionable research practices are created equal. John et al. (2012) examined 10 questionable research practices, and the self-reported prevalence of these practices differs drastically from the most common (“In a paper, failing to report all of a study’s dependent measures,” 63.4%; “Deciding whether to collect more data after looking to see whether the results were significant,” 55.9%), to less common (“In a paper, failing to report all of a study’s conditions,” 27.7%; “In a paper, reporting an unexpected finding as having been predicted from the start,” 27.0%), to the least common (“In a paper, claiming that results are unaffected by demographic variables [e.g., gender] when one is actually unsure [or knows that they do],” 3.0%; “Falsifying data,” 0.6%). Thus, a researcher willing to augment a dataset will not necessarily be willing to engage in other questionable research practices. Furthermore, the large range of these percentages (from 0.6% to 63.4%) implies that respondents saw some of these practices as acceptable but others as problematic. This suggests that many researches want to do the right thing but may not be aware of how problematic these questionable research practices are.
Second, all 10 of the questionable research practices examined by John et al. (2012) represent sins of disclosure (or, more precisely, a failure to disclose) either explicitly or implicitly. We believe that providing researchers with a mechanism for disclosing dataset augmentation will help to increase the norms of disclosure in our field—norms that should help to reduce other questionable research practices.
Future Directions
Table 2 lists the stages of dataset augmentation and which of the techniques discussed in this article can be applied in each stage. As can be seen in the table, a number of techniques are available for adjusting the critical value at Stage 1 before data analysis begins to allow dataset augmentation while maintaining a desired Type I error rate. At present, one technique, conditional power analysis, is available for researchers considering dataset augmentation after discovering nonsignificant results at Stage 2. Likewise, only one technique, paugmented, is available for calculating the Type I error inflation resulting from unplanned dataset augmentation at Stage 3. We believe that each of these stages offers potentially fruitful directions for future work.
Researchers at Stage 1 could benefit from models that help guide the optimal allocation of resources under a specified set of opportunities and constraints. Such models could determine, for example, whether a researcher would be better off running a large initial sample with an option of a single round of augmentation or running a potentially large number of small samples. Relevant parameters could include the relative costs of Type I versus Type II errors, opportunities for incremental participant recruitment, the precision of the initial effect size estimate, and so on. Lakens and Evers (2014) provide a useful introduction to the techniques and software available for researchers considering these issues in the context of sequential analyses.
Researchers at Stage 2 could benefit from a power analysis that takes into account (a) the effect size observed in the initial sample, and (b) the fact that the initial sample will be part of the final sample. The latter would require a modification to the traditional methods of power analysis, because the target dataset is not drawn entirely from the population. Rather, it is a mixture of an already collected and analyzed dataset and additional data drawn from the population. Lakens and Evers (2014) discuss the use of such conditional power analyses within adaptive designs in which the initial round of data collection is used to determine the final sample size.
Researchers at Stage 3 (and editors, reviewers, and readers evaluating the work of researchers at Stage 3) could benefit from further statistical work aimed at guiding the interpretation of paugmented. In particular, we believe paugmented could be profitably used in conjunction with what Geoff Cumming has dubbed “the new statistics” (effect sizes, confidence intervals, and meta-analysis; Cumming, 2014) to produce a more nuanced understanding of a given study’s results.
Conclusion
Type I errors matter. Given our field’s current bias toward publishing only statistically significant results, spurious findings that make their way into the published literature can influence the field for years. And, as Simmons et al. (2011) demonstrated through simulation and as we demonstrated mathematically, the common practice of dataset augmentation unquestionably inflates the Type I error rate. We see three possible responses: (a) We can ignore the problem (which has historically been our field’s primary response), (b) we can prohibit dataset augmentation (but such a prohibition would itself carry costs—most notably an increase in the Type II error rate), or (c) we can empower researchers to make informed decisions regarding their data. 6 The techniques described in this article provide three critical pieces of information: the degree of Type I error inflation caused by dataset augmentation, the adjustment to pcrit necessary to restore the Type I error rate to .05, and a mechanism to report the Type I error inflation that results from post-hoc dataset augmentation. These techniques require full disclosure of the a priori and post-hoc decisions made by the researcher, and in this, we fully concur with Simmons et al.’s recommendations that such disclosure become the norm.
Footnotes
Acknowledgements
We thank Jeremy Biesanz for suggesting the Z-method for combining the initial and supplemental samples. We also thank Thom Baguley, Amanda Durik, Greg Francis, Rosanna Guadagno, Alecia Santuzzi, John Skowronski, Michael Wagner, the Northern Illinois University Social-I/O Psychology Area, and the University of Alabama, Tuscaloosa, Department of Psychology for comments and feedback during the development of these ideas. Some of the findings reported here were initially presented at the May 2013 meeting of the Midwestern Psychological Association in Chicago, Illinois.
Declaration of Conflicting Interests
The authors declared that they had no conflicts of interest with respect to their authorship or the publication of this article.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
