Abstract
Typically, in education and psychology research, the investigator collects data and subsequently performs descriptive and inferential statistics. For example, a researcher might compute group means and use the null hypothesis significance testing procedure to draw conclusions about the populations from which the groups were drawn. We propose an alternative inferential statistical procedure that is performed prior to data collection rather than afterwards. To use this procedure, the researcher specifies how close she or he desires the group means to be to their corresponding population means and how confident she or he wishes to be that this actually is so. We derive an equation that provides researchers with a way to determine the sample size needed to meet the specifications concerning closeness and confidence, regardless of the number of groups.
In the context of structural equation modeling, Wolf, Harrington, Clark, and Miller (2013; also see Marcoulides & Saunders, 2006) have distinguished between inferential statistical approaches featuring operations performed prior to data collection or performed after data collection. Although the present focus is on sample means rather than on structural equation modeling, the distinction nevertheless is useful to keep in mind and it will feature strongly in the present article.
It is possible to identify different inferential statistics camps. Two of these camps fall within the “frequentist” way of thinking, where the basic assumption is that hypotheses are correct or incorrect but cannot take on probabilities other than 0 or 1. One camp—comprising those who wish to use data to reject or fail to reject null hypotheses—favor the null hypothesis significance testing procedure. A second camp—comprising those who wish to use frequentist thinking for parameter estimation—favor confidence intervals. Yet a third camp—comprising those who believe that hypotheses can take on probabilities other than 0 or 1—favor the use of the famous theorem by Bayes. 1 It also is possible, perhaps, to identify a fourth camp that features equivalence testing, where the idea is to control the size of misspecification at a prespecified value so as to assess the goodness of the model at a desired level of confidence (Wellek, 2010). 2 Equivalence testing has been advocated, for example, when the goal is to show that a treatment is equivalent to some other treatment (Walker & Nowacki, 2011), though there are many other uses as well (see Yuan, Chan, Marcoulides, & Bentler, 2016, for a list). Arguably, especially in Yuan et al. (2016), the procedure used to determine the desired sample size for equivalence testing occurs prior to data collection, which distinguishes it from the other three camps.
There has been much disagreement across the camps, with equivalence testing perhaps being the exception (e.g., Bakan, 1966; Berkson, 1938; Cohen, 1994; Fidler & Loftus, 2009; Fisher, 1973; Gigerenzer, 1993; Hoekstra, Morey, Rouder, & Wagenmakers, 2014; Hogben, 1957; Loftus, 1996; Lykken, 1968; Mayo, 1996; Meehl, 1967, 1978; Morey, Hoekstra, Rouder, Lee, & Wagenmakers, 2016; Popper, 1983; Rozeboom, 1960; Schmidt, 1996; Schmidt & Hunter, 1997; Suppes, 1994; Thompson, 1992; Trafimow, 2003, 2006; Trafimow & Rice, 2009). For present purposes, it is not necessary to discuss these controversies nor to advocate for any one camp or set of camps over any others. 3 Instead, our goal is to follow-up on a particular type of inferential procedure recently suggested by Trafimow (2016), that is not equivalent to any of the foregoing camps, and address what we consider to be a serious limitation of that procedure. We emphasize that the procedure we propose can be used in conjunction with other procedures and so there is no necessary implication of a competition between procedures.
Based, in part, on a recent article by Trafimow (2016), our proposal differs from those of the other camps, with the possible exception of equivalence testing, in the following ways. First, our a priori procedure (APP) pertains to a different question, as we explain in the subsequent section. 4 Second, as the label suggests, there is no need to collect any data whatsoever to answer that question. The APP is compatible with the other approaches but also works as a stand-alone procedure. Again, we emphasize that there is no necessity here for a competition between procedures.
Being Confident of Being Close
Suppose we pointed out to researchers that it would be much easier to obtain a single participant than to obtain a larger sample of participants. Based on this, we might ask, “Why collect a sample of participants?” After asking questions to get beyond issues such as the importance of getting statistical significance to publish, and publishing to get tenure, and so on, we believe that eventually researchers would converge on the notion that a sample of participants aids researchers in feeling confident that the sample statistic is close to the population parameter, or at a minimum, the larger the sample, the more likely it is to resemble the population. Few researchers would be interested in sample statistics if they felt that these were completely unrepresentative of population parameters and because the most widely used sample statistic researchers use is the sample mean, as an estimate of the population mean, our focus will be on means. 5
Well, then, if researchers wish to collect samples of participants so that they can be confident that their sample means are close to the corresponding population means, this implies two important issues: defining close and defining confident. In the case where only a single mean is at issue, there is a formula presented as Equation 1 that gives the number of participants needed (n) as a function of the fraction of a standard deviation defined as close (f) and the z-score
To use the formula, the researcher decides, prior to data collection, how close he or she wishes the sample mean to be to the population mean and the confidence desired of actually being within that distance. For example, the researcher might desire to be within three tenths of a standard deviation of the mean and to have a probability of .95 of actually being within that distance. The z-score that corresponds to .95 is 1.96, and so the number of participants needed is
Equation 1 assumes a normally distributed population. To test the importance of adhering to this assumption, Trafimow (2016) performed computer simulations with decidedly nonnormal distributions such as a rectangular distribution, a right triangular distribution, and even an exponential distribution. He found that using distributions that differed substantially from the assumed normal distribution made very little difference in the results. We accept these simulations and the implication that the normality assumption is not an important problem for Equation 1. Nevertheless, we believe that there is an important limitation with respect to Equation 1, specifically, and the Trafimow (2016) article, generally. That is, Equation 1 is fine if there is only a single mean to be considered, but few researchers are concerned with only a single mean. Much more often, there is an experimental group and a control group, so that there are two means and neither Equation 1 nor anything in the Trafimow (2016) article accommodates two or more means. Fairly often, researchers use complex factorial designs with any number of means. Thus, our main goal is to expand the APP to include multiple means.
We also investigate closely related issues. For example, how does the APP differ from power analysis, which is sometimes conducted a priori? Finally, we address additional issues suggested by the foregoing ones, including the implications the APP has for complex experimental designs.
Multiple Means
It is possible to derive Equation 2, which gives the probability of obtaining one sample mean within a specified fraction of a standard deviation of the population mean or
Suppose that we now imagine that there are two samples, with samples sizes
More generally, Equation 4 renders the probability of k means all being within their desired distances of the corresponding population means, where
In those cases where there is an equal sample size in each condition
Figure 1 explores the implications of Equation 5, where the probability that all of the means will be within f—that is,

The probability that all of the sample means will be within f as a function of f, the number of means, and N.
However, a look at the other three panels qualifies these conclusions. As N increases, asymptote is reached at lower levels of f along the horizontal axis. That is, increasing N allows the researcher to specify more stringent intervals in which there is a reasonable probability that all of the means are likely to be within them. More generally, the four panels of Figure 1 allow the researcher to see precisely how the desired precision, the number of means, and the total sample size interact to influence the probability that all of the means are within the specified interval. However, a limitation of Figure 1 is that it is difficult to discern the number of participants that actually are needed for the researcher to be able to be confident that the sample means will be close to their corresponding population means.
Reversing Equation 5 and Its Implications
With the aid of Figure 1, we have seen how f, k, and number of participants interact to influence the probability that all of the sample means of interest will be within the desired distance of the corresponding population means. However, we have not yet explored how f, k, and the desired probability that the sample means are within the specified distance
Figure 2 provides a visual rendition of Equation 6, where n is expressed along the vertical axis as a function of f along the horizontal axis and the number of means (1, 2, 3, 4, or 5, as in Figure 1). In Figure 1, f ranged from .01 to .5, but using this range caused a problem with respect to Figure 2. To understand the problem, consider the case where one wishes to have a .95 probability that five sample means are within .01 standard deviations of their respective population means. The value of n, in this case, according to Equation 6, is 65,985. Not only is this practically unrealistic but it also stretches the vertical axis to the point of making it difficult to see the effects to be described subsequently. To avoid the problem, we used a range of .1 to .5 for f along the horizontal axis. Finally, as in Figure 1, there were four panels, where the probability that all means are within f was set at .65, .75, .85, and .95. Put another way, each panel specifies a different level of confidence that the researcher can have that all of the sample means will be within the specified distance of the population means.

Figure displays n as a function of f, the number of means, and the desired confidence level.
Let us start with the first panel, where the confidence was set at .65, and with the lowest curve denoting the case where there is only a single mean to be considered. When f is at the minimum value of .1, thereby indicating impressive precision, the number of participants needed to have confidence at .65 is 87 (rounded to the nearest whole number). As one moves toward increasingly less precision, there is a steep drop in the number of participants needed at first but the drop becomes increasingly less steep as one becomes increasingly less precise. Perhaps another way to look at it is to go from right to left, where it can be seen that, at low levels of precision, substantial improvements in precision can be made at very little cost in n. In contrast, as one continues to move leftward along the horizontal axis, if one moves sufficiently far in that direction, even small improvements in precision have a substantial cost in n. This fact suggests that researchers might consider the concept of a “best buy,” that is, how much improvement in f is worth how much cost in n?
Two other effects are worth mentioning. Most obviously, as there are more means to be considered, n must increase accordingly. But this effect of the number of means is qualified by f. Consider, for example, the difference between one mean and five means when f is .1. In this case, when there is one mean, the needed n is 87, but when there are five means, the needed n is 301, for a difference of 214. In contrast, as the level of imprecision increases, differences in the level of n needed to reach f attenuate dramatically. For example, when f = .5, the levels of n needed for one or five means are 3 and 12, respectively, for a difference of 9.
All of the foregoing effects for when confidence is set at .65 become increasingly more dramatic as the level of confidence increases to .75, .85, and .95. This is because, although the panels do not look very different from each other in terms of the shapes of the curves, the extent of the values along the vertical axis increases as the level of confidence increases. For example, we saw earlier that when there are five means and confidence is set at .65, n = 301. But when confidence is set at .95, this value becomes 660. More generally, as confidence increases, so does (a) the effect of the number of means, (b) the effect of precision, and (c) the interaction between the number of means and precision. Figure 3 provides a “blown-up” version of Figure 2 where the vertical axis is restricted to a maximum of 100 participants so as to allow the reader to better view the implications of Equation 6 at sample sizes typically used in research.

Figure displays n as a function of f, the number of means, and the desired confidence level. However, n was capped at 100 along the vertical axis.
As dramatic as the foregoing effects might seem to be, they arguably provide underestimates because Figures 2 and 3 represent n along the vertical axis rather than N. To see why this might matter, consider that increasing the number of means necessitates that more participants are needed in two ways. First, as Figures 2 and 3 illustrate, increasing the number of means necessitates an increase in the number of participants in each condition. But, as Figures 2 and 3 fail to illustrate, increasing the number of means also indicates an increase in the number of conditions. Thus, although Figures 2 and 3 are fine for understanding how the number of conditions increases the number of participants needed in each of them, it is necessary to multiply n by the number of conditions (k) to understand the total effects. Figure 4 includes this change: N (rather than n) is represented along the vertical axis as a function of f, k, and the four levels of confidence used in Figure 2 (.65, .75, .85, and .95). As can be seen by attending to the actual values along the vertical axis, all of the effects described with respect to n are much more dramatic with respect to N.

Figure displays n as a function of f, the number of means, and the desired confidence level.
The APP and Power Analysis
The APP is similar to a priori power analysis in the sense that both are prior to data collection. However, there is a difference. The purpose of power analysis is to determine the N needed to have a reasonable chance of obtaining a statistically significant finding, given that there is an effect to be found. For those who favor confidence intervals over the null hypothesis significance testing procedure, the purpose of power analysis is to determine the N needed to obtain a confidence interval of a desired width. Either way, the purpose of the power analysis is to aid the researcher in what eventually will be an a posteriori analysis. In contrast, the APP can be used in isolation or in conjunction with a posteriori procedures. One starts by specifying how close one wishes the sample means to be to their corresponding population means, and the desired confidence that this actually will be so. From there, the APP provides the number of participants needed. Another way to think about the APP is that the closeness and confidence specifications indicate the conditions needed for the researcher to trust the data, so that as long as the researcher collects the required sample sizes to fulfill these conditions, the researcher can simply trust the resulting means without further inferential analyses, at least to the extent of the specified degree of confidence. The APP recognizes that the extent to which sample means can be trusted as estimates of population means has nothing to do with what the findings actually are but rather depends solely on the size of the samples. We hasten to add, however, that because the APP pertains to drawing conclusions about samples rather than about populations, the researcher is not justified in concluding that the population mean is within the specified distance (f) of the obtained sample mean, with the specified degree of confidence. 9
There is an additional difference that can be illustrated with an example. Suppose that Experimenter A and Experimenter B perform experiments with an experimental group and a control group. Based on previous research, Experimenter A can count on a very large effect size and Experimenter B anticipates a very small effect size. Because p is, to an important degree, influenced by the effect size, Experimenter A needs only a small sample size whereas Experimenter B needs a large sample size to have a reasonable chance of being able to reject the null hypothesis. In contrast, using the APP, the issue is not about obtaining a particular value for p but about having sample means that can be trusted to accurately estimate population means. From this point of view, it does not follow that Experimenter B needs to have a larger sample size than Experimenter A. From the point of view of trusting whether the sample means accurately estimate the population means, both researchers could use the same sample size and be able to trust their sample means equally. 10
Complexity
Many researchers employ complex factorial designs. However, the APP suggests that there is an important problem with this approach that few researchers appreciate. Specifically, as more conditions are added, there are more means, and the probability that all of the means are close to the corresponding population means decreases. Put simply, as the design becomes increasingly complex, the researcher can place less trust in the cell means. But there is more than one reason for this.
Most obviously, keeping the overall sample size (N) constant, increasing the number of conditions necessitates that each cell mean will be based on fewer participants. In turn, fewer participants indicates that less trust can be placed in the resulting cell means. But beyond that, even if more participants are added, so that the sample size per condition remains the same, the foregoing equations imply that the probability that all of the sample means are within the specified distance of the corresponding population means decreases as the number of conditions increases. It is interesting to consider this mathematical fact in the context of the complex designs that are common in psychology. For example, psychologists often use 2 × 2 × 2 designs, and even 2 × 2 × 2 × 2 designs are not particularly uncommon.
A few quick calculations are illuminating. When there is a single condition (k = 1) with 100 participants, there is a better than 95% chance that the sample mean will be within .2 standard deviations of the population mean. But suppose that the design is a 2 × 2 × 2 design, so that there are eight condition (k = 8). Let us even make an extremely favorable stipulation that we use N = 800 participants so that the number of participants in each condition remains constant at n = 100. Nevertheless, the probability that all of the means will be within .2 standard deviations of the corresponding population means is only .69. And matters become even worse if we consider a 2 × 2 × 2 × 2 design, so that k = 16. Even keeping n at 100 (so that N = 1,600) implies that the probability that all of the means will be within .2 standard deviations of the corresponding population means reduces to .47. Thus, the APP makes salient that there is an important problem with using complex designs, the nature of which otherwise would not be apparent. To reiterate, as the number of conditions increases, less trust can be placed in the obtained sample means.
Contrasting the Proposed Procedure With Other Procedures
Different procedures entail different issues. Our goal in the present section is to make clear how the APP asks a different question, and comes to a different sort of conclusion, than other procedures. Let us commence by considering the most common statistical procedure, namely, that concerned with testing null hypotheses. The idea of this procedure is to compute p and then reject or fail to reject the null hypothesis. As we pointed out earlier, many researchers have criticized this procedure, with much of the criticism based on the logical fact that one cannot validly draw conclusions about the probabilities of hypotheses given findings from the probabilities of findings given hypotheses. It would take us too far afield to discuss inverse inferences in detail and it is sufficient merely to note that there is an increasing trend for researchers to be concerned with this issue (see Trafimow, 2016, for such a discussion). Confidence intervals are closely related. In fact, it is possible to use confidence intervals as an alternative way to test null hypotheses. In addition, some researchers have suggested that confidence intervals can be used for parameter estimation. Critics of confidence intervals have argued that confidence intervals cannot validly be used for parameter estimation because there is no way to know the probability that the population mean (or difference between population means) is within the constructed interval. Both null hypothesis significance tests and confidence intervals have in common that the goal is to make inverse inferences about probabilities concerning hypotheses or population parameters given sample data. Critics have argued that this is not logically valid.
Bayesians are among the most vociferous claimants that null hypothesis significance tests and confidence intervals cannot validly be used to draw conclusions about populations from sample data. In contrast, the famous theorem by Bayes can be used to draw logically valid inferences about populations from sample data. However, frequentists are critical of Bayesian procedures on the grounds of unsound premises. They ask how one can know the prior probability of a hypothesis, what a prior distribution looks like, and how to handle the catch-all hypothesis of “not the null” (e.g., Mayo, 1996). Thus, critics of null hypothesis significance tests and confidence intervals tend to be dissatisfied based on the logical validity problems involved with inverse inferences, and critics of Bayesian methods tend to be dissatisfied based on what they consider to be questionable premises.
In contrast to null hypothesis significance tests, confidence intervals, and Bayesian procedures, which involve inverse inferences, the APP does not involve inverse inferences. Thus, the question of concern is not, “Can I reject the null hypothesis?” or “Can I conclude that the population mean is likely to be within a constructed interval?” Rather, the question is, “How can I be confident that the sample means are likely to be close to their corresponding population means?” As we suggested earlier, researchers can use the APP to address this last question and still use one of the available a posteriori procedures to address questions about hypotheses or about population means. Alternatively, those researchers who believe that available a posteriori procedures are problematic can use the APP to answer the question about confidence and closeness and leave questions about hypotheses or population means unaddressed.
Conclusion
Although Trafimow (2016), with his emphasis on a priori inferential statistics, provided an advance, we noted an important limitation that his procedure only is applicable when there is a single mean at issue. Our expansion of the procedure to work with multiple means provides researchers with the opportunity to perform a priori inferential statistical analyses with a variety of designs, such as when there are experimental and control conditions, complex factorial designs, and so on.
In turn, however, our expansion suggests additional implications, such as the fact that complex research designs can be quite problematic from the point of view of precision. We also showed how the APP differs from other APPs, such as power analyses—the latter depends importantly on the effect size one expects to obtain whereas our APP is uninfluenced by the expected effect size. There also is a necessity to follow-up power analyses with traditional a posteriori analyses whereas this is not true of the APP. More generally, and in agreement with Wolf et al. (2013), we believe that researchers have much to gain by performing a priori inferential statistics, and we hope and expect that our expansion of the interaction of the concepts of closeness and confidence to apply to any number of conditions will aid in the future development of the areas of education and psychology.
Footnotes
Appendix
Assume:
Then:
where
Continuing,
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
