Abstract
In cluster randomized evaluations, a treatment or intervention is randomly assigned to a set of clusters each with constituent individual units of observations (e.g., student units that attend schools, which are assigned to treatment). One consideration of these designs is how many units are needed per cluster to achieve adequate statistical power. Typically, researchers state that “about 30 units per cluster” is the most that will yield benefit towards statistical precision. To avoid rules of thumb not grounded in statistical theory and practical considerations, and instead provide guidance for this question, the ratio of the minimum detectable effect size (MDES) to the larger MDES with one less unit per cluster is related to the key parameters of the cluster randomized design. Formulas for this subsequent difference effect size ratio (SDESR) at a given number of units are provided, as are formulas for finding the number of units for an assumed SDESR. In general, the point of diminishing returns occurs with smaller numbers of units for larger values of the intraclass correlation.
In cluster randomized evaluations, an intervention is randomly assigned to a subset of clusters, within which there are individual units of observations. For example, n student units each attend
For a hypothetical example of the types of decisions in planning evaluations that this article can help rationalize, suppose an evaluation of a nursing home staff anxiety prevention program is being planned. Treatment is to be assigned by nursing home and the dependent variable, an anxiety scale, will be measured for each individual licensed practical nurse (LPN). The number of nursing homes available to be randomized is fixed. It is expected that about 30 percent 1 of the variation in the outcome occurs between nursing home, and a covariate is available that explains about 25 percent of the individual-level variation and 50 percent of the between cluster variation. Please note these values are entirely hypothetical. A concern is that each of the participating nursing homes have on average ten LPNs, but only about eight LPNs per nursing home are expected to participate in the study. Increasing the effort by evaluation team members and survey incentives to add a one nurse per home would add a considerable amount to the evaluation budget. Is this expenditure worth the resources? The results below allow for a computation that indicates the sensitivity of the study would change by 2 percent if the number of LPNs per nursing home would change from eight to nine. Whether 2 percent is meaningful and worth the additional resources, depends on many contextual factors which must also be considered, of course, many of which are discussed below. This article provides computational tools to better understand these decisions from the point of view of cases per cluster, and in turn offers additional structure to evaluation planning.
I begin with reasons why consideration of units per cluster is important. I follow this with a brief overview of statistical power and minimum detectable effect sizes (MDES) for two-level cluster-randomized evaluations, and then present a formula for subsequent difference effect size ratio (SDESR), which summarizes the practical benefit of adding an additional unit to a value n to the MDES. Using derivations which relate the SDESR to the number of units per cluster (
Why Consider Units per Cluster?
Before considering these formulas, it is important to explain why or when researchers should also consider how many units per cluster are utilized in an evaluation, rather than universally promote the need to add clusters. Readers will correctly note that adding clusters will typically produce more benefit than adding units per cluster, which is true when there is any variation associated with the cluster-level (this is discussed at the end of Online Appendix). However, the discussion of how many units per cluster is still important for several reasons. First, in many situations, the number of clusters available, or recruitable, is fixed or practically constrained. For example, the number of available units per cluster may have an impact on which clusters could be used in an evaluation such as in a U.S. state with many rural schools. Assuming only large schools could be included in the evaluation would unnecessarily limit generalizability (because there would be no coverage of smaller schools, from which to generalize). Understanding the point at which larger school-sizes are no longer practically meaningful may expand school eligibility (in the minds of researchers) for evaluations. Beyond schools, cluster-randomized studies focusing on health outcomes (such as those found at the Prevention Services Clearinghouse—preventionservices.acf.hhs.gov, see Wilson et al., 2019) or policing outcomes (White et al., 2021) may involve much smaller clusters such as shifts, clinics, or even therapy groups. The inclination of many evaluators may be to find large clusters, and so the work here provides a rational mechanism to evaluate the plausible benefit of these designs and reduce concern that small clusters lead to sub-optimal power.
Second, not all studies are able to rely on administrative data for dependent variable measures. Consider writing sample scores as an example. Oftentimes, the cost of measuring this type of outcome is high and using resources to pay for scoring the writing samples for all students in a school or even an entire classroom is prohibitive. However, this work shows that for many small effects the rational number of units or cluster is less than ten per cluster, requiring only a small set of data to be collected. For additional reasons to consider the sample-size within clusters, I refer readers to Raudenbush's seminal paper on optimal design (1997), which provides other examples, culminating in the statement “Choosing the optimal within-cluster sample size is a prelude to deciding on the total number of clusters” (p. 174). For Raudenbush, the optimal within-cluster sample size, or number of units per cluster, was related to cost functions and minimizing the sampling variance of the impact estimate. A footnote in Raudenbush (1997) foreshadowed the complexities of optimizing effect sizes, which the present article attempts to unravel.
Through this work I hope to add to the design considerations introduced by Raudenbush, further expanding the set of plausible clusters and dependent measures to support allocating resources for a broader range of studies. To clear, if there are few constraints on the number of units per cluster for an evaluation and the availability of large clusters is adequate, there is little in this paper which will offer utility for refining evaluation design decisions. If, on the other hand, constraints exist and cluster sizes are naturally small, then this paper will offer additional procedures which will improve discussions while planning cluster randomized studies.
Statistical Power and the Minimum Detectable Effect Size
Power is the chance of obtaining a statistically significant result from an evaluation based on the size of the expected estimate, population parameters, sample design, and analysis (see, e.g., Cohen, 1992). The concept of statistical power is based on the two types of statistical error. Assuming that an intervention is not efficacious, an evaluation that concludes that there is indeed an impact is making a Type I error (incorrectly stating that the means of treatment groups are different when they are, in fact, the same). This error is often noted as
If, in fact, an intervention is efficacious, but the evaluation concludes otherwise, then a Type II error has occurred (incorrectly stating that the means of treatment groups are the same when they are, in fact, different). This error is often noted as the next Greek letter,
Power analyses result from computing the chance that a statistical test will exceed the critical value under several assumptions that culminate in the expected test statistic, which is based on the ratio of the mean difference and the standard error of that difference. A brief overview is provided in the Online Appendix, and more detailed summaries can be found in several texts (e.g., Hedberg, 2017b; Liu, 2013; Ryan, 2013). For the general reader, the important points are that the standard error of the mean difference typically includes functions associated with the sample size and the (assumed to be) normally distributed residual variance of the continuous dependent variable. Reorganization of these formulas often converts the mean difference into a standardized mean difference effect size, such as Cohen's d (1992) or Hedges g (1981), representing the difference between treatment groups in units of standard deviations and other non-sample-size parameters into “scale-free” parameters such as correlations and portions of variance associated with various factors. These parameters are combined to form an expected test statistic, which is used with non-central statistical distributions to find a probability of Type II error, and its complement, statistical power.
The expected test can be algebraically equated with a quantity, noted as Q below, which combines values of the standard normal or student's t distribution (with a certain degrees of freedom) based on assumed values of the Type I and Type II error. Given Q, algebraic rules can be used to isolate the key parameters of the expected test (effect size, sample size, and other scale-free parameters) to form expressions of the required sample size to achieve a specified level of power for a specified effect size, or the effect size that satisfies a specified level of power and sample size, which is the focus of this article. This effect size was introduced by Bloom (1995) as the minimum detectable effect size (MDES) and used as a method to understand the sensitivity of a given design, given a sample design and level of power. The MDES works much like letters on an eye exam chart: the better visioned (sensitive) eyes can see smaller letters. It is the MDES for cluster-randomized evaluations that is the focus of our analysis.
Properties of the Minimum Detectable Effect Size for Two-Level Cluster Randomized Evaluations
The literature on statistical power for cluster randomized studies has a long history in health (Murray, 1998) and education (Raudenbush et al., 2007). Bloom and colleagues further detailed the MDES for cluster randomized evaluations (Bloom et al., 1999) and detailed the importance of uncorrelated covariates to improve the sensitivity of studies (Bloom et al., 2007). The MDES estimate
Another parameter in the MDES is the intraclass correlation, ICC or
Relationship Between the Number of Units Per Cluster (
) and the Minimum Detectable Effect Size for Two-level Cluster Randomized Evaluations
As has been noted, larger values of n lead to smaller values of the MDES. However, this relationship is somewhat complex. Figure 1 presents the MDES (A) and the subsequent difference effect size ratio (SDESR, B) for a cluster randomized design with 20 clusters in treatment and 20 in control with an ICC of .2 for a variety of values for n, the number of units per cluster. If we hold the parameters Q, M, f,

The values of the minimum detectable effect size (MDES, A) and subsequent difference effect size ratios (SDESR, B) by number of units per cluster for 20 clusters in treatment, 20 in control, an ICC of .2, and no covariates, for a two-tailed test (
For small numbers of units per cluster, the SDESR is noticeably less than 1, and so the benefits to adding units are meaningful. In Figure 1 A, the MDES for this design with two units per cluster is about .685, and the MDES for three units is .605, and the SDESR of .605 to .685 is
The point of the preceding paragraph is to illustrate the diminishing returns in sensitivity for adding units. For example, in many cases the practical difference between an MDES of .445 (
The SDESR metric, however, is still quite abstract and lacks a basic intuition to allow researchers working in the field to find it useful. To that end, suppose that a minimum absolute change (
Relating Subsequent Difference in Effect Size Ratios to the Intraclass Correlation
The goal of this work is to find an answer to this question: at what number of units per cluster is adding additional units no longer practically beneficial, where “no longer practically beneficial” is based on a high SDESR with a value close to 1. To do this, we must formulate a relation between change in the MDES (operationalized as the SDESR) to the units per cluster (
SDESR as a Function of n and the ICC
In the previous section, I established the SDESR (
For example, suppose an ICC of
In the case of covariates, these expressions have two additional parameters. The first is the proportion reduction in variance at the unit level, noted as
Units per Cluster (
) as a Function of the SDESR and the ICC (or, Finding the Point of Diminishing Returns)
Expressions (7) and (8) can be solved for n to find the function of the ICC and
In the case of covariates, these expressions also include
Illustrations and Intuitions
Tables 1 through 4 offer further illustrations of these results. These tables were produced in R (R Core Team, 2021) using functions detailed in the Online Appendix. Table 1 presents values for the SDESR,
Values of the SDESR (
Values of the SDESR (
Table 3 presents rounded integer values of PDRn for the same set of ICCs, but for various values of the Benchmarked SDESR (
Values of n by Benchmarked SDESR Values (
Values of n by Benchmarked SDESR Values (
Tables 5 and 6, employing typical covariate values such as
Values of
Values of
Across Tables 3–6 is a pattern where larger benchmark effect sizes have higher PDRn values for the same absolute change (a percent of a standard deviation), as these ratios represent smaller and smaller differences from the benchmark. This is congruent with patterns found in the power analysis of other ratios—such as odds ratios in logistic regression—where additional data is required for similar changes to extreme base rates (i.e., base rates near 0 or 1) relative to base rates near .5 (see, e.g., Demidenko, 2007).
Returning to the hypothetical nursing home example that started the article, the 30 percent variation in the outcome between nursing homes represents an ICC of .3, and the covariate effects of 25 percent explained at the LPN level is
Proposed Procedure
When deciding the number of units that indicate a reasonable point of diminishing returns, I offer the following suggestion for a sequence of computations during the planning of an evaluation. First, prior to any power analyses, use experience, literature, and benchmarks to select a reasonable expected effect size,
Next, use expression (10) to compute the number of units which represents the point of diminishing returns for the benchmark effect size. Note that with this procedure, as with any power analysis, be sure to increase this value based on expected attrition because this represents the final sample size, not the initially sampled set. At this point researchers also have all the necessary information to compute the number of clusters required for the benchmark effect size, and the supplemental material includes R code for these functions as they can be tedious. Different values of the benchmarked SDESR, as a function of the amount of change to the MDES,
Examples Based on Empirical Work and Other Assumptions
I offer the assumption in this article that the use of “typical” values in place of informed assumptions in planning studies is a counterproductive practice, whether it be for required units per cluster, effect sizes to expect, or even ICC values. Taking this at face value, I then move in this section to showcase how the formulas presented here can inform the planning of studies under various empirically informed scenarios. Suppose researchers are planning an early childhood education evaluation to evaluate an intervention that seeks to increase math scores for third grade students. Data from Hill and colleagues (2008, see Table 5) indicate that the typical impacts from academic intervention studies that they reviewed was a quarter standard deviation (.25). The range of ICCs across geographics and locales in the United States varies widely. For example, in small districts with 3 to 5 schools serving elementary grades, the school-level ICCs tend to be .05 or less, with ICCs of .1 only appearing in larger school districts with 10 schools serving each grade (see Tables 2 and 3 in Hedberg & Hedges, 2014). Across states, the school-level ICCs (without considering district effects) also vary widely with subject and grade, with third grade ICCs for Mathematics scores as high as .24 in Massachusetts, .23 in Colorado and as low as .05 in West Virginia (see Table 2 in Hedges & Hedberg, 2013). Given this empirical evidence, I offer the results of the following exercise.
Suppose four scenarios for planning research, comprising either expected impacts of .25 or .5 standard deviations in populations with ICCs of either .1 or .2. Next, suppose five power analysis strategies are employed to first find the optimal value of n and then compute the total number of schools with equal allocation to treatment and control: (A) finding the PDRn with
The results of these exercises appear in Table 7, which presents sample sizes that all meet power of .8 for a two-tailed test with the same covariate effectiveness for the respective effect size and ICC values. The first scenario, PDRn with
Sample Sizes all Meeting Power of .8 for a two-Tailed Test (
For example, for an effect size of .25 with an ICC of .1, the PDRn is 7 using the first strategy (A) but requires 74 total schools to achieve power of .8. The second strategy (B) increased the PDRn to 10 and lowered the required clusters to 64. The third strategy, (C), produced a much higher PDRn (52) and reduced the number of clusters even more. However, each successive increase in the SDESR increases n, lowers M, but ultimately produces larger samples. The required numbers of clusters are similar for scenarios (D) and (E), which assumed round values of n.
From this exercise, a major takeaway point is that there is a wide variety of situations and scenarios, even with a small selection of empirical settings. As a consequence, the entire prospect of rules of thumb about sample sizes within clusters is rendered inadequate. Instead, rather than present exact answers, this article provides tools and operationalization of the key considerations that can lead researchers to answers which apply to their studies. The antidote to rules of thumb are tools, which are presented here.
Conclusion
In teaching power analyses for cluster randomized designs, most instructors (including this author) will often note in passing that many different combinations of n and M will yield the same chance of detecting an effect size. Table 7 provides a clear example of this phenomena through a careful consideration of a researcher-controlled parameter of what it means to have diminishing returns, the SDESR. The SDESR can itself be tuned either with a broad threshold (such as .999) or based on changes to a benchmark effect size.
I provide a method to assess how many units are practically beneficial by providing researchers a metric of “beneficial” and employing this metric in a formula to estimate the number of units per cluster. In general, the point of diminishing returns occurs with smaller numbers of units for larger values of the ICC. This is intuitive, as the very design effect that reduces precision in cluster randomized evaluations includes the multiplication of units per cluster (
These results hopefully will help researchers avoid broad rules of thumb about one of the important choices in designing cluster randomized evaluations: the number of units per cluster. In my own experience, including being guilty of advising this, the general advice offered is that after 30 units, it does not make sense to continue to add units. As I stated in the first sections of this article, if clusters available to a given evaluation are large and plentiful and the cost of each unit observation are negligible, there is little here which will greatly impact evaluation designs. However, if clusters are smaller, then ICCs are higher, and this work can shed light on answering questions about “how many units do we really need?”
Finally, these results provide evidence to reject rules of thumb for sample sizes. As shown above, the required sample configuration is entirely dependent on various design parameters and on researcher defined goals. This is at the core of most statistical analysis. The ordinary least squares (OLS) regression equations are best for a given criteria, minimizing the total sum of squared deviations between the observations and the prediction. Should regressions need to meet other criteria, such as predicting the best median or more recent algorithms employed by data science researchers, then OLS regression is no longer “best.” In this article, I provide expressions for finding values of units per cluster based on the concept of diminishing returns.
Supplemental Material
sj-docx-1-aje-10.1177_10982140221134618 - Supplemental material for How Many Cases per Cluster? Operationalizing the Number of Units per Cluster Relative to Minimum Detectable Effects in Two-Level Cluster Randomized Evaluations with Linear Outcomes
Supplemental material, sj-docx-1-aje-10.1177_10982140221134618 for How Many Cases per Cluster? Operationalizing the Number of Units per Cluster Relative to Minimum Detectable Effects in Two-Level Cluster Randomized Evaluations with Linear Outcomes by E. C. Hedberg in American Journal of Evaluation
Supplemental Material
sj-r-2-aje-10.1177_10982140221134618 - Supplemental material for How Many Cases per Cluster? Operationalizing the Number of Units per Cluster Relative to Minimum Detectable Effects in Two-Level Cluster Randomized Evaluations with Linear Outcomes
Supplemental material, sj-r-2-aje-10.1177_10982140221134618 for How Many Cases per Cluster? Operationalizing the Number of Units per Cluster Relative to Minimum Detectable Effects in Two-Level Cluster Randomized Evaluations with Linear Outcomes by E. C. Hedberg in American Journal of Evaluation
Footnotes
Acknowledgments
The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D200045, Abt Associates. The opinions expressed are those of the author and does not necessarily represent the views of the Institute or the U.S. Department of Education. The author thanks Cris Price and Kristen Neishi for helpful comments on earlier drafts, and the constructive feedback of the anonymous reviewers, which markedly improved the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Department of Education, (grant number R305D200045).
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
