Abstract
Background:
There is an increased focus on randomized trials for proximal behavioral outcomes in early childhood research. However, planning sample sizes for such designs requires extant information on the size of effect, variance decomposition, and effectiveness of covariates.
Objectives:
The purpose of this article is to employ a recent large representative sample of early childhood longitudinal study kindergartners to estimate design parameters for use in planning cluster randomized trials. A secondary objective is to compare the results of math and reading with the previous kindergartner cohort of 1999.
Research Design:
For each measure, fall–spring gains in effect size units are calculated. In addition, multilevel models are fit to estimate variance components that are used to calculate intraclass correlations (ICCs) and R 2 statistics. The implications of the reported parameters are summarized in tables of required school sample sizes to detect small effects.
Measures:
The outcomes include information about student scores regarding learning behaviors, general behaviors, and academic abilities.
Results:
Aside from math and reading, there were small gains in these measures from fall to spring, leading to effect sizes between about .1 and .2. In addition, the nonacademic ICCs are smaller than the academic ICCs but are still nontrivial. Use of a pretest covariate is generally effective in reducing the required sample size in power analyses. The ICCs for math and reading are smaller for the current sample compared with the 1999 sample.
Introduction
With the increased focus in education research on evidence derived from randomized trials, the cluster randomized trial is now the standard for most evaluations. A cluster randomized trial is where entire clusters, for example, schools or classrooms, are assigned to treatment conditions where the outcomes are measured at the unit level, for example, students. Success in rejecting the null hypothesis is quantified as statistical power (Cohen, 1992), which measures the probability of detecting a true effect (1—Type II error) given a sampling design (e.g., number of clusters and units per cluster) and desired Type I error (α). Planning cluster randomized trials requires extant information to estimate the appropriate sample size and power, including design parameters of the type detailed in this article. In the past 10 years, there has been an increase in the amount of work outlining design parameters. 1
In early childhood, research has clearly demonstrated that nonacademic outcomes such as internal/externalizing behaviors and early cognition lead to later academic success (see, e.g., Claessens, Duncan, & Engel, 2009; Denham, 2006; Denham & Brown, 2010; DiPerna, Lei, & Reid, 2007). As a result, several interventions and programs are targeting not only the distal outcome of academic achievement but also these proximal behavioral outcomes. For example, the Corporation for National Community Service has several national and state-specific programs that request proposals targeting, in part, social and behavioral outcomes (e.g., http://www.in.gov/serveindiana/files/IN_-_2016_ASN_Notice.pdf). The Institute for Education Sciences also has a long-term grant program that supports research on special education and behavioral outcomes (see https://ies.ed.gov/funding/ncser_progs.asp).
These interventions target nonacademic outcomes in early childhood as a method to produce higher academic gains later in life. However, interim evaluations of these interventions need to test whether differences in the nonacademic outcomes have been achieved, and so these studies need to be powered to detect these effects. Thus, the focus of this article is on kindergartners, because even though nonacademic outcomes are more proximal, the act of randomizing entire clusters (schools) to treatment or control still means that the analysis is bound by the parameters of a multilevel design. As such, planning studies requires the same set of expectations about variance decomposition, effectiveness of covariates, and effect sizes that studies targeting academic outcomes require.
Collection of evidence about nonacademic outcomes in early childhood is ongoing. Some evaluations report parameters for social emotional outcomes (e.g., Rhoades, Greenberg, & Domitrovich, 2009), and systematic collections of parameters for these outcomes, such as Jacob, Zhu, and Bloom (2010), also offer important guides. Unfortunately, the estimates that have been published are the result of limited samples from experiments, and so analysis of a large representative sample, analogous to Hedges and Hedberg (2007), is warranted.
Many researchers suppose that the clustering of proximal outcomes is similar to the clustering behaviors of typical distal outcomes such as math and reading scores. The What Works Clearinghouse (WWC, 2014), for example, points reviewers to an estimate of 10% for the total variation in nonacademic outcomes at the school level when making cluster-based adjustments to statistical tests. These adjustments are required for evaluating research that improperly tests hypotheses without taking the clustered nature of the data into account (Hedges, 2007a). Typically, the WWC proposes 20% for academic outcomes and 10% for nonacademic outcomes. The guidance for academic outcomes is well grounded in empirical evidence (e.g., Hedges & Hedberg, 2007), but the guidance for nonacademic outcomes is less grounded. The guidance 10% for nonacademic outcomes may be too high.
While clustering of academic measures such as math and reading is well documented (e.g., Hedges & Hedberg, 2007), it is plausible that behavioral outcomes are less dependent on organizational efforts and thus will exhibit smaller school effects. In other words, since schools’ main goal is to work toward improving academic outcomes, behavioral outcomes of young children may be less impacted by school activities. Thus, the between-school variance of outcomes such as internalizing or externalizing behaviors may be small and so the guidance suggesting that 10% of the variation in between schools may be too high. On the other hand, the practical experience of early childhood education involves many nonacademic activities specifically designed to improve nonacademic outcomes, and some activities may impact more proximal, behavioral outcomes. What is needed then are estimates of these crucial design parameters to plan cluster randomized trials seeking to evaluate proximal nonacademic outcomes.
In addition to parameters about variance decomposition, researchers require knowledge about typical growth in the scale of standardized difference in means effect sizes. Bloom, Hill, Black, and Lipsey (2008) have provided empirical benchmarks for a year’s worth of growth in effect size units for math and reading. Their results indicate that reading ability grows by 1.5 standard deviations (SDs) while math growth slightly greater than 1 SD. Again, little is systematically known about nonacademic outcome gains from year to year.
The Present Study
The purpose of this article is to employ the early childhood longitudinal study, kindergarten class of 2010–2011 (ECLS-K: 2011, see Mulligan, Hastedt, & McCarroll, 2012, for an overview), to estimate design parameters useful in planning cluster randomized trials. 2 The outcomes include information about student scores regarding learning behaviors (approaches to learning and attentional focus), general behaviors (externalizing problem behaviors, inhibitory control, internalizing problem behaviors, interpersonal skills, and self-control), cognitive abilities (card sort postswitch score, card sort border score, and numbers reversed score), and academic abilities (math and reading scores). The parameters are presented for the spring scores (after a year’s worth of instruction by the school).
The present article, then, has several objectives. First, it will outline the gains in the outcome measures in effect size units for use as upper bound reference points. Second, it will present intraclass correlations (ICCs) and R 2 statistics for each outcome’s spring score. Next, to contextualize these results, estimated school-level sample size requirements are then calculated for each outcome and set of covariates for the effect sizes that reflect the analysis of the nonacademic outcomes. The goal of the sample size table is not to provide guidance on exact sample sizes per se but instead to offer guidance on the feasibility of conducting studies for the given effect sizes and design parameters. Actual planned sample sizes will be different depending on the number of students and the expected effect size. In addition, this article updates the Hedges and Hedberg (2007) estimates for math and reading using the new 2011 ECLS-K cohort and explores the changes to these parameters between cohorts.
This article is organized as follows. First, the sample and outcomes are reviewed. Next, the statistical methodology and analysis choices are covered. This article then presents the results and then ends with discussion. A review of the parameters necessary for planning the appropriate analysis of a cluster randomized trial is included in Appendix A.
Study Sample and Outcomes
The ECLS is a series of studies carried out by the U.S. Department of Education to better understand the dynamics of early childhood. The original kindergarten cohort was examined during the 1998–1999 school year. The present cohort was examined during the 2010–2011 school year. These data provide a rich pool of information from students, teachers, parents, and caregivers on a variety of topics. Of interest to this study are the teacher reports of several classroom behaviors (and corresponding parent reports when available) and the standardized tests (Tourangeau et al., 2012). The data set released contains 18,174 student records and 1,328 schools recorded for the spring scores. For this study, all students with spring test scores were selected for each outcome (i.e., the N is different for different test scores); 10,653 (59%) student records contained all 14 outcomes and 25% were missing one or two scores; and 523 students (3%) were missing all 14 spring outcomes. Appendix Table B1 gives the proportion of students missing test scores for each outcome and by the sample characteristics described below. Broadly, minority students were more likely to have missing data on all outcomes, as were students in the northeast, west, and urban areas.
Table 1 presents an overview of each measure, the developers, and the spring score reliability measure. Each measure is organized so that larger scores indicate positive progress, and thus positive effect sizes indicate gains in the desired direction. The outcomes in this study are organized around four types: learning behaviors, general behaviors, cognitive abilities, and academic abilities. Learning behaviors include approaches to learning, developed by the ECLS survey team and Rothbart, Ahadi, Hershey, and Fisher’s (2001) attentional focus measure. Next, general behaviors is a set of measures regarding both internalizing and externalizing problem behaviors (Gresham & Elliott, 1990), Rothbart, Ahadi, Hershey, and Fisher’s (2001) inhibitory control, interpersonal skills, and self-control (Gresham & Elliott, 1990). Cognitive ability measures include Zelazo’s card sort games (2006) and the numbers reversed measure developed by Wookcock in 1990. Finally, academic outcomes are comprised of math and reading item response theory (IRT) scores.
Description of Analyzed Outcomes.
Note. ECLS-K = early childhood longitudinal study kindergartners; IRT = item response theory; NA = not applicable; PR = parent report; TR = teacher report.
Many of these outcomes are the focus of interventions. For example, Stormont (2002) outlines several avenues for interventions to improve externalizing problem behaviors at the teacher and classroom levels. Barrera et al. (2002) implemented a randomized trial of an intervention designed to reduce both internalizing and externalizing problem behaviors. As a cognitive outcome example, Smith et al. (2013) used the numbers reversed metric, in part, to assess whether physical activity alleviated issues related to attention deficient disorders.
Statistical Method
In this section, the methods and formulas that are employed to estimate the design parameters are discussed. For each outcome, the same methods are used in the estimation process, which employed Stata 14’s “mixed” restricted maximum likelihood (REML) estimation procedures. Each estimate is associated with an estimate of the sampling variance. For all the estimates, weights were not employed. This is analogous to the procedures in Hedges and Hedberg (2007) and the reasons are discussed at the end of this section. Other topics in this section include issues of assumptions and reproducibility.
Effect Size Benchmarks
Estimates of the effect size for fall–spring gains employed the difference between the spring and fall mean,
ICCs
ICCs are estimated for the spring scores. To estimate the ICC for a given outcome, y, a multilevel model is fit for the ith observation in the jth school
where the software computes the REML estimates of the variance of u
0j
, which is
The estimate of the ICC, ρ, is then
4
R 2 Estimates
R 2 estimates for three sets of covariates are estimated. The first set of covariates is the fall score and its school mean, the second set of covariates are demographic indicators for gender, race, socioeconomic status, and their school means. The final set of covariates combines the fall score, demographic indicators, and the associated school means.
In order to estimate the R 2 statistics for each set of covariates, a model is fit to the data where the spring score for the ith student in school j is
where the software computes the REML estimates of the variance of u*
0j
, which is
at the school level and
at the student level. 5
Estimates of Required Schools
To illustrate the implications of the estimated design parameters, the number of schools required to detect an effect size of .1 or .2 is estimated. This is the result of an iterative process, where for each outcome the appropriate design parameters were entered into the noncentrality formula reviewed in the appendix, the number of students was set to 15 and 30, and the number of schools was incremented by 2 (one for treatment and one for control) until the exact power equaled or exceeded .8 with a Type I error rate of .05 for a two-tailed test.
Mixed Model Assumptions
Mixed models make the assumption that the variances of the random effects, u 0j and eij , are normally distributed. Thus, in order to validate the results of the ICC analysis, the distributions of the student residuals and school means were examined visually. The values that are analogous to eij , the student residuals, were calculated by simply subtracting the school mean of each score value from the student-level score. The distribution of u 0j was checked by examining the distributions of the school means. These distributions were checked using kernel density plots for each outcome and are presented in Figure 1. While all outcomes, student residuals, and school means failed a formal Shapiro-Wilk test of normality (1965; z scores of the tests are reported in Figure 1), most outcomes show approximate normal distributions for student residuals and school means, with the exception of the card sort postswitch score, in which the school means follow the overall distribution where most students score highly.

Distribution of spring scores, student residuals, and school means for each outcome.
Reproducibility
As with all research, others may wish to replicate the results of the current article. This is encouraged. However, it should be noted that the variance components will be different with the longitudinal data sets (compared to the base-year only files) because the earlier observations (such as kindergarten) are rescaled to vertically equate with the later grades (such as first grade). This is broadly described on page 5-1 of the eighth grade psychometric report for the first ECLS-K cohort (Najarian, Pollack, Sorongon, & Hausken, 2009). 6 Given this issue of reproducibility, it is noted that the results of this article are based on the restricted use base-year file and not the publicly released K-1 file, which will produce a slightly different set of answers.
The Decision to Not Use Sampling Weights to Estimate Design Parameters
The ECLS-K 2011 sample of schools is not a simple random sample of clusters. The data collection team first selected counties (or county groups) as the primary sampling units (PSUs) using a combination of sampling proportionate to size (to ensure representation of the largest PSUs) and stratified sampling to ensure representation of the variety of school settings (Tourangeau et al., 2012). As a result, several types of settings are overrepresented or underrepresented based on the sampling parameters (detailed in the data documentation). Thus, these data do not represent a simple random sample of schools. To estimate design parameters, the issue is not as simple as just using design weights, however. The scaling decisions about how to handle the Level 1 (Student) weights (which are not designed for multilevel modeling because they include the combined sampling probabilities of the school selection and student selection given the school selection) can have large impacts on the design parameters. Furthermore, the use of school-level weights can bias the between-school variance components (Rabe-Hesketh & Skrondal, 2006). In the end, the literature offers no absolute advice as yet.
Regardless of the technical issues surrounding survey data weights and design parameters, the practical reality is that while the weights may be appropriate for the estimation of fixed effects (regression slopes, etc.), the supplied weights may or may not be appropriate for estimating variance components of random effects. Moreover, the only school-level weight available at this time is the school-base weight for the school administrator survey that is adjusted for nonresponse. Several schools that collected data on the outcomes of interest did not have corresponding administrator surveys. The result is that 462 of the 1,328 schools (35% of schools and 12% of students) in the study have a school-level weight of zero. An analysis of these schools indicated that while the average reading score, for example, was slightly higher than for schools with nonzero weights, the between-school variance for the schools with a zero weight is much larger than those with nonzero weights. Thus, the use of the nonresponse adjusted school-level administrator reduces the ICC estimate with less data. Moreover, since none of the provided weights are designed for use in estimating variance components and since the misuse of design weights has as-yet known impacts on estimated design parameters, the choice of this article is to simply not employ the weights in the analysis.
Results
Sample and Descriptive Measures
Table 2 presents the basic demographic characteristics of the employed sample (2011) and its comparison to the 1999 ECLS-K cohort. All characteristics are presented as the percentage of a categorical variable. The 2011 sample is 49% female as it was in 1998. The poverty level of the 2011 sample is higher than the 1999 sample, increasing from 19% to 26% below the poverty threshold. Age is consistent between samples, with most of the spring ages falling between 74 months and 78 months. The 2011 sample is more diverse: Whites are no longer the majority, and there is a marked increase in Hispanic student (18–25%). The sampling of the 2011 cohort has also increased the percentage of the sample from the South and the West and has increased the number of rural students at the expense of the urban students. Finally, the most dramatic change has been the increase in full day programs, with 84% of the 2011 cohort participating in full day programs compared to 56% in 1999.
Unweighted Percentages of Demographic Characteristics of ECLS-K 1999 and ECLS-K 2011 Student Samples.
Note. ECLS-K = early childhood longitudinal study kindergartners.
Table 3 presents the descriptive statistics, including the means and SDs, of the fall and spring outcomes that are the focus of the analysis. The mean number of student observations for the spring measures was about 16,000, and the average number of schools was about 1,130. Across all outcomes, the SDs were similar between the fall and spring scores except for the cognitive measures.
Descriptive Statistics of Fall and Spring Outcome Scores and Effect Size of Fall to Spring Change.
Note. Effect size standard errors in parentheses. IRT = item response theory; SD = standard deviation; SE = standard error.
aRange is for both fall and spring Scores. bFall statistics are only for those with spring data. cSee text for calculation details. dValues in boldface indicates mean of outcome type. eCounts based on spring scores.
Figure 1 presents the kernel density plots of the spring scores, student-level residuals, and school means for each of the 14 outcomes. While most of the outcomes follow an approximate normal distribution, several outcomes have skewed distributions. The most extreme example is the postswitch card score, where virtually all of the students scored six cards. The other scales that are skewed have means weighted toward the normative, for example, the modal approaches to learning scores (teacher rating) is the maximum, and the modal externalizing behavior score is the minimum. However, in most cases where the spring score was not normally distributed, the student residuals and school means did follow an approximate normal distribution. The one exception to this is again the card sort postswitch score, for which the school means are distributed as the spring outcome with most cases approaching the top score. However, it should be noted that the formal test of normality, the Shapiro–Wilks test, indicates that all outcomes, their student residuals, and school means, are not strictly normally distributed as all the z scores are larger than 2.
Outcome Gains in Effect Size Units
Table 3 also presents the differences between the spring and fall scores in effect size units (14). For each outcome type, the simple average of effect size is presented in boldface to offer a rough guide. Across the nonacademic outcomes, the outcome gains in effect size units range approximately between .1 and .2. The average for learning behaviors is lower because of the negative gains in the parent report of approaches to learning. General behaviors effect sizes have an average of about .12 and cognitive abilities increase by about 0.21 SDs.
The nonacademic effect sizes are much smaller than the academic effect sizes reported in this article and others (e.g., Bloom, Hill, Black, & Lipsey, 2008; Hill, Bloom, Black, & Lipsey, 2008). For the ECLS-K 2011 cohort, the Math and Reading IRT scales show large gains between fall and spring, as expected. The average effect size is greater than 1 but is slightly lower than other reported measures.
ICC and R 2 Estimates
Table 4 presents the estimates of the spring ICCs for the unconditional models and R 2 statistics for three different models (pretest only, demographics, and pretest plus demographics). The ICCs for math and reading are the largest (about .19), while the ICCs for most other nonacademic outcomes averaged about between .06 and .08. While small, there was some variability in the parameters. Although the teacher reported self-control had an ICC close to .11, the parent report was far lower (.04). A similar pattern was evident in the approaches to learning outcome, where the teacher report has a larger ICC than the parent report (.08 vs. .04).
Intraclass Correlations and R 2 Statistics for the ECLS-K 2011 Sample Spring Scores.
Note. Standard errors in parentheses, boldface values indicate mean of outcome type. ECLS-K = early childhood longitudinal study kindergartners; ICC = intraclass correlation; IRT = item response theory.
aPretest is the fall score. bDemographics included gender, socioeconomic status, and race.
The student-level effectiveness of pretests (labeled in the table as Student R 2) also varied across outcomes. The academic pretests were most effective (with values averaging .65). For learning and general behaviors, the R 2 averaged about .4. Using student demographics offered little effectiveness, but marginally increased the total effectiveness when combined with pretests.
The effectiveness of a school average of the pretest (labeled in the table as School R 2) indicates that using the school-average pretest in the evaluation analysis is a key strategy for reducing the impact of clustering on significance tests (Hedges & Hedberg, 2007; Hedges & Rhoads, 2010). For most outcomes, this parameter was high across both the academic, cognitive, and behavioral outcomes, with averages approximately .6. School means of demographic covariates were more effective at the school level, for cognitive abilities, compared to the student-level effectiveness. As with the student-level results, the most variation is explained in models that combine the demographic and pretest variables.
Implications of Design Parameters
The implications of these results are presented in Table 5, which contains the number of schools necessary to detect effect sizes of .1 and .2 for within-school samples of 15 or 30 students. These are plausible ranges as the average effect size (excluding parental report of approaches to learning, math and reading) is .15. The table is organized in much the same way as Table 4, showcasing the different models (without covariates, pretest, demographic, and pretest plus demographic covariates). 7
Number of Schools Required to Detect 0.1 and 0.2 Effect Sizes Based on Estimated Design Parameters.
Note. Boldface numbers represent mean values by outcome types. Estimates are for 0.8 power for a two-tailed test with α = .05. ES = effect size; IRT = item response theory.
aPretest is the fall score. bDemographics included gender, socioeconomic status, and race.
Without covariates, the number of schools necessary to detect an effect size of .1 in nonacademic outcomes is about 400 for within-school samples of 15 and about 300 for within-school samples of 30. The sample size requirements for general behaviors are larger, about 450 for within-school samples of 15 and 350 for within-school samples of 30. The number of schools required to detect an effect size of .2 is smaller, with about 100 schools for within-school samples of 15 to detect an effect size of .1, and about 80 schools for within-school samples of 15 to detect an effect size of .2.
Including pretest and demographic covariates reduces the number of schools. The number of schools necessary to detect an effect size of .1 with covariates in nonacademic outcomes is about 200 for within-school samples of 15 and about 75 for within-school samples of 30. The number of schools required to detect an effect size of .2 with covariates is smaller, with about 50 schools for within-school samples of 15 to detect an effect size of .1 with covariates, and less than 50 schools for within-school samples of 15 to detect an effect size of .2 with covariates.
Changes to Math and Reading Parameters
The ICCs for both math and reading have decreased since the first ECLS-K cohort in 1999. In that survey, as reported in Hedges and Hedberg (2007), the ICC for math was .243 (standard error [SE] = .010) and it was 0.233 (SE = .010) for reading. In the current cohort, the ICC for math is .197 (SE = .009) and it is .186 (SE = .009) for reading. As reported in Table 6, these differences are beyond statistical chance, with approximate z statistics being 3.48 for math and 3.58 for reading (both statistically significant under a two-tailed test).
Analysis of ICC Change for Academic Abilities from 1999 to 2011 by Select Covariates.
Note. Standard errors in parentheses. ICC = intraclass correlation.
aSchools means also included in the model. b z Test computed as difference divided by square root of the sum of the standard errors square. c p Values based on normal distribution, two-tailed test.
As reported in Table 2, the 1999 and 2011 samples are not equivalent. To explore whether the difference in the ICCs is due to some historical change in the school effects or due to the change in samples, conditional ICCs were calculated from six models each for math and reading. These models included one with the poverty indicator and its school mean, one with the race indicators and their school means, a model with census region indicators, urbanicity indicators, and type of program. These are the variables that changed the most from 1999 to 2011. Finally, a model that included all of the selected sample characteristics was estimated. For each model, the conditional ICC was calculated along with its SE. The poverty indicator model for math produced equivalent conditional ICCs for each cohort, and the full model produced equivalent ICCs for math. The differences in the conditional ICCs from the full model for reading were marginally significant, meaning that strictly speaking they were the same at conventional Type I error rates, but the difference is practically important. Thus, the tentative conclusion is that the change in sample may explain the differences for math score, but the answer is less conclusive for reading scores.
Discussion
The purpose of this article was to estimate the design parameters for behavioral and academic outcomes in early childhood, to explore the differences between academic and nonacademic outcomes, and finally to compare math and reading parameters between the 1999 and 2011 samples. This was done using the ECLS-K. The design of this study mimicked that of Hedges and Hedberg (2007) in that for each outcome, mixed models were fit to the outcomes to estimate ICCs and R 2 statistics. Effect size benchmarks were also estimated for fall–spring gains to provide upper bound estimates of sample sizes required for plausible effect sizes. These effect size benchmarks constitute a year’s worth of growth and thus any intervention is likely to produce a smaller effect.
One expectation of the study was that school effects, and thus ICCs, would be minimal for behavioral outcomes. The ICCs were smaller for nonacademic outcomes than for math and reading, leading to the conclusion that schools correlate with these behaviors less than academic outcomes. This may be good news for planning research on behavioral outcomes, except for the result that the yearly gains in these outcomes is rather limited as measured by effect size. This means that the possible changes to behaviors (in the general population) are small, which can lead to larger required sample sizes despite the smaller ICCs.
This article provides the most systematic and representative catalogue of nonacademic design parameters to date. However, several nuances must be considered when utilizing these results. The first nuance is uncovered when we compare the teacher and parent reports of the same latent construct. The ICC for the teacher reported approaches to learning is .077, whereas the parent report is nearly half that at .044. The difference in self-control is even more striking, with the teacher report equaling .107 and the parent report is .036, nearly two thirds lower. It is difficult with the available data to know why these differences exist. One conservative possibility is that these ICCs simply reflect a teacher rater bias, that is, teachers tend to rate all their children as high or low on some scale. A more liberal interpretation is that children behave differently at home and school and that their school behaviors are influenced by peers and thus these ICCs are reflective of school conditions.
The math and reading parameters did change between cohorts, and this may be due to the differences in samples that resulted from changes to the demography and early childhood institutional organization between 1998 and 2011. A brief analysis to equate conditional ICCs was successful for math, but less so for reading. Thus, further research on school effects and historical change must be explored. Finally, these parameters may not reflect the local parameters for a given study, but if our previous experience with math and reading are any indication, these values provide an upper bound for what researchers working in local contexts can expect. For example, Hedberg and Hedges (2014) found that within-district ICCs were lower that national estimates. As such, this article provides useful estimates for studying behavioral outcomes of young children using cluster randomized designs.
Footnotes
Appendix A
Appendix B
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D140019, NORC at the University of Chicago. The opinions expressed are those of the authors and do not necessarily represent the views of the Institute or the U.S. Department of Education.
