Abstract
This paper provides practical guidance for researchers who are designing and analyzing studies that randomize schools—which comprise three levels of clustering (students in classrooms in schools)—to measure intervention effects on student academic outcomes when information on the middle level (classrooms) is missing. This situation arises frequently in practice because many available data sets identify the schools that students attend but not the classrooms in which they are taught. Do studies conducted under these circumstances yield results that are substantially different from what they would have been if this information had been available? The paper first considers this problem in the context of planning a school randomized study based on preexisting two-level information about how academic outcomes for students vary across schools and across students within schools (but not across classrooms in schools). The paper next considers this issue in the context of estimating intervention effects from school-randomized studies. Findings are based on empirical analyses of four multisite data sets using academic outcomes for students within classrooms within schools. The results indicate that in almost all situations one will obtain nearly identical results whether or not the classroom or middle level is omitted when designing or analyzing studies.
W
This problem often occurs in the planning stages of studies that randomize schools, because little is known about the three-level variance structure of outcome measures for students clustered in classrooms in schools. Most of the published empirical basis for planning such studies instead comprises information for the two-level variance structure of students clustered in schools (see, e.g., Bloom, Richburg-Hayes, & Black, 2007; Hedges & Hedberg, 2007). Thus, research designs based on this information do not account explicitly for the clustering of students in classrooms.
The problem also occurs in the analysis stage of studies that randomize schools because researchers often use administrative records to measure student outcomes. Because these records typically do not identify which students are in which classrooms—and adding such identifiers is difficult or costly, if not impossible, to do—the resulting studies are analyzed using two-level models that do not account explicitly for the clustering of students within classrooms.
Previous researchers have considered the implications of ignoring a level of variance when analyzing data with a multilevel structure. Specifically, they have shown that if a middle level of a multilevel variance structure is ignored, part of it will shift up one level, and the rest will shift down one level, thereby increasing estimates of the variances at these adjacent levels. In this way, the middle-level variance is to some extent accounted for implicitly (Opdenakker & Van Damme, 2000; Moerbeek, 2004; Tranmer & Steele, 2001; Van den Noortgate, Opdenakker, & Onghena, 2005).
Researchers also have tested (using simulated and actual data) the implications that such omissions can have for the interpretation of multiple regression analyses. They have demonstrated, for example, that in many situations, ignoring a level of variance will result in standard errors that are misspecified and thereby produce incorrect statistical inferences. For example, omitting the classroom level in a sample that has students clustered in classrooms within schools will produce incorrect estimates of standard errors for student-level independent variables (Opdenakker & Van Damme, 2000; Moerbeek, 2004; Van den Noortgate et al., 2005).
These studies provide a general overview of what happens to both the standard errors and point estimates of predictors included at all levels of a hierarchical model when various levels are ignored. However, as Van Landeghem, De Fraine, & Van Damme (2005) noted, the findings from these studies often do not apply to situations that researchers commonly face in practice. For example, many of the results are based on the assumption that the size and internal structure of every randomized cluster is the same and that no covariates are included in the analysis. This is rarely the case in practice. Similarly, these results are usually quite general and depend on factors such as the particular level of a multilevel variance structure that is ignored, the level of the predictor variable of interest, and the relative magnitudes of the variance components involved. The overarching conclusion of these studies is that omitting a level from a multilevel analysis can be problematic, but it is difficult to determine the practical implications of doing so for any given potential research application. Furthermore, these studies focus primarily on the implications of this approach for analyzing data when a level of variance is not explicitly acknowledged, and little attention is paid to the implications of missing a level of variance for the minimum detectable effects obtained during power analyses for planning studies. Consequently, there is little practical guidance for researchers who are interested in the design and analysis of school-randomized studies when information about the classroom is not available.
This article fills that gap by exploring the consequences of ignoring classroom-level information when designing or analyzing a school-randomized trial. It extends previous findings by investigating not only the implications of not acknowledging the middle level for analyzing data but also by investigating its implication for planning studies with a three-level data structure using only the top and bottom levels of information. It also provides concrete guidance to education researchers who are designing and analyzing data from school-level random assignment studies in which the cluster size and structure varies and covariates are used for analysis. Finally, the article extends the findings to cases in which the sample used to plan an impact study has a different cluster structure (i.e., different numbers of students per classroom and classrooms per school) than the structure of the analytical sample of the impact study itself. These extensions are based on empirical analyses of four multisite data sets that use academic outcomes for students within classrooms within schools.
The findings indicate that no substantial problem is likely to arise from using two-level models (for students within schools) to design or analyze studies that randomize schools. This conclusion holds for both elementary school data (for which the middle-level variance component tends to be small) and secondary school data (for which the middle-level variance component tends to be large), for data sets with varying numbers of students per classroom and classrooms per school, in situations in which covariates are included at either the student or the school level, and in situations in which the cluster structure of the study being planned differs substantially from the one used for planning purposes.
The rest of the article is structured as follows. We begin by presenting a theoretical framework for comparing three- and two-level models of a three-level situation. We then present estimates of three-level and two-level variance components and examine how an ignored classroom-level variance component is shifted up to the school level and down to the student level. We compare the shifting in our data with what is predicted theoretically and find the size of the actual and predicted shifts to be consistent with each other. These findings are then used to consider the implications of a two-level analysis of minimum detectable effect sizes (MDES) for a study that randomizes schools. We explore these implications for models that do and do not use covariates to estimate MDES. We also explore the implications of planning a study that has a different underlying data structure than the one used for planning purposes. We next explore the implications of ignoring the middle level when analyzing data with a three-level structure using a two-level model, and we conclude by offering some conclusions and recommendations.
Theoretical Framework
Consider the following two alternative research designs for estimating the impacts of an educational intervention on student outcomes from a study that randomizes schools in a large urban district. Both designs will estimate impacts by the observed differences in mean student outcomes for the randomized treatment group and control group, and the true variance structure for the study’s sample will comprise three levels: students, classrooms, and schools. Note that in this discussion, classroom-level influences are considered to be random. This is consistent with the convention of treating class-level differences as random in the literature and recognizes the random factors affecting variation at the classroom level (e.g., class dynamics can shift randomly across days depending on various factors, including particular students who are present or absent that day). We do recognize that there also might be fixed components at work here. For example, students are not always randomly assigned to teachers.
Design A uses a statistical model that specifies all three levels of the true variance structure. The school-level variance equals
Design B uses a two-level statistical model that specifies two variance components, one for mean values of the outcome measure across schools,
The following expressions can be used to compute an MDES for student outcomes given Designs A and B, without covariates or blocking. Note that throughout this article, MDES are defined for a two-tailed hypothesis test at the .05 level of statistical significance with 80% statistical power (for a discussion of how this is done, see Bloom, 2005).
For Design A,
where MDESA is the MDES for Design A, MJ – 2 is a multiplier for J – 2 degrees of freedom that equals approximately 2.8 for studies that randomize 20 or more schools, 1 P is the proportion of schools randomized to treatment, J is the total number of schools randomized to treatment or control status, K is the harmonic mean number of classrooms per school, and NA is the harmonic mean number of students per classroom.
For Design B,
where, in addition, NB is the harmonic mean number of students per school.
These two expressions are the same with respect to the multiplier (MJ – 2), which converts standard errors of estimates to minimum detectable effects (for a discussion, see Bloom, 2005). The two expressions are also the same with respect to the proportional allocation of randomized groups to treatment status (P) and control status (1 – P). However, they differ with respect to the square root of the sum of variance contributions from the different levels of each statistical model.
The central question to address when comparing these two expressions is, How do their estimated values compare when the total student variance is decomposed into all three components (for schools, classrooms, and students), as in Equation 1, to when the total student variance is decomposed only into components for schools and for students within schools, as in Equation 2?
To understand this question, first recall that both models start with the same total variance in the outcome measure across all students from all classrooms in all schools. Hence, the sum of the three variances under Model A equals the sum of the two variances under Model B, or
Variance estimates for Model B must thus shift the true middle-level variance to the bottom level, the top level or both levels.
Moerbeek (2004, Equation 14) derived the following expressions to represent this shifting when there is a constant number of classrooms per school (K) and students per classroom (NA):
and
where
Equations 4 and 5 indicate that a predictable portion of the true classroom-level variance is shifted to the estimated school-level variance, and the remainder is shifted to the estimated student-level variance. The sum of these two increments equals the total classroom variance.
Intuitively, it is easy to see how part of the true classroom-level variance shifts down to the estimated student-level variance. This occurs because part of the observed variance in outcomes across students within schools reflects classroom differences. Thus, when the variation across students within schools is measured and when cross-classroom differences are ignored, a part of these differences is included in the measure of student-level variance within schools,
It is less readily apparent how the two-level estimation Model B attributes some of the cross-classroom variance to the estimated variance across schools. This occurs because the total observed variance in school sample means can be decomposed into two parts: one that is due to true variation across schools and one that is due to estimation error produced by within-school student variation. Model B assumes that outcomes vary independently across students within schools when in fact they are clustered by classroom. By ignoring the clustering of students within classrooms, the two-level Model B understates the contribution of student-level variation to the total observed variance of school sample means. Hence, it overstates the amount of true variation that exists across schools.
Equation 4 indicates that more of the classroom-level variance is shifted to the estimated school-level variance as students per school (NA K) are clustered into fewer classrooms (K). This shift reflects how the clustering of students within classrooms inflates the true variability of within-school outcomes. Ignoring this clustering thus causes a larger understatement of within-school variability of outcomes when there are fewer classroom clusters, which in turn causes one to overstate the between-school variance accordingly. Consequently, the two-level model overestimates the true school-level variance. This is why the estimated school-level variance for the two-level model exceeds that for the three-level model.
Because the classroom variance that is ignored by a two-level model is reflected in estimates of school and student variances, the classroom variance is not missing from a two-level analysis. Indeed, as has been shown by others (Moerbeek, 2004; Van den Noortgate et al., 2005) theoretically, using a two-level model to estimate the cross-level variance components to be used in the calculation of the minimum detectable effect for a group-randomized research design can under certain circumstances produce the same results as those produced by a three-level model. It is easy to show that substituting Equations 4 and 5 into Equation 2 makes Equation 2 equivalent to Equation 1. As noted, however, these theoretical conclusions assume that every school has the same number of classrooms per school and students per classroom, that data used for planning a study reflect the number of classrooms per school and students per classroom that will be included in the actual study sample, and that no covariates will be used for the study’s analysis. To extend these theoretical findings to situations that occur in practice, in the remainder of this article, we explore empirically what happens when the middle level of a three-level model is excluded from analyses, using three-level student outcome data from four major sources.
The Data
Data from four different sources are used for the present empirical analysis. They are the School Breakfast Pilot Project (SBPP; Abt Associates Inc., & Promar International, 2005), the federal Reading First Impact Study (RFIS; Gamse, Bloom, Kemple, & Jacob, 2008), and statewide administrative records data on standardized test scores for individual students in multiple subjects from Florida and from North Carolina. Tables 1 and 2 describe the size, structure, and variability of the analysis samples for each data source. As can be seen, these samples provide an unusually large, diverse, and comprehensive empirical basis of analysis.
Data Structure for Each Study and Outcome
Sources: School Breakfast Pilot Project (SBPP) first follow-up year database, Reading First Impact Study (RFIS) first follow-up year database, the Florida Department of Education’s K-20 Education Data Warehouse for 2005, and the North Carolina Education Research Data Center for 2005.
Note: FCAT = Florida Comprehensive Assessment Test. Classes with only one student in the sample and schools with only one class in the sample are excluded from the analysis presented in this table.
Variation in Data Structure for Each Study and Outcome
Sources: School Breakfast Pilot Project (SBPP) first follow-up year database, Reading First Impact Study (RFIS) first follow-up year database, the Florida Department of Education’s K-20 Education Data Warehouse for 2005, and the North Carolina Education Research Data Center for 2005.
Note: FCAT = Florida Comprehensive Assessment Test. Classes with only one student in the sample and schools with only one class in the sample are excluded from the analysis presented in this table.
Table 1 reports the numbers of districts represented by these data plus the harmonic mean numbers of schools per district, classrooms per school, and students per classroom in the sample. Of particular importance is the fact that the internal cluster structure of schools in the sample (i.e., their number of classrooms and students per classroom) varies widely across the four data sources. Because, as demonstrated by Equations 4 and 5, this internal cluster determines how the classroom-level variance is shifted upward (to the school level) and downward (to the student level) when the middle level is ignored, it is important to represent a wide range of cluster structures in the analysis.
Table 2 describes the variability within the sample from each data source in its number of schools per district, number of classrooms per school, and number of students per classroom. This variability is measured by the standard deviation of each parameter. Of particular importance is the substantial variability that exists in the internal cluster structure of schools (their numbers of classrooms and students per classroom). This variability is what enables us to extend past theoretical work in ways that provide practical guidance for designing and analyzing educational evaluations (recall that existing theoretical findings assume no such variability).
The SBPP was a 3-year demonstration (2000–2003) that used a matched-pair, random-assignment design to randomly assign schools within six districts to a treatment condition in which schools implemented a universal free school breakfast program or to a control condition in which schools continued to operate their regular subsidized breakfast programs for eligible students from low-income families. The goal of the project was to measure the added value of universal free school breakfasts. The two outcome measures used from the SBPP for the analysis presented in this article are the Stanford 9 total math scaled score (math achievement test scores in scaled-score points) and the Stanford 9 total reading scaled score (reading achievement test scores in scaled-score points).
The present SBPP analysis sample contains 1,151 third graders from 233 classrooms in 111 schools from six districts. On average, there are approximately 3.7 students per classroom and 2 classrooms per school. The number of schools per district varies from around 6 to 8, depending on the outcome measure (see Table 1). The cluster structure of the SBPP sample is relatively constant by design because the original study sampled a fixed number of classrooms per school and students per classrooms. Hence, this sample has the smallest standard deviations for these parameters (see Table 2).
The RFIS was a 3-year (2004–2007) congressionally mandated evaluation of the federal government’s Reading First initiative to help all children read at or above grade level by the end of third grade (Gamse et al., 2008). The study used a regression discontinuity design that capitalized on the systematic process used by some districts to allocate their Reading First funds to schools. The study was designed to measure the effects of the program on teacher practices and student achievement. Seventeen districts plus one state program were chosen for the study, and its original sample included 248 schools. Data for the present analysis are limited to 15 sites (14 districts plus one state) and 225 schools for which it was possible to estimate student, classroom, and school variance components. Reading First outcome measures used for the present analysis are SAT 10 reading scaled scores for all first, second, and third graders in the study’s schools during the spring of 2005.
Even though the RFIS was a regression discontinuity analysis, it was possible to use its data to explore the implications of these data for a research design that would have randomized the schools. This was accomplished by ignoring the rating variable used to allocate Reading First funds (which was the basis for the study’s regression discontinuity analysis) and estimating the natural variation in academic outcomes that exists across schools, classrooms within schools, and students within classrooms. For similar reasons, the rating variable was ignored in the impact estimation models for the RFIS data. Therefore, the impact estimates for RFIS reported in this article are different from what was reported in the original RFIS reports.
The RFIS sample for the analysis presented in this article includes approximately 10 schools per district, three classrooms per school, and nine students per classroom. Unlike the SBPP, the RFIS was not designed to have a constant cluster size and structure. Instead all first through third grade students in regular education classrooms in the study’s schools were included in its original sample. Hence, there is more variability across RFIS schools in the number of students per classroom and classrooms per school than is the case for SBPP schools.
Statewide data on test scores for individual students from Florida were obtained from the Florida Department of Education’s K-20 Education Data Warehouse, a longitudinal data system that includes records on all students, teachers, and schools in the state. Each year, Florida students in the 3rd through 11th grades take the Florida Comprehensive Assessment Test (FCAT) in reading and math. The analysis presented in this article use data on these test scores for 5th grade (representing elementary school) in math and in reading for school year 2005–2006. All scores are normalized by subject. Samples are limited to students with valid test scores in both the current year and the previous year. The analytic samples are further restricted to “self-contained” classrooms only. On average, this elementary school sample comprises approximately 17 students per class, four classrooms per school, and six schools per district from a total of 43 districts.
Statewide data on test scores for individual students from North Carolina were obtained from the North Carolina Education Research Data Center for end-of-course assessments in reading and mathematics given to students in third through eighth grade in school year 2005–2006. The present analysis uses fifth grade scores to represent scores for elementary schools. Similar to what has been done with the Florida data, the analysis keeps students with valid test scores in both the current and the previous years, and it keeps self-contained classrooms only. On average, the elementary school sample has about 16 students per classroom, three classrooms per school, and five schools per district. A total of 86 districts are included in the elementary school sample.
Scores are also available for North Carolina secondary school students’ end-of-course assessments in algebra II, biology, chemistry, and geometry courses in school year 2005–2006. We use these scores to represent scores for secondary school. These end-of-course tests allow straightforward assignment of students to classrooms. The disadvantage of having end-of-course tests, on the other hand, is that students take these tests only once, and therefore no repeated measures of student performance in a particular subject are available. Thus, to control for pretest scores in the models that will be presented, students’ test scores on algebra I are used to approximate their starting levels. On average, the secondary school sample has approximately 15 to 20 students per classroom, three classrooms per school, and three schools per district. These data represent between 6 and 48 districts depending on the test subject.
Because the Florida and North Carolina data are for entire states, they reflect substantial variation in the number of students per classroom and classrooms per school. Hence, as can be seen from the standard deviations of the number of classrooms per school and the number of students per classroom reported in Table 2, the data exemplify schools with varying internal cluster structures. To investigate the impact of ignoring the middle level (the classroom level) in the context of estimating intervention effects, half of the schools in Florida and North Carolina were randomly assigned to a “treatment” group and the other half to a “control” group, such that the true “intervention effects” should be zero. This “quasi” treatment status is used in the impact estimations discussed later. It is not, however, used in the variance component estimation, because this randomly generated treatment status is not expected to affect the variance estimation.
Estimating Variance Components From Each Data Source
In this section, we report estimated variance components from each of the preceding data sources. Three-level variance components for Design A were estimated using a three-level hierarchical linear model (student-class-school); two-level variance components for Design B were estimated using a two-level hierarchical linear model (student-school). To reflect the typical range of common practices in education evaluation research, for each design, these variance components were estimated separately for models without covariates and for models with school-level or student-level baseline test scores as a covariate. Models for the SBPP and RFIS samples include a zero-or-one indicator variable to distinguish between treatment schools and control schools. This was not necessary for the purpose of estimating variance components for the Florida and North Carolina samples, because they do not actually constitute a specific set of treatment and control schools: The “treatment” and “control” status of the schools in these two samples is randomly generated to be used in the impact estimation and is not expected to have any effect on the variance component estimation. To ensure that all analyses are based solely on variation within school district, zero-or-one indicator variables for each district are included in the model. This is equivalent to centering all variables on the mean values for their districts (see Wooldridge, 2002).
Estimated Variance Components
Table 3 presents estimated variance components for all the outcomes in the data sets used in this article. The first three columns of Table 3 report estimated variance components for the three levels (school-class-student) of Model A, and the last two columns report estimated variance components for the two levels (school-student) of Model B. Each estimated variance component is standardized and reported as a proportion of the total student-level variance for the sample that it represents. Values for the three variance components in Model A sum to 1, and values for the two variance components in Model B sum to 1. Hence, the standardized variance components for schools and classrooms in these models represent intraclass correlations (ICCs; i.e., the proportion of total student variation that is at the school and at the classroom level, respectively). In addition, what is not shown in the table but was documented empirically is that, in all cases, the sum of the estimated nonstandardized three-level variance components equals the sum of the estimated nonstandardized two-level variance components (as noted by Equation 3).
Three-Level Versus Two-Level Model Comparisons: Variance Components at Various Levels as a Proportion of Total Variation
Sources: School Breakfast Pilot Project (SBPP) first follow-up year database, Reading First Impact Study (RFIS) first follow-up year database, the Florida Department of Education’s K-20 Education Data Warehouse for 2005, and the North Carolina Education Research Data Center for 2005.
Note: FCAT = Florida Comprehensive Assessment Test. Classes with only one student in the sample and schools with only one class in the sample are excluded from the analysis presented in this table. Estimated values for the variance components were obtained from a three-level model and a two-level model of the outcome measure without covariates. All analyses for SBPP and RFIS include an indicator variable distinguishing treatment and control groups as well as indicator variables for school districts in the study sample. Models used for the Florida and North Carolina data include only indicator variables for school districts in the study sample.
The variance component at school level as a proportion of total student variation is also known as the school-level intraclass correlation.
The variance component at the class level as a proportion of total student variation is also known as the class-level intraclass correlation.
Note that the estimated classroom variance for elementary schools is consistently a much smaller proportion of total student variation than it is for secondary schools. For elementary schools, this proportion is always below 0.140 and in most cases is well below this value. In contrast, for secondary schools the proportion ranges from 0.293 to 0.376. 2 This striking difference probably reflects more extensive student tracking in secondary schools than in elementary schools.
For SBPP Stanford 9 math scores (represented by the first row in Table 3), the standardized variance for schools in the three-level analysis equals 0.085. This means that 0.085 (or 8.5%) of the total variation across students in the analysis sample (within district blocks) is estimated to reflect differences in mean outcomes across schools. In other words, the school-level ICC equals .085. The standardized variance for classrooms in the three-level analysis equals 0.029. This means that 0.029 (or 2.9%) of the total variation across students in the analysis sample (within district blocks) is estimated to reflect differences in mean outcomes across classrooms within schools. In other words, the classroom-level ICC equals 0.029. The remaining proportion of total student variation (0.886) is due to differences in outcomes for students within classrooms. If instead of using a three-level model, variance components for the same data are estimated ignoring the classroom level, the estimated school-level variance is 0.097, and thus 0.903 of the total student variance is within schools (see columns 4 and 5 in Table 3).
The important point to note about these findings is that the classroom-level variance in the three-level model is shifted both to the school-level variance and to the student-level variance in the two-level model. Specifically, the estimated school-level variance for the two-level model (0.097) is larger than that for the three-level model (0.085), and the estimated variance for students within schools in the two-level model (0.903) is larger than that for students within classrooms in the three-level model (0.886). These differences are quite small, however, because the estimated classroom-level variance (0.029) is only a small proportion of total student variation. These differences, and the degree of “level shifting” they represent, are more pronounced for other samples in the table that have a greater proportion of their variation at the classroom level.
These variance component estimates also show that the school-level variances, and by extension the school-level ICCs, estimated from a three-level model are generally smaller, and sometimes much smaller, than the ones estimated from a two-level model (compare columns 1 and 4 in Table 3). Therefore, when conducting power calculations for two-level designs, one must not use school-level ICCs estimated from a three-level model as substitutes for the school-level ICCs, because doing so would lead to underestimation of the MDES of the design.
Comparing Predicted Versus Actual Shifting of the Middle-Level Variance Component
The empirical findings presented in Table 3 are consistent in direction with Equations 4 and 5, which predict the upward and downward shifting of an ignored classroom variance component. However, as already noted, Equations 4 and 5 assume a constant number of classrooms per school and students per classroom, whereas the samples used to estimate variance components for the analysis presented in this article (and for almost all others in education research) comprise schools that vary in these regards. Table 4 thus assesses the extent to which this variation in the internal structure of schools (clusters) causes the actual shifting in the classroom-level variance to differ from the amount of shifting predicted by Equations 4 and 5.
Three-Level Versus Two-Level Model Comparisons: Percentage of Variance Shifted to School and Student Levels When Class Level Is Not Explicitly Accounted For
Sources: School Breakfast Pilot Project (SBPP) first follow-up year database, Reading First Impact Study (RFIS) first follow-up year database, the Florida Department of Education’s K-20 Education Data Warehouse for 2005, and the North Carolina Education Research Data Center for 2005.
Note: FCAT = Florida Comprehensive Assessment Test. Classes with only one student in the sample and schools with only one class in the sample are excluded from the analysis presented in this table.
The first two columns in the table report the actual percentage of the classroom-level variance that is shifted to the school level and student level, respectively, and the last two columns present the corresponding percentages that are predicted by Equations 4 and 5 on the basis of the harmonic mean values for the number of classrooms per school and students per classroom in the analysis sample for each data source. Even though there was variability in the underlying cluster structure of the various data sets that were explored by this analysis, the distribution of the classroom-level variance to the school and student levels is fairly consistent with what the formula predicts.
It is also worth noting that even in situations where the percentage of variation shifted to each level differs from the theoretical prediction, the difference between the predicted and actual amount of variance that is shifted to each level may still be small if the middle-level variance component was small to begin with, as is the case with most of the elementary school data used in this analysis.
Planning a Study: Estimated Minimum Detectable Effect Size
Given the estimated variance components based on the three-level and two-level models, the next step in this analysis was to explore how not explicitly acknowledging the middle level affects the actual estimates of MDES for each of the outcomes in the data; this indicates the predicted level of precision one could expect to obtain for a study with a given sample size. This issue is of interest because the power of a study might be incorrectly calculated, or the standard error of an impact estimate might be incorrectly estimated when the model used for the power calculation and the ultimate impact estimation sample do not match (i.e., one is a two-level model and the other is a three-level model). The implications of this is first explored by estimating the precision of three-level and two-level analyses for a planned study with a cluster structure that is identical to the one from which the multilevel variances were estimated. For example, for the SBPP, it is assumed that the study being planned would include approximately two classrooms per school and approximately four students per classroom (see Table 1). These findings do not necessarily extrapolate to the typical situation in practice, in which multilevel variances are computed from data for an existing study and then used to design a future study with a different sample size and structure. This situation will be explored later.
The findings from this analysis are presented in Table 5. The analysis uses the standardized variance estimates from Table 3 plus the harmonic mean number of students per classroom and classrooms per school (as shown in Table 1) for each outcome measure to compute the MDES for that measure given its original sample structure (within a school). The total number of schools was assumed to be 60, and there were assumed to be 30 treatment schools and 30 control schools for all outcomes. Equation 1 was used to compute MDES for three-level analyses, and Equation 2 was used for two-level analyses.
Comparison Between Three-Level and Two-Level Estimates: Minimum Detectable Effect Sizes, Original Sample Structure
Sources: School Breakfast Pilot Project (SBPP) first follow-up year database, Reading First Impact Study (RFIS) first follow-up year database, the Florida Department of Education’s K-20 Education Data Warehouse for 2005, and the North Carolina Education Research Data Center for 2005.
Note: FCAT = Florida Comprehensive Assessment Test. Classes with only one student in the sample and schools with only one class in the sample are excluded from the analysis presented in this table. Estimated values for the intraclass correlations were obtained from a three-level model and a two-level model of the outcome measure without covariates. A school-level pretest and a student-level pretest measure were used in the model to obtain the R2 values used in the MDES calculation for models with covariates. In addition, all analyses for SBPP and RFIS include an indicator variable distinguishing treatment and control groups as well as indicator variables for school districts in the study sample. Models used for the Florida and North Carolina data only include indicator variables for school districts in the study sample.
Number of schools = 60, treatment/control = 1:1.
The first set of columns in the table shows findings from models that do not include any covariates (other than treatment indicators and district indicator variables where applicable). The first column presents the MDES for the three-level model for each measure, and the second column shows the difference between the MDES for the three-level model and the corresponding two-level model (the three-level estimate minus the two-level estimate). Consider yet again the findings for the SBPP Stanford 9 math score. Assuming 60 schools with 2.06 classrooms per school and 3.72 students per class (from row 1, Table 1) plus the three-level standardized unconditional variance estimates of 0.085, 0.029, and 0.886 for schools, classrooms, and students, respectively (from row 1, Table 3), an unconditional MDES of 0.341 was computed using Equation 1. Similarly, assuming 60 schools and an average of 7.67 students per school (2.06 classrooms × 3.72 students) and the two-level standardized unconditional variance estimates of 0.097 and 0.903 (from row 1, Table 3), an unconditional MDES of 0.342 was computed using Equation 2. The difference between these two MDES (0.000) is shown in the second column of Table 5.
The results show that for the elementary school data, for which the classroom-level variance is relatively small, the predicted level of precision is essentially the same whether the study was planned using a two-level analysis or a three-level analysis. For the elementary school data, estimates of MDES from the two- and three-level models differ by less than 0.005 in all cases. Although the differences in estimates of MDES are slightly larger among the secondary school data, for which the classroom-level variances components were substantially larger (ranging from 0.293 to 0.376), from a substantive perspective they remain quite small in absolute terms. So in the data sets explored in this analysis, if one did not explicitly acknowledge the middle level of clustering in designing a study with a data structure that was identical to the one used for planning purposes, one would, at worst, overstate the MDES by 0.021 for the North Carolina physics exam, which is about a 5% difference in precision. From a substantive perspective, this is a small difference.
Including Covariates
The findings shown in Table 5 also move comparisons of two- and three-level analyses one step further by taking the inclusion of covariates into account. In practice, baseline characteristics such as students’ prior test scores and demographics are often used as covariates to improve the precision of impact estimates; yet theoretical explorations of the implications of not explicitly acknowledging the middle level assume that no covariates are included. Therefore, to see how the inclusion of covariates would influence the results shown in the first column of Table 5, a corresponding set of analyses were conducted in which either a school-level pretest variable (second set of columns) or a student-level pretest variable (third set of columns) was included.
To the extent that covariates predict the variation in outcomes across individuals, classrooms, or schools, they reduce the “unexplained” variance at each of these levels. This in turn reduces the standard error of the impact estimate. Therefore, with covariates, the formula for computing the MDES for a three-level model (Model A) becomes
and for a two-level model (Model B), the formula becomes
where
The R2 values are calculated as the proportion of each unconditional variance that is explained by the covariates, that is, for Level L, where L is school, classroom, or student,
where
On the basis of these estimated R2 values (presented in Appendix Table A1) and the original unconditional variances presented in Table 3, it is possible to use Equations 6 and 7 to estimate the MDES for the original sample given available covariates. To do so for the school-level pretest, the R2 value obtained after including a school-level pretest in Design A was computed and substituted into Equation 6 above. For the student-level pretest, the R2 values obtained for school, classroom, and student levels after including a student-level pretest in Design A were computed. In all cases, the unconditional variances and total number of students, classrooms, and schools remained the same as in previous models.
The findings from these analyses are presented in the second and third sets of columns in Table 5. The first point to notice about these results is that including a pretest as a covariate at either the school or student level causes an overall reduction in the MDES (a finding that is consistent with prior research). Take again the SBPP math score. Without covariates, the MDES for the three-level analysis is 0.341. With a school-level pretest variable, the MDES from the three-level analysis is reduced to 0.256, and with a student level-pretest, the MDES from the three-level analysis is reduced to 0.316. A similar reduction is seen in the two-level models.
The second point to notice about the results presented in the second and third set of columns in Table 5 is that including a school-level covariate in the models used to estimate the MDES tends to exacerbate the difference between the predicted MDES obtained from a three-level analysis and the comparable two-level analysis relative to models that included no covariates. In all instances, the difference between the three-level and two-level estimates of MDES is larger than in the unconditional model. Furthermore, in these instances, the MDES would be underestimated if a two-level model were used. This is because including the pretest at the school level reduces the variance at the school level, thereby increasing the relative amount of variance that is accounted for at the classroom level, which is the source of the potential problem being examined. However, the differences between the two- and three-level analyses remain quite small, especially for elementary school data. The largest difference is 0.032, for the North Carolina high school geometry test. Thus, although the inclusion of a school-level pretest makes the difference between the two- and three-level analyses larger, in no case does one observe a “distortion” that is substantively important.
On the other hand, as can be seen in the third set of columns, the inclusion of a student-level pretest variable reduces the difference between the estimated MDES obtained from the two- and three-level analyses. In all instances, the differences between the predicted MDES from the three- and two-level analyses are smaller when the student-level pretest variable is included than is the case for the unconditional analyses. Additionally, including a student-level pretest seems to eliminate some of the largest differences that were observed in the unconditional analyses. With the inclusion of the student-level pretest variable, for example, the difference between the two- and three-level analyses for the North Carolina high school chemistry test is reduced from 0.017 to 0.000. The largest difference between the predicted MDES from the three- and two-level analyses is 0.026 (North Carolina physics) when a student level pretest is included. Note that algebra I scores were used as pretest measures for North Carolina secondary school subjects. This large difference may reflect the fact that algebra I is not a very good proxy for students’ previous knowledge of physics. It is not hard to see why including the student-level pretest helps reduce the problem. As shown earlier, the problem in the secondary school data is being driven by the large classroom-level variance component. If a student level pretest is included, this classroom-level variance component is reduced substantially, because much of it is accounted for by the student-level covariate. More important, the student-level covariate absorbs as much or more of the classroom-level variance than it does of the school-level or student-level variances, especially at the secondary school level. This pattern is demonstrated by the school-, classroom-, and student-level R2 values for student covariates reported in Appendix Table A1. On the other hand, including a school-level pretest tends to exacerbate the problem, because the school-level pretest reduces variance only at the school level, making the relative size of classroom-level variance even larger.
In summary, these findings illustrate that MDES computed from a two-level analysis, even when school-level or student-level covariates are included, are quite similar to those computed from a three-level analysis with the same data and covariates.
Varying the Sample Structure
The findings presented in Table 5 assume that the planned study has a cluster structure that is identical to the one from which the multilevel variances were estimated and used in the MDES calculation. However, these findings do not necessarily extrapolate to the typical situation in practice, in which multilevel variances are computed from data for an existing study and then used to design a future study with a different sample structure. One way to emulate this common situation is to vary the assumed target or planning sample structure and recompute minimum detectable effects for two-level and three-level analyses on the basis of current estimates for their ICCs and R2 values. Tables 6 and 7 show what the implications would be for planning a study when the underlying cluster structure has twice as many classrooms per school as the study being used to estimate relevant ICCs and R2 values (Table 6) and what the implications would be if the study being planned had half as many classrooms as the study being used to estimate relevant ICCs and R2 values (Table 7). Note in all cases that the total number of schools, as well as the total number of students per school, remains constant, and thus the two-level estimates used to create Tables 6 and 7 are the same as those used to create Table 5.
Minimum Detectable Effect Sizes for Alternative Sample Structures, Double the Number of Classes per School
Sources: School Breakfast Pilot Project (SBPP) first follow-up year database, Reading First Impact Study (RFIS) first follow-up year database, the Florida Department of Education’s K-20 Education Data Warehouse for 2005, and the North Carolina Education Research Data Center for 2005.
Note: FCAT = Florida Comprehensive Assessment Test. Classes with only one student in the sample and schools with only one class in the sample are excluded from the analysis presented in this table. Estimated values for the intraclass correlations were obtained from a three-level model and a two-level model of the outcome measure without covariates. A school-level pretest and a student-level pretest measure were used in the model to obtain the R2 values used in the MDES calculation for models with covariates. In addition, all analyses for SBPP and RFIS include an indicator variable distinguishing treatment and control groups as well as indicator variables for school districts in the study sample. Models used for the Florida and North Carolina data only include indicator variables for school districts in the study sample.
Double the number of classes per school, keep number of students per school constant, K = 2Ko, N = No/2, number of schools = 60, treatment/control = 1:1.
Minimum Detectable Effect Sizes for Alternative Sample Structures, Half of the Number of Classes per School
Sources: School Breakfast Pilot Project (SBPP) first follow-up year database, Reading First Impact Study (RFIS) first follow-up year database, the Florida Department of Education’s K-20 Education Data Warehouse for 2005, and the North Carolina Education Research Data Center for 2005.
Note: FCAT = Florida Comprehensive Assessment Test. Classes with only one student in the sample and schools with only one class in the sample are excluded from the analysis presented in this table. Estimated values for the intraclass correlations were obtained from a three-level model and a two-level model of the outcome measure without covariates. A school-level pretest and a student-level pretest measure were used in the model to obtain the R2 values used in the MDES calculation for models with covariates. In addition, all analyses for SBPP and RFIS include an indicator variable distinguishing treatment and control groups as well as indicator variables for school districts in the study sample. Models used for the Florida and North Carolina data only include indicator variables for school districts in the study sample.
Half of the number of classes per school, keep number of students per school constant, K = Ko/2, N = 2No, number of schools = 60, treatment/control = 1:1
Recall that the original SBPP Stanford math data had approximately four students per classroom and two classrooms per school (see Table 1). Table 6 explores the implications of planning a study in which, instead of having two classrooms per school and four students per classroom, there are four classrooms per school with two students per classroom. As before, the first set of columns in Table 6 show results for analyses without covariates. The second set of columns show analyses in which a school-level pretest is included, and the third shows the results of analyses in which a student-level pretest is included.
The findings in Table 6 illustrate that when the number of classrooms per school is doubled and the number of students per school is held constant, the MDES computed from a two-level analysis with or without covariates are almost identical to those computed from a three-level analysis with the same data and covariates. Thus, if one is planning a study in which the number of classrooms per school is greater than the number in the study used to compute the MDES, using a two-level model for analysis purposes will provide good estimates of the MDES, even though the middle level is not being accounted for explicitly.
Table 7 shows corresponding findings after halving the number of classrooms per school but holding constant the number of students per school. The results shown in Table 7 also indicate that, with the exception of the North Carolina secondary school data, the MDESs from the two- and three-level analyses yield quite comparable results, even though the sample structure has changed substantially. For the elementary school data, the difference between the estimates of MDES derived from the two- and three-level model are never more than 0.031. However, for the North Carolina secondary school data the differences between the estimates obtained from the two- and three-level analyses are much more sizable, ranging from 0.073 to 0.099. When a school-level pretest is added to the model (second set of columns)—a step that, as seen earlier, tends to magnify the difference between the two- and three-level models—the differences in MDES between the two models range from 0.093 to .0.126 for the various North Carolina secondary school outcomes. In this instance, using a two-level model to estimate the MDES in a study in which the underlying data structure is actually composed of three levels could be misleading. Yet, as also demonstrated earlier, including a student-level pretest (third set of columns) can reduce the difference between the estimates obtained from the two- and three-level models and help mitigate problems. In this case, the inclusion of the student-level pretest does reduce the differences substantially.
Planning a Study: Summary
Given these findings, what are the implications of planning a study that randomizes groups composed of three levels of variation, without explicitly accounting for the middle level? The preceding discussion shows that in almost all instances, the MDES obtained using two levels of data (e.g., students clustered within schools) is very similar to what would have been obtained with data at three levels (e.g., students clustered within classrooms within schools). This is true even when the data being used for planning purposes have a variable cluster structure, include covariates at the student level or school level, or do not reflect the same underlying structure as the sample used in the actual study (i.e., same number of students per classroom and classrooms per school). The similarity of MDES is especially true for data in which the variance component at the classroom-level is relatively small, which is usually the case in elementary schools. When the classroom-level variance component is large, the difference between the estimates derived from the two- and three-level analyses can in rare cases be meaningful, and the addition of a school-level pretest variable can make this problem worse. But including a pretest variable at the student level can help eliminate this problem under most circumstances.
Analyzing Data With a Three-Level Structure Using a Two-Level Model
Until now, the discussion has focused on planning future studies using three-level data when the extant data lack information at the “middle level,” that is, the classroom level. We now consider the analysis of the data from the impact study itself: Specifically, does the point estimate and estimated standard error for an impact at the school level remain the same whether the middle level of a three-level situation is considered explicitly? This question is particularly important, because in many instances, researchers are not able to explicitly link students to classes within schools and have no choice but to estimate a two-level model that does not explicitly consider the middle level of the data structure.
It has been shown that estimating a three-level model using feasible generalized least squares that fully accounts for the clustering in one’s data will provide consistent and asymptotically efficient estimates (Cheong, Fotiu, & Raudenbush, 2001). The questions here are whether researchers can obtain consistent estimates of program impact if they misspecify the model by not explicitly accounting for the middle level of clustering and whether the resulting estimates will be asymptotically efficient.
It can be shown that for samples with a constant number of students per classroom and classrooms per school and no covariates at the student, classroom, or school level other than the treatment indicator at school level, one will obtain identical estimates of program impacts and identical estimates of standard errors whether or not the middle level of a three-level situation is explicitly acknowledged. 3 However, as was the case when MDES obtained from two- and three-level models were considered, these proofs hold only for data that have a cluster structure that remains constant across clusters (i.e., schools), which is rarely the case in practice. In addition, the proofs do not take into account situations in which covariates are included at the student or school level, a situation that also frequently occurs. Furthermore, the proofs are for expected values of the estimators being considered, not for specific estimates from a given sample. To explore how well conclusions from the proofs hold for a broader and more realistic range of data structures, we return to our empirical analyses.
Table 8 present coefficient estimates and estimated standard errors for a school-level treatment indicator using both a two- and three-level model, for the four present data sources. All results in Table 8 are reported as standardized mean difference effect sizes so that these results can be compared across different data sources and outcome measures. 4 Recall that throughout the article, the RFIS data are used to explore the implications of these data for a research design that would have randomized the schools. Therefore, the RFIS impact estimates reported here (which are purely for methodological purposes) are not at all the same as those for the original study (which was based on a different research design and a different sample). Also recall that no specific interventions are being tested with the data for Florida and North Carolina, for which a zero-or-one treatment indicator was simply randomly assigned to each school. In this case, estimates of the coefficient for this indicator should be approximately zero. The first four columns include no covariates other than a treatment status indicator and indicators for districts. Columns 5 through 8 show models with a school-level pretest included, and columns 9 through 12 include a student-level covariate.
Three-Level Versus Two-Level Model Comparisons: Impact Estimates and Standard Errors in Effect Size Units
Sources: School Breakfast Pilot Project (SBPP) first follow-up year database, Reading First Impact Study (RFIS) first follow-up year database, the Florida Department of Education’s K-20 Education Data Warehouse for 2005, and the North Carolina Education Research Data Center for 2005.
Note: FCAT = Florida Comprehensive Assessment Test. Classes with only one student in the sample and schools with only one class in the sample are excluded from the analysis presented in this table. Estimated impacts were obtained from a three-level model and a two-level model of the outcome measure with or without school or student-level pretests as covariate. All analyses include an indicator variable distinguishing treatment and control groups as well as indicator variables for school districts in the study sample. All results are presented in effect size units. Effect sizes are calculated using the student-level outcome test score standard deviations for control group students.
Although the point estimates and standard errors shown in Table 8 are not exactly the same for the two types of analyses, they are in most instances quite comparable. Even in instances in which point estimates and standard errors differ somewhat, the same inferences would be drawn from a two-level or a three-level model. For example, for the RFIS data, the impact estimate for the second grade test was −0.078 standard deviations with a standard error of 0.04 when the impact was estimated using a three-level model. The corresponding two-level model yielded an impact estimate of −0.070 standard deviations with a standard error of 0.04. Both point estimates are similar in magnitude, and neither is statistically significant, so in both instances, the evidence indicates that Reading First had little or no impact on second grade SAT 10 reading scores. These findings hold across data sets of widely varying sizes and structures.
As was the case for planning a study, these findings suggest that a two-level model can be used to estimate program impacts, even when it does not explicitly acknowledge a middle level of clustering. This is particularly true when the middle-level variance component is small, as is the case for most elementary school outcomes. However, the finding also holds for secondary school data, for which the classroom-level variance component is relatively larger, and for situations in which the cluster structure varies across the schools in the sample and when covariates are included in the model.
Conclusions
As noted, this article is intended to provide practical guidance to researchers who are designing and analyzing studies that randomize schools to measure intervention effects on student academic outcomes when no information is available about the middle (classroom) level of clustering. Using four multisite data sets based on academic outcomes for students within classrooms within schools, we have explored in detail the implications of not explicitly acknowledging the middle level when planning or analyzing data in which the coefficient of interest is at the third (school) level. The analysis shows that in almost all situations, one will obtain nearly identical results, whether or not the classroom or middle level is acknowledged explicitly. With one exception, this conclusion holds for both elementary school data (for which the classroom variance component is typically quite small) and for secondary school data (for which the classroom variance component is somewhat larger), for data sets with varying numbers of students per classroom and classrooms per school, in situations in which covariates are included at either the student or school level, and in situations in which the cluster structure of the study being planned differs substantially from the one used for planning purposes. The only potential problem arises when the middle-level variance component is large (which is usually only the case for secondary school data) and when the study being planned has a markedly different cluster structure than the study that was used for planning purposes. Even in this kind of situation, if a student-level pretest variable is included in the models, any potential problems that may arise can be virtually eliminated. Thus, in most situations, researchers can proceed with two-level analyses of three-level data without too much cause for concern.
Footnotes
Appendix
Three-Level Versus Two-Level Model Comparisons: Estimated R2 Values for School- and Student-Level Covariates
| Estimated R2 for School-Level Covariate | Estimated R2 for Student-Level Covariate | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Three-Level Model | Two-Level Model | Three-Level Model | Two-Level Model | |||||||
| School Level | Class Level | Student Level | School Level | Student Level | School Level | Class Level | Student Level | School Level | Student Level | |
| SBPP | ||||||||||
| Stanford 9 total math scaled score | 0.377 | −0.107 | 0.005 | 0.311 | 0.004 | 0.514 | 0.393 | 0.477 | 0.510 | 0.475 |
| Stanford 9 total reading scaled score | 0.092 | 0.057 | −0.003 | 0.091 | −0.001 | 0.765 | 0.890 | 0.503 | 0.795 | 0.520 |
| RFIS | ||||||||||
| SAT 10 reading comprehension test, Grade 1 | 0.506 | −0.021 | 0.000 | 0.379 | 0.000 | 0.504 | 0.351 | 0.139 | 0.470 | 0.150 |
| SAT 10 reading comprehension test, Grade 2 | 0.752 | −0.022 | 0.000 | 0.503 | 0.000 | 0.659 | 0.711 | 0.464 | 0.661 | 0.477 |
| SAT 10 reading comprehension test, Grade 3 | 0.674 | −0.006 | 0.000 | 0.420 | 0.000 | 0.846 | 0.798 | 0.522 | 0.830 | 0.536 |
| Florida elementary school data | ||||||||||
| FCAT math test for Grade 5 | 0.824 | 0.009 | 0.000 | 0.649 | 0.000 | 0.663 | 0.805 | 0.605 | 0.696 | 0.628 |
| FCAT reading test for Grade 5 | 0.874 | 0.014 | 0.000 | 0.687 | 0.000 | 0.697 | 0.868 | 0.505 | 0.738 | 0.543 |
| North Carolina elementary school data | ||||||||||
| Math test for Grade 5 | 0.592 | 0.012 | 0.000 | 0.460 | 0.000 | 0.406 | 0.718 | 0.654 | 0.475 | 0.659 |
| Reading test for Grade 5 | 0.615 | 0.015 | 0.000 | 0.499 | 0.000 | 0.515 | 0.827 | 0.566 | 0.580 | 0.579 |
| North Carolina secondary school data | ||||||||||
| High school algebra II | 0.493 | 0.014 | 0.000 | 0.322 | 0.000 | 0.461 | 0.661 | 0.329 | 0.576 | 0.467 |
| High school biology | 0.675 | 0.003 | 0.000 | 0.312 | 0.000 | 0.229 | 0.693 | 0.310 | 0.442 | 0.422 |
| High school chemistry | 0.484 | 0.023 | 0.000 | 0.305 | 0.000 | 0.278 | 0.636 | 0.308 | 0.419 | 0.403 |
| High school geometry | 0.846 | 0.057 | 0.000 | 0.633 | 0.000 | 0.563 | 0.755 | 0.391 | 0.674 | 0.526 |
| High school physics | 0.526 | 0.008 | 0.000 | 0.340 | 0.000 | 0.504 | 0.630 | 0.329 | 0.603 | 0.438 |
Sources: School Breakfast Pilot Project (SBPP) first follow-up year database, Reading First Impact Study (RFIS) first follow-up year database, the Florida Department of Education’s K-20 Education Data Warehouse for 2005, and the North Carolina Education Research Data Center for 2005.
Note: FCAT = Florida Comprehensive Assessment Test. Classes with only one student in the sample and schools with only one class in the sample are excluded from the analysis presented in this table. Reported R2 values were calculated using variance components at various levels estimated from a three-level model and a two-level model of the outcome measure with and without covariates. All models used for the SBPP and RFIS data include an indicator variable distinguishing treatment and control groups as well as indicator variables for school districts in the study sample. Models used for the Florida and North Carolina data only include indicator variables for school districts in the study sample.
Acknowledgements
We thank Steve Raudenbush, Andres Martinez, and Fen Lin for their significant contribution to this article.
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: We wish to acknowledge financial support for this study provided by the William T. Grant Foundation, a staff development grant from Abt Associates Inc., and the Judith Gueron Fund for Methodological Innovation in Social Policy Research at MDRC. We also thank the North Carolina Education Research Data Center and the Florida Department of Education’s K-20 Education Data Warehouse (through the National Center for the Analysis of Longitudinal Data in Education Research, supported by Grant R305A060018 to the Urban Institute from the Institute of Education Sciences, U.S. Department of Education) for providing access to some of the data used in this study.
