Estimators for Clustered Education RCTs Using the Neyman Model for Causal Inference

Abstract

This article examines the estimation of two-stage clustered designs for education randomized control trials (RCTs) using the nonparametric Neyman causal inference framework that underlies experiments. The key distinction between the considered causal models is whether potential treatment and control group outcomes are considered to be fixed for the study population (the finite-population model) or randomly selected from a vaguely defined universe (the super-population model). Both approaches allow for heterogeneity of treatment effects. Appropriate estimation methods and asymptotic moments are discussed for each model using simple differences-in-means estimators and those that include baseline covariates. An empirical application using a large-scale education RCT shows that the choice of the finite- or super-population approach can matter. Thus, the choice of framework and sensitivity analyses should be specified and justified in the analysis protocols.

Keywords

Statistics experimental design program evaluation research methodology

In randomized control trials (RCTs) of education interventions, random assignment is often performed at the group level (such as a school or classroom) rather than at the student level. These group-based designs are common, because education RCTs often test interventions that are targeted to the group (e.g., a school restructuring initiative or professional development services for all teachers in a school). Thus, for these types of interventions, it is infeasible to randomly assign the treatment directly to students.

Under these group-based designs, data are typically collected on students. Thus, using student-level data, the statistical procedures that are used to estimate average treatment effects (ATEs) and their standard errors must account for the potential correlation of the outcomes of students within the same groups. In particular, the standard errors of the ATE estimators must be inflated to account for design effects due to clustering.

Over the past 40 years, a huge statistical literature across multiple disciplines discusses the estimation of treatment effects under two-stage clustered designs (see, e.g., Baltagi & Chang, 1994; De Leeuw & Meijer, 2008; Harville, 1977; Hsiao, 1986; Laird & Ware, 1982; Liang & Zeger, 1986; Murray, 1998; Rao, 1972; Raudenbush & Bryk, 2002; and Wooldridge, 2002). This article contributes to this literature by discussing the estimation of the ATE parameter for clustered designs using the building blocks of the nonparametric model of causal inference that underlies experimental designs. This model was introduced for nonclustered designs by Neyman (1923/1990) and later developed in Rubin (1974, 1977) and Holland (1986) using a potential outcomes framework.

The analysis focuses on continuous outcome data (such as student test scores) that are assumed to be either (1) fixed for the study population (a finite-population [FP] model) or (2) random draws from population outcome distributions (the more common super-population [SP] model). Appropriate estimation methods and new asymptotic variance formulas that are consistent with the Neyman approach are discussed for each model using simple differences-in-means estimators and those that include baseline covariates.

The considered Neyman approach yields estimation equations that have a different error structure than the model-based approaches that are typically used in practice and has several advantages. First, the Neyman approach does not require assumptions on the distributions of potential outcomes (only moment assumptions), whereas the model-based approaches often assume multilevel normality (which may not hold for some educational outcomes such as student absences or teacher salaries). Second, the variance formulas for the FP approach make it explicit that impact findings can be generalized only to those schools and students that are included in the study—which may be realistic in many settings—rather than to a vaguely defined super-population of study units that is often assumed using standard approaches. Finally, unlike commonly used model-based approaches, the Neyman framework allows for heterogeneity of treatment effects, which leads to variance expressions that differ for the treatment and control groups and that differ for the FP and SP models.

An empirical application using a large-scale RCT in the education area shows that the choice of the Neyman FP, the Neyman SP, or the standard model-based approach can matter. These results suggest that education researchers—who currently most often report impact findings using hierarchical linear model (HLM) methods (Raudenbush & Bryk, 2002)—should consider testing the robustness of study findings, by obtaining additional consistent impact estimates using methods that rely on alternative, nonparametric assumptions.

The Neyman Causal Inference Model for Clustered Designs

The FP Model

Consider an experimental design where n groups—hereafter referred to as schools—are randomly assigned to either a single treatment or control condition. The study contains np treatment and n(1 - p) control group schools, where p is the sampling rate to the treatment group (0 < p < 1) (and where np and n(1 - p) are rounded to integers). It is assumed that the study contains m_i students from school i and that there are $M = \sum_{i = 1}^{n} m_{i}$ total students. We invoke the stable unit treatment value assumption (SUTVA; Rubin, 1980) where the outcomes of a student depend only on the treatment assignment of that student’s school and not on the treatment assignments of other study schools.

It is assumed for now that the n schools and M students define the population universe—the FP model considered by Neyman for nonclustered designs. Under this scenario, the schools and students participating in the study are not considered to have been sampled from some larger population.

Let Y_Tij be the “potential” outcome for student j in school i in the treatment condition and Y_Cij be the potential outcome for the student in the control condition. These potential outcomes are assumed to be fixed (true values) for the study. The difference between the two fixed potential outcomes (Y_Tij - Y_Cij ) is the student-level treatment effect, and the ATE parameter β ₁ is the ATE over all students:

β_{1} = {\overset{ˉ}{Y}}_{T} - {\overset{ˉ}{Y}}_{C} = \frac{1}{M} \sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} (Y_{T i j} - Y_{C i j}) .

This ATE parameter cannot be calculated directly because potential outcomes for each student cannot be observed in both the treatment and control conditions. Formally, let T_i be the random assignment variable that equals 1 if a school is assigned to the treatment condition and 0 if the school is assigned to the control condition. The data generating process for the observed outcome for a student, y_ij , can then be expressed as follows:

y_{i j} = T_{i} Y_{T i j} + (1 - T_{i}) Y_{C i j} .

In Equation 2, y_ij is a random variable because T_i is a random variable, but the potential outcomes Y_Tij and Y_Cij are fixed for the study. Thus, under the Neyman FP model, the ATE parameter pertains only to those students and schools at the time the study was conducted. Stated differently, the impact findings have internal validity but do not necessarily generalize beyond the study participants. This approach can be justified on the grounds that schools are usually purposively selected for RCTs and, thus, may be a self-selected group of schools that are willing to participate and that are deemed to be suitable for the study based on their populations and contexts. Similarly, students participating in the study may not be representative of all students in the study schools, because they could be a nonrandom subset of those who consented to participate in the study and provided follow-up data.

Under this fixed population scenario, researchers are to be agnostic about whether the study results have external validity. Policymakers and other users of the study results can decide whether the impact evidence is sufficient to adopt the intervention on a broader scale, perhaps by examining the similarity of the observable characteristics of schools and students included in the study to their own contexts, and using results from subgroup impact analyses and analyses measuring the quality and fidelity of intervention implementation across the study sites.

Following the approach for nonclustered designs used by Freedman (2008) and Schochet (2010), a regression model for Equation 2 can be constructed by rewriting Equation 2 as follows:

y_{i j} = β_{0} + β_{1} (T_{i} - p) + η_{i j},

where

$β_{0} = p {\overset{ˉ}{Y}}_{T} + (1 - p) {\overset{ˉ}{Y}}_{C}$ and $β_{1} = {\overset{ˉ}{Y}}_{T} - {\overset{ˉ}{Y}}_{C}$ are parameters to be estimated and

$η_{i j} = α_{i j} + τ_{i j} (T_{i} - p)$ is an “error” term, where $α_{i j} = p (Y_{T i j} - {\overset{ˉ}{Y}}_{T}) + (1 - p) (Y_{C i j} - {\overset{ˉ}{Y}}_{C})$ and $τ_{i j} = (Y_{T i j} - {\overset{ˉ}{Y}}_{T}) - (Y_{C i j} - {\overset{ˉ}{Y}}_{C}) .$

In Equation 3, the error term η_ij is random because it is a function of the random T_i . The error term is also a function of two nonrandom components: (1)

α_{i j} = E (y_{i j} - y_{. .}^{=})

, the expected observed outcome for the student relative to the expected mean observed outcome, and (2) τ_ij , the student-level treatment effect relative to the ATE, that could differ across students if there are heterogenous treatment effects. Note that α_ij and τ_ij sum to zero over all students. This model is nonparametric because it does not depend on the distributions of the potential outcomes. Note in Equation 3 that the term (T_i - p) is used rather than T_i because it simplifies the proofs presented in Appendix A (online supplement available at http://jeb.sagepub.com/), but this centering has no effect on the findings.

The model in Equation 3 does not satisfy key assumptions of the usual regression model, because the random error η_ij does not have mean zero (over all possible treatment assignment configurations), and, to the extent that τ_ij varies across students, η_ij is heteroscedastic, $C o v (η_{i j}, η_{i j^{'}})$ is not constant for students in the same schools, $C o v (η_{i j}, η_{i^{'} j^{'}})$ is nonzero for students in different schools (for $i \neq i^{'}, j \neq j^{'}$ ), and η_ij is correlated with the regressor (T_i - p):

\begin{aligned} E (η_{i j}) = α_{i j}, V a r (η_{i j}) = τ_{i j}^{2} p (1 - p), C o v (η_{i j}, η_{i j^{'}}) = τ_{i j} τ_{i j^{'}} p (1 - p), \\ C o v (η_{i j}, η_{i^{'} j^{'}}) = - τ_{i j} τ_{i^{'} j^{'}} p (1 - p) / (n - 1), E [(T_{i} - p_{i}) η_{i j}] = τ_{i j} p (1 - p) . \end{aligned}

Note that in this model, the error terms for students in the same schools are correlated only because they have the same treatment status, not because they face similar environments.

Equation 3 implicitly assumes that schools are weighted by their student sample sizes. An alternative specification is to weight schools equally. In this case, the ATE parameter is $β_{1} = Y_{T}^{=} - Y_{C}^{=}$ , where $Y_{T}^{=} = (1 / n) \sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} (Y_{T i j} / m_{i})$ and $Y_{C}^{=} = (1 / n) \sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} (Y_{C i j} / m_{i})$ are averages of school-level means. This ATE parameter pertains to the average school effect rather than to the average student effect. The school- and student-level weighting schemes will result in different impact estimates if student sample sizes vary by school and the ATEs vary by school sample size.

Importantly, the FP model is very different than a fixed effects model where school effects are treated as fixed and student-level variation within the study schools is assumed to be the only source of variation in the impact estimates. This fixed effects framework does not conform to the Neyman model, because it ignores the randomness in T_i (and thus the clustered nature of the design). Stated differently, both the Neyman FP and fixed effects models assume that the study schools were not sampled, but the FP model treats the assignment of these schools to the treatment and control conditions as random (and adjusts the variance expressions accordingly), whereas the fixed effects approach ignores the randomness of the treatment assignment process and understates the true variance of the impact estimates. Thus, the fixed effects approach is not considered further in this article.

The SP Model

Under the SP version of the Neyman causal inference model, study schools and students are assumed to be random samples from broader populations. Under this framework, students are nested within schools. Let Z_Ti be the potential outcome for school i in the treatment condition and Z_Ci be the potential outcome for school i in the control condition. Potential outcomes for the n study schools are assumed to be random draws from potential treatment and control outcome distributions in the study super-population. It is assumed that means and variances of these distributions are finite and denoted by μ_T and σ_uT ² for potential treatment outcomes and μ_C and σ_uC ² for potential control outcomes.

Suppose next that m_i students are sampled from the super-population of students in study school i. The potential student-level outcomes Y_Tij and Y_Cij are now assumed to be random draws from student-level potential outcome distributions (which are conditional on school-level potential outcomes) with respective means Z_Ti and Z_Ci and finite variances $σ_{e T}^{2} > 0$ and $σ_{e C}^{2} > 0$ .

Under the SP model, the ATE parameter is $μ_{τ} = E (Z_{T i} - Z_{C i}) = μ_{T} - μ_{C} .$ Thus, the impact findings are now assumed to generalize to the super-population of schools that are “similar” to the study schools. The interpretation of this super-population will likely depend on the context (and may not exist), but researchers should be aware that the estimation of treatment effects using the SP approach makes the implicit assumption of external validity to a universe that is likely to be vaguely defined. Nonetheless, this approach can be justified on the grounds that policymakers may generalize the findings anyway, especially if the study provides a primary basis for deciding whether to implement the tested treatments more broadly. Furthermore, this approach is more consistent with the Bayesian view that assessing treatment effects is a dynamic process that takes place in a context of continuously increasing knowledge.

In the SP framework, T_i , Z_Ti , Z_Ci , Y_Tij , and Y_Cij are all random variables. As before, we can use Equation 2 to express observed student outcomes in terms of potential outcomes and can rearrange terms to yield the following regression model:

y_{i j} = α_{0} + α_{1} (T_{i} - p) + (u_{i} + e_{i j}),

where

$α_{0} = p μ_{T} + (1 - p) μ_{C}$ and $α_{1} = μ_{T} - μ_{C}$ (the ATE parameter) are coefficients to be estimated;

$u_{i} = T_{i} (Z_{T i} - μ_{T}) + (1 - T_{i}) (Z_{C i} - μ_{C})$ is a random school-level error term, where E(u_i ) = 0, $E ([T_{i} - p] u_{i}) = 0$ , $V a r (u_{i} | T_{i} = 1) = σ_{T u}^{2}$ , and $V a r (u_{i} | T_{i} = 0) = σ_{C u}^{2};$ and

$e_{i j} = T_{i} (Y_{T i j} - Z_{T i}) + (1 - T_{i}) (Y_{C i j} - Z_{C i})$ is a random student-level error, where E(e_ij ) = 0, $E ([T_{i} - p] e_{i j}) = E (u_{i} e_{i j}) = 0$ , $V a r (e_{i j} | T_{i} = 1) = σ_{T e}^{2}$ , and $V a r (e_{i j} | T_{i} = 0) = σ_{C e}^{2}$ .

Furthermore, if we define

δ_{i j} = u_{i} + e_{i j}

as the total error term, we find that

V a r (δ_{i j}) = p V a r (δ_{i j} | T_{i} = 1) + (1 - p) V a r (δ_{i j} | T_{i} = 0),

where

V a r (δ_{i j} | T_{i} = 1) = σ_{T u}^{2} + σ_{T e}^{2}, V a r (δ_{i j} | T_{i} = 0) = σ_{C u}^{2} + σ_{C e}^{2}, C o v (δ_{i j}, δ_{i^{'} j^{'}}) = 0,

C o v (δ_{i j}, δ_{i j^{'}} | T_{i} = 1) = σ_{T u}^{2}, C o v (δ_{i j}, δ_{i j^{'}} | T_{i} = 0) = σ_{C u}^{2} .

Thus, this model is the usual random effects model with an exchangeable m_i xm_i positive definite variance–covariance matrix for subjects within each school (labeled as $Ω_{i}$ ) except that the matrices can differ for treatments and controls. These variances could differ, for example, if there are heterogenous treatment effects that are uncorrelated with the potential outcomes under the control condition, which would lead to a larger variance for treatments than controls. The stacked block diagonal matrix for the full sample is labeled as $Ω$ .

Finally, note that Equation 4 can also be derived using the following two-level HLM model (Bryk & Raudenbush, 1992):

\begin{aligned} L e v e l 1 : y_{i j} = z_{i} + e_{i j} \\ L e v e l 2 : z_{i} = α_{0} + α_{1} T_{i} + u_{i}, \end{aligned}

where

z_{i} = T_{i} Z_{T i} + (1 - T_{i}) Z_{C i}

is the observed school-level outcome, Level 1 corresponds to students and Level 2 to units. Inserting the Level 2 equation into the Level 1 equation yields Equation 4. Thus, the HLM approach is consistent with the SP causal inference theory if the Level 1 and 2 error variances are allowed to differ for treatments and controls (which, as discussed, could occur if treatment effects are heterogeneous). The HLM approach, however, is not consistent with the Neyman model if error variances are assumed to be homogenous, which is often assumed in practice. Furthermore, the HLM approach typically assumes normality of the error terms, which is not required under the nonparametric Neyman approach.

ATE Parameter Estimation for the FP Model

This section discusses ATE parameter and variance estimation for the FP model with and without baseline covariates. Proofs of asymptotic results are provided in online supplement Appendix A. We rely on asymptotic results because the regression estimators are complex functions of the random variable T_i , which makes it difficult to obtain finite-sample moments. To obtain asymptotic properties for the estimators, we consider an increasing sequence of finite populations with the number of schools n increasing to infinity.

The FP Model Without Covariates

Ordinary least squares (OLS) methods are appropriate for estimating β ₁ in Equation 3, because the ATE parameter for the FP model pertains to the study sample only. The following lemma provides the asymptotic moments of the OLS estimator.

Lemma 1. The simple OLS estimator for β ₁ under the FP model in Equation 3 is ${\hat{β}}_{1, S R} = ({\overset{ˉ}{y}}_{T} - {\overset{ˉ}{y}}_{C})$ , where ${\overset{ˉ}{y}}_{T}$ and ${\overset{ˉ}{y}}_{C}$ are (unweighted) sample means for the treatment and control groups, respectively. As n increases to infinity for an increasing sequence of finite populations, ${\hat{β}}_{1, S R}$ is asymptotically unbiased. Furthermore, assume that

\begin{aligned} \overset{ˉ}{m} = \sum_{i = 1}^{n} m_{i} / n \to \overset{=}{m}, \frac{1}{n \overset{ˉ}{m}} \sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} \sum_{k = 1}^{m_{i}} (Y_{T i j} - {\overset{ˉ}{Y}}_{T}) (Y_{T i k} - {\overset{ˉ}{Y}}_{T}) \to S_{T}^{2}, \\ \frac{1}{n \overset{ˉ}{m}} \sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} \sum_{k = 1}^{m_{i}} (Y_{C i j} - {\overset{ˉ}{Y}}_{C}) (Y_{C i k} - {\overset{ˉ}{Y}}_{C}) \to S_{C}^{2}, and \frac{1}{n \overset{ˉ}{m}} \sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} \sum_{k = 1}^{m_{i}} τ_{i j} τ_{i k} \to S_{τ}^{2}, \end{aligned}

where

\overset{=}{m}, S_{T}^{2}, S_{C}^{2},

and S_τ ² are fixed, nonnegative, real numbers. Then,

{\hat{β}}_{1, S R}

is asymptotically normal with variance:

A s y V a r ({\hat{β}}_{1, S R}) = \frac{S_{T}^{2}}{n \overset{=}{m} p} + \frac{S_{C}^{2}}{n \overset{=}{m} (1 - p)} - \frac{S_{τ}^{2}}{n \overset{=}{m}} .

The S_T ² and S_C ² terms pertain to the extent to which potential outcomes vary and covary across students within the same schools. The S_τ ² term pertains to the extent to which treatment effects vary and covary across students within schools. Note that if student-level treatment effects are constant, S_τ ² = 0 and S_T ² = S_C ².

Consistent estimators for S_T ² and S_C ² in Equation 6 can be obtained using sample variances and covariances for treatments and controls, s_T ² and s_C ², respectively:

\begin{aligned} s_{T}^{2} = \frac{1}{n {\overset{ˉ}{m}}_{T} p} \sum_{i : T_{i} = 1}^{n p} \sum_{j = 1}^{m_{i}} \sum_{k = 1}^{m_{i}} (y_{i j} - {\overset{ˉ}{y}}_{T}) (y_{i k} - {\overset{ˉ}{y}}_{T}), \\ s_{C}^{2} = \frac{1}{n {\overset{ˉ}{m}}_{C} (1 - p)} \sum_{i : T_{i} = 0}^{n (1 - p)} \sum_{j = 1}^{m_{i}} \sum_{k = 1}^{m_{i}} (y_{i j} - {\overset{ˉ}{y}}_{C}) (y_{i k} - {\overset{ˉ}{y}}_{C}), \end{aligned}

where ${\overset{ˉ}{m}}_{T}$ and ${\overset{ˉ}{m}}_{C}$ are average cluster sizes for treatment and control study schools. Similar estimators can be obtained using the generalized estimating equation (GEE) approach developed by Liang and Zeger (1986) for clustered data assuming an independent working correlation structure, an identity link function, and the empirical sandwich variance estimator. As discussed in Murray (1998), the Type I error rate for the GEE estimator using standard inferential methods may be larger than the nominal level if the number of clusters in each research condition is less than 15 or 20. However, Small, Ten Have, and Rosenbaum (2008) and Imai, King, and Nall (2009) provide inferential methods with good statistical properties for designs with a small number of clusters.

Consistent estimators for S_τ ² in Equation 6 take the form $s_{τ}^{2} = \sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} \sum_{k = 1}^{m_{i}} {\hat{τ}}_{i j} {\hat{τ}}_{i k} / n \overset{ˉ}{m}$ , where ${\hat{τ}}_{i j}$ is an estimate of the treatment effect for student j in school i. Two possible estimation approaches for obtaining ${\hat{τ}}_{i j}$ using baseline covariates are as follows.

Subgroup method

Under this approach, ATEs can be estimated for a large number of student- and school-level subgroups defined at baseline by regressing y_ij on treatment-by-subgroup interaction terms. Estimates for τ_i can then be obtained using predicted ATEs from these subgroup interaction models. This approach may underestimate s_τ ² because the estimates for τ_ij are constant within subgroup cells, and the subgroup covariates may not fully explain the variation in τ_ij . However, this approach is fully based on the experimental design.

Propensity score matching method

This method uses propensity score matching (Rosenbaum & Rubin, 1983) to match treatment and control subjects using baseline data. This can be done in two stages by first matching treatment and control schools and then matching subjects within those schools, or in one stage by directly matching treatment and control subjects where school-level variables are included as matching covariates. Nearest neighbor, caliper, kernel, or similar matching methods with replacement could be used for the analysis (see Smith & Todd, 2005).

Note that there may be instances where covariates are not available to estimate S_τ ². In this case, because $S_{τ}^{2} \geq (S_{T} - S_{C})^{2}$ , an upper bound for the variance expression in Equation 6 is as follows:

A s y V a r ({\hat{β}}_{1, S R}) \leq \frac{S_{T}^{2}}{n \overset{=}{m} p} + \frac{S_{C}^{2}}{n \overset{=}{m} (1 - p)} - \frac{(S_{T} - S_{C})^{2}}{n \overset{=}{m}},

which is a conservative FP variance estimator.

Finally, if schools are to be weighted equally under unbalanced designs, the estimators from above can be applied by first premultiplying the outcome and explanatory variables (including the intercept) by the weights $\sqrt{w_{i j}}$ , where $w_{i j} \propto 1 / m_{i}$ (Pfeffermann, Skinner, Holmes, Goldstein, & Rasbash, 1998). For the equal-school weighting scheme, model-free permutation (randomization) tests can also be used to test the strong null hypothesis that all student-level treatment effects are zero (Gail, Mark, Carroll, Green, & Pee, 1996) or to also test nonzero treatment effects (Small et al., 2008).

The FP Model With Covariates

We now examine ATE estimators when the FP models include a 1xv vector of fixed covariates, $x_{i j}$ , pertaining to the prerandomization period. Baseline covariates (such as student pretest scores) are often used in the analysis of RCT data to improve the precision of the ATE estimates and to adjust for potential nonresponse biases due to missing data. The covariates are not indexed by T or C because their values are independent of treatment status due to randomization. The covariates could include both school- and student-level variables; all covariates are assumed to be centered at their grand means.

Importantly, in the Neyman model with fixed covariates, Equation 3 is still the true model. Thus, the ATE parameters considered above in the models without covariates pertain also to the models with covariates. To the extent that the covariates have explanatory power, they will be correlated with the error terms in Equation 3 (which violates a key assumption of the usual regression model).

To examine asymptotic moments of the OLS estimator under the FP model with fixed covariates, we assume in addition to Equation 5 that as n approaches infinity:

\frac{\sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} \sum_{k = 1}^{m_{i}} {x^{'}}_{i j} (Y_{T i k} - {\overset{ˉ}{Y}}_{T})}{n \overset{ˉ}{m}} \to S_{{x^{'} Y}_{T}}, \frac{\sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} \sum_{k = 1}^{m_{i}} {x^{'}}_{i j} (Y_{C i k} - {\overset{ˉ}{Y}}_{C})}{n \overset{ˉ}{m}} \to S_{{x^{'} Y}_{C}}, \frac{\sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} \sum_{k = 1}^{m_{i}} {x^{'}}_{i j} x_{i k}}{n \overset{ˉ}{m}} \to S_{x^{'} x},

\frac{\sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} {x^{'}}_{i j} x_{i j}}{n} \to S_{X^{'} X}, \frac{\sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} {x^{'}}_{i j} (Y_{T i j} - {\overset{ˉ}{Y}}_{T})}{n} \to S_{{X^{'} Y}_{T}}, a n d \frac{\sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} {x^{'}}_{i j} (Y_{C i j} - {\overset{ˉ}{Y}}_{C})}{n} \to S_{{X^{'} Y}_{C}},

where the S matrices contain cross products of fixed, nonnegative real numbers. The following lemma generalizes results in Schochet (2010) to two-stage clustered designs.

Lemma 2. Let ${\hat{β}}_{1, M R}$ be the multiple regression estimator for β_1 under the model in Equation 3 and assume Equations 5 and 8. Then, ${\hat{β}}_{1, M R}$ is asymptotically normal with mean β_1 and variance:

\begin{aligned} A s y V a r ({\hat{β}}_{1, M R}) = (\frac{S_{T}^{2}}{n \overset{=}{m} p} + \frac{S_{C}^{2}}{n \overset{=}{m} (1 - p)} - \frac{S_{τ}^{2}}{n \overset{=}{m}}) \\ - \frac{(2 {S_{α}}^{'} S_{{x^{'} Y}_{T}} - {S_{α}}^{'} S_{x^{'} x} S_{α})}{n \overset{=}{m} p} - \frac{(2 {S_{α}}^{'} S_{{x^{'} Y}_{C}} - {S_{α}}^{'} S_{x^{'} x} S_{α})}{n \overset{=}{m} (1 - p)} \end{aligned}

where

S_{α} = S_{X^{'} X}^{- 1} [p S_{{X^{'} Y}_{T}} + (1 - p) S_{{X^{'} Y}_{C}}]

The first bracketed term in Equation 9 is the variance of the OLS estimator under the FP model without covariates. The remaining terms account for precisions gains (or losses) from covariate adjustment. $S_{α}$ can be viewed as a vector of population regression parameters when $α$ is regressed on $X$ , and the other $S$ matrices are within-school variance–covariance matrices pertaining to the covariates and potential outcomes. If the covariance structure is the same for treatments and controls (which would occur, e.g., if treatment effects are constant), Equation 9 simplifies to $(2 {S_{α}}^{'} S_{x^{'} Y} - {S_{α}}^{'} S_{x^{'} x} S_{α}) / n \overset{=}{m} p (1 - p)$ and $S_{α} =<$ .

Consistent estimators for the $S$ matrices can be obtained using sample moments. For example, $S_{{x^{'} Y}_{T}}$ can be estimated using $> n p {\overset{ˉ}{m}}_{T}$ , and $S_{X^{'} X}$ can be estimated using $\sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} {x_{i j}}^{'} x_{i j} / n$ .

ATE Parameter Estimation for the SP Model

This section examines ATE parameter estimation for the SP model with and without baseline covariates using generalized least squares (GLS) methods that are often used to estimate random effects models. It is assumed that the cluster sample sizes m_i are randomly distributed across schools and thus are uncorrelated with T_i , $x_{i j}$ , u_i , and e_ij in Equation 4. The moment results discussed below can be obtained by first conditioning on m_i and then taking expectations with respect to the distribution of m_i ; for notational simplicity, this conditioning process is not shown.

The SP Model Without Covariates

Let $Q_{i} = (K {\tilde{T}}_{i})$ be the covariate matrix for school i, where $K$ is an m_i x1 column of 1s for the intercept and ${\tilde{T}}_{i}$ is an m_i x1 vector containing T_i - p terms. The efficient GLS estimator for α ₁ in Equation 4 is $[(\sum_{i = 1}^{n} {Q_{i}}^{'} Ω_{i}^{- 1} Q_{i})^{- 1} (\sum_{i = 1}^{n} {Q_{i}}^{'} Ω_{i}^{- 1} y_{i})]_{2, 2}$ , which in our case reduces to

{\hat{α}}_{1, S R} = {\overset{ˉ}{y}}_{T W} - {\overset{ˉ}{y}}_{C W} = \frac{\sum_{i : T_{i} = 1}^{n p} w_{T i}^{*} {\overset{ˉ}{y}}_{i}}{\sum_{i : T_{i} = 1}^{n p} w_{T i}^{*}} - \frac{\sum_{i : T_{i} = 0}^{n (1 - p)} w_{C i}^{*} {\overset{ˉ}{y}}_{i}}{\sum_{i : T_{i} = 0}^{n (1 - p)} w_{C i}^{*}},

where ${\overset{ˉ}{y}}_{i}$ is the mean outcome in school i; $w_{T i}^{*} = m_{i} w_{T i}$ and $w_{C i}^{*} = m_{i} w_{C i}$ are school-level weights for treatment and control schools, respectively; and $w_{T i} = [m_{i} σ_{T u}^{2} + σ_{T e}^{2}]^{- 1}$ and $w_{C i} = [m_{i} σ_{C u}^{2} + σ_{C e}^{2}]^{- 1}$ are student-level weights.

The GLS estimator in Equation 10 is a weighted differences-in-means estimator, where the weights are inverses of the variances of school-level means. Schools with more sampled students receive more weight than smaller schools, because the larger schools provide more information on the super-population parameters μ_T and μ_C . The SP weights will lie between the subject-level FP weights where schools are weighted by their sample sizes and the school-level FP weights, where schools are weighted equally. The SP weights will converge to the subject-level weights as the intraclass correlations (ICCs), $σ_{. u}^{2} / (σ_{. u}^{2} + σ_{. e}^{2})$ approach zero and will converge to the school-level FP weights as the ICCs approach one.

Assume that as n approaches infinity, $\sum_{i : T_{i} = 1}^{n p} m_{i} w_{T i} / \sum_{i : T_{i} = 1}^{n p} m_{i}$ converges to ${\overset{ˉ}{w}}_{T}$ for treatments and $\sum_{i : T_{i} = 0}^{n (1 - p)} m_{i} w_{C i} / \sum_{i : T_{i} = 0}^{n (1 - p)} m_{i}$ converges to ${\overset{ˉ}{w}}_{C}$ for controls. Standard methods (see, e.g., Wooldridge, 2002) can then be used to show that ${\hat{α}}_{1, S R}$ is asymptotically normal with mean α ₁ and the following variance:

A s y V a r ({\hat{α}}_{1, S R}) = [\frac{1}{n p \overset{=}{m} {\overset{ˉ}{w}}_{T}} + \frac{1}{n (1 - p) \overset{=}{m} {\overset{ˉ}{w}}_{C}}] .

The terms in Equation 11 are comparable to the S_T ² and S_C ² terms in Equation 6 for the FP model. Thus, an important difference between the SP and FP models is that unlike the SP model, the FP model contains S_τ ², which reduces variance. Thus, the variance may be somewhat smaller under the FP model, which is expected, because the SP model assumes external validity, with an associated loss in statistical precision.

The asymptotic variance in Equation 11 can be estimated as follows:

\hat{A s y V a r} ({\hat{α}}_{1, S R}) = s_{T, S P}^{2} + s_{C, S P}^{2} = {[\sum_{i : T_{i} = 1}^{n p} \frac{m_{i}}{m_{i} {\hat{σ}}_{T u}^{2} + {\hat{σ}}_{T e}^{2}}]}^{- 1} + {[\sum_{i : T_{i} = 0}^{n (1 - p)} \frac{m_{i}}{m_{i} {\hat{σ}}_{C u}^{2} + {\hat{σ}}_{C e}^{2}}]}^{- 1},

where estimated values for the variance components can be obtained using standard full or restricted information maximum likelihood (FIML or REML) methods (that assume normality of the errors), ANOVA, MINQUE, or similar approaches (see, e.g., Baltagi & Chang, 1994, and De Leeuw & Meijer, 2008).

The SP Model With Covariates

Under the SP model with covariates, the covariates x_ijl and potential outcomes are considered to be random draws from joint school- and student-level super-population distributions. Let ${\tilde{x}}_{i j l} = x_{i j l} - E (x_{i j l}) = u_{i l} + e_{i j l}$ , ${\tilde{Y}}_{T i j} = Y_{T i j} - μ_{T}$ , and ${\tilde{Y}}_{C i j} = Y_{C i j} - μ_{C}$ be population-centered variables, where u_il and e_ijl are random variables with zero means and respective finite variances $σ_{u l l} > 0$ and $σ_{e l l} > 0$ . Under the SP covariance structure, $E ({\tilde{x}}_{i j l} {\tilde{x}}_{i j l^{'}}) = σ_{u l l^{'}} + σ_{e l l^{'}}$ for $(l, l^{'}) \in (1, . . ., v)$ , $E ({\tilde{x}}_{i j l} {\tilde{x}}_{i j^{'} l^{'}}) = σ_{u l l^{'}}$ for $j \neq j^{'}$ , $E ({\tilde{x}}_{i j l} {\tilde{Y}}_{i j T}) = σ_{u l T} + σ_{e l T},$ $E ({\tilde{x}}_{i j l} {\tilde{Y}}_{i j^{'} T}) = σ_{u l T}$ , $E ({\tilde{x}}_{i j l} {\tilde{Y}}_{i j C}) = σ_{u l C} + σ_{e l C},$ $E ({\tilde{x}}_{i j l} {\tilde{Y}}_{i j^{'} C}) = σ_{u l C}$ , and the covariates and potential outcomes are uncorrelated across schools.

The SP model can include “school-level” covariates (e.g., indicators of the school’s urban or rural status) that are modeled as ${\tilde{x}}_{i l} = u_{i l}$ . The model can also include student-level covariates, which, consistent with the SP approach, are assumed to enter the model as two orthogonal categories of variables: (1) “between-school” covariates that are averaged to the school level: ${\tilde{x}}_{i . l} = u_{i l} + {\overset{ˉ}{e}}_{i . l}$ , and (2) “within-school” covariates that are measured relative to their school-level means: $({\tilde{x}}_{i j l} - {\tilde{x}}_{i . l}) = e_{i j l} - {\overset{ˉ}{e}}_{i . l}$ .

In the SP model with covariates, Equation 4 remains the true model. Similarly, $Ω$ from Equation 4 still applies, and thus, the GLS weights are not likely to have optimal properties. This approach is different than the usual random effects model which assumes that the true model includes the baseline covariates and that $Ω$ is conditional on the covariates. The design matrix used for estimation is now $Q_{i} = [K {\tilde{T}}_{i} X_{i}]$ , where $X_{i}$ is a matrix of centered covariates for school i, and the GLS estimator is $[(\sum_{i = 1}^{n} {Q^{'}}_{i} Ω_{i}^{- 1} Q_{i})^{- 1} (\sum_{i = 1}^{n} {Q^{'}}_{i} Ω_{i}^{- 1} y_{i})]_{2, 2} .$

To examine asymptotic moments of the GLS estimator under the SP model with covariates, we assume the following population moment analogs to Equation 8 for the FP model:

E [\sum_{j = 1}^{m_{i}} \sum_{k = 1}^{m_{i}} w_{T i}^{2} {\tilde{x}}^{'}_{i j} {\tilde{Y}}_{T i k}] = Λ_{{x^{'} Y}_{T}}, E [\sum_{j = 1}^{m_{i}} \sum_{k = 1}^{m_{i}} w_{C i}^{2} {\tilde{x}}^{'}_{i j} {\tilde{Y}}_{C i k}] = Λ_{{x^{'} Y}_{C}}, E [\sum_{j = 1}^{m_{i}} \sum_{k = 1}^{m_{i}} w_{T i}^{2} {\tilde{x}}^{'}_{i j} {\tilde{x}}_{i k}] = Λ_{x^{'} x}^{T},

\begin{aligned} E [\sum_{j = 1}^{m_{i}} \sum_{k = 1}^{m_{i}} w_{C i}^{2} {\tilde{x}}^{'}_{i j} {\tilde{x}}_{i k}] = Λ_{x^{'} x}^{C}, E [\sum_{j = 1}^{m_{i}} w_{T i} {\tilde{x}}^{'}_{i j} {\tilde{x}}_{i j}] = Λ_{X^{'} X}^{T}, E [\sum_{j = 1}^{m_{i}} w_{C i} {\tilde{x}}^{'}_{i j} {\tilde{x}}_{i j}] = Λ_{X^{'} X}^{C}, \\ E [\sum_{j = 1}^{m_{i}} w_{T i} {\tilde{x}}^{'}_{i j} {\tilde{Y}}_{T i j}] = Λ_{{X^{'} Y}_{T}}, a n d E [\sum_{j = 1}^{m_{i}} w_{C i} {\tilde{x}}^{'}_{i j} {\tilde{Y}}_{C i j}] = Λ_{{X^{'} Y}_{C}}, \end{aligned}

where ${\tilde{x}}_{i j}$ is a 1xv covariate vector for a subject with elements ${\tilde{x}}_{i j l}$ .

The (finite) elements of the $Λ$ matrices can be expressed in terms of the SP moments defined earlier. For example, the elements of $Λ_{x^{'} x}^{T}$ are of the form $E [m_{i}^{2} w_{T i}^{2} (σ_{u l l^{'}} + {σ_{e l l^{'}} / m_{i}})]$ if l and l' each refer to between-school covariates, $E [m_{i}^{2} w_{T i}^{2} (σ_{e l l^{'}} - {σ_{e l l^{'}} / m_{i}})]$ for within-school covariates, and $E [m_{i}^{2} w_{T i}^{2} σ_{u l l^{'}}]$ for school-level covariates (where expectations are with respect to the distribution of m_i ). Similarly, the elements of $Λ_{{X^{'} Y}_{T}}$ are $E [w_{T i} (m_{i} σ_{u l T} + σ_{e l T})]$ for between-school covariates, $E [w_{T i} (m_{i} σ_{e l l^{'}} - σ_{e l l^{'}})]$ for within-school covariates, and $E [m_{i} w_{T i} σ_{u l T}]$ for school-level covariates.

The following lemma provides the asymptotic distribution of the GLS SP regression estimator (Yang & Tsiatis, 2001, and Schochet, 2010, provide OLS results for nonclustered SP designs). The proof is provided in Appendix A.

Lemma 3. Let ${\hat{α}}_{1, M R}$ be the multiple GLS regression estimator for α ₁ under the SP model in Equation 4. Then, ${\hat{α}}_{1, M R}$ is asymptotically normal with mean α ₁ and variance:

\begin{aligned} A s y V a r ({\hat{α}}_{1, M R}) & = [\frac{1}{n p \overset{=}{m} {\overset{ˉ}{w}}_{T}} + \frac{1}{n (1 - p) \overset{=}{m} {\overset{ˉ}{w}}_{C}}] - \frac{1}{n p \overset{=}{m} {\overset{ˉ}{w}}_{T}} (2 {Γ_{α}}^{'} Λ_{{x^{'} Y}_{T}} - {Γ_{α}}^{'} Λ_{x^{'} x}^{T} Γ_{α}) \\ - \frac{1}{n (1 - p) \overset{=}{m} {\overset{ˉ}{w}}_{C}} (2 {Γ_{α}}^{'} Λ_{{x^{'} Y}_{C}} - {Γ_{α}}^{'} Λ_{x^{'} x}^{C} Γ_{α}), \end{aligned}

where

Γ_{α} = [p Λ_{X^{'} X}^{T} + (1 - p) Λ_{X^{'} X}^{C}]^{- 1} [p Λ_{{X^{'} Y}_{T}} + (1 - p) Λ_{{X^{'} Y}_{C}}]

The form of Equation 14 is similar to the form of Equation 8 for the FP model and has a similar interpretation. The first bracketed term is the variance of the GLS estimator under the SP model without covariates. The remaining terms account for precision effects due to covariate adjustment.

The $Λ$ matrices in Equation 14 can be estimated using weighted sample moments, or by first obtaining estimates of the variance–covariance components σ_u. and σ_e.. For example, $Λ_{{x^{'} Y}_{T}}$ can be estimated using $\sum_{i : T_{i} = 1}^{n p} \sum_{j = 1}^{m_{i}} \sum_{k = 1}^{m_{i}} w_{T i}^{2} {x^{'}}_{i j} (y_{T i k} - {\overset{ˉ}{y}}_{T}) / \sum_{i : T_{i} = 1}^{n p} m_{i} w_{T i}$ or $\sum_{i : T_{i} = 1}^{n p} m_{i}^{2} w_{T i}^{2} [{\hat{σ}}_{u l l^{'}} + ({\hat{σ}}_{e l l^{'}}] / m_{i}) / \sum_{i : T_{i} = 1}^{n p} m_{i} w_{T i}$ (for the between-level covariates). Values for ${\hat{σ}}_{u .}$ and ${\hat{σ}}_{e .}$ can be obtained using ANOVA methods or maximum likelihood methods for estimating random effects models that allow for multivariate outcomes (see, e.g., Fieuws, Verbeke, & Molenberghs, 2007; Tate & Pituch, 2007, Thum, 1997).

Empirical Application

The section compares impact findings using the FP and SP estimators discussed above using data from a large-scale school-based RCT of the achievement effects of four early elementary school math curricula (Agodini, Harris, Atkins-Burnett, Heavside, & Novak, 2009) that was funded by the Institute of Education Sciences (IES) at the U.S. Department of Education (ED). Table 1 describes the evaluation design, data, samples, key outcome measures, and baseline covariates. The impact results presented in the evaluation report were obtained using the standard random effects REML estimator, which assumes the same error structure for treatments and controls and that the errors are conditional on the covariates.

Table 1

Summary of Data Source Used for the Empirical Analysis

Study (Authors; Sponsor)	Achievement Effects of Four Early Elementary School Math Curricula: Findings From First Graders in 39 Schools (Agodini, Harris, Atkins-Burnett, Heavside, & Novak, 2009; IES)
Description of study	Study examined the relative impacts of four math curricula on first-grade mathematics achievement. The curricula were selected by a national panel of content experts to represent diverse approaches to teaching elementary school math. The four curricula are Investigations in Number, Data, and Space; Math Expressions; Saxon Math; and Scott Foresman-Addison Wesley Mathematics.
Original and current study populations	First graders in 39 Title I schools in four districts in four states for both the original and current study. For the current study, the treatment group was defined as those in schools receiving the Saxon or Math Expressions curricula, and the control group was defined as those receiving the remaining two curricula.
Outcome for current study (mean, standard deviation)	Early Childhood Longitudinal Study (ECLS) total math assessment scale score in five math content areas (45.0, 8.6)
Baseline covariates and variables used for the subgroup method (mean, standard deviation for continuous variables)	Student-Level: ECLS kindergarten (ECLS-K) pretest score (30.9,8.6); quadratic pretest score; indicators of whether student is female (0.49), Hispanic (0.21)^a, Black (0.20)^a, has an IEP plan (0.06)^a; and is an English language learner (0.13)^a;
	School-Level: Percentage of teachers who have a master’s degree (0.67, 0.32)^a; average years of teacher experience (12.0, 7.1)^a; percentage of students who are eligible for a free or reduced-price lunch (0.38, 0.31)^a; seven strata (block) indicator variables.

Note: IES = Institute of Education Sciences.

a. Variables were used to estimate potential outcomes using the subgroup method described in the text but not as baseline covariates.

For this article, the data were reanalyzed using the various FP and SP estimators considered above with and without covariates. Standard errors for the FP models were obtained using Equations 6 and 9, where sample moments were used to estimate S_T , S_C , and the S matrices. Estimates of S_τ were obtained using the subgroup method discussed above where the baseline covariates shown in Table 1 were used to estimate subgroup impacts; separate OLS models were estimated by gender and race/ethnicity. The FP models were estimated using two weighting schemes where (1) subjects were weighted equally and (2) schools were weighted equally.

Standard errors for the SP model were obtained using Equations 11 and 14. The Swamy and Arora (SA, 1972) ANOVA method was used to estimate the variance components σ_u. ² and σ_e. ² in Equation 4 that were needed to calculate the weights w_Ti and w_Ci . The SA ANOVA method, which is based on residuals from within- and between-school OLS regressions, was used rather than REML or FIML methods, because the SA ANOVA method does not rely on distributional assumptions (and thus, is more in line with the nonparametric Neyman framework), and was shown by Baltagi and Chang (1994) to perform well in simulations. Elements of the $Λ$ matrices in Equation 14 were estimated in two ways using: (1) weighted sample moments and (2) the SA ANOVA method.

Before presenting the impact findings, it is important to first present several key features of the data that can be used to help interpret the impact estimates. First, sample sizes vary across the study schools (the range is 16–51 students and the median is 31 students per school), and the ICC for the SP model without covariates is about 0.19 for both treatments and controls. Thus, the way in which school-level means are weighted to produce overall impacts will differ for the FP and SP models, which could lead to different impact findings across the estimators.

Second, the error structure is similar for treatments and controls. For example, ${\hat{σ}}_{T u}^{2} = 12$ , ${\hat{σ}}_{C u}^{2} = 15$ , ${\hat{σ}}_{T e}^{2} = 61$ , and ${\hat{σ}}_{C e}^{2} = 59$ for the SP model in Equation 4. Furthermore, pretest–posttest correlations are similarly large for treatments (ρ = .77) and controls (ρ = .78), suggesting that the models with covariates are likely to yield substantially more precise impact estimates than the models without covariates.

Finally, the value of $(s_{τ}^{2} / n \overset{ˉ}{m})$ is only about 7% of the value of the remaining variance terms in Equation 6 for the FP model without covariates, but the corresponding figure is 67% for the FP model with covariates in Equation 9. Thus, although the variation in treatment effects across subjects within units is modest, this variation will substantially increase the precision of the FP estimators relative to the SP estimators for the models that include baseline covariates.

Table 2 displays impact findings for the considered FP and SP estimators, the standard REML estimator, and the GEE empirical sandwich estimator. All estimators yield similar overall impact findings. The models with covariates all show that the Saxon or Math Expressions math curriculum produced significantly higher fifth-grade student math test scores than the other tested math curricula. The estimated impacts are all about 2.30 scale points, which translates into an impact of about 0.27 standard deviation (effect size) units, or about 3 months of average learning growth in math per year for these students (see Schochet, 2008). The impact estimates for the models without covariates are not statistically significant for any estimator due to the omission of the pretests that considerably improve the precision of the impact estimates.

Table 2

Impact Findings, by Estimator

Model and Estimator	Model With Covariates	Model Without Covariates
Finite-population (FP) OLS model
Schools weighted by sample sizes	2.24 (.331) (.000)*	1.38 (1.21) (.261)
Schools weighted equally	2.33 (.343) (.000)*	1.49 (1.23) (.232)
Super-population (SP) GLS model
Λ in Equation (14) estimated using sample moments	2.31 (.571) (.000)*	1.49 (1.26) (.242)
Λ in Equation (14) estimated using ANOVA	2.31 (.521) (.000)*	1.49 (1.26) (.242)
Standard REML GLS Estimator	2.27 (.570) (.000)*	1.48 (1.27) (.243)
GEE empirical sandwich estimator	2.24 (.521) (.000)*	1.38 (1.25) (.271)
Sample sizes
Schools (P = % treatment)	39 Schools (0.46)	39 Schools (0.46)
Students	1,309	1,309

Source: Data from the math curricula evaluation (see Agodini et al., 2009 and Table 1).

Note: From left to right, the figures in cells are the ATE impact estimates, estimated standard errors, and p values. Findings were obtained using computer programs written by the author, except (1) the standard REML estimator was estimated using SAS Proc Mixed where the error structure was assumed to be the same for treatments and controls, and (2) the GEE estimator was estimated using SAS Proc GENMOD assuming an independent working correlation structure, an identity link function, and the empirical sandwich estimator. The covariates included in the models are described in Table 1. See the text for methods and formulas. REML = restricted information maximum likelihood; GEE = generalized estimating equation; GLS = generalized least squares; ANOVA = analysis of variance.

*The impact estimate is statistically significant at the 5% significance level, two-tailed test.

Importantly, however, the estimated standard errors for the models with covariates are about 40% smaller using the FP estimators than the SP estimators. As discussed, this primarily occurs because, unlike the SP estimators, the variances of the FP estimators are reduced by $(s_{τ}^{2} / n \overset{ˉ}{m})$ , which measures the extent to which treatment effects vary and covary across subjects within schools. Although the smaller FP variance estimate did not result in changes in the statistical significance of the impact estimates in the present example, the FP and SP estimators could yield different impact findings in other RCT applications.

Summary and Conclusion

This article has examined the estimation of two-stage clustered RCT designs in the education area using the Neyman causal inference framework that underlies experiments. The key distinction between the considered causal models is whether potential treatment and control group outcomes are considered to be fixed for the study population (the FP model) or randomly selected from a vaguely defined super-population (the SP model).

In the FP model, the only source of randomness is treatment status, and a clustered design results only because students in the same cluster share the same treatment status. The relevant impact parameter for this model is the ATE for those in the study sample; thus, the impact results are internally valid only. In the SP model, cluster- and student-level potential outcomes are considered to be randomly sampled from respective super-population distributions. In this framework, the relevant ATE parameter is the intervention effect for the average cluster in the super-population. Thus, impact findings are assumed to generalize outside the study sample, although it is often difficult to precisely define the study universe.

This article derived asymptotic variance formulas for models with and without baseline covariates using OLS methods for the FP model and GLS methods for the SP model. The key difference between the FP and SP variance formulas is that the FP variance is reduced by a S_τ term that pertains to the extent to which treatment effects vary and covary across students within schools. Another important difference between the FP and SP variance estimators is the way in which schools are weighted for the analysis.

Importantly, the considered estimators differ from the standard model-based HLM estimator that is typically used in practice for clustered education RCTs, mainly due to differences in the assumed model error structures. This largely occurs because the Neyman framework allows for heterogeneity of treatment effects, which leads to variance expressions that differ for the treatment and control groups (which the HLM approach can accommodate but which is often ignored in practice). In addition, the HLM model is an SP approach, and thus, cannot accommodate the FP approach, and in particular, the reduction in variance due to treatment effect heterogeneity under the FP model. Furthermore, the Neyman approach does not require assumptions on the distributions of potential outcomes, whereas the model-based approaches typically assume multilevel normality, which may not always hold. Finally, in models with covariates, the variance–covariance matrix of the error terms in the HLM model is assumed to be conditional on the covariates, whereas the covariates do not enter the true model under the Neyman approach; this leads to differences in how clusters (schools) are weighted in the analysis to obtain overall impact estimates.

Using data from a recent influential clustered RCT in the education area, the empirical analysis estimated ATEs and their standard errors using the FP, SP, and standard HLM and GEE estimators. All estimators yield similar ATE point estimates and findings concerning statistical significance. However, standard errors of the FP estimators are considerably smaller than for the other estimators due to the S_τ term. This suggests that in particular studies, policy conclusions could differ using the various approaches.

As shown in this article, the decision to adopt the FP or SP framework in clustered RCTs can matter and has implications for the way in which the impact findings are generalized and interpreted. The choice of the benchmark estimation method should best fit evaluation research questions and objectives and should be specified and justified in the analysis protocols. The choice of framework, however, is often a difficult philosophical issue, and there might not always be a scientific basis to help guide this decision. Thus, education researchers may want to consider specifying in their analysis protocols sensitivity analyses using alternative estimation approaches and attempt to explain any discrepancies between sensitivity and benchmark analysis findings.

Footnotes

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author disclosed receipt of the following financial support for the research and/or authorship of this article: This research was funded under Contract ED-04-CO-0112/0006 with the U.S. Department of Education.

References

Agodini

R. B.

Harris

Atkins-Burnett

Heavside

Novak

R. M.

(2009). Achievement effects of four early elementary school math curricula: Findings from first graders in 39 schools. Washington, DC: U. S. Department of Education, Institute of Education Sciences.

Baltagi

Chang

(1994). A Comparative study of alternative estimators for the unbalanced one-way error component regression model. Journal of Econometrics, 62, 67–89.

Bryk

Raudenbush

(1992). Hierarchical linear models: Applications and data analysis methods. Newbury Park, CA: Sage.

De Leeuw

Meijer

(2008). Handbook of multilevel analysis. New York, NY: Springer.

Fieuws

Verbeke

Molenberghs

(2007). Random-effects models for multivariate repeated measures. Statistical Methods in Medical Research, 16, 387–397.

Freedman

(2008). On Regression Adjustments to Experimental Data. Advances in Applied Mathematics 40, 180–193.

Gail

M. H.

Mark

S. D.

Carroll

R. J.

Green

S. B.

Pee

(1996). On design considerations and randomization-based inference for commschooly intervention trials. Statistics in Medicine, 15, 1069–1092.

Hájek

(1960). Limiting distributions in simple random sampling from a finite population. Publications of the Mathematics Institute of Hungarian Academy of Science, 5, 361–375.

Harville

D. A.

(1977). Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association, 72, 320–340.

10.

Hoglund

(1978). Sampling from a finite population: A remainder term estimate. Scandinavian Journal of Statistics, 5, 69–71.

11.

Holland

(1986). Statistics and causal inference. Journal of the American Statistical Association, 81, 945–960.

12.

Hsiao

. (1986). Analysis of panel data. Cambridge, England: Cambridge University Press.

13.

Imai

King

Nall

(2009). The essential role of pair matching in cluster-randomized experiments, with application to the Mexican universal health insurance evaluation. Statistical Science, 24, 29–53.

14.

Laird

N. M.

Ware

J. H.

(1982). Random-effects models for longitudinal data. Biometrics, 38, 963–974.

15.

Liang

Zeger

(1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22.

16.

Murray

(1998). Design and analysis of group-randomized trials. New York, NY: Oxford University Press.

17.

Neyman

(1990). On the application of probability theory to agricultural experiments: Essay on principles. ( Dabrowska

D. M.

Speed

T. P.

Trans. and Ed.). Statistical Science, 5, 465–480. (Reprinted from Rocziniki Nauk Rolniczych Tom 10, 1–51, 1923).

18.

Pfeffermann

Skinner

Holmes

Goldstein

Rasbash

. (1998). Weighting for unequal selection probabilities in multilevel models. Journal of the Royal Statistical Society Series B, 60, 23–40.

19.

Rao

C. R.

(1972). Estimation of variance and covariance components in linear models. Journal of the American Statistical Association, 69, 112–115.

20.

Raudenbush

Bryk

(2002). Hierarchical linear models: Applications and data analysis methods. Newbury Park, CA: Sage.

21.

Rosenbaum

P. R.

Rubin

D. B.

(1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55.

22.

Rubin

(1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Education Psychology, 66, 688–701.

23.

Rubin

(1977). Assignment to treatment group on the basis of a covariate. Journal of Education Statistics, 2, 1–26.

24.

Rubin

D. B.

(1980). Comment on “Randomization analysis of experimental data in the Fisher randomization test” by Basu. Journal of the American Statistical Association, 75, 591–593.

25.

Schochet

(2008). Statistical power for random assignment evaluations of education programs. Journal of Educational and Behavioral Statistics, 33, 62–87.

26.

Schochet

(2010). Is regression adjustment supported by the neyman model for causal inference? Journal of Statistical Planning and Inference, 140, 246–259.

27.

Small

S. S.

Ten Have

Rosenbaum

(2008). Randomization inference in a group-randomized trial of treatments for depression: Covariate adjustment, noncompliance, and quantile effects. Journal of the American Statistical Association, 103, 271–279.

28.

Swamy

Arora

(1972). The exact finite sample properties of the estimators of coefficients in the error components regression models. Econometrica, 40, 261–275.

29.

Tate

Pituch

(2007). Multivariate hierarchical linear modeling in randomized field experiments. Journal of Experimental Education, 75, 317–337.

30.

Thum

Y. M.

(1997) Hierarchical linear models for multivariate outcomes. Journal of Educational and Behavioral Statistics, 22, 77–108.

31.

Wooldridge

. (2002). Econometric analysis of cross section and panel data. Cambridge, MA: MIT Press.

32.

Yang

Tsiatis

(2001). Efficiency study of estimators for a treatment effects in a pretest-posttest trial. American Statistician, 55, 314–321.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.10 MB

0.00 MB