Design-Based Estimators for Average Treatment Effects for Multi-Armed RCTs

Abstract

Design-based methods have recently been developed as a way to analyze randomized controlled trial (RCT) data for designs with a single treatment and control group. This article builds on this framework to develop design-based estimators for evaluations with multiple research groups. Results are provided for a wide range of designs used in education research, including clustered and blocked designs. Because analysis in the multi-armed setting involves pairwise contrasts across the research groups, the key methodological question addressed is: How do the estimators for the two-group design need to be adjusted for multi-armed trials? The critical insight is that in multi-armed trials where the goal is to identify the most effective treatments, the samples for each pairwise contrast are representative of the full set of randomized units, not just of themselves. The implication is that variance terms need to be adjusted slightly under the finite-population framework that can reduce precision, and blocks need to be weighted to reflect the full randomized sample in the block or biases can result. An empirical example using data from a multi-armed education RCT demonstrates the issues.

Keywords

multi-armed RCTs average treatment effects design-based estimators impact and variance estimation

Design-based methods have recently been developed as a way to analyze data from impact evaluations of interventions, programs, and policies (Freedman, 2008; Imbens & Rubin, 2015; Lin, 2013; Schochet, 2013, 2015/2016; Yang & Tsiatis, 2001). The nonparametric estimators are derived using the building blocks of experimental designs with minimal assumptions and are unbiased and normally distributed in large samples with simple variance estimators. The methods apply to randomized controlled trials (RCTs) and quasi-experimental designs (QEDs) with comparison groups for a wide range of designs used in social policy research. The methods have important advantages over traditional model-based impact estimation methods, such as hierarchical linear model (HLM) and cluster robust standard error methods, and perform well in simulations (Schochet, 2015/2016; Schochet & Kautz, 2018). Design-based estimators are acceptable for What Works Clearinghouse evidence reviews (Scher & Cole, 2017).

The literature on design-based methods has focused on RCTs with a single treatment and a single control group. This theory, however, has not been formally extended to designs with multiple research groups, apart from Dasgupta, Pillai, and Rubin (2015) who focus on design-based methods for 2 ^k factorial designs. This is an important gap in the literature because multi-armed RCTs can simultaneously examine the effects of multiple interventions in a single study, thereby increasing the amount that researchers and policy makers can learn from impact evaluations. In social policy research, these designs are particularly relevant for interventions that are relatively easy to implement. Multi-armed designs are also useful for rapid-cycle experiments involving multiple cycles of small changes aimed at continuous program improvement, for example, using behavioral-based interventions.

Multi-armed RCT designs have been used in education research in a variety of contexts. For instance, they have been used to test the effects of different forms of teacher-to-parent communication on student outcomes (Kraft & Rogers, 2014) and the effects of text messaging and peer mentoring on college enrollment rates among high school graduates (Castleman & Page, 2015). Multi-armed RCTs have also been used in larger studies to test the effects of competing math curricula (Agodini et al., 2009) and reading curricula (James-Burdumy et al., 2009). They have also been used internationally, for example, in Honduras, to examine the effects of various data-driven assessment tools to improve teaching practices and student outcomes (Toledo, Humpage-Liuzzi, Murray, & Glazerman, 2015).

This article provides new results on the estimation of average treatment effects (ATEs) for multi-armed designs, building on and referencing the design-based literature for the two-group design, rather than developing estimators from scratch (although we provide proofs of some key results for the two-group design in an online supplement to help clarify the theory). The approach is based on the Neyman–Rubin–Holland potential outcomes framework that underlies experiments (Holland, 1986; Neyman, 1923/1990; Rubin, 1974, 1977). The article also builds on Dasgupta et al. (2015) who focus on factorial effects for nonclustered designs only, whereas we consider a broader range of multi-armed RCT designs with blocking, clustering, and the inclusion of weights and baseline covariates to improve precision. Our focus is on RCTs, although key concepts apply also to multi-armed QEDs.

The article is in four sections. The first section discusses the considered estimators. The second section discusses how design-based ATE estimators for the two-group design need to be modified for the multi-armed design when comparing pairs of research groups to each other. The third section provides an empirical example using data from multi-armed RCT testing the effects of various supplemental reading interventions. The final section presents conclusions.

Considered ATE Estimators

An important consideration for multi-armed RCTs is the pairwise contrasts of interest to best address the study research questions. For instance, researchers may be interested in all pairwise comparisons across the research groups, pairwise comparisons with the control group, or pairwise comparisons with the best of the other treatments (Hsu, 1996; Westfall, Tobias, Rom, Wolfinger, & Hochberg, 1999). This section considers design-based ATE estimators for each pairwise contrast in isolation. For example, for a design with three treatment groups (T1, T2, and T3) and a control group (C), the methods apply to each possible pairwise contrast (e.g., T1–T2, T1–T3, T1–C, T2–C) as well as to contrasts formed by combining groups (e.g., comparing the combined T1 and T2 groups to the C group). Thus, the methods apply to the multi-armed context regardless of the full set of contrasts of interest.

We focus on settings where interest lies in comparing impact findings across the tested treatments to identify the most effective ones (i.e., where the various contrasts are interpreted in unison). Clearly, if the focus is only on a particular contrast without regard to the others (i.e., if a specific contrast is included in a meta-analysis to compare impact findings from past evaluations of similar interventions), estimators for the two-group design apply without any modifications. Because we consider multi-armed RCTs that involve multiple hypothesis testing across contrasts, the inflation of Type 1 errors for each individual test is of concern. We refer readers to Hsu (1996), Schochet (2009, 2017), and Westfall, Tobias, Rom, Wolfinger, and Hochberg (1999) for a discussion of multiple comparison adjustments for multi-armed designs.

We focus on the designs presented in Schochet (2015/2016) defined by two key features: clustering and blocking. Nonclustered designs are those where the unit of analysis aligns with the unit of randomization (such as analyzing student-level data with student-level random assignment), whereas clustered designs are those where the unit of analysis is nested within the unit of random assignment (such as analyzing student-level data with school- or teacher-level random assignment). For nonblocked designs, randomization is conducted within a single population, whereas for blocked designs, randomization is conducted separately within distinct subpopulations (such as school districts or grades). In combination, these two design features cover most RCTs in the education field.

We consider design-based estimators for both the finite-population (FP) model where impacts are assumed to pertain to the study sample only and the super-population (SP) model where impacts are assumed to generalize to an infinite population (which may be vaguely defined). We also consider estimators for models with and without baseline covariates to improve precision and models with weights to adjust for data nonresponse and to determine how to aggregate blocks and clusters to obtain pooled ATE estimators. We focus on unbiased (consistent) estimation and do not consider other types of estimators, such as those that improve mean squared error by allowing bias to improve precision.

As formalized mathematically in this article, we find that key components of the design-based theory for the two-group design apply also to multi-armed RCTs. However, two modifications are required if interest lies in comparing treatment effects across contrasts:

Under the FP model, ATE estimators for each pairwise contrast pertain to the entire randomized sample, not just to the two groups being compared. Thus, variance estimators for the FP model for the two-group design need to be adjusted slightly to reflect the broader inference population.

For similar reasons, for blocked designs, the weights assigned to each block to obtain pooled ATE estimators need to be scaled to reflect the size of the full randomized sample in the block. Ignoring this rescaling can lead to biased impact estimates.

In this article, we do not consider statistical power considerations for multi-armed designs, but note here that for several reasons, these designs could require larger samples to produce precise impact estimates than for the two-group design. First, for multi-armed designs, the sample is split across more research groups. Second, we might expect impacts to be smaller when contrasting variants of a treatment than when comparing a treatment to a control (status quo) condition. Finally, larger sample sizes might be required to compensate for multiple comparison adjustments when conducting hypothesis tests across the pairwise contrasts. These factors could be mitigated somewhat for rapid-cycle, multi-armed RCTs that focus on mediating or proximal outcomes (such as teacher knowledge) where we might expect intervention effects to be larger than for more distal outcomes (such as student test scores).

ATE Estimators for Nonclustered Designs

Consider the simplest multi-armed RCT design where n individuals from a single population are randomly assigned to one of K distinct research groups $(K \geq 3)$ . The research groups could include a control (business-as-usual) group but do not have to. Each research group is offered a different intervention or combination of interventions (e.g., for two interventions T1 and T2, the four research groups for a full factorial design could be defined by the receipt of both T1 and T2, T1 only, T2 only, or neither).

We do not consider orthogonal fractional factorial designs where the interventions consist of components that could each be turned on or off to form different service packages and where the research groups include only a subset of all possible treatment combinations (see Box, Hunter, & Hunter, 2005; Wu & Hamada, 2009). Our focus is on examining pairwise contrasts between distinct research groups, whereas factorial designs are structured to estimate main and interaction effects by comparing combinations of research groups to each other. Under fractional factorial designs, main and two-way interaction effects become confounded with higher order interaction effects (and even with each other in some designs with small numbers of tested combinations), and the nature of the confounding depends on the adopted parameterization of the factorial design. This confounding complicates the assumptions required to develop design-based impact estimators, and these assumptions are likely to vary based on the structure of the factorial design. We do not address this topic here, but it is an interesting area for future research (note that Dasgupta et al., 2015, consider 2 ^k factorial designs only).

We assume the sample contains $n_{k} = n p_{k}$ individuals in research group K $(k = 1, 2, ..., K)$ , where p_k is the research group assignment rate $(0 < p_{k} < 1; \sum_{k = 1}^{K} p_{k} = 1)$ and K is finite. Let $Y_{i} (k)$ be the potential outcome for individual i in group K which could be binary or continuous, and let $T_{i} (k)$ be the research group indicator variable that equals 1 if an individual is assigned to group K and 0 otherwise $(\sum_{k = 1}^{K} T_{i} (k) = 1) .$ More succinctly, let $Q_{i} = k$ for individuals assigned to group k.

Design-based estimators in the multi-armed context rely on several assumptions that we generalize using the corresponding assumptions for the two-group design:

The stable unit treatment value assumption (SUTVA; Rubin, 1986) which has two parts. First, for any two random assignment vectors Q and Q ^′, if $Q_{i} = {Q^{'}}_{i}$ for individual i, then $Y_{i} (Q) = Y_{i} (Q^{'})$ . This means that the potential outcomes of an individual depend only on that person’s research assignment and not on the assignments of other individuals. It also implies that the potential outcomes for a particular treatment are independent of the number and nature of the other treatments. The second SUTVA condition is that an individual offered a particular treatment does not receive different forms of the treatment either in isolation or in combination with other treatments. For example, if an RCT has three research groups defined by the receipt of T1 only, T2 only, or both T1 and T2, then the same version of T1 must be delivered to the group receiving T1 only and the group receiving both T1 and T2, and similarly for T2; otherwise, the combined T1–T2 intervention should be considered a different intervention (T3) rather than one that supplements one treatment with the other.

Independence between research group assignment status and potential outcomes, Q_i ⫫ $(Y_{i} (1), Y_{i} (2), ..., Y_{i} (K)),$ which is ensured by randomization for RCTs and is assumed to hold conditional on baseline covariates for QEDs with comparison groups.

A positive probability of assignment to each research group for each individual.

Finite first and second moments for potential outcomes.

FP Model

Under the FP model for the multi-armed design, the n individuals participating in the study are assumed to define the population universe, and potential outcomes are assumed to be fixed for the study. In this setting, the ATE parameter of interest for comparing interventions (research groups) K and k ^′ is

β (k, k^{'}) = \frac{1}{n} \sum_{i = 1}^{n} (Y_{i} (k) - Y_{i} (k^{'})) = \bar{Y} (k) - \bar{Y} (k^{'}) .

Importantly, this parameter pertains to the full sample of n individuals, not just to the $(n_{k} + n_{k^{'}})$ individuals randomized to the contrasted groups (hereafter referred to as the “estimation sample”). Thus, for multi-armed designs that aim to identify the most effective treatments among the set tested, ATE estimators for each pairwise contrast generalize beyond the estimation sample to the full randomized sample. Accordingly, variances need to be adjusted to reflect the larger inference population.

To demonstrate these adjustments, consider first the data-generating process for the observed outcome y_i :

y_{i} = \sum_{k = 1}^{K} T_{i} (k) Y_{i} (k) .

This relation states we can observe $Y_{i} (k)$ if an individual is randomly assigned to research group k, but not the person’s potential outcomes in other research conditions. In this expression, y_i is random because the $T_{i} (k)$ indicators are random (the potential outcomes are assumed to be fixed for the study).

Consider the simple differences-in-means estimator for $β (k, k^{'})$ calculated using the sample randomized to conditions K and $k^{'}$ :

\hat{β} (k, k^{'}) = \bar{y} (k) - \bar{y} (k^{'}) = \frac{1}{n_{k}} \sum_{i : T_{i} (k) = 1}^{n_{k}} y_{i} - \frac{1}{n_{k^{'}}} \sum_{i : T_{i} (k^{'}) = 1}^{n_{k^{'}}} y_{i} .

To show that this estimator is unbiased, we use the law of iterated expectations to demonstrate key conditioning arguments that are needed for more complex derivations later (although the result can be established more directly). First, we calculate the expectation of $\hat{β} (k, k^{'})$ with respect to the distribution, R, of all possible randomizations to groups K or k ^′, conditional on the $(n_{k} + n_{k^{'}})$ individuals assigned to the two groups and their fixed potential outcomes. Second, we average over random draws of $(n_{k} + n_{k^{'}})$ individuals from the population, i, of n individuals in the study. Mathematically, this approach can be expressed as

E_{R I} (\hat{β} (k, k^{'})) = E_{I} (E_{R} (\hat{β} (k, k^{'}) | {Y_{i} (k), Y_{i} (k^{'})} : i ∍ T_{i} (k) + T_{i} (k^{'}) = 1)) .

We know from Imbens and Rubin (2015) and Schochet (2010, 2015/2016) that the simple differences-in-means estimator for the two-group design is unbiased for the FP model (see Supplemental Material in the online version of the journal). Thus, the interior conditional expectation in Equation 4 equals $({\bar{Y}}^{*} (k) - {\bar{Y}}^{*} (k^{'}))$ , where ${\bar{Y}}^{*} (k)$ and ${\bar{Y}}^{*} (k^{'})$ are mean potential outcomes for those in the estimation sample. Thus, Equation 4 reduces to

E_{I} ({\bar{Y}}^{*} (k) - {\bar{Y}}^{*} (k^{'})) = \frac{1}{(n_{k} + n_{k^{'}})} \sum_{i = 1}^{n} E_{I} [Z_{i} (k, k^{'}) ({\bar{Y}}^{*} (k) - {\bar{Y}}^{*} (k^{'}))] = \bar{Y} (k) - \bar{Y} (k^{'}),

where $Z_{i} (k, k^{'})$ equals 1 for individuals randomized to group K or k ^′ and 0 otherwise. The last equality holds because (a) $Z_{i} (k, k^{'})$ is independent of ${\bar{Y}}^{*} (k)$ and ${\bar{Y}}^{*} (k^{'})$ due to randomization, (b) $E_{I} (Z_{i} (k, k^{'})) = (n_{k} + n_{k^{'}}) / n$ , and (c) $E_{I} ({\bar{Y}}^{*} (k) - {\bar{Y}}^{*} (k^{'})) = \bar{Y} (k) - \bar{Y} (k^{'}) .$ This proves that $\hat{β} (k, k^{'})$ is unbiased.

We can use a similar conditioning approach to calculate the variance of $\hat{β} (k, k^{'})$ using the law of total variance where, to simplify notation, we define the conditioning set in Equation 4 as G:

V a r_{R I} (\hat{β} (k, k^{'}) | G) = E_{I} (V a r_{R} (\hat{β} (k, k^{'}) | G)) + V a r_{I} (E_{R} (\hat{β} (k, k^{'}) | G)) .

Using variance results for the FP model for the two-group design (see, e.g. Schochet, 2010, 2015/2016, and Supplemental Material in the online version of the journal), we have that

E_{I} (V a r_{R} (\hat{β} (k, k^{'}) | G)) = \frac{σ_{I}^{2} (k)}{n_{k}} + \frac{σ_{I}^{2} (k^{'})}{n_{k^{'}}} - \frac{σ_{τ I}^{2} (k, k^{'})}{n_{k} + n_{k^{'}}},

where

σ_{I}^{2} (k) = \frac{1}{(n - 1)} \sum_{i = 1}^{n} {(Y_{i} (k) - \bar{Y} (k))}^{2} and σ_{I}^{2} (k^{'}) = \frac{1}{(n - 1)} \sum_{i = 1}^{n} {(Y_{i} (k^{'}) - \bar{Y} (k^{'}))}^{2}

are variances of $Y_{i} (k)$ and $Y_{i} (k^{'})$ across the entire randomized sample, and

σ_{τ I}^{2} (k, k^{'}) = \frac{1}{(n - 1)} \sum_{i = 1}^{n} {([Y_{i} (k) - \bar{Y} (k)] - [Y_{i} (k^{'}) - \bar{Y} (k^{'})])}^{2}

is the variance of individual-level treatment effects for the contrasted groups. We hereafter refer to the final term in Equation 7 as the “FP heterogeneity term.”

Similarly, because the differences-in-means estimator is unbiased for the FP model, it follows that

V a r_{I} (E_{R} (\hat{β} (k, k^{'}) | G)) = V a r_{I} ({\bar{Y}}^{*} (k) - {\bar{Y}}^{*} (k^{'})) = \frac{V a r_{I} (Y_{i}^{*} (k) - Y_{i}^{*} (k^{'}))}{n_{k} + n_{k^{'}}} = (1 - f) \frac{σ_{τ I}^{2} (k, k^{'})}{n_{k} + n_{k^{'}}},

where $(1 - f)$ is the finite population correction (FPC) with $f = (n_{k} + n_{k^{'}}) / n .$ Intuitively, the FPC accounts for the sampling of individuals (and their associated treatment effects) from the full randomized sample. Finally, collecting terms in Equations 7 and 8, we find that the variance in Equation 6 is

V a r_{R I} (\hat{β} (k, k^{'})) = \frac{σ_{I}^{2} (k)}{n_{k}} + \frac{σ_{I}^{2} (k^{'})}{n_{k^{'}}} - \frac{σ_{τ I}^{2} (k, k^{'})}{n} .

The critical difference then between the variance expression for the multi-armed RCT and the two-group RCT is that the FP heterogeneity term contains the divisor n rather than $(n_{k} + n_{k^{'}})$ (Dasgupta, Pillai, and Rubin, 2015, find a similar result for factorial effects for 2 ^k factorial designs). Thus, assuming treatment effect heterogeneity, the variance increases as we add more research groups due to the FP heterogeneity term, but the increases will typically be small. Stated differently, the variance increases as the size of the inference population outside the estimation sample becomes larger; in the extreme case, the FP heterogeneity term disappears under the SP model where the inference population is assumed to be infinite. It is interesting that ignoring the FP heterogeneity term for the FP model will yield conservative variance estimators that are the same for the two-group and multi-armed designs.

Unbiased estimates for $σ_{I}^{2} (k)$ and $σ_{I}^{2} (k^{'})$ can be obtained using the sample variances, $s_{I}^{2} (k) = \frac{1}{(n_{k} - 1)} \sum_{i : T_{i} (k) = 1}^{n_{k}} {(y_{i} - \bar{y} (k))}^{2}$ and $s_{I}^{2} (k^{'}) = \frac{1}{(n_{k^{'}} - 1)} \sum_{i : T_{i} (k^{'}) = 1}^{n_{k}} {(y_{i} - \bar{y} (k^{'}))}^{2} .$

The FP heterogeneity term, $σ_{τ I}^{2} (k, k^{'}),$ is not identifiable because it is not possible to observe an individual in both research conditions. However, we can bound this term using the Cauchy–Schwartz inequality by noting that $σ_{τ I}^{2} (k, k^{'}) \geq {(σ_{I} (k) - σ_{I} (k^{'}))}^{2},$ which yields the following conservative variance estimator:

V \hat{a} r_{R I} (\hat{β} (k, k^{'})) = \frac{s_{I}^{2} (k)}{n_{k}} + \frac{s_{I}^{2} (k^{'})}{n_{k^{'}}} - \frac{{(s_{I} (k) - s_{I} (k^{'}))}^{2}}{n} .

Aronow, Green, and Lee (2014) discuss methods to obtain sharper bounds on the FP heterogeneity term by approximating the marginal distributions of potential outcomes.¹

The ATE estimator, $\hat{β} (k, k^{'})$ , can be shown to be asymptotically normal using results in Imbens and Rubin (2015), Schochet (2015/2016; lemmas 5.1 and 5.2), and Li and Ding (2017); see Supplemental Material in the online version of the journal. Thus, hypothesis testing can be conducted using z tests (multiple comparisons adjustments could be applied). Alternatively, results in Hansen (2007) and Bell and McCaffrey (2002) as well as our simulation evidence below suggest that using t tests with $(n_{k} + n_{k^{'}} - 2)$ degrees of freedom has better small sample properties than the z tests (Satterthwaite corrections could also be applied).

We conducted simulations to provide practical guidance on the choice of z tests or t tests for hypothesis testing. For the simulations, we generated potential outcomes for a small three-armed RCT with 40 individuals, where 12 to 28 individuals were assigned to the two groups being contrasted (see Table 1). We generated potential outcomes using the normal and lognormal distributions (to allow for some skewness) assuming zero ATEs but allowing for some treatment effect heterogeneity. We assessed Type 1 errors across the 5,000 replications and compared estimated variances using Equation 10 to the true ones (see the footnote to Table 1 for simulation details).

Table 1.

Simulation Results for t Tests and z Tests for Designs With Small Sample Sizes

Sample Sizes for Contrast: Group 1, Group 2	Type 1 Error for t Test	Type 1 Error for z Test	Average of Estimated Variances	True Variance
Normal distribution for potential outcomes
8, 4	.060	.089	.366	.366
10, 6	.056	.076	.256	.257
12, 8	.047	.062	.206	.188
14, 10	.052	.064	.160	.157
16, 12	.042	.052	.140	.132
Lognormal distribution for potential outcomes
8, 4	.065	.102	.359	.350
10, 6	.065	.085	.253	.252
12, 8	.059	.074	.195	.189
14, 10	.056	.068	.161	.152
16, 12	.053	.066	.137	.133

Note. The simulations were conducted by creating a single data set with 40 individuals and randomly assigning individuals to Groups 1, 2, or 3 in each of the 5,000 replications. The figures focus on the Group 1–2 contrast. For the first panel, potential outcomes for Group 1, Y ₁, were generated as independent and identically distributed (iid) Normal(0,1) random variables, and potential outcomes for Group 2 were generated using $Y_{2} = α Y_{1} + e,$ where α = .88 and e were iid Normal(0,.432) random variables so that $V a r (Y_{2}) = 1.2$ and $C o r r (Y_{1}, Y_{2}) = .8$ which allows for some heterogeneity of effects. For the second panel, lognormal deviates were generated using exponentials of normal deviates to yield Y ₁ and Y ₂ values with the same variances and correlations as for the first panel. Type 1 errors are the percentages of test statistics that are statistically significant at the 5% level. The true variance is the standard deviation of the estimated impacts across the 5,000 replications.

The simulation results indicate that with very small sample sizes, the t tests are preferred because they yield Type 1 errors that are closer to the 5% nominal level than the z tests that yield inflated values (Table 1). We also find that the estimated variances match actual ones, even using the skewed lognormal distribution. We find very similar results assuming different levels of treatment effect heterogeneity across the sample (not shown).

For models that include weights, $w_{i}$ , to adjust for data nonresponse, the results in Schochet (2015/2016; lemma 5.6) for the two-group design and the conditioning arguments from above establish that a consistent variance estimator for the weighted differences-in-means estimator, ${\hat{β}}_{W} (k, k^{'})$ , is as follows:

A s \hat{y} V a r_{R I} ({\hat{β}}_{W} (k, k^{'})) = \frac{s_{I W}^{2} (k)}{n_{k}} + \frac{s_{I W}^{2} (k^{'})}{n_{k^{'}}} - \frac{{[s_{I W} (k) - s_{I W} (k^{'})]}^{2}}{n}

where

$s_{I W}^{2} (k) = \frac{1}{{\bar{w}}_{k}^{2} (n_{k} - 1)} \sum_{i : T_{i} (k) = 1}^{n_{k}} w_{i}^{2} {(y_{i} - {\bar{y}}_{W} (k))}^{2}$ and $s_{I W}^{2} (k^{'}) = \frac{1}{{\bar{w}}_{k^{'}}^{2} (n_{k^{'}} - 1)} \sum_{i : T_{i} (k^{'}) = 1}^{n_{k^{'}}} w_{i}^{2} {(y_{i} - {\bar{y}}_{W} (k^{'}))}^{2}$

are weighted sample variances,

${\bar{w}}_{k} = \frac{1}{n_{k}} \sum_{i : T_{i} (k) = 1}^{n_{k}} w_{i}$ and ${\bar{w}}_{k^{'}} = \frac{1}{n_{k^{'}}} \sum_{i : T_{i} (k^{'}) = 1}^{n_{k^{'}}} w_{i}$

are average weights for the two groups, and the FPC term uses $f = (n_{k} + n_{k^{'}}) / n$ .² Note that because nonresponse weights are measured with error (e.g., from propensity score modeling used to generate the weights), Equation 11 ignores added variance terms due to the estimation error in the weights. Note also that in the FP model, the nonresponse weights are applied so that data respondents are representative of the full randomized sample of respondents and nonrespondents and not of a broader population.

SP Model

Model-based ATE estimation methods, such as ordinary least squares (OLS) and HLM, typically assume an SP framework with an infinite SP. For this framework, the parameter of interest for each pairwise contrast is the mean effect in the infinite SP, $β_{SP} (k, k^{'}) = E_{I} (Y_{i} (k) - Y_{i} (k^{'}))$ .

In this SP setting, design-based ATE estimators for the multi-armed design are identical to those for the two-group design (see, e.g. Schochet, 2010, 2015/2016). The simple differences-in-means estimator for each contrast is unbiased and asymptotically normal, with the same variance estimator as in Equation 10 except that the FP heterogeneity term disappears because n is infinite. These results hold regardless of the number of research groups. Intuitively, each research group is a random sample from the infinite SP, so the mean outcomes across research groups are uncorrelated (which differs from the FP framework).

The finding that ATE estimators under the SP model are the same for the two-group and multi-armed RCT extends to blocked and clustered designs. This occurs because the SP framework for these designs assumes random sampling of study blocks, clusters, and/or individuals from infinite populations so the FP heterogeneity terms do not apply (see Pashley & Miratrix, 2017; Schochet, 2015/2016, for a discussion of different types of SP models based on the various stages of sampling that are assumed to be random or fixed). This same finding regarding the absence of the FP heterogeneity terms for the SP model also applies to models that include baseline covariates. Thus, we primarily focus on the FP model for the rest of this article.

Note that a framework in between the considered FP and SP models is a design where the study sample is randomly selected from a broader universe that is finite. In this case, the variance estimators for the FP model apply except that the FP heterogeneity term is divided by the number of individuals in the broader universe rather than the size of the study sample.

Models With Baseline Covariates

Researchers analyzing RCT data often include baseline covariates in the estimation models to increase precision and to adjust for random imbalances between the research groups. For multi-armed designs, separate regression models can be estimated for each pairwise contrast. In this case, the statistical properties of the multiple regression estimator for the two-group design apply to each pairwise model. The only difference is that the FP heterogeneity term for the variance estimator contains the divisor n rather than $(n_{k} + n_{k^{'}})$ , which will typically have only a very small effect on the variance estimators and test statistics.

To demonstrate these results more formally for the FP model, we can, following Freedman (2008), first rearrange Equation 2 assuming a two-group design using the estimation sample assigned to conditions K and k ^′ only to obtain the following regression model:

y_{i} = α^{*} + β^{*} (T_{i} - p_{k}^{*}) + u_{i}

where

\begin{array}{l} α^{*} = p_{k}^{*} \bar{Y} (k) + (1 - p_{k}^{*}) \bar{Y} (k^{'}), \\ β^{*} = {\bar{Y}}^{*} (k) - {\bar{Y}}^{*} (k^{'}), \\ u_{i} = λ_{i} + τ_{i} (T_{i} - p_{k}^{*}), \\ λ_{i} = p_{k}^{*} (Y_{i} (k) - \bar{Y} (k)) + (1 - p_{k}^{*}) (Y_{i} (k^{'}) - \bar{Y} (k^{'})), \\ τ_{i} = (Y_{i} (k) - \bar{Y} (k^{'})) - (Y_{i} (k^{'}) - \bar{Y} (k^{'})), \end{array}

where $T_{i}$ equals 1 if the individual is assigned to condition K and 0 if assigned to condition k ^′, where we drop the $(k, k^{'})$ subscripts on the model parameters for simplicity, and $p_{k}^{*} = n_{k} / (n_{k} + n_{k^{'}})$ is the assignment rate to group K. In this model, the “error” term, $u_{i}$ , is random solely because of $T_{i}$ . The framework is nonparametric because it makes no assumptions about the distribution of potential outcomes and allows treatment effects to differ across individuals. The model does not satisfy key assumptions of the usual regression model because $u_{i}$ does not have mean zero over the randomization distribution, R, and, to the extent that the $τ_{i}$ impacts vary across individuals, $u_{i}$ is heteroscedastic and is correlated across subjects and with the regressor $T_{i} - p_{k}^{*}$ :

\begin{array}{l} E_{R} (u_{i}) = λ_{i}, V a r_{R} (u_{i}) = τ_{i}^{2} p_{k}^{*} (1 - p_{k}^{*}), C o v_{R} (u_{i} u_{i^{'}}) = - τ_{i} τ_{i^{'}} p_{k}^{*} (1 - p_{k}^{*}) / (n^{*} - 1), \\ E_{R} [(T_{i} - p_{k}^{*}) u_{i}] = τ_{i} p_{k}^{*} (1 - p_{k}^{*}) . \end{array}

Nonetheless, as shown in Freedman (2008) and Schochet (2010) for the two-group design, OLS estimation of Equation 12 yields a differences-in-means estimator, $\hat{β}$ , with the exact same statistical properties as those discussed earlier (see Supplemental Material in the online version of the journal).

Consider next an OLS regression of y_i on $z_{c i},$ where $z_{c i} = (1 T_{c i} x_{c i})$ is a row vector and $δ^{'} = (α_{MR}^{*} β_{MR}^{*} γ)$ are associated parameters, where $T_{ci} = T_{i} - p_{k}^{*}$ is the centered treatment status indicator and $x_{c i} = x_{i} - \bar{x}$ are fixed, centered covariates (the centering does not change the impact estimates but facilitates the asymptotic results below). In the design-based framework, the covariates are “irrelevant” variables in the sense that they do not enter the true model in Equation 11. We do not need to assume that the conditional expectation of the outcomes is linear in the covariates.

Freedman (2008), Schochet (2010, 2015/2016), and Lin (2013) show that the standard variance estimator from the OLS regression needs to be adjusted to fully align with the structure underlying the two-group RCT design. As shown in Theorem 1 of Supplemental Material in the online version of the journal, the multiple regression estimator for the two-group design, ${\hat{β}}_{MR}^{*}$ , is asymptotically normal with asymptotic mean $β (k, k^{'})$ and the following asymptotic variance:

A s y V a r_{R} ({\hat{β}}_{MR}^{*}) = \frac{E_{FP} [{(Y_{i} (k) - \bar{Y} (k) - x_{c i} γ)}^{2}]}{n^{*} p_{k}^{*}} + \frac{E_{FP} [{(Y_{i} (k^{'}) - \bar{Y} (k^{'}) - x_{c i} γ)}^{2}]}{n^{*} (1 - p_{k}^{*})} - \frac{E_{FP} [τ_{i}^{2}]}{n^{*}},

where $n^{*} = (n_{k} + n_{k^{'}})$ , $γ = E_{F P} {({x^{'}}_{c i} x_{c i})}^{- 1} E_{F P} ({x^{'}}_{c i} λ_{i})$ , and $E_{FP} (.)$ are limits of the following moment vectors and positive definite matrices that are assumed to contain fixed numbers:

\begin{array}{l} \frac{\sum_{i = 1}^{n *} {(Y_{i} (k) - \bar{Y} (k))}^{2}}{(n^{*} - 1)}, \frac{\sum_{i = 1}^{n *} {(Y_{i} (k^{'}) - \bar{Y} (k^{'}))}^{2}}{(n^{*} - 1)}, \frac{\sum_{i = 1}^{n *} τ_{i}^{2}}{(n^{*} - 1)}, \\ \frac{\sum_{i = 1}^{n *} {x^{'}}_{c i} x_{c i}}{(n^{*} - 1)}, \frac{\sum_{i = 1}^{n *} {x^{'}}_{c i} (Y_{i} (k) - \bar{Y} (k))}{(n^{*} - 1)}, \frac{\sum_{i = 1}^{n *} {x^{'}}_{c i} (Y_{i} (k^{'}) - \bar{Y} (k^{'}))}{(n^{*} - 1)} . \end{array}

We can now apply the same conditioning arguments as above to Equation 13 to move from the two-group design to the multi-armed design to obtain the following conservative variance estimator for a pairwise contrast based on estimated model residuals:

A s \hat{y} V a r_{R I} ({\hat{β}}_{M R}^{*}) = \frac{M \hat{S} E (k)}{n_{k}} + \frac{M \hat{S} E (k^{'})}{n_{k^{'}}} - \frac{{[\sqrt{M \hat{S} E (k)} - \sqrt{M \hat{S} E (k^{'})}]}^{2}}{n},

where

M \hat{S} E (k) = \frac{1}{n_{k} - v p_{k}^{*} - 1} \sum_{i : T_{i} (k) = 1}^{n_{k}} {(y_{i} - {\hat{α}}_{M R}^{*} - {\hat{β}}_{M R}^{*} (1 - p_{k}^{*}) - x_{c i} \hat{γ})}^{2} and

M \hat{S} E (k^{'}) = \frac{1}{n_{k^{'}} - v (1 - p_{k}^{*}) - 1} \sum_{i : T_{i} (k^{'}) = 1}^{n_{k^{'}}} {(y_{i} - {\hat{α}}_{M R}^{*} + {\hat{β}}_{M R}^{*} p_{k}^{*} - x_{c i} \hat{γ})}^{2} .

In this expression, the $M \hat{S} E (.)$ terms are regression mean square errors for the two research groups, respectively; $v$ is the number of baseline covariates (assumed to be the same across pairwise contrasts), whose degrees of freedom are split proportionately across the contrasted research groups; and ${\hat{α}}_{M R}^{*}$ , ${\hat{β}}_{M R}^{*}$ , and $\hat{γ}$ are parameter estimates. The upper bound estimator in Equation 14 is based on the Cauchy–Schwartz inequality for the FP heterogeneity term:

E_{FP} (τ_{i}^{2}) \geq {(\sqrt{E_{FP} {[(Y_{i} (k) - \bar{Y} (k))]}^{2}} - \sqrt{E_{FP} {[(Y_{i} (k^{'}) - \bar{Y} (k^{'}))]}^{2}})}^{2}

Note that the same variance estimator results (i.e., the $M \hat{S} E (.)$ terms are identical) if the regression model includes the noncentered treatment status indicator and covariates instead of the centered ones. Note also that Equation 14 reduces to Equation 10 for the FP model without covariates. Hypothesis tests can be conducted using z tests or t tests with $(n_{k} + n_{k^{'}} - v - 2)$ degrees of freedom.

Extensions to Blocked Designs

Blocked designs occur when random assignment is conducted separately within distinct study subpopulations (such as school districts, grades, or cohorts over time). Consider a blocked design with K research groups in each block. Let the subscript “b” indicate blocks $(b = 1, 2, ..., h),$ and let $S_{i b}$ be a block indicator variable that equals 1 if individual i is in block b and 0 for individuals in different blocks. Assignment rates to the research groups (i.e., the $p_{k b}$ probabilities) could differ both within and across blocks. For example, for a design with two blocks and three research groups, the proportions assigned to the research groups could be 1/3, 1/3, and 1/3 in Block 1 and 1/2, 1/4, and 1/4 in Block 2.

In the multi-armed FP context where interest lies in identifying the most effective treatments among those tested, the ATE parameter for block b is

β_{b} (k, k^{'}) = {\bar{Y}}_{b} (k) - {\bar{Y}}_{b} (k^{'}) = \frac{1}{n_{b}} \sum_{i : S_{i b} = 1}^{n_{b}} (Y_{i b} (k) - Y_{i b} (k^{'})),

which is calculated over the full randomized sample of n_b individuals in the block. The ATE parameter across all blocks can then be expressed as

β_{blocked} (k, k^{'}) = \frac{\sum_{b = 1}^{h} w_{b} β_{b} (k, k^{'})}{\sum_{b = 1}^{h} w_{b}},

which is a weighted average of the block-specific ATEs with weights $w_{b}$ . The FP weighting scheme should fit with the study’s research questions. For example, blocks could be weighted proportional to their sample sizes if interest lies in the ATE parameter for the average individual, or equally if interest lies in the ATE parameter for the average block (i.e., if blocks are sites). For SP models (not considered here), precision weighting (based on inverses of variances) could be used to estimate SP ATE parameters.

Importantly, $w_{b}$ in Equation 16 pertains to all randomized individuals in the block, not just to those assigned to the contrasted groups, so that a particular block is given the same weight for all pairwise analyses. In our setting, the full randomized group is the relevant universe under the Neyman–Rubin–Holland FP framework because any resampling strategy for variance estimation would involve re-randomization of all study individuals in the block. Thus, for instance, if blocks are weighted proportional to their total sample size, we would set $w_{b} = n_{b}$ rather than $w_{b} = (n_{k b} + n_{k^{'} b})$ , as would be the case for the two-group design. Ignoring this scaling of the weights in the multi-armed FP context can lead to biased ATE estimators for a pairwise contrast if (a) impacts vary across blocks, (b) weights vary across blocks, and (c) the following holds for any block:

\frac{(w_{k b} + w_{k^{'} b})}{\sum_{b = 1}^{h} (w_{k b} + w_{k^{'} b})} \neq \frac{w_{b}}{\sum_{b = 1}^{h} w_{b}},

where $w_{k b}$ and $w_{k^{'} b}$ are block-level weights for the contrasted groups. This third condition means that biases can result if a block’s share of the total sample of individuals in conditions K and k ^′ differs from the block’s share of the total sample across all research groups. A necessary condition for Equation 17 is that assignment probabilities to the research groups differ across at least two blocks.

As an example, consider a design with two blocks and three research groups, where Block 1 has 20, 20, and 10 individuals in the three research groups, and the corresponding figures for Block 2 are 20, 10, and 20. If we weigh blocks proportional to their sample sizes, the correct weight for Block 1 under the multi-armed trial for any pairwise comparison is $w_{1} = .5$ (50/100), and the correct weight for Block 2 is $w_{2} = .5$ . If we instead assume a two-group design, the weights for comparing Groups 1 and 2 would change to $w_{1} = 40 / 70$ and $w_{2} = 30 / 70$ , yielding a biased estimator for the parameter in Equation 16 if impacts are heterogeneous across blocks. Note, however, that the two-group design would yield unbiased estimators for comparing Groups 2 and 3 because $w_{1}$ and $w_{2}$ would remain .5 in this case (30/60; i.e., Equation 17 does not hold).

ATE estimators for the parameter in Equation 16 for the multi-armed design are the same as for the two-group design. The only differences are that the block-specific weights for each pairwise contrast pertain to the full randomized sample, and the FP heterogeneity terms in the variance estimators have divisors n_b rather than $(n_{k b} + n_{k^{'} b}) .$ For example, the simple differences-in-means estimator for a particular pairwise contrast is

{\hat{β}}_{blocked} (k, k^{'}) = \frac{\sum_{b = 1}^{h} w_{b} ({\bar{y}}_{b} (k) - {\bar{y}}_{b} (k^{'}))}{\sum_{b = 1}^{h} w_{b}},

where ${\bar{y}}_{b} (k)$ and ${\bar{y}}_{b} (k^{'})$ are block-specific mean outcomes. This estimator is unbiased and asymptotically normal with the following variance:

V a r_{R I} ({\hat{β}}_{blocked} (k, k^{'})) = \frac{\sum_{b = 1}^{h} w_{b}^{2} V a r_{R I} ({\hat{β}}_{b} (k, k^{'}))}{(\sum_{b = 1}^{h} w_{b})^{2}},

which can be estimated by applying Equation 10 or Equation 11 separately for each block. Note that in the FP framework, the block-level weights in Equation 19 are assumed to be fixed for the study, but randomness could result if the weights incorporate adjustments for data nonresponse, in which case, the variance results are conditional on the weights.

Similarly, regression estimators for blocked designs for the simple treatment-control design apply to the multi-armed context with the same modifications to the weights and FP heterogeneity term as above (see Schochet, 2015/2016). The same weighting issues also apply to SP estimators for random blocked designs, where the blocks are considered to be randomly sampled from a broader block population.

ATE Estimators for Clustered Designs

The above theory extends directly to clustered designs where groups (such as schools or classrooms) rather than individuals (such as students) are randomly assigned to the research groups and where outcome data are collected on individuals (such as students). In this section, we show that the same issues apply to clustered designs as nonclustered designs as we move from the two-group to multi-armed context. Our focus is on the FP model because ATE estimators for clustered designs under the SP model do not change in the multi-armed setting. Because parallel issues exist for clustered and nonclustered designs, we provide less detail in this section than before.

For the analysis, we use similar notation as for the nonclustered design with the addition of the subscript “j” to indicate clusters. For instance, for the nonblocked design, $Y_{i j} (k)$ is the potential outcome for individual i in cluster j in research condition K; $y_{i j}$ is the observed outcome; $T_{j} (k)$ is the research status indicator variable that equals 1 if cluster j is randomly assigned to group K and 0 otherwise; and $Q_{j} = k$ for clusters assigned to group K. We assume that the sample contains m clusters with $m_{k} = m p_{k}$ clusters assigned to group K, where p_k is the research group sampling rate. It is assumed that cluster j has n_j individuals.

Similar to the nonclustered design, we rely on several assumptions for the clustered design: (a) SUTVA, which states that for any two random assignment vectors Q and Q ^′, if $Q_{j} = {Q^{'}}_{j}$ for cluster j, then $Y_{i j} (Q) = Y_{i j} (Q^{'})$ for all individuals in cluster j, that there are not different forms of the same treatment, and that potential outcomes for a given treatment do not depend on the number or nature of the other treatments; (b) randomization, defined as the independence between cluster-level research assignment status and potential outcomes, Q_j ⫫ $(Y_{i j} (1), Y_{i j} (2), ..., Y_{i j} (K));$ (c) positive assignment probabilities for each cluster to each research group; and (d) finite first and second moments for the potential outcomes over an increasing sequence of finite populations. Note that under clustered designs, SUTVA allows the outcomes of individuals within the same cluster to be correlated, which is reflected in the cluster-level means.

In the multi-armed setting for clustered designs where the goal of the study is to identify the most promising treatments among those tested, the ATE parameter for comparing interventions K and k ^′ is as follows:

β_{clus} (k, k^{'}) = \frac{\sum_{j = 1}^{m} w_{j} ({\bar{Y}}_{j} (k) - {\bar{Y}}_{j} (k^{'}))}{\sum_{j = 1}^{m} w_{j}},

where $w_{j}$ is the cluster weight (e.g., $w_{j} = 1$ to answer questions about the average cluster in the sample or $w_{j} = n_{j}$ to answer questions about the average individual in the sample);

{\bar{Y}}_{j} (k) = \frac{1}{n_{j}} \sum_{i = 1}^{n_{j}} Y_{i j} (k) and {\bar{Y}}_{j} (k^{'}) = \frac{1}{n_{j}} \sum_{i = 1}^{n_{j}} Y_{i j} (k^{'})

are mean, cluster-level potential outcomes in conditions K and k ^′; and the clus subscript signifies clustered designs. This ATE parameter is a weighted average of treatment contrasts across all m clusters in the sample, not just the $(m_{k} + m_{k^{'}})$ clusters randomized to the two research groups. Note that Equation 20 can also be expressed as a weighted average of individual-level treatment effects.

To develop estimators for the ATE parameter in Equation 20, note first that for clustered designs, a simple version of the design-based approach is to average the individual data to the cluster level (although as discussed later, versions exist that instead use individual-level data). In this setting, the data-generating process for the observed mean outcome for cluster j can be expressed as

{\bar{y}}_{j} = \sum_{k = 1}^{K} T_{j} (k) {\bar{Y}}_{j} (k),

where

{\bar{y}}_{j} = \frac{1}{n_{j}} \sum_{i = 1}^{n_{j}} y_{i j} .

Consider the weighted simple differences-in-means estimator using the aggregated data:

\begin{array}{l} {\hat{β}}_{clus} (k, k^{'}) = {\bar{\bar{y}}}_{W} (k) - {\bar{\bar{y}}}_{W} (k^{'}) \\ = \frac{\sum_{j : T_{j} (k) = 1}^{m_{k}} w_{j} {\bar{y}}_{j}}{\sum_{j : T_{j} (k) = 1}^{m_{k}} w_{j}} - \frac{\sum_{j : T_{j} (k^{'}) = 1}^{m_{k^{'}}} w_{j} {\bar{y}}_{j}}{\sum_{j : T_{j} (k^{'}) = 1}^{m_{k^{'}}} w_{j}} = \frac{\sum_{j = 1}^{m} w_{j} T_{j} (k) {\bar{Y}}_{j} (k)}{\sum_{j = 1}^{m} w_{j} T_{j} (k)} - \frac{\sum_{j = 1}^{m} w_{j} T_{j} (k^{'}) {\bar{Y}}_{j} (k^{'})}{\sum_{j = 1}^{m} w_{j} T_{j} (k^{'})}, \end{array}

where the third equality holds using Equation 21. To examine the statistical properties of this estimator in the multi-armed context, we build on the properties of this estimator for the two-group design.

For the two-group design, the estimator in Equation 21 is biased for the parameter in Equation 20 in finite samples if the weights differ across clusters and cluster-level ATEs are heterogeneous (the general case we consider). This is because the denominators in Equation 22 will depend on the particular allocation of clusters to the two research groups (i.e., the weights become random variables). However, as shown in Supplemental Material in the online version of the journal. Material (that builds on Schochet, 2013, 2015/2016), as m gets large, ${\hat{β}}_{clus} (k, k^{'})$ is a consistent estimator of the following asymptotic ATE parameter:

\frac{E_{FP} [w_{j} ({\bar{Y}}_{j} (k) - {\bar{Y}}_{j} (k^{'}))]}{E_{FP} (w_{j})},

where $E_{FP} (.)$ signifies expectations (assumed to be fixed, nonnegative real numbers) over an increasing sequence of finite populations. Furthermore, conditional on the weights, ${\hat{β}}_{clus} (k, k^{'})$ is asymptotically normal with variance:

A s y V a r_{R I} ({\hat{β}}_{clus} (k, k^{'})) = \frac{1}{E {(w_{j})}^{2}} [\frac{{\bar{S}}_{W}^{2} (k)}{m_{k}} + \frac{{\bar{S}}_{W}^{2} (k^{'})}{m_{k^{'}}} - \frac{{\bar{S}}_{τ W}^{2} (k, k^{'})}{m_{k} + m_{k^{'}}}],

where

{\bar{S}}_{W}^{2} (k) = lim_{m \to \infty} \frac{1}{(m - 1)} \sum_{j = 1}^{m} w_{j}^{2} {({\bar{Y}}_{j} (k) - {\bar{\bar{Y}}}_{W} (k))}^{2}, {\bar{S}}_{W}^{2} (k^{'}) = lim_{m \to \infty} \frac{1}{(m - 1)} \sum_{j = 1}^{m} w_{j}^{2} {({\bar{Y}}_{j} (k^{'}) - {\bar{\bar{Y}}}_{W} (k^{'}))}^{2}, and

{\bar{S}}_{τ W}^{2} (k, k^{'}) = lim_{m \to \infty} \frac{1}{(m - 1)} \sum_{j = 1}^{m} w_{j}^{2} {({\bar{Y}}_{j} (k) - {\bar{\bar{Y}}}_{W} (k)) - ({\bar{Y}}_{j} (k^{'}) - {\bar{\bar{Y}}}_{W} (k^{'}))}^{2}

is the FP heterogeneity term (see Supplemental Material in the online version of the journal). Accordingly, using the Cauchy–Schwartz inequality as for the nonclustered design, a consistent (upper-bound) variance estimator for Equation 24 is

A s \hat{y} V a r_{R I} ({\hat{β}}_{clus} (k, k^{'}) = \frac{s_{W}^{2} (k)}{m_{k}} + \frac{s_{W}^{2} (k^{'})}{m_{k^{'}}} - \frac{{(s_{W} (k) - s_{W} (k^{'}))}^{2}}{(m_{k} + m_{k^{'}})},

where

s_{W}^{2} (k) = \frac{1}{(m_{k} - 1) {\bar{w}}_{k}^{2}} \sum_{j : T_{j} (k) = 1}^{m_{k}} w_{j}^{2} {({\bar{y}}_{j} - {\bar{\bar{y}}}_{W} (k))}^{2}, s_{W}^{2} (k^{'}) = \frac{1}{(m_{k^{'}} - 1) {\bar{w}}_{k^{'}}^{2}} \sum_{j : T_{j} (k^{'}) = 1}^{m_{k^{'}}} w_{j}^{2} {({\bar{y}}_{j} - {\bar{\bar{y}}}_{W} (k^{'}))}^{2},

{\bar{w}}_{k} = \frac{1}{m_{k}} \sum_{j : T_{j} (k) = 1}^{m_{k}} w_{j}, and {\bar{w}}_{k^{'}} = \frac{1}{m_{k^{'}}} \sum_{j : T_{j} (k^{'}) = 1}^{m_{k^{'}}} w_{j} .

We can now extend these results to the multi-armed design using the same conditioning arguments as discussed in the previous section for the nonclustered design. First, we can show that the estimator in Equation 22 is a consistent estimator for the ATE parameter in Equation 23 using the law of iterated expectations, where, for arbitrarily large m, we first calculate expectations with respect to the randomization distribution, R, conditional on the $(m_{k} + m_{k^{'}})$ clusters assigned to the two research groups and their fixed potential outcomes, and then average over random draws of $(m_{k} + m_{k^{'}})$ clusters from the population of m clusters. Thus, the only small difference from the two-group design is that the increasing sequence of finite populations pertains to the entire randomized sample, not just to the clusters randomized to the two groups being compared.

Similarly, for large m, we can use the law of total variance to examine the asymptotic variance of ${\hat{β}}_{clus} (k, k^{'})$ in the multi-armed context. Following the same arguments as for the nonclustered design, we find similar variance expressions as in Equations 24 and 25 above for the two-group design. The only difference is that the denominators in the FP heterogeneity terms contain m rather than $(m_{k} + m_{k^{'}})$ .

Similar adjustments apply to variance expressions for regression-adjusted estimators, where separate weighted regression models with baseline covariates can be estimated for each pairwise contrast. The Supplemental Material in the online version of the journal (Theorem 2) examines the statistical properties of regression estimators that use individual-level data (for both outcomes and covariates) rather than cluster-level averages, because this approach improves precision by allowing covariates to explain both within- and between-cluster variation in the outcome. The variance formulas for these regression estimators for the two-group design still apply in the multi-armed context except that the FP heterogeneity terms are divided by m rather than $(m_{k} + m_{k^{'}})$ . Specifically, a consistent variance estimator for a pairwise contrast is as follows, where we use parallel notation as for the earlier discussion on regression adjustment for the nonclustered design:

A s \hat{y} V a r_{R} ({\hat{β}}_{clus, M R}^{*}) = \frac{M \hat{S} E (k)}{m_{k}} + \frac{M \hat{S} E (k^{'})}{m_{k^{'}}} - \frac{{[\sqrt{M \hat{S} E (k)} - \sqrt{M \hat{S} E (k^{'})}]}^{2}}{m},

where

M \hat{S} E (k) = \frac{1}{(m_{k} - v {\hat{p}}_{k}^{*} - 1) {\bar{w}}_{k}^{2}} \sum_{j : T_{j} (k) = 1}^{m_{k}} w_{j}^{2} {({\bar{y}}_{j} - {\hat{α}}_{c l u s, M R}^{*} - {\hat{β}}_{c l u s, M R}^{*} (1 - {\hat{p}}_{k}^{*}) - {\bar{x}}_{c j} {\hat{γ}}_{c l u s})}^{2} and

M \hat{S} E (k^{'}) = \frac{1}{(m_{k^{'}} - v (1 - {\hat{p}}_{k}^{*}) - 1) {\bar{w}}_{k^{'}}^{2}} \sum_{j : T_{j} (k^{'}) = 1}^{m_{k^{'}}} w_{j}^{2} {({\bar{y}}_{j} - {\hat{α}}_{c l u s, M R}^{*} + {\hat{β}}_{c l u s, M R}^{*} {\hat{p}}_{k}^{*} - {\bar{x}}_{c j} {\hat{γ}}_{c l u s})}^{2}

are mean square errors from the individual-level regression model averaged to the cluster level; ${\hat{p}}_{k}^{*} = \sum_{j = 1}^{m *} T_{j} w_{j} / \sum_{j = 1}^{m^{*}} w_{j}$ is the weighted proportion of clusters assigned to group K; $m^{*} = m_{k} + m_{k^{'}}$ , and other terms are defined as above. Note that the covariates will only affect the ATE estimates and increase precision if mean covariate values vary across clusters.

Finally, no new issues arise for blocked, clustered designs. The formulas in Schochet (2015/2016; chapter 8) still apply in the multi-armed context, except that the block-specific weights for each pairwise contrast now pertain to the full randomized sample of clusters, and the FP heterogeneity terms in the variance estimators are now divided by m_b (the total number of clusters in the block) rather than $(m_{k b} + m_{k^{'} b})$ (the number of clusters in the block randomized to the two contrasted groups).

Empirical Application

To demonstrate the ATE estimators discussed above, we use data from a multi-armed RCT that tested the effectiveness of several supplemental reading interventions for fifth graders (James-Burdumy et al., 2009). The study used a blocked, clustered design where schools were randomized within school districts. Table 2 summarizes the data for our analysis, including the study samples, research groups, and the composite test score outcome used by the evaluation. Our analysis uses the evaluation’s control group (Group 1) and two treatment groups created from the original four: (a) Schools offered the Reading for Knowledge curriculum (Group 2) and (b) schools offered any of the other three reading curricula (Group 3) that we combine because the original evaluation found similar impacts for them (and to minimize the reporting of results). Our goal is not to mimic the original study findings or to provide policy conclusions but to demonstrate several key features of ATE estimation in the multi-armed setting.

Table 2.

Summary of Randomized Controlled Trial Data for the Empirical Analysis

Description of Study	Sample and Research Groups for the Current Analysis	Outcome
Study, funded by the Institute of Education Sciences, examined the impacts of four reading comprehension curricula for a first cohort of fifth graders. The tested curricula were Project CRISS, ReadAbout, Read for Real, and Reading for Knowledge and were selected based on public submissions and ratings by an expert review panel. Schools were randomly assigned to one of the four intervention groups or to a control group using the status quo curriculum	Fifth-grade students in the 2006–2007 school year in 89 schools in 10 districts. The first treatment group for our analysis includes schools offering the Reading for Knowledge curriculum, the second treatment group includes schools offering the other three reading curricula, and the control group includes the original control group for the evaluation. Districts with fewer than two schools in each research group were combined into a single block (to facilitate variance estimation), yielding seven blocks for the analysis	Composite Z score from the passage comprehension subtest of the Group Reading Assessment and Diagnostic Evaluation and the Science and Social Studies Reading Comprehension Assessments

Table 3 displays impact results under the FP model for the three possible pairwise contrasts comparing Groups 1, 2, and 3. The results are presented for the design-based estimators discussed in this article and for those assuming the two-group design that ignore the adjustments for the multi-armed trial. The impact estimates and standard errors could differ for the two scenarios because of differences in how the block-level weights and FP heterogeneity terms are calculated. The estimates were obtained using the free RCT-YES software version 1.2 (www.rct-yes.com), with clusters weighted by their student sample sizes. To increase precision, all models included the following baseline covariates: school urban/rural status, teacher and student race/ethnicity indicators, student pretest scores on the Group Reading Assessment and Diagnostic Evaluation and Science and Social Studies tests, and an indicator of limited English proficiency. For illustrative purposes, we focus on unadjusted p values, although in practice, a more rigorous approach would be to adjust the p values for multiple testing across the three pairwise contrasts.

Table 3.

Impact Findings on Composite Z Scores for the Empirical Analysis

Pairwise Contrast	First Group Mean	Second Group Mean	Difference (Impact Estimate)	Effect Size	Standard Error of Difference	p Value of Difference
Scenario 1: Design-based estimator for the multi-armed design
Read for Knowledge versus Control (Group 2 vs. 1)	−.053	.032	−.084	−.09	.044	.078
Other Tested Curricula versus Control (Group 3 vs. 1)	−.018	.024	−.041	−.05	.027	.135
Read for Knowledge versus Other Tested Curricula (Group 3 vs. 2)	−.064	−.024	−.040	−.05	.041	.334
Scenario 2: Design-based estimator assuming the two-group design
Read for Knowledge versus Control (Group 2 vs. 1)	−.049	.021	−.070	−.08	.042	.125
Other Tested Curricula versus Control (Group 3 vs. 1)	−.023	.021	−.043	−.05	.027	.113
Read for Knowledge versus Other Curricula (Group 3 vs. 2)	−.058	−.016	−.042	−.04	.041	.315

Note. The sample includes 18 schools in the Read for Knowledge group (1,073 students), 21 control schools (1,183 students), and 50 schools (3,348 students) in the Other Tested Curricula group. The impact estimates are calculated using regression models that control for baseline covariates. The means for the second research group are sample means, and the means for the first research group are calculated by summing the means for the second group and the impact estimates. The effect size is the impact estimate divided by the standard deviation of the outcome for individuals in the control group.

* Difference is statistically significant at the .05 level, two-tailed test.

The results suggest that the tested reading curriculum lowered test scores relative to the status quo curricula, although none of the impacts are statistically significant at the 5% level. However, there is some evidence using the multi-armed estimator that the Read for Knowledge curriculum performed worse than the control condition if we adopt a 10% significance standard (p value of .078). This evidence, however, is weaker using the two-group estimator (p value of .125).

Table 4 provides key reasons for the differences in p values for the Group 2 versus 1 contrast for the two designs. First, we find some differences in block-level weights across the designs because assignment rates to the research groups differed across some blocks (Blocks 4, 6, and 7) and also because of random differences in average school sizes across the research groups. This means that the estimators under the two-group and multi-armed design differ because the estimated impacts vary considerably across blocks (they range from −.534 to .277 as shown in Table 4). Second, the pooled standard error is only slightly larger for the multi-armed design (.044 vs. .042 from Table 3) because even though the FP heterogeneity terms are about half as large for the multi-armed trial, they typically comprise less than 5% of the total variance (see final column of Table 4). Putting these findings together, the main driver of the p value differences for the Group 2 versus 1 contrast is that Block 3, which has the largest negative impact across the sites, is weighted more heavily under the multi-armed design than the two-group design. This weighting difference leads to a more negative overall impact estimate under the multi-armed trial (−.084) than under the two-group design (−.070), with only a small difference in standard errors.

Table 4.

Block Impacts, Weights, and Standard Errors for the Read for Knowledge and Control Group (Group 2 vs. 1) Contrast

Block	Impact Estimate on Composite Z Scores	Block Weight		Number of Schools in Groups 1, 2, 3^a	Ratio of Standard Errors for the Two Designs^b
Block	Impact Estimate on Composite Z Scores	Multi-Armed Design	Two-Group Design	Number of Schools in Groups 1, 2, 3^a	Ratio of Standard Errors for the Two Designs^b
1	−.095	.131	.145	2, 2, 6	1.05
2	−.400	.075	.081	2, 2, 6	1.09
3	−.534	.155	.129	2, 2, 6	0.94
4	−.003	.329	.360	7, 4, 12	1.02
5	.056	.083	.075	2, 2, 6	1.29
6	.090	.136	.131	4, 3, 9	1.08
7	.277	.090	.090	2, 3, 5	1.05

^a Schools are weighted by their sample sizes, not equally. Thus, the block weights differ for the two designs because of both differences in the number of schools across the three research groups and different school sizes (not shown).

^b Differences in standard errors for the two designs are due to the combined differences in the FP heterogeneity terms and scaling of the weights in Equation 25.

This analysis suggests that the adjustments needed to produce design-based estimators in the considered multi-armed context can matter in real-world RCT applications, although in this example, they did not produce different overall conclusions regarding the effects of the tested reading curricula.

Conclusions

This article developed design-based estimators for multi-armed impact evaluations for a wide range of designs used in education research, where interest lies in comparing impacts across the tested treatments to identify the most effective ones. Because the analysis in the multi-armed setting typically involves pairwise contrasts across the research groups, the key methodological question addressed in the article is: How do the estimators for the two-group design need to be adjusted for multi-armed trials? The critical insight is that in multi-armed trials using the FP framework, the samples for each pairwise contrast are representative of the full set of randomized units, not just of themselves. In essence, a “pure” FP model does not exist in the multi-armed setting. The implications are that (a) the FP heterogeneity terms need to be slightly adjusted in the variance estimators (that can slightly reduce precision) and (b) the weights need to be adjusted for blocked designs, so that blocks are weighted by the entire size of the block, not just the size of the two contrasted groups under investigation. Ignoring these adjustments could lead to biased estimators, especially the weighting corrections for blocked designs if impacts and assignment rates to the research groups differ across blocks.

As demonstrated by our empirical example using data from a clustered, blocked RCT, these adjustments can affect the estimated impacts and standard errors. Thus, researchers analyzing data from multi-armed trials should be cautious about applying methods for the simple treatment-control design for each pairwise comparison in turn. Instead, the adjusted estimators presented in this article are more grounded in the mechanisms underlying multi-armed experiments and are required to generate consistent estimators. In addition, although not the focus of this article, adjusting for multiple hypothesis testing across pairwise contrasts should be considered for multi-armed trials to control Type 1 error rates. The free RCT-YES software (www.rct-yes.com) can be used to estimate and report impacts using the methods discussed in this article, including multiple testing adjustments.

Supplemental Material

Supplemental Material, DS_10.3102_1076998618786968 - Design-Based Estimators for Average Treatment Effects for Multi-Armed RCTs

Supplemental Material, DS_10.3102_1076998618786968 for Design-Based Estimators for Average Treatment Effects for Multi-Armed RCTs by Peter Z. Schochet in Journal of Educational and Behavioral Statistics

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

References

Agodini

Harris

Atkins-Burnett

Heaviside

Novak

Murphy

(2009). Achievement effects of four early elementary school math curricula: Findings from first graders in 39 schools. Washington, DC: U.S. Department of Education, Institute of Education Sciences.

Aronow

P. M.

Green

D. P.

Lee

D. K. K.

(2014). Sharp bounds on the variance in randomized experiments. Annals of Statistics, 42, 850–871.

Bell

McCaffrey

(2002). Bias reduction in standard errors for linear regression with multi-stage samples. Survey Methodology, 28, 169–181.

Box

G. E.

Hunter

J. S.

Hunter

W. G.

(2005). Statistics for experiments: Design, innovation, and discovery (2nd ed.). New York, NY: Wiley.

Castleman

B. L.

Page

L. C.

(2015). Summer nudging: Can personalized text messages and peer mentor outreach increase college going among low-income high school graduates? Journal of Economic Behavior & Organization, 115, 144–160.

Dasgupta

Pillai

N. S.

Rubin

(2015). Casual inference from 2 ^k factorial designs by using potential outcomes. Journal of the Royal Statistical Society, B, 77, 727–753.

Freedman

(2008). On regression adjustments to experimental data. Advances in Applied Mathematics, 40, 180–193.

Hansen

C. B.

(2007). Asymptotic properties of a robust variance matrix estimator for panel data when T is large. Journal of Econometrics, 141, 597–620.

Holland

P. W.

(1986). Statistics and causal inference. Journal of the American Statistical Association, 81, 945–960.

10.

Hsu

J. C.

(1996). Multiple comparisons: theory and methods. London, England: Chapman & Hall.

11.

Imbens

Rubin

(2015). Causal inference for statistics, social, and biomedical sciences: An introduction. Cambridge, England: Cambridge University Press.

12.

James-Burdumy

Mansfield

Deke

Carey

Lugo-Gil

Hershey

… Faddis

(2009). Effectiveness of selected supplemental reading comprehension interventions. Washington, DC: U.S. Department of Education, Institute of Education Sciences.

13.

Kraft

M. A.

Rogers

(2014). The underutilized potential of teacher-to-parent communication: Evidence from a field experiment (Working Paper RWP14-049). Cambridge, MA: Harvard Kennedy School.

14.

Ding

(2017). General forms of finite population central limit theorems with applications to causal inference. Journal of the American Statistical Association, 112, 520, 1759–1769. doi:10.1080/01621459.2017.1295865

15.

Lin

(2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique. Annals of Applied Statistics, 7, 295–318.

16.

Neyman

(1990). On the application of probability theory to agricultural experiments: Essay on principles. Section 9. Statistical Science, 5, 465–472. (Original work published 1923)

17.

Pashley

Miratrix

(2017). Insights on variance estimation for blocked and matched pair designs (Working Paper). Harvard University Statistics Department.

18.

Rubin

D. B.

(1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Education Psychology, 66, 688–701.

19.

Rubin

D. B.

(1977). Assignment to treatment group on the basis of a covariate. Journal of Education Statistics, 2, 1–26.

20.

Rubin

D. B.

(1986). Which ifs have causal answers? Discussion of Holland’s “Statistics and causal inference.” Journal of the American Statistical Association, 81, 961–962.

21.

Scher

Cole

(2017). Evidence review standards considerations when using RCT-YES. Washington, DC: Analytic Technical Assistance and Development, National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education. Retrieved from http://www.rct-yes.com

22.

Schochet

P. Z.

(2009). An approach for addressing the multiple testing problem in social policy impact evaluations. Evaluation Review, 33, 539–567.

23.

Schochet

P. Z.

(2010). Is regression adjustment supported by the Neyman model for causal inference? Journal of Statistical Planning and Inference, 140, 246–259.

24.

Schochet

P. Z.

(2013). Estimators for clustered education RCTs using the Neyman model for causal inference. Journal of Educational and Behavioral Statistics, 38, 219–238.

25.

Schochet

P. Z.

(2016). Statistical theory for the RCT-YES software: Design-based causal inference for RCTs (NCEE 2015–4011; 2nd ed.). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education. Retrieved from https://ies.id.gov/ncee/pubs/20154011/pdf/20154011.pdf (Original work published 2015, 1st ed.)

26.

Schochet

P. Z.

(2017). Multi-armed RCTs: A design-based framework (NCEE 2017–4027). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education. Retrieved from https://ies.ed.gov/ncee/pubs/20174027/pdf/20174027.pdf

27.

Schochet

P. Z.

Kautz

(2018). Design-based estimators for clustered RCTs and how they compare to robust estimators (Working Paper No. 1). Princeton, NJ: Mathematica Policy Research.

28.

Toledo

Humpage-Liuzzi

Murray

Glazerman

(2015). Data-driven instruction in Honduras: An impact evaluation of the educAccion promising reading intervention evaluation plan. Washington, DC: Mathematica Policy Research, June 2016. Retrieved from https://www.socialscienceregistry.org/trials/780

29.

Westfall

P. H.

Tobias

Rom

Wolfinger

Hochberg

(1999). Multiple comparisons and multiple tests using SAS. Cary, NC: SAS Institute.

30.

C. F. J.

Hamada

M. S.

(2009). Experiments: Planning, analysis and parameter design optimization (2nd ed.). Hoboken, NJ: John Wiley.

31.

Yang

Tsiatis

(2001). Efficiency study of estimators for a treatment effect in a pretest-posttest trial. American Statistician, 55, 314–321.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.17 MB