Abstract
The present article provides a synthesis of the conceptual and statistical issues involved in using multisite randomized trials to learn about and from a distribution of heterogeneous program impacts across individuals and/or program sites. Learning about such a distribution involves estimating its mean value, detecting and quantifying its variation, and estimating site-specific impacts. Learning from such a distribution involves studying the factors that predict or explain impact variation. Part I of the article introduces the concepts and issues involved. Part II focuses on estimating the mean and variation of impacts of program assignment. Part III extends the discussion to variation in the impacts of program participation. Part IV considers how to use multisite trials to study moderators of program impacts (individual-level or site-level factors that influence these impacts) and mediators of program impacts (individual-level or site-level “mechanisms” that produce these impacts).
Keywords
Using Multisite Randomized Trials to Learn About and From a Distribution of Program Impacts
To make a valid causal statement about the impacts of a new reading program, drop-out prevention program or job training initiative, measuring the gains made by program participants is not enough. Estimating how participants would have fared without the program is also necessary. This requires a valid comparison group of nonprogram participants who were similar in all predictors of the outcome (measured or unmeasured) to participants at the outset of the study. The aim is to compare future outcomes for program participants and comparison group members under the assumption that comparison group outcomes reflect what “would have happened” to program participants without the program.
To create a valid comparison group for testing new medical treatments, scientists embraced the randomized controlled trial (RCT) soon after World War II. This strategy, which had its origins in agricultural research during the early 20th century (Fisher, 1925), was to randomly assign persons to a treatment group or a nontreatment “control group” in order to create two statistically equivalent groups. Mean future control group outcomes then provide unbiased estimates of what mean future treatment group outcomes would have been without the treatment. Exploiting this singular strength, government agencies and other funders have recently sponsored many RCTs to evaluate social and educational programs, policies, and practices (Greenberg & Shroder, 2004; Spybrook, 2013). As a result, we are now learning much about the effectiveness of preschool education, charter schools, remedial math and reading interventions, after-school services, teacher professional development, career academies, job training programs, social service programs, criminal justice programs, and more.
Researchers and research funders met in two recent conferences to review the design and analysis of RCTs for social and educational program evaluation (Learning from Variation in Program Effects sponsored by the William T. Grant Foundation and the conference that inspired this forum). Participants at these conferences noted that past RCTs focused mainly on estimating the average impacts of new programs. Although an average impact is an essential inferential target of any RCT, participants reasoned that the average is not sufficient by itself for developing public policy, professional practice, or program theory when program impacts are heterogeneous. Participants thus agreed that better understanding of how and why program impacts vary is needed; in other words, we need to learn more about and from the distribution of program impacts.
Learning About a Distribution of Program Impacts
Learning about a distribution of program impacts involves estimating the mean value of this impact, quantifying its variation around the mean, assessing the equity of this variation, and studying site-specific impacts.
Estimating mean impacts
Although it is common practice to estimate mean program impacts, the conceptual and statistical issues involved in doing so for multisite trials with heterogeneous impacts are more subtle than is typically acknowledged. Specifically, if impacts can vary across individuals and sites, multiple possible definitions of the overall mean exist. For example, a researcher might want to know the mean impact for the population of program-eligible persons represented by the study sample or the researcher might want to know the mean impact for the population of program sites represented by the sample.
Quantifying impact variation
If persons or sites vary widely in their responses to a program, the overall average program impact is not useful for policy makers who might contemplate adopting the program or for practitioners who want to know how to improve it. Hence, knowing about the extent to which program impacts vary is essential for informing appropriate use of the RCT results.
Assessing impact equity
In multisite trials, the cross-site correlation between program impacts and control group mean outcomes can be studied. If sites that serve individuals who would do especially poorly without the new program produce above-average impacts, this suggests that the program will tend to reduce inequality. If program sites that serve individuals who would do especially well without the program produce above-average impacts, this suggests that the program will tend to increase inequality.
Studying site-specific impacts
We can also use multisite RCTs to produce site-specific estimates of program impacts. We can thus quantify the effectiveness of the most and least effective sites. Knowing how effective a program can be is as important as knowing how effective it is on average, especially if one learns from best practice at effective sites how to improve performance at ineffective sites.
Learning From a Distribution of Program Impacts
Having learned about a distribution of program impacts, much can be learned from this distribution. Our point is that impact heterogeneity creates opportunities for testing theories about impact moderation and mediation.
Moderation of impacts
Program impacts vary because some types of persons are more likely than others to participate, because staff at some sites are more skilled than staff at other sites, or because existing services from outside of a program are more available and/or effective at some sites than at others. These factors are potential moderators of program impacts. More specifically, we define impact moderators as characteristics of clients or sites that (1) cannot be influenced by the program being tested and (2) facilitate or inhibit a program’s effectiveness.
To explore potential moderators, evaluators often conduct impact analyses for sample subgroups defined in terms of factors such as gender, ethnicity, social background, and risk of failure. It is less common to find an evaluation that is founded on an explicit moderation theory about who is likely to benefit the most or the least from the program being studied and what organizational conditions are most important for its success. Testing such theories may significantly increase the utility of evaluations for future program design and practice.
Mediation of impacts: mediators (or mechanisms) of program impacts are those aspects of program implementation, staff practice, and short-term changes in participants’ knowledge, skills, attitudes, or behavior that are (1) outcomes of random assignment and (2) predictors of long-term success.
In theory, sites with larger-than-average effects on program mediators will produce greater-than-average impacts on participant outcomes. Thus, heterogeneity of a program’s effects on its mediators can explain heterogeneity of impacts on participants’ outcomes. Nonetheless, most programs are founded on some theory about how program operations influence key mediators and produce long-term outcomes, but few rigorous evaluations explicitly test these theories, and impact heterogeneity is largely unexplained.
We allow for the possibility that treatment assignment can moderate the effect of treatment mediators. For example, assignment to a new job training program might increase participants’ motivation to work, thereby mediating the program’s impact on employment. In addition, program assignment might change the effect of motivation on employment.
The Importance of Multisite Trials for Studying a Distribution of Program Impacts
We focus here on multisite trials in which sample members are randomly assigned to a program or a control group within each of a number of sites. Sometimes sites are comparatively few in number, like the Moving to Opportunity (MTO) experiment conducted in five major U.S. cities (Katz, Kling, & Liebman, 2000). Other times, RCTs have many sites, like the national Head Start Impact Study, which was conducted in 350 Head Start centers from across the U.S. (Bloom & Weiland, 2015).
Although the research questions addressed and the statistical methods used depend on the number of sites and participants per site, all multisite trials represent “a fleet of randomized experiments.” Hence, they are well suited for studying mean program impact and impact heterogeneity. Moreover, multisite trials are prevalent, if not ubiquitous. For example, Spybrook (2013) found that more than two thirds of the 175 RCTs conducted by The Institute of Education Sciences since 1994 are multisite trials.
The Present Article
This article summarizes issues that arise and available options to consider when using multisite trials to study a distribution of program impacts. We recommend analytic approaches for addressing the issues, and we also identify new methodological frontiers as targets for future research. We now turn to focus on using multisite trials to study a distribution of impacts of program assignment (impacts of “intent to treat” [ITT]). Then we extend this discussion to impacts of program participation (complier average causal effects [CACE]). Finally, we consider moderators and mediators of program effects.
Learning About a Distribution of ITT Impacts
To lay a conceptual and methodological foundation, we begin with an individual-level distribution of ITT impacts in a single site. We then discuss how to use multisite RCTs to study a distribution of ITT impacts across multiple sites.
Studying the Distribution of ITT Impacts Across Individuals in a Single-Site RCT
The present discussion adopts the “potential outcomes” framework for causal inference, which is used widely in applied statistics. 1 We set T = 1 if a sample member is randomized to a new program (or treatment) and T = 0 if a sample member is randomized to a control group. Each individual has two potential outcomes: Y(1) if the participant is assigned to the program and Y(0) if the participant is assigned to the control group. 2 The causal effect of program assignment for an individual is the difference between his or her two potential outcomes:
It is not possible to calculate an ITT impact for an individual because we can observe only one of his two potential outcomes. We can observe Y(1) if the participant is assigned to the program group or Y(0) if the participant is assigned to the control group. Although we cannot estimate person-specific impacts, we can estimate the average ITT impact for the site population of individuals under a key assumption that a person’s potential outcomes do not predict treatment group assignment. Random assignment enables us to meet this assumption, so that, in an RCT, we can readily estimate a population average causal effect of program assignment or ITT (βITT):
where E denotes an “expectation” or population average. In other words, βITT equals the difference between the average outcome if the entire population were assigned to the program (E[Y(1)]) and the average outcome if the entire population were assigned to the control group (E[Y(0)]). We can use data from persons assigned to the treatment group (T = 1) to estimate how the entire population would fare, on average, if it were assigned to the program, that is, E[Y(1), because, in an RCT, persons assigned to the treatment group are statistically representative of the entire population of interest. Similarly, the RCT enables us to use the data from persons assigned to the control group (T = 0) to estimate how the entire population would fare, on average, if assigned to the control condition. To do so we require that assigning the entire population to one of the two groups would not change the potential outcomes of individuals. 3
Although we can estimate a population average impact from an RCT under mild assumptions, we cannot readily estimate the variance of ITT effects across individuals. To see this, note that, based on Equation 1:
Hence, the variance of Y(1) is:
which implies that:
where Cov[Y(0), B] is the individual-level covariance between control group outcomes and program impacts. Although we can estimate the two variances Var[Y(1)] and Var[Y(0)] from sample data, we cannot estimate Cov[Y(0), B] or Var(B) because we cannot observe both potential outcomes for individuals.
Further investigation (Bloom, Raudenbush, Weiss, & Porter, 2014; Bryk & Raudenbush, 1988) reveals that: If a program group and a control group have different individual-level outcome variances, we can conclude that ITT impacts vary across individuals.
4
If a program group and a control group do not have different individual-level outcome variances, we cannot conclude that ITT impacts do not vary across individuals.
5
If the program group variance is smaller than the control group variance, we can conclude that the program produces larger-than-average ITT impacts for persons who would fare worse than average without the program (i.e., program effects are compensatory).
6
If the program group variance is larger than the control group variance, however, we cannot conclude that the program produces larger-than-average impacts for persons who would fare better than average without the program.
7
In summary, a single-site RCT provides full information about mean impact of program assignment at a single site and limited information about heterogeneity of this impact across individuals at that site.
Studying a Distribution of ITT Impacts Across Sites in a Multisite RCT
We now consider multisite analyses of the distribution of program impacts. We are interested in the mean of this distribution, the cross-site variation around the mean, and the cross-site correlation between program impacts and control group mean outcomes. In addition, we want to estimate site-specific impacts.
Consider first the case of a population mean ITT impact. Estimating mean impact is simple—if we assume impacts to be constant across sites. However, given the heterogeneity of organizational conditions and populations served across sites in many RCTs, the assumption of a constant impact seems implausible. In this case, defining and estimating a mean program impact can be tricky.
Defining a population mean impact
When program impacts vary across persons and/or sites, different ways to define a population mean impact exist. On one hand, we might want to generalize findings to a population of sites (e.g., we might want to know the mean of the mean Head Start impacts for all Head Start centers in the United States). Or, we might want to generalize findings to a population of persons (e.g., the mean Head Start impact for the national population of program-eligible children).
Statisticians often define a parameter of interest as a “target of inference” or “estimand.” Ideally, researchers should be explicit about their estimands before designing a study. For example, suppose that prior to designing a study, we have information about the number of sites (J* in our population of interest), and we also have information about the number of eligible persons (Nj
in each site) j, there being
The subpopulation mean ITT impact (B j) for persons in site j is:
If we wish to generalize to a population of sites, we define our estimand as the simple mean of the site mean impacts (βsites), that is:
If we wish to generalize to a population of persons, we define our estimand as the following person-weighted mean of the site mean program impacts (βpersons):
If site-specific impacts are homogeneous, the site-average mean impact in Equation 8 will equal the person-average mean impact in Equation 9. Similarly, if all the sites have the same population size and the same fraction of persons assigned to treatment, the two estimates will also be equal. Otherwise, the estimands may differ from each other. For example, if programs in sites with large client populations are more effective than programs in sites with small client populations, the two population mean impacts will differ.
Designing a multisite trial
The choice of an estimand can influence the optimal design of a study. To see how, assume for simplicity that the cost of sampling children within sites and collecting data on program members and control group members is constant and that the individual outcome variance in the treatment and control groups is the same.
If the estimand of interest is βsites, it is optimal first to (1) draw a simple random sample of sites from the population of sites, (2) draw a simple random sample of n persons from each site, and (3) assign persons from each site with equal probability to the program group or control group. These conditions produce a perfectly balanced design with
Unfortunately, evaluators can rarely implement a probability sample of sites or persons and usually must select a sample of convenience. However, they can conceive of their study sites as representing a larger population of similar sites that might use the program, and they typically want their findings to apply to persons who might benefit if the program is found to be effective. Even in this setting, one must take care when choosing an estimand. For example, if one wanted to generalize findings to a population of sites βsites should be used. If instead, one wanted to generalize findings to a population of persons βpersons should be used. 9
Estimating a mean ITT impact
Having carefully defined an estimand of interest and designed a study accordingly, we determine how to estimate the desired mean impact. In so doing, we must confront the fact that site sample size (n
j), and the fraction of sample members randomized to the treatment
To see how these challenges play out in practice, we need additional notation. Paralleling our discussion of potential outcomes for an individual, we define U
1j
as the average outcome that would occur if the entire population of eligible persons in site j were assigned to the new program, and we define U
0j
as the average outcome that would occur if the entire population at site j were assigned to the program’s control group. The average impact of the new program at site j is thus
To keep this discussion simple, we confine our attention to the case where the unweighted “mean of site means” defined by βsites in Equation 8 is our estimand of interest. However, the logic of our inquiry would remain the same if we had focused on the estimand βpersons in Equation 9.
The “site fixed-effects” estimator
Perhaps the most common analytic strategy for estimating an average ITT effect for a multisite trial is the site fixed effects estimator. This estimator is obtained from the following regression model, where Yij is the outcome, Tij is treatment assignment, α j is a site fixed effect, and eij is a random error with zero mean, and, for simplicity, a constant variance (σ2): 10
The resulting estimator is (Raudenbush, 2014) equivalent to the following weighted average of site-specific impact estimates
where
When site impacts vary, things change. Now
A simple average
Naturally, one may think that we can greatly simplify the preceding problem using a straightforward average of site-specific impact estimates. For this case, in which we are generalizing to a population of sites, consider the simple unweighted average estimator:
This simple average is unbiased when we want to count all sites equally. However, it becomes imprecise when we give sites with very small samples equal importance to sites with very large samples.
A fixed-intercept random-coefficient estimator
Selecting between a site fixed effects estimator
Consider a hierarchical linear model (HLM; Dempster, Rubin, & Tsutakawa, 1981; Lindley & Smith, 1972; Raudenbush & Bryk, 2002), which specifies site impacts that vary randomly around a population grand mean (β) with a variance τ2 and removes cross-site differences in mean untreated (control group) outcomes by including a series of site-specific intercepts or fixed effects (α j ) as in Equation 10. 12 If τ2 were known for the population of sites of interest and Vj were known for each site, we would have an estimator with site weights equal to the reciprocal of the total variance τ2 + Vj of their site impact estimate. 1 This estimator has the same form as the fixed-effects estimator but with site-specific weights:
When site-specific impacts are homogeneous (τ2 = 0), these weights are the same as those for the site fixed-effects estimator
Recall our answer to the question about a solution to the preceding dilemma was a “qualified” yes, as an important qualification exists. The previous paragraph’s reasoning was based on the assumption that τ2 for the population of sites and V j for each site are known. The unknown part of V j is the within-site variance σ2, which can be estimated with considerable precision based on pooled data for even a moderately large RCT. However, precise estimation of τ2 depends on the number of sites in the RCT. If τ2 is estimated imprecisely, we will not likely land on the optimal place on the continuum between the fixed-effects estimator and the unweighted estimator. However, we will not land outside this continuum and estimator. Equation 13 can be computed using now-standard software for HLMs.
The Cross-Site Variance of ITT Impacts and the Cross-Site Covariance or Correlation Between ITT Impacts and Control Group Mean Outcomes
Defining a cross-site ITT impact variance
We’ve made the argument that, for multisite trials, the cross-site variance (or standard deviation) of mean program impacts as well as the cross-site mean should be estimated. But how should we do this?
First, we need to be careful in defining our estimand. An intuitively appealing definition of this cross-site variance is the mean squared discrepancy between site-specific impacts Bj and the unweighted cross-site mean impact β. This variance may be written as:
However, we may also be interested in a person-weighted average like:
It may seem counterintuitive to define a variance as a person-weighted average. However, Equation 15 can be useful in characterizing the extent to which site differences explain person-specific variation in response to an intervention.
Estimating a cross-site ITT impact variance
Few studies have attempted to estimate a cross-site variance of ITT impacts, and we have not found literature providing guidance for doing so. Clearly, the optimal method depends on the estimand of interest and the study design. However, a broad class of weighted estimators (
where wj
is a weight for each site’s contribution to the variance estimate. The idea here is that
Suppose we have a convenience sample, but regard our sites as representing an interesting if undefined universe of similar sites and our estimand is
This estimate will be “consistent,” that is, it will converge to the correct value as the number of sites in the sample becomes ever larger. However, it might be very imprecise, particularly if small sites produce outlying estimates
An alternative to the preceding simple average estimator is an HLM analysis based on maximum likelihood. Such an approach uses iteratively reestimated least squares to obtain, at iteration m + 1:
where
Estimating a cross-site covariance or correlation between ITT impacts and control group mean outcomes
What is the cross-site covariance or correlation between program impacts and control group mean outcomes? This question is rarely addressed empirically, but the answer could be potentially informative. If the cross-site correlation between program impacts and control group mean outcomes is positive (i.e., sites with higher-than-average program impacts tend to have higher-than-average control group mean outcomes), this suggests that the program being tested will increase cross-site outcome inequality. However, if this correlation is negative, the program will tend to reduce cross-site outcome inequality. The influence of this correlation on the overall distribution of outcomes across all population members can be estimated without imposing strong theory or assumptions. Yet conventional methods do not provide a consistent estimate of this correlation.
Again, selecting an estimand is important. To estimate the covariance, suppose that we knew the true mean impact (βsites) for our population of sites and each site-specific impact Bj
. Suppose we also knew the true mean untreated counterfactual outcome (µ0) for the population of sites and for each site U
0j
. Then, for each site, we could compute the product
where
Studying site-specific ITT impacts
If we could observe the impact Bj
for each site, we could display the cross-site impact distribution and determine, for example, the 10th, 25th, 75th, or 90th percentile values of this distribution. The problem is that we cannot observe the true values of Bj
. To address this problem, we might use our estimate
Perhaps the most popular way to address this problem is to compute, for each site, an “empirical Bayes,” estimate
This estimate is a weighted average of the site-specific impact estimate
Sites with large samples will tend to produce
Site-specific empirical Bayes estimators
Finally, shrinkage toward the overall mean is problematic when specific groups of sites vary markedly in their program effectiveness. In these cases, it might be more appropriate to shrink site-specific impact estimates toward a predicted value based on a theory of which kinds of sites have the largest effects (see Raudenbush & Bryk, 2002, Chapters 3 and 5). We consider such predictors (site-level moderators) in the next section.
Learning About the Distribution of Impacts of Program Participation
We have discussed the impact of random assignment of individuals to a program, known as an ITT effect. If everyone participates as assigned, whether as program or control members, we have an ideal situation called “perfect compliance” with random assignment. Unfortunately, perfect compliance rarely occurs. Instead, partial compliance results from two forms of behavior. First, some individuals assigned to a program will fail to participate. For example, in the MTO experiment (Kling, Liebman, & Katz, 2007), families living in public housing were randomly assigned to receive a voucher to pay rent in a low-poverty neighborhood. However, only 47% of the families assigned to receive the voucher actually used it. Second, individuals assigned to a control group can end up in the program being tested. For example, in lottery-based studies of charter schools, winners of the charter school lottery are invited to attend it and lottery losers are not. However, lottery losers may end up attending another charter school or even attend the charter school whose lottery they lost.
In studies where some persons assigned to the new program do not participate but no controls participate, the ITT impacts can be policy relevant. In these studies, the ITT effect represents the impact of a program on the persons for whom it was intended—that is, those who were assigned to it. However, in studies where some controls access the program, the ITT impact is of questionable relevance. In all cases of noncompliance, knowing the impact of actually participating in a program is important. For this purpose, a problem of selection bias arises, even in an RCT. This is because study participants shape the decision about whether to comply with random assignment. To cope with selection bias when estimating the impact of program participation, methodologists have widely adopted the method of instrumental variables (IVs) (Angrist, Imbens, & Rubin, 1996; Heckman & Vytlacil, 1998). For this approach, random assignment is conceived as an IV, which induces a subset of sample members to participate, and we can estimate the average impact of participation on those so induced (“compliers”) under comparatively weak assumptions.
We first examine how the IV method works for a single site with homogeneous impacts. We then illustrate how the analysis becomes more complex—and more interesting—with program impacts that vary across participants. 17 Finally, we consider how to use a multisite trial to estimate the cross-site mean and variance (or standard deviation) of the impacts of program participation.
The IV Method for a Single-Site Trial With Homogeneous Impacts
To understand the conventional IV method, consider the simple causal model in Figure 1. It begins with randomization of sample members to a program (T = 1) or control group (T = 0). This influences program participation, defined as M = 1 for participation and M = 0 for nonparticipation. 18 The impact of random assignment on program participation is denoted as γ, which is the difference between the probability of participating in the program if assigned to it and the probability of participating in the program if assigned to the control group. The impact of participating in the program on the outcome is denoted as δ.

Single-site homogeneous impacts.
Note that Figure 1 has no arrow between T and Y and thereby excludes a direct causal relationship between T and Y. This “exclusion restriction” is a key IV assumption and implies that the impact of program assignment on the outcome is produced entirely through the effect of program assignment on participation. In the language of path analysis, participation M “fully mediates” the ITT effect of T on Y, which we call β. This implies that the ITT effect is produced solely by the “indirect” effect of T on Y which operates through M, or:
The beauty of Equation 21 (when it holds for a situation) is that we can estimate δ without using M to predict Y. This eliminates potential selection bias noted earlier that occurs when trying to model Y as a function of M.
Instead, IV uses a two-stage approach. We estimate γ (the impact of T on M) and β (the impact of T on Y) without bias because T is randomly assigned. We then divide our estimate of β by our estimate of γ to obtain an approximately unbiased or, consistent estimate of δ:
Another key assumption of Equation 22 is that assignment to the program increases the probability of participation, that is, γ > 0. This is easily checked, and it would be rare to find an experiment where this condition does not hold.
The IV Method for a Single-Site Trial With Heterogeneous Impacts
If we think that persons respond heterogeneously to a given program, constructing a person-specific path model of its impacts makes sense, as in Figure 2 (Raudenbush, Reardon, & Nomi, 2012). Here, Γ is the unique effect of assignment T on individual participation M. Γ is “compliance” with treatment assignment. The population-average compliance is

Single-site heterogeneous impacts: Person-specific causal model.
as described by Raudenbush, Reardon, and Nomi (2012), based on Angrist, Imbens, and Rubin (1996).
As Equation 23 shows, the average effect of ITT (β) depends on the product γδ of the two average causal effects and on the covariance across individuals
How can we then estimate the average impact of program participation δ when the treatment effect is heterogeneous? We might assume as an approximation that
Rather than assuming no covariance between Γ and Δ, Angrist et al. (1996) developed an alternative approach when T and M are binary variables. They reasoned that four kinds of people exist: compliers, never takers, always takers, and defiers. Compliers are persons who would participate (M = 1) if offered a new program (T = 1) and not participate (M = 0) if assigned to a control group (T = 0). For compliers, the impact on M of being assigned to the program is
Here δCACE is the causal effect of program participation for persons with
In sum, if the gain from program participation varies across participants (as in Figure 2), the population mean effect of program assignment (β) is no longer the simple product γδ, unless we invoke the assumption of no covariance between compliance and impacts. However, for a binary mediator, we can invoke the assumption of monotonicity (which is weaker).
Using Multisite Trials to Learn About Variation in CACE
Our aim now is to characterize the cross-site distribution of CACE effects. For this purpose, Raudenbush et al. (2012) introduce statistical methods for estimating the mean and the variance of CACEs across sites. We emphasize the important role in this process played by whether we focus on site-level or person-level estimands, as was the case for ITT impacts.
Defining and Estimating a Population Average CACE
Suppose we want to generalize to a population of sites and regard each site as equally representative of that population. We want to estimate the unweighted true average CACE (δsites), where:
Here δ
j
is the CACE for site j and δsites is the population average CACE. To estimate δ
j
, an intuitive approach is to first estimate each site-specific CACE as
As an alternative, we might begin with our unbiased estimate of the unweighted average ITT as:
Can we then divide this quantity by the estimated average compliance
Suppose instead that we want to generalize to a population of persons so the CACE of interest weights each site’s estimate by its population size (Equation 9), and we regard each person in our study as equally representative of that population, so that each site’s sample size nj is proportional to its population size Nj . In this case, our estimand (δpersons) is:
Note there are Nj
γ
j
compliers in site j and
We can therefore define the person-level population CACE as
Studying a distribution of CACEs across sites
Raudenbush et al. (2012) describe several methods for estimating cross-site variance of CACEs, which are beyond the scope of this article. However, the key principles follow from the logic of the previous paragraphs: How we define our estimands is critical to shaping our approach for estimating a distribution of CACEs. Developing accessible methods for doing so is a focus of current methodological research.
Learning from a Distribution Of Program Impacts
We have discussed ways to study a cross-site distribution of program impacts. The idea now is to propose and test theories about when and why a program works, that is, to learn from a cross-site distribution of program effects in order to deepen our understanding of causal forces at work and how to manipulate them to improve program design and practice.
Moderation
Which types of persons benefit most from a program, and in what kinds of sites does the program work best? These important questions are about moderation of program impacts. We want to know whether a program works better for some types of persons than for others in order to target it efficiently or in order to investigate why the program does not work for certain types of persons. We would like to know which program sites are most effective, possibly to spur further investigation of practice in those sites or to frame general questions about why the program works when it does.
Questions about person-level and site-level moderators are almost always interdependent. Sites vary in the organizational conditions and practices that may be key to program success and in the composition of their client populations. Hence, claims about best practice at the site level might be misguided because especially effective sites might overrepresent persons who are most likely to benefit from the program being evaluated. As noted earlier, we define moderators of a program’s impacts to be any characteristics of its clients or sites that influence the program’s effectiveness but cannot be influenced by the program.
Person-level moderators
Evaluators commonly ask whether a program works better for boys than for girls, or for youth from high- versus low-income families, or for high- versus low-achieving students, or for persons of varying ethnicities. Such questions are often addressed through exploratory analyses conducted after average program impacts have been estimated. While such auxiliary analyses can enhance understanding, problems with this ad hoc, post hoc approach exist.
First, some subgroup findings may have limited relevance for policy or practice. For example, knowing that boys or ethnic minorities benefit most from a program might motivate further inquiry into why the program works for some clients but not for others—and this is a good thing. However, this knowledge does not necessarily imply that the program should make special effort to target particular subgroups.
Second, a search for subgroup impact variation can be stymied by the sheer number of subgroups to be examined. For example, the potentially large number of statistical tests of subgroup impact differences increases the likelihood of capitalizing on sample-specific differences that arise by chance and are therefore not replicable. Moreover, many subgroups are confounded with each other (i.e., they overlap). For example, ethnic minorities disproportionately comprise low-income persons, and boys have higher risk than girls for certain behavioral problems. Making theoretical sense of a large number of findings for such overlapping subgroups can be quite difficult.
Thus, we face a multiplicity of possible subgroups. No purely methodological fix to this problem exists, as the number of possible person-level moderators is too large to be sorted out by statistical hypothesis testing. What is needed is theory about who stands to benefit and why. Consider a program for increasing high school graduation rates. By construction, this program cannot appreciably increase graduation rates for students who would likely graduate without the program. At the opposite extreme, students with skills or prior grades that are so low that the program’s resources are insufficient to appreciably improve their graduation prospects will tend not to benefit from the program. We have plenty of theory and evidence about which kids are most likely to drop out of school (Rumberger, 1995), so one can envision developing a theoretically informed model that predicts this probability in the absence of treatment. The evaluator might then stratify his or her sample based on this predicted probability or “prognostic score.”
Stratifying on a prognostic score has several advantages. First, the prognostic score summarizes the predictive information in many different baseline characteristics, thereby greatly reducing the number of subgroup tests. Second, if program impacts depend strongly on a prognostic score, we confront interesting questions for policy and practice. One might envision, for example, targeting resources to persons with the greatest probability of benefiting. Third, stratifying on a prognostic score might provide a more realistic assessment of the impact of the program than that provided by an estimate of its overall average effect. For example, a school drop-out prevention program can reduce dropouts only for students who are at some risk of dropping out. Suppose that at-risk students comprise 50% of one’s sample. In that case, the average program effect on dropping out would be no more than half the size of the effect of the program on persons who could benefit from it.
We can augment a prognostic score analysis in ways that further understanding of impact variation. For example, with treatment group data, we could estimate a model that predicts post-program outcomes using individual baseline characteristics suggested by prior theory. Given randomization, the coefficients of this model for the treatment group should apply equally well to the control group, had they been assigned to the program. Thus, we can apply estimates of those coefficients to the baseline characteristics of control group members to predict how they might fare with access to the program. We could then use the same logic to obtain a prognostic score for how each program group member might fare without access to the program. In this way, we can estimate a pair of prognostic scores for each sample member and stratify them based on their pair of prognostic scores. By examining how program impacts vary across these strata within sites, we can efficiently summarize evidence about person-level moderators. 20
Site-level moderators
Knowledge about site-level moderators is potentially of great importance for developing program theory, policy, and practice. We need to understand what organizational conditions are necessary if a new program will succeed. These conditions might include the availability of resources like staff skills and knowledge, the prevailing organizational climate in sites, or local ecological conditions such as neighborhood safety and unemployment rates.
Hence, just as we might wish to estimate program impacts for subgroups of persons, we might want to estimate program impacts for subgroups of sites. Once again, problems arise from the fact that many ways to define subgroups exist, and thus, there are many moderators to consider. Now the problem of “many moderators” is even more acute because there will always be far fewer sites per site-level subgroup than there are persons per person-level subgroup. Hence there is, much less precision for estimating impact differences across site-level subgroups than for estimating impact differences across person-level subgroups. 21 Consequently, the need for a priori theory to reduce the number of site-level moderators is even stronger than it is for person-level moderators.
Double stratification
As noted earlier, a major problem arises when studying moderators of program impacts; that is, site-level and person-level moderators are often mutually confounding. For example, sites with favorable organizational conditions might serve comparatively advantaged clients. Thus, what appears to be the influence of a site-level moderator on program impacts might actually be the influence of a person-level moderator or vice versa. One way to address this problem is “double stratification.” For example, individual prognostic scores could be used to stratify sample members into two person-level subgroups—those at high risk of a negative outcome versus all others. In addition, program sites could be categorized according to a site-level moderator or set of moderators, (e.g., sites with high unemployment rates vs. all others and/or sites with high resource levels vs. all others). One could then split each site’s sample into four groups: a high-risk treatment group and a high-risk control group plus a low-risk treatment group and a low-risk control group. In this case, some sites may have empty cells. For example, some sites might have no “low-risk” treatment group members or no low-risk control group members or both. However, for all sites that have high- and low-risk treatment and control group members, we can compare program impacts on high- and low-risk students controlling for a site-level moderator or set of moderators. Likewise, we can compare program impacts across values of site-level moderators controlling for participant risk.
Mediation
Why does a new program work—or not? Innovative programs are based on theories about how program operations generate short-term changes that produce long-term benefits. Such short-term changes are called mediators. We define mediators of program impacts to be those aspects of program implementation, staff practice, and short-term changes in participants’ knowledge, skills, attitudes, or behavior that are outcomes of random assignment and predictors of long-term success. Mediators include shifts in organizational processes such as improved instruction or increased staff collaboration. These are often regarded as the mechanisms through which programs produce long-term benefits.
Methodological challenges
Analysis of mediational processes is popular in social science and program evaluation. However, drawing valid causal inferences about mediation is very challenging (for a detailed discussion of these challenges and alternative approaches to them, see the Keele article in this volume). For example, consider a study in which teachers are assigned at random to a professional development program with the aim of increasing instructional quality, which in turn is expected to improve student outcomes. Suppose that the program is successful in boosting student achievement. To what extent are the program-induced gains in student achievement explained—or “mediated”—by program-induced improvement in instruction? This mediational analysis would assess the impact of the program on instructional quality. If teachers are assigned at random to the program or a control group, the difference between mean instructional quality for the treatment and control groups is an unbiased estimate of the causal effect of the program on instructional quality. Next, one seeks to assess the impact of instructional quality on student achievement. Establishing this causal link is especially challenging because teachers are not assigned at random to instructional quality. For example, teachers’ pretreatment characteristics (experience, prior education, commitment, etc.) frequently predict their instructional quality. Such confounding can produce bias when studying the impact of instructional quality on youth outcomes.
A second problem arises when the impact of a mediator on the outcome has a different effect for treatment group members than for control group members. If this is the case, membership in the treatment group or control group moderates the causal effect of the mediator on the outcome. Conventional methods of path analysis thus do not work well (Holland, 1988; Pearl, 2001; Robins & Greenland, 1992). Presenters at the two national conferences referenced in the introduction of this article described three evolving statistical strategies for coping with these methodological challenges.
Multisite multimediator IV analysis
At the two conferences, Sean Reardon presented an approach that exploits site-to-site variation in the impact of a program on mediators. The rationale for this approach is intuitive. If M is a mediator and Y is an outcome of interest, we expect to see a large impact of a program on Y in sites where the program strongly affects M. If we fail to see such effects, we have evidence against the mediation theory. If we see effects, we have evidence of possible mediation. This idea extends nicely to the case of two mediators, call them M 1 and M 2. Suppose we see large effects of random assignment to the program on Y in sites where large effects of random assignment to the program on M 1 exist but not in sites where large effects of random assignment to the program on M 2 exist. Then, we would infer that M 1 is a more important mediator than is M 2. This intuition is the basis for Bloom, Hill, and Riccio’s (2003) study of mediators in a series of large-scale multisite welfare-to-work experiments and Kling, Liebman, and Katz (2007) applied this approach to their study of MTO.
Reardon and Raudenbush (2013) derived the assumptions that must be met in order to infer that a specified mediator has a causal effect on a specified outcome. These assumptions are closely related to the assumptions we described earlier when the aim was to identify the impact of participating in a new program (CACE). Indeed, program participation can be regarded as a mediator of the effect of program assignment, as described in Figure 2. The multisite, multimediator model extends this basic idea to the case of two or more mediators, as shown in Figure 3. Now, our IV T induces a shift in two mediators, M 1 and M 2 and each of these, by hypothesis, influences the outcome Y. Readers familiar with IV methods will immediately raise a question. We now have one instrument and two causal variables, meaning that we will end up with one equation and two unknowns. How can this possibly work? Here the beauty of the multisite design comes into play. We can regard the treatment assignment indicator in each site as a separate IV. Thus, if there are J sites with a treatment group and control group for each site, we have J instruments, enabling us to identify the impact of our two or more mediators on the outcome under several important assumptions.

Multiple site, two mediators: Person-specific causal model.
We can clearly recognize these assumptions, when we represent Figure 3 as a regression model. Let’s call B j the ITT effect in site j. Suppose that this effect works entirely through two mediators, M 1 and M 2. The impact of T on M 1 in site j is γ1j and the impact of T on M 2 in site j is γ2j . In terms of path analysis as shown in Figure 3, B j is the total effect of T on Y in site j, and it works strictly through indirect effects on the two mediators. Hence, we can express the path model as
Here δ1 is the overall average impact of M
1 on Y controlling for M
2, δ2 is the overall average impact of M
2 on Y controlling for M
1. Equation 30 is a simple regression model where the outcome B
j is the ITT effect on Y, and the predictors are γ1j
(the ITT effect on M
1) and γ2j
(the ITT effect on M
2).
22
The logic of this setup is that we can estimate B
j (the dependent variable in the regression model) as well as γ1j
and γ2j
(the two independent variables in the regression model) without bias based simply on the random assignment of participants to T. However, this gift comes at the price of several assumptions:
We can readily check Assumptions 3 and 4 against observed data, so they do not pose a strong challenge. Assumption 5 is based on program theory. The other assumptions, however, cannot be checked against the data.
Reardon, Unlu, Zhu, and Bloom (2014) discuss conditions under which failures of these assumptions are most likely to cause bias for analyses of a single mediator. They also provide a bias correction that is applicable when Assumption 2 fails and the goal is to estimate a single mediator effect. We anticipate that future work will extend these innovations to the case of multiple mediators. This is important because Assumption 2 is potentially a strong assumption.
We conclude that the multisite, multimediator IV method opens up interesting new ways to exploit cross-site heterogeneity in order to study the impact of program mediators on participants’ outcomes. However, this new and evolving method merits study to learn more about how failure of its assumptions influences its results.
Other strategies for mediation analysis in multisite trials
Finding flexible new strategies for mediation analysis is currently a topic of great interest in social science and public health (see recent books by Hong, 2015 and VanderWeele, 2015). Presenters at the aforementioned conferences reviewed two of the most potentially useful approaches: principal stratification and sequential randomization. A key feature of these approaches is that they do not require the exclusion restriction we relied on when describing the multisite, multimediator IV approach. A key limitation for our current discussion is that the application of these approaches to multisite trials is not yet well developed but is a topic of currently intense methodological research. Given the multisite theme of this article, we describe these approaches very briefly.
Principal stratification
One goal of principal stratification applied to the analysis of mediation is to estimate program impacts on persons whose mediator values are not affected by program assignment. These are “direct effects” of the program because they operate independently of the mediator or mediators of interest. The existence of a program impact on an outcome for persons who do not experience a program impact on a hypothesized mediator refutes the claim that the program’s impact is generated entirely through that mediator. The idea is to stratify one’s sample based on “potential mediator values” and to compare estimated program impacts for selected strata. Frangakis and Rubin (2002) label these strata as “principal strata.” Two sample members belong to the same principal stratum if their pair of potential values for a given mediator is the same. In other words, they belong to the same principal stratum if the value of their mediator under assignment to a program is the same and if the value of their mediator under assignment to control status is the same.
The problem of course is that we cannot observe the two potential mediator values for any sample member, so the principal stratum membership is unknown. However, as presented by Lindsay Page in this volume, it is possible in some important cases to use baseline and follow-up data for sample members to estimate a model that predicts their two potential mediator values and thereby predicts their principal stratum membership.
Sequential randomization
An innovative strategy for mediation analysis, described by Guanglei Hong at the William T. Grant Foundation conference, conceives of the mediation process as a sequence of randomized experiments (Pearl, 2001; Robins & Greenland, 1992). Consider how this works in the case of a single binary mediator, where M = 1 if the mediator value is favorable and M = 0 if it is not favorable. The first experiment is directly observable: We assign participants at random to a new program (T = 1) or to its control group (T = 0). The second experiment is hypothetical: Program group members are assigned at random to the favorable value of the mediator with some probability. Control group members are also randomly assigned to the favorable mediator value but with a different probability. If we knew these two probabilities, we could make the needed causal inferences (Imai, Keele, & Yamamoto, 2010). The empirical challenge is to estimate these probabilities, which may depend on baseline characteristics of sample members and the study setting.
Let’s call the mediator value to which a program group member is assigned M(1) and the mediator value to which a control group member is assigned M(0). In principal stratification, these two potential mediator values are treated as fixed characteristics of each sample member that depend on his background and the study setting. For analyses based on sequential randomization, the values of M(1) and M(0) are treated as stochastic. The probability that M = 1 depends on a participant’s past and whether he or she is randomly assigned to the program group or control group. Under sequential randomization, an effective program is seen as increasing the chance of receiving a favorable mediator value.
Recall that principal stratification groups sample members in terms of their predicted pair of potential mediator values, M(1) and M(0), based on their background characteristics and future outcomes. In contrast, under the assumption of sequential randomization, we seek to group sample members based on their pair of probabilities of experiencing a favorable mediator value under assignment to the program group and under assignment to the control group. This approach enables the analyst to estimate (a) the indirect effect, which is the causal effect on the outcome of changing the mediator value without changing program assignment and (b) the direct effect, which is the causal effect of changing program assignment without changing the mediator value. The relative magnitudes of these two component effects indicate the degree to which the program effect was transmitted by the hypothesized mediator.
Perhaps the key challenge is that, while participants are randomly assigned to the treatment T, they are not randomly assigned to the mediator M. However, if a rich set of pretreatment characteristics (call them X) are measured, we may be willing to assume that, within a stratum of persons with similar values of X, assignment of the mediator is effectively “as if” random. This means that, within such strata, there are program participants whose mediator values vary by chance and control participants whose mediator values vary by chance. Methodologists have devised a number of clever strategies for estimating direct and indirect effects in this context (see Hong, 2015; VanderWeele, 2015).
One difficulty with this approach is stratifying sample members on a potentially long list of baseline characteristics (X). To deal with this issue, one can use a propensity score (Rosenbaum & Rubin, 1983) because stratifying sample members on a propensity score can balance stratum members on all variables used to predict the propensity score, at least in large samples.
Comparing alternative approaches to mediational analysis
The preceding approaches for studying mediation of program impacts—multisite IVs, principal stratification, and the approximation of sequential randomization—have different strengths and limitations. The first approach directly exploits multisite RCTs to create a series of valid instruments and does not rely on pretreatment covariates to produced unbiased (or consistent) estimates. However, this approach requires all relevant mediators be observed and accounted for and is potentially subject to “omitted mediator bias.” Moreover, one must assume that site-specific impacts of the treatment on the mediator are not associated with site-specific impacts of the mediators on the outcome. In contrast, the approximation of sequential randomization does not assume that all mediators are measured and modeled. Rather, like standard path analysis, this approach decomposes the effect of treatment assignment into indirect effects that work through specified mediators and a direct effect that works through additional mediators that are unobserved. In doing so, the approach relaxes parametric assumptions that are commonly used for path analysis. However, like path analysis, sequential randomization requires a rich set of pretreatment covariates to support the assumption that, conditional on these covariates, mediator values in the treatment group and in the control group are effectively assigned randomly but with different probabilities. The principal stratification approach does not require measuring and modeling all pretreatment confounders (as does sequential randomization) or the exclusion restriction (as does IVs). Instead, it requires covariates and follow-up outcomes that adequately predict the potential values of sample members’ mediators. In addition, principal stratification is more useful for identifying a direct effect of program assignment and thereby falsifying a mediation theory than it is for estimating the parameters of a mediational process.
None of these approaches is perfect for all mediational analyses, and all mediational analyses (short of randomizing specified mediator values to treatment and control group members) require strong assumptions in order to estimate mediator effects. However, the assumptions required by these new strategies are less stringent than those required by conventional path analysis. Furthermore, despite the substantial difficulties of mediational analysis, we believe that it is essential for building a science of program design and development. However, selecting a method of mediational analysis for multisite trials is craft knowledge that is not yet fully understood or widely available.
Final Remarks
The presence of variation in program impacts upends conventional ways of analyzing and interpreting data from program evaluations, especially in multisite trials, which are very common in program evaluation. Among other things, impact variation makes it possible to define and estimate different types of average impacts. For example, we can define an average impact for a population of sites or an average impact for a population of persons, and with heterogeneity of program impacts, these parameters can differ.
However, any average becomes less informative as impact variation increases. Understanding this variation thus becomes more important, and new questions arise such as (a) By how much do impacts vary across individuals, subgroups of individuals and program sites? (b) What is the cross-site correlation between program impacts and control group mean outcomes? (c) What are the maximum and minimum site-specific program impacts? Searching for answers to these questions is learning about a distribution of program effects.
The existence of a distribution of program impacts also provides opportunities for testing theories about for whom, under what conditions, and why programs work. Toward this end, we can pose theories to guide future data collection for explaining impact variation within and across program sites. Theory building is learning from a distribution of program effects.
Statistical methods for discovering and explaining impact variation are developing rapidly, and we have provided a broad overview of new approaches. However, a great deal remains to be done, and we anticipate many new methodological breakthroughs during the next decade.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
This article was supported by the Spencer Foundation (for Bloom) and the William T. Grant Foundation (for Raudenbush and Bloom). We thank the following individuals for their insights into the issues explored by the article: Adam Gamoran, Guanglei Hong, Lindsay Page, Sean Reardon, Michael Weiss and Kim Dumont.
