Abstract
In randomized controlled trials, the complier average causal effect (CACE) parameter is often of policy interest because it pertains to intervention effects for study units that comply with their research assignments and receive a meaningful dose of treatment services. Causal inference methods for identifying and estimating the CACE parameter using an instrumental variables (IV) framework are well established for designs with a single treatment and control group. This article uses a parallel IV framework to discuss and build on the much smaller literature on estimation of CACE parameters for designs with multiple treatment groups. The key finding is that the conditions to identify and estimate CACE parameters are much more complex for multiarmed designs and may not be tractable in some cases. Practical steps are provided on how to proceed, and a case study demonstrates key issues. The results suggest that ensuring compliance is particularly important in multiarmed trials so that intention-to-treat estimates on the offer of intervention services (which can be identified) can provide meaningful information on the CACE parameters.
In randomized controlled trials (RCTs) of interventions and programs, random assignment is typically conducted before treatment services are received by study participants. As a result, some study participants may not receive treatment services offered to their research group, and some may receive barred services offered to other research groups. Thus, comparing the mean outcomes of study participants across research groups provides information on the intention-to-treat (ITT) parameter for those offered specific intervention services, but may not fully capture program effects on the actual receipt of offered services.
Accordingly, there is often policy interest in impact evaluations to estimate the complier average causal effect (CACE) parameter that pertains to average treatment effects (ATEs) for “compliers” in the study population—those who would receive intervention services offered to their research group but not interventions slated for other groups. Using an instrumental variables (IV) causal inference framework, Angrist et al. (1996) build on Bloom (1984) to derive the statistical assumptions required to identify (isolate) the CACE parameter for designs with a single treatment and control group, where compliance decisions are treated as dichotomous. Schochet and Chiang (2011) generalize these findings to RCTs where groups rather than individuals are randomized (clustered designs), where compliance decisions can be made by both clusters (such as school or hospital staff) and individuals within clusters (such as students or patients).
There is a much smaller literature on how to identify and estimate the CACE parameter for designs with multiple treatment groups (multiarmed designs). This gap is important because multiarmed RCTs are becoming increasingly popular in social policy research, as they can simultaneously examine the effects of multiple interventions in a single study, thereby increasing the amount that researchers and policy makers can learn from impact evaluations. In social policy research, these designs are particularly relevant for interventions that are relatively easy to implement—for example, an education RCT testing several texting initiatives to improve student engagement and achievement (Castleman & Page, 2015; Kraft & Rogers, 2015). Relatedly, multiarmed designs are useful for rapid cycle experiments aimed at continuous program improvement, for example, using behavioral-based interventions and encouragement designs.
This article addresses the question: What are the assumptions required to identify and estimate CACE parameters for multiarmed designs? Our goal is to motivate why the statistical assumptions for the single treatment–control group design do not easily generalize to the multiarmed context, where additional, often implausible assumptions are required for parameter identification. Thus, CACE estimation may not always be possible for multiarmed RCTs. We apply a causal inference framework to synthesize and build on the relevant literature (Blackwell, 2017; Cheng & Small, 2006; Long et al., 2010; Miratrix et al., 2018) and provide suggestions on additional model assumptions that could be invoked to estimate or at least bound CACE parameters in the multiarmed context. We provide practical steps on how to proceed in real-world applications to estimate CACE parameters. Finally, to demonstrate key concepts, we present a case study using data from a large-scale, multiarmed RCT of the Adult and Dislocated Worker Programs funded by the U.S. Department of Labor. Our focus is on statistical aspects of CACE estimation and making them accessible to researchers conducting evaluations and not on reasons for noncompliance or strategies to reduce it.
We adopt an IV framework because it is the most common approach used to estimate the CACE parameter in evaluation research. The IV approach shares features with principal stratification methods (Frangakis & Rubin, 2002; Long et al., 2010) by assigning sample members to distinct groups based on compliance decisions in each research condition. We do not consider structural nested mean models (Robins, 1994; Tan, 2010) that rely on alternative, parametric assumptions to identify CACE parameters. To simplify the presentation, we focus on CACE identification for nonclustered designs where individuals are randomized; parallel, but more complex issues arise for clustered designs where groups are randomized. Finally, while our focus is on RCTs, the discussion also applies to quasi-experimental designs (QEDs) with comparison groups (i.e., those obtained by matching or inverse probability weighting).
The remainder of this article is in five sections. First, we discuss identification and estimation of the ITT parameter in multiarmed RCTs, and second, we discuss parallel issues for the CACE parameter. Third, we provide practical guidance on CACE estimation in the multiarmed context. Fourth, we present case study findings. Finally, we present conclusions.
The ITT Parameter in Multiarmed RCTs
In multiarmed RCTs, impact estimation typically involves comparing mean outcomes across distinct pairs of research groups. Schochet (2017) discusses design-based ITT estimators in this context using a Neyman–Rubin–Holland potential outcomes framework that underlies experiments (Holland, 1986; Neyman, 1923; Rubin, 1974, 1977). Schochet (2009, 2017) also discusses the use of multiple comparison adjustments to account for the potential inflation of Type I errors across the pairwise contrasts.
To motivate this approach, consider an RCT where study participants are randomly assigned to one of K distinct research groups and follow-up data are collected on the sample. The research groups could include a control (business-as-usual) group but do not have to, and assignment rates could differ across the research groups. We assume that the study contains K* treatment groups (so that K* = K if the study has no control group, and K* = K − 1 otherwise). To focus the presentation, we assume each treatment group is offered a different intervention (e.g., Drug A or Drug B). Our framework, however, can be extended to allow for a treatment group that is offered several interventions (e.g., Drug A and B), which becomes more complex as some individuals might receive only part of the intervention. We discuss such an extension in our case study.
In our setup, we consider designs where randomization to each research group occurs at the same time and not designs where treatments are randomized sequentially (Blackwell, 2016). We also do not consider fractional factorial designs where the research groups include an orthogonal subset of all possible treatment combinations (see, e.g., Box et al., 2005; Wu & Hamada, 2009).
Potential outcomes measure what would result in each research condition. Let Yi
(k) signify the potential outcome for individual i in research group k, and let Ti
(k) be the research group indicator variable that equals 1 if an individual is assigned to group k, and 0 otherwise
In this setting, the ITT parameter of interest for contrasting groups k and k′ is the mean difference in potential outcomes,
This approach relies critically on the following two key identification assumptions that generalize the corresponding assumptions for the two-group design (Imbens & Rubin, 2015):
The stable unit treatment value assumption (SUTVA) (Rubin, 1986) has two parts. First, for any two random assignment vectors
Independence between research status and potential outcomes: Zi
is independent of
Under these assumptions and additional mild regularity conditions on population moments, Schochet (2017) and Schochet et al. (2020) prove that weighted least squares estimators in models with or without baseline covariates are consistent for the
The CACE Parameter in Multiarmed RCTS
Because the CACE parameter is a component of the ITT parameter, additional assumptions are required to identify the CACE parameter. To examine these assumptions in the multiarmed context, we build on the well-known CACE framework in Angrist et al. (1996) for the single treatment–control design by specifying potential compliance decisions in each research condition. Let
We now allow potential outcomes to depend not only on the research conditions as before but also on the compliance decisions. Formally, let
Using this framework, we can calculate the number of possible compliance subgroups in the study population based on their compliance decisions across the K research conditions. To do this, note that for each value of Zi
, there are
Angrist et al. (1996) specify three key assumptions for identifying the CACE parameter for the single treatment–control RCT that we generalize to the multiarmed setting (we also invoke the ITT assumptions specified in the previous section):
SUTVA for CACE analyses: Both
Monotonicity:
Exclusion restriction:
For the single treatment–control group design, where Zi
equals 1 for the treatment group and 0 for the control group, the three assumptions from above identify the CACE parameter,
Monotonicity implies that there are no defiers and the exclusion restriction implies that ATEs are zero for never- and always-takers. This means that the overall ITT parameter, ITT
1,0, can be expressed as a function of the ATE for compliers only. More formally, the CACE parameter,
As shown next, identification becomes more complex in the multiarmed context, even if there are only two treatment groups and no control group. Stated differently, in the multiarmed setting, the three assumptions from above are not enough to identify the CACE parameter.
Designs With Only Treatment Groups but No Control Group
Consider first a simple two-group design with two treatment groups (Research Groups 1 and 2) but no control group (K = K* = 2). In this case, the overall ITT parameter comparing the second treatment relative to the first, ITT2,1, can be expressed as a weighted average of ATEs for 16 possible compliance groups. This occurs because in each research group, individuals can receive Treatment 1 only (labeled “T1”), Treatment 2 only (“T2”), both treatments (“T12”), or neither treatment (“T0”). To simplify notation, define the couplet (a, b), where “a” and “b” represent a person’s compliance decisions if assigned to Research Groups 1 and 2, respectively. For example, (T0,T2) represents those who would receive no intervention services if assigned to Research Group 1 and would receive Treatment 2 if assigned to Research Group 2.
Consider a scenario where there are no crossovers so that those in Research Group 1 cannot receive Treatment 2 and vice versa. In this case, as shown in Figure 1, the overall ITT parameter, ITT2,1 , can be expressed as a weighted average of ATEs for four compliance subgroups: (T1,T2), (T1,T0), (T0,T2), and (T0,T0), which are often referred to in the literature as principal strata. Note that the condition of no crossovers is more stringent than the monotonicity condition (which leaves nine complier subgroups rather than four), because the absence of crossovers removes compliance subgroups such as (T1,T1) and (T1,T12) that are not excluded by monotonicity.

Depiction of compliance decisions for a randomized controlled trial with two treatment groups, no control group, and no crossovers. Note.
The first question to address is: What is the CACE parameter of interest? We focus on the parameter for the (T1,T2) subgroup because it represents the relative effect of receiving Treatment 2 compared to Treatment 1 for compliers in both research groups (always-compliers). The (T1,T0) and (T0,T2) groups also have compliers, but only in one research condition and not the other, and thus are likely to be of less policy interest. However, many of the statistical issues for the (T1,T2) parameter also apply to the (T1,T0) and (T0,T2) parameters.
The (T1,T2) parameter cannot be identified without additional assumptions. While the exclusion restriction implies that the ATE is zero for the (T0,T0) subgroup, this restriction does not apply to the three remaining compliance subgroups (principal strata). Further complicating identification, we cannot even identify the proportions of the study population in each compliance subgroup.
Next, we discuss identification strategies for designs without crossovers. With crossovers, the identification problem becomes intractable due to the large number of compliance cells.
Method 1: Obtain bounds without additional assumptions
The literature on constructing bounds for impact estimation originated with the seminal works of Manski (1990, 2003). In the multiarmed context, Cheng and Small (2006) build on this work to use linear programming methods to provide optimal bounds on both the subgroup proportions and ATE compliance parameters in Figure 1 for binary outcomes (where bounds are likely to be more informative than for continuous outcomes with a larger range). The approach solves a series of equations based on parameter restrictions between the interior and marginal parameters in Figure 1. For example, by definition, the marginal p 1 value is a weighted average of population proportions in the (T1,T2) and (T1,T0) groups and similarly for the other proportions and ATE parameters. We provide more intuition on this approach when discussing Method 2 below under a simplifying assumption, where we also discuss approaches for obtaining confidence intervals for the bounds.
Importantly, with more than two treatment groups, the calculation of bounds for the CACE parameter and their SEs become very difficult. Further, the bounds are likely to be noninformative (wide).
Method 2: Impose an additional monotonicity condition to obtain tighter bounds
It is possible to make more headway under the two-treatment design in settings where it is realistic to assume an additional monotonicity condition where one treatment is always preferred to the other. This could occur, for example, if one of the treatment groups receives more services deemed beneficial or fewer services deemed onerous. Under this condition, a person who takes up the offer of the less preferred treatment will always take up the offer of the more preferred treatment. For instance, if Treatment 2 is the preferred intervention, the monotonicity condition implies that
With this additional monotonicity condition, it is possible to identify the proportions of the population in each compliance subgroup: p
1 for the (T1,T2) subgroup,
In this expression,
Suppose we assume that
One approach for obtaining an upper bound on the CACE parameter is to use available impact results in the literature from studies related to the current evaluation to invoke the assumption,
This bounding approach generalizes to settings with additional treatment groups if there is an assumed monotone ordering of treatment preferences, there are no crossovers, and we invoke additional exclusion restrictions described below. For example, suppose we added a third treatment group (Research Group 3), which is preferred to Treatment 2, which in turn is preferred to Treatment 1, and extend the notation from Figure 1 to the three-group design. We then have four possible compliance subgroups: (T1,T2,T3), (T0,T2,T3), (T0,T0,T3), and (T0,T0,T0) with respective subpopulation proportions, p
1,
which states that the Treatment 3–2 contrast for Research Groups 2 and 3 compliers is the same for those who would take up the offer of Treatment 1 if given the chance and those who would not. Second, we must assume that
The approach by Cheng and Small (2006) can also be used to create bounds with the added monotonicity condition on the ordering of treatment preferences, where we extend their analysis for binary outcomes to general outcomes. This approach works directly with mean potential outcomes in each research condition rather than the treatment effects per se. To motivate this approach for the two-treatment design, let
In this setting, we can estimate
This approach extends to designs with more than two research groups. For example, consider a design with three treatment groups where Treatment 3 is preferred to Treatment 2, which is preferred to Treatment 1. In this case, respective lower and upper bounds for
Note that the bounds calculated using the treatment effects directly or using the mean potential outcomes can be combined (if the underlying assumptions hold). In this case, the sharper (narrower) bounds can be selected.
Miratrix et al. (2018) discuss a similar approach as Cheng and Small (2006) to estimate ATEs (for binary outcomes) for principal strata, although they do not focus on the CACE parameter. Instead, they focus on principal strata defined by students’ postrandomization high school choices, using data from a single treatment–control RCT of Early College High Schools. A key finding is that they show bounds can be tightened by first creating separate bounds for subgroups defined by baseline covariates predictive of the outcomes or compliance decisions and then averaging these bounds to obtain overall ones. This subgroup approach can also be applied in our CACE context to potentially sharpen estimated bounds.
Finally, a further complication with Cheng and Small’s (2006) approach is that it is technically challenging to obtain confidence intervals around the estimated bounds (see Canay & Shaikh, 2016, for a recent review and Yang & Small, 2016). Miratrix et al. (2018) provide simulation evidence that a bootstrap method works well in practice where (i) the data are repeatedly resampled with replacement, (ii) the bounds are recalculated for each bootstrap sample, and (iii) the 95% confidence interval is obtained as the fifth percentile of the lower bound and the 95th percentile of the upper bound across the bootstrap samples. However, more research is needed to assess this approach and others for obtaining confidence intervals for estimated bounds.
Method 3. Assume homogeneity of mean potential outcomes
The CACE parameter can be identified without the added monotonicity assumption on treatment preferences under the strong assumption that mean potential outcomes for those who receive a particular treatment are the same across compliance cells. For example, in Figure 1, this implies that
Designs With a Control Group
Multiarmed RCTs with a control group allow for additional possibilities to identify CACE parameters without the added monotonicity condition on treatment preferences, assuming no crossovers. To fix concepts, consider a design with a control group (Research Group 0) and two treatment groups (Research Groups 1 and 2). In this setting, K = 3 and K* = 2, so the overall ITT parameter for each pairwise contrast (e.g., the first treatment group relative to the control group) can be expressed as a weighted average of ATEs for 64 possible compliance subgroups. Define the triplet (a,b,c), where “a,” “b,” and “c” represent a sample member’s compliance decisions if assigned to Research Groups 0, 1, and 2, respectively. Assuming no crossovers, the structure of Figure 1 still holds except that the four compliance subgroups now include the triplets (T0,T1,T2), (T0,T1,T0), (T0,T0,T2), and (T0,T0,T0) rather than the couplets. We are interested in identifying impact parameters for the (T0,T1,T2) group.
With three research groups, there are three possible pairwise contrasts of interest comparing each treatment group to each other and to the control group. Thus, multiple CACE parameters emerge across the contrasts. When contrasting the two treatment groups, the same issues with CACE identification arise as for the two-treatment design discussed above. However, the situation becomes more tractable when contrasting each treatment group to the control group.
Consider comparing the first treatment group (Research Group 1) to the control group (Research Group 0). Using Figure 1, the full-sample ITT for this contrast,
where
which states that the ATE for those receiving Treatment 1 is the same for those who would take up the offer of Treatment 2 if given the chance and those who would not. In this case, the CACE parameter for Research Group 1 compliers can be identified as
The above analysis suggests that in multiarmed RCTs with control groups, we can identify well-defined CACE parameters by simply comparing each treatment group to the control group using the IV estimators found in the literature for the single treatment–control design. This approach, however, relies on the assumptions of no crossovers in the treatment and control groups (or a small number of crossovers that can safely be ignored in the analysis) and additional exclusion restrictions of the form shown in Equation 4. The multiple comparisons methods discussed in Schochet (2009, 2017) can be employed for significance testing across the multiple CACE estimators.
An important potential limitation of the above approach, however, is that the CACE parameters comparing each treatment group to the control group refer to different complier subpopulations (Long et al., 2010). For example, in the above scenario, the
Under certain assumptions, the
Practical Guidance
The analysis above demonstrates the complexities of CACE identification for multiarmed evaluations using IV methods. In this section, we synthesize the multidimensional issues to provide guidance to researchers on how they can assess whether CACE estimation is feasible in their context and, if so, how to proceed. We structure our guidance around four questions:
Question 1: What Are Crossover Rates?
In multiarmed trials, study participants are considered to be crossovers if they receive services slated for other research groups. Unlike the single treatment–control group design, crossovers refer to both treatment group members who receive interventions offered to other treatment groups and to controls who receive any of the interventions. If crossover rates are high, the number of possible compliance groups explodes, making identification impossible unless ad hoc assumptions are made to exclude some compliance groups that are assumed not to exist in the population. These assumptions may be difficult to justify. Thus, the next three questions are relevant only if crossover rates are low (e.g., less than 10% for each research group, but what is considered to be “low” will likely vary by context).
Question 2: Does the Study Include a Control Group?
If the answer is “yes,” the simplest approach is to estimate CACE parameters by comparing each treatment group to the control group using standard IV methods for the single treatment–control design. The CACE estimates can then be compared to each other to examine the relative effects of each treatment, under the assumption that the CACE estimates pertain to the same study population. This homogeneity assumption means that impacts on the receipt of a particular treatment are the same across compliance groups. To help assess the credibility of this assumption, researchers could examine the extent to which ITT estimates for each pairwise contrast vary across subgroups defined by baseline characteristics. A finding that the ITTs do not vary much across baseline subgroups could provide support for the homogeneity assumption for the CACE parameters. High compliance rates could also provide support for comparing the CACE estimates.
If the homogeneity assumption does not appear to be tenable, the methods of Cheng and Small (2006) can be used to bound the CACE parameters for designs with or without a control group, but only if there are two treatment groups; the calculations become much more complicated with three or more treatment groups and are likely to yield wide bounds. A potentially effective approach for tightening the bounds is to first create separate bounds for subgroups defined by baseline covariates that are predictive of the outcomes or compliance decisions and then to average these bounds to obtain overall ones (Miratrix et al., 2018). While obtaining confidence intervals around estimated bounds is complex, there is some evidence that a bootstrap approach works well in practice, although more research is needed on this topic.
Question 3: Is There Evidence That the Interventions Can Be Ranked in Terms of Their Preference?
In some studies, take-up rates could be higher for some interventions than others. If there is clear evidence of a monotone ordering of intervention preferences, the number of compliance groups reduces sharply because there are no individuals who would take up the offer of a less preferred intervention but not the offer of a more preferred one. In this case, sharper bounds can be obtained using Method 2 discussed in the previous section under plausible exclusion restrictions, even for designs with more than two treatment groups (with or without a control group). To assess the credibility of the monotonicity assumption on intervention preferences, it is important to examine available information on the nature of the interventions (e.g., whether some have more onerous requirements or extra benefits), the quality of intervention implementation, the take-up of intervention services, and survey data on satisfaction with the interventions.
Question 4: Is There Evidence of Homogeneity of Mean Outcomes Across Compliance Groups That Take Up the Offer of a Particular Intervention?
Stated differently: Are mean outcomes for those who receive a specific treatment independent of their compliance decisions in other research conditions? If this condition is tenable, CACE parameters for the treatment group compliers can be easily estimated by comparing the mean outcomes of recipients across research groups. This strong assumption is difficult to test. One approach could be to examine the means and variances of outcomes across subgroups defined by their baseline characteristics for both treatment recipients and nonrecipients. A finding that the distribution of outcomes is similar across baseline subgroups could provide support to the assumption.
Case Study
To demonstrate key issues for CACE identification and estimation for multiarmed RCTs, we use data from a large-scale RCT of the Adult and Dislocated Worker programs, among the largest employment and training initiatives in the United States (Fortson et al., 2017). Funded by the U.S. Department of Labor, the programs aim to help job seekers find meaningful employment by providing labor market information and resources on job search (core services), assistance from employment counselors for those needing additional help in finding employment (intensive services), and funding for training for those interested and deemed suitable. Job seekers access program services at American Job Centers located throughout the nation. The study evaluated the programs as implemented from 2011 to 2014 when the programs were authorized by the Workforce Investment Act (WIA) of 1998. The programs were subsequently reauthorized under the Workforce Innovation and Opportunity Act of 2014.
The study—which we hereafter refer to as the “WIA evaluation”—was designed to yield impact estimates with both internal and external validity. The study randomly selected 30 local workforce investment areas for study inclusion, 26 of whom agreed to participate along with two replacement sites. In each of the 28 study areas, local staff randomly assigned nearly all individuals eligible to receive WIA-funded services (35,665 job seekers) to one of three study groups: The “core group”—Research Group 0—could receive only basic core services that the programs are required to offer to all job seekers. Core services consist mainly of labor market information and online tools to help workers plan their careers and find employment. The core group could not receive WIA-funded intensive or training services described next.
The “core-and-intensive” group—Research Group 1—could receive any core or intensive service, but not WIA-funded training. Intensive services typically require more extensive or personalized assistance from American Job Center staff, and include assessments, workshops, career counseling, and referrals to other services. The “full-service” group—Research Group 2—could receive services in the same way they would have in the absence of the evaluation, including WIA-funded training. For those eligible, training services are provided mostly through vouchers that can be used to pay for tuition and fees at approved training programs.
Selection rates to Research Groups 0 and 1 were set low (6% each) to make the study more acceptable to local area staff (to help site recruitment) and to minimize the effects of the study on program operations. All individuals could seek other services in their communities.
To estimate program effects in the 3-year follow-up period, the WIA evaluation used outcome data from two rounds of surveys (15 and 30 months after random assignment) and administrative earnings records from the National Directory of New Hires (NDNH) containing unemployment insurance wage records reported by employers. Fortson et al. (2017) provided details on the study design, data collection, and methods for ITT estimation that compared mean outcomes across the three research groups over time. These ITT estimates pertain to impacts on the offer of specific services but not on the actual receipt of services.
Table 1 displays key ITT impact estimates on earnings in U.S. dollars (that include zero earnings), as reported in Fortson et al. (2017), that we use for our CACE analysis. The first key finding is that there is clear evidence that the offer of WIA-funded intensive services improved earnings. According to the survey data, the core-and-intensive group earned $8,087 more on average than the core group during the 30-month follow-up period, a statistically significant impact at the 5% level. Further, the core-and-intensive group earned more in each quarter (Fortson et al., 2017). A similar pattern emerges using the NDNH data, although the impacts are somewhat muted.
Selected ITT Impact Estimates on Earnings for the WIA Evaluation.
Note. ITT = intention-to-treat; WIA = Workforce Investment Act; NDNH = National Directory of New Hires; SE = standard error. Earnings are in 2020 dollars. Samples sizes using the survey data are 1,620, 1,578, and 1,575 for the full-service, core-and-intensive, and core groups, respectively, and the corresponding sample sizes using the NDNH data are 29,710, 2,034, and 2,029. Source: WIA evaluation 15- and 30-month follow-up surveys and NDNH data (Fortson et al., 2017). Earnings figures are in 2020 US dollars.
* Statistically significant at the 5% level, two-tailed test.
The second key finding pertains to the effects of the offer of WIA-funded training relative to the offer of WIA-funded intensive services only, where we examine the difference in mean earnings between the full-service and core-and-intensive groups. For this analysis, we focus on Quarter 12 earnings using the NDNH data (the most recent period) so that trainees had enough time to complete their training and reenter the labor market. As shown in Table 1, the ITT impact estimate is $185, which is statistically insignificant. However, this estimate is diluted by the relatively small difference in the training enrollment rate between the full-service and core-and-intensive groups (see below). This issue motivates our CACE analysis discussed next.
Here, our focus is on CACE estimation to address the following policy-relevant question (not addressed previously): What are the added effects of the receipt of training on top of the receipt of intensive services? To help isolate these effects, or at least bound them, we examine study data on service receipt and Quarter 12 NDNH earnings for the full-service and core-and-intensive groups (Research Groups 2 and 1; see Table 1 for sample sizes). Our analysis is not intended to be a comprehensive CACE analysis (which would be a full article in itself) but to demonstrate key issues and associated complexities.
We begin by addressing the four questions posed in the previous Practical Guidance section to help structure the CACE analysis. First, crossover rates were low: very few members of the core-and-intensive (and core) group received barred WIA-funded training services (Fortson et al., 2017), although many did receive training from other sources (see below). Accordingly, we assume no crossovers that received WIA-funded training, which greatly facilitates CACE estimation. Second, while the study included a control group (the core group), our focus here is on comparing the two treatment groups to each other rather than to the control group. Third, as discussed further below, we can rank interventions in terms of their preference, which further reduces the number of compliance groups. Finally, there is little evidence that mean outcomes for those who received a specific intervention were independent of their compliance decisions in other research conditions. For instance, those who received training were more likely than nontrainees to be female, have college degrees, and to have higher wages in their most recent job prior to random assignment, characteristics correlated with subsequent labor market outcomes. Thus, simply comparing mean earnings across the service receipt groups would likely yield biased estimates.
Next, to construct compliance cells for the CACE analysis, we examine information on the receipt of intensive and training services during the 30 months after random assignment. We find that nearly all members of the full-service and core-and-intensive groups received WIA-funded intensive services, as they typically received counseling right after the employment counselors conducted random assignment at the American Job Centers (D’Amico et al., 2015). However, the full-service group received more services: they spent an average of 16 more minutes with an employment counselor than the core-and-intensive group and received more assessments and supportive services (Fortson et al., 2017).
Another key finding is that the full-service group received more training than the core-and-intensive group (Figure 2). About 50% of the full-service group reported having ever enrolled in training (from any source), and about one third received WIA-funded training. In comparison, about 41% of the core-and-intensive group enrolled in training, which they paid for themselves or by using sources of funding other than WIA. Thus, members of the full-service group were 9 percentage points points more likely to enroll in training, a difference that is statistically significant at the 5% level. Further, those in the full-service group spent an average of 89 more hours in training than those in the core-and-intensive group and were significantly more likely to have completed a training program and attain a training credential (Forston et al., 2017).

Participation in training in the 30 months after random assignment. Source. Workforce Investment Act (WIA) evaluation 15- and 30-month follow-up surveys (Fortson et al., 2017). Note. The figures pertain to training funded from any source, including WIA, other programs, and the participants themselves. The standard error is 2.74 for comparing the difference in the training rate between the full-service and core-and-intensive groups. *Difference between the full-service and core-and-intensive groups is significant at the 5% level.
Nonetheless, the difference in the training rate between the full-service and core-and-intensive groups is smaller than anticipated at the time the study was designed (Fortson et al., 2017). This occurred because, as it turned out, WIA funds for training during the study intake period were the lowest they had been in more than a decade and access to training was limited. Thus, the ITT earnings impact estimate of $185 on the added benefit of offering WIA-funded services on top of WIA-funded intensive services is diluted by many in the full-service group who did not receive training.
These results guide our strategy for creating compliance groupings, which has the following four key features:
Training and intensive services received from sources not funded by WIA are treated as interventions themselves. We adopt this strategy because our goal is to isolate the value added of the receipt of training beyond the receipt of intensive services. This goal could not be achieved if we were to instead treat non-WIA-funded services as part of the study context and service counterfactual, as was done in Fortson et al. (2017) for ITT estimation.
Training services funded by WIA and elsewhere are combined for the analysis and similarly for intensive services. We combine training services—regardless of their source and funding—to reduce the number of compliance cells, and similarly for intensive services. This means that our CACE analysis does not address the added effects of WIA-funded training services per se but training services more generally.
Services are assumed to differ across the research groups. We adopt this feature because, as discussed, the amount and nature of the intensive and training services received differed across the research groups (so standard IV exclusion restrictions would not hold if we were to instead treat interventions as identical in each research condition).
Services received in the full-service condition are treated as preferred to those received in the core-and-intensive condition. This preference ordering is plausible because the full-service group received more services and spent much less of their own money on training (Fortson et al., 2017). This assumption further reduces the number of compliance cells.
Figure 3 displays the four possible compliance groups under this setup, which consists of couplets based on the services sample members could have received in either research condition (from any source). The two options for the core-and-intensive group were to receive intensive and training services (T1) or intensive services only (T2) and similarly for the full-service group (Treatments T3 and T4). Our interest is in the CACE parameter for the (T2,T3) group to isolate, or at least bound, the effects of training receipt on top of intensive service receipt.

Compliance cells for the Workforce Investment Act evaluation.
To make progress on identifying the
Examining Equation 5, it is clear we cannot easily point identify
First, to obtain an upper bound on
Next, to obtain a lower bound on
We can now estimate bounds for
To try to tighten the bounds, we estimated separate bounds for subgroups defined by whether the sample member had more than a high school degree at baseline (25% of individuals) or a high school degree or less (75% of individuals) and then aggregated these bounds to obtain overall ones (Miratrix et al., 2018). We selected these education subgroups (using data from study registration forms collected at study intake), because those with higher education levels were more likely to receive training and to have higher earnings than those with lower education levels. However, this approach has little effect on narrowing the bounds.
In sum, our case study demonstrates the difficulties with point identifying CACE parameters—and even obtaining informative bounds—in multiarmed trials. In our example, these complexities arose despite the small number of considered treatment arms.
Conclusions
This article has shown that estimating CACE parameters in the multiarmed context is substantially more complex than for the two-group, treatment–control design. Even in a relatively simple RCT with only two treatment groups, the CACE parameter for the always-compliers in both treatment conditions—the group typically of most interest—is not point identified, even in the absence of crossovers. These effects can be bounded using linear programming methods, but these methods are complex and do not easily generalize to designs with more than two treatment groups. Further, obtaining confidence intervals on the estimated bounds is technically challenging.
If it is reasonable to assume a monotone ordering of treatment preferences, we can obtain sharper bounds using linear programming and related approaches—that work with either the ATEs directly or the mean potential outcomes—even for designs with more than two treatment groups. However, even in these cases, the bounds can be wide (noninformative) as demonstrated by our case study. Obtaining informative bounds is especially challenging for continuous outcomes with large variances and ranges, so bounding methods may be best suited to binary or discrete outcomes.
If the study includes a control group, the simplest approach is to first estimate CACE parameters by contrasting each treatment group to the control group and then to compare these CACE estimates to each other. However, this approach is feasible only in the absence of crossovers and requires additional exclusion restrictions. Further, because each of these CACE estimators pertains to different subpopulations, it may not be possible to compare them unless it is realistic to assume homogeneity of treatment–control contrasts across subpopulations. The availability of detailed baseline data is important to help assess the credibility of these assumptions.
There are no easy solutions for dealing with noncompliance in multiarmed impact evaluations. Clearly, the credibility of imposing specific and sometimes untestable assumptions to identify CACE parameters will depend on the study context. Researchers should be aware that there may be instances where it is not possible to estimate CACE effects using IV methods in evaluations with multiple treatment groups.
For complex multiarmed trials, one could try using Bayesian principal stratification methods to model the relationships between principal stratum membership and potential outcomes (see, e.g., Frangakis & Rubin, 2002). However, this approach relies on parametric assumptions that are often difficult to test and the availability of detailed baseline data to improve the modeling. Further, under this approach, it is often necessary to invoke ad hoc assumptions to reduce the number of compliance cells to facilitate parameter estimation (model convergence).
In conclusion, the results from this article suggest that ensuring compliance is particularly important in multiarmed trials so that ITT estimators on the offer of intervention services (which can be identified) can provide meaningful information on the CACE parameters. In this context, study resources should be devoted to try to maximize the percentage of treatment group members who take up offered intervention services and to minimize those who receive crossover services slated for other research groups. For instance, for RCTs, compliance could be increased if it is feasible to select the point of random assignment close to the point of service receipt, and if study implementers (program staff) are fully trained in the random assignment procedures and fully understand the consequences of no-shows and crossovers on the evaluation findings.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
