Abstract
Background:
Random design experiments are a powerful device for estimating average treatment effects, but evaluators sometimes seek to estimate the distribution of treatment effects. For example, an evaluator might seek to learn the proportion of treated units who benefit from treatment, the proportion who receive no benefit, and the proportion who are harmed by treatment.
Method:
Imbens and Rubin (I&R) recommend a Bayesian approach to drawing inferences about the distribution of treatment effects. Drawing on the I&R recommendations, this article explains the approach; provides computing algorithms for continuous, binary, ordered and countable outcomes; and offers simulated and real-world illustrations.
Results:
This article shows how the I&R approach leads to bounded uncertainty intervals for summary measures of the distribution of treatment effects. It clarifies the nature of those bounds and shows that they are typically informative.
Conclusions:
Despite identification issues, bounded solutions provide useful insight into the distribution of treatment effects. We recommend that evaluators incorporate analyses of the distribution of treatment effects into new studies and that evaluators revisit completed studies to estimate the distribution of treatment effects.
Randomized control trials (RCTs) are the foundation for evaluation research because RCTs provide unambiguous causal inferences if the randomization is not compromised. Yet RCTs have critics (Berk, 2005; Burtless, 1995; Deaton & Cartwright, 2016; Sampson, 2010). One criticism is that classical methods for analyzing RCTs only estimate the average or median treatment effect (Heckman & Vytlacil, 2007). Even when the RCT shows a positive average treatment effect, different proportions of the population may be advantaged, harmed, or unaffected by treatment.
The average treatment effect tells us very little about this distribution of treatment effects. Yet without knowing this distribution, analysts are unable to apply a loss function, where positive and negative outcomes have different weights. Furthermore, evaluators and policy makers have scant ability to critically judge program effectiveness and redesign interventions because they lack guidance about what works best for whom. Despite the importance of this topic, there is limited scholarship on the subject.
Of course, evaluators have not been indifferent to the heterogeneity of treatment effects. Heterogeneity has been a topic of study at least since Johnson and Neyman (1936) and Cronbach and Snow (1977). More recently, evaluators sometimes stratify or use a regression (Bloom, 2005) to estimate how treatment effects vary systematically with exogenous factors. Sometimes evaluators use principal stratification, also called endogenous stratification, to estimate how treatment effects vary with postsampling outcomes (Frangakis & Rubin, 2002; Imbens & Angrist, 1994). 1 Xie et al. (2012) discuss heterogeneity in the context of observational data and propose three methods using propensity scores to evaluate treatment effects that may vary by the probability of treatment selection. 2 Raudenbush and Bloom (2015a, 2015b) look for heterogeneity using multiplesite random design experiments, and they describe innovative work sponsored by the William T. Grant Foundation, the Spencer Foundation, and the U.S. Institute of Educational Research (Raudenbush & Bloom, 2015a, 2015b).
All these efforts have a foundational premise: Treatment effects are mediated or moderated by individual and contextual factors. A specific drug treatment may benefit women more than men, African American women more than White women, African American women who exercise more than the other cross-classified groups. But these efforts are ultimately limited by either sample sizes within subgroups or by unmeasured covariates that may account for treatment heterogeneity or by controversial assumptions.
Imbens and Rubin (2015)—henceforth, I&R—suggest a Bayesian method for estimating the distribution of treatment effects. Consistent with the literature summarized above, the I&R method explains treatment heterogeneity using observed moderating variables. But their method goes beyond that level of explanation to include both explained and unexplained heterogeneity, thereby characterizing the entire distribution of treatment effects. Our article discusses and illustrates the I&R approach.
To start the discussion, consider an illustration from an influential criminal justice field experiment that evaluated alternative methods to deter domestic violence. Domestic violence is one of the most difficult and dangerous circumstances police face during their daily activities. The initial field experiment to study alternative interventions was conducted in Minneapolis (Sherman & Berk, 1984a , 1984b) and replicated elsewhere (Garner et al., 1995). In two of the three randomly assigned interventions, when police were called to a domestic violence incident, a suspect was either arrested or the suspect was sent away from home for several hours. One outcome was whether there was a repeat victimization within 6 months.
Sherman, and Berk (1984a) reported that 10% of the arrest group and 24% of the send-suspect-away group had a repeat victimization. The average treatment effect is therefore 14%—14% fewer victims following an arrest than following being sent away. Knowing the average treatment effect is certainly important. It informs public policy choices and adds to the body of evidence encompassing domestic violence interventions. However, there is the potential to learn a great deal more. To provide intuition, we identify four possible latent groups implicit in the design of this field experiment: Batterers who desist from domestic assault regardless of their treatment, batterers who repeat their crimes regardless of treatment; batterers who will desist if arrested but not otherwise, and batterers who will desist if sent-away but not otherwise. There is no treatment effect for the first two groups: There is a salutary effect from an arrest for the third group and a detrimental effect from an arrest for the last group. Estimating the distribution (proportions) of people in these four latent groups would provide a deeper understanding of the effects of domestic violence interventions than simply knowing that the average treatment effect is a 14% reduction in a repeat assault.
The goal of this article is to provide a framework based on the method proposed by Rubin (1978) and I&R (2015) to gain this deeper understanding. I&R propose a Bayesian approach to drawing inferences about the distribution of treatment effects. The Bayesian approach is especially useful because it leads to estimates of the distribution of treatment effects that are unavailable from a frequentist perspective; also, it leads to probability-based statements that are unavailable or difficult to derive from a frequentist perspective. Our contribution is to provide an accessible discussion of the I&R approach and to provide computing algorithms for commonly experienced outcome types. Evaluators can use the I&R approach to augment traditional experimental analysis in ongoing RCTs; and they can use it to reanalyze data from extant RCTs, thereby adding value to previously reported study findings.
To foster understanding of the value of studying the distribution of treatment effects, we discuss the I&R approach and illustrate its application by using simulated data and by reanalyzing data from a selected RCT study. This article begins with an intuitive perspective and then formalizes the argument. First, we discuss the modern counterfactual logic for evaluation. Readers familiar with this perspective will find the discussion rudimentary, but others should gain important background. We next provide a primer on Bayesian inference. Readers experienced with Bayesian analysis will find this elementary; those inexperienced with Bayesian analysis should gain sufficient background to understand the I&R approach. Next, we integrate the introductions to the counterfactual logic and Bayesian inference to formalize the I&R argument. In an Online Appendix, we cover computing algorithms for four types of outcomes—those measured on a continuous scale (least squares regressions), binary outcomes (probit), ordered outcomes (ordered probit), and countable outcomes (negative binomial with a normal mixture).
The I&R approach requires understanding two issues that may be unfamiliar to evaluators. The first is a purely Bayesian issue, which this article will cast as an imputation problem solved using Bayesian reasoning. The second is an identification problem, meaning that an important parameter cannot be estimated from the data. 3 To simplify the exposition, this article initially ignores the identification problem to focus on the inherently Bayesian aspects of the I&R approach. Then, the article returns to define and treat the identification problem.
A Counterfactual Primer
Many evaluators find the counterfactual perspective useful for thinking about evaluation problems (I&R, 2015; Morgan & Winship, 2015). Table 1 illustrates the counterfactual perspective, and we use this table to explain what we mean by the distribution of treatment effects. The columns are labeled 1–7, and we reference these column numbers throughout the following discussion.
Illustration of the Counterfactual Perspective to Evaluation.
In the stylized illustration represented by Table 1, we investigate treatment outcomes in a population of N units. In Table 1, N = 10, but usually N is much larger, and the reader may want to mentally add rows. Each row identifies a unit—Units 1–10 in this table. The second column identifies the outcome if the unit is treated, and the third column identifies the outcome if the unit is not treated. The counterfactual perspective applies to both observational and experimental studies, but here we assume random assignment, so those who are not treated are members of the control group. From the counterfactual perspective, the two potential outcomes are fixed but not necessarily known. The fourth column is the treatment effect, equal to the outcome in the treatment state minus the outcome in the control state. The treatment effects, too, are fixed and potentially heterogeneous.
If we knew the potential outcomes in Columns 2 and 3, then we would know the treatment effect for every unit, and hence the distribution of treatment effects. From that distribution, we could compute summary measures characterizing the distribution—the mean, the median, the quartiles, and so on. As suggested earlier, an especially useful summary is the proportion of units who benefit from treatment, the proportion who are harmed by treatment, and the proportion who are unchanged by treatment although in Table 1 all benefit from treatment.
Throughout this article, we use H(…) to denote a function that computes treatment effects from the outcomes in the treatment and control states and converts those outcomes into a summary measure. If δ1, δ2,…, δ10 are the treatment effects in Column 4, then the function H(…) maps from the outcomes under the treatment and control states into the treatment effects and then into the summary measure of the treatment effect distribution, which we call τ to be consistent with I&R’s notation.
Here,
Because we cannot observe outcomes in both treatment states, we cannot directly compute a summary measure τ. What if we had a method for imputing outcomes when those outcomes were in fact missing? If that were true, and if we found the imputations credible, then intuition suggests that we would have a basis for estimating the distribution of treatment effects and a basis for estimating the summary measure τ.
To foreshadow, suppose we had a method for estimating the distribution of outcomes under the treatment state and the distribution of outcomes under the control state for every unit in Table 1 as they appear in Columns 2 and 3. Intuition suggests that this is possible. After all, given random assignment, the empirical distribution of observed outcomes in Column 6 of Table 1 should approximate the distribution of unobserved outcomes in Column 6. Likewise, the empirical distribution of observed outcomes in Column 7 of Table 1 should approximate the distribution of unobserved outcomes in Column 7. We might impute unobserved outcomes using a random draw from these empirical distributions. Caution is required. The empirical distribution only approximates the actual distribution and the outcomes from a random draw is only one of many possible outcomes from a random draw, but nevertheless, intuition points toward a solution.
Temporarily ignoring this caution about the empirical distribution, assume that the empirical distribution for outcomes under the treatment and control conditions are the same as the underlying true distributions, select a random draw from those distributions, and use it to impute missing outcomes. Using the observed and imputed outcomes, compute δ1, δ2,…, δ10. Then, compute the resulting τ. Although this provides an estimate for a single τ, the estimate depends on the specific values from the random draw. Call this first estimate τ1. Repeat the exercise many times leading to new random draws of the missing outcomes and a series of estimates of tau—τ1, τ2,…, τ K where K is a large number. Intuition suggests that we can use the distribution τ1, τ2,…, τ K to draw inferences about the summary measure τ. This is the essence of the I&R approach, which leads to a probability-based range for the estimate of τ. But to derive a probability-based range, we must formalize the argument to account for the observed empirical distributions of Y treated and Y control being only approximations of the underlying true distributions and for the likelihood that the distributions of Y treated and Y control are not independent. The formality appears in A Bayesian Primer and The I&R Estimator sections. Returning to the intuition from Table 1, we introduce a simple model that leads to a formalized argument.
Corresponding to Table 1, Y
treated is the outcome in the treatment state. For this simple illustration, Y
treated is written as a linear function of covariates X and a random error term e
treated. The βtreated parameters represent the effect that each element of X has on the outcome in the treatment state. Y
control is the outcome in the control state. For this simple illustration, Y
control is written as a linear function of the same X covariates and a random error term e
control. The βcontrol parameters represent the effect that each element of X has on the outcome in the control state. The covariates X are always observed. The error terms e
treated and e
control have a bivariate normal distribution with variances
The last line represents the treatment effect as the difference between the outcome in the treatment state and the outcome in the control state. That last line shows that treatment effect heterogeneity comes from two possible sources. First, treatment may cause one or more of the β parameters to switch from βcontrol to βtreated. 4 Often evaluators assume that the administration of treatment only shifts a constant, but that is unnecessarily restrictive, because the treatment effect may be moderated by covariates. Second, treatment may cause the error terms to be drawn from different distributions. For exposition, it is convenient to distinguish these two sources as explained treatment heterogeneity and unexplained treatment heterogeneity, but this distinction is artificial in the sense that it depends on which variables X are introduced into the statistical model. Furthermore, “explained” treatment heterogeneity is still subject to sampling error because the β are estimated. Nevertheless, preserving this distinction will be useful. Equation (2) shows how treatment heterogeneity arises; it is the sum of explained and unexplained heterogeneity.
From a frequentist perspective, the parameters identified above are fixed. From this point forward, we abandon this familiar frequentist perspective. Instead, using a Bayesian perspective, the β and σ parameters are random and are drawn from a posterior distribution. The posterior distribution is a Bayesian concept that plays a crucial role in inference. It has no counterpart for a frequentist.
Using techniques described in A Bayesian Primer section, we can estimate the posterior distribution. The posterior distribution for βtreated and
As noted, from a Bayesian perspective, βtreated, βcontrol,
The approach is Bayesian in two regards. In the first regard, this is a multiple imputation problem; we make repeated draws from the posterior distribution of the βs and σs to impute values for Y treated and Y control. In the second regard, τ can be a complex statistic, such as the ratio of positive treatment effects to negative treatment effects, that has no frequentist counterpart.
A Bayesian Primer
The previous section explained the basics of the I&R approach, but we require more formality to derive a credible interval. We use a probability-based method to impute values for the missing outcomes, and we estimate a credible interval for τ using the observed and imputed outcomes. Bayes’s theorem is an often-used probability-based method for performing data imputation (Hoff, 2009, p. 115) and imputation is used in applied statistics (Enders, 2010; Little & Rubin, 2002; Schaefer, 1997). Imputations are sometimes challenged by those unwilling to accept the missing at random assumption, but for the present problem, both missing at random and missing completely at random are met because we assume a random design experiment.
To formalize the argument, a Bayesian primer is helpful. Bertsekas and Tsitsiklis (2008) and Kruschke (2015) provide simple introductions, and other sources provide expanded discussions (Hoff, 2009; Lancaster, 2004; Thompson, 2014). Without attempting to explain Bayes Theorem, this primer summarizes the role that Bayesian estimation plays in data imputation necessary to implement the I&R approach. To develop this summary, we request the reader to consider a familiar linear model:
This model identifies a column vector Y as the realizations of a random variable, a matrix X of explanatory variables, a vector β of parameters, and e as a column vector of random errors that are distributed as normal with variance σ2. The X are seen to explain this outcome, but not completely; the error terms account for how Y departs from the expectation Xβ. We imagine a situation where the Y are sometimes missing but the X are always known. We wish to impute values for the missing Y.
From the perspective of a Bayesian, the parameters are random and come from the posterior distribution, which we write as:
The expression on the left is an abbreviated version of the expression on the right, which includes an auxiliary parameter(s) θ, discussed later. For now, assume that we know this posterior distribution. We can use the posterior distribution to impute values for Y conditional on X. We only impute outcomes when they are unknown.
To explain, take a random draw from the posterior distribution to get a specific realization of β and σ2, denoted β1 and
The Y ′ denotes an imputation. The subscript on
Y and X are taken as fixed. There is no maximization problem here; the posterior distribution and the likelihood are the same. A Bayesian might adopt this approach when he or she has no prior information about the distribution of the parameters β and σ2.
The likelihood is familiar to frequentists. It is sometimes called the data model because it comes from assumptions made about the way that Y is generated. We stated those assumptions with Equation (6) but clearly alternative assumptions are possible.
Sometimes the Bayesian has some prior information, which is expressed as the distribution of β and σ2 conditional on auxiliary parameters θ. When that is the case, the posterior is written as:
The prior distribution represents a previously held belief about the distribution of the parameters. For present purposes, information about the prior is probably unavailable. In fact, for our purposes, we will use a flat—that is uninformative—prior. Practically, for our purposes, there will be very little difference between Equations (7) and (8). Because Equation (8) is more general, we will use it as a template for deriving the posterior, summarized as
Returning to the imputation problem, given the posterior distribution, we randomly draw β1 and
We can derive any statistic of interest τ from a vector of Y. For illustration, suppose this statistic is the median. For each vector
This section adopted a simple linear model. This model may be unrealistic for a specific problem, but other models are possible. They require changes to the likelihood and to the mechanics of the function G(…), but they do not require changes to the logic.
The I&R Estimator
This discussion above provides direction for imputing outcomes under a single state, but when we know the outcome in the treatment state, we must impute the outcome for the control state; when we know the outcome for the control state, we must impute the outcome for the treatment state. Equation (2) already introduced a two-state model. For the reader’s convenience, we repeat that model here:
Note again that this specification allows the treatment effect (the difference between the outcomes in the treated and control states) to differ for all units and with a little ingenuity we could adopt even more elaborate model specifications. Models are testable using conventional testing procedures, so from a modeling standpoint, we have justification for adopting a specific model if it passes diagnostic testing.
Except for one difficulty, moving from a one-equation model to a two-equation model does not affect our thinking about imputation. So we could focus on the Bayesian aspects of estimating the distribution of treatment effects, we have ignored this difficulty, but now we deal with its implications. The difficulty is that since we never observe both Y
treated and Y
control for any single unit, we have no way to estimate ρ. This is problematic because the value of ρ affects the distribution of e
treated when Y
control is observed and the distribution of e
control when Y
treated is observed. Normal theory (Bertsekas & Tsitsdklis, 2008, for example) says that, conditional on e
treated, e
control will be distributed as normal with mean equal to
I&R’s solution is to bound estimates when parameters are not identified (Manski, 2007). In the present case, ρ must be between −1 and 1, and we feel comfortable that it is between 0 and 1. Specifically, e treated and e control are the effects of variables that are excluded from the set of explanatory variables X. Likely, many of these omitted variables have effects that are unaltered by the receipt of treatment. If none were altered by the receipt of treatment, ρ = 1. Some may be altered by the receipt of treatment, but likely they have the same qualitative impact on Y, in which case 0 ≤ ρ ≤ 1. Possibly treatment reverses the effect of some omitted variables, but on balance this reversal is likely to be dominated by non-reversals, so 0 ≤ ρ ≤ 1. However, readers need not agree with us that 0 ≤ ρ ≤ 1 because the I&R procedure does not require such an assumption.
We also assume that the distribution of e treated and e control is bivariate normal. Justification comes from the usual logic that the error terms are the sums of a larger number of independent omitted variables that come from no specific distributions. From the central limit theorem, their sums will be normal. Inducing normality may require making transformation, such as modeling the log of Y instead of Y. Modest departures from normality appear to make no important difference in imputation models (Enders, 2010; Schaefer, 1997), so it seems likely that departures from normality would be unimportant for the present problem as well.
From the bounding perspective, we can impute missing values for the outcomes assuming first that ρ = 0 and assuming second that ρ = 1, or assuming any intermediate value for ρ. Starting with ρ = 0, we estimate a credible internal for τρ. Then assuming that ρ = 1, we estimate a mother credible interval for τρ. Possibly, one credible interval will overlap the other, but regardless, we select the lower and upper bound for the two as our best bounded range of estimates for τ. The result is no longer a credible interval, but rather, a bounded estimate of a credible interval. We can hope that the bounded estimate is sufficiently narrow to be informative but that remains to be seen.
Another way to look at this identification issue is that we are using an uninformative prior for the β and σ2 and a fully informative prior for ρ. Nothing in the estimation will cause that prior assumption about ρ to change. Thus, we adopt different prior assumptions about ρ and ask: How do alternative assumptions affect conclusions about the distribution of τ?
Using a Neyman estimator, I&R prove (section 6.4, page 87, and chapter appendix) a peculiarity about these different measures. As ρ increases from 0 to 1 and as ρ decreases from 0 to −1, the credible interval for the mean treatment effect estimate increases. This will also be true for other summary measures. A formal proof uses the Neyman estimator (see I&R). The next section provides a less rigorous discussion and intuition for why this observation matters.
The Role of ρ and the Construction of Bounds
The previous section explained that knowledge of ρ is required to impute Y treated when Y control is known and to impute Y control when Y treated is known. In this section, we discuss how treatment heterogeneity decreases monotonically as the true value of ρ goes from −1 to 1. This monotonic decrease is due to the effect of ρ on the variance of unmeasured heterogeneity, e treated − e control. We establish that the variance of τ follows a more complicated pattern, but it tends to increase as ρ goes from 0 to 1, and hence, the credible interval for τ tends to widen as ρ increases. The implication may be unintuitive: As treatment heterogeneity increases, the credible interval tends to decrease. This pattern will appear in the next two sections where we work with artificial and real-world data. One purpose of this section is to provide intuition for why these patterns occur.
Given this intuition, we discuss the effect of bounding ρ because, in practice, ρ is unknown and unknowable. Recognizing that the largest range for ρ is
For discussion, this section distinguishes three key concepts: treatment effect heterogeneity, credible intervals for treatment effect heterogeneity, and most importantly, bounded credible intervals.
Understanding these three concepts is crucial for interpreting estimates of the distribution of treatment effects.
Treatment Effect Heterogeneity
Consider treatment effect heterogeneity first. For simplicity, assume that X comprises a constant, and assume that the β parameters (μtreated and μcontrol) are known with certainty. There is no loss of generality here because the following discussion concerns unmeasured heterogeneity holding measured heterogeneity fixed. Also, for simplicity, assume that
Unmeasured heterogeneity can be summarized as the variance of e treated − e control. This is the variance for the difference between two random variables, which is written using a standard formula as:
where COV is the covariance of the model’s residuals. Of course,
Simple calculus shows that
Credible Intervals for Treatment Effect Heterogeneity
Regarding the second issue of the variance of τ, and hence, the width of the credible interval, intuition and simple explanations fail. From I&R (2015, p. 161), “Note that the missing outcomes are no longer independent. Conditional on the parameters…they were independent, but the fact that they depend on common parameters introduces some dependence.” This observation leads to a complicated variance formula, and except for special circumstances there are no analytical solutions (p. 171). Thus, except for a simple case, we cannot derive a formula demonstrating the relationship between ρ and the size of the credible interval.
Lacking an analytical formula to establish the point, we simulate data where there are no covariates: Xβtreated = μtreated = 10, Xβcontrol = μcontrol = 0,
Illustration of How Increasing ρ Increases the Credibility Interval.
Estimating the mean treatment effect is a simple problem that could be accomplished using a frequentist approach that does not depend on ρ because ρ does not affect the average outcome under the treatment and control states. The bottom row reports the estimated mean treatment effect and its 95% confidence interval. Noteworthy, the Bayesian approach and the frequentist approach lead to similar estimates, but of course, the frequentist approach is not available for many other choices for τ.
In this simulation, setting ρ = 0 does not lead to the lowest variance estimator. That honor belongs to ρ = −0.50. Finer gradations of ρ might show a different minimum. In this illustration, at least, the variance of τ declines from a local maximum to a global minimum and then climbs to a global maximum as ρ increases from −1 to +1.
Thus, following I&R, we anticipate that credible intervals for τ will grow wider as choices of ρ go from 0 to 1, but analytical variance estimators are lacking for more complicated models. Moreover, the findings pertain to a specific τ statistic: the mean for the distribution of δ. We have no reason to suppose that it pertains to other summary measures such as the proportion for whom treatment has no effect, the proportion for whom treatment is beneficial, and the proportion for whom treatment is harmful. Thus, we remain agnostic.
Bounded Credible Intervals
Although knowing the relationship between ρ and both unmeasured heterogeneity and the credible interval is useful, the practical problem is that to estimate τ and its credible interval, we must know ρ. We do not know ρ, so the above discussion may seem esoteric. Our solution is to apply Manski’s (2007) approach to this identification problem: We adopt assumptions about the possible bounds on ρ, estimate the credible interval for different values of ρ within those bounds, and identify the lowest and highest value for the credible interval over the entire range of ρ. I&R assess the situation (p. 169): The main point to take from this section is that the correlation coefficient (ρ) between the two potential outcomes is somewhat different from other parameters of the model because the data generally do not contain empirical information about it (ρ)…. This leaves us with the question of how they should be modeled. Sometimes we “choose” to be conservative about this dependence and therefore assume the worst case. In terms of the posterior variance (e.g. the variance of τ), the worst case is often the situation of perfect correlation between the two potential outcomes. (Parenthetical inserted, and emphasis added.)
Consider an illustration: Suppose that τ is defined as the largest decile of treatment effects, that is, the lower limit on the largest 10% of treatment effects. Then, τ will be largest when ρ = −1 and smallest when ρ = 1. This follows from the first observation that true heterogeneity increases as ρ goes from −1 to 1. Whatever the relationship between the variance of τ and ρ, it is likely that the credible interval for τ will be A to B for ρ = −1 and C to D for ρ = 1, and hence the bounded credible interval may be A to D. Again, there is nothing “conservative” about assuming ρ = 1.
To this point, this section has focused on the role that unmeasured heterogeneity plays in explaining total heterogeneity. For some purposes, an evaluator might concentrate on explained heterogeneity, defined by Equation (9) as the distribution of
Thus, the form of conservatism recognized by I&R is not, in general, how an evaluator thinks about the problem. The next two sections show how to apply the I&R approach to a typical experimental evaluation. The following discussion will borrow on the points made in this section.
A Monte Carlo Illustration
To illustrate the approach, we simulate data from a known data generation process. This model is familiar from earlier presentations. In this simulation, we adopt the data model:
Notice that the treatment effect is heterogeneous. On average, treatment improves outcomes, but the average effect is greater for units with larger values of X, and the effect will depend on the error terms. The sample size is 1,000; the number of repetitions K is 10,000. The number of repetitions is large for two reasons. First, we have emphasized that estimation works by taking random draws of the βs and σs from the posterior distribution. To improve efficiency, Stata uses a Markov Chain Monte Carlo algorithm, but this generates a lumpy distribution when K is small (Thompson, 2014) so a large K is necessary to smooth the distribution. The second is that we are especially interested in the tail of the τ distribution and large K is necessary to assure a reasonable proximity for that tail.
X and the observed Y are the same across all simulations. Half the sample is assigned to treatment. We first seek to estimate the median treatment effect, which should be about 7.5 = (5 − 0) + (10 − 5) × 0.5. (Because of sampling variance, it is actually 7.309.) Looking for boundary solutions, first we constrain ρ equal to 0, then equal to 0.5, and finally equal to 1 even though ρ is known to equal 0.7 in the simulated data.
When we set ρ = 0, the mean for the distribution of the median (e.g., the mean of the distribution of τ) treatment effect is 7.57 with a standard deviation of 0.559. The 95% credible interval is 6.45–8.64 (see Table 3).
Simulation Study of How ρ Affects the Credibility Interval.
Next, assume ρ = 0.5. (Its true value remains 0.7.) A 95% credible interval runs from 6.35 to 8.67. The mean of this distribution is 7.53 with a standard deviation of 0.595. As we have increased the assumed value of ρ from 0 to 0.5, the standard deviation has increased as has the credible interval.
Finally, still estimating the median effect subject to bounding, we set ρ = 1. As before, the mean of the distribution for the median is close to expectations and its standard deviation is 0.614. The credible interval is 6.28–8.71. Note that the standard deviation and credible interval have increased again. Given the discussion in the previous section, this increase is expected. From Manski’s perspective of bounding, we might prefer the estimates where ρ is forced to equal 1 since this is the most conservative yet clearly informative assumption about ρ. That is, it is the bounded interval given no knowledge of ρ except the assumption that ρ > 0.
The above is intended to illustration the assertions made in the previous section. Traditional parametric and nonparametric methods exist for estimating the mean or median effect, so we would not struggle with the above steps if estimating a mean or median were the objective. Nevertheless, applying these steps is instructive and we can consider other summary measures that are uniquely Bayesian.
Extending this illustration, suppose we want to know the maximum size of the treatment effect for the 10% of units for whom the treatment effect is smallest and the minimum size of the treatment effect for the 10% of units for whom the treatment effect is largest. These are inherently Bayesian statistics that lack a frequentist counterpart. Operationally, for every simulation, we identify the first and ninth deciles for the distribution of δ. We then take the mean of the first and ninth deciles over the 10,000 repetitions. This is τ.
Assuming that ρ = 1, the mean for the distribution of the lowest 10% of treatment effects is 6.09 with a standard deviation of 0.843. This is sensible because it is close to (5 − 0) + (10 − 5) × 0.10 = 5.5. The credible interval is 4.32–7.65. The mean for the distribution of the highest 10% of treatment effects is 8.92 with a standard deviation of 0.916. This is sensible because it is close to (5 − 0) + (10 − 5) × 0.90 = 9.5. The credible interval is 7.34–10.61.
This bounded estimate is helpful. It indicates that even those who benefit the least from the intervention still benefit. However, as evaluators, we are mindful that the value of ρ is unknown, and while we might suspect it is near 1, we cannot be sure. What happens if we set ρ = 0? Now the 10% of unit who do the worse under treatment appear to be harmed by treatment. The mean for τ is −11.13 and a credible interval is between −12.63 and −9.57. For the 10% of the units who do the best under treatment, the mean for τ is 26.19 and a credible interval is between 24.74 and 27.69. The 10% of units who do the best under treatment do very well. We anticipated this result with the discussion in the previous section. When ρ = 1, the upper and lower deciles are closer to each other, and their credible intervals are larger, compared with the assumption of ρ = 0.
As evaluators, we might choose to augment knowledge by just estimating the credible interval for a summary measure of the treatment effects conditional on x:
There are other ways to summarize the distribution of treatment effects. One other approach is to put subjects into one of the three categories: The benefit was substantial, the harm was substantial, or there was neither substantial benefit nor harm. For purposes of illustration, in this simulation, we assume that an effect of ±10 is substantial. Our principal interest is with balance between the winners and losers in this hypothetical treatment lottery, so we consider a bounding scenario of ρ = 0.
The simulation shows that between 0.095 and 0.134 of the units are substantially harmed by treatment, that between 0.423 and 0.483 of the units experience neither substantial harm nor substantial benefits, and that between 0.278 and 0.358 more of the population benefit than are harmed. (This latter range comes from subtracting the proportion harmed from the proportion benefitting in each iteration.) Even a pessimistic assumption that ρ = 0 suggests that a substantial proportion of the units benefit from treatment and that for every substantial loser there are about four substantial gainers. This bounded solution is more informative than a simple conclusion that the average treatment effect is positive.
Using Real-World Data
We apply the I&R procedure to a study by Gaes and Camp (2009) who reanalyzed experimental data collected by Berk et al. (2003). The purpose of the Gaes/Camp paper was to evaluate the effect of prison security level assignment on postrelease behavior. Gaes and Camp and other scholars (Bench & Allen, 2003; Chen & Shapiro, 2007) have argued that the prison environment has criminogenic properties. Higher security prisons are more criminogenic than lower security prisons implying prisoners assigned and released from higher security prisons will have a higher probability of recidivating. Plausibly, however, some offender outcomes are improved by higher security levels, which may be necessary for those offenders to adjust to prison rehabilitation regimes.
Berk et al. (2003) randomly assigned inmates to California prison security levels to test modifications made to the scoring of California’s inmate classification system. 561 inmates had scores consistent with placing them in a Level III (high security) prison. Of the 561, 264 were randomly assigned to a level I (low security) prison. Considering low security as the control condition and high security as the treatment condition, Gaes and Camp evaluated prison assignment using both nonparametric and semi-parametric survival analyses. A Cox regression showed that on average treatment increased the hazard of recidivating by 31.1%.
We reanalyzed these data using the I&R procedure based on a binary outcome. (An Online Appendix discusses the estimator for a binary outcome.) To create a binary variable, we chose the minimum time at risk for 561 people in the sample. This was 192 days. If someone was returned to prison in that time frame, they were assigned a value of 1 and 0 otherwise. The analysis included the following covariates: age at release, race, Hispanic origin, conviction crimes (person, drug, property, and other where property was the excluded category), and a dummy variable recoding whether someone had an arrest prior to age 17. The raw Level III (treatment) and Level I (control) recidivism rates were 37.5% and 25.2%, so the ATE was 12.3%.
We applied the I&R technique setting ρ to 0, 0.5, and 1. We sought to learn how frequently treatment increased recidivism, how frequently it had no effect, and how frequently treatment decreased recidivism. First, we imputed the latent variable Xβ for each of the K iterations. Second, we perturbed the latent variable by randomly drawing ∊ from the conditional standard normal distribution to get Xβ + ∊. 5 Third, if Xβ + ∊ was greater than zero, we assigned recidivism as the outcome and otherwise we assigned no recidivism. Thereby, we determined whether the outcome was worse under treatment, the same under treatment, or improved under treatment. The results are shown in Tables 4 and 5.
Mean Coefficients, Standard Deviations, and 95% Credible Intervals for Parameters in the Level III and Level I Probit Bayesian Regression Models.
Imbens/Rubin Procedure Applied to the Gaes–Camp Data: The Percentage of Offenders Who Recidivate Within 192 Days After Release From Prison Comparing Those Initially Assigned to a Level III as Opposed to a Level I Prison (Credible Interval = 2.5–97.5).
Table 4 summarizes Bayesian regression results. The first column identifies variables. The second and third columns report the mean βs and standard deviations from the posterior distribution of the Bayesian regression on the treatment group. The fourth and fifth columns report the low and high values for the 95% credible interval. The sixth and seventh columns show the mean βs and standard deviations from the posterior distribution of the Bayesian regression on the control group. The eighth and ninth columns report the low and high values for the 95% credible interval. The last column indicates whether the mean β parameters from the second and sixth columns are statistically different. Only the coefficient for race was different between the two models.
Table 5 summarizes estimates for τ, defined alternatively as percent with less recidivism given treatment, percent with the same level of recidivism given treatment or control, and percent with more recidivism given treatment. Under all three assumptions about ρ, most offenders had equivalent outcomes under treatment and control. The 2.5%–97.5% credible intervals for the scenarios also appear in the table. Examining those credible intervals shows that they are informatively narrow. In all scenarios, a higher proportion of offenders are harmed (high recidivism) than helped by treatment. However, for a substantial number, there is no difference in the outcome, presumably because most offenders avoid recidivism during the follow-up period.
Still, the credible intervals are large conditional on ρ. An alternative definition of τ is informative: Define τ as the difference between the percentage who are harmed and helped by treatment, that is, as the difference between the τs defined above. This is a direct measure of benefit and harm from assigning prisoners to higher level security is relatively insensitive to assumptions about ρ. Assuming that ρ can be between 0 and 1, the bounded credible interval is 6.2–19.6. This is not much larger than the actual credible interval conditional on ρ = 1.
What is a reasonable assumption about ρ? Consistent with the earlier discussion on the meaning of ρ, we justify our selection of ρ between 0 and 1. If we think of the recidivism outcomes as an ordered set of probabilities from low to high in the control group, then whether we theorize higher security levels are criminogenic or rehabilitative, it seems unlikely that the security-level assignment would reverse the order. That is, higher risk offenders likely remain at a higher risk in both states.
Put another way, observed and unobserved covariates explain some of the systematic variation in recidivism rates in both the higher risk and lower risk prisoners. The ρ pertains to unobserved variables that affect outcomes in those two states. It seems very likely that those unobserved outcomes have about the same influence in the treated and control states, suggesting that ρ is close to 1. It seems very unlikely that those unobserved outcomes have strong reversed influence in the treated and control states, suggesting that ρ is unlikely to be less than 0. Bounding is informative in this real-world illustration.
Discussion
Treatment effects are not necessarily heterogeneous; however, we suspect that heterogeneous treatment effects are common. The heterogeneity may be small, but when it is large, it becomes an important outcome worthy of scientific study.
Some evaluators have attempted to explain treatment heterogeneity by examining how the size of treatment effects varies with covariates or across strata. In the term used in this article, these evaluators have looked for explained treatment heterogeneity. An alternative approach, developed by I&R (2015), uses Bayesian imputation procedures to estimate the entire distribution of treatment effects, both explained and unexplained treatment heterogeneity. The two approaches are complementary but not equivalent.
A skeptical reader may still ask why a Bayesian analysis is necessary, especially since our estimation procedures used noninformative priors. For the reader unfamiliar with noninformative priors, think of a situation where the evaluator uses a statistical distribution with explicitly defined parameters that represent the treatment effects, such as those we used in the domestic violence hypotheticals. In the absence of any prior information about the treatment effects, the prior distribution of parameters tends to be wider and “flatter” and the influence on the posterior distribution is small relative to the likelihood function. This is a kind of statistical admission that prior to a study, we have little or no knowledge about the effect of treatment, but we do know that it has a likely distributional form, bivariate normal after conditioning on covariates. In our case, we need a procedure to draw parameters from the posterior distribution allowing us to impute values for our unobserved treatment and control counterfactual outcomes.
Of course, one of the strengths of Bayesian inference comes from using prior knowledge about the distribution of parameters. Gelman (2002) distinguishes between highly, moderately and noninformative priors. Knowledge can come from many sources including: meta-analyses of a research domain, an a priori understanding of the mean and shape of the parameter space, or precise empirical information on the parameter space. There is no reason why the I&R procedure cannot be adapted to incorporate an informative prior. We have not taken that step in this article, but we have shown how informative priors could be introduced into the analysis. Furthermore, in a technical sense, our assumptions about ρ might be considered to use alternative informative priors at least about ρ.
If we used maximum likelihood to estimate the missing counterfactuals—the frequentists approach—we would underestimate the imputation variability, unless the sample size was quite large. We would only get similar results from a maximum likelihood and Bayesian procedure if the sample sizes were extremely large. I&R (2015) state this more formally. While a frequentist can avoid the choice of the prior distribution, it comes with a cost: “Nearly always one has to rely on large sample approximations to justify the derived frequentist confidence intervals” (p. 174).
Furthermore, from this method, we get Bayesian-estimated quantities that allow us to characterize the distribution of treatment effects beyond a mean and median. As we have noted, these include more informative yet simple summary measures such as the proportion of people who benefit from treatment, the proportion who are harmed by treatment, and the proportion for whom treatment has no effect. Using a bit more ingenuity, the evaluator could calculate alternative statistics such as the proportion who benefit substantially and the proportion who benefit marginally. The only limitation in deriving these statistics is the research needs.
For example, in the introduction, we suggested that estimates of the distribution of treatment effects are important for benefit/cost analysis. Returning to the domestic violence illustration, policy makers might put a value of 10 units on domestic violence that is deterred by an arrest, they might put a value of −5 units on a domestic violence incident that occurs despite an arrest, and they might put a value of −15 units on a domestic violence incident that occurs because of an arrest. Given this loss equation, even a policy of arresting batterers that has no average effect on the rate of domestic violence can have a profound effect on social welfare, but an assessment requires some estimate of the distribution of treatment effects.
The I&R approach has a limitation. Given the fundamental problem that potential outcomes are observed in the treatment state, or in the control state, but never in both states simultaneously, an evaluator cannot estimate the correlation between the outcomes in the control and treatment states. Consequently, the evaluator cannot estimate a credible interval. This is disconcerting, but we have demonstrated that bounding solutions can provide useful insight. This finding may not be true of every evaluation.
The I&R solution has two components. There is explained heterogeneity attributable to the effects that covariates have on the outcomes in the treated and control states. Given random assignment, the explained heterogeneity is identified, and its estimation does not require bounding. Additionally, there is unexplained heterogeneity, arising from the error terms in the data model. The variances for those error terms are identified, but their correlation is not. Bounding comes from having to guess the size of that correlation.
Although it is unsettling to lack the means to identify the correlation, there is comfort that bounding plays a lessened role as the explained heterogeneity increases. The introduction of covariates is directly important because knowledge about the explained heterogeneity is useful and is indirectly important because an increase in the size of the explained heterogeneity reduces the importance of the unidentified correlation.
Some evaluators object to introducing covariates into a RCT (Freedman, 2008); others are attracted by the prospects of reducing standard errors while being cautious about over-fitting (Lin, 2012). Again, consider what models that introduce covariates tell us. The systematic Xβ part of the model corresponds to a conditional mean. It is not necessarily a statement of causality. Thus, the introduction of covariates allows the evaluator to estimate systematic correlation between covariates and treatment effects and, also, to reduce the importance that the unidentified ρ plays in the analysis.
Thinking about the unidentified correlation as the correlation between residuals rather than the correlation between outcomes in the treatment and control states, an evaluator might see the residuals as resulting from common unmeasured factors. It seems likely, then, that these residual correlations are close to 1. This need not be true. Evaluators should give careful thought to the mechanisms accounting for outcomes in the treated and control states. Certainly, if theory identifies variable that are moderators or mediators, and if those variables are measured, the moderators/mediators should be included in the statistical model.
Several other topics deserve discussion, but the length of a short paper precludes all but a summary. Evaluators should distinguish between finite populations and super populations. Throughout this article, we have discussed inferences about finite populations. The finite population comprises the n units that enter the study’s sample. For most evaluations, finite population estimates are probably appropriate.
Suppose, however, that an evaluator sees the n units as a random sample from a much larger population equal to N. There are two approaches. First, consider the situation where there are no covariates. Then the super-population estimate comes from imputing all outcomes, that is, from substituting imputations for all observations. The observed outcomes are only used for deriving the posterior distribution. Second, consider the situation where there are covariates. Now the computations are more complicated because in theory the covariates come from a distribution for which the n observations are a random sample. That additional uncertainty must be factored into the super-population estimates although an evaluator might consider variance of the X as of secondary importance (and ignore it) in a large sample.
Note that whether one is inferring to a finite population or to a super-population, the same analysis leads to the posterior distribution of the parameters. The only computational differences occur from the application of the G(…) function. It either assigns imputed values to all outcome (super-population) or to just the missing outcomes (finite population).
Many experiments require simple random assignment, but many others involve more complicated designs. Cluster randomization (Donner & Klar, 2000, 2004; Hayes & Moulton, 2009) is popular especially in settings where simple random assignment would violate the Stable Unit Treatment Valuation Assumption. The I&R approach applies to cluster randomized designs. The trick is to see the clusters as the unit of analysis. Cluster randomization is often designed to infer to a super-population. Again, an adequate discussion would go beyond the scope of this article and beyond the authors’ deliberations but see I&R (Chapter 9).
A special version of cluster randomization—pairwise matching (Imbens, 2011; Rhodes, 2014)—raises an interesting possibility. When applying this approach, an evaluator pairs units (typically clusters) based on similarity of covariates that explain posttreatment outcomes (pretreatment baseline rates might be suitable). One member of the pair is randomly assigned to treatment and the other to control. The pairing lead to a direct measure of the finite sample distribution of treatment effects. Again, an adequate discussion goes beyond the scope of this article but see I&R (Chapter 10).
We acknowledge that this article is concerned with simple random design experiments. Many evaluations have quasi-experimental designs. Techniques for estimating the distribution of treatment effects extend to quasi-experimental designs, but in that case, the evaluator faces an additional challenge: He or she must assure that the treatment effect is identified. This is a daunting challenge, but when it is met, the I&R approach is applicable.
Finally, we note that the Bayesian approach forces the evaluator to make some parametric assumptions that go beyond the assumptions that are required for estimating the mean or median effect. We have argued that assumptions about the prior are innocuous because uninformative priors are sufficient to drive the models. Assumptions about the likelihood are more substantive, but we note again that, other than the parameter ρ, all parameters are identified. Evaluators worried about unwarranted assumptions can perform standard diagnostic tests. Gross and misleading errors are avoidable.
We conclude that I&R’s recommended procedures are widely applicable to social science evaluation research. We believe that understanding the distribution of treatment effects is important for scientific inquiry. We recommend that evaluators performing RCTs incorporate the use of I&R’s approach into their analysis and that other evaluators revisit evaluation results to augment findings regarding average treatment effects.
Supplemental Material
Supplemental Material, Appendix_2_23_2018 - Estimating the Distribution of Treatment Effects From Random Design Experiments
Supplemental Material, Appendix_2_23_2018 for Estimating the Distribution of Treatment Effects From Random Design Experiments by William Rhodes and Gerald Gaes in Evaluation Review
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
