Estimating the Distribution of Treatment Effects From Random Design Experiments

Abstract

Background:

Random design experiments are a powerful device for estimating average treatment effects, but evaluators sometimes seek to estimate the distribution of treatment effects. For example, an evaluator might seek to learn the proportion of treated units who benefit from treatment, the proportion who receive no benefit, and the proportion who are harmed by treatment.

Method:

Imbens and Rubin (I&R) recommend a Bayesian approach to drawing inferences about the distribution of treatment effects. Drawing on the I&R recommendations, this article explains the approach; provides computing algorithms for continuous, binary, ordered and countable outcomes; and offers simulated and real-world illustrations.

Results:

This article shows how the I&R approach leads to bounded uncertainty intervals for summary measures of the distribution of treatment effects. It clarifies the nature of those bounds and shows that they are typically informative.

Conclusions:

Despite identification issues, bounded solutions provide useful insight into the distribution of treatment effects. We recommend that evaluators incorporate analyses of the distribution of treatment effects into new studies and that evaluators revisit completed studies to estimate the distribution of treatment effects.

Keywords

treatment effect heterogeneity Bayesian inference random design experiments

Randomized control trials (RCTs) are the foundation for evaluation research because RCTs provide unambiguous causal inferences if the randomization is not compromised. Yet RCTs have critics (Berk, 2005; Burtless, 1995; Deaton & Cartwright, 2016; Sampson, 2010). One criticism is that classical methods for analyzing RCTs only estimate the average or median treatment effect (Heckman & Vytlacil, 2007). Even when the RCT shows a positive average treatment effect, different proportions of the population may be advantaged, harmed, or unaffected by treatment.

The average treatment effect tells us very little about this distribution of treatment effects. Yet without knowing this distribution, analysts are unable to apply a loss function, where positive and negative outcomes have different weights. Furthermore, evaluators and policy makers have scant ability to critically judge program effectiveness and redesign interventions because they lack guidance about what works best for whom. Despite the importance of this topic, there is limited scholarship on the subject.

Of course, evaluators have not been indifferent to the heterogeneity of treatment effects. Heterogeneity has been a topic of study at least since Johnson and Neyman (1936) and Cronbach and Snow (1977). More recently, evaluators sometimes stratify or use a regression (Bloom, 2005) to estimate how treatment effects vary systematically with exogenous factors. Sometimes evaluators use principal stratification, also called endogenous stratification, to estimate how treatment effects vary with postsampling outcomes (Frangakis & Rubin, 2002; Imbens & Angrist, 1994).¹ Xie et al. (2012) discuss heterogeneity in the context of observational data and propose three methods using propensity scores to evaluate treatment effects that may vary by the probability of treatment selection.² Raudenbush and Bloom (2015a, 2015b) look for heterogeneity using multiplesite random design experiments, and they describe innovative work sponsored by the William T. Grant Foundation, the Spencer Foundation, and the U.S. Institute of Educational Research (Raudenbush & Bloom, 2015a, 2015b).

All these efforts have a foundational premise: Treatment effects are mediated or moderated by individual and contextual factors. A specific drug treatment may benefit women more than men, African American women more than White women, African American women who exercise more than the other cross-classified groups. But these efforts are ultimately limited by either sample sizes within subgroups or by unmeasured covariates that may account for treatment heterogeneity or by controversial assumptions.

Imbens and Rubin (2015)—henceforth, I&R—suggest a Bayesian method for estimating the distribution of treatment effects. Consistent with the literature summarized above, the I&R method explains treatment heterogeneity using observed moderating variables. But their method goes beyond that level of explanation to include both explained and unexplained heterogeneity, thereby characterizing the entire distribution of treatment effects. Our article discusses and illustrates the I&R approach.

To start the discussion, consider an illustration from an influential criminal justice field experiment that evaluated alternative methods to deter domestic violence. Domestic violence is one of the most difficult and dangerous circumstances police face during their daily activities. The initial field experiment to study alternative interventions was conducted in Minneapolis (Sherman & Berk, 1984a , 1984b) and replicated elsewhere (Garner et al., 1995). In two of the three randomly assigned interventions, when police were called to a domestic violence incident, a suspect was either arrested or the suspect was sent away from home for several hours. One outcome was whether there was a repeat victimization within 6 months.

Sherman, and Berk (1984a) reported that 10% of the arrest group and 24% of the send-suspect-away group had a repeat victimization. The average treatment effect is therefore 14%—14% fewer victims following an arrest than following being sent away. Knowing the average treatment effect is certainly important. It informs public policy choices and adds to the body of evidence encompassing domestic violence interventions. However, there is the potential to learn a great deal more. To provide intuition, we identify four possible latent groups implicit in the design of this field experiment: Batterers who desist from domestic assault regardless of their treatment, batterers who repeat their crimes regardless of treatment; batterers who will desist if arrested but not otherwise, and batterers who will desist if sent-away but not otherwise. There is no treatment effect for the first two groups: There is a salutary effect from an arrest for the third group and a detrimental effect from an arrest for the last group. Estimating the distribution (proportions) of people in these four latent groups would provide a deeper understanding of the effects of domestic violence interventions than simply knowing that the average treatment effect is a 14% reduction in a repeat assault.

The goal of this article is to provide a framework based on the method proposed by Rubin (1978) and I&R (2015) to gain this deeper understanding. I&R propose a Bayesian approach to drawing inferences about the distribution of treatment effects. The Bayesian approach is especially useful because it leads to estimates of the distribution of treatment effects that are unavailable from a frequentist perspective; also, it leads to probability-based statements that are unavailable or difficult to derive from a frequentist perspective. Our contribution is to provide an accessible discussion of the I&R approach and to provide computing algorithms for commonly experienced outcome types. Evaluators can use the I&R approach to augment traditional experimental analysis in ongoing RCTs; and they can use it to reanalyze data from extant RCTs, thereby adding value to previously reported study findings.

To foster understanding of the value of studying the distribution of treatment effects, we discuss the I&R approach and illustrate its application by using simulated data and by reanalyzing data from a selected RCT study. This article begins with an intuitive perspective and then formalizes the argument. First, we discuss the modern counterfactual logic for evaluation. Readers familiar with this perspective will find the discussion rudimentary, but others should gain important background. We next provide a primer on Bayesian inference. Readers experienced with Bayesian analysis will find this elementary; those inexperienced with Bayesian analysis should gain sufficient background to understand the I&R approach. Next, we integrate the introductions to the counterfactual logic and Bayesian inference to formalize the I&R argument. In an Online Appendix, we cover computing algorithms for four types of outcomes—those measured on a continuous scale (least squares regressions), binary outcomes (probit), ordered outcomes (ordered probit), and countable outcomes (negative binomial with a normal mixture).

The I&R approach requires understanding two issues that may be unfamiliar to evaluators. The first is a purely Bayesian issue, which this article will cast as an imputation problem solved using Bayesian reasoning. The second is an identification problem, meaning that an important parameter cannot be estimated from the data.³ To simplify the exposition, this article initially ignores the identification problem to focus on the inherently Bayesian aspects of the I&R approach. Then, the article returns to define and treat the identification problem.

A Counterfactual Primer

Many evaluators find the counterfactual perspective useful for thinking about evaluation problems (I&R, 2015; Morgan & Winship, 2015). Table 1 illustrates the counterfactual perspective, and we use this table to explain what we mean by the distribution of treatment effects. The columns are labeled 1–7, and we reference these column numbers throughout the following discussion.

Table 1.

Illustration of the Counterfactual Perspective to Evaluation.

Unit	Outcome if Treated	Outcome if Control	Treatment Effect	Treatment Status	Observed Data
(1)	(2)	(3)	(4)	(5)	(6)	(7)
1	17	10	7	Treat	17	Miss
2	19	15	4	Treat	19	Miss
3	13	10	3	Treat	13	Miss
4	25	20	5	Treat	25	Miss
5	17	10	7	Treat	17	Miss
6	13	10	3	Control	Miss	10
7	19	15	4	Control	Miss	15
8	15	10	5	Control	Miss	10
9	23	20	3	Control	Miss	20
10	13	10	3	Control	Miss	10

In the stylized illustration represented by Table 1, we investigate treatment outcomes in a population of N units. In Table 1, N = 10, but usually N is much larger, and the reader may want to mentally add rows. Each row identifies a unit—Units 1–10 in this table. The second column identifies the outcome if the unit is treated, and the third column identifies the outcome if the unit is not treated. The counterfactual perspective applies to both observational and experimental studies, but here we assume random assignment, so those who are not treated are members of the control group. From the counterfactual perspective, the two potential outcomes are fixed but not necessarily known. The fourth column is the treatment effect, equal to the outcome in the treatment state minus the outcome in the control state. The treatment effects, too, are fixed and potentially heterogeneous.

If we knew the potential outcomes in Columns 2 and 3, then we would know the treatment effect for every unit, and hence the distribution of treatment effects. From that distribution, we could compute summary measures characterizing the distribution—the mean, the median, the quartiles, and so on. As suggested earlier, an especially useful summary is the proportion of units who benefit from treatment, the proportion who are harmed by treatment, and the proportion who are unchanged by treatment although in Table 1 all benefit from treatment.

Throughout this article, we use H(…) to denote a function that computes treatment effects from the outcomes in the treatment and control states and converts those outcomes into a summary measure. If δ₁, δ₂,…, δ₁₀ are the treatment effects in Column 4, then the function H(…) maps from the outcomes under the treatment and control states into the treatment effects and then into the summary measure of the treatment effect distribution, which we call τ to be consistent with I&R’s notation.

H (Y_{treated}, Y_{control}) \to δ_{1}, δ_{2}, \dots, δ_{N} \to τ,

Here, $Y_{treated}$ is a vector of outcomes shown in Column 2, $Y_{control}$ is a vector of outcomes shown in Column 3, the δ are treatment effects shown in Column 4, and τ is the summary measure such as a mean, median, or proportion benefiting or harmed by the treatment. However, to produce τ this mapping function presumes that the outcomes in Columns 2 and 3 are completely known. Therefore, the fundamental problem preventing application of Equation (1) is that we cannot compute the treatment effect for even a single unit. When a unit is randomly assigned to treatment (Column 5), we know the outcome under the treatment state (Column 6). When a unit is randomly assigned to control, we know the outcome under the control state (Column 7). Otherwise, the outcome is missing and so denoted in Columns 6 and 7 where half the outcomes are missing.

Because we cannot observe outcomes in both treatment states, we cannot directly compute a summary measure τ. What if we had a method for imputing outcomes when those outcomes were in fact missing? If that were true, and if we found the imputations credible, then intuition suggests that we would have a basis for estimating the distribution of treatment effects and a basis for estimating the summary measure τ.

To foreshadow, suppose we had a method for estimating the distribution of outcomes under the treatment state and the distribution of outcomes under the control state for every unit in Table 1 as they appear in Columns 2 and 3. Intuition suggests that this is possible. After all, given random assignment, the empirical distribution of observed outcomes in Column 6 of Table 1 should approximate the distribution of unobserved outcomes in Column 6. Likewise, the empirical distribution of observed outcomes in Column 7 of Table 1 should approximate the distribution of unobserved outcomes in Column 7. We might impute unobserved outcomes using a random draw from these empirical distributions. Caution is required. The empirical distribution only approximates the actual distribution and the outcomes from a random draw is only one of many possible outcomes from a random draw, but nevertheless, intuition points toward a solution.

Temporarily ignoring this caution about the empirical distribution, assume that the empirical distribution for outcomes under the treatment and control conditions are the same as the underlying true distributions, select a random draw from those distributions, and use it to impute missing outcomes. Using the observed and imputed outcomes, compute δ₁, δ₂,…, δ₁₀. Then, compute the resulting τ. Although this provides an estimate for a single τ, the estimate depends on the specific values from the random draw. Call this first estimate τ₁. Repeat the exercise many times leading to new random draws of the missing outcomes and a series of estimates of tau—τ₁, τ₂,…, τ _K where K is a large number. Intuition suggests that we can use the distribution τ₁, τ₂,…, τ _K to draw inferences about the summary measure τ. This is the essence of the I&R approach, which leads to a probability-based range for the estimate of τ. But to derive a probability-based range, we must formalize the argument to account for the observed empirical distributions of Y _treated and Y _control being only approximations of the underlying true distributions and for the likelihood that the distributions of Y _treated and Y _control are not independent. The formality appears in A Bayesian Primer and The I&R Estimator sections. Returning to the intuition from Table 1, we introduce a simple model that leads to a formalized argument.

\begin{array}{l} Y_{treated} = X β_{treated} + e_{treated}, \\ Y_{control} = X β_{control} + e_{control}, \\ e \sim N_{B} (0, 0, σ_{treated}^{2}, σ_{control}^{2}, ρ), \\ δ = Y_{treated} - Y_{control} = X (β_{treated} - β_{control}) + e_{treated} - e_{control} . \end{array}

Corresponding to Table 1, Y _treated is the outcome in the treatment state. For this simple illustration, Y _treated is written as a linear function of covariates X and a random error term e _treated. The β_treated parameters represent the effect that each element of X has on the outcome in the treatment state. Y _control is the outcome in the control state. For this simple illustration, Y _control is written as a linear function of the same X covariates and a random error term e _control. The β_control parameters represent the effect that each element of X has on the outcome in the control state. The covariates X are always observed. The error terms e _treated and e _control have a bivariate normal distribution with variances $σ_{treated}^{2}$ and $σ_{control}^{2}$ with a correlation of ρ. An evaluator might entertain an alternative to the bivariate normal, but in practice, the bivariate normal (perhaps after transformations) will likely be the most practical choice.

The last line represents the treatment effect as the difference between the outcome in the treatment state and the outcome in the control state. That last line shows that treatment effect heterogeneity comes from two possible sources. First, treatment may cause one or more of the β parameters to switch from β_control to β_treated.⁴ Often evaluators assume that the administration of treatment only shifts a constant, but that is unnecessarily restrictive, because the treatment effect may be moderated by covariates. Second, treatment may cause the error terms to be drawn from different distributions. For exposition, it is convenient to distinguish these two sources as explained treatment heterogeneity and unexplained treatment heterogeneity, but this distinction is artificial in the sense that it depends on which variables X are introduced into the statistical model. Furthermore, “explained” treatment heterogeneity is still subject to sampling error because the β are estimated. Nevertheless, preserving this distinction will be useful. Equation (2) shows how treatment heterogeneity arises; it is the sum of explained and unexplained heterogeneity.

From a frequentist perspective, the parameters identified above are fixed. From this point forward, we abandon this familiar frequentist perspective. Instead, using a Bayesian perspective, the β and σ parameters are random and are drawn from a posterior distribution. The posterior distribution is a Bayesian concept that plays a crucial role in inference. It has no counterpart for a frequentist.

Using techniques described in A Bayesian Primer section, we can estimate the posterior distribution. The posterior distribution for β_treated and $σ_{treated}^{2}$ comes from a Bayesian regression on data where Y _treated is observed. The posterior distribution of β_control and $σ_{control}^{2}$ comes from a Bayesian regression on data where Y _control is observed. However, we cannot identify the posterior distribution for ρ because Y _treated and Y _control are never observed at the same time. This identification problem will concern us, but for now, to simplify the exposition, assume that ρ is known. Using the above model specification, we can write the treatment effect as:

\begin{matrix} δ = Y_{treated} - Y_{control}, \\ = Y_{treated} - X β_{control} - e_{control} given Y_{treated} observed, \\ = X β_{treated} + e_{treated} - Y_{control} given Y_{control} observed . \end{matrix}

As noted, from a Bayesian perspective, β_treated, β_control, $σ_{treated}^{2}$ , and $σ_{control}^{2}$ are random variables drawn from a posterior distribution. Consequently, Xβ_treated, Xβ_control, $e_{treated}$ , and $e_{control}$ are derived random variables that depend on the posterior distribution of β and σ² parameters. Presuming we know the posterior distribution for the β and σ² parameters, we can conceptualize the estimation process as requiring six steps. First, from the posterior distribution, we randomly sample $β_{treated}$ , $β_{control}$ , $σ_{treated}^{2}$ , and $σ_{control}^{2}$ . Second, conditional on $β_{treated}$ , $β_{control}$ , $σ_{treated}^{2}$ , and $σ_{control}^{2}$ , we compute Xβ_treated and Xβ_control and we randomly sample e _treated and e _control. Third, given Xβ_treated, Xβ_control, e _treated, and e _control, we impute Y _treated and Y _control when they are missing. Fourth, we derive an empirical distribution for δ = Y _treated – Y _control. Fifth, we compute the summary measure τ based on the empirical distribution of δ. These six steps are repeated until we have assembled a large number of τ values. From that large number of τ values, we estimate the distribution of τ that leads to a credible interval. For a Bayesian, a credible interval is the upper and lower limits of a truncated distribution of τ such that τ has a given probability of falling within that range. This differs from a frequentist confidence interval, for which the confidence interval has a given probability of covering a fixed parameter. Like a confidence interval, a credible interval represents how certain we are about an estimate such as τ.

The approach is Bayesian in two regards. In the first regard, this is a multiple imputation problem; we make repeated draws from the posterior distribution of the βs and σs to impute values for Y _treated and Y _control. In the second regard, τ can be a complex statistic, such as the ratio of positive treatment effects to negative treatment effects, that has no frequentist counterpart.

A Bayesian Primer

The previous section explained the basics of the I&R approach, but we require more formality to derive a credible interval. We use a probability-based method to impute values for the missing outcomes, and we estimate a credible interval for τ using the observed and imputed outcomes. Bayes’s theorem is an often-used probability-based method for performing data imputation (Hoff, 2009, p. 115) and imputation is used in applied statistics (Enders, 2010; Little & Rubin, 2002; Schaefer, 1997). Imputations are sometimes challenged by those unwilling to accept the missing at random assumption, but for the present problem, both missing at random and missing completely at random are met because we assume a random design experiment.

To formalize the argument, a Bayesian primer is helpful. Bertsekas and Tsitsiklis (2008) and Kruschke (2015) provide simple introductions, and other sources provide expanded discussions (Hoff, 2009; Lancaster, 2004; Thompson, 2014). Without attempting to explain Bayes Theorem, this primer summarizes the role that Bayesian estimation plays in data imputation necessary to implement the I&R approach. To develop this summary, we request the reader to consider a familiar linear model:

\begin{matrix} Y = X β + e \\ e \sim N (0, σ^{2}) \end{matrix} .

This model identifies a column vector Y as the realizations of a random variable, a matrix X of explanatory variables, a vector β of parameters, and e as a column vector of random errors that are distributed as normal with variance σ². The X are seen to explain this outcome, but not completely; the error terms account for how Y departs from the expectation Xβ. We imagine a situation where the Y are sometimes missing but the X are always known. We wish to impute values for the missing Y.

From the perspective of a Bayesian, the parameters are random and come from the posterior distribution, which we write as:

f_{posterior} (β, σ^{2}) = f_{posterior} (β, σ^{2} | Y, X, θ) .

The expression on the left is an abbreviated version of the expression on the right, which includes an auxiliary parameter(s) θ, discussed later. For now, assume that we know this posterior distribution. We can use the posterior distribution to impute values for Y conditional on X. We only impute outcomes when they are unknown.

To explain, take a random draw from the posterior distribution to get a specific realization of β and σ², denoted β₁ and $σ_{1}^{2}$ . From Equation (4), this selection allows us to impute values for Y as:

\begin{matrix} {Y'}_{1} = X β_{1} + e_{1} \\ e_{1} \sim N (0, σ_{1}^{2}) \end{matrix} .

The Y ′ denotes an imputation. The subscript on ${Y'}_{1}$ denotes that these imputations depend on the specific draw of β and σ² from the posterior distribution. The e ₁ is a random draw (the e are independent) from the normal distribution with mean 0 and variance $σ_{1}^{2}$ . In fact, the estimation procedure will require multiple rounds of imputations to derive $β_{1}, β_{2}, β_{3}, \dots$ and $σ_{1}^{2}, σ_{2}^{2}, σ_{3}^{2}, \dots$ , and eventually ${Y'}_{1}, {Y'}_{2}, {Y'}_{3}$ . These multiple rounds account for the fact that knowledge of the distribution of Y is uncertain. Before investigating the need for multiple passes through the imputation procedure, let’s investigate how we get the posterior distribution. In some circumstance, Bayesians set the posterior proportional to the likelihood function. The posterior might be written as:

f_{posterior} (β, σ^{2}) \propto f_{likelihood} (Y, X | β, σ^{2}) .

Y and X are taken as fixed. There is no maximization problem here; the posterior distribution and the likelihood are the same. A Bayesian might adopt this approach when he or she has no prior information about the distribution of the parameters β and σ².

The likelihood is familiar to frequentists. It is sometimes called the data model because it comes from assumptions made about the way that Y is generated. We stated those assumptions with Equation (6) but clearly alternative assumptions are possible.

Sometimes the Bayesian has some prior information, which is expressed as the distribution of β and σ² conditional on auxiliary parameters θ. When that is the case, the posterior is written as:

f_{posterior} (β, σ^{2}) \propto f_{likelihood} (Y, X | β, σ^{2}) f_{prior} (β, σ^{2} | θ) .

The prior distribution represents a previously held belief about the distribution of the parameters. For present purposes, information about the prior is probably unavailable. In fact, for our purposes, we will use a flat—that is uninformative—prior. Practically, for our purposes, there will be very little difference between Equations (7) and (8). Because Equation (8) is more general, we will use it as a template for deriving the posterior, summarized as $f_{prior} (\dots) f_{likelihood} (\dots) \to f_{posterior} (\dots)$ . What this means is that we have a principled way to derive the posterior distribution for the parameters entering our model. Resting on Bayesian logic, the principled method requires that we specify the data model and the prior.

Returning to the imputation problem, given the posterior distribution, we randomly draw β₁ and $σ_{1}^{2}$ . Using Equation (6), this leads us to a vector of outcomes $Y_{1}^{'}$ , some of whose members have outcomes that are observed (and hence not imputed) and some of whose members are imputed (because they are missing). We repeat this sampling process again and again until we have $Y_{1}^{'}, Y_{2}^{'}, Y_{3}^{'}, \dots Y_{k}^{'}$ where K is a large number. As shorthand, we express the process of deriving these imputations as $f_{prior} (\dots) f_{likelihood} (\dots) \to f_{posterior} (\dots) \to G (\dots)$ . Adding the $G (\dots)$ function denotes that we have mapped from the prior and likelihood into the posterior and from the posterior we have mapped into K vectors of imputed values of Y.

We can derive any statistic of interest τ from a vector of Y. For illustration, suppose this statistic is the median. For each vector $Y_{k}^{'}$ , which comprises observed or imputed values, we compute the median $τ_{k}$ . Over the K imputations, we have an empirical distribution for τ, which will approximate the true distribution for τ. Using the mapping introduced earlier, we expand the sequence of steps to $f_{prior} (\dots) f_{likelihood} (\dots) \to f_{posterior} (\dots) \to G (\dots) \to H (\dots)$ . The function H(…), which we introduced in Equation (1), maps the values of ${Y'}_{1}, {Y'}_{2}, {Y'}_{3}, \dots$ into the distribution of τ. From this distribution, we can construct a Bayesian credible interval defined (perhaps) as excluding the lowest and highest 5% of τ values to derive a 90% credible interval.

This section adopted a simple linear model. This model may be unrealistic for a specific problem, but other models are possible. They require changes to the likelihood and to the mechanics of the function G(…), but they do not require changes to the logic.

The I&R Estimator

This discussion above provides direction for imputing outcomes under a single state, but when we know the outcome in the treatment state, we must impute the outcome for the control state; when we know the outcome for the control state, we must impute the outcome for the treatment state. Equation (2) already introduced a two-state model. For the reader’s convenience, we repeat that model here:

\begin{matrix} Y_{treated} = X β_{treated} + e_{treated}, \\ Y_{control} = X β_{control} + e_{control}, \\ e \sim N_{B} (0, 0, σ_{treated}^{2}, σ_{control}^{2}, ρ), \\ δ = Y_{treated} - Y_{control} = X (β_{treated} - β_{control}) + e_{treated} - e_{control} . \end{matrix}

Note again that this specification allows the treatment effect (the difference between the outcomes in the treated and control states) to differ for all units and with a little ingenuity we could adopt even more elaborate model specifications. Models are testable using conventional testing procedures, so from a modeling standpoint, we have justification for adopting a specific model if it passes diagnostic testing.

Except for one difficulty, moving from a one-equation model to a two-equation model does not affect our thinking about imputation. So we could focus on the Bayesian aspects of estimating the distribution of treatment effects, we have ignored this difficulty, but now we deal with its implications. The difficulty is that since we never observe both Y _treated and Y _control for any single unit, we have no way to estimate ρ. This is problematic because the value of ρ affects the distribution of e _treated when Y _control is observed and the distribution of e _control when Y _treated is observed. Normal theory (Bertsekas & Tsitsdklis, 2008, for example) says that, conditional on e _treated, e _control will be distributed as normal with mean equal to $(σ_{control} / σ_{treated}) ρ (y_{treated} - x β_{treated}) = (σ_{control} / σ_{treated}) ρ e_{treated}$ and variance equal to $(1 - ρ^{2}) σ_{control}^{2}$ . Likewise, normal theory says that e _treated will be distributed as normal with mean equal to $(σ_{treated} / σ_{control}) ρ (y_{control} - x β_{control}) = (σ_{treated} / σ_{control}) ρ e_{control}$ and variance equal to $(1 - ρ^{2}) σ_{treated}^{2}$ . Thus, unless ρ = 0, knowledge of Y_t _reated will tell us little about the distribution of e _control, and knowledge of Y _control will tell us little about the distribution of e _treated. Computing algorithms, discussed in the Online Appendix, requires a value for the unknown ρ.

I&R’s solution is to bound estimates when parameters are not identified (Manski, 2007). In the present case, ρ must be between −1 and 1, and we feel comfortable that it is between 0 and 1. Specifically, e _treated and e _control are the effects of variables that are excluded from the set of explanatory variables X. Likely, many of these omitted variables have effects that are unaltered by the receipt of treatment. If none were altered by the receipt of treatment, ρ = 1. Some may be altered by the receipt of treatment, but likely they have the same qualitative impact on Y, in which case 0 ≤ ρ ≤ 1. Possibly treatment reverses the effect of some omitted variables, but on balance this reversal is likely to be dominated by non-reversals, so 0 ≤ ρ ≤ 1. However, readers need not agree with us that 0 ≤ ρ ≤ 1 because the I&R procedure does not require such an assumption.

We also assume that the distribution of e _treated and e _control is bivariate normal. Justification comes from the usual logic that the error terms are the sums of a larger number of independent omitted variables that come from no specific distributions. From the central limit theorem, their sums will be normal. Inducing normality may require making transformation, such as modeling the log of Y instead of Y. Modest departures from normality appear to make no important difference in imputation models (Enders, 2010; Schaefer, 1997), so it seems likely that departures from normality would be unimportant for the present problem as well.

From the bounding perspective, we can impute missing values for the outcomes assuming first that ρ = 0 and assuming second that ρ = 1, or assuming any intermediate value for ρ. Starting with ρ = 0, we estimate a credible internal for τ_ρ. Then assuming that ρ = 1, we estimate a mother credible interval for τ_ρ. Possibly, one credible interval will overlap the other, but regardless, we select the lower and upper bound for the two as our best bounded range of estimates for τ. The result is no longer a credible interval, but rather, a bounded estimate of a credible interval. We can hope that the bounded estimate is sufficiently narrow to be informative but that remains to be seen.

Another way to look at this identification issue is that we are using an uninformative prior for the β and σ² and a fully informative prior for ρ. Nothing in the estimation will cause that prior assumption about ρ to change. Thus, we adopt different prior assumptions about ρ and ask: How do alternative assumptions affect conclusions about the distribution of τ?

Using a Neyman estimator, I&R prove (section 6.4, page 87, and chapter appendix) a peculiarity about these different measures. As ρ increases from 0 to 1 and as ρ decreases from 0 to −1, the credible interval for the mean treatment effect estimate increases. This will also be true for other summary measures. A formal proof uses the Neyman estimator (see I&R). The next section provides a less rigorous discussion and intuition for why this observation matters.

The Role of ρ and the Construction of Bounds

The previous section explained that knowledge of ρ is required to impute Y _treated when Y _control is known and to impute Y _control when Y _treated is known. In this section, we discuss how treatment heterogeneity decreases monotonically as the true value of ρ goes from −1 to 1. This monotonic decrease is due to the effect of ρ on the variance of unmeasured heterogeneity, e _treated − e _control. We establish that the variance of τ follows a more complicated pattern, but it tends to increase as ρ goes from 0 to 1, and hence, the credible interval for τ tends to widen as ρ increases. The implication may be unintuitive: As treatment heterogeneity increases, the credible interval tends to decrease. This pattern will appear in the next two sections where we work with artificial and real-world data. One purpose of this section is to provide intuition for why these patterns occur.

Given this intuition, we discuss the effect of bounding ρ because, in practice, ρ is unknown and unknowable. Recognizing that the largest range for ρ is $- 1 \leq ρ \leq 1,$ we have argued that it is more likely that the range is $0 \leq ρ \leq 1$ , and we use this range to bound the credible interval of the Bayesian summary statistic for τ.

For discussion, this section distinguishes three key concepts:

treatment effect heterogeneity,

credible intervals for treatment effect heterogeneity, and

most importantly, bounded credible intervals.

Understanding these three concepts is crucial for interpreting estimates of the distribution of treatment effects.

Treatment Effect Heterogeneity

Consider treatment effect heterogeneity first. For simplicity, assume that X comprises a constant, and assume that the β parameters (μ_treated and μ_control) are known with certainty. There is no loss of generality here because the following discussion concerns unmeasured heterogeneity holding measured heterogeneity fixed. Also, for simplicity, assume that $σ_{treated}^{2} = σ_{control}^{2} = σ^{2}$ . This simplification is for notational convenience and has no important effect on conclusions.

Unmeasured heterogeneity can be summarized as the variance of e _treated − e _control. This is the variance for the difference between two random variables, which is written using a standard formula as:

σ_{unexplained}^{2} = σ_{treated}^{2} + σ_{control}^{2} - 2 COV = 2 (σ^{2} - COV),

where COV is the covariance of the model’s residuals. Of course, $ρ = COV / σ^{2}$ , so substitute and rewrite Equation (10) as:

σ_{unexplained}^{2} = 2 (1 - ρ) σ^{2} .

Simple calculus shows that $σ_{unexplained}^{2}$ decreases as ρ increases. Consider three special cases. If ρ = 1, then unexplained heterogeneity disappears. If ρ = 0, then unexplained heterogeneity equals $2 σ^{2}$ . If ρ = −1, then unexplained heterogeneity equals $4 σ^{2}$ . Unmeasured heterogeneity is largest when ρ = −1 and smallest when ρ = 1. Indeed, unmeasured heterogeneity is zero when ρ = 1, but this is a consequence of assuming that $σ_{treated}^{2} = σ_{control}^{2}$ . Otherwise, the limit for unmeasured heterogeneity will not be zero, but nevertheless, the monotonic relationship between unexplained heterogeneity and ρ will persist. Returning explained heterogeneity to the model will not alter conclusions about unexplained heterogeneity because ρ does not affect explained heterogeneity. This establishes the first point: Treatment heterogeneity falls as ρ increases.

Credible Intervals for Treatment Effect Heterogeneity

Regarding the second issue of the variance of τ, and hence, the width of the credible interval, intuition and simple explanations fail. From I&R (2015, p. 161), “Note that the missing outcomes are no longer independent. Conditional on the parameters…they were independent, but the fact that they depend on common parameters introduces some dependence.” This observation leads to a complicated variance formula, and except for special circumstances there are no analytical solutions (p. 171). Thus, except for a simple case, we cannot derive a formula demonstrating the relationship between ρ and the size of the credible interval.

Lacking an analytical formula to establish the point, we simulate data where there are no covariates: Xβ_treated = μ_treated = 10, Xβ_control = μ_control = 0, $σ_{treated}^{2} = σ_{control}^{2} = 100$ , and ρ vary between −1 and 1 in increments of 0.5. We randomly generate 1,000 cases, half treated and half controls. Using Bayesian estimation procedures, we generated 10,000 replications for estimating the distribution of τ, the mean treatment effect. Estimation presumes that ρ is known. Table 2 shows results. The first column represents the value of ρ used in the simulation. The next three columns report the lower limit for the 95% credible interval, the estimated mean τ, and the upper limit for the 95% credible interval. The last column, the most important column for this discussion, is the width of the credible interval. At least for this illustration, the estimated mean for τ is about the same regardless of ρ, but the credible interval widens as ρ goes from 0 to 1.

Table 2.

Illustration of How Increasing ρ Increases the Credibility Interval.

ρ	95% Credible Interval
ρ	Lower Limit	Mean	Upper Limit	Coverage
ρ = 1.0	19.30	20.56	21.87	2.57
ρ = 0.5	18.76	19.83	20.93	2.17
ρ = 0.0	18.65	19.54	20.42	1.77
ρ = −0.5	18.59	19.35	20.13	1.54
ρ = −1.0	19.50	20.41	21.32	1.82
	95% Confidence interval
Frequentist	18.32	19.57	20.82	2.50

Estimating the mean treatment effect is a simple problem that could be accomplished using a frequentist approach that does not depend on ρ because ρ does not affect the average outcome under the treatment and control states. The bottom row reports the estimated mean treatment effect and its 95% confidence interval. Noteworthy, the Bayesian approach and the frequentist approach lead to similar estimates, but of course, the frequentist approach is not available for many other choices for τ.

In this simulation, setting ρ = 0 does not lead to the lowest variance estimator. That honor belongs to ρ = −0.50. Finer gradations of ρ might show a different minimum. In this illustration, at least, the variance of τ declines from a local maximum to a global minimum and then climbs to a global maximum as ρ increases from −1 to +1.

Thus, following I&R, we anticipate that credible intervals for τ will grow wider as choices of ρ go from 0 to 1, but analytical variance estimators are lacking for more complicated models. Moreover, the findings pertain to a specific τ statistic: the mean for the distribution of δ. We have no reason to suppose that it pertains to other summary measures such as the proportion for whom treatment has no effect, the proportion for whom treatment is beneficial, and the proportion for whom treatment is harmful. Thus, we remain agnostic.

Bounded Credible Intervals

Although knowing the relationship between ρ and both unmeasured heterogeneity and the credible interval is useful, the practical problem is that to estimate τ and its credible interval, we must know ρ. We do not know ρ, so the above discussion may seem esoteric. Our solution is to apply Manski’s (2007) approach to this identification problem: We adopt assumptions about the possible bounds on ρ, estimate the credible interval for different values of ρ within those bounds, and identify the lowest and highest value for the credible interval over the entire range of ρ. I&R assess the situation (p. 169):

The main point to take from this section is that the correlation coefficient (ρ) between the two potential outcomes is somewhat different from other parameters of the model because the data generally do not contain empirical information about it (ρ)…. This leaves us with the question of how they should be modeled. Sometimes we “choose” to be conservative about this dependence and therefore assume the worst case. In terms of the posterior variance (e.g. the variance of τ), the worst case is often the situation of perfect correlation between the two potential outcomes. (Parenthetical inserted, and emphasis added.)

Note that I&R do not explicitly recommended a bounded solution but bounding is implicit in their discussion. Returning to I&R’s generalization in the above quotation, the conclusion that setting ρ = 1 leads to a conservative estimate is misleading. Their generalization is true about the posterior variance of τ, but it may be false when considering bounded intervals when τ is more complicated than a mean. For example, suppose only two values of ρ are tenable. One choice for ρ may lead to a lower credible interval of A to B, and another choice may lead to a higher credible interval of C to D. Unless A > C or D < B, the bounded credible interval will be A to D. There is nothing conservative about assuming that ρ = 1.

Consider an illustration: Suppose that τ is defined as the largest decile of treatment effects, that is, the lower limit on the largest 10% of treatment effects. Then, τ will be largest when ρ = −1 and smallest when ρ = 1. This follows from the first observation that true heterogeneity increases as ρ goes from −1 to 1. Whatever the relationship between the variance of τ and ρ, it is likely that the credible interval for τ will be A to B for ρ = −1 and C to D for ρ = 1, and hence the bounded credible interval may be A to D. Again, there is nothing “conservative” about assuming ρ = 1.

To this point, this section has focused on the role that unmeasured heterogeneity plays in explaining total heterogeneity. For some purposes, an evaluator might concentrate on explained heterogeneity, defined by Equation (9) as the distribution of $δ_{explained} = X (β_{treated} - β_{control})$ . The unknown parameter ρ plays no role; there is no need to impose bounds because a true credible interval is available. The utility of this concept of measured heterogeneity is that variation in treatment effectiveness is attributed to characteristics of the units, thereby helping evaluators to target effective treatment.

Thus, the form of conservatism recognized by I&R is not, in general, how an evaluator thinks about the problem. The next two sections show how to apply the I&R approach to a typical experimental evaluation. The following discussion will borrow on the points made in this section.

A Monte Carlo Illustration

To illustrate the approach, we simulate data from a known data generation process. This model is familiar from earlier presentations. In this simulation, we adopt the data model:

\begin{matrix} Y_{treated} = β_{0 t} + β_{1 t} X + e_{treated} = 5 + 10 X + e_{treated}, \\ Y_{control} = β_{0 c} + β_{1 c} X + e_{control} = 0 + 5 X + e_{control}, \\ \begin{array}{l} e \sim N_{B} (0, 0, 100, 100, 0.7), \\ X \sim uniform (0, 1) . \end{array} \end{matrix}

Notice that the treatment effect is heterogeneous. On average, treatment improves outcomes, but the average effect is greater for units with larger values of X, and the effect will depend on the error terms. The sample size is 1,000; the number of repetitions K is 10,000. The number of repetitions is large for two reasons. First, we have emphasized that estimation works by taking random draws of the βs and σs from the posterior distribution. To improve efficiency, Stata uses a Markov Chain Monte Carlo algorithm, but this generates a lumpy distribution when K is small (Thompson, 2014) so a large K is necessary to smooth the distribution. The second is that we are especially interested in the tail of the τ distribution and large K is necessary to assure a reasonable proximity for that tail.

X and the observed Y are the same across all simulations. Half the sample is assigned to treatment. We first seek to estimate the median treatment effect, which should be about 7.5 = (5 − 0) + (10 − 5) × 0.5. (Because of sampling variance, it is actually 7.309.) Looking for boundary solutions, first we constrain ρ equal to 0, then equal to 0.5, and finally equal to 1 even though ρ is known to equal 0.7 in the simulated data.

When we set ρ = 0, the mean for the distribution of the median (e.g., the mean of the distribution of τ) treatment effect is 7.57 with a standard deviation of 0.559. The 95% credible interval is 6.45–8.64 (see Table 3).

Table 3.

Simulation Study of How ρ Affects the Credibility Interval.

ρ	95% Credibility Interval
ρ	Lower Limit	Mean	Upper Limit	Coverage
ρ = 1.0	6.28	7.50	8.71	2.43
ρ = 0.5	6.35	7.53	8.67	2.32
ρ = 0.0	6.45	7.57	8.64	2.19

Next, assume ρ = 0.5. (Its true value remains 0.7.) A 95% credible interval runs from 6.35 to 8.67. The mean of this distribution is 7.53 with a standard deviation of 0.595. As we have increased the assumed value of ρ from 0 to 0.5, the standard deviation has increased as has the credible interval.

Finally, still estimating the median effect subject to bounding, we set ρ = 1. As before, the mean of the distribution for the median is close to expectations and its standard deviation is 0.614. The credible interval is 6.28–8.71. Note that the standard deviation and credible interval have increased again. Given the discussion in the previous section, this increase is expected. From Manski’s perspective of bounding, we might prefer the estimates where ρ is forced to equal 1 since this is the most conservative yet clearly informative assumption about ρ. That is, it is the bounded interval given no knowledge of ρ except the assumption that ρ > 0.

The above is intended to illustration the assertions made in the previous section. Traditional parametric and nonparametric methods exist for estimating the mean or median effect, so we would not struggle with the above steps if estimating a mean or median were the objective. Nevertheless, applying these steps is instructive and we can consider other summary measures that are uniquely Bayesian.

Extending this illustration, suppose we want to know the maximum size of the treatment effect for the 10% of units for whom the treatment effect is smallest and the minimum size of the treatment effect for the 10% of units for whom the treatment effect is largest. These are inherently Bayesian statistics that lack a frequentist counterpart. Operationally, for every simulation, we identify the first and ninth deciles for the distribution of δ. We then take the mean of the first and ninth deciles over the 10,000 repetitions. This is τ.

Assuming that ρ = 1, the mean for the distribution of the lowest 10% of treatment effects is 6.09 with a standard deviation of 0.843. This is sensible because it is close to (5 − 0) + (10 − 5) × 0.10 = 5.5. The credible interval is 4.32–7.65. The mean for the distribution of the highest 10% of treatment effects is 8.92 with a standard deviation of 0.916. This is sensible because it is close to (5 − 0) + (10 − 5) × 0.90 = 9.5. The credible interval is 7.34–10.61.

This bounded estimate is helpful. It indicates that even those who benefit the least from the intervention still benefit. However, as evaluators, we are mindful that the value of ρ is unknown, and while we might suspect it is near 1, we cannot be sure. What happens if we set ρ = 0? Now the 10% of unit who do the worse under treatment appear to be harmed by treatment. The mean for τ is −11.13 and a credible interval is between −12.63 and −9.57. For the 10% of the units who do the best under treatment, the mean for τ is 26.19 and a credible interval is between 24.74 and 27.69. The 10% of units who do the best under treatment do very well. We anticipated this result with the discussion in the previous section. When ρ = 1, the upper and lower deciles are closer to each other, and their credible intervals are larger, compared with the assumption of ρ = 0.

As evaluators, we might choose to augment knowledge by just estimating the credible interval for a summary measure of the treatment effects conditional on x: $(β_{0 t} - β_{0 c}) + (β_{1 t} - β_{1 c}) x$ . That credible interval depends only on the posterior distribution for the βs and the distribution of X. We have called this explained heterogeneity. We know that the about 10% of units should have an effect of about 5 + 5 × 0.10 = 5.5 or less and about 10% of units should have an effect of about 5 + 5 × 0.9 = 9.5 or more. Using the simulation described above, we estimate a lower decile as 6.5 with a standard error of 0.92, and we estimate an upper decile as 8.5 with a standard error of 0.92. As anticipated, the lower and upper deciles for the explained treatment effects are higher and lower, respectively, than the upper and lower deciles for the treatment effects that include explained and unexplained effects.

There are other ways to summarize the distribution of treatment effects. One other approach is to put subjects into one of the three categories: The benefit was substantial, the harm was substantial, or there was neither substantial benefit nor harm. For purposes of illustration, in this simulation, we assume that an effect of ±10 is substantial. Our principal interest is with balance between the winners and losers in this hypothetical treatment lottery, so we consider a bounding scenario of ρ = 0.

The simulation shows that between 0.095 and 0.134 of the units are substantially harmed by treatment, that between 0.423 and 0.483 of the units experience neither substantial harm nor substantial benefits, and that between 0.278 and 0.358 more of the population benefit than are harmed. (This latter range comes from subtracting the proportion harmed from the proportion benefitting in each iteration.) Even a pessimistic assumption that ρ = 0 suggests that a substantial proportion of the units benefit from treatment and that for every substantial loser there are about four substantial gainers. This bounded solution is more informative than a simple conclusion that the average treatment effect is positive.

Using Real-World Data

We apply the I&R procedure to a study by Gaes and Camp (2009) who reanalyzed experimental data collected by Berk et al. (2003). The purpose of the Gaes/Camp paper was to evaluate the effect of prison security level assignment on postrelease behavior. Gaes and Camp and other scholars (Bench & Allen, 2003; Chen & Shapiro, 2007) have argued that the prison environment has criminogenic properties. Higher security prisons are more criminogenic than lower security prisons implying prisoners assigned and released from higher security prisons will have a higher probability of recidivating. Plausibly, however, some offender outcomes are improved by higher security levels, which may be necessary for those offenders to adjust to prison rehabilitation regimes.

Berk et al. (2003) randomly assigned inmates to California prison security levels to test modifications made to the scoring of California’s inmate classification system. 561 inmates had scores consistent with placing them in a Level III (high security) prison. Of the 561, 264 were randomly assigned to a level I (low security) prison. Considering low security as the control condition and high security as the treatment condition, Gaes and Camp evaluated prison assignment using both nonparametric and semi-parametric survival analyses. A Cox regression showed that on average treatment increased the hazard of recidivating by 31.1%.

We reanalyzed these data using the I&R procedure based on a binary outcome. (An Online Appendix discusses the estimator for a binary outcome.) To create a binary variable, we chose the minimum time at risk for 561 people in the sample. This was 192 days. If someone was returned to prison in that time frame, they were assigned a value of 1 and 0 otherwise. The analysis included the following covariates: age at release, race, Hispanic origin, conviction crimes (person, drug, property, and other where property was the excluded category), and a dummy variable recoding whether someone had an arrest prior to age 17. The raw Level III (treatment) and Level I (control) recidivism rates were 37.5% and 25.2%, so the ATE was 12.3%.

We applied the I&R technique setting ρ to 0, 0.5, and 1. We sought to learn how frequently treatment increased recidivism, how frequently it had no effect, and how frequently treatment decreased recidivism. First, we imputed the latent variable Xβ for each of the K iterations. Second, we perturbed the latent variable by randomly drawing ∊ from the conditional standard normal distribution to get Xβ + ∊.⁵ Third, if Xβ + ∊ was greater than zero, we assigned recidivism as the outcome and otherwise we assigned no recidivism. Thereby, we determined whether the outcome was worse under treatment, the same under treatment, or improved under treatment. The results are shown in Tables 4 and 5.

Table 4.

Mean Coefficients, Standard Deviations, and 95% Credible Intervals for Parameters in the Level III and Level I Probit Bayesian Regression Models.

Outcome: Binary Recidivism	Level III: Treatment				Level I: Control				Test Level III Versus Level I Coefficient, p <
			Credible Interval				Credible Interval
	Mean	SD	Low	High	Mean	SD	Low	High
Age release	.02	.01	−.01	.05	.03	.01	.00003	.05	.98
White	.73	.27	.17	1.25	−.08	.22	−.52	.35	.03
Hispanic	.06	.20	−.32	.43	.56	.19	.16	.93	.18
Person crime	−.02	.15	−.32	.28	−.23	.20	−.63	.13	.77
Other crime	−.46	.27	−.99	.08	−.59	.39	−1.4	.11	.98
Drug crime	−.47	.15	−.75	−.18	−.15	.23	−.59	.27	.55
Serious misconduct	.59	.17	.25	.92	.43	.17	.13	.78	.88
Administrative misc.	.11	.17	−.23	.44	.23	.17	−.11	.57	.92
Age first arrest < 17	.47	.21	.07	.88	.22	.20	−.17	−.64	.76
Constant	−1.45	.37	−2.24	−.79	−1.73	.47	−2.71	.84	.93

Table 5.

Imbens/Rubin Procedure Applied to the Gaes–Camp Data: The Percentage of Offenders Who Recidivate Within 192 Days After Release From Prison Comparing Those Initially Assigned to a Level III as Opposed to a Level I Prison (Credible Interval = 2.5–97.5).

ρ	Estimate	Recidivism Lower Given Higher Security Level	Recidivism the Same Regardless of Security Level	Recidivism Higher Given Higher Security Level	Difference Between Higher and Lower Security Level
ρ = 1	Percentage	2.4	83.2	14.4	12.0
ρ = 1	Credible interval	0.7–4.6	78.4–87.5	9.8–19.6	6.2–18.2
ρ = 0.5	Percentage	8.8	68.8	22.4	13.7
ρ = 0.5	Credible interval	6.2–11.6	64.9–72.5	18.4–26.7	7.8–19.6
ρ = 0	Percentage	14.2	57.9	27.9	13.8
ρ = 0	Credible interval	11.4–17.1	54.0–61.8	24.3–31.7	8.4–18.9

Table 4 summarizes Bayesian regression results. The first column identifies variables. The second and third columns report the mean βs and standard deviations from the posterior distribution of the Bayesian regression on the treatment group. The fourth and fifth columns report the low and high values for the 95% credible interval. The sixth and seventh columns show the mean βs and standard deviations from the posterior distribution of the Bayesian regression on the control group. The eighth and ninth columns report the low and high values for the 95% credible interval. The last column indicates whether the mean β parameters from the second and sixth columns are statistically different. Only the coefficient for race was different between the two models.

Table 5 summarizes estimates for τ, defined alternatively as percent with less recidivism given treatment, percent with the same level of recidivism given treatment or control, and percent with more recidivism given treatment. Under all three assumptions about ρ, most offenders had equivalent outcomes under treatment and control. The 2.5%–97.5% credible intervals for the scenarios also appear in the table. Examining those credible intervals shows that they are informatively narrow. In all scenarios, a higher proportion of offenders are harmed (high recidivism) than helped by treatment. However, for a substantial number, there is no difference in the outcome, presumably because most offenders avoid recidivism during the follow-up period.

Still, the credible intervals are large conditional on ρ. An alternative definition of τ is informative: Define τ as the difference between the percentage who are harmed and helped by treatment, that is, as the difference between the τs defined above. This is a direct measure of benefit and harm from assigning prisoners to higher level security is relatively insensitive to assumptions about ρ. Assuming that ρ can be between 0 and 1, the bounded credible interval is 6.2–19.6. This is not much larger than the actual credible interval conditional on ρ = 1.

What is a reasonable assumption about ρ? Consistent with the earlier discussion on the meaning of ρ, we justify our selection of ρ between 0 and 1. If we think of the recidivism outcomes as an ordered set of probabilities from low to high in the control group, then whether we theorize higher security levels are criminogenic or rehabilitative, it seems unlikely that the security-level assignment would reverse the order. That is, higher risk offenders likely remain at a higher risk in both states.

Put another way, observed and unobserved covariates explain some of the systematic variation in recidivism rates in both the higher risk and lower risk prisoners. The ρ pertains to unobserved variables that affect outcomes in those two states. It seems very likely that those unobserved outcomes have about the same influence in the treated and control states, suggesting that ρ is close to 1. It seems very unlikely that those unobserved outcomes have strong reversed influence in the treated and control states, suggesting that ρ is unlikely to be less than 0. Bounding is informative in this real-world illustration.

Discussion

Treatment effects are not necessarily heterogeneous; however, we suspect that heterogeneous treatment effects are common. The heterogeneity may be small, but when it is large, it becomes an important outcome worthy of scientific study.

Some evaluators have attempted to explain treatment heterogeneity by examining how the size of treatment effects varies with covariates or across strata. In the term used in this article, these evaluators have looked for explained treatment heterogeneity. An alternative approach, developed by I&R (2015), uses Bayesian imputation procedures to estimate the entire distribution of treatment effects, both explained and unexplained treatment heterogeneity. The two approaches are complementary but not equivalent.

A skeptical reader may still ask why a Bayesian analysis is necessary, especially since our estimation procedures used noninformative priors. For the reader unfamiliar with noninformative priors, think of a situation where the evaluator uses a statistical distribution with explicitly defined parameters that represent the treatment effects, such as those we used in the domestic violence hypotheticals. In the absence of any prior information about the treatment effects, the prior distribution of parameters tends to be wider and “flatter” and the influence on the posterior distribution is small relative to the likelihood function. This is a kind of statistical admission that prior to a study, we have little or no knowledge about the effect of treatment, but we do know that it has a likely distributional form, bivariate normal after conditioning on covariates. In our case, we need a procedure to draw parameters from the posterior distribution allowing us to impute values for our unobserved treatment and control counterfactual outcomes.

Of course, one of the strengths of Bayesian inference comes from using prior knowledge about the distribution of parameters. Gelman (2002) distinguishes between highly, moderately and noninformative priors. Knowledge can come from many sources including: meta-analyses of a research domain, an a priori understanding of the mean and shape of the parameter space, or precise empirical information on the parameter space. There is no reason why the I&R procedure cannot be adapted to incorporate an informative prior. We have not taken that step in this article, but we have shown how informative priors could be introduced into the analysis. Furthermore, in a technical sense, our assumptions about ρ might be considered to use alternative informative priors at least about ρ.

If we used maximum likelihood to estimate the missing counterfactuals—the frequentists approach—we would underestimate the imputation variability, unless the sample size was quite large. We would only get similar results from a maximum likelihood and Bayesian procedure if the sample sizes were extremely large. I&R (2015) state this more formally. While a frequentist can avoid the choice of the prior distribution, it comes with a cost: “Nearly always one has to rely on large sample approximations to justify the derived frequentist confidence intervals” (p. 174).

Furthermore, from this method, we get Bayesian-estimated quantities that allow us to characterize the distribution of treatment effects beyond a mean and median. As we have noted, these include more informative yet simple summary measures such as the proportion of people who benefit from treatment, the proportion who are harmed by treatment, and the proportion for whom treatment has no effect. Using a bit more ingenuity, the evaluator could calculate alternative statistics such as the proportion who benefit substantially and the proportion who benefit marginally. The only limitation in deriving these statistics is the research needs.

For example, in the introduction, we suggested that estimates of the distribution of treatment effects are important for benefit/cost analysis. Returning to the domestic violence illustration, policy makers might put a value of 10 units on domestic violence that is deterred by an arrest, they might put a value of −5 units on a domestic violence incident that occurs despite an arrest, and they might put a value of −15 units on a domestic violence incident that occurs because of an arrest. Given this loss equation, even a policy of arresting batterers that has no average effect on the rate of domestic violence can have a profound effect on social welfare, but an assessment requires some estimate of the distribution of treatment effects.

The I&R approach has a limitation. Given the fundamental problem that potential outcomes are observed in the treatment state, or in the control state, but never in both states simultaneously, an evaluator cannot estimate the correlation between the outcomes in the control and treatment states. Consequently, the evaluator cannot estimate a credible interval. This is disconcerting, but we have demonstrated that bounding solutions can provide useful insight. This finding may not be true of every evaluation.

The I&R solution has two components. There is explained heterogeneity attributable to the effects that covariates have on the outcomes in the treated and control states. Given random assignment, the explained heterogeneity is identified, and its estimation does not require bounding. Additionally, there is unexplained heterogeneity, arising from the error terms in the data model. The variances for those error terms are identified, but their correlation is not. Bounding comes from having to guess the size of that correlation.

Although it is unsettling to lack the means to identify the correlation, there is comfort that bounding plays a lessened role as the explained heterogeneity increases. The introduction of covariates is directly important because knowledge about the explained heterogeneity is useful and is indirectly important because an increase in the size of the explained heterogeneity reduces the importance of the unidentified correlation.

Some evaluators object to introducing covariates into a RCT (Freedman, 2008); others are attracted by the prospects of reducing standard errors while being cautious about over-fitting (Lin, 2012). Again, consider what models that introduce covariates tell us. The systematic Xβ part of the model corresponds to a conditional mean. It is not necessarily a statement of causality. Thus, the introduction of covariates allows the evaluator to estimate systematic correlation between covariates and treatment effects and, also, to reduce the importance that the unidentified ρ plays in the analysis.

Thinking about the unidentified correlation as the correlation between residuals rather than the correlation between outcomes in the treatment and control states, an evaluator might see the residuals as resulting from common unmeasured factors. It seems likely, then, that these residual correlations are close to 1. This need not be true. Evaluators should give careful thought to the mechanisms accounting for outcomes in the treated and control states. Certainly, if theory identifies variable that are moderators or mediators, and if those variables are measured, the moderators/mediators should be included in the statistical model.

Several other topics deserve discussion, but the length of a short paper precludes all but a summary. Evaluators should distinguish between finite populations and super populations. Throughout this article, we have discussed inferences about finite populations. The finite population comprises the n units that enter the study’s sample. For most evaluations, finite population estimates are probably appropriate.

Suppose, however, that an evaluator sees the n units as a random sample from a much larger population equal to N. There are two approaches. First, consider the situation where there are no covariates. Then the super-population estimate comes from imputing all outcomes, that is, from substituting imputations for all observations. The observed outcomes are only used for deriving the posterior distribution. Second, consider the situation where there are covariates. Now the computations are more complicated because in theory the covariates come from a distribution for which the n observations are a random sample. That additional uncertainty must be factored into the super-population estimates although an evaluator might consider variance of the X as of secondary importance (and ignore it) in a large sample.

Note that whether one is inferring to a finite population or to a super-population, the same analysis leads to the posterior distribution of the parameters. The only computational differences occur from the application of the G(…) function. It either assigns imputed values to all outcome (super-population) or to just the missing outcomes (finite population).

Many experiments require simple random assignment, but many others involve more complicated designs. Cluster randomization (Donner & Klar, 2000, 2004; Hayes & Moulton, 2009) is popular especially in settings where simple random assignment would violate the Stable Unit Treatment Valuation Assumption. The I&R approach applies to cluster randomized designs. The trick is to see the clusters as the unit of analysis. Cluster randomization is often designed to infer to a super-population. Again, an adequate discussion would go beyond the scope of this article and beyond the authors’ deliberations but see I&R (Chapter 9).

A special version of cluster randomization—pairwise matching (Imbens, 2011; Rhodes, 2014)—raises an interesting possibility. When applying this approach, an evaluator pairs units (typically clusters) based on similarity of covariates that explain posttreatment outcomes (pretreatment baseline rates might be suitable). One member of the pair is randomly assigned to treatment and the other to control. The pairing lead to a direct measure of the finite sample distribution of treatment effects. Again, an adequate discussion goes beyond the scope of this article but see I&R (Chapter 10).

We acknowledge that this article is concerned with simple random design experiments. Many evaluations have quasi-experimental designs. Techniques for estimating the distribution of treatment effects extend to quasi-experimental designs, but in that case, the evaluator faces an additional challenge: He or she must assure that the treatment effect is identified. This is a daunting challenge, but when it is met, the I&R approach is applicable.

Finally, we note that the Bayesian approach forces the evaluator to make some parametric assumptions that go beyond the assumptions that are required for estimating the mean or median effect. We have argued that assumptions about the prior are innocuous because uninformative priors are sufficient to drive the models. Assumptions about the likelihood are more substantive, but we note again that, other than the parameter ρ, all parameters are identified. Evaluators worried about unwarranted assumptions can perform standard diagnostic tests. Gross and misleading errors are avoidable.

We conclude that I&R’s recommended procedures are widely applicable to social science evaluation research. We believe that understanding the distribution of treatment effects is important for scientific inquiry. We recommend that evaluators performing RCTs incorporate the use of I&R’s approach into their analysis and that other evaluators revisit evaluation results to augment findings regarding average treatment effects.

Supplemental Material

Supplemental Material, Appendix_2_23_2018 - Estimating the Distribution of Treatment Effects From Random Design Experiments

Supplemental Material, Appendix_2_23_2018 for Estimating the Distribution of Treatment Effects From Random Design Experiments by William Rhodes and Gerald Gaes in Evaluation Review

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

William Rhodes

Supplemental Material

Supplemental material for this article is available online.

Notes

References

Bench

Allen

. (2003). Investigating the stigma or prison classification: An experimental design. The Prison Journal, 83(4), 367–382.

Berk

(2005). Randomized experiments as the bronze standard. Journal of Experimental Criminology, 1(4), 417–433.

Berk

Ladd

Graziano

Baek

(2003). Randomized experiment testing inmate classification systems. Criminology and Public Policy, 2(2), 215–242.

Bertsekas

Tsitsdklis

(2008). Introduction to probability (2nd ed.). Athena Scientific Books.

Bloom

(1984). Accounting for no-shows in experimental evaluation designs. Evaluation Review, 8(2), 225–246.

Bloom

(2005). Learning more from social experiments: Evolving analytical approaches. Russell Sage Foundation.

Burtless

(1995). The case for randomized field trials in econometrics and policy research. Journal of Economic Perspectives, 9(2), 63–84.

Chen

Shapiro

(2007). Do harsher prison conditions reduce recidivism? A discontinuity based approach. American Law and Economics Review, Advanced Access 9(1), 1–29.

Cronbach

Snow

(1977). Aptitudes and instructional methods. Irvington.

10.

Deaton

Cartwright

(2016, September). Understanding and misunderstanding randomized controlled trials. [NBER Working Paper 22595]. http://www.nber.org/papers/w22595.pdf

11.

Donner

Klar

(2004). Pitfalls and controversies in cluster randomization trials. American Journal of Public Health, 93(3), 416–422.

12.

Enders

(2010). Applied missing data analysis. Guilford Press.

13.

Frangakis

Rubin

(2002). Principal stratification in causal inference. Biometrics, 58, 21–29.

14.

Freedman

(2008). On regression adjustments to experimental data. Advances in Applied Mathematics, 30(6), 180–193.

15.

Gaes

Camp

(2009). Unintended consequences: Experimental evidence from the criminogenic effect of prison security level placement on post-release recidivism. Journal of Experimental Criminology, 5(2), 139–162.

16.

Garner

Fagan

Maxwell

(1995). Published findings from the spouse assault replication program: A critical review. Journal of Quantitative Criminology, 11(1), 3–27.

17.

Gelman

(2002). Prior Distribution. In El-Shaarawi

A. H.

Piegorsch

W. W.

(Eds.), Encyclopedia of Environomics, 3, 1634–1637.

18.

Guo

Fraser

(2010). Propensity score analysis: Statistical methods and applications. Sage Publications.

19.

Hayes

Moulton

. (2009). Cluster randomization designs. Chapman & Hall/CRC Interdisciplinary Series.

20.

Heckman

(1979). Sample selection bias as a specification error. Econometrica, 47(1), 153–161.

21.

Heckman

Vytlacil

(2007a). Econometric evaluation of social programs, Part I: Causal models, structural models and econometric policy evaluations. In Heckman

Leamer

(Eds.), Handbook of econometrics (Vol. 6b, pp. 740) North Holland Press.

22.

Hoff

(2009). A first course in Bayesian statistical methods. Springer.

23.

Imbens

Angrist

(1994). Identification and estimation of local average treatment effects. Econometrica, 62(2), 467–475.

24.

Imbens

Rubin

(2015). Causal inference for statistics, social, and biomedical sciences: An introduction. Cambridge University Press.

25.

Johnson

Neyman

(1936). Tests of certain linear hypotheses and their applications to some educational problems. Statistical Research Memoirs, 1, 57–93.

26.

Kruschke

(2015). Doing Bayesian analysis: A tutorial with R, Jags and Stan. Elsevier.

27.

Lancaster

(2004). An introduction to modern Bayesian econometrics. Blackwell Publishing.

28.

Lin

. (2012). Agnostic notes on regression adjustments in experimental data: Reexamining freedman’s critique. Annals of Applied Statistics, 7(1), 295–318.

29.

Little

Rubin

(2002). Statistical analysis with missing data (2nd ed.). Wiley.

30.

Maddala

(1983). Limited-dependent and qualitative variables in econometrics. Cambridge University Press.

31.

Manski

(2007). Identification for prediction and decision. Harvard University Press.

32.

Morgan

Winship

(2015). Counterfactuals and causal inference: Methods and principals for social research (2nd ed.). Cambridge University Press.

33.

Page

(2012). Principal stratification as a framework for investigating mediational processes in experimental sessions. Journal of Research on Educational Effectiveness, 5, 215–244.

34.

Page

Feller

Grindal

Miratrix

Somers

(2015). Principal stratification: A tool for understanding variation in program effects across endogenous subgroups. American Journal of Evaluation, 36(4), 514–531.

35.

Peck

(2003). Subgroup analysis in social experiments: Measuring program impacts based on post-treatment choice. American Journal of Evaluation, 24, 157–187.

36.

Raudenbush

Bloom

. (2015a, May 1). It’s no longer all about the mean. William T. Grant Foundation. http://wtgrantfoundation.org/resource/its-no-longer-all-about-the-mean-using-multi-site-trials-to-learn-about-and-from-impact-variation

37.

Raudenbush

Bloom

(2015b). Learning about and from variation in program impacts using multi-site trials. http://wtgrantfoundation.org/library/uploads/2015/10/Learning-About-and-From-Variation-in-Program-Impacts-Using-Multi-site-Trials1.pdf

38.

Rhodes

(2014). Pairwise cluster randomization: An exposition. Evaluation Review, 38(3), 217–250.

39.

Rubin

(1978). Bayesian inferences for causal effects: The role of randomization. Annals of Statistics, 6, 34–58.

40.

Sampson

(2010). Gold standard myths: Observations on the experimental turn in quantitative criminology. Journal of Quantitative Criminology, 26(4), 489–500.

41.

Schaefer

(1997). Analysis of incomplete missing data. Chapman & HALL/CRC.

42.

Sherman

Berk

(1984a). The Minneapolis domestic violence experiment. Police Foundation.

43.

Sherman

Berk

(1984b). The specific deterrent effects of arrest for domestic assault. American Sociological Review, 49, 261–272.

44.

Thompson

(2014). Bayesian analysis with Stata. Stata Press.

45.

Xie

Brand

Jann

(2012). Estimating heterogeneous treatment effects with observational data. Sociological Methodology, 42, 314–347.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.05 MB