On Overfitting in Analysis of Symmetrically Predicted Endogenous Subgroups From Randomized Experimental Samples

Abstract

Using exogenous characteristics to identify endogenous subgroups, the approach discussed in this method note creates symmetric subsets within treatment and control groups, allowing the analysis to take advantage of an experimental design. In order to maintain treatment–control symmetry, however, prior work has posited that it is necessary to use a prediction subsample, separate from the subsample used for impact estimation in order to prevent overfitting from affecting impact estimates. Doing so diminishes sample size—both for prediction and analysis—and so has costs. This article delves into this topic to consider the conditions under which overfitting occurs and to characterize the effects of overfitting in terms of bias and variance. It suggests a strategy for preserving the full sample size in all phases of the analysis. The research uses Monte Carlo simulation to directly measure overfitting, identify the circumstances that should concern us, and to explore possible recommended practices and future research implications.

Keywords

social experiments program evaluation simulation overfitting subgroup analysis

Not only are funders of evaluation research interested in learning about the overall causal effects of policy interventions, they increasingly want to know what it is about the intervention that is responsible for any observed effects. This article is part of a larger effort to innovate in impact analysis methods to better isolate the effects of components of treatments. It discusses an approach that seeks evidence on the effects of treatment components by comparing symmetrically selected subgroups of treatment and control group members from experimental impact evaluations. The first part of this methods note in three parts (Peck, 2013) discusses in more detail the motivation and methods associated with the approach in general, first formally articulated as such in Peck (2003), though with its roots in earlier unpublished evaluation research.

In brief, using exogenous baseline characteristics to predict membership in subgroups revealed after random assignment, the approach identifies symmetric subsets of treatment and control groups for impact comparisons. This capitalizes on the experimental design’s internal validity while “teasing out” the effects of various treatment elements. We refer to this approach, in both this and the other parts of this series, as the Analysis of Symmetrically Predicted Endogenous Subgroups (or ASPES).

ASPES offers a tractable, experimentally based analytic technique when the subgroup membership of interest is observed in only one of the two randomized “arms” of the experiment, either the treatment group or the control group. Such is the case when postrandom assignment events determine membership, events potentially (and sometimes demonstrably) affected by the intervention. The selective sorting of the treatment and control groups on the focal event cannot be assumed to be the same, so subsetting the two samples based on postrandom assignment patterns will not produce equivalent groups for analysis, thus destroying the internal validity of the subgroup impact estimates.

This problem has long been recognized in the literature; so has Peck’s (2003) suggestion to predict subgroup membership as a function of background characteristics where membership can be observed (e.g., in the treatment group) and then apply the predictive model to the subgroup splitting of both the treatment and control group. This technique emphasizes symmetric sorting that preserves internal validity through treatment and control group subsample comparability.¹ One potential problem with this approach is that the predictive model of subgroup membership is expected to do a better job of predicting membership in the sample on which it was estimated than for the remainder of the data set. Purely random differences between the modeling sample and the remaining sample may create this problem, sometimes referred to as overfitting.

We focus on the topic of overfitting to determine the extent of threat it carries for findings derived from application of the ASPES approach and to examine ways to prevent any damage to the findings. Past applications of the method suggest overfitting might be a problem (e.g., Kemple & Snipes, 2000). Here we address the following questions: To what extent, and under what conditions, does overfitting exist and potentially threaten the validity of results from analysis of endogenously identified subgroups? To what extent might overfitting bias results and by how much? What are the implications for variance of using a split-sample, two-stage approach (defined below) to avoid overfitting?

The article proceeds as follows: We begin with some further motivation for the problem; then we discuss the general ASPES methodology as well as our methods for exploring issues of overfitting in that analysis; we present our findings, which consider subgroups defined both by events in the treatment group and by events in the control group and explore possible recommended practices; and finally we conclude with implications for future evaluation research.

Motivation

According to Peck (2013), the origins of the ASPES approach seem to lie within the team of researchers who conducted the experimental evaluation of the Job Training Partnership Act program (the National JTPA Study), including Larry Orr, Howard Bloom, Stephen Bell, Fred Doolittle, Winston Lin, George Cave, and Hans Bos. In their work, the endogenous subgroup of interest was training program applicants who would have particularly low future earnings if not served by the JTPA intervention. This group is endogenous to the treatment randomly assigned in that study, since only postrandom assignment behaviors and outcomes in the control group will reveal its members. Although the National JTPA Study was conducted in the 1980s and 1990s, with its final report issued in 1994, only recently have discussions explored the project’s use of a symmetric prediction strategy to form and analyze impacts on the very low earners subgroup in a way that capitalizes on the study’s experimental design. Members of the evaluation team report—though the work is not published anywhere—that they attempted to identify for separate analysis the members of the randomized treatment and control groups who would have had the lowest future earnings in the absence of JTPA (S. Bell, personal communication, September 12, 2011; J. Bos, personal communication, November 4, 2011; W. Lin, personal communication, September 12 and October 14, 2011). In doing so, they observed that their predictive model performed differentially on the control and treatment groups such that, they concluded, overfitting was responsible for the appearance of differential impacts by subgroup when there truly were none. Once the researchers separated the control group data observations used to estimate the predictive model from the data used to calculate subgroup impacts, these apparent impacts disappeared (hence the lack of subsequent reporting of findings; S. Bell, personal communication, September 12, 2011, and June 14, 2012). We revisit this situation later in this article as a simulated illustration of the extent and nature of the overfitting problem.

At the time, members of the team suggested “clean” and “dirty” approaches to the overfitting problem (W. Lin, personal communications, September 12 and October 14, 2011). The clean approach involved taking a random sample of the cases for which the subgroup-defining behavior could be observed and using it to estimate the predictive model, leaving the remainder of the data from that experimental arm (and all the data from the other experimental arm) to estimate impacts on predicted subgroups (W. Lin, personal communication, August 16, 2012). This is the approach that Peck (2003) later independently developed. In contrast, the dirty approach involved using the whole sample from the arm in which subgroup membership was observed both for predictive model estimation and for impact analysis.

Most published applications of analyses of endogenous subgroups do not randomly subdivide data from a given experimental arm into separate prediction and impact estimation samples. Instead, they use the whole sample from that arm for both predictive modeling and impact estimation. Most applications, therefore, offer a caveated interpretation of their “dirty” findings (e.g., Gibson, 2003; Harknett, 2006; Schochet & Burghardt, 2007). Some applications of the technique acknowledge that a separate prediction sample would be preferable, but they cite small sample size as the rationale for not taking this approach (e.g., Gibson, 2003; Kemple & Snipes, 2000; Morris & Hendra, 2009). The properties of random assignment assure that as sample size increases the treatment and control groups are increasingly alike even in the distribution of the error terms that generate the overfitting problem, thereby making potential overfitting less of a problem in larger samples than in smaller ones. That is, when there is more noise associated with sampling (as in smaller samples), any model fitted on one arm will more accurately identify subgroups within that arm than within the other arm, resulting in more overfitting bias.

A point of clarification is in order: by “overfitting” we do not mean to imply anything about whether the correct model is used to predict and identify endogenous subgroups of interest. In fact, we contend that a misspecified model could still result in unbiased impact estimates within treatment and control subgroups if overfitting in the sense meant here is avoided. The point at which model specification matters is more relevantly the analysis’s final stage, where estimated impacts on predicted subgroups need to be considered in relation to the desired impact on actual subgroups, as we elaborate in the next section.

How ASPES Works

Endogenous responses to policy interventions often determine the magnitude of that intervention’s impact and dictate the policy issue of greatest interest. A common example is when an individual randomized into the treatment group chooses not to participate at all in the assigned intervention. Impact magnitude for such a person presumably is diminished considerably (potentially to zero) by this behavior, and for the endogenous subgroup of nonparticipants as a whole. The converse endogenous subgroup—individuals who do participate—at least have a chance of experiencing an impact and hence tend to be of greater policy interest; but that group too is endogenous, and its counterparts are not identifiable in the control group.

Moreover, to determine the intervention’s impact on the subset of treatment group members who participate, that subset cannot be compared to the entire control group because participants in the treatment group likely differ systematically from treatment group members who choose not to participate, giving the control group as a whole a different composition than the participant subset of the treatment group. The Bloom (1984) correction is one approach to estimating the impact of the intervention on the participant subpopulation. The method assumes that the intervention has no effect on individuals who do not participate, a group Bloom calls “no-shows.” ASPES differs from the Bloom correction by allowing the possibility that nonzero impacts occur on both sides of the endogenous population split of interest. Thus, ASPES applies even to the case in which nonparticipants in the treatment group still experience an impact of having been offered the opportunity to participate through some consequent behavioral reaction, or where nonparticipation consists of skipping only a selected portion of a multifaceted intervention. For example, nonparticipants in the context of a mandatory welfare program are, by policy, subject to sanctions and therefore must experience an effect of being part of the policy regime.

The case of varying dosage levels of the treatment also fits within the ASPES framework. For example, two endogenous subgroups of interest may be defined by the quantity of the intervention experienced by each: high dosage $(D_{i} = 1)$ and low dosage treatment group members $(D_{i} = 0)$ . If dosage were a continuous measure more than two endogenous subgroups would be possible,² but we use this simple case for illustration. The same sort of discrete categorization (dichotomous or otherwise) can occur on other dimensions for subgroups defined in other ways: varying intervention qualities, varying subsets of intervention services, or various pathways through an intervention. Part 1 of this method note (Peck, 2013) elaborates on many possible examples and applications of this sort, both from the existing research literature and from current public policy discourse. Here we simplify to explore two groups, defined hypothetically by high and low dosage.

The ASPES approach recovers the impact parameters for endogenously determined subgroups through the following steps. First, individual baseline characteristics are used to predict subgroup membership—that is, to identify sample members most likely, if assigned to treatment, to receive a high dosage of the intervention rather than a low dosage. Second, both treatment and control group cases are sorted into subsamples based on the endogenous subgroup, high or low dosage, to which an individual sample member is most likely to belong, given his or her profile of baseline characteristics. Third, we use conventional subgroup impact estimation techniques to obtain measures of mean impact on each of the predicted subgroups, call them I_L and I_H for the predicted low and predicted high dosage subgroups, respectively. Because prediction into subgroups is not perfect, a final step converts the impact estimates for predicted subgroups into impacts on actual subgroups under certain assumptions. Before proceeding to the simulation exercise, we say a bit more here about the conversion process, which is more fully elaborated in Part 2 of this method note (Bell & Peck, 2013).

Two equations express impacts on predicted low dosage individuals and predicted high dosage individuals as a weighted sum of impacts on two actual dosage subgroups, as follows:

\begin{aligned} I_{L} = w_{L} L_{L} + g_{L} H_{L} \\ I_{H} = w_{H} L_{H} + g_{H} H_{H}, \end{aligned}

where the following notation applies:

$I_{L}$ is the impact on predicted low dosage individuals;

$I_{H}$ is the impact on predicted high dosage individuals;

$L_{L}$ is the impact on predicted low dosage individuals who are actually low dosage individuals;

$L_{H}$ is the impact on predicted high dosage individuals who are actually low dosage individuals;

$H_{L}$ is the impact on predicted low dosage individuals who are actually high dosage individuals;

$H_{H}$ is the impact on predicted high dosage individuals who are actually high dosage individuals;

$w_{L}$ is the proportion of predicted low dosage individuals who are actually in the low dosage subgroup;

$w_{H}$ is the proportion of predicted high dosage individuals who are actually in the low dosage subgroup;

$g_{L}$ is the proportion of predicted low dosage individuals who are actually in the high dosage subgroup;

$g_{H}$ is the proportion of predicted high dosage individuals who are actually in the high dosage subgroup.

This set of two equations contains four unknowns—L_L, L_H, H_L, and H_H—so two assumptions are necessary to solve it. Here, the ASPES analysis as proposed by Peck (2003) makes the following assumptions:

$L_{H} = L_{L}$ : impacts on low dosage individuals are the same on average for cases predicted to be high dosage individuals and cases predicted to be low dosage individuals;

H_H = H_L : impacts on high dosage individuals are the same on average for cases predicted to be high dosage individuals and cases predicted to be low dosage individuals.

Rearranging terms under these assumptions yields expressions for impacts on actual subgroups as a function of the elements that are known—impacts on predicted subgroups—and the relative proportions of those predicted to be in each subgroup who are actually in each subgroup. This yields the following expressions for overall low (L) and high (H) dosage impacts³:

\begin{aligned} L = \frac{g_{H} I_{L} - g_{L} I_{H}}{g_{H} - g_{L}} \\ H = \frac{w_{L} I_{H} - w_{H} I_{L}}{w_{L} - w_{H}} . \end{aligned}

In brief summary, the ASPES approach involves three steps: (1) prediction of subgroups defined by an endogenous trait, (2) analysis of impacts on those predicted subgroups; and (3)—assuming imperfect prediction—conversion of results through the process just described to produce estimated impacts for actual subgroups.⁴ Elements of the first, prediction step—specifically drawing a random subsample for prediction, and the appropriate size of that subsample—are what we turn to next, using simulations to better understand the performance and implications of this step for the overall analysis, including Steps (2) and (3).

Monte Carlo Analysis

We use Monte Carlo simulations to investigate the presence and consequences of overfitting when analyzing endogenous subgroups within experimental evaluation research. Monte Carlo analyses repeatedly simulate and analyze synthetic data constructed to have properties that illuminate the questions of interest. Each set of simulations establishes what the “true” impact is in the simulated data and use that as a starting point for exploring issues of bias and variance of the impact estimates that are generated through applying the ASPES approach. Because the “true” impacts are known (by definition/construction), it is possible to directly assess the bias versus variance trade-off in any estimates of impact—for example, when excluding the prediction sample used in Step (1) from the impact analysis sample used in Step (2) versus not excluding it.

We create the synthetic data used in the simulations through a series of steps that implement the following modeling assumptions. First, we assume that the probability of high dosage is related to individual baseline characteristics according to

Pr (D_{i} = 1) = \frac{exp \{v_{i}^{^{'}} α_{v} + x_{i}^{^{'}} α_{x} + z_{i}^{^{'}} α_{z}\}}{1 + exp \{v_{i}^{^{'}} α_{v} + x_{i}^{^{'}} α_{x} + z_{i}^{^{'}} α_{z}\}} .

Given a synthetic individual’s baseline characteristics $(v_{i}, x_{i}, z_{i})$ and a vector of parameter values $(α_{v}, α_{x}, α_{z})$ , this relationship allows us to simulate a dosage decision for that individual where $(D_{i} = 1)$ indicates a choice of high dosage and $(D_{i} = 0)$ indicates low dosage.

From $D_{i}$ , we then generate the outcome variable of interest as

\begin{aligned} y_{i} = L T_{i} (1 - D_{i}) + H T_{i} D_{i} + x_{i}^{^{'}} β_{x} + z_{i}^{^{'}} β_{z} + ϵ_{i} \end{aligned}

where

T_{i}

is determined at random and indicates assignment to the treatment group (T_i = 1) or the control group (T_i = 0). The vector

v_{i}

contains observable baseline measures that affect an individual’s dosage decision and do not directly affect the outcome of interest—that is, they are instrumental variables (IVs) for dosage.⁵ The vector

x_{i}

contains other observable baseline measures that affect both the dosage decision and the outcome of interest. Finally, the vector

z_{i}

contains unobservable baseline measures that affect both the dosage decision and the outcome of interest. There are two issues at play in this construction that affect the estimation of parameters and their standard errors: multicollinearity and endogeneity. The correlation between observable baseline characteristics

x_{i}

and observed dosage

D_{i}

induces multicollinearity. The vector of unobserved characteristics

z_{i}

enters the dosage equation and the error term of the outcome equation, inducing endogeneity.

We investigate overfitting in a variety of scenarios to determine the sensitivity of results to the presence of IVs and unobservable characteristics. In particular, we consider four special cases of the model described above and shown in Figure 1: Case 1 includes a single instrument in the dosage equation and does not include unobserved covariates in either equation, Case 2 includes neither instruments nor unobserved covariates, Case 3 includes a single instrument in the dosage equation and unobserved covariates in both equations, and Case 4 includes unobservable characteristics in both equations and does not include an instrument in the dosage equation.

Figure 1.

Variations in assumptions used in the simulated data-generating process, for scenarios originating in the treatment group.

These four cases simulate possible scenarios that would arise when exploring some mediating factor that is treatment induced, creating endogenous subgroups related to program-related choices, experiences, or events.

As prior work has elaborated, the postrandomization experiences of interest could occur within the control group or the treatment group. Of particular interest is the case where subgroup membership rests on the outcome for which impacts are to be estimated, such as postrandomization earnings in a job training evaluation. This case, in the JTPA study context discussed above, may be the original motivation for advancing a symmetric subgroups prediction methodology. Suppose we want to know whether those who fare especially well (or especially badly) in the absence of the intervention—in terms of the postrandomization outcome of earnings, for example—would have been differentially affected had they gained access to the intervention. Can we pick endogenous subgroups members based on their predicted postrandomization earnings in the control group and then successfully estimate impacts on those earnings without overfitting? To test whether this is the case, we replicate the Case 2 and Case 4 modeling assumptions from above—no instrument, both with and without unobservable variables—for the control group earnings application as Case 5 and Case 6. Anecdotally, we know that the JTPA study’s application of this approach resulted in substantial overfitting bias; hence, we expect this simulation exercise will shed light on the extent to which that might be the case more generally when the endogenous selection variable is also the outcome for which subgroup impacts will be calculated.

In the scenarios originating in the control group with a common subgroup selection and impact variable (such as postrandom assignment earnings), the data-generating process used establishes an individual’s outcome in the absence of treatment as:

\begin{aligned} y_{i | T = 0} = x_{i}^{^{'}} β_{x} + z_{i}^{^{'}} β_{z} + ϵ_{i} \\ ϵ_{i} ~ i i d N (0, σ^{2}) . \end{aligned}

Although this counterfactual outcome is not realized and would not be observed in a real data set individuals in the treatment group, we observe this unrealized outcome in the simulated data. We classify individuals into a low counterfactual outcome endogenous subgroup () if is in the lowest quartile of the control outcome's distribution. Allowing for the impact of the program to differ for those with low counterfactual outcomes and those without

(Q_{i} = 0)

, the individual’s outcome will be given by

\begin{aligned} y_{i} = L T_{i} Q_{i} + H T_{i} (1 - Q_{i}) + x_{i}^{^{'}} β_{x} + z_{i}^{^{'}} β_{z} + ϵ_{i} \\ ϵ_{i} ~ i i d N (0, σ^{2}) . \end{aligned}

As in the treatment pathway model, the correlation between subgroup membership and baseline covariates introduces multicollinearity. The unobserved covariates determine subgroup membership and enter the error term and induce endogeneity. In addition, this model presents an added statistical challenge: Because the subgroup selection and impact analysis variables are one and the same (y_i ), $Q_{i}$ is directly correlated with the error term in the impact equation in every possible scenario.

Before beginning a Monte Carlo analysis, we fix the model parameters $L, H, (α_{v}, α_{x}, α_{z}), (β_{x}, β_{z}), σ,$ specify the case, select the sample size n to simulate and decide what proportion of the sample to devote exclusively to the prediction stage. Each iteration of the Monte Carlo analysis involves the following steps: (0) simulate data, (1) predict subgroup membership, (2) estimate mean impacts on predicted subgroups, and (3) convert impacts on predicted subgroups to represent impacts on actual subgroups. Steps (1–3) correspond to the general ASPES approach articulated earlier. The added, prior, step is to create the simulated data, details of which appear in Appendix A.

Because we simulate the data, we know the true data-generating process and designate which variables are “realized” and “observable.” These variables are available to be used during the analysis Steps (1–3). Once we have obtained estimates, we draw on all available information including unobserved variables and counterfactual outcomes to investigate the estimates. Each of the simulated endogenous variables $D_{i}$ , $Q_{i}$ , and $y_{i}$ is determined by baseline characteristics, parameter values, assignment to treatment and a random draw from an error distribution. In the simulated data, we observe each of these separately and can therefore construct counterfactual observations: The dosage an individual in the control group would have selected had he or she been offered treatment and the outcome an individual would have obtained under the alternative treatment assignment.

Results

This section presents findings on our three related research questions: To what extent, and under what conditions, does overfitting exist and potentially threaten the validity of results from analysis of endogenously identified subgroups? To what extent might overfitting bias results and by how much? What are the implications for variance of using a split-sample, two-stage approach to ASPES to avoid overfitting?

Research Question 1: When does the overfitting problem arise?

To address the first research question—to what extent, and under what conditions, does overfitting exist and potentially threaten the validity of results from analysis of endogenously identified subgroups?—we structure the overfitting issue as follows. The issue concerns use of sample observations from the arm of the experiment in which the endogenous subgroup of interest is observed in two roles: estimating the predictive model of subgroup membership and calculating impacts on the predicted subgroups. Our Monte Carlo simulations varied the approach to forming the prediction and impact analysis samples to include dual use of all observations and also separation into disjoint subsets of observations for these two purposes. One scenario used the full treatment group (the arm in which the endogenous subgroup of interest is posited to reveal itself in this analysis except for Cases 5 and 6) in the prediction step and again in the analysis step, with every observation being reused. In other scenarios, we split the treatment group at random into nonoverlapping prediction and impact analysis samples, varying the percentage of the sample used for prediction between 25% and 50% (and hence the share of the sample used for impact estimation between 75% and 50%).

To determine for what sample sizes overfitting is observed and to characterize the magnitude of overfitting, we perform a statistical test to compare the accuracy of the prediction model between the treatment and the control groups in the analysis sample. Because we simulate counterfactual choices and outcomes, we know what dosage each individual would choose if he or she was offered treatment or what an individual’s outcome would be in absence of treatment, regardless of treatment status. We define the binary variable $d_{i}$ to indicate whether the prediction model placed the individual in the correct subgroup and test the null hypothesis that the prediction model performs equally well in the treatment and control group analysis samples:

\begin{aligned} H_{0} : (N o O v e r f i t t i n g) E (d_{i} | T_{i} = 1, {A n a l y s i s}_{i} = 1) = E (d_{i} | T_{i} = 0, {A n a l y s i s}_{i} = 1); \\ H_{1} : (O v e r f i t t i n g) E (d_{i} | T_{i} = 1, {A n a l y s i s}_{i} = 1) \neq E (d_{i} | T_{i} = 0, {A n a l y s i s}_{i} = 1) . \end{aligned}

The null hypothesis characterizes the case where there is no overfitting. The alternative hypothesis is that there is overfitting: The prediction model performs differentially in the treatment and control groups in the analysis sample. If we use the entire treatment group for prediction and analysis, then the rate of correct prediction in the treatment group reflects the accuracy of in-sample prediction, while that of the control group is based on out-of-sample prediction. Under a split-sample approach, the rate of correct prediction for both treatment and control groups is based on out-of-sample prediction.

Under the null hypothesis of no overfitting, we would find statistically significant differences at the 10% level in 10% of the repeated simulations. The simulations involving no reuse of the sample—which sets the benchmark for situations where overfitting cannot be a problem—show such a pattern. In particular, the flat dashed line in Figure 2 shows that the proportion of simulations for which we reject the null hypothesis for α = .10 when no overfitting occurs hovers at about 10% for all cases and sample sizes considered.⁶

Figure 2.

To what extent, and under what conditions, does overfitting exist? The presence of overfitting by use of separate prediction sample, by case and sample size.

We do not show separate results for Cases 1 and 3 because they are nearly identical to the results for Cases 2 and 4 (with the added instrument) but instead focus on those treatment- and control side–originated cases (2, 4–6) where both observables and unobservables are considered.

When we reuse observations for prediction and impact estimation, we find clear evidence of overfitting, for the smaller sample sizes considered (see top curved line in the each panel of the figure). Focusing on Case 2—the treatment pathway case that includes neither an instrument nor unobserved covariates—we find significant differences at the 10% level in 40% of simulations under the null of no overfitting when the total sample size is 500 and reuse is complete, in 27% of simulations with a sample size of 1,000, and in 13% of simulations with a sample size of 5,000. The magnitude of overfitting decreases as sample sizes increase, but is still detectable for a sample size of 10,000.⁷ ^,8

The control pathway case—where the endogenous subgroup selection variable and the impact variable are the same—demonstrates the same pattern, with a very slightly higher incidence of overfitting. For Case 5, the control pathway case that includes no observed covariates, we reject the null hypothesis of no overfitting at the 10% level in 43% of simulations with a total sample size of 500 when the prediction sample is reused for analysis, in 27% of simulations with a sample size of 1,000, and in 13% of simulations with a sample size of 5,000. The striking similarity of the control pathway case to the treatment pathway case is due to the fact that this analysis involves only the prediction of subgroup membership. The key distinction between the treatment pathway and the control pathway models is the relationship between subgroup formation and the outcome equation: In the treatment pathway model, subgroups are determined by a separate choice, while it is the outcome itself that determines subgroup membership in the control pathway. This distinction does not come into play in the prediction step, but will in the analysis step.

The extent of overfitting is very slightly higher for the cases that include unobserved covariates than for the cases with no unobserved covariates. For Case 4—which adds unobserved covariates to the treatment pathway framework—we reject the null hypothesis of no overfitting at the 10% level in 44% of simulations when the total sample size is 500 and reuse is complete, which is 4 percentage points higher than the corresponding proportion for Case 2.⁹ Case 6—which adds unobserved covariates to the control pathway framework—shows a similar pattern. Because the unobserved covariates are correlated with the observed covariates, the parameter estimates of the observed covariates from the logistic regression predicting dosage are affected by omitted variable bias—that is, they capture the influence of the unobserved covariates as well as their own direct influence on the probability of choosing high dosage. Chance differences in unobserved covariates between the treatment and control groups can therefore induce overfitting.

Research Question 2: How much bias arises when impact estimates suffer from overfitting?

To examine the second research question—to what extent might overfitting bias distort findings on impacts on endogenous subgroups?—we ran Monte Carlo simulations assuming nonzero impacts on both high- and low-dose individuals. The performance of ASPES is of particular interest under this assumption, since nonzero, unequal impacts on the two endogenous subgroups (high and low dosage individuals) is a case not adequately covered by either Bloom (1984) or the standard experimental approach.

In the absence of overfitting, each of the simulations in this set demonstrates that ASPES recovers true parameter values, if the method is fully and properly implemented: that is, estimates of impact on predicted subgroups at Step (2) must be converted to estimates of impact on actual subgroups at Step (3). Tables 1 and 2 present the results of these Monte Carlo simulations for varying sample sizes, ranging from 500 to 10,000.¹⁰ The means of the estimates produced through 10,000 repetitions are given in the table, with their standard deviations shown in parentheses below. The means are interpreted as the expected value of the estimator, and the standard deviations are interpreted as the standard error of a particular estimate. The impacts on predicted subgroups reflect the composition of the groups, which include individuals for whom the prediction was accurate and those for whom it was not. Because the data have been built to meet the assumptions of the conversion process (discussed earlier), impacts on actual subgroups are unbiased estimates of the true parameter values.

Table 1.

Performance of the Analysis of Symmetrically Predicted Endogenous Subgroups in the Absence of Overfitting, by Size of Prediction Subsample.

	Impact on predicted subgroups		Converted impact on actual subgroups
Sample size	High	Low	High	Low
True impact	0.300	0.100	0.300	0.100
Case 2: Treatment pathway; no unobserved covariates
500	0.208 (0.274)	0.135 (0.133)	0.301 (0.815)	0.099 (0.290)
1,000	0.223 (0.189)	0.136 (0.089)	0.302 (0.383)	0.101 (0.151)
5,000	0.236 (0.083)	0.132 (0.039)	0.301 (0.137)	0.099 (0.058)
10,000	0.238 (0.057)	0.133 (0.027)	0.300 (0.092)	0.100 (0.040)
Case 4: Treatment pathway; unobserved covariates
500	0.219 (0.247)	0.134 (0.139)	0.291 (0.536)	0.102 (0.233)
1,000	0.236 (0.167)	0.134 (0.094)	0.301 (0.287)	0.101 (0.138)
5,000	0.243 (0.073)	0.131 (0.041)	0.298 (0.111)	0.100 (0.056)
10,000	0.245 (0.052)	0.130 (0.029)	0.300 (0.077)	0.100 (0.039)

Note. 10,000 simulations; 50% of treatment arm reserved for prediction; 50% of treatment arm used in analysis.

Table 2.

Performance of the Analysis of Symmetrically Predicted Endogenous Subgroups in the Absence of Overfitting.

	Impact on predicted subgroups		Converted impact on actual subgroups
Sample size	High	Low	High	Low
True Impact	0.100	0.300	0.100	0.300
Case 5: Control pathway; no unobserved covariates
500	0.118 (0.135)	0.233 (0.243)	0.103 (0.167)	0.282 (0.403)
1,000	0.119 (0.091)	0.252 (0.167)	0.103 (0.104)	0.292 (0.224)
5,000	0.116 (0.039)	0.262 (0.072)	0.100 (0.044)	0.299 (0.089)
10,000	0.116 (0.028)	0.264 (0.050)	0.100 (0.031)	0.301 (0.062)
Case 6: Control pathway; unobserved covariates
500	0.117 (0.136)	0.232 (0.244)	0.102 (0.167)	0.282 (0.405)
1,000	0.118 (0.092)	0.252 (0.168)	0.103 (0.105)	0.292 (0.225)
5,000	0.115 (0.040)	0.264 (0.073)	0.100 (0.045)	0.299 (0.089)
10,000	0.115 (0.028)	0.265 (0.051)	0.100 (0.031)	0.301 (0.062)

Note. 10,000 simulations; 50% of control arm reserved for prediction; 50% of control arm used in analysis.

To investigate the influence of overfitting on the bias of the estimates, consider simulations with sample sizes less than 5,000. Tables 3 and 4 present these results and also include a few larger sample sizes for comparison. For the cases affected by endogeneity—the treatment pathway case with unobservables and both control pathway cases—we find evidence that overfitting biases results.

Table 3.

Performance of the Analysis of Symmetrically Predicted Endogenous Subgroups in the Presence of Overfitting, by Sample Size.

	Impact on predicted subgroups		Converted impact on actual subgroups
Sample size	High	Low	High	Low
True impact	0.300	0.100	0.300	0.100
Case 2: Treatment pathway; no unobserved covariates
500	0.242 (0.230)	0.129 (0.105)	0.295 (0.351)	0.101 (0.146)
1,000	0.243 (0.152)	0.131 (0.071)	0.302 (0.238)	0.099 (0.102)
5,000	0.239 (0.067)	0.132 (0.031)	0.299 (0.105)	0.100 (0.045)
10,000	0.240 (0.047)	0.132 (0.022)	0.300 (0.075)	0.100 (0.032)
Case 4: Treatment pathway; unobserved covariate
500	0.267 (0.200)	0.121 (0.110)	0.323 (0.282)	0.090 (0.141)
1,000	0.260 (0.136)	0.126 (0.076)	0.314 (0.195)	0.094 (0.100)
5,000	0.248 (0.059)	0.129 (0.033)	0.302 (0.088)	0.099 (0.045)
10,000	0.248 (0.042)	0.130 (0.023)	0.301 (0.062)	0.100 (0.031)

Note. 10,000 simulations; high dosage impact is 0.30 SD; low dosage impact is 0.10 standard deviations; entire treatment group used for prediction and reused for analysis.

Table 4.

Performance of the Analysis of Symmetrically Predicted Endogenous Subgroups in the Presence of Overfitting.

	Impact on predicted subgroups		Converted impact on actual subgroups
Sample size	High	Low	High	Low
True impact	0.100	0.300	0.100	0.300
Case 5: Control pathway; no unobserved covariate
500	0.074 (0.105)	0.395 (0.204)	0.052 (0.113)	0.442 (0.233)
1,000	0.095 (0.072)	0.329 (0.134)	0.075 (0.079)	0.373 (0.158)
5,000	0.111 (0.032)	0.278 (0.058)	0.095 (0.036)	0.316 (0.071)
10,000	0.113 (0.023)	0.271 (0.041)	0.097 (0.025)	0.309 (0.050)
Case 6: Control pathway; unobserved covariates
500	0.072 (0.107)	0.399 (0.207)	0.049 (0.115)	0.447 (0.236)
1,000	0.095 (0.074)	0.330 (0.137)	0.075 (0.081)	0.373 (0.161)
5,000	0.111 (0.033)	0.279 (0.059)	0.095 (0.037)	0.316 (0.072)
10,000	0.113 (0.023)	0.272 (0.041)	0.097 (0.026)	0.308 (0.050)

Note. 10,000 simulations; high dosage impact is 0.30 SD; low dosage impact is 0.10 standard deviations; entire control arm used for prediction and reused for analysis.

For Case 2—the treatment pathway case for which all covariates are observed—the ASPES method recovers the true parameter values for all sample sizes even if one reuses the entire group for both prediction and impact analysis. While the Case 2 Monte Carlo simulations for sample sizes 500 and 1,000 clearly exhibit overfitting when the entire treatment sample is used for both prediction and analysis, the mean parameter estimates for the simulations that exhibit overfitting do in fact recover the true impact magnitudes built into the synthetic data with a comparable degree of accuracy as the split-sample simulations that assume the same sample size when the Step (3) conversion is undertaken.

However, for Case 4—the treatment pathway case that include unobserved covariates—ASPES recovers biased results for the sample sizes that exhibit high levels of overfitting. When the entire treatment group is used for both prediction and analysis and the sample size is 500, the Case 4 simulation average estimate of the high dosage impact is 0.32 standard deviations and the average estimate of the low dosage impact is 0.09 standard deviations, which corresponds to a bias of 0.02 standard deviations and 0.01 standard deviations, respectively. A more conventional methodology for examining the role of endogenous factors in producing impact variation—ordinary least squares (OLS)—makes no attempt to adjust for differences in endogenously determined subgroups that could lead to selection bias. We report the results of this OLS analysis in Appendix B. This naive approach produces estimates of impact on the subgroups of interest with a bias of 0.10 standard deviations for high dosage impact and 0.04 for low dosage impact.

For both of the control pathways cases, we find clear evidence of bias for the sample sizes that exhibit overfitting. When the entire treatment group is used for both prediction and analysis and the sample size is 500, the low counterfactual impact is estimated to be 0.44 and the high counterfactual impact to be 0.05 for Case 5, which corresponds to a bias of 0.14 and 0.05 standard deviations, respectively. For context, OLS yields estimates with a bias of 0.64 and 0.21 standard deviations for Case 5 with a sample size of 500, as shown in Appendix B. These OLS results provide a sense of the magnitude of the endogeneity problem that the ASPES approach is addressing. For the control pathway cases, the endogeneity problem is much more severe than in the treatment pathway cases because the outcome itself determines subgroup membership.

In the presence of endogeneity, the ASPES approach that reuses the prediction sample reduces the bias in the estimates compared to the naive application of OLS, provided that Step (3) conversion is undertaken. However, the ASPES approach that uses a split sample for prediction and analysis produces unbiased estimates even in the presence of unobserved covariates and severe endogeneity.

Research Question 3: How sensitive is the variance of impact estimates to the different endogenous subgroup modeling approaches?

With regard to the third research question—what are the implications for variance of using a split sample, two-stage approach?—the preceding results suggest that we are in the familiar territory of a bias–variance trade-off. A split-sample approach increases the variance of the estimates in multiple ways. Because a split-sample approach uses a smaller sample size in the Step (1) prediction, the parameter estimates used in prediction are less precise. The split-sample approach also uses a smaller sample size in the Step (2) analysis, which further increases the variance.

To avoid overfitting while retaining the entire sample in the analysis step, we tested a cross-validation approach—akin to, though simpler than, the jackknifing approach that Sanbonmatsu, Kling, Duncan, and Brooks-Gunn (2006) used—in the prediction process step within the ASPES method. The cross-validation approach involves the following steps: (1) randomly partition the sample for the randomization arm in which endogenous subgroups of interest are observed into 10 cross-validation groups, (2) estimate the prediction model 10 times on these 10 subsamples, each time leaving out one of the cross-validation groups, and (3) construct predicted subgroup membership using the parameters obtained from the model estimation that excluded their group. Figure 3 illustrates the cross-validation approach. This process ensures that the predicted subgroup membership for every individual in the sample is constructed through out-of-sample prediction, allowing us to use the full sample for analysis without inducing overfitting.

Figure 3.

The cross-validation approach to predicting subgroup membership.

This approach differs from the standard application of cross-validation in two ways. First, cross-validation is typically used in model development, while we use the tools of cross-validation to construct a predicted value that feeds into the next step in the analysis. Second, cross-validation is typically applied in samples where outcomes and covariates are observed for all members, not just those in a particular treatment arm. Because the cross-validation approach produces 10 sets of estimates for the prediction model, the researcher must choose among many options when determining how to predict subgroup membership in the random assignment arm for which it is not observed. We resolved this by randomly partitioning both randomization arms into cross-validation groups at the outset in equal shares. While just one treatment arm is included in the estimation sample for the prediction model, both treatment and control individuals for a given cross-validation group obtain predicted values using the same estimates.

Our results indicate that the cross-validation approach virtually eliminates overfitting, at the same time permitting retention of the entire sample. Table 5 presents the results of the cross-validation approach for cases and sample sizes where we observe overfitting bias.¹¹ If there is no overfitting in the sample, then we would expect to reject the null hypothesis of no overfitting at the 10% level of significance in 10% of simulations. That is very close to what we observe when we use the cross-validation approach: The proportion of simulations for which we reject the null hypothesis of no overfitting at the 10% level rounds to 11% when using the cross-validation approach for all cases and samples sizes investigated. For comparison, when the entire treatment arm is used for both prediction and analysis and the total sample size is 500, the proportion of simulations that reject the null hypothesis of no overfitting at the 10% level of significance ranges from 43% to 44% in the cases included in the cross-validation exercise.

Table 5.

Performance of the Cross-Validation Approach to the Analysis of Symmetrically Predicted Endogenous Subgroups, by Case and Sample Size.

	Presence of overfitting	Impact on predicted subgroups		Converted impact on actual subgroups
	Proportion of simulations rejecting null	High	Low	High	Low
Case 4: Treatment pathway; unobserved covariates
True impact		0.300	0.100	0.300	0.100
500	0.109	0.227 (0.199)	0.134 (0.111)	0.292 (0.361)	0.103 (0.167)
1,000	0.107	0.239 (0.135)	0.133 (0.076)	0.300 (0.219)	0.100 (0.109)
Case 5: Control pathway; no unobserved covariates
True impact		0.100	0.300	0.100	0.300
500	0.111	0.118 (0.105)	0.251 (0.202)	0.103 (0.121)	0.292 (0.274)
1,000	0.107	0.117 (0.072)	0.257 (0.133)	0.101 (0.082)	0.294 (0.171)
Case 6: Control pathway; unobserved covariates
True impact		0.100	0.300	0.100	0.300
500	0.110	0.117 (0.107)	0.251 (0.203)	0.102 0.123	0.293 0.276
1,000	0.105	0.116 (0.074)	0.258 (0.137)	0.102 0.083	0.294 0.174

Note. 10,000 simulations; data partitioned into 10 cross-validation groups; hypothesis testing for presence of overfitting performed at the 10% level of significance.

Our results further indicate that the cross-validation approach obtains unbiased estimates of the true impact parameters, without the loss of power associated with the split-sample approach. The impact parameters recovered through the cross-validation approach are nearly identical to those recovered using a split-sample approach. The standard deviation of the estimates, which appear in parentheses adjacent to the mean impact estimate, are much smaller than those obtained through a split-sample approach. For a sample size of 500, the standard deviation of the Case 5 low counterfactual impact estimates is 0.27 when a cross-validation approach is used, slightly higher than the standard deviation of 0.23 obtained when the entire control group is used for prediction and analysis, but much lower than the standard deviation of 0.40 obtained using a split-sample approach. The cross-validation approach reduces variance relative to the split-sample approach because it increases the effective sample size used in estimating the prediction model—each prediction model is estimated using 90% of the relevant treatment arm—and the full sample is used for analysis.

Discussion

Results from the simulations described in this article show the bias of point estimates from overfitting in the context of the experimental analysis of endogenous subgroups in the presence of unobserved covariates. This bias is of relatively small magnitude, though larger in analyses where the subgroup selection variable and the outcome for which impacts are to be computed are one and the same (modeled here as control group pathways). In the presence of unobserved covariates, the ASPES approach that reuses the prediction sample reduces the bias in the estimates compared to the naive application of OLS, provided that Step (3) conversion is undertaken. However, the ASPES approach that uses a split sample for prediction and analysis produces unbiased estimates even in the presence of unobserved covariates.

The split-sample approach produces unbiased estimates at the cost of the power of hypothesis tests. In response to concerns about loss of sample size that come from designating a prediction sample, which is essential to avoiding overfitting-induced bias, we explored a cross-validation approach that proved highly successful: It avoids overfitting while maintaining the full sample for analysis, which decreases the variance of the estimates at the same time as recovering the true, unbiased impact estimates.

An important finding from this research is that the ASPES approach’s Step (3), converting impacts from predicted to actual, is essential to recovering unbiased results. That is, comparing the mean treatment and control outcomes for predicted subgroups does not produce, or even approximate very well, the true impact on actual subgroups given the degree of predictive accuracy built into our simulated data.¹² This might seem an obvious point. We make it here because prior applications of the approach have tended to stop at the point of estimating impacts on predicted subgroups, regardless whether those subgroups are identified using a designated prediction subsample or the full sample. This limitation justifies the caveats that have accompanied these prior, related studies (e.g., Schochet & Burghardt, 2007; Wood, Moore, & Clarkwest, 2011).

The simulations that we undertake in this article involve both treatment- and control-side origination and also covariates of the observed and unobserved types. These are scenarios that cover the likely range of applications of analyses of mediational pathways within the social policy arena. On the treatment side, we consider the hypothetical scenario of variation in treatment dosage. This is a relatively straightforward two-group case; but, as we have noted, three or more groups might be of interest to explore.¹³ On the control side, we consider the hypothetical scenario of the counterfactual outcome in which the endogenous subgroup of interest is selected on the very measure used to estimates impacts such as postrandom assignment earnings in the absence of the treatment (mirroring the analysis’ motivating source, the National JTPA Study). The research summarized here concludes that, when appropriate to the research question, the ASPES approach can provide unbiased estimates of the impacts of some mediating factor (endogenous subgroup).

Conclusion

Agencies within the federal government that evaluate policies and programs using randomized experimental designs are increasingly interested in decomposing how intervention features produce impacts. In the absence of randomization to specific intervention components, various analytic approaches might help estimate component impacts. Indeed, the past decade has been active in scholars’ development and testing of these various approaches. This body of research—with its uneven ability to support causal claims—motivates more deliberate thinking and vetting of research methods, to improve the field’s ability to answer important policy questions.¹⁴ This Method Note in Three Parts seeks to advance experimental analytic methods aimed at informing these “what works?” questions. In Part 3, we have examined specifically the issue of overfitting and its implications in the ASPES from randomized experimental samples.

The overall note on ASPES ends here with Part 3. We believe that, taken in total, the note sets a foundation for stretching what scientists can learn—and policy makers gain—from experimental evaluations to assess the impact of government and foundation social programs. Most fundamentally, ASPES drills deeper into experimental data in search of the pathways by which successful social policy interventions achieve their results. Any pathway that produces larger impacts on the target population can become the focus of greater emphasis in future refinements of the intervention—and the pathways that fail to demonstrate contributions can be retooled or discarded.

This process of uncovering the causal elements of successful interventions should not be oversold, however. We have argued that many important policy questions can be addressed through ASPES but that its applications over the last 10 years could be improved (Part 1 of the note). We have also sought ways to improve its use and reliability, as explained in Parts 2 and 3 of the note. In concluding this phase of the effort, we acknowledge that work remains, including:

making the connection to other similar analytic methods supported by randomized experiments, such as IV estimation and principal stratification;

adapting the methodology to continuous mediational variables, not just discrete subgroup classification indicators; and

accounting for multiple mediational factors operating simultaneously, each making its own contribution to program impacts.

Taken together, the elements of this Method Note in Three Parts share important insights that advance how the ASPES approach might be better understood and thereby find greater application in policy evaluations. Our primary hope in offering this series is that further efforts of methodologists and applied evaluation specialists will—over the next 10 years—expand our ability to determine what social policy intervention components contribute most to larger impacts when analyzing data from experiments that, historically, have said too little about this question. As we press in this direction, we are mindful that other routes for learning about what intervention elements matter may get us further faster than ASPES. We hope that these other routes can become a greater part of the scholarly and policy conversations as well. In sum, the parts of this Method Note urge methodological innovation and aim to improve the evaluation practice of endogenous subgroup analyses to inform important “what works?” questions across a wide array of policy topics.

Footnotes

Appendix A

Appendix B

Acknowledgments

We are grateful for the useful input from participants in Abt’s Journal Author Support Group, especially Jacob Klerman, Bill Rhodes, Steve Kennedy, David Judkins, and Rob Olsen; attendees at the Annual Research Conference of the Association for Public Policy Analysis and Management’s (APPAM, Baltimore, MD) presentation of this work, including Lindsay Page and Howard Bloom; and the research partnership of Shawn Moulton.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

Support for this research was provided in part by Abt Associates' Daniel B. McGillis Professional Development and Dissemination Grant program.

Notes

References

Bell

S. H.

Peck

L. R.

(2013). Using symmetric predication of endogenous subgroups for causal inferences about program effects under robust assumptions: Part two of a method note in three parts. American Journal of Evaluation, 34, 413–426. doi: 10.1177/1098214013490820

Bloom

H. S.

(1984). Accounting for no-shows in experimental evaluation designs. Evaluation Review, 8, 225–246. doi:10.1177/0193841X8400800205

Gibson

C. M.

(2003). Privileging the participant: The importance of subgroup analysis in social welfare evaluations. American Journal of Evaluation, 24, 443–469. doi:10.1177/109821400302400403

Harknett

(2006). Estimating effects for program participants using propensity score: Does receiving an earnings supplement affect union formation? Evaluation Review, 30, 741–778. doi:10.1177/0193841X06293411

Kemple

J. J.

Snipes

J. C

. (2000). Career academies: Impacts on students’ engagement and performance in high school. New York, NY: MDRC. Retrieved July 31, 2011, from http://www.mdrc.org/sites/default/files/full_45.pdf

Morris

P. A.

Hendra

(2009). Losing the safety net: How a time-limited welfare policy affects families at risk of reaching time limits. Developmental Psychology, 45, 383–400. doi:10.1037/a0014960

Moulton

Peck

L. R.

Dillman

K.-N.

(2013). The Moving to Opportunity Demonstration’s impact on health and well-being among high dosage participants (Abt Thought Leadership Paper). Cambridge, MA: Abt Associates.

Peck

L. R.

(2003). Subgroup analysis in social experiments: Measuring program impacts based on post treatment choice. American Journal of Evaluation, 24, 157–187. doi:10.1016/S1098-2140(03)00031-6

Peck

L. R.

(2013). On analysis of symmetrically predicted endogenous subgroups: Part one of a method note in three parts. American Journal of Evaluation, 34, 225–236. doi:10.1177/1098214013481666

10.

Peck

L. R.

Bell

S. H.

(2012, November 10). Estimating the influence of Head Start quality on child development. Paper presented at the Annual Fall Research Conference of the Association for Public Policy Analysis and Management, Baltimore, MD.

11.

Sanbonmatsu

Kling

J. R.

Duncan

G. J.

Brooks-Gunn

(2006). Neighborhoods and academic achievement: Results from the moving to opportunity experiment. Journal of Human Resources, 41, 649–691.

12.

Schochet

P. Z.

Burghardt

(2007). Using propensity scoring to estimate program-related subgroup impacts in experimental program evaluations. Evaluation Review, 31, 95–120.

13.

Wood

R. G.

Moore

Clarkwest

(2011). BSF’s effects on couples who attended group relationship skills sessions: A special analysis of 15–month data (OPRE Report #2011-17). Washington, DC: U.S. Department of Health and Human Services, Administration for Children and Families.