Abstract
Given the widespread use of nonexperimental (NE) methods for assessing program impacts, there is a strong need to know whether NE approaches yield causally valid results in field settings. In within-study comparison (WSC) designs, the researcher compares treatment effects from an NE with those obtained from a randomized experiment that shares the same target population. The goal is to assess whether the stringent assumptions required for NE methods are likely to be met in practice. This essay provides an overview of recent efforts to empirically evaluate NE method performance in field settings. We discuss a brief history of the design, highlighting methodological innovations along the way. We also describe papers that are included in this two-volume special issue on WSC approaches and suggest future areas for consideration in the design, implementation, and analysis of WSCs.
Over the last 50 years, two advances have improved methodological rigor for making causal inferences. The first advance was acknowledging the primacy of research design, such as the randomized experiment or the regression-discontinuity design (RDD), over statistical adjustment procedures for establishing causal inference (J. Angrist & Pischke, 2009; Morgan & Winship, 2007; Shadish, Cook, & Campbell, 2002). The second advance was using potential outcomes to define causal quantities of interest and to formulate identification assumptions for various research designs (Rubin, 1974, 2005). Together, these developments have provided researchers with a formal understanding of the assumptions required for research designs to produce valid causal results. These two advances have also helped researchers develop empirical diagnostics to partially probe whether these assumptions are likely to be met.
However, it is rarely possible for a researcher to test whether the stringent assumptions needed to identify and estimate a causal quantity for a given research design are actually met in field settings. In an RDD, we never know whether parametric and nonparametric estimation methods correctly model the relationship between the assignment and outcome variables. In a nonequivalent comparison group design, we rarely know whether all confounding covariates that are simultaneously related to treatment assignment and the outcome have been reliably measured. In comparative interrupted time-series designs, we never know whether units in the treatment and comparison group share “common trends” over time in the absence of treatment.
The within-study comparison (WSC) design has emerged as a method for assessing whether the stringent assumptions needed to identify and estimate causal quantities are met in practice. In a traditional WSC design, treatment effects from a randomized control trial (RCT) are compared to those produced by a nonexperiment (NE) that shares the same target population, outcomes, and intervention. The NE may be an RDD, a matching design, or a difference-in-differences or interrupted time-series approach. The goals of a WSC are to determine whether and under which conditions the NE method succeeds in reproducing results from a high-quality RCT with the same target population. Table 1 provides a summary of more than 70 WSCs from 1986 to 2017.
All Known WSCs.
Note. This list includes all known WSCs including working papers and paper presentations (where an unpublished version of the study is unavailable). We do not include simulation studies or four-arm designs where the study is intended to estimate the effect of randomization or preference rather than the performance of the NE method. The WSC design column notes whether the researchers used an independent or dependent-arm design (or both if more than one study was conducted). The NE design refers to the primary research design. Note that we group all time-series designs (including comparative interrupted time series and difference in difference) under the ITS label. Where authors combine NE designs, we note the primary design which is tested. NECG = nonequivalent comparison group; RDD = regression discontinuity design; ITS = interrupted time series; IV = instrumental variables; WSC = within-study comparison; NE = nonexperimental.
Results from early WSCs had a profound influence on research practice and priorities in program and policy evaluation (see WSC studies under the heading “Job training” in Table 1). These studies reified a clear preference in methodology choice for government funding agencies and evaluation policy: RCT whenever possible, RD (regression-discontinuity) when RCTs are not feasible, and finally if at all, observational approaches such as matching or regression adjustment (see What Works Clearinghouse Evidence Standards, 2008a, 2008b, 2011). The Office of Management and Budget (2005) cited results from early WSCs in their 2004 recommendation that federal agencies should use RCTs for evaluating program impacts, cautioning against the use of “comparison group studies” that “often lead to erroneous conclusions” (p. 5). The U.S. Department of Education also identified random assignment as the preferred method for “scientifically based research” in a 2005 issue of the Federal Register (2005). In responding to critiques that random assignment was “not the only method capable of generating causal effects,” Rod Paige (Federal Register, 2005), the Education Secretary under the George W. Bush Administration, cited WSC results, stating that “conclusions about causality based on other methods, including the quasi-experimental designs included in this priority, have been shown to be misleading compared with experimental evidence” (p. 3588).
Despite the importance of WSCs in providing researchers, funders, and decision makers with guidance about NE methods’ performance in practice and designing valid program evaluations, a number of questions about the best ways to implement and analyze the WSC itself remain. For example, what are the requirements for a WSC design to yield interpretable results, and how can researchers design a valid and reliable WSC? What criteria should researchers use to determine whether results from the NE replicate results from the RCT benchmark? And perhaps most importantly, how should we interpret results from one WSC to understand NE method performance in other contexts and settings?
In this essay, we provide a brief historical overview of WSC designs. To this end, we describe the special contributions of WSCs to the program evaluation literature and common methodological challenges that arise in the design, implementation, analysis, and interpretation of the approach. We then highlight papers that appear in this two-volume special issue of Evaluation Review. These papers add to our knowledge of NE method performance; they also address important methodological considerations in the design and analysis of WSCs. The essay concludes by considering future directions for how WSCs may be used to improve NE theory and practice.
History of WSCs
Statistical theory formulates the assumptions needed for a causal method to work. That is, theory shows when a method can yield unbiased causal effects. Simulation studies help researchers understand the statistical properties of the method under specific, well-defined conditions. Simulation studies, however, rarely capture the full complexity of real-world data and have little to say about whether a research design’s assumptions are actually met in field settings. Addressing these methodological questions requires empirical evaluations of NE methods in real-world evaluations.
Introduced by LaLonde (1986) and Fraker and Maynard (1987), the earliest WSC designs used data from job training evaluations to compare results from an NE with those from an RCT benchmark. To construct the WSCs, LaLonde and Fraker and Maynard used RCT data from the National Supported Work (NSW) Demonstration program (Manpower Demonstration Research Corporation, 1980). The NE was created by deleting RCT control cases from the NSW and replacing them with no-treatment comparisons from the Current Population Survey (CPS) or the Panel Study of Income Dynamics (PSID). The interest was methodological—to see whether econometric techniques could be used with nationally representative data sets to reproduce RCT results. But the goal was policy-driven—to discover whether there were more cost-efficient methods than RCTs for estimating program impacts.
The early WSCs examined the performance of regression, difference-in-differences, matching, and instrumental variable models. Researchers estimated NE bias by comparing NE results with those obtained from the RCT benchmark. Because the treatment group was shared across the RCT and NE arms, researchers also assessed bias by directly comparing conditional outcomes from NE comparisons and RCT controls (Bloom, Michaeloupoulos, & Hill, 2005; Fraker & Maynard, 1987). The general conclusion from these studies was that NE methods fail to reproduce RCT benchmark results (Fraker & Maynard, 1987; Friedlander & Robins, 1995). Fraker and Maynard (1987) summarized their findings by writing, the results of our study indicate that NE design evaluations cannot be relied on to estimate the effectiveness of programs like Supported Work with sufficient precision (and in some cases unbiasedness) to provide policymakers with adequate information to guide decisions. (p. 196)
Heckman and colleagues (Heckman & Hotz, 1989; Heckman, Ichimura, Smith, & Todd, 1998; Heckman, Ichimura, & Todd, 1997) reanalyzed the NSW data and conducted new WSCs with RCT data from the Job Training Partnership Act (JTPA) evaluation. For the JTPA data, they constructed the NE comparison group from observational data of individuals who qualified for JTPA but chose not to participate in the intervention. Using results from WSCs, Heckman and colleagues highlighted conditions under which NE bias can be successfully addressed in job training settings at least. NE estimates were less biased when rich covariate information was available for matching units, when comparisons were drawn from the same local labor markets, and when dependent variables were measured in the same way for all participants. They also observed that difference-in-differences estimators address selection bias better than cross-sectional estimators and that specification tests using pretreatment outcomes often succeeded in eliminating the most biased estimators. However, Heckman et al. also concluded that while these approaches often succeeded in reducing bias, there was no assurance that they reliably eliminated bias.
Two studies provided further surveys of WSC results, with similar conclusions. Glazerman, Levy, and Myers (2003) meta-analyzed 12 WSCs that used data from a series of job training experiments. Bloom, Michaeloupoulos, and Hill (2005) provide a qualitative summary of WSC results from early job training studies. Both reviews found that although NE approaches sometimes replicated RCT benchmark results, they often produced effects that were “dramatically different from the experimental benchmark” (p. 86). Although Glazerman et al. (2003) wrote that results from the meta-analysis did not resolve “longstanding debates about nonexperimental methods,” for many readers, the take-home message was clear—NE methods could not be trusted to produce credible causal estimates in field settings (p. 86).
Methodological Challenges With WSCs
Results from early WSCs prioritized RCTs as the main research design for program evaluation. This was especially true in fields such as education which, prior to 2001, did not have a tradition of using experiments (J. D. Angrist, 2004; Cook & Foray, 2007). However, despite the sound theoretical reasons to prefer RCTs and some types of quasi-experimental designs, results from early WSCs were also suspect in a number of ways. Incorrect conclusions about the empirical performance of NE methods could have occurred due to invalid WSC designs or the choice of an inappropriate metric for assessing NE performance. Below, we highlight five common methodological challenges (issues) that arose in the design and analysis of early WSCs.
Study differences between the RCT and NE: In many early WSCs, the RCT and NE differed in ways beyond the mode of treatment assignment (i.e., random assignment vs. self-selection). For example, comparison units in the CPS or PSID may have been drawn from remote locations (instead of within the same locale as treatment cases), measured at different time points, and, in some cases, may not have shared the same outcome measures. Comparison units in the NE may also have had alternative job training options than what was available to control cases in the RCT. When the RCT and NE arms have extraneous study differences, it is difficult for the researcher to draw conclusions about how well the NE actually performed. Lack of correspondence in NE and RCT results could have occurred because of bias in the NE estimate or because the outcome measure was not assessed in the same way across the two study arms. It would be impossible for the researcher to tell.
Differences in causal estimands: WSC results were sometimes confounded by comparisons of different causal quantities from each study condition. For example, the experimental average treatment effect (ATE) may have been compared to an RD ATE at the cutoff. If treatment effects are heterogeneous among subpopulations of units, then comparing two causal quantities may produce different effect estimates for reasons not related to bias in the NE.
Weak causal benchmark for evaluating NE: The RCT benchmark may have suffered from its own implementation problems in the field. Differential attrition, treatment noncompliance, or individuals trying to subvert the randomization process in the RCT may invalidate the RCT’s benchmark status, that is, the RCT was not well enough implemented to serve as the standard for evaluating NE performance.
Inappropriate metrics for assessing NE method performance: Early WSCs lacked consensus on how close RCT and NE results needed to be for the researcher to judge that the NE method succeeded in reproducing the RCT effects. Some studies compared the direction and magnitude of effects (Aiken, West, Schwalm, Carroll, & Hsuing, 1998), while others examined patterns of statistical significance (Agodini & Dynarski, 2004; Diaz & Handa, 2006), and still others observed whether estimates differed by more than some policy-relevant threshold (Glazerman et al., 2003). One challenge with these measures is that they may conclude that the NE fails to reproduce RCT results, even when the effect estimates are identical or very similar. For example, if the RCT estimate is slightly greater than 0 and the NE estimate is slightly less than 0, then comparing direction of effects may suggest lack of correspondence in results, even though the point estimates themselves may be considered as equivalent. In another example, the RCT and NE point estimates may be exactly identical, but the benchmark result is statistically insignificant while the NE result is significant. Although comparing significance patterns informs researchers about whether a policy maker would arrive at the same decision from an RCT and NE design, these measures may be less useful for assessing the performance of the NE method itself.
Limited generalization about NE method performance: Although results from early WSCs provided information about NE performance in job training contexts, there were questions about the extent to which these findings could be generalized to NEs with different target populations, treatments, outcomes, selection mechanisms, baseline information, and research designs.
Glazerman and colleagues (2003) acknowledged the limitations of early WSCs by writing that their “summary of findings gives only part of the picture, and it does so for a specific area of program evaluation research: the impacts of job training and welfare programs on participant earnings” (p. 87). Taken together, these concerns suggested that not only were more WSCs needed in different field settings, but WSCs of higher methodological quality for drawing valid conclusions about NE methods’ ability to estimate causal effects in practice.
WSC Methodological Innovations
Since the Glazerman et al. (2003) review, researchers have introduced WSC design innovations to address the five methodological limitations in the earlier numbered list. To reduce study differences in the RCT and NE (Issue 1 from above), researchers drew NE comparison units from the same target population as in the RCT. Bloom et al. (2005) used RCT data from the multistate, multisite National Evaluation of Welfare-to-Work Strategies (NEWWS) to construct a WSC. In the RCT arm, welfare recipients were randomly assigned to job training services within sites; in the NE arm, RCT controls from other NEWWS sites (often within the same city) were used to form the comparison group. Because all participants were involved in the same study protocol, they met the same eligibility criteria, provided the same baseline and outcome information, and experienced the same macroeconomic and labor market conditions at the same time. The consistency in research protocols across both study arms reduced the threat of confounders that might otherwise explain differences in RCT and NE results.
Shadish, Clark, and Steiner (2008) introduced another WSC design variant that bolstered the interpretation of results. They ensured that the RCT and NE compared equivalent causal estimands (Issue 2) for the same target population by randomly assigning study participants into the RCT or NE arm of the WSC. Once assigned into study arms, participants in the RCT were randomly assigned again into the reading or math intervention while those in the NE were allowed to select an intervention of their preference. NE bias was computed by comparing effect estimates of the ATE across both study arms. The researchers were also able to ensure that the RCT was well implemented by analyzing baseline and fidelity measures (Issue 3). And, because the WSC was prospectively planned and took place within a controlled laboratory-like setting, the researchers were able to implement the same study procedures across the RCT and NE arms (Issue 1). This meant delivering identical, scripted treatment and control interventions in the RCT and NE studies and using the same outcome measures for assessing impacts of the interventions. Subsequent analyses found no evidence of differential attrition within the RCT and across the RCT and NE arms.
Later WSCs introduced new approaches for assessing comparability between RCT and NE results (Issue 4). These studies acknowledged that, because of sampling error, even close replications of the same RCT would not result in identical treatment effects. And although most studies assessed comparability by examining statistical significance patterns between the RCT and NE, some began using direct statistical tests of difference between RCT and NE results. Other new methods for assessing correspondence included looking at the percentage of bias reduced from the initial naive comparison (Shadish et al., 2008), percent difference in the RCT and NE estimate (Wilde & Hollister, 2007), the mean squared error (Wing & Cook, 2013), the effect size differences between RCT and NE results (Hallberg, Wong, & Cook, 2016), or the relative performance of different NE approaches across multiple bootstrap replications (Hallberg, Wong, & Cook, 2016). Bell and Orr used a Bayesian framework to compute the probability of an incorrect policy decision for different magnitudes of true effect sizes (Solari, Nisar, Bell, & Orr, 2017). All of these approaches have their advantages and limitations. However, the lack of consensus in the WSC literature on how correspondence should be assessed has led to ambiguity and challenges in synthesizing the literature.
Finally, a common critique of WSC evaluations concerns their generalizability. Researchers want to know how well results from one study setting apply to NE method performance in other contexts, with different outcomes and treatment selection mechanisms (Issue 5). Although this issue is not unique to WSCs—the same concern arises in RCT evaluations—results from a single WSC study have little to say about general method performance. But results from multiple WSCs may provide insights as to how well these methods perform for similar outcomes and settings of particular interest.
Over the years, researchers have conducted qualitative and quantitative summaries of WSC results with the goal of providing advice for better NE practice. Some summaries have focused on observational method performance in particular disciplines or fields, with a narrowly defined set of outcomes. Glazerman et al. (2003) and Bloom et al. (2005) reviewed WSC results in the job training literature, where the outcome of interest was participants’ annual earnings. Both reviews confirmed Heckman et al.’s findings that NE methods produced less biased estimates when comparison groups were local, when covariate sets were rich and included pretest measures, and when researchers combined multiple design features (e.g., difference-in-differences with matching) for estimating effects.
Wong, Valentine, and Miller-Bains (2017) examined results from 12 WSCs in education settings with standardized reading or math outcomes. Their goal was to assess performance of common covariate types used in observational studies in education. As in the job training literature, Wong et al. found that the pretest often reduced a major portion of the bias but it did not always eliminate it. However, matching units from similar geographic locales did not provide the same benefit within education contexts as it did in job training settings. This was likely because the selection process into education interventions varied across settings, as did the definition of “local” comparisons in these evaluations. Wong et al. also noted that when rich covariate sets were available, NE methods replicated RCT benchmark estimates more closely in educational contexts, but the authors noted that further replications are needed in this area.
Other summarizes have reviewed WSC results from multiple disciplines to assess method performance more generally. Cook, Shadish, and Wong (2008) looked at 12 WSCs from 2002 to 2007 that spanned the fields of education, international development, and public health. The authors observed three conditions under which the NE method appeared to remove all or at least a major part of the bias. The first condition was when treatment and comparison units were assigned to treatment conditions based on an assignment variable and a cutoff, as in the RDD. In a more recent review, Chaplin et al. (2018) meta-analyzed results from 15 WSCs looking at RD performance across various fields. They found that the average NE bias was small, less than 0.01 SDs, providing further evidence for Cook et al.’s hypothesis.
Cook, Shadish, and Wong’s second and third conditions describe contexts under which NE methods appeared to remove most if not all the bias. Those contexts include when the selection process was known and observed by the researcher, as in students’ selection into a math or vocabulary intervention in the Shadish et al. WSC described above, or when “intact groups” (e.g., schools, villages) were matched using rich covariate information, or within the same geographic area. However, these results have yet to be confirmed by more recent WSCs, so more research is needed in this area.
This Special Issue
This two-volume special issue of Evaluation Review contributes to the WSC literature in two distinct ways. First, the February issue presents four additional case-study evaluations of NE method performance in educational contexts. Gleason, Resch, and Berk (2018) examine parametric and nonparametric method performance in an RDD. The authors use RCT data from evaluations of Ed Tech and Teach for America to construct RD designs synthetically. They created the RD by selecting a hypothetical cutoff on a baseline covariate and systematically deleting RCT treatment or comparison observations above and below the designated cutoff. A useful innovation of this article is that the authors replicated their RCT results across multiple data sets, as well as multiple cutoffs within each data set, and pooled their results through a systematic meta-analysis. Dong and Lipsey (2018) assess covariate performance in an observational study within the context of early childhood education (ECE). This is one of the few studies in the WSC literature that examines covariate performance in an ECE setting with outcomes of students’ emerging literacy and math skills. They also looked at the performance of different matching estimators when comparisons were drawn from within and across states. Kisbu-Sakarya, Cook, Tang, and Clark (2018) also examined NE method performance in the context of ECE, but their WSC evaluates the performance of a comparative RD (CRD) design to an RCT benchmark from the Head Start Impact study. Finally, Tang and Cook (2018) show the benefits of the CRD design by comparing the statistical precision of CRD results with RD and RCT results from the Head Start Impact study.
The April issue includes a series of methodological papers that seek to improve the design and analysis of the WSC approach itself. To this end, Wong and Steiner (2018) formalize the WSC design using a potential outcomes framework. They explicate the required design components and assumptions needed for the approach to yield a valid interpretation of NE method performance. This article also describes three different design variants for evaluating NE methods, and the benefits and limitations of each approach. Steiner and Wong (2018) next address the issue of how one should assess correspondence between RCT and NE results. That is, they address the question first posed by Wilde and Hollister (2007) of “how close is close enough” for the NE to have successfully replicated benchmark results? Through a series of simulation studies, the authors demonstrate the benefits and limitations of common criteria for assessing correspondence in RCT benchmark and NE results, and propose a new framework for assessing NE method performance: the correspondence test, which incorporates both frequentist tests of difference and equivalence in the same framework. Rindskopf, Shadish, and Clark (2018) propose an alternative criterion for assessing correspondence between RCT and NE results using a Bayesian approach. Their method involves calculating the probability that the absolute value of the difference between the RCT and NE result is less than some threshold determined to be close enough to 0. They argue that the Bayesian criteria improve the power of WSCs by allowing for the incorporation of prior information into the analysis and provide more varied, nuanced, and informative answers to questions of correspondence.
New Frontiers for WSC Approaches
Although the WSC literature has made strong advances since the early job training studies, our reading of the literature suggests four emerging areas for improving the design, analysis, and practice of NE evaluations:
Issue 1: Establish Research Protocols for the Design and Analysis of WSC Results
One issue with the implementation of WSCs is that knowledge of the benchmark result may inadvertently skew the many decisions researchers must make in the analysis of the NE. For example, in observational studies, the researcher has choices about covariate selection for estimating the propensity score (Smith & Todd, 2005) and about the type of estimator used to produce treatment effects (e.g., matching, stratification, or doubly robust estimators). Cook et al. (2008) recommend that two independent research teams should analyze the benchmark and NE separately and that the analysts of the NE should be blinded of the benchmark results. This is generally good practice, but it may not be specific enough to be feasible. Research teams may wish to coordinate which causal estimands they will compare, and the analytic models they will use to estimate treatment effects (e.g., should the RCT and NE treatment effects be estimated using regression-adjusted [doubly robust] models or not?).
In future implementations of WSCs, research teams should establish and describe a protocol in advance of data collection or analysis. Developing a WSC research protocol is similar to preregistration of research plans for RCTs or meta-analyses. One benefit of a WSC protocol is that it would provide prespecified guidance to researchers on questions that naturally arise in the design and analysis of WSCs. In cases where the NE and RCT are analyzed by independent teams of researchers, developing a research protocol can provide opportunities for investigators to come to a common understanding of the study plan. The research protocol could also allow for WSC researchers to obtain feedback and advice on their data collection and analysis plans, prior to revealing any results.
Generally, the WSC protocol should address the following topics: (1) confirmatory versus exploratory research questions in the WSC context, (2) diagnostics for assessing assumptions of the WSC design, (3) potential deviations from the intended research protocol, and (4) criteria for determining correspondence in results. The protocol should recommend that analysts of the RCT and NE document all analysis procedures; it should also provide a place for the researchers to document any problems or questions that arise, and how these questions were resolved. Finally, the protocol should provide guidance on when it is appropriate for RCT and NE analysts to consult with each other, and when their analysis should be conducted independently.
Issue 2: Consider Statistical Power for WSC Designs
Another critical issue in the planning of WSCs is ensuring that the design has sufficient statistical power for detecting comparability in treatment effects between the RCT and NE. In fact, WSCs usually have much greater power requirements than do the RCT or NE for detecting impacts. To understand why WSCs usually require larger samples, consider a scenario where the criterion for assessing correspondence in RCT and NE effects is to determine whether the two study conditions produce the same test result in a null hypothesis test of the treatment effect. In other words, do the RCT and NE result in the same conclusion about the presence of a treatment effect? In an independent WSC design (i.e., units were randomly assigned into RCT and NE conditions), the probability of rejecting the null in both study conditions depends on the statistical power in the RCT and NE. Here, a well-powered RCT and NE, with both having a statistical power of 0.80 to detect the true but unknown effect, produce the same pattern of statistical significance with a probability of .68 only (= .8 × .8 + .2 × .2, i.e., the probability of obtaining a significant effect estimate in both studies plus the probability of obtaining an insignificant result in both studies). But when—as is not uncommon—the RCT or NE is underpowered for detecting significant effects (e.g., both having a power of .2), the probability of obtaining corresponding significance patterns is again .68. But now correspondence is most likely due to obtaining insignificant (.8 × .8) rather than significant (.2 × .2) effect estimates in both studies. Thus, when there is no significant treatment effect for the NE and RCT, the researcher may incorrectly conclude that the NE lacks bias, but this may be because both study conditions are underpowered for detecting effects!
Future WSCs should consider statistical power for assessing comparability of results in the design phase of the evaluation. Three papers in the March issue provide guidance on statistical power. Wong and Steiner show that WSC design variants (e.g., WSCs with independent vs. dependent data structures in the RCT and NE arms) have different statistical power for assessing correspondence in results; and Steiner and Wong suggest a method for assessing statistical power in the design phase through the correspondence framework. Rindskopf and Shadish suggest that Bayesian approaches for assessing correspondence of RCT and NE results have improved statistical power over frequentist approaches.
Issue 3: Continue to Explore the External Validity of WSC Results
The existing WSCs represent a heterogeneous mix of studies from different disciplines, research designs, and outcomes. Currently, the authors have identified more than 70 WSC studies (see Table 1). These studies include substantial variation in contexts, NE methods examined, as well as outcomes and treatment selection mechanisms. As more studies continue to be added to the literature, ongoing quantitative synthesis of results can provide important descriptive information about NE method performance in field settings, and the contexts and conditions under which these methods may perform well. Meta-analysis of WSC results may also address an important challenge that many stand-alone studies face—lack of statistical power for assessing correspondence in results.
However, we note that a rigorous synthesis of WSC results also requires more systematic reporting of study procedures and outcomes as well as consistent criteria for assessing correspondence in results. For example, it would be useful for WSC analysts to report estimates of NE bias and the standard error of their bias estimates. Moreover, in WSC designs where units are shared between the RCT and NE arm, the standard errors should account for dependencies in the data structure (see discussion by Steiner and Wong). In addition, because the direction and strength of the selection processes in the NE vary across WSC studies, analysts should always report the initial, unadjusted selection bias (i.e., the difference between the unadjusted NE estimate and the RCT estimate). This allows for an assessment of the sign and magnitude of the selection bias before making any statistical adjustments.
Meta-analysis of WSC results has tremendous promise in revealing new insights about good NE practice. However, given the heterogeneity of WSCs in terms of study designs, samples, outcomes, and selection processes, a rigorous meta-analysis should synthesize or pool results only when substantively or theoretically appropriate. To this end, WSC analysts should document and report study procedures and contextual factors that may be related to NE bias.
Issue 4: Using RCT Benchmark Results for Examining Treatment Effect Variation and Generalization
Recently, researchers have applied WSC designs to address research questions of programmatic and policy relevance. For example, an RCT benchmark may be used to validate an NE model that is then used to estimate treatment effects for a more general target population of interest. This method has been applied to generalize treatment effects across different units (J. D. Angrist & Rokkanen, 2015; Wing & Clark, 2016), treatments (Bell, Harvill, Moulton, & Peck, 2017; Hotz, Imbens, & Mortimer, 2005), and settings (Abdulkadirólu, Angrist, Dynarski, Kane, & Pathak, 2011).
For example, Abdulkadirólu, Angrist, Dynarski, Kane, and Pathak (2011) used a WSC design to assess the external validity of treatment effects from Boston charter and pilot schools with admission lotteries to schools without such lotteries. The RCT consisted of lottery students in oversubscribed charter/pilot schools; the NE consisted of lottery winners as well as noncharter/pilot students in Boston public schools. To estimate NE treatment effects, the authors used regression models that controlled for student demographic characteristics and baseline scores.
The authors constructed a series of WSCs for subsamples of charter and pilot schools and for elementary and secondary grades. In cases where the WSC NE and RCT produced corresponding effects, the researchers concluded that the NE model was sufficient for addressing selection bias in an observational study of nonlottery charter/pilot schools and Boston public schools. The assumption here was that the selection process into charter/pilot schools with lotteries could be generalized to schools without lotteries. However, when the WSC NE failed to reproduce RCT benchmark results, the authors concluded that the NE model could not be used to estimate observational treatment effects. Overall, Abdulkadirólu et al. observed close correspondence in RCT and NE results for charter school students, and for middle school students with pilot programs. In assessing the external validity of the charter school lottery results, they found that although charter schools without lotteries produced positive and significant effects, they were smaller than effects observed from oversubscribed charter schools. The authors also found that the WSC NE model did not perform well for a subsample of high schools with pilot programs. As a result, they did not use the NE model to assess the external validity of treatment effects for this subsample of schools.
In a second example, Hotz, Imbens, and Klerman (2006) used a WSC design to examine treatment effect variation due to differences in program components. The researchers used RCT data from the Greater Avenues to Independence Program evaluation, where participants in six California counties were randomly assigned to receive job training services or to be in a control group that was denied services. Because of the local nature of treatment implementation, some county programs provided participants with general education and skills development, while other sites encouraged participants to secure immediate employment.
A goal of the evaluation was to assess treatment effect variation due to differential program components. However, because participants were not randomly assigned to sites, researchers were concerned that observed treatment effect variation may have been confounded with participants’ characteristics. To address this issue, the authors constructed a WSC using RCT control group members’ outcomes. Their goal was to examine whether NE methods and observed participant characteristics could address units’ selection into sites. In places where the NE method succeeded in producing conditionally equivalent control groups, the researchers felt assured that the NE approach could be used to produce valid effect estimates of program components.
These examples illustrate how WSCs may be used to probe NE assumptions empirically. They also show how WSCs may be used to signal when NE assumptions are not well warranted in field settings. As researchers continue to use RCT and NE data to “learn more” from program and policy evaluations, WSCs provide an important method for validating NE assumptions, and for generalizing and uncovering differential treatment effects.
Conclusion
Because of increased availability of RCT data, there are now empirical evaluations of NE methods in job training, education, early childhood development, political science, international development, and public health. WSCs have also been used to evaluate more types of quasi-experimental approaches including the RDD (see Cook and Wong, 2008, for review) and, most recently, the interrupted time-series design (St. Clair, Cook, & Hallberg, 2014; St. Clair, Hallberg, & Cook, 2016). As the number of WSCs in varying contexts increases, so does the opportunity for synthesizing the literature for greater insight and external validity.
Results from WSC evaluations have had important impacts on both research practice and funding priorities in program evaluation. In most areas of the social sciences, an RCT is the preferred method for establishing causal inferences. However, WSCs have shown specific contexts and conditions where NE methods succeed in removing most if not all the bias. Methodological advances in WSC designs, like those in presented in this special issue, will continue to improve our understanding of NE practice. As the program evaluation field turns to important policy-relevant questions such as “When, where, for whom, and why does it work?” WSCs may again be instrumental in improving methodology and validating research design assumptions in field settings.
Footnotes
Authors’ Note
The opinions expressed are those of the authors and do not represent views of NSF.
Acknowledgment
We dedicate this two-volume special issue on within-study comparison designs to our mentor and friend, William R. Shadish. We had the honor to work with and learn from Will on the design, implementation, and analysis of several within-study comparisons. The April issue of Evaluation Review includes one of Will’s last papers, coauthored with David Rindskopf. Will, we miss you and think of you often.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant #R305B140026 and a Collaborative research grant from the National Science Foundation, through grant ##2015-0285-00 to the Rectors and Visitors of the University of Virginia. The opinions expressed are those of the authors and do not represent views of the Institute, the U.S. Department of Education, or the National Science Foundation.
