What Can Be Learned From Empirical Evaluations of Nonexperimental Methods?

Abstract

Given the widespread use of nonexperimental (NE) methods for assessing program impacts, there is a strong need to know whether NE approaches yield causally valid results in field settings. In within-study comparison (WSC) designs, the researcher compares treatment effects from an NE with those obtained from a randomized experiment that shares the same target population. The goal is to assess whether the stringent assumptions required for NE methods are likely to be met in practice. This essay provides an overview of recent efforts to empirically evaluate NE method performance in field settings. We discuss a brief history of the design, highlighting methodological innovations along the way. We also describe papers that are included in this two-volume special issue on WSC approaches and suggest future areas for consideration in the design, implementation, and analysis of WSCs.

Keywords

within-study comparison causal inference program evaluation nonexperiments

Over the last 50 years, two advances have improved methodological rigor for making causal inferences. The first advance was acknowledging the primacy of research design, such as the randomized experiment or the regression-discontinuity design (RDD), over statistical adjustment procedures for establishing causal inference (J. Angrist & Pischke, 2009; Morgan & Winship, 2007; Shadish, Cook, & Campbell, 2002). The second advance was using potential outcomes to define causal quantities of interest and to formulate identification assumptions for various research designs (Rubin, 1974, 2005). Together, these developments have provided researchers with a formal understanding of the assumptions required for research designs to produce valid causal results. These two advances have also helped researchers develop empirical diagnostics to partially probe whether these assumptions are likely to be met.

However, it is rarely possible for a researcher to test whether the stringent assumptions needed to identify and estimate a causal quantity for a given research design are actually met in field settings. In an RDD, we never know whether parametric and nonparametric estimation methods correctly model the relationship between the assignment and outcome variables. In a nonequivalent comparison group design, we rarely know whether all confounding covariates that are simultaneously related to treatment assignment and the outcome have been reliably measured. In comparative interrupted time-series designs, we never know whether units in the treatment and comparison group share “common trends” over time in the absence of treatment.

The within-study comparison (WSC) design has emerged as a method for assessing whether the stringent assumptions needed to identify and estimate causal quantities are met in practice. In a traditional WSC design, treatment effects from a randomized control trial (RCT) are compared to those produced by a nonexperiment (NE) that shares the same target population, outcomes, and intervention. The NE may be an RDD, a matching design, or a difference-in-differences or interrupted time-series approach. The goals of a WSC are to determine whether and under which conditions the NE method succeeds in reproducing results from a high-quality RCT with the same target population. Table 1 provides a summary of more than 70 WSCs from 1986 to 2017.

Table 1.

All Known WSCs.

Field	Study	WSC Design	NE Design
Consumer science	Mueller and Gaus (2015)	Independent	Other
Development	Buddelmeyer and Skoufias (2004)	Dependent	RDD
	Diaz and Handa (2006)	Dependent	NECG
	Handa and Maluccio (2010)	Dependent	NECG
Education	Agodini and Dynarski (2004)	Dependent	NECG
	Aiken, West, Schwalm, Carroll, and Hsuing (1998)	Dependent	NECG/RDD
	Anderson and Wolf (2017)	Dependent	NECG
	J. Angrist, Autor, Hudson, and Pallais (2015)	Dependent	RDD
	Ashworth and Pullen (2015)	Dependent	RDD
	Barrera-Osorio, Filmer, and McIntyre (2014)	Dependent	RDD
	Bifulco (2012)	Dependent	NECG
	Dong and Lipsey (2018)	Dependent	NECG
	Fortson, Gleason, Kopa, and Verbitsky-Savitz (2015)	Dependent	NECG
	Fortson, Verbitsky-Savitz, Kopa, and Gleason (2012)	Dependent	NECG/ITS
	Gill et al. (2016)	Dependent	NECG
	Gleason, Resch, and Berk (2012)	Dependent	RDD
	Hallberg, Cook, Steiner, and Clark (2016)	Both	NECG
	Hallberg, Wong, and Cook, 2016	Dependent	NECG
	Jaciw (2016a)	Dependent	NECG
	Jaciw (2016b)	Dependent	NECG
	Jacob, Somers, Zhu, and Bloom (2016)	Dependent	ITS
	Leow, Wen, and Korfmacher (2015)	Dependent	NECG
	Lottridge, Nicewander, and Mitzel (2011)	Dependent	NECG
	Luellen, Shadish, and Clark (2005)	Independent	NECG
	Moss, Yeaton, and LIoyd (2014)	Dependent	RDD
	Padgett, Salisbury, An, and Pascarella (2010)	Dependent	NECG
	Pohl, Steiner, Eisermann, Soellner, and Cook (2009)	Independent	NECG
	Shadish, Clark, and Steiner (2008)	Independent	NECG
	Shadish, Galindo, Wong, Steiner, and Cook (2011)	Independent	RDD
	Somers, Zhu, Jacob, and Bloom (2013)	Dependent	ITS
	St. Clair, Cook, and Hallberg (2014)	Dependent	ITS
	St. Clair, Hallberg, and Cook (2016)	Dependent	ITS
	Steiner, Cook, Li, and Clark (2015)	Independent	NECG
	Steiner, Cook, Shadish, and Clark (2010)	Independent	NECG
	Steiner, Cook, and Shadish (2011)	Independent	NECG
	Tang, Cook, Kisbu-Sakarya, Hock, and Chiang (2017)	Dependent	RDD
	Wilde and Hollister (2007)	Dependent	NECG
	Zhou and Xie (2016)	Dependent	NECG
Environment	Ferraro and Miranda (2014)	Dependent	NECG/ITS
Environment	Wichman and Ferraro (2017)	Dependent	ITS
Job training	Bell, Orr, Blomquist, and Cain (1994)	Dependent	NECG
	Black, Galdo, and Smith (2007)	Dependent	RDD
	Bloom, Michalopoulos, and Hill (2005)	Dependent	NECG/ITS
	Bloom, Michalopoulos, Hill, and Lei (2002)	Dependent	NECG/ITS
	Dehejia and Wahba (2002)	Dependent	NECG
	Fraker and Maynard (1987)	Dependent	NECG/ITS
	Friedlander and Robins (1995)	Dependent	ITS
	Heckman and Hotz (1989)	Dependent	NECG/ITS
	Heckman, Ichimura, Smith, and Todd (1998)	Dependent	NECG/ITS
	Heckman, Ichimura, and Todd (1997)	Dependent	NECG/ITS
	Lalonde (1986)	Dependent	NECG/ITS
	Lee (2006)	Dependent	NECG
	Michalopoulos, Bloom, and Hill (2004)	Dependent	NECG/ITS
	Olsen and Decker (2001)	Dependent	NECG
	Peikes, Moreno, and Orzol (2008)	Dependent	NECG
	Smith and Todd (2005)	Dependent	NECG
Health	Anglin, Miller-Bains, Wong, and Wing (2018)	Dependent	ITS
	Bratberg, Grasdal, and Risa (2002)	Dependent	NECG/ITS
	Fretheim et al. (2015)	Dependent	ITS
	Hill, Reiter, and Zanutto (2005)	Dependent	NECG
	Schneeweiss, Maclure, Carleton, Glynn, and Avorn (2004)	Dependent	ITS
	Steventon, Grieve, and Sekhon (2015)	Dependent	NECG
Immigration	McKenzie, Stillman, and Gibson (2010)	Dependent	NECG/IV/ITS
Political science	Arceneaux, Gerber, and Green (2010)	Dependent	NECG
	Green, Leong, Kern, Gerber, and Larimer (2009)	Dependent	RDD
	Keele and Titiunik (2015)	Dependent	RDD

Note. This list includes all known WSCs including working papers and paper presentations (where an unpublished version of the study is unavailable). We do not include simulation studies or four-arm designs where the study is intended to estimate the effect of randomization or preference rather than the performance of the NE method. The WSC design column notes whether the researchers used an independent or dependent-arm design (or both if more than one study was conducted). The NE design refers to the primary research design. Note that we group all time-series designs (including comparative interrupted time series and difference in difference) under the ITS label. Where authors combine NE designs, we note the primary design which is tested. NECG = nonequivalent comparison group; RDD = regression discontinuity design; ITS = interrupted time series; IV = instrumental variables; WSC = within-study comparison; NE = nonexperimental.

Results from early WSCs had a profound influence on research practice and priorities in program and policy evaluation (see WSC studies under the heading “Job training” in Table 1). These studies reified a clear preference in methodology choice for government funding agencies and evaluation policy: RCT whenever possible, RD (regression-discontinuity) when RCTs are not feasible, and finally if at all, observational approaches such as matching or regression adjustment (see What Works Clearinghouse Evidence Standards, 2008a, 2008b, 2011). The Office of Management and Budget (2005) cited results from early WSCs in their 2004 recommendation that federal agencies should use RCTs for evaluating program impacts, cautioning against the use of “comparison group studies” that “often lead to erroneous conclusions” (p. 5). The U.S. Department of Education also identified random assignment as the preferred method for “scientifically based research” in a 2005 issue of the Federal Register (2005). In responding to critiques that random assignment was “not the only method capable of generating causal effects,” Rod Paige (Federal Register, 2005), the Education Secretary under the George W. Bush Administration, cited WSC results, stating that “conclusions about causality based on other methods, including the quasi-experimental designs included in this priority, have been shown to be misleading compared with experimental evidence” (p. 3588).

Despite the importance of WSCs in providing researchers, funders, and decision makers with guidance about NE methods’ performance in practice and designing valid program evaluations, a number of questions about the best ways to implement and analyze the WSC itself remain. For example, what are the requirements for a WSC design to yield interpretable results, and how can researchers design a valid and reliable WSC? What criteria should researchers use to determine whether results from the NE replicate results from the RCT benchmark? And perhaps most importantly, how should we interpret results from one WSC to understand NE method performance in other contexts and settings?

In this essay, we provide a brief historical overview of WSC designs. To this end, we describe the special contributions of WSCs to the program evaluation literature and common methodological challenges that arise in the design, implementation, analysis, and interpretation of the approach. We then highlight papers that appear in this two-volume special issue of Evaluation Review. These papers add to our knowledge of NE method performance; they also address important methodological considerations in the design and analysis of WSCs. The essay concludes by considering future directions for how WSCs may be used to improve NE theory and practice.

History of WSCs

Statistical theory formulates the assumptions needed for a causal method to work. That is, theory shows when a method can yield unbiased causal effects. Simulation studies help researchers understand the statistical properties of the method under specific, well-defined conditions. Simulation studies, however, rarely capture the full complexity of real-world data and have little to say about whether a research design’s assumptions are actually met in field settings. Addressing these methodological questions requires empirical evaluations of NE methods in real-world evaluations.

Introduced by LaLonde (1986) and Fraker and Maynard (1987), the earliest WSC designs used data from job training evaluations to compare results from an NE with those from an RCT benchmark. To construct the WSCs, LaLonde and Fraker and Maynard used RCT data from the National Supported Work (NSW) Demonstration program (Manpower Demonstration Research Corporation, 1980). The NE was created by deleting RCT control cases from the NSW and replacing them with no-treatment comparisons from the Current Population Survey (CPS) or the Panel Study of Income Dynamics (PSID). The interest was methodological—to see whether econometric techniques could be used with nationally representative data sets to reproduce RCT results. But the goal was policy-driven—to discover whether there were more cost-efficient methods than RCTs for estimating program impacts.

The early WSCs examined the performance of regression, difference-in-differences, matching, and instrumental variable models. Researchers estimated NE bias by comparing NE results with those obtained from the RCT benchmark. Because the treatment group was shared across the RCT and NE arms, researchers also assessed bias by directly comparing conditional outcomes from NE comparisons and RCT controls (Bloom, Michaeloupoulos, & Hill, 2005; Fraker & Maynard, 1987). The general conclusion from these studies was that NE methods fail to reproduce RCT benchmark results (Fraker & Maynard, 1987; Friedlander & Robins, 1995). Fraker and Maynard (1987) summarized their findings by writing,

the results of our study indicate that NE design evaluations cannot be relied on to estimate the effectiveness of programs like Supported Work with sufficient precision (and in some cases unbiasedness) to provide policymakers with adequate information to guide decisions. (p. 196)

A decade later, Dehejia and Wahba (1999) claimed to overturn that conclusion. They reanalyzed the NSW data and concluded that propensity score matching methods did succeed in reproducing RCT benchmark results. However, Smith and Todd (2005) showed that these estimates were highly sensitive to the choice of covariates used for estimating the propensity score and the analysis sample used. Subsequent WSC results also demonstrate the importance of covariate selection in matching procedures (Steiner, Cook, Shadish, & Clark, 2010).

Heckman and colleagues (Heckman & Hotz, 1989; Heckman, Ichimura, Smith, & Todd, 1998; Heckman, Ichimura, & Todd, 1997) reanalyzed the NSW data and conducted new WSCs with RCT data from the Job Training Partnership Act (JTPA) evaluation. For the JTPA data, they constructed the NE comparison group from observational data of individuals who qualified for JTPA but chose not to participate in the intervention. Using results from WSCs, Heckman and colleagues highlighted conditions under which NE bias can be successfully addressed in job training settings at least. NE estimates were less biased when rich covariate information was available for matching units, when comparisons were drawn from the same local labor markets, and when dependent variables were measured in the same way for all participants. They also observed that difference-in-differences estimators address selection bias better than cross-sectional estimators and that specification tests using pretreatment outcomes often succeeded in eliminating the most biased estimators. However, Heckman et al. also concluded that while these approaches often succeeded in reducing bias, there was no assurance that they reliably eliminated bias.

Two studies provided further surveys of WSC results, with similar conclusions. Glazerman, Levy, and Myers (2003) meta-analyzed 12 WSCs that used data from a series of job training experiments. Bloom, Michaeloupoulos, and Hill (2005) provide a qualitative summary of WSC results from early job training studies. Both reviews found that although NE approaches sometimes replicated RCT benchmark results, they often produced effects that were “dramatically different from the experimental benchmark” (p. 86). Although Glazerman et al. (2003) wrote that results from the meta-analysis did not resolve “longstanding debates about nonexperimental methods,” for many readers, the take-home message was clear—NE methods could not be trusted to produce credible causal estimates in field settings (p. 86).

Methodological Challenges With WSCs

Results from early WSCs prioritized RCTs as the main research design for program evaluation. This was especially true in fields such as education which, prior to 2001, did not have a tradition of using experiments (J. D. Angrist, 2004; Cook & Foray, 2007). However, despite the sound theoretical reasons to prefer RCTs and some types of quasi-experimental designs, results from early WSCs were also suspect in a number of ways. Incorrect conclusions about the empirical performance of NE methods could have occurred due to invalid WSC designs or the choice of an inappropriate metric for assessing NE performance. Below, we highlight five common methodological challenges (issues) that arose in the design and analysis of early WSCs.

Study differences between the RCT and NE: In many early WSCs, the RCT and NE differed in ways beyond the mode of treatment assignment (i.e., random assignment vs. self-selection). For example, comparison units in the CPS or PSID may have been drawn from remote locations (instead of within the same locale as treatment cases), measured at different time points, and, in some cases, may not have shared the same outcome measures. Comparison units in the NE may also have had alternative job training options than what was available to control cases in the RCT. When the RCT and NE arms have extraneous study differences, it is difficult for the researcher to draw conclusions about how well the NE actually performed. Lack of correspondence in NE and RCT results could have occurred because of bias in the NE estimate or because the outcome measure was not assessed in the same way across the two study arms. It would be impossible for the researcher to tell.

Differences in causal estimands: WSC results were sometimes confounded by comparisons of different causal quantities from each study condition. For example, the experimental average treatment effect (ATE) may have been compared to an RD ATE at the cutoff. If treatment effects are heterogeneous among subpopulations of units, then comparing two causal quantities may produce different effect estimates for reasons not related to bias in the NE.

Weak causal benchmark for evaluating NE: The RCT benchmark may have suffered from its own implementation problems in the field. Differential attrition, treatment noncompliance, or individuals trying to subvert the randomization process in the RCT may invalidate the RCT’s benchmark status, that is, the RCT was not well enough implemented to serve as the standard for evaluating NE performance.

Inappropriate metrics for assessing NE method performance: Early WSCs lacked consensus on how close RCT and NE results needed to be for the researcher to judge that the NE method succeeded in reproducing the RCT effects. Some studies compared the direction and magnitude of effects (Aiken, West, Schwalm, Carroll, & Hsuing, 1998), while others examined patterns of statistical significance (Agodini & Dynarski, 2004; Diaz & Handa, 2006), and still others observed whether estimates differed by more than some policy-relevant threshold (Glazerman et al., 2003). One challenge with these measures is that they may conclude that the NE fails to reproduce RCT results, even when the effect estimates are identical or very similar. For example, if the RCT estimate is slightly greater than 0 and the NE estimate is slightly less than 0, then comparing direction of effects may suggest lack of correspondence in results, even though the point estimates themselves may be considered as equivalent. In another example, the RCT and NE point estimates may be exactly identical, but the benchmark result is statistically insignificant while the NE result is significant. Although comparing significance patterns informs researchers about whether a policy maker would arrive at the same decision from an RCT and NE design, these measures may be less useful for assessing the performance of the NE method itself.

Limited generalization about NE method performance: Although results from early WSCs provided information about NE performance in job training contexts, there were questions about the extent to which these findings could be generalized to NEs with different target populations, treatments, outcomes, selection mechanisms, baseline information, and research designs.

Glazerman and colleagues (2003) acknowledged the limitations of early WSCs by writing that their “summary of findings gives only part of the picture, and it does so for a specific area of program evaluation research: the impacts of job training and welfare programs on participant earnings” (p. 87). Taken together, these concerns suggested that not only were more WSCs needed in different field settings, but WSCs of higher methodological quality for drawing valid conclusions about NE methods’ ability to estimate causal effects in practice.

WSC Methodological Innovations

Since the Glazerman et al. (2003) review, researchers have introduced WSC design innovations to address the five methodological limitations in the earlier numbered list. To reduce study differences in the RCT and NE (Issue 1 from above), researchers drew NE comparison units from the same target population as in the RCT. Bloom et al. (2005) used RCT data from the multistate, multisite National Evaluation of Welfare-to-Work Strategies (NEWWS) to construct a WSC. In the RCT arm, welfare recipients were randomly assigned to job training services within sites; in the NE arm, RCT controls from other NEWWS sites (often within the same city) were used to form the comparison group. Because all participants were involved in the same study protocol, they met the same eligibility criteria, provided the same baseline and outcome information, and experienced the same macroeconomic and labor market conditions at the same time. The consistency in research protocols across both study arms reduced the threat of confounders that might otherwise explain differences in RCT and NE results.

Shadish, Clark, and Steiner (2008) introduced another WSC design variant that bolstered the interpretation of results. They ensured that the RCT and NE compared equivalent causal estimands (Issue 2) for the same target population by randomly assigning study participants into the RCT or NE arm of the WSC. Once assigned into study arms, participants in the RCT were randomly assigned again into the reading or math intervention while those in the NE were allowed to select an intervention of their preference. NE bias was computed by comparing effect estimates of the ATE across both study arms. The researchers were also able to ensure that the RCT was well implemented by analyzing baseline and fidelity measures (Issue 3). And, because the WSC was prospectively planned and took place within a controlled laboratory-like setting, the researchers were able to implement the same study procedures across the RCT and NE arms (Issue 1). This meant delivering identical, scripted treatment and control interventions in the RCT and NE studies and using the same outcome measures for assessing impacts of the interventions. Subsequent analyses found no evidence of differential attrition within the RCT and across the RCT and NE arms.

Later WSCs introduced new approaches for assessing comparability between RCT and NE results (Issue 4). These studies acknowledged that, because of sampling error, even close replications of the same RCT would not result in identical treatment effects. And although most studies assessed comparability by examining statistical significance patterns between the RCT and NE, some began using direct statistical tests of difference between RCT and NE results. Other new methods for assessing correspondence included looking at the percentage of bias reduced from the initial naive comparison (Shadish et al., 2008), percent difference in the RCT and NE estimate (Wilde & Hollister, 2007), the mean squared error (Wing & Cook, 2013), the effect size differences between RCT and NE results (Hallberg, Wong, & Cook, 2016), or the relative performance of different NE approaches across multiple bootstrap replications (Hallberg, Wong, & Cook, 2016). Bell and Orr used a Bayesian framework to compute the probability of an incorrect policy decision for different magnitudes of true effect sizes (Solari, Nisar, Bell, & Orr, 2017). All of these approaches have their advantages and limitations. However, the lack of consensus in the WSC literature on how correspondence should be assessed has led to ambiguity and challenges in synthesizing the literature.

Finally, a common critique of WSC evaluations concerns their generalizability. Researchers want to know how well results from one study setting apply to NE method performance in other contexts, with different outcomes and treatment selection mechanisms (Issue 5). Although this issue is not unique to WSCs—the same concern arises in RCT evaluations—results from a single WSC study have little to say about general method performance. But results from multiple WSCs may provide insights as to how well these methods perform for similar outcomes and settings of particular interest.

Over the years, researchers have conducted qualitative and quantitative summaries of WSC results with the goal of providing advice for better NE practice. Some summaries have focused on observational method performance in particular disciplines or fields, with a narrowly defined set of outcomes. Glazerman et al. (2003) and Bloom et al. (2005) reviewed WSC results in the job training literature, where the outcome of interest was participants’ annual earnings. Both reviews confirmed Heckman et al.’s findings that NE methods produced less biased estimates when comparison groups were local, when covariate sets were rich and included pretest measures, and when researchers combined multiple design features (e.g., difference-in-differences with matching) for estimating effects.

Wong, Valentine, and Miller-Bains (2017) examined results from 12 WSCs in education settings with standardized reading or math outcomes. Their goal was to assess performance of common covariate types used in observational studies in education. As in the job training literature, Wong et al. found that the pretest often reduced a major portion of the bias but it did not always eliminate it. However, matching units from similar geographic locales did not provide the same benefit within education contexts as it did in job training settings. This was likely because the selection process into education interventions varied across settings, as did the definition of “local” comparisons in these evaluations. Wong et al. also noted that when rich covariate sets were available, NE methods replicated RCT benchmark estimates more closely in educational contexts, but the authors noted that further replications are needed in this area.

Other summarizes have reviewed WSC results from multiple disciplines to assess method performance more generally. Cook, Shadish, and Wong (2008) looked at 12 WSCs from 2002 to 2007 that spanned the fields of education, international development, and public health. The authors observed three conditions under which the NE method appeared to remove all or at least a major part of the bias. The first condition was when treatment and comparison units were assigned to treatment conditions based on an assignment variable and a cutoff, as in the RDD. In a more recent review, Chaplin et al. (2018) meta-analyzed results from 15 WSCs looking at RD performance across various fields. They found that the average NE bias was small, less than 0.01 SDs, providing further evidence for Cook et al.’s hypothesis.

Cook, Shadish, and Wong’s second and third conditions describe contexts under which NE methods appeared to remove most if not all the bias. Those contexts include when the selection process was known and observed by the researcher, as in students’ selection into a math or vocabulary intervention in the Shadish et al. WSC described above, or when “intact groups” (e.g., schools, villages) were matched using rich covariate information, or within the same geographic area. However, these results have yet to be confirmed by more recent WSCs, so more research is needed in this area.

This Special Issue

This two-volume special issue of Evaluation Review contributes to the WSC literature in two distinct ways. First, the February issue presents four additional case-study evaluations of NE method performance in educational contexts. Gleason, Resch, and Berk (2018) examine parametric and nonparametric method performance in an RDD. The authors use RCT data from evaluations of Ed Tech and Teach for America to construct RD designs synthetically. They created the RD by selecting a hypothetical cutoff on a baseline covariate and systematically deleting RCT treatment or comparison observations above and below the designated cutoff. A useful innovation of this article is that the authors replicated their RCT results across multiple data sets, as well as multiple cutoffs within each data set, and pooled their results through a systematic meta-analysis. Dong and Lipsey (2018) assess covariate performance in an observational study within the context of early childhood education (ECE). This is one of the few studies in the WSC literature that examines covariate performance in an ECE setting with outcomes of students’ emerging literacy and math skills. They also looked at the performance of different matching estimators when comparisons were drawn from within and across states. Kisbu-Sakarya, Cook, Tang, and Clark (2018) also examined NE method performance in the context of ECE, but their WSC evaluates the performance of a comparative RD (CRD) design to an RCT benchmark from the Head Start Impact study. Finally, Tang and Cook (2018) show the benefits of the CRD design by comparing the statistical precision of CRD results with RD and RCT results from the Head Start Impact study.

The April issue includes a series of methodological papers that seek to improve the design and analysis of the WSC approach itself. To this end, Wong and Steiner (2018) formalize the WSC design using a potential outcomes framework. They explicate the required design components and assumptions needed for the approach to yield a valid interpretation of NE method performance. This article also describes three different design variants for evaluating NE methods, and the benefits and limitations of each approach. Steiner and Wong (2018) next address the issue of how one should assess correspondence between RCT and NE results. That is, they address the question first posed by Wilde and Hollister (2007) of “how close is close enough” for the NE to have successfully replicated benchmark results? Through a series of simulation studies, the authors demonstrate the benefits and limitations of common criteria for assessing correspondence in RCT benchmark and NE results, and propose a new framework for assessing NE method performance: the correspondence test, which incorporates both frequentist tests of difference and equivalence in the same framework. Rindskopf, Shadish, and Clark (2018) propose an alternative criterion for assessing correspondence between RCT and NE results using a Bayesian approach. Their method involves calculating the probability that the absolute value of the difference between the RCT and NE result is less than some threshold determined to be close enough to 0. They argue that the Bayesian criteria improve the power of WSCs by allowing for the incorporation of prior information into the analysis and provide more varied, nuanced, and informative answers to questions of correspondence.

New Frontiers for WSC Approaches

Although the WSC literature has made strong advances since the early job training studies, our reading of the literature suggests four emerging areas for improving the design, analysis, and practice of NE evaluations:

Issue 1: Establish Research Protocols for the Design and Analysis of WSC Results

One issue with the implementation of WSCs is that knowledge of the benchmark result may inadvertently skew the many decisions researchers must make in the analysis of the NE. For example, in observational studies, the researcher has choices about covariate selection for estimating the propensity score (Smith & Todd, 2005) and about the type of estimator used to produce treatment effects (e.g., matching, stratification, or doubly robust estimators). Cook et al. (2008) recommend that two independent research teams should analyze the benchmark and NE separately and that the analysts of the NE should be blinded of the benchmark results. This is generally good practice, but it may not be specific enough to be feasible. Research teams may wish to coordinate which causal estimands they will compare, and the analytic models they will use to estimate treatment effects (e.g., should the RCT and NE treatment effects be estimated using regression-adjusted [doubly robust] models or not?).

In future implementations of WSCs, research teams should establish and describe a protocol in advance of data collection or analysis. Developing a WSC research protocol is similar to preregistration of research plans for RCTs or meta-analyses. One benefit of a WSC protocol is that it would provide prespecified guidance to researchers on questions that naturally arise in the design and analysis of WSCs. In cases where the NE and RCT are analyzed by independent teams of researchers, developing a research protocol can provide opportunities for investigators to come to a common understanding of the study plan. The research protocol could also allow for WSC researchers to obtain feedback and advice on their data collection and analysis plans, prior to revealing any results.

Generally, the WSC protocol should address the following topics: (1) confirmatory versus exploratory research questions in the WSC context, (2) diagnostics for assessing assumptions of the WSC design, (3) potential deviations from the intended research protocol, and (4) criteria for determining correspondence in results. The protocol should recommend that analysts of the RCT and NE document all analysis procedures; it should also provide a place for the researchers to document any problems or questions that arise, and how these questions were resolved. Finally, the protocol should provide guidance on when it is appropriate for RCT and NE analysts to consult with each other, and when their analysis should be conducted independently.

Issue 2: Consider Statistical Power for WSC Designs

Another critical issue in the planning of WSCs is ensuring that the design has sufficient statistical power for detecting comparability in treatment effects between the RCT and NE. In fact, WSCs usually have much greater power requirements than do the RCT or NE for detecting impacts. To understand why WSCs usually require larger samples, consider a scenario where the criterion for assessing correspondence in RCT and NE effects is to determine whether the two study conditions produce the same test result in a null hypothesis test of the treatment effect. In other words, do the RCT and NE result in the same conclusion about the presence of a treatment effect? In an independent WSC design (i.e., units were randomly assigned into RCT and NE conditions), the probability of rejecting the null in both study conditions depends on the statistical power in the RCT and NE. Here, a well-powered RCT and NE, with both having a statistical power of 0.80 to detect the true but unknown effect, produce the same pattern of statistical significance with a probability of .68 only (= .8 × .8 + .2 × .2, i.e., the probability of obtaining a significant effect estimate in both studies plus the probability of obtaining an insignificant result in both studies). But when—as is not uncommon—the RCT or NE is underpowered for detecting significant effects (e.g., both having a power of .2), the probability of obtaining corresponding significance patterns is again .68. But now correspondence is most likely due to obtaining insignificant (.8 × .8) rather than significant (.2 × .2) effect estimates in both studies. Thus, when there is no significant treatment effect for the NE and RCT, the researcher may incorrectly conclude that the NE lacks bias, but this may be because both study conditions are underpowered for detecting effects!

Future WSCs should consider statistical power for assessing comparability of results in the design phase of the evaluation. Three papers in the March issue provide guidance on statistical power. Wong and Steiner show that WSC design variants (e.g., WSCs with independent vs. dependent data structures in the RCT and NE arms) have different statistical power for assessing correspondence in results; and Steiner and Wong suggest a method for assessing statistical power in the design phase through the correspondence framework. Rindskopf and Shadish suggest that Bayesian approaches for assessing correspondence of RCT and NE results have improved statistical power over frequentist approaches.

Issue 3: Continue to Explore the External Validity of WSC Results

The existing WSCs represent a heterogeneous mix of studies from different disciplines, research designs, and outcomes. Currently, the authors have identified more than 70 WSC studies (see Table 1). These studies include substantial variation in contexts, NE methods examined, as well as outcomes and treatment selection mechanisms. As more studies continue to be added to the literature, ongoing quantitative synthesis of results can provide important descriptive information about NE method performance in field settings, and the contexts and conditions under which these methods may perform well. Meta-analysis of WSC results may also address an important challenge that many stand-alone studies face—lack of statistical power for assessing correspondence in results.

However, we note that a rigorous synthesis of WSC results also requires more systematic reporting of study procedures and outcomes as well as consistent criteria for assessing correspondence in results. For example, it would be useful for WSC analysts to report estimates of NE bias and the standard error of their bias estimates. Moreover, in WSC designs where units are shared between the RCT and NE arm, the standard errors should account for dependencies in the data structure (see discussion by Steiner and Wong). In addition, because the direction and strength of the selection processes in the NE vary across WSC studies, analysts should always report the initial, unadjusted selection bias (i.e., the difference between the unadjusted NE estimate and the RCT estimate). This allows for an assessment of the sign and magnitude of the selection bias before making any statistical adjustments.

Meta-analysis of WSC results has tremendous promise in revealing new insights about good NE practice. However, given the heterogeneity of WSCs in terms of study designs, samples, outcomes, and selection processes, a rigorous meta-analysis should synthesize or pool results only when substantively or theoretically appropriate. To this end, WSC analysts should document and report study procedures and contextual factors that may be related to NE bias.

Issue 4: Using RCT Benchmark Results for Examining Treatment Effect Variation and Generalization

Recently, researchers have applied WSC designs to address research questions of programmatic and policy relevance. For example, an RCT benchmark may be used to validate an NE model that is then used to estimate treatment effects for a more general target population of interest. This method has been applied to generalize treatment effects across different units (J. D. Angrist & Rokkanen, 2015; Wing & Clark, 2016), treatments (Bell, Harvill, Moulton, & Peck, 2017; Hotz, Imbens, & Mortimer, 2005), and settings (Abdulkadirólu, Angrist, Dynarski, Kane, & Pathak, 2011).

For example, Abdulkadirólu, Angrist, Dynarski, Kane, and Pathak (2011) used a WSC design to assess the external validity of treatment effects from Boston charter and pilot schools with admission lotteries to schools without such lotteries. The RCT consisted of lottery students in oversubscribed charter/pilot schools; the NE consisted of lottery winners as well as noncharter/pilot students in Boston public schools. To estimate NE treatment effects, the authors used regression models that controlled for student demographic characteristics and baseline scores.

The authors constructed a series of WSCs for subsamples of charter and pilot schools and for elementary and secondary grades. In cases where the WSC NE and RCT produced corresponding effects, the researchers concluded that the NE model was sufficient for addressing selection bias in an observational study of nonlottery charter/pilot schools and Boston public schools. The assumption here was that the selection process into charter/pilot schools with lotteries could be generalized to schools without lotteries. However, when the WSC NE failed to reproduce RCT benchmark results, the authors concluded that the NE model could not be used to estimate observational treatment effects. Overall, Abdulkadirólu et al. observed close correspondence in RCT and NE results for charter school students, and for middle school students with pilot programs. In assessing the external validity of the charter school lottery results, they found that although charter schools without lotteries produced positive and significant effects, they were smaller than effects observed from oversubscribed charter schools. The authors also found that the WSC NE model did not perform well for a subsample of high schools with pilot programs. As a result, they did not use the NE model to assess the external validity of treatment effects for this subsample of schools.

In a second example, Hotz, Imbens, and Klerman (2006) used a WSC design to examine treatment effect variation due to differences in program components. The researchers used RCT data from the Greater Avenues to Independence Program evaluation, where participants in six California counties were randomly assigned to receive job training services or to be in a control group that was denied services. Because of the local nature of treatment implementation, some county programs provided participants with general education and skills development, while other sites encouraged participants to secure immediate employment.

A goal of the evaluation was to assess treatment effect variation due to differential program components. However, because participants were not randomly assigned to sites, researchers were concerned that observed treatment effect variation may have been confounded with participants’ characteristics. To address this issue, the authors constructed a WSC using RCT control group members’ outcomes. Their goal was to examine whether NE methods and observed participant characteristics could address units’ selection into sites. In places where the NE method succeeded in producing conditionally equivalent control groups, the researchers felt assured that the NE approach could be used to produce valid effect estimates of program components.

These examples illustrate how WSCs may be used to probe NE assumptions empirically. They also show how WSCs may be used to signal when NE assumptions are not well warranted in field settings. As researchers continue to use RCT and NE data to “learn more” from program and policy evaluations, WSCs provide an important method for validating NE assumptions, and for generalizing and uncovering differential treatment effects.

Conclusion

Because of increased availability of RCT data, there are now empirical evaluations of NE methods in job training, education, early childhood development, political science, international development, and public health. WSCs have also been used to evaluate more types of quasi-experimental approaches including the RDD (see Cook and Wong, 2008, for review) and, most recently, the interrupted time-series design (St. Clair, Cook, & Hallberg, 2014; St. Clair, Hallberg, & Cook, 2016). As the number of WSCs in varying contexts increases, so does the opportunity for synthesizing the literature for greater insight and external validity.

Results from WSC evaluations have had important impacts on both research practice and funding priorities in program evaluation. In most areas of the social sciences, an RCT is the preferred method for establishing causal inferences. However, WSCs have shown specific contexts and conditions where NE methods succeed in removing most if not all the bias. Methodological advances in WSC designs, like those in presented in this special issue, will continue to improve our understanding of NE practice. As the program evaluation field turns to important policy-relevant questions such as “When, where, for whom, and why does it work?” WSCs may again be instrumental in improving methodology and validating research design assumptions in field settings.

Footnotes

Authors’ Note

The opinions expressed are those of the authors and do not represent views of NSF.

Acknowledgment

We dedicate this two-volume special issue on within-study comparison designs to our mentor and friend, William R. Shadish. We had the honor to work with and learn from Will on the design, implementation, and analysis of several within-study comparisons. The April issue of Evaluation Review includes one of Will’s last papers, coauthored with David Rindskopf. Will, we miss you and think of you often.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant #R305B140026 and a Collaborative research grant from the National Science Foundation, through grant ##2015-0285-00 to the Rectors and Visitors of the University of Virginia. The opinions expressed are those of the authors and do not represent views of the Institute, the U.S. Department of Education, or the National Science Foundation.

References

Abdulkadirólu

Angrist

J. D.

Dynarski

S. M.

Kane

T. J.

Pathak

P. A.

(2011). Accountability and flexibility in public schools: Evidence from Boston’s charters and pilots. Quarterly Journal of Economics, 126, 699–748. doi:10.1093/qje/qjr017

Agodini

Dynarski

(2004). Are experiments the only option? A Look at Dropout Prevention Programs. Review of Economics and Statistics, 86, 180–194. doi:10.1162/003465304323023741

Aiken

L. S.

West

S. G.

Schwalm

D. E.

Carroll

J. L.

Hsiung

(1998). Comparison of a randomized and two quasi-experimental designs in a single outcome evaluation. Evaluation Review, 22, 207–244. doi:10.1177/0193841X9802200203

Anderson

K. P.

Wolf

P. J.

(2017). Evaluating school vouchers: Evidence from a within-study comparison (EDRE No. 2017–10). SSRN Electronic Journal. doi:10.2139/ssrn.2952967

Anglin

Miller-Bains

Wong

V. C.

Wing

(2018). Methods of reducing bias in time series designs: A within study comparison. In Society for research on educational effectiveness. Washington, DC. Retrieved from https://www.sree.org/conferences/2018s/program/

Angrist

Autor

Hudson

Pallais

(2015). Evaluating econometric evaluations of post-secondary aid. American Economic Review, 105, 502–507. doi:10.1257/aer.p20151025

Angrist

Pischke

J.-S

. (2009). Mostly harmless econometrics: An empiricist’s companion. Princeton, NJ: Princeton University Press.

Angrist

J. D.

(2004). American education research changes tack. Oxford Review of Economic Policy, 20, 198–212. doi:10.1093/oxrep/grh011

Angrist

J. D.

Rokkanen

(2015). Wanna get away? Regression discontinuity estimation of exam school effects away from the cutoff. Journal of the American Statistical Association, 110, 1331–1344. doi:10.1080/01621459.2015.1012259

10.

Arceneaux

Gerber

A. S.

Green

D. P.

(2010). A Cautionary note on the use of matching to estimate causal effects: An empirical example comparing matching estimates to an experimental benchmark. Sociological Methods & Research, 39, 256–282. doi:10.1177/0049124110378098

11.

Ashworth

K. E.

Pullen

P. C.

(2015). Comparing regression discontinuity and multivariate analyses of variance: Examining the effects of a vocabulary intervention for students at risk for reading disability. Learning Disability Quarterly, 38, 131–144. doi:10.1177/0731948714555020

12.

Barrera-Osorio

Filmer

McIntyre

(2014). Randomized controlled trials and regression discontinuity estimations: An empirical comparison. In Society for research on educational effectiveness. Retrieved from https://www.sree.org/conferences/2014s/program/

13.

Bell

Harvill

Moulton

Peck

(2017). Using within-site experimental evidence to reduce cross-site attributional bias in connecting program components to program impacts using within-site experimental evidence. OPRE Report #2017-13. Washington, DC: Office of Planning, Research and Evaluation, Administration for Children and Families, U.S. Department of Health and Human Services.

14.

Bell

Orr

Blomquist

Cain

G. G.

(1994). Program applicants as a comparison group in evaluating training programs: Theory and a test. Kalamazoo, MI: Upjohn Institute for Employment Research. doi:10.17848/9780585284545

15.

Bifulco

(2012). Can nonexperimental estimates replicate estimates based on random assignment in evaluations of school choice? A within-study comparison. Journal of Policy Analysis and Management, 31, 729–751. doi:10.1002/pam.20637

16.

Black

Galdo

Smith

(2007). Evaluating the bias of the regression discontinuity design using experimental data. Unpublished working paper.

17.

Bloom

Michalopoulos

Hill

(2005). Using experiments to assess nonexperimental comparison-group methods for measuring program effects. In Learning more from social experiments: Evolving analytic approaches (pp. 173–235).

18.

Bloom

Michalopoulos

Hill

Lei

(2002). Can nonexperimental comparison group methods match the findings from a random assignment evaluation of mandatory welfare-to-work programs? (MDRC Working Papers on Research Methodology). Retrieved from https://www.mdrc.org/sites/default/files/full_43.pdf

19.

Bratberg

Grasdal

Risa

A. E.

(2002). Evaluating social policy by experimental and nonexperimental methods. Scandinavian Journal of Economics, 104, 147–171. doi:10.1111/1467-9442.00276

20.

Buddelmeyer

Skoufias

(2004). Evaluation of the performance of regression discontinuity design on PROGRESA (World Bank policy research working paper). Washington, DC: World Bank. Retrieved from https://search.lib.virginia.edu/catalog/u6398806

21.

Chaplin

D. D.

Cook

Zurovac

Coopersmith

Finucane

M. M.

Vollmer

L. N.

Morris

. (2018). The internal and external validity of the regression discontinuity design: A meta-analysis of 15 within-study-comparisons. Journal of Policy Analysis and Management, 37. doi:10.1002/pam.22051

22.

Cook

T. D.

Foray

(2007). Building the capacity to experiment in schools: A case study of the institute of educational sciences in the US department of education. Economics of Innovation and New Technology, 16, 385–402. doi:10.1080/10438590600982475

23.

Cook

T. D.

Shadish

W. R.

Wong

V. C.

(2008). Three conditions under which experiments and observational studies produce comparable causal estimates: New findings from within-study comparisons. Journal of Policy Analysis and Management, 27, 724–750. doi:10.1002/pam

24.

Cook

T. D.

Wong

V. C.

(2008). Empirical tests of the validity of the regression discontinuity design. Annales d’Economie et de Statistique, 127–150. doi:10.2307/27917242

25.

Dehejia

R. H.

Wahba

(1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association, 94, 1053–1062.

26.

Dehejia

R. H.

Wahba

(2002). Propensity score-matching methods for nonexperimental causal studies. Review of Economics and Statistics, 84, 151–161. doi:10.1162/003465302317331982

27.

Diaz

J. J.

Handa

(2006). An assessment of propensity score matching as a nonexperimental impact estimator. Journal of Human Resources, XLI, 319–345. doi:10.3368/jhr.XLI.2.319

28.

Dong

Lipsey

M. W.

(2018). Can propensity score analysis approximate randomized experiments using pretest and demographic information in pre-K intervention research? Evaluation Review, 42(1), 34–70.

29.

Federal Register. (2005). Scientifically based evaluation methods. Federal Register, 70, 3586–3589.

30.

Ferraro

P. J.

Miranda

J. J.

(2014). The performance of non-experimental designs in the evaluation of environmental programs: A design-replication study using a large-scale randomized experiment as a benchmark. Journal of Economic Behavior & Organization, 107, 344–365. doi:10.1016/j.jebo.2014.03.008

31.

Fortson

Gleason

Kopa

Verbitsky-Savitz

(2015). Horseshoes, hand grenades, and treatment effects? Reassessing whether nonexperimental estimators are biased. Economics of Education Review, 44, 100–113. doi:10.1016/j.econedurev.2014.11.001

32.

Fortson

Verbitsky-Savitz

Kopa

Gleason

(2012). Using an experimental evaluation of charter schools to test whether nonexperimental comparison group methods can replicate experimental impact estimates. National center for education evaluation and regional assistance. Retrieved from http://www.eric.ed.gov/ERICWebPortal/recordDetail?accno=ED531481

33.

Fraker

Maynard

(1987). The adequacy of comparison group designs for evaluations of employment-related programs. The Journal of Human Resources, 22. doi:10.2307/145902

34.

Fretheim

Zhang

Ross-Degnan

Oxman

A. D.

Cheyne

Foy

R.,

… Soumerai

S. B.

(2015). A reanalysis of cluster randomized trials showed interrupted time-series studies were valuable in health system evaluation. Journal of Clinical Epidemiology, 68, 324–333. doi:10.1016/j.jclinepi.2014.10.003

35.

Friedlander

Robins

P. K.

(1995). Evaluating program evaluations: New evidence on commonly used nonexperimental methods. American Economic Review, 85, 923–937. doi:10.1016/j.jclinepi.2014.10.003

36.

Gill

Furgeson

Chiang

Teh

Haimson

Savitz

N. V.

(2016). Replicating experimental impact estimates with nonexperimental methods in the context of control-group noncompliance. Statistics and Public Policy, 3, 1–11. doi:10.1080/2330443X.2015.1084252

37.

Glazerman

Levy

D. M.

Myers

(2003). Nonexperimental versus experimental estimates of earnings impacts. Annals of the American Academy of Political & Social Science, 589, 63–93. doi:10.1177/0002716203254879

38.

Gleason

Resch

Berk

(2012). Replicating experimental impact estimates using a regression discontinuity approach. Retrieved from https://ies.ed.gov/ncee/pubs/20124025/

39.

Gleason

Resch

Berk

(2018). RD or not RD: Using experimental studies to assess the performance of the regression discontinuity approach. Evaluation Review, 42(1), 3–33.

40.

Green

D. P.

Leong

T. Y.

Kern

H. L.

Gerber

A. S.

Larimer

C. W.

(2009). Testing the accuracy of regression discontinuity analysis using experimental benchmarks. Political Analysis, 17, 400–417. doi:10.1093/pan/mpp018

41.

Hallberg

Cook

T. D.

Steiner

P. M.

Clark

M. H.

(2016). Pretest measures of the study outcome and the elimination of selection bias: Evidence from three within study comparisons. Prevention Science, 1–10. doi:10.1007/s11121-016-0732-6

42.

Hallberg

Wong

V. C.

Cook

T. D.

(2016). Evaluating methods for selecting school-level comparisons in quasi-experimental designs: Results from a within-study comparison (EdPolicy Works Working Paper Series No. 47). Retrieved from https://curry.virginia.edu/uploads/resourceLibrary/47_School_Comparisons_in_Observational_Designs.pdf

43.

Handa

Maluccio

J. A.

(2010). Matching the gold standard: Comparing experimental and nonexperimental evaluation techniques for a geographically targeted program. Economic Development and Cultural Change, 58, 415–447. doi:10.1086/650421

44.

Heckman

J. J.

Hotz

J. V.

(1989). Choosing among alternative nonexperimental methods for estimating the impact of social programs: The case of manpower training. Journal of the American Statistical Association, 84, 862. doi:10.2307/2290059

45.

Heckman

J. J.

Ichimura

Smith

Todd

(1998). Characterizing selection bias using experimental data. Econometrica, 66, 1017–1098. doi:10.2307/2999630

46.

Heckman

J. J.

Ichimura

Todd

(1997). Matching as an econometric evaluation estimator: Evidence from evaluating a job training programme. The Review of Economic Studies, 64, 605–654. doi:10.2307/2971733

47.

Hill

J. L.

Reiter

J. P.

Zanutto

E. L.

(2005). A comparison of experimental and observational data analyses. In Gelman

Meng

X.-L.

(Eds.), Applied Bayesian modeling and causal inference from incomplete-data perspectives: An essential journey with Donald Rubin’s statistical family (pp. 49–60). doi:10.1002/0470090456.ch5

48.

Hotz

V. J.

Imbens

G. W.

Klerman

J. A.

(2006). Evaluating the differential effects of alternative welfare-to-work training components: A reanalysis of the California GAIN program. Journal of Labor Economics, 24, 521–566. doi:10.1086/505050

49.

Hotz

V.J.

Imbens

G. W.

Mortimer

J. H.

(2005). Predicting the efficacy of future training programs using past experiences at other locations. Journal of Econometrics, 125, 241–270. doi:10.1016/j.jeconom.2004.04.009

50.

Jaciw

A. P.

(2016a). Applications of a within-study comparison approach for evaluating Bias in generalized causal inferences from comparison groups studies. Evaluation Review, 40, 241–276. doi:10.1177/0193841X16664457

51.

Jaciw

A. P.

(2016b). Assessing the accuracy of generalized inferences from comparison group studies using a within-study comparison approach. Evaluation Review, 40, 199–240. doi:10.1177/0193841X16664456

52.

Jacob

Somers

M.-A.

Zhu

Bloom

(2016). The validity of the comparative interrupted time series design for evaluating the effect of school-level interventions. Evaluation Review, 40, 167–198. doi:10.1177/0193841X16663414

53.

Keele

L. J.

Titiunik

(2015). Geographic boundaries as regression discontinuities. Political Analysis, 23, 127–155. doi:10.1093/pan/mpu014

54.

Kisbu-Sakarya

Cook

T. D.

Tang

Clark

M. H.

(2018). Comparative regression discontinuity: A stress test with small samples. Evaluation Review, 4(1), 111–143.

55.

Lalonde

R. J.

(1986). Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review, 76, 604–620.

56.

Lee

W.-S.

(2006). Evaluating the effects of a mandatory government program using matched groups within a similar geographic location. SSRN Electronic Journal, 1–74. doi:10.2139/ssrn.936783

57.

Leow

Wen

Korfmacher

(2015). Two-year versus one-year head start program impact: Addressing selection bias by comparing regression modeling with propensity score analysis. Applied Developmental Science, 19, 31–46. doi:10.1080/10888691.2014.977995

58.

Lottridge

S. M.

Nicewander

W. A.

Mitzel

H. C.

(2011). A comparison of paper and online tests using a within-subjects design and propensity score matching study. Multivariate Behavioral Research, 46, 544–566. doi:10.1080/00273171.2011.569408

59.

Luellen

J. K.

Shadish

W. R.

Clark

M. H.

(2005). Propensity scores: An introduction and experimental test. Evaluation Review, 29, 530–558. doi:10.1177/0193841X05275596

60.

Manpower Demonstration Research Corporation. (1980). Summary and findings of the national support work demonstration. Cambridge, MA: Ballinger.

61.

McKenzie

Stillman

Gibson

(2010). How important is selection? Experimental vs. non-experimental measures of the income gains from migration. Journal of the European Economic Association, 8, 913–945. doi:10.1111/j.1542-4774.2010.tb00544.x

62.

Michalopoulos

Bloom

Hill

(2004). Can propensity-score methods match the findings from a random assignment evaluation of mandatory welfare-to-work programs? Review of Economics and Statistics, 86, 156–179. doi:10.1162/003465304323023732

63.

Morgan

S. L.

Winship

(2007). Counterfactuals and causal inference: Methods and principles for social research. Cambridge, England: Cambridge University Press. doi:10.1017/CBO9780511804564

64.

Moss

B. G.

Yeaton

W. H.

LIoyd

J. E.

(2014). Evaluating the effectiveness of developmental mathematics by embedding a randomized experiment within a regression discontinuity design. Educational Evaluation and Policy Analysis, 36, 170–185. doi:10.3102/0162373713504988

65.

Mueller

C. E.

Gaus

(2015). Assessing the performance of the “Counterfactual as Self-Estimated by Program Participants.” American Journal of Evaluation, 36, 7–24. doi:10.1177/1098214014538487

66.

Office of Management and Budget. (2005). What constitutes strong evidence of a program’s effectiveness? Retrieved from https://obamawhitehouse.archives.gov/sites/default/files/omb/part/2004_program_eval.pdf

67.

Olsen

R. B.

Decker

P. T.

(2001). Testing different methods of estimating the impacts of worker profiling and reemployment services systems. Washington, DC: U.S. Department of Labor, Employment and Training Administration. Retrieved from https://search.lib.virginia.edu/catalog/u3860217

68.

Padgett

R. D.

Salisbury

M. H.

B. P.

Pascarella

E. T.

(2010). Required, practical, or unnecessary? An examination and demonstration of propensity score matching using longitudinal secondary data. New Directions for Institutional Research, 2010, 29–42. doi:10.1002/ir.370

69.

Peikes

D. N.

Moreno

Orzol

S. M.

(2008). Propensity score matching: A note of caution for evaluators of social programs. The American Statistician, 62, 222–231. doi:10.1198/000313008X332016

70.

Pohl

Steiner

P. M.

Eisermann

Soellner

Cook

T. D.

(2009). Unbiased causal inference from an observational study: Results of a within-study comparison. Educational Evaluation and Policy Analysis, 31, 463–479. doi:10.3102/0162373709343964

71.

Rindskopf

Shadish

W. R.

Clark

M. H.

(2018). Using Bayesian correspondence criteria to compare results from a randomized experiment and a quasi-experiment allowing self-selection. Evaluation Review, 42(2), 248–280.

72.

Rubin

D. B.

(1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology. American Psychological Association. doi:10.1037/h0037350

73.

Rubin

D. B.

(2005). Causal inference using potential outcomes. Journal of the American Statistical Association, 100, 322–331. doi:10.1198/016214504000001880

74.

Schneeweiss

Maclure

Carleton

Glynn

R. J.

Avorn

(2004). Clinical and economic consequences of a reimbursement restriction of nebulised respiratory therapy in adults: Direct comparison of randomised and observational evaluations. British Medical Journal, 328, 560. doi:10.1136/bmj.38020.698194.F6

75.

Shadish

W. R.

Clark

M. H.

Steiner

P. M.

(2008). Can nonrandomized experiments yield accurate answers? A randomized experiment comparing random and nonrandom assignments. Journal of the American Statistical Association, 103, 1334–1344. doi:10.1198/016214508000000733

76.

Shadish

W. R.

Cook

T. D.

Campbell

D. T.

(2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin.

77.

Shadish

W. R.

Galindo

Wong

V. C.

Steiner

P. M.

Cook

T. D.

(2011). A randomized experiment comparing random and cutoff-based assignment. Psychological Methods, 16, 179–191. doi:10.1037/a0023345

78.

Smith

Todd

(2005). Does matching overcome LaLonde’s critique of nonexperimental estimators? Journal of Econometrics, 125, 305–353. doi:10.1016/j.jeconom.2004.04.011

79.

Solari

Nisar

Bell

Orr

(2017). Quantifying the policy reliability of competing non-experimental methods for measuring the impacts of social programs. Association for Public Policy Analysis and Management. Retrieved from http://www.appam.org/events/fall-research-conference/events/2017fall-research-conference/

80.

Somers

M.-A.

Zhu

Jacob

Bloom

(2013). The validity and precision of the comparative interrupted time series design and the difference-in-difference design in educational evaluation (MDRC Working Paper on Research Methodology). Retrieved from http://appam.confex.com/data/extendedabstract/appam/2012/Paper_1758_extendedabstract_156_0.pdf

81.

St. Clair

Cook

T. D.

Hallberg

(2014). Examining the internal validity and statistical precision of the comparative interrupted time series design by comparison with a randomized experiment. American Journal of Evaluation, 35, 311–327. doi:10.1177/1098214014527337

82.

St. Clair

Hallberg

Cook

T. D.

(2016). The validity and precision of the comparative interrupted time-series design. Journal of Educational and Behavioral Statistics, 41, 269–299. doi:10.3102/1076998616636854

83.

Steiner

P. M.

Cook

T. D.

Clark

M. H.

(2015). Bias reduction in quasi-experiments with little selection theory but many covariates. Journal of Research on Educational Effectiveness, 8, 552–576. doi:10.1080/19345747.2014.978058

84.

Steiner

P. M.

Cook

T. D.

Shadish

W. R.

(2011). On the importance of reliable covariate measurement in selection bias adjustments using propensity scores. Journal of Educational and Behavioral Statistics, 36, 213–236. doi:10.3102/1076998610375835

85.

Steiner

P. M.

Cook

T. D.

Shadish

W. R.

Clark

M. H.

(2010). The importance of covariate selection in controlling for selection bias in observational studies. Psychological Methods, 15, 250–267. doi:10.1037/a0018719

86.

Steiner

P. M.

Wong

V. C.

(2018). Assessing correspondence between experimental and non-experimental estimates in within-study-comparisons. Evaluation Review, 42(2), 214–247.

87.

Steventon

Grieve

Sekhon

J. S.

(2015). A comparison of alternative strategies for choosing control populations in observational studies. Health Services and Outcomes Research Methodology, 15, 157–181. doi:10.1007/s10742-014-0135-8

88.

Tang

Cook

T. D.

(2018). Statistical power for the comparative regression discontinuity design with a pretest no-treatment control function: Theory and evidence from the national head start impact study. Evaluation Review, 42(1), 71–110.

89.

Tang

Cook

T. D.

Kisbu-Sakarya

Hock

Chiang

(2017). The comparative regression discontinuity (CRD) design: An overview and demonstration of its performance relative to basic RD and the randomized experiment. In Regression discontinuity designs (Vol. 38, pp. 237–279, SE–6). Emerald. doi:10.1108/S0731-905320170000038011

90.

What Works Clearinghouse. (2008a). What works clearinghouse evidence standards for reviewing studies (Version 1.0) [Computer software]. Retrieved from https://ies.ed.gov/ncee/wwc/Docs/referenceresources/wwc_version1_standards.pdf

91.

What Works Clearinghouse. (2008b). What works clearinghouse procedures and standards handbook (Version 2.0) [Computer software]. Retrieved from https://ies.ed.gov/ncee/wwc/Docs/referenceresources/wwc_procedures_v2_standards_handbook.pdf

92.

What Works Clearinghouse. (2011). What Works Clearinghouse TM procedures and standards handbook (Version 3.0) [Computer software]. Retrieved from https://ies.ed.gov/ncee/wwc/Docs/referenceresources/wwc_procedures_v3_0_draft_standards_handbook.pdf

93.

Wichman

C. J.

Ferraro

P. J.

(2017). A cautionary tale on using panel data estimators to measure program impacts. Economics Letters, 151, 82–90. doi:10.1016/j.econlet.2016.11.029

94.

Wilde

E. T.

Hollister

(2007). How close is close enough? Evaluating propensity score matching using data from a class size reduction experiment. Journal of Policy Analysis and Management, 26, 455–477. doi:10.1002/pam20262

95.

Wing

Clark

M. H.

(2016). What can we learn from a doubly randomized preference trial? An instrumental variables perspective. Journal of Policy Analysis and Management, 36, 418–437. doi:10.1002/pam.21965

96.

Wing

Cook

T. D.

(2013). Strengthening the regression discontinuity design using additional design elements: A within-study comparison. Journal of Policy Analysis and Management, 32, 853–877. doi:10.1002/pam.21721

97.

Wong

V. C.

Hallberg

Cook

T. D.

(2013). Intact school matching in education: Exploring the relative importance of focal and local matching. In Society for research on educational effectiveness. Retrieved from https://www.sree.org/conferences/2013s/program/

98.

Wong

V. C.

Steiner

P. M.

(2018). Designs of empirical evaluations of non-experimental methods in field settings. Evaluation Review, 42(1), 176–213.

99.

Wong

V. C.

Valentine

Miller-Bains

(2017). Empirical performance of covariates in education observational studies. Journal of Research on Educational Effectiveness, 10, 207–236. doi:10.1080/19345747.2016.1164781

100.

Zhou

Xie

(2016). Propensity score–based methods versus MTE-based methods in causal inference. Sociological Methods & Research, 45, 3–40. doi:10.1177/0049124114555199