Applications of a Within-Study Comparison Approach for Evaluating Bias in Generalized Causal Inferences From Comparison Groups Studies

Abstract

Background:

Past studies have examined factors associated with reductions in bias in comparison group studies (CGSs). The companion work to this article extends the framework to investigate the accuracy of generalized inferences from CGS.

Objectives:

This article empirically examines levels of bias in CGS-based impact estimates when used for generalization, and reductions in bias resulting from covariate adjustment. It assesses potential for bias reduction against criteria from past studies.

Research design:

Multisite trials are used to generate impact estimates based on cross-site comparisons that are evaluated against site-specific experimental benchmarks. Strategies for reducing bias are evaluated. Results from two experiments are considered.

Subjects:

Students in Grades K–3 in 79 schools in Tennessee and students in Grades 4–8 in 82 schools in Alabama.

Measures:

Grades K–3 Stanford Achievement Test reading and math scores; Grades 4–8 Stanford Achievement Test (SAT) 10 reading scores.

Results:

Generalizing impacts to sites through estimates based on between-site nonexperimental comparisons leads to bias from differences between sites in average performance, and in impact, and covariation between these quantities. The first of these biases is larger. Covariate adjustments reduce bias but not completely. Criteria for bias reduction from past studies appear to extend to generalized inferences based on CGSs.

Conclusion:

When generalizing from a CGS, results may be affected by bias from differences between the study and inference sites in both average performance and average impact. The same factors may underlie both forms of bias. Researchers and practitioners can assess the validity of generalized inferences from CGSs by applying criteria for bias reduction from past studies.

Keywords

bias in quasi-experiments within-study comparison approaches comparison group studies

This article is the second of two that empirically investigate the accuracy of generalized inferences from comparison group studies (CGSs). In the first paper (Jaciw, 2016), we developed the methodology and briefly considered two empirical examples. In the current work, we apply the methods and present the two examples in full.

Specifically, in this current article, we (1) briefly review results of the background studies discussed in Jaciw (2016), focusing on criteria under which bias in CGSs has been successfully reduced in previous works, (2) review the main findings from the companion work, particularly the expressions for bias in CGS-based impact estimates that we apply in the current article, (3) specify the estimation models used to obtain the results, and (4) interpret the results from the Tennessee class size reduction experiment (Project Student-Teacher Achievement Ratio [STAR]) and the Alabama Math, Science and Technology Initiative (AMSTI) multisite trials, which are our main illustrative examples. In the analysis, we summarize the levels of average absolute bias in generalized inferences from CGSs and examine the potential for covariate adjustments to reduce the levels of bias. We also consider the applicability of criteria for bias reduction identified in past studies to the current results. We address four technical issues and consider next steps in the development of the method.

Background

In this and the companion article, we build on within-study comparison (WSC) approaches to investigating bias in CGS-based estimates of program impact. WSC methods, pioneered by Lalonde (1986) and Fraker and Maynard (1987), attempt to replicate unbiased experimental benchmark results using nonexperiments. In the usual application, outcomes for an experimental control group are substituted with those from a nonexperimental comparison group that is strategically selected to be like the experimental control. The resulting estimate is one that plausibly would be arrived at if randomization to conditions had not been possible and, instead, a nonexperimental comparison group had to be used to establish the counterfactual to individuals randomly assigned to treatment. The difference between the nonexperimental and experimental estimates is a measure of bias in the former. After measuring the level of bias, researchers typically apply various analytic strategies, including statistical adjustments, to assess if the nonexperimental estimate can be brought into correspondence with the benchmark experimental result; that is, whether they can produce a valid nonexperimental counterfactual to individuals randomly assigned to treatment.

Summary of Findings From Past WSC Studies

Several main findings have emerged from past WSG studies, concerning the conditions that lead to bias or that are helpful for mitigating it. We review them here. The companion paper to this one provides a more detailed review and cites systematic reviews of studies addressing the question.

The selection of the comparison group is critical for avoiding bias, with local matching tending to yield less biased results

Comparison groups that are distant from the inference population tend to perform poorly, in part because differences in geographic location, or in when outcome are assessed, can lead to third variable confounding; that is, in terms of factors that it is impossible to adjust for the effects of. For example, in a WSC study by Bloom, Michalopoulos, and Hill (2005) that used results from several randomized trials in the National Evaluation of Welfare-to-Work Strategies, the researchers showed that (1) for matched comparison groups identified at different sites in the same city, it is easier to produce a valid counterfactual in the short run than in the longer term and (2) in-state comparison groups produce less bias than out-of-state comparisons.

The quality of the matching criteria and the covariates used matter

Ideally, it is important to account for the effects of theory-based factors that reflect program participation and outcomes for specific contexts (Smith & Todd, 2005). Generally, bias will be reduced when adjustments are made for the effects of covariates that are related to the outcome and influence selection. Preprogram measures of the outcome variable, such as earnings in WSC studies of job training programs, in most cases will satisfy the first condition. If the baseline measure of the outcome variable is also related to selection into the program, it will operate as a confounder, and adjusting for its effect is expected to reduce bias. In education, Shadish, Clark, and Steiner (2008) show that bias from self-selection can be almost completely reduced by modeling covariates that reflect both theory and a common sense determination of why individuals self-select into programs. Further, variables related to individual interest and motivation—often not measured—may be important to model, especially when participation in a program is voluntary (Agodini & Dynarski, 2004). In contrast to variables that capture the selection process or that are predictive of outcomes, “off the shelf” covariates, including basic demographics, without a pretest, typically are not helpful for reducing bias.

Greater complexity of methods matters less and does not compensate for a poor quasi-experimental design or a lack of strong covariates

Applying sophisticated analysis to data from a design where the comparison group is remotely drawn, or where the covariates do not capture the selection process, is unlikely to reduce bias. In the study by Bloom et al. (2005) discussed above, Ordinary Least Squares regression worked as well at reducing bias as more complex approaches provided that the matched comparison group was local. In this case, a design involving local matching mitigated the influence of unobservable confounders, while sophisticated analysis did not compensate for a suboptimal match. In the WSC study of voluntary high school dropout prevention programs by Agodini and Dynarski (2004) noted above, even with extensive covariates “not typically available to researchers,” (p. 192) propensity score matching methods were not successful at replicating experimental impacts. The authors conclude “impacts based on regression methods, which are easier to implement, are not any more capable of replicating experimental impacts in this setting than are propensity-score methods” (p. 192). They point out that the voluntary nature of the program likely led to selection on unobserved motivational factors, the effects of which could not be accounted for. More broadly, in their meta-analysis of WSC studies in job training, Glazerman, Levy, and Myers (2003) concluded that while statistical adjustment generally reduced bias, regression, propensity score matching, or other forms of matching produced similar results. The implication is that with the covariates that were commonly available in the studies they reviewed, there was a limit to how successfully the experimental benchmark result could be replicated, which more sophisticated analyses could not improve on.

Additional criteria

Further criteria for evaluating the quality of a WCS study and its potential to replicate experimental results include that the randomized experiment is well executed and low in attrition, that the same estimator is used for the experiment and CGS-based result, and that the analyst of the CGS should be blind to the results of the yoked experimental study (Shadish, Steiner, & Cook, 2012). Also, closer replication has been observed with larger experiments, and when experimental results are indeterminate or show no impact (Glazerman, Levy, & Myers, 2003). Later in this article, we will apply several of the criteria discussed above to the empirical findings.

Using the WSC Approach to Investigate the Incremental Generalizability of Inferences From CGSs

In Jaciw (2016), we use a specific variant of the WSC methodology to investigate the validity of generalizations from CGSs. When several randomized trials of the same or similar interventions are conducted at multiple sites, a nonexperimental counterfactual for any given site can be constructed from the experimental controls at one or more of the other sites. (Bloom, Michalopoulos, & Hill, 2005, provide an example in jobs training and Wilde & Hollister, 2007, in education.) With this variant of WSC, the goal is to account for factors influencing selection into sites. For two reasons, this “multisite trial version” of WSC is well-suited to the approach to investigating the accuracy of generalized inferences from CGSs that we develop in this work. First, a multisite trial provides us with a plausible inference population; specifically, a given site in the trial. All sites in the experiment are similar in that they have agreed to be part of a randomized trial of the same or a similar intervention, thereby meeting eligibility criteria and, implicitly, interest or need to have the program implemented. A small step in a claim of external validity would be to show that we can accurately infer program impact for a given site using information from the other sites in the same trial. Therefore, we describe this process as making an incremental generalization about program impact and then evaluating the accuracy of the inference. A second reason for the suitability of the multisite trial variant of WSC, for studying the accuracy of generalized inferences, is that it furnishes an experimental impact estimate for each site, thereby allowing a direct comparison between the benchmark experimental and CGS-based results.

Identifying Bias in the CGS-Based Impact Estimate

In the companion paper to this one (Jaciw, 2016), our starting point is an experimental benchmark result for the jth site in a trial with N sites. Applying the WSC approach, we define a comparison group-based nonexperimental counterfactual at j. Specifically, we construct the nonexperimental counterfactual to individuals assigned to control at j, using average performance of individuals assigned to treatment in expectation across all the sites except the inference site, j. The difference between this quantity and average control performance at, j, is the CGS-based generalized measure of impact for that site, which we express as follows:¹

E_{D_{i} \neq j} [E (Y_{i} (q (X_{i, D_{i}})) | D_{i})] - E (Y_{i} (0) | D_{i} = j) .

This quantity can be decomposed into three terms:

\begin{array}{l} E (Δ_{i} (q (X_{i, D_{i} = j})) | D_{i} = j) \\ + {E_{D_{i} \neq j} [E (Y_{i} (0) | D_{i})] - E (Y_{i} (0) | D_{i} = j)} \\ + {E_{D_{i} \neq j} [E (Δ_{i} (q (X_{D_{i}})) | D_{i})] - [E (Δ_{i} (q (X_{i, D_{i} = j})) | D_{i} = j)]} . \end{array}

The first term is the estimand: the average impact at inference site D_i = j. A successful generalization means estimating this quantity without bias.² The quantity in the first set of braces is bias attributable to the difference between the comparison and inference sites in average performance of the controls. This is bias due to confounding, and it is the usual concern of WSC studies. In the second braces is bias due to the difference between the comparison and inference sites in program impact, potentially from imbalance on factors that moderate the treatment effect.

An important take-away from the companion paper is that the application of the WSC approach to address generalized inferences from CGSs leads to consideration of both types of bias—from confounding, and effect moderation. A related question is whether confounders that lead to the first bias also act as moderators that produce the second form of bias. In the companion paper, we develop this point further and make the argument for why it is important to study both types of bias together. Briefly, we argue that doing so (1) comes with little additional cost when already conducting the multisite variant of a WSC study, and that studying the two forms of bias separately can be misleading because, while both forms of bias contribute to total absolute bias, the total is not necessarily reducible to the components; (2) provides the opportunity to apply the lessons learned from the long-standing record of WSC studies to the problem of generalizability—if we find that confounders also function as moderators, then the criteria associated with bias reduction in traditional WSCs may also be the factors that affect the success of generalizing results from CGSs; and (3) is likely to align with interests of decision makers at a given site, who may be more interested in understanding accuracy of inferences as a whole than disentangling total bias in terms of its components.

Summarizing bias across multiple sites

For convenience, we label the three terms in expression (2), δ _j , β _j and γ _j , respectively. In a multisite trial version of WSC, bias in the CGS-based quantity can be identified for each site: $Bias (D_{i} = 1) = β_{1} + γ_{1}$ , $Bias (D_{i} = 2) = β_{2} + γ_{2}$ , … $Bias (D_{i} = N) = β_{N} + γ_{N}$ . Summarizing bias across sites by taking the grand mean underestimates bias because the terms can be positive or negative, resulting in a mean value close to 0, even when bias may be appreciable on either side of 0 (as has been empirically shown by, for example, Bloom et al., 2005). It is more informative to consider average absolute bias. In the companion paper we propose, as an alternative measure of the average magnitude of bias, the square root of the average of the terms for squared bias; that is, we propose the square root of the following quantity:

\bar{{BIAS}^{2}} = \frac{1}{N} \sum_{j = 1}^{N} {(β_{j} + γ_{j})}^{2} = \frac{1}{N} \sum_{j = 1}^{N} (β_{j}^{2} + 2 γ_{j} β_{j} + γ_{j}^{2}) .

In the companion paper we show that the quantities in Equation 3 express two variances, representing deviations in site average performance, and in site-average impact, from their respective grand means, as well as the covariance across sites in these deviations: $τ_{0} = \frac{1}{N} \sum_{j = 1}^{N} β_{j}^{2}$ , $τ_{1} = \frac{1}{N} \sum_{j = 1}^{N} γ_{j}^{2}$ , and $τ_{01} = \frac{1}{N} \sum_{j = 1}^{N} γ_{j} β_{j}$ . The average magnitude of bias can be written in terms of these quantities and expressed in units of the standard deviation (SD) of the outcome variable:

\frac{1}{S D} \sqrt{\bar{{BIAS}^{2}}} = \frac{1}{S D} \sqrt{τ_{0} + 2 τ_{01} + τ_{1}} .

In the empirical examples that we consider below, the goal is to estimate this quantity as well as the following two components:

\frac{1}{S D} \sqrt{τ_{0}} = \frac{1}{S D} \sqrt{\frac{1}{N} \sum_{j = 1}^{N} β_{j}^{2}},

\frac{1}{S D} \sqrt{τ_{1}} = \frac{1}{S D} \sqrt{\frac{1}{N} \sum_{i = j}^{N} γ_{j}^{2}} .

The use of covariate adjustments

In the usual WSC studies, a principal goal is to examine whether bias can be reduced by accounting for effects of confounders. Similarly, we evaluate whether bias from effect heterogeneity can be accounted for by modeling the interactions between potential moderators and treatment. If we can account for treatment effect heterogeneity through modeling moderator effects, we say this establishes the incremental generalizability of the program effect, because doing so means that we have successfully leveraged information from the comparison sites to account for the differences between the experimental benchmark and CGS-based estimates of program impact for the inference site. There is a long-standing discussion in the literature about the connection between moderator effects and generalizability. In the companion article, we consider several traditions. Not all approaches are the same—some consider moderator effects as demonstrating a lack of generalizability (Shadish, Cook and Campbell, 2002), while others adjust for the effects of program moderators when generalizing experimental findings to inference populations beyond the experiment (Tipton, 2013).

Method

We use multilevel regression to estimate levels of average total bias, and its components, prior to and after adjusting for effects of covariates, including moderators.³ To obtain estimates for the AMSTI example below, we analyze outcomes using hierarchical linear (HL) models (Raudenbush & Bryk, 2002; Singer, 1998). Four main models are used to estimate average absolute bias prior to and after applying statistical adjustments.⁴ Model 1, the starting model, includes no covariates at the student or site levels:

y_{i j} = α + β T_{i j} + u_{0 j} + u_{1 j} T_{i j} + ∊_{i j} .

y_ij is the outcome for participant i at site j, T_ij is the treatment assignment status for that person (coded 1 for treatment and 0 for control), u_0j is the site-specific deviation in average performance from the grand average of performance, u_1j is the site-specific deviation in impact from the grand average impact, and ε _ij is the within-site person-specific random error.

Model 2 is used to obtain estimates of average absolute bias before adjustment for the effects of site-level covariates. It includes covariates at the student level; “ $(Z_{pij})$ ” and their interactions with the treatment assignment variable. Each student-level covariate is group mean centered on the site-level mean of the variable:

y_{i j} = α^{*} + β^{*} T_{i j} + \sum_{p} π_{p}^{*} (Z_{p i j} - \bar{Z_{p j}}) + \sum_{p} λ_{p}^{*} (Z_{p i j} - \bar{Z_{p j}}) T_{i j} + u_{0 j}^{*} + u_{1 j}^{*} T_{i j} + ∊_{i j}^{*} .

We are interested in estimates of the following quantities:

Var (u_{0 j}^{*}) = τ_{0}^{*},

Var (u_{1 j}^{*}) = τ_{1}^{*},

Cov (u_{0 j}^{*}, u_{1 j}^{*}) = τ_{01}^{*} .

Average absolute bias is estimated as: $\frac{1}{\hat{S D}} \sqrt{\hat{τ_{0}^{*}} + 2 \hat{τ_{01}^{*}} + \hat{τ_{1}^{*}}}$ .⁵

Model 3 is like the second but includes also the main effects of site-level covariates, $\bar{Z_{p j}}$ :

y_{i j} = α^{* *} + β^{* *} T_{i j} + \sum_{p} π_{p}^{* *} (Z_{p i j} - \bar{Z_{p j}}) + \sum_{p} π_{p}^{' * *} \bar{Z_{p j}} + \sum_{p} λ_{p}^{* *} (Z_{p i j} - \bar{Z_{p j}}) T_{i j} + u_{0 j}^{* *} + u_{1 j}^{* *} T_{i j} + ε_{i j}^{* *} .

This model allows us to assess potential reductions in bias that result from modeling the site-level covariates. Also, we compare the results of this model to those from Model 4, below, to assess the reduction in bias from modeling the interactions between site-level covariates and the indicator of treatment assignment status.

Model 4 is like Model 3 but includes also the interactions between the site-level covariates and their interactions with treatment:

y_{i j} = α^{* * *} + β^{* * *} T_{i j} + \sum_{p} π_{p}^{* * *} (Z_{p i j} - \bar{Z_{p j}}) + \sum_{p} π_{p}^{' * * *} \bar{Z_{p j}} + \sum_{p} λ_{p}^{* * *} (Z_{p i j} - \bar{Z_{p j}}) T_{i j} + \sum_{p} λ_{p}^{' * * *} \bar{Z_{p j}} T_{i j} + u_{0 j}^{* * *} + u_{1 j}^{* * *} T_{i j} + ε_{i j}^{* * *} .

For the fourth model, we are interested in estimates of the following quantities:

Var (u_{0 j}^{* * *}) = τ_{0}^{* * *},

Var (u_{1 j}^{* * *}) = τ_{1}^{* * *},

Cov (u_{0 j}^{* * *}, u_{1 j}^{* * *}) = τ_{01}^{* * *} .

Average absolute bias conditional on the main effects of site covariates and their interactions with treatment is estimated as $\frac{1}{\hat{S D}} \sqrt{\hat{τ_{0}^{* * *}} + 2 \hat{τ_{01}^{* * *}} + \hat{τ_{1}^{* * *}}}$ . The quantities of main interest are estimates of variance components and the average magnitude of bias before conditioning results on the effects of site-level covariates (Model 2) and after conditioning results on effects of site-level covariates and their interactions with treatment (Model 4).

Results

Example 1: The Tennessee STAR Multisite Trial

The data

Our first example uses results from the Tennessee class size reduction experiment (Finn & Achilles, 1990; Mosteller, 1995). The multisite trial started in 1985 and lasted 4 years. In the first year, 6,400 kindergarten students in 79 schools were randomized to one of the three conditions: (1) small classes (13–17 students), (2) regular classes (22–25 students), or (3) regular classes with an aide. The numbers of classes in the three conditions described above were 108, 101, and 99, respectively. Teachers were also randomized to classes. The outcome measures were scale scores in reading and math. Small classes were found to have a positive impact on achievement. For example, by the end of the second year, students in small classes had more than .20 SD advantage in achievement over students in regular-sized classes (Finn & Achilles, 1990).

We expand on specific results from Nye, Hedges, and Konstantopoulos (2000). Average performance across schools and the treatment effect were modeled as randomly varying across schools after modeling the interactions of treatment with seven school-level covariates: three indicators of urbanicity, the percentage of teachers in small classes having an advanced education degree, teachers’ total experience in small classes, the percentage of students in small classes receiving free or reduced-price lunch, and the percentage of Black students in small classes. This analysis was carried out for each grade, K–3, for both reading and math. The square roots of the school-level variance component estimates are interpretable as summaries of average levels of bias when inferring impact for a given site using cross-site comparisons as discussed in the Method section above.

As emphasized by Cook, Shadish, and Wong (2008), it is important to assess the validity of the benchmark experimental findings before interpreting the results relevant to the WSC effort. In the experiment, attrition of students was high (approximately 20–30% of students left the study each year over the course of the experiment) and the difference in attrition rates between small and regular-sized classes was statistically significant for all grades. However, Nye et al. (2000) demonstrated that for every year and in both subject areas the impacts for leavers and stayers the year before were almost identical. They conclude that it is implausible that attrition biased the treatment effects the following year. Students also moved between conditions, which can induce bias. Nye et al. (2000) found that the direction of the biases from students switching classes varied across grades, with impact being overestimated in first and third grades but underestimated in second grade. To address the potential for bias from students switching conditions, the researchers conducted analyses on students as assigned through randomization as well as in the conditions that they ended up being in.

Results from the WSC analysis

In Table 1, we show (1) the average small class effect for math and reading, (2) the estimated between-school variation in average performance (“residual variance in the intercept”), and (3) the estimated between-school variation in the impact (“residual variance in the impact”) after conditioning on the effects of the school-level covariates and their interactions with the treatment variable. This is done for each grade level. We focus on the results based on initial assignment.

Table 1.

Average Impacts and Variances Across Sites in Intercepts and Impacts from the Tennessee Class Size Reduction Experiment (Based on Initial Assignment).

Grade Levels	Math				Reading
Grade Levels	K	1	2	3	K	1	2	3
Average impact	.215*	.199*	.121*	.141*	.202*	.123*	.175*	.154*
Residual variance for the intercept	.131*	.099*	.075*	.061*	.125*	.078*	.037*	.031*
Residual variance in the impact	.002	.031*	.040*	.009	.019	.008	.007	.003
Square root of the residual variance for the intercept	.362*	.315*	.274*	.247*	.354*	.279*	.192*	.176*
Square root of the residual variance for the impact	.045	.176*	.200*	.095	.138	.089	.084	.055

Note. These quantities are obtained from Nye, Hedges, and Konstantopoulos’s (2000) table 10; the square roots of the residual variances are not in the original table. The researchers z-transformed the scale scores, therefore the square roots of the residual variances are in the units of the standard deviation of the posttest.

*p < .05.

In the last two rows, we display the square roots of the variances. The “square root of the residual variance for the intercept” and “square root of residual variance for the impact” can be considered estimates of $\frac{1}{S D} \sqrt{τ_{0}^{* * *}}$ and $\frac{1}{S D} \sqrt{τ_{1}^{* * *}}$ , respectively.⁶ We observe that, in the case of this multisite trial, estimates based on nonexperimental comparisons do not replicate experimental benchmark findings even with regression adjustments. Looking across grades and subject areas, the average magnitude of bias from variation across sites in average performance, ranges from .176 to .362 SD units, with median value .277. All eight estimates are significantly different from 0 at the α = .05 level, and are larger than the grand mean impact, which ranges between .121 and .215 SD units, with median .165. The average magnitude of bias from variation across sites in the impact after conditioning on site-level moderator effects is between .045 and .200 SD units, with median value .092. Two of the eight estimates of average absolute bias from variation in impact are significant and only one is larger than the median average impact.

Levels of average bias from differences in average performance are robust to which analysis is conducted—based on initial assignment or the condition in which students ended up in.⁷ The median (across 4 grade levels × 2 subject areas = 8 outcomes) of average absolute bias is .277 SD of the outcome distribution with initial assignment and .287 with actual assignment and all eight measures are statistically significant for each type of analysis. This is not the case for average absolute bias from variation in average performance: The median levels of absolute bias are .092 SD of the outcome distribution with initial assignment and .148 with actual assignment. Two of the eight estimates are statistically significant in the analysis based on initial assignment and four are significant based on final assignment.

In a small number of the analyses, the school average of teachers’ experience in small classes and categories of urbanicity interact with treatment, giving some insight into moderators of impact across schools. However, as Nye et al. (2000) point out, the field of education lacks understanding of the mechanisms behind the small class effect (p. 150).⁸ Insufficient theory potentially limits our capacity to identify and measure the factors that moderate the impact and thereby account for the sources of bias from effect heterogeneity in cases where we observe it.

Should we expect success in reducing bias in this WSC analysis?

We consider whether the study satisfies criteria reviewed earlier that are associated with bias reduction in past WSC studies. The covariates do not include a pretest, and the comparisons are made within the state, but are not more local than that. Also, the experimental impacts based on “actual assignment” may be biased from student noncompliance with random assignment. Further, the covariates may be considered all-purpose demographics and do not necessarily reflect theory of selection into schools. However, the study also shares several characteristics that have been associated with bias reduction in past WSC studies: (1) adjustments were made for effects of covariates at both the individual and intact group (i.e., site) level, (2) comparisons were made with sites that were similar in an important respect—all schools volunteered to be part of the multisite trial, (3) the experimental samples were relatively large, and (4) all variables, including the outcomes, were measured in exactly the same way across the study sample. Given that several important criteria are not met, especially the lack of a pretest, it is not surprising that bias is present on average in CGS-based estimates formed through between-site comparisons. We consider the Tennessee STAR example to be an initial demonstration of the approach. Next, we turn to a more detailed example where we will also investigate the covariation between terms representing the two types of bias—from confounding and effect moderation.

Example 2: The Randomized Trial of AMSTI

The data

We use results from a randomized trial of AMSTI (Newman et al., 2012). AMSTI is a reform-based intervention that is intended to increase teachers’ use of instructional strategies that promote higher-order thinking skills and active learning in math, science, and technology. This, in turn, is intended to have a positive impact on student achievement in mathematics and science.

In the trial, 82 schools were randomized in matched pairs to AMSTI or business as usual. Pairs were selected purposively to be representative of the five regions of Alabama from which the schools were drawn. Criteria for selecting pairs included mathematics achievement, the percentage of minority students, and the percentage of students from low-income households.⁹ Impacts on SAT 10 tests of mathematics and reading in Grades 4–8 combined, and science in Grades 5 and 7 combined, were assessed after 1 year.

In this article, we focus on 1-year impacts on reading. Our overall sample consists of 40 matched pairs of schools and N = 17,922 students. For this example, the blocks (matched pairs) serve as the sites. In the analysis of the combined sample (minorities and nonminorities), all schools that were randomized were retained in the analysis with the exception of two that constituted one-matched pair.¹⁰ Between the time of randomization and analysis, total attrition at the school level was 2% with 0 differential attrition. Total attrition at the student level was 4.2% with differential attrition 1.6%. At these levels, bias from attrition is expected to be low according to standards of evidence developed by the What Works Clearinghouse. Baseline equivalence on pretest was established for the analysis sample (p = .70).

In the original trial, positive statistically significant impacts were found in math and reading. An important exploratory results was a significant differential impact on reading depending on minority status, with a significant impact for nonminorities and no impact for minorities. A similar trend was seen in mathematics and science also. As a result, in this article, we analyze results for the full sample and separately by minority status.

Results from the WSC analysis

Table 2 shows the estimates of variance components and the magnitude of bias before adjustment for the effects of site-level covariates for the reading outcome, for both the combined sample, and by minority status (i.e., based on HL Model 2).¹¹ We observe that, before covariate adjustments, on average, the CGS-based results formed through cross-site comparisons do not lead to accurate inferences concerning program impact for individual sites. Average absolute bias from between-block differences in average performance is .42, .45, and .39 SD units for the combined sample, for minorities, and for nonminorities, respectively. Average absolute bias from differences among blocks in the impact is .10, .14, and .06 SD for the groups, respectively. Each of these biases is statistically significant at the α = .05 level. Total absolute bias is .44, .45, and .39 for the three samples, respectively. Before modeling effects of site-level covariates, we observe little covariation between the terms for the two biases. The covariance terms are not statistically significant for the combined sample and for the subsamples (at the α = .05 level).

Table 2.

Estimates of Variance Components and Average Absolute Bias for Models Without Site-Level Effects for Combined Sample and by Minority Status.

Quantities Estimated
Analysis Sample	Between-Block Variation in Average Performance in Scale Score Units, Average Magnitude of Bias from Differences in Average Performance in SD Units	Between-Block Variation in the Impact in Scale Score Units, Average Magnitude of Bias from Differences in Impact in SD Units	Covariance Between Deviations in Performance and Impact, in Scale Score Units	Average Squared Bias, Average Total Absolute Bias Expressed in SD Units
	$τ_{0}^{}$ , $\sqrt{τ_{0}^{}} / S D$	$τ_{1}^{}$ , $\sqrt{τ_{1}^{}} / S D$	$τ_{01}^{*}$	$τ_{0}^{} + 2 τ_{01}^{} + τ_{1}^{}$ , $\sqrt{τ_{0}^{} + 2 τ_{01}^{} + τ_{1}^{}} / S D$
Whole sample (N = 17,922)	240.80, .42 (p < .01)	13.30, .10 (p < .01)	0.34 (p = .97)	254.78, .44
Minorities (N = 5,881)	235.16, .45 (p < .01)	23.42, .14 (p < .01)	−13.61 (p = .43)	231.36, .45
Nonminorities (N = 10,198)	201.31, .39 (p < .01)	5.18, .06 (p = .04)	2.26 (p = .81)	211.01, .39

Note. The results are obtained from estimation Model 2 described in the section on Methods. Because the covariance can take on negative values, we do not take the square root of the quantity and divide it by the SD of the outcome variable in order to express in terms of effect size units, as we do with the other quantities. The p values correspond to the estimates of the variance components. When obtaining results separately by minority status, we eliminated matched pairs of schools where both schools in a matched pair did not include both minorities and nonminorities. This explains why the total number of minority and nonminority students in the last two rows does not sum to the number for the whole sample. When we reran the analysis combining minorities and nonminorities for this reduced sample (5,881 + 10,198 = 16,079), the results were very similar to those for the whole sample (N = 17,922). SD = standard deviation.

Table 3 shows the corresponding results after modeling site-level covariates and their interactions with the indicator of treatment assignment status (i.e., based on HL Model 4). The block-level covariates included average pretest and proportion of students who are male, who are of low socioeconomic status (SES), who are minorities, and who are English speakers. Remaining absolute bias from between-block differences in average performance is .07, .07, and .04 SD units for the combined sample, for minorities, and for nonminorities, respectively (each statistically significant at α = .05). Remaining bias from effect heterogeneity is .08, .07, and .03 SD units for the three groups, respectively (only the variance component for the first of these is statistically significant at α = .05.) The covariances between the two forms of bias are negative for the three samples, with the component for the combined sample being statistically significant at the α = .05 level.

Table 3.

Estimates of Variance Components and Average Absolute Bias for Models With Site-Level Effects for Combined Sample and by Minority Status.

Quantities Estimated
Analysis Sample	Between-Block Variation in Average Performance in Scale Score Units, Average Magnitude of Bias from Differences in Average Performance in SD Units	Between-Block Variation in the Impact in Scale Score Units, Average Magnitude of Bias from Differences in Impact in SD Units	Covariance Between Deviations in Performance and Impact, in Scale Score Units	Average Squared Bias, Average Total Absolute Bias Expressed in SD Units
	$τ_{0}^{* * }$ , $\sqrt{τ_{0}^{ * *}} / S D$	$τ_{1}^{* * }$ , $\sqrt{τ_{1}^{ * *}} / S D$	$τ_{01}^{* * *}$	$τ_{0}^{* * } + 2 τ_{01}^{ * } + τ_{1}^{ * }$ , $\sqrt{τ_{0}^{ * } + 2 τ_{01}^{ * } + τ_{1}^{ * *}} / S D$
Whole Sample (N = 17,922)	7.41, .07 (p < .01)	8.44, .08 (p < .01)	−6.61 (p = .01)	2.63, .04
Minorities (N = 5,881)	6.20, .07 (p = .04)	6.21, .07 (p = .10)	−5.97 (p = .12)	0.47, .02
Nonminorities (N = 10,198)	2.74, .04 (p = .03)	1.87, .03 (p = .17)	−1.90 (p = .21)	0.81, .02

Note. The results are obtained from estimation Model 4 described in the section on Methods. Because the covariance can take on negative values, we do not take the square root of the quantity and divide it by the SD of the outcome variable in order to express it in terms of effect size units, as we do with the other quantities. The p values correspond to the estimates of the variance components. When obtaining results separately by minority status, we eliminated matched pairs of schools where both schools in a matched pair did not include both minorities and nonminorities. This explains why the total number of minority and nonminority students in the last two rows does not sum to the number for the whole sample. When we reran the analysis combining minorities and nonminorities for the reduced sample (5,881 + 10,198 = 16,079), the results were very similar to those for the whole sample (N = 17,922). The exception was the interaction between treatment and pair-level pretest, for which the point estimate changed from 0.16 (p = .08) to 0.22 (p = .03), and the main effect of proportion English speaker, which changed from −36.37 (p = .02) to −37.19 (p = .13). SD = standard deviation.

This example shows that when generalizing results from CGSs, it is important to consider the effects of bias from confounding and from effect moderation simultaneously. The covariance between them influences the average magnitude of bias in the CGS-based estimate. Studying the two biases separately is misleading, because the biases they are not necessarily additive. This holds for the combined sample in the example, where modeling site-level effects reduces each of the variance components and brings into sharper resolution the remaining covariance between site deviations in average performance and site deviations in impact. Because the covariance is negative, the net effect is to reduce the average total absolute bias to .04 SD. Ignoring the covariance, and adding together the two biases, leads to incorrect value of .15 SD units for average absolute bias. The significant covariance suggests the presence of unadjusted-for factors that leads to bias from both confounding and effect moderation.

While bias is successfully reduced for the combined sample, the remaining variance components are statistically significant, and total remaining bias is comparable in magnitude to the average impact of AMSTI, which was .06 SD and statistically significant. In other words, in this case, the generalized inference concerning impact that is based on a covariate-adjusted CGS-based result, on average, is inaccurate by an amount approximate to the average impact itself. A standardized effect size of .06 is small by some standards; however, in the AMSTI experiment, it amounted to about 28 days’ worth of additional schooling, a substantial advantage.

It is noteworthy that among the site-level covariates modeled, some are correlated with outcomes and interact with treatment. If we use p = .10 as a threshold, for the combined sample, the site average pretest and proportion of students eligible for free or reduced-price lunch (SES) are associated with the outcome and are moderators of impact. Gender and proportion minority moderate the treatment effect. In the companion article (Jaciw, 2016), we demonstrate that bias from effect moderation in estimates of impact based on cross-site comparisons depends on both effect moderation being present and the moderating variable being imbalanced between the inference and comparison sites.

Sensitivity analyses

We conducted sensitivity analyses to examine the robustness of the variance component estimates that are used to summarize bias (results reported in Tables A1–A4 in the online Supplementary Material). We examined models that included (1) squares of Level 1 or Level 2 covariates, (2) cross-level interactions between Level 1 covariates and corresponding Level 2 covariates, and (3) three-way interactions between Level 1 covariates, corresponding Level 2 covariates and the dummy variable indicating treatment assignment status. The results were robust within each of the three main types of models: (1) models without Level 2 covariates, (2) models that include main effects of Level 2 covariates but not interactions of Level 2 covariates with treatment, and (3) models that include Level 2 covariates and their interactions with treatment.

In the analyses, random-effects models were used to obtain the variance estimates representing deviations in site-specific performance, and impact, from the grand means of the estimates. As a further check of the robustness of the results, we examined the possibility that variance estimates reflect variation across blocks in the proportion of sample members assigned to treatment. To do so, we investigated whether results changed once samples were balanced between conditions within blocks. We achieved balance within blocks by eliminating a random sample of students from the larger group (treatment or control) within each block. We then reestimated the variances. We repeated this procedure 100 times, removing a different random sample of students each time. We did this for the full sample (i.e., including minorities and nonminorities). The summary statistics are provided in Appendix Table C1. The first column of results shows point estimates and standard errors for the benchmark analyses. The second shows the mean and median values of estimates from the analyses using balances samples. The third shows the mean and median values for the standard errors for those estimates. We found that across all four models, the mean and median values of the variance component estimates and standard errors are consistent with those for the benchmark analysis using the full sample. (The standard errors are consistently slightly larger as would be expected, given the exclusion of a random sample of students in each block). The results indicate that the variance components, which are used to summarize bias, are similar to those obtained under conditions where samples are balanced between conditions within blocks.

Should we expect success in reducing bias in this WSC analysis?

Should we expect the observed levels of bias reduction from modeling covariates, given what we know about results from prior WSC studies? The study shares several characteristics that have been associated with bias reduction in past WSC studies: (1) covariates included the pretest; (2) adjustments were made for effects of covariates at both the individual and intact group (i.e., site) level; (3) the same causal quantity, the effect of intent to treat (ITT), was being estimated through the experiment and nonexperiment; (4) comparisons were made with sites that were similar in an important respect—all schools volunteered to be part of the multisite trial; (5) the experimental samples were relatively large (80 schools and 17,992 students for the combined sample); (5) all variables, including the outcomes, were measured in exactly the same way across the study sample; (6) the AMSTI experiment was well executed based on low attrition of participants; and (7) the same estimator was used to assess impact for the AMSTI experiment and the CGS. However, the study also has some limitations. First, it takes advantage of general demographics for covariates, not ones intended to reflect the selection of individuals into sites. While several of the covariates were related to the outcome or moderated the impact, and therefore adjustment for effects of those covariates would be expected to reduce bias, the question is whether the variances and covariance could have been reduced further if we had stronger covariates more-directly associated with the selection of individuals into sites. A second potential weakness is that we conducted the WSC study with knowledge of the results of the experiment, which can potentially lead to bias from choosing methods to confirm expectations (Shadish et al., 2012). However, we do not think this is a major threat to the validity of the results because our goal in conducting the WSC study was not to champion a specific methodology, or confirm an expectation, for successfully replicating experimental results using CGSs; but rather, to present a novel application of WSG methods to investigate the potential for successfully generalizing results from CGSs.

Four Technical Points to Consider

Four technical points are relevant to interpreting the results. First, there is a limitation to using matched pairs in the application of the method, as was the case with AMSTI. With matched pairs, we are unable to deconfound the differences across blocks in the impact from the sampling variation of units randomized within the blocks. With AMSTI, some fraction of the variance we are observing may be due to random sampling error at the school level.¹² The presence of statistically significant interactions of block-level covariates with treatment, however, indicates that at least some of the heterogeneity in impact is attributable to systematic differences across blocks on attributes that interact with the program.

Second, there is the issue of deciding whether the reduction in bias from modeling effects of covariates is statistically significant. We conducted a joint significance test using the deviance statistic (Singer & Willett, 2003) to assess if modeling main effects of block-level covariates reduced bias, and separately, to determine whether modeling interactions between block covariates and treatment reduced bias from effect moderation. The joint test of main effects was significant at the α = .01 level for the combined sample and for the minority and nonminority subsamples. The test of interactions was not significant for nonminorities (p = .51), for minorities (p = .08), and for the combined sample (p = .20). The lack of significance for the joint test of the interaction effects must be reconciled with the observation that, after adjusting for the effects of moderators, between-site variation in the treatment effect was reduced and ceased to be significant for the minority and nonminority subsamples. Use of the deviance statistic to determine statistical significance is dependent on the difference between the models compared in the number of effects estimated. Including more effects makes the test less powerful. Without theory as a basis for narrowing the list of covariates, we included all available variables. This may explain why the change in the deviance statistic was not significant. Generally, it is important to identify enough variables (ideally, prior to the experiment) to account for heterogeneity in effects but not so many that it becomes virtually impossible for the resulting reduction in the deviance statistic to be statistically significant.

Third, in this work, we have focused on expressions for the average magnitude of bias when generalizing impact to a given site by inferring the treatment counterfactual using outcomes from comparison sites. In addition to bias, random sampling variation contributes uncertainty to the generalized inference. To address this, we can consider levels of uncertainty from both bias and random sampling variation, for a CGS-based estimate, when inferring impact for a given site, j. Bias in this quantity is given in expression 1 (repeated here for easy reference): $E_{D_{i} \neq j} [E (Y_{i} (q (X_{i, D_{i}})) | D_{i})] - E (Y_{i} (0) | D_{i} = j)$ . The first term is an average of performance under assignment to treatment across all sites except the inference site j. The second term is average performance for controls at j. In the sample-based analogue of this quantity, the first term, which involves averaging outcomes across individuals and sites, has associated with it sampling variance at both of these levels: $\frac{τ_{0}}{N - 1} + \frac{σ^{2}}{(N - 1) n}$ (τ₀ is the variation in outcomes at the school-level, σ₂ is the variation in outcomes among students within schools, and n is the average number of students per school.). The second term, in the difference estimate, is the outcome Y_i(0) averaged across individuals at the inference site, with the sampling variance at that level being $\frac{σ^{2}}{n}$ .¹³

The fourth issue, discussed more extensively in the companion paper, and mentioned briefly here, applies to WSC studies generally and concerns the nature of the estimand. The goal of WSC studies is to replicate estimates from experiments using results from CGSs. For replication to be possible, they must be measuring the same causal quantity. Experiments can produce different causal quantities including effect of intent to treat (ITT), the average treatment effect (ATE), the effect of treatment on the treated (TOT), and the local average treatment effect (LATE) (Gennetian, Morris, Bos, & Bloom, 2005). CGS-based estimates typically are concerned with TOT. Therefore, the WSC study conducted here must be qualified as depending on the restrictive condition that the causal quantity from the CGS, which typically is TOT, happens to align with ITT. This is a limitation of the current work; however, we argue in the companion paper, that replication of an ITT estimate through a CGS represents a successful accounting for processes that influence selection into sites, including factors that end up making a difference to program implementation and uptake within sites, which is reflected in ITT estimates. It does not, however, address further selection, for example, processes differentiating program adherents from no-shows within an experiment.

Discussion and Conclusions

In this article, we considered an application of the WSC approach for empirically investigating the incremental generalizability of impact findings from CGSs. We reviewed the main findings from past WSC studies and the steps taken in the companion article to expand the WSC methodology to address external validity of CGS-based results. In this article, we focused on the criteria from WSC studies for evaluating the potential for bias reduction, which we applied to the two empirical examples.

What are the take-aways from the two empirical studies? Results from the Tennessee STAR and AMSTI experiments give a preliminary demonstration of the WSC approach as extended to investigate conditions for successfully drawing incremental generalizations across sites within a study. The STAR data met fewer of the criteria associated with successful reductions of bias in past WSC studies, and this possibly explains why the average magnitude of bias from between-site variation in performance persisted even with covariate adjustments. The assessment of remaining bias from between-site variation in impact was less clear because of the effects of student crossover between conditions, which was consequential to the estimates. Remaining average absolute bias from treatment effect heterogeneity (median values of .09 and .15 SD with samples analyzed according to initial and final assignment, respectively) was smaller than from differences in average performance (approximately .28 SD with either sample).

For the AMSTI experiment, total residual average magnitudes of bias for the combined sample were similar: .07, and .08 SD units attributable to differences in average performance and in impact, respectively. However, the covariance term made a difference: In scale score units, average absolute biases were 7.41 and 8.44, while the covariance was −6.61 with all three being statistically significant. The negative covariance led to a reduction in average absolute bias by cancelling the contributions of the component terms, resulting in a value of .04 SD units for the total average absolute bias, which is small by some standards. Therefore, the AMSTI study supports past WSC findings by showing that when many of the conditions associated with successful bias reduction are met, bias can be brought to low levels. The study also adds a new consideration by showing that when generalizing from CGS results, potential bias is captured through three terms—in this case, the negative covariance lowered the total bias, but under a different scenario, a positive covariance would have further compounded the average magnitude of bias, making it larger than the sum of the magnitudes of the component biases.

The main question for practitioners is whether CGS-based results should be accepted as valid indicators of what to expect in terms of impact at their site. The current WSC study is a first step in answering this question. More studies of this kind are necessary to establish the kind of track record of results that have emerged through traditional WSC approaches. Importantly, this study demonstrates that bias, including from effect heterogeneity, is amenable to being reduced when a study meets the criteria associated with bias reduction in traditional WSC studies. Second, it shows that covariate adjustments reduce but do not eliminate bias. Third, it shows that bias from effect heterogeneity cannot simply be ignored—it was substantial, and in several instances remained significant even with covariate adjustments. Fourth, it shows that biases may be correlated—confounders may also be moderators. In other words, comparisons between sites may be rendered inaccurate by differences between them on attributes that (1) affect average performance or (2) that interact with the program or (3) that do both.

This study, which is based on traditional WSC studies, equips researchers with a methodology for investigating the incremental generalizability of CGS-based results. Further empirical work on this approach may yield practical guidelines concerning strategies for avoiding bias, to help both researchers in designing and analyzing results from CGSs, and practitioners in evaluating the generalizability of the results from previous efficacy studies, for their individual contexts.

Footnotes

Appendix A

Appendix B

Appendix C

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Supplemental Material

The online [appendices/data supplements/etc.] are available at .

Notes

References

Agodini

Dynarski

(2004). Are experiments the only option? A look at dropout prevention programs. The Review of Economics and Statistics, 86, 180–194.

Bloom

H. S.

Michalopoulos

Hill

C. J.

(2005). Using experiments to assess nonexperimental comparison-group methods for measuring program effect. In Bloom

H. S.

(Ed.), Learning more from social experiments (pp. 173–235). New York, NY: Russell Sage Foundation.

Cohen

D. K.

Raudenbush

S. W.

Ball

D. L.

(2002). Resources, instruction and research. In Mosteller

Boruch

(Eds.), Evidence matters: Randomized trials in educational research (pp. 80–119). Washington, DC: Brookings Institution Press.

Cook

T. D.

Shadish

W. R.

Wong

V. C.

(2008). Three conditions under which experiments and observational studies produce comparable causal estimates: New findings from within-study comparisons. Journal of Policy Analysis and Management, 27, 724–750.

Finn

J. D.

Achilles

C. M.

(1990). Answers and questions about class size: A statewide experiment. American Educational Research Journal, 27, 557–577.

Fraker

Maynard

(1987). The adequacy of comparison group designs for evaluations of employment-related programs. The Journal of Human Resources, 22, 194–227.

Gennetian

L. A.

Morris

P. A.

Bos

J. M.

Bloom

H. S.

(2005). Constructing instrumental variables from experimental data to explore how treatments produce effects. In Bloom

H. S.

(Ed.), Learning more from social experiments (pp. 75–114). New York, NY: Russell Sage Foundation.

Glazerman

Levy

Myers

(2003). Nonexperimental versus experimental estimates of earnings impacts. American Academy of Political and Social Science, 589, 63–93.

Jaciw

A. P

. (2016). Assessing the accuracy of generalized inferences from comparison group studies using a within-study comparison approach: The methodology. Evaluation Review, 40, 199–240.

10.

Lalonde

(1986). Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review, 76, 604–620.

11.

Mosteller

(1995). The Tennessee study of class size in the early school grades. The Future of Children, 5, 113–127.

12.

Newman

Finney

P. B.

Bell

Turner

Jaciw

A. P.

Zacamy

J. L.

Feagans Gould

(2012). Evaluation of the effectiveness of the Alabama Math, Science, and Technology Initiative (AMSTI) (NCEE 2012–4008). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education.

13.

Nye

Hedges

L. V.

Konstantopoulos

(2000). The effects of small classes on academic achievement: The results of the Tennessee class size experiment. American Educational Research Journal, 37, 123–151.

14.

Raudenbush

S. W.

Bryk

A. S.

(2002). Hierarchical linear models (2nd ed.). Thousand Oaks, CA: Sage.

15.

Rhodes

(2014). Pairwise cluster randomization: An exposition. Evaluation Review, 38, 217–250.

16.

Shadish

W. R.

Clark

M. H.

Steiner

P. M.

(2008). Can nonrandomized experiments yield accurate answers? A randomized experiment comparing random to nonrandom assignment. Journal of the American Statistical Association, 103, 1334–1343.

17.

Shadish

W. R.

Cook

T. D.

Campbell

D. T.

(2002). Experimental and quasi experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin.

18.

Shadish

W. R.

Steiner

P. M.

Cook

T. D.

(2012). A case study about why it can be difficult to test whether propensity score analysis works in field experiments. Journal of Methods and Measurement in the Social Sciences, 3, 1–12.

19.

Singer

J. D.

(1998). Using SAS PROC MIXED to fit multilevel models, hierarchical models, and individual growth models. Journal of Educational and Behavioral Statistics, 23, 323–355.

20.

Singer

J. D.

Willett

J. B.

(2003). Applied longitudinal data analysis. New York, NY: Oxford University Press.

21.

Smith

J. A.

Todd

P. E.

(2005). Does matching overcome Lalonde’s critique of nonexperimental estimators? Journal of Econometrics, 125, 305–353.

22.

Tipton

(2013). Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts. Journal of Educational and Behavioral Statistics, 38, 239–266.

23.

Wilde

E. T.

Hollister

(2007). How close is close enough? Evaluating propensity score matching using data from a class size reduction experiment. Journal of Policy Analysis and Management, 26, 455–477.