Abstract
In the current socio-political climate, there is an extra urgency to evaluate whether program impacts are distributed fairly across important student groups in education. Both experimental and quasi-experimental designs (QEDs) can contribute to answering this question. This work demonstrates that QEDs that compare outcomes across higher-level implementation units, such as schools, are especially well-suited to contributing evidence on differential program effects across student groups. Such designs, by differencing away site-level (macro) effects, on average produce estimates of the differential impact that are closer to experimental benchmark results than are estimates of average impact based on the same design. This work argues for the importance of routine evaluation of moderated impacts, describes the differencing procedure, and empirically tests the methodology with seven impact evaluations in education. The hope is to encourage broader use of this design type to more-efficiently develop the evidence base for differential program effects, particularly for underserved students.
Keywords
Introduction
Experiments and quasi-experiments are normally used primarily to estimate average program impacts for both full study samples and subgroups of interest. Such evaluations occasionally also assess whether the effects of programs vary across subgroups of individuals. Both kinds of results are important: first, to test whether a program has a positive impact on the study sample at large, and second, to evaluate whether there is a uniform distribution of benefits.
Even though both types of results have value, much more is known about conditions for unbiased estimation of average impact than of differential impact. More specifically, a great deal of work has been done to empirically evaluate conditions under which nonexperimental comparison group designs (CGDs) replicate average impact findings from experiments. By one account, 66 such Within Study Comparison (WSC) evaluations have been conducted through 2018 (Wong, Steiner, & Anglin, 2018), with more-recent ones (e.g., Unlu, Lauen, Fuller, Berglund, & Estera, 2021) adding to the tally. To my knowledge, no such studies have been carried out to similarly empirically evaluate the conditions under which CGD-based estimates of differential impact replicate experimental benchmark results. This limits our understanding of the range of evaluation designs that provide credible evidence about differential impacts.
In this work, I begin to address the above-noted deficit in the literature. In doing so, I empirically demonstrate a principle understood by methodologists concerning reduced relative bias; that is, with CGDs that involve the comparison of outcomes across sites, estimates of differential impact potentially have less bias (are more accurate) than estimates of average impact based on the same design. This has important implications for the acceptability of results from CGDs, especially in their potential to inform us about whether given programs achieve more benefit for some groups than others.
In this work, I do the following: (a) discuss why it is important to evaluate differential impacts of programs, (b) describe how WSC methods can be extended to evaluate the accuracy of differential impact results from CGDs, (c) provide a graphical and algebraic argument for why CGDs that make between-site comparisons can result in less-biased results for impact differences across person-level subgroups, than for average impacts, (d) conduct an application of a WSC study that compares average and differential impact results from CGDs against experimental benchmarks, (e) review possible alternative explanations for the empirical findings, and (f) discuss the implications of the results and possible extensions of the work.
Background
Reasons to Evaluate Differential Program Impacts
I first consider several reasons why it is important to evaluate the differential impacts of programs.
Analysis of differential effects is the first step to understanding the conditions under which impacts are observed; that is, for addressing what works, for whom, and why. In particular, the analysis of effects of moderators—characteristics of individuals or settings measured at the start of a study that are associated with differences in program impact—can lead to a deeper analysis of why a program benefits some subgroups of individuals more than others.
In sum, there are several fundamental reasons to evaluate differential program impacts. These reasons provide the impetus to better understand the potential of different study designs, including CGDs, to yield accurate estimates of these effects.
The Problem of Internal Validity Bias in CGDs
In other CGDs, the comparison may be between individuals belonging to different higher-level units, such as between classes that opt into a program that are led by one group of teachers in a school, and different classes that do not join the program and that are led by a different group of teachers in the same school. Alternatively, the comparison may be between students in one or more treatment schools in which everyone receives treatment and one or more comparison schools where no one does. (For convenience, I will refer to such level-2 units as “sites,” “clusters,” or “locations”.) In this situation—where all students in one set of sites experience treatment, and all students in a different set of sites do not—avoiding bias requires satisfying the assumption of conditionally unconfounded assignment of individuals into locations (Hotz et al., 2005). That is, it is necessary to adjust for the effects of person variables that influence selection into treatment by way of selection into locations. In these cases, bias may also arise from site-level factors that influence outcomes—so-called “macros factors” (Hotz et al., 2005)—which are “variables whose values are constant within a location, or at least within sub-locations” (p. 248). Macro factors may act as confounders in CGDs that make comparisons across sites. This work focuses on CGDs in which the comparison is of this sort; that is, between outcomes for individuals who belong to treatment clusters, and outcomes for individuals who belong to control clusters.
The Empirical Study of Conditions for Bias in Average and Differential Treatment Effects from CGDs
As was noted in the Introduction, through 2018, over 66 WSC studies have been carried out in various fields (Wong et al., 2018), including economics and education, with additional ones since then (e.g., Unlu et al., 2021). These works have provided detailed results and guidance for implementing CGDs to yield internally valid impact findings. 2 Among the results, WSCs emphasize the need to match treatment and comparison cases locally in terms of geographic distance, market conditions, and time, to ensure that the groups being compared are constant, or at least similar, on macro variables (Bloom, Michalopoulos, & Hill, 2005). As an analytic solution to the problem of macro variables, Hotz et al. (2005) propose using multiple locations to support model-based adjustments for the effects of location-level characteristics. This idea is important to the current work. 3
It is easy to see this as an extension of WSC methods once it is recognized that the differential impact is a type of average impact; that is, a differential impact between subgroups can be expressed as the average impact of the program on the performance gradient between the subgroups. For example, the difference in the average impact of a program between two strata of socioeconomic status (SES) can be equivalently expressed as the average impact of treatment on the performance differential between the strata of SES. This fact leads to a straightforward extension of the WSC approach. This idea is further developed below.
An Assertion
An assertion of this work is that CGDs that use cross-site comparisons yield estimates of differential impact across subgroups of persons that are less prone to selection bias than are average impact quantities based on the same design. Specifically, when cross-site comparisons are used to estimate both average impact and differential impact between subgroups of persons, the latter estimate is less prone to selection bias attributable to macro factors identified at the site level.
In the next section, I do the following: (1) develop the argument above using algebraic and graphical models, (2) conduct a WSC study to empirically test the assertion using outcomes from seven experiments in education, and (3) review limitations and implications of the result. In the process, I will also demonstrate how certain statistics commonly used with experimental evaluations may also be interpreted as metrics for summarizing levels of bias in the context of specific types of WSC studies.
Evaluating Bias in Effect Estimates Involving Cross-Site Comparisons
Preliminary Expressions
The goal is to develop expressions for average bias in a CGD that compares outcomes across sites. The first scenario involves just two locales (L = i and L = j). The true average impact on outcome Y at i is represented as follows:
The true differential impact across person-level subgroups A and B (assumed to be mutually exclusive and exhaustive of individuals at the site) is as follows:
In this scenario, the comparison group-based average impact for site i is biased by the following amount:
Model-Based Expressions for Bias
Expressions for bias using rudimentary models that figure in performance differences between locations are derived in Supplement 1. There are two scenarios. The first assumes that average achievement, in the absence of treatment at the inference site, i, is greater than at the comparison site, j, by a constant value, Q, and that this difference is the same across subgroups A and B:
The derivation in Supplement 1 shows that Q is “differenced away” in calculating the difference in performance gradients between sites (i.e., it does not appear in the expression for
Based on these derivations, in the empirical part of this work, I evaluate the average magnitude of each type of bias across multiple sites (Research Question [RQ] 1).
It is noteworthy that, because Q and K each assumes a constant value at each site, when estimation involves a comparison between just two sites, neither effect can be eliminated through site-level covariate adjustments. However, with multiple sites, model-based adjustments may be used to reduce each form of bias (Hotz et al., 2005). RQ2 addresses whether adjustments for the effects of macro variables reduce either form of bias.
Methods: Evaluating Bias in Cross-Site Comparisons of Performance Gradients
A WSC approach is used with seven experiments in education to evaluate bias in estimates of average and differential (between person-level subgroups) impact, in the case where comparisons of outcomes are made between level-2 units (i.e., between clusters or sites). Appendix A describes in more detail how the method used fits the WSC framework. In short, consistent with methods of WSC studies, biases in average and differential impacts from CGDs are summarized by comparing control outcomes (average performance or the performance gradient between subgroups) at the inference site to corresponding outcomes at the comparison sites. The success of strategies to reduce bias is then evaluated.
For the set of experiments used in this work, level-2 units can be schools (e.g., Study 1), teachers (e.g., Study 3), grade-level teams (e.g., Study 2), or classes (e.g., Study 4). Therefore, the analysis covers a range of examples with definable level-2 units across which comparisons are made. (Future studies may focus more specifically on empirical examples in which comparisons are between the level-2 units of only a certain kind, such as between schools.) Next, we describe the steps of the approach.
The standardized RMSB in the comparison of means is as follows:
Research Questions
I address the following questions:
Based on up to 11 outcomes across seven studies, what are the estimated values of standardized RMSB, when comparing performance means and gradients across sites? To what extent are levels of bias decreased by adjusting for the main effects of macro variables? (The exploration of this question is intended as a proof of concept because we can apply it only to Study 1, and are limited to just gender and minority status among the moderators.)
Data
The data consist of achievement outcomes data from seven cluster randomized experiments in education. Details of the studies are included in Tables B1 and B2 in Appendix B. 8 They include randomized controlled trials (RCTs) of programs designed to promote inquiry-based Science, Technology, Engineering, and Mathematics practices, second-language development, English language development, math skills with a focus on algebra, math skills generally, and language development of lower-performing readers. Outcomes were assessed using established instruments including state tests and performance measures developed by testing agencies. For several of the programs, outcomes were available for more than one subject area, yielding up to 11 data points. I focus on four individual-level characteristics to assess differential impacts: gender, SES (defined by eligibility for free or reduced-price lunch), minority status (defined as non-White in three studies, and as Black students in one study), and English Speaker status (defined as whether a student is designated English Proficient). The sample sizes of clusters and persons, and the proportions of individuals in each subgroup, are displayed in Tables B1 and B2 in Appendix B. These subgroups were selected for the empirical demonstration of this work because school districts regularly collect information about student membership in the categories, making them commonly available and because they represent categories across which it is generally important to establish whether program impacts and benefits are apportioned evenly.
Estimation: Hierarchical Linear Models
Addressing RQ2 involved estimation of variance components and corresponding values of RMSB using science outcomes data from Study 1 only. This was done separately for gender and minority status as the moderators, both prior to, and after adjusting for the effects of site-level macro variables. For this school-randomized trial, seven models were evaluated for each of the two moderator variables: (a) with no adjustments for site-level variables, (b) with main effects of site-level averages of teacher covariates only, (c) with main effects of locale characteristics only, (d) with main effects of site-level averages of student baseline achievement only, (e) with main effects of site-level averages of a broader set of student covariates (including the pretest) only, (f) with main effects of site-level averages of teacher and student covariates only (i.e., all variables under [b] and [e]), and (g) with main effects of all site-level covariates. (The specific covariates are listed in Table B3 in Appendix B.)
Results
Research Question 1
Estimates of

Estimates of standardized root mean squared bias (RMSB) for average impact and differential impact before adjusting for effects of student-level covariates. Note. For estimates that lie on the X-axis, SAS PROC MIXED returned an estimate of zero with no p-value, which means that the estimation routine has reached a boundary value for estimating the parameter, and a model with fewer random effects is recommended (Singer & Willett, 2003). Interpretation is that the random effect is negligibly different from zero; however, an alternative interpretation is that the estimate is indeterminate.
Summary of Estimates of RMSB/SD by Moderator.
Note. RMSB = root mean squared bias; SES = socioeconomic status; ELL = English Language Learner; SD = standard deviation; ND = not determinate.
Summary of Estimates of RMSB/SD by Study.
Note. RMSB = root mean squared bias; SD = standard deviation.

Estimates of standardized root mean squared bias (RMSB) for average impact and differential impact after adjusting for effects of student-level covariates. Note. For estimates that lie on the X-axis, SAS PROC MIXED returned an estimate of zero with no p-value, which means that the estimation routine has reached a boundary value for estimating the parameter, and a model with fewer random effects is recommended (Singer & Willett, 2003). Interpretation is that the random effect is negligibly different from zero; however, an alternative interpretation is that the estimate is indeterminate.
With gender as the moderator, the median values of estimates of
Overall, two main trends are observed in Tables 1 and 2: (a) on average
Research Question 2
RQ2 addresses whether adjusting for differences across sites in values of macro variables lowers the estimated values of either

Levels of estimated root mean squared bias (RMSB) for Study 1 expressed in standardized effect size units with and without adjustment for effects of macro variables.
Estimates of Standardized Root Mean Squared Bias (RMSB) for Average and Differential Impact by Gender with Adjustments for Macro Variables.
Note. The models adjust for the following site-level covariates:
Model 1: no adjustments for site-level variables.
Model 2: with main effects of site-level teacher-based covariates only.
Model 3: with main effects of locale characteristics only.
Model 4: with main effects of site-level averages of student baseline achievement only.
Model 5: with main effects of site-level student-based covariates (including the pretest) only.
Model 6: with main effects of site-level teacher-based and student-based covariates only (i.e., all variables under [2] and [5]).
Model 7: with main effects of all site-level covariates.
Estimates of RMSB for Average and Differential Impact by Minority Status with Adjustments for Macro Variables.
Note. RMSB: Standardized Root Mean Squared Bias; ICC: intraclass correlation coefficient. The models adjust for the following site-level covariates:
Model 1: no adjustments for site-level variables.
Model 2: with main effects of site-level teacher-based covariates only.
Model 3: with main effects of locale characteristics only.
Model 4: with main effects of site-level averages of student baseline achievement only.
Model 5: with main effects of site-level student-based covariates (including the pretest) only.
Model 6: with main effects of site-level teacher-based and student-based covariates only (i.e., all variables under [2] and [5]).
Model 7: with main effects of all site-level covariates.
Adjusting for effects of site-level (macro) variables lowers estimates of
Consistent with results addressing RQ1, to start, the estimate of
A point of caution is needed about the results of highly saturated models. Specifically, the sample includes 55 sites, and models F and G include combinations of covariates from previous models, making them highly parameterized for the available sample size of sites, and therefore, susceptible to bias from overspecification. Therefore, if conclusions are limited to just models A to E, for this sequence of models, adjusting for effects of site-level macro variables lowers estimates of
Reflection on Empirical Results
Earlier in this work, basic algebraic models were applied to demonstrate how, in principle, measures of differential impact between person-level subgroups can be less biased than measures of average impact for CGDs that involve comparisons between clusters. The argument hinged on the effects of macro variables being differenced away in the comparison of differential performance, but not in the comparison of average performance. The empirical analysis summarized estimates of biases in average and differential impacts for CGDs that involve between-cluster comparisons. The results are consistent with expectations based on the theoretical argument. In most cases, differential effects have lower estimated bias than average impact, both prior to and after covariate adjustments; however, even following covariate adjustment, the magnitudes of standardized bias depart from zero. There are several possible reasons for this, which we consider next.
First, estimates of bias reflect random sampling error. In the empirical examples, HL models separated random sampling variation in outcomes among individuals within clusters (the sigma-squared) from variation in outcomes across clusters used to summarize RMSB (the tau values). However, the estimates of the tau terms also reflect sampling error. That is, for a given study, the sample of clusters may be considered a single draw from a hypothetical population of clusters, and any estimate based on that draw reflects site-level random sampling variance, which is reflected in the standard errors of the estimates of the tau terms.
Second, there may be a residual bias that is attributable to the effects of macro variables that are not accounted for in the analysis (i.e., portions of the K and Q terms in Equations 11 and 12, which are not eliminated through differencing methods or regression adjustments using available variables.)
Third, the methods assume that everyone in each cluster is either in treatment or control. In reality, not everyone may participate in a program at a given site. In other words, bias may arise from the selection of individuals into sites, as well as into the program (or not) within sites. Additionally, bias may arise from the mobility of individuals, including through attrition. Some of the variations in estimates of bias may reflect these factors.
Fourth, variation in results across studies may reflect that the results include findings from several studies where clusters were identified at different levels. For example, in one study, the comparisons were across schools, and in another, among classes within schools. The amount of bias of each type (and its amenability to being reduced) may depend on the level of the comparison. Given the close relationship between the ICC and the quantity used to summarize
A fifth factor that may add variability to the results, and is a limitation of this work, is that for all studies except Study 1, we could not limit estimates of average effects and achievement gradients to control cases only. The results are based on outcomes for treatment and control members combined. It is possible that variances assessed across both conditions are larger than if they had been assessed in the control condition only. This may be expected if, for instance, treatment program implementation varies across sites. However, if control performance also varies depending on differences in what the counterfactual programs are, or from differences in the implementation of a dominant counterfactual program, then similar cluster-level variances across treatment and control may be expected:
Importantly, despite the heterogeneity of the studies (e.g., with clusters defined at different levels), and the specific limitations described above, a similar trend of estimates of
Replication of the methods of this work, including for a larger number of contexts, outcomes, and moderators, is needed to obtain more certainty about the trends observed in this work. This could be accomplished through regular reporting of both the ICC and the moderator gap variance ratio in experimental impact evaluations. In most cases this would require little additional effort because the relevant variance components are routinely estimated. (In the history of WSC studies, each study provides a small number of data points for evaluating average absolute bias from CGDs. The compilation of results from many WSC studies provides a basis for drawing more-firmly established rules about levels of, and conditions for, bias in CGDs. This work should be considered as preliminary in a line of similar potential studies of the question, but as applied to the evaluation of differential impacts across subgroups of persons).
As a final note, the role of macro effects in reducing bias was explored through only a single study as proof of concept, and therefore, is even more in need of replication before drawing firm conclusions. 10
Conclusions
This study provides additional justification for routinely evaluating differential program impacts across person-level subgroups. I show that for CGDs like those considered in this work, bias in estimates of differential impact may be relatively low, suggesting that such designs can increase the pool of reliable evidence concerning differential program effectiveness. Estimates of differential impact should be of interest to program developers and stakeholders, given their importance for the reasons described at the beginning of this work.
The current study may be considered one of several that address emergent questions about conditions for and levels of impact heterogeneity and differential impacts of programs. The findings include: (a) differential impacts across subgroups of individuals are often in the same range of magnitude or larger than average impacts (Jaciw, Lin, & Ma, 2016), (b) impact heterogeneity across study sites is common in impact evaluations in education and jobs training (Weiss et al., 2017), and (c) adequate statistical power for evaluating differential impacts across subgroups of persons may be achievable with cluster designs (Bloom et al., 2005; Jaciw, Lin, et al., 2016). These efforts, and the current work, hopefully, will encourage the routine evaluation of differential impacts for important subgroups, whether through CGDs or randomized experiments designed to test full-sample impacts.
This work raises additional questions that are outside the scope of the current effort, and are earmarked for future investigation. For example, it is important to know whether the reduced bias in estimates of differential impact is robust to differences in levels of balance on subgroup categories. That is, does bias increase when the distribution of categories is more imbalanced, for instance with a low- to high-SES ratio of 20:80 compared to to a 50:50 ratio for gender? Furthermore, does bias increase as the variance across clusters in the ratio for the two categories of a moderator increases? I explored the first question preliminarily by plotting estimates of
The results, prior to and after adjusting for baseline covariates, are displayed by a moderator in Figure 4. Prior to covariate adjustments, the correlation between

Plot of estimates of standardized bias against the average balance on four moderators before covariate adjustments (left) and after (right).
This work also prompts general questions about directions for research about differential impacts—questions that are relevant for RCTs as well as CGDs. For example, given many possible options, how should we prioritize subgroups? One possibility is to select subgroups that are expected to experience different impacts based on theory. Another option is to routinely measure differential effects using standard categories. For example, it is very common in educational research to report achievement outcomes for students according to English Learner status, gender, and ethnicity. Routine reporting of subgroups and differential impacts for these categories would greatly expand what is known about the potential impacts of different kinds of programs for specific subgroups of individuals. A third option is to assess differential impacts routinely for traditionally underserved groups to monitor for an equitable distribution of benefits and to inform program improvement to reduce deficits (Nguyen & Jaciw, 2021). However, this requires sensitivity to, and recognition of, the fact that subgroup categories “are neither ‘natural’ nor given” and should be critically evaluated in their uses (Garcia, López, & Vélez, 2018). Based on recently developed methods, a fourth option is to identify subgroups using data-driven approaches that stratify subpopulations, depending on the magnitude of their treatment effects (e.g., Athey & Imbens, 2016).
I mention one more possible area for future work. WSC studies often extend findings about average absolute bias to consideration of whether departures of nonexperiment-based results, including CGDs, from experimental benchmarks, are large enough to matter for policy (Krueger, 1999; Krueger & Hanushek, 2000; Wilde & Hollister, 2007). Such questions are also relevant to differential impact findings. For example, we note that almost all of the averages of RMSB/SD in Tables 1 and 2 exceed .05 SD, a magnitude that represents a substantial proportion of annual growth in achievement (Bloom, Hill, Black, & Lipsey, 2008). Other substantive criteria to support decision-making are available. For instance, where equity of impacts across subgroups is the goal, the question should be about how much bias can be tolerated, if any, that would result in a wrong conclusion about differential benefits for underserved groups.
In closing, I emphasize that given the sociocultural challenges facing the US currently, there is added impetus to focus impact evaluations on not only the question of what works, but also the question of whether programs apportion benefits equally across important subgroups, and if not, why. The current work provides a demonstration in favor of expanding tests of differential impact for a wider range of evaluation designs and is consistent with the societal priory to achieve parity in program effectiveness for beneficiaries.
Supplemental Material
sj-pdf-1-aje-10.1177_10982140231160561 - Supplemental material for Do Social Programs Help Some Beneficiaries More Than Others? Evaluating the Potential for Comparison Group Designs to Yield Low-Bias Estimates of Differential Impact
Supplemental material, sj-pdf-1-aje-10.1177_10982140231160561 for Do Social Programs Help Some Beneficiaries More Than Others? Evaluating the Potential for Comparison Group Designs to Yield Low-Bias Estimates of Differential Impact by Andrew P. Jaciw in American Journal of Evaluation
Footnotes
Appendix A. Application of a WSC Approach to Evaluate Bias in Cross-Site Comparisons of Performance Gradients
Appendix B. Details of the Seven Empirical Studies
Estimates of Site-Level Variance Components (Before and After Covariate Adjustments).
| Study | Outcome |
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| Before adjustment | |||||||
| 1 | Science | 0.091 | 0.014 | — | 0.051 | — | |
| 2 | Reading | 0.057 | 0.004 | 0.026 | 0.197 | 0.080 | |
| 2 | Reading | 850.859 | 19.275 | 8.261 | 699.412 | 919.699 | |
| 3 | Reading | 173.640 | 123.376 | 249.798 | — | — | |
| 3 | Listening/ speaking | 60.121 | 61.107 | 42.380 | — | — | |
| 3 | Writing | 142.250 | 29.524 | 76.046 | — | — | |
| 3 | All academic | 155.040 | 98.040 | 133.380 | — | — | |
| 4 | Math | 1163.100 | 361.392 | — | — | 390.469 | |
| 5 | Reading | 92.560 | 6.566 | 26.052 | — | 66.508 | |
| 6 | Reading | 0.442 | 0.000 a | 0.000 a | 0.132 | 0.200 | |
| 7 | Math | 412.160 | 40.071 | 110.673 | 11.449 | — | |
| After adjustment | |||||||
| 1 | Science | 0.033 | 0.011 | — | 0.040 | — | |
| 2 | Reading | 0.001 | 0.001 | 0.025 | 0.063 | 0.006 | |
| 2 | Reading | 112.897 | 1.101 | 0.000a | 0.000 a | 52.318 | |
| 3 | Reading | 48.741 | 9.139 | 56.357 | — | — | |
| 3 | Listening/ speaking | 5.914 | 0.000a | 0.000a | — | — | |
| 3 | Writing | 93.044 | 16.998 | 0.000a | — | — | |
| 3 | All academic | 55.860 | 37.620 | 0.000a | — | — | |
| 4 | Math | 116.310 | 186.927 | — | — | 207.696 | |
| 5 | Reading | 24.358 | 4.024 | 0.000a | — | 0.847 | |
| 6 | Reading | 0.170 | 0.000a | 0.032 | 0.096 | 0.009 | |
| 7 | Math | 106.856 | 20.990 | 11.449 | 0.000 a | — | |
Note. SES = socioeconomic status; ELL = English Language Learner.
SAS PROC MIXED gives an estimate of zero with no p-value. (This indicates that the estimation routine has reached a boundary value for estimating the parameter [i.e., zero], and a model with fewer random effects is recommended [Singer & Willett, 2003].) The interpretation of this result is that the random effect is negligibly different from zero.
Appendix C. HL Models Used in Analysis
Acknowledgement
I would like to thank Denis Newman for the many years of thoughtful discussions about the importance of understanding and responding to subgroup and differential impacts in educational research and evaluation with a view to equity and social justice, and for inspiring me to pursue work of this kind.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research and/or authorship of this article: The research associated with empirical study Study 1 was supported by the U.S. Department of Education’s Investing in Innovation program, through Award Number U411B140026. The relevant results were obtained during the course of the grant.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
