Abstract
In this chapter, we describe and compare the standards for evidence used by three entities that review studies of education interventions: Blueprints for Healthy Youth Development, Social Programs that Work, and the What Works Clearinghouse. Based on direct comparisons of the evidence frameworks, we identify key differences in the level at which effectiveness ratings are granted (i.e., intervention vs. outcome domain), as well as in how each entity prioritizes intervention documentation, researcher independence, and sustained versus immediate effects. Because such differences in priorities may result in contradictory intervention ratings between entities, we offer a number of recommendations for a common set of standards that would harmonize effectiveness ratings across the three entities while preserving differences that allow for variation in user priorities. These include disentangling study rigor from intervention effectiveness, ceasing vote counting procedures, adding replication criteria, adding fidelity criteria, assessing baseline equivalence for randomized studies, making quasi-experiments eligible for review, adding criteria for researcher independence, and providing effectiveness ratings at the level of the outcome domain rather than the intervention.
Despite a history that can be tracked to 1867, the U.S. Department of Education is a relative newcomer to applying evidence standards to education research. In fact, it was not until criticisms about the failure of the education research community to accumulate knowledge became overwhelming, including those aired in the U.S. House of Representatives, that federal action was taken (Fuhrman, 2001; National Research Council, 1992). These actions included the establishment of the Institute of Education Sciences (IES) via the Education Sciences Reform Act (U.S. Congress, 2002) and shortly afterward its large investment in the What Works Clearinghouse (WWC; IES, 2020a).
Toward the accumulation of knowledge about effective interventions, the WWC seeks to review intervention research in education and provide information that will support evidence-based decision making. This entails locating and reviewing impact studies conducted for a given intervention and rating those studies by applying standards of research quality. In addition, for interventions with impact studies that meet standards, an effectiveness rating is also given.
Other philanthropic entities in the United States are also involved in vetting impact studies toward more informed programmatic decision making in education. For example, the Laura and John Arnold Foundation has invested in Blueprints for Healthy Youth Development (Blueprints; Mihalic & Elliot, 2015; University of Colorado Boulder Institute of Behavior Science, 2019). Blueprints was founded in 1996 as a list of effective delinquency and drug abuse prevention programs with funding from several entities, including the Centers for Disease Control and Prevention, the Pennsylvania Commission on Crime and Delinquency, the Colorado Division of Criminal Justice, and the Office of Juvenile Justice and Delinquency Prevention, which remained a long-term funder. In 2011, Blueprints expanded its scope to include mental and physical health and academic success outcomes, with funding from the Annie E. Casey Foundation. Finally, since 2016, Arnold Ventures has supported three major expansions to the initiative, including certification of evidence-based practices and policies, a focus on adult crime offenders, and a registry of noncertified interventions. The Arnold Ventures Group has also invested in Social Programs that Work (SPW; Arnold Ventures’ Evidence-Based Policy Team, 2019). SPW was founded as the Coalition for Evidence-Based Policy in 2001, sponsored by the Council for Excellence in Government. The coalition joined the Arnold Foundation as SPW in April of 2015 (Sparks, 2015).
In a similar vein to the WWC, these philanthropic entities apply unique sets of standards that result in a tiered evidence designation. However, unlike the WWC, both Blueprints and SPW do not separately distinguish between design quality and favorable impacts in their tier designations. Common to the Blueprints, SPW, and the WWC is the intended role each plays in helping decision makers. This role includes being a trusted source for identifying relevant interventions, assessing the rigor of the designs used to study those interventions, synthesizing the effectiveness evidence available for those interventions, and providing practitioner-friendly guidance for programmatic decision making.
There is much overlap in the interventions of interest to Blueprints, SPW, and the WWC, as well as several other research-vetting entities in education. These common interests introduce the possibility of each entity reviewing an intervention and providing ratings of effectiveness that differ significantly. Inconsistencies in ratings across entities undermines the trusted role that each is intending to play in bridging research to end-user decisions. As such, comparing the different conclusions that might be drawn about an intervention, based solely on which evidence frameworks are applied, can be insightful to ongoing refinement of those frameworks.
In the last half decade, largely outside of education, efforts have begun to compare evidence frameworks and to identify key differences (e.g., Fagan & Buchanan, 2016; Means et al., 2015; Neuhoff et al., 2015; Pew-MacArthur, 2020). For example, Neuhoff et al. (2015) conducted likely the most comprehensive review of entities that review intervention research, examining 35 unique entities across health care, crime, social, and education outcome domains. Their work provided helpful information to practitioners by mapping the landscape of study registries through comparing standards across entities and aligning the entities to their respective domains of interest. Furthermore, both the Neuhoff et al. and Pew-MacArthur work has raised awareness of the need for common standards for harmonizing intervention ratings. Others have made direct comparisons of intervention ratings across entities in specific fields, noting inconsistent ratings for crime prevention interventions (Fagan & Buchanan, 2016) and for behavioral health interventions (Means et al., 2015). However, many prominent registries’ standards have changed since the publication of these comparisons. While these prior efforts provide a model approach for comparing frameworks and the conclusions that result from their application, evidence frameworks continue to evolve through methodological advances, necessitating continued monitoring of whether they are converging or diverging in the information they provide to decision makers.
In this chapter, we draw on these prior efforts to propose an overarching evidence framework and use that framework to more comprehensively compare the standards and procedures of three existing evidence frameworks (Blueprints, SPW, and the WWC) and any resulting assessments of interventions for education outcomes. These three frameworks by no means represent an exhaustive list of entities that address education programming. Others include, for example, Evidence for ESSA (Johns Hopkins University, 2020), CrimeSolutions (U.S. Department of Justice, 2020), and the National Technical Assistance Center on Transition’s (2020) registry of effective practices. However, because Blueprints, SPW, and the WWC are well established and documented, they create an opportunity to provide an in-depth comparison across frameworks. Unlike some of the other comparative work cited above, we sought to compare fewer frameworks, and do so in more depth.
To further illustrate key differences across Blueprints, SPW, and the WWC, we present a set of hypothetical scenarios describing potentially significant differences in intervention ratings across the three evidence frameworks. This essay is intended to identify differences in framework-specific assessments of education interventions, illustrate how those differences could confound or inhibit programmatic decision making for practitioners, and make recommendations that could better inform these decisions by facilitating more accurate and consistent assessments of education interventions.
Common Elements of Evidence Frameworks
The evidence standards that undergird the Blueprints, SPW, and WWC frameworks have existed in various forms for decades but have been updated as necessary based on advances in methodology and shifts in emphases within respective fields. The WWC, established in 2002, has updated its standards approximately every 3 to 5 years. The Standards for Evidence in Efficacy, Effectiveness, and Dissemination that guide the frameworks upon which both Blueprints and SPW are based, were established in 2005 (Flay et al., 2005), and updated a decade later (Gottfredson et al., 2015). From our broad review of these and other standards for assessing intervention research, we observe several components that are generally considered, falling into three discrete categories: (1) the intervention and its implementation, (2) the design of the research with which the intervention is studied, and (3) the effectiveness of the intervention on outcomes of interest (see Figure 1). Moving forward in this essay, we use this three-category typology to facilitate direct comparisons across the three frameworks.

The Relationship Between an Intervention, the Research Design, and the Intervention’s Effectiveness
The Intervention
At the heart of education research are the interventions themselves—those strategies that are designed to improve student, parent, teacher, and school outcomes. To build a strong case that an intervention is ready for rigorous study, intervention documentation must clearly define the target population with whom the intervention is meant to be implemented, include a theory of change for the intended improvements, and specify a logic model to measure fidelity of implementation. Furthermore, intervention documentation should include elements that facilitate an assessment of implementation feasibility, such as resources required to implement (e.g., staff/faculty buy in, cost per student, staff training requirements, adequate fit with existing initiatives) and replicability of the intervention in various settings and contexts.
Research Design
While the intervention design may inform what type of research design is the most feasible, the rigor with which an intervention is studied is mutually exclusive from the intervention itself. Collectively, established evidence frameworks assess an extensive list of study features, including the suitability of the research design for causal inference and for examining the effectiveness of a specific intervention, the reliability and validity of the measures used in the study, the power for detecting effects, the generalizability of the study sample to the target population, the replicability or replication history for the intervention, and the degree to which the researchers conducting the study are independent of the intervention.
Effectiveness
Typically, the design of the intervention, the rigor of the research designs, and the observed treatment effects come together to answer questions of effectiveness. Independent of whether an intervention yet has empirical evidence of effectiveness, the intent of an intervention’s design is to improve outcomes. The causal effects research that may ensue strives to determine and demonstrate the extent to which an intervention affects an outcome or set of outcomes. Across established evidence frameworks, the effectiveness of an intervention is based on whether an intervention has been studied multiple times with rigorous designs, the effects are consistent within and across studies with no overriding negative effects, the effects differ between subgroups or under different circumstances, and whether the effects are immediate or sustained. It is important to keep in mind that this question of effectiveness is answered differently across frameworks. Some establish effectiveness at the intervention level using the best-case combination of study rigor and effectiveness, while others indicate effectiveness separately for each outcome domain, subsample, and time period within a study. Figure 1 illustrates the relationships between the intervention design, research design, and effectiveness criteria as three separate but interconnected components of evidence standards.
In the following section, we describe the Blueprints, SPW, and WWC evidence frameworks. Using a systematic document review approach, we first describe the evidence criteria considered in each framework, identifying and documenting each element included in the respective frameworks. We then align each of these elements with the components and subcomponents of our organizing framework (see Figure 1), mapping each element with mutual exclusivity to the intervention, research design, or effectiveness components. Based on this alignment, we provide a matrix to highlight which evidence components are considered in each framework and with what emphasis. Finally, we discuss several illustrative scenarios that highlight how ratings for a given intervention might differ across these three frameworks.
A Focus on Three Evidence Frameworks
Blueprints for Healthy Youth Development
Blueprints for Healthy Youth Development is a nonprofit registry of interventions. Interventions in the Blueprints registry span the social policy spectrum and their effectiveness has been rigorously tested, mostly through randomized control trials. Blueprints focuses on interventions designed to reduce antisocial behavior and promote positive youth development and adult maturity. They focus on outcomes related to behavior, education, emotional well-being, physical health, and positive relationships. Blueprints uses four criteria to examine the evidence of a program’s effectiveness: (1) intervention specificity, (2) evaluation quality, (3) intervention impact, and (4) dissemination readiness. Each is discussed briefly below.
Intervention Specificity
This first criterion focuses on the theoretical basis and logic model for the program (i.e., construct validity). All programs must clearly identify the theory(ies) guiding the intervention, their targeted risk and protective factors and outcomes, the populations to be served, and their specific content and methods of delivery. This standard serves as a “filter,” meaning that interventions that do not meet this criterion are ineligible for further review. Programs meeting the intervention specificity criterion are then assessed on the next two criteria, evaluation quality and intervention impact.
Evaluation Quality
The evaluation quality standards are based on the study’s approach to evaluating the intervention, including treatment group assignment, instrument alignment and psychometrics, overall and differential attrition, independence between data collection and implementation of the intervention, the use of valid, reliable outcome measures, intent-to-treat analysis, use of appropriate statistical analyses (including proper treatment of clustered data), baseline equivalence both for the randomized/matched sample and the analysis sample (even for low-attrition randomized controlled trials [RCTs]), and baseline equivalence between those with complete data compared to subjects with missing data (irrespective of condition). Blueprints does not have a primary standard regarding implementation fidelity, but information on implementation quality is listed as a “secondary criterion,” meaning it is recommended to increase confidence in the study findings (see Mihalic & Elliott, 2015, p. 129).
Intervention Impact
The intervention impact criterion is based on whether the intervention demonstrates statistically significant favorable effects with no iatrogenic effects in a preponderance of studies that meet Blueprints evaluation quality criteria. The intervention specificity, evaluation quality, and intervention impact standards are initially assessed by internal review teams. If studies of interventions meet all three criteria, the intervention and its studies are then reviewed by an external advisory board that ultimately determines whether to endorse the intervention for Blueprints certification. Interventions endorsed by the board are then assessed for dissemination readiness.
Dissemination Readiness
For programs to meet this criterion, there must be explicit processes for ensuring the intervention gets to the right persons and that the program that was evaluated is still available. The website defines dissemination readiness as “the intervention is currently available for dissemination and has the necessary organizational capability, manuals, training, technical assistance and other support required for implementation with fidelity in communities and public service systems” (https://www.blueprintsprograms.org/blueprints-certification/). If a well-evaluated program cannot provide any such assistance, it will not be listed as a certified program on the Blueprints website. Rather, it is given a “not dissemination ready” rating—meaning it has met criteria for evaluation quality but has not yet met the dissemination readiness criteria (as explained on the Blueprints website: https://www.blueprintsprograms.org/non-certified-programs/). This criterion signifies Blueprints’ interest in the practical use and widespread dissemination of effective interventions that are ready for scale. Having a formal dissemination readiness standard reduces the number of programs listed on the Blueprints website, since not all programs demonstrated to be effective have the capacity to be disseminated.
Interventions meeting Blueprints evaluation quality and intervention impact criteria may receive one of three possible levels of certification: Promising, Model, or Model+. Blueprints’ standards for a Promising rating include the following: (1) at least one high-quality RCT or two well-conducted quasi-experimental design (QED) evaluations (evaluation quality) and (2) positive findings with no evidence of iatrogenic effects (intervention impact). In addition to these two criteria, a Model rating requires (1) a high-quality RCT (evaluation quality) and (2) sustainability of effects for at least 1-year post intervention (intervention impact). Blueprints also has a Model
Blueprints also assigns ratings for noncertified programs that do not meet the evaluation quality and intervention impact criteria. Ineffective programs meet all Blueprints evaluation quality standards but finds nonsignificant effects on relevant outcomes, and Harmful programs meet all Blueprints evaluation criteria but find significant harmful effects on a Blueprints behavioral outcome. The alignment between Blueprints criteria for Promising, Model, and Model+ programs and our overarching framework is shown in Table 1.
Blueprints Criteria Mapped to Evidence Rating Components and Subcomponents
Criteria for Model designation.
Criteria for Model+ designation.
Social Programs That Work
Social Programs that Work is an evidence-based clearinghouse funded by the Laura and John Arnold Foundation and Coalition for Evidence-based Policy. The mission of this site is to help policymakers and the general public identify rigorous evidence in all areas of social policy and aid in data-driven decision making. The clearinghouse examines the evidence related to several education policy areas, including early childhood, K–12 education, and postsecondary education, in addition to other areas such as unplanned pregnancy prevention, housing/homelessness, and chronic disease prevention.
SPW provides a detailed checklist for reviewing experimental studies of social programs (Coalition for Evidence-Based Policy, 2010), and details four general areas of consideration: (1) the overall study design, (2) statistical balance between the treatment and control conditions, (3) outcome measures, and (4) the effects observed in the study. We discuss each briefly below.
Study Design
SPW considers two components for overall study design. First, random assignment should be done at the “appropriate level” for the intervention. For example, charter school programs that are oversubscribed (i.e., KIPP Charter Schools, Promise Neighborhoods) and employ a lottery system for enrollment may be appropriate for individual-level random assignment, while other programs implemented at the school level (i.e., Success for All) may need to be assigned at the school level. Second, the sample should be large enough to detect meaningful effects, and if it was not, authors should include an analysis to determine whether the sample in their study had adequate power for detecting effects.
Statistical Balance
SPW examines several components related to statistical balance of the treatment and control conditions before, during, and after the implementation of the intervention. Factors related to this balance before the intervention include evidence of baseline equivalence on key factors and evidence that participants consented to participate before random assignment (to guard against biased participation). During the intervention, reviewers consider crossover of group membership and consistency of outcome data collection between the treatment and control conditions. Finally, SPW reviewers consider attrition and the integrity of group assignment after the intervention is completed. SPW uses a general guideline of no more than 20% attrition and require that study authors establish baseline equivalence for the final analytic sample if attrition occurs. SPW further requires an intent-to-treat analysis to estimate treatment effects, keeping participants in their originally assigned group for the analysis, regardless of participation.
Outcome Measures
There are four criteria examined in relation to outcome measures. First, criteria related to the validity of outcome measures focuses on whether the measures are psychometrically well-established, or highly correlated with such a measure. These criteria also address ensuring the measures are not overaligned with either the treatment or control group. Second, the outcomes must have practical or policy implications. Namely, the outcomes must represent broader goals of the intervention rather than an intermediate output that may be a mechanism of the intended outcome. Third, the researchers must be blinded to participants’ condition when collecting data, where applicable, in order to reduce the risk of bias through expectation. Finally, the effects of the intervention must be measured at least 1 year after the intervention is completed to ensure that the intervention produces meaningful, lasting effects.
Intervention Effects
SPW considers two general criteria for examining the effects of an intervention. Studies must report both effect sizes and p values for statistically significant effects, as well as whether the size of the effect has practical or policy implications. These criteria include accounting for the randomization strategy used in group formation and whether the effects are fixed (generalizing only to the study participants) or random (generalizing to the larger population). Studies must also report all outcomes measured, regardless of their statistical significance or magnitude of effect, which guards against the risk of selective reporting of significant findings.
SPW’s ratings, designated at the intervention level, are Top Tier, Near Top Tier, and Suggestive Tier. For an intervention to be designated as Top Tier, it must be evaluated by more than one well-designed, well-implemented experimental study (or one large, multisite study) in a replicable setting with large, favorable, sustained effects. The studies also must present no overriding negative effects and have been conducted in more than one setting. Interventions designated as Near Top Tier meet all the criteria for Top Tier evidence but lack at least one characteristic to qualify for Top Tier (in most cases, replication of effects). For an intervention to be rated as Suggestive by SPW, it must have been evaluated by one or more well-designed, well-implemented experimental studies, or studies that “closely approximate the randomization process” with favorable effects, but may be limited by no sustained or statistically significant positive findings, or have only been conducted in one setting. 1 An intervention is listed on the SPW website only if it attains a minimum rating of Suggestive Tier. SPW’s checklist for reviewing RCTs includes the elements outlined in Table 2.
Social Programs That Work Criteria Mapped to Research Design, Intervention, and Effectiveness Categories
Note. SPW = Social Programs that Work; RCT = randomized controlled trial.
The What Works Clearinghouse
The WWC is an evidence-based clearinghouse funded by the U.S. Department of Education’s Institutes for Education Sciences. The WWC reviews education research studies designed to support causal inference, including group designs (RCTs and QEDs), regression discontinuity designs, and single case designs. Reviews are conducted primarily by contracted review teams and are guided by the WWC standards and procedures, which are updated every few years to incorporate new methodological advancements and to retire outdated criteria. Reviews are also guided within specific topic areas by review protocols, which provide specific guidance on how the standards and procedures are applied within a specific topic area. This guidance specifies the types of interventions, populations, and outcome domains that are eligible to be reviewed; levels of acceptable attrition (reviews can be held to a cautious or optimistic attrition boundary); and methods for calculating baseline equivalence. In the section that follows, we provide an overview of the WWC’s criteria for rating studies under WWC Standards version 4.1, and the criteria for the WWC’s effectiveness ratings, which are applied only to those studies that meet minimum study rating criteria.
Study Ratings
The WWC Standards version 4.1 has three possible study ratings and these ratings are the same for all eligible designs: group designs, single case designs, and regression discontinuity designs. In this chapter, we will focus on study ratings for group designs. Note that WWC Standards and the resulting study ratings pertain only to the rigor of the study design, not the intervention effectiveness. The latter is done using a separate effectiveness rating scheme. For example, a study that observed a null effect (i.e., effect size = 0.0 standard deviations) could still attain the highest study rating. The WWC study ratings, in order of highest to lowest level of rigor, are provided in Table 3, and WWC effectiveness ratings are in Table 4. In Table 5, both study and effectiveness ratings are mapped to our overarching framework (see Figure 1). The WWC Group Design Standards address five major categories of considerations.
What Works Clearinghouse Group Design Study Ratings and Descriptions
What Works Clearinghouse (WWC) Characterization of Findings in Intervention Reports
Source. What Works Clearinghouse Procedures Handbook version 4.1 (IES, 2020b).
WWC Criteria Mapped to the Research Design and Effectiveness Categories
Note. WWC = What Works Clearinghouse; RCT = randomized controlled trial; QED = quasi-experimental. These criteria apply to WWC’s Group Design Standards, which are by far the most frequently applied standards. WWC also specifies standards for regression discontinuity designs and single case designs.
Study design
Both randomized trials and quasi-experiments are eligible for review and the standards for study design address assignment procedures, and if random, whether those procedures were well-executed or compromised. The assessment of random assignment procedures includes whether the study analyzes units according to the original assigned condition (i.e., intent to treat analysis) and whether any units were excluded for reasons related to the intervention.
Confounds
Identification of a confound results in a study rating of Does Not Meet Standards. Confounding factors are components of a study that make it difficult or impossible to isolate the effect of the intervention. The first type of confound assessed by the WWC is referred to as the “n = 1” confound. This occurs when the intervention or comparison group contains a single study unit (e.g., one school assigned to each treatment condition). The second type assessed by the WWC are those that result from characteristics that could plausibly affect outcomes differing systematically between groups (e.g., all teachers in the intervention group have more experience or a more advanced degree than those in the comparison group).
Outcome characteristics
The WWC examines several outcome measure characteristics in determining if a study has an eligible outcome. These characteristics are face validity, reliability, degree of overalignment (outcomes to interventions), and consistency of testing conditions. Failure to pass any of the outcome measure criterion will result in a study rating of Does Not Meet Standards. We provide brief descriptions each criterion below.
Face validity
To pass this criterion, a measure must have a clear definition and measure what it claims to measure.
Reliability
To pass this criterion, an outcome measures must demonstrate adequate internal consistency, test–retest reliability, or interrater reliability, depending on the nature of the measure.
Overalignment
To pass this criterion, the measure must not be tailored to the intervention or repeat some aspect of the intervention, such that the intervention group participants are advantaged on the outcome measure.
Consistency of testing conditions
To pass this criterion, the testing conditions in treatment and control/comparison groups must have been sufficiently similar. That is, there must be sufficient similarity in the mode of data collection (e.g., online vs. paper/pencil), timing of data collection, personnel administering the measure, and how the scores are constructed.
Attrition
The WWC assesses overall attrition and differential attrition (i.e., the difference in attrition rates across treatment conditions) for randomized control trials. Attrition designations (high or low) are based jointly on the overall and differential attrition rates and coincide with a bias threshold where high attrition studies are presumed to have at least 0.05 standard deviations of bias in the impact estimate due to attrition. Two thresholds for high/low attrition are used, optimistic or cautious, and these are based on an assessment of the relationship between attrition and outcomes. For example, if attrition is likely random and not related to outcomes, an optimistic threshold is used.
Baseline equivalence
For quasi-experiments or randomized control trials with high attrition or compromised random assignment, the WWC assesses equivalence of treatment and control/comparison groups on a baseline version of the outcome measure (or other group characteristic when no baseline measure exists). If the absolute value of the difference across groups on the selected baseline measure is larger than 0.25 standard deviations, the study fails the baseline equivalence criterion and the study receives a rating of Does Not Meet Standards. If the absolute value of the baseline difference between treatment conditions is between 0.05 and 0.25 standard deviations, the study can receive a rating of Meets Standards with Reservations, provided that the impact analyses controlled for differences on the baseline measure. If the absolute value of the baseline difference is less than or equal to 0.05 standard deviations, the groups are deemed baseline equivalent and the study passes this criterion regardless of whether scores on the baseline measure were used to statistically adjust the impact estimate.
Effectiveness Ratings
Using only studies that meet standards, the WWC synthesizes the effects of multiple studies of an intervention on outcomes in a given domain. In the effectiveness scheme used in the WWC Procedures Handbook version 4.1, intervention effectiveness ratings are based on three characteristics of the effects being synthesized: the number and precision (weight) of the studies that contribute effects, the study ratings of the contributing studies, and the sign and statistical significance of the weighted mean effect size from a fixed effects meta-analyses of those effects.
The criteria for a Positive Effects rating, the WWC’s highest effectiveness rating, are cross-referenced with evidence rating components and subcomponents in Table 5.
Comparing the Three Evidence Frameworks
Toward our goal of investigating why differences in intervention ratings might exist across these three entities and the evidence frameworks they use, we conducted a comprehensive comparison of the frameworks, finding similarities and key differences in how characteristics of the intervention, research design, and effectiveness components are considered.
Intervention
In assessing programs, Blueprints scrutinizes the quality and documentation of the intervention, including the theory of change and dissemination readiness. To a lesser extent, SPW also examines the quality of interventions through assessing the extent to which outcomes are of policy and practical importance. Conversely, the WWC does not consider any intervention components in either its design standards or effectiveness ratings. As a result, a program studied using two well-designed RCTs, each reporting significant, positive, and sustained effects, but without implementation information, may receive the highest rating from the WWC (Positive Effects) and SPW (Top Tier), but not receive a rating at all from Blueprints for failing to meet the feasibility of implementation (dissemination readiness) criterion.
Research Design
While all three frameworks put more emphasis on the rigor of the research design, than they do on criteria related to the intervention or its effectiveness, some notable differences surfaced. For example, researcher independence is only considered by Blueprints. An intervention studied and replicated by two research teams that find significant immediate and sustained effects may not receive the highest Blueprints rating if both teams have financial ties to the program developers. The same intervention may receive the highest rating from the WWC (Positive Effects) and SPW (Top Tier).
Effectiveness
While all three frameworks focus on effectiveness, the WWC does not include the subcomponent of sustained effects (i.e., long-term or follow-up effects) in its criteria, while BP and SPW do. Therefore, an intervention evaluated by two well-designed and implemented RCTs that found significant positive effects on immediate outcomes may qualify for the highest WWC effectiveness rating of Positive Effects, while only qualifying for Promising (minimum standard of evidence) from Blueprints and Suggestive (lowest rating) from SPW. Additionally, WWC effectiveness ratings apply to specific outcome domains within an intervention. For example, if an intervention shows favorable effects on mathematics achievement, and null effects on literacy, it will receive two, different, domain-specific effectiveness ratings. While Blueprints and SPW specify relevant outcomes from the interventions they review, effectiveness ratings are granted at the intervention level rather than at the outcome domain level.
Table 6 provides a comparison of evidence rating components and subcomponents across all three evidence frameworks. The numbers in the table represent the number of criteria each evidence framework assigns to each subcomponent.
Matrix of Evidence Rating Components Across Blueprints, SPW, and WWC Standards
Note. WWW = What Works Clearinghouse; SPW = Social Programs that Work.
Studies reviewed by the WWC can meet standards without any indication of effectiveness.
Differing Conclusions Across Frameworks: Illustrative Scenarios
In this section, we take the cross-referencing of evidence criteria presented in Table 6 a step further to illustrate how important differences in the three evidence frameworks might produce disparities in intervention ratings. To illustrate these differences, we provide several hypothetical scenarios of potential programmatic decisions when studies are reviewed by Blueprints, SPW, and the WWC. For each scenario below, we assume that the studies use a group design and report outcomes related to educational achievement and learning. Additionally, we use the term studied twice to refer to situations in which the same intervention was tested in two separate samples.
Scenario 1: Entangled Rigor and Effectiveness
The impact of Intervention A was studied twice using randomized designs for both immediate and follow-up outcomes. The WWC did not give Intervention A an effectiveness rating, and the intervention was not certified by Blueprints or SPW. The decision maker can infer from this result that neither study met WWC Standards so effectiveness was not considered. For Blueprints, the decision maker could infer that the studies did not meet evaluation quality or dissemination criteria. The reasons for noncertification would be available to decision makers on the Blueprints website and would distinguish between minor issues relating to measurement or analysis, more serious issues relating to the evaluation design, and other issues relating to effectiveness (such as null or harmful effects). For SPW, the decision maker does not know whether (1) the effects were significant and consistent but the design was not rigorous, (2) the design was rigorous but the effects were nonsignificant or inconsistent, or (3) the studies were both insufficiently rigorous and observed nonsignificant/inconsistent effects.
Scenario 2: Differences in Synthesis Approach
The impact of Intervention B was studied twice using well-implemented randomized designs for immediate and follow-up outcomes. Study 1 had an observed effect of +0.30 standard deviations (p = .01, n = 200) and Study 2 had an observed effect of −0.15 standard deviations (p = .04, n = 125). In both Study 1 and Study 2, the effects on immediate and follow-up outcomes were similar. For the immediate outcomes, the precision-weighted mean (from a fixed effects meta-analysis) of the effect sizes is +0.13 standard deviations (p = .03). The likely WWC effectiveness rating for Intervention B is Positive Effects. Blueprints would likely have to decide between a rating of Promising based on Study 1 (one well-conducted RCT with long-term positive results) and a rating of Ineffective based on Study 2 (one well-conducted RCT with long-term negative results). SPW would likely not rate the intervention at all due to the presence of strong countervailing evidence (one effect was negative and statistically significant).
Scenario 3: Differences in QED Eligibility
The impact of Intervention C was studied twice using well-implemented QEDs for immediate and follow-up outcomes. Intervention C could get the Promising rating from Blueprints if the two QEDs both found favorable and statistically significant effects but would need at least one RCT to qualify for a higher level of certification. Similarly, Intervention C could get no better than the Potentially Positive Effects rating from the WWC, as the WWC requires at least one of the two studies be a randomized design to achieve its highest effectiveness rating of Positive Effects. Intervention C may receive no rating at all from SPW unless the process for assigning study participants closely approximated random assignment.
Scenario 4: Different Values Placed on QED Evidence
The impact of Intervention D was studied once using a well-implemented QED for immediate and follow-up outcomes. Differences in ratings for Intervention D spring from how the frameworks differentially value evidence from QEDs. Intervention D could get the second highest effectiveness rating, Potentially Positive, from the WWC, but would receive an Inconclusive rating from Blueprints, since their minimum standard of evidence, Promising, requires at least two high-quality QEDs. This intervention might not be rated by SPW at all as they only review RCTs or studies that assign participants in a way that closely approximates random assignment.
Scenario 5: Assessing Baseline Equivalence for RCTs
The impact of Intervention E was studied using two RCTs for immediate and follow-up outcomes. Study 1 had 10% overall attrition and 1% differential attrition with a baseline outcome difference of 0.30 standard deviations for the assigned sample (p = .03) and 0.35 standard deviations for the analytic sample (p = .02). Study 2 had 5% overall attrition and 1% differential attrition with a baseline difference of 0.33 standard deviations for the assigned sample (p = .02) and 0.36 standard deviations for the analytic sample (p = .01). Intervention E would be rated as Inconclusive and not certified by Blueprints, and would not receive a rating from SPW, because both the RCTs fail the baseline equivalence statistical significance criterion. The WWC does not review baseline equivalence for low attrition RCTs and given the low attrition rates, these studies could have been rated as Meets Standards without Reservations. Had the fixed effects meta-analytic average of the two RCT effects been statistically significant and positive, Intervention E could receive the highest effectiveness rating of Positive Effects from the WWC.
Scenario 6: Differences in Attrition Criteria
The impact of Intervention F was studied using two large RCTs for immediate and follow-up outcomes and the second RCT was a close replication of the first. Study 1 (n = 2,000) had 9% overall attrition and 6% differential attrition. Study 2 (n = 3,000) had 10% overall attrition and 6% differential attrition. Had both RCT effects been statistically significant and positive, and by extension, the fixed effects meta-analytic average of the two RCT effects been statistically significant and positive, Intervention F could receive the highest rating of Top Tier from SPW and the highest effectiveness rating of Positive Effects from the WWC as a low attrition RCT at the cautious boundary. The intervention would only receive the highest ratings from Blueprints of Model (or Model+ if the replication was independent) if the study also ruled out potential differential attrition bias by (1) demonstrating baseline equivalence for the assigned sample (the full sample before attrition) as well as the analysis sample (the reduced sample after attrition) and (2) demonstrating that attrition of those with complete data (“completers”) was unrelated to socio-demographics and baseline outcomes of those with incomplete data (“attritors”).
Scenario 7: Differences/Ambiguity in Replication Requirements
The impact of Intervention G was studied using two well-implemented RCTs for immediate and follow-up outcomes, but the studies were not exact replications of one another because the second study modified a core component of the intervention. Had both RCT effects been statistically significant and positive, and by extension, the fixed effects meta-analytic average of the two RCT effects been statistically significant and positive, Intervention G could receive the highest effectiveness rating of Positive Effects from the WWC if the second study was considered at least an approximate replication. Depending on how similar the two studies need to be to pass replication standards, Intervention G may only get the second highest rating of Near Top Tier from SPW and the second highest rating of Model from Blueprints due to failing their respective replication criteria.
Scenario 8: Ratings at the Intervention Level Versus Outcome Domain Level
The effects of Intervention H on outcomes from two different outcome domains, reading and math, were studied using one low attrition RCT. The effects in the reading domain were statistically significant and positive, and the effects in the mathematics domain were negative but not statistically significant. For the WWC, the effectiveness rating would be Potentially Positive Effects for the reading domain and Uncertain Effects for the mathematics domain. For the reading domain only, the intervention would receive a holistic rating of Suggestive from SPW and a Promising rating from Blueprints.
Considerations Moving Forward
In this section, we pose broad future considerations for the improvement of evidence frameworks for education research. Our commentary in this section is not aimed at any one entity, and we acknowledge that one or more of the three evidence frameworks reviewed in this chapter already account for or otherwise address the concerns we raise. Furthermore, although not officially incorporated into the WWC, IES’ emerging Standards for Excellence in Education Research (SEER) principles (IES, 2020c) promote standards for research that mirror our recommendations in this section. The SEER standards encourage researchers to (1) preregister studies; (2) make findings, methods, and data open; (3) identify interventions’ core components; (4) document treatment implementation and contrast; (5) analyze interventions’ costs; (6) focus on meaningful outcomes; (7) facilitate generalization of study findings; and (8) support scaling of promising results. When applicable, we draw parallels between our recommendations and that of the SEER principles.
Disentangle Study Rigor and Intervention Effectiveness Ratings
Inconsistent effectiveness schemes across frameworks can confuse and mislead decision makers (e.g., Fagan & Buchanan 2016). One observation based on our assessment is that decision makers will be better served by separate ratings of study rigor and effectiveness. A primary source of ambiguity is when an intervention has not been given a rating, despite being reviewed by an entity. For example, decision makers might want to know whether an intervention demonstrated positive effects, even if those studies were disqualified for being quasi-experiments or one or more of the studies was underpowered (e.g., nonsignificant positive effects). A separate effectiveness rating that does not overemphasize the statistical significance of individual effects and allows for otherwise rigorous nonrandomized studies, would facilitate consideration of interventions that have only undergone early-stage testing that is often done at small scale with QEDs. Perhaps such effectiveness rating schemes could attach a confidence indicator to the effectiveness rating that would allow for indication that effects from less rigorous designs were synthesized.
Cease Vote Counting Synthesis Procedures
The “box score” approach to vote counting (see Borenstein et al., 2011; Hedges & Olkin, 1985) involves characterizing the effects of studies categorically (e.g., positive, negative, positive and statistically significant, negative and statistically significant). Such tallies typically indicate the direction and/or significance of effects while ignoring their precision. An application of vote counting is to penalize an intervention for having countervailing evidence, regardless of the precision of the study/studies producing the countervailing evidence or the precision of the study/studies showing evidence in the opposite direction. In its place, we suggest an effectiveness scheme that weights studies by their precision before synthesizing them and using the weighted average effect and perhaps its statistical significance as the primary indicator of effectiveness. Common approaches to computing this weighted mean include fixed- and random-effects meta-analysis (Hedges & Vevea, 1998). Use of meta-analytic techniques was also suggested by Means et al. (2015) in their analysis and comparison of evidence frameworks in behavior health.
Continue to Refine Effectiveness Ratings Using Replication Criteria
In education, the lack of replication is now well documented (Hedges, 2018; Makel & Plucker, 2014). The replication crisis in education manifests in the Neuhoff et al. (2015) analysis of the “market” for effectiveness evidence, where they state that the supply of evidence to support decision making must be strengthened. Mihalic and Elliot (2015) raise the bar even higher by suggesting that replications be conducted by independent researchers.
Each of the evidence frameworks discussed in this chapter either directly or indirectly encourages replication by virtue of its highest effectiveness ratings being reserved for interventions whose effects were observed in at least two rigorous studies. While Blueprints and SPW directly mention that the two studies must be replications of one another, the WWC does not, although if the WWC synthesized the two (or more) effects in a fixed effects meta-analysis, the presumption is that the source studies are similar in important ways; at a minimum, both studies test comparable enactments of the intervention and estimate effects in at least one common outcome domain. Our review of all three frameworks suggests that better definitions of what counts as replication (i.e., in what ways do the studies need to be similar?) would be helpful.
Conversely, there is benefit to assigning effectiveness ratings on the basis of a single contributing study. Given the time required to get large-scale trials funded, conducted, disseminated, and reviewed by entities such as Blueprints, SPW, or the WWC, waiting to establish effectiveness ratings for all interventions until each has at least two studies could deprive decision makers of evidence for an excessively long period of time. An overemphasis on replication at the undue expense of timely decision making is not consistent with the goals of these entities, and has the potential to prioritize methodological rigor at the expense of practical utility by limiting whether and how these ratings are used by decision makers. Even cautioned ratings of effectiveness are likely more useful than no rating at all. Perhaps for interventions that have only been studied once, the best compromise is for evidence standards to include effectiveness ratings that adequately convey the additional uncertainty in conclusions drawn from a single study (e.g., the Blueprints Promising rating 2 ).
The potential for any evidence framework to effectively contribute to knowledge accumulation by valuing and encouraging replication depends on whether intervention researchers will follow suit. Researchers must propose and conduct more replication studies, disseminate study findings with more comprehensive and open reporting of implementation conditions, and adopt more widespread sharing of raw data and methods. If the SEER principles gain momentum and influence in education, these useful research reporting practices are likely to increase. Perhaps increased replication rates will follow and evidence frameworks will have less need for effectiveness ratings schemes designed for interventions tested with just one impact study.
Consider Adding Implementation Fidelity Standards
None of the evidence frameworks compared in this chapter consider fidelity of implementation as a primary evidence criterion. Decision makers considering new interventions would benefit from knowing both the observed effects and whether the intervention was implemented as intended. If interventions under consideration show both evidence of effectiveness and meet an established standard for implementation fidelity, that not only increases confidence that the observed effect is that of the intended intervention but also that the intervention is usable by implementers and feasible to implement on a large scale. Furthermore, fidelity information can be correlated with outcomes to examine the robustness of treatment effects to varying implementation conditions, which provides one indicator of the external validity of those effects.
Consistent with the recommendations of Fagan and Buchanan (2016), as well as promoted in the SEER principles regarding implementation reporting, adopting fidelity standards into evidence frameworks would further encourage study authors to report this oft-ignored category of data. The benefits described here could be realized if authors, at a minimum, reported the percentage of treatment units whose implementation exhibited acceptable adherence to the intervention design. Associated fidelity standards would likely need to establish numeric cutoffs for acceptable overall fidelity. A greater ask, although likely worth the effort, is fidelity reporting at the treatment unit level. Future fidelity standards might include accompanying benchmarks that would facilitate a review of the robustness of effects to variation in fidelity levels.
Assess Baseline Equivalence for Randomized Studies
In expectation, randomization balances group characteristics across treatment conditions (Murnane & Willett, 2011; Shadish et al., 2002). However, in randomized studies with small sample sizes, either at the individual or cluster level, the likelihood is lower that balance will be achieved on individual or cluster-level characteristics that are correlated with outcomes (e.g., Kang et al., 2008; Lachin et al.,1988). Some of the randomized studies reviewed by Blueprints, SPW, and the WWC have sample sizes small enough to potentially question the expectation of balance through randomization. As such, we propose that baseline equivalence be assessed for RCTs as well, or at least those below a specified minimum sample size where balance on observed or unobserved characteristics can be reasonably expected. Alternatively, or in addition to this step, estimation models for small randomized studies could be assessed to determine if those models adjust for possible imbalance on covariates with strong correlations with outcomes.
Make QEDs Eligible for Review
In education research, there are often circumstances in which it is not ethical or practical to randomize study participants into treatment and comparison conditions. Indeed, to avoid denial of services for students who might otherwise be eligible, some federally funded educational interventions such as the Federal TRIO program limit evaluation methods by including a federal ban on overrecruitment for the purposes of populating a study sample since this might result in unintended iatrogenic effects (Section 402H(b), Higher Education Act, 2019). These conditions severely limit opportunities for random assignment, thus necessitating QEDs to assess intervention impacts. Making QEDs eligible for review increases the breadth of reviewable interventions while staying faithful to the mission of providing practitioners with information on interventions that are proven to improve outcomes. Furthermore, recent methodological developments in QEDs have yielded approaches (e.g., regression discontinuity designs, propensity score matching) that can produce effect size estimates with minimal selection bias and support causal inferences similar to those possible with randomized experiments (e.g., Chen & Kaplan, 2015; Gunter & Daly, 2012; Maas et al., 2017; Monahan et al., 2011; Tang et al., 2017).
Place Greater Value on Researcher Independence
Conflicts of interest in any endeavor aimed at reporting results can compromise the integrity of that endeavor. Justifiably, a strong emphasis is placed on disclosing financial ties in education research and other fields such as medical research (see Romain, 2015) and journalism (DellaVigna & Hermle, 2014), and these disclosures allow the reader to make their own judgements regarding the trustworthiness of those sources of information. While this practice is necessary and should continue, we argue that evidence frameworks should account for researcher independence. Studies replicated without researcher independence are more likely than independent replications to report favorable findings, suggesting that conflict of interest can have an effect on the results of a study (Makel & Plucker, 2014). Preregistration of studies (and the treatment contrasts of interest) is promoted in the SEER principles and can discourage selective reporting of findings, especially when the registered study is being conducted independently of the developer. Education research should place greater value on research conducted in settings where the researchers are free from conflicting ties, financial or otherwise, to the intervention being studied.
Provide Effectiveness Ratings at the Outcome Domain Level
Sometimes, educational interventions aim to improve outcomes in several different domains and may have different effects across those domains. Rating an intervention at the intervention level without taking into consideration the differential impact across domains could lead to misinformation about the intervention’s effectiveness if all domains do not reach the same level of effectiveness. Thus, we recommend providing effectiveness ratings for interventions at the outcome domain level rather than at the intervention level. This recommendation aligns with statutory language in the Elementary and Secondary Education Act (Elementary and Secondary Education Act, 2015) as well as the non-regulatory ESSA guidance (U.S. Department of Education, 2016).
Concluding Remarks
Entities such as Blueprints, SPW, and the WWC remain three of the most trusted sources of evidence for programmatic decision making in education. Cross-entity inconsistency in the study designs that are reviewed or in the ratings they assign to interventions undermines the decision-making process and can erode practitioner trust in the information they provide, unless the reasons for the discrepancies are clear. In this chapter, we reviewed the evidence frameworks of these three entities, illustrating key differences in standards and the emphases placed on those standards. We also highlighted key features of selected frameworks and endorsed those in our forward-looking considerations.
Ultimately, we hope to promote convergence across the three frameworks by establishing a common core of evidence standards that draws on the strongest characteristics of each framework and aligns study ratings/eligibility requirements toward maximum end-user trust. As such, we envision a future where practitioners in education can use information provided by any of these entities to make programmatic decisions and do so without concern that the evidence or effectiveness conclusions being used in that decision are entity-specific. Furthermore, greater convergence of these frameworks, even if it is partial convergence to a common core, could optimize their usefulness. For example, convergence sufficient to harmonize effectiveness ratings will better meet the end-user’s need for consistency while also allowing them to consider additional framework-specific standards that align with their unique decision-making criteria. As such, our charge toward convergence does not necessarily imply that the three unique frameworks cease to exist. While harmonization of ratings from these entities is ambitious in the near-term, what can be done by all relevant entities, immediately, is improve documentation and rationale for standards and study eligibility requirements. At a minimum, this will better assist decision makers in navigating the reasons why ratings may differ.
