Abstract

The systematic review movement—for example, the What Works Clearinghouse, the Teen Pregnancy Prevention (TPP) Evidence Review—attempts to synthesize the available evaluation evidence in order to aid sites (e.g., school districts) in choosing a program model (e.g., a curriculum) to address social issues (Klerman, 2017). That goal is laudable. Appropriate methods to achieve that goal are still evolving.
Suppose the decision problem is as follows. A site wants to implement a program to address a social problem in its local area. More specifically, the site wants to implement the program that will be most effective in this local area. This is an important and vividly policy-relevant question.
To aid its choice, the site refers to the results of a systematic review. That review has identified multiple program models that are—according to its review criteria—“effective.” The decision problem can be thought of as the relative weight to give to:
broad evidence, from national studies or studies on contexts not (in some appropriate sense) similar to the site seeking guidance,
versus
narrow evidence, on a context (in some appropriate sense) similar to the site seeking guidance.
Concerns about the limited external validity of available evaluation evidence suggest putting more weight on the narrow evidence. This essay, however, will argue that in the absence of strong evidence of large variation in program inputs by context and population served, giving greater weight to narrow evidence will lead to choosing and implementing programs less effective for this site. Furthermore, if there truly was such large variation in program impacts by context, the whole evidence-based policy movement and the systematic review effort that supports it would be infeasible. 1
To make these arguments, this essay proceeds in four sections. The first section notes that many of the other evidence reviews (one in particular) and several articles in this special issue (Avellar et al., 2017; Leviton & Trujillo, 2017; Paulsell, Thomas, Monahan, & Seftor, 2017) implicitly and sometimes explicitly urge sites to “pattern match”—that is, to adopt what we will call the “stylized decision rule” (SDR):
Such SDR/pattern matching assumes weak external validity and is therefore reluctant to extend evidence beyond the domain (population, geography, time period) on which that evidence was generated.
The second section of the essay argues informally that this approach is problematic. Given the depth of the available evidence, this approach seems likely to choose the wrong program model.
The third section of the essay uses empirical Bayes (EB) ideas to consider the problem slightly more formally. The essay concludes with a discussion of implications for practice (i.e., guidance to sites on program model selection) and future research.
Throughout, this essay’s approach is theoretical. The discussion implies that EB is the preferred approach for inferring what will work best at a given site. EB may be ideal, but it is usually not feasible (but see Valentine et al., 2017). Implementing EB would require a large number of studies, that those studies report results for consistently defined subgroups, and then that resources be available to conduct the EB analyses. For few policy domains, are we anywhere near that point. 2
The final section argues that, even in the absence of sufficient studies to do an EB analysis, an EB perspective has constructive implications for how we conduct conventional non-EB evidence reviews and for how we help a site to choose a program model.
The Systemic Review Perspective
Consider the TPP Effectiveness Review. For the Office of Adolescent Health (OAH) and TPP, helping sites choose from among effective program models is implicit in the authorizing statute, addressed directly on OAH’s website, and sometimes an explicit request to TPP staff from sites.
On its website, OAH provides an “online learning module” entitled “How to Select an Evidence Based Teen Pregnancy Prevention Program,” (http://www.hhs.gov/ash/oah/resources-and-publications/learning/tpp-evidence-based/index.html, “Step 4: Assess Fit.”) which states At this stage in the program selection process, it is important to verify that your program of interest is actually applicable to the population with which you are working. For example, implementing an EBP [evidence based program] that was determined to be effective among low-income African American students in urban environments may not yield the same results for tribal youth in rural settings. How does your population compare to that in the study of that EBP? If there are differences, are they likely to compromise your results? (Remember, you can get more detailed information about the adolescents involved in the evaluations by referring to the implementation reports.) Assessing population fit includes consideration of the following: age, race/ethnicity, sex, socioeconomic status, language, immigration status, sexual orientation, culture, other considerations (e.g., juvenile justice, parenting teens). (“Step 4: Assess Fit/Population Fit”) Are there local laws, policies, or other norms that would be violated by certain components of this program? For example, are there laws prohibiting condom demonstrations in schools? Is it administratively feasible, given the policies and procedures of the implementing organization? Does the program align well with local norms and customs? Are there community cultural considerations you should take into account? (“Step 4: Assess Fit/Environment Fit/Context”)
The first paragraph calls for broad pattern matching, as in the SDR/pattern matching stated earlier, that is, to choose the program with demonstrated effectiveness in the context and population closest to the target site’s context. Indeed, staff of federal systematic reviews report that sites attempt broad pattern matching (e.g., urban/rural, White/Black/Hispanic) and that staff of the systematic evidence reviews assist sites with that effort.
Consistent with this guidance to sites, several papers in this special issue (Avellar et al., 2017; Leviton & Trujillo, 2017; Paulsell et al., 2017) urge research reports and systematic reviews to include much more information about context, including population served and community characteristics.
This guidance also seems unassailable. Even high-quality impact analysis evidence (e.g., a well-implemented randomized trial) only formally applies to the population randomized. Following statutory guidance, the evidence reviews have focused on studies with strong internal validity. Not extending the results—any more than necessary—to other populations and contexts is consistent with that focus on internal validity. Indeed, several of the papers in this special issue consider ways to formally extend existing results (Stuart & Rhodes, 2017; Tipton, Hallberg, Hedges, & Chan, 2017; Tipton & Peck, 2017; see also Kern, Stuart, Hill, & Green, 2016; Olsen, Orr, Bell, & Stuart, 2013; Stuart, Bradshaw, & Leaf, 2014; Stuart, Cole, Bradshaw, & Leaf, 2011; Tipton, 2013, 2014).
If concerns about external validity are primary, then such program selection by pattern matching seems reasonable. But, if we are unwilling to extend evaluation evidence beyond the original evaluation population, evidence-based policy making will be—for several reasons—impossible.
First, all evaluation evidence relates to the past; all program selection decisions are about the future. Some generalizing over time is unavoidable.
Second, there will never be sufficient evaluation evidence to perfectly pattern match for all possible sites. If the decision problem is to choose a single program model over some large population, we could imagine random assignment on a randomly 3 chosen subset of that population, for example, randomly choosing communities from across the nation (Olsen & Orr, 2016; Tipton et al., 2017). This is a leading formal approach to issues of external validity.
However, this formal approach breaks down when a local site is choosing a program model. Only in the rarest of cases will we have research evidence for this site (i.e., the site choosing a program model) or even for this context/demography (e.g., rural and Hispanic). Even if we give up on evidence for this site, a broad (e.g., national) evaluation will rarely have enough observations to yield precise estimates for sites “like” this one (i.e., rural and Hispanic). Given the lack of precision, in most cases, there will not be clear evidence of impact for “sites like this one.” Where there is clear evidence of impact for sites like this one, the estimated impact is likely to be huge. This is because given the small samples, only huge estimates of impact will be detectable. For a variety of technical reasons (see the next section), those estimated impacts are likely to be considerably larger than the true impacts for sites like this one.
The Problems With Narrow Evidence
Presumably, systematic reviews are trying to help each site to choose the program model that will have the largest impact for that site. 4 Several features of the available evidence suggest that the SDR/pattern matching will lead to selection of a suboptimal program model—for this context and population.
First, our studies are usually not well powered. We often deliberately specify sample sizes sufficient to detect any impact. Samples of this size are woefully too small to determine which of two program models is better. Even when differences in effectiveness across program models are large, determining which of two program models is better will require pooling across multiple studies. As we narrow the set of studies we consider, we will lose sufficient power to identify the better program.
Second, because our studies are not well powered, subgroup analyses are also woefully underpowered. Sometimes we can detect an impact in a (larger) subgroup. Only rarely can we detect a differential impact. Focusing on program models demonstrated effective for a subgroup will often imply not choosing better models that have not been demonstrated effective in this population.
Third, sometimes a program is tested only in a subgroup (e.g., a Black neighborhood); often other programs are not. Focusing on program models demonstrated effective in this subgroup implies ignoring program models that have not been tested in the subgroup. Perhaps the best program has not been tested in the subgroup.
An EB Approach
This section uses EB ideas to develop a model consistent with this discussion in the previous section. 5 To fix ideas, suppose that sites like this one means serving the subgroup that this site would serve. Then, a meta-analytic perspective (e.g., Hedges & Olkin, 2014) would suggest modeling the impact of some program model, p, on subgroup, g, as follows:
where
In terms of this model, the difference in impacts between Programs A and B for subgroup g will be:
If
In principle, each of these terms is estimable. An EB perspective (Casella, 1985; Morris, 1983; Valentine et al., 2017) suggests, however, that we should not use the estimated parameters directly. The (almost always) small number of narrow studies of program p for subgroup g and the small samples for those studies (or small subsamples in broad studies) imply that estimates of γ have considerable sampling variability. Therefore, instead of using the estimated γs, an EB perspective would suggest using an adjusted version of estimated parameters.
where
Equation 3 implies the following: As the sampling variability, As the assumed (but not unknowable) true variance,
In almost all cases of policy analysis, the sampling variability,
The crucial issue is thus how big are the interactions? If impacts vary radically across subgroups, then one should give more weight to narrow evidence. Evidence for large interactions is weak. Few studies have sufficient power to precisely estimate interactions, but those that do often fail to detect any subgroup effects (e.g., Collins et al., 2016; Jaciw, Lin, & Ma, 2016; Michalopoulos, Schwartz, & Adams-Ciardullo, 2000). 6 Furthermore, what estimates we have of subgroups are likely in part due to common differential impacts by subgroups not to true interactions. It follows that estimates of differential subgroup impacts are likely an upper bound on the interactions of interest for this analysis. 7
This line of argument leads to decision rule very different from the SDR/pattern matching with its focus on narrow evidence. The implied, if slightly imprecise, alternative decision rule (ADR) posits:
Discussion
The previous section argued that concerns about external validity can be viewed as concerns about interaction of main effects (i.e., Program × Population/Context). If there were no such interactions, the decision problem would be to identify the—single, common across all populations/contexts—the best program model. Given the frequent lack of any evidence in the subgroup of interest for many plausible programs and the small sample sizes—and therefore large sampling variability—of what evidence does exist, those giving advice to sites about which program to choose should give considerably more weight to broad evidence than to narrow evidence. When broad evidence suggests much larger impacts for one program model, only the strongest narrow evidence should overturn that result. The absence of narrow evidence for the program that the broad evidence suggests has the largest impact will usually not be a reason to choose the program for which there is narrow evidence of effectiveness. Furthermore, if Program A is clearly better than Program B overall, then narrow evidence that Program B is better than Program A in this subgroup may not be enough to choose Program B over Program A.
This analysis implies guidance very different from the SDR/pattern matching. Ideally, we would simultaneously consider all programs and the interactions that drive external validity concerns. The EB approach of the previous section presents a specification that does so. In most domains, estimating that specification will require many more and much larger evaluations than we currently have.
In practice, it seems likely both that the program models vary in the impact for a given population/context and that the impact varies with population/context. Focusing on external validity leads to considering a very narrow set of programs. But identifying the best program for a given population is itself challenging—near and often past what is feasible even if we pool all studies. Thus, a focus on external validity is not costless. Given how thin the available evidence usually is, focusing on the narrow(est) evidence will often lead to choosing the wrong program—for this site.
This analysis implies several policy recommendations.
First, fund more and larger evaluations. This recommendation is hardly surprising. Several other perspectives also lead to this recommendation. There is already considerable movement in this direction. OAH built evaluation into the second round of TPP grant funding, even for sites implementing program models with demonstrated effectiveness. The issue, of course, is cost.
Second, evaluations should be strongly encouraged to estimate and report subgroup impacts (and their standard errors) or, more broadly, estimates of impact moderation/response surface modeling (Kern et al., 2016), even if the subgroup estimates are not statistically significant. Those subgroup results, even when statistically insignificant, are the data for a meta-analysis addressing external validity issues (in particular, differential impacts by population characteristics). In addition, the underlying data should be made available for reanalysis. Again, this recommendation is hardly surprising. Several other perspectives also lead to this recommendation.
Third, fund meta-analyses, and in particular, meta-analyses that attempt to estimate the interactions that are crucial for addressing external validity concerns. Issues of sampling variability suggest that those meta-analyses should probably consider an EB perspective. Again, there is already movement in this direction. In 2016, U.S. Department of Health and Human Services/Administration for Children and Families awarded a contract for meta-analysis of the teen pregnancy evaluations (“Quantitative Synthesis of Federal Funded Teen Pregnancy Prevention Programs”; 15-233-SOL-00579).
Finally, put more weight on broad evidence. Until enough studies are completed, and the results of EB modeling of those data become available, we will need to use the EB insights developed in this essay in a less formal way. Specifically, this discussion suggests that focusing on—often second order—external validity issues and following the SDR/pattern matching will usually lead to choosing programs with a smaller impact in this context/population than would a strategy that considers both broad and narrow evidence, giving considerably more weight to the broad evidence.
Footnotes
Author’s Note
The ideas expressed here are only the position of the author. They are not necessarily the position of those Abt Associates, its sponsors, or those who provided comments. Bry Pollack provided editorial assistance.
Acknowledgments
Earlier versions of these ideas were presented at the Research and Evaluation Conference on Self-Sufficiency, Washington, DC, June 2016, sponsored by the Office of Planning, Research and Evaluation in the Administration for Children and Families, U.S. Department of Health and Human Services. Many thanks for comments received there. This article benefited from comments on earlier drafts by Austin Nichols, Andrew Jaciw, Dave Judkins, Randall Juras, Laura Peck, Rob Olsen, and T’Pring Westbrook.
