Abstract
The success of federal agencies in creating and using evidence-based policies hinges on (1) their commitment to include routine use of evidence—including research and program evaluations—in program design and funding decisions and (2) their capacity to adapt their operating practices accordingly. The recent push toward using evidence more deliberately in government meant that federal agencies needed to quickly improve the accessibility of existing evidence. They also had to foster internal capacity to fairly judge its quality and applicability; build capacity and support for routinely using evidence within program and policy offices to support policy development and monitoring; and create a consensus within agencies around sensible ways to categorize, rate, and apply evidence. Common evidence standards, open access to evidence review platforms, and mandates for embedding rigorous evaluations into funded programs are among the most influential tools agencies have used in this new era of evidence-based policymaking.
Keywords
In the late 1960s, Congress began periodically mandating rigorous evaluations of new or reauthorized programs funded by agencies such as the U.S. Departments of Education (ED), Health and Human Services (HHS), and Labor (DOL) (U.S. Government Accountability Office 2005; Orzag 2009; Zients 2012). By the early 2000s, social policy fields had developed enough evidence that scholars and some government agencies began systematically collecting and synthesizing findings on the effectiveness of particular programs, policies, or practices and making the syntheses publicly available on web-based platforms such as the nonprofit Campbell Collaboration (C-2) and the What Works Clearinghouse (WWC) at ED.
Congressional actions such as the 2001 No Child Left Behind Act, which called for the use of scientifically based evidence, sowed the seeds for federal policies that more routinely integrate evidence into their development and funding decisions. Then, in President Obama’s first term, massive public and private investments in education, health and human services, and employment programs—a response to the 2008 recession—catalyzed the evidence-based policy movement by requiring that the funds be used in ways supported by scientific evidence.
Between 2010 and 2016, nearly $7 billion of federal stimulus funding flowed to evidence-based programs (Feldman and Haskins 2016). That funding helped to further expand the evidence base, increase the use of evidence to help design new programs, and support continuous improvement of ongoing programs. This article discusses the infrastructure behind evidence-based policymaking in federal agencies, especially ED, and highlights strategies for creating the capacity and culture that agencies need to generate and productively use evidence for developing policy, making funding decisions, and managing programs.
Infrastructure to Support the Evidence-Based Policy Agenda
There are four pillars of support that have helped to advance the federal evidence-based policy agenda. One is the core infrastructure to produce and effectively use evidence. A second is the sizable and growing body of evidence on the effectiveness of programs, policies, and practices stimulated in large part by federal funding agency demand. Third are systems for archiving evaluation evidence and making it publicly accessible (Petrosino et al. 2001; Whitehurst 2003). Fourth are the standards and systems, including professional standards for ethical conduct of evaluations, and for reporting findings and synthesizing findings across studies.
Infrastructure to support evidence production and use
From the War on Poverty until the end of the twentieth century, there was a modest but growing band of professionals engaged in the production and use of evidence intended to inform public policy. Professionals within government tended to oversee funding of evaluations of federally funded programs and use of evidence to support policy development and monitoring, while social scientists in research firms and academic research centers designed and carried out the program evaluations. In the early years, these professionals were developing their craft on the job. However, by the turn of the century, this workforce had grown and matured, in no small part due to expansion in number and size of both schools of public policy and administration and research firms specializing in program and policy evaluation (e.g., Abt Associates, American Institutes for Research, Mathematica Policy Research, Rand Corporation, and RTI International).
Demand for quality evidence
There have been ebbs and flows in funding for research needed to support evidence-based policy. Moreover, until recently, there has been considerable variability and often ambiguity regarding the expected level of rigor in federally funded program evaluations. Even today, many federally funded evaluations are more accurately characterized as basic program audits than as evaluations designed to generate credible evidence to guide policy development and program improvement. Furthermore, control over the type and quality of evidence produced often is fragmented across and within agencies.
Since the turn of the century, the government has stepped up not only its commitment to rigorously evaluating funded programs but also its reliance on evidence to guide program and policy development and funding decisions. It is much more deliberate about what is evaluated, what types of evidence are expected to be produced, and what types of evidence will count in federal policy deliberations, including deliberations over what evaluations to fund.
As efforts to use evidence in the policy process have increased, the limitations of the existing evidence base has become more visible. For example, a 1996 meta-analysis of Title 1 evaluations identified and reviewed 150 studies published between 1976 and 1995, of which only 17 met specified standards for credible evidence (Borman and D’Agostino 1996). A series of such experiences triggered efforts to promulgate standards that, in turn, influenced the number and nature of studies commissioned. The gap between evidence sought and found stimulated focused attention on the potential of and limits to using the existing evidence to improve policy and practice, for example, by using descriptive and correlational evidence to construct grounded theories of change but not to support causal warrants.
Today, the hard work of mapping various types of evidence to funding decisions falls largely to the staff of the federal agencies with program oversight and, in some cases, to congressional staff responsible for integrating evidence guidelines into legislation. These staff must carefully consider the implications of their decisions for stakeholder groups that include federal program and evaluation offices, in addition to researchers and practitioners.
Accessibility of evidence to guide public policymaking and management
At the turn of this century, only a patchwork of evidence supported policymaking and administration. For example, many welfare and workforce policies had been rigorously evaluated, but such evidence was sparse for program and policy initiatives in education, early childhood, and juvenile justice. Moreover, the research was scattered across federal program and evaluation offices, academic journals, and unpublished manuscripts.
The Smith Richardson Foundation (Cottingham, Maynard, and Stagner 2004) was instrumental in laying the groundwork for systematic efforts to collect and synthesize evidence on the effectiveness of various approaches to pressing social and economic problems, and to make that evidence publicly accessible. Today, branches of federal agencies—such as the Institute of Education Sciences (IES) in ED, the National Commission on Families in HHS, and the Employment and Training Administration in DOL—support many evidence clearinghouses. Several others operate outside of government, such as the Campbell Collaboration and Results First (Maynard, Goldstein, and Smith Nightingale 2016).
As an example, the WWC in ED, launched in 2002 (Whitehurst 2003; this volume), has reviewed more than 9,900 studies of education programs, policies, and practices against well-documented, carefully considered standards. All the reviews are available for free on the Internet. When studies provide evidence on program effectiveness, WWC rates the quality of that evidence; if studies meet WWC’s evidence standards, reviews include a description of the intervention, the setting, the study sample and methods, and the study findings. 1 When multiple studies report on the same program, policy, or practice, WWC also reports the average estimated effect, as well as information on the extent and quality of evidence (e.g., the number of studies, the quality of the evidence reported in each study, and the range of the estimates).
Professional standards and best practices
There is a robust industry in program evaluation and policy analysis today. However, most of the literature in the field still lacks rigor and there are major gaps in how the information is being used by policy-makers and practitioners. Beginning with the War on Poverty, the past 40 years of social program evaluations led to a core set of principles that govern the design and implementation of rigorous program evaluations and to commonly accepted practices for estimating costs and comparing cost-effectiveness (Gueron and Rolston 2013). However, the advent of evidence-based policy initiatives created a need to expand government agencies’ capacity to rigorously evaluate programs, systematically review evidence, and map that evidence to standards for making policy and for funding and monitoring programs. Moreover, federal evaluation and program offices have come to need staff who have working knowledge of evidence standards, who understand how those standards affect their oversight responsibilities for federally mandated and/or funded evaluations, and who are able to judge whether evidence was adequate to support specific policy decisions. As a result of these advances, there has been acceleration in the production of high-quality studies that provide valuable insights regarding the likely impact of various types of program and policy changes.
Building Agencies’ Capacity and Culture to Create and Use Evidence
By the early 2000s, the Office of Management and Budget (OMB) had gained buy-in for the evidence-based movement from senior agency officials through well-crafted directives calling on them to prioritize evidence-based approaches to policymaking and monitoring and to build evidence capacity in their agencies (Orzag 2009). The biggest challenge for these officials was changing the culture and practices within federal agencies (and among states, localities, and grantees that implement federally funded policies), and this challenge fell largely to career civil servants and strategically positioned political appointees.
The ED was somewhat better prepared than other agencies to develop and administer evidence-based policies. First, it had already established standards for evidence on program effectiveness in the WWC. Second, it included the IES, which operates as an independent research and evaluation arm of ED. Third, IES had staff who were familiar with the WWC and its evidence standards. Like peer agencies, though, it lacked strong working relationships between the program office staff responsible for funding decisions and those responsible for program oversight, which typically included requirements for independent evaluations. Moreover, program office staff and their grantees often lacked extensive expertise in the conduct of rigorous evaluations and/or the use of evidence that such evaluations produce.
In 2010, ED established a department-wide working group to advance its capacity to better use evidence in all aspects of its work, from annual planning and performance monitoring to policy development and administration. The group of roughly a dozen people included political appointees and career staff from the agency’s program and evaluation arms.
The OMB also ran a cross-agency working group, with senior representatives from key federal agencies. These working groups focused on five primary areas: defining what constitutes evidence for various purposes; defining “acceptable” evidence to support particular funding or performance criteria; creating efficient ways to review evidence submitted in support of grant applications; building trusting, supportive relationships between program and evaluation offices; and collaborating with prospective and current grantees to support their use of evidence.
What counts as evidence
One of the biggest challenges for all agencies was agreeing what types and levels of evidence should count for various purposes. Reaching consensus was complicated by three factors. First, not all stakeholders understood the scientific principles for evaluation equally well. This is particularly true of practitioners and policy-makers whose academic training and professional experiences often paid only lip-service to the topic. Second, the amount, relevance, and quality of evidence varied across and within policy areas. Third, the likelihood of political backlash varied depending on the policy area.
All stakeholders agreed that it is important to prioritize existing evidence-supported programs and practices and to encourage new and innovative approaches. However, recognizing that not all evidence is created equal, there was a push for “tiered evidence,” which essentially rates evidence by the credibility of its causal impact estimates and the estimated effectiveness of the focal intervention. The jury is still out on how well the tiered-evidence strategies can work to balance these objectives.
Moreover, putting concepts such as “alignment” and “strength” of evidence into practice proved to be complicated and sometimes controversial. In some areas, such as education, much of the funding for basic research comes from the National Science Foundation (NSF). Thus, to create a seamless system of evidence production and use with minimal threat of political backlash, ED partnered with NSF to develop the Common Evidence Guidelines for Education Research. These guidelines aimed to (1) help organize and guide NSF’s and ED’s respective decisions about investments in education research and (2) clarify for potential grantees and peer reviewers the justifications for and evidence expected from each type of study … [and] speed the pace of research and development. (Earle, Maynard, and Curran Neild 2013)
The guidelines created a common understanding of the nature, strengths, and limitations of various genres of research and fostered appreciation and respect for context when judging whether and, if so, how evidence should count.
What constitutes evidence to support funding
Defining what constitutes evidence for a particular federal program generally falls to Congress, with substantial input from the relevant federal agencies. Under tiered-evidence standards, such as those used by ED’s Investing in Innovation (i3) fund—today called Education Innovation and Research—the most generous funding is typically reserved for applicants who present strong evidence that their proposed strategy is likely to produce the intended outcomes. For example, applicants might present one large or multiple smaller studies that used an experimental or other well-matched comparison group design and showed statistically significant and/or meaningfully sized estimated impacts for populations and settings similar to their own. At the other end of the continuum, an applicant who presents only a correlational study showing a statistically significant relationship between the proposed program and the desired outcome might still be eligible for support, but with less funding and a requirement to conduct a rigorous evaluation that will generate evidence to guide future policy decisions.
Efficient ways to review evidence
Early on, the evidence criteria used in federal grant competitions were ad hoc variations of (but neither fully aligned with nor anchored to) the evidence standards and study ratings that had been codified in review protocols used for agency-sponsored evidence review initiatives like the WWC. In short order, however, staff who administered the evidence-based grant competitions began aligning their evidence criteria with those being used in their agency’s evidence review platforms (though not necessarily mimicking those criteria). Some funding streams, such as those supported under the Every Student Succeeds Act (ESSA), link their standards to those used by the WWC (U.S. Department of Education 2016; Gross 2016). Most use tiering, so that funding levels vary by the extent and strength of the supporting evidence. And, some, such as the Office of Adolescent Health’s Teen Pregnancy Prevention initiatives and the Nurse Home Visiting Programs, rely on lists of programs already designated as evidence-based (U.S. Department of Health and Human Services n.d.; Sama-Miller et al. 2016).
The ED has begun mapping points of consistency and differences between the evidence standards being used for various ED programs and the WWC evidence standards. The most common differences involve greater specificity in the context for the study (e.g., reference populations, date and location of the study, outcomes reported). But the evidence guidelines for particular funding streams may also include criteria such as minimum sample size and minimum number of studies providing supporting evidence.
Agencies have also had to develop internal protocols governing such things as who can review evidence submitted to support applications or monitor performance on the funded projects, and who can specify processes for documenting and communicating review results. ED greatly simplified these decisions by building on the WWC study screening and review protocols, including the practice of using only certified WWC reviewers and tailoring review protocols to specific program solicitations.
Cooperation and collaboration
For ED’s evidence-based policy initiatives to succeed, program and evaluation staff had to cooperate and collaborate. One important area of collaboration was developing recommendations for evidence standards specific to particular grant programs. A second was establishing timely, accurate, and transparent systems to manage reviews of evidence, and a third was developing capacity to help grantees meet the requirements for evaluating funded programs.
In all cases, cooperation and collaboration sprang from the relevant agency and/or interagency working groups, which showed the potential for mutual benefits. Those of us on the evaluation side of ED learned more about how evidence was defined and used in funding decisions. On the flip side, staff in the program offices learned more about how to embed rigorous evaluations into ongoing programs without jeopardizing their success. For example, by giving program offices the evaluation support they needed during the evidence review process, IES increased the use and usefulness of its evaluation resources. Further, by working closely with IES to get strong evaluation support for programs funded under evidence-based policy initiatives, the program offices strengthened their grantees’ ability to meet their obligations for rigorous monitoring and evaluation and the evaluation offices gained more and better studies to inform future policy development efforts.
Building support and trust among grantees
Leaders of the evidence-based movement knew it was crucial that these policies neither discouraged innovation among applicants nor deterred innovative organizations from applying for funding. In addition to using tiered-evidence strategies, federal agencies typically reserved some program resources for outreach and support to prospective applicants for funding. Some support strategies were modest, like hosting webinars on the evidence requirements for funding streams. More significant efforts included contracting with research firms to give grantees technical assistance with the required evaluations. For example, i3 grantees were required to contract for independent evaluations of their programs (the required qualities of which varied across evidence tiers). But grantees also had access to tailored evaluation tools, could participate in professional training sessions, received professional feedback on study plans and products, and had access to technical support for their evaluations.
Access to technical assistance contractors had three benefits. First, it substantially increased the likelihood that grantees would complete and release the mandated evaluation reports. Second, it expanded the pool of well-credentialed evaluation experts who offered training, were available to answer questions, and provided constructive feedback on evaluation plans. Third, it created tools and other guidance documents to improve the quality and efficiency of the research (e.g., see IES n.d.; Price et al. 2016a, 2016b; Corporation for National and Community Service n.d.).
Looking to the Future
From the perspective of someone who has been intimately involved in agency adoption of evidence-based approaches for over three decades, it is clear that evidence-based policymaking has come a long way. However, the lessons of the last decade have also shown us that there remains much room for improvement. For example, there is still a need for
a system for real-time tracking of existing evidence and in-process evaluations;
greater transparency in the underlying research to support published findings;
systems for linking and sharing data, while still protecting privacy;
greater breadth, depth, and quality of the evidence to support decision-making;
improvements in the culture and capacity to embed rigorous experimentation and monitoring evaluations into routine program operations; and
closing the gaps among improvement science, design-based implementation research, traditional program evaluation, and benefit-cost analysis.
If all federally sponsored evaluations were registered in a publicly accessible database, we would have ready access to information about in-process and completed studies and better information about whether the results from completed evaluations have been fully and objectively reported (Pigott et al. 2013). Then we could better judge whether existing evidence is applicable to other times and settings (Tipton, Yeager, and Iachan 2016). Evidence clearinghouses such has WWC and Crime Solutions.gov have made it easier to access and use data to guide decision-making. But such clearinghouses could improve the transparency and consistency of their review criteria, the content and format of their information, and their capacity for data sharing. For example, underlying each of the evidence review platforms is a coded database that, if it were accessible, would make it easier to conduct tailor-made research syntheses.
The current evidence base is spotty in coverage and includes many studies that are outdated and/or methodologically flawed. As an example, a recent WWC review of four widely touted strategies to improve the teaching force identified forty-nine potentially relevant studies. Only seven of twenty-four studies of Teach for America met the WWC’s evidence standards and, for each of the other three programs, only one study met standards (author’s tabulations). In total, only about 10 percent of the studies identified by the WWC as relevant to the topics reviewed to date met its standards for credibility.
The long-run promise of evidence-based policymaking and administration depends on creating a culture that rewards program administrators for engineering strategies to improve outcomes and on routinely embedding rigorous evaluations into major program improvement efforts. The embedded A-B experiments (i.e., comparisons of alternative conditions or strategies) now being promoted as part of evidence-based policy initiatives inside and outside of government are compatible with improvement science principles, and conducting such studies should increase the success rate of the tested programs. One promising tactic would be to conduct fewer large-scale, place-based evaluations and rely more on networked embedded experiments (i.e., many small-scale replications of identical experiments), which are increasingly common in health care (Sharples 2013; Marshall, Pronovost, and Dixon-Woods 2013). Doing so would improve the generalizability of study findings, reduce the burden on programs participating in the studies, decrease the time needed to complete studies, and lower costs.
Footnotes
Notes
Rebecca A. Maynard is University Trustee Chair Professor of Education and Social Policy at the University of Pennsylvania and former commissioner for the National Center for Education Evaluation and Regional Assistance at the Institute of Education Sciences. Her current research focuses on methods for integrating program evaluation and improvement science and on improving the quality and utility of research syntheses.
