Abstract
In a 1987 article, Peter R. Rossi promulgated “The Iron Law of Evaluation and Other Metallic Rules.” The Metallic Laws were meant as an informal (and humorous) overstatement of the weakness of contemporary evaluations of social programs. Rossi’ s underlying worry was not so much about the state of evaluation technology in the abstract, but, rather, in its inability to advance our broad understanding of social problems and what to do about them---in other words, to make evaluation policy relevant. Rossi attributed the continuing failure to develop successful “large-scale social programs” to the failure to build a strong knowledge base for this kind of “social engineering.” The qualities of studies that enable such accumulated learning are variously labeled “external validity,” “generalizability,” “applicability,” or “transferability.” This Special Issue includes five papers that seek to explore and apply this understanding.
Keywords
During his distinguished career, Peter H. Rossi made many important contributions to the field of program evaluation. Perhaps the most famous (or infamous) were his Metallic Laws of Evaluation, first mentioned in a 1978 paper and then formally published in a 1987 article, “The Iron Law of Evaluation and Other Metallic Rules” (see Box 1). The Metallic Laws were meant as an informal (and humorous) overstatement of an exceedingly serious point. As Rossi (2003) explained twenty-five years later at a panel in his honor at the Association for Public Policy Analysis and Management (APPAM): The Iron law states that the typical impact assessment of a public social program finds that the program is either ineffective or only marginally effective. The Stainless Steel Law is that better designed evaluations are more likely to yield such findings (p. 2).
At that 2003 APPAM panel, Rossi lamented that the attention his Laws received was the source of considerable embarrassment being easily misunderstood and frequently misused (Rossi, 2003, p. 3). Nevertheless, their underlying message struck a deep nerve in the social policy and evaluation community. His essential message was that the results of program evaluations were not helping guide the development or operations of large social programs, a point echoed in many places and in many ways. Rebecca Maynard’s 2005 APPAM presidential address, for example, asked: “Evidence-Based Decision Making: What Will It Take for the Decision Makers to Care?” The Iron Law: “The expected value of any net impact assessment of any large-scale social program is zero.” Stainless Steel Law: “The better designed the impact assessment of a social program, the more likely is the resulting estimate of net impact to be zero.” Brass Law: “The more social programs are designed to change individuals, the more likely the net impact of the program will be zero.” Zinc Law: “Only those programs that are likely to fail are evaluated.” Source: Rossi (1987).Box 1
The Iron Law of Evaluation and Other Metallic Rules
In the roughly forty-five years since Rossi first promulgated the Metallic Laws, program evaluations have grown vastly in number, size, and sophistication. Major advances have been made in the ability to estimate the outputs, outcomes, and impacts of individual projects. Nevertheless, as Randall Brown said in accepting the 2020 Rossi Award for Contributions to the Theory and Practice of Program Evaluation: “While evaluations have shown some programs to be successful over the years, the experience of legions of policy researchers suggests that the essence of Rossi’s Iron Law that relatively few public social programs are found to be effective continues to hold true” (Brown, 2023).
A careful reading of Rossi’s two papers on the Metallic Laws reveals that his underlying worry was not so much about the state of evaluation technology in the abstract, but, rather, in its inability to advance our broad understanding of social problems and what to do about them in other words, to make evaluation policy relevant. Rossi attributed the continuing failure to develop successful “large-scale social programs” to the failure to build a strong knowledge base for this kind of “social engineering.” As he explained: The major reason why public social programs fail is that effective programs are difficult to design. Those who typically dominate in designing programs often do not have the social science skills and knowledge needed. Basic social science furthermore is not advanced enough to provide strong guides to designing effective programs. The consequence is that the designing of social programs has been a kind of trial and error strategy of try-this-and-try-that with little accumulation of knowledge that might be the basis of social engineering (Rossi, 2003, p. 2).
In 2022, we asked a small group of experts in the field of program evaluation to identify the important next steps to fulfil Rossi’s hope for more informed research strategies through the “accumulation of knowledge.” Their uniform answer? Learn more from past (and future) evaluations. They emphasized that, although the quality of individual evaluations was essential, the bigger challenge now was to see how the findings of individual studies could be validly combined to enrich understanding. As Maynard explains, It is important to accumulate evidence. Rarely will any one study provide all the information needed to guide policy or practice in particular areas. Most often, the information needed to make well-informed decisions comes from multiple (often many) studies that examine issues in varied contexts and with different target populations (Maynard, 2006, p. 250–251).
The qualities of studies that enable such accumulated learning are variously labeled “external validity,” “generalizability,” “applicability,” or “transferability.” For present purposes, we forego defining these terms because the definitions and application of these terms differ by discipline (and, indeed, within disciplines by different authors), and, also, between the US and other countries.
Based on this understanding, the four papers in this Special Issue explore the need to accumulate knowledge from separate but related perspectives. • “Contexts of Convenience: Generalizing from Published Evaluations of School Finance Policies” (Danielle Handel and Eric Hanushek, Stanford University): Handel and Hanushek consider the interpretation and generalizability of recent, credibly identified studies of how educational funding affects student outcomes. They describe how the available quasi-experimental estimates come from a wide range of contexts where additional resources can be assumed to be exogenous. After harmonizing the results to reflect the impacts of a ten percent increase in funding, they find a wide variation in impact estimates. Over half of the observed variation reflects differences in the true impact parameters across contexts with the remainder reflecting sampling error. They argue that simple comparisons of central tendency or other distributional measures from the existing set of studies are inappropriate because the underlying contexts for the funding changes differ widely and cannot be readily described as estimates coming from a well-defined population. Instead, Handel and Hanushek first consider but reject the possibility that the variations of estimates reflect differences in the estimation approaches. They then proceed to consider whether the observed variation in estimates arises because they vary by court-ordered funding versus other funding, by targeted versus untargeted funding, or by state-level policies versus within-state funding. None of these underlying differences in study context explains the variation in estimated impacts. They find some evidence that funding has a greater impact on disadvantaged students, but the same variation in impacts depending on the context remains. These attempts to reconcile differences in impact estimates lead Handel and Hanushek to underscore the basic observation that how funds are used (i.e., the context) is as important if not more important than how much funding is available. Handel and Hanushek conclude that an obvious way to proceed to generalize from these results would be having better descriptions of the relevant contexts for funding increases and then replicating the studies of impacts within contexts. Unfortunately, they conclude, this is unlikely to occur because the incentives for researchers and for journals operate to limit such extensions. • “How Mixed-Methods Research Can Improve the Policy Relevance of Impact Evaluations” (Burt Barnow, Sanjay Pandey, and Qian Luo, George Washington University): Barnow, Pandey, and Luo describe how mixed methods can improve the value and policy relevance of impact evaluations, paying particular attention to how mixed methods can be used to address external validity and generalization issues. They briefly review the literature on the rationales for using mixed methods; provide documentation of the extent to which mixed methods have been used in impact evaluations in recent years; describe how they developed a list of recent impact evaluations using mixed and then conducted full-text reviews of these articles; and then summarize the findings from their articles. They also discuss three exemplars, and then describe how mixed methods have been used for studying and improving external validity and potential improvements that could be made in this area. Barnow, Pandey, and Luo find that mixed methods are rarely used in impact evaluations, and explain how increasing their use of mixed methods could be useful by reinforcing findings from the quantitative analysis (triangulation), and could also help us understand the mechanism by which programs succeed (or fail) to have impacts. Evaluations, they conclude, can and should go beyond treating interventions as a black box, and, instead, should explore how outcomes vary by participant characteristics, program features, and environmental context. By exploring these issues, mixed methods can play an important role in identifying the extent to which evaluation results have external validity and generalizability. • “The Logic of Generalization from Systematic Reviews and Meta-analyses of Impact Evaluations” (Julia Littell, Bryn Mawr College): Littell starts by describing systematic reviews and meta-analyses of impact evaluations as potent tools for generalized causal inference. Although results of these reviews are often used to inform decision makers about expected effects of interventions in diverse policy and practice contexts, the logic of generalization from research reviews is not well developed. Littell shows that systematic reviews are based on nonprobability samples of studies, programs, and participants; thus, the evaluations included in a systematic review are not necessarily representative of populations and treatments of interest. Moreover, the application of principles of generalized causal inference is hampered by high risks of bias, uncertain estimates, and insufficient descriptive data from impact evaluations. Littell presents a pragmatic approach to the assessment of the generalizability of systematic reviews and meta-analyses, which builds on sampling theory, concerns about epistemic uncertainty, and principles of generalized causal inference. This approach is applied to two systematic reviews and meta-analyses of “evidence-based” psychosocial interventions for youth and families. Littell concludes that, while systematic reviews and meta-analyses can test generalizability claims and shed light on heterogeneity and potential moderators of effects, further work is needed to develop practical approaches to generalizability assessment that will guide better applications of interventions in policy and practice contexts. • “Improving the Usefulness and Use of Meta-analysis to Inform Policy and Practice” (Rebecca Maynard, University of Pennsylvania): Maynard describes how meta-analysis is a powerful tool for synthesizing what we know about the likely impacts of particular policy decisions. Yet, she counsels, too often the available evidence falls short of ideal for judging the applicability and generalizability of findings to current and future populations and places. Maynard notes that, while we have a robust infrastructure to support the conduct of rigorous meta-analysis, there is much room for improvement in applying that infrastructure to support important decisions faced by policy makers and practitioners. She calls for the continued expansion and strengthening of the base of primary impact evaluations and for making key study details more accessible for use in meta-analyses. When designing meta-analyses, she continues, we should take stock of the range of varied interests of alternative users as well as the volume and richness of the evidence base when designing primary studies and meta-analyses. Maynard concludes that, when designing and reporting meta-analyses, it is important to address issues of applicability and generalizability that are paramount for responsible use of the findings by policymakers and practitioners. • “Transferability of lessons from program evaluations; iron laws, hiding hands and the evidence ecosystem” (Tom Ling, Cambridge University and RAND Europe): Ling begins by discussing the differences between two approaches to transforming lessons from one evaluation to other settings. The first draws upon statistics and causal inferences and the second involves constructing a reasoned case based on weighing up different data collected along the causal chain from designing a program through to delivering results. Echoing the other contributions to this Special Issue, he argues that both approaches benefit from designing the research based upon existing evidence and ensuring that the descriptions of the program, context, and intended beneficiaries are sufficiently rich to allow lessons to be transferred. In more complex interventions, and where human choice and action plays a greater role in implementation and engagement, Ling argues, a deeper understanding of human agency is needed. He emphasizes how humans as agents of social change bring their professional and social identities, their prior assumptions, and their varying capabilities to bear when they engage with delivering or receiving a program. Thus, his chapter is a plea for creatively and drawing appropriately on both approaches with a focus on learning lessons and spreading solutions rather than admiring problems.
The presence of two papers on systematic reviews is no accident, for it appears that this process of accumulating knowledge could be comfortably accommodated within their analytic processes. That seems to be the area of greatest promise.
This Special Issue seeks to increase interest about and activity on this important, but modestly studied, area of program evaluation, especially as compared to “internal validity”/“causal validity.” It is part of a parallel effort on the part of European researchers that will be published in a subsequent Special Issue of Evaluation Review.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
