Abstract
This paper describes how mixed methods can improve the value and policy relevance of impact evaluations, paying particular attention to how mixed methods can be used to address external validity and generalization issues. We briefly review the literature on the rationales for using mixed methods; provide documentation of the extent to which mixed methods have been used in impact evaluations in recent years; describe how we developed a list of recent impact evaluations using mixed methods and the process used to conduct full-text reviews of these articles; summarize the findings from our analysis of the articles; discuss three exemplars of using mixed methods in impact evaluations; and discuss how mixed methods have been used for studying and improving external validity and potential improvements that could be made in this area. We find that mixed methods are rarely used in impact evaluations, and we believe that increased use of mixed methods would be useful because they can reinforce findings from the quantitative analysis (triangulation), and they can also help us understand the mechanism by which programs have their impacts and the reasons why programs fail.
Keywords
Introduction
The purpose of this paper is to describe how mixed methods can improve the value and policy relevance of impact evaluations, paying particular attention to how mixed methods can be used to address external validity and generalization issues. We briefly review the literature on the rationales for using mixed methods; provide documentation of the extent to which mixed methods have been used in impact evaluations in recent years; describe how we developed a list of recent impact evaluations using mixed methods and the process used to conduct full-text reviews of these articles; summarize the findings from our analysis of the articles; discuss three exemplars of using mixed methods in impact evaluations; and discuss how mixed methods have been used for studying and improving external validity and potential improvements that could be made in this area.
To begin, we provide the definitions for some of the key terms used in this paper. Many of these definitions are from the academic literature on program evaluation, but the authors of the articles in this special issue have adopted some specific definitions for the special issue so that some key terms have the same meaning among all the papers. • Evaluations are “periodic, objective assessments of a planned, ongoing, or completed project, program, or policy. Evaluations are used to answer specific questions, often related to design, implementation, and results” (Gertler et al., 2016, p. 327). • Impact evaluation …. “is an evaluation that makes a causal link between a program or intervention and a set of outcomes. An impact evaluation answers the question: What is the impact (or causal effect) of a program on an outcome of interest.” (Gertler et al., 2016, p. 328). • Mixed methods research “covers a diverse set of practices for combining qualitative and quantitative methods, in the interests of exploiting the strengths of both types of research and offsetting each others’ weaknesses.” (Starr, 2014, p. 242).
Why Use Mixed Methods in Impact Evaluations?
Our review of recent published impact evaluations, presented below, shows that use of mixed methods is relatively rare—only about 10 percent of recent impact evaluations included mixed methods. In this section, we discuss why mixed methods are a valuable component of impact evaluations and provide some hypotheses on why they have not been more widely used. Advocates for the use of mixed methods in impact evaluations make two types of arguments on why quantitative research alone is insufficient for evaluating interventions. First, the argument is often made that quantitative methods alone, even when rigorously carried out, may produce incorrect or misleading estimates of program impacts; as discussed below, this could be due to failure to adequately control for all variables that affect the outcomes of interest (internal validity issues) or that the impact estimates produced might not apply to the population of interest (external validity or generalization issues). Second, proponents of mixed methods argue that quantitative methods fail to consider many important questions, which can limit the utility of an impact evaluation or lead to implementation of an incorrect policy. Proponents of mixed methods generally do not claim that quantitative impact evaluations should be abandoned, but rather that the best approach to conducting an evaluation of an intervention is to include both quantitative and qualitative research, that is, mixed methods.
Limitations of Quantitative Methods in Estimating Program Impacts
A basic concern with the use of quantitative methods for impact evaluations is that they may fail to produce unbiased estimates of program impacts. This is particularly the case for evaluations that do not make use of random assignment, sometimes referred to as quasi-experimental or non-experimental designs. In these types of impact evaluations, the outcomes for a group that received the treatment is compared to a comparison group of individuals who did not receive the treatment. A variety of statistical techniques have been used in these evaluations to either make the groups equivalent, for example, matching treatment and comparison group members, and/or to control for differences in the characteristics, for example, regression analysis or analysis of covariance. In recent years, increasingly sophisticated approaches have been used in non-experimental impact evaluations, including differences in differences models, regression discontinuity designs, synthetic control groups, and instrumental variables.
This article is not the place to assess the merits of these various approaches, but they all rely on strong assumptions regarding matters such as lack of omitted variables and measurement error, homogeneous treatment effects, and correct specification of the functional form. Given the role played by these strong assumptions, competent observers present divergent assessments of progress made in recent years in improving social science empirical work. Angrist and Pischke (2010, p. 26) are optimistic: Improvement [in empirical economics] has come mostly from better research designs, either by virtue of outright experimentation or through the well-founded and careful implementation of quasi-experimental methods. Empirical work in this spirit has produced a credibility revolution in the fields of labor, public finance, and development economics over the past 20 years. Design-based revolutionaries have notched many successes, putting hard numbers on key parameters of interest to both policymakers and economic theorists.
Leamer (2010, p. 44), while recognizing the improvements in empirical strategies, is much more cautious: We should be celebrating the small genuine victories of the economists who use their tools most effectively, and we should dial back our adoration of those who can carry the biggest and brightest and least-understood weapons. We would benefit from some serious humility, and from burning our “Mission Accomplished” banners. It’s never gonna happen.
Both Angrist and Pischke (2010) and Leamer (2010) acknowledge that randomized controlled trials (RCTs) overcome the limitations of non-experimental designs, but there is a substantial literature on limitations of RCTs as well. In this section, we note some of the statistical concerns with RCTs, and in the following section, we describe other concerns.
Deaton and Cartwright (2018, p. 2), in a well-known paper, express strong concern that people have too much confidence in the ability of RCTs to provide answers to relevant evaluation questions: Contrary to frequent claims in the applied literature, randomization does not equalize everything other than the treatment in the treatment and control groups, it does not automatically deliver a precise estimate of the average treatment effect (ATE), and it does not relieve us of the need to think about (observed or unobserved) covariates. Finding out whether an estimate was generated by chance is more difficult than commonly believed. At best, an RCT yields an unbiased estimate, but this property is of limited practical value. Even then, estimates apply only to the sample selected for the trial, often no more than a convenience sample, and justification is required to extend the results to other groups, including any population to which the trial sample belongs, or to any individual, including an individual in the trial.
Cook (2018) expanded on the points raised by Deaton and Cartwright (2018) and listed 26 assumptions that must be met for a single-site RCT to warrant “gold standard” consideration. Of course, others have raised concerns about the limitations of RCTs in program evaluations, and Barnow (2010, p. 103) noted that “an experiment is not a substitute for thinking.” It should be noted that some researchers believe Deaton and Cartwright are too pessimistic about the value of RCTs and some of the non-experimental methods developed in recent years; for example, Imbens (2018, p. 50) states “I am more sanguine about the recent developments in empirical practice in economics and other social sciences, and am optimistic about the ongoing research in this area, both empirical and theoretical.”
Barnow and Greenberg (2020) illustrate how assumptions about the homogeneity of treatment impacts can limit the external validity of an impact evaluation. Barnow and Greenberg (2020 use a simple model, similar to the model presented in Cronbach (1982), where the outcomes are a function of the characteristics of the target populations (X), the economic and other environmental conditions at the sites and the time in which they took place (E), and components and inputs incorporated into the evaluated programs (P). Obviously, the resulting heterogeneity may cause program impacts to vary across the trials, as suggested by the simple model in the text box below.
The interaction terms allow for the possibility that program impacts may vary as X, P, and E vary. For example, a program may result in a larger increase in earnings at sites where the unemployment rate is low rather than high (or vice versa). Similarly, the success of a program may depend on the age distribution or racial composition of the target population, as appeared to be the case with GAIN (Greenberg, et al., 2005). Finally, as Greenberg et al. conclude, observers often attribute the fact that Riverside appeared more successful than the GAIN program in the other five sites to the emphasis in that site’s program design to a work first approach. If there are no interactions between Z and X, P, and E, then λ will capture the entire program impact.Text Box of Impact Model with Interactions
An individual evaluation may lack variation in the X, P, and E variables, thus limiting the external validity of the findings; alternatively, there could be variation in these variables, but if the evaluator fails to include the interaction terms in the evaluation, the evaluation will fail to reveal how the impact might vary in other situations.
As illustrated below, the presence of impact interactions can be detected through both quantitative and qualitative strands of the evaluation. Some of the impact studies reviewed use subgroup analyses in their quantitative approach to determine if the intervention is more effective for some groups than others. Although similar analyses can be conducted for interactions in the program and environment variables, we did not observe them in our review.
Other Reasons for Using Mixed Methods in Impact Evaluations
Concerns about the failure of quantitative methods to deliver accurate parameter estimates are not the only rationale for combining qualitative methods with quantitative methods in research. Bamberger et al. (2016) argue that quantitative methods tend to focus on a limited number of outcomes that are pre-identified. They observe that interventions often have unintended consequences, which can be positive or negative. They further note that while quantitative research methods cannot identify such unintended outcomes, qualitative methods, such as in-depth interviews and focus groups, can be used to discover such outcomes.
Mixed methods researchers have suggested rationales for conducting mixed methods research. Perhaps the earliest and most influential set of rationales for using mixed methods research designs was offered by Jennifer Greene and colleagues (1989) who identify five rationales for mixed methods research. We reproduce below description of these five rationales in Greene et al.’s (1989, p. 259) words: • “Triangulation seeks convergence, corroboration, correspondence of results from different methods. • Complementarity seeks elaboration, enhancement, illustration, clarification of the results from one method with the results from the other method. • Development seeks to use the results from one method to help develop or inform the other method, where development is broadly construed to include sampling and implementation, as well as measurement decisions. • Initiation seeks the discovery of paradox and contradiction, new perspectives of frameworks, the recasting of questions or results with questions from one method with questions or results from the other method. • Expansion seeks to expand the breadth and range of inquiry by using different methods for different inquiry components.”
Others have offered other lists of rationales for using mixed methods. Bryman (2006) suggests a more detailed list of 16 rationales for using mixed methods.
1
Fetters (2020) lists the advantages of using mixed methods rather than listing the rationales: • “Uses the strengths of qualitative and quantitative research to offset respective weaknesses. • Enhances the breadth and depth of the research. • Compares data from both types of research to examine different, similar, or seemingly discordant findings about a phenomenon. • Uses results of one type of data for collection to build procedures for the collection of the other type of data. • Develops a model qualitatively and tests it quantitatively. • Constructs a theoretical model quantitatively and validates it qualitatively.”
Below, we review articles identified in our search of recent literature that use mixed methods for impact evaluations and describe the rationales for the use of mixed methods.
External Validity and Generalization as Rationales for Mixed Methods
External validity and generalization are not specifically mentioned as rationales for using mixed methods in Greene et al. (1989), Fetters (2020), or Bryman (2006), but they are implicit in some of the rationales offered. Fetters (2020) does not appear to link mixed methods with external validity or generalization, but these concepts are consistent with the concept of “Enhances the breadth and depth of the research.” For Greene et al. (1989), these concepts could be considered under their expansion rationale. The article uses the term “expansion” to mean “to expand the breadth and range of inquiry by using different methods for different inquiry components” (p. 259); although the first part of the sentence appears to support the use of mixed methods for these purposes, the phrase “for different inquiry components” does not. The use of mixed methods for external validity and generalization corresponds directly to Bryman’s (2006, p. 107) concept of context, which “refers to cases in which the combination is rationalized in terms of qualitative research providing contextual understanding coupled with either generalizable, externally valid findings or broad relationships among variables uncovered through a survey.” Thus, although use of mixed methods to address external validity and generalizability issues has not been highlighted in the literature on rationales for mixed methods, it is certainly consistent with this purpose; as described below, qualitative research in impact evaluations sometimes is used to identify subgroups and environments where an intervention is likely to be more or less successful. As discussed more below, we believe that in conducting impact evaluations, researchers should consider using mixed methods to explore external validity issues.
Current Use of Mixed Methods in Impact Evaluation Research
In the foregoing, this paper has sought to show how mixed methods can be used to improve impact evaluations, and in this section, we explore the extent to which mixed methods have been used in impact evaluations and what can be learned from such efforts. To identify examples of articles that use mixed methods in impact evaluations, we build on the approach used in Richwine et al. (2022) to identify and review research in recent years (2000–2022) in leading public administration, public policy, economics, and evaluation journals by conducting a search using Web of Science. We first present data on the prevalence of mixed methods in impact evaluations for selected journals in the fields of economics, public administration, and evaluation.
Approach Used to Identify Recent Articles Using Mixed Methods in Impact Evaluation Research
This paper expands on Hendren et al. (2018) and Richwine et al. (2022) by broadening the search beyond the public administration and public policy journals to include economics and program evaluation journals where impact evaluation studies are likely to be published. To identify articles for our review, in March 2023, we conducted a literature search using Web of Science for 29 public administration and public policy journals from Richwine et al. (2022), 6 purposively selected program evaluation journals, and the top 71 economics journal based on the Social Science Citation Index. The search terms we used include mixed methods terms used in prior studies (Hendren et al., 2018; Richwine et al., 2022) and two additional terms to focus on impact evaluations. Our search strategy, search terms, and full list of included journals are presented in Appendix A and Appendix B.
We searched for articles published between January 2010 and December 2022, as the majority of mixed methods studies have been published since 2010 (Hendren et al., 2018). We limited our results to articles with full-text available in English and included only empirical mixed methods impact evaluation articles. We did not restrict the topic area of impact evaluations or type of methods used in either strand. Figure 1 presents an overview of our search and selection process. Overview of search selection process for relevant articles.
One of the authors and a research assistant reviewed all abstracts independently and they resolved discrepancies by comparing notes at weekly meetings. Any remaining discrepancies were adjudicated by the other two authors. After abstract and title review, two authors conducted full text review of 38 selected articles and determined that only 15 were impact evaluation and used mixed methods. Two authors independently extracted data and compared notes to resolve discrepancies during data extraction.
Prevalence of Mixed Methods in Impact Evaluations
The review of recent journal articles in selected journals, conducted May 23, 2023, yielded 3468 potential mixed methods articles and 2371 potential impact evaluations. As shown in Figure 2, there was not a great deal of overlap in the articles identified in the two categories; only 215 of the articles listed keywords for both impact evaluations and mixed methods studies. Less than 10 percent of the impact evaluation studies identified made a reference to mixed methods. Venn diagram of mixed methods and impact evaluation articles.
Thus, it is clear that use of mixed methods in impact evaluations is relatively rare, at least in the relatively wide assortment of public policy, public administration, economics, and evaluation journals included in our review. However, it would be unwise to attach too much importance to the specific numbers in Figure 2. First, the numbers are based on an automated review of key words and abstracts, so there are likely to be many “false positives” that do not qualify as mixed methods or impact evaluation studies. The full-text review of 38 articles in the following section confirms this point. Second, we may have missed some studies, particularly if the article titles and abstracts did not contain our search terms. Third, to keep our task manageable, we limited the journals reviewed. Specifically, we included 70 economics journals, but over 300 economics journals are available, and we did not review articles in specific fields such as education, public health, business, and psychology. Finally, the findings depend on the terms used in the screening process. Nonetheless, it is likely that our key finding in this section would remain with alternative screening categories: only a minority of impact evaluations use mixed methods.
Findings From Full-Text Reviews of 38 Articles
This section provides highlights from the full-text reviews of the 38 articles identified as potential impact evaluations using mixed methods. Most of the articles (20) were published in program evaluation journals, but some articles also appeared in economics journals (14) and public policy and public administration journals (4).
The first issue addressed in the screening process was to verify if the articles met the criteria for medium or high-quality impact evaluations using the criteria established by the U.S. Department of Labor in its Clearinghouse for Labor Evaluation and Research (CLEAR) 2 , with the minor modification of including articles that used a regression discontinuity design. We found that only 16 of the 38 articles met these criteria, and only 15 articles met the modified CLEAR criteria and used mixed methods. Some of the articles failed to qualify for high or medium quality impact evaluations because they failed to use the methods specified in the CLEAR criteria; these evaluations used methods such as post-program measures alone, pre-post changes without use of a comparison group, or opinions of the treatment group about effectiveness alone. In a majority of the articles that did not meet the modified CLEAR standards, however, the point of the articles was not to estimate program impact; instead, the studies were proof of concept studies or efforts to gather views of participants in focus groups.
We fared better in screening in studies for using mixed methods research designs. In our review, we determined that a majority of the articles provided an adequate rationale for using mixed methods and integrated the findings from the qualitative and quantitative strands. Notably, almost all the articles that met the modified CLEAR criteria for high or moderate quality impact evaluations had an adequate qualitative strand (15 of 16).
As noted earlier, Greene et al. (1989) offered five rationales for mixed methods and Bryman (2006) offered 16 rationales. In our review, triangulation was a frequent rationale, but the other common rationales were mechanisms and reason for failure. Not surprisingly, 13 of the articles meeting modified CLEAR guidelines and mixed methods guidelines mentioned triangulation—the extent to which the findings from the qualitative strand supported the findings in the quantitative strand. In virtually all the studies where triangulation was used, the quantitative and qualitative strands were in agreement about the direction of program impacts. The use of mixed methods for studying mechanisms and reasons for failure is, in our view, more interesting. A common criticism of RCTs is that even if an evaluation finds that an intervention is successful, in many cases the impact evaluation does not provide evidence on why the intervention was successful—the intervention is treated as a “black box,” with no evidence on why the intervention is successful or if all the components are useful (see Fetters & Molina-Azorin, 2020). Black box studies are adequate for testing simple hypotheses, such as whether vitamin C will prevent scurvy, but can be woefully inadequate for assessing complex programs in fields such as education, job training, nutrition programs, and development programs.
Nearly as many articles used mixed methods to study mechanisms by which the interventions achieve outcomes (12) as used mixed methods for triangulation (13). As noted above, mixed methods studies can overcome the tendency for impact evaluations to treat interventions as black boxes by using techniques such as interviews, focus groups, and ethnographies to ask participants, program officials, and leaders why and how interventions succeed and fail. For example, Rao et al. (2017, p. 481), in a study that combined an RCT with ethnography to evaluate a demonstration intended to deepen democracy in rural India, stated “This paper demonstrates that an in-depth ethnography conducted alongside a survey-based RCT can provide important insights into the processes of change, the mechanisms that led to the observed outcome, and thus make a null effect meaningful and interesting.” In a study of the agroforestry in Kenya, Hughes et al. (2020, p. 5) used qualitative methods to explore differences between men and women in adoption and intensity of agroforestry and to “explore the mechanisms through which different components of [the] Agroforestry’s program may have contributed to livelihood improvements.” Understanding mechanisms by which interventions are effective and failures as discussed below, can lead to refinements in treatments, which can then be subjected to rigorous impact evaluations and further qualitative analysis.
Understanding why interventions fail could be considered a subset of studying mechanisms, but five of the studies reviewed specifically mention understanding failure as a rationale. We concur with those authors that understanding the reasons for failure can provide valuable lessons not available from quantitative impact analysis alone. Indeed, a successful program can be replicated without understanding the mechanisms that made it successful, but an understanding of why a program failed can be valuable in modifying the program to remove the barriers to success. In addition to describing how their qualitative analysis led to understanding the mechanisms of the program, Rao et al. (2017, p. 481) noted that “The detailed qualitative data from a 10% subsample allow us to unpack the reasons why the intervention ‘failed,’ highlighting the role of variations in the quality of facilitation, lack of top-down support, and difficulties with confronting the stubborn challenge of persistent inequality.” Nichols-Barrer et al. (2018) provide an interesting example of how the qualitative findings can suggest why an intervention failed. In their evaluation of a civic participation demonstration in Rwanda, Nichols-Barrer et al. (2018, p. 25) interviews found that “respondents did express disappointment that program activities lasted for only 10 months and that several RTP activities were not fully implemented. However, this disappointment was directed almost exclusively at program administrators, and in many cases respondents voiced support for extending the program because they thought it was well designed and effective.” Of course, another impact evaluation would be required to validate the views of the respondents, but the qualitative strand suggests a path for a more successful project.
External Validity and Generalizability in Articles Reviewed
Most of the articles reviewed did not address issues of external validity and generalizability, and some of those that did address the issues did not mention those terms. Six of the 16 articles that met the modified CLEAR requirements for medium or high-quality impact evaluations raised external validity issues. The most common issue on external validity and generalization was on whether the impact evaluation results applied to various subgroups. For example, in a study of the impact of an unconditional cash transfer program in rural Ghana (LEAP), de Milliano et al. (2021, p. 9) stated: As an extension to the quantitative analysis, we explored heterogeneous effects to assess whether the effect on social support differs for various subgroups in the population using variables arising from the qualitative analysis and previous literature…. We examined the effects of LEAP 1000 on social support by parity (one child vs. multiple children), type of marriage (monogamous vs. polygamous), level of education (no or less than primary vs. primary school and higher) and feeling of empowerment (having power to decide over one’s life-course vs. no power to decide).
Bonilla et al. (2017, p. 58), in an evaluation of a child grant program in Zambia, noted now differential attrition limits the ability to generalize from the study’s findings to the overall population: “…. it is important to note that overall, older, more educated women, living in the Kaputa district and with higher composite sole or joint decision making are more likely to be lost to follow-up—thus limiting the generalizability of our results.” Bonilla et al. (2017, p. 67) also note other limits in applying their findings to the overall population: “Finally, our sample is a unique population of women with young children, living in three poor rural districts in Zambia, and thus findings around empowerment dynamics are likely to vary significantly from populations in other geographic regions, particularly those with markedly different gender norms.”
Only two of the articles discussed using mixed methods for external validity and generalization issues, but they were both enthusiastic about the potential. Green et al. (2015, p. 393) are strong advocates for using mixed methods for enhancing external validity: One approach to both including the range of outcomes likely to be of interest to different constituencies and avoiding trade-offs between external and internal validity is to mix designs and data collection methods in an iterative way….This article builds on these pragmatic approaches. We outline how integrating qualitative analytic induction within a quasi-experimental design can help defend against threats to validity whilst ensuring that the evaluation results have purchase with a wide range of constituencies.
Green et al. (2015, p. 402) conclude “Qualitative evidence was used to enhance causal credibility and judgements about external validity.”
Roelen and Saha (2021, p. 10) also make a strong case for using mixed methods for learning about external validity as well as discovering the mechanisms for program impacts: This study also provides further evidence of the use of mixed methods approaches in evaluation and its value-added in making findings useful and credible for practitioners and policy makers…The use of qualitative methods allowed for exploring mechanisms that both enhanced and prevented change, which we would not have obtained by using quantitative methods only. As such, mixed methods approaches can help to address external validity concerns as it provides insight into mechanisms underpinning observed outcomes in a given context – much in line with the tradition of ‘realist evaluation’ (Pawson & Tilley, 1997) – allowing for greater understandings of whether findings may be generalizable across or applicable to other contexts.
In sum, external validity and generalization were not commonly discussed in the impact evaluations reviewed, and when these concepts were discussed, it was more often in the quantitative strand than in the qualitative strand; however, two of the studies strongly advocated mixed methods for inferences about external validity, as well as for other purposes such as understanding mechanisms.
Summary of Findings from Exemplary Articles
To illustrate the value of mixed methods in impact evaluations, this section describes how three articles made use of mixed methods in their analysis. These are not the only articles that made good use of mixed methods, but they were selected to illustrate how mixed methods have been used in impact evaluations.
Rao et al. (2017) provides a mixed methods evaluation of a demonstration intended to deepen democracy in rural India. The evaluation includes methodologically strong components in both the quantitative and qualitative strands. The authors state that “this paper examines the impact of a two-year effort to improve citizen engagement in a poor and arid region of India - northern Karnataka.” The demonstration was an effort to replicate a successful democracy intervention in a neighboring but wealthier and more educated state. The intervention was targeted at the poorest areas in the state, with 50 areas assigned to receive the treatment, and 50 assigned to the control group. Impacts were estimated at the village and individual levels, depending on the outcome, using a standard difference in differences approach. The qualitative analysis was conducted in 10 percent of the treatment and control districts, that is, 5 of each. The quantitative data was collected over two years, and the ethnographic qualitative research was conducted over four years. Field investigators were stationed in each site included in the qualitative sample, and they submitted reports on a monthly basis. The quantitative analysis did not indicate that the demonstration was successful (p. 487): “However, the data do not indicate that the People’s Planning intervention had a significant impact across a wide spectrum of possible outcomes. The results, at best, show very weak evidence of both positive and negative impacts of the intervention on a very small number of outcomes.” The qualitative analysis points to four reasons for the demonstration’s failure: (1) challenging context, (2) failure to integrate the service providers into provision of government services, variation in the quality of facilitation, and (3) poor application of the program design. The authors stress the importance of the qualitative analysis in revealing why the intervention failed to have the expected impact: “This paper demonstrates that an in-depth ethnography conducted alongside a survey-based RCT can provide important insights into the processes of change, the mechanisms that led to the observed outcome, and thus make a null effect meaningful and interesting.”
Roelen and Saha (2021) offer a second example of excellent use of mixed methods in an impact evaluation. This study used a non-experimental design to assess the impact of a “graduation program” on risk factors facing low-income children in Haiti. The intervention, delivered over a period of 18 months, offers “intensive and tailored support, including: (i) weekly stipends of … approximately US $13 PPP during the first six months of implementation; (ii) asset transfer (approximately US $155), (iii) support to join Village Savings and Lending Association (VSLA), (iv) weekly home visits by case managers, including health and nutrition messaging, and (v) in-kind support such as materials for home repair and installation of latrine (approximately US $250) and access to the local hospital” (p. 107). The evaluation was unable to implement an RCT, but a strong non-experimental design was used. The quantitative analysis used propensity score matching and difference in differences analyses, which are generally considered two of the strongest non-experimental techniques. In addition, the study included discussion of attrition, robustness checks, and sensitivity analysis. The qualitative strand is not described in detail, but the article indicates how the qualitative analysis was used for triangulation and understanding how impacts can vary by context, and important external validity and mixed methods issue: This study also provides further evidence of the use of mixed methods approaches in evaluation and its value-added in making findings useful and credible for practitioners and policy makers…. The use of qualitative methods allowed for exploring mechanisms that both enhanced and prevented change, which we would not have obtained by using quantitative methods only. As such, mixed methods approaches can help to address external validity concerns as it provides insight into mechanisms underpinning observed outcomes in a given context – much in line with the tradition of ‘realist evaluation’ (Pawson & Tilley, 1997) – allowing for greater understandings of whether findings may be generalizable across or applicable to other contexts (p. 116).
The impact evaluation found mixed results, with the intervention improving mental health, providing a modest improvement to children’s exposure to corporal punishment, and no impact on exposure to violence outside the home or attitudes on child disciplining practices. The study strongly endorsed the use of mixed methods for understanding mechanisms and learning how to improve programs.
The third example of an article that includes mixed methods in an impact evaluation is de Milliano et al. (2021), which provides an evaluation of an unconditional cash transfer program (LEAP 1000) on pregnant women and women with children under one year old in rural Ghana to analyze the impact of the program on social, emotional, and instrumental support. The impact analysis uses a non-experimental design that employs a regression discontinuity design and difference in differences to estimate program impact. The quantitative analysis pays a great deal of attention to how the impact varies by subgroup: As an extension to the quantitative analysis, we explored heterogeneous effects to assess whether the effect on social support differs for various subgroups in the population using variables arising from the qualitative analysis and previous literature … We examined the effects of LEAP 1000 on social support by parity (one child vs. multiple children), type of marriage (monogamous vs. polygamous), level of education (no or less than primary vs. primary school and higher) and feeling of empowerment (having power to decide over one’s life-course vs. no power to decide) (p. 9).
In addition, the study included in-depth interviews that were useful in understanding the mechanisms by which the program resulted in impacts on the participants: “The in-depth qualitative interviews confirmed these findings with women experiencing a growth in the access to financial markets and increased opportunities to mingle with peers in the markets, at social gathering and in community groups. The program even had an enabling role in stimulating changes that led to women creating new relationships and strengthening existing ones” (p. 11).
Conclusions and Areas for Further Research
Several conclusions emerge from this review of the use of mixed methods in impact evaluations. First, we have verified that mixed methods are rarely used in impact evaluations. We think this is unfortunate, as mixed methods can greatly increase the knowledge gained and policy relevance of impact evaluations. When mixed methods are used, they can reinforce findings from the quantitative analysis (triangulation), but they can also help understand the mechanism by which programs have their impacts and the reasons why programs fail. Such findings help researchers and policymakers improve program designs by focusing on certain subgroups and modifying programs to overcome reasons why programs failed.
Evaluations can and should go beyond treating interventions as a black box and explore how outcomes vary by participant characteristics, program features, and environmental context. By exploring these issues, mixed methods can play an important role in identifying the extent to which evaluation results have external validity and generalizability. Evaluations of ongoing programs and demonstrations should rarely be a “one and done” effort. Mixed methods results can often lead to refinements in program design and targeting that can be studied in future evaluations. We discussed how program impacts can and often do vary by characteristics of participants, program context, and the environment. These issues can sometimes be explored through quantitative methods, but mixed methods studies offer a way to suggest which of these areas are worth exploring, and, indeed, several of the studies reviewed made use of mixed methods to suggest how variations in participants, context, and programs can affect the outcomes. Several other papers in this special issue also make the point that variation in impact by participant characteristics, program features, and environmental context can play an important role in the external validity and generalizability of evaluations, notably Maynard (2024) and Littell (2024). In the United States, many government agencies support the use of mixed methods in their evaluations. For example, the U.S. Department of Labor and the U.S. Department of Health and Human Services fund implementation studies along with impact evaluations for evaluations of demonstrations and ongoing programs. Although every impact evaluation need not be accompanied by a qualitative assessment, we believe it should be more the norm than the exception.
Supplemental Material
Supplemental Material - How Mixed-Methods Research Can Improve the Policy Relevance of Impact Evaluations
Supplemental Material for How Mixed-Methods Research Can Improve the Policy Relevance of Impact Evaluations by Burt S. Barnow, Sanjay K. Pandey, and Qian “Eric” Luo in Evaluation Review.
Footnotes
Authors’ Note
We are grateful to Quan Nha Hong, Sergi Fàbregues, and Douglas Besharov for comments and Varnika Birla for excellent research assistance. Any errors are the responsibility of the authors.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
