A Systematic Review of Meta-Evaluations: Lessons for Evaluation and Impact Analysis

Abstract

Impact evaluation and measurement are highly complex and can pose challenges for both social impact providers and funders. Measuring the impact of social interventions requires the continuous exploration and improvement of evaluation approaches and tools. This article explores the available evidence on meta-evaluation—the “evaluation of evaluations”—as an analytical tool for improving impact evaluation and analysis in practice. It presents a systematic review of 15 meta-evaluations with an impact evaluation/analysis component. These studies, taken from both the scholarly and gray literature, were analyzed thematically, yielding insights about the potential contribution of meta-evaluation in improving the methodological rigor of impact evaluation and organizational learning among practitioners. To conclude, we suggest that meta-evaluation is a viable way of examining impact evaluations used in the broader social sector, particularly market-based social interventions.

Keywords

meta-evaluation impact evaluation impact analysis systematic review

The need for impact evaluation or impact measurement is becoming increasingly commonplace for all types of organizations that use public and private funding to drive social change. More recently, the rise of new actors designing and delivering social interventions, such as international nongovernmental organizations (NGOs), philanthropic organizations, and market-oriented players, has introduced new solutions and instruments to tackle social issues (Picciotto, 2015). Organizations in the social sector are expected to demonstrate their effectiveness, mainly as proof of legitimacy or accountability for both donors and beneficiaries (Barraket & Yousefpour, 2013; European Union & Organisation for Economic Co-operation and Development [OECD], 2015; Nguyen et al., 2015). Existing literature suggests that social impact providers, funders, and investors continue to find impact measurement a challenge despite significant growth in the frameworks and tools available to support them (OECD, 2019; S. D. Phillips & Johnson, 2019). Not-for-profit organizations face particular challenges in evaluating social goals, and their evaluations often succumb to common flaws (Liket et al., 2014). Organizations that deliver social impact—including not-for-profit organizations, social enterprises, purpose-driven corporations, governments at all levels, and funders and investors (such as philanthropic organizations, corporate investors, governments, and international development agencies)—have joined forces in the quest to develop pragmatic and rigorous impact measurement methods.

The current evaluation literature and social impact measurement literature present mixed views about the link between evaluation and social impact measurement. On one side of the debate, evaluation and social impact measurement are depicted as separate fields. For example, Vo and Christie (2018) recognized that evaluation and social impact measurement have each developed in response to different stakeholders and social demands and that private entities in the pursuit of social change often prioritize financial return over social and environmental goals when conducting impact measurement. On the other side of the debate, the traction of social impact is seen as opening a new chapter in evaluation theory and practice development. As Picciotto (2015) puts it, social impact measurement is “the fifth wave of evaluation diffusion” (p. 2). Additionally, practitioner communities from each field have formed separate, yet overlapping, camps, with evaluation practitioners working primarily in public, philanthropic, and not-for-profit sectors and social impact measurement practitioners with market-based entities (Reisman et al., 2015). To foster practical and viable approaches, it is necessary to engage with both evaluation practice and social impact measurement practice and draw from shared knowledge and learnings across sectors and fields.

Asking hard questions around “what works, for whom, and how” has long been a trademark of the evaluative thinking and practice. We have seen a growth in the resources available to evaluation practitioners, including evaluation frameworks, methods, standards, and tools to assess the progress and outcomes of social interventions. Measuring the impacts of social interventions is imbued with complexity, and we believe that there is a strong need to investigate current evaluation and impact measurement approaches. Meta-evaluation is a method that allows us to explore the evidence of an evaluation approach or product and conduct critical analysis on the merit, quality, and effectiveness of the evaluation.

The literature espouses meta-evaluation as a means for assessing the methodological rigor of evaluations and for aggregating data from multiple evaluations (Cooksy & Caracelli, 2009; Henry, 2016). In this study, we drew upon Michael Scriven’s original definition of meta-evaluation as the “evaluation of evaluations” (Scriven, 1969, as cited in Scriven 2009, p. iii). While meta-evaluation is commonly conflated with meta-analysis, the two concepts are different. The key element that differentiates meta-evaluation is that “it evaluates the evaluations to which it refers; it does not merely summarise them” (Scriven, 2005, p. 250). In particular, a meta-evaluation assesses the merit and worth of an evaluation, while a meta-analysis performs a quantitative synthesis of empirical studies (which might include evaluative studies; Stufflebeam, 2001).

Meta-evaluation is a specific type of evaluation that can be used to systematically identify factors that must be addressed to improve the evaluation and support decisions about the quality of the evaluation (Yarbrough et al., 2010). Typically, a meta-evaluation looks at multiple evaluations to understand broader patterns in the processes and outcomes of evaluations. The method can be tailored for various purposes and possibilities. For example, internal meta-evaluations use self-administered checklists to assess accountability, formal external meta-evaluations can be commissioned to assess the quality of evaluation, and targeted meta-evaluations can be used to explore which kinds of evaluation approaches are most efficient and effective for specific purposes (Cooksy & Carcelli, 2009; Yarbrough et al., 2010). The method can thus be used to examine the utility, strengths, and limitations of approaches employed in published evaluations (Henry, 2016; Stufflebeam, 2001; Tingle et al., 2003).

We conducted a systematic literature review to examine how impact evaluations and analyses have been assessed through the analytical lens of meta-evaluation. Systematic literature reviews are typically used to process large volumes of information to map out areas of uncertainty in research and identify gaps in the literature (Petticrew & Roberts, 2006). Systematic reviews also provide a reproducible method for synthesizing empirical research and in some cases allow for generalizations to be drawn from the findings of different studies (Cooper et al., 2009). Aside from its common application in health and medicine fields, systematic reviews are increasingly being adopted to other areas such as management, conservation, international development, and economics (Gough et al., 2013).

As such, this study set out to undertake a systematic literature review that assesses multiple meta-evaluations. The review initially targeted both impact evaluation and social impact measurement literature in order to draw lessons from meta-evaluations in both fields. Following a systematic search of the scholarly and gray databases, we included texts that both were (a) meta-evaluations—defined as the “evaluation of evaluations” (Scrivens, 2009)—and (b) involved an impact evaluation or assessment component. We initially intended to include social impact measurement or impact analysis practices employed by social impact providers and investors. However, we could not trace any explicated application of meta-evaluation in impact measurement of social enterprises or other market-oriented solutions. As a result, the meta-evaluations reviewed in this study examined impact evaluation practices primarily in a program and development evaluation context. Accordingly, our research questions are as follows:

What does the available evidence tell us about the extent to which meta-evaluations assess impact evaluations?

What can meta-evaluation as an instrument contribute to improving evaluation and impact analysis practices?

The rest of this article is organized as follows. After presenting our methodology for the search strategy, quality assessment of the publications, and approach to thematic analysis and synthesis, our findings are arranged according to our research questions. We describe the scope of impact evaluations and assessments sampled and discuss the key methodological features meta-evaluated, the purposes and aims of meta-evaluations, and meta-evaluation criteria used, before summarizing the recommendations from the studies for improving methodological practices. The article proceeds to discuss applications of meta-evaluation in broader contexts such as social impact analysis, leading to calls for greater dissemination and evidence synthesis of impact evaluations and social impact measurements.

Method

Search Strategy

This study follows a systematic literature review approach. Between January and June 2019 (later updated to April 2020), we conducted a comprehensive search of English-language publications available from January 1999 to March 2019 using four scholarly databases—Web of Science, EbscoHost, ProQuest, and Scopus—alongside a customized NGO search engine,¹ allowing the search to cover both peer-reviewed academic literature and gray literature. To ensure that any additional materials not indexed in these databases/search engines were identified, we undertook a targeted search of 37 evaluation-related journals² and examined the reference lists of articles screened. The search strategy defined the search terms to include in either the abstract or key words of the publication. A combination of the following search terms was used: “meta-evaluation” and “social impact” or “impact assessment” or “impact evaluation” or “impact measurement” or “impact review”. See Table 1 for the string terms used in each round of the search. Truncation symbols (*) were employed to include grammatical and spelling variants of the search terms (e.g., “metaevaluation”, “meta evaluation”, and “meta-evaluation”); additionally, Boolean operators (AND/OR) were used to narrow the results.

Table 1.

Databases or Journals Searched and the Corresponding Search String Used.

Database or Journal	Search String
Web of Science, EbscoHost, ProQuest, and Scopus	1. “meta eval*” AND “social impact”
Web of Science, EbscoHost, ProQuest, and Scopus	2. “meta eval” AND “impact” AND (“assess” OR “evaluat” OR “measur” OR “review*”)
Google custom search: nongovernmental organizations	1. “meta evaluation”, “impact assessment”
Google custom search: nongovernmental organizations	2. “meta evaluation”, “impact”
Evaluation-related journals	“meta-evaluation” OR “metaevaluation” OR “meta evaluation” AND “impact”

The initial search in 2019 yielded 71 unique articles from the four scholarly databases. Although three of the databases (i.e., EbscoHost, Web of Science, and ProQuest) collectively index gray literature (including working papers, market research reports, industry reports, and case studies), all results were either scholarly journal articles or book chapters. In a subsequent search for gray literature, we first used Google’s web search engine. However, this yielded a surfeit of irrelevant sources. To find public reports from governments and industries, we conducted a separate, customized NGO search based on the assumption that NGOs undertake a considerable volume of program and development evaluations. Google does not recognize Boolean operators or truncations, so we modified the search terms to (1) “meta evaluation”, “impact assessment” and (2) “meta evaluation”, “impact” (including the quotation marks). The customized NGO search refined the Google search results to only those published by NGOs. This filter yielded 23 unique hits, most of which were industry reports from international charities and humanitarian organizations. Our targeted search (of evaluation-related journals and reference lists of screened articles) yielded eight new studies including scholarly articles and public reports from governments and NGOs. Dissertations, commentaries, and books were excluded from these initial search results (see our discussion on potential bias created by this decision in the Limitations subsection), while conference proceedings and book chapters were included.

The search protocol was repeated in April 2020 to capture records that were published or indexed between June 2019 and April 2020, after the original search was run. In total, the updated search incorporated 11 new results for screening—four of which were from the original four databases, seven from the targeted search of evaluation journals, and none from the custom Google search of NGO publications. A total of 113 publications were identified for screening in this study.

Screening Process

The inclusion criteria developed for screening stipulate that (a) meta-evaluation is defined as an “evaluation of evaluation(s),” which excludes meta-analyses or meta-reviews that may be labeled as “meta-evaluations”; and (b) the meta-evaluation should include some form of impact analysis that has evaluated, measured, or reported the impacts of the intervention(s) under evaluation. Based on these inclusion criteria, screening was carried out in two steps. First, researchers independently screened (1) the 71 abstracts from the four databases, (2) the 23 abstracts from the customized Google NGOs search, and (3) eight abstracts from the targeted search against the inclusion criteria. The same screening process was repeated in April 2020 for the 11 records that were published or indexed since the original search was run. No additional publications from the updated search were added to the final sample. Second, two researchers independently conducted full-text reviews of the 40 articles identified from the first step against the same criteria. Disagreements between researchers were resolved through consensus-based discussion. The screening process is summarized in Figure 1. A total of 15 publications—meta-evaluations that evaluate or assess impact evaluations or impact analysis in some form–were assessed as eligible for inclusion. The meta-evaluations reviewed in this study are summarized in Table 2.

Figure 1.

Screening and selection process.

Table 2.

Meta-Evaluations With Impact Evaluation or Impact Analysis Components by Publication Type.

No.	Author and Year	Meta-Evaluation Type	Number of Evaluations Assessed	Program Focus Areas
Peer-reviewed scholarly article
1	Lam et al. (2019)	External	70	Multiple
2	Philips & de Wet (2017)	External	5	Multiple
3	Jacob & Desautels (2014)	External	20	Aboriginal programs
4	Chapman (2012)	External	62	Health
5	Leveille & Chamberland (2010)	External	50	Child welfare and protection
6	Scott-Little et al. (2002)	External	23	Education
Public program report
7	Green & Hargadine (2016)	Mixed	92	Multiple
8	Hageboeck et al. (2013)	External	340	Multiple
9	Weigel (2012)	Not specified	>60	Multiple
10	Robert & Engelhardt (2009)	External	45	Humanitarian aid
11	Concern (2005)	Internal	>10	Emergency response
12	Treasury Board of Canada Secretariat (2004)	Internal	115	Multiple
13	Universalia (2003)	External	42	Multiple
14	Goldenberg (2001)	External	104	Multiple
15	Houghton & Robertson (2001)	External	49	Humanitarian aid

Quality Assessment

To assess the quality of the meta-evaluations under review, a set of criteria was developed based on guidelines under the meta-evaluation standard in the Program Evaluation Standards (second edition) developed by the Joint Committee on Standards for Educational Evaluation (1994), the checkpoints for the meta-evaluation standard in Stufflebeam’s (1999) meta-evaluation checklist, and recommendations for internal and external meta-evaluation standards in the latest edition of the Program Evaluation Standards (Yarbrough et al., 2010). The quality assessment tool contains 10 questions (Table 3). Two reviewers independently assessed the sampled meta-evaluations using the quality assessment questions and marked each question as being either “addressed,” “partially addressed,” or “not addressed”/“not clear (unable to judge).”

Table 3.

Quality Assessment Questions and Results for Meta-Evaluations Reviewed.

Assessment Questions	A	%	P	%	N	%
Q1: Does the meta-evaluation (ME) articulate the goal and objectives?	14	93	1	7	0	0
Q2: Does the ME specify the evaluation criteria or standards?	12	80	0	0	3	20
Q3 Does the ME clarify who the meta-evaluator is?	13	87	1	7	1	7
Q4: Does the ME identify resources allocated to the ME (budget, time, personnel, and expertise)?	0	0	6	40	9	60
Q5: Does the ME evaluate the instrument, data collection and handling, and coding and analysis?	10	67	2	13	3	20
Q6: Does the ME evaluate the involvement of stakeholders?	7	47	4	27	4	27
Q7: Does the ME assess whether the information used to judge the evaluation against standards is sufficient?	7	47	1	7	7	47
Q8: Does the ME identify intended users of the ME?	4	27	3	20	8	53
Q9: Does the ME clarify all steps and procedures involved?	13	87	1	7	1	7
Q10: Does the ME provide sufficient information to tell it is internal, external, or mixed ME?	9	60	3	20	3	20

Note. n = 15. A = addressed; P = partially addressed; N = not clear (unable to judge).

Table 3 reports the questions assessed and results. Most of the 15 studies we analyzed had clearly articulated aims and objectives, indicated the evaluation criteria or standards applied, and who the meta-evaluator was (team), and the process and steps undertaken to conduct the meta-evaluation. The majority also examined evaluation tools and methodological practices such as instruments, data collection, and data analysis. The resources allocated to the meta-evaluation were either inadequately addressed or overlooked by most studies, making it the weakest area identified by the quality assessment. Appropriately costing necessary resources for meta-evaluation contribute to not only transparency but also to the accountability of the whole meta-evaluation exercise.

Data Analysis and Synthesis

Descriptive data were first collected across the 15 meta-evaluations, specifically the type and year of publication, type of meta-evaluation, type of evaluations evaluated, focus areas of the programs evaluated, and the number of evaluations included in the meta-evaluations. Data were then summarized including the frequency counts for each of the data categories above. It is worth noting that this study does not attempt to draw inferences from the sampled 15 studies to a larger population of potential meta-evaluations. Rather, our focus was to identify patterns and themes that emerged in the sample to shed light on the application and improvement of impact evaluation and analysis practices.

We adopted a “theory-driven” approach to conduct a thematic analysis of the data (Braun & Clarke, 2006). The method involves an initial step to develop a coding frame centering on the methodology or methodological features reviewed and following steps to identify and define emergent themes (for further details, see Braun & Clarke, 2006). First, a draft coding frame was independently tested by two researchers using a small subset of the sample, and revisions were incorporated to finalize the coding frame. Table 4 contains the final coding frame used. This process also enabled researchers to become familiar with the coding frame and to reach consensus in applying it. Next, two researchers independently coded the 15 meta-evaluations using QSR’s NVivo software (Version 12). Disagreements between the two researchers were resolved through consensus-based discussion. Subsequently, based on the initial codes, one of the two researchers who coded the studies synthesized the qualitative data and developed analytical themes that address the research questions (Thomas & Harden, 2008).

Table 4.

Coding Frame Used in the Systemic Review.

1. Context of meta-evaluation (ME)

2. ME purpose/aim/research questions

3. ME method

3.1. ME definition

3.2. ME evaluation criteria or standards

3.3. Sample, scope, and coverage

3.4. Meta-evaluator

3.5. Stakeholder participation in ME

3.6. What was evaluated or analyzed in ME

3.7. ME process

4. Impact evaluation evaluated

4.1. Measurement of impact

4.2. Evaluation of impact evaluation/analysis

5. ME key findings (methodological)

6. Implications of ME

7. Limitations

7.1. Suggestions for future ME or impact evaluation research or practice

8. Other

8.1. Notes for ME methodology

Findings

Descriptive data from our sample of 15 meta-evaluations are reported in Tables 5 –7. Meta-evaluations that were published in peer-reviewed scholarly journals accounted for 40% of the sample (Table 5). As shown in Table 2, the meta-evaluations varied in size, ranging between five and over 300 evaluations. On average, program reports meta-evaluated larger samples of evaluations than scholarly studies (see the estimated average size of meta-evaluations of the two groups in Table 5 and a further breakdown of the number of evaluations by publication type in Table 6). Program reports likely had access to either internal or unpublished evaluations and designated funding to conduct meta-evaluations at a larger scale. Although close to half of the meta-evaluations were published before 2010 (n = 7), with the majority being program reports (n = 6), there is a comparatively large amount of recently published scholarly articles in our sample (Table 7). Most meta-evaluations were conducted by evaluators/researchers external to the program evaluated (73%, n = 11), which accounts for all scholarly articles and half of the subsample of program reports (Table 2). Just over half of the studies evaluated programs or interventions that spanned multiple program focus areas (n = 8), and the other half evaluated programs that focused on one or two specific program areas (Table 2). As such, this review sees meta-evaluation as a specific type of evaluation being applied to a diverse range of evaluands.

Table 5.

Type of Publication.

Publication Type	n	%	Total Number of Evaluations Assessed	Average Number of Evaluations Assessed
Peer-reviewed scholarly articles	6	40	230	38
Program reports	9	60	867^a	95^a
Total	15		1,087

Note. n = 15.

^a The total and average numbers of evaluations assessed in the subsample of program reports are estimates as two meta-evaluation reports provided estimates instead of accurate numbers of evaluations assessed.

Table 6.

Number of Evaluations Assessed.

Number of Evaluations Assessed	Publication Type		n	%
Number of Evaluations Assessed	Scholarly Article	Program Report	n	%
20 or fewer	2	1	3	20
21–50	2	3	5	33
51–100	2	2	4	27
More than 100	0	3	3	20
Total	6	9

Note. n = 15.

Table 7.

Year of Publication.

Publication Year	Publication Type		n	%
Publication Year	Scholarly Article	Program Report	n	%
2015–2019	2	1	3	20
2010–2014	3	2	5	33
Before 2010	1	6	7	47
Total (all years)	6	9

Note. n = 15.

This review sets out to examine impact assessment and analysis practices in the context of program evaluation through the analytical lens of meta-evaluations. The rest of this section presents our key findings under each of the research questions.

Research Question 1: What does the available evidence tell us about the extent to which meta-evaluations assess impact evaluations?

First, we looked at the methodological scope of impact evaluations sampled in the meta-evaluations. While only two papers had a single focus—one on outcome/impact evaluations (Philips & de Wet, 2017) and another publication on economic return studies of health promotion programs (Chapman, 2012)—the remaining 13 had a multifocal approach, evaluating mixed types of evaluations. These encompassed formative and summative evaluations as well as implementation, performance, and impact evaluation reports. Of these 13 studies that meta-evaluated different evaluations, four identified impact evaluation as a type of evaluation and reported a percentage of impact evaluations in the sample, ranging from as low as 3% to 9% (e.g., Hageboeck et al., 2013). The exception here was Lam et al. (2019), in which 24% of the sampled evaluations were impact evaluations. Another paper noted that among the 50 evaluations meta-evaluated, 40% had an impact assessment component (Leveille & Chamberland, 2010).

Some studies observed performance or formative evaluations that reported outcome data, asked questions on causality or attribution, or applied impact evaluation techniques to some extent. Close to half (40%) of the meta-evaluations investigated whether the evaluations under assessment had raised questions about attribution issues. Hence, on the one hand, impact evaluations and assessments might constitute a small yet growing pool of evidence as a newer type of evaluation, and on the other hand, impact assessment or measurement inquiries and practices are not exclusive to impact evaluations alone.

Most of the meta-evaluations we reviewed focused on the quality dimensions of evaluations with a systematic approach. This review consequently examined the methodological features of impact evaluations or impact assessments captured through meta-evaluations. Our findings reveal a focus on research design and impact measures. Given the attention to attribution, it is not a surprise to see preference given to the more experimental research or evaluation designs, particularly those that use control groups and random assignment in response to the need to verify impact. In their meta-evaluation of the U.S. Agency for International Development (USAID), Hageboeck et al. (2013) indicated that USAID’s definition of impact evaluation articulated criteria such as whether a comparison group was included and whether it had at least two data points. While quantitative evaluations are important for outcome measures, not all impacts are quantifiable. One study recognized that qualitative evaluations offer useful information on barriers and opportunities for improving program processes and outcomes (Lam et al., 2019). Only one publication from our sample focused solely on outcome/impact evaluations that use naturalistic/qualitative methods (Philips & de Wet, 2017). They concluded that lack of methodological clarity is a key contributor to perceptions that naturalistic/qualitative evaluations lack rigor (Philips & de Wet, 2017). Another common design feature examined is the presence (and role) of theory or logic models in evaluations, also known as impact hypotheses. Although the results from the meta-evaluations that assessed this feature are mixed, the underlying understanding is that a theory of change or logic model should be used or referred to as a guide for evaluations (e.g., Houghton & Robertson, 2001; Weigel, 2012).

Measurement quality is not typically a meta-evaluation criterion. However, the outcome and impact measures vary significantly in our sample of meta-evaluations. The measurement challenge resides not only in how to measure but also what to measure (Scott-Little et al., 2002). Many have called for better and/or more appropriate impact indicators. One meta-evaluation highlighted the need to associate the development of appropriate impact indicators with affected populations instead of being driven by the priorities of the funding agency and/or government (Houghton & Robertson, 2001). This resonates with recommendations made by Lam et al. (2019) that engaging beneficiaries and stakeholders through participatory evaluation approaches when developing indicators could encourage stakeholder ownership of the evaluative process. Philips and de Wet (2017) also flagged the importance of understanding the context of an indicator and that the availability of such information could clarify the extent to which an indicator is measurable and therefore its suitability in specific contexts.

Research Question 2: What can meta-evaluation as an instrument contribute to improving evaluation and impact analysis practices?

Quality assessment and improvement is at the heart of a meta-evaluation because it helps stakeholders know the extent to which they can trust the evaluation (Yarbrough et al., 2010). Table 8 reports the purpose of the meta-evaluation as explicated by each study. Fourteen of the 15 studies focused on either quality assessment (pertaining to the assessment of the quality of evaluations) or quality/performance improvement (pertaining to the improvement of evaluation quality or practice or services delivered) or both. This reflects the importance attached to the quality of practices in our sample. In addition to quality, the sample encompasses a range of other evaluation purposes including organizational learning, tracking progress over time, and project-specific objectives (e.g., assessing gender incorporation in evaluations, contributing to strategic framework development, and assessing the empirical value of a particular model). This type of quality checking lays the foundation for the versatile application of meta-evaluation in the organizational learning and capacity building space. As such, meta-evaluation can be used explicitly as a quality assessment instrument, and the meta-evaluation technique can also serve as an analytical tool to explore specific topics or situations of interest.

Table 8.

Purposes and Objectives of the Meta-Evaluations.

Meta-Evaluation Purpose/Objective	n	%
Quality assessment	9	60
Quality/performance improvement	8	53
Organizational learning	4	27
Tracking progress over time	3	20
Other	5	33
Gender incorporations into evaluation	(1)
Summarizing effect size	(1)
Development of strategic framework	(1)
Assessing project impact	(1)
Assessing empirical value of application of child welfare services model	(1)

Note. n = 15.

Assessing the methodological rigor of evaluations is a primary feature of meta-evaluations that are used for quality check purpose. Ten of the 15 publications in this review meta-evaluated either the design or methodology of evaluations or both. This reflects the underlying perception that evaluation findings are only as good as the methodologies used (Scott-Little et al., 2002) and that good quality evaluations can stimulate further use (Jacob & Desautels, 2014). Although the elements used to assess the quality varied greatly, there was some commonality in the meta-evaluation criteria/standards used. In comparison to scholarly meta-evaluation studies, which preferred the use of existing criteria/standards, the program reports mainly relied on specifically developed criteria (see Table 9). The most common existing criteria/standards used or referred to in the sample included the OECD’s Development Assistance Committee (DAC) evaluation criteria (OECD DAC, 2019) and the Program Evaluation Standards (second edition and third edition).

Table 9.

Meta-Evaluation Criteria/Standards.

Criteria/Standard Type	Publication Type		n	%
Criteria/Standard Type	Scholarly Article	Program Report	n	%
Study-specific	1	6	7	47
Existing	3	1	4	27
Mix of study-specific and existing	2	2	4	27
Total (all types)	6	9

Note. n = 15.

Additionally, the sampled studies also drew on existing assessment tools for particular issues. For example, Lam et al. (2019) adapted questions from gender-related assessment tools, and Leveille and Chamberland (2010) applied assessment tools for qualitative and quantitative research to evaluate the quality and rigor of the studies they meta-evaluated. The meta-evaluations that used study/program-specific criteria had greater variation in their assessment tools. For example, Philips and de Wet (2017) developed four trustworthiness criteria; Chapman (2012) used a set of seven criteria including research design, sample size, quality of baseline delineations, quality of measurements used, and the appropriateness and replicability of interventions; and Universalia’s (2003) report developed a 21-section quality checklist. It is likely that program reports took into account the diverse perceptions of evaluation stakeholders and the different ways that meta-evaluation results can be used and reflected these in the development of meta-evaluation criteria and standards. The prevalence of program-specific criteria in the sample highlights the diversity in interpretations and the demand for quality in practice. These criteria/standards, regardless of existing or program-specific frameworks, were used in various ways by the sampled publications, including creating a rating or scoring system (Chapman, 2012; Scott-Little et al., 2002), building a checklist (Hageboeck et al., 2013), guiding analysis and synthesis (Philips & de Wet, 2017), and formatting sections or headers in reports to present findings (Houghton & Robertson, 2001; Universalia, 2003).

A majority of the sample also examined the methods used to collect and analyze data. Several publications noted that data analysis, particularly the relationship between data and analysis, lacked clarity and the use of appropriate methods to guide the process—despite allowing for more data being collected in the field. The demand for explicitness and detail in data collection and analysis methods was addressed to both quantitative and qualitative evaluations (Hageboeck et al., 2013; Jacob & Desautels, 2014; Philips & de Wet, 2017). In two studies that cited the lack of data as a factor that weakened evidentiary foundations of the project findings (Goldenberg, 2001; Weigel, 2012), the missing data were baseline data. This suggests that data collection (particularly in baseline studies) must be embedded early in the design of an evaluation.

Most of the meta-evaluations we reviewed made recommendations about ways to improve methodological or quality practices. Table 10 groups these recommendations into five areas and outlines the key suggestions made in each area. It is not a surprise to see most recommendations being directed to quality dimensions of evaluations or meta-evaluations, given the central role of quality assessment under a meta-evaluative lens. Five key quality improvement recommendations emerged in the following areas: the evaluation design elements, stakeholder participation, the need to link evaluations to the owner organization’s strategies and goals, impact assessment design, and the importance of enhancing methodological clarity. The operationalization of quality-focused evaluation practice is another thematic area that drew a number of recommendations. These included encouraging organizations to take up recommended approaches and tactics to build organizational capacity, developing resources to support learning, and utilizing evaluation results to inform decision making. Many of these suggestions could benefit evaluation practices regardless of the type of evaluation involved. Others are more relevant to impact evaluation and analysis practices. These include, for example, embedding impact assessment in project cycle management; using a theory of change, logic model, or impact hypothesis as the underlying theoretical foundation; offering technical support for impact assessment tasks; and sharing impact assessment results with the wider expert and practitioner community by publishing in peer-reviewed journals.

Table 10.

Quality Improvement Recommendations From Meta-Evaluations.

Thematic Area	Key Recommendations
Applying proven evaluation designs to improve evaluation quality	Design
	Use of experimental designs to show causality
	Using a mixed-methods approach to evaluations
	Applying a gender equity lens to the design and implementation of evaluations
	Strengthening the needs assessment and stakeholder participation
	Consulting the beneficiaries or key players in evaluations to complement documentary analysis in a meta-evaluation
	Organizational strategy and goals
	Considering the performance of evaluated projects against the owner organization’s strategic goals and objectives
	Impact assessment
	Exploring how impact assessment can be better embedded in project cycle management
	Providing the program logic model or impact hypothesis as a reference for the evaluation and improving clarity
	Providing a detailed description of methodologies, data collection, and data analysis
	Discussing the limitations and constraints of the evaluation
Enhancing organizational evaluative capacity to operationalize quality-focused evaluation practices	Building evaluative capacity
	Developing organizational standards for evaluation design and the selection of methodologies
	Institutionalizing evaluation standards across the organization
	Implementing a rigorous approach to monitor and report on the quality of evaluations
	Expanding the capacity of the organization to consume, critique, and utilize evaluation reports of high quality
	Organizational resource development
	Creating a peer learning or assistance space on methodologically demanding tasks such as impact assessment
	Providing support and guidance on evaluation and learning from the head office to field offices
	Use of evaluation results
	Examining how evaluation reports are used in the decision-making process
	Regularly reviewing the organization’s use of evaluation results
Better communication and use of evaluation findings and recommendations	Using templates or standard formats for evaluation reports
	Feeding impact results to expert circles by publishing in journals after completing an impact assessment
	Discussing recommendations with the commissioning organization at the draft report stage
Applying a gender lens in evaluations	Going beyond the collection of gender-disaggregated data to employ various approaches for a better understanding of how gender may differently shape evaluation outcomes and experiences
Applying a gender lens in evaluations	Examining the sociopolitical parameters that determine or influence gender inequity and prompt gender-responsive actions
Conducting comparative studies between interventions	Using consistent or standard research methods and measures to enable comparative studies or comparisons between interventions

Discussion

Just as evaluation addresses evaluative questions around the quality and value of an evaluand (Davidson, 2014), meta-evaluation assesses the quality of evaluations. This makes meta-evaluation a useful analytical tool for organizations looking to monitor and improve their evaluative practices. The fact that we could not locate meta-evaluations evaluating impact measurement of market-based solutions, such as social enterprises or impact investment, suggests that the meta-evaluation technique has yet to be taken up in contexts beyond program and development evaluations. Social sector organizations are facing methodological challenges and resource constraints that limit their technical expertise in social impact measurement and reporting (see Barraket & Yousefpour, 2013; Haski-Leventhal & Mehra, 2016). Adopting a meta-evaluative approach as part of the social impact analysis can help impact providers and funders improve the quality and rigor of impact measurement.

A relatively small number of impact evaluations were identified in the meta-evaluation sample. As discussed in the previous section, the complexities of addressing social issues and measuring and reporting impacts demand further exploration of current approaches and evidence base. The benefits of synthesizing knowledge and evidence have been long demonstrated in the natural and medical sciences (e.g., the Cochrane Collaboration), but much less so in the evaluation and social impact measurement field. The need to disseminate and synthesize the findings of impact evaluations, social impact measurements, and other methodologically demanding evaluative exercises is crucial as the demand for social impact continues to grow. Meta-evaluation can serve as an effective instrument for systematically reviewing and aggregating evaluation studies.

The meta-evaluative lens undertaken by our study highlights some emergent issues in impact measurement practices. All meta-evaluations we reviewed had evaluated impact evaluations or assessments, evaluations that had impact analysis elements, or evaluations that reported on impact. However, very few provided clear definitions of outcomes, impact, and impact evaluations. Many used impact interchangeably with outcomes, and some referred to the impact element in OECD’s DAC standards or discussed impact in the broader context of program results, achievements, and success. The lack of a clear definition of “impact” could be a key contributor to the challenges experienced in impact measurement, particularly around what to measure and how to measure. Indicator development is one process that can be used to define results (Houghton & Robertson, 2001). Similarly, impact measurement plays a role in defining impact. For example, the USAID definition of an impact evaluation stipulates the presence of control groups and at least two data points. Clarity in definitions of impact is required to improve impact measurement practices, just as methodological clarity is required to improve evaluation design and method, data collection, and analysis. Furthermore, only one meta-evaluation (Goldenberg, 2001) in our sample noted the effect of time on achieving short- and long-term goals, arguing that evaluators rarely made the distinction between short-term achievements and long-term prospects. Although evaluation reports typically included information on timing according to project cycles—such as baseline, midterm, end line, or impact—the need to incorporate temporality in impact measures and reporting is critical, particularly when long-term benefits and sustainability-related goals are involved.

There are some limitations in this study that should be noted. The scarcity of meta-evaluations in the public domain is a key drawback. However, systematically sourcing relevant gray literature is evidently a challenge: Our search from the academic databases yielded fewer practice-based reports than expected and using search engines like Google left us saturated with irrelevant results. To ameliorate this limitation, we used a customized Google search with an NGO filter to source most of our gray literature. However, limiting the scope of the search to materials published by NGOs may have created a bias in the gray literature sample. Another potential source of bias is the exclusion of dissertations from our search. The decision to exclude dissertations was made to ensure that our selection was systematic and consistent by reducing publication bias that may incept from the dissertations, assuming that significant research findings would have likely been published. However, we acknowledge that this might have excluded dissertations with unpublished empirical studies on meta-evaluations.

We also note that there is variability in how meta-evaluations are named and discussed in both the scholarly studies and gray literature. During our screening process, we found that many studies that used the term “meta-evaluation” were better described as meta-analyses. Thus, our search criteria may have not captured some relevant studies because of the way those studies were labeled. It is also worth noting that studies included in this systematic review are all self-identified as meta-evaluations.

Conclusion

Meta-evaluations are systematic reviews of evaluations. This study examined meta-evaluations of impact evaluations and impact analysis practices. Our findings suggest that meta-evaluation is a viable approach for developing evaluative capacity in organizations, including quality assessment and improvement of evaluative practices. Given the growing evidence base of impact evaluation and the increasing demand for better impact analysis and reporting from the social sector, meta-evaluative techniques can play a crucial role in improving the quality of impact measures. The limited application of meta-evaluation in our sample, and the fact that all were in the program or development evaluation field, highlight the need for wider uptake of this approach in other contexts. The use of meta-evaluation techniques among new actors—such as social enterprises and other market-based actors—may lead to new developments in social impact measurement practice.

Footnotes

Acknowledgments

The authors wish to acknowledge the contribution of Dr. Erin Wilson and members of the Social Enterprise Impact Lab (SEIL) Project at the Centre for Social Impact (CSI), Swinburne University of Technology for their input. They are thankful for the valuable comments provided by the editor and their reviewers whose comments have helped them to sharpen and strengthen the article. They also wish to thank the Lord Mayor’s Charitable Foundation and Family Life for providing the funding for the SEIL project, of which this study was part of.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by the Lord Mayor’s Charitable Foundation and Family Life in Melbourne, Australia.

ORCID iD

Joanne Xiaolei Qian-Khoo

Notes

References

Barraket

Yousefpour

(2013). Evaluation and social impact measurement amongst small to medium social enterprises: Process, purpose and value. Australian Journal of Public Administration, 72(4), 447–458. https://doi.org/10.1111/1467-8500.12042

Braun

Clarke

(2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2), 77–101. https://doi.org/10.1191/1478088706qp063oa

*Chapman

L. S.

(2012). Meta-evaluation of worksite health promotion economic return studies: 2012 update. American Journal of Health Promotion, 26(4), 1–12. https://doi.org/10.4278/ajhp.26.4.tahp

*Concern. (2005). Analysis of emergency evaluations: An updated discussion paper. https://reliefweb.int/sites/reliefweb.int/files/resources/emergency_response_evaluations_-_meta_evaluation_2000_to_2004.pdf

Cooksy

L. J.

Caracelli

V. J.

(2009). Meta-evaluation in practice. Journal of Multidisciplinary Evaluation, 6(11), 1–15. https://eric.ed.gov/?id=EJ829478

Cooper

Hedges

L. V.

Valentine

J. C.

(2009). The handbook of research synthesis and meta-analysis (2nd ed.). Russell Sage Foundation.

Davidson

E. J.

(2014). Evaluative reasoning, methodological briefs: Impact evaluation 4 (Methodological Briefs No. 4). https://www.unicef-irc.org/publications/749-evaluative-reasoning-methodological-briefs-impact-evaluation-no-4.html

European Union & Organisation for Economic Co-operation and Development. (2015). Policy brief on social impact measurement for social enterprises: Policies for social entrepreneurship. Publications Office of the European Union. https://www.oecd.org/social/PB-SIM-Web_FINAL.pdf

*Goldenberg

D. A.

(2001). Meta-evaluation of goal achievement in CARE projects: A review of findings and methodological lessons from CARE final evaluations, 1994–2000. CARE MEGA Evaluations: CARE USA. https://www.careevaluations.org/evaluation/care-mega-2000-synthesis-report/

10.

Gough

Oliver

Thomas

(2013). Learning from Research: Systematic Reviews for Informing Policy Decisions: A Quick Guide. A paper for the Alliance for Useful Evidence. London: Nesta. https://www.alliance4usefulevidence.org/assets/Alliance-FUE-reviews-booklet-3.pdf

11.

*Green

Hargadine

(2016). Sectoral synthesis of FY2015 evaluation findings: Bureau for economic growth, education, and environment. U.S. Agency for International Development (USAID). https://pdf.usaid.gov/pdf_docs/PA00MP17.pdf

12.

*Hageboeck

Frumkin

Monschein

(2013). Meta-evaluation of quality and coverage of USAID evaluations: 2009–2012. U.S. Agency for International Development (USAID). https://www.usaid.gov/sites/default/files/documents/1870/Meta-Evaluation%20of%20Quality%20and%20Coverage%20of%20USAID%20Evaluations%202009-2012.pdf

13.

Haski-Leventhal

Mehra

(2016). Impact measurement in social enterprises: Australia and India. Social Enterprise Journal, 12(1), 78–103. https://doi.org/10.1108/SEJ-05-2015-0012

14.

Henry

(2016). The meta-evaluation of the sports participation impact and legacy of the London 2012 Games: Methodological implications. Journal of Global Sport Management, 1(1–2), 19–33. https://doi.org/10.1080/24704067.2016.1177356

15.

*Houghton

Robertson

(2001). Humanitarian action: Learning from evaluation (ALNAP Annual Review Series 2001). Overseas Development Institute (ODI). https://reliefweb.int/sites/reliefweb.int/files/resources/98A220E8FF3F4EEAC1256C24005D5378-ar2001_all.pdf

16.

*Jacob

Desautels

(2014). Assessing the quality of Aboriginal program evaluations . Canadian Journal of Program Evaluation, 29(1), 62–86. https://journalhosting.ucalgary.ca/index.php/cjpe/article/view/30831

17.

Joint Committee on Standards for Educational Evaluation. (1994). The program evaluation standards (2nd ed.). SAGE.

18.

*Lam

Dodd

Whynot

Skinner

(2019). How is gender being addressed in the international development evaluation literature? A meta-evaluation. Research Evaluation, 28(2), 158–168. https://doi.org/10.1093/reseval/rvy042

19.

*Léveillé

Chamberland

(2010). Toward a general model for child welfare and protection services: A meta-evaluation of international experiences regarding the adoption of the Framework for the Assessment of Children in Need and Their Families (FACNF). Children and Youth Services Review, 32(7), 929–944. https://doi.org/10.1016/j.childyouth.2010.03.009

20.

Liket

K. C.

Rey-Garcia

Maas

K. E. H.

(2014). Why aren’t evaluations working and what to do about it: A framework for negotiating meaningful evaluation in nonprofits. American Journal of Evaluation, 35(2), 171–188. https://doi.org/10.1177/1098214013517736

21.

Nguyen

Szkudlarek

Seymour

R. G.

(2015). Social impact measurement in social enterprises: An interdependence perspective. Canadian Journal of Administrative Sciences, 32(4), 224–237. http://doi.org/10.1002/cjas.1359

22.

Organisation for Economic Co-operation and Development. (2019). Social impact investment 2019: The impact imperative for sustainable development. https://doi.org/10.1787/9789264311299-en

23.

Organisation for Economic Co-operation and Development—Development Assistance Committee. (2019). Evaluation criteria. http://www.oecd.org/dac/evaluation/daccriteriaforevaluatingdevelopmentassistance.htm

24.

Petticrew

Roberts

(2006). Systematic Reviews in the Social Sciences: A Practical Guide. John Wiley & Sons (UK). https://doi.org/10.1002/9780470754887

25.

Phillips

S. D.

Johnson

(2019). Inching to impact: The demand side of social impact investing. Journal of Business Ethics. Advance online publication. https://doi.org/10.1007/s10551-019-04241-5

26.

*Phillips

de Wet

J. P.

(2017). Towards rigorous practice: A framework for assessing naturalistic evaluations in the development sector. Evaluation, 23(1), 102–120. https://doi.org/10.1177/1356389016682777

27.

Picciotto

(2015). The 5th wave: Social impact evaluation. The Rockefeller Foundation. https://www.rockefellerfoundation.org/wp-content/uploads/The-5th-Wave-Social-Impact-Evaluation.pdf

28.

Reisman

Orians

Picciotto

Jackson

Harji

MacPherson

Olazabal

. (2015). Streams of social impact work: Building bridges in a new evaluation era with market-oriented players at the table (Working paper). The Rockefeller Foundation. https://www.rockefellerfoundation.org/wp-content/uploads/Streams-of-social-impact-work.pdf

29.

*Robert

Engelhardt

. (2009). OCHA meta-evaluation: Final report. Lotus M&E. https://reliefweb.int/report/world/ocha-meta-evaluation-final-report

30.

*Scott-Little

Hamann

M. S.

Jurs

S. G.

(2002). Evaluations of after-school programs: A meta-evaluation of methodologies and narrative synthesis of findings. American Journal of Evaluation, 23(4), 387–419. https://doi.org/10.1177/109821400202300403

31.

Scriven

(2005). Metaevaluation. In Mathison

(Ed.), Encyclopedia of evaluation (pp. 250–251). SAGE. https://dx-doi-org.web.bisu.edu.cn/10.4135/9781412950558.n340

32.

Scriven

(2009). Meta-evaluation revisited. Journal of MultiDisciplinary Evaluation, 6(11), iii–viii. https://journals.sfu.ca/jmde/index.php/jmde_1/article/view/220/215

33.

Stufflebeam

D. L.

(1999). Program evaluation models metaevaluation checklist: Based on the program evaluation standards. https://wmich.edu/sites/default/files/attachments/u350/2014/program_metaeval_short.pdf

34.

Stufflebeam

D. L.

(2001). The metaevaluation imperative. American Journal of Evaluation, 22(2), 183–209. https://doi.org/10.1016/S1098-2140(01)00127-8

35.

Thomas

Harden

(2008). Methods for the thematic synthesis of qualitative research in systematic reviews. BMC Medical Research Methodology, 8(45). https://doi.org/10.1186/1471-2288-8-45

36.

Tingle

L. R.

DeSimone

Covington

(2003). A meta-evaluation of 11 school-based smoking prevention programs. Journal of School Health, 73(2), 64–67. https://doi.org/10.1111/j.1746-1561.2003.tb03574.x

37.

*Treasury Board of Canada Secretariat. (2004). Review of the quality of evaluations across departments and agencies. https://www.tbs-sct.gc.ca/cee/pubs/review-examen2004-eng.pdf

38.

*Universalia. (2003). IUCN meta-evaluation: An analysis of IUCN evaluations 2000–2002. Universalia with IUCN—The World Conservation Union. https://www.iucn.org/downloads/meta_evaluation_03_1.pdf

39.

A. T.

Christie

C. A.

(2018). Where impact measurement meets evaluation: Tensions, challenges, and opportunities. American Journal of Evaluation, 39(3), 383–388. https://doi.org/10.1177/1098214018778813

40.

*Weigel

(2012). Meta evaluation: Aggregating learning from evaluations and reflecting on opportunities to feed into organisational change. HELVETAS Swiss Intercooperation. https://www.helvetas.org/Publications-PDFs/metaevaluationexecutivesummary2012.pdf

41.

Yarbrough

D. B.

Shula

L. M.

Hopson

R. K.

Caruthers

F. A.

(2010). The program evaluation standards: A guide for evaluators and evaluation users (3rd ed.). SAGE.