Abstract
Evaluative criteria define a “high quality” or “successful” evaluand and provide the basis for judgment of merit and worth, yet they are often assumed and implicit in the evaluation process. This article presents an empirically supported model that describes and integrates two aspects of criteria: domain and source. Domain identifies the focus or substance of a criterion, while source describes the individual, group, or document from which it is drawn. Developed from a synthesis of evaluation literature and empirical analysis of evaluation reports, the model defines 11 criteria domains and 10 sources and reveals the relationships among them. In this integrated model, the two dimensions can be used together as a thinking tool to guide evaluators in specifying criteria, in empirical research on the valuing process, and as a conceptual framework and language for theorists prescribing criteria selection.
Evaluation is a practice of making judgments about the merit or worth of a program or other evaluand (Scriven, 1991). This assessment is based on explicit or implicit criteria that distinguish a “high quality” or “successful” intervention from one that is “low quality” or “unsuccessful” (Julnes, 2012a; Schwandt, 2015; Scriven as cited in Fournier, 1995). Evaluative criteria structure the foundational logic that distinguishes the practice of evaluation from an expression of personal opinion or preference (Schwandt, 2015; Scriven, 2007).
While specification and application of criteria are central to evaluation (Henry, 2002), the literature on evaluative criteria—and the valuing process more generally—is limited (Alkin et al., 2012). Perhaps as a result, many evaluators lack a fundamental understanding of evaluative criteria and report difficulty specifying criteria (Ozeki et al., 2019). Scholars have called for an empirical, descriptive theory of valuing, emphasizing the need to understand how evaluators reach evaluative conclusions, including how they identify the values they use and how they enact the valuing process (Coryn et al., 2017; Shadish et al., 1991). A key area of focus for a descriptive theory of valuing would be the nature and source of criteria (Mathison, 2005).
As a preliminary step in theory building, this article advances a model that describes and integrates two aspects of evaluative criteria: domain and source. Domain is defined as the focus or substance of a criterion, and source as the person, group, or document from which it is drawn. Together, these two dimensions can be used as a thinking tool to guide evaluators in selecting criteria, as a resource for evaluator training, and in empirical research on evaluation to illuminate which values are represented by the selected criteria and who holds or advances those values. In addition, evaluation theorists can use the framework to prescribe and critique criteria specification. Recognizing the limitations of systemization in the practice and study of valuing (Julnes, 2012a), the model is intended to be dynamic and incorporate new understandings that emerge from research and theory building.
The article is organized into three sections. The first part addresses criteria domains. I begin by synthesizing the available literature to identify nine domains. This list is refined by examining the criteria evident in a sample of evaluation reports; one domain is revised, and two additional domains are identified. The second section considers criteria sources. A review of the literature reveals nine sources, and empirical examination identifies one additional source. In the third part of the article, I combine the two dimensions to create an integrated model of domains and sources, which illuminates the range of values and perspectives that can inform evaluative judgments. I conclude with implications for evaluation theory, practice, and research.
Key Terminology
The concept of values remains poorly specified, confused, and fuzzy in the field of evaluation (House, 2015, p. xiii); perhaps, in part, because the term is used to refer to a range of related ideas (House & Howe, 1999; Schwandt, 2015). I define values as beliefs about what is important, worthwhile, desirable, or preferable that transcend specific situations (House, 2015; Rokeach, 1973; Schwartz, 1992). Values underpin the criteria people use to evaluate behaviors and events in daily life (Rokeach, 1973; Schwartz, 1992). In the formal practice of evaluation, values inform each component of a study, including criteria specification, as well as the social relational aspects of the evaluator’s role (Greene, 2012; Shadish et al., 1991).
Evaluative criteria define or exemplify the desired state for an intervention by describing the aspects, dimensions, or characteristics on which it will be judged (Davidson, 2005; Sadler, 1985; Scriven, 2013). Values shape the vision of the desired state and definition of success or quality for an intervention (Mark et al., 2000; Rogers, 2016), and stakeholders may hold different values that inform how social problems are defined and which interventions and outcomes are viewed as important or desirable (Archibald, 2020; Davidson, 2010; Greene et al., 2011; Madison, 1992). Evaluators are tasked with identifying relevant values about program success or quality and representing those as evaluative criteria (Mark et al., 2000).
Finally, valuing refers to the process of determining the merit, worth, or value an intervention (Scriven, 1991). Dewey (1939) drew a distinction between two forms of valuing: prizing, in the sense of appreciating or “holding precious, dear” (p. 195); and appraising, which involves reasoning and critical judgment (de Munck & Zimmermann, 2015; Dewey, 1939). Criteria provide the basis for the reasoning and critical judgment that constitute the valuing process in formal evaluation practice. This article has relevance for the logic-based valuing process outlined by Scriven (2012; Fournier, 1995), which requires evaluators to establish criteria of merit and then construct and apply performance standards associated with each criterion. The model presented here provides a framework for evaluators to use in selecting and articulating those criteria. This article also has relevance for more holistic valuing processes, such as that described by Stake and colleagues (1997) that rest on evaluators’ perceptions and recognition of quality. In this type of valuing, criteria are understood as scaffolding or categories that guide evaluative interpretation. The model of criteria domains and sources can be used to direct attention to different dimensions of and perspectives on quality and communicate the basis for evaluative judgments.
Part 1: Criteria Domains
To date, there is limited discussion about the types of criteria evaluators do or should select, and those conversations are typically confined to the literature associated with specific intervention types or evaluation approaches. Schwandt (2015) provides a synthesis by describing seven criteria domains (along with seven sources of criteria, discussed below). Several of these domains are drawn from international development evaluation, where the Development Assistance Committee (DAC) of the Organization for Economic Cooperation and Development (OECD) has advanced an influential set of evaluative criteria that major international development agencies have adopted for framing development evaluations (Armytage, 2011; Chianca, 2008; Schwandt, 2015). The OECD/DAC criteria have also been adapted for complex emergencies (DAC, 1999).
Certain national governments have advanced sets of criteria for public sector evaluations. The U.S. Governmental Accountability Office (GAO) has identified 10 criteria domains, grouped in three categories, that are to be addressed in evaluations of federal programs (Shipman, 2012; U.S. General Accounting Office, 1988). In a similar fashion, the Government of Canada identified five issues or criteria to be addressed in evaluations within federal government departments (Centre for Excellence for Evaluation [CEE], 2015; Dumaine, 2012; Treasury Board of Canada Secretariat [TBCS], 2012). These criteria were reviewed as part of an assessment of the Government of Canada’s policy on evaluation (CEE, 2015) and rescinded in 2016. Subsequently, the Policy on Results described the criteria typically employed in evaluations (TBCS, 2016).
Turning to the evaluation theory literature, a handful of theorists outline specific criteria that can or should guide the valuing process: Scriven (2000), Davidson (2005), and Greene and colleagues (2011). While other evaluation approaches advance particular values, they provide little guidance on specific criteria domains to consider. The empirical literature is similarly silent on criteria domains. For example, in one of the rare empirical studies of valuing, Hurteau and colleagues (2009) examined evaluation reports to analyze how evaluators justified their judgments of program quality. The authors found that half of the sample did include explicit criteria, but those criteria were not described in their article.
Synthesis
To synthesize the literature described above, I used a data display matrix to represent and reorganize each criteria framework (Miles et al., 2014; see Table 1). Matrix entries were grouped and regrouped to bring together entries that reflected similar underlying concepts while ensuring each grouping included only one primary idea. This, at times, required dividing an entry that addressed multiple ideas into its component parts. The process resulted in nine criteria domains with applicability for evaluations of programs and other interventions. I grouped the domains into two categories: those that address the conceptualization and implementation of an intervention and domains that address intervention results, considered alone or with implementation. I also developed a brief definition for each domain. These descriptions are intended to communicate the focus of each domain while allowing adaptation to reflect variation in evaluands and contexts.
Criteria Domains Synthesized From the Literature.
a Domain has been divided and appears in multiple rows.
Three domains related to intervention conceptualization and implementation: (a) Relevance—aims and activities are consistent with the needs, requirements, culture, interests, or circumstances of the intended beneficiaries; (b) Design—activities and implementation are consistent with relevant theoretical principles, best practices, standards, and/or laws and/or implementation is timely; and (c) Alignment—intervention is consistent and coordinated with larger initiatives, related intervention, funder aims, and/or interconnected problems. Six domains addressed the results of the intervention, either alone or in combination with its implementation: (d) Effectiveness—intervention achieves desired results, outcomes, or objectives, (e) Unintended Effects—intervention is associated with unintended positive consequences and the absence of negative consequences; (f) Consequence—intervention yields significant benefits to intended beneficiaries and other relevant populations that could benefit from the program and/or reaches a significant number of people or locations; (g) Equity—opportunities, experiences, benefits, and results are fair and just, with particular consideration to prioritizing marginalized populations; (h) Resource Use—funding, personnel, and materials are used economically and/or intervention yields an appropriate level of benefit in relation to the funds, personnel, and materials required; and (i) Sustainability—intervention has long-term benefits and/or program activities can continue beyond the initial start-up period.
Methods: Sample of Evaluation Reports
To refine the emerging list of domains, I examined the criteria evident in a sample of evaluation reports. The reports were drawn from the field of informal science, technology, engineering, and mathematics (STEM) education, defined as lifelong STEM learning that “takes place across a multitude of designed settings and experiences outside of the formal classroom” (Center for the Advancement of Informal Science Education [CAISE], 2019, para. 1), such as science centers and museums; zoos and aquariums; botanical gardens; community and out-of-school time programs; public science events; and film, television, and online media (CAISE, 2019; Dierking et al., 2003). I focused on informal STEM education (ISE) evaluation reports for two reasons. First, I hoped to surface additional criteria domains and, therefore, selected an area of evaluation practice not directly addressed in the synthesized literature. Further, ISE interventions are voluntary educational experiences in which engagement is driven by social agendas in addition to learning aims (Allen et al., 2007; Falk, 2009; Perry, 2012). I hoped this distinctive feature might be associated with domains not reflected in the synthesized literature.
Second, the field of ISE evaluation offers a unique resource for research on evaluation: the InformalScience.org website, which includes a repository of publicly available ISE evaluation reports. All projects funded through the National Science Foundation’s (NSF) Advancing Informal STEM Learning program must post a final evaluation report to the site (NSF, 2017), and other reports are posted voluntarily. Although the repository is likely not representative of all ISE evaluation reports—reports of internal evaluations, formative evaluations, and non-federally funded evaluations are likely underrepresented—it is regarded as a key resource for studying ISE evaluation (see, e.g., Fu et al., 2016; Grack Nelson & Cohn, 2015; Morrissey et al., 2014; Serrell, 2015). As such, it can provide a window into the criteria used in ISE evaluation.
I drew a purposive sample of 37 InformalScience.org records that (a) reported on a summative or formative evaluation of an ISE program, exhibition, media project (film, television, or online), or other intervention and (b) were uploaded to the site in 2017 (summative and formative) or 2016 (formative). Twelve reports were removed from the sample: five reported on formal educational programs, five focused exclusively on interventions for educators or scientists rather than ISE participants, one reported on prototype testing rather than evaluation of a developed project, and one did not include sufficient detail for analysis.
The final sample included 18 summative and seven formative reports, which described evaluations of nine programs, nine exhibitions, eight media projects, one performance, and one curriculum. Two reports discussed evaluands that included more than one intervention type. Fourteen of the evaluands and studies were funded by NSF, five were funded by other U.S. federal agencies, and one was funded by a private corporation. Five documents did not report how the evaluand or study was funded. Authors of the reports were affiliated with 13 evaluation firms/centers and five ISE institutions. Two firms/centers and one ISE institution contributed two reports each to the sample, and one firm/center contributed three reports.
Methods: Identifying Criteria and Domains
A fundamental challenge in the empirical study of evaluative criteria—and the values they represent—is that they are often assumed and implicit in the evaluation process (Greene et al., 2011). However, like values, criteria “show up” in key evaluation commonplaces—that is, the key evaluation questions, intended outcomes and their associated indicators, instruments and measures, and evaluative judgments and conclusions (Greene, 2012; Greene et al., 2011; Hall et al., 2012). Evaluative criteria can, therefore, be surfaced through interpretive analysis of those commonplaces. Table 2 summarizes how criteria are embedded—and can be recognized—in five evaluation commonplaces. First, key evaluation questions focus a study on specific spheres of implementation or results about which to gather evidence of quality or success. Particular variables or constructs are then identified within those spheres of focus and serve as the (explicit or implicit) basis for specifying target outcomes, as well as the indicators used to operationalize those outcomes. Measures and instruments are used to collect data related to the variables or constructs of interest, target outcomes, and indicators as evidence of quality or success. Finally, evaluative conclusions are grounded in explicit or implicit criteria. Conclusions may or may not directly follow from the study’s questions, variables/constructs of interest, target outcomes and indicators, and measures and instruments.
Evaluation Commonplaces in Which Evaluative Criteria Are Embedded.
For the current empirical analysis, I used a structured document review guide to analyze these five evaluation commonplaces in each report through an iterative process of skimming, reading, and interpreting (Bowen, 2009). After skimming each document to familiarize myself with the intervention and evaluation, I read each report to (a) search for explicitly stated criteria and (b) note the key evaluation questions, intended outcomes and indicators, and variables or constructs that were the focus of evaluation instruments and measures, and evaluative conclusions. I then identified implicit evaluative criteria by interpreting how the information embedded in each commonplace could be applied to complete this sentence: A good or successful evaluand is one for which ________________. Numerous criteria were identified for each report in the sample, and each criterion was recorded in the form of a sentence.
For example, one report described an evaluation of an engineering-focused afterschool program for elementary school children. A key evaluation question focused on the extent to which participants demonstrated understanding of the engineering design process, and data collection and evaluative conclusions focused on participants’ understanding of this topic. I interpreted this to mean that a good or successful program was one in which: Children demonstrate understanding of the engineering design process. Another report described an evaluation of an exhibition, with a key evaluation question focused on the extent to which visiting Latino families saw the exhibit content as relevant and meaningful to their daily lives. This theme was explored through interviews with family groups and reflected in the evaluative conclusions. The following criterion was recorded: Latino families report the exhibit content is relevant and meaningful to their daily lives.
I coded each criterion using the nine literature-derived domains. Criteria that did not correspond to one of the domains were coded inductively. I reviewed the new codes to identify those that were conceptually similar or different and areas of overlap with the literature-derived domains. This resulted in two new domain codes (Experience and Replicability) and a revised definition of one domain (Resource Use), as described below. I employed principles of peer debriefing and an audit trail to enhance the credibility of the research at three stages in the analytic process (Lincoln & Guba, 1985). First, as I designed my analytic strategy and document review guide, I obtained feedback from a researcher who studies values in evaluation. The same researcher later reviewed a sample of the coded criteria. Finally, four other researchers reviewed my methods and findings once my initial analysis was complete. At each stage, I incorporated feedback before proceeding with the analysis.
Findings: Empirical Investigation of Criteria Domains
Table 3 summarizes the results of the evaluation report analysis and presents a sample of coded criteria. Numerical counts are reported, although the purpose of this analysis was to refine the emerging list of domains rather than investigate the prevalence of each domain. (Teasdale, 2020a, reports a follow-up study that investigated patterns in the criteria domains evident in a larger sample of ISE evaluation reports and discusses implications for ISE evaluation practice.)
Criteria Domains Evident in Sample of Evaluation Reports.
All of the evaluative criteria identified in this analysis were embedded in the evaluation commonplaces; that is, no report explicitly stated the criteria that provided the basis for evaluative judgments. Analysis of the evaluation commonplaces revealed criteria drawn from the Effectiveness domain in each report. In addition, most reports included criteria drawn from the Design (n = 14) and Relevance (n = 13) domains. A few reports included criteria drawn from the Equity (n = 5), Consequence (n = 4), Sustainability (n = 4) or Unintended Effects (n = 2) domains. Two reports included criteria focused on whether resources were sufficient to support program activities. I coded these as Resource Use and revised the definition to include the sufficiency of resources. Finally, no reports in the sample included criteria drawn from the Alignment domain.
Nearly all of the reports in the sample included at least one criterion that did not fall within one of the nine literature-derived domains. Most of these (n = 16) focused on the manner in which the intervention was delivered. I coded these with a new domain code: Experience—activities are delivered in a way that is respectful, rewarding, and/or enjoyable. Two criteria fell within a second emergent domain that focused on whether the intervention could be of benefit in other contexts. I labeled these with a new domain code as well: Replicability—components, activities, or the underlying model or principles can be duplicated or adapted to another context.
Discussion: Empirical Investigation of Criteria Domains
The empirical analysis resulted in a refined list of 11 criteria domains—including a revised definition for one domain—that are presented in Table 4. The findings have several implications for developing a model of evaluative criteria. First, the empirical analysis focused on an area of evaluation practice not directly addressed in the synthesized literature, yet eight of the nine domains were evident in the sample of evaluation reports. This suggests the identified domains may be broadly applicable to areas of evaluation practice not directly addressed in the synthesized literature. Second, the empirical component of this study produced a refined list of domains, including a revised definition of one domain and identification of two new domains. This suggests that empirical examination of criteria domains can enrich and expand the conceptual discussion of evaluative criteria in the literature.
Criteria Domains.
Third, the analysis surfaced Experience as a new criteria domain. I conceptualize the Experience domain to be applicable in two ways, both of which are consistent with the voluntary nature of ISE activities described above: (a) a positive, rewarding experience could be considered a prerequisite for engagement in ISE and/or a pathway through which to achieve desired effects and/or (b) a rewarding experience could be a desired outcome of ISE activities in and of itself, as participants may take away a range of social- and leisure-related benefits in addition to (or instead of) learning-related benefits. Both conceptualizations suggest overlap with the Effectiveness domain—a positive, rewarding experience could be a desired result of an intervention—or with the Design domain, as a rewarding experience could be an element of a well-designed intervention. However, I advance Experience as a distinct domain to highlight the possibility of (a) assessing the manner in which program activities are delivered, distinct from the content or design of those activities and (b) defining quality or success in terms of participants’ lived experience. Although the Experience domain emerged from evaluation of voluntary educational activities, it may be applicable to other intervention types and contexts. For example, it might be relevant for evaluators seeking to privilege the lived experiences of program participants, including evaluators in culturally responsive and democratic traditions, as this domain expands the focus of study beyond the substance of program activities to consider how those activities are experienced.
Finally, the empirical analysis surfaced the Replicability domain. This is consistent with a critique that the OECD/DAC criteria fail to consider the extent to which an intervention’s design, approach, or products can be applied in other contexts (Chianca, 2008) and, therefore, fills a gap in the initial list of criteria domains.
Part 2: Sources of Criteria
While domain describes the substance of evaluative criteria, source reflects the person, group, or document from which criteria are drawn. Source is an essential consideration because criteria drawn from different sources may reflect different values about what constitutes a high-quality or successful program. For example, an evaluator might include criteria from the Design domain in a particular evaluation. There may be varying perspectives on what a high-quality design means for a given program, depending on whether the definition is derived from program staff, the scholarly literature, or the evaluator’s prior experience. Explicating the source of criteria helps illuminate what quality or success looks like from varying value perspectives and enables evaluators to act with intentionality in choosing which values to privilege.
In a survey study of evaluation practice, Shadish and Epstein (1987) explored the sources from which evaluators drew criteria (operationalized as the “dependent variables used to judge program effectiveness,” p. 562). They found program goals were the most common source reported, followed by criteria that had been used in previous evaluations. Despite a lack of additional research, it appears that program goals remain a common source of criteria in practice (Davidson, 2005; Rossi et al., 2004; Schwandt, 2015). This is not surprising, given that goals are often formulated through a design process that develops an intervention to address specific needs (Henry & Julnes, 1998). However, as Scriven (1993) emphasizes, “program evaluation is not a determination of goal attainment” (p. 2). Evaluators may need to consider other sources of criteria because the goals of an intervention may not be sufficiently specific or measurable to guide an evaluation (Rossi et al., 2004; Weiss, 1973), there may be discrepancies between formal goals and the actual goals of the intervention as implemented (Deutscher, 1977; Weiss, 1973), and key goals may be assumed but unstated. An intervention may also have multiple or conflicting goals (Davidson, 2005; Mark et al., 2000; Shipman, 2012). In addition, an exclusive focus on program goals can result in evaluators overlooking unintended outcomes (positive or negative), as well as the extent to which intervention aims are aligned with the needs of intended beneficiaries (Davidson, 2005; Deutscher, 1977; Scriven, 1972).
Evaluators can draw criteria from a number of other sources. Evaluation commissioners are a common source of criteria (Rossi et al., 2004), and Schwandt’s (2015) discussion of criteria described above identifies additional sources: established requirements, needs assessment, and expert opinions, including the evaluator’s opinions. Other sources include prior evaluations; research or scholarly literature; legislation, regulations, or policies; and professional standards (Shadish & Epstein, 1987; Shipman, 2012). Criteria may also be drawn from program stakeholders (Shadish et al., 1991). Input can be obtained from staff and/or actual or intended beneficiaries through interviews or discussions (MacNeil & Mead, 2005; Moro et al., 2007; Stake et al., 1997), a “systematic canvass” of stakeholders (Weiss, 1973, p. 7), by surveying stakeholders and the general public (Henry, 2002; Henry & Julnes, 1998), or through a dialogic process focused on values and criteria (Greene et al., 2011; House & Howe, 1999). Evaluators might also use crowdsourcing techniques (Harman & Azzam, 2018).
Synthesis
In all, the literature offered nine possible sources of criteria. Consistent with my approach to the domains, I grouped these sources into three categories: sources grounded in the intervention itself, sources related to the evaluation, and sources that are external to the intervention and evaluation. Three sources were intervention-related: (a) Objectives—aims, goals, and/or intended outcomes of an intervention; (b) Staff or Leaders—individuals who design, direct, or implement the intervention; and (c) Beneficiaries (Intended or Actual)—individuals the intervention aims to serve or assist or those engaged as participants, clients, and so on. Next, there were three evaluation-related sources: (d) Commissioner—evaluation sponsor who requires, requests, and/or funds the study; (e) Previous Studies—prior assessments of the intervention or similar interventions; and (f) Evaluators or Evaluation Literature—individuals who conduct the assessment of the intervention or other interventions and/or research, scholarly, or practitioner publications about assessing interventions. Finally, there were three external sources: (g) Substantive Literature or Experts—research, scholarly, or practitioner publications that are relevant to the intervention type or topic or individuals with relevant, specialized knowledge or experience; (h) Requirements or Standards—legislation, policies, and procedures that govern an intervention and/or professional norms or best practices that are relevant to the intervention; and (i) General Public—individuals who are members of the community (neighborhood, city, state, country, etc.) where an intervention takes place but are not the intended or actual beneficiaries.
Methods: Identifying Sources of Criteria
To refine the list of sources, I returned to the sample of ISE evaluation reports. The criteria identified above were embedded in the evaluation commonplaces rather than explicitly stated; therefore, I sought to identify the source of those commonplaces. Specifically, I investigated the sources of the evaluation questions and variables or constructs reflected in the target outcomes, indicators, instruments, and/or measures. I also sought to identify the sources of criteria for evaluative conclusions that did not directly follow from other commonplaces.
I began by reviewing each report and locating instances in which the sources of key evaluation questions or variables/constructs of interest were described or suggested. I coded those instances using the nine literature-derived sources. Through this process, one new source was identified—Partners (defined below)—and added to the list. A data display matrix was created to represent the sources that were described or suggested in each report.
I then developed a brief online survey to gather data from report authors about the sources of the key evaluation questions and variables or constructs of interest. The instrument began by obtaining informed consent and then introduced the topic of the survey, identified the specific evaluation report under consideration, and defined key terms. Next, respondents considered the following open-ended item: For this evaluation, what sources did you (or your team) draw on to identify the key evaluation questions? For example, who, if anyone, did you talk to? What, if anything, did you read? A similar item focused on sources for the variables or constructs of interest. Next, respondents considered a series of multiple-choice items drawn from the 10 sources described above. For example, respondents were asked: Did you review the following types of documents or literature to identify the key evaluation questions or variables/constructs of interest for this evaluation? Respondents indicated Yes, No, or Not sure for each item that followed: Previous evaluation reports for this project, Evaluation reports for other relevant projects, Research or scholarly literature about evaluation, and Research or scholarly literature about ISE or a related field. Finally, a set of open-ended items invited respondents to describe additional sources of criteria not included in the multiple-choice items.
I gathered feedback on the survey items and instructions from a researcher who studies values in evaluation and three ISE evaluators with varying educational and professional backgrounds, and I revised the instrument based on their feedback. After obtaining institutional review board approval, 1 I identified the first author of each report in the sample or, when no author was named, the principal of the evaluation firm that produced the report. This resulted in a list of 20 evaluators associated with the 25 reports; three individuals were each associated with two reports in the sample, and one individual was associated with three reports. To minimize response burden for those four evaluators, I purposively selected one document for each author, seeking a balance of intervention types and formative/summative reports in the survey sample. Survey invitations were sent by email, with a reminder sent after 2 weeks. Eighteen authors responded (90% response rate), with 15 completing the survey in its entirety. Two responses were removed because respondents indicated they were unaware of the source of the key evaluation questions and/or variables/constructs of interest. This resulted in a final sample size of 13 (72% completion rate) and survey data that corresponded to 52% of the reports in the sample.
Survey data were analyzed in three stages. First, I coded the responses to the open-ended items using the 10 sources of criteria outlined above. No additional sources were identified during this process. Then, I compared each respondent’s responses to the multiple-choice items with their open-ended responses. Finally, I added the results to the data display matrix that included the sources described or suggested in each report.
Findings: Empirical Investigation of Criteria Sources
Table 5 summarizes the results of the report analysis and survey. Numerical counts are reported, although, as with the investigation of criteria domains, the purpose of this analysis was to refine the emerging list of sources rather than investigate the prevalence of each source.
Criteria Sources Evident in Sample of Evaluation Reports and Identified Through a Survey of Report Authors.
Nearly all of the evaluations included criteria drawn from Objectives. Two reports explicitly stated the aim of the evaluation was to assess the extent to which the project achieved its desired outcomes, and this source was identified by all survey respondents. All survey respondents identified Staff and Leaders in response to both open-ended and closed-ended items. Nearly all respondents selected Commissioners in the multiple-choice questions (n = 10), and several described it in responses to open-ended items (n = 5). Previous Studies was identified as a source by many respondents (n = 7, open-ended item; n = 10, closed-ended item) and described in one evaluation report. Respondents also frequently identified Substantive Literature or Experts as a source of criteria (n = 6, open-ended item; n = 13, closed-ended item). Evaluators or Evaluation Literature was selected by all survey respondents (n = 13) in closed-ended items but rarely described in open-ended responses (n = 4) or discussed in the evaluation reports (n = 1). Requirements or Standards was selected by nearly all survey respondents (n = 11) and described by two. Beneficiaries (Intended or Actual) was described as a source in one report and by one survey respondent. Finally, General Public was rarely evident (n = 1, open-ended item; n = 2, closed-ended item).
One report in the sample suggested that staff in partnering organizations were consulted when identifying key evaluation questions. I represented this source with a new code that was included in the survey instrument: Partners–staff or leaders who direct or operate entities that contribute to, collaborate on, or otherwise, provide support for an intervention. This source was selected by many survey respondents (n = 9) and described by two.
Discussion: Empirical Investigation of Criteria Sources
The empirical analysis resulted in a refined list of 10 sources, presented in Table 6. As noted above, the analysis focused on an area of practice not directly addressed in the synthesized literature, yet there was evidence that each of the literature-derived sources was associated with the sample of reports. Consistent with the discussion of criteria domains, this suggests the identified sources may be applicable to areas of practice beyond those reflected in the synthesized literature. In addition, the analysis identified a source of criteria that did not emerge from the literature synthesis, underscoring the importance of empirical investigation.
Sources of Criteria.
It is important to note that the document analysis and survey methods used in this analysis captured a limited amount of detail and context. As a result, there are several possibilities to consider when the evidence suggests criteria were not drawn from a particular source. First, the source may have, in fact, been used but was not captured by the methods employed in this study. Second, the source may not have been used because it was not relevant or possible for the particular evaluation (e.g., there were no previous studies available or no partners to consult). Third, the source may not have been used, although it was relevant or possible. In addition, the methods used in this study may have produced “false positive” findings by incorrectly identifying sources that were not, in fact, used. In future empirical research, criteria sources may best be examined through dialogue with evaluators and/or observation of the evaluation process. This would allow researchers to capture the process of criteria selection as that process unfolds—avoiding reliance on evaluators’ memories (as with survey methods) or written summaries of the evaluation (such as the evaluation report)—and investigate the relevance of various sources of criteria.
Part 3: Integrated Model of Criteria Domains and Sources
Combining the 11 domains and 10 sources identified in this study yields an integrated model of criteria domains and sources (see Figure 1). In representing the model, the domains comprise the outer ring: those that address conceptualization and implementation are on the right side and those that address results (considered alone or with implementation) are on the left. Criteria domains are positioned as the outer layer of the diagram because they form the structure or scaffolding for evaluative judgments. Sources of criteria are positioned in the interior, divided into three categories: intervention-related, evaluation-related, and external. These are located in the core of the model to represent the often implicit and assumed nature of criteria, which frequently requires evaluators to “dig” to uncover their source(s).

Integrated model of criteria domains and perspectives.
The criteria domains and sources are integrated into a single model to highlight the interaction between domains and sources during the criteria specification process. If evaluators solely consider criteria domains, they might identify aspects or characteristics of an intervention that exemplify quality or success but fail to uncover whose values have shaped that vision of quality or success. If they solely consider sources of criteria, they might identify whose values should guide an evaluation but miss the specificity about the desired state those stakeholders value. Adopting an integrated approach to criteria specification enables closer, more careful consideration of two aspects of criteria (domains and sources) and their interaction.
As an example, we can consider an evaluator who consults multiple stakeholders for a summative evaluation of an after-school robotics program for high school youth. The evaluator might discover that leaders consider the program a success if it accomplishes its goal of sparking interest in engineering careers (Domains: Effectiveness; Source: Leaders and Staff, Objectives) while ensuring that youth from communities that have been historically marginalized in STEM are well-served (Domains: Equity; Source: Leaders and Staff). The evaluation commissioner might define quality in terms of the program achieving its goal of boosting academic performance in science and math (Domains: Effectiveness; Source: Commissioner, Objectives), as well as its cost-effectiveness (Domains: Resource Use; Source: Commissioner) and potential to expand to new communities (Domains: Replicability; Source: Commissioner). Participating youth might consider the program a success if it addresses their interests (Domains: Relevance; Source: Beneficiaries) and complements the school curriculum (Domains: Alignment; Source: Beneficiaries). In addition, the evaluator might review the literature and surface design principles associated with high-quality robotics activities (Domain: Design; Source: Substantive Literature or Experts).
In this example, the model of criteria domains and sources helps to illuminate the multiple domains and sources from which the evaluator might draw criteria. Even within the single domain of Effectiveness, the evaluator uncovered variation in values held by different sources. Using the model of criteria domains and sources to adopt an integrated perspective on criteria specification (as the evaluation is unfolding or retrospectively) can illuminate this evaluator’s options for specifying evaluative criteria and shaping the evaluation commonplaces and clarify the basis for evaluative judgments, including how quality or success was defined and whose values those definitions reflect.
As an emerging model, the framework presented here is likely to reflect a partial perspective on the criteria to be found across the broad field of evaluation. Further research is needed to identify criteria domains and sources not captured in this literature review and empirical analysis and to explore the applicability of the framework to domains beyond ISE. The model has, therefore, been designed to be dynamic to allow incorporation of additional domains and sources—and other aspects of evaluative criteria—that emerge from further investigation. This responds to Julnes’s (2012a) caution that in seeking to make valuing more systematic, the limitations of that systematization will be revealed. It is also important to note that the model focuses on the performance of discrete interventions, rather than addressing complex, large-scale social transformation. As Ofir (2018) argues, evaluation of social change may require different criteria domains. The current model describes criteria that are relevant for evaluating programs and other interventions.
A second limitation arises from the focus on describing criteria domains and sources rather than modeling how evaluators might prioritize and select criteria. Further research is needed to investigate these topics and could inform the development of an interactive tool for criteria selection. In addition, Ozeki and colleagues (2019) have outlined the need for evaluators to better understand and implement Scriven’s logic-based valuing process. The framework addresses one step of that process—criteria specification—but does not address the other steps, which involve specifying standards, comparing data to standards, and formulating a synthesis judgment.
Implications
The integrated model of criteria domains and sources responds to calls for an empirical, descriptive theory of valuing that addresses the values and processes evaluators use to reach evaluative conclusions. The framework is not intended to prescribe which criteria evaluators should select or the sources from which criteria should be drawn. Instead, I seek to lay theoretical groundwork by identifying, describing, and integrating two key aspects of criteria specification. This builds on prior work that calls for greater attention to values in evaluation and outlines how values “show up” in evaluation practice (e.g., Greene, 2012; Hall et al., 2012; House & Howe, 1999) by clarifying the evaluative criteria specified to represent those values. The resulting model can inform theory, research, and practice in several ways.
First, evaluation practitioners can leverage the framework as a thinking tool to guide selection of evaluative criteria. The model can be used to map the possibilities available to evaluators and illuminate their choices as they specify and apply criteria. For example, some evaluation approaches focus attention primarily on the desired outcomes for an intervention—that is, Effectiveness criteria. This focus is evident in objectives-oriented evaluation (Madaus, 2012; Tyler, 1949) and the expectations of funders who explicitly define evaluation in terms of measuring achievement of program objectives. For example, the federal agencies that funded the majority of the ISE projects in this study frame evaluation as a process of measuring the extent to which projects fulfill the aims for which they were funded and produce desired changes in individuals or communities (see Friedman, 2008; Institute for Museum and Library Services, n.d.; NSF, 2017). A similar emphasis on Effectiveness criteria is evident in theory-driven approaches that encourage evaluators to examine the rationale and processes for why and how an intervention is expected to achieve its desired outcomes, often including the use of logic modeling (Chen, 2012; Donaldson, 2007; Funnell & Rogers, 2011; Kellogg Foundation, 2004).
Evaluators can use the model to expand criteria specification beyond the Effectiveness domain to foster more nuanced understandings of the intervention and its results. For example, Picciotto (2013) describes the “outcome trilogy” in development evaluation that focuses on determining whether an intervention achieves relevant objectives, efficiently, and with good results (p. 162). Expressed in the domains described here, this reflects criteria drawn from the Relevance, Resource use, and Effectiveness domains. As this example illustrates, evaluators can draw criteria from multiple domains to reveal the relationships among different aspects of the intervention and provide a more comprehensive assessment of quality or success.
The model can also be used to achieve greater inclusion in the valuing process. Certain evaluation approaches, such as democratic and collaborative approaches, draw attention to the values of stakeholders who are often overlooked or excluded in the evaluation process (Greene, 2006), and evaluation standards and ethical guidelines call on evaluators to balance varying stakeholder interests (American Evaluation Association, 2018; Yarbrough et al., 2011). Evaluators can use the framework to identify the stakeholder groups who are (and are not) represented in the criteria specification process and inform efforts toward inclusion.
In these ways, evaluators can use the framework to help align their selection of criteria with the conditions and needs of specific evaluations (Schwandt, 2015). Ultimately, this may support greater intentionality in the valuing process, which has been described as largely unreflective (Julnes, 2012b). Yet adopting more expansive and inclusive approaches to criteria specification will likely raise tensions for evaluators to negotiate. For example, different stakeholder groups may prioritize different criteria domains and have varying levels of power and influence to advance their values (Greene et al., 2011; Teasdale, 2020b). This raises important questions about the relative importance of those perspectives and value stances and how they align with evaluation purposes and the decision making an evaluation aims to inform.
Second, evaluators can use the framework to articulate and describe the criteria they choose and the source of those criteria, making their selection of criteria more explicit and democratic. This can support clearer, more defensible evaluative conclusions. It can also help evaluation practitioners communicate with stakeholders about the criteria and values that underpin evaluative inquiry and foster stakeholders’ understanding of the complexity and multiplicity inherent in defining success for a given intervention. Third, the model can be used to strengthen evaluator training. As the field adopts evaluator competencies and moves toward professionalization, the valuing process is a component of the foundational knowledge base required for all evaluators (Ozeki et al., 2019). This framework can be used to help evaluators develop their understanding of evaluative criteria, which underpin the valuing process.
Fourth, the framework can inform research on evaluation that examines the valuing process, seeks to illuminate the values that underpin an evaluation, and/or aims to identify the stakeholders who hold those values. Researchers can also leverage the model to support investigations of the perceived legitimacy of various criteria among stakeholders, the fit between certain criteria and evaluation contexts, the criteria specification process, and the consequences of various criteria and processes for evaluation use. In addition, the model provides a conceptual framework and language for theorists seeking to prescribe and/or critique criteria selection. This may help clarify criteria prescription and make prescriptions more actionable, enabling practitioners to more easily apply theoretical principles and researchers to examine the application. As such, the model could help strengthen the link between valuing theory and practice.
In sum, the integrated model of criteria domains and perspectives contributes to building an empirical, descriptive theory of valuing and provides an emerging framework to strengthen evaluation practice, enhance evaluator training, support empirical research, and contribute to ongoing theory development.
Footnotes
Acknowledgments
I thank Jennifer Greene, George Julnes, H. Chad Lane, Rebekah Willett, and Eboni Zamani-Gallaher for their thoughtful feedback on an earlier draft of this article and Thomas Schwandt for his insightful comments on a related conference paper. I would also like to thank the editor and three anonymous reviewers for their valuable comments that substantially improved this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
