Abstract
A budding area of research is devoted to studying evaluator curriculum, yet to date, it has focused exclusively on describing the content and emphasis of topics or competencies in university-based programs. This study aims to expand the foci of research efforts and investigates the extent to which evaluators agree on what competencies should guide the development and implementation of evaluator education. This study used the Delphi method with evaluators (n = 11) and included three rounds of online surveys and follow-up interviews between rounds. This article discusses on which competencies evaluators were able to reach consensus. Where consensus was not found, possible reasons are offered. Where consensus was found, the necessity of each competency at both the master’s and doctoral levels is described. Findings are situated in ongoing debates about what is unique about what novice evaluators need to know and be able to do and the purpose of evaluator education.
Keywords
There is no answer to the question what constitutes an adequate evaluation training programme….
The topic of competency is an important factor in the discussion of evaluation as a profession. Historically, professions have progressed from an initial identification of the characteristics of a professional to what it means to be well qualified (Wilcox & King, 2014; Worthen, 1999). Discussion of quality ties in with professional considerations of how to define what it means to be competent and how to show that competence has been attained. To that end, evaluation scholars have long debated competencies. Discussions of competency issues can be mapped against a continuum of why competencies are important, from “credentialing” (i.e., that an evaluator knows the subject) to certification (i.e., that an evaluator knows the subject and has demonstrated the skills) to licensure (the acknowledgment of an agency that an evaluator is competent to practice the profession; Altschuld, 1999; Altschuld & Engle, 2015).
Competency includes having particular knowledge (knowing how to do something), skill (being able to do something), or attitude (thinking that one has the knowledge and skills to do something; Wilcox & King, 2014). Specifically in evaluation, competencies have been defined as “…the skills, knowledge, abilities, and attributes required to conduct an evaluation” (McGuire & Zorzi, 2005, p. 74). Researchers in evaluation have suggested there are many uses for competencies, such as for university education (e.g., designing of academic programs and courses), advocacy (e.g., ensuring the quality of evaluators and evaluations), and self-reflection (e.g., the professional development of the practicing evaluator; McGuire & Zorzi, 2005; Perrin, 2005; Podems, 2014). The topic of competencies also influences academic theoretical research and ongoing professional development (Altschuld, 2005; Altschuld & Engle, 2015). Despite many years of debate about professional competencies in evaluation and the role they should play, McDavid and Huse (2015) assert, and the study authors concur, that “agreement on the necessary competencies for a field is one of the stumbling blocks for establishing credentialing, certification, or accreditation systems” (p. 54).
Some progress has been made with respect to evaluator competencies. For example, the American Evaluation Association (AEA), the world’s largest evaluation association, recently endorsed a set of competencies that includes five domains comprised of 49 competencies: (i) professional practice domain, with nine competencies; (ii) methodology domain, with 14 competencies; (iii) context domain, with eight competencies; (iv) planning and management domain, with 10 competencies; and (v) interpersonal domain, with eight competencies (AEA, 2018). AEA’s evaluator competencies built on the work of others. This includes the competency work by the Canadian Evaluation Society (CES; Barrington et al., 2015; CES, 2018), the Capabilities Framework and associated voluntary evaluator peer review system by the European Evaluation Society (EES; EES, n.d.; Piccotto, 2011), and the Competencies for Development Evaluation Evaluators, Managers, and Commissioners by the International Development Evaluation Association (2012), to name a few. The work of several researchers was also instrumental in informing AEA’s efforts (see King & Ayoo, 2020, for a review). While the creation of AEA’s competencies is an important milestone in evaluation, much work remains to be done.
Among evaluation practitioners, conversations on growing the profession, capacity building, and the competencies one ought to bring to practice are happening frequently (Galport & Azzam, 2016; Leviton, 2014; Naidoo, 2013). These conversations surface several important considerations concerning evaluation practice. These include ideas about when a novice is safe to practice, considerations of what benchmarks are used and the extent to which they are context-dependent, and the role in institutions of higher education in training future evaluators. This study is located in the framework of what might make for a more competent novice evaluator. The overarching research question explored in this study was what AEA evaluator competencies ought to guide the development and implementation of evaluator education curricula?
Review of Literature
Since the birth of modern evaluation in the United States in the 1960s, the evaluation field has been debating the question of how to teach people to become evaluators. In 1980, in a New Directions for Evaluation issue devoted to training program evaluators edited by the late Lee B. Sechrest, Elizabeth Brown wrote, “There is no answer to the question what constitutes an adequate evaluation training programme….” (p. 86). Almost 40 years later, this question is still being debated, and surprisingly, few peer-reviewed articles have been published (Altschuld & Engle, 1994; Gross Davis, 1986; King & Ayoo, 2020).
There is, however, reason to be hopeful. The creation of the International Society for Evaluation Education marks a concerted global effort to build the knowledge base about evaluator and evaluation education. Progress on the professionalization front is also occurring. Examples include the EvalPartners Professionalization Taskforce and its work, the development of evaluator competencies, credentialing in Canada, and other quality control mechanisms such as voluntary evaluation peer review. A recent special issue of Evaluation and Program Planning is devoted specifically to evaluator education. One of these articles, written by King and Ayoo (2020), lays out what is known about evaluator education, focusing on several interrelated components. One component they call attention to is evaluator education curricula.
What is currently known about evaluator education curricula comes from published studies that used surveys of students, faculty, AEA members, and job seekers, or analysis of course titles, course catalogs and bulletins, or course syllabi. Collectively, these studies focus exclusively on describing the content and emphasis of topics or competencies in university-based programs (Christie et al., 2014; Davies & McKay, 2014; J. D. Dewey et al., 2008; Fierro & Christie, 2011; Galport & Azzam, 2016; LaVelle, 2020; LaVelle et al., 2020).
While these contributions are important, evaluation researchers have recognized that curriculum is an understudied area and offered several examples of empirical questions that need to be answered to build a program of scholarship on evaluator educator curricula. For example, LaVelle and Donaldson (2015) note that the field knows very little about the percentage of evaluator competencies covered in programs offering at least two courses. Others (Davies & McKay, 2014; LaVelle & Donaldson, 2015) have highlighted the prevalence of technical knowledge in evaluation curriculum, but little is known on how this technical knowledge is balanced with other competencies and how this is related to differences in master’s and doctoral curriculum. The field also knows very little about the frameworks guiding the development and implementation of evaluator educator curricula (Darabi, 2002; Lee et al., 2007; Preskill, 1992). Moreover, studies are mostly still focused on what we are doing.
No studies have focused on what we should be doing. As noted in a recent article published by Gullickson and colleagues, there is a need to attend to both what we are doing and “to what we should be doing to educate evaluators” (Gullickson et al., 2019, p. 20, emphasis in original). This study is focused on the should side. More specifically, the overarching research question we explore is what AEA evaluator competencies ought to guide the development and implementation of evaluator education curricula? This study is a necessary first step to help the field engaged in critical debate and reflection on what should be included in the evaluator education curriculum at both the master’s and doctoral levels. Moreover, it provides a novel area of exploration in ongoing efforts to build a program of scholarship on evaluator educator curricula.
Conceptual Framework
This study is located in calls for more empirical scholarship on evaluator education and focuses specifically on the issue of the curriculum. Several important concepts are central to our work. The first is the idea of curriculum. Prescriptive [curriculum] definitions provide us with what “ought” to happen, and they more often than not take the form of a plan, an intended program, or some kind of expert opinion about what needs to take place in the course of study. (Ellis, 2004, p. 4)
Drawing from empirical evidence and writings from educational experts, a coherent curriculum design framework has been developed (National Academy of Sciences, 1999). This process-focused framework has been adapted to the evaluation education context (Figure 1). One of the first tasks is to establish goals and standards, with several important elements feeding into this task. One element is research, and it is here that this study and other studies on or related to curriculum will inform these conversations. This study contributes to this conversation by focusing specifically on evaluator competencies.

Coherent curriculum design framework.
Method
Research Design
Recall that the overarching research question investigated in this study was what AEA evaluator competencies ought to guide the development and implementation of evaluator education curricula? This study used the Delphi method to obtain consensus from a panel of evaluators on what competencies should guide the development and implementation of evaluator education curriculum (Garavalia & Gredler, 2004; Okoli & Pawlowski, 2004). Three rounds of online surveys were used, with follow-up interviews after Rounds 1 and 2 to probe reasons for participants’ ratings. The first round of our Delphi study began with a predetermined set of items, specifically the competencies endorsed by the AEA as of August 2017.
Sample
There is no master population list of evaluators from which to recruit in the United States. However, AEA is the world’s largest evaluation association and it provides a publicly available list of officers serving as Topical Interest Group (TIG) leaders. In general, TIG leaders are elected and have the responsibility of coordinating and overseeing the annual proposal review process, putting together a slate of TIG-sponsored sessions for the annual conference, and helping to coordinate and implement AEA-related efforts, among other duties (https://www.eval.org). TIG leaders are often individuals who have many years of experience in evaluation with subject matter expertise that qualifies them to serve as representatives of their peers. For these reasons, we invited TIG leaders to participate in our Delphi study. It should be noted, however, that our invitation excluded graduate students and recent graduates due to their novice professional status. We also excluded individuals based outside of the United States because we wanted to maintain focus on evaluation competencies applicable to the American context. 1 One leader from each TIG (n = 50) was invited to participate in the study. In total, 11 evaluators agreed to participate as raters in the first round. By the second round, this number fell to 10 evaluators, and by the third round, this number fell to nine evaluators.
Most of these raters have doctoral degrees (88.9%). They have been in evaluation, on average, about 18 years (SD = 9) and a member of AEA for 13.5 of those years (SD = 6.7). More than half are involved in evaluator education, either as faculty members (50%) or trainers (12.5%), while 37.5% are evaluation practitioners. Over half serve as supervisors or directors in the evaluations they conduct (55.6%), with the rest serving as a specialist, consultant, manager, or coordinator (44.4%). All have conducted a significant number of evaluations (median = 20).
These evaluators’ disciplinary backgrounds and the contexts in which they practice vary. In terms of disciplinary background, four reported education (40%), two reported evaluation (20%), one reported educational psychology (10%), one reported health/public health (10%), one reported management (10%), and one reported sociology (10%). As for the areas in which their evaluation work takes place, participants reported working across simultaneously several contexts: eight reported higher education (80%), six reported education (PK–12; 60%), four reported healthcare/public health (40%), four reported mental health (40%), three reported substance abuse prevention and treatment (30%), one reported community development (10%), one reported employment and welfare (10%), and one reported other and did not provide a description (10%).
Participants were asked to describe their evaluation practice using a scale with evidence of content validity (Christie, 2003; Christie & Masyn, 2008). Results are included in Table 1. Most reported using a mixed method approach. In terms of values, most also appeared to endorse a view that evaluation is interpreted and judged by primary users and is partly subjective and includes values. In terms of use, however, there was no clear pattern; the evaluators had varied views on use.
Description of Participants’ Evaluation Practice.
Note. The following 6-point scale was used: 1 = very dissimilar to how I conduct evaluation, 2 = somewhat dissimilar to how I conduct evaluation, 3 = slightly dissimilar to how I conduct evaluation, 4 = slightly similar to how I conduct evaluation, 5 = somewhat similar to how I conduct evaluation, and 6 = very similar to how I conduct evaluation.
Procedures
An overview of study procedures is included in Figure 2. Prior to beginning the Delphi process, the online survey was piloted via Qualtrics with seven AEA members representative of our intended sample. These raters were asked to complete the online survey as the sample participants would in the Delphi study and to rate the survey on how clear its instructions were and how easy it was to complete.

Visual of Delphi procedures. Note. Both the survey pilot and three rounds of Delphi procedures appear in the figure.
In Round 1, participants were asked to complete a series of demographic questions and the Round 1 survey. This survey required participants to assess each of the 58 evaluation competencies on two dimensions: (1) necessity to be included in master’s curriculum and (2) necessity to be included in doctoral curriculum. Thus, there were 116 survey items, each rated on a 6-point Likert-type response scale (1 = least necessary, 2 = minimally necessary, 3 = moderately necessary, 4 = necessary, 5 = very necessary, and 6 = highly necessary). Upon completion of the Round 1 survey, the research team evaluated the consensus on each item and produced a list of items to be rerated in the next round.
Next, each participant was interviewed and offered a rationale for their ratings over the telephone. These interviews focused exclusively on items that would be rerated in the next round. Due to the number of items and ratings, it was not feasible to ask participants to provide a rationale for all of their ratings. Using information from the ratings and demographic information, the team identified select items for further follow-up with Delphi participants. The aim was to gather rationales for ratings from two participants for each item that would be rerated in the next round at both the master’s and doctoral levels. Interviews were not recorded. One research team member conducted the interview while a second team member took notes.
The process for Rounds 2 and 3 were similar to Round 1 in that participants were asked to rate evaluation competencies on the same two dimensions, but only for competencies where consensus had not yet been reached. In Round 2, 80 items were rerated, 48 of which were for master’s curricula and 32 for doctoral curricula. In Round 3, 50 items were rerated (36 items for master’s curricula and 14 items for doctoral curricula). For Rounds 2 and 3, each participant received a summary of the results from the prior round to potentially inform their responses in the current round. These summaries included the median, interquartile range (IQR), and response distribution of each item for which consensus had not been reached, the ratings they provided in the prior round, and a summary of comments from the phone interviews. No follow-up qualitative interviews were conducted after the Round 3 survey.
Consensus Decision Process
A decision process developed by the research team was used to systematically decide whether a rating distribution indicated that there was consensus among the raters for each item in each round (see Figure 3). A range of standardized and heuristic measures were utilized in this decision process, including the IQR, 2 the within-group correlation (r wg) index, 3 and the average deviation index 4 as standardized measures.
Two heuristic measures were used in evaluating consensus. The first was a decision rule based on the spread in the rating distribution similar to the rule suggested by Ulschak (1983). This decision rule stated that consensus on an item was indicated by at least 80% of the ratings falling within two adjacent response categories. The second was a decision rule that stated that the existence of outliers in the rating distribution was an indication of a lack of consensus. It was decided by the research team to define outlier ratings within a distribution as ratings within a response category when the adjacent response categories have no ratings (e.g., if a rater chose 2 while all other raters chose 4 or 5, the 2 was considered an outlier rating because no rater chose 1 or 3).

Overview of indicators and methods used to make consensus decisions.
Qualitative Analysis
Interview notes were entered into a data matrix (Miles et al., 2020), containing each competency that was rated in a particular round and rationales from interviews. Rationales were separated by master’s and doctoral levels. Once this matrix was populated, the research team conducted a content analysis to examine the presence of concepts and identify any patterns and themes (Creswell, 2008, 2013). The first round of coding was inductive to allow codes to emerge from the data and used a combination of holistic coding to get a general sense of the overall content of participants’ rationales for ratings and descriptive coding to assign labels to initial codes. The second round of coding was also inductive and used pattern coding to identify themes visible across codes. Throughout both rounds of coding, differences of interpretation were settled through discussion among the research team until consensus was reached. Importantly, the researchers did not force meaning on the data. Rather, researchers only identified codes, themes, and patterns that were visible across multiple interviews and that were related to our substantive interest.
Consistent with guidance on drawing and verifying conclusions from qualitative data (Miles et al., 2020), the research team’s tactics for generating meaning included pattern recognition, looking for evidence of plausibility and lack of plausibility in codes and themes, and counting of codes and themes. Tactics for confirming findings included checking for representativeness of codes and themes and looking for negative evidence of our themes and patterns. To ensure the quality of conclusions, the research team used several categories of tactics. These include confirmability (describing fully study methods, examining plausible rival conclusions), dependability (clear research questions and methods that are congruent with them, collecting data over the full range of study participants, data quality checks, and peer review), credibility (areas of certainty and uncertainty are identified, negative evidence sought and rival explanations considered, use of measures tied to relevant constructs), and transferability (sampling procedures are fully described, characteristics of the sample are presented, inclusion of relevant quotes to illuminate key findings, and findings are able to be situated in the larger set of scholarship from which it draws).
Results
Areas where consensus was found and why are presented in this section. A discussion of competencies on which participants reached consensus and insight into their ratings provided by their qualitative interviews are discussed. This information highlights which overall necessity ratings are consistent enough to be interpreted while elucidating why the participants were not able to reach consensus on certain competencies. Next, which competencies appeared to be rated as most necessary based on how high median ratings were and how quickly participants came to consensus is presented. This information highlights what the participants believe should be most prioritized in master’s and doctoral programs.
Where Was Consensus Found and Not Found? Why?
At the end of Round 3, there were 32 items for which consensus was not reached. Of the initial 116 items measuring the necessity of each evaluator competency’s inclusion within master’s and doctoral evaluation programs, 84 items (72.4%) had rating distributions indicative of consensus among the raters by the final Delphi round. Of the 32 items for which consensus had not yet been reached, 25 were for master’s curricula and seven were for doctoral curricula.
In what follows, we describe where consensus was and was not found for each competency. We start with competency domains where consensus was most easily found and then work toward domains where it was more difficult to find consensus.
Methodology domain
As shown in Table 2, Delphi participants readily came to consensus on which competencies in this domain were necessary in master’s and doctoral level curricula. When comparing results for master’s versus doctoral programs in this domain, there were very few instances in which participants did not agree. Where they did not agree, it was almost always for master’s curricula.
Results for the Competencies Appearing Under the Methodology Domain.
Note. Response options: 1 = least necessary, 2 = minimally necessary, 3 = moderately necessary, 4 = necessary, 5 = very necessary, and 6 = highly necessary. Shading indicates the priority of the competency for that type of program according to ratings: black shading = highest priority, dark gray shading = second highest priority, light gray shading = third highest priority, and no shading = unclear priority.
Context domain
Table 3 shows that Delphi participants were easily able to agree on which competencies in this domain were necessary in doctoral curricula. However, there was less agreement at the master’s level. Recall that participants were asked to provide a rationale for a sample of their ratings between rounds. Based on these interviews, no discernable pattern was detected for why participants were unable to come to consensus at the master’s level for Competencies 3.5, 3.6, and 3.9.
For Competencies 3.3 and 3.8, differences in ratings appeared to stem from ideas about what practice looks like for recent master’s program graduates. For 3.3, one Delphi participant said, “my assumption is that master’s level graduates are likely be supervised by more senior staff” implying that the supervisors were the ones who needed to have this competency. Other participants argued that this competency was less necessary than others because the current master’s training landscape does not support it. As one participant articulated, I would like to say this is very necessary because I believe in the concepts of systems thinking and complexity. At the same time, we graduate lots of students who never dig into and really get these concepts, and they are successful in their careers.
For Competency 3.8, Delphi participants disagreed about the extent to which it was necessary for master’s students to develop and facilitate a shared understanding of the program. For example, the researchers heard the same refrain about “master’s level graduates [being more] likely supervised by senior staff,” and thus, senior personnel needed to have this competency. At the same time, other participants saw this competency as important to include, noting that “You cannot do good evaluation work without understanding the program. It is not possible to do it completely. There are limitations to understanding, but it is critical to try and understand as much as you can.”
Results for the Competencies Appearing Under the Context Domain.
Note. Response options: 1 = least necessary, 2 = minimally necessary, 3 = moderately necessary, 4 = necessary, 5 = very necessary, and 6 = highly necessary. Shading indicates the priority of the competency for that type of program according to ratings: black shading = highest priority, dark gray shading = second highest priority, light gray shading = third highest priority, and no shading = unclear priority.
Interpersonal domain
As shown in Table 4, Delphi participants easily came to consensus on which competencies (save one) in this domain were necessary to cover in doctoral curricula. Here too, the pattern of where consensus was found suggests that it was harder for participants to agree on master’s level curricula. Based on participant interviews, no discernable pattern was detected for why participants were not able to come to consensus at the master’s level for Competency 5.4. When asked why they rated this competency the way they did, participants expressed ideas such as, Within our context, our MA-level graduates are the ones out and interacting with clients on the day-to-day basis. So, being able to listen for understanding and engaging perspectives is critical for ensuring the evaluation is accurately done. Going back to the example of the low-income apartment complex. I would not go in there by myself and try to facilitate a discussion. I would bring somebody from the community who was knowledgeable and with that group could convey messages to the group. That’s a competency that should be important to all doctoral programs regardless of content or discipline. In evaluation, it’s especially important in the sense that the profession is moving toward an increased focus on cultural competence.
Collectively, these suggest that evaluators believe this competency is important at the master’s and doctoral levels. However, across three rounds of surveys, their responses did not reflect this position.
For Competency 5.2, the lack of consensus at the master’s level seems to stem from the tension between whether evaluation is a “team sport” and whether this means that everyone needs to have the same set of competencies. One Delphi participant, who was adamant that this competency ought to be included at the master’s level, argued that this competency, …goes back to the professionalism of the field. Someone getting a master’s should be able to lead a stakeholder discussion, and not get angry when someone doesn’t agree with your ideas. People can get paranoid over evaluation discussions. There are a lot of opportunities for people to become reactive, defensive, paranoid, accuse you of a hidden agenda, etcetera. I think if you are going into this field, you have to understand how you can overcome this process.…It doesn’t come naturally to everyone, and if it doesn’t come naturally to you, then you just have to work on it.
Yet, another Delphi participant who did not rate it as necessary at the master’s level noted that they were “thinking about the case of someone wanting a master’s because they are interested in statistics and design, and intending to be on an evaluation team where they don’t have to intact with people.”
For Competency 5.5, disagreement centered on what should be privileged during a 2-year master’s program and what could take a back seat. As one Delphi participant articulated, Is it necessary compared to other competencies when coming out of a program? I don’t think I see it as high up. Some of these soft skills, such as communication are addressed less than the hard skills. I believe they should be included more, but it’s not worth it if other things have to be bumped out.
Results for the Competencies Appearing Under the Interpersonal Domain.
Note. Response options: 1 = least necessary, 2 = minimally necessary, 3 = moderately necessary, 4 = necessary, 5 = very necessary, and 6 = highly necessary. Shading indicates the priority of the competency for that type of program according to ratings: black shading = highest priority, dark gray shading = second highest priority, light gray shading = third highest priority, and no shading = unclear priority.
Professional practice domain
As seen in Table 5, this domain marked the first instance in which Delphi participants were not easily able to come to consensus on which competencies appearing under this domain were necessary to cover in both master’s and doctoral curricula. For all but one competency, it was very difficult to discern a pattern for why participants did not reach consensus for most of this domain’s competencies.
Examining heat map patterns for Competencies 1.6 and 1.7 together provides an interesting perspective because they are interrelated. Competency 1.6 is centered on the concept of articulating and recognizing areas for growth, while Competency 1.7 is about engaging in ongoing professional development, presumably based on what one notices as part of engaging with Competency 1.6. Our qualitative interviews provided data on why coming to consensus on Competency 1.7 at the doctoral level might have been difficult. Some Delphi participants expressed sentiments such as “on-going professional development is a critical part of the profession and it seems to make sense to emphasize that.” Others, however, also agreed that Competency 1.7 was important at the doctoral level, but noted, This rating comes from my sense of frustration to get the doctoral students in our program to do this. Can we convince them that ongoing learning is important? We talk about it all of the time. I am not sure we always convince them.…The field of evaluation is constantly evolving…You can’t cover everything, even in a PhD program. There is a personal responsibility to have a learning and growth plan.
Results for the Competencies Appearing Under the Professional Practice Domain.
Note. Response options: 1 = least necessary, 2 = minimally necessary, 3 = moderately necessary, 4 = necessary, 5 = very necessary, and 6 = highly necessary. Shading indicates the priority of the competency for that type of program according to ratings: black shading = highest priority, dark gray shading = second highest priority, light gray shading = third highest priority, and no shading = unclear priority.
Planning and management domain
As shown in Table 6, it was the most difficult for Delphi participants to come to consensus on competencies appearing under this domain. Interviews revealed no pattern for understanding why participants could not agree.
Results for the Competencies Appearing Under the Planning and Management Domain.
Note. Response options: 1 = least necessary, 2 = minimally necessary, 3 = moderately necessary, 4 = necessary, 5 = very necessary, and 6 = highly necessary. Shading indicates the priority of the competency for that type of program according to ratings: black shading = highest priority, dark gray shading = second highest priority, light gray shading = third highest priority, and no shading = unclear priority.
Where Consensus Was Found and What Do Median Ratings Suggest About Which Competencies Ought to Be Included in Master’s and Doctoral Programs?
In what follows, the researchers present findings related to how necessary raters felt each competency was upon reaching consensus using median ratings. 5 It is notable that no median necessity rating was below 4. This implies that, overall, the raters felt that all of the competencies on which they reached consensus were necessary for master’s and doctoral programs, but to varying degrees.
Highest priority competencies
Tables 2 –6 display the highest priority competencies for master’s programs as black shaded cells under the “Necessity for Master’s Programs” heading. The research team deemed a competency a “high priority” if consensus was reached in the first round and the median rating was a 6 (highly necessary). Examining these competencies reveals a focus on methodology, with five of the nine competencies (56%) being from that domain. Of the remaining four, two were from the professional practice domain (22%), one was from the context domain (11%), and one was from the interpersonal domain (11%). Clear themes among these competencies are a focus on ethical behavior and expertise in evaluation methodology. Absent from these nine are competencies in the planning and management domain.
Tables 2 –6 display the highest priority competencies for doctoral programs as black shaded cells under the “Necessity for Doctoral Programs” heading using the same rationale for what counts as “high priority.” There were many more competencies on which raters reached consensus in the first round for doctoral programs (n = 26) than for master’s programs (n = 10). Moreover, a similar focus on the methodology domain is evident, with 13 of the 20 competencies (65%) being from that domain. Of the remaining seven, three were from the professional practice domain (15%), two were from the context domain (10%), one was from the planning and management domain (5%), and one was from the interpersonal domain (5%). As with master’s programs, clear themes among these competencies include a focus on ethical behavior and expertise in evaluation methodology.
Second highest priority competencies
The research team deemed a competency a “secondary priority” if it met one of the two conditions: (1) consensus was reached in Rounds 2 or 3 with a median rating of 6 (highly necessary) or (2) consensus was reached in any round with a median rating of 5 (very necessary). Tables 2 –6 display the second highest priority evaluator competencies for master’s programs as cells shaded dark gray under the “Necessity for Master’s Programs” heading. Here, again is a similar focus on the methodology domain, but to a lesser degree, with seven of the 19 competencies (37%) being from that domain. Of the remaining 12, three were from the professional practice domain (16%), four were from the context domain (21%), one was from the planning and management domain (5%), and four were from the interpersonal domain (21%). Potential themes among these competencies are a focus on utilization, interpretation, and communication of evaluations.
Tables 2 –6 display the second highest priority evaluator competencies for doctoral programs as cells shaded dark gray under the “Necessity for Doctoral Programs” heading using the same rationale for what counts as “secondary priority.” Similar to the “high priority” competencies, there were more “secondary priority” competencies on which raters reached consensus for doctoral programs (n = 29) than for master’s programs (n = 19). However, the “secondary priority” competencies for doctoral programs have almost no competencies from the methodology domain (mostly because only three were left that could have been included in this category), with only two of the 29 competencies being from that domain (7%). The majority of these competencies were from the interpersonal (n = 9, 31%) and context (n = 8, 28%) domains. Of the remaining 10, six were from the planning and management domain (21%) and four were from the professional practice domain (14%).
Third highest priority competencies
The research team deemed a competency a “tertiary priority” if consensus was reached in any round with a median rating of 4 (necessary). Tables 2 –6 display the third highest priority evaluator competencies for master’s programs as cells shaded light gray with white font under the “Necessity for Master’s Programs” heading. There were only five competencies in this category: one from the professional practice domain (20%), three from the planning and management domain (60%), and one from the interpersonal domain (20%). All seemed related to managing evaluations.
Finally, Tables 2 –6 display the third highest priority evaluator competencies for doctoral programs as cells shaded light gray with white font under the “Necessity for Doctoral Programs” heading using the same rationale for what counts as “tertiary priority.” This list consists of the two remaining competencies on which consensus was reached for their necessity in doctoral programs. One was from the professional practice domain and the other was from the planning and management domain.
Discussion
This study sought to explore to what extent consensus exists among a panel of evaluators about which competencies ought to guide the development and implementation of evaluator education curricula. Several important ideas about evaluation education are raised by study findings (Table 7). These include what is unique about what novice evaluators need to know, including the role of technical methods in evaluation education and the relationship between research and evaluation. Study findings also surface ideas about the purpose of evaluator education, including what constitutes an evaluation program and what differentiates those seeking masters and doctoral education..
Summary Table of Highest Priority Competencies From the Delphi Study.
Note. There were no competencies that were only highest priority for master’s programs.
The only area in which Delphi participants were able to both come to agreement quickly and rate these competencies as very high priority was for competencies appearing under the methodology domain. To be clear, results do not suggest, nor is it being claimed, that only technical evaluation methods matter in evaluation education. Rather, what this finding suggests is that, according to the evaluators in this study, the technical side of evaluation work ought to be included in graduate evaluation education. Even the fiercest advocates of evaluation education going beyond technical methods acknowledge that these skills play a role in evaluation and evaluation education. For example, House (2015) in his list of 25 ideas about evaluation was attempting to appropriately situate the role of technical methods by noting, for example, that “evaluators evaluate with their whole person, not only with methods” (p. 127) and that “although traditional methods correct for some biases…not all biases are connected to methods” (p. 128). Consistent with this view, Scriven (2016) noted that “the social sciences have contributed a whole toolbox of useful devices for dealing with the nonevaluative tasks that are essential in many evaluations” (p. 32). Schwandt (2015) also notes that methods are necessary but insufficient to do good work. Taken as a whole, the finding that technical methods ought to be included in evaluator education is consistent with scholarly arguments on their need and extant research on what is being taught (Christie et al., 2014; Davies & McKay, 2014; J. D. Dewey et al., 2008; Fierro & Christie, 2011; Galport & Azzam, 2016).
What is problematic is that this study was not able to find consensus among Delphi participants in many areas beyond methodology, even after three rounds. The participants in this study know evaluation—most have doctoral degrees in disciplines that reflect the field (e.g., education, evaluation, public health, sociology), had been members of AEA over 13 years and doing evaluations even longer, more than half are university educators teaching future evaluators, and all have published in major American evaluation journals. Thus, a lack of consensus on the other four domains is noteworthy, especially keeping in mind that the first domain of the 2018 AEA Evaluator Competencies, Professional Practice, defines what is unique about what professional evaluators need to know and be able to do in practice. What might help interpret this key finding?
One plausible reason has to do with varied perceptions on the relationship between research and evaluation promulgated by evaluation scholars and held by evaluation practitioners. Levin-Rozalis (2003) noted that in her professional practice, she frequently encounters a lack of understanding of the difference between research and evaluation, and it “is at the expense of evaluation” (p. 2). Wanzer (2021) found evidence of five different conceptualizations of the relationship between research and evaluation held by practicing evaluators (e.g., evaluation and research as a Venn diagram, evaluation as a type of applied research), which are also present in scholarly writing defining evaluation (Mathison, 2008; Mertens, 2014; Rallis, 2014; Scriven, 2008; Vedung, 2004). In Wanzer’s study, while most AEA members were more likely than Division H (Research, Evaluation, and Assessment in Schools) American Educational Research Association members to consider themselves evaluators, only 62% conceived of evaluation–research relationship as being like a Venn diagram. The rest framed the relationship as evaluation is a type of applied research (22%), evaluation and research are continuum anchors (10%), research is a type of evaluation (5%), or there is no difference between the two (1%). Moreover, Wanzer found that regardless of how one thinks about the relationship between evaluation and research, when it comes to technical design and methodology, few saw differences between research and evaluation. Thus, it is possible that one of the reasons Delphi participants were able to come to agreement quickly and believe that competencies appearing under the methodology domain were a high priority, is that they subscribe to this view—when it comes to technical methods, few differences exist. In the same vein, how one thinks about the relationship between evaluation and research is also likely to influence both what else besides technical methods might be important and contribute to variability in responses offered.
A second plausible reason is related to ongoing discussions about what is unique about what evaluators need to know and be able to do. This question has been debated for over four decades and is tied to ongoing conversations about the status of evaluation as a profession (Morell & Flaherty, 1978; Schwandt, 2017; Smith, 1999). At the heart of these debates are ontological (e.g., What is evaluation? What boundaries separate it from related professions?) and axiological (e.g., What is evaluation professionalism? What ethical dilemmas arise from framing evaluation as a social good in light of our work being a market-based activity?) concerns. After more than 40 years, the answers to these questions, and by extension, what is unique about what falls within evaluations unique purview, is that “it depends” (Gullickson, 2020, p. 2). In practice, these ideas have implications for several aspects of our work, including how competencies are used to guide education and practice. For example, if one’s definition of evaluation encompasses description and judgment, then the foundational competencies include, for example, evaluation-specific logic, methods, and ethics (Gullickson, 2020). Thus, evaluator education would emphasize these aspects, meaning that the content covered in evaluation courses would be things like the logic of evaluation, evaluative reasoning, and evaluation ethos, to name a few.
Consistent with this view, results from our study suggest that ethics is a plausible candidate for prioritization in the curriculum beyond technical methods. More specifically, looking across our results for the highest priority competencies, it is clear that ethics is another common theme at the master’s and doctoral levels. For example, Competencies 1.1, 2.1, and 5.1 were rated as high priority across graduate curriculum. Other competencies with ethics at their core were also assigned high priority ratings at the master’s and doctoral levels (Competencies 3.1 and 4.1, respectively). This is consistent with other research, which found that ethics was an area that evaluation students were exposed to in their classes and that evaluator educators privileged (Davies & McKay, 2014).
However, study results also support an alternative view. Gullickson (2020) argued that if one’s definition of evaluation simply equated it as a type of applied social science, then the foundational competencies would place a heavy emphasis on research methods. In this circumstance, evaluator education courses would be centered on things like quantitative, qualitative, and/or mixed method research design, sampling theory, and ethical principles for research involving human subjects, and so on. It stands to reason than that differences in how our Delphi participants define evaluation, which is consistent with the variability found in our published literature, could be a contributing factor on why overall consensus within the other four domains was elusive. This seems especially plausible, given that consensus was not reached for the first competency, which by definition is supposed to identify unique aspects of evaluator knowledge.
A third possible reason that consensus was not reached in more domains among evaluators in this study may have to do with larger discussions of what constitutes an evaluation program, the purpose of formal evaluator education, and what differentiates those seeking masters and doctoral training in evaluation. In terms of how to define evaluation programs, there are two strategies that have been used. The first is based on a qualitative analysis of course titles with the word evaluation in it (LaVelle, 2018). Based on this line of thought, evaluation programs have been defined as having two or more courses that meet this criterion. A second strategy has been to ask university-affiliated educators to self-report the number of evaluation courses they offer and the breadth and depth of evaluation topics covered within these courses (Davies & MacKay, 2014). Regardless of the strategy used, results suggest that there is variability in the number of courses offered. Moreover, the topics covered vary, and this is likely to do with academic freedom principles that operate within higher education. These principles create an environment whereby university professors have autonomy over course content, learning objectives, what educational materials they use, what pedagogies are employed, and so on. These professors bring with them their own ideas about what the training of evaluators ought to look like, their own orientation to or preferences for evaluation practice that frame their instruction, their value orientations, and so on. Moreover, they do so in varying contexts, meaning the evaluation courses they teach could be the only course or one of the several courses. Even if more than one course is offered, there is variability in whether these courses are required or elective and how often they are offered. Depending on all of these factors, university faculty have a major role in deciding which competency domains are prioritized and to what depth (Boyce & McGowan, 2019). The latter is particularly important because, even if competency domains are identified as priorities, extant research has documented that they are not taught at equal depths (Davies & MacKay, 2014).
The result of the academic freedom principle in action is that evaluation education is heterogenous, and thus graduates of programs enter into the field with vastly different abilities, orientations to practice, values, and so on. This begs the question of whether there should be a voluntary consortium of evaluation programs in the United States, perhaps modeled after the CES Consortium of Universities for Evaluation Education? If taken, a networked collaboration of this sort would serve to reduce some of the variability observed across graduates of evaluation programs. However, care would need to be taken in who should be involved, how to move toward common curricular elements while at the same time respecting the academic freedom of faculty members, the interdisciplinary nature of the field, and the like. Done poorly, this could further erode the unique aspects of our field and further reduce evaluation to no more than a series of technical skills to be mastered. Done well, this could modernize evaluation education and practice.
On the matter of the purpose of formal evaluator education, qualitative comments from our Delphi participants suggest that, for some, the purpose of graduate education in evaluation is securing a job. The perception of knowledge as a commodity to be purchased for the purposes of employment can be traced back to neoliberal influences on higher education, including graduate education (Harland, 2009). This is in stark contrast to the liberal education philosophy that positions the purpose of graduate education as preparing graduates to contribute to social change through specialized knowledge situated within context, intellectual creativity, critical thinking, autonomy, and resilience and with an understanding and tolerance for diversity of thought and experiences (Axelrod, 2002). These ideas are far from settled and remain contested spaces within higher education. Where one falls within this debate is also likely to influence how one thinks about what ought to be included in graduate education curriculum, and this remains an area ripe for further study.
Falling from this line of thought is the question of what the purpose of a master’s degree and a doctoral degree is? Traditionally, master’s programs have been primarily situated as preparation for doctoral study, thus students are situated as consumers of knowledge (Conrad et al., 1993). Once they enter the doctoral study, students make the shift to producers of knowledge. However, this remains an open question in evaluation. It is posited that a lack a serious debate on this subject is one of the reasons why it was easier for Delphi participants to reach consensus on competencies necessary for doctoral study, compared to competencies for master’s study.
Taken as a whole, several plausible reasons for a lack of consensus on what AEA evaluator competencies ought to guide the development and implementation of evaluator education curricula at the master’s and doctoral levels are offered. Whatever the reasons for a lack of consensus across all five AEA evaluator competency domains, more work is needed. A discussion of future work is taken up in the final section, after discussing study limitations.
Limitations
As is true for all studies, this study has limitations that should be considered. First, all consensus median ratings were between four and six. There are two possible interpretations of this. One is that the AEA competencies task force has identified knowledge, skills, and dispositions that are important for evaluators to train on. A second interpretation is that Delphi participants were disinclined to give lower ratings or what is commonly referred to as social desirability bias. While the pattern of responses on items where consensus could not be reached suggests that there was variability in terms of the spread of scores across all six response options or who was an outlier and on which item, social response bias cannot be ruled out.
Much of the guidance and studies on using Delphi methods were developed prior to the creation of the internet, so limited research and guidance exist for online Delphi methods (Donohoe et al., 2012). To the extent possible, the researchers attempted to mitigate known possible limitations identified in the online Delphi literature by using established survey software, being available for technical support troubleshooting, avoiding known time threats (e.g., working with and around Delphi participants’ vacation times, conference, and work travel), and using membership organizations to ensure that survey participants were legitimately evaluators. However, it is possible that risks were not mitigated completely. Finally, recruitment and attrition are known limitations of Delphi methods, regardless of the medium (Donohoe et al., 2012; Garavalia & Gredler, 2004; Okoli & Pawlowski, 2004). The best available guidance for addressing recruitment and attrition issues was followed (e.g., be transparent about the process and goals at the outset of the study, recruit from populations who are likely to have a high interest in the overarching research question or results). Yet, the researchers were only able to recruit 11 evaluators to participate in the study, losing one in each round, ending with nine in the final round.
Beyond methodological limitations, there are two practical limitations. The first is that the competencies rated by Delphi participants were those recently endorsed by the AEA. While there are commonalities across competencies developed by voluntary organizations for professional evaluation and others (e.g., United Nations Evaluation Group), there are important contextual and cultural differences among them. This is why the recently endorsed AEA competencies make it clear that they serve as a “way to make clear to everyone the important characteristics of professional evaluation practice” by providing “a common language and set of criteria to clarify what it means to be included in the definition of evaluator” in the United States context. For this reason, care should be taken in using results from this study in other contexts. Equally important is the debate around how these competencies ought to be used, if at all, in guiding evaluator education. Whether competency-based evaluation education ought to be adopted is still uncertain. Drawing inspiration from competency-based education across multiple professions (Frank et al., 2010; Menefee & Thompson 1994; Spady, 1994), this study assumed that the competencies do have a role to play in what ought to be taught in evaluation education at the graduate level. It is possible, that not all educators share this view.
Future Work
Results from this study offer several possible directions for future research. One possible avenue is research that continues to investigate both what we are doing and what we should be doing in terms of evaluator and evaluation curricula. For example, future research could replicate this study but with a different group of evaluators. Researchers could also extend the current study to investigate whether there are differences in evaluator consensus across fields of application (e.g., education, public health, philanthropy). Furthermore, an examination of how conceptions of the relationship between research and evaluation and definitions of evaluation relate to evaluator ratings should also be explored.
A second possible line of research could begin to take up the question of what the purpose of formal evaluator education ought to be, and what ought to differentiate those seeking masters and doctoral training in evaluation. A part of this work would need to be conceptual, laying out the arguments for possible configurations, including those that align with neoliberal or liberal education philosophies. This work could be supported by empirical examinations of the extent to which these configurations map on to what is already happening at the graduate level and what knowledgeable evaluators believe the configuration ought to be. Or, how programs might change curriculum to align with suggested priorities? This work would also need to be cognizant of academic freedom guaranteed to faculty and the implications of designing a common core evaluation curriculum.
While there is more work to be done, this article answers calls for research to inform evaluation teaching and learning (Gullickson et al., 2019; King & Ayoo, 2020) and contributes to a budding area of research devoted to studying evaluator and evaluation curricula. This study extends extant research in two important ways. It is the first to move beyond describing the content and emphasis of topics or competencies in university-based programs. Moreover, the focus of this research is novel in that instead of describing what is currently happening in terms of evaluator curriculum, it provides empirical evidence regarding what evaluators believe we ought to be doing. In doing so, this study serves as inspiration for other investigators to take up the call for more empirical and conceptual work in this neglected area.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
