Abstract
This article proposes a research program with two goals: (a) to support nonprofit leaders to productively engage evaluation and (b) to advance a meso-level theory of nonprofit evaluation that recognizes the diverse ways nonprofits contribute to social change. Such a research program is timely, as evaluation becomes increasingly institutionalized in the sector in ways that constrain nonprofit leaders from engaging productively with evaluation to advance their social impact. This research program brings existing nonprofit scholarship into conversation with evaluation scholarship and puts forward a research agenda organized around the practical dilemmas facing nonprofit leaders as they answer four key evaluation questions: what to evaluate, for what purpose, using which criteria, and with what evidence and methods. By anchoring a research program around these four questions, we seek to reopen the possibilities for how scholars can support nonprofit leaders in engaging evaluation to enhance their social impact.
Introduction
Over the last several decades, nonprofit leaders have struggled to answer a persistent and pressing question: How can they best evaluate whether their organization is making a difference? As we show in this article, a substantial body of nonprofit scholarship has emerged on this topic, offering a powerful set of explanations for why nonprofits struggle to evaluate their social impact. What is less clear from this scholarship is how we, as nonprofit researchers, can respond to such challenges: How might our research support nonprofit leaders to productively engage evaluation in ways that enhance their social impact and contribute to social equity? Such a question seems particularly salient now as we see a narrow approach to evaluation becoming institutionalized in the sector, one that focuses on quantifiable results from the implementation of program or project interventions, favors methods such as cost–benefit analysis and randomized control trials (RCTs) for assessing impact, and often prioritizes ends over means in valuations of social change. Moreover, as the demand for evidence continues to grow, new efforts at standardization are being put in place within certain subfields, raising additional concerns about the ability of nonprofits to be responsive to communities.
In this article, we set out a research program that aims to motivate studies where the findings not only have the potential to support nonprofit leaders, their staff, and communities to fruitfully engage in evaluation but also to move toward a meso-level theory of nonprofit evaluation, one that better aligns with the diverse ways nonprofits seek to make a difference in communities. To accomplish this task, we bring nonprofit scholarship into conversation with an equally expansive scholarship on evaluation to organize a research program around the practical dilemmas facing nonprofit leaders as they address four key questions: What to evaluate? For what purpose? Using which criteria? And with what evidence and methods? The evaluation scholarship has considered each of these questions in depth, showing alternative ways of answering each but has not taken an organizational lens to these questions and certainly not a nonprofit lens. 1 By combining these two bodies of scholarship, we seek to motivate empirical studies integrated around a common set of questions and grounded in nonprofit practice—with the potential to advance a meso-level theory of nonprofit evaluation that opens up diverse possibilities for how nonprofit leaders can engage evaluation to support their social impact.
Our article is organized as follows. First, we provide an overview of the primary streams of nonprofit scholarship on the topic of social impact evaluation: organizational effectiveness, organizational accountability, and institutional environments. We then turn to our proposed research program, which is organized around the four core evaluation questions noted above. For each question, we start by identifying the dominant or more institutionalized answer evident in nonprofit practice today. We point to studies by nonprofit researchers that reveal why nonprofits struggle to evaluate their social impact in these ways. We then introduce the reader to the evaluation scholarship, summarizing examples of key insights that have emerged from this literature around each question. We build on these two bodies of scholarship to conclude each section with suggestions for future research.
Before proceeding, we want to define the key terms in our analysis and how we employ them. By evaluation we mean the systematic gathering of information about an entity to determine its merit or worth, inform decision-making, and improve social impact (Patton, 1997; Weiss, 1998). Evaluation is more than a one-off assessment; it involves a set of practices that can include the ongoing collection of performance data by organizations to inform decision-making, and less frequent in-depth assessment of a program’s implementation or an organization’s impact. We use the term social impact to refer broadly to the “difference made” by nonprofits as mission-driven organizations seeking to make a change in the world. Central to our inquiry is the idea of “practice dilemmas.” A practice dilemma is an adaptive challenge with no easy fixed answers but rather requires thoughtful engagement and critical inquiry on a regular basis (Heifetz, 1998; Schwandt, 2000, 2015). Dilemmas are distinct from other types of challenges that can be addressed through technocratic solutions or with an infusion of resources such as capital or expertise; by definition, dilemmas do not have clear solutions (Heifetz & Laurie, 2001). By highlighting the evaluation practice dilemmas facing nonprofit leaders and organizing a research agenda around them, we hope to build a stronger bridge to practice, one that recognizes the diversity in the sector and supports greater pluralism in approach. Finally, we use the term participants or communities interchangeably to refer to individuals, families, and communities who are the intended beneficiaries of nonprofit initiatives, recognizing that any term inadequately conveys a respect for these individuals, their agency, and their central role in social change. For readers who may be less familiar with evaluation terms, we have included an appendix of terms and definitions.
Nonprofit Scholarship: Central Lines of Inquiry
Scholarship on nonprofit organizations approaches questions about the evaluation of social impact from several vantage points. We organize this literature into three overlapping but distinct streams of research: organizational effectiveness, organizational accountability, and institutional environments. Our goal here is not to provide a comprehensive review but rather to bring together what are sometimes viewed as unrelated lines of inquiry—to show how they constitute a larger historical arc of research relevant to evaluating nonprofit social impact. Our discussion is summarized in Table 1 below.
Nonprofit Research Streams on the Evaluation of Social Impact.
Note. NPOs = nonprofit organizations.
Organizational Effectiveness
In the 1970s and 1980s, as the number of nonprofits grew, scholars started paying attention to nonprofits as a distinct organizational form, separate from firms and government, with unique challenges in measuring organizational effectiveness. Early research on effectiveness was heavily shaped by scholars of organizational sociology, industrial psychology, and administrative sciences. For example, an influential edited volume on New Perspectives on Organizational Effectiveness (Goodman & Pennings, 1977) tapped a number of pioneering organizational scholars to offer their perspectives on the challenges of assessing effectiveness across various types of organizations. A foundational textbook in organizational sociology (Scott, 1977, 1992) drew on this prior work to identify three basic types of indicators for judging organizational effectiveness—outcomes, processes, and structures. But early on Scott (1992) pointed to challenges with outcome indicators: “outcomes present serious problems of interpretation” such as inadequate knowledge of cause and effect, time required to observe results, and environmental characteristics beyond the control of the organization (p. 354).
These ideas were picked up in The Nonprofit Sector: A Research Handbook (Powell, 1987). This edited volume was one of the first research handbooks devoted to nonprofit organizations, and it included a chapter focused specifically on the distinctive challenges of measuring nonprofit performance (Kanter & Summers, 1987). Later scholars further probed whether the challenges of effectiveness differed in nonprofit versus for-profit organizations, pointing to two key differences. The first is that for-profit actors tend to focus on financial measures of performance which are generally easier to assess in both the short- and long term, while nonprofits typically see financial performance as an input rather than an outcome (Kaplan, 2001; Speckbacher, 2003). The second key difference is that nonprofits face multiple constituencies (such as funders, beneficiaries, communities, and government) in assessing their effectiveness, with each often viewing effectiveness using different criteria (Kanter & Brinkerhoff, 1981; Kanter & Summers, 1987).
A review of empirical studies on the topic between 1977 and 1997 (Forbes, 1998) discussed three principal measures of effectiveness: (a) goal attainment; (b) resource attainment, in terms of enabling organizational survival; and (c) multidimensional approaches arising from diverse criteria and values that organizational stakeholders use to assess the merits of the organization. This review further noted that the way in which effectiveness is assessed—the actual process of assessment—often introduces competing values and implicit criteria or what other scholars have since framed as the political aspects of evaluation (Tassie et al., 1996) and the social construction of nonprofit effectiveness (Herman & Renz, 1999, 2008). Subsequent research has developed multidimensional constructs for effectiveness that include both goal attainment and organizational survival (Sowa et al., 2004), examined the determinants of effectiveness in both research practice (Liket & Maas, 2015), and proposed a multiple constituency theory of program performance (Campbell & Lambright, 2016).
Research on organizational effectiveness has also been influenced by business scholarship, especially from the fields of accounting and strategy (Anthony & Young, 2004; Merchant & Otley, 2006; Oster, 1995). The balanced scorecard, in particular, has attracted attention among nonprofits for its multidimensional approach to measuring performance along four dimensions or “perspectives”: financial/donors, customers/stakeholders, internal processes/quality, and learning/improvement processes (Kaplan, 2001). Nonprofit scholars note this approach has been helpful in offering nonfinancial measures of performance and in linking internal processes to goals but does little to address how to assess goal attainment given the nonlinear nature of cause and effect, and how to resolve competing performance criteria used by multiple constituencies (Speckbacher, 2003).
Finally, a related challenge in this literature lies in establishing measures of effectiveness that can be compared across organizations, given the complexity and diversity of nonprofit work (Ebrahim & Rangan, 2014; Stone & Cutcher-Gershenfeld, 2001). The lack of reliable and comparable measures, based on shared agreements about what criteria and measures are important, may be one reason why the nonprofit sector does not have a robust infrastructure to provide systematic data on organizational performance akin to private sector ratings agencies and industry analysts (Prakash & Gugerty, 2010). Even nonprofit managers themselves appear to hold very different views of organizational effectiveness, with some focused on outcome accountability and others on measures of organizational efficiency (Mitchell, 2013). The question of what constitutes nonprofit effectiveness is central to a core concern of evaluation: what to evaluate? We return to this issue in the next section.
Organizational Accountability
Nonprofit scholarship in the late 1990s and early 2000s responded to the growing demands for accountability in the sector, both in international development and in the United States. The former was perhaps best marked by the publication of a seminal article on accountability and performance in the flagship journal World Development (Edwards & Hulme, 1996) and further developed into a widely read edited book in the same year. At about the same time in the United States, the journal Nonprofit Management and Leadership devoted a special issue to the topic of accountability in 1995. Two trends in practice further heightened the interest of scholars in accountability. The first was growth in public sector contracting to nonprofits resulting from state retrenchment and the emergence of a “new public management” discourse on performance (Krauskopf & Chen, 2010; D. H. Smith, 1999; S. R. Smith & Lipsky, 1993). The second trend was a series of highly visible scandals that contributed to an erosion of public confidence in nonprofit organizations (Bebbington & Riddell, 1997; Fisher, 1998; Gibelman & Gelman, 2001; Young et al., 1996).
The diverse scholarship on accountability converged on three key questions, both empirical and normative: for what is an organization accountable, to whom is an organization primarily accountable (given competing demands), and how is accountability operationalized in practice? Scholars began to differentiate among “upward” accountability demands of funders (such as foundations, private investors, government agencies, and individual donors), “downward” accountability to clients, beneficiaries, and communities, and “internal” accountability to their own staff and boards (Edwards & Hulme, 1996; Kearns, 1996; Lindenberg & Bryant, 2001; Najam, 1996; Oster, 1995). The emerging research further sought to grapple with the distinctive features of nonprofit accountability given that nonprofits have no owners akin to shareholders (Hansmann, 1996) and face public scrutiny in exchange for tax exemption (Fremont-Smith, 2004).
Not surprisingly, researchers found that the most powerful accountability claims came from funders rather than beneficiaries or clients because funders could threaten to withhold funding whereas clients often did not have such an exit option or sanctioning mechanism (Ebrahim, 2003a; Hirschman, 1970a). Principal-agent perspectives suggested that mechanisms of upward accountability would be better developed than downward accountability, on the grounds that nonprofits essentially act as agents for their funder-principals (Prakash & Gugerty, 2010). Other research showed this principal-agent explanation to be too limited (Benjamin, 2010) and explored “mutual” and “plurilateral” accountability mechanisms among organizations working together in a network or under conditions of interdependence (Brown, 2007; Macdonald, 2007).
A flurry of empirical work looked at the reporting relationships between nonprofits and their funders, showing not only the negotiated and political nature of accounting for results (Benjamin, 2008a; Cutt & Murray, 2000; Tassie et al., 1996) but also how demands for accountability transferred the risk of delivering impact from funders to nonprofits, sometimes leading nonprofits to overstate their results or compromise relationship goals, such as community building (Benjamin, 2008b; Campbell, 2002). This work revealed how accountability demands are shaped by relationships of power among actors (Dubnick & Justice, 2004; Ebrahim & Weisband, 2007). Other research uncovered the consequences of such relationships, documenting how nonprofits sought to separate external reporting from internal learning, creating an “accountability myopia” focused on short-term results at the expense of long-term learning (Ebrahim, 2003b, 2005). The rise in self-regulation and accreditation regimes worldwide underlined the pressure on nonprofits and nongovernmental organizations (NGOs) to demonstrate accountability to external stakeholders while also attempting to stave off regulatory action by governments (Breen et al., 2019; Gugerty, 2009; Gugerty & Prakash, 2010).
Recent work on international NGOs has highlighted the irreconcilability of these competing accountability demands. As NGOs scale and build their capabilities for influencing global policy agendas, they can lose their abilities to stay connected and accountable to local actors (Balboa, 2018). Conversely, those that stay focused on accountability in grassroots relationships have difficulty building global capabilities and influence. More troubling, NGOs that succeed in building substantial authority in global politics can fall into an “authority trap” whereby they soften their activism to focus on incremental rather than radical change (Stroup & Wong, 2017). These studies highlight the relational rather than absolute nature of accountability, suggesting that the traditional bases of legitimacy and effectiveness that have enabled the global power of NGOs are now being eroded to the point of making them irrelevant (Mitchell et al., 2020).
This brief discussion reveals two orientations to accountability questions: a positivist or rationalist approach and a social constructivist approach. Rationalist perspectives suggest that evaluation efforts, guided by expert evaluators, can be used to find objective measures of performance, create a basis for learning, and hold organizations to account. Social constructivist perspectives, however, suggest that measures are rarely objective, as they are the result of relationships of (unequal) power among stakeholders. Both perspectives thus highlight different practice dilemmas that arise from accountability claims—with rationalist perspectives emphasizing the challenge of measurability and standardization, while social constructivists point to the dilemmas of negotiating among competing, or even incommensurable, demands for accountability. We revisit these perspectives when we consider the central dilemmas in the second half of this article.
Institutional Environments
The institutional environments literature, which has grown in prominence over the past two decades, documents a growing isomorphism in the nonprofit sector, such as the widespread adoption of evaluative tools and business management practices. The conceptual foundations of this work can be found in institutional theory, informed especially by DiMaggio and Powell’s (1983) seminal article on institutional isomorphism, alongside scholarship on how organizational environments shape the diffusion and adoption of managerial practices and the symbolic uses of information (Feldman & March, 1988; Meyer & Rowan, 1977).
Despite its long roots in organizational theory, research on institutional environments did not become a dominant stream in nonprofit scholarship until the early 2000s. The resulting research has shown that nonprofit organizations measure social impact not necessarily for purposes of assessing their own performance but for establishing social legitimacy within their organizational environments—often adopting short-term and easily quantifiable metrics over more ambiguous or complex measures of social change (Hwang & Powell, 2009) and decoupling measurement and evaluation policy from practice (Bromley & Powell, 2012). Measurement systems thus serve not simply as rational instruments of assessment but as political and contested means of social and cultural legitimation, especially in resource-dependent contexts (Pfeffer & Salancik, 1978).
Researchers have further argued that such use of measurement is part of a deeper structural transformation of the nonprofit sector characterized by marketization and managerialism, given the ascendance of business practices across society (Eikenberry & Kluver, 2004; Maier et al., 2016; Mair & Hehenberger, 2014; Powell et al., 2005). For example, scholars have shown a growing shift in nonprofits toward the hiring of professional managers, adoption of formalized managerial practices such as strategic planning, independent financial auditing, and quantitative evaluation and performance measurement (Bromley & Meyer, 2017; Tuckman & Chang, 2006).
Many scholars have sought to better understand nonprofits’ responses to these institutional pressures by examining variations in the adoption of evaluation practices (e.g., Barman & MacIndoe, 2012; Campos et al., 2011; Carman, 2007; Carman & Fredricks, 2010; Carman et al., 2008; Hoefer, 2000; Kang et al., 2012; Marshall & Suárez, 2014). A central finding of this research is that the growing adoption of the instruments and tools of social impact measurement and evaluation—such as theories of change, logic models and frameworks, and experimental methods of evaluation—may be a result of externally generated pressures for legitimacy and are less related to efforts to improve practice (please see the appendix for definitions of key terms). Yet the nonprofit scholarship also suggests that adoption and use depend on having adequate organizational capacity, and that managers may have some agency in how and why they adopt evaluative practices, as we explore in the section below (Benjamin & Campbell, 2020).
Together, these three lines of nonprofit scholarship highlight how challenges in defining effectiveness, negotiating demands for accountability, and responding to external institutional pressures shape how nonprofits engage with the assessment of social impact. At the heart of each stream of nonprofit literature lies a common emphasis on how to define and evaluate social impact but with somewhat different emphases (effectiveness, accountability, institutionalization) that reflect the concerns of the times. Rather than trying to reconcile the differences among these literature, our approach has been to illustrate their common concern with evaluation and social impact. We now turn to developing a research agenda that might open up space for a greater agency to nonprofits seeking to improve the social impacts of their organizations.
A Future Research Program
The nonprofit scholarship summarized above offers a set of powerful explanations for why many nonprofits face challenges in evaluating the social impact of their work. What is less clear from reviewing this literature is what we, as nonprofit scholars, might do: How might our research support nonprofits to engage in evaluation fruitfully in light of these challenges? We propose a research program that aims to develop a more applied and meso-level theory of evaluation for nonprofits, one that better aligns with how nonprofits work and the diverse ways they can contribute to social change and equity.
We believe such a research program is urgently needed because a narrow approach to evaluation is becoming increasingly institutionalized in the sector. This narrow approach focuses almost exclusively on interventions at the program or project level, prioritizes intended or predefined outcomes above other types of criteria, and holds evidence that can be readily quantified or even monetized as more credible. We see this approach enacted in the spread of evidence-based policies, value-for-money criteria, quantification of benefits, a predilection toward randomized control trials, and other trends that are becoming institutionalized in the nonprofit field. Although these approaches can have important contributions in terms of comparability and scale, they reflect narrow approaches to assessing value.
We believe that the complex nature of social change requires greater pluralism in approaches to social impact, a point echoed by practitioners and scholars alike. Our proposed research program thus seeks to support alternative ways of approaching evaluation in the sector, centering on the key practice dilemmas faced by nonprofit leaders. To help accomplish our goal, we turn to the scholarship on evaluation. We focus primarily on the scholarly field of program evaluation, which informs the professional field of evaluators working in the nonprofit sector (e.g., Alkin 2013; Dahler-Larsen, 2012; Schwandt, 2015; Shadish et al., 1991; Thomas & Campbell, 2020; Weiss, 1998). Program evaluation scholarship has developed a more pluralistic set of approaches to evaluation practice that increasingly recognizes how standard evaluation approaches, seen as objective or neutral, can in fact represent dominant interests, specifically those that are White, Western, colonial, or from the global north (e.g., Caldwell & Bledsoe, 2019; Cavino, 2013; Chilisa et al., 2016; Chouinard, 2016; Hood, 2004; Hopson, 2009; House, 2017; Kirkhart, 2010; LaFrance & Nichols, 2008; Madison, 2007; Stanfield, 1999; Thomas et al., 2018). We also draw inspiration from the sociology of valuation which considers the assumptions that inform evaluative processes endemic in social life (Barman, 2015; Beljean, n.d.; Boltanski & Thévenot, 2006; Lamont, 2012). 2
This scholarship has theorized and debated four core evaluation questions (Shadish et al., 1991):
What is being evaluated? Identifying what to evaluate requires clarifying the unit of analysis for evaluation and drawing boundaries around what is included and excluded. For nonprofits, this requires not only considering the agent(s) of change—such as an anti-poverty program or project, an organization or network, the community or participants—but also what requires changing and the relationship between the two.
What is the purpose of evaluation? Evaluations are intended to be used to support some decision or action. In the nonprofit sector, this could involve making a final judgment that results in cutting a program or renewing a grant or for improving practices such as providing training to staff or changing the way a program is designed. Evaluation can also be used to encourage deliberation among stakeholders, reaffirm a community’s self-determination, or elevate community voice.
What criteria should be used in an evaluation to judge merit or worth? The central point of evaluation is to make some judgment about the entity being evaluated. This requires identifying and selecting evaluative criteria, and the values that inform them, and then applying these criteria to the relevant evidence. For nonprofits, standard criteria often include some measure of intended program outcomes, but other criteria could include greater community leadership or enhanced dignity among participants.
What evidence is credible and what methods are needed to gather that evidence? An evaluation typically assesses the performance of the entity against these criteria, using evidence and methods that are viewed as legitimate to key stakeholder groups. In the nonprofit sector, this evidence may be gathered formally with recognized social science methods, including community-engaged research methods or culled from existing information systems.
Organizing our research agenda around these four evaluation questions brings us closer to the practical dilemmas facing nonprofit leaders as they evaluate their social impact—because these questions must be answered in any evaluation, whether nonprofits explicitly address them or not. Making these questions explicit in an evaluation can provide nonprofit leaders with better traction and agency in their work while also giving nonprofit scholars a starting point for a more practice-oriented research agenda that supports pluralism and equity in the sector.
For each evaluation question, we first identify the central dilemma nonprofit leaders face as they attempt to answer this question. We point to how the institutionalization of a narrower response to the question hinders nonprofit leaders’ abilities to address the associated dilemma thoughtfully. We return to the nonprofit literature here to elaborate on the challenges nonprofits face in answering this question. We then turn to the evaluation scholarship. This scholarship is evolving but historically has been organized into four domains, reflected in the four core questions above and evident in introductory texts to the program evaluation field (e.g., Schwandt, 2015; Shadish et al., 1991; Weiss, 1998). Again, our purpose is to draw on key insights, rather than offer a systematic review. Together, these two bodies of scholarship lay the groundwork for a research program that seeks to meet two intimately connected goals: advancing a meso-level theory of nonprofit evaluation and supporting nonprofit leaders to productively engage evaluation in diverse and more equitable ways. We summarize this discussion in Table 2 below.
Toward an Integrative Nonprofit-Evaluation Research Agenda.
What to Evaluate?
The first evaluation question—what is to be evaluated?—appears deceptively simple. In fact, this question may not even be explicitly considered by nonprofit leaders because the answer is predetermined: evaluate specific programs or projects for funders. This focus on the program and project as the dominant unit of analysis is reinforced through evaluation tools and handbooks on logic models and theories of change intended to help nonprofits specify the central components of a program and its expected results (e.g., Knowlton & Phillips, 2012; USAID, 2022; W.K. Kellogg Foundation, 2004).
Nonprofit scholarship, however, has historically been concerned with the organization (rather than the project or program) as the primary unit of analysis. As discussed earlier, nonprofit scholars recognize both the multidimensional nature of organizational effectiveness and the need to consider more than the program when assessing the social impact of nonprofits, including organizational goals, organizational mission and strategy, financial measures, the assessments of multiple constituencies, and the contribution to larger coalition and system goals (e.g., Bryan, 2019; Ebrahim, 2019; Lecy et al., 2012; Sowa et al., 2004; Speckbacher, 2003). Other research focuses specifically on the relationship between nonprofit organizations and those individuals, families, and communities who are the intended direct beneficiaries of the organization. This body of scholarship calls attention to the diverse ways these participants engage with nonprofits as organizations, not simply as recipients of programs, and how this engagement affects participant experience and social impact (e.g., Benjamin, 2012, 2021a; Benjamin & Campbell, 2015; Knowlton & Phillips, 2012). These literatures show that the project or program is not always the most appropriate unit of analysis for evaluation, as it restricts our understanding of how nonprofits as organizations contribute to social impact. A core dilemma facing nonprofit leaders when considering what to evaluate is thus how to define the unit of analysis.
Although evaluation scholarship has principally been concerned with the program as the primary unit of analysis, it offers some nuance and depth for informing nonprofit evaluation. We discuss three developments in this scholarship relevant for our purposes. First, evaluators developed a more nuanced and complex understanding of programs. Early evaluation scholars focused on developing evaluation designs that could isolate the causal relationship between a program and measured results, viewing programs as simple instrumental interventions. But by the late 1970s, the challenges of implementation and the larger social and political context in which programs unfold spurred evaluators to open up the “black box” of programs—to better understand their internal structure, external constraints, as well as the recursive relationship between programs and social change (Shadish et al., 1991, p. 38). This included attention to how a program was implemented and whether fidelity to the model was maintained, what is sometimes referred to as process evaluation (Harachi et al., 1999; Mowbray et al., 2003; Stufflebeam, 1983). Relatedly, theory-based evaluations sought to specify the “theory of change” or the causal logic underlying program interventions, something familiar to many nonprofits today (Chen, 1990, 2005a, 2005b; Chen & Rossi, 1983; Meyer et al., 2021; Rogers, 2007, 2008; Weiss, 1998). 3 Here scholars have drawn on realist philosophy to consider not simply whether something works but as Pawson and Tilley (2005) explain “What works for whom, in what circumstances, in what respects and how?” (p. 363; see also Pawson & Tilley, 1997).
Second, evaluation theorists recognized that desired outcomes are emergent in many settings, defined in collaboration with participants and communities and are not determined a priori. For example, neighborhood revitalization efforts involve working with residents to identify core concerns. When programs require partnering with participants and communities to define outcomes, evaluating fidelity to a predetermined program model is misplaced (Patton, 2011, 2016; Rogers, 2008). And third, evaluation approaches started to recognize that program outcomes depend not just on the technical aspects of program implementation but on the quality of the relationships in the organizational setting, including those among staff, between staff and communities, and among community members themselves (Abma, 2006; Visse & Abma, 2018). Although some of these interactions between staff and participants may be specified in an intervention protocol and thus studied in an evaluation, scholars have suggested that interactions extend beyond the intervention, as discussed below.
What are the implications of these two bodies of scholarship for future nonprofit research on the question of “what to evaluate”? The evaluation scholarship offers a more expansive and complex understanding of what goes into specifying a program and its effects, compared with a typical logic model familiar to many nonprofit leaders. Yet the characteristics of nonprofit organizations, including their diverse structures, their leadership, and organizing challenges, are typically not the focus of this scholarship. We believe research attentive to organizations can support nonprofit leaders and avoid the trap of only evaluating isolated programs and projects or of using simplistic assumptions about how they contribute to social change and equity. We offer two principal lines of future research centered at the organizational level.
First, to develop a meso-level theory of nonprofit evaluation, one that takes organizations seriously, we need to shift our focus from the program to organizational strategy. Nonprofit organizations are not simply a blank canvas on which programs and projects are executed but have overarching theories and assumptions about how programs fit together, get implemented, and respond to their environments. These are central concerns of strategy (Bryson, 2016; Oster, 1995). 4 Indeed strategy is a familiar term to many nonprofit leaders, whether they are engaged in advocacy, social movements, or human services. Recent research identifies several distinct types of social change strategies—niche, integrated, emergent, and ecosystem—that are contingent on the organization’s knowledge about cause and effect and its degree of control over desired outcomes (Ebrahim, 2019). This work suggests that the appropriate unit of analysis, and the organization’s evaluation approach more broadly, depends on the organization’s strategy, thereby offering considerable agency to managers in determining “what to evaluate.” Other research suggests that choices about strategy shape operational capacities for measurement (Moore, 2013) and collaboration with other actors (Balboa, 2018) that managers need to consider in evaluation. New research is needed that can help us better theorize the relationships between strategy, evaluation (particularly units of analysis), and capacity-building.
Second, nonprofit scholars are uniquely positioned to conduct research that also considers how answering the unit of analysis question may vary depending on the nonprofit–community relationship. For example, we know that communities and participants are central actors in achieving social change, taking steps inside and outside the organization to achieve their desired outcomes. How might our evaluation approaches support equity, for example, by expanding the unit of analysis beyond the organization to consider how desired outcomes are co-defined and co-produced by communities (Benjamin, 2021a; Benjamin & Campbell, 2015; Bovaird & Loeffler, 2012; Chilisa et al., 2016; Ostrom, 1996)? This includes a critical examination of how nonprofits might contribute to or stymie community-desired outcomes. Relatedly, defining the unit of analysis at the organizational level requires documenting how the organization—and not simply its programs—shapes participants’ experiences in ways that matter for social impact and ultimately for social equity (Benjamin, 2021b; Kushner, 2000). This question is even more salient because the nonprofit form includes diverse organizational structures, from highly bureaucratic to collectivist. These diverse structures allow for different forms of engagement and authority on the part of participants, which in turn can affect the norms and values of the organization in ways that matter for participants’ experience and thus social impact (Benjamin, 2021b; Chen et al., 2013). Such experiences can include direct involvement on the board or an advisory group (which is often not well captured in program-focused evaluation), but it can also include less tangible experiences of organizational culture such as service interactions and ongoing relationships with nonprofit staff (Benjamin, 2022). We need research that examines how the organization, its governance, culture, and so on affect participants’ experiences in ways that matter for social equity and impact, and how this might vary depending on their engagement with the organization.
For What Purpose?
How evaluation results will be used is a central question in evaluation given that evaluations are typically undertaken to generate knowledge that informs decisions. But using evaluative data to inform decisions requires that data are matched to the types of questions and decisions that need to be made. Because nonprofits have numerous external and internal stakeholders who require different types of evaluative data, deciding whose decisions are to be informed requires articulating and mediating among uses for multiple audiences. This core dilemma—of addressing competing demands for use—is particularly challenging because the accountability demands of funders often take priority, further institutionalizing dominant perspectives and making it difficult for nonprofits to consider the full spectrum of information from which other constituents might benefit.
Several nonprofit studies document the consequences of using evaluation to meet funder accountability requirements and also point to other reasons nonprofits may not use the data they collect. These reasons include limited capacity, inability to control the data they collect, and inadequate technology (Benjamin et al., 2017; Hoefer, 2000). For example, the nonprofit literature on accountability discussed earlier showed how goal displacement or overclaiming of results might be a natural consequence of using evaluation to meet funder demands. The organizational effectiveness literature shows how the ambiguity inherent in defining nonprofit effectiveness is often resolved in favor of meeting funder requirements, resulting in evidence that is neither useful for organizational-level decision-making nor for learning (Bryan et al., 2020; Carman & Fredericks, 2010). Consequently, evaluative data needed by internal audiences for learning and improvement are often not available (Gugerty & Karlan, 2014). As a result of these forces, managers and staff may ultimately see evaluation as symbolic and separate from their “real” work (Buckmaster, 1999; Mitchell & Berlan, 2016; Riddell, 1999).
Use has been a central concern in evaluation scholarship, in part because evaluation results seem to be used so little, at least not directly (Dahler-Larsen, 2012). Early evaluation theorists assumed that evaluation results would inform decisions about program continuation or expansion (Shadish et al., 1991). But these naive assumptions about instrumental use confronted the stark reality that evaluation results were not being used as intended. Disappointment with this limited role helped to generate a theory that described: (a) the various audiences and types of evaluation use; (b) the time frames in which use occurs; and (c) how the use can explicitly be facilitated (Shadish et al., 1991, p. 53).
On this first point, evaluation scholars set out to identify the potential users of evaluations and their specific information needs, identifying how evaluation might influence a wide range of audiences that included policymakers, funders, managers, the policy-shaping community as well as program beneficiaries and communities (Greene, 2013; Kirkhart, 2000). The idea of evaluation “influence” helped to expand conceptions of use beyond immediate instrumental decision-making. For example, evaluation could shift the ways in which stakeholders conceptualized or thought about an issue (Weiss, 1973). Evaluation findings could also be used by stakeholders to enhance the legitimacy of a particular organization, program, or practice (Schwandt, 2015). Moreover, a number of evaluation approaches have been developed to advance social justice and equity, recognizing that all evaluations advance certain perspectives and interests and so the priority should be on those perspectives and interests with the least power (Greene, 1997). Here instrumental, conceptual, and legitimacy use could be critically redefined in light of larger equity goals. On the second point above, scholars also realized that different types of use might unfold over time. Examining use over longer time periods illuminated the ways in which evaluation created unintended as well as intended consequences that would not be visible in the short term. Longer time horizons also called attention to how the very act of participating in evaluation changed the understanding of program participants (Kirkhart, 2000).
Finally, evaluation scholars recognized that use required active facilitation (Schwandt, 2015). One stream of evaluation scholarship focused on policy influence and uptake of ideas, studying how and when generally available research evidence and evidence-based practices were incorporated into organizational practice and intervention design (Carswell et al., 2021; Hardwick et al., 2015). Another stream turned the lens to the needs of program managers for data that could be used for program improvement (Patton, 1997; Wholey, 1981). This included parallel trends in international development focused on “management by objectives” and “managing for results” to increase the use of performance data by managers with the hope of improving the effectiveness of international aid (Martinez & Cooper, 2020; Rossi et al., 1982). Facilitating use by managers also requires specific knowledge of and attention to incentives and rewards embedded in organizations (Behn, 2014; Wholey, 1981).
How can nonprofit and evaluation scholarship help to better theorize about evaluation use in a way that might guide nonprofit practice and vice versa? While several studies have examined how nonprofits use evaluation data, we focus on three possible lines of inquiry.
First, we need to better understand what “use” means across the wide range of organizations in the nonprofit sector. How, and in what ways, does evaluative data and its use get discussed in service delivery nonprofits, advocacy organizations, or community organizations? How do these discussions influence actual use by staff, managers, leaders, and beneficiaries? How do discussions and actions vary across types of organizations and stakeholders? Many studies report that some nonprofit leaders do make consistent use of evaluation information (Innonet, 2016; LeRoux & Wright, 2010), while others report that nonprofits are “drowning in data” and are either minimally using the data they do have (Benjamin et al., 2017; Snibbe, 2006) or are using it largely for symbolic purposes of compliance. Studies of the uptake of evidence-based policy and practices also suggest that, when evidence or evaluation results do not reflect the expertise and knowledge of clients and staff or are not co-produced by them, they are less likely to be used (Carswell et al., 2021; Hardwick et al., 2015). This suggests that attention to equity and diverse perspectives is a core component of facilitating use.
We also need a more systematic way of investigating the unintended consequences of evaluation use. The goal displacement resulting from trying to achieve certain narrow targets is well documented, but the evaluation process also has consequences. The very act of participating in evaluation can shift cognitive understandings of programs, affect stakeholders’ views of merit and worth, and alter dynamics and perceptions of power and privilege (Kirkhart, 2000; Schwandt, 2015; VanderPlaat, 1995). Here we might ask: How are beneficiaries affected by the evaluation process? How does it affect staff? Not only in terms of their workload, which is well documented (Benjamin et al., 2017; Kim et al., 2019; Snibbe, 2006), but also in how they engage with communities or how they think about communities? Who owns the data that are produced, who gets to use these data, and who gets to tell the story about the data? (See Cavino, 2013; Chambers, 1999; Stanfield, 1999). Critical approaches to evaluation research have highlighted the potentially extractive nature of evaluation (e.g., Cavino, 2013; Center for Evaluation Innovation, Institute for Foundation and Donor Learning, Dorothy A Johnson Center for Philanthropy, & Luminare Group, 2017; Chilisa et al., 2016; Chouinard, 2016; Tuck & Yang, 2014) such that communities of color, of indigenous peoples, and of people with disabilities, have long called for “nothing about us without us” (Charlton, 1998).
Research also needs to further explore the conflicting purposes between accountability and organizational learning. Empirical work on constraints to learning and how they might be overcome, remains limited, with many nonprofit leaders reporting they feel underprepared to take on this responsibility (Mitchell & Berlan, 2016). Extant scholarship often misses the fact that organizational capacity for evaluation is distinct from the capacity to use the results of these efforts. Creating an organizational culture that values meaningful evidence is critical to data use and to better understanding and avoiding negative unintended consequences (Bryan et al., 2020; Cousins et al., 2014; Taylor-Ritzler et al., 2013). This is the type of critical thinking that evaluative practice can support when narrow conceptions of instrumental use are set aside in favor of reflection, learning, and inclusion of diverse perspectives.
Using Which Criteria?
Identifying criteria for judging merit or worth is one of the central tasks in evaluation. Because criteria are informed by values about what is good or worthy, judgment requires being attentive to those underlying values. One central dilemma facing nonprofit leaders is how to identify and prioritize those values and the related criteria used by stakeholders to judge their organizations. Doing so is even more challenging because the outcomes movement, driven largely by funders, has institutionalized the idea that achieving intended program outcomes is the most legitimate criterion for judging effectiveness (Brest, 2020). And because funders have an outsized influence in defining these outcomes, as noted above, their criteria and values are often dominant in determining nonprofit worth or merit (e.g., Benjamin, 2008a; Cutt & Murray, 2000; Ebrahim, 2005; Mitchell et al., 2020). This emphasis on intended outcomes makes it harder to consider other criteria, including those that emerge in partnership with communities. 5
Nonprofit research further elaborates on the challenges and limitations of using goal attainment measures, such as intended program outcomes, as the primary criterion for judging nonprofits (e.g., Campbell, 2002; Ebrahim, 2019). For example, humanitarian organizations operating in crisis contexts must focus on the delivery of short-term outputs such as food, water, temporary shelter, and medical services, without necessarily aiming to achieve longer term outcomes directly. Moreover, some nonprofits cocreate their outcomes with partner organizations as well as with participants and communities themselves, making it difficult to define these outcomes at the outset because they emerge while working in partnership (e.g., Benjamin & Campbell, 2015). Relatedly, measurable program outcomes may miss expressive roles of the sector (e.g., Knutsen & Brower, 2010; S. R. Smith, 2010) and using standardized criteria risks undermining nonprofit innovation and experimentation (Hwang & Powell, 2009; Phillips & Carlan, 2018). Finally, using program outcomes as the main criterion can miss more subtle social processes within these organizations that redress or reinforce inequity (Benjamin, 2022).
Informed by diverse disciplinary training, evaluation scholars take an expansive view of judging merit or worth, recognizing that values “are omnipresent in social programming” (Shadish et al., 1991, p. 47). Values are embedded in the organizational and political system in which the program is created and administered, as well as in stakeholders’ commitment to the importance of the problem and its solution and in the evaluation itself, including its methodologies and particular purposes (e.g., Chambers, 1994; Greene, 2013; House, 1980; Schwandt, 2015). For example, evaluation scholars call attention to how White supremacy, systemic racism, and colonialism have shaped evaluation and suggest ways to address this (e.g., Bowman, 2020; Caldwell & Bledsoe, 2019; Cavino, 2013; Chouinard, 2016; Dean-Coffey, 2018; LaFrance & Nichols, 2008; Stanfield, 1999; Thomas & Campbell, 2020; Thomas et al., 2018). Given that values implicitly or explicitly inform the criteria used to judge programs, this body of scholarship suggests ways in which evaluation can make those values, and their implications, explicit and thereby open to analysis.
To start, evaluation scholarship outlines the decisions required to reach a final assessment or judgment about a program: (a) Generating a set of possible criteria and deciding which criteria are relevant (e.g., outputs, outcomes, equity, efficiency, cultural relevance, responsiveness to participants). (b) Deciding what benchmark needs to be met on each criterion (e.g., is 60% of participants agreeing that the nonprofit is responsive considered acceptable or is 85%?); (c) Deciding how to synthesize the evidence related to multiple criteria (e.g., weighting or ranking system, deliberation and consensus, holistic); and (d) Deciding who should make these decisions and how (e.g., outside evaluator, nonprofit managers, funders, other stakeholders; Schwandt, 2015, p. 49). 6 As we note below, in nonprofit evaluation, this valuing process is often implicit with the result that these distinct decisions are never discussed.
Evaluation scholars give consideration to each of these steps. For example, scholars point to the problems that result from using intended program outcomes as a criterion. Intended outcomes tend to reflect the perspectives and interests of decision makers rather than participants and often fail to account for unintended consequences of programs (Abma et al., 2020; Kushner, 2000; Madison, 1992; Mathison, 2005; Scriven, 1991). Some scholars suggest the need to focus on the experience of program stakeholders to understand quality, arguing that looking at intended outcomes tells us little about what is actually going on in a program (Stake, 2004, p. 89). Other evaluation scholars give attention to processes for generating and weighing criteria (e.g., House & Howe, 2000). For example, a prescriptive approach elevates one ethical value, such as equity or social justice, while a more descriptive approach describes and considers all stakeholders’ values without elevating one over another (Shadish et al., 1991, pp. 47–49). 7
Bringing together the nonprofit and evaluation scholarship thus points to the need for explicit attention to values: values of stakeholders, values implicit in social change efforts, and values in organizations themselves. We see several possible lines of inquiry that could guide research on the “valuing of nonprofits.” Given space constraints, we focus here only on three possibilities.
First, nonprofit scholars could theorize more deeply on a non-instrumental understanding of social impact. An instrumental view treats nonprofit organizations as a means to other ends, that is, nonprofits are valuable if they produce a certain number of affordable housing units (Frumkin, 2002; Kramer, 1987). A non-instrumental view recognizes that the process or approach nonprofits take to working with communities can create different kinds of desired outcomes, including those where community leadership is recognized and supported to demand actions by government and other institutions that reflect their concerns (e.g., Dodge & Ospina, 2016; Mosley, 2011; D. D. H. Smith, 1999; S. R. Smith, 2010). Relatedly, nonprofit researchers could further probe how values—such as dignity, market logic, or white normativity—not only infuse organizations (Chen et al., 2013) but also how they shape strategy and social impact (Barman, 2015; Doan & Knight, 2020; Feit, 2019). With a noninstrumental view of nonprofit social impact, researchers might explore conflicts between expressive and instrumental purposes. We already know that too much emphasis on short-term instrumental results can undermine expressive work such as building grassroots community leadership (Benjamin, 2008b), but we could start to examine how other expressive work, such as engaging volunteers, affects the experience of those individuals, families, and communities that are intended to benefit directly from nonprofit initiatives (Horvath, 2020).
Second, nonprofit research can offer a clearer picture of how stakeholders approach questions of valuing, again building on previous work (e.g., Herman & Renz, 1999). Such research requires not only an understanding of how stakeholder values inform criteria but also how stakeholders rank these criteria and how they synthesize evidence to come to a final assessment. Making this process more explicit would enable nonprofit scholars to understand how different stakeholder groups, such as funders and participants or communities, approach this process. For example, prior research has shown that nonprofit leaders can have different priorities for neighborhood development than residents (Kissane & Gingerich, 2004). Other research suggests that nonprofit leaders strategically use the priorities of communities to negotiate criteria with funders (e.g., Ospina et al., 2002). More recent studies find that, as relationships between nonprofits and their funders evolve, funders sometimes relax their criteria in favor of those preferred by the nonprofit (Lall, 2019).
Third, nonprofit scholars could examine the valuing process embedded in sector-level standards. A wide spectrum of standards currently exists including voluntary codes of conduct (Gugerty, 2009; Kunugi & Schweitz, 1999); “club” standards required for membership in a selective group or association (Gugerty & Prakash, 2010); auditable standards, common in financial accounting but increasingly being adopted for assessing environmental, social, and governance (ESG) behavior (Barman, 2015; Lall, 2017); and standards for shared output and outcome metrics by industry or sector (McCreless et al., 2014). How do these standards condition the criteria viewed as valid? Could standards be used to expand the range of evidence condidered valid? If standards are understood as establishing a shared basis for judging value or worth, scholarship might examine the process of standard creation, the content they embody, whose values they represent, and how conflicting views and values get surfaced and addressed.
With What Evidence and Methods?
Finally, we turn to the fourth key question of evaluation: What evidence is needed to ascertain that nonprofits are “making a difference”? What are the appropriate methods for gathering and analyzing evidence? At the heart of these questions lies the enduring dilemma of how to establish credible evidence through evaluation. The institutionalized response—that the most credible evidence is that which proves nonprofit initiatives caused measurable results—rewards the use of practices that have been evaluated through experimental methods such as RCTs (Mosley et al., 2019). The rise of evidence registries (such as Cochrane Library, Campbell Collaborative, 3ie, etc.) further institutionalizes these methods (Prewitt et al., 2012). Although many nonprofits do not have the scale or resources to undertake experimental studies, the institutionalization of RCTs as a so-called “gold standard” shapes expectations about what constitutes credible evidence (Center for Global Development, 2006; Eyben et al., 2015). 8
This institutionalized view, however, is increasingly being challenged by nonprofit practitioners and scholars who seek to identify methods that are better suited to diverse organizational realities and who argue that experimental methods of impact evaluation are expensive to conduct, take too long to yield timely information, are unsuitable for all types of social interventions, are not very helpful for mid-course correction, and ignore racialized power dynamics (Chambers et al., 2009; Dichter et al., 2016; Khagram et al., 2009; Mosley et al., 2019; Trelstad, 2008; Whittle, 2013). Research has also shown that nonprofit leaders lack the capacity to undertake such “rigorous” evaluation of program outcomes. Many organizations lack expertise and resources to invest in evaluating impact (Carman, 2007; Carman & Fredericks, 2010) are not attempting any kind of causal attribution (Hoefer, 2000), and may lack the evaluation “culture” to support evaluation efforts (Mitchell & Berlan, 2016). In short, managers need support in finding the “right fit” between their goals and needs for evidence (Gugerty & Karlan, 2018)
The evaluation literature has arrived at a more pluralistic view on the generation of credible evidence, noting that “all methods are not equally good for all tasks, so the task is to sort out the strengths and weaknesses of methods for different purposes” and further cautioning that “no method is routinely feasible and unbiased, so no study is ever free of flaws” (Shadish et al., 1991, p. 42). 9 Although early evaluation scholars developed quasi-experimental alternatives to randomized control trials (such as matched comparison and regression discontinuity), these methods retained a focus on attribution. Approaches that emphasize causal attribution have been critiqued on the grounds that their positivist underpinnings lead them to oversimplify causality and fail to account adequately for political, social, and institutional context (Chambers et al., 2009; Khagram et al., 2009; Pawson, 2013; Pawson & Tilley, 1997; Virtanen & Uusikylä, 2004). Qualitative approaches to evaluation emerged throughout the 1980s and 1990s, drawing from theoretical traditions including interpretivism, hermeneutics and social constructivism (Schwandt, 2000). These approaches took human action as inherently meaningful, something that must be understood from the actor’s point of view and interpreted in the context in which it is undertaken. This led to a range of more inductive approaches to understanding complex social change processes (e.g., Glaser, 1998; Guba & Lincoln, 1989).
More recently, the evaluation field has devoted increasing attention to methodologies for assessing “contribution” rather than “attribution” (Ebrahim, 2019, pp. 229–236; Kane et al., 2021; Raynor et al., 2021). Contribution-based methodologies are more appropriate for examining social change in complex systems where it is not feasible to create an experiment with a control group, to sufficiently isolate causal mechanisms, or to establish an observable counterfactual (Lemire et al., 2012; Mayne, 2001, 2011, 2012; Rogers, 2007, 2009). A range of such methods have emerged over the years, including contribution analysis, process tracing, outcome harvesting, and outcome mapping (Beach & Pedersen, 2013; Befani & Mayne, 2014; Bennett, 2010; Bennett & Checkel, 2015; Davies & Dart, 2005; Earl et al., 2001; Wilson-Grau & Britt, 2012). This growth in attention to multiple methodologies is part of a larger trend in the evaluation literature toward recognition and respect for diverse approaches, what Cook presciently called “multiplism,” to strengthen findings (Cook, 1985; Greene et al., 2001).
While the field of evaluation has generated relatively pluralistic approaches to the credibility of evidence, some evaluation scholars note that the field still tends to privilege western, colonial, and White-dominant perspectives to the exclusion of knowledge and perspectives of black, indigenous, and scholars of color from around the globe (Chouinard, 2016; Shanker, 2019) Evaluation scholarship cautions that the credibility of evidence “is subject to interpretation by the relevant reference group charged with making that judgment” (Schwandt, 2015, p. 73). But who is the relevant reference group—is it professional evaluators, funders, organizational leaders, frontline staff, participants, communities, or some combination? Or, as the renowned development studies scholar, Robert Chambers (1999), often asked, “Whose reality counts?.” This question in the evaluation field has spawned a movement to reconsider and reclaim the Afro-centric and African American roots of evaluation (Chilisa et al., 2016; Hood & Hopson, 2008) and to develop alternative approaches, including multicultural and culturally relevant evaluation (Bledsoe & Donaldson, 2014; Kirkhart, 2010) including ones that employ indigenous-centered methodologies (Bowman, 2020; Cavino, 2013; LaFrance & Nichols, 2008; L. T. Smith, 2012).
What do these developments in evidence and methodology suggest for nonprofit research? We identify at least three broad directions for nonprofit scholars. First, scholars can take seriously Shadish et al.’s (1991, p. 42) suggestion “to sort out the strengths and weaknesses of methods for different purposes” with particular attention to the nonprofit organizational context. What methods of evaluation do nonprofits currently use, and why? How can methods be chosen and developed to better align with organizational strategy and recognize the diverse ways nonprofits partner with communities? The evaluation literature suggests that methods aligned with cultural and organizational norms are more likely to feed into decision-making. Recent scholarship provides frameworks and principles to guide nonprofit managers in identifying the best fit among measurement methods (Gugerty & Karlan, 2018) and sorting methods and tools by different stages of funder decision-making (Ebrahim, 2019, pp. 208–240). These efforts just scratch the surface of the growing range of evaluative methods and innovations, indicating a need for nonprofit scholars to examine diverse methods and purposes.
Second, there remains a dearth of quality research on the nature of evidence and methods that reflect the knowledge and perspectives of beneficiaries and communities—particularly marginalized or underrepresented groups including Black, indigenous and communities of color as well as persons with disabilities (Mertens et al., 1994). What does “credibility” in evidence look like when the lived experiences of beneficiaries and communities tell a different story than the causal inferences drawn from expert-driven evaluations? What constitutes culturally-relevant rigor and validity? What can be learned from participatory methodologies in creating credible evidence based on the experiences of frontline staff and clients? Recent research using in-depth qualitative methods uncovers the “hidden work” of social change, substantiating a belief by nonprofit staff that dominant methods do not adequately capture what they do or the difference they make (Benjamin, 2012, 2022; Benjamin & Campbell, 2015; Rayner & Bonnici, 2021). This work provides examples of how participants’ definition of, and pathway to, significant change does not always align with the stated outcomes of the organization. In exploring this line of inquiry, nonprofit researchers might examine related evaluation scholarship that centers on the experience of participants (Abma et al., 2020; Center for Evaluation Innovation, Institute for Foundation and Donor Learning, Dorothy A Johnson Center for Philanthropy, & Luminare Group, 2017; Cousins & Whitmmore, 1998; Fetterman, 2005; VanderPlaat, 1995), widens the understanding of validity, and provides more open and responsive methodologies that recognize diverse ways of knowing and alternative forms of evidence (e.g., Cavino, 2013; Griffith & Montrosse-Moorhead, 2014; House, 1980; Thomas & Campbell, 2020). Recent efforts around beneficiary feedback aim to better understand the experiences of beneficiaries, using methodologies such as “lean data” and “constituent voice” (Dichter et al., 2016; Twersky et al., 2013), but there has been little scholarly study of the effects of such methods, and the basis upon which they establish credible evidence.
Third, and perhaps the least studied, is the range of new methodologies—based on big data and artificial intelligence—for making sense of large, complex data. These approaches do not rely on traditional scientific methods of hypothesis testing for drawing causal inferences. Instead, their strength lies in uncovering relationships through the detection of patterns in large data sets. Their potential value rests in helping us to both describe and understand systems, such as those involving climate change and racial justice, that cannot be broken down into a simple set of causal relationships. Their practical value for social change resides in developing more accurate classifications of change agent populations such as nonprofit organizations (Santamarina et al., 2021) and in identifying levers for intervening in complex systems, despite only a poor understanding of cause and effect (see Burns & Worsley, 2015; Meadows, 2008; Miller & Page, 2007; Siegenfeld & Bar-Yam, 2020). We know little about how to deploy such methods in ways that are feasible for nonprofit leaders, and the risks involved in using them, particularly their potential to reify past patterns of bias.
Conclusion: Toward a Meso-Level Theory of Nonprofit Evaluation
In this article, we set out to bring together the scholarship on evaluation with the literature on nonprofit organizations. Our aim was to propose a research program that we believe will lay the groundwork for a meso-level theory of nonprofit evaluation—to galvanize research that supports nonprofit leaders in productively engaging evaluation to advance their abilities to contribute to social change. This is an applied agenda for research and theory-building, responsive to the practical dilemmas facing nonprofit leaders and their organizations as they seek to generate social impact. Our research agenda is explicitly and normatively motivated by a desire for enabling greater agency—of nonprofit leaders, staff, communities—to pursue evaluation that better meets their needs and ultimately results in greater social impact for communities.
We have only just begun mapping the contours of such a research program. The four key questions that form the center of evaluative practice—what to evaluate, for what purpose, using which criteria, and with what evidence and methods—provide the basic scaffolding for the research program necessary to develop this body of theory. Each component of this scaffold (our research agenda) is grounded in the practice dilemmas that nonprofit leaders must address in deploying evaluation: defining the unit of analysis, addressing competing uses and demands, working with multiple criteria, and establishing the credibility of evidence. We hope that our effort to explicate these dilemmas, and the questions for future research that we pose, will open up new ways of seeing, valuing, and using evaluation as a plural set of approaches and mindsets in the service of social change and equity.
In building on this scaffolding, our research program offers several analytical supports toward constructing a meso-level theory of nonprofit evaluation. First, our research program is centered on the organization, rather than the program or project. Instead of one overarching approach to organizational evaluation, we envision approaches that are attentive to the specific needs of different types of nonprofit organizations. Second, a nonprofit theory of evaluation takes seriously the values-based core of nonprofits, inquiring not just about the instrumental results that organizations might produce, but also about the expressive values embodied in how they engage constituents and thus shape the experience of dignity and status of individuals and communities. Third, such a theory recognizes and values knowledge that goes beyond evaluation experts to center the knowledges of participants, as well as that of managers and staff. If nonprofit organizations can be understood as sites and vehicles of community-building, identity creation, agency in society, then we need an epistemology of evaluation that goes beyond the expertise of the evaluator. And finally, a theory of nonprofit evaluation requires explicit attention to issues of use—how data and information are produced, how and by whom they are used, and who ultimately owns, controls, and benefits from evaluative knowledge.
Footnotes
Appendix
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
