Abstract
Scholars have used both quantitative and qualitative approaches to empirically study nonprofit roles. Mission statements and program descriptions often reflect such roles, however, until recently collecting and classifying a large sample has been labor-intensive. This research note uses data on United Ways that e-filed their 990 forms and supervised machine learning to illustrate an approach for classifying a large set of mission descriptions by roles. Temporal and geographic variation in roles detected in mission statements suggests that such an approach may be fruitful in future research.
Scholars have ascribed a variety of roles to nonprofits—roles associated with organizational identity, purpose, practice, and outcomes. Roles attributed to organizations across the nonprofit sector include service provider, civic intermediary, and innovator (Eikenberry & Kluver, 2004; Frumkin, 2009; LeRoux, 2009; Levine Daniel & Fyall, 2019; Moulton & Eckerd, 2012; Salamon, 1987). Research also identifies roles particular to nonprofit fields and subsectors (Brilliant & Young, 2004; Crutchfield & Grant, 2012; Young, 2000). A variety of environmental and organizational factors shape nonprofit roles at the sector, subsector, and field levels of analysis (Eikenberry & Kluver, 2004; LeRoux, 2009; Maier et al., 2016; Moulton & Eckerd, 2012; Salamon, 2003; Young, 2001). Such internal and external pressures can lead to role-conflict at the organizational level and/or undermine the nonprofit sector’s capacity to fulfill roles critical to democratic governance (Eikenberry & Kluver, 2004; Levine Daniel & Fyall, 2019). Enhancing our understanding of variation in nonprofit roles is therefore consequential for academic research and nonprofit practice.
Researchers have used a variety of methods ranging from quantitative analysis of survey data (LeRoux, 2009; Moulton & Eckerd, 2012) to qualitative coding of interviews and archival documents (Barman, 2002, 2006; Brilliant & Young, 2001, 2004) to classify nonprofits by roles. Such data collection and classification approaches are labor-intensive, which often limits the geographic and/or temporal range of the resultant analysis. This research note leverages newly available digital data from the Internal Revenue Service (IRS) and supervised machine learning (ML) to examine the viability of an alternative approach. We pose the following methods-focused research questions: (a) Can 990 mission descriptions be used to study nonprofit roles? (b) Can supervised ML be used to classify mission descriptions by roles? and (c) Does geographic and/or temporal variation in roles suggest that using mission descriptions for role classification may be a fruitful approach in future research?
To address these questions, we use publicly available 990 data released by the IRS through Amazon Web Services (AWS) and draw on previous research describing roles across the United Way (UW) system. The AWS data include mission descriptions and other organizational variables for nonprofits that e-filed their 990 form after 2009 (IRS 990 Filings, 2020). To classify mission descriptions by roles, we use supervised ML, drawing on the guidelines and adapting code in the R computer language developed by Kobayashi et al. (2018). While descriptions of ML for text classification have been published in political science, management, and public administration journals (Anastasopoulos & Whitford, 2019; Grimmer & Stewart, 2013; Kobayashi et al., 2018), and while automated approaches have been used to classify nonprofits by service area (Fyall et al., 2018; Litofcenko et al., 2020; Ma, 2021), no nonprofit research examines ML for nonprofit role classification.
This research note makes several contributions. First, it illustrates that mission descriptions can be used to study nonprofit roles. Second, it demonstrates how ML can be used to classify mission descriptions by roles. By describing the methodological process and some of the major pitfalls and limitations of using supervised ML to classify mission descriptions by roles, it provides a procedural roadmap for future research. Finally, it shows that roles detected in mission descriptions show geographic and temporal variation and suggests avenues for future research. It begins with an overview of research on nonprofit roles and ML for text classification before describing the methodological approach and findings of this study.
Nonprofit Roles
The nonprofit roles described in existing research fall into two broad categories: societal roles ascribed to organizations across the sector, and subsector or field-level roles associated with a specific group of nonprofits. Societal roles are associated with the identity and function of organizations across the sector and often serve to differentiate nonprofits from public and for-profit organizations. Examples of societal roles include nonprofits as service providers; innovators; political advocates for stakeholders and potential adversaries of government and private firms; civic intermediaries working to educate, mobilize, and assimilate citizens to democratic norms; venues for the expression of individual values; and hotbeds of social capital creation (Anheier, 2009; Eikenberry & Kluver, 2004; Frumkin, 2009; Gordon & Babchuk, 1959; LeRoux, 2009; Levine Daniel & Fyall, 2019; Light, 1998; Putnam, 2000; Putnam et al., 1994; Salamon, 1987; Weisbrod, 1977; Young, 2000). Research suggests that the societal roles played by nonprofits vary temporally (Eikenberry & Kluver, 2004; Hall, 2006), geographically (Putnam et al., 1994; Young, 2000), and based on organizational characteristics (LeRoux, 2009). Societal role variation has implications for social cohesion, democratic participation, and the realization of important public values (Eikenberry & Kluver, 2004; LeRoux, 2009; Levine Daniel & Fyall, 2019; Putnam, 2000; Putnam et al., 1994).
Research on the Environmental Defense Fund (EDF) by Crutchfield and Grant (2012) offers an example of the second category of nonprofit roles: those associated with organizations in a particular field or subsector. In the 1970s, the EDF and other environmental nonprofits understood their primarily roles to be those of legal challengers and aggressive advocates. However, in the mid-1980s, the EDF began to pioneer the new roles of market-based problem-solver and cross-sector collaborator by developing corporate partnerships and the first cap-and-trade approach to clean air protection. At first, other environmental organizations criticized the EDF for “selling out.” Over time, however, the roles pioneered by the EDF gained legitimacy. This evolution is evidenced by the fact that the Sierra Club, which initially criticized the EDF for embracing new roles, now includes the following language in its mission statement:
We partner with individual and institutional donors to align financial resources with strategic outcomes, provide flexible funding for innovation, build capacity in the environmental movement, and create partnerships with a broad spectrum of allied organizations around shared values and goals (Mission, 2020).
The work of Crutchfield and Grant (2012) illustrates that the study of role variation across a subsector or field over time can provide important insight into changes in nonprofit identity, practice, and impact.
This discussion highlights the importance of research on nonprofit roles for understanding nonprofit distinctiveness, the functions of the sector and change at the field and organizational levels of analysis. Much of the existing research on nonprofit roles identifies and broadly describes societal and field-level roles. Fewer studies categorize nonprofits by roles and those that do so often have limited temporal or geographic scope due to the labor-intensive data collection and analysis methods employed. For example, LeRoux (2009) and Moulton and Eckerd (2012) categorize nonprofits by roles using cross-sectional survey data collected from a subset of nonprofits in a single state. Barman’s (2002, 2006) qualitative examination of United Way (UW) roles relies on historical archives to highlight field-level role variation over time, but limits its organizational-level analysis to two communities. This research note examines the viability of using supervised ML to classify organizations by roles across a large dataset.
To do so, we look to previous research on roles ascribed to organizations in the UW system. The UW Worldwide is the focal organization of the UW system, which comprises approximately 1,800 local UW organizations around the globe (United Ways Worldwide, 2017). Each local UW is a federated fund, an umbrella organization that collects community contributions, and re-allocates them as grants to human service nonprofits. At its zenith, the system distributed close to US$4 billion each year, making it the most significant private funder of human services in many local communities across the United States (Gronbjerg et al., 1996; Hall, 2006). Over the past 20 years, however, environmental challenges have led the UW Worldwide to push for change across the system, shaping the roles espoused by local UWs (Barman, 2002; Brilliant & Young, 2004; Young, 2001).
Existing research on local UW roles primarily uses archival and interview data (Barman, 2002, 2006; Brilliant & Young, 2004; Paarlberg & Ghosh Moulick, 2017; Paarlberg & Meinhold, 2012; Pfeffer & Leong, 1977; Provan, 1982; Young, 2001). Here, we illustrate the potential of ML to classify a large set of mission descriptions by three identity-based roles attributed to federated funds, including local UWs (Brilliant & Young, 2004). It is important to note that these roles are not mutually exclusive.
Fiscal Intermediary
In this role, local UWs eliminate duplication and competition in community-wide fundraising campaigns by efficiently collecting and re-allocating community resources to human service nonprofits.
Regulator
In this role, local UWs serve as a kind of “good housekeeping seal of approval” for human service nonprofits. As regulators, local UWs are watchdogs that evaluate the organizational performance of human service nonprofits.
Community Problem-Solver
In this role, local UWs identify and prioritize needs at the community-level. Traditionally, UW managers conducted a formal needs’ assessment. Over time, practice has evolved and now many local UWs work collaboratively with community stakeholders.
Brilliant and Young’s (2004) framework was selected for this research because of the clarity of the role descriptions it provides, its relevant to the large UW system, and the fact that UW system changes may lead to variation in roles articulated in mission descriptions. From this description of the roles, we use to classify mission descriptions, we now turn to an overview of ML for text classification.
Supervised ML for Text Classification
ML encompasses a set of approaches used to extract information from large datasets using statistical and mathematical techniques and pattern recognition technologies (Larose & Larose, 2014). An important application of ML is text classification, where texts may be full-length documents, paragraphs, sentences, and so on. Both rule-based and ML approaches to text classification involve computer algorithms—structured procedures executed by a machine. Rule-based approaches, such as automated content analysis, use classification rules developed by humans and translated into computer code. For example, Fyall et al. (2018) used a dictionary of keywords to classify mission statements and identify housing and shelter nonprofits. With ML, an algorithm inductively generates classification rules based on patterns in the data. In this sense, ML involves computers “learning” rules for classification (Kobayashi et al., 2018).
The ML approaches most widely used in social science research can be categorized as unsupervised or supervised (Grimmer & Stewart, 2013). Supervised approaches, like the ones examined here, require a subset of labeled (pre-classified) data on which to train an ML algorithm and examine its performance (James et al., 2013). Once an algorithm can effectively classify labeled data, it is then used to predict the classes of unlabeled data. To avoid biasing the algorithm during training, it is important that the labeled subset be representative and that labels be consistently applied during the labeling process. Supervised ML thus requires a substantive understanding of sampling and qualitative coding (Grimmer & Stewart, 2013).
While supervised ML for text classification has been effectively employed by researchers from other disciplines (Anastasopoulos & Whitford, 2019; Grimmer & Stewart, 2013; Kobayashi et al., 2018), it has not been widely used by nonprofit scholars. We now illustrate an application of supervised ML to facilitate future nonprofit research: the classification of mission descriptions by nonprofit roles.
Data and Method
The data used for this study come from IRS 990 forms electronically filed by local UWs between 2010 and 2016 (IRS 990 Filings, 2020; Perry, 2015). These data include the filing year and each organization’s name, unique employee identification number (EIN), address, and response to the prompt, “Briefly describe the organization’s mission or most significant activities.” Local UW organizations were identified using EINs scraped from the website of the United Ways Worldwide (2017) in March. Machine-readable data on local UWs that e-filed their 990 form between 2010 and 2016 were extracted from AWS using the Open-Data-for-Nonprofit-Research codebook (Lecy & Grasse, 2019). Although it is not a focus of this research note, an important consideration when using these data is sampling bias stemming from the fact that not all nonprofits file electronically. Organization addresses were geocoded (Texas A&M Geoservices Services, 2020) and their proximity to the 100 most populous U.S. cities was calculated using the law of cosines (great-circle-distance). This process resulted in a database containing 4891 observations of the name, filing year, mission description, and large-city proximity of local UWs that e-filed with the IRS between 2010 and 2016. Figure 1 visualizes the phases of text classification described below, which were implemented by adapting code in the programming language R made publicly available by Kobayashi et al. (2018).

Diagram of text classification process adapted from Kobayashi et al. (2018).
Phase I: Hand-Labeling Random Sample of Mission Descriptions
To create labeled data for training purposes, a random sample of 800 observations was drawn across data pooled from 2010 to 2016. The question of how large a labeled sample is needed should be informed by practical considerations (e.g., data acquisition costs, the opportunity cost of hand-coding) and empirical assessment of ML model performance (Kobayashi et al., 2018). The decision to label a sample across all years rather than within each year was made to maintain label consistency across the period of study.
Each mission description was labeled (hand-coded) for the presence or absence of each of the roles described earlier (i.e., fiscal intermediary, regulator, and community problem-solver); an observation was labeled “1” if its mission description reflected a role, and “0” otherwise. Each role was coded separately since roles are not mutually exclusive. I took a directed approach to labeling by using roles identified a priori based on existing research, however, future work could take a grounded approach (Hsieh & Shannon, 2005). Table 1 presents exemplary statements and incidence of the fiscal intermediary and community problem-solver roles. The regulator role was present in less than 8% of the mission descriptions, making the role classes (0/1) highly unbalanced, and unsuitable for classifier training. We therefor do not attempt to use supervised ML to classify by the regulator role.
Exemplary Statements and Incidence of Roles.
Phase II: Text Preprocessing, Transformation, Reduction, and Weighting
Before textual data can be used to train an ML algorithm, it should be preprocessed to remove irrelevant information that could make it more difficult to detect meaningful patterns (see Kobayashi et al., 2018). Choices about what constitutes “irrelevant information,” however, must be informed by the data and research questions at hand. For example, the first step in preprocessing—tokenization—is usually implemented at the word level by splitting text into component words, punctuation, numbers, and so on. When text is tokenized at the word level, information about the meaning of words-in-context is lost. Alternative approaches that retain more information about words-in-context include tokenization based on linguistic information (labeled parts of speech, including nouns, verbs, etc.) or n-grams (combinations of words that frequently co-occur such as the bi-gram “united, way”). Researchers should consider tokenizing at the parts of speech or n-gram level if it better addresses the research question or if classification performance after word-level tokenization is poor (Kobayashi et al., 2018).
For this study, mission descriptions were tokenized at the word level and further preprocessed to make all text lower case and to remove punctuation, extra spaces, and an extended list of stop words. Conventional stop words such as “the” and “to,” and others specific to the data (“united” and “way”) were removed since they occur frequently and do not help to distinguish role classes from one another. Numbers were also removed in preprocessing after substituting “nonprofit” for “501(c)3” across the dataset. Preprocessing may also involve stemming—reducing words to their root to normalize text (e.g., “calculate” and “calculating,” would become “calcul”). The effect of stemming English words, however, may be negligible or even detrimental for classification purposes (see Kobayashi et al., 2018). I therefore trained separate models to compare the effect of using word-stems versus unstemmed words.
The full corpus (collection) of mission descriptions was then transformed so that it could be used to train multiple classification algorithms. Transformation involves creating a document-term matrix (DTM). In a DTM, each row corresponds to a document (mission description), each column corresponds to a word/stem, and each cell contains the number of times a word/stem appears in a mission description. DTMs are typically very large and sparse. It is often helpful to reduce sparsity by removing columns with mostly empty cells. This step is justified since words/stems that only appear once or twice across a corpus are unlikely to be helpful in classification and may increase computation time (see Kobayashi et al., 2018). DTMs were reduced by removing words/stems that appeared in less than 1.5% of mission descriptions resulting in matrices containing either 159 word-stems or 171 unstemmed words. The relatively small number of words/stems remaining after sparsity reduction has to do with the notable level of language consistency across the UW system.
Finally, the DTM was weighted in two distinct ways. In binary format, each cell is weighted “1,” if a word was present in a mission description and “0,” if absent. In term frequency-inverse document frequency (TF-IDF) format, each cell is weighted by the word’s frequency in the document (mission description) multiplied by the inverse of the word’s document frequency, the number of documents across the corpus containing the word. TF-IDF weighting helps identify keywords in each document and can sometimes increase classification accuracy (Kobayashi et al., 2018).
Phase III: Classifier Training
The weighted matrices were then used to train a series of algorithms to classify mission descriptions by the presence or absence of each role. Since the roles are not mutually exclusive, I trained separate classifiers for each role. Also, I followed best practice to train several different types of classifiers to determine those with the best performance (Kobayashi et al., 2018). Naïve Bayes, support-vector machine and random forest classification algorithms were trained on both binary and TF-IDF-weighted matrices constructed using either word-stems or unstemmed words. This allowed me to compare performance across 12 different classification models for each role. It is beyond the scope of this research to provide a detailed description of how each type of classification algorithm works (see Hastie et al., 2009; James et al., 2013; Kobayashi et al., 2018). Broadly, however, naïve Bayes are probabilistic, support-vector machines are geometric and random forests are enhanced logical classifiers. These classifier types were selected because naïve Bayes, support-vector, and random forest classifiers work well on textual data (Kobayashi et al., 2018).
In supervised ML, k-fold cross-validation is considered a systematic strategy to assess the performance of a classifier model (Hastie et al., 2009; James et al., 2013; Kobayashi et al., 2018). Labeled data are partitioned into k parts, a model is then trained on k-1 parts and used to predict the class of observations in the remaining part or “test” set. This process is repeated until each of the partitions has been used as the test set. The extent to which the model accurately predicts the known class of observations in each of the k test sets can then be assessed and averaged to provide a measure of model performance. What is more, the k-fold cross-validation process can be repeated, and performance indicators averaged over all distinct partitionings to provide an even more robust measure of model performance. We follow Kobayashi et al. (2018) and use results of 20 times fivefold cross-validation to assess model performance based on F-score and balanced accuracy performance indicators.
Phase IV: Classifier Evaluation and Application
Both F-score and balanced accuracy are global measures of model performance that range from 0 to 1 (Kobayashi et al., 2018). They are not the only measures of model performance, but are used here because they provide insight into the extent to which a model correctly classifies and misclassifies roles; the F-score considers false positives and false negatives, while balanced accuracy considers true positives and the false negatives predicted by the model. Figures 2 and 3 display the F-score and balanced accuracy for support-vector machine, naïve Bayes and random forest classifiers trained on DTMs containing either word-stems or unstemmed word with either binary or TF-IDF weighting.

Classifier performance for fiscal intermediary role.

Classifier performance for community problem-solver role.
Support vector and random forest models trained on all types of matrices are consistently strong with no F-score and balanced accuracy indicators lower than .75 and some above .9. Based on an evaluation of performance measures, we choose to classify unlabeled mission statements using random forest models trained on matrices of unstemmed words with TF-IDF weighting.
Using the fully classified dataset, we assess temporal and geographic variation in roles using contingency tables and chi-square tests of independence. Table 2 presents the distribution of roles espoused in mission descriptions from 2010 to 2016.
Temporal Role Variation.
Separate tests of independence at the 95% confidence level were conducted for each role over this period. The results suggest that both the fiscal intermediary and community problem-solver roles are independent of year.
To examine geographic variation, we considered roles and organizational proximity to a large city. Table 3 presents the distribution of roles espoused in mission descriptions by local UWs located within 25 miles, or at greater distance, from the 100 most populous US cities. The results of the tests of independence at the 95% confidence level suggest that the fiscal intermediary role appears less, and the community problem-solver role appears more, in mission descriptions of urban UWs than in those of rural UWs.
Geographic Role Variation in 2015.
Findings and Discussion
The results of the hand-coding process suggest that two of the three roles ascribed to local UWs are reflected in 990 mission descriptions. Mission description language did not consistently reflect the role of regulator, so no ML models were trained to classify by this role. These findings provide a qualified affirmative response to our first methodological research question: Can 990 mission descriptions be used to study nonprofit roles? The relative absence of the regulator role in UW mission descriptions does not mean that organizations are not playing this role in practice, simply that they are not espousing this role in 990 mission descriptions. This suggests that future research should be cognizant that some roles are less likely to be articulated in mission descriptions than others. In addition, this highlights the importance of using clearly defined roles to label training data. Future research that draws on roles that are not as concretely defined as those in the Brilliant and Young (2004) framework, or where roles are clearly defined but difficult to disambiguate in labeling, may have trouble implementing a supervised ML approach to role classification.
Our second methodological research question asks: Can supervised ML be used to classify mission descriptions by roles? The performance of our random forest and support-vector models trained using various input matrices are more than acceptable; we can expect these models to perform well on data similar to, but distinct from, the labeled subset (Hastie et al., 2009; James et al., 2013). These findings suggest that supervised ML can be used to classify mission descriptions based on the presence or absence of nonprofit roles. This finding comes with caveats, however.
First, while the performance of our random forest and support-vector models are strong, performance indicators for our naïve Bayes classifiers are variable. This highlights the importance of training several different types of classifiers and comparing performance metrics when using ML for text classification. Second, Figures 2 and 3 suggest that the terms and weights of input matrices used for training can affect classifier performance as well. Therefore, researchers should train classifiers using a variety of matrices containing both word-stems and unstemmed words and weighted in binary and TF-IDF format to have a better chance of identifying a high-performing model. Finally, models classifying based on the fiscal intermediary role were consistently stronger than those for the community problem-solver role. This may be related to the fact that the fiscal intermediary role was detected in 77.6% of labeled mission descriptions while the community problem-solver role was detected in only 33.3%. This suggests that the viability of using supervised ML role classification may decrease if the role is reflected in fewer mission descriptions. In such a case, rule-based algorithms or hand-coding may prove more reliable.
While it is beyond the scope of this research note to develop and test hypotheses related to espoused roles across the UW system, our final methodological research question asks: Does geographic and/or temporal variation in roles suggest that using mission description for classification may be a fruitful approach in future research? Previous work suggests that the UW is undergoing important changes (Barman, 2002, 2006; Brilliant & Young, 2004; Paarlberg & Ghosh Moulick, 2017; Paarlberg & Meinhold, 2012; Young, 2001), which could lead to variation in roles espoused by local UWs. Our exploratory analysis suggests that roles espoused by local UWs may vary geographically but not temporally. The local contextual factors shaping this variation (and lack thereof) will be interesting to examine in more detail in future research. This work underscores that future research can be facilitated by using supervised ML to classify mission descriptions by roles.
Conclusion
Like any research, this methodological study has several limitations. First, the methods described in this study cannot currently be implemented in many software packages, such as SPSS and STATA; however, they can be implemented in the open-source programming languages R and Python. As we have tried to highlight throughout this manuscript, choices regarding selection and labeling of training data, text preprocessing and DTM weighting, and classifier and model selection are consequential. These “technical” choices informed by performance indicators are also shaped by human priorities and understandings of causation and data generation processes.
This study makes several contributions. First, it provides empirical evidence that 990 mission descriptions can be used to study nonprofit role variation. Second, it illustrates how supervised ML can be used to streamline mission description classification by roles and points to important technical considerations. The methods employed in this study, which are suitable for large datasets containing textual data, can facilitate future empirical research and increase our understanding of nonprofit role variation and its relationship to organizational identity, purpose, practice, and impact.
Footnotes
Acknowledgements
The author would like to acknowledge her dissertation committee for their support in developing an early version of this research note. She would also like to thank Laurie Paarlberg for her feedback as she prepared chapters of her dissertation for submission to academic journals.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
