Abstract
Recent advances in text mining have provided new methods for capitalizing on the voluminous natural language text data created by organizations, their employees, and their customers. Although often overlooked, decisions made during text preprocessing affect whether the content and/or style of language are captured, the statistical power of subsequent analyses, and the validity of insights derived from text mining. Past methodological articles have described the general process of obtaining and analyzing text data, but recommendations for preprocessing text data were inconsistent. Furthermore, primary studies use and report different preprocessing techniques. To address this, we conduct two complementary reviews of computational linguistics and organizational text mining research to provide empirically grounded text preprocessing decision-making recommendations that account for the type of text mining conducted (i.e., open or closed vocabulary), the research question under investigation, and the data set’s characteristics (i.e., corpus size and average document length). Notably, deviations from these recommendations will be appropriate and, at times, necessary due to the unique characteristics of one’s text data. We also provide recommendations for reporting text mining to promote transparency and reproducibility.
Text mining is increasingly being used in organizational research and practice because up to 80% of organizational data are stored as unstructured, natural language text (Grimes, 2008). Researchers have used closed vocabulary text mining over the past three decades to summarize text data by counting conceptually related words and phrases to score constructs (e.g., entrepreneurial orientation; Short et al., 2010). Recently, new methods and open source software have emerged that can use any words and/or phrases in text as the unit of analysis, known as open vocabulary text mining (Schwartz, Eichstaedt, Kern, et al., 2013) and, more broadly, natural language processing (NLP; Oswald et al., 2020). Open vocabulary text mining has garnered interest in organizational research and practice (e.g., Li, 2019) because it can be used to improve employee selection and assessment (Campion et al., 2016; Sajjadiani et al., 2019; Speer, 2018) and advance organizational theory by helping researchers uncover emergent processes, develop new classification systems, and analyze social and cultural dynamics (Hannigan et al., 2019). Both closed and open vocabulary text mining represent methods of analyzing natural language text data that save considerable time and expense compared to traditional, manual coding of text.
The emergence of open vocabulary text mining has increased the need for and prevalence of preprocessing natural language texts before analysis. Preprocessing involves transforming text prior to analysis by identifying which units (e.g., words and phrases) to use (i.e., tokenize), removing content that is irrelevant for some tasks (i.e., remove nonalphabetic characters and stop words), agglomerating semantically related terms to decrease data sparsity and increase predictive power (i.e., lowercase conversion, correct misspellings, expand contractions/abbreviations, and stem/lemmatize), and increasing the amount of semantic information that is captured (i.e., handling negation). However, this means that preprocessing can also remove useful information (e.g., removing stop words when they are relevant to one’s research question), introduce errors into analysis (e.g., when stemming conflates semantically distinct words), and drastically alter subsequent results (Boyd, 2016).
Despite its importance, text preprocessing is often glossed over and poorly reported in text mining research (Fokkens et al., 2013). For example, the vast majority of closed vocabulary text mining studies do not report conducting any preprocessing (see Table 4), although some closed vocabulary software packages (e.g., Linguistic Inquiry and Word Count) recommend that researchers preprocess texts to enhance measurement precision (Pennebaker, Booth, et al., 2015). This is particularly concerning because preprocessing may affect the reliability and validity of results in the same way that researcher degrees of freedom (Simmons et al., 2011) can alter research findings during metaanalytic coding (Wanous et al., 1989), qualitative research (Jonsen et al., 2018), and screening quantitative responses (Meade & Craig, 2012). To further complicate things, past research has provided conflicting preprocessing recommendations—for instance, organizational researchers have provided four distinct recommendations for stemming during open vocabulary text mining: Always stem except on short documents (Kobayashi et al., 2018a), only stem in small corpora (Kern et al., 2016), only stem if it does not reduce classification/prediction accuracy (Kobayashi et al., 2018b), and cautiously consider stemming only after conducting topic modeling (Banks et al., 2018).
This article seeks to improve current text preprocessing practices, resolve these conflicting preprocessing recommendations, and provide reporting guidelines for text mining research. First, we advance our understanding of the conceptual and empirical considerations that should drive preprocessing decisions. We do so by considering whether one’s research question is best addressed by closed or open vocabulary text mining, if it necessitates capturing what people communicate and/or how they communicate (i.e., the content and style of speech; Pennebaker et al., 2003), and the characteristics of one’s corpus (i.e., corpus size and average document length). Then, when describing the preprocessing techniques, we discuss how each technique affects statistical power, measurement validity, and one’s ability to extract various textual features.
Second, we review text preprocessing comparison research in computational linguistics (a subfield of linguistics and computer science that is responsible for the development of open vocabulary text mining) to derive preprocessing best practices. We then use those findings to critically review the preprocessing practices in organizational research. Conducting these two reviews together advances our understanding of whether organizational scholars follow preprocessing best practices.
Third, we use the information gleaned from these reviews to provide recommendations for conducting text mining preprocessing, as summarized in Figure 1, and reporting text mining research, as summarized in Table 5. Such recommendations have been provided for established research methods (e.g., surveys, meta-analysis; APA Publications and Communications Board, 2008) but not yet for text mining. Our preprocessing recommendations aim to be comprehensive, but we recognize there are situations where specific research questions or unique data sets may necessitate deviating from these recommendations. As the Linguistic Inquiry and Word Count (LIWC) operator manual notes, during preprocessing, one should “keep in mind what your goals are in analyzing the data” (Pennebaker, Booth, et al., 2015, p. 15). Notably, our recommendations are informed by consideration of later stages of the text mining process, including which features will be extracted (i.e., open or closed vocabulary features) and the nature of one’s research question, because the optimal preprocessing techniques may differ for different textual features (e.g., n-grams, topic models, closed vocabulary dictionaries) and goals (e.g., investigating trait-like or state-like variables). Overall, our paper aims to improve the validity of text mining results, researchers’ understanding of the effects of preprocessing, and the transparency and replicability of text mining research.

Preprocessing recommendations for open and closed vocabulary text mining.
Conceptual and Empirical Considerations for Text Preprocessing
Before describing the various preprocessing techniques, it is essential to understand why and when preprocessing may affect text mining results. Specifically, preprocessing should first be considered in the context of one’s research question: Is the research question best addressed by using closed and/or open vocabulary text mining, and is the research question relevant to the content and/or style of speech? Second, preprocessing should be considered in the context of one’s corpus: Is the corpus large or small (i.e., how many documents are in the corpus), and how long, on average, are the documents in the corpus? Text mining terms relevant to this discussion are defined in Table 1.
Definitions of Terms Relevant to Text Mining and Preprocessing.
Research Question
Closed and/or Open Vocabulary Text Mining
When one has a corpus of natural language text (i.e., a collection of documents), it is possible to extract features using either closed vocabulary text mining, open vocabulary text mining, or both (hybrid methods also exist; see DICTION’s Insistence function 1 ). Preprocessing decisions can alter the validity of text mining by affecting whether words and/or phrases are accurately captured. Importantly, closed and open vocabulary text mining have different preprocessing needs, but to understand why, it is necessary to explain how closed and open vocabulary text mining work.
Closed vocabulary text mining counts words and phrases in a priori dictionaries, making it a more deductive approach to extracting textual information. These dictionaries are generally developed to capture specific constructs, which may be linguistic or psychological in nature. As an extension of manual coding, researchers can use existing dictionaries or create new ones to automatically count the number of times words and/or phrases appear (e.g., counting the number of positive words in the text to measure positivity), treating the number of occurrences of words in a dictionary as the construct score. For example, LIWC has linguistic dictionaries that count the use of pronouns, verbs, and punctuation as well as psychological dictionaries including drives (i.e., affiliation, achievement, power, reward, and risk), positive emotions, and negative emotions (Pennebaker, Boyd et al., 2015). Researchers have also developed dictionaries for organizational constructs, including entrepreneurial orientation (Short et al., 2010) and the great eight performance dimensions (Speer et al., 2018). The dictionary scores can then be used in traditional analyses (e.g., regression-based hypothesis testing) or in machine learning, whether supervised or unsupervised.
In closed vocabulary text mining, the choice of which dictionaries to use is the sole determinant of what textual features are computed. With this in mind, preprocessing can adversely affect measurement precision in closed vocabulary text mining by inadvertently altering words in the corpus or creating a mismatch between how a word appears in the corpus and how it appears in the dictionary. However, preprocessing can also be used to enhance measurement precision by transforming all the variations, colloquialisms, or misspellings of a word within a corpus to its dictionary entry. For instance, in the popular LIWC, all text is automatically converted to lowercase (otherwise, capitalized words will not match their lowercase dictionary entries), and the technical manual encourages users to correct spelling errors and expand abbreviations (Pennebaker, Booth, et al., 2015).
The effects of preprocessing on one’s results are pronounced in open vocabulary text mining. Open vocabulary text mining has no dictionaries to specify which words are of interest and, therefore, no preformed notions about which words and/or phrases are valuable, making it a more inductive approach to extracting textual information. The variables extracted from open vocabulary text mining are frequently used in machine learning, whether supervised or unsupervised (Kobayashi et al., 2018b). In open vocabulary text mining, any n-gram (i.e., phrases of length n) can be counted and used as a variable, making it a useful approach in contexts with nonstandard spellings, abbreviations, and emoticons, which may not be present in closed vocabulary dictionaries (Park et al., 2015). Open vocabulary text mining is often conducted in R or Python, leaving the user to specify all preprocessing decisions. Preprocessing decisions determine what is captured in open vocabulary text mining because preprocessing techniques alter and/or remove words from text.
Content and/or Style of Speech
Text mining can be used to address a wide variety of research questions. For example, topic modeling (an open vocabulary technique that involves identifying the latent topics in texts) has been used to address myriad organizational research questions (Hannigan et al., 2019). One increasingly common way that text mining is used in social psychology is to infer traits and trait-like individual differences from text, such as social media posts (e.g., Park et al., 2015), which has been used to distinguish the average personality traits of workers in a given occupation (Kern et al., 2019) and to help with identifying and recruiting prospective applicants with job-relevant traits (Faliagka et al., 2012). Researchers have also applied text mining for a variety of other tasks, including understanding the similarity of texts (Banks et al., 2018), understanding context-specific “state” variables like attitudes (e.g., sentiment analysis), and predicting behavioral outcomes from text data (Piezunka & Dahlander, 2015). In other words, some research questions are focused on extracting trait-like individual differences from text, whereas others are interested in extracting state-like variables.
To understand how preprocessing may affect one’s ability to address these two types of research questions, we draw on Pennebaker’s distinction between the content of speech (i.e., what is being said; content words) and the style of speech (i.e., how it is being said; function words; Pennebaker et al., 2003). Some preprocessing techniques alter the validity of content captured from text, and several remove elements of the style of speech. Content words include nouns (e.g., coworker, friend), regular verbs (e.g., grow, learn), and many adjectives and adverbs (e.g., happy, very) and convey what is said. Therefore, content words tend to differ from situation to situation given that language reflects situation-dependent topics. In contrast, the style of speech is comprised of function words (i.e., particles), including articles (e.g., a, the), auxiliary verbs (e.g., am, will), conjunctions (e.g., and, but), prepositions (e.g., to, with), and pronouns (e.g., I, he, she). Function words represent how language is delivered and tend to be used more consistently across situations within individuals (Pennebaker et al., 2003).
As a result of the differences in how content and function words differ within and between people, content words tend to be more important for understanding state-like variables. For instance, Speer (2018) used the content of performance appraisals in Year 1 to predict Year 2 performance ratings while controlling for Year 1 performance ratings. Similarly, Campion et al. (2016) used content words to extract the topics in texts and used those topics to automatically score achievement records by modeling the scores given by human raters.
On the other hand, function words tend to be more important for understanding trait-like variables. Function words make up only 0.05% of the English vocabulary, yet they comprise over half of all words spoken and written in everyday communication (Tausczik & Pennebaker, 2010). Differences in the use of function words can indicate individual differences including age, sex, honesty, status, personality traits, and more (Pennebaker et al., 2003; Tausczik & Pennebaker, 2010). This is known as the human stylome, which distinguishes individuals based on their writing the same way that their genome distinguishes them based on their DNA (van Halteren et al., 2005). For example, Park et al.’s (2015) study of U.S. Facebook users found that many n-grams (i.e., phrases of length n, where n = 1 is isolated words, n = 2 is two-word phrases, etc.) most predictive of extraversion contained pronouns (e.g., I can’t, it is), prepositions (e.g., from, of, into), and articles (e.g., the, as). Furthermore, a study of online communities found that groups can be distinguished based on their linguistic style, which included multiple types of function words, including prepositions (e.g., to, with, on), adverbs (e.g., very, really), auxiliary verbs (e.g., am, will, have), and negations (e.g., no, never; Khalid & Srinivasan, 2020), which could be applied in organizational settings to help understand team and/or group differences.
As we describe in the following, preprocessing may affect whether function words, which represent linguistic style, are still present in speech. Furthermore, preprocessing affects whether other elements of style that may be related to trait-like differences, such as the use of abbreviations or spelling errors, are present in text. Therefore, one’s research question is an important consideration when making preprocessing decisions. For example, if one’s research question relates to trait-like individual differences, idiosyncrasies such as typos and misspellings may be informative, but when one’s research question does not relate to trait-like individual differences, standardizing such idiosyncrasies can improve measurement precision.
Corpus Size
The size of one’s corpus also matters for preprocessing (Kern et al., 2016; Kobayashi et al., 2018a). Large corpora (i.e., those with more documents) have more power to utilize fine distinctions in text (e.g., between singular and plural versions of nouns or past and present tense verbs). On the other hand, small corpora have less power to do so. Therefore, statistical power can be increased in small corpora by, for example, combining semantically related words. Yet such increases in power come at the expense of losing nuance in language, potentially introducing errors and, thereby, reducing validity. Corpora may be considered large when a thousand or more observations are available (e.g., Kern et al., 2016).
Average Document Length
In a related vein, the length of documents in one’s corpus should also be considered during preprocessing (Kobayashi et al., 2018a). Shorter documents (e.g., tweets, open-ended survey responses) contain minimal content. Therefore, even if the research question does not relate to individual differences, more predictive power is available if the style of speech is retained. On the other hand, because longer documents (e.g., comprehensive performance reviews) have more content, the style of speech is less needed when irrelevant to one’s research question. Generally, documents can be considered short if they contain fewer than 500 words (Kern et al., 2016), although researchers in other fields (e.g., computational linguistics) suggest that documents are short when, on average, they contain fewer than two dozen words (e.g., tweets; Faguo et al., 2010).
Preprocessing and Text Mining Overview
We now turn to explain preprocessing techniques and their effects on text mining, relating them to the aforementioned conceptual and empirical considerations. Table 2 defines and describes each preprocessing technique, details their effects on text mining analyses, and summarizes our recommendations for whether and when to apply them in closed and open vocabulary text mining (we preview the recommendations here and describe them in more detail in the Discussion). Preprocessing involves techniques, for example, that account for more semantic information to increase the validity of results (e.g., handling negation) or reduce dimensionality by treating distinct forms of the same word as a single unit to increase power (e.g., stemming/lemmatizing). After reviewing preprocessing techniques, we explain how text is quantified and analyzed to consider preprocessing decisions in the context of the features one plans to extract and to couch preprocessing in the broader text mining process.
Preprocessing and Related Terms, Defined.
Lowercase conversion involves converting all letters in a corpus to lowercase. Researchers have suggested to always use lowercase conversion during open vocabulary text mining (Banks et al., 2018; Kobayashi et al., 2018b). Computers represent capital and lowercase letters differently, so the same word capitalized versus lowercase may be counted separately if not unified. Capitalization is mainly used to identify proper nouns and acronyms—otherwise, it tends not to carry semantic information. Lowercase conversion tends to be beneficial because it decreases data dimensionality, thereby increasing statistical power, and usually does not reduce validity. Indeed, closed vocabulary tools generally perform lowercase conversion without user intervention (e.g., LIWC; Pennebaker, Booth, et al., 2015), meaning that lowercase conversion is generally conducted during closed vocabulary text mining but is unlikely to be reported by researchers.
A variety of methods exist for handling negation. Negations have an important role in language because they usually alter the meaning of subsequent words. For instance, good should never be counted the same as not good, yet if not properly handled in text mining, each occurrence of not good will be counted as an instance of both not and good, thereby conflating not good and good. One rudimentary method of handling negation involves appending not_ to each word it precedes, thereby creating a new, distinct unigram (e.g., Speer, 2018). This can increase data dimensionality proportionally with the number of words preceded by negations and, therefore, likely decreases statistical power. Importantly, however, handling negation captures more semantic information, reduces error, and increases validity. This applies to both open and closed vocabulary text mining—if negation is not addressed, both open and closed vocabulary text mining may conflate negated and nonnegated forms of words. For example, unaddressed negation is one of the most prevalent sources of error in closed vocabulary text mining (Schwartz, Eichstaedt, Blanco, et al., 2013). Appending not_ creates a new, distinct unigram in open vocabulary text mining and prevents the negated form from being counted in closed vocabulary text mining (i.e., because this modified form will not be present in the closed vocabulary dictionaries).
Correcting spelling errors and expanding contractions and abbreviations involves identifying instances of each, then correcting/expanding them to standardize the text. Typos, contractions, and abbreviations that appear only a few times in a corpus will likely provide little value for subsequent analysis in open vocabulary text mining, so correcting them has the potential to reduce dimensionality and increase power. On the other hand, spelling errors may be informative—for example, an automated system for screening resumes may not want to correct spelling errors because the rate of such errors may relate to workplace outcomes, and other types of idiosyncrasies represent one’s style of speech and may also relate to trait-like individual differences (Kern et al., 2016). Indeed, previous recommendations have cautioned against making such corrections if the research question is focused on individual differences (Kern et al., 2016), yet it can be useful in nonformal texts where errors are more likely (Kobayashi et al., 2018a, 2018b). Importantly, it may be easier to identify and expand acronyms (a form of abbreviation) if done prior to lowercase conversion because capitalization is one way of identifying acronyms. Using these techniques may be especially useful in closed vocabulary text mining because all nonstandard spellings, abbreviations, and contractions will not be counted unless they are included in the focal dictionaries (another option is to add them to the focal dictionaries; Pennebaker, Booth, et al., 2015).
Removing nonalphabetic characters involves removing punctuation, symbols, and/or numbers from text. If nonalphabetic characters are retained in open vocabulary text mining, each instance of an n-gram adjacent to punctuation (e.g., at the end of a sentence) may be treated distinctly from other instances if not properly handled during tokenization (i.e., identifying the units of analysis for open vocabulary text mining). Therefore, this technique is commonly conducted to focus on the words and phrases in text, with some researchers suggesting to always remove punctuation (Banks et al., 2018) and others suggesting not to remove nonalphabetic characters when in doubt (Kobayashi et al., 2018b). Researchers may not want to remove nonalphabetic characters when their research question relates to trait-like individual differences because punctuation is one element of style, as are language variables that require punctuation, such as the number of words in a sentence (Pennebaker, Boyd, et al., 2015). For example, the use of exclamation points is related to personality traits (Golbeck et al., 2011), and punctuation is needed to quantify indices of linguistic complexity that use the number of words per sentence, such as Flesch-Kincaid Grade Level. Further complicating matters, periods may also occur in times (e.g., 6:30 p.m.), email addresses (e.g., username@website.com), and elsewhere—they will need to be removed from these words and phrases but not the ends of sentences to measure words per sentence. Furthermore, researchers are increasingly using social media text data that contain emoticons (e.g.,: D). Emoticons carry positive and negative sentiment, and their use also likely reflects individual differences (Kern et al., 2016) as well as cross-cultural differences in emotional expression (e.g., Li et al., 2019). Where emoticons are present, emoticon recognition should be used (although we do not include emoticon recognition in Figure 1, when emoticons are present, they should be counted regardless of whether one is using open or closed vocabulary text mining). Furthermore, the end of sentences will sometimes need to be handled in a special way—for instance, it may not be meaningful to count n-grams that span sentence boundaries, and to avoid counting such phrases, punctuation must be retained until later in the process. In closed vocabulary text mining, punctuation should generally be retained because the choice of dictionaries determines whether they will be counted.
Stop words are words so common in a corpus that they may become uninformative, and stop words are often removed during open vocabulary text mining preprocessing. For instance, in information retrieval (i.e., online and offline search), stop words are rarely useful, and removing them decreases computation time. Researchers have suggested to always remove stop words (Banks et al., 2018), to do so except in short documents (Kobayashi et al., 2018a, 2018b), and to do so in small corpora unless predicting individual differences (Kern et al., 2016). Such exceptions are necessary because many stop words are also function words that make up one’s style of speech. In closed vocabulary text mining, stop words should always be retained because the choice of dictionaries determines whether they will be counted. In open vocabulary text mining, stop words should be retained when one’s research question relates to individual differences. Domain-specific lists of frequently occurring words in a corpus can be generated for removal, or generic stop word lists that include common terms such as the, a, and and can be used.
Stemming and lemmatization involve removing word suffixes to transform them to their root form (a stem and lemma, respectively). This was originally done in open vocabulary text mining for the same reason as removing stop words—to reduce computation time and improve recall in information retrieval. However, they can also increase power by reducing dimensionality. Prior recommendations have focused only on stemming, suggesting that it is useful on small corpora (Kern et al., 2016) or short documents (Kobayashi et al., 2018a) but also cautioning that it can be considered only after conducting topic modeling (Banks et al., 2018) and that stemming should only be conducted if it does not reduce predictive accuracy in the resulting analyses (Kobaysahi et al., 2018b). Caution is warranted because collapsing word variants into a single root can remove differences in the style of speech and/or misclassify words if not careful. Furthermore, stemming and lemmatization are very different processes and create different outcomes. Stemmers use rule-based heuristics to remove word suffixes, leaving only the “stem,” without considering homographs (i.e., words spelled the same but with different meanings). On the other hand, lemmatizers consider the part of speech of words (e.g., verb, noun) and use a lexicon of words and their variants to transform them into their lemma. Stemming tends to reduce dimensionality more than lemmatizing, but stemming can also collapse distinct words (or different forms of a word that have distinct meanings) to the same stem (e.g., the Porter stemmer would collapse organ, organs, organic, organism, organize, and organization to organ). Furthermore, the stems left after stemming are often not actual words, making them difficult to interpret. Alternatively, lemmatizers use a lexicon to change words to their lemma, meaning that words with a common stem tend to be reduced to their respective, distinct forms (e.g., organ and organs would collapse to organ, whereas organic, organism, organize, and organization would be unchanged because they have distinct meanings and/or different parts of speech). Therefore, stemming can increase power more than lemmatizing does, but it may also reduce validity. In closed vocabulary text mining, lemmatizing may be useful when applied to the corpus to increase the coverage of dictionaries, but stemming would need to be applied to both the corpus and the dictionaries because it does not always return actual words. Using stemming/lemmatizing can help increase the coverage of closed vocabulary dictionaries, yet if this introduces errors or conflates distinct words, doing so will reduce the precision and validity of closed vocabulary text mining.
Text Mining
Preprocessing is the first stage of text mining. Once preprocessing decisions are made and implemented, features (i.e., variables) are extracted from the text using closed vocabulary text mining, open vocabulary text mining, or both. Then, those features are used as the units of analysis.
Feature Extraction
As described previously, two primary approaches exist for extracting features from natural language text: closed and open vocabulary text mining. In closed vocabulary text mining, dictionaries are used to count words relevant to linguistic and/or psychological constructs. After deciding which dictionaries to include, closed vocabulary text mining software counts the number of times each entry in a dictionary occurs. Either the raw counts or the proportion of text matching each dictionary can be used for subsequent analyses. At the end of this process, each text in the corpus is represented by a vector of weights, where entries represent scores for the focal dictionaries. Closed vocabulary text mining has been used extensively in organizational research (see our second review).
In open vocabulary text mining, after the corpus has been preprocessed, n-grams are extracted from text using a tokenizer. Tokenization is the process of identifying the units of analysis (i.e., n-grams), counting them in each document, and creating the document-term matrix (where each row is a document, each column is an n-gram, and each entry is the number of times the n-gram occurs in a given document). Tokenizers process a corpus by extracting all phrases of n consecutive words and counting their frequency in each document. Special tokenizers have been developed specifically to capture emoticons because more general tokenizers do not have rules in place for handling consecutive punctuation/symbols (Kern et al., 2016). Some researchers only extract unigrams from the text, yet unigrams are devoid of context and, therefore, could lead to spurious associations that do not hold up under scrutiny. Therefore, researchers should use tokenizers to count unigrams, bigrams, and often trigrams (i.e., n = 1, 2, and 3), but the added value of including longer phrases is diminished due to their sparseness. Extracting phrases helps improve validity by capturing some semantic information in the text. For example, when analyzing free-form employee survey responses, the unigram challenging is relatively uninformative, yet comparing the frequency of challenging relationship and challenging project could help one identify whether employees are struggling with interpersonal or task-related issues and whether these challenges differ by functional area.
The resulting document-term matrix can be used as is, or further transformations can be made on this matrix, such as using the term frequency-inverse document frequency (tf-idf) transformation, which gives greater weight to less frequent n-grams (Spärck Jones, 1972), or generating topic models, for example, with the popular Latent Dirichlet Allocation (Blei et al., 2003). Topic models use techniques similar to factor analysis to identify the latent topics in a corpus and assign each document a “loading” that indicates the extent each extracted topic is present in that document. Importantly, one may not want to (a) remove stop words when extracting n-grams with n > 1 or (b) stem or lemmatize when generating topic models (Banks et al., 2018). Removing stop words in bi- and trigrams can create nonsensical phrases and reduce validity, and topic modeling can account for different forms of the same word, so there is less of a need to collapse different forms of a word together via stemming/lemmatizing. Although extracting n-grams is a necessary precursor to topic modeling, as Figure 1 shows, the optimal preprocessing techniques differ depending on which features (i.e., n-grams vs. topic models) one ultimately plans to use in subsequent analyses.
Analysis
After extracting features from text, the features are generally used downstream in one of three broad classes of analysis: visualization, clustering, and prediction. Visualization is a form of data analysis that can provide a succinct summary of high-dimensional data (Tay et al., 2018). Traditionally, researchers used visualization to highlight findings that support hypotheses, yet increasingly, modern visualization tools allow for interactive data visualization that can help researchers uncover new insights in the data. In text mining, word clouds are frequently used to summarize the frequency and/or predictive value of n-grams and closed vocabulary dictionaries (e.g., Schwartz, Eichstaedt, Kern, et al., 2013).
Clustering (i.e., unsupervised machine learning) is a class of inductive methods that group documents by their textual features, with no outcome variable driving the agglomeration. Examples include hierarchical clustering methods (e.g., McLaughlin et al., 1991; Woo et al., 2018) and partitioning clustering, such as k-means clustering (MacQueen, 1967). For example, Kobayashi et al. (2018b) showed how hierarchical clustering could be used on textual features to inspect the similarities among job descriptions in online job boards, generating information that could be valuable for job analysis.
Prediction (i.e., classification, or supervised machine learning; MacQueen, 1967) is a deductive method used to predict outcomes of interest from a given text. Indeed, prediction is perhaps the prototypical application of open vocabulary text mining, wherein predictive model parameters are estimated on a training data set and validated on a testing data set. For instance, Campion et al. (2016) applied text mining to automatically score achievement record essays by modeling the scores given by human raters, potentially saving the organization time and money by reducing the human labor required to screen job applicants. Traditional, regression-based hypothesis tests (commonly used during closed vocabulary text mining) are also a form of prediction, but the models are generally not cross-validated.
Review of Computational Linguistics Studies Comparing Preprocessing Techniques
Methods
Research in computational linguistics (CL) has directly compared how different preprocessing techniques affect text mining performance. To do so, such studies examine how the accuracy of machine learning predictions for some task (e.g., text classification, sentiment analysis) is affected by manipulating preprocessing techniques. We began with a broad search seeking to identify research focused on comparing the effects of different preprocessing decisions. Later, we expanded our search to top CL conferences to identify research that compared the effects of preprocessing decisions even if that was not the primary research focus. We sought research in the past 5 years using “preprocess*” and “stemm* AND lemmat*” as search terms on IEEE Xplore and ACM Digital Library. These search terms were chosen because stemming and lemmatization are often discussed in studies focused on preprocessing even if they do not compare stemming to lemmatization. This initial search returned 144 articles. Then, we uncovered 13 additional potentially relevant articles by inspecting the last 5 years of proceedings of top NLP conferences: the Association for Computational Linguistics, the Association for Information Science and Technology, Conference on Empirical Methods in Natural Language Processing, and the North American Chapter of the Association for Computational Linguistics. We further collected 11 additional articles through ancestral search. The first author made relevance decisions by examining abstracts and, if determination could not be made from the abstract, full texts. The second author coded a random sample of 20% (N = 33) of the potentially relevant articles to determine coding reliability for article inclusion. Agreement between coders was 97.1%, and the one disagreement was resolved via discussion. We found 50 studies that compared the effects of preprocessing decisions on open vocabulary text mining results (and found no articles comparing the effects of preprocessing on closed vocabulary text mining), and the first two authors coded them on four criteria: research context (e.g., information retrieval), text language(s), preprocessing techniques compared, and which preprocessing techniques gave the best system performance (e.g., highest classification accuracy). We dropped one of the studies because it reported contradictory results about the same analyses. When coding the beneficial preprocessing techniques, disagreements were resolved via discussion.
Results
Full details about each study included in our review are provided in Appendix A in the Supplemental Material available online. The aggregated results of their preprocessing investigations are summarized in Table 3 and this section. Table 3 lists each preprocessing technique for which we found comparative studies, the number of studies investigating each technique, and the languages of the texts used in the studies and breaks down the results by studies in the English language and studies in all languages.
Summary of Computational Linguistics Preprocessing Comparison Review.
Note: k = number of studies uncovered in literature search that tested the effects of each preprocessing technique. English Results reports the proportion of English-language studies that found the preprocessing technique to be beneficial, and All Results reports the proportion of studies in all languages that found the preprocessing technique to be beneficial.
Some preprocessing techniques appeared very beneficial for machine learning system accuracy, whereas the effects of others were equivocal. Two of three studies of lowercase conversion found it improved system accuracy, both of which were studies of English text. For example, Uysal and Gunal (2014) found that lowercase conversion increased the accuracy of machine learning classifiers for classifying emails as spam and identifying the category of news stories in both English and Turkish. Although some of the CL results may not appear directly relevant to organizational researchers, the effects of lowercase conversion on machine learning classifiers and predictive models for other tasks (e.g., Kobayashi et al., 2018b) are likely to be consistent. All five studies of handling negation found it was beneficial. For example, Smith et al. (2015) found that handling negation increased the accuracy of a computerized question answering system. Five of eight studies of nonalphabetic character removal found it was beneficial, although just three of six studies of English text found the same. Three of five studies of emoticon recognition found it was beneficial, including two of the three studies of English text. All three studies of spelling corrections found it improved system accuracy. For example, Mhatre et al. (2017) found that spelling corrections improved the accuracy of machine learning models classifying product reviews as positive or negative. Five of six studies of expanding contractions and abbreviations found it improved system accuracy, including three of the four studies of English text. For example, Jiangqiang and Xiaolin (2017) found that expanding acronyms improved the accuracy of Twitter sentiment analysis (i.e., positive or negative) machine learning models. Eight of 23 studies of stop word removal found it was beneficial, including five of the 11 studies of English text.
Moving to stemming and lemmatizing, 14 of 24 studies of stemming found it was beneficial, including six of the 12 studies of English text. Eight of 10 studies of lemmatizing found it was beneficial, including six of the eight studies of English text. When stemming and lemmatizing were directly compared, five of eight studies found lemmatizing was beneficial, two of the studies found stemming was beneficial, and the remaining one found equivocal results. In studies of English text, lemmatizing was beneficial in three of five studies, one study found stemming was beneficial, and the remaining one found equivocal results. For example, Mulki et al. (2018) found that lemmatizing improved the accuracy of emotion classifiers that labeled tweets with the discrete emotion most present in them.
Discussion of Computational Linguistics Review
Our review of CL research found that several preprocessing techniques are generally beneficial in terms of improving the validity of the subsequent analyses. Specifically, lowercase conversion, handling negation, spelling corrections, and expanding contractions and abbreviations tend to improve system accuracy. Notably, however, the studies of spelling corrections and expanding contractions and abbreviations did not focus on predicting individual differences. On the other hand, some techniques, including nonalphabetic character removal, emoticon recognition, stop word removal, and stemming, are less beneficial. When compared to stemming, lemmatizing provides better results.
Traditionally, some exploration of different preprocessing techniques is expected due to the inductive nature of open vocabulary text mining (Banks et al., 2019), and different combinations of preprocessing techniques can improve the accuracy of later machine learning predictions by several percentage points. When text mining is used, for example, to predict job performance, such seemingly minor increases in accuracy can provide considerable long-term utility gains. For example, Schmidt and Hunter (1998) called incremental validity of .01 beyond general mental ability a “moderate” improvement (p. 269). Therefore, better text preprocessing can potentially enhance the utility of organizational text mining applications. Note, however, that it is possible organizational researchers have explored different preprocessing techniques without reporting it—for example, Speer (2018) trained a machine learning model to predict job performance ratings from narrative performance reviews and may have experimented with multiple preprocessing techniques to identify the combination that provided the highest predictive accuracy. Alternately, if Speer did not, then it may be possible to improve the predictive accuracy of similar systems in the future by experimenting with preprocessing techniques. We now turn to compare these findings to existing text preprocessing practices in organizational research.
Review of Preprocessing in Organizational Research
Method for Review of Organizational Text Mining Research
Given our interest in understanding the preprocessing practices of existing organizational research that uses text mining, we sought to uncover published, substantive organizational research that utilized open or closed vocabulary methods. We searched 10 prominent substantive journals in our field, as is common in organizational research reviews (e.g., Aguinis et al., 2009): Academy of Management Journal, Administrative Science Quarterly, Journal of Business & Psychology, Journal of Applied Psychology, Journal of Organizational Behavior, Journal of Management, Journal of Vocational Behavior, The Leadership Quarterly, Organizational Behavior and Human Decision Processes, and Personnel Psychology for “text mining” OR “natural language processing” OR LIWC OR DICTION OR “CAT Scanner” OR “General Inquirer” OR “Premium modeler” OR “RIOT Scan” OR WordStat. These search terms returned 264 articles. The first author determined whether text mining was used in each article by reading abstracts and, if determination could not be made from the abstract, full texts. The second author coded a random sample of 20% (N = 53) of the potentially relevant articles to determine coding reliability for article inclusion. Agreement between coders was 98.1%, and the one disagreement was resolved via discussion. Eighty-six articles, ranging from 1994 to 2019, used text mining and were included in our review. Articles were coded by the first and second authors by summarizing their text mining methods, data sources, preprocessing techniques, the rationale for the preprocessing decisions, analyses conducted, features extracted, and why text variables were extracted.
Results
Appendix B in the Supplemental Material available online provides full details about each study included in our review, and Table 4 summarizes the descriptive statistics and aggregate reported preprocessing practices. Table 4 excludes all preprocessing techniques that were never used by any organizational researchers (i.e., emoticon recognition, lemmatizing). Table 4 lists the number of open and closed vocabulary text mining studies our search uncovered, how many of each reported using at least one preprocessing technique, the number of studies that provided the rationale for their preprocessing decisions, and the number of studies that used each preprocessing technique.
Summary of Organizational Research Preprocessing Review.
Note: LC = lowercase conversion; HN = handle negation; NAR = nonalphabetic character removal; PR = punctuation removal; SC = spelling corrections; ECA = expand contractions and abbreviations; SW = stop word removal; St = stemming. Entries indicate number of studies uncovered in the review that reported conducting each preprocessing technique.
Nine (10%) of the 86 articles used open vocabulary text mining, first appearing in 2015. Four of these nine also used closed vocabulary text mining, so 81 (95%) of the articles used closed vocabulary text mining. Only three of the 81 (4%) articles using closed vocabulary text mining reported preprocessing text. One accounted for negation by removing words preceded by no (Bligh & Hess, 2007), one removed nonalphabetical characters and used lowercase conversion (Piezunka et al., 2019), and one corrected spelling errors and expanded contractions (Lengelle et al., 2013). Only one of these studies provided a rationale for their preprocessing—Bligh and Hess (2007) stated they removed words preceded by no to account for the change in meaning, but they did not explain why this particular method (i.e., deletion) was chosen. On the other hand, seven of the nine (78%) open vocabulary studies reported using multiple preprocessing techniques.
For open vocabulary text mining research, Appendix C in the Supplemental Material available online summarizes the preprocessing techniques used, the rationale provided for the preprocessing decisions, and the software used. Table 4 summarizes this information by listing how many studies reported conducting each preprocessing technique. None of the open vocabulary text mining studies corrected spelling errors, expanded contractions and/or abbreviations, or used lemmatizing. At least two thirds of the studies used lowercase conversion, stop word removal, and stemming. Less than half of the nine studies reported engaging in any other preprocessing technique.
Not all studies reported engaging in tokenization. Tokenization identifies the units of analysis (i.e., phrases of length n, or n-grams) for open vocabulary text mining and, therefore, is a necessary part of preprocessing (Schmiedel et al., 2019). Therefore, it is possible that these studies simply did not report it. Additionally, proprietary software (e.g., IBM SPSS Premium Modeler; Campion et al., 2016) likely tokenizes without user knowledge.
Only Speer (2018) and Speer et al. (2018) reported accounting for negation. To do so, Speer (2018; Speer et al., 2018) appended negation terms (e.g., not) to the subsequent word as “not_word.” This approach improves validity by treating negated and nonnegated words distinctly.
Regarding reporting, seven of the nine studies provided some information about the software used. Two studies (22%), however, did not mention how they conducted open vocabulary text mining (Giorgi & Weber, 2015; Piezunka et al., 2019). Without knowing what software was used, researchers are unable to replicate the methodology of a research study. Eighty of the 81 (99%) closed vocabulary text mining studies reported which software was used.
We also inspected the rationale researchers provided for their text preprocessing decisions. Three studies did not provide any rationale (Campion et al., 2016; Giorgi & Weber, 2015; Speer et al., 2018). Three studies cited prior research as justification for some (but not all) preprocessing decisions (Antons et al., 2019; Banks et al., 2019; Speer, 2018). Three studies stated they stemmed their corpus because stemming is “commonly” conducted (Croidieu & Kim, 2018; Piezunka & Dahlander, 2015; Piezunka et al., 2019). Two studies claimed to conduct the “standard” text preprocessing techniques, yet these studies used different preprocessing techniques (Banks et al., 2019; Croidieu & Kim, 2018).
General Discussion
By conducting two complementary reviews, this article uncovered the best practices in text preprocessing as determined by computational linguistics as well as the current preprocessing reporting practices in organizational text mining research. We now integrate the findings of these two reviews with the conceptual considerations reviewed previously to provide a conceptually informed and empirically grounded set of preprocessing recommendations.
Preprocessing Recommendations
Overall, the emerging organizational research using open vocabulary text mining has relied on convenient and common preprocessing techniques. In other words, the choice of preprocessing techniques has rarely been supported by empirical research or theory. Indeed, multiple studies claimed to conduct stemming because it is “common,” and two claimed to conduct the “standard” preprocessing techniques yet used different preprocessing techniques. Furthermore, it is very rare for studies to report preprocessing text for closed vocabulary text mining. We provide research-based preprocessing recommendations for closed and open vocabulary text mining in Figure 1 and Table 2 and detail them here. However, as noted previously, we recognize that specific situations may require deviating from these recommendations and that different proprietary software (e.g., LIWC) may automatically conduct some of the preprocessing techniques.
All text should be converted to lowercase. Lowercase conversion decreases data dimensionality while maintaining semantic information. Most organizational studies reported using lowercase conversion with open vocabulary text mining. Likely, studies do not report conducting lowercase conversion with closed vocabulary text mining because the software packages, like LIWC, conduct lowercase conversion without user input.
Our first review showed handling negation should be conducted in both closed and open vocabulary text mining. Handling negation better accounts for semantic information in text, increasing data dimensionality and validity. For instance, to accurately assess perceptions of politicians’ Machiavellianism, distinguishing honest from not honest is necessary (e.g., Bhattacharya et al., 2015). However, only two organizational open vocabulary studies and one closed vocabulary text mining study reported handling negation.
Then, researchers must consider whether they will be extracting features using open or closed vocabulary text mining. If only using closed vocabulary text mining, then spelling corrections should be made and contractions and abbreviations should be expanded (if the dictionary only contains the expanded version of the contractions and abbreviations). Doing so will ensure that the relevant words are accurately counted in the focal closed vocabulary dictionaries.
If using open vocabulary text mining, researchers must consider whether their research question requires them to capture the content and/or style of speech. If only content is important, then spelling corrections should be made, contractions and abbreviations should be expanded, and nonalphabetic characters should be removed to reduce data dimensionality and increase statistical power. Unless idiosyncratic language differences, such as spelling errors, abbreviations, and use of punctuation, are useful for the research question at hand, researchers should make these corrections to standardize the language. Doing so can improve the accuracy of open vocabulary text mining. Although not included in Figure 1, if emoticons are present in one’s corpus, emoticon recognition should be used regardless of whether one is using closed or open vocabulary text mining. Capitalization and nonalphabetic characters are both informative for emoticon recognition in open vocabulary text mining, and in closed vocabulary text mining, one merely needs to include an emoticon dictionary.
When conducting open vocabulary text mining, tokenization involves removing extra white space because it is superfluous. Then, researchers must next consider the average document length in their corpus. If documents are long, then stop words should be removed from unigrams. Removing stop words can be detrimental in short documents (e.g., tweets) because stop words comprise a large proportion of the total words, and many stop words are function words that comprise one’s communication style, which tends to be a stable individual difference (Pennebaker et al., 2003). For example, Meinecke and Kauffeld (2018) used LIWC categories that contain stop words to quantify supervisor and subordinate language style, which affected performance appraisal effectiveness. Therefore, consider retaining stop words when relevant to one’s research question or when extracting n-grams with n > 1 because stop words alter the meaning of subsequent words (Schofield et al., 2017). Park et al.’s (2015) study of Facebook users found stop words were present in several phrases predictive of personality traits. Stop word removal is unnecessary for closed vocabulary text mining because the choice of dictionaries determines whether stop words are counted.
Finally, when using open vocabulary text mining, researchers must consider which features will be extracted from text: n-grams and/or topic models. If only topic models will be used in later analyses, then preprocessing is complete. However, if n-grams will be directly used, then researchers must consider their corpus size. If the corpus is small, then researchers should stem or lemmatize the text. Lemmatizing is superior to stemming because it is less likely to conflate distinct words, thereby providing greater validity. However, most organizational open vocabulary text mining studies stemmed their corpus. Topic models can determine the relationships between different forms of the same word, so there is less of a need to stem or lemmatize when using them (Schofield & Mimno, 2016). Stemming and lemmatizing are most useful in small corpora (Kern et al., 2016) because there is less power in subsequent analyses to utilize fine distinctions in text. Ideally, lemmatization should be used on smaller corpora because it has a lower error rate compared to stemming and provides more interpretable results (Manning et al., 2008). Generally, stemming and lemmatizing are not used in closed vocabulary text mining, but if they are, they should be applied to both the corpus and the dictionaries.
Reporting Text Mining for Transparency and Replication
Our review of the CL literature shows that the choices made in the preprocessing stage meaningfully alter text mining results. Yet, as shown in Table 1, the text mining studies in organizational science often provide arbitrary and uneven reporting of preprocessing techniques. This gives readers an incomplete picture of the study methodology and subsequently, the results obtained. Lack of careful reporting standards for preprocessing can hinder the transparency and replicability of research methods, especially when the methodology is novel to the area. Special attention needs to be given to both reproducibility of the original results given the same data and code and, from that, replication of original conclusions with reanalyzed or new data (Epskamp, 2019). In this section, we highlight potential pitfalls in conducting and reporting text mining research and summarize recommendations to improve reproducibility and replicability. We summarize our recommendations in Table 5 and detail them below.
Best Practices for Reporting Text Mining.
It is essential for researchers to familiarize themselves with the key aspects necessary for the reproduction of their text mining research and report them. The first step, therefore, is to report information about the data set (i.e., the location and retrieval process). Next, researchers should report exactly which preprocessing techniques were used, the order in which they were conducted, which were excluded, and why. Because each preprocessing technique has implications for subsequent analyses, researchers should understand these ramifications and use them to justify their choices. For instance, Banks et al. (2018) recommended conducting multiple iterations of preprocessing to identify the optimal set of techniques—if doing so, researchers should report these procedures (either in the article or supplemental materials) and the benefits of this approach (if any) for their theory development and empirical investigation. They should also detail what algorithm and tools were used to do the analyses. Then the researchers should describe the features they extracted and any further analyses done. One of the best ways to ensure replicability would be to post the source code used to generate all analytic results (Sandve et al., 2013). An increasingly popular open science practice is to include supplementary materials that include the data (or the data source), source code, and a document detailing the analysis plan. This practice allows the analyses to be exactly replicated, supporting scientific transparency and integrity.
Given that text mining research is software dependent, special attention needs to be given to the software used to conduct text mining. For each preprocessing technique and analysis, the tools and software packages used and their version number should be reported. For closed vocabulary text mining, our review suggests that some preprocessing techniques may be automatically conducted by the proprietary software—authors should note and report such automatic and intentional application of preprocessing techniques. Given the extended timeline of research publications, researchers should consider a version control system that allows for tracking changes in code that may affect reproduction attempts. For instance, GitHub provides such a version control system (github.com). Researchers should generally use crystallized software: software that is not in beta version, has been validated in a methodological study, is well documented, is actively maintained, and ideally, is open-source code (for further description of crystallized software, see Epskamp, 2019).
Subjective decisions about the analytical choices and interpretations, known as researcher degrees of freedom (Simmons et al., 2011), are particularly common when the methodology is so new that detailed aspects of it have yet to be thoroughly investigated. Preprocessing choices alter the results of text mining analysis, and therefore, researchers should explain why they chose a given set of preprocessing techniques. Other subjective decisions, such as dropping extracted features, can also affect the findings and therefore require justification. For example, Kobayashi et al. (2018b) first conducted their analysis with all stop words retained, then removed all stop words that were not predictive in their model. If theory suggests stop words are uninformative for a given task, then retaining predictive stop words may introduce spurious relationships or suppress the relationships other n-grams have to outcomes. Although seemingly minute, unreported and unjustified subjective decisions can create unreproducible and biased research findings. Relatedly, researchers should also strive to report results for all sets of preprocessed data. Although analysis may focus on one set of theoretically grounded results, researchers should mention how the other preprocessing combinations altered the results. Altered results present an opportunity: Researchers can generate theory to explain why the results changed between these preprocessing techniques (or why they did not change). Such practices hold potential for increasing the theoretical insights that can be derived from text mining.
As a new method in organizational science, there are a number of potential threats to the reproducibility of text mining research, including researcher degrees of freedom and software dependencies. Both inductive and deductive research in this area require careful consideration and transparent reporting of each analytical process used so that the findings can be independently verified.
Limitations and Future Work
A largely untapped opportunity exists for organizational researchers to analyze unstructured data in theory-driven, computationally advanced ways. Whereas computer scientists apply open vocabulary text mining to organizational problems in a data-driven way, organizational researchers can pair these approaches with strong theoretical reasoning, as in much of the closed vocabulary text mining research and our preprocessing recommendations (Figure 1). Doing so will aid researchers and practitioners because the insights derived will be grounded in theory, based on thousands of data points, and cross-validated. We provided guidance to help researchers and practitioners conduct empirically justified and conceptually driven preprocessing for text mining.
To our knowledge, an area of open vocabulary preprocessing with no evidence-based consensus is removing sparse or frequent terms. Multiple recommendations have been provided, including removing terms occurring in less than 1% (Schwartz, Eichstaedt, Kern, et al., 2013) or 5% or more than 99% (Kobayashi et al., 2018a) of documents. Other practices can be found as well, such as removing terms that occur in less than 0.1% of documents (Antons et al., 2019) or removing corpus-specific stop words (Schmiedel et al., 2019). The appropriate answer likely depends on document length—a lower threshold (e.g., 0.1%) could be used for shorter documents to avoid increasing data sparsity, whereas higher thresholds (e.g., 5%) could be used for longer documents. Additionally, the evidence suggests that corpus-specific stop words are unnecessary (e.g., Schofield et al., 2017). In both cases, the effect of such practices should be empirically assessed.
Our recommendations are limited to existing open vocabulary text mining approaches adopted by organizational researchers. Emerging CL approaches that utilize neural network topic models (i.e., transfer learning—e.g., BERT; Devlin et al., 2018) use both forward and backward context to represent word meaning. Such approaches can represent multiple meanings of the same word (i.e., polysemy), including negation (however, cf. Niven & Kao, 2019). Therefore, such approaches require less preprocessing than recommended here.
Additionally, other textual features can be extracted that require different preprocessing techniques, but these features have received less attention in organizational research to date. Specifically, features related to NLP (Pandey & Pandey, 2019) that account for the syntactic structure of the text require the use of syntactic parsers early in preprocessing. Syntactic parsers identify words’ part of speech as well as the dependencies among words, enabling researchers to utilize nonadjacent, semantically connected words as features (e.g., subject-verb-object triples; Franzosi, 2004). More generally, certain parts of speech are considered vital for understanding language use (Pennebaker et al., 2003), and counting various parts of speech can provide insight into how language patterns shift depending on contextual and relational dynamics (e.g., Danescu-Niculescu-Mizil et al., 2012; Yang & Srinivasan, 2016). Future research should explore these textual features and their implications for organizational behavior.
Existing work utilizing text mining tends to eschew theory in favor of raw statistical power to increase the understanding we have of our field (Antons et al., 2019) or to automate knowledge work (Campion et al., 2016). Conversely, Kleinbaum (2012) and Maslach et al. (2018) each provided a detailed theoretical rationale to form the basis for analyzing and interpreting their textual features. Theoretical rationale is key for future work in this area because a theoretical understanding of the research question should guide preprocessing, feature extraction, and analysis. Our recommendations aim to provide a solid foundation for such research by ensuring that preprocessing decisions are empirically and theoretically justified, transparent, and replicable, resulting in data and findings that are psychologically meaningful and defensible.
Supplemental Material
Supplemental Material, appendices_final - Text Preprocessing for Text Mining in Organizational Research: Review and Recommendations
Supplemental Material, appendices_final for Text Preprocessing for Text Mining in Organizational Research: Review and Recommendations by Louis Hickman, Stuti Thapa, Louis Tay, Mengyang Cao and Padmini Srinivasan in Organizational Research Methods
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
