A Quest for Transparent and Reproducible Text-Mining Methodologies in Computational Social Science

Abstract

1. Introduction

We thank the editorial board for the opportunity to discuss our methodological contribution in a symposium dialogue as well as the two commentators for their inspiring and challenging comments. We are especially delighted that the commentators agree on the relevance of analyzing the dynamics of manifest and latent meanings in big data using different text-mining tools in general and for map analysis in particular. According to our reading, the commentators focused on quality criteria, namely, two different but highly relevant aspects of transparency in research processes. Laura K. Nelson (this volume, pp. 139–143) focused on transparency in the context of research foci and analytical steps in a text-analysis project to ensure the reproducibility of results, whereas Burt L. Monroe (this volume, pp. 132–139) focused on transparency regarding data inspection and thus the credibility of results. We structured our rejoinder as follows: First, we draw on selected aspects Nelson and Monroe posed that we believe they consider to be most important to reflect on transparency in the context of big data and text-mining tools. Second, because quality criteria such as transparency do not exist in isolation, we complement the discussion on quality by adding general issues regarding overarching text-mining methodology. Finally, we conclude by providing a prospect for further establishment of big data analysis in the social sciences.

2. Transparency and Reproducibility

In Nelson’s comment, we find the essence of two basic forms of subjectivity involved in social science research projects, namely, subjectivity regarding coding and interpretations on the one hand and subjectivity regarding research foci and analytic steps on the other. We thank Nelson for putting these two forms of subjectivity at the center of her thoughts; this helps highlight which form of subjectivity is addressable with text-mining tools and which is not.

Subjectivity regarding coding and interpretations can be addressed using text-mining tools. In line with Lee and Martin (2015), counting specific textual characteristics with text-mining tools does not force any particular meaning onto texts. The algorithms on which text-mining tools are based offer reproducible results. These machine-produced results do not rely on carefully considered, but still subjective (and for the observer, often invisible), ex ante interpretations to the same extent as coding procedures would.¹ In other words, coding procedures require social scientists to use their barely expressible preunderstandings about the social world for the coding of data (cf. Biernacki 2012), whereas text-mining tools require knowledge of how the algorithms operate and for which purposes their application appears reasonable. Using text-mining tools, ex ante interpretations with reference to coding are minimized; for instance, map analyses enable social scientists to be transparent about the purposes a map should fulfill and which tool they should apply to meet the given purpose’s requirements. In this sense, statistics and the results of text-mining tools support ex post interpretations, which can openly be discussed among social scientists because the foundation of such results is driven by the transparently stated purpose of the given map and the algorithm of the text-mining tool. However, the character of maps is not influenced by the hardly verbalizable preunderstandings of researchers (Lee and Martin 2015).

With respect to subjectivity regarding research foci and analytic steps, Nelson (p. 139) argues that “sociologists using text as data must make a dizzying number of decisions about what information to extract and how to answer their research question. It is simply impossible to represent text absent any subjective decisions and have those representations be analytically useful, particularly when measuring meaning in text.” We would like to emphasize Nelson’s statement and point out that text-mining tools do not relieve social scientists from making various subjective ex ante decisions for their research design, which they should make transparent and comprehensible. Therefore, we fully agree with the relevance of Nelson’s five guidelines to evaluate any text-analysis project even if we would like to suggest a slight reconsideration of guidelines four and five.

In guideline four, Nelson argues that social scientists should consider whether altered textual characteristics, text-mining tools, or other ex ante decisions might change results and conclusions in a substantial way. From our perspective, it seems natural for results and conclusions to change if the steps of analysis or the tools used are modified. It is exactly this fact that makes approaches such as map analysis meaningful because maps based on different articulated purposes highlight multiple perspectives on the same text corpus. In other words, focusing on a specific purpose and thus on specific textual characteristics is meaningful as “text conveys a vast amount of information, much of it ambiguous and only some of which is relevant for a research question or purpose” (Nelson, p. 139).

In guideline five, Nelson calls social scientists to consider whether interpretations are reproducible by other researchers analyzing the same data. We would be careful in speaking of reproducible interpretations because interpretations belong to the step that we propose is the most subjective and therefore the one that should be made as transparent as possible. In our methodological approach, we suggest providing results that are raised without ex ante interpretations to enable ex post interpretations. In line with Lee and Martin (2015), such an approach is valuable because it enables researchers to mutually reflect on their ex post interpretations to arrive at conclusions. We argue that divergence in ex post interpretations is not a general issue of concern but instead facilitates open discussions that anyone can participate in because results provided by text-mining tools and displayed in maps are widely independent of ex ante interpretations. In other words, our approach implies that social scientists’ preunderstandings and interpretations can be explicitly and transparently applied to data after the construction of maps and thus remain comprehensible for others.

In his comment, Monroe points to another crucial quality criteria for the utilization of text-mining tools: the importance of being clear about the nature of the data. We believe the aspects Monroe highlights can be subsumed under an overarching and relevant issue, namely, the careful inspection of data before any kind of text-mining analysis. Overall, we agree with Monroe that no work can report all descriptive statistics of a given text corpus, but the merit of this comment lies in his call for social scientists to inspect their data carefully and provide readers an intuitive sense of the appropriateness of data for a given research question.

In more detail, as text-mining tools uncover and investigate patterns within texts, the preparation of the text corpus is highly relevant (see e.g., Händschke et al. 2018). For instance, duplicates must be removed from text corpora to avoid biasing results. In addition, even if the exclusion of semantic surroundings (“topics”) from further analysis is a proper procedure, it is still necessary to verify whether social scientists unknowingly exclude the most relevant semantic surroundings and instead build their analyses on residual ones. For our analysis, we considered this issue and found that the surroundings we excluded exhibited a small degree of importance across all documents. A t test showed that the distribution values of the used surroundings were significantly greater than the distribution values of the excluded surroundings (t = 4.76, df = 3,345.78, p = .000).

Another source of biased results may rest in the sparseness in data. For example, to ensure the move of semantic triplets across eras (see Figure 5 of our article) is not influenced by a general low probability of occurrence due to smaller sample sizes, we verified whether the frequency of triplet occurrence changed systematically over time. In our case, we did not find intensive variations (the average occurrence of triplets varies around the mean at a level of 7.19 percent).² This shows that smaller data sets do not necessarily imply biased results of grammatical parsing because systematic grammatical patterns may also exist in rather small text corpora. A sparseness of results may also arise if the length of texts used by text-mining tools, such as topic modeling to analyze semantic surroundings, differs to a large degree. One option to test for sparseness regarding text length would be, as Monroe suggests, to inspect the richness of the document-term matrix produced for applying topic modeling. Researchers could, for example, inspect the documents and check whether the text lengths differ significantly. Of course, every corpus contains some documents with only short texts; what is decisive is whether the average length of texts from which the semantic surroundings are constructed varies dramatically on average. Before our analysis, we found that variation did not appear to be an issue. The length of texts we used to construct the semantic surroundings of triplets in era one was only 27.9 percent smaller, on average, than in era two, and the length of texts in era two was 7.0 percent smaller, on average, than in era three.

So far, the discussion in this symposium shows that social scientists have a lot of options at hand to verify the transparency of data analysis and ensure the reproducibility of results. However, the discussion also reveals that we have not yet reached a common standard. For example, this fact is reflected in the widely diverging level of detail with which social scientists, when using text-mining tools, describe their methodology and the amount of information they provide readers to substantiate the appropriateness of their data and results. In this sense, the symposium illustrates the relevance of such debates for the social sciences.

3. Text-Mining Methodology

We believe debates about quality criteria regarding the utilization of text-mining tools are central for the social sciences; therefore, we are grateful to both commentators for raising this important topic. However, the application of text-mining tools requires concrete research designs, and we have the impression that discussions about such research designs are rather unbalanced. Thus, to complement the discussion in the symposium on transparency and reproducibility, we aim to provide an overview of the current state of text-mining methodology in computational social sciences.

We believe that most methodological frameworks aim to use text-mining tools to facilitate close reading and thus for discovering new social theory. Accordingly, scholars have suggested using text-mining tools to scale up well-established inductive research paradigms, such as grounded theory (Muller et al. 2016; Nelson 2017) or hermeneutics (Mohr et al. 2013; Mohr, Wagner-Pacifici, and Breiger 2015). These methodological frameworks suggest applying text-mining tools to uncover coarse-grained patterns of meaning that can enable close reading in the context of big data. One central contribution of using text-mining tools in this way is that scholars, instead of collecting the best small sample of texts to represent the textual whole, can consider large amounts of texts in their full complexity and nuance (Edelmann and Mohr 2018; Wagner-Pacifici, Mohr, and Breiger 2015). Such an approach appears well suited for studies that cannot reasonably reduce their text material to a small sample and studies that, due to a lack of social theory, cannot determine from the outset whether a small text sample represents a given social phenomenon accurately. In this sense, text-mining tools support inductive research paradigms in discovering unnoticed, surprising, and often latent regularities in massive samples of texts, which may boost sociological theorization (Evans and Aceves 2016). Our work complements this stream of research using various text-mining tools to address different layers of meaning that enable inductive but still formal in-depth analyses of manifest and latent meanings in large text corpora.

However, to the best of our knowledge, only a few studies utilize text-mining tools for the confirmation or test of social theory. Many studies apply text-mining tools to count textual characteristics for descriptive analyses (DiMaggio, Nag, and Blei 2013; Mohr et al. 2013; Rule, Cointet, and Bearman 2015; Sudhahar et al. 2013), but we propose expanding the utilization of text-mining tools for the operationalization of theoretical constructs in deductive research. The few existing studies use text-mining tools to produce measures based on counts of single entities. For example, Bail (2016) used topic modeling to construct an independent variable that measures the diversity of discursive content in organizations’ social media accounts. Fligstein, Brundage, and Schultz (2017) applied topic modeling to meeting transcripts of the Federal Open Market Committee to construct a dependent variable that measures the prevalence of finance and banking frames used by actors with and without private banking experience. van Atteveldt, Kleinnijenhuis, and Ruigrok (2008) used grammatical parsing to count the extent to which actors are mentioned as grammatical subjects in newspaper articles. The authors used these counts as dependent variables that they argued were predicted, for example, by the perceived powerfulness of the given actor.

Despite the value of count measures for quantitative analysis, we would like to add that computational social science may also apply text-mining tools to construct relational measures. For example, relational measures are especially useful because culture and meanings are built from structures of similarity and distance and social scientists may wish to study the existence of these patterns of relations between entities (Mohr 1998). We suggest the utilization of well-established similarity and distance measures (Manning and Schütze 2000) to trace these relational patterns. For example, Goldenstein and colleagues (2019) used words as dimensions of vector spaces to measure the distance between self-representations of large corporations from Germany, the United Kingdom, and the United States. Instead of relying on words, our work demonstrates that vector-space dimensions can also refer to communication structures and semantic surroundings. Finally, our work shows that the utilization of text-mining tools may also enable social scientists to construct traditional network analytic measures (e.g., degree, density, or centrality) based on a large amount of textual data (for an application, see Sudhahar et al. 2013).

4. Conclusion

This symposium illustrates how scholars in common endeavors proceed in the direction of professionalizing the application of text-mining tools in the social sciences. We believe that a continuous discussion and reflection on quality criteria and text-mining methodologies will yield important benefits. As an example for such a discussion, Nelson’s and Monroe’s highly relevant but, at the same time, diverging comments reveal two of these benefits. First, discussions enable social scientists to uncover how methods from other disciplines, such as computer science and linguistics, complement or even advance established methodological traditions and tool kits. Second, open communication may yield insights into when and how these methods are reasonably applied to advance social scientific evidence. Finally, because no single method can solve all issues that text analyses may face and because even the advent of sophisticated text-mining tools is just a small step forward, social scientists’ ongoing discussion is the most important source to identify the best (even if preliminary) solution for the problems at hand.

Footnotes

Notes

Author Biographies

Author biographies can be found on page 131 of this volume.

References

Bail

Christopher A.

2016. “Cultural Carrying Capacity: Organ Donation Advocacy, Discursive Framing, and Social Media Engagement.” Social Science and Medicine 165(1):280–88.

Biernacki

Richard

. 2012. Reinventing Evidence in Social Inquiry. New York: Palgrave Macmillan.

DiMaggio

Paul J.

Nag

Manish

Blei

David

. 2013. “Exploiting Affinities between Topic Modeling and the Sociological Perspective on Culture: Application to Newspaper Coverage of U.S. Government Arts Funding.” Poetics 41(6):570–606.

Edelmann

Achim

Mohr

John W.

2018. “Formal Studies of Culture: Issues, Challenges, and Current Trends.” Poetics 68(1):1–9.

Evans

James A.

Aceves

Pedro

. 2016. “Machine Translation: Mining Text for Social Theory.” Annual Review of Sociology 42(1):21–50.

Fligstein

Neil

Brundage

Jonah S.

Schultz

Michael

. 2017. “Seeing like the Fed: Culture, Cognition, and Framing in the Failure to Anticipate the Financial Crisis of 2008.” American Sociological Review 82(5):879–909.

Goldenstein

Jan

Poschmann

Philipp

Händschke

Sebastian G. M.

Walgenbach

Peter

. 2019. “Global and Local Orientation in Organizational Actorhood: A Comparative Study of Large Corporations from Germany, the United Kingdom, and the United States.” European Journal of Cultural and Political Sociology 6(2):201–36.

Händschke

Sebastian G. M.

Büchel

Sven

Goldenstein

Jan

Poschmann

Philipp

Hahn

Udo

Walgenbach

Peter

. 2018. “A Corpus of Corporate Annual and Social Responsibility Reports: 280 Million Tokens of Balanced Organizational Writing.” Pp. 20–31 in ECONLP 2018: Proceedings of the First Workshop on Economics and Natural Language Processing. Melbourne, Australia: Association for Computational Linguistics.

Lee

Monica

Martin

John Levi

. 2015. “Coding, Counting and Cultural Cartography.” American Journal of Cultural Sociology 3(1):1–33.

10.

Manning

Christopher D.

Schütze

Hinrich

. 2000. Foundations of Natural Language Processing. 4th ed. Cambridge, MA: MIT Press.

11.

Mohr

John W.

1998. “Measuring Meaning Structures.” Annual Review of Sociology 24(1):345–70.

12.

Mohr

John W.

Wagner-Pacifici

Robin

Breiger

Ronald L.

2015. “Toward a Computational Hermeneutics.” Big Data & Society 2(2):1–8.

13.

Mohr

John W.

Wagner-Pacifici

Robin

Breiger

Ronald L.

Bogdanov

Petko

. 2013. “Graphing the Grammar of Motives in National Security Strategies: Cultural Interpretation, Automated Text Analysis and the Drama of Global Politics.” Poetics 41(6):670–700.

14.

Muller

Michael

Guha

Shion

Baumer

Eric P. S.

Mimno

David

Sadat Shami

2016. “Machine Learning and Grounded Theory Method.” Pp. 3–8 in Proceedings of the 19th International Conference on Supporting Group Work - GROUP ‘16. New York, NY: Association for Computing Machinery.

15.

Nelson

Laura K.

2017. “Computational Grounded Theory: A Methodological Framework.” Sociological Methods & Research. doi:10.1177/0049124117729703

16.

O’Neil

Cathy

. 2017. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. London: Penguin Books.

17.

Rule

Alix

Cointet

Jean-Philippe

Bearman

Peter S.

2015. “Lexical Shifts, Substantive Changes, and Continuity in State of the Union Discourse, 1790–2014.” Proceedings of the National Academy of Sciences 112(35):10837–44.

18.

Sudhahar

Saatviga

de Fazio

Gianluca

Franzosi

Roberto

Cristianini

Nello

. 2013. “Network Analysis of Narrative Content in Large Corpora.” Natural Language Engineering 21(1):81–112.

19.

van Atteveldt

Wouter

Kleinnijenhuis

Jan

Ruigrok

Nel

. 2008. “Parsing, Semantic Networks, and Political Authority: Using Syntactic Analysis to Extract Semantic Relations from Dutch Newspaper Articles.” Political Analysis 16(4):428–46.

20.

Wagner-Pacifici

Robin

Mohr

John W.

Breiger

Ronald L.

2015. “Ontologies, Methodologies, and New Uses of Big Data in the Social and Cultural Sciences.” Big Data & Society 2(2):1–11.