Abstract

Introduction
While analysis of written text is not new as scholars in the past have spent considerable efforts in analysing monographs, ideological pamphlets, speeches and news reports to gain political and sociological insights, the field of content analysis is undergoing a massive methodological transformation. On one hand, digitization of written records across the world—books, government documents, religious preaching and political party materials—is producing massive amounts of machine-readable text. On the other hand, advent of digital social media platforms has democratized political communication. Earlier, only a select few could use the written word to engage with the mass public but now, most of us are part of public platforms, such as Facebook and Twitter, where we express our political opinions.
This note aims to highlight the availability of new methodological tools to analyse written texts more effectively. We begin by briefly summarizing four examples that have been used throughout this note to illustrate technical points and help the reader grasp the full scope of the methodological innovations in the field of automated content analysis. We then describe the logistics behind converting political texts into formats that can be quantitatively analysed. In this section, we also discuss the methodology behind modelling a corpus of text for creating political variables, and the different statistical techniques being used by researchers to make sense of the political and social world. We then shift our focus to the importance of validating statistical models to ensure that the findings capture the underlying political processes. We end the discussion by talking about advantages as well as limitations of statistical techniques used to analyse text-based data.
Four Examples from India
We use four examples from diverse domains and different sources of digital texts to underline the broad applicability of new tools for automated content analysis. 3
Example 1: This research project uses newspaper archives, now available in digital format, to understand the incidence of violence between religious communities in India. It borrows its design from the Varshney-Wilkinsion data set on Hindu–Muslim riots that sought to understand how incidence of violence varied over time and space and extends back to the precolonial period. 4 This method to analyse Hindu–Muslim violence assumes that violence, on a medium to large scale, would be reported in the press and can be identified by analysing the text of newspaper articles.
Example 2: This research project focusses on how politicians use the social media space. Specifically, we look at interaction between Members of Parliament (MPs) of the 16th Lok Sabha and their online followers on Twitter to understand the nature of partisan polarization in India. We assume that the nature of retweets and hashtags would differ across partisan lines. Thus, analysis of the text of the tweets would help us in understanding convergence and divergence in political opinions expressed on online social media platforms.
A growing body of research is increasingly analysing the contents of discussions within institutions using new methodological tools, such as debates in the Constituent Assembly, questions raised in the parliament and state assemblies, manifestos of political parties, and political speeches by various political leaders including those who hold executive office, to understand how political interests of different groups within society are represented.
Example 3: We borrow from the work of Bhogale (2018) who analyses 276,000 parliamentary questions tabled in the Lok Sabha from 1999 to 2017 to understand representation of Muslim interest.
Example 4: We borrow from Phadnis and Kashyap’s (2019) analysis of Mann ki Baat speeches delivered by Prime Minister Narendra Modi to explore whether the ideological leaning of a historical personality has a bearing on the amount of mention they receive in these speeches.
Sources of Digital Text
Digital texts can be accessed from three main sources: (a) traditional documents which have been digitized and made machine-readable, (b) online political content and (c) user generated content on social networks. Along with historical documents, many governmental and non-governmental organizations are making their decisions and deliberations public—from the Supreme Court rulings to parliamentary debates (refer to Examples 3 and 4)— and they are now available in digital format. The digital format not only ensures the longevity of these records but also makes it possible for researchers to have easy access to them. This change is, in part, due to improvement in digitization techniques and cheap digital storage. 5
While previous research on many topics were limited in their time and space sample because of the dependence on manual coding digitization greatly enhances the scale at which researchers can retest theories on larger samples. 6 For instance, the research on ethnic violence using newspaper archives derives its feasibility and scale from a recent move by several newspapers to digitize their archives (refer to Example 1). Finally, social media platforms represent a completely new form of political communication that did not exist before. These platforms have become an important avenue of research for social scientists due to their increasing presence in the society and their possible impact on everyday politics. Facebook and Twitter are now sine qua non for politicians across the globe for reaching out to their supporters. More importantly, individuals are engaged in fierce partisan and non-partisan political battles in the online space and in the process are generating large digital footprints that can be used to gain insights on political processes (refer to Example 2). It would have been manually impossible to read and analyse tweets of Indian parliamentarians and their followers (records run into millions of tweets) without the advancements in information technology that allow for download and storage of large digital footprints.
Logistical Challenges in Analysing Digital Text
The possibility of analysing new sources of digital text is also accompanied by logistical challenges. In this section, we seek to highlight resources that can help researchers as they navigate the process of gathering and storing large-scale text data for their projects. Most historical documents are in formats (PDFs or images) that cannot be easily processed by statistical software packages. To make scanned documents ready for analysis, researchers use optical character recognition (OCR) technology and convert the digitized records, such as books, and newspapers, into readable text format. 7 The OCR process might not successfully reproduce the original texts if the document image has poor quality and in such cases, researchers can take manual help or use online platforms such as Amazon Mechanical Turk to hire manual help to efficiently translate the documents into a digital format (Berinsky, Huber, & Lenz, 2012). 8
In contrast to historical documents, texts from web pages, such as Man Ki Baat speeches or parliamentary questions can be more readily processed for analysis. In some cases, organizations provide developers access to their data through application programming interfaces (APIs). For example, Twitter has an API that can be easily programmed. Thus, for exploring differences in nature of tweets along partisan lines, we use the Twitter API to download relevant tweets. Unfortunately, APIs still remain an exception rather than the rule and researchers have to download data from websites stored in a wide range of formats, from XML to Excel files. Downloading data from websites often requires writing extensive programmes to create web scrapers or crawlers that can iterate through multiple webpages and produce an output that can be easily analysed. 9
Stages of Analysing Digital Texts
The use of a text document to understand political processes is based on an assumption that words capture underlying political processes. Be it assessing policy preferences of a politician using her speeches or ideological positions of political parties through their manifestos, we expect that political variables are reflected in the written or spoken word. The process of coding text by humans involves understanding the context in which the text was generated and then creating guidelines for placing the text into categories or in a scale based on certain patterns. For instance, the same person may be making speeches in varied contexts, such as at the parliament, at an investor meet, at a global forum or at an election rally, and thus, for any meaningful interpretation, each set of such texts should be analysed with respect to the particular context in which it was generated.
Statistical analysis of text is similar in its approach but differs from human coding in fundamental ways. It begins with simplifying the text corpus by ignoring word order, dropping stop words 10 and stemming the words, that is, removing everything but the root of the word. These pre-processing steps are common across different types of analyses and are well-documented in a wide range of literature on this topic (Grimmer & Stewart, 2013; Lucas et al., 2015). The pre-processing of raw texts removes any word that would not help the statistical model distinguish across documents and reduces the number of unique words that would be part of the analysis.
Document-term matrix: In most textual analysis projects, the first step is to represent the corpus of text in terms of a document-term matrix which captures the frequency of occurrence of unique words, also called terms or features, across different documents. 11 The process of creating the term-matrix also allows the researcher to give the text corpus numeric frequency values making them amenable for quantitative analysis. Here, documents represent the basic unit of analysis and could be a post on Facebook, a tweet, a speech, a page, or a book, depending on the how the researcher hopes to present her empirical analysis. For example, we consider every tweet as a document and seek to understand whether the nature of text differs based on some ideological slant in the tweet.
Classification of documents: Most research analysing texts to understand political processes aims to classify documents into different categories. As aforementioned, a document could be a sentence in a book, a tweet, or a speech, depending on the unit of analysis. A classification of documents into different categories allows researchers to answer questions, such as whether the speeches are positive or negative in their tone, does the press release by politicians appear in favour of government policy or not, do manifestos of political parties differ on an ideological scale—from ‘conservative’ to ‘liberal’ or ‘right’ to ‘left’. Classification of documents based on textual analysis allows the researchers to explore relationship between the content of the document and a political outcome of interest. The process of classifying documents requires familiarity with statistical models and machine learning techniques (Grimmer, 2015; Lucas et al., 2015; Monroe & Schrodt, 2008).
Analytical models: Supervised and unsupervised: While going into the technical details is beyond the scope of this research note, we seek to outline the broad differences in statistical models used for textual analysis and point readers towards resources that can help familiarize them with the techniques. The statistical models for text-based analysis can be broadly divided into two groups, namely supervised learning and unsupervised learning. 12 The decision to choose between supervised and unsupervised models is inherently linked to the nature of the underlying question at hand.
Supervised models are very useful when researchers are dealing with questions in which the relationship between the text and theoretical concept is well established (Evans & Aceves, 2016). In such cases, researchers start by coding a smaller sample of documents to create a training data set, that is, a small subset of a large corpus of the text data with well-defined classifications. They then use a statistical model—statistical models can vary greatly in complexity ranging from dictionary methods that categorize documents based on certain word frequencies to more complex models such as Random Forests—that establishes the relationship between the text in the training dataset and its classification into different categories (e.g., we can manually classify a set of newspaper article as mentioning ethnic violence or not and then use a statistical model to establish the relationship between the text and the classification. When a statistical model categorizes the training data into different categories with a reasonable degree of accuracy, it is deemed fit for the full corpus of data. Then, by scaling their training model, the researchers extend their classification technique to the larger text corpus at minimal additional cost in terms of time and resources.
However, there are several instances where researchers attempt to discover patterns and look for new theoretical insights. Unsupervised models can be used in these cases to find underlying patterns in the data that cannot be observed without imposing any structure on the data. Most topic models and clustering algorithms look at the underlying statistical variation in the document-term matrix to come up with natural classes with the data. While the statistical model can categorize text into different topics, the researcher has to assign theoretical meaning to these categories. Thus, the focus shifts from scaling models based on known categories to interpreting the meaning of theoretically unknown categories.
Our stylized examples can help make this distinction clearer. In case of research on ethnic violence, we can use prior research to come up with clear guidelines to identify whether a newspaper article contains a mention of ethnic violence. We can use dictionary methods to create a set of words that are associated with violence or could use human coding to classify articles as either reporting on ethnic violence (1) or not (0). We can then apply these strategies on a subsample of newspaper articles and choose a statistical model linking individual articles to incidences of ethnic violence. Finally, if the prediction of ethnic violence is reasonably accurate on the training data set, we can scale the approach to all newspaper articles. This way we can extend the empirical analysis on ethnic violence using newspaper records to wider time periods, with minimal cost compared to the time and resource expensive process of human coding of all articles. However, our research on how tweets differ across partisan lines may have to adopt a different approach because there is no well-established theory to classify tweets (or texts) that could predict a partisan slant in the Indian context. We simply do not have an existing framework, and thus, are inclined to use unsupervised models to discover if there are any underlying differences and automatically place tweets into different groups. The groupings would be based on the distribution of features or words across the tweets as captured by the document-term matrix. We can then choose a sample of tweets from different groups to provide a theoretically substantial meaning to the unsupervised categories and can finally explore if these categories are along partisan lines.
Bhogale (2018) in her study analysed 276,000 parliamentary questions and also relied on unsupervised learning methods. After narrowing down her sample to extract questions having a ‘Muslim theme’, she wanted to classify the question into different topics. Her topic model categorized the questions into 20 different topics, but to interpret these topics she had to read through sample of questions from each category. Based on contextual understanding, she was able to interpret the main themes across different categories. She found that a large share (32%) of all of questions about Muslims are related to ‘development’ topic, including education, literacy, poverty, women’s issues, health and transfer of benefits. The second category includes questions on the Haj (24%)—ranging from questions concerning the Haj subsidies, annual expenditure on the subsidy, and details of the arrangements of transportation and accommodation of pilgrims. The third category includes questions on culture and language (10%), such as questions on the teaching of Urdu language in schools.
Validation of statistical analysis of text data: The seminal article by Grimmer and Stewart has four principles for text-based analysis and the first one states that ‘All Quantitative Models of Language Are Wrong—But some are Useful’ (2013, p. 269). 13 In essence, this statement highlights the importance of validating the models used to assign theoretical meaning to text-based documents. The conceptualization of political variables used in the analysis must align with how these variables are operationalized by the text-based statistical models. In supervised models, it is important for researchers to check whether their scaled-up model is doing a good job of classifying all the documents. For the ethnic violence example, we can sample a set of articles outside the training data set (test data) and check whether the quality of classification is high. It is possible that models that perform well on the training data may show worse performance on documents outside the training data. For unsupervised methods, the challenge of validation is more complex since there is no clear benchmark as in the case of supervised learning (created through the training data set). The onus is on the research to clearly establish what different categories mean and then demonstrate by cross-validation that the way the variable is constructed matches the theoretical concept. For example, if unsupervised analysis of speeches is used to create groupings of speeches that favour and oppose war, then one way to cross-validate this measure would be to see if the extent of opposition changes during peace and war time. Similarly, researchers can demonstrate and construct validity for their measure by comparing outcomes from unsupervised methods with supervised methods (Grimmer & Stewart, 2013). As an illustration, if using unsupervised models we find that individuals from one party are more jingoistic, defined by some clearly defined coding process, than the other, then the measure could be cross-validated by showing the extent of jingoism measured through tweets goes up during a war but not during friendly cross-national talks with a rival nation. Researchers can also decide to select a subsample of tweets and ask human coders to place them on a scale of jingoism (Lowe & Beniot, 2013). The broader point is that the use of multiple different statistical models and complexity of the algorithms used makes text-based analysis prone to incorrect interpretation of text and selective choosing of models to highlight positive results. Without proper validation, it would be difficult for researchers to make strong theoretical claims regarding their text-based analysis.
Conclusion
We have sought to provide a brief overview of the methodological innovations in the field of automated content analysis. Any review of such a vast field is bound to lack comprehensiveness. However, the idea behind this note is to point towards an exciting avenue of research that is likely to generate new insights about our political and social world. As the digital sphere expands further, it would become incumbent upon social science researchers to be well-versed with the possibilities automated content analysis is going to generate. These new tools have their own set of limitations, as Grimmer and Stewart (2013) point in their article, and thus require much more careful validation of findings. Nevertheless, automated content analysis has the potential to make inferences that were previously impossible.
Footnotes
Acknowledgements
The authors would like to thank Ajit Phadnis, Divya Vaid, Samarth Bansal, Saloni Bhogale and Suhas Palshikar for their comments.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The authors received no financial support for the research, authorship and/or publication of this article.
