Abstract
In the article, we argue that the advent of data mining techniques and big data in media and communication studies present problems that involve fundamental methodological questions, requiring us to revisit existing ways in which the link between theory, operationalization and data are explained and justified. We note that the discourse of instrumental optimization that surrounds big data clouds epistemic debates about their appropriate integration in scholarly explanations, and argue that a discussion of these problems can usefully depart from a distinction between the two main types of data mining models (supervised and unsupervised). We argue that both types pose specific challenges and give examples of ways they have been productively overcome. In particular, we argue that while big data approaches have introduced novel opportunities for research, they have fundamentally been incorporated into media and communication studies in ways that comply with existing, prototypical explanatory schemes. Our examples link specific empirical studies to general strategies of scientific explanation, focusing on neo-positivist, critical realist and interpretivist explanations.
Keywords
Big data – Paradigm or method?
The advent of big data as a topic in media and communication studies more than a decade ago has given rise to many different developments. It has provided new analytical tools, new data sources and sparked a new interest in quantitative methods. The new types of data have helped shed light on topics relating to new and legacy media alike: The digital revolution has led to new and more fine-grained measurements of both radio and television audiences and has also extended the tracing of audiences across new and old platforms. The development has been closely linked to the simultaneous expansion of the field of study. The new methods and data sources arrived as part and parcel of the digital expansion of the media landscape, and in many cases represented the only available source of insight into the new parts of it. This was the case with several social media platforms, such as, for example, Twitter, which for years made it possible for researchers to harvest profile information and message data. The development has also been helped by the business-driven interest in exploiting the new data sources, which led to the blossoming of a wide array of tools aimed at simplifying the process of collecting, cleaning, preparing and analysing large and often heterogeneous data sets.
In the following, we will argue that an important, but rarely discussed, aspect of the turn to big data has been an increased flexibility in the way quantitative data are used in media and communication studies. Not only has tools such as predictive machine learning led to new possibilities in existing, quantitatively oriented schools of thought in the field, it has also led to the application of explanatory strategies not normally associated with quantitative analysis (e.g. interpretivism) to quantitative data, based on a wider set of tools and techniques. In particular, we argue that big data should not be viewed as a methodological paradigm in media and communication research in itself, closely associated with machine learning techniques. Instead, we should see it as a new set of analytical techniques which have been productively integrated into many of the existing ways of doing research in the field of media and communication. Below, we will exemplify this using expositions of key explanatory frameworks of neo-positivism, critical realism and interpretivism, all of which are regularly used in empirical analysis in our field, often also those using big data – whether explicitly stated or not (Blaikie, 2000, 2007).
Before we exemplify this, we want to point to two key debates which have complicated the discussion so far. First, a central element in the methodological debates in our field has been influenced by an internal debate in statistics about the ways in which inferences could be drawn from data, and the status of the conceptual tools used to underpin them (Efron and Hastie, 2016; Hastie et al., 2009; James et al., 2013). To put it short, the math developed to underpin traditional, inferential statistics was developed to deal with data scarcity (when data were expensive to come by), whereas computational-era statistics have developed to deal with very different problems associated with data abundance (e.g. high-dimensionality). This debate has sometimes created a dichotomous discussion, suggesting that big data was a terra nova, where existing methodological frameworks did not apply or had limited relevance. Looking at actual, successful applications of big data and machine learning in our field, as we do below, this hardly appears to be the case – on the contrary, it appears that the new methods and statistics have been tailored to fit existing explanatory frameworks.
A second and related issue concerns the industrial origins of big data technologies and analytical best practices. This has led to powerful analytical tools and a rapid escalation of practical specialization (e.g. in the form so-called data science competencies). Yet, it has also clouded key differences between scholarly and corporate uses of big data: Scholarly uses are focused on explanations, for example, on deriving insights into reality through empirical observations scaffolded by theoretical concepts. Corporate uses ultimately gravitate towards optimization of information gain in order to drive profits up and/or costs down. Common to both is a concern for the quality of the statistical model, but the ultimate purpose and measure of analytical success differ substantially (boyd and Crawford, 2012; McKinsey Global Institute, 2011; Mayer-Schönberger and Cukier, 2013). In business uses of data, the predictive power of a model is often pursued as the ultimate goal: if the purpose of an analysis is to predict the future behaviour of a customer in a webshop, then any available data that can help lift the predictive accuracy are useful. In a scholarly analysis, predictive accuracy is rarely an end in itself: these analyses strive to understand the interplay between different factors to produce an outcome. The field has used traditional, linear regression (foundational in predictive statistics) as a core tool in quantitative analysis since long before the advent of big data. Yet this was rarely used to predict unknown states or outcomes – rather, it was used to understand how different factors (e.g. gender and age) influenced an independent variable (e.g. consumption of television news programming). In his programmatic Wired article, Anderson (2008) suggests that science can do away with the type of ‘why’ questions that efforts to understand the interplay between variables in a model lead us on to. Instead, science should simply make do with ever-improved prediction. Today, few probably take Anderson’s point as valid, but also refrain from answering the ‘then what?’ question that follows from rejecting it. In the following, we will argue two related points: First, that big data have made an undeniable contribution to the field of media and communication studies, and, second, that researchers have accomplished this by integrating it into existing explanatory schemes from the social sciences. We pursue this by empirical exemplification, by showing how successful applications of machine learning and big data have followed existing models of social scientific explanation. We obviously do not expect this to exhaust the range of successful applications, nor do we want to argue that ours are the only appropriate examples, as there are obviously many others to choose from. What we hope to do is to exemplify the diverse range of ways that big data are used in media and communication research, and to argue that the main benefits to the field from integrating big data in its empirical repertoire do not stem from the predictive accuracy of machine learning, but rather from the application and extension of existing explanatory paradigms.
Data mining – Supervised and unsupervised approaches
Big data approaches to analysis are often grouped into supervised and unsupervised models (Provost and Fawcett, 2013). The former, comprising what is often labelled machine learning, covers modelling approaches where the algorithm is trained on data sets where the property to be predicted (e.g. data on whether a viewer chooses to view a given video on a streaming platform or not) and the available predictor variables (e.g. viewing history, time of day, demographics) are included in the data. The model is subsequently tested on a new data set where the outcome variable is known to the analyst but is not included in the data set fed to the model, which will calculate the value for the outcome variable. Comparing the calculated values to the actual values of the outcome variable in the test set allows the analysts to assess model accuracy. In recent years, the advent of the class of so-called deep neural networks has further increased the performance of supervised models, and it is the predictive strength of supervised models which accounts for much of the commercial interest and cultural fascination with big data. For predictive analysis of unstructured data, such as, for example, facial recognition based on digital images (Krizhevsky et al., 2012), deep neural networks have contributed to vastly increased performance, whereas analysis of structured data (e.g. tabular data of the sort traditionally used in quantitative studies in our field) is currently led by the XGBoost algorithm (Chen and Guestrin, 2016).
The class of unsupervised models does not rely on the logic of training and testing data. An example of this class is the clustering of cases in a data set into smaller, more homogeneous groups, for example, by grouping users of a streaming service into segments according to their viewing histories. Since there is no way of a priori defining a right number of clusters, the decision on how many clusters to settle on for the final model will to a wide extent be a matter of choice for the analyst; although some measures of homogeneity of the resulting clusters can provide guidance for the choice, there is usually no exact statistical answer, but rather a range of equally relevant solutions. Examples of unsupervised approaches in communication research include topic modelling (a text analysis approach that groups texts based on the similarity of words and phrases used) and social network analysis (a technique that determines the relationship of entities based on their position in a network). The allure of unsupervised models is the (relative) ease of application: compared to supervised models, the researcher does not have to train and test models but only needs to prepare the data and feed it to a computer algorithm, which then produces a result.
Models and methods
As discussed above, data mining techniques have found several uses in media and communication research. Blaikie and Priest (2016) develop a vocabulary of core strategies which are characteristic of the wider field of social research. These comprise explanatory strategies and (specifically) the particular kinds of justification that each paradigm provides for deeming a given way of producing knowledge to be scientific, for example, the meaning and mode of applying theory and for linking observations and concepts. Blaikie and Priest identify a number of central paradigms and their defining characteristics in terms of the role of theory, ontological assumptions, the preferred mode of logic and the role of empirical materials in reaching explanations. In the following section, we will concentrate on prototypical ways big data have been used in neo-positivist, critical realist and interpretivist explanatory frameworks in media and communication studies. These paradigms represent the most commonly used explanatory frameworks in the field and have been chosen to illustrate the different ways in which big data and machine learning are employed in scholarly explanations in the different paradigms. We stress that the attribution of the articles to the different paradigms of explanation for the most part rest on our reading of them rather than on direct claims by the authors themselves and is therefore of course debatable.
Neo-positivism
Neo-positivism is probably the paradigm that is by default assumed to underlie big data analysis in the field, and the term (or at least ‘positivism’) is often used pejoratively in our field and in the social sciences more broadly to indicate an a-theoretical application of quantitative methods. It is, however, important to note that Blaikie and Priest (2016) do not follow this assumption, but explicitly see empirical, neo-positivist research as based on theoretical propositions about the phenomenon under study. The difference to other paradigms is rather the nature of what counts as theory, which in neo-positivism is less a theoretical system, and more a collection of clearly defined terms native to the sub-field of study (e.g. what constitutes different types of news, news framing, etc.).
Neo-positivism is also characterized by clear formulations of testable hypothesis and a deductive approach to explanation, which involves hypothesis testing. This has meant that the logic of supervised modelling is an obvious match for this paradigm, but the lineage of the paradigm is long in media and communication studies: The dominant approach in media and communication research for many years has been to test well-known theories of the field in new contexts (e.g. using different empirical material or novel methods). Since the foundational article by McCombs and Shaw (1972), agenda-setting has proven to be a highly resilient and influential theory in media and communication research. The theory predicts that media organizations, that is, the press, shape what people care about by setting the public agenda in their publications. The theory has received much praise for its simplicity, but also faced criticism, in part for the linearity in the causal statement: agendas flow from the media to the public, not the other way around. A 2014 article (Neuman et al., 2014), among others, demonstrated how large-scale digital data sets could shed new light on this directionality issue. Whereas agenda-setting research typically relies on surveys to identify what people at large find important, this study utilized data from Twitter as well as blogs and web fora as a proxy for the public agenda. They then compared the popularity of a range of researcher-defined issues (such as economy, social issues and environment) over time to coverage of the same issues in online news articles. To trace issues across media, the study used a simple supervised model: a classification of text based on Boolean search strings, such as if the text includes the term ‘unemployment OR employment,’ it is treated as dealing with the issue of ‘jobs’. Using a statistical method for comparing different timelines (Granger causality), the study found agenda-setting to not be unidirectional, but instead highly dynamic – emanating as often in the public agenda as in the news media and with a tendency to flow both ways; a ‘two-way-agenda-setting’ (Neuman et al., 2014: 210). While such an approach introduces new challenges (e.g. operationalizing ‘public agenda’ as a product of tweeting and blogging), it does offer an important corrective to one of the field’s most-used theoretical frameworks. In this way, big data analyses can update and challenge established beliefs and theories in the field of communication.
Interpretivism
This explanatory framework is widely used in media and communication research but is not commonly associated with big data modelling or even traditional quantitative analysis. The emphasis in interpretivist explanations is more often associated with qualitative studies, as it relies on iterative formulations of research questions and conceptualizations, using existing theories as sensitizing concepts rather than as structured scaffolding of a study. The ontological stance of interpretivist research emphasizes the mutually constitutive nature of the interplay between social agents and structures, and in relation to empirical work, this often translates as a strong focus on an iterative process of data collection and the development of explanations.
We argue that interpretivism takes up a central position in the use of big data in our field: the wide range of studies that rely on data visualizations as central elements in the generation of insights. A central example is the visualization of social network data, which often allows (preliminary) hypothesis to be generated purely on visual inspection of the network, which can then be refined through statistical analysis of network properties. The preferred mode of big data modelling in this context is unsupervised models. In addition to visualization, they include analysis of language use and cluster analysis.
As previously mentioned, network analyses of the Internet have been instrumental in mapping the web (the network of websites on the Internet). In particular, scholars in the Digital Methods Initiative (DMI) centred in University of Amsterdam and headed by Richard Rogers (2013, 2019) have pioneered an interpretative and critical approach to the study of the web. The central tenet of DMI’s approach is the application of unsupervised models like network analyses and automated content analysis to understand how the infrastructure of the web (hyperlinks, hashtags, social media connections, etc.) shapes social interactions on the web, as well as what it means to be social at all (Marres and Gerlitz, 2018). Digital tools like IssueCrawler (https://issuecrawler.net/) or Netvizz (now defunct) that create a network between, respectively, websites or Facebook connections based on their interlinking patterns (who links to whom), have epitomized this approach. In this way, researchers have been able to inductively map the interactions around social issues unfold over time (Rogers and Marres, 2000) and across platforms (see, for example, Rogers, in this issue). Based on these analyses, new sensitizing concepts, such as ‘transglocalization’ to describe the mix of local and transnational connections in the online networks of migrants (Kok and Rogers, 2016), have been introduced. In contrast to the neo-positivistic case, the goal here is not to test hypotheses with predefined issues, but instead to discover how issues emerge, change and dissolve on the web. The merit of interpretative approaches to big data is therefore more often in the theoretical innovations that emerge from the mapping exercises. An additional accomplishment has arguably been to facilitate how the wider field of media and communication studies have come to know and make sense of many of the new types of data that digitization has introduced. The use of an interpretivist approach can also be seen as a new way of using quantitative methods, inspired by the easier availability of appropriate data. The statistical vocabulary of traditional quantitative media and communication research often relied on heavy protocols of hypothesis formulation, model assessment and testing. While a change of paradigm obviously does not do away with basic requirements of data understanding and assessment of validity and reliability, changing the explanatory paradigm will highlight other aspects of the analysis as critical. Interpretivist explanations of big data can make the use and interpretation of data visualizations more relevant (Healy, 2018), and as such, it can facilitate a re-introduction of quantitative methods to parts of the field that have previously rejected it.
Critical realism
The explanatory model behind critical realist research reflects its characteristic elements. Investigations often begin with the observation of a phenomenon to be explained (e.g. an anomaly) and seek to develop a possible theoretical explanation for it. The empirical component of an explanation then consists in compiling a data source which may serve as evidence for the theory. While this explanatory model is often criticized (especially by falsificationists) for being an obvious example of retrofitting facts to theory, the explanatory logic claimed for the theory is retroductive, or to arrive at the most plausible explanation possible for a given phenomenon. While critical realism has traditionally been the mode of explanation preferred by research on the political economy of media (Hardy, 2014: 12), we argue below that the paradigm has potential for many other branches of our field in relation to big data.
The central element of a critical realist account is the attention to mechanisms. A typical pattern is to identify surprising findings inductively from the empirical material and then come up with most-likely explanations. One area where this strategy has been applied is in comparative media research. A notable case is a 2019 paper (Bolsover and Howard, 2019) comparing bot activity around Chinese propaganda through a domestic (Weibo) and international (Twitter) social network service. The authors retrieve posting activity around specific accounts (Weibo) and hashtags (Twitter) and then apply a mixture of supervised and manual methods to detect bot activity (including posting frequency and account information). Surprisingly, bot activity is much more frequent in the debate on Twitter than on Weibo (where it is hardly present), and even on Twitter, it tends to be anti-government accounts that are automated. The authors then explain the surprising finding in terms of needs rather than capabilities. In short, as the Chinese state can employ an army of human propagandists, they have less of a need of automation (which is easily detectable as the paper itself is a testament of) than social movements and political opponents do. At the same time, the authors critically reflect on Twitter’s value as a free public space for the Chinese diaspora, when a large share of the debate, albeit critical of Beijing, is carried by bots. Thus, the paper illustrates how a retroductive explanation can be derived from simple comparisons (bot or no bot) across large data sets. It is important to note that this anomaly would have been difficult to detect accurately with other types of data. The explanation here, which includes the identification of an anomaly (indeed, a falsification of a prior belief) and the application of an alternative, theoretical framing is a concrete example of how existing assumptions about a phenomenon can be brought to bear on large data on social media communication.
It is clear that the retroductive form of explanation is at odds with the ideal held in the neo-positivist paradigm, which is centred on the formulation and testing of hypothesis. However, it also points to a notable strength in retroductive reasoning, which has relevance for the use of big data in our field, since retroduction is open to the reflexive analysis and explanation of a given, theoretically defined phenomenon. This reflexivity and the iterative character of the research process it entails, is perhaps even stronger at odds with beliefs about appropriate analytical conduct, traditionally held in parts of the social sciences: to many neo-positivists, retroductive reasoning will resemble the practice of data dredging or p-hacking, which denotes the relentless probing of a data set in order to derive a viable (meaning: statistically significant) result. However, this does not accurately capture the requirements of retroductive reasoning, which depends heavily on the integration of existing knowledge in the form of theories and findings about a subject and/or closely related phenomena. A marked difference between the two approaches is the above-mentioned nature of theory, which in critical realism is more likely to take the form of a comprehensive conceptual system, involving a larger range of phenomena and time scales. In contrast to hypothesis testing, the plausible formulation of an explanation that fits facts and existing theory is what serves as the limiting factor in retroductive reasoning.
This is relevant to our discussion of big data in media and communication research: In current debates about digital communication, there is a clear gap between the most notable theories about central topics (e.g. social media platforms) and many studies based on big data about them, as noted by Schroeder and Cowls (2020: 969). There are many reasons for this state of affairs, one being that the empirical study of digital communication is not the sole province of media and communication research. Indeed, a large part of big data–based empirical work stems from other fields and is often carried out by researchers with a background in the natural sciences. This presents a difficult case, where a reconciliation between empirical and theoretical knowledge is highly desirable, and where retroductive reasoning may provide a way forward Murthy (2017).
Conclusion
Our argument above is twofold: First, we want to argue that the discussion about the methodological status of big data methods in media and communications research is not really about whether they constitute a new explanatory paradigm or not. Big data methods are frequently integrated into existing explanatory frameworks, and their unique capabilities for developing insights into mediated communication phenomena are already used in a wide variety of ways. Second, the very variety of explanations that we have sketched above should lead us to reassess the potentialities of big data for our field: rather than resorting to arguments about the revolutionary potentialities about these methods, the field should take advantage of the plurality of voices and approaches, and capitalize on it. Media and communication studies have shown remarkable agility for methodological innovation in the past, not least because the inflexibility from institutionalization into competing camps has been less pronounced here than elsewhere in the social sciences. Recognizing the potentialities of big data for supporting many different kinds of explanations may well let us repeat that success. The benefits from doing so would not just be the methodological dividend to the field: Critical understanding of big data as a wider, social phenomenon can only benefit from an informed understanding of the potentialities and possible alternatives for explanations and uses that it can imply (Fuchs, 2017). And that hinges on methodological sophistication and experimentation of the sort we hope to encourage here. An additional perspective from the above discussion is the identification of a clear line of distinction between the differing requirements of scientific explanations based on big data in the social sciences, and successful applications of machine learning in the data business. The inclusion of big data methods in our field invariably involves the structured use of theory, regardless of the explanatory paradigm in use, whereas business applications are centred on quantitative optimization alone. The development of tools and techniques for their practical application is very rapid, and for the most part happens outside of the social sciences. Thus, when we pick them up, they come to us with an epistemic surplus as well as an epistemic debt: They are heavy on jargon on optimization, and light on the ways they fit different explanatory paradigms. It is up to us as a field to begin to formulate the different varieties of standards that we want to apply to these models, and how we critically want to appropriate tools that are currently used most comprehensively and efficiently by the new barons of surveillance capitalism (Zuboff, 2018).
Footnotes
Funding
The author(s) received no financial support for the research, authorship and/or publication of this article.
