Abstract
As an interdisciplinary field, data-driven journalism integrates the intellectual origins of investigative journalism, computer-assisted reporting, and the emerging paradigm of computational social science. Studies of news production have revealed, however, that news professionals are reinforcing existing power structures via an interpretive community, where homophily-evoked social interactions—even in the social media context—create echo chambers and discussion fragmentation. Is the representation of data-driven journalism in the electronic public sphere breaking boundaries among people from different domains or does it resemble the existing power structure? This study adopts a network analytics approach and constructs a representational network among actors who joined the public discussion of data-driven journalism in the Twittersphere—the co-retweeted network—such that two accounts are connected if their tweets are retweeted by the same user. Public tweets containing search queries related to data-driven journalism published from February 2017 to February 2018 were collected with Twitter real-time streaming application programming interface (API). A co-retweeted network with 1,148 accounts was derived from verified accounts’ retweeting posts. Results found that several communities emerged, and news organizations, nongovernmental and nonprofit professional organizations, and academic institutions were in the crucial positions of the network. The exponential random graph models (ERGMs) based on this network revealed the extent to which gender, geographical location, and institutional type of the users were associated with the tie-formation. This study documents the major actors who are discussing the subject of data-driven journalism and raises critical reflections toward the interdisciplinary collaboration in the production of public knowledge.
Keywords
Data-driven journalism is gaining an increasing presence in the newsrooms. Rooted in a century’s worth of traditional reporting, investigative journalism, and computer-assisted reporting, several large legacy news organizations have established data news teams or data desks whereas digital-born media are implementing new ways of news investigation and storytelling. Data-driven journalism integrates the intellectual origins of investigative journalism and the emerging paradigm of computational social science (Lazer et al., 2009). Journalistic professionals now collaborate with design experts, developers, and data scientists in the news production, and such collaboration articulates open sources, transparency, and interdisciplinarity (Lewis & Usher, 2014, 2016; Lindgren & Lundström, 2011).
While data science has added the impetus of new investigative and storytelling methods, observers have expressed doubts. Although news media aim to provide a deliberate public sphere, some studies have found that media professionals are reinforcing existing gender relations and the power structure by maintaining an interpretive community (Usher et al., 2018; Zelizer, 1993). Homophily-evoked social interactions among the news media professionals lead to gendered echo chambers, opinion polarization, and fragmentation (Usher et al., 2018). Social media, especially Twitter, have been widely used by journalists. Twitter offers a direct and public channel for citizens and professional actors to connect and interact with each other (Bode et al., 2015) and is a crucial online platform for data-driven journalism professionals to engage and interact with their general audience (Boyles & Meyer, 2016). Previous studies have found that social media interactions among journalists are gendered (Usher et al., 2018) and polarized, and worsen the gender and power equality. Commentators have begun to ask whether data-driven journalism, as an emerging form of the journalistic practice, is truly breaking boundaries among users from different domains or is simply a palimpsest of the existing power structure.
This study adopts a network analytics approach to document the fragmentation of online behaviors via people’s social interactions. Informed by the framework of communication networks (Shumate et al., 2013), this study constructs a representational network among the retweeted users—that is, the co-retweeted network—such that two users are connected if their tweets were retweeted by the same third user. Such a method can document the hidden connecting structure of how those whose voices are being heard and circulated—whose tweets are being retweeted—in the electronic public sphere, reveal the tie-formation mechanism among these users. Empirically, this study extracted public tweets collected with Twitter open real-time streaming API and constructed the co-retweeted network based on users’ retweeting behaviors. This study attempts to identify several communities emerging from the co-retweeted network. The exponential random graph models (ERGMs) further revealed the tie-generating mechanisms, confirming that the social interactions are fragmented. By using this method, this study documents the major actors of journalism professionals who are discussing the subject of data-driven journalism, and how different groups of actors are related implicitly. It contributes critical reflections toward the interdisciplinary collaboration in the production of public knowledge.
Data-Driven Journalism and Interdisciplinarity
Data-Driven Journalism as the Trading Zone
Data-driven journalism, regarded as an interdisciplinary field, refers to the journalistic practice of obtaining, reporting on, organizing, editing, and publishing data in the public interest via collaboration related to the techniques of statistics, computer science, visualization, design, and news reporting (Coddington, 2015). As an emerging form of journalistic practice, it integrates several intellectual origins of investigative journalism, precision journalism, computer-assisted reporting, and the emerging paradigm of computational social science and digital humanities (Coddington, 2015; Lazer, et al., 2009; Meyer, 2002).
Previous studies have used the “trading zones framework” to capture such collaboration in data-driven journalism (Galison, 1997; Lewis & Usher, 2016). A trading zone is “a set of physical places and technical arrangements as well as the processes of social interaction” and “the productive possibilities at the intersection of heterogeneous actors” (Lewis & Usher, 2014, pp. 546–547). Several studies and empirical cases have implied that the production process—at least in the public discussion of data-driven journalism—has brought together experts with different domain expertise such as social science scholars, traditional reporters, designers, data scientists, and programmers (Lewis & Usher, 2014, 2016; Lindgren & Lundström, 2011; Q. Wang et al., 2018). As an example of the shared intellectual turf shared by journalism and quantitative research methods, the National Institute for Computer-Assisted Reporting hosts the annual Philip Meyer Journalism Award, which “recognizes excellent journalism done using social science research methods.” As part of the shared interests among journalism and computer science, various techniques from data science, data management, and data analytics have been integrated as some of the knowledge skills for journalists to be equipped with, such as programming, web scraping (Bradshaw, 2013), database management, and emerging professional activities and meetups. The “hacks” (journalists) and “hackers” (technologists) are working together to create physical and digital spaces for exploring new ways to tell stories (Lewis & Usher, 2016).
However, the trading zone framework and the above cases only explain the cross-discipline aspect of data-driven journalism during the off-line setting in the newsroom. The related work should have further tackled whether the cross-disciplinary collaboration will change long-term implicit existing power structures behind the news production process. Previous research did not fully explore the social interactions among professionals from different knowledge domains in the online setting. While the practice and information exchange of data-driven journalism are moving away from traditional newsroom settings, several online platforms—GitHub, Facebook, Twitter, and Slack among them—are providing platforms for the data-driven journalism-related discourse to communicate (Ausserhofer, et al., 2017). Furthermore, literature from the sociology of news has revealed that the production process and social interaction among the journalists are highly polarized and resembling existing power relationships, a situation that is observable in the social media setting as well. The “interpretive community” framework attempts to address this issue, and it will be reviewed below.
Data-Driven Journalism and the Interpretive Community
The extent to which news professionals make sense of and interpret social realities in a collective manner has been studied via the framework of the “interpretive community” (Zelizer, 1993). This point of view argues that news professionals are reinforcing existing social relations and power structures by communicating their shared discourse and collective interpretations of key public events (Zelizer, 1993). In such a community, they have developed shared meanings through ongoing social interaction and mutual endorsement (Berkowitz & TerKeurst, 1999). As a result, such long-term social interaction and endorsement will lead to a fragmentation of the understanding of social issues.
The idea of the interpretive community has been studied in the news production process and the social interactions among news workers (Brüggemann & Engesser, 2013; Carlson, 2012; Saez-Trumper et al., 2013; Zelizer, 2007). In the digital age, a number of studies have concluded that people’s social interactions via the new media reflect the existing power structure. For example, on the internet, Chang et al. (2009) discovered that when the internet is open, the social interactions are closed. The hyperlinks of the news media are pointed to the core countries, whereas developing countries are in peripheral positions of the network. On social media, Zhang (2018) found that the discussion of the topic data-driven journalism on Twitter appears to be fragmented and clustered in different communities. Usher and her colleagues (2018) adopted the framework of the interpretive community to explicate the interactions on Twitter of a group of political reporters in Washington, DC. They found that social interactions on Twitter are gendered, such that male journalists dominated the Twittersphere. They also have homophily-evoked social interactions, with the journalists’ interaction eventually leading to a “gendered echo chamber” (Usher et al., 2018, p. 1). Hence, the social interactions among the news professionals’ data-driven journalism practitioners are likely to have boundaries, be homophily-evoked, and be polarized and fragmented.
Social media have become viable channels for journalists to disseminate information, and they provide rich information for researchers to look into the social interactions among different users on a particular public topic. Informed by the study of Usher et al. (2018), this study also focuses on the users’ interactions on Twitter, a popular instance of social media. Building on the work of Usher et al., (2018), this study makes two unique extensions: First, this study is moving from a direct communication practice (such as commenting and retweeting) to an indirect or implicit communication practice, which can reveal the power structure among the users. Furthermore, this study adopts a network analytics approach. The next section introduces how social network analysis based on social media data can provide a lens through which to view this inquiry.
A Network Analytics Approach
Social Media and Social Networks
The social network analytics approach offers a feasible perspective for studying people’s online interactions. Social media are viable venues for people to interact. This study focuses on Twitter, one of the most popular and active social networking sites, which is regarded as a “transnational electronic public sphere” (Barnett et al., 2017, p. 38). It offers a direct and public channel for citizens and professional actors to connect and interact with each other (Bode et al., 2015) and is a crucial online platform for data-driven journalism professionals to engage and interact with their general audience (Boyles & Meyer, 2016). It also offers a feasible channel to examine the world system (Barnett et al., 2017). In the field of digital journalism, several journalists have established their own social media channels to express opinions, build up a personal identity, and promote their media organizations (Hanusch, 2017). Twitter is an empirical example of digital footprints, and it includes social information to study human behavior. Related works have conducted social network analysis based on these digital footprints, such as the following relationship (Peng et al., 2016), the interactions based on commenting (Himelboim et al., 2014), the network formed by hashtags (Zhang, 2018), and the user–user networks derived from retweeting and mentioning (Del Valle & Bravo, 2018; X. Wang et al., 2019).
Establishing the Co-Retweeted Network
As seen from the above examples of the social network analysis based on Twitter, there are numerous ways to construct the networks. To conceptualize these communication networks, Shumate et al. (2013) proposed four types of communication networks: flow networks (sending and receiving messages), affinity networks (such as the network of family members or the collaboration network within an organization), semantic networks (based on the shared meanings of words and other symbols), and representational networks (an association of actors communicating to a third party; describing messages about one node’s affiliation with other nodes). While existing social media data analytics are based on manifested ties, this study extends previous studies by focusing on the hidden ties, where the interactions are based on the shared social, political, and cultural meanings. Representational networks, as described by Shumate et al. (2013), are of this type. Examples of a representational network, as stated by Shumate et al. (2013) in their review, range from hyperlink networks and bibliometric networks to a network of name mentions on websites. Informed by the literature on bibliometric studies, citation analysis, and recent studies of social media interactions based on digital traces, as an instance of constructing the representational network, this study constructs the “co-retweeted network” (Finn et al., 2014, p. 5; Metaxas et al., 2015; Wong, et al., 2016), where two users are connected if their tweets were retweeted by a third user. As explicated by Shumate et al. (2013, p. 99), “representational relations involve messages about an association among actors communicated to a third party or the public…describing messages about one node’s affiliation with other nodes.” Compared to other existing studies of Twitter-based or social media-based networks, a representational network is an instance of a hidden network, wherein “the actors do not send or receive any messages to each other, or do not even have any relationships” and the relationships between the nodes are “representational” (Shumate et al., 2013, p. 99). Co-retweeted means that the two actors—one of whom retweets the other’s piece to initiate a conversation or speak to him or her—have similar cultural or social characteristics and are perceived to have a common niche.
Existing studies have documented the importance of studying these communication networks derived from social media retweeting. In retweeting behaviors, as explicated by C. J. Wang et al. (2013), who focused on the Twitter interactions of the Occupy Wall Street movement, those who initiated the conversation (the ones who retweet others’ tweets) are like soldiers, who are coming and going, being ephemeral. On the contrary, those who are receiving a conversation (being retweeted) are like the generals and are being more stable because it is their voices that are being heard and disseminated to a wider audience in the digital public sphere. In the Twittersphere, the retweeting behaviors providing the information and opinions contributed by those who are being retweeted are diffused, enlarged, and made manifest. Simply put, the co-retweeted network—a network focusing on these sources of messages—reflects the power structure, and they are speaking to other users. The network structure built by those who are the “source of the messages” reflects the discussion fragmentation and also reveals the key social actors who are influential. The network construction method shares the similar principle of co-cited study in bibliometric studies, which can demonstrate “the occurrences of two authors being cited together, the network shows the…intellectual structure and influential topics and scholar groups” (Gao et al., 2018, p. 202).
Exploring the Tie-Formation Mechanisms Among the Actors
Besides describing the network structure and key players, this study also explores the tie-formation of a co-retweeted network to determine why two or more users are retweeted. As Shumate (2013, p. 108) explained, the formation of a representational network may be due to “external characteristics outside the network…[or] properties of the nodes themselves.” This study explores the impact from the characteristics of the users, as indicated by their sociodemographic variables (for human accounts) or by their professional characteristics (for institutional accounts). The formation of a representational network largely follows the mechanism of preferential attachment, with popular nodes tending to attract more links, and those clusters are “prevalent based upon socially constructed groupings of actors” (p. 108). Thus, based on the theoretical machinery of the network formation process, we can anticipate that those users (or the social actors represented by the account, to be precise) who prove to be on a higher level of the power structure will be in clusters and are more likely to evoke more connections in the co-retweeting process as they are perceived to be influential. Given the relatively sparse literature in this regard, the present project, as an exploratory study, focuses on three factors that have been documented to be correlated with the network structure in the previous literature: gender, as identified in the empirical findings of Usher et al. (2018), the interactions on social media are gendered; hence, it is anticipated that males will have more connections than their female counterparts. geographical location, as identified in the empirical findings of Chang et al. (2009) who found that although the internet is open, online interaction is local. It is proposed that the geographical location of the users will be associated with the tie-formation process. institutional type of the accounts, as the function of different types of institutions (academies, government agencies, or news organizations) vary widely. In the network formation process, homophily also explains the tie-formation and that accounts of a similar type are more likely to be retweeted by the same group of users.
Research Questions
Based on the above review, this study asks two research questions:
Method
Data Collection and Network Construction
To collect relevant tweets, we used a Twitter streaming API to harvest public tweets, collecting a sampling of real-time tweets via a series of user-defined search queries (Twitter Developer Documentation, 2017). The search queries (keywords and hashtags) followed terminologies suggested by Coddington (2015) and implemented by Zhang (2018) in his study of tweets related to data-driven journalism: “data journalism,” “data-driven journalism,” “computational journalism,” and “computer-assisted reporting.”
The search, which started on February 19, 2017, and continued for a consecutive period of 12 months to February 18, 2018, generated 126,610 tweets. The tweets appeared in JavaScript Object Notation format provided by Twitter in the public domain. Each of the 126,610 tweets contained information including the users’ screen name, the user’s ID number, the text, user location, time zone, language, and whether this account is a verified account. We removed 69 pieces of tweets that containing missing values in any of these fields. In the remaining 126,541 tweets, only tweets published by verified accounts (i.e., the account belonging to a verified user or an institution, such as a news organization or an academic department) were included in the study, and each post is regarded as public information. Since late 2009, Twitter launched its Verified Account Program to establish the authenticity of the accounts, so other users can trust that a legitimate source is authoring the tweets (LaMarre & Suzuki-Lambrecht, 2013). Focusing on verified accounts would also exclude spam, fake, or satirical accounts (Hanusch & Bruns, 2017). After the above data cleaning, the final sample in our case study consisted of 11,247 tweets (roughly 8.89% of all the tweets we collected). Based on these tweets, the co-retweeted network was constructed as advised by Finn et al. (2014). This study did not consider the weight of the ties.
Accounts’ Profiles
The accounts’ profiles—gender, country, and organization type—were manually coded by two codes; both were research assistants of the first author. The coders were instructed to visit accounts’ profile pages and documented related information according to the stipulations of the codebook in Table 1. The coders firstly coded a small proportion of the data together independently, and the average intercoder’s reliability was .91, which was satisfactory. Discrepancies were clarified after discussions, before the coders continued to code the rest of the accounts. Table 1 also reports the descriptive statistics of each variable.
Coding Scheme for Verified Twitter Accounts and Descriptive Statistics of the Accounts’ Attributes.a
Note. n = 1,007.
a In the coding process, there were 141 accounts, whose profile pages were not reachable, either because the account has been removed, suspended or the account is private. These accounts were removed from the exponential random graph modeling. b The numbers of followers, following, and tweets are based on the number dated on December 2019, when this article was revised. We understand that these metrics are changing from time to time. However, these metrics can reflect the levels of the accounts’ activities in the Twittersphere. c Mean and standard deviations (in the curved parenthesis) are reported, and the minimum and maximum values are reported in the rectangle bracket.
Results
Network Structure
The co-retweet network included 1,148 nodes and 14,472 edges. The overall network had a very low density (.022), which was resulted from the filtering process by keeping the verified users. The network’s clustering coefficient (also called transitivity) was 0.44. The average path length was 2.84. For degree centrality, the mean was 25.21 and the median was 11. Note that the co-retweeted network is undirected and unweighted. The layout of the network is visualized in Figure 1.

Visualization of the co-retweeted network (n = 1148). Note. The figure is plotted with Gephi (version 0.9.2), with the Force Atlas 2 layout. The colors of the nodes indicate their community after the community detection algorithms by Gephi. The size of the node is proportional to its betweenness centrality (the names of the top-20 ranked accounts are presented).
As shown in Figure 1, there are several clusters, which are the “subsets of the accounts who interact and communicate” frequently among each other; in the clusters, there are several “most connected users [as hubs]” who have relationships with many other actors in the network, indicating by their high degree centrality (Himelboim et al., 2014, p. 360).
To further demonstrate the community structure of the network, we plotted the four largest subcommunities in Figure 2. The communities are derived from the community detection algorithm of Gephi.

Visualization of the four largest communities (containing the top-4 highest numbers of accounts). Note. The figure is plotted with Gephi (version 0.9.2), with the Force Atlas 2 layout. The community detection algorithm was based on Gephi’s “Modularity” function.
As argued by Himelboim et al. (2014), the betweenness centrality and the degree centrality are two crucial aspects of the importance of a node in the network; the former indicates “a bridge or a brokerage,” whereas the latter indicates “a hub” (pp. 364–365). We identified the accounts with the 10 highest degree and betweenness centrality. The nodes’ information and their degree and betweenness centrality are reported in Tables 2A and 2B.
Accounts With the Highest Levels (Ranked Top 10) of Degree Centrality.
Accounts With the Highest Levels (Ranked Top 10) of Betweenness Centrality.
As can be seen from Tables 2A and 2B, several accounts served both as a brokerage among different clusters (having a high betweenness centrality) and as a hub (having a high degree). These accounts are professional organizations on data-driven investigative reporting (such as GIJN, the Global Investigative Journalism Network), news professionals (such as kleinmatic. whose holder is Scott Klein, Deputy Managing Editor, ProPublica), and academic staffs whose area of expertise covers data journalism and data analysis (such as albertocairo, whose holder is Prof. Alberto Cairo, Knight Chair in Visual Journalism at the School of Communication, University of Miami). The correlation coefficient of degree centrality and betweenness centrality for all the accounts is high (r = .73, p < .001).
Exploring the Tie-Formation Mechanism
To explain the tie-formation mechanism among the accounts in the co-retweeted network, an ERGM was estimated by the Statnet suite of packages available on the Comprehensive R Archive Network. The ERGM compares the network statistics from an empirically observed network to the distribution of network statistics generated from randomly simulated networks. The variables of this study were gender, country, and the type of organization, using the Monte Carlo maximum likelihood estimation method. This study models two different dimensions of tie-formation mechanisms: (a) It examines the main effects of the variables in generating more connections, such as the extent to which an account having a particular attribute (e.g., male) will have more connections than its counterparts not having that attribute; and (b) the homophily effects, such as the extent to which an account having a particular attribute (e.g., be a male) will have more connections with other nodes having that same attribute (i.e., will male users interact more with other male users?).
The degree distribution of a representational network is highly skewed, which follows the ubiquitous power law (Barabási & Albert, 1999) and previous studies on the network formed in social media (Barnett et al., 2017). In other words, a small group of accounts had a large number of connections.
The main effects of the nodes’ attributes are presented in Table 3. The ERGM estimates the likelihood for an account’s certain attribute would make it to have more connections (performed by the “node factor” function in the Statnet package). Results indicated that gender did not play a role in creating more connection, such that compared to institutional accounts, males or females did not have a higher probability of being connected with other accounts (for males, coefficient = −0.016, SE = .018, p = .379; for females, coefficient = 0.010, SE = .020, p = .619). For the location, compared with countries located in other regions, accounts from France had more connections with other accounts (coefficient = 2.796, SE = .037, p < .001), and the similar pattern was applied to accounts from other countries such as Germany (coefficient = 2.608, SE = .031, p < .001), India (coefficient = 2.037, SE = .033, p < .001), Ireland (coefficient = 2.117, SE = .042, p < .001), Spain (coefficient = 1.653, SE = .042, p < .001), the United Kingdom (coefficient = 1.555, SE = .030, p < .01), and the United States (coefficient = 1.075, SE = .028, p < .001), but not Canada (coefficient = 0.073, SE = .072, p = .31). In terms of the type of organization, when nongovernmental and nonprofit professional organizations (NGOs/NPOs) were treated as the reference group for the current analysis, academic institutions had more connections (coefficient = 0.136, SE = .022, p < .001), whereas the same pattern held true for news organizations (coefficient = 0.114, SE = .019, p < .01) and technology companies (coefficient = 0.178, SE = .024, p < .001). Accounts with a larger number of followers were less likely to form a co-retweeted relationship with other accounts (coefficient = −0.058, SE = .005, p < .001). Contrarily, accounts that were following others (coefficient = 0.044, SE = .006, p < .001) and posted a large number of tweets (coefficient = 0.044, SE = .006, p < .001) were more likely to be co-retweeted with other accounts.
Exponential Random Graph Model Results: Main Effects for the Node Connection.
Note. n = 1,007. Model AIC = 109,769; BIC = 109,958; residual deviance = 109,735 (df = 506,504); null deviance = 702,187 (df = 506,521).
*p < .05. **p < .01. ***p < .001.
To further examine the homophily effects among the tie-formation process of accounts with a similar attribute, another model (performed by the “node match” function of Statnet) was estimated and the results are presented in Table 4. The node match function of the Statnet counts the number of edges (i, j) for which attribute (i) equals to attribute (j) and can be regarded as the tendency to form a “within-variable” edge, that is, whether a node with a certain attribute will be more likely to establish a link with another node with the same attribute (Goodreau, et al, 2008; Morris et al., 2008).
Exponential Random Graph Model Results: Homophily Effects for the Node With Similar Attributes.
Note. n = 1,007. Model AIC = 120,893; BIC = 121,115; residual deviance = 120,853(df = 506,501); null deviance = 702,187 (df = 506,521). NGO/NPO = nongovernmental and nonprofit professional organizations.
*p < .05. **p < .01. ***p < .001.
Males are more likely to be co-retweeted with other males (coefficient = 0.173, SE = .022, p < .001), whereas females are less likely to be co-retweeted with other females (coefficient = −0.141, SE = .036, p < .001). Institutional accounts are less likely to be co-retweeted with other institutional accounts (coefficient = −0.087, SE = .037, p < .05). Country-wise, for most of the countries in the current analysis, the accounts located in that countries were more likely to be connected with other accounts from the same country such as France (coefficient = 2.256, SE = .181, p < .001), Germany (coefficient = 2.261, SE = .067, p < .001), India (coefficient = 1.366, SE = .102, p < .001), Ireland (coefficient = 1.301, SE = .238, p < .001), Spain (coefficient = 0.700, SE = .242, p < .001), and the United Kingdom (coefficient = 0.319, SE = .044, p < .001); however, accounts from the United States and Canada were less likely to be connected with other accounts from the same country (the United States: coefficient = −0.655, SE = .028, p < .001; Canada: coefficient = −1.730, SE = .714, p < .05). Academic accounts are more likely to be co-retweeted with other academic accounts (coefficient = .134, SE = .050, p < .01), but the opposite pattern holds true for news organization (coefficient = −0.045, SE = .022, p < .05) and NGO/NPO (coefficient = −0.487, SE = .059, p < .001). Accounts with a large number of followers are less likely to form a co-retweeted relationship with other accounts (coefficient = −0.049, SE = .005, p < .001). Contrarily, accounts that are following others (coefficient = 0.069, SE = .006, p < .001) and have posted a large number of tweets (coefficient = 0.013, SE = .006, p < .05) are more likely to be co-retweeted with other accounts.
Discussion
This study started with two contradictory understandings of data-driven journalism as an emerging form of journalistic practice. First, one has the hopes of interdisciplinary collaboration that is articulated by the trading zone framework and empirical evidence of cross-interdisciplinary collaboration (Galison, 1997; Lewis & Usher, 2016). Second, there is also the hype articulated by the interpretive community framework and the gendered echo chambers that form via homophily-evoked social media interactions among journalists (Usher et al., 2018; Zelizer, 1993). Informed by the theoretical framework of the communication network (Shumate et al., 2013) and previous related work using the network analytics approach to analyze digital trace data (Guerrero-Solé, 2018; Peng et al., 2016; Zhang, 2018), this article extracted year-long related Twitter discussions and focused on retweeting behaviors as a digital communication practice and also on the connection pattern of those “retweeted users” (whose tweets are retweeted by other users) by constructing the co-retweeted network (Finn et al., 2014). This study constructed the co-retweeted network via data processing of digital traces. The premise of such an analytical strategy is that these “retweeted users” are the sources of the information and opinions that are being heard, disseminated, and propagated. As explicated by C. J. Wang et al. (2013) in two metaphors, these retweeted users are like generals and their voices are being amplified and enlarged, whereas the retweeting users are like the soldiers and the ephemeral voices. Hence, documenting the communication network of these retweeted users from the retweeting process can better reveal the discussion fragmentation of a particular topic. Results found that the co-retweeted network is fragmented with several clusters emerging. ERGMs revealed that the tie-formation of the co-retweeted network is associated with the accounts’ characteristics, especially gender, location, and domain area. This article is devoted to the crossover of computational communication research and the quest to articulate power and gender—the existing social hierarchies—in the field of data-driven journalism. Several findings are highlighted below.
First, focusing on the public profiles of the accounts that were being retweeted by other accounts, the present project finds imbalances among the gender and geographical locations. Most of these accounts are owned by male users (nearly 40%); females and institutions each occupied one third of the sample. It echoes the idea that male users are dominating the social media discussion. It offers additional evidence on the gender asymmetry in the digital public sphere, not only in the field of digital journalism (Usher et al., 2018) but also in the social media discussion related to other public issues (Poell & Rajagopalan, 2015) and other online collaboration communities such as Wikipedia (Collier & Bear, 2012; Fichman & Hara, 2014). Location-wise, a great proportion of the accounts identify their primary country as the United States or United Kingdom. These results appear to reaffirm the geographic and language landscape of the field of data-driven journalism, at least in the Twittersphere. The result is similar to the country–country networks derived from the international communication flow over the internet (Chang et al., 2009): Even when the internet is promoting equality and globalization, the communication process via the internet, however, is largely dominated by the developed countries. Also, this study echoes previous findings that there is a dominance of English-speaking users on the social media platform (Poell & Rajagopalan, 2015). However, our findings might also be resulted from the English keywords in the searching process. Further studies can focus on the discussion of data-driven journalism in different languages.
Second, in terms of the area domain of the accounts, news organizations and nonprofit professional organizations take the lead, followed by academia and technology companies. It suggests an interdisciplinary nature of the discussion of data-driven journalism. This is similar with previous findings on the public discussion of data-driven journalism in the Twittersphere, such that besides news professionals, other nonjournalistic actors such as professional organizations, academic institutions, and technology companies are all parts of the discussion for this field (Zhang, 2018). This pattern may be explained in two ways. On the one hand, it suggests that the Twitter-based discussion of data-driven journalism is not only centered on journalism. News-related professional organizations and academic faculties whose research areas are digital news and investigative journalism are also a part of the discourse. The development of data-driven journalism is also backed with technological factors such as data scientists, technologists, programmers, designers, and technology start-up companies (Lewis & Usher, 2014; Usher, 2017). These companies, mostly are profit-driven, are “funded either by venture capitalists, wealthy entrepreneurs offering seed funding, or venture arms of large media companies, and represent a burgeoning expansion of digital-native journalism” (Usher, 2017, p. 1116). The present findings reflect different actors’ visibility in the field of data-driven journalism. Another explanation is that Twitter is a very common channel for academic faculties and journalism practitioners to communicate, share related articles, or promote their new work (Broersma & Graham, 2012), which is also a well-established platform for self-promotion.
Third, the results suggest that the communication network derived from retweeting is fragmented—as several communities emerged—but is also connected by several actors who are situated in the crucial positions of this network and acting like the brokerages. On the one hand, the network demonstrates a situation of discussion fragmentation. It suggests that the “interpretive community” also exists in the discussion of data-driven journalism. As indicated by the ERGM results to test the main effect of the accounts’ attributes, accounts in the several dominating countries (i.e., France, Germany, the United Kingdom, and the United States) have more connections compared with accounts from other countries, and accounts belonging to news organizations, academic institutions, and technology companies also have more connections compared with the reference group. In the ERGM results to test the homophily effects (i.e., whether nodes with a certain attribute are more likely to be connected with other nodes having the same attributes), we found that several attributes have a homophily effect, that is, males are more likely to be connected with males, nodes are more likely to be connected with other nodes from the same country (except those from the United States and Canada). We also found that academic-related accounts tended to be clustered together, whereas news organizations and professional organizations are less likely to be co-retweeted with other organizations that having the same nature. One explanation may be the fact that academic institutions tend to theorize the practice, and scholarly activities are a consensus for different universities; however, news or professional organizations are either competing with each other or they have different areas for their regular news beats. We also found that accounts with a larger number of followers are less likely to establish co-retweeted relationships with other accounts, indicating that these “key opinion leaders” are competing with each other in the Twittersphere. Accounts being more active (such as following others and posting more tweets) are more likely to be co-retweeted with other accounts. This pattern seems to confirm the idea of preferential attachment, in that the highly connected nodes are likelier to gain more connections. On the other hand, although there are different communities of the network, the network also contains several important figures, which serve both as hubs (having many connections) and as brokerages (bridging different subcommunities). This can be partly seen from the results of Tables 2A and 2B, where there are a number of accounts who have both a high level of degree centrality and a high level of betweenness centrality. These accounts can be regarded as “a handful of passionate individuals” who actively promote data-driven journalism, data visualization for news, and interdisciplinary collaboration for digital news (De Maeyer et al., 2015, p. 432).
Fourth, in addition to the above findings, method-wise, this study contributes to the application of the network analytics approach to the study of the field of data-driven journalism. Whereas most existing studies focus on the manifested ties that can be directly observed or documented, such as the following network (Peng et al., 2016) or the retweeting (reposting) network (X. Wang et al., 2019), this study extends this line of research on the communication network of social media by focusing on the hidden ties. Actually, within Shumate’s (2013) classification of the communication network, a representational network can be regarded as an instance of a hidden network. This study constructs the “co-retweeted network” (Finn et al., 2014, p. 5; Metaxas et al., 2015; Wong, et al., 2016). The network construction method assembles the media–media network in the study of audience fragmentation (Ksiazek, 2011; Webster & Ksiazek, 2012). This study documents the major players in the discussion process and reflects the power structure.
Limitations and Directions for Future Research
This project has several limitations that need to be acknowledged, which also pave ways for future research directions. First, retweeting behaviors do not necessarily mean the users approve or endorse the retweeted contents. As a result, the social meanings of retweeting behaviors are complicated because retweeting can mean either support or opposition. However, regardless of the motivation and the consequences of retweeting, the retweeting behavior itself has at least two implications: First, on the microlevel, the retweeted users’ tweets are diffused and their voices are disseminated from his or her followers to the retweeting users’ followers, helping the original information reach a wider audience; and second, on the macrolevel, as the retweeting behaviors start to accumulate, those retweeted users’ tweets may demonstrate several patterns, and these patterns can be demonstrated and explained using the co-retweeted network as an instance of the representational network. Also, it is likely that the retweeting account is quoting the texts of other accounts but not necessarily using the “retweeting” function of Twitter, which were not included in the present analysis. Second, another limitation of this study is that the co-retweeted network is unweighted. Although a dichotomous unweighted network can demonstrate the connection pattern and adding the weight would not change the structure of the network, further studies can further explore the weights (i.e., how many times exactly the two accounts are co-retweeted). Third, it is acknowledged that even though the co-retweeted network indicates there are several communities in the discussion of data-driven journalism, this study could not address the idea of the interpretive community in the actual news production process.
Despite there are several limitations, this study paves ways for future studies. First, when this study is a cross-sectional snapshot of the network structure, it cannot fully reveal whether the social media discussion of data-driven journalism will become more fragmented or more convergent. Fully revealing the trend requires longitudinal studies tracing the change of the network structure or the discussion contents for a relatively longer time. For example, Ho and Zhang (2020) traced the description of virtual reality video games’ text description in an online gaming market for 29 months (2016–2018) and found a pattern-shift of the producers’ marketing strategies. The shift of the communication network and discourse of data-driven journalism is also likely to be a long-term phenomenon. Second, this study only documents the communication on Twitter. It cannot fully reveal the actual development of data-driven journalism by the news professionals in the newsrooms. Further studies may consider adding more data sources by using a mixed-methods approach to document the production of data-driven journalism. Third, although this study is moving along the line of study on the communication network formed on social media and focusing on retweeting behaviors, future studies can pay attention to the content of the tweets, such as retrieving the major topics of the tweets (Zhang, 2018; Zheng & Shahin, 2018) or performing a semantic network analysis on the tweets to determine what topics are commented upon (Xiong et al., 2019) and may further inspect the extent to which the communication network formed by the discussion topics is similar or different with the network formed by commenting or retweeting behaviors. Focusing on the relationship among the contents of the tweets may also address a limitation of this study that focusing on the manifested retweeting behavior but failing to include the quotation behaviors. Fourth, this study is only based on one social media platform, Twitter. Although Twitter is a widely used social media platform for digital news practitioners (Hanusch & Bruns, 2017) and the current study’s findings are empirically meaningful, the results can be extended and compared with results obtained from other platforms. For example, more and more open data movement activists and journalists are uploading their codes on GitHub, a publicly accessible code repository (Ausserhofer et al., 2017; Radchenko & Sakoyan, 2014). As clarified by Ausserhofer et al. (2017), several platforms besides Twitter and Facebook may serve as feasible conduits for data journalism research, such as GitHub, Slack, and Meetup.
To summarize, this article contributes to the existing literature in three aspects. First, it reveals the landscape of data-driven journalism discussion in the Twittersphere in particular and in the electronic public sphere in general. Second, it contributes to the scholarly research into data-driven journalism by identifying the major communities and key actors whose voices are disseminated by others. It reminds us the online discussion may resemble the existing power structure in the off-line, real world. Third, the present project adds empirical evidence to the audience fragmentation and homophily literature. It serves as an application of the network analytics approach to the crossover of social media data analytics and digital journalism studies. It paves the way for further studies of the production and representation of public knowledge.
Supplemental Material
Zhang_Ho_Online_Supplement_Replication_Codes_Public - Exploring the Fragmentation of the Representation of Data-Driven Journalism in the Twittersphere: A Network Analytics Approach
Zhang_Ho_Online_Supplement_Replication_Codes_Public for Exploring the Fragmentation of the Representation of Data-Driven Journalism in the Twittersphere: A Network Analytics Approach by Xinzhi Zhang and Jeffrey C. F. Ho in Social Science Computer Review
Footnotes
Acknowledgment
The authors would like to thank the three anonymous reviewers for their insightful comments, and the help from Can He, Chen Xu, Ryan Ng, Minyi Chen, and Xiaohang Deng. The second author would like to thank the School of Design, Hong Kong Polytechnic University for their continuous support.
Data Availability
The data used for this article were extracted from Twitter via Twitter public API. The code for data extraction (in Python), data processing (in Python), and data analysis (in R) will be made available and shared to the public via an GitHub repository. The URL of the GitHub repository, with all the replication codes:
.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Software Information
The data extraction (API-based data collection) and data processing (the network construction) were conducted in Python. The ERGM modeling was conducted with the Statnet package in R. The network data visualization was conducted with Gephi.
Supplemental Material
The supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
