Abstract
In the last decade, social networks have emerged as a significant media platform for dissemination of political information. In this article, we characterize linguistic factors that affect the dissemination of a political message on Twitter. This is important because characterization of messages that were posted by users from both political camps can provide a unique insight into how rival audiences can be reached. We analyze about 20,000 Twitter users who expressed explicit support for Barack Obama or Mitt Romney during the 2012 U.S. presidential election. We extracted approximately 344,000 tweets by these users that contained links to political webpages and appeared in the 3 months prior to the election. We show that the language used in a linked page is correlated with the page's popularity in terms of the total number of users that post a link to this page. Pages that are written in a more Republican language are posted by a larger number of users than Democrat-leaning pages. We also observe that pages that use highly polarized language are generally posted by users from a single political camp. Pages that are reposted by both Romney and Obama supporters are usually written in a relatively neutral language. In addition, when the user is a Republican, the text accompanying a retweeted page is likely to be modified, usually to refute information that conflicts with the user's viewpoint.
Introduction
The current political discourse in the United States is known to be fragmented and highly polarized (Garrett & Resnick, 2011). One of the possible causes for such polarization is selective exposure to information, namely, people's tendencies to seek information from news outlets that reinforce their viewpoints and to avoid information that contradicts them (e.g., Mutz & Martin, 2001). This selective exposure causes people to embrace more polarized views (Stinchcombe, 2010) and to decrease tolerance for other opinions (Garrett & Resnick, 2011).
One possible step towards reducing problems of selective exposure is to develop messaging that can reach across different viewpoints. In order to do so, it is important to understand the factors that determine whether a certain political message will be consumed by people with differing viewpoints.
Online social networks provide us with convenient platform for such analysis. The data from many social networks is publicly available. Furthermore, social media has emerged as a significant platform for discussion and dissemination of political information. For example, Pew surveys (Smith, 2011) found that 22% of adult Internet users participated in political campaigns through at least one of the major social media platforms (Twitter, Facebook, and Myspace) during the 2010 U.S. elections.
In this article, we focus on political discourse in Twitter—a micro-blogging service that allows users to post short “tweets.” Users can also broadcast the tweets of their neighbors in the social network, a phenomena known as “retweeting,” thus further propagating a message (boyd, Golder & Lotan, 2010). Our main interest is in linguistic factors associated with a user's decision to post or retweet a link to a page related to U.S. politics. For the reasons mentioned above, we are especially interested in characterizing linguistic properties of webpages that were posted or retweeted by both Democrats and Republicans.
We distinguish between two mechanisms by which a user decides to expose his peers to a political webpage. The first mechanism works when a user becomes aware of the page through a post in one of the blogs he or she follows and retweets the post and the corresponding page. In the second mechanism, the user becomes aware of a page through means other than Twitter (e.g., news aggregators) and posts it on his or her blog.
We analyze both of these mechanisms and show that the language used in a page has high correlation with the political affiliation of the users who post a link to that page. For instance, pages that use highly polarized language are likely to be both posted and retweeted only by users from one political camp. Conversely, pages that are written in a more neutral (or bi-partisan) language are likely to be posted and retweeted by users from both camps. However, the text accompanying the retweeted page is likely to be modified when the retweeting user is a Republican, probably in order to frame it in accordance with the users’ own interpretation of the event. We also investigate the correlation between political affiliations of the users who post a certain page and the political leanings of the news outlet to which this page belongs (e.g., CNN or The New York Times). Our results show that the more liberal the news outlet, the smaller the fraction of Republicans that post a link to this page, and vice versa. In addition, we analyzed the relationship between the language used in a page and its popularity in terms of the total number of users that posted a link to this page. Similar to Adamic and Glance (2005), we observe a higher posting activity among Republican users. In particular, pages that are written in a more conservative language are posted by a larger number of users than liberal-leaning pages.
Literature Survey
In the last decade, we have witnessed an exponential growth in the amount of politics-related research on online data, in general, and on Twitter, in particular. Among the questions that have been investigated are prediction of future election results (e.g., Metaxas, Mustafaraj, & Gayo-Avello, 2011), credibility of political information (Castillo, Mendoza, & Poblete, 2011; Morris, Counts, Roseway, Hoff, & Schwarz, 2012; Ratkiewicz et al., 2011), finding users whose opinion on a certain subject is influential (Barbieri, Bonchi, & Manco, 2012; Weng, Lim, Jiang, & He, 2010), and leveraging anonymized web search queries to analyze and visualize political issues (Weber, Garimella, & Borra, 2012).
Methods for identification of political leanings of Twitter users were introduced by Pennacchiotti and Popescu (2011) and Conover, Gonçalves, Ratkiewicz, Flammini, and Menczer (2011). These two methods rely on machine learning techniques to build a classifier that maps a user, characterized by multiple behavioral and linguistic features, onto his or her political affiliation. The authors report high prediction accuracy (above 84%) for both methods. In our research, we leverage the high level of political activity of users during the presidential election season to obtain similar results in a more straightforward way. We identify a user's political affiliation from his usage of manually selected, highly partisan hashtags (see Data Set Description section for complete details). This simple method allows us to identify political affiliations of approximately 400,000 users with an accuracy of above 95%. The simplicity and the higher accuracy of our method come at the expense of a smaller recall than that of methods introduced by Pennacchiotti and Popescu (2011) and Conover et al. (2011). However, the obtained user population was large enough for meaningful analysis.
Several works (e.g., Kupavskii et al., 2012; Suh, Hong, Pirolli, & Chi, 2010; Yang & Counts, 2010 ) considered the question of predicting whether a certain tweet will be retweeted. These works considered all tweets, regardless of their topic or whether they contain a link to a webpage. This is in contrast with our analysis that is constrained to tweets that contain links to political webpages. This focus allows us to build a prediction mechanism based on scope-specific features, such as the language used in the webpage and the political affiliations of the posting user and the potential retweeter.
Finally, in our analysis we rely on the Slant Quotient scores (SQ scores), introduced by Groseclose and Milyo (2005), to quantify the political leaning of news outlets. The SQ score ranges from 0 to 100, where a higher value corresponds to a more liberal news outlet.
Data Set Description
We identified a large set U of Twitter users that expressed explicit support for Barack Obama (n = 372,769) or Mitt Romney (n = 22,902) during the 2012 U.S. presidential election. In what follows, we alternatively refer to these users as Democrats and Republicans, respectively.
We found these users by looking for specific hashtags, for example, #votedobama or #voteromney (see Table 1 for the complete list), among the tweets made in the 10 days following the election day. The political affiliations of these users were determined by these hashtags, for example, #votedobama corresponds to a Democrat and #voteromney corresponds to a Republican. We validate this user classification approach by manually labeling the political affiliation of 300 randomly selected users. We found that the accuracy of our method is greater than 95%.
List of Hashtags Used for Identification of the Political Affiliation of Users.
We then extracted all tweets published by users in U during the period of 3 months between August 1and November 15, 2012, that contain a valid web link. Overall, there were 893,502 of these tweets, out of which 89,392 were retweeted by another user in the set U. We used hashtags such as “#election2012” (see Table 2 for the complete list) to extract a subset of 344,623 tweets on political issues that were retweeted 32,691 times. We let
List of Hashtags Used to Identify Political Tweets.
We downloaded all pages that correspond to the set
We approximate the underlying social network of Twitter users using a directed retweet network. For this purpose, we consider retweets of any post, not just those that contain web links. The nodes in this network are the users in U. There is an edge from node A to node B if user A retweeted at least two posts by user B. We say that user A is a neighbor of user B if there is an edge from user A to user B and note that a post by user B can be retweeted only by its neighbors. The retweet network contains 201,362 edges for 392,995 nodes, which implies that the network is highly sparse.
Table 3 shows the number of edges between users for the four possible pairs of political affiliations. Table 3 also lists the density of these edges, that is, the total number of edges from a set V of users to a set W of users divided by the maximal possible number of edges between these sets.
Connectivity Statistics in Retweet Network.
It can be seen that the intraparty connections are much denser among Romney supporters than among Obama supporters. This observation is in line with Adamic and Glance (2005). Surprisingly, and in contrast to Adamic and Glance (2005), the density of interparty connections is relatively high, for example, it is comparable to the density of links among Democratic users. This difference in interparty connectivity may indicate different connection behavior in a different social medium (Twitter vs. the blogosphere), or it may represent a larger change in the structure of political polarization. An explanation of this phenomenon is an important question for future research.
Propagation Mechanisms
In this section, we consider linguistic factors associated with the dissemination of political pages in Twitter. We focus on two mechanisms of propagation: In the first mechanism, a user becomes aware of an interesting page through a source other than Twitter and decides to post this page. In the second mechanism, a page posted by a user sparks interest of one of user’s neighbors who decides to retweet it. In the considered data set, the first mechanism is the dominant one as retweets account only for 4.6% of the total 164,000 unique pairs of posting user and a page.
We begin by analyzing linguistic properties of globally popular pages, that is, pages published by a large number of users in U. This approach allows us to capture results of both mechanisms together. We then focus on the second mechanism only by analyzing how linguistic factors affect page’s probability to be retweeted.
Globally Popular Pages
In this section, we consider political pages with a proper language model. Overall, we found 20,842 such pages, and these pages were published by total of 10,225 unique users, creating 217,945 unique pairs of a user and a published link. We randomly choose approximately 75% of these pages to be a training set and the rest of the pages to be a testing set.
The training set contains 15,630 pages that were published by 8,687 users (7,690 Democrats and 997 Republicans), creating 162,329 pairs of a user and a published link.
We define the average Democratic language model by averaging the language models of pages in the training set, where the language model of each page is weighted by the number of Democratic users that published it. We define the average Republican language model in the similar way.
The testing set contains 5,212 pages that were published by 6,244 users (5,378 Democrats and 866 Republicans), creating 55,616 pairs of a user and a published link. For each page in the testing set, we calculated the total number of users that published it and the fraction of Republicans among these users. Given the language model
where
Our first observation is that the fraction of Republicans among publishing users has strong negative correlation with the relative democratism of the page (Spearman’s
We also observe that the relative democratism of a page has a negative correlation with the total number of users that tweet this page (Spearman’s
Finally, we considered the correlation between the total number of users that posted a page and the SQ score (Groseclose & Milyo, 2005) of the news outlet that published this page. Similarly to the previous observation, the more conservative the news outlet, the larger the number of publications (Spearman’s
Retweeted Pages
In this section, we consider all political pages that were published and retweeted by users with a proper political language model. Overall, we found 3,168 such users, and these users published 19,634 political pages in total, creating 164,363 unique pairs of a user and a published link. Every link published by a certain user can potentially be retweeted by this user’s neighbors in the retweet network. We refer to the triple of a posting user, a page, and a neighbor of the posting user as a potential retweet.
For every retweet, we measured the textual similarity between the text of the original tweet and its retweet in terms of the Jaccard index, which is defined in the following manner: Let S and T be sets of distinct words used in the original tweet and the retweet, respectively. The Jaccard index for sets S and T is given by
There were 3,036,561 potential retweets in our data set, out which about 7,657 we taken, thus resulting in a retweet rate of 0.25%. No textual changes (Jaccard index of 1) were introduced in 3,993 retweets, resulting in a retweet rate of 0.13%.
Our first step was to calculate the retweet rate for potential retweets grouped by affiliation of users. We considered each one of the four possible combinations: both the posting user and his or her neighbor are Democrats (D, D), the posting user is a Democrat and his or her neighbor is a Republican (D, R), the posting user is a Republican and his or her neighbor is a Democrat (R, D), and both the posting user and his or her neighbor are Republicans (R, R). We also considered separately retweets without textual changes (J = 1). The corresponding results are presented in Table 4.
Retweet Statistics.
When no restrictions are imposed on the textual similarity of a tweet and retweet, links posted by a Democrat have a more than 30% greater chance to be retweeted. Comparing retweet rates for pairs (D, D) and (D, R), we observe that intraparty and interparty retweet rates are very similar. The same holds for pairs (R, D) and (R, R). These two observations are in contradiction to Adamic and Glance (2005 who reported the intraparty reposting rate to be much higher than the interparty reposting rate.
The situation changes once we restrict our attention to retweets with Jaccard index of one. We see a drop in retweet rate of 44%, 40%, and 45% for pairs (D, D), (R, D), and (R, R), respectively. However, the drop in retweet rate for the pair (D, R) is much higher (64%). Thus, Republicans introduce more changes into tweet text than Democrats, and this is independent of the affiliation of the user that originally posted the page. The largest amount of changes (see mean Jaccard index in Table 4) to the tweet text are made when a Republican user retweets a page posted by a Democratic user.
Finally, we built a predictive model that ranks potential retweets by their probability to become an actual retweet. We hypothesize that decision of a user B to retweet a page posted by user A depends on three cosine similarities of language models: the similarity between users A and B
We began by training a model for prediction of all retweets, regardless of the value of the Jaccard index. Prior to the training, we normalize each of the similarity-based features to have a mean value of zero and a variance of one. This normalization allows us to measure the relative importance of each feature in terms of the fraction of energy contained in the corresponding coefficient, that is, the squared value of this coefficient divided by the sum of the squares of all four coefficients. We use two thirds of potential retweets as the training set for linear regression and the remaining third of potential tweets for performance testing. The AUC and the fraction of energy for each coefficient are presented in Table 5. 1
Prediction Accuracy of Decision to Retweet ( J = 1).
It can be seen that the quality of prediction is relatively high for all affiliation pairs. The similarity
Dominant Feature Selection for Prediction of Decision to Retweet ( J = 1).
Very similar results are obtained for the retweets without textual changes. We refer the reader to Appendix Tables A1 and A2 for the details.
Summary and Future Work
In this article, we considered the linguistic factors associated with a Twitter user's decision to post or retweet a link to political webpage. We show that the language used in a page is correlated both with the total number of users that post a link to this page and with the affiliation of these users. In general, users post and retweet pages that use language similar to theirs; therefore pages written in a neutral language reach audiences in both political camps. In contrast, pages that use highly polarized language are usually posted and retweeted by users from a single political camp. We observe higher posting activity of Republican users and, as a result, pages written in Republican-leaning language are posted more than pages written in Democratic-leaning language.
In our future work, we plan to consider the effect of factors other than linguistic that are associated with the decision to retweet a political page. Such factors include socioeconomic statuses of the poster and the possible retweeter, their geographic proximity, and their ethnicity. Another interesting research question is to explain our observation of a high number of interparty edges in the retweet network.
Footnotes
Appendix
Dominant Feature Selection for Prediction of Decision to Retweet (All Tweets).
| Features | AUC for Each Affiliation Pair | |||
|---|---|---|---|---|
| (D, D) | (D, R) | (R, D) | (R, R) | |
| 0.694 | 0.715 | 0.721 | 0.747 | |
| 0.553 | 0.491 | 0.551 | 0.527 | |
| 0.683 | 0.742 | 0.725 | 0.744 | |
| , | 0.515 | 0.535 | 0.498 | 0.499 |
| , | 0.689 | 0.687 | 0.718 | 0.744 |
| , | 0.563 | 0.523 | 0.550 | 0.526 |
| , , | 0.666 | 0.724 | 0.718 | 0.748 |
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
