Abstract
This article empirically revisits the idea of ideological segregation and homogeneity in social networks with an exhaustive analysis of the website Reddit.com. Using a computer-assisted analysis on a corpus of multiple billion comments, it studies the relation between the tone of comments and three political topics, immigration, macroeconomics and defense. Looking at the standard deviation of the average tone of users and communities over time on these specific topics, results show an overall trend toward more ideological heterogeneity as a product of new users’ influx, while multiple communities studied and long-term users show a trend toward homogeneity. Results also show that digital heuristics, such as upvotes and downvotes, illustrate a far greater diversity of opinion than a study solely on published comments could let on.
Keywords
Researchers and pundits have portrayed online communication as ideologically segregated by its very nature, notably through an avoidance strategy developed by users seeking entertainment and ideological gratification (Muhlberger, 2005; Papacharissi, 2002; Prior, 2005), evoking the theory of uses and gratifications originally applied to mass media (see Swanson, 1987). This theory suggests that consumers of traditional media and, as a corollary, users of information and communication technologies (ICTs) develop information consumption habits meant to satisfy specific needs, including entertainment and ideological gratification, by actively and voluntarily avoiding conflict and dissension. The resulting ideologically segregated environment would not allow users to be exposed to disagreements and alternative viewpoints presumed to be required for deliberation and possible persuasion. Prior research systematically comparing online communities and offline communication concede that online communication is ideologically segregated but find no evidence that it is more so than offline communication or that the segregation has increased over time (Gentzkow & Shapiro, 2011). These results are largely based on survey data (Klofstad et al., 2013; Mukherji et al., 1998; Wojcieszak & Mutz, 2009) and data from navigation histories (Flaxman et al., 2016; Gentzkow & Shapiro, 2011), or experimental designs demonstrating the impact of disagreements (Arceneaux et al., 2012; Esterling et al., 2015).
Computer-assisted quantitative text analysis allows us to tackle the issue from another angle, providing new insights into the question of whether online communities engaging in political discussion evolve into more homogenized or polarized venues over time. The answer has profound implications for our understanding of contemporary political socialization in a world increasingly characterized by mediated and virtual discussion networks and at a time when troll farms, allegedly backed by Moscow (Collins & Russell, 2018), actively try to change the discourse and sway electoral results via comments on social networks.
Drawing on previous work using social media content, mostly from Twitter, many scholars have tried to derive opinions and attitudes from both texts and comments and even predict electoral results (see Ceron et al., 2014; Tumasjan et al., 2010). We propose that by using a sentiment analysis monitoring the evolution of tone directed at certain political objects we can effectively monitor and quantify the average attitude within online communities and determine exhaustively and empirically whether they become more homogeneous or heterogeneous over time. Of course, these changes within communities are expected to be the corollary of self-selection into these communities and therefore we would be looking at the aggregate data on the community level as well as the individual level to monitor these changes. We specifically consider the evolution of inferred attitudes toward immigration, macroeconomic issues as well as defense and military issues on the social news aggregator Reddit.com using data from an exhaustive database of public comments ranging from October 2007 to December 2016.
Interpersonal Communication and Political Attitudes
Interpersonal interactions allow for easier access to political information due to the development of trust between interlocutors, and act as a buffer between individuals and the political discourse of the elite, as notably established by the multi/two-step flow of communication theory (Katz & Lazarsfeld, 1955; Lazarsfeld et al., 1948; see also McClurg, 2003). Individuals also infer a level of political competence to interlocutors and identify potential ideological bias. This modulates the frequency and effect of political discussions due to the need for ideological gratification. Indeed, perceived expertise and ideological similarities have positive effects on the frequency of these discussions and the persuasiveness of arguments resulting in the potential for attitude changes and eventually, in network homophily (Huckfeldt, 2001; Huckfeldt & Sprague, 1991; McClurg, 2006).
Nevertheless, interpersonal interactions online are not entirely identical to face-to-face interactions. One of the features of online discussions that distinguishes them from other forms of interaction is that they often remain anonymous, as is the case on Reddit.com. The consequences of anonymity are unclear, though. Some research suggests that it tends to bring out marginal ideologues and partisans in a succession of hostile monologues (Davis, 2005). Other research suggests that, for the most part, online political discussions attract moderates and individuals with no strong partisan affiliation (Klofstad et al., 2013). Anonymity leads to another key feature of interpersonal interactions and online political discussions: the blurring of distinctions between information providers and consumers as well as between experts and novices (Delli Carpini, 2000). Some research reports a consequential increase in the perception of empowerment from users (Johnson & Kaye, 2003), making online discussions an important vehicle for political information and an alternative to elite and mass media discourse (see Druckman & Nelson, 2003).
Online interactions are characterized by political homogeneity and ideological segregation (see Bennett & Iyengar, 2008; Boutyline & Willer, 2017; Gentzkow & Shapiro, 2011; Klofstad et al., 2013). Easy access to a multitude of varied sources of information and venues for political discussion leads individuals to self-selection toward information that better reflects their own bias, reducing exposure to disagreements and alternative perspectives (Arceneaux et al., 2012; Muhlberger, 2005; Prior, 2005). While several studies find evidence of this ideological segregation, Gentzkow and Shapiro (2011) suggest that this segregation differs little from face-to-face interactions.
However, online interactions are sometimes influenced by algorithms and scripts, adding a level of complexity to the homophily of online networks (see Eslami et al., 2015). Some fear that the algorithms on Facebook or Twitter may create a filter bubble by their ability to show users content mostly similar to what they have previously seen, followed, or shared. However, empirical evidence suggests that social media use is rather associated with a modest increase in exposure to alternative perspectives (Bakshy et al., 2015; Barberá et al., 2015; Flaxman et al., 2016). Indeed, this ideological segregation of digital networks is moderated by the facilitated consumption of traditional media (Flaxman et al., 2016). Although there is a good deal of evidence suggesting a trend toward homogeneity of online political spaces, Wojcieszak and Mutz (2009) rather demonstrate that incidental exposure to political conversations in environments that are a priori organized around nonpolitical ideas or issues tends to expose participants to more disagreements and varied perspectives.
The Study of Homogenization and Online Polarization
This article investigates the homogenization or polarization, over time, of online communities discussing politics and examines the proposition that users self-select into communities to avoid conflict and dissension. Unlike most studies on the subject, we use a rarely studied corpus—Reddit—and a content analysis approach rather than a survey or experimental approach that has been widely used in the past. In doing so, we propose an analysis from a new perspective based on field observations of the digital communities where these discussions actually take place.
Whether online or off-line, interpersonal interactions of a political nature tend to occur in homogeneous environments. This homogeneity comes, however, from two different dynamics: self-selection in the environment and socialization. Self-selection is the result of users who choose to submit content in environments that correspond to their own political biases or who choose to avoid or leave political environments and communities altogether. This implies that individuals generally avoid being exposed to different perspectives than their own, in line with the theory of uses and gratifications. However, homogeneity can also be the result of socialization. For this to happen, the self-selection effect must be limited, as exposure to different perspectives is necessary. We assume that some political discussions occur online in settings that are not necessarily political in nature, but that within these communities, there is nevertheless an unknown distribution of political attitudes in favor of or against political objects among the people who frequent them. In other words, a subreddit talking about cooking may, due to exogenous factors, attract more individuals who identify themselves on the left (or right). Because the topic of discussion of choice is politically neutral, individuals are not inclined to avoid the community to avoid political conflicts or disagreements. Over time and over conversations, however, the underlying distribution of political attitudes expressed in a fortuitous way can have an effect on users, socializing them toward the expressed general trend of the group. Stated more formally, political comments in online discussions not a priori related to politics should allow for incidental exposure to a more diverse political discourse which in turn should lead to greater conformity of political attitudes over time (see, e.g., Mutz, 2002, 2006). Reddit.com provides a platform that is, in general, not necessarily geared toward political discussion and that contains communities that are not explicitly partisan allowing for incidental political conversation to take place, and over time shaping the politics of users into more homogenous subgroups.
As part of our analysis, we chose to observe the evolution of sentiments expressed about immigration, macroeconomics, and defense for two reasons. First, these three political objects are not, for the most part, too dependent on a specific geographical or political context. Secondly, the inference of political attitudes from the feelings expressed toward these subjects is relatively simple and logical. For example, positive feelings expressed with respect to immigration suggest tolerance toward immigration, feelings expressed with respect to the subject of “macroeconomics” can be interpreted as a measure of appreciation of economic development and those expressed with respect to the subject of “defense” imply support or lack of support for troops and military operations. The extent to which these measurements can provide coherent attitudes require validation.
Consequently, while the literature would suggest greater political homogeneity in our results, we distinguish two possibilities:
The corollary of this hypothesis:
In order to further study the potential socialization dynamic at play, we will also study our corpus at a microlevel, looking at individual posters, more precisely at users that have been active for a long time and monitor their inferred attitudinal changes.
Data
Reddit.com is a social news site—a news aggregator—where content and moderation are the products of participatory work by its users, as opposed to other sites where content is managed by an editorial team or an algorithm. In practice, Reddit users (or redditors) choose a community (commonly called subreddit) where they submit links to other sites or share their own perspective in a post. The other redditors subscribing to the subreddit in question will then individually determine whether the post requires an upvote or downvote, which will then be aggregated to give the post a score. The Reddiquette, an informal set of written rules and values redditors are encouraged to abide by, emphasize the need to vote based on whether a post or comment contributes to the conversation. These upvotes and downvotes act as numerical heuristics of social acceptability (Duguay, 2020). Unlike simple likes on Facebook or Twitter, an algorithm will then sort and order the posts and other content according to the score resulting from the votes and time since publication. Under each post and shared link, redditors are encouraged to discuss through comments that are also evaluated and moderated with upvotes and downvotes and ordered accordingly.
From a practical point of view, Reddit’s application programming interface (API) allows access to a complete and exhaustive database of public comments, while Twitter requests often only provide a targeted sample of tweets. In addition, the Reddiquette encourages the use of standard English and the writing of constructive and substantial comments. The length and quality of text units are all the richer and more appropriate for content analysis than those of tweets that are limited in number of characters and where shortcuts, acronyms, and emojis are structurally encouraged.
Reddit also provides us with an ideal platform to study two theoretically important phenomena in ordinary discussions online: anonymity and fortuitous political discussions. First, most redditors use anonymous accounts to publish their content and comments, although some notables in the off-line world are known as such on the platform. Secondly, Reddit’s division into a myriad of subreddits dedicated to specific topics allows the study of the occurrence of political topics in contexts where most conversations concern topics or hobbies unrelated to politics.
The database used for this analysis is the result of several APIcalls compiled. It contains more than 3 billion comments, almost all of Reddit’s public comments published from October 2007 to December 2016.
In conjunction with an analysis of the website as a whole, we use a sample of subreddits that allows us to track the evolution of sentiments expressed about immigration, macroeconomics, and defense from different perspectives, with recent and older communities and by comparing between and within regions. Considering their importance as default subreddits as well as subreddits that deal with topics that frequently concern us, our list must include r/politics, r/worldnews, r/news, and r/AskReddit. To this list, we add three popular regional subreddits, namely r/canada, r/Europe, and r/unitedkingdom, as well as two subreddits that are less popular, but which have a reputation for being more sophisticated versions of their generic counterparts, r/canadapolitics and r/ukpolitics for comparison purposes. For the same reason, we also add r/politicaldiscussion, as a more sophisticated version of r/politics, as well as r/The_Donald, the relatively new but immensely popular subreddit dedicated to Donald Trump supporters.
Method
Following the work of Gordon W. Allport (1929, 1935), in research based on survey data, political attitudes are often defined as feelings or evaluations about political objects with two dimensions measured on a Likert-type scale: (1) the direction of the evaluation (negative or positive) about the object and (2) the intensity of the feeling or evaluation. Similarly, we propose that, in the context of textual data, we can infer these two dimensions of a political attitude and measure the direction and intensity of an expressed sentiment, whether negative or positive, by quantifying the words associated with or directed at a political object.
Many studies have already used different methods of content analysis and sentiment analysis on tweets and social media to produce data comparable to opinion polls or even to predict election results (Bermingham & Smeaton, 2011; Ceron et al, 2014; Cody et al., 2016; O’Connor et al., 2010). Based on the assumption that Twitter is widely used for political discussion, Tumasjan et al. (2010) demonstrate that, in the context of the German federal elections, mere references to political parties and politicians directly reflected the election results and that analysis of the sentiments expressed by parties and politicians reproduced their political orientations. Ceron et al. (2014) show that while internet and Twitter users more specifically are not necessarily representative of the general population, sentiment analyses tend to reproduce the results of mass surveys.
While questions remain about representativeness and the measures used in sentiment analysis, this is also the case with surveys that are increasingly confronted with issues related to low response rates. The use of Reddit as a corpus helps us avoid the usual limitations of Twitter-based sentiment analysis such as the use of abbreviations and poor language quality as well as incomplete or partial sentences, although other issues remain. The two main limitations of such an approach are the way sarcasm is measured as well as textual units measured as neutral because their, otherwise high numbered, positive and negative word counts are roughly equal. This neutrality may be the result of a textual unit referring simultaneously to several objects or the use of charged words used to amplify the effect of each other, without necessarily conferring a neutral tone (as would be the case with the expression “beautiful bastard”).
The variable from which we infer homogeneity is the tone, as measured by the Lexicoder Sentiment Dictionary (LSD; see Young & Soroka, 2012b), which counts the number of positive or negative words (or group of words) from a dictionary containing more than 4,500 words or expressions. It should also be noted that we use a “negation” dictionary that allows us to take into account the use of negation combined with the words or expressions in question that can change the inferred tone.
In a comparison of several types of content analysis methods used to infer public opinion from a corpus of social media, González-Bailón and Paltoglou (2015) show that although machine learning produces better results in highly specialized contexts and consequently less diversified corpuses, dictionary analysis (including Lexicoder) of diverse data sets such as comments on YouTube, Twitter, or Digg, tend to produce results comparable to machine learning. Considering that Reddit is a constellation of several million subreddits with varying norms and rules regarding form and content, a dictionary analysis such as ours seems appropriate.
We choose to study the evolution of the tone in its relationship to political objects mentioned in the conversation in order to clarify the inference of attitudinal changes and homogeneity. Consequently, the comments were analyzed with the English Lexicoder Topic Dictionary, which counts the occurrence of words (or groups of words) associated with 28 major political topics (Albugh et al., 2013). This special dictionary has been designed for use in the analysis of news media and political platforms. Although Soroka et al. (2015) emphasize the importance of using dictionaries specifically designed for each corpus, we believe that the Lexicoder Topic Dictionary and the Lexicoder Sentiment Dictionary are appropriate dictionaries for our approach. Indeed, we believe that the specialized and precise nature of dictionaries, adapted to the analysis of news media and political speeches, allows us to reduce the incidence of false positives that would lead us to confuse trivial conversations with political conversations. The specific, even technical nature of these dictionaries provides us with a conservative assessment of the political nature of comments and conversations.
In order to determine the subject of each conversation, we combined comments belonging to the same conversation, determined via the identification number of the conversation, and counted the words belonging to each of these 28 political subjects using this subject dictionary. This process allows us to contextualize very short comments and emojis, American Standard Code for Information Interchange (ASCII) art, memes similar content which would fail to provide an accurate picture of the topic at hand if looked at individually. While we considered establishing the main topic for each conversation and thus analyzing only the comments included in the conversations that adhere to the topics on specific topics, we opted instead to consider all the conversations where our topics of interest (immigration, macroeconomics, and defense) are discussed in order to reflect the often-unstructured nature of political conversations and thus avoid data loss. It also implies that incidental exposure to political statements in a conversation that is not a priori political in nature is not removed from the default database.
In order to account for either the homogenization or polarization of communities and inferred attitudinal changes in individuals, we have structured our data to obtain two levels of analysis. First, using comment time stamps, we have restructured our data, so that each day/subday pair is a single observation. This restructuring allows us to observe the daily evolution of the tone and its standard deviation in different communities. Similarly, we have restructured our data so that each observation is composed of the months/authors pair, again using the time stamp and usernames of the redditors. To facilitate the manipulation of the second database, we eliminated months/authors with only one comment, given the lack of commitment, as well as users whose bots status, that is, applications performing automated tasks, is explicitly known or otherwise obvious, such as “AutoModerator.”
Comments are our basic textual unit from which we derive our tone measure, which can theoretically range from −100 to 100 and consists, as suggested by Young and Soroka (2012a), of the proportion of positive words minus the proportion of negative words. For example, a comment with an equal number of positive and negative words would have a net tone of 0; a comment with 3% of its words considered positive and 16% of its words considered negative would have a net tone of −13. A very negative comment could have a net tone of about −13. We then look at the average tone and standard deviation for a specific time unit on a specific community for a specific subject and use this measure to determine the evolution of the level of homogenization, our main dependent variable. Simply put, when the standard deviation of tone with respect to a political object increases over time, it is interpreted as increased diversity of opinion. Conversely, when the standard deviation decreases over time, it is being interpreted as increased homogenization on that topic. Accounting for the specific reality of reddit.com comment system where the comments with the biggest score appear at the top of a conversation and the ones with the lowest score at the very bottom, each comment was also weighted according to their score when analyzing the tone of a community. In doing so, we ensure that comments that are considered inappropriate or unwelcome by a subreddit do not bias the average tone or standard deviation, while the tone of highly appreciated and positive comments is given importance that reflects their support by those redditors who would probably not otherwise have participated in the conversation.
Results
Our analysis begins with a validation of our tone measurements. In order to test the validity of our measures, we looked at the results of the dictionary analysis of the 384 subreddits that individually have over 1 million comments from October 2007 to December 2016 to see whether the results met our expectations. Looking only at the average tone, most subreddits with a very positive average tone are found among four main categories of communities that sometimes overlap: Hobbyist communities sharing content or providing a platform for exchanges; communities about family-friendly video games such as Pokémon and Animal Crossing; support and advice seeking communities; and amateur pornographic content or niche fetishes. At the other end of the spectrum, most communities with a very negative tone fall into three broad categories: controversial political communities often associated with masculinism or the alt-right, communities sharing morbid or violent content, and communities about violent entertainment and cultural products. To a lesser extent, news-related subreddits are also quite negative.
Looking instead at the validity of our political subject measures with subreddits discussing immigration the most, r/worldnews leads with 1,067,622 mentions, followed by r/AskReddit and r/politics. In terms of macroeconomics, r/politics has almost twice the number of mentions of the next subreddit with 6,159,725 mentions. The r/worldnews and r/personalfinance subreddits follow r/politics with other political or regional subreddits, which is also the case with immigration. However, for the subject of defense, r/AskReddit has the most mentions with 5,539,600, followed by r/worldnews and r/politics and, to a lesser extent, historical communities as well as subreddits dedicated to strategy or shooting video games. Default subreddits, such as r/AskReddit, are immensely popular communities and, although not necessarily focused on political discussions, often involve military or immigration-related issues, whether consciously as part of a dedicated post or fortuitously as part of a conversation otherwise focused on another subject. Similarly, while war video game communities have many occurrences of the subject of defense, potentially an indicator of false positive, it would not necessarily be atypical for such communities to discuss real and contemporary politics.
In order to observe the validity of our attitudinal inference, we looked at the average sentiment expressed on our topics of interest to specific communities whose ideological biases are known. Looking only at the comments of conversations about immigration, weighted by score, r/The_Donald and r/conservative have a tone of −1.014 and −0.688, respectively, while the community more explicitly pro-immigration r/immigration at an average tone of 1.515. In general, communities dedicated to travel, language learning, or geographical entities have a more positive tone toward immigration, while explicitly racist communities have a negative average tone.
As for the subject of defense, while there is no lasting subreddit specifically on pacifism or anti-war sentiment for comparison purposes, r/army, which is dedicated to conversations about the U.S. military, has a relatively positive tone of 0.3445. However, as mentioned above, subreddits dedicated to violent cultural or sports products, which regularly use the key words included in the “defense” category of our dictionary, tend to have a more negative tone and conversations specifically related to defense are no exception. Meanwhile, the usually negative conservative and hateful groups retain their negativity when discussing “defense” but are joined in their negativity by anticonservative or anti-racist communities such as r/ShitRedditSays, on this specific subject.
With regard to conversations on macroeconomics, subreddits such as r/finance, r/stocks, and r/personalfinance have a very positive weighted average tone with 2.075, 1.917, and 1.803, respectively. In addition, r/socialism has a positive tone, although less than 0.479, which may seem high from an ideological perspective but is explained by serious discussions on Marxist economic theories. However, both r/occupywallstreet and r/environment have negative tones on average with −0.285 and −0.334, respectively. We also find that communities dedicated to marketing or specific brands tend to have a high average tone when considering conversations about macroeconomics.
This preliminary analysis allows us to make three observations regarding the validity of our measures: (1) on the negativity/positivity dimension, our analysis is consistent; (2) concerning topics, while there might be some measurement errors concerning “defense,” our technical topic dictionary appears to succeed in its goal to avoid false positive, resulting in subreddits known for discussing political and social issues showing more mentions; (3) when combining both measures and comparing communities with known ideological biases, our results are coherent, logical, and consistent.
Overall, the distribution of the standard deviation of the tone over the mean tone of our observations is approximately in the form of a “U.” In other words, conversations with a more negative or positive average tone are also those that seem to be the most heterogeneous. This is particularly true when weighting the results according to the score. We explain this distribution by the fact that most comments gravitate toward a mean of about 0 and that charged or polarizing comments encourage responses that are just as charged or polarizing. However, when we look more specifically at the dynamics within subreddits, the distribution often becomes linear. For example, if we consider only textual data, positive conversations about immigration in r/worldnews or r/politics tend to be more homogeneous, while more negative conversations see their standard deviation increase. It is interesting to note that when these same results are weighted, most linear distributions take a “U” shape. This suggests that although some subreddits have clearly expressed ideological tendencies at the textual level, in accordance with the assumption of ideological segregation, upvotes and downvotes paint a different picture. Indeed, while textual data can demonstrate, for example, that some communities are homogeneous when discussing immigration positively, adding the weighting related to the score shows that these communities are more heterogeneous and divided than they appear to be. From this perspective, the addition of weighting by scores as a sign of social acceptability is essential.
We turn next to analyzing the evolution of tone over time, and in particular within the selected 11 subreddits. For example, Figures 1–6 show that, smoothed with 3 months running average, when combining measures of sentiment and topic, r/canada and r/canadapolitics as well as r/unitedkingdom and r/ukpolitics evolve together over time. Excluding the noisier first weeks of each subreddits history, each pair has numerous instances of troughs and peaks that are either occurring at the same time or in opposition to one another, suggesting a reaction to the same political events with almost always a more positive tone for the subreddit known to be more sophisticated and curated.
Figures 1–6 show that the tone and subject dictionaries capture the expected variation in the nature of conversations across the different subreddits and that the average tone works as we expected. We then turn to the core of our analysis and the evolution of the standard deviation of the tone in order to test our hypotheses about homogeneity.
Figures 7– 9 show the average tone and standard deviation of tone for each day across all comments on Reddit.com concerning, respectively, immigration, defense, and macroeconomics. This figure illustrates the general trend when looking at the entire platform, a trend that is identical for our three topics of interest: the average tone revolves around 0 and the standard deviation of tone increases by about 2 points over the 10-year period studied, reaching a standard deviation of about 13. Clearly, the average comment is relatively neutral with respect to the three political objects of interest, but the distribution is quite large with a standard deviation of 11%–13% more words designated as either positive or negative for the equivalent of 11%–13%. Thus, when we consider all the public comments toward our three political objects and inferred attitudes, Reddit becomes more and more diversified and heterogeneous.

Changes in the average daily tone of immigration on r/CanadaPolitics and r/Canada between October 2007 and December 2016. Source: All public comments on Reddit.com from October 2007 to December 2016 (Baumgartner, 2015–).

Evolution of the average daily tone of defense on r/CanadaPolitics and r/Canada between October 2007 and December 2016. Source: All public comments on Reddit.com from October 2007 to December 2016 (Baumgartner, 2015–).

Changes in the average daily tone of macroeconomics on r/CanadaPolitics and r/Canada between October 2007 and December 2016. Source: All public comments on Reddit.com from October 2007 to December 2016 (Baumgartner, 2015–).

Changes in the average daily tone of immigration on r/ukpolitics and r/unitedkingdom between October 2007 and December 2016. Source: All public comments on Reddit.com from October 2007 to December 2016 (Baumgartner, 2015–).

Changes in the average daily tone of defense on r/ukpolitics and r/unitedkingdom between October 2007 and December 2016. Source: All public comments on Reddit.com from October 2007 to December 2016 (Baumgartner, 2015–).

Changes in the average daily tone of macroeconomics on r/ukpolitics and r/unitedkingdom between October 2007 and December 2016. Source: All public comments on Reddit.com from October 2007 to December 2016 (Baumgartner, 2015–).

Changes in the average tone and standard deviation of immigration on the Reddit.com website as a whole between October 2007 and December 2016. Source: All public comments on Reddit.com from October 2007 to December 2016 (Baumgartner, 2015–).

Evolution of the average tone and its standard deviation with regard to defense on the entire Reddit.com website between October 2007 and December 2016. Source: All public comments on Reddit.com from October 2007 to December 2016 (Baumgartner, 2015–).

Evolution of the average tone and its standard deviation with respect to macroeconomics on the Reddit.com website as a whole between October 2007 and December 2016. Source: All public comments on Reddit.com from October 2007 to December 2016 (Baumgartner, 2015–).
We also analyzed the slope and ordinate at 0 of a series of linear regressions where time in 24-hr increments (multiplied by 365 to extrapolate over a 1-year period) is used to predict the standard deviation of tone for the three subjects across the subreddits of interest mentioned above. This allows us to capture and compare general trends in the standard deviation of tone with respect to our political objects. The results for the site as a whole for immigration, defense, and macroeconomics, the coefficients are 0.0004, 0.0003, and 0.0003, respectively, which equates to about 0.15, 0.11, and 0.11 points each year. This shows three important characteristics of our data: (1) Most of the results in Figure 3 for specific subreddits show higher coefficients than for Reddit as a whole. (2) Almost all subreddits have comparable coefficients for each subject. (3) With the notable exception of r/CanadaPolitics and r/politics as well as defense-related discussions on r/news, all coefficients are positive. In other words, with the exception of r/CanadaPolitics et r/politics and occasionally r/news, the subreddits studied all become more diversified over time.
This tendency of most communities to show signs of ideological diversification is surprising. Only r/CanadaPolitics and r/politics show an increase in homogeneity over time, as measured by a decreasing standard deviation, for all subjects. In the case of r/CanadaPolitics, this trend is twice as fast as Reddit’s overall diversification rate. Interestingly, the differentiation between subreddits is less about political and nonpolitical or less partisan subreddits but about default and nondefault subreddits. Indeed, r/politics, r/worldnews, r/news, and r/askreddit, as default subreddits, all have a relatively high standard deviation from the start and, with the exception of r/politics, show an increasing standard deviation. Both r/politics and r/CanadaPolitics are known to be rather left-wing communities and often accused of leaving little place for more conservative viewpoints: While identifying the political leanings of subreddits is out of the purview of this article, it does appear that those communities are becoming more homogeneous.
In any case, the dichotomy between generalist geographical subreddits and their “political” equivalent does not seem to show a particular trend on the issue of homogeneity. The data also do not support Hypothesis 1a or Hypothesis 1b since most communities are becoming increasingly heterogeneous according to our measures, with little distinction between political and nonpolitical subreddits.
In order to test whether the arrival of new users can explain the significant increase in polarization and attitudinal diversification toward our political objects, we used the same models, but only considering active users over a period of at least 12 of the 120 months that concern our study.
Data show that, with the exception of r/Europe, r/ukpolitics, and r/unitedkingdom which retain their positive coefficient and r/news which do not give significant results, all subreddits now have an increasingly small standard deviation over time. However, while most subreddits, barring the initial noise at their creation, have a relatively linear evolution of the standard deviation of their tone over time, two subreddits stand out. First, the evolution of our inferred homogeneity measure of r/AskReddit is very noisy, to the point of seeming random, until the second half of 2015 when we see a significant decline in the standard deviation of the tone. Second, r/politics sees a clear trend toward homogeneity from its inception to the second half of 2015, while its standard deviation of tone tends to increase rapidly. These trends are consistent with the creation of r/The_Donald on the eve of the 2016 U.S. elections. Since we are considering here only a sample of established and active users within these communities, this increase in homogeneity on r/AskReddit and this increase in ideological heterogeneity on r/politics, on all topics, cannot be explained by the self-selection of old or new users. These results suggest a socialization that could take the form of self-censorship on r/AskReddit and polarization on r/Politics, which is otherwise recognized as rather progressive. There does not seem to be a clear trend in the average tone. Indeed, for most subreddits, the average tone is fairly stable around zero over the period, with more conservative subreddits moving slightly toward a more negative tone. There are exceptions: (1) compared to other subreddits, r/news is stable at a much more negative point than most other communities, (2) r/CanadaPolitics is much more negative about discussions on macroeconomics than on other topics, (3) the evolution of the average tone of r/politics and r/worldnews seems to be random, fluctuating month after month, 4) r/AskReddit shows a significant decrease in the average tone over all topics through the 120-month period.
Discussion
This article presents results from a comprehensive content analysis on the effect of online political discussions on political attitudes inferred on Reddit.com. Despite a wealth of literature on the effects of being informed or participating in online discussions on political engagement or participation, little research has explored the link between online interactions and inferred political attitudes. In addition, literature on both the online and off-line interpersonal interactions suggests a trend toward network homophily and political homogeneity, but this trend has not been verified by empirical studies of a similar magnitude to ours. This article examines the effects of discussions about political objects over time and the variation in the standard deviation of the average tone with respect to this object, designed as a measure of political homogeneity applied to three political objects, namely immigration, macroeconomics, and defense.
This study had three distinct goals: an ontological, a methodological, and a theoretical one. First, we aimed to provide arguments to include Reddit.com as a platform studied by academia and join the likes of Twitter and Facebook, if not for its important traffic, for its particularities making it a promising laboratory in the study of ordinary discussion online. Second, by using content analysis to infer political attitudes and study homogeneity, this study aims to add to the conversation about opinion mining in public opinion research. Finally, this study provides another step toward a better understanding of a potentially powerful political socialization agent that is the internet and specifically Reddit and shows that the uses and gratification theory proposition according to which users self-select online only partially explain trends toward homogeneity.
The increasing heterogeneity expressed toward immigration, defense, and macroeconomics over the period studied seems to be mainly the product of the sustained influx of new users. However, when considering long-term users in some communities and particularly on the eve of the 2016 elections, the heterogeneity expressed runs counter to the assumptions of much of the literature on the ideologically segregated nature of online political discussions. In doing so, we reject Hypothesis 1a and Hypothesis 1b while noting that the trend toward homogeneity is the norm when considering active uses for longer and therefore accept Hypothesis 2. We find no evidence that subreddits considered less partisan or political (such as r/AskReddit, r/Canada and r/unitedkingdom and r/Europe) behave differently from their more explicitly political counterparts.
Although our measures do not translate well into absolute values, our validity tests show relative accuracy and consistency in the collective attitudes inferred toward the political objects studied. Our results also show the need to complement content analyses by considering the effects of digital heuristics on social acceptability, in this case in the form of upvotes and downvotes. While textual analyses show that some communities tend to be more homogeneous on some topics, when weighted by score, the analysis highlights the heterogeneity of the community. This observation allows us to take a second look at the “echo chamber” narrative and the ways in which interaction online may change the importance of certain opinions. Future research must more fully address the importance of these appreciation cues online because while users might refrain from posting comments displaying unpopular attitudes within a subreddit, they are less likely to refrain from up or downvoting according to their preference. More importantly, these downvotes clearly illustrate that users are exposed to alternative points of view.
Overall, and in the case of most of the subreddits studied, the data suggest that rather than becoming more homogeneous, Reddit.com is becoming more and more diversified over time in the areas of immigration, macroeconomics, and defense. This homogenization measure implies that as Reddit grows, the tone distribution is stretched and comments much more negative or much more positive than the average are published, both on Reddit.com as a whole and on most subreddits, except r/politics, and r/CanadaPolitics. However, replicating these results while excluding comments from recent, short-lived, and barely active accounts, keeping only regular and long-term users, shows that this heterogeneity is the result of an influx of new users corresponding to the increasing popularity of Reddit.com. The analysis of the aggregate tone of these users, however, illustrates changes in the distribution of the tone expressed with respect to the political objects of interest to us in the months preceding the 2016 U.S. presidential election. The importance of these changes suggests attitudinal transformations as products of socialization, as expressed both through comments and the use of digital heuristics of social acceptability.
Although this database is comprehensive and contains more than 10 years of comments, in some respects, the results remain partial. When it pertains to incidental political conversations, many more subreddits designed to discuss and share things far from politics would need to be studied to understand their scope and potential impact on attitude homogeneity. Similarly, this article focuses on the homogeneity of different communities with respect to policy discussions, but more research is needed on the inferred attitudes and biases of online communities. While our data set contains a lot of information on the act of posting comments and the comments themselves, it contains no information as to who upvotes or downvotes each comment which could improve our analysis.
Footnotes
Data Availability
The data used for this analysis are available publicly through Reddit’s API. This analysis used a data set compiled and readily available online via torrent (see reference list).
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Software Information
The Lexicoder 3.0 application (see reference list) was used in conjunction with R and Python for data processing and content analysis. Due to the size of the data set and limited computational means, the analysis was done in increments and as such the syntax is fragmented in multiple documents containing multiple thousand lines each. The custom code may be obtained from the authors but contains few comments and explanations. The statistical analysis of these results was done in Stata.
