Abstract
In recent years, Reddit has attracted the interest of many researchers due to its popularity all over the world. In this article, we aim at providing a contribution to the knowledge of this social network by investigating three of its aspects, interesting from the scientific viewpoint, and, at the same time, by analysing a large number of applications. In particular, we first propose a definition and an analysis of several stereotypes of both subreddits and authors. This analysis is coupled with the definition of three possible orthogonal taxonomies that help us to classify stereotypes in an appropriate way. Then, we investigate the possible existence of author assortativity in this social medium; specifically, we focus on co-posters, that is, authors who submitted posts on the same subreddit.
Keywords
1. Introduction
Reddit 1 is a heterogeneous crowd–sourced news aggregator and online social platform, originally self-declared as ‘the front page of Internet’. It was founded in 2005 and, in few years, has become an ecosystem of 430M+ average monthly active users. 2 At the time of writing, it ranks 19th and 5th in the Alexa’s top 500 global and US websites, respectively. 3 Reddit is built on the concept of subreddit, which is an interest-based community where users can post and comment contents. A subreddit is identified by a name and is referred to using the /r/ prefix within Reddit such as /r/science and /r/cats. Currently, there are more than 1.9M subreddits. 4 They are mainly topical, although more general cases exist.
In Reddit, users can submit contents in the form of texts, images and links to external resources. Submitted contents (also simply called posts) can be read by other users and discussed via comments. Users can subscribe to multiple subreddits in order to receive the latest posted contents on their front pages. An important feature of Reddit is voting, which represents the mechanism affecting the visibility and the ranking of both posts and comments. In fact, users are allowed to upvote or downvote posts of other users, so that each submission has a score. This is a metric based on the difference between the number of upvotes and the number of downvotes, and it significantly affects the order through which posts and comments are shown to users. However, the exact numbers of upvotes and downvotes are not shown publicly.
Due to the great expansion of Reddit in the latest years, many researchers all over the world have been attracted by this social platform. An overview of the studies on Reddit can be found in Medvedev et al. [1], whereas an interesting longitudinal analysis on the evolution of this social medium is presented in Singer et al. [2]. Authors have analysed and are continuously analysing, many aspects of Reddit, ranging from community structures and interactions [3–5] to user behaviour [6,7], from the analysis of the structure and content of subreddits, posts and comments [8] to the analysis of the structural properties of Reddit when it is seen as a social network [5]. Other specific topics, such as text classification [9], user migration [10], political and ideological aspects [11], have been also studied.
In this article, we aim at providing a contribution in the knowledge of Reddit by investigating subreddit and author stereotypes and by evaluating author assortativity in this social platform.
The term ‘stereotype’ comes from the combination of two Greek words, namely, ‘stereos’ (i.e. solid) and ‘typos’ (i.e. impression). It is adopted to indicate a popular belief about specific groups of individuals. This term first appeared in the press at the end of the 18th century. Later, it was introduced into modern psychology at the beginning of the 20th century by Lippmann [12]. The tendency to classify people into groups and to associate each group with a ‘general idea’, a ‘label’ (and, ultimately, a stereotype) is intrinsic to the human mind. As a result, many (both positive and negative) stereotypes have been defined in the history of humanity, in the most disparate areas. Think, for instance, of the stereotypes coined in sport, art, literature, and so on. With the capillary spread of the Web, the practice of coining and using stereotypes has extended from real life to Cyberspace [13,14]. As the Web became increasingly interactive, with the transition to the Web 2.0 and, above all, with the appearance of social networks, the adoption of stereotypes in the Cyberspace becomes more and more evident [15–20]. For example, in Facebook, one can encounter stereotypes like ‘Lime-Lighters’, ‘Emo’s’, ‘Philosophy Majors’, ‘Hopeless Romantics’, ‘Ghosts’, ‘Stalkers’, ‘Addicts’, and so forth [21]. Similarly, Instagram also presents a wide range of stereotypes [22]. We argue that stereotypes do not necessarily have a negative meaning, as it often happens in real life. On the contrary, they can be extremely useful in everyday communications and interactions in social networks. In this article, we want to go one step further; in fact, we claim that it is possible to define ‘scientific’ stereotypes that could be used in scientific applications. We also believe that Reddit fits well for our goal and that, in this context, besides defining stereotypes for the authors of Reddit, it is possible to also introduce stereotypes for subreddits.
The concept of ‘assortativity’ or ‘assortative mixing’ in a social network was introduced in a famous paper of Newman [23]. It is strictly related to the concept of homophily [24] and indicates a network node’s predilection to relate to other nodes that are somewhat similar. Several possible similarities could be considered in assortativity, but the most investigated one is node degree. Newman focused on degree assortativity and defined a network as assortative if its nodes having many connections tend to be connected to other nodes with many connections. He showed that social networks are often assortatively mixed, whereas technological and biological networks tend to be disassortative. After Newman, other authors investigated assortativity in several social networks such as Facebook [25], Twitter [26], Cyworld, Orkut and MySpace [27]. They found that (a) Cyworld is disassortative with respect to friendship and very assortative with respect to the ‘testimonial’ relationship; (b) Orkut is assortative with respect to friendship; (c) MySpace is neutral with respect to the same relationship; (d) Twitter is strongly assortative with respect to shared interests of users; (e) Facebook is assortative with respect to the tendency of a bridge (i.e. a user joining more social networks) to communicate with other bridges. In this article, we extend the assortativity analysis to Reddit, which was only marginally considered in the past studies about this topic. We first consider degree assortativity because it is the most studied one in the past. Then, we also analyse eigenvector assortativity. We show that Reddit is assortative with respect to both these centralities, which confirms that also this social platform follows the hypotheses of Newman concerning the existence of assortative mixing in social networks.
The significance and value of this article concern both the theoretical and the application viewpoints. From the theoretical point of view, this is the first article that studies the concept of stereotype in Reddit; actually, approaches for the characterization and identification of specific traits of users have been independently presented in different scientific works: users showing multi-community engagement [3], anti-social behaviours [4], community opposers [28], ‘answer-persons’ [6] and ‘explorers’ [29] are some examples. It is also the first article that proposes a study on the concept of assortativity in Reddit. In fact, this concept had been investigated for a wide variety of social platforms in the past, such as Facebook [25], Twitter [26], Cyworld, MySpace and Orkut [27], but no author had been involved in analysing it in Reddit. Instead, as far as the application point of view is concerned, we highlight that the knowledge patterns on stereotypes and author assortativity extracted in this article can be employed in a large variety of contexts. Just to cite a few of them, we mention (a) the definition of some guidelines to follow in order to make a subreddit successful; (b) the definition and realisation of different categories of recommender systems for Reddit; (c) the definition of an algorithm that finds subreddits to merge or, at least, to integrate; (d) the detection of possible targets for an advertising campaign; (e) the definition and implementation of different categories of recommender systems; and (f) the definition of an algorithm that builds blacklists of users based on author stereotypes.
The outline of this article is as follows. In Section 2, we describe related literature. In Section 3, we present an overview of our investigation activity and describe the data set adopted in our experiments. In Section 4, we present several preliminary analyses concerning posts, comments and users in Reddit. In Section 5, we illustrate the activities performed to detect subreddit stereotypes and to determine their features. In Section 6, we describe the same tasks but performed to detect author stereotypes. In Section 7, we analyse author assortativity in Reddit. In Section 8, we present a discussion of obtained results. In Section 9, we describe some possible applications of the knowledge we extracted in the previous sections. Finally, in Section 10, we draw our conclusions and have a look at future developments concerning our research.
2. Related work
The study of social networks has rapidly become a core research field, thanks to its interdisciplinary aspects [30–35]. Indeed, many researchers of different disciplines, such as computer scientists, sociologists and anthropologists, exhibited a huge interest in social network analysis [36–38]. In this context, Reddit is an invaluable source of information, insights and research possibilities. Indeed, it is a prosperous environment, where users share contents and interact with each other. The heterogeneous nature of Reddit, together with the openness and the richness of its data, encouraged scientific community to explore the twists and turns of this platform.
The swift increase in scientific literature related to Reddit has produced a discrete number of papers with several goals and methodologies. In Medvedev et al. [1], the authors present an overall survey on Reddit, which illustrates several studies on this social network, spanning in time from 2005 to 2018. An interesting longitudinal analysis on the evolution of Reddit is presented in Singer et al. [2].
As pointed out in the Introduction, one of the main theoretical contributions of this article is the study of the concept of author stereotype in Reddit and the definition and characterization of several stereotypes of interest. As a matter of fact, in past literature, approaches for the characterization and identification of specific traits of users have been presented in different papers. Some of the considered traits are users presenting multi-community engagement [3], anti-social behaviours [4], community opposers [28], ‘answer-persons’ [6] and ‘explorers’ [29]. The main contribution of our work with respect to these proposals is a systematic study of several traits of users, which are summarised in a wide spectrum of stereotypes and in a suitable classification of them.
In more detail, the ‘multi-community interaction’ trait is studied in Tan and Lee [3], where the authors analyse the evolution of communities in which users post in their Reddit ‘life’. They find out that, actually, Reddit users continually post in new communities; in fact, those who leave a community are intended to do so from the very early beginning of their history. Social and anti-social behaviours are analysed in Datta and Adar [4], where the authors apply a definition that extends Brunton’s construct of spam in order to separate norm-compliant behaviours from norm-violating ones. This approach also investigates inter-community conflicts by associating social and anti-social homes to users. Conflicts between users are also studied in Kumar et al. [28], but from a different point of view. Here, the authors analyse inter-community interactions across 36,000 communities and focus on cases where users of one community, driven by a negative sentiment, submit comments in another community. They highlight how such conflicts actually emerge from a very small number of communities and discuss on strategies for predicting conflicts and mitigating their negative impacts. The presence of users showing the trait of ‘answer-person’ in Reddit is explored in Buntain and Golbeck [6], where the authors define an automated method based on user interactions for identifying this role, yet avoiding expensive content analysis. Finally, in Hessel et al. [29], the authors present a study regarding highly related communities; in this analysis, they define the characteristics of explorers and non-explorers by adopting a specific taxonomy.
The studies and approaches outlined above have been developed considering several communities and subreddits. In Kou et al. [7], a specific subreddit about online User Experience (/r/userexperience) is studied. Here, members socialise and learn together. The authors of this study identify five distinct social roles, namely, the ‘knowledge broker’ (i.e. a member that introduces knowledge to the community by sharing links), the ‘translator’ (i.e. a member that offers her academic knowledge into the community), the ‘conversation facilitator’, the ‘experienced practitioner’ and the ‘learner’. Even if the contribution of Kou et al. [7] is particularly interesting because it considers several facets of users’ characterization (and, for this feature, it is similar to our work) these classes are specific and valid for the analysed community only. On the contrary, author stereotypes introduced in our approach cover a wide range of possible facets of users’ behaviour, with no limitation on the kind and amount of subreddits the users interact with.
As a final remark about stereotyping in the literature, it is worth observing that our proposal introduces both author and subreddit stereotypes. To the best of our knowledge, the definition of subreddit stereotypes received no attention in the literature and, consequently, it represents a step forward in the research on Reddit.
As far as this last aspect is concerned, we pointed out in the Introduction that one of the main potential applications of subreddit stereotyping is the definition of guidelines in order to make a subreddit successful. With respect to this topic, some papers studied how to predict the success of a subreddit or, more generally, of a community from different perspectives. In particular, Cunha et al. [39] investigate the success and group dynamics of online communities, focusing on Reddit ones. In detail, they identify four success measures desirable for most communities, spanning from the growth of the numbers of members to the volume of activities within the community and capturing different kinds of success. They also investigate the prediction of the final success of a new community. Furthermore, Weninger [40] presents a broad exploration of posts, with a particular interest to comments. Here, they aim at fulfilling three different tasks. The first is analysing a comment thread by looking at its topical structure and evolution; the second consists of exploiting comment threads to enhance web search; and the third aims at distilling useful features to predict the final score of a comment. Finally, in Shen and Carolyn [8], the authors investigate both the behavioural context of user posting and the polarisation of user responses.
The main difference between the above-mentioned approaches and the stereotyping activity proposed in this article is that the former observes communities evolution and, possibly, predicts their success, whereas the latter could be used to provide guidelines for promoting specific actions to obtain the desired success. From a data analytical point of view, the former focuses on descriptive and predictive analytics, whereas the latter also performs diagnostic and prescriptive one.
As pointed out in the Introduction, another contribution of this article is the study of assortativity in Reddit. While this topic has been analysed with reference to other social platforms [25–27], only few works marginally analysed it on Reddit. In particular, in Hamilton et al. [41], the authors focus on studying loyal communities, finding that they tend to be less assortative as long as their interaction level increases. In this case, assortativity is studied on monthly interaction networks, where users are considered connected if they submit a comment in the same comment chain with a gap of at most two comments. The authors also carry out a comparison with a null model and find that the difference between loyal communities and their random counterparts disappears. This result implies that users in loyal communities tend to interact with dissimilar users as a consequence of the community’s activity. Actually, in Hamilton et al. [41], assortativity is used as a tool for characterising loyal communities, studying single chains of comments. On the contrary, we study assortativity from a more general point of view, in order to provide an overall characterization of Reddit users across several subreddits and comments. Furthermore, we study both degree assortativity and eigenvector assortativity.
Another work marginally related to our study on assortativity in Reddit is presented in Fire and Guestrin [5]. Here, the authors discuss the rise of new trends in complex networks by looking at vertices that ‘shine’ (i.e. high-degree vertices), also called network stars. They study the evolution of some complex networks, with Reddit among them. They analyse the temporal dynamics of the networks by looking at how different features, such as density and average clustering coefficient, change over time. Clearly, Fire and Guestrin [5] and our article are quite different. Indeed, differently from what happens in Fire and Guestrin [5], our assortativity definition does not allow the analysis of temporal dynamics, which is the main goal of Fire and Guestrin [5]. On the other side, it helps to characterise the tendency of users to associate with each other.
Other works, marginally related to our proposal, focus on the study of specific aspects of subreddits or user behaviours. For instance, in LaViolette and Hogan [9], the authors use text classification and computational critical discourse analysis to distinguish and interpret ideological differences between subreddits. In Zhang et al. [42], the authors present a study regarding a quantitative, language-based typology of communities’ identity, revealing how several social phenomena manifest across communities. The introduced taxonomy is based on two aspects of community identity, that is, distinctiveness and dynamicity. User migration is studied in Newell et al. [10]. Here, Reddit is examined during a period of community unrest in order to identify the motivations for this kind of behaviour. Political and ideological aspects emerging in Reddit are discussed in the literature [11,43–45]. Finally, in Fiesler et al. [46], the authors present a mixed-method study of 100,000 subreddits and their rules in order to define effective mechanisms for community governance.
3. Overview of our investigation activity
After having defined the motivations and objectives of the analyses described in this article in this section, we present an overview of our investigation activity. We start depicting the overall structure of Reddit in Figure 1. In the left part of this figure, each rounded box represents a subreddit. The central part shows a list of posts in the example subreddit /r/subreddit, where each colour identifies a different type of posts (text, image or link to an external resource). Finally, the right part illustrates the structure of a post, including its title and its comments, which are presented as a tree having the post as root.

A graphical overview of Reddit structure.
The workflow that represents the tasks, which our investigation activity consists of, is shown in Figure 2. Due to layout reasons, in this figure, we put the data set as an input to the descriptive analysis module only. Actually, it is to be considered as an input to all the modules of the workflow. Similarly, descriptive knowledge patterns are also to be considered as an input to the assortativity analysis module. Finally, both descriptive knowledge patterns and stereotype knowledge patterns (analogously to what explicitly shown in Figure 2 for the assortativity knowledge patterns) are to be considered also as outputs of our investigation activity.

The workflow representing the tasks of our investigation.
As shown in Figure 2, the first phase of our investigation consists of a descriptive analysis of all those features of Reddit that can affect the investigation of both stereotypes and assortativity. We start with some preliminary investigations on Reddit data. They focus on three aspects, namely, posts submitted to subreddits, comments under these posts and, finally, users who created a subreddit, posted or commented. The aim of this preliminary descriptive analysis was not to discover new specific knowledge about Reddit. Instead, it allowed us to better understand the data set and to check if some theoretical trends, which should have characterised these aspects on Reddit, were verified on it. Furthermore, the results found, which were partially expected, represented the starting point of the next knowledge detection activities. They were also useful to explain the knowledge patterns extracted.
These knowledge patterns, together with the data set, are given as input to the second module of our workflow, which carries out the extraction and analysis of stereotypes. In order to extract and analyse subreddit stereotypes, we first investigate the lifespan of a subreddit, depicting its typical characteristics. Then, starting from this, we identify several subreddit stereotypes and, finally, we define and apply three orthogonal taxonomies in order to characterise them. After the analysis of subreddit stereotypes, we proceed similarly for author stereotypes. In particular, we extract several author stereotypes and, then, we classify them according to some orthogonal taxonomies that we define for this purpose.
The information returned by this module, especially the one extracted from the analysis of author stereotypes, is given as input to the third module of the workflow, which deals with assortativity analysis. We aimed at performing this analysis for Reddit authors and both degree and eigenvector assortativity to verify if authors who are very active in Reddit tend to form a backbone or not. This module first builds an appropriate social network whose nodes represent the authors and whose arcs denote the co-posting activity. Afterwards, it performs the analysis of the assortativity of Reddit authors against degree and eigenvector centrality. The patterns derived during this phase represent the last type of knowledge pattern returned in output by our approach.
After having described the workflow representing our investigation activity, we now illustrate in more detail the characteristics of our data set. All the data required for our activity was downloaded from the
In order to carry out our experiments, we used a server equipped with 16 Intel Xeon E5520 CPUs and 96 GB of RAM with the Ubuntu 18.04.3 operating system. We adopted Python 3.6 as programming language, its library Pandas to perform extract, transform and load (ETL) operations on data and its library NetworkX to perform operations on networks.
During the ETL phase, we observed that some of the available posts referred to authors that had left Reddit. We decided to remove these posts from our data set. At the end of this last activity, the number of posts at our disposal was 122,568,630.
We computed the number of authors who submitted these posts; it was equal to 12,464,188. Then, we found the number of the subreddits which they referred to; it was equal to 1,356,069.
4. Preliminary investigations on Reddit data
In this section, we describe some preliminary investigations that we performed on Reddit. As pointed out in Section 3, these are not the core of our article, but they confirmed us the suitability of our data set. Furthermore, some knowledge extracted here was extremely useful in the analyses described in the next sections. We group the following analyses in three subsets, which are posts, comments and authors, respectively. We describe each subset in a separate subsection.
4.1. Investigation on posts
We started this investigation by performing the following analyses on posts:
Distribution of subreddits against posts (Figure 3); it follows a power law with
Distribution of authors against posts (Figure 4); it follows a power law with
Distribution of posts against scores (Figure 5); it follows a power law with

Distribution of subreddits against posts (log–log scale).

Distribution of authors against posts (log–log scale).

Distribution of posts against scores (log–log scale).
The maximum number of posts with the same score is 51,721,824. Interestingly, these posts have associated a score equal to 1. Instead, the number of posts with a score equal to 0 or 2 is much smaller. This trend can be explained considering that a post submitted on Reddit starts with a score of 1. As a consequence, when no other author upvotes or downvotes it, the final score of the post is 1.
We also observe that no post has a negative score. This fact is due to Reddit that shows and returns a score equal to 0 for a post whenever the number of downvotes is higher than the number of upvotes, that is, also when the real score of the post is negative. So, posts with a score equal to 0 are to all intents and purposes intended as ‘negative’ posts.
At this point, we also computed
The distribution of authors against negative posts (Figure 6); it follows a power law with
The distribution of authors against positive posts (Figure 7); it follows a power law with

Distribution of authors against negative posts (log–log scale).

Distribution of authors against positive posts (log–log scale).
As for these two distributions, we found that the number of positive posts is about 16 times the number of negative ones.
4.2. Investigation on comments
As for this investigation, we computed
The distribution of subreddits against comments (Figure 8); it follows a power law with
The distribution of the average number of comments against the scores of the posts, they refer to Figure 9. Interestingly, in this case, we have a roughly Gaussian distribution, whose mean is at a score near to 50,000. The distribution presents several outliers. For instance, for a score equal to 79,470, we have a post with a number of comments equal to 71,225.
The distribution of posts against comments (Figure 10); it follows a power law with

Distribution of subreddits against comments (log–log scale).

Distribution of the average number of comments against the scores of the posts they refer to.

Distribution of posts against comments (log–log scale).
Finally, we considered the 150 posts with the highest number of comments and the subreddits they were submitted to. We obtained only 31 subreddits. Then, we computed the average number of comments for all the posts submitted in each of these subreddits. The results obtained are reported in Figure 11. From the analysis of this figure, we can observe that the distribution is very irregular. It decreases quickly for the first three subreddits, very slowly for the next 13 subreddits, quickly for the next 9 subreddits and, finally, it suddenly drops and becomes almost zero.

Distribution of the average number of comments submitted to the subreddits receiving the 150 most commented posts.
4.3. Investigation on authors
First, we determined the distribution of authors against subreddits (Figure 12). It follows a power law with

Distribution of authors against subreddits (log–log scale).
Afterwards, we selected the 150 posts with the highest number of comments and the corresponding authors. Interestingly, we had only 26 authors for all the 150 posts. These can be considered as the most commented authors in Reddit and, maybe, they are influencers. Then, we computed the average number of comments for all the posts each author submitted. The results obtained are reported in Figure 13. From the analysis of this figure, we can observe that the decrease in the distribution is roughly stepwise.

Distribution of the average number of comments received against the authors submitting the 150 most commented posts.
5. Stereotyping subreddits
In order to determine some possible stereotypes of subreddits, we start investigating the subreddit lifespan. As a first step, we considered the subreddits created in January 2019 and then verified the month when they performed their last activity (and, therefore, presumably died). The results obtained are reported in Figure 14. Here, an activity level of 1 implies that the subreddit died in the same month it was born, an activity level of 2 suggests that it died 1 month after it was born, and so on. An activity level of eight indicates that it is still alive (we recall that our data set comprises data from 1 January 2019 to 1 September 2019). We proceeded in the same way for the subreddits created in February, March, and so forth. For instance, in Figure 15, we report the trends of the subreddits created in February 2019 and in March 2019.

Lifespan of the subreddits created in January 2019.

Lifespan of the subreddits created in February 2019 (at left) and March 2019 (at right).
After this, we focused on those subreddits died in the same month they were born. We analysed their corresponding lifespan and we observed that almost all of them died in the same day they were born. For instance, in Figure 16, we report the trends of the subreddits born and died in February 2019 and in March 2019.

Lifespan of the subreddits born and died in February 2019 (at left) and March 2019 (at right).
Then, we decided to deeply investigate those subreddits died in the same day they were born. We computed their distribution against the number of their posts. Figure 17 shows what happens for January 2019; the same trend can be observed for the other months of this year. Clearly, this distribution follows a power law, a trend that can be observed also for similar subreddits born in the other months. From its analysis, we observe that most of the subreddits, which died in the same day they were born, have only one post. At this point, we computed the distribution of these subreddits against the number of comments. In Figure 18, we show the subreddits of January 2019, even if the same trend can be observed for the other months of this year. From the analysis of this figure, we can note that this distribution follows a power law. Furthermore, most of these subreddits have no comments.

Distribution of the subreddits of January 2019 died in the same day they were born against the number of their posts.

Distribution of the subreddits of January 2019 died in the same day they were born against the number of their comments.
Next, we examined a second class of subreddits, similar to the previous one. In fact, we selected all those subreddits that died 1 day after they were born. Again, we first computed their distribution against the number of posts. In Figure 19, we show what happens for the subreddits of January 2019; again, the same trend was found for all the other months. This distribution follows a power law, which was expected. The unexpected thing was that the minimum number of posts was 2 and not 1. Even more unexpectedly, this trend is also confirmed for the subreddits with the same features born in the other months. After that, we computed the distribution of these subreddits against the number of comments. In Figure 20, we show it for the subreddits of January 2019; the same trend can be observed for all the other months. From the analysis of this figure, we note that this distribution follows a power law. Furthermore, most of these subreddits have no comments.

Distribution of the subreddits of January 2019 died 1 day after they were born against the number of their posts.

Distribution of the subreddits of January 2019 died 1 day after they were born against the number of their comments.
Note that the two classes of subreddits above have a proper characterization that differentiates them from all the other classes of subreddits (for instance, the ones that survived for some months). They also have few features distinguishing them from each other. However, the number of their similarities is much higher than the number of their differences. As a consequence, both these two classes can be considered as a ‘macro-category’ of stereotypes that we call ‘dead in crib’. At this point, by deepening what we have found previously, we have determined the following stereotypes characterising the subreddits ‘dead in crib’ (i.e. those subreddits who died at most 1 day after they were born)
User profile: it is associated with a user profile.
Unsuccessful subreddit: it initially stimulated several interactions. However, after few hours, these interactions finished and it quickly died.
Comment grabber: it had at least one post capable of stimulating a debate, even if minimal.
Private community: it requires an invitation to be accessed. It is often associated with a specific event of interest for a specific community.
Banned subreddit: it was banned probably because it was associated with a spammer.
Bot: it can be recognised because its posts are always similar and consist of links and comments with links.
In order to characterise these stereotypes, and all the others that we will consider in the following, we have defined three possible orthogonal taxonomies. These are based on
The number of posts; we considered two possible classes, that is, few posts and many posts;
The number of comments; we considered two possible classes, that is, few comments and many comments;
The number of authors; we considered two possible classes, that is, few authors and many authors.
Taking these three taxonomies into consideration, the previous stereotypes can be classified, as shown in Tables 1 and 2.
Classification of stereotypes concerning the subreddits ‘dead in crib’– Few posts case.
Classification of stereotypes concerning the subreddits ‘dead in crib’– Many posts case.
Observe that a stereotype can often belong to both the classes of a taxonomy. This implies that it cannot be ‘categorized’ based on that taxonomy. For instance, comment grabber, in the presence of many comments and many authors, can be found with both few posts and many posts. This implies that this stereotype can be characterised only by the number of comments and the number of authors, but not by the number of posts. Analogously, in the presence of many posts, banned subreddit cannot be characterised by the number of comments or the number of authors. By contrast, in the presence of few posts, banned subreddits is characterised by few comments and few authors.
After having investigated the stereotypes of the subreddits ‘dead in crib’, we focused on the opposite category of subreddits, that is, those survived for all the months of reference for our data set. We collectively call them ‘survivors’ in the following. We applied the same reasoning and tasks that we have made for the subreddits ‘dead in crib’ and we obtained the following stereotypes
User profile and Bot: these are the same ones we have seen for the subreddits ‘dead in crib’.
Cringe/NSFW subreddit: it contains strange or strong-content posts, submitted by only one user, or, alternatively, it is an NSFW subreddit.
Niche subreddit: its topics are niche ones, and it draws the attention of users interested in them.
Successful subreddit.
Big comment grabber: almost all the posts submitted in it stimulate a debate.
Utility subreddit: it is conceived to support a specific activity (think, for instance, of a subreddit where users ask for a translation).
Based on the three taxonomies defined above, the previous stereotypes can be classified, as shown in Tables 3 and 4.
Classification of stereotypes concerning the subreddits ‘survivors’– Few posts case.
NSFW: not safe for work.
Classification of stereotypes concerning the subreddits ‘survivors’– Many posts case.
NSFW: not safe for work.
After these analyses on the stereotypes belonging to the two extreme categories ‘dead in crib’ and ‘survivors’, we decided to apply the same reasonings and tasks to investigate a third category of stereotypes, intermediate between the two previous ones. Specifically, we focused on those subreddits that lived 5 months after their creation and, then, died. We call this category ‘undelivered promises’ and we obtained the following stereotypes for it
User profile, niche subreddit, Bot, cringe/NSFW subreddit, private community and banned subreddit: these are the same ones we have seen for the previous categories.
Unsuccessful boomer: it was successful for a while, but died after a period of decline.
Unsuccessful zombie: it was born without praise or blame, managed to survive for a while in a grey way and, finally, died.
Based on the three taxonomies that we defined above, the previous stereotypes can be classified, as shown in Tables 5 and 6.
Classification of stereotypes concerning the subreddits ‘undelivered promises’– Few posts case.
NSFW: not safe for work.
Classification of stereotypes concerning the subreddits ‘undelivered promises’– Many posts case.
NSFW: not safe for work.
6. Stereotyping authors
In order to determine the possible author stereotypes, we proceeded in a way analogous to what we have done to define subreddit stereotypes. In fact, also for authors, we found three macro-categories of stereotypes, namely, ‘very positive’, ‘neutral’ and ‘very negative’ authors. To better understand the reasoning underlying these categories, we recall that, in Section 4.1, we have found that the number of positive posts is about 16 times the number of negative ones in Reddit. As a consequence, it is possible to use this result as a baseline for a preliminary author classification. Specifically, we considered an author as ‘very positive’ if the number of positive posts submitted by her is at least
Analogously to what we have done for subreddit stereotypes, we have defined two possible orthogonal taxonomies, namely,
The number of posts: the possible classes are few posts and many posts.
The number of comments: the possible classes are few comments and many comments.
Afterwards, we determined the following stereotypes characterising the ‘very positive’ authors, proceeding in a way analogous to the one we adopted for subreddit stereotypes:
Unsuccessful author: she submits posts but she is never capable of stimulating interactions with other authors.
Fame seeker: she submits (and/or she is still submitting) an impressive amount of posts in order to reach fame in Reddit.
Cringe/NSFW author: she often submits cringe/NSFW posts.
FBG publisher (few but good publisher): she does not publish a very high number of posts; however, her posts are generally appreciated by other users.
Content creator: she creates and submits contents for people.
Successful author: she submits many posts that receive many positive comments and are appreciated by other users.
Reposter: she simply re-submits posts of other authors.
Based on the two taxonomies that we defined above, the previous stereotypes can be classified, as shown in Table 7.
Classification of the stereotypes concerning ‘very positive’ authors.
NSFW: not safe for work; FBG: few but good.
After the ‘very positive’ authors, we focused on the opposite macro-category of author stereotypes, that is, the ‘very negative’ ones. We obtained the following stereotypes, applying the same reasoning and performing the same tasks that we made for ‘very positive’ authors:
Unsuccessful author: this stereotype is the same as we have seen for ‘very positive’ authors.
Spammer: she is an author submitting a lot of spam posts evaluated negatively by other users.
Hatred sower: she is a user whose goal is attacking minority groups with hate posts or comments.
Instigator: she is an author using every opportunity to make herself known. For her, it is not important how she is judged, but the fact that one speaks of her.
Based on the two taxonomies defined above, the previous stereotypes can be classified, as shown in Table 8.
Classification of the stereotypes concerning ‘very negative’ authors.
After having analysed the stereotypes belonging to the two extreme categories, that is, ‘very positive’ and ‘very negative’ authors, we decided to investigate ‘neutral’ authors as representative of a third macro-category, intermediate between the two previous ones. We obtained the following stereotypes, applying the same reasoning and tasks that we made for the other two macro-categories:
Unsuccessful author and fame seeker: these stereotypes are the same ones we have seen for the previous macro-categories.
PP author (private purpose author): she often creates subreddits for private purposes, for instance to talk about specific topics of interest for a particular community. Often, her subreddits require an invitation for being accessed.
Bot: it is a bot; it can be recognised because it always submits similar posts consisting of links and comments with links.
Moody author: she creates subreddits and submits posts whose topics, expressed positions and evaluations apparently swing without a logic.
Comment grabber: she occasionally submits posts capable of stimulating a debate, even if minimal.
Big comment grabber: almost all the posts submitted by her stimulate a debate.
Based on the two taxonomies defined above for authors, the previous stereotypes can be classified, as shown in Table 9.
Classification of the stereotypes concerning ‘neutral’ authors.
7. Analysing author assortativity
In the past, assortativity has been largely analysed in several social media [25]. In this section, we aim at checking if a form of assortativity exists in Reddit; in particular, we focus on co-posters, that is, authors submitting posts on the same subreddit.
In order to perform our analyses, we define a support network
Here,
The number of nodes of
First of all, we computed the degree centrality of the nodes of

Distribution of degree centrality for the nodes of
We sorted the corresponding authors in a descending order, based on their degree centrality, to verify the possible presence of a degree assortativity in Reddit. Then, we divided the sorted list into intervals of authors. In particular, we considered equi-width intervals
First of all, we considered the first interval

(a) Number of authors of
In order to prove the statistical significance of our results, we generated a null model to compare our findings with the ones obtained in an unbiasedly random scenario. Specifically, we built our null model shuffling the arcs of

(a) Number of authors of
However, this is not sufficient to conclude that there is a degree assortativity for authors in Reddit. In fact, we must check if this trend is also confirmed for the authors with an intermediate degree centrality and for those with a low degree centrality.
Clearly, for an exhaustive analysis, we should repeat the tasks we have previously done for
Figure 24(a) reports the number of authors of

(a) Number of authors of
Also, in this case, we compared these findings with the ones obtained in the null model. These last ones are reported in Figure 25. Looking at these results and the ones represented in Figure 24, we can conclude that, again, the behaviour observed in these last figures is not random but it is a property of Reddit.

(a) Number of authors of
Finally, Figure 26(a) reports the number of authors of

(a) Number of authors of

(a) Number of authors of
Having verified that there exists a sort of backbone among the authors with a high (intermediate and low, respectively) degree centrality, we can conclude that actually Reddit is assortative with respect to degree centrality, as far as the co-posting relationship is concerned.
This important result can be explained considering the concept of karma and the posting rules in Reddit. Indeed, in this platform, each user has associated a karma, which is a score taking her past ‘reputation’ into account. In general, users with high karma are very active and, often, submit a lot of appreciated posts. As a consequence, it is presumable that they have a high-degree centrality. In other words, a direct correlation between karma and degree centrality can be recognised for authors. Now, the posting rules of Reddit state that each subreddit has associated a minimum threshold of karma [48–50] so that only the authors with a karma higher than this threshold can submit a post on it. This threshold is dynamic and changes over time. Clearly, when it is low, all the authors can submit their posts on the subreddit. When it grows, the authors with a low karma (and, presumably, with a low degree centrality) cannot submit posts on it. Finally, when it becomes high, only the authors with a high karma (and, presumably, a high-degree centrality) can submit posts on it. This way of proceeding tends to segment users into groups having homogeneous degree centralities.
Having verified the assortativity of Reddit with respect to degree centrality, it is natural to wonder whether this property depends on the type of centrality or is intrinsic in this social platform. As a premise to this investigation, it is worth underlying that each form of assortativity is a unique history per se. Therefore, it is impossible to define a general rule. Nevertheless, it is possible to verify if a trend exists, and we have operated in this direction.
To this end, we have chosen a second form of centrality (i.e. the eigenvector centrality) and we have repeated for it all the steps previously seen for degree centrality. The results obtained are shown in Figures 28–30

(a) Number of authors of

(a) Number of authors of

(a) Number of authors of
They confirm that there is an assortativity among the authors of Reddit also with respect to the eigenvector centrality. As a consequence, we can conclude that the assortativity of Reddit authors is not limited to degree centrality but represents a trend characterising this social platform beyond the form of centrality taken into consideration.
8. Discussion
In this section, we examine the results on subreddit stereotypes in order to identify their correlations and build an overview of the knowledge on Reddit extracted in this article.
First of all, we observe that, although in principle subreddit stereotypes and author stereotypes are two orthogonal concepts, in practice, there are strong correlations between them. In fact, certain subreddit stereotypes are the ideal and perfectly tailored places for certain user stereotypes, and vice versa.
Let us now examine these correlations more closely. In the following of this section, for more clarity and to avoid heavy speech, we use the
Clearly, there are very strong and direct correlations between
There is at least a partial relationship between
A less obvious, but extremely interesting correlation exists between
Again,
Finally, there is a quite evident correlation between
After having examined the correlation between subreddit stereotypes and author stereotypes, we continue our discussion by examining the correlations between the results obtained for author stereotypes and those concerning assortativity. In Section 7, we found that there is a degree (eigenvector, respectively) assortativity between Reddit authors. This implies that authors with similar degree (eigenvector, respectively) centrality tend to form a backbone. Keeping in mind the definition and properties of these two forms of centrality, it is possible to make some interesting deductions.
The first one is that fame seekers, who generally have a high-degree centrality, tend to form a backbone and, therefore, to support each other. An analogous reasoning can be imagined for successful authors and reporters, who are also characterised by a very-high-degree centrality. Continuing in this direction, even many authors characterised by negative stereotypes tend to support each other; in particular, this happens for spammers, hatred sowers and investigators. In these cases, a post published by one of them tends to provoke the reaction of the others, giving rise to very long discussions that often involve a huge number of people. A similar situation, even if with a neutral and not negative connotation, can concern the big comment grabbers. Even these authors tend to form communities in which large discussions take place; however, unlike the previous cases, these discussions are not necessarily harmful.
As far as eigenvector centrality is concerned, in addition to all the communities mentioned above, the presence of backbones between FBG publishers or content creators appears possible. In fact, these authors, who tend to use Reddit as a utility tool, may be strongly attracted by subreddits created by authors with the same intentions and, therefore, may tend to form communities. It is interesting to highlight that these types of figures (a sort of ‘grey cardinals’) are the classical ones having a high eigenvector centrality and, as far as we are concerned, a high eigenvector assortativity.
A final discussion concerns the results on assortativity described in this article and the ones on assortativity in social networks described in the past literature. As previously pointed out, Newman’s seminal work showed that social networks are generally assortative, unlike other types of networks, such as technological and biological ones, which are disassortative [23].
Next, Ahn et al. [27] demonstrated that (a) Cyworld is slightly disassortative with respect to degree centrality on a network built taking users and their friendships into account, while it is strongly assortative with respect to degree centrality on a network built considering users and the ‘testimonial’ relationships (a kind of relationship specific of this social network) existing between them; (b) Orkut is assortative with respect to degree centrality on a network built starting from users and their friendships; and (c) MySpace is neutral (that is neither assortative nor disassortative) with respect to degree centrality on a network that takes users and their friendships into account.
Buccafurri et al. [26] showed that Twitter is strongly assortative with respect to degree centrality on a network that takes the sharing of interest among users into account. Furthermore, Buccafurri et al. [25] studied assortativity in Facebook and showed that such a social network is assortative with respect to the tendency of a bridge (i.e. a user joining more social networks) to communicate with other bridges.
Finally, in Hamilton et al. [41], the authors considered Reddit and investigated the concept of assortativity but for a very particular aspect, that is, loyal communities. In particular, they showed that loyal communities are not assortative with respect to the activity level of the users belonging to them, while assortativity exists in the case of unloyal communities. The lack of assortativity in loyal communities implies that users belonging to them are willing to communicate with all the other users of the same community, regardless the corresponding activity level. By contrast, the presence of assortativity in unloyal communities implies that the corresponding users tend to partition themselves into subgroups based on their activity level. Indeed, a user with a certain activity level tend to communicate only with users having similar activity levels.
As said before, our article wants to provide a contribution in the study of assortativity in social networks. First, besides degree centrality, it also considers eigenvector centrality. Furthermore, it focuses on the study of assortativity in Reddit, a social platform that was not analysed in the past as far as this feature is concerned, except for the investigations described in Hamilton et al. [41]. However, in this last article, the main topic of the author investigation was not assortativity but loyalty, while assortativity simply served as a feature to assess whether loyal and unloyal communities could be partitioned into smaller groups. Therefore, compared with the general studies on assortativity presented in literature [25–27], the analysis of Hamilton et al. [41] can be considered of niche. As a proof of this, we can observe that, contrary to all studies on assortativity proposed in the past, in Hamilton et al. [41], the presence of assortativity among the nodes of a network is seen as a negative factor (leading highly active users to disregard little active and new ones), rather than a positive feature.
Compared with Hamilton et al. [41], our article aims at bringing the study of assortativity into Reddit in the general mainstream of the study of assortativity in social networks, analysing this feature by itself, independently from other features, such as loyalty. As a matter of fact, the results we found are in line, and even strengthen the trends on assortativity in social networks hypothesised by Newman and next found by most of the other authors.
9. Possible applications of stereotypes
This section presents some possible applications of the stereotypes previously investigated. It consists of two subsections. The first explains how subreddit stereotypes could be employed to make a subreddit successful. The second highlights how particular types of author stereotypes could be used to improve the content quality of subreddits.
9.1. Subreddit stereotypes
In Section 5, we defined several subreddit stereotypes belonging to three macro-categories, namely, ‘dead in crib’, ‘survivors’ and ‘undelivered promises’. A first application of this research can be the definition of some guidelines to follow in order to make a subreddit successful. Indeed, knowing how a subreddit became successful (unsuccessful, respectively) can lead to the characterization of ‘positive’ (‘negative’, respectively) actions that can influence the ‘lifespan’ of a new subreddit. For instance, consider the subreddit /r/meme. It started during 2008 and, at the time of writing, it has about 806,000 users. Certainly, it represents an example of a successful subreddit. Here, the authors post high quality and engaging contents. This kind of behaviour could be registered as a ‘best practice’ in the guidelines. However, a subreddit containing only few contents from few authors is an example of an unsuccessful subreddit. This failure could be caused by a lack of engaging contents posted in it. Clearly, what said above provides just an idea of what these guidelines could contain.
Another possible application of subreddit stereotypes could regard the definition and realisation of recommender systems for Reddit. These systems would aim at recommending to a user subreddits with the same stereotype (or the same content) as the ones characterising the subreddits accessed by her in the past. In any case, the recommender system should avoid ‘dead in crib’ subreddits or, more generally, unsuccessful ones. However, the same system should suggest to a user successful subreddits, subreddits currently expanding their community and/or subreddits characterised by contents in line with her profile.
A further example of possible usage of subreddit stereotypes could be the definition of an algorithm that finds subreddits to merge or, at least, to integrate. For instance, consider two zombie subreddits with related topics, where authors are posting contents that were not able to attract other users. These two subreddits are surviving, but their interactions with users are so low that they can actually be considered dead. If they would be merged or integrated into a unique subreddit, they could have more chances of becoming successful. Joining together two, or even more, subreddits having the same (or related) topics/characteristics brings more visibility and more contents to them. These contents would be, otherwise, dispersed in different unsuccessful subreddits. Even if the new integrated subreddit is made up of past zombies, it could become so successful to attract authors and co-posters from other communities.
9.2. Author stereotypes
In Section 6, we defined some possible author stereotypes. Some of them are strictly related to the homonymous or corresponding subreddit stereotypes. Other ones, instead, are intrinsic to human behaviour and, in particular, to the concept of author. For example, consider ‘fame seekers’ and ‘content creators’. These users could represent the target of a proposal of an advertising campaign aiming at promoting them. Take, for instance, a painter or a digital artist, who has been classified as ‘fame seeker’. An advertising company can easily persuade her to give it an engagement to promote her image.
Another possible usage of author stereotypes is the definition and implementation of different categories of recommender systems. A first category could help bootstrapping a subreddit. Consider, for instance, a newborn subreddit where authors post comics strips created by them. Knowing successful authors of comics strips and being able to convince them to become ‘content creators’ in the new subreddit could help this last one to get visibility. Complementary to this case, a second category of recommender systems could be used for talent scouting. In this case, a ‘fame seeker’, who is also a creator of comics strips, could be recommended to successful subreddits if her contents are high-quality ones.
The last application we present in this overview is the definition of an algorithm that builds blacklists of users based on author stereotypes. As an example, we can define a ‘dangerousness level’ of an author for one subreddit, a set of subreddits or all subreddits. For instance, in such a scenario, ‘hatred sowers’ can be automatically banned from subreddits attended by sensitive people. This way of proceeding could certainly maintain the discussion in these subreddits clean, thus avoiding their visitors being harassed by fake news and cyberbullying.
10. Conclusion
In this article, we have presented an investigation on Reddit, whose aim was analysing three aspects of this social platform that are interesting for both the theory and the practice. First, we have examined related literature and we have described the data set used for our investigation. Then, we have illustrated some preliminary analyses that allowed us to gather some (partially expected) information, useful to correctly carry out the following activities and interpret the corresponding results.
The first knowledge detected in our investigation is subreddit stereotypes. We have explained the way of proceeding that we followed to determine them, we have defined three macro-categories and, for each of them, a certain number of stereotypes. Finally, we have proposed three orthogonal taxonomies and we have classified the detected stereotypes according to them. We have proceeded in the same way performing the second main task of our investigation, namely, the definition and classification of author stereotypes. Afterwards, we have focused on a more theoretical issue. In fact, analogously to what has been carried out for other social platforms, we have verified if Reddit is assortative, and in which way. We have found that a degree and an eigenvector assortativity exist in Reddit and that they involve co-posters. Finally, we have presented several applications that could benefit from subreddit and author stereotypes.
In the future, we plan to develop our research on Reddit along several directions. First of all, we would like to carry out a deep investigation on NSFW subreddits. In fact, in spite they are very numerous, few analyses on them have been performed in the past literature. Furthermore, in Section 9.1, we have seen that the merge, or at least the integration, of related subreddits could be extremely beneficial. Therefore, we plan to define an approach that finds possible subreddits to merge or to integrate and, then, suggests the tasks necessary to carry out this activity. Finally, we would like to define an approach to find duplicate accounts, that is, two or more Reddit accounts belonging to the same person. We would like to understand the main motivations leading a user to adopt multiple accounts and verify if she has different behaviours in different accounts.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This work was partially supported by (a) the Italian Ministry for Economic Development (MISE) under the project ‘Smarter Solutions in the Big Data World’, funded within the call ‘HORIZON2020’ PON I&C 2014–2020 (CUP B28I17000250008), and (b) the Department of Information Engineering at the Polytechnic University of Marche under the project ‘A network-based approach to uniformly extract knowledge and support decision making in heterogeneous application contexts’ (RSAB 2018).
