Community recommendation for text post in social media: A case study on Reddit

Abstract

Reddit is a popular social media website where users can submit content such as direct links and text posts into a forum called subreddit. The average number of new subreddits created reaches 500 per day. Because of the vast and growing number of subreddits, users need to discover and familiarize themselves with all existing communities before submission. In this paper, we propose new feature sets for an online community which are text posts ratio, the average length of text in the post and the domain-specific features. The community recommendation framework is designed and experimented based on Reddit dataset. The framework successfully identifies and collects textual communities by finding their representatives using clustering algorithm namely DBSCAN, then a logistic regression algorithm is applied to recommend a list of communities with high content similarity to a given post. Comprehensive experimental evaluations on Reddit dataset reveal that the proposed framework achieves high precision at 90%.

Keywords

Online community recommendation social media Reddit DBSCAN logistic regression

1. Introduction

Reddit is a social website that encourages users to share a variety of contents including text, news article, picture, video clip. Content sharing is organized into subreddits or communities that are freely created by members. Each subreddit is monitored by the moderators or creator that has the right to configure, customize and set up additional rules for the communities. The content of each subreddit can be different regarding both breadth and depth of the topic they interested. Registered users can subscribe to subreddits that align with their interests to join and share their contents. As of 2017, Reddit has 542 million monthly visitors (234 million unique users), ranking as the #5 most visited website in U.S. and #8 in the world [5].

Reddit supports two types of submission, link post, and text post. A link post requires a title and URL that links to an external website in which users are allowed to view the content. The text post contains discussion, comment or question answering that the community shares their ideas responding to the post. Therefore the motivation of these two types of post is different. Users who submit text posts usually expect the feedback that fulfill their needs, this means that the right subreddit should meet their expectations. Unfortunately, there is no automatic tool that can recommend the appropriate subreddits to the user.

Moreover, the number of existing subreddits is continually growing, it is impossible for users to explore and get familiar with those contents. Therefore, we propose a community recommendation framework focusing on text post using data on Reddit as a case study. Our framework recommends a set of ranked communities for users and allow them to choose base on their preferences. By offering options to users, they can review and choose the one they prefer to share from a list of subreddits. The probabilistic classifier, namely logistic regression is employed to obtain a set of probabilities of all textual communities for a given post based on content similarity. Then a selection of subreddits with high probability values is proposed to users as a recommendation list. The remainder of this paper is organized as follows:

•
Textual community identification procedures, including feature extraction, representative finding and collection processes in Section 3.2.
•
Community recommendation framework for text post on Reddit based on content similarity determined by logistic regression in Section 4.
•
Experimental result of domain-specific features associated with Reddit post in term of classification performance is shown in Section 4.

2. Literature review

Reddit has become a growing source of information for social media data mining tasks in recent years. One of the reasons might be the massive, diverse and acceptable quality of content on the website. Weninger et al. [23] did an in-depth study on topical cohesiveness and other aspects of discussion on the website. One of their findings is that the discussed content represents the topical hierarchy, not chaotic. They encouraged that the data can be used for mining-related tasks. Choi et al. [8] also analyzed different trends and characteristics of both content and users in an extensive data set of discussion posts from the website. They found clear evidence of social structure and topic coherence-driven conversation. They also hoped that their work would assist others that wish to better understand discussions on social media websites. These phenomena led to the surge of studies that taking advantage of the massive and diverse information gold mine, e.g., [14, 21, 22, 9].

A few research related to recommendation using Reddit data set has been found in the literature. Nguyen et al. [17] proposed a framework that aims to recommend Reddit posts that fit user’s interest. The framework employed user’s Twitter data to create a user profile based on pre-defined topics. Ensemble classification method based on Naïve Bayes was applied and reached 58% precision. They continued their experiments dealing with user profiles creation using WordNet database in [18]. They applied natural language processing tools and machine learning algorithms to develop a recommendation system for subreddit articles based on interest profiles derived from users’ tweets. They introduced a simple WordNet-based genre classifier based on a similarity measure derived from the WordNet ontology. They found that the WordNet approach has poor precision when the number of tweets is small.

Association rule mining algorithm was applied by [13] to find relations among subreddits. They proposed a web-based application named Recommenddit that allowed users to query subreddits that are associated with a given subreddit. However, their work focused only on large subreddits only.

There were some research works related to community recommendation in other online domains such as IBM Connections dataset. Pal et al. [19] proposed a framework for selecting a set of communities that most likely to provide the best answer for a given question. Their framework measures various similarities between the question (e.g. tf-idf and parts of speech), obtained from users and the existing communities. Then the k-nearest neighbor algorithms were applied to rank the communities. The evaluated performance is based on IBM Connections datasets. The performance evaluated by precision@5 reached 65%.

The collection of tweets related to “presidential campaigns” between Barack Obama and Mitt Romney, obtained from March 1 ${}^{\text{st}}$ , 2012 to May 31 ${}^{\text{st}}$ , 2012 were crawled and investigated by [16, 15]. The proposed a generative topic model captured both types of interests as topics in a parameter universe with a mechanism that identified the association of interests to either a given user or a given community. The model applied the communities derived from the social links of users to avoid the expensive computation of combining the community discovering process with the topic modeling process. The model can recommend topic-related influential users and topic cohesive interactive communities for a given user’s profile. The performance measured in term of precision@5 reached 35%.

Figure 1.

A subreddit page, namely “PoliticalDiscussion” community, displaying a portion of posts within the community.

Figure 2.

A text post in the “PoliticalDiscussion” community. The community name, title, body and voting score are annotated.

3. Dataset

Our proposed framework explores the top 4,271 communities ranked by the number of members as reported by redditlist.com on October 16 ${}^{\text{th}}$ , 2016 [3]. The data collection is done by a Python script that connected via Reddit API [2] using Python Reddit API Wrapper (PRAW) library [1]. For each community (see Fig. 1), we collect 1,000 top voted posts resulting in 4,044,142 posts.

The structure of a post consists of community name, title, body, voting score, author name and time of creation (see Fig. 2). The title may contain post’s tags specified by community’s moderators. The body contains the text of a text post and is blank in case of a link post. The voting score is calculated from the total vote count. A post normally contains hidden fields, including a unique identifier and a type indicator (see Table 1).

Table 1
A sample of the collected posts with some essential fields. The range of score is varied depend on the number of users in the community

Community name	Title	Score	…	Type
FloridaMan	Florida Man arrives by ca…	796	…	Link
Knoxville	Yay! We did it again, Kno…	22	…	Link
CafeRacers	Its finally done!!!!…	110	…	Link
Art	Patrick Stewart and Ian M…	3155	…	Link
evenwithcontext	“From our understanding o…	171	…	Link
nasa	Our flying saucer, a Mars…	88	…	Link
entertainment	HE’S BACK! Jon Stewart Re…	1495	…	Link
truegaming	I feel guilty when playin…	345	…	Text
gamingsuggestions	Should I try Portal?…	25	…	Text
mercedes_benz	Fitting vanity plate?…	76	…	Link

Figure 3.

Textual community identification processes.

3.1 Preprocessing

Given a text post (see Fig. 2), the title and the body are concatenated and converted to the lower cases. The preprocessing includes Markdown tags and URLs removal. Then, a bag of words representation is created for the next processes.

3.2 Textual community identification

Since our research focuses on textual community recommendation, therefore a set of textual community needs to be identified. We propose an approach that can automatically analyze the data set without the needs of domain experts. There are three main steps which are feature extraction, textual representative finding, and textual community collection steps (see Fig. 3).

3.2.1 Community feature extraction

Since Reddit allows users to freely post text or link in the communities, therefore we need to specify the community that mainly concentrates on the text post (see Algorithm 3.2.1).

[h] Community feature extraction algorithm.[1] ExtractCommunityFeaturesDataset $D$ $V\leftarrow$ Set of community feature vectors, initialize with an empty set $\{\}$ $C\leftarrow$ Set of communities in the dataset $D$ community $c_{i}$ in $C$ $Np_{i}\leftarrow$ Number of posts in community $c_{i}$ $\textit{Ntp}_{i}\leftarrow$ Number of text posts in community $c_{i}$ $x_{i}\leftarrow\textit{Ntp}_{i}/Np_{i}$ Ratio of text posts in community $c_{i}$ $P_{i}\leftarrow$ Set of posts in community $c_{i}$ $L_{i}\leftarrow$ Total length of text in community $c_{i}$ , initialize with 0 post $p_{i}$ in $P_{i}$ $l_{i}\leftarrow$ Length of text in post $p_{i}$ $L_{i}\leftarrow L_{i}+l_{i}$ $y_{i}\leftarrow L_{i}/Np_{i}$ Average length of text in community $c_{i}$ $v_{i}\leftarrow(x_{i},y_{i})$ The feature vector of community $c_{i}$ $V\leftarrow V\cup\{v_{i}\}$ $y_{\textit{min}}\leftarrow$ Minimum average length of text $y_{i}$ in $V$ $y_{\textit{max}}\leftarrow$ Maximum average length of text $y_{i}$ in $V$ average length of text $y_{i}$ in $V$ $y_{i}\leftarrow$ $\log_{10}$ $y_{i}$ Normalized average length of text in community $c_{i}$ $y_{i}\leftarrow(y_{i}-y_{\textit{min}})/(y_{\textit{max}}-y_{\textit{min}})$ Scaled average length of text in community $c_{i}$ $V$

We propose two community features which are the ratio of the text posts and the average length of text in the post. The ratio of text posts is calculated from the number of text posts divided by the number of all posts in that subreddit where n is the number of posts in the community (see Eq. (1)).

$\displaystyle\textit{ratio}=\frac{\sum_{i=1}^{n}\begin{array}[]{ll}1&\text{if % post $i$ is text}\\ 0&\text{if post $i$ is link}\\ \end{array}}{\sum_{i=1}^{n}1}$ (1)

Figure 4.

Normalized community features obtained from community feature extraction algorithm.

Figure 5.

Clustered community features separated by colors. The two largest clusters are circled and annotated, the large one with lower values of normalized average length of text and ratio of text posts, and the smaller one with higher of both values.

Figure 6.

Two largest clusters, the black cluster with lower feature values represents non-textual communities and the white cluster represents the textual communities.

The average length of the text is obtained by averaging the number of characters in the title and the body of the post found in that community where $n$ is the number of posts in the community (see Eq. (2)), this value is normalized using log function (see Eq. (3)) and scaled between zero and one (see Eq. (4)). Algorithm 3.2.1 describes the feature extraction process in detail.

$\displaystyle\textit{length}_{\textit{average}}=\frac{\sum_{i=1}^{n}{\text{% length of post $i$}}}{n}$ (2) $\displaystyle\textit{length}_{\textit{normalized}}=\log_{10}{\textit{length}_{% \textit{average}}}$ (3) $\displaystyle\textit{length}_{\textit{scaled}}=\frac{\textit{length}_{\textit{% normalized}}-\min{\textit{length}_{\textit{normalized}}}}{\max{\textit{length}% _{\textit{normalized}}}-\min{\textit{length}_{\textit{normalized}}}}$ (4)

The result obtained from the feature extraction algorithm is visualized as shown in Fig. 4. We found that there are different degrees of post types mixing in the communities. The right-most and the left-most plotting area are the two high-density regions that represent low and high values of text length and text posts proportion.

[b] Textual community finding algorithm.[1] FindTextualRepresentativeCommunity feature vectors $V$ $C\leftarrow$ Set of clusters obtained from DBSCAN $V$ $C^{\prime}\leftarrow$ List of clusters $C$ sorted by the number of instances in descending order. $c_{1}\leftarrow$ The first cluster in $C^{\prime}$ $c_{2}\leftarrow$ The second cluster in $C^{\prime}$ $x_{1}\leftarrow$ The average ratio of text in cluster $c_{1}$ $x_{2}\leftarrow$ The average ratio of text in cluster $c_{2}$ $y_{1}\leftarrow$ The average scaled length of text in cluster $c_{1}$ $y_{2}\leftarrow$ The average scaled length of text in cluster $c_{2}$ $x_{1}>x_{2}$ and $y_{1}>y_{2}$ $c_{\textit{text}}\leftarrow c_{1}$ $x_{2}>x_{1}$ and $y_{2}>y_{1}$ $c_{\textit{text}}\leftarrow c_{2}$ $c_{\textit{text}}$ Textual community representative

3.2.2 Textual representative finding

To obtain the textual community representatives, we apply a clustering algorithm to select the communities based on their feature similarities. Base on the data visualization in Fig. 4, we apply density-based spatial clustering of applications with noise (DBSCAN) [10] which is a method of choice since it uses density as a mechanism to group the instances by excluding outliers. After the clustering process is finished (see Fig. 5), the set of clusters is sorted according to the number of instances in descending order. The first two clusters that have the highest numbers of instances are selected as representatives for both types of extreme community (see Fig. 6). Then the textual cluster (or representative) is determined by its higher average values of both features. The representative contains 280 communities with the ratio of text posts between 0.927 to 1.0 and the average normalized length of text between 0.549 to 0.822 (or 384 to 3,726 characters). Algorithm 6 describes the community clustering and the textual representative finding process in detail.

3.2.3 Textual community collection

To obtain the final collection of textual communities ready for recommendation process, we use the average feature values of the textual representative as a minimum threshold to create the collection. The average length of text is 0.676 (1,263 characters) and the average ratio of text posts is 0.991. We finally get 98 communities (see Table 2) with the data of 92,382 posts for the final dataset. Algorithm 3.2.3 describes textual community collection process.

[h] Textual community collection algorithm.[1] CollectTextualCommunitiesTextual community representative $c_{\textit{text}}$ , community feature vectors $V$ $T\leftarrow$ Set of textual communities, initialize with an empty set $\{\}$ $x_{\textit{text}}\leftarrow$ The average ratio of text of the representative $c_{\textit{text}}$ $y_{\textit{text}}\leftarrow$ The average scaled length of text of the representative $c_{\textit{text}}$ feature vector $(x_{i},y_{i})$ in $V$ $x_{i}\geqslant x_{\textit{text}}$ and $y_{i}\geqslant y_{\textit{text}}$ $c_{i}\leftarrow$ The community of feature vector $(x_{i},y_{i})$ $T\leftarrow T\cup\{c_{i}\}$ $T$

Table 2
The list of communities selected using the average values of both features calculated from the textual representative

Community name	Ratio of text posts	Average length of text
Advice	0.998	1,592
Animesuggest	0.993	1,357
AskDocs	0.999	1,346
BestOfOutrageCulture	0.999	1,481
BitcoinMarkets	0.998	1,388
BooCRedux	1.0	2,192
BreakUps	0.995	2,019
C_S_T	1.0	2,674
CampHalfBloodRP	1.0	2,243
CasualPokemonTrades	0.996	1,537
CharacterRant	1.0	1,741
CompetitiveEDH	0.995	1,505
CruciblePlaybook	1.0	3,605
DMAcademy	0.998	1,318
DankNation	0.994	1,691
DaystromInstitute	0.998	2,532
DebateAnAtheist	0.999	1,499
DebateReligion	0.999	1,572
DnDBehindTheScreen	0.994	5,603
GGdiscussion	0.996	2,041
Geosim	0.992	1,618
GiftofGames	0.999	1,654
Glitch_in_the_Matrix	1.0	2,138
GodhoodWB	1.0	2,388
HFY	0.997	13,386
IAmA	1.0	1,469
IDontWorkHereLady	1.0	1,742
IronThronePowers	0.999	2,631
IronThroneRP	0.998	3,512
LegalAdviceUK	0.992	1,454
LetsNotMeet	0.999	4,999
LoLeventVoDs	1.0	16,220
MilitaryStories	0.999	5,192
OpiatesRecovery	0.994	1,639
PBPNexus	1.0	3,583
PokemonPlaza	0.994	2,571
Pokemongiveaway	0.995	2,282
ProRevenge	0.993	3,726
PurplePillDebate	1.0	2,252
SRSDiscussion	1.0	1,872
SVExchange	0.996	9,242
SteamGameSwap	0.995	1,357
SuggestALaptop	0.992	1,370
TF2WeaponIdeas	1.0	1,420
TMBR	0.994	1,355
TalesFromRetail	0.999	2,205
TalesFromThePizzaGuy	1.0	1,525
TalesFromTheSquadCar	1.0	5,247
Thetruthishere	1.0	3,291
TiADiscussion	1.0	1,487
TrueDoTA2	1.0	1,786
TrueOffMyChest	1.0	1,532
UnresolvedMysteries	1.0	2,759

Table 2, continued
Community name	Ratio of text posts	Average length of text
UnsentLetters	1.0	1,404
WredditCountryClub	0.998	1,964
airsoftmarket	1.0	1,440
asktransgender	1.0	1,455
asktrp	1.0	1,374
askwomenadvice	1.0	1,490
autotldr	1.0	2,650
buildapcforme	0.995	2,170
changemyview	1.0	2,432
confession	1.0	1,925
copypasta	1.0	1,640
creepyencounters	1.0	3,409
cscareerquestions	1.0	1,450
dfsports	0.996	1,376
gainit	0.996	1,521
getdisciplined	0.997	2,271
learnprogramming	0.998	2,063
legaladvice	1.0	1,917
makeupexchange	1.0	3,124
marriedredpill	0.995	4,192
neckbeardstories	1.0	5,495
needadvice	0.998	1,560
nflstreams	1.0	1,577
nosleep	1.0	11,390
noveltranslations	0.998	1,961
offmychest	1.0	1,909
personalfinance	0.999	2,030
pettyrevenge	0.998	1,955
pokemontrades	0.998	1,550
rant	0.998	1,468
relationship_advice	1.0	2,426
relationships	1.0	3,945
respectthreads	1.0	9,503
self	1.0	1,740
shortscarystories	1.0	1,574
smallbusiness	0.996	1,315
subredditoftheday	0.994	6,431
summonerschool	0.999	3,383
talesfromcallcenters	0.999	2,023
talesfromtechsupport	1.0	2,985
tifu	0.998	2,283
tldr	0.999	3,131
trendingsubreddits	1.0	1,758
truegaming	1.0	2,286
woweconomy	1.0	1,349

We found that Algorithm 3.2.3 plays an important role in the recommendation steps. The experimental results using different collections (280 communities obtained from DBSCAN and 98 communities from Algorithm 3.2.3) shows significant improvement (see Section 5 for details).

4. Recommendation framework

Our goal is to provide a list of recommended communities for users to choose based on their preferences. Doing so can give the user more freedom to pick a subreddit from the list. Given a title and body of the post, the algorithm recommends communities based on the content similarity between the post and communities. The framework consists of two steps (see Fig. 7). First, it utilizes probabilistic classification to obtain probability as the similarity between each subreddit and a given post. A probabilistic classifier predicts the probability of each target class instead of one certain result. We study the classification performance of naïve Bayes (NB) [4] and logistic regression (LR) with stochastic gradient descent (SGD) [24] and coordinate descent (CD) [11] optimization. Second, the framework picks a list of communities that are highly associated with the post based on the predicted probabilities. The list is presented as options for users to further review and choose for their posts, all of which are considered to be most relevant. The advantage of our framework is the ability to recommend communities with high content relevance even they might be small or less known communities.

Figure 7.

Community recommendation framework.

4.1 Content similarity

The framework uses probability obtained from a probabilistic classifier to determine the content similarity. Naïve Bayes and logistic regression are studied to see their performance. Since Logistic regression is typically a binary classifier, therefore, we apply one-vs-rest technique to fit with our multiclass problem domain [7]. Normally the classification determines the highest probability class as the predicted result. But our framework saves all predicted probability values to be used in the next process. Algorithm 4.1 describes the training process of multiclass logistic regression using one-vs-rest technique. Algorithm 4.1 describes the probability prediction process using classifiers trained with one-vs-rest method.

[h] Logistic regression training algorithm using one-vs-rest.[1] TrainClassifiersDataset $D$ $C\leftarrow$ Set of classifiers, initialize with an empty set $\{\}$ $S\leftarrow$ Set of communities in the dataset $D$ community $s_{i}$ in $S$ $P_{i}\leftarrow$ Set of posts in community $s_{i}$ $P^{\prime}_{i}\leftarrow$ Set of posts not in community $s_{i}$ $c_{i}\leftarrow$ Logistic regression classifier for community $s_{i}$ , trained with $P_{i}$ and $P^{\prime}_{i}$ $C\leftarrow C\cup\{c_{i}\}$ $C$

[h] Probability prediction algorithm using trained classifiers.[1] PredictProbabilitiesClassifiers $C$ , Post $x$ $P\leftarrow$ Set of community probabilities, initialize with an empty set $\{\}$ classifier $c_{i}$ in $C$ $s_{i}\leftarrow$ Target community of classifier $c_{i}$ $p_{i}\leftarrow$ Probability of community $s_{i}$ predicted by $c_{i}$ for post $x$ $P\leftarrow P\cup\{(p_{i},s_{i})\}$ $P$

We study two optimization methods for logistic regression which are stochastic gradient descent and coordinate descent. The term frequency-inverse document frequency (tf-idf) from the post is extracted as a feature vector and we use the corresponding subreddit as the label.

Domain-specific features

To investigate the classification performance, we study new features associated with the Reddit’s post. These features include the length of title and body, voting score and user behavior. Equation (5) shows text lengths scaled between zero and one. The minimum and maximum lengths are calculated from the data set.

$\displaystyle\textit{length}_{\textit{scaled}}=\frac{\textit{length}-\min{% \textit{length}}}{\max{\textit{length}}-\min{\textit{length}}}$ (5)

The score feature is obtained from the voting score from users and use as the instance weights for the post to emphasize on user engagement in the community Eq. (6).

$\displaystyle\textit{score}_{\textit{normalized}}=\log_{10}{\textit{score}}$ (6)

Table 3

Feature set combination. Set A consist of solely tf-idf vectors. Set B to G consist of tf-idf vectors with additional features. Set X to Z consist of high potential features indicated by performance exploration of set B to G

Feature set	TF-IDF	Length		Time of creation			Score
		Title	Body	Hour	Day	Month
A	✓	–	–	–	–	–	–
B	✓	✓		–	–	–	–
C	✓	–	✓	–	–	–	–
D	✓	–	–	✓	–	–	–
E	✓	–	–	–	✓	–	–
F	✓	–	–	–	–	✓	–
G	✓	–	–	–	–	–	✓
X	✓	✓	–	–	–	✓	–
Y	✓	–	✓	–	–	–	✓
Z	✓	✓	–	–	–	–	✓

Table 4

Classification performance of studied algorithms, measured using standard metrics with different feature sets for comparison

Algorithm	Feature set	Accuracy	Precision	Recall	F1	Kappa
NB	A	0.544	0.707	0.530	0.543	0.540
	B	0.548	0.709	0.533	0.547	0.543 ${}^{*}$
	C	0.538	0.703	0.523	0.538	0.533
	D	0.538	0.685	0.523	0.536	0.533
	E	0.544	0.693	0.529	0.541	0.539
	F	0.550	0.703	0.535	0.546	0.545 ${}^{*}$
	G	0.413	0.666	0.404	0.416	0.407
	X	0.554	0.710	0.539	0.551	0.549 ${}^{**}$
	Y	0.410	0.662	0.400	0.413	0.403
	Z	0.411	0.665	0.401	0.413	0.404
LR (SGD)	A	0.678	0.680	0.661	0.647	0.675
	B	0.678	0.686	0.662	0.649	0.674
	C	0.682	0.684	0.666	0.652	0.679 ${}^{*}$
	D	0.661	0.663	0.644	0.630	0.657
	E	0.672	0.672	0.656	0.641	0.669
	F	0.657	0.668	0.643	0.629	0.653
	G	0.712	0.716	0.695	0.687	0.709 ${}^{*}$
	X	0.663	0.672	0.647	0.635	0.659
	Y	0.715	0.720	0.698	0.691	0.711 ${}^{**}$
	Z	0.716	0.719	0.699	0.691	0.713
LR (CD)	A	0.763	0.751	0.748	0.742	0.760
	B	0.768	0.757	0.753	0.747	0.766 ${}^{*}$
	C	0.764	0.752	0.751	0.745	0.761
	D	0.761	0.750	0.746	0.741	0.758
	E	0.762	0.749	0.747	0.741	0.759
	F	0.763	0.748	0.749	0.743	0.761
	G	0.773	0.768	0.759	0.756	0.771 ${}^{*}$
	X	0.768	0.754	0.754	0.749	0.766
	Y	0.775	0.768	0.759	0.756	0.772
	Z	0.778	0.771	0.763	0.760	0.776 ${}^{**}$

${}^{*}$ indicates the potential set for the algorithm. ${}^{**}$ indicates the feature set created from the potential features for the algorithm.

We explore the performance of each algorithm using different feature sets (see Tables 3 and 4). Each feature set is created based on tf-idf with an additional extra feature. Feature set X, Y, and Z are created using the most effective features for each algorithm. We found that the most significant features in most cases are the length of title and score weighting. After testing, we decide to apply logistic regression optimization using coordinate descent with feature set Z in our framework since it exhibits higher performance than others.

4.2 Top-K recommendation

After obtaining the predicted probabilities for a given post from the classification process, the framework will pick communities with the highest probability values and provide to the user. The number of recommended communities is varied in the experiment to see their performance. In our experiment, we set the number of proposed communities starting from 1 to 5 for performance comparison. Algorithm 4.2 describes the community selection process using predicted probabilities and the number of final results. Algorithm 4.2 describes the overall community recommendation using previous algorithms.

[h] Top-K community selection using predicted similarity.[1] TopKCommunitiesProbabilities $P$ , Number of recommended communities $K$ $T\leftarrow$ List of communities, initialize with an empty list $[]$ $P^{\prime}\leftarrow$ List of community probabilities $P$ sorted by descending probability $p_{i}$ $i\leftarrow 1$ to $K$ $(p_{i},s_{i})\leftarrow P^{\prime}[i]$ The $i^{\text{th}}$ highest probability $p_{i}$ and its community $s_{i}$ in $P^{\prime}$ $T$ .insert $s_{i}$ $T$ [h] Community feature extraction algorithm.[1] RecommendCommunitiesClassifiers $C$ , Post $x$ , Number of recommended communities $K$ $R\leftarrow$ List of recommended communities, initialize with an empty list $[]$ $P\leftarrow$ PredictProbabilities $C$ , $x$ $R\leftarrow$ TopKCommunities $P$ , $K$ $R$

Table 5
The average precision of the framework with the number of recommended communities (K) set to be from 1 to 5 compared with different algorithms and feature sets

Algorithm	Feature set	K $=$ 1	K $=$ 2	K $=$ 3	K $=$ 4	K $=$ 5
KNN	A	0.502	0.591	0.645	0.678	0.692
RF	A	0.418	0.487	0.521	0.551	0.580
NB	A	0.544	0.653	0.709	0.753	0.786
	X	0.554	0.660	0.717	0.760	0.794
	Y	0.410	0.510	0.571	0.615	0.652
	Z	0.411	0.509	0.570	0.613	0.651
LR (SGD)	A	0.677	0.789	0.837	0.866	0.886
	X	0.661	0.772	0.824	0.856	0.878
	Y	0.715	0.822	0.867	0.893	0.911
	Z	0.715	0.823	0.868	0.894	0.912
LR (CD)	A	0.762	0.861	0.899	0.922	0.937
	X	0.768	0.865	0.903	0.925	0.938
	Y	0.776	0.870	0.908	0.929	0.943
	Z	0.779	0.874	0.911	0.931	0.945

Table 6

Classification performance of studied algorithms, measured using standard metrics with different feature sets for comparison, using the 280 communities in textual representative as dataset

Algorithm	Feature set	Accuracy	Precision	Recall	F1	Kappa
NB	A	0.452	0.659	0.659	0.659	0.659
	B	0.452	0.663	0.438	0.470	0.450
	C	0.447	0.657	0.433	0.465	0.445
	D	0.438	0.645	0.425	0.454	0.436
	E	0.450	0.653	0.435	0.465	0.448
	F	0.454	0.649	0.439	0.465	0.452
	G	0.277	0.644	0.270	0.301	0.275
	X	0.454	0.656	0.439	0.466	0.453
	Y	0.275	0.643	0.268	0.300	0.272
	Z	0.273	0.639	0.265	0.295	0.271
LR (SGD)	A	0.485	0.557	0.471	0.456	0.483
	B	0.468	0.569	0.455	0.448	0.467
	C	0.489	0.558	0.475	0.459	0.487
	D	0.436	0.537	0.423	0.416	0.434
	E	0.479	0.542	0.466	0.450	0.478
	F	0.400	0.584	0.388	0.392	0.398
	G	0.506	0.607	0.490	0.486	0.505
	X	0.399	0.589	0.387	0.391	0.397
	Y	0.510	0.612	0.495	0.491	0.509
	Z	0.510	0.607	0.494	0.491	0.508
LR (CD)	A	0.680	0.682	0.672	0.670	0.678
	B	0.685	0.687	0.678	0.675	0.684
	C	0.680	0.683	0.673	0.671	0.680
	D	0.678	0.679	0.670	0.668	0.677
	E	0.679	0.681	0.672	0.669	0.678
	F	0.683	0.684	0.677	0.673	0.682
	G	0.691	0.703	0.685	0.687	0.690
	X	0.687	0.687	0.681	0.677	0.686
	Y	0.692	0.703	0.685	0.687	0.691
	Z	0.696	0.707	0.690	0.691	0.695

Table 7

The average precision of the framework with the number of recommended communities (K) set to be from 1 to 5 compared with different algorithms and feature sets, using the 280 communities in textual representative as dataset

Algorithm	Feature set	K $=$ 1	K $=$ 2	K $=$ 3	K $=$ 4	K $=$ 5
NB	A	0.452	0.551	0.602	0.639	0.666
	X	0.454	0.554	0.606	0.643	0.670
	Y	0.275	0.357	0.407	0.441	0.469
	Z	0.273	0.354	0.403	0.437	0.465
LR (SGD)	A	0.484	0.602	0.662	0.701	0.730
	X	0.401	0.505	0.567	0.608	0.608
	Y	0.510	0.629	0.686	0.723	0.750
	Z	0.509	0.627	0.684	0.721	0.747
LR (CD)	A	0.680	0.792	0.838	0.865	0.882
	X	0.687	0.796	0.841	0.867	0.884
	Y	0.692	0.801	0.846	0.872	0.889
	Z	0.696	0.805	0.849	0.875	0.891

5. Performance evaluation

We evaluate the performance of the framework using 10-fold cross-validation and use the community names as the label data. To determine the performance of the ranked classification result, we measure the precision with different numbers of results known as precision@k. Note that precision@k is a performance metric used in information retrieval related tasks to determine the relevance of the results without considering their positions. In our case, this means that a recommendation is treated as correct if the originally specified community is in the recommended list. We measure the average precision of all folds with different feature sets, numbers of recommended communities and compare with baseline methods, which are k-nearest neighbors (KNN) [6], random forest (RF) [12] of 10 trees and NB with the basic feature set A (see Table 5). The result indicates that the framework archives 90% precision by recommending a list of 3 to 5 communities (K $=$ 3 to 5).

The contribution of textual community collection is investigated as shown in Tables 6 and 7. We found that using only the textual community representative ( $c_{\textit{text}}$ ) obtained from DBSCAN gets lower performance than expected. Therefore, finding the suitable collection (Algorithm 3.2.3) is important for the recommendation framework.

Figure 8.

The effect of noise on the performance of probability prediction measured in precision at K $=$ 1 with different algorithms and feature sets.

5.1 The effect of noise

The effect of noise is studied to see the robustness of our framework. For text-mining domain, stop words are considered as noise [20]. We performed two more experiments, one without preprocessing (noisy data) and the other with noise removal (cleaned data), then we measure their precisions at K $=$ 1. The performance comparison is shown in Fig. 8. The number of features (words) in the cleaned dataset was reduced by 22.5%, from 218,830 down to 169,597. But the time used in both processes are not much different in our case because of the compact size of the dataset.

The result shows that KNN performed slightly worse on the noisy dataset because stop words increase the chance of algorithm taking incorrect instances with those features in as neighbors. RF performed slightly better on the noisy dataset than the clean one. Since RF randomly select a subset of feature to be the candidates during the tree construction. We found that NB and LR are robust to noise because they are based on probability calculation. The precisions of LR (CD) using the feature set Z applied on both noisy and cleaned dataset are almost identical which are 0.779 and 0.778. This indicates the robustness of the framework on noisy data. However, removing more features can worsen the performance in some cases due to the nature of the dataset, some words may have good classification power in social media datasets than others. For example, some communities like to ask questions more than others. We found that our method exhibits the highest performance and robust to noise compare to the baseline methods.

Table 8
The time complexity of each algorithm. $N$ is the number of posts in the dataset. $C$ is the number of available communities. $D$ is the number of features in a vector. $K$ is the number of recommended communities

#	Algorithm	Time complexity
		Training	Testing
1	Community feature extraction	O(N)	–
2	Textual representative finding	O(N ${}^{2}$ )	–
3	Textual community collection	O(C)	–
4	Logistic regression training	O(ND)	–
5	Probability prediction	–	O(C)
6	Top-K community selection	–	O(K)
7	Community recommendation	–	O(C)

5.2 Time complexity

The time complexity of each algorithm is shown in Table 8. The community feature extraction algorithm need to iterate all $N$ posts in the dataset once in order to calculate the features of each community, resulting with $O(N)$ . We found that the textual representative finding algorithm requires $O(N^{2})$ (Algorithm 6) The textual community collection algorithm then goes through all $C$ community vectors and retrieves ones that fit the criteria calculated from the textual representative, yielding $O(C)$ . The LR training phase (Algorithm 4.1) needs more computational time compared to other algorithms in our framework. Note that we apply LIBLINEAR implementation for LR training in the content similarity process, giving us $O(ND)$ where $D$ is the number of features. After the training processes are completed, the framework can perform efficiently. It iterates all $C$ communities in order to collect and recommend $K$ communities to user, which is technically $O(C+K)$ . However the number of communities is likely to be higher than that of recommended communities, therefore the complexity can be summarized as $O(C)$ . We can conclude that the overall complexity of the recommendation framework requires $O(C+K+C)$ which is $O(C)$ .

6. Conclusion

We propose a community recommendation framework designed for text post and experiment with the data on Reddit. Our framework starts with textual community identification using DBSCAN to get the textual community representatives on the website then used it as a threshold to select them from the dataset. The strength of our framework is the automatically learning without the needs of domain experts. The logistic regression is applied to predict the content similarity of each community for a post. Finally, the framework selects a set of communities with high probability values and provides it to the user as a recommendation. We also propose domain-specific features of post on Reddit in term of classification performance improvement. The performance shows that by recommending a list of 3 to 5 communities, the framework achieves 90% precision. We also examine the effect of noise in the dataset and the time complexity of proposed algorithms. We plan to investigate our framework on more extensive datasets to see the impact of the top-k recommendation process in the near future.

Footnotes

Acknowledgments

We would like to gratefully acknowledge and thank for the financial support from Kasetsart University Research and Development Institute and SCIKU 50 ${}^{\text{th}}$ Anniversary Scholarship.

References

Praw: PRAW, an acronym for “Python Reddit API Wrapper”, is a python package that allows for simple access to Reddit’s API, Oct. 2016.

Reddit.com: Api documentation. https://www.reddit.com/dev/api, Oct. 2016.

Redditlist.com – Tracking the top 5000 subreddits. http://redditlist.com/, Oct. 2016.

Naive Bayes text classification. https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html, Nov. 2017.

Reddit.com Traffic, Demographics and Competitors – Alexa. https://www.alexa.com/siteinfo/reddit.com, Dec. 2017.

Altman

N.S.

, An introduction to kernel and nearest-neighbor nonparametric regression, The American Statistician 46(3) (1992), 175–185.

Bishop

C.M.

, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

Choi

Han

Chung

Ahn

Y.-Y.

Chun

B.-G.

and Kwon

T.T.

, Characterizing Conversation Patterns in Reddit: From the Perspectives of Content Properties and User Participation Behaviors, in: Proceedings of the 2015 ACM on Conference on Online Social Networks, COSN ’15, New York, NY, USA, ACM. 2015, pp. 233–243.

Cunha

T.O.

Weber

Haddadi

and Pappa

G.L.

, The Effect of Social Feedback in a Reddit Weight Loss Community, in: Proceedings of the 6th International Conference on Digital Health Conference, DH ’16, New York, NY, USA, ACM. 2016, pp. 99–103.

10.

Ester

Kriegel

H.-P.

Sander

and Xu

, A density-based algorithm for discovering clusters in large spatial databases with noise, AAAI Press, 1996, pp. 226–231.

11.

Fan

R.-E.

Chang

K.-W.

Hsieh

C.-J.

Wang

X.-R.

and Lin

C.-J.

, Liblinear: A library for large linear classification, J. Mach. Learn. Res. 9 (June 2008), 1871–1874.

12.

T.K.

, Random decision forests, in: Proceedings of 3rd International Conference on Document Analysis and Recognition, Vol. 1, Aug. 1995, pp. 278–282.

13.

Jamonnak

Kilgallin

Chan

C.C.

and Cheng

, Recommenddit: A Recommendation Service for Reddit Communities, in: 2015 International Conference on Computational Science and Computational Intelligence (CSCI), Dec. 2015, pp. 374–379.

14.

Kumar

Dredze

Coppersmith

and Choudhury

M.D.

, Detecting Changes in Suicide Content Manifested in Social Media Following Celebrity Suicides, in: Proceedings of the 26th ACM Conference on Hypertext & Social Media, HT ’15, New York, NY, USA, ACM. 2015, pp. 85–94.

15.

Peng

Kataria

Sun

and Li

, Recommending users and communities in social media, ACM Transactions on Knowledge Discovery from Data (TKDD) 10(2) (Oct. 2015), 17.

16.

Peng

Kataria

Sun

Peng

Kataria

Sun

and Li

, FRec: A novel framework of recommending users and communities in social media, ACM, 10/27/2013, 10/27/2013, pp. 1765–1770.

17.

Nguyen

Richards

Chan

C.-C.

and Liszka

K.J.

, RedTweet: Recommendation Engine for Reddit, in: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, ASONAM ’15, New York, NY, USA, ACM. 2015, pp. 1381–1388.

18.

Nguyen

Richards

Chan

C.-C.

and Liszka

K.J.

, RedTweet: Recommendation engine for reddit, Journal of Intelligent Information Systems 47(2) (Oct. 2016), 247–265.

19.

Pal

Wang

Zhou

M.X.

Nichols

and Smith

B.A.

, Question routing to user communities, in: Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, CIKM ’13, New York, NY, USA, ACM. 2013, pp. 2357–2362.

20.

Saif

Fernández

and Alani

, On stopwords, filtering and data sparsity for sentiment analysis of Twitter, in: LREC 2014, Ninth International Conference on Language Resources and Evaluation. Proceedings, Reykjavik, Iceland, 2014, pp. 810–817.

21.

Tamersoy

De Choudhury

and Chau

D.H.

, Characterizing Smoking and Drinking Abstinence from Social Media, in: Proceedings of the 26th ACM Conference on Hypertext & Social Media, HT ’15, New York, NY, USA, ACM. 2015, pp. 139–148.

22.

Tan

and Lee

, All Who Wander: On the Prevalence and Characteristics of Multi-community Engagement, in: Proceedings of the 24th International Conference on World Wide Web, WWW ’15, Republic and Canton of Geneva, Switzerland, International World Wide Web Conferences Steering Committee. 2015, pp. 1056–1066.

23.

Weninger

Zhu

X.A.

and Han

, An Exploration of Discussion Threads in Social News Sites: A Case Study of the Reddit Community, in: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM ’13, New York, NY, USA, ACM. 2013, pp. 579–583.

24.

Zhang

, Solving large scale linear prediction problems using stochastic gradient descent algorithms, ACM, Apr. 2004, p. 116.

Community recommendation for text post in social media: A case study on Reddit

Abstract

Keywords

1. Introduction

Table 1 A sample of the collected posts with some essential fields. The range of score is varied depend on the number of users in the community

3.2 Textual community identification

3.2.1 Community feature extraction

3.2.3 Textual community collection

Table 2 The list of communities selected using the average values of both features calculated from the textual representative

Domain-specific features

Table 5 The average precision of the framework with the number of recommended communities (K) set to be from 1 to 5 compared with different algorithms and feature sets

Table 8 The time complexity of each algorithm. N is the number of posts in the dataset. C is the number of available communities. D is the number of features in a vector. K is the number of recommended communities

6. Conclusion

Footnotes

Acknowledgments

References

Table 1
A sample of the collected posts with some essential fields. The range of score is varied depend on the number of users in the community

Table 2
The list of communities selected using the average values of both features calculated from the textual representative

Table 5
The average precision of the framework with the number of recommended communities (K) set to be from 1 to 5 compared with different algorithms and feature sets

Table 8
The time complexity of each algorithm. $N$ is the number of posts in the dataset. $C$ is the number of available communities. $D$ is the number of features in a vector. $K$ is the number of recommended communities