User graph topic model

Abstract

Finding coherent topics in Twitter data is difficult task because of the sparseness and informal language. Tweets also provide rich contextual and auxiliary metadata which can be used to supervise the topic modeling to get more coherent topics. In this paper, a novel topic model is proposed which extends Author Topic Model for twitter. Standard Author Topic Model cannot be used on Twitter data as every tweet has exactly one author. The proposed User Graph Topic Model (UGTM) considers the semantic relationships among tweet users based on the contextual information like hashtags, user mentions and replies to make a user graph. Related users of author of a tweet are found and used in tweet generation process. Related user information from the user graph is used to obtain the dirichlet prior for user generation. Empirical results show that the proposed UGTM outperforms standard Author Topic Model (ATM) on experimental data.

Keywords

Topic models Latent Dirichlet Allocation user graph

1. Introduction

In the age of Big Data, analyzing large collections of flat text data is a challenging task. Unstructured nature of textual data make analysis difficult. Topic Models [1] have been proven useful for analyzing and discovering important statistical pattern in large collection of text documents. Topic models are probabilistic models which discover the latent semantic structure hidden in the large text data. They have wide spectrum of text mining applications such as multi-document summarization [2 –5], Sentiment analysis [6, 7], opinion mining [8], social network analysis [9], machine translation [10] etc.

Topic models have obtained success in analyzing topics in long flat documents i.e. scientific documents, news articles, literary documents etc. but they fail to produce good results for short text data like tweets [11]. Tweets are very short containing maximum of 140 characters. The reason for unsatisfactory results for short text is their noisy and sparse nature because Topic Models implicitly capture the word co-occurrence relationships in the documents which are very sparse in short documents [12]. Moreover, Tweets contains informal language construct like slangs, abbreviations which increases the size of dictionary and ambiguity. Several methods are proposed for improving topic model results for short text-By extending short text, by adding additional terms [13 –15], by aggregating similar tweets [11 , 16–18], by designing special topic models that suit short text [19 –21].

Although conventional topic models are designed for the flat text documents, they can utilize the meta-data of the text documents to supervise the topic modelling [22]. Tweets have rich contextual information. In addition to the actual text, tweets also have hashtags, URLs, user mentions in the tweet body which provides additional semantics to the tweets. Hashtags are added by users to categorize and highlight topics of the tweets. URLs provides additional information to the tweets. User mentions and replies provide communication relationships among the tweet users. Furthermore, tweet meta-data also contains auxiliary data like user profile, timestamp, reply, location etc. These contextual and auxiliary information of tweets can be used to weekly-supervise topic modelling to get additional word co-occurrence relationship among the tweets. Several works have focused on using these contextual and auxiliary tweet data to enhance topic modeling on tweets [23 –29]. Most of these works focused on using hashtags for supervising topic models. The drawback of these works is that they can be used only when tweets has some hashtags. If a tweet does not contain any hashtag then either hashtag based topic modelling cannot be applied or that tweet does not contribute to topic modeling. For this reason, all hashtag based topic modeling methods use only those tweets which contains hashtags for training and testing.

In Author Topic Model [30], set of authors of the document are treated as document tags. Each author of the document has some contribution to the content of the document. ATM has some drawbacks. First, ATM does not improve topic modeling if there is exactly one author for each document in the corpus. For this reason, ATM cannot be used for tweet data because each tweet has exactly one author. Second, ATM does not consider the impact of those users who are not the author of the document. In a social network like Twitter, the content of text documents is affected by a group of users who are closely communicating on a topic. The group of users communicating on a topic are reflected by using a set of hashtags and URLs, mentioning each other using user mentions and replying to each other on twitter. Such communicating users have a kind of mutual relationships among them, which can be used to model the role of users on tweet generation. For example, consider the following four tweets shown in the first column of table 1, which were tweeted by four different users but they mentioned the same user ‘@elisewho’. There is no hashtag in these tweets. But since the author of these tweets have mentioned to the same user ‘@elisewho’, the authors of these tweets topic are related to the topics of the user ‘@elisewho’. The second column of table 1 shows tweets posted by ‘@elisewho’, which are related to the topics of the tweets shown in the first column of Table 1.

Table 1
Relationship among mentioning and mentioned users

Tweet Text by four different users who mentioned ’elisewho’ Tweets by Mentioned User ’elisewho’

elisewho This is an incredible picture considering the fact a North Korean defector was riddled with bullets at th… Here they are signing their joint statement

@elisewho In a dark twist, at the end of his statement, Kim Jong Un thanks the press earnest… The North’s Kim Jong Un and the South’s Moon Jae-in hand-in-hand as they crossed over the military demarcation line…

@elisewho If only Trump was as sane as Kim Jong Un. The leaders of the two Koreas moments away from announcing a joint agreement with outcomes from today’s historic

Today, Kim Jong Un became the first North Korean leader to walk into South Korea. NPR’s @elisewho reports from near…

Tweet Text by four different users who mentioned ’elisewho’	Tweets by Mentioned User ’elisewho’
elisewho This is an incredible picture considering the fact a North Korean defector was riddled with bullets at th…	Here they are signing their joint statement
@elisewho In a dark twist, at the end of his statement, Kim Jong Un thanks the press earnest…	The North’s Kim Jong Un and the South’s Moon Jae-in hand-in-hand as they crossed over the military demarcation line…
@elisewho If only Trump was as sane as Kim Jong Un.	The leaders of the two Koreas moments away from announcing a joint agreement with outcomes from today’s historic
Today, Kim Jong Un became the first North Korean leader to walk into South Korea. NPR’s @elisewho reports from near…

In this paper, a probabilistic topic model, User Graph Topic Model (UGTM), is proposed that takes the relationships among the tweet users into account and use it to model the effect of users on tweets generation. The proposed UGTM model uses user and their relationships based on hashtags, URLs and mentions. UGTM alleviates the problem faced by hashtag based approches discussed above. Since every tweet has a user, UGTM can be used with every tweet whether that tweet contains hashtag or not. UGTM also improves Author Topic Model by also considering the contribution of those users in tweet generation who are not the author of tweet.

In the proposed model, to generate a word of a tweet, a user is first generated probabilistically using the user relationships. The generated user may be different from the actual author of the tweet. A word is generated then from the generated user’s topic distribution over words. So, a word in a tweet may generated by a user whose topic distribution is related to the tweet content but is not the author of tweet. The contribution of the proposed model is summarized as follows:

The proposed model tackles the sparsity problem of short text by using user relationship based on the tweet contextual information.

Unlike Author Topic Model, the proposed model considers the effect of non-author users on topic modeling.

The user relationship information is used as dirichlet prior for generating user in the generative process.

The proposed model performs effectively on tweet data on clustering and topic quality task.

Remainder of the paper is organized as follows. Next section describes related work. Section 3 presents the proposed model. Section 4 presents experiments and results. Section 5 concludes and discusses future work.

2. Related work

Topic Modelling have been a popular method to analyze the latent semantic structures of large text collections. Topic modelling algorithms finds latent topics in the documents by using a probabilistic generative process to generate hidden topics. Latent Dirichlet Allocation (LDA) [31] is the first approach in this direction which is the extension of probabilistic latent semantic analysis (pLSA) [32]. LDA is generalization of pLSA and is similar to it when a uniform dirichlet prior is used instead of sparse dirichlet prior. LDA introduces dirichlet priors for multinomial distribution of documents over topics. Several extensions of LDA have been proposed. Correlated Topic Model (CTM) [33] finds topic relations in the documents. Dynamic Topic Model (DTM) [34], Relational Topic Model (RTM) [35] and Hierarchical Topic Model (HTM) [36] are some other variants of LDA. LDA and its variants produce excellent results on long documents but they fail on short documents like tweets which are noisy, sparse and short.

2.1. Topic modelling for short texts

Several approaches have been proposed for topic modeling to handle short and noisy data. In one approach, short texts have been extended by appending additional words from external knowledge bases Word-Net, Wikipedia or World Wide Web. Additional words are also added from the document corpus itself using co-occurrence word relationships. Co-frequency Expansion (CoFE) [37] framework added the words in the corpus which co-occur frequently with the document words, assuming that co-occurring words have high probability of falling in the same topic. A distributed representation based expansion (DREx) framework [15] is designed which is based on the concept of word embedding to model word similarities.

Another approach of tackling the short text problem is to aggregate the related short documents into longer documents. The relation among the short documents is based on some heuristics. For example, hashtags [16], users [11] and conversations [17] have been used for tweet aggregation. Hong et al. [11] applies the standard LDA on the aggregated document of all the tweets authored by a single user. Mehrotra [16] et al proposes tweet pooling based on burst-score, timestamp and hashtag, which combines tweets on the basis of tweet bursts (i.e. high number of tweets in unit time), time period (e.g. hour) and tweet hashtags respectively. Hashtag based tweet pooling scheme perform better than standard LDA but burst-score and timestamp based pooling does not improve the results. Alvarez-Melis [17] proposes conversa-tion based pooling which combines those tweets which are part of the same twitter conversation. Author claims that conversation based pooling scheme performs better than hashtag based pooling scheme. Other works have also used other tweet contextual information like URL, mentions for tweet pooling. The drawback of these pooling schemes is that the contextual information like hashtags, replies, URLs are not always available in all the tweets. These pooling schemes can be used only when the contextual information is available for aggregation. Self-Aggregation based Topic Model (SATM) [38] uses clustering methods along with topic modelling to provide a general framework for document aggregation.

Another approach for tackling short text is to design special topic modelling algorithm that are suited for sparse and noisy nature of short texts. Bi-term topic Model (BTM) [19] considers all bigrams instead of words in the entire corpus to reduce the sparsity problem. Twitter-LDA [20] models solves the data sparsity problem by assuming that short text contains exactly one topic. This assumption provides better result and is reasonable for the short text. GPU-DMM topic model [39] enriches short text by using auxiliary word embedding from large text collections. It uses Generalized Polya Urn Model to promote semantically related words during the sampling process. Word network Topic Model (WNTM) [21] works upon the pseudo documents created for each word using word graph. The group of words found in the generative process are treated as latent topics in the corpus.

2.2. Supervised topic models

Supervised topic models uses contextual information available with semi structured text documents for guiding the generative models. Blei proposes Supervised LDA (sLDA) [22] for labelled documents to accommodate a variety of response types that limits the topics to a pre-defined set with their distributions provided a priori. Labelled-LDA [40] uses one to one correspondence between topics and document tags to learn word-tag relations. Labelled-LDA restricts a documents topics to the labels assigned to it. Ramage et al [41] uses labelled-LDA to pick hyper-parameters corresponding to the tweet labels restricting tweet topics to its labels.

Several works have been proposed to use tag information associated with the documents to supervise the topic modelling. Hashtag Graph Topic Model (HGTM) [23], Tag-topic model [24], TagLDA [25], Behaviour-LDA [27], tag-Weighted Dirichlet Allocation (TWDA) [26], hashtag-LDA [28], Tag-Latent Dirichlet Allocation [29] models uses the tags associated with the documents to restrict the topics guided by tags. Y Wang [23] proposes HGTM to utilize hashtag relationship from a hashtag graph to model semi structured tweets. HGTM directly models the influence of those words which does not appear in the document itself through the use of hashtag relationships. Tsai presents Tag-topic model [24] for blog mining, which is based on the Author-Topic model and Latent Dirichlet Allocation. The tag-topic model determines the most likely tags and words for a given topic in a collection of blog posts. TagLDA [25] extend Latent Dirichlet Allocation model by replacing the unigram word distribution with a factored representation conditioned on both the topic and the tag structure of the document. TWDA [26] provides a framework that leverages both the tags and words in each document to infer the topic components for the documents. This allows not only to learn the document-topic and topic-word distributions, but also to infer the tag-topic distributions for text mining. Behavior-LDA [27] uses tweet behavior (reply, retweet, post etc.) as document tags and uses it to guide topic learning. Hashtag-LDA [28] studies the influence of tweet hashtags on the topic distribution by jointly modeling the words and hashtags in tweets. Tag-Latent Dirichlet Allocation (TLDA) [29] presents a topic modeling approach that extends Latent Dirichlet Allocation by incorporating the observed hashtags in the generative process. In TLDA, a hashtag is mapped into the form of a mixture of shared topics. This representation further enables the analysis of the relationships between the hashtags. Author Topic Model (ATM) [30] is another supervised topic model which uses author information for influencing topic modelling that models both document content and author interests. In fact, Tag-Topic model is a variant of ATM, in which tags of the blog documents instead of au-thors are used for supervision.

The proposed work is similar to the Author Topic Model in the respect that it also models the author influence on document generation. The advantage of the proposed model is that it can model those document collections in which there is only one author for each document. Performance of ATM is same as standard LDA in this situation. The proposed model is also similar to the Labelled-LDA in the respect that it modifies dirichlet hyper-parameters to model the author influence on topic distribution. Labelled-LDA uses a set of binary switches to select dirichlet hyper-parameters for selecting the topics related to the tags of the documents. Whereas the proposed model obtains dirichlet hyper-parameter values from the relationships among all the users. The proposed work is related to Word Network topic Model (WNTM) and Hashtag Graph topic Model (HGTM) in the respect that the proposed model also uses a graph of users as WNTM uses word graph and HGTM uses hashtag graph.

3. User graph based topic model

In this section, the proposed user graph topic model (UGTM) is presented. Firstly, a brief overview of latent dirichlet allocation (LDA) and one of its variant Author Topic Model (ATM), on which the proposed topic model is built, are described, then proposed topic model is presented.

3.1. Latent Dirichlet Allocation (LDA) and Author Topic Model (ATM)

Latent Drichlet allocation has been widely applied for topic modelling of text documents. In LDA, each document in the corpus is supposed to be a mixture of multiple topics. Each word of the document is sup-posed to come from a different topic. Each document has a multinomial distribution over a fixed number of topics and each topic has a multinomial distribution over a fixed vocabulary words. To generate a word in a document, the generative process of LDA first generates a topic according to the probability distribution of the document over topics and then generates a word according to the probability distribution of the selected topic over vocabulary words.

Author Topic Model (ATM) extents LDA by considering the authorship information of the documents. ATM is used for the documents when there are multiple authors for a single document. In ATM, each author of the document, instead of the document itself, has a multinomial distribution over topics and each topic has a multinomial distribution over words. In ATM, document is modeled as distribution over topics that is a mixture over distributions over authors. Clearly, ATM is same as LDA when each document has exactly one author.

3.2. Proposed user graph topic model

ATM does not improve the performance substantially on documents which has exactly one authors. There are documents like tweets which has exactly one author. Due to the social nature of twitter a tweet by one author is also affected by other authors. ATM fails to model the influence of other users on the tweet generation. UGTM assumes that the words of a document may also be generated by the topic distribution of other users who are not the author of the document. User Graph Topic Model (UGTM) is a probabilistic topic model that describes a generative process to model the collaborative influence of authors on documents. The block diagram of UGTM is shown in Fig. 1.

Fig. 1

Block Diagram of UGTM.

Let the corpus has N documents. Each document has N_d words w_d1, w_d2, …, w_{dN
_d}. The author of the document d is u_d.

Traditional LDA assumes a fixed number of hidden topics K in the whole corpus. To generate words in a document, it first generates a latent topic and then generates word from that topic. UGTM uses two la-tent variables for the generation of a word. It first generates a latent user variable using the author’s multinomial distribution over all users. Each user has multinomial distribution over topics. UGTM then generates a latent topic variable from the generated user’s topic distribution and then generates word from the generated topic’s word distribution.

UGTM supervises user latent variable generation. Supervision information is taken from user graph which specifies the relationships among users. A user graph is an undirected weighted graph in which nodes are users and edges are obtained from some relationship among users. One relationship between user u_i and u_j is the usage of common hashtags by them in the entire corpus, which shows a kind of semantic relationship between user u_i and u_j. The weight of the edge (u_i, u_j) is the number of common hashtags between u_i and u_j. Several of such relationship can be defined using tweet contextual information e.g. URLs, replies, mentions etc. User graph is represented as two dimensional matrix G_U×U where U is the number of users in the corpus. Let G_u be the row of the matrix G for user u.

UGTM uses dirichilet-multinomial distribution pair to model user, topic and word generation. Dirichilet distribution is the conjugate prior of the multinomial distribution. The effect of using dirichlet distribution as prior for multinomial distribution is the addition of dirichilet prior count to the multinomial counts in the posterior distribution. For this reason dirichlet prior counts are also called pseudo-counts, which provides a smoothing effect on multinomial parameters and exhibits the polya-urn’s rich get richer behavior [42].

3.2.1. Dirichlet hyper-parameter γ for user distribution

In UGTM, user graph is used to find the dirichlet prior for the user generation. In traditional LDA and its other variants, usually a symmetric dirichlet prior is used which uses a single vale for all dirichlet parameters. UGTM uses supervision information from user graph to find the dirichlet prior for user latent variable generation. In the user graph G, each row G_u describes the relationships between the user u and all other users. This relationship information for user u i.e. row G_u of user graph can be used as prior information for guiding to find the other related users who may affect the generation of the document. The user graph matrix row G_u is used to find the dirichlet prior parameters for user latent variable generation. The values in row G_u normalized by the largest value in that row are used as dirichlet prior parameters for multinomial distribution of user latent variable.

Note that there is no loops i.e. an edge from a user node to itself in the user graph, resulting in the zero prior value for the author itself. But prior value for the author must be largest. A parameter π is introduced to make the dirichlet priors correct. Let S_u be the sum of all values in the original row G_u and g_uu′ be the relationship value between u and u’ in that row. The prior for the author u is kept as π × S_u. The prior for other users u’ are kept as (1 - π) × g_uu′. This increases the prior belief that most of the words in the document are generated by the author of the document.

To choose the value of π, experiments are conducted for different values of π. Large values of π (>0.9) will reduce the effect of user graph and make UGTM close to Author Topic Model. Low values of π will take the effect of other users more than the actual user of the tweet. Any value of π greater than 0.5 give preference to the author of the tweets. The value 0.7 is empirically chosen because it gives best results.

3.2.2. Generative process of tweets in UGTM

Let the author of the document d be u_d. For each word position w_{d
_i} in document d, user u_{d
_i} and topic z_{d
_i} latent variables are generated. First, user u_{d
_i} of the word w_{d
_i} is generated. User u_{d
_i} is generated by drawing from the u_d’s multinomial distribution ψ_{u
_d} over all users with dirichlet hyper-parameter γ_{u
_d}. Each user has a multinomial distribution over topics with dirichlet hyper-parameter α. Topic z_{d
_i} is drawn from user u_{d
_i}’s topic distribution. Each topic has a multinomial distribution over terms with dirichlet hyper-parameter β. Finally, Word w_{d
_i} is drawn from topic z_{d
_i}’s term distribution. By assigning a user latent variable, each user has a contribution to the topic distribution of the document. The plate notation of graphical model of UGTM is depicted in Fig. 1.

Fig. 2

Plate Notation of graphical Model of UGTM.

The generative process for UGTM is described as follows: Φ ∼Dirichlet(β)

θ ∼ Dirichlet(α)

ψ_{u
_d} ∼ Dir(γ_{u
_d})

u| u_d∼Multinomial(ψ_{u
_d})

z ∼Multinomial(θ_u)

w ∼ Multinomial(Φ_z)

Dirichlet hyper-parameters α, β are predefined. Dirichlet hyper-parameter γ is derived from the user graph G (Section 3.2.1). Let γ_u be the dirichlet prior for user u. Number of topics K and number of users U are fixed.

For each of the user u = 1: U, draw ψ_u∼Dir(γ_u)

For each of the user u = 1: U, draw θ_u∼Dir(α)

For each of the topic z = 1: K, draw Φ_z∼Dir(β)

For each document d = 1: N, draw document length N_d

Let u_d be the author of document d and dirichlet hyper-parameter for user u_{d
_i} be γ_{u
_{d
_i}}.

For each word w_{_di} i = 1: N_d of document d:

Draw u_{d
_i} for w_{_di}, u_{d
_i}∼Multinomial(ψ_{u
_d} | γ_{u
_d})

Draw z_{_di} for u_{_di}, z_{_di}∼Multinomial(θ_{u
_{d
_i}})

Draw w_{_di} for z_{_di}, w_{_di}∼Multinomial(Φ_{z
_{d
_i}})

3.3. Parameter estimation

UGTM is a probabilistic model that describes the generation of documents as defined in Fig. 2. The author u_d and words w_d of the document are observed. The Dirichlet hyper-parameters γ_u for each user u is derived from the user graph G. For each word in each document, a user and topic latent variables are sampled according to latent multinomial distribution parameters ψ and θ respectively. Now the task is to infer all the hidden variables based on the observed variables. Assuming all words of all the documents are drawn independently, the probability of generating the whole corpus is given by: $p (w | ψ, θ, Φ, u, z, G) = \prod_{d = 1}^{N} p (w_{d} | ψ, θ, Φ, u_{d}, G)$ (1)

Here, N is the number of documents in the corpus. Since all the words of a document are drawn independently, the probability of generating document d is given by:

$p (w_{d} | ψ, θ, Φ, u_{d}, G) = \prod_{i = 1}^{N_{d}} p (w_{d_{i}} | ψ, θ, Φ, u_{d}, G)$ Based on the generative process of UGTM, this can be written as: $= \prod_{i = 1}^{N_{d}} \sum_{u = 1}^{U} \sum_{k = 1}^{K} p (w_{d_{i}}, u_{d_{i}} = u, z_{d_{i}} = k | ψ, θ, Φ, u_{d}, G)$

$\begin{matrix} = \prod_{i = 1}^{N_{d}} \sum_{u = 1}^{U} \sum_{k = 1}^{K} p (w_{d_{i}} | z_{d_{i}} = k, Φ) \\ . p (z_{d_{i}} = k | u_{d_{i}} = u, θ) \\ . p (u_{d_{i}} = u | ψ) \end{matrix}$

Here, U is the number of users and K is the number of topics. $= \prod_{i = 1}^{N_{d}} \sum_{u = 1}^{U} \sum_{k = 1}^{K} Φ_{{kw}_{d_{i}}} θ_{uk} ψ_{u_{d} u}$ (2)

The exact inference of the above probability distribution s intractable. An approximate inference method called Gibbs Sampling is used for the solution. Gibbs Sampling [43] is a special case of Markov Cain Monte Carlo (MCMC) [44] simulation and provides simple algorithm for approximate inference in complex models. A collapsed Gibbs sampler is used which integrates out some of the parameters for model inference. In this application, model parameters Φ, θ and ψ are integrated out.

Adopting the process as described in [45], the full conditionals for the hidden variables z_{_di} and u_{_di} are derived after integrating out other parameters, the sample posterior distributions for z_{_di} and u_{_di} can be written as:

$\begin{matrix} p (u_{d_{i}} = u | U_{- i}, Z_{- i}) \propto \frac{n_{u_{d}, - i}^{u} + γ_{u_{d} u}}{\sum_{u = 1}^{U} (n_{u_{d}, - i}^{u} + γ_{u_{d} u})} \\ \times \frac{n_{u, - i}^{k} + α}{\sum_{k = 1}^{K} (n_{u, - i}^{k} + α)} \end{matrix}$ (3)

$\begin{matrix} p (z_{d_{i}} = k | Z_{- i}, W_{- i}) \propto \frac{n_{u, - i}^{k} + α}{\sum_{k = 1}^{K} (n_{u, - i}^{k} + α)} \\ \times \frac{n_{k, - i}^{t} + β}{\sum_{t = 1}^{T} (n_{k, - i}^{t} + β)} \end{matrix}$ (4)

After sufficient number of sampling iteration, the Gibbs sampling chain converges. The parameters Φ, θ and ψ can be obtained using following equations:

$ψ_{u u^{'}} = \frac{n_{u^{'}}^{u} + χ_{u u^{'}}}{\sum_{u = 1}^{U} (n_{u^{'}}^{u} + χ_{u u^{'}})}$ (5)

$θ_{uk} = \frac{n_{u}^{k} + α}{\sum_{k = 1}^{K} (n_{u}^{k} + α)}$ (6)

$Φ_{kt} = \frac{n_{k}^{t} + β}{\sum_{t = 1}^{T} (n_{k}^{t} + β)}$ (7)

The Gibbs sampling algorithm is described in Algorithm 1.

Algorithm 1 Gibbs sampling algorithm for UGTM

Input: Iterations I, Topics K, User Graph G, α, β, γ,

All D documents with document d having N_d words

C^UU: User-User Count Matrix C^UK: User-Topic

Count Matrix C^KT: Topic-Term Count Matrix

Output: Parameters φ, θ and ψ Randomly Initialize

Count Matrices C^UU, C^UK, C^KT

For i = 1 to I Do

For d = 1 to D Do

For j = 1 to N_d Do

Decrement counts for the current word

Draw user u_{d
_j} according to equation 3

Draw user z_{d
_j} according to equation 4

Update Count Matrices C^UU, C^UK, C^KT

End For

Calculate parameters ψ, θ and φ according to

equations 5, 6 and 7 respectively

End For.

Once the parameters has been inferred, a new documents topic distribution can also be inferred using the estimated parameters. For a new document with its author and words known, its topics can be inferred using the same sampling algorithm using the parameters learned by algorithm 1. The algorithm to infer the topics of a new document is shown in Algorithm 2:

Algorithm 2 UGTM inference for a new Tweet

Input: Iterations I, User Graph G, α, β, γ, u_d, w_d

Output: u_d and z_d for all the words w_d

Randomly initialize the user and topic assignments

For i = 1 to I Do

For j = 1 to N_d Do

Draw user u_{d
_j} according to equation 3

Draw user z_{d
_j} according to equation 4

Update u_{d
_j} and z_{d
_j}

End For

Return all u_{d
_j} and z_{d
_j}

4. Experimental results

This section describes the evaluation methods and experimental re-sults.

4.1. Datasets

The dataset is collected using Twitter Streaming API [46] during April, 2018. Twenty disjoint trending topics are selected to fetch tweets from twitter. For each of the trending topic, certain number of tweets are fetched. Total number of tweets in the dataset is around 193K after discarding retweets. The trending topic of each tweet is assigned as class label of that tweet. The dataset is randomly divided into 80% training set and 20% test set. Topic label distribution in both training and test set is preserved.

Fig. 3

Purity and NMI Score.

4.2. Preprocessing

Firstly, the data collected from Twitter are processed to remove the quirks of the social nature of Twitter. A raw tweet may have any sort of slangs, punctuations and hyphenation. After tokenizing, all stop-words, expressions, mentions and URLs are removed in the Tweet text. HTML characters are escaped and slangs are modified using a slang dictionary. Only those tweets are retained which has more than 3 tokens after preprocessing.

4.3. Evaluation criterion

The proposed model is evaluated on two tasks. First task is tweet clustering. The performance of proposed model is evaluated on how well it clusters the tweets. Since the true labels of the tweets are available, Purity and Normalized Mutual Information (NMI) [47] are used for clustering evaluation. Clustering quality is good if value of Purity and NMI are large. H-Score [19] measure is also used which is the ratio of intra-cluster distance to the inter-cluster distance. Lesser the H-score values, better the performance.

For Purity and NMI, Each topic is considered as a cluster. To assign a tweet to a cluster, cosine similarity is calculated between vector space model representation of the tweet and topic words distribution. Each tweet is assigned to exactly one cluster with maximum similarity.

For H-score calculation, each tweet’s word-topic count vector is found and normalized to make it a vector. For H-score calculation, we select 20 popular hashtags and each hashtag’s tweet-set is considred as a topic. Note that, one tweet may be in more than one topic if it has more than one selected hashtags. Second task is evaluation of topic quality, which is measured using Pointwise Mutual Information (PMI) score [48], which agrees with human-judged topic coherence. PMI score uses point wise mutual information using an external knowledge-base. In our experiment, English Wikipedia of size 3.1 M articles is used for calculating PMI score. Top twenty words of each topic are considered for the evaluation of PMI-score.

4.4. Results & discussion

Results for both the evaluation tasks are presented in the following sub-sections. Three kind of user relationships have been used to construct user graph. The proposed model is denoted as UGTM-H when hashtags are used to construct user graph. The model is denoted as UGTM-M and UGTM-R when user mentions and replies are used for graph construction respectively.

UGTM results are compared with ATM [30] and HGTM [23]. HGTM uses hashtag relationship among tweets to model the tpoics. Work in [23] shows that it outperforms all hashtag based topic models on tweet and hashtag clustering tasks. We have implemented HGTM-T algorithm as described in [23].

4.4.1. Tweet Clustering

The Purity, NMI and H-score results for the dataset are shown in Figs. 4 and 5. The empirical results shows that all variants of UGTM perform better than ATM on all three evaluation measures purity, NMI and H-score. Among the three UGTM variants, UGTM-H perform better than UGTM-M which performs better than UGTM-R. It shows that hashtag based co-occurrence relationship shows stronger semantic relationship among users than relationship based on user mentions and tweet replies. One reason for this is that the experimental dataset is collected using hashtag based keywords and most of the users have used the popular hashtags and are related with each other based on those hashtags. Also the number of tweets mentioning users (41.1 %) and replying other tweets (14.8 %) are much less than the number of tweets having hashtags (53.3 %). UGTM-M and UGTM-R are outperformed by HGTM. but the performance of UGTM-H is at par with HGTM. In fact, UGTM-H performs slightly better than HGTM on H-Score. Note that HGTM is trained on only those tweets which has at least one hashtag and tested on entire test set which also includes tweets that does not include any hashtags. When HGTM is tested on only those test tweets which has some hashtags, the purity, NMI and H-Score values comes out to be 0.65, 0.51 and 0.41 respectively.

Fig. 4

H-Score.

Fig. 5

PMI Score.

Table 2

Top Twenty Words for Each Algorithm

Topic	UGTM-H	UGTM-M	UGTM-R	HGTM	ATM
ipad	ipad apple iphone professional education pencil inch mini support ios air event price free unveils	ipad iphone bad application celebrate giveaway country update detox work money store ago ipads win	ipad apple pencil asifa buy system house year pretty white money mini god job daughter	ipad apple iphone pencil education professional inch ios support mini event application air unveils price	ipad apple support iphone professional pencil education inch air exposed ios price apps unveils tablet
theatreday	day theatre happy bring friends favourite tuesday celebrating celebrate year awesome public lucky love deal	happy good theatre price breaking feel guys friends start expulsion point wait hard days deal	No topic	happy theatre day celebrate today love world stage show celebrating life great amazing years favourite	happy day theatre pet amazing pets great remember yesterday late missed live fun left forward
Champions League	champions league man real liverpool city semi final madrid roma finals click uefa barcelona ahead	league champions win liverpool madrid uefa won set teams article tweet yesterday coutinho premier bayern	league champions win final liverpool semi won madrid finals real season uefa quarter messi roma	champions league liverpool win real semi madrid final uefa city barcelona man finals cognizant manchester	champions league final win media man liverpool semi finals social started silent followers goal roma
Kim Jong Un	kim jong south north home moon korean breaking leader part jae meet history korea proud	kim jong south north korean korea history leader live makes peace moment meet meeting share	kim jong north south korean korea leader moon jae live moment historic war enter bought	kim jong south moon korean north jae korea leader historic summit plane military president joint r	south president kim jong north read report moon leader medical found start meet history jae
Asaram	asaram bapu innocent case pocso misuse jail bogus swamy justice proof fake due frame conspiracy	asaram bapu innocent case pocso media jail stop bogus swamy proof girl allegation called support	asaram bapu innocent case bogus jail years proof swamy saint bapuji sant judiciary single legal	bapu asaram innocent pocso misuse bogus justice swamy due proof frame conspiracy allegation framed law	asaram bapu innocent pocso misuse jail girl due law proof frame injustice innocents framed sheer

4.4.2. Topic Coherence

The PMI for the dataset is shown in Fig. 5. Results shows that all variants of UGTM outperforms ATM on PMI-score. PMI-score for UGTM-H and UGTM-M are better than ATM whereas UGTM-R results are comparable to ATM. UGTM-R and ATM discovers almost similar topic words. The performance UGTM-H is same as HGTM on PMI-Score. When HGTM is tested on only those test tweets which has some hashtags, the PMI-Score comes out to be 1.05.

The top twenty words for each algorithm for five selected topics are shown in table 2. Study of topical words presented in table 2 shows that words discovered by UGTM-H and UGTM-M are more coherent than UGTM-R and ATM. UGTM-R and ATM discovered mostly unrelated words for some topics.

5. Conclusion

Tweet topic modelling requires extra effort due to the sparsity and informality problem. But it allows the contextual and auxiliary data available with tweets to be used to semi-supervise the topic modelling. In this paper, a user graph topic model is presented which uses the user relationships from a user graph which is obtained by finding the relationship among tweet users based on hashtags, user mentions and replies. The proposed topic model, differently from the Author Topic Model, describes the effect of those users on a tweet generation who are not the author of tweet. The paper also shows that the dirichlet prior can be obtained from the user graph to supervise the user generation so that it is restricted to only related users. We also compare proposed UGTM algorithm with state of the art HGTM algorithm. HGTM outperforms UGTM-M and UGTM-R. The performance of UGTM-H is comparable to HGTM. Note that we have used a smaller training dataset containing only those tweets which has atleast one hashtag for executing HGTM. For this reason, hashtag relationships is rich in HGTM and it performs well. Only UGTM-H provides at par results with HGTM. Nevertheless, our goal was to model the effect of users on topics which UGTM fulfills as it provides better results than ATM. User graph topic model enhances semantic relationships among tweet users which is reflected in semantic relationships among tweets in the tweet generation process. Empirical results show that the proposed topic model outperforms standard ATM on tweet clustering and topic coherence evaluation.

References

D.M.

Blei , Probabilistic topic models, Communications of the ACM 55(4) (2012), 77–84.

Wang , et al., Multi-document summarization using sentence-based topic models, In Proceedings of the ACLI-JCNLP 2009 Conference Short Papers. Association for Computational Linguistics, 2009.

Yang , et al., A novel contextual topic model for multi-document summarization, Expert Systems with Applications 42(3) (2015), 1340–1352.

Akhtar ,

Javed and

Ahmad , Hierarchical summarization of text documents using topic modeling and formal concept analysis, Data Management, Analytics and Innovation 2 (2018), 21.

Akhtar , Hierarchical summarization of news tweets with twitter-lda, In Applications of Soft Computing for the Web, Springer, 2017, pp. 83–98.

Lin and

He , Joint sentiment/topic model for sentiment analysis, In Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 2009.

Akhtar ,

Zubair ,

Kumar and

Ahmad , Aspect based sentiment oriented summarization of hotel reviews, Procedia Computer Science 115 (2017), 563–571.

Zhai , et al., Constrained lda for grouping product features in opinion mining, In Pacific-Asia Conference on Knowledge Discovery and Data Mining Heidelberg, Berlin, Springer, 2011.

K.T.

Ho ,

Q.V.

Bui and

Marc , Dynamic social network analysis using author-topic model, In International Conference on Innovations for Community Services, Cham, 2018.

10.

Hu ,

Zhai ,

Edelman and

Boyd-Graber , Polylingual tree-based topic models for translation domain adaptation association for computational linguistics, 2014.

11.

Hong and

B.D.

Davison , Empirical study of topic modeling in twitter, In Workshop on Social Media Analytics (2010).

12.

Yan ,

Guo ,

Lan and

Cheng , A biterm topic model for short texts, In Proceedings of the 22nd international conference on World Wide Web, ACM, 2013, pp. 1445–1456.

13.

Tang ,

Wang ,

Gao ,

Hu and

Liu , Enriching short text representation in microblog for clustering, Frontiers of Computer Science in China 6(1) (2012), 88–101.

14.

Tang ,

Xia ,

Wang ,

Lau and

Zheng , Clustering tweets using wikipedia concepts, In LREC, 2014, pp. 2262–2267.

15.

Bicalho ,

Pita ,

Pedrosa ,

Lacerda and

G.L.

Pappa , A general framework to expand short text for topic modeling, Information Sciences 393 (2017), 66–81.

16.

Mehrotra ,

Sanner ,

Buntine and

Xie , Improving lda topic models for microblogs via tweet pooling and automatic labeling, In SIGIR (2013).

17.

Alvarez-Melis and

Saveski , Topic modeling in twitter: Aggregating tweets by conversations, ICWSM 2016 (2016), 519–522.

18.

Akhtar and

M.M.S.

Beg , Improving microblog clustering: Tweet pooling schemes, CSI Journal of Computing (2018), 19–30.

19.

Yan , et al., A biterm topic model for short texts, In Proceedings of the 22nd International Conference on World Wide Web ACM, 2013.

20.

W.X.

Zhao , et al., Comparing twitter and traditional media using topic models, In European Conference on Information Retrieval, Heidelberg, Berlin, Springer, 2011.

21.

Zuo ,

Zhao and

Xu , Word network topic model: A simple but general solution for short and imbalanced texts, Knowledge and Information Systems 48(2) (2016), 379–398.

22.

J.D.

Mcauliffe and

D.M.

Blei , Supervised topic models, Advances in Neural Information Processing Systems, 2008.

23.

Wang , et al., Hashtag graph based topic model for tweet mining, Data Mining (ICDM) 2014 (2014).

24.

F.S.

Tsai , Atag-topicmodelforblog mining, Expert Systems with Applications 38(5) (2011), 5330–5335.

25.

Zhu ,

Blei and

Lafferty , Taglda: Bringing document structure knowledge into topic models. Technical report, TR-1553, University of Wisconsin, 2006.

26.

Li , et al., Tag-weighted dirichlet allocation, Data Mining (ICDM) 2013 (2013).

27.

Qiu ,

Zhu and

Jiang , It is not just what we say, but how we say them: Lda-based behavior-topic model, In Proceedings of the 2013 SIAM International Conference on Data Mining Society for Industrial and Applied Mathematics, 2013.

28.

Zhao , et al., A personalized hashtag recommendation approach using lda-based topic model in microblog environment, Future Generation Computer Systems 65 (2016), 196–206.

29.

Ma , et al., Tag-latent dirichlet allocation: Understanding hashtags and their relationships, In IEEE Computer Society, editor, Proceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)-Volume 01, 2013.

30.

Rosen-Zvi , et al., The author-topic model for authors and documents, In Auai Press, editor, Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, 2004.

31.

D.M.

Blei ,

A.Y.

Ng and

M.I.

Jordan , Latent dirichlet allocation, Journal of Machine Learning Research 3 (2003), 993–1022.

32.

Thomas , Probabilistic latent semantic analysis. In Morgan Kaufmann Publishers Inc, editor, Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, 1999.

33.

D.M.

Blei and

J.D.

Lafferty , Correlated topic models, In Proceedings of the 18th International Conference on Neural Information Processing Systems, 2005.

34.

D.M.

Blei and

J.D.

Lafferty , Dynamic topic models, In Proceedings of the 23rd International Conference on Machine Learning ACM, 2006.

35.

Chang and

David , Relational topic models for document networks, Artificial Intelligence and Statistics (2009).

36.

T.L.

Griffiths , et al., Hierarchical topic models and the nested chinese restaurant process, Advances in Neural Information Processing Systems (2004).

37.

Pedrosa , et al., Topic modeling for short texts with cooccurrence frequency-based expansion, Intelligent Systems (BRACIS) 2016 (2016), 5.

38.

Quan , et al., Short and Sparse Text Topic Modeling via Self-Aggregation, IJCAI (2015).

39.

Li , et al., Topic modeling for short texts with auxiliary word embeddings, In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval ACM, 2016.

40.

Ramage , et al., Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora, In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 Association for Computational Linguistics, 2009.

41.

Ramage ,

S.T.

Dumais and

D.J.

Liebling , Characterizing microblogs with topic models, ICWSM 10 (2010), 1.

42.

Heinrich , Parameter estimation for text analysis, arbylon.net/publications/text-est.pdf, Web 2005.

43.

Griffiths , Gibbs sampling in the generative model of latent dirichlet allocation, 2002.

44.

W.R.

Gilks , Markov chain monte carlo, Encyclopedia of Biostatistics (2005).

45.

T.L.

Griffiths and

Steyvers , Finding scientific topics, Proceedings of the National Academy of Sciences 101(suppl 1) (2004), 5228–5235.

46.

Twitter streaming api, https://developer.twitter.com/en/docs.html. 16 August 2018.

47.

C.D.

Manning ,

Raghavan and

Schtze , Introduction to Information Retrieval, Cambridge University Press, 2008.

48.

Newman , et al., Automatic evaluation of topic coherence, Human Language Technologies: The 2010 (2010).