Abstract
With the prevalent of short texts, discovering the topics within them has become an important task. Biterm Topic Model (BTM) is more suitable to discover topics on short texts than traditional topic models. However, there are still some challenges that dealing short texts with BTM will always ignore the document-topic semantic information and lack the true intentions of users. In addition, it is a static method and can not manage streaming short texts when a new one arrives immediately. In order to keep document-topic information and get the topic distribution of a new short text at once, we propose a joint model based on online algorithms of Latent Dirichlet Allocation (LDA) and BTM, which combines the merits of both models. Not only does it alleviate the sparsity when addressing short texts with the online algorithm of BTM, namely Incremental Biterm Topic Model (IBTM), but also keeps document-topic information with extended LDA. And considering the differences between English and Chinese text in writing, we use combined words in short texts as key words to extend the length of short texts and keep the true intensions of users. As shown in the experiment results on two real world datasets, our method is better than other baseline methods. In the end, we explain an application of our method in the task of discovering user interest tags.
Introduction
With the area of big data arriving and the development of the network, different kinds of resources represented by text resources increase explosively. Short texts are prevalent anywhere, such as social media, news headlines, short messages (SMS) and so on. How to dig out the potential value and the information that users are interested in from the complex and disorganized text resources, becomes a hard problem. In recent years, probabilistic topic models have made a great success in text mining tasks such as retrieval, summarization, categorization, clustering and so on, they have been becoming more and more important [33].
The traditional topic models mainly include Probabilistic Latent Semantic Analysis (PLSA) [16] and Latent Dirichlet Allocation (LDA) [4]. They are widely used to uncover the latent topics contained in long text corpus. The basic idea of them is to utilize bag-of-words assumption. Many researchers have made extensions based on them to solve more problems. For example, Cheng et al. [7] used a scene classification method based on bag-of-visual-words (BoVW) representation in the combination of PLSA model and k-Nearest Neighbour (k-NN) classifier for automatic landslide detection. Bassiou and Kotropoulos [3] proposed online PLSA. It assimilates new words that out of vocabulary and discards old words which exclusively appeared within the context of a varying document stream to get the latent topics of the new document. LDA is the Bayesian extension of PLSA, and it is a fully generative approach in language modeling which overcomes the inconsistent generative semantics of PLSA [12]. Due to its nice generalization ability and extensibility, LDA has achieved a huge success in text mining domain. Tag-topic model [28] based on the Author-Topic model [26] and LDA [4] determines the most likely tags and words for given topics in a collection of blog posts. Kim and Shim [18] proposed a recommendation system for Twitter using probabilistic modeling based on LDA which recommends top-K users to follow and top-K tweets to read for a user. Guo et al. [14] proposed an approach based on LDA to analyze the tourist satisfaction according to their ratings and reviews. With the rapid development of the online social media, taking Sina Weibo1
To alleviate the sparsity problem in short texts, scores of researchers have proposed numerous ideas. For example, Weng et al. [30] aggregated the tweets published by individual users into one document before training LDA. Hong and Davison [17] also aggregated the tweets containing the same words and showed that topic models trained on these aggregated messages work better than the conventional LDA. Though these methods alleviate the sparsity problem, these ideas ignored the situation that if the distribution over documents for each topic is heavily skewed, identifying topics from a small number of documents will be extremely difficult for LDA [27]. The other way to deal with the problem is to assume that a short document only covers a single topic [24]. For example, Zhao et al. [35] utilized mixture of unigrams [25] to discover topics from a representative sample of the entire Twitter. However, this assumption is not always correct, because of the truth that a short text may have many topics even it is too short. Another way is to utilize extra information. For example, Yang et al. [32] utilized knowledge incorporation to solve the content sparsity problem and proposed a phrase topic model. Lim et al. [21] proposed a novel method for short text modeling by leveraging the auxiliary information that accompanies tweets, called Twitter-Network (TN) topic model, which utilized hierarchical Poisson-Dirichlet processes (HPDP) for text modeling and a Gaussian process random function model for social network modeling to model topics for tweets. Li et al. [19] proposed an enriching topic model for short texts by means of auxiliary word embedding, which learnt the background knowledge about word semantic relatedness from millions of external documents. However, it will increase the workload and bring in noise. Different from above researches, Biterm Topic Model (BTM) [31] is more advantageous than LDA in dealing with short texts, because it learns topics by directly modeling the generation of co-occurrence word patterns in the corpus, making the inference effective with the rich corpus-level information. However, it loses sematic information in the document.
In this paper, we will introduce a joint model to mine streaming Chinese short texts which tackles the sparsity problem and keeps semantic information of the document. Standard LDA proposed by Blei [4] suffers from the data sparsity problem when documents are extremely short. Different from LDA that learns topics from document-level word co-occurrences by modeling each document as a mixture of topics, BTM learns topics from co-occurrence biterms in the corpus. Since short texts are streaming and the volume of it is big in the future, it’s not practical to use static topic model like LDA and BTM to sample the whole corpus repeatedly to discover topics of streaming short texts content. Online LDA and online PLSA depend critically on the accuracy of the topics inferred according to former training work and do not tackle the sparsity problem. In order to get more precise topic distribution of the short texts, we finally apply a natural extension model of LDA named online Twitter-User LDA (we used it to discover hashtags in English text [20]). It uses a computationally inexpensive online algorithms namely incremental Gibbs sampler [5] which immediately updates estimations of the topics as each short text is observed. To suggest more precise topic-words and alleviate the sparsity of short texts, we combined it with incremental biterm topic model (IBTM) [8]. Finally, we propose a joint topic model to mine more precise topics.
In addition, on the grounds of what previous research did, most of scholars mine information from English text and a few of them do this on Chinese. There are quite a few differences in preprocessing between English and Chinese. In English, we remove all stop words from documents or sentences, such as conjunctions, prepositions and articles. There is one more point that stemming [1] is used to address words, but the utility of normalization is also criticized [10]. In Chinese, because of the difficulties in defining what constitutes a word [11] and no white space between Chinese words, word segmentation is needed. Part-of-Speech (POS) tagging is also needed to reduce the noise brought by Chinese word segmentation [34]. Zhang and Tsai [34] thought nouns and verbs are meaningful words and extracted them to mine novel Chinese text. Different from Zhang, we utilize the other method to select key words as input, and the method we will describe is introduced in Section 3.1.
The main contributions of our research are as follows. Firstly, our model can find more precise topics on each streaming Chinese short text. Secondly, our model could discover topic distribution of each short text from document and corpus perspectives. Thirdly, our model could get the true intensions of the users from combined words in preprocessing. Finally, our model can be carried out on tasks, such as, discovering user interests tags, recommending hashtags of streaming short texts and so on.
The remainder of this paper is organized as follows. Related work is reviewed in Section 2. The details of our proposed method are revealed in Section 3. Section 4 presents the experiment results and discussions of our design method and other baselines. Finally, we present conclusions and thoughts for future work.
In this section, we review the existing online algorithms to process the streaming data – online LDA and IBTM. At first, we introduce online LDA and how to get the topic distribution of the documents by the model. Then, we review the IBTM and get latent topics of the corpus with it.
Online LDA
Online LDA [2] is an extended model of LDA based on a modification of batch Gibbs sampler. The essence of online LDA is the same as LDA. The difference is when a new streaming data arrives, online LDA will sample the topic of each new word and get the topic distribution of the new data based on previous training work. Because the process of topic discovery in online LDA is the same as LDA in initial phase, we introduce LDA briefly and clearly to explain the process of topic discovery.
LDA – a probabilistic generative model utilizes a set of latent topics, each of which represents the distribution over a fixed number of vocabularies to describe the generative process of documents. In simple terms, each document in the corpus has been modeled a multinomial distribution
For each document Draw a topic distribution For each topic Draw a topic-specific word distribution For each word Draw a topic Draw a word
Figure 1a illustrates the graphical representation of LDA. Following the above procedures, we can know the topic
In the equation,
In training, online LDA uses collapsed Gibbs sampler [13] to get the topic distribution. When a new document arrives, online LDA applies Eq. (1) to sample the topics of each new word and get the topic distribution of the new document.
The graphical representation of LDA (a) and BTM (b).
IBTM is the extension of BTM, and it not only learns topic distribution of the co-occurrence biterms, but also updates the model continuously. Before we introduce it, we review its basic model – BTM briefly.
Conventional topic models, such as PLSA and LDA, have features that modeling a document with a mixture of topics, revealing topics through implicitly capturing the document-level co-occurrence word patterns. Owing to the features, we will suffer from severe data sparsity problem, if directly apply these models on short texts. BTM can outperform the above two models in tackling the sparsity problem of short texts. Different from LDA, BTM strengthens the learning of the topic model by biterms that are more frequently unordered co-occurrence word-pairs in short texts. In addition, it samples topics iteratively using the rich information of the whole corpus to infer the topic distribution of the whole corpus. Not only the correlations between words do it keep, but also BTM retains the independent of each document. BTM can infer the topic distribution of the document corresponds to topic-words distribution in the corpus. The generative process of BTM is as follows:
Draw a topic distribution Draw a topic-specific word distribution Draw a topic assignment Draw two words
Figure 1b illustrates the graphical representation of BTM. In order to solve the intractable problem of estimating the parameters of BTM exactly, Gibbs sampling as a simple and effective strategy for estimating
Here
IBTM is an extension of BTM, it adopts incremental Gibbs sampler to update the parameter
[htbp] IBTM (incremental biterm topic model) algorithm
Update
Generate rejuvenation sequence
Update
The framework of discovering topic distribution of a new short text.
In this section, our architecture for discovering topics on Chinese streaming short texts is described in detail. The whole framework is shown in Fig. 2. Facing the streaming short texts, firstly, we preprocess the raw short texts such as removing stop words and so on. Then we use the extended LDA to infer topic distribution of streaming short texts from document perspective and get the document sematic information. In addition, we use IBTM to estimate the topic-words distribution of streaming short texts and get words information in the dynamic corpus. Finally, we set a threshold
At first, we will introduce the preprocessing for Chinese short texts. There are many differences between English text and Chinese one in preprocessing, and we mainly introduce what we have done in preprocessing Chinese short text for our experiments. Then, we introduce a new extended model based on LDA and how to use it to deal with the streaming short texts when they were posted. Last but not least, we outline joint model based on the extended LDA and IBTM to discover topic distribution when a new Chinese short text arrives.
Preprocessing for Chinese short text
In Chinese text, the word is the smallest independent meaningful element, and there is no boundary between words. In order to get the true intentions of users from texts, word segmentation is needed. Though many researchers try their best to promote the accuracy of segmentation, no one provide a complete function for Chinese textual analysis. We decided to use jieba,3
We agree with Zhang’s point that nouns and verbs are meaningful words, but different from what Zhang had done. In order to reduce the errors in word segmentation and get the deep meanings of users, we collect combined words in the document based on their locations, such as adjective-noun, verb-noun, noun-noun. The procedure of extracting our meaningful words is outlined in Algorithm 2. Furthermore, after selecting words, we require that each short text contains at least two keywords and the frequence of each word is 1. Next, we illustrate this point using some example short texts.
Example short texts and their key words in Sina Weibo
Table 1 lists 2 example short texts from Sina Weibo in the left-hand column (i.e., “UTF8gbsnæå°¾å©ç°±ï¼æ ä¸ä¼¦æ¯çç¾ä¸½ (Trailing wedding dress is unparalleled beauty)”, “
[t] Extracting combined words algorithmWords and part-of-speech tagging in a document. The length of them is
If the word is adjective or verb, or noun before any nouns
Get the combined words;
Count the frequency of combined words;
Else
Select nouns or verbs and count them; Sort the selected words in descending order. The length of sorted list is
Get the meaningful words
If the two words are adjacent in the document
Combine them;
Add the combined words into corpus;
Else
Add the former word into corpus.
Considering data is streaming, as well as a short text may express many topics, in order to get the dynamic topic distributions in each short text, we propose a dynamic method on LDA, extended LDA. LDA is a static model, and we think it is limited in practical use, for short texts changing over time and batch Gibbs sampler can not update the model immediately when a new short text arrives [13, 8]. We extend this model by using a computationally inexpensive online algorithm, namely incremental Gibbs sampler [5], which updates estimations of the topics as each tweet is observed immediately.
Like LDA, we think every document has
Here the
[htbp] Extended LDA algorithm
Draw topic
Update
Generate fixed-length rejuvenation
Draw topic
Update
Discard the first Chinese short text in
Update
Add new Chinese short text to the end of
Compute
In this work, the reason why we design fixed length of each
We show the details about our proposed method in this part. Discovering topics for streaming Chinese short texts involves two subtasks: determining the topic distribution in the documents from document-topic distribution and keeping document semantic information; determining the topic distribution in the corpus with co-occurrence biterms information. In subtask, we use extended LDA to reveal topic distribution with document semantic information and use IBTM which we discribed in Section 2.2 to reveal topic distribution with co-occurrence biterms in the corpus.
When a new streaming Chinese short text arrives, we use extended LDA and IBTM to discover different topic distributions with different purposes. After having determined its topic distribution, we sort topics by their probabilities. Here topic
Model complexity
Due to the joint model includes two subtasks, we will explain the time complexity of each model clearly. The time complexity of IBTM in sliding window
Experiments
In this section, we show the effectiveness of our model that combines extended LDA and IBTM to discover more precious topics and topic-words on real world streaming Chinese short texts. And we perform all experiments on a 3.30 GHz-CPU 16GB-RAM machine.
Datasets
Sina Weibo dataset is a collection of 688738 microblogs proposed by 239 users and each user posted more than 1000 microblogs on Sina Weibo (because most of users often expressed their moods and did not have many meaningful words, where the meaningful words are nouns and verbs). For the preprocessing, we turn traditional Chinese characters into simplified Chinese characters, remove duplicate microblogs and delete punctuations in microblogs at first. Then, we use jieba to dispose these data. In this dataset, we use two methods to select key words that one is according to Zhang [34], and the other is our method in Algorithm 2. Finally, we remove stop words in the key words. And the statistics of Sina Weibo dataset are shown in Table 2.
BaiduQA dataset contains 648514 questions crawled from a popular Q&A website.4
The statistics of Sina Weibo dataset
The statistics of BaiduQA dataset
To investigate the quality of topics discovered by all the test methods, we use a common metric to evaluate the results.
Baseline methods and parameter settings
We compare our method against the following two models unless explicitly specified elsewhere. Yan et al. [31] provided open-source implementation of IBTM via C++.5
In the experiment on Sina Weibo dataset, we set parameter
In BaiduQA dataset, according to the studies of papers [8, 19, 31], we set
Traditionally, we evaluate a topic model with perplexity. Nevertheless, some researchers proposed this method is not always agreeing with human opinion and reflecting the semantic coherence of a topic [6, 22]. The other method is coherence score [22], which measures the extent of co-occurring word pairs in a document and consists with basic assumption of BTM [8]. In order to perform more comprehensive analysis, we utilize Pointwise Mutual Information (PMI) [23] to evaluate the interpretability or semantic coherence of these topic models.
Given a topic
Where
Topic coherence on Sina Weibo dataset.
Topic coherence on BaiduQA dataset.
Figures 3 and 4 list the PMI-Scores of all methods on both datasets with the number of most probable words from 5 to 20 in each topic. The results demonstrate that our method with combined words can get better PMI-Scores than baseline methods and even our method which bases on nouns and verbs as input. It appears combined words are beneficial for short texts clustering. In addition, as shown from the table, the PMI-Scores of IBTM outperform online LDA and our method with nous and verbs. It is because of the essence of IBTM, which discovers topic distribution of corpus based on co-occurrence biterms. Though our model which bases on nouns and verbs as input is worse than IBTM, it is better than online LDA. The reason is IBTM amends extended LDA to get more precise topic distribution with co-occurrence biterms. Thirdly, Though the PMI-Scores of dynamic NMF are higher than our method, the changes in scores are higher than other methods. It means the stability of co-occurrence words in topics of dynamic NMF is weak.
In order to further explain our model, we take a BaiduQA question probably about “how to review English before senior high school entrance examination?” as example. Table 4 lists the document topic information and the topic-words distribution about it based on different online methods. Here, in order to explain our joint model is better on topic discovery and the difference between the two subtasks, we show the results of two subtasks on this question. From the table we can get the question is about examination or senior high school entrance examination according to top 5 words in most probable words based on these online methods. However, the last 5 words in ordered 20 most probable words based on
Example of document-topic information and topic-words distribution of the new short text in BaiduQA dataset
Example of document-topic information and topic-words distribution of the new short text in BaiduQA dataset
With topic modeling, we can use a tag to represent a short text according to topic distribution
From the results, we can see that no matter IBTM or extended LDA with combined words as input outperforms better than Zhang’s method as input. This validates that incorporating verbs, adjectives, nons with nous as input is beneficial for short text topic modeling. In addition, incremental Gibbs sampling is better than collapsed Gibbs sampling in calculating posterior distribution of topic-words that gains the accuracy of short text classification. That is why the accuracy of classification in online LDA is lower than IBTM and extended LDA. There shall mention it that our joint model is to choice the better topic distribution according to
Average classification accuracy of the methods on BaiduQA dataset
In this section, for the sake of explaining that our model could be applied in multiple tasks, we utilize our joint model to discover users’ dynamic interests based on their streaming Chinese microblogs on Sina Weibo. We select 10 users randomly from 239 ones in Sina Weibo dataset. The statistics of 10 users’ microblogs are shown in Table 6.
The statistics of 10 users’ microblogs
The statistics of 10 users’ microblogs
In this experiment, we compare our method against two online methods (online LDA and IBTM) and three static methods (Dirichlet Multinomial Mixture (DMM) Model [24],8
Topic model generated a ranked words based on
We define the hitrate@N: given top
Tables 7 and 8 list the precisions of all methods to discover interest tags of users’ streaming microblogs. So we have the following experiment results.
Average precision@N of algorithms
Average precision@N of algorithms
In Table 7, firstly, our method achieves the highest on precision@2 and precision@5, which significantly outperforms online baseline methods and even more than some static models, no matter our method with verbs and nouns or our method with combined words. It demonstrates that our joint model could discover more precise topics and interest tags. Secondly, our method with combined words is higher than our method with nouns and verbs in precision@2 but lower than that in precision@5. The reason is that a combined word represents two or three words in remaining words of a document which reduces the precision in evaluation method. Thirdly, online LDA is better than IBTM with finding interest tags in a new microblog. It demonstrates that IBTM is better than online LDA in finding co-occurrence words in corpus, though, online LDA could find more precise interest tags in a short text than IBTM. In addition, the precision@N in LDA and BTM is respectively higher than their online algorithms, but the interest tags of users in LDA and BTM do not have peculiarity. The reason why is that top 5 words in each topic of these models effected by the high-frequency words in corpus. However, these words can not be represented as interest tags. Last but not least, DMM appears better than other static models which means limiting fewer number of topics will beneficial to get more precise interest tags.
In Table 8, we lists one of microblogs of user 2213526752 and 5103645868 respectively. The bold is interest tags of the microblog. As we can see, most of methods discover high frequency words in microblogs of user 2213526752, and we get an unclear interest of the user. From the table we can see that DMM (UTF8gbsnå (ARASHI)) and our method (c-words) (UTF8gbsn樱ä°ç¿ (Sakurai), UTF8gbsnä°å®« (Ninomiya), UTF8gbsnæ¾ææ¶¦ (Matsumoto), UTF8gbsnæ (UTF8gbsnã¨ã«ããæãã宿)) is more precise on the tags. In user 5103645868, these methods all discover many interest tags, but we can find the tags is most in our methods(c-words) in top 10 topic-words based on
Example of users and their interest tags on Sina Weibo dataset
With the prevalence of short text, topic model has become increasingly important on the topic discovery. Facing the massive data and the sparsity of short texts, we utilize a joint model to discover the topic distribution of the streaming data real time. In this paper, we take advantage of an extended LDA to reveal the topic proportion of the new streaming Chinese short text at first. Then considering the sparsity of the short texts, we combine IBTM to get a better proportion of topics. Comparing the results of topic distribution in both models, our joint model return a best topic distribution to the short text. Our results show that the joint model can discover better topic distribution. Besides we use a different method in pre-processing step, it will extend the short texts alleviating sparsity, get the true intentions of the user and reduce the errors in word segmentation. The results verify our thought. Finally, we utilize the joint model to handle short texts of a user to discover his or her dynamic interests. Experiment results demonstrate our joint model with combined words can significantly outperform other models in discovering interest tags of users.
Because some words in a topic do not have semantic relevance, our future efforts will focus on combining with some sematic similarity algorithms and theories to discover more precise latent topics of Chinese short text.
Footnotes
Acknowledgments
This paper is founded by National Natural Science Foundation of China (Grant No: 61673235) and National Key R&D Program Projects of China (Grant No: 2018YFC1707600).
