A joint model of extended LDA and IBTM over streaming Chinese short texts

Abstract

With the prevalent of short texts, discovering the topics within them has become an important task. Biterm Topic Model (BTM) is more suitable to discover topics on short texts than traditional topic models. However, there are still some challenges that dealing short texts with BTM will always ignore the document-topic semantic information and lack the true intentions of users. In addition, it is a static method and can not manage streaming short texts when a new one arrives immediately. In order to keep document-topic information and get the topic distribution of a new short text at once, we propose a joint model based on online algorithms of Latent Dirichlet Allocation (LDA) and BTM, which combines the merits of both models. Not only does it alleviate the sparsity when addressing short texts with the online algorithm of BTM, namely Incremental Biterm Topic Model (IBTM), but also keeps document-topic information with extended LDA. And considering the differences between English and Chinese text in writing, we use combined words in short texts as key words to extend the length of short texts and keep the true intensions of users. As shown in the experiment results on two real world datasets, our method is better than other baseline methods. In the end, we explain an application of our method in the task of discovering user interest tags.

Keywords

Streaming chinese short text topic discovery topic models online algorithms

1. Introduction

With the area of big data arriving and the development of the network, different kinds of resources represented by text resources increase explosively. Short texts are prevalent anywhere, such as social media, news headlines, short messages (SMS) and so on. How to dig out the potential value and the information that users are interested in from the complex and disorganized text resources, becomes a hard problem. In recent years, probabilistic topic models have made a great success in text mining tasks such as retrieval, summarization, categorization, clustering and so on, they have been becoming more and more important [33].

The traditional topic models mainly include Probabilistic Latent Semantic Analysis (PLSA) [16] and Latent Dirichlet Allocation (LDA) [4]. They are widely used to uncover the latent topics contained in long text corpus. The basic idea of them is to utilize bag-of-words assumption. Many researchers have made extensions based on them to solve more problems. For example, Cheng et al. [7] used a scene classification method based on bag-of-visual-words (BoVW) representation in the combination of PLSA model and k-Nearest Neighbour (k-NN) classifier for automatic landslide detection. Bassiou and Kotropoulos [3] proposed online PLSA. It assimilates new words that out of vocabulary and discards old words which exclusively appeared within the context of a varying document stream to get the latent topics of the new document. LDA is the Bayesian extension of PLSA, and it is a fully generative approach in language modeling which overcomes the inconsistent generative semantics of PLSA [12]. Due to its nice generalization ability and extensibility, LDA has achieved a huge success in text mining domain. Tag-topic model [28] based on the Author-Topic model [26] and LDA [4] determines the most likely tags and words for given topics in a collection of blog posts. Kim and Shim [18] proposed a recommendation system for Twitter using probabilistic modeling based on LDA which recommends top-K users to follow and top-K tweets to read for a user. Guo et al. [14] proposed an approach based on LDA to analyze the tourist satisfaction according to their ratings and reviews. With the rapid development of the online social media, taking Sina Weibo1

¹
https://weibo.com/.

as an example, according to the fourth quarter earnings report (2017) by Sina Weibo, MAU (monthly active users) has reached 392 million,2

http://www.chinaz.com/news/2018/0214/858326.shtml.

and short texts are growing with millions of data at least. Facing the massive data, online algorithms were proposed to deal with the streaming data, such as, online PLSA [3], online LDA [2, 15] (which are based on different optimization algorithms) and so on. In essence, these conventional topic models utilize the document-level word co-occurrence to reveal the latent topics of the corpus [29]. That is to say, they are more appropriate to summarize the long text into several topic terms. For this reason, if we directly apply these models to short texts, there will be a severe data sparsity problem [17, 31, 27].

To alleviate the sparsity problem in short texts, scores of researchers have proposed numerous ideas. For example, Weng et al. [30] aggregated the tweets published by individual users into one document before training LDA. Hong and Davison [17] also aggregated the tweets containing the same words and showed that topic models trained on these aggregated messages work better than the conventional LDA. Though these methods alleviate the sparsity problem, these ideas ignored the situation that if the distribution over documents for each topic is heavily skewed, identifying topics from a small number of documents will be extremely difficult for LDA [27]. The other way to deal with the problem is to assume that a short document only covers a single topic [24]. For example, Zhao et al. [35] utilized mixture of unigrams [25] to discover topics from a representative sample of the entire Twitter. However, this assumption is not always correct, because of the truth that a short text may have many topics even it is too short. Another way is to utilize extra information. For example, Yang et al. [32] utilized knowledge incorporation to solve the content sparsity problem and proposed a phrase topic model. Lim et al. [21] proposed a novel method for short text modeling by leveraging the auxiliary information that accompanies tweets, called Twitter-Network (TN) topic model, which utilized hierarchical Poisson-Dirichlet processes (HPDP) for text modeling and a Gaussian process random function model for social network modeling to model topics for tweets. Li et al. [19] proposed an enriching topic model for short texts by means of auxiliary word embedding, which learnt the background knowledge about word semantic relatedness from millions of external documents. However, it will increase the workload and bring in noise. Different from above researches, Biterm Topic Model (BTM) [31] is more advantageous than LDA in dealing with short texts, because it learns topics by directly modeling the generation of co-occurrence word patterns in the corpus, making the inference effective with the rich corpus-level information. However, it loses sematic information in the document.

In this paper, we will introduce a joint model to mine streaming Chinese short texts which tackles the sparsity problem and keeps semantic information of the document. Standard LDA proposed by Blei [4] suffers from the data sparsity problem when documents are extremely short. Different from LDA that learns topics from document-level word co-occurrences by modeling each document as a mixture of topics, BTM learns topics from co-occurrence biterms in the corpus. Since short texts are streaming and the volume of it is big in the future, it’s not practical to use static topic model like LDA and BTM to sample the whole corpus repeatedly to discover topics of streaming short texts content. Online LDA and online PLSA depend critically on the accuracy of the topics inferred according to former training work and do not tackle the sparsity problem. In order to get more precise topic distribution of the short texts, we finally apply a natural extension model of LDA named online Twitter-User LDA (we used it to discover hashtags in English text [20]). It uses a computationally inexpensive online algorithms namely incremental Gibbs sampler [5] which immediately updates estimations of the topics as each short text is observed. To suggest more precise topic-words and alleviate the sparsity of short texts, we combined it with incremental biterm topic model (IBTM) [8]. Finally, we propose a joint topic model to mine more precise topics.

In addition, on the grounds of what previous research did, most of scholars mine information from English text and a few of them do this on Chinese. There are quite a few differences in preprocessing between English and Chinese. In English, we remove all stop words from documents or sentences, such as conjunctions, prepositions and articles. There is one more point that stemming [1] is used to address words, but the utility of normalization is also criticized [10]. In Chinese, because of the difficulties in defining what constitutes a word [11] and no white space between Chinese words, word segmentation is needed. Part-of-Speech (POS) tagging is also needed to reduce the noise brought by Chinese word segmentation [34]. Zhang and Tsai [34] thought nouns and verbs are meaningful words and extracted them to mine novel Chinese text. Different from Zhang, we utilize the other method to select key words as input, and the method we will describe is introduced in Section 3.1.

The main contributions of our research are as follows. Firstly, our model can find more precise topics on each streaming Chinese short text. Secondly, our model could discover topic distribution of each short text from document and corpus perspectives. Thirdly, our model could get the true intensions of the users from combined words in preprocessing. Finally, our model can be carried out on tasks, such as, discovering user interests tags, recommending hashtags of streaming short texts and so on.

The remainder of this paper is organized as follows. Related work is reviewed in Section 2. The details of our proposed method are revealed in Section 3. Section 4 presents the experiment results and discussions of our design method and other baselines. Finally, we present conclusions and thoughts for future work.

2. Related work

In this section, we review the existing online algorithms to process the streaming data – online LDA and IBTM. At first, we introduce online LDA and how to get the topic distribution of the documents by the model. Then, we review the IBTM and get latent topics of the corpus with it.

2.1 Online LDA

Online LDA [2] is an extended model of LDA based on a modification of batch Gibbs sampler. The essence of online LDA is the same as LDA. The difference is when a new streaming data arrives, online LDA will sample the topic of each new word and get the topic distribution of the new data based on previous training work. Because the process of topic discovery in online LDA is the same as LDA in initial phase, we introduce LDA briefly and clearly to explain the process of topic discovery.

LDA – a probabilistic generative model utilizes a set of latent topics, each of which represents the distribution over a fixed number of vocabularies to describe the generative process of documents. In simple terms, each document in the corpus has been modeled a multinomial distribution $\theta$ over $K$ topics, where the number of $K$ is selected by experiment to determine the best value of $K$ . And each topic is modeled a multinomial distribution $\phi$ over $V$ words, where the number of $V$ is all of different words in the corpus. $\alpha$ and $\beta$ are Dirichlet hyper-parameters of multinomial distribution $\theta$ and $\phi$ , respectively. In a word, words that are maximum probability in each topic and related documents can help users to understand the topic distribution in a document. The generative process of LDA is as follows:

(1) 1.
For each document $d\in\left\{1,\ldots,D\right\}$

Draw a topic distribution $\theta_{d}$ for document $d$ : $\theta_{d}\sim\textit{Dirichlet}(\alpha)$
2.
For each topic $z\in\left\{1,\ldots,K\right\}$

Draw a topic-specific word distribution $\phi_{z}$ for each topic $z$ : $\phi_{z}\sim\textit{Dirichlet}(\beta)$
3.
For each word $n\in\left\{1,\ldots,N_{d}\right\}$ in document $d$

Draw a topic $z_{dn}$ : $z\sim\textit{Multi}(\theta_{d})$

Draw a word $w_{dn}$ : $w\sim\textit{Multi}(\phi_{z_{dn}})$

Figure 1a illustrates the graphical representation of LDA. Following the above procedures, we can know the topic $z_{dn}$ is sampled for each word $w_{dn}$ according to Eq. (1).

$\displaystyle P(z_{dn}|\textit{rest})\propto(n_{kd}^{-dn}+\alpha_{k})\frac{n_{% kw}^{-dn}+\beta_{w}}{\sum_{w}\beta_{w}+n_{k}^{-dn}}$ (1)

In the equation, $w$ is short for $w_{dn}$ ; $n_{kd}$ , $n_{kw}$ and $n_{k}$ denote the number of co-occurrence times of topic $k$ and document $d$ , the number of co-occurrences of topic $k$ and word $w$ , the number of times that each word assigned to topic $k$ , respectively.

In training, online LDA uses collapsed Gibbs sampler [13] to get the topic distribution. When a new document arrives, online LDA applies Eq. (1) to sample the topics of each new word and get the topic distribution of the new document.

Figure 1.
The graphical representation of LDA (a) and BTM (b).

2.2 IBTM

IBTM is the extension of BTM, and it not only learns topic distribution of the co-occurrence biterms, but also updates the model continuously. Before we introduce it, we review its basic model – BTM briefly.

Conventional topic models, such as PLSA and LDA, have features that modeling a document with a mixture of topics, revealing topics through implicitly capturing the document-level co-occurrence word patterns. Owing to the features, we will suffer from severe data sparsity problem, if directly apply these models on short texts. BTM can outperform the above two models in tackling the sparsity problem of short texts. Different from LDA, BTM strengthens the learning of the topic model by biterms that are more frequently unordered co-occurrence word-pairs in short texts. In addition, it samples topics iteratively using the rich information of the whole corpus to infer the topic distribution of the whole corpus. Not only the correlations between words do it keep, but also BTM retains the independent of each document. BTM can infer the topic distribution of the document corresponds to topic-words distribution in the corpus. The generative process of BTM is as follows:

(1) 1.
Draw a topic distribution $\theta$ from Dirichlet distribution with hyper-parameter $\alpha$ for the whole collection: $\theta\sim\textit{Dirichlet}(\alpha)$
2.
Draw a topic-specific word distribution $\phi_{z}$ from Dirichlet distribution with hyper-parameter $\beta$ for topic $z$ : $\phi_{z}\sim\textit{Dirichlet}(\beta)$
3.
Draw a topic assignment $z$ from the multinomial distribution $\theta$ for each biterm $b$ in the biterm set $B$ : $z\sim\textit{Multi}(\theta)$
4.
Draw two words $w_{i1},w_{i2}$ assigned from the multinomial distribution $\phi_{z}$ : $w_{i1},w_{i2}\sim\textit{Multi}(\phi_{z})$

Figure 1b illustrates the graphical representation of BTM. In order to solve the intractable problem of estimating the parameters of BTM exactly, Gibbs sampling as a simple and effective strategy for estimating $\theta$ and $\phi$ , comparing with variational inference and maximum posterior estimation. It applies Markov chain on the joint probability of the corpus, and the conditional probability distribution is in Eq. (2). More details can be found in [31].

$\displaystyle P(z|z_{-b},B,\alpha,\beta)\propto(n_{z}^{-b}+\alpha)\frac{(n_{w_% {i}|z}^{-b}+\beta)(n_{w_{j}|z}^{-b}+\beta)}{(\sum_{w=1}^{V}n_{w|z}^{-b}+V\beta% )(\sum_{w=1}^{V}n_{w|z}^{-b}+V\beta+1)}$ (2)

Here $z_{-b}$ is the topic assignments for all biterms in the biterm set $B$ , and biterm b is excepted. $\alpha$ and $\beta$ are Dirichlet hyper-parameter. $n_{z}^{-b}$ indicates the number of biterms assigned to the topic $z$ excluding $b$ . $n_{w|z}^{-b}$ denotes the number of times of word $w$ assigned to topic $z$ excluding it in the current biterm $b$ . $V$ is the size of the vocabulary in the corpus. After a sufficient number of iterations, we get the number of biterms in each topic $z$ , marking by $n_{z}$ , the number of times that each word $w$ assigned to topic $z$ , denoting by $n_{w|z}$ , and $N_{B}$ that’s the number of biterms in the corpus. Then we can estimate the topic-words distribution $\phi$ by Eq. (3) and global topic distribution $\theta$ by Eq. (4):

$\displaystyle\phi_{w|z}=\frac{n_{w|z}+\beta}{\sum_{w=1}^{V}n_{w|z}+V\beta}$ (3) $\displaystyle\theta_{z}=\frac{n_{z}+\alpha}{N_{B}+K\alpha}$ (4)

IBTM is an extension of BTM, it adopts incremental Gibbs sampler to update the parameter $\phi$ and $\theta$ immediately when a biterm arrives. In details, IBTM updates the model in two steps when a biterm $b_{i}$ arrives. At first, we get the topic assignments of $b_{i}$ from $P(z_{i}|z_{i-1},B_{i})$ , where $z_{i-1}=\{z_{j}\}_{j=1}^{i-1}$ indicates all the previous topic assignments, and $B_{i}=\{b_{j}\}_{j=1}^{i}$ . Secondly, we randomly choose some previous biterms to construct a biterm sequence, called rejuvenation sequence $R(i)$ to resample their topic assignments. For each biterm $b_{j}\in R(i)$ , we resample its topic assignment $z_{j}$ from $P(z_{j}|z_{-j},i,B_{i})$ . The procedure of IBTM is outlined in Algorithm 1.

[htbp] IBTM (incremental biterm topic model) algorithm $K,\alpha,\beta$ , biterm sequence $B=\{b_{1},\ldots,b_{n}\}$ $\phi,\theta$ $i=1$ to $N$ Draw topic $z$ from Eq. (2);

Update $n_{z}$ and $n_{w|z}$ ;

Generate rejuvenation sequence $R_{i}$ ;

$j\in R_{i}$ Draw topic assignment $z^{\prime}$ from Eq. (2);

Update $n_{z^{\prime}},n_{w|z^{\prime}}$ ; Compute $\phi$ by Eq. (3) and $\theta$ by Eq. (4)

Figure 2.
The framework of discovering topic distribution of a new short text.

3. Joint model framework of extended LDA and IBTM

In this section, our architecture for discovering topics on Chinese streaming short texts is described in detail. The whole framework is shown in Fig. 2. Facing the streaming short texts, firstly, we preprocess the raw short texts such as removing stop words and so on. Then we use the extended LDA to infer topic distribution of streaming short texts from document perspective and get the document sematic information. In addition, we use IBTM to estimate the topic-words distribution of streaming short texts and get words information in the dynamic corpus. Finally, we set a threshold $\tau$ to decide the topic distribution of each streaming short text.

At first, we will introduce the preprocessing for Chinese short texts. There are many differences between English text and Chinese one in preprocessing, and we mainly introduce what we have done in preprocessing Chinese short text for our experiments. Then, we introduce a new extended model based on LDA and how to use it to deal with the streaming short texts when they were posted. Last but not least, we outline joint model based on the extended LDA and IBTM to discover topic distribution when a new Chinese short text arrives.

3.1 Preprocessing for Chinese short text

In Chinese text, the word is the smallest independent meaningful element, and there is no boundary between words. In order to get the true intentions of users from texts, word segmentation is needed. Though many researchers try their best to promote the accuracy of segmentation, no one provide a complete function for Chinese textual analysis. We decided to use jieba,3

³
https://github.com/fxsjy/jieba.

which is a word segmentation and part-of-speech tagging tool for Chinese text analysis. Whatever English or Chinese text both need to remove stop words. And next, we need to select meaningful words. Zhang et al. [34] thought nouns and verbs are meaningful words and extracted them to mine novel Chinese text.

We agree with Zhang’s point that nouns and verbs are meaningful words, but different from what Zhang had done. In order to reduce the errors in word segmentation and get the deep meanings of users, we collect combined words in the document based on their locations, such as adjective-noun, verb-noun, noun-noun. The procedure of extracting our meaningful words is outlined in Algorithm 2. Furthermore, after selecting words, we require that each short text contains at least two keywords and the frequence of each word is 1. Next, we illustrate this point using some example short texts.

Table 1

Example short texts and their key words in Sina Weibo

Example short text	Meaningful words (n & v)	Meaningful words (combined words)
UTF8gbsnæ‹–å°¾å©šç°±ï¼Œæ— ä¸Žä¼¦æ¯”çš„ç¾Žä¸½ (Trailing wedding dress is unparalleled beauty.)	UTF8gbsnæ‹– (drag) UTF8gbsnå°¾ (tail) UTF8gbsnå©šç°± (wedding dress) UTF8gbsnç¾Žä¸½ (beauty)	UTF8gbsnæ‹–å°¾å©šç°± (Trailing wedding dress) UTF8gbsnæ‹–å°¾ (trailing) UTF8gbsnå©šç°± (wedding dress) UTF8gbsnç¾Žä¸½ (beauty)
$\sharp$ UTF8gbsnä¼˜ç§€å›¢é˜ŸVSå·®åŠå›¢é˜Ÿ $\sharp$ UTF8gbsnè‡å·±å¯å·å…¥å°§ (Good team VS bad team, compare it to yourself.)	UTF8gbsnå›¢é˜Ÿ (team) UTF8gbsnå·®åŠ (bad) UTF8gbsnå¯å·å…¥å°§ (compare it to yourself)	UTF8gbsnä¼˜ç§€å›¢é˜Ÿ (good team) UTF8gbsnå·®åŠå›¢é˜Ÿ (bad team) UTF8gbsnå¯å·å…¥å°§ (compare it to yourself)

Table 1 lists 2 example short texts from Sina Weibo in the left-hand column (i.e., “UTF8gbsnæ‹–å°¾å©šç°±ï¼Œæ— ä¸Žä¼¦æ¯”çš„ç¾Žä¸½ (Trailing wedding dress is unparalleled beauty)”, “ $\sharp$ UTF8gbsnä¼˜ç§€å›¢é˜ŸVSUTF8gbsnå·®åŠå›¢é˜Ÿ $\sharp$ UTF8gbsnè‡å·±å¯å·å…¥å°§ (Good team VS bad team, compare it to yourself)”). In the central column, we show the nouns and verbs in the short text, according to Zhang’s method. That is to say, we have done word segmentation, part-of-speech tagging, nouns and verbs selection and duplicate words removal with jieba in corpus. From each key words we can see that it cannot express the true intentions of the user and the word segmentation do not conform to human expectations. In the right-hand column, we show the results after extracting the key words with Algorithm 2. Shown in the key words of the right-hand column, using our method could reduce some errors in word segmentation and part-of-speech tagging with jieba. Every key word can express the true intensions of the user and reduce the document frequency of high frequency words in corpus. Therefore, we believe that using our combined words method in preprocessing would significantly enhance topic modeling and the precision of key words in recommending on short texts.

[t] Extracting combined words algorithmWords and part-of-speech tagging in a document. The length of them is $N$ .Combined words Select words

$i$ from 1 to $N$ , i is odd number

If the word is adjective or verb, or noun before any nouns

Get the combined words;

Count the frequency of combined words;

Else

Select nouns or verbs and count them; Sort the selected words in descending order. The length of sorted list is $L$ .

Get the meaningful words

$j$ from 1 to $L$

If the two words are adjacent in the document

Combine them;

Add the combined words into corpus;

Else

Add the former word into corpus.

3.2 Extended LDA

Considering data is streaming, as well as a short text may express many topics, in order to get the dynamic topic distributions in each short text, we propose a dynamic method on LDA, extended LDA. LDA is a static model, and we think it is limited in practical use, for short texts changing over time and batch Gibbs sampler can not update the model immediately when a new short text arrives [13, 8]. We extend this model by using a computationally inexpensive online algorithm, namely incremental Gibbs sampler [5], which updates estimations of the topics as each tweet is observed immediately.

Like LDA, we think every document has $K$ latent topics, and each topic $k$ is represented by a topic word distribution $\phi_{k}$ . We get the latest Chinese short texts with a sliding window $L$ , and the $L$ is static. $\theta$ is used to represent topic distribution of short text $d$ in $L$ . We train $d$ at first and sample topic distribution in each word. Then we get the number of co-occurrence times of topic $k$ in short text $d-n_{kd}$ , the number of co-occurrences of topic $k$ and word $w-n_{kw}$ , the number of times that each word $w$ assigned to topic $k-n_{k}$ . Facing a new short text, we sample the topic distribution and update $n_{k}$ , $n_{kd}$ , and $n_{kw}$ . After selecting rejuvenation short texts and resampling the topic distribution, we discard the first short text and add the new one to $L$ . If there is no more new short text, we utilize the newer $n_{k}$ , $n_{kd}$ , and $n_{kw}$ to estimate $\phi$ and $\theta$ as follows:

$\displaystyle\phi_{w|z}=\frac{n_{kw}+\beta}{n_{k}+V\beta}$ (5) $\displaystyle\theta_{z}=\frac{n_{kd}+\alpha}{n_{d}+K\alpha}$ (6)

Here the $n_{d}$ represents the length of short text $d$ . The procedure of extended LDA is outlined in Algorithm 3. In this algorithm, we use twice Eq. (1) with different reasons. The first time, we use it to address the new short text to the corpus and run collapsed Gibbs sampling [13] to sample topics keeping the topic assignments of the old short texts at the same time. And the second, just running collapsed Gibbs sampling on $R(i)$ and resampling the topic assignments of other words to close to posterior distribution $P(z|w)$ .

[htbp] Extended LDA algorithm $K,\alpha,\beta$ , $L$ , $R$ , the length of each Chinese short text $N$ . $\phi,\theta$ $i$ from 1 to $N$

Draw topic $k$ from Eq. (1);

Update $n_{k}$ , $n_{kd}$ , and $n_{kw}$ ;

Generate fixed-length rejuvenation $R(i)$ from $L$ ;

$j$ from 1 to $R$

Draw topic $k$ from Eq. (1);

Update $n_{k}$ , $n_{kd}$ , and $n_{kw}$ ;

Discard the first Chinese short text in $L$ ;

Update $n_{k}$ , $n_{kd}$ , and $n_{kw}$ .

Add new Chinese short text to the end of $L$ .

Compute $\phi$ and $\theta$ by Eqs (5) and (6) respectively.

In this work, the reason why we design fixed length of each $R(i)$ is because if these rejuvenation samplings are performed enough (depending on the mixing time of the induced Markov chain) and the length of $R(i)$ is long enough, the model will closely approximate the posterior distribution of topic-words infinitely [5]. The other reason is a large number of rejuvenation short texts could consume much time and huge memory to sample and the number of resampling steps determines the runtime of the incremental Gibbs sampling.

3.3 Our proposed method: Joint model of extended LDA and IBTM

We show the details about our proposed method in this part. Discovering topics for streaming Chinese short texts involves two subtasks: determining the topic distribution in the documents from document-topic distribution and keeping document semantic information; determining the topic distribution in the corpus with co-occurrence biterms information. In subtask, we use extended LDA to reveal topic distribution with document semantic information and use IBTM which we discribed in Section 2.2 to reveal topic distribution with co-occurrence biterms in the corpus.

When a new streaming Chinese short text arrives, we use extended LDA and IBTM to discover different topic distributions with different purposes. After having determined its topic distribution, we sort topics by their probabilities. Here topic $k_{\max}$ is presented the maximum probability of topic-document and topic $k_{\min}$ has minimum probability $p_{\min}$ in 5 topic distribution. We then define $\delta=p_{\max}-p_{\min}$ . For the whole corpus, we set a constant threshold $\tau$ . If $\delta>\tau$ , we choose $N$ top words from $\phi_{\max}$ in topic-word distribution of extended LDA, else we select $N$ top words from the corresponding maximum probability topic in both models. Finally, the $N$ words are represented the topic-words distribution of the new Chinese short text. That is to say, we will choice the best topic-words distribution on the joint model about the short text and return these words to user.

3.4 Model complexity

Due to the joint model includes two subtasks, we will explain the time complexity of each model clearly. The time complexity of IBTM in sliding window $L$ is $O(N_{\textit{iter}}K|B^{(L)}|R(i))$ , where $N_{\textit{iter}}$ is the number of iterations of Gibbs sampling, $K$ is the number of topics, $B^{(L)}$ is the set of biterms in sliding window $L$ . $B^{(L)}=N_{D}\bar{l}(\bar{l}-1)/2$ , where the $N_{D}$ is the number of short texts in sliding window $L$ , and $\bar{l}$ is number of distinct words in a short text. $R(i)$ is resample biterms. The time complexity of extended LDA in sliding window $L$ is $O(N_{\textit{iter}}KN_{D}\bar{l}R(i))$ , where the meanings of $N_{\textit{iter}}$ , $K$ , $N_{D}$ , $\bar{l}$ are the same with IBTM, but $R(i)=N_{R(i)}\bar{l}$ , and the $N_{R(i)}$ is the number of rejuvenation short texts. In order to get a more precise topic distribution of a new short text, the time complexity of joint model depends on the longer in two subtasks.

4. Experiments

In this section, we show the effectiveness of our model that combines extended LDA and IBTM to discover more precious topics and topic-words on real world streaming Chinese short texts. And we perform all experiments on a 3.30 GHz-CPU 16GB-RAM machine.

4.1 Datasets

Sina Weibo dataset is a collection of 688738 microblogs proposed by 239 users and each user posted more than 1000 microblogs on Sina Weibo (because most of users often expressed their moods and did not have many meaningful words, where the meaningful words are nouns and verbs). For the preprocessing, we turn traditional Chinese characters into simplified Chinese characters, remove duplicate microblogs and delete punctuations in microblogs at first. Then, we use jieba to dispose these data. In this dataset, we use two methods to select key words that one is according to Zhang [34], and the other is our method in Algorithm 2. Finally, we remove stop words in the key words. And the statistics of Sina Weibo dataset are shown in Table 2.

BaiduQA dataset contains 648514 questions crawled from a popular Q&A website.4

⁴
http://zhidao.baidu.com.

This dataset has been used in a few studies [8, 19, 31]. Besides, each question has been labeled a category by its asker and that will help us to testify the accuracy of classification in Section 4.3. The data has been preprocessed and used by the author and the details about it are described in [31, 8]. In this data, we found the reserved words in each question are nouns and verbs and that meets the assumption of Zhang and Tsai [34]. In order to testify the effectiveness of combined words, we use our method to get key words in each question and remove the words that occur less than 3 in corpus. The statistics of BaiduQA dataset are shown in Table 3.

Table 2

The statistics of Sina Weibo dataset

preprocessing with combined words.
${}^{\text{a}}$ n & v: After preprocessing with nouns and verbs. ${}^{\text{b}}$ c-words: After
Number of microblogs	688738
Number of user	239
Number of preprocessed microblogs (n & v) ${}^{\text{a}}$	376082
Average length of microblogs (n & v)	11.12
Number of words (n & v)	203886
Number of preprocessed microblogs (c-words) ${}^{\text{b}}$	371119
Average length of preprocessed microblogs (c-words)	11.78
Number of combined words (c-words)	1245537

Table 3

The statistics of BaiduQA dataset

preprocessing with combined words.
${}^{\text{a}}$ n & v: After preprocessing with nouns and verbs. ${}^{\text{b}}$ c-words: After
Number of questions	648514
Number of preprocessed questions (n & v) ${}^{\text{a}}$	179042
Average length of questions (n & v)	4.11
Number of words (n & v)	26560
Number of preprocessed questions (c-words) ${}^{\text{b}}$	178755
Average length of preprocessed questions (c-words)	4.48
Number of combined words (c-words)	33967

4.2 Quality of topics

To investigate the quality of topics discovered by all the test methods, we use a common metric to evaluate the results.

4.2.1 Baseline methods and parameter settings

We compare our method against the following two models unless explicitly specified elsewhere. Yan et al. [31] provided open-source implementation of IBTM via C++.5

⁵
https://github.com/xiaohuiyan.

In order to fit for our goal of discovering topic distribution on each short text, we implement it with java code. That is to say, we use java code to implement our joint model, online LDA and IBTM.

In the experiment on Sina Weibo dataset, we set parameter $\alpha=$ 0.1, $\beta=$ 0.01, iteration $=$ 500 and get top 20 words with maximum probability in each topic for online LDA, IBTM and our method. Because there is no best way to determine the topic number $K$ , we set the $K$ with a sequence of values 20, 30, 50 and 100 to train these models. Considering online LDA needs training corpus, we set the training microblogs is 100000. To keep the balance between efficiency and effectiveness, we finally set { $L,L_{\textit{IBTM}},R(i),R(i)_{\textit{IBTM}}$ } $=$ {10000, 100000, 5000, 50000}. In addition, considering the time stamp in Sina Weibo dataset, we add dynamic Non-negative Matrix Factorization (dynamic NMF)6

⁶

https://github.com/derekgreene/dynamic-nmf.

model as a baseline method. In dynamic NMF, we use default parameter settings in paper [9].

In BaiduQA dataset, according to the studies of papers [8, 19, 31], we set $\alpha=$ 50/K, $\beta=$ 0.01 on IBTM and $\alpha=$ 0.1, $\beta=$ 0.01 on online LDA and extended LDA. In these models, we set iteration $=$ 1000 and get top 20 words with maximum probability in each topic. And we make the $K$ with 40, 60 and 80 to train these models. In order to keep effectiveness and consistency, we set the sliding window is 100000 for all methods and { $R(i),R(i)_{\textit{IBTM}}$ } $=$ {5000, 50000}.

4.2.2 Evaluation method

Traditionally, we evaluate a topic model with perplexity. Nevertheless, some researchers proposed this method is not always agreeing with human opinion and reflecting the semantic coherence of a topic [6, 22]. The other method is coherence score [22], which measures the extent of co-occurring word pairs in a document and consists with basic assumption of BTM [8]. In order to perform more comprehensive analysis, we utilize Pointwise Mutual Information (PMI) [23] to evaluate the interpretability or semantic coherence of these topic models.

Given a topic $k$ with top $N$ words ( $w_{1},\ldots,w_{n}$ ) ordered by $P(w|z)$ , the PMI-Score of $k$ is:

$\displaystyle\textit{PMI}(k)=\frac{2}{n(n-1)}\sum\limits_{1\leqslant i<j% \leqslant n}\log\frac{p(w_{i},w_{j})}{p(w_{i})p(w_{j})}$ (7)

Where $p(w_{i},w_{j})$ is represented the probability of co-occurring word pair $(w_{i},w_{j})$ , and $p(w)$ is represented the probability of the short texts including $w$ in corpus. Considering that there are many colloquial expressions and cyber languages in Sina Weibo, if we use external sources as corpus of PMI, the accuracy of PMI-Scores will be not convincing. Thus, we use our corpus to verify the topic quality and calculate the average PMI-Scores, i.e., $\frac{1}{K}\sum_{k}\textit{PMI-Score}(k)$ to evaluate the quality of topics discovered on Sina Weibo dataset. In order to testify our opinion, we use 934233 Chinese Wikipedia articles (1.84 G) for BaiduQA dataset to calculate the PMI-Scores.

Figure 3.

Topic coherence on Sina Weibo dataset.

Figure 4.

Topic coherence on BaiduQA dataset.

4.2.3 Experiment results and analysis

Figures 3 and 4 list the PMI-Scores of all methods on both datasets with the number of most probable words from 5 to 20 in each topic. The results demonstrate that our method with combined words can get better PMI-Scores than baseline methods and even our method which bases on nouns and verbs as input. It appears combined words are beneficial for short texts clustering. In addition, as shown from the table, the PMI-Scores of IBTM outperform online LDA and our method with nous and verbs. It is because of the essence of IBTM, which discovers topic distribution of corpus based on co-occurrence biterms. Though our model which bases on nouns and verbs as input is worse than IBTM, it is better than online LDA. The reason is IBTM amends extended LDA to get more precise topic distribution with co-occurrence biterms. Thirdly, Though the PMI-Scores of dynamic NMF are higher than our method, the changes in scores are higher than other methods. It means the stability of co-occurrence words in topics of dynamic NMF is weak.

In order to further explain our model, we take a BaiduQA question probably about “how to review English before senior high school entrance examination?” as example. Table 4 lists the document topic information and the topic-words distribution about it based on different online methods. Here, in order to explain our joint model is better on topic discovery and the difference between the two subtasks, we show the results of two subtasks on this question. From the table we can get the question is about examination or senior high school entrance examination according to top 5 words in most probable words based on these online methods. However, the last 5 words in ordered 20 most probable words based on $P(w|z)$ can not highlight the topic of examination clearly except extended LDA. In our joint model, we compare the value about $P(z|d)$ on two subtasks, choice the max probability topic and topic-words to the short text. That will return a better topic-words to us to analyse and understand what the user want to say. On two right-hand columns of the table, we use bold to denote the correlation of the question.

Table 4
Example of document-topic information and topic-words distribution of the new short text in BaiduQA dataset

Short text	Method	Topic	Top5-words ${}^{\text{a}}$	Last5-words ${}^{\text{b}}$
		( $p(z\mid d)$ )
UTF8gbsnåˆä¸ (junior high school) UTF8gbsnè‹±è¯ (English) UTF8gbsnå¤ä (review)	Online LDA (n & v) ${}^{\text{c}}$	6 (0.594)	UTF8gbsnç”æ¡ˆ (answer) UTF8gbsné«˜åˆ† (high score) UTF8gbsnä¸è€ƒ (Senior high school entrance examination)UTF8gbsnå´ç°§ (grade) UTF8gbsnè€ƒè¯• (examination)	UTF8gbsnä¸Šå†Œ (volume one) UTF8gbsnå¸®å¿™ (help) VB UTF8gbsnåœ¨ç°¿ (online) UTF8gbsnè¯·æ•™ (consult)
UTF8gbsnä¸è€ƒ (Senior high school entrance examination)	IBTM (n & v)	15 (0.732)	UTF8gbsnä¸è€ƒ (Senior high school entrance examination)UTF8gbsnè€ƒè¯• (examination) 2010 2010UTF8gbsnå´ (the year of 2010) UTF8gbsnä¸å¦ (high school)	UTF8gbsnæ•™å¸ˆ (teacher) UTF8gbsnè‹±è¯ (english)UTF8gbsnä¸€ä¸ (No. 1 high school) UTF8gbsnå½•å– (enroll) UTF8gbsnæ‹›ç”Ÿ (recruit students)
	Extended LDA (n & v)	26 (0.978)	UTF8gbsnä¸è€ƒ (Senior high school entrance examination)UTF8gbsnå´ç°§ (grade) UTF8gbsnç”æ¡ˆ (answer) UTF8gbsnè¯æ–‡ (chinese) UTF8gbsnæ•°å¦ (mathmatics)	UTF8gbsnä¸€é“ (a) UTF8gbsnè‹±è¯ (english)UTF8gbsnåˆä°Œ (grade two of junior school) UTF8gbsnå•å…ƒ (unit) UTF8gbsnè¯•é¢˜ (examination)
	Our method(n & v)	26	UTF8gbsnä¸è€ƒ (Senior high school entrance examination)UTF8gbsnå´ç°§ (grade) UTF8gbsnç”æ¡ˆ (answer) UTF8gbsnè¯æ–‡ (chinese) UTF8gbsnæ•°å¦ (mathmatics)	UTF8gbsnä¸€é“ (a) UTF8gbsnè‹±è¯ (english)UTF8gbsnåˆä°Œ (grade two of junior school) UTF8gbsnå•å…ƒ (unit) UTF8gbsnè¯•é¢˜ (examination)
	IBTM (c-words) ${}^{\text{d}}$	11 (0.932)	UTF8gbsnå´ç°§ (grade) UTF8gbsnç”æ¡ˆ (answer) UTF8gbsnä¸è€ƒ (Senior high school entrance examination)UTF8gbsnè¯æ–‡ (chinese) UTF8gbsnæ•°å¦ (mathmatics)	UTF8gbsnæˆç»© (score)UTF8gbsnå•å…ƒ (unit) UTF8gbsnè‹±è¯ (english)UTF8gbsnä¸å¦ (high school) UTF8gbsnåˆä°Œ (grade two of junior school)
	Extended LDA (c-words)	38 (0.979)	UTF8gbsnä¸è€ƒ (Senior high school entrance examination)UTF8gbsnç”æ¡ˆ (answer) UTF8gbsnå´ç°§ (grade) UTF8gbsnè¯æ–‡ (chinese) UTF8gbsnæ•°å¦ (mathmatics)	UTF8gbsnè¯•å· (examination paper) UTF8gbsnä¸å¦ (high school) UTF8gbsnå•å…ƒ (unit) UTF8gbsnæˆç»© (score)UTF8gbsnè‹±è¯ (english)
	Our method(c-words)	38	UTF8gbsnä¸è€ƒ (Senior high school entrance examination)UTF8gbsnç”æ¡ˆ (answer) UTF8gbsnå´ç°§ (grade) UTF8gbsnè¯æ–‡ (chinese) UTF8gbsnæ•°å¦ (mathmatics)	UTF8gbsnè¯•å· (examination paper) UTF8gbsnä¸å¦ (high school) UTF8gbsnå•å…ƒ (unit) UTF8gbsnæˆç»© (score)UTF8gbsnè‹±è¯ (english)

${}^{\text{a}}$ Top5-words: Top 5 words in each topic-words distribution based on $P(w|z)$ . ${}^{\text{b}}$ Last5-words: Last 5 words in 20 most probable words based on $P(w|z)$ . ${}^{\text{c}}$ n & v: After preprocessing with nouns and verbs. ${}^{\text{d}}$ c-words: After preprocessing with combined words.

4.3 Evaluation on combined words by short text classification

With topic modeling, we can use a tag to represent a short text according to topic distribution $P(z|d)$ . Li et al. [19] think summation over words (SW) method leads to large performance improvement over Naive Bayes (NB) method in topic-level representation of short text. Thus, we use SW method in all algorithms. The quality of the topic can be evaluated by the accuracy of text classification using the topic-level representation [8, 20, 31]. Here we use a linear kernel Support Vector Machines (SVM) classifier in sklearn7

⁷
http://scikit-learn.org/.

with default parameter settings. Due to our method focus on streaming Chinese short text, we use the last 100000 questions with 21 labels on BaiduQA dataset to do this experiment. And the classification accuracy is computed by 5-fold cross validation and the results are shown in Table 5.

From the results, we can see that no matter IBTM or extended LDA with combined words as input outperforms better than Zhang’s method as input. This validates that incorporating verbs, adjectives, nons with nous as input is beneficial for short text topic modeling. In addition, incremental Gibbs sampling is better than collapsed Gibbs sampling in calculating posterior distribution of topic-words that gains the accuracy of short text classification. That is why the accuracy of classification in online LDA is lower than IBTM and extended LDA. There shall mention it that our joint model is to choice the better topic distribution according to $P(z|d)$ from IBTM and extended LDA in each short text. That is to say, no matter which one is chosen, it will be better in classification.

Table 5

Average classification accuracy of the methods on BaiduQA dataset

Method	$K=$ 40	$K=$ 60	$K=$ 80
Online LDA (n, v)	0.405	0.406	0.412
IBTM (n, v)	0.596	0.597	0.610
IBTM (c-words)	0.604	0.606	0.614
Extended LDA (n, v)	0.574	0.569	0.560
Extended LDA (c-words)	0.580	0.572	0.570

4.4 Users’ interest tags discovery

In this section, for the sake of explaining that our model could be applied in multiple tasks, we utilize our joint model to discover users’ dynamic interests based on their streaming Chinese microblogs on Sina Weibo. We select 10 users randomly from 239 ones in Sina Weibo dataset. The statistics of 10 users’ microblogs are shown in Table 6.

Table 6
The statistics of 10 users’ microblogs

User	o-microblogs ${}^{\text{a}}$	re-microblogs ${}^{\text{b}}$	ave-length ${}^{\text{c}}$	com-length ${}^{\text{d}}$
1800611161	7839	4783	13.70	14.33
2060029603	3733	2512	36.58	31.23
2123426191	8106	4938	24.35	23.43
2213526752	15913	10539	19.36	20.73
2774892563	8353	4819	14.96	15.85
3277363951	7870	4559	26.59	26.18
5097562275	4298	2802	19.72	23.35
5103645868	15491	10253	18.36	18.68
5181521136	9531	5762	16.24	18.14
5497480056	5481	3641	19.85	23.60

${}^{\text{a}}$ o-microblogs: Original number of microblogs posted by user. ${}^{\text{b}}$ re-microblogs: Number of microblogs after preprocessing. ${}^{\text{c}}$ ave-length: Average length of microblogs with n & v. ${}^{\text{d}}$ com-length: Average length of microblogs with combining words.

4.4.1 Baseline methods and parameter settings

In this experiment, we compare our method against two online methods (online LDA and IBTM) and three static methods (Dirichlet Multinomial Mixture (DMM) Model [24],8

⁸
http://jldadmm.sourceforge.net/.

LDA and BTM). We set parameter

\alpha=

0.1,

\beta=

0.01, iteration

=

500 and get top 20 words with maximum probability in each topic for all methods. The

K

with values of 30 and 50 to train these models. To keep the balance between efficiency and effectiveness, we finally set the sliding window to 1000 and

R(i)

to 500 for all online methods.

4.4.2 Evaluation method

Topic model generated a ranked words based on $P(w|z)$ in each topic. In order to get the interest tags of users, we use remaining words in each microblog of users to represent the interest tags.

We define the hitrate@N: given top $N$ suggested interest tags for each microblog. Method@N is represented that we get top $N$ interest tags by using baseline methods or our joint model. In order to show the precision of interest tags discovery, we count it with the function:

$\displaystyle\textit{precision@N}=\frac{\textit{hitrate@N}\cap\textit{Method@N% }}{N}$ (8)

4.4.3 Experiment results and analysis

Tables 7 and 8 list the precisions of all methods to discover interest tags of users’ streaming microblogs. So we have the following experiment results.

Table 7
Average precision@N of algorithms

${}^{\text{b}}$ (c-words): After preprocessing with combined words.
${}^{\text{a}}$ (n & v): After preprocessing with nouns and verbs.
Method	precision@2	precision@5
DMM (n & v) ${}^{\text{a}}$	0.540	0.390
LDA (n & v)	0.500	0.388
BTM (n & v)	0.505	0.382
Online LDA (n & v)	0.485	0.356
IBTM (n & v)	0.465	0.348
Our method (n & v)	0.511	0.398
Our method (c-words) ${}^{\text{b}}$	0.515	0.380

In Table 7, firstly, our method achieves the highest on precision@2 and precision@5, which significantly outperforms online baseline methods and even more than some static models, no matter our method with verbs and nouns or our method with combined words. It demonstrates that our joint model could discover more precise topics and interest tags. Secondly, our method with combined words is higher than our method with nouns and verbs in precision@2 but lower than that in precision@5. The reason is that a combined word represents two or three words in remaining words of a document which reduces the precision in evaluation method. Thirdly, online LDA is better than IBTM with finding interest tags in a new microblog. It demonstrates that IBTM is better than online LDA in finding co-occurrence words in corpus, though, online LDA could find more precise interest tags in a short text than IBTM. In addition, the precision@N in LDA and BTM is respectively higher than their online algorithms, but the interest tags of users in LDA and BTM do not have peculiarity. The reason why is that top 5 words in each topic of these models effected by the high-frequency words in corpus. However, these words can not be represented as interest tags. Last but not least, DMM appears better than other static models which means limiting fewer number of topics will beneficial to get more precise interest tags.

In Table 8, we lists one of microblogs of user 2213526752 and 5103645868 respectively. The bold is interest tags of the microblog. As we can see, most of methods discover high frequency words in microblogs of user 2213526752, and we get an unclear interest of the user. From the table we can see that DMM (UTF8gbsnåš (ARASHI)) and our method (c-words) (UTF8gbsnæ¨±ä°•ç¿” (Sakurai), UTF8gbsnä°Œå®« (Ninomiya), UTF8gbsnæ¾æœæ¶¦ (Matsumoto), UTF8gbsnæ‘ (UTF8gbsnã¨ã«ã‹ãæ˜Žã‚‹ã„å®‰æ‘)) is more precise on the tags. In user 5103645868, these methods all discover many interest tags, but we can find the tags is most in our methods(c-words) in top 10 topic-words based on $P(w|z)$ .

Table 8

Example of users and their interest tags on Sina Weibo dataset

User	Short text	Method	Top5-words
221352 6752	UTF8gbsnå¨±äï¼šæ®è°ƒæŸ¥æ•°æ®æ˜¾ç¤°ï¼Œä¸œä°å„å¤§ç”µè§†å°å´æœ«å´åˆå„ç±»èŠ‚ç›®å‡°åœ°çŽ‡å‰ä°”çš„ä½ç½®è¢«åšæˆå‘˜éœ¸å ã€‚å…¶ä¸ä°Œå®«å’ŒäŸå‡°	DMM (n & v)	UTF8gbsnå¨±ä (entertainment)UTF8gbsnç°¢ç™½ (red & white) UTF8gbsnéŸä (music) UTF8gbsnæŒ (song) nhk UTF8gbsnæˆå‘˜ (member)UTF8gbsnåˆæˆ˜ (combat) UTF8gbsnåš (ARASHI)UTF8gbsnæ¼”å”± (sing) UTF8gbsnæ—¥æœ (Japan)
	UTF8gbsnæ¼”çš„ä°†31æ¡£èŠ‚ç›®ï¼Œæ¨±ä°•ç¿”å‡°æ¼”22æ¡£ï¼Œå¤§é‡Žæ™°ã€æ¾æœæ¶¦ã€ç›¸å¶é›…ç°å‡ä¸°21 æ¡£ã€‚ä°Œå®«è¿žç»ä¸¤å´ç™»é¡¶ï¼Œæ¤å¤–å…¥å›´å‰	BTM (n & v)	UTF8gbsnä¸»æ¼” (protagonist) UTF8gbsnç”µå½± (movie) UTF8gbsné¥°æ¼” (play) UTF8gbsnå¨±ä (entertainment)UTF8gbsnå‡°æ¼” (act)UTF8gbsnæ•…ä°‹ (story) UTF8gbsnå¥ˆ (nai) UTF8gbsnè®è¿° (narrate) UTF8gbsnæ”ç¼– (recompose) UTF8gbsnçŒ« (cat)
	UTF8gbsnåçš„è¿˜æœ‰ã¨ã«ã‹ãæ˜Žã‚‹ã„å®‰æ‘ã€å‡°å·å“æœ—ã€æ½éƒ¨ä½‘ã€ç»«éƒ¨ç¥ä°Œç‰ã€‚ (Entertainment: According to the survey,	LDA (n & v)	UTF8gbsné¥°æ¼” (play) UTF8gbsnå¨±ä (entertainment)UTF8gbsnä¸»æ¼” (protagonist) UTF8gbsnå±•å¼€ (unfold) UTF8gbsnå¼€æ’ (be on) UTF8gbsnæ•…ä°‹ (story) UTF8gbsnå‡°æ¼” (act)UTF8gbsnç”µè§†å‰§ (TV play) UTF8gbsnå‰§ä¸ (in the play) UTF8gbsnæ’å‡° (broadcast)
	the positions of the top five appearances of all kinds of programs in major TV stations of Tokyo at the end of the year	IBTM (n & v)	UTF8gbsnæˆå‘˜ (member)UTF8gbsnæˆ˜å£« (warrior) UTF8gbsnå¨±ä (entertainment)UTF8gbsnç¾Žå°‘å¥ (pretty girl) UTF8gbsnå (fame) UTF8gbsnæ’å‡° (broadcast) akb48 UTF8gbsnæ¼”å”± ç°¢ç™½ (red & white) UTF8gbsnèŠ‚ç›® (program)
	were occupied by members of ARASHI. Among them, Ninomiya played in 31 shows, Sakurai appeared in 22 ones,	Online LDA (n & v)	UTF8gbsnå¨±ä (entertainment)UTF8gbsnä¸œä° (Tokyo)UTF8gbsnæˆå‘˜ (member) akb48 UTF8gbsnå‘è¡Œ (publish) UTF8gbsnç»„åˆ (group) UTF8gbsnæ¡¥ (bridge) UTF8gbsnçŽ°åœ° (scene) UTF8gbsnå¼ (Zhang) UTF8gbsnæ¼”å”±ä¼š (concert)
	Satoshi Ohno, Matsumoto and Masaki were in 21 ones. Ninomiya topped the list for two years continuously. In addition,	Our method (n & v)	UTF8gbsnå¨±ä (entertainment)UTF8gbsnæˆå‘˜ (member)UTF8gbsnå±•å¼€ (unfold) UTF8gbsnæˆ˜å£« (warrior) UTF8gbsnå (fame) UTF8gbsnéŸä (music) smap UTF8gbsnå…å¸ƒ (publish) akb48 UTF8gbsnæ¯•ä¸š (graduate)
	UTF8gbsnã¨ã«ã‹ãæ˜Žã‚‹ã„å®‰æ‘, UTF8gbsnã§ãŒã‚ ã¦ã¤ã‚ã†, UTF8gbsnæ½éƒ¨ä½‘, and UTF8gbsnã‚ã‚„ã ã‚†ã†ã˜.) are in top 10.)	Our method (c-words)	UTF8gbsnæ¨±ä°•ç¿” (Sakurai)UTF8gbsnçæ®– (breed) UTF8gbsnå¸¦æœ‰ (equip) UTF8gbsnä°Œå®« (Ninomiya)UTF8gbsnæ¾æœæ¶¦ (Matsumoto)UTF8gbsnèµ°è¿‘ç§‘å¦ (approaches to science) UTF8gbsnå–œæ¢ (like) UTF8gbsnæ‘ (UTF8gbsnã¨ã«ã‹ãæ˜Žã‚‹ã„å®‰æ‘)UTF8gbsnä° (buy) UTF8gbsnä¸å›½ (China)
510364 5868	UTF8gbsnæ˜Ÿå§æŽ¨èæ˜Ÿé—»ï¼šæ®éŸ©å’æŠ¥é“ï¼Œé›èŽ‰å°†é€€å‡°f(x)çš„ç»„åˆæ´»åŠ¨ï¼Œå‡†å¤‡ä»¥æ¼”å‘˜å±•å¼€æ´»åŠ¨ã€‚è™½ç„¶ç¦»å¼€ä°†ç»„åˆä½†å› ä¸ŽSMå…å¸	DMM (n & v)	UTF8gbsnå®‹èŒœ (Victoria Song) fx sm UTF8gbsnæ˜Ÿå§ (Star sister)UTF8gbsnéŸ©å›½ (Korea) UTF8gbsnæˆå‘˜ (member) UTF8gbsné¿æ™— (Lu Han) UTF8gbsnä¸å›½ (China) UTF8gbsnæŽ¨è (recommend)UTF8gbsnæ¼”å”±ä¼š (concert)
	UTF8gbsnçš„åˆç°¦å°šæœåˆ°æœŸæ‰€ä»¥å°†ä¸ä¼šç¦»å¼€SMã€‚è€Œf(x)å°†ä»¥å››ä°°ç»„å±•å¼€æ´»åŠ¨ï¼Œè®¡åˆ’åœ¨å°‘å¥æ—¶ä»£åŽä°Ž9æœˆé‡æ–°è¿›è¡Œç»„åˆæ´»åŠ¨ï¼Œå¯	BTM (n & v)	smUTF8gbsnä¸å›½ (China) UTF8gbsnå…å¸ (company)UTF8gbsnéŸ©å›½ (Korea) UTF8gbsnæˆå‘˜ (member) UTF8gbsné¿æ™— (Lu Han) UTF8gbsnç»„åˆ (group) exo UTF8gbsnå®‹èŒœ (Victoria Song) fx
	UTF8gbsnæ¤SMæœå‘è¡¨å®˜æ–ç«‹åœ°ã€‚ (Star sister recommends News: According to the Korean media report, Sulli will withdraw	LDA (n & v)	mv UTF8gbsnéŸ©å›½ (Korea) UTF8gbsnä¸å›½ (China) UTF8gbsnç»„åˆ (group)UTF8gbsnæ‹æ‘„ (make a record) UTF8gbsnæŽ¨è (recommend)UTF8gbsnå…å¸ (company)UTF8gbsnéŸä fxUTF8gbsnä¸»é¢˜æ› (theme song)
	from the f(x) and prepare to perform activities as an actress. Although she left the group, she will not leave SM because	IBTM (n & v)	sm UTF8gbsnå…å¸ (company)UTF8gbsnå®‹èŒœ (Victoria Song) UTF8gbsnæ´»åŠ¨ (activity) xingwenjiemi UTF8gbsnèŠ‚ç›® (program) UTF8gbsnå‚åŠ (attend) UTF8gbsnå‘Šè¯‰ (tell) UTF8gbsnç»„åˆ (group)UTF8gbsnæœç´¢ (search)
	the contract with SM company has not yet expired. The f(x) will be launched in a four-person group and plans to regroup	Online LDA (n & v)	UTF8gbsnä¸å›½ (China) UTF8gbsnæ‹æ‘„ (make a record) UTF8gbsnéŸ©å›½ (Korea) UTF8gbsnç»„åˆ (group) fx UTF8gbsnå…å¸ (company)UTF8gbsnéƒ‘ç§€æ™¶ (Krystal) UTF8gbsnåˆä½œ (cooperation) smUTF8gbsnå‘å¸ƒ (publish)
	activities in September after the Girls’ Generation. SM has not made an official position.)	Our method (n & v)	UTF8gbsnå±•å¼€ (unfold) UTF8gbsnç»“å©š (get married) UTF8gbsnç»„åˆ (group) smUTF8gbsnåˆ†æ‰‹ (break up) UTF8gbsnä¼ (spread) UTF8gbsnéŸ©å›½ (Korea) UTF8gbsnå…å¸ (company) UTF8gbsnæ´»åŠ¨ (activity) fx
		Our method (c-words)	UTF8gbsnæ˜Ÿå§ (Star sister) UTF8gbsnæ¼”å‘˜ (actress)UTF8gbsné‡å°† (Chong Qing) UTF8gbsnæŠ¥é“ (report) UTF8gbsnå’ (media)UTF8gbsnå—å¼€ (Nankai) tfboys smUTF8gbsnå…å¸ƒ UTF8gbsnæŽ¨èæ˜Ÿé—» (recommend news)

${}^{\text{a}}$ (n & v): After preprocessing with nouns and verbs. ${}^{\text{b}}$ (c-words): After preprocessing with combined words.

5. Conclusions and future work

With the prevalence of short text, topic model has become increasingly important on the topic discovery. Facing the massive data and the sparsity of short texts, we utilize a joint model to discover the topic distribution of the streaming data real time. In this paper, we take advantage of an extended LDA to reveal the topic proportion of the new streaming Chinese short text at first. Then considering the sparsity of the short texts, we combine IBTM to get a better proportion of topics. Comparing the results of topic distribution in both models, our joint model return a best topic distribution to the short text. Our results show that the joint model can discover better topic distribution. Besides we use a different method in pre-processing step, it will extend the short texts alleviating sparsity, get the true intentions of the user and reduce the errors in word segmentation. The results verify our thought. Finally, we utilize the joint model to handle short texts of a user to discover his or her dynamic interests. Experiment results demonstrate our joint model with combined words can significantly outperform other models in discovering interest tags of users.

Because some words in a topic do not have semantic relevance, our future efforts will focus on combining with some sematic similarity algorithms and theories to discover more precise latent topics of Chinese short text.

Footnotes

Acknowledgments

This paper is founded by National Natural Science Foundation of China (Grant No: 61673235) and National Key R&D Program Projects of China (Grant No: 2018YFC1707600).

References

Baldwin

Cook

Lui

Mackinlay

and Wang

, How noisy social media text, how diffrnt social media sources? in: Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP 2013), 2013, pp. 9–9.

Banerjee

and Basu

, Topic models over text streams: A study of batch and online unsupervised learning, in: Siam International Conference on Data Mining, April 26–28, 2007, Minneapolis, Minnesota, Usa, 2007.

Bassiou

N.K.

and Kotropoulos

C.L.

, Online plsa: batch updating techniques including out-of-vocabulary words, IEEE Transactions on Neural Networks & Learning Systems 25(11) (2014), 1953–1966.

Blei

D.M.

A.Y.

and Jordan

M.I.

, Latent dirichlet allocation, Journal of Machine Learning Research 3(Jan) (2003), 993–1022.

Canini

Shi

and Griffiths

, Online inference of topics with latent dirichlet allocation, in: Artificial Intelligence and Statistics, 2009, pp. 65–72.

Chang

Boyd-Graber

Gerrish

Wang

and Blei

D.M.

, Reading tea leaves: how humans interpret topic models, in: International Conference on Neural Information Processing Systems, 2009, pp. 288–296.

Cheng

Guo

Zhao

Han

and Fang

, Automatic landslide detection from remote-sensing imagery using a scene classification method based on bovw and plsa, International Journal for Remote Sensing 34(1) (2013), 45–59.

Cheng

Yan

Lan

and Guo

, Btm: Topic modeling over short texts, IEEE Transactions on Knowledge and Data Engineering 26(12) (2014), 2928–2941.

Derek

and Cross

J.P.

, Exploring the political agenda of the europeanÂ parliament using a dynamic topicÂ modeling approach, Political Analysis 25(1) (2017), 77–94.

10.

Eisenstein

, What to do about bad language on the internet, in: Proceedings of Naacl, 2013, pp. 359–369.

11.

Gao

and Huang

C.N.

, Chinese word segmentation and named entity recognition: A pragmatic approach, Computational Linguistics 31(4) (2005), 531–574.

12.

Girolami

, On an equivalence between plsi and lda, in: International ACM SIGIR Conference on Research and Development in Informaion Retrieval, 2003, pp. 433–434.

13.

Griffiths

T.L.

and Steyvers

, Finding scientific topics, Proceedings of the National academy of Sciences 101(suppl 1) (2004), 5228–5235.

14.

Guo

Barnes

S.J.

and Jia

, Mining meaning from online ratings and reviews: Tourist satisfaction analysis using latent dirichletallocation, Tourism Management 59 (2017), 467–483.

15.

Hoffman

M.D.

Blei

D.M.

and Bach

, Online learning for latent dirichlet allocation, In: International Conference on Neural Information Processing Systems, 2010, pp. 856–864.

16.

Hofmann

, Probabilistic latent semantic indexing, in: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 1999, pp. 50–57.

17.

Hong

and Davison

B.D.

, Empirical study of topic modeling in twitter, in: Proceedings of the First Workshop on Social Media Analytics, ACM, 2010, pp. 80–88.

18.

Kim

and Shim

, Twilite: A recommendation system for twitter using a probabilistic model based on latent dirichlet allocation, Information Systems 42(3) (2014), 59–77.

19.

Wang

Zhang

Sun

and Ma

, Topic modeling for short texts with auxiliary word embeddings, in: International Acm Sigir Conference on Research & Development in Information Retrieval, 2016, pp. 165–174.

20.

and Xu

, Suggest what to tag: Recommending more precise hashtags based on users’ dynamic interests and streaming tweet content, Knowledge-Based Systems 106 (2016), 196–205.

21.

Lim

K.W.

Chen

and Buntine

, Twitter-network topic model: A full bayesian treatment for social network and text modeling, arXiv preprint arXiv:1609.06791v1, 2016.

22.

Mimno

Wallach

H.M.

Talley

Leenders

and McCallum

, Optimizing semantic coherence in topic models, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, Association for Computational Linguistics, 2011, pp. 262–272.

23.

Newman

Lau

J.H.

Grieser

and Baldwin

, Automatic evaluation of topic coherence, in: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, 2010, pp. 100–108.

24.

Nguyen

D.Q.

, jLDADMM: A Java package for the LDA and DMM topic models, http://jldadmm.sourceforge.net/, 2015.

25.

Nigam

McCallum

A.K.

Thrun

and Mitchell

, Text classification from labeled and unlabeled documents using em, Machine Learning 39(2) (2000), 103–134.

26.

Rosen-Zvi

Griffiths

Steyvers

and Smyth

, The author-topic model for authors and documents, in: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, AUAI Press, 2004, pp. 487–494.

27.

Tang

Meng

Nguyen

Mei

and Zhang

, Understanding the limiting factors of topic modeling via posterior contraction analysis, in: International Conference on Machine Learning, 2014, pp. 190–198.

28.

Tsai

F.S.

, A tag-topic model for blog mining, Expert Systems with Applications 38(5) (2011), 5330–5335.

29.

Wang

and McCallum

, Topics over time: a non-markov continuous-time model of topical trends, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2006, pp. 424–433.

30.

Weng

Lim

E.-P.

Jiang

and He

, Twitterrank: finding topic-sensitive influential twitterers, in: Proceedings of the Third ACM International Conference on Web Search and Data Mining, ACM, 2010, pp. 261–270.

31.

Yan

Guo

Lan

and Cheng

, A biterm topic model for short texts, in: Proceedings of the 22nd International Conference on World Wide Web, ACM, 2013, pp. 1445–1456.

32.

Yang

Yao

and Wei

, Short text understanding by leveraging knowledge into topic model, in: NAACL, 2015, pp. 1232–1237.

33.

Zhai

C.X.

, Probabilistic topic models for text data retrieval and analysis, in: The International ACM SIGIR Conference, 2017, pp. 1399–1401.

34.

Zhang

and Tsai

F.S.

, Chinese novelty mining, in: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 – Volume 3, EMNLP ’09, Association for Computational Linguistics, 2009, pp. 1561–1570.

35.

Zhao

W.X.

Jiang

Weng

Lim

E.P.

Yan

and Li

, Comparing twitter and traditional media using topic models, in: European Conference on Advances in Information Retrieval, 2011, pp. 338–349.

A joint model of extended LDA and IBTM over streaming Chinese short texts

Abstract

Keywords

1. Introduction

1 https://weibo.com/.

2.1 Online LDA

3.1 Preprocessing for Chinese short text

3 https://github.com/fxsjy/jieba.

3.4 Model complexity

4. Experiments

4.1 Datasets

4 http://zhidao.baidu.com.

4.2.1 Baseline methods and parameter settings

5 https://github.com/xiaohuiyan.

Table 4 Example of document-topic information and topic-words distribution of the new short text in BaiduQA dataset

7 http://scikit-learn.org/.

Table 6 The statistics of 10 users’ microblogs

8 http://jldadmm.sourceforge.net/.

Table 7 Average precision@N of algorithms

Footnotes

Acknowledgments

References

¹
https://weibo.com/.

³
https://github.com/fxsjy/jieba.

⁴
http://zhidao.baidu.com.

⁵
https://github.com/xiaohuiyan.

Table 4
Example of document-topic information and topic-words distribution of the new short text in BaiduQA dataset

⁷
http://scikit-learn.org/.

Table 6
The statistics of 10 users’ microblogs

⁸
http://jldadmm.sourceforge.net/.

Table 7
Average precision@N of algorithms