Utilizing Recurrent Neural Network for topic discovery in short text scenarios 1

Abstract

The volume of short text data increases rapidly these years. Data examples include tweets and online Q&A pairs. It is essential to organize and summarize these data automatically. Topic model is one of the effective approaches, whose application domains include text mining, personalized recommendation and so on. Conventional models like pLSA and LDA are designed for long text data. However, these models may suffer from the sparsity problem brought by lacking words in short text scenarios. Recent studies such as BTM show that using word co-occurrent pairs is effective to relieve the sparsity problem. However, both BTM and extended models ignore the quantifiable relationship between words. From our perspectives, two more related words should occur in the same topic. Based on this idea, we introduce a model named RIBS, which makes use of RNN to learn relationship. By using the learned relationship, we introduce a model named RIBS-Bigrams, which can display topics with bigrams. Through experiments on two open-source and real-world datasets, RIBS achieves better coherence in topic discovery, and RIBS-Bigrams achieves better readability in topic display. In the document characterization task, the document representation of RIBS can lead better purity and entropy in clustering, higher accuracy in classification.

Keywords

Topic model short text Recurrent Neural Network bigrams

1. Introduction

With the development of the Internet, expressing opinions through the social network or asking and answering questions online have become more and more popular. These daily behaviors may produce huge volume of short text data. That is why short text, also named micro text [7], has attracted much attention from researchers. These massive data emerging everyday are of great value, but we can hardly analyse them directly. Therefore, we need a tool like topic model to help us organize and summarize text data automatically. Topic model can represent each document with a distribution over topics and describe each topic with a distribution over words. It is widely applied to many interesting domains such as text mining [15, 18], question retrieval [6, 12] and personalized recommendation [4, 13].

Researches on topic model have been carried out for years. Conventional topic models include probabilistic Latent Semantic Analysis (pLSA) [10] and Latent Dirichlet Allocation (LDA) [3], which are widely used for discovering hidden topics from text corpus. Most following works focus on relaxing model assumptions [26], applying to different types of collections [23] and so on. These models are designed for long text documents, which consist of many words, and have shown great effectiveness in long text scenarios. However, nowadays we have to deal with short text data more often. Different from long text data, the sparsity of short text content brings challenges to conventional topic models, because the document-word matrix is quite sparse so that we have few words to learn and analyse from original corpus.

There exist various strategies to address the sparsity problem, which have been introduced in recent years. One intuitive strategy is to extend the original short text corpus into longer ones by aggregating similar texts [14, 28] based on certain rules. The shortcoming of this strategy is obvious because these methods over rely on auxiliary data, which may be unavailable or hard to get in some cases. For example, if we get a dataset without author information or we can find little suitable knowledge from the Internet, the effectiveness of this kind of methods will be greatly weakened. The second strategy is to add restrictions to model assumptions. For example, there is only one topic in each document, known as Dirichlet Multinomial Mixture (DMM) [31]. This kind of solution alleviates the sparsity problem to some extent but the restriction is too strong, because the topic number of each document depends on the given corpus. Another creative strategy alleviates the problem by constructing word pairs or word groups to represent the original texts. One of those representative work is Biterm Topic Model (BTM), which uses word co-occurrence relationship from original corpus to learn topics [30]. For Word Network Topic Model (WNTM), it constructs pseudo documents with word groups learned from the word network [33] to alleviate the sparsity problem. These models indeed have a superior performance to conventional methods. However, it is worth noting that they both ignore the quantifiable relationship between words. For instance, if there is a document, which consists of words {iPhone, iPad, house}, BTM would model biterms (iPhone, iPad) and (iPhone, house) equally for learning topics. But according to our knowledge, iPad may have a higher probability to appear in the same topic with iPhone than with house. This means we can not ignore the prior knowledge of relationship between words. Detailed examples from the Online Questions Dataset, which will be introduced later, are as Table 1 shows.

Table 1
Examples of biterms extracted from the Online Questions Dataset

Biterms (A, B)	Co-occurrence	Related or not	Biterms (A, C)	Co-occurrence	Related or not
(search, web)	34	Related	(search, bar)	31	Not related
(nissan, sentra)	24	Related	(nissan, install)	22	Not related
(mobile, phones)	14	Related	(mobile, download)	14	Not related

We can find that even two biterms are with similar co-occurrence, the relationship between words is quite different. So it is essential to take this into account. There are various evaluation metrics for relationship between words. For example, Chen and associates [5] use Pointwise Mutual Information (PMI) to describe this relationship. Unfortunately, PMI is simply based on statistics. For example, if (A, B) co-occurs as many times as (A, C) does, PMI will fail to distinguish the different influence caused by the distance between word A and B, and between word A and C. So we prefer to learn this relationship by training Recurrent Neural Network (RNN) for its successful application in language model. At the same time, to filter high-frequency words, we apply classic Inverse Document Frequency (IDF) [24] on each word. We call this model as RNN-IDF based Biterm Short-text Topic Model (RIBS). What’s more, we also introduce an extended model named RIBS-Bigrams, which can improve the readability of topics with bigrams. The main contributions include:

•

We propose an effective RNN-IDF based Biterm Short-text Topic Model to take into consideration the quantifiable relationship between words. In RIBS, the relationship is determined by training RNN on the whole corpus and using IDF of words. In this way, we can have a better description on biterms.

•

We introduce a simple, fast and effective method RIBS-Bigrams to extend RIBS for displaying topics with bigrams. In RIBS-Bigrams, the generation of each bigram is determined by considering the topical information and the closeness of two words at the same time. In this way, we can have a better readability on topics.

•

On two open-source and real-world short text datasets, we evaluate the proposed models by conducting topic quality, topic display and document characterization experiments. Experimental results demonstrate the effectiveness of proposed models.

This paper will be organized as follows. Section 2 will show related researches. Sections 3 and 4 will present our topic models RIBS and RIBS-Bigrams. Section 5 will contain the experiments and finally Section 6 will have a conclusion.

2. Related work

Topic model has developed for years. Especially, there have been many researches on topic model in long text scenarios [3, 10, 20]. In this section, we focus on recent work in short text scenarios and give a brief summarization.

With the explosive growth of short text data and high value of applications like text categorization [27] and news clustering [29], topic model in short text scenarios has become a promising research field. More and more researchers have shown interests in it. The main challenge brought by short text lies in the lack of words, which may cause the document-word matrix seriously sparse. This kind of phenomenon is harmful for topic discovery because we can hardly describe topics with only few words. Most models are proposed based on the following strategies. One strategy in early years was document aggregation. For example, Hong and associates aggregated tweets, which shared the same key words before using LDA [11], Weng and associates [28] aggregated texts, which were posted by the same author, Jin and associates extended short texts with auxiliary related texts [14]. These models need extra text data, which may be limited or hard to get. Some other researchers developed their models with additional strong assumptions. For example, Zhao and associates assumed that each document would only contain one topic [32], similar to this idea, Lin and associates assumed that each document would contain the most related subset of topics and each topic could be composed by limited words [18]. These restrictions are too strong and the rationality of assumptions over depends on the content of given corpus, which make these models limited. Another novel idea in recent years is constructing word groups or word pairs. Using word groups to construct pseudo document is feasible because semantic related word groups can stand for the same topic, work like WNTM [33] is based on this idea. Using word pairs is also popular, Yan and associates proposed a novel topic model named BTM [30] which could learn topics by modeling the generation of word co-occurrence patterns directly. Following work like d-BTM [29] extended BTM by deleting some redundant biterms, as shown in Fig. 1.

For doc1 {Google Map for IOS}, we can see BTM extracts every co-occurrent word pair from the same document to form a biterm, while d-BTM tries to exclude some unimportant biterms. It labels each word as a topic term (T), general term (G) and document specific term (D) respectively, biterms which don’t contain topic terms will be deleted. For example, Map is a document specific term and for is a general term, so biterm Map-for will be deleted.

From the illustration of Fig. 1, we find both BTM and extended models can relieve the sparsity problem without using auxiliary texts. It is more suitable and universal in short text scenarios, so we do research on this basis. However, BTM ignores the different relationship between words, and d-BTM tries to filter some useless biterms simply by deleting them, which may result in the loss of information. From our perspectives, it is more rational to bring in prior knowledge to describe the relationship between words. In our early work [19], we utilized RNN, instead of using word2vec [16, 21], to learn this kind of prior knowledge so as to solve these two deficiencies at the same time. We think training word2vec requires to make use of auxiliary texts and has the probability to bring in noise. However, RNN only uses original corpus for training and has been successfully applied to many text processing tasks [1, 25]. So, we introduce a RIBS topic model to enhance topic discovery in short text scenarios. Furthermore, benefiting from the prior knowledge learned in RIBS, we extend the RIBS topic model by introducing a method to display topics with bigrams, leading to the RIBS-Bigrams topic model.

3. RIBS topic model

In this section, we will give the problem setting of short text topic model at first. Then we will give detailed descriptions of how RIBS works. Table 2 lists some annotations, which we will use.

Table 2
Annotations

Symbol	Description
$\mathbb{D}$	A set of the given corpus, $\mathbb{D}=\{d_{1},d_{2},\ldots,d_{\|\mathbb{D}\|}\}$
$\mathbb{B}$	A set of all the biterms, whose size is $\|\mathbb{B}\|$
$\mathbb{W}$	A set of the vocabularies of the whole corpus, $\mathbb{W}=\{w_{1},w_{2},...,w_{\|\mathbb{W}\|}\}$
$K$	The number of topics

Figure 1.

A simple illustration for biterm extraction of BTM and d-BTM.

3.1 Problem setting

Topic model uses observed words to discover latent topics, which is really helpful in text mining. The problem setting is as Definition 1 shows.

Definition 1. Given the corpus $\mathbb{D}$ with $|\mathbb{D}|$ documents whose vocabulary size is $|\mathbb{W}|$ , topic model aims to discover topics of each document and learn topic representations with words. If the corpus has $K$ topics, topic model should give an $|\mathbb{D}|\times K$ matrix $\theta$ for topic distributions over documents and a $K\times|\mathbb{W}|$ matrix $\phi$ for word distributions over topics. In short text scenarios, each document consists of only few words.

3.2 Model description

RIBS utilizes prior knowledge to measure the quantifiable relationship between words. If two words are more related, they are more likely to belong to the same topic. Different from BTM’s generative process, we assume that two words in a biterm are drawn from a topic probabilistically based on their relationship, whereas a topic is still sampled from a topic mixture over the whole corpus. The generative process is shown in Fig. 2 and described as follows:

1.
For each word $w$ , learn prior knowledge $\beta$ with RNN and IDF from corpus $\mathbb{D}$ .
2.
Draw $\theta\sim$ Dirichlet $(\alpha)$ , treated as topic distributions over biterm $b$ .
3.
For each topic $k\in[1,K]$

(a)
draw $\phi_{k,w_{i}}\sim$ Dirichlet $(\beta_{i})$ .
(b)
draw $\phi_{k,w_{j}}\sim$ Dirichlet $(\beta_{j})$ .

4.
For each biterm $b\in\mathbb{B}$ , where $b=(w_{i},w_{j})$

(a)
draw $z\sim$ Multinomial $(\theta)$ .
(b)
draw $w_{i}\sim$ Multinomial $(\phi_{z_{w_{i}}})$ .

draw $w_{j}\sim$ Multinomial $(\phi_{z_{w_{j}}})$ .

where $z$ is a variable, which represents the topic id. $\theta$ is a $K$ -dimensional multinomial distribution, where $\theta_{k}$ represents the probability of topic $z_{k}$ (we denote the topic as $z_{k}$ when $z_{k}=k$ , $k\in[1,K]$ ). $\phi$ is a $K\times|\mathbb{W}|$ matrix, which is the word distributions over topics, we denote the $k$ -th row in $\phi$ as $\phi_{k}$ to represent the word distribution over topic $z_{k}$ . $w_{i}$ and $w_{j}$ are two observed words. $\alpha$ and $\beta$ are the symmetric Dirichlet priors for $\theta$ and $\phi$ . In RIBS, we bring in prior knowledge for $\beta$ .

Figure 2.
Graphical representation of RIBS.

3.3 Prior knowledge learning

Most short text topic models like BTM and d-BTM ignore the quantifiable relationship between words. However, this kind of relationship is very important because if two words are more related, they may have a higher probability to occur in the same topic. We think the prior knowledge should satisfy the following properties:

•
If two words are more likely to appear in the same generative sentence, they are more related. Instead, if two words are far away from each other in the same sentence, the relationship between them shall be weakened.
•
The order of two words should be taken into account for learning relationship, which means, for example, we encourage to generate bigram white house instead of house white.
•
If a word appears in many documents, it may contain less topical information, whose probability of representing topics should be weaken.

We bring in two kinds of prior knowledge and combine them together to satisfy these properties and give detailed description respectively.

Artificial neural networks have been found quite effective in learning relationship between words for sentence generation [2, 25]. We find that RNN is a good choice, which satisfies the first and second property for the following reasons:

•
The conditional probability learned by RNN can quantify word $w_{j}$ ’s generation probability when given word $w_{i}$ and previously observed words. This probability can reflect the similarity and tightness between two words.
•
The learning process of RNN can guarantee that the earlier a word is observed, the less influence it will have on current learning word.
•
Words trained by RNN appear in sequence, so the learned quantifiable relationship can help generate readable phrases and bigrams for further research.

Encouraged by recent work [2], which utilized RNN for short text representation, we use a simple recurrent neural network called Elman [8] to learn relationship between words. We can also choose LSTM [9], which is known for preserving information observed long ago. However, the long-time memory ability of LSTM might build unexpected strong relationship between two far apart words and bring in some noise. Detailed analysis will be discussed in the experiment section. In this section, we will show the learning process of relationship between words with Elman network. The Elman network is as Fig. 3 shows.

Figure 3.
A simple Elman recurrent neural network.

In Fig. 3, $w_{t}\in\mathbb{R}^{L}$ represents the current word, where $L$ is the length of the vectorized $w_{t}$ . $h_{t}\in\mathbb{R}^{H}$ is a hidden layer, where $H$ is the size of the hidden layer. $y_{t}\in\mathbb{R}^{|\mathbb{W}|}$ is the output layer. $t$ is the current input time.

Since the hidden layers $h_{t-1}$ and $h_{t}$ have a recurrent connection, we can believe that $h_{t-1}$ has remembered all the words observed before time $t$ . This means RNN can learn the relationship between the current word and previous observed words. Additionally, the influence of the previous observed words is decreasing over time.

The input layer $x_{t}\in\mathbb{R}^{L+H}$ is defined as $x_{t}=[w_{t},h_{t-1}]$ . Then we can compute hidden and output layers with $x_{t}$ :

$\displaystyle h_{t}=\sigma(\mathbf{U}x_{t}),$ (1) $\displaystyle y_{t}=g(\mathbf{V}h_{t}),$ (2)

where $\sigma$ is the sigmoid function $\sigma(z)=\frac{1}{1+e^{-z}}$ and $g$ is the softmax function $g(z_{m})=\frac{e^{z_{m}}}{\sum_{k}{e^{z_{k}}}}$ . $\mathbf{U}\in\mathbb{R}^{H\times(L+H)}$ and $\mathbf{V}\in\mathbb{R}^{|\mathbb{W}|\times H}$ are two weight matrices for us to learn.

Once we have learned the conditional probability, we can give the definition of the relationship between $w_{i}$ and $w_{j}$ , denoted as $y_{i}(j)$ , which is the expectation of the $j$ -th value in $y_{i}$ in different contexts.

$\displaystyle y_{i}(j)={\rm{E}}(P(w_{j}|w_{i},h_{i-1})).$ (3)

$y_{i}(j)$ represents the probability of $w_{j}$ appears when given $w_{i}$ . Meanwhile, the given $h_{i-1}$ guarantees previously observed words also have effects on $w_{j}$ by distance.

What’s more, to satisfy the third property, work like d-BTM just deletes some topic-irrelevant biterms. Since we are already short of words, deleting biterms may cause further information loss. So we decide to utilize Inverse Document Frequency (IDF) for measuring each word as follows:

$\displaystyle{\rm{IDF}}_{w_{i}}=\textit{log}{\frac{|\mathbb{D}|}{|{d\in\mathbb% {D}:w_{i}\in d}|}},$ (4)

where ${|{d\in\mathbb{D}:w_{i}\in d}|}$ represents number of documents word $w_{i}$ appears in. The more times $w_{i}$ appears in documents, the smaller value of ${\rm{IDF}}_{w_{i}}$ will be. We can use this weight to decrease $w_{i}$ ’s probability of generating a topic.

Now we can give the final definition of prior knowledge $\beta$ , by making use of RNN and IDF at the same time. For word $w_{i}$ and $w_{j}$ :

$\displaystyle\beta_{i}=\epsilon\times y_{i}(j)\times{\rm{IDF}}_{w_{i}},$ (5) $\displaystyle\beta_{j}=\epsilon\times y_{i}(j)\times{\rm{IDF}}_{w_{j}},$ (6)

where $\epsilon$ is to avoid $\beta$ being too small.

After learning prior knowledge, we can introduce biterm construction. RIBS constructs biterms by using any two distinct words in a single document, which means we can generate $C_{n}^{2}$ biterms from an n-word document. Different from the biterm extraction procedure of BTM, we need to bring in prior knowledge. For each biterm $b\in\mathbb{B}$ , the new definition is as follows:

$\displaystyle b=(w_{i},w_{j},r_{ij}),\ \text{where}\ r_{ij}=\langle{\rm{IDF}}_% {w_{i}},{\rm{IDF}}_{w_{j}},y_{i}(j)\rangle$

When scanning the whole corpus, biterm-construction process is executed at the same time.
3.4 Gibbs sampling for parameter estimation

We employ Gibbs sampling for learning parameters like BTM by taking prior knowledge into consideration. According to the chain rule on the joint probability of the corpus, we acquire the following conditional probability equation:

$\displaystyle p(z|z_{-b},\mathbb{B})\propto\frac{(n_{-b,z}+\alpha)}{|\mathbb{B% }|+K\alpha}\frac{(n_{-b,w_{i}|z}+\beta_{i})(n_{-b,w_{j}|z}+\beta_{j})}{(\sum_{% w}(n_{-b,w|z}+\beta))^{2}},$ (7)

where $n_{-b,z}$ is the number of biterms assigned to topic $z$ without biterm $b$ . $n_{-b,w_{i}|z}$ is the number of word $w_{i}$ assigned to topic $z$ without biterm $b$ . Then we can estimate global topic parameter $\theta$ and topic-word distributions parameter $\phi$ as follows:

$\displaystyle\theta_{k}=\frac{(n_{z_{k}}+\alpha)}{|\mathbb{B}|+K\alpha}.$ (8)

for word $w_{i}$ and $w_{j}$

$\displaystyle\phi_{k,w_{i}}=\frac{n_{w_{i}|z_{k}}+\beta_{i}}{\sum_{w}(n_{w|z_{% k}}+\beta)}.$ (9) $\displaystyle\phi_{k,w_{j}}=\frac{n_{w_{j}|z_{k}}+\beta_{j}}{\sum_{w}(n_{w|z_{% k}}+\beta)}.$ (10)

The Gibbs sampling procedure is as Algorithm 1 shows.

[h] Gibbs sampling algorithm for RIBStopic number $K$ , $\alpha$ , $\beta$ , biterm set $\mathbb{B}$ . $\theta$ and $\phi$ . Initialize topic assignments for each biterm randomly. $\textit{iter}\leftarrow 1$ to $N_{\textit{iter}}$ biterm $b=(w_{i},w_{j},r_{ij})\in\mathbb{B}$ Draw topic $z_{k}$ from $P(z|z_{-b},\mathbb{B})$ . Update $n_{z_{k}}$ , $n_{w_{i}|z_{k}}$ , $n_{w_{j}|z_{k}}$ . Compute $\theta$ by Eq. (8) and $\phi$ by Eqs (9) and (10).

According to the definition of Eqs (9) and (10), we can denote $\phi_{k}=[\phi_{k,w_{1}},\phi_{k,w_{2}},\ldots,\phi_{k,w_{W}}]$ as word distributions over topic $z_{k}$ .

3.5 Topics inference

Because RIBS models topics on biterms, we have to infer the topic distributions over documents by utilizing knowledge learned by Gibbs sampling. Deriving topic $z_{k}$ ’s proportion of a document $d\in\mathbb{D}$ is as follows:

$\displaystyle P(z_{k}|d)=\sum_{b\in\mathbb{B}}P(z_{k},b|d)=\sum_{b\in\mathbb{B% }}P(z_{k}|b,d)P(b|d).$ (11)

We assume the topic of $b$ denoted as $z_{k}$ is conditionally independent of $d$ , which means $P(z_{k}|b,d)=P(z_{k}|b)$ , so we can get the following simplified equation:

$\displaystyle P(z_{k}|d)=\sum_{b\in\mathbb{B}}P(z_{k}|b)P(b|d).$ (12)

We can calculate $P(z_{k}|b)$ via Bayes formula:

$\displaystyle P(z_{k}|b)=\frac{P(z_{k})P(w_{i}|z_{k})P(w_{j}|z_{k})}{\sum_{k^{% \prime}\in K}{P(z_{k^{\prime}})P(w_{i}|z_{k^{\prime}})P(w_{j}|z_{k^{\prime}})}},$ (13)

where $P(z_{k})=\theta_{k}$ , $P(w_{i}|z_{k})=\phi_{k,w_{i}}$ , $\theta$ and $\phi$ are parameters learned in RIBS.

As to calculate $P(b|d)$ , we can simply treat it as a counting problem:

$\displaystyle P(b|d)=\frac{n_{d}(b)}{\sum_{b\in\mathbb{B}}n_{d}(b)},$ (14)

where $n_{d}(b)$ is the frequency of biterm $b$ in document $d$ . So the topic distributions over document $d$ is $P(z|d)=[P(z_{1}|d),P(z_{2}|d),\ldots,P(z_{K}|d)]$ .

Outputs of RIBS are a $|\mathbb{D}|\times K$ matrix for topic distributions over all documents and a $K\times|\mathbb{W}|$ matrix for word distributions over all topics, calculated as Eqs (15) and (16) show:

$\displaystyle P(z|\mathbb{D})=[P(z|d_{1}),P(z|d_{2}),...,P(z|d_{|\mathbb{D}|})].$ (15) $\displaystyle\phi=[\phi_{z_{1}},\phi_{z_{2}},\ldots,\phi_{z_{K}}].$ (16)

4. RIBS-bigrams topic model

When we are reading texts, bigrams are usually much easier to understand and contain more information than single word. Most topic models, no matter they are designed for long text or for short text, such as LDA [3], pLSA [10], d-BTM [29], BTM [30], and WNTM [33]. When coming to topic display, they all use a set of single words to describe a learned topic. This will make difficulty for reading, which is one of the deficiencies of most topic models. For example, if we read two words captain and America separately, they are with quite low semantic similarity. In fact, we find they would often occur in the same words set for describing topic about movie. However, when we treat these two words as bigrams captain America, then they become the name of a popular movie, and have more specific topical information. So, we believe it is necessary to describe topics with bigrams for the following reasons:

•
Using bigrams to display topics can improve readability and reduce ambiguity, which can reduce the reading and summarizing time by human.
•
When the number of words or bigrams for displaying topics is fixed, using bigrams can contain more information.

We believe every bigram to be generated for topic description should satisfy the following two properties, given words $w_{i}$ and $w_{j}$ :

•
$w_{i}$ and $w_{j}$ should have a high similarity in latent topic distribution, which means they should describe the similar topics.
•
$w_{i}$ should have a high probability to co-occur with $w_{j}$ , and it is important to discover the connection order of $w_{i}$ and $w_{j}$ from the corpus.

To satisfy the first property, we can use a simple strategy: For every topic $k$ , we sort $\phi_{z_{k}}$ , which is the word distributions over topic $k$ in a descending order, and then we select the top $T$ words with highest probability, regarded as candidates. This operation can guarantee that these $T$ words are of high similarities for describing topic $k$ .

Benefiting from the training procedure of RNN, we input words into the network in order. So we can satisfy the second property by directly utilizing the quantifiable relationship between words. $y_{i}(j)$ is the conditional probability learned by RNN, which reflects the probability of $w_{j}$ ’s occurrence when given $w_{i}$ and its contexts. So, we can generate bigrams by comparing the learned quantifiable relationship among topical similarly candidate words.

[h] Bigrams generation algorithm for topic displaytopic number $K$ , the number of unigram or bigrams for displaying topics $M$ , threshold $\delta$ of relationship between words, threshold $T$ of candidate word numbers .the set of unigram or bigrams for displaying topics $\mathbb{UB}^{k},k\in\{1,2,\ldots,K\}$ . $k\leftarrow 1$ to $K$ Initialize the used words set $\mathbb{UW}^{k}=\emptyset$ . Sort the word list $W_{k}$ according to the descending order of $\phi_{z_{k}}$ $i\leftarrow 1$ to $M$ $w_{i}\in\mathbb{UW}^{k}$ Set $w_{i}\leftarrow w_{i+1}$ and remove $w_{i}$ from the list. Initialize $max=0$ . $j\leftarrow 1$ to $T$ $max={\rm{MAX}}(y_{i}(j),y_{j}(i))$ . Update $w1,w2$ synchronously. $max<\delta$ Insert $w_{i}$ to $\mathbb{UB}^{k}$ and Insert $w_{i}$ to $\mathbb{UW}^{k}$ . Insert $w_{1}+w_{2}$ to $\mathbb{UB}^{k}$ and Insert $w_{1}$ , $w_{2}$ to $\mathbb{UW}^{k}$ .

We propose a simple and effective bigrams generation algorithm for topic display, details are as Algorithm 2 shows. This work is an extension of RIBS, which aims to improve the readability of topics. Benefiting from the learning procedure of RNN in RIBS topic model, we do not need to spend additional time and learn additional knowledge for generating bigrams. We can utilize the prior knowledge learned in RIBS to improve the quality of topic discovery and topic readability in short text scenarios at the same time. We name this model as RIBS-Bigrams.

The bigrams generation algorithm is extendable and flexible because the most important knowledge we utlize in the algorithm is the relationship between words. This can be discovered and replaced by other NLP algorithms and techniques.
5. Experiments

In this section, we conduct experiments to show RIBS outperforms the state-of-the-art topic models in short text scenarios. We evaluate the performance in terms of topic quality and document characterization respectively. We also conduct topic display experiments to compare RIBS-Bigrams with other unigram models. The experimental results show that our proposed model RIBS is more effective and RIBS-Bigrams can describe topics much better than other models.

5.1 Datasets

We use two open-source and real-world short text datasets for experiments:

•
Online Questions:2
²
https://webscope.sandbox.yahoo.com/catalog.php?datatype=l&did=10.

the open-source corpus is collected from a popular online Q&A website Yahoo! Answers, offered by Yahoo! Research. There are 25 questions without labels, so we have already deleted them. After preprocessing, we have achieved over 50,000 question contents for our experiments. Each question is attached with a label according to the forum it was posted. There are 24 categories in the corpus, the vocabulary size is 9696 and the average length of a single question is 4.950 words. It is a typical short text dataset.
•
Online News:3
³
http://archive.ics.uci.edu/ml/datasets/News+Aggregatorblei2003latent.

the open-source corpus is collected from an web aggregator in the period from 10-March-2014 to 10-August-2014, offered by UCI Machine Learning Repository. After preprocessing, we have achieved over 170,000 news headlines for our experiments. Each headline is annotated with a category by the data provider without missing values. There are 4 categories in the corpus, the vocabulary size is 7247 and the average length of a single question is 6.876 words. It is also a typical short text dataset.

The preprocessing for both datasets is according to the following three steps: Firstly, we lowercase all the capital letters to reduce the vocabulary size without any information loss. Secondly, we delete all the stop words based on the common stop words list for English. Finally, we delete documents containing less than 3 words.
5.2 Experiment settings

We compare RIBS with four topic models:

•
LDA is a famous topic model which performs really well in regular text scenario. We use a standard open source LDA4
⁴
http://jgibblda.sourceforge.net/.

implemented by Gibbs sampling.
•
BTM is a popular topic model for short text. We do experiments with the standard code provided by BTM authors.5
⁵
https://github.com/xiaohuiyan/BTM.

•
d-BTM is extended from BTM by deleting some topic-irrelevant biterms. We implement this model based on BTM source code.
•
RIBS-LSTM uses LSTM cells to learn prior knowledge for RIBS topic model. We compare this model with other models to show that LSTM is not suitable for topic discovery in short text scenarios.

Parameter settings are as follows.

•
For training RNN: we implement the network training with tensorflow (version 1.2):6
⁶
https://www.tensorflow.org/versions/r1.2/.

we set the length of input word vector $L$ as 100, set the batch size as 128, set the hidden size $H$ as 200 and set the learning rate $lr=$ 0.01.
•
For training topic models: we set $\alpha=$ 50/K, $\beta=$ 0.05 for all models.
•
For learning prior knowledge of RIBS: we set $\epsilon=$ 50 for Online Questions dataset and $\epsilon=$ 100 for Online News Dataset. This assignment is determined by experimental attempts, as Fig. 4 shows. We find that the parameter $\epsilon$ is necessary for RIBS to avoid the influence of prior knowledge being too weak or too strong.
•
For RIBS-Bigrams: we set the threshold $\delta=$ 0.1 for relationship between words, set the threshold $T=$ 30 for candidate word numbers.

Figure 4.
Topic discovery performance with the change of parameter $\epsilon$ on Online Question and Online News Datasets. Subfigures (a) and (b) are performance of coherence on both datasets respectively.

5.3 Experiments and analysis

5.3.1 Better topic discovery ability of RIBS

Topic model is designed for topic discovery, so topic quality is a significant judgement of model performance. This experiment aims to show RIBS has a better performance in topic discovery than baselines. We choose coherence [22] as the evaluation metric. The main idea of coherence is that a good topic should consist of words in cohesive semantic similarity. It is calculated as follows:

$\displaystyle C=\frac{1}{K}{\sum_{z=1}^{K}{\sum_{m=2}^{M}{\sum_{l=1}^{m-1}}{% \textit{log}\frac{n_{D}(w_{m}^{z},w_{l}^{z})+\epsilon^{\prime}}{n_{D}(w_{l}^{z% })}}}},$ (17)

where $[w_{1}^{z},w_{2}^{z},\ldots,w_{M}^{z}]$ denotes the $M$ most representative words of topic $z$ . $n_{D}(w_{l})$ is the word frequency of $w_{l}$ and $n_{D}(w_{m},w_{l})$ is the co-occurrence count in the corpus. $C$ is a negative number, a higher value indicates a better performance. We conduct this experiment with $K=$ 5, 10, 15, 20, 25, 30 and calculate coherence by choosing $M$ as 5, 10, 20.

In Tables 4 and 4, all four short text topic models outperform LDA on both datasets, which means LDA is really unsuitable for short texts because the document-word matrix is too sparse. Results of BTM and d-BTM show that biterm construction is good for short text topic discovery. But d-BTM achieves worse coherence than BTM does, we think this is caused by deleting biterms in d-BTM. It will lead to information loss when the text is quite short. For RIBS-LSTM, we find its performance is similar with BTM and has improvement in few cases. This indicates the quantifiable relationship learned by LSTM is not as suitable as which learned by Elman. No matter what value $M$ is, coherence of RIBS is always more close to 0, the improvement over both BTM and d-BTM lies in quantifiable relationship brought by RIBS. This kind of prior knowledge is learned from the whole corpus and remembers observed words over time, which can encourage two words with closer semantic relationship have higher probability to occur in the same topic. Compared to our previous work [19], a new observation is that the improvement of RIBS-LSTM in topic coherence is not as much as that of RIBS. We think it is due to the noise brought by LSTM cells, which may reduce the performance in short text scenarios.

5.3.2 Better topic description ability of RIBS-Bigrams

This experiment aims to show the improvement of topic’s readability by displaying topics discovered by different topic models. Most topic models describe topics with a set of single words. This is one of

Table 3
Topic coherence on Online Questions Dataset (the best results are in boldface)

	K $=$ 5			K $=$ 10			K $=$ 15
	M $=$ 5	M $=$ 10	M $=$ 20	M $=$ 5	M $=$ 10	M $=$ 20	M $=$ 5	M $=$ 10	M $=$ 20
LDA	$-$ 217.1 $\pm$ 6.5	$-$ 1174.1 $\pm$ 21.1	$-$ 5702.9 $\pm$ 32.4	$-$ 94.3 $\pm$ 3.2	$-$ 514.6 $\pm$ 13.7	$-$ 2625.0 $\pm$ 24.3	$-$ 60.2 $\pm$ 2.2	$-$ 314.7 $\pm$ 6.5	$-$ 1627.5 $\pm$ 27.2
BTM	$-$ 20.3 $\pm$ 2.2	$-$ 104.2 $\pm$ 8.1	$-$ 535.9 $\pm$ 13.5	$-$ 22.2 $\pm$ 0.9	$-$ 113.8 $\pm$ 2.5	$-$ 581.4 $\pm$ 17.9	$-$ 21.8 $\pm$ 1.2	$-$ 117.5 $\pm$ 2.4	$-$ 607.2 $\pm$ 8.4
d-BTM	$-$ 22.3 $\pm$ 2.8	$-$ 107.4 $\pm$ 5.4	$-$ 540.8 $\pm$ 16.6	$-$ 23.2 $\pm$ 1.2	$-$ 115.8 $\pm$ 2.4	$-$ 589.7 $\pm$ 10.7	$-$ 23.0 $\pm$ 1.1	$-$ 118.7 $\pm$ 1.3	$-$ 616.1 $\pm$ 7.3
RIBS-LSTM	$-$ 23.4 $\pm$ 2.4	$-$ 111.4 $\pm$ 6.5	$-$ 546.3 $\pm$ 7.7	$-$ 20.5 $\pm$ 0.5	$-$ 114.1 $\pm$ 2.6	$-$ 593.1 $\pm$ 13.8	$-$ 21.0 $\pm$ 0.6	$-$ 119.1 $\pm$ 2.5	$-$ 612.3 $\pm$ 5.8
RIBS	$-$ 17.1 $\pm$ 0.2	$-$ 89.7 $\pm$ 0.1	$-$ 523.8 $\pm$ 5.7	$-$ 17.7 $\pm$ 0.5	$-$ 107.7 $\pm$ 2.0	$-$ 550.1 $\pm$ 5.5	$-$ 19.2 $\pm$ 0.4	$-$ 112.0 $\pm$ 3.0	$-$ 597.7 $\pm$ 5.1
	K $=$ 20			K $=$ 25			K $=$ 30
	M $=$ 5	M $=$ 10	M $=$ 20	M $=$ 5	M $=$ 10	M $=$ 20	M $=$ 5	M $=$ 10	M $=$ 20
LDA	$-$ 41.6 $\pm$ 1.3	$-$ 223.8 $\pm$ 5.9	$-$ 1153.8 $\pm$ 12.0	$-$ 31.4 $\pm$ 1.3	$-$ 172.6 $\pm$ 1.6	$-$ 897.9 $\pm$ 11.0	$-$ 25.4 $\pm$ 0.9	$-$ 139.2 $\pm$ 2.6	$-$ 727.3 $\pm$ 10.8
BTM	$-$ 21.4 $\pm$ 0.9	$-$ 117.6 $\pm$ 2.0	$-$ 623.2 $\pm$ 9.0	$-$ 20.5 $\pm$ 0.8	$-$ 118.0 $\pm$ 1.6	$-$ 629.6 $\pm$ 11.0	$-$ 21.4 $\pm$ 0.8	$-$ 121.1 $\pm$ 1.4	$-$ 640.6 $\pm$ 5.7
d-BTM	$-$ 22.9 $\pm$ 0.6	$-$ 119.4 $\pm$ 1.5	$-$ 623.9 $\pm$ 6.6	$-$ 21.8 $\pm$ 0.4	$-$ 121.0 $\pm$ 1.4	$-$ 634.2 $\pm$ 6.6	$-$ 22.2 $\pm$ 0.4	$-$ 122.3 $\pm$ 2.1	$-$ 644.7 $\pm$ 6.6
RIBS-LSTM	$-$ 20.1 $\pm$ 1.0	$-$ 117.6 $\pm$ 1.7	$-$ 615.5 $\pm$ 4.3	$-$ 20.4 $\pm$ 0.8	$-$ 117.3 $\pm$ 2.9	$-$ 621.1 $\pm$ 7.3	$-$ 20.7 $\pm$ 0.7	$-$ 119.6 $\pm$ 1.3	$-$ 636.2 $\pm$ 5.0
RIBS	$-$ 19.1 $\pm$ 0.7	$-$ 112.0 $\pm$ 2.5	$-$ 599.6 $\pm$ 4.3	$-$ 18.5 $\pm$ 0.5	$-$ 112.1 $\pm$ 1.1	$-$ 606.3 $\pm$ 5.9	$-$ 19.3 $\pm$ 0.6	$-$ 114.1 $\pm$ 1.6	$-$ 615.8 $\pm$ 6.1

Table 4

Topic coherence on Online News Dataset (the best results are in boldface)

	K $=$ 5			K $=$ 10			K $=$ 15
	M $=$ 5	M $=$ 10	M $=$ 20	M $=$ 5	M $=$ 10	M $=$ 20	M $=$ 5	M $=$ 10	M $=$ 20
LDA	$-$ 155.2 $\pm$ 5.4	$-$ 851.3 $\pm$ 19.7	$-$ 4443.7 $\pm$ 45.2	$-$ 70.5 $\pm$ 2.0	$-$ 392.0 $\pm$ 7.0	$-$ 2065.2 $\pm$ 28.6	$-$ 44.6 $\pm$ 2.0	$-$ 245.4 $\pm$ 7.0	$-$ 1305.8 $\pm$ 19.3
BTM	$-$ 22.3 $\pm$ 1.4	$-$ 109.6 $\pm$ 7.9	$-$ 548.7 $\pm$ 23.0	$-$ 20.6 $\pm$ 1.2	$-$ 118.6 $\pm$ 3.3	$-$ 633.7 $\pm$ 13.6	$-$ 21.7 $\pm$ 2.1	$-$ 127.3 $\pm$ 5.9	$-$ 671.2 $\pm$ 20.0
d-BTM	$-$ 25.5 $\pm$ 2.1	$-$ 125.1 $\pm$ 4.1	$-$ 589.4 $\pm$ 33.2	$-$ 23.2 $\pm$ 1.6	$-$ 132.3 $\pm$ 5.0	$-$ 665.1 $\pm$ 16.6	$-$ 24.1 $\pm$ 2.7	$-$ 136.3 $\pm$ 8.4	$-$ 689.1 $\pm$ 22.8
RIBS-LSTM	$-$ 22.4 $\pm$ 1.3	$-$ 109.8 $\pm$ 7.6	$-$ 555.0 $\pm$ 30.6	$-$ 22.0 $\pm$ 1.3	$-$ 125.1 $\pm$ 2.9	$-$ 652.3 $\pm$ 10.6	$-$ 21.7 $\pm$ 1.2	$-$ 129.3 $\pm$ 5.1	$-$ 669.3 $\pm$ 17.3
RIBS	$-$ 16.6 $\pm$ 2.1	$-$ 100.0 $\pm$ 3.5	$-$ 511.1 $\pm$ 6.4	$-$ 17.7 $\pm$ 1.5	$-$ 104.9 $\pm$ 5.3	$-$ 588.7 $\pm$ 14.1	$-$ 19.6 $\pm$ 1.8	$-$ 115.4 $\pm$ 5.4	$-$ 627.8 $\pm$ 17.0
	K $=$ 20			K $=$ 25			K $=$ 30
	M $=$ 5	M $=$ 10	M $=$ 20	M $=$ 5	M $=$ 10	M $=$ 20	M $=$ 5	M $=$ 10	M $=$ 20
LDA	$-$ 32.6 $\pm$ 0.8	$-$ 183.9 $\pm$ 3.3	$-$ 967.3 $\pm$ 9.2	$-$ 26.2 $\pm$ 1.5	$-$ 144.8 $\pm$ 5.1	$-$ 763.0 $\pm$ 15.7	$-$ 21.9 $\pm$ 0.7	$-$ 119.9 $\pm$ 2.8	$-$ 633.5 $\pm$ 8.9
BTM	$-$ 23.3 $\pm$ 1.8	$-$ 132.7 $\pm$ 3.7	$-$ 676.6 $\pm$ 17.5	$-$ 24.5 $\pm$ 2.0	$-$ 133.8 $\pm$ 3.8	$-$ 690.3 $\pm$ 8.5	$-$ 22.7 $\pm$ 0.7	$-$ 132.3 $\pm$ 2.6	$-$ 692.9 $\pm$ 9.1
d-BTM	$-$ 24.7 $\pm$ 1.9	$-$ 139.8 $\pm$ 5.0	$-$ 703.0 $\pm$ 17.0	$-$ 24.1 $\pm$ 1.3	$-$ 135.8 $\pm$ 2.9	$-$ 694.8 $\pm$ 11.5	$-$ 24.1 $\pm$ 1.2	$-$ 134.8 $\pm$ 3.2	$-$ 706.8 $\pm$ 7.2
RIBS-LSTM	$-$ 23.9 $\pm$ 2.6	$-$ 135.0 $\pm$ 7.3	$-$ 684.3 $\pm$ 11.5	$-$ 22.3 $\pm$ 1.4	$-$ 131.7 $\pm$ 5.3	$-$ 683.7 $\pm$ 15.4	$-$ 22.5 $\pm$ 1.3	$-$ 130.0 $\pm$ 12.3	$-$ 680.4 $\pm$ 12.3
RIBS	$-$ 18.7 $\pm$ 0.6	$-$ 119.8 $\pm$ 2.4	$-$ 637.5 $\pm$ 11.0	$-$ 19.6 $\pm$ 1.0	$-$ 121.1 $\pm$ 3.4	$-$ 653.9 $\pm$ 11.3	$-$ 20.5 $\pm$ 1.9	$-$ 123.4 $\pm$ 4.4	$-$ 657.5 $\pm$ 15.3

the deficiency of topic models. In RIBS, the relationship between words learned by RNN can not only describe biterms better but also help generate bigrams for displaying topics. So, we introduce a bigrams short text topic model named RIBS-Bigrams extended from RIBS. The comparison between BTM and RIBS-Bigrams for both datasets are exhibited in Tables 5 and 6.

Table 5

Topic display on Online Questions Dataset

TOPIC about relationship		TOPIC about business		TOPIC about digital products
BTM	RIBS-Bigrams	BTM	RIBS-Bigrams	BTM	RIBS-Bigrams
Love	Doesnt love	Credit	Money	Phone	Cell phone
Girl	Girl	Money	Bad credit	Music	Itunes music
Guy	Guy	Business	Business	Computer	Connect computer
Deal	Friend	Job	Buy	Ipod	Ipod video
Life	Deal	Yahoo	Buy house	Player	Cd player
Person	Sex life	Card	Company	Dvd	Dvd player
Sex	Time	Company	Pay	Cell	Transfer songs
Husband	Husband	Real	Credit card	Mp3	Download dvd
Time	Person	Search	Real estate	Transfer	Mp3 player
Friend	Boyfriend	Email	Income tax	Songs	Tv

Table 6

Topic display on Online News Dataset

TOPIC about TV and movies		TOPIC about phones		TOPIC about entertainment
BTM	RIBS-Bigrams	BTM	RIBS-Bigrams	BTM	RIBS-Bigrams
Star	Season finale	Samsung	Samsung galaxy	Kim	Kim kardashian
Wars	Game thrones	Galaxy	Apple	Kardashian	Kanye west
Movie	Movie trailer	Apple	Galaxy s5	Kanye	Kardashian’s wedding
Trailer	Watch	Google	Android	West	Justin bieber
Box	Box office	Android	Ipad iphone	Wedding	Baby north
Office	Video game	S5	Microsoft	Justin	Selena gomez
Film	Movie review	Microsoft	Specs price	Miley	Chris martin
America	Captain america	Iphone	Google	Cyrus	Seth rogen
Episode	Season episode	Price	Release surface	Bieber	Video
Captain	Episode recap	Phone	Htc m8	Video	Photo

Take Topic about TV and Movies in Table 6 for example. Word box and word office almost have little correlation with tv and movies. However, when they appear together, box office indicates “a place at a theater or other arts establishment where tickets are bought or reserved”, which is a common phrase in the movie industry. What’s more, using bigrams can have a more specific meaning for describing topics. Take the Topic about Entertainment in Table 6 for example: there are many first names or second names discovered by BTM, such as justin. But there exist many men who are named ‘justin’. We can hardly figure out which field this ‘justin’ might belong to. RIBS-Bigrams can solve this deficiency by generating bigrams justin bieber. With this phrase, we can figure out this is a name of a popular singer, not a manager or other else. And this name may have higher probability to describe topic about entertainment. There is another advantage of using bigrams, which we can list more related topical information when given fixed number of words or bigrams. We can still look at Topic about Entertainment. Compared with BTM, RIBS-Bigrams has discovered five new topical bigrams (baby north, selena gomez, chris martin, seth rogen, photo), which are all closely related to this topic.

5.3.3 Advantage of document characterization

Document characterization is a common application of topic model, so we conduct clustering and classification experiments to show the advantage of RIBS from another perspective.

Clustering aims to gather unlabeled documents into several clusters, each of which contains semantic similarly documents. This is an effective method to measure topic quality. For fair comparison, we use the same clustering method as BTM does. We take each topic as a cluster, and assign each document $d$ to the topic cluster $z$ with highest value of conditional probability $P(z|d)$ . We denote the set of output clusters as $\Omega=\{\omega_{1},\omega_{2},\ldots,\omega_{K}\}$ and denote the $P$ labeled categories of the documents $\mathbb{D}$ as the set $\mathbb{C}=\{c_{1},c_{2},\ldots,c_{P}\}$ . We use purity and entropy, which are two common evaluation metrics for clustering, for comparison between five models:

•
Purity calculates the ratio of dominant category in each cluster, where a larger value indicates a better performance. Formally:

${\text{purity}}(\Omega,\mathbb{C})=\frac{1}{|\mathbb{D}|}\sum^{K}_{i=1}{\max% \limits_{j\in\{1,2,\ldots,P\}}|\omega_{i}\cap c_{j}|}.$
•
Entropy is used for measuring chaos in a set of data so that a smaller entropy indicates a better performance. Formally:

${\text{entropy}}(\Omega,\mathbb{C})=-\frac{1}{|\mathbb{D}|}\sum^{K}_{i=1}\sum^% {P}_{j}{|\omega_{i}\cap c_{j}|}\textit{log}_{2}\frac{|\omega_{i}\cap c_{j}|}{|% \omega_{i}|}.$

We compute the purity and entropy with average results of ten-times experiments. We set $K$ from 5 to 30 with step size as 5 for both datasets. Results are as Fig. 5 shows.

We have the following observations. Although d-BTM indeed improves BTM in clustering in most cases on the Online Question Dataset, when it comes to the Online News Dataset, the performance of d-BTM is even worse than BTM in some cases. We think deleting some biterms may reduce several topic-irrelevant ones, but will also lose some word-topic information at the same time. Experimental results in both four figures show the better performance of RIBS-LSTM and RIBS than other models on both datasets. The reason is that, different from d-BTM, RIBS utilizes probabilistic knowledge learned from IDF to elimiate high-frequency words which can remain word-topic information as much as possible. We think it is useful for achieving better topic representations. We also find that the performance of RIBS-LSTM in clustering task is not quite stable, which is better than LDA, BTM, d-BTM in most cases but may fail in few cases. This phenomenon shows the effectiveness of bringing in prior knowledge but using LSTM is not as good as using Elman.

Classification aims to annotate each document a label by learning from label-observable documents. We use topic distributions over documents $p(z|d)$ inferred by topic models as features of documents. A better classification accuracy indicates the learned topic representations are more discriminative. We choose naïve Bayes and SVM as two classification algorithms and compute the classification accuracy through tenfold cross-validation on both datasets. We set $K$ from 5 to 30 with step size as 5.

We have the following observations from Tables 8 and 8. Firstly, on both datasets, our RIBS-LSTM and RIBS models significantly outperform existing models in most cases. This observation shows that the improvement benefits from involving quantifiable relationship between biterms for short text topic modeling. Secondly, on the online news dataset, RIBS-LSTM and RIBS achieve the quite similar performance. This may lie in the average length of the online news dataset, which is longer than the online question dataset. So, the noise of remembering irrelevant words brought by LSTM might reduce and RIBS-LSTM can have similar performance with RIBS.

In summary, through the topic discovery experiments, the RIBS topic model can discover more coherent topics and RIBS-Bigrams can select more suitable bigrams with higherexperiment1-1 probability to represent topics. Through both the clustering and classification experiments, we can conclude RIBS has a better performance than other baselines in document characterization tasks.

Table 7
Classification accuracy on Online Questions Dataset (the best results are in boldface)

Table 8
Classification accuracy on Online News Dataset (the best results are in boldface)

Figure 5.
Clustering performance on Online Question and Online News Datasets. Subfigures (a) and (b) are purity and entropy performance on Online Question Dataset respectively. Subfigures (c) and (d) are purity and entropy performance on Online News Dataset respectively.

6. Conclusion and future work

Topic model is widely accepted as an effective tool for organizing and summarizing digital data automatically. With the explosive growth of social network on the Internet, topic model for short text has become a promising research field. Analysing short text data will suffer from the sparsity problem. In this paper, we introduce a novel short text topic model named RIBS which brings prior knowledge learned from RNN and IDF for describing biterms. We also introduce a bigrams display topic model RIBS-Bigrams extended from RIBS. To the best of our knowledge, few topic models have taken solving sparsity problem and better displaying topics into consideration at the same time. Experimental results based on two open-source and real-world datasets show that this kind of prior knowledge is quite important and useful for short text topic discovery, also show the effectiveness of RIBS and RIBS-Bigrams. As for future work, since most short text data are emerging continuously, we would like to extend RIBS into an online model and apply it into practice.

Footnotes

Acknowledgments

This paper is supported by the National Key Research and Development Program of China (grant no. 2016YFB1001102) and the National Natural Science Foundation of China (grant no. 61375069, 61403156, 61502227), this research is supported by the Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing University. We also would like to thank machine learning repository of UCI and Yahoo! Research for the datasets.

References

Alikaniotis

Yannakoudakis

and Rei

, Automatic text scoring using neural networks, CoRR, abs/1606.04289, 2016.

Amiri

and Daumé

, III, Short text representation for detecting churn in microblogs, In Proceedings of the 30th AAAI conference on Artificial Intelligence, 2016, pp. 2566–2572.

Blei

D.M.

A.Y.

and Jordan

M.I.

, Latent dirichlet allocation, Journal of machine Learning Research 3 (Jan 2003), 993–1022.

Chen

Zheng

Zhou

and Chen

, Making recommendations on microblogs through topic modeling, In International Conference on Web Information Systems Engineering, Springer, 2013, pp. 252–265.

Chen

G.-B.

and Kao

H.-Y.

, Word co-occurrence augmented topic model in short text, Intelligent Data Analysis 21(S1) (2017), S55–S70.

Chen

Jose

J.M.

Yuan

and Zhang

, A semantic graph based topic model for question retrieval in community question answering, In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, ACM, 2016, pp. 287–296.

Dent

and Paul

, Through the twitter glass: detecting questions in micro-text, In Proceedings of the 5th AAAI Conference on Analyzing Microtext, AAAI Press, 2011, pp. 8–13.

Elman

J.L.

, Finding structure in time, Cognitive Science 14(2) (1990), 179–211.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation 9(8) (1997), 1735–1780.

10.

Hofmann

, Probabilistic latent semantic indexing, In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM, 1999, pp. 50–57.

11.

Hong

and Davison

B.D.

, Empirical study of topic modeling in twitter, In Proceedings of the first workshop on social media analytics, ACM, 2010, pp. 80–88.

12.

Wang

and He

, Question-answer topic model for question retrieval in community question answering, In Proceedings of the 21st ACM international conference on Information and knowledge management, ACM, 2012., pp. 2471–2474.

13.

Jiang

Qian

Shen

and Mei

, Author topic model-based collaborative filtering for personalized poi recommendations, IEEE Transactions on Multimedia 17(6) (2015), 907–918.

14.

Jin

Liu

N.N.

Zhao

and Yang

, Transferring topical knowledge from auxiliary long texts for short text clustering, In Proceedings of the 20th ACM international conference on Information and knowledge management, ACM, 2011, pp. 775–784.

15.

Lau

J.H.

Collier

and Baldwin

, On-line trend analysis with topic models,

\backslash

# twitter trends detection topic model online. In COLING, 2012, pp. 1519–1534.

16.

Wang

Zhang

Sun

and Ma

, Topic modeling for short texts with auxiliary word embeddings, In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, ACM, 2016, pp. 165–174.

17.

Lichman

. UCI machine learning repository, 2013.

18.

Lin

Tian

Mei

and Cheng

, The dual-sparse topic model: Mining focused topics and focused terms in short text, In Proceedings of the 23rd international conference on World wide web, ACM, 2014, pp. 539–550.

19.

Xie

L.-Y.

Kang

Wang

C.-J.

and Xie

J.-Y.

, Don’t forget the quantifiable relationship between words: Using recurrent neural network for short text topic discovery, In AAAI, 2017, pp. 1192–1198.

20.

Mcauliffe

J.D.

and Blei

D.M.

, Supervised topic models, In Advances in neural information processing systems, 2008, pp. 121–128.

21.

Mikolov

Sutskever

Chen

Corrado

G.S.

and Dean

, Distributed representations of words and phrases and their compositionality, In Advances in neural information processing systems, 2013, pp. 3111–3119 .

22.

Mimno

Wallach

H.M.

Talley

Leenders

and McCallum

, Optimizing semantic coherence in topic models, In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2011, pp. 262–272.

23.

Shalit

Weinshall

and Chechik

, Modeling musical influence with topic models, In ICML (2), 2013, pp. 244–252.

24.

Sparck Jones

, A statistical interpretation of term specificity and its application in retrieval, Journal of Documentation 28(1) (1972), 11–21.

25.

Sutskever

Martens

and Hinton

G.E.

, Generating text with recurrent neural networks, In Proceedings of the 28th International Conference on Machine Learning, 2011, pp. 1017–1024.

26.

Teh

Y.W.

Jordan

M.I.

Beal

M.J.

and Blei

D.M.

, Sharing clusters among related groups: Hierarchical dirichlet processes, In Advances in neural information processing systems, 2005, pp. 1385–1392.

27.

Wang

Zhang

and Hao

, A robust framework for short text categorization based on topic model and integrated classifier, In 2014 International Joint Conference on Neural Networks, IEEE, 2014, pp. 3534–3539.

28.

Weng

Lim

E.-P.

Jiang

and He

, Twitterrank: finding topic-sensitive influential twitterers, In Proceedings of the third ACM international conference on Web search and data mining, ACM, 2010, pp. 261–270.

29.

Xia

Tang

Hussain

and Cambria

, Discriminative bi-term topic model for headline-based social news clustering, In FLAIRS Conference, 2015, pp. 311–316.

30.

Yan

Guo

Lan

and Cheng

, A biterm topic model for short texts, In Proceedings of the 22nd international conference on World Wide Web, ACM, 2013, pp. 1445–1456.

31.

Yin

and Wang

, A dirichlet multinomial mixture model-based approach for short text clustering, In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2014, pp. 233–242.

32.

Zhao

W.X.

Jiang

Weng

Lim

Yan

and Li

, Comparing twitter and traditional media using topic models, In European Conference on Information Retrieval, Springer, 2011, pp. 338–349.

33.

Zuo

Zhao

and Xu

, Word network topic model: a simple but general solution for short and imbalanced texts, Knowledge and Information Systems, 2014, pp. 1–20.

Utilizing Recurrent Neural Network for topic discovery in short text scenarios 1

Abstract

Keywords

1. Introduction

Table 1 Examples of biterms extracted from the Online Questions Dataset

3. RIBS topic model

Table 2 Annotations

3.2 Model description

5.1 Datasets

5.3.1 Better topic discovery ability of RIBS

Table 3 Topic coherence on Online Questions Dataset (the best results are in boldface)

Footnotes

Acknowledgments

References

Table 1
Examples of biterms extracted from the Online Questions Dataset

Table 2
Annotations

Table 3
Topic coherence on Online Questions Dataset (the best results are in boldface)