A practical algorithm for solving the sparseness problem of short text clustering

Abstract

Dirichlet Multinomial Mixture (DMM) models have been successful in clustering short texts. However, the word co-occurrence information that can be captured by these models is limited to the short text corpus itself. If two words have strong relatedness but rarely co-occurring in short texts, these models can not fully capture the semantic relatedness between the two words. In this paper, we propose a novel model by incorporating word-word correlation into DMM, called WDMM. By constructing a sparse graph using word-word relationship, our model expands each short text using their neighboring words in each text that can help to solve the problem of sparseness in short texts. Therefore, the cluster label of each text is not only influenced by its words, but decided by their similar words in this corpus. Experimental results on real-world datasets demonstrated the substantial superiority of our WDMM model over the state-of-the-art methods.

Keywords

Short text clustering dirichlet multinomial mixture

1. Introduction

Along with the emergence and popularity of social media (e.g. Twitter and Facebook), short text clustering is critical for building many applications such as user profiling [17] and recommendation [30]. Compared with regular texts, short text clustering has the following three challenges: only very limited word co-occurrence information is available, the frequency of words plays a less discriminative role, and the limited contexts make it more difficult to identify the senses of ambiguous words [25]. Therefore, to address these challenges, two major heuristic strategies have been adopted to deal with how to cluster short texts.

One way to overcome these issues is to explore some topic modelings to cluster short texts. For example, the most common one follows the simple assumption that each text is sampled from only one latent clustering, known as mixture of unigrams or Dirichlet Multinomial Mixture (DMM) model [21, 34, 35]. It is totally unsuited to regular texts, but it can be suitable for short texts compared to the complex assumption adopted by LDA[3] and PLSA[3] that each text is modeled over a set of topics. Many variants of topic modelings for short texts have been proposed recently [16, 18, 23, 25]. The strategy can alleviate the problem of sparseness, and is fast to converge. However, the word co-occurrence information that can be captured by these model is limited to the short text corpus itself. If two words have strong relatedness but rarely co-occurring in short texts, these models can not fully capture the semantic relatedness between the two words. The second strategy takes advantage of word embedding or autoencoder to learn distributed representations of short texts, and choose a clustering methods (e.g., KMeans[9] to group those texts [31, 13, 29, 33, 4]. For example, Skip-thoughts [13] trains encoder-decoder Recurrent Neural Networks (RNN) without supervision to predict the next and the previous sentences given the current sentence. Different from TF-IDF metric that relies on surface form of the text, distributed representations encode a text as a compact, dense and lower dimensional vector with the semantic meaning of the text distributed along the dimensions of the vector. As noted by Finegan-Dollak et al.[5], distributed representations ought to perform better on a dataset uses a wide variety of words to express the same idea, but should enjoy no advantage when applied to these datasets whose underlying data can be clustered tightly.

In this paper, we try to keep the advantages of topic modelings and utilize the advantages of distributed representations to overcome the weakness of topic modelings when applied to short text clustering. We propose a practical method for short text clustering called WDMM, which incorporating Word-word correlation into Dirichlet Multinomial Mixture. Our solution can achieve excellent performance in short text clustering without incurring a radical increase in the computational complexity of the inference algorithm. More specifically, we adopted a different method that does not incorporate word-word correlations as a prior level but rather softly enforces it at the posterior level. In this sense, our method expands each short text through linking the semantically relevant words together, even if they rarely co-occurring in short texts. We conduct comprehensive experiments to evaluate WDMM and to demonstrate the effectiveness of our model. We compare WDMM with traditional clustering methods (K-Means [32]) and probabilistic topic model (LDA [3]). We also compare with distributed representation models such as Word2Vec [19] and Skip-thoughts model [13], Dirichlet Multinomial Mixture model [34] and its variations GPU-DMM and GPU-PDMM[16]. WDMM achieves state-of-the-art performance across various datasets.

The rest of this paper is organized as follows. In Section 2, we review related work. In Section 3, we detail our approach and provide an efficient collapsed Gibbs sampling algorithm. Finally, in Section 4, we illustrate the efficacy of our approach in several real-world datasets.

2. Related work

Our work is related to three lines of research in literature: model-based methods, text representation, and short text clustering with side information.

2.1 Model-based methods

One of the most promising approach for short text clustering is model-based clustering algorithms [12, 2, 28]. Among them, Gaussian Mixture Model (GMM) [26] is widely adopted based on the assumption that data points are generated by a mixture of Gaussian distributions. Since the complexity of GMM is too large for high-dimensional data like text, GMM is not widely used for short text. Nigam et al. [21] proposed a mixture of unigram model, which assumes that each text is generated by one cluster. The unigram model is an EM-based algorithm for Dirichlet Multinomial Mixture (DMM) model. Except the basic expectation maximization (EM), there have been a number of inference methods that have been used to estimate the parameters including variation inference and Gibbs sampling. Therefore, Yu et al. [36] proposed the DMAFP model based on variational inference algorithm [10]. Yin and Wang [34] proposed a collapsed Gibbs sampling algorithm for DMM (abbr. to GSDMM), and found that GSDMM can infer the number of clusters automatically. For speeding up the time, Yin and Wang proposed a new algorithm as FGSDMM [35] using an online clustering scheme for initialization. Qiang et al. [24] proposed a novel based on Pitman-Yor Process to capture the power-law phenomenon of the cluster distribution. In addition, many variants of GSDMM have been proposed for short text topic modeling [25, 37] and streaming short text clustering [18], recently. All these methods focus on clustering short texts with limited word co-occurrence information, and no additional data is leveraged.

2.2 Text representation

Text representation is one of the fundamental problems in text mining and Information Retrieval (IR). It aims to numerically represent the unstructured text documents to make them mathematically computable. Different from model-based methods that can directly get the cluster label for each text, one of clustering algorithms (KMeans[32], Affinity Propagation (AP)[7], and so on) is selected for clustering short texts after obtaining the representation of each text. This kind of approaches basically fall into two major categories: traditional text representation methods and distributed representations. Traditional text representation methods rely on surface form of the text, in which Term Frequency-Inverse Document Frequency (TFIDF), N-grams, and dependency Counts are the most commonly used methods [5]. Due to the fact that only very limited word co-occurrence information is available in short texts, traditional text representation methods cannot work very well for short text clustering. Recently, with the help of word embeddings, neural networks demonstrate their great performance for constructing text representation, since distributed representations can capture semantic meanings of words. Word2Vec [19] and Glove [22] are state-of-the-art word representation models. Inspired by Word2Vec, Doc2Vec [14] can directly learn vector representations of sentences and documents. Skip-thoughts [13] trains encoder-decoder Recurrent Neural Networks (RNN) without supervision to predict the next and the previous sentences given the current sentence. Different autoencoders are also used to learn possibly trivial representations of text documents, which can automatically learn data representations by trying to reconstruct its input at the output layer [4].

2.3 Short text clustering with side information

Due to the sparse nature of short texts, many approaches employ large, external document repositories, such as Wikipedia [1] or the Open Directory Project [6], to incorporate additional world knowledge into the clustering process. Tang et al. [29] proposed an end-to-end approach to train the text expansion algorithm which used the search results from a large collection to expand short texts. The sheer size of many of these external collections can make these techniques difficult or time consuming to apply. GPU-DMM applied in short text topic modeling [16] leveraged the word-word relationship during the topic inference process, to tackle the data sparsity issue. The direct difference between GPU-DMM and WDMM is that WDMM can infer the real number of clusters. Our model adopts graph-based proposal distribution to incorporate the similarity matrix, and their semantically related words in GPU-DMM are extracted and promoted by using the generalized Pólya Urn model. Besides, our model adjusts the promotion weight based on the similarity value between words, which is a fixed value in GPU-DMM.

3. Short text clustering

In this section, we describe our algorithm how to utilize word-word side information for solving the problem of sparseness in short texts. The same to GSDMM, we keep its generative process intact, thus enjoying the nice conjugate properties between the Dirichlet and Multinomial distributions. But unlike GSDMM, we use the word-word relationship to influence the posterior distribution over correlated words to be similar, which biases the model towards the desired effect. We first introduce a collapsed Gibbs sampling algorithm for Dirichlet Multinomial Mixture (GSDMM), then specify how we represent word-word relationships, and finally give our model. The meaning of the variables in the paper is shown in Table 1.

Table 1
The notations of symbols used in the paper

$D$	The whole document corpus
$V$	The vocabulary in $D$
$K$	The number of clusters
$\alpha,\beta$	The hyperparameters of the Dirichlet priors
$\theta$	The cluster distribution in $D$
$\phi_{k}$	The probability distribution of words in cluster $k$
$z_{d}$	The cluster index of document $d$
$w_{d,i}$	The $i$ -th word of document $d$
$m_{k}$	The number of documents in $D$ belonging to cluster $k$
$n_{k}^{w}$	The number of word $w$ in cluster $k$
$n_{d}^{w}$	The number of word $w$ in document $d$
$n_{k}$	The number of words belonging to topic $k$
$G$	A stochastic graph
$\mathcal{N}(w)$	All neighbors of node $w$ in $G$
$\lambda$	A user-specified parameter
$S$	A similarity matrix
$S_{wv}$	The similarity value between word $w$ and $v$

3.1 Dirichlet multinomial mixture model

In GSDMM [34, 35], one assumes that a document is sampled from a single cluster and all words of this document are from this cluster. That is, the documents are generated following the graphical model in Fig. 1.

Figure 1.

Graphical model of GSDMM.

Suppose the corpus contains $K$ clusters and short texts with $D$ documents over $V$ unique words in the vocabulary. For the whole corpus D draw a cluster distribution $\theta$ from a Dirichlet distribution with concentration parameter $\alpha$ ,

$\displaystyle\theta\sim\textit{Dir}(\alpha)$ (1)

Each cluster $k$ draws a word distribution from a Dirichlet distribution with concentration parameter $\beta$ ,

$\displaystyle\phi_{k}\sim\textit{Dir}(\beta)$ (2)

Each document $d\in\{1,\ldots,D\}$ draws a cluster from the multinomial $\theta$ via,

$\displaystyle z_{d}\sim\textit{Discrete}(\theta)$ (3)

Each word $i\in\{1,\ldots,n_{d}\}$ in the document $d$ are sampled by the cluster-word multinomial distribution $\phi_{z_{d}}$ via,

$\displaystyle w_{d,i}\sim\textit{Discrete}(\phi_{z_{d}})$ (4)

A key property to derive an efficient sampler for DMM is the fact that the Dirichlet distribution is a conjugate prior to the multinomial distribution. This allows us to integrate out $\theta$ and $\phi$ . This yields a Gibbs sampler to draw $p(z_{d}\mid rest)$ efficiently. The conditional probability is given by,

$\displaystyle p(z_{d}=k\mid\textit{rest})\propto\frac{m_{k,-d}+\alpha}{D-1+K% \alpha}\times\frac{\prod_{w\in d}\prod_{j=1}^{n_{d}^{w}}(n_{k,-d}^{w}+\beta+j-% 1)}{\prod_{i=1}^{n_{d}}(n_{k,-d}+V\beta+i-1)}$ (5)

where $n_{d}^{w}$ is term frequency of word $w$ in document $d$ , $m_{k,-d}$ is the number of documents assigned to cluster $k$ without considering document $d$ , $n_{k,-d}^{w}$ is the number of word $w$ in cluster $k$ without considering document $d$ , and $n_{k,-d}$ is the number of words assigned to cluster $k$ without considering document $d$ . The posterior cluster-word distribution can be calculated,

$\displaystyle p(w|z=k)=\frac{n_{k}^{w}+\beta}{\sum_{v}^{W}(n_{k}^{v}+V\beta)}$ (6)

3.2 Representation of word-word relationship

Here, word-word relationship is represented as a sparse graph $G$ , in which two words are connected if they are semantically similar. Suppose $G$ is a stochastic graph, where the weight between words is a probability between zero and one (i.e. $G_{uv}\in[0,1]$ ) and the sum of the probabilities over one node’s all neighbors equals 1 ( $\sum_{v\in\mathcal{N}(w)}{G_{wv}}$ ). Here, $\mathcal{N}(w)$ denotes all neighbors of node $w$ in $G$ . A similarity matrix $S$ is defined as follows,

$\displaystyle S=(1-\lambda)\times I+\lambda\times G$ (7)

where $I$ is the $V\times V$ identify matrix and $V$ is the number of words in vocabulary. $\lambda\in[0,1]$ is a user-specified parameter that controls the influence of all neighbors $\mathcal{N}(w)$ over node $w$ . By Eq. (7), we can see $S$ is a stochastic matrix in satisfying each row of $S$ sums to 1, and the non-zeroes entries in row $S_{w}$ are $\mathcal{N}(w)\cup\{w\}$ .

3.3 Model inference

We use collapsed Gibbs sampling to carry out posterior inference for parameter learning. The hidden multinomial variables text-level variables ( $z$ ) are sampled, conditioned on a complete assignment of all other hidden variables. WDMM and GSDMM model share the same generative process and graphical representation, but different in topic inference process. We will introduce how to incorporate the similarity matrix into the graph-based proposal distribution.

For each text $d$ in $D$ , the multinomial distributions of $\phi$ and $\theta$ can be integrated out. Suppose that there are $l$ active clusters with ( $m_{1}$ ,…, $m_{l}$ ) texts belong to each cluster. For existing cluster label $k$ $\in$ $(1,\ldots,l)$ , we use the following conditional probability distribution,

$\displaystyle p(z_{d}=k\mid\textit{rest})\propto\frac{m_{k,-d}+\alpha}{D-1+K% \alpha}\times\frac{\prod_{w\in d}\prod_{j=1}^{n_{d}^{w}}(\sum_{v\in\mathcal{N}% (w)\cup\{w\}}{S_{wv}n_{k,-d}^{v}}+\beta+j-1)}{\prod_{i=1}^{n_{d}}(n_{k,-d}+V% \beta+i-1)}$ (8)

When $\lambda$ equal 0 in Eq. (7), $S_{ij}$ will equal 0 under $i\neq j$ . In this case, Eq. (8) is reduce to Eq. (5). Different from GSDMM that cluster label $z_{d}$ is determined by cluster distribution $\theta$ , $z_{d}$ in WDMM depends on both $\theta$ and the cluster assignments of neighboring words of each words in the document. Considering the words and their neighboring words in each text, WDMM can solve the problem of sparseness compared with GSDMM. As we increase $\lambda$ , the contribution of neighboring words in the graph increases. The left part of Eq. (8) follows the assumptions of the Chinese Restaurant Process to choose a large cluster with more texts, and the right part indicates that text $d$ will tend to choose a cluster whose texts share more word with it.

Compared with Eq. (5), the graph based proposal can be simply achieved by replacing the cluster-word counts statistics $n_{k,-d}^{w}$ by their graph-based convex counterparts,

$\displaystyle n_{k,-d}^{w}=\!\!\!\!\sum_{v\in\mathcal{N}(w)\cup\{w\}}{S_{wv}n_% {k,-d}^{v}}$ (9)

For a new cluster $l+1$ , we use the following equation,

$\displaystyle p(z_{d}=l+1\mid\textit{rest})\propto\propto{\displaystyle\frac{% \alpha}{D-1+\alpha}}{\displaystyle\frac{\prod_{w\in d}\prod_{j=1}^{n_{d}^{w}}(% \beta+j-1)}{\prod_{i=1}^{d}(V\beta+i-1)}}$ (10)

3.4 Algorithm implementation

It has been validated that the topic coherence measure based on word co-occurrence pattern is a reliable indicator of topic quality and is highly consistent with domain expert annotation [16]. Accordingly, it is reasonable that the words with high semantic relatedness should be clustered together under the same topic. In the following, we present the detail of our WDMM model is given in Algorithm 1. The initialization steps go from steps 1 to 9. Step 2 assigns each document to an independent cluster. Steps 3 to 9 record the following information: $z_{d}$ (cluster label of each text), $m_{k}$ (number of texts in cluster $k$ ), $n_{k}$ (number of words in cluster $k$ ), and $n_{k}^{w}$ (number of word $w$ occurring in cluster $z_{d}$ ). This initialization will not consume much time since the number of occupied tables is close to the real number of clusters after 2 or 3 iterations. In contrast with the random choices of texts and $K_{\textit{max}}$ clusters in GSDMM, our proposed method does not introduce any noise during the initialization process. The problem in FGSDMM is that the number of clusters after initialization is lower than the real number of clusters when applied to datasets whose underlying data can be clustered tightly.

Algorithm 1 WDMM Algorithm Input: $\alpha$ , $\beta$ , $S$ and $D$ short documents Output: Cluster label $z_{d}$ for each document [1] $d$ = 1 to $D$ in the corpus //Initialization $k\leftarrow d$ $z_{d}\leftarrow k$ $m_{k}\leftarrow m_{k}+1$ $w\in d$ $n_{k}\leftarrow n_{k}+1$ $n_{k}^{w}\leftarrow n_{k}^{w}+1$ iterations $d$ = 1 to $D$ in the corpus $k\leftarrow z_{d}$ $m_{k}\leftarrow m_{k}-1$ $w\in d$ $n_{k}\leftarrow n_{k}-1$ $n_{k}^{w}\leftarrow n_{k}^{w}-1$ $m_{k}$ == 0 Remove cluster $k$ Compute the probability of $d$ choosing existing clusters or a new cluster $l+1$ with Eqs (8) and (10), respectively Sample cluster index $k$ for document $d$ $z_{d}\leftarrow k$ $m_{k}\leftarrow m_{k}+1$ $w\in d$ $n_{k}\leftarrow n_{k}+1$ $n_{k}^{w}\leftarrow n_{k}^{w}+1$

From steps 10 to 30, we iteratively sample the cluster label for each document after removing it from his last cluster, until completing all iterations. Steps 13 to 17 remove document $d$ from its corresponding cluster $z_{k}$ , that means the corresponding information in $m_{k}$ , $n_{k}$ and $n_{k}^{w}$ are updated accordingly. Then, WDMM computes the probability of $d$ choosing each of the $l$ existing clusters or a new cluster $l+1$ through Eqs (8) and (10), samples a table $z_{d}$ for document $d$ , and updates $m_{k}$ , $n_{k}$ and $n_{k}^{w}$ . This means that the time complexity of WDMM is linear to the number of existing clusters. Finally, only some active tables have customers after some iterations, and the customers in one table are from the same cluster.

3.5 Representation of clusters

Based on the fact that the Dirichlet distribution is conjugate to the multinomial distribution, we can get the posterior cluster-word distribution of $\phi$ as follows,

$\displaystyle\phi_{k}^{w}=\frac{\sum_{v\in\mathcal{N}(w)\cup\{w\}}{S_{wv}n_{k,% -d}^{v}}+\beta}{\sum_{v}^{W}(n_{k}^{v}+V\beta)}$ (11)

where $\phi_{k}^{w}$ is the probability of word $w$ being generated by cluster $k$ , and can be considered as the importance of word $w$ to cluster $k$ .

3.6 Relationship to other DMM models

WDMM vs. FGSDMM. FGSDMM [35] is closely related to our model, but there are several differences. FGSDMM explicitly enforces sparsity by only sampling one cluster for each document. If two words have strong relatedness but rarely co-occurring in short texts, FGSDMM can not fully capture the semantic relatedness between the two words. Our model not only focuses on one cluster for each document, but incorporates the neighboring words of each words into the inference process that can help capture the semantic relatedness.

It is important to note that the number of cluster after initialization in FGSDMM usually less than the true number when clustering short texts. This means that $K_{\textit{max}}$ is smaller than $K$ in FGSDMM that will decrease the performance. WDMM assigns all documents to different clusters in the initialization whose advantage is that no noise is introduced in the process of initialization.

WDMM vs. GPU-DMM. Our model is also reminiscent of GPU-DMM applied in short text topic modeling [16], which leverages the word-word relationship during the topic inference process, to tackle the data sparsity issue. The direct difference between these two models is that WDMM can infer the real number of clusters. And, our model adopts graph-based proposal distribution to incorporate the similarity matrix, and their semantically related words in GPU-DMM are extracted and promoted by using the generalized Pólya Urn model. Besides, our model adjusts the promotion weight based on the similarity value between words, which is a fixed value in GPU-DMM.

4. Experiments

In this section, We study the empirical performance of WDMM compared to the baselines.

4.1 Dataset

We adopt the same datasets with GSDMM and its variation [24, 34] for evaluation. They are Google News1

¹
http://news.google.com.

and Tweet2

http://trec.nist.gov/data/microblog.html.

dataset, as listed below.

Google News: the news articles about the same event in the Google news are grouped into clusters automatically. We adopted a snapshot of the Google News on November 27, 2013 which were first used in the paper [34]. The dataset was divided into three datasets: TitleSet (TSet), SnippetSet (SSet), and TitleSnippetSet (TSSet). The TitleSet and SnippetSet only contain the titles and snippets, respectively, while the TitleSnippetSet contains both the titles and snippets.

Tweet: consists of 2,472 tweets that are highly relevant to 89 queries. The relevance between tweets and queries are manually labelled in the 2011 and 2012 microblog tracks at Text REtrieval Conference (TREC).3

http://trec.nist.gov/data/microblog.html.

For each dataset, we conduct the following preprocessing: (1) Convert all letters into lowercase; (2) Remove non-latin characters and stop words; (3) Remove words whose lengths are smaller than 2 or larger than 15; (4) Perform stemming from words with the WordNet Lemmatizer of NLTK;4

⁴

http://www.nltk.org.

(5) Remove words whose frequency is less than 2. Their statistics are summarized in Table 2.

Table 2

Summary of the text datasets (D: the number of documents, K: the number of clusters, V: vocabulary size, AVG: the average length of the documents)

Dataset	D	K	V	AVG
Tweet	2472	86	5098	8.56
TSet	11109	152	8110	6.23
SSet	11109	152	18477	22.20
TSSet	11109	152	19671	28.43

4.2 Comparison with baseline methods

We compare our WDMM with a wide range of other models including traditional text representation method (TFIDF), topic models (NMF and LDA), distributed representation models (Skip-thoughts), Dirichlet multinomial mixture model (GSDMM) and its variation (FGSDMM and GPU-DMM), as listed below.

TFIDF[32]: a traditional text representation method Term Frequency-Inverse Document Frequency (TFIDF), only relies on surface form of the text. We use the open-source sklearn implementation for KMeans [32] using KMeans++ as initialization based on Euclidean distance.

LDA[3]: a directed graphical model which models a document as a mixture of topics and a topic as a mixture of words. When LDA is used for text clustering, we choose the maximum value from a mixture of topics as its cluster label for each document. LDA based on Gibbs sampling is chosen as comparison5

⁵
http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm.

[8].

NMF[15]: an unsupervised learning technique originally employed to decompose high-dimensional data A $\in\mathbb{R}^{m\times n}$ into two non-negative factors $W\in\mathbb{R}^{m\times k}$ and $H\in\mathbb{R}^{k\times n}$ whose product is an approximation of $A$ . Here, $W$ vector of coefficients can be interpreted as the $k$ topic membership weights for the corresponding document. Here, all the documents are first represented using Term Frequency metric. We use the open-source sklearn implementation for NMF based on Euclidean distance.

Skip-thoughts model (Skip)[13]: a distributed representation model can directly learn vector representations of documents. Skip-thoughts models trains encoder-decoder Recurrent Neural Networks (RNN) without supervision to predict the next and the previous sentences given the current sentence. The pretrained skip-thought model computes vectors as sentence representations. The same to TF-IDF, clustering method K-Means is selected. We adopt the code6

⁶

https://github.com/ryankiros/skip-thoughts.

released by the authors.

GSDMM[34]: a collapsed Gibbs sampling algorithm for Dirichlet Multinomial Mixture model. The source code of GSDMM is downloaded7

⁷

https://github.com/jackyin12/GSDMM/.

that provided by its paper.

FGSDMM[35]: a fast GSDMM model using an online clustering scheme for initialization. We implement FGSDMM on our own, since the implementation is not publicly available.

GPU-DMM and GPU-PDMM[16]: a new DMM model for short text topic modeling, which promotes the semantically related words under the same topic during the sampling process by using the generalized Pólya Urn model (GPU) model. GPU-PDMM is a variant of GPU-DMM. Specifically, each short text in PDMM is assumed to be generated by a limited number of topics (e.g., one, two, or three topics), where the topic number is modeled as a Poisson distribution. The authors released the code.8

⁸

https://github.com/NobodyWHU/GPUDMM.

WDMM: the proposed model by this paper.

Training details: Following [34], $\alpha$ and $\beta$ of LDA and GSDMM are set as 0.1 and 0.1. Following [35], $\alpha$ and $\beta$ of FGSDMM are set as 1 and 0.05. In GPU-DMM, we use the recommended settings $\alpha=$ 50/ $K$ , $\beta=$ 0.01 and $\mu=$ 0.1. Our model sets $\alpha=$ 0.1, $\beta=$ 0.1, smoothness factor $\lambda=$ 0.01 and dynamic neighborhood of size 5. For TF-IDF, NMF, LDA, Skip-thoughts, GPU-DMM and GPU-PDMM, we set $K$ at the true number of clusters of each dataset. For GSDMM and FGSDMM, we set the initial number of clusters at 500 for all datasets. For all algorithms, we set the maximum number of iterations at 500 (to make a fair comparison). As word-word information we used both Glove2Vec [22] embedding and the typical cosine similarity measure for the similarity between words.

4.3 Evaluation metrics

The clustering results on real-world data are evaluated through comparing the label of each text obtained by algorithms with the real label. Five metrics are used to measure the clustering performance: Normalized Mutual Information (NMI) [36], Homogeneity (H)[27], Completeness (C) [27], Adjusted Rand Index (ARI)[11], and Adjusted Mutual Information (AMI)[20]. For all metrics, a larger score indicates better clustering performance. In the experiment, we adopted sklearn9

⁹
http://scikit-learn.org.

to implement these metrics.

•

Normalized Mutual Information (NMI) is a clustering validation metric that effectively measures the amount of statistical information shared by the predicted cluster assignments and the ground truth, independent of the absolute cluster label values. Two patients are assigned to the same cluster if and only if they are similar, thus clustering can be viewed as a series of pair-wise decisions.

•

Homogeneity (H) represents the objective that each cluster contains only members of a ground true group and completeness (C) represents the objective that all members of a ground true group are assigned to the same cluster [27].

•

Rand Index (RI) measures the percentage of clustering decisions that are correct. Rand Index can be adjusted for the chance grouping of elements, which will result in one of its variants called Adjusted Rand Index (ARI). ARI has a value between 0 and 1, and RI can have negative values.

•

Adjusted Mutual Information (AMI)[20] corrects the effect of agreement solely due to chance between clusters, similar to the way Adjusted Rand Index (ARI) corrects the Rand index.

Figure 2.

Performance of the models on the TweetSet.

Figure 3.

Performance of the models on the TSet.

Figure 4.

Performance of the models on the SSet.

Figure 5.

Performance of the models on the TSSet.

4.4 Comparison with existing methods

Model-based methods rely on stochastic elements in their initialization phase, which can potentially lead to different results being generated on the same corpus when using the same parameter values. Therefore, each model is run twenty times on each dataset, and we report the mean and standard deviation of the performance measured by the five evaluation metrics introduced. Figure 2 shows the performance of all models on the Tweet dataset. We can see that traditional text clustering methods (including TFIDF, NMF, LDA) do not perform well on this dataset, which indicates that texts can belong to more than one cluster is unsuited to short texts. Although distributed representation model (Skip) is better than TFIDF, its performance is worse than DMM-based models (GSDMM, FGSDMM, GPU-DMM, GPU-PDMM, WDMM). DMM-based models consistently achieve higher performance, which means that this assumption that each text is generated by one cluster can be suitable for short texts. GPU-DMM and GPU-PDMM consistently achieves higher accuracies than other DMM-based methods (GSDMM, FGSDMM), which suggests the effectiveness of pre-training word embeddings on a large external corpus to learn general knowledge. Our WDMM model significantly outperforms all other models. For example, WDMM obtains 91.2% NMI value which is significantly higher than the 89.7% achieved by GPU-PDMM. This validates that incorporating word-word relationship by obtaining word embeddings with WDMM models is beneficial for short text clustering.

Figure 6.

Number of clusters on Tweet, TSet, SSet and TSSet.

Figure 7.

Performance of WDMM varying iterations.

The experimental results on Google (TSet, SSet and TSSet) are shown on Figs 3–5. First, we can see that WDMM performs better than the baselines. Similar to the conclusion in Tweet, this validates that our strategy about incorporating word-word relatedness knowledge is useful for short text clustering again. Second, GPU-PDMM is the second best model using five metrics. Although both WDMM, GPU-DMM and GPU-PDMM use the side information, WDMM adopts a useful way to incorporate the side information, and adjusts the promotion weight based on the similarity value between words, which is a fixed value in GPU-DMM and GPU-PDMM. In conclusion, the results on the four datasets suggest that WDMM is a desired choice for short text clustering.

4.5 Number of clusters found automatically

Similar to GSDMM and FGSDMM, WDMM also can infer the number of clusters in dataset. We will investigate whether the number of clusters found by the models is related to the real number of clusters in dataset by varying the number of iterations. Since GSDMM needs an initial number of clusters, we choose a large number 500 for GSDMM. The results varying the number of iterations on Tweet and Google news (TSet, SSet and TSSet) are shown in Fig. 6, respectively. The number of clusters discovered by FGSDMM is rather higher and unstable when compared to the real number of clusters in Tweet, and is severely lower than the real number of clusters in SSet and TSSet. When dealing with these datasets (SSet and TSet) whose underlying data can be clustered tightly, the number of clusters after initialization in FGSDMM is lower than the real number in result of the worst results. Compared with GSDMM, the number inferred by WDMM is close to the real number of clusters.

4.6 Influence of the number of iterations

By varying the number of iterations, we now study the effect of the number of iterations in WDMM. Figure 7 depicts the NMI results of WDMM varying the number of iterations from 2 to 50 on the four datasets. The performance of WDMM improves very quickly on the first two iterations. WDMM reaches a steady state using seven iterations. With one exception, the NMI of WDMM has a little improvement from 0.90 to 0.91 on the 21th interation using Tweet dataset. As a whole, we can see that WDMM is faster to converge.

5. Conclusion

In this paper, we propose a novel model by incorporating Word-word correlation into Dirichlet Multinomial Mixture (WDMM) to cluster short texts without the assumption of the true number of clusters. We presented a collapsed Gibbs Sampling algorithm by jointly considering the advantages of Dirichlet multinomial mixture mode and distributed representations. By constructing a sparse graph using word-word relationship, our model expands each short text using their neighboring words in each text that can help to solve the problem of sparseness in short texts. If two words have a strong relationship but rarely co-occur in short texts, WDMM can capture the semantic relatedness between the two words. Experimental results on real-world datasets demonstrated the substantial superiority of our WDMM model over the state-of-the-art methods.

Footnotes

Acknowledgments

This research is partially supported by the the National Natural Science Foundation of China under grants (61703362, 61702441, 61503116, 61402203), Natural Science Foundation of Jiangsu Province of China under grants (BK20170513, BK20161338), the Natural Science Foundation of the Higher Education Institutions of Jiangsu Province of China under grant 17KJB520045, and the Science and Technology Planning Project of Yangzhou of China under grant YZ2016238.

References

Banerjee

Ramanathan

and Gupta

, Clustering short texts using wikipedia, In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, 2007, pp. 787–788.

Beykikhoshk

Arandjelović

Phung

and Venkatesh

, Discovering topic structures of a temporally evolving document corpus, Knowledge and Information Systems 55(3) (2018), 599–632.

Blei

D.M.

A.Y.

and Jordan

M.I.

, Latent dirichlet allocation, The Journal of Machine Learning Research 3 (2003), 993–1022.

Chen

and Zaki

M.J.

, Kate: K-competitive autoencoder for text, arXiv preprint arXiv:1705.02033, 2017.

Finegan-Dollak

Coke

Zhang

and Radev

D.R.

, Effects of creativity and cluster tightness on short text clustering performance, In ACL (1), 2016.

Fodeh

Punch

and Tan

P.-N.

, On ontology-driven document clustering using core semantic features, Knowledge and Information Systems 28(2) (2011), 395–421.

Frey

B.J.

and Dueck

, Clustering by passing messages between data points, Science 315(5814) (2007), 972–976.

Griffiths

T.L.

and Steyvers

, Finding scientific topics, Proceedings of the National Academy of Sciences 101(Suppl 1) (2004), 5228–5235.

Han

Kamber

and Data mining: concepts and techniques: concepts and techniques, Elsevier, 2011.

10.

Huang

Wang

Zhang

and Shi

, Dirichlet process mixture model for document clustering with feature partition, IEEE Transactions on Knowledge and Data Engineering 25(8) (2013), 1748–1759.

11.

Hubert

and Arabie

, Comparing partitions, Journal of Classification 2(1) (1985), 193–218.

12.

Ibrahim

Elbagoury

Kamel

M.S.

and Karray

, Tools and approaches for topic detection from twitter streams: survey, Knowledge and Information Systems, 2017, pp. 1–29.

13.

Kiros

Zhu

Salakhutdinov

R.R.

Zemel

Urtasun

Torralba

and Fidler

, Skip-thought vectors, In Advances in neural information processing systems, 2015, pp. 3294–3302.

14.

and Mikolov

, Distributed representations of sentences and documents, In Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 1188–1196.

15.

Lee

D.D.

and Seung

H.S.

, Algorithms for non-negative matrix factorization, In NIPS, 2001, pp. 556–562.

16.

Duan

Wang

Zhang

Sun

and Ma

, Enhancing topic modeling for short texts with auxiliary word embeddings, ACM Transactions on Information Systems 36(2) (2017), 11.

17.

Ritter

and Hovy

E.H.

, Weakly supervised user profile extraction from twitter, In ACL (1), 2014, pp. 165–174.

18.

Liang

Yilmaz

and Kanoulas

, Dynamic clustering of streaming short documents, In SIGKDD, ACM, 2016, pp. 995–1004.

19.

Mikolov

Sutskever

Chen

Corrado

G.S.

and Dean

, Distributed representations of words and phrases and their compositionality, In Advances in neural information processing systems, 2013, pp. 3111–3119.

20.

Nguyen

X.V.

Epps

and James

, Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance, Journal of Machine Learning Research 11 (2010), 2837–2854.

21.

Nigam

McCallum

A.K.

Thrun

and Mitchell

, Text classification from labeled and unlabeled documents using em, Machine Learning 39(2–3) (2000), 103–134.

22.

Pennington

Socher

and Manning

, Glove: Global vectors for word representation, In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.

23.

Qiang

Chen

Wang

and Wu

, Topic modeling over short texts by incorporating word embeddings, In Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2017, pp. 363–374.

24.

Qiang

Yuan

and Wu

, Short text clustering based on pitman-yor process mixture model, Applied Intelligence 48(7) (2018), 1802–1812.

25.

Quan

Kit

and Pan

S.J.

, Short and sparse text topic modeling via self-aggregation, In AAAI, 2015, pp. 2270–2276.

26.

Reynolds

, Gaussian mixture models, Encyclopedia of Biometrics, 2015, pp. 827–832.

27.

Rosenberg

and Hirschberg

, V-measure: A conditional entropy-based external cluster evaluation measure, In AAAI, 2007, pp. 410–420.

28.

Sun

and Chan

P.K.

, Estimating effectiveness of twitter messages with a personalized machine learning approach, Knowledge and Information Systems 56(1) (2018), 27–53.

29.

Tang

Wang

Zheng

and Mei

, End-to-end learning for short text expansion, In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2017, pp. 1105–1113.

30.

Wang

Chen

Y.P.

and Lin

, Recommendation in internet forums and blogs, In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics Association, for Computational Linguistics, 2010, pp. 257–265.

31.

Wang

Liu

Zhang

Wang

and Hao

, Semantic clustering and convolutional neural network for short text categorization, In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, 2015, pp. 352–357.

32.

Kumar

Quinlan

J.R.

Ghosh

Yang

Motoda

McLachlan

G.J.

Liu

Philip

S.Y.

et al., Top 10 algorithms in data mining, Knowledge and Information Systems 14(1) (2008), 1–37.

33.

Wang

Zheng

Tian

and Zhao

, Self-taught convolutional neural networks for short text clustering, Neural Networks 88 (2017), 22–31.

34.

Yin

and Wang

, A dirichlet multinomial mixture model-based approach for short text clustering, In SIGKDD, 2014, pp. 233–242.

35.

Yin

and Wang

, A text clustering algorithm using an online clustering scheme for initialization, In SIGKDD, 2016, pp. 1995–2004.

36.

Huang

and Wang

, Document clustering via dirichlet process mixture model with feature selection, In SIGKDD, ACM, 2010, pp. 763–772.

37.

Zuo

Zhang

Lin

Wang

and Xiong

, Topic modeling of short texts: A pseudo-document view, In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2016, pp. 2105–2114.

A practical algorithm for solving the sparseness problem of short text clustering

Abstract

Keywords

1. Introduction

2. Related work

2.1 Model-based methods

2.2 Text representation

2.3 Short text clustering with side information

3. Short text clustering

Table 1 The notations of symbols used in the paper

3.5 Representation of clusters

4. Experiments

4.1 Dataset

1 http://news.google.com.

5 http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm.

9 http://scikit-learn.org.

4.6 Influence of the number of iterations

5. Conclusion

Footnotes

Acknowledgments

References

Table 1
The notations of symbols used in the paper

¹
http://news.google.com.

⁵
http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm.

⁹
http://scikit-learn.org.