Abstract
Dirichlet Multinomial Mixture (DMM) models have been successful in clustering short texts. However, the word co-occurrence information that can be captured by these models is limited to the short text corpus itself. If two words have strong relatedness but rarely co-occurring in short texts, these models can not fully capture the semantic relatedness between the two words. In this paper, we propose a novel model by incorporating word-word correlation into DMM, called WDMM. By constructing a sparse graph using word-word relationship, our model expands each short text using their neighboring words in each text that can help to solve the problem of sparseness in short texts. Therefore, the cluster label of each text is not only influenced by its words, but decided by their similar words in this corpus. Experimental results on real-world datasets demonstrated the substantial superiority of our WDMM model over the state-of-the-art methods.
Introduction
Along with the emergence and popularity of social media (e.g. Twitter and Facebook), short text clustering is critical for building many applications such as user profiling [17] and recommendation [30]. Compared with regular texts, short text clustering has the following three challenges: only very limited word co-occurrence information is available, the frequency of words plays a less discriminative role, and the limited contexts make it more difficult to identify the senses of ambiguous words [25]. Therefore, to address these challenges, two major heuristic strategies have been adopted to deal with how to cluster short texts.
One way to overcome these issues is to explore some topic modelings to cluster short texts. For example, the most common one follows the simple assumption that each text is sampled from only one latent clustering, known as mixture of unigrams or Dirichlet Multinomial Mixture (DMM) model [21, 34, 35]. It is totally unsuited to regular texts, but it can be suitable for short texts compared to the complex assumption adopted by LDA[3] and PLSA[3] that each text is modeled over a set of topics. Many variants of topic modelings for short texts have been proposed recently [16, 18, 23, 25]. The strategy can alleviate the problem of sparseness, and is fast to converge. However, the word co-occurrence information that can be captured by these model is limited to the short text corpus itself. If two words have strong relatedness but rarely co-occurring in short texts, these models can not fully capture the semantic relatedness between the two words. The second strategy takes advantage of word embedding or autoencoder to learn distributed representations of short texts, and choose a clustering methods (e.g., KMeans[9] to group those texts [31, 13, 29, 33, 4]. For example, Skip-thoughts [13] trains encoder-decoder Recurrent Neural Networks (RNN) without supervision to predict the next and the previous sentences given the current sentence. Different from TF-IDF metric that relies on surface form of the text, distributed representations encode a text as a compact, dense and lower dimensional vector with the semantic meaning of the text distributed along the dimensions of the vector. As noted by Finegan-Dollak et al.[5], distributed representations ought to perform better on a dataset uses a wide variety of words to express the same idea, but should enjoy no advantage when applied to these datasets whose underlying data can be clustered tightly.
In this paper, we try to keep the advantages of topic modelings and utilize the advantages of distributed representations to overcome the weakness of topic modelings when applied to short text clustering. We propose a practical method for short text clustering called WDMM, which incorporating Word-word correlation into Dirichlet Multinomial Mixture. Our solution can achieve excellent performance in short text clustering without incurring a radical increase in the computational complexity of the inference algorithm. More specifically, we adopted a different method that does not incorporate word-word correlations as a prior level but rather softly enforces it at the posterior level. In this sense, our method expands each short text through linking the semantically relevant words together, even if they rarely co-occurring in short texts. We conduct comprehensive experiments to evaluate WDMM and to demonstrate the effectiveness of our model. We compare WDMM with traditional clustering methods (K-Means [32]) and probabilistic topic model (LDA [3]). We also compare with distributed representation models such as Word2Vec [19] and Skip-thoughts model [13], Dirichlet Multinomial Mixture model [34] and its variations GPU-DMM and GPU-PDMM[16]. WDMM achieves state-of-the-art performance across various datasets.
The rest of this paper is organized as follows. In Section 2, we review related work. In Section 3, we detail our approach and provide an efficient collapsed Gibbs sampling algorithm. Finally, in Section 4, we illustrate the efficacy of our approach in several real-world datasets.
Related work
Our work is related to three lines of research in literature: model-based methods, text representation, and short text clustering with side information.
Model-based methods
One of the most promising approach for short text clustering is model-based clustering algorithms [12, 2, 28]. Among them, Gaussian Mixture Model (GMM) [26] is widely adopted based on the assumption that data points are generated by a mixture of Gaussian distributions. Since the complexity of GMM is too large for high-dimensional data like text, GMM is not widely used for short text. Nigam et al. [21] proposed a mixture of unigram model, which assumes that each text is generated by one cluster. The unigram model is an EM-based algorithm for Dirichlet Multinomial Mixture (DMM) model. Except the basic expectation maximization (EM), there have been a number of inference methods that have been used to estimate the parameters including variation inference and Gibbs sampling. Therefore, Yu et al. [36] proposed the DMAFP model based on variational inference algorithm [10]. Yin and Wang [34] proposed a collapsed Gibbs sampling algorithm for DMM (abbr. to GSDMM), and found that GSDMM can infer the number of clusters automatically. For speeding up the time, Yin and Wang proposed a new algorithm as FGSDMM [35] using an online clustering scheme for initialization. Qiang et al. [24] proposed a novel based on Pitman-Yor Process to capture the power-law phenomenon of the cluster distribution. In addition, many variants of GSDMM have been proposed for short text topic modeling [25, 37] and streaming short text clustering [18], recently. All these methods focus on clustering short texts with limited word co-occurrence information, and no additional data is leveraged.
Text representation
Text representation is one of the fundamental problems in text mining and Information Retrieval (IR). It aims to numerically represent the unstructured text documents to make them mathematically computable. Different from model-based methods that can directly get the cluster label for each text, one of clustering algorithms (KMeans[32], Affinity Propagation (AP)[7], and so on) is selected for clustering short texts after obtaining the representation of each text. This kind of approaches basically fall into two major categories: traditional text representation methods and distributed representations. Traditional text representation methods rely on surface form of the text, in which Term Frequency-Inverse Document Frequency (TFIDF), N-grams, and dependency Counts are the most commonly used methods [5]. Due to the fact that only very limited word co-occurrence information is available in short texts, traditional text representation methods cannot work very well for short text clustering. Recently, with the help of word embeddings, neural networks demonstrate their great performance for constructing text representation, since distributed representations can capture semantic meanings of words. Word2Vec [19] and Glove [22] are state-of-the-art word representation models. Inspired by Word2Vec, Doc2Vec [14] can directly learn vector representations of sentences and documents. Skip-thoughts [13] trains encoder-decoder Recurrent Neural Networks (RNN) without supervision to predict the next and the previous sentences given the current sentence. Different autoencoders are also used to learn possibly trivial representations of text documents, which can automatically learn data representations by trying to reconstruct its input at the output layer [4].
Short text clustering with side information
Due to the sparse nature of short texts, many approaches employ large, external document repositories, such as Wikipedia [1] or the Open Directory Project [6], to incorporate additional world knowledge into the clustering process. Tang et al. [29] proposed an end-to-end approach to train the text expansion algorithm which used the search results from a large collection to expand short texts. The sheer size of many of these external collections can make these techniques difficult or time consuming to apply. GPU-DMM applied in short text topic modeling [16] leveraged the word-word relationship during the topic inference process, to tackle the data sparsity issue. The direct difference between GPU-DMM and WDMM is that WDMM can infer the real number of clusters. Our model adopts graph-based proposal distribution to incorporate the similarity matrix, and their semantically related words in GPU-DMM are extracted and promoted by using the generalized Pólya Urn model. Besides, our model adjusts the promotion weight based on the similarity value between words, which is a fixed value in GPU-DMM.
Short text clustering
In this section, we describe our algorithm how to utilize word-word side information for solving the problem of sparseness in short texts. The same to GSDMM, we keep its generative process intact, thus enjoying the nice conjugate properties between the Dirichlet and Multinomial distributions. But unlike GSDMM, we use the word-word relationship to influence the posterior distribution over correlated words to be similar, which biases the model towards the desired effect. We first introduce a collapsed Gibbs sampling algorithm for Dirichlet Multinomial Mixture (GSDMM), then specify how we represent word-word relationships, and finally give our model. The meaning of the variables in the paper is shown in Table 1.
The notations of symbols used in the paper
The notations of symbols used in the paper
In GSDMM [34, 35], one assumes that a document is sampled from a single cluster and all words of this document are from this cluster. That is, the documents are generated following the graphical model in Fig. 1.
Graphical model of GSDMM.
Suppose the corpus contains
Each cluster
Each document
Each word
A key property to derive an efficient sampler for DMM is the fact that the Dirichlet distribution is a conjugate prior to the multinomial distribution. This allows us to integrate out
where
Here, word-word relationship is represented as a sparse graph
where
We use collapsed Gibbs sampling to carry out posterior inference for parameter learning. The hidden multinomial variables text-level variables (
For each text
When
Compared with Eq. (5), the graph based proposal can be simply achieved by replacing the cluster-word counts statistics
For a new cluster
It has been validated that the topic coherence measure based on word co-occurrence pattern is a reliable indicator of topic quality and is highly consistent with domain expert annotation [16]. Accordingly, it is reasonable that the words with high semantic relatedness should be clustered together under the same topic. In the following, we present the detail of our WDMM model is given in Algorithm 1. The initialization steps go from steps 1 to 9. Step 2 assigns each document to an independent cluster. Steps 3 to 9 record the following information:
From steps 10 to 30, we iteratively sample the cluster label for each document after removing it from his last cluster, until completing all iterations. Steps 13 to 17 remove document
Representation of clusters
Based on the fact that the Dirichlet distribution is conjugate to the multinomial distribution, we can get the posterior cluster-word distribution of
where
WDMM vs. FGSDMM. FGSDMM [35] is closely related to our model, but there are several differences. FGSDMM explicitly enforces sparsity by only sampling one cluster for each document. If two words have strong relatedness but rarely co-occurring in short texts, FGSDMM can not fully capture the semantic relatedness between the two words. Our model not only focuses on one cluster for each document, but incorporates the neighboring words of each words into the inference process that can help capture the semantic relatedness.
It is important to note that the number of cluster after initialization in FGSDMM usually less than the true number when clustering short texts. This means that
WDMM vs. GPU-DMM. Our model is also reminiscent of GPU-DMM applied in short text topic modeling [16], which leverages the word-word relationship during the topic inference process, to tackle the data sparsity issue. The direct difference between these two models is that WDMM can infer the real number of clusters. And, our model adopts graph-based proposal distribution to incorporate the similarity matrix, and their semantically related words in GPU-DMM are extracted and promoted by using the generalized Pólya Urn model. Besides, our model adjusts the promotion weight based on the similarity value between words, which is a fixed value in GPU-DMM.
Experiments
In this section, We study the empirical performance of WDMM compared to the baselines.
Dataset
We adopt the same datasets with GSDMM and its variation [24, 34] for evaluation. They are Google News1
For each dataset, we conduct the following preprocessing: (1) Convert all letters into lowercase; (2) Remove non-latin characters and stop words; (3) Remove words whose lengths are smaller than 2 or larger than 15; (4) Perform stemming from words with the WordNet Lemmatizer of NLTK;4
Summary of the text datasets (D: the number of documents, K: the number of clusters, V: vocabulary size, AVG: the average length of the documents)
We compare our WDMM with a wide range of other models including traditional text representation method (TFIDF), topic models (NMF and LDA), distributed representation models (Skip-thoughts), Dirichlet multinomial mixture model (GSDMM) and its variation (FGSDMM and GPU-DMM), as listed below.
The clustering results on real-world data are evaluated through comparing the label of each text obtained by algorithms with the real label. Five metrics are used to measure the clustering performance: Normalized Mutual Information (NMI) [36], Homogeneity (H)[27], Completeness (C) [27], Adjusted Rand Index (ARI)[11], and Adjusted Mutual Information (AMI)[20]. For all metrics, a larger score indicates better clustering performance. In the experiment, we adopted sklearn9
Normalized Mutual Information (NMI) is a clustering validation metric that effectively measures the amount of statistical information shared by the predicted cluster assignments and the ground truth, independent of the absolute cluster label values. Two patients are assigned to the same cluster if and only if they are similar, thus clustering can be viewed as a series of pair-wise decisions. Homogeneity (H) represents the objective that each cluster contains only members of a ground true group and completeness (C) represents the objective that all members of a ground true group are assigned to the same cluster [27]. Rand Index (RI) measures the percentage of clustering decisions that are correct. Rand Index can be adjusted for the chance grouping of elements, which will result in one of its variants called Adjusted Rand Index (ARI). ARI has a value between 0 and 1, and RI can have negative values. Adjusted Mutual Information (AMI)[20] corrects the effect of agreement solely due to chance between clusters, similar to the way Adjusted Rand Index (ARI) corrects the Rand index.
Performance of the models on the TweetSet.
Performance of the models on the TSet.
Performance of the models on the SSet.
Performance of the models on the TSSet.
Model-based methods rely on stochastic elements in their initialization phase, which can potentially lead to different results being generated on the same corpus when using the same parameter values. Therefore, each model is run twenty times on each dataset, and we report the mean and standard deviation of the performance measured by the five evaluation metrics introduced. Figure 2 shows the performance of all models on the Tweet dataset. We can see that traditional text clustering methods (including TFIDF, NMF, LDA) do not perform well on this dataset, which indicates that texts can belong to more than one cluster is unsuited to short texts. Although distributed representation model (Skip) is better than TFIDF, its performance is worse than DMM-based models (GSDMM, FGSDMM, GPU-DMM, GPU-PDMM, WDMM). DMM-based models consistently achieve higher performance, which means that this assumption that each text is generated by one cluster can be suitable for short texts. GPU-DMM and GPU-PDMM consistently achieves higher accuracies than other DMM-based methods (GSDMM, FGSDMM), which suggests the effectiveness of pre-training word embeddings on a large external corpus to learn general knowledge. Our WDMM model significantly outperforms all other models. For example, WDMM obtains 91.2% NMI value which is significantly higher than the 89.7% achieved by GPU-PDMM. This validates that incorporating word-word relationship by obtaining word embeddings with WDMM models is beneficial for short text clustering.
Number of clusters on Tweet, TSet, SSet and TSSet.
Performance of WDMM varying iterations.
The experimental results on Google (TSet, SSet and TSSet) are shown on Figs 3–5. First, we can see that WDMM performs better than the baselines. Similar to the conclusion in Tweet, this validates that our strategy about incorporating word-word relatedness knowledge is useful for short text clustering again. Second, GPU-PDMM is the second best model using five metrics. Although both WDMM, GPU-DMM and GPU-PDMM use the side information, WDMM adopts a useful way to incorporate the side information, and adjusts the promotion weight based on the similarity value between words, which is a fixed value in GPU-DMM and GPU-PDMM. In conclusion, the results on the four datasets suggest that WDMM is a desired choice for short text clustering.
Similar to GSDMM and FGSDMM, WDMM also can infer the number of clusters in dataset. We will investigate whether the number of clusters found by the models is related to the real number of clusters in dataset by varying the number of iterations. Since GSDMM needs an initial number of clusters, we choose a large number 500 for GSDMM. The results varying the number of iterations on Tweet and Google news (TSet, SSet and TSSet) are shown in Fig. 6, respectively. The number of clusters discovered by FGSDMM is rather higher and unstable when compared to the real number of clusters in Tweet, and is severely lower than the real number of clusters in SSet and TSSet. When dealing with these datasets (SSet and TSet) whose underlying data can be clustered tightly, the number of clusters after initialization in FGSDMM is lower than the real number in result of the worst results. Compared with GSDMM, the number inferred by WDMM is close to the real number of clusters.
Influence of the number of iterations
By varying the number of iterations, we now study the effect of the number of iterations in WDMM. Figure 7 depicts the NMI results of WDMM varying the number of iterations from 2 to 50 on the four datasets. The performance of WDMM improves very quickly on the first two iterations. WDMM reaches a steady state using seven iterations. With one exception, the NMI of WDMM has a little improvement from 0.90 to 0.91 on the 21th interation using Tweet dataset. As a whole, we can see that WDMM is faster to converge.
Conclusion
In this paper, we propose a novel model by incorporating Word-word correlation into Dirichlet Multinomial Mixture (WDMM) to cluster short texts without the assumption of the true number of clusters. We presented a collapsed Gibbs Sampling algorithm by jointly considering the advantages of Dirichlet multinomial mixture mode and distributed representations. By constructing a sparse graph using word-word relationship, our model expands each short text using their neighboring words in each text that can help to solve the problem of sparseness in short texts. If two words have a strong relationship but rarely co-occur in short texts, WDMM can capture the semantic relatedness between the two words. Experimental results on real-world datasets demonstrated the substantial superiority of our WDMM model over the state-of-the-art methods.
Footnotes
Acknowledgments
This research is partially supported by the the National Natural Science Foundation of China under grants (61703362, 61702441, 61503116, 61402203), Natural Science Foundation of Jiangsu Province of China under grants (BK20170513, BK20161338), the Natural Science Foundation of the Higher Education Institutions of Jiangsu Province of China under grant 17KJB520045, and the Science and Technology Planning Project of Yangzhou of China under grant YZ2016238.
