A new Chinese text clustering algorithm based on WRD and improved K-means

Abstract

Text clustering has been widely used in data mining, document management, search engines, and other fields. The K-means algorithm is a representative algorithm of text clustering. However, traditional K-means algorithm often uses Euclidean distance or cosine distance to measure the similarity between texts, which is not effective in face of high-dimensional data and cannot retain enough semantic information. In response to the above problems, we combine word rotator’s distance with the K-means algorithm, and propose the WRDK-means algorithm, which use word rotator’s distance to calculate the similarity between texts and preserve more text features. Furthermore, we define a new cluster center initialization method that improves cluster instability during random initial cluster center selection. And, to solve the problem of inconsistent length between texts, we propose a new iterative approximation method of cluster centers. We selected three suitable datasets and five evaluation indicators to verify the feasibility of the proposed algorithm. Among them, the RI value of our algorithm exceeds 90%. And for Marco_F1, our scheme was about 37.77%, 23.2%, 13.06% and 20.12% better than other four methods, respectively.

Keywords

Text clustering K-means algorithm word rotator’s distance WRDK-means

1. Introduction

With the development of the Internet and the explosive growth of text data, more attention has been paid to the research of text clustering. Cluster analysis is a technique based on statistics, which is widely used in fields such as, machine learning, data mining, pattern recognition, and image analysis. The clustering algorithms divides the sample set into different subsets according to the difference in similarity between samples by distance measurement. The purpose is to make the similarity of samples in the same subset as high as possible, and the similarity of samples in different subsets as low as possible [1].

K-means algorithm is widely used in text clustering due to its simple principle and fast convergence speed. The traditional K-means algorithm uses Euclidean distance to measure the similarity between data [2]. However, unlike the clustering of numerical data, the semantic information of the context of the text data needs to be considered when clustering the text data. In addition, at this stage, like many other types of data, high dimensionality and sparsity are a major feature of text data. Since the Euclidean distance considers the attribute values of the data to be the same, it cannot consider the correlation between them, and the performance is also poor when dealing with high-dimensional data. Therefore, the effect of the traditional K-means clustering algorithm in text clustering is not ideal.

In view of the distance and similarity measurement problems in the traditional K-means algorithm for text clustering, we propose an improved K-means text clustering algorithm (WRDK-means). The algorithm combines the word rotation distance (Word Rotator’s Distance, WRD) suitable for text distance calculation, and uses the semantic features of the words extracted by the CBOW model [3] under the Word2vec tool to calculate the similarity between two texts. The improved K-means algorithm shows a good clustering effect for text processing, especially for high-dimensional text data containing a large amount of contextual semantic information.

The rest of this article is organized as follows. In Section 2, the relevant research work in the field of text clustering is analyzed; Section 3 briefly discusses related technical knowledge; Section 4 elaborates on the improved K-means algorithm proposed in this paper for text clustering; Section 5 uses the algorithm proposed in this paper to carry out clustering experiments and analysis; Section 6 is a summary of this article.

2. Related works

According to different principles, clustering algorithms can be divided into hierarchical-based methods, density-based methods, partition-based methods, and grid-based methods. The partition-based clustering method, also known as distance-based clustering. It divides the data set into several subsets according to the distance so that the distance between different subsets is as large as possible and the distance within the same subset is as small as possible. The K-means algorithm, as one of the representative algorithms of the partition method, was first proposed by J. MacQueen at the Fifth Berkeley International Mathematics Symposium in 1967 [4]. One of the main disadvantages of the K-means algorithm is the high-dimensionality and sparsity of the data. So the traditional distance measurement methods cannot effectively process the data.

Based on this, domestic and foreign experts and scholars have optimized and improved the K-means algorithm. J. Li and J.H. Li combined the K-means algorithm with the SVM decision tree [5] to improve the efficiency and accuracy of the traditional algorithm. K-means becomes slow for large and high dimensional datasets. The traditional FPAC algorithm had mitigated this problem. But the improvement in the speed was reached at the cost of reducing the quality of the clustering results. So S. Bejos [6] proposed the improved FPAC algorithm without highly increasing the runtime. In [7], Xiao-Dong Wang et al. propose a fast adaptive K-means (FAKM) type subspace clustering model, where an adaptive loss function is designed to provide a flexible cluster indicator calculation mechanism, thereby suitable for datasets under different distributions.

It is easy to fall into the problem of superiority when it is like the problem. In addition, many researchers use the idea of dimensionality reduction and then adopt clustering to improve the algorithm. For example, Q. Xu et al. [8] used the K-means algorithm to cluster the data after dimensionality reduction through principal component analysis, and the clustering results were greatly improved. In [9], a new type of multi-view k-means, called a feature-reduction multi-view k-means (FRMVK), is proposed. In this paper, a learning mechanism for the multi-view k-means algorithm to automatically compute individual feature weight is constructed. It can reduce these irrelevant feature components in each view. A new multi-view K-means objective function is firstly proposed for constructing the learning mechanism for feature weights in multi-view clustering. A schema for eliminating irrelevant feature(s) with small weight(s) is then considered for feature reduction. A. Chakraborty et al. [10] discussed the performance of K-means clustering algorithm on city block, cosine and correlation distance which are used to get the results and further their performance had been shown in terms of accuracy. For classification, K-means had claimed 98% accuracy on city block and correlation distance.

The traditional K-means algorithm cannot well retain contextual semantic information when processing high-dimensional data, showing a relatively unsatisfactory effect. Therefore, we improve the above shortcomings and propose the WRDK-means algorithm.

3. Key technologies

3.1 Text clustering process

According to the general process of text clustering, we divide text clustering into four steps: text preprocessing, feature extraction, similarity calculation, and clustering realization, as shown in Fig. 1.

1. Text preprocessing

Due to the unstructured or semi-structured nature of text data, they need to be converted into structured data that can be directly recognized by the computer. Therefore, text preprocessing is required first. The quality of the preprocessing results will affect the results of clustering. Text preprocessing mainly includes two steps: word segmentation and stop word removal. We use jieba word segmentation to segment the Chinese text words according to the dictionary and use Baidu stop vocabulary to remove words with high occurrence probability, such as pronouns, auxiliary words, prepositions, conjunctions, etc.

Figure 1.

Technical route of text clustering.

2. Feature extraction

For long text, firstly, we use the TF-IDF algorithm to extract keywords to reduce the time complexity; for short text, we directly perform feature selection. For the task of Chinese text clustering, we choose a suitable data set as the corpus, perform the same text preprocessing as in step 1, and use the CBOW language model to vectorize it, so that each word in the corpus can be unique the vector corresponds to it. Finally, vector matching is performed on the text preprocessed in the previous step according to the trained corpus.

3. Similarity calculation and clustering algorithm

Since text clustering is divided by the size of the similarity between different texts, the calculation method of the text similarity will directly affect the clustering results. In clustering, similarity calculation methods such as Euclidean distance and cosine distance are usually used. However, such type of methods have disadvantages such as not being able to handle high-dimensional data well and performing poorly in long text clustering tasks. To solve these problems, we choose to use word rotation distance (WRD) as the calculation method of similarity between different texts, and propose a WRDK-means text clustering algorithm that reasonably combines WRD and traditional K-means algorithm.

3.2 TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a commonly used weighting technique used in information retrieval and text mining to evaluate the importance of a word to a document set [11]. The importance of a word increases proportionally with the number of times it occurs in the document set and decreases inversely with the frequency in the corpus. The term frequency ( $T F$ ) represents the frequency of the keyword $w$ in the document $D_{i}$ :

$\displaystyle TF_{w,D_{i}}=\frac{\textit{count}(w)}{|D_{i}|}$ (1)

Where $\textit{count}(w)$ is the number of occurrences of the keyword $w$ , and $|{D_{i}}|$ is the number of all words in the document $D_{i}$ .

The inverse document frequency IDF reflects the prevalence of keywords:

$\displaystyle\textit{IDF}_{w}=\log\frac{N}{1+\sum_{i-1}^{N}I(w,D_{i})}$ (2)

Among them, $N$ is the total number of documents, and $I({w,D_{i}})$ indicates whether the document $D_{i}$ contains the keyword $w$ . It contains the keyword as 1, does not contain as 0.

Then the $TF-\textit{IDF}$ value of the keyword $w$ in the document $D_{i}$ is [12]:

$\displaystyle TF-\textit{IDF}_{w,D_{i}}=TF_{w,D_{i}}\times\textit{IDF}_{w}$ (3)

3.3 CBOW language model

In 2013, Google’s Tomas Mikolove et al. [13] improved and optimized the probabilistic language model (NNLM) of feedforward neural networks, and launched the word2vec tool. After training, the tool can vectorize all words, quantitatively measure the relationship between words, and the output word vectors can be used for tasks related to natural language processing, such as clustering. The CBOW chosen in our paper is a language model in word2vec, which predicts the current word through context and constructs a supervised training criteria from unsupervised data [14].

The full name of CBOW is Continuous bag of words. It uses Hierarchical Softmax to train the word vector model of text data and maps the text feature words to a low-dimensional space in a distributed manner, so that the positional relationship of the word vectors in the low-dimensional space can well reflect their semantic connection. The model consists of an input layer, a projection layer and an output layer [15]. The structure of the model is shown in Fig. 2. Among them, the input layer is composed of context $w_{({t\pm i})}$ words, and each word is represented by a one-hot encoded vector; the projection layer is obtained by summing and averaging the dictionary lookup table formed by the corpus; the output layer is the also represented by one-hot encoding and outputs the predicted word $w_{(t)}$ .

Figure 2.

CBOW model structure.

3.4 Word Rotator’s Distance

In 2020, the WRD (Word Rotator’s Distance) published by S. Yokoi et al. was proposed after improving WMD [16]. In 2015, Matt Kusner and others put forward the concept of Word Mover’s Distance (WMD) in a paper published at the Washington Conference [17]. WMD abstracts the weights of words in two texts into transfer and capacity. Semantic similarity is defined as the minimum cumulative distance required for the total weights of words in one text to fully migrate and fill the total capacity of words in another text. The larger the WMD, the smaller the text similarity; conversely, the greater the text similarity.

There are two sentences $s=({t_{1},t_{2},\ldots,t_{n}}),s^{\prime}=(t^{\prime}_{1},t^{\prime}_{2},% \ldots,t^{\prime}_{n^{\prime}})$ . The vector sequence after their vectorization is divided into $({\bm{w_{1}},\bm{w_{2}},\ldots,\bm{w_{n}}})$ and $({\bm{w^{\prime}_{1}},\bm{w^{\prime}_{2}},\ldots,\bm{w^{\prime}_{n^{\prime}}}})$ . Assuming $T\in R^{n\times n}$ , $T_{ij}\geqslant 0$ means that the word $i$ in a text $d$ moves to the weight of the word $j$ in another document ${d}^{\prime}$ . And $d_{i}$ represents the weight of the word $i$ itself in the document $d$ . $d^{\prime}_{j}$ represents the weight of word $j$ itself in the document ${d}^{\prime}$ . $d_{i,j}$ represents the distance between word $i$ and word $j$ . The model of WMD is expressed as follows [18]:

$\displaystyle\min\sum_{i,j=0}^{n}T_{ij}d_{i,j}$ (4) $\displaystyle st:\sum_{j=1}^{n}T_{ij}=d_{i},\sum_{i=1}^{n}T_{ij}=d^{\prime}_{i},$ (5) $\displaystyle d_{i,j}=||\bm{w_{i}}-\bm{w^{\prime}_{j}}||$ (6)

The Euclidean distance used in WMD is less effective in measuring the distance between words than using cosine distance. Furthermore, WMD is theoretically an unlimited amount, and its threshold of similarity cannot be adjusted well. In order to solve the above problems, WRD came into being.

WRD [16] first proposed and proved the concept that the modulo length of a word vector is positively related to the importance of the word. The larger the WRD, the smaller the text similarity, and vice versa. Since the similarity measure used is cosine distance, the transformation between two vectors is more like a rotation than a movement. The WRD model is expressed as follows [17]:

$\displaystyle\min\sum_{i,j=0}^{n}T_{ij}d_{i,j}$ (7) $\displaystyle st:\sum_{j=1}^{n}T_{ij}=||\bm{w_{i}}||d_{i},\sum_{i=1}^{n}T_{ij}% =||\bm{w^{\prime}_{j}}||d^{\prime}_{i},$ (8) $\displaystyle d_{i,j}=1-\frac{\bm{w_{i}}\cdot\bm{w^{\prime}_{i}}}{||\bm{w_{i}}% ||\times||\bm{w^{\prime}_{j}}||}$ (9)

4. WRDK-means text clustering algorithm

4.1 K-means clustering algorithm

The K-means algorithm is a typical representative of partition-based clustering algorithm. It usually uses Euclidean distance as a measure of similarity between data objects, that is, the smaller the distance between data objects, the higher the similarity, and the more likely they belong to the same cluster. On the contrary, the lower the similarity, the less likely it is to belong to the same cluster. Based on the above principles, the data to be clustered is finally divided into k categories.

Assuming that there is a data set to be classified $\{{x_{i},i=1,2,3\ldots,n}\}$ , and the calculation formula of the distance between data $\textit{dist}({x_{i},x_{j}})$ is as follows:

$\displaystyle\textit{dist}(x_{i},x_{j})=||x_{i}-x_{j}||\quad i,j=1,2,\ldots,n$ (10)

The specific algorithm flow of K-means is shown in Algorithm 1 [19].

Algorithm 1: K-means algorithm
Input: the data set to be classified $X=\{{x_{1},x_{2},x_{3}\ldots,x_{n}}\}$ , the number of clusters $k$ ;
Output: clustering result $C=\{{C_{1},C_{2},C_{3}\ldots,C_{n}}\}$ .
Randomly select $k$ data as the initial cluster center, and the maximum number of iterations is $T$ .
For $\textit{iteration}=1,2,\ldots,T$
For every $x_{i}$
According to Eq. (10), calculate $\textit{dist}({x_{i},C_{k}})$
Divide $x_{i}$ into clusters where the nearest cluster center is located
End for
According to Eq. (11), update all cluster centers
End for

The K-means algorithm has the advantages of simple principle and fast convergence speed, but it also has its limitations, mainly as follows:

The number $k$ of clusters needs to be set in advance. Whether the determination of the $k$ value is reasonable is related to the accuracy of the clustering results, but the number of clusters cannot be known in advance without clustering.

Sensitive to initial conditions. The random selection of the initial cluster center makes it easy to fall into local optimum during the clustering process, which will lead to fluctuations and instability of the clustering results.

The disaster of dimensionality. The K-means algorithm is suitable for processing low-dimensional data. As the dimensionality increases, the metric utility of the Euclidean distance will decrease, and the convergence of cluster centers will become more difficult.

4.2 WRDK-means text clustering algorithm

Unsupervised text clustering algorithms usually use dimensionality reduction and highly sparse matrix methods to select text features, and then use methods such as Euclidean distance to calculate text similarity. This approach largely ignores the semantic information in the text. This leads to such algorithms can not better express text features in the face of synonyms, polysemy and strong contextual context.

Therefore, this paper introduces WRD into the K-means algorithm and proposes the WRDK-means clustering algorithm. WRD has many advantages. It can measure the similarity between texts of different lengths, and fully consider the comprehensive meaning of the context. Calculating the word rotation distance between texts is essential for calculating the minimum value of word transfer cost, and the weight of word transfer is obtained by solving the transportation problem in linear programming in WRD. Although the solution to the transportation problem is peculiar and may cause the word order relationship of the text context to be ignored, in general, the semantic information contained in the calculation of word rotation distance is still significantly improved. In addition, in the face of long text data clustering, the TF-IDF algorithm can be used to extract keywords from the long texts, reducing the time cost by sacrificing a small amount of unimportant semantic information.

However, the introduction of the WRD has brought a lot of difficulties to the K-means algorithm, mainly in the following two aspects:

The regular text features passed through after dimensionality reduction can unify the dimensions, which greatly simplifies subsequent calculations. In order to preserve the semantic information of the text as much as possible, WRD has different sizes of the text after matrix, which is not conducive to the selection of the initial cluster center.

The traditional K-means algorithm uses a summation and average method to update cluster centers, which is not suitable for text features with different dimensions.

In view of the above shortcomings and difficulties, We propose the WRDK-means clustering algorithm. The WRDK-means clustering algorithm defines the characteristics of the word set $\{{w_{i},i=1,2,3\ldots,n}\}$ after text segmentation as a matrix $\bm{F}$ , as shown in Eq. (11). Among them, $\bm{v_{i}}$ is the word vector calculated by $\bm{w_{i}}$ in the CBOW model, and its length is determined by the CBOW model. The number of rows of the matrix $\bm{F}$ is determined by the number of words $m$ after the text is segmentation.

$\displaystyle\bm{F_{m\times n}}=\left[{{\begin{array}[]{*{20}c}{(\bm{v_{1}})_{% 1\times n}}\\ \ldots\\ {(\bm{v_{i}})_{1\times n}}\\ \end{array}}}\right]_{m\times n}$ (11)

The following is a detailed description of the algorithm.

4.2.1 Determine the number of clusters

For the determination of the number of clusters $k$ , a large number of literature studies have summarized targeted solutions, such as the silhouette coefficient method [18], based on generalization ability [21], and a new cluster validity index based on average comprehensiveness DAS [22] and so on. We choose the silhouette coefficient method as the method of determining the $k$ value. The silhouette coefficient generates silhouette plots for different values of $k$ and selects the one with the highest coefficient according to the distribution of clusters, which is the best value.

4.2.2 Initialize cluster center

For the initialization of cluster centers, especially for high-dimensional data, due to the large word shift distance between different types of texts, it is necessary to ensure that the initialized cluster centers maintain a certain distance from each other, which helps to improve the stability of the algorithm. But for real text data, there tends to be a small amount of extreme data (i.e. points far away from almost all the sample space). If the extreme data is used as the initial cluster center, the cluster center point may be “dead” and cannot be divided into other points to update. At the same time, in order to ensure that the cluster centers have the conditions to accommodate enough semantic information, the size of the cluster centers should be the same as the size of the longest text matrix in the data set. The algorithm proposed in this paper first randomly selects a text $x_{r1}$ in the data set $X$ as the first cluster center. And then, in the remaining datasets, the text $x_{r2}$ whose word rotation distance dist product of each cluster center is greater than the threshold $\lambda^{m}$ is randomly selected as the second cluster center, and so on, until $k$ cluster centers are found. The detailed steps are as follows:

1. Randomly select a text $x_{r1}$ as the first cluster center, denoted as $C_{1}$ . Initialize a blank cluster center matrix. The number of columns in the cluster center is equal to the dimension of the word vector in the CBOW model and the number of rows in the cluster center is equal to the maximum length of the text word set in the data set to be classified $X=\{{x_{1},x_{2},x_{3}\ldots,x_{n}}\}$ . Fill the word vectors in the text $x_{r1}$ into the cluster center matrix in turn and fill the remaining blank parts with random numbers.

2. Initialize a blank cluster center matrix and select the $m$ -th text $x_{rm}$ in the data set. It satisfies the following equation:

$\displaystyle\mathop{\prod}\limits_{i=1}^{m-1}\textit{dist}({c_{i},x_{rm}})>% \lambda^{m}\lambda\in({0,1})$ (12)

Fill the matrix as the $m$ th initial cluster center. Repeat this step until all initial cluster centers are obtained.

4.2.3 Update cluster center

Algorithm 2: WRDK-means algorithm
Input: the data set to be classified $X=\{{x_{1},x_{2},x_{3}\ldots,x_{n}}\}$ , the number of clusters $k$ ;
Output: clustering result $C=\{{C_{1},C_{2},C_{3}\ldots,C_{n}}\}$ .
According to Eq. (12), the algorithm selects $k$ data as the initial clustering center, and the maximum number of iterations is $T$ .
For $\textit{iteration}=1,2,\ldots,T$
For every $x_{i}$
According to the WRD model, solve $\textit{dist}({x_{i},C_{k}})$
Divide $x_{i}$ into clusters where the nearest cluster center is located.
End for
For every $C_{k}$
For every $y_{i}$
According to the WRD model, solve $\textit{WRD}({C_{k},y_{i}})$
According to Eqs (13)–(16), all cluster centers are updated.
End for
End for
End for

After dividing the dataset $X=\{{x_{1},x_{2},x_{3}\ldots,x_{n}}\}$ into the clusters where each cluster center is located, the cluster centers are updated by iterative approximation. When solving the transfer problem of WRD, the minimum word transfer cost and transfer weight of each word between the two texts can be solved simultaneously. Suppose that the text set $Y=\{{y_{1},y_{2},y_{3}\ldots,y_{n}}\}$ is divided under the original cluster center $C_{1}$ . The word transfer weight $k$ between the words $\{{C_{11},C_{12}\ldots,C_{1a}}\}$ in the original cluster center $C_{1}$ and the words $\{{y_{11},y_{12}\ldots,y_{1b}}\}$ in the text $y_{1}$ can be obtained. The result of the WRD model is as follows:

$\displaystyle\textit{WRD}({C_{1},y_{1}})=({\textit{dist}({C_{1},y_{1}}),K_{C1y% 1}})$ (13)

If the number of words contained in $C_{1}$ is $m$ and the number of words contained in $y_{1}$ is $n$ , then:

$\displaystyle K_{C1y1}=\left[{{\begin{array}[]{*{20}c}{k_{C11y11}}&\cdots&{k_{% C11y1n}}\\ \vdots&\ddots&\vdots\\ {k_{C1my11}}&\cdots&{k_{C1my1n}}\\ \end{array}}}\right]$ (14)

The iterative approximation formula between the word $C_{11}$ in the cluster center $C_{1}$ and the word $y_{11}$ in the text $y_{1}$ is:

$\displaystyle C_{11}=C_{11}-({C_{11}-y_{11}})\times k_{C11y11}\times lr$ (15)

In order to stabilize the iteration of the algorithm, $l r$ is used to control the range of iterative approximation. Assuming that the number of iterations of the algorithm is a constant $T$ , then $l r$ satisfies the formula:

$\displaystyle lr=1-\frac{\textit{iteration}-1}{T}$ (16)

Among them, iteration is the number of rounds of the current iteration.

The specific process of WRDK-means algorithm is shown in Algorithm 2.

5. Experimental results and analysis

5.1 Data set

5.1.1 Experimental corpus

The corpus comes from the dataset THUCNews of the Natural Language Processing Laboratory of Tsinghua University. The corpus is generated by filtering historical data of the Sina News RSS subscription channel from 2005 to 2011. It contains 740,000 news documents, totaling 2.19 GB [20].

5.1.2 Experimental data set

The experiment used three Chinese datasets to demonstrate the effect of the WRDK-means clustering algorithm proposed in our paper. The datasets are as follows:

The Tsinghua University Natural Language Processing Laboratory dataset THUCNews dataset THUCNews has 14 candidate categories, namely: Finance, Lottery, Real Estate, Stocks, Home Furnishing, Education, Technology, Society, Fashion, Current Affairs, Sports, Constellation, Games, and Entertainment. We randomly select 200 pieces of data from each category as the classification dataset 1.

There are 20 candidate categories in the Chinese text classification corpus data of the Natural Language Processing Group of the International Database Center, Department of Computer Information and Technology, Fudan University. After removing the categories with less than 200 data, the remaining 9 categories are: C3-Art, C7-History, C11-Space, C19-Computer, C31-Environment, C32-Agriculture, C34-Economy, C38-Politics and C39-Sports. We randomly select 200 pieces of data from each category as the classification dataset 2.

The text classification corpus provided by Sogou Lab. This corpus contains the news corpus and the corresponding category labels of the news collected by the Sohu News website that are manually classified by category. There are 9 candidate categories, namely: Internet, Sports, Health, Military, Recruitment, Education, Culture, Tourism and Finance. We randomly select 200 pieces of data from each category as the classification dataset 3.

We take the dataset THUCNews from the Natural Language Processing Laboratory of Tsinghua University as an example. The dataset consists of 14 categories of folders, and each category folder contains a large number of text files, and there is no blank file data. Take a file under the category of “Games” as an example, it consists of two parts: the news title and the text, as shown in Fig. 3.

Figure 3.

Data display of the data set THUCNews from the Natural Language Processing Laboratory of Tsinghua University.

5.2 Evaluation index

For many indicators of clustering algorithm performance evaluation, we choose purity, rand index, precision, recall and $F$ value.

5.2.1 Purity

Purity is a simple cluster evaluation method. It needs to calculate the ratio of the number of correctly clustered data to the total number of data. The formula is as follows:

$\displaystyle\textit{purity}({{\Omega},C})=\frac{1}{N}\sum_{k}\max_{j}|{\omega% _{k}\cap c_{j}}|$ (17)

Where ${\Omega}=\{{\omega_{1},\omega_{2},\ldots,\omega_{k}}\}$ is the set of clusters, which $\omega_{k}$ represents the set of the $k$ th cluster. $C=\{{c_{1},c_{2},\ldots,c_{j}}\}$ is a set of data, and $c_{j}$ represents the $j$ th data. $N$ represents the total number of data.

5.2.2 Rand index

RI is a means of evaluating clusters. It uses the principle of permutation and combination. The formula is as follows:

$\displaystyle RI=\frac{TP+FP}{TP+FP+TN+FN}$ (18)

TP represents the number of positive samples predicted to be positive samples. FP represents the number of negative samples predicted to be positive samples. FN represents the number of positive samples predicted to be negative samples. TN represents the number of negative samples predicted to be negative samples.

5.2.3 Precision

Precision is a commonly used evaluation metric. It is calculated as the probability that all samples predicted to be positive are positive. The formula is as follows:

$\displaystyle P=\frac{TP}{TP+FP}$ (19)

5.2.4 Recall

The recall is calculated as the probability that the actual positive sample is predicted to be a positive sample, and the formula is as follows:

$\displaystyle R=\frac{TP}{TP+FN}$ (20)

5.2.5 F-Measure

In order to avoid the contradiction between precision and recall, the two need to be considered comprehensively. The most common method is the $F$ value. It is calculated as the weighted harmonic mean of precision and recall. The formula is as follows:

$\displaystyle F=\frac{({\alpha^{2}+1})\times P\times R}{\alpha^{2}\times({P+R})}$ (21)

When the parameter $\alpha=1$ , it is the F1 value.

5.2.6 Macro-averaging

First, the above-mentioned evaluation index values of each category are counted, and then the arithmetic mean of all categories is calculated, such as Eqs (22)–(24).

$\displaystyle\textit{Macro\_P}=\frac{1}{n}\mathop{\sum}\limits_{i=1}^{n}P_{i}$ (22) $\displaystyle\textit{Macro\_R}=\frac{1}{n}\mathop{\sum}\limits_{i=1}^{n}R_{i}$ (23) $\displaystyle\textit{Macro\_F}=\frac{1}{n}\mathop{\sum}\limits_{i=1}^{n}F_{i}$ (24)

Among them, $P$ , $R$ and $F$ represent Precision, Recall and $F$ value respectively.

5.3 Experimental results and analysis

5.3.1 Initial point selection experiment for clustering

In the WRDK-means algorithm, the value of the threshold $\lambda^{m}$ determines how to select cluster centers other than the first one. We used randomly initialized cluster centroids and randomly initialized cluster centroids for four different thresholds for comparison. For each condition, we performed 10 experiments separately. In Table 1, we list the maximum and median values of the evaluation metrics in each case. According to the values in Table 1, the results of random initialization of cluster centers (i.e. $\lambda^{m}=0$ ) are general. Meanwhile, when the threshold $\lambda^{m}=0.95$ , the experimental results are poor. When the threshold $\lambda^{m}\leqslant 0.85$ , the experimental results are good. In our experiment, we get the best results with $\lambda^{m}=0.5$ . In this case, the five evaluation index values of our scheme are approximately 0.6362, 0.948, 0.6761, 0.6324 and 0.6295.

Table 1
Initial point selection experiment for clustering

Method	$\lambda^{m}$	Purity	RI	$P_{i}$	$R_{i}$	$F_{i}$
Maximum of random experiments	$\backslash$	0.5665	0.938	0.6201	0.5624	0.5695
Median of random experiments	$\backslash$	0.5359	0.9337	0.5558	0.5309	0.5209
Maximum of $\lambda^{m}=0.95$ experiments	0.95	0.5355	0.9336	0.5718	0.5315	0.5247
Median of $\lambda^{m}=0.95$ experiments	0.95	0.483	0.8014	0.4941	0.4801	0.4734
Maximum of $\lambda^{m}=0.85$ experiments	0.85	0.6191	0.9455	0.6542	0.616	0.6201
Median of $\lambda^{m}=0.85$ experiments	0.85	0.4917	0.9273	0.4636	0.4895	0.4632
Maximum of $\lambda^{m}=0.7$ experiments	0.7	0.6221	0.946	0.6211	0.6176	0.614
Median of $\lambda^{m}=0.7$ experiments	0.7	0.4658	0.9236	0.4916	0.4648	0.4619
Maximum of $\lambda^{m}=0.5$ experiments	0.5	0.6362	0.948	0.6761	0.6324	0.6295
Median of $\lambda^{m}=0.5$ experiments	0.5	0.4808	0.9258	0.456	0.4796	0.4617

5.3.2 Experiments on clustering effects

All the experimental results are averaged after 10 experiments. In order to balance the running speed and running accuracy of the WRDK-means algorithm, the dimensionality of the text vectorization of the CBOW model is unified to 300 dimensions. After each data is sorted by TF-IDF, the top 20 keywords in importance are retained, and the length of the cluster center is uniformly 20. In dataset 1, rand index, precision $P_{i}$ , recall $R_{i}$ and the F1 value $F_{i}$ of each category of the WRDK-means algorithm are shown in Table 2, and the confusion matrix is shown in Fig. 4. Table 3 shows the purity, rand index, Macro_P, Macro_R, and Macro_F1 comparisons of the algorithm in datasets 1, 2, and 3.

As can be seen from Table 2, the F1 value of each category is unbalanced. The performance of some categories such as “Home” and “Finance” are significantly lower than the macro average. This may be because the keywords of such topics are scattered, and the 20 keywords intercepted by TF-IDF are not enough to retain semantic information. During the experiment, different initialized cluster centers have a significant impact on the experimental results between categories, indicating that the cluster center initialization method still has a lot of room for improvement. For text clustering tasks, the more categories in the dataset, the more imbalanced the number of categories, and the greater the difficulty of clustering. Therefore, the clustering result of dataset 2 are better than dataset 1. The corpus of the experiment in this paper is the same as the dataset 1. Since the classification of dataset 1 basically overlaps with the classifications of dataset 2, and the classification of dataset 1 is quite different from that of dataset 3, and the clustering result of dataset 3 is lower than that of dataset 2. It shows that the choice of corpus also has a greater impact on the clustering results.

Table 2
WRDK-means clustering results of the Chinese dataset of Tsinghua University

Category	Rand index	$P_{i}$	$R_{i}$	$F_{i}$
Education	0.9799	0.8783	0.8383	0.8578
Constellation	0.9762	0.7816	0.9226	0.8463
Society	0.9671	0.7217	0.895	0.7991
Real estate	0.974	0.9705	0.6633	0.788
Current affairs	0.9624	0.7621	0.705	0.705
Sports	0.9478	0.6052	0.809	0.6924
Entertainment	0.9489	0.6277	0.7286	0.6744
Games	0.9263	0.4965	0.72	0.5877
Lottery	0.9573	0.9493	0.3989	0.5617
Fashion	0.9357	0.5578	0.5353	0.5463
Technology	0.9529	0.96	0.3636	0.5274
Stock	0.9255	0.4782	0.5076	0.4925
Finance	0.904	0.3726	0.5206	0.4344
Home	0.9139	0.3034	0.2458	0.2716
Macro-averaging	0.948	0.6761	0.6324	0.6295

Table 3

Clustering results of three data sets

Data set	Purity	Marco_RI	$\textit{Marco\_P}_{i}$	$\textit{Macro\_R}_{i}$	$\textit{Macro\_F}_{i}$
Data set 1 (14 categories)	0.6362	0.948	0.6761	0.6324	0.6295
Data set 2 (9 categories)	0.6538	0.9386	0.6799	0.6653	0.6669
Data set 3 (9 categories)	0.5822	0.9071	0.5927	0.5829	0.5707

Table 4

Comparison of WRDK-means algorithm with four document schemes

Method	Feature extraction	Feature selection	Similarity comparison method	Clustering method
WRDKmeans (proposed)	TF-IDF Word2vector	Text matrix	Word rotator’s distance	K-means (improved)
The combination of LDA, VSM and K-means [23]	TF-IDF	LDA VSM	LDA Cosine distance	LDA K-means (traditional)
Unsupervised K-means [24]	TF-IDF	VSM	Cosine distance	K-means (traditional)
WCK-means [25]	TF-IDF Word2vector	VSM	Euclidean distance	Canopy K-means (traditional)
Text hierarchical clustering based on VSM and Birch [26]	TF-IDF	VSM	Cosine distance	Birch

Figure 4.

Tsinghua University Chinese data set WRDK-means clustering confusion matrix.

5.3.3 Comparative experiments

Under dataset 1, we compared the clustering algorithm proposed in this paper with four traditional text clustering algorithms under five evaluation indicators. We compared the proposed scheme with the specific algorithms used by the other four algorithms and list them in Table 4. They are four-stage algorithms of feature extraction algorithm, similarity comparison method and clustering method. The purity, rand index, $\textit{Macro\_P}_{i}$ $\textit{Macro\_R}_{i}$ and $\textit{Macro\_F}_{i}$ values of the five methods in dataset 1 are calculated respectively in Table 5. According to the results, the five evaluation index values of our scheme are 0.6362, 0.948, 0.6761, 0.6324 and 0.6295, much higher than the other four schemes. Especially for Marco_F1, our scheme is about 37.77%, 23.2%, 13.06% and 20.12% better than [23, 24, 25, 26], respectively. So our algorithm shows obvious superiority.

Table 5
Results of comparative experiments

Method	Purity	Marco_RI	$\textit{Macro\_P}_{i}$	$\textit{Macro\_R}_{i}$	$\textit{Macro\_F}_{i}$
WRDKmeans (proposed)	0.6362	0.948	0.6761	0.6324	0.6295
The combination of LDA, VSM and K-means [23]	0.2147	0.8938	0.2691	0.2625	0.2518
Unsupervised K-means [24]	0.3865	0.7249	0.4979	0.3914	0.3975
WCK-means [25]	0.5013	0.8048	0.5582	0.4988	0.4989
Text hierarchical clustering based on VSM and Birch [26]	0.4171	0.7916	0.5729	0.4225	0.4283

6. Conclusion

In this paper, we propose WRDK-means, a text clustering algorithm based on WRD improved K-means. According to the related concepts and properties of WRD, the algorithm replaces the Euclidean distance under the traditional VSM model to measure the similarity between text data, and fully retains the contextual semantic information. In order to make the combination of WRD and K-means, fully and reasonably, we redesign the feature extraction method of text, improve the cluster instability problem in the process of random initial cluster center selection, and propose a new cluster center iterative approximation method. We use the algorithm proposed in this paper to conduct experiments on three datasets respectively, and compare with other four traditional algorithms. The results show that for text clustering, especially high-dimensional text data clustering, the WRDK-means algorithm has significantly improvements in the evaluation indicators of purity, rand index, precision, recall and F1 value. Among them, the RI value of our algorithm exceeds 90%. And the Macro_F1 is improved by about 38% at most. However, the algorithm runs slowly, and the initialization method still has room for improvement, which will become the next improvement direction.

Data availability

All data, models and codes generated or used during the study are available from the corresponding author by request.

Ethical approval

The work is a novel work and has not been published elsewhere nor is it currently under review for publication elsewhere.

Author’s contributions

Cui Zicai – Conceptualization, modeling, methodology, analysis and computer simulation, writing. Zhong Bocheng – conceptualization, modeling, analysis and supervision. Bai Chen – modeling, computer simulation, analysis, supervision and writing.

Footnotes

Acknowledgments

This work was supported by National Natural Science Foundation of China (62006150).

Conflict of interest

The authors declare that they have no conflict of interest.

References

Wadnare

R.J.

Sherekar

S.S.

and Thakare

V.M.

, Development of Text Clustering Method with K-Means for Analysis of Text Data, International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 2021.

Jie

Jiyue

Junhui

Yusheng

Huiping

and Kaiyan

, Review on the Research of K-means Clustering Algorithm in Big Data, in: 2020 IEEE 3rd International Conference on Electronics and Communication Engineering (ICECE), 2020, pp. 107–111. doi: 10.1109/ICECE51594.2020.9353036.

Ling

Dyer

Black

A.W.

et al., Two/too simple adaptations of word2vec for syntax problems, in: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 1299–1304.

James

M.Q.

, Some methods for classification and analysis of multivariate observations, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1(14) (1967).

Lee

and Lee

J.H.

, K-means clustering based SVM ensemble methods for imbalanced data problem, in: 2014 Joint 7th International Conference on Soft Computing and Intelligent Systems (SCIS) and 15th International Symposium on Advanced Intelligent Systems (ISIS), 2014, pp. 614–617.

Bejos

Feliciano-Avelino

Martínez-Trinidad

J.F.

et al., Improved fast partitional clustering algorithm for text clustering, Journal of Intelligent and Fuzzy Systems 39(2) (2020), 2137–2145.

Wang

Chen

Yan

Zeng

and Hong

, Fast Adaptive K-Means Subspace Clustering for High-Dimensional Data, in: IEEE Access, vol. 7, 2019, pp. 42639–42651. doi: 10.1109/ACCESS.2907043.

Ding

Liu

et al., PCA-guided search for K-means, Pattern Recognition Letters 54 (2015), 50–55.

Yang

M.-S.

and Sinaga

K.P.

, A Feature-Reduction Multi-View k-Means Clustering Algorithm, in: IEEE Access, vol. 7, 2019, pp. 114472–114486. doi: 10.1109/ACCESS.2934179.

10.

Chakraborty

Faujdar

Punhani

et al., Comparative Study of K-Means Clustering Using Iris Data Set for Various Distance, in: 2020 10th International Conference on Cloud Computing, Data Science & Engineering (Confluence), 2020.

11.

Marcińczuk

Gniewkowski

Walkowiak

et al., Text document clustering: Wordnet vs. TF-IDF vs. word embeddings, in: Proceedings of the 11th Global Wordnet Conference, 2021, pp. 207–214.

12.

Mikolov

Chen

Corrado

et al., Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781, 2013.

13.

Bafna

Pramod

and Vaidya

, Document clustering: TF-IDF approach, in: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), 2016, pp. 61–66.

14.

Rong

, word2vec parameter learning explained, arXiv:1411.2738, 2014.

15.

Jatnika

Bijaksana

M.A.

and Suryani

A.A.

, Word2vec model analysis for semantic similarities in english words, Procedia Computer Science 157 (2019), 160–167.

16.

Yokoi

Takahashi

Akama

et al., Word Rotator’s Distance, arXiv:2004.15003, 2020.

17.

Kusner

Sun

Kolkin

et al., From word embeddings to document distances, International conference on machine learning, PMLR, 2015, 957–966.

18.

Feng

Yin

Shangguan

et al., Travel Mode Selecting Prediction Method Based on Passenger Portrait and Random Forest, in: 2020 Chinese Automation Congress (CAC), 2020.

19.

Yang

, Research on Chinese Text Clustering Algorithm Based on Semantic Similarity, ChengDu University of Electronic Science and Technology of China, 2018.

20.

Sun

Guo

Zhao

Zheng

and Liu

, THUCTC: An Efficient Chinese Text Classifier, 2016.

21.

Lei

and Li

, Improved K-means Clustering Algorithm by Combining with Multiple Factors, in: International Conference on Advances in Computer Technology, Information Science and Communication (CTISC), 2021, pp. 258–263.

22.

Pugazhenthi

and Kumar

L.S.

, Selection of Optimal Number of Clusters and Centroids for K-means and Fuzzy C-means Clustering: A Review, in: 5th International Conference on Computing, Communication and Security (ICCCS), 2021, pp. 1–4.

23.

Yang

and Jiang

, The research on text clustering based on LDA joint model, Journal of Intelligent & Fuzzy Systems 32(5) (2017), 3655–3667.

24.

Sangaiah

A.K.

Fakhry

A.E.

Abdel-Basset

et al., Arabic text clustering using improved clustering algorithms with dimensionality reduction, Cluster Computing 22(2) (2019), 4535–4549.

25.

Wang

Zhou

and Li

, Design and application of a text clustering algorithm based on parallelized K-means clustering, Rev. d’Intelligence Artif 33(6) (2019), 453–460.

26.

Yang

Xia

Zhao

and Dou

, Research on the text clustering method of Chinese encyclopedia based on feature dictionary construction and BIRCH algorithm, Computer Era 23-27+31 (Chinese), 2019.

A new Chinese text clustering algorithm based on WRD and improved K-means

Abstract

Keywords

1. Introduction

2. Related works

3. Key technologies

3.1 Text clustering process

1. Text preprocessing

2. Feature extraction

3. Similarity calculation and clustering algorithm

4.1 K-means clustering algorithm

4.2.2 Initialize cluster center

5.1 Data set

5.1.1 Experimental corpus

5.1.2 Experimental data set

5.2.1 Purity

5.3.1 Initial point selection experiment for clustering

Table 1 Initial point selection experiment for clustering

Table 2 WRDK-means clustering results of the Chinese dataset of Tsinghua University

Table 5 Results of comparative experiments

Data availability

Ethical approval

Author’s contributions

Footnotes

Acknowledgments

Conflict of interest

References

Table 1
Initial point selection experiment for clustering

Table 2
WRDK-means clustering results of the Chinese dataset of Tsinghua University

Table 5
Results of comparative experiments