The research on text clustering based on LDA joint model

Abstract

This paper proposed a cluster algorithm based on the combination of LDA (Latent Dirichlet allocation) probabilistic topic model and VSM (Vector Space Model), with the three-tier framework adopted containing text, topic and feature word. Although LDA alone has the ability to seek out the hidden topic knowledge, it is hard for the low-dimensional model to remain the integrity of the text information, leading to insufficient capacity for distinguishing texts. The paper is set to launch the cluster analysis in turns of feature words and topic through integrating two model above. With a better mix of LDA and VSM, the clustering effect will be improved, paralleling determining the optimal clustering number K of the K-means algorithms and optimum topic number T of LDA model. In order to design the algorithms more scientifically and effectively, silhouette coefficient and Dunn coefficient have been brought in to make assessments.

Keywords

Text cluster LDA model K-means algorithms VSM model silhouette coefficient Dunn coefficient

1 Introduction

The rapid development of Internet technology, including surfing and recommendation, facilitates the easier accesses to text information that people need [1]. But most of information are packed with repetition, if not, similar contents. To obtain the useful information in a quick manner, inevitably, it is necessary to generate the quality abstract, making the overall generalization of enormous texts for the sake of convenience. As to the skill of abstract production, the current age of knowledge demands the replacement of manual operation by intelligent algorithms and natural language processing and the exploration into automatic abstracting system thrives [2]. The text clustering algorithms which this paper focus on, as one of the offshoots of the exploration, boast some flexibility and automatic processing capacity without the manual supervision [3], the particular training and the preliminary label on texts. However, text clustering faces a great challenge that its mere dependence on word frequency data is too outdated to meet the modern requirements [4]. More and more researchers realize the importance of introducing the external semantic knowledge or even more effective means of digging out the internal semantic knowledge that texts itself contains [5]. For the moment, the most popular model for the extraction of internal semantic knowledge—LDA model—handle the self-training by mature probabilistic algorithm [6]. Additionally, it is straightforward to use the model which has a fixed-scale parameter space catering more to the large-scale text sets processing [7].

Among the varieties of algorithms of text clustering, a system for lateral comparison and analysis has yet to be created. Under the circumstances of practical work, the selection of algorithms and parameter setting rely much on the experience, which may restrict the further study such as algorithms study and analysis in a more scientific way [8]. Despite the lack of system, there is an alternative method that making assessments on text clustering results by which all items such as algorithms design, algorithms selection, algorithms performance and parameter optimization are measured [9]. This paper main studies the K-means clustering algorithms based on LDA model [10]. With all comparisons among experiment results and algorithms and evaluations on clustering work completed, the text clustering that integrates LDA model, VSM model and K-means algorithms will be proposed and appraised objectively [11].

The remainder of the paper is organized as follows. Section two talks about the previous progress in regard to the text clustering; section three talks about the preparatory text processing based on word segmentation method; section four talks about the research on LDA-based text clustering method; section five talks about the test results and analysis; section six comes to the conclusion.

2 Related work

Researchers home and abroad has done a great deal of work in text information extraction in a bid for the higher performance of text analysis. Salton proposed the Vector Space Model [12], the most commonly-used algorithms. LDA model, presented first by Blei [13], mainly processed the text features by reducing it into several topic spaces, determining the link between the hidden topic and word within each space and getting the probability distribution of the text topic. Deer Wester advanced the Latent Semantic Analysis model which is designed to discover the latent semantic correlation between targeted text and words [14]. Ma Dashun established the LDA text model for the single text to retrieve the information effectively [15]. Inspired by Professor Ma, LDA model was improved by Wei Xing and W.B. Croft [16], with the LDA text model used for better retrieval. Liu Zhenlu undertook the study of LSA based on LDA [17]. Professor Liu separated the semantic into three spaces on different frequencies and well revised the clustering result by means of semantic interaction and text types. Li Wenbo integrated the text type information into conventional topic model [18], coming up with an additional LDA model to enhance its classification performance. Xu Shuo also had its new co-occurrence clustering method dedicated to the extraction of the maximum frequent item sets [19]. The previous research also showed the frequent word sets produced a remarkably improved effect on clustering the short texts [20, 21], but the conventional clustering performed reversely such as merely using the K-means. This paper, at the very beginning, adopted the combined measures of LDA model and K-means to analyze, but later the test results found adding VSM model to analyze the featuring words and topic respectively. LDA excels at extracting the lateral topic information out of the text internally, but its low-dimensional character weakens the ability to differentiate the texts [22]. VSM model and K-means, therefore, are welcomed to compensate the common deficiencies, leading to a better clustering performance which are also appraised by silhouette coefficient [23] and Dunn coefficient [24].

3 Preparatory text processing based on word segmentation method

3.1 The common methods for Chinese word segmentation

As to the processing techniques of Chinese information that differs from the English, where no any two words or characters are spaced out by rules, operations such as classification, clustering, machine learning and text search will be processed based out the Chinese words [25]. As the minimum part enough to carry the meaning, the certain word or even character, concepts extraction, subject analysis as well as understanding and processing of natural languages are conducted following the separating the sentence into several key words [26]. The dominating word separation methods are currently based on string matching, statistics and understanding [27].

3.1.1 Word separation based on string matching

One of the most important steps for this is to match the certain string about to be separated with what is in the chosen dictionary. The key-words recognition continues and repeats its single cycle in which any identical word found in dictionary will be matched with that of the targeted string unless all Chinese words are separated finally [28]. The classifications of matching vary according to lengths or scanning track or so. For lengths, there are the maximum and minimum while for scanning track there are also forward matching, reverse matching and two-way scanning [29]. Generally, the reverse is better than the forward in accuracy and basically the flexible combination of different methods is also recommended. Other methods based on string matching such as word-by-word traversal and optimum matching are straightforward with high efficiency but demands for a dictionary of high-level [30].

3.1.2 Word separation based on statistics

The basic idea of this comes out of the notion that all words are made by character or characters. In these characters, if some adjacent characters occur again and again [31], it is normal to see there is a high possibility that they are making up the same word. Then the threshold value is set to benchmark the times of its occurrence and if the adjacent appearance making up the word exceeds the threshold, the word will be chosen the key word without the elaborate steps of making a dictionary. However, one deficiency may be inevitable, that is, fake key words like “ (meaning ‘my’)” or (meaning “your”)”, appears frequently though, can be disruptive [32].

3.1.3 Word separation based on the understanding

By means of the artificial intelligence, computers are encouraged to simulate the way that men understand the sentence to separate the words [33]. Under this circumstances, the semantic analysis is a requisite before using the machine to process the word separations which sometimes can produce the ambiguity [34]. Although there are two ways available – expert system tool and neural network – they may run into difficulties in practical work because the huge varieties of combinations and contexts in Chinese texts are far beyond what the machine now can handle [35].

3.2 Preparatory text processing based on “jieba” word separation

The “jieba” (means to stutter for Chinese) is one of the effective methods of preparatory text processing, which consists of three main steps [36].

The “Trie” structure helps produce the word-picture scanning system that enables the DAG (Directed Acyclic Graph) to be built which involves all possible combinations of the characters [37].

Then the dynamic programming is adopted to find a path towards the maximum probability and get the separations closest to the words frequency threshold value [38]. As one of the offshoots of operational research, the dynamic programming ranks the first on the list of mathematical methods used in the practical decision processing. Its prominence lies in transforming the multiple stages into the single and attaining the final answer by utilizing the links among stages [39].

The “jieba” word separation comes in handy to remove the fake key-words mentioned above and what remains is a word matrix that comes out of the original texts. What constitutes the matrix is no more than the collection of all key words and the order of their appearance will not be considered [40].

4 The text clustering based on LDA

Clustering algorithm has the following types such as partitioning-based density-based, layers-based and model-based types [41]. And the K-means, which is going to be discussed in this paper, belongs to the partitioning based type. With simple computation and fast convergence, K-means is apt to deal with the huge data set [42]. Besides K-means, another commonly-used fuzzy clustering algorithm, Fuzzy C-means or FCM, which also belongs to partitioning-based algorithm [43], produces close results to K-means clustering but it still requires more computation time than K-means because of the fuzzy measures calculations involvement in the algorithm [44]. This paper aims to improve the K-means and proposed an updated clustering that can analyze the public opinion very well in a relatively short period of time. The traditional K-means is prone to the defect of local minimum rather than the proper value, hence leading to the poor performance of clustering. So with the stability and accuracy of the algorithm result ensured, obtaining an appropriate initial clustering center is of particular importance for clustering performance enhancement [45]. Under such condition, the paper puts forward the LDA-based joint model to improve the K-means and optimize the textual clustering.

4.1 The text clustering based on K-means

4.1.1 K-means algorithms

K-means algorithm, proposed by J.B. MacQueen in 1967, is a classic clustering algorithm widely used in scientific research and industrial work. The rationale of K-means algorithms is to keep selecting and clarifying the clustering centers one time after another by calculating the closest distance between object and the clustering centers and guaranteeing each object locating its nearest clustering centers [46]. The specific algorithm processing is listed as follows [47].

Select several number of objects (there several means “k”) to be the initial clustering centers.

Calculate each the distance between each object and the certain clustering center and distribute the object into the nearest center.

If all objects distributed, recalculate the clustering centers

Compare the result of iii with previous number of clustering centers. if the number differs, start repeating from ii.

Get the final result.

Firstly, choose “k” number of object to form some initial clustering centers. For the remnants, calculating the distance to put them into the most similar clustering [48]. By calculating the clustering number repeatedly comes the convergence of the final function in which mean square error is used to measure. The formula is as Equation (1). $E = \sum_{i = 1}^{k} \sum_{P \in C_{i}} | p - m_{i} |^{2}$ (1)

Here “E” represents the sum of all mean square errors.

The formula above aims to make all the clusters more distinct from each other but the clusters themselves concentrated. The upside of the algorithm qualifies itself to deal with the large-scale data sets, accompanying with its flexibility and high efficiency.

4.1.2 The determination of the optimal clustering number in K-means

With the advantage of simplicity of calculation and adaptability, K-means clustering needs the definite number of the clusters beforehand as the input parameter which, if it is set, the result of clustering needs to be involved. In the process of appraising the result, silhouette coefficient has well combined the cohesiveness and separations naturally to handle the result [49]. The specific value often ranges from “–1” to “1”, with value which is closer to “1” indicating better performance of the clustering. By the principle, here more than just one clustering parameters as well as repeated calculation are required to make the calculation more precise. As the Table 1 and Fig. 1 indicate, the maximum silhouette coefficient means the most suitable number of clusters.

Table 1
The number of clusters K based on K-Means algorithms

K 2 3 4 5 6 7 8 9 10

silhouette 0.1 0.222 0.312 0.703 0.69 0.6 0.5 0.49 0.32

coefficient

K	2	3	4	5	6	7	8	9	10
silhouette	0.1	0.222	0.312	0.703	0.69	0.6	0.5	0.49	0.32
coefficient

Fig.1

The silhouette coefficient of K.

From the graph above, the horizontal coordinate represents the number of clusters while the vertical the silhouette coefficient. If the maximum silhouette coefficient means the best clustering performance, here when the number of cluster is 5, the coefficient peaked at 0.71 and then went down before or after. Therefore, in the description below, the number of clusters will be set “5”.

4.2 The text clustering based on VSM

4.2.1 VSM model

The Vector Space Model, proposed by G. Salton, has its basic idea that combinations and positions of the characteristic items in the text will be dismissed as it leaves little effect on the text classification. What matters lies on what the characteristic items really are, if they are the independent and significant parts of the whole texts [50]. The characteristic items demonstrate itself in characters or phrases, but words overwhelmingly. The single character contains the limited meaning so it weighs less in the characteristic items in the process of describing the targeted text [51].

Assume the text contains N characteristic items, then it can be represented with t = (t₁, t₂, … t_n) or d = ((t₁, w₁) , (t₂, w₂) , …… (t_n, w_n)). Here “t_i” means the characteristic item number “i” and “w_i” means the weighted value of the characteristic item number “i”. VSM or Vector Space Model, one of most widely used models in processing information given in Chinese, transforms each text mention above into the certain vector, creating a text collection, which is also a high-dimensional vector space. With the help of Cosine Law on Vector, the similarities between texts [52].

Suppose the random two text vectors can be represented as $\vec{d_{1}}, \vec{d_{2}}$ , the formula to calculate the similarities between texts is as Equation (2). $sim (\vec{d_{1}}, \vec{d_{2}}) = cos (\vec{d_{1}}, \vec{d_{2}}) = \frac{(\vec{d_{1}}, \vec{d_{2}})}{(| \vec{d_{1}} | \times | \vec{d_{2}} |)}$ (2)

4.2.2 The representation of the tf-idf weight in VSM

The commonly-seen weight formula is Equation (3). $W (t, \bar{d}) = tf (t, \bar{d}) \times log (\frac{N}{n_{i}} + 0.5)$ (3)

Here are some explanations, $W (t, \bar{d})$ represents the weight value of the certain word in the text, t represents the times of occurrence. N represents the total numbers of the texts, n_i represents the number of text that contains the certain word. This paper chooses the tf-idf to calculate the weight value and construct the VSM model [53].

As for the comments, if every comment is a point, then the similarity between every comment will be calculated by means of tf-idf [54]. Here the similarity means the distance in the actual computation, followed by the clustering procedure by using the K-means to measure the distance between every two comments. The recommended clustering value will be still “5”.

4.2.3 The K-means algorithms based on VSM

The combination of VSM model and K-means can be fulfilled in the following procedures [55].

See each comment as a text and calculate the tf-idf weight and the weight of each characteristic item in the comment. If the certain comment contains characteristic item number “n”, the weight of each characteristic item can be represented w₁, w₂, w₃ ……w_n. The formula of the whole comment is d = ((t₁, w₁) , (t₂, w₂) , …… (t_n, w_n)).

Calculate the vector of every comment and input them into K-means algorithm

Get the output from K-means, which is the clustering result.

4.3 LDA model

Latent Dirichlet Allocation, or LDA, as a new topic generation probability model in processing the information in Chinese, gather all related documents that share the hidden topics by some proportion. It is based on the hypothesis of the common sense. Each of the documentation that occurs in the model have their hidden topics mixed randomly, making the whole a more qualified collection that consists of all characteristics which is reduced to the hidden topics [56].

In the category of the topic models, the topic is different from the “topic” used in most cases. It displayed itself with a bunch of conditional probability of related words [57]. The link between the document and the topics is delivered by the concept of the generation model. The generation model refers to a process in which every word is picked out with some chances from the topic which is also picked in the same way. All generation of the certain word can be boiled down to the Equation (4). $\begin{matrix} p (word / text) = \\ \sum_{topic} p (word / topic) \times p (topic / text) \end{matrix}$ (4)

LDA Model recognize the hidden topic information in the documents or texts by means of constructing the model automatically [58]. It is more of a machine learning, without any human supervision. Based on the idea that a text or a document bears the equivalence of a random mix of hidden topics gathered with different probability, LDA adopts the bag-of-word method to turn the textual elements to the vector in mathematical manner, making it easier to build a model and simplifying the computation process [59]. The bag-of-word way does not need to consider the order likewise [60].

The steps of the generation of topics are as follows

For any text included, pick out the certain topic randomly from the topic sets that it relates

Continue picking out the word from the topic selected in the step above

Repeat the procedures unless all words are traversed.

It is not hard to see there are three layers of constructions, that is, document collection level, document level and characteristic word level.

Parameter α and β are the document collections, as shown in Fig. 2. Vector α describes the relative strength among the hidden topics while β demonstrates the probability distribution of the hidden topics in the certain text and the items in the collection β means the probability of the certain characteristic words being contained in the certain the hidden topic [61].

Fig.2

The sketch map of the layers of LDA Model.

Suppose the number of characteristic words in the certain text is N and the hidden topic is T. Under such situation, parameter α will be T-dimensional vector. Along the processing line, it is not difficult to see the next generation will be the vector θ. Then the parameter β is a matrix of K*N, signifying the probability distribution of the related words in the hidden topics. The parameter α and β are at the document level so they needed to be determined for only once.

What ensues comes the characteristic words level, the parameter w and z, reflecting the probability distribution of the collected characteristic words. The certain vector w_i represents the characteristic words vector in the certain text. The parameter Zi indicates the proportion of distribution of all characteristic words in the text. The parameter W is the variable watch and the parameter θ and z are hidden variables, in which the latter is generated from the former. And the variable watch comes out of the z and β [62].

From what was represented above, the formula of joint probability is Equation (5). $\begin{matrix} p (θ, z, w / α, β) \\ = p (θ / α) {coprod}_{n = 1}^{N} p (Z_{n} / θ) p (w_{n} / Z_{n}, β) \end{matrix}$ (5)

Here, the D = {d₁, d₂, d₃,… d_M} represents the document collection; the d = {w₁, w₂, w₃, … w_N} represents documents; the Z = {z₁, z₂, z₃, … z_T } represents the hidden topic collections; N represents the number of the characteristic words; M represents the number of the document collections; T represents the number of the hidden topics. The overall processing of the generation of a document through LDA topic model unfurls in the following steps [63].

First thing needed to do is to get a set of characteristic words of the certain text N = Poisson (ɛ).

What comes the α is the parameter, dirichlet distribution parameter, which concerns the topic distribution θ = Dir (α) in the certain text.

Besides the number of the characteristic words, what they are which will be generated is represented by W.

Inspired by the steps above, the formula of generation probability of some characteristic words in the certain text is Equation (6).

p (w_{i}) = \sum_{j = 1}^{T} p (w_{i} / z_{i} = j) p (z_{i} = j)

(6)

The formula that denotes the probability of the certain characteristic words in the text is Equation (7). $p (w / d) = \sum_{j = 1}^{T} φ_{w}^{j} \times θ_{j}^{d}$ (7)

Then the maximum likelihood estimation starts following the three layers of LDA being constructed based on the parameter α and β. $I (a, β) = \sum_{i = 1}^{M} log p (d_{i} / α, β)$ (8)

Here the p (d_i/α, β) signifies the conditional probability distribution of generating characteristic words in the specific text.

4.4 The determination of the optimum number of topics based on LDA

With the description of LDA model elaborated, the model extracts the collected hidden topics structures by analyzing the huge data and picked out the premier structure. In the section, the average similarity theory shows the model reaches the optimum as average similarity among topics is at its minimum point [64].

In the β matrix, the distribution of the topics at the V-dimensional words space is used to show the topic vectors. The correlation among the vectors are measured by the standard cosine distance. The following formula shows Equation (9).

$\begin{matrix} corre (z_{i}, z_{j}) = corre (β_{i}, β_{j}) \\ = \frac{\sum_{v = 0}^{V} β_{iv} \times β_{jv}}{\sqrt{\sum_{v = 0}^{V} (β_{iv})^{2} \sum_{v = 0}^{V} (β_{jv})^{2}}} \end{matrix}$ (9)

It is found that the less the distance value corre (z_i, z_j) shows, the more independence exists among topics. $avg_corre (structure) = \frac{\sum_{i = 1}^{K - 1} \sum_{j = i + 1}^{K} corre (z_{i}, z_{j})}{K \times (K - 1) / 2}$ (10)

In this way, the parameter, average similarity is used to show the stability of the topic structure. According to the analysis, if the most premier topics structures is expected, the average similarity must be ensured at a minimum. Assume that the optimal topic number is T, by the LDA process stated above, the space vector is about to be created based on the T topics under each comment.

The textual similarity of the latent topic vectors based on LDA are displayed in the following formula Equation (11). ${sim}_{LDA} (d_{i}, d_{j}) = \frac{d_{i - LDA} \times d_{j - LDA}}{| d_{i - LDA} \times d_{j - LDA} |}$ (11)

By the computation procedures of the similarity, the clustering operation using K-means algorithm are implemented in the same manner, followed by the text clustering on the combination of LDA andK-means [65].

The flowing diagram of the combination of LDA Model and K-means algorithm is shown in Fig. 3.

Fig.3

LDA model combined with K-Means algorithm flow chart.

4.5 The text clustering algorithms combining LDA, VSM and K-means

4.5.1 The advantage of the combination of the LDA and VSM

The conventional clustering analysis conduct the calculation on textual similarity on the characteristic word level, therefore, the conventional may fail to go deep into the deeper semantic information. With the help of the LDA model, it will be easier to achieve a three-tier structure made up of document, topics and feature words so that the model is able to extract the topics from the very inside of the texts, however, it has a flaw that it has a poor ability to distinguish the texts [66]. This section aims to conduct the clustering analysis in term of the featuring words and topics through the combination of the LDA and VSMmodel [67].

4.5.2 The combination of LDA, VSM and K-means

The conventional clustering technique adopts the VSM feature words space vector on the basis of tf-idf weight strategy to calculate the similarity of the texts [68].

As for the certain text, the textual vector based on tf-idf weight strategy is “d_i(TF-IDF) = (w₁, w₂, w₃, …… w_N)”, “N” represents the number of featuring words. The textual vector based on LDA topic model is “d_i(LDA) = (t₁, t₂, t₃, …… t_T)”, “T” represents the number of latent topics.

The similarity of the random two texts based on tf-idf weight strategy has its formula as Equation (12). $S_{VSM} (d_{i}, d_{j}) = \frac{d_{i (TF - IDF)} \times d_{j (TF - IDF)}}{| d_{i (TF - IDF)} \times d_{j (TF - IDF)} |}$ (12)

The similarity of the random two texts based on LDA topics model has its formula as Equation (13). $S_{LDA} (d_{i}, d_{j}) = \frac{d_{i (LDA)} \times d_{j (LDA)}}{| d_{i (LDA)} | \times | d_{j (LDA)} |}$ (13)

The design adopts the LDA model and VSMvector space model based on tf-idf strategy to calculate the similarity respectively [69]. The two kinds of similarities will be combined linearly, added with the K-means clustering algorithms, to start the clustering analysis. The formula of linear combination is as Equation (14). $S (d_{i}, d_{j}) = ɛ S_{VSM} (d_{i}, d_{j}) + (1 - ɛ) S_{LDA} (d_{i}, d_{j})$ (14)

Here “ɛ” is the linear correlation coefficient. The flow graph of combination of LDA Model, VSM Model and K-means algorithm is as Fig. 4.

Fig.4

LDA model combined with K-Means algorithm and VSM model flow chart.

5 Experiments

5.1 The creation of the data set

By means of crawler technology [70], the test gathers the sina weibo comments on “the woman driver beaten in Chengdu for her intentional violations” and create the collection that includes all the comments. Then it is the preparatory work of segmentation in Chinese words that provides the operational data for the textual clustering algorithms. What make this incident more of a valuable test example is that the public opinions shift several times, very peculiar enough to make the texts more diverse leading to a better clustering.

The crawler technology has so far helped collect some 7000 comments which are complete enough to create a data collection. The items involved are user ID, nickname, sina membership, verified account, contents posted, the number of pictures, weibo links, date, posting means, number of the repeat, number of the comments, number of likes, etc. This clustering will take nothing but the content posted by users for textual clustering as others are dismissed [71]. “Jieba” word separation means is used to remove some insignificant but disruptively similar words. A matrix of key words is formed as a simple collection of key-words without their order not taken into consideration, simplifying the subsequent model construction and clustering.

5.2 The determination of the optimal number of topics in LDA

Continuing the preceding steps, the number of the topics ranges from 2 to 40 and the machine measures the result every other two. The Table 2 shows the number of topics and the similarity among topics. As shown in Fig. 5, T means the number of topics and S means the similarities.

Table 2
The number of topics T

T 2 4 6 8 10 12 14 16 18 20

S 0.18 0.08 0.03 0.01 0.012 0.012 0.013 0.015 0.017 0.02

T 22 24 26 28 30 32 34 36 38 40

S 0.023 0.025 0.027 0.03 0.032 0.035 0.03 0.041 0.042 0.05

T	2	4	6	8	10	12	14	16	18	20
S	0.18	0.08	0.03	0.01	0.012	0.012	0.013	0.015	0.017	0.02
T	22	24	26	28	30	32	34	36	38	40
S	0.023	0.025	0.027	0.03	0.032	0.035	0.03	0.041	0.042	0.05

Fig.5

Optimal Distribution of T.

5.3 The comparison of two textual clustering results

The test result shows the combination of VSM Model and K-means is undesirable, hence another model LDA is recommended to combine and the model get its best when T is eight. The following chart will bring another new parameter called the clustering performance.

Synthesizing all test result so far, it is true that under the condition of eight topics and five clustering clusters the clustering achieve the best performance. In the Table 3, the “SIZE” measures the value of clustering performance. In the 7000 comments gathered, there are 459 comments or 7% of the total included in only single cluster, 3058 comments or 42% of the total shared by two clusters, 598 comments or 8% shared by three, 2021 comments or 28% shared by four, 1108 comments or 15% shared by all. But as the LAD model also has the deficiency, VSM and LDA model will work together with K-means algorithms.

Table 3
The test results of LDA model and K-Means algorithm

K 1 2 3 4 5

SIZE 459 3058 598 2021 1108

K	1	2	3	4	5
SIZE	459	3058	598	2021	1108

The Fig. 6 shows the result with the participation of the three means.

Fig.6

Diagram o LDA model and K-Means algorithm based on test results.

In the Table 4, the “SIZE” measures the value of clustering performance. As shown in Fig. 7, there are 507 comments or 7% of the total included in only single cluster, 941 comments or 13% of the total shared by two clusters, 1448 comments or 20% shared by three, 1956 comments or 27% shared by four, 2392 comments or 33% shared by all.

Table 4

The results of LDA model combined with K-means algorithm and VSM model

K	K1	K2	K3	K4	K5
SIZE	507	941	1448	1956	2392

Fig.7

Diagram of LDA model combined with K-means algorithm and VSM model based on test results.

5.4 The appraisal of the clustering performance

In terms of different procedures within the textual clustering analysis, the indexes are required to meet the different standards, but the index relying on the human judgment will be a better choice so that it makes the clustering algorithms and a series of processing procedures cater to what man needed [72]. The index, with the company of the human judgment, can join in the process of lateral comparisons among different algorithms, the analysis of the algorithm performance and the optimization of related parameters, fitting the clustering result into the human judgment. The clustering algorithms. The index on basis of function, designed to deal with the clustering computation, is more of the part of algorithm than the criterion for algorithm. For instance, each iteration in K-means has the tendency to go to where the smallest error sum of squares is got. The error sum of squares here is not only the objective function, but also the criterion for clustering work. Certainly, besides the error sum of squares, the algorithm, if designed again, is free to take other objective functions, achieving the variations of K-means. If the cluster with the high cohesion is supposed to be located, the index that benchmark the cohesion is recommended [73]. The function-based index does not apply to the horizontal comparison because it serves as both part of the algorithm and the criterion as well. But it will be perfect for the situation that the optimal clustering result need to be located. If the cohesion within one collection or separation among collections are to be measured, it will ask the purely text-related index.

In fact, the indexes directly used to evaluate the textual clustering result derives from the two standards (algorithms operation and appraisal criterion), which is basically different from adopting the two standards. There are different kinds of classification criteria that derives from the basic two. The index appraising the cluster is responsible for just one cluster, while the overall index measures the whole clustering result, hence the former can help make up the overall appraisal index.

If the performance of algorithms need comparisons, no participation of human interference is recommended, such as the cluster index appraisal, which is a better choice for relatively smaller collection under the automatic appraisal, without potential changes that human evaluation may cause.

5.4.1 Silhouette coefficient evaluation

There is a correlation between cluster cohesion and separation, that is, the sum of the both is constant, so it is not hard to infer the minimum cohesion means the maximum separation or vice versa and it is no point of analyzing the clustering by using either cohesion or separation. With the combination of cohesion and separation and weighted improvement, the Silhouette Coefficient, put forward by Kaufman, implements the resolution [74].

The Equation (15) is the definition of silhouette coefficient $S_{i} = \frac{b_{i} - a_{i}}{max (b_{i}, a_{i})}$ (15)

Here “a_i” represents the distance between the object d_i and another within the same cluster. Suppose D(i, C) represents the average distance between the certain object and all objects within cluster C. by the similar rationale, if the b_i - min {D (i, C)} takes the minimum value, then for the single clustering, the definition of silhouette coefficient is seen below Equation (16). $S_{k} = \frac{1}{n} \sum_{i = 1}^{n} S_{i}$ (16)

Here, “n” represents the number of the object within the data collection and “k” represents the number of clusters.

The silhouette coefficient of the certain object can be used to evaluate whether the object fits the cluster, with its measuring value ranging from “–1” to “1”. The value, if closer to “1”, shows that average distance within the cluster is less than the that of minimum between clusters, meaning the object may have less chance to fit the cluster. If the value is closer to “–1”, it means the certain object fits the clustermore.

The number of category “k” is set to show the clustering silhouette coefficient for further analysis, for example, selecting the optimal clustering number. The procedures start with getting the maximum of all the numbers of category. When the number reaches the height, the “k” equals the optimal clustering number and the optimal clustering silhouette coefficient as well. The clustering result will be thebest one.

Now if silhouette coefficient is used to measure three clustering methods shown in the Table 5.

Table 5

Silhouette coefficient of clustering results

Algorithms	SC
VSM+K-means	0.122382
LDA+K-means	0.632356
VSM+LDA+K-means	0.6693

As shown in Fig. 8, the vertical axis represents the silhouette coefficient, the combination of LDA and VSM boasts the best performance. LDA-based clustering algorithm also has a similarly good performance, but both of them are far better than desirable mere VSM.

Fig.8

Diagram of the silhouette coefficient.

5.4.2 Dunn’s validation index evaluation

Dunn’s validation index evaluation refers to the ratio of the minimum distance between categories to the maximum diameter within the category [75]. Dunn index adopts the former to signify the separation between categories and the latter to indicate the cohesion with the specific category. Dunn is the ration, hence, the greater of the Dunn value, the greater distance between categories, which means a better clustering performance.

The evaluation by Dunn validation index is displayed in the Table 6.

Table 6
Dunn coefficient of clustering results

Algorithms DVI

VSM+K-means 0.03756

LDA+K-means 0.328456

VSM+LDA+K-means 0.764323

Algorithms	DVI
VSM+K-means	0.03756
LDA+K-means	0.328456
VSM+LDA+K-means	0.764323

The charts above show the combination of VSM and LDA achieve a better performance, as shown in Fig. 9. Either VSM or LDA.

Fig.9

Diagram of the Dunn coefficient.

6 Conclusion

This paper conducts the textual clustering by using LDA. The following step is to combine the weighted value based on td-idf weighted value and text-topic model generated by the LDA model, which helps extract the interior semantic information through the deep and improve the quality of textual clustering. Compared with the conventional algorithm, the combination of space vector model based on td-idf weighted value and textual clustering algorithms based on LDA model has a higher computationaccuracy.

With the aid of crawler technique, the three types of algorithms have shown their clustering result. Through comparing the three results, LDA Model extract the latent topics of texts, traversing three layers from documents to topics to characteristic words. At the same time the Dunn validation index and silhouette coefficient indicates the textual clustering based on LDA model is superior to the conventional methods. The result manifests its best particularly with the integration of LDA and VSM. For LDA and VSM, their similarities added in linear form make the similarity more precise and offset the problem potentially caused by excessive dimensions of LDA model and VSM model. It is glaringly evident that clustering result based on LDA, VSM and K-means bears the best performance, theoretically and practically manifesting the advantage of clustering algorithms based on LDA model.

Footnotes

Acknowledgments

The work is supported by grants from National Science and Technology Supporting Program of China (2014BAH10F00), University Research Program of Communication University of China (3132015XNG1522). We thank the reviewers and editor for their helpful comments.

References

Oliver

J.J.

, Buntine

W.L.

, Roumeliotis

, System and method for adaptive text recommendation, 2015.

Salvador

S.W.

and Magdin

, Predictive natural language processing models, 2016.

Hamou

R.M.

, Bouarara

H.A.

and Amine

, Bio-inspired techniques in the clustering of texts: Synthesis and comparative study, International Journal of Applied Metaheuristic Computing (2015), 39–68.

Wei

, et al., A semantic approach for text clustering using WordNet and lexical chains, Expert Systems with Applications42(4) (2015), 2264–2275.

Errecalde

M.L.

, Cagnina

L.C.

and Rosso

, Silhouette attraction: A simple and effective method for text clustering, Natural Language Engineering1 (2015), 1–40.

Martinez

, et al., LDA-based probabilistic graphical model for excitation-emission matrices, Intelligent Data Analysis19(5) (2015), 1109–1130.

Chen

, A novel clustering algorithm for large-scale text collection and its incremental version, Information Technology & Control45(2) (2016).

Corriveau

, et al., Bayesian network as an adaptive parameter setting approach for genetic algorithms, Complex & Intelligent Systems (2016), 1–22.

Bharill

, Tiwari

and Malviya

, Fuzzy based clustering algorithms to handle Big Data with implementation on Apache Spark, IEEE Second International Conference on Big Data Computing Service and Applications, 2016, pp. 95–104.

10.

Kemaiaia

and Merouani

H.F.

, Clustering with probabilistic topic models on Arabic texts: A comparative study of LDA and K-means, International Arab Journal of Information Technology13(2) (2015).

11.

Kumar

, Yadav

D.K.

and Gupta

V.K.

, Frequent term based text document clustering: A new approach, International Conference on Soft Computing Techniques and Implementations IEEE, 2015.

12.

Salton

, Wong

and Yang

C.S.

, A vector space model for automatic indexing, Communications of the ACM18(11) (1975), 613–620.

13.

Blei

, Ng

and Jordan

, Latent dirichlet allocation, Journal of Machine Leaning Research3 (2003), 993–1022.

14.

Deerwester

S.C.

, Dumais

S.T.

and Landauer

T.K.

, et al., Indexing by latent semantic analysis[J], JASIS41(6) (1990), 391–407.

15.

, Rao

and Wang

, An empirical study of SLDA for information retrieval [J], Information Retrieval Technology (1) (2011), 84–92.

16.

Wei

and Croft

W.B.

, LDA-based document models for Adhoc retrieval, Proceeding of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006.

17.

Liu

, et al., An approach of latent semantic space partition and web document clustering, Journal of Chinese Information Processing25(1) (2011), 60–59.

18.

L.W.

, Beijing. Text Classification Based on Labeled-LDA Model [J], Chinese Journal of Computers31(4) (2009), 620–627.

19.

and Qiao

, A novel approach for Co-occurrence clustering analysis: Maximal frequent item set mining, Journal of the China Society for Scientific and Technical Information31(2) (2012), 143–150.

20.

Wang

Y.H.

, Jia

and Yang

S.Q.

, Massive short documents classification method based on frequent term set clustering, Computer Engineering & Design28(8) (2007), 1744–1746.

21.

Wang

, Jia

and Yang

, Study on massive short documents clustering technology, Computer Engineering33(14) (2007), 38–40.

22.

, et al., Microblog dimensionality reduction – A deep learning approach, IEEE Transactions on Knowledge & Data Engineering (2016), 1–1.

23.

Chang

C.J.

, Dai

W.L.

and Chen

C.C.

, A novel procedure for multi model development using the grey silhouette coefficient for small-data-set forecasting, Journal of the Operational Research Society66(11) (2015), 1887–1894.

24.

Trauwaert

, On the meaning of Dunn’s partition coefficient for fuzzy clusters, Fuzzy Sets and Systems25(2) (1988), 217–242.

25.

Xia

X.U.

, Peifeng

L.I.

and Zhu

, A Semi-supervised Chinese Event Extraction Method, Journal of Chinese Information Processing30(2) (2016), 168–174.

26.

Bouhriz

, Benabbou

and Benlahmer

, Text concepts extraction based on Arabic WordNet and formal concept analysis, International Journal of Computer Applications111(16) (2015), 30–34.

27.

Gang

, et al., Hybrid FA: A memory reduction technique for the AC automata based on statistics, Journal on Communications36(7) (2015), 31–39.

28.

Tian

W.D.

and Huang

, Study on the Application of Frequent Sub-tree Patterns in Focus Words Recognition, Microelectronics & Computer32(11) (2015), 27–32.

29.

Wang

, et al., Track fusion based on threshold factor classification algorithm in wireless sensor networks, International Journal of Communication Systems (2016), DOI: 10.1002/dac.3164

30.

Beguet

and Burmako

, Traversal Query Language For Scala Meta Epfl, 2015.

31.

Wang

and Huang

S.T.

, Chinese word segmentation based on A-priori and adjacent characters, International Conference on Machine Learning and Cybernetics, Vol. 6, 2005, pp. 3808–3813.

32.

Zhou

, Clothing-to-words mapping using word separation method, Computers & Electrical Engineering39(2) (2013), 361–372.

33.

Aljindi

, Information security, artificial intelligence and legacy information systems, Dissertations & Theses – Gradworks, 2015, 192 pages; 3740130.

34.

Hua

, et al., Short text understanding through lexical-semantic analysis, IEEE, International Conference on Data Engineering IEEE, 2015, pp. 495–506.

35.

Miyani

, Doshi

and Jain

, Word problem solver system using artificial intelligence, Procedia Computer Science45 (2015), 800–807.

36.

, Liu

and Li

, The simply implement of effective naïve bayes web news text classification model, Statistical and Application3 (2014), 30–35.

37.

Bendavid

, et al., High dimensional Bayesian inference for Gaussian directed acyclic graph models, arXiv:1109. 4371v5 [math.ST], 6 Mar2015, 1–55.

38.

Ross

S.M.

, Introduction to stochastic dynamic programming, Journal of the American Statistical Association (2015), 1–27.

39.

Gluss

, An elementary introduction to dynamic programming: A state equation approach, Journal of Regional Science14(1) (1974), 150–152.

40.

Han

, Yuan

and Xiao

, Research review on water science based on co-word cluster analysis of keywords, Journal of North China University of Water Resources & Electric Power36(4) (2015), 20–25.

41.

Aggarwal

C.C.

, et al., Frequent pattern mining with uncertain data, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 2009, pp. 29–38.

42.

Ordonez

and Omiecinski

, Efficient disk-based K-means clustering for relational databases, IEEE Transactions on Knowledge & Data Engineering16(8) (2004), 909–921.

43.

Yang

M.S.

, A survey of fuzzy clustering, Mathematical & Computer Modelling18(11) (1993), 1–16.

44.

Ghosh

and Dubey

S.K.

, Comparative analysis of K-Means and fuzzy C-means algorithms, International Journal of Advanced Computer Science & Applications4(4) (2013).

45.

Hamerly

and Elkan

, Learning the K in K-means, Advances in Neural Information Processing Systems17(2004) (2003).

46.

Liu

, et al., Kernel-based fuzzy C-means clustering method based on parameter optimization, Jilin Daxue Xuebao46(1) (2016), 246–251.

47.

Krishna

and Narasimha Murty

, Genetic K-means algorithm, IEEE Transactions on Systems Man & Cybernetics Part B Cybernetics A Publication of the IEEE Systems Man & Cybernetics Society29(3) (1999), 433–439.

48.

Chen

J.Y.

and He

H.H.

, Research on density-based clustering algorithm for mixed data with determine cluster centers automatically, Acta Automatica Sinica41(10) (2015), 1798–1813.

49.

Eler

D.M.

, Macanha

P.A.

and Garcia

R.E.

, Simplified Stress and Simplified Silhouette Coefficient to a Faster Quality Evaluation of Multidimensional Projection Techniques and Feature Spaces, 2015, pp. 133–139.

50.

Liu

J.W.

, Zheng

J.C.

and Chen

, A new method of behavior characteristic similarity calculation between children learners based on knowledge graphs and VSM, Journal of Anqing Teachers College22(2) (2016), 54–59.

51.

Voborník

, Effective determining of the degree of similarity of selected properties of objects through characteristic text strings, International Journal of Mathematics & Computers in Simulation10 (2016), 90–99.

52.

, et al., An improved focused crawler based on semantic similarity vector space model, Applied Soft Computing36 (2015), 392–407.

53.

Adji

T.B.

, Abidin

and Nugroho

H.A.

, System of negative Indonesian website detection using TF-IDF and Vector Space Model, International Conference on Electrical Engineering and Computer Science IEEE, 2015, pp. 206–210.

54.

Alodadi

and Janeja

V.P.

, Similarity in Patient Support Forums Using TF-IDF and Cosine Similarity Metrics, International Conference on Healthcare Informatics IEEE, 2015, pp. 521–522.

55.

Roul

R.K.

, et al., A novel modified apriori approach for web document clustering, Computer Science33 (2015), 159–171.

56.

Kar

, Nunes

and Ribeiro

, Summarization of changes in dynamic text collections using Latent Dirichlet Allocation model, Information Processing & Management51(6) (2015), 809–833.

57.

Thu

H.N.T.

, Thanh

T.D.

and Hai

T.N.

, et al., Building Vietnamese topic modeling based on core terms and applying in text classification [C], Fifth International Conference on Communication Systems and Network Technologies, IEEE (2015), 1284–1288.

58.

Gao

, Chen

and Zhu

, Streaming Gibbs Sampling for LDA Model, 2016.

59.

, Kontonatsios

and Ananiadou

, Supporting systematic reviews using LDA-based document representations, Systematic Reviews4(1) (2015), 1–12.

60.

Wen-Bo

, Le

and Da-Kun

, Text Classification Based on Labeled-LDA Model [J], Chinese Journal of Computers31(4) (2009), 620–627.

61.

Tran

D.T.

, Sakurai

and Lee

J.H.

, Integration of a topic probability distribution into surgical phase estimation with a hidden Markov model, Industrial Electronics Society, IECON 2015-, Conference of the IEEE IEEE, 2015.

62.

Kabir

C.A.

and Kumar

S.A.

, Discrete Characteristic Probability Distribution Theorem, Scholars Press, 2015.

63.

Wang

, Fu

and Chen

, Analyzing Knowledge Structure Research with LDA Model. New Technology of Library & Information Service, 2016.

64.

Zhang

, et al., UT-LDA Based Similarity Computing in Microblog, IEEE International Conference on Software Quality, Reliability and Security – Companion IEEE, 2015.

65.

Zheng

and Hong

L.I.

, Texts clustering of K-means based on LDA, Computer & Modernization1(8) (2013), 78–80.

66.

, Qin

and Liu

, Open-categorical text classification based on multi-LDA models, Soft Computing19(1) (2015), 29–38.

67.

Zheng

, Liu

J.L.

and Xiang

, FAQ Answering System Based on VSM and LDA Model, Computer Technology & Development24(1) (2014), 133–135.

68.

Lin

, et al., Intelligent medical guide system based on VSM weight improvement algorithm, Computer Applications & Software32(9) (2015), 81–83.

69.

, et al., Performance of using LDA for Chinese news text classification, 2015, pp. 1260–1264.

70.

Zhou

and Xie

, The integration technology of sensor network based on web crawler, 2015, pp. 1–7.

71.

Dařena

and Žižka

, Revealing Groups of antically Close Textual Documents by Clustering: Problems and Possibilities. Modern Computational Models of Semantic Discovery in Natural Language, 2015.

72.

Smith

and Agrawal

, A Comparison of Patent Classifications with Clustering Analysis. Web Information Systems Engineering – WISE 2015. Springer International Publishing, 2015.

73.

Cafieri

, Costa

and Hansen

, Modularity maximization clustering with cohesion conditions, 2015.

74.

Ajaykumar

, Gupta

and Merchant

P.S.N.

, Automated Lane Detection by K-means Clustering: A Machine Learning Approach. Electronic Imaging, 2016.

75.

Mary

S.A.L.

, Evaluation of clustering algorithm with cluster validation metrics, European Journal of Scientific Research69(1) (2012), 61–72.

The research on text clustering based on LDA joint model

Abstract

Keywords

1 Introduction

2 Related work

3 Preparatory text processing based on word segmentation method

3.1 The common methods for Chinese word segmentation

3.1.1 Word separation based on string matching

3.1.2 Word separation based on statistics

3.1.3 Word separation based on the understanding

3.2 Preparatory text processing based on “jieba” word separation

4 The text clustering based on LDA

4.1 The text clustering based on K-means

4.1.1 K-means algorithms

Table 1 The number of clusters K based on K-Means algorithms K 2 3 4 5 6 7 8 9 10 silhouette 0.1 0.222 0.312 0.703 0.69 0.6 0.5 0.49 0.32 coefficient

4.2.1 VSM model

4.3 LDA model

4.5.1 The advantage of the combination of the LDA and VSM

4.5.2 The combination of LDA, VSM and K-means

5.1 The creation of the data set

5.2 The determination of the optimal number of topics in LDA

Table 2 The number of topics T T 2 4 6 8 10 12 14 16 18 20 S 0.18 0.08 0.03 0.01 0.012 0.012 0.013 0.015 0.017 0.02 T 22 24 26 28 30 32 34 36 38 40 S 0.023 0.025 0.027 0.03 0.032 0.035 0.03 0.041 0.042 0.05

Table 3 The test results of LDA model and K-Means algorithm K 1 2 3 4 5 SIZE 459 3058 598 2021 1108

5.4.1 Silhouette coefficient evaluation

Table 6 Dunn coefficient of clustering results Algorithms DVI VSM+K-means 0.03756 LDA+K-means 0.328456 VSM+LDA+K-means 0.764323

Footnotes

Acknowledgments

References

Table 1
The number of clusters K based on K-Means algorithms

K 2 3 4 5 6 7 8 9 10

silhouette 0.1 0.222 0.312 0.703 0.69 0.6 0.5 0.49 0.32

coefficient

Table 2
The number of topics T

T 2 4 6 8 10 12 14 16 18 20

S 0.18 0.08 0.03 0.01 0.012 0.012 0.013 0.015 0.017 0.02

T 22 24 26 28 30 32 34 36 38 40

S 0.023 0.025 0.027 0.03 0.032 0.035 0.03 0.041 0.042 0.05

Table 3
The test results of LDA model and K-Means algorithm

K 1 2 3 4 5

SIZE 459 3058 598 2021 1108

Table 6
Dunn coefficient of clustering results

Algorithms DVI

VSM+K-means 0.03756

LDA+K-means 0.328456

VSM+LDA+K-means 0.764323