K-means based method for overlapping document clustering

Abstract

Overlapping clustering algorithms have shown to be effective for clustering documents. However, the current overlapping document clustering algorithms produce a big number of clusters, which make them little useful for the user. Therefore, in this paper, we propose a k-means based method for overlapping document clustering, which allows to specify by the user the number of groups to be built. Our experiments with different corpora show that our proposal allows obtaining better results in terms of FBcubed than other recent works for overlapping document clustering reported in the literature.

Keywords

Clustering overlapping clustering document clustering

1 Introduction

Clustering is the process of creating groups of instances such that, according to a metric, instances inside a group are similar among them; while they are dissimilar with instances in other groups. In the literature, several clustering algorithms have been developed using different techniques for different domains [8 , 46]. Clustering has been widely applied in several areas as data analysis [28], data streams classification [30], image analysis [13] and image classification [27], marketing [29], segmentation [42], etc.

In last years, there has been an increasing interest in document clustering since it helps to analyze and organize textual data coming from different sources like the Internet [1], social networks [37], e-mails [25], recommender systems [41], opinion systems [15], etc. Several algorithms for document clustering, for example those proposed in [4–6 , 47], build non overlapped clusterings. However, it is common that a document belongs to more than one category (for example themes or topics). For this reason, some document clustering algorithms designed to produce overlapping clusterings have also been developed [16 , 39]. All these algorithms do not need, as input parameter, the number of clusters to build. However, they tend to produce a high number of clusters, which makes them little useful for document clustering. Therefore, in this paper, we propose a k-means based method for overlapping document clustering, the proposed method allows the user to specify the number of groups to be built. Our experiments using different corpora show that our proposal allows obtaining better results in terms of FBcubed than other works for overlapping document clustering reported in the literature.

The rest of the paper is organized as follows: Section 2 describes the related work. Section 3 introduces the proposed method. Section 4 shows our experimental results. Finally our conclusions and future work are provided in Section 5.

2 Related work

Document clustering is a research line that has been widely studied in the literature [2 , 50]. In the last years, taking into account that a document can belong to more than one category (theme or topic), the overlapping of themes in documents has been considered by clustering algorithms. In this section, we briefly review the main overlapping document clustering algorithms reported in the literature.

Star [10] is an incremental overlapping clustering algorithm. The Star algorithm works with an undirected graph, where each vertex of the graph represents a document and edges between vertices represent the similarity among documents (using the cosine similarity function). From this similarity graph, a new graph called β-similarity graph is built by preserving only those edges having a similarity greater than a certain threshold β. The Star algorithm builds a coverage of the β-similarity graph by using star-shaped subgraphs, where a star-shaped subgraph is a graph in which there is a vertex called center and there is an edge between this vertex and the rest of the vertices in the graph, called satellites. Star follows a greedy heuristic for covering the β-similarity graph through star-shaped subgraphs. This is done by selecting highly connected vertices from the β-similarity graph, as the centers of the star-shaped subgraphs that would cover the β-similarity graph. The center vertices of the star-shaped subgraphs in the final clustering are stored in a list of centers named X. First, all isolated vertices in the β-similarity graph are included in the list X, since they are degenerated star-shaped subgraphs. Then, Star works in two steps, in the first step, the degree of each vertex of the β-similarity graph is computed and all non-isolated vertices are sorted in a list of possible center vertices L, in a descending order according the their vertex degree. In the second step, Star iteratively takes the vertex with the highest degree in L as the center of a star-shaped subgraph jointly with its adjacent vertices in the β-similarity graph as its satellites, to build a cluster. The vertices of the new star-shaped subgraph are removed from the list L and the center of this star-shaped subgraph is added to the list X. This process is repeated until a set of star-shaped subgraphs that completely covers the original graph is obtained. The clusters obtained in the list X are overlapped since a vertex can be adjacent to more than one center.

The Generalized Star algorithm (GStar), proposed in [36], is a generalization of the Star algorithm. In the first step, for building the list L, the degree of each vertex is computed but considering for each vertex v only those adjacent vertices that have a degree lower than the degree of v. Later in the second step, GStar proceeds as in the Star algorithm.

Based on GStar, the CStar algorithm was proposed in [3, 34], in this algorithm, unlike GStar, the number of generated clusters is reduced. First, the GStar algorithm is executed and then those subgraphs (clusters) having all their vertices contained in other subgraphs are eliminated; these clusters are considered redundant clusters. An algorithm with a similar idea was proposed in [35].

Another extension of the GStar algorithm was reported in [31], called OCDC (Overlapping Clustering based on Density and Compactness), unlike GStar, in the first step, to sort the list of possible centers L, OCDC considers the average density and compactness of the vertices. Where, the density of a vertex v is computed as the number of adjacent vertices of v with a lower degree than v, while the compactness of a vertex v is computed as the number of adjacent vertices whose intra-cluster similarity is smaller than the intra-cluster similarity of v. The intra-cluster similarity of a vertex is the average of the similarity between this vertex and its adjacent vertices. In the second step, the list X is built in a the same way as Star but the vertices are ordered with respect to their average density and compactness. Finally, in order to reduce the number of clusters, if all the vertices of a star-shaped subgraph are contained in other star-shaped subgraphs the star-shaped subgraph is eliminated. According to the authors, OCDC obtains fewer overlapping clusters, with less overlap, than GStar.

In [33] an overlapping clustering algorithm, OClustR (Overlapping Clustering based on Relevance) is presented, which is based on OCDC, but, unlike OCDC, it sorts the list of possible centers L according to the average of the relative density and the relative compactness of a vertex. The relative density of a vertex is computed as the average of the density of its adjacent vertices, and the relative compactness is computed as the average of the compactness of its adjacent vertices. Afterward, it proceeds as the OCDC algorithm.

The DClustR algorithm [32] is a dynamic overlapping clustering algorithm based on OClustR. DClustR builds the clustering similar to OClustR, but, since DClustR is a dynamic algorithm, it allows adding or deleting documents without re-building the clustering from scratch. This is achieved through the detection of the connected components where the documents were added or deleted and then only the star-shaped subgraphs involved in these connected components are updated.

In [16], those words that co-occur within a document are considered as a concept. The overlapping clustering algorithm proposed by the authors starts by extracting all concepts from a text document collection and representing each document as a graph of concepts. In this graph the vertices are the concepts in a document and the edges connect a pair of concepts if the respective concepts have a significance value greater than a certain threshold. The significance value is computed as the number of documents where the concepts co-occur divided by the number of documents where they appear alone. Initially, each graph of concepts (document) forms a cluster, which is called cluster of concepts. Then, all pairs of clusters of concepts are joined iteratively, two clusters are joined if they share a certain percentage of concepts. While this process is performed, each cluster could be joined separately with different clusters, creating overlapping clusters. This process stops when no more concept clusters can be joined.

Parallel GPU-based implementations for the OClustR and DClustR algorithms were proposed in [39]. These implementations, called CUDA-OClus and CUDA-DClus, improve the efficiency of the sequential algorithms.

All the overlapping clustering algorithms above reviewed do not require as input the number of clusters to build. However, these algorithms tend to produce a high number of clusters, including groups with only one document, which is not useful in most applications. For this reason, in this paper, we propose an overlapping k-means based method for overlapping document clustering where the number of groups to build can be provided by the user.

3 Proposed method

Since the method proposed in this paper is based on the weighed overlapped k-means algorithm (WOKM), in section 3.1 we first describe the WOKM algorithm. Additionally, WOKM, like k-means, also depends on the initialization of centers, therefore in our proposed method, we propose to apply the k-Harmonic Means algorithm, described in Section 3.2, to face this problem. Finally, in Section 3.3, the proposed method is introduced.

3.1 WOKM algorithm

The weighted overlapping k-means (WOKM) algorithm was proposed in [14]. This algorithm is based on the k-means algorithm, but it builds overlapped clusters by introducing local weighting for each cluster.

Let χ = {x₁, x₂, . . . , x_n} be a set of instances where each instance is described through p features. Let $C = {C_{1}, C_{2}, . . ., C_{k}}$ be the set of overlapping clusters to build and $M = {m_{1}, m_{2}, . . ., m_{k}}$ the centers of these clusters, respectively.

The WOKM algorithm follows the idea of the k-means algorithm but optimizing the next objective function: $φ ({C_{i}}_{i = 1}^{k}) = \sum_{x_{i} \in χ} \sum_{v = 1}^{p} γ_{i, v}^{δ} {| x_{i, v} - φ_{v} (x_{i}) |}^{2}$ (1)

Since $C$ is a coverage of χ, each instance x_i belongs to at least one cluster. In Equation (1) φ_v (x_i) denotes the “image” of x_i, in terms of feature v, defined by the combination of those prototypes ( $m_{c} \in M$ ) where x_i belongs to, which is computed as follows: $φ_{v} (x_{i}) = \frac{\sum_{m_{j} \in A_{i}} λ_{j, v}^{δ} m_{j, v}}{\sum_{m_{j} \in A_{i}} λ_{j, v}^{δ}},$

where A_i = {m_j|x_i ∈ C_j}. In other words, φ_v (x_i) is a weighted average of the cluster centroids where x_i belongs to.

In (1) γ_i,v is computed as follows: $γ_{i, v} = \frac{\sum_{m_{j} \in A_{i}} λ_{j, v}}{| A_{i} |}$ (2) where λ_j,v is the weight associated to the feature v in the cluster j, such that $\forall j, \sum_{v = 1}^{p} λ_{j, v} = 1$ .

The parameter δ > 1 that appears in (1) is a parameter for regulating the influence of the local weighting.

Given k centers chosen from χ or randomly generated, and given the initial weights associated to the features v as λ_j,v = 1/p for v = 1, . . . , p and j = 1, . . . , k. The optimization of the objective function (1) is reached by iteratively performing the following three steps:

Multi-assignment: Each instance x_i is assigned to one or more clusters in such a way that the objective function is minimized. To this end, a new set A_i is computed with the following heuristic:

Given $C = {C_{1}, C_{2}, . . ., C_{k}}$ a set of overlapping clusters, their respective centers $M = {m_{1}, m_{2}, . . ., m_{k}}$ , and an instance x_i to be assigned to one o more clusters of $C$ . At the beginning A_i =∅. Then the centers of the clusters are iteratively evaluated from the nearest to the farthest to x_i. First x_i is assigned to the cluster C_j with center m_j, i.e., m_j is included in A_i. Then the multi-assignment of x_i to both the cluster of the nearest center and the cluster of the second nearest center is evaluated. This is done by assigning x_i to the cluster C_l of the second nearest center m_l, i.e., m_l is included in A_i only if the value of Equation (1) is improved. While the value of Equation (1) is improved this process is repeated by assigning each time x_i to one more cluster (the nearest in the order), otherwise the process stops the first time this condition is not fulfilled, and at the end A_i will have the multi-assignment for x_i.

Cluster centers updating: For obtaining a new center $m_{j}^{*}$ for the cluster C_j, those instances x_i that belong to more clusters will have less impact on the position of the new centroid, as expressed in the following equation:yy $m_{j}^{*} = \frac{1}{\sum_{x_{i} \in C_{j}} α_{i}} \sum_{x_{i} \in C_{j}} α_{i} (| A_{i} | x_{i} - \sum_{m_{c} \in A_{i} \ {m_{j}}} m_{c})$

where α_i = 1/|A_i|².

Weights updating: Once the clusters and centroids have been computed, the local weights are updated as follows: $λ_{j, v} = \frac{(\sum_{x_{i} \in c_{j} | | c_{i} | = 1} (x_{i, v} - m_{j, v})^{2})^{1 / (1 - δ)}}{\sum_{u = 1}^{p} (\sum_{x_{i} \in c_{j} | | c_{i} | = 1} (x_{i, u} - m_{j, u})^{1 / (1 - δ)}}$

These steps are repeated until the clusters do not change or a maximum number of iterations is reached. At the end, the WOKM algorithm obtains k overlapping clusters.

3.2 k-Harmonic Means

The k-Harmonic Means (KHM) algorithm was proposed by [48, 49], this is a k-means based clustering algorithm, which minimizes the harmonic average distances from all the instances to the centers of all the clusters. The harmonic average supplies a weight to each instance based on its proximity to each center.

Let $M = {m_{1}, m_{2}, . . ., m_{k}}$ be the centers for each cluster. The objective function optimized in KHM is: $φ ({m_{l}}_{l = 1}^{k}) = \sum_{i = 1}^{n} \frac{k}{\sum_{j = 1}^{k} \frac{1}{∥ x_{i} - m_{j} ∥^{t}}},$ (3) where t is a parameter (t ≥ 2), the expression $\sum_{j = 1}^{k} \frac{1}{∥ x_{i} - m_{j} ∥^{t}}$ is the harmonic mean.

The centers are updated using the equation: $m_{j} = \frac{\sum_{i = 1}^{n} z (m_{j} | x_{i}) w (x_{i}) x_{i}}{\sum_{i = 1}^{n} z (m_{j} | x_{i}) w (x_{i})},$ (4) where z (m_j|x_i) is the membership function of x_i to the cluster m_j, which is computed as: $z (m_{j} | x_{i}) = \frac{∥ x_{i} - m_{j} ∥^{- p - 2}}{\sum_{j = 1}^{k} ∥ x_{i} - m_{j} ∥^{- p - 2}},$

and w (x_i) is the weight associated to each x_i, which is computed as: $w (x_{i}) = \frac{\sum_{j = 1}^{k} ∥ x_{i} - m_{j} ∥^{- p - 2}}{(\sum_{j = 1}^{k} ∥ x_{i} - m_{j} ∥^{- p})^{2}}$

The KHM algorithm begins with k centers chosen randomly, each instance is associated to the cluster with the nearest center, next, KHM updates the centers using Equation 4, and the process is repeated. The KHM algorithm stops when the clusters do not change or a maximum number of iterations is reached.

3.3 Our proposal

As we have discussed in section 2, most overlapping document clustering algorithms reported in the literature have the drawback that they cannot be provided with the number of clusters to build, and commonly they tend to produce too many clusters, some of them containing just a single document. Therefore, in this section, we propose a method, which follows the k-means approach, for overlapping document clustering where the number of clusters can be specified a priori by the user. In Fig. 1, the workflow of our proposed method, called Harmonic Weighted Overlapping k-means (HWOKM), is shown.

Fig. 1

Harmonic weighted overlapping k-means.

Let $D = {d_{1}, d_{2}, . . ., d_{n}}$ be a set of documents where each document is described through p terms (features). Let $C = {C_{1}, C_{2}, . . ., C_{k}}$ be the set of overlapping clusters to build. Our method follows the k-means approach but optimizing the objective function: $φ ({c_{i}}_{i = 1}^{k}) = \sum_{d_{i} \in D} \sum_{v = 1}^{p} γ_{i, v}^{δ} {| d_{i, v} - φ_{v} (d_{i}) |}^{2}$ (5) where $γ_{i, v}^{δ}$ is computed as in Equation (2).

In the literature, the k-means algorithm has been extended for producing overlapping clusters, specifically we can find OKM [21], WOKM and OKMED [14]. In [9] an experimental comparison among the algorithms Overlapping k-means (OKM), Overlapping k-medoids (OKMED) and Weighted Overlapping k-means (WOKM) was performed. From this study, the authors concluded that the WOKM algorithm, described in section 3.1, performs better in terms of clustering quality than the OKM and OKMED algorithms.

On the other hand, it is well known that k-means suffers from the initialization of the centers problem, and this problem is inherited to the WOKM algorithm. For reducing this problem, in our method, we propose to take advantage of the ideas of [49], where the authors show that the use of the harmonic mean into the KHM algorithm allows obtaining good centers for the initialization of k-means [20, 23].

Thus, we propose using KHM to initialize the centers for WOKM, with the hypothesis that good centers for building good disjoint clusters would be also suitable for building good overlapping clusters. Thus, our proposed method first applies KHM on the input corpus, As result, KHM produces a set of k disjoint clusters. The centers of the clusters built by KHM are then used as initial centers for WOKM, which uses these centers jointly with the original corpus to produce a set of overlapping clusters.

A pseudocode of our proposed method is shown in Algorithm 1. From this pseudocode, it can be noticed that the complexity of the k-means algorithm is inherited to HWOKM.

4 Experiments

In this section, in order to evaluate the efficacy of our proposal, we compare our proposed method against the overlapping k-means (OKM) algorithm [21], the weighted overlapping k-means (WOKM) algorithm [14], and the harmonic overlapping k-means (HOKM) algorithm [24]. This comparison was done in this way because our proposed method is based on WOKM, and thus a comparison against WOKM and the OKM algorithm, its predecessor, is mandatory. Additionally, we also include in our comparison to the OClustR algorithm, which is one of the most recent overlapping document clustering algorithms, and according to the results reported in [33], this algorithm creates fewer clusters than other overlapping document clustering algorithms of the state of the art. We did not include to the algorithm proposed in [16] in our comparison, since this algorithm was designed just for short texts (titles of papers).

4.1 Experimental setup

The datasets used for our experiments were taken from the Mulan repository: A Java Library for Multi-Label Learning that is an open-source Java library for learning from multi-label datasets. For our experiments, we choose ten document datasets. These corpora are from different domains, as web pages, laws, medicine, bibliography information and emails; containing from 978 to 87856 documents. The number of terms varies from 500 to 37187 and the number of classes varies between 26 and 983. A summary about the number of documents, terms and classes of the selected document datasets appears in Table 1.

Table 1
Summary of the document datasets used in our experiments

Corpus Documents Terms Classes

Arts 7485 23146 26

Bibtex 7395 1836 159

Bookmarks 87856 2150 208

Business 5505 21924 30

Delicious 16105 500 983

Education 6030 27534 33

Enron 1702 1001 53

EUR-Lex (DC)³ 17407 5000 412

EUR-Lex (SM)⁴ 19348 5000 201

Health 4558 30605 32

Medical 978 1449 45

Science 6428 37187 40

Corpus	Documents	Terms	Classes
Arts	7485	23146	26
Bibtex	7395	1836	159
Bookmarks	87856	2150	208
Business	5505	21924	30
Delicious	16105	500	983
Education	6030	27534	33
Enron	1702	1001	53
EUR-Lex (DC)³	17407	5000	412
EUR-Lex (SM)⁴	19348	5000	201
Health	4558	30605	32
Medical	978	1449	45
Science	6428	37187	40

³Directory codes. ⁴Subject matters.

The clustering results were assessed using the FBcubed measure [7], which is an external measure specially designed to evaluate overlapping clustering algorithms. For OKM and WOKM algorithms, the same random initialization was used. For HOKM and HWOKM methods, the k-centers obtained by KHM (using the same initialization used for OKM and WOKM) were used as initial centers.

To verify that our results are statistically significant, the Wilcoxon signed rank test [45] was applied. This test is a non-parametric test for the comparison of two small samples. The samples were obtained through ten executions of each algorithm with different initial centers. The Wilcoxon test was computed using the Keel (Knowledge Extraction based on Evolutionary Learning) software [45], which is an open source Java software tool for different knowledge data discovery tasks.

The implementations of OKM, WOKM and OClustR were provided by the authors of [9, 33], the implementation of KHM taken from the Speech Processing Toolbox for MATLAB. The experiments were performed on a server with an Intel Xeon E5540 2.67 GHz processor with 4 physical cores and 12 Gbytes of RAM running the Linux operating system, Ubuntu distribution.

4.2 Experimental results

In this section, we show the results obtained in our experiments with the document datasets shown in Table 1. The columns in Table 2 show the results of the different clustering algorithms evaluated in our comparison, while the datasets used in this experiment appear in the rows. Each entry in Table 2 represents the average FBcubed of the ten executions of the algorithm in the respective column using the dataset in the corresponding row. For each dataset (row), we highlighted in bold the highest result corresponding to the algorithm (column) that obtained the best result in terms of FBCubed.

Table 2
FBcubed results of the evaluated overlapping document clustering algorithms over the datasets of Table 1

Dataset OKM WOKM HOKM HWOKM OClustR Significant

Arts 0.2541 0.2442 0.3188 0.3397 0.0307 +

Bibtex 0.1639 0.1640 0.1894 0.2052 0.1671 +

Bookmarks 0.2134 0.2131 0.1713 0.2331 0.1812 +

Business 0.6599 0.6532 0.4069 0.4677 0.0087 Equal

Delicious 0.0666 0.0686 0.5259 0.5294 0.0309 +

Education 0.2022 0.2023 0.2308 0.2428 0.0288 +

Enron 0.5658 0.5704 0.5886 0.6005 0.1018 Equal

EUR-Lex (DC) 0.2099 0.2064 0.2412 0.2894 0.0351 +

EUR-Lex (SM) 0.2613 0.2708 0.3064 0.3798 0.0231 +

Health 0.4149 0.3389 0.5037 0.5057 0.0228 +

Medical 0.3535 0.3588 0.2878 0.3861 0.5265 +

Science 0.1899 0.1670 0.1986 0.2058 0.0257 +

Dataset	OKM	WOKM	HOKM	HWOKM	OClustR	Significant
Arts	0.2541	0.2442	0.3188	0.3397	0.0307	+
Bibtex	0.1639	0.1640	0.1894	0.2052	0.1671	+
Bookmarks	0.2134	0.2131	0.1713	0.2331	0.1812	+
Business	0.6599	0.6532	0.4069	0.4677	0.0087	Equal
Delicious	0.0666	0.0686	0.5259	0.5294	0.0309	+
Education	0.2022	0.2023	0.2308	0.2428	0.0288	+
Enron	0.5658	0.5704	0.5886	0.6005	0.1018	Equal
EUR-Lex (DC)	0.2099	0.2064	0.2412	0.2894	0.0351	+
EUR-Lex (SM)	0.2613	0.2708	0.3064	0.3798	0.0231	+
Health	0.4149	0.3389	0.5037	0.5057	0.0228	+
Medical	0.3535	0.3588	0.2878	0.3861	0.5265	+
Science	0.1899	0.1670	0.1986	0.2058	0.0257	+

The results reported in Table 2 were obtained by fixing k as the number of classes for the respective corpus which appear in the last column of Table 1. The value of the parameter δ in WOKM was fixed as δ = 2, as suggested by [9]. Additionally the OClustR algorithm requires as parameter the value of β, for building the β-similarity graph. For determining this value, we tested different values between 0.1 and 0.5 (as suggested in [33]) with increments of 0.1, and we chose β = 0.4, since this value got the best results of FBcubed.

From the results in Table 2, we can see that in general, our proposal gets better results than the other algorithms. The column of HWOKM has most of its values in bold.

Another important point that deserves a discussion is the that our method by using KHM, for determining the initial centers for WOKM, assumes that good centers for building good disjoint clusters are also suitable for building good overlapping clusters. From the results in Table 2, we can conclude that in general this assumption is true, since if we compare WOKM and HWOKM, we can see that when HKM is used for determining the initial centers of WOKM (see column HWOKM), this gets better results that using random initial centers (see column WOKM). The same happens if we compare OKM and HOKM. In both above cases, there are a couple of datasets where the use of HKM does not allow get the best results. Explaining the reasons of this fact requires a deeper study which is out of the scope of this paper.

On the other hand, since OClustR finds by itself the number of clusters, Table 3 shows the number of clusters generated for each dataset. As it can be seen in this table, the number of clusters built by OClustR is too big regarding the number of real classes in each corpus (see last column of Table 1).

Table 3

Clusters obtained by the OClustR algorithm

Corpus	Classes
Arts	3884
Bibtex	2957
Bookmarks	24429
Business	3536
Delicious	1086
Education	3141
Enron	497
EUR-Lex (DC)	1306
EUR-Lex (SM)	1297
Health	2489
Medical	88
Science	2091

Additionally, we performed a statistical comparison among the results obtained by our proposed method and those results of the second best algorithm, WOKM, through the Wilcoxon signed rank test with a level of significance of α = 0.95 and this test shows that in most of the cases the advantage of the results of HWOKM against WOKM is statistically significant, see the last column in Table 1, where “+” means that HWOKM was statistically better than WOKM; and “equal” means that there was not statistical significant difference.

5 Conclusions

In this paper, a new method for overlapping document clustering was introduced, which uses an overlapping clustering algorithm (WOKM), which can be provided with the number of clusters. Our proposed method uses a set of good center initialization produced by the KHM algorithm to build good overlapping clusters. Experiments were performed on several document datasets, with overlap, comparing the results of our proposal against different overlapping document clustering algorithms proposed in the literature. Based on our results, we can conclude that the proposed algorithm HWOKM allows obtaining better overlapping clustering results, in terms of the FBcubed measure, than all other evaluated document clustering algorithms.

As future work, we will develop a fast method for deterministically determine the initial centers for the WOKM algorithm.

Footnotes

Acknowledgments

The first author gratefully acknowledges to the National Council of Science and Technology of Mexico (CONACyT) for her Ph.D. fellowship, through the scholarship 481750.

References

Abualigah

, Khader

A.T.

and Hanandeh

, A Novel Weighting Scheme Applied to Improve the Text Document Clustering Techniques (2018), 305–320. 01.

Allahyari

, Pouriyeh

S.A.

, Assefi

, Safaei

, Trippe

E.D.

, Gutierrez

J.B.

and Kochut

, A brief survey of text mining: Classification, clustering and extraction techniques. CoRR, abs/1707.02919, (2017).

Alonso

A.G.

, Suárez

A.P.

and Medina-Pagola

J.E.

, ACONS: A new algorithm for clustering documents, In Progress in Pattern Recognition, Image Analysis and Applications, 12th Iberoamericann Congress on Pattern Recognition, CIARP 2007, Valparaiso, Chile, November 13-16, 2007, Proceedings, pages (2007), 664–673.

Amador

, García

, Lío

D.G.

and Guevara

D.M.

, Semclustdml: algoritmo para agrupar artículos científicos basado en la información brindada por las referencias bibliográficas, Revista Cubana de Ciencias Informáticas11(2) (2017).

Penichet

L.A.

, Guevara

D.M.

and Lorenzo

M.M.G.

, New Similarity Function for Scientific Articles Clustering based on the Bibliographic References, Computacióny Sistemas22 (2018), 93–102, 03.

Amato

and Savino

, Approximate Similarity Search in Metric Spaces Using Inverted Files, In Proceedings of the 3rd International Conference on Scalable Information Systems, pages 28:1–28:10, ICST, Brussels, Belgium, Belgium, (2008). ICST(Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering).

Amigó

, Gonzalo

, Artiles

and Verdejo

, A comparison of extrinsic clustering evaluation metrics based on formal constraints, Information Retrieval12(4) (2009), 461–486.

Amini

, Wah

T.Y.

and Saboohi

, On density-based data streams clustering algorithms: A survey, Journal of Computer Science and Technology29(1) (2014), 116–141.

Aroche-Villarruel

A.A.

, Carrasco-Ochoa

J.A.

, Martínez-Trinidad

J.F.

, Arturo Olvera-López

and Pérez-Suárez

, Study of overlapping clustering algorithms based on kmeans through fbcubed metric, In J.F. Martínez-Trinidad, J.A. Carrasco-Ochoa, J.A. Olvera-Lopez, J. Salas-Rodríguez and C.Y. Suen, editors, Pattern Recognition, pages 112–121, Cham, (2014). Springer International Publishing.

10.

Aslam

, Pelekhov

and Rus

, Static and dynamic information organization with star clusters, In Proceedings of the Seventh International Conference on Information and Knowledge Management, CIKM ’98, pages 208–217, New York, NY, USA, 1998. ACM.

11.

Baadel

, Thabtah

and Lu

, Overlapping clustering: A review. In 2016 SAI Computing Conference (SAI), (2016), pp. 233–237.

12.

Ceselli

, Colombo

and Cordone

, Balanced compact clustering for efficient range queries in metric spaces, Discrete Applied Mathematics169 (2014), 43–67.

13.

Chouhan

S.S.

, Kaul

and Singh

U.P.

, Soft computing approaches for image segmentation: a survey, Multimedia Tools and Applications77(21) (2018), 28483–28537.

14.

Cleuziou

, Two variants of the okm for overlapping clustering, In F. Guillet, G. Ritschard, D.A. Zighed and H. Briand, editors, EGC (best of volume), volume 292 of Studies in Computational Intelligence, pages 149–166. Springer, (2010).

15.

Coavoux

, Elsahar

and Gallé

, Unsupervised aspect-basedmulti-document abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 42–47, Hong Kong, China, November (2019). Association for Computational Linguistics.

16.

Dey

, Ranjan

, Verma

and Naskar

, A semantic overlapping clustering algorithm for analyzing short-texts, In V. Flores, F. Gomide, A. Janusz, C. Meneses, D. Miao, G. Peters, D. Slezak, G. Wang, R. Weber and Y. Yao, editors, Rough Sets, pages 470–479, Cham, (2016). Springer International Publishing.

17.

Edla

D.R.

, Tripathi

, Kuppili

and Cheruku

, Survey on clustering techniques, In 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), pages 696–703, April (2018).

18.

Fahad

, Alshatri

, Tari

, Alamri

, Khalil

, Zomaya

A.Y.

, Foufou

and Bouras

, A survey of clustering algorithms for big data: Taxonomy and empirical analysis, IEEE Transactions on Emerging Topics in Computing2(3) (2014), 267–279.

19.

Gilpin

and Davidson

, A flexible ILP formulation for hierarchical clustering, Artificial Intelligence244(C) (2017), 95–109.

20.

Goel

, Gupta

and Yadav

R.K.

, Improved Kharmonic means in wireless sensor networks, In 2017 IEEE 15th Student Conference on Research and Development (SCOReD), (2017), pp. 275–279.

21.

Guillaume

, An extended version of the k-means method for overlapping clustering. 2008 19th International Conference on Pattern Recognition, pages 1–4, (2008).

22.

Hassan

M.T.

, Karim

, Kim

J.-B.

and Jeon

, CDIM: document clustering by discrimination information maximization, Inf Sci316 (2015), 87–106.

23.

Khan

, Ni

, Fan

and Shi

, An improved Kmeans clustering algorithm based on an adaptive initial parameter estimation procedure for image segmentation, International Journal of Innovative Computing Information and Control13, 1509–1526. 10 (2017)

24.

Khanmohammadi

, Adibeig

and Shanehbandy

, An Improved Overlapping K-means Clustering Method for Medical Applications, Expert Syst Appl67(C) (2017), 12–18.

25.

Lekha

, Maheshwaran

, Tharani

, Ram

P.K.

, Surya

M.K.

and Manikandan

, Efficient detection of spam messages using obf and cbf blocking techniques, In 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI), (2019), pp. 1175–1179.

26.

Mei

, Wang

, Chen

and Miao

, Large scale document categorization with fuzzy clustering, IEEE Transactions on Fuzzy Systems25(5) (2017), 1239–1251.

27.

Memon

, Ali

and Pirzada

, A novel technique for region-based features similarity for content-based image retrieval, Mehran University Research Journal of Engineering & Technology37 (2017), 11.

28.

Metsalu

and Vilo

, Clustvis: a web tool for visualizing clustering of multivariate data using principal component analysis and heatmap, Nucleic Acids Research43(W1) (2015), W566–W570.

29.

Mitik

, Korkmaz

, Karagoz

, Toroslu

I.H.

and Yucel

, Data mining approach for direct marketing of banking products with profit/cost analysis, The Review of Socionetwork Strategies11(1) (2017), 17–31.

30.

Nguyen

H.-L.

, Woon

Y.-K.

and Ng

W.-K.

, A survey on data stream clustering and classification, Knowledge and Information Systems45(3) (2015), 535–569.

31.

Pérez-Suárez

, Martínez-Trinidad

J.F.

, Carrasco-Ochoa

J.A.

and Medina-Pagola

J.E.

, A New Overlapping Clustering Algorithm Based on Graph Theory, In Advances in Artificial Intelligence, pages 61–72, Berlin, Heidelberg, (2013). Springer Berlin Heidelberg.

32.

Pérez-Suárez

, Martínez-Trinidad

J.F.

, Carrasco-Ochoa

J.A.

and Medina-Pagola

J.E.

, An algorithm based on density and compactness for dynamic overlapping clustering, Pattern Recognition46(11) (2013), 3040–3055.

33.

Pérez-Suárez

, Martínez-Trinidad

J.F.

, Carrasco-Ochoa

J.A.

and Medina-Pagola

J.E.

, OClustR: A new graph-based algorithm for overlapping clustering, Neurocomputing121 (2013), 234–247.

34.

Pérez-Suárez

, Martínez-Trinidad

J.F.

, Carrasco-Ochoa

J.A.

and Medina-Pagola

J.E.

, A new graph-based algorithmfor clustering documents, In Workshops Proceedings of the 8th IEEE International Conference on Data Mining (ICDM2008), December 15–19, (2008), Pisa, Italy, pages 710–719, (2008).

35.

Pérez-Suárez

, Martínez-Trinidad

J.F.

, Carrasco-Ochoa

J.A.

and Medina-Pagola

J.E.

, A New Incremental Algorithm for Overlapped Clustering, In Progress in Pattern Recognition, Image Analysis Computer Vision and Applications, 14th Iberoamerican Conference on Pattern Recognition, CIARP 2009, Guadalajara, Jalisco, Mexico, November 15-18, 2009. Proceedings, pages 497–504, (2009).

36.

Pérez-Suárez

and Medina-Pagola

J.E.

, Aclustering algorithm based on generalized Stars, In Petra Perner, editor, Machine Learning and Data Mining in Pattern Recognition, pages 248–262, Berlin, Heidelberg, (2007). Springer Berlin Heidelberg.

37.

Salloum

, Al-Emran

and Shaalan

, Mining text in news channels: A case study from facebook, International Journal of Information Technology and Language Studies (IJITLS)1 (2018), 1–9, 08.

38.

Shah

and Mahajan

, Document clustering: A detailed review, International Journal of Applied Information Systems4(5) 30–38, October (2012). Published by Foundation of Computer Science, New York, USA.

39.

González Soler

L.J.

, Suárez

A.P.

and Fernández-Jambrina

, Static and incremental overlapping clustering algorithms for large collections processing in GPU, Informatica (Slovenia)42(2) (2018).

40.

Tan

, Yang

and He

, Feature co-shrinking for co-clustering, Pattern Recognition77 (2018), 12–19.

41.

Thanh

N.D.

, Ali

and Son

L.H.

, A novel clustering algorithm in a neutrosophic recommender system for medical diagnosis, Cognitive Computation9(4) (2017), 526–544.

42.

Tkaczynski

, Segmentation Using Two-Step Cluster Analysis, SpringerSingapore, Singapore, (2017).

43.

Triguero

, González

, Moyano

, García

, Alcala-Fdez

, Luengo

, Fernández

, Del Jesus

M.J.

, Sanchez

and Herrera

, Keel 3.0: An open source software for multi-stage analysis in data mining, International Journal of Computational Intelligence Systems10 (2017), 1238–1249. 09.

44.

Tsoumakas

, Spyromitros-Xioufis

, Vilcek

and Vlahavas

, Mulan: A Java Library for Multi-Label Learning, Journal of Machine Learning Research12 (2011), 2411–2414.

45.

Wilcoxon

, Individual comparisons by ranking methods, Biometrics Bulletin1(6) (1945), 80–83.

46.

and Tian

, A comprehensive survey of clustering algorithms, Annals of Data Science2(2) (2015), 165–193.

47.

Yung-Shen

, Jung-Yi

and Shie-Jue

, A Similarity Measure for Text Classification and Clustering, IEEE Transactions on Knowledge and Data Engineering26(7) (2014), 1575–1590.

48.

Zhang

, Hsu

and Dayal

, KHarmonic Means -A Spatial Clustering Algorithm with Boosting. In Proceedings of the First International Workshop on Temporal, Spatial and Spatio-Temporal Data Mining-Revised Papers, TSDM ’00, pages 31–45, London, UK, UK, (2000). Springer-Verlag.

49.

Zhang

, Hsu

, Dayal

and Data

, K-Harmonic Means - A Data Clustering Algorithm, Hewlett Packard Research Laboratory Technical Report12 ( (1999).

50.

Zhu

, Zhang

and Shi

, Application of algorithm CARDBK in document clustering, Wuhan University Journal of Natural Sciences23 (2018), 514–524. 12.