Abstract
Graph-based clustering performs efficiently for identifying clusters in local and nonlinear data Patterns. The existing methods face the problem of parameter selection, such as the setting of
Keywords
Introduction
As an important branch of clustering technique, graph-based clustering algorithms have received great attention in the area of data mining and pattern recognition [1, 2, 3]. The technology first represents data as a graph, and then uses the connectivity information in the graph to find clusters in the data. Compared with prototype based clustering algorithms, such as k-means, the main advantage of graph clustering algorithm is that it does not need to set the number of clusters in advance. Therefore, graph-based clustering algorithm has made great progress in the past decades.
At present, there are mainly three types of constructing graph: the fully connected graph,
Although the above graphs can effectively represent the data and mine nonlinear patterns, noise seriously affects the clustering results. Therefore, the method named CutESC [13] use a threshold of edges to conduct edge-cutting for excluding noises based on a k-nearest neighbor graph. CHKNN sets another parameter to control the size of isolated points and uses the relationship between it and the number of mutual nearest neighbors of each point to detect noises. Obviously, the performance of CHKNN relies on appropriate parameters. The algorithm OPS [14] applies a reconstruct method based on k-nearest neighbor graph to conduct node cutting. The LASSO regularization model with optimization is introduced for feature selection. CutPC assumes that the density of each point is different. It detects noise based on neighborhood density, so it is seriously limited by density. The common feature of these noise detection methods is the need for parameters.
In all, the main feature of existing graph based clustering algorithms is the need for parameters in graph construction or noise detection. The choice of parameters will inevitably affect the performance of clustering algorithm. In this study, we propose a non-parametric clustering algorithm with adaptive noise detecting. The advantage of this method is that it does not need prior knowledge to set parameters both in graph construction and noise detection. Therefore, the proposed algorithm overcomes the problem of parameter dependence of current graph-based clustering algorithms. A weighted natural neighbor graph is developed to represent the original data set and effective graph characteristics information is extracted from this graph to identify noises. Neighborhood information is comprehensively used to improve the noise detecting adaptability and robustness. The proposed method makes three main contributions to the current state, which are summarized as follows:
A non-parametric noise identification way is constructed. We take adequate utilization of structural connectivity information, density information, and directional diversity information to improve noise detection performance. A clustering method termed NonPC is proposed, which realizes parameter-free graph-based clustering. The proposed method does not need parameter presetting. The optimal clustering results can be generated through the algorithm process. We provide a persuasive performance evaluation to demonstrate the validity and advancement and noise robustness of the proposed method using synthetic and UCI real-world datasets. The promising results certify that our method is superior to the alternative nonlinear clustering methods.
A
.
(
where
.
(Reverse Neighbors). The reverse neighbors of object
.
(Natural Neighbors (NaN)). For each object
.
(Natural Neighbors Graph (NaNG)). NaNG is a undirected graph which is denoted by
From Eq. (3), it is not difficult to find that the NaNG is an undirected graph without weight. However, from Figs 1b and 2b, we can find that there is other abundant and useful information between samples, such as direction. NaNG is unable to represent such sufficient information. Hence, it is not beneficial to identify noises.
The clustering process of NonPC. In the first stage (a–b), we construct a weighted natural neighbor graph to represent the original data patterns. In the schematic of Non-parametric node detecting (b–d), we conduct an unsupervised method on attributes to detect the noises. The red dots represent the detected noises. In the last stage (d–f), we perform clustering and assign the noise and outer points to their nearest cluster.
Method overview
As shown in Fig. 1, the proposed clustering method broadly carries out three steps: (1) Construct a weighted natural neighbor graph to represent the original data and reveal the structural information among data objects. In such a graph representation model, the objects are expressed by nodes, the similarity is expressed by distance weight, and the connections with neighbors are expressed by edges. (2) Non-parametric noise detecting. We pick up five attributes from the constructed graph for observations, including the number of bi-neighbors and revise neighbors, the density of neighbors and revise neighbors, and the directional diversity of each object. These attributes are used to divide the data into clean data and noise. This process is non-parametric and adaptive. (3) Clustering and assigning the noises to their nearest cluster respectively. The clusters will be found by searching the strongly connected components in the weighted natural neighborhood graph of clean data. The number of components is the optimal number of clusters. The removed noisy points will be assigned to the nearest cluster with the most neighbors belongs to.
Constructing the weighted natural neighbor graph
Given a data set
.
(weighted Natural Neighbors Graph (wNaNG)). wNaNG is a directed graph which is denoted by
where
Obviously, the constructed representation graph does not need to set any parameters. What is more, the graph ensures each sample point has the same number of neighbors, which reduces the influence of density difference on cluster partition. Importantly, wNaNG is a weighted directed graph. The information contained is profit to noise identification.
Noise detection is completed in a non-parametric way. We use the connectivity difference between observations in clusters and out of clusters to complete noise detection. Five attributes are extracted from the wNaNG, including the number of bi-connected pairs and revise neighbors, the density of the neighborhood and the revise neighborhood and the directional diversity of each object. These attributes are able to exhibit the distribution differences of data and noise, which can be used to identify the noises hidden in the data set.
.
(Bi-connected pairs). A bi-connected pair represents two points that are natural neighbors of each other, which is defined as follows:
.
(Bi-number). The Bi-number is the number of bi-connected pairs of
The Bi-number of a point can discover the degree of connection between the object and other points. Therefore, the noisy points with small Bi-number have a sparse connection relationship with their neighbors, and the points with large Bi-number have a tight connection with their neighbors. Consequently, a point that is tightly connected with its neighbors implies that the point is very close to its neighbors and their similarity is high.
.
(Reverse connectivity). The reverse connectivity of
Reverse connectivity reveals the connected density of the region where
.
(Neighborhood density). The neighborhood density of
where
.
(Reverse neighborhood density). The reverse neighborhood density of
where
The density of the neighborhood and reverse neighborhood are conducive to demonstrating the location of a point. A point with low neighborhood density and high reverse neighbor density indicates that the point is in a dense area and the neighbors are very close to it. That is the similarity with its neighbors is high.
.
(Directional diversity). The directional diversity of
where
The directional diversity reflects the neighbor distribution of
The visualization of different types of points. From att1 to att5 means Bi-number, reverse connectivity, reverse neighborhood density, directional diversity, and neighborhood density, respectively. For noise point 
Figure 2a display two types of points at different location. In Fig. 2b the noise point
According to the above analysis, it is concluded that there are obvious differences between the data and noise in the extracted connectivity attributes. Figure 1c shows the dimension reduction results of the connectivity attributes by the Principal Component Analysis algorithm. It is found that there are two clusters that are obviously separated. Therefore, unsupervised methods can be conducted to detect noises.
.
(Non-parametric noise detecting). We apply an unsupervised algorithm such as k-means to cluster the original data into two clusters, and identify the cluster with higher Bi-number, reverse connectivity, reverse neighborhood density, and lower directional diversity and neighborhood density as noises.
Node cutting is performed based on noise detection. The noisy and isolated points and the edges connected to these points are removed from the wNaNG. (In practice, we directly remove the columns and rows of noisy and outer points from the connected matrix.). Figure 1d lists the nodes identified as noises, demonstrating that the noises can be determined appropriately.
After removing the noisy and isolated points from the original connections of datasets, we obtain the pure datasets and the pure connected matrix. In this situation, we can perform clustering by detecting the strongly connected components and setting them as clusters straightforwardly on the new connected matrix. A strongly connected component is a subgraph in which any two points are strongly connected to each other, no more vertices can be added to the subgraph. If we find out all the strongly connected components in the wNaNG successfully, the clusters in a data set are identified.
We use the Tarjan [16] method to detect the strongly connected components of the graph directly. Tarjan is based on the depth-first search algorithm, each strongly connected component is a subtree in the search tree. The components are fully determined and the desirable number of clusters is generated. In other words, the clusters of pure datasets are grabbed. Figure 1e shows that the clusters of data are distinguished fittingly.
The advantage of making use of Tarjan is that this method can achieve the complexity of linear time. The space and time requirements of this algorithm are bounded by
Assign the noisy and isolated points to clusters
The last step of the NonPC is to assign the noisy and isolated points to clusters that have already been decided in Section 3.4. wNaNG has revealed the connected relationship between all objects in the datasets, it is sufficient to assign the removed points to clusters. We straightforwardly make use of wNaNG to complete node assigning: each removed point is assigned to a cluster via its own
where
This part analyzes the computational complexity of the proposed NonPC. We set
Distributions of the original synthetic datasets.
In order to evaluate the performance of the NonPC, simulation studies are conducted based on various synthetic datasets. Figure 3 illustrates the original distributions for each synthetic data set. These datasets are assumed to contain noisy data distribution with linear and nonlinear patterns. D1_5 is composed of spherical clusters with different numbers, which contains five clusters with noises and a total of 513 observations. D2_2 consists of two clusters with 420 objects, including noises. D3_4 consists of one circle cluster and three spherical clusters with noisy observations for a total of 1114 points. D4_6 is composed of six tightly-connected clusters: four spherical clusters in two right-angle line clusters with some noise objects, for a total of 1915 objects. D5_4 is made up of three spherical clusters and one manifold cluster including noises with 1427 objects. D6_7 are composed of seven tightly-connected nonlinear clusters with 990 points, including noises.
Simulation setup
We compare the proposed method with different categories clustering algorithms: k-means [17], density-based spatial clustering of applications with noise (DBSCAN) [18], spectral clustering (SC) [19] and cut-point clustering algorithm (CutPC) [11], Watershed clustering based on
To compare the performances of the proposed and comparative methods, we used four external criteria: the Adjust Rand Index (ARI) [22], Fowlkes-Mallows index (FMI) and Normalized Mutual Information (NMI) [23].
Noise sensitivity analysis. The x-axis represents the noise ratio (0–50%), and the y-axis represents the Adjusted Rand index obtained.
In this part, we analyze the noise detecting robustness of the proposed NonPC algorithm by assessing its performance with different amounts of noise in a data set. We use the six synthetic data and add external noises randomly with the noise ratio
Figure 4 presents the results for the proposed and six comparison clustering benchmark algorithms. The x-axis represents the noise ratio, and the y-axis represents the Adjusted Rand index obtained. A method with a larger Adjusted Rand Index over the noise ratio would be regarded as a better one. The test results demonstrate that the NonPC evidently outperforms the other comparison algorithms in the aspect of the robustness to noises: the Adjusted Rand Index values for the NonPC tend to obtain higher than those of the other methods, exhibiting smaller Rand Index changes with increases in the noise ratio. This implies that the proposed method efficiently detects noises and thus accurately identified intrinsic cluster structures.
With the number of noises increasing, more similar edges of high weight may be generated by these noises. Spectral clustering is a kind of edge-cutting algorithm, it is not easy to search for an optimal edge cut that reveals intrinsic cluster structures. Noise-robust clustering methods perform better than spectral clustering. However, the performance obtained by the DBSCAN also degraded significantly with an increase in noise because this method is sensitive to the tuning parameter. The DBSCAN needs to adjust the tuning parameter because the increased noises change the distribution of the base data. And the same for the AP algorithm. For the CutPC, although it is also based on node cutting, it is limited to the density of clusters. The external nodes make unbalanced densities. Therefore, the performance of CutPC is inferior to that of NonPC. The K-means algorithm cannot perform well in non-convex data set and it is sensitive to noises and outliers. Hence, it is difficult to achieve optimal clustering performance with a large number of noise. On the other hand, the proposed NonPC algorithm is robust to noise because it not only ensures the consistency of the number of neighbors but also conducts node cutting so that it removes the generated noise data correctly and identifies intrinsic cluster structures.
Figure 5 illustrates the graphical clustering results for the proposed NonPC method. From the graphical results, we can verify that NonPC can effectively process complex and nonlinear patterns. When applied to noisy data, the proposed NonPC method shows robust clustering capability.
The characteristics of the six real-word datasets
The characteristics of the six real-word datasets
Graphical clustering results for the proposed NonPC algorithm.
To further demonstrate the superiority and the applicability to real situations of NonPC, we test the clustering performance by using six real-world datasets from UCI. The data characteristics can be seen in Table 1. These datasets are widely used for testing in the field of machine learning.
Table 2 shows the ARI score of all algorithms. We can see that compared with graph-based clustering algorithms SC, CutPC, and WC in the Wine, Seeds, Iris, HCV, Thyroid, and HCV datasets, NonPC shows excellent performance.
ARI of real-world datasets
ARI of real-world datasets
NMI of real-world datasets
FMI of real-world datasets
Table 3 lists the NMI score. It is clear that NonPC obtains the best clustering performance and achieves greater improvement in the Wine, Seeds, Iris, Thyroid, and HCV datasets. This comparison also illustrates that NonPC can generate desirable clustering results on tightly-connected patterns. Considering the FMI aspect, as demonstrated in Table 4, NonPC gets the best clustering performance in the Wine, Iris, Thyroid, and HCV datasets. What is more, the score of NonPC exceeds the other graph-based clustering in all datasets except the Ionosphere data set.
In summary, for the clusters with different densities and nonlinear distribution, the performance of the proposed NonPC method is better than other graph-based clustering methods, such as SC, CutPC, and WC algorithm, and also performs better than other types of clustering algorithms. To some extent, it confirms that the proposed algorithm contributes to the progress of graph-based clustering algorithms.
In this paper, we propose a non-parameter clustering algorithm with adaptive noise detecting based on a weighted natural neighbor graph. The constructed weighted natural neighbor graph represents the original data patterns more appropriately. In the noise detecting process, we carefully analyze and confirm the effectiveness of graph-based connectivity features for noise detection. The connectivity information is adequately excavated and used by k-means to identify the noises adaptively. According to the extensive experiments, the proposed graph-based clustering algorithm shows superior in detecting noise and dealing with nonlinear and tightly connected patterns in a non-parameter way. In conclusion, the proposed clustering approach is of great significance to promote the research progress of graph-based clustering algorithms. However, the method has some limitations, such as it is unable to identify clusters with spatial overlap.
Footnotes
Acknowledgments
This work is supported by the Natural Science Foundation of China (No. 41804112), the Youth Project of Science and Technology Research Program of Chongqing Education Commission of China (No. KJQN202001143) and the High Quality Development Plan of Graduate Education of Chongqing University of Technology (No. gzlcx20223216).
