NonPC: Non-parametric clustering algorithm with adaptive noise detecting

Abstract

Graph-based clustering performs efficiently for identifying clusters in local and nonlinear data Patterns. The existing methods face the problem of parameter selection, such as the setting of $k$ of the k-nearest neighbor graph and the threshold in noise detection. In this paper, a non-parametric clustering algorithm (NonPC) is proposed to tackle those inherent limitations and improve clustering performance. The weighted natural neighbor graph (wNaNG) is developed to represent the given data without any prior knowledge. What is more, the proposed NonPC method adaptively detects noise data in an unsupervised way based on some attributes extracted from wNaNG. The algorithm works without preliminary parameter settings while automatically identifying clusters with unbalanced densities, arbitrary shapes, and noises. To assess the advantages of the NonPC algorithm, extensive experiments have been conducted compared with some classic and recent clustering methods. The results demonstrate that the proposed NonPC algorithm significantly outperforms the state-of-the-art and well-known algorithms in Adjusted Rand index, Normalized Mutual Information, and Fowlkes-Mallows index aspects.

Keywords

Weighted natural neighbor graph non-parametric method graph-based clustering unsupervised learning noise detecting

1. Introduction

As an important branch of clustering technique, graph-based clustering algorithms have received great attention in the area of data mining and pattern recognition [1, 2, 3]. The technology first represents data as a graph, and then uses the connectivity information in the graph to find clusters in the data. Compared with prototype based clustering algorithms, such as k-means, the main advantage of graph clustering algorithm is that it does not need to set the number of clusters in advance. Therefore, graph-based clustering algorithm has made great progress in the past decades.

At present, there are mainly three types of constructing graph: the fully connected graph, $\varepsilon$ -neighborhood graph and k-nearest neighbor graphs. The fully connected graph makes any two points with positive similarities connected to each other and assigns all the edges different weights based on a similarity function such as a Gaussian kernel function. The problem is that the time complexity of this method is high [4]. A $\varepsilon$ -neighborhood graph connect two points when their distance are smaller than $\varepsilon$ . Therefore, the number of connections in the graph can be greatly reduced, but it is necessary to determine an appropriate $\varepsilon$ [5]. K-nearest-neighbor graph only stores the edges of K neighboring points for each point [6]. This graph and its variant versions are most widely used in graph clustering algorithms because of its simplicity [7]. For example, 1-nearest neighbor graph are used to represent data [8] and find clusters [9]; A clustering method based on hybrid K-nearest neighbor (CHKNN) is proposed in [10] to mine complex data patterns. In [11] a parameter-free data representation named natural neighbor graph and cut-point clustering algorithm (CutPC) are developed. The authors in [12] improve the natural neighbor graph by adding a judgment of the connected components of the graph with the number of clusters. Nevertheless, the information and neighborhood expressed by the natural neighbor graph are limited because it is an undirected graph without weights.

Although the above graphs can effectively represent the data and mine nonlinear patterns, noise seriously affects the clustering results. Therefore, the method named CutESC [13] use a threshold of edges to conduct edge-cutting for excluding noises based on a k-nearest neighbor graph. CHKNN sets another parameter to control the size of isolated points and uses the relationship between it and the number of mutual nearest neighbors of each point to detect noises. Obviously, the performance of CHKNN relies on appropriate parameters. The algorithm OPS [14] applies a reconstruct method based on k-nearest neighbor graph to conduct node cutting. The LASSO regularization model with optimization is introduced for feature selection. CutPC assumes that the density of each point is different. It detects noise based on neighborhood density, so it is seriously limited by density. The common feature of these noise detection methods is the need for parameters.

In all, the main feature of existing graph based clustering algorithms is the need for parameters in graph construction or noise detection. The choice of parameters will inevitably affect the performance of clustering algorithm. In this study, we propose a non-parametric clustering algorithm with adaptive noise detecting. The advantage of this method is that it does not need prior knowledge to set parameters both in graph construction and noise detection. Therefore, the proposed algorithm overcomes the problem of parameter dependence of current graph-based clustering algorithms. A weighted natural neighbor graph is developed to represent the original data set and effective graph characteristics information is extracted from this graph to identify noises. Neighborhood information is comprehensively used to improve the noise detecting adaptability and robustness. The proposed method makes three main contributions to the current state, which are summarized as follows:

•
A non-parametric noise identification way is constructed. We take adequate utilization of structural connectivity information, density information, and directional diversity information to improve noise detection performance.
•
A clustering method termed NonPC is proposed, which realizes parameter-free graph-based clustering. The proposed method does not need parameter presetting. The optimal clustering results can be generated through the algorithm process.
•
We provide a persuasive performance evaluation to demonstrate the validity and advancement and noise robustness of the proposed method using synthetic and UCI real-world datasets. The promising results certify that our method is superior to the alternative nonlinear clustering methods.

2. Related work

A $k$ -nearest neighbor graph creates an edge between two objects according to their neighbor relationships. The disadvantage is that the parameter $k$ need to be determined manually. The natural neighbor graph was proposed to avoid this problem, which applies the number of k-nearest neighbors and reverse neighbors in iterations to obtain $k$ automatically. The definitions of k-nearest neighbor and reverse neighbors can be seen as follows:

.

( $k$ -nearest neighbor). The $k$ -nearest neighbors of point $x_{i}$ are a subset of datasets $D$ , including all points $x_{j}$ considered by $x_{i}$ as its $k$ nearest neighbor, which are defined as follows.

$\displaystyle NN_{k}(x_{i})=\{x_{j}\in D\mid d(x_{i},x_{j})\leqslant d(x_{i},q% )\},$ (1)

where $q$ is the $k_{th}$ nearest neighbor of object $x_{i}$ , $d(x_{i},x_{j})$ is a distance function that returns the distance between samples $x_{i}$ and $x_{j}$ . In this paper, we use the Euclidean distance to produce all the results.

.

(Reverse Neighbors). The reverse neighbors of object $x_{i}$ means a subset of datasets $D$ , including all points $x_{j}$ which takes $x_{i}$ as its $k$ nearest neighbor, which is defined as follows.

$\displaystyle\textit{RNN}_{k}(x_{i})=\{x_{j}\in D\mid x_{i}\in NN_{k}(x_{j})\}.$ (2)

.

(Natural Neighbors (NaN)). For each object $x_{i}$ , the natural neighbors are defined as the k-nearest neighbors where $k$ is equal to the Natural Neighbor Eigenvalue $\lambda$ , denoted as $\textit{NaN}(x_{i})$ . $\lambda$ will be generated in a neighbor searching process: the search range of neighbors $k$ is continuously expanded. When all points have reverse neighbors or the number of points without reverse neighbors keeps stable, the iteration stops, and the search range $k$ is the Natural Neighbor Eigenvalue $\lambda$ .

.

(Natural Neighbors Graph (NaNG)). NaNG is a undirected graph which is denoted by $G=(V,E)$ , where $V$ is the set of samples, and $\forall(\mu,\nu)\in E$ : $\nu$ is a natural neighbor of $\mu$ . Natural neighbor means if both points are the k-nearest neighbors of each other. An edge $e_{ij}$ between nodes is defined as:

$\displaystyle e_{ij}=\left\{\begin{array}[]{ll}1&\text{if }x_{j}\in\textit{NaN% }(x_{i})\text{ or }x_{i}\in\textit{NaN}(x_{j})\\ 0&\text{otherwise}\\ \end{array}\right..$ (3)

From Eq. (3), it is not difficult to find that the NaNG is an undirected graph without weight. However, from Figs 1b and 2b, we can find that there is other abundant and useful information between samples, such as direction. NaNG is unable to represent such sufficient information. Hence, it is not beneficial to identify noises.

Figure 1.

The clustering process of NonPC. In the first stage (a–b), we construct a weighted natural neighbor graph to represent the original data patterns. In the schematic of Non-parametric node detecting (b–d), we conduct an unsupervised method on attributes to detect the noises. The red dots represent the detected noises. In the last stage (d–f), we perform clustering and assign the noise and outer points to their nearest cluster.

3. Proposed method

3.1 Method overview

As shown in Fig. 1, the proposed clustering method broadly carries out three steps: (1) Construct a weighted natural neighbor graph to represent the original data and reveal the structural information among data objects. In such a graph representation model, the objects are expressed by nodes, the similarity is expressed by distance weight, and the connections with neighbors are expressed by edges. (2) Non-parametric noise detecting. We pick up five attributes from the constructed graph for observations, including the number of bi-neighbors and revise neighbors, the density of neighbors and revise neighbors, and the directional diversity of each object. These attributes are used to divide the data into clean data and noise. This process is non-parametric and adaptive. (3) Clustering and assigning the noises to their nearest cluster respectively. The clusters will be found by searching the strongly connected components in the weighted natural neighborhood graph of clean data. The number of components is the optimal number of clusters. The removed noisy points will be assigned to the nearest cluster with the most neighbors belongs to.

3.2 Constructing the weighted natural neighbor graph

Given a data set ${\{x_{1},x_{2},\ldots,x_{n}\}}$ of $n$ points in $\Re^{d}$ , where $d\in N$ . The definitions related to the weighted natural neighbor graph are as follows:

.

(weighted Natural Neighbors Graph (wNaNG)). wNaNG is a directed graph which is denoted by $G=(V,E)$ , where $V$ is the set of samples, and $\forall(\mu,\nu)\in E$ : $\nu$ is a $k$ nearest neighbor of $\mu$ . An edge $e_{ij}$ between nodes is defined as:

$\displaystyle e_{ij}=\left\{\begin{array}[]{ll}w_{ij}&\text{if }x_{j}\in% \textit{NaN}(x_{i})\\ 0&\text{otherwise}\\ \end{array}\right.,$ (4)

where $w_{ij}$ is the distance $d(x_{i},x_{j})$ . The constructed wNaNG can be represented by a connected matrix $C$ , such that each element of the matrix $C_{ij}$ corresponds to the edge $e_{ij}$ of the wNaNG. Figure 1b shows the constructed wNaNG of the original data set. Note that, the search cost of nearest neighbors for each object in the database is huge. Therefore the KD-tree [15] is introduced into the weighted natural neighbor searching algorithm.

Obviously, the constructed representation graph does not need to set any parameters. What is more, the graph ensures each sample point has the same number of neighbors, which reduces the influence of density difference on cluster partition. Importantly, wNaNG is a weighted directed graph. The information contained is profit to noise identification.

3.3 Non-parametric noise detecting

Noise detection is completed in a non-parametric way. We use the connectivity difference between observations in clusters and out of clusters to complete noise detection. Five attributes are extracted from the wNaNG, including the number of bi-connected pairs and revise neighbors, the density of the neighborhood and the revise neighborhood and the directional diversity of each object. These attributes are able to exhibit the distribution differences of data and noise, which can be used to identify the noises hidden in the data set.

.

(Bi-connected pairs). A bi-connected pair represents two points that are natural neighbors of each other, which is defined as follows:

$\displaystyle x_{i}\in\textit{NaN}(x_{j})\wedge x_{j}\in\textit{NaN}(x_{i}).$ (5)

.

(Bi-number). The Bi-number is the number of bi-connected pairs of $x_{i}$ .

The Bi-number of a point can discover the degree of connection between the object and other points. Therefore, the noisy points with small Bi-number have a sparse connection relationship with their neighbors, and the points with large Bi-number have a tight connection with their neighbors. Consequently, a point that is tightly connected with its neighbors implies that the point is very close to its neighbors and their similarity is high.

.

(Reverse connectivity). The reverse connectivity of $x_{i}$ is equal to the number of reverse neighbors of $x_{i}$ .

Reverse connectivity reveals the connected density of the region where $x_{i}$ is located in the whole data set distribution. If one point with high reverse connectivity, it illustrated that this point is located in a dense region and closely connected with other points. According to this, the noises definitely have low reverse connectivity. In addition, the number of reverse neighbors of the natural outlier is equal to zero according to the concept of natural neighbor.

.

(Neighborhood density). The neighborhood density of $x_{i}$ is defined as the means distance to neighbors of $x_{i}$ .

$\displaystyle\textit{Nei\_density}(x_{i})=\frac{1}{k}\sum_{x_{j}\in\textit{NaN% }(x_{i})}d(x_{i},x_{j}),$ (6)

where $k$ is the natural neighbor characteristic value $\lambda$ and that is the number of neighbors of $x_{i}$ .

.

(Reverse neighborhood density). The reverse neighborhood density of $x_{i}$ is equal to the mean distance to reverse neighbors of $x_{i}$ . In this method, we use $\textit{RNN}(x_{i})$ to calculate the reverse density.

$\displaystyle\textit{re\_density}(x_{i})=\frac{1}{k_{r}}\sum_{x_{j}\in\textit{% RNN}_{k}(x_{i})}d(x_{i},x_{j}),$ (7)

where $k_{r}$ is the number of reverse neighbors of $x_{i}$ . $\textit{RNN}_{k}(x_{i})$ represents the set of reverse neighbors of point $x_{i}$ .

The density of the neighborhood and reverse neighborhood are conducive to demonstrating the location of a point. A point with low neighborhood density and high reverse neighbor density indicates that the point is in a dense area and the neighbors are very close to it. That is the similarity with its neighbors is high.

.

(Directional diversity). The directional diversity of $x_{i}$ is defined as the mode sum of all neighbor directional vectors of $x_{i}$

$\displaystyle\textit{Dir\_diversity}(x_{i})=\sum_{i=1}^{i=k-1}\sum_{j=i+1}^{j=% k}\|a_{j^{\prime}}-a_{j^{\prime\prime}}\|,$ $\displaystyle a_{j^{\prime}},a_{j^{\prime\prime}}\in M,$ (8) $\displaystyle M=\{{a_{j}\mid a_{j}=x_{i}-x_{j}},x_{j}\in\textit{NaN}(x_{i})\},$

where $k$ is the natural neighbor characteristic value $\lambda$ and that is the number of neighbors of $x_{i}$ . $\textit{NaN}(x_{i})$ represents the neighbors of point $x_{i}$ , and $M$ is set of the neighbor directional vectors of $x_{i}$ .

The directional diversity reflects the neighbor distribution of $x_{i}$ , and the neighbor distributions of points at different locations are obviously different. For points within clusters, the neighbors are except to scatter in all directions. For the borders, the neighbor directional vectors point more to the cluster. And for the noises, the neighbor directional vectors point in similar directions predictably.

Figure 2.

The visualization of different types of points. From att1 to att5 means Bi-number, reverse connectivity, reverse neighborhood density, directional diversity, and neighborhood density, respectively. For noise point $a_{1}$ , there is only neighbor connection without reverse neighbors and the distribution of neighbor vectors is consistent. For the object in clusters $b_{3}$ , the number of neighbors and reverse neighbors is close to equilibrium, and the distribution of neighbor vectors spreads out in all directions. Especially, for the border point $b_{1}$ , the number of reverse neighbors is less than that of neighbors. The neighbor vectors uniformly point to the center of the cluster, and the reverse neighbor vector points in from the scattered direction. The attribute values of the points in the cluster are close, separating from the noises.

Figure 2a display two types of points at different location. In Fig. 2b the noise point $a_{1}$ , there are only neighbors connection without reverse neighbors and the distribution of neighbor vectors is consistent. For the object in clusters $b_{3}$ , the number of neighbors and reverse neighbors are close to equilibrium, and the distribution of neighbor vectors spread out in all directions. Also, the neighbors are very close. Especially, for the border nodes of clusters, the number of reverse neighbors is less than the number of neighbors. In addition, the neighbor vectors uniformly point to the center of the cluster, and the reverse neighbor vector points in from the scattered direction. This feature is significantly helpful to distinguish noises from the border nodes of clusters. Figure 2c shows the attributes of six points (three noise and three points in clusters) with curves. The attributes of points in clusters are near and the curves are separated from the other three lines. Note that, although there is little difference in neighborhood density, we can also see that the neighborhood density of the noise is larger than that of the points in clusters. The curves show that compared with noise, the points in clusters have higher Bi-number, reverse connectivity, reverse neighborhood density, and lower directional diversity and neighborhood density.

According to the above analysis, it is concluded that there are obvious differences between the data and noise in the extracted connectivity attributes. Figure 1c shows the dimension reduction results of the connectivity attributes by the Principal Component Analysis algorithm. It is found that there are two clusters that are obviously separated. Therefore, unsupervised methods can be conducted to detect noises.

.

(Non-parametric noise detecting). We apply an unsupervised algorithm such as k-means to cluster the original data into two clusters, and identify the cluster with higher Bi-number, reverse connectivity, reverse neighborhood density, and lower directional diversity and neighborhood density as noises.

Node cutting is performed based on noise detection. The noisy and isolated points and the edges connected to these points are removed from the wNaNG. (In practice, we directly remove the columns and rows of noisy and outer points from the connected matrix.). Figure 1d lists the nodes identified as noises, demonstrating that the noises can be determined appropriately.

3.4 Clustering by finding strongly connected component

After removing the noisy and isolated points from the original connections of datasets, we obtain the pure datasets and the pure connected matrix. In this situation, we can perform clustering by detecting the strongly connected components and setting them as clusters straightforwardly on the new connected matrix. A strongly connected component is a subgraph in which any two points are strongly connected to each other, no more vertices can be added to the subgraph. If we find out all the strongly connected components in the wNaNG successfully, the clusters in a data set are identified.

We use the Tarjan [16] method to detect the strongly connected components of the graph directly. Tarjan is based on the depth-first search algorithm, each strongly connected component is a subtree in the search tree. The components are fully determined and the desirable number of clusters is generated. In other words, the clusters of pure datasets are grabbed. Figure 1e shows that the clusters of data are distinguished fittingly.

The advantage of making use of Tarjan is that this method can achieve the complexity of linear time. The space and time requirements of this algorithm are bounded by $(V+E)$ , where $V$ is the number of vertices and $E$ is the number of edges of the graph used.

3.5 Assign the noisy and isolated points to clusters

The last step of the NonPC is to assign the noisy and isolated points to clusters that have already been decided in Section 3.4. wNaNG has revealed the connected relationship between all objects in the datasets, it is sufficient to assign the removed points to clusters. We straightforwardly make use of wNaNG to complete node assigning: each removed point is assigned to a cluster via its own $k$ -nearest neighbors. Category statistics are made on the noises’ neighbors. The category of noise is determined as the highest frequency category of neighbors. The assigning scheme is expressed as follows:

$\displaystyle y_{(x_{i})}=\mathop{\arg\max}_{\textit{Clu}_{m}}\sum\textit{Clu}% _{x_{j}},x_{i}\in\textit{Noise},x_{j}\in\textit{NaN}(x_{i}),$ (9)

where $\textit{Clu}_{m}$ denotes the $m_{th}$ cluster identified and $y_{i}$ is assigned as the cluster label for noisy and isolated point $x_{i}$ . The final clustering results can be seen in Fig. 1f. It is distinct that NonPC performs well.

3.6 Computational complexity analysis

This part analyzes the computational complexity of the proposed NonPC. We set $n$ as the total number of observations in the data set. KD-tree is introduced into NaN-Searching when constructing the wNaNG, so the time complexity of the NaN-searching algorithm is $O(n*\textit{logn})$ . The time complexity for extracting the connectivity attributes and finding the noise and isolated points is $O(n^{2})$ . The clustering by finding strongly connected components can be solved in a linear time $O(V+E)$ . Assigning the removed objects to clusters has the same computational complexity as constructing the wNaNG. In summary, the whole computational complexity of NonPC converges to $O(n^{2})$ .

Figure 3.

Distributions of the original synthetic datasets.

4. Simulation study

In order to evaluate the performance of the NonPC, simulation studies are conducted based on various synthetic datasets. Figure 3 illustrates the original distributions for each synthetic data set. These datasets are assumed to contain noisy data distribution with linear and nonlinear patterns. D1_5 is composed of spherical clusters with different numbers, which contains five clusters with noises and a total of 513 observations. D2_2 consists of two clusters with 420 objects, including noises. D3_4 consists of one circle cluster and three spherical clusters with noisy observations for a total of 1114 points. D4_6 is composed of six tightly-connected clusters: four spherical clusters in two right-angle line clusters with some noise objects, for a total of 1915 objects. D5_4 is made up of three spherical clusters and one manifold cluster including noises with 1427 objects. D6_7 are composed of seven tightly-connected nonlinear clusters with 990 points, including noises.

4.1 Simulation setup

We compare the proposed method with different categories clustering algorithms: k-means [17], density-based spatial clustering of applications with noise (DBSCAN) [18], spectral clustering (SC) [19] and cut-point clustering algorithm (CutPC) [11], Watershed clustering based on $k$ -nearest-neighbor graph, Pauta Criterion (WC) [20] and Affinity Propagation (AP) clustering algorithm [21]. The experiments are carried out on a PC equipped with Intel(R) Core(TM) i5-9400 CPU, 8 GB RAM, the operating system is Windows 10 64 bit, and the programming environment is VSCode.

To compare the performances of the proposed and comparative methods, we used four external criteria: the Adjust Rand Index (ARI) [22], Fowlkes-Mallows index (FMI) and Normalized Mutual Information (NMI) [23].

Figure 4.

Noise sensitivity analysis. The x-axis represents the noise ratio (0–50%), and the y-axis represents the Adjusted Rand index obtained.

4.2 Noise detecting performance

In this part, we analyze the noise detecting robustness of the proposed NonPC algorithm by assessing its performance with different amounts of noise in a data set. We use the six synthetic data and add external noises randomly with the noise ratio $\alpha$ increased from 0 to 50%. Noises are generated by Gaussian distribution.

Figure 4 presents the results for the proposed and six comparison clustering benchmark algorithms. The x-axis represents the noise ratio, and the y-axis represents the Adjusted Rand index obtained. A method with a larger Adjusted Rand Index over the noise ratio would be regarded as a better one. The test results demonstrate that the NonPC evidently outperforms the other comparison algorithms in the aspect of the robustness to noises: the Adjusted Rand Index values for the NonPC tend to obtain higher than those of the other methods, exhibiting smaller Rand Index changes with increases in the noise ratio. This implies that the proposed method efficiently detects noises and thus accurately identified intrinsic cluster structures.

With the number of noises increasing, more similar edges of high weight may be generated by these noises. Spectral clustering is a kind of edge-cutting algorithm, it is not easy to search for an optimal edge cut that reveals intrinsic cluster structures. Noise-robust clustering methods perform better than spectral clustering. However, the performance obtained by the DBSCAN also degraded significantly with an increase in noise because this method is sensitive to the tuning parameter. The DBSCAN needs to adjust the tuning parameter because the increased noises change the distribution of the base data. And the same for the AP algorithm. For the CutPC, although it is also based on node cutting, it is limited to the density of clusters. The external nodes make unbalanced densities. Therefore, the performance of CutPC is inferior to that of NonPC. The K-means algorithm cannot perform well in non-convex data set and it is sensitive to noises and outliers. Hence, it is difficult to achieve optimal clustering performance with a large number of noise. On the other hand, the proposed NonPC algorithm is robust to noise because it not only ensures the consistency of the number of neighbors but also conducts node cutting so that it removes the generated noise data correctly and identifies intrinsic cluster structures.

Figure 5 illustrates the graphical clustering results for the proposed NonPC method. From the graphical results, we can verify that NonPC can effectively process complex and nonlinear patterns. When applied to noisy data, the proposed NonPC method shows robust clustering capability.

Table 1
The characteristics of the six real-word datasets

Datasets	Number of instances	Number of attributes	Number of cluster
Ionosphere	351	34	2
Wine	178	13	3
Seeds	210	7	3
Iris	140	4	3
Thyroid	215	5	3
HCV	615	12	4

Figure 5.

Graphical clustering results for the proposed NonPC algorithm.

5. Verification with real-world datasets

To further demonstrate the superiority and the applicability to real situations of NonPC, we test the clustering performance by using six real-world datasets from UCI. The data characteristics can be seen in Table 1. These datasets are widely used for testing in the field of machine learning.

Table 2 shows the ARI score of all algorithms. We can see that compared with graph-based clustering algorithms SC, CutPC, and WC in the Wine, Seeds, Iris, HCV, Thyroid, and HCV datasets, NonPC shows excellent performance.

Table 2
ARI of real-world datasets

Datasets	K-means	DBSCAN	AP	SC	CutPC	WC	NonPC
Ionosphere	0.7011	0.7230	0.0297	0.1032	0.3987	0.5630	$-$ 0.0037
Wine	0.3711	0.2873	0.1706	0.3590	0.0029	0.3540	0.3684
Seeds	0.7166	0.3936	0.2772	0.6160	0.0025	0.4034	0.6162
Iris	0.7302	0.5681	0.6073	0.7591	0.4819	0.5300	0.8274
Thyroid	0.5791	0.6869	0.1704	0.2953	0.4474	0.4698	0.7039
HCV	0.4999	0.3650	0.0159	0.1060	0.3805	0.0888	0.3844

Table 3

NMI of real-world datasets

Datasets	K-means	DBSCAN	AP	SC	CutPC	WC	NonPC
Ionosphere	0.1151	0.2950	0.1334	0.0672	0.3810	0.3329	0.0213
Wine	0.4287	0.3879	0.3360	0.4199	0.0273	0.4047	0.5944
Seeds	0.6949	0.4750	0.4958	0.6436	0.0123	0.4034	0.6632
Iris	0.7581	0.7336	0.6657	0.8056	0.5530	0.2506	0.8560
Thyroid	0.4945	0.5291	0.3850	0.8007	0.3832	0.4585	0.6172
HCV	0.2679	0.5530	0.1828	0.1800	0.2468	0.1508	0.4043

Table 4

FMI of real-world datasets

Datasets	K-means	DBSCAN	AP	SC	CutPC	WC	NonPC
Ionosphere	0.6100	0.7456	0.3635	0.5736	0.8062	0.5663	0.7252
Wine	0.1685	0.5692	0.3355	0.5762	0.4976	0.5696	0.8068
Seeds	0.8106	0.6392	0.4602	0.7459	0.4815	0.6108	0.7462
Iris	0.8208	0.7714	0.7208	0.8407	0.7277	0.7250	0.8827
Thyroid	0.8063	0.8711	0.4279	0.6046	0.7682	0.8030	0.8777
HCV	0.8801	0.8830	0.1961	0.5640	0.8714	0.4981	0.9099

Table 3 lists the NMI score. It is clear that NonPC obtains the best clustering performance and achieves greater improvement in the Wine, Seeds, Iris, Thyroid, and HCV datasets. This comparison also illustrates that NonPC can generate desirable clustering results on tightly-connected patterns. Considering the FMI aspect, as demonstrated in Table 4, NonPC gets the best clustering performance in the Wine, Iris, Thyroid, and HCV datasets. What is more, the score of NonPC exceeds the other graph-based clustering in all datasets except the Ionosphere data set.

In summary, for the clusters with different densities and nonlinear distribution, the performance of the proposed NonPC method is better than other graph-based clustering methods, such as SC, CutPC, and WC algorithm, and also performs better than other types of clustering algorithms. To some extent, it confirms that the proposed algorithm contributes to the progress of graph-based clustering algorithms.

6. Conclusions

In this paper, we propose a non-parameter clustering algorithm with adaptive noise detecting based on a weighted natural neighbor graph. The constructed weighted natural neighbor graph represents the original data patterns more appropriately. In the noise detecting process, we carefully analyze and confirm the effectiveness of graph-based connectivity features for noise detection. The connectivity information is adequately excavated and used by k-means to identify the noises adaptively. According to the extensive experiments, the proposed graph-based clustering algorithm shows superior in detecting noise and dealing with nonlinear and tightly connected patterns in a non-parameter way. In conclusion, the proposed clustering approach is of great significance to promote the research progress of graph-based clustering algorithms. However, the method has some limitations, such as it is unable to identify clusters with spatial overlap.

Footnotes

Acknowledgments

This work is supported by the Natural Science Foundation of China (No. 41804112), the Youth Project of Science and Technology Research Program of Chongqing Education Commission of China (No. KJQN202001143) and the High Quality Development Plan of Graduate Education of Chongqing University of Technology (No. gzlcx20223216).

References

Bose

and Mali

, Type-reduced vague possibilistic fuzzy clustering for medical images, Pattern Recognition 112(2) (2021), 107784.

Peng

Ser

Chen

and Lin

, Robust semi-supervised nonnegative matrix factorization for image clustering, Pattern Recognition 111 (2021), 107683.

Sridhar

R.S.

Prasad

and Balakrishnan

, Spatio-Temporal association rule based deep annotation-free clustering (STAR-DAC) for unsupervised person re-identification, Pattern Recognition 122 (2022), 108287.

Liu

Qin

Wan

Pan

Gao

and Wen

, A quantum algorithm for solving eigenproblem of the Laplacian matrix of a fully connected graph, arXiv preprint arXiv:2203.14451, 2022.

Dal Col

and Petronetto

, Graph regularization multidimensional projection, Pattern Recognition 129 (2022), 108690.

Stevens

S.S.

, Mathematics, measurement and psychophysics, Handbook of Experimental Psychology, 1951.

Zhu

Feng

and Huang

, Natural neighbor: A self-adaptive neighborhood method without parameter K, Pattern Recognition Letters 80 (2016), 30–36.

Yan

Huang

and Zhao

, Hierarchical Superpixel Segmentation by Parallel CRTrees Labeling, IEEE Transactions on Image Processing 31 (2022), 4719–4732.

Mehta

and Pasari

, Hyperspectral Image Clustering Using Nearest Neighbor, in: 2021 IEEE International India Geoscience and Remote Sensing Symposium (InGARSS), 2021, pp. 194–197.

10.

Qin

Zhu

Wang

and Li

, A Novel clustering method based on hybrid K-nearest-neighbor graph, Pattern Recognition 74 (2018), 1–14.

11.

Xiong

Dai

Zha

Zhang

and Dan

, A novel graph-based clustering method using noise cutting, Information Systems 91 (2020), 101504.

12.

Zhang

Ding

Wang

and Hou

, Chameleon algorithm based on improved natural neighbor graph generating sub-clusters, Applied Intelligence 51(11) (2021), 8399–8415.

13.

Aksaç

Özyer

and Alhajj

, CutESC: Cutting edge spatial clustering technique based on proximity graphs, Pattern Recognition 96 (2019), 106948.

14.

Kim

and Kim

S.B.

, Outer-Points shaver: Robust graph-based clustering via node cutting, Pattern Recognition 97 (2020), 107001.

15.

Bentley

J.L.

, Multidimensional binary search trees used for associative searching, Communications of the ACM 18(9) (1975), 509–517.

16.

Tarjan

, Depth-first search and linear graph algorithms, in: 12th Annual Symposium on Switching and Automata Theory (swat 1971), 1971, pp. 114–121.

17.

MacQueen

, Some methods for classification and analysis of multivariate observations, in: Proc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66), Univ. California Press, Berkeley, Calif., Vol. I: Statistics, 1967, pp. 281–297.

18.

Ester

Kriegel

Sander

and Xu

, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, in: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996, pp. 226–231.

19.

A.Y.

Jordan

M.I.

and Weiss

, On Spectral Clustering: Analysis and an Algorithm, in: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, 2001, pp. 849–856.

20.

Xia

Zhang

Wang

Han

and Yan

, WC-KNNG-PC: Watershed clustering based on k-nearest-neighbor graph and Pauta Criterion, Pattern Recognition 121 (2022), 108177.

21.

Frey

B.J.

and Dueck

, Clustering by passing messages between data points, Science 315(5814) (2007), 972–976.

22.

McInnes

and Healy

, Accelerated Hierarchical Density Based Clustering, in: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), 2017, pp. 33–42.

23.

Vinh

N.X.

Epps

and Bailey

, Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance, J. Mach. Learn. Res. 11 (2010), 2837–2854.

NonPC: Non-parametric clustering algorithm with adaptive noise detecting

Abstract

Keywords

1. Introduction

.

.

.

.

3.1 Method overview

3.2 Constructing the weighted natural neighbor graph

.

.

.

.

.

.

.

.

3.5 Assign the noisy and isolated points to clusters

4.1 Simulation setup

Table 1 The characteristics of the six real-word datasets

Table 2 ARI of real-world datasets

Footnotes

Acknowledgments

References

Table 1
The characteristics of the six real-word datasets

Table 2
ARI of real-world datasets