Abstract
Density peaks clustering (DPC) is as an efficient algorithm due for the cluster centers can be found quickly. However, this approach has some disadvantages. Firstly, it is sensitive to the cutoff distance; secondly, the neighborhood information of the data is not considered when calculating the local density; thirdly, during allocation, one assignment error may cause more errors. Considering these problems, this study proposes a domain density peak clustering algorithm based on natural neighbor (NDDC). At first, natural neighbor is introduced innovatively to obtain the neighborhood of each point. Then, based on the natural neighbors, several new methods are proposed to calculate corresponding metrics of the points to identify the centers. At last, this study proposes a new two-step assignment strategy to reduce the probability of data misclassification. A series of experiments are conducted that the NDDC offers higher accuracy and robustness than other methods.
Introduction
As an unsupervised process, clustering aims to partition data points to different clusters according to the similarity between data samples. The data in the same category should be as similar as possible, and those in different categories should be as different as possible. Clustering has been widely used in image processing [1], medical information analysis [2], pattern recognition [3], geography [4], and biological information analysis [5]. In the past few decades, numerous clustering methods have been proposed, including partition-based clustering [6, 7, 8], grid-based clustering [9, 10], hierarchical clustering [11, 12], model-based clustering [13, 14], depth clustering [15, 16], and density-based clustering [17]. Density-based clustering methods have become a research hotspot because this method can process clusters of any shape and number. The most representative method is density-based spatial clustering of applications with noise (DBSCAN) [18], which can identify clusters and filter outliers without specifying the number of clusters. However, it is sensitive to the parameter settings, such as the radius and density. The ordering points used to identify the clustering structure (OPTICS) algorithm is an improvement of DBSCAN and it can obtain clusters of different densities by ordering the queue of the output [19]. Subsequently, many variants of these two methods have been developed and discussed to improve the clustering performance [20, 21].
In 2014, Rodriguez and Laio presented a novel density-based algorithm called density peak clustering (DPC) [22]. It uses the local density and
In summary, the existing methods have two limitations: Firstly, the DPC algorithm and its improved are sensitive to parameters, the changes of
We use the natural neighbors instead of the nearest neighbors to describe the relationship between data. Each point has different numbers of natural neighbors. Points in dense regions have more neighbors than in sparse regions, and outliers have no natural neighbors. This neighborhood relationship can better reflect the data distribution and the relationship between data than the nearest neighbor method. We propose several new methods to calculate corresponding metrics of the points based on natural neighbors. Most DPC extensions select the same neighborhood size for all data. In contrast, natural neighbors can adaptively define the neighborhood size of each point. Thus, based on natural neighbors, we calculate the local density, the local domain density and We design a two-step assignment strategy based on the natural eigenvalue. In DPC, a single assignment error may produce additional errors. In our proposed, we divide the points into dense points and remaining points. If the number of natural neighbors of a point is greater than the ratio of the natural eigenvalue, it is a dense point; otherwise, it is a remaining point. The dense points around the same center should be in the same cluster; therefore, the first step is assigning the dense points to form the cluster prototype. In the second step, the remaining points are assigned to their superior points. Since the clustering prototype has been formed, it does not require too many times to find the superior point, thereby reducing the probability of generating errors. The proposed algorithm is effective and robust. Existing algorithms need to adjust the cutoff distance or the number of neighbors, and a change in parameters significantly influences the results. Our algorithm only needs to adjust one parameter to determine the definition of the dense point, and changing the parameter does not substantially impact the clustering results. Experiments show that our algorithm is extremely robust and can be applied to datasets with any cluster size.
The remainder of this paper is organized as follows: Section 2 outlines the related works. Section 3 describes our proposed NDDC method. In Section 4, the experimental results are given and analyzed. Section 5 summarizes the paper.
In this section, we introduce the DPC process and recent studies. Regarding the natural neighbor, we introduce various neighborhood relationships and describe the concepts of the natural neighbor, stable state, natural eigenvalue, and noise.
DPC algorithm
DPC is based on two premises: 1) the local density of the center point is greater than that of its neighbors. 2) The distance between the center points of different clusters is relatively large. DPC uses the local density and the distance from the nearest larger density point to describe the data points. DPC uses the cutoff distance method defined in Eq. (1) or the kernel function method in Eq. (2) to calculate the local density
where
Equation (3) is used to calculate the parameter distance
The objective is to find a data point that is closest to point
DPC draws a decision graph with
After the selection, the non-central point is assigned to its superior point’s cluster. If the superior point does not yet belong to a cluster, the algorithm finds the superior of the superior point until the clustering is completed.
The original DPC algorithm has been widely used in various fields, but it has several shortcomings. First, the definition of the local density requires the participation of all data, which is costly, and the neighborhood information of the data is not considered. Therefore, Du et al. incorporate the KNN into the DPC and propose the DPC-KNN [23], which limits the calculation space to the range of KNNs. They also use principal component analysis to preprocess high-dimensional data. Jiang et al. [24] also use the KNN to improve the calculation method of the
In recent years, the concepts of the reverse nearest neighbor and mutual nearest neighbor have also been applied to clustering [32, 33]. However, these methods have problems with neighborhood size selection. In the following section, we describe our proposed improvements.
Natural neighbors are inspired by human society. As a new concept of neighbors, this method can effectively solve the problem of neighborhood selection [27]. When two people regard each other as friends, they are natural neighbors. In a society, if everyone has at least one friend, the society is in a stable state. Similarly, if every point except noise has at least one natural neighbor, the dataset is in a relatively stable state. Several concepts related to natural neighbors are explained below.
(K-Nearest Neighbor).
Suppose the dataset is
where
(Reverse K-Nearest Neighbor).
The reverse KNN of
(Natural neighbor).
If the natural neighbor of point
(Stable state and natural eigenvalue).
A natural neighbor search is a process of gradually expanding the search range until it reaches a stable state. The definition of the steady state is as follows:
where
(Noise points).
When the search range is
We use a simple dataset to show the acquisition process of natural neighbors. Figure 1a–c shows the neighbor relationship obtained under the continuous expansion of the neighborhood search range. First, each point finds its nearest neighbor, and then verifies the reverse neighbor relationship. If the two data points are each other’s nearest neighbor, the neighbor relationship is established. AC, DG, and EF have established neighbor relationships with each other in Fig. 1a. Then expand the neighbor range and get the neighbor relationship as shown in Fig. 1b. All points except point H have at least one neighbor. After the neighbor range is expanded again, point H still has no neighbor, so H is an outlier, the dataset reaches the stable state, and the natural eigenvalue is the current neighborhood search range.
The process of obtaining the natural neighbors of the dataset. (a) neighbor relationship (
Because the natural neighbor search process is self-adaptive and the natural neighbor relationship is effective and robust to describe the data, it has been used in machine learning applications, such as classification and anomaly detection [34]. In this paper, we use this approach to improve the DPC.
This section introduces the details of the proposed algorithm. First, we obtain the natural neighbors of each point and the global natural eigenvalue; the local density
Workflow of the proposed NDDC algorithm.
We obtain the neighborhood information based on the natural neighbors and calculate the relevant metric values. This approach does not require manual selection of the k value to determine neighborhood information.
The first step is obtaining the natural neighbors and natural eigenvalue. We initialize the
where
Next, we calculate the local domain density based on the local density to highlight the center points.
(Local domain density).
The local domain density of point
The weight coefficient
Subsequently, for each point, the algorithm finds the nearest point with the highest local domain density. The distance between these two points is the
After obtaining the
This measure requires parameter
The detailed algorithm is shown in Algorithm 1.
[H] : Calculate
get the maximum Domain Density
This step assigns dense points to form the prototype of the cluster. In the allocation process, we introduce a ratio parameter
(Dense point).
The dense point is a point with many natural neighbors. The global natural eigenvalue
First, the label of the center point is assigned to the dense point of its natural neighbors. Then, the label is assigned to the natural neighbor’s natural neighbor. This process is repeated until all the dense points are allocated.
The ratio parameter
Details of the two stages of label assignment. (a) The original data (Aggregation); (b) identified cluster prototype; (c) assigning labels to the remaining points.
After assigning the dense points and forming the cluster prototype, the final step is to assign the remaining data points to form the final cluster. We use the DPC, i.e., we find a point that is closest to it and whose local domain density is greater than that of the point. The label of this point is the same as that of its superior point. Since the clustering prototype has been formed, the superior point can be found easily, thereby minimizing the “domino effect”.
Figure 3 depicts the clustering results of the two-stage strategy. Figure 3a shows the data distribution of the dataset consisting of seven clusters. First, the natural neighbors of each data point are obtained through Algorithm 1, and the global natural eigenvalue is 5. Then we select several points with the highest
Algorithm 2 provides the steps of the two-step allocation strategy, which is described in Section 3.2 and Section 3.3. The first step of the allocation strategy uses a queue as the data structure and enqueue and dequeue operations to allocate the dense points. The time complexity of this step is
We also analyze the space complexity. A distance matrix is constructed when calculating the local density and other metrics, and the space complexity is
[H] : NDDC’s two-step allocation strategy[1] PMC list, ratio parameter
Experimental results and analysis
We use some synthetic datasets and the UCI datasets for testing to verify the effectiveness of the NDDC algorithm. The experimental results are compared with the K-Means, DBSCAN [18], DPC [22], FKNN [29], SNN-DPC [31], DBFN [30], and FSDPC algorithms [28].
Evaluation measures
The evaluation measures used in this article include the clustering accuracy (Acc), normalized mutual information (NMI), and the adjusted Rand index (ARI). The Acc is defined in Eq. (17):
where
Given two variables
where
where
The ARI is often used to evaluate clustering results. The Rand index (RI) is the proportion of correctly allocated results.
The ARI is defined based on the RI and measures the degree of agreement between the two data distributions:
where
We make 16 databases for the results in the experiments, including 6 synthetic datasets and 10 UCI real datasets. In order to ensure the reliability of the dataset, all the test datasets we selected did not contain missing values. The details of the synthetic and UCI real datasets are listed in Tables 1 and 2, respectively.
Synthetic datasets
Synthetic datasets
UCI datasets
Performances of different algorithms on different synthetic datasets
Mean and variance of the accuracy of different algorithms
The clustering results of the 7 algorithms on the Aggregation dataset.
The clustering results of the 7 algorithms on the Path-based dataset.
Performances of different algorithms on different UCI datasets
The mean and variance of the clustering accuracy of each algorithm on the four datasets
Robustness analysis of the four algorithms on the E-coli dataset.
Robustness analysis of the four algorithms on the Heart-failure-clinical dataset.
Robustness analysis of the four algorithms on the Libras Movement dataset.
Robustness analysis of the four algorithms on the Seeds dataset.
We adjust the parameters in each algorithm to reach the optimum state. The K-Means algorithm only requires the number of clusters. DBSCAN requires adjusting two parameters (eps and minpts); we set
Results on the synthetic datasets
Six synthetic datasets are used to test the performance of the proposed algorithm. These datasets have different distributions and sizes so that various situations can be simulated. For the NDDC, we select the kernel function with the best clustering performance. The final clustering results are listed in Table 3. Most of the algorithms provide satisfactory clustering results on the 4k2-far dataset, and our proposed method also obtained the best performance. The proposed method has the highest performance on the Aggregation and Path-based datasets when using the cutoff kernel and Gaussian kernel, respectively. Our proposed algorithm (exponential kernel) performs well on the Compound dataset based on the NMI indicator. None of the other algorithms reach 0.9. The FKNN algorithm has the best clustering performance on the Flame dataset, and the DBFN algorithm has the best clustering performance on the R15 dataset. FSDPC has the same excellent performance as our proposed algorithm on Aggregation dataset. However, the NDDC algorithm does not perform significantly better than these optimal algorithms, which is reflected in the numerical difference of about 0.01. We calculated the mean and variance of the Acc of each algorithm. Our proposed algorithm has the highest average accuracy and the smallest variance, and it is the most stable method. The results are listed in Table 4.
We choose the Aggregation and Path-based datasets to display the clustering results of different algorithms visually. Figures 4 and 5, respectively, show the clustering results of K-Means, DBSCAN, DPC, FKNN, SNN, DBFN, FSDPC, and the proposed NDDC algorithm on these two datasets. For the Aggregation dataset, Fig. 4h shows the results obtained by the NDDC algorithm, and Fig. 4a–g shows the results of the other state-of-the-art algorithms. The K-Means incorrectly divides a large cluster into three clusters, and DBSCAN does not predict the correct number of clusters; thus, these two algorithms cannot complete the basic clustering tasks. The DPC, FKNN, SNN, and DBFN algorithms can complete the basic clustering tasks, but there are some allocation errors. For example, in Fig. 4e, the three clusters in the lower-left corner have assignment errors, and there are errors in the upper left corner in Fig. 4f. Our proposed algorithm does not only complete the basic clustering tasks but also has the highest accuracy with FSDPC among all algorithms. Most algorithms cannot identify the true distribution of the Path-based dataset. The black dots in Fig. 5b represent unallocated data, indicating that the DBSCAN algorithm recognizes these data as noise because the density is less than the specified threshold. The K-Means, DPC, FKNN, DBFN, and FSDPC algorithms cannot identify the ring clusters. Only the SNN and our proposed algorithm correctly complete the clustering task. Based on the clustering accuracy and other indicators, our proposed algorithm provides the best performance.
Results on the UCI real datasets
Ten synthetic datasets are used to test the performance of the proposed algorithm. These datasets are often used to test the classification or clustering performance. The clustering results are listed in Table 5. On the Ionosphere and Teaching-Assistant-Evaluation datasets, our proposed algorithm provides the best clustering performance, with the top performance of all indicators. On the Haberman, Heart-failure-clinical, Banknote-authentication, and Shill Bidding datasets, our algorithm also reaches the top performance in several indicators. On the Iris, Libras Movement, E-coli, and Seeds datasets, our algorithm does not achieve the optimal results, but it exhibits strong robustness. We selected the Libras Movement, E-coli, Seeds and Heart-failure-clinical datasets to analyze the robustness of the NDDC, SNN, FKNN, DBFN. Since the parameter settings of FSDPC algorithm have been given, the robustness analysis of FSDPC algorithm is not performed. Figures 6 to 9 show the changes in the clustering accuracy of each algorithm as the parameters change. The X-axis represents the value of the parameter, and the Y-axis represents the clustering accuracy. On the E-coli dataset, the highest accuracy of the SNN algorithm is 0.84, but when K exceeds 20, the accuracy of the algorithm drops sharply. In contrast, the accuracy of the NDDC algorithm is maintained at about 0.7. On the Seeds dataset, the accuracy of our algorithm is stable at 0.8–0.9, whereas that of the SNN algorithm fluctuates considerably. The proposed algorithm is the most stable on the Heart-failure-clinical dataset, with almost no fluctuations. The mean and variance of the clustering accuracy of each algorithm on each dataset are listed in Table 6. The variance of the NDDC is the smallest on all datasets, and its average clustering accuracy is the highest on most datasets, demonstrating the superiority of the NDDC.
In summary, our proposed algorithm has better clustering results than the other algorithms on some datasets, and the parameters of the NDDC are easier to adjust than those of the other algorithms. The experiments show that the NDDC algorithm is highly robust on most datasets.
Conclusion
We used the natural neighbor instead of the nearest neighbor and proposed the NDDC to overcome the deficiencies of the DPC and its improvements. The algorithm first obtains the global natural eigenvalue and the natural neighbors of each point. Subsequently, the local density, local domain density,
In the future, we will adapt the NDDC to process streaming data to enable data analysis in the production environment, solve practical problems, and improve the production efficiency in related fields.
Footnotes
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grants with No. 62273164, No. 61873324, No. 61903156, the Natural Science Foundation of Shandong Province under Grant with No. ZR2019MF040, and the Higher Educational Science and Technology Program of Jinan City under Grant with No. 2020GXRC057. We thank openbiox community and Hiplot team (https://hiplot.com.cn) for providing technical assistance and valuable tools for data analysis and visualization.
