Abstract
DBSCAN (density-based spatial clustering of applications with noise) is one of the most widely used density-based clustering algorithms, which can find arbitrary shapes of clusters, determine the number of clusters, and identify noise samples automatically. However, the performance of DBSCAN is significantly limited as it is quite sensitive to the parameters of eps and MinPts. Eps represents the eps-neighborhood and MinPts stands for a minimum number of points. Additionally, a dataset with large variations in densities will probably trap the DBSCAN because its parameters are fixed. In order to overcome these limitations, we propose a new density-clustering algorithm called GNN-DBSCAN which uses an adaptive Grid to divide the dataset and defines local core samples by using the Nearest Neighbor. With the help of grid, the dataset space will be divided into a finite number of cells. After that, the nearest neighbor lying in every filled cell and adjacent filled cells are defined as the local core samples. Then, GNN-DBSCAN obtains global core samples by enhancing and screening local core samples. In this way, our algorithm can identify higher-quality core samples than DBSCAN. Lastly, give these global core samples and use dynamic radius based on k-nearest neighbors to cluster the datasets. Dynamic radius can overcome the problems of DBSCAN caused by its fixed parameter eps. Therefore, our method can perform better on dataset with large variations in densities. Experiments on synthetic and real-world datasets were conducted. The results indicate that the average Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Adjusted Mutual Information (AMI) and V-measure of our proposed algorithm outperform the existing algorithm DBSCAN, DPC, ADBSCAN, and HDBSCAN.
Introduction
Clustering is always favored by data processing scientists. It is the most widely used unsupervised learning technique. Clustering aims to separate data samples into specific classes by comparing the similarity between the data samples. Samples of the same cluster are highly similar to those between different clusters [1]. As a fundamental technique in machine learning, clustering has applications in a wide variety of domains such as social networks [2], neural networks [3], medical science [4], cyberspace security [5], Community detection [6, 7] and fuzzy time series forecasting [8, 9].
DBSCAN [10] is a widely used density-based clustering algorithm. In DBSCAN, clusters are defined as regions connected by dense areas, which are separated by sparse areas. There are three types of samples defined in DBSCAN, namely core samples, boundary samples and noise samples. A core sample is defined by its parameters eps and MinPts that sample p is a core sample if at least MinPts samples are within eps-neighborhood of sample p. DBSCAN works as follows. It scans the dataset and checks each point p whether it is a core sample or not. If p is a core sample, then it creates a new cluster with p and then retrieves all the samples density reachable from p, and marks them with the same cluster label. The process terminates when no new objects can be added to any cluster. DBSCAN can find arbitrary shapes of clusters and determine the number of clusters, as well as identify noise samples automatically. However, there are two major shortcomings of DBSCAN: (1) The main weakness of DBSCAN is its sensitivity to the parameters eps and MinPts. DBSCAN will perform differently even if the parameters are similar. (2) The assumption of DBSCAN that the density of each cluster is uniformly distributed may not be realistic. It may fail on datasets having clusters with different densities. The factors leading to the above shortcomings are as follows. (1) The parameters of DBSCAN are determined by hand. Thus, multiple manual experiments should be taken to determine the appropriate parameters without any prior knowledge. (2) Small eps and MinPts are suitable for clusters with high-density, while great eps and MinPts are appropriate to clusters with low-density. Thus, it may fail on dataset with large variational density
In this paper, we propose a combination of grid technique and the nearest neighbor to overcome the drawbacks of DBSCAN. GNN-DBSCAN is a new density-based algorithm proposed in this paper. “G” stands for an adaptive grid that is used to divide dataset and “NN” represents the Nearest Neighbor. Firstly, the adaptive grid is used to divide the dataset. With the help of grid, we can get infinite subspaces of dataset. Then we identify all the nearest neighbor lying in every filled cell and adjacent filled cells as the local core samples. The nearest neighbor stands for the highest local density. The way to find local core samples is parameter-free and does not depend on the density distribution. Therefore, the quality of core samples identified by GNN-DBSCAN is better than DBSCAN due to the fact that DBSCAN relies on density distribution heavily. Lastly, we use dynamic radius for the clustering process while DBSCAN adopts fixed eps.
The contributions of this paper are in three-fold: This paper proposes a new density-based clustering algorithm called GNN-DBSCAN, which uses an adaptive grid to divide the dataset and defines the nearest neighbor as local core samples. Our method uses a parameter-free way to find local core samples. What’s more, the identification of local core samples has no concern with density distribution of dataset. In this way, GNN-DBSCAN can overcome the drawbacks of DBSCAN. We provide a new parameter-free way to find local core samples, which can overcome the drawbacks of DBSCAN that requires many manual experiments to determine parameters. An adaptive grid is used in our method. Thus, we can avoid the uncertainty caused by manual selection of the cell length.
The rest of this paper is organized as follows. Section 2 describes the related work of DBSCAN. Section 3 discusses our proposed algorithm. Section 4 shows the results of experiments compared with other clustering algorithms. Conclusion and future work are given in Section 5
Related work
As mentioned earlier, two parameters eps and MinPts are used to identify core samples in DBSCAN. A core sample p is marked by the way that if the number of samples lying in sample p’s eps-neighborhood is greater than or equals to MinPts, then p will be marked as a core sample. After finding all core samples, the clustering process of DBSACN begins with a core sample and links all directly density-reachable samples together to form a cluster. DBSCAN can discover clusters of arbitrary shapes and different sizes while being robust to noise points. However, the performance has been significantly limited due to it is quite sensitive to the parameters of eps and MinPts. For one thing, the performance needs to go through multiple experiments for setting appropriate parameters. For another, the fixed parameters are difficult to adapt to the density variation of clusters. Many studies focus on improving DBSCAN or overcoming the drawbacks brought by the fixed parameters. MDBSCAN [11] algorithm can generate two different parameters by using a statistical method. Based on the parameters, clustering results of MDBSCAN can be more accurate with varied densities. The parameters of MDBSCAN are calculated from the density distribution of dataset while the parameters of DBSCAN are determined by hand. DPC [12] can find cluster centers through two assumptions that cluster centers have higher density than their neighbors and there is a large distance between them. Besides, DPC decides core samples by generating an additional decision graph. Zhu et al. [13] proposed a density-ratio based method to overcome the weakness that density-based clustering algorithms like DBSCAN have difficulty identifying all clusters if there are great variations in the density of clusters. Lu et al. [14] use random walk model to estimate the importance of data samples(core samples). Then, based on the core samples, they use a k-nearest neighbors chain to find the proper number of clusters and initial clusters. K-PRSCAN [15] uses PageRank algorithm to measure the importance of data points in K clusters. K-PRSCAN can distinguish both globular and non-globular clusters, and reduce the effect of noise samples by comparing the value of samples importance. HDBSCAN [16] is developed by Campello et al, which adopts a hierarchical clustering approach to improve DBSCAN. HDBSCAN can find clusters of varying densities and be more robust to the setting of parameters. NS-DBSCAN [17] is also a hierarchical clustering approach, which provide a new technique for visualizing the density distribution and indicating the intrinsic clustering structure. In addition, many approaches based on the k-nearest neighbors have been proposed because k-nearest neighbors can properly estimate the density of data sample. The Shared Nearest Neighbors (SNN) density-based clustering algorithm [18] uses SNN similarity in place of the parameter eps in DBSCAN. Moreover, [18] solves the problem that cluster data in a nonparametric way when the globular concept cannot be accepted. CMUNE [19] is a variation of the SNN algorithm, which uses mutual nearest neighbor to find dense regions. IS-DBSCAN [20], ISB-DBSCAN [21], RECORD [22], RNN-DBSCAN [23] and DBSCRN [24] use the reverse k-nearest neighbors as an estimate of observation density and identify core samples. ADBSCAN [25] is the latest density-based clustering algorithm based on the nearest neighbor. In ADBSCAN, the core sample is defined as the nearest neighbor of subgraph which is constructed by Union-Find algorithm with path compression. It uses the concept of subgraph path to set the radius of each core sample, which overcomes the shortcomings of DBSCAN brought by fixed eps. However, only one of the two nearest neighbors is kept as core samples in ADBSCAN, which makes a lot of core samples lost. Meanwhile, some other clustering methods [26–28] also attract researchers attention.
In summary, the above methods improve DBSCAN in two ways: based on the statistical approaches and based on the k-nearest neighbors approaches. These methods have their drawbacks like the parameters in statistical approaches mostly rely on the distances among samples, which would cause parameters sensitivity. Additionally, or k-nearest neighbors, studies [29, 30] have shown that k is a sensitive parameter in different datasets. What’s more, approches [11–24] define core samples from a global perspective. They overcome the two main limitations of DBSCAN to an extent, but they would ignore some key core samples that exist in different density regions [25]. finds core samples from a local scope using subgraph to divide dataset, through which reasonable core samples in different density regions can be found. But [25] only keeps one of the two nearest neighbors as the core sample, which makes lots of core samples lost. Our proposed method divides dataset with an adaptive grid and keeps both of the two nearest neighbors in cells or adjacent cells as local core samples, thereby holding local core samples to the maximum extent.
The proposed algorithm: GNN-DBSCAN
In this paper, we propose a new density-based algorithm named GNN-DBSCAN, in which “G” stands for an adaptive grid that is used to divide the dataset and “NN” represents the Nearest Neighbor. The steps of GNN-DBSCAN are as follows. Step 1, divide dataset with an adaptive grid. Then identify samples in every filled cell and adjacent filled cells as local core samples. Step 2, obtain global core samples by enhancing and screening local core samples with knowledge of probability theory. Step 3, identify cluster by using dynamic radius based on k-nearest neighbors. In this paper, results are produced by Euclidean distance. Fig. 1 shows a synthetic dataset Syn1 is divided into a finite number of cells that form a grid structure.

Dataset Syn1 is divided into finite number of cells that form a grid structure.
1: Initialize cluster id C = 0
2: grid, LCSs = ConstructGrid(X) //algorithm 2
3: GCSs = LCSCheck(X,grid,LCSs,k,core_percent) //algorithm 3
4:
5: λ g = (1 + (1/(1 + eδ(D g ))))
6: radius
g
= λ
g
*
7:
8:
9:
10: continue
11:
12:
13: label s as cluster C i : s ∈ C i
14:
15: neighbor s = {x|x ∈ X, d (s, x) ≤ radius s }
16: for each sample u in neighbor s do step 12
17:
18:
19: C = C + 1
20:
21: Set unlabeled samples to noise
22: return label

Quantile-quantile plot of LCSs density. It shows the LCSs density distribution coincides with Gaussian distribution.
1: Randomly choose
2: cell_length = 1 + e-δ(L) * d
3: Divide dataset X into grid with cell_length
4: Identify all LCS according to definition 2.
5: return grid, LCSs
There are three steps of our proposed algorithm GNN-DBSCAN: (1) Divide the dataset with an adaptive grid and find out all LCSs according to definition 2. (2) Obtain GCSs by enhancing and screening LCSs. (3) Identify clusters with GCSs using dynamic radius based on k-nearest neighbors. GNN-DBSCAN works as Algorithm 1. Algorithm 2 implements step (1) and Algorithm 3 fulfills step (2) with a series of given parameters k and core_percent.
1:
2:
3:
4: s ∈ LCSs
5:
6:
7:
8:
9:
10:
11: ρ (X) = log (ρ (X))
12: θ = mean (ρ (X)) + φ (core _ percent) * σ (ρ (X))
13:
14:
15: u ∈ GCSs
16:
17:
18: return GCSs
In algorithm 1, we use dynamic radius for the process of clustering, which is different from DBSCAN which uses fixed eps. A proper radius of sample should correspond to the region’s density that the sample belongs to. It can be overt that samples lying in high-density will have shorter distance with its k-nearest neighbors than samples lying in low-density region. Therefore, the distance of k-nearest neighbors can reflect the density distribution of the sample’s location. Thus, we use the the distance of k-nearest neighbors to compute the radius of GCSs. In this way, the radius used in GNN-DBSCAN is more reasonable than DBSCAN. Therefore, we firstly define a radius expansion factor as follows:
Algorithm 3 is the process of enhancement and filtration. Enhancement can enrich local core samples, which identifies the samples in cell as LCS only if the cell with LCS and its point_ratio is greater than the average of all cells point_ratio. Filtration help us to improve the quality of core samples, which delete LCSs lying in low-density regions. Fig. 3 shows the effect of Algorithm 3 on dataset Syn1, where the parameters k is 35 and core_percent is 0.8.

The points in red in (a) and (b) are LCS. (a) shows LCSs are identified in Algorithm 2, we can see many samples in grey (not core samples) lying in the dense region. After enhancement by Algorithm 3, the LCSs in red are shown in (b), from which we can see that many samples in dense regions are identified as LCS while in sparse regions are not. (c) shows the GCSs after filtration in Algorithm 3, some LCSs belonging to sparse regions are deleted.
Assume that n is the number of samples in the dataset. The computational complexity of GNN-DBSCAN depends on its solution to the All Nearest Neighbor (ANN) problem and the nearest neighbor problem. The computational complexity of ANN can be reduced to (n*logn) by the natural neighbor algorithm optimized by KD-tree [32] or R*Tree [33]. The nearest neighbor problem can be solved in time O (n2) by a brute force method. According to [34], the nearest neighbor problem could be solved in linear time with randomness. In summary, the complexity of the proposed clustering method is O (n * logn).
Experimental analysis
In this section, we will discuss the parameters selection of GNN-DBSCAN (GDBS) and compare the proposed method with other state-of-the-art clustering methods including DBSCAN (DBS), DPC, ADBSCAN (ADBS), and HDBSCAN (HDBS). ADBSCAN is the latest clustering algorithm based on the nearest neighbor. The code of ADBSCAN is provided by the original author and the rest comparison algorithms use open source code. We use The Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Adjusted Mutual Information (AMI) and V-measure to measure the performance of clustering results. All of them are well-known metrics to evaluate clustering algorithms. ARI is one of the most popular measures in cluster validation, which compares two partitions: one labeled by clustering process and the other given in the true clusters, and NMI is also a popular measure used for evaluating clustering results by employing information theory to quantify the differences between two clustering partitions. What’s more, AMI is an adjustment of the Mutual Information (MI) score to account for chance. It accounts for the fact that the MI is generally higher for two clusters with a larger number of clusters, regardless of whether there is actually more information shared and V-measure is the harmonic mean between homogeneity and completeness. The experiments are performed on a PC with Intel i3-5005U, 8G RAM, Windows 10 64 bit OS, and the python3 programming environment.
Selection of parameters
As mentioned above, two parameters are used in GNN-DBSCAN. (1) k is the number of nearest neighbors, which decides the density of sample and provides a crucial reference for the dynamic radius used in clustering process. (2) core_percent is the core ratio of the dataset, we use it to screen LCSs and the range of core_percent is 0 ≤ core _ percent ≤ 1. Six synthetic datasets DS1_1-DS1_6 were generated to analyze the effects of the two parameters, which are depicted in Fig. 4 and details are shown in Table 1.

Datasets DS1_1 to DS1_6 used in the experiments of section 4.1.
Details of synthetic datasets DS1_1-DS1_6
In the experiments of selecting parameter k, we set core_percent to 0.8 and 0.9 for comparison and 0 ≤ k ≤ 140. The results are shown in Fig. 5, from which we can see the ARI performance of GNN-DBSCAN has slightly fluctuation when k is approximately 15. On DS1_1 to DS1_5 (without noise samples), the ARI performance tends to increase flatly when the value of k is greater than 20 and has reached a high level when k is roughly 60, which demonstrates GNN-DBSCAN’s relatively insensitivity to parameter k. It also can be observed from Fig. 5 that the ARI performance is better with higher core_percent 0.9 on DS1_1 to DS1_5 (without noise samples). But on DS1_6 (with noise samples), the core_percent 0.8 has a better ARI performance than 0.9. It is normal because DS1_6 has noise samples and lower core_percent is more proper. But DS1_1 to DS1_5 is pure, and higher core_percent is more reasonable.

Results of experiments about parameter k on DS1_1 to DS1_6. The different colors of solid lines stand for the results on different datasets and core_percent is 0.9. The same color of dashed lines represents the results and core_percent is 0.8.
In the following experiments of selecting parameter core_percent, we set the parameter k to 60 and 0 ≤ core _ percent ≤ 1. The results are depicted in Fig. 6, which reveals that on DS1_1 to DS1_5 (without noise samples) the higher core_percent is, the better ARI performance is. On the contrary, the ARI performance is good on DS1_6(noise samples are 660) when the core_percent is lower. From Fig. 6, we can see the value of ARI is greater than 0.9 when parameter core_percent is greater than 0.75. Besides, DS1_3 and DS1_6 achieve the worst ARI performance when core_percent is 1. The reasons for this include two aspects. For one thing, some LCSs lie in different cluster boundaries. For another, these LCSs are close to each other. And core_percent is 1 means the filtration process won’t work, which makes those LCSs to be treated as GCSs and take part in the clustering process. Consequently, the suitable interval of core_percent is [0.75,1) if dataset has no noise samples. On DS1_6 rthere will be the best ARI performance when core_percent is around 0.65.

Results of experiments about parameter core_percent on DS1_1 to DS1_6. The different colors of solid lines stand for the results of different datasets and k is 60.
In order to reflect the influences of parameter core_percent more visually rwe set k = 60 and adjust core_percent to plot the GCSs of DS1_6 shown in Fig. 7.

GCSs (points in red) of DS1_6 with parameters k = 60 and different core_percent. We can observe from Fig. 7 that with the core_percent decreasing rfewer GCSs are held and these GCSs are closer to the cluster center. Therefore core_percent can be regarded as the “thickness” of the cluster boundary. The smaller core_percent is rthe thicker “thickness” is.
To sum up rwe present three experimental rules for selecting the proper parameters of k and core_percent.
(1) k is associated with the size of the dataset rwhich is used to compute densities and clustering radius. A practical rule is that k = min (N/linebreak100 r 100) ±10 rwhere N is the size of dataset.
(2) core_percent help us to filter LCSs lying in low-density regions or cluster boundary rand the reasonable range is 0.75 r1) if the dataset is pure. On the contrary rif dataset has noise samples ra suggested rule is that core_percent = 1-2 * noise_ratio. Meanwhile rthere are three types of samples defined in DBSCAN named core samples rboundary samples and noise samples. Therefore rthe parameter core_percent of GNN-DBSCAN is not recommended to be set to 1.
(3) core_percent and k should be a negative correlation. The value of k should decrease correspondingly when the value of core_percent increases.
To verify the performance of our method, we compare it with other state-of-the-art algorithms on eight synthetic datasets and several benchmarking real-world datasets. The synthetic datasets are displayed in Fig. 8 and the characteristics are described in Table 2. The real-world datasets are obtained from the University of California, Irvine (UCI) Machine Learning Respository [35]. They are Wine, Iris, Ecoli, Banknote, Page block, and the details of real-world datasets are summarized in Table 3.
Details of eight synthetic datasets
Details of eight synthetic datasets
Details of real-world datasets

Eight synthetic datasets which used in our experiments of section 4.2.
We use the parameters given in the original paper, and we set the parameters not included in that paper as the default parameters for each algorithm. For GNN-DBSCAN, we set k = min (N/100, 100) ±10 and choose core_percent from interval [0.75,1) if dataset is pure. Otherwise we set core_percent = 1-2*noise_ratio. Table 4 illustrates the parameter settings of each clustering method in eight synthetic datasets. The experimental results are displayed in Fig. 9 rand the ARI rNMI rAMI and V-measure performance for synthetic datasets are shown in Table 5.
Details of parameters on synthetic datasets used in each clustering algorithm
acp is core_percent. bnp is noise_percent.
The results of each clustering algorithm on synthetic datasets

Clustering results on synthetic datasets. Grey points to stand for the noise samples.
As can be seen from the results depicted in Fig. 9 rDBSCAN ras the earliest proposal of all these algorithms rshows strong stability and only has poor performance on datasets with large different densities. For example ron DS2_4 rthe two clusters on the top of the picture are identified as the same cluster. The reason is that it is difficult to set the proper parameters to isolate the high-density samples when connecting the two clusters. Moreover ron DS2_3 rsome samples lying in the cluster boundary are identified as a cluster. For DPC rit has superior performance on DS2_1 rDS2_5 and DS2_7 rwhich consists of spherical-like clusters or linear clusters. However rit has poor performance on DS2_2 and DS2_6 because the location of these samples in the cluster is not good for the clustering process reven though some samples meet the hypothesis of DPC. The ADBSCAN algorithm is the latest clustering algorithm based on the nearest neighbors rits performance is good and stable through all the synthetic datasets. Nevertheless rsome samples located at the edge of the cluster were wrongly clustered or even a single sample was identified as a cluster such as on DS2_1 rDS2_2 rDS2_3 and DS2_4. The major reason for the phenomenon is that ADBSCAN only keeps one of the two nearest neighbors as core samples rwhich results in the loss of core samples. As for HDBSCAN rit has a good performance with the default parameters rbut it has poor performance when dataset has large noise samples or the clusters are highly close to each other.
It can be observed from the experiments on synthetic datasets from Table. 5, rGNN-DBSCAN benefits from its parameter-free method to identify LCS and holds LCS to the maximal extent rthus rit works well on most datasets like DS2_2 rDS2_7 and DS2_8. Its ARI performance surpasses that of DBSCAN on all of these synthetic datasets. It is worth noting that the average values of GNN-DBSCAN on ARI rNMI rAMI and V-measure are the highest.
Table 6 reports the results for all comparison algorithms on ARI rNMI rAMI and V-measure on real-world datasets. According to Table 6 rwe can find that GNN-DBSCA obtains the best average ARI rNMI rAMI and V-measure performance. It is obvious that we have the best performance on Pageblock and the worst performance on Wine and Iris.
The results of each clustering algorithm on real-world datasets
The test results of consuming time are given in Table 7 and Table 8. From which we can observe that our methods have larger time consumption than other comparison algorithms but close to DPC.
The consuming time on synthetic datasets
The consuming time on real-world datasets
In summary rthe overall test results are shown in this section demonstrate that our proposed method outperforms the other comparison algorithms on most datasets. What’s more rwe discuss the parameters selection and demonstrate how k and core_percent affect the algorithm’s performance with a set of experiments. From the results of synthetic and real-world datasets rwe can observe that the size of dataset have less influence on the performance of our proposed method except time consumption. Meanwhile rour method can handle the shuffled dataset because our method identifies core samples from a local perspective.
In this paper, we propose a new density-based algorithm called GNN-DBSCAN, which uses an adaptive grid to divide the dataset and defines local core samples by using the nearest neighbor. The nearest neighbor stands for the local highest density. Also, with the help of grid, GNN-DBSCAN overcomes the two major shortcomings of DBSCAN. What’s more, we compared our method with DBSCAN, DPC, ADBSCAN, and HDBSCAN on synthetic and real-world datasets and the results indicate that the average performance of our proposed algorithm is superior to other comparison algorithms. According to the results, it can be observed that the average ARI, NMI, AMI and V-measure on synthetic datasets are 0.938, 0.933, 0.932, 0.933 and the average ARI, NMI, AMI and V-measure on real-workd datasets are 0.508, 0.516, 0.487, 0.508, respectively.
However, as can be seen from experiments on real-world datasets, we are trapped by the curse of dimensionality. Consequently, the future work will focus on improving the performance of the GNN-DBSCAN algorithm in high-dimensional space.
Footnotes
Acknowledgment
The authors would like to appreciate the editor and reviewers for their valuable comments and suggestions. This work was supported in part by the National Key Research and Development Program of China [grant number 2020YFB1805400]; in part by the National Natural Science Foundation of China [grant number U1736212, U19A2068, and 62032002]; in part by the China Postdoctoral Science Foundation [grant number 2020M683345].
