GNN-DBSCAN: A new density-based algorithm using grid and the nearest neighbor

Abstract

DBSCAN (density-based spatial clustering of applications with noise) is one of the most widely used density-based clustering algorithms, which can find arbitrary shapes of clusters, determine the number of clusters, and identify noise samples automatically. However, the performance of DBSCAN is significantly limited as it is quite sensitive to the parameters of eps and MinPts. Eps represents the eps-neighborhood and MinPts stands for a minimum number of points. Additionally, a dataset with large variations in densities will probably trap the DBSCAN because its parameters are fixed. In order to overcome these limitations, we propose a new density-clustering algorithm called GNN-DBSCAN which uses an adaptive Grid to divide the dataset and defines local core samples by using the Nearest Neighbor. With the help of grid, the dataset space will be divided into a finite number of cells. After that, the nearest neighbor lying in every filled cell and adjacent filled cells are defined as the local core samples. Then, GNN-DBSCAN obtains global core samples by enhancing and screening local core samples. In this way, our algorithm can identify higher-quality core samples than DBSCAN. Lastly, give these global core samples and use dynamic radius based on k-nearest neighbors to cluster the datasets. Dynamic radius can overcome the problems of DBSCAN caused by its fixed parameter eps. Therefore, our method can perform better on dataset with large variations in densities. Experiments on synthetic and real-world datasets were conducted. The results indicate that the average Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Adjusted Mutual Information (AMI) and V-measure of our proposed algorithm outperform the existing algorithm DBSCAN, DPC, ADBSCAN, and HDBSCAN.

Keywords

Density-based clustering algorithm Grid The nearest neighbor DBSCAN

1 Introduction

Clustering is always favored by data processing scientists. It is the most widely used unsupervised learning technique. Clustering aims to separate data samples into specific classes by comparing the similarity between the data samples. Samples of the same cluster are highly similar to those between different clusters [1]. As a fundamental technique in machine learning, clustering has applications in a wide variety of domains such as social networks [2], neural networks [3], medical science [4], cyberspace security [5], Community detection [6, 7] and fuzzy time series forecasting [8, 9].

DBSCAN [10] is a widely used density-based clustering algorithm. In DBSCAN, clusters are defined as regions connected by dense areas, which are separated by sparse areas. There are three types of samples defined in DBSCAN, namely core samples, boundary samples and noise samples. A core sample is defined by its parameters eps and MinPts that sample p is a core sample if at least MinPts samples are within eps-neighborhood of sample p. DBSCAN works as follows. It scans the dataset and checks each point p whether it is a core sample or not. If p is a core sample, then it creates a new cluster with p and then retrieves all the samples density reachable from p, and marks them with the same cluster label. The process terminates when no new objects can be added to any cluster. DBSCAN can find arbitrary shapes of clusters and determine the number of clusters, as well as identify noise samples automatically. However, there are two major shortcomings of DBSCAN: (1) The main weakness of DBSCAN is its sensitivity to the parameters eps and MinPts. DBSCAN will perform differently even if the parameters are similar. (2) The assumption of DBSCAN that the density of each cluster is uniformly distributed may not be realistic. It may fail on datasets having clusters with different densities. The factors leading to the above shortcomings are as follows. (1) The parameters of DBSCAN are determined by hand. Thus, multiple manual experiments should be taken to determine the appropriate parameters without any prior knowledge. (2) Small eps and MinPts are suitable for clusters with high-density, while great eps and MinPts are appropriate to clusters with low-density. Thus, it may fail on dataset with large variational density

In this paper, we propose a combination of grid technique and the nearest neighbor to overcome the drawbacks of DBSCAN. GNN-DBSCAN is a new density-based algorithm proposed in this paper. “G” stands for an adaptive grid that is used to divide dataset and “NN” represents the Nearest Neighbor. Firstly, the adaptive grid is used to divide the dataset. With the help of grid, we can get infinite subspaces of dataset. Then we identify all the nearest neighbor lying in every filled cell and adjacent filled cells as the local core samples. The nearest neighbor stands for the highest local density. The way to find local core samples is parameter-free and does not depend on the density distribution. Therefore, the quality of core samples identified by GNN-DBSCAN is better than DBSCAN due to the fact that DBSCAN relies on density distribution heavily. Lastly, we use dynamic radius for the clustering process while DBSCAN adopts fixed eps.

The contributions of this paper are in three-fold:

This paper proposes a new density-based clustering algorithm called GNN-DBSCAN, which uses an adaptive grid to divide the dataset and defines the nearest neighbor as local core samples. Our method uses a parameter-free way to find local core samples. What’s more, the identification of local core samples has no concern with density distribution of dataset. In this way, GNN-DBSCAN can overcome the drawbacks of DBSCAN.

We provide a new parameter-free way to find local core samples, which can overcome the drawbacks of DBSCAN that requires many manual experiments to determine parameters.

An adaptive grid is used in our method. Thus, we can avoid the uncertainty caused by manual selection of the cell length.

The rest of this paper is organized as follows. Section 2 describes the related work of DBSCAN. Section 3 discusses our proposed algorithm. Section 4 shows the results of experiments compared with other clustering algorithms. Conclusion and future work are given in Section 5

2 Related work

As mentioned earlier, two parameters eps and MinPts are used to identify core samples in DBSCAN. A core sample p is marked by the way that if the number of samples lying in sample p’s eps-neighborhood is greater than or equals to MinPts, then p will be marked as a core sample. After finding all core samples, the clustering process of DBSACN begins with a core sample and links all directly density-reachable samples together to form a cluster. DBSCAN can discover clusters of arbitrary shapes and different sizes while being robust to noise points. However, the performance has been significantly limited due to it is quite sensitive to the parameters of eps and MinPts. For one thing, the performance needs to go through multiple experiments for setting appropriate parameters. For another, the fixed parameters are difficult to adapt to the density variation of clusters. Many studies focus on improving DBSCAN or overcoming the drawbacks brought by the fixed parameters. MDBSCAN [11] algorithm can generate two different parameters by using a statistical method. Based on the parameters, clustering results of MDBSCAN can be more accurate with varied densities. The parameters of MDBSCAN are calculated from the density distribution of dataset while the parameters of DBSCAN are determined by hand. DPC [12] can find cluster centers through two assumptions that cluster centers have higher density than their neighbors and there is a large distance between them. Besides, DPC decides core samples by generating an additional decision graph. Zhu et al. [13] proposed a density-ratio based method to overcome the weakness that density-based clustering algorithms like DBSCAN have difficulty identifying all clusters if there are great variations in the density of clusters. Lu et al. [14] use random walk model to estimate the importance of data samples(core samples). Then, based on the core samples, they use a k-nearest neighbors chain to find the proper number of clusters and initial clusters. K-PRSCAN [15] uses PageRank algorithm to measure the importance of data points in K clusters. K-PRSCAN can distinguish both globular and non-globular clusters, and reduce the effect of noise samples by comparing the value of samples importance. HDBSCAN [16] is developed by Campello et al, which adopts a hierarchical clustering approach to improve DBSCAN. HDBSCAN can find clusters of varying densities and be more robust to the setting of parameters. NS-DBSCAN [17] is also a hierarchical clustering approach, which provide a new technique for visualizing the density distribution and indicating the intrinsic clustering structure. In addition, many approaches based on the k-nearest neighbors have been proposed because k-nearest neighbors can properly estimate the density of data sample. The Shared Nearest Neighbors (SNN) density-based clustering algorithm [18] uses SNN similarity in place of the parameter eps in DBSCAN. Moreover, [18] solves the problem that cluster data in a nonparametric way when the globular concept cannot be accepted. CMUNE [19] is a variation of the SNN algorithm, which uses mutual nearest neighbor to find dense regions. IS-DBSCAN [20], ISB-DBSCAN [21], RECORD [22], RNN-DBSCAN [23] and DBSCRN [24] use the reverse k-nearest neighbors as an estimate of observation density and identify core samples. ADBSCAN [25] is the latest density-based clustering algorithm based on the nearest neighbor. In ADBSCAN, the core sample is defined as the nearest neighbor of subgraph which is constructed by Union-Find algorithm with path compression. It uses the concept of subgraph path to set the radius of each core sample, which overcomes the shortcomings of DBSCAN brought by fixed eps. However, only one of the two nearest neighbors is kept as core samples in ADBSCAN, which makes a lot of core samples lost. Meanwhile, some other clustering methods [26 –28] also attract researchers attention.

In summary, the above methods improve DBSCAN in two ways: based on the statistical approaches and based on the k-nearest neighbors approaches. These methods have their drawbacks like the parameters in statistical approaches mostly rely on the distances among samples, which would cause parameters sensitivity. Additionally, or k-nearest neighbors, studies [29, 30] have shown that k is a sensitive parameter in different datasets. What’s more, approches [11 –24] define core samples from a global perspective. They overcome the two main limitations of DBSCAN to an extent, but they would ignore some key core samples that exist in different density regions [25]. finds core samples from a local scope using subgraph to divide dataset, through which reasonable core samples in different density regions can be found. But [25] only keeps one of the two nearest neighbors as the core sample, which makes lots of core samples lost. Our proposed method divides dataset with an adaptive grid and keeps both of the two nearest neighbors in cells or adjacent cells as local core samples, thereby holding local core samples to the maximum extent.

3 The proposed algorithm: GNN-DBSCAN

In this paper, we propose a new density-based algorithm named GNN-DBSCAN, in which “G” stands for an adaptive grid that is used to divide the dataset and “NN” represents the Nearest Neighbor. The steps of GNN-DBSCAN are as follows. Step 1, divide dataset with an adaptive grid. Then identify samples in every filled cell and adjacent filled cells as local core samples. Step 2, obtain global core samples by enhancing and screening local core samples with knowledge of probability theory. Step 3, identify cluster by using dynamic radius based on k-nearest neighbors. In this paper, results are produced by Euclidean distance. Fig. 1 shows a synthetic dataset Syn1 is divided into a finite number of cells that form a grid structure.

Fig. 1

Dataset Syn1 is divided into finite number of cells that form a grid structure.

Algorithm 1 GNN-DBSCAN(Dataset X, k, core_percent)

1: Initialize cluster id C = 0

2: grid, LCSs = ConstructGrid(X) //algorithm 2

3: GCSs = LCSCheck(X,grid,LCSs,k,core_percent) //algorithm 3

4: for each global core sample core_g in GCSs do

5: λ_g = (1 + (1/(1 + e^δ(D_g))))

6: radius_g = λ_g * $\frac{\sum_{j = 1}^{k} d (x_{i}, x_{i}^{j})}{k}$

7: end for

8: for each samples s in X do

9: if s ∈C_i or s ∉ GCSs then

10: continue

11: end if

12: if s ∉ C_i then

13: label s as cluster C_i: s ∈ C_i

14: if s ∈ GCSs then

15: neighbor_s = {x|x ∈ X, d (s, x) ≤ radius_s}

16: for each sample u in neighbor_s do step 12

17: end if

18: end if

19: C = C + 1

20: end for

21: Set unlabeled samples to noise

22: return label

3.1 Basic concepts

Definition 1. Nearest Neighbor: Given a set P of points in a d-dimensional space R^d, construct a data structure which given any query point q finds the point p in P with the smallest distance to q in Euclidean distance. Then, the point q and point p are the nearest neighbors. Definition 2. Local Core Sample (LCS): Many cells will be obtained after using an adaptive grid to divide dataset. In each cell with more than two samples, the nearest neighbors of it are defined as Local Core Sample. Meanwhile, the nearest neighbors who lie in adjacent cells are also regarded as Local Core Sample. Based on LCS, our method can obtain high-quality core samples called global core sample defined in definition 4.

Definition 3. Density: The density of the samples is defined by function ρ (.). $ρ (x_{i}) = \frac{\sum_{j = 1}^{k} d (x_{i}, x_{i}^{j})}{k}$ (1) where x_i ∈ dataset X, and $x_{i}^{j}$ (j = 1,2,...,k) is the j-th nearest neighbors of x_i, d(.) represents the distance between x_i and $x_{i}^{j}$ , k is a parameter of GNN-DBSCAN, which stands for the number of the nearest neighbors of sample and 0 ≤ k ≤ |X|.

Definition 4. Global Core Sample (GCS): It may result in bad performance if use LCSs to clustering directly because many LCSs lie in low-density regions. In order to deal with this problem, we compute a threshold θ with Eq.(2). Before computing θ, the density of each sample has been processed by logarithmic function which can make the density distribution more close to normal distribution. Then we can compute a reasonale threshold with the help of quantile function. The quantile-quantile plot of samples’ density is shown in Fig. 2. With θ, the Global Core Sample is defined as an LCS with lower density than θ. $θ = mean (ρ (X)) + φ (core_percent) * σ (ρ (X))$ (2) where mean(.) is the mean value function, φ(.) is the quantile function [31] of the normal distribution, which is widely used in probability theory to compute quantile, σ(.) is the standard deviation function, core_percent is the parameter of GNN-DBSCAN, which stands for the core samples ratio and 0≤ core_percent ≤1.

Fig. 2

Quantile-quantile plot of LCSs density. It shows the LCSs density distribution coincides with Gaussian distribution.

Algorithm 2 ConstructGrid(X)

1: Randomly choose $n = \sqrt{| X |}$ samples and calculate the nearest distance d between n samples.

2: cell_length = 1 + e^-δ(L) * d

3: Divide dataset X into grid with cell_length

4: Identify all LCS according to definition 2.

5: return grid, LCSs

3.2 GNN-DBSCAN algorithm

There are three steps of our proposed algorithm GNN-DBSCAN: (1) Divide the dataset with an adaptive grid and find out all LCSs according to definition 2. (2) Obtain GCSs by enhancing and screening LCSs. (3) Identify clusters with GCSs using dynamic radius based on k-nearest neighbors. GNN-DBSCAN works as Algorithm 1. Algorithm 2 implements step (1) and Algorithm 3 fulfills step (2) with a series of given parameters k and core_percent.

Algorithm 3 LCSCheck(X,grid,LCSs,k,core_percent)

1: for each grid’s cell c with LCS do

2: if PR_c ≥ mean (PR_grid) then

3: for each sample s in cell do

4: s ∈ LCSs

5: end for

6: end if

7: end for

8: for each sample q in X do

9: $ρ (x_{i}) = \frac{\sum_{j = 1}^{k} d (x_{i}, x_{i}^{j})}{k}$

10: end for

11: ρ (X) = log (ρ (X))

12: θ = mean (ρ (X)) + φ (core _ percent) * σ (ρ (X))

13: for each u in LCSs do

14: if ρ (u) ≤ θ then

15: u ∈ GCSs

16: end if

17: end for

18: return GCSs

In algorithm 1, we use dynamic radius for the process of clustering, which is different from DBSCAN which uses fixed eps. A proper radius of sample should correspond to the region’s density that the sample belongs to. It can be overt that samples lying in high-density will have shorter distance with its k-nearest neighbors than samples lying in low-density region. Therefore, the distance of k-nearest neighbors can reflect the density distribution of the sample’s location. Thus, we use the the distance of k-nearest neighbors to compute the radius of GCSs. In this way, the radius used in GNN-DBSCAN is more reasonable than DBSCAN. Therefore, we firstly define a radius expansion factor as follows: $λ_{g} = 1 + (1 / (1 + e^{δ (D_{g})}))$ (3) where D_g is the set of distance between sample g and its k-nearest neighbors, δ(.) is the variance function. The radius of GCSs is computed by function as follows: ${radius}_{i} = λ_{i} * \frac{\sum_{j = 1}^{k} d (x_{i}, x_{i}^{j})}{k}$ (4) where $\frac{\sum_{j = 1}^{k} d (x_{i}, x_{i}^{j})}{k}$ is the average distance of x_i and its k-nearest neighbors. Algorithm 2 describes the details of construction of grid. Firstly, we randomly select $n = \sqrt{N}$ samples, N is the dataset’s size. Then compute the minimum distance d between the n points. If we use d to divide dataset directly, we may face two issues. (1) The cells of grid will be multiple and small if d is too short, which leads to an excessive division problem. (2) If d is too long, then the cells of grid will be few and large. Thus, too long or too short cell length will limit the quality of grid. To meet this gap, we use d and variance to compute cell_length as Eq.(5). $cell_length = 1 + e^{- δ (L)} * d$ (5) where δ(.) is the variance function, and L is the set of distances between each sample in $n = \sqrt{N}$ samples and their nearest neighbor.

Algorithm 3 is the process of enhancement and filtration. Enhancement can enrich local core samples, which identifies the samples in cell as LCS only if the cell with LCS and its point_ratio is greater than the average of all cells point_ratio. Filtration help us to improve the quality of core samples, which delete LCSs lying in low-density regions. Fig. 3 shows the effect of Algorithm 3 on dataset Syn1, where the parameters k is 35 and core_percent is 0.8.

Fig. 3

The points in red in (a) and (b) are LCS. (a) shows LCSs are identified in Algorithm 2, we can see many samples in grey (not core samples) lying in the dense region. After enhancement by Algorithm 3, the LCSs in red are shown in (b), from which we can see that many samples in dense regions are identified as LCS while in sparse regions are not. (c) shows the GCSs after filtration in Algorithm 3, some LCSs belonging to sparse regions are deleted.

3.3 Complexity analysis

Assume that n is the number of samples in the dataset. The computational complexity of GNN-DBSCAN depends on its solution to the All Nearest Neighbor (ANN) problem and the nearest neighbor problem. The computational complexity of ANN can be reduced to (n*logn) by the natural neighbor algorithm optimized by KD-tree [32] or R*Tree [33]. The nearest neighbor problem can be solved in time O (n²) by a brute force method. According to [34], the nearest neighbor problem could be solved in linear time with randomness. In summary, the complexity of the proposed clustering method is O (n * logn).

4 Experimental analysis

In this section, we will discuss the parameters selection of GNN-DBSCAN (GDBS) and compare the proposed method with other state-of-the-art clustering methods including DBSCAN (DBS), DPC, ADBSCAN (ADBS), and HDBSCAN (HDBS). ADBSCAN is the latest clustering algorithm based on the nearest neighbor. The code of ADBSCAN is provided by the original author and the rest comparison algorithms use open source code. We use The Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Adjusted Mutual Information (AMI) and V-measure to measure the performance of clustering results. All of them are well-known metrics to evaluate clustering algorithms. ARI is one of the most popular measures in cluster validation, which compares two partitions: one labeled by clustering process and the other given in the true clusters, and NMI is also a popular measure used for evaluating clustering results by employing information theory to quantify the differences between two clustering partitions. What’s more, AMI is an adjustment of the Mutual Information (MI) score to account for chance. It accounts for the fact that the MI is generally higher for two clusters with a larger number of clusters, regardless of whether there is actually more information shared and V-measure is the harmonic mean between homogeneity and completeness. The experiments are performed on a PC with Intel i3-5005U, 8G RAM, Windows 10 64 bit OS, and the python3 programming environment.

4.1 Selection of parameters

As mentioned above, two parameters are used in GNN-DBSCAN. (1) k is the number of nearest neighbors, which decides the density of sample and provides a crucial reference for the dynamic radius used in clustering process. (2) core_percent is the core ratio of the dataset, we use it to screen LCSs and the range of core_percent is 0 ≤ core _ percent ≤ 1. Six synthetic datasets DS1_1-DS1_6 were generated to analyze the effects of the two parameters, which are depicted in Fig. 4 and details are shown in Table 1.

Fig. 4

Datasets DS1_1 to DS1_6 used in the experiments of section 4.1.

Table 1

Details of synthetic datasets DS1_1-DS1_6

Datasets	Sizes	Noise samples	Clusters
DS1_1	500	0	2
DS1_2	800	0	2
DS1_3	1500	0	3
DS1_4	2500	0	6
DS1_5	5000	0	6
DS1_6	3500	660	6

In the experiments of selecting parameter k, we set core_percent to 0.8 and 0.9 for comparison and 0 ≤ k ≤ 140. The results are shown in Fig. 5, from which we can see the ARI performance of GNN-DBSCAN has slightly fluctuation when k is approximately 15. On DS1_1 to DS1_5 (without noise samples), the ARI performance tends to increase flatly when the value of k is greater than 20 and has reached a high level when k is roughly 60, which demonstrates GNN-DBSCAN’s relatively insensitivity to parameter k. It also can be observed from Fig. 5 that the ARI performance is better with higher core_percent 0.9 on DS1_1 to DS1_5 (without noise samples). But on DS1_6 (with noise samples), the core_percent 0.8 has a better ARI performance than 0.9. It is normal because DS1_6 has noise samples and lower core_percent is more proper. But DS1_1 to DS1_5 is pure, and higher core_percent is more reasonable.

Fig. 5

Results of experiments about parameter k on DS1_1 to DS1_6. The different colors of solid lines stand for the results on different datasets and core_percent is 0.9. The same color of dashed lines represents the results and core_percent is 0.8.

In the following experiments of selecting parameter core_percent, we set the parameter k to 60 and 0 ≤ core _ percent ≤ 1. The results are depicted in Fig. 6, which reveals that on DS1_1 to DS1_5 (without noise samples) the higher core_percent is, the better ARI performance is. On the contrary, the ARI performance is good on DS1_6(noise samples are 660) when the core_percent is lower. From Fig. 6, we can see the value of ARI is greater than 0.9 when parameter core_percent is greater than 0.75. Besides, DS1_3 and DS1_6 achieve the worst ARI performance when core_percent is 1. The reasons for this include two aspects. For one thing, some LCSs lie in different cluster boundaries. For another, these LCSs are close to each other. And core_percent is 1 means the filtration process won’t work, which makes those LCSs to be treated as GCSs and take part in the clustering process. Consequently, the suitable interval of core_percent is [0.75,1) if dataset has no noise samples. On DS1_6 rthere will be the best ARI performance when core_percent is around 0.65.

Fig. 6

Results of experiments about parameter core_percent on DS1_1 to DS1_6. The different colors of solid lines stand for the results of different datasets and k is 60.

In order to reflect the influences of parameter core_percent more visually rwe set k = 60 and adjust core_percent to plot the GCSs of DS1_6 shown in Fig. 7.

Fig. 7

GCSs (points in red) of DS1_6 with parameters k = 60 and different core_percent. We can observe from Fig. 7 that with the core_percent decreasing rfewer GCSs are held and these GCSs are closer to the cluster center. Therefore core_percent can be regarded as the “thickness” of the cluster boundary. The smaller core_percent is rthe thicker “thickness” is.

To sum up rwe present three experimental rules for selecting the proper parameters of k and core_percent.

(1) k is associated with the size of the dataset rwhich is used to compute densities and clustering radius. A practical rule is that k = min (N/linebreak100 r 100) ±10 rwhere N is the size of dataset.

(2) core_percent help us to filter LCSs lying in low-density regions or cluster boundary rand the reasonable range is 0.75 r1) if the dataset is pure. On the contrary rif dataset has noise samples ra suggested rule is that core_percent = 1-2 * noise_ratio. Meanwhile rthere are three types of samples defined in DBSCAN named core samples rboundary samples and noise samples. Therefore rthe parameter core_percent of GNN-DBSCAN is not recommended to be set to 1.

(3) core_percent and k should be a negative correlation. The value of k should decrease correspondingly when the value of core_percent increases.

4.2 Results on the clustering algorithms

To verify the performance of our method, we compare it with other state-of-the-art algorithms on eight synthetic datasets and several benchmarking real-world datasets. The synthetic datasets are displayed in Fig. 8 and the characteristics are described in Table 2. The real-world datasets are obtained from the University of California, Irvine (UCI) Machine Learning Respository [35]. They are Wine, Iris, Ecoli, Banknote, Page block, and the details of real-world datasets are summarized in Table 3.

Table 2
Details of eight synthetic datasets

Datasets Sizes Noise samples Clusters

DS2_1 1500 0 4

DS2_2 1500 0 3

DS2_3 1500 400 4

DS2_4 4200 200 5

DS2_5 788 0 7

DS2_6 240 0 2

DS2_7 312 0 3

DS2_8 8000 764 5

Datasets	Sizes	Noise samples	Clusters
DS2_1	1500	0	4
DS2_2	1500	0	3
DS2_3	1500	400	4
DS2_4	4200	200	5
DS2_5	788	0	7
DS2_6	240	0	2
DS2_7	312	0	3
DS2_8	8000	764	5

Table 3

Details of real-world datasets

Datasets	Sizes	Dimensions	Clusters
Wine	178	13	3
Iris	150	4	3
Ecoli	336	7	8
Banknote	1372	4	2
Page block	5473	10	5

Fig. 8

Eight synthetic datasets which used in our experiments of section 4.2.

We use the parameters given in the original paper, and we set the parameters not included in that paper as the default parameters for each algorithm. For GNN-DBSCAN, we set k = min (N/100, 100) ±10 and choose core_percent from interval [0.75,1) if dataset is pure. Otherwise we set core_percent = 1-2*noise_ratio. Table 4 illustrates the parameter settings of each clustering method in eight synthetic datasets. The experimental results are displayed in Fig. 9 rand the ARI rNMI rAMI and V-measure performance for synthetic datasets are shown in Table 5.

Table 4

Details of parameters on synthetic datasets used in each clustering algorithm

DataSet	GDBS	DBSC	DPC	ADBSC	HDBSC
DS2_1	k=10	eps=0.035	-	MinPts=26	MinPts=15
	cp^a =0.95	MinPts=8	np^b =0.1
DS2_2	k=20	eps=0.05	-	MinPts=40	MinPts=15
	cp=0.95	MinPts=8	np=0.01
DS2_3	k=15	eps=0.027	-	MinPts=50	MinPts=15
	cp=0.5	MinPts=10	np=0.5
DS2_4	k=20	eps=0.0315	-	MinPts=60	MinPts=45
	cp=0.85	MinPts=10	np=0.20
DS2_5	k=10	eps=0.048	-	MinPts=20	MinPts=12
	cp=0.80	MinPts=8	np=0.02
DS2_6	k=5	eps=0.068	-	MinPts=20	MinPts=3
	cp=0.75	MinPts=4	np=0.5
DS2_7	k=5	eps=0.025	-	MinPts=10	MinPts=2
	cp=0.95	MinPts=23	np=0.02
DS2_8	k=20	eps=0.03	-	MinPts=40	MinPts=27
	cp=0.80	MinPts=15	np=0.2

^acp is core_percent. ^bnp is noise_percent.

Table 5

The results of each clustering algorithm on synthetic datasets

DataSet	GDBS	DBS	DPC	ADBSC	HDBS
DS2_1	ARI	0.966	0.882	0.981	0.896	0.902
	NMI	0.950	0.884	0.971	0.881	0.905
	AMI	0.951	0.884	0.971	0.880	0.904
	V-measure	0.951	0.884	0.971	0.881	0.905
DS2_2	ARI	1.000	1.000	0.034	0.995	1.000
	NMI	1.000	1.000	0.165	0.990	1.000
	AMI	1.000	1.000	0.163	0.990	1.000
	V-measure	1.000	1.000	0.165	0.990	1.000
DS2_3	ARI	0.681	0.680	0.413	0.665	0.613
	NMI	0.706	0.718	0.413	0.712	0.693
	AMI	0.705	0.717	0.481	0.711	0.692
	V-measure	0.706	0.718	0.482	0.712	0.693
DS2_4	ARI	0.946	0.810	0.427	0.980	0.788
	NMI	0.948	0.867	0.602	0.962	0.827
	AMI	0.948	0.867	0.601	0.962	0.827
	V-measure	0.948	0.867	0.602	0.962	0.827
DS2_5	ARI	0.988	0.986	0.996	0.912	0.838
	NMI	0.981	0.980	0.993	0.950	0.902
	AMI	0.981	0.980	0.992	0.950	0.901
	V-measure	0.981	0.980	0.992	0.950	0.902
DS2_6	ARI	0.971	0.939	0.536	0.830	0.764
	NMI	0.935	0.866	0.543	0.729	0.687
	AMI	0.935	0.866	0.542	0.727	0.685
	V-measure	0.935	0.866	0.543	0.729	0.687
DS2_7	ARI	1.000	0.995	1.000	1.000	1.000
	NMI	1.000	0.992	1.000	1.000	1.000
	AMI	1.000	0.992	1.000	1.000	1.000
	V-measure	1.000	0.992	1.000	1.000	1.000
DS2_8	ARI	0.951	0.951	0.539	0.952	0.930
	NMI	0.940	0.939	0.657	0.939	0.912
	AMI	0.940	0.939	0.658	0.939	0.912
	V-measure	0.940	0.939	0.657	0.939	0.912
Average	ARI	0.938	0.906	0.626	0.904	0.854
	NMI	0.933	0.906	0.669	0.896	0.866
	AMI	0.932	0.905	0.676	0.895	0.865
	V-measure	0.933	0.906	0.676	0.896	0.866

Fig. 9

Clustering results on synthetic datasets. Grey points to stand for the noise samples.

As can be seen from the results depicted in Fig. 9 rDBSCAN ras the earliest proposal of all these algorithms rshows strong stability and only has poor performance on datasets with large different densities. For example ron DS2_4 rthe two clusters on the top of the picture are identified as the same cluster. The reason is that it is difficult to set the proper parameters to isolate the high-density samples when connecting the two clusters. Moreover ron DS2_3 rsome samples lying in the cluster boundary are identified as a cluster. For DPC rit has superior performance on DS2_1 rDS2_5 and DS2_7 rwhich consists of spherical-like clusters or linear clusters. However rit has poor performance on DS2_2 and DS2_6 because the location of these samples in the cluster is not good for the clustering process reven though some samples meet the hypothesis of DPC. The ADBSCAN algorithm is the latest clustering algorithm based on the nearest neighbors rits performance is good and stable through all the synthetic datasets. Nevertheless rsome samples located at the edge of the cluster were wrongly clustered or even a single sample was identified as a cluster such as on DS2_1 rDS2_2 rDS2_3 and DS2_4. The major reason for the phenomenon is that ADBSCAN only keeps one of the two nearest neighbors as core samples rwhich results in the loss of core samples. As for HDBSCAN rit has a good performance with the default parameters rbut it has poor performance when dataset has large noise samples or the clusters are highly close to each other.

It can be observed from the experiments on synthetic datasets from Table. 5, rGNN-DBSCAN benefits from its parameter-free method to identify LCS and holds LCS to the maximal extent rthus rit works well on most datasets like DS2_2 rDS2_7 and DS2_8. Its ARI performance surpasses that of DBSCAN on all of these synthetic datasets. It is worth noting that the average values of GNN-DBSCAN on ARI rNMI rAMI and V-measure are the highest.

Table 6 reports the results for all comparison algorithms on ARI rNMI rAMI and V-measure on real-world datasets. According to Table 6 rwe can find that GNN-DBSCA obtains the best average ARI rNMI rAMI and V-measure performance. It is obvious that we have the best performance on Pageblock and the worst performance on Wine and Iris.

Table 6

The results of each clustering algorithm on real-world datasets

DataSet		GDBS	DBS	DPC	ADBSC	HDBS
Wine	ARI	0.568	0.241	0.403	0.389	0.249
	NMI	0.395	0.375	0.395	0.391	0.352
	AMI	0.368	0.363	0.389	0.387	0.344
	V-measure	0.385	0.375	0.389	0.391	0.352
Iris	ARI	0.568	0.499	0.573	0.494	0.568
	NMI	0.734	0.590	0.613	0.567	0.734
	AMI	0.732	0.582	0.608	0.561	0.732
	V-measure	0.734	0.590	0.613	0.567	0.734
Ecoli	ARI	0.341	0.294	0.369	0.108	0.294
	NMI	0.481	0.361	0.357	0.189	0.361
	AMI	0.395	0.349	0.345	0.147	0.350
	V-measure	0.477	0.361	0.357	0.189	0.361
Banknote	ARI	0.535	0.315	0.069	0.439	0.312
	NMI	0.333	0.169	0.074	0.288	0.159
	AMI	0.561	0.530	0.195	0.554	0.396
	V-measure	0.563	0.531	0.196	0.555	0.397
Page block	ARI	0.736	0.529	0.044	0.548	0.287
	NMI	0.637	0.531	0.196	0.555	0.397
	AMI	0.382	0.164	0.072	0.266	0.155
	V-measure	0.383	0.169	0.074	0.288	0.159
Average	ARI	0.508	0.376	0.292	0.396	0.342
	NMI	0.516	0.405	0.327	0.398	0.4
	AMI	0.487	0.398	0.322	0.383	0.395
	V-measure	0.508	0.405	0.327	0.398	0.400

The test results of consuming time are given in Table 7 and Table 8. From which we can observe that our methods have larger time consumption than other comparison algorithms but close to DPC.

Table 7

The consuming time on synthetic datasets

DataSet	GDBS	DBSC	DPC	ADBSC	HDBSC
DS2_1	6.2815	0.0249	6.7396	0.3931	0.0407
DS2_2	8.1241	0.0449	6.3586	0.7337	0.0478
DS2_3	11.1345	0.0154	10.2953	0.6945	0.0604
DS2_4	59.0168	0.0634	51.8831	1.9699	0.1662
DS2_5	2.0756	0.0062	1.7843	0.2815	0.0250
DS2_6	0.3062	0.0025	0.2692	0.0487	0.0096
DS2_7	0.3207	0.0047	0.3418	0.0859	0.0194
DS2_8	>60	0.0930	>60	2.6506	0.2912

Table 8

The consuming time on real-world datasets

DataSet	GDBS	DBSC	DPC	ADBSC	HDBSC
Wine	0.2212	0.0031	0.0912	0.0400	0.0070
Iris	0.1601	0.0020	0.0577	0.0314	0.0056
Ecoli	0.4782	0.0051	0.3026	0.1296	0.0141
Banknote	5.0492	0.9872	5.7540	0.4637	0.0954
Page block	>60	0.0198	>60	1.2590	0.7091

In summary rthe overall test results are shown in this section demonstrate that our proposed method outperforms the other comparison algorithms on most datasets. What’s more rwe discuss the parameters selection and demonstrate how k and core_percent affect the algorithm’s performance with a set of experiments. From the results of synthetic and real-world datasets rwe can observe that the size of dataset have less influence on the performance of our proposed method except time consumption. Meanwhile rour method can handle the shuffled dataset because our method identifies core samples from a local perspective.

5 Conclusions and future works

In this paper, we propose a new density-based algorithm called GNN-DBSCAN, which uses an adaptive grid to divide the dataset and defines local core samples by using the nearest neighbor. The nearest neighbor stands for the local highest density. Also, with the help of grid, GNN-DBSCAN overcomes the two major shortcomings of DBSCAN. What’s more, we compared our method with DBSCAN, DPC, ADBSCAN, and HDBSCAN on synthetic and real-world datasets and the results indicate that the average performance of our proposed algorithm is superior to other comparison algorithms. According to the results, it can be observed that the average ARI, NMI, AMI and V-measure on synthetic datasets are 0.938, 0.933, 0.932, 0.933 and the average ARI, NMI, AMI and V-measure on real-workd datasets are 0.508, 0.516, 0.487, 0.508, respectively.

However, as can be seen from experiments on real-world datasets, we are trapped by the curse of dimensionality. Consequently, the future work will focus on improving the performance of the GNN-DBSCAN algorithm in high-dimensional space.

Footnotes

Acknowledgment

The authors would like to appreciate the editor and reviewers for their valuable comments and suggestions. This work was supported in part by the National Key Research and Development Program of China [grant number 2020YFB1805400]; in part by the National Natural Science Foundation of China [grant number U1736212, U19A2068, and 62032002]; in part by the China Postdoctoral Science Foundation [grant number 2020M683345].

References

Jain

A.K.

, Murty

M.N.

and Flynn

P.J.

, Data clustering: a review, ACM Computing Surveys (CSUR) 31(3) (1999), 264–323.

Liu

and Guo

, STCCD: Semantic trajectory clustering based on community detection in networks, Expert Systems with Applications 162 (2020), 113689.

Jan

Z.M.

and Verma

, Multiple Strong and Balanced Clusters based Ensemble of Deep Learners, Pattern Recognition (2020), 107420.

Clatworthy

, Buick

, Hankins

, Weinman

and Horne

, The use and reporting of cluster analysis in health psychology: A review, British Journal of Health Psychology 10(3) (2005), 329–358.

Lemay

, Fernandez

and Knight

, An isolated virtual cluster for SCADA network security research, In 1st International Symposium for ICS & SCADA Cyber Security Research 2013 (ICS-CSR 2013) 1 (2013), 88–96.

Al-Andoli

, Cheah

W.P.

and Tan

S.C.

, Deep autoencoder-based community detection in complex networks with particle swarm optimization and continuation algorithms, Journal of Intelligent and Fuzzy Systems 40(1) (2021), 1–17.

Al-Andoli

, Cheah

W.P.

and Tan

S.C.

, Deep learning-based community detection in complex networks with network partitioning and reduction of trainable parameters, Journal of Ambient Intelligence and Humanized Computing 3 (2020).

Singh

and Borah

, High-order fuzzy-neuro expert system for time series forecasting, Knowledge-Based Systems 46 Complete (2013), 12–21.

Singh

and Borah

, An efficient time series forecasting model based on fuzzy time series, Engineering Applications of Artificial Intelligence 26(10) (2013), 2443–2457.

10.

Ester

, et al., A density-based algorithm for discovering clusters in large spatial databases with noise, Kdd 96(34) (1996), 226–231.

11.

Wang

, Liu

and Shen

, MDBSCAN: Multi-level density based spatial clustering of applications with noise, In Proceedings of the The 11th International Knowledge Management in Organizations Conference on The changing face of Knowledge Management Impacting Society (2016), 1–5.

12.

Rodriguez

and Laio

, Clustering by fast search and find of density peaks, Science 344(6191) (2014), 1492–1496.

13.

Zhu

, Ting

K.M.

and Carman

M.J.

, Density-ratio based clustering for discovering clusters with varying densities, Pattern Recognition 60 (2016), 983–997.

14.

, Zhu

and Wu

, A novel data clustering algorithm using heuristic rules based on k-nearest neighbors chain, Engineering Applications of Artificial Intelligence 72 (2018), 213–227.

15.

Liu

, Sun

, Chen

, Liu

and Zhong

, KPRSCAN: A clustering method based on Page Rank, Neurocomputing 175 (2016), 65–80.

16.

Campello

R.J.

, Moulavi

and Sander

, Densitybased clustering based on hierarchical density estimates, In Pacific-Asia conference on knowledge discovery and data mining (2013), 160–172. Springer, Berlin, Heidelberg.

17.

Jarvis

R.A.

and Patrick

E.A.

, Clustering using a similarity measure based on shared near neighbors, IEEE Transactions on Computers 100(11) (1973), 1025–1034.

18.

Abbas

M.A.

and Shoukry

A.A.

, Cmune: A clustering using mutual nearest neighbors algorithm, In 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA) (2012), 1192–1197. IEEE.

19.

Wang

, et al., NS-DBSCAN: A Density-Based Clustering Algorithm in Network Space, International Journal of Geo-Information 8(5) (2019), 218.

20.

Cassisi

, Ferro

, Giugno

, Pigola

and Pulvirenti

, Enhancing density-based clustering: Parameter reduction and outlier detection, Information Systems 38(3) (2013), 317–330.

21.

, Ma

, Tang

, Cao

, Tian

, Al-Dhelaan

and Al-Rodhaan

, An efficient and scalable densitybased clustering algorithm for datasets with complex structures, Neurocomputing 171 (2016), 9–22.

22.

Vadapalli

, Valluri

S.R.

and Karlapalem

, A simple yet effective data clustering algorithm, In Sixth International Conference on Data Mining (ICDM’06) (2006), 1108–1112. IEEE.

23.

Bryant

and Cios

, RNN-DBSCAN: A densitybased clustering algorithm using reverse nearest neighbor density estimates, IEEE Transactions on Knowledge and Data Engineering 30(6) (2017), 1109–1121.

24.

, Liu

, Li

and Gan

, A novel density-based clustering algorithm using nearest neighbor graph, Pattern Recognition 102 (2020), 107206.

25.

Cover

and Hart

, Nearest neighbor pattern classification, IEEE Transactions on Information Theory 13(1) (1967), 21–27.

26.

Guo

, Wang

, Bell

, Bi

and Greer

, KNN model-based approach in classification, In OTM Confederated International Conferences “On the Move to Meaningful Internet Systems” (2003), 986–996. Springer, Berlin, Heidelberg.

27.

Chowdhury

and Amorim

, An efficient densitybased clustering algorithm using reverse nearest neighbour, (2018).

28.

Singh

, A neutrosophic-entropy based clustering algorithm (NEBCA) with HSV color system: A special application in segmentation of Parkinson’s disease (PD) MR images, Computer Methods and Programs in Biomedicine 189(2) (2020), 105317.

29.

Khan

G.A.

, et al., Multi-view data clustering via nonnegative matrix factorization with manifold regularization, International Journal of Machine Learning and Cybernetics 2 (2021).

30.

Khan

G.A.

, et al., Weighted Multi-View Data Clustering via Joint Non-Negative Matrix Factorization, 2019 IEEE 14th International Conference on Intelligent Systems and Knowledge Engineering (ISKE) IEEE (2019).

31.

Chen

and Liu

, Quantile and quantile-function estimations under density ratio model[J], Annals of Statistics 41(3) (2013), 1669–1692.

32.

Beniley

J.L.

, Multidimensional Binary Seareh Trees Used for Assoeiative Searehing, ACM Communications 18(9) (1975), 509–517.

33.

Kriegel

N.B.H.P.

, Schneider

and Seeger

, The R*-tree: An E cient and Robust Access Method for Points and Rectangles. In Proceedings of the ACM SIGMOD Conference on Management of Data, (1990).

34.

Lipton

R.J.

, The P= NP Question and Gödel’s Lost Letter. Springer Science & Business Media, (2010).

35.

UCI. http://archive.ics.uci.edu/ml/index.php. Accessed 1 August (2020).

GNN-DBSCAN: A new density-based algorithm using grid and the nearest neighbor

Abstract

Keywords

1 Introduction

2 Related work

3 The proposed algorithm: GNN-DBSCAN

4 Experimental analysis

4.1 Selection of parameters

Table 2 Details of eight synthetic datasets Datasets Sizes Noise samples Clusters DS2_1 1500 0 4 DS2_2 1500 0 3 DS2_3 1500 400 4 DS2_4 4200 200 5 DS2_5 788 0 7 DS2_6 240 0 2 DS2_7 312 0 3 DS2_8 8000 764 5

Footnotes

Acknowledgment

References

Table 2
Details of eight synthetic datasets

Datasets Sizes Noise samples Clusters

DS2_1 1500 0 4

DS2_2 1500 0 3

DS2_3 1500 400 4

DS2_4 4200 200 5

DS2_5 788 0 7

DS2_6 240 0 2

DS2_7 312 0 3

DS2_8 8000 764 5