Density peaks clustering based on local fair density and fuzzy k-nearest neighbors membership allocation strategy

Abstract

The density peaks clustering algorithm (DPC) has been widely concerned since it was proposed in 2014. There is no need to specify in advance and only one parameter required. However, some disadvantages are still witnessed in DPC: (1) Requiring repeated experiments for choosing a suitable calculation method of the local density due to the variations in the scale of the dataset, which will lead to additional time cost. (2) Difficulty in finding an optimal cutoff distance threshold, since different parameters not only impact the selection of cluster centers but also directly affect the quality of clusters. (3) Poor fault tolerance of the allocation strategy, especially in manifold datasets or datasets with uneven density distribution. Targetting solutions to these problems, a density peaks clustering based on local fair density and fuzzy k-nearest neighbors membership allocation strategy (LF-DPC) is proposed in this paper. First, to obtain a more balanced local density, two classic local density calculation methods are combined in the algorithm to calculate the local fair density through the optimization function with the smallest local density difference. Second, a robust two stage remaining points allocation strategy is designed. In the first stage, k-nearest neighbors are used to quickly and accurately allocate points from the cluster center. In the second stage, to further improve the accuracy of allocation, a fuzzy k-nearest neighbors membership method is designed to allocate the remaining points. Finally, the LF-DPC algorithm has been experimented based on several synthetic and real-world datasets. The results prove that the proposed algorithm has obvious advantages compared with the other five ones.

Keywords

Density peaks clustering local fair density fuzzy k-nearest neighbors membership allocation strategy

1 Introduction

Being one of the core technologies of data mining, the clustering algorithm is a statistical method of researching into sample classification problems. Clustering is usually automatically divided according to some similar characteristics of the samples, and the objects are divided into different clusters according to certain rules. Therefore, the samples tend to enjoy a high degree of similarity in the same cluster, while vast differences often occur among samples of separate clusters [1, 2]. The clustering algorithm is also an important field in machine learning, which falls into several categories: partitioning methods, hierarchical methods, density-based methods, grid-based methods, model-based methods [1 –4]. With the development of data mining and machine learning, clustering technology has been widely used in diverse research fields, such as image processing [5, 6], community detection [7, 8], and commercial market analysis [9, 10].

So far, many classic clustering algorithms have been developed, such as K-means [11], BIRCH [12], STING [13], EM [14], DBSCAN [15] etc. Among them, DBSCAN is a representative of the density-based clustering algorithm. It defines clusters as the largest set of density connection points, which can divide high-density areas into clusters and find clusters of arbitrary shapes in noise samples. Although DBSCAN does not specify the number of clusters in advance, it is necessary to set the minimum number (MinPts) and the radius domain (e), making the optimal values of these two parameters difficult to obtain.

In 2014, Rodriguez et al. [16] published a paper in Science, introducing a novel clustering algorithm by fast search and discovery of density peaks (DPC). The algorithm can cluster data in arbitrary shapes and easily identify abnormal points. DPC clustering process is simple, efficient with only one parameter required, therefore, it has been widely concerned. DPC is based on two assumptions: (1) The cluster center is always surrounded by low-density data points. (2) The cluster center is far away from other high-density data points. Based on the assumptions, DPC first calculates the local density and the relative distance via the cutoff distance threshold. Second, it plots a two-dimensional decision graph by sorting local density and distance. Then it selects the appropriate clustering center, and finally assigns the remaining points to the most suitable cluster.

Since DPC was proposed, it has been favored by many scholars. Although the algorithm performs well on some datasets, there are still some drawbacks: (1) Selecting a suitable cutoff distance is difficult, for there is no unified measurement on calculating the local density, and different local density calculation methods have different impact on the quality of clustering. (2) The cluster centers usually select data points with large local density and relative distance, especially in the unevenly distributed datasets, which are prone to multiple peaks problems. (3) Allocation strategy easily causes a “domino” effect, resulting in the misallocation of remaining points.

To solve these problems, we proposed a density peaks clustering algorithm based on local fair density and fuzzy k-nearest neighbors membership allocation strategy (LF-DPC) that showcases some innovations and contributions:

(1) Based on two typical density calculation methods, the new local fair density is defined by the local density difference optimization function.

(2) A two-stage remaining points allocation strategy is designed: the breadth-first search to allocate the remaining points were applied in the first stage; a fuzzy k-nearest neigbors membership allocation strategy was designed in the second stage. When the allocation is completed, the membership difference of a sample point is lower than a certain threshold, and the point is re-allocated by expanding the range of k-nearest neighbors.

(3) The clustering performance of LF-DPC is evaluated using a large number of experimental datasets, and the results are compared to the other five algorithms.

The rest of this paper is organized as follows: in Section 2, related work is discussed;in Section 3, the DPC algorithm is introduced; in Section 4, the proposed algorithm is discussed in details; in Section 5, comparisons are made between the proposed algorithm and the other five ones, along with the list of experimental analyses; the final section includes conclusions and anticipation of the future research.

2 Related works

Many researchers have offered various improvement methods in face of DPC’s flaws, which may be split into three categories: (1) local density design. (2) remaining point allocation strategy, and (3) sub-cluster merging strategy.

Targeting the local density deficiencies of DPC, many researchers have been endeavoring to overcome this defect. Du et al. [17] proposed a density peaks clustering algorithm based on the k-nearest neighbors (DPC-KNN), in which the idea of k-nearest neighbors was introduced into DPC and another option was provided for the calculation of local density. Xie et al. [18] proposed an improved DPC based on the fuzzy weighted k-nearest neighbors clustering algorithm (FKNN-DPC), by combining the concept of k-nearest neighbors with the redesigned calculation method of local density, the algorithm was completely independent from the cutoff distance. Liu et al. [19] presented an adaptive density peaks clustering algorithm (ADPC-KNN), in which k-nearest neighbors were applied to calculate the global parameter and local density. The algorithm also employed a new method of automatically selecting the initial cluster center. The comparative density peaks algorithm (CDP) was proposed by Li [20], in which not only mutual k-nearest neighbors was used to calculate the local density but also the geodesic distance was applied to define a new relative distance. Liu et al. [21] designed a shared nearest neighbor based on density peaks clustering algorithm (SNN-DPC), in which the concept of shared neighbors was used to define new local density. A density peak clustering algorithm for natural neighbors (NaNDP) was proposed by Cheng [22], which used natural nearest neighbors as k-nearest neighbors to calculate local density. Wu et al. [23] proposed a density peaks clustering with symmetric neighborhood relationship (DPC-SNR), whose local density was calculated using the reverse k-nearest neighbors. As for the complexity of calculating local density for DPC, Zhao et al. [24] designed a density peaks clustering based on circular partition and grid similarity (DPC-CP-GS), in which a new grid local density was proposed. Fan et al. [25] proposed a density peaks clustering based on k-nearest neighbors sharing (DPC-KNNS), which defined the local density by using the similarity between samples. However, each DPC derivative has its local density calculation method, thus lacking a balance calculation method with different local density characteristics combined. Besides, DPC provides two local density calculation methods for datasets of different sizes, which increases the cost of experiments.

In terms of allocation strategy, FKNN-DPC [18] designed a new method of allocation strategy, which applied breadth-first search and semi-supervised learning to assign outlier and non-outlier points. Yu et al. [26] introduced an improved DPC named DPCSA with weighted local density sequence and two-stage assignment strategies, the algorithm used the fixed k-value to calculate the local density and designed a two-stage remaining points allocation strategy. Density fragment clustering without peaks (DFC) was proposed by Jiang [27], who designed density fragment generation and aggregation, but the clustering effect was not satisfactory within the path-based datasets. However, these strategies failed to cover the membership distribution of k-nearest neighbors in the remaining points.

Many scholars have also conducted extensive research on the effectiveness of DPC in addressing the manifold datasets or datasets with uneven density distribution. For the unevenly distributed datasets, Zhuo et al. [28] proposed a density peaks clustering algorithm employing a hierarchical strategy (HCFS), which was applied to a new mechanism for measuring the similarity and connectivity of subcluster and merging the subcluster with a high degree of similarity and connectivity. Wang et al. [29] proposed a DPC based on local minimal spanning tree (DPC-LMST), which merged similar subcluster through subcluster merging factor (SCMF). Xu et al. [30] considered multiple density peaks in one cluster and designed a feasible density peaks clustering algorithm (FDPC) based on a new merging strategy inspired by the support vector machine. For the multiple peaks problem, Ren et al. [31] proposed a density peaks clustering algorithm based on the layered k-nearest neighbors and subcluster merging (LKSM-DPC), in which a strategy of new subcluster similarity and subcluster merging was redesigned. Although a subcluster merging strategy has been provided in the above literature, the remaining points allocation was ignored, and they did not realize that using them is a time-consuming process.

To solve the above-mentioned deficiencies in local density and allocation strategy, as well as some limits in Table 1, this paper is dedicated to researching an improved DPC algorithm framework that considers the local fair density and fuzzy k-nearest neighbors membership allocation strategy.

Table 1
Advantages and limits of related algorithms

Improved aspects Method Advantages Limits

– DPC [16] (Original algorithm) -Quickly find cluster centers and outliers -Achieve arbitrary shape clustering -Select a suitable cutoff distance is difficult -Different local density calculation methods -Allocation strategy causes a “domino” effect

1.Local density DPC-KNN [17] -The local density calculation uses the KNN instead of the cutoff distance -Can’t effectively process vertical stripes data

SNN-DPC [21] -The local density calculation method for shared-nearest-neighbor, effectively preventing variable-density clusters -Can’t automatically determine the parameter of K value

ADPC-KNN [19] -A new way for initial cluster centers selection-A new idea of cluster density reachable -The parameter of K value is pre-specified

2.Allocation strategy FKNN-DPC [18] -Optimized the allocation strategy -effectively identify outliers -High computational cost in allocation strategy -Manually specify the K value of KNN

DPCSA [26] -No need to specify parameters in advance -Improved the allocation strategy -High local density calculation cost

3.Sub-cluster merging strategy DPC-LMST [29] -Based on inner densities and boundary distance, sub-cluster merging is designed to select the initial center more effectively -On small data sets, the running time is longer

FDPC [30] -No requirements about prior number of the cluster centers-A novel merging strategy -High computational cost

Improved aspects	Method	Advantages	Limits
–	DPC [16] (Original algorithm)	-Quickly find cluster centers and outliers -Achieve arbitrary shape clustering	-Select a suitable cutoff distance is difficult -Different local density calculation methods -Allocation strategy causes a “domino” effect
1.Local density	DPC-KNN [17]	-The local density calculation uses the KNN instead of the cutoff distance	-Can’t effectively process vertical stripes data
	SNN-DPC [21]	-The local density calculation method for shared-nearest-neighbor, effectively preventing variable-density clusters	-Can’t automatically determine the parameter of K value
	ADPC-KNN [19]	-A new way for initial cluster centers selection-A new idea of cluster density reachable	-The parameter of K value is pre-specified
2.Allocation strategy	FKNN-DPC [18]	-Optimized the allocation strategy -effectively identify outliers	-High computational cost in allocation strategy -Manually specify the K value of KNN
	DPCSA [26]	-No need to specify parameters in advance -Improved the allocation strategy	-High local density calculation cost
3.Sub-cluster merging strategy	DPC-LMST [29]	-Based on inner densities and boundary distance, sub-cluster merging is designed to select the initial center more effectively	-On small data sets, the running time is longer
	FDPC [30]	-No requirements about prior number of the cluster centers-A novel merging strategy	-High computational cost

3 The DPC algorithm

The DPC algorithm involves a novel series of mathematical steps based on the density clustering. DPC reveals two features of an appropriate cluster center: (1) The cluster center has a higher local density. (2) The existing cluster centers are far away from the other cluster centers. Therefore, there are two key variables in the cluster center: local density ρ and relative distance δ.

For datasets on different scales, DPC provides two methods of local density calculation: $ρ_{i} = \sum_{j} χ (d_{ij} - d_{c}), χ (x) = {\begin{matrix} \begin{matrix} 1, & x < 0 \end{matrix} \\ \begin{matrix} 0, & x \geq 0 \end{matrix} \end{matrix}$ (1) $ρ_{i} = \sum_{j} exp (- (\frac{d_{ij}}{d_{c}})^{2})$ (2)

Where d_ij is the distance between data point i to j, d_c is the cutoff distance, χ (.) is a piecewise function. For small-scale datasets, the calculation method of the Gaussian kernel is adopted by DPC. Formula (1) is applicable to large-scale datasets. To get a better clustering effect, DPC needs different density calculation methods for different datasets, causing extra workload.

Besides, when the relative distance δ is defined as the shortest distance from data point i to a point with a higher density, the calculation method can be formulated by equation (3) as: $δ_{i} = min_{j : ρ_{j} > ρ_{i}} (d_{ij})$ (3)

Where a data point in the samples has the highest local density, DPC considers it a density peak, and its relative distance is set as the maximum value. The calculation method is formulated as follows: $δ_{i} = max_{j} (d_{ij})$ (4)

Based on the local density and relative distance of all data points, a decision graph is constructed. In the graph, points with greater local density and relative distance are considered to be candidate cluster centers. A formula for selecting the number of potential cluster centers is as follows: ${math}_{5} γ_{i} = ρ_{i} \times δ_{i}$ (5)

However, when the dataset is unevenly distributed, DPC often has difficulty in finding the cluster centers and is prone to multiple peaks problems.

After the cluster centers are determined, the remaining points are assigned to the corresponding cluster center as its nearest neighbor of higher density. However, this allocation strategy is inclined to cause a chain reaction, which is erroneously allocating sparse density points to the closest dense area. It can be seen from Fig. 2(c) that the data points in the upper clusters of Pathbased are incorrectly allocated to the two middle clusters.

The detailed steps of the DPC algorithm are shown in Algorithm 1.

Algorithm 1 The DPC algorithm

Input: A dataset X, and the parameter d_c;

Output: the clustering result Y;

Step 1: Compute the Euclidean distance matrix d_ij;

Step 2: Calculate the local density ρ_i using (1) or (2);

Step 3: Calculate the distance δ_i using (3) and (4);

Step 4: Plot the decision graph using (5);

Step 5: Select some suitable cluster centers with the largest γ_i from the decision graph;

Step 6: Each remaining points is assigned to the cluster as its nearest neighbor of higher density;

Step 7: Return to Y.

4 The LF-DPC algorithm

In this section, the two contributions of LF-DPC are first introduced. Secondly, the detailed steps of the proposed algorithm are provided. Finally, the complexity of the algorithm is analyzed.

4.1 The main contribution of LF-DPC

4.1.1 The local fair density

Local density is the key to DPC, for it not only impacts the choice of cluster center but also affects the quality of the final cluster. Therefore, designing a reliable local density calculation method is the research focus for many researchers. DPC uses different local density calculation methods for the separate datasets, and its clustering results vary greatly. Although the local density was calculated in the form of k-nearest neighbors in a great deal of literature, the results were different due to different calculation formulas.

Based on the idea of k-nearest neighbors, literature [17] and [18] respectively shows two different local calculation methods. Besides, formula (6) focuses on the local structure of the data, and formula (7) concentrates on the distribution of k-nearest neighbors. $ρ_{i} = exp (- (\frac{1}{k} \sum_{x_{j} \in KNN (x_{i})} d {(x_{i}, x_{j})}^{2}))$ (6) $ρ_{i} = \sum_{x_{j} \in KNN (x_{i})} exp (- d (x_{i}, x_{j}))$ (7)

Where d (x_i, x_j) represents the Euclidean distance from sample point x_i to x_j, KNN (x_i) is the k-nearest neighbors of x_i, k is the parameter value of KNN.

In this paper, two typical local density calculation methods are combined for the proposal of a new local fair density measurement method, which makes the local density of data points more balanced. Therefore, it not only effectively selects the cluster center, but also facilitates the formation of the final cluster. The new local fair density ρ is formulated as follows: $\begin{matrix} ρ = θ_{1} * nor (α) + θ_{2} * nor (β) \\ s . t . θ_{1} + θ_{2} = 1 \end{matrix}$ (8)

Where θ₁ and θ₂ are fair coefficients of local density in these two methods. α and β respectively represent the local density in two calculation methods; nor is a normalized function which maps the local density in different dimensions to the interval [0,1].

In formula (8), as long as θ₁ is calculated, the local fair density can be obtained. To calculate θ₁, a concept of local density difference is defined as:

Definition 1. (Local Density Difference). The local density contribution values of any two algorithms are θ₁nor (α) δ_ij and θ₂nor (β) δ_ij. Considering the equality between various local density calculation methods, so the local density difference is given as follows: $\sum_{i = 1}^{2} \sum_{j = 1}^{n} {(θ_{1} nor (α) δ_{ij} - θ_{2} nor (β) δ_{ij})}^{2}$ (9)

Where δ_ij is the relative distance set in these two methods, which can be seen from formula (3) and (4).

To minimize the local density difference between any two methods, a local density balance model with the local density tending to be fair as the optimization function is established as follows: $\begin{matrix} min \sum_{i = 1}^{2} \sum_{j = 1}^{n} {(θ_{1} nor (α) δ_{ij} - θ_{2} nor (β) δ_{ij})}^{2} \\ s . t . θ_{1}, θ_{2} \geq 0 \\ θ_{1} + θ_{2} = 1 \end{matrix}$ (10)

The new local fair density can be calculated using the aforementioned method.

4.1.2 The Remaining points allocation strategy

The remaining points allocation strategy of the DPC algorithm only considers distance as the allocation metric. Once there is a misallocation of one point, a “domino” effect will occur to other points. As is shown in Fig. 2(C), although the cluster center was correctly identified, the points on the left and right sides were obviously assigned incorrectly. Therefore, we proposed a strategy of two-stage remaining points allocation. The birth of the first stage draws on the ideas of the breadth-first search, in which the k-nearest neighbors allocation method was used, being similar to the allocation strategy in the first phase of FKNNDPC. In the second stage, we designed a strategy of robust fuzzy k-nearest neighbors membership allocation. This strategy bears no resemblance to FKNN-DPC, for it considers distance similarity and probability. In addition, the points allocated in the first stage have certainly belonged to one particular cluster. Inspired by membership in fuzzy mathematics [32], we consider another membership method to allocate the remaining points in this paper, and the degree of membership focuses on the distribution of points rather than distance. Furthermore, the membership difference of the sample points is calculated in the second stage, once all of the sample points have been allocated. When it falls below a certain threshold, the range of k-nearest neighbors is expanded and the sample points are re-allocated to improve the allocation’s accuracy.

Allocation strategy 1 Use k-nearest neighbors to quickly allocate remaining points

Select an unvisited center point c_i from the cluster center set SC, marking that c_i has been visited;

Find the k-nearest neighbors set KNN (c_i) of c_i, merge the points into the cluster c_i in sequence, initialize the queue Q, and sequentially enqueue the data points in KNN (c_i) into Q;

Take the head data q of Q, if each data x ∈ KNN (q) is not allocated, then x will be classified into the cluster where q is, and x will be queued to Q;

If Q is not empty, go to Step 3;

If there are still unvisited data points in set SC, go to Step 1, otherwise end strategy 1.

Allocation strategy 2 Use fuzzy k-nearest neighbors membership to allocate unvisited points in strategy 1

For each unassigned data point x_j, calculate the number of allocated points in KNN (x_j);

Calculate the membership degree u_ij of x_j by using formula (11) and (12);

Sort the membership degree u_ij, assign it to the cluster c_i with the highest membership, and mark that x_j has been assigned;

If there are unallocated points, go to Step 1;

If the requirements are met for any x_j: the two largest membership degrees |u_ij-u_bj|< =0.05, the parameter k of KNN is set to 2 * k + 1, and x_j is re-allocated based on the maximum membership degree, otherwise end strategy 2.

Definition 2. (Fuzzy k-nearest neighbors membership). Assume that X = {x₁, …, x_j, …, x_n} is a dataset with n samples, {C₁, …, C_i, …, C_m} is the dataset currently allocated to m clusters. KNN (x_j) represents the k-nearest neighbors set of the sample point x_j. CKNN (C_i, x_j) denotes the sample set that has been certainly assigned to C_i among the k-nearest neighbors of the sample x_j. Then the membership degree u_ij of the fuzzy k-nearest neighbors of the sample point x_j is formulated as follows: $CKNN (C_{i}, x_{j}) = KNN (x_{j}) \cap C_{i}$ (11) $u_{ij} = \frac{| CKNN (C_{i}, x_{j}) |}{| KNN (x_{j}) |}$ (12)

Where | … | is the symbol that represents the number of elements in the sample set.

It can be seen from Definition 2 that formula (11) is mainly for the sample points to be allocated in the second stage. The most appropriate cluster can be assigned by calculating the membership of the k-nearest neighbors of the sample points, it is different from DPC and FKNN-DPC. For example, as is shown in Fig. 1, the red points c₁ and c₂ are the cluster center of the two clusters; the blue points indicate that it has been allocated to the c₁ cluster; the red points indicate that it has been allocated to the c₂ cluster; the black points represent the remaining unallocated points. The point x_j is being allocated, which indicates d (c₁, x_j) > d (c₂, x_j). In the dotted circle, the number of k-nearest neighbors of x_j is 7, out of which 3 blue points have been allocated to the cluster c₁, and 2 yellow points to cluster c₂. So the degree of membership are u_1j = 3/7 and u_2j = 2/7. According to the DPC allocation principle, x_j is allocated to c₂, but it is more reasonable to be allocated to c₁.

Fig. 1

Schematic diagram of membership degree of fuzzy k-nearest neighbors.

In summary, the design of the remaining point allocation strategy is shown in Allocation strategy 1 and 2.

4.2 The flow and complexity of LF-DPC

4.2.1 The steps of LF-DPC

The LF-DPC algorithm can be divided into four major steps: (1) Construct a distance matrix through the dataset. (2) Calculate local fair density and relative distance. (3) Plot a decision graph and select a candidate cluster center. (4) Assign the remaining points according to the new allocation strategy. The detailed steps of the LF-DPC algorithm are as follows:

Algorithm 2 The LF-DPC algorithm

Input: A dataset X, and the parameter k;

Output: the clustering result Y;

Standardize the dataset X;

Compute the Euclidean distance matrix d_ij using X;

Local fair density are calculated using formulas (6)–(10);

Relative distance is calculated using formulas (3) and (4);

Plot the decision graph according to the local fair density and relative distance, and select the appropriate candidate cluster centers in the decision graph;

Allocate the remaining points using Allocation strategy 1;

If there are remaining points that have not been allocated, use strategy 2 to allocate them;

Return to Y.

4.2.2 Complexity analysis

In this section, we evaluate the complexity of the LF-DPC algorithm. Assume n is the size of the dataset, C denotes the number of candidate subclusters, and k represents the number of nearest neighbors.

The time complexity of the LF-DPC algorithm is mainly based on the following aspects: (1) Calculate the distance matrix of the dataset (O (n²)). (2) Evaluate the local fair density (O (n²)). (3) (a) Allocation strategy 1 is similar to graph depth-first search, which uses the adjacency matrix or adjacency table for traversal, and the time complexity in the worst case is O (n²); (b) Allocation strategy 2 uses the degree of membership to allocate the remaining points, and the time complexity will not exceed (O (n²)). Above all, the time complexity of the algorithm in this paper is O (n²), which is the same as DPC.

The space complexity mainly depends on the storage space required by the algorithm: (1) Store the distance matrix (O (n²)). (2) Store k-nearest neighbors for each point(O (kn)). (3) Additional space needs to be used in the allocation strategy. (a) Strategy 1 uses a queue, which is O (n); (b) A fuzzy membership matrix is used in Strategy 2, which is O (Cn). Since C and k are much smaller than n, the space complexity of the LF-DPC algorithm does not increase more than O (n²).

5 Experiments and analyses

In this section, the test datasets and experimental settings are first described in details, followed by the analysis of the clustering effect of the proposed algorithm on the synthetic datasets and real-world datasets, and finally a summary is made on the run time of the algorithm.

5.1 Test datasets and experimental settings

To verify the effectiveness of the LF-DPC algorithm, experiments and comparisons with K-means, DBSCAN, DPC, DPC-KNN, and FKNN-DPC were carried out. K-means and DBSCAN are the most popular clustering algorithms. DPC is the original algorithm, and our algorithm is improved based on it. DPC-KNN only improves the calculation method of local density and lacks consideration of the remaining points allocation strategy. FKNN-DPC designs the remaining points allocation strategy, but it is too time-consuming. Therefore, it is more reasonable to comprehensively consider these algorithms in the comparative experiments. The function of the K-means was provided by Matlab; we programed the code of the DBSCAN and KNN-DPC based on the author’s paper; the codes of the DPC and FKNN-DPC were provided by the authors. The experimental environment is a PC with an Intel (R) Core(TM) i7-7500 CPU @ 2.70GHz, 2.90 GHz, 12G RAM, Windows 10 64-bit operating system, and the programming software uses MATLAB 2015b.

The experiment tested 16 datasets, including six synthetic datasets and ten real-world datasets from the UCI Machine Learning Repository. The synthetic datasets are Pathbased, Spiral, Flame, D31, Aggregation, and S2. Due to their different sizes and attributes, they are widely used in various clustering algorithms. Through experiments on synthetic datasets, the clustering performance of the algorithm in different scenarios can be simulated. The real-world datasets come from different fields, with different sizes, attributes, and the number of clusters. The adaptability of the algorithm may be assessed by experiments on various datasets. Tables 2 and 3 show all test datasets.

Table 2
Synthetic datasets

Dataset Size Attribute Cluster

Pathbased 300 2 3

Spiral 312 2 3

Flame 240 2 2

D31 3100 2 31

Aggregation 788 2 7

S2 5000 2 15

Dataset	Size	Attribute	Cluster
Pathbased	300	2	3
Spiral	312	2	3
Flame	240	2	2
D31	3100	2	31
Aggregation	788	2	7
S2	5000	2	15

Table 3

Real-world datasets

Dataset	Size	Attribute	Cluster
Iris	150	4	3
Seeds	210	7	3
Wine	178	13	3
Ionosphere	351	34	2
Scadi	70	206	7
Libras Movement	360	91	15
Wdbc	569	30	2
Segmentation	2310	19	7
Parkinsons	195	23	2
Dermatology	366	34	6

Three indicators are used in the experiments: clustering accuracy (ACC), adjusted rand index (ARI), and adjusted mutual information (AMI) for clustering performance evaluation. They are the benchmark of classic cluster evaluation [33, 34]. The closer all evaluation indicators are to 1, the better the clustering performance is.

To reflect the performance of the proposed algorithm in this paper, all the algorithms for comparison obtain the best clustering effect through parameter tuning. K-means takes the number of actual clusters as the parameter. Two parameters need to be determined in DBSCAN: one is the radius domain e, ranging from 0.04 to 0.99, and the other one is the minimum number MinPts, selecting from 2 to 38. DPC needs to set the cutoff distance threshold, within the range from 0.3% to 5%. DPC-KNN takes a percentage as an input parameter, ranging from 1% to 8%. FKNN-DPC and LF-DPC take the number of nearest neighbors as a parameter, ranging between 2 and 30.

Because the attributes of the test dataset itself have different dimensions and magnitude, all datasets are standardized before the experiment, which is formulated as follows: $x_{ij} = \frac{x_{ij} - min (x_{j})}{max (x_{j}) - min (x_{j})}$ (13)

Where x_ij is the value of the attribute column j of the data point x_i; min(x_j) and max(x_j) respectively represent the minimum and maximum values of the point in the attribute column j.

5.2 Experiments on synthetic datasets

Comparisons were made between the proposed algorithm and K-means, DBSCAN, DPC, DPC-KNN, and FKNN-DPC on synthetic datasets. Table 4 summarizes the clustering results of ACC, ARI, and AMI of the six algorithms on the synthetic datasets. It can be seen from Table 4 that LF-DPC achieved the best clustering effect on most datasets. In other words, our algorithm can obtain the optimal clustering performance on datasets of different sizes and dimensions. The clustering result graphs of the six synthetic datasets are presented in Figs. 2–7.

Table 4
Cluster evaluation of six algorithms on synthetic datasets

Algorithm ACC ARI AMI Par ACC ARI AMI Par

Pathbased Spiral

K-means 0.7433 0.4613 0.5098 3 0.3429 -0.0061 -0.0056 3

DBSCAN 0.8033 0.5890 0.6884 0.065/4 1 1 1 0.04/2

DPC 0.7400 0.4572 0.5054 2% 1 1 1 4%

DPC-KNN 0.7600 0.4797 0.5294 5% 1 1 1 4%

FKNN-DPC 0.8967 0.7323 0.7744 8 1 1 1 7

LF-DPC 0.9900 0.9699 0.9525 8 1 1 1 5

Flame D31

K-means 0.8417 0.4647 0.3938 2 0.8868 0.8795 0.9397 31

DBSCAN 0.9417 0.9081 0.7570 0.065/4 0.8281 0.8078 0.8895 0.04/38

DPC 1 1 1 5% 0.9687 0.9372 0.9564 2%

DPC-KNN 1 1 1 2% 0.9677 0.9353 0.9551 1%

FKNN-DPC 0.9917 0.9666 0.9267 5 0.9690 0.9375 0.9566 9

LF-DPC 1 1 1 2 0.9745 0.9486 0.9634 28

Aggregation S2

K-means 0.7525 0.6963 0.7947 7 0.8884 0.8630 0.9133 15

DBSCAN 0.9835 0.9779 0.9529 0.04/6 0.8210 0.7485 0.8511 0.04/30

DPC 0.9975 0.9956 0.9922 2% 0.9696 0.9370 0.9446 2%

DPC-KNN 0.9962 0.9935 0.9892 1% 0.9678 0.9335 0.9429 2%

FKNN-DPC 0.9975 0.9949 0.9907 8 0.9588 0.9157 0.9341 8

LF-DPC 0.9975 0.9949 0.9905 7 0.9718 0.9415 0.9484 30

Algorithm	ACC	ARI	AMI	Par	ACC	ARI	AMI	Par
		Pathbased				Spiral
K-means	0.7433	0.4613	0.5098	3	0.3429	-0.0061	-0.0056	3
DBSCAN	0.8033	0.5890	0.6884	0.065/4	1	1	1	0.04/2
DPC	0.7400	0.4572	0.5054	2%	1	1	1	4%
DPC-KNN	0.7600	0.4797	0.5294	5%	1	1	1	4%
FKNN-DPC	0.8967	0.7323	0.7744	8	1	1	1	7
LF-DPC	0.9900	0.9699	0.9525	8	1	1	1	5
		Flame			D31
K-means	0.8417	0.4647	0.3938	2	0.8868	0.8795	0.9397	31
DBSCAN	0.9417	0.9081	0.7570	0.065/4	0.8281	0.8078	0.8895	0.04/38
DPC	1	1	1	5%	0.9687	0.9372	0.9564	2%
DPC-KNN	1	1	1	2%	0.9677	0.9353	0.9551	1%
FKNN-DPC	0.9917	0.9666	0.9267	5	0.9690	0.9375	0.9566	9
LF-DPC	1	1	1	2	0.9745	0.9486	0.9634	28
		Aggregation			S2
K-means	0.7525	0.6963	0.7947	7	0.8884	0.8630	0.9133	15
DBSCAN	0.9835	0.9779	0.9529	0.04/6	0.8210	0.7485	0.8511	0.04/30
DPC	0.9975	0.9956	0.9922	2%	0.9696	0.9370	0.9446	2%
DPC-KNN	0.9962	0.9935	0.9892	1%	0.9678	0.9335	0.9429	2%
FKNN-DPC	0.9975	0.9949	0.9907	8	0.9588	0.9157	0.9341	8
LF-DPC	0.9975	0.9949	0.9905	7	0.9718	0.9415	0.9484	30

Fig. 2

Clustering results of six algorithms on Pathbased dataset.

Fig. 3

Clustering results of six algorithms on Spiral dataset.

Fig. 4

Clustering results of six algorithms on Flame dataset.

Fig. 5

Clustering results of six algorithms on D31 dataset.

Fig. 6

Clustering results of six algorithms on Aggregation dataset.

Fig. 7

Clustering results of six algorithms on S2 dataset.

Figure 2 shows the clustering results of the six algorithms on the Pathbased dataset. The dataset consists of two inner clusters surrounded by circular clusters. Although K-means could find the cluster center, the points on the left and right sides of the circular cluster were incorrectly classified into the inner cluster. In the DBSCAN algorithm, all the points on the right side of the circular cluster were allocated incorrectly. Basically, the allocation strategy of DPC and DPC-KNN is to allocate points to the nearest cluster. Since the center of the circular cluster is at the top, and the points on both sides are closer to the center of the inner cluster, the circular clusters cannot be clustered correctly. FKNN-DPC has improved the allocation strategy, but allocation errors also occurred in some points on the right side. LF-DPC could perfectly cluster the Pathbased dataset, mainly because our allocation strategy played a decisive role.

The Spiral dataset consists of three ring lines. Figure 3 shows the clustering results of the six algorithms on this dataset. It can be seen from the graph that K-means performed poorly, which could be attributed to the deficiency of the partitioning clustering algorithm. The other five algorithms could perfectly identify these three ring clusters, and they all fall into the category of density clustering algorithm.

The Flame dataset is a common dataset with an uneven density distribution. According to Fig. 4, K-means, DPC, DPC-KNN, FKNN-DPC, and LF-DPC can correctly detect the cluster center. There are several allocation errors in K-means. On this dataset, DBSCAN found two clusters, but a few data points were flagged as noise. There are just two points with allocation problems in FKNN-DPC. DPC, DPC-KNN, and the proposed algorithm in this paper, on the other hand, were able to recognize the two clusters perfectly.

Figure 5 shows the execution results of the six algorithms on the D31 dataset. K-means could not correctly find some clustering centers. DBSCAN could not effectively identify a small number of clusters, and many boundary points were identified as noise. Although DPC, DPC-KNN, FKNN-DPC, and LF-DPC could all recognize 31 clusters, some points had obvious allocation errors. Compared with the other algorithms, LF-DPC had the best clustering performance.

The Aggregation dataset has seven clusters in different shapes. Figure 6 lists the clustering results of the six algorithms on this dataset. K-means could not find the cluster centers of some clusters, leading to the worst clustering effect. For DBSCAN, there were a few noise points despite effective identification of seven clusters in different shape. DPC, DPC-KNN, FKNN-DPC, and LF-DPC could also effectively cluster this dataset, but there were several point assignment errors at the boundary of the connected clusters.

Figure 7 shows the clustering results of the six algorithms on the S2 dataset. DBSCAN recognized the boundary points of many clusters as noise, and its clustering performance was the worst. K-means could not find the cluster centers of three clusters. DPC, DPC-KNN, FKNN-DPC, and LF-DPC could effectively identify 15 clusters, among which LF-DPC had the best effect.

5.3 Experiments on real-world datasets

Comparisons were made between LF-DPC and the other five algorithms on real-world datasets.

As can be seen from Table 5, the clustering indicators ACC, ARI, and AMI of our algorithm on the Iris, Wine, and Libras datasets were better than those of the other five algorithms. On the Seeds dataset, LF-DPC is only slightly behind FKNN-DPC in ACC indicators. On the Ionosphere dataset, the clustering result of the proposed algorithm ranked second, only behind DBSCAN. In high-dimensional datasets, such as Scadi, our algorithm had the highest ACC, which was only slightly behind DPC-KNN in terms of indicators ARI and AMI. On the Wdbc dataset, the proposed algorithm was also superior to the other density clustering algorithms, for it only lagged behind K-means, and the partitioning clustering algorithms might be more suitable for this dataset. On the Segmentation dataset in ACC and AMI, LF-DPC had the best clustering result, and it was also better than the other four algorithms. DPC-KNN, FKNN-DPC, and LF-DPC had the same clustering performance on the Parkinsons dataset. On the Dermatology dataset, LF-DPC was ahead of the other algorithms in terms of indicators ACC and ARI, being only second to FKNN-DPC in terms of indicator AMI. In summary, the proposed algorithm on most datasets has achieved optimal clustering performance.

Table 5
Cluster evaluation of six algorithms on real-world datasets

Algorithm ACC ARI AMI Par ACC ARI AMI Par

Iris Seeds

K-means 0.8867 0.7163 0.7331 3 0.8905 0.7049 0.6705 3

DBSCAN 0.7400 0.6120 0.5692 0.12/5 0.6905 0.5291 0.5302 0.24/16

DPC 0.8867 0.7196 0.7668 2% 0.9000 0.7341 0.7172 2%

DPC-KNN 0.9600 0.8857 0.8605 2% 0.9143 0.7664 0.7303 2%

FKNN-DPC 0.9733 0.9222 0.9124 7 0.9240 0.7900 0.7590 8

LF-DPC 0.9733 0.9222 0.9124 7 0.9238 0.7909 0.7669 6

Wine Ionosphere

K-means 0.9494 0.8471 0.8301 3 0.7123 0.1776 0.1294 2

DBSCAN 0.8146 0.5292 0.5484 0.5/21 0.9145 0.6835 0.5520 0.78/9

DPC 0.8315 0.5716 0.6461 2% 0.6752 0.1191 0.0764 2%

DPC-KNN 0.8933 0.6990 0.7228 8% 0.7379 0.2183 0.1355 1%

FKNN-DPC 0.9490 0.8520 0.8310 7 0.7520 0.2840 0.3550 8

LF-DPC 0.9719 0.9150 0.8800 7 0.8661 0.5276 0.4001 8

Scadi Libras.

K-means 0.6286 0.4682 0.5135 7 0.4556 0.3225 0.5304 15

DBSCAN - - - - 0.3444 0.1948 0.4217 0.9/2

DPC 0.6143 0.5044 0.4543 2% 0.4306 0.3128 0.5326 0.3%

DPC-KNN 0.6571 0.6588 0.5509 2% 0.4361 0.2694 0.4778 1%

FKNN-DPC 0.7286 0.6191 0.5319 6 0.4111 0.3270 0.5302 9

LF-DPC 0.7571 0.6323 0.5505 6 0.4583 0.33387 0.5430 6

Wdbc Segmentation

K-means 0.9279 0.7302 0.6110 2 0.6000 0.5106 0.6076 7

DBSCAN 0.8471 0.4786 0.3581 0.46/38 0.4143 0.2129 0.3693 0.5/4

DPC 0.8260 0.4106 0.3641 2% 0.6238 0.4952 0.6102 4%

DPC-KNN 0.8418 0.4552 0.4017 2% 0.6381 0.5412 0.6320 8%

FKNN-DPC 0.8401 0.4502 0.3974 7 0.7160 0.5550 0.6550 7

LF-DPC 0.8489 0.4756 0.4189 4 0.7238 0.5409 0.6683 9

Parkinsons Dermatology

K-means 0.6308 0.0520 0.2129 2 0.4553 0.3542 0.6049 6

DBSCAN 0.5949 0.0252 0.0071 0.5/17 0.5894 0.4106 0.5779 0.99/3

DPC 0.7385 0.1989 0.0994 2% 0.7374 0.6554 0.7999 2%

DPC-KNN 0.8205 0.2686 0.1772 2% 0.7402 0.6349 0.7731 2%

FKNN-DPC 0.8205 0.2686 0.1772 5 0.7737 0.7299 0.8645 7

LF-DPC 0.8205 0.2686 0.1772 6 0.8017 0.7523 0.8489 6

Algorithm	ACC	ARI	AMI	Par	ACC	ARI	AMI	Par
		Iris				Seeds
K-means	0.8867	0.7163	0.7331	3	0.8905	0.7049	0.6705	3
DBSCAN	0.7400	0.6120	0.5692	0.12/5	0.6905	0.5291	0.5302	0.24/16
DPC	0.8867	0.7196	0.7668	2%	0.9000	0.7341	0.7172	2%
DPC-KNN	0.9600	0.8857	0.8605	2%	0.9143	0.7664	0.7303	2%
FKNN-DPC	0.9733	0.9222	0.9124	7	0.9240	0.7900	0.7590	8
LF-DPC	0.9733	0.9222	0.9124	7	0.9238	0.7909	0.7669	6
		Wine				Ionosphere
K-means	0.9494	0.8471	0.8301	3	0.7123	0.1776	0.1294	2
DBSCAN	0.8146	0.5292	0.5484	0.5/21	0.9145	0.6835	0.5520	0.78/9
DPC	0.8315	0.5716	0.6461	2%	0.6752	0.1191	0.0764	2%
DPC-KNN	0.8933	0.6990	0.7228	8%	0.7379	0.2183	0.1355	1%
FKNN-DPC	0.9490	0.8520	0.8310	7	0.7520	0.2840	0.3550	8
LF-DPC	0.9719	0.9150	0.8800	7	0.8661	0.5276	0.4001	8
		Scadi				Libras.
K-means	0.6286	0.4682	0.5135	7	0.4556	0.3225	0.5304	15
DBSCAN	-	-	-	-	0.3444	0.1948	0.4217	0.9/2
DPC	0.6143	0.5044	0.4543	2%	0.4306	0.3128	0.5326	0.3%
DPC-KNN	0.6571	0.6588	0.5509	2%	0.4361	0.2694	0.4778	1%
FKNN-DPC	0.7286	0.6191	0.5319	6	0.4111	0.3270	0.5302	9
LF-DPC	0.7571	0.6323	0.5505	6	0.4583	0.33387	0.5430	6
		Wdbc				Segmentation
K-means	0.9279	0.7302	0.6110	2	0.6000	0.5106	0.6076	7
DBSCAN	0.8471	0.4786	0.3581	0.46/38	0.4143	0.2129	0.3693	0.5/4
DPC	0.8260	0.4106	0.3641	2%	0.6238	0.4952	0.6102	4%
DPC-KNN	0.8418	0.4552	0.4017	2%	0.6381	0.5412	0.6320	8%
FKNN-DPC	0.8401	0.4502	0.3974	7	0.7160	0.5550	0.6550	7
LF-DPC	0.8489	0.4756	0.4189	4	0.7238	0.5409	0.6683	9
		Parkinsons				Dermatology
K-means	0.6308	0.0520	0.2129	2	0.4553	0.3542	0.6049	6
DBSCAN	0.5949	0.0252	0.0071	0.5/17	0.5894	0.4106	0.5779	0.99/3
DPC	0.7385	0.1989	0.0994	2%	0.7374	0.6554	0.7999	2%
DPC-KNN	0.8205	0.2686	0.1772	2%	0.7402	0.6349	0.7731	2%
FKNN-DPC	0.8205	0.2686	0.1772	5	0.7737	0.7299	0.8645	7
LF-DPC	0.8205	0.2686	0.1772	6	0.8017	0.7523	0.8489	6

5.4 Run time comparison of six algorithms

The run time of LF-DPC and the other five algorithms is shown in Table 6. It can be seen from the run time that on the small-scale datasets, LF-DPC, and FKNN-DPC were at the same level, but K-means and DBSCAN had an advantage. On the large-scale datasets, such as S2, the run time of FKNN-DPC increased significantly, which could be ascribed to the fact that the use of remaining points allocation strategy of FKNN-DPC was very time-consuming. In this paper, we also proposed an allocation strategy, but the running time was faster than FKNN-DPC on most datasets.

Table 6
Run time of six algorithms on 16 datasets (time is measured in seconds)

Datasets K-means DBSCAN DPC DPC-KNN FKNN-DPC LF-DPC

Pathbased 0.0062 0.0242 0.1685 0.1436 0.3156 0.3016

Spiral 0.0075 0.0082 0.1765 0.1648 0.3142 0.3175

Flame 0.0082 0.0581 0.1642 0.1570 0.4822 0.4583

D31 1.7324 0.4265 1.1244 0.9254 9.6580 9.4266

Aggregation 0.0132 0.0175 0.2326 0.1838 0.9855 1.0993

S2 0.0278 0.8056 2.5542 1.9756 48.1286 26.7913

Iris 0.0062 0.0078 0.1528 0.1556 0.2262 0.2176

Seeds 0.0076 0.0084 0.1118 0.1452 0.2366 0.2562

Wine 0.0084 0.0092 0.1658 0.1426 0.2420 0.2398

Ionosphere 0.0125 0.0121 0.1742 0.1874 0.5557 0.5421

Scadi 0.0116 – 0.3406 0.1456 0.5152 0.2138

Libras 0.0304 0.0326 0.2282 0.2241 0.3520 0.4126

Wdbc 0.0094 0.0252 0.2054 0.1951 0.4752 0.2246

Segmentation 0.1282 0.2044 0.3726 0.3740 0.9620 0.6481

Parkinsons 0.0122 0.0264 0.1466 0.1458 0.3402 0.2862

Dermatology 0.0070 0.0164 0.1608 0.1784 0.4436 0.3884

Datasets	K-means	DBSCAN	DPC	DPC-KNN	FKNN-DPC	LF-DPC
Pathbased	0.0062	0.0242	0.1685	0.1436	0.3156	0.3016
Spiral	0.0075	0.0082	0.1765	0.1648	0.3142	0.3175
Flame	0.0082	0.0581	0.1642	0.1570	0.4822	0.4583
D31	1.7324	0.4265	1.1244	0.9254	9.6580	9.4266
Aggregation	0.0132	0.0175	0.2326	0.1838	0.9855	1.0993
S2	0.0278	0.8056	2.5542	1.9756	48.1286	26.7913
Iris	0.0062	0.0078	0.1528	0.1556	0.2262	0.2176
Seeds	0.0076	0.0084	0.1118	0.1452	0.2366	0.2562
Wine	0.0084	0.0092	0.1658	0.1426	0.2420	0.2398
Ionosphere	0.0125	0.0121	0.1742	0.1874	0.5557	0.5421
Scadi	0.0116	–	0.3406	0.1456	0.5152	0.2138
Libras	0.0304	0.0326	0.2282	0.2241	0.3520	0.4126
Wdbc	0.0094	0.0252	0.2054	0.1951	0.4752	0.2246
Segmentation	0.1282	0.2044	0.3726	0.3740	0.9620	0.6481
Parkinsons	0.0122	0.0264	0.1466	0.1458	0.3402	0.2862
Dermatology	0.0070	0.0164	0.1608	0.1784	0.4436	0.3884

6 Conclusion

Against the deficiencies of DPC, we proposed the LF-DPC algorithm according to the following process. First, a method for calculating local fair density was proposed. This method can more accurately obtain cluster centers, and more easily form high-quality cluster results. Second, a two-stage remaining points allocation strategy was proposed, which solves the problem of chain reaction in the DPC allocation strategy. Accordingly, the clustering accuracy has significantly improved. Finally, the effectiveness of the proposed algorithm was verified on multiple datasets together with the other five algorithms. The experimental results show that LF-DPC can adapt to datasets of different types, sizes, and dimensions. However, the key parameter K value in this paper is still manually specified. When the K value is selected improperly, the performance of clustering will be affected. In the future, automatic density peaks clustering algorithms and their application scenarios will be further investigated.

Footnotes

Acknowledgment

We are very grateful for the FKNN-DPC algorithm source code provided by Professor Juanying Xie et al., FKNN-DPC [] was published in Information Sciences in 2016.

This work has been supported in part by the National Key Research and Development Program of China (No. 2018YFB1701500 and No. 2018YFB1701502).

Authors contribution

Manuscript write and experimental design: Chunhua Ren; Manuscript revise: Chunhua Ren, Linfu Sun, and Yunhui Gao; Manuscript typeset: Chunhua Ren, Yunhui Gao, and Yang Yu.

References

Frey

B.J.

and Dueck

, Clustering by passing messages between data points, Science 315(5814) (2007), 972–976.

and Wunsch

D.C.

, Survey of clustering algorithms, IEEE Transactions on Neural Networks 16(3) (2005), 645–678.

Han

J.W.

, Kamber

and Pei

, Data mining: concepts and techniques, Data Mining Concepts Models Methods and Algorithms Second Edition 5(4) (2011), 1–18.

Jain

, Murty

and Flynn

, Data clustering: a review, ACM Computing Surveys 31(3) (1999), 264–323.

Chaira

and Panwar

, An atanassov’s intuitionistic fuzzy kernel clustering for medical image segmentation, International Journal of Computational Intelligence Systems 7(2) (2014), 360–370.

Ghai

, Gera

and Jain

, A new approach to extract text from images based on DWT and K-means clustering, International Journal of Computational Intelligence Systems 9(5) (2016), 900–916.

Bai

, Yang

and Shi

, An overlapping community detection algorithm based on density peaks, Neurocomputing 226 (2016), 7–15.

Liu

, Jin

and Baquero

, Genetic algorithm with a local search strategy for discovering communities in complex networks, International Journal of Computational Intelligence Systems 6(2) (2013), 354–369.

Hosseini

, Maleki

and Gholamian

M.R.

, Cluster analysis using data mining approach to develop CRM methodology to assess the customer loyalty, Expert Systems with Applications 37(7) (2010), 5259–5264.

10.

Wang

C.H.

, Outlier identification and market segmentation using kernel-based clustering techniques, Expert Systems with Applications 36(2) (2009), 3744–3750.

11.

Jain

A.K.

, Data clustering: 50 years beyond K-means, Pattern Recognition Letters 31(8) (2010), 651–666.

12.

Zhang

, Ramakrishnan

and Livny

, Birch: an efficient data clustering method for very large data-bases, ACM Sigmod Record 25(2) (1996), 103–114.

13.

Wang

, Yang

and Muntz

, Sting: a statistical information grid approach to spatial data mining, In Proceedings of the Very Large Databases (VLDB), Athens, GREECE, 1997, pp. 186–195.

14.

Dempster

, Laird

and Rubin

, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statal Society 39(1) (1977), 1–38.

15.

Ester

, Kriegel

, Sander

and Xu

, A density-based algorithm for discovering clusters in large spatial databases with noise, In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA, 1996, pp. 226–231.

16.

Rodriguez

and Laio

, Clustering by fast search and find of density peaks, Science 344(6191) (2014), 1492–1496.

17.

, Ding

and Jia

, Study on density peaks clustering based on k-nearest neighbors and principal component analysis, Knowledge-Based Systems 99 (2016), 135–145.

18.

Xie

, Gao

, Xie

and Liu

, Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors, Information Sciences 354 (2016), 19–40.

19.

Liu

, Ma

and Yu

, Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy, Knowledge-Based Systems 133 (2017), 208–220.

20.

and Tang

, Comparative density peaks clustering, Expert Systems with Applications 95 (2018), 236–247.

21.

Liu

, Wang

and Yu

, Shared-nearest-neighbor-based clustering by fast search and find of density peaks, Information Sciences 450 (2018), 200–226.

22.

Cheng

, Zhu

, Huang

and Yang

, Natural neighbor-based clustering algorithm with density peaks, In Proceedings International Joint Conference on Neural Networks (IJCNN), Vancouver, CANADA, 2016, pp. 92–98.

23.

, Lee

, Isokawa

, Yao

and Xia

, Efficient clustering method based on density peaks with symmetric neighborhood relationship, IEEE Access 7 (2019), 60684–60696.

24.

Zhao

, Tang

, Fan

, et al., Density peaks clustering based on circular partition and grid similarity, Concurrency and Computation Practice and Experience 32(3) (2019), e5567.

25.

Fan

, Yao

, Han

, et al., Density peaks clustering based on k-nearest neighbors sharing, Concurrency and Computation Practice and Experience (2020), e5993.

26.

, Liu

, Guo

and Liu

, Density peaks clustering based on weighted local density sequence and nearest neighbor assignment, IEEE Access 7 (2019), 34301–34317.

27.

Jiang

, Tao

and Li

, DFC: density fragment clustering without peaks, Journal of Intelligent and Fuzzy Systems 34(1) (2018), 525–536.

28.

Zhuo

, Li

, Liao

, Li

, Wei

and Li

, HCFS: a density peak based on clustering algorithm employing a hierarchical strategy, IEEE Access 7 (2019), 74612–74624.

29.

Wang

and Zhu

, Density peaks clustering based on local minimal spanning tree, IEEE Access 7 (2019), 108438–108446.

30.

, Ding

, Xu

, Liao

and Xue

, A feasible density peaks clustering algorithm with a merging strategy, Soft Computing 23(13) (2019), 5171–5183.

31.

Ren

, Sun

, Yu

, et al., Effective density peaks clustering algorithm based on the layered K-Nearest neighbors and subcluster merging, IEEE Access 8 (2020), 123449–123468.

32.

Keller

, Gray

and Givens

, A fuzzy K-nearest neighbor algorithm, IEEE Transactions on Systems, Man, and Cybernetics 15(4) (1985), 580–585.

33.

Vinh

, Epps

and Bailey

, Information theoretic measures for clustering comparison: variants, properties, normalization and correction for chance, Journal of Machine Learning Research 11 (2010), 2837–2854.

34.

Boudane

and Berrichi

, Gabriel graph-based connectivity and density for internal validity of clustering, Progress in Artificial Intelligence 9(3) (2020), 221–238.

Density peaks clustering based on local fair density and fuzzy k-nearest neighbors membership allocation strategy

Abstract

Keywords

1 Introduction

2 Related works

4.1 The main contribution of LF-DPC

4.1.1 The local fair density

4.2.1 The steps of LF-DPC

4.2.2 Complexity analysis

5 Experiments and analyses

5.1 Test datasets and experimental settings

Table 2 Synthetic datasets Dataset Size Attribute Cluster Pathbased 300 2 3 Spiral 312 2 3 Flame 240 2 2 D31 3100 2 31 Aggregation 788 2 7 S2 5000 2 15

Footnotes

Acknowledgment

Authors contribution

References

Table 2
Synthetic datasets

Dataset Size Attribute Cluster

Pathbased 300 2 3

Spiral 312 2 3

Flame 240 2 2

D31 3100 2 31

Aggregation 788 2 7

S2 5000 2 15