Extended clustering algorithm based on cluster shape boundary

Abstract

Based on the shape characteristics of the sample distribution in the clustering problem, this paper proposes an extended clustering algorithm based on cluster shape boundary (ECBSB). The algorithm automatically determines the number of clusters and classification discrimination boundaries by finding the boundary closures of the clusters from a global perspective of the sample distribution. Since ECBSB is insensitive to local features of the sample distribution, it can accurately identify clusters on complex shape and uneven density distribution. ECBSB first detects the shape boundary points of the cluster in the sample set with edge noise points eliminated, and then generates boundary closures around the cluster based on the boundary points. Finally, the cluster labels of the boundary are propagated to the entire sample set by a nearest neighbor search. The proposed method is evaluated on multiple benchmark datasets. Exhaustive experimental results show that the proposed method achieves highly accurate and robust clustering results, and is superior to the classical clustering baselines on most of the test data.

Keywords

Clustering global characteristics boundary closure complicated shape

1. Introduction

Clustering is one of the most fundamental problems in data science and machine learning communities [1, 2]. Enormous efforts have been made on clustering algorithms over the past decades [3, 4]. However, with the development of consumer-level data acquisition devices, the data acquired in real-world applications nowadays are highly disordered, unlabeled, and complex, which brings new challenges to this problem [5].

Clustering algorithms [6, 7, 8] can be classified by their representation and methodology: partition clustering [9, 10, 11, 12, 13, 14, 15, 16, 17], density-based clustering [18, 19, 20, 21, 22, 23], model-based clustering [24, 25], graph clustering [27, 28, 29, 30], and learning-based clustering [31, 32, 33], as shown in Table 1. These methods tackle the clustering problem by either using hand-crafted pipelines to first find the cluster center and determine the label, or apply data-driven strategies to learn the clustering rules. To be specific, K-means [9, 10, 11, 12, 13] constantly adjusts the center of mass in order to minimize the objective function. ${l}^{2}$ -Wkmeans algorithm [14] is a new K-means clustering framework, and extends W-k-means to feature weights by ${{l}^{2}}$ norm regularization. Meanwhile, K-medoids [15] uses the sample closest to the center of mass to replace the center of mass. Fuzzy C-means [16, 17] obtains the membership degree of each sample point by optimizing the objective function so as to determine the belonging cluster of sample points. DPC [18] gets the decision graph according to the relationship between the density and the relative distance of samples, which provides a reliable reference for users to determine the number of clusters. DBSCAN [19, 20, 21] determines the relationship of samples in clusters by a density connection method, and can identify clusters with arbitrary complex shapes. DENCLUE [22] uses an influence function to describe samples and searches the cluster center through the density attraction sample point, which has a strict mathematical basis. Furthermore, Chameleon [23] dynamically keeps the balance between the interconnectivity and the closeness between clusters, and can find clusters of any shape. GMM [24] first calculates the probability that each sample belongs to each Gaussian distribution, and then allocates samples to the distribution with the highest probability. [25] proposed a semi-random model for K-means clustering that generalizes the GMM, and thus improves the robustness of algorithm. The affinity propagation algorithm [26] carries out information transfer between data points in the distance matrix, and determines the final clustering center by updating the responsibility matrix and the availability matrix. Additionally, spectral clustering method [27, 28, 29, 30] calculates the eigenvectors of a Laplacian matrix, and then uses K-means to cluster the eigenvectors. Lastly, the Deepcluster [31] algorithm uses K-means iteratively to group features before training the neural network, and then updates the network weights.

Table 1
Introduction of clustering algorithm

Types	Algorithms	Characteristic
Partition clustering	K-means; K-medoids	Maximize the targetfunction; The K value needs to be set
	${{l}^{2}}$ -Wkmeans; Fuzzy C-meams
Density-based clustering	DPC; DBSCAN	Calculate the sample density and adjust the clustering results
	DENCLUE; Chameleon	by parameters
Model-based clustering	GMM	Find a model that can best fit a given dataset
	Semi-random model
Graph clustering	Spectral clustering	The object of clustering is eigenvector; The K value needs to
		be set
Learning-based clustering	Deepcluster	Effective combination of deep learning method and K-means
		algorithm

Although the aforementioned algorithms are able to produce relatively convincing results in some scenarios, the limitations are very obvious: (1) the number of clusters and clustering parameters need to be defined carefully, which is cumbersome and infeasible for some practical applications; (2) for data with a complex density distribution, the location of clustering center may be fuzzy and difficult to determine; and (3) it is difficult to identify clusters with complex shapes. For example, K-means, C-means, K-medoids, and DPC require the user to enter the number of clusters. Furthermore, K-means and K-medoids are not suitable for finding clusters with nonconvex shapes, and DBSCAN is not good at identifying clusters with uneven density distribution. The GMM-based algorithm assumes that the samples obey $K$ different Gaussian distributions, which means that the K value needs to be determined manually. Moreover, the affinity propagation algorithm needs to set parameter $P$ , which is positively related to the number of clustering centers. The spectral clustering and Deepcluster algorithms must also set $K$ manually when using K-means.

In this paper, an extended clustering algorithm based on cluster shape boundary (ECBSB) is proposed. ECBSB searches the cluster boundary closure from the global perspective of sample distribution, and then propagates the cluster label from the boundary closure to the center, which reduces the algorithm’s dependence on the cluster shape. It is significantly different from the existing clustering algorithms, as it better utilizes the global characteristics of data distribution by analyzing the cluster boundary explicitly.

The contributions of this paper are threefold:

•

A clustering method that extends from boundary to center is proposed. Based on the global characteristics of the sample distribution, the proposed method searches the cluster boundary closure, and then propagates the cluster labels from the boundary closure to the center of the cluster to realize clustering.

•

ECBSB can automatically determine the number of clusters according to the modified number of boundary closures, and can identify clusters with complex shapes. ECBSB can also well identify clusters with an uneven density distribution.

•

Experiments demonstrate that the proposed method can achieve high accuracy and robust clustering results, and outperforms the classical clustering algorithms on most of the datasets.

The rest of this paper is organized as follows. Section 2 briefly introduces the related works. Section 3 introduces the main idea and framework of ECBSB. Section 4 introduces the specific implementation of ECBSB based on the Aggregation dataset, and then explains the parameter selection and noise processing method. Section 5 shows the test results of ECBSB on benchmark datasets, and compares it with some classical clustering algorithms. Conclusions and directions for future work are provided in Section 6.

2. Related works

It is a common strategy for most clustering methods to find clusters by finding cluster centers. However, this often requires a lot of calculation. Searching for the boundary closure can greatly simplify the steps of clustering, and lead to better clustering results. This section introduces the related boundary clustering methods, and gives a detailed description of the current popular clustering algorithms.

Esra et al. [34] constructed a connected graph method based on small surface detection, and theoretically proved that the small surface based on pattern clustering is the boundary of clustering. Therefore, clustering based on boundary information is feasible. In order to solve the problem that two adjacent points belonging to different classes may be clustered into one class, Zhong et al. [35] proposed a clustering algorithm based on cluster boundary information. The algorithm selects boundary points in the sample space by setting the density threshold. Then, the idea of transitivity is used to cluster the non-boundary points. Although this method is based on the boundary information, it still extends from the cluster center to the boundary, and needs to search a large number of non-boundary points, and thus the efficiency of the algorithm is not high. Additionally, it is likely to link the samples in two different clusters during the transmission process, and hence it is not suitable for clusters with an uneven density distribution.

The K-means algorithm based on partitioning firstly randomly selects $K$ cluster centers, and constantly adjusts the cluster center to minimize the objective function so as to determine the final location of the cluster center. The K-means++ algorithm [36] chooses initial cluster centers according to the idea that these centers are as far apart as possible. The K-means++ algorithm first randomly selects an initial clustering center, and then selects the farthest point from this point as the next initial point, and repeats this process until all $K$ centers are selected.

The DBSCAN algorithm describes the density degree of the sample distribution according to a set of neighborhood parameters $(\varepsilon,\textit{MinPts})$ , and derives the maximum density connected sample set by defining the core point, density direct, density reachable and so on. The DPC algorithm uses the index $\gamma$ ( $\gamma$ $=$ Density * Relative distance) to select the cluster center according to the characteristics of a large cluster center density and long distance between different centers. However, the DPC algorithm needs to determine the number of clusters $K$ artificially, so as to find the $K$ samples with the largest index $\gamma$ value as the clustering center.

The spectral clustering algorithm based on graph theory needs to calculate the similarity matrix and Laplacian matrix between samples, and clusters the feature vectors of the Laplacian matrix to cluster samples. The principle of spectral clustering is similar to the dimensionality reduction of data, that is, mapping data from high-dimensional space to low-dimensional space. Although there are many spectral clustering methods, the spectral clustering algorithm mentioned in this paper is the most basic method because it is based on eigenvectors and uses algorithms similar to K-means to cluster eigenvectors.

3. Main idea of ECBSB

The main idea of ECBSB is to emphasize the global distribution characteristics of the data points in the sample set by finding the shape boundary of the cluster, and to gather the samples in the cluster boundary closure into a class, rather than cluster based on the density or distance of samples.

For example, when clustering the samples in Fig. 1a, we will determine the clustering results according to whether samples belong to the same closure, rather than determining the distance and density of samples. Take the clustering problem of points $A$ , $B$ , and $C$ in Fig. 1b as an example. If we only cluster these three points according to Euclidean distance or density, it is difficult to cluster points $A$ and $B$ into the same category. However, points $A$ and $B$ belong to Area_1 in the global distribution of the sample set, and thus it is more reasonable to cluster them together even though point $C$ is closer to $A$ than point $B$ is to $A$ .

Figure 1.

Global characteristics of sample distribution. a Sample set; b divide the sample set into clusters Area_1 and Area_2 according to the characteristics of global distribution

Based on this idea, ECBSB finds cluster boundary closures according to the global distribution of the sample set. However, there are usually noise points in the sample set (such as the points in the circle in Fig. 1a), which will cause serious interference to the clustering of other samples. Therefore, the interference of noise should be eliminated, and then the cluster boundary should be extracted according to the overall distribution characteristics of the sample set. ECBSB consists of the following four steps:

•

STEP1: Eliminate edge noise points. Determine and eliminate edge noise points in the original dataset in order to highlight the shape boundaries of the cluster and reduce the interconnectivity between the clusters.

•

STEP2: Extract boundary points. Find the shape boundary points of the cluster.

•

STEP3: Close the cluster boundary. In the set of boundary points, connect boundary points according to the nearest neighbor search to form the boundary closure of the cluster. The boundary closure is detected and modified, and then the clustering number is automatically determined according to the number of closures.

•

STEP4: Extended clustering. According to the nearest neighbor principle, transfer cluster labels from boundary points to non-boundary points.

The extended clustering algorithm based on cluster shape boundary is presented in Algorithm 1.

[H] : Framework of ECBSB[1] Raw sample data: $\Phi$ Label set of clustering results: $\Omega$ Get Euclidean distance matrix $E$ from the $\Phi$ $\textit{Processed\_Data}\leftarrow\textit{Eliminate\_edge\_noise}(E)$ $Q\leftarrow\textit{Extract\_boundary\_points}(\textit{Processed\_Data})$ // $Q$ represents the set of boundary points. $\textit{BPL}\leftarrow\textit{Closure\_Formation}(Q,\textit{Processed\_Data})$ // BPL represents cluster label of boundary points. $\Omega\leftarrow\textit{Extension}(\Phi,\textit{BPL},E)$ End

4. ECBSB: Extended clustering algorithm based on cluster shape boundary

In this section, we introduce the specific implementation method of the extended clustering algorithm based on cluster shape boundary. The pseudocode of the key steps is given, and the flow of the algorithm is explained in detail with the Aggregation [34] dataset. Then, we explain how the algorithm processes noise, and how the parameters are set.

4.1 Implementation of ECBSB

Before the implementation of the algorithm, the related variables and concepts are introduced. Sample set $D$ contains $N$ samples. Each sample has $f$ -dimensional characteristics, which is expressed as $\mathbf{{{d}_{i}}}=({{d}_{i1}},{{d}_{i2}},\ldots,{{d}_{if}})\in D,(i=1,\ldots,N)$ . The neighborhood of ${{d}_{i}}$ is defined as $\mathbf{{Nei}({{d}_{i}},{radius})}=\{\textit{dist}({{d}_{i}},{{d}_{k}})% \leqslant\textit{radius}|{{d}_{k}}\in\ D\}$ , where $\textit{dist}({{d}_{i}},{{d}_{k}})$ represents the Euclidean distance between ${{d}_{i}}$ an ${{d}_{k}}$ , and radius is the neighborhood radius. The Euclidean distance matrix between $N$ samples is expressed as ${{E}_{N\times N}}$ .

Eliminate edge noise points

In order to find the boundary points accurately, we need to first remove the edge noise points. The points satisfying the following conditions in the sample set are determined as edge noise points, and are eliminated.

$\displaystyle 1){en_{i}}\in D$ (1) $\displaystyle 2){en_{i}}\_\textit{num}<\textit{Cnum}$ (2)

Here, $en_{i}\_\textit{num}$ represents the number of samples belonging to $\mathbf{Nei(e{{n}_{i}},Crad)}$ ; Cnum and neighborhood radius Crad are the parameters set by the user before the implementation of the algorithm, which will be explained in Sect. 3.3. The set of edge noise points is represented as $\mathbf{C}=\{e{{n}_{1}},\ldots,e{{n}_{c}}|e{{n}_{i}}\in D,i=1,\ldots,c\}$ , where $c$ is the number of edge noise points.

After the edge noise points is determined, the distance matrix $\mathbf{Processed\_Data}{{}_{(N-c)\times(N-c)}}$ without edge noise points can be obtained by eliminating the rows and columns involving edge noise points in matrix $\mathbf{E}$ . After this step, the connectivity between clusters is reduced, which is helpful for determining the boundary points in the next step.

Extract boundary points

Taking any sample ${{q}_{i}}$ in set $D-C$ as the center, quadrant partition is made to its neighborhood $\mathbf{Nei({{q}_{i}},Crad)}$ . It is required that none of the quadrants has 0 samples; otherwise, ${{q}_{i}}$ is determined as the boundary point. The specific process is as follows.

In set $D-C$ , we use $f$ -bit binary code ${0\ldots 101}$ to represent the difference between ${{q}_{i}}$ and samples in $\mathbf{Nei({{q}_{i}},Crad)}$ . Each bit of the encoded value represents the difference in the corresponding attribute (where 0 represents a positive difference and 1 represents a negative difference). The encoded value ${0\ldots 00},{0\ldots 01},\ldots,{1\ldots 11}$ corresponds to the ${{2}^{f}}$ quadrants of ${{q}_{i}}$ in turn. According to their encoded values, we put the samples in $\mathbf{Nei({{q}_{i}},Crad)}$ into different quadrants of ${{q}_{i}}$ , and store the number of neighborhood points in quadrants in $\mathbf{S}={{[0\ldots 0]}_{1\times{{2}^{f}}}}$ . If the matrix $\mathbf{S}$ of sample ${{q}_{i}}$ does not include 0, then ${{q}_{i}}$ is not a boundary point, otherwise, ${{q}_{i}}$ is determined as a boundary point. The set of boundary points is expressed as $\mathbf{Q}=\{{{q}_{1}},{{q}_{2}},{{q}_{3}},\ldots,{{q}_{n}}|{{q}_{i}}\in D-C\}$ , where $n$ is the number of boundary points.

Figure 2.

Example of boundary point determination.

Taking the two-dimensional sample space as an example, the encoded value $00,01,10$ , and $11$ corresponds to the first, second, third, and fourth quadrants of ${{q}_{i}}$ , respectively. The encoded value $00$ indicates that the difference between the two characteristic dimensions of ${{q}_{k}}-{{q}_{i}}({{q}_{k}}\in\mathbf{Nei({{q}_{i}},Crad)})$ is greater than 0. That is, when the encoded value of ${{q}_{k}}$ is $00$ , the sample is classified into the first quadrant of ${{q}_{i}}$ . At the same time, the count value of the first position in the ${{q}_{i}}$ matrix of $\mathbf{S}$ is changed to 1, i.e., $\mathbf{S}={{[1,0,0,0]}_{1\times 4}}$ .

As shown in Fig. 2, the $\mathbf{S}$ matrix $[2,3,2,1]$ of point $A$ does not contain 0, so point $A$ is not a boundary point. The $S$ matrix $[0,1,0,1]$ of point $C$ contains 0, and thus point $C$ is a boundary point. Points $B$ and $D$ can then be judged according to this method.

Close the cluster boundary

The core step of ECBSB is to connect the boundary points according to the nearest neighbor search principle to form the boundary closure of the cluster. Before connecting the boundary points, we first define cluster labels and link labels for all points in sample set $D$ . The labels are represented by array $\textit{Label}=[\textit{Cluster\_label},\textit{Link\_label}]$ . The cluster label represents the category of the cluster to which it belongs. Link_label is the label for each point when connecting boundary points. If the value of link label is true, it means the points are connected to each other; otherwise, the points are not connected. The initial value of Label is $[0,\textit{FALSE}]$ .

We specify a starting point before connecting boundary points. The starting point ${{q}_{k}}\in Q,(1<k<n)$ is randomly selected from the boundary points where Label is $[0,\textit{FALSE}]$ . The cluster label of the starting point is $k$ , and the link label is 1. The starting point is also the end point of the boundary closure. All points connecting the start and end points are called process points. The cluster label of the process point is equal to the cluster label of the starting point, and the link label is $-$ 1.

Figure 3.

Example of boundary closure formation.

For example, the boundary points ${{q}_{A}},{{q}_{B}},{{q}_{C}}$ , and ${{q}_{D}}$ in Fig. 3 are initially marked $[0,0$ . When connecting the boundary points, we firstly randomly select ${{q}_{A}}$ as the starting point, that is, the label of ${{q}_{A}}$ becomes $[A,1]$ . In the boundary point set $\{{{q}_{A}},{{q}_{B}},{{q}_{C}},{{q}_{D}}\}$ , ${{q}_{B}}$ is closest to ${{q}_{A}}$ , and the label of ${{q}_{B}}$ is $[0,0]$ ; thus, the label of ${{q}_{B}}$ becomes $[A,-1]$ . Similarly, the labels of ${{q}_{C}}$ and ${{q}_{D}}$ all change to $[A,-1]$ . The closest to ${{q}_{D}}$ is ${{q}_{A}}$ , but the link label of ${{q}_{A}}$ is 1, and therefore ${{q}_{A}}$ is taken as the end of the connection. Finally, samples whose cluster label is $A$ are clustered together.

We give the pseudocode of the closed cluster boundary process in Algorithm 3. Algorithm 3 randomly selects the points marked as [0, 0] in the set of boundary points as the starting point, and then finds the process points through recursive function. The recursive process is shown in Algorithm 3.

[H] : Closure_formation[1] $Q$ , Processed_DataBPL $i=1\to n$ ${{q}_{i}}\_\textit{Label}=[0,0]$ ${{q}_{i}}\_\textit{Label}=[i,1]$ ; Search neighborhood points set $Q N$ of ${{q}_{i}}$ in $Q$ ; // $QN=\{q{{n}_{1}},q{{n}_{2}},\ldots,q{{n}_{k}}|q{{n}_{j}}\in Q,j=1,\ldots,k\}$ . $j=1\to k$ $q{{n}_{j}}\_\textit{Label}=[0,0]$ Recursive (Processed_Data, $q{{n}_{j}}$ , $i$ , $Q$ ); return BPL

: Recursive (Processed_Data, $q{{n}_{j}}$ , $i$ , $Q$ )[1] Processed_Data, $q{{n}_{j}}$ , $i$ , $Q$ BPL ${q{{n}_{j}}\_\textit{Label}=[i,-1]}$ ; Search neighborhood points set $Q N$ of $q{{n}_{j}}$ in $Q$ ; // $QN=\{q{{n}_{1}},q{{n}_{2}},\ldots,q{{n}_{w}}|q{{n}_{k}}\in Q,k=1,\ldots,w\}$ . $k=1\to w$ $q{{n}_{k}}\_\textit{Label}=[i,1]$ break; Recursive (Processed_Data, $q{{n}_{k}}$ , $i$ , $Q$ ); return BPL

When the nearest neighbor of the process point is the starting point, the recursion ends. However, when the starting point and the process point are closest to each other, the recursive process easily falls into an endless loop. Therefore, we need to set another threshold; when the number of process points with the same cluster label is greater than this threshold, the process points are allowed to connect with the starting point.

After the boundary closure is formed, the boundary points on the same closure have the same cluster labels. However, ring clusters, such as Area_1 in Fig. 1b, have two boundary closures and two different cluster labels, and thus we need to modify the cluster labels of boundary closures. If the two boundary closures can be connected through non-boundary points, then the two closures can be grouped into one class, and the same cluster label can be used.

Extended clustering

According to the nearest neighbor principle, the cluster labels are propagated from the boundary point whose cluster label is not 0 to the sample whose cluster label is 0. As shown in Table 2, Point_i represents the non-boundary points. Column A(1) represents the sample that is closest to the non-boundary points, and column A(N-1) represents the sample furthest from the non-boundary points. Here $Nn=N-n$ represents the number of non-boundary points.

Table 2

Sorting table of distance between non-boundary points and other points

Non-boundary point	Neighborhood point
	A(1) (Nearest)	A(2)	$\cdots$	A(N-1) (Farthest)
Point_1	P1_1	P1_2	$\cdots$	P1_(N-1)
Point_2	P2_1	P2_2	$\cdots$	P2_(N-1)
$\cdots$	$\cdots$	$\cdots$	$\cdots$	$\cdots$
Point_(Nn)	P(Nn)_1	P(Nn)_2	$\cdots$	P(Nn)_(N-1)

First, we check whether the cluster label of neighbor points in column A(1) of non-boundary points is 0. If it is not 0, the cluster label of the non-boundary point of the corresponding row is changed to the cluster label of the neighbor point. For example, when the cluster label of P2_1 in column A(1) is $k$ and $k\neq 0$ , the cluster label of Point_2 is changed to $k$ . Then, the cluster labels of the samples in column A(2), A(3), …, A(N-1) are determined in turn until the cluster labels of all non-boundary points are not 0. We test ECBSB on the Aggregation dataset, as shown in Fig. 4. The process of connecting boundary points to form boundary closure, as shown in EC.mp4 included in the supplemental materials.

Figure 4.

Running result chart of each step of ECBSB. a Distribution of samples in the original sample set; b the sample marked with an asterisk is the determined edge noise points; c sample distribution after eliminating edge noise points; d the samples marked in red is the determined boundary points; e connect boundary points to form closed-loop cluster boundary; f clustering results.

4.2 Complexity analysis

The time complexity of the ECBSB algorithm is the sum of the time complexity of its four steps, and is expressed as $T(n)={{t}_{1}}+{{t}_{2}}+{{t}_{3}}+{{t}_{4}}$ . Since the algorithm needs to scan all samples in set $D$ when removing edge noise and extracting boundary points, the time complexity of Step 1 and Step 2 is $O(n)$ , that is ${{t}_{1}}={{t}_{2}}=O(N)$ . In Step 3, the algorithm performs the connection operation in the boundary point set $Q$ , and the time complexity is ${{t}_{3}}=1+2+\ldots+n=\frac{(1+n)n}{2}$ , where the number of boundary points is far less than the total number of samples, that is, $n\ll N$ . In the extended clustering stage, the labels need to be copied to $N n$ samples, and the time complexity is ${{t}_{4}}=Nn(N-1)$ . Therefore, the worst case time complexity of the ECBSB algorithm is $T(n)=O({{N}^{2}})$ .

4.3 Processing of edge noise points

There are usually noise points in the sample set. While the existing clustering algorithms usually introduce a threshold to eliminate noise points, when the threshold setting is unreasonable, non-noise points will be removed. Although ECBSB eliminates the edge noise points in Step1, it classifies the edge noise points into the nearest cluster in Step4. It is reasonable to classify noise points into clusters closest to themselves. ECBSB has a high tolerance for noise points, and reduces the probability of misclassification of non-noise points.

4.4 Parameters Cnum and Crad

The parameters Crad and Cnum are used to determine and eliminate the edge noise points in Step1; however, it is not easy to determine these two parameters through calculation. We thus suggest to use an interactive method to adjust the two parameters through graphic representation. First, according to the sample distribution, observe the approximate diameter of the smaller clusters in the space. Take 1/4 of the diameter as Crad, and then adjust Cnum. The selected parameters are reasonable when the interconnection between clusters is reduced, and the shape boundary of clusters is protruded.

5. Experimental and analysics

In this section, we test the performance of ECBSB on benchmark datasets, and compare its performance with those of other clustering algorithms via ACC and other indicators. All experiments are implemented in MATLAB2020. The experimental results verify the effectiveness of ECBSB.

All algorithms follow the same principle of parameter selection as the original algorithm. In K-means, K-means++, DPC, and SC algorithms, we take the real clustering number as the $K$ value. Because of the randomness of the K-means algorithm and K-means++ algorithm in choosing the initial clustering center, we conducted the experiments 100 times, and took the average value as the final result. For the DBSCAN algorithm, we adjusted the parameters many times and chose the best result as the final result.

5.1 Experimental setting

Dataset

There are eight benchmark datasets used for testing: six labeled datasets and three unlabeled datasets [23, 37, 38, 39, 40]. The shape of clusters in each dataset is different, with different features such as embedding, connection, and density overlap. The details are shown in Table 3.

The two clusters in the Flame dataset are connected, and the whole cluster is in the shape of an arrow; the clusters in the Compound dataset overlap each other, and the dataset contain an embedded structure; in the Aggregation, Triangle2, Xclara, and D31 datasets, the clusters are elliptic convex clusters, but they are connected and have noise; the shape of clusters in PanelB, T4.8k, and T5.8k datasets are complex, and there are a lot of noise samples and artifacts between clusters.

Table 3
Datasets

Dataset (#Objects)		#Clusters	Dataset features
With label	Flame (240)	2	Interconnectivity
	Compound (390)	6	Embedded
	Aggregation (788)	7	Interconnectivity
	Triangle2 (1000)	4	Normal distribution different variances
	Xclara (3000)	3
	D31 (3100)	31	High density, Overlapping
Without label	PanelB (4000)	Unknow	High density, Artifacts Interconnectivity, Embedded
	T4.8k (8000)	Unknow
	T5.8k (8000)	Unknow

Evaluation Indicators

In this paper, in order to quantitatively describe the performance of the proposed algorithm on each dataset, we use four common external indices and one internal index: ACC, NMI, ARI, F-measure, and DBI [1, 2, 3].

•

Accuracy (ACC): ACC is used to represent the matching degree between clustering results and real labels. The closer the clustering result is to the real labels, the higher the ACC score.

•

Normalized Mutual Information (NMI): NMI measures the relationship between the predicted results and the real results. When the prediction completely matches the real results, NMI is a maximum.

•

Adjusted Rand Index (ARI): ARI measures the consistency of two data distributions (i.e., real label and forecast label). The value range is [ $-$ 1, 1]. The closer the value is to 1, the closer the predicted label distribution is to the real label distribution.

•

F-Measure: The F-Measure is defined as the geometric mean of the pairwise precision and recall. The calculation formula is: $F=\frac{({{\beta}^{2}}+1)\cdot P\cdot R}{{{\beta}^{2}}\cdot P\cdot R}$ . When $\beta$ is greater than 1, precision is more important; meanwhile, recall is more important when $\beta$ is less than 1. In the experiment, the value of $\beta$ is 1, that is, precision and recall are equally important.

•

Davies-Bouldin index (DBI): DBI is an internal evaluation index and does not need label information of samples. The smaller the DBI value, the smaller the intra cluster distance, and the larger the inter cluster distance, thus indicating a better clustering result.

5.2 Experimental results

As shown in Fig. 5, on the Flame dataset, although the two clusters are highly interconnected, ECBSB can identify clusters similar to the arrow shape, because the interconnection between clusters has been reduced after Step1. The Compound dataset contains an embedded structure, but this does not affect the formation of the boundary closure, and thus ECBSB can accurately identify the embedded structure. The T4.8k dataset contains many clusters with complex shapes and artifacts, but ECBSB can also accurately find the cluster boundary, and automatically determine the number of clusters according to the boundary closure. The idea of ECBSB is also applicable in the three-dimensional sample space. It only needs to adjust the cluster boundary closure method in Step3. In this paper, the Delaunay triangulation algorithm is used to connect boundary points to form a three-dimensional boundary closure, as shown in Fig. 5d.

Figure 5.

Experimental results of ECBSB.

Figure 6.

The results of four algorithms are compared when the density distribution of clusters is not uniform.

As shown in Fig. 6, the original sample set consists of three clusters with normal distribution, but the density distribution of clusters is not uniform. In this condition, only ECBSB can accurately identify the three clusters (Fig. 6a). Because of the low density and sparse distribution of the green cluster, DBSCAN does not accurately identify the boundary part of the red cluster, as shown in Fig. 6b. Meanwhile, influenced by the high density of the black cluster, K-means and DPC gather the samples belonging to the red cluster into the black cluster (Fig. 6c and 6d).

Table 4

Comparison of evaluation indexes of different algorithms

ACC	K-means	K-means++	DBSCAN	DPC	SC	ECBSB
Flame	0.8493	0.851	0.975	0.7875	0.9917	0.9792
Compound	0.6126	0.61	0.5815	0.6491	0.6416	0.8672
Aggregation	0.786	0.7785	0.986	0.9987	0.9937	0.9975
D31	0.8627	0.851	0.92	0.9674	0.941	0.9713
Xclara	0.9977	0.9977	0.9853	0.999	0.7007	0.9973
Triangle2	0.9533	0.9615	0.981	0.994	0.997	0.992
NMI	K-means	K-means++	DBSCAN	DPC	SC	ECBSB
Flame	0.428	0.4348	0.84	0.4049	0.9269	0.8489
Compound	0.68	0.6684	0.2282	0.7448	0.7306	0.8352
Aggregation	0.795	0.7958	0.96	0.9956	0.98	0.9923
D31	0.9265	0.922	0.9	0.9568	0.9537	0.9604
Xclara	0.987	0.987	0.9314	0.9936	0.5641	0.9842
Triangle2	0.8624	0.8694	0.9287	0.9756	0.9862	0.9642
ARI	K-means	K-means++	DBSCAN	DPC	SC	ECBSB
Flame	0.4862	0.4912	0.944	0.3269	0.9666	0.9177
Compound	0.5478	0.5349	0.1948	0.5605	0.5156	0.8481
Aggregation	0.71	0.7118	0.976	0.9978	0.987	0.9956
D31	0.8436	0.832	0.8	0.9345	0.9175	0.9419
Xclara	0.9929	0.9929	0.9766	0.997	0.5569	0.9921
Triangle2	0.8805	0.89	0.9623	0.9802	0.99	0.9759
F-Measure	K-means	K-means++	DBSCAN	DPC	SC	ECBSB
Flame	0.8521	0.8539	0.983	0.7903	0.9917	0.9792
Compound	0.684	0.6781	0.5481	0.7289	0.7025	0.8832
Aggregation	0.831	0.827	0.9865	0.9987	0.9937	0.9975
D31	0.8877	0.879	0.9321	0.9674	0.9534	0.9713
Xclara	0.9977	0.9977	0.9921	0.999	0.7742	0.9973
Triangle2	0.9542	0.9612	0.9879	0.994	0.997	0.992
DBI	K-means	K-means++	DBSCAN	DPC	SC	ECBSB
Flame	1.1052	1.1054	1.4489	1.1338	1.1545	0.1182
Compound	0.7067	0.7074	0.8465	0.8029	0.5614	2.3274
Aggregation	0.6332	0.633	0.6357	0.4365	0.453	0.4969
D31	0.5222	0.5274	0.7461	0.4568	0.4923	0.0984
Xclara	0.4042	0.4072	0.789	0.384	0.5582	0.3838
Triangle2	0.4696	0.4536	0.8742	0.5017	0.4125	0.5006

5.3 Performance comparison

We compare the ECBSB algorithm with K-means, K-means++, DBSCAN, DPC, and spectral cluster (SC) algorithms on different datasets, and the results are presented in Table 4. The K-means algorithm and K-means++ algorithm perform well on convex datasets, but their performance is significantly lower than that of the ECBSB algorithm on nonconvex datasets. The scores of the DBSCAN algorithm on six datasets are lower than those of the ECBSB algorithm, because the parameters of the DBSCAN algorithm can only reflect the local attributes of the sample distribution; while the parameters of the ECBSB algorithm reflect the global characteristics of the sample distribution. The DPC algorithm is better than the ECBSB algorithm on convex datasets Xclara and Triangle2, but performs worse than the ECBSB algorithm on nonconvex datasets. Spectral clustering can reasonably find the structural characteristics of the data distribution, and its score on each dataset is similar to those obtained by the ECBSB algorithm.

However, the K-means algorithm, K-means++ algorithm, and spectral clustering algorithm need to specify the number of clusters according to prior knowledge, which is difficult to achieve in practice. Conversely, the ECBSB algorithm can independently determine the number of clusters by finding the number of cluster boundaries. Moreover, the DBSCAN algorithm is sensitive to the density of clusters, especially when the density difference is large. Although the DPC algorithm provides a decision graph for its user’s reference, the number of clusters still need to be set manually. Meanwhile, the ECBSB algorithm can find the shape and structure of clusters according to the global characteristics of the data distribution, and determine the number of clusters independently according to the number of boundaries. Furthermore, it can obtain better clustering results when the density distribution among clusters is uneven. The clustering results of the six algorithms on different datasets are shown in the appendix.

6. Conclusion and future work

An extended clustering algorithm based on cluster shape boundary proposed is to find the shape boundary of clusters from the global perspective of sample distribution, and then extend the boundary to the cluster center to realize clustering. Moreover, ECBSB can automatically determine the number of clusters according to the number of boundary closures, and can identify clusters with complex shapes. Finally, numerous experiments are carried out on benchmark datasets, and the results show that ECBSB outperforms DBSCAN, K-means, K-means++, DPC, and spectral clustering algorithms.

The idea of ECBSB is applicable to high-dimensional space, but its operability is poor. Therefore, this paper only gives a method to find the shape boundary of clusters in low-dimensional data space. Our next work will apply ECBSB to the high-dimensional data space and explore the method of building a shape boundary in the high-dimensional space. Additionally, we note that ECBSB is not suitable for those clusters without boundary points or boundary closures. This needs to be addressed to improve the adaptability of ECBSB to different datasets.

Footnotes

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grants 61825305.

Appendix A. Comparison results of different algorithms

This appendix presents the performance of six algorithms on different datasets, as shown in Fig. 7 to Fig. 14.

As shown in Fig. 7, the clustering results of ECBSB, spectral clustering, and DBSCAN algorithm are basically correct, but DBSCAN recognizes many data as noise points. K-means, K-means++, and DPC algorithms fail to identify arrow shaped clusters on the Flame dataset. Only the ECBSB algorithm can find the embedded structure in the Compound dataset, and the number of samples identified as noise is far less than that identified by the DBSCAN algorithm, as shown in Fig. 8. On the Aggregation and PanelB datasets, the clustering results of K-means and K-means++ algorithm are obviously wrong due to the large impact of density, as shown in Fig. 9 and Fig. 10.

The clustering results of the ECBSB algorithm and other five algorithms are basically the same in Fig. 11 to Fig. 13. In clusters with complex shapes, the ECBSB algorithm can achieve the same results as K-means and K-means++, as shown in Fig. 14.

Figure 7.

Flame.

Figure 8.

Compound.

Figure 9.

Aggregation.

Figure 10.

PanelB.

Figure 11.

Triangle2.

Figure 12.

Xclara.

Figure 13.

D31.

Figure 14.

T5.8k.

References

and Tian

, A comprehensive survey of clustering algorithms, Annals of Data Science 2 (2015), 165–193.

Perezsuarez

et al., A review of conceptual clustering algorithms, Artificial Intelligence Review 52(2) (2019), 1267–1296.

Saxena

et al., A review of clustering techniques and developments, Neurocomputing (2017), 664–681.

Xie

et al., Unsupervised deep embedding for clustering analysis, in: International Conference on Machine Learning, 2016, pp. 478–487.

Yue

S.H.

et al., Clustering mechanism for electric tomography imaging, Sci China Inf Sci 55 (2012), 2849–2864.

Suo

et al., Neighborhood grid clustering and its application in fault diagnosis of satellite power system, Proceedings of the Institution of Mechanical Engineers, Part G: Journal of Aerospace Engineering 233 (2019), 1270–1283.

Fahad

et al., A survey of clustering algorithms for big data: Taxonomy and empirical analysis, IEEE Transactions on Emerging Topics in Computing 2 (2014), 267–279.

Ghaffari

et al., Improved Parallel Algorithms for Density-Based Network Clustering, in: International Conference on Machine Learning, 2019, pp. 2201–2210.

Sinha

, K-means clustering using random matrix sparsification, in: International Conference on Machine Learning, 2018, pp. 4684–4692.

10.

Datta

Bhattacharjee

and Das

, Clustering with missing features: A penalized dissimilarity measure based approach, Machine Learning 107(12) (2018), 1–39.

11.

Chen

et al., Fast density peak clustering for large scale data based on KNN, Knowledge Based Systems (2020).

12.

Likas

Vlassis

and Verbeek

J.J.

, The global k-means clustering algorithm, Pattern Recognition 36 (2003), 451–461.

13.

Yue

S.H.

et al., An unsupervised grid-based approach for clustering analysis, Sci China Inf Sci 53 (2010), 1345–1357.

14.

Huang

et al., A new weighting k-means type clustering framework with an l2-Norm regularization, Knowledge Based Systems 151(JUL.1) (2018), 165–179.

15.

Pardeshi

and Toshniwal

, Improved k-medoids clustering based on cluster validity index and object density, in: IEEE International Advance Computing Conference, 2010, pp. 379–384.

16.

Zhou

K.L.

and Yang

S.L.

, Fuzziness parameter selection in fuzzy c-means: The perspective of cluster validation, Sci China Inf Sci 57 (2014), 112206(8).

17.

Lei

et al., Significantly fast and robust fuzzy c-means clustering algorithm based on morphological reconstruction and membership filtering, IEEE Transactions on Fuzzy Systems (2018), 1–1.

18.

Rodriguez

and Laio

, Clustering by fast search and find of density peaks, Science 344 (2014), 1492–1496.

19.

Ester

et al., A density-based algorithm for discovering clusters in large spatial Databases with Noise, Knowledge Discovery And Data Mining (1996), 226–231.

20.

Chen

et al., A Fast Clustering Algorithm based on pruning unnecessary distance computations in DBSCAN for High-Dimensional Data, Pattern Recognition (2018), 375–387.

21.

Hess

et al., The SpectACl of Nonconvex Clustering: A Spectral Approach to Density-Based Clustering, National Conference on Artificial Intelligence 33(01) (2019), 3788–3795.

22.

Hinneburg

and Keim

D.A.

, An Efficient Approach to Clustering in Large Multimedia Databases with Noise, in: Proceedings of the 4th International Conference on Knowledge Discovery and Datamining (KDD’98), New York, 1998, pp. 58–65.

23.

Karypis

Han

E.H.

and Kumar

, Chameleon: Hierarchical clustering using dynamic modeling, Computer 32 (1999), 68–75.

24.

Janouek

et al., Gaussian Mixture Model Cluster Forest, in: International Conference on Machine Learning and Applications, Miami, FL, 2015, pp. 1019–1023.

25.

Vijayaraghavan

and Awasthi

, Clustering Semi-Random Mixtures of Gaussians, in: International Conference on Machine Learning, 2018, pp. 5055–5064.

26.

Frey

B.J.

and Dueck

, Clustering by passing messages between data points, Science 315 (2007), 972–976.

27.

A.Y.

et al., On Spectral Clustering: Analysis and an algorithm, in: Neural Information Processing Systems, 2001, pp. 849–856.

28.

et al., Fast large-scale spectral clustering via explicit feature mapping, IEEE Transactions on Systems, Man, and Cybernetics 49(3) (2019), 1058–1071.

29.

Kang

et al., Low-rank kernel learning for graph-based clustering, Knowledge Based Systems (2019), 510–517.

30.

Yang

et al., Fast spectral clustering learning with hierarchical bipartite graph for large-scale data, Pattern Recognition Letters (2020), 345–352.

31.

Caron

et al., Deep Clustering for Unsupervised Learning of Visual Features, in: European Conference on Computer Vision, 2018, pp. 139–156.

32.

Xie

et al., Unsupervised deep embedding for clustering analysis, in: International Conference on Machine Learning, 2016, pp. 478–487.

33.

Yang

et al., Towards K-means-friendly spaces: simultaneous deep learning and clustering, in: International Conference on Machine Learning, 2017, pp. 3861–3870.

34.

Ataer-Cansizoglu

Akcakaya

and Erdogmus

, Minor surfaces are boundaries of mode-based clusters, IEEE Signal Processing Letters 22(7) (2015), 891–895.

35.

Zhong

et al., A new clustering algorithm by using boundary information, IEEE Congress on Evolutionary Computation (2018), 1–8.

36.

Richard

et al., K-variates++: more pluses in the k-means++, in: Proceedings of the 33rd International Conference on International Conference on Machine Learning, 2016, pp. 145–154.

37.

Gionis

Mannila

and Tsaparas

, Clustering Aggregation, in: 21st International Conference on Data Engineering (ICDE’05), Tokoyo, Japan, 2005, pp. 341–352.

38.

Zahn

C.T.

, Graph-theoretical methods for detecting and describing gestalt clusters, IEEE Transactions on Computers C-20 (2006), 68–86.

39.

Veenman

C.J.

Reinders

M.J.T.

and Backer

, A maximum variance cluster algorithm, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002), 1273–1280.

40.

and Medico

, Flame, a novel fuzzy clustering method for the analysis of DNA microarray data, BMC Bioinformatics 8 (2007), 3–3.