A multi-center clustering algorithm based on mutual nearest neighbors for arbitrarily distributed data

Abstract

Multi-center clustering algorithms have attracted the attention of researchers because they can deal with complex data sets more effectively. However, the reasonable determination of cluster centers and their number as well as the final clusters is a challenging problem. In order to solve this problem, we propose a multi-center clustering algorithm based on mutual nearest neighbors (briefly MC-MNN). Firstly, we design a center-point discovery algorithm based on mutual nearest neighbors, which can adaptively find center points without any parameters for data sets with different density distributions. Then, a sub-cluster discovery algorithm is designed based on the connection of center points. This algorithm can effectively utilize the role of multiple center points, and can effectively cluster non-convex data sets. Finally, we design a merging algorithm, which can effectively obtain final clusters based on the degree of overlapping and distance between sub-clusters. Compared with existing algorithms, the MC-MNN has four advantages: (1) It can automatically obtain center points by using the mutual nearest neighbors; (2) It runs without any parameters; (3) It can adaptively find the final number of clusters; (4) It can effectively cluster arbitrarily distributed data sets. Experiments show the effectiveness of the MC-MNN and its superiority is verified by comparing with five related algorithms.

Keywords

Multiple centers data clustering mutual nearest neighbors arbitrary distribution

1. Introduction

With the development of information technology, it is not difficult to collect data. However, how to better understand and utilize the collected data is a problem faced by various fields. Machine learning methods are effective tools for this problem. Classification and clustering are two of the most widely used machine learning methods. Classification is the division of data sets with prior knowledge. To improve the accuracy and efficiency of classification, many methods have been proposed [1, 2, 3, 4]. However, classification needs to know part of the data label information in advance. In many practical scenarios, it is difficult to know the data label information in advance. As an unsupervised machine learning method, clustering divides a data set into multiple clusters according to the inherent similarity of data, so that data within a cluster are similar, while data in different clusters are not similar. Partitioning unlabeled data is the basis for many studies and applications. In medicine, the classification of disease types can help doctors treat their patients more effectively and accurately [5, 6]. In business, the segmentation of customer groups and the discovery of communities facilitate effective marketing [7, 8, 9]. In information retrieval, the contents of documents are divided and sorted, which is convenient and efficient for information retrieval [10, 11, 12]. In traffic, detecting traffic incidents can help people to solve traffic congestion problems [13, 14, 15]. In image recognition, it helps people to better segment and recognize images [16, 17]. In addition, clustering is often used as a pre-step in other studies [18, 19, 20]. With the increasing of the volume of the data and the complexity of the data, the classic clustering algorithms [21, 22, 23, 24, 25] show their limitations.

K-means algorithm [21] is a simple and widely used classic clustering algorithm based on partition. It needs to specify the initial clustering centers and form stable clustering results through continuous iterations. Each cluster has a center. This algorithm is suitable for convex data sets with Gaussian distribution. For many complex data sets, it is difficult to use one center to represent a cluster, and the K-means algorithm is no longer effective. DBSCAN algorithm [23] is a classic density based clustering algorithm. It no longer uses a center point to represent a cluster, but defines a cluster as a set of the largest density connected data points. The algorithm can find clusters with arbitrary shape. However, the definition of density connected data points depends on the two parameters: Eps and MinPts. Different values of Eps and MinPts need to be specified for different data sets, and it is difficult to get a suitable Eps and MinPts without prior knowledge. Even with suitable Eps and MinPts, DBSCAN still fails for clusters with varying densities. The model-based clustering algorithm also does not need to specify clustering centers. The algorithm assumes that all clusters in the data set obey a certain probability distribution, and the data set fits a mixture model [26, 27]. The final clustering result is found by the optimal parameter values of the model. Model-based clustering algorithms usually assume that all clusters obey the same type of model with different parameters. Based on this assumption, the algorithm has poor clustering effect on data sets with multiple distributions. The grid-based clustering algorithm divides the data set into multiple grids, transforms the data clustering into the grid clustering, and improves the clustering efficiency. Some improved grid-based clustering algorithms can effectively cluster data with non-uniform density distribution by dividing the data using grids of different sizes [28, 29], but it is difficult to set appropriate grids sizes without prior knowledge. The Chameleon algorithm is a hierarchical clustering algorithm [30] that can be used for non-convex data sets. The algorithm uses a node to represent a data point, and an edge between nodes represents the similarity of two nodes to form a graph. After that, the algorithm divides the graph composed of data points into multiple sub-clusters, and then merges the sub-clusters to obtain the clusters. The algorithm can effectively cluster data sets with non-convex shapes. However, the Chameleon algorithm is sensitive to the parameters involved. In order to better cluster complex data, many researchers have proposed adaptive clustering algorithms based on multi-centers.

Liang [31] designed a multi-center clustering method called MC. The MC algorithm sets multiple centers for large clusters, and the number of data points represented by each center is equal to that of small clusters to avoid the phenomenon that k-means algorithm wrongly clusters the data of larger clusters into small clusters (uniform effect). The algorithm has a good effect on imbalanced data sets, but has a poor effect on non-convex data sets. Tao [32] proposed a multi-center clustering algorithm which can effectively cluster non-convex data sets, and the two-level hierarchical subtractive clustering algorithm was employed to effectively cluster large-scale data. But, the performance of this algorithm on data sets with overlapping between clusters is poor. Xia et al. [33] proposed a clustering method named WC-KNNG-PC. If the parameters are set properly, then WC-KNNG-PC algorithm can find clusters and noises in complex data sets and has good clustering result. However, its efficiency is very poor because it needs many iterations to obtain appropriate parameters, especially for real data sets. Lu et al. [34] proposed a clustering algorithm based on multi-center competitive learning (SMCL). It uses multiple subsets (each subset has a center) to represent a cluster, and automatically adjusts the number of subsets through competitive learning until it is stable. This method uses multiple centers to represent a cluster, which is suitable for data sets with unbalanced distribution. However, the algorithm iteratively finds stable clustering results in the process of competitive learning. So the algorithm is time-consuming. Density peak clustering (DPC) [35] is a clustering algorithm which takes the density peak points (data points with the highest density in the local area) as the cluster centers. It manifests its strengths in a non-iterative way. Bie et al. [36] proposed an improved DPC algorithm, which selects more data points than the actual number of cluster centers as density peak points to form sub-clusters, and then merges the sub-clusters to form the final clustering result, but this algorithm is easy to ignore clusters with sparse density. A multi-center density peak clustering algorithm called McDPC algorithm, is proposed by Wang et al. [37]. The algorithm stratifies the density and selects the density peak points from different density regions. It is able to find clusters with sparse density. However, compared with the DPC algorithm, the algorithm added three new parameters. The algorithm sets different parameter values for different data sets, and requires multiple debugging to find the optimal parameter values. This is not realistic for practical applications that do not know the data labels.

According to the above analysis, the multi-center clustering algorithm can solve the problem with arbitrarily distributed data sets, but there are still some problems. The main problems encountered by the current multi-center clustering algorithms include: (1) How to better set the values of parameters; (2) How to obtain the proper multiple center points; (3) How to merge sub-clusters; (4) How to determine the final number of clusters. Aiming at these problems, we propose a multi-center clustering algorithm based on mutual nearest neighbors. Firstly, the algorithm obtains multiple center points by finding the nearest neighbors and the mutual nearest neighbors of each data point, and classifies the neighbors into its center point; Then, sub-clusters were obtained through continuously connecting center points that fall in each other’s neighborhood, and data points that are not classified into any sub-cluster are the noise points; Finally, a sub-clusters merging algorithm is designed to judge the final number of clusters by using two indicators (distance between sub-clusters and overlapping degree of sub-clusters), so as to get the final clustering results. The main contributions of this article are:

1.
A new clustering algorithm MC-MNN was proposed for arbitrarily distributed data sets. The algorithm does not need any parameters.
2.
An algorithm to automatically find multiple center points was proposed for data sets with different density distributions.
3.
A sub-cluster construction method based on the connection of center points was proposed for data sets with different shapes.
4.
A sub-cluster merging algorithm to automatically find the number of final clusters was proposed.

The rest of this article is organized as follows. The related concepts are briefly introduced in Section 2. Section 3 introduces MC-MNN algorithm in detail. The experimental results and comparison algorithms are introduced in Section 4, and conclusions are given in Section 5.
2. Related basic theories

This section introduces the concepts of K nearest neighbors, reverse nearest neighbors and mutual nearest neighbors related to the proposed algorithm MC-MNN.

(K nearest neighbors [38, 39]).

Given a data set $S=\{s_{i}\}_{i=1}^{n}$ , the distance between two data points $s_{i}$ and $s_{j}$ is $d(s_{i},s_{j})$ . The K nearest neighbors of a data point $s_{i}$ are a subset $N_{K}(s_{i})$ of $S$ . $N_{K}(s_{i})\subseteq S/\{s_{i}\}$ , $N_{K}(s_{i})$ contains $K$ elements and $\forall s_{j}\in N_{K}(s_{i})$ , $s_{t}\in S/\{N_{K}(s_{i})\cup s_{i}\}:d(s_{i},s_{j})\leqslant d(s_{i},s_{t})$ .

The K-nearest neighbor method is used in classification [40, 41, 42]. Its core idea is that, given a training set, for a new input data point, first find its $K$ nearest neighbors. If most of these $K$ data points belong to a certain class, the input data point is classified into this class. Currently, the K-nearest neighbor method is widely used for clustering [43, 44]. However, the clustering result usually is affected by noise points. In order to reduce the influence of noise points, some researchers [45, 38, 39] use the reverse nearest neighbors in the clustering algorithm to improve the clustering result.

(Reverse $K$ nearest neighbors).

The reverse $K$ nearest neighbors of data point $s_{i}$ are a subset $\textit{RN}_{K}(s_{i})$ of $S$ , and if $s_{i}\in N_{k}(s_{j})$ then $s_{j}\in\textit{RN}_{K}(s_{i})$ .

Both $K$ nearest neighbors and reverse $K$ nearest neighbors are all asymmetric and cannot fully express the similarity between data points in a data set. Some researchers [46, 47, 48] use the concept of mutual neighbors to more closely express the similarity between data points.

(Mutual $K$ nearest neighbors).

The mutual $K$ nearest neighbors of data $s_{i}$ are a subset $\textit{MN}_{K}(s_{i})$ of $S$ satisfying:

1.
$\textit{MN}_{K}(s_{i})\subseteq S/\{s_{i}\}$ .
2.
If $s_{j}\in N_{K}(s_{i})$ and $s_{i}\in N_{K}(x_{j})$ , then $s_{j}\in\textit{MN}_{K}(s_{i})$ and $s_{i}\in\textit{MN}_{K}(s_{j})$ .

3. MC-MNN method

The proposed method MC-MNN includes 3 steps: (1) Center points discovery, a multi-center points discovery algorithm based on mutual nearest neighbors is designed, which can make full use of data structure information and automatically discover multiple center points. (2) Sub-cluster generation, an algorithm based on the connection of center points is designed, which can effectively obtain sub-clusters and noise points. (3) Sub-clusters merging, a sub-cluster merging algorithm based on distance and overlapping degree is designed, which can merge sub-clusters to obtain the final clusters. The proposed algorithm MC-MNN uses the $k$ -th nearest neighbor and the $k$ -th mutual nearest neighbor when determining center points. Their definitions are as follows:

Figure 1.

The process of determining the center points.

(The $K$ -th nearest neighbor).

Given a data set $X=\{x_{i}\}_{i=1}^{n}$ , $d_{ij}$ is the distance between two data points $x_{i}$ and $x_{j}$ . If $d_{ij}=\min_{t=1,t\neq i}^{n}\{d_{it}\}$ , then $x_{j}$ is the (first) nearest neighbor of $x_{i}$ . Similarly, if $x_{p}$ is the $K$ nearest neighbor of $x_{i}$ in $X$ , then $x_{p}$ is called the $K$ -th nearest neighbor of $x_{i}$ .

(The $k$ -th mutual nearest neighbor).

If $x_{j}$ is the $k_{1}$ -th nearest neighbor of $x_{i}$ and $x_{i}$ is the $k_{2}$ -th nearest neighbor of $x_{j}$ , then $x_{i}$ and $x_{j}$ are called the $k$ -th mutual neighbors, and $x_{i}$ ( $x_{j}$ ) is called the $k$ -th mutual nearest neighbor of $x_{j}$ ( $x_{i}$ ), where $k=\max\{k_{1},k_{2}\}$ .

3.1 Multi-center discovery algorithm

In this section, we propose a multi-center discovery algorithm based on the mutual nearest neighbors. The algorithm does not need to set parameters, and can automatically find center points through the mutual neighbor relationship of data points.

Given a data set $X=\{x_{i}\}_{i=1}^{n}$ , for each data point $x_{i}\in X$ , let $Nj(i)$ denote the set of the $j$ -th nearest neighbors of $x_{i}$ and $\textit{NM}j(i)$ denote the number of the $j$ -th mutual nearest neighbors of $x_{i}$ for $i=1,2,\ldots,n$ , where $j=1,2,\cdot\cdot\cdot$ . $\textit{SNM}k(i)$ denote the total number of the first $k$ mutual nearest neighbors of $x_{i}$ .

In [49], authors proposed an adaptive method to automatically determine the number $k$ of the $k$ nearest neighbors (KNN) by

$\displaystyle k=\min\{t|(x_{i}\in\textit{KNN}_{t}(x_{j}))\wedge(x_{j}\in% \textit{KNN}_{t}(x_{i})\text{or}∼{}\textit{rep}\geqslant\sqrt{t-\textit{rep}},% i,j=1,2,\ldots,n\},$ (1)

where $\textit{KNN}_{t}(x_{i})$ denotes the set of all first $t$ nearest neighbors of point $x_{i}$ , $\textit{NN}(i)=\{x_{j}|x_{i}\in\textit{KNN}_{t}(x_{j}))\wedge x_{j}\in\textit{% KNN}_{t}(x_{i})\}$ denotes the natural neighbors of $x_{i}$ , and rep is the number of rounds that the number of data points with zero natural neighbor is continuously unchanged.

Table 1

The process of obtaining center points

Points	$k=$ 1			$k=$ 2			$k=$ 3			$k=$ 4
	N1	NM1	SNM1	N2	NM2	SNM2	N3	NM3	SNM3	N4	NM4	SNM4
1	2	1	1	4	0	1	3	0	1	12	0	1
2	1	1	1	3	0	1	4	0	1	14	0	1
3	12	1	1	14	0	1	4	2	3	5	0	3
4	3	0	0	5	0	0	12	1	1	6	0	1
5	11	1	1	10	0	1	12	2	3	6	1	4
6	5	0	0	11	0	0	8	0	0	7	2	2
7	8	1	1	6	0	1	11	0	1	9	1	2
8	7	1	1	11	0	1	10	1	2	9	2	4
9	17	1	1	8	0	1	10	0	1	15	2	3
10	11	0	0	15	2	2	5	1	3	8	1	4
11	5	1	1	10	1	2	8	1	3	12	1	4
12	3	1	1	14	1	2	5	1	3	11	1	4
13	18	1	1	15	1	2	14	0	2	12	1	3
14	12	0	0	18	2	2	3	1	3	13	1	4
15	10	0	0	13	2	2	9	0	2	11	1	3
16	17	0	0	15	1	1	13	0	1	9	0	1
17	9	1	1	16	1	2	15	0	2	10	0	2
18	13	1	1	14	1	2	12	0	2	15	0	2
19	28	1	1	20	1	2	25	1	3	26	0	3
20	21	0	0	19	2	2	26	0	2	25	0	2
21	26	0	0	20	2	2	27	1	3	25	0	3
22	23	1	1	27	1	2	26	0	2	25	1	3
23	22	1	1	24	1	2	25	0	2	26	0	2
24	28	0	0	23	1	1	25	1	2	26	1	3
25	26	1	1	28	1	2	19	1	3	24	1	4
26	25	1	1	21	1	2	27	1	3	22	1	4
27	26	0	0	22	1	1	21	2	3	25	0	3
28	19	1	1	25	1	1	24	1	3	26	0	3

*The rows where the center points are located are gray.

We make use of the above idea of determining number $k$ of KNN to design a method to determine the center points. The detail is as follows:

For each data point $x_{i}\in X$ , first, find $N1(i)$ and compute $\textit{NM}1(i)$ . Let $\textit{SNM}1(i)=\textit{NM}1(i)$ . Also, for each fixed $t\geqslant 1$ , let $\textit{NZ}(t)$ denote the number of data points $x_{i}$ satisfying $\textit{SNM}t(i)=0$ for $i=1,2,\ldots,n$ , i.e., $\textit{NZ}(t)=|\{i|\textit{SNM}t(i)=0,i=1,2,\ldots,n\}|$ . Let $\textit{rep}_{t}$ denote the number of times for which $\textit{NZ}(t)$ is continuously unchanged from $t=1$ to $k$ . Obviously, $\textit{rep}_{1}=0$ . If $\textit{SNM}1(i)\neq 0$ for $i=1,2,\ldots,n$ , or $\textit{rep}_{1}\geqslant\sqrt{1-\textit{rep}_{1}}$ , let $t=1$ and stop. Otherwise, find $N2(i)$ and compute $\textit{NM}2(i)$ for each data point $x_{i}\in X$ . Compute $\textit{SNM}2(i)=\textit{NM}1(i)+\textit{NM}2(i)$ for each $x_{i}$ . Compute $\textit{NZ}(2)$ and $\textit{rep}_{2}$ . If $\textit{SNM}2(i)\neq 0$ for $i=1,2,\ldots,n$ , or $\textit{rep}_{2}\geqslant\sqrt{2-\textit{rep}_{2}}$ , let $t=2$ and stop. Otherwise, repeat above process until some $Nt(i)$ , $\textit{NM}t(i)$ and $\textit{rep}_{t}$ satisfying that $\textit{SNM}t(i)=\textit{NM}1(i)+\textit{NM}2(i)+\ldots+\textit{NM}t(i)\neq 0$ for $i=1,2,\ldots,n$ or $\textit{rep}_{t}\geqslant\sqrt{t-\textit{rep}_{t}}$ . Then we find $t$ and stop.

Figure 2.

Data set and its center points. (a) data set; (b) center points.

When the searching ends, each data point $x_{i}$ satisfying $\textit{SNM}t(i)=t$ is called a center point. The process of determining the center points is summarized in Fig. 1.

In order to further illustrate the determination process of the center points, we take the data set in Fig. 2 as an example and gradually determine the center point through Table 1. Figure 2a shows a data set with different density distributions. Table 1 shows the process of finding $Nk(i)$ and calculating $\textit{NM}k(i)$ and $\textit{SNM}k(i)$ until the process ends, where the first column in Table 1 lists the index $i$ of each data point $x_{i}$ . First, find the first nearest neighbor $N1(i)$ for each $x_{i}\in X$ which was put in the second column and compute the number of the first mutual nearest neighbors of $x_{i}$ , i.e., $\textit{NM}1(i)$ , which was put in the third column. $\textit{SNM}1(i)$ of $x_{i}$ was put in the fourth column of Table 1 with $\textit{SNM}1(i)=\textit{NM}1(i)$ . We identify ten data points 4, 6, 10, 14, 15, 16, 20, 21, 24, and 27 do not have the first mutual nearest neighbor, i.e., the values of $\textit{NM}1$ for these data points are zero. Note that $\textit{NZ}(1)=10$ and $\textit{rep}_{1}=0$ , thus, $\textit{rep}_{1}<\sqrt{1-\textit{rep}_{1}}$ . We have to go to the second round process, Next, find the second nearest neighbors $N2(i)$ for each $x_{i}\in X$ which was put in the fifth column, compute the number of the second mutual nearest neighbors $\textit{NM}2(i)$ of $x_{i}$ , which was put in the sixth column of Table 1. It is identified that data points 4 and 6 have not the first and second mutual nearest neighbors i.e., $\textit{SNM}2(i)=\textit{NM}1(i)+\textit{NM}2(i)=0$ for $i=4,6$ . The values of $\textit{SNM}2(i)$ were put in the seventh column of Table 1. Now $\textit{NZ}(2)=2$ and $\textit{rep}_{2}=0$ . So, $\textit{rep}_{2}<\sqrt{2-\textit{rep}_{2}}$ . We have to go to the third round process. Similarly, find the third nearest neighbor $N3(i)$ , and compute $\textit{NM}3(i)$ . It is identified that data point $i=6$ has not the first to third mutual nearest neighbor, i.e., $\textit{SNM}3(6)=\textit{NM}1(6)+\textit{NM}2(6)+\textit{NM}3(6)=0$ . Now $\textit{NZ}(3)=1$ and $\textit{rep}_{3}=0$ . So, $\textit{rep}_{3}<\sqrt{3-\textit{rep}_{3}}$ . Repeating this process, we can find that the 4-th nearest neighbor $N4(i)$ and compute $\textit{NM}4(i)$ . It is identified that each data point has at least one $k$ -th mutual nearest neighbor for some $k\in\{1,2,3,4\}$ , i.e., the values of $\textit{SNM}4(i)\neq 0$ for each $x_{i}\in X$ . The search process is completed. Data point $x_{i}$ with $\textit{SNM}4(i)=4$ is a center point. In this example, data points 5, 8, 10, 11, 12, 14, 25, and 26 in red circles are center points, as shown in Fig. 2b.

: Multi-center discovery algorithmInput: Data set $X=\{x_{1},x_{2},\ldots,x_{n}\}$ .Output: Center point set $M c$ , the number of center points $N u m$ , $t$ . [1] $Mc=\emptyset$ , $\textit{Num}=0$ , $t=1$ , $\textit{SNM}t=0$ , $\textit{flag}=0$ ; Create a $k-d$ tree $T$ from data set $X$ ; Discover center points; $\textit{flag}==0$ determine $Nt(X)$ by $k-d$ tree $T$ ; $i=1$ to $n$ determine $\textit{NM}t(i)$ ; calculate $\textit{SNM}t(i)$ ; $i=i+1$ ; $\textit{NZ}=|\{i|\textit{SNM}t(i)=0,i=1,2,\ldots,n\}|$ ; $\textit{rep}=\textit{repeat}(\textit{NZ})$ ; $\textit{SNM}t(X)\neq 0\parallel\textit{rep}\geqslant\sqrt{t-\textit{rep}}$ $\textit{flag}=1$ ; $t=t+1$ ; $i=1:n$ $\textit{SNM}t(i)==t$ $Mc=Mc\cup x_{i}$ ; $\textit{Num}=\textit{Num}+1$ ; return $M c$ , Num, $t$ .

It can be intuitively seen from Fig. 2a and b that there are two clusters in the data set and these two clusters have different density distributions. The center points found locate in the center regions of two clusters. This indicates that the proposed center points discovery method which is based on the mutual nearest neighbors of points is not affected by the density distribution of data points. So the method is suitable for data sets with nonuniform density distribution.

The multi-center discovery algorithm is summarized in Algorithm 1. Line 5 determines the $t$ -th nearest neighbor, and lines 6–10 determine the mutual nearest neighbor and number of mutual nearest neighbor of each data point. Lines 11–17 determine whether to end the search. Lines 19–24 obtain the center point set $M c$ and the number of center points Num.

3.2 Sub-cluster generation algorithm

After obtaining the center points, let $M c$ denote the set of the center points, and a sub-cluster generation algorithm is designed. First, for any center point $c_{1}\in Mc$ , we shall construct a sub-cluster by classifying $t$ nearest neighbors of $c_{1}$ into this sub-cluster, where $t$ is obtained by Algorithm 1. Remove $c_{1}$ from the $M c$ . If there is another center point $c_{2}$ among $t$ nearest neighbors for this sub-cluster, we also classify the $t$ nearest neighbors of $c_{2}$ into this sub-cluster, and remove $c_{2}$ from $M c$ . Repeat this process until there is no center point in this sub-cluster. Then take another center point $c_{3}\in Mc$ , and repeat the above process until $M c$ is empty. Data points that do not belong to any sub-cluster are noise points. Algorithm 2 shows the summary of sub-clusters generation process.

: Sub-cluster generation algorithmInput: $t$ , $M c$ .Output: Lable of data label, number of sub-cluster $k$ . [1] $k=0$ ; $Mc\neq\emptyset$ $k=k+1$ ; $c_{i}\in Mc$ Delete $c_{i}$ from $M c$ ; $\textit{label}(c_{i})=k$ ; Assign $t$ nearest neighbors of center $c_{i}$ $Nt(c_{i})$ to $c_{i}$ ; $Nt(c_{i})\cap Mc\neq\emptyset$ $C_{i}=Nt(c_{i})\cap Mc$ ; Delete $C_{i}$ from $M c$ ; $\widetilde{c}_{i}\in C_{i}$ Assign $t$ nearest neighbors of center $\widetilde{c}_{i}$ to $c_{i}$ ; Repeat lines 8–12; return the label of sub-cluster label, number of sub-cluster $k$ ;

We use the previous example to illustrate the process of generation sub-clusters in Fig. 3. Note that the center point set $M c$ $=$ $\{$ 5, 8, 10, 11, 12, 14, 25, 26 $\}$ and $t=4$ are obtained in Algorithm 1. Sub-cluster generation algorithm first takes data point 5 from the center point set $M c$ $=$ $\{$ 5, 8, 10, 11, 12, 14, 25, 26 $\}$ to generate sub-cluster 1, and remove center point 5 from $M c$ . Then, classify $t$ nearest neighbors of data point 5 into sub-cluster 1. Among these neighbors, neighbors 10, 11, and 12 are also the center points, remove center points 10, 11, and 12 from the center point set $M c$ . For each of data points 10, 11 and 12, do the following: assign the $t$ nearest neighbors of data point 10 to sub-cluster 1. Note that neighbor 8 of data point 10 is also center point. Then, remove center point 8 from the center point set $M c$ and assign the $t$ nearest neighbors of center point 8 into sub-cluster 1. Since there is no center point in the $t$ nearest neighbors of point 8, the operation for center point 10 is finished. For each of center points 11 and 12, the similar operation can be conducted. Thereafter, take another center point from the center point set $M c$ and repeat the above process until the center point set $M c$ is empty. Figure 3a shows the sub-cluster generating process, where data points in red circles are center points, the blue arrows show the neighbor assigning process. Figure 3b shows two sub-clusters found, and black data points are noise points.

Figure 3.

An example to generate sub-clusters. (a) searching process; (b) sub-clusters.

The sub-cluster generation process is summarized in Fig. 4, where rectangles represent non-center points, and the ellipses represent the center points.

Figure 4.

The sub-cluster generating process.

3.3 Sub-clusters merging algorithm

After obtaining sub-clusters, we designed a sub-clusters merging algorithm. The algorithm uses two metrics (distance and overlapping degree between sub-clusters) to judge the relationship between sub-clusters, which can make better use of the inherent information of data sets and obtain good merging results. For any two sub-clusters $M$ and $N$ , if data point $x_{i}$ belongs to sub-cluster $M$ (or $N$ ), and some of its $t$ nearest neighbors are in sub-cluster $N$ (or $M$ ), then data point $x_{i}$ is called the shared data point of sub-clusters $M$ and $N$ .

The overlapping degree of clusters $M$ and $N$ is given below:

$\displaystyle OL_{(M,N)}=$ (2) $\displaystyle\left(\sum_{x_{i}\in M}\textit{SN}_{i}+\sum_{x_{j}\in N}\textit{% SN}_{j}\right)/\min(|M|,|N|),$

where $|M|$ and $|N|$ are the numbers of data points in sub-clusters $M$ and $N$ , respectively. If $x_{i}$ belongs to sub-cluster $M$ (or $N$ ) and some of its $t$ nearest neighbors are in sub-cluster $N$ (or $M$ ), then $\textit{SN}_{i}=1$ , else, $\textit{SN}_{i}=0$ . The number of all shared data points of clusters $M$ and $N$ is $(\sum_{x_{i}\in M}\textit{SN}_{i}+\sum_{x_{j}\in N}\textit{SN}_{j})$ .

The distance between sub-clusters refers to the minimum distance between data point pairs of two sub-clusters. For sub-clusters $M$ and $N$ , the distance $d_{(N,M)}$ between $M$ and $N$ is given by

$\displaystyle d_{(M,N)}=\min\limits_{x_{i}\in M,x_{j}\in N}\{d_{ij}\}.$ (3)

Obviously, if the overlapping degree between every pair of sub-clusters in a data set is zero, then the sub-clusters are the final clusters, and there is no need to merge them, else, we use grouping algorithm used in [31, 34] to merge the sub-clusters. Suppose there are $k$ sub-clusters to form the set of sub-clusters $C=\{C_{1},C_{2},\ldots,C_{k}\}$ .

Set $G_{k}=C$ . We can define the overlapping degree metric $MOL_{k}$ of $G_{k}$ by the maximal overlapping degree $OL(C_{i},C_{j})$

$\displaystyle\textit{MOL}_{k}=\textit{OL}_{(C_{p},C_{q})}=\max_{C_{i}\in G_{k}% ,C_{j}\in G_{k},1\leqslant i<j\leqslant k}\textit{OL}_{(C_{i},C_{j})}.$ (4)

Merge $C_{p}$ and $C_{q}$ into one sub-cluster. Put the merged sub-cluster into $G_{k}$ and remove $C_{p}$ and $C_{q}$ from $G_{k}$ to get $G_{k-1}$ . $G_{k-1}$ contains $k-1$ sub-clusters and denoted by $G_{k-1}=\{C_{1},C_{2},\ldots,C_{k-1}\}$ without loss of generality.

Use the distance between sub-clusters $C_{p}$ and $C_{q}$ as the distance metric of $G_{k}$ at same time

$\displaystyle Md_{k}=d_{(C_{p},C_{q})}.$ (5)

Then, we can define $\textit{MOL}_{k-1}$ and $Md_{k-1}$ of $G_{k-1}$ and get $G_{k-2}$ . Similarly, we can define $\textit{MOL}_{i}$ and $Md_{i}$ of $G_{i}$ and get $G_{i-1}$ sequentially for $i=k-2,k-3,\ldots,2$ .

To eliminate the discrepancy of two metrics, we normalized two metrics $\textit{MOL}_{i}$ and $Md_{i}$ and still denote the normalized metrics as $\textit{MOL}_{i}$ and $Md_{i}$ , i.e., set

$\displaystyle\textit{MOL}_{i}=\textit{MOL}_{i}/\max(\textit{MOL}_{i})_{i=2}^{k}.$ (6)

$\displaystyle Md_{i}=Md_{i}/\max(Md_{i})_{i=2}^{k}.$ (7)

The higher the overlapping degree between the two sub-clusters is, the more priority of merging they should have. Similarly, the smaller the distance between the two sub-clusters is, the more necessary they should be to merge. In order to consider these two metrics simultaneously, we integrate these two metrics into one as follows: First, define $\textit{MO}_{i}$ by

$\displaystyle\textit{MO}_{i}=1-\textit{MOL}_{i}.$ (8)

Then we define the degree of difficulty of merging $G_{k}$ to get $G_{k-1}$ by $\textit{MO}d_{k}$ as follows

$\displaystyle\textit{MO}d_{k}=\textit{MO}_{k}+Md_{k}.$ (9)

The larger the value of $\textit{MO}d_{k}$ , the more difficulty to obtain $G_{k-1}$ from $G_{k}$ . When sub-clusters in $G_{i}$ are gradually merged from $i=k,k-1,\ldots,2$ , the value of $\textit{MO}d_{i}$ becomes larger and larger, which means that the difficulty of degree of merging for $G_{i}$ becomes larger and larger. If the value of $\textit{MO}d_{i}$ changes the largest in the process from the $\textit{MO}d_{i}$ to $\textit{MO}d_{i-1}$ , it means that merging $G_{i}$ to $G_{i-1}$ is the most difficult in the whole process. So, the merging should be stopped here. The final number of clusters is obtained by

$\displaystyle K^{*}=\arg\max\limits_{3\leqslant i\leqslant k}\{\textit{MO}d_{i% -1}-\textit{MO}d_{i}\}-1.$ (10)

$G_{K^{*}}$ is the merging results. The selection of $K^{*}$ is shown in Fig. 5.

Figure 5.

Selection process of $K*$ .

Now, the noises are assigned into the clusters which should belong to, and obtain the final clustering results. The sub-clusters merging process was summarized in Algorithm 3.

[htp] : Sub-clusters mergingInput: Data set $X$ , $C=\{C_{1},C_{2},\ldots,C_{k}\}$ , the number of sub-cluster $k$ .Output: Lable of data label, number of clusters $K^{*}$ . [1] $i=1$ to $k-1$ $j=i+1$ to $k$ Calculate $OL_{(M,N)}$ by Eq. (2); Calculate $d_{(M,N)}$ by Eq. (3); $G_{k}=C$ ; $G=\{G_{1},G_{2},\ldots,G_{k}\}$ ; $k>1$ Obtain $C_{p}$ , $C_{q}$ by calculating $\textit{MOL}_{k}$ using Eq. (4); Calculate $Md_{k}$ by Eq. (5); $G_{k-1}=G_{k}/\{C_{p},C_{q}\}\cup\{C_{p}\cup C_{q}\}$ ; $k=k-1$ ; Select the number of cluster $K^{*}$ by Eq. (10); Assign noise points into clusters; Return data label, number of clusters $K^{*}$ ;

Figure 6 shows the clustering process of the proposed algorithm on data set Spiral without any overlap between the sub-clusters. The center points found are shown in red circles as shown in Fig. 6a. And the three sub-clusters found which have not any shared data points in their $t$ nearest neighbors are shown in Fig. 6b, that is, there is no any overlap between any two sub-clusters. Therefore, there is no need to merge, and the sub-clusters are the final clustering result. Where $t=3$ does not need to specify, and it has been calculated according to Eq. (1).

Figure 6.

Example for sub-clusters not requiring merging. (a) data set and its center points; (b) clustering result.

Table 2

The 15 data sets and their features

Data sets	Samples	Attribute	Clusters	Cluster sizes	Convex	Similar	Noise	Source
Gaussian	2000	2	4	1212, 606, 121, 61	Convex	No	Yes	[34]
Ids2	3200	2	5	2000, 400, 400, 200, 200	Convex	No	Yes	[34]
Aggregation	788	2	7	273, 170, 130, 102, 45, 34, 34	Convex	No	Yes	[50]
Circles	152	2	2	101, 51	Non-convex	No	Yes	[50]
Lithuanian	2400	2	2	2000, 400	Non-convex	No	Yes	[51]
Banana	2400	2	2	2000, 400	Non-convex	No	Yes	[51]
Breast	683	9	2	444, 239	–	Yes	–	[52]
Ecoli	327	5	5	143, 77, 52, 35, 20	–	No	–	[52]
Car	1728	6	4	1210, 384, 69, 65	–	No	–	[52]
Vote	232	13	2	124, 108	–	Yes	–	[52]
Seeds	210	7	3	70, 70, 70	–	Yes	–	[52]
Thyroid	215	5	3	150, 35, 30	–	No	–	[52]
Pageblock	5357	10	3	4913, 329, 115	–	No	–	[52]
Robotnavigation	5456	24	4	2205, 2097, 826,328	–	No	–	[52]
Olivetti face	200	4096	20	10 samples in each cluster	–	Yes	–	[53]

Figure 7 shows the merging process of the proposed algorithm on data set Ids2. Figure 7a shows 11 sub-clusters found. The determined number of cluster is shown in Fig. 7b. Figure 7c shows the result of merging sub-clusters, data points in black are noise points. The clustering result after assigning the noise points is shown in Fig. 7d.

Figure 7.

Sub-cluster merging procedure.

3.4 Time complexity analysis of the proposed algorithm MC-MNN

For a data set $X=\{x_{i}\}_{i=1}^{n}$ containing $n$ samples, the time complexity of building a k-d tree for the data set $X$ is $O(n*\log n)$ . The time complexity of finding the t-th nearest neighbor is $O(t*n*\log n)$ . According to literature [49], the value of $t$ is 6 to 7 for general data sets, and the value of $t$ will be greater than 20 but not greater than 30 for complex data sets. So the time complexity of finding the t-th nearest neighbor is $O(n*\log n)$ . The time complexity of determining the center point is $O(n)$ . The time complexity of generating sub-cluster is $O(n)$ . Assuming that $k$ sub-clusters are generated $(k\ll n)$ , the worst time complexity of calculating the degree of overlap between sub-clusters is $O(k*(n/k))=O(n)$ , and the time complexity of calculating the distance between sub-clusters is $O((n/k)^{2})$ . The time complexity of determining the final number of classes is $O(k)$ . Therefore, the total time complexity of the algorithm is $O(n*\log n)$ or $O((n/k)^{2})$ .

4. Experimental results

4.1 Data sets and the compared algorithms

In order to test clustering results of the proposed algorithms MC-MNN on arbitrarily distributed data sets, the 15 data sets include convex or non-convex data sets, data sets with similar cluster sizes or with large differences in cluster size, and data set with or without noise are shown in Table 2. Among them, the first six data sets are synthetic data sets and the last nine are real data sets, where data sets Banana and Lithuanian are the non-convex with thick lines, while data set Circles is the non-convex with thin lines. Data sets Circles and Aggregation are with similar cluster sizes, and the data sets Lithuanian, Banana, Gaussian and Ids2 are with large differences in cluster size. Nine real data sets have different dimensions, number of samples, and the number of clusters, where Olivetti face is an image data set. We used principal component analysis tool to preprocess the real data sets.

To identify the superiority of the proposed algorithm MC-MNN, we compared MC-MNN with the five related clustering algorithms McDPC [37], SMCL [34], Fuzzy-CFSFDP [36], DPADN [54], and SNNDPC [55]. Among them, McDPC [37], SMCL [34], DPADN [54], and Fuzzy-CFSFDP [36] are all based on multi-centers, SNNDPC [55] is a clustering algorithm based on shared nearest neighbors for arbitrary shape. All experiments were tested on a Windows 10 PC with an Intel i7 3.30 GHz CPU and 16 GB RAM.

4.2 Experimental results

4.2.1 The clustering process of MC-MNN for 6 synthetic data sets

The MC-MNN algorithm includes three parts: center points discovery, sub-cluster construction, the cluster number determination and the final clusters acquisition. In order to more clearly demonstrate the experimental process and results of the algorithm on the data sets, Figs 8–13 show the clustering process on six synthetic data sets.

Figure 8.

Data set Gaussian. (a) center point determination. (b) sub-cluster obtaining. (c) number of cluster determination. (d) clustering result.

Figure 9.

Data set Ids2. (a) center point determination. (b) sub-cluster obtaining. (c) number of cluster determination. (d) clustering result.

Figure 10.

Data set aggregation. (a) center point determination. (b) sub-cluster obtaining. (c) number of cluster determination. (d) clustering result.

Figure 11.

Data set circles. (a) center point determination. (b) sub-cluster obtaining. (c) number of cluster determination. (d) clustering result.

Figure 12.

Data set Lithuanian. (a) center point determination. (b) sub-cluster obtaining. (c) number of cluster determination. (d) clustering result.

Table 3

Clustering information of synthetic data sets

Data sets	$t$	Num	$K$ *
Gaussian	10	25	4
Ids2	14	11	5
Aggregation	6	7	7
Circles	5	11	2
Lithuanian	21	8	2
Banana	17	7	2

* Num is the number of sub-clusters.

Figure 13.

Data set Gaussian. (a) center point determination. (b) sub-cluster obtaining. (c) number of cluster determination. (d) clustering result.

Figures 8–13a show center points in red circles of 6 synthetic data sets. We can seen that center points found are the representative data points of the main parts of data sets. The obtained sub-clusters are shown in Figs 8–13b (noise points are excluded). Figures 8–13c are line charts for final clusters number determination, where blue lines represent the degree of difficulty of merging in overlapping degree, the red lines represent the degree of difficulty of merging in distance, and the yellow lines represent degree of difficulty of merging. When the number of sub-clusters changes from $k-1$ to $1$ (k is the number of sub-clusters obtained from Algorithm 2), record these changes. When the number of clusters changes from $K^{*}$ to $K^{*}-1$ , the change of $\textit{MO}d_{k}$ is the biggest, then we take $K^{*}$ as the final number of clusters, and obtain the final clusters. Figures 8–13d show the final clustering results. Take Gaussian data set as an example to illustrate the merging process of synthetic data set. For data set Gaussian, the value of $t$ is 10 when the Algorithm 1 terminates. Figure 8b shows 25 sub-clusters obtained by Algorithm 2. The value of $\textit{MO}d_{k}$ with the greatest change from $K^{*}=4$ to $K^{*}=3$ is shown in Fig. 8c. So $K^{*}=4$ is the final number of clusters. As can be seen from the results shown in Figs 8–13d, the proposed algorithm MC-MNN achieves excellent clustering results on six synthetic data sets. We can conclude that whether the data set is convex or a non-convex, with or without noise, with or without balanced size, the MC-MNN algorithm has achieved good clustering results on all of them. The clustering process information of synthetic data sets is shown in Table 3.

4.2.2 Performance evaluations on synthetic and real data sets

In the experiments, we compare the proposed algorithm MC-MNN with five related algorithms: McDPC [37], SMCL [34], Fuzzy-CFSFDP [36], DPA- DN [54], SNNDPC [55]. We use six performance metrics to test the performances of six algorithms: accuracy (ACC), normalized mutual information (NMI), recall (RE), the number of determined clusters (Clusters), the number of parameter and sensitivity of parameters (Parameters), and time cost. The comparison results of 6 synthetic data sets are shown in Table 4. The comparison results of 9 real data sets are shown in Table 5. The best values for six algorithms for each performance metric are shown in bold, and the second best values are underlined. For performance metric Parameters, Nonsen means insensitive to parameters, and None means no parameters.

Algorithm McDPC needs to specify the values of four parameters, $\gamma$ , $\theta$ , $\lambda$ and $p c t$ , SMCL needs to specify four parameters $\alpha$ , $\eta$ , $K$ and $E$ , SNNDPC needs to specify parameter $k$ , Fuzzy-CFSFDP and DPADN need to specify parameter $d c$ . Among them, McDPC, SNNDPC are more sensitive to the parameters, while SMCL, Fuzzy-CFSFDP, and DPADN are less sensitive to parameters. For parameter $\gamma$ and $\theta$ in algorithm McDPC, we use an interval with step 0.1 from 0.1 to 1, the value of parameter $\lambda$ , we use an interval with step 0.1 from average $\delta$ to $\delta+1$ , and parameter $p c t$ is set to 2. For SNNDPC algorithm, the parameter $k$ is set from 5 to 15 with step 1. Finally, we choose the value closest to the correct number of clusters as the final parameter values and their parameters are reported in Tables 4 and 5. In SMCL algorithm, $\alpha$ , $\eta$ , E, and K are set 0.005, 0.005, 1000, and 10, respectively for all test data sets. In Fuzzy-CFSFDP and DPADN, $d c$ is set to ensure that the average number of neighbors is approximately 2% the number of the total data points. Our algorithm MC-MNN has no parameter to be assigned.

Table 4
Results of the synthetic data sets

Data sets	Metric	McDPC	SMCL	Fuzzy-CFSFDP	DPADN	SNNDPC	MC-MNN
Gaussian	ACC	0.9910	0.9810	0.6060	0.6070	0.7325	0.9910
	RE	0.9897	0.9868	0.7644	0.5607	0.8639	0.9940
	NMI	0.9437	0.9101	0.2539	0.2075	0.5448	0.9468
	Clusters	4.0000	4.0000	2.0000	3.0000	4.0000	4.0000
	Parameters	0.1, 0.1, 1, 2	Nonsen	Nonsen	Nonsen	14	None
Ids2	ACC	0.9263	0.9931	0.8703	0.6250	0.8797	0.9931
	RE	0.7956	0.9970	0.8934	0.2000	0.9563	0.9970
	NMI	0.9571	0.9652	0.7985	0.0011	0.7847	0.9683
	Clusters	12.0000	5.0000	4.0000	2.0000	5.0000	5.0000
	Parameters	0.1, 0.1, 1.1, 2	Nonsen	Nonsen	Nonsen	15	None
Aggregation	ACC	0.9048	0.9949	0.9928	0.9987	0.9695	0.9962
	RE	0.8351	0.9968	0.9935	0.9989	0.9788	0.9973
	NMI	0.9194	0.9851	0.9824	0.9957	0.9432	0.9883
	Clusters	7.0000	7.0000	6.0000	7.0000	7.0000	7.0000
	Parameters	0.1, 0.1, 1.5, 2	Nonsen	Nonsen	Nonsen	14	None
Circles	ACC	0.8026	1.0000	0.3092	1.0000	0.6711	1.0000
	RE	0.7059	1.0000	0.2393	1.0000	0.6667	1.0000
	NMI	0.8488	1.0000	0.4576	1.0000	0.9753	1.0000
	Clusters	3.0000	2.0000	10.0000	2.0000	2.0000	2.0000
	Parameters	0.1, 0.4, 1, 2	Nonsen	Nonsen	Nonsen	5	None
Lithuanian	ACC	0.4808	1.0000	0.5946	0.6946	0.6008	1.0000
	RE	0.4585	1.0000	0.5168	0.4168	0.7605	1.0000
	NMI	0.5659	1.0000	0.0001	0.0644	0.1895	1.0000
	Clusters	4.0000	2.0000	2.0000	2.0000	2.0000	2.0000
	Parameters	0.1, 0.7, 1.6, 2	Nonsen	Nonsen	Nonsen	15	None
Banana	ACC	0.5425	0.9996	0.6923	0.9923	0.5008	1.0000
	RE	0.3255	0.9988	0.8049	0.9949	0.7005	1.0000
	NMI	0.6257	0.9928	0.2678	0.9878	0.1396	1.0000
	Clusters	3.0000	2.0000	2.0000	2.0000	2.0000	2.0000
	Parameters	0.1, 0.8, 1.9, 2	Nonsen	Nonsen	Nonsen	15	None

*The best values are in bold and the second best values are underlined.

Firstly, the performances of six algorithms are compared on six synthetic data sets. It can be found from Table 4 that MC-MNN achieved the best results on the 4 performance indicators, ACC, RE, NMI, and Clusters for the 5 synthetic data sets Gaussian, Ids2, Circles, Lithuanian and Banana. For data set Aggregation, our algorithm MC-MNN has obtained the correct cluster number, and the second best results on ACC, RE and NMI (0.9962, 0.9973, 0.9883). This result is still very competitive. Moreover, MC-MNN does not need to set any parameters. For McDPC, it is more sensitive to its four parameters. McDPC obtained relatively good clustering results on data sets Gaussian, Aggregation and Ids2, and poor clustering results on data sets Circles, Lithuanian and Banana. SMCL obtained the best clustering results on data sets Circle and Lithuanian, and relatively good clustering results on the other four synthetic data sets. It is not very sensitive to parameters and is a very competitive clustering algorithm. Fuzzy-CFSFDP is also insensitive to its parameter $d c$ . It works relatively well for data sets Ids2 and Aggregation, but it performs relatively poor for others four synthetic data sets. It automatically selects points with $\delta$ greater than $2\sigma(\delta)$ and $\rho$ greater than $\mu(\rho)$ as center points, so the center points in the sparse clusters will be ignored. DPADN is also not sensitive to its parameter $d c$ . It obtained the best clustering results on data sets Aggregation and Circle, and relatively good clustering results on data set Banana. But it performs relatively poor for others three synthetic data sets. For algorithm DPADN, the selected center points of the small cluster are less than those of the large cluster, and it is easy to merge the small clusters into the large clusters. Therefore, the clustering results on data sets Gaussian, Ids2 and other imbalanced data sets are relatively poor. SNNDPC needs to manually select the cluster centers according to the decision graph. The algorithm achieved relatively good clustering results on data sets Ids2 and Aggregation, but on the other four synthetic data sets, the clustering results are relatively poor. SNNDPC is also sensitive to its parameter. In conclusion, the MC-MNN algorithm performs the best on synthetic data sets compared to the other 5 algorithms.

Table 5

Results of the real data sets

Data sets	Metric	McDPC	SMCL	Fuzzy-CFSFDP	DPADN	SNNDPC	MC-MNN
Breast	ACC	0.9239	0.9678	0.3499	0.6515	0.7862	0.9634
	RE	0.9289	0.9656	0.5000	0.5021	0.7129	0.9622
	NMI	0.6410	0.7830	0.0001	0.0092	0.7030	0.7670
	Clusters	2.0000	2.0000	1.0000	3.0000	3.0000	2.0000
	Parameters	0.1, 0.3, 3.7, 2	Nonsen	Nonsen	Nonsen	10	None
Ecoli	ACC	0.6453	0.6483	0.4373	0.4373	0.6544	0.6820
	RE	0.3766	0.3792	0.2000	0.2012	0.5528	0.5642
	NMI	0.5134	0.5102	0.0001	0.0156	0.6139	0.6644
	Clusters	3.0000	2.0000	1.0000	2.0000	3.0000	3.0000
	Parameters	0.1, 0.5, 0.3, 2	Nonsen	Nonsen	Nonsen	14	None
Car	ACC	0.5547	0.7002	0.2905	0.2667	0.4615	0.6863
	RE	0.2478	0.2500	0.1141	0.2064	0.2619	0.2779
	NMI	0.2033	0.0001	0.2018	0.1210	0.1976	0.0379
	Clusters	3.0000	1.0000	10.0000	2.0000	4.0000	2.0000
	Parameters	0.1, 0.1, 1, 2	Nonsen	Nonsen	Nonsen	8	None
Vote	ACC	0.6422	0.8664	0.8060	0.5302	0.8621	0.8750
	RE	0.6384	0.8771	0.8048	0.4960	0.8667	0.8771
	NMI	0.3449	0.4812	0.5589	0.0075	0.3225	0.4633
	Clusters	2.0000	2.0000	3.0000	2.0000	2.0000	2.0000
	Parameters	0.1,0.1,1.4, 2.5	Nonsen	Nonsen	Nonsen	8	None
Seeds	ACC	0.4524	0.5524	0.6033	0.3667	0.6313	0.6190
	RE	0.4524	0.5524	0.6033	0.3667	0.6313	0.6190
	NMI	0.1199	0.3441	0.5778	0.0586	0.5679	0.5778
	Clusters	3.0000	3.0000	2.0000	3.0000	2.0000	5.0000
	Parameters	0.1, 0.6, 0.6, 2	Nonsen	Nonsen	Nonsen	6	None
Thyroid	ACC	0.5907	0.6977	0.6977	0.6977	0.6326	0.7302
	RE	0.5925	0.3333	0.3333	0.3333	0.3241	0.4000
	NMI	0.2715	0.2154	0.0001	0.0425	0.2539	0.1286
	Clusters	3.0000	2.0000	1.0000	2.0000	2.0000	2.0000
	Parameters	0.1, 0.2, 3.4, 2	Nonsen	Nonsen	Nonsen	6	None
Pageblock	ACC	0.7118	0.9171	0.9171	0.9177	0.8997	0.8161
	RE	0.4399	0.3333	0.3333	0.3420	0.2272	0.4526
	NMI	0.0155	0.0001	0.0001	0.0293	0.0507	0.0530
	Clusters	3.0000	1.0000	1.0000	5.0000	3.0000	2.0000
	Parameters	0.1, 0.7, 245.7, 2	Nonsen	Nonsen	Nonsen	5	None
Robot navigation	ACC	0.0940	0.1514	0.3063	0.3091	0.1514	0.2538
	RE	0.0717	0.1514	0.2760	0.2020	0.2500	0.3499
	NMI	0.1195	0.0001	0.0046	0.0001	0.0001	0.1695
	Clusters	88.0000	1.0000	2.0000	2.0000	1.0000	14.0000
	Parameters	1, 1, 2.5, 2	Nonsen	Nonsen	Nonsen	5	None
Olivetti face	ACC	0.4300	0.0500	0.6350	0.1500	0.6150	0.7850
	RE	0.4300	0.0500	0.6350	0.1500	0.6150	0.7850
	NMI	0.6330	0.0001	0.8314	0.4443	0.8414	0.9177
	Clusters	18.0000	1.0000	16.0000	3.0000	16.0000	16.0000
	Parameters	0.1, 0.2, 8.2, 2	Nonsen	Nonsen	Nonsen	6	None

*The best values are in bold and the second best values are underlined.

Table 6

Run-time (in second) of MC-MNN and compared algorithms

Data sets	McDPC	SMCL	Fuzzy-CFSFDP	DPADN	SNNDPC	MC-MNN
Gaussian	7.9896	159.7747	8.5953	11.5980	90.6851	8.7891
Ids2	16.1765	662.5271	9.3085	17.4781	32.6991	17.2431
Aggregation	4.0532	66.0548	4.9604	5.8905	53.9368	2.6448
Circles	1.5691	7.8790	1.0793	4.8895	20.0849	1.1645
Lithuanian	9.4235	234.3669	5.1581	6.2672	47.2117	19.3818
Banana	9.6061	194.7357	5.1533	8.6874	42.5286	18.3806
Breast	3.7369	1395.1937	3.3639	3.1235	27.7495	23.1423
Ecoli	3.8326	36.1947	2.8165	2.6915	27.1364	4.4623
Car	5.8326	2309.9407	11.1695	9.6891	63.3476	22.1479
Vote	12.1421	452.2448	1.1765	0.8710	31.0023	23.7628
Seeds	2.8517	41.0608	3.1632	2.4089	44.0242	4.5092
Thyroid	2.3147	22.1362	1.7737	1.4642	16.5361	7.1873
Pageblock	26.5478	15636.6311	13.1765	16.8710	150.1547	42.1028
Robotnavigation	121.6040	11794.6610	24.7368	26.9120	186.0462	45.3002
Olivetti face	4.1347	821.2551	1.1631	1.0615	48.5231	1.7462

Figure 14.

Clustering results of MC-MNN algorithm on Olivetti faces data set.

Secondly, for the nine real data sets, these data sets are totally analyzed according to 45 performance indicator values (each with 5 indicator values). MC-MNN algorithm achieved 27 best performance indicator values (briefly best values) and 10 second-best performance indicator values (briefly second best values). McDPC algorithm achieved 10 best values and 3 second best values. SMCL algorithm achieved 8 best values and 15 second best values. SNNDPC algorithm achieved 5 best values and 12 second best values. DPADN algorithm achieved 5 best values and 12 second best values. Fuzzy-CFSFDP algorithm achieved 3 best values and 20 second best values. In addition, for imbalanced data sets such as Car and Pageblock, the algorithms SMCL and Fuzzy-CFSFDP cluster the data points into one cluster, so they have achieved better results on the ACC index (ACC refers to the proportion of correctly clustered data points to the total data points). Compared with ACC, the RE index can better evaluate the clustering effect of data sets with large differences in cluster size (RE refers to the ratio of the number of correctly clustered data points in each cluster to the total number of data points in each cluster). In all real data sets, our algorithm MC-MNN always obtains the best or second best value on RE index. In summary, MC-MNN also has good performance on real data sets.

In order to further verify the image processing ability of the algorithm, the first two hundred photos of the Olivetti faces data set [53] were selected. The size of the photos is 64*64. Each person has ten photos of different angles and expressions. There are 20 people. Firstly, principal component analysis was performed on the data set, and the features with cumulative contribution rate greater than 85% were selected for clustering. The clustering results of the same cluster are represented by the same color. In order to more clearly express the cluster to which the photo belongs, the cluster number to which the photo belongs is marked with white Arabic numerals in the lower left part of the image. The clustering results are shown in Fig. 14.

As can be seen from the Fig. 14, the algorithm MC-MNN aggregates 200 photos into sixteen clusters. Among them, the clustering results of eleven clusters (3, 5, 6, 7, 10, 11, 12, 13, 14, 15, and 16) are completely correct. The first cluster contains 7 photos of the first person and 8 photos of the 15th person. The second cluster contains all photos of the 2nd person, 6 photos of the 4th person, and 9 photos of the 14th person. The fourth cluster contains 4 photos of the 4th person and all photos of the 12th person. The eighth cluster contains all the photos of the 8th person, one photo of the 14th person and all the photos of the 18th person. The ninth cluster contains all the photos of the 9th person, 2 photos of the 15th person and 3 photos of the first person.

From the Table 5, it can be found that MC-MNN has achieved the best clustering results in the performance indicators ACC, RE, and NMI, and the Fuzzy-CFSFDP algorithm has obtained the second best clustering results in these indicators. In terms of the number of obtained clusters, McDPC algorithm obtain the best result. Overall, MC-MNN algorithm has the best clustering effect on Olivetti faces data set compared with five related algorithms.

4.2.3 Time cost comparison

Algorithms MC-MNN, McDPC, Fuzzy-CFSFDP and DPADN have the same order of magnitude time cost on 15 data sets. and the time cost by algorithm SNNDPC is an order of magnitude higher than that of algorithms MC-MNN, McDPC, Fuzzy-CFSFDP and DPADN. This is because the algorithm SNNDPC requires manual judgment to select the center points, which takes more time. The time cost of algorithm SMCL is the highest among all algorithms, This is because many iterations are needed by algorithm SMCL in many situations. The time cost of six algorithms on 15 data sets are shown in Table 6.

5. Conclusion

In this study, we proposed a multi-center clustering algorithm based on mutual nearest neighbors called MC-MNN, which uses multiple centers to represent a cluster and aims to effectively cluster arbitrarily distributed data. In MC-MNN, first, we design a center points discovery algorithm based on mutual nearest neighbors, which can adaptively find center points without any parameters. Because the center points are found according to their mutual nearest neighbors, which is independent of the distance and density between data points, the algorithm is suitable for data sets with different density distribution. Then, a sub-clusters construction algorithm is designed based on the connection of the center points. The algorithm using the connection of multiple center points to construct sub-clusters is effective for clustering non-convex data sets. Finally, we measure the difficulty of merging sub-clusters according to the degree of overlapping and distance between sub-clusters, and design a cluster number determination algorithm that can effectively determine the final number of clusters and obtain the clustering results. Compared with the existing algorithms, we can make the following conclusions: the MC-MNN algorithm has four advantages: 1) It can obtain center points by using the mutual nearest neighbors automatically; 2) It runs without any parameters; 3) It can automatically obtain the number of clusters; 4) It can be effectively applied to the arbitrarily distributed data.

Footnotes

Acknowledgments

This article is supported by NSFC (61872281) and NKR&DPC (2017YFC1703506).

References

Alam

KMR

Siddique

Adeli

. A dynamic ensemble learning algorithm for neural networks. Neural Comput Appl. 2020; 32(12): 8675-8690. Available from: https://doi.org/10.1007/s00521-019-04359-7.

Pereira

Piteri

de Souza

Papa

Adeli

. FEMa: A finite element machine for fast learning. Neural Comput Appl. 2020; 32(10): 6393-6404. Available from: https://doi.org/10.1007/s00521-019-04146-4.

Rafiei

Adeli

. A new neural dynamic classification algorithm. IEEE Transactions on Neural Networks and Learning Systems. 2017; 28(12): 3074-3083.

Ahmadlou

Adeli

. Enhanced probabilistic neural network with local decision circles: A robust classifier. Integrated Computer-Aided Engineering. 2010; 17(3): 197-210.

Zhou

Menche

Barabási

Sharma

. Human symptoms-disease network. Nature Communications. 2014; 5(1): 1-10.

Yang

Niyongabo

Shu

Wang

Chang

, et al. Integrated network analysis of symptom clusters across disease conditions. Journal of Biomedical Informatics. 2020; 107(12): 103482.

Lee

Chang

Sano

. Clustering and classification based on distributed automatic feature engineering for customer segmentation. Symmetry. 2021; 13(9): 1557.

Akbar

Liu

Latif

. Discovering knowledge by comparing silhouettes using k-means clustering for customer segmentation. International Journal of Knowledge Management (IJKM). 2020; 16(3): 70-88.

Pan

Yan

. Exploiting higher-order patterns for community detection in attributed graphs. Integrated Computer-Aided Engineering. 2021; 28(2): 207-218.

10.

Djenouri

Belhadi

Fournier-Viger

Lin

JCW

. Fast and effective cluster-based information retrieval using frequent closed itemsets. Information Sciences. 2018; 453: 154-167.

11.

Leuski

. Evaluating document clustering for interactive information retrieval. in: Proceedings of the Tenth International Conference on Information and Knowledge Management. CIKM ’01. New York, NY, USA: Association for Computing Machinery. 2001; 33-40. Available from: https://doi.org/10.1145/502585.502592.

12.

Zhang

Wang

. An unsupervised semantic sentence ranking scheme for text documents. Integrated Computer-Aided Engineering. 2021; 28(1): 17-33.

13.

Ghosh-Dastidar

Adeli

. Wavelet-clustering-neural network model for freeway incident detection. Computer-Aided Civil and Infrastructure Engineering. 2003; 18(5): 325-338.

14.

Xia

Wang

. A data-driven approach to determining freeway incident impact areas with fuzzy and graph theory-based clustering. Computer-Aided Civil and Infrastructure Engineering. 2020; 35(2): 178-199.

15.

Jiang

Adeli

. Clustering-neural network models for freeway work zone capacity estimation. International Journal of Neural Systems. 2004; 14(03): 147-163.

16.

Mirzaei

Adeli

. Segmentation and clustering in brain MRI imaging. Reviews in the Neurosciences. 2019; 30(1): 31-44.

17.

Avola

Bernardi

Cinque

Massaroni

Foresti

. Fusing self-organized neural network and keypoint clustering for localized real-time background subtraction. International Journal of Neural Systems. 2020; 30(4): 2050016.

18.

Jiang

Adeli

. Fuzzy clustering approach for accurate embedding dimension identification in chaotic time series. Integrated Computer-Aided Engineering. 2003; 10(3): 287-302.

19.

Ortiz-Rosario

Adeli

Buford

. MUSIC-expected maximization gaussian mixture methodology for clustering and detection of task-related neuronal firing rates. Behavioural Brain Research. 2017; 317: 226-236.

20.

Mammone

Ieracitano

Adeli

Bramanti

Morabito

. Permutation jaccard distance-based hierarchical clustering to estimate EEG network density modifications in MCI subjects. IEEE Transactions on Neural Networks and Learning Systems. 2018; 29(10): 5122-5135.

21.

MacQueen

, et al. Some methods for classification and analysis of multivariate observations. in: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Oakland, CA, USA. 1967; 1: 281-297.

22.

Zhang

Ramakrishnan

Livny

. BIRCH: An efficient data clustering method for very large databases. ACM Sigmod Record. 1996; 25(2): 103-114.

23.

Ester

Kriegel

Sander

, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. in: 2nd International Conference on Knowledge Discovery and Data Mining. Portland, OR. Portland, Oregon: AAAI Press. 1996; 226-231.

24.

Wang

Yang

Muntz

, et al. STING: A statistical information grid approach to spatial data mining. 1997; 97.

25.

Fisher

. Knowledge acquisition via incremental conceptual clustering. Machine Learning. 1987; 2(2): 139-172.

26.

McLachlan

Basford

. Mixture models: Inference and applications to clustering. M. Dekker New York. 1988; 38.

27.

McLachlan

Lee

Rathnayake

. Finite mixture models. Annual Review of Statistics and its Application. 2019; 6: 355-378.

28.

Wang

Yang

Muntz

. STING: A statistical information grid approach to spatial data mining. in: VLDB’97, Proceedings of 23rd International Conference on Very Large Data Bases. August 25-29, 1997, Athens, Greece. Morgan Kaufmann. 1997; 186-195. Available from: http://www.vldb.org/conf/1997/P186.PDF.

29.

Liao

Liu

Choudhary

. A grid-based clustering algorithm using adaptive mesh refinement. in: 7th Workshop on Mining Scientific and Engineering Datasets of SIAM International Conference on Data Mining. 2004; 22: 61-69.

30.

Karypis

Han

Kumar

. Chameleon: Hierarchical clustering using dynamic modeling. Computer. 1999; 32(8): 68-75.

31.

Liang

Bai

Dang

Cao

. The K-means-type algorithms versus imbalanced data distributions. IEEE Transactions on Fuzzy Systems. 2012; 20(4): 728-745.

32.

Tao

. Unsupervised fuzzy clustering with multi-center clusters. Fuzzy Sets and Systems. 2002; 128(3): 305-322.

33.

Xia

Zhang

Wang

Han

Yan

. WC-KNNG-PC: Watershed clustering based on k-nearest-neighbor graph and pauta criterion. Pattern Recognition. 2022; 121: 108177.

34.

Cheung

Tang

. Self-adaptive multiprototype-based competitive learning approach: A k-means-type algorithm for imbalanced data clustering. IEEE Transactions on Cybernetics. 2021; 51(3): 1598-1612.

35.

Rodriguez

Laio

. Clustering by fast search and find of density peaks. Science. 2014; 344(6191): 1492-1496.

36.

Bie

Mehmood

Ruan

Sun

Dawood

. Adaptive fuzzy clustering by fast search and find of density peaks. Personal and Ubiquitous Computing. 2016; 20(5): 785-793.

37.

Wang

Zhang

Pang

Miao

Tan

, et al. McDPC: Multi-center density peak clustering. Neural Computing and Applications. 2020; 32(17): 13465-13478.

38.

Bryant

Cios

. RNN-DBSCAN: A density-based clustering algorithm using reverse nearest neighbor density estimates. IEEE Transactions on Knowledge and Data Engineering. 2017; 30(6): 1109-1121.

39.

Tong

Wang

Liu

. An adaptive clustering algorithm based on local-density peaks for imbalanced data without parameters. IEEE Transactions on Knowledge and Data Engineering. 2021.

40.

Cover

Hart

. Nearest neighbor pattern classification. IEEE Transactions on Information Theory. 1967; 13(1): 21-27.

41.

Basu

Murthy

. Towards enriching the quality of k-nearest neighbor rule for document classification. International Journal of Machine Learning and Cybernetics. 2014; 5(6): 897-905.

42.

Chen

Yang

Wang

Liu

Wang

, et al. A novel bankruptcy prediction model based on an adaptive fuzzy k-nearest neighbor method. Knowledge-Based Systems. 2011; 24(8): 1348-1359.

43.

Ding

Jia

. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowledge-Based Systems. 2016; 99: 135-145.

44.

Chen

Fan

Shen

Zhang

Liu

, et al. Fast density peak clustering for large scale data based on kNN. Knowledge-Based Systems. 2020; 187: 104824.

45.

Vadapalli

Valluri

Karlapalem

. A simple yet effective data clustering algorithm. in: Sixth International Conference on Data Mining (ICDM’06). IEEE. 2006; 1108-1112.

46.

Abbas

El-Zoghabi

Shoukry

. DenMune: Density peak based clustering using mutual nearest neighbors. Pattern Recognition. 2021; 109: 107589.

47.

Cottam

Curtis

. The use of distance measures in phytosociological sampling. Ecology. 1956; 37(3): 451-460.

48.

Gowda

Krishna

. Agglomerative clustering using the concept of mutual nearest neighbourhood. Pattern Recognition. 1978; 10(2): 105-112.

49.

Zhu

Feng

Huang

. Natural neighbor: A self-adaptive neighborhood method without parameter k. Pattern Recognition Letters. 2016; 80: 30-36.

50.

Fränti

Sieranoja

. K-means properties on six clustering benchmark datasets. Applied Intelligence. 2018; 48(12): 4743-4759.

51.

Duin

Juszczak

Paclik

Pekalska

De Ridder

Tax

, et al. Prtools4. 1, a matlab toolbox for pattern recognition. Delft University of Technology. 2007; 2600.

52.

Dua

Graff

. UCI machine learning repository. 2017. Available from: http://archive.ics.uci.edu/ml.

53.

Samaria

Harter

. Parameterisation of a stochastic model for human face identification. in: Proceedings of 1994 IEEE Workshop on Applications of Computer Vision. IEEE. 1994; 138-142.

54.

Tong

, Liu

, Gao

XZ.

,A density-peak-based clustering algorithm of automatically determining the number of clusters, Neurocomputing. 2021; 458: 655-666.

55.

Liu

Wang

. Shared-nearest-neighbor-based clustering by fast search and find of density peaks. Information Sciences. 2018; 450: 200-226.

A multi-center clustering algorithm based on mutual nearest neighbors for arbitrarily distributed data

Abstract

Keywords

1. Introduction

(K nearest neighbors [38, 39]).

(Reverse K nearest neighbors).

(Mutual K nearest neighbors).

1. 𝑀𝑁 K ⁢ ( s i ) ⊆ S / { s i } . 2. If s j ∈ N K ⁢ ( s i ) and s i ∈ N K ⁢ ( x j ) , then s j ∈ 𝑀𝑁 K ⁢ ( s i ) and s i ∈ 𝑀𝑁 K ⁢ ( s j ) . 3. MC-MNN method

(The K -th nearest neighbor).

(The k -th mutual nearest neighbor).

4. Experimental results

4.1 Data sets and the compared algorithms

4.2 Experimental results

4.2.1 The clustering process of MC-MNN for 6 synthetic data sets

Table 4 Results of the synthetic data sets

5. Conclusion

Footnotes

Acknowledgments

References

(Reverse $K$ nearest neighbors).

(Mutual $K$ nearest neighbors).

1.
$\textit{MN}_{K}(s_{i})\subseteq S/\{s_{i}\}$ .
2.
If $s_{j}\in N_{K}(s_{i})$ and $s_{i}\in N_{K}(x_{j})$ , then $s_{j}\in\textit{MN}_{K}(s_{i})$ and $s_{i}\in\textit{MN}_{K}(s_{j})$ .

3. MC-MNN method

(The $K$ -th nearest neighbor).

(The $k$ -th mutual nearest neighbor).

Table 4
Results of the synthetic data sets