DCE-IVI: Density-based clustering ensemble by selecting internal validity index

Abstract

As each clustering algorithm cannot efficiently partition datasets with arbitrary shapes, the thought of clustering ensemble is proposed to consistently integrate clustering results to obtain better division. Most of ensemble research employs a single algorithm with different parameters to clustering. And this can be easily integrated, however it is hardly to divide complex datasets. Other available methods integrate different algorithms, it can divide datasets from different aspects, but fail to take outliers into account, which produces negative effects on the partition results. In order to solve these problems, we clustering datasets with three different density-based algorithms. The innovation of this paper is described as: (1) by setting dynamic thresholds, lower frequency evidence in the co-association matrix is gradually deleted to obtain multiple reconstructed matrices; (2) these reconstructed matrices are analyzed by hierarchical clustering to obtain basic clustering results; (3) an internal validity index is designed by the compactness within clusters and the correlation between clusters, which is used to select the final clustering result. By this innovation, the clustering effect is significantly improved. Finally, a series of experiments are designed, and the results verify the improvement and effectiveness of the proposed technique (DCE-IVI).

Keywords

Clustering ensemble density-based clustering algorithms negative effects internal validity index

1. Introduction

Clustering analysis is a highly important unsupervised machine learning strategy. Its main task is to divide a set of data objects into different classes so that the data objects of the same class are more similar, but there are obvious differences between objects of different classes [8]. As a supervised learning method, classification requires clearly known information on each class and that all data objects have a corresponding class [30]. Unlike classification, clustering analysis is based on the relationship between data objects when the dataset is divided [25]. In short, it works to separate similar objects into a group without considering the specific category; that is, there is no class label information.

Although there are a large number of clustering algorithms, there is no single algorithm that can efficiently handle all types of data distribution. Since each clustering algorithm has its own analysis strategy and performance standards, different clustering algorithms or parameters may produce different clustering results with significant deviations [1].

Ensemble learning is based on the combination of multiple different models to solve specific problems [4]. Its basic principle is based on the idea of using different techniques rather than a single method to establish a set of hypotheses and then combining them in order to improve results [5]. The use of ensemble learning in classification has become increasingly effective in recent years.

In the absence of supervision information, it is difficult to determine which partition structure is most suitable for the actual distribution. Therefore, selecting an appropriate algorithm is a challenging task. In order to solve this issue, many researchers have combined clustering analysis algorithms with ensemble learning to perform “classification", and then consistently integrated the clustering results produced by several clustering algorithms. This process is called clustering ensemble.

Generally speaking, clustering ensemble can be defined as the process of combining multiple clustering results of a dataset. By merging the results of multiple clustering algorithms, higher quality and more robust clustering results can be obtained [24]. In general, clustering ensemble is divided into two stages: the generation of cluster members and consistency integration of the cluster members [34]. The main task of the first stage is to select appropriate clustering algorithms to generate multiple different cluster members. Since the clustering label has no physical meaning, the same label divided by different algorithms may represent different clusters. Therefore, it is more complicated to integrate the cluster members consistently. Different consistency integration schemes have a significant impact on the accuracy of clustering results [15].

Most scholars currently involved in clustering ensemble research tend to focus on in-depth research and innovation, but sometimes ignore relatively simplistic problems. The common problems are as follows:

(1)
Many scholars employ the k-means algorithm as the basic partition generator in the stage of generating cluster members [24, 13, 14]. However, the algorithm is not suitable for clustering non-spherical datasets and cannot process datasets with significant density differences [18]. Therefore, many clustering results are poor and have a negative impact on consistency integration.
(2)
In the first stage, most methods use the same algorithm to analyze the dataset several times to obtain the difference of basic clustering results. However, as Topchy et al. pointed out in [38], given a set of weak clustering algorithms with sufficient differences, consistency integration can produce efficient division results. In order to obtain efficient ensemble results, it is necessary to use different clustering algorithms as much as possible as each clustering algorithm reveals different aspects of the dataset. This will provide a superior final ensemble result.
(3)
Whether at the stage of generating cluster members or consistency integration of cluster members, it is easy to ignore the influence of outliers on the final results. For example, the selected basic partition generator is insensitive to outliers. In addition, many scholars choose to generate a co-association matrix and then add information that describes the cluster structure to refine it [42]. However, in a practical sense, this method may cause some data points to be recognized as noise, which has a negative impact on the final clustering during consistency integration [7].

Cluster member generation and consistency integration methods have a significant influence on the accuracy of the final clustering results. In view of the above problems, this paper proposes the following solutions:

(1)
Three density-based clustering algorithms (DBSCAN [10], DPC [36, 33], and OPTICS [2] algorithms), which can handle a variety of data structures, are employed to perform clustering analysis on datasets to generate different cluster members.
(2)
A co-association matrix is generated according to the generated cluster members. According to certain rules, the evidence with a lower frequency in the co-association matrix is deleted. The co-association matrix is then transformed into multiple reconstruction matrices, and hierarchical clustering analysis is performed on these to obtain multiple basic clustering results.
(3)
By analyzing the compactness of the clusters and the correlation between clusters in the basic clustering results, an internal validity index is set to select basic clustering results to obtain the optimal clustering result datasets.

The rest of this article is organized as follows. Section 2 reviews related work on the clustering ensemble problem and provides a specific description of the innovative methods proposed in this paper. Section 3 introduces the clustering ensemble method based on density-based clustering algorithms(DCE-IVI). The experimental results are given in Section 4. Finally, Section 5 summarizes the full text.
2. Related work

The clustering ensemble problem was first introduced by Strehl and Ghosh in 2002 [37]. They described it as combining multiple clustering results of a set of objects without accessing the original features. Gionis et al. also described the problem in 2007: given a set of clustering results, the goal of clustering ensemble is to find a cluster that is consistent as much as possible with all input clustering results [17]. It can thus be observed that clustering ensemble works to generate the best result by combining and dividing multiple clustering results using certain rules.

2.1 Background of cluster ensemble method

2.1.1 Methods of generating cluster members

In clustering ensemble, the method of generating cluster members, such as the selection of clustering algorithms and the design of parameters, has a significant impact on the quality of the final clustering result. The three main generation methods currently available are described as follows:

The first method is to clustering the datasets with the same clustering algorithm but set different parameters for each clustering iteration. In 2002, Fred et al. adopted the k-means algorithm with variable k [13]. Using this method, a reasonable k was initially produced as $k\in[K_{\min},K_{\max}]$ , where $K_{\min}\geqslant 2$ , $K_{\max}\leqslant n$ (where n referred to the number of data objects in the dataset), and then randomly selected k-th data objects were employed as the initial cluster centers to divide the dataset to generate cluster members.

The second method is to divide the dataset into multiple different data subsets according to certain rules and use the same clustering algorithm to process these subsets. Using this technique, Fischer and Buhmann [11] applied the bootstrap method to obtain several subsets.

The third method is to use different types of clustering algorithms to clustering the same dataset. In 2007, Gionis et al. used single link, complete link, average link, wardâ€™s clustering, and k-means to generate basic clustering results with this technique [17].

2.1.2 Methods of consistency integration for cluster members

After generating clustering members, consistency integration is a key step that determines the quality and accuracy of the final result. Numerous methods have been proposed to solve this problem. The several main methods are outlined as follows:

The first technique is a voting method that conducts voting according to the division of data points by cluster members and calculates the proportion of data points divided into each cluster. If the proportion is bigger than the set threshold, it will be divided into this cluster. In 2001, Fred used partition information to calculate the number of times a pair of data points were divided into the same cluster, which was used as the vote of whether two data points belong to the same cluster [12].

The second method is to transform the clustering ensemble problem into the minimum cutting problem of the hypergraph, and use the clustering algorithm based on graph theory to perform clustering ensemble. In 2002, Strehl et al. proposed the cluster-based similarity partitioning algorithm (CSPA) algorithm [37]. CSPA first employs the calculated co-association matrix as the similarity matrix, and then uses data points as the vertices and the similarity value as the edge weight to construct a hypergraph. Finally, the graph segmentation algorithm METIS is used to divide the dataset to obtain the clustering result.

The third method is the evidence accumulation (EA) method proposed by Fred in 2002. Considering each generated cluster member as the independent evidence, the number of times a pair of data points is divided into the same cluster is calculated to obtain the co-association matrix [14], and then the final clustering result is obtained through hierarchical clustering.

The fourth method is to weigh the generated cluster members to achieve clustering ensemble. In 2006, Zhou et al. used the normalized mutual information (NMI) between a pair of clusters to calculate the weights of each cluster and selected the optimal cluster by excluding the method where the mutual information weight was less than the threshold [42].

2.1.3 Existing clustering ensemble methods

In recent years, new clustering ensemble methods have been continuously proposed, with considerable improvement and innovation to the above methods.

In 2015, Caiming Zhong et al. proposed a new hypergraph transformation method [44]. Firstly, the co-association matrix is generated, and the co-association matrix is refined by calculating the probability. Then, the matrix is transformed into a path-based matrix, and spectral clustering is applied to the path-based matrix to generate the final clustering result. In 2016, Feijiang Li et al. proposed a clustering ensemble algorithm based on evidence theory [23]. The neighbors of each data are first located, and the label probabilities for each member are generated. After that, the probabilities of these labels are fused to produce the final result. Later, in 2019, Feijiang Li et al. proposed the concept of sample stability to determine the contribution of samples. By this method, the dataset is divided into cluster core and cluster halo, and the samples in the cluster core are used to determine the clear structures. According to the stability of the samples, the samples in the cluster halo are then gradually assigned to the clear structures [21]. In 2019, Huang Dong et al. proposed a U-SPEC algorithm that could effectively process large-scale datasets. By integrating multiple U-SPEC into a unified ensemble clustering framework, clustering analysis with high efficiency and high robustness was realized [18]. Liang Bai et al. also proposed a multiple k-means clustering ensemble algorithm in 2020 to locate nonlinearly separable clusters [3], which extracted local data labels from clustering members. This method not only inherits the scalability of k-means but also overcomes its limitation of only analyzing linear datasets.

2.2 Density-based clustering algorithms

Different clustering algorithms are sensitive to varying dataset structures, meaning that different clustering algorithms will produce different clustering results when dividing datasets. If only one clustering algorithm is used to partition the original dataset, the quality of the final clustering results cannot be guaranteed. This article uses three density-based clustering algorithms that can handle different dataset structures to generate basic cluster members, namely DBSCAN, DPC, and OPTICS algorithms.

DBSCAN, named density-based spatial clustering of applications with noise, can clustering dense datasets of any shape [10]. It is highly sensitive to the input parameters $\varepsilon$ and minPts, and can find outliers while clustering. However, if the density of datasets is not uniform and the spacing between clusters is highly divergent, the quality of clustering will be poor. In clustering by fast search and find of density peaks (DPC), the clustering center is surrounded by samples with low local density, and these samples are far away from any sample with high local density [36]. DPC can automatically locate clustering centers, effectively distribute data points and remove noise points, and efficiently cluster datasets of any shape. Therefore, it is highly suitable for the clustering analysis of large-scale data [33]. Another density-based clustering algorithm is ordering points to identify the clustering structure (OPTICS), which is an improvement to the concept of DBSCAN. The algorithm is insensitive to the input of the initial parameter, can efficiently locate outliers, and can obtain clusters of different densities [2].

2.3 Hierarchical clustering method

Hierarchical clustering algorithm works to decompose a given dataset hierarchically and organize data points into a clustering tree. There are two methods of hierarchical decomposition. One is condensed hierarchical clustering, which is a bottom-up clustering method. In the beginning, each data point is regarded as a separate cluster, and then the two closest clusters are gradually merged into one cluster or until they reach a termination condition [28]. The other method is split hierarchical clustering, which is also known as the top-down clustering method. Split hierarchical clustering regards all data points in the dataset as a cluster at the beginning, and then gradually splits the cluster into smaller clusters, until each data point is allocated in a separate cluster or a termination condition is reached, then stop clustering [35].

In the clustering ensemble method based on hierarchical clustering, the condensed hierarchical clustering method is usually used to analyze the co-association matrix. Assuming that $D=\{x_{1},x_{2},\ldots,x_{n}\}$ is the dataset, and the dataset contains n-th data points, then the n*n co-association matrix can be obtained. The basic steps of the condensed hierarchical clustering algorithm are as follows:

1.
Regard each data point as a cluster. The similarity between clusters is equal to the similarity of the corresponding data points;
2.
Find the two closest clusters and merge them;
3.
Calculate the similarity between the new cluster and the original cluster;
4.
Repeat steps 2 and 3 until the similarity between all clusters reaches a threshold, then stop clustering.

As described above, in the process of consistency integration for cluster members, most clustering ensemble methods adopt the condensed hierarchical clustering method. In addition, the condensed hierarchical clustering method based on similarity threshold avoids setting the final number of clusters. It is based on the similarity matrix formed by cluster members, which reflects the similarity measure between data points, and can be applied to the minimum spanning tree. The final clustering result can thus be realized by setting a threshold to cut off the weak connection. In the evidence accumulation (EA) method [14] proposed by Fred, the threshold t was set to 0.5, and hierarchical clustering analysis was performed on the similarity matrix to obtain the final partition result.
3. Design of DCE-IVI

3.1 Method overview

At present, most clustering ensemble methods for research and analysis are based on k-means. Using such methods, the similarity matrix is obtained according to the basic clustering result, and the information describing the cluster structure is added to the similarity matrix to refine it. However, these schemes fail to consider the outliers and the shape of the dataset, providing inadequate clustering results.

Figure 1.

The overview of the proposed method DCE-IVI.

In this paper, we use three density-based algorithms that can handle different data structures, DBSCAN, DPC, and OPTICS, to clustering the dataset to obtain different basic clustering results. We then create a co-association matrix (CM) according to the basic clustering result, and zero the elements in the matrix that are smaller than different thresholds t, so we can acquire multiple reconstructed matrices. Hierarchical clustering (HC) analysis is employed for each reconstructed matrix to obtain the corresponding clustering results. Using internal validity index DCE-IVI, the optimal clustering result is then selected as the final result. The overall process is shown in Fig. 1.

3.2 Form a co-association matrix (CM)

Figure 2.

Clustering results of DBSCAN, DPC and OPTICS algorithms on the Jain dataset.

The above three density-based clustering algorithms vary in sensitivity to the dataset’s density, structure, and noise points. By using these algorithms to partition the dataset respectively, we can obtain the different clustering results, as shown in Fig. 2, which help divide the dataset more efficiently.

Co-association matrix is a method proposed by Fred to measure the similarity between data points [14]. The specific idea is that in different data partition processes, data points that ultimately belong to the same cluster may also belong to the same cluster. By considering each cluster member as independent evidence, the number of times that the data points are divided into the same cluster is calculated. The co-association matrix can then be obtained.

Assuming that $D=\{x_{1},x_{2},\ldots,x_{n}\}$ is the dataset that needs to be partitioned, $P=\{P_{1},\ldots,P_{m}\}$ represents the set of the basic cluster partitions, where $x_{t}$ is the t-th data point in the dataset $D$ ; $P_{i}=\{C_{i1},C_{i2},\ldots,C_{ik}\}$ , where $C_{ij}$ represents the j-th cluster in the cluster member $P_{i}$ , and the similarity between the u-th data point and the v-th data point is expressed as follows:

$\displaystyle A_{uv}=\frac{\textit{The times that }x_{u}\textit{ and }x_{v}% \textit{ belong to the same cluster}}{\textit{The number of the cluster % mumbers:m}}$ (1)

The co-association matrix transforms the results generated by cluster members into a matrix, which can be regarded as the similarity matrix between all points in the dataset [41]. Thus, the mathematical expression of each element in the coincidence matrix (CM) is as follows:

$\displaystyle CM(u,v)=\frac{1}{m}\sum_{i=1}^{m}\delta(x_{u},x_{v})$ (2)

Among them, $CM(u,v)$ represents the element in the u-th row and v-th column of the CM matrix, m is the number of cluster members obtained, and $\delta(x_{u},x_{v})$ is defined as follows:

$\displaystyle\delta(x_{u},x_{v})=\left\{\begin{array}[]{ll}1,&C_{i}(x_{u})=C_{% i}(x_{v})\\ 0,&\text{otherwise}\\ \end{array}\right.$ (3)

$C_{i}(x_{u})$ and $C_{i}(x_{v})$ respectively represent the clusters of data points $x_{u}$ and $x_{v}$ in the basic cluster $P_{i}$ . $\delta(x_{u},x_{v})$ indicates whether the data points $x_{u}$ and $x_{v}$ belong to the same cluster in the basic cluster $P_{i}$ . If they are in the same cluster, the value is 1; otherwise, the value is 0.

3.3 Generate the second-order basis clustering results

It can be seen from the analysis that some evidence in the co-association matrix may affect the effective clustering of data points. So that data points are divided into incorrect clusters, or two clusters with precise partitions are connected to form a cluster. In fact, it can be said that some evidence in the co-association matrix can be considered to be related to outliers, which has a negative impact on the division of the final clustering result. Therefore, removing these negative evidences can provide more effective clustering result. While it is difficult to determine which is negative evidence and delete it. However, we can observe that when a pair of data points belong to the same cluster in the same cluster member, but belong to different clusters in the real partition. According to the expression of the co-association matrix (CM), the corresponding position value of such a pair of data points in CM is very small.

As illustrated by the above analysis, negative evidence is difficult to determine. To address this issue, Ren et al. proposed the concept of confusion [32]. The confusion of a pair of data points represents the uncertainty of their division in the same cluster. When the frequency between a pair of data points is 0.5, the degree of confusion is the largest. When the frequency is far less than 0.5, the corresponding pair of data points can be removed as negative evidence. Therefore, we use r as the step size in [0, 0.5] to gradually remove evidence, which is less than the set value to generate different reconstruction matrices ( $\textit{CMs}=\{CM_{1},CM_{2},\ldots,CM_{50}\}$ ). Then, set the threshold t to 0.5 according to [14], and perform condensed hierarchical clustering on the generated reconstruction matrices, so as to obtain multiple corresponding second-order basis clustering results ( $B=\{B_{1},B_{2},\ldots B_{50}\}$ ).

Algorithm 1: Generate the second base clustering results
Input: Original dataset D,Co-association matrix CM
Output: Second base clustering results B
1 $\phi$ $\to$ B
2 $[r_{1},r_{2},\ldots,r_{50}]\to S$
3 for s $\in$ S do
4 $\quad$ CM $\to$ CMs
5 $\quad$ for CMs do
6 $\qquad$ if CMs (i, j) $\leqslant$ s then
7 $\quad$ $\qquad$ 0 $\to$ CMs (i, j)
8 $\quad$ B $\cup$ HC (CMs, D) $\to$ B

The algorithm is largely represented by Algorithm 1 below. The original generated co-association matrix must be taken as the input, and hierarchical clustering on CMs is required to obtain the second-order basic clustering results B. Where, the condensed hierarchical clustering $HC(\textit{CMs},D)$ can be specifically described as:

1.
Regard each data point corresponding to each element of the co-association matrix as a cluster. The similarity between clusters is equal to the similarity of the corresponding data points;
2.
Find the two closest clusters and merge them;
3.
Recalculate the similarity between the new cluster and the original clusters;
4.
Repeat steps 2 and 3 until the similarity between all clusters reaches 0.5, then stop clustering.

Once the clustering results B have been determined, it requires set a validity index to filter the best clustering member as the final cluster.
3.4 Integration of basic clustering results

3.4.1 Minimax similarity

Suppose that $D=\{x_{1},x_{2},\ldots,x_{n}\}$ is the dataset, $G=(D,E)$ is the fully connected graph corresponding to the dataset $D$ , $n$ data points can be represented as $n$ vertices of the fully connected graph, the weight between data points is expressed as the edge in the fully connected graph, and $SM_{D}$ is the similarity matrix. Minimax similarity is defined as follows:

Let $\varphi_{ij}^{D}$ represent the set of all paths from vertex $i$ to vertex $j$ in a fully connected graph. For each path $P\in\varphi_{ij}^{D}$ , the effective similarity $s_{ij}^{p}$ between the data points $x_{i}$ and $x_{j}$ is the minimum edge weight along the path $p$ :

$\displaystyle s_{ij}^{p}=\underset{i\leqslant h\leqslant|p|}{\min}s(p[h],p[h+1])$ (4)

The total similarity $\textit{sim}(i,j)$ between vertex $i$ and vertex $j$ is defined as the maximum value of the effective similarity of all paths in $\varphi_{ij}^{D}$ :

$\displaystyle\textit{sim}(i,j,D)=\underset{p\in\varphi_{ij}^{D}}{\max}\{s_{ij}% ^{p}\}$ (5)

Then, the minimax similarity is obtained as follows:

$\displaystyle\textit{sim}(i,j,D)=\underset{p\in\varphi_{ij}^{D}}{\max}\left\{% \underset{i\leqslant h\leqslant|p|}{\min}s(p[h],p[h+1])\right\}$ (6)

Among them, $p[h]$ is the h-th vertex along the path from vertex $i$ to vertex $j$ , and $s(p[h],p[h+1])$ is the weight of the edge from the h-th vertex to the next vertex, which is the similarity of data points $x_{h}$ and $x_{h+1}$ in $SM_{D}$ .

Thus, to determine the similarity of data points $x_{h}$ and $x_{h+1}$ in $SM_{D}$ , $P[h]$ is the h-th vertex along the path from vertex $i$ to vertex $j$ , and is the weight of the edge from the h-th vertex to the next vertex.

According to the definition, when two data points do not belong to the same cluster, the similarity between them is very small. However, if there are some abnormal noise values between two clusters, even if two data points are in different clusters, the similarity between them may be very large. In other words, the similarity measure based on path is very sensitive to noise points. To address this issue, Chang et al. proposed a robust minimax similarity method to eliminate the influence of noise on clustering results [6]:

$\displaystyle\textit{Rsim}(i,j,D)=\underset{p\in\varphi_{ij}^{D}}{\max}\left\{% \underset{i\leqslant h\leqslant|p|}{\min}s(p[h],p[h+1])w_{h}w_{h+1}\right\}$ (7)

In the above formula, the weight $w_{h}$ can be expressed as follows:

$\displaystyle w_{h}=\frac{\sum_{x_{a}\in N(x_{h})}s(x_{h},x_{a})}{\underset{x_% {b}\in D}{\max}(\sum_{x_{c}\in N(x_{b})}s(x_{b},x_{c}))}$ (8)

where $N(x_{h})$ is the nearest neighbor set of the data point $x_{h}$ .

3.4.2 Internal validity index DCE-IVI

In the process of clustering, the validity index can determine the optimal number of clusters and select the optimal partition [44]. These indicators can be divided into three categories: external validity index, internal validity index, and relative validity index [40]. External validity index can be used when the external information of the dataset is available. The matching degree between the clustering partition is compared, and the external criteria is used to evaluate the performance of different clustering analysis algorithms. According to the pre-defined evaluation criteria, the relative validity index tests the different parameter settings of the clustering algorithm and finally selects the optimal parameters and clustering mode. The internal validity index is mainly based on the geometric structure information of the dataset, and evaluates the clustering division from the aspects of compactness and separation. As can be seen from the definition of clustering, the data points in the same cluster are closely distributed, while the data points in different clusters are scattered. Compactness is used to describe the distance between data points in the same cluster, and separability is used to describe the distance between data points in different clusters [26]. In this paper, without the original dataset information, by setting the internal validity index to select the results, the final clustering result is obtained.

Assuming that $B_{i}=\{C_{1},C_{2},\ldots,C_{k}\}$ is the base cluster to be measured, the compactness and separation of minimax similarity based on robustness are defined as follows:

$\displaystyle\textit{compact}(C_{i})=\underset{x_{p}\in C_{i1},x_{q}\in C_{i2}% }{\max}\textit{Rsim}(x_{p},x_{q},C_{i})$ (9) $\displaystyle\textit{separate}(C_{i},C_{j})=\underset{x_{p}\in C_{i},x_{q}\in C% _{j}}{\max}\textit{Rsim}(x_{p},x_{q},D)$ (10)

Where $\textit{compact}(C_{i})$ represents the stability within the cluster, $\textit{separate}(C_{i},C_{j})$ represents the correlation between different clusters, and $C_{i1}$ and $C_{i2}$ are the results of the normalized cutting of cluster $C_{i}$ . Therefore, the definition of DCE-IVI based on minimax similarity is as follows:

$\displaystyle\textit{DCE-IVI}=\sum_{1\leqslant i\leqslant k}\frac{\textit{% separate}(C_{i},D\setminus C_{i})}{\textit{compact}(C_{i})}$ (11)

Algorithm 2: Get the quality by the internal validity index: DCE-IVI
Input: Similarity matrix $SM_{D}$ , the number of nearest neighbors l, base cluster $B_{i}=\{C_{1},C_{2},\ldots,C_{k}\}$
Output: Quality measure by DCE-IVI
1 $\textit{Rsim}(x_{p},x_{q},D)\|x_{p},x_{q}\in D\to\textit{Rsim}_{D}$
2 0 $\to$ ivi
3 for $i=1,2,\ldots,k$ do
4 max $\textit{Rsim}_{D}\|x_{p}\in C_{i},x_{q}\in D\backslash C_{i}\to\textit{separate}$
5 $SM_{D}(C_{i})\to SM_{C}i$
6 $\textit{Rsim}(x_{p},x_{q},C_{i})\|x_{p},x_{q}\in C_{i}\to\textit{Rsim}_{Ci}$
7 $\textit{Ncut}(\textit{Rsim}_{Ci},2)\to[C_{i1},C_{i2}]$
8 min $\textit{Rsim}_{Ci}\|x_{p}\in C_{i1},x_{q}\in C_{i2}\to\textit{compact}$
9 ivi $+$ separate/compact $\to$ ivi
10 return ivi

It can be seen from [37] that a cluster is a high-density region divided by some low-density regions. In order to obtain the optimal cluster, it is necessary to make the connectivity between different clusters low and the internal stability of the cluster strong, because the smaller the internal validity index DCE-IVI, the better the clustering effect. Algorithm 2 describes the calculation process of DCE-IVI, in which $\textit{Rsim}_{D}$ is a robust minimax similarity matrix based on dataset $D$ , $SM_{D}$ is the similarity matrix of all data points in data set $D$ , $SM_{Ci}$ is a similarity matrix based on data points in cluster $C_{i}$ , and $\textit{Rsim}_{Ci}$ is a robust minimax similarity matrix based on data points in cluster $C_{i}$ .

The final clustering ensemble result is obtained by selecting the second-order basic clustering result with the smallest DCE-IVI value, as shown in Algorithm 3. Besides, the role of domain knowledge in the entire process are shown in Table 1.

Algorithm 3: Obtain the final clustering ensemble result: F
Input: Second base clustering results $B=\{B_{1},B_{2},\ldots,B_{50}\}$
Output: Final clustering ensemble result: F
1 NULL $\to$ F
2 for $B_{i}\in B$ , where $i=1,2,\ldots,50$ do
4 $\quad$ if F is NULL or DCE-IVI (F) $>$ DCE-IVI ( $B_{i}$ ) then
5 $\qquad$ $B_{i}\to F$

Table 1

The role of domain knowledge in the entire process

Mathematical symbol	Role
D	Dataset
P	the set of the basic cluster partitions
$P_{i}$	cluster member
$C_{ij}$	the j-th cluster in $P_{i}$
CM	co-association matrix
CM (u, v)	the element in the u-th row and v-th column of CM
$C_{i}(x_{u})$	the clusters of data points $x_{u}$ and $x_{v}$ in the basic cluster $P_{i}$
CMs	reconstruction matrices
B	corresponding second-order basis clustering results
$\varphi_{ij}^{D}$	all sets from vertex $i$ to vertex $j$ in a fully connected graph
p	the path in $\varphi_{ij}^{D}$
$s_{ij}^{p}$	the minimum edge weight along the path $p$
$\textit{sim}(i,j)$	the maximum value of the effective similarity of all paths in $\varphi_{ij}^{D}$
$p[h]$	the h-th vertex along the path from vertex $i$ to vertex $j$
$s(p[h],p[h+1])$	the weight of the edge from the h-th vertex to the next vertex
$w_{h}$	the nearest neighbor set of the data point $x_{h}$
$\textit{compact}(C_{i})$	the stability within the cluster
$\textit{separate}(C_{i},C_{j})$	the correlation between different clusters
DCE-IVI	internal validity index
$SM_{D}$	the similarity matrix of all data points in dataset $D$
$SM_{C}$	the similarity matrix based on data points in cluster $C_{i}$
$\textit{Rsim}_{C_{i}}$	the robust minimax similarity matrix based on data points in cluster $C_{i}$

3.5 Time complexity

The algorithm mainly includes several steps: generating basic clustering results, forming co-association matrix CM, generating second-order basis clustering results based on the reconstructed matrix, and selecting the final clustering results through internal validity index DCE-IVI. Based on this, the time complexity of this algorithm is analyzed as follows and N is the number of data points:

Generating basis clustering results is realized by DBSCAN, DPC and OPTICS algorithms. Among them, the basic time complexity of DBSCAN and OPTICS algorithms is O (N * the time required to find points in eps field), the worst-case time complexity is $O(N^{2})$ [10, 2], and the time complexity of DPC algorithm is $O(N^{2})$ [33], N is the number of sample points in the dataset. It can be concluded that the time complexity of this stage is $O(N^{2})$ .

In the stage of calculating the co-association matrix through the basis clustering members, assuming that the number of basis cluster members is m, each basis partition has $\sqrt{N}$ clusters, and the clusters have the same size, the calculation time $T\approx\frac{1}{m}\ast N^{\frac{3}{2}}$ and the time complexity is $O(N^{\frac{3}{2}})$ [14].

When generating the second-order basis clustering results based on reconstruction matrices, firstly, the elements in these matrices that are less than the threshold value need to be zeroed, and the time complexity is $O(N^{2})$ . Then, hierarchical clustering is used to generate second order basis clustering results. It needs n iterations, and the matrix needs to be updated and stored in each iteration, so the time complexity $O(N^{3})$ . The time complexity of this stage is $O(N^{2}+N^{3})$ .

In the final stage of selecting the final clustering result through the internal validity index DCE-IVI, the two components of DCE-IVI are compactness and separation, which are calculated by the robustness path similarity. The time complexity of similarity based on robust path is $O(N^{2})$ [43, 31], so this stage has the same complexity: $O(N^{2})$ .

Through the above analysis, the time complexity of this algorithm is $O(N^{3})$ .

4. Experiment analysis

4.1 Experimental dataset

In this paper, 10 datasets, including five two-dimensional synthetic datasets and five real multi-dimensional datasets, were employed for testing. The real distribution of data points in the five two-dimensional datasets is shown in Fig. 3, and a detailed description of these 10 datasets is provided in Table 2.

Two large-scale real datasets were also tested. A detailed description of these two datasets is provided in Table 3.

Table 2
Detailed description of above 10 datasets

Datasets	Objects	Dimensions	Classes	Refs
Aggregation	788	2	7	[17]
Flame	240	2	2	[16]
Jain	373	2	2	[20]
R15	600	2	15	[39]
Spiral	312	2	3	[6]
Glass	214	9	6	[9]
Iris	150	4	3	[9]
Ionosphere	351	34	2	[9]
Segmentation	2130	19	7	[9]
WDBC	569	30	2	[9]

Table 3

Detailed description of 2 large-scale real datasets

Datasets	Objects	Dimensions	Classes	Refs
Pendigits	10992	16	10	[9]
Shill Bidding	6321	13	2	[9]

Figure 3.

Synthetic datasets.

Table 4

Specific information of datasets

Datasets	Specific information
Aggregation	the dataset contains narrow “bridges” between clusters, uneven-sized clusters, and outliers
Flame	the dataset contains narrow “bridges” between clusters, and outliers
Jain	the dataset consists of two sparse uneven clusters, and there are noise points in the it
R15	the data et consists of 15 similar two-dimensional Gaussian clusters, and the clusters are located in the ring
Spiral	it is a 3-spiral data set, and it is more suitable for path-based clustering
Glass	it is a Glass Identification dataset, and contains 9 attributes
Iris	it is quantify the morphologic variation of Iris flowers of three related species
Ionosphere	the dataset is used to describe the radar returns(Good and Bad) from the ionosphere
Segmentation	the dataset include image data described by high-level numeric-valued attributes, and the images were
	handsegmented to create a classification for every pixel
WDBC	the dataset include some information about breast cancer, and its two categories are benign and malignant
Pendigits	it is a Pen-Based Recognition of Handwritten Digits dataset by collecting 250 samples from 44 writers
Shill Bidding	the dataset include the auction data of eBay auctions of a popular product

Furthermore, in order to better understand certain data characteristics in datasets, we collected specific information of the datasets used in the experiment, and presented it in Table 4, and all datasets do not have missing values.

4.2 Algorithms compared

In order to compare the density-based clustering ensemble algorithm proposed in this paper with other techniques, two common methods were selected: link-based clustering ensemble (LCE) [19], Strehl’s algorithms [37]. And there is also a novel selective clustering ensemble method (DSME) [22].

Link-based clustering ensemble (LCE) improves the co-association matrix by locating the hidden information of the basic partitions. Three improved schemes, weighted connected-triple (WCT), weighted triple-quality (WTQ), and combined similarity measure (CSM), are then proposed. In [37], three improved clustering ensemble methods were also proposed as cluster-based similarity partitioning algorithm (CSPA), hypergraph partitioning algorithm (HGPA), and meta-clustering algorithm (MCLA). The selective clustering ensemble method (DSME) has considered the quality of each clusters, and then embedded it into a framework DS, which considers the difference between the result in the ensemble selection stage and the result in the ensemble integration stage [22].

4.3 Evaluation index

Three clustering evaluation indexes were employed to efficiently measure the clustering ensemble results. These were classification accuracy (CA) [27].

Suppose $D=\{x_{1},x_{2},\ldots,x_{n}\}$ is the dataset, $T=\{T_{1},\ldots,T_{Kt}\}$ is the true label of the dataset $D$ , $S=\{S_{1},\ldots,S_{Ks}\}$ is the label obtained by analyzing the dataset D, where n is the number of data points, and $K_{t}$ and Ks are the numbers of clusters of T and S.

CA is used to compare the label obtained by analyzing the dataset and the true label of the dataset. It can be defined as [29]:

$\displaystyle CA(T,S)=\frac{1}{n}\sum_{i=1}^{K_{s}}|S_{i}\cap\textit{mode}(S_{% i},T)|$ (12)

where $\textit{mode}(S_{i},T)=\textit{argmax}_{T_{j}\in T}|S_{i}\bigcap T_{j}|$ .

Mutual information (MI) can describe the shared information of a pair of clusters, while normalized mutual information (NMI) uses entropy as the denominator to adjust the MI value between 0 and 1. It is usually used as an external validity index [37], which is defined as follows:

$\displaystyle\textit{NMI}(T,S)=\frac{\sum_{i=1}^{K_{s}}\sum_{j=1}^{K_{t}}|S_{i% }\bigcap T_{j}|\log\left(\frac{n|S_{i}\bigcap T_{j}|}{|S_{i}||T_{j}|}\right)}{% \sqrt{\sum_{i=1}^{K_{s}}|S_{i}|\log\frac{|S_{i}|}{n}\left(\sum_{j=1}^{K_{t}}|T% _{j}|\log\frac{|T_{j}|}{n}\right)}}$ (13)

where $|Si|$ is the number of data points in $S_{i}$ .

ARI measures the coincidence degree of two data distributions. Its value range is $[-1,1]$ . The larger the value is, the more consistent the clustering result is with the real situation [27].

Let a represent the logarithm of data points in the same cluster in T and S, and b represent the logarithm of data points in different clusters in T and S, then ARI can be defined as:

$\displaystyle\textit{ARI}=\frac{RI-E[RI]}{\max(RI)-E[RI]}$ (14)

Where RI can be expressed as follows:

$\displaystyle RI=\frac{a+b}{C_{2}^{n_{\textit{sample}}}}$ (15)

$C_{2}^{n_{\textit{sample}}}$ represents the number of the pairs that can be composed in the dataset, and the range of RI is [0, 1]. The larger RI is, the higher the accuracy of the clustering effect and the higher the purity of each cluster. ARI is based on RI, which has higher discrimination [27].

4.4 Experimental results

The final visualization results of DCE-IVI for synthetic datasets are shown in Fig. 4, the visualization of the clustering ensemble results from various methods(CSPA, HGPA, MCLA, and DSME) for Aggregation and Jain datasets are shown in Figs 5 and 6, and the experimental results are shown in Tables 3–8. In these tables, DCE-IVI represents the method proposed in this paper. Three clustering ensemble methods based on the two-dimensional graphs WCT, WTQ and CSM, and three clustering ensemble methods CSPA, HGPA and MCLA, and the selective clustering ensemble method DSME were also tested on the cited datasets and compared with the clustering ensemble method proposed in this paper.

Figure 4.

The final visualization results of synthetic datasets.

Figure 5.

The visualization results from various methods for Aggregation.

Figure 6.

The visualization results from various methods for Jain.

For each clustering task used in the experiment, their cluster ranking is not unique, but the use of cluster quality index is unique, which needs to be obtained through preprocessing. For the clustering ensemble methods used, the parameters were set as follows:

The DBSCAN algorithm and the OPTICS algorithm were run twice, respectively, and each algorithm was run several times to obtain the corresponding optimal value.

For the DCE-IVI method proposed in this paper, the parameter selection effect of each base clustering algorithm was the best, and the number of nearest neighbors of the data points was set to 3.

For WCT, WTQ, and CSM, the maximum number of iterations was 100, and the number of basic partitions was 10. The attenuation factor parameter DC was set to 0.9.

For CSPA, HGPA and MCLA, the maximum number of iterations was 100, and the number of basic partitions was also set to 100.

For DSME, the selection threshold t was 0.5, and the consensus function was the hierarchical clustering algorithm, and the number of basic partitions was also set to 50.

Table 5

CA clustering quality for 10 datasets, the highest quality of clustering results in each row is highlighted in bold

Datasets	Clustering ensemble algorithms
	DCE-IVI	WCT	WTQ	CSM	CSPA	HGPA	MCLA	DSME
Aggregation	0.9987	0.91	0.87	0.91	0.82	0.87	0.84	0.79
Flame	0.9417	0.96	0.96	0.95	0.83	0.86	0.84	0.78
Jain	1.00	0.79	0.89	0.78	0.76	0.74	0.77	0.88
R15	0.9950	1.00	1.00	0.99	1.00	1.00	1.00	0.74
Spiral	1.00	0.40	0.41	0.38	0.38	0.60	0.44	0.38
Glass	0.5187	0.60	0.60	0.61	0.61	0.61	0.58	0.45
Iris	0.94	0.94	0.95	0.91	0.96	0.97	0.96	0.86
Ionosphere	0.9145	0.68	0.71	0.66	0.66	0.68	0.71	0.63
Segmentation	0.5933	0.69	0.63	0.68	0.69	0.67	0.68	0.48
WDBC	0.8822	0.73	0.78	0.71	0.76	0.82	0.81	0.80

Table 6

NMI clustering quality for 10 datasets, the highest quality of clustering results in each row is highlighted in bold

Datasets	Clustering ensemble algorithms
	DCE-IVI	WCT	WTQ	CSM	CSPA	HGPA	MCLA	DSME
Aggregation	0.9957	0.83	0.79	0.83	0.69	0.71	0.71	0.75
Flame	0.7864	0.83	0.80	0.79	0.45	0.46	0.45	0.24
Jain	1.00	0.28	0.57	0.36	0.47	0.48	0.47	0.55
R15	0.9914	1.00	1.00	0.99	0.93	0.93	0.93	0.86
Spiral	1.00	0.03	0.02	0.01	0.56	0.56	0.55	0.02
Glass	0.4303	0.33	0.32	0.33	0.73	0.72	0.75	0.35
Iris	0.8244	0.85	0.86	0.81	0.64	0.65	0.65	0.77
Ionosphere	0.5732	0.13	0.13	0.11	0.49	0.43	0.48	0.07
Segmentation	0.6691	0.60	0.61	0.57	0.68	0.68	0.67	0.47
WDBC	0.4963	0.29	0.37	0.27	0.45	0.47	0.46	0.38

Table 7

ARI clustering quality for 10 datasets, the highest quality of clustering results in each row is highlighted in bold

Datasets	Clustering ensemble algorithms
	DCE-IVI	WCT	WTQ	CSM	CSPA	HGPA	MCLA	DSME
Aggregation	0.9978	0.67	0.59	0.67	0.54	0.64	0.56	0.69
Flame	0.8573	0.87	0.86	0.82	0.4	0.52	0.47	0.30
Jain	1.00	0.23	0.57	0.15	0.27	0.17	0.28	0.58
R15	0.9892	1.00	1.00	0.99	1.00	1.00	1.00	0.68
Spiral	1.00	0.02	0.02	0.01	0.01	0.23	0.06	0.01
Glass	0.2691	0.18	0.19	0.18	0.18	0.19	0.17	0.18
Iris	0.8343	0.83	0.86	0.78	0.90	0.92	0.89	0.69
Ionosphere	0.6820	0.12	0.17	0.07	0.11	0.13	0.17	0.06
Segmentation	0.4755	0.54	0.47	0.53	0.53	0.51	0.52	0.30
WDBC	0.5775	0.21	0.32	0.18	0.27	0.42	0.39	0.37

The results of the CA test are shown in Table 5. The first column is the cited datasets, the second column is the CA test quality of the method proposed in the paper, and the remaining columns are the CA test quality of other clustering ensemble methods. It can be seen that, for the datasets Aggregation, Jain, Spiral, Ionosphere, and WDBC, the clustering ensemble method proposed in this paper has a better effect than other methods. Especially for the Spiral dataset, the quality of the method reaches 1.0, which is much better than other methods. In addition, for other datasets, the CA test quality of the method in this paper is highly similar to the best CA test quality of other methods. The clustering structure of some datasets, especially high-dimensional datasets, is reasonably complex as the selected base clustering algorithm itself has a good clustering effect for arbitrary distribution shaped datasets. It can thus be observed that our algorithm has a strong analysis and processing ability for complex data sets.

The results of the NMI test are shown in Table 6. The proposed method has almost the same effect as the CA test. For dataset Glass, DCE-IVI is better than three improved link-based clustering ensemble algorithms WCT, WTQ, CSM and DSME. However, for the three improved Strehl’s algorithms, the effect is much worse. For the dataset Ionosphere, the method in this paper has a much better clustering ensemble effect than other methods.

The results tested by ARI are provided in Table 7. Generally speaking, for the same clustering result, the value produced by the ARI test is smaller than the value produced by the CA test, but they have the same upper limit of 1.00. From this table, we can see that DCE-IVI has the highest value in six datasets. Particularly for the datasets Jain, Spiral, Ionosphere, and WDBC, the ARI test quality of the method proposed in this paper is much higher than other methods. Besides Iris, the ARI test quality of the proposed method is similar to other methods. For Iris, the value of the proposed method is significantly different from that of the clustering ensemble method with the highest ARI test quality.

Table 8

CA clustering quality for 2 large-scale datasets, the highest quality of clustering results in each row is highlighted in bold

Datasets	Clustering ensemble algorithms
	DCE-IVI	CSPA	HGPA	MCLA
Pendigits	0.7376	0.7331	0.3190	0.6632
Shill Bidding	0.8722	0.5053	0.7513	0.5093

Table 9

NMI clustering quality for 2 large-scale datasets, the highest quality of clustering results in each row is highlighted in bold

Datasets	Clustering ensemble algorithms
	DCE-IVI	CSPA	HGPA	MCLA
Pendigits	0.7436	0.6679	0.2345	0.6656
Shill Bidding	0.0079	0.0003	0.0063	0.0009

Table 10

ARI clustering quality for 2 large-scale datasets, the highest quality of clustering results in each row is highlighted in bold

Datasets	Clustering ensemble algorithms
	DCE-IVI	CSPA	HGPA	MCLA
Pendigits	0.5902	0.5579	0.1548	0.5346
Shill Bidding	$-$ 0.0276	0.0001	0.0462	0.0003

Tables 8–10 show the results of the algorithm comparison test on two large-scale real datasets. CA, NMI, and ARI were taken as evaluation indexes, and compared DCE-IVI with three improved Strehlâ€™s algorithms. It can be observed that DCE-IVI has the best analysis effect on Pendigits and Shill Bidding datasets when using CA and NMI for evaluation tests. When using ARI for testing, DCE-IVI still has the best analysis effect on the Pendigits dataset, but it has a poor effect on the Shill Bidding dataset. It can be observed that the ARI value range is $[-1.1]$ , which is obtained by analyzing the pair of data points. By observing the real distribution of the data in the Shill Bidding dataset, we can observe that the distribution density of the data points is more uniform. This indicates that the results obtained by the experimental division are poor, which makes the value obtained by the ARI analysis lower and possibly below 0.

Based on the test results in Tables 5–10, we can observe that the DCE-IVI algorithm proposed in this paper has good clustering ensemble results for two-dimensional datasets and multi-dimensional datasets, and it can also obtain outstanding analysis results for complex datasets.

4.5 Experiment on business emission data

In order to verify the effectiveness of DCE-IVI, this paper also analyzes the real business emission data. These data are obtained from six different collection points of two businesses in a certain place, and the data of four months are collected in total. Table 11 is a detailed description of the business emission datasets.

Table 11
Description of the business emission datasets

Datasets	Objects	Dimensions	Datasets	Objects	Dimensions
1-1649	2077	8	3-1649	2976	8
1-1650	2078	8	3-1650	2976	8
1-1651	2078	8	3-1651	2977	8
1-1652	2078	8	3-1652	2976	8
1-2196	2503	8	3-2196	2649	8
1-2197	2503	8	3-2197	2649	8
2-1649	1278	8	4-1649	2688	8
2-1650	1279	8	4-1650	2687	8
2-1651	1278	8	4-1651	2687	8
2-1652	1279	8	4-1652	2687	8
2-2196	1655	8	4-2196	1032	8
2-2197	1655	8	4-2197	1032	8

In the above datasets, the data can be divided into four types: high production and low pollution control, high production and high pollution control, low production and low pollution control, and low production and high pollution control. It is necessary to analyze and judge the operating status and abnormal condition of the pollutant discharge businesses through the experiment, so as to realize effective monitoring of the pollutant discharge situation by relevant unit, which can determine the key attention of some substandard businesses.

Using DCE-IVI to analyze these datasets, and the experimental results are returned to businesses. Manual analysis of the datasets indicates that the partition results obtained by our experiments are of high quality.

5. Conclusion

The basic clustering algorithms in clustering ensemble have different division effects on datasets with different structures. However, many clustering ensemble algorithms do not consider this point and normally only generate different basic clustering members. In addition, to improve clustering quality, clustering ensemble algorithms based on the co-association matrix usually add some lost information to the matrix to describe the original data structure more effectually, but fail to consider the effect of the outliers. Based on the above problems, this paper proposed a density-based clustering ensemble algorithm based on co-association matrix and internal validity index. The selection of density-based clustering algorithms could effectively clustering datasets with arbitrary shapes. According to the VAT analysis, some negative evidence was present, making the results from the division deviate from the true partition results. As the similarity value of negative evidence in the co-association matrix was low, we removed the negative evidence to achieve a better division. Condensed hierarchical clustering was then performed on the obtained matrix results to get the corresponding division results. Finally, based on the compactness of the cluster and the correlation between clusters, we set an internal validity index to select the best result.

Footnotes

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grants with No. 61873324, the Natural Science Foundation of Shandong Province under Grant with No. ZR2019MF040, the Natural Science Foundation of Shandong Province under Grant with No. ZR2020LZH004 and Grant with No. ZR2020LZH006, and the Higher Educational Science and Technology Program of Jinan City under Grant with No. 2020GXRC057.

References

Alqurashi

and Wang

, Clustering ensemble method, International Journal of Machine Learning and Cybernetics 10(6) (2019), 1227–1246.

Ankerst

Breunig

M.M.

Kriegel

H.-P.

and Sander

, Optics: Ordering points to identify the clustering structure, ACM Sigmod Record 28(2) (1999), 49–60.

Bai

Liang

and Cao

, A multiple k-means clustering ensemble algorithm to find nonlinearly separable clusters, Information Fusion 61 (2020), 36–47.

Bolón-Canedo

and Alonso-Betanzos

, Ensembles for feature selection: A review and future trends, Information Fusion 52 (2019), 1–12.

Brown

, Ensemble learning, Encyclopedia of Machine Learning 312 (2010), 15–19.

Chang

and Yeung

D.-Y.

, Robust path-based spectral clustering, Pattern Recognition 41(1) (2008), 191–203.

Chen

M.-S.

Huang

Wang

C.-D.

Huang

and Lai

J.-H.

, Relaxed multi-view clustering in latent embedding space, Information Fusion 68 (2021), 8–21.

Cheng

Zhu

Huang

Yang

and Wu

, Natural neighbor-based clustering algorithm with local representatives, Knowledge-Based Systems 123 (2017), 238–253.

Dua

and Graff

, UCI machine learning repository, 2017.

10.

Ester

Kriegel

H.-P.

Sander

and Xu

, A density-based algorithm for discovering clusters in large spatial databases with noise, in: Kdd, Vol. 96, pages 226–231, 1996.

11.

Fischer

and Buhmann

, Bagging for path-based clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 25(11) (2003), 1411–1415.

12.

Fred

, Finding consistent clusters in data partitions, in: International Workshop on Multiple Classifier Systems, pages 309–318, Springer, 2001.

13.

Fred

and Jain

A.K.

, Evidence accumulation clustering based on the k-means algorithm, in: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages 442–451, Springer, 2002.

14.

Fred

A.L.

and Jain

A.K.

, Data clustering using evidence accumulation, in: Object recognition supported by user interaction for service robots, Vol. 4, pages 276–280, IEEE, 2002.

15.

Fred

A.L.N.

and Jain

A.K.

, Combining multiple clusterings using evidence accumulation, IEEE Transactions on Pattern Analysis and Machine Intelligence 27(6) (2005), 835–850.

16.

and Medico

, Flame, a novel fuzzy clustering method for the analysis of dna microarray data, BMC Bioinformatics 8(1) (2007), 1–15.

17.

Gionis

Mannila

and Tsaparas

, Clustering aggregation, Acm Transactions on Knowledge Discovery from Data (TKDD) 1(1) (2007), 4–es.

18.

Huang

Wang

C.-D.

J.-S.

Lai

J.-H.

and Kwoh

C.-K.

, Ultra-scalable spectral clustering and ensemble clustering, IEEE Transactions on Knowledge and Data Engineering 32(6) (2019), 1212–1226.

19.

Iam-On

Boongoen

Garrett

and Price

, A link-based approach to the cluster ensemble problem, IEEE Transactions on Pattern Analysis and Machine Intelligence 33(12) (2011), 2396–2409.

20.

Jain

A.K.

and Law

M.H.

, Data clustering: A user’s dilemma, in: International Conference on Pattern Recognition and Machine Intelligence, pages 1–10, Springer, 2005.

21.

Qian

Wang

Dang

and Jing

, Clustering ensemble based on sample’s stability, Artificial Intelligence 273 (2019), 37–55.

22.

Qian

Wang

Dang

and Liu

, Cluster’s quality evaluation and selective clustering ensemble, ACM Transactions on Knowledge Discovery from Data (TKDD) 12(5) (2018), 1–27.

23.

Qian

Wang

and Liang

, Multigranulation information fusion: A dempster-shafer evidence theory-based clustering ensemble method, Information Sciences 378 (2017), 389–409.

24.

and Wang

, Research on cluster ensembles methods based on hierarchical clustering, Computer Engineering and Applications (27) (2010), 124–127.

25.

Liu

Zhang

Wang

and Zhao

, An improved path-based clustering algorithm, Knowledge-Based Systems 163 (2019), 69–81.

26.

Liu

and Dou

, Performance evaluation index of genetic algorithm, Cooperative Economy and Technology (8) (2012), 116–116.

27.

Mimaroglu

and Aksehirli

, Diclens: Divisive clustering ensemble with automatic cluster number, IEEE/ACM Transactions on Computational Biology and Bioinformatics 9(2) (2011), 408–420.

28.

Nguyen

Armisen

Sánchez-Hernández

Casabayó

and Agell

, An owa-based hierarchical clustering approach to understanding users’ lifestyles, Knowledge-Based Systems 190 (2020), 105308.

29.

Nguyen

and Caruana

, Consensus clusterings, in: Seventh IEEE International Conference on Data Mining (ICDM 2007), pages 607–612, IEEE, 2007.

30.

Pruitt

R.C.

and James

, Classification algorithms, Technometrics 29(400) (1986), 499–499.

31.

Rathore

Ghafoori

Bezdek

J.C.

Palaniswami

and Leckie

, Approximating dunn’s cluster validity indices for partitions of big data, IEEE Transactions on Cybernetics 49(5) (2019), 1629–1641.

32.

Ren

Domeniconi

Zhang

and Yu

, Weighted-object ensemble clustering: Methods and analysis, Knowledge and Information Systems 51(2) (2017), 661–689.

33.

Rodriguez

and Laio

, Clustering by fast search and find of density peaks, Science 344(6191) (2014), 1492–1496.

34.

Sandes

N.C.

and Coelho

A.L.

, Clustering ensembles: A hedonic game theoretical approach, Pattern Recognition 81 (2018), 95–111.

35.

Shi

and Malik

, Normalized cuts and image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8) (2000), 888–905.

36.

Shi

Cao

Chen

and Han

, Fast and effective active clustering ensemble based on density peak, IEEE Transactions on Neural Networks and Learning Systems pp(99) (2020), 1–15.

37.

Strehl

and Ghosh

, Cluster ensembles – a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research 3(3) (2002), 583–617.

38.

Topchy

Jain

A.K.

and Punch

, Combining multiple weak clusterings, in: Third IEEE International Conference on Data Mining, pages 331–338, IEEE, 2003.

39.

Veenman

C.J.

Reinders

M.J.T.

and Backer

, A maximum variance cluster algorithm, IEEE Transactions on Pattern Analysis and Machine Intelligence 24(9) (2002), 1273–1280.

40.

and Tian

, A comprehensive survey of clustering algorithms, Annals of Data Science 2(2) (2015), 165–193.

41.

Yang

Peng

Zhu

and Nie

, Co-clustering ensemble based on bilateral k-means algorithm, IEEE Access 8 (2020), 51285–51294.

42.

Zhi-Hua Zhou

W.T.

, Clusterer ensemble, Knowledge-Based Systems 19(1) (2006), 77–83.

43.

Zhong

Malinen

Miao

and Fränti

, A fast minimum spanning tree algorithm based on k-means, Information Sciences 295 (2015), 1–17.

44.

Zhong

Yue

Zhang

and Lei

, A clustering ensemble: Two-level-refined co-association matrix with path-based transformation, Pattern Recognition 48(8) (2015), 2699–2709.

DCE-IVI: Density-based clustering ensemble by selecting internal validity index

Abstract

Keywords

1. Introduction

2.1 Background of cluster ensemble method

2.1.1 Methods of generating cluster members

2.1.2 Methods of consistency integration for cluster members

2.1.3 Existing clustering ensemble methods

2.2 Density-based clustering algorithms

2.3 Hierarchical clustering method

3.1 Method overview

3.4.1 Minimax similarity

4. Experiment analysis

4.1 Experimental dataset

Table 2 Detailed description of above 10 datasets

4.3 Evaluation index

Table 11 Description of the business emission datasets

Footnotes

Acknowledgments

References

Table 2
Detailed description of above 10 datasets

Table 11
Description of the business emission datasets