Abstract
As each clustering algorithm cannot efficiently partition datasets with arbitrary shapes, the thought of clustering ensemble is proposed to consistently integrate clustering results to obtain better division. Most of ensemble research employs a single algorithm with different parameters to clustering. And this can be easily integrated, however it is hardly to divide complex datasets. Other available methods integrate different algorithms, it can divide datasets from different aspects, but fail to take outliers into account, which produces negative effects on the partition results. In order to solve these problems, we clustering datasets with three different density-based algorithms. The innovation of this paper is described as: (1) by setting dynamic thresholds, lower frequency evidence in the co-association matrix is gradually deleted to obtain multiple reconstructed matrices; (2) these reconstructed matrices are analyzed by hierarchical clustering to obtain basic clustering results; (3) an internal validity index is designed by the compactness within clusters and the correlation between clusters, which is used to select the final clustering result. By this innovation, the clustering effect is significantly improved. Finally, a series of experiments are designed, and the results verify the improvement and effectiveness of the proposed technique (DCE-IVI).
Keywords
Introduction
Clustering analysis is a highly important unsupervised machine learning strategy. Its main task is to divide a set of data objects into different classes so that the data objects of the same class are more similar, but there are obvious differences between objects of different classes [8]. As a supervised learning method, classification requires clearly known information on each class and that all data objects have a corresponding class [30]. Unlike classification, clustering analysis is based on the relationship between data objects when the dataset is divided [25]. In short, it works to separate similar objects into a group without considering the specific category; that is, there is no class label information.
Although there are a large number of clustering algorithms, there is no single algorithm that can efficiently handle all types of data distribution. Since each clustering algorithm has its own analysis strategy and performance standards, different clustering algorithms or parameters may produce different clustering results with significant deviations [1].
Ensemble learning is based on the combination of multiple different models to solve specific problems [4]. Its basic principle is based on the idea of using different techniques rather than a single method to establish a set of hypotheses and then combining them in order to improve results [5]. The use of ensemble learning in classification has become increasingly effective in recent years.
In the absence of supervision information, it is difficult to determine which partition structure is most suitable for the actual distribution. Therefore, selecting an appropriate algorithm is a challenging task. In order to solve this issue, many researchers have combined clustering analysis algorithms with ensemble learning to perform “classification", and then consistently integrated the clustering results produced by several clustering algorithms. This process is called clustering ensemble.
Generally speaking, clustering ensemble can be defined as the process of combining multiple clustering results of a dataset. By merging the results of multiple clustering algorithms, higher quality and more robust clustering results can be obtained [24]. In general, clustering ensemble is divided into two stages: the generation of cluster members and consistency integration of the cluster members [34]. The main task of the first stage is to select appropriate clustering algorithms to generate multiple different cluster members. Since the clustering label has no physical meaning, the same label divided by different algorithms may represent different clusters. Therefore, it is more complicated to integrate the cluster members consistently. Different consistency integration schemes have a significant impact on the accuracy of clustering results [15].
Most scholars currently involved in clustering ensemble research tend to focus on in-depth research and innovation, but sometimes ignore relatively simplistic problems. The common problems are as follows:
Many scholars employ the k-means algorithm as the basic partition generator in the stage of generating cluster members [24, 13, 14]. However, the algorithm is not suitable for clustering non-spherical datasets and cannot process datasets with significant density differences [18]. Therefore, many clustering results are poor and have a negative impact on consistency integration. In the first stage, most methods use the same algorithm to analyze the dataset several times to obtain the difference of basic clustering results. However, as Topchy et al. pointed out in [38], given a set of weak clustering algorithms with sufficient differences, consistency integration can produce efficient division results. In order to obtain efficient ensemble results, it is necessary to use different clustering algorithms as much as possible as each clustering algorithm reveals different aspects of the dataset. This will provide a superior final ensemble result. Whether at the stage of generating cluster members or consistency integration of cluster members, it is easy to ignore the influence of outliers on the final results. For example, the selected basic partition generator is insensitive to outliers. In addition, many scholars choose to generate a co-association matrix and then add information that describes the cluster structure to refine it [42]. However, in a practical sense, this method may cause some data points to be recognized as noise, which has a negative impact on the final clustering during consistency integration [7].
Cluster member generation and consistency integration methods have a significant influence on the accuracy of the final clustering results. In view of the above problems, this paper proposes the following solutions:
Three density-based clustering algorithms (DBSCAN [10], DPC [36, 33], and OPTICS [2] algorithms), which can handle a variety of data structures, are employed to perform clustering analysis on datasets to generate different cluster members. A co-association matrix is generated according to the generated cluster members. According to certain rules, the evidence with a lower frequency in the co-association matrix is deleted. The co-association matrix is then transformed into multiple reconstruction matrices, and hierarchical clustering analysis is performed on these to obtain multiple basic clustering results. By analyzing the compactness of the clusters and the correlation between clusters in the basic clustering results, an internal validity index is set to select basic clustering results to obtain the optimal clustering result datasets.
The rest of this article is organized as follows. Section 2 reviews related work on the clustering ensemble problem and provides a specific description of the innovative methods proposed in this paper. Section 3 introduces the clustering ensemble method based on density-based clustering algorithms(DCE-IVI). The experimental results are given in Section 4. Finally, Section 5 summarizes the full text.
The clustering ensemble problem was first introduced by Strehl and Ghosh in 2002 [37]. They described it as combining multiple clustering results of a set of objects without accessing the original features. Gionis et al. also described the problem in 2007: given a set of clustering results, the goal of clustering ensemble is to find a cluster that is consistent as much as possible with all input clustering results [17]. It can thus be observed that clustering ensemble works to generate the best result by combining and dividing multiple clustering results using certain rules.
Background of cluster ensemble method
Methods of generating cluster members
In clustering ensemble, the method of generating cluster members, such as the selection of clustering algorithms and the design of parameters, has a significant impact on the quality of the final clustering result. The three main generation methods currently available are described as follows:
The first method is to clustering the datasets with the same clustering algorithm but set different parameters for each clustering iteration. In 2002, Fred et al. adopted the k-means algorithm with variable k [13]. Using this method, a reasonable k was initially produced as
The second method is to divide the dataset into multiple different data subsets according to certain rules and use the same clustering algorithm to process these subsets. Using this technique, Fischer and Buhmann [11] applied the bootstrap method to obtain several subsets.
The third method is to use different types of clustering algorithms to clustering the same dataset. In 2007, Gionis et al. used single link, complete link, average link, wardâs clustering, and k-means to generate basic clustering results with this technique [17].
Methods of consistency integration for cluster members
After generating clustering members, consistency integration is a key step that determines the quality and accuracy of the final result. Numerous methods have been proposed to solve this problem. The several main methods are outlined as follows:
The first technique is a voting method that conducts voting according to the division of data points by cluster members and calculates the proportion of data points divided into each cluster. If the proportion is bigger than the set threshold, it will be divided into this cluster. In 2001, Fred used partition information to calculate the number of times a pair of data points were divided into the same cluster, which was used as the vote of whether two data points belong to the same cluster [12].
The second method is to transform the clustering ensemble problem into the minimum cutting problem of the hypergraph, and use the clustering algorithm based on graph theory to perform clustering ensemble. In 2002, Strehl et al. proposed the cluster-based similarity partitioning algorithm (CSPA) algorithm [37]. CSPA first employs the calculated co-association matrix as the similarity matrix, and then uses data points as the vertices and the similarity value as the edge weight to construct a hypergraph. Finally, the graph segmentation algorithm METIS is used to divide the dataset to obtain the clustering result.
The third method is the evidence accumulation (EA) method proposed by Fred in 2002. Considering each generated cluster member as the independent evidence, the number of times a pair of data points is divided into the same cluster is calculated to obtain the co-association matrix [14], and then the final clustering result is obtained through hierarchical clustering.
The fourth method is to weigh the generated cluster members to achieve clustering ensemble. In 2006, Zhou et al. used the normalized mutual information (NMI) between a pair of clusters to calculate the weights of each cluster and selected the optimal cluster by excluding the method where the mutual information weight was less than the threshold [42].
Existing clustering ensemble methods
In recent years, new clustering ensemble methods have been continuously proposed, with considerable improvement and innovation to the above methods.
In 2015, Caiming Zhong et al. proposed a new hypergraph transformation method [44]. Firstly, the co-association matrix is generated, and the co-association matrix is refined by calculating the probability. Then, the matrix is transformed into a path-based matrix, and spectral clustering is applied to the path-based matrix to generate the final clustering result. In 2016, Feijiang Li et al. proposed a clustering ensemble algorithm based on evidence theory [23]. The neighbors of each data are first located, and the label probabilities for each member are generated. After that, the probabilities of these labels are fused to produce the final result. Later, in 2019, Feijiang Li et al. proposed the concept of sample stability to determine the contribution of samples. By this method, the dataset is divided into cluster core and cluster halo, and the samples in the cluster core are used to determine the clear structures. According to the stability of the samples, the samples in the cluster halo are then gradually assigned to the clear structures [21]. In 2019, Huang Dong et al. proposed a U-SPEC algorithm that could effectively process large-scale datasets. By integrating multiple U-SPEC into a unified ensemble clustering framework, clustering analysis with high efficiency and high robustness was realized [18]. Liang Bai et al. also proposed a multiple k-means clustering ensemble algorithm in 2020 to locate nonlinearly separable clusters [3], which extracted local data labels from clustering members. This method not only inherits the scalability of k-means but also overcomes its limitation of only analyzing linear datasets.
Density-based clustering algorithms
Different clustering algorithms are sensitive to varying dataset structures, meaning that different clustering algorithms will produce different clustering results when dividing datasets. If only one clustering algorithm is used to partition the original dataset, the quality of the final clustering results cannot be guaranteed. This article uses three density-based clustering algorithms that can handle different dataset structures to generate basic cluster members, namely DBSCAN, DPC, and OPTICS algorithms.
DBSCAN, named density-based spatial clustering of applications with noise, can clustering dense datasets of any shape [10]. It is highly sensitive to the input parameters
Hierarchical clustering method
Hierarchical clustering algorithm works to decompose a given dataset hierarchically and organize data points into a clustering tree. There are two methods of hierarchical decomposition. One is condensed hierarchical clustering, which is a bottom-up clustering method. In the beginning, each data point is regarded as a separate cluster, and then the two closest clusters are gradually merged into one cluster or until they reach a termination condition [28]. The other method is split hierarchical clustering, which is also known as the top-down clustering method. Split hierarchical clustering regards all data points in the dataset as a cluster at the beginning, and then gradually splits the cluster into smaller clusters, until each data point is allocated in a separate cluster or a termination condition is reached, then stop clustering [35].
In the clustering ensemble method based on hierarchical clustering, the condensed hierarchical clustering method is usually used to analyze the co-association matrix. Assuming that
Regard each data point as a cluster. The similarity between clusters is equal to the similarity of the corresponding data points; Find the two closest clusters and merge them; Calculate the similarity between the new cluster and the original cluster; Repeat steps 2 and 3 until the similarity between all clusters reaches a threshold, then stop clustering.
As described above, in the process of consistency integration for cluster members, most clustering ensemble methods adopt the condensed hierarchical clustering method. In addition, the condensed hierarchical clustering method based on similarity threshold avoids setting the final number of clusters. It is based on the similarity matrix formed by cluster members, which reflects the similarity measure between data points, and can be applied to the minimum spanning tree. The final clustering result can thus be realized by setting a threshold to cut off the weak connection. In the evidence accumulation (EA) method [14] proposed by Fred, the threshold t was set to 0.5, and hierarchical clustering analysis was performed on the similarity matrix to obtain the final partition result.
Method overview
At present, most clustering ensemble methods for research and analysis are based on k-means. Using such methods, the similarity matrix is obtained according to the basic clustering result, and the information describing the cluster structure is added to the similarity matrix to refine it. However, these schemes fail to consider the outliers and the shape of the dataset, providing inadequate clustering results.
The overview of the proposed method DCE-IVI.
In this paper, we use three density-based algorithms that can handle different data structures, DBSCAN, DPC, and OPTICS, to clustering the dataset to obtain different basic clustering results. We then create a co-association matrix (CM) according to the basic clustering result, and zero the elements in the matrix that are smaller than different thresholds t, so we can acquire multiple reconstructed matrices. Hierarchical clustering (HC) analysis is employed for each reconstructed matrix to obtain the corresponding clustering results. Using internal validity index DCE-IVI, the optimal clustering result is then selected as the final result. The overall process is shown in Fig. 1.
Clustering results of DBSCAN, DPC and OPTICS algorithms on the Jain dataset.
The above three density-based clustering algorithms vary in sensitivity to the dataset’s density, structure, and noise points. By using these algorithms to partition the dataset respectively, we can obtain the different clustering results, as shown in Fig. 2, which help divide the dataset more efficiently.
Co-association matrix is a method proposed by Fred to measure the similarity between data points [14]. The specific idea is that in different data partition processes, data points that ultimately belong to the same cluster may also belong to the same cluster. By considering each cluster member as independent evidence, the number of times that the data points are divided into the same cluster is calculated. The co-association matrix can then be obtained.
Assuming that
The co-association matrix transforms the results generated by cluster members into a matrix, which can be regarded as the similarity matrix between all points in the dataset [41]. Thus, the mathematical expression of each element in the coincidence matrix (CM) is as follows:
Among them,
It can be seen from the analysis that some evidence in the co-association matrix may affect the effective clustering of data points. So that data points are divided into incorrect clusters, or two clusters with precise partitions are connected to form a cluster. In fact, it can be said that some evidence in the co-association matrix can be considered to be related to outliers, which has a negative impact on the division of the final clustering result. Therefore, removing these negative evidences can provide more effective clustering result. While it is difficult to determine which is negative evidence and delete it. However, we can observe that when a pair of data points belong to the same cluster in the same cluster member, but belong to different clusters in the real partition. According to the expression of the co-association matrix (CM), the corresponding position value of such a pair of data points in CM is very small.
As illustrated by the above analysis, negative evidence is difficult to determine. To address this issue, Ren et al. proposed the concept of confusion [32]. The confusion of a pair of data points represents the uncertainty of their division in the same cluster. When the frequency between a pair of data points is 0.5, the degree of confusion is the largest. When the frequency is far less than 0.5, the corresponding pair of data points can be removed as negative evidence. Therefore, we use r as the step size in [0, 0.5] to gradually remove evidence, which is less than the set value to generate different reconstruction matrices (
The algorithm is largely represented by Algorithm 1 below. The original generated co-association matrix must be taken as the input, and hierarchical clustering on CMs is required to obtain the second-order basic clustering results B. Where, the condensed hierarchical clustering
Regard each data point corresponding to each element of the co-association matrix as a cluster. The similarity between clusters is equal to the similarity of the corresponding data points; Find the two closest clusters and merge them; Recalculate the similarity between the new cluster and the original clusters; Repeat steps 2 and 3 until the similarity between all clusters reaches 0.5, then stop clustering.
Once the clustering results B have been determined, it requires set a validity index to filter the best clustering member as the final cluster.
Minimax similarity
Suppose that
Let
The total similarity
Then, the minimax similarity is obtained as follows:
Among them,
Thus, to determine the similarity of data points
According to the definition, when two data points do not belong to the same cluster, the similarity between them is very small. However, if there are some abnormal noise values between two clusters, even if two data points are in different clusters, the similarity between them may be very large. In other words, the similarity measure based on path is very sensitive to noise points. To address this issue, Chang et al. proposed a robust minimax similarity method to eliminate the influence of noise on clustering results [6]:
In the above formula, the weight
where
In the process of clustering, the validity index can determine the optimal number of clusters and select the optimal partition [44]. These indicators can be divided into three categories: external validity index, internal validity index, and relative validity index [40]. External validity index can be used when the external information of the dataset is available. The matching degree between the clustering partition is compared, and the external criteria is used to evaluate the performance of different clustering analysis algorithms. According to the pre-defined evaluation criteria, the relative validity index tests the different parameter settings of the clustering algorithm and finally selects the optimal parameters and clustering mode. The internal validity index is mainly based on the geometric structure information of the dataset, and evaluates the clustering division from the aspects of compactness and separation. As can be seen from the definition of clustering, the data points in the same cluster are closely distributed, while the data points in different clusters are scattered. Compactness is used to describe the distance between data points in the same cluster, and separability is used to describe the distance between data points in different clusters [26]. In this paper, without the original dataset information, by setting the internal validity index to select the results, the final clustering result is obtained.
Assuming that
Where
It can be seen from [37] that a cluster is a high-density region divided by some low-density regions. In order to obtain the optimal cluster, it is necessary to make the connectivity between different clusters low and the internal stability of the cluster strong, because the smaller the internal validity index DCE-IVI, the better the clustering effect. Algorithm 2 describes the calculation process of DCE-IVI, in which
The final clustering ensemble result is obtained by selecting the second-order basic clustering result with the smallest DCE-IVI value, as shown in Algorithm 3. Besides, the role of domain knowledge in the entire process are shown in Table 1.
The role of domain knowledge in the entire process
The algorithm mainly includes several steps: generating basic clustering results, forming co-association matrix CM, generating second-order basis clustering results based on the reconstructed matrix, and selecting the final clustering results through internal validity index DCE-IVI. Based on this, the time complexity of this algorithm is analyzed as follows and N is the number of data points:
Generating basis clustering results is realized by DBSCAN, DPC and OPTICS algorithms. Among them, the basic time complexity of DBSCAN and OPTICS algorithms is O (N * the time required to find points in eps field), the worst-case time complexity is
In the stage of calculating the co-association matrix through the basis clustering members, assuming that the number of basis cluster members is m, each basis partition has
When generating the second-order basis clustering results based on reconstruction matrices, firstly, the elements in these matrices that are less than the threshold value need to be zeroed, and the time complexity is
In the final stage of selecting the final clustering result through the internal validity index DCE-IVI, the two components of DCE-IVI are compactness and separation, which are calculated by the robustness path similarity. The time complexity of similarity based on robust path is
Through the above analysis, the time complexity of this algorithm is
Experiment analysis
Experimental dataset
In this paper, 10 datasets, including five two-dimensional synthetic datasets and five real multi-dimensional datasets, were employed for testing. The real distribution of data points in the five two-dimensional datasets is shown in Fig. 3, and a detailed description of these 10 datasets is provided in Table 2.
Two large-scale real datasets were also tested. A detailed description of these two datasets is provided in Table 3.
Detailed description of above 10 datasets
Detailed description of above 10 datasets
Detailed description of 2 large-scale real datasets
Synthetic datasets.
Specific information of datasets
Furthermore, in order to better understand certain data characteristics in datasets, we collected specific information of the datasets used in the experiment, and presented it in Table 4, and all datasets do not have missing values.
In order to compare the density-based clustering ensemble algorithm proposed in this paper with other techniques, two common methods were selected: link-based clustering ensemble (LCE) [19], Strehl’s algorithms [37]. And there is also a novel selective clustering ensemble method (DSME) [22].
Link-based clustering ensemble (LCE) improves the co-association matrix by locating the hidden information of the basic partitions. Three improved schemes, weighted connected-triple (WCT), weighted triple-quality (WTQ), and combined similarity measure (CSM), are then proposed. In [37], three improved clustering ensemble methods were also proposed as cluster-based similarity partitioning algorithm (CSPA), hypergraph partitioning algorithm (HGPA), and meta-clustering algorithm (MCLA). The selective clustering ensemble method (DSME) has considered the quality of each clusters, and then embedded it into a framework DS, which considers the difference between the result in the ensemble selection stage and the result in the ensemble integration stage [22].
Evaluation index
Three clustering evaluation indexes were employed to efficiently measure the clustering ensemble results. These were classification accuracy (CA) [27].
Suppose
CA is used to compare the label obtained by analyzing the dataset and the true label of the dataset. It can be defined as [29]:
where
Mutual information (MI) can describe the shared information of a pair of clusters, while normalized mutual information (NMI) uses entropy as the denominator to adjust the MI value between 0 and 1. It is usually used as an external validity index [37], which is defined as follows:
where
ARI measures the coincidence degree of two data distributions. Its value range is
Let a represent the logarithm of data points in the same cluster in T and S, and b represent the logarithm of data points in different clusters in T and S, then ARI can be defined as:
Where RI can be expressed as follows:
The final visualization results of DCE-IVI for synthetic datasets are shown in Fig. 4, the visualization of the clustering ensemble results from various methods(CSPA, HGPA, MCLA, and DSME) for Aggregation and Jain datasets are shown in Figs 5 and 6, and the experimental results are shown in Tables 3–8. In these tables, DCE-IVI represents the method proposed in this paper. Three clustering ensemble methods based on the two-dimensional graphs WCT, WTQ and CSM, and three clustering ensemble methods CSPA, HGPA and MCLA, and the selective clustering ensemble method DSME were also tested on the cited datasets and compared with the clustering ensemble method proposed in this paper.
The final visualization results of synthetic datasets.
The visualization results from various methods for Aggregation.
The visualization results from various methods for Jain.
For each clustering task used in the experiment, their cluster ranking is not unique, but the use of cluster quality index is unique, which needs to be obtained through preprocessing. For the clustering ensemble methods used, the parameters were set as follows:
The DBSCAN algorithm and the OPTICS algorithm were run twice, respectively, and each algorithm was run several times to obtain the corresponding optimal value. For the DCE-IVI method proposed in this paper, the parameter selection effect of each base clustering algorithm was the best, and the number of nearest neighbors of the data points was set to 3. For WCT, WTQ, and CSM, the maximum number of iterations was 100, and the number of basic partitions was 10. The attenuation factor parameter DC was set to 0.9. For CSPA, HGPA and MCLA, the maximum number of iterations was 100, and the number of basic partitions was also set to 100. For DSME, the selection threshold t was 0.5, and the consensus function was the hierarchical clustering algorithm, and the number of basic partitions was also set to 50.
CA clustering quality for 10 datasets, the highest quality of clustering results in each row is highlighted in bold
NMI clustering quality for 10 datasets, the highest quality of clustering results in each row is highlighted in bold
ARI clustering quality for 10 datasets, the highest quality of clustering results in each row is highlighted in bold
The results of the CA test are shown in Table 5. The first column is the cited datasets, the second column is the CA test quality of the method proposed in the paper, and the remaining columns are the CA test quality of other clustering ensemble methods. It can be seen that, for the datasets Aggregation, Jain, Spiral, Ionosphere, and WDBC, the clustering ensemble method proposed in this paper has a better effect than other methods. Especially for the Spiral dataset, the quality of the method reaches 1.0, which is much better than other methods. In addition, for other datasets, the CA test quality of the method in this paper is highly similar to the best CA test quality of other methods. The clustering structure of some datasets, especially high-dimensional datasets, is reasonably complex as the selected base clustering algorithm itself has a good clustering effect for arbitrary distribution shaped datasets. It can thus be observed that our algorithm has a strong analysis and processing ability for complex data sets.
The results of the NMI test are shown in Table 6. The proposed method has almost the same effect as the CA test. For dataset Glass, DCE-IVI is better than three improved link-based clustering ensemble algorithms WCT, WTQ, CSM and DSME. However, for the three improved Strehl’s algorithms, the effect is much worse. For the dataset Ionosphere, the method in this paper has a much better clustering ensemble effect than other methods.
The results tested by ARI are provided in Table 7. Generally speaking, for the same clustering result, the value produced by the ARI test is smaller than the value produced by the CA test, but they have the same upper limit of 1.00. From this table, we can see that DCE-IVI has the highest value in six datasets. Particularly for the datasets Jain, Spiral, Ionosphere, and WDBC, the ARI test quality of the method proposed in this paper is much higher than other methods. Besides Iris, the ARI test quality of the proposed method is similar to other methods. For Iris, the value of the proposed method is significantly different from that of the clustering ensemble method with the highest ARI test quality.
CA clustering quality for 2 large-scale datasets, the highest quality of clustering results in each row is highlighted in bold
NMI clustering quality for 2 large-scale datasets, the highest quality of clustering results in each row is highlighted in bold
ARI clustering quality for 2 large-scale datasets, the highest quality of clustering results in each row is highlighted in bold
Tables 8–10 show the results of the algorithm comparison test on two large-scale real datasets. CA, NMI, and ARI were taken as evaluation indexes, and compared DCE-IVI with three improved Strehlâs algorithms. It can be observed that DCE-IVI has the best analysis effect on Pendigits and Shill Bidding datasets when using CA and NMI for evaluation tests. When using ARI for testing, DCE-IVI still has the best analysis effect on the Pendigits dataset, but it has a poor effect on the Shill Bidding dataset. It can be observed that the ARI value range is
Based on the test results in Tables 5–10, we can observe that the DCE-IVI algorithm proposed in this paper has good clustering ensemble results for two-dimensional datasets and multi-dimensional datasets, and it can also obtain outstanding analysis results for complex datasets.
In order to verify the effectiveness of DCE-IVI, this paper also analyzes the real business emission data. These data are obtained from six different collection points of two businesses in a certain place, and the data of four months are collected in total. Table 11 is a detailed description of the business emission datasets.
Description of the business emission datasets
Description of the business emission datasets
In the above datasets, the data can be divided into four types: high production and low pollution control, high production and high pollution control, low production and low pollution control, and low production and high pollution control. It is necessary to analyze and judge the operating status and abnormal condition of the pollutant discharge businesses through the experiment, so as to realize effective monitoring of the pollutant discharge situation by relevant unit, which can determine the key attention of some substandard businesses.
Using DCE-IVI to analyze these datasets, and the experimental results are returned to businesses. Manual analysis of the datasets indicates that the partition results obtained by our experiments are of high quality.
The basic clustering algorithms in clustering ensemble have different division effects on datasets with different structures. However, many clustering ensemble algorithms do not consider this point and normally only generate different basic clustering members. In addition, to improve clustering quality, clustering ensemble algorithms based on the co-association matrix usually add some lost information to the matrix to describe the original data structure more effectually, but fail to consider the effect of the outliers. Based on the above problems, this paper proposed a density-based clustering ensemble algorithm based on co-association matrix and internal validity index. The selection of density-based clustering algorithms could effectively clustering datasets with arbitrary shapes. According to the VAT analysis, some negative evidence was present, making the results from the division deviate from the true partition results. As the similarity value of negative evidence in the co-association matrix was low, we removed the negative evidence to achieve a better division. Condensed hierarchical clustering was then performed on the obtained matrix results to get the corresponding division results. Finally, based on the compactness of the cluster and the correlation between clusters, we set an internal validity index to select the best result.
Footnotes
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grants with No. 61873324, the Natural Science Foundation of Shandong Province under Grant with No. ZR2019MF040, the Natural Science Foundation of Shandong Province under Grant with No. ZR2020LZH004 and Grant with No. ZR2020LZH006, and the Higher Educational Science and Technology Program of Jinan City under Grant with No. 2020GXRC057.
