Abstract
Clustering ensemble refers to the problem of obtaining a final clustering of a dataset by combining multiple partitions computed by different clustering algorithms. The clustering ensemble has emerged as a prominent method for improving robustness of unsupervised classification solutions. This problem has been received an increasing attention in recent years but a little attention has been paid to weight the combined clusterings without access the original data. We address in this paper the problem of weighted clustering ensemble problem by defining an unsupervised method to compute the weight of each combined clustering without access the original data. The weight of each base clustering is computed using its quality and the quality of its neighbouring clusterings. The proposed method permits to estimate the right number of clusters of the final clustering before the combining step by exploiting the generated weights.
Introduction
Clustering algorithms classify the data into homogeneous groups called clusters such as the objects in each cluster have a maximum similarity with each other and maximum dissimilarity with the objects of other clusters. A large number of clustering algorithms have been developed to tackle the clustering problem [10, 11, 13, 15, 20]. However, there is no single clustering algorithm that can discover all cluster shapes. In addition, for the same dataset, different clusterings can be found by using several clustering algorithms or by using different initializations in a same clustering algorithm. For these reasons, it is difficult to determine the best clustering result of a given dataset.
Clustering ensemble has emerged as an important method for improving the quality of unsupervised classification solutions. The problem aims at finding a better and more robust clustering by combining a set of clustering results of the same dataset. Clustering ensemble techniques have been used in many research fields [5, 6, 24]. This problem has been studied by many researchers in data mining and many approaches have been explored to find the final clustering such as: voting approaches [3], hyper graph approaches [1, 4, 21], optimization techniques [7, 12, 15, 16, 17], mutual information approaches [2]. A literature review of the most important approaches can be found in [19].
The existing clustering ensemble approaches do not take into account the relative importance of each combined clustering when generating consensus clustering. Low quality clusterings can be combined with others of better quality which affects the quality of the final clustering. To address this problem, we propose in this paper, a weighted clustering ensemble approach to reflect the relative importance of each clustering before aggregation. The proposed approach is based on defining a method to automatically generate the weight of each clustering based on its quality and the quality of its neighbouring clusterings. Using the computed weights, the number of clusters in the final clustering can be estimated before the combining step and a better quality final clustering can be then computed.
The rest of this paper is organized as follows. In Section 2, a brief description of the related works in weighted clustering ensemble problem is presented. The proposed method is presented in Section 3. An illustrative example of the proposed approach is detailed in Section 4. An experimental study using several datasets is given in Section 5. The computation complexity of the proposed approach is discussed in Section 6. Finally, we conclude with a brief discussion about the proposed approach and some future works.
Literature review
The problem of weighted clustering ensemble has been tackled in several research papers and many approaches have been explored to deal with this problem. These approaches can be classified into reformulation base approaches [23, 14, 18], kernel function based approaches [22] and cluster validity based approaches [9, 8].
In the first category, the weighted clustering ensemble problem is reformulated as a known problem. In [23] the authors proved that only a subset of the input clusterings contribute to the final consensus clustering, i.e. clustering with larger weight contributes more to the final consensus clustering. As consequence, the weighted consensus clustering has been reformulated to a regularization of LASSO problem. The proposed approach in [14] is based on the idea that in many real-world problems, some points may be correlated with respect to a given set of dimensions. Each dimension could be relevant to at least one of the clusters. As a consequence, the problem can be reformulated to a subspace clustering problem. In [18], the problem has been reformulated as a special instance of Maximum-Weight Independent Set (MWIS) problem. The problem is represented by a graph in which the vertices represent the clusters. An index that measures both cohesion and separation has been used to weight each cluster. A variant of simulated annealing method has been used to find the final clustering.
In the second category, kernel functions have been used to solve the problem. In [22], the authors proposed an approach that aims at analyzing the set of partitions in the cluster ensemble to extract valuable information that can improve the quality of the combination process. A weight is assigned to each partition according to how much this partition satisfies a set of properties. To obtain the final clustering, a kernel function representing a new similarity measure between partitions has been proposed.
In the last category, the problem has been solved by assigning a validity index to each cluster. In [9], the authors proposed an approach based on crowd agreement estimation and multi-granularity link analysis. The authors weight the base clusterings in accordance with their clustering validity by defining a normalized crowd agreement index and the source aware connected triple similarity measure (SACT). The SACT similarity between two clusters is computed using a reference pair of clusters (the pair of clusters with the maximum SACT). In [8], the authors proposed to estimate the uncertainty of each cluster in the ensemble using an entropic criterion. A local weighting strategy based on the uncertainty of the each cluster and an ensemble-driven cluster validity index (ECI) has been proposed. To obtain the final clustering a local weighted co-association matrix (LWCA) has been used.
We propose in this paper to use an approach that permits to address two aspects. Firstly, the proposed approach permits to use the information at the clustering level in order to generate the weight of clusterings without access at the original data. Secondly, the weight of each base clustering is generated using its neighbourhood. The details of the proposed approach will be described in the following section.
The proposed approach
Our approach consists to group the most similar base clusterings using a clustering technique. The weight of each base clustering is then computed using the clusterings of the same cluster. Therefore, the weight of each base clustering will be influenced by its neighbourhood.
Let
Compute the similarity Compute the distance measure Compute a clustering For each cluster
Compute the relative importance
Compute the normalized weight of each base clustering
Compute the final clustering using a weighted evidence accumulation algorithm [9] as follows:
For each base clustering Compute the weighted co-association matrix
Transform
Compute the final clustering by applying a clustering algorithm to the matrix
The proposed approach is presented in Algorithm 1.
To illustrate the different steps of the proposed approach, let’s consider an artificial dataset composed of 6 objects. Let’s have a set of 4 different clusterings for the dataset (Table 1). Each base clustering has a different number of clusters.
Base clusterings of the illustrative example
Base clusterings of the illustrative example
Rand index has been used to compute the similarity between each couple of the base clustering as follows:
where
Table 2 summarizes the similarity between each couple of clusterings of the example.
Similarity matrix of the base clusterings
The distance measure
Distance matrix of the base clusterings
Using the distance matrix computed previously, the third step consists to compute a clustering for the base clusterings. For this step, the
Results of the clustering step
Results of the clustering step
The cluster
Let’s, now, compute the weight of each base clustering using Eq. (2):
It is important to notice that the base clustering having a weight equal to 0 will be naturally excluded from the aggregation step according to Eq. (3).
Compute the final clustering
This step starts by the computation of the pair-wise co-occurrence matrix of each base clustering. In Table 5 is presented the pair-wise co-occurrence matrix of the base clustering
Pair-wise co-occurrence matrix of the base clustering
Pair-wise co-occurrence matrix of the base clustering
Based on the pair-wise co-occurrence matrices, the weights computed previously and Eqs (3) and (4), the distance matrix
Distance matrix of the weighted clustering ensemble problem
The final step is to compute the final clustering using the matrix
To prove the applicability of the proposed approach, we present in the current section some experimental results. The proposed approach has been experimented using seven real datasets form UCI repository. For each dataset, a set of clusterings has been generated using three clustering algorithms (hierarchic clustering,
Description of the data used in the experimentation
Description of the data used in the experimentation
In Table 8 are presented the weights computed by the proposed approach. For Iris dataset, the base clustering having the maximum weight is obtained using
In a second time, the proposed approach has been compared to two clustering ensemble algorithms; CL_CONSENSUS developed in [12] and COMUSA developed in [21]. Tow criteria have been used to compare the results:
The number of clusters in the final clustering. The quality of the final clustering computed using rand index between the final clustering and the real dataset.
For each approach, the experimentations have been executed 10 times. The best result of each algorithm has been retained. Results are summarized in Table 9. The results show that the proposed approach generates a better quality final clustering with the right number of clusters compared to CL_CONSENSUS. Compared to COMUSA the proposed approach generates a final clustering with better quality except for ionosphere dataset.
Experimental results of the proposed approach
Comparison of the proposed approach to CL_CONSENUS and COMSA algorithms
The computation complexity of the proposed approach depends on the complexity the clustering algorithm (used in the third and the last steps), the complexity of the pair-wise comparisons used in the computation of the similarity measure between each couple of the base clusterings and the computation of the co-occurrence matrices (last step).
The complexity of the clustering algorithm (like
To evaluate the performance of the proposed approach, the execution time has been computed for each dataset used in the experimentation. The results (see Table 10) show that the execution time of the proposed approach is less than 1 minute for all the dataset except for vehicle dataset.
Execution time of the proposed approach
Execution time of the proposed approach
In this paper we tackled the problem of defining the weight of the base clusterings in the weighted clustering ensemble problem. We proposed an unsupervised approach to automatically generate the weight of each base clustering.
Compared to the major related works in weighted clustering ensemble, the proposed approach has some similarities. Firstly, the proposed approach uses a weighting strategy without access the original data. Furthermore, the proposed approach is based on an extended version of evidence accumulated algorithm to obtain the final clustering this algorithm is has been also used in [8, 9]. Nevertheless, compared to the most recent related works [8, 9], our approach is based on weighting strategy at clustering level while other approaches are based on a weighting strategy at cluster level. Finally, in our approach the number of cluster in the final clustering is estimated automatically this issue has not been addressed in the other works.
As we can see, the proposed approach depends on a set of parameters. Different similarity measures can be used to compute the similarity between each couple of the base clusterings. Different clustering algorithms can be used to group the base clusterings. All these parameters can affect the quality of the final clustering. For these reason, in the future works, it is important to evaluate the sensitivity of the proposed approach against the change of all these parameters.
Footnotes
Authors’ Bios
