Weighted clustering ensemble: Towards learning the weights of the base clusterings

Abstract

Clustering ensemble refers to the problem of obtaining a final clustering of a dataset by combining multiple partitions computed by different clustering algorithms. The clustering ensemble has emerged as a prominent method for improving robustness of unsupervised classification solutions. This problem has been received an increasing attention in recent years but a little attention has been paid to weight the combined clusterings without access the original data. We address in this paper the problem of weighted clustering ensemble problem by defining an unsupervised method to compute the weight of each combined clustering without access the original data. The weight of each base clustering is computed using its quality and the quality of its neighbouring clusterings. The proposed method permits to estimate the right number of clusters of the final clustering before the combining step by exploiting the generated weights.

Keywords

Clustering clustering ensemble weight evidence accumulation

1. Introduction

Clustering algorithms classify the data into homogeneous groups called clusters such as the objects in each cluster have a maximum similarity with each other and maximum dissimilarity with the objects of other clusters. A large number of clustering algorithms have been developed to tackle the clustering problem [10, 11, 13, 15, 20]. However, there is no single clustering algorithm that can discover all cluster shapes. In addition, for the same dataset, different clusterings can be found by using several clustering algorithms or by using different initializations in a same clustering algorithm. For these reasons, it is difficult to determine the best clustering result of a given dataset.

Clustering ensemble has emerged as an important method for improving the quality of unsupervised classification solutions. The problem aims at finding a better and more robust clustering by combining a set of clustering results of the same dataset. Clustering ensemble techniques have been used in many research fields [5, 6, 24]. This problem has been studied by many researchers in data mining and many approaches have been explored to find the final clustering such as: voting approaches [3], hyper graph approaches [1, 4, 21], optimization techniques [7, 12, 15, 16, 17], mutual information approaches [2]. A literature review of the most important approaches can be found in [19].

The existing clustering ensemble approaches do not take into account the relative importance of each combined clustering when generating consensus clustering. Low quality clusterings can be combined with others of better quality which affects the quality of the final clustering. To address this problem, we propose in this paper, a weighted clustering ensemble approach to reflect the relative importance of each clustering before aggregation. The proposed approach is based on defining a method to automatically generate the weight of each clustering based on its quality and the quality of its neighbouring clusterings. Using the computed weights, the number of clusters in the final clustering can be estimated before the combining step and a better quality final clustering can be then computed.

The rest of this paper is organized as follows. In Section 2, a brief description of the related works in weighted clustering ensemble problem is presented. The proposed method is presented in Section 3. An illustrative example of the proposed approach is detailed in Section 4. An experimental study using several datasets is given in Section 5. The computation complexity of the proposed approach is discussed in Section 6. Finally, we conclude with a brief discussion about the proposed approach and some future works.

2. Literature review

The problem of weighted clustering ensemble has been tackled in several research papers and many approaches have been explored to deal with this problem. These approaches can be classified into reformulation base approaches [23, 14, 18], kernel function based approaches [22] and cluster validity based approaches [9, 8].

In the first category, the weighted clustering ensemble problem is reformulated as a known problem. In [23] the authors proved that only a subset of the input clusterings contribute to the final consensus clustering, i.e. clustering with larger weight contributes more to the final consensus clustering. As consequence, the weighted consensus clustering has been reformulated to a regularization of LASSO problem. The proposed approach in [14] is based on the idea that in many real-world problems, some points may be correlated with respect to a given set of dimensions. Each dimension could be relevant to at least one of the clusters. As a consequence, the problem can be reformulated to a subspace clustering problem. In [18], the problem has been reformulated as a special instance of Maximum-Weight Independent Set (MWIS) problem. The problem is represented by a graph in which the vertices represent the clusters. An index that measures both cohesion and separation has been used to weight each cluster. A variant of simulated annealing method has been used to find the final clustering.

In the second category, kernel functions have been used to solve the problem. In [22], the authors proposed an approach that aims at analyzing the set of partitions in the cluster ensemble to extract valuable information that can improve the quality of the combination process. A weight is assigned to each partition according to how much this partition satisfies a set of properties. To obtain the final clustering, a kernel function representing a new similarity measure between partitions has been proposed.

In the last category, the problem has been solved by assigning a validity index to each cluster. In [9], the authors proposed an approach based on crowd agreement estimation and multi-granularity link analysis. The authors weight the base clusterings in accordance with their clustering validity by defining a normalized crowd agreement index and the source aware connected triple similarity measure (SACT). The SACT similarity between two clusters is computed using a reference pair of clusters (the pair of clusters with the maximum SACT). In [8], the authors proposed to estimate the uncertainty of each cluster in the ensemble using an entropic criterion. A local weighting strategy based on the uncertainty of the each cluster and an ensemble-driven cluster validity index (ECI) has been proposed. To obtain the final clustering a local weighted co-association matrix (LWCA) has been used.

We propose in this paper to use an approach that permits to address two aspects. Firstly, the proposed approach permits to use the information at the clustering level in order to generate the weight of clusterings without access at the original data. Secondly, the weight of each base clustering is generated using its neighbourhood. The details of the proposed approach will be described in the following section.

3. The proposed approach

Our approach consists to group the most similar base clusterings using a clustering technique. The weight of each base clustering is then computed using the clusterings of the same cluster. Therefore, the weight of each base clustering will be influenced by its neighbourhood.

Let $\Pi=\left\{\pi_{1},\pi_{2},\pi_{3},\ldots,\pi_{m}\right\}$ be the set of the base clusterings. The proposed approach is based on the following steps:

Step 1:
Compute the similarity $S\left(\pi_{i},\pi_{j}\right)$ between each couple of the base clusterings using Rand index.
Step 2:
Compute the distance measure $d\left(\pi_{i},\pi_{j}\right)$ between each couple of the base clusterings by $d\left(\pi_{i},\pi_{j}\right)=1-S\left(\pi_{i},\pi_{j}\right)$ .
Step 3:
Compute a clustering $CL=\left\{C_{1},C_{2},C_{3},\ldots,C_{q}\right\}$ of $q$ clusters of the base clusterings by using a clustering algorithm and the distance computed previously.
Step 4:
For each cluster $C_{l}$ in $C L$ do

Compute the relative importance ${ri}_{t}$ for each base clustering $\pi_{t}$ in $C_{l}$ regarding its neighbouring clusterings by using:

$\displaystyle{ri}_{t}=\frac{1}{n}\sum\limits_{r=1,r\neq t,\left(\pi_{r},\pi_{t% }\right)\in C_{l}}^{\left|C_{l}\right|}{S\left(\pi_{r},\pi_{t}\right)}$ (1)

Compute the normalized weight of each base clustering $\pi_{t}$ in $C L$ using:

$\displaystyle w_{t}=\frac{{ri}_{t}}{\sum{ri}_{t}}$ (2)

Step 5:
Compute the final clustering using a weighted evidence accumulation algorithm [9] as follows:

For each base clustering $\pi_{t}$ , compute the pair-wise co-occurrence matrix $B_{t}$ . Where $B_{t}\left(u,v\right)=1$ if the objects $u$ and $v$ are placed in the same cluster by the clustering $\pi_{t}$ and $B_{t}\left(u,v\right)=0$ otherwise.

Compute the weighted co-association matrix $w c o$ by:

$\displaystyle wco=\sum\limits_{t=1}^{\left|\Pi\right|}{w_{t}B_{t}}$ (3)

Transform $w c o$ into a distance matrix $M$ ( $w c o$ is considered as a similarity matrix) by:

$\displaystyle M_{ij}=1-{wco}_{ij}$ (4)

Compute the final clustering by applying a clustering algorithm to the matrix $M$ .

The proposed approach is presented in Algorithm 1.

Algorithm 1. Neighbourhood based Weighted Clustering Ensemble

Input

X: a dataset

${\Pi}=\left\{\pi_{1},\pi_{2},\pi_{3},\ldots,\pi_{m}\right\}$ : a set of the base clusterings

Output

$\pi^{}$ : final clustering composed of k clusters

Begin

/* compute the similarity between each couple of the base clustering /

1: For $i$ in $1$ to $m-1$ do

2: For $j$ in $2$ to $m$ do

3: $S\left(\pi_{i},\pi_{j}\right)\leftarrow\textit{Rand\_index}\left(\pi_{i},\pi_{% j}\right)$

4: $d\left(\pi_{i},\pi_{j}\right)\leftarrow 1-S\left(\pi_{i},\pi_{j}\right)$

5: End for

6: End for

/ compute a clustering of the base clusterings composed of $q$ clsters using the matrix $d$ /

7: CL $\leftarrow$ clustering_algorithm $(d,q)$

/ compute the relative importance of each clustering $\pi_{t}$ with respect to its neighbourhood /

8: For each cluster $C_{{l}}$ in CL do

For each $\pi_{{t}}$ in $C_{{l}}$ do

9: $\displaystyle{ri}_{t}\leftarrow\frac{1}{n}\sum\limits_{r=1,r\neq t,\left(\pi_{% r},\pi_{t}\right)\in C_{l}}^{\left|C_{l}\right|}{S\left(\pi_{r},\pi_{t}\right)}$

10: End for

11: End for

/ compute the normalized weight of each clustering* $\pi_{t}$ /

12: For each cluster $C_{{l}}$ in CL do

13: For each $\pi_{{t}}$ in $C_{{l}}$ do

14: $\displaystyle w_{t}\leftarrow\frac{{ri}_{t}}{\sum{ri}_{t}}$

15: End for

16: End for

/* $k$ is the number of clusters in the clustering having the maximum weight/

17: $w_{\max}\leftarrow\max(w_{t})$

18: $k\leftarrow\textit{nb{\_}cluster}(\pi_{t},w_{\max})$

/ compute the co-occurrence matrix for each base clustering /

19: For each $\pi_{{t}}$ in $\prod$ do

20: For $(u,v)$ in $X$ do

/ if objects $u$ and $v$ are placed in the same cluster by the clustering* $\pi_{t}$ /

21: if $C(u)=C(v)$ then

22: $B_{t}\left(u,v\right)\leftarrow 1$

23: Else

24: ${B}_{t}\left(u,v\right)\leftarrow 0$

25: End if

26: End for

27: End for

/* compute the weighted co-occurrence matrix /

28: $\displaystyle M_{ij}\leftarrow 1-\sum\limits_{t=1}^{\left|\Pi\right|}{w_{t}B_{% t}}$

/ compute the final clustering using the matrix $M$ and the value of $k$ generated in step 18 /

29: $\displaystyle\pi^{*}\leftarrow\textit{clustering{\_}algorithm}(M,k)$

30: End.

4. Illustrative example

To illustrate the different steps of the proposed approach, let’s consider an artificial dataset composed of 6 objects. Let’s have a set of 4 different clusterings for the dataset (Table 1). Each base clustering has a different number of clusters.

Table 1
Base clusterings of the illustrative example

	$x_{1}$	$x_{2}$	$x_{3}$	$x_{4}$	$x_{5}$	$x_{6}$
$\pi_{1}$	1	1	2	1	2	2
$\pi_{2}$	1	2	3	1	2	3
$\pi_{3}$	1	2	3	4	4	3
$\pi_{4}$	1	2	3	5	4	3

4.1 Compute the similarity between each couple of the base clusterings

Rand index has been used to compute the similarity between each couple of the base clustering as follows:

$\displaystyle S\left(\pi_{i},\pi_{j}\right)=\frac{A}{A+D}$ (5)

where $A$ denotes the number of all pairs of data points which are either put into the same cluster by both $\pi_{i}$ and $\pi_{j}$ or put into different clusters by both clusterings. Conversely, $D$ denotes the number of all pairs of data points that are put into the same cluster by one clustering, but into different clusters by the other clustering.

Table 2 summarizes the similarity between each couple of clusterings of the example.

Table 2

Similarity matrix of the base clusterings

	$\pi_{1}$	$\pi_{2}$	$\pi_{3}$	$\pi_{4}$
$\pi_{1}$	1	0.667	0.6	0.667
$\pi_{2}$	0.667	1	0.8	0.867
$\pi_{3}$	0.6	0.8	1	0.933
$\pi_{4}$	0.667	0.867	0.933	1

4.2 Compute the dissimilarity matrix between each couple of the base clusterings

The distance measure $d\left(\pi_{i},\pi_{j}\right)$ between each couple of base clusterings is computed by:

$\displaystyle d\left(\pi_{i},\pi_{j}\right)=1-S\left(\pi_{i},\pi_{j}\right)$ (6)

Table 3

Distance matrix of the base clusterings

	$\pi_{1}$	$\pi_{2}$	$\pi_{3}$	$\pi_{4}$
$\pi_{1}$	0	0.333	0.4	0.333
$\pi_{2}$	0.333	0	0.2	0.133
$\pi_{3}$	0.4	0.2	0	0.067
$\pi_{4}$	0.333	0.133	0.067	0

4.3 Compute a clustering for the base clusterings

Using the distance matrix computed previously, the third step consists to compute a clustering for the base clusterings. For this step, the $k$ -medoids [13] algorithm is used with different values of the number of clusters ( $q$ ). The result having the maximum average silhouette width is retained. As mentioned in Table 4, the clustering step has been executed using two values of $q$ (2 and 3). The best clustering result is obtained with $q=$ 2. Thus, the base clusterings can be grouped into two clusters $C_{1}=\left\{\pi_{1}\right\}$ and $C_{2}=\left\{\pi_{2},\pi_{3},\pi_{4}\right\}$ .

Table 4
Results of the clustering step

$q$	Average silhouette width
2	0.46
3	0.29

4.4 Compute the weight of each base clustering regarding its neighbourhood

The cluster $C_{1}$ is composed of only one base clustering $\pi_{1}$ , thus, its relative importance is equal to 0. For the cluster $C_{2}$ , by applying Eq. (1) we have: ${ri}_{2}=\frac{1}{3}\left(S\left(\pi_{2},\pi_{3}\right)+S\left(\pi_{2},\pi_{4}% \right)\right)=0.556$ . Using the same Eq. (1), we have ${ri}_{3}=0.578$ and ${ri}_{4}=0.6$ .

Let’s, now, compute the weight of each base clustering using Eq. (2):

$\displaystyle w_{1}=\frac{0}{0+0.556+0.578+0.6}=0$ $\displaystyle w_{2}=\frac{0.556}{0+0.556+0.578+0.6}=0.32$ $\displaystyle w_{3}=\frac{0.578}{0+0.556+0.578+0.6}=0.33$ $\displaystyle w_{1}=\frac{0.6}{0+0.556+0.578+0.6}=0.35$

It is important to notice that the base clustering having a weight equal to 0 will be naturally excluded from the aggregation step according to Eq. (3).

4.5 Compute the final clustering

This step starts by the computation of the pair-wise co-occurrence matrix of each base clustering. In Table 5 is presented the pair-wise co-occurrence matrix of the base clustering $\pi_{2}$ .

Table 5
Pair-wise co-occurrence matrix of the base clustering $\pi_{2}$

	$x_{1}$	$x_{2}$	$x_{3}$	$x_{4}$	$x_{5}$	$x_{6}$
$x_{1}$	0	0	0	1	0	0
$x_{2}$	0	0	0	0	1	0
$x_{3}$	0	0	0	0	0	1
$x_{4}$	1	0	0	0	0	0
$x_{5}$	0	1	0	0	0	0
$x_{6}$	0	0	1	0	0	0

Based on the pair-wise co-occurrence matrices, the weights computed previously and Eqs (3) and (4), the distance matrix $M$ is computed and presented in Table 6.

Table 6

Distance matrix of the weighted clustering ensemble problem

	$x_{1}$	$x_{2}$	$x_{3}$	$x_{4}$	$x_{5}$	$x_{6}$
$x_{1}$	0	1	1	0.68	1	1
$x_{2}$	1	0	1	1	0.68	1
$x_{3}$	1	1	0	1	1	0
$x_{4}$	0.68	1	1	0	0.67	1
$x_{5}$	1	0.68	1	0.67	0	1
$x_{6}$	1	1	0	1	1	0

The final step is to compute the final clustering using the matrix $M$ and a clustering algorithm. To compute the final clustering we propose to use $k$ -medoids algorithm with $k=$ 5 which corresponds to the number of clusters of the base clustering having the maximum weight. The final clustering is organized as follows: $\left\{\left\{x_{1}\right\},\left\{x_{2}\right\},\left\{x_{3},x_{6}\right\},% \left\{x_{4}\right\},\left\{x_{5}\right\}\right\}$ .

5. Experimentation

To prove the applicability of the proposed approach, we present in the current section some experimental results. The proposed approach has been experimented using seven real datasets form UCI repository. For each dataset, a set of clusterings has been generated using three clustering algorithms (hierarchic clustering, $k$ -means, $k$ -medoids). Each algorithm has been executed using several values of the number of clusters. A complete description of the data used in the experimentation is presented in Table 7.

Table 7
Description of the data used in the experimentation

Dataset	# of objects	# of clusters	# of base clusterings	# of clusters in the base clusterings
Iris	150	3	12	2-3-4-5
Wine	178	3	12	2-3-4-5
Soybean	47	4	12	2-3-4-5
Glass	214	6	18	2-3-4-5-6-7
Ionosphere	351	2	12	2-3-4-5
Vehicle	846	4	12	2-3-4-5
UKM (User Knowledge Modeling)	403	4	12	2-3-4-5

In Table 8 are presented the weights computed by the proposed approach. For Iris dataset, the base clustering having the maximum weight is obtained using $k$ -medoids algorithm with $k=$ 3 which corresponds the real number of clusters. For Glass dataset the base clustering having the maximum weight is obtained by executing a hierarchic clustering algorithm with $k=$ 6. This number correspond the real number of clusters for Glass dataset. The same observation is observed for the rest of the datasets. For this reason, the clustering step aiming at generation the final clustering has been executed using the number of clusters of the base clustering having the maximum weight.

In a second time, the proposed approach has been compared to two clustering ensemble algorithms; CL_CONSENSUS developed in [12] and COMUSA developed in [21]. Tow criteria have been used to compare the results:

The number of clusters in the final clustering.

The quality of the final clustering computed using rand index between the final clustering and the real dataset.

For each approach, the experimentations have been executed 10 times. The best result of each algorithm has been retained. Results are summarized in Table 9. The results show that the proposed approach generates a better quality final clustering with the right number of clusters compared to CL_CONSENSUS. Compared to COMUSA the proposed approach generates a final clustering with better quality except for ionosphere dataset.

Table 8

Experimental results of the proposed approach

Algorithm	#of clusters	Dataset
		Iris	Wine	Soybean	Glass	Ionosphere	Vehicle	UKM
Hierarchic	2	0	0.0658	0	0.0367	0.0652	0.0723	0.08853
	3	0.0921	0.0882	0.0954	0.0367	0.0652	0.0863	0.09253
	4	0.0954	0.0881	0.0959	0.0576	0.0762	0.0861	0.09253
	5	0.0926	0.0857	0.0877	0.0622	0.0762	0.0892	0.08264
	6	/	/	/	0.0622	/	/	/
	7	/	/	/	0.0622	/	/	/
$k$ -medoids	2	0.0741	0.0677	0.0783	0.0598	0.0976	0.0777	0.05359
	3	0.0986	0.0907	0.0957	0.0606	0.0937	0.0867	0.09053
	4	0.0942	0.0906	0.0976	0.0592	0.0715	0.0891	0.09158
	5	0.0920	0.0900	0.0964	0.0499	0.0715	0.0888	0.09251
	6	/	/	/	0.0500	/	/	/
	7	/	/	/	0.0490	/	/	/
$k$ -means	2	0.0741	0.0703	0.0783	0.0594	0.0973	0.0775	0.05359
	3	0.0986	0.0908	0.0957	0.0585	0.0933	0.0712	0.08737
	4	0.0958	0.0855	0.0869	0.0592	0.0967	0.0895	0.08470
	5	0.0923	0.0866	0.0920	0.0583	0.0958	0.0856	0.08989
	6	/	/	/	0.0590	/	/	/
	7	/	/	/	0.0596	/	/	/

Table 9

Comparison of the proposed approach to CL_CONSENUS and COMSA algorithms

	Proposed approach		CL_CONSENSUS		COMUSA
Dataset	# of clusters	Quality of	# of clusters	Quality of	# of clusters	Quality of
	in the final	the final	in the final	the final	in the final	the final
	clustering	clustering	clustering	clustering	clustering	clustering
Iris	3	0.903	3	0.837	3	0.837
Soybean	4	0.850	3	0.824	3	0.820
Wine	3	0.726	4	0.711	2	0.619
Glass	6	0.759	5	0.568	6	0.852
Ionosphere	2	0.586	3	0.576	2	0.507
Vehicle	4	0.623	4	0.623	4	0.615
UKM	4	0.704	4	0.681	4	0.586

6. Discussion on the complexity of the proposed approach

The computation complexity of the proposed approach depends on the complexity the clustering algorithm (used in the third and the last steps), the complexity of the pair-wise comparisons used in the computation of the similarity measure between each couple of the base clusterings and the computation of the co-occurrence matrices (last step).

The complexity of the clustering algorithm (like $k$ -medoids for example) depends on both the number of clusters ( $k$ ) and the size of the dataset ( $n$ ). It is evaluated to $O(k*(n-k)^{2})$ . The computation complexity of the similarity matrix is evaluated to $O((|\prod|*n)^{2})$ (where $|\prod|$ is the number of the base clusterings). Finally, the complexity of the computation of the co-occurrence matrices in the last step is evaluated to $O(|\prod|*n^{2})$ .

To evaluate the performance of the proposed approach, the execution time has been computed for each dataset used in the experimentation. The results (see Table 10) show that the execution time of the proposed approach is less than 1 minute for all the dataset except for vehicle dataset.

Table 10
Execution time of the proposed approach

Dataset	Execution time (s)
Iris	6.427
Soybean	1.529
Wine	8.502
Glass	17.222
Ionosphere	24.991
Vehicle	138,767
UKM	32.058

7. Conclusion

In this paper we tackled the problem of defining the weight of the base clusterings in the weighted clustering ensemble problem. We proposed an unsupervised approach to automatically generate the weight of each base clustering.

Compared to the major related works in weighted clustering ensemble, the proposed approach has some similarities. Firstly, the proposed approach uses a weighting strategy without access the original data. Furthermore, the proposed approach is based on an extended version of evidence accumulated algorithm to obtain the final clustering this algorithm is has been also used in [8, 9]. Nevertheless, compared to the most recent related works [8, 9], our approach is based on weighting strategy at clustering level while other approaches are based on a weighting strategy at cluster level. Finally, in our approach the number of cluster in the final clustering is estimated automatically this issue has not been addressed in the other works.

As we can see, the proposed approach depends on a set of parameters. Different similarity measures can be used to compute the similarity between each couple of the base clusterings. Different clustering algorithms can be used to group the base clusterings. All these parameters can affect the quality of the final clustering. For these reason, in the future works, it is important to evaluate the sensitivity of the proposed approach against the change of all these parameters.

Footnotes

Authors’ Bios

Baroudi Rouba is a teacher/researcher at the Science and Technology Department, Faculty of Science and Technology at the University of Mostaganem, Algeria. He graduated in 2002 from Computer Science Department, Faculty of Sciences, University of Oran1, Algeria. He received his Post graduation degree in computer science in 2005. He received his Phd degree at the University of Oran1 in 2015. He is member of “Data Engineering and Web Technologies” Research Group in LITIO laboratory. His research interests include clustering, Data mining and multicriteria decision aid.

Safia Nait Bahloul is a full professor at the Computer Science Department, Faculty of Science at the University of Oran1, Algeria. She obtained, after several scientific stays in CNAM of Paris and LIRIS laboratory, the University of Claude Bernard1- Lyon, her Phd degree at the University of Oran1 in 1997. She is a member of Computer and Information Technology Laboratory of Oran (LITIO) which is approved in 2009. Since 2011, she has been leading a team on the topic of Data Engineering and Web Technology. Her research covers advanced aspects of Databases, Web Technology and Data mining. Her works have been published in several international journals and conferences.

References

Strehl

and Ghosh

, Cluster ensembles: A knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research 3(1) (2002), 583–617.

Topchy

Jain

A.K.

and Punch

, A mixture model for clustering ensembles, in: Proceedings of the SIAM International Conference on Data Mining, Florida, USA (22–24 April 2004), 379–390.

Topchy

Jain

A.K.

and Punch

, Combining multiple weak clusterings, in: Proceedings of the 3rd IEEE International Conference on Data Mining, Melbourne (19–22 November 2003), 331–338.

Fischer

and Buhmann

J.M.

, Path-based clustering for grouping of smooth curves and texture segmentation, IEEE Transaction on Pattern Analysis and Machine Intelligence 25(4) (2003), 513–518.

Rouba

and Nait Bahloul

, A multicriteria clustering approach based on similarity indices and clustering ensemble techniques, International Journal of Information Technology and Decision Making 13(4) (2014), 811–837.

Rouba

and Nait Bahloul

, Minimization of the disagreements in clustering aggregation, in: Proceedings of the International Conference of Intelligent Computing, Shanghai, China (15–18 September 2008), 517–524.

Rouba

Bahloul

S.N.

Ammour

D.N.

and Zaaf

, GACMC: A binary method for combining multiple clusterings using a genetic algorithm, The Mediterranean Journal of Computers and Networks MEDJCN 8(3) (2012), 93–101.

Huang

Wang

C.-D.

and Lai

J.-H.

, Locally weighted ensemble clustering, IEEE Transactions on Cybernetics PP(99) (2017), 1–14.

Huang

Lai

J.H.

and Wang

C.D.

, Combining multiple clusterings via crowd agreement estimation and multi-granularity link analysis, Neurocomputing 170(3) (2015), 240–250.

10.

Sheikholeslami

Chaterjee

and Zhang

, WaveCluster: A multi-resolution clustering approach for very large databases, in: Proceedings of the 24th International Conference of Very Large Databases VLDB, New York, USA (24–27 August 1998), 428–439.

11.

McQueen

, Some methods for classification and analysis of multivariate observations, in: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability 1 (1967), (Univ of Calif Press), 281–297.

12.

Hornik

and Bohm

, Hard and soft Euclidean consensus partitions, Studies in Classification, Data Analysis, and Knowledge Organization, Springer (2008), 147–154.

13.

Kaufman

and Rousseeuw

P.J.

, Groups in data: An introduction to cluster analysis, Ed Wiley, New York, 1990.

14.

Al-Razgan

and Domeniconi

, Weighted clustering ensembles, ACM Transactions on Knowledge Discovery from Data 2(4) (2009), 1–40.

15.

Ester

Kriegel

H.P.

Sander

and Xu

, A density based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of the 2nd International Conference of Knowledge Discovery and Data Mining, Portland, Oregon (2–4 August 1996), 226–231.

16.

Mohammadi

Azade

Saberi

and Azaron

, Genetic algorithm-based clustering ensemble: Determination number of clusters, International Journal of Business Forecasting and Marketing Intelligence 1(3) (2010), 201–216.

17.

Mohammadi

Nikanjam

and Rahmani

, An evolutionary approach to clustering ensemble, in: Proceedings of the 4th International Conference of Natural Computation, Jinan, China (18–20 October 2008), 77–82.

18.

and Latecki

L.J.

, Clustering aggregation as maximum-weight independent set, in: Proceedings of the 25th Conference on Advances in Neural Information Processing Systems, Lake Tahoe, Nevada, USA, (3–6 December 2012), 791–799.

19.

Ghaemi

Sulaiman

Ibrahim

and Mustapha

, A survey: Clustering ensembles techniques, in: Proceedings of the World Academy of Science, Engineering and Technology 38 (2009), 636–645.

20.

Guha

Rastogi

and Shim

, CURE: An efficient clustering algorithm for large databases, in: Proceedings of the ACM SIGMOD International Conference of Management of Data, Seattle, WA, USA (1–4 June 1998), 73–84.

21.

Mimaroglu

and Erdil

, Combining multiple clusterings using similarity graph, Pattern Recognition 44 (2011), 694–703.

22.

Vega-Pons

Correa-Morris

and Ruiz-Shulcloper

, Weighted partition consensus via kernels, Pattern Recognition 43(8) (2010), 2712–2724.

23.

and Ding

, Weighted consensus clustering, in: Proceedings of the 8th SIAM International Conference on Data Mining, Atlanta, Georgia (24–26 April 2008), 798–809.

24.

Zhang

Cheng

Zhang

Chen

and Fang

, Clustering aggregation based on genetic algorithm for document clusterings, in: Proceedings of the IEEE Congress on Evolutionary Computation, Hong Kong, China (1–6 June 2008), 3156–3161.

Algorithm 1. Neighbourhood based Weighted Clustering Ensemble
Input
	X: a dataset
	${\Pi}=\left\{\pi_{1},\pi_{2},\pi_{3},\ldots,\pi_{m}\right\}$ : a set of the base clusterings
Output
	$\pi^{*}$ : final clustering composed of k clusters
Begin
/ compute the similarity between each couple of the base clustering /
1:	For $i$ in $1$ to $m-1$ do
2:	For $j$ in $2$ to $m$ do
3:	$S\left(\pi_{i},\pi_{j}\right)\leftarrow\textit{Rand\_index}\left(\pi_{i},\pi_{% j}\right)$
4:	$d\left(\pi_{i},\pi_{j}\right)\leftarrow 1-S\left(\pi_{i},\pi_{j}\right)$
5:	End for
6:	End for
/ compute a clustering of the base clusterings composed of $q$ clsters using the matrix $d$ /
7:	CL $\leftarrow$ clustering_algorithm $(d,q)$
/ compute the relative importance of each clustering $\pi_{t}$ with respect to its neighbourhood /
8:	For each cluster $C_{{l}}$ in CL do
	For each $\pi_{{t}}$ in $C_{{l}}$ do
9:	$\displaystyle{ri}_{t}\leftarrow\frac{1}{n}\sum\limits_{r=1,r\neq t,\left(\pi_{% r},\pi_{t}\right)\in C_{l}}^{\left\|C_{l}\right\|}{S\left(\pi_{r},\pi_{t}\right)}$
10:	End for
11:	End for
/ compute the normalized weight of each clustering* $\pi_{t}$ */
12:	For each cluster $C_{{l}}$ in CL do
13:	For each $\pi_{{t}}$ in $C_{{l}}$ do
14:	$\displaystyle w_{t}\leftarrow\frac{{ri}_{t}}{\sum{ri}_{t}}$
15:	End for
16:	End for
/ $k$ is the number of clusters in the clustering having the maximum weight/
17:	$w_{\max}\leftarrow\max(w_{t})$
18:	$k\leftarrow\textit{nb{\_}cluster}(\pi_{t},w_{\max})$
/ compute the co-occurrence matrix for each base clustering /
19:	For each $\pi_{{t}}$ in $\prod$ do
20:	For $(u,v)$ in $X$ do
/ if objects $u$ and $v$ are placed in the same cluster by the clustering* $\pi_{t}$ */
21:	if $C(u)=C(v)$ then
22:	$B_{t}\left(u,v\right)\leftarrow 1$
23:	Else
24:	${B}_{t}\left(u,v\right)\leftarrow 0$
25:	End if
26:	End for
27:	End for
/ compute the weighted co-occurrence matrix /
28:	$\displaystyle M_{ij}\leftarrow 1-\sum\limits_{t=1}^{\left\|\Pi\right\|}{w_{t}B_{% t}}$
/ compute the final clustering using the matrix $M$ and the value of $k$ generated in step 18 /
29:	$\displaystyle\pi^{*}\leftarrow\textit{clustering{\_}algorithm}(M,k)$
30:	End.

Weighted clustering ensemble: Towards learning the weights of the base clusterings

Abstract

Keywords

1. Introduction

2. Literature review

3. The proposed approach

Table 1 Base clusterings of the illustrative example

Table 4 Results of the clustering step

4.5 Compute the final clustering

Table 5 Pair-wise co-occurrence matrix of the base clustering π 2

Table 7 Description of the data used in the experimentation

Table 10 Execution time of the proposed approach

Footnotes

Authors’ Bios

References

Table 1
Base clusterings of the illustrative example

Table 4
Results of the clustering step

Table 5
Pair-wise co-occurrence matrix of the base clustering $\pi_{2}$

Table 7
Description of the data used in the experimentation

Table 10
Execution time of the proposed approach