Consensus fuzzy clustering by sequential quadratic programming approach

Abstract

Existing fuzzy clustering ensemble approaches do not consider dependability. This causes those methods to be fragile in dealing with unsuitable basic partitions. While many ensemble clustering approaches are recently introduced for improvement of the quality of the partitioning, but lack of a median partition based consensus function that considers more participate reliable clusters, remains unsolved problem. Dealing with the mentioned problem, an innovative weighting fuzzy cluster ensemble framework is proposed according to cluster dependability approximation. For combining the fuzzy clusters, a fuzzy co-association matrix is extracted in a weighted manner out of initial fuzzy clusters according to their dependabilities. The suggested objective function is a constrained nonlinear objective function and we solve it by sparse sequential quadratic programming (SSQP). Experimentations indicate our method can outperform modern clustering ensemble approaches.

Keywords

Fuzzy cluster ensemble cluster dependability consensus function base clustering sequential quadratic programming

1 Introduction

The categorization of the unlabeled data objects into a set of non-predefined collections in such a way that the objects of a group remain as similar to each other as possible and the objects of different groups stay as dissimilar to each other as possible (by minimizing intra-grouping distances simultaneous with maximizing the inter-grouping distances). Clustering can be used in various unsupervised applications such as pattern recognition, image analysis, document retrieval, marketing research, bioinformatics, data mining, and many more multidisciplinary domains.

Generally, these methods are divided into two categories: fuzzy (soft) and hard (crisp) methods. In spite of hard methods, fuzzy methods can assign any object to potentially more than one cluster. They associate a membership degree per each data object-cluster. a membership degree for a data object-cluster indicates the degree to which the data object belongs to the mentioned cluster. This is inspired by the fact that some objects cannot be clustered in a hard manner even by a human. For instance, a satellite image segmentation application can be considered; while a pixel of a satellite image may correspond to a land space, it may belong to different types of lands. Contrarily, a data object certainly belongs to one cluster in hard methods [1]. These methods are considered to be special cases of soft methods. The foundation of the fuzzy clustering analysis is the basic fuzzy c-means (FCM) method that is introduced and completed by Bezdek [2]. Meanwhile, the many soft methods emerged from original FCM in order to adapt it to different datasets with different structures. We can mention Gath-Geva algorithm (GG) [3], kernel-based fuzzy clustering (KFCM) [4], and Gustafson-Kessel algorithm (GK) [5] algorithms as some examples.

Based on different similarity criteria, many methods aroused in data clustering context. Therefore, they turned out to be inherently different. Therefore, applying these methods will produce dissimilar partitions for a given dataset. Even, applying a single method with unlike parameters (or even unlike initializations if the method is unstable) can generate different data partitions for the same dataset. According to “no free lunch”, there is not a dominant algorithm. To sum up, any of them is better under specific conditions. Hence, an alternative solution to deal with many contradictory objective functions can be to combine some of these algorithms. The clustering aggregation or clustering ensemble is the name of this approach [6], which recently has become popular in scientific community [7 –14]. It is widely accepted that they can generate more robust, novel, accurate, stable, and innovative results than the simple traditional methods [15]. They can be used in parallel to be scalable. They are able to find out the number of real clusters in a dataset. They are able to cluster the heterogeneous dataset.

Clustering ensemble approaches consist of 2 (usually independent) steps: at the first step, a number of base clusterings are generated, which are as diverse as possible, and at the second step, a module (usually referred to as a consensus function) is applied to them to extract final clustering out of those different base clusterings. The goal of the second phase of the ensemble clustering is to reach the final clustering. This achieved through a consensus function. Since the clustering problem is unsupervised, producing the “final clustering” with maximum similarity to all base clusterings is very difficult and an NP problem [16]. For this purpose, various consensus functions are proposed, each with a specific approach and different information from the base clusterings obtained from phase one, and sometimes by considering the initial characteristics of the data. The current clustering ensemble methods are categorized as: a) intermediate space clustering ensemble methods [10], [17], b) co-association matrix based clustering ensemble methods [6], [18 –21], c) hyper-graph based clustering ensemble methods [6 , 21] d) expectation maximization clustering ensemble methods [14] and e) mathematical modeling clustering ensemble methods (median partition) [15, 22], f) voting- based approach [23 –25], g) Quadratic Mutual Information approach [20]. The fourth category (mathematical modeling clustering ensemble methods) are widely considered to be better than others [26].

While soft clustering methods are widely considered to be better than their counterparts, soft clustering ensembles are not as developed and they are in their early periods. Some current soft clustering ensembles transform soft partitions into hard partitions before any work and after that, extract the consensus partition by application of a traditional hard consensus function. During this conversion much information of basic partitions may be lost. Additionally, most of existing approaches produce crisp final clustering. Consequently, it is safe to say that there is no efficient fuzzy consensus clustering methods.

Any base clustering with low quality highly affects the consensus process in ensemble clustering. The low-quality or even noised basic partitions can badly affect the final partition. Therefore, to handle these type of basic partitions, weighting of basic partitions according to their approximated qualities has been recommended in order to improve the quality of consensus functions [27 –29]. However, they assign a weight to each partition, not each cluster [27 –29]. But, for obvious reasons, a good cluster may be discovered by a bad partition and be ignored due to its bad weight. Here, we aim at the following targets: (1) considering soft clusters as the base members of our ensemble, (2) defining quality estimator at cluster level, (3) defining cluster weight according to cluster approximated quality, (4) defining a consensus function that, during its process of final partition extraction, participates each soft cluster according its weight, and finally (5) defining a consensus function that generates a soft partition as output. It is worth mentioning that among the available studies no researchers have considered the role of dependability in soft cluster ensemble. Measuring soft cluster dependability in fuzzy clustering ensemble poses a challenging task.

In the median partition based consensus function approach, the problem of consensus partition discovery is formulated into an optimization problem. It searches for a partition that maximizes the average similarity to all partitions in ensemble. Although a great number of clustering ensemble approaches were introduced over the past years, there are relatively few researches in handling fuzzy clustering ensemble based on median partition approach and none of them investigated co-association (Co) ensemble methods simultaneously along with median partition.

We introduce a soft cluster ensemble according to ensemble-driven cluster undependability approximation and cluster-level weighting approach to address the aforementioned challenges. The flowchart of our ensemble structure is depicted by Fig. 1. Considering benefits of the ensemble diversity and fuzzy clustering, we integrate the cluster undependability and validity into our ensemble to enhance the consensus quality. Here, the undependability of each fuzzy cluster is approximated according to its acceptability in relation to a reference set (here whole of the ensemble). Specially, for a given fuzzy cluster, its undependability is approximated by defining a new metric. After that, based on cluster undependability estimation, a Dependability Driven Cluster Indicator (DDCI) can be computed to show any cluster dependability. Here, the point is the fact that a pool of diverse clusters in the ensemble can be used as a guideline to evaluate every single cluster. By assessing and assigning weights to clusters of the ensemble through the DDCI module, weighted Co matrix (WCoM) whose weights are calculated based on dependability are computed. At last, finding the final consensus partition is modeled as an optimization problem, and this problem is solved by sequential quadratic Programming.

Our contribution is to propose a soft cluster ensemble aiming at satisfying the following constraints: (1) considering soft clusters as the base members of our ensemble, (2) defining a consensus function that, during its process of final partition extraction, participates each soft cluster according its weight, and finally (3) defining a consensus function that generates a soft partition as output.

Next section is dedicated to the related works. After that, the background will be presented. Our contribution and approach is introduced by Section 4. The experimental study is presented by Section 5. Finally, Section 6 concludes the paper and give guidelines to the future works.

2 Literature

The researches of Fred and Jain [18] and Strehl and Ghosh [16] are widely considered to be among the initial attempts in clustering ensemble field. In the mentioned researches, aggregators were proposed to extract the final partition out of a pool of base partitions without access to the original dataset features or the base clustering algorithms. Transforming the problem into a constrained optimization objective function, they solved the problem. While a lot of works have been done concerning ensembles of hard clusterings, we only explain those dedicated to soft cluster ensemble, among which the following ones are briefed:

Berikov introduced a probabilistic framework to aggregate the soft cluster ensemble according to the WCoM [30]. He generates each of his basic partitions by applying a different basic soft clusterer. He proposes to compute the variance of the Hellinger distance [31] between any pair of data objects across ensemble. Then, he considers the reverse of this value to be similarity of those two objects in his WCoM. Finaly, he applies a traditional hierarchical clusterer to obtain the consensus partition.

Punera and Ghosh introduced the soft versions of CSPA, MCLA, which had been proposed by Strel and Ghosh [16], and HBGF, which had been proposed by Fred and Brodly [7, 8]. sCSPA, of sMCLA, and sHBGF are the names of the soft mentioned versions [8].

A voting-based consensus clustering is the voting-merging algorithm (VMA) [32]. This is a fuzzy ensemble clustering method that computes the final clustering according to averaging of membership matrices of all basic soft partitions. In this work, all base soft clusters need to be initially relabeled, which is a time-consuming process.

Another related research introduced by Saha et al. is SVMeFC [33]. In this method, some soft clusterers including FCIDE, MoDEFC, GAFC, GAFPSC, and FCM are applied to a given dataset. Some objects of the derived soft partitions are chosen to train a support vector machine (SVM) and the remaining objects of each partition are re-labeled by SVM. At the end of this procedure, to obtained final clusters, CSPA is applied on outputs of SVM. SVMs are very powerful algorithms for data sorting and separating, especially when combined with other methods of machine learning. This procedure best fits the cases where excessive precision is required, as long as we choose mapping functions properly. But it is hugely time-consuming because of high computational complexity of SVMs and also consumes a lot of memory.

Alizadeh et al. converted the soft cluster ensemble problem into a binary bit string problem [34]. They introduced the problem by a constrained non-linear objective function, named fuzzy string objective function (FSOF). Although their output partition is soft, the input basic partitions must be hard.

Sevillano et al. [25] introduced a voting approach to aggregate the base soft clusterings in the ensemble. Their approach has 2 steps: a) relabeling step and b) voting step. In the first step, the relabeling is accomplished by Hungarian algorithm [35, 36] with O (K³) where K is the number clusters. Consensus partition generated by applying sum voting rule or product voting rule [37].

Arlyd and Anna introduced a stable soft cluster ensemble [38] in which a set of basic soft clusterers such as GK, FCM, GG, and KFCM is initially employed to generate ensemble and then use FCM as consensus function to extract final partition out of the CoM of ensemble. There are no weighting mechanisms here.

Szabo and Nunes de Castro [39] offer a method for soft cluster ensemble based on Particle Swarm Optimization (PSO) which can be applied to fuzzy and crisp clusters [39]. Initial clusters in this method are generated using PSO through different parameters. Then a pruning process is accomplished through which a very low fraction of elite partitions are selected. One of internal cluster validity indices like Ball-Hall [40], Calinski-Harabasz [41], Dunn index [42], Silhouette index [43] or Xie-Beni [44] are used to evaluate a partition for selection. Finally, a PSO is employed as consensus function to extract the final partition out of the selected partitions. In this method, each particle represents a cluster, despite the other PSO-based methods which each particle represents a clustering.

Parvin et al. for handling imbalanced clustering proposed a weighted locally adaptive clustering algorithm (FWLAC). Because the performance of FWLAC algorithm is dependent on well-tuning of its two parameters, they proposed an elite soft cluster ensemble. Their proposed procedure first converts soft clusters into crisp clusters and considers each cluster to be a clustering; finally clustering normalized mutual information (NMI) was employed for assessment of each cluster [45].

3 Background

Before explaining this proposed approach, the general formulation of the data, fuzzy clustering ensemble and entropy definitions should be introduced as follows:

Definition 1. A data object is a multi-tuple $(d_{i}^{1}, d_{i}^{2}, \dots d_{i}^{N})$ presented as $\vec{d_{i}}$ , where $d_{i}^{j}$ is the j-th feature from i-th data, $d_{:}^{j}$ is the j-th feature from whole data. N is the number of features, $N = | d_{1}^{:} |$ and M is the number of the data objects‘ $M = | d_{:}^{i} |$ .

Definition 2. Fuzzy clustering of data set x is a two dimensional matrix with size M * c, where $M = | d_{:}^{1} |$ and c is the number of clusters, presented as π (d) so that: $\forall j \in {1, \dots, c}, i : π {(\vec{d_{i}})}^{j} \in [0, 1]$ (1) $where \forall i : \sum_{j = 1}^{c} π {({\vec{d}}_{i})}^{j} = 1$ (2) where $π {({\vec{d}}_{i})}^{j}$ is the membership degree that i-th data object belongs to j-th cluster.

Definition 3. A clustering ensemble consists of β base clusterings defined as: $Π = {π^{1}, \dots, π^{β}}$ (3) $where π^{m} = {C_{1}^{m}, \dots, C_{n^{m}}^{m}}$ (4) where, π^m is the m-th base clustering in Π, $C_{i}^{m}$ is the i-th cluster in clustering π^m and n^m is the number of clusters in π^m.

To sum up, the set of all clusters in the ensemble is presented as $C = {C_{1}^{1}, \dots, C_{n^{β}}^{β}}$ (5) where $C_{i}^{j}$ is the i-th cluster of clustering π^j, thus the number of all clusters in the base clusterings (c) is computed as: $c = n^{1} + \dots + n^{β}$ (6)

Definition 4. It is assumed that there exists a consensuses of π^ms base clusterings which must be summarized in an ensemble matrix. The E_M×(c) is the ensemble matrix expressed as: $E_{i, j}^{π_{:} (x)} = π^{m} {(\vec{x_{i}})}^{g}$ (7) where i and j are the data and cluster number in E respectively, j = (m - 1) k + g , g and m are cluster number and clustering number in Π respectively at g ≤ n^m, c is the number of all clusters in base clusterings.

Example 1. Three fuzzy clusterings π¹, π² and π³ (β = 3) on dataset x with 6 data objects (M = 6) are shown in Table 1.

Table 1

Three fuzzy clustering π¹, π² and π³

	c ₁	c ₂	c ₃		c ₁	c ₂	c ₃		c ₁	c ₂	c ₃
x ₁	0.1	0.7	0.2	x₁	0.6	0.3	0.1	x₁	0.7	0.2	0.1
x ₂	0.0	0.8	0.2	x₂	0.8	0.2	0.0	x₂	0.9	0.1	0.0
x ₃	0.1	0.4	0.5	x₃	0.5	0.5	0.1	x₃	0.9	0.0	0.1
x ₄	0.0	0.2	0.7	x₄	0.2	0.7	0.1	x₄	0.2	0.6	0.2
x ₅	0.8	0.1	0.9	x₅	0.0	0.5	0.5	x₅	0.1	0.9	0.0
x ₆	0.7	0.1	0.1	x₆	0.1	0.2	0.7	x₆	0.0	0.2	0.8

Table 2

The ensemble E of three fuzzy clusterings π¹, π² and π³

E^{π (x)} = \begin{matrix} c_{1} & c_{2} & c_{3} & c_{4} & c_{5} & c_{6} & c_{7} & c_{8} & c_{9} \\ x_{1} & 0.1 & 0.7 & 0.2 & 0.6 & 0.3 & 0.1 & 0.7 & 0.2 & 0.1 \\ x_{2} & 0.0 & 0.8 & 0.2 & 0.8 & 0.2 & 0.0 & 0.9 & 0.1 & 0.0 \\ x_{3} & 0.1 & 0.4 & 0.5 & 0.5 & 0.5 & 0.0 & 0.9 & 0.0 & 0.1 \\ x_{4} & 0.1 & 0.2 & 0.7 & 0.2 & 0.7 & 0.1 & 0.2 & 0.6 & 0.2 \\ x_{5} & 0.0 & 0.1 & 0.9 & 0.1 & 0.2 & 0.7 & 0.1 & 0.9 & 0.0 \\ x_{6} & 0.8 & 0.1 & 0.1 & 0.1 & 0.3 & 0.6 & 0.0 & 0.2 & 0.8 \end{matrix}

Definition 5. For a discrete random variable X, the entropy H (X) is expressed as: $H (X) = - \sum_{x \int X} p (x) {log}_{2} p (x)$ (8) where X is the set of values that x can take, and p (x) is the probability mass function of X.

If and only if these X and Y are independent, then H (X, Y) = H (X) + H (Y) holds. Thus, given n independent random variables X₁ ; … ; X_n, the following term is valid: $H (X_{1}; \dots; X_{n}) = H (X_{1}) + \dots + H (X_{n})$ (9)

4 Proposed approach

In this paper, a new fuzzy clustering ensemble approach based on ensemble-driven cluster undependability estimation and local weighting strategy is proposed. The steps of the proposed approach are shown in Fig. 1 (after green boxes). In the first step, the acceptability of each cluster is computed. In the second step, the undependability of each cluster is approximated in Section (4.2), in the third step, the cluster weight computation is done in Section (4.3). In the fourth step the fuzzy weighted co-association matrix is computed in Section (4.4) and in step 5, the consensus clustering is obtained as described in Section (4.5).

Fig. 1

Proposed approach steps.

4.1 Cluster acceptability

The first step in Fig. 1 is to compute the acceptability of each cluster in relation to other clusters within different clusterings; i.e. the probability of agreement between two clusters in different clusterings which is obtained according to Definition 7. We measure the cluster acceptability by employing the ‘Similarity between fuzzy sets’ that proposed by Zheng [46], which measures the agreement relationship between two fuzzy sets.

Definition 6. The acceptability of cluster $C_{i}^{ki}$ (cluster C_i ∈ π^ki) in relation to cluster $C_{j}^{kj}$ (cluster C_j ∈ π^kj), where i ≠ j is computed through Equation (10):

$Accp (C_{i}^{ki}, C_{j}^{kj}) = \sum_{t = 1}^{M} \frac{\min (1 - π^{ki} {(\vec{x_{t}})}^{i} + π^{kj} {(\vec{x_{t}})}^{j}, 1 + π^{ki} {(\vec{x_{t}})}^{i} - π^{kj} {(\vec{x_{t}})}^{j})}{\max (1 - π^{ki} {(\vec{x_{t}})}^{i} + π^{kj} {(\vec{x_{t}})}^{j}, 1 + π^{ki} {(\vec{x_{t}})}^{i} - π^{kj} {(\vec{x_{t}})}^{j}}$ (10)

The value of the cluster acceptability (Accp) computed from (10) is in the range [0, 1].

Example 2. The Accp values of fuzzy clusterings related to Example 1 in Table 1 is shown in Table 3.

Table 3

The values of Accp corresponding to fuzzy clusterings presented in Tables 1–3

	c₁	c₂	c₃	c₄	c₅	c₆	c₇	c₈	c₉
c₁	1.0000	0.3953	0.3483	0.3953	0.4815	0.7143	0.2903	0.4458	0.9672
c₂	0.3953	1.0000	0.4118	0.9355	0.5190	0.3333	0.7910	0.3483	0.4118
c₃	0.3483	0.4118	1.0000	0.4458	0.7143	0.4815	0.3333	0.7647	0.3636
c₄	0.3953	0.9355	0.4458	1.0000	0.5584	0.3333	0.7910	0.3483	0.4118
c₅	0.4815	0.5190	0.7143	0.5584	1.0000	0.4458	0.4286	0.5789	0.5000
c₆	0.7143	0.3333	0.4815	0.3333	0.4458	1.0000	0.2371	0.6438	0.6901
c₇	0.2903	0.7910	0.3333	0.7910	0.4286	0.2371	1.0000	0.2500	0.3043
c₈	0.4458	0.3483	0.7647	0.3483	0.5789	0.6438	0.2500	1.0000	0.4634
c₉	0.9672	0.4118	0.3636	0.4118	0.5000	0.6901	0.3043	0.4634	1.0000

4.2 Cluster undependability

Because we assume that there is not knowledge about the original data object features, the concept of entropy applying the data object membership degree to the cluster in the entire ensemble is applied for evaluating the Dependability of each cluster. The second step in Fig. 1 is estimating the undependability of clusters based on an entropic criterion in the ensemble. It is obtained without the knowledge of the original data features or making any assumptions on data distribution. Entropy is a measure of cluster undependability associated with a random variable where every cluster consists of a set of data objects. Given a cluster $C_{i}^{ki}$ and a base clustering π^kj ∈ Π, then the membership degree of data objects belong to C_i may varies from their membership degree belong to all clusters within π^kj. Indeed, the membership degree of data objects belong to C_i may not be accepted by all clusters in π^kj. The undependability (or entropy) of C_i with regards to π^kj is obtained according to the manner in which data objects in C_i are clustered in π^kj and with regards Definition 5, is approximated according to definition 7.

Definition 7. The undependability of cluster $C_{i}^{ki}$ (cluster C_i in the clustering π^ki) with respect to clustering π^m (ki ≠ m) in the ensemble Π is expressed by $U n R (C_{i}^{k i} π^{m})$ and is calculated as:

$\begin{array}{l} U n R (C_{i}^{k i} π^{m}) = \frac{1}{n^{m}} * \sum_{j = 1}^{n^{m}} \\ - A c c p n (C_{i}^{k i}, C_{j}^{m}) \log_{2}^{(p n (C_{i}^{k i}, C_{j}^{m}))} \end{array}$ (11) where $Accpn (C_{i}^{ki}, C_{j}^{m})$ is the normalized value of $Accp (C_{i}^{ki}, C_{j}^{m})$ within interval [0,1] and is expressed by Equation (12)

$\begin{matrix} Accpn (C_{i}^{ki}, C_{j}^{m}) = \\ Accp (C_{i}^{ki}, C_{j}^{m}) / - \sum_{t = 1}^{n^{m}} Accp (C_{i}^{ki}, C_{t}^{m}) \end{matrix}$ (12) where n^m is the number of clusters in π^m, $C_{j}^{m}$ is the j-th cluster in π^m and $Accp (C_{i}^{ki}, C_{j}^{m})$ is expressed by Equation (10).

Example 3. Accpn values related to fuzzy clusterings in Table 1 is computed with regards to the Accp values in Table 3 according to equation (12), the results are shown in Table 4. Then corresponding UnR values with respect to each base clustering is computed according to Equations (11) and the results is shown in Table 5.

Table 4

The values of Accpn corresponding to the Accp values presented in Table 3

	c₁	c₂	c₃	c₄	c₅	c₆	c₇	c₈	c₉
c₁	–	–	–	0.2485	0.3026	0.4489	0.1704	0.2617	0.5678
c₂	–	–	–	0.5233	0.2903	0.1864	0.5100	0.2246	0.2655
c₃	–	–	–	0.2716	0.4351	0.2933	0.2280	0.5232	0.2488
c₄	0.2225	0.5266	0.2509	–	–	–	0.5100	0.2246	0.2655
c₅	0.2808	0.3027	0.4166	–	–	–	0.2843	0.3840	0.3317
c₆	0.4671	0.2180	0.3149	–	–	–	0.1509	0.4098	0.4393
c₇	0.2052	0.5592	0.2356	0.5430	0.2942	0.1628	–	–	–
c₈	0.2860	0.2234	0.4906	0.2217	0.3685	0.4098	–	–	–
c₉	0.5550	0.2363	0.2087	0.2570	0.3121	0.4308	–	–	–

Table 5

The values of UnR corresponding to Accpn values presented in Table 4

$C_{1_{2}}^{1}$	$C_{1_{3}}^{1}$	$C_{2_{2}}^{1}$	$C_{2_{3}}^{1}$	$C_{3_{2}}^{1}$	$C_{3_{3}}^{1}$	$C_{1_{1}}^{2}$	$C_{1_{3}}^{2}$	$C_{2_{1}}^{2}$
0.3557	0.3409	0.3370	0.3385	0.3586	0.3530	0.2203	0.3348	0.2289
$C_{2_{3}}^{2}$	$C_{3_{1}}^{2}$	$C_{3_{3}}^{2}$	$C_{1_{1}}^{3}$	$C_{1_{2}}^{3}$	$C_{2_{1}}^{3}$	$C_{2_{2}}^{3}$	$C_{3_{1}}^{3}$	$C_{3_{2}}^{3}$
0.2394	0.2276	0.2016	0.3290	0.2185	0.3558	0.3670	0.3585	0.3643

We assume that basic partitions of ensemble are mutually independent [47]. So, according to Equation (9) the undependability of C_i with respect to the ensemble Π can be obtained by summing up the undependability of C_i with respect to the β basic partitions in Π is computed based on Equation (13).

Definition 8. The undependability of cluster $C_{i}^{ki}$ in ensemble clusterings Π is defined as: $U n R^{Π} (C_{i}^{ki}) = \sum_{j = 1}^{β} U n R (C_{i}^{k i} π^{j})$ (13)

Example 4. The UnR values with respect to all base clusterings (UnR^Π) correspond to Example 3 (Table 5) is computed according to equations (13) and the results is shown in Table 6.

Table 6

The values of UnR^Π corresponding to UnR values presented in Table 5

$Un R^{Π} (C_{1}^{1})$	$Un R^{Π} (C_{2}^{1})$	$Un R^{Π} (C_{3}^{1})$	$Un R^{Π} (C_{1}^{2})$	$Un R^{Π} (C_{2}^{2})$
0.6967	0.6756	0.7116	0.5551	0.4683
$Un R^{Π} (C_{3}^{2})$	$Un R^{Π} (C_{1}^{3})$	$Un R^{Π} (C_{2}^{3})$	$Un R^{Π} (C_{3}^{3})$
0.4292	0.5475	0.7228	0.7228

4.3 Cluster dependability

Third step in Fig. 1 is dedicated to computation of dependabilities of all clusters of the ensemble. Dependability of any cluster can be obtained according to its undependability estimated based on the whole ensemble through a dependability driven cluster indicator (DDCI) after obtaining undependability of each cluster in the ensemble.

Definition 9. For an ensemble Π with β base clusterings, the weight of each cluster in clustering ensemble (i.e. DDCI: Dependability driven cluster indicator) for a cluster C_i is defined as $DDCI (C_{i}^{ki}, Π) = e^{- Un R^{Π} (C_{i}) / - β \emptyset}$ (14) where parameter Ø>0 is used to tune the undependability effect on clustering ensemble weight. Based on empirical results, the best result is obtained when parameter Ø is a real value within [0.3, 8] interval.

Due to UnR^Π(C_i) ∈ [0,+ ∞), it holds that DDCI(C_i) ∈ (0, 1] is met for any C_i ∈ Π. It is obvious that as the undependability of a cluster C_i is minimized (UnR^Π(C_i)=0), its DDCI is maximized (DDCI (C_i)=1). Additionally, it is obvious that DDCI(C_i) is the Dependability of cluster C_i.

Example 5. The DDCI values corresponding to Table 6 is computed according to equation (14) and the result is shown in Table 7.

Table 7

The values of DDCI corresponding to UnR values in Table 6

$C_{1}^{1}$	$C_{2}^{1}$	$C_{3}^{1}$	$C_{1}^{2}$	$C_{2}^{2}$	$C_{3}^{2}$	$C_{1}^{3}$	$C_{2}^{3}$	$C_{3}^{3}$
0.4232	0.4042	0.4256	0.4172	0.4271	0.4272	0.4221	0.4152	0.4387

4.4 WCoM

As can be seen in Fig. 1, the fourth step is computing co-association matrix with regard to the Dependability of each cluster in the ensemble. One of the most common methods used to combine the base clusterings is the co-association matrix-based method. Evidence Accumulation Clustering (EAC), which was first proposed by Fred and Jain [18]. EAC maps the individual data object clusterings in a clustering ensemble into a new pairwise similarity measure.

Unlike in the general crisp evidence accumulation method, because a sample does not belong to any basic soft cluster absolutely, we cannot obtain values of CoM entries by counting how many times a pair of instances are assigned in a shard cluster. In soft clustering, any instance belongs to each cluster with a membership degree. Therefore, we should find a way to evaluate the strength of association between data objects.

Definition 10. The soft CoM is expressed as: ${FCo}_{i, j}^{π_{:}^{'} (x)} = 1 / - β \sum_{k = 1}^{β}_{t = 1}^{n^{k}} sup (inf (π^{k} {(\vec{x_{i}})}^{t}, π^{k} {(\vec{x_{j}})}^{t}))$ (15) where $\vec{x_{i}}$ and $\vec{x_{j}}$ are the data objects.If inf(x, y) = xy and sup(x, y) = x + y, then $FC o^{π_{:} (x)} = 1 / - β E^{π_{:} (x)} \times {E^{π_{:} (x)}}^{T}$ (16)

As was mentioned in Definition 10, co-association matrix reflects the strength of association between data objects. In order to take the Dependability of each cluster into account in the co-association matrix, DDCI would be considered as a multiplier term (weight) in co-association matrix computation, leading to computation of the weighted fuzzy co-association matrix according to Definition 11.

Definition 11. The weighted fuzzy co-association clustering ensemble matrix (WFCo) is defined as:

$\begin{matrix} {WFCo}_{i, j}^{π_{:}^{'} (x)} = 1 / - β \sum_{k = 1}^{β}_{t = 1}^{n^{k}} DDCI (C_{t}^{k}, Π) \\ \sup (inf (π^{k} {(\vec{x_{i}})}^{t}, π^{k} {(\vec{x_{j}})}^{t})) \end{matrix}$ (17)

Example 6. The WFCo matrix of fuzzy cluster ensemble example 1 in Table 1 with regard to the calculated values of UnR^Π in Table 6 is shown in Table 7. With β = 3, we have: $DDCI (C_{1}^{1}, Π) = 0.4232,$ $DDCI (C_{2}^{1}, Π) = 0.4042,$ $DDCI (C_{3}^{1}, Π) = 0.4256,$ $DDCI (C_{1}^{2}, Π) = 0.4172,$ $DDCI (C_{2}^{2}, Π) = 0.4271,$ $DDCI (C_{3}^{2}, Π) = 0.4272,$ $DDCI (C_{1}^{3}, Π) = 0.4221,$ $DDCI (C_{2}^{3}, Π) = 0.4152,$ $DDCI (C_{3}^{3}, Π) = 0.4387,$ $π^{1} {(\vec{x_{1}})}^{1} = 0.1, π^{1} {(\vec{x_{2}})}^{1} = 0.0,$ $π^{1} {(\vec{x_{1}})}^{2} = 0.7, π^{1} {(\vec{x_{2}})}^{2} = 0.8,$ $π^{1} {(\vec{x_{1}})}^{3} = 0.2, π^{1} {(\vec{x_{2}})}^{3} = 0.2,$ $π^{2} {(\vec{x_{1}})}^{1} = 0.6, π^{2} {(\vec{x_{2}})}^{1} = 0.8,$ $π^{2} {(\vec{x_{1}})}^{2} = 0.3, π^{2} {(\vec{x_{2}})}^{2} = 0.2,$ $π^{2} {(\vec{x_{1}})}^{3} = 0.1, π^{2} {(\vec{x_{2}})}^{3} = 0.0,$ $π^{3} {(\vec{x_{1}})}^{1} = 0.7, π^{3} {(\vec{x_{2}})}^{1} = 0.9,$ $π^{3} {(\vec{x_{1}})}^{2} = 0.2, π^{3} {(\vec{x_{2}})}^{2} = 0.1,$ $π^{3} {(\vec{x_{1}})}^{3} = 0.1 and π^{3} {(\vec{x_{2}})}^{3} = 0.0 .$

According to Definition 12, $\begin{matrix} WFC o_{1, 2} = 1 / - 3 (\sup [DDCI (C_{1}^{1}, Π) * \inf (π^{1} (\vec{x_{1}})^{1}, π^{1} (\vec{x_{2}})^{1}), DDCI (C_{2}^{1}, Π) \\ * \inf (π^{1} (\vec{x_{1}})^{2}, π^{1} (\vec{x_{2}})^{2}), DDCI (C_{3}^{1}, Π) * \inf (π^{1} (\vec{x_{1}})^{3}, π^{1} (\vec{x_{2}})^{3})] \\ + \sup [DDCI (C_{1}^{2}, Π) * \inf (π^{2} (\vec{x_{1}})^{1}, π^{2} (\vec{x_{2}})^{1}), DDCI (C_{2}^{2}, Π) \\ * \inf (π^{2} (\vec{x_{1}})^{2}), π^{2} (\vec{x_{2}})^{2}), DDCI (C_{3}^{2}, Π) * \inf (π^{2} (\vec{x_{1}})^{3}, π^{2} (\vec{x_{2}})^{3})] \\ + \sup [DDCI (C_{1}^{3}, Π) * \inf (π^{3} (\vec{x_{1}})^{1}, π^{3} (\vec{x_{2}})^{1}), DDCI (C_{2}^{3}, Π) \\ * \inf (π^{3} (\vec{x_{1}})^{2}), π^{3} (\vec{x_{2}})^{2}), DDCI (C_{3}^{3}, Π) * \inf (π^{3} (\vec{x_{1}})^{3}, π^{3} (\vec{x_{2}})^{3})]) \end{matrix}$

Substituting the values in the above equation we have, $\begin{matrix} WFC o_{1, 2} = 1 / - 3 (sup [0.4232 * inf (0.1, 0.0), 0.4042 * inf (0.7, 0.8), 0.4256 * inf (0.2, 0.2)] \\ + sup [0.4172 * inf (0.6, 0.8), 0.4271 * inf (0.3, 0.2)), 0.4272 * inf (0.1, 0.0)] \\ + sup [0.4221 * inf (0.7, 0.9), 0.4152 * inf (0.2, 0.1) + 0.4387 * inf (0.1, 0.0)]) = 0.2478 \end{matrix}$

In a similar way, the other entries of WFCo matrix can be obtained (see Table 8).

Table 8

The WFCo matrix of fuzzy clustering ensemble presented in Table 1

	x₁	x₂	x₃	x₄	x₅	x₆
x₁	–	0.2478	0.2115	0.1274	0.0966	0.0700
x₂	0.2478	–	0.2481	0.1184	0.0788	0.0364
x₃	0.2115	0.2481	–	0.1520	0.1020	0.0626
x₄	0.1274	0.1184	0.1520	–	0.1997	0.1027
x₅	0.0966	0.0788	0.1020	0.1997	–	0.1076
x₆	0.0700	0.0364	0.0626	0.1027	0.1076	–

4.5 Consensus function

The final step in Fig. 1 is compute final clustering (consensus clustering). The process of extracting final clustering from the co-association matrix using the EAFC method (Definition 11) is named consensus function. To obtain consensus function at the first we define the objective function to drive final clustering in Section 4.5.1, then we explain its solution in Section 4.5.2.

4.5.1 Objective function

A consensus function is used to derive the final fuzzy clustering π^* from Π by solving the Equation (18). Then at first we formalize the problem of finding the final fuzzy clustering π^* from the cluster ensemble Π as objective function. Objective function must take into account both fuzzy cluster diversity and fuzzy cluster Dependability of ensemble E summarized in fuzzy co-association clustering ensemble matrix (WFCo). We try to find final fuzzy clustering π^* that its co-association matrix is approximately equals to WFCo. According to Equation (16), the co-association of π^* is π^* × π^*′. Hence we try to minimize the square error between WFCo and co-association matrix of π^* (minimize the sum of difference square between the final clustering and co-association matrix), as formalize in Equation (18-1). Also π^* is fuzzy clustering, each element must be satisfy the constraint (18-2). Additionally because sum of membership of a data object to all clusters in π^* must equals to 1 the constraint (18-3) was added to Equation (18). This objective function is defined as Equation (18). $minimize \sum_{k = 1}^{M} \sum_{l = 1}^{M} {(WFC o_{kl} - \sum_{j = 1}^{k} (\underset{kj}{π^{*}} \times \underset{jl}{π^{*}}))}^{2}$ (18)

Subject to: $\forall i \in {1, \dots ., k}, j \in {1, \dots ., M} | 0 \leq \underset{kj}{π^{*}} \leq 1$ (19) $\forall k \in {1, \dots ., M} | \sum_{j = 1}^{k} \underset{kj}{π^{*}} = 1$ (20) where WFCo is fuzzy co-association matrix of base clusterings Π, π^* is the final clustering matrix, M is the number of data objects, K is the number of clusters in the final clustering. It is worth noting that π^* is a M × K membership matrix, where rows correspond to data objects and columns to clusters and where each element’s represents the membership degree of a data object belonging to a particular cluster each row’s is a membership degree of each data object to final clusters.

4.5.2 Problem solver

The proposed solver named the FCESQP (Fuzzy Clustering Ensemble by Sequential Quadratic Programming) is introduced in this section. AS mentioned in the previous section, the cluster ensemble is formulated as an optimization problem.

The optimization problem goal is to minimize the soft clustering ensemble objective function. This indirectly results in a clustering where the dependability between basic soft clusters are maximized. To solve the proposed model, any non-linear optimization solver can be applied. Sequential quadratic programming (SQP) method have proved highly effective for solving constrained optimization problems with smooth non-linear functions in the objective and constraints [48, 49]. The objective function is nonlinear with linear constraints. Because the coefficient matrix of constraints is spare, the sparse SQP (SSQP) is applied for solving our optimization problem.

The SSQP algorithm is fully described by Gill, Murray and Saunders [50]. It employs a sparse sequential quadratic programming (SQP) algorithm with limited-memory quasi-Newton approximations to the Hessian of the Lagrangian. It is especially effective for nonlinear problems with functions and gradients that are expensive to evaluate. The functions should be smooth but need not be convex. SSQP is suitable for large-scale for general nonlinear programs of the form $min_{x} f_{0} (x)$ (21) $Subject to : Ax = a$ (22) $x_{l} \leq x \leq x_{u}$ (23) where l and u are lower and upper bounds (with constant values), f₀ (x) is a smooth scalar objective function, A is a sparse matrix that refers coefficient values of the constraints, and x_l, x_u are lower and upper bound of variable x. We map the objective function (18) to Equation (19) form as follows:

At the first, the transformation of matrix π^* into the vector x (containing M × K scalar variables) according to (Equation (20)) is necessary. $j = 1 . . k, i = 1 M : x_{t} = \underset{ij}{π^{*}}$ (24) where t = i + (j - 1) * k.

After this transformation M × 1 vector x_l is set to zero, M × 1 vector x_h is set to one, the a M × 1 vector is set to one and the (M) × (M × k) sparse matrix A is defined according to (Eq.(21))

$\begin{matrix} i = 1 . . . M, j = 1 . . . k, t = 1 \dots (M \times k) \\ A_{it} = {\begin{matrix} 1 & if t = = (i - 1) * j + k \\ 0 & otherwise \end{matrix} \end{matrix}$ (25)

Example 7. suppose the structurer of final clustering (a 6 × 3 matix π^*) for fuzzy base clusterings in example 1 (suppose K = 3) where its WFCo is shown in Table 8, is shown in Table 9. This matrix (π^*) is transformed into the 18 × 1 vector x by Equation (21) and its form is as x=( $\underset{11}{π^{*}}$ , $\underset{12}{π^{*}}$ , $\underset{13}{π^{*}}$ , $\underset{21}{π^{*}}$ , $\underset{21}{π^{*}}$ , $\underset{22}{π^{*}}$ , $\underset{23}{π^{*}}$ , $\underset{31}{π^{*}}$ , $\underset{32}{π^{*}}$ , $\underset{32}{π^{*}}$ , $\underset{33}{π^{*}}$ , $\underset{41}{π^{*}}$ , $\underset{42}{π^{*}}$ , $\underset{43}{π^{*}}$ , $\underset{51}{π^{*}}$ , $\underset{52}{π^{*}}$ , $\underset{53}{π^{*}}$ , $\underset{61}{π^{*}}$ , $\underset{62}{π^{*}}$ , $\underset{63}{π^{*}}$ ). The corresponding 6 ×18 matrix A is computed by Equation (23) and the its values are shown in Table 10.

Table 9

The structurer of final clustering corresponds to base clustering in Table 1

π^{*} = \begin{matrix} \underset{11}{π^{*}} & \underset{12}{π^{*}} & \underset{13}{π^{*}} \\ \underset{21}{π^{*}} & \underset{22}{π^{*}} & \underset{23}{π^{*}} \\ \underset{31}{π^{*}} & \underset{32}{π^{*}} & \underset{33}{π^{*}} \\ \underset{41}{π^{*}} & \underset{42}{π^{*}} & \underset{43}{π^{*}} \\ \underset{51}{π^{*}} & \underset{52}{π^{*}} & \underset{53}{π^{*}} \\ \underset{61}{π^{*}} & \underset{62}{π^{*}} & \underset{63}{π^{*}} \end{matrix}

Table 10

Coefficient matrix A related to example 5

[\begin{matrix} 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \\ 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 \\ 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 \\ 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 \\ 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 \\ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 \end{matrix}]

As the proposed algorithm is based on the SQP approach, we provide here a short synopsis of that method. Because the objective function (19) (f₀ (x)) is nonlinear, we approximate it to quadratic form with consider 3 first term of Taylor series as Equation (22). $F (\bar{x}) = F (x) + g^{T} (x) (\bar{x} - x) + \frac{1}{2} {(\bar{x} - x)}^{T} H (x) (\bar{x} - x)$ (26) where H and g are the Hessian matrix and gradient vector of the objective function, x is current value of variable X. At each iteration, Equation (22) has only one variable, therefore F (x) is constant and it can be eliminated. Constraints of our objective function are linear, then they do not need to be approximated. Therefore, it is rewritten as Equation (23). $F (\bar{x}) = g^{T} (x) (\bar{x} - x) + \frac{1}{2} {(\bar{x} - x)}^{T} H (x) (\bar{x} - x)$ (27) $A (x) = a$ (28) $B (x) \leq b$ (29)

Equation (23) is in the form of quadratic and we can solve it by using a sequence of quadratic programming (QP) sub-problems in each iteration. Since the objective function (i.e. equation 18) consists of equality and inequality constraints, the active-set method of QP is used to solve it as follows:

Start from an arbitrary point x₀, then find the next iteration value by setting x_k+1 = x_k+p_kd_k, where p_k is step-length and d^k is search direction at iteration k.

At the current iteration x_k determine the index set of active the inequality constraints: $A^{k} = {j | b_{j}^{T} x_{k} - b_{j} = 0, j = 1, \dots, m_{2}}$ (30)

Then, we find the direction (d) value by solving Equation (25). $min_{d} {g^{T} (x_{k} + d) + \frac{1}{2} {(x_{k} + d)}^{T} H (x_{k} + d)}$ (31) subject to: $a_{i}^{T} (x_{k} + d) = a_{i}, i = 1, \dots, m_{1}$ (32) $b_{j}^{T} (x_{k} + d) = b_{j}, j \in A^{k}$ (33) where m₁ is the number of equation constraints.

If we expand the Equation (25), and simplify these expressions and drop constants to Equation (26) is obtained. $min_{d} {{(H x_{k} + g)}^{T} d + \frac{1}{2} d^{T} Hd}$ (34) subject to: $Ad = 0$ (35) $\tilde{Bd} = 0$ (36) $where \tilde{B} = [\begin{matrix} ⋮ \\ {b_{j}}^{T} \\ ⋮ \end{matrix}], j \in A^{k}$ (37)

To obtain the search direction d^k, solve the Equation (27): $min_{d} {1 / - 2 d^{T} Hd + {[g^{k}]}^{T} d}$ (38) subject to: $Ad = 0$ (39) $\tilde{Bd} = 0$ (40) where g^k = Hx_k + g.

The Karush–Kuhn–Tucker (KKT) optimality conditions [35] lead to the Equation (28): $[\begin{matrix} H & A^{T} & {\tilde{B}}^{T} \\ A & 0 & 0 \\ \tilde{B} & 0 & 0 \end{matrix}] [\begin{matrix} d \\ λ \\ \tilde{μ} \end{matrix}] = [\begin{matrix} - g^{k} \\ 0 \\ 0 \end{matrix}]$ (41) where μ, λ are Lagrange multipliers corresponding to active inequality and equality constraint, respectively.

If d_k is a solution of QP, then there are λ_k and ${\tilde{μ}}_{k}$ such that $H d_{k} + g^{k} + A^{T} λ_{k} + {\tilde{B}}^{T} {\tilde{μ}}_{k} = 0$ (42) $A d_{k} = 0$ (43) $\tilde{B} d_{k} = 0$ (44)

There are two cases: either d_k=0 or d_k ≠ 0.

Case 1: d_k=0. the Equation (29) above reduces to Equation (30). $g^{k} + A^{T} λ_{k} + {\tilde{B}}^{T} {\tilde{μ}}_{k} = 0$ (45)

Case 1-a: if ${\tilde{μ}}_{k} \geq 0$ , x_k+1 = x_k is a KKT point. Stop!

Case 1-b: If some components of ${\tilde{μ}}_{k}$ are negative, then x_k is not an optimal solution. Let $μ_{j_{o}} = min {{\tilde{μ}}_{j} | {\tilde{μ}}_{j} < 0, j \in A^{k}}$ . Remove the index j_o from A^k and solve the quadratic programming problem (31). $min_{d} {1 / - 2 d^{T} Hd + {[g^{k}]}^{T} d}$ (46) subject to: $a_{i}^{T} d = 0$ (47) $b_{j}^{T} d = 0, j \in A^{k} ∖ {j_{o}}$ (48)

Then the obtained direction is descent direction d_k for (QP).

Case 2: (d_k ≠ 0): Determine a step-length p_k that guarantees x_k+p_kd_k is feasible to QP.

A common p_k that guarantees the satisfaction of all constraints is $\min {1, \frac{b_{j} - {b_{j}}^{T} d_{k}}{{b_{j}}^{T} d_{k}} | j \notin A^{k} and {b_{j}}^{T} d_{k} > 0}$ (49)

If p_k<1, then p= $\frac{b_{j} - {b_{j}}^{T} d_{k}}{{b_{j}}^{T} d_{k}}$ for some j_o ∉ A^k. This implies that b_{j_o}^T (x_i + p * d_k) = b_{j_o}. This is, the inequality constraint in Equation (25) corresponding to the index j_o becomes active and must be added to active set. Then A^k = A^k ∪ j_o}.

The SQP algorithm is summarized in the Algorithm 1.

Algorithm 1: Consensus_clustering_SQP{

// Input:Objective function, number of final clusterings K

// output final clustering π^*

1. Transform final clustering membership π^* to vector form x by Equation (20)

2. rewrite the objective function (18) as a nonlinear form (19) by Equation (20)

1. convert nonlinear objective function (9-1) by Taylor series to quadratic form (23)

1: Give a start vector x₀ as initial solution for x;

2: Identify the active index set A⁰;

3: Set k = 0;

4: while (no convergence) {

5: Compute g^k = Hx_k + g;

6: Obtain d_k, λ_k and

{\tilde{μ}}_{k}

by solving the KKT-equations for

min_{d} {1 / - 2 d^{T} Hd + {[g^{k}]}^{T} d} (271)

Subject to:

a_{i}^{T} d = 0 (272)

b_{j}^{T} d = 0 (273), j \in A^{k}

7: if (d_k=0) {

8: if

{\tilde{μ}}_{k}

≥0 {

9: STOP! x_k is a KKT point.

10: else

11:

μ_{j_{o}} = min {{\tilde{μ}}_{j} | {\tilde{μ}}_{j} < 0, j \in A^{k}}

12: Update the index set A^k ← A^k ∖ { j_o } and GOTO Step 6.

13: } // end if

{\tilde{μ}}_{k}

≥0

14: } // end if (d_k=0)

16. if (d_k ≠ 0)

17. Compute the step-length as:

p_{k} = \min {1, \frac{b_{j} - {b_{j}}^{T} d_{k}}{{b_{j}}^{T} d_{k}} | j \notin A^{k} and {b_{j}}^{T} d_{k} > 0}

18. update x_k+1 = x_k + p_k * d_k

19. Update active index-set: if p_k=1, then A^k+1=A^k, else A^k+1=A^k∪ { j_o }, where p_k=

, \frac{b_{j_{o}} - {b_{j_{o}}}^{T} d_{k}}{{b_{j_{o}}}^{T} d_{k}} for {b_{j_{o}}}^{T} d_{k} > 0

20. Update k ← k + 1.

21. } // end if (d_k ≠ 0)

22. } //end while

23. transform vector x to M × K matrix π^* as final clustering

}

The approximate Hessian Matrix Q is updated from iteration to iteration using one of the variable metric updating formulas [51]. Because the matrix of coefficient constraints in our objective function (Equation (20)) is sparse, SQP algorithm must exploits sparsity in the constraint Jacobian and maintains a limited-memory quasi-Newton approximation H_k to the Hessian of the Lagrangian [52].

It is worth mentioning that the vector X is transformed to the M × K matrix π^* by Equation (33)

$\begin{matrix} t = 1 \dots M \times K, j = 1 . . K, i = 1 \dots M \\ \underset{ij}{π^{*}} = x_{t} where i = t / - K and j = t - K \times t / - K \end{matrix}$ (50)

4.5.3 Consensus algorithm

To obtain consensus clustering from base clusterings Π according to the local Dependability of each cluster in the ensemble, it is necessary to compute each cluster’s Dependability in the ensemble according to Equation (14). For this purpose, the entropy of each cluster in Π needs to be computed according to Equation (13). To do this, we should compute the undependability of each cluster with respect to clustering π^m in ensemble Π, according to Equation (11). For this computation, computing the acceptability of the clusters in ensemble Π according to Equation (10) is necessary. After computing the DDCI, the weighted fuzzy co-association clustering ensemble matrix (WFCo) is obtained according to Equation (17). Then based on WFCo the objective function (Equation (18)) is constructed. Finally, by applying the Consensus_clustering_SQP algorithm as solver over the Equation (18) the final clustering is obtained.

This algorithm is named FCESQP (Fuzzy Clustering Ensemble by SQP) is presented in Algorithm 2 with details. In this algorithm Π is the base clustering ensemble and k is the number of clusters in the final clustering.

Algorithm 2 FCESQP (Fuzzy Clustering Ensemble by SQP Algorithm) Input:Π, k.

// Π is an ensemble of basic clusterings

// k is the number of final clusters

1: Compute the acceptability of each cluster with in relation to other clusters in each clustering π^m belong to the ensemble Π according to definition 6.

2: Compute the undependability of each cluster with respect to clustering π^m in the ensemble Π according to definition 7.

3: Compute the undependability of the clusters in the ensemble Π according to definition 8.

4: Compute the DDCI of the clusters in ensemble according to definition 9.

5: Construct the WFCo matrix according to definition 11.

6: construct the objective function that consists of WFCo (Equation (18)).

7. Obtain the final clustering with k clusters via Consensus_clustering_SQP (Equation (18), k)

//(π^* = Consensus_clustering_SQP (Equation (18), k)

Output: the consensus clustering π^*.

5 Experimental study

5.1 Benchmark and evaluation criteria

To evaluate the robustness and quality of the proposed fuzzy clustering ensemble approach, twelve data sets are selected from UCI Machine Learning Data Sets [52], the “Galaxy” dataset described in [53] and a well-known dataset HalfRing as the experimental datasets. The description of these datasets is shown in Table 9.

Two evaluation criteria NMI and Dunn are applied here to assess the quality of clustering.

NMI is normalized mutual information between two clusterings [16], and for two clusterings π¹ and π² is calculated as $NMI (π^{1}, π^{2}) = I (π^{1}, π^{2}) / - (\sqrt{H (π^{1})} \sqrt{H (π^{2}))}$ (51) where I (π¹, π²) denotes the mutual information between two clusterings and H (π¹) denotes the entropy of π¹. In this paper π¹ is final clustering and π² is ground truth of each dataset. A larger value of NMI indicate a better clustering result.

The Dunn Index [42] is defined as

$\begin{matrix} Dunn (π^{i}) = min_{\begin{matrix} j \in {1, \dots, c^{i}} \\ k \in {1, \dots, c^{i}} \\ j \neq k \end{matrix}} \\ {\frac{\min_{d} is (C_{j}^{i}, C_{k}^{i})}{max_{t \in {1, \dots, c^{i}}} (diam (C_{t}^{i}))}} \end{matrix}$ (52) where $\min_{d} is (C_{j}^{i}, C_{k}^{i})$ is the distance between the two nearest data objects in clusters $C_{j}^{i} and C_{k}^{i}$ and $diam (C_{t}^{i})$ is the diameter of the cluster $C_{t}^{i}$ , similar to NMI, a higher value of the Dunn index indicates a better clustering result.

5.2 Ensemble generation

To evaluate the consensus quality over various ensembles, base clusterings are constructed through the FCM and K-means clustering algorithms. In order to construct diverse base clusterings, the FCM and K-means are run with different numbers of cluster. The number of clusters for them is randomly chosen from the $[2, \sqrt{M}]$ interval, where M is the number of data objects in the dataset under experiment.

The ensemble size for performance evaluation of the methods was assumed as β = 10. Base on empirical results, the best result is obtained when parameter ∅=0.4. To rule out the occasional luck factor and provide a fair comparison, this proposed approach, the state-of-the-art fuzzy clustering ensemble methods were assessed by their quality criteria and AC robustness average over numerous runs (40 runs).

The experimentations have been conducted on a Matlab14a-64.

5.3 Comparison with state of the art

The proposed SQP approach were compared with eight clustering ensemble methods, i.e. WEAC [27], GPMGLA [27], SVC [25], PVC [25], BVC [25], ISC [25], Berikov [30] and FSCEOGA1 [34]. The two quality-evaluating criteria, NMI and Dunn were applied to determine the quality of the final clustering resulted from the proposed methods and the baseline methods. The number of clusters in each dataset is the same as the number of pre-defined classes (ground truth) in each dataset.

For comparison purposes, each of the proposed methods and the baseline methods are executed 40 times. The average values of NMI and Dunn criteria of different methods over 40 runs are shown in Tables 12 and 13 respectively. The value in bold in the rows represents the best quality-term of each dataset yield by all the examined algorithms. The last row shows the average quality-term for each algorithm over all the datasets. Because the FSCEOGA1 is computationally expensive, these methods cannot handle large datasets because of large execution time. For this reason, the quality results of FSCEOGA1 method are missing on the Vehicle dataset. Therefore, the quality of FSCEOGA1 the Vehicle dataset denoted as a dash.

Table 12
The NMI resulted from different algorithms

Dataset WEAC GPMGLA SVC PVC BVC ISC Berikov FSCEOGA1 FCESQP

Breast 0.7896 0.0029 0.2789 0.1715 0.6985 0.3750 0.0457 0.6977 0.7894

Galaxy 0.2821 0.2768 0.3083 0.0461 0.0445 0.3196 0.0986 0.2309 0.2851

Glass 0.3918 0.3648 0.3129 0.0823 0.0944 0.3584 0.0901 0.2797 0.3585

Haberman 0.0078 0.0002 0.0253 0.0021 0.0002 0.0231 0.0164 0.0003 0.1234

Halfring 0.3113 0.3088 0.2838 0.0051 0.2238 0.2608 0.0264 0.2886 0.3210

Ionesphere 0.1264 0.0165 0.1448 0.1485 0.1312 0.1403 0.0642 0.1227 0.1568

Iris 0.5967 0.7869 0.5923 0.1272 0.5923 0.5458 0.0815 0.6813 0.7515

Knowledge 0.2373 0.1115 0.2681 0.0815 0.1703 0.2502 0.0669 0.2455 0.2803

Seeds 0.6286 0.6286 0.4161 0.2751 0.4698 0.7075 0.0741 0.5963 0.7075

Sheart 0.0767 0.0000 0.0388 0.0303 0.0592 0.0379 0.0194 0.0741 0.0780

Wine 0.3522 0.3927 0.2673 0.1256 0.3806 0.2315 0.0389 0.4033 0.4302

Vehicle 0.2010 0.2027 0.1610 0.0720 0.0807 0.2061 0.0170 0.2111

Alg. Avg 0.3335 0.2577 0.2581 0.0973 0.2455 0.2880 0.0533 0.3291 0.3744

Dataset	WEAC	GPMGLA	SVC	PVC	BVC	ISC	Berikov	FSCEOGA1	FCESQP
Breast	0.7896	0.0029	0.2789	0.1715	0.6985	0.3750	0.0457	0.6977	0.7894
Galaxy	0.2821	0.2768	0.3083	0.0461	0.0445	0.3196	0.0986	0.2309	0.2851
Glass	0.3918	0.3648	0.3129	0.0823	0.0944	0.3584	0.0901	0.2797	0.3585
Haberman	0.0078	0.0002	0.0253	0.0021	0.0002	0.0231	0.0164	0.0003	0.1234
Halfring	0.3113	0.3088	0.2838	0.0051	0.2238	0.2608	0.0264	0.2886	0.3210
Ionesphere	0.1264	0.0165	0.1448	0.1485	0.1312	0.1403	0.0642	0.1227	0.1568
Iris	0.5967	0.7869	0.5923	0.1272	0.5923	0.5458	0.0815	0.6813	0.7515
Knowledge	0.2373	0.1115	0.2681	0.0815	0.1703	0.2502	0.0669	0.2455	0.2803
Seeds	0.6286	0.6286	0.4161	0.2751	0.4698	0.7075	0.0741	0.5963	0.7075
Sheart	0.0767	0.0000	0.0388	0.0303	0.0592	0.0379	0.0194	0.0741	0.0780
Wine	0.3522	0.3927	0.2673	0.1256	0.3806	0.2315	0.0389	0.4033	0.4302
Vehicle	0.2010	0.2027	0.1610	0.0720	0.0807	0.2061	0.0170		0.2111
Alg. Avg	0.3335	0.2577	0.2581	0.0973	0.2455	0.2880	0.0533	0.3291	0.3744

Table 13

The Dunn index resulted from different algorithms

Dataset	WEAC	GPMGLA	FSCEOGA1	SVC	PVC	BVC	ISC	Berikov	FCESQP
Breast	1.58	0.18	1.63	0.26	0.57	0.85	0.39	0.08	1.88
Galaxy	1.20	1.39	0.69	1.10	0.10	0.43	1.27	0.05	1.48
Glass	0.43	0.33	0.37	0.51	1.77	0.10	0.75	0.01	1.75
Haberman	1.77	1.98	2.01	0.51	1.02	0.35	0.57	0.08	2.17
Halfring	2.46	2.46	2.47	1.07	0.85	0.62	1.46	0.01	2.65
Ionesphere	1.02	0.17	0.92	0.27	0.60	1.02	0.35	0.07	1.12
Iris	2.29	2.41	1.92	2.15	1.61	2.15	1.12	0.04	2.45
Knowledge	1.03	0.51	1.03	1.08	0.70	0.32	1.02	0.10	1.10
Seeds	2.35	2.35	1.88	1.30	0.28	0.54	2.35	0.04	2.33
SHeart	1.00	0.20	1.02	0.00	1.11	0.09	0.75	0.10	1.11
Wine	3.19	1.33	1.83	0.41	1.53	0.74	0.27	0.06	2.98
vehicle	1.70	1.73		0.50	0.44	0.56	1.56	0.03	1.80
Alg. Avg	1.67	1.25	1.43	0.76	0.88	0.65	0.99	0.06	1.90

According to Table 12, FCESQP outperforms other algorithms on ten datasets, while WEAC and ISC outperform other algorithms only in one dataset. It is 2 times that the FCESQP algorithm obtains the third best results. We can see that FCESQP algorithm achieves the best average NMI with the value of 0.3744.

To ensure the results do not happen by chance, and to assess quality of the proposed method running statistical analysis is a must. The Friedman test [54] is applied here to the results of Table 12, subject to null hypothesis, where the mean ranks are equal for all the examined algorithms. The significant level is set to 0.05. The experimental results, subject to Friedman test in Table 12 is shown in Fig. 2. As observed in Fig. 2 and the null hypothesis that the mean rank of the NMI being equal in all algorithms is rejected, because p-value is 7.779E-7, indicating that there exists significant difference. As observed in the mean ranks, FCESQP has the highest NMI score followed by WEAC and then SVC.

Fig. 2

Friedman test result of Table 12.

According to Table 13, it is obvious FCESQP outperforms other algorithms on nine datasets, while WEAC outperforms other algorithms on two datasets. PVC outperforms other algorithms on dataset Glass. With respect to last row (average values on all datasets) it is obvious that FCESQP algorithm achieves the best average Dunn index with the value of 1.90, WEAC has the second score and FSCEOGA1 has the third score.

The experimental results, subject to Friedman test in Table 13 is shown in Fig. 3. As observed in Fig. 3 and the null hypothesis that the mean rank of the NMI being equal in all algorithms is rejected, because p-value is 3.705E-8, indicating that there exists significant difference. As observed in the mean ranks, FCESQP has the highest Dunn score followed by WEAC and then FSCEOGA1.

Fig. 3

Friedman test result of Table 13.

Fig. 4

The results of different methods on Yeast dataset.

As the datasets shown in Table 11 are all small, we try a slightly large dataset to show the effectiveness of the proposed method in these conditions. The dataset Yeast with 1484 instances and 8 attributes and 10 classes. The proposed method still outperforms the other methods in terms of NMI.

Table 11

Description of the datasets

Dataset	Number of data objects (M)	Number of classes (k)	Number of features (N)
Breast	683	2	9
Galaxy	323	7	4
Glass	214	7	10
Haberman	306	2	3
Halfring	400	2	2
Ionesphere	351	2	34
Iris	150	3	4
Knowledge	258	4	5
Seeds	210	3	7
SAHeart	462	2	9
Wine	178	3	13
Vehicle	846	4	18

6 Conclusions and guidelines to future works

In this paper, a novel fuzzy cluster ensemble approach based on estimation of fuzzy cluster undependability has been proposed. The uncertainty of a fuzzy cluster against a clustering is approximated using an entropic criterion. Then a new Dependability driven cluster indicator termed DDCI based on cluster undependability and local weighting strategy has been introduced. The DDCI measure does not depend on the original data features and has no presumption on data distribution. A local weighting scheme to promote the conventional co-association matrix through the DDCI weigh has been also introduced named WFCo. Instead of participating all clusters in the co-association matrix equally, in this approach each cluster participates in the co-association matrix with respect to its Dependability in the ensemble. In order to extraction final clustering from matrix WFCo a constrained nonlinear optimization problem was formed. We solve this problem by sparse sequential quadratic programming. The experimental results over twelve datasets confirm the quality improve in comparison with other fuzzy cluster ensemble methods.

Propose a parallel solution which obtain the final clustering by the solution of the optimization problem can be considered as in a future work. Solving this nonlinear optimization problem by other methods can be discussed as a future work of this paper. Apply this approach in some real-world applications (specially engineering application) will also be carried out.

References

Peng

, Zhang

, Qin

, Kong

Joint non negative and fuzzy coding with graph regularization for efficient data clustering, Egyptian Informatics Journal (2020), DOI: 10.1016/j.eij.2020.05.001.

Bezdek

J.C.

, Ehrlich

and Full

, FCM: The fuzzy c-meansclustering algorithm, Computers & Geosciences 10(2–3) (1984), 191–203.

Gath

and Geva

A.B.

, Unsupervised optimal fuzzy clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 11(7) (1989), 773–780.

Graves

and Pedrycz

, Kernel-based fuzzy clustering and fuzzy clustering: A comparative experimental study, Fuzzy Sets and Systems 161(4) (2010). 522–543. https://doi.org/10.1016/j.fss.2009.10.021.

Gustafson

D.E.

, Kessel

W.C.

Fuzzy clustering with a fuzzy covariance matrix. In Decision and Control including the 17th Symposium on Adaptive Processes, 1978 IEEE Conference on (1979), 761–766.

Strehl

and Ghosh

, Cluster ensembles –a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research 3 (2002)b, 583–617. https://doi.org/101162/153244303321897735.

Fern

X.Z.

and Brodley

C.E.

, Random projection for high dimensional data clustering: A cluster ensemble approach, Proceedings of the Twentieth International Conference/non Machine Learning 20 (2003), 186–193. https://doi.org/101.1.72.6059.

Fern

X.Z.

, Brodley

C.E.

Solving cluster ensemble problems by bipartite graph partitioning. In Proceedings of the Twenty-First International Conference on Machine Learning (2004), 36.

Greene

, Tsymbal

, Bolshakova

, Cunningham

Ensemble clustering in medical diagnostics. In Computer- Based Medical Systems, 2004. CBMS 2004. Proceedings. 17th IEEE Symposium on (2004), 576–581.

10.

Hadjitodorov

S.T.

, Kuncheva

L.I.

and Todorova

L.P.

, Moderate diversity for better cluster ensembles, Information Fusion 7(3) (2006), 264–275. https://doi.org/101016/j.inffus.2005.01.008.

11.

Kuncheva

L.I.

, Hadjitodorov

S.T.

, Todorova

L.P.

Experimental comparison of cluster ensemble methods. In Information Fusion, 2006 9th International Conference on (2006), 1–7.

12.

Topchy

, Jain

A.K.

, Punch

Combining multiple weak clusterings, Third IEEE International Conference on Data Mining (2003), 0–7. https://doi.org/101109/ICDM.2003.1250937.

13.

Topchy

A.P.

, Jain

A.A.K.

and Punch

W.F.

, A mixture model for clustering ensembles, Sdm (2004), 379–390. https://doi.org/101137/1.9781611972740.35.

14.

Topchy

, Jain

A.K.

and Punch

, Clustering ensembles: models of consensus and weak partitions, IEEE Transactions on Pattern Analysis and Machine Intelligence 27(12) (2005)a, 1866–1881. https://doi.org/101109/TPAMI.2005.237.

15.

Vega-Pons

, Correa-Morris

and Ruiz-Shulcloper

, Weighted partition consensus via kernels, Pattern Recognition 43(8) (2010), 2712–2724. https://doi.org/101016/j.patcog.2010.03.001.

16.

Strehl

, Ghosh

Cluster Ensembles —A Knowledge Reuse Framework for Combining Multiple Partitions, Journal of Machine Learning Research 3(Dec) (2002a), 583–617. Retrieved from http://www.jmlr.org/papers/v3/strehl02a.html.

17.

Franek

and Jiang

, Ensemble clustering by means of clustering embedding in vector spaces, Pattern Recognition 47(2) (2014), 833–842. https://doi.org/101016/j.patcog.2013.08.019.

18.

Fred

A.L.N.

and Jain

A.K.

, Data clustering using evidence accumulation, Object Recognition Supported by User Interaction for Service Robots 4 (2002), 276–280. https://doi.org/101109/ICPR.2002.1047450.

19.

Fred

A.L.N.

and Jain

A.K.

, Combining multiple clusterings using evidence accumulation, , IEEE Transactions on Pattern Analysis and Machine Intelligence 27(6) (2005), 835–850.

20.

Minaei-Bidgoli

, Topchy

and Punch

W.F.

, Ensembles of partitions via data resampling, International Conference on Information Technology: Coding Computing, ITCC 2(Cl) (2004), 188–192. https://doi.org/101109/ITCC.2004.1286629.

21.

Zhong

, Yue

, Zhang

and Lei

, A clustering ensemble: Two-level-refined co-association matrix with path-based transformation, Pattern Recognition 48(8) (2015), 2699–2709. https://doi.org/101016/j.patcog.2015.02.014.

22.

Singh

, Mukherjee

, Peng

J.M.

and Xu

J.H.

, Ensemble clustering using semidefinite programming with applications, Machine Learning 79(1–2) (2010), 177–200. https://doi.org/Doi10.1007/S10994-009-5158-Y.

23.

Ayad

H.G.

and Kamel

M.S.

, Cumulative voting consensus method for partitions with variable number of clusters, IEEE Transactions on Pattern Analysis and Machine Intelligence 30(1) (2008), 160–173.

24.

Ayad

H.G.

and Kamel

M.S.

, On voting-based consensus of cluster ensembles, Pattern Recognition 43(5) (2010), 1943–1953.

25.

Sevillano

, Alías

and Socoró

J.C.

, Positional and confidence voting-based consensus functions for fuzzy cluster ensembles, Fuzzy Sets and Systems 193(Supplement C) (2012), 1–32. https://doi.org/10.1016/j.fss.2011.09.007.

26.

Barthelemy

J.-P.

Leclerc

The median procedure for partitions, Partitioning Data Sets 19 (1993), 3–34.

27.

Huang

, Lai

J.-H.

and Wang

C.-D.

, Combining multiple clusterings via crowd agreement estimation and multi-granularity link analysis, Neurocomputing 170(November 2016) (2015), 240–250. https://dx-doi-org.web.bisu.edu.cn/10.1016/j.neucom.2014.05.094.

28.

, Ding

Weighted consensus clustering. In Proceedings of the 2008 SIAM International Conference on Data Mining (2008), 798–809.

29.

, Li

, Gao

, You

, Liu

, Wong

H.-S.

and Han

, Hybrid clustering solution selection strategy, Pattern Recognition 47(10) (2014), 3362–3375.

30.

Berikov

V.B.

, A probabilistic model of fuzzy clustering ensemble, Pattern Recognition and Image Analysis 28(1) (2018), 1–10.

31.

Kailath

, The divergence and Bhattacharyya distance measures in signal selection, IEEE Transactions on Communication Technology 15(1) (1967), 52–60.

32.

Dimitriadou

, Weingessel

and Hornik

, A combination scheme for fuzzy clustering, International Journal of Pattern Recognition and Artificial Intelligence 16(07) (2002), 901–912. https://doi.org/101142/S0218001402002052.

33.

Saha

, Maulik

, Bandyopadhyay

and Plewczynski

, SVMeFC: SVM ensemble fuzzy clustering for satellite image segmentation, IEEE Geoscience and Remote Sensing Letters 9(1) (2012), 52–55. https://doi.org/101109/LGRS.2011.2160150.

34.

Alizadeh

, Minaei

and Parvin

, Optimizing fuzzy cluster ensemble in string representation, International Journal of Pattern Recognition and Artificial Intelligence 27(02) (2013), 1350005. https://doi.org/101142/S0218001413500055.

35.

Kuhn

H.W.

Nonlinear programming. In Proceedings of 2nd Berkeley Symposium. Berkeley: University of California Press (1951), 481–492.

36.

Kuhn

H.W.

, The Hungarian method for the assignment problem, Naval Research Logistics (NRL) 2(1–2) (1955), 83–97.

37.

Van

An overviewand comparison of voting methods for pattern recognition. In Frontiers in Handwriting Recognition, 2002. Proceedings. Eighth International Workshop on (2002), 195–200.

38.

Bedalli

, Mancellari

Asilkan

, A heterogeneous cluster ensemble model for improving the stability of fuzzy cluster analysis, Procedia Computer Science 102(August) (2016), 129–136. https://doi.org/101016/j.procs.2016.09.379.

39.

de Oliveira

J.V.

Szabo

de Castro

L.N.

, Particle swarm clustering in clustering ensembles: Exploiting pruning and alignment free consensus, Applied Soft Computing 55(Supplement C) (2017), 141–153. https://doi.org/10.1016/j.asoc.2017.01.035.

40.

Ball

, Hall Dj

A novel method of data analysis and pattern classification. Isodata, A novel method of data analysis and pattern classification. Tch. Report 5RI, Project 5533. Stanford RI, Menlo Park, CA. USA (1965).

41.

Caliński

and Harabasz

, A dendrite method for clusteranalysis, Communications in Statistics-Theory and Methods 3(1) (1974), 1–27.

42.

Dunn

J.C.

, Well-separated clusters and optimal fuzzy partitions, Journal of Cybernetics 4(1) (1974), 95–104.

43.

Rousseeuw

P.J.

, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics 20 (1987), 53–65.

44.

Pal

N.R.

and Bezdek

J.C.

, On cluster validity for the fuzzy c-means model, IEEE Transactions on Fuzzy Systems 3(3) (1995), 370–379.

45.

Minaei-bidgoli

H.P.B.

A clustering ensemble framework based on selection of fuzzy weighted clusters in a locally adaptive clustering algorithm (2015), 87–112. https://doi.org/101007/s10044-013-0364-4.

46.

Zheng

, A similarity measure between fuzzy sets, In Applied Mechanics and Materials 229 (2012), 2663–2666.

47.

Vega-Pons

and Ruiz-Shulcloper

, A survey of clustering ensemble algorithms, International Journal of Pattern Recognition and Artificial Intelligence 25(03) (2011), 337–372. https://doi.org/101142/S0218001411008683.

48.

Edgar

T.F.

, Himmelblau

D.M.

and Lasdon

L.S.

, Optimization of Chemical Processes. McGraw-Hill (2001).

49.

Haftka

R.T.

, Gurdal

(n.d.), Elements of Structural Optimization, Third revised and expanded edition. Kluwer Academic Publishers (1992).

50.

Gill

P.E.

, Murray

and Saunders

M.A.

, SNOPT: An SQP algorithm for large-scale constrained optimization, SIAM Review 47(1) (2005), 99–131.

51.

Han

S.-P.

, Superlinearly convergent variable metric algorithms for general nonlinear programming problems, Mathematical Programming 11(1) (1976), 263–282.

52.

Blake

C.L.

, Merz

C.J.

UCI Repository of machine learning databases [http://www.ics.uci.edu/ mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science 55 (1998). Dennis

J.E.

Jr Schnabel

R.B.

, Numerical methods for unconstrained optimization and nonlinear equations (Vol. 16). Siam (1996).

53.

Odewahn

S.C.

, Stockwell

E.B.

, Pennington

R.L.

, Humphreys

R.M.

and Zumach

W.A.

, Automated star/galaxy discrimination with neural networks, The Astronomical Journal 103 (1992), 318–331.

54.

Iman

R.L.

and Davenport

J.M.

, Approximations of the critical region of the fbietkan statistic, Communications in Statistics-Theory and Methods 9(6) (1980), 571–595.