SPCM: Efficient semi-possibilistic c-means clustering algorithm

Abstract

The required division and exponentiation operations needed per iteration for the possibilistic c-means (PCM) clustering algorithm complicate its implementation, especially on homomorphically-encrypted data. This paper presents a novel efficient soft clustering algorithm based on the possibilistic paradigm, termed SPCM. It aims at easing future applications of PCM to encrypted data. It reduces the required exponentiation and division operations at each iteration by restricting the membership values to an ordered set of discrete values in [0,1], resulting in a better performance in terms of runtime and several other performance indices. At each iteration, distances to the new clusters’ centers are determined, then the distances are compared to the initially computed and dynamically updated range of values, that divide the entire range of distances associated with each cluster center into intervals (bins), to assign appropriate soft memberships to objects. The required number of comparisons is O(log the number of discretization levels). Thus, the computation of centers and memberships is greatly simplified during execution. Also, the use of discrete values for memberships allows soft modification (increment or decrement) of the soft memberships of identified outliers and core objects instead of rough modification (setting to zero or one) in related algorithms. Experimental results on synthetic and standard test data sets verified the efficiency and effectiveness of the proposed algorithm. The average percent of the achieved reduction in runtime is 35% and the average percent of the achieved increase in v-measure, adjusted mutual information, and adjusted rand index is 6% on five datasets compared to PCM. The larger the dataset, the higher the reduction in runtime. Also, SPCM achieved a comparable performance with less computational complexity compared to variants of related algorithms.

Keywords

Clustering algorithms fuzzy clustering possibilistic c-means hybrid soft clustering homomorphic encryption

1 Introduction

Clustering is an unsupervised learning technique that is used to group a set of similar objects such that an object is more similar to objects in its group than to objects in other groups. By clustering analysis, the correlation of data attributes and the globally distributed model can be discovered. Partitioning clustering algorithms such as K-means [1], fuzzy c-means (FCM) algorithm [2], possibilistic c-means(PCM) [3, 4], and mode-based methods such as competitive learning [5] are the most popular and widely used on the cloud due to their efficiency and simplicity. PCM is more robust than FCM since it gives low membership for outliers in all clusters. Other categories of clustering algorithms are self-organizing map (SOM) [6], grid-based methods [7], density-based methods [8] and hierarchical methods [9].

In hard c-means, each data point is given a membership of 0 or 1, on the other hand, fuzzy c-means gives each data point a membership value between 0 and 1 to each cluster. The closer an object is to a cluster center the greater is its membership in that cluster. Clearly, the sum of memberships of each object should be equal to one. After each iteration, membership values and clusters’ centers are updated according to a given formula. Fuzzy c-means suffers from high computation time, and sensitivity to noise.

To overcome the difficulties of the fuzzy c-means, a new clustering model named possibilistic c-Means (PCM) has been proposed in [3, 4]. A unifying view of fuzzy and possibilistic memberships can be found in [10] where clusters are considered neither independent nor rigidly related which results in a better outlier rejection, relative to FCM, and better cluster isolation relative to PCM.

Both FCM and PCM algorithms aim to capture effectively different grades of uncertainty that help to cope with noise and outliers, producing a more compact and well-separated clustering structure. However, hard clustering is still a practical option when the cost of misclassification is very high [11], i.e., the number of incorrectly classified points should be minimized independently from the correctly classified ones.

Rough fuzzy clustering is introduced in [12] where objects assignments to lower or upper approximations are based on membership degrees instead of Euclidean distances, and where points in lower approximations and boundary regions are weighted differently. In [13], neutrosophic set (ns) theory is proposed as a generalization of the theory of intuitionistic fuzzy sets. In ns theory, uncertainty is represented in terms of a truth-membership, an indeterminacy-membership, and a falsity-membership. One of the recent applications of this theory to MR image segmentation was found in [14] where MR regions are segmented based on the neutrosophical entropy information (NEI) function that provides the basis for the representation of ambiguity in MR images.

In [15], two new scalable algorithms based on the c-means algorithm are proposed as enhancements to clustering by fast search and find of density peaks (CFSFDP) [16]. One of the two algorithms uses an acceleration mechanism based on concept approximation while the other is faster and uses exemplar clustering to obtain slightly less accurate results compared to the original algorithm. The complexity of density peaks is reduced in [17] to allow efficient clustering of very large datasets.

A three-way Multi-view clustering is proposed in [18] in which an objective function is constructed and optimized to preserve the consistency between the views by utilizing the low-rank sparse representation technique to build the consensus low rank matrix. Also, they pursue the neighborhood-based three-way clustering technique to reflect the data items into core and fringe regions.

The weighted possibilistic fuzzy c-mean WPFCM in [19] considers the importance of membership and typicality in the clustering process while assigning weights. Re-substitution errors of WPFCM are found to be near to FCM and PFCM, but far less than PCM. According to several performance indices, WPFCM overcomes the sensitivity weakness of FCM and resolves the coincident problem of PCM and unreasonable weight parameters of PFCM. Another work in the same direction can be found in [20].

In [21], an online generalized possibilistic c-means clustering algorithm, called Online Generalized Adaptive Possibilistic C-Means (O-GAPCM) is proposed. Based on APCM and extending its abilities, O-GAPCM is also able to unravel hyper-ellipsoidally shaped clusters while performing online processing, which makes it very computationally efficient. Experimental results show that O-GAPCM exhibits the best compromise between accuracy and time efficiency, compared to other related algorithms.

In [22], Intuitionist Possibilistic Fuzzy c-mean (IPFCM) is proposed to minimize the effect of outliers during the clustering process. IPFCM achieves reliable outlier detection while maintaining clustering consistency. In [23], a two-stage self-tuning possibilistic c-means clustering algorithm is proposed that can reliably handle outlier data than PCM, and reduce the number of parameters.

Early implementation for semi-fuzzy (soft) clustering is found in [24] where three mathematical models for semi-fuzzy clustering with their corresponding algorithms are introduced. The first algorithm limits the fuzziness to a smaller subset of the available clusters while the other two algorithms allow a pattern to be associated with the proper number of clusters based on a threshold value on distance or membership. In [13], a rough-fuzzy c-medoids algorithm is based on the principles of rough sets and fuzzy sets [15]. The concept of soft clustering is applied to fuzzy CLARANS [25] in [26]. Also, in [27], a soft bi-clustering algorithm is introduced in which the memberships of rows and columns are changed between a set of fixed values as an alternative to hard bi-clustering.

PCM is very sensitive to initialization and may generate coincident clusters. Reducing the memberships of core points to other clusters [28] and the adaptive determination of the penalty parameters [29] can greatly help in reducing the sensitivity to initialization, removing the clusters that gradually become obsolete, and avoiding coincident clusters.

Due to its high computational complexity, PCM is not suitable for clustering big data with a large number of objects efficiently. Furthermore, PCM needs to load all the objects into the main memory which is limited in local computing devices. Cloud computing offers a scalable and cost-efficient solution for clustering big data by providing tremendous memory space and high processing power [30]. Several improved PCM algorithms have been proposed for big data clustering such as weighted kernel possibilistic c-means algorithm [31], incremental WPCM [32, 33], as well as distributed weighted possibilistic c-means algorithm [34], and secure weighted possibilistic c-means for privacy-preserved clustering of big data on the cloud [35].

Uploading the raw data to the cloud directly as in [34] poses a serious threat to data security. Also, privacy-preserved algorithms [35] that are based on approximating the required calculations using Taylor’s expansion always produce lower clustering accuracy than their corresponding conventional algorithms. In [35], a secure weighted possibilistic c-means algorithm based on a full homomorphic encryption scheme (BGV) [36] is introduced for secure data clustering on the cloud. The BGV scheme allows only addition and multiplication on encrypted data. In [37], a k-means algorithm is proposed that guarantees privacy-preserving under a given security model to avoid degradation caused by using encryption schemes in big data analysis, however, it is limited to a specific security model. Mohassel et al. [38] learned the k-means protocol based on privacy-preserving between different databases, aiming at protecting contextual privacy. Zhu et al. [39] designed a privacy-preserving k-means clustering scheme based on distributed data in P2P networks.

When building a privacy-preserving approach for a data mining algorithm, the appropriate scenario and the balance of privacy preservation, accuracy, data utility, and efficiency are both major problems. To help tackle these problems, this paper proposes a semi possibilistic clustering algorithm based on the possibilistic paradigm. The main contributions of this work can be summarized as follows:

The proposed algorithm (SPCM) reduces the complexity of possibilistic c-means (PCM) by restricting the memberships to a set of discrete values between zero and one differing by a small step size. The required multiplications in computing the centers are reduced to O (cr) where r is the number of discretization levels and c is the number of clusters.

SPCM avoids the division and exponentiation operations needed per iteration of the possibilistic c-means (PCM) clustering algorithm for computing the memberships which complicate its implementation, especially on homomorphically-encrypted data.

The between-class relationships are increased by decrementing the soft memberships of non-winner prototypes and incrementing the winner typicality. In this way, the algorithm avoids the effect of early wrong rough decisions that may be taken in [28].

The proposed algorithm can deal with outliers efficiently. After identifying them as the lower tail of the soft memberships-based histogram, the identified outliers’ soft memberships are decremented (instead of setting them to zero). It can incorporate additional information from the user such as the expected percentage of noise.

However, determining the typicality still necessitates a magnitude relation comparison, which is difficult to execute over encrypted data and has been deferred for now.

The efficiency and effectiveness of the proposed algorithm are evaluated on several standard datasets and synthetic datasets by comparing SPCM with hard c-means (HCM), Fuzzy c-means (FCM), possibilistic c-means (PCM), possibilistic fuzzy c-means (PFCM), and the weighted possibilistic c-means (WPCM). The results show that SPCM is more efficient than related fuzzy algorithms and its effectiveness compares favorably to PCM.

The paper is organized as follows. The detailed description of PCM and WPCM are reviewed in Section 2. The proposed algorithm is presented in Section 3. The experimental results are given in Section 4 and, finally, the paper is concluded in section 5.

2 Related work

Before describing the proposed algorithm and other related algorithms in the following sections, commonly used abbreviations and symbols in these sections are listed in Table 1.

Table 1
Abbreviations and symbols used in the text

Abbrev. Description Abbrev. Description

HCM Hard c-means W Weight matrix of size n×k used only in WPCM

PCM possibilistic c-means algorithm U membership matrix of size n×k

WPCM weighted possibilistic c-means algorithm E Input threshold used for the termination condition

SPCM The proposed semi-possibilistic c-means algorithm α winner prototype threshold

FCM Fuzzy c-means μ ^m Represents the values in μ raised to power m

PFCM Possibilistic fuzzy c-means (FCM) B_i |μ –{0, 1}| bin values for cluster i

X input dataset consisting of n objects and d attributes a,b weights are given to FCM and PCM in PFCM

u _ij membership (typicality) of object x_j in cluster i VMI v-measure index

m The fuzzification factor AMI adjusted mutual information index

c the required number of clusters ARI adjusted rand index

V A set of k centers v₁, v₂, … v_k each of size 1×d W Weight matrix of size n×k used only in WPCM

maxiter Maximum number of iterations U membership matrix of size n×k

μ a set of r + 1 discrete membership values between 0 and 1 given as input or computed from r. r if fixed step is required then the range of soft memberships is 0..r and the corresponding discrete values will be μ={0, 1/r, 2/r, . . . . . . , 1}

η _i The penalty term coefficient in PCM.

Abbrev.	Description	Abbrev.	Description
HCM	Hard c-means	W	Weight matrix of size n×k used only in WPCM
PCM	possibilistic c-means algorithm	U	membership matrix of size n×k
WPCM	weighted possibilistic c-means algorithm	E	Input threshold used for the termination condition
SPCM	The proposed semi-possibilistic c-means algorithm	α	winner prototype threshold
FCM	Fuzzy c-means	μ ^m	Represents the values in μ raised to power m
PFCM	Possibilistic fuzzy c-means (FCM)	B_i	\|μ –{0, 1}\| bin values for cluster i
X	input dataset consisting of n objects and d attributes	a,b	weights are given to FCM and PCM in PFCM
u _ij	membership (typicality) of object x_j in cluster i	VMI	v-measure index
m	The fuzzification factor	AMI	adjusted mutual information index
c	the required number of clusters	ARI	adjusted rand index
V	A set of k centers v₁, v₂, … v_k each of size 1×d	W	Weight matrix of size n×k used only in WPCM
maxiter	Maximum number of iterations	U	membership matrix of size n×k
μ	a set of r + 1 discrete membership values between 0 and 1 given as input or computed from r.	r	if fixed step is required then the range of soft memberships is 0..r and the corresponding discrete values will be μ={0, 1/r, 2/r, . . . . . . , 1}
η _i	The penalty term coefficient in PCM.

2.1 Possibilistic c-means (PCM)

The possibilistic approach to clustering [3, 4] assumes that the membership value of a data point to a fuzzy set (or cluster) is absolute, i.e. it is an evaluation of a degree of typicality and does not depend on the membership values of the same point to other clusters.

Let X = {x₁, x₂ …… , x_n} be a set of unlabeled data points, V = {v₁, v₂ …… , v_c} a set of cluster centers (or prototypes) and U = [u_ij] the fuzzy membership matrix. In the possibilistic c-means (PCM) algorithms, the constraints on the elements of U are relaxed to: $u_{ij} \in [0, 1] for i = 1, 2, . . c and j = 1, 2, . . n$ $0 < \sum_{j = 1 squo}^{n} u_{ij} < n for i = 1, 2, . . c$ $\lor_{i} u_{ij} > 0 for j = 1, 2, \dots n$

The membership u_ij is the fuzzy membership of object x_j in cluster i and can be defined heuristically in several ways. The above requirements simply imply that a cluster cannot be empty and each pattern must be assigned to at least one cluster. This turns a standard fuzzy clustering procedure into a mode-seeking algorithm [3, 4]. The objective function of PCM contains two terms; the first one is the objective function of the fuzzy c-means [2], while the second is a penalty term: $\begin{matrix} J_{m} (U, Y) = \sum_{i = 1}^{c} \sum_{j = 1}^{n} u_{ij}^{m} {∥ x_{j} {- v}_{i} ∥}^{2} + \\ \sum_{i = 1}^{c} η_{i} \sum_{j = 1}^{n} {(1 - u_{ij})}^{m} \end{matrix}$ (1) where m is a fuzzifier ∈ [1, ∞) given as input and c is an input parameter representing the number of clusters. The values of η_i for i = 1 to c should be estimated before the clustering procedure starts depending on the average size of the i-th cluster. The solution that is obtained by minimizing the above objective function will be highly dependent on the parameter η_i. Note that if η_i→ ∞ for i = 1,2, . . . c (i.e., the second term of J_m (U, Y) is omitted), then a trivial solution is obtained by the minimization of the remaining cost function (i.e., u_ij = 0 for i = 1, 2,..c and j = 1, 2,..n, since no probabilistic constraint is assumed). The pair (U; Y) minimizes J_m, under the above constraints only if: $u_{ij} = \frac{1}{1 + {(d_{ij}^{2} / η_{i})}^{\frac{1}{m - 1}}} for i = 1, 2, . . . c and j = 1, 2, \dots n$ (2)

and $v_{i} = \frac{\sum_{j = 1}^{n} x_{j} u_{ij}^{m}}{\sum_{j = 1}^{n} u_{ij}^{m}} fori = 1, 2, \dots c$ (3)

Equations (2) and (3) can be used as formulas for updating the membership functions and the clusters centers.

Algorithm 1 shows the pseudo-code for the traditional possibilistic c-means algorithm (PCM).

Algorithm 1: Traditional PCM algorithm.
Input: X, c, m, maxiter
Output: U, V
Initialize v_i and η_I for i = 1,2, . . . ,c
for iteration = 1 to maxiter do
compute memberships U using Equation (2)
V‘ = V
compute new centers V using Equation (3)
if (\|\|V‘-V\|\|< ɛ) then return U, V endif
Endfor
return U, V

2.2 Weighted possibilistic c-means

In PCM, each object is assigned the same weight, however, the importance of objects differs. Furthermore, PCM cannot yield the desirable clustering results for the datasets including noisy objects or outliers. To tackle this problem, a weighted possibilistic c-means algorithm (WPCM) is presented in [16] by assigning a weight value to each object, which leads to the following objective function of WPCM: $\begin{matrix} J_{m} (U, Y) = \sum_{i = 1}^{c} \sum_{j = 1}^{n} u_{ij}^{m} {| | x_{j} - v_{i} | |}^{2} + \\ \sum_{i = 1}^{c} \frac{1}{β_{i}} \sum_{j = 1}^{n} (w_{j} - u_{ij}) \end{matrix}$ (4)

The weights are updated using the following formula: $w_{j} = \sum_{i = 1}^{c} e^{- α | | x_{j} - v_{i} {| |}^{2}}$ (5)

The memberships are computed as follows: $u_{ij} = \frac{w_{j}}{{(1 + \frac{{| | x_{j} - v_{i} | |}^{2}}{η_{i}})}^{\frac{1}{m - 1}}}$ (6)

While the centers are computed as in Equation (3).

Algorithm 2: Weighted PCM algorithm
Input: X, c, m, maxiter
Output: U, V
Initialize v_i and η_i for i = 1,2, . . . ,c
for iteration = 1 to maxiter do
compute weights W using Equation (5)
compute memberships U using Equation (6)
V^′= V
compute new centers V using Equation (3)
if (\|\|V^′-V\|\|< ɛ) then return U, V endif
endfor
return U, V

3 The proposed soft possibilistic c-means (SPCM)

3.1 Soft memberships

By restricting the memberships to a set of discrete values between zero and one, the squared distances that correspond to these membership values can be expressed as a function of these memberships. Using Equation (2) of the standard PCM, a squared distance can be expressed as follows: ${(d_{ij}^{2} / η_{i})}^{\frac{1}{m - 1}} = (\frac{1}{u_{ij}} - 1)$

Raising the two sides of the equation to (m-1)>0 power gives $(d_{ij}^{2} / η_{i}) = {(\frac{1}{u_{ij}} - 1)}^{m - 1}$

Then $d_{ij}^{2} = η_{i} {(\frac{1}{u_{ij}} - 1)}^{m - 1}$ (7)

When the set of discrete membership values is given as an input parameter, then the corresponding bins can be initially computed, for each cluster, given the value of η_i for i = 1, 2,.., c. Each time, η_i is updated, the new corresponding discretization bin boundaries can be updated accordingly. To compute the membership of an object in a certain cluster, its distance to the cluster center is compared to the values stored in the bin vector associated with this cluster. For example, if a fixed step equal to 1/r is used then the corresponding discrete values μ will be {0, 1/r, 2/r, …, 1}. Each cluster i will have a bin vector B_i where B_i is a set of r-1 bin values corresponding to μ –{0, 1}. As the bin vector is sorted, searching for the proper bin given the squared distance needs O (log r) comparisons. B_i is used for discretizing the distances to the center of cluster i. The experimental results show that when using large enough (16 or more) discrete values, the proposed algorithm can produce accurate results with a high reduction in runtime as the cost of comparisons is less than the cost of exponentiation and division operations. The values stored in B = {b₁, b₂, b₃, … b_r-1} can be used to approximate the exact PCM membership given d² as shown in Equation (8). μ can have arbitrary values between 0 and 1, for example, we may choose μ = {0, 0.01, 0.2, 0.4, 0.45, 0.5, 0.55, 0.6, 0.9, 1} to increase the accuracy of memberships in a certain region. To ease the computation of the centers, another vector μ^m is initially computed to store the same values of μ raised to power m. Each value in μ is considered a category and crisp membership functions in these categories are used in Equation (8). Another approach is to use Trapezoidal membership for categories 0 and 1 and Triangular membership for the other categories as shown in Table 2. In the fuzzy membership approach, the division is needed to compute the approximated membership from the squared distance d² as shown in Equation (9). Figure 1 shows the exact membership in Equation (2) compared to the crisp and fuzzy approaches for approximating the membership of PCM given d² and B. The start and the end of the triangles in the lower part of Fig. 1 correspond to the values in B. As the upper part of Fig. 1 shows, the crisp approach can approximate the membership well if the number of discretization levels is chosen large enough (for example 16 or higher). In this approach we avoid the need for division and exponentiation which is the main concern of the proposed algorithm and at the same time the worst case for the cost of the calculation of membership is reduced. In the rest of the paper Equation (8) is used. The MATLAB code that produced Fig. 1 will be available in [40].

$u = {\begin{matrix} 0 & d^{2} > b_{1} \\ 1 / r & b_{1} ⩾ d_{>}^{2} b_{2} \\ 2 / r & b_{1} ⩾ d^{2} > b_{2} \\ ⋮ & ⋮ \\ (r - 1) / r & b_{r - 2} ⩾ d^{2} > b_{r - 1} \\ 1 & d^{2} ⩽ b_{r - 1} \end{matrix}$ (8) $u = \begin{matrix} 0 & d^{2} {> 2 b}_{1} {- b}_{2} \\ {(2 b}_{1} {- b}_{2} {- d}^{2} {) / (b}_{2} {- b}_{1}) r & b_{1} < d^{2} ⩽ {2 b}_{1} {- b}_{2} \\ \begin{matrix} {[2 (b}_{1} {- d}^{2} {) + (d}^{2} {- b}_{2} {)] / (b}_{2} {- b}_{1}) r \\ ⋮ \\ \begin{matrix} b_{r - 1} {[r (b}_{r - 1} {- d}^{2} {) + (r - 1) (d}^{2} - \frac{b_{r - 1}}{2})] / 2 r \\ 1 \end{matrix} \end{matrix} & \begin{matrix} b_{1} ⩾ d^{2} {> b}_{2} \\ ⋮ \\ \begin{matrix} b_{r - 1} ⩾ d^{2} {> b}_{r - 1} / 2 \\ d^{2} ⩽ b_{r - 1} / 2 \end{matrix} \end{matrix} \end{matrix}$ (9)

Fig. 1

Approximating the membership function of PCM.

3.2 Computing the centers of the clusters

Knowing the number of objects having a soft membership s_ij = k in cluster i (n_ik), the computation of a center requires O(r) multiplications where r is the discretization level. The vector μ^m stores the corresponding membership of μ raised to power m. It is computed once for all the values stored in the soft membership vector μ. B is updated in each iteration while μ and μ^m are fixed. The centers are computed as follows: $v_{i} = \frac{\sum_{k = 1}^{r} μ_{k}^{m} \sum_{s_{ij = k}}}{\sum_{k = 1}^{r} n_{ik} μ_{k}^{m}} for i = 1, 2, \dots c$ (10)

3.3 Dealing with outliers

To be able to give a lower weight for outliers, the proposed algorithm computes a histogram $h_{j} = \sum_{i = 1}^{c} s_{ij}$ , i.e., the sum of the soft memberships of object x_j in all clusters for j = 1, 2, . . . , n. The objects in the lower tail (objects having h_j less than the average of h_j by three standard deviations) of the histogram and before the first peak are considered outliers in the proposed algorithm. After identifying candidate outliers, their typicality, which are integer values representing their soft memberships in SPCM, are decremented. The expected percentage of noise can be supplied by the user as an input parameter to be used in identifying candidate outliers.

3.4 Introducing the between-class relationships

To overcome the PCM’s coincident clustering problem, the cutset idea, introduced in [28], is used. In this approach, the winner prototype is identified as the prototype having the maximum membership and its membership is higher than a threshold. The between-class relationships are introduced by setting the membership of a core object in the winner prototype to one and setting the non-winner typicality degree to zero to increase the separability of the produced clusters. Such modifications do not affect the typicality of outliers since PCM by its nature gives small typicality values to outliers. In the proposed algorithm SPCM, the threshold is chosen experimentally as a soft membership near 0.7. The between-class relationships are increased by decrementing the soft memberships of non-winner prototypes and incrementing the winner typicality.

3.5 The proposed algorithm (SPCM)

As shown in the pseudo code of algorithm 3, the SPCM algorithm starts by selecting initial cluster centers. If a fixed step 1/r is used, the corresponding memberships are computed and stored in the vector μ and the memberships raised to power m are computed and stored in the vector μ^m. For example, if r = 4 and m = 2 then μ={0,0.25,0.50,0.75,1} and μ^m = {0, 0 . 25², 0 . 50², 0 . 75², 1}).

Also, each cluster i has a vector B _i containing the discretization bins, i.e., the distances that correspond to the values in μ and it is updated using Equation (11).

Initial values for η_i are computed from the initial clustering. At each iteration, the distances to the cluster centers are computed and compared to the discretization bins in a manner similar to binary search. The values in S are the soft memberships (integer values) of the identified discretization level for each object in a certain cluster. An efficient implementation for calculating the memberships is done in Python using list comprehensions and the “bisect“ function [40].

Outliers belonging to the lower tail of a histogram for $h_{j} = \sum_{i = 1}^{c} s_{ij}$ are identified as described in section 3.3, then the soft memberships of identified outliers are decremented. The objects inside each cluster are identified in the step of computing the memberships (core points). The updated soft memberships of core points along with their distances to the center of the cluster to which they have the highest memberships are used to calculate the average distance needed for computing the corresponding penalty parameter η for that cluster. Based on the computed soft memberships, core points are identified as in section 3.4 and their memberships to the other clusters are decremented instead of setting them to zero as in [28].

Algorithm 3: semi-possibilistic c-means (SPCM)
Input:
X: a dataset of size n×d
c: number of clusters
m: fuzzification factor
α: threshold help in deciding the winner prototype
β: used in updating the penalty parametersη
μ: is a vector of size r + 1
Output:
U: the membership matrix of size n×c
V: c cluster centers of size 1×d
Begin
1. Compute μ ^m from μ
2. Select initial centers v_i for i = 1,2, . . . ,c
3. Initialize η_i for i = 1,2, . . . ,c
4. Compute initial discretization bins B =B₁ . . B_c using
Equation (7)
5. repeat
5.1. compute $d_{ij}^{2} = \| \| x_{j} - v_{i} \| \|^{2}$ for i = 1,2 . . . c and
j = 1,2, . . . ..n
5.2. Let S_i be the result of discretizing $D_{i} = {d_{ij}^{2}$ for
j = 1 . . n} into r levels using B_i then s_ij is the soft
membership that corresponds to d_ij
// refining the memberships of candidate core objects
5.3 for each object $q = {argmax}_{i = 1}^{c} s_{ij}$
5.3.1 sumdin_i = sumdin_i+d_iq
5.3.2 increment countin_i
5.3.3 If and s_iq > α
increment the soft memberships of q in i
decrement other soft memberships for the same
object j in the other clusters
Endif
endfor
//updating the penalty parameters
5.6 $η_{i} = (\underset{i}{sumdin} (1 / β_{i} - 1)^{m - 1}) / {countin}_{i}$ for i = 1, 2,
. . . , c //refine the memberships of candidate outliers
5.7 decrement the soft memberships of each candidate
outlier object q in the lower tail of a histogram for:
$h_{j} = \sum_{i = 1}^{c} s_{ij}$
5.8 update η_i for i = 1, 2, . . . , c using the refined S
5.9 update the bins vectorB_i using Equation (11) for
i = 1, 2, . . . , c
5.10 V‘ = V //store old centers
5.11 compute the new cluster centers V using Equation (10)
until \|\|V-V^′\|\|< ɛ
6. return U, V
End

Finally, the new value for the vector B_i representing the discretization bins for cluster i in iteration t + 1 is updated as follows: $B_{i} (t + 1) = \frac{η_{i (t + 1)}}{η_{i (t)}} B_{i} (t)$ (11)

The algorithm stops if it reaches the maximum number of iterations or the change in clusters centers becomes very small within two successive iterations. The process may be repeated numlocal times and the best solution is returned as shown in Fig. 2.

Fig. 2

Block diagram for the proposed algorithm (SPCM).

3.6 Explanatory example

In the following illustrative example, we show one iteration of SPCM over two-dimensional datasets represented by the first three columns in Table 3. As shown in Fig. 3, the number of clusters is chosen equals to 2 (c = 2) and the corresponding initial centroids are: v₁ = (1, 1) and v₂ = (5, 7). The squared Euclidian distance of each object to the first and second cluster centers are shown in the fourth and the fifth column respectively.

Fig. 3

The dataset used in the explanatory example.

Table 2

Fuzzy membership functions to approximate the PCM membership given d²

Category	Membership function (d²)
1	Trap<0,0, b_r-1/2, b_r-1>
r-1/r	Tri<b_r-1/2, b_r-1, b_r-2>
r-2/r	Tri<b_r-1, b_r-2, b_r-3>
⋮	⋮
2/r	Tri<b₃,b₂,b₁>
1/r	Tri<b₂,b₁, 2b₁- b₂>
0	Trap<b₁, 2b₁-b₂, ∞, ∞>

Table 3

Illustrative example

Object	Feature values		Distance to	Distance to	Initial memberships		Refined memberships
	x	y	Center 1	Center 2	C1	C2	C1	C2	Cause
x ₁	1	1	0	51.99	5	1	5	0	bet.
x ₂	1.5	2	1.25	37.21	3	1	3	1
x ₃	3	4	13.03	13.03	1	2	1	2
x ₄	5	7	51.98	0.0	1	5	0	5	bet.
x ₅	3.5	5	22.28	6.25	1	2	1	2
x ₆	4.5	5	28.19	4.24	1	3	1	3
x ₇	3.5	4.5	18.49	8.53	1	2	1	2
x ₈	4	15	205	65	1	1	0	0	Out.

Given the initial clustering, the 1st, 2nd and 3rd objects belong to the first cluster and their distances (din) to their cluster center are 0, 1.12, 3.61, respectively. Similarly, the remaining five objects belong to the second cluster and their distances to its center are 0, 3.61, 2.5, 2.06, 2.92, 8.06. The minimum din is 0 while the maximum is 3.61 for the first cluster and 8.06 for the second cluster.

By choosing the fuzzification factor m = 2 and the membership vector μ=[0,.05,.25,.5,.75,.9, 1] then μ² = [0,.0025,.063,.25,.56,.81, 1].

The discretization bins for each cluster are computed from the corresponding memberships in μ (excluding 0 and 1) by using Equation (7). At the current iteration, the discretization bins for the first cluster become B₁ = η₁ * [399, 3, 1, 0 . 33, 0 . 11] where η_i is dynamically computed at each iteration and selected in the interval max(din) [1/ α –1] ^m-1 > η_i>min(din) [1/0.99 - 1] ^m-1 where 0.5 = < α < 1. If α is chosen equal to 0.5 then 3.61 > η₁ > 0. If we select η₁= 1.8 then B₁ = 1.8×[399, 3, 1, 0.33, 0.11] = [718.2, 5.4, 1.8, 0.594, 0.198]. Similarly, 8.06> η₂> 0, 4.03 is a suitable choice for η₂. B₂ = 5 * [399, 3,1, 0.33, 0.11] = [1608,15,5, 1.65, 0.55]. The soft memberships for cluster 1 are S_1 = [5, 4, 3, 3, 3, 3, 3] and the soft memberships for the second cluster are S₂ =[3, 3, 3, 6, 3, 3, 3]. Using the procedure proposed in section 3.3, x₈ may be identified as a candidate outlier and the soft memberships are further refined by decreasing the memberships of x₈ by one level in the two clusters. Also, the between-class distance is increased by decrementing the soft memberships of non-winner prototypes and incrementing the winner typicality as shown in columns 8 and 9. The threshold is chosen equals to the soft membership 4 (between 0.75 and 0.9).

The new centroid v₁ for cluster 1 can be computed using the refined values of the soft memberships and their corresponding values stored in μ^m as follows: $\begin{matrix} v_{1} = \frac{0.0025 (x_{3} + x_{5} + x_{6} + x_{7}) + 0.56 x_{2} + x_{1}}{4 \times 0.0025 + 0.56 + 1} \\ = \frac{[2.64, 3.18]}{1.57} = [1.18, 1.36] \end{matrix}$

Similarly, the new centroid v₂ for cluster 2 can be computed as follows: $\begin{matrix} v_{2} = \frac{0.063 (x_{3} + x_{5} + x_{7}) + 0.0025 x_{2} 0.56 x_{6} + x_{4}}{3 \times 0.063 + 0.56 + 1} \\ = [4.66, 6.09] \end{matrix}$

It is clear that the new centers after one iteration are very close to the centers of the true clusters 1,2 and {3, 4, 5, 6, 7}.

As shown in the above example, the memberships are calculated without the need of division or exponentiation. Also, the number of multiplications needed for computing the centers is reduced to the order of r multiplications. Still, few divisions of (O(c)) are needed in computing cluster centers and η. These divisions can be approximated using Taylor expansion as shown in [35].

4 Experimental results

4.1 Tuning the fuzzification parameter (m)

The fuzzification parameter m plays an important role in computing the discretization bins B using Equation (7). Its value determines the rate of decay in the computed values of B from the discrete memberships μ given as input to SPCM. When m = 1, the memberships in μ are crisp i.e. {0, 1} and B_i will have a single value equal to η_i where i = 1, 2, . . . , c. All points with d² (x_j, V_i) greater than η_i will have a zero membership to cluster i. For large values of m, the values in B_i decay slowly. Similar to PCM, increasing values of m in SPCM increases the possibility that all points in the dataset belong to a given cluster resulting in coincidental clusters. Whereas, in the FCM, increasing values of m increase the sharing of points among all clusters. Thus, the value of m that gives us satisfactory performance is different in the two approaches. A value of m = 2 is known to give good results with the FCM while a more appropriate choice for PCM is found to be m = 1.5. As shown in Figs. 4 and 5, a value between 2 to 2.5 is appropriate for the proposed algorithm on the Iris dataset in terms of ARI and runtime, respectively. A value of 2 for the fuzzifier m is chosen for SPCM in the rest of the paper.

Fig.4

ARI for different values of fuzzifier m on Iris.

Fig. 5

Runtime for different values of m on Iris.

4.2 Performance evaluation on a very small synthetic dataset

In this section, the proposed algorithm is applied to a two-dimensional data set X₁₂ with 12 points, as shown in Table 4. The coordinates of X₁₂ are given in the 2nd and 3rd rows of Table 4. Let {x₁, x₂, …, x₁₀} ∪ {x₁₁, x₁₂} be denoted by X₁₂. x₁₁ is called an “inlier” (bridge) and x₁₂ an “outlier” (noise). x₁₁ and x₁₂ are equidistant from all corresponding pairs of points in the two clusters. For ease of comparison to the other five algorithms, a normalized coefficient [28] in Equation (12) is used. The results in Table 4 are those reported in [28], the fuzzification factor m is chosen equal to 2 for SPCM similar to the value chosen in the experiments with the related algorithms in Table 4.

Table 4
Comparing SPCM to related algorithms on X12 using Normalized weight coefficients of different prototypes

Pts x ₁ x ₂ x ₃ x ₄ x ₅ x ₆ x ₇ x ₈ x ₉ x ₁₀ x ₁₁ x ₁₂

Data x –5 –3.34 –3.34 –3.34 –1.67 1.67 3.34 3.34 3.34 5 0 0

y 0 1.67 0 –1.67 0 0 1.67 0 –1.67 0 0 20

FCM Φ _1,j 0.8713 0.9999 0.9571 0.7521 0.8038 0.0131 0.0001 0.0010 0.0207 0.0059 0.2557 0.2556

Φ _2,j 0.0059 0.0000 0.0010 0.0206 0.0130 0.8035 1.0000 0.9570 0.7519 0.8713 0.2554 0.2556

PCM η₁, η₂= 2.3 Φ _1,j 0.0076 0.0162 0.0186 0.0162 0.0561 0.8841 0.4417 0.9979 0.4406 0.2659 0.2193 0.0000

Φ _2,j 0.0075 0.0162 0.0185 0.0162 0.0560 0.8818 0.4423 1.0000 0.4412 0.2669 0.2185 0.0000

PFCM a:b=10 : 1 Φ _1,j 0.8239 1.0000 0.9337 0.7094 0.7708 0.0126 0.0000 0.0008 0.0189 0.0053 0.2420 0.2386

Φ _2,j 0.0053 0.0000 0.0008 0.0189 0.0126 0.7707 1.0000 0.9337 0.7094 0.8239 0.2419 0.2386

PFCM a:b=1 : 1 Φ _1,j 0.5975 0.9995 0.8762 0.5044 0.6253 0.0105 0.0000 0.0003 0.0103 0.0020 0.1760 0.1539

Φ _2,j 0.0020 0.0000 0.0002 0.0103 0.0105 0.6250 1.0000 0.8759 0.5042 0.5975 0.1758 0.1539

PFCM a:b=1 : 10 Φ _1,j 0.2413 0.4460 1.0000 0.2206 0.3905 0.0085 0.0014 0.0014 0.0017 0.0000 0.0667 0.0254

Φ _2,j 0.0000 0.0014 0.0014 0.0017 0.0085 0.3905 0.4460 1.0000 0.2206 0.2413 0.0667 0.0254

C-PCM η = 5, β=.5 Φ _1,j 0.3424 0.4164 1.0000 0.4151 0.5212 0.0000 0.0000 0.0000 0.0000 0.0000 0.1212 0.0001

Φ _2,j 0.0000 0.0000 0.0000 0.0000 0.0000 0.5212 0.4164 1.0000 0.4151 0.3424 0.1212 0.0001

SPCM Φ _1,j 0.3750 0.4375 1.0000 0.4375 0.5625 0.0000 0.0000 0.0000 0.0000 0.0000 0.0625 0.0000

Φ _2,j 0.0000 0.0000 0.0000 0.0000 0.0000 0.5625 0.4375 1.0000 0.4375 0.3750 0.0625 0.0000

	Pts	x ₁	x ₂	x ₃	x ₄	x ₅	x ₆	x ₇	x ₈	x ₉	x ₁₀	x ₁₁	x ₁₂
Data	x	–5	–3.34	–3.34	–3.34	–1.67	1.67	3.34	3.34	3.34	5	0	0
	y	0	1.67	0	–1.67	0	0	1.67	0	–1.67	0	0	20
FCM	Φ _1,j	0.8713	0.9999	0.9571	0.7521	0.8038	0.0131	0.0001	0.0010	0.0207	0.0059	0.2557	0.2556
	Φ _2,j	0.0059	0.0000	0.0010	0.0206	0.0130	0.8035	1.0000	0.9570	0.7519	0.8713	0.2554	0.2556
PCM η₁, η₂= 2.3	Φ _1,j	0.0076	0.0162	0.0186	0.0162	0.0561	0.8841	0.4417	0.9979	0.4406	0.2659	0.2193	0.0000
	Φ _2,j	0.0075	0.0162	0.0185	0.0162	0.0560	0.8818	0.4423	1.0000	0.4412	0.2669	0.2185	0.0000
PFCM a:b=10 : 1	Φ _1,j	0.8239	1.0000	0.9337	0.7094	0.7708	0.0126	0.0000	0.0008	0.0189	0.0053	0.2420	0.2386
	Φ _2,j	0.0053	0.0000	0.0008	0.0189	0.0126	0.7707	1.0000	0.9337	0.7094	0.8239	0.2419	0.2386
PFCM a:b=1 : 1	Φ _1,j	0.5975	0.9995	0.8762	0.5044	0.6253	0.0105	0.0000	0.0003	0.0103	0.0020	0.1760	0.1539
	Φ _2,j	0.0020	0.0000	0.0002	0.0103	0.0105	0.6250	1.0000	0.8759	0.5042	0.5975	0.1758	0.1539
PFCM a:b=1 : 10	Φ _1,j	0.2413	0.4460	1.0000	0.2206	0.3905	0.0085	0.0014	0.0014	0.0017	0.0000	0.0667	0.0254
	Φ _2,j	0.0000	0.0014	0.0014	0.0017	0.0085	0.3905	0.4460	1.0000	0.2206	0.2413	0.0667	0.0254
C-PCM η = 5, β=.5	Φ _1,j	0.3424	0.4164	1.0000	0.4151	0.5212	0.0000	0.0000	0.0000	0.0000	0.0000	0.1212	0.0001
	Φ _2,j	0.0000	0.0000	0.0000	0.0000	0.0000	0.5212	0.4164	1.0000	0.4151	0.3424	0.1212	0.0001
SPCM	Φ _1,j	0.3750	0.4375	1.0000	0.4375	0.5625	0.0000	0.0000	0.0000	0.0000	0.0000	0.0625	0.0000
	Φ _2,j	0.0000	0.0000	0.0000	0.0000	0.0000	0.5625	0.4375	1.0000	0.4375	0.3750	0.0625	0.0000

$\emptyset_{ij} = \frac{u_{ij} - \min_{ij} u_{ij}}{\max_{ij} u_{ij} - \min_{ij} u_{ij}}$ (12)

As shown in Table 4, it is clear from column 14 that the proposed algorithm SPCM performed much better than the other five related algorithms and was able to assign the outlier object x₁₂ the lowest membership which is zero resulting in a negligible influence on the determination of the cluster centers as shown in Fig. 6. Consequently, the SPCM preserves the strong performance of the PCM to outliers. Also, the proposed SPCM method successfully overcomes the coincident clustering problem of the PCM and obtains distinct centroids as shown in Fig. 6.

Fig. 6

Determined ceneters for X12 dataset using SPCM.

4.3 The efficiency of the proposed algorithm on large synthetic datasets

In this section, the average runtime of SPCM is compared to the average runtime of the traditional PCM on several synthetic datasets of different sizes ranging from 50k to 500k objects, on datasets of size 50k with two features and number of clusters ranging from 2 to 50, and, finally, on datasets of 50k objects and 10 clusters with the number of features ranging from 2 to 10.

Figure 7 shows the average runtime of the traditional PCM and the proposed SPCM. The runtime of SPCM is much less than PCM. The difference increases as the dataset size increases. The runtime of PCM is about five times the runtime of the SPCM for the dataset of size 500k.

Fig. 7

Average runtime (sec.) on generated datasets of size 50k objects with two features and number of clusters ranging from to 2 to 50.

Figure 8 shows the average runtime (sec.) on generated datasets of size 50k objects, with two features and a different number of clusters. The runtime of both PCM and SPCM algorithms increases with the increase in the number of clusters as the complexity analysis shows later. The runtime of SPCM is always less than PCM. The higher the number of clusters the higher the increase of the average runtime of PCM compared to SPCM. The runtime of PCM is about twelve times the average runtime of SPCM when the number of clusters is equal to fifty.

Fig. 8

Average runtime (sec.) on generated datasets of different sizes ranging from 50k to 500k.

Figure 9 shows the average runtime on generated datasets of size 50k objects, with 10 clusters and various number of features. The runtime of both PCM and SPCM algorithms increases as the number of features increases. The runtime of SPCM is always less than PCM. The higher the number of features the higher the increase of the average runtime of PCM compared to SPCM. The runtime of PCM is about four times the average runtime of SPCM when the number of clusters equals fifty. The parameters of SPCM are fine tuned. Based on the fine-tuning experiments, in the above and next experiments, the discretization levels of SPCM are chosen equal to 256, α= 0.7, β= 0.55 and the increment and decrement steps are chosen equal to 8.

Fig. 9

Average runtime (sec.) on generated datasets of size 50k with ten clusters and different number of features.

4.4 Performance on standard datasets

4.4.1 Dataset description

Table 5, describes the adopted datasets. In our investigation in this work, we used the Iris data. Iris data is one of the earliest and most widely used data sets in data mining. In the Iris data set, there are 150 samples from three different Iris flower types. Each sample of Iris flowers has four attributes representing different measures (the length and width of the sepal, and the length and width of the petal). Each of the 5800 samples in the shuttle dataset belongs to one of five different classes and has 9 attributes all of which are numerical. About 80% of the data belongs to class 1. The Breast dataset has 569 samples and 9 numerical features that describe characteristics of the cell nuclei present in a digitized image of a fine needle aspirate (FNA) of a breast mass. Features in the Wine dataset are the quantities of 13 constituents (Alcohol, Malic acid,...) found using chemical analysis of 178 samples of wines grown in the same region in Italy but derived from three different cultivars (3 classes). The 34 features of Ionosphere dataset correspond to 17 complex values that represent the autocorrelation between the time of a pulse and its pulse number for each of the 17 pulses of the Goose Bay system. All datasets used in our experiments have numerical attributes and have no missing data. Euclidean distance is used as a measure of dissimilarity between samples. The Principal Component Analysis algorithm implemented in [41] is used to reduce the dimension to 10 for any dataset having a number of attributes larger than 10.

Table 5
Performance of SPCM compared to five related clustering algorithms on five standard datasets

Dataset Samples/Features/ Metric Classes Algorithms

HCM PCM FCM PFCM WPCM SPCM

Wine 178/13/3 VMI 0.79 0.736 0.424 0.475 0.834 0.842

AMI 0.79 0.734 0.417 0.468 0.833 0.840

ARI 0.81 0.747 0.380 0.373 0.854 0.881

Avg. Runtime (sec) 0.015 0.156 0.156 0.131 0.265 0.091

Iris 150/4/3 VMI 0.65 0.671 0.645 0.691 0.643 0.673

AMI 0.65 0.668 0.640 0.687 0.639 0.668

ARI 0.62 0.649 0.607 0.664 0.621 0.658

Avg. Runtime (sec) 0.015 0.140 0.124 0.125 0.374 0.082

Ionosphere 351/34/2 VMI 0.56 0.581 0.478 0.564 0.609 0.611

AMI 0.56 0.581 0.477 0.564 0.608 0.617

ARI 0.68 0.670 0.594 0.684 0.669 0.701

Avg. Runtime (sec) 0.031 0.203 0.156 0.141 0.296 0.136

Breast 683/9/2 VMI 0.56 0.609 0.639 0.558 0.485 0.560

AMI 0.56 0.608 0.639 0.558 0.484 0.571

ARI 0.68 0.670 0.750 0.677 0.568 0.687

Avg. Runtime (sec) 0.031 0.281 0.203 0.149 0.359 0.125

Shuttle 58000/9/5 VMI 0.32 0.274 0.105 0.194 0.312 0.291

AMI 0.32 0.274 0.104 0.194 0.312 0.290

ARI 0.25 0.217 0.064 0.125 0.147 0.235

Avg. Runtime (sec) 5 208 277 218 233 143

Dataset	Samples/Features/	Metric Classes	Algorithms
Wine	178/13/3	VMI	0.79	0.736	0.424	0.475	0.834	0.842
		AMI	0.79	0.734	0.417	0.468	0.833	0.840
		ARI	0.81	0.747	0.380	0.373	0.854	0.881
		Avg. Runtime (sec)	0.015	0.156	0.156	0.131	0.265	0.091
Iris	150/4/3	VMI	0.65	0.671	0.645	0.691	0.643	0.673
		AMI	0.65	0.668	0.640	0.687	0.639	0.668
		ARI	0.62	0.649	0.607	0.664	0.621	0.658
		Avg. Runtime (sec)	0.015	0.140	0.124	0.125	0.374	0.082
Ionosphere	351/34/2	VMI	0.56	0.581	0.478	0.564	0.609	0.611
		AMI	0.56	0.581	0.477	0.564	0.608	0.617
		ARI	0.68	0.670	0.594	0.684	0.669	0.701
		Avg. Runtime (sec)	0.031	0.203	0.156	0.141	0.296	0.136
Breast	683/9/2	VMI	0.56	0.609	0.639	0.558	0.485	0.560
		AMI	0.56	0.608	0.639	0.558	0.484	0.571
		ARI	0.68	0.670	0.750	0.677	0.568	0.687
		Avg. Runtime (sec)	0.031	0.281	0.203	0.149	0.359	0.125
Shuttle	58000/9/5	VMI	0.32	0.274	0.105	0.194	0.312	0.291
		AMI	0.32	0.274	0.104	0.194	0.312	0.290
		ARI	0.25	0.217	0.064	0.125	0.147	0.235
		Avg. Runtime (sec)	5	208	277	218	233	143

4.4.2 Performance indices

The third column in Table 5 lists the three metrics that are used in our experiments namely v-measure index (VMI), adjusted mutual information index (AMI) and adjusted rand index (ARI). AMI is a symmetric metric, i.e., swapping the predicted labels with the true labels does not change the score. Thus, it can be used as a consensus measure. VMI is also independent of the absolute values of the labels and it is measured as the harmonic mean between homogeneity and completeness. ARI measures the similarity between the true and predicted labeling, ignoring permutations. ARI has a score range of [–1,1] and it can be used to compare all kinds of clustering algorithms given the true labels.

4.4.3 Performance results

Table 5 presents the performance indices and runtime of the proposed SPCM algorithm compared to another five related methods on five standard datasets [42].

As shown in Table 5, SPCM outperforms all algorithms on both Wine and Ionosphere datasets in terms of VMI, AMI, and ARI. FCM was the winner algorithm on the Breast dataset. On the Iris dataset, the results of the five algorithms were close to each other, SPCM was the best in terms of VMI while PFCM in terms of AMI and ARI. The runtime of SPCM is slightly higher than HCM and less than the runtime of any of the other algorithms.

4.5 Detailed performance comparison on iris data

Table 6 shows the iteration time, number of iterations, total runtime, silhouette index, maxdist, mindist, and sumdist for five local minima computed using different random initialization on the Iris dataset. The maxdist, mindist, and sumdist correspond to the square root of the maximum, minimum and sum of g_i over i in Equation 13 where $v_{i}^{*}$ denotes the actual center for cluster i. The results show that the proposed algorithm requires a smaller number of iterations than PCM and the cost of a single iteration is about 80% of the cost of a PCM iteration. The total runtime of SPCM is the smallest. SPCM is the best in terms of silhouette index and slightly lower than the best in terms of other indices.

Table 6
Detailed performance on Iris dataset

Algorithm Runtime per Total Iteration No. local VMI AMI ARI Silhouette.index maxdist mindist sumdist

iteration(ms) runtime(ms) count minimum

PCM 44.501 2492.078 56 5 0.643 0.639 0.621 0.454 0.766 0.707 2.121

FCM 31.138 3176.032 102 5 0.684 0.680 0.653 0.454 0.523 0.379 1.138

SPCM 35.650 1568.614 44 5 0.673 0.668 0.659 0.459 0.791 0.703 2.1079

PFCM 48.428 1791.819 37 5 0.607 0.687 0.664 0.451 0.708 0.560 1.6807

WPCM 120.981 2419.626 20 5 0.641 0.636 0.611 0.457 0.939 0.787 2.3599

HCM 1.133 19275.234 17009 5 0.673 0.669 0.641 0.448 0.596 0.423 1.168

Algorithm	Runtime per	Total	Iteration	No. local	VMI	AMI	ARI	Silhouette.index	maxdist	mindist	sumdist
PCM	44.501	2492.078	56	5	0.643	0.639	0.621	0.454	0.766	0.707	2.121
FCM	31.138	3176.032	102	5	0.684	0.680	0.653	0.454	0.523	0.379	1.138
SPCM	35.650	1568.614	44	5	0.673	0.668	0.659	0.459	0.791	0.703	2.1079
PFCM	48.428	1791.819	37	5	0.607	0.687	0.664	0.451	0.708	0.560	1.6807
WPCM	120.981	2419.626	20	5	0.641	0.636	0.611	0.457	0.939	0.787	2.3599
HCM	1.133	19275.234	17009	5	0.673	0.669	0.641	0.448	0.596	0.423	1.168

$g_{i} = ∥ v_{i}^{*} - v_{i} ∥^{2}$ (13)

4.6 Performance of the proposed algorithm for different discretization levels

Results on Ionosphere for different values of discretization levels r = 2, 4, 8, 16, 32, 64, 128, 256, 1024, 2048, and 3072 on Ionosphere data are shown in Figs. 10 and 11. As illustrated in Fig. 10, the ARI increases up to r = 256 then saturates. However, this parameter is data-dependent, if the clusters highly overlap, a larger value for r can be chosen without much increase in runtime. On the other hand, the runtime of SPCM increases linearly with r as shown in Fig. 11. Compared to the traditional PCM, the runtime of SPCM is much lower for all values of r. From the complexity analysis of SPCM it is obvious that the log r comparisons required for computing the membership of an object to a cluster-center are less expensive than the required operations in the other algorithms because it does not require multiplications, divisions, and exponentiations.

Fig. 10

ARI of SPCM on Ionosphere for different discretization levels.

Fig. 11

The runtime of SPCM on ionosphere for different discretization levels.

4.7 Computational complexity

The step of computing the centers of the clusters in SPCM is O (r c) multiplications where r is the number of discretization levels and c is the number of clusters, while in PCM it is O (n c) where n> > r for large n. The step of computing the memberships for SPCM is O (n c log r) comparisons while for PCM it is O (n c). However, the cost of a comparison operation is much less than that of exponentiation, multiplication and division which is clear from the runtime analysis of the algorithms.

Results on Ionosphere for different discretization levels r = 2, 4, 8, 16, 32, 64, 128, 256, 1024, 2048, and 3072 as shown in Figs. 10 and 11. As illustrated in Fig. 10, the ARI increases up to r = 256 then saturates.

4.8 Implementation details

The above three metrics are available in Scikit-Learn [41]. The proposed algorithm along with several related algorithms are implemented in Python 3.8 on windows 10 and all machine learning tasks have been performed using Scikit-Learn with Intel Core i5, 2.5 GHz processor, and 16 GB RAM. The source code, as well as datasets and sample output are available at GitHub [40].

5 Conclusion

In this paper, an efficient semi-possibilistic c-means clustering algorithm (SPCM) is presented aiming at facilitating future applications of PCM to encrypted data. The required exponentiation and division operations in the traditional possibilistic c-means (PCM) complicate its implementation, especially on homomorphically-encrypted data. The key idea is to avoid the need for exponentiation and division operations at each iteration by restricting the memberships to a set of discrete values between zero and one differing by a small step size (possibly unfixed). The analysis of the computational complexity of SPCM shows that the computation of centers, memberships, dealing with noise, and the adaptation of its parameters during its execution are simple and more efficient than PCM. Also, unlike existing algorithms, SPCM increment the soft memberships of the identified core object in the winner prototype while it decrements its memberships in the other clusters instead of setting them to one for the winner and zero otherwise as in the cutset algorithm [28]. Similarly, the memberships of identified outlier objects are decremented in all clusters instead of setting their memberships to zero. Soft modification to the memberships of identified core and outlier objects increases the robustness of SPCM. As shown from the experimental results, the average percent of the achieved reduction in runtime is 35% and the average percent of the achieved increase in v-measure, adjusted mutual information, and adjusted rand index is 6% on five datasets compared to PCM. The larger the dataset, the higher the reduction in runtime. Also, SPCM achieved a comparable performance with less computational complexity compared to variants of related algorithms. Applying SPCM to homomorphically-encrypted data to cluster big data on the cloud without the disclosure of private data is a subject of an ongoing investigation.

Footnotes

Acknowledgments

I would like to thank Prof. Amin Shoukry (Alexandria University) for helping in improving the manuscript.

References

Khanmohammadi

, Adibeig

and Shanehbandy

, An improved overlapping k-means clustering method for medical applications, Expert Systems with Applications 67 (2017), 12–18.

Bezdek

J.C.

Objective function clustering, in Pattern recognition with fuzzy objective function algorithms, 1981, Springer, p. 43–93.

Krishnapuram

and Keller

J.M.

, The possibilistic c-means algorithm: insights and recommendations, IEEE transactions on Fuzzy Systems 4(3) (1996), 385–393.

Krishnapuram

and Keller

J.M.

, A possibilistic approach to clustering, IEEE transactions on Fuzzy Systems 1(2) (1993), 98–110.

Rumelhart

D.E.

and Zipser

, Feature discovery by competitive learning, Cognitive science 9(1) (1985), 75–112.

Yang

M.-S.

, Hung

W.-L.

and Chen

D.-H.

, Self-organizing map for symbolic data, , Fuzzy Sets and Systems 203 (2012), 49–73.

Wang

, Yang

, Muntz

STING: A statistical information grid approach to spatial data mining, in VLDB, 1997.

Ester

et al., A density-based algorithm for discovering clusters in large spatial databases with noise, in Kdd, 1996.

Murtagh

and Contreras

, Algorithms for hierarchical clustering:an overview, II, Wiley Interdisciplinary Reviews: Data Miningand Knowledge Discovery 7(6) (2017), e1219.

10.

Masulli

and Rovetta

, Soft transition from probabilistic to possibilistic fuzzy clustering, IEEE Transactions on Fuzzy Systems 14(4) (2006), 516–527.

11.

Peters

, Is there any need for rough clustering? , Pattern Recognition Letters 53 (2015), 31–37.

12.

Mitra

, Banka

and Pedrycz

, Rough–fuzzy collaborative clustering, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 36(4) (2006), 795–805.

13.

Smarandache

Neutrosophy, a new Branch of Philosophy, 2002: Infinite Study.

14.

Singh

, A neutrosophic-entropy based clustering algorithm (NEBCA) with HSV color system: A special application in segmentation of Parkinson’s disease (PD) MR images, , Computer Methods and Programs in Biomedicine 189 (2020), 105317.

15.

Bai

et al., Fast density clustering strategies based on the k-means algorithm, , Pattern Recognition 71 (2017), 375–386.

16.

Mehmood

et al., Clustering by fast search and find of density peaks via heat diffusion, , Neurocomputing 208 (2016), 210–217.

17.

Sieranoja

and Fränti

, Fast and general density peaks clustering, , Pattern Recognition Letters 128 (2019), 551–558.

18.

Khan

G.A.

et al., Multi-view low rank sparse representation method for three-way clustering, International Journal of Machine Learning and Cybernetics 13(1) (2022), 233–253.

19.

Chen

et al., A Weight Possibilistic Fuzzy C-Means Clustering Algorithm, , Scientific Programming 2021 (2021).

20.

Askari

, Fuzzy C-Means clustering algorithm for data with unequal cluster sizes and contaminated with noise and outliers: Review and development, , Expert Systems with Applications 165 (2021), 113856.

21.

Xenaki

S.D.

, Koutroumbas

K.D.

, Rontogiannis

A.A.

Koutroumbas and A.A. Rontogiannis, A Novel Online Generalized Possibilistic Clustering Algorithm for Big Data Processing, in 2018 26th European Signal Processing Conference (EUSIPCO), 2018, IEEE.

22.

Chowdhary

C.L.

et al., An efficient segmentation and classification system in medical images using intuitionist possibilistic fuzzy C-mean clustering and fuzzy SVM algorithm, Sensors 20(14) (2020), 3903.

23.

Szilagyi

, Lefkovits

and Szil´

S.M.

, Self-Tuning Possibilistic c-Means Clustering Models, International Journal of Uncertainty,Fuzziness and Knowledge-Based Systems 27(Supp01) (2019), 143–159.

24.

Selim

S.Z.

and Ismail

M.A.

, Soft clustering of multidimensional data: a semi-fuzzy approach, Pattern Recognition 17(5) (1984), 559–568.

25.

Mahfouz

M.A.

, Ismail

Fuzzy relatives of the CLARANS algorithm with application to text clustering, in Proceedings of World Academy of Science, Engineering and Technology, 2009, Citeseer.

26.

Mahfouz

M.A.

, Ismail

M.A.

Efficient soft relational clustering based on randomized search applied to selection of bio-basis for amino acid sequence analysis, in 2012 Seventh International Conference on Computer Engineering & Systems (ICCES), 2012, IEEE.

27.

Mahfouz

M.A.

, Ismail

M.A.

Soft flexible overlapping biclustering utilizing hybrid search strategies, in International Conference on Advanced Machine Learning Technologies and Applications, 2012, Springer.

28.

and Fan

, Cutset-type possibilistic c-means clustering algorithm, Applied Soft Computing 64 (2018), 401–422.

29.

Xenaki

S.D.

, Koutroumbas

K.D.

and Rontogiannis

A.A.

, A novel adaptive possibilistic clustering algorithm, IEEE transactions on Fuzzy Systems 24(4) (2015), 791–810.

30.

Armbrust

et al., A view of cloud computing, Communications of the ACM 53(4) (2010), 50–58.

31.

Zhang

and Chen

, A weighted kernel possibilistic c-means algorithm based on cloud computing for clustering big data, International Journal of Communication Systems 27(9) (2014), 1378–1391.

32.

Havens

T.C.

et al., Fuzzy c-means algorithms for very large data, IEEE transactions on Fuzzy Systems 20(6) (2012), 1130–1146.

33.

Zhang

et al., A High-Order Possibilistic $C$-Means Algorithm for Clustering Incomplete Multimedia Data, IEEE Systems Journal 11(4) (2015), 2160–2169.

34.

Zhang

et al., Privacy-preserving double-projection deep computation model with crowdsourcing on cloud for big data feature learning, IEEE Internet of Things Journal 5(4) (2017), 2896–2903.

35.

Zhang

et al., Secure weighted possibilistic c-means algorithm on cloud for clustering big data, , Information Sciences 479 (2019), 515–525.

36.

Al-Fuqaha

et al., Internet of things: A survey on enablingtechnologies, protocols, and applications, IEEE Communications Surveys&Tutorials 17(4) (2015), 2347–2376.

37.

Gheid

, Challal

Efficient and privacy-preserving k-means clustering for big data mining, in 2016 IEEE Trustcom/BigDataSE/ISPA, 2016, IEEE.

38.

Mohassel

, Rosulek

, Trieu

Practical privacypreserving k-means clustering, Cryptology ePrint Archive, 2019.

39.

Zhu

and Li

, Privacy-preserving k-means clustering with local synchronization in peer-to-peer networks, Peer-to-Peer Networking and Applications 13(6) (2020), 2272–2284.

40.

Available at: https://github.com/mamahfouz66/SPCM.

41.

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12 (2011), 2825–2830. Available from: https://scikit-learn.org/stable/about.html.

42.

UC Irvine Machine Learning Repository available at: http://archive.ics.uci.edu/ml/datasets.