Abstract
The required division and exponentiation operations needed per iteration for the possibilistic c-means (PCM) clustering algorithm complicate its implementation, especially on homomorphically-encrypted data. This paper presents a novel efficient soft clustering algorithm based on the possibilistic paradigm, termed SPCM. It aims at easing future applications of PCM to encrypted data. It reduces the required exponentiation and division operations at each iteration by restricting the membership values to an ordered set of discrete values in [0,1], resulting in a better performance in terms of runtime and several other performance indices. At each iteration, distances to the new clusters’ centers are determined, then the distances are compared to the initially computed and dynamically updated range of values, that divide the entire range of distances associated with each cluster center into intervals (bins), to assign appropriate soft memberships to objects. The required number of comparisons is O(log the number of discretization levels). Thus, the computation of centers and memberships is greatly simplified during execution. Also, the use of discrete values for memberships allows soft modification (increment or decrement) of the soft memberships of identified outliers and core objects instead of rough modification (setting to zero or one) in related algorithms. Experimental results on synthetic and standard test data sets verified the efficiency and effectiveness of the proposed algorithm. The average percent of the achieved reduction in runtime is 35% and the average percent of the achieved increase in v-measure, adjusted mutual information, and adjusted rand index is 6% on five datasets compared to PCM. The larger the dataset, the higher the reduction in runtime. Also, SPCM achieved a comparable performance with less computational complexity compared to variants of related algorithms.
Keywords
Introduction
Clustering is an unsupervised learning technique that is used to group a set of similar objects such that an object is more similar to objects in its group than to objects in other groups. By clustering analysis, the correlation of data attributes and the globally distributed model can be discovered. Partitioning clustering algorithms such as K-means [1], fuzzy c-means (FCM) algorithm [2], possibilistic c-means(PCM) [3, 4], and mode-based methods such as competitive learning [5] are the most popular and widely used on the cloud due to their efficiency and simplicity. PCM is more robust than FCM since it gives low membership for outliers in all clusters. Other categories of clustering algorithms are self-organizing map (SOM) [6], grid-based methods [7], density-based methods [8] and hierarchical methods [9].
In hard c-means, each data point is given a membership of 0 or 1, on the other hand, fuzzy c-means gives each data point a membership value between 0 and 1 to each cluster. The closer an object is to a cluster center the greater is its membership in that cluster. Clearly, the sum of memberships of each object should be equal to one. After each iteration, membership values and clusters’ centers are updated according to a given formula. Fuzzy c-means suffers from high computation time, and sensitivity to noise.
To overcome the difficulties of the fuzzy c-means, a new clustering model named possibilistic c-Means (PCM) has been proposed in [3, 4]. A unifying view of fuzzy and possibilistic memberships can be found in [10] where clusters are considered neither independent nor rigidly related which results in a better outlier rejection, relative to FCM, and better cluster isolation relative to PCM.
Both FCM and PCM algorithms aim to capture effectively different grades of uncertainty that help to cope with noise and outliers, producing a more compact and well-separated clustering structure. However, hard clustering is still a practical option when the cost of misclassification is very high [11], i.e., the number of incorrectly classified points should be minimized independently from the correctly classified ones.
Rough fuzzy clustering is introduced in [12] where objects assignments to lower or upper approximations are based on membership degrees instead of Euclidean distances, and where points in lower approximations and boundary regions are weighted differently. In [13], neutrosophic set (ns) theory is proposed as a generalization of the theory of intuitionistic fuzzy sets. In ns theory, uncertainty is represented in terms of a truth-membership, an indeterminacy-membership, and a falsity-membership. One of the recent applications of this theory to MR image segmentation was found in [14] where MR regions are segmented based on the neutrosophical entropy information (NEI) function that provides the basis for the representation of ambiguity in MR images.
In [15], two new scalable algorithms based on the c-means algorithm are proposed as enhancements to clustering by fast search and find of density peaks (CFSFDP) [16]. One of the two algorithms uses an acceleration mechanism based on concept approximation while the other is faster and uses exemplar clustering to obtain slightly less accurate results compared to the original algorithm. The complexity of density peaks is reduced in [17] to allow efficient clustering of very large datasets.
A three-way Multi-view clustering is proposed in [18] in which an objective function is constructed and optimized to preserve the consistency between the views by utilizing the low-rank sparse representation technique to build the consensus low rank matrix. Also, they pursue the neighborhood-based three-way clustering technique to reflect the data items into core and fringe regions.
The weighted possibilistic fuzzy c-mean WPFCM in [19] considers the importance of membership and typicality in the clustering process while assigning weights. Re-substitution errors of WPFCM are found to be near to FCM and PFCM, but far less than PCM. According to several performance indices, WPFCM overcomes the sensitivity weakness of FCM and resolves the coincident problem of PCM and unreasonable weight parameters of PFCM. Another work in the same direction can be found in [20].
In [21], an online generalized possibilistic c-means clustering algorithm, called Online Generalized Adaptive Possibilistic C-Means (O-GAPCM) is proposed. Based on APCM and extending its abilities, O-GAPCM is also able to unravel hyper-ellipsoidally shaped clusters while performing online processing, which makes it very computationally efficient. Experimental results show that O-GAPCM exhibits the best compromise between accuracy and time efficiency, compared to other related algorithms.
In [22], Intuitionist Possibilistic Fuzzy c-mean (IPFCM) is proposed to minimize the effect of outliers during the clustering process. IPFCM achieves reliable outlier detection while maintaining clustering consistency. In [23], a two-stage self-tuning possibilistic c-means clustering algorithm is proposed that can reliably handle outlier data than PCM, and reduce the number of parameters.
Early implementation for semi-fuzzy (soft) clustering is found in [24] where three mathematical models for semi-fuzzy clustering with their corresponding algorithms are introduced. The first algorithm limits the fuzziness to a smaller subset of the available clusters while the other two algorithms allow a pattern to be associated with the proper number of clusters based on a threshold value on distance or membership. In [13], a rough-fuzzy c-medoids algorithm is based on the principles of rough sets and fuzzy sets [15]. The concept of soft clustering is applied to fuzzy CLARANS [25] in [26]. Also, in [27], a soft bi-clustering algorithm is introduced in which the memberships of rows and columns are changed between a set of fixed values as an alternative to hard bi-clustering.
PCM is very sensitive to initialization and may generate coincident clusters. Reducing the memberships of core points to other clusters [28] and the adaptive determination of the penalty parameters [29] can greatly help in reducing the sensitivity to initialization, removing the clusters that gradually become obsolete, and avoiding coincident clusters.
Due to its high computational complexity, PCM is not suitable for clustering big data with a large number of objects efficiently. Furthermore, PCM needs to load all the objects into the main memory which is limited in local computing devices. Cloud computing offers a scalable and cost-efficient solution for clustering big data by providing tremendous memory space and high processing power [30]. Several improved PCM algorithms have been proposed for big data clustering such as weighted kernel possibilistic c-means algorithm [31], incremental WPCM [32, 33], as well as distributed weighted possibilistic c-means algorithm [34], and secure weighted possibilistic c-means for privacy-preserved clustering of big data on the cloud [35].
Uploading the raw data to the cloud directly as in [34] poses a serious threat to data security. Also, privacy-preserved algorithms [35] that are based on approximating the required calculations using Taylor’s expansion always produce lower clustering accuracy than their corresponding conventional algorithms. In [35], a secure weighted possibilistic c-means algorithm based on a full homomorphic encryption scheme (BGV) [36] is introduced for secure data clustering on the cloud. The BGV scheme allows only addition and multiplication on encrypted data. In [37], a k-means algorithm is proposed that guarantees privacy-preserving under a given security model to avoid degradation caused by using encryption schemes in big data analysis, however, it is limited to a specific security model. Mohassel et al. [38] learned the k-means protocol based on privacy-preserving between different databases, aiming at protecting contextual privacy. Zhu et al. [39] designed a privacy-preserving k-means clustering scheme based on distributed data in P2P networks.
When building a privacy-preserving approach for a data mining algorithm, the appropriate scenario and the balance of privacy preservation, accuracy, data utility, and efficiency are both major problems. To help tackle these problems, this paper proposes a semi possibilistic clustering algorithm based on the possibilistic paradigm. The main contributions of this work can be summarized as follows: The proposed algorithm (SPCM) reduces the complexity of possibilistic c-means (PCM) by restricting the memberships to a set of discrete values between zero and one differing by a small step size. The required multiplications in computing the centers are reduced to O (cr) where r is the number of discretization levels and c is the number of clusters. SPCM avoids the division and exponentiation operations needed per iteration of the possibilistic c-means (PCM) clustering algorithm for computing the memberships which complicate its implementation, especially on homomorphically-encrypted data. The between-class relationships are increased by decrementing the soft memberships of non-winner prototypes and incrementing the winner typicality. In this way, the algorithm avoids the effect of early wrong rough decisions that may be taken in [28]. The proposed algorithm can deal with outliers efficiently. After identifying them as the lower tail of the soft memberships-based histogram, the identified outliers’ soft memberships are decremented (instead of setting them to zero). It can incorporate additional information from the user such as the expected percentage of noise.
However, determining the typicality still necessitates a magnitude relation comparison, which is difficult to execute over encrypted data and has been deferred for now.
The efficiency and effectiveness of the proposed algorithm are evaluated on several standard datasets and synthetic datasets by comparing SPCM with hard c-means (HCM), Fuzzy c-means (FCM), possibilistic c-means (PCM), possibilistic fuzzy c-means (PFCM), and the weighted possibilistic c-means (WPCM). The results show that SPCM is more efficient than related fuzzy algorithms and its effectiveness compares favorably to PCM.
The paper is organized as follows. The detailed description of PCM and WPCM are reviewed in Section 2. The proposed algorithm is presented in Section 3. The experimental results are given in Section 4 and, finally, the paper is concluded in section 5.
Related work
Before describing the proposed algorithm and other related algorithms in the following sections, commonly used abbreviations and symbols in these sections are listed in Table 1.
Abbreviations and symbols used in the text
Abbreviations and symbols used in the text
The possibilistic approach to clustering [3, 4] assumes that the membership value of a data point to a fuzzy set (or cluster) is absolute, i.e. it is an evaluation of a degree of typicality and does not depend on the membership values of the same point to other clusters.
Let X = {x1, x2 …… , x
n
} be a set of unlabeled data points, V = {v1, v2 …… , v
c
} a set of cluster centers (or prototypes) and U = [u
ij
] the fuzzy membership matrix. In the possibilistic c-means (PCM) algorithms, the constraints on the elements of U are relaxed to:
The membership u
ij
is the fuzzy membership of object x
j
in cluster i and can be defined heuristically in several ways. The above requirements simply imply that a cluster cannot be empty and each pattern must be assigned to at least one cluster. This turns a standard fuzzy clustering procedure into a mode-seeking algorithm [3, 4]. The objective function of PCM contains two terms; the first one is the objective function of the fuzzy c-means [2], while the second is a penalty term:
and
Equations (2) and (3) can be used as formulas for updating the membership functions and the clusters centers.
Algorithm 1 shows the pseudo-code for the traditional possibilistic c-means algorithm (PCM).
In PCM, each object is assigned the same weight, however, the importance of objects differs. Furthermore, PCM cannot yield the desirable clustering results for the datasets including noisy objects or outliers. To tackle this problem, a weighted possibilistic c-means algorithm (WPCM) is presented in [16] by assigning a weight value to each object, which leads to the following objective function of WPCM:
The weights are updated using the following formula:
The memberships are computed as follows:
While the centers are computed as in Equation (3).
Soft memberships
By restricting the memberships to a set of discrete values between zero and one, the squared distances that correspond to these membership values can be expressed as a function of these memberships. Using Equation (2) of the standard PCM, a squared distance can be expressed as follows:
Raising the two sides of the equation to (m-1)>0 power gives
Then
When the set of discrete membership values is given as an input parameter, then the corresponding bins can be initially computed, for each cluster, given the value of η i for i = 1, 2,.., c. Each time, η i is updated, the new corresponding discretization bin boundaries can be updated accordingly. To compute the membership of an object in a certain cluster, its distance to the cluster center is compared to the values stored in the bin vector associated with this cluster. For example, if a fixed step equal to 1/r is used then the corresponding discrete values μ will be {0, 1/r, 2/r, …, 1}. Each cluster i will have a bin vector Bi where Bi is a set of r-1 bin values corresponding to μ –{0, 1}. As the bin vector is sorted, searching for the proper bin given the squared distance needs O (log r) comparisons. Bi is used for discretizing the distances to the center of cluster i. The experimental results show that when using large enough (16 or more) discrete values, the proposed algorithm can produce accurate results with a high reduction in runtime as the cost of comparisons is less than the cost of exponentiation and division operations. The values stored in B = {b1, b2, b3, … br-1} can be used to approximate the exact PCM membership given d2 as shown in Equation (8). μ can have arbitrary values between 0 and 1, for example, we may choose μ = {0, 0.01, 0.2, 0.4, 0.45, 0.5, 0.55, 0.6, 0.9, 1} to increase the accuracy of memberships in a certain region. To ease the computation of the centers, another vector μ m is initially computed to store the same values of μ raised to power m. Each value in μ is considered a category and crisp membership functions in these categories are used in Equation (8). Another approach is to use Trapezoidal membership for categories 0 and 1 and Triangular membership for the other categories as shown in Table 2. In the fuzzy membership approach, the division is needed to compute the approximated membership from the squared distance d2 as shown in Equation (9). Figure 1 shows the exact membership in Equation (2) compared to the crisp and fuzzy approaches for approximating the membership of PCM given d2 and B. The start and the end of the triangles in the lower part of Fig. 1 correspond to the values in B. As the upper part of Fig. 1 shows, the crisp approach can approximate the membership well if the number of discretization levels is chosen large enough (for example 16 or higher). In this approach we avoid the need for division and exponentiation which is the main concern of the proposed algorithm and at the same time the worst case for the cost of the calculation of membership is reduced. In the rest of the paper Equation (8) is used. The MATLAB code that produced Fig. 1 will be available in [40].

Approximating the membership function of PCM.
Knowing the number of objects having a soft membership s
ij
= k in cluster i (n
ik
), the computation of a center requires O(r) multiplications where r is the discretization level. The vector μ
m
stores the corresponding membership of μ raised to power m. It is computed once for all the values stored in the soft membership vector μ. B is updated in each iteration while μ and μ
m
are fixed. The centers are computed as follows:
To be able to give a lower weight for outliers, the proposed algorithm computes a histogram
Introducing the between-class relationships
To overcome the PCM’s coincident clustering problem, the cutset idea, introduced in [28], is used. In this approach, the winner prototype is identified as the prototype having the maximum membership and its membership is higher than a threshold. The between-class relationships are introduced by setting the membership of a core object in the winner prototype to one and setting the non-winner typicality degree to zero to increase the separability of the produced clusters. Such modifications do not affect the typicality of outliers since PCM by its nature gives small typicality values to outliers. In the proposed algorithm SPCM, the threshold is chosen experimentally as a soft membership near 0.7. The between-class relationships are increased by decrementing the soft memberships of non-winner prototypes and incrementing the winner typicality.
The proposed algorithm (SPCM)
As shown in the pseudo code of algorithm 3, the SPCM algorithm starts by selecting initial cluster centers. If a fixed step 1/r is used, the corresponding memberships are computed and stored in the vector μ and the memberships raised to power m are computed and stored in the vector μ m . For example, if r = 4 and m = 2 then μ={0,0.25,0.50,0.75,1} and μ m = {0, 0 . 252, 0 . 502, 0 . 752, 1}).
Also, each cluster i has a vector B i containing the discretization bins, i.e., the distances that correspond to the values in μ and it is updated using Equation (11).
Initial values for η i are computed from the initial clustering. At each iteration, the distances to the cluster centers are computed and compared to the discretization bins in a manner similar to binary search. The values in S are the soft memberships (integer values) of the identified discretization level for each object in a certain cluster. An efficient implementation for calculating the memberships is done in Python using list comprehensions and the “bisect“ function [40].
Outliers belonging to the lower tail of a histogram for
Finally, the new value for the vector B
i
representing the discretization bins for cluster i in iteration t + 1 is updated as follows:
The algorithm stops if it reaches the maximum number of iterations or the change in clusters centers becomes very small within two successive iterations. The process may be repeated numlocal times and the best solution is returned as shown in Fig. 2.

Block diagram for the proposed algorithm (SPCM).
In the following illustrative example, we show one iteration of SPCM over two-dimensional datasets represented by the first three columns in Table 3. As shown in Fig. 3, the number of clusters is chosen equals to 2 (c = 2) and the corresponding initial centroids are: v1 = (1, 1) and v2 = (5, 7). The squared Euclidian distance of each object to the first and second cluster centers are shown in the fourth and the fifth column respectively.

The dataset used in the explanatory example.
Fuzzy membership functions to approximate the PCM membership given d2
Illustrative example
Given the initial clustering, the 1st, 2nd and 3rd objects belong to the first cluster and their distances (din) to their cluster center are 0, 1.12, 3.61, respectively. Similarly, the remaining five objects belong to the second cluster and their distances to its center are 0, 3.61, 2.5, 2.06, 2.92, 8.06. The minimum din is 0 while the maximum is 3.61 for the first cluster and 8.06 for the second cluster.
By choosing the fuzzification factor m = 2 and the membership vector μ=[0,.05,.25,.5,.75,.9, 1] then μ2 = [0,.0025,.063,.25,.56,.81, 1].
The discretization bins for each cluster are computed from the corresponding memberships in μ (excluding 0 and 1) by using Equation (7). At the current iteration, the discretization bins for the first cluster become B1 = η1 * [399, 3, 1, 0 . 33, 0 . 11] where η i is dynamically computed at each iteration and selected in the interval max(din) [1/ α –1] m-1 > ηi>min(din) [1/0.99 - 1] m-1 where 0.5 = < α < 1. If α is chosen equal to 0.5 then 3.61 > η1 > 0. If we select η1= 1.8 then B1 = 1.8×[399, 3, 1, 0.33, 0.11] = [718.2, 5.4, 1.8, 0.594, 0.198]. Similarly, 8.06> η2> 0, 4.03 is a suitable choice for η2. B2 = 5 * [399, 3,1, 0.33, 0.11] = [1608,15,5, 1.65, 0.55]. The soft memberships for cluster 1 are S1 = [5, 4, 3, 3, 3, 3, 3] and the soft memberships for the second cluster are S2 =[3, 3, 3, 6, 3, 3, 3]. Using the procedure proposed in section 3.3, x8 may be identified as a candidate outlier and the soft memberships are further refined by decreasing the memberships of x8 by one level in the two clusters. Also, the between-class distance is increased by decrementing the soft memberships of non-winner prototypes and incrementing the winner typicality as shown in columns 8 and 9. The threshold is chosen equals to the soft membership 4 (between 0.75 and 0.9).
The new centroid v1 for cluster 1 can be computed using the refined values of the soft memberships and their corresponding values stored in μ
m
as follows:
Similarly, the new centroid v2 for cluster 2 can be computed as follows:
It is clear that the new centers after one iteration are very close to the centers of the true clusters 1,2 and {3, 4, 5, 6, 7}.
As shown in the above example, the memberships are calculated without the need of division or exponentiation. Also, the number of multiplications needed for computing the centers is reduced to the order of r multiplications. Still, few divisions of (O(c)) are needed in computing cluster centers and η. These divisions can be approximated using Taylor expansion as shown in [35].
Tuning the fuzzification parameter (m)
The fuzzification parameter m plays an important role in computing the discretization bins B using Equation (7). Its value determines the rate of decay in the computed values of B from the discrete memberships μ given as input to SPCM. When m = 1, the memberships in μ are crisp i.e. {0, 1} and Bi will have a single value equal to η i where i = 1, 2, . . . , c. All points with d2 (x j , V i ) greater than η i will have a zero membership to cluster i. For large values of m, the values in Bi decay slowly. Similar to PCM, increasing values of m in SPCM increases the possibility that all points in the dataset belong to a given cluster resulting in coincidental clusters. Whereas, in the FCM, increasing values of m increase the sharing of points among all clusters. Thus, the value of m that gives us satisfactory performance is different in the two approaches. A value of m = 2 is known to give good results with the FCM while a more appropriate choice for PCM is found to be m = 1.5. As shown in Figs. 4 and 5, a value between 2 to 2.5 is appropriate for the proposed algorithm on the Iris dataset in terms of ARI and runtime, respectively. A value of 2 for the fuzzifier m is chosen for SPCM in the rest of the paper.

ARI for different values of fuzzifier m on Iris.

Runtime for different values of m on Iris.
In this section, the proposed algorithm is applied to a two-dimensional data set X12 with 12 points, as shown in Table 4. The coordinates of X12 are given in the 2nd and 3rd rows of Table 4. Let {x1, x2, …, x10} ∪ {x11, x12} be denoted by X12. x11 is called an “inlier” (bridge) and x12 an “outlier” (noise). x11 and x12 are equidistant from all corresponding pairs of points in the two clusters. For ease of comparison to the other five algorithms, a normalized coefficient [28] in Equation (12) is used. The results in Table 4 are those reported in [28], the fuzzification factor m is chosen equal to 2 for SPCM similar to the value chosen in the experiments with the related algorithms in Table 4.
Comparing SPCM to related algorithms on X12 using Normalized weight coefficients of different prototypes
Comparing SPCM to related algorithms on X12 using Normalized weight coefficients of different prototypes
As shown in Table 4, it is clear from column 14 that the proposed algorithm SPCM performed much better than the other five related algorithms and was able to assign the outlier object x12 the lowest membership which is zero resulting in a negligible influence on the determination of the cluster centers as shown in Fig. 6. Consequently, the SPCM preserves the strong performance of the PCM to outliers. Also, the proposed SPCM method successfully overcomes the coincident clustering problem of the PCM and obtains distinct centroids as shown in Fig. 6.

Determined ceneters for X12 dataset using SPCM.
In this section, the average runtime of SPCM is compared to the average runtime of the traditional PCM on several synthetic datasets of different sizes ranging from 50k to 500k objects, on datasets of size 50k with two features and number of clusters ranging from 2 to 50, and, finally, on datasets of 50k objects and 10 clusters with the number of features ranging from 2 to 10.
Figure 7 shows the average runtime of the traditional PCM and the proposed SPCM. The runtime of SPCM is much less than PCM. The difference increases as the dataset size increases. The runtime of PCM is about five times the runtime of the SPCM for the dataset of size 500k.

Average runtime (sec.) on generated datasets of size 50k objects with two features and number of clusters ranging from to 2 to 50.
Figure 8 shows the average runtime (sec.) on generated datasets of size 50k objects, with two features and a different number of clusters. The runtime of both PCM and SPCM algorithms increases with the increase in the number of clusters as the complexity analysis shows later. The runtime of SPCM is always less than PCM. The higher the number of clusters the higher the increase of the average runtime of PCM compared to SPCM. The runtime of PCM is about twelve times the average runtime of SPCM when the number of clusters is equal to fifty.

Average runtime (sec.) on generated datasets of different sizes ranging from 50k to 500k.
Figure 9 shows the average runtime on generated datasets of size 50k objects, with 10 clusters and various number of features. The runtime of both PCM and SPCM algorithms increases as the number of features increases. The runtime of SPCM is always less than PCM. The higher the number of features the higher the increase of the average runtime of PCM compared to SPCM. The runtime of PCM is about four times the average runtime of SPCM when the number of clusters equals fifty. The parameters of SPCM are fine tuned. Based on the fine-tuning experiments, in the above and next experiments, the discretization levels of SPCM are chosen equal to 256, α= 0.7, β= 0.55 and the increment and decrement steps are chosen equal to 8.

Average runtime (sec.) on generated datasets of size 50k with ten clusters and different number of features.
Dataset description
Table 5, describes the adopted datasets. In our investigation in this work, we used the Iris data. Iris data is one of the earliest and most widely used data sets in data mining. In the Iris data set, there are 150 samples from three different Iris flower types. Each sample of Iris flowers has four attributes representing different measures (the length and width of the sepal, and the length and width of the petal). Each of the 5800 samples in the shuttle dataset belongs to one of five different classes and has 9 attributes all of which are numerical. About 80% of the data belongs to class 1. The Breast dataset has 569 samples and 9 numerical features that describe characteristics of the cell nuclei present in a digitized image of a fine needle aspirate (FNA) of a breast mass. Features in the Wine dataset are the quantities of 13 constituents (Alcohol, Malic acid,...) found using chemical analysis of 178 samples of wines grown in the same region in Italy but derived from three different cultivars (3 classes). The 34 features of Ionosphere dataset correspond to 17 complex values that represent the autocorrelation between the time of a pulse and its pulse number for each of the 17 pulses of the Goose Bay system. All datasets used in our experiments have numerical attributes and have no missing data. Euclidean distance is used as a measure of dissimilarity between samples. The Principal Component Analysis algorithm implemented in [41] is used to reduce the dimension to 10 for any dataset having a number of attributes larger than 10.
Performance of SPCM compared to five related clustering algorithms on five standard datasets
Performance of SPCM compared to five related clustering algorithms on five standard datasets
The third column in Table 5 lists the three metrics that are used in our experiments namely v-measure index (VMI), adjusted mutual information index (AMI) and adjusted rand index (ARI). AMI is a symmetric metric, i.e., swapping the predicted labels with the true labels does not change the score. Thus, it can be used as a consensus measure. VMI is also independent of the absolute values of the labels and it is measured as the harmonic mean between homogeneity and completeness. ARI measures the similarity between the true and predicted labeling, ignoring permutations. ARI has a score range of [–1,1] and it can be used to compare all kinds of clustering algorithms given the true labels.
Performance results
Table 5 presents the performance indices and runtime of the proposed SPCM algorithm compared to another five related methods on five standard datasets [42].
As shown in Table 5, SPCM outperforms all algorithms on both Wine and Ionosphere datasets in terms of VMI, AMI, and ARI. FCM was the winner algorithm on the Breast dataset. On the Iris dataset, the results of the five algorithms were close to each other, SPCM was the best in terms of VMI while PFCM in terms of AMI and ARI. The runtime of SPCM is slightly higher than HCM and less than the runtime of any of the other algorithms.
Detailed performance comparison on iris data
Table 6 shows the iteration time, number of iterations, total runtime, silhouette index, maxdist, mindist, and sumdist for five local minima computed using different random initialization on the Iris dataset. The maxdist, mindist, and sumdist correspond to the square root of the maximum, minimum and sum of gi over i in Equation 13 where
Detailed performance on Iris dataset
Detailed performance on Iris dataset
Results on Ionosphere for different values of discretization levels r = 2, 4, 8, 16, 32, 64, 128, 256, 1024, 2048, and 3072 on Ionosphere data are shown in Figs. 10 and 11. As illustrated in Fig. 10, the ARI increases up to r = 256 then saturates. However, this parameter is data-dependent, if the clusters highly overlap, a larger value for r can be chosen without much increase in runtime. On the other hand, the runtime of SPCM increases linearly with r as shown in Fig. 11. Compared to the traditional PCM, the runtime of SPCM is much lower for all values of r. From the complexity analysis of SPCM it is obvious that the log r comparisons required for computing the membership of an object to a cluster-center are less expensive than the required operations in the other algorithms because it does not require multiplications, divisions, and exponentiations.

ARI of SPCM on Ionosphere for different discretization levels.

The runtime of SPCM on ionosphere for different discretization levels.
The step of computing the centers of the clusters in SPCM is O (r c) multiplications where r is the number of discretization levels and c is the number of clusters, while in PCM it is O (n c) where n> > r for large n. The step of computing the memberships for SPCM is O (n c log r) comparisons while for PCM it is O (n c). However, the cost of a comparison operation is much less than that of exponentiation, multiplication and division which is clear from the runtime analysis of the algorithms.
Results on Ionosphere for different discretization levels r = 2, 4, 8, 16, 32, 64, 128, 256, 1024, 2048, and 3072 as shown in Figs. 10 and 11. As illustrated in Fig. 10, the ARI increases up to r = 256 then saturates.
Implementation details
The above three metrics are available in Scikit-Learn [41]. The proposed algorithm along with several related algorithms are implemented in Python 3.8 on windows 10 and all machine learning tasks have been performed using Scikit-Learn with Intel Core i5, 2.5 GHz processor, and 16 GB RAM. The source code, as well as datasets and sample output are available at GitHub [40].
Conclusion
In this paper, an efficient semi-possibilistic c-means clustering algorithm (SPCM) is presented aiming at facilitating future applications of PCM to encrypted data. The required exponentiation and division operations in the traditional possibilistic c-means (PCM) complicate its implementation, especially on homomorphically-encrypted data. The key idea is to avoid the need for exponentiation and division operations at each iteration by restricting the memberships to a set of discrete values between zero and one differing by a small step size (possibly unfixed). The analysis of the computational complexity of SPCM shows that the computation of centers, memberships, dealing with noise, and the adaptation of its parameters during its execution are simple and more efficient than PCM. Also, unlike existing algorithms, SPCM increment the soft memberships of the identified core object in the winner prototype while it decrements its memberships in the other clusters instead of setting them to one for the winner and zero otherwise as in the cutset algorithm [28]. Similarly, the memberships of identified outlier objects are decremented in all clusters instead of setting their memberships to zero. Soft modification to the memberships of identified core and outlier objects increases the robustness of SPCM. As shown from the experimental results, the average percent of the achieved reduction in runtime is 35% and the average percent of the achieved increase in v-measure, adjusted mutual information, and adjusted rand index is 6% on five datasets compared to PCM. The larger the dataset, the higher the reduction in runtime. Also, SPCM achieved a comparable performance with less computational complexity compared to variants of related algorithms. Applying SPCM to homomorphically-encrypted data to cluster big data on the cloud without the disclosure of private data is a subject of an ongoing investigation.
Footnotes
Acknowledgments
I would like to thank Prof. Amin Shoukry (Alexandria University) for helping in improving the manuscript.
