Abstract
DKIFCM (Density Based Kernelized Intuitionistic Fuzzy C Means) is the new proposed clustering algorithm that is based on outlier identification, kernel functions, and intuitionist fuzzy approach. DKIFCM is an inspiration from Kernelized Intuitionistic Fuzzy C Means (KIFCM) algorithm and it addresses the performance issue in the presence of outliers. It first identifies outliers based on density of data and then clusters are computed accurately by mapping the data to high dimensional feature space. Performance and effectiveness of various algorithms are evaluated on synthetic 2D data sets such as Diamond data set (D10, D12, and D15), and noisy Dunn data set as well as on high dimension real-world data set such as Fisher-Iris, Wine, and Wisconsin Breast Cancer Data-set. Results of DKIFCM are compared with results of other algorithms such as Fuzzy-C-Means (FCM), Intuitionistic FCM (IFCM), Kernel-Intuitionistic FCM (KIFCM), and density-oriented FCM (DOFCM), and the performance of proposed algorithm is found to be superior even in the presence of outliers and noise. Key advantages of DKIFCM are outlier identification, robustness to noise, and accurate centroid computation.
Introduction
Fuzzy clustering, a part of unsupervised learning algorithms [1–3], is very effective in dealing with uncertainty, vagueness, and fuzziness in data. It is widely used in various domains such as resource optimization [4, 5], anomaly detection and fault diagnosis [6, 37], recommender systems [8], image segmentation [9, 10] including medical imaging [11], product clustering [12], customer segmentation [13], and network optimization [14–16] etc.
Fuzzy logic and fuzzy set theory are most suitable to deal with vagueness and uncertainties in the data. In 1965, fuzzy sets were introduced [17], followed by ISODATA algorithm [18], and popular fuzzy clustering algorithm, Fuzzy-c-means (FCM) [19] subsequently. However, FCM does not perform well for noisy and outlier contaminated data. To overcome this problem of FCM, a number of algorithms such as Possibilistic-C-means (PCM) [20], Noise Clustering [21], Credibilistic Fuzzy-c-means (CFCM) [22], and Possibilistic-Fuzzy-C-means (PFCM) [23] are proposed in the literature. During the last decade, lot of research has been done in fuzzy clustering domain. For example, Kaur P. et al. [38] clearly shows the comparison of different algorithms on MR brain images and on different real datasets, showing limitations and relative advantages of different algorithms. PCM, which is based on possibilistic approach, is helpful sometimes in dealing with noisy data, but it is over sensitive to initializations and many a times results in identical or overlapping clusters. Noise Clustering works on the concept of noise prototype and forms separate cluster for noise, but it fails to result in very efficient clusters when the number of clusters are increased for the same data set or when the proportion of outliers is increased. CFCM defines credibility and performs clustering on the basis of credibility score. It outperforms FCM, PCM, and PFCM but has certain issues such as assigning an outlier to multiple clusters and sensitivity to the choice of initial prototype. Unlike Noise Clustering, CFCM doesn’t identify outliers, rather it only emphasizes on reducing their effect on resultant clusters. This issue has been illustrated in many research papers [34–36] such as Gosain A. et al. [31] in which performance comparison of various fuzzy clustering algorithms on D12 data set is discussed. PFCM works on fusion of fuzzy and possibilistic approach, and hence it has possibilistic and fuzzy membership. It outperforms FCM and PCM, but fails for outlier contaminated data and data with highly size variant clusters. All the above discussed algorithms focus on reducing the impact of outliers instead of identifying and removing the outliers. Kaur P. et al. in 2011 proposed Density oriented FCM (DOFCM) [24, 25], which works on identification of outliers. It works excellent in outlier identification but clusters and centroid computation can be improved. Also, DOFCM fails for non-linearly separable clusters. Parallelly, Chaira [26, 27] proposed IFCM by incorporating hesitation degree with membership function and intuitionistic fuzzy entropy in objective function. Although, it successfully converges the cluster center to a desired location, but it is not much effective for noisy data and fails for nonlinearly separable data. Kaur [28] proposed KIFCM by integrating RBF-kernel function with intuitionistic fuzzy sets. KIFCM works well with nonlinear data, and handles the noise to some extent. However, it fails to perform well in the presence of outliers.
The proposed algorithm, DKIFCM, is an inspiration from KIFCM and solves the issue of dealing with outliers. DKIFCM is a hybridization of density-based approach, kernel function, and IFCM. Four standard two-dimensional data sets and three real high dimensional data sets are used for experimental comparison of the proposed algorithm with FCM, IFCM, KIFCM, and DOFCM.
Further paper is comprised of four parts. Part I gives concise reread of FCM, IFCM, KIFCM, and DOFCM, and Part II is depiction of DKIFCM which is the proposed algorithm. Part III is experimental assessment on synthetic 2D and real high dimensional data sets. Last part is conclusion which is followed by the reference list of useful research articles.
Literature survery
In this section, a concise reread of FCM, Intuitionistic-FCM (IFCM), Kernelised IFCM (KIFCM), and DOFCM is done. X represents {x1, x2, x3, ... , x
n
}, data set, each x
i
is a data object and n is number of data object in X. C represents {C1, C2, C3, ... , C
cc
}, set of cluster centroids, each C
j
is centroid of j
th
cluster and cc is number of clusters in X.
The fuzzy C means (FCM) [19]
FCM is the most novel and popular fuzzy clustering algorithm. It is based on the assumption that number of clusters are known in advance. Objective is to minimize the following equation subject to constraint in Equation (2).
But FCM always fails to detect noise and outliers, thus it performs drastically low in their presence.
On the bases of possibilistic approach, PCM was proposed. The membership constraint in Equation (2) is relaxed and the following objective function is formed:
Here, the former term focuses on distance minimization among data objects and centroids, and the latter term causes η
j
to gain its maximum value, avoiding all trivial solutions. η
j
is positive number randomly chosen as per data set. Centroid updation and membership updation are as per following equations:
PCM is sometimes helpful with noisy data. But it suffers from the inconsistencies similar to greedy approach techniques, that is, minimization of objective function steps at a local minima instead of global minima [29].
Pal came up with a new idea of fusion of fuzzy and possibilistic approach. Hence, PFCM has a possibilistic membership (t ki ) and a fuzzy membership (u ki ) associated with each data point. It works on minimizing subsequent objective function:
This algorithm outperforms PCM and FCM but fails to produce accurate results when size of clusters is unequal and when outliers are present [31].
K. K. Chintalapudi and K. Moshe introduced a new variable called credibility that is defined as follows:
CFCM lowers the effect of the presence of outliers on cluster computation [26, 29]. Thus, it is observed that it improves centroid computation but still doesn’t provide the accurate centroids and also allocates some outliers to two or more clusters [28, 30].
Xu proposed IFCM which is based on intuitionistic fuzzy set theory and is helpful in dealing with uncertain and vague data. Its objective function is:
IFCM produces overlapping clusters and hence it becomes really difficult to assign a cluster to the points lying in overlapping region. Also, IFCM fails to handle outliers as this algorithm treats outliers as data objects.
KIFCM makes use of Radial Basis kernel function [28] for computing the distance between the centroids and data objects, improving the accuracy of IFCM. Its objective function is:
Kaur proposed an algorithm DOFCM by introducing a new term neighborhood membership, which is defined as follows:
Outlier identification is done by setting a threshold value ’α’ for the neighborhood membership. If
After studying FCM and its variants DOFCM, IFCM, and KIFCM, IFCM can be improved with the idea of density approach and kernelization. A new algorithm named ‘Density Based Kernelized approach to IFCM’ (DKIFCM) is proposed. It produces ‘n + 1’ clusters where one cluster is for outlier and ‘n’ clusters are for data objects. Proposed algorithm first identifies outliers and removes them by withholding their membership. Then, KIFCM [28] is applied on all data objects to make accurate and noise free clusters. In the tailing subsections, identification of outliers is explained, followed by clustering process of proposed approach.
Outlier identification
Outliers are those data points that show huge variation to other data points in a data set, as if they do not belong to the data set. Various methods based on statistics, proximity, and clustering are proposed in literature to identify outliers. In this paper, outliers are identified based on the density of a data object corresponding to its neighborhood by using a density factor called neighborhood membership [24, 25], which is defined in DOFCM [24]. Neighborhood membership is formulized as follows:
To calculate a threshold value ’α’ for neighborhood membership, firstly neighborhood radius (r
neighborhood
) is calculated as per Ester [33], and then count of data objects in neighborhood (
All data objects with neighborhood membership value less than set threshold value are outliers. Figure 1. represents functioning of outlier identification technique on noisy Dunn data set. Value of r neighborhood is calculated as 0.8677 and for this noisy data set, various values of α are tried on visual basis, resulting in best value of α as 0.15. Then, neighborhood membership is computed for all the data objects. All data objects whose neighborhood membership is less than α are considered to be outliers and such data objects are highlighted in Fig. 1 by drawing a circle of r neighborhood radius with that data object as center. With the above approach, we are able to identify all outliers except four outliers, as shown in Fig. 1.

Illustration of Outlier Identification.
A Density Based kernelized Approach to IFCM is described here. Objective function for this proposed approach is:
K(x
i
)-K(C
j
) represents distance between xi and Cj in high dimension space. Mathematically, Radial Basis Kernel [31] is defined as:
And Equation (19) can be revised as:
To determine
Outlier Identification:
Step 1. Calculate neighborhood radius (r neighborhood ) using Ester [33]
Step 2. Set
Step 3.
Step 4.
compute
Step 5. Set α, on visual basis of data set (X).
Step 6.
OL←OL U xi
Step 7. Set parameters like l, m, SC
Step 8. Set a random value to membership function- uij, ∀i ε [1,2,..,n, 1,2,..,n], and ∀j ε [1,2,..,cc, 1,2,..,cc] subject to constraints in Equation (2)
Step 9. Set Obj_fun0 and V using (19) and (27) respectively
Step 10. For ∀ OL, set membership value equal to zero
Step 11.
Compute new_Centroids (C), using (27)
Compute new_Obj_funk, using (24)
Update uij for all i, using (26)
Return C
Step 12. Return C
Algorithms are implemented using MATLAB Version 7 software and on a system with hardware configuration as intel-i5 2.5 GHz processor and 8 GB RAM. Experimental results are projected in form of tables and figures. Common algorithmic parameters in the implementation of proposed and other algorithms are: fuzzifier is set as 2, maximum iterations is set as 100, and minimum improvement in every consecutive iteration is set to 0.00001. Experiments are executed on standard 2D and real-high dimension data sets. Notation used to symbolize clustering results are: green/blue dot (’.’ or ’.’) for data objects of different clusters, ’*’ as centroids, ’∘’ as outlier.
Example 1. (D10 and D12 data sets): D10 is also known as Diamond data set. It is noise free and consists of 2 identical clusters, each cluster consists of five data objects. Figures 2 and 3 show D10 data set and clustering results of FCM, IFCM, KIFCM, DKIFCM, and DOFCM. D12[21] data set is the union of data objects of D10, a noise point - x11, and an outlier - x12. Figures 4 and 5 show D12 data set and clustering results of FCM, IFCM, KIFCM, DKIFCM, and DOFCM. Table 1 lists the coordinates of centroids generated by the proposed and other algorithms, and error in terms of computed centroids and ideal centroids on D10 data set. The ideal centroids for D10 and D12 data set are:

(a) D10 (b) FCM result, (c) IFCM result.

(a) KIFCM result, (b) DOFCM result, (c) DKIFCM result.

(a) D12 (b) FCM result, (c) IFCM result.

(a) KIFCM result, (b) DOFCM result, (c) DKIFCM result.
Computed cluster centroids and average error on D10
The error is computed as
Computed cluster centroids and average error on D12
On analysing Figs. 2, 3, 4, and 5, it is observed that FCM performs very good for outlier free and noise free data such as D10, but FCM performance is drastically affected even in the presence of a single outlier as FCM fails to identify outliers and treat outliers just like any other data object. IFCM performs much better than FCM in presence of noise but it also fails to identify outliers and sometime noise points are allocated to more than one cluster. When the data spread is in a way that linear separation is not possible, then kernelized versions of fuzzy clustering algorithms such as KIFCM become helpful. Kernelized approach also improves the accuracy of centroids and the same is verified in experimental results shown in Tables 1 and 2. Our proposed algorithm, DKIFCM, has best of all characteristics i.e. identification of outliers, robustness to noise, and accurate centroids computation.
Example 2. (D15 data sets): D15 data set is the union of data objects of D10, a noise point - x11, and four outliers. Figures 6 and 7 show D15 data set and clustering results of FCM, IFCM, KIFCM, DKIFCM, and DOFCM. Table 3 lists the coordinates of centroids and computed error on D15 data set. The ideal centroids for D15 and error are same as specified in example 1.

(a) D15 (b) FCM result, (c) IFCM result.

(a) KIFCM result, (b) DOFCM result, (c) DKIFCM result.
On analysing Fig. 6, Fig. 7, and Table 3, it is observed that FCM performance is worst as it forms outliers as one of cluster and all data objects as second cluster. IFCM performs much better than FCM even in the presence of noise and high proportion of outliers. But it too fails to identify outliers and noise points are allocated to more than one cluster. However, KIFCM shows high robustness to the presence of outliers as it successfully identifies right clusters with least impact on centroid position in comparison to FCM, IFCM, and DOFCM. DOFCM successfully identifies outliers and generates better clusters. However, proposed algorithm, DKIFCM performs best as it offers outliers identification, robustness to noise, and accurate centroid computation.
Computed cluster centroids and average error on D15
Example 3. (Dunn Data-set): Dunn Data consists of two clusters that are square in shape and highly vary in density. Figure. 8(a) shows noisy Dunn data set, and Fig. 8(b-c) and Fig. 9 show clustering results of FCM, KIFCM, IFCM, DKIFCM and DOFCM. Table 4 lists the coordinates of centroids and error in centroid computation generated by the proposed and other algorithms. The ideal centroids for this data set are:

(a) Noisy Dunn (b) FCM result, (c) IFCM result.

(a) KIFCM result, (b) DOFCM result, (c) DKIFCM result.
Computed cluster centroid and average error on Dunn
On analysing Figs. 8 and 9, it is observed that FCM, IFCM, and KIFCM fail to identify outliers as these algorithms treat outliers as other data objects. However, DOFCM and DKIFCM identify most of the outliers marked as ’o’, and then clustering is performed. On analysing Table 4, KIFCM performs slightly better than FCM in computing accurate centroid position and IFCM is worst choice for such data set. DOFCM and DKIFCM perform equally well in outlier identification and both outperforms FCM, IFCM and KIFCM. Between DKIFCM and DOFCM, former outperforms the later in accuracy of centroid computation.
Example 4. (High Dimension Real World Data set):
Performance of proposed algorithm is tested on three well known real world datasets - (1) Fisher-Iris, (2) Wine, (3) Wisconsin Breast Cancer. Accuracy for high-dimension data is computed as follows:
Equation (28) is Huang’s measure, where n is total number of data objects in a given data set. ai is count of data objects that truly belong to ith cluster (True Positive). Accuracy ranges from zero to one, and accuracy for perfect clusters will be 1.
Fisher-Iris is a real-world data set of three species of flowers and it has 50 instances of each species. Each flower is represented by 4 attributes - sepal length, petal length, sepal width, and sepal length. Table 5 lists the accuracy and misclassification of proposed and other algorithms for Fisher-Iris data set. On analysing Table 5, it is observed that DKIFCM performs best, followed by IFCM, FCM and DOFCM. KIFCM performance has been worst among FCM, IFCM, DOFCM, DKIFCM.
For Iris Data-set, misclassification and accuracy using FCM, KIFCM, IFCM, DOFCM & DKIFCM
Wine dataset has 178 instances from three class of wines, each instance has 13 attribute such as Alcohol, Ash, Malic acid, etc. Wisconsin Breast Cancer dataset has 569 instances for Benign and Malign classes and each instance has 32 attributes.
Tables 6 and 7 list the accuracy and misclassification of proposed and other algorithms for Wine dataset and Wisconsin Breast Cancer dataset respectively. Since, these datasets have no outlier and in the absence of outliers, performance of DKIFCM is similar to KIFCM. However, DKIFCM performance is much better in comparison to other algorithms such as FCM, IFCM, DOFCM.
For Wine Data-set, misclassification and accuracy using FCM, KIFCM, IFCM, DOFCM & DKIFCM
For Wisconsin Breast Cancer Data-set, misclassification and accuracy using FCM, KIFCM, IFCM, DOFCM & DKIFCM
Due to digitization of today’s world, clustering is no less than a magical wand for a wide range of applications such as customer segmentation, pattern recognition, and object recognition etc. Since decades, researchers and academicians are putting efforts to improve clustering techniques. In last decade, a revolution in computation power and availability of vast, versatile and huge data has given clustering new horizons. In this paper, we proposed a new robust clustering algorithm, DKIFCM, which is a hybridization of density focused approach, intuitionistic fuzzy sets, and kernel functions. It uses RBF kernel function to map data objects to feature space to improve resulting clusters accuracy, density-oriented approach to identify outliers, and intuitionistic fuzzy sets based clustering approach to improve centroids position of resultant clusters. Hand in hand working of these approaches has resulted in a robust clustering algorithm which significantly outperforms the existing algorithms - FCM, IFCM, KIFCM, DOFCM. Performance of DKIFCM is tested on standard data sets in literature such as D10, D12, D15, and noisy Dunn-Square data set and also on high dimension data set such as Fisher-Iris, Wine, Wisconsin Breast Cancer. Experimentally, it is found that proposed method is significantly effective in the presence of outliers and noise for non-linearly as well as linearly separable data. Major advantages of proposed algorithm are outlier identification, robustness to noise, and accurate centroid computation, with future scope of improving outlier identification effectiveness.
