An effective fuzzy clustering algorithm with outlier identification feature

Abstract

DKIFCM (Density Based Kernelized Intuitionistic Fuzzy C Means) is the new proposed clustering algorithm that is based on outlier identification, kernel functions, and intuitionist fuzzy approach. DKIFCM is an inspiration from Kernelized Intuitionistic Fuzzy C Means (KIFCM) algorithm and it addresses the performance issue in the presence of outliers. It first identifies outliers based on density of data and then clusters are computed accurately by mapping the data to high dimensional feature space. Performance and effectiveness of various algorithms are evaluated on synthetic 2D data sets such as Diamond data set (D10, D12, and D15), and noisy Dunn data set as well as on high dimension real-world data set such as Fisher-Iris, Wine, and Wisconsin Breast Cancer Data-set. Results of DKIFCM are compared with results of other algorithms such as Fuzzy-C-Means (FCM), Intuitionistic FCM (IFCM), Kernel-Intuitionistic FCM (KIFCM), and density-oriented FCM (DOFCM), and the performance of proposed algorithm is found to be superior even in the presence of outliers and noise. Key advantages of DKIFCM are outlier identification, robustness to noise, and accurate centroid computation.

Keywords

Fuzzy clustering identification of outlier FCM IFCM DOFCM KIFCM kernel functions

1 Introduction

Fuzzy clustering, a part of unsupervised learning algorithms [1 –3], is very effective in dealing with uncertainty, vagueness, and fuzziness in data. It is widely used in various domains such as resource optimization [4, 5], anomaly detection and fault diagnosis [6 , 37], recommender systems [8], image segmentation [9, 10] including medical imaging [11], product clustering [12], customer segmentation [13], and network optimization [14 –16] etc.

Fuzzy logic and fuzzy set theory are most suitable to deal with vagueness and uncertainties in the data. In 1965, fuzzy sets were introduced [17], followed by ISODATA algorithm [18], and popular fuzzy clustering algorithm, Fuzzy-c-means (FCM) [19] subsequently. However, FCM does not perform well for noisy and outlier contaminated data. To overcome this problem of FCM, a number of algorithms such as Possibilistic-C-means (PCM) [20], Noise Clustering [21], Credibilistic Fuzzy-c-means (CFCM) [22], and Possibilistic-Fuzzy-C-means (PFCM) [23] are proposed in the literature. During the last decade, lot of research has been done in fuzzy clustering domain. For example, Kaur P. et al. [38] clearly shows the comparison of different algorithms on MR brain images and on different real datasets, showing limitations and relative advantages of different algorithms. PCM, which is based on possibilistic approach, is helpful sometimes in dealing with noisy data, but it is over sensitive to initializations and many a times results in identical or overlapping clusters. Noise Clustering works on the concept of noise prototype and forms separate cluster for noise, but it fails to result in very efficient clusters when the number of clusters are increased for the same data set or when the proportion of outliers is increased. CFCM defines credibility and performs clustering on the basis of credibility score. It outperforms FCM, PCM, and PFCM but has certain issues such as assigning an outlier to multiple clusters and sensitivity to the choice of initial prototype. Unlike Noise Clustering, CFCM doesn’t identify outliers, rather it only emphasizes on reducing their effect on resultant clusters. This issue has been illustrated in many research papers [34 –36] such as Gosain A. et al. [31] in which performance comparison of various fuzzy clustering algorithms on D12 data set is discussed. PFCM works on fusion of fuzzy and possibilistic approach, and hence it has possibilistic and fuzzy membership. It outperforms FCM and PCM, but fails for outlier contaminated data and data with highly size variant clusters. All the above discussed algorithms focus on reducing the impact of outliers instead of identifying and removing the outliers. Kaur P. et al. in 2011 proposed Density oriented FCM (DOFCM) [24, 25], which works on identification of outliers. It works excellent in outlier identification but clusters and centroid computation can be improved. Also, DOFCM fails for non-linearly separable clusters. Parallelly, Chaira [26, 27] proposed IFCM by incorporating hesitation degree with membership function and intuitionistic fuzzy entropy in objective function. Although, it successfully converges the cluster center to a desired location, but it is not much effective for noisy data and fails for nonlinearly separable data. Kaur [28] proposed KIFCM by integrating RBF-kernel function with intuitionistic fuzzy sets. KIFCM works well with nonlinear data, and handles the noise to some extent. However, it fails to perform well in the presence of outliers.

The proposed algorithm, DKIFCM, is an inspiration from KIFCM and solves the issue of dealing with outliers. DKIFCM is a hybridization of density-based approach, kernel function, and IFCM. Four standard two-dimensional data sets and three real high dimensional data sets are used for experimental comparison of the proposed algorithm with FCM, IFCM, KIFCM, and DOFCM.

Further paper is comprised of four parts. Part I gives concise reread of FCM, IFCM, KIFCM, and DOFCM, and Part II is depiction of DKIFCM which is the proposed algorithm. Part III is experimental assessment on synthetic 2D and real high dimensional data sets. Last part is conclusion which is followed by the reference list of useful research articles.

2 Literature survery

In this section, a concise reread of FCM, Intuitionistic-FCM (IFCM), Kernelised IFCM (KIFCM), and DOFCM is done. X represents {x₁, x₂, x₃, ... , x_n}, data set, each x_i is a data object and n is number of data object in X. C represents {C₁, C₂, C₃, ... , C_cc}, set of cluster centroids, each C_j is centroid of j^th cluster and cc is number of clusters in X. $U = [\begin{matrix} u_{11} & u_{12} & . . . & u_{1 n} \\ . . . & . . . & . . . & . . . \\ u_{cc 1} & u_{cc 2} & . . . & u_{ccn} \end{matrix}]$ represents fuzzy membership matrix, where u_ji represents fuzzy membership of x_i to cluster C_j. $Eu_d_{ij} = \sqrt{\sum_{k = 1}^{d} x_{ik} - C_{jk}^{2}}$ represents Euclidean distance between data object (x_i) and cluster (C_j) for a d-dimension data set.

2.1 The fuzzy C means (FCM) [19]

FCM is the most novel and popular fuzzy clustering algorithm. It is based on the assumption that number of clusters are known in advance. Objective is to minimize the following equation subject to constraint in Equation (2). $J_{FCM} = \sum_{j = 1}^{cc} \sum_{i = 1}^{n} u_{ij}^{m} Eu_d_{ij}^{2}$ (1) where cc, X, n, Eu_d_ij are as describes in section 2, and ‘u_ij’ is fuzzy membership of ‘x_i’ data object in ‘C_j’ cluster. $\sum_{j = 1}^{cc} u_{ij} = 1 for i = 1, 2, 3, . . ., n$ (2)

But FCM always fails to detect noise and outliers, thus it performs drastically low in their presence.

2.2 Possibilistic C means algorithm [20]

On the bases of possibilistic approach, PCM was proposed. The membership constraint in Equation (2) is relaxed and the following objective function is formed: $J_{FCM} = \sum_{j = 1}^{cc} \sum_{i = 1}^{n} u_{ij}^{m} Eu_d_{ij}^{2} + \sum_{j = 1}^{cc} η_{j} \sum_{i = 1}^{n} {(1 - u_{ij})}^{m}$ (3)

Here, the former term focuses on distance minimization among data objects and centroids, and the latter term causes η_j to gain its maximum value, avoiding all trivial solutions. η_j is positive number randomly chosen as per data set. Centroid updation and membership updation are as per following equations: $C_{j} = \frac{\sum_{i = 1}^{n} (u_{ij}^{m} x_{i})}{\sum_{i = 1}^{n} (u_{ij}^{m})} \forall i$ (4) $u_{ij} = \frac{1}{1 + {(\frac{x_{i} - C_{j}^{2}}{η_{j}})}^{\frac{1}{m - 1}}}$ (5)

PCM is sometimes helpful with noisy data. But it suffers from the inconsistencies similar to greedy approach techniques, that is, minimization of objective function steps at a local minima instead of global minima [29].

2.3 Possibilistic fuzzy C means [23]

Pal came up with a new idea of fusion of fuzzy and possibilistic approach. Hence, PFCM has a possibilistic membership (t_ki) and a fuzzy membership (u_ki) associated with each data point. It works on minimizing subsequent objective function:

$\begin{matrix} J_{PCM} = & \sum_{j = 1}^{cc} \sum_{i = 1}^{n} ({au}_{ij}^{m} + {bt}_{ij}^{η}) Eu_d_{ij}^{2} \\ + \sum_{j = 1}^{cc} ϒ_{j} \sum_{i = 1}^{n} {(1 - t_{ij})}^{η} \end{matrix}$ (6) subject to the constraint given in Equation (2) and 0≤u_ij < 1 and 0≤t_ij <1 and η>1, m > 1, a > 0 and b > 0. ’a’ and ‘b’ are two constants which describes relative relationship between fuzzy membership and possibilistic membership in the Equation no. (6) [30]. Fuzzy membership is defined in Equation no. (5) and possibilistic membership is defined below: $t_{ij} = \frac{1}{1 + {(\frac{b}{ϒ} Eu_d_{ij}^{2})}^{\frac{1}{η - 1}}}$ (7) where 1≤j≤cc and cluster center ‘C_j’ is defined as: $C_{j} = \frac{\sum_{i = 1}^{n} ({au}_{ij}^{m} + {bt}_{ij}^{η}) x_{i}}{\sum_{i = 1}^{n} ({au}_{ij}^{m} + {bt}_{ij}^{η})}$ (8)

This algorithm outperforms PCM and FCM but fails to produce accurate results when size of clusters is unequal and when outliers are present [31].

2.4 Credibilistic fuzzy C-Means [22]

K. K. Chintalapudi and K. Moshe introduced a new variable called credibility that is defined as follows: $Ψ_{k} = 1 - \frac{(1 - θ) α_{k}}{max_{i = 1 . . n} (α_{i})}, 0 ⩽ θ ⩽ 1$ (9) where $α_{k} = \min_{j = 1 . . cc} (x_{ij})$ , is the distance of point x_k from it’s nearest centroid ‘C_j’. Noisiest point ‘x_k’ gets its credibility value equal to Φ, and Φ is a parameter that controls the minimum value of ψ_k. CFCM is their proposed algorithm that uses credibility to specify a new constraint, given below: $\sum_{j = 1}^{cc} u_{ij} = Ψ_{k}, j = 1 . . n$ (10) and subsequent objective function: $J_{CFCM} = \sum_{j = 1}^{cc} \sum_{i = 1}^{n} u_{ij}^{m} Eu_d_{ij}^{2}$ (11)

CFCM lowers the effect of the presence of outliers on cluster computation [26, 29]. Thus, it is observed that it improves centroid computation but still doesn’t provide the accurate centroids and also allocates some outliers to two or more clusters [28, 30].

2.5 Intuitionistic fuzzy C means [26, 27]

Xu proposed IFCM which is based on intuitionistic fuzzy set theory and is helpful in dealing with uncertain and vague data. Its objective function is: $J_{IFCM} = \sum_{i = 1}^{n} \sum_{j = 1}^{cc} u_{ij}^{* m} Eu_d_{ij}^{2} + \sum_{j = 1}^{cc} η_{j}^{*} e^{1 - η_{j}^{*}}$ (12) with m = 2, $u_{ij}^{*} = u_{ij} + η_{ij}$ , where $u_{ij}^{*}$ represents intuitionistic fuzzy membership [10] and η _ij is the hesitation degree [10] defined as: $η_{ij} = 1 - u_{ij} - {(1 - u_{ij}^{α})}^{1 / α}, α > 0$ (13) and $η_{i}^{*} = \frac{1}{N} \sum_{k = 1}^{n} η_{ik}$ , k belongs to [1,N]

IFCM produces overlapping clusters and hence it becomes really difficult to assign a cluster to the points lying in overlapping region. Also, IFCM fails to handle outliers as this algorithm treats outliers as data objects.

2.6 Kernel version of intuitionistic fuzzy C means [28, 32]

KIFCM makes use of Radial Basis kernel function [28] for computing the distance between the centroids and data objects, improving the accuracy of IFCM. Its objective function is: $\begin{matrix} J_{KFCM} = \sum_{i = 1}^{n} \sum_{j = 1}^{cc} u_{ij}^{* m} {∥ (x_{i}) - (C_{j}) ∥}^{2} \\ + \sum_{j = 1}^{cc} π_{j}^{*} e^{\land} (1 - π_{j}^{*}) \end{matrix}$ (14)

2.7 Density oriented fuzzy C means [26]

Kaur proposed an algorithm DOFCM by introducing a new term neighborhood membership, which is defined as follows: $M_{neighborhood}^{i} (X) = \frac{η_{neighborhood}^{i}}{η_{\max}}$ (15) where, $η_{neighborhood}^{i}$ = count of data objects in neighbourhood. $η_{max} = {max}_{i = 1 \dots n} (η_{neighborhood}^{i})$ is maximum count of neighborhood objects. Any data object ‘x_p’ is said to be in neighborhood of ‘x_i’ in case it satisfies the following condition: $x_{p} \in X, x_{i} \in X | dist (x_{p}, x_{i}) ⩽ r_{neighborhood}$ (16) where dist(x_p, x_i) is Euclidean distance between ‘x_i’ and ‘x_p’. r_neighborhood is neighborhood radius.

Outlier identification is done by setting a threshold value ’α’ for the neighborhood membership. If $M_{neighborhood}^{i} (X)$ value of a point ‘j’ is less than α, then only ‘j’ is considered an outlier. DOFCM first identifies outliers, assigns zero fuzzy membership, and then performs clustering using FCM algorithm.

3 Proposed algorithm

After studying FCM and its variants DOFCM, IFCM, and KIFCM, IFCM can be improved with the idea of density approach and kernelization. A new algorithm named ‘Density Based Kernelized approach to IFCM’ (DKIFCM) is proposed. It produces ‘n + 1’ clusters where one cluster is for outlier and ‘n’ clusters are for data objects. Proposed algorithm first identifies outliers and removes them by withholding their membership. Then, KIFCM [28] is applied on all data objects to make accurate and noise free clusters. In the tailing subsections, identification of outliers is explained, followed by clustering process of proposed approach.

3.1 Outlier identification

Outliers are those data points that show huge variation to other data points in a data set, as if they do not belong to the data set. Various methods based on statistics, proximity, and clustering are proposed in literature to identify outliers. In this paper, outliers are identified based on the density of a data object corresponding to its neighborhood by using a density factor called neighborhood membership [24, 25], which is defined in DOFCM [24]. Neighborhood membership is formulized as follows: $M_{neighborhood}^{j} = \frac{η_{neighborhood}^{j}}{η_{max}}$ (17) where $η_{max} = max_{j = 1, 2, . . n} (η_{neighborhood}^{j})$ and $η_{neighborhood}^{j}$ is the count of data objects in the neighborhood of data object x_j. Neighborhood of any data object ‘x_j’ is mathematically defined as: ${\forall q \in X | dist (q, j) ⩽ r_{neighborhood}}$ (18) where dist(q,j) denotes Euclidean distance between point q and point j, and r_neighborhood denotes neighborhood radius computed as per Ester [33].

To calculate a threshold value ’α’ for neighborhood membership, firstly neighborhood radius (r_neighborhood) is calculated as per Ester [33], and then count of data objects in neighborhood ( $η_{neighbourhood}^{j}$ ) of every data object is calculated using r_neighborhood. η _max is the greatest value for $η_{neighbourhood}^{j}$ . Using Equation no. (17), neighborhood membership of each data object is computed. A particular x_k is selected based on visual observation, and it’s respective $M_{neighborhood}^{k}$ value is set as threshold value(α). $M_{neighborhood}^{k} = {\begin{matrix} <, then x_{k} is outlier \\ ⩾, then x_{k} is not an outlier \end{matrix}$

All data objects with neighborhood membership value less than set threshold value are outliers. Figure 1. represents functioning of outlier identification technique on noisy Dunn data set. Value of r_neighborhood is calculated as 0.8677 and for this noisy data set, various values of α are tried on visual basis, resulting in best value of α as 0.15. Then, neighborhood membership is computed for all the data objects. All data objects whose neighborhood membership is less than α are considered to be outliers and such data objects are highlighted in Fig. 1 by drawing a circle of r_neighborhood radius with that data object as center. With the above approach, we are able to identify all outliers except four outliers, as shown in Fig. 1.

Fig. 1

Illustration of Outlier Identification.

3.2 Clustering using kernelized approach to IFCM

A Density Based kernelized Approach to IFCM is described here. Objective function for this proposed approach is:

$\begin{matrix} J_{DKIFCM} = & \sum_{j = 1}^{cc + 1} \sum_{i = 1}^{n} u_{ij}^{* m} K (x_{i}) - K {(C_{j})}^{2} \\ + \sum_{i = 1}^{cc + 1} π_{i}^{*} e^{1 - π_{i}^{*}} \end{matrix}$ (19) where m is fuzzifier, X = {x₁, x₂, x₃, ... ,x_n}, where x_i is i^th data object, C_j is centroid of j^th cluster, and $u_{ij}^{*} = u_{ij} + π_{ij}$ where $u_{ij}^{*}$ denotes the IFCM membership and u_ij denotes the FCM membership of i^th data object in j^th cluster; and π_ij is hesitation degree and is defined as follows: $π_{ij} = 1 - u_{ij} - {(1 - u_{ij}^{β})}^{1 / β}, with β > 0$ (20) $π_{i}^{*} = \frac{1}{n} \sum_{k = 1}^{n} π_{ik}, k \in [1, n]$ (21)

K(x_i)-K(C_j) represents distance between x_i and C_j in high dimension space. Mathematically, Radial Basis Kernel [31] is defined as: $K (x, y) - exp (- \frac{{∥ x_{j} - y_{i} ∥}^{2}}{h^{2}})$ (22) here, h defines kernel width and K(x, x) = 1. So, ∥K (x_k) - K (v_i) ∥ ² can be simplified as: $K (x_{i}) - K {(C_{j})}^{2} = 2 (1 - K (x_{i}, C_{j}))$ (23)

And Equation (19) can be revised as:

$\begin{matrix} J_{DKIFCM} = & 2 \sum_{j = 1}^{cc + 1} \sum_{i = 1}^{n} u_{ij}^{* m} (1 - \exp (- \frac{x_{i} - C_{j}^{2}}{h^{2}})) \\ + \sum_{i = 1}^{cc + 1} π_{i}^{*} e^{1 - π_{i}^{*}} \end{matrix}$ (24)

To determine $u_{ij}^{*}$ and C_j, J_DKIFCM is minimized subject to the condition specified below: $0 ⩽ u_{ij}^{*} ⩽ 1 and \sum_{j = 1}^{cc + 1} u_{ij}^{*} = 1, fori = 1, 2, . ., n$ (25) here, $u_{ij}^{*} = 1$ when i denotes an outlier and $\sum_{i = 1}^{cc} u_{ij}^{*} = 1$ when i denotes a data point. On minimizing Equation (24) with mentioned constraints, we get the following equation for membership update $(u_{ij}^{*})$ and centroid computation (C_j): $u_{ij}^{*} = \frac{1}{\sum_{k = 1}^{cc + 1} {(\frac{K (x_{i}) - K {(C_{j})}^{2}}{K (x_{k}) - K {(C_{j})}^{2}})}^{\frac{1}{m - 1}}}$ (26) $C_{j} = \frac{\sum_{i = 1}^{n} u_{ij}^{* m} . x_{i}}{\sum_{i = 1}^{n} u_{ij}^{* m}}$ (27)

DKIFCM algorithm

Input: data set (X), Fuzzifier (m), number of clusters (cc + 1), maximum iterations (l), minimum_successive_improvement (SC)

Output: Cluster Centroids (C), Set of Outliers (OL)

Outlier Identification:

Step 1. Calculate neighborhood radius (r_neighborhood) using Ester [33]

Step 2. Set $η_{neighbourhood}^{i} \leftarrow$ Count of data objects within r_neighborhood, ∀i ε 1,2,..,n

Step 3. $η_{max} \leftarrow max {η_{neighbourhood}^{i}, \forall i}$

Step 4. for i = 1 to n

Begin

compute $M_{neighbourhood}^{i}$ using (17)

End

Step 5. Set α, on visual basis of data set (X).

Step 6. for i = 1 to n

Begin

IF( $M_{neighbourhood}^{i}$ < α)

OL←OL U x_i

ENDIF

End

Clustering process:

Step 7. Set parameters like l, m, SC

Step 8. Set a random value to membership function- u_ij, ∀i ε [1,2,..,n, 1,2,..,n], and ∀j ε [1,2,..,cc, 1,2,..,cc] subject to constraints in Equation (2)

Step 9. Set Obj_fun₀ and V using (19) and (27) respectively

Step 10. For ∀ OL, set membership value equal to zero

Step 11. for k = 1, 2,.., l

Begin

Compute new_Centroids (C), using (27)

Compute new_Obj_fun_k, using (24)

Update u_ij for all i, using (26)

IF(|Obj_fun_k- Obj_fun_k - 1|)<SC

Return C

EndIF

End

Step 12. Return C

4 Experimental analysis and results

Algorithms are implemented using MATLAB Version 7 software and on a system with hardware configuration as intel-i5 2.5 GHz processor and 8 GB RAM. Experimental results are projected in form of tables and figures. Common algorithmic parameters in the implementation of proposed and other algorithms are: fuzzifier is set as 2, maximum iterations is set as 100, and minimum improvement in every consecutive iteration is set to 0.00001. Experiments are executed on standard 2D and real-high dimension data sets. Notation used to symbolize clustering results are: green/blue dot (’.’ or ’.’) for data objects of different clusters, ’*’ as centroids, ’∘’ as outlier.

Example 1. (D10 and D12 data sets): D10 is also known as Diamond data set. It is noise free and consists of 2 identical clusters, each cluster consists of five data objects. Figures 2 and 3 show D10 data set and clustering results of FCM, IFCM, KIFCM, DKIFCM, and DOFCM. D12[21] data set is the union of data objects of D10, a noise point - x11, and an outlier - x12. Figures 4 and 5 show D12 data set and clustering results of FCM, IFCM, KIFCM, DKIFCM, and DOFCM. Table 1 lists the coordinates of centroids generated by the proposed and other algorithms, and error in terms of computed centroids and ideal centroids on D10 data set. The ideal centroids for D10 and D12 data set are:

Fig. 2

(a) D10 (b) FCM result, (c) IFCM result.

Fig. 3

(a) KIFCM result, (b) DOFCM result, (c) DKIFCM result.

Fig. 4

(a) D12 (b) FCM result, (c) IFCM result.

Fig. 5

(a) KIFCM result, (b) DOFCM result, (c) DKIFCM result.

Table 1

Computed cluster centroids and average error on D10

Algorithms	Cluster 1		Cluster 2		Average Error
	cx*	cy*	cx*	cy*
FCM	3.3591000000	0.0000000000	–3.3591000000	0.0000000000	0.0003648100
IFCM	3.1403208037	0.0000000000	–3.1403208037	0.0000000000	0.0398717814
KIFCM	3.4557000000	0.0001000000	–3.4557000000	–0.0001000000	0.0133865000
DOFCM	3.3591000000	0.0000000000	–3.3591000000	0.0000000000	0.0003648100
DKIFCM	3.3454000000	–0.0082000000	–3.3415000000	0.0084000000	0.0000846050

$C_{Ideal} = [\begin{matrix} 3.34 & 0 \\ - 3.34 & 0 \end{matrix}]$

The error is computed as $Error = ∥ C_{Ideal} - C_{*}^{2} ∥$ , where * indicates FCM/KIFCM/IFCM/DKIFCM/DOFCM. Table 2 lists the coordinates of centroids and error in centroid computation generated by the proposed and other algorithms for D12 data set.

Table 2

Computed cluster centroids and average error on D12

Algorithms	Cluster 1		Cluster 2		Average Error
	cx*	cy*	cx*	cy*
FCM	2.9852000000	0.5434000000	–2.9855000000	0.5437000000	0.4212232700
IFCM	3.4401225373	0.0888294644	–3.4401751444	0.0888883394	0.0179256963
KIFCM	3.3467000000	0.2008000000	–3.2994000000	0.1876000000	0.0386038250
DOFCM	3.1672000000	0.0000000000	–3.1674000000	0.0000000000	0.0298253000
DKIFCM	3.139561878	0.0086371025	–3.1389692781	–0.0226215249	0.0037890150

On analysing Figs. 2, 3, 4, and 5, it is observed that FCM performs very good for outlier free and noise free data such as D10, but FCM performance is drastically affected even in the presence of a single outlier as FCM fails to identify outliers and treat outliers just like any other data object. IFCM performs much better than FCM in presence of noise but it also fails to identify outliers and sometime noise points are allocated to more than one cluster. When the data spread is in a way that linear separation is not possible, then kernelized versions of fuzzy clustering algorithms such as KIFCM become helpful. Kernelized approach also improves the accuracy of centroids and the same is verified in experimental results shown in Tables 1 and 2. Our proposed algorithm, DKIFCM, has best of all characteristics i.e. identification of outliers, robustness to noise, and accurate centroids computation.

Example 2. (D15 data sets): D15 data set is the union of data objects of D10, a noise point - x11, and four outliers. Figures 6 and 7 show D15 data set and clustering results of FCM, IFCM, KIFCM, DKIFCM, and DOFCM. Table 3 lists the coordinates of centroids and computed error on D15 data set. The ideal centroids for D15 and error are same as specified in example 1.

Fig. 6

(a) D15 (b) FCM result, (c) IFCM result.

Fig. 7

(a) KIFCM result, (b) DOFCM result, (c) DKIFCM result.

On analysing Fig. 6, Fig. 7, and Table 3, it is observed that FCM performance is worst as it forms outliers as one of cluster and all data objects as second cluster. IFCM performs much better than FCM even in the presence of noise and high proportion of outliers. But it too fails to identify outliers and noise points are allocated to more than one cluster. However, KIFCM shows high robustness to the presence of outliers as it successfully identifies right clusters with least impact on centroid position in comparison to FCM, IFCM, and DOFCM. DOFCM successfully identifies outliers and generates better clusters. However, proposed algorithm, DKIFCM performs best as it offers outliers identification, robustness to noise, and accurate centroid computation.

Table 3

Computed cluster centroids and average error on D15

Algorithms	Cluster 1		Cluster 2		Average Error
	cx*	cy*	cx*	cy*
FCM	0.675664771	23.17382119	0.004747364	0.122728763	277.663534
IFCM	3.583866483	0.025541416	–3.580651086	0.022600503	0.059273477
KIFCM	3.308539989	–0.024125867	–3.209771978	0.015214635	0.009381306
DOFCM	3.167456691	–5.9E-09	–3.167223882	5.1E-09	0.02981139
DKIFCM	3.388887575	–0.000409094	–3.424288773	0.00032228	0.004747432

Example 3. (Dunn Data-set): Dunn Data consists of two clusters that are square in shape and highly vary in density. Figure. 8(a) shows noisy Dunn data set, and Fig. 8(b-c) and Fig. 9 show clustering results of FCM, KIFCM, IFCM, DKIFCM and DOFCM. Table 4 lists the coordinates of centroids and error in centroid computation generated by the proposed and other algorithms. The ideal centroids for this data set are: $C_{Ideal} = [\begin{matrix} 15.2 & 0 \\ 5.4 & 0 \end{matrix}]$

Fig. 8

(a) Noisy Dunn (b) FCM result, (c) IFCM result.

Fig. 9

(a) KIFCM result, (b) DOFCM result, (c) DKIFCM result.

Table 4

Computed cluster centroid and average error on Dunn

Algorithms	Cluster 1		Cluster 2		Average Error
	cx*	cy*	cx*	cy*
FCM	15.3149583035	0. 3322331999	5.7652222484	0.1165370845	0. 1352812467
DOFCM	15.3847795993	0.0086151724	5.4869883645	0.1718854955	0.0356646603
IFCM	15.8884354895	0.2603106162	5.3129720416	0.0486664588	0.2758236650
KIFCM	15.4049770055	0.3079146714	5.463145783	0.0599450682	0.0722039094
DKIFCM	15.2697135416	0.0209208252	5.3543545124	0.0786688079	0.0067849753

On analysing Figs. 8 and 9, it is observed that FCM, IFCM, and KIFCM fail to identify outliers as these algorithms treat outliers as other data objects. However, DOFCM and DKIFCM identify most of the outliers marked as ’o’, and then clustering is performed. On analysing Table 4, KIFCM performs slightly better than FCM in computing accurate centroid position and IFCM is worst choice for such data set. DOFCM and DKIFCM perform equally well in outlier identification and both outperforms FCM, IFCM and KIFCM. Between DKIFCM and DOFCM, former outperforms the later in accuracy of centroid computation.

Example 4. (High Dimension Real World Data set):

Performance of proposed algorithm is tested on three well known real world datasets - (1) Fisher-Iris, (2) Wine, (3) Wisconsin Breast Cancer. Accuracy for high-dimension data is computed as follows: $accuracy = \frac{\sum_{i = 1}^{cc} a_{i}}{n}$ (28)

Equation (28) is Huang’s measure, where n is total number of data objects in a given data set. a_i is count of data objects that truly belong to i^th cluster (True Positive). Accuracy ranges from zero to one, and accuracy for perfect clusters will be 1.

Fisher-Iris is a real-world data set of three species of flowers and it has 50 instances of each species. Each flower is represented by 4 attributes - sepal length, petal length, sepal width, and sepal length. Table 5 lists the accuracy and misclassification of proposed and other algorithms for Fisher-Iris data set. On analysing Table 5, it is observed that DKIFCM performs best, followed by IFCM, FCM and DOFCM. KIFCM performance has been worst among FCM, IFCM, DOFCM, DKIFCM.

Table 5

For Iris Data-set, misclassification and accuracy using FCM, KIFCM, IFCM, DOFCM & DKIFCM

Algorithm	Misclassification	Accuracy
FCM	16	0.89333
IFCM	15	0.90000
KIFCM	12	0.92000
DOFCM	16	0.89333
DKIFCM	12	0.92000

Wine dataset has 178 instances from three class of wines, each instance has 13 attribute such as Alcohol, Ash, Malic acid, etc. Wisconsin Breast Cancer dataset has 569 instances for Benign and Malign classes and each instance has 32 attributes.

Tables 6 and 7 list the accuracy and misclassification of proposed and other algorithms for Wine dataset and Wisconsin Breast Cancer dataset respectively. Since, these datasets have no outlier and in the absence of outliers, performance of DKIFCM is similar to KIFCM. However, DKIFCM performance is much better in comparison to other algorithms such as FCM, IFCM, DOFCM.

Table 6

For Wine Data-set, misclassification and accuracy using FCM, KIFCM, IFCM, DOFCM & DKIFCM

Algorithm	Misclassification	Accuracy
FCM	56	0.68539
IFCM	54	0.69662
KIFCM	47	0.73593
DOFCM	56	0.68539
DKIFCM	47	0.73593

Table 7

For Wisconsin Breast Cancer Data-set, misclassification and accuracy using FCM, KIFCM, IFCM, DOFCM & DKIFCM

Algorithm	Misclassification	Accuracy
FCM	83	0.85413
IFCM	75	0.86819
KIFCM	53	0.90685
DOFCM	83	0.85413
DKIFCM	53	0.90685

5 Conclusion

Due to digitization of today’s world, clustering is no less than a magical wand for a wide range of applications such as customer segmentation, pattern recognition, and object recognition etc. Since decades, researchers and academicians are putting efforts to improve clustering techniques. In last decade, a revolution in computation power and availability of vast, versatile and huge data has given clustering new horizons. In this paper, we proposed a new robust clustering algorithm, DKIFCM, which is a hybridization of density focused approach, intuitionistic fuzzy sets, and kernel functions. It uses RBF kernel function to map data objects to feature space to improve resulting clusters accuracy, density-oriented approach to identify outliers, and intuitionistic fuzzy sets based clustering approach to improve centroids position of resultant clusters. Hand in hand working of these approaches has resulted in a robust clustering algorithm which significantly outperforms the existing algorithms - FCM, IFCM, KIFCM, DOFCM. Performance of DKIFCM is tested on standard data sets in literature such as D10, D12, D15, and noisy Dunn-Square data set and also on high dimension data set such as Fisher-Iris, Wine, Wisconsin Breast Cancer. Experimentally, it is found that proposed method is significantly effective in the presence of outliers and noise for non-linearly as well as linearly separable data. Major advantages of proposed algorithm are outlier identification, robustness to noise, and accurate centroid computation, with future scope of improving outlier identification effectiveness.

References

Han

, Pei

and Kamber

, Data mining: concepts and techniques, (Elsevier, 2011, 3rd Edition).

, Kumar

, Ross Quinlan

, et al., Top 10 algorithms in data mining,(Springer, Knowledge and Information Systems 14 (2008), 1–37.

Jain

A.K.

, Narasimha Murty

and Flynn

P.J.

, Data clustering: a review, ACM Computing Surveys (CSUR) 31(3) (1999), 264–323.

Albert

and Nanjappan

, An Efficient Kernel FCM and Artificial Fish Swarm Optimization Based Optimal Resource Allocation in Cloud, Journal of Circuits Systems and Computers (2020).

Sonti

S.R.

and Prasad

M.S.G.

, Enhanced fuzzy C-means clustering based cooperative spectrum sensing combined with multi-objective resource allocation approach for delay-aware CRNs, IET Communications 14(4) (2019), 619–626.

Rodríguez-Ramos

, da Silva Neto

A.J.

, et al., An approach to fault diagnosis with online detection of novel faults using fuzzy clustering tools, Expert Systems with Applications 113 (2018), 200–212.

Nancy

, Muthurajkumar

, Ganapathy

, et al., Intrusion detection using dynamic feature selection and fuzzy temporal decision tree classification for wireless sensor networks, IET Communications 14(5) (2020), 888–895.

Oner

S.C.

and Oztaysi

, An interval type 2 hesitant fuzzy MCDM approach and a fuzzy c means clustering for retailer clustering, Soft Computing 22(15) (2018), 4971–4987.

Wen

, Xuan

, Li

, et al., Image segmentation algorithm based on neutrosophic fuzzy clustering with non-local information, IET Image Processing 14(3) (2019), 576–584.

10.

, Chen

and Wang

, An Intuitionistic Kernel-Based Fuzzy C-Means Clustering Algorithm With Local Information for Power Equipment Image Segmentation, IEEE Access 8 (2020), 4500–4514.

11.

Sheela

C.J.J.

and Suganthi

, Morphological edge detection and brain tumor segmentation in Magnetic Resonance (MR) images based on region growing and performance evaluation of modified Fuzzy C-Means (FCM) algorithm, Multimedia Tools and Applications (2020), 1–14.

12.

Moodley

, Chiclana

, Caraffini

, et al., A product-centric data mining algorithm for targeted promotions, Journal of Retailing and Consumer Services (2019), 101940.

13.

Bodyanskiy

Y.V.

, Tyshchenko

O.K.

and Mashtalir

S.V.

, Fuzzy Clustering High-Dimensional Data Using Information Weighting, In International Conference on Artificial Intelligence and Soft Computing Springer, (2019), 385–395.

14.

Rajput

and Kumaravelu

V.B.

, Scalable and sustainable wireless sensor networks for agricultural application of Internet of things using fuzzy c-means algorithm, Sustainable Computing: Informatics and Systems 22 (2019), 62–74.

15.

Rajput

, Kumaravelu

V.B.

and Murugadass

, Smart Monitoring of Farmland Using Fuzzy-Based Distributed Wireless Sensor Networks, In Emerging Technologies for Agriculture and Environment, Springer, Singapore, (2020), 53–75.

16.

Patil

S.D.

and Ragha

, An adaptive fuzzy based message dissemination and micro-artificial bee colony algorithm optimized routing scheme for vehicular ad hoc network, IET Communications (2020).

17.

Zadeh

L.A.

, Fuzzy sets, Information and Control 8(3) (1965), 338–353.

18.

Dunn

J.C.

, A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters, (1973), 32–57.

19.

Bezdek

J.C.

, Ehrlich

and Full

, FCM: The fuzzy c-means clustering algorithm, Computers & Geosciences 10(2-3) (1984), 191–203.

20.

Krishnapuram

and Keller

J.M.

, A possibilistic approach to clustering, IEEE Transactions on Fuzzy Systems 1(2) (1993), 98–110.

21.

Dave

R.N.

, Characterization and detection of noise in clustering, Pattern Recognition Letters 12(11) (1991), 657–664.

22.

Chintalapudi

K.K.

and Kam

, The credibilistic fuzzy c means clustering algorithm, In, SMC’98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No. 98CH36218), IEEE 2 (1998), 2034–2039.

23.

Pal

N.R.

, Pal

, Keller

J.M.

and Bezdek

J.C.

, A possibilistic fuzzy c-means clustering algorithm, IEEE Transactions on Fuzzy Systems 13(4) (2005), 517–530.

24.

Kaur

and Gosain

, A density oriented fuzzy c-means clustering algorithm for recognising original cluster shapes from noisy data, International Journal of Innovative Computing and Applications 3(2) (2011), 77–87.

25.

Kaur

and Gosain

, Density-oriented approach to identify outliers and get noiseless clusters in Fuzzy C—Means, In International Conference on Fuzzy Systems, IEEE, (2010), 1–8.

26.

Chaira

, A novel intuitionistic fuzzy C means clustering algorithm and its application to medical images, Applied Soft Computing 11(2) (2011), 1711–1717.

27.

and Wu

, Intuitionistic fuzzy c-means clustering algorithms, Journal of Systems Engineering and Electronics 21(4) (2010), 580–590.

28.

Kaur

, Soni

A.K.

and Gosain

, RETRACTED: A robust kernelized intuitionistic fuzzy c-means clustering algorithm in segmentation of noisy medical images, (2013), 163–175.

29.

Kaur

, Soni

A.K.

and Gosain

, Robust Intuitionistic Fuzzy C-means clustering for linearly and nonlinearly separable data, International Conference on Image Information Processing, IEEE, (2011), 1–6.

30.

Romdhane

L.B.

, Bannour

and el Ayeb

, ‘IMIOL: a system for indexing images by their semantic content based on possibilistic fuzzy clustering and adaptive resonance theory neural networks learning, Applied Artificial Intelligence 24(9) (2010), 821–846.

31.

Gosain

and Dahiya

, Performance analysis of various fuzzy clustering algorithms: a review, Procedia Computer Science 79 (2016), 100–111.

32.

Tushir

and Srivastava

, A new Kernelized hybrid c-mean clustering model with optimized parameters, Applied Soft Computing 10(2) (2010), 381–389.

33.

Ester

, Kriegel

H.-P.

, Sander

and Xu

, A density-based algorithm for discovering clusters in large spatial databases with noise, In Kdd 96(34) (1996), 226–231.

34.

Gosain

and Dahiya

, A New Robust Fuzzy Clustering Approach: DBKIFCM, Neural Processing Letters 52(3) (2020), 2189–2210.

35.

Dahiya

, Gosain

and Mann

, Experimental Analysis of Fuzzy Clustering Algorithms, Intelligent Data Engineering and Analytics. Springer, Singapore, (2020), 311–320.

36.

Dahiya

, Gosain

and Gupta

, RKT2FCM: RBF Kernel-Based Type-2 Fuzzy Clustering, Available at SSRN 3577549, (2020).

37.

Dahiya

, Nanda

, Artwani

and Varshney

, Using Clustering techniques and Classification Mechanisms for Fault Diagnosis, International Journal 9(2) (2020).

38.

Kaur

, Soni

A.K.

and Gosain

, Robust kernelized approach to clustering by incorporating new distance measure, Engineering Applications of Artificial Intelligence 26(2) (2013), 833–847.