Feature Linkage Weight Based Feature Reduction using Fuzzy Clustering Method

Abstract

In this paper, a novel Feature-Reduction Fuzzy C-means (FRFCM) with Feature Linkage Weight (FRFCM-FLW) algorithm is introduced. By the combination of FRFCM and feature linkage weight, a new feature selection model is developed, called a Feature Linkage Weight Based FRFCM using fuzzy clustering. The larger amounts of features are superior to the complication of the problem, and the larger the time that is exhausted in creating the outcome of the classifier or the model. Feature selection has been established as a high-quality method for preferring features that best describes the data under certain criteria or measure. The proposed method presents three stages namely, 1) Data Formation: The process of data collection and data cleaning; 2) FRFCM-FLW. The proposed method can decrease feature elements routinely, and also construct excellent clustering results. The proposed method calculates a novel weight for every feature by combining modified Mahalanobis distance with feature δ_m variance in FRFCM algorithm; 3) Fuzzy C-means (FCM) cluster. The proposed FRFCM-FLW method proves high Accuracy Rate (AR), Rand Index (RI) and Jaccard Index (JI) ratio when compared to other feature reduction algorithms like WFCM, EWKM, WKM, FCM and FRFCM algorithms.

Keywords

Data mining fuzzy logic feature selection FCM

1 Introduction

Data mining is the procedure of discovering formerly unidentified patterns and trends in databases and using feature information to create projecting models. Data mining combines arithmetical analysis, machine learning and database knowledge to mine hidden patterns and interaction from huge databases.

Clustering is a method that automatically examines the relations between data and categorizes them to form thematically coherent structures—clusters of texts that distribute related topics. Automatic clustering does not need any human interference, nor does it need any preceding knowledge about the inputs, which makes it an extensively applied technique for information analysis [1].

Data clustering is predictable as a significant area of data mining [1]. This is the method of separating data elements into special clusters (recognized as clusters) in such a system that the essentials within a collection acquire high distance while they are different from the essentials in a different group.

In data mining, clustering can be categorized as two portions namely, (i) Soft Clustering (Overlapping Clustering) and (ii) Hard Clustering (or restricted Clustering). In soft clustering methods, fuzzy positions are used to group data, so that every position might fit into two or additional groups with dissimilar degrees of membership. At this point, data will be connected to a right membership value. In several conditions, fuzzy clustering is added more normally than hard clustering. Objects on the limitations among several classes are not compulsory to completely fit into one of the classes, but rather are assigned membership degrees among zero and one demonstrating their incomplete membership. Subsequently in hard clustering methods, data are clustered in a limited approach, so that if a convinced datum fits into an explicit cluster then it could not be incorporated in a new cluster. Fuzzy C Means (FCM) is an especially accepted soft clustering method, and similarly K-means is an essential hard clustering method. The proposed technique is based on a soft clustering category with the base of Feature-Reduction FCM (FRFCM) algorithm.

The objective of cluster examination is to allocate data samples with related belongings to similar clusters with different data samples to dissimilar groups [2]. Because the majority data formation issues attributes are not measured to be similarly significant.

The proposed work presents an enhanced Feature-Reduction FCM (FRFCM) clustering with new Feature Linkage Weight (FRFCM-FLW) procedure. The proposed method is an extension of FRFCM [13] method of combining Feature Linkage Weight method to determine the significance of each feature. A feature selection process is able to compute the Feature Linkage Weight (FLW) of every feature. The advantages of the proposed FLW method are that the modified Mahalanobis distance is used. It means that the inconsistency and correspondence of data is taken into account in manipulative the clusters. The main contribution of this paper is that the feature linkage weight index has a great role in shaping the clusters.

The feature estimation index is the smaller the distance value of attribute is the relevant significant feature. The modified Mahalanobis distance, which acquires into account the FLW of features in adding to inconsistency of data features.

The objective of this work is to develop a novel fuzzy based automatic feature selection algorithm which is efficient to remove the irrelevant features.

2 Related work

Coletta et al., 2012 [7] has reviewed various deviations of the extensively used Fuzzy C-Means (FCM) algorithm to maintain clustering data disseminated diagonally dissimilar locations. Those techniques have been deliberated below special types, approximating parallel fuzzy and collaborative clustering. The authors offered various augmentations of the two FCM-based clustering algorithms used to group circulated data by incoming at various positive behavior of determining an important parameters of the algorithms (counting the amount of clusters) and structuring a set of analytically prepared strategy such as an assortment of detailed algorithm depends on the environment of the data surroundings.

M. Gong, L. Su, M. Jia, and W. Chen, 2014 [8] has proposed an approach centers on adjusting the membership as a substitute for adjusting the objective function. The membership approach is computationally easy in all steps concerned. Its objective function can immediately arrival to the unique form of FCM, which directs to its fewer time utilization compared to that of several observably freshly enhanced FCM algorithms. Next, their advance adjusts the membership of every pixel according to a narrative form of MRF energy utility throughout which the adjacency of every pixel, as well as their association, are concerned. Theoretical examination and new outcomes on real-world datasets showed that the proposed approach can discover the actual changes as well as reasonable the effect of speckle noises.

S. T. Chang, K. P. Lu, and M. S. Yang, 2015 [9] has designed a technique, called “Fuzzy Change Point (FCP) algorithm is to discover the Change Points (CP’s) and concurrently approximate the constraints of failure form. The fuzzy c-partitions concept is initially surrounded into the CP failure models. Some promising group of all CPs was measured as a screening of data with a fuzzy membership. They transferred these memberships into the simulated memberships of data points to every individual cluster, and so they obtained an approximate for model constraints by the fuzzy c-regressions technique. Consequently, authors utilized the fuzzy c-means clustering to attain novel iterates of the CPs collection memberships by reducing an objective function relating to the deviations among the predicted reply values and data values”.

Miin-Shen Yang and Yi-Cheng Tian, 2015 [10] proposed a bias-alteration expression with a revising equation to fine-tuned the property of initializations on fuzzy clustering algorithms. They initially proposed the so-called bias-alteration fuzzy clustering of the comprehensive FCM algorithm. The authors created the bias-alteration FCM, bias- alteration Gustafson and Kessel clustering and bias-alteration inter-cluster partition algorithms. They evaluated their proposed bias- alteration fuzzy clustering algorithms with additional fuzzy clustering algorithms with statistical illustrations. They also functional the bias-alteration fuzzy clustering algorithms to real-world data sets.

G. Teng, et al., 2016[11] has introduced Group Method of Data Handling (GMDH) to cluster simultaneously, and a group set outline pointed as cluster ensemble framework based on the cluster technique of data handling (CE-GMDH). CE-GMDH depends on three mechanisms: a primary resolution, a transfer function and an outside measure. Numerous CE-GMDH models can be built according to dissimilar categories of transfer functions and outside criterion. In their study, three narrative techniques were illustrated based on special transfer functions: cluster-based similarity partitioning algorithm, least squares approach and semi-definite programming.

C. C. Yeh and M. S. Yang, 2017[12] has discussed assessment procedures for cluster ensembles based on the fuzzy generalized Rand index (FGRI). The authors first exploited a graph based and relation matrices to renovate a membership matrix into a signal relation matrix, and have the outline of matrix multiplication to compute correspondence procedures. The authors utilized a broad range of other scenarios so that it can treat the following situations to analyze similarity measures: (1) connecting a fuzzy cluster ensemble and a crisp separation, (2) connecting a fuzzy cluster ensemble and a cluster ensemble, (3) connecting a fuzzy cluster ensemble and a fuzzy separation, (4) between two fuzzy cluster ensembles, and (5) connecting two different object data sets with the same cardinal amount and the similar separation technique.

Miin-Shen Yang and Yessica Nataliani, 2018 [13] have designed a narrative technique for progressing fuzzy clustering methods that can consistently estimate entity attribute weight, and concurrently decrease these unimportant variables. The authors measured the Fuzzy C-Means aim task with feature-weighted entropy, and build knowledge scheme constraints, and then reduced this inappropriate attribute mechanism. They called it a feature-reduction FCM (FRFCM). Through FRFCM procedure, a novel process for removing incomplete attribute(s) through low weight(s) is shaped for attribute elimination.

Lin et al., 2018 [20] has proposed a novel fuzzy rough set representation for attribute or feature reduction in multi-label classification. Dissimilar from single-label feature elimination, a bottleneck of fuzzy rough set for multi-label feature elimination is to discover the accurate dissimilar classes’ instances for the output instance, which extremely changes the heftiness of fuzzy lower and upper approximations. They defined the gain vector of every instance to estimate the likelihood of individual different class’s sample with deference to the output sample. Then, restricted sampling is leveraged to build a strong distance among instances. It can realize the heftiness beside noisy information when manipulative the fuzzy upper and lower approximations below the entire labels.

Zixiao Shen, Xin Chen, Jonathan M. Garibaldi, 2019 [19] has proposed a narrative weighted mixture feature selection technique using bootstrap and fuzzy sets. The technique generally assigned of three measures, with fuzzy sets creation using bootstrap, weighted permutation of fuzzy sets and feature ranking based on defuzzification. The authors implemented the technique by merging four state-of-the-art feature selection techniques and assessed the performance based on three widely available biomedical datasets using fivefold cross validation. Based on the feature selection consequences, their technique formed similar classification accuracies to the finest of the entity feature selection techniques for the entire assessed datasets.

S. Kashef and H. Nezamabadi-pour, 2019 [21] has reviewed a fast accurate filter-based feature selection technique is entirely considered for multi-label datasets to discover label-particular attributes. It plots the attributes to a multi-dimensional space supported on a filter technique, and chooses the majority significant features with the assist of Pareto-dominance models from multi-objective normalization area. Their technique can be used as online feature selection that covenants with troubles in which features appear consecutively while the amount of data samples is predetermined.

Sun, et al., 2019 [22] has proposed a novel neighborhood rough sets and entropy measure-based gene selection with Fisher score for tumor classification, which has the capability of dealing with real-value data even as preserving the unique gene classification information. Foremost, the Fisher score technique is engaged to remove irrelevant genes to considerably decrease computation difficulty. After that, several neighborhood entropy-based uncertainty actions are examined for handling the uncertainty and noisy of gene expression data.

Sun, et al., 2020 [23] has proposed a novel NMRS-based attribute reduction technique using Lebesgue and entropy procedures in incomplete neighborhood decision systems. Initially, a few concepts of positive and negative NMRS forms in incomplete neighborhood decision systems are known, correspondingly. Then, a Lebesgue determine was shared with NMRS to learn neighborhood tolerance class-based ambiguity measures. To examine the ambiguity, redundancy and noise of incomplete neighborhood decision methods in feature, several neighborhood multi-granulation entropy-based improbability procedures are illustrated by integrating Lebesgue and entropy procedures.

3 Proposed methodology

The proposed FRFCM-FLW method performs to test all the experimentations that were conducted on a high dimensional dataset [12]. The FRFCM-FLW algorithm successfully removes unimportant features with dissimilar weights, which resources that data with attributes of different selected weights of features are clustered. The overall FRFCM-FLW flow diagram is described in Fig. 1.

Fig. 1

Proposed FRFCM-FLW algorithm Flow Diagram.

In Fig. 1, a real time and synthetic datasets are taken as inputs in MATLAB simulation. The first procedure of data preparation predicts the quantity of instances, quantity of features and classes. Next FRFCM-FLW algorithm works with four steps namely (i) Feature Mean and Variation; (ii) Fuzzy objective; (iii) Fuzzy Memberships and (iv) Feature Linkage weight. In this last step the proposed algorithm performs feature reduction process with appropriate feature linkage weight measure. The final process is clustering the obtained features and computing the cluster accuracy.

3.1 Data preparation

Data preparation is the process of data cleaning. Data cleaning guarantees that data gathered is accurate such that the related decisions are valid. Data is collected from different varieties of data sources from the different web pages. The additional real datasets which include plant data (soybean, seeds), cancer data (colon cancer, breast cancer, ovariance cancer), disease data (thyroid, pima Indians, bupa), handwritten data (USPS), and text data (basehock), taken from UCI data repository [17] and Kent Ridge Biomedical Data Set [18]. The real-world datasets are described in Table 1.

Table 1
Real Datasets

Dataset Number of Instances Classes Number of Features

Iris 150 3 4

Thyroid 215 2 5

Bupa 345 2 6

Seeds 210 3 7

Breast cancer 699 2 8

Pima Indians 788 2 8

Soybean 47 4 21

USPS 4000 10 256

Ovariance cancer 216 2 4000

Basehock 1993 2 4862

Colon cancer 62 2 2000

Dataset	Number of Instances	Classes	Number of Features
Iris	150	3	4
Thyroid	215	2	5
Bupa	345	2	6
Seeds	210	3	7
Breast cancer	699	2	8
Pima Indians	788	2	8
Soybean	47	4	21
USPS	4000	10	256
Ovariance cancer	216	2	4000
Basehock	1993	2	4862
Colon cancer	62	2	2000

The data thus attained, might not be prepared and might include incomplete information. Consequently, the together data is necessary to be subjected to data cleaning. The data that is composed must be developmental or prearranged for examination. This comprises construction of the data as an essential for the relevant examination tools. For instance, the data might have to be placed into rows and columns in a table within a worksheet or arithmetical application.

The processed data may be imperfect, contain replicas, or surround errors. Data cleaning is the procedure of identifying and accurate (or eliminating) corrupt or incorrect records from a dataset, and refers to identifying shortened, inaccurate, imprecise or inappropriate parts of the data and then restoring, adjusting, or removing the unclean or coarse data. There are numerous categories of data cleaning that depend on the category of data. For illustration, while cleaning the unqualified data might be evaluated against reliable published numbers or defined thresholds.

3.2 Feature-reduction FCM with feature linkage weight (FRFCM-FLW)

The Enhanced Feature-Reduction FCM with Feature Linkage Weight (FRFCM-FLW) is an extended version of FRFCM algorithm. In general existing Feature Reduction FCM (FRFCM) algorithm that can automatically compute different feature weights by considering the FCM objective function with feature-weighted entropy. Moreover, we create a feature-reduction schema to eliminate these irrelevant features with small weights such that the computational time can be decreased with better clustering performance. The proposed FRFCM-FLW algorithm randomly updates feature-weight vectors in its training phase rather than exploiting a permanent feature-linkage weight vector. Since the feature linkage weight vector of the conventional FRFCM algorithm remains not fixed throughout the clustering process, the implication of weighted features to the altering cluster information cannot be properly manifested.

Let the dataset set D = {x₁, x₂, x₃ ... x_n} indicate a set of data points to be grouped into c clusters, where x_i (i = 1, 2, ... n) is the data points. The fuzzy objective function is to find out nonlinear associations between the data points using embedding linking’s that connect features of data with new feature linkage weight spaces. The proposed FRFCM based feature linkage weight algorithm is an iterative feature reduction method that reduces the objective function (discover the non-linear relationship among the dataset).

Given an dataset, D = {x₁ ... x_n}⊂Rp (Represent pattern), the proposed algorithm divides X into c fuzzy separations by reducing the following objective function as, $\begin{matrix} J (U, V, W) = \sum_{k = 1}^{C} \sum_{i = 1}^{N} \sum_{j = 1}^{f} μ_{ij}^{m} δ_{j} {[d_{ij}^{w}]}^{2} \\ d_{ij}^{w} = \sum_{k = 1}^{C} \sum_{i = 1}^{N} \sum_{j = 1}^{f} w_{j} {(x_{ij} - V_{kj})}^{2} \end{matrix}$ (1)

Where U is objective function, V is centre point, w is weight, m is exponent value 2, d stands for the a priori likelihood distance measure for the class C_k and the distance of the model i,j. The cluster center V_kj as follows, $V_{kj} = \frac{\sum_{i = 1}^{n} μ_{ij}^{m} x_{ij}}{\sum_{i = 1}^{n} μ_{ij}^{m}}$ (2)

To discover the expression (μ(z)/ σ(z)) _q for the feature q in the FRFCM algorithm preserve really switch a scattering among clusters in the dataset. Consequently, utilize(mean(z)/var(z)) _q to estimate δq. That is, to judge the estimate for δ _q as follows: $δ_{q} = (\frac{μ (z)}{σ (z)})$ (3)

Where μ is mean and σ is variance.

To build a feature-reduction process using FRFCM-FLW algorithm, with assists of threshold value removes the unimportant features. In a feature, a threshold value is used to decide which the important features will be chosen. To identify the dataset with n data samples in which every dataset has f feature classes with the restraint $\sum_{j}^{f} w_{j} = 1$ (i.e, j = 1, 2, ... , f) of feature weights. If a feature (f) is high, then the threshold for feature elimination is instinctively selected as 1/f. But, imagine the proposed FRFCM-FLW algorithm to be fixed for a large amount of datasets, still for small f. In this wisdom, the number of data points n should be measured as another factor. Therefore, to consider $1 / \sqrt{nf}$ as an appropriate threshold for removal the irrelevant feature(s) in the datasets.

Proposed a new feature linkage weight clustering method, which is based on FRFCM expanded technique that is illustrated by [13] for formative FLW features or attributes and, furthermore, utilizes customized Mahalanobis distance with feature δ_m determine of weight, which takes into explanation the FLW of features in accumulation to irrelevant of data. FRFCM features are initialized based on the input dataset. FRFCM-FLW method executed on the data set, but here the method utilizes altered Mahalanobis distance with feature δ_m distance determined in FRFCM algorithm. The primary difference is of Mahalanobis distance along with feature δ_m measure is performed. In this approach, feature variation and association of data is in use into account in computing feature clusters.

The proposed FRFCM-FLW algorithm determines the new set of features with modified feature linkage weight used for further clustering. In general FCM clustering method widely used Euclidian and Mahalanobis distance measure. The Euclidian distance between a and b is: $d^{2} (a, b) = {(a - b)}^{T} (a - b)$ (4)

Let the a and b model vector (every model is particular data point in the data set and a model vector in which a and b fundamentals are the assessments that the model attributes presume in the data samples) be characterized as a^p = [a₁^p.,a₂^p, ... ,a_n^p], b^p = [b₁^p.,b₂^p, ... ,b_n^p].

The Mahalanobis distance among a and center v (the inconsistency and association of the data) is: $d^{2} (a, v, CV) = {(a - v)}^{T} {CV}^{- 1} (a - v)$ (5)

In Mahalanobis distance calculate CV is the co-variance matrix. Using CV matrix in Mahalanobis distance performs the irrelevant and association of the data. To take into consideration the weight or distance of the features in computation of weight among two data points, FLW technique suggest the utilization of (a-b)δ_m (personalized feature variance (a-b)) instead of (a-b) in distance calculate, whether it is Mahalanobis (a-b)δ_m is a vector that its i^th component is attained by development of i^th element of vector (a –b) and i^th element of vector “FLW”. Now, using feature index to calculate the FLW of every feature. So to calculate the FLW of a feature this way: suppose f₁, f₂, ... , f_n are n features of a data set and feature index (f_i) and FLW (f_i) are feature index and Feature Linkage weight of feature f_i, respectively so, $\begin{matrix} FLW (f_{i}) = \\ \frac{(\sum_{j = 1}^{n} featureIndex (f_{j})) - featureIndex (f_{i})}{\sum_{j = 1}^{n} featureIndex (f_{j})}, 1 ⩽ i ⩽ n \end{matrix}$ (6)

Feature Linkage weight assignment is a unique case of feature selection where different features are ranked according to their significance. The feature is assigned a value in the interval [0, 1] indicating the significance of that feature value is called “FLW”. To define a vector whose FLW which its i^th element is FLW (f_i). Till now to calculate FLW of every feature of the data set. Currently must take into explanation these standards in computing the distance among data points, which is of great significance in feature selection. So, with this alteration, Equations 4 and 5 will be derived to this form: $d_{δm}^{2} (a, b) = {(a - b)}_{δm}^{t} {(a - b)}_{δm}$ (7)

and $d_{δm}^{2} (a, v, CV) = {(a - v)}_{δm}^{t} {CV}^{- 1} {(a - v)}_{δm}$ (8)

respectively, where ${FLW}_{δm} (i) = {(a - b)}_{δm} (i) \times FLW (i)$ (9)

The modified distance method is proposed in FRFCM-FLW algorithm of clustering data set with special feature linkage weights (i.e., feature similarity). Therefore, the proposed FLW value with consider a $1 / \sqrt{nf}$ as a suitable threshold for discarding these unimportant feature(s) in the FRFCM-FLW clustering algorithm.

Algorithm: FRFCM-FLW Pseudo code

Procedure FRFCM-FLW (Data D, number of clusters C,

feature variance δ)

1: Fix C, 2 < C < n

2: Fix ɛ(e.g.,ɛ=0.001) // process termination condition

3: Fix maximum iterations (e.g., max_Iterations = 100)

4: Randomly Initialize V_o = v₁, v₂, ... ,v_c cluster centers

5: Calculate δ _q using data points D by Equation (3).

6: fort = 1 to max_Iterations do

Calculate feature variance using Equation. (1)

Update cluster centers using Equation (2)

Calculate the modified distances using

Equation (9)

Discard total f_r number of j^th feature components

for weight FLW(t)

IfFLW(t)

\leq 1 / \sqrt{nf}

Update feature f^(new) = f - f_r

end if

ifFLW^(t) - FLW^(t-1) < ɛ

Break;

else

t = t + 1

f = f^(new⁾

endif

endfor

The feature elimination action from FRFCM-FLW in Iris database is revealed in Table 2. Following the initial round, the third and fourth attributes provide better attribute weights than the primary with subsequent attributes. Because a primary with subsequent attributes has extremely little attribute weights, according to as two features are unnecessary throughout clustering procedures. FRFCM-FLW lastly keeps the two relevant attributes, i.e., the attribute (PL) and the attribute (PW), after the three rounds through an extremely fine clustering consequence of AR = 0.985.

Table 2

Feature Elimination Performance for Iris Dataset with FRFCM-FLW Algorithm

	SL	SW	PL	PW
	Feature weights
Initial Weight	0.25	0.25	0.25	0.25
Round 1	0.1978	0. 4922	0.6597	0.9370
Round 2	–	–	0.7769	0.9844
Round 3	–	–	0.7743	0.9844

4 Experimental results

The experimental result has been evaluated with the proposed FRFCM-FLW feature selection algorithm. The results executed on Intel I5-7200U series 2.71 GHz, x64-based processor, 8GB main memory, and runs on the Windows 10 operating system using MATALB R2017a simulation.

To evaluate the feature reduction based fuzzy clustering performance of the proposed FRFCM-FLW achieves better clustering accuracy with existing FRFCM [13], FCM [3], Entropy-Weighted k-means (EWKM) [5], Weighted K-means (WKM) [4], Weighted FCM (WFCM) [6]. In this experimental using synthetic and real-world datasets are obtainable. The proposed FRFCM-FLW feature reduction with Fuzzy C-means clustering algorithm executed to predict accuracy of datasets to meet the needs of various test requirements.

In this experimentation, proposed FRFCM-FLW algorithm works with real-world dataset of Iris data, that include four features, (i.e., “sepal length (SL, in cm), sepal width (SW, in cm), petal length (PL, in cm), and petal width (PW, in cm)”) with 150 data samples. Three different cluster classes of Iris dataset are i.e., setosa, versicolor, and virginica. Table 3 demonstrates an Accuracy Rates (ARs) of FRFCM with special attributes. By using every features, FRFCM provides the AR of 0.973 (4 incomplete points of 150 samples), although utilizing the features PW and PL. The feature decrease actions from FRFCM-FLW for the Iris dataset are exposed in Table 2. Furthermore, using Feature Linkage Weight method of modified Mahalanobis distance measure with feature mean and variance (δ_m) performs very best quality clustering outcome of AR = 0.985 (2 incomplete data of 150 data). It means, using FRFCM-FLW, feature PL is the majority relevant feature and PW is the subsequent the majority relevant feature for Iris dataset. Meanwhile, the proposed method performs good clustering result with high dimensional dataset of Basehock and Colon cancer with AR 0.831 and 0.682.

Table 3
The Best AR’s Using Dissimilar Primary Cluster Centers with Predetermined Primary Feature Weights

Datasets WFCM EWKM WKM FCM FRFCM FRFCM-FLW

Iris 0.953 0.960 0.960 0.893 0.973 0.985

Thyroid 0.954 0.661 0.861 0.791 0.901 0.972

Bupa 0.525 0.597 0.588 0.563 0.551 0.635

Seeds 0.895 0.891 0.909 0.895 0.919 0.926

Breast cancer 0.938 0.957 0.955 0.936 0.953 0.959

Pima Indians 0.720 1.00 1.00 0.659 1.00 0.992

Soybean 0.894 1.000 0.830 0.787 0.979 0.980

USPS 0.402 0.443 0.527 0.402 0.464 0.598

Colon cancer 0.565 0.613 0.597 0.548 0.613 0.699

Ovariance cancer 0.741 0.810 0.759 0.713 0.796 0.831

Basehock 0.537 0.616 0.551 0.536 0.635 0.682

Datasets	WFCM	EWKM	WKM	FCM	FRFCM	FRFCM-FLW
Iris	0.953	0.960	0.960	0.893	0.973	0.985
Thyroid	0.954	0.661	0.861	0.791	0.901	0.972
Bupa	0.525	0.597	0.588	0.563	0.551	0.635
Seeds	0.895	0.891	0.909	0.895	0.919	0.926
Breast cancer	0.938	0.957	0.955	0.936	0.953	0.959
Pima Indians	0.720	1.00	1.00	0.659	1.00	0.992
Soybean	0.894	1.000	0.830	0.787	0.979	0.980
USPS	0.402	0.443	0.527	0.402	0.464	0.598
Colon cancer	0.565	0.613	0.597	0.548	0.613	0.699
Ovariance cancer	0.741	0.810	0.759	0.713	0.796	0.831
Basehock	0.537	0.616	0.551	0.536	0.635	0.682

In [2] authors discussed an Iris database considers four features or attributes, i.e., Sepal Length (SL), Sepal Width (SW), Petal Length (PL) and Petal Width (PW). Figure 2 demonstrates a clustering for Iris dataset dependent on features SL and SW (i.e, purple and gray colors), while Fig. 3 demonstrates a clustering depends on PL and PW (i.e, sky blue and brown colors). In Fig. 2, individuals can observe that are greatly additional crossover among the classes. It is complicated for us to separate the class from the feature points. Figure 3 demonstrates that for the categorization of Iris dataset, features PL and PW are more significant than SL and SW. The cluster similarity (SL = 0, SW = 0, PL = 1, PW = 1) is better than (SL = 1, SW = 1, PL = 0, PW = 0) for Iris database categorization.

Fig. 2

Result of Iris Feature Weights (1, 1, 0, 0) with Fuzzy c-means Algorithm.

Fig. 3

Result of Iris Feature Weights (0, 0, 1, 1) with Fuzzy c-means Algorithm.

Fig. 4

Comparison of Accuracy Rate (AR) values with Real world datasets.

Fig. 5

Comparison of Jaccard Index (JI) values with Real world datasets.

Fig. 6

Comparison of Jaccard Index (JI) values with Real world datasets.

Quantifying proposed FRFCM-FLW clustering accuracy measures with Accuracy Rate (AR), Rand Index (RI), and Jaccard Index (JI). To computing clustering presentation, Accuracy Rate (AR), $AR = \frac{1}{n} \sum_{k = 1}^{c} n (c_{k})$ is obtained, where n(c_k) is the amount of data samples that attained exact clustering for the cluster sk, and n is the entire amount of data samples. A higher AR as an improved clustering presentation. The evaluation using dissimilar primary cluster centers with predetermined original feature weights (w = 0.25) are exposed in Tables 3 5. From the finest ARs, RIs, and JIs obtained from the entire algorithms, the proposed FRFCM-FLW algorithm really attains improved outcome. The AR, JI and RI were extended by Yeh and Yang [14]. The higher the AR, JI and RI the improved the clustering presentation.

Table 4

The Best RI’s Using Dissimilar Primary Cluster Centers With Predetermined Primary Feature Weights

Datasets	WFCM	EWKM	WKM	FCM	FRFCM	FRFCM-FLW
Iris	0.950	0.950	0.950	0.880	0.966	0.979
Thyroid	0.924	0.589	0.792	0.719	0.857	0.946
Bupa	0.506	0.518	0.514	0.500	0.531	0.596
Seeds	0.873	0.872	0.891	0.874	0.900	0.923
Breast cancer	0.884	0.918	0.915	0.879	0.919	0.936
Pima Indians	0.548	1.000	1.000	0.550	1.000	1.000
Soybean	0.906	1.000	0.859	0.843	0.843	0.986
USPS	0.702	0.842	0.893	0.702	0.868	0.906
Colon cancer	0.501	0.518	0.511	0.497	0.518	0.587
Ovariance cancer	0.614	0.691	0.633	0.589	0.674	0.711
Basehock	0.503	0.535	0.505	0.502	0.536	0.590

Table 5

The Best JI’s Using Dissimilar Primary Cluster Centers with Predetermined Primary Feature Weights

Datasets	WFCM	EWKM	WKM	FCM	FRFCM	FRFCM-FLW
Iris	0.818	0.858	0.892	0.654	0.901	0.915
Thyroid	0.868	0.446	0.675	0.550	0.765	0.893
Bupa	0.362	0.480	0.447	0.429	0.343	0.566
Seeds	0.678	0.677	0.716	0.682	0.737	0.798
Breast cancer	0.812	0.861	0.857	0.805	0.848	0.891
Pima Indians	0.455	1.000	1.000	0.442	1.000	1.000
Soybean	0.676	1.000	0.627	0.615	0.910	1.000
USPS	0.218	0.272	0.350	0.224	0.315	0.403
Colon cancer	0.349	0.493	0.358	0.344	0.493	0.511
Ovariance cancer	0.457	0.544	0.470	0.431	0.518	0.588
Basehock	0.361	0.500	0.499	0.359	0.450	0.534

Rand [15] proposed an objective criterion for assessment of clustering techniques, recognized as RI (Rand index). At the present, RI comprises commonly used for computing similarity among two clustering separations. The RI is defined by $RI = (x + y) / (\begin{matrix} n \\ 2 \end{matrix})$ , where x refers to the number of times a pair of elements belongs to a same cluster across two different clustering results and the y refers to the number of times a pair of elements are in different clusters across two different clustering result, so superior to the RI, the enhanced clustering presentation.

An additional quantity of clustering is JI (Jaccard Index) proposed by Jaccard [16]. The JI is described as JI = x/(x + y + z), where z refers number of times a pair of elements that negative for the different clusters and positive for the same clusters. Superior JI performs an enhanced clustering performance.

5 Conclusion

In this paper analyzed the advancement of the data feature reduction concepts in the field of feature reduction with Fuzzy C-means clustering technique. The proposed algorithm works with three stages namely Data formation, Enhanced Feature-Reduction FCM with Feature Linkage Weight (FRFCM-FLW) and FCM Clustering. The original synthetic and raw datasets are preprocessed using data cleaning process. The proposed feature selection is carried out using FRFCM-FLW method to remove the features with low feature weight. Different feature selection algorithms are discussed in related work. The proposed FRFCM-FLW clustering algorithm reduces the feature size automatically, and produced superior quality clustering outcomes. The FRFCM-FLW algorithm calculates a novel weight value of every feature by combining Mahalanobis distance with feature δ_m distance measure in FRFCM algorithm. The clustering results of feature weight play a significant role in forming the feature dimension of clusters. These novel weights are utilized to modernize the fuzzy memberships and cluster centers for the dataset throughout the iterations. It also proves that FRFCM-FLW method obtains high AR, RI and JI ratio when compared to other feature reduction algorithms like WFCM, EWKM, WKM, FCM and FRFCM algorithms.

References

Han

and Kamber

, Datamining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco. (2001).

Wang

, Wang

and Wan

, Improving fuzzy c-means clustering based on feature-weight learning, Pattern Recognition Letters (2004), 25.

Dunn

, A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters, J Cybern 3(3) (1973), 32–57.

Huang

J.Z.

, Ng

M.K.

, Rong

and Li

, Automated variable weighting in k-means type clustering, IEEE Trans Pattern Anal Mach Intell 27(5) (2005), 657–668.

Jing

, Ng

M.K.

and Huang

J.Z.

, An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data, IEEE Trans Knowl Data Eng 19(8) (2007), 1026–1041.

Wang

, Wang

and Wang

, Improving fuzzy c-means clustering based on feature weight learning, Pattern Recognit Lett 25(10) (2004), 1123–1132.

Coletta

L.F.S.

, Vendramin

, Hruschka

E.R.

, Campello

R.J.G.B.

and Pedrycz

, Collaborative fuzzy clustering algorithms: Some refinements and design guidelines, IEEE Trans Fuzzy Syst 20(3) (2012), 444–462.

Gong

, Su

, Jia

and Chen

, Fuzzy clustering with a modified MRF energy function for change detection in synthetic aperture radar images, IEEE Trans Fuzzy Syst 22(1) (2014), 98–109.

Chang

S.T.

, Lu

K.P.

and Yang

M.S.

, Fuzzy change-point algorithms for regression models, IEEE Trans Fuzzy Syst 23(6) (2015), 2343–2357.

10.

Yang

M.-S.

and Tian

Y.-C.

, Bias-correction fuzzy clustering algorithms, Information Sciences 309 (2015), Elsevier, 138–162.

11.

Teng

, et al., Cluster ensemble framework based on the group method of data handling, Appl Soft Comput J (2016).

12.

Yeh

C.C.

and Yang

M.S.

, Evaluation measures for cluster ensembles based on a fuzzy generalized Rand index, Appl. Soft Comput (2017).

13.

Yang

M.-S.

and Nataliani

, A Feature-Reduction Fuzzy Clustering Algorithm Based on Feature-Weighted Entropy, IEEE Transactions On Fuzzy Systems 26(2) (2018).

14.

Yeh

C.C.

and Yang

M.S.

, A generalization of Rand and Jaccard indices with its fuzzy extension, Int J Fuzzy Syst 18 (2016), 1008–1018.

15.

Rand

W.M.

, Objective criteria for the evaluation of clustering methods, J Amer Statist Assoc 66 (1971), 846–850.

16.

Jaccard

, Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines, Bull de la Soc Vaudoise des Sci Naturelles 37 (1901), 241–272.

17.

Lichman

, UCI Machine Learning Repository, Irvine, CA: University of California, School of Information and Computer Science (2013). [Online]. Available: https://archive.ics.uci.edu/ml/datasets.html.

18.

Alon

, et al., Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc Nat Acad Sci USA, 96 (1999), 6745–6750.

19.

Shen

, Chen

and Garibaldi

J.M.

, A NovelWeighted Combination Method for Feature Selection using Fuzzy Sets, IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) (2019).

20.

Lin

, Li

, Wang

and Chen

, Attribute reduction for multi-label learning with fuzzy rough set, Knowl Based Syst 152 (2018), 51–61.

21.

Kashef

and Nezamabadi-pour

, A label-specific multi-label feature selection algorithm based on the Pareto dominance concept, Pattern Recognit 88 (2019), 654–667.

22.

Sun

, Zhang

X.-Y.

, Qian

Y.-H.

, Xu

J.-C.

, Zhang

S.-G.

and Tian

, Joint neighborhood entropy-based gene selection method with Fisher score for tumor classification, Appl Intell 49(4) (2019), 1245–1259.

23.

Sun

, Wang

, Ding

, Qian

and Xu

, Neighborhood multi-granulation rough sets-based attribute reduction using lebesgue and entropy measures in incomplete neighborhood decision systems, ’, Knowl Based Syst 192 (2020), Art. no. 105373.

Feature Linkage Weight Based Feature Reduction using Fuzzy Clustering Method

Abstract

Keywords

1 Introduction

2 Related work

3 Proposed methodology

Table 1 Real Datasets Dataset Number of Instances Classes Number of Features Iris 150 3 4 Thyroid 215 2 5 Bupa 345 2 6 Seeds 210 3 7 Breast cancer 699 2 8 Pima Indians 788 2 8 Soybean 47 4 21 USPS 4000 10 256 Ovariance cancer 216 2 4000 Basehock 1993 2 4862 Colon cancer 62 2 2000

References

Table 1
Real Datasets

Dataset Number of Instances Classes Number of Features

Iris 150 3 4

Thyroid 215 2 5

Bupa 345 2 6

Seeds 210 3 7

Breast cancer 699 2 8

Pima Indians 788 2 8

Soybean 47 4 21

USPS 4000 10 256

Ovariance cancer 216 2 4000

Basehock 1993 2 4862

Colon cancer 62 2 2000