Abstract
The feature reduction fuzzy c-means (FRFCM) algorithm has been proven to be effective for clustering data with redundant/unimportant feature(s). However, the FRFCM algorithm still has the following disadvantages. 1) The FRFCM uses the mean-to-variance-ratio (MVR) index to measure the feature importance of a dataset, but this index is affected by data normalization, i.e., a large MVR value of original feature(s) may become small if the data are normalized, and vice versa. Moreover, the MVR value(s) of the important feature(s) of a dataset may not necessarily be large. 2) The feature weights obtained by the FRFCM are sensitive to the initial cluster centers and initial feature weights. 3) The FRFCM algorithm may be unable to assign the proper weights to the features of a dataset. Thus, in the feature reduction learning process, important features may be discarded, but unimportant features may be retained. These disadvantages can cause the FRFCM algorithm to discard important feature components. In addition, the threshold for the selection of the important feature(s) of the FRFCM may not be easy to determine. To mitigate the disadvantages of the FRFCM algorithm, we first devise a new index, named the marginal kurtosis measure (MKM), to measure the importance of each feature in a dataset. Then, a novel and robust feature reduction fuzzy c-means clustering algorithm called the FRFCM-MKM, which incorporates the marginal kurtosis measure into the FRFCM, is proposed. Furthermore, an accurate threshold is introduced to select important feature(s) and discard unimportant feature(s). Experiments on synthetic and real-world datasets demonstrate that the FRFCM-MKM is effective and efficient.
Introduction
AS a data-driven technique for data analysis, clustering has been widely used in research fields such as statistics, pattern recognition and machine learning. Given a dataset
To overcome these shortcomings, feature weighting and feature selection techniques are used in the traditional k-means and FCM. Feature selection techniques assume that each of the selected features has the same degree of importance. The feature weighting method is an extension of the former and makes feature weights take values in the interval of [0,1]; the more important a feature is, the larger the weight should be.
Feature weighting is an important technique, which can be found in many literatures [8]. In clustering analysis, some variants of the k-means and FCM have been presented, such as the weighted k-means (WKM) [9]; the entropy-weighted k-means (EWKM) [10]; the Minkowski weighted k-means (MWKM) [11]; the Minkowski metric fuzzy weighted c-means (MWFCM) [12]; the weighted FCM using feature-weight learning (WFCM) [13]; the two version of feature-weighted FCM, which is called simultaneous clustering and attribute discrimination (SCAD1 & SCAD2) [14]; and the enhanced soft subspace clustering (ESSC) [15] algorithm, which can be referred to as an extension of conventional feature weighting clustering. By using a feature weighting technique in the k-means or FCM, the performance may be improved. However, these algorithms do not have a mechanism to remove unimportant features in the clustering process, except for the FRFCM [16] that uses feature reduction techniques. As a pioneer work of feature reduction methods, the FRFCM algorithm uses the mean-to-variance-ratio (MVR) to measure the feature importance and integrate it into the objective function to control the within cluster dispersion term and the entropy term. The feature reduction technique is also introduced to the multi-view k-means clustering [17] algorithm. Subsequently, this feature reduction learning method is extended to feature-weighted possibilistic c-means clustering [18].
The main contributions of this paper are as follows: (1) an MKM index is proposed to measure the feature importance, (2) a new and robust feature reduction fuzzy clustering algorithm is proposed to cluster high dimension data, and (3) a threshold with a theoretical basis is designed to determine which features need to be retained and which features need to be deleted in the clustering process.
The rest of the paper is organized as follows. Section 2 first introduces the feature-weighted k-means and feature-weighted FCM clustering algorithms used in the present work, and then introduces feature reduction fuzzy c-means clustering in detail. Section 3 introduces the techniques of the proposed algorithm, its convergence proof, and complexity analysis. Section 4 presents the results of the experiment, and, finally, Section 5 presents the conclusion and ideas for future work.
Related works
In this section, we provide a review of the literature on feature weighted clustering methods including the FRFCM. The commonalities and characteristics of each algorithm are discussed.
A. Feature Weighting using K-Means
Huang et al. [9] proposed the WKM algorithm which has the following objective function:
In its implementation, it adds an extra step to the basic k-means to determine the feature weights of each iteration in the k-means clustering process. The weight of each feature is inversely proportional to the sum of the variances within the feature cluster. Therefore, unimportant features can be identified, and the influence of unimportant features on the clustering result is significantly reduced, which can improve the clustering accuracy and performance. However, the WKM requires the user to subjectively specify an additional parameter, i.e., the index of the feature weights. Therefore, it is difficult for users to determine an appropriate value for this parameter to obtain a desired quality clustering result. In addition, the feature weights generated using this method may not represent the feature importance well [19, 20].
Jing et al. [10] proposed the entropy weighting K-means method EWKM, which is developed as a subspace clustering technique, and particularly useful for high-dimensional sparse data since it uses using feature weighting methods. The EWKM minimizes the differences within the cluster and maximizes the negative entropy. The reason behind this is that it uses more dimensions to identify clusters of high-dimensional sparse data, avoiding problems related to identifying such clusters using only a few dimensions. The EWKM has the following objective function:
In its implementation, it extends the standard k-means algorithm with an additional step to calculate the feature weight of each cluster in each iteration of the clustering process. The weight is inversely proportional to the sum of the intra-cluster variances of the variables in the cluster. Because the EWKM requires many calculations, it is very time-consuming.
B. Feature Weighting in Fuzzy clustering
WFCM was proposed by Wang et al. [13] to improve the performance of the FCM. The WFCM attempts to minimize the following objective function:
SCAD2 was proposed by Frigui and Nasraoui [14] by using a central weighing scheme. The objective function of SCAD2 is as follows.
In SCAD2, the feature weights produced by it do not properly represent the importance of the features in the clusters in some situations [22].
By using the between-cluster information of the dataset, Deng et al. [15] proposed ESSC algorithm. The ESSC uses the following objective function:
where v0j denotes the jth feature of the center of the whole dataset and
C. Fuzzy C lustering With Feature Reduction Schema
As a pioneer work on feature reduction learning algorithms, Yang et al [16] developed a fuzzy clustering algorithm named the FRFCM, which uses the feature-weighted entropy with a feature reduction scheme to cluster data. Their proposed algorithm uses the mean-to-variance-ratio (MVR) index to measure the feature importance of a dataset, and then the MVR index is integrated into the objective function to control the within cluster weighted feature dispersion term and entropy term. By calculating the feature weight using an updating equation, the FRFCM sets a threshold to discard feature(s) with small weight(s) less than it. After reducing the unimportant features from data, the FRFCM can improve the accuracy of the clustering algorithm and speed up the clustering process. The objective function of the FRFCM algorithm is given as follows:
subject to
The iteration formula of w
j
is as follows:
In Equation (7), since δ
j
is irrelevant to subscripts i and k, it can be modified as follows:
Let
Since δ j is used to measure the feature importance, the FRFCM should produce a large weight for a feature with a large δ j , and the algorithm should retain the feature(s) with large weight(s) in the feature reduction clustering process. However, as can be seen from Equation 9, the updated equation of w j is a function of δ j and S j and can produce a large weight for the jth feature if both δ j and S j are small; it cannot produce a large weight for a large δ j unless S j is small enough. As a consequence, important feature(s) may not be identified by the FRFCM, which make the algorithm incorrectly discard important features from the dataset in the first iteration of the FRFCM. This is one of the shortcomings of the FRFCM algorithm. In addition, the FRFCM still has two other disadvantages. 1) The MVR values for the important feature(s) of some datasets are not necessarily large. For example, the MVR index of the iris dataset is δ = [8 . 522, 16 . 244, 1 . 207, 2 . 058], indicating that the 1st feature and 2nd feature are two important features, yet most feature weighting clustering algorithms consider the 3rd feature and 4th feature as two important features [16]. Moreover, the MVR index changes when some data are normalized. A large MVR value may become small and also a small MVR value may become large if the data are normalized. 2) The feature weights obtained by the FRFCM are very sensitive to the initialization. These disadvantages can degrade the performance of the FRFCM.
Next, we propose a novel feature reduction fuzzy c-means clustering algorithm that leverages the marginal kurtosis measure to mitigate the above disadvantages of the FRFCM algorithm.
A. Problem Definition
In data mining and data analysis tasks, if a feature or variable possesses only one mode (Fig. 1 (a)), it is a less important feature or variable. However, if a feature or value has two different modes (

Features of two modal types. (a) Unimodal distribution; (b) bimodal distribution.
In Equation (10), δ
j
is the mean-to-standard-variance-ratio of η, and it is actually a function of the kurtosis of
Next, we use an example to illustrate the advantage of the MKM over the MVR to measure the feature importance.
To present an intuitive understanding of the property of feature importance, we further investigate the distribution of the features of dataset D1 shown in Fig. 4. We can clearly see that the 2nd feature and 3rd feature are compact with a cluster-like structure; therefore, they should be more important than the other two features, which coincides with δ MKM = [1 . 072, 1 . 360, 1 . 341, 1 . 143] indicating the 2nd feature and 3rd feature of dataset D1 are two importance features. If the MVR is used to measure the feature importance, the MVR of D1 is δ MVR = [1 . 497, 0 . 981, 1 . 089, 3 . 085], indicating that the 1st feature and 4th feature are important features, but in fact, the 2nd feature and 3rd feature are actually two important features of dataset D1. What mislead the algorithm is that if D1 is normalized, the MVR of D1 becomes δ MVR = [5 . 944, 8 . 940, 8 . 908, 6 . 161], and then the 2nd and 3rd features become important features. Similarly, the 3rd feature and 4th feature of the iris dataset are two important features because these features have a cluster-like structure, as shown in Fig. 5, and the MKM of iris is δ MKM = [0 . 834, 0 . 666, 1 . 282, 1 . 222]; however, the MVR of iris is δ MVR = [8 . 103, 13 . 455, 5 . 228, 4 . 527]. Thus, in this paper, we use the MKM index to measure the feature importance instead of the MVR index.
B. The F ormulation of Feature Reduction Fuzzy C-Means Clustering L everaging the M arginal K urtosis M easure
In this section, inspired by the work of Yang et al. [16], we propose a novel and robust feature reduction fuzzy clustering algorithm leveraging the marginal kurtosis measure, named the FRFCM-MKM, to mitigate the disadvantages of the FRFCM algorithm. In this method, the MKM index is used to measure the feature importance of a dataset, and each feature has an individual weight updating equation in each iteration. We develop a new threshold to determine which feature(s) is/are to be discarded. In this process, the features with small weights less than the threshold will be removed from datasets. Let
To obtain the optimal clustering results, the FRFCM-MKM is minimized subject to the following constraints:
According to Equation (14), the iteration formula of u ij can be obtained as follows:
From Equation (15), we obtain
From Equation (16), it can be obtained that
I- denotes the set of indexes of the zero-weight (redundant) features, while I+ denotes the set of indexes of the positive weight (important) features. |I-| and |I+| are the cardinality of I- and I+, respectively. The elements of I- and I+ are computed by the same method as in [23], which is outlined as follows.
Here we give the detailed derivation of the update equations of w j from the Lagrangian in Equation (13) and the Karush–Kuhn–Tucker conditions in Equations (16)–(18). Combining Equations (13) and (16) results in
According to
By plugging Equation (24) into Equation (23), the closed-form solution for the feature weights is obtained as
Now we consider Equation (17) for two respective cases.
1) γ j = 0, ∀ j ∈ {1, . . . , d}, Equation. (25) becomes
For each feature j, this set of solutions is valid only if
Otherwise, we consider the second case.
2) If γ
j
> 0 for at least one j, according to Equation (18) when γ
j
> 0, w
j
= 0; therefore,
From the above equation, we obtain that
Constrained by Equation (12), obviously, it is known that not all w
j
can be 0. Therefore, the set { j|j = 1, . . . , d} is separated into two subsets denoted as I- and I+, where
Therefore, Equation (27) becomes
From the above equation, it can be obtained that
By replacing the γ j in Equation (25) with Equation (29), after some simplification, we obtain the final form as in Equation (21).
How to choose the threshold is an important step for the proposed clustering algorithm. To develop a feature-reduction scheme using an FRFCM algorithm, Yang et al. [16] used
Threshold for different sized datasets drawn from same distribution
Now, we try to find a proper threshold to remove the feature(s) with small weight(s) instead of using
C. FRFCM-MKM Algorithm
Next, we provide the detailed proof of convergence theorems for the FRFC-Means algorithm.
D. Convergence Theorems for FRFCM-MKM Algorithm
Referring to [24–26], Theorems 1 and 2 can be easily proved. Because the main difference between FCM and FRFCM-MKM exists in the involved feature weights, here we only give a detailed proof of theorem 3 below.
Then, follow the steps from Equation (20) to Equation (26), we have
The necessary condition is proved.
Secondly, we prove the sufficient condition for minimum of J (
Thus, the bordered Hessian matrix with regard to
Since we know τ > 0, and n · c > 0, the
And in the same way, the case of feature reduction can be proved. Then theorem 3 can be verified. □
In this section, to evaluate the performance of the proposed FRFCM-MKM algorithm, the WKM, EWKM, WFCM, SCAD2, ESSC, and the FRFCM are chosen for analysis. A series of experiments are performed with various datasets. In most cases, running the clustering algorithm on raw data does not work well. Therefore, we normalized each feature
We set the threshold value ɛ and the maximum number of restarts to ɛ = 105 and t = 300, respectively, in all the algorithms. The parameter of the fuzziness m in all fuzzy-type clustering algorithms and our algorithm is set to 2. The experiments are performed on a PC with an Intel Core i5-4590 with 4 GB of memory. All code is written in the MATLAB computing environment.
A. Performance M etrics
Three performance metrics including the accuracy (AC) [27], the normalized mutual information (NMI) [28], and the adjusted Rand index (ARI) [29] are used to evaluate the performance of clustering algorithms. They are averaged over thirty different runs with random initialization of the feature weights and with random initializations of the centroids using the initialization strategy in reference [30].
where n is the total number of data points, c is the number of clusters, n
ij
= |S
i
∩ S
j
|,
B. Experiment
In this experiment, we generate the same dataset denoted by D2 as that used in example 1 of reference [16]. This dataset has 400 data points with components x2 and x3 being generated from the Gaussian mixture model
To compare the feature weight representation abilities of the WKM, FRFCM and FRFCM-MKM, we apply the WKM, FRFCM and FRFCM-MKM algorithms to dataset D2 thirty times to examine whether the algorithms assign proper weights to features. Since the prototype-based clustering algorithms are sensitive to the initialization, to reduce the sensitivity of the initialization for clustering algorithms, we first choose five sets of centers
Weight assignments of the WKM, FRFCM and FRFCM-MKM for unnormalized D2 with fixed cluster centers
Weight assignments of the WKM, FRFCM and FRFCM-MKM for normalized D2 with fixed initial weights
We next run the WKM, FRFCM and FRFCM-MKM for the unnormalized and normalized D2 datasets with equal initial feature weights
Weight assignments of the WKM, FRFCM and FRFCM-MKM for unnormalized D2 when the initial feature weights are fixed as
Weight assignments of the WKM, FRFCM and FRFCM-MKM for normalized D2 when the initial feature weights are fixed as
The reason that the FRFCM-MKM represents the features well is that the second term in the objective function is used to control the value of w j s; and the larger the value of τ is, the closer the value of w j is to that of δ j , and the more robust the algorithm. Fig. 6 shows the weight assignments of the FRFCM-MKM under different parameters. As τ increases, the feature weights remain the same for each run, which indicates the robustness of the proposed method.
Now we compare the clustering results achieved by the WKM, EWKM, WFCM, SCAD2, ESSC, FRFCM and FRFCM-MKM algorithms. Table 8 lists the clustering results of the WKM, EWKM, WFCM, SCAD2, ESSC, FRFCM and FRFCM-MKM algorithms for unnormalized and normalized D2. The WKM and EWKM produce unsuitable feature weights, and the clustering results are worse. The SCAD2 and ESSC produce suitable feature weights on normalized D2 (Table 7), and the clustering result are desirable. The FRFCM does not produce suitable feature weights for dataset D2, and the important feature(s) is/are discarded from the data at the first iteration, which leads to an incorrect clustering result. When D2 is normalized, the FRFCM assigns comparative larger weights to the four features (Table 3). Meanwhile, the threshold calculated by

(a) Two component data points generated from a Gaussian mixture model; (b)The Gaussian mixture model data points by adding component x3.

(a) δ MVR value of D1 (unnormalized); (b) δ MVR value for D1 (normalized),

Distribution of the different features of dataset D1. (a) 1st feature. (b) 2nd feature. (c) 3rd feature. (d) 4th feature.

Distribution of the different features of the iris dataset. (a) 1st feature. (b) 2nd feature. (c) 3rd feature. (d) 4th feature.
Weight assignments of the WKM, FRFCM and FRFCM-MKM for unnormalized D2 with the centers and weights randomly initialized
The final weight assignments of the SCAD2 and ESSC on normalized D2
Comparing the clustering results achieved by seven algorithms on dataset D2
In this experiment, for the iris dataset, we add h (h = 4, 8, 12) noisy features to 4 real ones. The noisy features are Gaussian noise. We let the noisy features follow a Gaussian distribution, i.e.,

The weight assignments of the FRFCM-MKM under different parameters.

(a)∼(c)Weight assignments of the WKM with 4, 8, and 12 noisy features added to iris, respectively. (d)∼(f) Weight assignments of the FRFCM with 4, 8, and 12 noisy features added to iris, respectively. (g)∼(i) Weight assignments of the FRFCM-MKM with 4, 8, and 12 noisy features added to iris, respectively.

(a)∼(c) Weight assignments of the EWKM with 4, 8, and 12 noisy features added to iris, respectively. (d)∼(f) Weight assignments of SCAD2 with 4, 8, and 12 noisy features added to iris, respectively. (g)∼(i) Weight assignments of ESSC with 4, 8, and 12 noisy features added to iris, respectively.
Comparing the clustering results achieved by the seven algorithms on the Iris data and noisy Iris data
Now we perform clustering on six high-dimensional synthetic datasets to demonstrate the effectiveness of the proposed algorithm. Each dataset contains 1024 samples and 16 clusters, and the datasets used in this experiment can be downloaded from the webpage of the Speech and Image Processing Unit [32]. The detailed information on these six high dimensional synthetic datasets is summarized in Table 10.
Brief information for the high dimensional synthetic datasets
Table 11 shows the clustering results for the seven different methods on the datasets ranging from 64 to 1,024 samples. The clustering algorithms are applied to the six synthetic datasets Dim64 to Dim1024 that represent higher dimensionality varying datasets. Compared with the other approaches, the FRFCM-MKM algorithm achieves excellent clustering results as represented by the AC, NMI and ARI on Dim256∼Dim1024. As shown in Table 12, Moreover, the run-time of the FRFCM-MKM algorithm on these six datasets are lowest compared with those of the other algorithms. Its run-time ranges from 0.004 to 0.006 seconds on Dim256∼Dim1024. The average time of WKM, EWKM, FCM, WFCM, SCAD2, ESSC, FRFCM and FRFCM-MKM on dim032 to dim1024 are 0.159, 0.421, 0.068, 4.136, 0.832, 0.842, 0.058, 0.005 second respectively. The FRFCM-MKM is 10 times faster than the FRFCM in average. The SCAD2, ESSC and FRFCM methods also achieve desirable results; however, the run-times of these algorithms are relatively high. Although the FRFCM uses a feature reduction scheme, when these data are normalized, the FRFCM fails to discard the unimportant features from Dim032∼Dim256. The FRFCM can remove 40 unimportant features and retain 472 features for dim512. Meanwhile, the FRFCM removes 523 unimportant features from Dim1024 and retains 501 features. However, the FRFCM-MKM retains 5, 5, 4, 6, 5, and 10 features for Dim032∼Dim1024, respectively, and it only utilizes a small number of features in the clustering process (Table 12).
Clustering performance on dim032-dim1024 using the WKM, EWKM, FCM, WFCM, SCAD2, ESSC, FRFCM and FRFCM-MKM
Numbers of original and final features obtained from the FRFCM and the FRFCM-MKM and the total run-times (in seconds) for the synthetic datasets for each algorithm
In this experiment, we investigate the performances of the proposed algorithm on some real world datasets, and their properties are summarized in Table 13. We compare the AC, NMI and ARI of the FRFCM-MKM with those of the WKM, EWKM, FCM, WFCM, SCAD2, and FRFCM-MKM. The comparisons use different initial cluster centers with different initial feature weights.
Summary of some real world Datasets
Table 14 shows the clustering results of seven different clustering algorithms on thirteen real datasets. Compared with the other approaches, the FRFCM-MKM achieves excellent clustering results on nine out of thirteen datasets according to the AC, NMI and ARI. The improvement of the FRFCM-MKM compared to the FRFCM is significant both in the performance metrics and feature reduction. On the easily-clustered datasets such as iris, wdbc_all, breast_cancer, FRFCM-MKM is superior to the other six algorithms. The differences between FRFCM-MKM and FRFCM on the iris dataset are 6.67%, 11.62%, and 21.51% for AC, NMI and ARI metric. On the wdbc_all dataset the differences are 0.76%, 3.94%, and 3.45%, while on the breast_cancer dataset they are 1.16%, 14.15%, 2%. The final features by running FRFCM-MKM on these datasets are 2, 10, and 5, respectively, compared with 4, 30 and 9 by FRFCM. On other hard-clustered datasets such as glass, movement_libras, and SMK_CAN_1987 FRFCM-MKM is still superior to the other six algorithms. The differences between FRFCM-MKM and FRFCM on glass are 7.55%, 7.64% and 18.29% for AC, NMI and ARI metric. On the movement_libras on the SMK_CAN_1987 dataset they are 1.49%, 12.90% and 5.26%. As shown in Table 15, the FRFCM-MKM retains 4, 26, and 21 features in the final step on these datasets respectively, while FRFCM retains7, 90 and 19993 features. On the wpbc dataset, the proposed algorithm retains 11 important features in the cluster process with AC = 0.576, while the FRFCM retains 31 features with AC = 0.581. On the ORL dataset, our method uses 215 important features of the data to cluster with AC = 0.503. Although the EWKM algorithm performs best on these data with AC = 0.600, it uses all 2567 features for clustering; therefore, the EWKM takes more time to cluster data. The clustering performance of the proposed algorithm is not the best on these data sets, but its performances are almost equivalent to those of the comparison algorithm while using a small number of features. Table 16 shows the average performance on all datasets. The performance of proposed algorithm is obviously better than the other six methods in terms of AC, NMI and ARI. According to the experimental results on the datasets, we can infer the following conclusions:
Clustering performances on real datasets using the WKM, EWKM, SCAD2, FCM, FRFCM, and FRFCM-MKM
Numbers of original and final features obtained from the FRFCM and FRFCM–MKM and the total run-time (in seconds) for the real datasets for each algorithm
The average value of the results of the FRFCM-MKM and other six methods on thirteen real datasets
The proposed algorithm is robust to initialization, while other algorithms are lack of robustness to initialization. The proposed method represents the feature weights well, while the other methods do not produce proper weights for features. Compared with FRFCM, FRFCM-MKM achieves better result in overall. The MKM index proposed in this paper is better than the MVR index in measuring feature importance. The threshold based on Pythagorean means of using the information of feature weights is more reasonable than the threshold used in FRCM.
Next, we analyze the computational complexity of the FRFCM-MKM. The computational complexity of the FRFCM-MKM depends on three updating stages: (1) update the membership matrix
This study focuses on the feature reduction learning in fuzzy clustering. We develop the MKM index to measure the feature importance. Based on it, a novel objective function is proposed to simultaneously minimize the within cluster dispersion and the discrepancy between the feature importance and feature weights. Then, the feature reducing scheme is used in the clustering process. From the results of clustering both synthetic datasets and real datasets. We observe that the proposed algorithm not only achieves superior performance in terms of the accuracy, adjusted Rand index, and normalized mutual information but also shows great stability regarding the initialization.
This paper proposed an index named MKM to for measuring feature importance. Despite its advantages in measuring feature importance, this index only considers the marginal kurtosis of features. It does not take the feature dispersion into account. This is the shortcoming of our paper. The MKM index works on most datasets in our experiments, but it does not work on some datasets in our experiment, such as the ionosphere and ORL datasets. For future research, one proposed research direction is to design a more much better index to for measuring feature importance by taking MKM and feature dispersion into account together, based on which new feature reduction learning algorithms based will be developed. Another possible research direction is to extend application of our method to multi-view clustering and possibilistic clustering.
