Feature reduction fuzzy C-Means algorithm leveraging the marginal kurtosis measure

Abstract

The feature reduction fuzzy c-means (FRFCM) algorithm has been proven to be effective for clustering data with redundant/unimportant feature(s). However, the FRFCM algorithm still has the following disadvantages. 1) The FRFCM uses the mean-to-variance-ratio (MVR) index to measure the feature importance of a dataset, but this index is affected by data normalization, i.e., a large MVR value of original feature(s) may become small if the data are normalized, and vice versa. Moreover, the MVR value(s) of the important feature(s) of a dataset may not necessarily be large. 2) The feature weights obtained by the FRFCM are sensitive to the initial cluster centers and initial feature weights. 3) The FRFCM algorithm may be unable to assign the proper weights to the features of a dataset. Thus, in the feature reduction learning process, important features may be discarded, but unimportant features may be retained. These disadvantages can cause the FRFCM algorithm to discard important feature components. In addition, the threshold for the selection of the important feature(s) of the FRFCM may not be easy to determine. To mitigate the disadvantages of the FRFCM algorithm, we first devise a new index, named the marginal kurtosis measure (MKM), to measure the importance of each feature in a dataset. Then, a novel and robust feature reduction fuzzy c-means clustering algorithm called the FRFCM-MKM, which incorporates the marginal kurtosis measure into the FRFCM, is proposed. Furthermore, an accurate threshold is introduced to select important feature(s) and discard unimportant feature(s). Experiments on synthetic and real-world datasets demonstrate that the FRFCM-MKM is effective and efficient.

Keywords

Fuzzy c-means feature reduction learning marginal kurtosis measure mean-to-variance ratio

1 Introduction

AS a data-driven technique for data analysis, clustering has been widely used in research fields such as statistics, pattern recognition and machine learning. Given a dataset X consisting of data points x_i where i = 1, 2, . . . , n, clustering algorithms aim to group X into k clusters S ={ S₁, S₂, . . . , S_k } such that the data points x_i ∈ S_k have the highest similarity and the data points between different clusters have the highest dissimilarity. Perhaps two of the most well-known partition-based clustering algorithms are the k-means [1] and the fuzzy c-means (FCM) [2, 3], and the latter is a generalization of the former. The k-means is known for its simplicity, and the FCM remains popular due to its capability to express the data points that belong to more than one cluster. They have been widely studied and applied in many areas [4 –6]. However, the k-means and FCM do not take the feature importance into account, and they cluster data using equal feature weights without considering the structure of the data. That is, the features that are important or less important may make the same contribution to the clustering. As a result, their clustering performance will be affected by the existence of unimportant feature(s) or redundant feature(s), which may cause the k-means and FCM to produce incorrect clustering results and significantly degrade the clustering performance in some applications [7].

To overcome these shortcomings, feature weighting and feature selection techniques are used in the traditional k-means and FCM. Feature selection techniques assume that each of the selected features has the same degree of importance. The feature weighting method is an extension of the former and makes feature weights take values in the interval of [0,1]; the more important a feature is, the larger the weight should be.

Feature weighting is an important technique, which can be found in many literatures [8]. In clustering analysis, some variants of the k-means and FCM have been presented, such as the weighted k-means (WKM) [9]; the entropy-weighted k-means (EWKM) [10]; the Minkowski weighted k-means (MWKM) [11]; the Minkowski metric fuzzy weighted c-means (MWFCM) [12]; the weighted FCM using feature-weight learning (WFCM) [13]; the two version of feature-weighted FCM, which is called simultaneous clustering and attribute discrimination (SCAD1 & SCAD2) [14]; and the enhanced soft subspace clustering (ESSC) [15] algorithm, which can be referred to as an extension of conventional feature weighting clustering. By using a feature weighting technique in the k-means or FCM, the performance may be improved. However, these algorithms do not have a mechanism to remove unimportant features in the clustering process, except for the FRFCM [16] that uses feature reduction techniques. As a pioneer work of feature reduction methods, the FRFCM algorithm uses the mean-to-variance-ratio (MVR) to measure the feature importance and integrate it into the objective function to control the within cluster dispersion term and the entropy term. The feature reduction technique is also introduced to the multi-view k-means clustering [17] algorithm. Subsequently, this feature reduction learning method is extended to feature-weighted possibilistic c-means clustering [18].

The main contributions of this paper are as follows: (1) an MKM index is proposed to measure the feature importance, (2) a new and robust feature reduction fuzzy clustering algorithm is proposed to cluster high dimension data, and (3) a threshold with a theoretical basis is designed to determine which features need to be retained and which features need to be deleted in the clustering process.

The rest of the paper is organized as follows. Section 2 first introduces the feature-weighted k-means and feature-weighted FCM clustering algorithms used in the present work, and then introduces feature reduction fuzzy c-means clustering in detail. Section 3 introduces the techniques of the proposed algorithm, its convergence proof, and complexity analysis. Section 4 presents the results of the experiment, and, finally, Section 5 presents the conclusion and ideas for future work.

2 Related works

In this section, we provide a review of the literature on feature weighted clustering methods including the FRFCM. The commonalities and characteristics of each algorithm are discussed.

A. Feature Weighting using K-Means

Huang et al. [9] proposed the WKM algorithm which has the following objective function: $\begin{matrix} J (U, V, w) = \sum_{i = 1}^{n} \sum_{k = 1}^{c} μ_{ik} \sum_{j = 1}^{d} w_{j}^{β} {(x_{ij} - v_{jk})}^{2} \\ s . t . \sum_{j = 1}^{d} w_{j} = 1, \sum_{k = 1}^{c} μ_{ik} = 1, μ_{ik} \in {0, 1} . \end{matrix}$ (1)

In its implementation, it adds an extra step to the basic k-means to determine the feature weights of each iteration in the k-means clustering process. The weight of each feature is inversely proportional to the sum of the variances within the feature cluster. Therefore, unimportant features can be identified, and the influence of unimportant features on the clustering result is significantly reduced, which can improve the clustering accuracy and performance. However, the WKM requires the user to subjectively specify an additional parameter, i.e., the index of the feature weights. Therefore, it is difficult for users to determine an appropriate value for this parameter to obtain a desired quality clustering result. In addition, the feature weights generated using this method may not represent the feature importance well [19, 20].

Jing et al. [10] proposed the entropy weighting K-means method EWKM, which is developed as a subspace clustering technique, and particularly useful for high-dimensional sparse data since it uses using feature weighting methods. The EWKM minimizes the differences within the cluster and maximizes the negative entropy. The reason behind this is that it uses more dimensions to identify clusters of high-dimensional sparse data, avoiding problems related to identifying such clusters using only a few dimensions. The EWKM has the following objective function:

$\begin{matrix} J (U, V, w) = \sum_{k = 1}^{c} [\sum_{i = 1}^{n} \sum_{j = 1}^{d} μ_{ik} w_{kj} {(x_{ij} - v_{kj})}^{2} \\ + γ \sum_{j = 1}^{d} w_{kj} log w_{kj}] \\ s . t . \sum_{j = 1}^{d} w_{kj} = 1, {w_{kj}} ⩾ 0, {w_{kj}} ⩾ 0 . \end{matrix}$ (2)

In its implementation, it extends the standard k-means algorithm with an additional step to calculate the feature weight of each cluster in each iteration of the clustering process. The weight is inversely proportional to the sum of the intra-cluster variances of the variables in the cluster. Because the EWKM requires many calculations, it is very time-consuming.

B. Feature Weighting in Fuzzy clustering

WFCM was proposed by Wang et al. [13] to improve the performance of the FCM. The WFCM attempts to minimize the following objective function: $J (U, V, w) = \sum_{i = 1}^{n} \sum_{k = 1}^{c} μ_{ij}^{m} \sum_{j = 1}^{d} w_{j}^{2} {(x_{ij} - v_{kj})}^{2}$ (3) where w_j is a feature weight calculated by the algorithm proposed by Yeung and Wang [21].

SCAD2 was proposed by Frigui and Nasraoui [14] by using a central weighing scheme. The objective function of SCAD2 is as follows. $J (U, V, w) = \sum_{j = 1}^{k} \sum_{i = 1}^{n} \sum_{l = 1}^{p} μ_{ij}^{m} w_{jl}^{q} {(x_{ij} - v_{jl})}^{2}$ (4)

In SCAD2, the feature weights produced by it do not properly represent the importance of the features in the clusters in some situations [22].

By using the between-cluster information of the dataset, Deng et al. [15] proposed ESSC algorithm. The ESSC uses the following objective function:

$\begin{matrix} J (U, V, W) = \sum_{k = 1}^{c} \sum_{i = 1}^{n} μ_{ik}^{m} \sum_{j = 1}^{d} w_{kj} {(x_{ij} - v_{kj})}^{2} \\ + γ \sum_{k = 1}^{c} \sum_{j = 1}^{d} w_{kj} ln w_{kj} - η \sum_{k = 1}^{c} \sum_{i = 1}^{n} μ_{ik}^{m} \sum_{j = 1}^{d} w_{kj} {(v_{kj} - v_{0 j})}^{2} \end{matrix}$ (5)

where v_0j denotes the jth feature of the center of the whole dataset and $\sum_{k = 1}^{c} \sum_{i = 1}^{n} μ_{ik}^{m} \sum_{j = 1}^{d} w_{kj} {(x_{ij} - v_{kj})}^{2}$ is the within cluster dispersion. The ESSC algorithm minimizes the within cluster dispersion and maximizes the negative entropy, and meanwhile minimizes the weighted distance between cluster centers and data center. The obtained feature weights also do not well represent the feature importance of the clusters for some datasets.

C. Fuzzy C lustering With Feature Reduction Schema

As a pioneer work on feature reduction learning algorithms, Yang et al [16] developed a fuzzy clustering algorithm named the FRFCM, which uses the feature-weighted entropy with a feature reduction scheme to cluster data. Their proposed algorithm uses the mean-to-variance-ratio (MVR) index to measure the feature importance of a dataset, and then the MVR index is integrated into the objective function to control the within cluster weighted feature dispersion term and entropy term. By calculating the feature weight using an updating equation, the FRFCM sets a threshold to discard feature(s) with small weight(s) less than it. After reducing the unimportant features from data, the FRFCM can improve the accuracy of the clustering algorithm and speed up the clustering process. The objective function of the FRFCM algorithm is given as follows:

$\begin{matrix} min_{U, V, w} J (U, V, w) = & \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} \sum_{j = 1}^{d} δ_{j} w_{j} {(x_{ij} - v_{kj})}^{2} \\ + \frac{n}{c} \sum_{j = 1}^{d} {(w_{j} log δ_{j} w_{j})}^{2} \end{matrix}$ (6)

subject to $\sum_{k = 1}^{c} u_{ik} = 1, \sum_{j = 1}^{d} w_{j} = 1, \forall j = 1, . . ., n .$ where $δ_{j} = {(\frac{mean (x)}{var (x)})}_{j}, j = 1, 2, \dots, d .$

The iteration formula of w_j is as follows: $w_{j} = \frac{\frac{1}{δ_{j}} exp (- \frac{c}{n} \sum_{k = 1}^{c} \sum_{i = 1}^{n} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2})}{\sum_{t = 1}^{d} \frac{1}{δ_{t}} exp (- \frac{c}{n} \sum_{k = 1}^{c} \sum_{i = 1}^{n} u_{ik}^{m} δ_{t} {(x_{it} - v_{kt})}^{2})}$ (7)

In Equation (7), since δ_j is irrelevant to subscripts i and k, it can be modified as follows: $w_{j} = \frac{\frac{1}{δ_{j}} exp (- \frac{c}{n} δ_{j} \sum_{k = 1}^{c} \sum_{i = 1}^{n} u_{ik}^{m} {(x_{ij} - v_{kj})}^{2})}{\sum_{t = 1}^{d} \frac{1}{δ_{t}} exp (- \frac{c}{n} \sum_{k = 1}^{c} \sum_{i = 1}^{n} u_{ik}^{m} δ_{t} {(x_{it} - v_{kt})}^{2})}$ (8)

Let $S_{j} = \sum_{k = 1}^{c} \sum_{i = 1}^{n} u_{ik}^{m} {(x_{ij} - v_{kj})}^{2}$ , where S_j is the sum of the within-cluster dispersion of the jth feature. Because the denominator is a positive constant, if we denote this constant as Const, then Equation. (7) can be simplified as follows: $w_{j} = \frac{1}{δ_{j} Const} exp (- \frac{c}{n} δ_{j} S_{j})$ (9)

Since δ_j is used to measure the feature importance, the FRFCM should produce a large weight for a feature with a large δ_j, and the algorithm should retain the feature(s) with large weight(s) in the feature reduction clustering process. However, as can be seen from Equation 9, the updated equation of w_j is a function of δ_j and S_j and can produce a large weight for the jth feature if both δ_j and S_j are small; it cannot produce a large weight for a large δ_j unless S_j is small enough. As a consequence, important feature(s) may not be identified by the FRFCM, which make the algorithm incorrectly discard important features from the dataset in the first iteration of the FRFCM. This is one of the shortcomings of the FRFCM algorithm. In addition, the FRFCM still has two other disadvantages. 1) The MVR values for the important feature(s) of some datasets are not necessarily large. For example, the MVR index of the iris dataset is δ = [8 . 522, 16 . 244, 1 . 207, 2 . 058], indicating that the 1st feature and 2nd feature are two important features, yet most feature weighting clustering algorithms consider the 3rd feature and 4th feature as two important features [16]. Moreover, the MVR index changes when some data are normalized. A large MVR value may become small and also a small MVR value may become large if the data are normalized. 2) The feature weights obtained by the FRFCM are very sensitive to the initialization. These disadvantages can degrade the performance of the FRFCM.

Next, we propose a novel feature reduction fuzzy c-means clustering algorithm that leverages the marginal kurtosis measure to mitigate the above disadvantages of the FRFCM algorithm.

3 The feature reduction fuzzy c-means clustering leveraging the marginal kurtosis measure

A. Problem Definition

In data mining and data analysis tasks, if a feature or variable possesses only one mode (Fig. 1 (a)), it is a less important feature or variable. However, if a feature or value has two different modes ( Fig. 1 (b)), it remains important for the learning algorithm. Owing to incompetence of the MVR index in [16] in measuring feature importance with this distribution, in this paper we devise a new index to measure feature importance.

Fig. 1

Features of two modal types. (a) Unimodal distribution; (b) bimodal distribution.

Definition 1. Given a dataset X, let η = (X - EX) ², the marginal kurtosis measure of X is defined as follows:

$\begin{matrix} δ_{j} = {(\frac{E η}{\sqrt{var (η)}})}_{j} \\ = {(\frac{E ({(X - E (X))}^{2})}{\sqrt{E {({(X - E (X))}^{2} - E ({(X - E (X))}^{2}))}^{2}}})}_{j} \\ = {(\frac{E ({(X - E (X))}^{2})}{\sqrt{E ({(X - E (X))}^{4}) - {(E ({(X - E (X))}^{2}))}^{2}}})}_{j} \\ = {(1 / \sqrt{\frac{E ({(X - E (X))}^{4})}{{(E ({(X - E (X))}^{2}))}^{2}} - 1})}_{j} \\ = {(\frac{1}{\sqrt{kurtosis (X) - 1}})}_{j} \end{matrix}$ (10)

In Equation (10), δ_j is the mean-to-standard-variance-ratio of η, and it is actually a function of the kurtosis of X. Thus, we name it the marginal kurtosis measure, and denote it as MKM. Compared with the MVR of X, the MKM of X has the property of being unaffected by data normalization.

Next, we use an example to illustrate the advantage of the MKM over the MVR to measure the feature importance.

Example 1. In this example, we produce a synthetic dataset with 400 data points. Component x₂ and component x₃ of this dataset are generated from a Gaussian mixture model $\sum_{k = 1}^{2} α_{k} N (μ_{k}, \sum_{k})$ with the parameters α₁ = α₂ = 0.5, $\sum_{1} = (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix})$ and $\sum_{2} = (\begin{matrix} 0.5 & 0 \\ 0 & 0.5 \end{matrix})$ . Components x₁ and x₄ follow uniform distributions over the interval [0, 4] and the interval [0, 2], respectively. We denote this dataset as D1. The MVR value of the original dataset D1 is δ_MVR = [1.497, 0.981, 1.089, 3.085]. Since δ₁ and δ₄ are larger than δ₂ and δ₃, the 1st feature and 4th feature are considered to be two important features. However, if D1 is normalized, the MVR of D1 becomes δ_MVR = [5.944, 8.940, 8.906, 6.161], and then the 2nd feature and 3rd feature become important features (Fig. 3). Since the feature importance is an attribute of a dataset, the feature importance should remain unchanged, even if the data are normalized. However, when D1 is normalized, the MVR index of D1 is changed. If the MVR index is used to measure the feature importance, the algorithm actually cannot differentiate the important feature(s).

To present an intuitive understanding of the property of feature importance, we further investigate the distribution of the features of dataset D1 shown in Fig. 4. We can clearly see that the 2nd feature and 3rd feature are compact with a cluster-like structure; therefore, they should be more important than the other two features, which coincides with δ_MKM = [1 . 072, 1 . 360, 1 . 341, 1 . 143] indicating the 2nd feature and 3rd feature of dataset D1 are two importance features. If the MVR is used to measure the feature importance, the MVR of D1 is δ_MVR = [1 . 497, 0 . 981, 1 . 089, 3 . 085], indicating that the 1st feature and 4th feature are important features, but in fact, the 2nd feature and 3rd feature are actually two important features of dataset D1. What mislead the algorithm is that if D1 is normalized, the MVR of D1 becomes δ_MVR = [5 . 944, 8 . 940, 8 . 908, 6 . 161], and then the 2nd and 3rd features become important features. Similarly, the 3rd feature and 4th feature of the iris dataset are two important features because these features have a cluster-like structure, as shown in Fig. 5, and the MKM of iris is δ_MKM = [0 . 834, 0 . 666, 1 . 282, 1 . 222]; however, the MVR of iris is δ_MVR = [8 . 103, 13 . 455, 5 . 228, 4 . 527]. Thus, in this paper, we use the MKM index to measure the feature importance instead of the MVR index.

B. The F ormulation of Feature Reduction Fuzzy C-Means Clustering L everaging the M arginal K urtosis M easure

In this section, inspired by the work of Yang et al. [16], we propose a novel and robust feature reduction fuzzy clustering algorithm leveraging the marginal kurtosis measure, named the FRFCM-MKM, to mitigate the disadvantages of the FRFCM algorithm. In this method, the MKM index is used to measure the feature importance of a dataset, and each feature has an individual weight updating equation in each iteration. We develop a new threshold to determine which feature(s) is/are to be discarded. In this process, the features with small weights less than the threshold will be removed from datasets. Let X ={ x_i|i = 1, 2, . . . , n } be a dataset of $ℝ^{d}$ space, and the weights of the features are denoted as a weight vector w = [w₁, . . . , w_d] ^T. The objective function of the FRFCM-MKM algorithm is minimized in the form that follows:

$\begin{matrix} J (U, V, w) = & \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} \sum_{j = 1}^{d} δ_{j} w_{j} {(x_{ij} - v_{kj})}^{2} \\ + τ \cdot n \cdot c \sum_{j = 1}^{d} {(w_{j} - δ_{j})}^{2} \end{matrix}$ (11)

To obtain the optimal clustering results, the FRFCM-MKM is minimized subject to the following constraints: $s . t \sum_{k = 1}^{c} u_{ik} = 1, \sum_{j = 1}^{d} w_{j} = 1, w_{j} ⩾ 0$ (12) where δ_j is computed by Equation (10). δ_j is the marginal kurtosis measure of X, and utilized to control parameter w_j in the FRFCM-MKM algorithm. From the above, we see that the clustering problem of finding the optimal cluster assignments of objects is now formulated as a constrained optimization problem, i.e., to find the optimal values of U, V, and w subject to a set of constraints. Considering that U, V and w are continuous variables, we can use the method of Lagrange multipliers with the first order necessary condition to derive their solutions.

$\begin{matrix} L = & J + \sum_{i = 1}^{n} λ_{i} (1 - \sum_{k = 1}^{c} u_{ik}) \\ + β (1 - \sum_{j = 1}^{d} w_{j}) + \sum_{j = 1}^{d} γ_{j} w_{j} \end{matrix}$ (13) $\frac{\partial L}{\partial u_{ik}} = {mu}_{ik}^{m - 1} \sum_{j = 1}^{d} δ_{j} w_{j} {(x_{ij} - v_{kj})}^{2} - λ_{i} = 0$ (14) $\frac{\partial L}{\partial v_{kj}} = - 2 δ_{j} w_{j} \sum_{i = 1}^{n} u_{ik}^{m} v_{kj} (x_{ij} - v_{kj}) = 0$ (15)

$\begin{matrix} \frac{\partial L}{\partial w_{j}} = & \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} \\ + γ (w_{j} - δ_{j}) - β + γ_{j} = 0 \end{matrix}$ (16) $γ_{j} ⩾ 0$ (17) $γ_{j} w_{j} = 0$ (18)

According to Equation (14), the iteration formula of u_ij can be obtained as follows:

$\begin{matrix} u_{ik} = & {[\sum_{j = 1}^{d} δ_{j} w_{j} {(x_{ij} - v_{kj})}^{2}]}^{\frac{1}{1 - m}} / \\ \sum_{s = 1}^{c} {[\sum_{j = 1}^{d} δ_{j} w_{j} {(x_{sj} - v_{kj})}^{2}]}^{\frac{1}{1 - m}} \end{matrix}$ (19)

From Equation (15), we obtain $v_{kj} = \sum_{i = 1}^{n} u_{ik}^{m} x_{ij} / \sum_{i = 1}^{n} μ_{ik}^{m}$ (20)

From Equation (16), it can be obtained that

$\begin{matrix} w_{j} = \underset{I}{\underset{︸}{\frac{1}{| I^{+} |} - \frac{1}{| I^{+} |} \sum_{s \in I^{+}} δ_{j} + δ_{j}}} + \\ \underset{II}{\underset{︸}{\frac{1}{2 τ \cdot n \cdot c | I^{+} |} \sum_{s \in I^{+}} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{is} - v_{ks})}^{2} - \frac{1}{2 τ \cdot n \cdot c} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2}}} \end{matrix}$ (21) $\begin{matrix} I^{-} = {j : w_{j} = 0} \\ I^{+} = {j : w_{j} > 0} \end{matrix}$ (22)

I^- denotes the set of indexes of the zero-weight (redundant) features, while I⁺ denotes the set of indexes of the positive weight (important) features. |I^-| and |I⁺| are the cardinality of I^- and I⁺, respectively. The elements of I^- and I⁺ are computed by the same method as in [23], which is outlined as follows.

Procedure for determining I^- and I⁺
1. Initialize I⁺ = φ, I^- = 1, 2, …, n;
2. s ← s + 1, $I_{s}^{+} = I_{s - 1}^{+} + {p}, I_{s}^{-} = I_{s - 1}^{-} - {p}$ , where $p = \underset{j \in I_{s - 1}^{-}}{argmin} {\frac{1}{2 τ \cdot n \cdot c} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} - δ_{j}}$ ;
3. Examine whether w_j > 0 by computing Equation (21), where $f = arg max_{j \in I^{+}} {\frac{1}{2 τ \cdot n \cdot c} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} - δ_{j}}$ . If yes, go to step 2; otherwise, set $I^{+} = I_{s - 1}^{+}$ , $I^{-} = I_{s - 1}^{-}$ and terminate.

Here we give the detailed derivation of the update equations of w_j from the Lagrangian in Equation (13) and the Karush–Kuhn–Tucker conditions in Equations (16)–(18). Combining Equations (13) and (16) results in

$\begin{matrix} w_{j} = & \frac{β}{2 τ \cdot n \cdot c} - \frac{1}{2 τ \cdot n \cdot c} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} \\ {(x_{ij} - v_{kj})}^{2} - \frac{1}{2 τ \cdot n \cdot c} γ_{j} + δ_{j} \end{matrix}$ (23)

According to $\sum_{j = 1}^{d} w_{j} = 1$ , it can be obtained that

$\begin{matrix} \frac{β p}{2 τ \cdot n \cdot c} - \frac{1}{2 τ \cdot n \cdot c} \sum_{j = 1}^{d} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} \\ - \frac{1}{2 τ \cdot n \cdot c} \sum_{j = 1}^{d} γ_{j} + \sum_{j = 1}^{d} δ_{j} = 1 \\ β = \frac{2 τ \cdot n \cdot c}{d} (1 - \sum_{j = 1}^{d} δ_{j}) + \frac{1}{d} \sum_{j = 1}^{d} \sum_{i = 1}^{n} \sum_{k = 1}^{c} μ_{ik}^{m} δ_{j} \\ {(x_{ij} - v_{kj})}^{2} + \frac{1}{d} \sum_{j = 1}^{d} γ_{j} \end{matrix}$ (24)

By plugging Equation (24) into Equation (23), the closed-form solution for the feature weights is obtained as

$\begin{matrix} w_{j} = \frac{1}{d} - \frac{1}{d} \sum_{j = 1}^{d} δ_{j} + \frac{1}{2 τ \cdot n \cdot c \cdot d} \sum_{j = 1}^{d} \sum_{i = 1}^{n} \sum_{k = 1}^{c} \\ u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} + \frac{1}{2 τ \cdot n \cdot c \cdot d} \sum_{j = 1}^{d} γ_{j} - \frac{1}{2 τ \cdot n \cdot c} \\ \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} - \frac{1}{2 τ \cdot n \cdot c} γ_{j} + δ_{j} \end{matrix}$ (25)

Now we consider Equation (17) for two respective cases.

1) γ_j = 0, ∀ j ∈ {1, . . . , d}, Equation. (25) becomes

$\begin{matrix} w_{j} = \frac{1}{d} - \frac{1}{d} \sum_{j = 1}^{d} δ_{j} + \frac{1}{2 τ \cdot n \cdot c \cdot d} \sum_{j = 1}^{d} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} \\ {(x_{ij} - v_{kj})}^{2} - \frac{1}{2 τ \cdot n \cdot c} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} + δ_{j} \end{matrix}$

For each feature j, this set of solutions is valid only if

$\begin{matrix} \frac{1}{d} - \frac{1}{d} \sum_{j = 1}^{d} δ_{j} + \frac{1}{2 τ \cdot n \cdot c \cdot d} \sum_{j = 1}^{d} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} \\ {(x_{ij} - v_{kj})}^{2} - \frac{1}{2 τ \cdot n \cdot c} \sum_{i = 1}^{n} \sum_{j = 1}^{k} u_{ij}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} + δ_{j} ⩾ 0 \end{matrix}$

Otherwise, we consider the second case.

2) If γ_j > 0 for at least one j, according to Equation (18) when γ_j > 0, w_j = 0; therefore, $\begin{matrix} \frac{1}{d} - \frac{1}{d} \sum_{j = 1}^{d} δ_{j} + \frac{1}{2 τ \cdot n \cdot c \cdot d} \sum_{j = 1}^{d} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} \\ {(x_{ij} - v_{kj})}^{2} + \frac{1}{2 τ \cdot n \cdot c \cdot d} \sum_{j = 1}^{d} γ_{j} - \frac{1}{2 τ \cdot n \cdot c} \\ \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} - \frac{1}{2 τ \cdot n \cdot c} γ_{j} + δ_{j} = 0 \end{matrix}$ (26)

From the above equation, we obtain that $\begin{matrix} \frac{1}{2 τ \cdot n \cdot c} γ_{j} = \frac{1}{d} - \frac{1}{d} \sum_{j = 1}^{d} δ_{j} + \frac{1}{2 τ \cdot n \cdot c \cdot d} \\ \sum_{j = 1}^{d} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} + \frac{1}{2 τ \cdot n \cdot c \cdot d} \sum_{j = 1}^{d} γ_{j} \\ - \frac{1}{2 τ \cdot n \cdot c} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} + δ_{j} \end{matrix}$ (27)

Constrained by Equation (12), obviously, it is known that not all w_j can be 0. Therefore, the set { j|j = 1, . . . , d} is separated into two subsets denoted as I^- and I⁺, where $\begin{matrix} I^{-} = {j : w_{j} = 0} \\ I^{+} = {j : w_{j} > 0} \neq φ \end{matrix}$ (28)

Therefore, Equation (27) becomes

$\begin{matrix} \frac{1}{2 τ \cdot n \cdot c} γ_{j} = \frac{1}{d} - \frac{1}{d} \sum_{j = 1}^{d} δ_{j} + \frac{1}{2 τ \cdot n \cdot c \cdot d} \\ \sum_{j = 1}^{d} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} + \frac{1}{2 τ \cdot n \cdot c \cdot d} \\ (\sum_{s \in I^{+}} γ_{s} + \sum_{s \in I^{-}} γ_{s}) - \frac{1}{2 τ \cdot n \cdot c} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} \\ {(x_{ij} - v_{kj})}^{2} + δ_{j} = \frac{1}{d} - \frac{1}{d} \sum_{j = 1}^{d} δ_{j} + \frac{1}{2 τ \cdot n \cdot c \cdot d} \\ \sum_{j = 1}^{d} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} + \frac{| I^{-} |}{2 τ \cdot n \cdot c \cdot d} γ_{j} \\ - \frac{1}{2 τ \cdot n \cdot c} \sum_{i = 1}^{n} \sum_{j = 1}^{k} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} + δ_{j} \frac{| I^{+} |}{2 τ \cdot n \cdot c \cdot d} \\ γ_{j} = \frac{1}{d} - \frac{1}{d} \sum_{j = 1}^{d} δ_{j} + \frac{1}{2 τ \cdot n \cdot c \cdot d} \sum_{j = 1}^{d} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} \\ δ_{j} {(x_{ij} - v_{kj})}^{2} - \frac{1}{2 τ \cdot n \cdot c} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} + δ_{j} \end{matrix}$

From the above equation, it can be obtained that $\begin{matrix} γ_{j} = \frac{2 τ \cdot n \cdot c}{| I^{+} |} - \frac{2 τ \cdot n \cdot c}{| I^{+} |} \sum_{j = 1}^{d} δ_{j} \\ + \frac{1}{| I^{+} |} \sum_{j = 1}^{d} \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} - \\ \frac{d}{| I^{+} |} \sum_{i = 1}^{n} \sum_{j = 1}^{k} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} + \frac{2 τ \cdot n \cdot c \cdot d}{| I^{+} |} δ_{j} \end{matrix}$ (29)

By replacing the γ_j in Equation (25) with Equation (29), after some simplification, we obtain the final form as in Equation (21).

How to choose the threshold is an important step for the proposed clustering algorithm. To develop a feature-reduction scheme using an FRFCM algorithm, Yang et al. [16] used $1 / \sqrt{nd}$ as a threshold to determine which feature(s) is/are to be selected or discarded. Yet, $1 / \sqrt{nd}$ decreases if n is increasing. For different sized datasets generated from the same distribution, they must have the same feature importance, and should have similar feature weights in the feature weighting clustering algorithm. Consequently, the thresholds should be equal. However, the thresholds computed by $1 / \sqrt{nd}$ range from 0.071 to 0.0071 (Table 1), though the feature importance of the datasets are equal. Thus, $1 / \sqrt{nd}$ is not a suitable threshold for some data in the feature reduction learning algorithm.

Table 1

Threshold for different sized datasets drawn from same distribution

Size	n = 50	n = 200	n = 500	n = 1000	n = 5000
$\frac{1}{\sqrt{nd}}$	0.071	0.050	0.022	0.0158	0.0071

Now, we try to find a proper threshold to remove the feature(s) with small weight(s) instead of using $1 / \sqrt{nd}$ as a threshold. Since δ_j is a feature importance measure, we can construct a threshold using the information of the δ_js. In Equation (11), if the δ_js are normalized, let τ→ + ∞, and it can obtain w_j ≈ δ_j when the FRFCM-MKM algorithm converges, which inspired us to use the δ_js to construct a threshold. According to the property of the Pythagorean means, as the smallest of the three Pythagorean means, the harmonic mean tends to emphasize the impact of small δ_j s while minimizing the impact of large δ_j s. Then, we pick $\sqrt{d} / \sum_{j = 1}^{d} \frac{1}{δ_{j}}$ as a threshold since $\sqrt{d} / \sum_{j = 1}^{d} \frac{1}{δ_{j}} < d / \sum_{j = 1}^{d} \frac{1}{δ_{j}} <$ $\sqrt[d]{\prod_{j = 1}^{d} δ_{j}} < \frac{1}{d} \sum_{j = 1}^{d} δ_{j} = \frac{1}{d}$ . The proposed FRFCM-MKM algorithm can be summarized as follows.

C. FRFCM-MKM Algorithm

Initialization: Fix ɛ > 0 Give cluster number k, randomly initialize the cluster center V⁽⁰⁾, randomly initialize the feature weight w⁽⁰⁾, and set t = 0.
Step 1: Compute the δ_j using data set X by Equation (10).
Step 2: Calculate membership function U^(t) using δ_j, V^(t-1)
and w^(t-1) by Equation (19).
Step 3: Update cluster center V^(t)using U^(t) by Equation (20).
Step 4: Update w^(t) using δ_j,U^(t)and V^(t) by Equation (21), normalize w^(t).
Step 5: remove total r number of j feature components for w^(t), if $w_{j}^{(t)} ⩽ \sqrt{d} / \sum_{j = 1}^{d} \frac{1}{δ_{j}}$ , and set d^(new) = d - r.
Step 6: Compute w^(t) by Equation (21).
Step 7: If ∥U^(t) - U^(t-1) ∥ < ɛ, then quit;
else set t = t + 1, d = d^(new) and go back to Step 1.

Next, we provide the detailed proof of convergence theorems for the FRFC-Means algorithm.

D. Convergence Theorems for FRFCM-MKM Algorithm

Theorem 1. If V and w are fixed, then U is a strict local minimum of J (U, V, w) if and only if U is calculated by (19).

Theorem 2. If U and w are fixed, then V is a strict local minimum of J (U, V, w) if and only if V is calculated by (20).

Theorem 3. If U and V are fixed, then w is a strict local minimum of J (U, V, w) if and only if w is calculated by (21).

Referring to [24 –26], Theorems 1 and 2 can be easily proved. Because the main difference between FCM and FRFCM-MKM exists in the involved feature weights, here we only give a detailed proof of theorem 3 below.

Proof. We first prove the necessary condition for minimum of J (U, V, w) by using the Lagrangian multiplier method to transform the constraint minimization problem into an unconstrained problem. We take the Lagrangian function as follows. $\begin{matrix} L_{1} (w, β) = \sum_{i = 1}^{N} \sum_{j = 1}^{K} u_{ik}^{m} \sum_{j = 1}^{d} δ_{j} w_{j} {(x_{ij} - v_{kj})}^{2} + \\ τ \cdot n \cdot c \sum_{j = 1}^{d} {(w_{j} - δ_{j})}^{2} + β (1 - \sum_{j = 1}^{d} w_{j}) \end{matrix}$ (30) where 21 β is Lagrangian multiplier, By setting the gradient of L₁ with respect to w_j and β to zero it can be obtained:

$\begin{matrix} \frac{\partial L_{1}}{\partial w_{j}} = & \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} \\ + 2 τ \cdot n \cdot c \cdot (w_{j} - δ_{j}) - β = 0 \end{matrix}$ (31) $\frac{\partial L_{1}}{\partial β} = \sum_{j = 1}^{d} w_{j} - 1 = 0$ (32)

Then, follow the steps from Equation (20) to Equation (26), we have $\begin{matrix} w_{j} = \frac{1}{d} - \frac{1}{d} \sum_{j = 1}^{d} δ_{j} + \frac{1}{2 τ \cdot n \cdot c \cdot d} \sum_{j = 1}^{d} \sum_{i = 1}^{n} \sum_{k = 1}^{c} \\ u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} + \frac{1}{2 τ \cdot n \cdot c \cdot d} \sum_{j = 1}^{d} γ_{j} - \frac{1}{2 τ \cdot n \cdot c} \\ \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} - \frac{1}{2 τ \cdot n \cdot c} γ_{j} + δ_{j} \end{matrix}$ (33)

The necessary condition is proved.

Secondly, we prove the sufficient condition for minimum of J (U, V, w). To analysis the sufficiency of Theorem 3, we can check Hessian Matrix of Lagrangian function L₁ (w, β) denoted as H_{L
₁} (w, β), because $\begin{matrix} \frac{\partial L_{2}}{\partial w_{j}} = \sum_{i = 1}^{n} \sum_{k = 1}^{c} u_{ik}^{m} δ_{j} {(x_{ij} - v_{kj})}^{2} \\ + 2 τ \cdot n \cdot c \cdot (w_{j} - δ_{j}) - β = 0 \\ \frac{\partial^{2} L_{2}}{\partial w_{j} \partial w_{l}} = κ_{jl} 2 τ \end{matrix}$ (34) where κ_jl is Kronecker index with $κ_{jl} = {\begin{matrix} 1, if j = l \\ 0, if j \neq l \end{matrix}$ , $\frac{\partial^{2} L_{2}}{\partial w_{l} \partial β} = \frac{\partial^{2} L_{2}}{\partial β \partial w_{l}} = 1 .$

Thus, the bordered Hessian matrix with regard to w and β is

$\begin{matrix} H_{L_{1}} (w, β) \\ = [\begin{matrix} 2 τ \cdot n \cdot c & 0 & 0 & \dots & 0 \\ 0 & 2 τ \cdot n \cdot c & 0 & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & 0 & \dots & 2 τ \cdot n \cdot c \end{matrix}] \end{matrix}$ (35)

Since we know τ > 0, and n · c > 0, the $H_{L_{1}} (w, β)$ is a diagonal matrix with positive diagonal entries. So the matrix $H_{L_{1}} (w, β)$ is positive definite. J (U, V, w) must have the minimum point, and Equation (25) is a sufficient condition for w to be a local minimum of J (U, V, w).

And in the same way, the case of feature reduction can be proved. Then theorem 3 can be verified. □

4 Experimental study

In this section, to evaluate the performance of the proposed FRFCM-MKM algorithm, the WKM, EWKM, WFCM, SCAD2, ESSC, and the FRFCM are chosen for analysis. A series of experiments are performed with various datasets. In most cases, running the clustering algorithm on raw data does not work well. Therefore, we normalized each feature f_j = [x_1j, x_2j, . . . , x_nj] ^T by subtracting its mean and dividing it by its range, as shown below: $y_{j} = \frac{f_{j}^{T} - f_{j} \cdot 1^{T} \cdot 1}{max (f_{j}) - min (f_{j})}$ where 1 is an all one element vector.

We set the threshold value ɛ and the maximum number of restarts to ɛ = 10⁵ and t = 300, respectively, in all the algorithms. The parameter of the fuzziness m in all fuzzy-type clustering algorithms and our algorithm is set to 2. The experiments are performed on a PC with an Intel Core i5-4590 with 4 GB of memory. All code is written in the MATLAB computing environment.

A. Performance M etrics

Three performance metrics including the accuracy (AC) [27], the normalized mutual information (NMI) [28], and the adjusted Rand index (ARI) [29] are used to evaluate the performance of clustering algorithms. They are averaged over thirty different runs with random initialization of the feature weights and with random initializations of the centroids using the initialization strategy in reference [30].

Accuracy: We evaluate the clustering results by comparing the obtained labels using clustering algorithms with the provided ground truths. The second metric we use is the AC with $AC = \frac{1}{n} \sum_{k = 1}^{c} d_{k}$ , where d_k indicates the number of data points that are correctly clustered for cluster k, and n is number of the data points in the dataset. A larger AC means better performance of the clustering algorithm.

ARI: The Rand index (RI) proposed by Rand [31] measures the agreement between two crisp partitions of a set of data points. The ARI is the adjusted form of the Rand index. The ARI has a maximum value of 1, and its expected value is 0 in the case of random clusters. A larger ARI implies more agreement between two partitions, and, hence, the greater effectiveness of the algorithm since the two partitions are produced by a clustering algorithm and human experts, respectively. The adjusted Rand Index is computed as follows:

$ARI = \frac{\sum_{ij} (\begin{matrix} n_{ij} \\ 2 \end{matrix}) - [\sum_{i} (\begin{matrix} a_{i} \\ 2 \end{matrix}) \sum_{j} (\begin{matrix} b_{j} \\ 2 \end{matrix})] / (\begin{matrix} n \\ 2 \end{matrix})}{\frac{1}{2} [\sum_{i} (\begin{matrix} a_{i} \\ 2 \end{matrix}) + \sum_{j} (\begin{matrix} b_{j} \\ 2 \end{matrix})] - [\sum_{i} (\begin{matrix} a_{i} \\ 2 \end{matrix}) \sum_{j} (\begin{matrix} b_{j} \\ 2 \end{matrix})] / (\begin{matrix} n \\ 2 \end{matrix})}$ (36)

where n is the total number of data points, c is the number of clusters, n_ij = |S_i ∩ S_j|, $a_{i} = \sum_{j = 1}^{c} | S_{i} \cap S_{j} |$ and $b_{i} = \sum_{i = 1}^{c} | S_{i} \cap S_{j} |$ .

NMI: The NMI [28] index is another popular index that is used to evaluate clustering results. This index measures the agreement of the clustering results produced by an algorithm and the ground truth. If we refer to a class as the ground truth and a cluster as the results of a clustering algorithm, the NMI is calculated as follows: $NMI = \frac{\sum_{i} \sum_{j} n_{ij} log (\frac{n \cdot n_{ij}}{a_{i} \cdot b_{j}})}{\sqrt{\sum_{i = 1} n_{i} log (\frac{n_{i}}{n}) (\sum_{j} n_{j} log (\frac{n_{j}}{n}))}}$ (37) where n is the total number of data points, c is the number of clusters, n_ij = |S_i ∩ S_j|, $a_{i} = \sum_{j = 1}^{c} | S_{i} \cap S_{j} |$ and $b_{i} = \sum_{i = 1}^{c} | S_{i} \cap S_{j} |$ .

B. Experiment

Experiment 1: the feature weight representation ability

In this experiment, we generate the same dataset denoted by D2 as that used in example 1 of reference [16]. This dataset has 400 data points with components x₂ and x₃ being generated from the Gaussian mixture model $\sum_{k = 1}^{2} α_{k} N (μ_{k}, \sum_{k})$ , where the parameters α₁ = α₂ = 0.5, μ₁ = (3, 3) ^T, μ₂ = (7, 7) ^T, $\sum_{1} = (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix})$ and $\sum_{2} = (\begin{matrix} 0.5 & 0 \\ 0 & 0.5 \end{matrix})$ . Components x₁ and x₄ follow uniform distributions over the intervals [0, 10] and [0, 12], respectively.

To compare the feature weight representation abilities of the WKM, FRFCM and FRFCM-MKM, we apply the WKM, FRFCM and FRFCM-MKM algorithms to dataset D2 thirty times to examine whether the algorithms assign proper weights to features. Since the prototype-based clustering algorithms are sensitive to the initialization, to reduce the sensitivity of the initialization for clustering algorithms, we first choose five sets of centers V₁, V₂, V₃, V₄ and V₅ from D2 as the initial cluster centers by using the initialization strategy in the k-means++ [30] and then implement the WKM, FRFCM and FRFCM-MKM with different initial feature weights. To save space, though the algorithms are conducted thirty times, we merely plot the bar graphs of the first ten runs to study the feature assignment problem of the three algorithms. For the WKM, we plot the bar graphs of the final weights, but for the FRFCM and FRFCM-MKM, the bar graphs are for the first iterations since the weights in the first iterations play a vital role in the feature reduction process. Each color of the bar graphs represents the weight assignment for a run of the algorithm. As shown in the bar graphs (Table 2), for the five sets of fixed initial clusters, the WKM and FRFCM do not properly assign weights to features using random initial weights because different initial weights may lead to different weight assignments. Especially, the FRFCM algorithm may assign small weights to important features (i.e., the 2nd and 3rd features) but large weights to unimportant features (i.e., the 1st and 4th features). Thus, in the feature reduction process, the 2nd and 3rd features may be discarded. The case of D2 being normalized is given in Table 3. ${\tilde{V}}_{1}$ to ${\tilde{V}}_{5}$ are five sets of fixed initial cluster centers. The bar graphs show that the weights produced by the FRFCM are less sensitive to different initial feature weights when the data are normalized. However, the FRFCM still does not properly assign weights to features since the 2nd and 3rd features obtain small weights. However, our method is always capable of properly assigning weights to features whether the data are normalized or not, and the 2nd and 3rd features always obtain large weights.

Table 2

Weight assignments of the WKM, FRFCM and FRFCM-MKM for unnormalized D2 with fixed cluster centers

Table 3

Weight assignments of the WKM, FRFCM and FRFCM-MKM for normalized D2 with fixed initial weights

We next run the WKM, FRFCM and FRFCM-MKM for the unnormalized and normalized D2 datasets with equal initial feature weights w = [1/4, 1/4, 1/4, 1/4] ^T and with random initial cluster centers. The bar graphs of the weight assignments for these algorithms are shown in Tables 4 and 5, which demonstrate that the FRFCM-MKM represents the features importance well. When the initial weights are fixed and the centers are randomly initialized, both the WKM and FRFCM do not represent the features well.

Table 4

Weight assignments of the WKM, FRFCM and FRFCM-MKM for unnormalized D2 when the initial feature weights are fixed as w = [1/4, 1/4, 1/4, 1/4] ^T

Table 5

Weight assignments of the WKM, FRFCM and FRFCM-MKM for normalized D2 when the initial feature weights are fixed as w = [1/4, 1/4, 1/4, 1/4] ^T

The reason that the FRFCM-MKM represents the features well is that the second term in the objective function is used to control the value of w_j s; and the larger the value of τ is, the closer the value of w_j is to that of δ_j, and the more robust the algorithm. Fig. 6 shows the weight assignments of the FRFCM-MKM under different parameters. As τ increases, the feature weights remain the same for each run, which indicates the robustness of the proposed method.

Now we compare the clustering results achieved by the WKM, EWKM, WFCM, SCAD2, ESSC, FRFCM and FRFCM-MKM algorithms. Table 8 lists the clustering results of the WKM, EWKM, WFCM, SCAD2, ESSC, FRFCM and FRFCM-MKM algorithms for unnormalized and normalized D2. The WKM and EWKM produce unsuitable feature weights, and the clustering results are worse. The SCAD2 and ESSC produce suitable feature weights on normalized D2 (Table 7), and the clustering result are desirable. The FRFCM does not produce suitable feature weights for dataset D2, and the important feature(s) is/are discarded from the data at the first iteration, which leads to an incorrect clustering result. When D2 is normalized, the FRFCM assigns comparative larger weights to the four features (Table 3). Meanwhile, the threshold calculated by $1 / \sqrt{nd}$ is 0.025, and this value is used to determine which feature(s) will be retained or removed. However, the threshold is so small that the FRFCM fails to remove any feature(s) in the clustering process. In this case, the FRFCM algorithm clusters the data well in the full space, but the feature reduction mechanism in the FRFCM fails. The proposed algorithm outperforms the other clustering methods on the dataset D2 with AC = 0.998, NMI = 0.977 and ARI = 0.990, and the main reason is that the FRFCM-MKM properly represents the feature weights. Then, in the feature reduction learning process, the 1st feature and 4th feature are discarded. The FRFCM-MKM is actually implemented in a 2-D subspace of D2, as shown in Fig. 2(a); therefore, the desired result can be obtained. Moreover, The weights obtained by the FRFCM-MKM always represent the importance of features well regardless of whether the data are normalized or not because the weight update equation is dominated by part I (see Equation (21)), and part II is merely used to adjust the weight value. This characteristic makes the FRFCM-MKM produce almost equal weights for when the data are normalized and not unnormalized, and this makes the feature weights robust to the initialization. This experiment demonstrates that the FRFCM-MKM can properly assign the weights to the features, and it also shows that the weights produced by the FRFCM-MKM are robust to the initialization.

Fig. 2

(a) Two component data points generated from a Gaussian mixture model; (b)The Gaussian mixture model data points by adding component x₃.

Fig. 3

(a) δ_MVR value of D1 (unnormalized); (b) δ_MVR value for D1 (normalized),

Fig. 4

Distribution of the different features of dataset D1. (a) 1st feature. (b) 2nd feature. (c) 3rd feature. (d) 4th feature.

Fig. 5

Distribution of the different features of the iris dataset. (a) 1st feature. (b) 2nd feature. (c) 3rd feature. (d) 4th feature.

Table 6

Weight assignments of the WKM, FRFCM and FRFCM-MKM for unnormalized D2 with the centers and weights randomly initialized

Table 7

The final weight assignments of the SCAD2 and ESSC on normalized D2

Method	1st feature	2nd feature	3rd feature	4th feature
SCAD2	0.246	0.253	0.265	0.235
	0.233	0.264	0.275	0.227
ESSC	0.198	0.321	0.329	0.152
	0.133	0.367	0.380	0.120

Table 8

Comparing the clustering results achieved by seven algorithms on dataset D2

Datasets	Performance metric	WKM	EWKM	WFCM	SCAD2	ESSC	FRFCM	FRFCM-KMK
D2	AC	0.834±0.221	0.836±0.209	0.988±0.067	0.935±0.000	0.910±0.098	0.822±0.218	0.998 ± 0.000
(unnormalized)	NMI	0.625±0.466	0.597±0.436	0.958±0.138	0.660±0.000	0.664±0.319	0.619±0.479	0.977 ± 0.000
	ARI	0.635±0.473	0.620±0.450	0.970±0.140	0.756±0.000	0.714±0.303	0.627±0.485	0.990 ± 0.000
D2	AC	0.839±0.213	0.864±0.199	0.932±0.165	0.998±0.000	0.998±0.000	0.998±0.000	0.998 ± 0.000
(Normalized)	NMI	0.621±0.459	0.712±0.431	0.841±0.342	0.977±0.000	0.977±0.000	0.977±0.000	0.977 ± 0.000
	ARI	0.635±0.467	0.683±0.452	0.852±0.347	0.990±0.000	0.990±0.000	0.990±0.000	0.990 ± 0.000

Experiment 2: the effect of suppressing noisy/redundant feature(s)

In this experiment, for the iris dataset, we add h (h = 4, 8, 12) noisy features to 4 real ones. The noisy features are Gaussian noise. We let the noisy features follow a Gaussian distribution, i.e., ${\tilde{x}}_{j} \sim N (0, h)$ , where j = 1, . . . , h, h = 4, 8, 12, and normalized these datasets. Now we compare the noisy feature suppression abilities of the WKM, EWKM, WFCM, SCAD2, ESSC, FRFCM and FRFCM-MKM on the noisy iris dataset. Because the feature weights produced by the comparing algorithms is relative to the dispersion of features, the obtained weights of noisy feature is much larger. However, the FRFCM-MKM algorithms produce close-to-zero weights for noisy features. The reason is that the weight update equation is mainly controlled by part I (see Equation (21)). Fig. 7 (a)∼(c) demonstrate that the WMK is able to produce large weights for important features and small weights for unimportant features and noisy features, but the weights produced by the WMK are sensitive to the initialization. The WMK may produce large feature weight(s) for noisy features when the initialization is improper. The FRFCM can assign proper features weights for the noisy iris dataset, but the assignment decreases the weights of the important features and assigns comparatively large weights to noisy features (Fig. 7. (d)∼(f)). Moreover, the thresholds computed by $1 / \sqrt{nd}$ for the noisy Iris are 0.029, 0.024, and 0.020, respectively. The thresholds are so small that the FRFCM is not capable of removing any unimportant features and noisy features from the dataset; hence, the FRFCM cannot suppress the noisy features well. The FRFCM-MKM algorithm suppresses the noisy features well and produces large weights for the 3rd and 4th features since these features are important features, small weights for the 1st and 2nd features, and close-to-zero weights for the noisy features (Fig. 7(g)∼(i)). Thus, unimportant features and noisy features can be identified and discarded. It would be interesting to note that, on the iris with noisy features, our proposed method performs surprisingly well. Our method achieves the same results as on the original iris with AC = 0.967, NMI = 0.885 and ARI = 0.904. It indicates that the FRFCM-MKM can suppress noisy features well and the noisy features have no negative effect on our method. Compared with the FRFCM, the improvements of FRFCM-MKM on the iris with 4 and 8 noisy features are 5.11% for AC, 10.04% for NMI and 15.01% for ARI. Because the EWKM, SCAD2 and ESSC are unable to identify the noisy features and assign large weights to the noisy features (Fig. 8), the clustering performance is rather poor. The clustering results of the seven algorithms are listed in Table 9.

Fig. 6

The weight assignments of the FRFCM-MKM under different parameters.

Fig. 7

(a)∼(c)Weight assignments of the WKM with 4, 8, and 12 noisy features added to iris, respectively. (d)∼(f) Weight assignments of the FRFCM with 4, 8, and 12 noisy features added to iris, respectively. (g)∼(i) Weight assignments of the FRFCM-MKM with 4, 8, and 12 noisy features added to iris, respectively.

Fig. 8

(a)∼(c) Weight assignments of the EWKM with 4, 8, and 12 noisy features added to iris, respectively. (d)∼(f) Weight assignments of SCAD2 with 4, 8, and 12 noisy features added to iris, respectively. (g)∼(i) Weight assignments of ESSC with 4, 8, and 12 noisy features added to iris, respectively.

Table 9

Comparing the clustering results achieved by the seven algorithms on the Iris data and noisy Iris data

Dataset	Performance metric	WKM	EWKM	WFCM	SCAD2	ESSC	FRFCM	FRFCM-KMK
4 features	AC	0.855±0.184	0.854±0.105	0.920±0.115	0.702±0.000	0.893±0.000	0.900±0.000	0.967 ± 0.000
	NMI	0.807±0.111	0.728±0.044	0.836±0.072	0.723±0.000	0.743±0.000	0.766±0.000	0.885 ± 0.000
	ARI	0.774±0.201	0.692±0.095	0.833±0.122	0.880±0.000	0.729±0.000	0.744±0.000	0.904 ± 0.000
+4 noise	AC	0.884±0.165	0.807±0.148	0.953±0.000	0.893±0.000	0.913±0.000	0.920±0.000	0.967 ± 0.000
	NMI	0.809±0.144	0.711±0.064	0.850±0.000	0.739±0.000	0.767±0.000	0.777±0.000	0.885 ± 0.000
	ARI	0.797±0.199	0.655±0.133	0.868±0.000	0.728±0.000	0.771±0.000	0.786±0.000	0.904 ± 0.000
+8 noise	AC	0.855±0.184	0.854±0.105	0.920±0.115	0.880±0.000	0.893±0.000	0.900±0.000	0.967 ± 0.000
	NMI	0.807±0.111	0.728±0.044	0.836±0.072	0.723±0.000	0.743±0.000	0.766±0.000	0.885 ± 0.000
	ARI	0.774±0.201	0.692±0.095	0.833±0.122	0.702±0.000	0.729±0.000	0.744±0.000	0.904 ± 0.000
+12 noise	AC	0.881±0.178	0.795±0.101	0.933±0.000	0.833±0.000	0.887±0.000	0.893±0.000	0.967 ± 0.000
	NMI	0.789±0.210	0.664±0.052	0.823±0.000	0.665±0.000	0.705±0.000	0.723±0.000	0.885 ± 0.000
	ARI	0.781±0.240	0.604±0.091	0.818±0.000	0.619±0.000	0.711±0.000	0.726±0.000	0.904 ± 0.000

Experiment 3: clustering results on synthetic datasets.

Now we perform clustering on six high-dimensional synthetic datasets to demonstrate the effectiveness of the proposed algorithm. Each dataset contains 1024 samples and 16 clusters, and the datasets used in this experiment can be downloaded from the webpage of the Speech and Image Processing Unit [32]. The detailed information on these six high dimensional synthetic datasets is summarized in Table 10.

Table 10

Brief information for the high dimensional synthetic datasets

Dataset	#Instance	#Feature	#Cluster	#Object in each cluster
Dim032	1024	32	16	64
Dim064	1024	64	16	64
Dim128	1024	128	16	64
Dim256	1024	256	16	64
Dim512	1024	512	16	64
Dim1024	1024	1024	16	64

Table 11 shows the clustering results for the seven different methods on the datasets ranging from 64 to 1,024 samples. The clustering algorithms are applied to the six synthetic datasets Dim64 to Dim1024 that represent higher dimensionality varying datasets. Compared with the other approaches, the FRFCM-MKM algorithm achieves excellent clustering results as represented by the AC, NMI and ARI on Dim256∼Dim1024. As shown in Table 12, Moreover, the run-time of the FRFCM-MKM algorithm on these six datasets are lowest compared with those of the other algorithms. Its run-time ranges from 0.004 to 0.006 seconds on Dim256∼Dim1024. The average time of WKM, EWKM, FCM, WFCM, SCAD2, ESSC, FRFCM and FRFCM-MKM on dim032 to dim1024 are 0.159, 0.421, 0.068, 4.136, 0.832, 0.842, 0.058, 0.005 second respectively. The FRFCM-MKM is 10 times faster than the FRFCM in average. The SCAD2, ESSC and FRFCM methods also achieve desirable results; however, the run-times of these algorithms are relatively high. Although the FRFCM uses a feature reduction scheme, when these data are normalized, the FRFCM fails to discard the unimportant features from Dim032∼Dim256. The FRFCM can remove 40 unimportant features and retain 472 features for dim512. Meanwhile, the FRFCM removes 523 unimportant features from Dim1024 and retains 501 features. However, the FRFCM-MKM retains 5, 5, 4, 6, 5, and 10 features for Dim032∼Dim1024, respectively, and it only utilizes a small number of features in the clustering process (Table 12).

Table 11

Clustering performance on dim032-dim1024 using the WKM, EWKM, FCM, WFCM, SCAD2, ESSC, FRFCM and FRFCM-MKM

Dataset	Performance metric	WKM	EWKM	WFCM	SCAD2	ESSC	FRFCM	FRFCM-KMK
Dim032	AC	0.682±0.083	0.933±0.043	0.994±0.021	0.995±0.017	0.995±0.019	0.995 ± 0.017	0.981±0.035
	NMI	0.905±0.031	0.982±0.013	0.998±0.006	0.999±0.005	0.999±0.005	0.999 ± 0.005	0.995±0.010
	ARI	0.681±0.093	0.932±0.046	0.995±0.020	0.995±0.018	0.995±0.019	0.995 ± 0.018	0.982±0.034
Dim064	AC	0.640±0.083	0.938±0.035	1.000±0.000	1.000±0.000	1.000±0.000	1.000 ± 0.000	0.986±0.032
	NMI	0.890±0.032	0.983±0.010	1.000±0.000	1.000±0.000	1.000±0.000	1.000 ± 0.000	0.996±0.009
	ARI	0.639±0.100	0.938±0.034	1.000±0.000	1.000±0.000	1.000±0.000	1.000 ± 0.000	0.987±0.030
Dim128	AC	0.654±0.082	0.948±0.047	0.302±0.081	1.000±0.000	1.000±0.000	1.000 ± 0.000	0.987±0.031
	NMI	0.895±0.031	0.986±0.013	0.683±0.077	1.000±0.000	1.000±0.000	1.000 ± 0.000	0.996±0.008
	ARI	0.649±0.092	0.946±0.050	0.270±0.091	1.000±0.000	1.000±0.000	1.000 ± 0.000	0.987±0.029
Dim256	AC	0.606±0.101	0.985±0.028	0.263±0.072	1.000±0.000	1.000±0.000	1.000±0.000	1.000 ± 0.000
	NMI	0.876±0.041	0.996±0.007	0.657±0.063	1.000±0.000	1.000±0.000	1.000±0.000	1.000 ± 0.000
	ARI	0.605±0.108	0.985±0.028	0.237±0.070	1.000±0.000	1.000±0.000	1.000±0.000	1.000 ± 0.000
Dim512	AC	0.613±0.066	0.977±0.031	0.344±0.072	1.000±0.000	1.000±0.000	1.000±0.000	1.000 ± 0.000
	NMI	0.883±0.026	0.994±0.008	0.720±0.057	1.000±0.000	1.000±0.000	1.000±0.000	1.000 ± 0.000
	ARI	0.623±0.066	0.976±0.032	0.305±0.071	1.000±0.000	1.000±0.000	1.000±0.000	1.000 ± 0.000
Dim1024	AC	0.612±0.095	0.989±0.025	0.460±0.076	1.000±0.000	1.000±0.000	1.000±0.000	1.000 ± 0.000
	NMI	0.875±0.044	0.997±0.007	0.809±0.045	1.000±0.000	1.000±0.000	1.000±0.000	1.000 ± 0.000
	ARI	0.598±0.119	0.989±0.025	0.457±0.099	1.000±0.000	1.000±0.000	1.000±0.000	1.000 ± 0.000

Table 12

Numbers of original and final features obtained from the FRFCM and the FRFCM-MKM and the total run-times (in seconds) for the synthetic datasets for each algorithm

Datasets	Original dimension	Final dimension by FRFCM	Final dimension by FRFCM-MKM	Total run-time(seconds)
				WKM	EWKM	FCM	WFCM	SCAD2	ESSC	FRFCM	FRFCM-KMK
Dim032	32	32	5	0.012	0.029	0.006	0.138	0.140	0.130	0.010	0.004
Dim064	64	64	5	0.021	0.064	0.013	0.317	0.154	0.123	0.011	0.004
Dim128	128	128	4	0.059	0.186	0.028	2.514	0.380	0.415	0.031	0.005
Dim256	256	256	6	0.123	0.295	0.050	3.254	0.564	0.571	0.063	0.004
Dim512	512	472	5	0.232	0.618	0.114	6.409	1.302	1.205	0.109	0.004
Dim1024	1024	501	10	0.508	1.334	0.195	12.183	2.452	2.608	0.122	0.006

Experiment 4: clustering results on some real dataset

In this experiment, we investigate the performances of the proposed algorithm on some real world datasets, and their properties are summarized in Table 13. We compare the AC, NMI and ARI of the FRFCM-MKM with those of the WKM, EWKM, FCM, WFCM, SCAD2, and FRFCM-MKM. The comparisons use different initial cluster centers with different initial feature weights.

Table 13

Summary of some real world Datasets

Dataset	#Instances	#Features	#Cluster
iris	150	4	3
wdbc_all	569	30	2
wpbc	198	32	2
sonar	208	60	2
glass	214	9	6
Breast_cancel	699	8	2
Movement_libras	360	90	15
ionosphere	351	33	2
bupa	345	6	2
colon	62	2000	2
prostate_GE	102	5966	2
SMK_CAN_1987	187	19993	2
ORL	400	2576	40

Table 14 shows the clustering results of seven different clustering algorithms on thirteen real datasets. Compared with the other approaches, the FRFCM-MKM achieves excellent clustering results on nine out of thirteen datasets according to the AC, NMI and ARI. The improvement of the FRFCM-MKM compared to the FRFCM is significant both in the performance metrics and feature reduction. On the easily-clustered datasets such as iris, wdbc_all, breast_cancer, FRFCM-MKM is superior to the other six algorithms. The differences between FRFCM-MKM and FRFCM on the iris dataset are 6.67%, 11.62%, and 21.51% for AC, NMI and ARI metric. On the wdbc_all dataset the differences are 0.76%, 3.94%, and 3.45%, while on the breast_cancer dataset they are 1.16%, 14.15%, 2%. The final features by running FRFCM-MKM on these datasets are 2, 10, and 5, respectively, compared with 4, 30 and 9 by FRFCM. On other hard-clustered datasets such as glass, movement_libras, and SMK_CAN_1987 FRFCM-MKM is still superior to the other six algorithms. The differences between FRFCM-MKM and FRFCM on glass are 7.55%, 7.64% and 18.29% for AC, NMI and ARI metric. On the movement_libras on the SMK_CAN_1987 dataset they are 1.49%, 12.90% and 5.26%. As shown in Table 15, the FRFCM-MKM retains 4, 26, and 21 features in the final step on these datasets respectively, while FRFCM retains7, 90 and 19993 features. On the wpbc dataset, the proposed algorithm retains 11 important features in the cluster process with AC = 0.576, while the FRFCM retains 31 features with AC = 0.581. On the ORL dataset, our method uses 215 important features of the data to cluster with AC = 0.503. Although the EWKM algorithm performs best on these data with AC = 0.600, it uses all 2567 features for clustering; therefore, the EWKM takes more time to cluster data. The clustering performance of the proposed algorithm is not the best on these data sets, but its performances are almost equivalent to those of the comparison algorithm while using a small number of features. Table 16 shows the average performance on all datasets. The performance of proposed algorithm is obviously better than the other six methods in terms of AC, NMI and ARI. According to the experimental results on the datasets, we can infer the following conclusions:

Table 14

Clustering performances on real datasets using the WKM, EWKM, SCAD2, FCM, FRFCM, and FRFCM-MKM

Dataset	Performance metric	WKM	EWKM	WFCM	SCAD2	ESSC	FRFCM	FRFCM-KMK
iris	AC	0.857±0.178	0.852±0.104	0.937±0.083	0.880±0.000	0.893±0.000	0.900±0.000	0.967 ± 0.000
	NMI	0.786±0.161	0.727±0.047	0.847±0.052	0.723±0.000	0.743±0.000	0.766±0.000	0.885 ± 0.000
	ARI	0.759±0.230	0.689±0.095	0.850±0.088	0.702±0.000	0.729±0.000	0.744±0.000	0.904 ± 0.000
wdbc_all	AC	0.852±0.000	0.926±0.006	0.921±0.000	0.928±0.000	0.930±0.000	0.926±0.000	0.933 ± 0.000
	NMI	0.477±0.000	0.627±0.030	0.615±0.000	0.615±0.000	0.627±0.000	0.609±0.000	0.633 ± 0.000
	ARI	0.486±0.000	0.722±0.021	0.706±0.000	0.730±0.000	0.736±0.000	0.724±0.000	0.749 ± 0.000
wpbc	AC	0.571±0.029	0.590±0.021	0.601 ± 0.000	0.581±0.000	0.581±0.000	0.581±0.000	0.576±0.000
	NMI	0.013±0.011	0.024±0.006	0.027 ± 0.000	0.015±0.000	0.015±0.000	0.021±0.000	0.013±0.000
	ARI	0.015±0.018	0.027±0.016	0.035 ± 0.000	0.020±0.000	0.020±0.000	0.022±0.000	0.018±0.000
sonar	AC	0.582±0.015	0.541±0.025	0.536±0.012	0.548±0.001	0.548±0.007	0.540±0.003	0.625 ± 0.000
	NMI	0.059±0.021	0.024±0.023	0.005±0.002	0.007±0.000	0.008±0.002	0.005±0.001	0.049 ± 0.000
	ARI	0.023±0.008	0.004±0.008	0.001±0.003	0.004±0.000	0.005±0.003	0.002±0.001	0.058 ± 0.000
glass	AC	0.468±0.041	0.461±0.040	0.466±0.037	0.435±0.029	0.441±0.015	0.437±0.027	0.470 ± 0.021
	NMI	0.310±0.053	0.332±0.049	0.333±0.038	0.334±0.029	0.329±0.027	0.314±0.030	0.338 ± 0.006
	ARI	0.197±0.042	0.191±0.045	0.201±0.038	0.183±0.022	0.181±0.015	0.175±0.027	0.207 ± 0.004
breast cancel	AC	0.946±0.043	0.659±0.016	0.951±0.000	0.951±0.000	0.957±0.000	0.948±0.000	0.959 ± 0.000
	NMI	0.693±0.100	0.026±0.039	0.706±0.000	0.704±0.000	0.729±0.000	0.692±0.000	0.718 ± 0.000
	ARI	0.798±0.117	0.009±0.029	0.812±0.000	0.813±0.000	0.834±0.000	0.802±0.000	0.818 ± 0.000
movement libras	AC	0.410±0.053	0.436 ± 0.029	0.447±0.025	0.245±0.023	0.266±0.026	0.248±0.020	0.342±0.015
	NMI	0.533±0.081	0.580 ± 0.020	0.575±0.017	0.358±0.020	0.376±0.025	0.365±0.022	0.451±0.010
	ARI	0.256±0.063	0.295 ± 0.026	0.299±0.022	0.148±0.017	0.162±0.017	0.152±0.015	0.183±0.008
ionosphere	AC	0.703±0.041	0.725 ± 0.032	0.704±0.000	0.709±0.000	0.709±0.000	0.709±0.000	0.698±0.000
	NMI	0.133±0.051	0.160 ± 0.057	0.120±0.000	0.130±0.000	0.130±0.000	0.130±0.000	0.104±0.000
	ARI	0.161±0.059	0.198 ± 0.063	0.163±0.000	0.173±0.000	0.173±0.000	0.173±0.000	0.153±0.000
bupa	AC	0.548±0.005	0.548±0.003	0.548±0.000	0.507±0.000	0.522±0.000	0.507±0.000	0.551 ± 0.000
	NMI	0.503±0.002	0.503±0.000	0.503±0.000	0.499±0.000	0.499±0.000	0.499±0.000	0.504 ± 0.000
	ARI	– 0.006±0.004	– 0.006±0.001	– 0.007±0.000	– 0.007±0.000	– 0.008±0.000	– 0.007±0.000	0.006 ± 0.000
colon	AC	0.557±0.017	0.556±0.018	0.548±0.000	0.548±0.000	0.581±0.018	0.565±0.000	0.600 ± 0.000
	NMI	0.006±0.004	0.005±0.003	0.002±0.000	0.006±0.000	0.006±0.003	0.010±0.000	0.008 ± 0.000
	ARI	– 0.003±0.010	– 0.004±0.006	– 0.008±0.000	– 0.006±0.000	0.006±0.009	0.001±0.000	0.008 ± 0.000
prostate_GE	AC	0.563±0.017	0.576±0.009	0.570±0.000	0.570±0.000	0.572±0.000	0.568±0.000	0.580 ± 0.000
	NMI	0.016±0.021	0.018±0.008	0.017±0.000	0.018±0.000	0.017±0.000	0.016±0.000	0.020 ± 0.000
	ARI	0.015±0.013	0.016±0.007	0.017±0.000	0.017±0.000	0.017±0.000	0.016±0.000	0.017 ± 0.000
SMK_CAN_1987	AC	0.537±0.015	0.549±0.015	0.551±0.000	0.604±0.000	0.604±0.000	0.604±0.000	0.613 ± 0.000
	NMI	0.004±0.004	0.007±0.004	0.007±0.000	0.031±0.000	0.031±0.000	0.031±0.000	0.035 ± 0.000
	ARI	0.002±0.005	0.006±0.006	0.006±0.000	0.038±0.000	0.038±0.000	0.038±0.000	0.040 ± 0.000
ORL	AC	0.490±0.006	0.600 ± 0.046	0.367±0.016	0.038±0.002	0.112±0.033	0.088±0.623	0.503±0.192
	NMI	0.775±0.030	0.819 ± 0.018	0.656±0.010	0.361±0.012	0.426±0.025	0.354±0.041	0.709±0.008
	ARI	0.385±0.056	0.597 ± 0.044	0.227±0.013	0.054±0.008	0.048±0.009	0.040±0.010	0.343±0.015

Table 15

Numbers of original and final features obtained from the FRFCM and FRFCM–MKM and the total run-time (in seconds) for the real datasets for each algorithm

Datasets	Original dimension d	Final dimension d by FRFCM	Final dimension d by FRFCM-MKM	total run-time (seconds)
				WKM	EWKM	WFCM	SCAD2	ESSC	FRFCM	FRFCM-KMK
iris	4	4	2	0.001	0.001	0.005	0.004	0.005	0.001	0.001
wdbc_all	30	30	10	0.002	0.003	0.045	0.036	0.034	0.003	0.003
wpbc	32	31	11	0.001	0.002	0.038	0.040	0.354	0.001	0.003
sonar	60	57	5	0.001	0.003	0.085	0.119	0.621	0.004	0.002
glass	9	7	4	0.001	0.001	0.022	0.021	0.020	0.001	0.002
breast cancel	10	9	5	0.001	0.003	0.020	0.014	0.028	0.002	0.003
movement libras	90	90	26	0.010	0.012	0.962	1.388	1.946	0.010	0.007
ionosphere	33	33	12	0.002	0.002	0.020	0.016	0.022	0.004	0.004
bupa	6	5	3	0.001	0.002	0.023	0.019	0.021	0.002	0.003
colon	2000	1700	13	0.005	0.007	0.133	0.166	2.493	0.017	0.003
prostate_GE	5966	5666	5	0.043	0.297	1.546	1.104	1.340	0.063	0.002
SMK_CAN_1987	19993	19693	21	0.231	1.903	10.431	13.828	16.211	0.284	0.009
ORL	2576	2538	22	1.302	11.391	142.839	52.946	80.028	0.608	0.020

Table 16

The average value of the results of the FRFCM-MKM and other six methods on thirteen real datasets

Performance metric	WKM	EWKM	WFCM	SCAD2	ESSC	FRFCM	FRFCM-MKM
AC	0.622	0.617	0.627	0.580	0.594	0.586	0.647
NMI	0.331	0.296	0.339	0.292	0.303	0.293	0.368
ARI	0.238	0.211	0.254	0.221	0.226	0.222	0.270

The proposed algorithm is robust to initialization, while other algorithms are lack of robustness to initialization.

The proposed method represents the feature weights well, while the other methods do not produce proper weights for features.

Compared with FRFCM, FRFCM-MKM achieves better result in overall.

The MKM index proposed in this paper is better than the MVR index in measuring feature importance.

The threshold based on Pythagorean means of using the information of feature weights is more reasonable than the threshold used in FRCM.

Next, we analyze the computational complexity of the FRFCM-MKM. The computational complexity of the FRFCM-MKM depends on three updating stages: (1) update the membership matrix U, which needs O (nc²d); (2) update the cluster centers v_k, which needs O (nc); and (3) update the weight w_j, which needs O (ncd²). The overall computational complexity for the FRFCM-MKM is O (nc²d + nc + ncd²).

5 Conclusion

This study focuses on the feature reduction learning in fuzzy clustering. We develop the MKM index to measure the feature importance. Based on it, a novel objective function is proposed to simultaneously minimize the within cluster dispersion and the discrepancy between the feature importance and feature weights. Then, the feature reducing scheme is used in the clustering process. From the results of clustering both synthetic datasets and real datasets. We observe that the proposed algorithm not only achieves superior performance in terms of the accuracy, adjusted Rand index, and normalized mutual information but also shows great stability regarding the initialization.

This paper proposed an index named MKM to for measuring feature importance. Despite its advantages in measuring feature importance, this index only considers the marginal kurtosis of features. It does not take the feature dispersion into account. This is the shortcoming of our paper. The MKM index works on most datasets in our experiments, but it does not work on some datasets in our experiment, such as the ionosphere and ORL datasets. For future research, one proposed research direction is to design a more much better index to for measuring feature importance by taking MKM and feature dispersion into account together, based on which new feature reduction learning algorithms based will be developed. Another possible research direction is to extend application of our method to multi-view clustering and possibilistic clustering.

References

Lloyd

S.P.

, Least squares quantization in PCM, IEEE Trans Information Theory 28 (1982), 129–137.

Dunn

J.C.

, A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters, J Cybernet 3 (1973), 32–57.

Bezdek

J.C.

, Pattern Recognition with Fuzzy Objective Function Algorithms, Kluwer Academic Publishers Norwell, MA, USA, (1981).

Deng

, Jiang

, Chung

F.L.

, Ishibuchi

, Choi

and Wang

, Transfer prototype-based fuzzy clustering, IEEE Transactions on Fuzzy Systems 24(5) (2016), 1210–1232.

Gong

, Su

, Jia

and Chen

, Fuzzy clustering with a modified MRF energy function for change detection in synthetic aperture radar images, IEEE Transactions on Fuzzy Systems 22(1) (2014), 98–109.

Chang

S.T.

, Lu

K.P.

and Yang

M.S.

, Fuzzy change-point algorithms for regression models, IEEE Transactions on Fuzzy Systems 23(6) (2015), 2343–2357.

Zhou

, Chen

C.L.P.

, Zhang

and Li

H.-X.

, Fuzzy clustering with the entropy of attribute weights, Neurocomputing 198 (2016), 125–134.

Guyon

, Gunn

, Ben-Hur

and Dror

, Result Analysis of the NIPS 2003 Feature Selection Challenge, In Neural Information Processing Systems (NIPS), (2005).

Huang

J.Z.

, Ng

M.K.

, Rong

and Li

, Automated Variable Weighting in k-Means Type Clustering, IEEE Transaction on Pattern Analysis and Machine Intelligence 27(5) (2005), 657–668.

10.

Jing

, Ng

M.K.

and Huang

J.Z.

, An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data, IEEE Transactions on Knowledge and Data Engineering 19(8) (2007), 1026–1041.

11.

Amorim

R.C.D.

and Mirkin

, Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering[J], Pattern Recognition 45(3) (2012), 1061–1075.

12.

Svetlova

, Mirkin

and Lei

, MFWK-Means: Minkowski metric Fuzzy Weighted K-Means for high dimensional data clustering, Proceedings of the 2013 IEEE 14th International Conference on Information Reuse and Integration, IEEE IRI 2013. (2013), 692–699.

13.

Wang

, Wang

and Wang

, Improving fuzzy c-means clustering based on feature weight learning, Pattern Recognition Letters 25(10) (2004), 1123–1132.

14.

Frigui

and Nasraoui

, Unsupervised learning of prototypes and attribute weights, Pattern Recognition 37(3) (2004), 567–581.

15.

Deng

, Choi

, Chung

F.L.

and Wang

, Enhanced soft subspace clustering integrating within-cluster and between-cluster information, Pattern Recognition 43(3) (2010), 767–781.

16.

Yang

M.S.

and Nataliani

, A Feature-Reduction Fuzzy Clustering Algorithm Based on Feature-Weighted Entropy [J], IEEE Transactions on Fuzzy Systems 26(2) (2018), 817–835.

17.

Yang

and Sinaga

K.P.

, A Feature-Reduction Multi-View k-Means Clustering Algorithm, in, IEEE Access 7 (2019), 114472–114486. doi: 10.1109/ACCESS.2019.2934179

18.

Yang

and Benjamin

J.B.M.

, Feature-Weighted Possibilistic C-Means Clustering With a Feature-Reduction Framework, in IEEE Transactions on Fuzzy Systems doi: 10.1109/TFUZZ.2020.2968879

19.

Xing

H.J.

, Wang

and Ha

, A comparative experimental study of feature-weight learning approaches[C]// Proceedings of the IEEE International Conference on Systems, Man and Cybernetics Anchorage, Alaska, USA, October 9–12, 2011. IEEE, (2011).

20.

Tsai

C.-Y.

and Chiu

C.-C.

, Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm, Comput Statist Data Anal 52 (2008), 4658–4672.

21.

Yeung

and Wang

, Improving performance of similarity-based clustering by feature weight learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 24(4) (2002), 556–561.

22.

Hashemzadeh

, Oskouei

A.G.

and Farajzadeh

, New fuzzy C-means clustering method based on feature-weight and cluster-weight learning, Applied Soft Computing Journal (2019).

23.

Mei

J.P.

and Chen.

, Fuzzy clustering with weighted medoids for relational data, Pattern Recognition 43(5) (2010), 1964–1974.

24.

Bezdek

J.C.

, A convergence theorem for the fuzzy ISODATA clustering algorithms, IEEE Trans Pattern Anal Mach Intell 2 (1980), 1–8.

25.

Wang

, Wang

, Chung

, et al., Fuzzy partition based soft subspace clustering and its applications in high dimensional data[J], Information Sciences (2013).

26.

Zangwill

W.I.

, Nonlinear programming: a unified approach, Englewood Cliffs, NJ: Prentice Hall, (1969).

27.

Cai

, He

and Han

, Document clustering using locality preserving indexing, IEEE Transactions on Knowledge and Data Engineering 17(12) (2005), 1624–1637.

28.

Strehl

and Ghosh

, Cluster ensembles—A knowledge reuse frame-work for combining multiple partitions, J Mach Learn Res 3 (2003), 583–617.

29.

Hubert

and Arabie

, Comparing partitions, J Classif 2(1) (1985), 193–218.

30.

Arthur

and Vassilvitskii

, k-means++: The advantages of careful seeding, In SODA, pp. 1027–1035, (2007).

31.

Rand

W.M.

, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association 66 (1971), 846–850.

32.

Speech and Image Processing Unit, School of Computing University of Eastern Finland, Clustering datasets [Online]<http://cs.joensuu.fi/sipu/datasets/>.