Simultaneous feature extraction and selection of microarray data using fuzzy-rough based multiobjective nonnegative matrix factorization

Abstract

The microarray data are important to detect diseases, however, there are a large number of genes with small sample size, and this leads to slow convergence speed and reducing the prediction accuracy. Therefore, reducing the dimension of data is needed as preprocessing step for classification of data. There are two methods can be used to perform the dimension reduction, namely, the feature extraction and feature selection. The feature extraction methods are transforming data into another space and then a subset of features are selected using some criteria. The projection of the measurements, using these methods, is different from the original data. Unlike feature extraction, the feature selection methods select relevant features without changing their values, however, these methods need a large time than feature extraction. There are some algorithms can simultaneously select and extract features from data to take the advantages of both methods. This paper proposed a new simultaneous feature extraction/selection method for high-dimensional microarray data. The proposed method combines fuzzy neighborhood rough set method with nonnegative matrix factorization based on multiobjective evolutionary. To evaluate the accuracy of our approach, a computational experiments were performed on seven gene microarray datasets with diverse characteristics. Experimental results illustrate that the proposed method is better than other algorithms in term of performance measures.

Keywords

Principal component analysis (PCA)independent component analysis (ICA)nonnegative matrix factorization (NMF) and fuzzy neighborhood rough set (FNRS)

1 Introduction

In the biological research such as proteomic mass spectrometry data, genome-wide microarray data, and metabolic data, there are a lot of genes (features) [60]. Determining relevant features is important to know the function of gene types of cells, and others. However, microarrays always contain thousands of genes with relatively few samples, this causes many difficulties in providing an efficient and accurate analysis. This might loss useful information, which leads to inaccurate prediction results and slow classification performance [7]. So the dimension reduction (DR) is needed before any further analysis.

The DR is defined as the operation of selecting a map that transforms the sample from measurement space into the feature space. The common two methods to perform DR are selecting the features in, 1) the measurement space, 2) the transformed space. The aim of the feature selection in the first method is to remove the irrelevant/redundant features [5].

While; the aim of the second method is to get a new transformed space through using all the information in the measurement space. Consequently, the higher dimensional data are mapped to a lower dimensional, this method is called feature extraction [17, 30].

There are several feature extraction approaches such as principal components analysis (PCA) [6, 32], nonnegative matrix factorization (NMF) [1, 2] and independent component analysis (ICA) [3, 27]. These methods have been extensively applied to biological data analysis [18 , 34]. Firstly, PCA reduces the uncorrelated features through a group of steps, where the first one is to compute the covariance matrix then the eigenvalue and eigenvectors are computed [6]. The eigenvectors with the smallest variance are removed, in the final step of PCA, the original data is transformed by matrix multiplications. However, PCA has some drawbacks such as the number of eigenvectors must be defined, also the basis vectors of PCA are not suitable for biological data analysis since they are orthogonal and the hidden patterns may not be orthogonal. In the other words, ICA is a statistical method that decomposes a multidimensional feature into statistically independent components to discover the hidden factors from a set of random variables [3], however, it is expensive computationally. The ICA and PCA methods have both positive and negative coefficients. In gene expression, these contradict physical realities [8]. To solve this problem, Lee and Seung introduced NMF that deals with nonnegative data [35]. There are several NMF methods that are presented in [11].

In all previous extraction algorithms, the original data are transformed into a lower dimensional feature space. These transformations are irreversible so the projection of the measurements is different from the original data. Therefore, we need to select relevant features without change their values to accomplish DR, so the feature selection algorithms can do this task.

Feature selection (also be called variable selection, or attribute selection) represents the process of selecting a subset of the existing features without a transformation [17]. The feature selection can be divided into two types, filter and wrapper methods. The wrapper methods [9] are often used with a learning algorithm [48, 58]. The sequential floating forward selection algorithm (SFFS) is one of the popular feature selection methods [49]. The SFFS consists of two steps, the insertion step and a deletion step, which partially avoids the stuck in local point of correct classification rate (CCR). In [16], V. Dimitrios introduced an algorithm that improved SFFS which the execution time is reduced by a preliminary statistical test that removes redundant features. The accuracy is also improved by a tentative test which choses the features that achieve a statistically significant improvement of CCR. However, the SFFS have computationally expensive so cannot apply to large data.

Filter methods are a select subset of feature without learning algorithm, it is faster than wrapper methods. There are many algorithms that use the filter such as maximize the relevancy minimizing the redundancy (MRMR) criterion based on mutual information [43, 47] which maximize the relevancy of a gene subset while minimizing the redundancy [22].

The uncertainty is one of the main problems in data analysis. The Rough set theory recently [42, 46] is used to deal with the uncertainty problem. Rough sets, is a mathematical tool to perform the DR and has many applications including data mining [4, 23] and pattern recognition [61]. Rough sets can be used to selection feature with discretized feature values [28 , 45]. Also, neighborhood rough sets [25, 26] are suitable for the heterogeneous dataset. However, in real applications, there are usually real-valued data and fuzzy information and this will lead to error.

Combining fuzzy and rough sets used to work with uncertainty in the real application data [64]. Also, the fuzzy rough solving the discrete problem of the rough set. They have been applied successfully to feature selection of real-valued data set [10 , 59]. In [41] a new clustering algorithm, called a fuzzy–rough supervised feature clustering (FRSAC), is proposed to determine such groups of genes. The FRSAC algorithm measure the information of sample groups or class labels similarity between genes, therefore, the redundancy is. However, a feature selection methods are higher time complexity than feature extraction methods [38]. Hence, it is difficult to choose whether the FS algorithm or feature extraction is suitable to perform the DR.

1.1 Related works

There are some algorithms can simultaneously extract and select features such as in [19], the authors presented a description an algorithm that combined PCA, ICA, and fuzzy classifier for breast cancer detection. In [56, 57] an algorithm that combined ICA/PCA with RS is proposed for feature extraction and reduction images where PCA/ICA used to feature extraction, then the rough set is used to refine the features. In [56] ICA and PCA are combined with rough set to face recognition, and in [57] for mammogram classification. Also, in [36] PCA is combined with the rough set to feature extraction of the remote sensing image. In [52] the problem of discretization in rough ICA algorithm is avoided by using Rough Fuzzy and this algorithm is introduced to web mining and to handle clustering web user sessions. However, these previous methods (ICA, PCA) suffer from some limitations such as for PCA it is not the optimal choice to deal with biological data analysis because it required the data to be orthogonal and this conditions may be not satisfied. Also, ICA is expensive computationally like NMF, since they provide only one solution at each iteration.

Therefore, the main contribution of this paper is to convert the NMF method to a nonlinear equation systems and used the multiobjective evolutionary algorithms (MOEAs) to solve this system. According to MOEAs there are several solutions (that allocated in different fronts) are obtained at each iteration.

Then according to the modified version of NMF, this paper provides an alternative feature extraction/ selection algorithm is proposed. In which, it combines a fuzzy neighborhood rough set (FRS) and NMF based on multiobjective evolutionary algorithms (MOEAs) for finding the relevant genes from microarray gene expression data. Our algorithm is simultaneously selects and extracts features from microarray data. The steps of our algorithm are a) extracting feature using NMF based MOEA, b) fuzzy similar relation computed, and c) the selection of a reduct and d) the classification of data. The effectiveness of the proposed algorithm, along with a comparison with other methods, is demonstrated on a set of real-life data.

The rest of this paper is organized as follows. We give a brief overview of the fuzzy-rough set, nonlinear equation system and nonnegative matrix factorization are introduced in Section 2. In Section 3 the proposed algorithm based NMF is introduced. Section 4 presents experimental results and statistical analysis of the proposed our algorithm. Conclusion is given in Section 5.

2 Preliminary

2.1 Rough set

Rough set theory is one of the recent mathematical approaches to imprecision and uncertainty. In an Information System (IS), the data can be represented as a table and each object, i.e. row in the table, is represented by some information or features which are represented by columns. Mathematically, an information system is denoted by I = (U, A, V, f), where U is a non-empty finite set of objects, i.e. the universe, A represents a non-empty finite set of features or attributes, V is the union of features domain as follows, V = ∪ _a∈A V_a, and f_a : U → V_a, where V_a is the set of values of feature a [46].

A decision system has the same structure of data, but each object has its own decision or class label (i.e. A = C ∪ d where C is the condition features and d represents a decision feature).

Each non-empty subset P ⊂ A determines an equivalence relation as follows:

$\begin{matrix} IND (P) & = & {(x, y) \in U \times U | \forall a \in P, f (a, x) \\ = & f (a, y)} \end{matrix}$ (1)

The partition of U that is generated by P is denoted by U/P and given by $U / P = {[x]_{p} | x \in U}$ (2) where [x] _p is the equivalent class of the P-indiscernibility relation, which is the mathematical basis of rough set theory.

The lower approximation $(\underline{P} X)$ and upper approximation $(\bar{P} X)$ of the set X ⊆ U can be defined as $\underline{P} X = {x \in U | [x_{p}] \subseteq X}$ (3) $\bar{P} X = {x \in U | [x]_{p} \cap X \neq \emptyset}$ (4)

Let P, Q ⊆ A be an equivalent relation over U, the positive, negative region of the relation U/Q with respect to P (given by β_P (Q)) is defined as: $β_{P} (Q) = \cup_{x \in U / Q} \underline{P} X$ (5)

Where β_P (Q) represents the set of all objects of U that can be uniquely classified to blocks or classes of U/Q by means of P.

The dependency between attributes is an important task of data analysis. Given P, Q ⊆ A, and all features from P are determined by the features from Q. If there is a relation between P and Q, then P depends totally on Q (IND (P) ⊆ IND (Q)) is denoted by Q → P. In other words, the partition that is generated by P is better than the partition generated by Q. The degree of dependency γ (Q) is denoted by $γ (Q) = \frac{| β_{P} (Q) |}{| U |}$ (6) where |. | represents the cardinality. If γ (Q) =1, then P depends totally on Q, on the other hand, if γ (Q) =0, then P does not depend on Q, if 0 < γ (Q) <1, then P depends partially on Q. In decision systems, the degree of dependency represents the quality of approximation of classification.

2.2 Fuzzy-rough set

The fuzzy-rough set model has been proposed to deal with the real-world applications usually contain real-valued features. The fuzzy rough sets are used to solve the drawbacks of rough sets in real applications, as we know, Pawlak’s rough set model can only deal with data sets with discrete values [46]. Following [38], given a non-empty finite set X, R is a binary relation defined on X, denoted by a relation matrix M (R) [31]: $M (R) = (\begin{matrix} r_{11} & r_{12} & \dots & r_{1 n} \\ r_{21} & r_{22} & \dots & r_{2 n} \\ \dots & \dots & \dots & \dots \\ r_{n 1} & r_{n 2} & \dots & r_{nn} \end{matrix})$ (7)

Where r_ij ∈ [0, 1] is the relation value of x_i and x_j.

The fuzzy partition of the universe U, generated by a fuzzy equivalence relation R, is defined as: $U / R = {[x_{i}]_{R}}_{i = 1}^{n},$ (8)

Where [x_i] _R is the fuzzy equivalence class generated by x_i and R. The fuzzy equivalence relation, U/R is a fuzzy partition and then [x_i] _R is a fuzzy set defined as: $[x_{i}]_{R} = \frac{r_{i 1}}{x_{1}} + \frac{r_{i 2}}{x_{2}} + \dots + \frac{r_{in}}{x_{n}}$ (9)

The R is a fuzzy equivalence relation on the universe U; the lower and upper approximations of X can be defined as: $\underline{R} X = {x_{i} | [x_{i}]_{R} \subseteq X, x_{i} \in U}$ (10) $\bar{R} X = {x_{i} | [x_{i}]_{R} \cap X \neq φ, x_{i} \in U}$ (11)

2.3 The nonlinear equation systems as multiobjective optimization problem

The nonlinear equation systems (NESs) has more attention in many fields such as physics [24], chemical processes [44], engineered materials [21] and robotics [12]. The NES can be stated as follows: $e_{i} (h) = 0, i = 1, 2, \dots, n$ (12)

Where h = [h₁, …, h_m] is the decision vector with m dimension. There are several methods that convert the NES into Multiobjective optimization problem by representing e_i ( h ) as fitness function. For example, Grosans and Abraham [20] proposed an algorithm called CA which consists of two stages: 1) representing NES in the form of a multiobjective optimization problem (MOP). 2) The MOEA is used to solve the MOP.

However, the complexity of CA is high since it is considered that the number of equations in NES are the number of objective functions to be solved.

Complexity of MOEA to solve the transformed problem by CA is high since the number of objectives is equal to the number of equations in NES. In [63] an algorithm called multiobjective optimization for NESs (MONES) is presented. In this algorithm, the NES is transformed into a bi-objective optimization problem through two parts: 1) the first is define the location function as:

$\begin{matrix} min α_{1} (h) = h_{1} \\ min α_{2} (h) = 1 - h_{1} \end{matrix}$ (13)

Where h = [h₁, …, h_m] represents the decision vector and h₁ is the first decision variable.

2) The second part is the system function which has the following form:

$\begin{matrix} min γ_{1} (h) = \sum_{j = 1}^{n} | e_{j} (h) | = \sum_{j = 1}^{n} | f_{j} (h) - v_{j} | \\ min γ_{2} (h) = n * max (| e_{1} (h) |, \dots, | e_{n} (h) |) \end{matrix}$ (14)

Then these two parts are combined together as in Equation (9) which represents the bi-objective optimization problem [63]: $min f_{1} (h) = α_{1} (h) + γ_{1} (h) = h_{1} + \sum_{j = 1}^{n} | e_{j} (h) |$ (15) $\begin{matrix} min f_{2} (h) & = & α_{2} (h) + γ_{2} (h) \\ = & 1 - h_{1} + * max (| e_{1} (h) |, \dots, | e_{n} (h) |) \end{matrix}$

The transformed problem is solved by the most popular algorithm of MOEAs called non-dominated sorting genetic algorithm II (NSGA-II) [14].

The first step in NSGA-II is to generate random populations P of size N. Then compute the fitness values for each individual in P. The NSGA-II selects randomly a two individuals, x and y by using the binary tournament selection techniques. Then the parent is determined by selecting the individual which has the best nondomination fitness and this parent is added to a mating pool (which represent N/2). For example, the individual x is considered as parent if (f (x) < f (y)), however, if (f (x) = f (y)) then select the individual which resides in the less crowded region (The details of the crowding distance calculation refer to [13].

To construct the offsprings Q_G, (The new individuals in the feasible space of solutions), the polynomial mutation operators and crossover are used to select parents. Here the polynomial mutation is used the polynomial probability distribution. The offsprings created are given as: $y = p_{i}^{t} + (p_{i}^{u} - p_{i}^{l}) \bar{δ}$ (16)

Where the parameter $\bar{δ}$ is calculated from the polynomial probability distribution as: $\bar{δ} = {\begin{matrix} 2 (r_{i})^{\frac{1}{η_{m} + 1}} - 1 & 0 \leq r_{i} \leq 0.5 \\ 1 - (2 (1 - r_{i}))^{\frac{1}{η_{m} + 1}}, & 0.5 \leq r_{i} \leq 1 \end{matrix}$ (17)

The η_m ∈ R⁺ is mutation constant and r_i ∈ [0, 1] is a random number.

2.4 Nonnegative matrix factorization

Nonnegative matrix factorization purpose to decompose a nonnegative matrix V ∈ R ^I×J into a product of two nonnegative matrices W ∈ R ^I×k and H ∈ R ^k×J, such that WH approximates V as well as possible [1]. NMF has many applications including document clustering [53], face recognition [35], signal processing [50] and music transcription [54]. In general, there is loss function called the Euclidean distance:

$\begin{matrix} D_{ED} (V, WH) & = & \frac{1}{2} | | V - WH | |^{2} \\ = & \frac{1}{2} \sum_{ij} (V_{ij} - [WH]_{ij})^{2} \end{matrix}$ (18)

A general algorithm used to decomposition the above loss function based on multiplicative update, defined as [35]: $W_{ik} = W_{ik} \frac{({VH}^{T})_{ik}}{({WHH}^{T})_{ik}}, H_{kj} = H_{kj} \frac{(W^{T} V)_{kj}}{({WW}^{T} H)_{kj}}$ (19)

Also, there are many algorithms to solve NMF problem such as projection gradient, Newton and others in [11]. However, all the previous approach suffer from some drawbacks such as they are dependent on the initial solution, easily stuck in a locally optimal solution, and needed gradient information. Also, the aim of these algorithms is to find only one optimal solution at each. Therefore, to find optimal solutions in a single run the MOEA is used [13, 14].

3 The proposed algorithm

The proposed Fuzzy Neighborhood Rough MONMF method (which called FNRMONMF) is illustrated in Algorithm 1 and Fig. 1. This method is successfully applied to extracting the features by using NMF and then applying the upper and lower approximations to find the reduct set of features. The FNRMONMF method is consists of three stages 1) feature extraction 2) feature selection 3) classification.

Fig.1

The framework of the proposed FRMONMF Algorithm.

Algorithm 1: FRMONMF Algorithm
1. Input: Data Matrix V
2. Output:Γ the set of reduct featues.
3. Initial Γ =∅
Feature extraction stage
4. [ W , H ]= MOEANMF ( V , k, ɛ)
Feature selection stage
5. Compute the relation matrix M (R) using (7) and
$r_{ij} = {\begin{matrix} 1 & Δ^{w_{k}} (w_{ik}, w_{jk}) \leq \in \\ 0 & otherwise \end{matrix}$
6. For index = Size(W,2):-1:1
7. For l = 1: index
8. Compute SIG (I) = γ_{M_l} (d) for each M
9. EndFor
10. Select w_k which satisfies SIG (w_k, W , d)
$= max_{l} (SIG (w_{l}, Γ, d))$
11. If SIG(w_k, W, d) >0
Γ = Γ ∪ k
12. Else
Return Γ
13. End IF
14. End For
15. Return Γ
Classification stage
16. Using fivefold CV method.
17. Using the classifiers KNN and SVM.
18. Compute the performance.

However, before any stage, the method starts by initial value to the set of selected features (i.e. Γ =∅).

Feature Extraction Stage:

The next step in the proposed method is to extract the features from the data V through decomposing it into WH , in which W represents the extracted features. However, the NMF is transformed first to multiobjective problem (MOP) by using the generic transformation [63] as in Equation (9) through considering e_j ( h ) = f _j ( h ) - v _j = ( wh ) _j - v _j.

Then the MOEANMF algorithm (as in Algorithm 2) starts to decompose V with considering that the number of cluster k and tolerance ɛ as the input.

Algorithm 2: MOEANMF ( V , k, ɛ)
1. Input:V input data, ∈ tolerance
2. Output:W, H, iter, objective value and time
3. Repeat
4. W = rand (I, k)
5. H = rand (k, J)
6. Forj = 1 to J do // in parallel environment
7. H (:, j) = MOEA (V (:, j), W)//that minimizes (12);
8. EndFor
9. Fori = 1 to I do // in parallel environment
10. W (i, :) =MOEA(V (i, :), H) //that minimizes (12);
11. EndFor
12. Until \|WH - V \| ≤ ɛ

The next step is to generate a two random matrices H and W , then solving the converted NMF for H as NES by using MOEA (Algorithm 3) in parallel form. Then select the best elements of H from the first front that contains many solutions, and update the matrix W using MOEA (with fixing H ). Repeat this steps until the error between WH and V is less than ɛ. In this stage, the time complexity can be reduced through computing W and H in a parallel environment since each W (i, :) not depend on other elements of W (the same for H ).

Algorithm 3: MOEA Algorithm
1. Input: V , W , N = (I or J)
2. Output: The best population P.
3. G = 0; // G is the generation number
4. Randomly generate an initial population P_G of size N from the decision space.
5. Evaluate each individual in P_G based on (15).
6. Implement the binary tournament selection, simulated binary crossover, and polynomial mutation to generate the offspring population Q_G (16).
7. Evaluate each individual in Q_G based on (15).
8. Z_G = P_G ∪ Q_G.
9. Divide Z_G into several nondomination levels (denoted as ND₁, ND₂, …, ND_N) according to a fast nondominated sorting.
10. P_G+1 = φ and i = 1.
11. WhileP_G+1 < N
P_G+1 = P_G+1 ∪ ND_i and i = i + 1
12. EndWhile
13. Let P_G+1 = P_G+1 ∖ ND_i-1, delete (\|ND_i-1\| + \|ND_G+1\| - N) individuals with the smallest crowding-distance values in ND_i-1, and let P_G+1 = P_G+1 ∪ ND_i-1.
14. If the stopping criterion is satisfied, stop and output the final population, otherwiseG = G + 1 and go to step 4.

Feature Selection Stage:

The next step in FNRMONMF is to select the best subset of features from W . Therefore, the relation matrix M (R) is computed for each feature of W using Equation (7) and the significance for each w_i ∈ W - Γ is computed as:

$\begin{matrix} γ_{W} (d) & = & | {POS}_{W} (d) | / | W |, \\ {POS}_{W} (d) & = & ⋃_{x \in W / d} \underline{B} (X), \end{matrix}$ (20) where $\underline{B} (X)$ and POS_c (d) are lower approximation bound and the positive region, respectively. W is the set of all features and d is the label of classes.

Then the feature w_k of the largest significance is selected. Finally, if the significance of w_k is greater than zero, then it added to Γ, otherwise return Γ.

Classification stage:

In order to assess the performance of the selected features, the dataset is classified by using either KNN or SVM classifiers. These two classifiers are used in several literatures and provide better results as in [65 –67]. However, before using classifier the dataset is divided into two sets by using fivefold cross-validation (CV) method. In which this method works by splitting the dataset into 5 groups and select 4 groups as training set and one group as testing set, then repeat this process 5 time until all groups become testing. The output of this process is the average of classification accuracy overall 5 runs.

4 Experimental results

To evaluate the accuracy of FRNMONMF algorithm for classification a high-dimensional biological data that given in Table 1, it is compared with nine algorithms. These algorithms namely, ICA [18], PCA [32], RS [46]), FRS [39], neighborhood RS (NRS) [25], Fuzzy-Rough ICA (FRICA) [52], Multi-Cluster/Class Feature Selection (MFCS) [15], Entropy NMF (ENMF) [64] and Fuzzy-Rough neighborhood NMF (FRNNMF).

Table 1
The description of datasets

Dataset No. Dataset Samples Genes Classes Ref.

1 ALLAML 38 5000 3 [64]

2 Brain Tumor 90 5905 5 [62]

3 CNS 60 7129 5 [64]

4 Colon 62 2000 2 [62]

5 DLBCL 77 5469 2 [62]

6 Leukemia1 72 5230 3 [64]

7 9_Tumors 60 5726 9 [5]

Dataset No.	Dataset	Samples	Genes	Classes	Ref.
1	ALLAML	38	5000	3	[64]
2	Brain Tumor	90	5905	5	[62]
3	CNS	60	7129	5	[64]
4	Colon	62	2000	2	[62]
5	DLBCL	77	5469	2	[62]
6	Leukemia1	72	5230	3	[64]
7	9_Tumors	60	5726	9	[5]

The dataset is divided into training and testing set by using the fivefold cross-validation (CV) method. Also, two classifiers namely, K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) are used since they provides a good results in several literatures [65, 66]. The experiments are implemented using Matlab 2014b which setup on Windows 7 (64-bit) that runs on “CPU Core2 Duo with 4GB ram". For statistical analysis, all the comparative algorithms are run 30 times.

4.1 Discussion

Figs. 1 and 2 (also, Tables 2–3) show the averages of performance overall datasets using SVM and KNN respectively. From these results we can observed that, the ICA has the less accuracy overall the datasets. Whereas, the MFCS has performance better than ICA overall the used datasets (expect on Colon cancer and Brain tumor). The RS, FRS and NRS are have nearly the same accuracy for all the used datasets. Also, we can concluded that, in general, the FRNMONMF algorithm has better performance than FRNNMF algorithm (expect on Colon Cancer and 9 Tumors).

Fig.2

The average of classification accuracy for each algorithm overall datasets using SVM.

Fig.3

The average of classification accuracy for each algorithm overall datasets using KNN.

Table 2

The average of classification accuracy of the algorithms using SVM along each dataset

	RS	FRS	NRS	ICA	PCA	MFCS	FRNMONMF	FRNNMF	ENMF	FRICA
ALLAML	85.88	86.95	87.05	80	67.82	80	92.5	90.95	88	85.6
Brain	86.916	86.19	85.91	80.5	90.86	74.86	96.18	94.94	89.95	85.55
CNS	88.96	89.62	87.62	71.81	88.88	79.87	97.58	95.45	85.07	75.07
Colon	80.9	81.22	82.02	79.66	87.61	75.83	86.55	88.66	88.7	85.55
DLBCL	87.01	86.66	85.17	69.66	75.9	76.89	96.32	95.83	74.53	74.53
Leukemia	93.06	91.17	92.79	83.17	82.88	84.31	96.86	95.25	93.16	88.04
9 Tumors	75.99	75.26	76.26	74.71	90	83.36	94.76	95.26	89.7	79.17

Table 3

The average of classification accuracy of the algorithms using KNN along each dataset

	RS	FRS	NRS	ICA	PCA	MFCS	FRNMONMF	FRNNMF	ENMF	FRICA
ALLAML	84.04	83.86	83.78	70	80	80.43	94.73	92.04	83.47	77.6
Brain Tumor	75.05	75.05	75.05	72.14	78.51	77.59	85.96	82.18	76.29	70.29
CNS	90.9	91.2	90.2	77.9	88.2	83.2	96.92	93.8	94.08	82.17
Colon	81.03	79.86	80.06	86.75	85.33	85.06	95.68	96.86	80.55	70.06
DLBLCL	77.38	77.03	77.88	71.59	81.06	79.68	89.42	88.53	78.43	74.21
Leukemia	86.58	86.88	87.18	66.96	85.32	78.33	97.53	96.68	96.67	70.71
9 Tumors	89.09	88.95	88.88	78.76	82.42	84.17	91.27	93.88	87.08	82.05

4.2 Statistical analysis

In this section, in order to make further statistical analyses, the non-parametric Wilcoxon rank sum test is used [67]. This test is used to determine if there exists a significance difference between the median of control group (the proposed algorithm) and other groups (algorithms) or not at level of significance equal 5%. In which the null hypothesis assumes that there is no a significance difference between the proposed algorithm and the other algorithms.

Table 4
The time required by each algorithm to select the features

RS FRS NRS ICA PCA MFCS FRNMONMF FRNNMF ENMF FRICA

ALLAML 5.1 7.8 6.90 6.50E3 4.90E3 5.10E3 5.30E3 5.70E3 5.40E3 7.30E3

Brain Tumor 6.1 6.8 8.10 5.90E4 5.10E4 5.30E4 5.20E4 5.50E4 5.10E4 6.50E4

CNS 6.9 4.50E1 4.30E1 4.80E4 4.10E4 4.60E4 4.10E4 4.20E4 4.50E4 4.90E4

Colon 0.7 1.5 2.1 4.10E4 3.20E4 3.30E4 4.00E4 3.60E4 3.90E4 5.20E4

DLBLCL 9.1 4.90E1 5.60E1 3.00E3 2.10E3 2.20E3 2.80E3 3.10E3 2.80E3 3.40E3

Leukemia 6.8 3.50E1 4.00E1 4.00E3 2.90E3 2.70E3 4.60E3 4.70E3 4.20E3 4.50E3

9 Tumors 0.5 1.9 2.80 8.90E3 7.90E3 8.20E3 8.30E3 8.50E3 8.40E3 9.40E3

Average 5.028 2.10E1 2.27E1 2.43E4 2.02E4 2.14E4 2.20E4 2.21E4 2.22E4 2.72E4

	RS	FRS	NRS	ICA	PCA	MFCS	FRNMONMF	FRNNMF	ENMF	FRICA
ALLAML	5.1	7.8	6.90	6.50E3	4.90E3	5.10E3	5.30E3	5.70E3	5.40E3	7.30E3
Brain Tumor	6.1	6.8	8.10	5.90E4	5.10E4	5.30E4	5.20E4	5.50E4	5.10E4	6.50E4
CNS	6.9	4.50E1	4.30E1	4.80E4	4.10E4	4.60E4	4.10E4	4.20E4	4.50E4	4.90E4
Colon	0.7	1.5	2.1	4.10E4	3.20E4	3.30E4	4.00E4	3.60E4	3.90E4	5.20E4
DLBLCL	9.1	4.90E1	5.60E1	3.00E3	2.10E3	2.20E3	2.80E3	3.10E3	2.80E3	3.40E3
Leukemia	6.8	3.50E1	4.00E1	4.00E3	2.90E3	2.70E3	4.60E3	4.70E3	4.20E3	4.50E3
9 Tumors	0.5	1.9	2.80	8.90E3	7.90E3	8.20E3	8.30E3	8.50E3	8.40E3	9.40E3
Average	5.028	2.10E1	2.27E1	2.43E4	2.02E4	2.14E4	2.20E4	2.21E4	2.22E4	2.72E4

Table 5

The results of Wilcoxon rank sum test

		RS	FRS	NRS	ICA	PCA	MFCS	FRNNMF	ENMF	FRICA
ALLAML	SVM	1	1	1	1	1	1	0	0	1
	KNN	1	1	1	1	1	1	0	1	1
Brain Tumor	SVM	1	1	1	1	1	1	0	1	1
	KNN	1	1	1	1	1	1	0	1	1
CNS	SVM	1	1	1	1	1	1	0	1	1
	KNN	1	1	1	1	1	1	0	0	1
Colon	SVM	1	1	1	1	0	1	0	0	0
	KNN	1	1	1	1	1	1	0	1	1
DLBLCL	SVM	1	1	1	1	1	1	0	1	1
	KNN	1	1	1	1	1	1	0	1	1
Leukemia	SVM	0	1	0	1	1	1	0	0	1
	KNN	1	1	1	1	1	1	0	0	1
9 Tumors	SVM	1	1	1	1	1	1	0	0	1
	KNN	0	0	0	1	1	1	0	1	1

Table 5 shows the results of Wilcoxon rank sum test, in which from this table it can be seen that there exists a significance difference between the FRNMONMF algorithm and the other algorithms except the FRNNMF algorithm. Also, there is no a significance difference with ENMF along Leukemia dataset, also, when the SVM is used to classify ALLAML, Colon and 9 Tumors, and for CNS dataset when KNN is used. For PCA and FRICA algorithms there is no significance difference only over Colon dataset with SVM. Moreover, the results of the proposed algorithm not significant difference with RS and NRS in Leukemia (with SVM) and 9 Tumors (with KNN). As well as, by comparing FRS with the FRNMONMF only the 9 Tumors dataset with KNN not significant difference.

Moreover, from all previous results it can be observed that the FRNMONMF is the better algorithm which improves the classification accuracy of the microarray datasets that contain a large number of genes. However, there is no significance difference with only FRNNMF algorithm overall dataset but the proposed algorithm still gives results better than FRNNMF with small difference.

The good performance proposed FRNMONMF method is results from the two facts: 1) the fuzzy neighborhood rough set that determine the degree of dependency of the features in label (target) feature. 2) In the proposed method, the NMF is converted into a biobjective optimization problem and used MOEA algorithm to determine the optimal solutions. This way is different from solving the NMF through multiplication rule that, can getting in stuck and, provides only one solution at each iteration (for W or H ).

5 Conclusion

The dimension reduction is a necessary preprocessing for microarray data, because this data contains a lot of irrelevant/redundant genes (features). The Fuzzy-Rough set is one of popular feature selection methods, which not needed to discretize the dataset, unlike the classical rough set. However, the Fuzzy-rough algorithms suffer from a large computation time and, therefore not it is a suitable optimal method for dealing with big data. The feature extraction methods such as PCA and ICA can be used as dimension reduction but it changes the meaning of original data, also it contains negative values. In this paper, we combine the perfect feature extractions methods called NMF and fuzzy-rough sets to improve classification performance. NMF is the algorithm attempts to maximize the similarity among extracted features. The main drawbacks of the NMF are determining the number of clusters to reduce this dependency the fuzzy-rough sets are used. Also, the MOEA algorithm is used with NMF to find multiple optimal solutions and this not performed in other NMF algorithms that aim to find just one optimal solution at each run. The results on seven microarray data show that the proposed algorithm is effective, where the irrelevant features can be removed without decreasing classification performances.

References

Abd El Aziz

M.E.

, Khidr

, Nonnegative Matrix Factorization Based On Projected Hybrid Conjugate Gradient Algorithm, Signal, Image and Video Processing, 2014.

Abd El Aziz

M.E.

and Khidr

, A novel algorithm for source localization based on nonnegative matrix factorization using αβ-divergence in cochleagram, WSEAS Transactions on Computers10(10) (2013).

Abd El-Aziz

M.E.

, EL-Sayed Waheed

and Osama

A.M.

, Mixture of Generalized Gamma Density-Based Score Function for Fastica, Hindawi Publishing Corporation, 2011.

Adjei

, Chen

, Heng-Da

, Cooley

D.H.

, Cheng

R.J.

and Twombly

, A fuzzy search method for Rough Sets in Data mining, IFSA World Congress and 20th NAFIPS International Conference, 2, 2001, pp. 980–985.

Alon

, et al., Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays, Proc Nat’l Academy of Sciences USA96(12) (1999), 6745–6750.

Boutsidis

, Mahoney

M.W.

and Drineas

, Unsupervised Feature Selection for Principal Components Analysis. KDD’08, Las Vegas, Nevada, USA, 2008, pp. 1–12.

Brunet

J.P.

, et al., Metagenes and molecular pattern discovery using matrix factorization, Proc Natl Acad Sci USA101 (2004), 4164–4169.

Carmona-Saez

, Pascual-Marqui

, Tirado

, Carazo

and Pascual-Montano

, Biclustering of gene expression data by nonsmooth non-negative matrix factorization, BMC Bioinformatics7(1) (2006).

Caruna

and Freitag

, Greedy Feature Selection, Proceedings of the 11th International Conference on Machine Learning, 1994, pp. 28–36.

10.

Chen

, Zhang

, Zhao

, Hu

and Zhu

, A novel algorithm for finding reducts with fuzzy rough sets, IEEE Transactions on Fuzzy Systems20(2) (2012), 385–389.

11.

Cichocki

, Zdunek

, Phan

A.-H.

and Amari

, Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis, John Wiley, ISBN: 978-0-470-74666-0, 2009, p. 552.

12.

Collins

C.L.

, Forward kinematics of planar parallel manipulators in the Clifford algebra of P2, Mechanism and Machine Theory37(8) (2002), 799–813.

13.

Deb

and Srinivas

, Muiltiobjective optimization using nondominated sorting in genetic algorithms, Evol Comp2(3) (1994), 221–248.

14.

Deb

, Multi-objective optimization using evolutionary algorithms, Baffins Lane, Chichester, Wiley, 2001.

15.

Deng

, Chiyuan

and Xiaofei

, Unsupervised feature selection for multi-cluster data, Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 2010.

16.

Dimitrios

and Kotropoulos

, Fast and accurate sequential floating forward feature selection with the Bayes classifier applied to speech emotion recognition, Signal Processing88(12) (2008), 2956–2970.

17.

J.G.

and Brodley

C.E.

, Feature subset selection and order identification for unsupervised learning, In In Proc 17th International Conference on Machine Learning, 2000, pp. 247–254.

18.

Engreitz

, Daigle

Jr , Marshall

and Altman

, Independent component analysis: Mining microarray data for fundamental human gene expression modules, J Biomedical Informatics43 (2010), 932–944.

19.

Fadi

and Ikhlas

, A Computer-Aided Diagnosis System for Breast Cancer Using Independent Component Analysis and Fuzzy Classifier, Modelling and Simulation in Engineering Volume 2008, Article ID 238305, 2008, p. 9.

20.

Grosan

and Abraham

, A new approach for solving nonlinear equation systems, IEEE Transactions on Systems Man and Cybernetics - Part A38(3) (2008), 698–714.

21.

, Qu

, Li

and Han

, AC Losses in HTS tapes and devices with transport current solved through the resistivity-adaption algorithm, IEEE Transactions on Applied Superconductivity23(2) (2013), 8201708.

22.

Guyon

and Elisseeff

, An introduction to variable and feature selection, J Mach Learn Res3 (2003), 1157–1182.

23.

Han

and Kamber

, Data Mining: Concepts and Techniques. Morgan Kaufman, San Francisco, 2000.

24.

Henderson

, Sacco

W.F.

, Barufatti

and Ali

M.M.

, Calculation of critical points of thermodynamic mixtures with differential evolution algorithms, Industrial & Engineering Chemistry Research49(4) (2010), 1872–1882.

25.

, Yu

, Liu

and Wu

, Neighborhood rough set based heterogeneous feature subset selection, Information Sciences178 (2008), 3577–3594.

26.

, Pedrycz

, Yu

and Lang

, Selecting discrete and continuous features based on neighborhood decision error minimization, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics40(1) (2010), 137–150.

27.

Hyvrinen

and Oja

, Independent component analysis by general nonlinear Hebbian-like learning rules, Signal Processing64(3) (1998), 301–313.

28.

Jensen

and Shen

, Semantics-preserving dimensionality reduction: Rough and fuzzy-rough-based approach, IEEE Transactions on Knowledge and Data Engineering16(12) (2004), 1457–1471.

29.

Jensen

and Shen

, New approaches to fuzzy-rough feature selection, IEEE Transactions on Fuzzy Systems17(4) (2009), 824–838.

30.

Dai

J.J.

, Lieu

and Rocke

, Dimension Reduction for Classification with Gene Expression Microarray Data, Statistical Applications in Genetics and Molecular Biology, 5, No. 1, Article 6, 2006.

31.

Jianhua

and Qing

, Attribute selection based on information gain ratio in fuzzy rough set theory with application to tumor classification, Applied Soft Computing13 (2013), 211–221.

32.

Jolliffe

I.T.

, Principal Components Analysis, New York: Springer- Verlag, 2002.

33.

Joseph

, King

B.W.

, Diane

, Ali

, Wold

B.J.

and Hart

C.E.

, Mining gene expression data by interpreting principal components, BMC Bioinformatics7 (2006), 194.

34.

Kim

and Park

, Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis, Bioinformatics23 (2007), 1495–1502.

35.

Lee

D.D.

and Seung

H.S.

, Algorithms for non-negative matrix factorization, in Advances in Neural Information Processing Systems, 13, 2001, pp. 556–562.

36.

Lei

, Wan

and Chou

, The comparison of PCA and discrete rough set for feature extraction of remote sensing image classification – A case study on rice classification, Comput Geosci12(1) (2008), 1–14.

37.

Maji

and Pal

S.K.

, Feature selection using f-information measures in fuzzy approximation spaces, IEEE Transactions on Knowledge and Data Engineering22(6) (2010), 854–867.

38.

Maji

and Paul

, Rough set based maximum relevance-maximum significance criterion and gene selection from microarray data, International Journal of Approximate Reasoning52(3) (2011), 408–426.

39.

Maji

, Fuzzy–rough supervised feature clustering algorithm and classification of microarray data, IEEE Transactions On Systems, Man, And Cybernetics—Part B: Cybernetics41(1) (2011).

40.

Maji

and Pal

S.K.

, Fuzzy-rough sets for information measures and selection of relevant genes from microarray data, IEEE Trans Syst, Man, Cybern B, Cybern40(3) (2010), 741–752.

41.

Maji

and Garai

, Fuzzy-rough simultaneous feature selection and feature extraction algorithm, Cybernetics, IEEE Transactions on43 (2013), 1166–1177.

42.

Maulik

and Chakraborty

, Fuzzy preference based feature selection and semisupervised SVM for cancer classification, IEEE Transactions On Nanobioscience13(2) (2014).

43.

Ooi

, Chetty

and Teng

, Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data, BMC Bioinformatics7 (2006), 320–339.

44.

Patrascioiu

and Marinoiu

, The applications of the non-linear equations systems algorithms for the heat transfer processes, in Proceedings of the 12th WSEAS International Conference on Mathematical Methods, Computational Techniques and Intelligent Systems, 2010, pp. 30–35.

45.

Parthalain

, Shen

and Jensen

, A distance measure approach to exploring the rough set boundary region for feature reduction, IEEE Transactions on Knowledge and Data Engineering22(3) (2010), 305–317.

46.

Pawlak

, Rough Sets: Theoretical Aspects of Resoning About Data. Dordrecht, The Netherlands: Kluwer, 1991.

47.

Peng

, Long

and Ding

, Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and minredundancy, IEEE Trans Pattern Anal Mach Intell27(8) (2005), 1226–1238.

48.

Piyushkumar

and Jagath

, SVM-RFE with mrmr filter for gene selection, ieee transactions on nanobioscience, 9(1) (2010).

49.

Pudil

, Novovicova

and Kittler

, floating search method in feature selection, pattern recognition, Letter15 (1994), 1119–1125.

50.

Sajda

, Du

, Brown

, Stoyanova

, Shungu

Mao

, et al., Nonnegative matrix factorization for rapid recovery of constituent spectra in magnetic resonance chemical shift imaging of the brain, IEEE Transactions on Medical Imaging23(12) (2004), pp. 1453–1465.

51.

Sassi

R.J.

, Silva

L.A.

and Hernandez

E.M.

, Neural networks and rough sets: A comparative study on data classification, Int Conf Artificial Intelligence (ICAI’06)1(1) (2006), 1–10.

52.

Siriporn

, Salim

, Ngadiman

M.S.

, Chimphlee

and Srinoy

, Independent Component Analysis And Rough Fuzzy Based Approach To Web Usage Mining, Proceedings of the IASTED International Conference on Artificial Intelligence and Applications, AIA 2006, 2006, pp. 422–427.

53.

Shahnaz

and Berry

, Document clustering using nonnegative matrix factorization, Inf Process Manag42 (2006), 373–386.

54.

Smaragdis

and Brown

, Non-negative matrix factorization for polyphonic music transcription, in: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2004, pp. 177–180.

55.

Swiniarski

and Skowron

, Independent component analysis, principal component analysis and rough sets in face recognition, Transactions on Rough Sets I Lecture Notes in Computer Science3100 (2004), 392–404.

56.

Swiniarski

, Lim

H.K.

, Shin

J.H.

and Skowron

, Independent component analysis, principal component analysis and rough sets in hybrid mammogram classification, in Proceedings of the International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV ’06), Las Vegas, Nev, USA, vol. 2, 2006, pp. 640–645.

57.

Tang

, Zhang

Y.-Q.

and Huang

, Development of two-stage SVMRFE gene selection strategy for microarray expression data analysis, IEEE Trans Comput Biol Bioinformatics4(3) (2007), 365–381.

58.

Tsang

E.C.C.

, Chen

, Yeung

D.S.

, Wang

X.-Z.

and Lee

, Features reduction using fuzzy rough sets, IEEE Transactions on Fuzzy Systems16(5) (2008), 1130–1141.

59.

Golub

T.R.

, Slonim

D.K.

, Tamayo

, Huard

, Gaasenbeek

, Mesirov

J.P.

, Coller

, Loh

M.L.

, Downing

J.R.

, Caligiuri

M.A.

, Bloomfield

C.D.

and Lander

E.S.

, Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, Science286(5439) (1999), 531–537.

60.

Weixiang

, Tianfu

and Siping

, Regularized Nonnegative Matrix Factorization for Clustering Gene Expression Data, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

61.

Weixiang

, Kehong

and Datian

, On α-divergence based nonnegative matrix factorization for clustering cancer gene expression data, Artificial Intelligence in Medicine44 (2008), 1–5.

62.

Yong

, Song

, Han-Xiong

and Zixing

, Locating multiple optimal solutions of nonlinear equation systems based on multiobjective optimization, IEEE Trans Evol Comp (2014).

63.

Yifeng

and Alioune

, Nonnegative least-squares methods for the classification of high-dimensional biological data, EEE/ACM Transactions on Computational Biology and Bioinformatics10(2) (2013).

64.

Zhao

, Tsang

E.C.C.

, Chen

and Wang

, Building a rule-based classifier: A fuzzy-rough set approach, IEEE Transactions on Knowledge and Data Engineering22(5) (2010), 624–638.

65.

El Aziz

M.A.

and Hassanien

A.E.

, An improved social spider optimization algorithm based on rough sets for solving minimum number attribute reduction problem, Neural Comput & Applic (2016).

66.

El Aziz

M.A.

and Hassanien

A.E.

, Modified cuckoo search algorithm with rough sets for feature selection, Neural Comput & Applic (2016).

67.

Wilcoxon

, Individual comparisons by ranking methods, Biometrics Bulletin1(6) (1945), 80–83.

Simultaneous feature extraction and selection of microarray data using fuzzy-rough based multiobjective nonnegative matrix factorization

Abstract

Keywords

1 Introduction

1.1 Related works

2 Preliminary

2.1 Rough set

Table 1 The description of datasets Dataset No. Dataset Samples Genes Classes Ref. 1 ALLAML 38 5000 3 [64] 2 Brain Tumor 90 5905 5 [62] 3 CNS 60 7129 5 [64] 4 Colon 62 2000 2 [62] 5 DLBCL 77 5469 2 [62] 6 Leukemia1 72 5230 3 [64] 7 9_Tumors 60 5726 9 [5]

References

Table 1
The description of datasets

Dataset No. Dataset Samples Genes Classes Ref.

1 ALLAML 38 5000 3 [64]

2 Brain Tumor 90 5905 5 [62]

3 CNS 60 7129 5 [64]

4 Colon 62 2000 2 [62]

5 DLBCL 77 5469 2 [62]

6 Leukemia1 72 5230 3 [64]

7 9_Tumors 60 5726 9 [5]