Feature genes selection using Fisher transformation method

Abstract

The selection of feature genes with high recognition ability from the gene expression profiles have gained great significances in biology. However, most of the existing methods for feature genes selection have a high time complexity where lead to a poor performance. Motivated by this, an effective feature selection method, called Fisher transformation (FT), is proposed which based on the improved Fisher discriminant analysis (FDA) and neighborhood rough set algorithms. The FT method has two benefits: 1. The multiple neighborhood rough set algorithm is used for solving the small sample size problem of FDA; 2. The improved FDA algorithm is used for selecting feature genes and ameliorating poor ability of classification. Furthermore, we measure the impact of the FT approach on the final selection consequence. The results obtained on four public tumor microarray datasets provide beneficial insight on both the benefits and limitations, paving the way to the exploration of new and wider feature selection programs.

Keywords

Fisher discriminant analysis neighborhood rough set feature selection Fisher transformation

1 Introduction

Tumor develops either through acquired mutations or epigenetic change that causes differential gene expression profiles in the cancerous cells. Microarray technology is pervasive used in the area of genomic research in identifying differentially expressed genes, activation of oncogene pathways and detect in novel biomarkers for clinical tumor diagnosis [1, 2]. Samples of microarray datasets often are fewer than 200 each having more than thousands of genes. Since the number of genes is typically larger than the number of samples, classification of microarray data is subjected to the curse of dimensionality [3, 4].

To take the effect of gene sets on tumor classifications into consideration, two types of feature selection methods have been employed: wrapper methods and filter methods [5]. Wrapper methods seek through the space of possible feature gene subsets and estimate the quality of feature subset S in terms of prediction accuracy [6]. For example, the Markov Blanket (MB) [7] was introduced into incremental wrapper based FSS process, rather than evaluated the quality of all the features ranked by filter method. Inza [8] used a greedy search algorithm called sequential forward selection and various supervised learning machines to find expressed genes and showed that with a notably reduced number of genes, the classification accuracy was improved significantly. In contrast, filter methods consider a subset of features are selected in a pre-processing step, and then the selected features are used to design a classifier [9]. Nguyen T [10] proposed a novel method for genes selection based on a modification of analytic hierarchy process (AHP). Dan [11] proposed a MRMR (Max-Relevance and Min-Redundancy) criterion. Wrapper methods usually result in better performance in classification tasks. However, it has higher time complexity and unsatisfactory in the generalization capability. In contrast, The calculation of filter methods are typically much faster than wrapper methods and it is suitable for gene expression profile datasets [12].

The Fisher discriminant analysis (FDA) algorithm [13] is a classical feature selection method in the form of filter, widely employed in biological information processing. The goal of FDA algorithm is to find an optimal projection which makes the samples are divided more perfectly (i.e. seek the optimal projection by maximizing the ratio of the between-class and within-class scatter matrices of the projected samples). In recent years, spectral graph theory is applied in FDA algorithm, which combined with local manifold data mining and between-class discrimination method. In order to cope with the high dimensionality, Yan [14] proposed a marginal Fisher analysis method in which the intrinsic graph described the distance of intra-class and combined each data point with its neighboring points of the same class. On the contrary, the penalty graph considered the marginal points and describes the inter-class divisibility. He [15] put forward a novel semi-supervised method called maximum margin projection (MMP). MMP is designed for discovering the local manifold structure. However, all of these methods often encounter small sample size (SSS) problem [16].

At present, several methods were utilized in solving the SSS problem (e.g. pseudo-inverse, pre-processing method, perturbation method and singular value decomposition) [17]. Pre-processing strategy is used for feature selection, and aims to reduce the dimensionality of the original data to a certain degree and ensure that the within-class scatter is reversible. For instance, Fisherface method [18] uses principle component analysis to reduce data dimensionality to (n - c)-dimensionality, where n is the number of samples and c is the number of classes, and then uses linear discriminant analysis (LDA) to projection. Indeed, the two-step processing strategy achieves more reliable and robust selection results [19].

Numerical attribute reduction based on neighborhood rough set (NRS) algorithm has caused wide public concern over the recent years [20]. Many studies have shown that it takes a good effect in terms of stability, especially in the context of high-dimensional [21, 23]. The value of neighborhood is directly related to the setting of threshold in NRS algorithm, a proper threshold δ results in a subset with highly predictive features. However, global neighborhood in this field is often used for dealing with the decision system (i.e. each sample uses the same neighborhood value in different condition attributes combination). While this method has a high time complexity, and it does not result in the optimal δ value [24].

In order to solve the problems of scatter in within-class variance matrix of FDA algorithm and poor classification performance in classifying gene expression profiles, a novel feature selection method, called Fisher transformation (FT), is put forward in this paper to classify tumor microarray datasets. The pre-processing strategy is used for feature selection and guaranteeing the within-class scatter reversible, which reduces the dimensionality of the original data to a certain degree. Firstly, improved multi-neighborhood is adopted for rough set method, and the original data are mapping in a low-dimensional space. Then the improved maximizing margin Fisher criterion algorithm is used for feature selection by introducing a scaling factor with the purpose of making full use of scale invariant of weight matrix. Finally, the performance of various classifiers is compared based on the selected feature gene datasets. Our results showed that the FT method selects a small set of non-redundant disease related genes, with high specificity.

2 Basic concepts

2.1 Neighborhood rough set

Rough set theory is believed to be a promising algorithm for feature selection, but it has low classification accuracy for the genomic datasets. A key reason is that the continuous real type of gene expression profiles needs discretization. However, neighborhood rough set provides a new opportunities for dealing with numeric attribute set [25].

Definition 1. In a given N-dimensionality real number space Ω, R is a set of real numbers, R^N is N-dimensionality real vector space, Δ = R^N × R^N → R, so Δ is a measurement of R^N.

Definition 2. In a given real number space Ω, and non-empty finite set U = {x₁, x₂, … x_n}, neighborhood δ of ∀x_i is defined as δ (x_i) = (x|x ∈ U, Δ (x, x_i) ≤ δ), in which δ ≥ 0.

Definition 3. In a given non-empty finite set U = {x₁, x₂, … x_n}, A and D are the conditional attribute and decision attribute set respectively, based on δ condition A will generate a δ-neighborhood relation N. Then the neighborhood decision system is represented as NDS =< U, A ∪ D, δ >.

Definition 4. Neighborhood decision system NDS =< U, A ∪ D, δ >, to the domain U is divided into N equivalence classes (X₁, X₂, … X_N) by decision attribute D, ∀B ⊆ A, so the upper and low approximation of decision attribute D with respect to subset B are defined as ${\bar{N}}_{B} D = ⋃_{i = 1}^{N} {\bar{N}}_{B} X_{i}$ (1) ${\underline{N}}_{B} D = ⋃_{i = 1}^{N} {\underline{N}}_{B} X_{i}$ (2) where ${\bar{N}}_{B} X = {x_{i} | δ_{B} (x_{i}) \cap X \neq \emptyset, x_{i} \in U}$ , ${\underline{N}}_{B} X = {x_{i} | δ_{B} (x_{i}) \subseteq X, x_{i} \in U}$ .

The boundary of decision system is defined as $BN (D) = {\bar{N}}_{B} D - {\underline{N}}_{B} D$ (3)

Positive domain and negative domain of neighborhood decision system are defined as ${Pos}_{B} (D) = {\underline{N}}_{B} D$ (4) ${Neg}_{B} (D) = U - {\bar{N}}_{B} D$ (5)

The dependency degree of D - B is defined as $k_{D} = γ_{B} (D) = \frac{| {Pos}_{B} (D) |}{| U |}$ (6)

It can be concluded from the above equations that the attribute dependency k_D is monotonous, if B₁ ⊆ B₂ ⊆ … ⊆ A, then γ_{B
₁} (D) ≤ γ_{B
₂} (D) ≤ ⋯ ≤ γ_A (D). If condition attribute a ∈ B, then the contribution of a single feature a to the approximation of D can be defined as $SIG (a, B, D) = γ_{B \cup {a}} (D) - γ_{B} (D)$ (7)

2.2 Fisher discriminant analysis

The aim of feature selection is mapping the original data points into a low-dimensional space and saving the geometric topological properties between the data points. An ideal low-dimensional space of the best mapping should attempt to maximize the inter-class dissimilarity and minimize the intra-class dissimilarity. In more detail, FDA algorithm defines the separation between two distributions to be the ratio of the variance between-classes to the variance within-classes. It is generally supposed that the FDA algorithm has explicit mapping relationship between high-dimensional data u_i and low-dimension Y_i, that is $Y_{i} = V^{T} u_{i}$ (8) where V is projection matrix.

In a given space U (number of sample N with d-dimensionality), the mean vector m_i of various types $m_{i} = (1 / N \cdot) \sum_{u \in U_{i}} u, i = 1, 2$ (9)

Within-class scatter matrix S_i and global within-class scatter matrix S_w $S_{i} = \sum_{u \in U_{i}} (u - m_{i}) (u - m_{i})^{T}, i = 1, 2$ (10) $S_{w} = S_{1} + S_{2}$ (11)

Global between-class scatter matrix S_b $S_{b} = (m_{1} - m_{2}) (m_{1} - m_{2})^{T}$ (12) where S_w is symmetric positive semidefinite matrix, S_b is symmetric positive semidefinite matrix, and the maximum rank is equal to 1 under binary classification tasks.

Set w₁, w₂, … w_c are c types, S_b, S_w and S_t are represented by global between-class scatter matrix, within-class scatter matrix and global population scatter matrix.

FDA criterion function is defined as follows. $J_{f} (u) = u^{T} S_{b} u / (u^{T} S_{w} u), u \neq 0$ (13)

Generalized FDA criterion function is defined as follows. $J_{f} (u) = u^{T} S_{b} u / (u^{T} S_{t} u), u \neq 0$ (14) where S_b, S_w and S_t are non-negative definite matrix, and S_t = S_b + S_w.

3 Proposed approach

3.1 Multiple neighborhood rough set

The neighborhood calculation matter to neighborhood rough set, it is crucial to choose which kind of measures to calculate the degree of similarity between genomic datasets. In the study of neighborhood rough set, the value of neighborhood is set basically by experience and the experiment. In addition, a common way of dealing with the decision system is usually adopted global neighborhood (i.e. each sample uses the same neighborhood value in different condition attributes combination). Experiments found that: 1) Small changes in parameter δ will have a significant impact on attribute reduction and rule extraction; 2) The optimal parameter δ is obtained through the search algorithm, which often requires a higher time complexity, and it can not even get the optimal parameter δ. The main cause of this situation lies in the various distribution of data. It usually dificultly to accurately describe the data which using the same value of neighborhood. Moreover, the classification results fluctuate with the change of parameter δ and affect the stability of model.

Considering from the generalization of model, the Euclidean Distance function is chosen as the measure criterion, which is the most common method when dealing with real data types and it can be used to prevent over-fitting to a certain degree. In addition, the neighborhood could influence the interaction among the different attributes in neighborhood rough set algorithm, and some genes may produce negative effect under the interaction of classification. Therefore, we propose a multiple neighborhood rough set algorithm (i.e. set different neighborhood for each attribute).

In a given decision table DT = (U, A ∪ D, { V_a } , f_a) _a∈A, Euclidean Distance Δ (x, y, R) of feature subset R ⊆ C between any two points x, y ∈ U is $Δ (x, y, R) = \sqrt{\sum_{a \in R} (f_{a} (x) - f_{a} (y))^{2}}$ (15)

In feature subset R, Euclidean Distance is used for calculating neighborhood for each attribute value. For each x ∈ U, multiple neighborhood limited by feature set R is defined as $L_{R}^{Δ} (x) = {y | y \in U, Δ (x, y, {a}) / r}$ (16) where a ∈ R, r is a threshold.

3.2 Maximizing margin Fisher criterion

The number of genes is typically larger than the number of samples for microarray data. An effective way is mapping the original data into a low-dimensional space and the topological property between the data points not change or as close as possible. The low-dimensional space of the best mapping is the distance between different categories keep a greater distance, and the same categories are close to each other. FDA criterion function is skillfully combined the within-class scatter matrix with between-class scatter matrix on the projection vector. The physical interpretation of taking maximization goal function J_f (u) vector V as the projection direction: the ratio of within-class scatter matrix and between-class scatter matrix is maximum after the sample projection [26].

However, conventional FDA algorithm reflects weak ability of classification. In order to overcome these limitations, improved maximizing margin Fisher criterion is applied in a novel nonlinear feature selection method.

The conventional maximizing margin criterion [27] is defined as $J = max {\sum_{ij} p_{i} p_{j} [d (m_{i}, m_{j}) - s (m_{i}) - s (m_{j})]}$ (17) where p_i and p_j are prior probability distribution density of class i and j. d (m_i, m_j) is between-class distance, s (m_i) is a measure of dispersion in class i.

Trace tr of covariance matrix represents s (m_i) in statistics. Equation (17) can be rewritten as $J = 2 tr (S_{b} - S_{w})$ (18)

To significantly improve the classification performance of the genomic datasets, we introduce a scaling factor in Equation (18) with the purpose of making full use of scale invariant of weight matrix. Firstly, keeping the centroid unchanged. Then the scaling operation is performed between the same class datasets and centroid. Namely, the between-class matrix S_b remains unchanged. Moreover, scaling operation result in more closely associated within-class matrix S_w. The derivation process is denoted are shown in Equations (21, 22). where μ is the scaling factor (μ ≥ 0), Y_i is embedding vector of low dimensional. Combining with Equation (8), the improved maximizing margin criterion can be worked out. $J = tr {V^{T} (S_{b} - μ S_{w}) V}$ (19) $\begin{matrix} S_{b}^{'} & = & \sum_{i = 1}^{c} n_{i} (m_{i}^{'} - m^{'}) (m_{i}^{'} - m^{'})^{T} \\ = & \sum_{i = 1}^{c} n_{i} (m_{i} - m) (m_{i} - m)^{T} = S_{b} \end{matrix}$ (20) $\begin{matrix} S_{w}^{'} & = & \sum_{i = 1}^{n_{i}} (Y_{i}^{'} - m_{i}^{'}) (Y_{i}^{'} - m_{i}^{'})^{T} \\ = & \sum_{i = 1}^{c} μ (Y_{i} - m_{i}) (Y_{i} - m_{i})^{T} = μ S_{w} \end{matrix}$ (21) where each column of V is a unit vector. The improved maximizing margin Fisher criterion algorithm is expressed as follows $(S_{b} - μ S_{w}) V = λ V$ (22) where λ is eigenvalue and V is eigenvector.

In more detail, the best projection direction V is under maximization J conditions. The largest eigenvalue corresponding to eigenvector is the best projection direction in FDA algorithm. If we select the first d eigenvalues (sorted from big to small) corresponding to eigenvectors as the projection direction. Thus the sample is mapped to a d-dimensionality space.

3.3 Feature genes selection using Fisher transformation

The large number of features and small number of samples may lead to some matrix non-reversible in the process of operation (i.e. SSS problem) in FDA algorithm, and neighborhood rough set has outstanding performance in attribute reduction. Therefore we propose an effective FT method for the selection of feature genes. Firstly, multiple neighborhood rough set algorithm is used for attribute reduction which can map into a new feature space. Then the improved maximizing margin Fisher criterion algorithm is used for feature selection and finding better scaling factor. Feature genes selection using Fisher transformation method description is shown in Algorithm 1Algorithm 1.

Algorithm 1

Algorithm 1. Feature genes selection using Fisher transformation

Input: Dataset U = (x₁, x₂, …, x_N)

Output: Feature subset S

Step1: Initialize attribute reduction set red =∅

Step2: Calculate neighborhood $L_{R}^{Δ} (x)$ of attribute a_i

Step3:a_i ∈ U, calculate positive domain Pos and significance SIG of a_i, which use neighborhood rough set, and sort the a_i in descending order according to SIG

Step4: If SIG ≤ β, record k, red = red + a_k, U = U - Pos_k, return to Step3; if SIG > β, output reduction results red. Pos_k is positive domain set of attributes a_i, β is significance lower limit

Step5: Calculate mean vector m_i of reduction results red. where m_i = 1/N · ∑_{u∈U_i}u

Step6: Calculate global within-class scatter matrix S_w and between-class scatter matrix S_b, where within-class scatter matrix S_i = ∑_{u∈U_i} (u - m_i) (u - m_i) ^T,i = 1, 2. S_w = S₁ + S₂, S_b = (m₁ - m₂) (m₁ - m₂) ^T

Step7: Calculate generalized feature according to the Equation (20), select eigenvectors corresponding to the largest d eigenvalues as projection direction, and get the optimal projection matrix V = [v₁, v₂, …, v_d]

Step8: By Y_i = V^Tu_i converse projection into the d-dimensionality to obtain feature subset S

Step9: End

Algorithm 1. Feature genes selection using Fisher transformation
Input: Dataset U = (x₁, x₂, …, x_N)
Output: Feature subset S
Step1: Initialize attribute reduction set red =∅
Step2: Calculate neighborhood $L_{R}^{Δ} (x)$ of attribute a_i
Step3:a_i ∈ U, calculate positive domain Pos and significance SIG of a_i, which use neighborhood rough set, and sort the a_i in descending order according to SIG
Step4: If SIG ≤ β, record k, red = red + a_k, U = U - Pos_k, return to Step3; if SIG > β, output reduction results red. Pos_k is positive domain set of attributes a_i, β is significance lower limit
Step5: Calculate mean vector m_i of reduction results red. where m_i = 1/N · ∑_{u∈U_i}u
Step6: Calculate global within-class scatter matrix S_w and between-class scatter matrix S_b, where within-class scatter matrix S_i = ∑_{u∈U_i} (u - m_i) (u - m_i) ^T,i = 1, 2. S_w = S₁ + S₂, S_b = (m₁ - m₂) (m₁ - m₂) ^T
Step7: Calculate generalized feature according to the Equation (20), select eigenvectors corresponding to the largest d eigenvalues as projection direction, and get the optimal projection matrix V = [v₁, v₂, …, v_d]
Step8: By Y_i = V^Tu_i converse projection into the d-dimensionality to obtain feature subset S
Step9: End

4 Experimental results

4.1 Datasets

In order to verify the effectiveness of the proposed FT algorithm, four public tumor microarray datasets are used for making simulation experiment. Specially, all of them represent binary classification tasks. Detail information of datasets is shown in Table 1.

Table 1
Experimental datasets

Dataset No. of features No. of instances

Leukemia [28] 7129 72

Colon [29] 2000 62

Lung [30] 12600 203

Prostate [31] 12600 102

Dataset	No. of features	No. of instances
Leukemia [28]	7129	72
Colon [29]	2000	62
Lung [30]	12600	203
Prostate [31]	12600	102

All numerical experiments are performed on a personal computer with a 3.1 GHz AMD Athlon(tm) II and 4 G-byte of memory. This computer runs Windows 7, with Matlab-R2010 and Weka-3.9.0.

4.2 Results and analysis

PCA algorithm is used for analyzing four gene expression profile datasets before FT method, and drawing pareto diagram (i.e. the information in genomic data) of the principal components explained variance for each dataset (blue curve said before the information content of total n genes). The results are shown in Fig. 1.

Fig.1

Pareto diagram of the principal components explained variance.

The accumulation contribution rate of most datasets (except Lung datasets) reaches more than 90 percent when the principal components of datasets are 50 (see Fig. 1). It illustrates gene expression profile datasets contain a large amount of redundancy (i.e. irrelevant and confounding factors) and the number of feature genes are a small set, so it is necessary to remove the redundancy before genes selection operation. It also verifies that the FT method is effective which using multiple neighborhood rough set algorithm to remove the redundancy (i.e. pre-processing strategy).

After repeated trials, we set the threshold r = 0.56, attribute significance lower limit β = 0.01 and scaling factor μ = 0.4. We explore the optimal threshold of eigenvalue k: from k = 2 to k = 15 (i.e. k is the number of feature genes). The average accuracy result, along with the change of eigenvalue k, is shown in Fig. 2. The average correct rate and number of feature genes are shown in Table 2.

Fig.2

Classifier results.

Table 2

Number of selected genes and average correct rate

Dataset	No. of selected genes	Average correct rate
Leukemia	4	99.01
Colon	3	97.36
Lung	5	100
Prostate	7	98.79

The curve overall shows a declining trend (see Fig. 2) and the number of feature genes is less than 10, which illustrate that the average correct rate is obvious difference for different eigenvalue k and the number of feature genes do not necessarily have a positive association with classification performance. In order to validate the effectiveness of the FT method for genomic datasets, we test the classification performance of feature genes from the following four aspects.

Differentially expressed of feature genes

An important indicator of feature selection is classification performance (i.e. the gene expression level in abnormal and normal tissue samples). The t-statistic is used for testing the gene expression level (i.e. significantly in abnormal and normal tissues), in a given level of significance α = 0.05 to t-statistic. Inspection of the original problem H₀ : μ_t - μ_n = 0, the alternative hypothesis H₁ : μ_t - μ_n ≠ 0, in which μ_t, μ_n are the means of gene expression level in abnormal and normal tissue samples. The means of feature genes selected in abnormal and normal tissue samples are shown in Fig. 3.

Fig.3

Differentially expressed genes of feature subsets.

All these gene expression level of feature genes in abnormal and normal tissue samples are significantly different (see Fig. 3). e.g. four Leukemia feature subsets are selected by FT method, in which No. 1, 2 and 4 genes show low levels expression in normal tissues, but No. 3 gene shows high level expression in normal tissues. In prostate datasets, No. 2, 3, 5 and 7 genes show low level expression in normal tissues; No. 1, 4 and 6 genes show high level expression in normal tissues. The result illustrates the feature subsets extracted by the FT method has super-discriminative capability. Not only does different expression of feature gene subsets provide reliable information for classification, but it makes people more intuitive understanding the characteristics on the gene expression in tumor tissues by comparing with the behavior of gene expression in normal tissues. It is helpful to distinguish tissues between normal and abnormal, and it provides an important basis for further research to understand the mechanism of tumor and clinical treatment.

Classification performance of feature genes

In order to verify the classification performance of FT method, we use several kinds of common classifiers in Weka tool to classify gene expression profile datasets, and other experiments are carried out in the same way without using pre-processing strategy (i.e. direct classification). Experiments use 10-fold cross-validation, the results are shown in Table 3.

Table 3

Classification precision comparing results of different classifiers

Dataset	Lib-SVM	C4.5	Naive	KNN
Leukemia	65.27/99.01	79.16/100	80.21/98.26	84.72/99.47
Colon	64.51/97.36	75.80/96.51	71.23/99.32	77.41/97.03
Lung	591.62/100	89.01/97.16	68.34/99.43	96.05/100
Prostate	56.61/98.79	80.14/99.81	69.46/96.48	83.08/90.01

(Note: The result of direct classification in the left slash (/); The result of FT method in the right slash (/))

The selected feature genes in this paper show a good classification performance for normal and abnormal tissue samples. Classification accuracy, in comparison to the method which does not use the de-noising steps (i.e. pre-processing strategy) is greatly improved (see Table 3). In summary, the feature selection method in this paper keeps the ability of classification unchanged. The FT method presented is feasible and effective.

The differences of different feature selection methods

To further validate the effectiveness of our method and the differences among different feature selection methods, several existing classical algorithms are used in our experiment, i.e. ODP (original data processing), PCA, NRS and Fisher. In addition, we set the same threshold for NRS and multiple NRS algorithms to guarantee the feasibility and effectiveness of contrast experiments. The Lib-SVM classifier in Weka tool is used to simulate experiments, classification results are shown in Fig. 4.

Fig.4

Classifier results.

The ODP method gets the lowest accuracy (e.g. test accuracy of Leukemia datasets is 65.27%, which is lower than PCA and other methods), and it shows that gene expression profile datasets contain much redundant information (see Fig. 4). In addition, NRS and Fisher methods are neck-and-neck for the classification accuracy which are not as good as FT method. However, other methods use attribute reduction operation (i.e. pre-processing strategy) which make classification accuracy improved. It is clear that the removed redundant information improves the genes classification accuracy. Conversely, redundant features may reduce the judgment ability of datasets. The FT method has an outstanding ability in classification.

Compared with the related research of the latest methods

To verify the differences between our method and the latest research, six feature selection methods are used for comparison which represent of some relevant algorithms proposed or improved by domain experts in recent years. Literature [32] FBFE and other algorithms are used for feature selection in gene expression profiles. Lib-SVM classifier in Weka tool is used to simulation experiment. The number of feature genes and classification results are shown in Table 4 and Table 5.

Table 4

No. of selected genes

Dataset	FT	FBFE [32]	PSO-dICA [33]	BDE [34]	DRF0 [35]	BQPSO [36]	ILASSO [37]
Leukemia	4	35	300	7	13	9	14
Colon	3	30	20	3	10	11	4
Lung	5	80	1000	3	17	10	7
Prostate	7	50	1000	3	113	10	9

Table 5

Classification precision of different feature selection methods

Dataset	FT	FBFE [32]	PSO-dICA [33]	BDE [34]	DRF0 [35]	BQPSO [36]	ILASSO [37]
Leukemia	99.01	91.23	94.44	82.40	91.18	100	98.61
Colon	97.36	83.34	96.48	75.00	90.00	95.52	90.32
Lung	100	85.21	99.31	98.00	98.66	99.96	100
Prostate	98.79	83.23	96.77	94.10	85.69	99.25	96.08

(Note FBFE: fuzzy backward feature elimination; PSO-dICA: discriminant independent component analysis based on PSO; BDE: binary differential evolution; DRF0: the distributed ranking filter approach removing the features with information gain zero from the ranking; BQPSO: distributed feature selection: an application to microarray data classification; ILASSO: iterative lasso)

In this paper, the FT method gets less number of feature genes and higher accuracy are shown in Table 4 and Table 5, e.g. Colon data, the relevant methods (such as FBFE) get lower classification precision than our method and select less number of feature genes. Comparing with these latest algorithms, individual methods (e.g. BQPSO algorithm is used for Leukemia and Prostate datasets) seem to be a slightly better preference in terms of accuracy. However, it selects 9 and 10 feature genes more than our method 4 and 7. To summarize, the FT method achieves higher classification accuracy than BQPSO, IGA and other algorithms, and the number of selected feature genes are less, which validate the effectiveness of feature selection method. Therefore, with the FT feature selection operation, the classification process can be speeded up by reducing dimensionality. The accuracy is considerably higher compared to the conventional techniques.

5 Conclusions

In this work, we explore the effects and benefits of FT method in the context of feature selection from high-dimensional genomic data. Specifically, using the pre-processing strategy, improved FDA method is applied in feature selection tasks. The main contributions of our study are follows:

We analyze Fisher on both benefits and limitations, and propose the improved maximizing margin Fisher criterion in solving the within-class covariance singularity and poor classification ability problems;

Multiple neighborhood rough set method is proposed to solve SSS problem for the gene expression profiles in NRS method.

The results of our experiments give insight on both predominance and inferior position of FT method and could represent a useful starting point to better understand the behavior of these techniques as well as the extent of their applicability to specific tumor problems. In more detail, we study on intrinsic genomic information to better understand pathogenesis of tumor, and provide reference for the clinical treatment of tumor. However, our research is only in view of binary classification, so we will focus on multiclass gene problems for future research.

Footnotes

Acknowledgments

This work is supported by National Natural Science Foundation of China (Nos. 61370169, 61402153, 60873104), Project funded by China Postdoctoral Science Foundation (No. 2016M602247), Key Project of Science and Technology Department of Henan Province (Nos. 142102210056, 162102210261).

References

Yong-Hyukčň

and Yourim

, A genetic filter for cancer classification on gene expression data, Bio-Medical Materials and Engineering 26 (2015), 1993–2002.

Chen

, Zheng

, Baade

P.D.

, et al., Cancer statistics in China, CA Cancer J Clin 2 (2016), 115–132.

Saeys

, Inza

and Larraaga

, A review of feature selection techniques in bioinformatics, Bioinformatics 23(19) (2007), 2507–2517.

Mahdizadeh

and Eftekhari

, Generating fuzzy rule base classifier for highly imbalanced datasets using a hybrid of evolutionary algorithms and subtractive clustering, Journal of Intelligent and Fuzzy Systems 27(6) (2014), 3033–3046.

Zhou

and He

J.Y.

, Survey of the gene selection technologies based on microarray in bioinformatics, Computer Science 34 (2007), 143–150.

Chen

X.W.

, Margin-based wrapper methods for gene identification using microarray, Neurocomputing 69 (2006), 2236–2243.

Wang

, An

, Chen

, et al., Incremental wrapper based gene selection with Markov blanket, IEEE International Conference on Bioinformatics and Biomedicine (2014), 74–79.

Inza

, Sierra

, Blanco

and Lerranaga

, Gene selection by sequential search wrapper approaches in microarray cancer class prediction, J Intelligent Fuzzy Syst 12 (2002), 25–33.

J.C.

, Li

and Sun

, Feature gene selection method based on logistic and correlation information entropy, Bio-Medical Materials and Engineering 26 (2015), 1953–1959.

10.

Nguyen

, Khosravi

, Creighton

, et al., A novel aggregate gene selection method for microarray data classification, Pattern Recognition Letters 60 (2015), 16–23.

11.

Dan

, Rish

, Haws

, et al., MINT: Mutual information based transductive feature selection for genetic trait prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics 13 (2016);578–583.

12.

Staĺ¡czyk

, Feature selection for data and pattern recognition, Studies in Computational Intelligence 584 (2015), 1–7.

13.

Fisher

R.A.

, The use of multiple measurements in taxonomic problem, Annals of Eugenics 7 (1936), 179–188.

14.

Yan

, Xu

, Zhang

, et al., Graph embedding and extensions: A general framework for dimensionality reduction, IEEE Transactions on Pattern Analysis and Machine Intelligence 29(1) (2007), 40–51.

15.

, Cai

and Han

, Learning a maximum margin subspace for image retrieval, IEEE Transactions on Knowledge and Data Engineering 20(2) (2007), 189–201.

16.

Barchinejad

and Eftekhari

, Unsupervised feature selection method based on sensitivity and correlation concepts for multiclass problems. Journal of Intelligent and Fuzzy Systems 30(5) (2015);2883–2895.

17.

Guo

, Ding

, Fang

, et al., Fisher linear discriminant embedded metric learning,(143), Neurocomputing (2014), 7–13.

18.

Belhumeur

P.N.

, Eigenface vs. Fisherfaces: Recognition using class specificlinear projection, IEEE Trans, Pattern Anal Machine Intell 19 (1997), 711–720.

19.

, Xiao

and Jian

, Supervised group Lasso with applications to microarray data analysis, Bmc Bioinformatics 8(1) (2007), 1–17.

20.

Q.H.

, Pan

, et al., An effifient gene selection technique for cancer recognition based on neighborhood mutual information, International Journal of Machine and Cybernetics 1 (2010), 63–74.

21.

Liu

, Xie

, Wang

, et al., Hyperspectral band selection based on a variable precision neighborhood rough set, Applied Optics 55 (2016), 462–472.

22.

, Pedrycz

and Miao

, Neighborhood rough sets based multi-label classification for automatic image annotation, International Journal of Approximate Reasoning 54 (2013), 1373–1387.

23.

, Zhou

, Hu

, et al., Mechanical fault diagnosis based on redundant second generation wavelet packet transform, neighborhood rough set and support vector machine, Mechanical Systems and Signal Processing 28(2) (2012), 608–621.

24.

Hui

J.L.

, Pan

, Wu

K.K.

, et al., Attribute reduction based on asymmetric variable neighborhood rough set, Computer Science 42 (2015), 282–287.

25.

Q.H.

, Yu

D.R.

and Xie.

Z.X.

, Numerical attribute reduction based on neighborhood granulation and rough approximation, Journal of Software 19(3) (2008), 640–649.

26.

Liu

and Yang

J.Y.

, An efficient algorithm for Foley-Sammon optimal set of discrimiant vectory by algebraic method, International Journal of Pattern Recognition and Artificial Intelligence 6 (1992), 817–829.

27.

Golub

G.H.

and Van

C.F.

, Loan, Matrix computations, The Johns Hopkins Univ Press 47 (1996), 392–396.

28.

Golub

T.R.

, Slonim

D.K.

, Tamayo

, et al., Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, Science 286 (1999), 531–537.

29.

Alon

, Barkai

, Notterman

D.A.

, et al., Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc Natl Acad Sci 96 (1999), 6745–6750.

30.

Bhattacharjee

, Richards

W.G.

, Staunton

, et al., Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses, Proc Natl Acad Sci 98(24) (2001), 13790–13795.

31.

Singh

, Febbo

P.G.

, Ross

, et al., Gene expression correlates of clinical prostate cancer behavior, Cancer Cell 1 (2002), 203–209.

32.

Aziz

, Verma

C.K.

, Srivastava

, et al., A fuzzy based feature selection from independent component subspace for machine learning classification of microarray data, Genomics Data 8 (2016), 4–15.

33.

Mollaee

and Moattar

M.H.

, A novel feature extraction approach based on ensemble feature selection and modified discriminant independent component analysis for microarray data classification, Biocybernetics and Biomedical Engineering 36 (2016), 521–529.

34.

Apolloni

and Leguizamĺőn

and Alba

, Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments, Applied Soft Computing 38 (2015), 922–932.

35.

Bolĺőn-Canedo

, Sĺćnchez-Marono

and Alonso-Betanzos

, Distributed feature selection: An application to microarray data classification, Applied Soft Computing 30 (2015), 136–150.

36.

, Sun

and Li

, et al. Cancer feature selection and classification using a binary quantum-behaved particle swarm optimization and support vector machine., Computational and Mathematical Methods in Medicine 2016 (2016), 1–9.

37.

Zhang