A fuzzy gaussian rank aggregation ensemble feature selection method for microarray data

Abstract

In Microarray Data, it is complicated to achieve more classification accuracy due to the presence of high dimensions, irrelevant and noisy data. And also It had more gene expression data and fewer samples. To increase the classification accuracy and the processing speed of the model, an optimal number of features need to extract, this can be achieved by applying the feature selection method. In this paper, we propose a hybrid ensemble feature selection method. The proposed method has two phases, filter and wrapper phase in filter phase ensemble technique is used for aggregating the feature ranks of the Relief, minimum redundancy Maximum Relevance (mRMR), and Feature Correlation (FC) filter feature selection methods. This paper uses the Fuzzy Gaussian membership function ordering for aggregating the ranks. In wrapper phase, Improved Binary Particle Swarm Optimization (IBPSO) is used for selecting the optimal features, and the RBF Kernel-based Support Vector Machine (SVM) classifier is used as an evaluator. The performance of the proposed model are compared with state of art feature selection methods using five benchmark datasets. For evaluation various performance metrics such as Accuracy, Recall, Precision, and F1-Score are used. Furthermore, the experimental results show that the performance of the proposed method outperforms the other feature selection methods.

Keywords

Hybrid feature selection ensemble method IBPSO relief mRMR FC and kernel SVM

1. Introduction

Microarray data with genes and gene expressions is one of the examples for High Dimensional Data. Analysis of expression of genes can provide valuable information to predict how certain diseases are formed and the countermeasures needed. DNA Microarray data contains data related to genes, gene annotations, gene expression level, observations, and observation annotations. To work with such a dataset, it is essential to have feature selection. If feature selection is not employed, the classification of genes may not yield accurate results in sub-optimal performance. Feature selection is necessary to improve the accuracy of any classification algorithm. By removal of irrelevant and redundant features, it also enhance the quality of input data for processing.

Feature selection methods are broadly classified into three types: Filter, Wrapper, and Embedded methods [34]. In the filter method, the feature subsets are selected using statistical methods before the learning model is applied. Whereas in the wrapper method [13] features are selected using the learning model as fitness function and feature subsets are selected by using search strategies. Finally, the embedded method [15] selects the features during the training phase itself and is specific for a learning model. In hybrid feature selection, it composes of two stages, in the first stage filter method is used to select the best features, and then as second stage wrapper method is applied for selecting the optimal feature selection. Some of the hybrid models includes Chi-square and GA [19], Multiple-Filter-Multiple-Wrapper (MFMW) [20], Mutual Information and PSO [33], Mutual Information (MI) and Recursive Feature Elimination (RFE) [35] and so on.

Many feature selection algorithms came into existence on microarray data such as [28, 16, 31] to mention few. Ding and Peng use the mRMR feature selection method for finding the optimal feature subsets for the microarray dataset [10]. Li et al. use a combination of GA (Genetic Algorithm) with kNN for the selection of feature subsets [21]. Wahde and Szallasi reviewed wrapper methods based on evolutionary algorithms; the selection of feature subsets is optimized using genetic operations [38]. Clustering-based feature selection is explored in [32]. A cuttlefish optimization approach is studied in [1]. Therefore, it is essential to have a more comprehensive approach for feature selection. In this paper, we proposed a hybrid feature selection method for Microarray data (high dimensional data). The proposed method uses ensemble approaches in the filter phase known as Relief, mRMR, and Feature correlation Filter methods besides IBPSO for optimization of features. In the wrapper phase, we use IBPSO instead of BPSO. Since BPSO has a drawback of getting stuck to local optima so to overcome this problem, IBPSO is used. So IBPSO overcomes the local optimum problem. Each particle uses two fitness values pBest and gBest. If the gBest value is trapped into local optimum, in IBPSO, the gBest value is reset to overcome being get trapped into local optimum [8]. By defining algorithms known as Ensemble Feature Selection (EFS) in the filter phase and Improved Binary Particle Swarm Optimization (IBPSO) in wrapper phase, the proposed method (EFS-IBPSO) achieves better performance when compared with the state-of-the-art feature selection methods. Our contributions in this paper are as follows.

•
A framework is proposed for performing feature selection for high dimensional data in a systematic way. For achieving better performance ensembling, three filter rank based feature selection methods are used, followed by IBPSO in the wrapper phase.
•
To aggregate the ranks in the filter phase, the Fuzzy Gaussian method is proposed.
•
Kernel SVM Classifier is used as an evaluator for the IBPSO in the wrapper phase.
•
Experimental analysis of different dimensions of data is conducted to validate the EFS-IBPSO algorithm.

The remainder of the paper is structured as follows. Section 2 provides a review of literature pertaining to feature selection for microarray data (high dimensional data). Section 3 presents the proposed framework. Section 4 includes dataset details and result analysis. While section 5 provides conclusions and directions for future work.
2. Related work

This section provides a literature review on feature selection for microarray data (high dimensional data). Li et al. [22] uses linear support vector machines classifier to over the drawback of massive time consumption by SVM-RFE (Support Vector Machine based on Recursive Feature Elimination). Along with this re-sampling method to preprocessing the datasets for a balanced distribution of samples.

Wang et al. [39] proposed the BCO-MDP (Bacterial Colony Optimization with Multi-Dimensional Population) feature selection method to overcome the problem of local optimal. Generally, all population-based algorithms have the drawback of local optimal convergence to overcome this [39] uses the population with multiple dimensions and are represented by different subsets feature sizes. Ayyad et al. [2] proposed a new MKNN (Modified k-nearest neighbor) classification method for gene data. It has two scenarios LMKNN (largest modified KNN) and SMKNN (small modified KNN). It uses weighting strategies for employing neighbors.

Rani and Ramyachitra [29] uses the Spider Monkey Optimization (SMO) algorithm for the selection of optimal feature subset for Microarray gene datasets. The SVM classifier is used for classifying the cancer dataset using the optimal feature subsets selected by using the SMO optimization technique. The proposed methods have outperformed all existing techniques.

Mundra and Rajapakse [26] proposed a hybrid feature selection method by combing MRMR and support vector machine recursive feature elimination (SVM-RFE). MRMR feature selection method is used as a filter, and SVM-RFE is used as a wrapper method. First MRMR filter method is applied to the gene datasets, the best feature subset is selected. The feature chosen by the filter method is used as input for the wrapper method final feature subsets are selected using SVM-RFE.

El Akadi et al. [13] proposed a two-stage feature selection method by combining the mRMR filter method and the GA wrapper method. Candidate feature sets are selected in the first stage using MI-based mRMR, which filters irrelevant and redundant features, and also, the computational cost of the second stage is also reduced. These candidate features are applied as input for the second stage, and final optimal features are selected by using the GA wrapper method. SVM classifier for training the model.

Yuan et al. [41] uses partial maximum correlation information (PMCI) for Microarray data. For evaluating the importance of the feature, PCMI extracts several orthogonal components from the feature space. It uses the correlation between feature and class for extracting components. For evaluating the performance of the proposed method, it uses ten benchmark datasets over six already existing algorithms.

Bolón-Canedo et al. [5] proposed an ensemble of filters for feature selection. Five filter selection methods are used for selecting feature subsets. The feature subsets obtained from these five filter methods are forwarded to some specific classifiers. The output from these classifiers is ensemble by using the simple voting technique. Ten benchmark Microarray datasets are used for evaluating the performance of the proposed model.

Pashaei and Aydin [27] proposed Binary Black Hole Algorithm (BBHA) based on optimization problems. Existing methods have the drawback that they are computationally expensive and local optima. To overcome the local optima problem, BBHA uses an efficient global search. Eight benchmark datasets are used for evaluating the performance of the proposed methods. The performance is compared with Simulated Annealing (SA), Particle Swarm Optimization (PSO), Genetic Algorithm (GA).

Morovvat and Osareh [25] proposed an ensemble of filters and wrappers method for the Microarray dataset. The proposed method uses a set of filter and wrapper methods for selecting the feature subsets. First, a collection of filters is used to select candidate features, and these candidate features are used as inputs for the wrapper method to obtain optimal feature subsets. These optimal features are used by ensemble the performance of the classifiers J48, SMO (Sequential minimal optimization – SVM), and Naive Bayes classification methods.

Jain et al. [16] proposed a two-stage hybrid model for the classification of Microarray data. Correlation-based feature selection (CFS) and improved-Binary Particle Swarm Optimization (iBPSO) methods are integrated. Using iBPSO overcomes the drawback of the local optimum of traditional optimization algorithms. Eleven benchmark datasets are used for evaluating the performance of the proposed model. Naive Bayes classifier with 10-fold cross-validation is used for classification of the Microarray datasets.

Dash [9] proposed a bi-objective ranked based Pareto front technique by using two rank-based technique. The drawback with a single ranking technique is the same feature is ranked different ranks by different ranking methods. Due to this, there is the possibility of selecting irrelevant features and also may reject some of the relevant features. For ranking, the model grading method had been used.

Lu et al. [24] proposed a hybrid MIMAGA-feature selection method by combining the mutual information maximization (MIM) and adaptive genetic algorithm (AGA). By using MIMAGA – selection method, most of the redundant features are eliminated, and significantly reduced features are selected. To ensure the robustness of the MIMAGA method, Lu et al. use four classifications on reduced feature datasets for classification. MIM is used for finding the best dependency on all other features of the same class. AGA is used for finding the most suitable value for the crossover probability and the mutation probability values for finding the best optimal solutions.

Gao et al. [14] proposed a hybrid feature selection by combining Information Gain – Support Vector Machine (IG-SVM). IG is used for removing redundant and irrelevant features. For removing the noise in the dataset and further removal of redundant features, SVM is used.

Sharbaf et al. [31] proposed a hybrid feature selection method by combining the Fisher criterion as a filter method followed by a cellular learning automata (CLA) optimized with ant colony method (ACO) as wrapper approach. kNN, SVM, Naive Bayes classifiers are used for evaluating the selected features. CLA is used for solving the NP-complete problem during searching for feature subsets. ACO is used for selecting optimal features and also to overcome the local maxima problem.

From the literature, it is understood that hybrid feature selection approaches could provide better performance. In this paper, we proposed a hybrid feature selection approach for improving the accuracy of classification.

3. Proposed methodologies

In this section, we proposed a hybrid feature selection by combining the ensemble of filter feature selection methods and the IBPSO wrapper method for improving the performance of the classification model for microarray data.

Figure 1.

Overview of the proposed framework.

3.1 Problem definition

When data has acquired a high dimension, it is difficult for machine learning algorithms to achieve high accuracy. The rationale behind this is that these data lead to the curse of the high-dimensionality problem, and that tends to either inefficiency of machine learning algorithms or even render them useless. From the literature, it is observed that the filter method alone can’t achieve the best feature subset because it doesn’t use any learning algorithm for subset selection. Whereas, in the case of the wrapper method, even though it uses a learning approach for feature subset selection, it has an NP-hard problem for subset selection. To overcome the problems of the filter and wrapper method, the hybrid feature selection method is proposed. Therefore, it is very challenging to have the best feature selection algorithms for high dimensional data analytics to enhance the performance of machine learning algorithms. This problem is to be addressed by defining a hybrid ensemble feature selection algorithm that produces a subset of features to meet an objective.

3.2 Overview of the proposed framework

This section provides a broad overview of the proposed hybrid ensemble feature selection method for selecting the optimal feature subsets. The proposed method has two phases, namely the filter phase followed by the wrapper phase. In the filter phase, three rank-based filter methods are aggregated by using a fuzzy Gaussian rank aggregator, and top n features are selected by using threshold value. These top n features are given as input for the wrapper method, and the optimal feature subset is selected. The three rank-based filter methods used in filter methods are Relief, minimum Redundancy Maximum Relevance (mRMR), and Feature Correlation (FC). Improved Binary Particle Swarm Optimization (IBPSO) wrapper method is used in the wrapper phase. The features selected with the ensemble method in the filter phase are used as input for IBPSO in the wrapper phase to find optimize features. Then the final feature subset is used to train the model using a kernel SVM classifier. Figure 1 represents the overview of the proposed framework. A ten-fold cross-validation technique is used to overcome the problem of over-fitting.

Figure 2.

Ensemble feature selection process.

As presented in Fig. 2, the Microarray dataset is given as input for the ensemble feature selection method, which makes use of the combined benefits of multiple algorithms. Three rank based filter methods, namely Feature Correlation, mRMR, and Relief are used to rank the features according to their impact with the target features. The ensemble approach is used as it gets the combined benefits of multiple algorithms. From literature, some of the rank aggregation methods are Mean, Median, Highest Rank [40], Lowest Rank, Geomean, l2 norm, Borda, Robust Rank Aggregation [11] etc. There are some drawbacks to these methods. In the mean method, the mean rank may get affected due to very small or very large ranks And also, it is influenced by outliers and skewed distributions. Similarly, in the case of the median, it does not depends on all the values in the series that lead to less representative average and also is concerned with only middle value and so on. So to overcome the problems of existing methods, we use Fuzzy Gaussian Membership Function Ordering [4] for rank aggregation.

From the aggregate ranked features, we need to select top $n$ features. The $n$ value is obtained by using threshold value namely Fisher discriminant ratio value proposed by [3] and is calculated by using Eq. (1).

$\displaystyle f_{d}=\frac{\sum_{i=1,j=1,i\neq j}^{C}p_{i}p_{j}(\mu_{i}-\mu_{j}% )^{2}}{\sum_{i=1}^{C}p_{i}\sigma_{i}^{2}}$ (1)

where $\sigma_{i}^{2}$ , $\mu_{i}$ and $p_{i}$ are the variance, mean and proportion of the $i^{\text{th}}$ class $C$ , respectively. The $f_{d}$ values for all the features are calculated and the final feature subset is obtained by using Eq. (2).

$\displaystyle e[v]=\lambda\times\frac{1}{f_{d}}+(1-\lambda)\times\rho$ (2)

where $\lambda$ value ranges from $[0,1]$ ( $\lambda=0.7$ is used for this work) and $\rho$ is the % of features included.

In the following subsections we are going to explain in detail about each rank based filter feature selection methods.

3.3 Relief

Relief is one of the rank-based filter feature selection methods. Relief (RF) is used for assigning weights to the features based on instance-based learning [18]. Weights are assigned based on the nearest hit and nearest miss. The weights are updating based on how well the given instance is distinguished from its nearest miss and nearest hit. Higher the difference between the nearest miss and nearest hit, than the higher the weight of feature. If $Y$ is a feature which has the discrete values $\{y_{1},y_{2},\ldots,y_{n}\}$ and its RF is calculated by using Eq. (3).

$\displaystyle\textit{RF}(Y)=\frac{I_{G}(Y)\sum_{x_{i}\epsilon X}p(x_{i})^{2}}{% (1-\sum_{c\epsilon C}p(c)^{2})\sum_{c\epsilon C}p(c)^{2}}$ (3)

where $C$ is target feature and $I_{G}$ is Gini index [18].

Table 1

Individual’s ranking and different rank aggregation methods

Feature	FS1	FS2	FS3	Score	Rank	Score	Rank	Score	Rank
	Feature rankings			ARM		MED		GEM
f1	1	2	1	1.33	1	1	1	1.26	1
f2	2	1	2	1.67	2	2	2	1.59	2
f3	3	4	4	3.67	4	4	4	3.63	4
f4	4	3	3	3.33	3	3	3	3.30	3

ARM $=$ arithmetic average; GEM $=$ geometric mean; MED $=$ median.

3.4 Minimum redundancy maximum relevance (mRMR)

El Akadi et al. [13] uses mRMR feature selection method for selecting maximum relevant and minimum redundant set of feature for a given gene expression dataset. Mutual Information (MI) based mRMR method is employed to find minimum redundant and maximum relevant set of features. Let $X=\{g_{i,j}\}_{n\times J}$ denotes the data matrix of the Microarray dataset, where $g_{i,j}$ is $i^{\text{th}}$ gene of sample $j$ , $n$ represents no.of genes and $J$ represents number of samples in the dataset. Let $g_{j}=(g_{1,j},g_{2,j},\ldots,g_{i,j})$ represents the $j^{\text{th}}$ sample of gene dataset. Similarly $g_{i}=(g_{i,1},g_{i,2},\ldots,g_{i,K})$ represents the gene expression of $i^{\text{th}}$ gene across sample and $G=\{1,2,\ldots,n\}$ represents gene index sets. The MI between gene $i$ and class label $c$ is calculated using Eq. (4)

$\displaystyle\textit{MI}(i,c)=\sum_{x_{i}}p(i;c)\log\frac{p(i;c)}{p(i)p(c)}$ (4)

The relevancy [26] $R_{S}$ of gene is given by using Eq. (5)

$\displaystyle R_{s}=\frac{1}{|S|}\sum_{l}\sum_{i\epsilon S}\textit{MI}(l,i)$ (5)

where $\textit{MI}(l,i)$ represents the $M I$ between gene $i$ and class label $l$ and $S$ is subset of $G$ . Similarly the redundancy of gene $i$ with the other genes of $S$ is measured by using Eq. (6)

$\displaystyle Q_{S,i}=\frac{1}{|S|^{2}}\sum_{{i}^{\prime}\epsilon S,{i}^{% \prime}\neq i}\textit{MI}(i,{i}^{\prime})$ (6)

The gene ranking ( $g(\text{rank})$ ) is evaluated by using the Eq. (7) i.e., the ratio between $R_{S}$ and $Q_{S,i}$ .

$\displaystyle g(\text{rank})=\operatorname{arg\,max}_{i\epsilon s}\frac{R_{S}}% {Q_{S,i}}$ (7)

3.5 Feature correlation (FC)

FC is another rank based filter feature selection method that is used as part of the proposed ensemble method in the filter phase. Here the correlation of feature values is given importance. Among the $k$ number of classes, for a feature $f$ , discrimination value $S(f)$ is computed using Eq. (8).

$\displaystyle S(f)=\frac{\sum_{k=1}^{k}p_{k}(m_{k}-m)}{\sigma^{2}(f)\sum_{k=1}% ^{K}p_{k}(1-p_{k})}$ (8)

where mean value is denoted by m. In the same fashion, the mean value of a feature for the $k^{\text{th}}$ class data is denoted as $m_{k}$ . The variance of the feature is denoted as $\sigma^{2}(f)$ . Probability of occurrence of $k^{\text{th}}$ class is denoted as $p_{k}$ . when the $k$ value is considered to be 2 with a uniform distribution of both the classes; the Eq. (8) is simplified and defined as in Eq. (9).

$\displaystyle s_{12}(f)=\frac{(m_{1}(f)-m(f))^{2}+(m_{2}(f)-m(f))^{2}}{2\sigma% ^{2}(f)}$ (9)

Here also the large value of $S_{12}(f)$ denotes good discriminative power for detection of the two classes.

The above mentioned three filter feature selection methods are rank based methods that is they sort the features based on the ranks obtained by the respective FS methods. For obtaining greater diversity in the final subset, Feature Selection methods with different metrics are used.

3.6 Ranking combination

From the literature, we came to know that there is a wide range of ensemble rank aggregation techniques available for combing the ranks obtained from different methods. The range varies from simple mathematical calculation measures such as minimum, maximum, mean, and so on to more advanced measures like SVM-Rank [30, 23]. Other rank aggregation methods are based on scores calculated using Borda algorithm [23] are Mean, Geometric Mean, Median, and $l_{2}$ norms. These methods are useful when the size of the dataset is less but not suitable for high dimensional datasets. Some of the drawbacks by using these methods are when the rank of the mean method is very small or very large, it can affect the mean ranking, and also it is influenced by outliers and skewed distributions. Similarly, the median method does not depend on all the values in the series that lead to less representative average and also concerned with only the middle item. So to overcome the problems in existing methods and to have a fair ranking scheme, we use the Fuzzy Gaussian membership method for rank aggregation in this paper.

Table 1 represents the aggregate ranking of three feature ranking methods FS1, FS2 and FS3. The scores are computed for these methods using different rank aggregate methods ARM (arithmetic average), MED (median), GEM (geometric mean) and their rank aggregates are tabulated.

In Board’s methods mean position of the feature is used, with that inspiration we take the variance of the feature position also consider for calculating the membership values. Equation (10) is used for calculating the membership function of each feature obtained as a function of rank positions.

$\displaystyle\mu_{f_{i}}(x)=\frac{1}{\sqrt{2\pi\sigma^{2}_{f_{i}}}}\exp\left(-% \frac{1}{2}\left[\frac{(x-\bar{x}_{f_{i}})^{2}}{\sigma^{2}_{f_{i}}}\right]\right)$ (10)

where $\bar{x}_{f_{i}}$ mean and $\bar{\sigma}^{2}_{f_{i}}$ variance position of features $f_{i}$ .

Example 1. Let us consider $\textit{FS}1$ , $\textit{FS}2$ and $\textit{FS}3$ are three filter ranking methods, Given $\textit{FS}1=[f1,f2,f3$ , $f4]$ , $\textit{FS}2=[f2,f1,f4,f3]$ and $\textit{FS}3=[f1,f2,f4$ , $f3]$ . The position of feature represents the rank of that particular feature. In $F1$ the rank of ‘ $f2$ ’ feature is 2. Mean position of features are $\bar{x}_{f1}=(1+2+1)/3=1.33$ , $\bar{x}_{f2}=1.67$ , $\bar{x}_{f3}=3.67$ , $\bar{x}_{f4}=3.33$ and the variances are $\bar{\sigma}^{2}_{f1}=[(1-1.33)^{2}+(2-1.33)^{2}+(1-1.33)^{2}]/3=0.22$ , $\bar{\sigma}^{2}_{f2}=0.22$ , $\bar{\sigma}^{2}_{f3}=0.22$ , $\bar{\sigma}^{2}_{f4}=0.22$ . Using Eq. (10), we have

$\displaystyle\mu_{f1}(1)=\frac{1}{\sqrt{2\pi\times 0.22}}\exp\left(-\frac{1}{2% }\left[\frac{(1-1.33)^{2}}{0.22}\right]\right)=0.8360,\mu_{f1}(2)=0.8060,$ $\displaystyle\mu_{f1}(3)=0.6234,\mu_{f1}(4)=0.3870,$ $\displaystyle\mu_{f2}(1)=0.8059,\mu_{f2}(2)=0.8360,$ $\displaystyle\mu_{f2}(3)=0.6959,\mu_{f2}(4)=0.4649,$ $\displaystyle\mu_{f3}(1)=0.3870,\mu_{f3}(2)=0.6234,$ $\displaystyle\mu_{f3}(3)=0.8060,\mu_{f3}(4)=0.8360,$ $\displaystyle\mu_{f4}(1)=0.4649,\mu_{f4}(2)=0.6959,$ $\displaystyle\mu_{f4}(3)=0.8360,\mu_{f4}(4)=0.8059,$

With this membership function value of each feature obtained as a function of position mean and variance. The membership value for each feature at each position is calculated. By using membership function ordering (MFO) [4] technique, the feature having the highest membership value at the given position is assigned the corresponding position of that particular feature. Thus the features are arranged in the aggregated ranking. Using this method the aggregate rank for example 1 is $L=[f1,f2,f4,f3]$ using ( $\max^{f4}_{i=f1}(\mu_{i}(1))=\mu_{f1}(1)$ . So in the aggregated rank list L, L(1)=f1. Similarly, for all other features, we calculate like that, if the position is already assigned for any feature, then the next max value in the list is considered.

3.7 Ensemble feature selection

Generally, in machine learning models, only one learning model is used for prediction. To improve further the performance of the model, the output of multiple prediction models is combined using ensemble learning, i.e., combine the output of various models is better than using one learning model. And it is successfully applied in machine learning classification models. Based on this concept to select the best feature subset, the feature subsets obtained from different feature selection methods are combined using an ensemble technique used for the classification model. In ensemble feature selection also the feature subsets from different feature selection methods are combined to form a single feature subset using some aggregation methods.

In this paper, Fuzzy Gaussian Membership function ordering is used for combing the ranking of the feature subsets to obtain an aggregate ranking. The number of features to select from the aggregate ranks is obtained by using the Fisher discriminant ratio value $f_{d}$ [3], and this value is calculated by using Eq. (1).

Pseudo code of the ensemble feature selectionInputInputOutputOutput Microarray dataset D Selected Features F

Initialize feature rank vector V1,V2,V3feature selection method f V1 obtain feature ranking based on ReliefV2 obtain feature ranking based on mRMRV3 obtain feature ranking based on Feature CorrelationFeatures $f_{i}$ Calculate the mean position $\bar{x}_{f_{i}}$ for features $f_{i}$ Calculate the variance position ${\bar{\sigma}}^{2}_{f_{i}}$ for features $f_{i}$ Calculate the membership function of each feature as a function of rank positions using Eq. (10)

$C_{\text{final}}=$ Aggregate the ranking list V1, V2 and V3 feature ranking list as explained in example:1 $F=$ Select top $T$ attributes from $C_{\text{final}}$ . ( $T$ value is obtained by using Eq. (1))

$F$

Algorithm 3.7 explain about the proposed ensemble feature selection methods. The algorithm combines three filter ranking feature selection methods namely Relief, mRMR and Feature Correlation approaches. The feature ranking of these methods are aggregated by using Fuzzy Gaussian Membership function in the filter phase. The selected top $T$ features are used as input for IBPSO in wrapper phase.

3.8 Improved binary particle swarm optimization (wrapper phase)

In 1995 [17] proposed a population-based stochastic optimization technique, called Particle swarm optimization (PSO). It is based on the behavior of birds in a flock. Best solutions are achieved by using their own memory of each particle and the knowledge gained by the swarm. For optimization, the fitness function is used, and each particle has fitness values. Every particle moves its position and velocity based on the best position obtained by each particle and its neighbors. It is estimated by two fitness values pbest and gbest. gbest is a global fitness value, whereas pbest is a local fitness value. To overcome the problem of local optimum, the weights are fine-tuning.

IBPSO proposed by [8] is used to overcome the problem of premature convergence of Binary Particle Swarm Optimization (BPSO) [7] and PSO. And also, IBPSO is used for achieving superior classification results and a reduced number of features. The kernel-based SVM classifier is used for evaluating the selected features and also used for calculating fitness value. In IBPSO, the position of each particle is represented by a binary bit. 1 represents feature being selected, and 0 represents the feature is not selected. The initial position of each particle is chosen by using a random function given in [36]. The velocity and position of the particles are updated iteratively based on the performance of the kernel SVM classifier and the selected number of features. pbest $n$ ( $n$ is the number of particles) represents the best fitness value for each particle whereas gbest represents the global best fitness value i.e. the best fitness value within a group of pbest $n$ . The position and velocity of each particle are updated iteratively based on the pbest and gbest values until predefined no of iterations. The following equations are used for updating the position of each particle.

$\displaystyle v^{i+1}_{pd}=wv^{i}_{pd}+c_{1}\times\Psi_{1}(\text{pbest}_{pd}-x% ^{i}_{pd})+c_{2}\times\Psi_{2}(\text{gbest}_{d}-x^{i}_{pd})\text{if }v^{i+1}_{% pd}\notin(V_{\min},V_{\max})\text{ then }v^{i+1}_{pd}\quad=\max(\min(V_{\max},% v^{i+1}_{pd}),V_{\min})S(v^{i+1}_{pd})=\frac{1}{1+e^{-v^{i+1}_{pd}}}\text{ if % }(\Psi_{3}<S(v^{i+1}_{pd}))\text{ then }x^{i+1}_{pd}=1;\text{ else }x^{i+1}_{% pd}=0.$ (13)

where, $w$ is inertia weight, $\Psi_{1},\Psi_{2}$ and $\Psi_{3}$ are random numbers ranges from $[0,1]$ . $x^{i}_{pd},x^{i+1}_{pd}$ and $v^{i}_{pd},v^{i+1}_{pd}$ are position and velocity vectors of $d^{\text{th}}$ feature for $p^{\text{th}}$ particle in $i^{\text{th}}$ and $(i+1)^{\text{th}}$ iterations respectively. where $\left[V_{\min},V_{\max}\right]$ values are $[-6,6]$ in our case. $c_{1}=2$ and $c_{2}=2$ are positive constants. These values are taken from [8]

The $(i+1)^{\text{th}}$ velocity of particles is calculated by multiplying inertia weight with $i^{\text{th}}$ particle velocity. To overcome the problem of local optimum,Before each particle position is updated the gbest value has to be evaluated. Based on the gbest value of each iteration we can decide that the particle is trapped in local optimum or not, Suppose if the gbest value isn’t changing for three consecutive iterations then the particle is trapped in local optimum. Then the gbest position is reset to ‘0’ to overcome from premature convergence. Algorithm 2 describes the pseudo code for the proposed method.

Table 2 represents the default parameters values used during the implementation of IBPSO and these parameters are evaluated by performing so many trails to get the best objective values and also based on many related work [17, 12].

Table 2

IBPSO parameter values

Parameters	Values
Particle size	100
No. of iterations (T)	100
$c_{1}$	2
$c_{2}$	2
$v_{\max}$	6
$v_{\min}$	$-$ 6

[t] Pseudo code for proposed method load Microarray dataset DInitialize feature rank vector V1, V2, V3feature selection method f V1 obtain feature ranking based on ReliefV2 obtain feature ranking based on mRMRV3 obtain feature ranking based on Feature CorrelationFeatures $f_{i}$ Calculate the mean position $\bar{x}_{f_{i}}$ for features $f_{i}$ Calculate the variance position ${\bar{\sigma}}^{2}_{f_{i}}$ for features $f_{i}$ Calculate the membership function of each feature as a function of rank positions using equation:10

$\backslash*$ IBPSO Initialize the position vector of particles using function in [36]In position vector bit 1 represent feature being selectedBit 0 represent feature being not selected

no of iteration or the stopping criterion is not met Evaluate fitness of particle swarm by kernel svm $p=1$ to number of particles fitness of $X_{p}$ is greater than the fitness of $\text{pbest}_{p}$ $\text{pbest}_{p}=X_{p}$ fitness of any particle of the particle swarm is $>$ gbest gbest $=$ position of particle fitness of gbest is the same 3 times reset gbest $d=1$ to number of dimension of particle $v^{i+1}_{pd}=wv^{i}_{pd}+c_{1}\times\Psi_{1}(\text{pbest}_{pd}-x^{i}_{pd})+c_{% 2}\times\Psi_{2}(\text{gbest}_{d}-x^{i}_{pd})$ $\text{if}v^{i+1}_{pd}\notin(V_{\min},V_{\max})\text{ then }v^{i+1}_{pd}=\max(% \min(V_{\max},v^{i+1}_{pd}),V_{\min})$ $S(v^{i+1}_{pd})=\frac{1}{1+e^{-v^{i+1}_{pd}}}$ $\text{if}(\Psi_{3}<S(v^{i+1}_{pd}))\text{ then }x^{i+1}_{pd}=1;\text{ else }x^% {i+1}_{pd}=0.$

Here in this work we use RBF kernel SVM for the evaluation of the IBPSO. In kernel svm there are so many kernel functions available that includes polynomial, radial basis function (RBF), and sigmoid kernel [6]. In this work as kernel function we are using RBF function because this kernel function is suitable to analyze higher-dimensional data [36] and also require only two parameters $C$ and $\gamma$ Values. Randomly sampling from exponential probability distribution are used for selecting the hyper parameter values ( $C$ and $\gamma$ ) [37]. A high $C$ value will train the example correctly where as a low $C$ value will make smooth decision surface. similarly $\gamma$ value also selected wisely for better classification accuracy. In this paper the near optimal hyper parameter values $C=0.5$ and $\gamma=0.02$ are used by RBF kernel svm model and these values are taken from literature.

4. Dataset details and result analysis

Microarray datasets Colon, Breast, Prostate, Lymphoma and SRBCT which are publicly available and are collected from http://csse.szu.edu.cn/staff/zhuzx/ Datasets.html [42, 9]. The dataset collected from this repository was cleaned.

Table 3
Datasets used in the empirical study

Data sets	No. of genes	No. of samples	No. of class labels
Colon	2000	60	2
Breast	24481	97	2
Prostate [9]	12600	34	2
Lymphoma	4026	62	3
SRBCT	2308	83	4

Table 4

Number of genes before and after feature selection method

Dataset	Original	Relief	mRMR	IBPSO	Relief $+$ IBPSO	mRMR $+$ IBPSO	Proposed method
Colon	2000	100	120	50	45	43	35
Breast	24481	600	95	86	83	70	65
Prostate	12600	340	80	75	72	68	53
Lymphoma	4026	200	56	48	40	38	30
SRBCT	2308	150	40	38	32	30	26

Table 5

Comparison of classification accuracy (%) obtained by the proposed method with other competing methods

	Feature selection methods
Dataset	Original	Relief	mRMR	IBPSO*	Relief $+$ IBPSO*	mRMR $+$ IBPSO*	Proposed method*
Colon	71.87	76.92	80.00	76.92	85.42	92.11	94.57
Breast	61.54	76.04	79.17	76.04	84.38	89.47	92.31
Prostate	72.92	75.00	81.25	78.13	86.46	93.33	95.89
Lymphoma	69.23	72.00	80.21	78.85	89.58	92.31	93.33
SRBCT	73.96	76.04	80.21	78.13	91.67	95.60	97.22

* RBF kernel based SVM is used as fitness function.

Table 3 describes the dataset details such as the number of samples, no of genes, and the no of classes. The colon dataset contains 2000 genes having 60 observations with two class labels. Out of all the datasets, only SRBCT has four class labels, whereas remaining all datasets have only two class labels. For the implementation of the proposed method, MATLAB software is used.

4.1 Result analysis

The proposed method is implemented in two phases, in the first phase, the features are ranked using three ranking filter feature selection methods. The three rank based filter feature selection methods are Relief, mRMR, and Feature correlation. Ensemble rank aggregation technique is used to obtain the final ranks to features. In this paper, we use the Fuzzy Gaussian membership function ordering rank aggregation technique. Top $T$ ranked features are selected from these aggregate ranking list and is passed as input to the second phase of the proposed method. Here the $T$ value is threshold value, which is calculated by using Eq. (1).

In the second phase of the proposed method, we use the Improved Binary Particle Swarm Optimization (IBPSO) method is used for further feature selection of the Microarray datasets and RBF kernel-based SVM classifier is used as the fitness function. Four Performance measures are used, namely Classification Accuracy, Recall, Precision, and F1-Score, to compare the performance of the proposed method over existing feature selection methods.

Table 4 represents the actual number of genes present in the given datasets and the number of genes selected by the feature selection methods. The no of genes is very less for proposed feature selection method when compared with other filter, wrapper and hybrid feature selection methods. For filter methods such as Relief and mRMR the no of feature selected is more when compared with wrapper, hybrid and proposed method. where as for IBPSO wrapper method the number of feature selected is very less when compared with filter method and little bit more when compared with wrapper and proposed method. Similarly in the case of Relief $+$ IBPSO and mRMR $+$ IBPSO hybrid methods the number of features selected is very less when compared with filter and wrapper method but more when compares with proposed method.

As first experiment, we use RBF kernel SVM classifier on the five datasets without using any feature selection methods. In Table 5 Original represents classification is performed using all genes available in the datasets. It can be observed that the classification accuracy is very less.

In the second experiment, we perform Relief, mRMR, IBPSO filter, wrapper methods and Relief $+$ IBPSO, mRMR $+$ IBPSO hybrid feature selection methods along with proposed method are used for selecting the feature subsets. The values in the Table 5 represents the classification accuracy for the selected features for the above mentioned feature selection methods. RBF kernel SVM is used for finding the fitness and classification accuracy for the selected features. And also it is observed that the classification accuracy is better for proposed method when compared with other feature selection methods. In Table 5 bold values indicates highest accuracy among the FS methods over the datasets.

Table 5 represents the comparison of classification accuracy (%) between proposed method over various feature selection methods such as filter methods (Relief, mRMR),Wrapper method (IBPSO) and hybrid feature selection (Relief $+$ IBPSO, mRMR $+$ IBPSO). Original method represents without applying any feature selection on the dataset, i.e. for training the model all the features are consider. For Colon dataset the proposed method acquired 94.57% accuracy, compared with existing feature selection methods. Similarly for Breast dataset 92.31%, Prostate dataset 95.89%, Lymphoma 93.33% and SRBCT 97.22%. Proposed method outperforms when compared with other methods due to its ensemble approach that will have synergistic effect when multiple methods are combined appropriately.

Table 6
Shows experimental results with different data types compared with different feature selection methods

Feature selection methods	Datasets	Recall	Precision	F1-score
Original	Colon	77.36	73.21	75.23
	Breast	87.80	66.66	76.19
	Prostate	77.35	74.55	75.93
	Lymphoma	82.65	60.00	75.00
	SRBCT	68.52	82.22	74.75
Relief	Colon	75.00	85.71	81.26
	Breast	76.67	83.64	80.00
	Prostate	70.00	79.55	74.47
	Lymphoma	88.85	72.75	80.00
	SRBCT	82.61	71.70	76.77
mRMR	Colon	81.56	82.16	81.85
	Breast	84.68	85.52	85.09
	Prostate	84.87	85.68	85.27
	Lymphoma	83.65	84.41	84.02
	SRBCT	84.12	85.64	84.87
IBPSO	Colon	91.64	92.64	92.13
	Breast	93.62	94.85	94.23
	Prostate	94.68	95.32	94.99
	Lymphoma	93.61	94.48	94.04
	SRBCT	94.24	95.74	94.98
Relief $+$ PSO	Colon	91.12	92.36	91.73
	Breast	91.98	93.14	92.55
	Prostate	92.65	93.98	93.31
	Lymphoma	94.32	95.54	94.92
	SRBCT	93.64	94.56	94.09
mRMR $+$ IBPSO	Colon	92.48	93.35	92.91
	Breast	93.54	94.64	94.08
	Prostate	92.84	93.84	93.33
	Lymphoma	92.68	93.98	93.32
	SRBCT	93.42	94.54	93.97
Proposed	Colon	94.48	95.68	93.81
	Breast	95.64	96.84	96.56
	Prostate	94.62	95.64	95.87
	Lymphoma	94.86	95.87	95.84
	SRBCT	95.67	96.85	95.68

Table 6 represents the performance measure values such as Recall, Precision and F1-Score between proposed method and other competing methods like Original, Relief, mRMR, Improved Binary Particle Swarm Optimization (IBPSO), Relief $+$ PSO and mRMR $+$ IBPSO. Original method represents without applying any feature selection method on the dataset, i.e. for training the model all the features are consider. Where as Relief $+$ PSO and mRMR $+$ IBPSO are hybrid feature selection methods.

ROC (Receiver Operating Characteristic) Curve is used as performance measure in machine learning. ROC Curve is plotted as a function of true positive rate (sensitivity) to false positive rate (Specificity). The points on the ROC curve represents the ratio of sensitivity to specificity pair for a particular threshold value. The AUC (Area Under Curve) represents the measure of how good a parameter can able to vary between the classes. The more the AUC the best that particular model and vice versa.

ROC curve was drawn for each feature selection methods. To plot the ROC, Sensitivity and Specificity are calculated for each feature selection method. For evaluating which method is best Area Under Curve (AUC) is used, larger the area more accuracy and vice versa. The method with highest AUC is considered as the best feature selection methods.

Figure 3.

Receiver operating characteristic curve for colon dataset.

Figure 4.

Receiver operating characteristic curve for breast dataset.

Figures 3–7 represent the ROC curves for the Colon, Breast, Prostate, Lymphoma and SRBCT datasets. $X$ -axis represents false positive rate where as $Y$ -axis represent True positive rate. The AUC is more for proposed method when compared with other feature selection methods such as Original, Relief, mRMR, Feature correlation, Particle Swarm Optimization (PSO) and Improved Binary Particle Swarm Optimization (IBPSO). Original method represents without applying any feature selection method on the dataset, i.e. for training the model all the features are consider. For Figs 3–7, the AUC is more for proposed method when compared with other feature selection methods.

Figure 5.

Receiver operating characteristic curve for prostate dataset.

Figure 6.

Receiver operating characteristic curve for lymphoma dataset.

Figure 7.

Receiver operating characteristic curve for srbct dataset.

5. Conclusion and future work

In this paper, feature selection from high dimensional data is proposed. The best classification methods may show sub-optimal performance when feature selection is not carried out. The rationale behind this is that the datasets, when used as it may cause many issues due to redundancy and irrelevant features. The efficiency of the classification algorithms will go down, and sometimes, they may be useless unless feature selection is carried out. In this paper, we proposed a feature selection method known as Hybrid ensemble Feature Selection (EFS), which combines approaches like Relief, mRMR, and Feature Correlation FS methods in the filter phase and IBPSO in wrapper phase. By using these three approaches as part of the ensemble method, the proposed method provides better results when applied to Microarray data (High Dimensional data). The performance of the ensemble method is improved by adapting to fuzzy Gaussian rank aggregation methods, which outperformed the mean, median, geomean methods used in the literature. The best-selected features are further applied to IBPSO (wrapper method) to obtain an optimal set of features. The overall performance of the proposed method is superior to other methods over the state of the art. Then the results are further optimized by IBPSO. The proposed framework reveals better performance over state of the art. In the future, the proposed ensemble method can be extended for feature selection in streaming data.

References

Ang

J.C.

Mirzal

Haron

and Hamed

H.N.A.

, Supervised, unsupervised, and semi-supervised feature selection: A review on gene selection, IEEE/ACM Transactions on Computational Biology and Bioinformatics 13(5) (2016), 971–989.

Ayyad

S.M.

Saleh

A.I.

and Labib

L.M.

, Gene expression cancer classification using modified K-nearest neighbors technique, BioSystems 176 (2019), 41–51.

Basu

and Ho

T.K.

, Data complexity in pattern recognition , Springer Science & Business Media, 2006.

Beg

M.S.

and Ahmad

, Soft computing techniques for rank aggregation on the world wide web, World Wide Web 6(1) (2003), 5–22.

Bolón-Canedo

Sánchez-Maroño

and Alonso-Betanzos

, An ensemble of filters and classifiers for microarray data classification, Pattern Recognition 45(1) (2012), 531–539.

Burges

C.J.

, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2(2) (1998), 121–167.

Cervante

Xue

Zhang

and Shang

, Binary particle swarm optimisation for feature selection: A filter based approach, in: 2012 IEEE Congress on Evolutionary Computation, IEEE, 2012, pp. 1–8.

Chuang

L.-Y.

Chang

H.-W.

C.-J.

and Yang

C.-H.

, Improved binary pso for feature selection using gene expression data, Computational Biology and Chemistry 32(1) (2008), 29–38.

Dash

, A two stage grading approach for feature selection and classification of microarray data using pareto based feature ranking techniques: A case study, Journal of King Saud University-Computer and Information Sciences (2017).

10.

Ding

and Peng

, Minimum redundancy feature selection from microarray gene expression data, Journal of Bioinformatics and Computational Biology 3(2) (2005), 185–205.

11.

Dittman

D.J.

Khoshgoftaar

T.M.

Wald

and Napolitano

, Classification performance of rank aggregation techniques for ensemble gene selection, in: The Twenty-Sixth International FLAIRS Conference, 2013.

12.

Eberhart

R.C.

Shi

and Kennedy

, Swarm intelligence, San Francisco: Morgan Kaufmann Publishers, 2001.

13.

El Akadi

Amine

El Ouardighi

and Aboutajdine

, A two-stage gene selection scheme utilizing mrmr filter and ga wrapper, Knowledge and Information Systems 26(3) (2011), 487–500.

14.

Gao

and Huang

, Hybrid method based on information gain and support vector machine for gene selection in cancer classification, Genomics, Proteomics & Bioinformatics 15(6) (2017), 389–395.

15.

Guyon

and Elisseeff

, An introduction to variable and feature selection, Journal of Machine Learning Research 3(Mar) (2003), 1157–1182.

16.

Jain

V.K.

and Jain

, Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification, Applied Soft Computing 62 (2018), 203–215.

17.

Kennedy

and Eberhart

, Particle swarm optimization (pso), in: Proc IEEE International Conference on Neural Networks, Perth, Australia, 1995, pp. 1942–1948.

18.

Kononenko

, Estimating attributes: Analysis and extensions of relief, in: European Conference on Machine Learning, Springer, 1994, pp. 171–182.

19.

Lee

C.-P.

and Leu

, A novel hybrid feature selection method for microarray data analysis, Applied Soft Computing 11(1) (2011), 208–213.

20.

Leung

and Hung

, A multiple-filter-multiple-wrapper approach to gene selection and microarray data classification, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 7(1) (2010), 108–117.

21.

Weinberg

C.R.

Darden

T.A.

and Pedersen

L.G.

, Gene selection for sample classification based on gene expression data: Study of sensitivity to choice of parameters of the ga/knn method, Bioinformatics 17(12) (2001), 1131–1142.

22.

Xie

and Liu

, Efficient feature selection and classification for microarray data, PloS one 13(8) (2018), 1–21.

23.

Lin

, Rank aggregation methods, Wiley Interdisciplinary Reviews: Computational Statistics 2(5) (2010), 555–570.

24.

Chen

Yan

Jin

Xue

and Gao

, A hybrid feature selection algorithm for gene expression data classification, Neurocomputing 256 (2017), 56–62.

25.

Morovvat

and Osareh

, An ensemble of filters and wrappers for microarray data classification, Mach Learn Appl An Int J 3(2) (2016), 1–17.

26.

Mundra

P.A.

and Rajapakse

J.C.

, Svm-rfe with MRMR filter for gene selection, IEEE Transactions on Nanobioscience 9(1) (2010), 31–37.

27.

Pashaei

and Aydin

, Binary black hole algorithm for feature selection and classification on biological data, Applied Soft Computing 56 (2017), 94–106.

28.

Rakkeitwinai

Lursinsap

Aporntewan

and Mutirangura

, New feature selection for gene expression classification based on degree of class overlap in principal dimensions, Computers in Biology and Medicine 64 (2015), 292–298.

29.

Rani

R.R.

and Ramyachitra

, Microarray cancer gene feature selection using spider monkey optimization algorithm and cancer classification using svm, Procedia Computer Science 143 (2018), 108–116.

30.

Seijo-Pardo

Porto-Díaz

Bolón-Canedo

and Alonso-Betanzos

, Ensemble feature selection: Homogeneous and heterogeneous approaches, Knowledge-Based Systems 118 (2017), 124–139.

31.

Sharbaf

F.V.

Mosafer

and Moattar

M.H.

, A hybrid gene selection approach for microarray data classification using cellular learning automata and ant colony optimization, Genomics 107(6) (2016), 231–238.

32.

Song

and Wang

, A fast clustering-based feature subset selection algorithm for high-dimensional data, IEEE Transactions on Knowledge and Data Engineering 25(1) (2013), 1–14.

33.

Unler

Murat

and Chinnam

R.B.

, mr2pso: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification, Information Sciences 181(20) (2011), 4625–4641.

34.

Venkatesh

and Anuradha

, A review of feature selection and its methods, Cybernetics and Information Technologies 19(1) (2019), 3–26.

35.

Venkatesh

and Anuradha

, A hybrid feature selection approach for handling a high-dimensional data, in: Innovations in Computer Science and Engineering, Springer, 2019, pp. 365–373.

36.

Vieira

S.M.

Mendonça

L.F.

Farinha

G.J.

and Sousa

J.M.

, Modified binary pso for feature selection using svm applied to mortality prediction of septic patients, Applied Soft Computing 13(8) (2013), 3494–3504.

37.

Vivian-Griffiths

Baker

Schmidt

K.M.

Bracher-Smith

Walters

Artemiou

Holmans

O’donovan

M.C.

Owen

M.J.

Pocklington

et al., Predictive modeling of schizophrenia from genomic data: Comparison of polygenic risk score with kernel support vector machines approach, American Journal of Medical Genetics Part B: Neuropsychiatric Genetics 180(1) (2019), 80–85.

38.

Wahde

and Szallasi

, A survey of methods for classification of gene expression data using evolutionary algorithms, Expert Review of Molecular Diagnostics 6(1) (2006), 101–110.

39.

Wang

Tan

and Niu

, Feature selection for classification of microarray gene expression cancers using bacterial colony optimization with multi-dimensional population, Swarm and Evolutionary Computation, 2019.

40.

Willett

, Combination of similarity rankings using data fusion, Journal of Chemical Information and Modeling 53(1) (2013), 1–10.

41.

Yuan

Yang

and Ji

, Partial maximum correlation information: A new feature selection method for microarray data classification, Neurocomputing 323 (2019), 231–243.

42.

Zhu

Ong

Y.-S.

and Dash

, Markov blanket-embedded genetic algorithm for gene selection, Pattern Recognition 40(11) (2007), 3236–3248.