Gene selection for enhanced classification on microarray data using a weighted k-NN based algorithm

Abstract

Feature selection is a common solution to microarray analysis. Previous approaches either select features based on classical statistical tests that can be tuned up with a classifier, or using regularization penalties incorporated in the cost function. Here we propose to use a feature ranking and weighting scheme instead, which combines statistical techniques with a weighted $k$ -NN classifier using a modified forward selection procedure.

We demonstrate that classification accuracy of our proposal outperforms existing methods on a range of public microarray gene expression datasets. The proposed method is also compared to state-of-the-art feature selection algorithms by means of the Friedman test.

Although a bunch of feature selection techniques has been used for genomic data, the experimental results show the classification superiority of our method on most of the present gene expression datasets.

Keywords

Computational genomics microarray data analysis feature selection feature ranking feature weighting k-nearest neighbors

1. Introduction

The wide use of gene expression technologies, such as microarrays, permits to screen thousands of genes over multiple observations. In general, a microarray is a high-dimensional structure consisting of few samples ( $n$ ) with thousands of genes ( $p$ ). Gene expression information helps to monitor and measure relevant data to understand different biological information and facilitates the analysis in specific contexts such as cancer diagnosis or the classification of different tumor types [7, 18, 21]. Due to the nature of microarrays, some drawbacks in the use of classic machine learning and statistical algorithms for classification purposes have been identified. First, one of them relies on the poor classification performance in high-dimensional domains with few samples, referred to as $n<<p$ , in which machine learning techniques do not have sufficient observations to perform a good classification or prediction. Secondly, datasets with thousands of features are more likely to have many correlated or irrelevant features (genes) yielding poor classification accuracy and high computational costs [36].

One possible solution to these issues is to use feature selection methods to perform data dimensionality reduction thus preserving uncorrelated genes with the most discriminative information. Feature selection (FS) techniques have been widely explored in the past decades [8, 30]. The main idea behind FS is to select a small subset of $m$ features from a larger set of $p$ variables without degenerating the classification performance [4, 27, 36]. According to the way they operate, feature selection techniques are grouped into three main categories: filter methods, wrapper methods, and embedded methods.

The filter methods remove correlated features before the classification step, typically by evaluating and selecting the most relevant features individually [34]. On the other hand, wrapper methods are tuned up with a classifier to evaluate the relevance of different subsets of features. The selected feature subset would be the one with the highest classification accuracy. As a benefit, wrappers tend to lead to better classification results [19, 28]. In embedded methods, the feature selection is a part of the model construction, where strongly correlated features tend to be in or out of the model altogether commonly by means of $l$ 1-norm penalization, also known as sparse discriminant analysis [9, 45, 59].

Evidently, it would be necessary to evaluate all possible feature subsets to guarantee the optimum classification accuracy; however, this would be an inefficient solution. To perform a trade-off between efficiency and effectiveness, some strategies have been developed, where feature ranking plays an important role to find the best features leading to competitive and close-to-optimum results [48, 57].

Several feature selection techniques have been used in the microarray data context. Some of them use genetic algorithms or meta-heuristics in order to perform the variable selection, as shown in [13, 29, 33, 41]. Other works combine different strategies, such as hybrid wrapper-filter methods [2, 23, 24, 31], an ensemble consisting of a variety of filters and classifiers [5], or a nearest neighbors wrapper-based ensemble [40], whereas some efforts are devoted to reducing the processing time, as in [49, 6] where authors propose a methodology to accelerate the evaluation of candidate features in a wrapper-based algorithm, and where a distributed filter approach to improve the running time is presented. Furthermore, a comprehensive study presented in [26] proves that statistical methods constitute a valuable feature ranking tool for gene selection in microarray data sets.

In this paper, an efficient feature selection method using a weighted nearest neighbor classification scheme is presented; this method uses a novel way to rank the feature relevance that is also used to obtain the weights for the Euclidean distance computation. Then, with the aim to select the genes with most discriminative information, a sequential search is carried out. A comparison with 5 state-of-the-art feature selection algorithms over 11 public microarray datasets was carried out in terms of the $F$ -measure and statistical significance by means of the Friedman test. Although a bunch of feature selection techniques has been used in genomic data, the experimental results show our method outperforms the previous work on most of the present gene expression datasets.

The rest of the paper is organized as follows: Section 2 presents the proposed feature selection algorithm; the feature ranking and weighting schemes, the forward selection algorithm and the weighted $k$ -NN procedure are also described. Section 3 is devoted to experimental results and discussion, where we present a comparative analysis of proposed feature selection method with feature selection algorithms in the literature. In addition, a multiple pairwise test over the Friedman test to provide a more precise comparison was carried out. The last section shows the conclusions of the present work.

2. Proposed feature selection algorithm

The aim of this section is to introduce the present proposal. As an initial condition, the dataset needs to be standardized, so each feature has zero mean and one standard deviation. The feature ranking and weighting are performed using the $1$ -NN classifier afterwards. If the user decides to take advantage of an early stopping condition, a pre-filtering step can be carried out by taking the first $m<p$ features from the feature-ranked dataset, otherwise, the algorithm takes all the $p$ features. As a final step, the proposed forward selection procedure is applied. A general view of the proposed method is illustrated in Fig. 1. The feature ranking and weighting, so as the forward selection method, are described in detail in following subsections.

Figure 1.

Diagram of the proposed feature selection algorithm.

2.1 Feature ranking

Feature ranking, or variable ranking, is a common baseline step in several feature selection methods [19]. It helps to measure the relevance of each individual feature, in order to select variables providing an enhanced classification result. Commonly, a correlation criterion, a Shannon’s entropy-based mutual information, and classical statistical tests e.g., T-test, F-test, Chi-squared, among others, are used [20]. In this paper, a simple criterion for feature ranking is proposed, consisting of the evaluation of each feature based on the performance obtained by the $1$ -NN, measured on the training set, with that feature alone.

Let $X=\left[{\phi_{1},\phi_{2},\ldots,\phi_{p}}\right]=\left[{X_{1},X_{2},\ldots,X% _{n}}\right]^{T}$ be a matrix of size $n\times p$ with $n$ observations and $p$ features, and let $y=\left[{y_{1},y_{2},\ldots,y_{n}}\right]^{T}$ be a $n$ -dimensional column vector containing the class of each observation in $X$ . The criterion used in this paper to rank the feature/genes is described as the feature-wise classification performance (FWC), which represents individual attribute evaluation using the $1$ -NN, as shown in the following equation:

$\displaystyle R_{\textit{FW}}\left(i\right)=f_{\textit{1NN}}\left({\phi_{i},y}\right)$ (1)

where $f_{\textit{1NN}}$ represents a function that evaluates each feature $\phi_{i}$ according to its ability to predict the value of the class variable $y$ , using the $1$ -NN classifier. Here, also $R_{\textit{FW}}$ will rank features from best to worse.

The incorporation of the $f_{\textit{1NN}}$ function as a measure to estimate the relevance of each feature in the dataset constitutes a contribution to feature ranking methods. This property allows the reduction of the processing time while preserving the discriminatory ability of the classifier, as we will show in subsequent sections.

2.2 Feature weighting

One of the most important steps in this proposal is to compute the weights for each feature. The proposed method is a classification performance-based approach, in which the $1$ -NN classifier is executed having one feature at a time. The obtained classification performance is directly used as the feature weight, used to compute the Euclidean distance in order to predict the class of the corresponding sample. We observe that some methods use the classification performance to assess in the weights computation, however, none of these methods have used the classification performance as we propose in this paper. Commonly, performance methods parameterize a distance function with feature weights and use the performance feedback as a function to iteratively optimize these parameters [53]. In difference, we proposed a feature-wise classification weighting in such a way that the performance itself serves as the feature weight. The following equation describes the Feature-wise classification weighting: get the $R_{\textit{FW}}$ vector, computed as shown in Eq. (1) and map the values between 0 and 1, as follows:

$\displaystyle W_{\textit{FW}}\left(i\right)=\frac{R_{\textit{FW}}}{100}=\frac{% f_{\textit{1NN}}\left({\phi_{i},y}\right)}{100}$ (2)

Example: Let $R_{\textit{FW}}=$ [58.6 48 88 87] be the feature-wise performance vector, the feature-wise weighting would be $W_{\textit{FW}}=$ [0.586 0.48 0.88 0.87].

Note: Values for $R_{\textit{FW}}$ can be obtained by using different evaluation criteria, such as accuracy, area under roc curve, and so on. However, due to its capability to deal with class-imbalanced problems, authors adopt the $F$ -measure as the evaluation measure.

2.3 Weighted k-NN

The $k$ -nearest neighbor classifier ( $k$ -NN) is a non-parametric classification algorithm, which constitutes one of the most representative instance-based classifier. Despite the algorithm was proposed in 1951 [15], it is still of contemporary relevance not only for a historical reason but also for its ease to implement and classification effectiveness. This is evidenced by a range of up-to-date papers in the literature using the $k$ -NN. These contributions include modifications on the referred classifier [14, 39, 42, 58], and its use in different applications, such as computational linguistics and text mining [17, 46], classification in big data and large-scale datasets [12, 32], in biomedical engineering [1, 51, 54] and, of course, in microarray dataset classification [5, 6, 35, 40, 49, 56].

The nearest neighbor ( $k=$ 1) decision rule consists of assigning an unclassified instance to the class of the nearest previously classified instances, having a probability of error bound as twice the Bayes minimum error probability [10]; how close an instance is to another is determined by a distance function. The Euclidean, Manhattan, and the Mahalanobis distances are among the most commonly used metrics. However, the distance election is not an easy task, there are some previous works presenting a comprehensive study to help in this issue [3, 47, 52]. In [3] the authors discuss that the election of most popular similarity measures is commonly justified empirically from a practical point of view. They suggest that discussion about the use of different cost functions is not concluded, and contrarily, they encourage researchers to continue with the experimentation in this “fascinating area”. Similarly, in the study presented in [52], different metrics depending on the decision of what kind of classifier is desirable are reported. For example, for a highly accurate classifier, multiple locally metrics are suggested; for a high-speed $k$ -NN classifier, low-rank distance metrics with ball-trees are recommended. Regarding what is reported in [47], they conclude that Euclidean distance should be best suitable for practical implementations and applications, and its use is recommended when we do not have further prior knowledge of data. Moreover, the main advantages of choosing the Mahalanobis distance over the Euclidean are reviewed. Nevertheless, in the microarray context, which is a high-dimensional problem ( $n<<p$ ), the Mahalanobis metric may increase the data computations since the inverse of the covariance matrix is needed to calculate the distances between each pair of instances. According to what is said in previous analysis, we decide to choose the Euclidean distance as a similarity measure to perform a nearest neighbor classification. Furthermore, is not mandatory to know exactly the distance between two instances, but the order relationship among the nearest ones; for this reason, the square of the Euclidean distance is taken instead, reducing the computing of a squared root at each pair of observations.

On the other hand, it is well-known that some features are more informative to make a class prediction than others [19]. In consequence, a nearest neighbor rule implementing a feature-weighted Euclidean distance is used in this paper. Regardless the election to calculate the weights $W$ , the distance function can be obtained as expressed in Eq. (3).

$\displaystyle d_{W}\left({X_{i},X_{j}}\right)^{2}=\sum\limits_{k=1}^{p}{W\left% (k\right)\times\left({\,X_{i}\left(k\right)-X_{j}\left(k\right)\,}\right)^{2}}$ (3)

In accordance to what is presented in previous sections, the feature ranking values are also used as the weights to construct the nearest neighbor classifier.

2.4 Stepwise forward selection using the 1-NN

Wrapper methods evaluate the relevance of feature subsets by using a classification algorithm. Clearly, in order to guarantee the best classification results, it would be necessary to evaluate all possible combinations of gene subsets, exponentially increasing the computational complexity. For this reason, wrapper methods are commonly underestimated as they seem to be brute force methods [19, 28]. The time complexity represents a big problem when processing datasets with thousands of features. Fortunately, efficient strategies such as greedy search seem to be suitable to alleviate this issue.

Stepwise forward selection (SFS) uses a greedy search method which starts with an empty subset and selects the first feature at iteration zero. The algorithm considers all the ( $p-k$ ) remaining features at the $k$ th iteration, for $k=0,\ldots,p-1$ . Then progressively incorporate one variable at a time, from these ( $p-k$ ) features, and identify the subset that best performs in a selected classifier. That is, the overall evaluated subsets amount to a total of $1+\sum_{k=0}^{p-1}{\left({p-k}\right)}=1+p\left({p-1}\right)/2$ [25, 44].

In this study, we modify this procedure by evaluate every feature individually and rank them from best to worst, in such a manner that the first feature processed by forward selection should be the best one (regardless what the approach to rank them is), thus having next in the queue the best of the remaining variables. Regarding the proposed feature selection method, the first step consists of the computation of feature ranking $R$ and feature weighting $W$ , and the initialization of parameters such as best classification $\textit{BC}=$ 0, and best subset $\textit{BS}=\left\{\emptyset\right\}$ . Afterwards, append the next best-ranked feature $\phi_{i}\in R$ to the best subset and perform the nearest neighbor classification. If classification performance improves, then add the corresponding feature to the best subset and update the best classification performance. Otherwise, discard it. Repeat this procedure from the first to the last ranked features.

The main advantage of this approach is that space search is sized only by $p$ , where $p$ is the number of features in the dataset. Albeit this approach reduces the number of computations in comparison to conventional stepwise forward selection (SFS) scheme, it is possible to provide a more efficient procedure by means of an early stopping criterion. Commonly, this criterion stops the subset search if, at any given iteration, the predictive performance decreases by adding the next feature to the current best subset. Instead, we decided to include a pre-filtering stage as the stopping criterion, in order to contribute to algorithm’s flexibility. Algorithm 1 shows a pseudo-code of this method. Unlike exhaustive best subset selection or conventional forward selection which involved fitting 2 ${}^{p}$ models and $p(p+1)/2$ models, respectively, our proposal only fits $m<p$ models, with $m=p$ as worst case scenario (when stopping criterion is not used).

Algorithm 1. Proposed Forward Selection-based scheme for the Weighted Nearest Neighbor classification

Input: Microarray gene expression dataset

Output: A set of selected features/genes: BS

(1) Select the parameters to use with

f_{\textit{1NN}}

algorithm

(2) Initialize

\textit{BC}=

0 and

\textit{BS}=\left\{\emptyset\right\}

(3) for

i=

1 to

m

(4) aux_subset

=

\textit{BS}\cup\phi_{i}

(5) aux_class

=

f_{\textit{1NN}}

( aux_subset,y)

(6) if (

\textit{aux\_class}>\textit{BC}

)

(7)

\textit{BS}=\textit{aux\_subset}

(8)

\textit{BC}=\textit{aux\_class}

(9) end if

(10) end for

3. Experimental results and discussions

Throughout the experimental study, 11 gene expression microarray datasets with thousands of features, as described in Section 3.1, were used. In Section 3.2, the details about the algorithm implementation, so as the measures for evaluating the classification performance and for significance analysis are presented. Finally, the comparative analysis results and discussion are presented in Section 3.3.

3.1 Datasets

The experiments were conducted over 11 publicly available microarray gene expression datasets with a variety number of classes, and can be obtained from [60]. Summarized data information about the number of features, samples, classes, class distribution and also a brief description can be consulted in Table 1. In general, datasets used in this paper have a class distribution, comprising serious complications for most machine learning algorithms. This difficulty comes when classification algorithms tend to ignore the minority class at the learning stage. As a result, instances belonging to small or minority classes are commonly misclassified [43]. In order to give more insight into imbalanced data, let us assume an imbalanced two-class problem in which the main goal is to determine whether a person is healthy or have a rare disease. As a consequence, instances from the class representing the disease case are scarce (e.g. 5% of instances) in comparison to the healthy group (95%). Having this scenario, it is relatively simple for an algorithm to maximize the classification performance on the dataset, without performing any learning, by predicting each person to be healthy. For this reason, the accuracy, i.e. the percentage of correctly classified instances, is not a convenient measure to evaluate the classification performance, and another evaluation measure should be used.

Table 1
Microarray cancer datasets. First group description

Dataset	Feat	Samples	Classes	Class distribution (in %)	Description
9 tumors	5,726	60	9	15/12/13/10/10/13/13/4/10	9 various human tumor types
11 tumors	12,533	174	11	16/5/15/13/7/6/4/15/3/8/8	11 various human tumor types
14 tumors	15,009	308	26	10/4/6/5/7/4/3/3/5/4/4/5/4/	14 various human tumor types
				6/2/3/2/4/2/2/2/2/4/3/1/3
Braintumor 1	5,920	90	5	67/11/11/4/7	5 human brain tumor types
Braintumor 2	10,367	50	4	28/14/28/30	4 malignant glioma types
DLBCL	5,469	77	2	75/25	Diffuse large b and follicular lymphomas
Leukemia 1	5,327	72	3	53/13/34	AML, ALL B-cell, and ALL T-cell
Leukemia 2	11,225	72	3	39/33/28	AML, ALL, and MLL
Lung cancer	12,600	203	5	68/9/10/10/3	4 lung cancer types and normal tissue
Prostate tumor	10,509	102	2	49/51	Prostate tumor and normal tissue
SRBCT	2,308	83	4	35/30/13/22	Small round blue cell tumors

3.2 Experimental setup

The FWC feature selection algorithm was implemented using the libraries of the Waikato Environment for Knowledge Analysis (WEKA) (version 3.8) [22]. This software is written in Java and allows the integration of different components in the data mining field, such as pre-processing methods, classification algorithms, cross-validated schemes, visualization, filters, and so on. The parameters for the $1$ -NN classifier (named IB1 in WEKA) remain in the default values. To validate the effectiveness of the proposed FWC, we use the ten-fold cross-validation method performing the feature selection within each fold evaluating the F-measure only in the training set. The average results achieved by this procedure over 30 repeated runs on each dataset are reported. As explained before, the classification accuracy rate is not a good evaluation to cope with imbalanced datasets and another measure should be used. The $F$ -measure is an evaluation measure, computed as the harmonic mean of precision ( $P$ ) and recall ( $R$ ), which means that precision and recall are evenly taken under consideration for the classifier evaluation, being more suitable for class-imbalanced data. For this reason, the $F$ -measure was adopted for ranking, weighting, and evaluation purposes and it is computed as follows:

$\displaystyle F=\frac{2}{\frac{1}{P}+\frac{1}{R}}=\frac{\textit{2PR}}{P+R}$ (4)

Statistical significance tests consist of rejecting (or not) the null hypothesis $H_{0}$ which suggests there are no significant differences between observations. To this regard, the Friedman test [16] was used. This is a non-parametric test based on average ranks that allows detecting statistically relevant differences among a group of classifiers. The results when comparing the proposed method and state-of-the-art algorithms indicate significant differences with a value of $p=$ 1.52E-07. The Friedman test cannot precisely detect which classifier has significant difference towards others, for this reason we use the Nemenyi post-hoc test [37], a multiple pairwise test considering a confidence of 95% ( $\alpha=$ 0.05).

Besides, the $z$ -statistic was used to compare the $i$ -th with the $j$ -th methods directly, and can be calculated by Eq. (5) as shown by Demšar in [11] where $R_{i}$ and $R_{j}$ are the average ranks for the $i$ -th and the $j$ -th feature selection algorithms, respectively. SE is expressed in Eq. (6) and represents a standard error when performing the pairwise comparison between these methods, where $k$ is the number of methods to compare and ND represents the number of datasets. The $z$ value is obtained to find the corresponding $p$ -value from a normal distribution table in ${\cal N}(0,1)$ .

$\displaystyle z=\frac{\left({R_{i}-R_{j}}\right)}{\textit{SE}}$ (5) $\displaystyle\textit{SE}=\sqrt{\frac{k\left({k+1}\right)}{\textit{6ND}}}$ (6)

All experiments were carried out on a PC with an Intel Core i3 Processor (3.00 GHz) running Windows 7 Professional operating system with 4096 MB of RAM.

3.3 Experimental results

In this section, the proposed FWC feature selection algorithm was compared to conventional SFS algorithm using the greedy stepwise search implemented in WEKA, and with state-of-the-art methods such as the bacterial colony optimization based feature selection algorithm (BCO- $k$ NN) [50], the hybrid information gain-genetic algorithm (IG-GA- $k$ NN) [55], the two population genetic algorithm with distance-based $k$ -nearest neighbor voting classifier (TGA- $k$ NNV) [29], and the sequential random $k$ -nearest neighbors (SR $k$ NN) [40]. The BCO- $k$ NN is a weighted-based feature selection method which exploits the benefits of evolutionary algorithms for optimization. To this end, it employs the evolutionary bacterial colony optimization (BCO) algorithm [38] to provide the features with different weights to guide the selection process. The feature subset classification error is taken as the fitness function for BCO. Regarding the IG-GA- $k$ NN and the TGA- $k$ NNV, both are hybrid filter/wrapper methods using a genetic algorithm (GA). The IG-GA- $k$ NN first computes the feature relevance by using the information gain (IG) criterion, and then features are filtered according to a relevance threshold. Finally feature selection is carried out by means of a GA. On the other hand, the TGA- $k$ NNV performs a gene clustering using Fisher’s least significant differences in a first step, followed by a second step in which a two-population GA is applied. The first population is used to select a cluster from the first step, whereas the second population is devoted to select genes from clusters by its relevance. Furthermore, the SKRNN approach is an ensemble of $k$ -NN classifiers arranged similar to the Random Forests, where each of the base classifiers uses a forward selection strategy and a majority voting scheme to give a predictive result.

The experimental comparative results of the proposed FWC, conventional forward selection, and state-of-the-art feature selection methods are given in Table 2, best results are bold-faced. The number of selected features is shown in corresponding parentheses. According to Table 2, FWC algorithm outperforms the rest of methods achieving a higher F-measure in most of the present datasets (7 out of 11). In general, algorithms such as IG-GA/ $k$ NNV and SRKNN, tend to select more features, which are either irrelevant or redundant. The BCO- $k$ NN is consistently the algorithm that selects the smaller number of features. However, the results of classification are quite similar to obtained with proposed FWC.

It should be mentioned that in FWC, the genes selection is consistent even if the pre-filtering stage is applied to reduce the search space to ( $p$ /3), ( $p$ /5), or ( $p$ /10) features at most (early stopping condition is used). Regardless this election, our proposal selects the same attributes in datasets such as Braintumor1, Leukemia1, Lung cancer, Prostate tumor, and SRBCT. Similarly, in some other datasets (11 tumors, Braintumor2, DLBCL, and Leukemia2) our proposal selects attributes consistently, i.e., FWC ( $p$ /10) selects only 2 attributes less than the FWC without stop condition. This is a proof that the proposed feature ranking schema properly evaluates the attributes improving the $F$ -measure.

Table 2
Classification performance ( $F$ -measure and selected features) comparison of feature selection methods on microarray datasets

Dataset	Proposed FWC
$k=$ 1	Proposed FWC ( $p$ /3) $k=$ 1	Proposed FWC ( $p$ /5) $k=$ 1	Proposed FWC ( $p$ /10) $k=$ 1	Conven-tional SFS $k=$ 1	Conven-tional SFS. Stop crit. $k=$ 1	BCO- $k$ NN $k=$ 5	IG-GA/ $k$ NN $k=$ 1	TGA- $k$ NNV $k=$ 3	SRKNN [36] $k=$ 1
9 tumors	86.14% (43)	82.19% (35)	77.52% (31)	68.67% (16)	–	75.00% (16)	92.22% (28)	85.00% (52)	–	43% (25)
11 tumors	95.95% (54)	95.87% (52)	95.87% (52)	95.87% (52)	–	89.08% (15)	89.62% (24)	92.53% (479)	–	84% (72)
14 tumors	81.28% (195)	76.44% (144)	74.80% (130)	69.75% (100)	–	76.30% (58)	67.64% (44)	65.26% (810)	–	69% (83)
Braintumor1	96.15% (19)	96.15% (19)	96.15% (19)	96.15% (19)	–	93.33% (10)	96.30% (21)	93.33% (244)	93.33% (16)	85% (28)
Braintumor2	100.00% (15)	100.00% (15)	98.00% (14)	96.00% (13)	–	98.00% (5)	100.00% (8)	88.00% (489)	94.00% (8)	74% (23)
DLBCL	100.00% (12)	100.00% (12)	98.69% (10)	98.69% (10)	–	100.00% (4)	100.00% (3)	100.00% (107)	100.00% (12)	97% (44)
Leukemia1	100.00% (10)	100.00% (10)	100.00% (10)	100.00% (10)	–	100.00% (4)	100.00% (7)	100.00% (82)	98.60% (6)	94% (28)
Leukemia2	100.00% (7)	100.00% (7)	100.00% (7)	98.62% (6)	–	100.00% (6)	100.00% (4)	98.61% (782)	100.00% (9)	92% (22)
Lung cancer	99.50% (25)	99.50% (25)	99.50% (25)	99.50% (25)	–	98.52% (10)	99.34% (32)	95.57% (2101)	97.00% (11)	77% (53)
Prostate tumor	98.04% (10)	98.04% (10)	98.04% (10)	98.04% (10)	–	95.10% (5)	100.00% (7)	96.08% (343)	98.00% (8)	91% (25)
SRBCT	100.00% (11)	100.00% (11)	100.00% (11)	100.00% (11)	100.00% (1717)	100.00% (7)	100.00% (9)	100.00% (56)	100.00% (10)	98% (35)
Average	96.10%	95.29%	94.42%	92.84%	–	92.86%	95.01%	92.22%	–	82.18%

So far, effectiveness of the different feature selection algorithms reported very similar results. Nevertheless, Table 3 depicts the Friedman average ranks for comparative methods, where the 4 proposed strategies in this paper were ranked within the best five.

Table 4 summarizes the results for the Nemenyi post-hoc on the Friedman test and the $z$ -test, where we compare our proposal with other feature selection methods. Accordingly, we can in general safely reject the null hypothesis with a significance level of 5% for the FWC, and FWC ( $p$ /3) methods as compared to $k$ -NN using conventional SFS, IG-GA, and TGA schemes, and also in comparison with the SRKNN classification ensemble. In particular, the proposed FWC ( $p$ /5) is significantly different to two state-of-the-art methods described here (TGA- $k$ NN and SRKNN). Results regarding the $z$ -test reported an analogous comparative behavior.

Table 3

Friedman average ranks

Dataset	Avg. rank
Proposed FWC	2.7727
Proposed FWC ( $p$ /3)	3.2273
BCO- $k$ NN	3.6364
Proposed FWC ( $p$ /5)	4.1364
Proposed FWC ( $p$ /10)	5.0000
TGA- $k$ NNV	5.5909
Conventional SFS (w/stopping criterion)	5.8182
IG-GA- $k$ NN	6.0909
SRKNN	8.7273

Table 4

Friedman test with significance level $\alpha=$ 0.05, and $z$ -test with critical value $q_{0.05}=$ 2.576 for two-tailed Nemenyi post-hoc test. The $p$ -values $<$ 0.05 are shown in bold

vs	Proposed FWC		Proposed FWC ( $p$ /3)		Proposed FWC ( $p$ /5)		Proposed FWC ( $p$ /10)
	$p$ -value	$z$ -test	$p$ -value	$z$ -test	$p$ -value	$z$ -test	$p$ -value	$z$ -test
Conventional SFS	0.034	4.517	0.034	3.843	0. 317	2.495	1.000	1.214
(w/stopping criterion)
BCO- $k$ NN	1.000	1.281	1.000	0.607	0.480	0.742	0.317	2.023
IG-GA/ $k$ NN	0.005	4.922	0.034	4.248	0.096	2.899	0.096	1.618
TGA- $k$ NNV	0.005	4.180	0.005	3.506	0.020	2.157	0.527	0.877
SRKNN [40]	0.001	8.832	0.001	8.158	0.001	6.809	0.001	5.528

Table 5

Time comparison (in seconds) of feature selection methods on a range of data sets

Data set	Proposed	Proposed	Proposed	Proposed	Conventional	Conventional
	FWC $k=$ 1	FWC ( $p$ /3)	FWC ( $p$ /5)	FWC ( $p$ /10)	SFS	SFS. Stop
		$k=$ 1	$k=$ 1	$k=$ 1	$k=$ 1	crit. $k=$ 1
9 tumors	268.54	99.95	64.71	43.46	–	478.00
11 tumors	3294.79	1130.41	875.82	397.68	–	1959.00
14 tumors	10004.84	3282.85	1909.56	1026.04	–	18:56:21
Braintumor1	392.31	143.51	97.70	60.23	–	252.00
Braintumor2	789.91	286.46	207.79	119.00	–	231.00
DLBCL	293.47	109.70	72.43	47.08	–	102.00
Leukemia1	261.38	98.61	64.82	42.11	–	85.00
Leukemia2	1214.49	397.10	276.54	153.76	–	278.00
Lung cancer	4397.32	1727.84	958.48	552.86	–	903.00
Prostate tumor	1345.42	419.05	298.75	168.56	–	296.00
SRBCT	77.89	30.72	22.18	18.47	40:25:02	72.00
Average	2030.94	702.38	440.80	239.02	–	465.60

From Table 4, we can say that our proposed FWC and FWC ( $p$ /3) are equivalent to evolutionary BCO- $k$ NN algorithm. Likewise the FWC ( $p$ /10) and conventional SFS with stopping criterion were equally evaluated by post-hoc analysis. Although the SFS seems to be effective in some cases, the proposed FWC yielded much better execution times, as shown in Table 5. On the other hand, the conventional SFS leads to very large execution times making it impractical for high-dimensional problems.

4. Conclusions

In this paper, an efficient feature selection method with a simple strategy for gene ranking and weighting was presented. It comprises an unexplored procedure to perform feature ranking based on the individual feature classification performance. The ranking values also serve as the weights for Euclidean distance on the $1$ -NN classifier. Experiments over eleven well-known microarray datasets were carried out. In order to compare our proposal with state-of-the-art algorithms, a ten-fold cross validation scheme was used. Comparative analysis shows that the FWC was the best ranked algorithm according to statistical Friedman test. Also, significant differences between both the proposed FWC and FWC ( $p$ /3) feature selection strategies and four algorithms were found by the $z$ -test. As reported in this study, the FWC can speeds up the feature selection process by adding an early stopping criterion (pre-filtering stage), leading to an efficient classification performance. In the pre-filtering stage, the maximum number of features is limited to a predefined size, speeding up the selection by having a smaller search space. However, it is necessary to identify an appropriate search space for classification in high-dimensional domain without prior knowledge of data.

According to the results, the FWC constitutes a reliable and effective feature selection method. It is worth noticing that our proposal is able to remove more than 98% of correlated or redundant genes, providing enhanced classification accuracy as well.

In future research, we focus on a wide analysis including a variety of classification algorithms using the presented feature selection scheme. Moreover, we suggest extending this proposal to other high-dimensional datasets in different contexts. This would greatly help to characterize the behavior of presented method concerning the classification performance and, also, for measuring the removal of redundant genes by using classification algorithms based on different approaches, such as statistical learning, decision trees or knowledge-based algorithms. Hence, our proposal is competitive to state-of-the-art methods and can be applied in problems of high research interest.

Footnotes

Acknowledgments

The authors would like to thank the following institutions for their support: Science and Technology National Council of Mexico, Universidad Autónoma de Guerrero (School of Engineering), Instituto Politécnico Nacional of Mexico (Center for Computing Research, and Center for Innovation and Computing Technological Development). A special thanks to Line Clemmensen, for her constructive comments helped to improve the present work.

Conflict of interest

The authors declare no conflict of interest.

References

Amaral

J.L.M.

Lopes

A.J.

Veiga

Faria

A.C.D.

and Melo

P.L.

, High-accuracy Detection of Airway Obstruction in Asthma Using Machine Learning Algorithms and Forced Oscillation Measurements, Comput Methods Programs Biomed (2017).

Apolloni

Leguizamón

and Alba

, Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments, Appl Soft Comput 38 (2016), 922–932. doi: 10.1016/j.asoc.2015.10.037.

Biehl

Hammer

and Villmann

, Distance measures for prototype based classification, in: Int Work Brain-Inspired Comput, 2013, pp. 10.0–116.

Blum

A.L.

and Langley

, Selection of relevant features and examples in machine learning, Artif Intell 97 (1997), 245–271. doi: 10.1016/S0004-3702(97)00063-5.

Bolón-Canedo

Sánchez-Maroño

and Alonso-Betanzos

, An ensemble of filters and classifiers for microarray data classification, Pattern Recognit 45 (2012), 531–539. doi: 10.1016/j.patcog.2011.06.006.

Bolón-Canedo

Sánchez-Maroñp

and Alonso-Betanzos

, Distributed feature selection: An application to microarray data classification, Appl Soft Comput 30 (2015), 136–150. doi: 10.1016/j.asoc.2015.01.035.

Chan

W.H.

Mohamad

M.S.

Deris

Zaki

Kasim

Omatu

Corchado

J.M.

and Al Ashwal

, Identification of informative genes and pathways using an improved penalized support vector machine with a weighting scheme, Comput Biol Med 77 (2016), 102–115. doi: 10.1016/j.compbiomed.2016.08.004.

Chandrashekar

and Sahin

, A survey on feature selection methods, Comput Electr Eng 40 (2014), 16–28. doi: 10.1016/j.compeleceng.2013.11.024.

Clemmensen

Hastie

Witten

and Ersbøll

, Sparse Discriminant Analysis, Technometrics 53 (2011), 406–413. doi: 10.1198/TECH.2011.08118.

10.

Cover

and Hart

, Nearest neighbor pattern classification, IEEE Trans Inf Theory 13 (1967), 21–27. doi: 10.1109/TIT.1967.1053964.

11.

Demšar

, Statistical comparisons of classifiers over multiple data sets, J Mach Learn Res 7 (2006), 1–30. doi: 10.1.1.141.3142.

12.

Deng

Zhu

Cheng

Zong

and Zhang

, Efficient kNN classification algorithm for big data, Neurocomputing 195 (2016), 143–148.

13.

Elyasigomari

Lee

D.A.

Screen

H.R.C.

and Shaheed

M.H.

, Development of a two-stage gene selection method that incorporates a novel hybrid approach using the cuckoo optimization algorithm and harmony search for cancer classification, J Biomed Inform 67 (2017), 11–20.

14.

Ertuugrul

öF.

and Taugluk

M.E.

, A novel version of k nearest neighbor: Dependent nearest neighbor, Appl SoftComput 55 (2017), 480–490.

15.

Fix

and Hodges

J.L.

, Discriminatory analysis nonparametric discrimination: Consistency properties, Int Stat Rev/Rev Int Stat 57 (1989), 238. doi: 10.2307/1403797.

16.

Friedman

, The Use of Ranks to Avoid the Assumption of Normality Implicit in the analysis of variance, J Am Stat Assoc 32 (1937), 675–701. doi: 10.1080/01621459.1937.10503522.

17.

Gali

Mariescu-Istodor

and Fränti

, Using linguistic features to automatically extract web page title, Expert Syst Appl 79 (2017), 296–312.

18.

Golub

T.R.

, Molecular Classification of cancer: Class discovery and class prediction by gene expression monitoring, Science (80-.) 286 (1999), 531–537. doi: 10.1126/science.286.5439.531.

19.

Guyon

and Elisseeff

, An introduction to variable and feature selection, J Mach Learn Res 3 (2003), 1157–1182.

20.

Guyon

and Elisseeff

, An introduction to feature extraction, in: Featur Extr, Springer, 2006, pp. 1–25.

21.

Guyon

Weston

Barnhill

and Vapnik

, Gene selection for cancer classification using support vector machines, Mach Learn 46 (2002), 389–422.

22.

Hall

Frank

Holmes

Pfahringer

Reutemann

and Witten

I.H.

, The WEKA data mining software: an update, ACM SIGKDD Explor Newsl 11 (2009), 10–18.

23.

Hira

Z.M.

and Gillies

D.F.

, A review of feature selection and feature extraction methods applied on microarray data, Adv Bioinformatics 2015 (2015), 1–13. doi: 10.1155/2015/198363.

24.

Hsu

H.-H.

Hsieh

C.-W.

and Lu

M.-D.

, Hybrid feature selection by combining filters and wrappers, Expert Syst Appl 38 (2011), 8144–8150.

25.

James

Witten

Hastie

and Tibshirani

, An Introduction to Statistical Learning, Springer, 2013.

26.

Jeffery

I.B.

Higgins

D.G.

and Culhane

A.C.

, Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data, BMC Bioinformatics 7 (2006), 359.

27.

Kira

and Rendell

L.A.

, The feature selection problem: Traditional methods and a new algorithm, in: AAAI, 1992, pp. 129–134.

28.

Kohavi

and John

G.H.

, Wrappers for feature subset selection, Artif Intell 97 (1997), 273–324. doi: 10.1016/S0004-3702(97)00043-X.

29.

Lee

C.P.

and Lin

W.S.

, Using the two-population genetic algorithm with distance-based k-nearest neighbour voting classifier for high-dimensional data, Int J Data Min Bioinform 14 (2016), 315. doi: 10.1504/IJDMB.2016.075820.

30.

Cheng

Wang

Morstatter

Trevino

R.P.

Tang

and Liu

, Feature selection: A data perspective, ACM Comput Surv 50 (2017), 94.

31.

Chen

Yan

Jin

Xue

and Gao

, A hybrid feature selection algorithm for gene expression data classification, Neurocomputing (2017).

32.

Maillo

Ramírez

Triguero

and Herrera

, kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data, Knowledge-Based Syst. 117 (2017), 3–15.

33.

Mandal

and Mukhopadhyay

, Multiobjective PSO-based rank aggregation: Application in gene ranking from microarray data, Inf Sci (Ny) 385 (2017), 55–75.

34.

Molina

L.C.

Belanche

and Nebot

, Feature selection algorithms: A survey and experimental evaluation, in: 2002 IEEE Int. Conf. Data Mining, 2002. Proceedings, IEEE Comput. Soc, 2002, pp. 306–313. doi: 10.1109/ICDM.2002.1183917.

35.

Mortazavi

and Moattar

M.H.

, Robust feature selection from microarray data based on cooperative game theory and qualitative mutual information, Adv Bioinformatics 2016 (2016), 1–16. doi: 10.1155/2016/1058305.

36.

Narendra

P.M.

and Fukunaga

, A branch and bound algorithm for feature subset selection, IEEE Trans Comput 26 (1977), 917–922. doi: 10.1109/tc.1977.1674939.

37.

Nemenyi

P.B.

, Distribution-free multiple comparisons, Princeton University, 1963.

38.

Niu

and Wang

, Bacterial Colony Optimization, Discret Dyn Nat Soc 2012 (2012), 1–28. doi: 10.1155/2012/698057.

39.

Pan

Wang

and Ku

, A new general nearest neighbor classification based on the mutual neighborhood information, Knowledge-Based Syst. 121 (2017), 142–152.

40.

Park

C.H.

and Kim

S.B.

, Sequential random k-nearest neighbor feature selection for high-dimensional data, Expert Syst. Appl 42 (2015), 2336–2342. doi: 10.1016/j.eswa.2014.10.044.

41.

Shreem

S.S.

Abdullah

and Nazri

M.Z.A.

, Hybrid feature selection algorithm using symmetrical uncertainty and a harmony search algorithm, Int J Syst Sci 47 (2016), 1312–1329.

42.

Song

Liang

and Zhao

, An efficient instance selection algorithm for k nearest neighbor regression, Neurocomputing (2017).

43.

Sun

Wong

A.K.C.

and Kamel

M.S.

, Classification of imbalanced data: A review, Int J Pattern Recognit Artif Intell 23 (2009), 687–719.

44.

Tang

Alelyani

and Liu

, Feature selection for classification: A review, Data Classif Algorithms Appl (2014), 37.

45.

Tibshirani

, Regression shrinkage and selection via the lasso, J R Stat Soc Ser B (1996), 267–288.

46.

Trstenjak

Mikac

and Donko

, KNN with TF-IDF based framework for text categorization, Procedia Eng 69 (2014), 1356–1364.

47.

Walters-Williams

and Li

, Comparative study of distance functions for nearest neighbors, Adv Tech Comput Sci Softw Eng (2010), 79–84.

48.

Wang

Chen

and Alterovitz

, Improving PLS-RFE based gene selection for microarray data classification, Comput Biol Med 62 (2015), 14–24. doi: 10.1016/j.compbiomed.2015.04.011.

49.

Wang

Chen

and Alterovitz

, Accelerating wrapper-based feature selection with K-nearest-neighbor, Knowledge-Based Syst 83 (2015), 81–91. doi: 10.1016/j.knosys.2015.03.009.

50.

Wang

Jing

and Niu

, A discrete bacterial algorithm for feature selection in classification of microarray gene expression cancer data, Knowledge-Based Syst 126 (2017), 8–19. doi: 10.1016/j.knosys.2017.04.004.

51.

Wei

Wan

Guo

and Wong

K.K.L.

, A novel hierarchical selective ensemble classifier with bioinformatics application, Artif Intell Med (2017).

52.

Weinberger

K.Q.

and Saul

L.K.

, Distance metric learning for large margin nearest neighbor classification, J Mach Learn Res 10 (2009), 207–244.

53.

Wettschereck

Aha

D.W.

and Mohri

, A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms, Artif Intell Rev 11 (1997), 273–314.

54.

Yang

C.-H.

Weng

Z.-J.

Chuang

L.-Y.

and Yang

C.-S.

, Identification of SNP-SNP interaction for chronic dialysis patients, Comput Biol Med 83 (2017), 94–101.

55.

Yang

C.-H.

Chuang

L.-Y.

and Yang

C.H.

, IG-GA: a hybrid filter/wrapper method for feature selection of microarray data, J Med Biol Eng 30 (2010), 23–28.

56.

Yang

Zhou

Zhu

and Ji

, Iterative ensemble feature selection for multiclass classification of imbalanced microarray data, J Biol Res 23 (2016), 13.

57.

Zhang

and Zhang

, Significance of gene ranking for classification of microarray samples, IEEE/ACM Trans Comput Biol Bioinforma 3 (2006), 312–320. doi: 10.1109/TCBB.2006.42.

58.

Zhang

Kotagiri

Tari

and Cheriet

, KRNN: k Rare-class Nearest Neighbour classification, Pattern Recognit 62 (2017), 33–44.

59.

Zou

and Hastie

, Regularization and variable selection via the elastic net, J R Stat Soc Ser B 67 (2005), 301–320. doi: 10.1111/j.1467-9868.2005.00503.x.

60.

Cancer microarray data sets. Plymouth University, (2005). http://www.tech.plym.ac.uk/spmc/links/bioinformatics/microarray/microarray_cancers.html (accessed February 14, 2017).

Gene selection for enhanced classification on microarray data using a weighted k-NN based algorithm

Abstract

Keywords

1. Introduction

2. Proposed feature selection algorithm

3. Experimental results and discussions

3.1 Datasets

Table 1 Microarray cancer datasets. First group description

Table 2 Classification performance ( F -measure and selected features) comparison of feature selection methods on microarray datasets

Footnotes

Acknowledgments

Conflict of interest

References

Table 1
Microarray cancer datasets. First group description

Table 2
Classification performance ( $F$ -measure and selected features) comparison of feature selection methods on microarray datasets