A New Filter Approach Based on Effective Ranges for Classification of Gene Expression Data

Abstract

Over the years, many studies have been carried out to reduce and eliminate the effects of diseases on human health. Gene expression data sets play a critical role in diagnosing and treating diseases. These data sets consist of thousands of genes and a small number of sample sizes. This situation creates the curse of dimensionality and it becomes problematic to analyze such data sets. One of the most effective strategies to solve this problem is feature selection methods. Feature selection is a preprocessing step to improve classification performance by selecting the most relevant and informative features while increasing the accuracy of classification. In this article, we propose a new statistically based filter method for the feature selection approach named Effective Range-based Feature Selection Algorithm (FSAER). As an extension of the previous Effective Range based Gene Selection (ERGS) and Improved Feature Selection based on Effective Range (IFSER) algorithms, our novel method includes the advantages of both methods while taking into account the disjoint area. To illustrate the efficacy of the proposed algorithm, the experiments have been conducted on six benchmark gene expression data sets. The results of the FSAER and the other filter methods have been compared in terms of classification accuracies to demonstrate the effectiveness of the proposed method. For classification methods, support vector machines, naive Bayes classifier, and k-nearest neighbor algorithms have been used.

Introduction

The gene expression data sets obtained by digitizing the microarray image data are in the form of a matrix containing the genes and expression levels. A typical setting is a small number of samples represented by thousands of genes. Dealing with this high-dimensional data set and classifying it directly is a difficult problem.^1–3 So, dimensionality reduction plays a very important role since it reduces the number of genes using techniques such as feature selection and feature extraction. These approaches aim to reduce the high-dimensional feature space to low-dimensional representation, which aims for higher classification accuracy. Contrary to feature extraction, feature selection selects the subset of the features from the original set with minimal redundancy and maximum relevance to gain classification accuracy.

The classification of this data set will be disadvantageous in terms of computational time because it is high dimensional. Classification of this data set will be disadvantageous in terms of computational time because it is high dimensional. In addition, noise in gene expression, which is likely to be present in the feature set, adversely affects the performance of the classification.⁴ Studies have shown that some genes carry no disease-related information and give misleading results for disease diagnosis.^2,5,6 Therefore, the researchers have leaned toward the studies that have been developed using the different feature (gene) selection approaches to determine important genes in DNA microarray data sets.⁷

Feature selection methods

Feature selection is defined in the literature in three main topics, namely filter, wrapper, and embedded methods.^6,8

Filter methods, which are the methods we focus on in this study, are the statistical methods that are independent of the classifier and evaluate the features individually. In the literature, there have been many different studies related to filter methods, also known as statistical-based methods.^7,9

Wrapper methods often use classification accuracy as an indicator when selecting the best subset of genes. These methods require a classification algorithm at each step of the feature selection process. Unlike filter methods, it is in interaction with classification algorithms. Because of working with thousands of gene data, wrapper methods are less preferred than filter methods since they have higher calculation costs.^7,10–15

Embedded methods simultaneously perform gene selection and classification while selecting the best subset of genes. There are many embedded method studies used for gene selection in the literature.^16–19 In recent years, hybrid methods that combine filter and wrapper methods have been advanced and frequently used as feature selection methods.^5,8

When the studies in the literature are reviewed, it is seen that while the high-dimensional data are analyzed, the methods related to the search strategy are difficult to compute. Wrapper and embedded methods can only select an optimal feature subset by requiring a particular classifier. Therefore, when other classifiers are used, the selected features may yield worse results. In addition, these two methods are disadvantageous in terms of computational time compared with the filter methods.^20–22 Therefore, in gene data sets with a large number of features, it may be more appropriate to use filter methods to select a feature subset.^20–23 For these reasons, our study focuses on a filter-based feature selection, which is a statistical-based method and also constitutes an important part of hybrid methods.

Some of the commonly used filter methods are Fisher-score,^7,24 chi-square (χ²),²⁵ information gain (IG),²⁶ mutual information,²⁷ correlation-based feature selection method,²⁸ and Relief-F.^9,29

Related works

A related study proposed a sequential forward search method based on mutual information approach measurement between selected genes and class.³⁰ Another article used support vectors based on the t statistic for feature selection.³¹ Unlike current statistical methods, a statistical method has been proposed in a study, an Effective Range based Gene Selection (ERGS), which defines an effective range of features for each class.³² The basic principle of this effective feature selection approach is that more weight is given to features that clearly distinguish classes. A study used the BW ratio (between-class to within-class sums of squares) method based on Fisher's linear discriminant analysis in different gene expression data sets to perform the feature selection.³³

In a different article, also, an approach has been suggested, namely³⁴ “Improved Feature Selection based on Effective Range (IFSER)” by developing a lacking part of the algorithm called “Effective Range based Gene Selection (ERGS).” Another work on the issue proposed a novel filter feature selection method based on the maximal information coefficient (MIC) and Gram–Schmidt orthogonalization, named Orthogonal MIC Feature Selection.³⁵ Lastly, a study proposed an effective feature selection method, combining double Radial Basis Function kernels with weighted analysis, to extract feature genes from gene expression data, by exploring its nonlinear mapping ability.³⁶

One of the aforementioned studies³² proposed a filter method by calculating effective ranges based on Chebyshev inequality for each feature, based on the overlap values of these effective ranges. While ERGS only considers overlapping area (OA), IFSER takes into account the case where a range of a class covers the range of another class (including area [IA]).

Our contributions

Our original contributions through this article are as follows:

The major deficiency of the ERGS and IFSER algorithms based on effective ranges is that they assign the same weight value to all of the disjoint ranges. We proposed a new feature selection approach named “Feature Selection Algorithm based on Effective Ranges” (FSAER), which distinguishes the disjoint effective ranges.

FSAER has the advantages of ERGS and IFSER and it further defines a new total area that is needed for handling the disjoint areas. For gene expression data sets, there are thousands of genes, so some of them will include the disjoint area. Therefore, FSAER outperforms the original ERGS and IFSER, and some of the other current filter methods.

The rest of this article is composed as follows. A brief overview of filter methods is summarized in the Filter Methods: A Brief Overview section. The proposed feature selection algorithm is introduced in detail in the Proposed Feature Selection Algorithm section. Classification methods used for analysis are described in the Classification Methods Used for Analysis section. The Experiment and Results section reports experimental results on six well-known gene expression data sets and provides a comparative evaluation of the proposed algorithm. Finally, conclusions are provided in the Conclusions section.

Filter Methods: A Brief Overview

In the scope of this study, to show the effectiveness of the proposed algorithm, five different filter methods have been used for comparison. These commonly used five filter methods in the literature are χ² statistic,²⁵ Relief-F,^9,29 IG,²⁶ ERGS,³² and IFSER.³⁴ The brief descriptions of these five filter algorithms are given as follows.

χ² statistic

The χ² filter method is based on χ² statistics. The value of the χ² statistic is computed for each feature individually. Each feature with numerical values must be discretized before computing χ² statistics. For each i-th feature, the χ² statistic is defined as follows.³² $χ^{2} = \sum_{x \in X_{i}} \sum_{c \in C} \frac{{(n_{(x \in X_{i} & c \in C)} - e_{(x \in X_{i} & c \in C)})}^{2}}{e_{(x \in X_{i} & c \in C)}}$ (1)

where $n_{(x \in X_{i} & c \in C)}$ represents the number of samples in $X_{i}$ for class c whose value is x. The expected frequency $e_{(x \in X_{i} & c \in C)}$ is computed by the following formula: $e_{(x \in X_{i} & c \in C)} = \frac{n_{x \in X_{i}} \times n_{c \in C}}{n}$ (2)

where $n_{x \in X_{i}}$ denotes the number of samples in $X_{i}$ with value x, $n_{c \in C}$ denotes the number of samples of class c, and n is the total number of samples.

In statistical analysis, the Cramer-Phi coefficient is used as a measure of association for nominal variables. It is possible to use this measure as a filter for feature selection. This coefficient is defined as follows: $ϕ = \sqrt{\frac{χ^{2}}{n (k - 1)}}$ (3)

As seen from Equation (3), this measure is obtained depending on the definition of $χ^{2}$ . In this study, k represents the row or column number whose number is less.

The features are selected based on the ranked values of the Cramer-Phi coefficient calculated with the χ² statistic for each feature. A higher weight is given to the features where this coefficient is bigger.

Relief-F

Relief algorithm is a feature selection algorithm that is proposed for binary class problems.³⁷ While this algorithm can deal with discrete and continuous features, it cannot deal with incomplete data. Relief-F has been proposed as an extension of the original Relief algorithm to deal with noisy, incomplete, and multiclass data sets.³² The basic logic of the Relief-F algorithm is to proportion features according to how well they can distinguish between the samples of different classes and how well they can cluster samples of the same class.³⁸

In this method, a relevance weight is assigned to each feature. Randomly, an “r” sample is selected from the “n” sample. The relevance values are updated depending on the difference between the nearest samples of the same (H) (nearest hit) and different classes M(C) (nearest miss of class C) with the selected samples (r). The features that discriminate the sample from the neighbors of different classes are given more weight. The weights are updated taking into account the average contribution of the nearest misses M(C). The average contribution also takes into account the prior probability of each class P(C). The update rule X_i, the weight of the i-th feature, is given in the following equation.³² $w_{i} = w_{i} - \frac{ϕ (X_{i}, r, H)}{n} + \sum_{C \neq C_{r}} \frac{P (C) \times ϕ (X_{i}, r, M (C))}{n}$ (4)

where $ϕ (X_{i}, r, H)$ function is the calculation of the distance between the selected samples (r) and the nearest hit (H) or the nearest misses $M (C)$ .

Information gain

IG is often used as a criterion for selecting features in decision trees.²⁶ A study used this method also as a criterion for gene selection.²⁵ IG is a measure based on entropy. Entropy is used to measure uncertainty in a variable. When all values of the variable are equal, there is no uncertainty and the entropy value is 0, but the entropy reaches its maximum value if the values of the variable are not equally distributed.³⁹ The IG value is often used as a filter for ordering genes when gene selection is made. The fact that a gene has a high IG value means that the gene provides more information.

For example, there is a data set with $N = \{1, 2, \dots, n\}$ sample and k class. $P (C_{i}, N)$ is the ratio of C_i to N. Where $C_{i}, i = 1, 2, \dots, k$ is the set of samples that belongs to i-th class. The entropy of the data set can be calculated as follows.²⁵ $E n t r o p y (N) = - \sum_{i = 1}^{k} P (C_{i}, N) \times l o g P (C_{i}, N)$ (5)

If a $γ$ gene has $V = \{v_{1}, v_{2}, \dots, v_{m}\}$ levels, $N_{j} \in N| γ = v_{j}$ , the entropy value for $γ$ gene is given by the following: $E n t r o p y_{γ} (N) = \sum_{j = 1}^{m} \frac{|N_{j}|}{N} \times E n t r o p y (N_{j})$ (6)

As a result, the IG value for a gene $γ$ can be calculated as follows: $I n f o r m a t i o n G a i n (γ) = E n t r o p y (N) - E n t r o p y_{γ} (N)$ (7)

Both IG and χ² statistic methods have high performance in distinguishing the features.³⁸

Effective Range based Gene Selection

One of the studies, which offers a starting point for this article, proposed a filter method by calculating effective ranges based on Chebyshev inequality for each feature, based on the overlap values of these effective ranges.³² ERGS works under the principle that a feature should be given more weight if decision boundaries among classes are very far away from each other, that is, classes can easily be distinguished. The decision boundaries of the classes are obtained by a statistically defined effective range.³²

Calculating effective ranges ( $R_{i j}$ )

Assume, for a data set that has d features, the set of features is shown as $X = {X_{1}, X_{2}, \dots, X_{d}}$ . l be the class number and n_j ( $j = 1, 2, \dots, l$ ) be the unit number of the j-th class. In this case, the equations below can be defined as follows: $n = \sum_{j = 1}^{l} n_{j} (s a m p l e s i z e o f t h e d a t a s e t)$ (8) $p_{j} = \frac{n_{j}}{n} (t h e p r o b a b i l i t y o f t h e j - t h c l a s s)$ (9)

$μ_{i j}$ and $σ_{i j}$ denote the mean and standard deviation of the j-th class for the i-th feature, respectively. Effective Range ( $R_{i j}$ ) of j-th class for the i-th feature is defined as follows $R_{i j} = [r_{i j}^{-}, r_{i j}^{+}] = [μ_{i j} - (1 - p_{j}) γ σ_{i j}, μ_{i j} + (1 - p_{j}) γ σ_{i j}]$ (10)

where p_j is the prior probability of the j-th class and $r_{i j}^{-}$ and $r_{i j}^{+}$ are the lower and upper bounds of the effective range, respectively. $γ$ is a constant that is determined statistically by Chebyshev inequality given in the following formula, which is true for all distributions whose mean and variance are finite. $P (|X - μ_{i j}| \geq γ σ_{i j}) \leq \frac{1}{γ^{2}}$ (11)

It is taken as $γ = 1.732$ for the effective range that contains at least two-thirds of the data. The $(1 - p_{j})$ value in Equation (10) is used to reduce the variance, to reduce the effect of high probability classes.

The ERGS algorithm calculates the overlap value of the ranges given in Equation (10) for each feature. There should be little overlap between classes in highly distinguished features.

ERGS algorithm

The steps of the ERGS algorithm, for the i-th feature X_i, can be given as follows:

Calculate effective ranges ( $R_{i j}$ ) for each class

Sort the effective ranges based on lower bounds ( $r_{i j}^{-}$ ) in ascending order

Compute $O A_{i}$

O A_{i} = \sum_{j = 1}^{l - 1} \sum_{k = j + 1}^{l} φ_{i} (j, k)

(12)

where $φ_{i} (j, k) = \{\begin{matrix} r_{i j}^{+} - r_{i k}^{-} & i f r_{i k}^{-} < r_{i j}^{+} \\ 0 & o t h e r w i s e \end{matrix}$

4. Compute area coefficient ( $A C_{i}$ ) $A C_{i} = \frac{O A_{i}}{m a x (r_{i 1}^{+}, r_{i 2}^{+}, \dots, r_{i l}^{+}) - r_{i 1}^{-}}$ (13)

5. Compute normalized area coefficient ( $N A C_{i}$ )

6. Compute weight (w_i) $w_{i} = 1 - N A C_{i}$ (15)

7. Select the feature that provides $w_{i} \geq θ$

Some features range from 0 to 10, while others can range from 0 to 10,000. This means that for features with a higher data range, the OA can be greater. To override this effect, the OA is divided by the data range of the feature.

An example for illustration of ERGS algorithm

Mixed-lineage leukemia (MLL), a benchmark gene data set, contains 72 samples of 12,562 genes.⁴⁰ These samples consist of three types: acute lymphoblastic leukemia (ALL), MLL, and acute myeloid leukemia (AML).³²

In the ERGS algorithm, more weight is assigned to a gene if it has less OA. Using the ERGS algorithm, the weight for the gene with Gene Accession No. “32864_at” is computed as 0.90, which shows high weightage for this gene. This justifies the fact that there is no OA between AML and ALL types of leukemia for gene “32864_at,” as shown in Figure 1. It means that gene “32864_at” cannot have any ambiguity to classify AML and ALL types of leukemia. Therefore, it is shown in the example that the ERGS algorithm gives more weight to those features that are helpful for classifying the data accurately, that is, the feature should not lead to any ambiguity in the classification process. The features with higher weights also clearly describe the decision boundary in the classification process.³²

FIG. 1.

Data plot of MLL data set: Gene Accession No. 32684_at.³² ALL, acute lymphoblastic leukemia; AML, acute myeloid leukemia; MLL, mixed-lineage leukemia.

Improved Feature Selection based on Effective Range

Wang et al.³⁴ have developed the ERGS algorithm to take into account the case where a range of a class covers the range of another class (IA). The steps of the IFSER algorithm, for the i-th feature X_i, can be given as follows.

Calculate effective ranges ( $R_{i j}$ ) for each class

Sort the effective ranges based on lower bounds ( $r_{i j}^{-}$ ) in ascending order

Compute $O A_{i}$

O A_{i} = \sum_{j = 1}^{l - 1} \sum_{k = j + 1}^{l} φ_{i} (j, k)

(16)

where $φ_{i} (j, k) = \{\begin{matrix} r_{i j}^{+} - r_{i k}^{-} & i f r_{i k}^{-} < r_{i j}^{+} \\ 0 & o t h e r w i s e \end{matrix}$

4. Compute $I A_{i}$ $I A_{i} = \sum_{j = 1}^{l - 1} \sum_{k = j + 1}^{l} ψ_{i} (j, k)$ (17)

where $ψ_{i} (j, k) = \{\begin{matrix} r_{i k}^{+} - r_{i k}^{-} & i f r_{i k}^{+} < r_{i j}^{+} \\ 0 & o t h e r w i s e \end{matrix}$

5. Compute $A C_{i}$ $A C_{i} = \frac{O A_{i} + I A_{i}}{m a x (r_{i 1}^{+}, r_{i 2}^{+}, \dots, r_{i l}^{+}) - r_{i 1}^{-}}$ (18)

6. Compute $N A C_{i}$

7. Compute the normalized N_i and H_i $\begin{matrix} N H_{i} = 1 - H_{i j} ∕ m a x (H_{s j}) \\ G H_{i} = 1 - G_{i j} ∕ m a x (G_{s j}) \end{matrix} f o r s = 1, 2, \dots, d$ (20)

where $H_{i j}$ and $G_{i j}$ denote the number of samples in $O A_{i}$ and $I A_{i}$ for the j-th class.

8. Compute weight (w_i) $w_{i} = N A C_{i} \times (N H_{i} + G H_{i})$ (21)

9. Select the feature that provides $w_{i} \geq θ$

Proposed Feature Selection Algorithm

Motivation

The major deficiency of the ERGS and IFSER algorithms based on effective ranges is that they assign the same weight value to all of the disjoint ranges (Fig. 2c). The ERGS algorithm selects the feature by considering only the OAs defined in Figure 2a. The algorithm uses the same weight value for all of the ranges that correspond to Figure 2c, while partly taking into account the condition given in Figure 2b.

FIG. 2.

Situations of effective ranges relative to each other (a) overlap, (b) including, (c) disjoint.

The IFSER algorithm has made improvements in the ERGS algorithm, taking into account the state of Figure 2b, but this algorithm does not distinguish between the corresponding ranges in the case given in Figure 2c, just similar to ERGS.

Consider the Golub data to explain this situation through an example. Golub data is a data set consisting of 2 classes and 3051 features with 38 samples.⁴¹ After calculating the effective ranges in this data set, the number of (a) overlap, (b) including, (c) disjoint situations given in Figure 2 is summarized in Table 1.

Table 1.

Number of (a) overlap, (b) including, (c) disjoint situations for Golub data set

No. of overlaps	No. of including	No. of disjoints
1151	1713	187

As seen in Table 1, $ϕ_{i} (j, k)$ value for 187 features is greater than zero. ERGS and IFSER algorithms give the same weight value to all of these features, regardless of the size of $ϕ_{i} (j, k)$ . For example, $m a x [ϕ_{829} (1, 2)]$ is equal to $1.461251$ for 187 features, while $m i n [ϕ_{1206} (1, 2)]$ is equal to $0.0001502558$ and the $O A_{i}$ value in the ERGS algorithm, and the $O A_{i}$ and $I A_{i}$ values in the IFSER algorithm are 0 for these two features. However, the larger value of $ϕ_{i} (j, k)$ can also increase the discrimination power. By taking this situation into account, the quality of classification will increase.

As the Golub example shows, a data set may have many disjoint areas. Considering only Figure 2a and b when calculating the OA will assign 0 weights to all features in the disjoint area, causing us to ignore the better separation case in Figure 2c.

It would be wrong to say that all 187 features given in Table 1 have the same power in separating classes. The larger the disjoint area, the greater the weight of that feature should be.

In this study, we propose a new feature selection algorithm that distinguishes the disjoint effective ranges indicated by $ϕ_{i} (j, k)$ . The steps of this algorithm are given below.

Algorithm

The steps of the algorithm, for the i-th feature X_i, can be given as follows and finally can be summarized as in the form of a flowchart in Figure 3 and Algorithm 1.

FIG. 3.

Flowchart of the FSAER. FSAER, Effective Range-based Feature Selection Algorithm.

Calculate effective ranges ( $R_{i j}$ ) for each class

Sort the effective ranges based on lower bounds ( $r_{i j}^{-}$ ) in an ascending order

Compute total area ( $T A_{i}$ )

T A_{i} = \sum_{j = 1}^{l - 1} \sum_{k = j + 1}^{l} φ_{i} (j, k) + ψ_{i} (j, k) - ϕ_{i} (j, k)

(22)

where $\begin{matrix} φ_{i} (j, k) = \{\begin{matrix} r_{i j}^{+} - r_{i k}^{-} & i f r_{i k}^{-} < r_{i j}^{+} \\ 0 & o t h e r w i s e \end{matrix} \\ ψ_{i} (j, k) = \{\begin{matrix} r_{i k}^{+} - r_{i k}^{-} & i f r_{i k}^{+} < r_{i j}^{+} \\ 0 & o t h e r w i s e \end{matrix} \\ ϕ_{i} (j, k) = \{\begin{matrix} r_{i k}^{-} - r_{i j}^{+} & i f r_{i j}^{+} < r_{i k}^{-} \\ 0 & o t h e r w i s e \end{matrix} \end{matrix}$

4. Compute $A C_{i}$ $A C_{i} = \frac{T A_{i}}{m a x (r_{i 1}^{+}, r_{i 2}^{+}, \dots, r_{i l}^{+}) - r_{i 1}^{-}}$ (23)

5. Compute $N A C_{i}$

6. Compute weight (w_i) $w_{i} = 1 - N A C_{i}$ (25)

7. Select the feature that provides $w_{i} \geq θ$

Algorithm 1

Input: Data matrix X ∈ Rⁿ^×d, has d features, the set of features X = {X_i} = {X₁,X₂,…,X_d},
i = 1, 2,…, d; l be the class number and n_j, (j = 1,2,…,l) be the unit number of the j-th class,
and the number of selected features k.
Output: Feature subset.
(1) Calculate effective ranges (R_ij) for each class by (10);
(2) Compute the TA_i by (22);
(3) Compute the AC_i by (23);
(4) Normalize the AC_i by (24);
(5) Compute the weight of each feature by (25);
(6) Sort the weight of all features in descending order;
(7) Select the best k features.

Classification Methods Used for Analysis

In this study, the results obtained for the proposed methods and other methods were compared in terms of classification accuracy using support vector machine (SVM), naive-Bayes classifier (NBC), and K-nearest neighbor (KNN) algorithm. A description of these methods is given below.

Support vector machine

SVM is a supervised learning algorithm, at basic, used to distinguish the data of different classes from each other using line, plane, or hyperplanes (i.e., decision bounds).^{42, 43} To create an optimal hyperplane, an iterative training algorithm is used to minimize the error function $Λ (w)$ . This error function $Λ (w)$ can be defined as follows³²: $Λ (w) = \frac{1}{2} w^{T} w + C \sum ξ_{i}$ (26)

Subject to the constraints: $y_{i} [w^{T} K (x_{i}) + b] \geq 1 - ξ_{i} a n d ξ \geq 0, i = 1, 2, \dots, n$ (27)

where b is a constant, w denotes the coefficients' vector, and $ξ_{i}, i = 1, 2, \dots, n$ denotes the parameters that may cause misclassification. x_i (for each training sample i) is an independent variable represented by class labels y_i. The kernel function (K) converts input data to higher dimensional feature space.

Naive-Bayes classifier

The NBC is a classifier based on Bayes theorem, working according to probability principles, assuming independence. A study has found that this assumption is less effective than thought.⁴⁴

The probability of Bayes is the generalized version of the conditional probability for k discrete events. This probability can be defined as follows.^45,46 $P (C_{j} ∕ X) = \frac{P (X ∕ C_{j}) P (C_{j})}{P (X)}$ (28)

where

$P (C_{j} ∕ X)$ : probability of occurrence of class C_j, while X event is known

$P (X ∕ C_{j})$ : probability of occurrence of X event in C_j class

$P (C_{j})$ : probability of occurrence of class C_j

$P (X)$ : probability of occurrence of X event

K-nearest neighbors

KNN algorithm, which is used in cases where the independent variables are numerical (as in the case of gene data), performs classification according to the distances between the observations. The steps of the method can be summarized as follows.^37,47

Step 1. Determine the k value (number of neighbors)

Step 2. Calculate distances between observations

Step 3. Sort observations in ascending order by distance

Step 4. Assign to class the most repeated in k-observation, which has the smallest distance

Experiment and Results

In this section, the performance of the proposed feature selection algorithm (FSAER) has been evaluated using six different well-known gene expression data sets, namely Leukemia_G,⁴¹ Leukemia_Z,⁴⁸ Colon,⁴⁹ Leukemia_C,⁵⁰ Prostate,⁵¹ and Small Round Blue Cell Tumors (SRBCT).⁵² To show the effectiveness of the proposed algorithm, five different filter methods also have been applied to these six data sets. These are χ² statistic, Relief-F, IG, ERGS, and IFSER.

SVM, NBC, and KNN algorithms have been used as the classification methods for the measurement performance of selected features (using filter methods).

The classification accuracies obtained by applying leave-one-out cross-validation (LOOCV) are presented in detail in Tables 3 –5.

Algorithms, ERGS, IFSER, and FSAER, have been implemented in R language. Biocomb package in R has been used for χ², Relief-F, and IG algorithms and Caret package in R has been used to obtain classification accuracies.

Data set description

Leukemia_G (ALL-AML)

Leukemia_G data set contains information of 72 patients, 47 of them are ALL and 25 of them are AML (Table 2). Bone marrow samples were obtained from each of the 72 patients at the time of diagnosis. This data set consists of 2 classes for 7129 gene expression profiles from 72 samples.

Table 2.

Characteristics and accessible sources of data sets

Data set	Sample size	No. of genes	No. of classes	Source
Leukemia_G	72	7129	2	Ref.⁵⁹
Leukemia_Z	72	7129	3	Ref.⁶⁰
Colon	62	2000	2	Ref.⁵⁹
Leukemia_C	111	12,625	2	Ref.⁵⁹
Prostate	102	12,600	2	Ref.⁵⁹
SRBCT	83	2308	4	Ref.⁶¹

SRBCT, Small Round Blue Cell Tumors.

Leukemia_Z (ALL-AML_3)

Leukemia_Z consists of a total of 72 patients, including gene expression data set, 25 AML, 38 B cell ALL, and 9 T cell ALL. This data set consists of 3 classes for 7129 gene expression profiles from 72 samples.

Colon

Colon gene expression data set contains colon epithelial cell samples from 62 different colon cancer patients. Samples include normal biopsies collected from healthy portions of the same patient's colon and tumor biopsies collected from tumors. In this data set, which has 2 classes as tumor and healthy, there is 2000 gene information belonging to 62 patients.

Leukemia_C (ALL)

Leukemia_C gene expression data set contains information of 111 patients with T cell and B cell ALL. This data set has 111 samples and 2 classes and contains 12,625 gene information.

Prostate

Prostate gene expression data set consisted of 102 radical prostatectomy types, which were reported to be high quality, including 52 tumor prostate samples and 50 normal prostate samples. This data set has 102 samples and 2 classes and has 12,600 gene information.

Small Round Blue Cell Tumors

SRBCT gene expression data set has been derived from childhood cancer studies. In this data set, there are 29 samples with Ewing sarcoma, 11 with Burkitt's lymphoma, 18 with neuroblastoma, and 25 with rhabdomyosarcoma. In total, this data set consists of 4 classes for 2308 gene information from 83 samples.

The obtained results

The results obtained using the R language from the aforementioned methods are presented in Tables 3 –5. In addition, Figures 4 –6 display the classification results of six different methods in heatmap form.

Table 3.

Leave-one-out cross-validation classification accuracies (%) with naive Bayes classifier of six different data sets for different feature selection methods using 10–100 selected features

Data set	Method	NBC
Data set	Method	10	20	40	60	80	100
Leukemia_G	χ²-Statistic	95.83	95.83	95.83	95.83	95.83	95.83
	Relief-F	90.28	93.05	94.44	95.83	95.83	95.83
	IG	95.83	95.83	95.83	95.83	97.22	97.22
	ERGS	100	98.61	98.61	98.61	98.61	98.61
	IFSER	100	98.61	98.61	98.61	98.61	98.61
	FSAER	95.83	98.61	100	98.61	98.61	98.61
Leukemia_Z	χ²-Statistic	81.94	91.67	94.44	97.22	97.22	98.61
	Relief-F	94.44	93.06	95.83	95.83	97.22	97.22
	IG	95.83	95.83	95.83	97.22	97.22	97.22
	ERGS	97.22	95.83	97.22	97.22	97.22	97.22
	IFSER	97.22	98.61	98.61	97.22	97.22	98.61
	FSAER	98.61	97.22	97.22	97.22	97.22	98.61
Colon	χ²-Statistic	85.48	85.48	85.48	87.10	85.48	87.10
	Relief-F	82.26	85.48	87.10	87.10	87.10	88.71
	IG	87.10	87.10	83.87	82.26	82.26	83.87
	ERGS	80.64	77.41	77.41	79.03	79.03	79.03
	IFSER	75.80	79.03	80.64	77.41	79.03	75.80
	FSAER	80.64	77.41	77.41	79.03	79.03	79.03
Leukemia_C	χ²-Statistic	82.88	79.28	81.08	79.28	76.58	77.48
	Relief-F	58.82	70.58	67.65	66.67	79.41	78.43
	IG	81.08	77.48	81.08	77.48	77.48	75.68
	ERGS	84.68	82.88	85.58	86.49	85.58	85.58
	IFSER	84.68	83.78	86.49	85.58	84.68	85.58
	FSAER	85.58	81.98	85.58	86.49	85.58	85.58
Prostate	χ²-Statistic	92.17	92.17	89.22	92.17	93.14	93.14
	Relief-F	56.86	60.78	72.55	72.55	70.59	70.59
	IG	92.16	91.18	91.18	93.14	93.14	93.14
	ERGS	94.12	94.12	93.14	92.17	92.17	93.14
	IFSER	95.10	94.12	93.14	92.17	92.17	91.18
	FSAER	95.10	95.10	93.14	92.17	92.17	93.14
SRBCT	χ²-Statistic	95.18	96.38	97.59	98.79	100	100
	Relief-F	60.24	68.67	90.36	92.77	90.36	92.77
	IG	97.59	97.59	100	100	100	100
	ERGS	98.79	98.79	100	100	100	100
	IFSER	85.54	96.38	98.79	100	100	100
	FSAER	98.79	100	100	100	100	100

ERGS, Effective Range based Gene Selection; FSAER, Effective Range-based Feature Selection Algorithm; IFSER, Improved Feature Selection based on Effective Range; IG, information gain; NBC, naive Bayes classifier.

Table 4.

Leave-one-out cross-validation classification accuracies (%) with support vector machines of six different data sets for different feature selection methods using 10–100 selected features

Data set	Method	SVM
Data set	Method	10	20	40	60	80	100
Leukemia_G	χ²-Statistic	88.88	93.05	94.44	95.83	98.61	98.61
	Relief-F	87.50	94.44	93.06	97.22	97.22	95.83
	IG	97.22	91.67	98.61	97.22	98.61	98.61
	ERGS	98.61	97.22	95.83	98.61	100	100
	IFSER	98.61	97.22	95.83	97.22	100	100
	FSAER	95.93	93.05	95.83	98.61	100	100
Leukemia_Z	χ²-Statistic	93.05	93.05	95.83	97.22	94.44	95.83
	Relief-F	91.67	94.44	98.61	98.61	98.61	98.61
	IG	94.44	94.44	95.83	95.83	95.83	95.83
	ERGS	97.22	97.22	98.61	98.61	95.83	97.22
	IFSER	93.05	97.22	95.83	95.83	95.83	97.22
	FSAER	97.22	95.83	98.61	97.22	97.22	97.22
Colon	χ²-Statistic	83.87	79.03	83.87	87.10	85.48	83.87
	Relief-F	79.03	87.09	77.42	77.42	82.26	85.48
	IG	83.87	82.25	72.58	80.64	82.26	87.10
	ERGS	85.48	82.25	85.48	80.64	80.64	79.03
	IFSER	75.80	85.48	85.48	82.25	77.41	75.80
	FSAER	85.48	82.25	85.48	80.64	80.64	79.03
Leukemia_C	χ²-Statistic	85.58	81.08	75.67	72.07	77.48	84.68
	Relief-F	65.69	70.59	84.31	80.39	81.37	81.37
	IG	82.88	77.48	79.28	81.98	78.38	82.88
	ERGS	90.09	93.69	90.10	89.19	92.79	93.69
	IFSER	89.19	93.69	91.89	87.39	90.09	89.19
	FSAER	88.29	92.79	92.79	93.69	91.89	92.79
Prostate	χ²-Statistic	97.06	93.13	94.11	93.13	95.10	94.11
	Relief-F	55.88	65.69	76.47	69.61	69.61	82.35
	IG	96.08	89.22	91.18	93.14	94.12	95.10
	ERGS	93.14	96.08	95.10	92.17	92.17	93.14
	IFSER	92.17	96.08	92.17	89.26	90.20	91.18
	FSAER	92.17	97.06	92.17	92.17	92.17	94.12
SRBCT	χ²-Statistic	96.38	96.38	96.38	98.79	98.79	100
	Relief-F	63.85	77.11	92.77	96.38	97.59	100
	IG	96.38	96.38	100	100	100	100
	ERGS	100	100	100	100	100	100
	IFSER	89.15	98.79	100	100	100	100
	FSAER	96.38	100	100	100	100	100

SVM, support vector machine.

Table 5.

Leave-one-out cross-validation classification accuracies (%) with k-nearest neighbor of six different data sets for different feature selection methods using 10–100 selected features

Data set	Method	KNN
Data set	Method	10	20	40	60	80	100
Leukemia_G	χ²-Statistic	86.11	93.05	94.44	91.67	93.06	93.06
	Relief-F	88.89	94.44	94.44	91.67	90.27	91.67
	IG	90.28	95.83	95.83	93.05	94.44	93.05
	ERGS	98.61	98.61	97.22	98.61	97.22	95.83
	IFSER	98.61	98.61	97.22	95.83	95.83	95.83
	FSAER	95.83	97.22	97.22	98.61	97.22	95.83
Leukemia_Z	χ²-Statistic	94.44	91.67	93.05	95.83	95.83	95.83
	Relief-F	90.28	90.28	91.67	94.44	94.44	95.83
	IG	90.28	93.05	95.83	95.83	97.22	95.83
	ERGS	97.22	97.22	95.83	95.83	95.83	97.22
	IFSER	97.22	93.05	95.83	93.05	97.22	95.83
	FSAER	94.44	95.83	97.22	94.44	95.83	95.83
Colon	χ²-Statistic	83.87	85.48	85.48	85.48	82.26	82.26
	Relief-F	83.87	87.10	87.10	87.10	85.48	85.48
	IG	85.48	85.48	83.87	83.87	87.10	85.48
	ERGS	85.48	85.48	88.70	85.48	87.10	85.48
	IFSER	77.41	85.48	87.10	87.10	87.10	85.48
	FSAER	85.48	85.48	88.70	85.48	87.10	85.48
Leukemia_C	χ²-Statistic	87.39	82.88	83.78	83.78	85.58	82.88
	Relief-F	58.82	62.74	76.47	82.35	84.31	82.35
	IG	86.49	81.98	86.48	83.78	81.08	79.28
	ERGS	86.49	87.39	87.39	87.39	85.58	86.49
	IFSER	87.39	89.19	89.19	88.29	89.19	88.29
	FSAER	86.49	87.39	87.39	86.48	84.68	88.29
Prostate	χ²-Statistic	90.20	89.22	88.24	89.22	92.17	92.17
	Relief-F	54.90	55.88	76.47	76.47	79.41	78.43
	IG	90.20	89.22	88.24	89.22	94.11	94.11
	ERGS	92.17	94.18	90.20	90.20	90.20	90.20
	IFSER	91.18	90.20	90.20	90.20	90.20	90.20
	FSAER	91.18	93.14	90.20	90.20	92.17	90.20
SRBCT	χ²-Statistic	96.38	97.59	98.79	98.79	98.79	100
	Relief-F	54.22	54.22	84.34	90.36	90.36	89.16
	IG	96.38	98.79	98.79	98.79	100	100
	ERGS	97.59	98.79	98.79	100	100	100
	IFSER	84.33	91.56	96.38	98.79	98.79	98.79
	FSAER	96.38	98.79	97.59	98.79	100	100

KNN, k-nearest neighbor.

FIG. 4.

Heatmap for LOOCV classification accuracies (%) with NBC of six different data sets for different feature selection methods using 10–100 selected features. LOOCV, leave-one-out cross-validation; NBC, naive Bayes classifier.

FIG. 5.

Heatmap for LOOCV classification accuracies (%) with SVM of six different data sets for different feature selection methods using 10–100 selected features. SVM, support vector machines.

FIG. 6.

Heatmap for LOOCV classification accuracies (%) with KNN of six different data sets for different feature selection methods using 10–100 selected features.

Comparative evaluation

To investigate the efficiency of the FSAER algorithm, the classification accuracy of the selected features using this algorithm has been compared with the accuracies obtained from other feature selection algorithms. For a better comparison, LOOCV has been used to calculate the classification accuracy.

On six different data sets, the classification accuracy was obtained from χ² Statistics, Relief-F, IG, ERGS, IFSER, and FSAER methods for different numbers of selected gene subsets (between 10 and 100) and has been presented in Tables 3 –5.

In addition, we could access gene accession numbers and gene descriptions of the selected genes. For Leukemia_G data, the top 10 selected genes by FSAER algorithm has are shown in Table 6. These results are commensurate with the clinically proven results.

Table 6.

The top 10 selected genes using Feature Selection Algorithm based on Effective Range for Leukemia_G data

Gene accession number	Gene description
D88270_at	GB DEF = (lambda) DNA for immunoglobin light chain
U05259_rna1_at	MB-1 gene
M31523_at	TCF3 Transcription factor 3 (E2A immunoglobulin enhancer-binding factors E12/E47)
M11722_at	Terminal transferase mRNA
M92287_at	CCND3 Cyclin D3
X59417_at	PROTEASOME IOTA CHAIN
X82240_rna1_at	TCL1 gene (T cell leukemia) extracted from H.sapiens mRNA for T cell leukemia/lymphoma 1
M84371_rna1_s_at	CD19 gene
M89957_at	IGB Immunoglobulin-associated beta (B29)
J05243_at	SPTAN1 Spectrin, alpha, nonerythrocytic 1 (alpha-fodrin)

Experimental results using NBC

Table 3 shows the classification accuracy of the different size feature subsets (10–100) obtained from the feature selection methods for six different gene expression data sets as a result of using the NBC.

When Table 3 and Figure 7 are taken into consideration, it is seen that the FSAER method has the highest accuracy rates for most of the data sets. To evaluate the results in detail, we examined the data sets in this table separately.

FIG. 7.

LOOCV classification accuracies (%) with NBC of (a) Leukemia_G, (b) Leukemia_C, (c) Prostate, (d) SRBCT for different feature selection methods using 10–100 selected features. ERGS, Effective Range based Gene Selection; FSAER, Effective Range-based Feature Selection Algorithm; IFSER, Improved Feature Selection based on Effective Range; IG, information gain; SRBCT, Small Round Blue Cell Tumors.

When the results obtained for the Leukemia_G data set are examined, it is seen that the FSAER method has the highest accuracy rate of 100% when 40 features are selected. In addition, the FSAER algorithm has yielded very good results compared with other methods when the selected feature sizes are 20, 40, 60, 80, and 100.

When the results obtained for the Leukemia_Z data set are examined, it is seen that the highest accuracy rate of 98.61% is reached when 10 and 100 features are selected with the FSAER method.

Leukemia_G and Leukemia_Z are two data sets of the same size, but their number of classes is different. When the results of two data sets are examined, it is seen that the FSAER method gives quite effective results for both two-class and three-class data sets. This shows that the method works effectively for both two-class and multiclass data sets.

The highest classification accuracy for the Colon data set has been provided by the Relief-F method, while the results of the FSAER and ERGS methods have been the same.

In the Leukemia_C data set, ERGS and FSAER have given the highest accuracy rate of 86.49% for 60 features. In the case of 10 features, FSAER has given better results than all other methods.

When the Prostate data set is examined, it is seen that the highest accuracy rate in the table is 95.10% when 10 and 20 genes are selected with the FSAER method. Although there is a decrease in accuracy rates as the size of the feature increases, the classification accuracy for FSAER in other cases is quite high.

When we look at the data sets, Leukemia_C and Prostate are more high-dimensional data sets compared with others. The results show that FSAER is quite successful. This suggests that the proposed FSAER method gives effective results when the data size increases.

In the SRBCT data set (4 classes), it is seen that the FSAER method reaches the highest accuracy rates regardless of how many genes are selected (10, 20, 40, 60, 80, and 100).

Experimental results using SVM classifier

Table 4 shows the classification accuracy as a result of using the SVM classifier with different size subsets (10–100) obtained from feature selection methods for six different gene expression data sets. Table 4 and Figure 8 show that the FSAER method often has for all data sets the highest accuracy rates.

FIG. 8.

LOOCV classification accuracies (%) with SVM of (a) Leukemia_G, (b) Leukemia_C, (c) Prostate, (d) SRBCT for different feature selection methods using 10–100 selected features.

When the results obtained from the Leukemia_G data set are examined, it is seen that the highest accuracy rate of 100% is reached when the 80 and 100 features are selected with FSAER, ERGS, and IFSER methods. In addition, FSAER and ERGS methods have given the highest accuracy rate for the 60 selected features.

When the results obtained from the Leukemia_Z data set are examined, it is seen that FSAER and ERGS methods have the highest accuracy rate of 98.61% when 40 features are selected.

For the Colon data set, the highest accuracy rate of 85.48% is seen to be achieved when 10 features are selected with FSAER and ERGS methods. In addition, FSAER, ERGS, and IFSER methods have given the highest accuracy rate for selected 40 features.

In the Leukemia_C data set, when 20 features are selected, ERGS and IFSER methods provide the highest accuracy rate of 93.69%. When 60 features are selected, the FSAER method has the highest accuracy rate.

When the Prostate data set is examined, the χ² statistics method reaches the highest accuracy rate of 97.06% when 10 features are selected. When 20 features are selected, the FSAER method reaches the highest accuracy rate.

When the results obtained from the SRBCT data set, which has four classes, are examined, ERGS and FSAER methods perform remarkably well.

Experimental results using KNN classifier

Table 5 shows the classification accuracies as a result of using the KNN classifier for different size subsets (10–100) obtained from the feature selection methods on six different gene expression data sets. Table 5 and Figure 9 show that the FSAER method often has for all data sets the highest accuracy rates.

FIG. 9.

LOOCV classification accuracies (%) with KNN of (a) Leukemia_G, (b) Leukemia_C, (c) Prostate, (d) SRBCT for different feature selection methods using 10–100 selected features.

When the results obtained from the Leukemia_G data set are examined, it is seen that FSAER and ERGS methods reach the highest accuracy rate of 98.61% when 60 features are selected.

When the results obtained from the Leukemia_Z data set are examined, it is seen that the FSAER method reaches the highest accuracy rate of 97.22% when 40 features are selected.

The highest accuracy rate for the Colon data set is 88.70%. FSAER and ERGS methods have reached this classification accuracy with 40 features. These two methods have the best accuracy together.

Conclusions

This study proposes a new statistical feature selection approach named FSAER. As an extension of the previous ERGS and IFSER algorithms, our novel method includes the advantages of both methods while taking into account the disjoint area.

By applying three classification methods (NBC, SVM, and KNN) using the features selected by FSAER to six different publicly available gene expression data sets, classification accuracies have been obtained and compared with previously known filter methods. As a result, the knowledge of “the selected feature subsets,” denoted on the charts, can be obtained. In this way, the genes needed to diagnose and treat the related diseases can be achieved.

The purpose of feature selection methods is to reduce the size of the features before classification in multidimensional data sets. The proposed method does this by assigning weight to the features. Therefore, this method is not only applicable to gene expression data sets but also to other multidimensional data sets.

In recent years, hybrid methods have been used for feature selection in the literature.^53–58 We studied statistically based filter methods, which is a very important step in hybrid methods, to make a more effective contribution to this field. Our forthcoming studies will be on proposing hybrid methods, including the filter method we have improved.

Footnotes

Authors' Contributions

D.T.: Methodology (lead); writing—original draft (lead); software (lead); and writing—review and editing (equal). B.A.: Methodology (supporting); writing—original draft (supporting); software (supporting); and writing—review and editing (equal). Ö.Y.: Methodology (supporting); writing—original draft (supporting); and writing—review and editing (equal).

Author Disclosure Statement

No competing financial interests exist.

Funding Information

No funding was received for this article.

Abbreviations Used

References

McPherson

, Steel

, Dixon

. Breast cancer—Epidemiology, risk factors, and genetics. BMJ, 2000; 321(7261):624–628.

Pashaei

, Pashaei

. Gene selection using hybrid dragonfly black hole algorithm: A case study on RNA-seq COVID-19 data. Anal Biochem, 2021; 627:114242.

Baldi

, Long

. A Bayesian framework for the analysis of microarray expression data: Regularized t-test and statistical inferences of gene changes. Bioinformatics, 2001; 17(6):509–519.

Tang

, Alelyani

, Liu

Feature selection for classification: A review. In: Data Classification: Algorithms and Applications. ( Aggarwal

. ed.) Chapman & Hall/CRC: Boca Raton; 2014; pp. 37.

Bolón-Canedo

, Sánchez-Maroño

, Alonso-Betanzos

. Recent advances and emerging challenges of feature selection in the context of big data. Knowl Based Syst, 2015; 86:33–45.

Zheng

, Zhu

, Tang

, et al. Gene selection for microarray data classification via adaptive hypergraph embedded dictionary learning. Gene, 2019; 706:188–200.

Bolón-Canedo

, Sánchez-Maroño

, Alonso-Betanzos

, et al. A review of microarray datasets and applied feature selection methods. Inf Sci, 2014; 282:111–135.

Saeys

, Inza

, Larranaga

. A review of feature selection techniques in bioinformatics. Bioinformatics, 2007; 23(19):2507–2517.

Kumar

, Vanaja

Analysis of feature selection algorithms on classification: A survey. Int J Comput Appl, 2014:96(17):28–35.

10.

, Liu

Feature selection for high-dimensional data: A fast correlation-based filter solution. In: Proceedings of the 20th International Conference on Machine Learning (ICML-03). ( Fawcett

, Mishra

. eds.) AAAI Press: Washington, DC; 2003; pp. 856–863.

11.

Duan

, Rajapakse

, Wang

, et al. Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans Nanobiosci, 2005; 4(3):228–234.

12.

Zhou

, Tuck

. MSVM-RFE: Extensions of SVM-RFE for multiclass gene selection on DNA microarray data. Bioinformatics, 2007; 23(9):1106–1114.

13.

Lee

, Lin

, Chen

, et al. Gene selection and sample classification on microarray data based on adaptive genetic algorithm/k-nearest neighbor method. Expert Syst Appl, 2011; 38(5):4661–4667.

14.

Gheyas

, Smith

. Feature subset selection in large dimensionality domains. Pattern Recognit, 2010; 43(1):5–13.

15.

Inza

, Sierra

, Blanco

, et al. Gene selection by sequential search wrapper approaches in microarray cancer class prediction. J. Intell Fuzzy Syst, 2002; 12(1):25–33.

16.

Guyon

, Weston

, Barnhill

, et al. Gene selection for cancer classification using support vector machines. Mach Learn, 2002; 46(1):389–422.

17.

Maldonado

, Weber

, Basak

. Simultaneous feature selection and classification using kernel-penalized support vector machines. Inf Sci, 2011; 181(1):115–128.

18.

Canul-Reich

, Hall

, Goldgof

, et al. Iterative feature perturbation as a gene selector for microarray data. Int J Pattern Recognit Artif Intell, 2012; 26(05):1260003.

19.

Kang

, Huo

, Xin

, et al. Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine. J Theor Biol, 2019; 463:77–91.

20.

Almuallim

, Dietterich

. Learning with many irrelevant features. In: AAAI-91. AAAI Press: Washington, DC; 1991; pp. 547–552.

21.

Kohavi

, John

. Wrappers for feature subset selection. Artif Intell, 1997; 97(1–2):273–324.

22.

Radovic

, Ghalwash

, Filipovic

, et al. Minimum redundancy maximum relevance feature selection approach for temporal gene expression data. BMC Bioinformatics, 2017; 18(1):1–14.

23.

Remeseiro

, Bolon-Canedo

. A review of feature selection methods in medical applications. Comput Biol Chem, 2019; 112:103375.

24.

Duda

, Hart

, Stork

. (2000). Pattern Classification. John Wiley & Sons: New Jersey, USA; 2006.

25.

Liu

, Setiono

Chi2: Feature selection and discretization of numeric attributes. In: Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence. IEEE Computer Society Press: Washington, DC; 1995; pp. 388–391.

26.

Quinlan

JR.

Induction of decision trees. Mach Learn, 1986; 1(1):81–106.

27.

Battiti

Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw, 1994; 5(4):537–550.

28.

Hall

MA.

Correlation-Based Feature Selection for Machine Learning. PhD Thesis, The University of Waikato, Hamilton, New Zealand; 1999.

29.

Robnik-Šikonja

, Kononenko

. Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn, 2003; 53(1):23–69.

30.

Cai

, Hao

, Yang

, et al. An efficient gene selection algorithm based on mutual information. Neurocomputing, 2009; 72(4–6):991–999.

31.

Mundra

, Rajapakse

. Gene and sample selection for cancer classification with support vectors based t-statistic. Neurocomputing, 2010; 73(13–15):2353–2362.

32.

Chandra

, Gupta

. An efficient statistical feature selection approach for classification of gene expression data. JBI, 2011; 44(4):529–535.

33.

Lorena

, Costa

, Spolaôr

, et al. Analysis of complexity indices for classification problems: Cancer gene expression data. Neurocomputing, 2012; 75(1):33–42.

34.

Wang

, Zhou

, Yi

, et al. An improved feature selection based on effective range for classification. ScientificWorldJournal, 2014; 2014:972125.

35.

Lyu

, Wan

, Han

, et al. A filter feature selection method based on the Maximal Information Coefficient and Gram-Schmidt Orthogonalization for biomedical data mining. Comput Biol Med, 2017; 89:264–274.

36.

Liu

, Xu

, Zhang

, et al. Feature selection of gene expression data for cancer classification using double RBF-kernels. BMC Bioinformatics, 2018; 19(1):1–14.

37.

Lai

, Yeh

, Chang

. Gene selection using information gain and improved simplified swarm optimization. Neurocomputing, 2016; 218:331–338.

38.

Wang

, Tetko

, Hall

, et al. Gene selection from microarray data for cancer classification—A machine learning approach. Comput Biol Chem, 2005; 29(1):37–46.

39.

Han

, Pei

, Kamber

Data Mining: Concepts and Techniques. Elsevier: Amsterdam, The Netherlands; 2011.

40.

Armstrong

, Staunton

, Silverman

, et al. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet, 2002; 30(1):41–47.

41.

Golub

, Slonim

, Tamayo

, et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 1999; 286(5439):531–537.

42.

Boser

, Guyon

, Vapnik

. A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. (Haussler D. ed.) Association for Computing Machinery: New York, NY; 1992; pp. 144–152.

43.

Vapnik

The Nature of Statistical Learning Theory. Springer Science & Business Media: New York, USA; 1999.

44.

Domingos

, Pazzani

. On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn, 1997; 29(2):103–130.

45.

Zhang

The optimality of naive Bayes. AAAI, 2004; 1(2):3.

46.

Deng

, Sun

, Chang

, et al. Probabilistic models for classification. In: Data Classification. Chapman and Hall/CRC; 2014; pp. 93–114.

47.

Zaki

, Meira

JrW

. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press: Cambridge, UK; 2014.

48.

Zhu

, Ong

, Dash

. Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognit, 2007; 40(11):3236–3248.

49.

Alon

, Barkai

, Notterman

, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A, 1999; 96(12):6745–6750.

50.

Chiaretti

, Li

, Gentleman

, et al. Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood, 2004; 103(7):2771–2778.

51.

Singh

, Febbo

, Ross

, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 2002; 1(2):203–209.

52.

Khan

, Wei

, Ringner

, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med, 2001; 7(6):673–679.

53.

Deng

, Li

, Deng

, et al. Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification. Med Biol Eng Comput, 2022; 60:663–681.

54.

Hammadi

, Qasim

. Hybrid binary atom search optimization approaches with statistical dependence for feature selection. In: 2022 International Conference on Computer Science and Software Engineering (CSASE). IEEE: New Jersey, USA; 2022; pp. 218–223.

55.

Tahmouresi

, Rashedi

, Yaghoobi

, et al. Gene selection using pyramid gravitational search algorithm. PLoS One, 2022; 17(3): e0265351.

56.

Pashaei

, Ozen

, Aydin

Biomarker discovery based on BBHA and AdaboostM1 on microarray data for cancer classification. In: 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) IEEE; 2016; pp. 3080–3083.

57.

Pashaei

, Pashaei

. Hybrid binary COOT algorithm with simulated annealing for feature selection in high-dimensional microarray data. Neural Comput Appl, 2023; 35(1):353–374.

58.

Pashaei

, Pashaei

Gene selection for cancer classification using a new hybrid of binary black hole algorithm. In: 28th Signal Processing and Communications Applications Conference (SIU). IEEE: New Jersey, USA; 2020; pp. 1–4.

59.

Available from: https://www.rdocumentation.org/packages/datamicroarray/versions/0.2.3 (Last accessed: June 24, 2023).

60.

Zhu

, Ong

, Dash

. Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognit, 2007; 49(11):3236–3248.

61.

Khan

, Wei

, Ringner

, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med, 2001; 7:673–679.

A New Filter Approach Based on Effective Ranges for Classification of Gene Expression Data

Abstract

Introduction

Feature selection methods

Related works

Our contributions

Filter Methods: A Brief Overview

χ2 statistic

Relief-F

Information gain

Effective Range based Gene Selection

Calculating effective ranges ( R i j )

ERGS algorithm

An example for illustration of ERGS algorithm

Improved Feature Selection based on Effective Range

Proposed Feature Selection Algorithm

Motivation

Algorithm

Classification Methods Used for Analysis

Support vector machine

Naive-Bayes classifier

K-nearest neighbors

Experiment and Results

Data set description

Leukemia_G (ALL-AML)

Leukemia_Z (ALL-AML_3)

Colon

Leukemia_C (ALL)

Prostate

Small Round Blue Cell Tumors

The obtained results

Comparative evaluation

Experimental results using NBC

Experimental results using SVM classifier

Experimental results using KNN classifier

Conclusions

Footnotes

Authors' Contributions

Author Disclosure Statement

Funding Information

Abbreviations Used

References

χ² statistic

Calculating effective ranges ( $R_{i j}$ )