An efficient search algorithm for biomarker selection from RNA-seq prostate cancer data

Abstract

RNA-sequencing technology helps to consider the expression of thousands of genes, simultaneously. The large-scale gene expression data include a huge number of genes versus a few samples. Therefore, the algorithms that among huge number of unrelated genes can accurately detect genes associated with specific disease can be useful for experts in early detect and treat the disease.

A two-phase search algorithm is proposed in this paper to discover the biomarkers in the RNA-seq gene expression dataset for the prostate cancer diagnosis. After statistical noise removing from the original large-scale dataset, a multi-objective optimization process is proposed to select the best non-dominated subset of genes with the maximum classification accuracy and the minimum number of genes, simultaneously. Finally, the proposed cache-based modification of the sequential forward floating selection (CMSFFS) algorithm is applied to the selected subset of genes to discover the most discriminant genes.

The obtained results show that the proposed algorithm is able to achieve the classification accuracy, sensitivity and specificity of 100% in the large scale RNA-seq prostate cancer dataset by selecting only three biomarkers.

Keywords

RNA-seq large-scale prostate cancer data two-phase search algorithm multi-objective-based optimization CMSFFS

1 Introduction

The prostate is a small gland in the male reproductive system. Prostate cancer, which is also known as carcinoma of the prostate, occurs when some of the cells of the prostate, are reproduced much faster than normal case. RNA-sequencing data are known as large-scale data. Scientists use the large-scale data to check the differences between the normal and abnormal cells in the human body. Because of the high cost of genetic tests, the number of samples compared to the number of extracted genes is very low in the large-scale data. This is the main reason for the poor performance of many reported gene selection techniques. It is of great importance to develop an algorithm for the accurate detection of the genes associated with prostate cancer among huge number of unrelated genes. A number of researches dealing with the gene selection process from the large-scale prostate cancer data have been reported as follows.

Chiang and Ho [1] have combined the rough-based feature selection method with the radial basis function (RBF) neural network for the classification of gene expression data. This method can find the relevant features without requiring the number of clusters to be known a priori and identify the centers that approximate to the correct ones. In this paper, the authors have been attempted to introduce a prediction scheme that combines the rough-based feature selection method with radial basis function neural network. A hybrid gene selection method (IG-SVM) has been proposed in [2] to select informative genes for cancer classification. IG is a filter method that can eliminate irrelevant features in high-dimensional gene expression data and the wrapper SVM method has been used to further eliminate redundant genes based on the genes selected by filters. Given the small size of the datasets tested, the method proposed in this study should be further validated in larger datasets. Chen et al. [3] have proposed a gene selection method based on clustering, in which dissimilarity measures have been obtained through kernel functions. This method searches for best weights of genes iteratively at the same time to optimize the clustering objective function. Adaptive distance has been used in the proposed process in order to learn the weights of genes during the clustering process. Sharma et al. have proposed a gene selection method to classify the microarray gene expression data [4]. At first, the genes have been initially divided into a number of subsets. Then, the informative smaller subset of genes is merged with another gene subset in order to update the gene subsets. This method is repeated until all subsets are merged into one informative subset. An evolutionary algorithm based on genetic algorithms and artificial intelligence (IDGA) has been introduced in [5] to identify predictive genes for cancer classification. At first, a filter method has been applied to reduce the dimensionality of feature space. Next, an integer-coded genetic algorithm with variable-length genotype, adaptive parameters and modified genetic operators has been proposed.

An efficacious hybrid feature (gene) selection algorithm has been proposed in this paper in order to select the sub-optimal subset of genes for the classification of the large-scale RNA-seq prostate cancer data. Heuristic optimization techniques cannot guarantee the global optimum; therefore usually sub-optimal results are obtained. The sub-optimal results obtained in this investigation are acceptable.

At first, the noisy data are omitted by applying a statistical method. The sub-optimal subset of genes is selected from the RNA-seq prostate cancer dataset using the introduced Shuffled Frog Leaping-based Multi-objective Optimization Algorithm (SFLMOA) by considering the two objectives in this paper, i.e. “minimum number of selected genes” and “maximum amount of classification accuracy”. Then, the final subset of genes is selected using a Cache-based Modification of the Sequential Floating Feature Selection (CMSFFS) algorithm.

A combination of the teaching learning-based optimization (TLBO) algorithm and the PSO algorithm has been proposed in our previous work [6] to detect the smallest subset of genes involved in breast cancer with the highest performance. It is noted that in our previous work, a one-phase gene selection method based on a combination of two metaheuristic optimization techniques, i.e. the fuzzy adaptive PSO and TLBO, has been proposed for the feature selection of the breast cancer dataset. In contrast, in this study, a hybrid search algorithm based on the SFLMOA and the CMSFFS algorithm has been proposed to identify the most relevant genes. In other words, the combination of the introduced SFLMOA and CMSFFS algorithm in this study can find the sub-optimal subset of genes in the RNA-seq prostate cancer dataset.

The paper is organized as follows. The proposed hybrid gene selection algorithm is described in Section 2. In Section 3, the results are presented. Finally, the discussion and conclusion have been provided in Sections 4 and 5, respectively.

2 Proposed hybrid gene selection algorithm

In order to effectively represent the processes of the proposed hybrid gene selection algorithm, each of the components has been separately described.

2.1 The support vector machine (SVM) classifier

Vapnik et al. [7] have proposed the support vector machine which is a very effective and important machine learning approach. A critical issue in the large-scale data analysis, such as RNA-seq gene expression data, is the low number of samples versus a huge number of features which is known as “curse of dimensionality”. Support vector machines (SVMs) can easily overcome the challenge of the curse of dimensionality [8]. SVM-based methods maximize the margin between the original training data and the separating hyper-plane. If possible, a linear discriminant function is created based on a small number of critical border samples from each class otherwise, in order to automatically map the training samples into a space with higher dimension, the “kernel” technique is applied [9]. The linear, polynomial, radial basis function (RBF) and sigmoid functions as four basic kernels in SVM are respectively expressed as follows. $K (X_{i}, X_{j}) = {X_{i}}^{T} X_{j}$ (1) $K (X_{i}, X_{j}) = {(γ {X_{i}}^{T} X_{j} + r)}^{g}, γ > 0$ (2) $K (X_{i}, X_{j}) = exp (- γ | | X_{i} - X_{j} | |^{2}), γ > 0$ (3) $K (X_{i}, X_{j}) = tan (γ {X_{i}}^{T} X_{j} + r), γ > 0$ (4) where r, γ and g are the parameters of kernels, T indicates the transpose operator and X_i, X_j are the i-th and j-th components.

2.2 Statistical pre-processing

A statistical pre-processing with the following equation is used to reduce the primary dimension of gene expression data by removing the noisy genes. $SNR = \frac{{(mea n_{D} - mea n_{N})}^{2}}{St {d_{D}}^{2} + St {d_{N}}^{2}}$ (5) where mean and Std are the mean and standard deviation of the genes, respectively and subscripts D and N denote the diseased and normal samples, respectively.

2.3 Shuffled frog-leaping algorithm

Eusuff et al. [10] have originally developed the Shuffled frog-leaping algorithm (SFLA) which is a memetic meta-heuristic optimization to find the global optimal solution. Participating of each member in previous experience of all other members is the main idea of the SFLA. In the SFLA, frogs are described as a memetic vector and are hosts for memes. Each meme consists of a number of memotypes which the memotypes are similar to a gene in a chromosome in a Genetic Algorithm (GA). The frogs can communicate with each other in order to improve their memes by infecting each other. Position of an individual frog is changed by improving memes and is moved toward a global solution [11].

At first, initial population containing P frogs is randomly generated in the SFLA and then the frogs are divided into m memeplexes according to their fitness. The division is done so that the first high ranked frog going to the first memeplex, second one going to the second memeplex, the k-th frog go to the k-th memeplex and the k + 1-th frog back to the first memeplex.

The position of the worst frog in each memeplex at the iteration t is changed as follows. $x_{worst, k}^{t} = x_{worst, k}^{t - 1} + Δ x_{k}^{t}$ (6) $Δ x_{k}^{t} = rand * (x_{best, k}^{t - 1} - x_{worst, k}^{t - 1})$ (7) where x_worst and x_best are respectively the worst and the best solution (frog) in each memeplex and rand is a random number in the range of [0-1].

If this process produces a frog with better fitness function value, it replaces the worst frog. Otherwise, x_best is replaced by x_gbest in Equation (7) and the above process is repeated in which x_gbest is the global best solution. If no improvement becomes possible in this case, the worst frog in each memeplex is updated with a new random one.

After finishing the local search processes in the m memeplexes, all of the solutions are reshuffled and then the SFL algorithm is returned to the stage of sorting and division of the frogs into m memeplexes. The local search and the shuffling processes continue until the termination criterion is met.

2.4 Mutated SFLA

In order to overcome the problem of trapping in the local optimum solutions, a new mutated version of the shuffled frog-leaping algorithm has been proposed in this paper. The mutation method is a strong technique to reclaim the efficiency of the SFL algorithm. In the mutated SFL algorithm and in each iteration, four frogs are selected randomly from the current memeplex. The mutated frog ( $x_{mutation, k}^{t}$ ) is proposed as follows.

$\begin{matrix} x_{mutation, k}^{t} & = & x_{gbest} + F_{1} (x_{1, k}^{t} - x_{2, k}^{t}) \\ + F_{2} (x_{3, k}^{t} - x_{4, k}^{t}) \end{matrix}$ (8) where x_gbest is the global best solution and F₁ and F₂ are the mutation parameters which are selected in the range of 0.1 ≤ F₁, F₂ ≤ 0.9.

The mutation parameters are selected randomly in each iteration. For greater effectiveness, four selected frogs should be as: $x_{1, k}^{t} \neq x_{2, k}^{t} \neq x_{3, k}^{t} \neq x_{4, k}^{t} \neq x_{k}^{t}$ .

Finally, in order to produce a new improved solution (frog), the generated mutated frog ( $x_{mutation, k}^{t}$ ) is blended with the target frog according to the following equation.

$\begin{matrix} x_{improved, k, d}^{t} & = & {\begin{matrix} x_{mutation, k, d}^{t}, \\ x_{k, d}^{t} \end{matrix} \begin{matrix} if Cr > rand \\ otherwise \end{matrix}, \\ d = 1, 2, \dots, D \end{matrix}$ (9) where D is the dimension of each solution (in other words, p is the index of the elements of each solution) and Cr is the crossover constant in the range of 0 ≤ Cr ≤ 1.

It should be noted that the bounds for all elements in each new generated solution should be checked. If any element in each new generated solution exceeds its limitations, it should be replaced by its own upper or lower bound.

2.5 Binary SFLA

In order to use the SFL algorithm in binary mode, the following equation have been introduced in [11]. By using this equations, the variables of each frog has been considered to be either “0” or “1” in the binary SFL (BSFL) algorithm which “0” or “1” in the position of each variable represents the absence or the presence of the specified gene, respectively. The position vector (x) in the BSFL algorithm is updated by using the Δx_k in each iteration. $x_{worst, k}^{t} = x_{worst, k}^{t - 1} \oplus Δ x_{k}^{t}$ (10) $\begin{matrix} Δ x_{k}^{t} & = & (c_{1, k}^{t} . (x_{best, k}^{t - 1} \oplus x_{worst, k}^{t - 1})) \\ + (c_{2, k}^{t} . (x_{gbest, k}^{t - 1} \oplus x_{worst, k}^{t - 1})) \end{matrix}$ (11) where c_1,k and c_2,k are random binary vectors and symbols “.”, “+” and “⊕” represent the logical AND, OR and XOR operators, respectively. As mentioned, if this process generates a better solution, it replaces the worst frog. Otherwise, a new solution is randomly generated in order to replace the considered frog. In this study, in order to minimize the number of selected genes and simultaneously maximize the accuracy, sensitivity and specificity of the classification, the BSFL algorithm has been applied to optimally find the presence or the absence of the genes in the selected subset of genes.

2.6 The cache-based modification of the SFFS algorithm

Let [V_i] be the set of i features from the original set [P]. The importance of the feature v in the set [V_i] and the feature p in the set [P - V_i] are respectively calculated using Equations (12 and 13). $Imp (v) = CF ([V_{i}]) - CF ([V_{i}] - v)$ (12) $Imp (p) = CF ([V_{i}] + p) - CF ([V_{i}])$ (13) where CF is the Cost Function value for each subset of features and [.] is a vector containing the index of each feature. The classification accuracy obtained using the SVM classifier has been considered the Cost Function value in this paper. It is noteworthy that the highest value of Imp belongs to the most important feature.

In the first stage of the Cache-Based Modification of the Sequential Forward Floating Selection (CMSFFS) algorithm, the Sequential Forward Selection (SFS) method [12] has been used to select the initial subset of features. In the CMSFFS algorithm, the search process has been started from the initial subset of features ([V₂]). In the first stage, the most important member (p) from the [P - V_i] set is added to the subset of [V_i] and generate [V_i+1] if the candidate feature is not listed in the Black List and the subset of [V_i] + p is not listed in the Cache cell.

If [V_i+1] is in the Cache memory, the most important member is added to the Black List and the process to select the most important member for aggregating with [V_i] is repeated.

Let v_l be the least important feature in [V_i+1]. In this case, if v_l = p then the new generated [V_i+1] is stored in the Cache memory, i + 1 is considered as i and the process of the most important member selection to be aggregated with [V_i] is repeated after clearing the Black List. If v_l ≠ p then v_l is deleted from [V_i+1], and [V′_i] is generated and the second stage of the proposed CMSFFS algorithm is started.

In the second stage, if the number of members of [V′_i] is equal to the number of members of [V_initial], [V_i] is replaced by [V′_i], CF ([V_i]) is replaced by CF ([V′_i]), the new generated [V′_i] is stored in the Cache memory and the process of the most important member selection to be added to [V_i] is repeated after clearing the Black List. Otherwise the least important feature in [V′_i] is detected as v_ll. If CF ([V′_i] - v_ll) ≤ CF ([V_i-1]) then [V_i] is replaced by [V′_i], CF ([V_i]) is replaced by CF ([V′_i]), the new generated [V′_i] is stored in the Cache memory and the process of the most important member selection to be combined with [V_i] is repeated after clearing the Black List. Otherwise v_ll is deleted from [V′_i] and [V′_i-1] is generated then i - 1 is considered as i. Then the condition of equality of the number of members of [V′_i] and the number of members of [V_initial] is checked again. This process continues until the desired number of features achieves. More details of the CMSFFS algorithm are available in [13].

2.7 Multi-objective approach

In order to resolve the weakness of some previous works [4 , 14– 20] (considering the feature selection problem to be single objective), the optimization problem has been considered bi-objective in this paper. In this case, the highest classification accuracy and the smallest subset of genes have been simultaneously considered by employing a multi-objective approach and storing the non-dominated solutions in the matrix repository (Rep). Details of the multi-objective technique used in this paper are available in [21].

Assuming the minimization problem, the concept of the non-dominated solution is described as follows:

If x₁ and x₂ are two candidate solutions in the multi-objective problem and x₁ and x₂, then the following relation must be satisfied. ${\begin{matrix} f_{i} (x_{1}) \leq f_{i} (x_{2}), & \forall i \\ \exists i, & f_{i} (x_{1}) < f_{i} (x_{2}) \end{matrix}, i = 1, 2$ (14) where, f_i, i = 1, 2 are the objective functions in the multi-objective optimization subject.

If f₁ and f₂ are, respectively, the objective functions of classification accuracy and the number of selected genes, their membership functions are defined as follows. $μ_{1} = {\begin{matrix} 0, \\ \frac{f_{1} - f_{1, min}}{f_{1, max} - f_{1, min}} \\ 1, \end{matrix}, \begin{matrix} f_{1} \leq f_{1, min} \\ f_{1, min} < f_{1} \leq f_{1, max} \\ f_{1} > f_{1, max} \end{matrix}$ (15) $μ_{2} = {\begin{matrix} 1, \\ \frac{f_{2, max} - f_{2}}{f_{2, max} - f_{2, min}} \\ 0, \end{matrix}, \begin{matrix} f_{2} \leq f_{2, min} \\ f_{2, min} < f_{2} \leq f_{2, max} \\ f_{2} > f_{2, max} \end{matrix}$ (16) where the index of min and max indicates the minimum and maximum values of the objective functions.

f₁ and f₂ are, respectively, in the range of 0 ≤ f₁ ≤ 1 and 1 ≤ f₂ ≤ m that m is the largest number of genes considered in the optimization process.

Since the optimization problem in this study is in a multi-objective form, the direct calculation of x_worst, x_best and x_gbest (in the SFL algorithm) is not possible. In this study, to choose the best solutions for x_best and x_gbest at each iteration of the proposed SFLA-based multi-objective optimization algorithm, a fuzzy interactive (min-max) method has been considered according to the goal defined in Equation (17). Also, in order to choose the worst solution (x_worst) at each iteration of the proposed SFLA-based multi-objective optimization algorithm, the max-max method has been proposed according to the goal defined in Equation (18). $min_{j} [max_{i} {μ_{i}^{ref} - μ_{i} ({Rep}_{j})}]$ (17) $max_{j} [max_{i} {μ_{i}^{ref} - μ_{i} ({Rep}_{j})}]$ (18) where, i and j indicate the i-th objective function and the j-th member of the population, respectively. Also, μ_i (Rep_j) represents the i-th membership function of the j-th member of the population and μ^ref is a reference vector with members in the range of 0 < μ^ref ≤ 1 which is randomly selected. The term “interactive” in a fuzzy multi-objective optimization problem is defined as adjustment of the reference membership level ( $μ_{i}^{ref}$ ) by the decision maker who can efficiently derive the “satisficing (satisfying) solution” according to his or her preferences. In order to create a Pareto-Front with uniform distribution, the reference value should be changed from 0 to 1. In the proposed approach, all of the generated non-dominated solutions are stored in the repository matrix.

2.8 The proposed algorithm

Figure 1 represents the flowchart of the proposed gene selection algorithm in this study.

In this study, an approach based on the shuffled frog-leaping algorithm (SFLA) and the CMSFFS algorithm is proposed to identify the most informative genes in RNA-seq prostate cancer dataset. In the first stage, the mutated binary SFLA-based multi-objective optimization is used to simultaneous incorporate both objectives of the maximum accuracy and the minimum number of selected genes in the proposed gene selection algorithm. Finally, the cache-based modification of the SFFS algorithm is used to choose the most discriminant genes from those selected in the SFLA-based multi-objective optimization stage. The SVM classifier with RBF kernel has been considered as the central core of the proposed approach to classify the RNA-seq prostate cancer data.

Let $S \in ℝ^{p \times q}$ be the original set of the RNA-seq prostate cancer dataset and p and q respectively indicate the total number of genes and the total number of samples in the dataset. After applying the statistical pre-processing stage, the number of genes is reduced to N_p and the matrix of the reduced dataset which is denoted by $\hat{S} \in ℝ^{N_{p} \times q}$ are sent to the proposed mutated SFLA-based interactive multi-objective optimization stage.

If $X \in B^{N \times N_{p}}$ is the binary matrix of initial population in the proposed mutated SFL algorithm, N and N_p represent the number of solutions (frogs) and the dimension of each solution, respectively. Also N_I, N_F and N_R are the maximum number of iterations, the number of folds and the repetition number of the validation process in the proposed SFLA-based multi-objective optimization algorithm. In this case, the randomness of the proposed algorithm results is minimized in the multi-objective optimization stage. By participating the two contradictory objectives of the highest classification accuracy and the lowest number of genes, simultaneously, in the multi-objective optimization stage, the sub-optimal subset of genes is extracted from the RNA-seq prostate cancer dataset. Eventually, the final subset of genes mostly associated with prostate cancer is selected from the sub-optimal subset of genes by applying the CMSFFS algorithm.

3 Experimental results

Fig.1

Flowchart of the proposed gene selection algorithm.

3.1 The RNA-seq prostate cancer dataset

In order to evaluate the performance of the proposed gene selection algorithm in this paper, the publicly available RNA-seq prostate cancer dataset [22] has been employed. The RNA-sequencing prostate cancer dataset consists of 20502 genes and 20 samples with 10 benign prostate tissues and 10 cancerous prostate tissues. Also, the number of classes in these datasets is two.

3.2 Assessment process

In order to prevent the over-fitting and leaking phenomenon, 20% of the samples in the RNA-seq prostate cancer dataset is separated as external test data after shuffling the original data by applying the five-fold cross-validation in the gene selection process; the remaining 80% of the samples is only used in the proposed gene selection process. Therefore, the external test data mentioned above are used to evaluate the results obtained in the process of the proposed gene selection algorithm. Hence, the final results reported in Table 1 are based on the external test data. Therefore, it can be claimed that only the external test data, which have never participated in the gene selection process, have been used to evaluate the final results of the proposed gene selection algorithm.

Table 1
Comparison of gene selection algorithms from the large-scale prostate cancer datasets

Method # of folds # of selected genes Performance appraisal

% Acc. % Sens. % Spec.

[1] 10 668 99.62 N/A^* N/A

[2] 10 3 96.08 N/A N/A

[3] 10 10 96.85 N/A N/A

[4] 3 4 97 N/A N/A

[5] 10 14 96.3 N/A N/A

The proposed algorithm N_F * N_R = 5 *20 3 100 100 100

Method	# of folds	# of selected genes
[1]	10	668	99.62	N/A^*	N/A
[2]	10	3	96.08	N/A	N/A
[3]	10	10	96.85	N/A	N/A
[4]	3	4	97	N/A	N/A
[5]	10	14	96.3	N/A	N/A
The proposed algorithm	N_F * N_R = 5 *20	3	100	100	100

^*N/A stands for not available.

By considering the two classes of “normal” and “tumor” in the RNA-seq prostate cancer gene expression dataset, the following parameters are defined to evaluate the performance of the proposed gene selection algorithm.

TP (True Positive) represents the samples which algorithm truly identified as tumor samples.

TN (True Negative) represents the samples which algorithm truly identified as normal samples.

FP (False Positive) represents the samples which algorithm falsely identified as tumor samples.

FN (False Negative) represents the samples which algorithm falsely identified as normal samples.

In this study, the ratio of tumor cases that the classifier correctly classifies them in the tumor class is known as the sensitivity. The ratio of normal cases that the classifier correctly classifies them in the normal class is known as the specificity. Accuracy is another important quantitative criterion which represents the number of samples that the proposed gene selection algorithm diagnoses their affiliation or non-affiliation to the specified group. Three explained criteria are mathematically formulated as follows. $Sensitivity = \frac{# TP}{# TP + # FN}$ (19) $Specificity = \frac{# TN}{# TN + # FP}$ (20) $Accuracy = \frac{# TP + # TN}{# TP + # FN + # TN + # FP}$ (21)

3.3 Case studies

As explained before, in order to remove the noisy genes and to select the most informative genes in the RNA-seq prostate cancer dataset to be sent to the mutated SFLA-based multi-objective optimization stage, the statistical pre-processing has been used. Figure 2 shows the result of applying the pre-processing stage on the RNA-seq prostate cancer dataset. In this figure, the horizontal axis and the vertical axis respectively indicate the number of selected genes with the highest scores and the average of classification accuracy. The best value for N_p to be sent to the proposed multi-objective stage has been considered N_p = 100. It can be seen that for N_p = 100, there is a flat area in Fig. 2 in which by increasing the number of genes, the maximum classification accuracy is not affected.

Fig.2

The result of applying the pre-processing stage on the RNA-seq prostate cancer dataset.

Fig.3

The result of applying the proposed multi-objective optimization method on the RNA-seq prostate cancer dataset.

In this study, the number of initial population in the proposed multi-objective optimization stage has been considered N = 300.

As mentioned before, the binary SFL algorithm has high capability to find the sub-optimal solution. Also, the mutation method used in the SFL algorithm increases the capability of local search of the proposed optimization algorithm. The proposed mutated SFL-based multi-objective optimization method has been applied to the filtered RNA-seq data in order to select the sub-optimal subset of genes by considering, simultaneously, the two defined objectives, i.e. “minimum number of selected genes” and “maximum amount of classification accuracy”. In this paper N_I = 10 has been considered. Applying data folding technique with N_F = 5 and also the repetition of the validation process in the proposed SFLA-based multi-objective optimization algorithm with N_R = 20 enhance the robustness and reduce the randomness of the proposed algorithm.

Figure 3 demonstrates the result of applying the first stage of the gene selection process, i.e. the proposed multi-objective optimization method on the RNA-seq prostate cancer dataset. In Fig. 3, the horizontal and the vertical axes specify the accuracy of the RNA-seq data classification and the number of selected genes, respectively. The star-shaped symbols on the pareto-front indicate the sub-optimal solutions which the sub-optimal solution with the highest classification accuracy and the lowest number of genes has been marked in Fig. 3. As indicated in Fig. 3, the proposed mutated SFL-based multi-objective optimization method is able to achieve the classification accuracy of 100% in the first stage of the gene selection process with only nine genes.

In the second stage of the proposed gene selection approach, the CMSFFS algorithm has been applied to the sub-optimal subset of genes selected in the multi-objective optimization stage.

By applying the CMSFFS algorithm, the final subset of genes is selected from the RNA-seq prostate cancer dataset in a repetitive cache-based process. The final result of applying the CMSFFS gene selection algorithm on the sub-optimal subset of genes selected in the multi-objective optimization stage on the RNA-seq prostate cancer dataset has been shown in Fig. 4.

The horizontal and the vertical axes in Fig. 4 represent the number of selected genes and the classification accuracy of the RNA-seq prostate cancer data. As shown in Fig. 4, the proposed gene selection algorithm is capable to select the sub-optimal subset of genes with only three genes in the large scale RNA-seq prostate cancer dataset such that 100% classification accuracy can be achieved.

Fig.4

The final result of applying the CMSFFS gene selection algorithm on the sub-optimal subset of genes selected in the multi-objective optimization stage on the RNA-seq prostate cancer dataset.

4 Discussion

Table 1 represents a comparison of the results of the mutated SFL-based gene selection algorithm proposed in this paper and several researches in recent years on the large-scale prostate cancer datasets.

A combination of the rough-based feature selection method and RBF neural network has been used in [1] in order to classify the gene expression data. The number of folds has been considered to be 10 and the accuracy value of the classification has been reported 99.62% with the 668 genes selected from the prostate cancer dataset. In [2], A hybrid SVM-based gene selection method has been used to analysis the gene expression data. The accuracy of the classification in [2] has been obtained 96.08% with 3 selected genes. It should be noted that the folding technique with the number of folds of 10 has been used in this study. A clustering-based gene selection method has been proposed in [3]. The number of folds in this study has been considered 10 for the prostate cancer dataset and the accuracy of the classification in [3] has been obtained 96.85% with 10 selected genes. In [4], a feature selection algorithm has been proposed based on the division of genes into small subsets which the subsets of features provide high classification accuracy. The number of folds has been considered to be 3 for the prostate cancer dataset. The accuracy of the classification in [4] has been obtained 97% with 4 selected genes. A method based on genetic algorithm has been proposed in [5] to select the informative genes for cancer classification. The accuracy value of the prostate cancer data classification has been obtained 96.3% while the number of selected genes has been reported 14. Also, the number of folds has been considered to be 10.

Most of the techniques proposed in recent years could optimize only one of the two important objectives, i.e. minimization of the number of selected genes or maximization of the classification accuracy in the considered gene expression dataset.

In this study, at first the noisy genes are filtered in the statistical pre-processing stage. Then by applying the proposed fuzzy interactive multi-objective SFL-based optimization algorithm on the filtered RNA-seq prostate cancer dataset, the sub-optimal subset of informative genes associated with prostate cancer is obtained. In the proposed binary SFL-based multi-objective optimization algorithm, the ability of local search has been increased by employing the proposed mutation method with 5 members (x_gbest and 4 different members) for each member of the population. In the multi-objective optimization stage, the sub-optimal subset with the lowest number of genes and the highest classification accuracy is selected. Our proposed SFL-based multi-objective optimization method can achieve 100% accuracy of classification by selecting 9 genes of the RNA-seq prostate cancer dataset. In the second stage of the proposed search approach, the sub-optimal subset of genes obtained from the multi-objective optimization stage is sent to the CMSFFS algorithm. Ultimately, the most informative genes are selected from the sub-optimal subset of genes by applying the CMSFFS algorithm.

The final subset of genes with only the three genes has been selected by applying the proposed hybrid gene selection approach on the RNA-seq prostate cancer dataset so that 100% accuracy of classification has been achieved.

By using the folding technique and repeating the assessment process several times, the phenomenon of randomness is reduced in this study and the robustness of the proposed algorithm is increased. In this paper, to evaluate the proposed algorithm, the SVM classifier has been used because the SVM classifier can overcome the challenge of “curse of dimensionality”. Because of the high dimensionality of the search space in the RNA-seq prostate cancer dataset, the proposed hybrid gene selection approach could find the global sub-optimal solution in the two proposed stage.

Details of the sub-optimally selected genes in the RNA-seq prostate cancer dataset has been listed in Table 2.

Table 2
Details of the final subset of genes in the RNA-seq prostate cancer dataset

Gene Index Gene_ID

14478 RPL29P2

6271 FLJ44606

4706 DHRS12

Gene Index	Gene_ID
14478	RPL29P2
6271	FLJ44606
4706	DHRS12

5 Conclusion

A powerful hybrid gene selection algorithm based on multi-objective optimization was suggested in this paper to select the most informative subset of genes in the RNA-seq prostate cancer dataset. After removing the noisy genes, the sub-optimal subset of genes associated with prostate cancer was selected in the multi-objective optimization stage. The combination of the proposed mutation method by the SFL-based optimization algorithm in binary mode could reduce the probability of being caught in the local optimum and increase the ability of local searching of the proposed algorithm. Using multi-objective optimization process in this article could enable the decision maker to select the required solution according to the number of genes and the classification accuracy. Ultimately, in order to detect the final subset of genes from the RNA-seq prostate cancer dataset, the CMSFFS algorithm was used. The final subset of genes with the three identified genes could classify the RNA-seq data with the classification accuracy, sensitivity and specificity of 100% by using the external test data. Also, the probability of obtaining random results was minimized in this study by using the techniques of data folding and multiple implementation of the proposed gene selection algorithm.

References

Chiang

J.-H.

and Ho

S.-H.

, A Combination of Rough-Based Feature Selection and RBF Neural Network for Classification Using Gene Expression Data, IEEE Transactions on Nanobioscience, 7, 1, pp. 91–99, 2008.

Gao

, Ye

, Lu

, et al., Hybrid method based on information gain and support vector machine for gene selection in cancer classification, Genomics, Proteomics & Bioinformatics 15 (2017), 389–395.

Chen

, Zhang

and Gutman

, A kernel-based clustering method for gene selection with gene expression data , Journal of Biomedical Informatics 62 (2016), 12–20.

Sharma

, Imoto

and Miyano

, A top-r feature selection algorithm for microarray gene expression data, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 9 (2012), 754–764.

Dashtban

and Balafar

, Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts, Genomics 109 (2017), 91–107.

Shahbeig

, Helfroush

M.S.

and Rahideh

, A fuzzy multi-objective hybrid tlbo–pso approach to select the associated genes with breast cancer, Signal Processing 131 (2017), 58–65.

Vladimir

V.N.

and Vapnik

, The nature of statistical learning theory, Springer Heidelberg, 1995.

Melgani

and Bazi

, Classification of electrocardiogram signals with support vector machines and particle swarm optimization, IEEE Transactions on Information Technology in Biomedicine 12 (2008), 667–677.

Hsu

C.-C.

, Chen

M.-C.

and Chen

L.-S.

, Integrating independent component analysis and support vector machine for multivariate process monitoring, Computers &Industrial Engineering 59 (2010), 145–156.

10.

Eusuff

, Lansey

and Pasha

, Shuffled frog-leaping algorithm: A memetic meta-heuristic for discrete optimization, Engineering Optimization 38 (2006), 129–154.

11.

Gomez-Gonzalez

, Ruiz-Rodriguez

and Jurado

, Probabilistic optimal allocation of biomass fueled gas engine in unbalanced radial systems with metaheuristic techniques, Electric Power Systems Research 108 (2014), 35–42.

12.

Whitney

A.W.

, A direct method of nonparametric measurement selection, IEEE Transactions on Computers 100 (1971), 110–1103.

13.

Shahbeig

, Rahideh

, Helfroush

M.S.

, et al., Gene expression feature selection for prostate cancer diagnosis using a two-phase heuristic– deterministic search strategy, IET Systems Biology (2018).

14.

Aziz

, Verma

and Srivastava

, A fuzzy based feature selection from independent component subspace for machine learning classification of microarray data, Genomics Data 8 (2016), 4–15.

15.

Chandra

and Gupta

, An efficient statistical feature selection approach for classification of gene expression data, Journal of Biomedical Informatics 44 (2011), 529–535.

16.

Cui

, Zheng

C.-H.

, Yang

, et al., Sparse maximum margin discriminant analysis for feature extraction and gene selection on gene expression data, Computers in Biology and Medicine 43 (2013), 933–941.

17.

Gonzalez-Navarro

F.F.

and Belanche-Muñoz

L.A.

, Feature Selection for microarray gene expression data using simulated annealing guided by the multivariate joint entropy, ComputaciÓn y Sistemas 18 (2014), 275–293.

18.

Liu

, Cui

, Jiang

, et al., A combinational feature selection and enble neural network method for classification of gene expression data, BMC Bioinformatics 5 (2004).

19.

Nguyen

, Khosravi

, Creighton

, et al., Hidden markov models for cancer classification using gene expression profiles, Information Sciences 316 (2015), 293–307.

20.

Wang

and Han

, Hybrid feature selection method for gene expression analysis, Electronics Letters 50 (2014), 1269–1271.

21.

Shahbeig

, Rahideh

, Helfroush

M.S.

, et al., Gene selection from large-scale gene expression data based on fuzzy interactive multi-objective binary optimization for medical diagnosis, Biocybernetics and Biomedical Engineering 38 (2018), 313–328.

22.

Smith

B.A.

, Sokolov

, Uzunangelov

, et al., A basal stem cell signature identifies aggressive prostate cancer phenotypes, Proceedings of the National Academy of Sciences 112 (2015), E6544–E6552.

An efficient search algorithm for biomarker selection from RNA-seq prostate cancer data

Abstract

Keywords

1 Introduction

2 Proposed hybrid gene selection algorithm

2.1 The support vector machine (SVM) classifier

3 Experimental results

3.2 Assessment process

Table 2 Details of the final subset of genes in the RNA-seq prostate cancer dataset Gene Index Gene_ID 14478 RPL29P2 6271 FLJ44606 4706 DHRS12

References

Table 2
Details of the final subset of genes in the RNA-seq prostate cancer dataset

Gene Index Gene_ID

14478 RPL29P2

6271 FLJ44606

4706 DHRS12