Abstract
Gene selection as an important data preprocessing technique for cancer classification is one of the most challenging issues in the field of microarray data analysis. In this paper, to deal with gene expression data more effectively, a locally linear embedding (LLE) and neighborhood rough sets-based gene selection method using Lebesgue measure for cancer classification is proposed. First, to solve the problems that the traditional LLE method cannot effectively identify category information, and is susceptible to noise pollution and other issues, the intra-class neighborhood is defined and a new method of calculating reconstruction weight is proposed by combining with the Euclidean distance to improve LLE. Then, the Lebesgue measure is introduced into neighborhood rough sets, a δ -neighborhood measure is defined, and the dependency degree and the significance measure are presented in neighborhood decision systems. Finally, an improved LLE and neighborhood rough sets-based gene selection algorithm is designed, where the improved LLE algorithm is used to reduce the initial dimensions of gene expression data and obtain a candidate gene subset, and the Lebesgue measure and dependency degree-based relative reduction for gene expression data is developed to further screen the candidate subset to select the final gene subset. The experimental results under several public gene expression data sets prove that the proposed method is effective for selecting the most relevant genes with high classification accuracy.
Keywords
Introduction
Microarray techniques have been used to delineate cancer groups or to identify candidate genes for cancer prognosis, and various classification methods have been applied to analyze or interpret gene expression data as such problems can be viewed as classification ones [1, 2]. However, due to the availability of small number of effective samples compared to the large number of genes in microarray data, many computational methods have failed to identify a small subset of important genes [3, 4]. In general, the classification of cancer using microarray data involves data acquisition pre-processing, gene selection and classification [5]. Classification performance obtained through these processes is evaluated, and gene selection is an important aspect in the course of microarray data analysis. The aim of gene selection is to reduce the dimensionality of microarray data in order to enhance the accuracy of classification task [6]. The feature selection methods broadly divided into four categories including filter, wrapper, embedded, and hybrid approaches [7–12]. Independent of the classifier, filter methods have been widely used because it holds the advantage of high speed and is capable of dealing with large data sets, but they are easily trapped into local optimum. Though the wrapper approach contains a given learning model, it suffers from high computational cost especially for high-dimensional microarray data sets. The main advantage of embedded approach is the interaction with learning model, but training a given classifier with the full gene set is time-consuming especially. The major disadvantage of hybrid approach is that the filter and wrapper approaches are not truly integrated with each other, which leads to lower classification performance. Our gene selection method is based on the filter approach, in which a heuristic search algorithm is used to find an optimal gene subset with neighborhood rough sets for gene expression data sets.
The pre-processing of tumor gene selection is dimensionality reduction of data sets. The locally linear embedding (LLE) is an efficient way to reduce the dimension [13–16]. Vanderplas and Connolly [13] introduced LLE to the astronomical community as a classification technique. Liu et al. [16] used the unsupervised LLE learning algorithm to transform multivariate MRI data of regional brain volume and cortical thickness to a locally linear space with fewer dimensions. Su et al. [17] developed a fault diagnosis method based on incremental enhanced supervised LLE and adaptive nearest neighbor classifier to improve the accuracy of machinery fault diagnosis. Sun et al. [18] presented a gene selection method based on LLE and neighborhood rough sets for gene expression data classification. Xu et al. [19] combined LLE and correlation coefficient to classify microarray data. Although the computational complexity of LLE is low, LLE is unsupervised and cannot identify the categories of data. In order to get fewer genes with higher classification ability while reducing time complexity, this paper focuses on improving classical LLE algorithm.
For evaluating a gene selection method, in addition to the predictive ability of gene subsets, two other important aspects that need to be considered are the stability of the selected genes and the computational costs [17, 18]. Thus, gene subsets which have low dimensionality and high classification ability would be selected from gene expression profiles. In this paper, the neighborhood rough sets [20–23] is introduced to handle continuous numerical data, can avoid information loss and improves classification accuracy of gene subsets. Hu et al. [23] proposed a technique for heterogeneous feature subset selection based on neighborhood rough sets. Meng et al. [24] presented neighborhood system to deal directly with information table formed by integrating gene expression data with biological knowledge. Liu et al. [25] studied a quick attribute reduction algorithm for neighborhood rough sets. Sun et al. [26] raised a gene selection approach to improve the classification performance of microarray data based on Fisher linear discriminant and neighborhood rough sets. Chen et al. [27] proposed a gene selection method for tumor classification using neighborhood rough sets and entropy measures. Mu et al. [28] designed a feature selection method based on improved Fisher discriminant analysis and neighborhood rough sets. However, when dealing with high-dimensional data, neighborhood rough sets still have some issues of more time consuming and space complexity.
In order to decrease time complexity and improve classification accuracy of selecting gene subset for cancer classification, a gene selection method based on improved LLE and neighborhood rough sets using Lebesgue measure is proposed. Firstly, by introducing the concept of intra-class neighborhood and redefining the right of reconfiguration, the LLE algorithm is improved. The pre-processing of the prior optimization can be carried out, which makes the reduction data contain more effective classification information. Then, on the basic of neighborhood, a gene expression data set is granulated by neighborhood parameters, and the Lebesgue measure is introduced to develop a new dependency degree in neighborhood rough sets. The method of calculating significance measure is improved. Finally, LLE and neighborhood rough sets are combined to build an effective cancer gene selection and classification model, which can improve the classification ability of gene subsets. From many numerical experiment results, it can been easily observed that the number of genes selected by our algorithm is the least on the Prostate and Leukemia data sets, and is slightly second to two extended neighborhood rough set methods on the Colon and Gastric data sets. What’s more, our calculated average classification accuracy is the highest on all the data sets. Therefore, the proposed method shows the great performance for gene expression data classification.
The article is mainly composed of the following parts. Section 2 recalls some basic knowledge of LLE and neighborhood rough sets. In Section 3, LLE is improved and neighborhood rough set model with Lebesgue measure is investigated. Section 4 analyzes the simulation results of gene expression data sets. Section 5 is a summary of this article.
Preliminaries
In this section, we briefly review some basic concepts about LLE and neighborhood rough sets. These notations have been given in [13–27].
Locally linear embedding
The LLE is an unsupervised nonlinear dimensionality reduction method for mapping high-dimensional data nonlinearly to a lower-dimensional space [18]. Its basic idea is that of global minimisation of the reconstruction error of the set of all local neighbors in the data set. The LLE algorithm has some attractive characteristics: it does not require an iterative algorithm and just a few parameters need to be set [14]. A feature mapping is established as follows: the low-dimensional embedding maintains the same local neighborhood relationship in high-dimensional space. Thus, the LLE can obtain the low-dimensional embedding from the nearest neighbor graph in high-dimensional space under certain conditions [19].
As input, LLE maps a data set of N D-dimensional vectors assembled in a matrix X of size D × N, i.e., X = {x1, x2, ⋯ , x N }. Its output is a set of N M-dimensional vectors in a matrix Y of size M × N, i.e., Y = {y1, y2, ⋯ , y N }, where M ⪡ D, and the k th column vector of Y corresponds to the k th column vector of X. Assuming the data lies on a nonlinear manifold which locally can be approximated linearly, it employs three stages as follows: (I) locally fitting hyperplanes around each sample x i , based on its k nearest neighbors, (II) calculating reconstruction weights, and (III) finding lower-dimensional coordinates y i for each x i , by minimising a mapping function based on these weights.
In stage I, each sample x i is approximated by a weighted linear combination of its k nearest neighbors, making use of the assumption that neighboring samples will lie on a locally linear patch of the nonlinear manifold. Then, the adjacent points of each point by using the k nearest neighbors can be obtained.
In stage II, the weights w
ij
that best linearly reconstruct X from its neighbors, need to be computed to solve the constrained least squares problem. Assume that each sample x
i
in a high-dimensional space can be linearly represented by the samples in its local neighborhood. Then, the reconstruction weights can be obtained from minimizing the quadratic sum of local reconstruction deviation of x
i
by using thefollowing cost function minimised:
In stage III, the weights w
ij
are fixed and new m-dimensional vectors y
i
are sought, which minimise the criterion as an embedding cost function:
The neighborhood rough set model is a method to solve the problem that classical rough sets cannot handle continuous numerical data [23]. In gene expression data sets, the measured gene expression levels and pharmaceutical tests are presented by continuous-valued data at different magnitudes [12]. By utilizing neighborhood rough sets, the discretization of continuous data can be avoided. Note that the gene expression data sets can be described by a neighborhood decision system, where a sample is an object, a gene contains a conditional attribute, and a subclass of cancer corresponds to a decision attribute.
Given a neighborhood decision system NS = (U, C, D, V, f, Δ, δ), U = {x1, x2, ⋯ , x n } is a sample set, and C = {a1, a2, ⋯ , a m } is a set of all attributes, while D is a decision attribute set. V = ⋃ a∈{C∪D}V a , where V a is a value set of attribute a. f : U × {C ∪ D} → V is a map function and f (a, x) represents the value of x on attribute a ∈ C ∪ D. Δ → [0, ∞) is a distance function, and δ is a neighborhood parameter, where 0 ≤ δ ≤ 1. In the following, NS = (U, C, D, V, f, Δ, δ) is simply noted by NS = (U, C, D, δ).
Since the Euclidean distance function effectively reflects the basic information of the unknown data [4, 29], it is introduced into this paper, expressed as
Given a neighborhood decision system NS = (U, C, D, δ) and a distance function Δ → [0, ∞), for any B ⊆ C and δ ∈ [0, 1], the similarity relation resulting by the subset B is described as
Given a neighborhood decision system NS = (U, C, D, δ) with B ⊆ C, and U/D = {D1, D2, ⋯ , D
N
}, then the neighborhood lower approximation set and upper approximation set of D with respect to B are described respectively as
Improved locally linear embedding
The traditional LLE algorithm has some deficiencies [13]. For example, it is assumed that the data is evenly and densely sampled in the LLE algorithm, while for those data polluted by noise, sparsely sampled and with larger curvature, the low-dimensional embedding result will be seriously damaged, and the original manifold learning algorithms are all unsupervised. Hence, the LLE algorithm does not make the best use of category information and the result of dimensionality reduction is not conducive to the later identification and classification. Furthermore, LLE is a kind of expression without explicit mapping, so the ability to generate new samples is weak. Therefore, by considering the category information of samples, the neighborhood of samples can be reconstructed to solve this defect of the LLE algorithm.
The specific steps of improved LLE-based dimensionality reduction algorithm are as follows:
Step 1: Let X = {x1, x2, ⋯ , x
N
} be a given data set of N point, where x
i
∈ X, select the intra-class neighborhood of samples, calculate the distance between samples by adopting the following Euclidean distance:
Step 2: Calculate the reconstruction weight matrix W of sample neighborhood. Assume that each sample x
i
in a high-dimensional space can be linearly represented by the samples in its local neighborhood, and the new reconstruction weight w
ij
can be calculated by Equation (7). Then, from the minimised cost function
Step 3: Calculate the optimal low-dimensional embedded matrix Y, select an appropriate embedded dimension M and obtain the best low-dimensional embedded matrix Y by the following function:
Since
In short, the process of solving Y is equal to seeking the eigenvector of matrix M which is sparse, symmetric and semi-positive definite. Here, the Lagrange multiplier is firstly used to calculate L (Y) = YMY
T
- λ (YY
T
- NI), and the partial derivative of Y can be obtained. Then, L (Y) is minimised to get the following formula: L (Y) =
Let E be any point set in R
n
, and for each column open ball I
i
covering E, its volume is summed as μ = ∑
i
|i|. The entire μ make up a number set with bounded below, and its lower bound (determined by E completely) is claimed as the Lebesgue external measure of E [31], which is described as
In a neighborhood decision system NS = (U, C, D, δ) with B ⊆ C,
(1) K (B, D) = K (C, D),
(2) K (B, D) > K (B - {a}, D), where any a ∈ B.
Obviously, a relative reduct of C with respect to D is the minimal attribute subset to retain the dependency degree of C with respect to D.
The specific steps of Lebesgue measure and dependency degree-based relative reduction algorithm for gene expression data are as follows.
Step 1: Use min-max standardization
Step 2: Select all the gene columns to make up a gene set S A except for the decision in Ai×t.
Step 3: Initialize a reduct set red =∅.
Step 4: Calculate the dependency degree K({a i }, D) for all a i in S A by Eq. (12), and get the significance measure SIG using Equations. (13) and (14), where a i (i = 1, 2, ⋯ , N) represents the gene column in S A .
Step 5: If SIG= 0, then delete the redundant a i .
Step 6: Calculate the SIG of the gene a i which satisfies max{a i ∈ S A |K (C, D)}.
Step 7: If SIG ≤ λ, then let red = red ∪ {a k } and S = S ∪ POS k , and return to Step 4.
Step 8: Find out the corresponding gene columns in Ai×tfrom red, and obtain an optimal gene subset.
Step 9: Return an optimal gene subset.
In this subsection, the improved LLE algorithm is used to select genes from the initial gene expression data sets. The reconstruction weight matrix of the data sets is obtained by the intra-class neighborhood, and then to screen genes with certain distinguishing ability the low-dimensional embedded sample matrix is calculated to obtain the candidate gene subsets by Algorithm 1. Algorithm 2 is employed to make gene selection in the candidate gene subsets. The candidate subsets are normalized, the maximum dependence degree and the significance measure of each gene are calculated respectively, and then one can judge whether it is redundant. When the significance measure of the genes is bigger than λ, these genes can be grouped into a subset, and then the process of selecting the best genes can be carried out. Thus, the steps of improved LLE and neighborhood rough sets-based gene selection (LLENRS-GS) algorithm are as follows.
Step 1: Select the intra-class neighborhood of the gene expression data sets with Step 1 of Algorithm 1.
Step 2: Calculate the reconstruction weight matrix W with Step 2 of Algorithm 1.
Step 3: Calculate the low-dimensional embedded matrix Y, and obtain the preliminary reduced dimension of gene subsets with Step 3 of Algorithm 1.
Step 4: Standardize the gene subset and generate a matrix A by Step 1 of Algorithm 2.
Step 5: Group all the gene columns in the matrix A into a gene set S A with Step 2 of Algorithm 2.
Step 6: Initialize a reduct set red =∅.
Step 7: Calculate the dependency degree of a i and the SIG, where any a i ∈ Ai×t - red in Ai×t.
Step 8: Output a great gene subset by using Steps 5-9 of Algorithm 2.
Experimental results and analysis
Experiment preparation
To verify the classification performance of the proposed LLENRS-GS algorithm, the simulation experiments are performed on five public gene expression data sets, which can be downloaded at http://bioinform-atics.rutgers. ed/Static/Supplemens/CompCancer/data- sets. The description of the five gene expression data sets is shown in Table 1.
Overview of the five gene expression data sets
Overview of the five gene expression data sets
The experimental operating system is Windows 7, Intel Core i55200U, 1.50 GHZ, and 4.0 GB memory. All simulation experiments are implemented in Matlab R2014a and Weka 3.8.
In this experiment, the LLE algorithm [13] and our improved LLE (ILLE) algorithm (Algorithm 1) are performed on the five gene expression data sets in Table 1. Four kinds of different classifiers including LibSVM, J48, Random Tree and Random SubSpace are run in Weka to test the dimensionality reduction results of the LLE and ILLE algorithms. Moreover, the classification accuracy of the dimensionality reduction results is verified with the 10-fold cross-validation method. The verification results of classification accuracy of LLE and ILLE on the four classifiers for the five gene expression data sets are shown in Table 2. Then, it can be seen from Table 2 that all of the classification accuracy of ILLE on the four classifiers for the five gene expression data sets are higher than that of LLE. Therefore, the ILLE algorithm owns the obvious advantage. That is, Algorithm 1 is effective.
Classification accuracy of LLE and ILLE on the four classifiers for the five gene expression data sets
Classification accuracy of LLE and ILLE on the four classifiers for the five gene expression data sets
Taking the Prostate Tumor data set as an example, for Algorithm 1, supposed that the number of neigh-bors is 8, the low-dimensional embedded dimension is 30, and then the preliminary dimension reduction can be achieved. Supposed that the parameter λ = 0.4, and Algorithm 2 is performed and the selected gene subset with 11 genes can be obtained. So, liking this above process, namely using the LLENRS-GS algorithm, the gene selection results of the five gene expression data sets are illustrated in Table 3.
Classification performance of related reduction algorithms
This portion of our experiments evaluates the performance of our proposed algorithm in terms of the number of selected genes, the running time, and the classification accuracy on the selected genes. Then, the classification performance of the LLENRS-GS algorithm is compared with those of the other six related reduction algorithms on the five gene expression data sets in Table 1. These methods include: (1) the LLE-based gene selection algorithm (LLE) [32], (2) the gene selection algorithm based on locally linear embedding and neighborhood rough sets (LLE-NRS) [18], (3) the principal component analysis-based gene selection algorithm (PCA) [7], (4) the PCA algorithm [7] combined with the NRS algorithm [23] (PCA + NRS), (5) the Relief-based feature selection (Relief) [33], and (6) the Relief algorithm [33] combined with the NRS algorithm [23] (Relief + NRS). The LibSVM, J48, Random Tree, and Random SubSpace classifiers in Weka are used to do some simulation experiments. By comparing our LLENRS-GS algorithms with the above six reduction algorithms, the classification results of the seven reduction algorithms on the five gene expression data sets are shown in Tables 4-8, respectively.
Table 4 shows that the LLENRS-GS obtains the highest classification accuracy and selects the least number of genes, though its time-consuming is larger than that of LLE, LLE-NRS, PCA and PCA + NRS. The PCA, PCA + NRS, Relief and Relief + NRS algorithms obtains the higher classification accuracy on the J48, Random Tree, and Random SubSpace classifiers than that on the LibSVM classifier. The LLENRS-GS not only achieves the highest classification accuracy in the J48, Random Tree and Random SubSpace classifiers, but also is sensitive to the LibSVM classifier. The classification accuracy is 98.0392%, which is about 47% higher than that of the PCA and Relief algorithms. The classification accuracy of LLENRS-GS is nearly 17% higher than that of LLE. Furthermore, the LLENRS-GS algorithm achieves the highest average classification accuracy for the selected Prostate Tumor genes. Therefore, our algorithm can effectively remove noises from the original Prostate Tumor data set.
Table 5 shows the specific comparison results of gene selection for Colon cancer with the seven reduction algorithms. As we can see, the classification accuracy of genes selected by LLENRS-GS is significantly higher than that of the other six algorithms. Especially in the LibSVM classifier, the classification accuracy is 10% higher than that of LLE, and 40% higher than that of PCA. Thus, the classification accuracy obtained by LLENRS-GS has been significantly improved. When the classification accuracy is verified in J48, Random Tree, and Random SubSpace, the precision of LLENRS-GS is the higher than the other algorithms. In addition, the other six algorithms can only maintain the relatively better accuracy in a certain classifier, while the accuracy in the other classifiers is reduced a lot and incapable to keep the stable precision. However, our algorithm can maintain the relatively great classification accuracy in the four classifiers. Moreover, LLENRS-GS obtains the highest average classification accuracy for the selected Colon Cancer genes. Thus, our algorithm can not only effectively remove noises from the Colon Cancer data set, but can also obviously improve the classification accuracy of the selected Colon Cancer genes.
Classification performance of the LLENRS-GS algorithm for the five gene expression data sets
Classification performance of the LLENRS-GS algorithm for the five gene expression data sets
Classification performance of the seven reduction algorithms for Prostate Tumor
Classification performance of the seven reduction algorithms for Colon Cancer
Table 6 describes the comparison results of gene selection for the Leukemia data set. As shown in Table 6, the LLENRS-GS algorithm obtains the least number of selected genes and achieves the highest classification accuracy on the LibSVM, Random Tree, and Random SubSpace classifiers for the selected Leukemia genes. Especially, in the LibSVM classifier, the classification accuracy reaches 100% without any error. The accuracy of LLENRS-GS is about 47% higher than that of PCA, PCA + NRS, Relief and Relief + NRS. Even it is nearly 8% higher in classification accuracy than that of LLE. In addition, the classification accuracy of LLENRS-GS is relatively stable and maintains the higher state. However, on the J48 classifier, the classification accuracy is slightly lower. The reason is that when the data processing is performed, the genes which are more suitable for J48 classifiers are likely to be deleted. Furthermore, the LLENRS-GS algorithm achieves the highest average classification accuracy for the selected Leukemia genes. Hence, it can be shown that our algorithm has better classification performance for the selected Leukemia gene data.
Classification performance of the seven reduction algorithms for Leukemia
Table 7 compares the classification of the Gastric Cancer data set. The LLENRS-GS algorithm achieves the smaller number of selected genes, and has less time-consuming and the highest accuracy of 80% in the LibSVM classifier, 85% in the Random Tree classifier, and 92.5% in the Random SubSpace classifier. Note that the accuracy 90% of our algorithm in the J48 classifier is lower than 92.5% of Relief. However, the Relief algorithm achieves the largest number of the selected genes. The reason is that when LLENRS-GS processes the gene data set, some noises of the gene data sets are not fully filtered, which reduces the classification ability of the selected gene subset, so this reduces the classification accuracy in the J48 classifier. Meanwhile, it can be seen from Table 7 that LLENRS-GS achieves the highest average classification accuracy for the selected Gastric Cancer genes.
Classification performance of the seven reduction algorithms for Gastric Cancer
Similar to the classification results in Tables 4 and 5, Table 8 shows that the LLENRS-GS algorithm obtains the highest classification accuracy, and achieves the smaller number of the selected genes. For example, the classification accuracy of LLENRS-GS has approximately 21%, 30%, 27%, and 35% higher than that of PCA in the four classifiers, respectively. It can be observed that LLENRS-GS is more sensitive to the LibSVM classifier than the other six algorithms, and then it has a strong adaptability. What’s more, the LLENRS-GS has the highest average classification accuracy for the selected Breast genes. Therefore, it can be concluded that our algorithm achieves the best classification performance for the Breast data set.
Classification performance of the seven reduction algorithms for Breast
Classification accuracy of the selected genes with the ten dimensionality reduction methods
To further verify the classification performance of our proposed method, ten methods are evaluated in terms of the number of selected genes and the classification accuracy on the selected genes. The LLENRS-GS algorithm is compared with the nine related dimensionality reduction methods, which include: (1) the original data processing method (ODP), (2) the Fisher score algorithm [34], (3) the Lasso algorithm [35], (4) the neighborhood rough set model (NRS) [23], (5) the gene selection algorithm based on the fisher linear discriminant and the neighborhood rough sets (FLD-NRS) [26], (6) the gene selection algorithm based on locally linear embedding and neighborhood rough sets (LLE-NRS) [18], (7) the Relief algorithm [33] combined with the NRS algorithm [23, 36] (Relief + NRS), (8) the fuzzy backward feature elimination (FBFE) [37], and (9) the binary differential evolution (BDE) [38]. The LibSVM classifier in Weka tool is used to do some simulation experiments. The classification accuracy of the selected genes is shown in Table 9.
According to the results of the classification accuracy in Table 9, the differences among the ten methods can be clearly identified. As for the classification accuracy, the LLENRS-GS algorithm obtains higher classification accuracy than the NRS algorithm. However, some genes with classification information also are deleted, which leads to low classification accuracy of the NRS. The three extended NRS methods (FLD-NRS, LLE-NRS and Relief + NRS) overcome this drawback to improve the classification accuracy. Compared with these methods, the LLENRS-GS algorithm achieves the highest classification accuracy for the Prostate Tumor, Colon Cancer and Leukemia data sets. Compared with the FBFE and BDE algorithms, the LLENRS-GS algorithm has slightly improved classification accuracy for all of the three data sets. Thus, our proposed approach can obviously reduce the dimension of gene expression data sets and outperforms the other nine related dimensionality reduction methods. In summary, our method is an efficient dimensionality reduction technique for high-dimensional gene expression data sets.
During the above experiments, the rough ordering of these nine methods with respect to time complexity is as follows: O(LLENRS-GS) = O(Fisher score) < O(FBFE) < O(BDE) < O(FLD-NRS) < O(Relief + NRS) < O(NRS) < O(LLE-NRS) < O(Lasso), where O(A) denotes the time complexity of A algorithm. For high-dimensional gene expression data, the Lasso algorithm has the highest time complexity, which is O (nm3) [35], where n denotes the number of samples, and m describes the number of genes. For the NRS algorithm and its extension forms, the time complexity is O (m2n + m2nlogn) for the LLE-NRS algorithm [18], and O (m2nlogn) for NRS algorithm [23]. Moreover, the Relief + NRS algorithm has the time complexity of O(mn + mnlogn) [23, 36], and the complexity of FLD-NRS is O(mnlogn) [26]. The time complexities of three extended NRS methods (FLD-NRS, LLE-NRS and Relief + NRS) are lower than that of NRS. Since the population initialization is the main process of the BDE algorithm, the complexity of BDE is close to O(nm) [38]. For FBFE, the time is mainly spent on evaluating the relevance of the genes using entropy, and its time complexity is no more than O(nm) [37]. The complexities of LLENRS-GS and Fisher score [34] are O (m) approximately and lower than those of the other seven algorithms. Although the Lasso algorithm has better classification accuracy, due to n ⪡ m in most cases, it has the much higher time complexity than our LLENRS-GS algorithm. Hence, these results demonstrate that our algorithm can effectively reduce the dimension of gene expression data sets, increase the classification accuracy, and expedite the classification process with less time consumption.
According to the abovementioned experimental results of the comparative analysis of our method with other schemes, it can be concluded that the gene subset selected by LLENRS-GS has high classification accuracy and is obviously superior to the other related reduction algorithms. At the same time, by comparing the classification accuracy in each classifier, it can be found that the classification accuracy of LLENRS-GS in each classifier has the high stability, so it has better compatibility in practical applications. What’s more, LLENRS-GS is more sensitive to the LibSVM classifiers than other algorithms, and has a short running time and high processing efficiency.
Conclusion
Identifying cancer-related genes is helpful for earlier cancer diagnosis and drug design. Gene selection is one of the important steps in cancer classification. In order to deal with gene expression data more effectively, this paper proposes a gene selection algorithm based on improved LLE and neighborhood rough sets. By considering the category information of samples, the neighborhood of samples is reconstructed to improve the traditional LLE algorithm, and the improved LLE-based dimensionality reduction algorithm is designed to treat the processed gene expression data sets with the initial dimensionality reduction and obtain the candidate gene subset. Then, the Lebesgue measure-based dependency degree and significance measure are presented in neighborhood decision systems. Finally, a heuristic relative reduction algorithm for gene selection is developed. The simulation results under the five gene expression data sets demonstrate that the gene subset selected by the proposed algorithm is highly representative, and our method has great classification performance. For Prostate Tumor, the genes selected by LLENRS-GS are the least and the average accuracy is 14.9485% -21.3235% higher than the other algorithms; on Colon Cancer, the number of genes selected by Relief + NRS is only less than that of our method, however, our average accuracy is 10% -29% higher than all the other methods; for Leukemia, the genes selected by LLENRS-GS is also the least and its average accuracy is 4.8611% -32.2917% higher than the others; on Gastric Cancer, the LLENRS-GS only selects one more gene than the LLE + NRS, whereas the average accuracy of our proposed algorithm is 1.875% -29.75% higher than the other algorithms; and for Breast, the genes selected by both LLE + NRS and Relief + NRS are less than that of LLENRS-GS, but our average accuracy is 16.6667% -30.4487% higher than all the other algorithms. Therefore, it can be proved that our algorithm can obtain a small, effective gene subset with higher classification accuracy, and outperforms other reduction methods.
Footnotes
Acknowledgements
This work was partially supported by the National Natural Science Foundation of China (No. 61772176, No. 61402153), the China Postdoctoral Science Foundation (No. 2016M602247), the Plan for Scientific Innovation Talent of Henan Province (No. 184100510003), the Key Scientific and Technological Project of Henan Province (No. 182102210362, No. 182102210078), the Young Scholar Program of Henan Province (No. 2017GGJS041), and the Natural Science Foundation of Henan Province (No. 182300410130, No. 182300410306, No. 182300410368).
