Improved LLE and neighborhood rough sets-based gene selection using Lebesgue measure for cancer classification on gene expression data

Abstract

Gene selection as an important data preprocessing technique for cancer classification is one of the most challenging issues in the field of microarray data analysis. In this paper, to deal with gene expression data more effectively, a locally linear embedding (LLE) and neighborhood rough sets-based gene selection method using Lebesgue measure for cancer classification is proposed. First, to solve the problems that the traditional LLE method cannot effectively identify category information, and is susceptible to noise pollution and other issues, the intra-class neighborhood is defined and a new method of calculating reconstruction weight is proposed by combining with the Euclidean distance to improve LLE. Then, the Lebesgue measure is introduced into neighborhood rough sets, a δ -neighborhood measure is defined, and the dependency degree and the significance measure are presented in neighborhood decision systems. Finally, an improved LLE and neighborhood rough sets-based gene selection algorithm is designed, where the improved LLE algorithm is used to reduce the initial dimensions of gene expression data and obtain a candidate gene subset, and the Lebesgue measure and dependency degree-based relative reduction for gene expression data is developed to further screen the candidate subset to select the final gene subset. The experimental results under several public gene expression data sets prove that the proposed method is effective for selecting the most relevant genes with high classification accuracy.

Keywords

Rough sets neighborhood rough sets gene selection locally linear embedding cancer classification

1. Introduction

Microarray techniques have been used to delineate cancer groups or to identify candidate genes for cancer prognosis, and various classification methods have been applied to analyze or interpret gene expression data as such problems can be viewed as classification ones [1, 2]. However, due to the availability of small number of effective samples compared to the large number of genes in microarray data, many computational methods have failed to identify a small subset of important genes [3, 4]. In general, the classification of cancer using microarray data involves data acquisition pre-processing, gene selection and classification [5]. Classification performance obtained through these processes is evaluated, and gene selection is an important aspect in the course of microarray data analysis. The aim of gene selection is to reduce the dimensionality of microarray data in order to enhance the accuracy of classification task [6]. The feature selection methods broadly divided into four categories including filter, wrapper, embedded, and hybrid approaches [7 –12]. Independent of the classifier, filter methods have been widely used because it holds the advantage of high speed and is capable of dealing with large data sets, but they are easily trapped into local optimum. Though the wrapper approach contains a given learning model, it suffers from high computational cost especially for high-dimensional microarray data sets. The main advantage of embedded approach is the interaction with learning model, but training a given classifier with the full gene set is time-consuming especially. The major disadvantage of hybrid approach is that the filter and wrapper approaches are not truly integrated with each other, which leads to lower classification performance. Our gene selection method is based on the filter approach, in which a heuristic search algorithm is used to find an optimal gene subset with neighborhood rough sets for gene expression data sets.

The pre-processing of tumor gene selection is dimensionality reduction of data sets. The locally linear embedding (LLE) is an efficient way to reduce the dimension [13 –16]. Vanderplas and Connolly [13] introduced LLE to the astronomical community as a classification technique. Liu et al. [16] used the unsupervised LLE learning algorithm to transform multivariate MRI data of regional brain volume and cortical thickness to a locally linear space with fewer dimensions. Su et al. [17] developed a fault diagnosis method based on incremental enhanced supervised LLE and adaptive nearest neighbor classifier to improve the accuracy of machinery fault diagnosis. Sun et al. [18] presented a gene selection method based on LLE and neighborhood rough sets for gene expression data classification. Xu et al. [19] combined LLE and correlation coefficient to classify microarray data. Although the computational complexity of LLE is low, LLE is unsupervised and cannot identify the categories of data. In order to get fewer genes with higher classification ability while reducing time complexity, this paper focuses on improving classical LLE algorithm.

For evaluating a gene selection method, in addition to the predictive ability of gene subsets, two other important aspects that need to be considered are the stability of the selected genes and the computational costs [17, 18]. Thus, gene subsets which have low dimensionality and high classification ability would be selected from gene expression profiles. In this paper, the neighborhood rough sets [20 –23] is introduced to handle continuous numerical data, can avoid information loss and improves classification accuracy of gene subsets. Hu et al. [23] proposed a technique for heterogeneous feature subset selection based on neighborhood rough sets. Meng et al. [24] presented neighborhood system to deal directly with information table formed by integrating gene expression data with biological knowledge. Liu et al. [25] studied a quick attribute reduction algorithm for neighborhood rough sets. Sun et al. [26] raised a gene selection approach to improve the classification performance of microarray data based on Fisher linear discriminant and neighborhood rough sets. Chen et al. [27] proposed a gene selection method for tumor classification using neighborhood rough sets and entropy measures. Mu et al. [28] designed a feature selection method based on improved Fisher discriminant analysis and neighborhood rough sets. However, when dealing with high-dimensional data, neighborhood rough sets still have some issues of more time consuming and space complexity.

In order to decrease time complexity and improve classification accuracy of selecting gene subset for cancer classification, a gene selection method based on improved LLE and neighborhood rough sets using Lebesgue measure is proposed. Firstly, by introducing the concept of intra-class neighborhood and redefining the right of reconfiguration, the LLE algorithm is improved. The pre-processing of the prior optimization can be carried out, which makes the reduction data contain more effective classification information. Then, on the basic of neighborhood, a gene expression data set is granulated by neighborhood parameters, and the Lebesgue measure is introduced to develop a new dependency degree in neighborhood rough sets. The method of calculating significance measure is improved. Finally, LLE and neighborhood rough sets are combined to build an effective cancer gene selection and classification model, which can improve the classification ability of gene subsets. From many numerical experiment results, it can been easily observed that the number of genes selected by our algorithm is the least on the Prostate and Leukemia data sets, and is slightly second to two extended neighborhood rough set methods on the Colon and Gastric data sets. What’s more, our calculated average classification accuracy is the highest on all the data sets. Therefore, the proposed method shows the great performance for gene expression data classification.

The article is mainly composed of the following parts. Section 2 recalls some basic knowledge of LLE and neighborhood rough sets. In Section 3, LLE is improved and neighborhood rough set model with Lebesgue measure is investigated. Section 4 analyzes the simulation results of gene expression data sets. Section 5 is a summary of this article.

2. Preliminaries

In this section, we briefly review some basic concepts about LLE and neighborhood rough sets. These notations have been given in [13 –27].

2.1. Locally linear embedding

The LLE is an unsupervised nonlinear dimensionality reduction method for mapping high-dimensional data nonlinearly to a lower-dimensional space [18]. Its basic idea is that of global minimisation of the reconstruction error of the set of all local neighbors in the data set. The LLE algorithm has some attractive characteristics: it does not require an iterative algorithm and just a few parameters need to be set [14]. A feature mapping is established as follows: the low-dimensional embedding maintains the same local neighborhood relationship in high-dimensional space. Thus, the LLE can obtain the low-dimensional embedding from the nearest neighbor graph in high-dimensional space under certain conditions [19].

As input, LLE maps a data set of N D-dimensional vectors assembled in a matrix X of size D × N, i.e., X = {x₁, x₂, ⋯ , x_N}. Its output is a set of N M-dimensional vectors in a matrix Y of size M × N, i.e., Y = {y₁, y₂, ⋯ , y_N}, where M ⪡ D, and the k^th column vector of Y corresponds to the k^th column vector of X. Assuming the data lies on a nonlinear manifold which locally can be approximated linearly, it employs three stages as follows: (I) locally fitting hyperplanes around each sample x_i, based on its k nearest neighbors, (II) calculating reconstruction weights, and (III) finding lower-dimensional coordinates y_i for each x_i, by minimising a mapping function based on these weights.

In stage I, each sample x_i is approximated by a weighted linear combination of its k nearest neighbors, making use of the assumption that neighboring samples will lie on a locally linear patch of the nonlinear manifold. Then, the adjacent points of each point by using the k nearest neighbors can be obtained.

In stage II, the weights w_ij that best linearly reconstruct X from its neighbors, need to be computed to solve the constrained least squares problem. Assume that each sample x_i in a high-dimensional space can be linearly represented by the samples in its local neighborhood. Then, the reconstruction weights can be obtained from minimizing the quadratic sum of local reconstruction deviation of x_i by using thefollowing cost function minimised: $ɛ (W) = \sum_{i = 1}^{N} | x_{i} - \sum_{j = 1}^{k} w_{ij} x_{j} |^{2},$ (1) where the reconstruction weight matrix W of sample neighborhood is a N × N sparse matrix, for one vector x_i, the weights w_ij sum up to 1, and if x_i and x_j are not in the same neighbor, w_ij = 0. It follows that the matrix W is calculated according to the least square.

In stage III, the weights w_ij are fixed and new m-dimensional vectors y_i are sought, which minimise the criterion as an embedding cost function: $ɛ (Y) = \sum_{i = 1}^{N} | y_{i} - \sum_{j = 1}^{k} w_{ij} y_{j} |^{2},$ (2) where $\sum_{i = 1}^{N} y_{i} = 0$ , $\frac{\sum_{i = 1}^{N} y_{i} y_{i}^{T}}{N} = I$ , and I is a d × d unit matrix (d < M).

2.2. Neighborhood rough sets

The neighborhood rough set model is a method to solve the problem that classical rough sets cannot handle continuous numerical data [23]. In gene expression data sets, the measured gene expression levels and pharmaceutical tests are presented by continuous-valued data at different magnitudes [12]. By utilizing neighborhood rough sets, the discretization of continuous data can be avoided. Note that the gene expression data sets can be described by a neighborhood decision system, where a sample is an object, a gene contains a conditional attribute, and a subclass of cancer corresponds to a decision attribute.

Given a neighborhood decision system NS = (U, C, D, V, f, Δ, δ), U = {x₁, x₂, ⋯ , x_n} is a sample set, and C = {a₁, a₂, ⋯ , a_m} is a set of all attributes, while D is a decision attribute set. V = ⋃ _a∈{C∪D}V_a, where V_a is a value set of attribute a. f : U × {C ∪ D} → V is a map function and f (a, x) represents the value of x on attribute a ∈ C ∪ D. Δ → [0, ∞) is a distance function, and δ is a neighborhood parameter, where 0 ≤ δ ≤ 1. In the following, NS = (U, C, D, V, f, Δ, δ) is simply noted by NS = (U, C, D, δ).

Since the Euclidean distance function effectively reflects the basic information of the unknown data [4, 29], it is introduced into this paper, expressed as $Δ_{B} (x, y) = \sqrt{\sum_{k = 1}^{| B |} | f (a_{k}, x) - f (a_{k}, y) |^{2}},$ (3) where |B| is the cardinality of subset B.

Given a neighborhood decision system NS = (U, C, D, δ) and a distance function Δ → [0, ∞), for any B ⊆ C and δ ∈ [0, 1], the similarity relation resulting by the subset B is described as ${NR}_{δ} (B) = {(x, y) \in U \times U | Δ_{B} (x, y) \leq δ} .$ (4) For any x ∈ U, the neighborhood class of x with respect to B is denoted by $n_{B}^{δ} (x) = {y | x, y \in U, Δ_{B} (x, y) \leq δ} .$ (5)

Given a neighborhood decision system NS = (U, C, D, δ) with B ⊆ C, and U/D = {D₁, D₂, ⋯ , D_N}, then the neighborhood lower approximation set and upper approximation set of D with respect to B are described respectively as ${\underline{N}}_{B} (D)) = \cup_{i = 1}^{N} {\underline{N}}_{B} (D_{i})$ , and ${\bar{N}}_{B} (D) = \cup_{i = 1}^{N} {\bar{N}}_{B} (D_{i})$ . It follows that ${POS}_{B} (D) = {\underline{N}}_{B} (D)$ describes the positive domain of a neighborhood decision system, and ${NEG}_{B} (D) = U - {\bar{N}}_{B} (D)$ represents the negative domain of the neighborhood decision system. Then, the dependency degree of D to B is denoted by $γ_{B} (D) = \frac{| {POS}_{B} (D) |}{| U |} .$ (6) Obviously, 0 ≤ γ_B (D) ≤1.

3. Proposed gene selection method

3.1. Improved locally linear embedding

The traditional LLE algorithm has some deficiencies [13]. For example, it is assumed that the data is evenly and densely sampled in the LLE algorithm, while for those data polluted by noise, sparsely sampled and with larger curvature, the low-dimensional embedding result will be seriously damaged, and the original manifold learning algorithms are all unsupervised. Hence, the LLE algorithm does not make the best use of category information and the result of dimensionality reduction is not conducive to the later identification and classification. Furthermore, LLE is a kind of expression without explicit mapping, so the ability to generate new samples is weak. Therefore, by considering the category information of samples, the neighborhood of samples can be reconstructed to solve this defect of the LLE algorithm.

Definition 1. The intra-class neighborhood of sample x is redefined as the k samples with the distance closest to x and the same label distance as x, and the neighborhood distance of these samples on different class labels is infinite.

Definition 2. If the Geodesic distance is used to describe the structural information among data, then a new method of calculating reconstruction weight by combining the Euclidean distance can be denoted by $w_{ij} = w_{ij}^{S} \times w_{ij}^{D},$ (7) where w_ij is a kind of reconstruction weight, $w_{ij}^{S}$ is a structural weight and $w_{ij}^{S} = \frac{D_{G} (x_{i}, x_{ij})}{\sum_{j = 1}^{k} D_{G} (x_{i}, x_{ij})}$ , D_G (x_i, x_ij) is the Geodesic distance between the sample x_i and the neighborhood sample x_ij of x_i, $w_{ij}^{D}$ is a distance weight and $w_{ij}^{D} = \frac{D_{E} (x_{i}, x_{ij})}{\sum_{j = 1}^{k} D_{E} (x_{i}, x_{ij})}$ , and D_E (x_i, x_ij) is the Euclidean distance between the sample x_i and the neighborhood sample x_ij of x_i. In this paper, the Dijkstra algorithm [30] is introduced to calculate the Geodesic distance.

The specific steps of improved LLE-based dimensionality reduction algorithm are as follows:

Algorithm 1.

Input: A data set of N D-dimensional vectors in a matrix X of size D × N, i.e., X = {x₁, x₂, ⋯ , x_N}.

Output: A data set of N M-dimensional vectors in a matrix Y of size M × N, i.e., Y = {y₁, y₂, ⋯ , y_N}.

Step 1: Let X = {x₁, x₂, ⋯ , x_N} be a given data set of N point, where x_i ∈ X, select the intra-class neighborhood of samples, calculate the distance between samples by adopting the following Euclidean distance: $d_{ij} = \sqrt{\sum_{k = 1}^{D} {| x_{ik} - x_{jk} |}^{2}}$ , where i, j= 1, 2, ⋯, N, and find refactoring neighborhood of the k nearest neighbors for each data point. Here, the choice of neighborhood k is more difficult. It is well known that if k has an excessively small value, the topology of sample may be changed and the information carried by the data set cannot be correctly reflected, whereas if k has an excessively large value, the smoothness of the entire data set may be affected and the spatial structure may disappear.

Step 2: Calculate the reconstruction weight matrix W of sample neighborhood. Assume that each sample x_i in a high-dimensional space can be linearly represented by the samples in its local neighborhood, and the new reconstruction weight w_ij can be calculated by Equation (7). Then, from the minimised cost function $ɛ (W) = \sum_{i = 1}^{N} | x_{i} - \sum_{j = 1}^{k} w_{ij} x_{j} |^{2}$ , the reconstruction weights can be obtained from minimizing the quadratic sum of local reconstruction deviation of x_i. When the N samples in X are traversed, a reconstruction weight matrix W = [w_ij]_N ×N can be constructed.

Step 3: Calculate the optimal low-dimensional embedded matrix Y, select an appropriate embedded dimension M and obtain the best low-dimensional embedded matrix Y by the following function: $min Φ = min \sum_{i = 1}^{N} {| y_{i} - \sum_{x_{j} \in N_{i}} w_{ij} y_{j} |}^{2}$ , and then $ɛ (Y) = \sum_{i = 1}^{N} | y_{i} - \sum_{j = 1}^{k} w_{ij} y_{j} |^{2}$ . Thus, one has a data set of N M-dimensional vectors, i.e., Y = {y₁, y₂, ⋯ , y_N}.

Since $\sum_{i} | a_{i} |^{2} = \sum_{i} a_{i}^{T} a_{i} = trace (A^{T} A)$ , it can be concluded that $min Φ = min \sum_{i = 1}^{N} {| y_{i} - \sum_{x_{j} \in N_{i}} w_{ij} y_{j} |}^{2} = \sum_{i = 1}^{N} | y_{i} I_{i} -$ yw_i|² = |Y (I - W) |² = | (I - W) ^TY^T|² = trace(Y (I - W) (I - W) ^TY^T) = trace (YMY^T). Hence, M = (I - W) ^T (I - W) holds.

In short, the process of solving Y is equal to seeking the eigenvector of matrix M which is sparse, symmetric and semi-positive definite. Here, the Lagrange multiplier is firstly used to calculate L (Y) = YMY^T - λ (YY^T - NI), and the partial derivative of Y can be obtained. Then, L (Y) is minimised to get the following formula: L (Y) = ${YMY}^{T} - λ ({YY}^{T} - NI) \frac{\partial L}{\partial Y} = 2 {MY}^{T} - 2 λ Y^{T} = 0$ . Finally, the smallest d eigenvectors in M are combined to form a low-dimensional embedded sample matrix Y. In general, the process of improving LLE can be summarily described as X → W → Y.

3.2. Lebesgue measure and dependency degree-based relative reduction

Let E be any point set in Rⁿ, and for each column open ball I_i covering E, its volume is summed as μ = ∑_i|i|. The entire μ make up a number set with bounded below, and its lower bound (determined by E completely) is claimed as the Lebesgue external measure of E [31], which is described as $m^{*} (E) = inf_{E \subset \cup I_{i}} \sum_{i} | I_{i} | .$ (8) Then, the Lebesgue internal measure of E is m_∗ (E) = |I| - m^∗ (I - E). If m_∗ (E) = m^∗ (E), then the E is measurable and denoted as m (E). When the Lebesgue measure of U is 0, it can be shown as the cardinality of U. Here, m (X) is used uniformly to describe the Lebesgue measure of X ⊆ U in this paper.

Definition 3. Given a neighborhood decision system NS = (U, C, D, δ) with B ⊆ C, Δ (x, y) is the Euclidean distance function between two objects, the neighborhood radius parameter δ ≥ 0, and under the δ-neighborhood relationship between objects, a δ-neighborhood measure for B on any object x, y ∈ U based on Lebesgue measure is defined as $m (n_{B}^{δ} (x)) = m ({y | x, y \in U, Δ_{B} (x, y) \leq δ}) .$ (9)

Definition 4. Given a neighborhood decision system NS = (U, C, D, δ) with B ⊆ C, U/D = {D₁, D₂, ⋯ , D_N}, and for any X ⊆ U, the neighborhood lower approximation set and the neighborhood upper approximation set of D on B based on Lebesgue measure are defined respectively as $m ({\underline{N}}_{B} (D)) = m (\cup_{i = 1}^{N} {\underline{N}}_{B} (D_{i})),$ (10)

$m ({\bar{N}}_{B} (D)) = m (\cup_{i = 1}^{N} {\bar{N}}_{B} (D_{i})),$ (11) where ${\underline{N}}_{B} (D_{i}) = {x \in U | n_{B}^{δ} (x) \subseteq D_{i}}$ , ${\bar{N}}_{B} (D_{i}) = {x \in U | n_{B}^{δ} (x) \cap D_{i} \neq \emptyset}$ , and i = 1, 2, ⋯ , N.

In a neighborhood decision system NS = (U, C, D, δ) with B ⊆ C, $m ({POS}_{B} (D)) = m ({\underline{N}}_{B} (D))$ describes the Lebesgue measure of the positive domain.

Definition 5. Given a neighborhood decision system NS = (U, C, D, δ) with B ⊆ C, the dependency degree of B with respect to D based on Lebesgue measure is defined as $K (B, D) = \frac{m ({POS}_{B} (D))}{m (U)} = \frac{m ({\underline{N}}_{B} (D))}{m (U)} .$ (12)

Definition 6. (Internal significance) Given a neighborhood decision system NS = (U, C, D, δ) with B ⊆ C, for any attribute a ∈ B, the significance measure of a in B with respect to D is defined as ${SIG}^{inner} (a, B, D) = K (B, D) - K (B - {a}, D) .$ (13)

Definition 7. Given a neighborhood decision system NS = (U, C, D, δ) with B ⊆ C and any a ∈ B, the attribute a is called redundant in B with respect to D if K (B, D) > K (B - {a}, D); otherwise, the attribute a is indispensable in B with respect to D. B is called dependent if any attribute in B with respect to D is indispensable. B is called a relative reduct of C with respect to D if it satisfies the following two conditions:

(1) K (B, D) = K (C, D),

(2) K (B, D) > K (B - {a}, D), where any a ∈ B.

Obviously, a relative reduct of C with respect to D is the minimal attribute subset to retain the dependency degree of C with respect to D.

Definition 8. (External significance) Given a neighborhood decision system NS = (U, C, D, δ) with B ⊆ C, for any attribute a∈ C - B, the significance measure of a with respect to D is defined as ${SIG}^{outer} (a, B, D) = K (B \cup {a}, D) - K (B, D) .$ (14) When B = ∅, SIG^outer (a, B, D) = K ({a}, D). From Definition 8, the significance of attribute a is the increment of the distinguishing information after adding a into B. The larger the value of SIG^outer (a, B, D) is, the more important the attribute a for B with respect to D is.

The specific steps of Lebesgue measure and dependency degree-based relative reduction algorithm for gene expression data are as follows.

Algorithm 2.

Input: A gene data set, as a neighborhood decision system NS = (U, C, D, δ), and a lower limit λ.

Output: An optimal gene subset.

Step 1: Use min-max standardization $\frac{x_{ij} - min}{max - min}$ to standardize the given gene data set, where x_ij is the gene value of the original data, a matrix A_i×t is obtain, i is the number of samples, and t is the number of genes.

Step 2: Select all the gene columns to make up a gene set S_A except for the decision in A_i×t.

Step 3: Initialize a reduct set red =∅.

Step 4: Calculate the dependency degree K({a_i}, D) for all a_i in S_A by Eq. (12), and get the significance measure SIG using Equations. (13) and (14), where a_i (i = 1, 2, ⋯ , N) represents the gene column in S_A.

Step 5: If SIG= 0, then delete the redundant a_i.

Step 6: Calculate the SIG of the gene a_i which satisfies max{a_i ∈ S_A|K (C, D)}.

Step 7: If SIG ≤ λ, then let red = red ∪ {a_k} and S = S ∪ POS_k, and return to Step 4.

Step 8: Find out the corresponding gene columns in A_i×tfrom red, and obtain an optimal gene subset.

Step 9: Return an optimal gene subset.

3.3. Improved LLE and neighborhood rough sets-based gene selection algorithm

In this subsection, the improved LLE algorithm is used to select genes from the initial gene expression data sets. The reconstruction weight matrix of the data sets is obtained by the intra-class neighborhood, and then to screen genes with certain distinguishing ability the low-dimensional embedded sample matrix is calculated to obtain the candidate gene subsets by Algorithm 1. Algorithm 2 is employed to make gene selection in the candidate gene subsets. The candidate subsets are normalized, the maximum dependence degree and the significance measure of each gene are calculated respectively, and then one can judge whether it is redundant. When the significance measure of the genes is bigger than λ, these genes can be grouped into a subset, and then the process of selecting the best genes can be carried out. Thus, the steps of improved LLE and neighborhood rough sets-based gene selection (LLENRS-GS) algorithm are as follows.

Algorithm 3.

Input: A gene data set, as a neighborhood decision system NS = (U, C, D, δ), and a lower limit λ.

Output: A selected gene subset.

Step 1: Select the intra-class neighborhood of the gene expression data sets with Step 1 of Algorithm 1.

Step 2: Calculate the reconstruction weight matrix W with Step 2 of Algorithm 1.

Step 3: Calculate the low-dimensional embedded matrix Y, and obtain the preliminary reduced dimension of gene subsets with Step 3 of Algorithm 1.

Step 4: Standardize the gene subset and generate a matrix A by Step 1 of Algorithm 2.

Step 5: Group all the gene columns in the matrix A into a gene set S_A with Step 2 of Algorithm 2.

Step 6: Initialize a reduct set red =∅.

Step 7: Calculate the dependency degree of a_i and the SIG, where any a_i ∈ A_i×t - red in A_i×t.

Step 8: Output a great gene subset by using Steps 5-9 of Algorithm 2.

4. Experimental results and analysis

4.1. Experiment preparation

To verify the classification performance of the proposed LLENRS-GS algorithm, the simulation experiments are performed on five public gene expression data sets, which can be downloaded at http://bioinform-atics.rutgers. ed/Static/Supplemens/CompCancer/data- sets. The description of the five gene expression data sets is shown in Table 1.

Table 1
Overview of the five gene expression data sets

Data set Genes Samples Classes

Prostate Tumor 10509 102 2

Colon Cancer 2000 50 2

Leukemia 5327 72 2

Gastric Cancer 1519 40 2

Breast 24482 78 2

Data set	Genes	Samples	Classes
Prostate Tumor	10509	102	2
Colon Cancer	2000	50	2
Leukemia	5327	72	2
Gastric Cancer	1519	40	2
Breast	24482	78	2

The experimental operating system is Windows 7, Intel Core i55200U, 1.50 GHZ, and 4.0 GB memory. All simulation experiments are implemented in Matlab R2014a and Weka 3.8.

4.2. Comparison of LLE and improved LLE

In this experiment, the LLE algorithm [13] and our improved LLE (ILLE) algorithm (Algorithm 1) are performed on the five gene expression data sets in Table 1. Four kinds of different classifiers including LibSVM, J48, Random Tree and Random SubSpace are run in Weka to test the dimensionality reduction results of the LLE and ILLE algorithms. Moreover, the classification accuracy of the dimensionality reduction results is verified with the 10-fold cross-validation method. The verification results of classification accuracy of LLE and ILLE on the four classifiers for the five gene expression data sets are shown in Table 2. Then, it can be seen from Table 2 that all of the classification accuracy of ILLE on the four classifiers for the five gene expression data sets are higher than that of LLE. Therefore, the ILLE algorithm owns the obvious advantage. That is, Algorithm 1 is effective.

Table 2
Classification accuracy of LLE and ILLE on the four classifiers for the five gene expression data sets

Data set LibSVM J48 Random Tree Random SubSpace

LLE ILLE LLE ILLE LLE ILLE LLE ILLE

Prostate Tumor 81.3825% 98.0392% 73.5294% 84.3137% 71.5686% 89.2157% 71.5686% 92.1569%

Colon Cancer 86% 92% 76% 90% 68% 72% 78% 88%

Leukemia 93.0556% 100% 80.5556% 84.7222% 83.3333% 90.2778% 87.5% 95.8333%

Gastric Cancer 50% 92.5% 70% 90% 50% 82.5% 62.5% 92.5%

Breast 66.6667% 87.1795% 51.2821% 76.9231% 55.1282% 74.359% 56.4103% 82.0513%

Average 75.421% 93.9437% 70.2734% 85.1918% 65.606% 81.6705% 71.1958% 90.1083%

Data set	LibSVM	J48	Random Tree	Random SubSpace
Prostate Tumor	81.3825%	98.0392%	73.5294%	84.3137%	71.5686%	89.2157%	71.5686%	92.1569%
Colon Cancer	86%	92%	76%	90%	68%	72%	78%	88%
Leukemia	93.0556%	100%	80.5556%	84.7222%	83.3333%	90.2778%	87.5%	95.8333%
Gastric Cancer	50%	92.5%	70%	90%	50%	82.5%	62.5%	92.5%
Breast	66.6667%	87.1795%	51.2821%	76.9231%	55.1282%	74.359%	56.4103%	82.0513%
Average	75.421%	93.9437%	70.2734%	85.1918%	65.606%	81.6705%	71.1958%	90.1083%

4.3. Classification performance of LLENRS-GS

Taking the Prostate Tumor data set as an example, for Algorithm 1, supposed that the number of neigh-bors is 8, the low-dimensional embedded dimension is 30, and then the preliminary dimension reduction can be achieved. Supposed that the parameter λ = 0.4, and Algorithm 2 is performed and the selected gene subset with 11 genes can be obtained. So, liking this above process, namely using the LLENRS-GS algorithm, the gene selection results of the five gene expression data sets are illustrated in Table 3.

4.4. Classification performance of related reduction algorithms

This portion of our experiments evaluates the performance of our proposed algorithm in terms of the number of selected genes, the running time, and the classification accuracy on the selected genes. Then, the classification performance of the LLENRS-GS algorithm is compared with those of the other six related reduction algorithms on the five gene expression data sets in Table 1. These methods include: (1) the LLE-based gene selection algorithm (LLE) [32], (2) the gene selection algorithm based on locally linear embedding and neighborhood rough sets (LLE-NRS) [18], (3) the principal component analysis-based gene selection algorithm (PCA) [7], (4) the PCA algorithm [7] combined with the NRS algorithm [23] (PCA + NRS), (5) the Relief-based feature selection (Relief) [33], and (6) the Relief algorithm [33] combined with the NRS algorithm [23] (Relief + NRS). The LibSVM, J48, Random Tree, and Random SubSpace classifiers in Weka are used to do some simulation experiments. By comparing our LLENRS-GS algorithms with the above six reduction algorithms, the classification results of the seven reduction algorithms on the five gene expression data sets are shown in Tables 4-8, respectively.

Table 4 shows that the LLENRS-GS obtains the highest classification accuracy and selects the least number of genes, though its time-consuming is larger than that of LLE, LLE-NRS, PCA and PCA + NRS. The PCA, PCA + NRS, Relief and Relief + NRS algorithms obtains the higher classification accuracy on the J48, Random Tree, and Random SubSpace classifiers than that on the LibSVM classifier. The LLENRS-GS not only achieves the highest classification accuracy in the J48, Random Tree and Random SubSpace classifiers, but also is sensitive to the LibSVM classifier. The classification accuracy is 98.0392%, which is about 47% higher than that of the PCA and Relief algorithms. The classification accuracy of LLENRS-GS is nearly 17% higher than that of LLE. Furthermore, the LLENRS-GS algorithm achieves the highest average classification accuracy for the selected Prostate Tumor genes. Therefore, our algorithm can effectively remove noises from the original Prostate Tumor data set.

Table 5 shows the specific comparison results of gene selection for Colon cancer with the seven reduction algorithms. As we can see, the classification accuracy of genes selected by LLENRS-GS is significantly higher than that of the other six algorithms. Especially in the LibSVM classifier, the classification accuracy is 10% higher than that of LLE, and 40% higher than that of PCA. Thus, the classification accuracy obtained by LLENRS-GS has been significantly improved. When the classification accuracy is verified in J48, Random Tree, and Random SubSpace, the precision of LLENRS-GS is the higher than the other algorithms. In addition, the other six algorithms can only maintain the relatively better accuracy in a certain classifier, while the accuracy in the other classifiers is reduced a lot and incapable to keep the stable precision. However, our algorithm can maintain the relatively great classification accuracy in the four classifiers. Moreover, LLENRS-GS obtains the highest average classification accuracy for the selected Colon Cancer genes. Thus, our algorithm can not only effectively remove noises from the Colon Cancer data set, but can also obviously improve the classification accuracy of the selected Colon Cancer genes.

Table 3
Classification performance of the LLENRS-GS algorithm for the five gene expression data sets

Data set Genes Classification accuracy Error rate

Prostate Tumor 11 98.0392% 0.0196

Colon Cancer 15 96% 0.04

Leukemia 14 100% 0

Gastric Cancer 14 85% 0.15

Breast 14 87.1795% 0.1538

Data set	Genes	Classification accuracy	Error rate
Prostate Tumor	11	98.0392%	0.0196
Colon Cancer	15	96%	0.04
Leukemia	14	100%	0
Gastric Cancer	14	85%	0.15
Breast	14	87.1795%	0.1538

Table 4

Classification performance of the seven reduction algorithms for Prostate Tumor

Algorithm	Genes	Time (s)	LibSVM	J48	Random Tree	Random SubSpace	Average
LLE	29	0.347	81.3825%	73.5294%	71.5686%	71.5686%	74.5123%
LLE-NRS	19	3.182	75.4902%	66.6667%	65.6863%	74.5098%	70.5883%
PCA	30	0.031	50.9804%	76.4706%	73.5294%	83.3333%	71.0784%
PCA + NRS	17	2.773	50.9804%	79.4118%	71.5686%	70.5882%	68.1373%
Relief	61	22.566	50.9804%	73.5294%	70.5882%	82.3529%	69.3627%
Relief + NRS	16	27.907	50.9804%	77.451%	78.4314%	81.3725%	72.0588%
LLENRS-GS	11	9.445	98.0392%	80.3922%	85.2941%	94.1176%	89.4608%

Table 5

Classification performance of the seven reduction algorithms for Colon Cancer

Algorithm	Genes	Time (s)	LibSVM	J48	Random Tree	Random SubSpace	Average
LLE	30	0.056	86%	76%	68%	78%	77%
LLE-NRS	16	0.833	88%	80%	76%	80%	81%
PCA	30	0.011	56%	50%	68%	74%	62%
PCA + NRS	16	1.024	56%	72%	78%	80%	71.5%
Relief	17	56.238	56%	80%	76%	78%	72.5%
Relief + NRS	9	56.798	56%	84%	80%	84%	76%
LLENRS-GS	15	2.06	96%	94%	84%	90%	91%

Table 6 describes the comparison results of gene selection for the Leukemia data set. As shown in Table 6, the LLENRS-GS algorithm obtains the least number of selected genes and achieves the highest classification accuracy on the LibSVM, Random Tree, and Random SubSpace classifiers for the selected Leukemia genes. Especially, in the LibSVM classifier, the classification accuracy reaches 100% without any error. The accuracy of LLENRS-GS is about 47% higher than that of PCA, PCA + NRS, Relief and Relief + NRS. Even it is nearly 8% higher in classification accuracy than that of LLE. In addition, the classification accuracy of LLENRS-GS is relatively stable and maintains the higher state. However, on the J48 classifier, the classification accuracy is slightly lower. The reason is that when the data processing is performed, the genes which are more suitable for J48 classifiers are likely to be deleted. Furthermore, the LLENRS-GS algorithm achieves the highest average classification accuracy for the selected Leukemia genes. Hence, it can be shown that our algorithm has better classification performance for the selected Leukemia gene data.

Table 6

Classification performance of the seven reduction algorithms for Leukemia

Algorithm	Genes	Time (s)	LibSVM	J48	Random Tree	Random SubSpace	Average
LLE	30	0.108	93.0556%	80.5556%	79.1667%	88.8889%	85.4167%
LLE-NRS	22	1.773	93.0556%	81.9444%	84.7222%	88.8889%	87.1528%
PCA	30	0.031	52.7778%	90.2778%	69.4444%	93.0556%	76.3889%
PCA + NRS	16	1.693	52.7778%	84.7222%	77.7778%	84.7222%	75.0000%
Relief	468	423.620	52.7778%	95.8333%	59.7222%	95.8333%	76.0417%
Relief + NRS	17	486.210	52.7778%	59.7222%	65.2778%	61.1111%	59.7222%
LLENRS-GS	14	4.483	100%	84.7222%	86.1111%	97.2222%	92.0139%

Table 7 compares the classification of the Gastric Cancer data set. The LLENRS-GS algorithm achieves the smaller number of selected genes, and has less time-consuming and the highest accuracy of 80% in the LibSVM classifier, 85% in the Random Tree classifier, and 92.5% in the Random SubSpace classifier. Note that the accuracy 90% of our algorithm in the J48 classifier is lower than 92.5% of Relief. However, the Relief algorithm achieves the largest number of the selected genes. The reason is that when LLENRS-GS processes the gene data set, some noises of the gene data sets are not fully filtered, which reduces the classification ability of the selected gene subset, so this reduces the classification accuracy in the J48 classifier. Meanwhile, it can be seen from Table 7 that LLENRS-GS achieves the highest average classification accuracy for the selected Gastric Cancer genes.

Table 7

Classification performance of the seven reduction algorithms for Gastric Cancer

Algorithm	Genes	Time (s)	LibSVM	J48	Random Tree	Random SubSpace	Average
LLE	30	0.052	50%	70%	50%	62.5%	58.125%
LLE-NRS	13	0.697	75%	77.5%	77.5%	80%	58.75%
PCA	30	0.03	50%	62.5%	70%	62.5%	60.625%
PCA + NRS	16	0.572	57.5%	62.5%	60%	70%	62.5%
Relief	49	28.304	72.5%	92.5%	82.5%	92.5%	85%
Relief + NRS	14	29.655	75%	85%	80%	77.5%	79.375%
LLENRS-GS	14	1.059	80%	90%	85%	92.5%	86.875%

Similar to the classification results in Tables 4 and 5, Table 8 shows that the LLENRS-GS algorithm obtains the highest classification accuracy, and achieves the smaller number of the selected genes. For example, the classification accuracy of LLENRS-GS has approximately 21%, 30%, 27%, and 35% higher than that of PCA in the four classifiers, respectively. It can be observed that LLENRS-GS is more sensitive to the LibSVM classifier than the other six algorithms, and then it has a strong adaptability. What’s more, the LLENRS-GS has the highest average classification accuracy for the selected Breast genes. Therefore, it can be concluded that our algorithm achieves the best classification performance for the Breast data set.

Table 8

Classification performance of the seven reduction algorithms for Breast

Algorithm	Genes	Time (s)	LibSVM	J48	Random Tree	Random SubSpace	Average
LLE	30	0.171	66.6667%	51.2821%	55.1282%	56.4103%	57.3718%
LLE-NRS	11	2.269	70.5128%	58.9744%	51.2821%	56.4103%	59.2849%
PCA	30	0.102	56.4103%	48.7179%	44.8718%	47.4359%	49.3590%
PCA + NRS	14	2.719	56.4103%	53.8462%	51.2821%	55.1282%	54.1667%
Relief	102	929.051	56.4103%	65.3846%	65.3846%	65.3846%	63.1410%
Relief + NRS	13	940.839	61.5385%	58.9744%	57.6923%	69.2308%	61.8590%
LLENRS-GS	14	4.934	87.1795%	78.2051%	71.7949%	82.0513%	79.8077%

Table 9

Classification accuracy of the selected genes with the ten dimensionality reduction methods

Data set	ODP	Fisher score	Lasso	NRS	FLD-NRS	LLE-NRS	Relief + NRS	FBFE	BDE	LLENRS-GS
Prostate Tumor	56.6%	86%	96.1%	64.7%	80%	71.1%	64.2%	83.2%	94.1%	98.0392%
Colon Cancer	64.5%	83.8%	88.7%	61.1%	88%	84%	56.4%	83.3%	75%	96%
Leukemia	65.3%	93.4%	98.6%	64.5%	82.8%	86.8%	56.3%	91.2%	82.4%	100%
Average	62.13%	87.73%	94.47%	63.43%	83.60%	80.63%	58.97%	85.90%	83.83%	98.01%

4.5. Classification performance of related dimensionality reduction algorithms

To further verify the classification performance of our proposed method, ten methods are evaluated in terms of the number of selected genes and the classification accuracy on the selected genes. The LLENRS-GS algorithm is compared with the nine related dimensionality reduction methods, which include: (1) the original data processing method (ODP), (2) the Fisher score algorithm [34], (3) the Lasso algorithm [35], (4) the neighborhood rough set model (NRS) [23], (5) the gene selection algorithm based on the fisher linear discriminant and the neighborhood rough sets (FLD-NRS) [26], (6) the gene selection algorithm based on locally linear embedding and neighborhood rough sets (LLE-NRS) [18], (7) the Relief algorithm [33] combined with the NRS algorithm [23, 36] (Relief + NRS), (8) the fuzzy backward feature elimination (FBFE) [37], and (9) the binary differential evolution (BDE) [38]. The LibSVM classifier in Weka tool is used to do some simulation experiments. The classification accuracy of the selected genes is shown in Table 9.

According to the results of the classification accuracy in Table 9, the differences among the ten methods can be clearly identified. As for the classification accuracy, the LLENRS-GS algorithm obtains higher classification accuracy than the NRS algorithm. However, some genes with classification information also are deleted, which leads to low classification accuracy of the NRS. The three extended NRS methods (FLD-NRS, LLE-NRS and Relief + NRS) overcome this drawback to improve the classification accuracy. Compared with these methods, the LLENRS-GS algorithm achieves the highest classification accuracy for the Prostate Tumor, Colon Cancer and Leukemia data sets. Compared with the FBFE and BDE algorithms, the LLENRS-GS algorithm has slightly improved classification accuracy for all of the three data sets. Thus, our proposed approach can obviously reduce the dimension of gene expression data sets and outperforms the other nine related dimensionality reduction methods. In summary, our method is an efficient dimensionality reduction technique for high-dimensional gene expression data sets.

During the above experiments, the rough ordering of these nine methods with respect to time complexity is as follows: O(LLENRS-GS) = O(Fisher score) < O(FBFE) < O(BDE) < O(FLD-NRS) < O(Relief + NRS) < O(NRS) < O(LLE-NRS) < O(Lasso), where O(A) denotes the time complexity of A algorithm. For high-dimensional gene expression data, the Lasso algorithm has the highest time complexity, which is O (nm³) [35], where n denotes the number of samples, and m describes the number of genes. For the NRS algorithm and its extension forms, the time complexity is O (m²n + m²nlogn) for the LLE-NRS algorithm [18], and O (m²nlogn) for NRS algorithm [23]. Moreover, the Relief + NRS algorithm has the time complexity of O(mn + mnlogn) [23 , 36], and the complexity of FLD-NRS is O(mnlogn) [26]. The time complexities of three extended NRS methods (FLD-NRS, LLE-NRS and Relief + NRS) are lower than that of NRS. Since the population initialization is the main process of the BDE algorithm, the complexity of BDE is close to O(nm) [38]. For FBFE, the time is mainly spent on evaluating the relevance of the genes using entropy, and its time complexity is no more than O(nm) [37]. The complexities of LLENRS-GS and Fisher score [34] are O (m) approximately and lower than those of the other seven algorithms. Although the Lasso algorithm has better classification accuracy, due to n ⪡ m in most cases, it has the much higher time complexity than our LLENRS-GS algorithm. Hence, these results demonstrate that our algorithm can effectively reduce the dimension of gene expression data sets, increase the classification accuracy, and expedite the classification process with less time consumption.

According to the abovementioned experimental results of the comparative analysis of our method with other schemes, it can be concluded that the gene subset selected by LLENRS-GS has high classification accuracy and is obviously superior to the other related reduction algorithms. At the same time, by comparing the classification accuracy in each classifier, it can be found that the classification accuracy of LLENRS-GS in each classifier has the high stability, so it has better compatibility in practical applications. What’s more, LLENRS-GS is more sensitive to the LibSVM classifiers than other algorithms, and has a short running time and high processing efficiency.

5. Conclusion

Identifying cancer-related genes is helpful for earlier cancer diagnosis and drug design. Gene selection is one of the important steps in cancer classification. In order to deal with gene expression data more effectively, this paper proposes a gene selection algorithm based on improved LLE and neighborhood rough sets. By considering the category information of samples, the neighborhood of samples is reconstructed to improve the traditional LLE algorithm, and the improved LLE-based dimensionality reduction algorithm is designed to treat the processed gene expression data sets with the initial dimensionality reduction and obtain the candidate gene subset. Then, the Lebesgue measure-based dependency degree and significance measure are presented in neighborhood decision systems. Finally, a heuristic relative reduction algorithm for gene selection is developed. The simulation results under the five gene expression data sets demonstrate that the gene subset selected by the proposed algorithm is highly representative, and our method has great classification performance. For Prostate Tumor, the genes selected by LLENRS-GS are the least and the average accuracy is 14.9485% -21.3235% higher than the other algorithms; on Colon Cancer, the number of genes selected by Relief + NRS is only less than that of our method, however, our average accuracy is 10% -29% higher than all the other methods; for Leukemia, the genes selected by LLENRS-GS is also the least and its average accuracy is 4.8611% -32.2917% higher than the others; on Gastric Cancer, the LLENRS-GS only selects one more gene than the LLE + NRS, whereas the average accuracy of our proposed algorithm is 1.875% -29.75% higher than the other algorithms; and for Breast, the genes selected by both LLE + NRS and Relief + NRS are less than that of LLENRS-GS, but our average accuracy is 16.6667% -30.4487% higher than all the other algorithms. Therefore, it can be proved that our algorithm can obtain a small, effective gene subset with higher classification accuracy, and outperforms other reduction methods.

Footnotes

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China (No. 61772176, No. 61402153), the China Postdoctoral Science Foundation (No. 2016M602247), the Plan for Scientific Innovation Talent of Henan Province (No. 184100510003), the Key Scientific and Technological Project of Henan Province (No. 182102210362, No. 182102210078), the Young Scholar Program of Henan Province (No. 2017GGJS041), and the Natural Science Foundation of Henan Province (No. 182300410130, No. 182300410306, No. 182300410368).

References

Jain ,

V.K.

Jain and

Jain , Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification, Applied Soft Computing 62 (2018), 203–215.

J.X.

Liu ,

Xu ,

C.H.

Zheng ,

Kong and

Z.H.

Lai , RPCA-based tumor classification using gene expression data, IEEE/ACM Transactions on Computational Biology and Bioinformatics 12 (2015), 964–970.

Wan and

A.A.

Freitas , An empirical evaluation of hierarchical feature selection methods for classification in bioinformatics datasets with gene ontology-based features, Artificial Intelligence Review 50(2) (2018), 201–240.

Sun ,

X.Y.

Zhang ,

Y.H.

Qian ,

J.C.

Xu ,

S.G.

Zhang and

Tian , Joint neighborhood entropy-based gene selection method with fisher score for tumor classification, Applied Intelligence (2018). DOI: 10.1007/s10489–018–1320-1

Elyasigomari ,

M.S.

Mirjafari ,

H.R.C.

Screen and

M.H.

Sha-heed , Cancer classification using a novel gene selection approach by means of shuffling based on data clustering with optimization, Applied Soft Computing 35 (2015), 43–51.

Sina ,

Ali ,

Reza and

Parham , Gene selection for microarray data classification using a novel ant colony optimization, Neurocomputing 168 (2015), 1024–1036.

Sun ,

J.C.

Xu and

Yin , Principal component-based feature selection for tumor classification, Bio-Medical Materials and Engineering 26 (2015), 2011–2017.

C.Z.

Wang ,

Q.H.

Hu ,

X.Z.

Wang ,

D.G.

Chen and

Y.H.

Qian , Feature selection based on neighborhood discrimination index, IEEE Transactions on Neural Networks and Learning Systems 29(7) (2018), 2986–2999.

C.Z.

Wang ,

He ,

M.W.

Shao and

Q.H.

Hu , Feature selection based on maximal neighborhood discernibility, International Journal of Machine Learning and Cybernetics 9(11) (2018), 1929–1940.

10.

Min ,

Z.H.

Zhang and

Dong , Ant colony optimization with partial-complete searching for attribute reduction, Journal of Computational Science 25 (2018), 170–182.

11.

Sun ,

J.C.

Xu and

Tian , Feature selection using rough entropy-based uncertainty measures in incomplete decision systems, Knowledge-Based Systems 36 (2012), 206–216.

12.

Feng ,

J.C.

Xu and

T.H.

Xu , An efficient gene selection technique based on self-organizing map and particle swarm optimization, Journal of Intelligent & Fuzzy Systems 33(6) (2017), 3287–3294.

13.

VanderPlas and

Connolly , Reducing the dimensionality of data: Locally linear embedding of sloan galaxy spectra, The Astronomical Journal 138(5) (2009), 1365–1379.

14.

De Ridder ,

Kouropteva ,

Okun ,

Pietikinen and

R.P.W.

Duin , Supervised locally linear embedding. In:

Kaynak ,

Alpaydin ,

Oja ,

Xu , eds., Artificial Neural Networks and Neural Information Processing-ICANN/ICONIP 2003, Springer, Berlin, Heidelberg, Lecture Notes in Computer Science 2714 (2003), 333–341.

15.

Sun ,

R.N.

Liu ,

J.C.

Xu ,

S.G.

Zhang and

Tian , An affinity propagation clustering method using hybrid kernel function with LLE, IEEE Access 6 (2018), 68892–68909.

16.

Liu ,

D.G.

Tosun and

M.W.

Weiner , Locally linear embedding (LLE) for MRI based Alzheimer's disease classification, NeuroImage 83 (2013), 148–157.

17.

Z.Q.

Su ,

B.P.

Tang ,

J.H.

Ma and

Deng , Fault diagnosis method based on incremental enhanced supervised locally linear embedding and adaptive nearest neighbor classifier, Measurement 48 (2014), 136–148.

18.

Sun ,

J.C.

Xu ,

Wang and

Yin , Locally linear embedding and neighborhood rough set-based gene selection for gene expression data classification, Genetics and Molecular Research 15(3) (2016), gmr.15038990.

19.

J.C.

Xu ,

H.Y.

Mu ,

Wang and

F.Z.

Huang , Feature genes selection using supervised locally linear embedding and correlation coefficient for microarray classification, Computational and Mathematical Methods in Medicine 2018 (2018), Article ID 5490513.

20.

Y.Y.

Yao , Relation interpretation of neighborhood operators and rough set approximation operators, Information Sciences 195 (1998), 239–259.

21.

W.Z.

Wu and

W.X.

Zhang , Neighborhood operator systems and approximations, Information Sciences 144 (2002), 201–217.

22.

Wang ,

Y.H.

Qian ,

X.Y.

Liang ,

Guo and

J.Y.

Liang , Local neighborhood rough set, Knowledge-Based Systems 153 (2018), 53–64.

23.

Q.H.

Hu ,

D.R.

Yu ,

J.F.

Liu and

C.X.

Wu , Neighborhood rough set based heterogeneous feature subset selection, Information Sciences 178 (2008), 3577–3594.

24.

Meng ,

Zhang and

Y.S.

Luan , Gene selection integrated with biological knowledge for plant stress response using neighborhood system and rough set theory, IEEE/ACM Transactions on Computational Biology and Bioinformatics 12 (2015), 433–444.

25.

Liu ,

W.L.

Huang ,

Y.L.

Jiang and

Z.Y.

Zeng , Quick attribute reduct algorithm for neighborhood rough set model, Information Sciences 271 (2014), 65–81.

26.

Sun ,

X.Y.

Zhang ,

J.C.

Xu and

Wang , A Gene selection approach based on the Fisher linear discriminant and the neighborhood rough set, Bioengineered 9(1) (2018), 144–151.

27.

Y.M.

Chen ,

Z.J.

Zhang ,

J.Z.

Zheng ,

Ma and

Xue , Gene selection for tumor classification using neighborhood rough sets and entropy measures, Journal of Biomedical Informatics 67 (2017), 59–68.

28.

H.Y.

Mu ,

J.C.

Xu ,

Wang and

Sun , Feature genes selection using Fisher transformation method, Journal of Intelligent & Fuzzy Systems 34(6) (2018), 4291–4300.

29.

Sun and

J.C.

Xu , Information entropy and mutual information-based uncertainty measures in rough set theory, Applied Mathematics & Information Sciences 8(4) (2014), 1973–1985.

30.

Yang ,

Li ,

Hu ,

Gao and

Wang , Multimode process monitoring based on geodesic distance, International Journal of Software Engineering and Knowledge Engineering 28(9) (2018), 1225–1248.

31.

P.R.

Halmos , Measure Theory, World Publishing Corporation, 2007, pp. 100–152.

32.

Y.X.

Lang ,

Zheng and

Xing , An effective gene selection method for cancer classification based on locally linear embedding, Journal of Computational and Theoretical Nanoscience 8(10) (2011), 2108–2111.

33.

R.J.

Urbanowicz ,

Meeker ,

C.W.

La ,

R.S.

Olson and

J.H.

Moore , Relief-based feature selection: Introduction and review, Journal of Biomedical Informatics 85 (2018), 189–203.

34.

Yang ,

Y.L.

Liu ,

C.S.

Feng and

G.Q.

Zhu , Applying the Fisher score to identify Alzheimer's disease-related genes, Genetics and Molecular Research 15(2) (2016), gmr.15028798.

35.

S.F.

Zheng and

W.X.

Liu , An experimental comparison of gene selection by Lasso and Dantzig selector for cancer classification, Computers in Biology and Medicine 41(11) (2011), 1033–1040.

36.

Sun and

J.C.

Xu , Feature selection using mutual information based uncertainty measures for tumor classification, Bio-Medical Materials and Engineering 24(1) (2014), 763–770.

37.

Aziz ,

C.K.

Verma and

Srivastava , A fuzzy based feature selection from independent component subspace for machine learning classification of microarray data, Genomics Data 8 (2016), 4–15.

38.

Apolloni ,

Leguizamon and

Alba , Two hybrid wrapperfilter feature selection algorithms applied to high-dimensional microarray experiments, Applied Soft Computing 38 (2016), 922–932.

Improved LLE and neighborhood rough sets-based gene selection using Lebesgue measure for cancer classification on gene expression data

Abstract

Keywords

1. Introduction

2. Preliminaries

2.1. Locally linear embedding

3.1. Improved locally linear embedding

4. Experimental results and analysis

4.1. Experiment preparation

Table 1 Overview of the five gene expression data sets Data set Genes Samples Classes Prostate Tumor 10509 102 2 Colon Cancer 2000 50 2 Leukemia 5327 72 2 Gastric Cancer 1519 40 2 Breast 24482 78 2

4.4. Classification performance of related reduction algorithms

Table 3 Classification performance of the LLENRS-GS algorithm for the five gene expression data sets Data set Genes Classification accuracy Error rate Prostate Tumor 11 98.0392% 0.0196 Colon Cancer 15 96% 0.04 Leukemia 14 100% 0 Gastric Cancer 14 85% 0.15 Breast 14 87.1795% 0.1538

5. Conclusion

Footnotes

Acknowledgements

References

Table 1
Overview of the five gene expression data sets

Data set Genes Samples Classes

Prostate Tumor 10509 102 2

Colon Cancer 2000 50 2

Leukemia 5327 72 2

Gastric Cancer 1519 40 2

Breast 24482 78 2

Table 3
Classification performance of the LLENRS-GS algorithm for the five gene expression data sets

Data set Genes Classification accuracy Error rate

Prostate Tumor 11 98.0392% 0.0196

Colon Cancer 15 96% 0.04

Leukemia 14 100% 0

Gastric Cancer 14 85% 0.15

Breast 14 87.1795% 0.1538