An efficient gene selection technique based on Self-organizing Map and Particle Swarm Optimization

Abstract

Among the large amount of genes presented in microarray gene expression data, only a small fraction of them is effective for performing a certain diagnostic test. It is for this reason that reducing the dimensionality of gene expression data is imperative. An improved Self-organizing map method based on neighborhood mutual information correlation measure is proposed, and then combines with Particle swarm optimization method to construct an efficient gene selection algorithm, denoted by ICMSOM-PSO. Experimental results show that the proposed method can reduce the dimensionality of the dataset, and confirm the most informative gene subset and improve classification accuracy.

Keywords

Self-organizing map neighborhood mutual information particle swarm optimization gene selection

1 Introduction

In recent years, gene expression profiles (GEP) based molecular diagnosis of tumor have attracted a great number of medical researchers and computer scientists for the goal of realizing precise and early tumor diagnosis [1 –3]. However, the curse of dimensionality caused by high dimensionality and small sample size of tumor dataset seriously challenges the tumor classification. So how to select important gene subsets from thousands of genes in GEP dataset to drastically reduce the dimensionality of tumor dataset is the first key step to address this problem [4].

Clustering is a main task of explorative data mining and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics, etc. [4]. When applied to gene expression data, conventional clustering algorithms may encounter a problem that a huge number of genes (attributes) versus a small number of samples [5].

Self-organizing map (SOM) [6 –8] has a number of features that make them particularly well suitable for clustering and analyzing of gene expression patterns. They are ideally suited to explore data analysis, allowing one to impose partial structure on the clusters (in contrast to the rigid structure of hierarchical clustering, the strong prior hypotheses used in Bayesian clustering, and the non-structure ofK-means clustering) and facilitating easy visualization and interpretation. SOM has many good computational properties and is scalable to large data sets. As for correlation measures, Euclidean distance and Pearson’s correlation coefficient are widely used for clustering [9]. However, for measuring thecorrelation between genes, Euclidean distance is not effective enough to describe functional similarity such as positive or negative correlation in values. Empirical studies have shown that it may assign a high similarity score to a pair of dissimilarity genes [10]. Furthermore, most clustering methods are not able to cope with continuous attributes effectively, which is also a distinctive characteristic of gene expression data. When applied to the continuous attributes, conventional methods commonly discretize the continuous data into a finite number of intervals for data mining. But discretization may lead to information loss. As we know, Self-organizing map utilizes Euclidean distance measuring the correlation between genes. However, in this paper, we use neighborhood mutual information instead of Euclidean distance to evaluate the correlation between genes. By applying improved SOM to gene expression data, clusters of genes based on their neighborhood mutual information correlation can be discovered.

In this paper, we present a gene selection method using combined particle swarm optimization method with the improved SOM. Firstly, the improved SOM was used to preprocess the original gene expression data, and then obtained winning neuron weights. Furthermore, particles were initialized with random position, representing the generated set of winning neuron weights. Finally, the identified relevant feature subsets were tested by the particle swarm optimization method, which tried to determine optimal feature subsets.

The rest of this paper is organized as follows: Section 2 introduced the concepts of Self-organizing map clustering and neighborhood mutual information and particle swarm optimization method. An effective and efficient attribute clustering algorithm was proposed in Section 3. In order to evaluate the performance of the proposed algorithm, three gene expression datasets were applied. The experimental results were presented in Section 4. Finally, the conclusion is drawn in Section 5.

2 Related work

2.1 Self-organizing map

Self-organizing map [11, 12] is an unsupervised neural network computational mapping technique that forms ordered non-linear projection of high dimensional input data items to a low, often one or two dimensional grids. SOM is a clustering tool which can convert the non-linear statistical relationships between high dimensional data into simple geometric relationships of their image points on a low-dimensional display. By that way, the data points which show similar properties are placed closely to each other within the output of SOM algorithm [7].

SOM consists of a set of neurons which usually are arranged in a two dimensional structure, there exist neighborhood relations among the neurons, which dictates the topology and structure. As it is shown in Fig. 1, the network of SOM usually consists of two layers of neurons: an input layer and output layer. Although, the neurons on input layer are fully connected to the neurons that on output layer, the neurons on each layer have no connection to the neurons in that layer. Neurons on the output layer are arranged in either a rectangular or hexagonal lattice. The selection of lattice shape affects the performance of the SOM due to the number of neighbor of neurons. The performance of the SOM by using a hexagonal shape is expected to be higher since more neighbors are modified in the hexagonal lattice when it is compared with a rectangular lattice [11]. Each neuron is represented by an n-dimensional weight vectors. The algorithm of SOM is initialized by assigning the values of weight vectors of each output neuron linearly or randomly. Training process of SOM starts by representing a data point randomly in the network. The distances between these data points and the weight vectors of all neurons are computed by using distance measures such as Euclidean distance. By comparing these distances, the nearest Kohonen neuron, is identified as the ‘winning’ neuron. The weights of the winning neuron are adjusted in order to get close to the actual data point. The weights of neighboring neurons are also updated, so that the order of the input space can be satisfied [13]. In the SOM network, the character mapping is topologically ordered and character choosing. Topologically “ordered” means the space position of the neuron in the network being is mapped from the character or some territory of the input samples. “Character” choosing means that when a data is given in the input space of non-linear distributed, SOM can choose the best character for the approach [12, 14]. Both of the two characters make the SOM network suitable for the disease feature map of the construction.

Fig.1

Structure of SOM.

There are several steps in the application of the algorithm. To get the winner, competition and learning are needed in the process. The steps above are repeated until the feature mapping is formed. The process is as follows:

Initialization: Choose random values for the initial weights w_j.

Winner Finding: Find the winning neuron C at the time t, using the correlated criterion:

$\begin{matrix} C & = & arg max ∥ X - w_{j} ∥ \\ = & \sqrt{\sum_{i = 1}^{N} (x_{i} (t) - w_{j} (t))^{2}} \end{matrix}$ (1) where X = [x₁, x₂, …, x_l] ^T represents an input vector at time the t, N is the total number of neurons.

Weights Updating: Adjust the weights of the winner and its neighbors, using the following rule: $\begin{matrix} w_{j} (t + 1) & = & w_{j} (t) + η (t) h_{j, C} (t) \\ [X (t) - w_{j} (t)] \end{matrix}$ (2) $h_{j, C} (t) = exp [- \frac{{NR}_{δ} (r_{c}; r_{i})}{2 σ^{2} (t)}]$ (3) where X (t) represents an input data at the time t, h_j,C (t) is the topological neighborhood function of the winner neuron C at the time t, η (t) is a positive constant called ‘learning-rate factor’, r_c ∈ R² and r_i ∈ R² are the location vectors of nodes c and i, respectively. σ (t) defines the width of the kernel. Both η (t) and σ (t) will decrease with time. It should be emphasized that the success of the map formation is critically dependent on the values of the main parameters (i.e., h_j,C (t) and η (t)), the initial values of weight vectors, and the prespecified number of iterations.

Repeat the Steps (2) and (3) until the changes in the disease feature mapping are very small or the maximum iteration time is reached.

2.2 Neighborhood mutual information measure

Hu et al. [15] proposed neighborhood mutual information to cope with continuous gene data, evaluating the relevance between attributes.

There is a problem to employ mutual information in gene evaluation due to the difficulty in estimating probability density of genes. So neighborhood mutual information combines the concept of neighborhood with information theory, and generalizes Shannon’s entropy to numerical information. Training samples are usually given as vectors of attribute values and the attributes are numerical, as shown in Table 1, where A₁ and A₂ are two attributes, while C is the decision label of samples.

Table 1
Experiment data sets

Data set Genes Classes Samples

Leukemia 7129 2 72

SRBCT 2308 5 88

Breast 9216 5 84

Data set	Genes	Classes	Samples
Leukemia	7129	2	72
SRBCT	2308	5	88
Breast	9216	5	84

Let U = {x₁, x₂, ⋯ , x_n} be a set of samples described with gene set F, and Δ is a distance function on U, δ ≥ 0 is a constant, then the neighborhood of sample x is denoted by $δ (x) = {x_{i} | Δ (x, x_{i}) \leq δ}$ (4)

Given S ⊆ F is a subset of genes, the neighborhood of sample x_i in S is denoted by δ_S (x_i). The neighborhood uncertainty of x_i is denoted by ${NH}_{δ}^{x_{i}} (S) = - log \frac{∥ δ_{S} (x_{i}) ∥}{n}$ (5) and the average uncertainty of the set of samples is computed as ${NH}_{δ} (S) = - \frac{1}{n} \sum_{i = 1}^{n} log \frac{∥ δ_{S} (x_{i}) ∥}{n}$ (6)

Given R, S ⊆ F are two subsets of genes, the neighborhood of sample x_i in gene subspace S ∪ R is denoted by δ_S∪R (x_i), and the joint neighborhood entropy of S ∪ R is computed as ${NH}_{δ} (R, S) = - \frac{1}{n} \sum_{i = 1}^{n} log \frac{∥ δ_{S \cup R} (x_{i}) ∥}{n}$ (7)

Let R, S ⊆ F be two subsets of genes, then the neighborhood mutual information of R and S is denoted by ${NMI}_{δ} (R; S) = - \frac{1}{n} \sum_{i = 1}^{n} log \frac{∥ δ_{R} (x_{i}) ∥ \cdot ∥ δ_{S} (x_{i}) ∥}{n ∥ δ_{S \cup R} (x_{i}) ∥}$ (8)

2.3 Particle Swarm Optimization (PSO)

Nowadays, there have existed many types of optimization algorithms, such as GA, DA, PSO. The PSO algorithm has been widely applied in different fields owing to its simple concept, easy to implement and fast to converge.

Particle swarm optimization (PSO) originated from the simulation of social behavior of birds in a flock, which was developed by Kennedy and Eberhart [16 –18]. Compared with other evolution strategy, PSO has kept the global search strategy based on population. In PSO, physical position is not an important factor. The member that is called particle is initialized by assigning random positions and velocities, and flies in the search space with a velocity adjusted by its own flying memory and its companion’s flying experience.

During each iteration, every particle is accelerated towards its own personal best, as well as in the direction of the global best position. All particles have fitness values which are decided by a fitness function. Each particle updates its own position and velocity according to the Equations (9 and 10) in every iteration. $\begin{matrix} v_{id}^{k + 1} & = & {wv}_{id}^{k} + c_{1} γ_{1} (p_{id}^{k} - x_{id}^{k}) \\ + c_{2} γ_{2} (p_{gd}^{k} - x_{id}^{k}) \end{matrix}$ (9) $x_{id}^{k + 1} = x_{id}^{k} + v_{id}^{k + 1}$ (10)

Where w is the inertia weight, $v_{id}^{k}$ and $x_{id}^{k}$ stand for the velocity and position of the ith particle of the kth iteration, respectively. $p_{id}^{k}$ denotes the previously best position of particle i (pbest), $p_{gd}^{k}$ denotes the global best position of the swarm (gbest). c₁ and c₂ are acceleration constants (the general value of c₁ and c₂ are in the interval [0, 2]), γ₁ and γ₂ are random numbers in the range [0, 1].

3 Efficient gene selection algorithm

3.1 The novel correlation measure

As for correlation measures, Euclidean distance and Pearson’s correlation coefficient are widely used for clustering [19]. However, for measuring the correlation between genes, Euclidean distance is not effective enough to describe functional similarity such as positive or negative correlation in values. Empirical studies have shown that it may assign a high similarity score to a pair of dissimilarity genes [10]. Conventional SOM uses Euclidean distance measuring the correlation between genes, while it can’t be able to effectively cope with continuous attributes which is also a distinctive characteristic of gene expression data. Au et al. presented an information measure to evaluate the correlation between attributes.

It is called the interdependence redundancy measure [20] between two attributes, A_i and A_j, i, j ∈ {1, ⋯ , m}, which is denoted by $R (A_{i} : A_{j}) = \frac{I (A_{i} : A_{j})}{H (A_{i} : A_{j})}$ (11) where I (A_i : A_j) is the mutual information between A_i and A_j, and H (A_i, A_j) is the joint entropy of A_i and A_j.

Definition 1. Let A_i and A_j, i, j ∈ {1, ⋯ , m} be two attributes, the we can have ${NR}_{δ} (A_{i}; A_{j}) = \frac{{NMI}_{δ} (A_{i}; A_{j})}{{NH}_{δ} (A_{i}, A_{j})}$ (12) where NMI_δ (A_i ; A_j) is the neighborhood mutual information between A_i and A_j, and NH_δ (A_i, A_j) is the joint neighborhood entropy of A_i and A_j.

Definition 2. Let X = [x₁, x₂, ⋯ x_L] ^T be an input data, where X is a sample. Choose a random value for the initial weight vector w_j = [w_j1, w_j2, ⋯ , w_jL] ^T, j = 1, 2, ⋯ , N, N is the number of output neurons, normally, N ≤ s/2, s is the number of disease. w_ji (t) is the ith component weight vector of the output neuron node j at time t, x_i (t) is the ith input of X at time t. In this paper, we use neighborhood mutual information instead of Euclidean distance to evaluate the correlation between genes. The correlation between the input X and neuron j is defined as ${NR}_{δ} (x_{i} (t); w_{ji} (t)) = \frac{{NMI}_{δ} (x_{i} (t); w_{ji} (t))}{{NH}_{δ} (x_{i} (t), w_{ji} (t))}$ (13) where w_ji (t) is the ith component weight vector of the output neuron node j at time t, x_i (t) is the ith input of X at time t. NMI_δ (x_i (t) ; w_ji (t)) is the neighborhood mutual information between x_i (t) and w_ji (t), and NH_δ (x_i (t) , w_ji (t)) is the joint neighborhood entropy of x_i (t) and w_ji (t).

3.2 The description of the improved SOM clustering algorithm (ICMSOM)

In order to cope with the continuous attributes effectively, we redefine the winning neuron C.

Definition 3. Let X = [x₁, x₂, …, x₁] ^T represent an input vector at time t, N is the total number of neurons, w_j is the weight vector, the winning neuron C at time t is defined as: $C = arg max {NR}_{δ} (x_{i} (t); w_{ji} (t)), j = 1, 2, \dots, N$ (14)

There are several steps in the application of the algorithm. These are competition and learning,getting the winner in the process. The steps above are repeated until the feature mapping is formed. The detailed ICMSOM algorithm is described as Algorithm 1.

Algorithm 1. ICMSOM

Step 1: Initialization: Choose random values for the initial weights w_j.

Step 2: Winner Finding: Find the winning neuron C at time t, according to the Equation (14).

Step 3: Weights Updating: Adjust the weights of the winner and its neighbors, according to the Equations (2 and 3).

Step 4: Repeat the Steps 2 and 3 until the changes in the disease feature mapping are very small or the maximum iteration time is reached.

3.3 ICMSOM-PSO procedure

This paper proposes an improved Self-organizing map (SOM) method, and then combines it with Particle swarm optimization (PSO) method to construct an efficient gene selection algorithm, denoted by ICMSOM-PSO.

Definition 4. Let ω and C be the inertia weight and the winning neuron respectively, the fitness function is defined as: $Fitness = \frac{ω}{C}$ (15) where C is obtained according to the Equation (14).

The detailed ICMSOM-PSO algorithm is described as Algorithm 2, and the flow chart is shown in Fig. 2.

Algorithm 2. ICMSOM-PSO

Step 1: A set of winning neuron weights is generated by ICMSOM algorithm.

Step 2: A swarm of particles with random position is generated in the search space, representing the generated set of winning neuron weights.

Step 3: All the particles are evaluated by a fitness function.

Step 4: Each particle updates its position and velocity according to Equations 6 and 7.

Step 5: At the end of each iteration pbest and gbest are calculated.

Step 6: If the quality of solution found by a particle is higher than its previous pbest, this solution will be the new pbest for that particle, and the pbest among all particles is selected as gbest.

Step 7: Once the termination condition is met, output the final solution, otherwise go to Step 4.

Step 8: END.

Fig.2

The flowchart of ICMSOM-PSO.

4 Experimental analysis

4.1 Dataset

In this section, we shall demonstrate the performance of our algorithm ICMSOM-PSO given in Section 3. In order to test the proposed algorithm, three cancer recognition datasets are collected. A review of these sets is given in Table 1. The Leukemia dataset [21] consists of 7129 genes and 72 samples from two different types of samples: acute lymphblastic leukemia (ALL) and acute myloid leukemia (AML). The training dataset contains 38 samples (27 ALL and 11AML) while testing dataset consists of 34 samples (20 ALL and 14AML). The SRBCT dataset [22] is the small round blue cell tumors, consists of 2308 genes and 88 samples. The breast cancer dataset [23] consists of 9216 genes and 84 samples which contain 18 ER+(estrogen receptor) samples and 20 ER-samples.

4.2 Experimental process

In this experiment, the operating environment is Lenovo Windows7 PC with 3.1 GHZ CPU and 4 GB RAM. The experimental software is MATLAB 2013a. The inputs and outputs of dataset are normalized over the interval [–1, 1]. In the training stage of SOM algorithm, the maximum iterative time is set as 500, and the initial learning rate η₀ is 0.3, the initial neighborhood region h_j,C (0) is 1. Since feature value scaling can enhance pattern recognition accuracy, the values are normalized to [0,1], and set δ = 0.15. The normalization is given by the Equation (16): $f_{v a l u e}^{'} = \frac{f_{value} - {value}_{min}}{{value}_{max} - {value}_{min}}$ (16) where $f_{v a l u e}^{'}$ is a scaled value of a feature, f_value is the original value of a feature, value_max is the upper boundary of the feature value, and value_min is the lower boundary of the feature value.

To validate the effectiveness of the proposed method, we compare our proposed method with single SOM and single PSO method, which have been also used for gene selection. The corresponding parameter settings of the two algorithms are the same as the ICMSOM-PSO method. Three popular classification algorithms are employed for evaluating the quality of these method. Linear support vector machine (LSVM), k-nearest-neighbor classifier (KNN) and CART are introduced to compute classification performance. Ten-fold-cross-validation is employed to determine the training set and test set. In this stage, all the samples are firstly divided into training samples and testing samples. The tenfold procedure is: (1) the n samples are divided randomly into 10 subsets of equal size; (2) 9 of the 10 subsets are used for gene selection and to train a classifier using the genes selected; (3) the remained subset is used to test the performance. After running ten times, the average and standard deviation are output as the final results.

Here, Acc (%) is calculated by ten-fold cross-validation, Avg (N) is the average numbers of selected genes each time. As we can see, the Table 2 shows the testing accuracy and number of genes selected in 10 times on the three datasets based on LSVM, and the average results of Acc for three datasets are 95.0%, 88.6%, and 94.4% respectively. The results in Table 3 show that the testing accuracy and number of genes selected based on CART. The average results of Acc for three datasets are 95.0%, 88.9%, and 94.4% respectively. Referring to Table 4, the genes selected based on KNN, The average results of Acc for three datasets are 95.0%, 89.0%, and 94.4% respectively.

Table 2

LSVM accuracies with selected genes on thress datasets based on ICMSOM-PSO

Running times	Data set
	Leukemia		SRBCT		Breast
	Acc (%)	Avg (N)	Acc (%)	Avg (N)	Acc (%)	Avg (N)
1	96.3	22	91.3	19	97.0	28
2	95.2	22	90.4	18	96.2	27
3	93.6	19	86.6	16	94.6	27
4	94.2	19	87.6	17	95.7	26
5	96.9	18	88.3	18	94.7	25
6	93.8	19	88.7	17	93.2	24
7	94.3	21	87.1	16	92.4	26
8	94.4	20	90.5	18	93.2	26
9	95.8	19	88.2	15	92.7	25
10	95.5	21	87.3	16	94.3	26
Average	95.0	20	88.6	17	94.4	26

Table 3

Cart accuracies with selected genes on thress datasets based on ICMSOM-PSO

Running times	Data set
	Leukemia		SRBCT		Breast
	Acc (%)	Avg (N)	Acc (%)	Avg (N)	Acc (%)	Avg (N)
1	97.2	22	92.5	19	97.3	27
2	94.1	22	91.6	17	96.4	27
3	93.6	19	86.8	18	93.4	26
4	95.2	18	86.7	17	95.5	26
5	97.3	18	88.3	17	95.1	26
6	93.7	19	88.7	16	94.2	25
7	94.3	22	86.9	16	92.6	24
8	94.3	20	92.3	17	92.4	24
9	94.7	19	88.5	16	93.7	25
10	95.6	21	86.7	15	93.4	26
Average	95.0	20	88.9	16.8	94.4	25.6

Table 4

KNN accuracies with selected genes on thress datasets based on ICMSOM-PSO

Running times	Data set
	Leukemia		SRBCT		Breast
	Acc (%)	Avg (N)	Acc (%)	Avg (N)	Acc (%)	Avg (N)
1	96.8	22	91.6	18	97.0	26
2	96.4	21	91.0	17	96.4	27
3	95.2	19	90.1	17	94.5	27
4	94.2	19	88.6	16	95.4	26
5	95.9	18	88.7	16	95.1	26
6	94.1	18	87.7	17	94.6	25
7	93.4	21	87.2	16	93.4	24
8	93.7	21	89.5	16	92.2	24
9	94.7	18	88.3	15	92.1	25
10	95.6	22	87.3	16	93.3	26
Average	95.0	19.9	89.0	16.4	94.4	28.3

4.3 Result analysis

Table 5 gives the number of selected genes and performance based on SOM. PSO based on the number of selected genes and performance is shown in Table 6. From these tables, we can see that the best classification results are 95.0% on Leukemia data using the ICMSOM-PSO method, which is better than that of single SOM and PSO. For the SRBCT data, the accuracy is obtained by the proposed method, which is also competitive to the results obtained by single SOM and PSO method. For the Breast cancer data, the best classification accuracy is 94.4% by the proposed method, which is also better than that of single SOM and PSO. We can draw a conclusion that the proposed method has better performance for gene selection compared to single SOM or PSO.

Table 5
Number of selected genes and performance based on SOM

Data set LSVM CART KNN

Acc Avg Acc Acc Acc Avg

(%) (N) (%) (N) (%) (N)

Leukemia 88.2 30 87.8 30 90.2 29

SRBCT 79.8 25 78.5 26 82.6 26

Breast 82.3 40 83.6 42 84.5 40

Data set	LSVM	CART	KNN
Leukemia	88.2	30	87.8	30	90.2	29
SRBCT	79.8	25	78.5	26	82.6	26
Breast	82.3	40	83.6	42	84.5	40

Table 6

Number of selected genes and performance based on PSO

Data set	LSVM		CART		KNN
	Acc	Avg	Acc	Avg	Acc	Avg
	(%)	(N)	(%)	(N)	(%)	(N)
Leukemia	90.3	34	88.8	33	92.2	33
SRBCT	84.4	26	80.5	28	86.6	27
Breast	83.6	42	83.8	40	88.5	42

5 Conclusion

Clustering and classification are key tasks of gene identification. In virtue of the continuous attributes, in this paper, an improved Self-organizing map clustering algorithm based on neighborhood mutual information correlation measure are proposed, combined with PSO algorithm for feature selection. Then we have demonstrated that this approach reduces the number of genes selected and increases the classification accuracy rate. Experimental results show that this algorithm outperforms the other approaches. In the recent research, the cluster configuration is studied by using qualitative analysis. In order to get thorough understanding about gene expression profiles, and extend our model to improve its generation, we have to investigate the cluster quality further, which refers to its shape, size and distribution, etc. That is what we want to be involved with in the further concern.

Footnotes

Acknowledgments

This work is supported by the National Natural Science Foundation of China (Nos. 61370169, 60873104), the Key Project of Science and Technology Department of Henan Province (Nos. 112102210194, 142102210056), the Science and Technology Research Key Project of Educational Department of Henan Province (Nos. 12A520027, 13A520529), the Key Project of Science and Technology of Xinxiang Government (No. ZG13004), and the Education Fund for Youth Key Teachers of Henan Normal University.

References

Shreem

S.S.

, Abdullah

and Nazri

M.Z.A.

, Hybrid feature selection algorithm using symmetrical uncertainty and a harmony search algorithm, International Journal of Systems Science (2014), 1312–1329.

Guerrero-Enamorado

, Morell

and Noaman

A.Y.

, An algorithm evaluation for discovering classification rules with gene expression programming, International Journal of Computational Intelligence System9(2) (2016), 9263–9280.

Wan

, The research of fast clustering algorithm of high dimension data mining, International Journal of Digital Content Technology & Its Applic7(2) (2013), 604–611.

, Li

and Sun

, Feature gene selection method based on logistic and correlation information entropy, Bio-Medical Materials and Engineering26(s1) (2015), S1953–S1959.

, Xu

, Sun

, et al., An efficient gene selection technique based on fuzzy C-means and neighborhood rough set, Applied Mathematics & Information Sciences8(6) (2014), 3101–3110.

Buonamente

, Dindo

and Johnsson

, Hierarchies of Self-Organizing Maps for action recognition, Cognitive Systems Research39 (2016), 33–41.

J.L.

, Chang

P.C.

, Tsao

C.C.

, et al., A patent quality analysis and classification system using self-organizing maps with support vector machine, Applied Soft Computing41(C) (2016), 305–316.

Wang

, Fault diagnosis method based on fuzzy support vector machines and self-organizing map neural network, International Journal of Advancements in Computing Technology4(19) (2012), 139–148.

, Gao

, Li

, et al., A greedy correlation measure based attribute clustering algorithm for gene selection, Journal of Computers8(4) (2013).

10.

W.H.

, Chan

K.C.C.

, Wong

A.K.C.

, et al., Correction to “attribute clustering for grouping, selection, and classification of gene expression data, IEEE/ACM Transactions on Computational Biology & Bioinformatics2(2) (2005), 83–101.

11.

Budayan

, Dikmen

and Birgonul

M.T.

, Comparing the performance of traditional cluster analysis, self-organizing maps and fuzzy C-means method for strategic grouping, Expert Systems with Applications36 (2009), 11772–11781.

12.

Zhang

and Chai

, Self-organizing feature map for cluster analysis in multi-disease diagnosis, Expert Systems with Applications37 (2010), 6359–6367.

13.

Curry

, Davies

, Evans

and Moutinho

, The Kohonen self-organizing map: An application to the study of strategic groups in the UK hotel industry, Expert Systems (2001), 19–31.

14.

Chi

S.C.

and Yang

C.C.

, Integration of ant colony SOM and K-means for clustering analyses, In 10th International Conference KES2006: Knowledge-Based Intelligent Information and Engineering Systems, 2006, pp. 1–8.

15.

, Pan

, An

, et al., An efficient gene selection technique for cancer recognition based on neighborhood mutual information, International Journal of Machine Learning & Cybernetics1(1) (2010), 63–74.

16.

Kennedy

and Eberhart

, Particle swarm optimization, IEEE International Conference on Neural Networks, 1995, pp. 1942–1948.

17.

Shahreza

M.L.

, Moazzami

, Moshiri

, et al., Anomaly detection using a self-organizing map and particle swarm optimization, Scientia Iranica18(6) (2011), 1460–1468.

18.

S.T.

, Wu

X.X.

and Tan

, Gene selection using hybrid particle swarm optimization and genetic algorithm, Soft Comput12(11) (2008), 1039–1048.

19.

Chiu

D.K.Y.

and Wong

A.K.C.

, Multiple pattern association for interpreting structural and functional characteristic of biomolecules, Information Sciences167(1) (2004), 23–39.

20.

Wong

A.K.C.

and Liu

T.S.

, Typicality, diversity and feature patterns of an ensemble, IEEE Trans on ComputersC-24(2) (1975), 158–181.

21.

Scheidegger

, Sigg

and Behra

, Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, Brain Research501(2) (1989), 205–214.

22.

Khan

, Wei

J.S.

, Ringnér

, et al., Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nature Medicine7(6) (2001), 673–679.

23.

Perou

C.M.

, Sørlie

, Eisen

M.B.

, et al., Molecular portraits of human breast tumours, Nature406(6797) (2000), 747–752.

An efficient gene selection technique based on Self-organizing Map and Particle Swarm Optimization

Abstract

Keywords

1 Introduction

2 Related work

2.1 Self-organizing map

Table 1 Experiment data sets Data set Genes Classes Samples Leukemia 7129 2 72 SRBCT 2308 5 88 Breast 9216 5 84

3.1 The novel correlation measure

4.1 Dataset

4.2 Experimental process

Table 5 Number of selected genes and performance based on SOM Data set LSVM CART KNN Acc Avg Acc Acc Acc Avg (%) (N) (%) (N) (%) (N) Leukemia 88.2 30 87.8 30 90.2 29 SRBCT 79.8 25 78.5 26 82.6 26 Breast 82.3 40 83.6 42 84.5 40

Footnotes

Acknowledgments

References

Table 1
Experiment data sets

Data set Genes Classes Samples

Leukemia 7129 2 72

SRBCT 2308 5 88

Breast 9216 5 84

Table 5
Number of selected genes and performance based on SOM

Data set LSVM CART KNN

Acc Avg Acc Acc Acc Avg

(%) (N) (%) (N) (%) (N)

Leukemia 88.2 30 87.8 30 90.2 29

SRBCT 79.8 25 78.5 26 82.6 26

Breast 82.3 40 83.6 42 84.5 40