Abstract
Among the large amount of genes presented in microarray gene expression data, only a small fraction of them is effective for performing a certain diagnostic test. It is for this reason that reducing the dimensionality of gene expression data is imperative. An improved Self-organizing map method based on neighborhood mutual information correlation measure is proposed, and then combines with Particle swarm optimization method to construct an efficient gene selection algorithm, denoted by ICMSOM-PSO. Experimental results show that the proposed method can reduce the dimensionality of the dataset, and confirm the most informative gene subset and improve classification accuracy.
Introduction
In recent years, gene expression profiles (GEP) based molecular diagnosis of tumor have attracted a great number of medical researchers and computer scientists for the goal of realizing precise and early tumor diagnosis [1–3]. However, the curse of dimensionality caused by high dimensionality and small sample size of tumor dataset seriously challenges the tumor classification. So how to select important gene subsets from thousands of genes in GEP dataset to drastically reduce the dimensionality of tumor dataset is the first key step to address this problem [4].
Clustering is a main task of explorative data mining and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics, etc. [4]. When applied to gene expression data, conventional clustering algorithms may encounter a problem that a huge number of genes (attributes) versus a small number of samples [5].
Self-organizing map (SOM) [6–8] has a number of features that make them particularly well suitable for clustering and analyzing of gene expression patterns. They are ideally suited to explore data analysis, allowing one to impose partial structure on the clusters (in contrast to the rigid structure of hierarchical clustering, the strong prior hypotheses used in Bayesian clustering, and the non-structure ofK-means clustering) and facilitating easy visualization and interpretation. SOM has many good computational properties and is scalable to large data sets. As for correlation measures, Euclidean distance and Pearson’s correlation coefficient are widely used for clustering [9]. However, for measuring thecorrelation between genes, Euclidean distance is not effective enough to describe functional similarity such as positive or negative correlation in values. Empirical studies have shown that it may assign a high similarity score to a pair of dissimilarity genes [10]. Furthermore, most clustering methods are not able to cope with continuous attributes effectively, which is also a distinctive characteristic of gene expression data. When applied to the continuous attributes, conventional methods commonly discretize the continuous data into a finite number of intervals for data mining. But discretization may lead to information loss. As we know, Self-organizing map utilizes Euclidean distance measuring the correlation between genes. However, in this paper, we use neighborhood mutual information instead of Euclidean distance to evaluate the correlation between genes. By applying improved SOM to gene expression data, clusters of genes based on their neighborhood mutual information correlation can be discovered.
In this paper, we present a gene selection method using combined particle swarm optimization method with the improved SOM. Firstly, the improved SOM was used to preprocess the original gene expression data, and then obtained winning neuron weights. Furthermore, particles were initialized with random position, representing the generated set of winning neuron weights. Finally, the identified relevant feature subsets were tested by the particle swarm optimization method, which tried to determine optimal feature subsets.
The rest of this paper is organized as follows: Section 2 introduced the concepts of Self-organizing map clustering and neighborhood mutual information and particle swarm optimization method. An effective and efficient attribute clustering algorithm was proposed in Section 3. In order to evaluate the performance of the proposed algorithm, three gene expression datasets were applied. The experimental results were presented in Section 4. Finally, the conclusion is drawn in Section 5.
Related work
Self-organizing map
Self-organizing map [11, 12] is an unsupervised neural network computational mapping technique that forms ordered non-linear projection of high dimensional input data items to a low, often one or two dimensional grids. SOM is a clustering tool which can convert the non-linear statistical relationships between high dimensional data into simple geometric relationships of their image points on a low-dimensional display. By that way, the data points which show similar properties are placed closely to each other within the output of SOM algorithm [7].
SOM consists of a set of neurons which usually are arranged in a two dimensional structure, there exist neighborhood relations among the neurons, which dictates the topology and structure. As it is shown in Fig. 1, the network of SOM usually consists of two layers of neurons: an input layer and output layer. Although, the neurons on input layer are fully connected to the neurons that on output layer, the neurons on each layer have no connection to the neurons in that layer. Neurons on the output layer are arranged in either a rectangular or hexagonal lattice. The selection of lattice shape affects the performance of the SOM due to the number of neighbor of neurons. The performance of the SOM by using a hexagonal shape is expected to be higher since more neighbors are modified in the hexagonal lattice when it is compared with a rectangular lattice [11]. Each neuron is represented by an n-dimensional weight vectors. The algorithm of SOM is initialized by assigning the values of weight vectors of each output neuron linearly or randomly. Training process of SOM starts by representing a data point randomly in the network. The distances between these data points and the weight vectors of all neurons are computed by using distance measures such as Euclidean distance. By comparing these distances, the nearest Kohonen neuron, is identified as the ‘winning’ neuron. The weights of the winning neuron are adjusted in order to get close to the actual data point. The weights of neighboring neurons are also updated, so that the order of the input space can be satisfied [13]. In the SOM network, the character mapping is topologically ordered and character choosing. Topologically “ordered” means the space position of the neuron in the network being is mapped from the character or some territory of the input samples. “Character” choosing means that when a data is given in the input space of non-linear distributed, SOM can choose the best character for the approach [12, 14]. Both of the two characters make the SOM network suitable for the disease feature map of the construction.

Structure of SOM.
There are several steps in the application of the algorithm. To get the winner, competition and learning are needed in the process. The steps above are repeated until the feature mapping is formed. The process is as follows: Initialization: Choose random values for the initial weights w
j
. Winner Finding: Find the winning neuron C at the time t, using the correlated criterion: Weights Updating: Adjust the weights of the winner and its neighbors, using the following rule:
Repeat the Steps (2) and (3) until the changes in the disease feature mapping are very small or the maximum iteration time is reached.
Hu et al. [15] proposed neighborhood mutual information to cope with continuous gene data, evaluating the relevance between attributes.
There is a problem to employ mutual information in gene evaluation due to the difficulty in estimating probability density of genes. So neighborhood mutual information combines the concept of neighborhood with information theory, and generalizes Shannon’s entropy to numerical information. Training samples are usually given as vectors of attribute values and the attributes are numerical, as shown in Table 1, where A1 and A2 are two attributes, while C is the decision label of samples.
Experiment data sets
Experiment data sets
Let U = {x1, x2, ⋯ , x
n
} be a set of samples described with gene set F, and Δ is a distance function on U, δ ≥ 0 is a constant, then the neighborhood of sample x is denoted by
Given S ⊆ F is a subset of genes, the neighborhood of sample x
i
in S is denoted by δ
S
(x
i
). The neighborhood uncertainty of x
i
is denoted by
Given R, S ⊆ F are two subsets of genes, the neighborhood of sample x
i
in gene subspace S ∪ R is denoted by δS∪R (x
i
), and the joint neighborhood entropy of S ∪ R is computed as
Let R, S ⊆ F be two subsets of genes, then the neighborhood mutual information of R and S is denoted by
Nowadays, there have existed many types of optimization algorithms, such as GA, DA, PSO. The PSO algorithm has been widely applied in different fields owing to its simple concept, easy to implement and fast to converge.
Particle swarm optimization (PSO) originated from the simulation of social behavior of birds in a flock, which was developed by Kennedy and Eberhart [16–18]. Compared with other evolution strategy, PSO has kept the global search strategy based on population. In PSO, physical position is not an important factor. The member that is called particle is initialized by assigning random positions and velocities, and flies in the search space with a velocity adjusted by its own flying memory and its companion’s flying experience.
During each iteration, every particle is accelerated towards its own personal best, as well as in the direction of the global best position. All particles have fitness values which are decided by a fitness function. Each particle updates its own position and velocity according to the Equations (9 and 10) in every iteration.
Where w is the inertia weight,
The novel correlation measure
As for correlation measures, Euclidean distance and Pearson’s correlation coefficient are widely used for clustering [19]. However, for measuring the correlation between genes, Euclidean distance is not effective enough to describe functional similarity such as positive or negative correlation in values. Empirical studies have shown that it may assign a high similarity score to a pair of dissimilarity genes [10]. Conventional SOM uses Euclidean distance measuring the correlation between genes, while it can’t be able to effectively cope with continuous attributes which is also a distinctive characteristic of gene expression data. Au et al. presented an information measure to evaluate the correlation between attributes.
It is called the interdependence redundancy measure [20] between two attributes, A
i
and A
j
, i, j ∈ {1, ⋯ , m}, which is denoted by
In order to cope with the continuous attributes effectively, we redefine the winning neuron C.
There are several steps in the application of the algorithm. These are competition and learning,getting the winner in the process. The steps above are repeated until the feature mapping is formed. The detailed ICMSOM algorithm is described as Algorithm 1.
Step 1: Initialization: Choose random values for the initial weights w j .
Step 2: Winner Finding: Find the winning neuron C at time t, according to the Equation (14).
Step 3: Weights Updating: Adjust the weights of the winner and its neighbors, according to the Equations (2 and 3).
Step 4: Repeat the Steps 2 and 3 until the changes in the disease feature mapping are very small or the maximum iteration time is reached.
This paper proposes an improved Self-organizing map (SOM) method, and then combines it with Particle swarm optimization (PSO) method to construct an efficient gene selection algorithm, denoted by ICMSOM-PSO.
The detailed ICMSOM-PSO algorithm is described as Algorithm 2, and the flow chart is shown in Fig. 2.
Step 1: A set of winning neuron weights is generated by ICMSOM algorithm.
Step 2: A swarm of particles with random position is generated in the search space, representing the generated set of winning neuron weights.
Step 3: All the particles are evaluated by a fitness function.
Step 4: Each particle updates its position and velocity according to Equations 6 and 7.
Step 5: At the end of each iteration pbest and gbest are calculated.
Step 6: If the quality of solution found by a particle is higher than its previous pbest, this solution will be the new pbest for that particle, and the pbest among all particles is selected as gbest.
Step 7: Once the termination condition is met, output the final solution, otherwise go to Step 4.
Step 8: END.

The flowchart of ICMSOM-PSO.
Dataset
In this section, we shall demonstrate the performance of our algorithm ICMSOM-PSO given in Section 3. In order to test the proposed algorithm, three cancer recognition datasets are collected. A review of these sets is given in Table 1. The Leukemia dataset [21] consists of 7129 genes and 72 samples from two different types of samples: acute lymphblastic leukemia (ALL) and acute myloid leukemia (AML). The training dataset contains 38 samples (27 ALL and 11AML) while testing dataset consists of 34 samples (20 ALL and 14AML). The SRBCT dataset [22] is the small round blue cell tumors, consists of 2308 genes and 88 samples. The breast cancer dataset [23] consists of 9216 genes and 84 samples which contain 18 ER+(estrogen receptor) samples and 20 ER-samples.
Experimental process
In this experiment, the operating environment is Lenovo Windows7 PC with 3.1 GHZ CPU and 4 GB RAM. The experimental software is MATLAB 2013a. The inputs and outputs of dataset are normalized over the interval [–1, 1]. In the training stage of SOM algorithm, the maximum iterative time is set as 500, and the initial learning rate η0 is 0.3, the initial neighborhood region hj,C (0) is 1. Since feature value scaling can enhance pattern recognition accuracy, the values are normalized to [0,1], and set δ = 0.15. The normalization is given by the Equation (16):
To validate the effectiveness of the proposed method, we compare our proposed method with single SOM and single PSO method, which have been also used for gene selection. The corresponding parameter settings of the two algorithms are the same as the ICMSOM-PSO method. Three popular classification algorithms are employed for evaluating the quality of these method. Linear support vector machine (LSVM), k-nearest-neighbor classifier (KNN) and CART are introduced to compute classification performance. Ten-fold-cross-validation is employed to determine the training set and test set. In this stage, all the samples are firstly divided into training samples and testing samples. The tenfold procedure is: (1) the n samples are divided randomly into 10 subsets of equal size; (2) 9 of the 10 subsets are used for gene selection and to train a classifier using the genes selected; (3) the remained subset is used to test the performance. After running ten times, the average and standard deviation are output as the final results.
Here, Acc (%) is calculated by ten-fold cross-validation, Avg (N) is the average numbers of selected genes each time. As we can see, the Table 2 shows the testing accuracy and number of genes selected in 10 times on the three datasets based on LSVM, and the average results of Acc for three datasets are 95.0%, 88.6%, and 94.4% respectively. The results in Table 3 show that the testing accuracy and number of genes selected based on CART. The average results of Acc for three datasets are 95.0%, 88.9%, and 94.4% respectively. Referring to Table 4, the genes selected based on KNN, The average results of Acc for three datasets are 95.0%, 89.0%, and 94.4% respectively.
LSVM accuracies with selected genes on thress datasets based on ICMSOM-PSO
Cart accuracies with selected genes on thress datasets based on ICMSOM-PSO
KNN accuracies with selected genes on thress datasets based on ICMSOM-PSO
Table 5 gives the number of selected genes and performance based on SOM. PSO based on the number of selected genes and performance is shown in Table 6. From these tables, we can see that the best classification results are 95.0% on Leukemia data using the ICMSOM-PSO method, which is better than that of single SOM and PSO. For the SRBCT data, the accuracy is obtained by the proposed method, which is also competitive to the results obtained by single SOM and PSO method. For the Breast cancer data, the best classification accuracy is 94.4% by the proposed method, which is also better than that of single SOM and PSO. We can draw a conclusion that the proposed method has better performance for gene selection compared to single SOM or PSO.
Number of selected genes and performance based on SOM
Number of selected genes and performance based on SOM
Number of selected genes and performance based on PSO
Clustering and classification are key tasks of gene identification. In virtue of the continuous attributes, in this paper, an improved Self-organizing map clustering algorithm based on neighborhood mutual information correlation measure are proposed, combined with PSO algorithm for feature selection. Then we have demonstrated that this approach reduces the number of genes selected and increases the classification accuracy rate. Experimental results show that this algorithm outperforms the other approaches. In the recent research, the cluster configuration is studied by using qualitative analysis. In order to get thorough understanding about gene expression profiles, and extend our model to improve its generation, we have to investigate the cluster quality further, which refers to its shape, size and distribution, etc. That is what we want to be involved with in the further concern.
Footnotes
Acknowledgments
This work is supported by the National Natural Science Foundation of China (Nos. 61370169, 60873104), the Key Project of Science and Technology Department of Henan Province (Nos. 112102210194, 142102210056), the Science and Technology Research Key Project of Educational Department of Henan Province (Nos. 12A520027, 13A520529), the Key Project of Science and Technology of Xinxiang Government (No. ZG13004), and the Education Fund for Youth Key Teachers of Henan Normal University.
