Abstract
Research has proved that DNA Microarray data containing gene expression profiles are potentially excellent diagnostic tools in the medical industry. A persistent problem with regard to accessible microarray datasets is that the number of samples are much lesser than the number of features that are present. Thus, in order to extract accurate information from the dataset, one must use a robust technique. Feature selection (FS) has proved to be an effective way by which irrelevant and noisy data can be discarded. In FS, relevant features are picked, and result in commendable classification accuracy. This paper proposes a model that employs a compounded hybrid feature selection technique (Filter + Wrapper) to classify microarray cancer data. Initially, a filter method called Information Gain (IG) to eliminate redundant features that will not contribute significantly to the final classification is used. Following to that, an evolutionary computing technique (micro Genetic Algorithm (mGA)) to find the best minimal subset of required features is employed. Then the features are classified using a traditional Support Vector Classifier and also cross validated to obtain high classification accuracy, using a minimal number of features. The complexity of the model is reduced significantly by adding mGA, as opposed to already existing models that use various other feature selection algorithms.
Introduction
An important challenge that is being grappled with in the medicine industry is to classify a patient diagnosed with cancer, to their appropriate class. Class specific treatment can significantly reduce toxicity and increase the efficiency of the therapy. Traditional classification techniques that are being used today, however have relatively low prediction accuracies. With regard to non-invasive technologies, imaging and screening techniques often lead to misinterpretation of benign lesions as tumours or vice versa, since they are morphologically similar although molecularly different. For eg. Mammograms, used to diagnose breast cancer, are said to pick up only around 80–90% of tumours, leaving 10–20% of the tumours undetected. Invasive technologies like biopsies (examination of a body tissue for disease) and FNAC (fine needle aspiration cytology - uses a thin, hollow needle to extract a few cells from the tumour) may lead to side effects like slight bleeding, tenderness, bruising, pain etc. Owing to these disadvantages, analysis of microarray gene data has been considered as a better way to achieve more accurate results. Microarray analysis techniques that are used for interpreting the data, allow researchers to investigate the expression state of a large number of genes - in many cases, an organisms’ entire genome - in a single experiment.
Microarray gene expression and micro genetic algorithm
Many diseases and other biological processes are resultants of the activity of genes. Microarray data is widely used in the research concerning cancer prediction and classification. The process of genetic material being converted to microarray data is as follows: The messenger RNA (mRNA) levels in all the cells are measured, which is then converted to complementary DNA (cDNA). The cDNA levels are denoted by fluorescent markers. Since tumour cells are said to have a different gene expression pattern than normal cells, fluorescent green and red is used to differentiate between the two. A dot that glows bright red is a highly expressed gene in a tumour cell and a dot that glows bright green is a highly expressed gene in a normal cell. A gene highly expressed in both categories will glow yellow. In this way, based on the intensity of the colour of the dots, they are converted to numbers.
Microarray analysis techniques that are used for interpreting the data, allow researchers to investigate the expression state of a large number of genes - in many cases, an organisms’ entire genome - in a single experiment. But microarray datasets are high-dimensional, making it hard to be able to analyse the data without succumbing to the curse of dimensionality [1, 2]. This is where the role of Computer Science comes into play. With the development of data science and machine learning, various techniques by which large volumes of data can be handled efficiently and intelligently have been devised. One such technique is feature selection.
An evolutionary algorithm, in its truest sense, is a stochastic algorithm that helps to find an effective solution to problems in various domains. It makes use of mechanisms inspired by biological evolution like reproduction, mutation, crossover (recombination) and selection. Just like how biological evolution follows the principle [3] of ‘survival of the fittest’, evolutionary algorithms converge to a global optima. So given an initial population of individuals, the environment it is subject to causes natural selection, which in turn increases the overall fitness of the population. This fitness value is customised in a fitness function [4], which can be modelled using parameters relevant to the application. Based upon this fitness value, participants of a generation are retained and promoted to the next iteration. Variation in the population is caused by two main functions - Crossover and Mutation. Crossover is applied over two candidates, where with reference to the given number of points (n, i.e 1 point crossover, 2 point crossover and so on) they are recombined. Mutation is applied over one candidate, where slight modifications are made to the candidate to create a new member. Both crossover and mutation functions are dependant on the specified probabilities - crossover and mutationprobabilities.
The Micro Genetic Algorithm is a simple yet robust evolutionary algorithm that can solve the most complex problems much faster than most other pure (non-hybrid) heuristic methods. mGA is much faster than conventional Genetic Algorithms [5, 6] and gives better solutions without having to initialise certain input parameters like the mutation rate. The algorithmic structure of mGA is shown in Fig. 1. The initial population is downsized by close to a tenth. In the traditional GA, we normally initialize the initial population to about 20–30 individuals. mGA can drastically reduce both programming time, processing time and memory requirements.

Algorithmic structure of mGA.
Feature selection is a concept of machine learning and statistics, and as the name suggests is a method by which useful features contained in samples of high dimensions are identified and selected, without altering the original data. It is also known as variable selection, attribute selection or variable subset selection.
The proposed hybrid approach
Hybridizing classical approaches of solving real world problems with bio-inspired computation based approaches are the most common research area. The hybrid approaches are attributed to the advantages they render in achieving good solution with less computational complexity. This is due to the nature that the hybridization approaches are able to balance the strength and weakness of the constituent approaches. By considering the advantages of the hybrid approaches the model proposed in this paper hybridizing the filter method for feature selection with mGA for feature reduction [7]. Proposing different hybrid approaches for classification of medical data for prediction is most popularly carried out in state-of-the-art [8–10]. The strategy followed to hybridize the chosen approaches and the results obtained through hybridized model is explained in this section.
The microarray data [11] for training the model is obtained from the Kentridge Biomedical Repository. This repository contains labelled microarray datasets for the following types of cancers - Lung, Ovarian and Leukemia. Also, the samples contained within these datasets are for a set of patients, wherein some are normal samples and some are cancerous samples, i.e they are binary. These datasets are stored in. csv file format (comma separated values), and directly imported to Microsoft Excel. The raw, unprocessed, dataset is interpreted that each row is for patient and each column is for a gene expression levels of each feature. The last column denotes whether the patient is diagnosed with cancer or not.
Owing to the largest size of the dataset, in order to simplify the dataset dimension-wise, the filter method is used for feature selection. It treats each feature individually and not taking the relationship between features. Some commonly used filter methods on microarray data are Correlation based feature selection and Information Gain etc. The filter method employed to select relevant genes, for our experiment, is Information Gain. Information gain (IG) is a feature ranking method based on decision trees that exhibits good classification performance. Information gain selects features that give out the most information about the classes [12]. It is a measure based on entropy - it indicates to what extent the whole entropy is reduced if we know the value of a specific attribute. Therefore, the IG value indicates how much information this attribute contributes to the data set. Each feature has its own IG value which determines whether this feature is to be selected or not. A threshold value is used for checking the features; if a feature has a greater IG value than the threshold, the feature is chosen; otherwise, it is not selected.
Information gain is the amount of information that’s gained by knowing the value of the attribute, which is the entropy of the distribution before the split minus the entropy of the distribution after it (Refer Equations (1 and 2)). A higher information gain will result in a higher likelihood of obtaining pure classes in a target class. This part of work is carried out using the data mining tool WEKA.
Next, since the wrapper method of feature selection is more complex, the proposed model uses an Evolutionary Computing approach for further feature reduction. The micro genetic algorithm (mGA) is used for this model. This algorithm is similar to the conventional Genetic Algorithm (GA), except the initial population size is downsized by close to a tenth. The mGA approach implemented following to the filter method performed a systematic random search to identify a reduce set of features with increased accuracy as its fitness. The step by step working of this proposed model is presented in next section (Section 5). The algorithms used in the proposed model are described below
ALGORITHM GENERATE-POPULATION (SIZE, LENGTH):
POP: = { } : Empty population
for i = 1 - >SIZE do
IND = “ ” : Individual
for j = 1 - >LENGTH do
IND.append (random(1))
end
end
end
ALGORITHM FITNESS (POPULATION)
T = 0.5 + n- - n+/n- + n+′; Adjust threshold for SVC
B = – 1 : best fitness seen so far
I = “ ” : best individual seen so far
for i = 1 - >PS do
XS: = POPULATION [i]: Selected Individuvals
A: = SCORE (SVC, XS, y): Accuracy of selected genes
if A >B then
I: = POPULATION [i]
B: = A
end
end
RETURN I, B
End
ALGORITHM SELECT-FEATURES (X, y)
P: = 5 : population size
F: = X.size.columns : Number of genes per individual
POP: = { } : Population
N = 200 : Number of generations
K = 30 : Number of generations with no improvement
limit
/*Initial population generation*/
POP: = GENERATE-POPULATION (PS,F)
k = 0
best = – 1 : Best score so far
I = “ ” : Best individual so far
for n = 1 – >N
SELECTION (POP)
CROSSOVER (POP)
MUTATE (POP)
INDIVIDUAL, fitness = FITNESS (POP)
/* Returns the best individual and its fitness score*/
If fitness >best then
best: = fitness
I = INDIVIDUAL
else
k: = k + 1
end
/* Kill the population and restart with thebest one*/
if k == K then
k: = 0
POP: = { I}:Initialize the population with the best
individuval
POP.append(GENERATE_POPULATION(PS - 1,F))
end
RETURNI
End
This work is coded using Python programming language, by importing necessary functions from the library ‘scikit learn’, and importing necessary files from the DEAP (Distributed Evolutionary Algorithms). The step by step processes are Load the specific datasets onto the system. Set the threshold value, using the parameters alpha and delta, which is used to classify the samples. Initialise the classifier, this work uses SVC (Support Vector Classifier) As per the specified population size (minimal, since mGA), pick the initial population and randomly assign 0’s or 1’s. The population is varied using crossover, mutation and selection functions. For each chromosome (size = no. of features), the selected individuals (1’s) are scored with the help of the SVC. The accuracy, which is measured using the Equation (3), is used as fitness. This process is repeated over a number of iterations. The best individuals and their fitness values are retained before moving on to subsequent generations.
The details (Number of Samples, Number of Features and Accuracy) of the raw dataset are presented in Table 1 for three different datasets (Lung Cancer, Ovarian Cancer and CNS cancer). The results obtained after the filtering phase is shown in Table 2. The final results obtained after involving the mGA algorithm is shown in Table 3. The Table 1 shows that for none of the data set the classification accuracy is 100%. On comparing Tables 1 and 2, it is found that there are good reductions in the number of features and significant improvement in the classification accuracy. Comparison of Tables 1–3 reveals the fact that the proposed hybrid algorithm is able to reduce the number of features by 65% on average, on all the date set, while still giving significant increase in classification accuracy.
Raw datasets statistics
Post-filter statistics
Post-mGA statistics
For the Lung Cancer data the proposed approach achieves 100% accuracy with 1706 number of features. Then, for the Ovarian cancer data the proposed approach able to achieve 95.05% of accuracy with 2844 number of features. Finally, for the cancer data the proposed approach has given 92.86% accuracy with only 29 features. The obtained results proved the superiority of the proposed hybrid approach.
An approach to employ hybrid feature selection technique is proposed to classify the microarray cancer data. This approach involves a filter method, initially, to remove least significant features. From the reduced feature set a best minimal set is generated using an Evolutionary Algorithm. This final reduced set is used for classification by Support Vector machines. The obtained results showed that the proposed approach reduces the features to great extent with significant increase in the classification accuracy.
Since the classification problem dealt in this work is to detect cancer 100% accuracy is necessary. Thus this approach can be tweaked slightly, in order to give rise to a higher accuracy. Also, designing a graphical interface to the system would be another useful extension, as the representation of results on a GUI is always better interpreted.
