Abstract
Recently, available data has increased explosively in both number of samples and dimensionality. The huge number of high dimensional data generates the presence of noisy, redundant and irrelevant dimensions. Such dimensions can increase the time and computational cost in the learning process and even degenerate the performance of learning tasks. One of the ways to reduce dimensionality is by Feature Selection (FS). The aim of this paper is study the feature selection based on expert knowledge and traditional methods (filter, wrapper and embedded) and analyze their performance in classification tasks. Three datasets related to cancer domain in humans were used for feature selection: Breast Cancer (BC), Primary Tumor (PT) and Central Nervous System (CNS). C4.5, K-Nearest Neighbors, Support Vector Machine and Multi Layer Perceptron were trained with the best subset of features for each cancer dataset. The subset of features selected by the wrapper method presents the best average accuracy in the datasets BC and PT, while the subset of features selected by the embedded method reaches the highest average accuracy in the CNS dataset.
Introduction
Today, the data deluge era is happening, evidenced by the sheer volume of data from a variety of sources and its growing rate of generation [1, 2]. Large datasets may contain tens or hundreds of attributes, where several attributes may be irrelevant to the classification tasks in machine learning [3, 4], in other words, the high dimensional in data sets increase the time and computational cost in the training process, also degenerates the performance of the classifiers due to the existence of irrelevant, redundant and noisy dimensions [5, 6].
Feature selection (FS) allows to reduce the high dimensional in data sets by selecting a subset of most relevant features [7]. Different authors [8, 9] assert that the feature selection methods have several advantages such as: Improving the performance of the classifiers. Better visualization and data understanding. Reducing time and computational cost. Use of simpler models and gaining of speed in data training process.
There are two main approaches that deal with feature selection: based on Expert Knowledge [10] and traditional FS as Filter, Wrapper and Embedded methods [9, 12].
The aim of this paper is study the feature selection approaches (expert knowledge and traditional methods) and analyze their performance in classification tasks. Datasets related to cancer domain in humans were used for feature selection. The remainder of this paper is organized as follows: Section 2 describes the cancer datasets, the feature selection approaches and the related works. Section 3 the experimentation of the feature selection approaches; Section 4 presents the results of the classification algorithms using the subset of features selected for cancer dataset and Section 5 conclusions and future works.
Material and methods
In this section feature selection approaches, related works and the description of cancer datasets are presented.
Background
High dimensional in machine learning consists on features that can be irrelevant, misleading, or redundant which increase search space size resulting in difficulty to process data further thus not contributing to the learning process [13]. Feature selection approaches based on knowledge expert and traditional methods are proposals for solving the high dimensional.
Feature selection based on expert knowledge
The development of intelligent systems requires adequate representations of knowledge bases in order to carry out analysis of some problematic and generate the best possible solutions. Expert knowledge includes facts of related domain and requires the use of data and information. Artificial Intelligence seeks to represent generalizations, that is, not to represent each individual situation, but to group the situations that share important properties [14]. Therefore, a formal language will be independent of the model in which it is situated and allows to represent all the elements or objects of an application domain, and make inferences to arrive at conclusions from the represented knowledge [15]. Some common representations of knowledge are: rules, frames, ontologies, semantic networks, object-oriented languages, and Petri nets.
Precisely, ontology is one of the knowledge structures with great adoption in different domains. As Uschold et al. [16] define: “An ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning. This includes definitions and an indication of how concepts are inter-related which collectively impose a structure on the domain and constrain the possible interpretations of terms.” The knowledge obtained from experts can be modeled and stored in ontologies, so that it can be shared and used by different applications and communities [17].
Expert knowledge represented in an ontology provides an easy understanding of the different variables present in an application domain. This allows identifying the main relationships between concepts that define meaningful events. In addition, the concepts of an ontology can be enriched semantically [18], allowing the search for similar terms among each other to solve ambiguities generally present in linguistic comparisons.
A domain ontology was used as expert knowledge source, in order to identify the most relevant attributes in each data set (See Section 3.1).
Traditional feature selection
Feature selection (FS) is the process of selecting best features among all the features that are useful for a determinate machine learning task [13]. FS algorithms can be distinguished into three categories: Filter methods: features are selected based on ranking criterion for scoring the attributes and a threshold is used to remove attributes below the threshold [3]. Filter methods are applied before machine learning task to filter out the less relevant attributes. This methods are considered faster and they have low computational cost but unreliable in classification tasks compared to wrapper methods [13]. Wrapper methods: based on classifier training with the feature space. Based on the performance of the classifier are added or removed features [19]. This methods are high computational cost when are used large feature space because each feature set must be evaluated with the trained classifier [13]. Embedded methods: the aim is reduce the computation time taken up for reclassifying different subsets which is done in wrapper methods [20–22]. This methods incorporate the feature selection as part of the training process. Thus, embedded methods are designed for minimization problems respect to the weights of the features during training process and the settings of a classifier [12].
One algorithm for each feature selection approach was selected (Section 3.2) for using in the data sets explained in Section 2.3.
Related works
In this subsection presents a review of the current literature around two major topic areas. The first part covers related works of feature selection based on expert knowledge. Second part concerns related works of feature selection focused in traditional methods.
Feature selection based on expert knowledge
In [23] the knowledge and judgment of experts is used to guide a feature selection process through a wrapper technique. The process is based on the search for the features names that match the concepts given by the expert in relation to an objective variable related to the monitoring of industrial processes. The expert then improves an unsupervised learning process from a dataset containing the attributes that best relate to the problem. Similarly, in [10, 24] a comparison between a feature selection process carried out from expert knowledge and data mining techniques is presented. The objective is to improve classification tasks in medical data, where there are datasets that contain a large number of attributes. The results of the experiments show that features selected from sensitivity of a classifier, while the features selected by data mining technique CFS (Correlation-based Feature Selection) improve the predictive power of a classifier in the majority class.
Expert knowledge can also be harnessed from domain ontologies, which contain such knowledge in a structured and hierarchical way. In this way, in [25] it is presented a novel method to extract relevant features of a dataset for text classification, through the use of a domain ontology. The search of the features names in the ontology allows to determine its relevance from the hierarchical relations in the ontology. The experimental results show that the proposed approach improves the accuracy of the KNN classifier. In addition, a more balanced performance on individual categories is detected, when using a dataset of smaller dimensionality but that also contains the most representative variables for the classification task.
Traditional feature selection
Numerous researchers address the high dimensional issue through traditional feature selection methods. Table 1 shows a summary of the main algorithms (organized by filter, wrapper and embedded methods) for feature selection. The algorithms shown in Table 2 are the most used according to different literature reviews [9, 26–29]. The most of literature found for feature selection is focused in solve the high dimensional in bioinformatics databases.
Traditional feature selection algorithms
Traditional feature selection algorithms
Selected attributes from ontology
*An updated list of attributes selected according to the ontology was provided by the authors of [39].
With aim to compare the feature selection based on expert knowledge, three algorithms of traditional FS methods were selected: (i) Chi-squared test (chi2), (ii) Sequential backward selection with random forest (SBS) and (iii) SVM method of recursive feature elimination (SVM-R). In Section 3.2 are explained how works and the features selected for data sets presented in Section 2.3.
Three public datasets related to cancer domain in humans were used, obtained in different follow-ups to patient’s conditions. These correspond to those used by da Silva et al. [39] in order to present a proposal of automatic and semantic pre - selection of features. In this sense, in addition to the fact that the datasets were public, we could use the results of such research for our approach in Section 3. Two datasets were obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia; used in [40]: Breast Cancer (BC): The class in this dataset is related to the recurrence of cancer in patients who have received treatment. In addition, it contains 9 nominal attributes with patient information and associated tumor characteristics. The number of instances is 286, where 201 correspond to non-recurrence and 85 to recurrence events. Primary Tumor (PT): This dataset contains 18 nominal attributes including the class, which represents the body part where the cancer originated or where the first tumor appeared. The number of instances is 339 and there are 23 possible values for the class.
The last dataset corresponds to the one used in [41]: Central Nervous System (CNS): This dataset contains 60 samples of patients with central nervous system embryonal tumor, 7129 nominal attributes that correspond to genes and a class containing the information whether or not the patient survived the treatments. 21 instances correspond to survivors (labelled as “Class1”) and 39 to failures (labelled as “Class0”).
These datasets contain a different number of attributes, in which the reduction of these can help to improve classification tasks for the generation of prediction models. Such reduction can be approached from data mining tasks or from the knowledge of experts in the area.
Feature selection: knowledge expert vs traditional methods
In this section the feature selection approaches are compared and explained. Expert knowledge and traditional methods are used in the cancer datasets with aim to obtain the best subset of features.
Feature selection based on expert knowledge
In [39] a semantic selection of features in cancer-related datasets, making use of a domain ontology, is proposed. For this, a semantic search of the names of dataset attributes is applied in a domain ontology and a lexical ontology is used for word recognition. The domain ontology used corresponds to the National Cancer Institute (NCI) ontology available as a Thesaurus reference terminology [42], which provides a vocabulary for clinical care, integrating cancer-related clinical and molecular information. The ontology contains 127776 classes and 97 properties and concepts related to: cancers, findings, drugs, therapies, anatomy, genes, pathways, cellular and subcellular processes, proteins, and experimental organisms. The names of the attributes in each dataset were extracted and searched within the NCI ontology to know the concepts related to each one. Considering that each dataset represents a different problem around the cancer, the search of the attributes in the ontology allows to identify which variables of the environment are more related to the problem represented in the class of each dataset. The lexical ontology used is WordNet [43]. In case the name of the attribute present in the dataset is not found in the NCI ontology, it is searched in WordNet in order to find synonyms and hyperonymy that are contained in the ontology.
As a result, the semantic comparison between the names of the attributes contained in each dataset and the concepts of the cancer domain ontology allows identifying the attributes most related to the class of each dataset. This allows to reduce the dimension of the dataset from the knowledge of expert and this task can be carried out by a person with no experience in the application domain. With an optimized dataset, classification tasks through machine learning can have better results.
In this sense, the ontology was consulted in order to identify the most related attributes (features) to the class of each dataset, which describes a problem in the domain of cancer. This task is an expert knowledge approach obtained from the NCI ontology and the results are shown in Table 2.
Traditional feature selection
Three feature selection algorithms (FSA) were chosen for reduce the high dimensional in the data sets related to cancer domain. Each algorithm correspond to different feature selection method.
Filter method: chi-squared test
Chi-squared test (chi2) was the filter method used. This method works in nominal attributes. Several authors affirm that chi2 is the most used statistical test that measures divergence from the distribution expected [13, 27]. Chi2 assumes the attribute occurrence is independent of the class values. Thus, chi2 evaluates the worth of a attribute by computing the value of the chi-squared statistic with respect to the class. If the class is independent of the attribute, then the attribute is discarded [33]. Table 3 presents the selected attributes by Chi-squared test.
Selected attributes by Chi-squared test
Selected attributes by Chi-squared test
Figures 1–3 presents the results reach by Chi-squared test.

Results obtained by Chi-squared test in Breast Cancer dataset.

Results obtained by Chi-squared test in Primary Tumor dataset.

Results obtained by Chi-squared test in CNS dataset.
For breast cancer dataset (Fig.1), chi2 removed the attributes with importance (i) less than 0.1, for this case, node-caps (i = 0.09) and age (i = 0) attributes. Menopause attribute is the most important attribute with i = 0.53; the importance of the remaining of attributes are ranging from 0.13 and 0.14.
For the case of primary tumor dataset, chi2 discarded the attributes: age, histologic-type, degree-of-diffe, lung, pleura, peritoneum, brain, neck, mediastinum. These attributes achieved importance less than 0.1, as shown Fig.2.
The Fig.3 shows the attributes index for CNS dataset with importance greater than 0. Chi2 deleted 7058 attributes with importance equal to 0.
The algorithm sequential backward selection (SBS) was used with random forest [44] as wrapped classifier. The first iteration, SBS starts training the random forest with all set of attributes and it removes them one by one in each iteration. Once an attribute is removed, the performance is estimated using validation techniques as cross validation, precision and accuracy measures [4, 11]. Table 4 shows the subset of features selected by SBS.
Selected attributes by wrapper method
Selected attributes by wrapper method
For Breast cancer dataset, SBS with random forest classifier reaches the highest Accuracy (0.42), with the attributes: deg-malig and inv-nodes, as depicted the Fig.4.

Results obtained by Sequential Backward Selection in BC dataset.
Whereas in the primary tumor dataset, 15 variables are the best subset of features achieved by SBS with Accuracy of 0.57, as presented the Fig.6

Results obtained by Sequential Backward Selection in CNS dataset.
In CNS dataset, all subset of features reach Accuracies grater than 0.48. However the highest Accuracy is of 0.73 for a subset of 73 variables, as shown Fig.5.

Results obtained by Sequential Backward Selection in PT dataset.
A Support vector machines-recursive feature elimination (SVM-R) was used as embedded method [36]. Initially, all attributes are considered and gradually excludes the ones that do not identify separating samples in different classes. An attribute is considered useful based on its weight resulting from training SVMs with the current set of attributes [26]. Recursive feature elimination is used to increase the likelihood that select the best features besides it includes cross-validation steps [45–47]. Table 5 presents the selected attributes by SVM-R.
Selected attributes by Support Vector Machine based on Recursive Feature Elimination
Selected attributes by Support Vector Machine based on Recursive Feature Elimination
Table 6 presents a summary of the features defined as relevant and irrelevant by feature selection approaches.
Features defined as relevant and irrelevant by feature selection approaches
In Breast Cancer dataset, Chi2 defined the highest number of features as relevants (7), while the Expert Knowledge was the approach that considered the highest number of features as relevants in the datasets Primary tumor (16) and CNS (236).
Once FS approaches selected the best subset of features for each cancer dataset, The classification algorithms more used in the literature [48–51] are applied: C4.5 Decision Tree (worths of confidence factor: 0.1–1 and the minimum number of instances per leaf: 2–4), K-Nearest Neighbors (K-NN) with K values among 1–6. For Support Vector Machine (SVM) values of c: 0.3–4 and epsilon: 1.0E-16–0.001; Multi Layer Perceptron (MLP) with values of learning rate and momentum ranging from of 0.1–0.8 and 0.1–1.0. The classifiers were evaluated with 10-fold cross-validation in Weka toolkit.
Table 7 shows classifiers accuracy using feature selection approaches (FSA): expert knowledge from ontology (EK), traditional methods: Chi-squared (Chi2), Sequential Backward Selection (SBS) and Support Vector Machine based on Recursive Feature Elimination (SVM-R) for each data set. Additionally, the dataset original (No FSA) is also used to compare the classifiers accuracy. The underlined values belong to highest accuracy for each feature selection approach.
Classifiers accuracy using feature selection methods
Classifiers accuracy using feature selection methods
For the Breast Cancer dataset, Chi2 and SBS obtained better accuracy (difference of 0.40 between the two approaches) used as training data in a classifier K-NN. In the case of Primary Tumor dataset, SBS offered the result with better accuracy, similarly with K-NN. Finally, for the CNS dataset the best accuracy configuration was SVM-R with the SVM classifier. The feature selection approaches allowed to obtain better sets of training for classification, in relation to the prediction class value for a new input data. Features selection based on expert knowledge, although did not get the best accuracy, it generated training data with better performance for classification task than the original dataset.
A comparison of average accuracy between the four FSA and original dataset is shown in Fig.7. The SBS method presents the best average accuracy in the datasets: Breast cancer and Primary tumor, and the second best average accuracy in the CNS dataset. SVM-R reaches the highest avarage accuracy in the CNS dataset. Additionally, the average accuracy of the expert knowledge-based approach is at high values for the three datasets.

Comparison of average accuracy between four FSA and original dataset.
In this paper, the feature selection approaches based on expert knowledge and traditional methods (filter, wrapper, embedded) were studied, through datasets related to cancer domain. Four classification algorithms were applied on the subset of features selected by expert knowledge and traditional methods. From this work is concluded: Feature selection based on knowledge expert and filter methods are faster and they have low computational cost. However, the filter methods considers independence among features. Thus, each feature is considered separately, which may lead to worse classification performance, while approach based on expert knowledge searches the features most relevant from perspective of application domain. The subset of features of cancer datasets generated by wrapper method obtained better accuracy in the classification than expert knowledge approach. Although these methods have high computational cost, they achieve a high performance on several domain applications because the feature selection criterion is the performance of the classifier i.e. the classifier is wrapped on a search algorithm which will find a subset which gives the highest classifier performance [3]. Similarly, the embedded method: Support Vector Machine based on Recursive Feature Elimination obtained better accuracy in the classification than expert knowledge approach, specially by Support Vector Machines. It occurs because each embedded method is designed with a specific classifier that performs feature selection in the process of training [28].
Based on conclusions mentioned and the description, in Table 8, four scenarios are proposed for the use of feature selection methods.
Scenarios for the use of feature selection approaches
As feature works are proposed: Build a mechanism that combines expert knowledge approach, wrapper and embedded methods where each algorithm has a confidence weight defined by the user. Create a case-based reasoning system for automatic recommendation of the suitable FS algorithm considering the meta-features of a dataset. Compare traditional methods and expert knowledge approaches in others data quality issues as: missing values, outliers, imbalanced classes, contradictory instances and duplicate instances [52]. Construct a feature extraction system (FES) based on expert knowledge. A FES computes new features from the original features in order to increase the classifier performance and allow higher classification accuracy [53].
Footnotes
Acknowledgments
The authors are grateful for the technical support of Control Learning Systems Optimization Group (CAOS) of the Carlos III University of Madrid, Telematics Engineering Group (GIT) of the University of Cauca, Ministerio de Economía y Competitividad de España (Proyecto TRA2011-29454-C03-03. i-Support: Sistema Inteligente Basado en Agentes de Soporte al Conductor) and Colciencias (Colombia) for PhD scholarship granted to MsC. David Camilo Corrales.
