Abstract
Feature selection is a pre-processing method that identifies the significant features from high-dimensional data and able to diminish the computational cost of the learning algorithm because of removing the irrelevant and redundant features. It has traditionally been applied in a wide range of problems that include biological data processing, pattern recognition, and computer vision. The aim of this paper is to identify the best feature subsets from the benchmark datasets which improve the performance of the classifiers. Existing filter-based feature selection approaches fail to choose the relevant features from the original feature sets. To obtain the tiny subset of relevant features, we have introduced a novel filter-based feature selection method, called ReCFS. The proposed method is a combination of both feature-feature correlation and nearest neighbor weighted features to find an optimal subset of features to minimize correlation among features. The effectiveness of the selected feature subset by proposed method is evaluated by using two classifiers such as Naïve Bayes and K-Nearest Neighbour on real-life datasets. For the diverse performance measurements, the experiments are conducted on eight real-life datasets of varied dimensionality and number of instances. The result demonstrates that the proposed method has found promising feature subsets which improved the classification accuracy over competing feature selection methods
Introduction
At the abstract level, data mining methods in medical disciplines gather information from earlier experiences and examine them to distinguish patterns and provide a solution in the present circumstances. It is an effective diagnostic procedure for recognizing of ambiguous and significant records from the expansive volume of medical data taken from medical scientific disciplines and clinical tests. It can be referred to define as the composite computation process of instinctive pattern identification and intellectual decision making depends on training data [1]. Availability of medical statistics in data mining domains, is a critical problem towards selecting the relevant features and overlook the extraneous features for better diagnosis. To overcome this limitation, we have developed a new diagnostic model in the medical datasets for identifying the informative features by using two stage feature selection framework which can be reduced the computational cost of any learning algorithm.
In the machine learning domain, feature selection aim is to search a discriminant feature subsets from the original feature space. Usually, it is used as a pre-processing step to increase the prediction power, understandability, scalability, and generalization capability of learning engine [2]. Recently, data-based technology has amplified explosively in both sample size and dimensionality in many computer science fields and these characteristics make it difficult to discover a compact set of features that really contribute to the response of the model. Therefore, researchers have been employed several feature selection (FS) approaches for choosing relevant feature subset from original feature sets. It is an approach to reduce the high dimensional data that is generally used in the field of machine learning, pattern recognition, and data mining. The main objective of feature selection methods is to choose m pertinent features from the d original dataset on some criteria as correlation, redundancy, similarity, and inconsistency, where m<d [3].
A plenitude of the FS method is considered in the literature, most of them rising as a need to analyze microarray data to select the best discriminative gene [4]. The informative gene is required when the data generally consists of noisy, redundant, and inappropriate data. It also plays an important role to identify early stage tumour detection and cancer discovery. The selection of gene processes directs to powerful impact on cancer identification for better clinical treatment [5]. In last five years, scientists have found the commonly used feature selection techniques are Correlation-based Feature Selection (CFS) [6], Relief-F [7], Double Input Symmetrical Relevance (DISR) [8], and minimum Redundancy Maximum Relevance (mRMR) [9] that work to embrace a huge number of features but small gene expression data size (ten to hundreds) where as, it cannot recognize feature interactions. Generally, feature selection methods are widely used before the classification learning of data in the field of data such as bioinformatics.
Feature selection methods are classified into four categories: filter, wrapper, embedded, and hybrid methods. In general, filter method [1] selects a prominent feature subset without using a classification algorithm. Generally, it is used in high dimensional datasets such as microarray datasets. In comparison to filter-based feature evaluation performance, wrapper methods are relatively slower than filter-based methods. Wrapper method [10] uses a classification algorithm to evaluate the accuracy produced by the use of the feature subset in classification. In specific classifiers, wrapper methods can give maximum classification performance, but generally, it has high computational burden. An embedded method implemented by own FS process and perform selection of process and classification, simultaneously [11]. A hybrid method is the integration of filter and wrapper-based methods [12]. By using filter-based method, selects an optimal candidate feature subset from the original dataset and then applies wrapper method on candidate feature set for refining.
A correct selection of the features can lead to an improvement of the inductive learner, either in terms of learning speed, generalization capacity or simplicity of the induced model. In [10] took composition of feature relevancy into account and proposed the composition of feature relevance feature selection method. From [13] proposed the five-dimensional joint mutual information feature selection algorithm which took the higher-order feature interactions into account and adopted the maximum of the minimum method. However, there is no widely accepted method currently, in particular, a better way to measure feature interactions.
Therefore, in the proposed method, we select the optimal feature subsets combining feature correlation-based FS (CFS) method and nearest neighbor weights selection method (Relief-F) for data classification on University of California, Irvine (UCI) and biological data which helps the physician to arrive at an accurate diagnosis. The features are exposed into two-stage combination process, which forms the feature subset with the lowest cardinality i.e., according to the corresponding rank of the features, the minimum number of features is selected that expands the accuracy. The technique utilizes an efficient strategy of combining feature selectors with two-layer filter method. The main highlights of this paper as follow: This paper presents a new two-stage filter method combining two filters which can improve the prediction classification. Diagnostic proposed framework combined Relief-F scheme and CFS approach for optimal feature subsets. It is based on combining rankings of features that contain all the ordered features. The experimental outcomes demonstrate that the proposed framework using popular decision tree classifier (as Random Forest) model which achieves remarkable results in the biological dataset obtained from the UCI repository.
The main contribution of this paper combines the feature correlation-based FS (CFS) method and nearest neighbor weights selection method (Relief-F) for data classification on UCI and biological datasets. The method is established to perform satisfactorily in terms of classification accuracy for the huge number of feature-based benchmark datasets. The rest of the paper is organized as follows. In Section 2, we explain related work. The concept of proposed method is discussed in Section 3. Experimental results are shown in Section 4. Finally, conclusion in Section 5.
Related work
The main objective of the feature selection approaches is to find an optimal set of features based on following criteria correlations, similarity, and consistency to provide worthy classification results from biological datasets. Once the dimensionality of data is reduced, the critical features of the original data are conserved, many prominent features are selected, and low dimension dataset is produced. In the literature, researchers have been used two types of feature selection methods i.e., filter and wrapper methods for cancer diagnosis. The simplest correlation criteria are linear correlation coefficient which may be able to capture correlations with a non-linear nature. On the other hand, information-theoretic criteria, like mutual information and symmetric uncertainty, based on the concept of entropy, have become popular in the machine learning because of the capability of handling both linear and nonlinear relationships [10].
Feature selection methods for high dimensional datasets
Many feature selection algorithms have been proposed for classification. It is critical to minimize the classification error [11, 15]. Given input data D listed as N samples and M features, F = fi, i = 1,2, ... ,n, and the target classification variable C (class). The feature selection problem is to find from the M-dimensional observation space, RM, a subspace of m features, Rm that optimally characterizes C. The most common approach for feature selection is to search for an optimal set of features that provides good classification results. Most feature selection algorithms use statistical measures such as correlation and chi-square or population-based approach such as genetic algorithms and ant colony optimization. This technique aims to select a feature subset from the benchmark dataset based on some criteria (such as redundancy, relevancy, and consistency). According to Yu et al. [16], feature subsets categorize into four different ways: The first category is noisy features and completely irrelevant, second category is redundant features and weakly relevant, second last category is non-redundant features and weakly relevant, and last category is strongly relevant features.
According to [17], researcher has discriminated the two most popular FS methods like filter and wrapper. The filter method selects the informative features and provides general solutions for various classifiers due to independency of any learning algorithm. It is unable to detect indirect relationships between features because of simplicity, speed, and computational efficiency, one the other hand, wrapper method is the simplest technique. However, it is prohibitive in terms of computational cost because of independency of any learning algorithm. In ensemble method, build a portion of feature selection method that produces the combined feature subset of the group [18]. The purpose of design embedded method is to solve the perturbation and repeatability issues on gene expression dataset. The performance of ensemble feature selection is no longer dependent on the single feature subset, thus it is more convenient and flexible for the microarray gene expression dataset. The more discussion about the ensemble method can be found [19]. In current trend on feature selection, hybrid and ensemble method are the most powerful feature selection method. A hybrid method is designed by integrating two different methods like filter and wrapper. It takes advantage of both method by integrating the complementary strengths [20]. The most commonly used ensemble approach can be found [21].
Filter-based FS methods have different nature such as univariate based t-test and Laplacian score are generally used in high dimensional datasets. However, they are less effective than multi-variant models. According to [22], the most popular filter method called mRMR, generally, it is applied to handle the complex structure of gene problems. Genes selected by mRMR provide more balanced coverage of space and detention broader characteristics of phenotypes. In the context of wrapper methods, genetic algorithm is one of the population-based approaches which is associated with a classifier. The wrapper is based on well-established machine learning techniques, without any facts about the underlying data distribution. Feature selection for gene expression analysis has been used wrapper approaches to distinguish subtype of tumours for cancer prediction, to decrease the number of genes, to examine in case of a new tissue and also to support drug discovery and early diagnosis [23].
From [24], non-unique decision measure was proposed that captures the degree of given feature subset being relevant to different categories. It helped to represent the uncertainty information in the boundary region of a granular model, such as rough sets or fuzzy-rough sets in an efficient manner. Using rank-based schemes [25], mostly feature selection algorithms dependent on binary discrimination which ranks according to the weight of the selected feature subset. According to Sun et al. [22], local Learning-based Feature Selection (LLBFS) methods handle the complex problems. Giorgio et al. [18]. Introduced the concept of infinite feature selection that exploits the convergence properties of power series of matrices. The primary challenging task of gene selection method is to extract informative genes contribution for the classification from datasets with low computational cost, therefore, authors integrated conglomeration of kendall correlation and filter-based feature selection method for better classification and prediction [26].
Due to the limitations of individual filter and wrapper methods, many hybrid feature selection have been proposed and resolved the shortcomings of previous developed individual methods with the goal of choose features subsets that maximized the accuracy of a particular classifier. In [27], Pashaei et al. developed a hybrid binary black hole algorithm that integrated the meta heuristic binary black-hole and binary PSO (4–2) model to find the best gene subsets. This model was associated with random forest recursive feature elimination pre-filtering approach and sparse partial least squares discriminant analysis, k-nearest neighbor and naive Bayes classifiers. The performance of the proposed method was estimated on two benchmarks and three clinical microarrays. Similarly, authors [28] investigated a hybrid binary bat enhanced particle swarm optimization algorithm for gene subset selection on the high dimensional datasets. A set of assessment indicators were used to assessed and compared the different methods over standard data sets. Lee et al. [29] were recognized wrapper as adaptive genetic algorithm with k-nearest neighbor. It used to select the optimal gene subsets, reduced the size of the data set, and classified correctly. The accuracy of proposed is better than GA and it only takes half the execution time of GA.
Similarly, Das et al. [30] introduced a group incremental FS algorithm by the combination of rough-set and GA to select the important feature subset. The genetic algorithm used a fitness function for incremental FS was defined using the previously generated reduce and positive region of target set through the concepts of the rough set. In addition, one of the various approaches proposed by Shukla et al. [12] to solve the feature selection problem. In this approach, a genetic algorithm of parallel implementation was developed to produce useful features. The proposed method was tested on benchmark University of California, Irvine (UCI) datasets. Similarly, Li et al. [31] proposed a genetic algorithm based on multi-population for feature selection. The numerical results show that it could not only be used for only FS but also for parallel feature selection with maximum accuracy and small number features. As stated above related work, a variety of research has been done with gene datasets. This study aims to overcome the existing challenges of this field.
Methodology
The traditional feature selection method required more computational cost to choose the prominent feature subset from the high dimensional dataset due to the expletive of dimension. The proposed method uses combination of correlation-based features and nearest neighbor weighted features to select a subset of relevant features. This method considers both feature-class correlation and feature-feature correlation to determine a feature subset. In addition, the proposed method enhances the classification accuracy and also reduces the search complexity for generating feature subset over the benchmark datasets.
The benchmark datasets D of dimension n, with feature set F = f1, f2, ... , fn the problem is to choose an subsets of correlated features F’ in two aspects: F’ is subset of F; and for optimal feature subset F’, classifier gives the high classification accuracy. In other words, aim to identify a subset of features where (fa, (fa, fb) ∈ F, the feature-class correlation is maximum and feature-feature correlation is minimum.
Proposed strategy
The selection of feature subsets based on evaluation function, which is toward highly correlated with the sample class (c) and uncorrelated to features. It selects the high weighted features that discriminate against the instance from neighbors of different categories. The weight of each ith feature (w
i
) is updated according to Equation (1).
Where, R is trial instance sample from n instances in sample different category, the function ψ (A
i
, R, H) evaluate the distance between instance samples (R) and nearest hit (H) or miss n(c). The combination of Relief-F and CFS leads to an effective feature selection scheme. The proposed feature selection (ReCFS) method works according to Equation (2).
Where, c (gi, gj) is the correlation coefficient, gj and gi are represents the set of candidate features selects from the mth feature on the sample dataset.
In the proposed method, firstly we use the Relief-F method to choose high ranked features from the sample data and then CFS method applied to select the subset from the candidate dataset to give the predictive accuracy. In this section, the proposed method includes six stages as shown in Fig. 1. The pseudo-code for this approach is given in Algorithm 1.

Flowchart of the proposed method for classification.
The flowchart in Fig. 1 illustrates the process of classification based on the benchmark dataset in an orderly manner and includes six stages of processing the data and classifying the provided samples into its corresponding class. Benchmark datasets suffer from high dimensionality. Thus, feature selection mechanism must be taken into consideration. It involves finding the most informative features among the massive number of available features. The proposed method, integration of feature selection method for selecting the best features subsets and passing it for classification. After feature selection takes place, the dataset becomes ready to be explored by the classification structure. The selected subset works as training set for the classifier. The classifier aims to predict the correct label of the sample and it takes the decision to provide best average accuracy.
The proposed feature selection method selects features based on a ranking procedure of Relief-F and CFS. It selects feature that has the highest feature-class correlation but minimum feature-feature correlation, which requires solving FS optimization problem. The soundness of the proposed method is derived from the fact that it selects feature which is either strongly relevant or weakly redundant (explained in equivalent example). A strongly relevant feature has high feature-class correlation but low feature-feature correlation and a weakly relevant feature has high feature-class correlation but medium feature-feature correlation [32]. In addition, the method handles the tie condition (i.e., if two features have the same rank) by selecting the feature that has high feature-class correlation.
Example
Let real life dataset [33] have F = {f1, f2, f3} be a set of three features i.e., age of patient (f1), year of operation (f2), and number of detected positive axillary nodes (f3). The main objective is to compute highly correlated features for each features present in the datasets. First, we compute feature-class correlation for each feature and select the feature that has the maximum value. The feature f3 has the highest value and hence, f3 is removed from F and is put in the optimal feature set, say F’. Next, for feature fj ∈ F, we compute feature-feature correlation with each feature fi ∈ F ’ and store the average value of feature-feature correlation for fj. This way, we compute feature-feature correlation for f1 and f2. Again, we compute feature-class correlation for f1 and f2. This procedure is continued until we get a subset of k features.
The detailed steps of ReCFS are summarized as Algorithm 1. The first step consists of the feature space of experimental dataset F containing M features and selects a high ranked feature from sample dataset. In the second step, integration of Relief-F and CFS based on the maximization of the correlation between each feature and the class label. In third step, selecting the top-ranked feature from the candidate dataset. In fourth step the training phase consists of splitting the feature space of training set F” into K disjoint subsets. In fifth step, building a classifier, we concentrated on selecting relevant features and deciding how to select the relevant features. In step six, we select the different widely used machine learning classifier i.e., KNN and NB for achieving maximum accuracy.
In this study, we use the most common microarray gene datasets [26] and UCI machine learning repository. We choose three DNA microarray datasets for research such as ALL_AML, colon cancer, and lung cancer. Table 1 present the use UCI dataset description.
Dataset description
Dataset description
Extensive experiments are done to evaluate the proposed feature selection method by using UCI repository. Details of datasets are given in Table 1. We perform experiments on some selected UCI dataset sand gene datasets, comparative analysis of proposed (ReCFS) with other methods by the diverse classifiers as K-Nearest Neighbour and Naïve Bayes. We use 10-fold cross-validation to evaluate the performance of the feature subsets by using diverse classifiers. The proposed algorithm comparison is shown in Tables (2) to (5) for all used benchmark datasets. The disparity in cardinality of selected optimal features is shown in Figs. (2) to (5).
It is seen from Table 2, the classification accuracy of the three DNA microarray datasets measure by the diverse feature selection method. We investigate the best six gene subsets range of 10 to 100 genes are selected and show the effectiveness of ReCFS method. From Table 2, we can understand that the average accuracy of the NB classifiers by proposed method is in the range of 74.04–83.80% in Colon dataset and 83.28–92.71% for Lung dataset. Also for the ALL_AML dataset the accuracy achieved by the proposed ReCFS method for 10–100 gene subsets lies in the range of 91.78–97.21%.
Percentage of accuracies within Naïve Bayes classification of six gene subset datasets for Relief-F, CFS, MRMR, JMI, DISR and ReCFS methods
Percentage of accuracies within Naïve Bayes classification of six gene subset datasets for Relief-F, CFS, MRMR, JMI, DISR and ReCFS methods
Table 2 shows that ReCFS algorithm is able to select the best informative gene for classification as compared to other feature selection techniques. Even by selecting the top 100 gene subset, ReCFS method is able to achieve 98.08% classification accuracy using NB for ALL_AML dataset, 83.8% for Colon dataset, and 92.71% for Lung cancer dataset. From the Fig. 2 (a) to (c) shows the performance comparison of the proposed method with respect to NB classifiers on three datasets. As shown in Fig. 2 (b) to (c), the proposed method achieves the best accuracy in NB classification with respect to Colon, Lung cancer dataset.

(a) -(c) shows the performance comparison of the proposed method with respect to NB classifiers on Lung can-cer, ALL_AML and Colon datasets.
Table 3 shows the 95.78% highest average accuracy for the k-NN classifiers for ALL_AML dataset. As stated in Tables 3, result for less number of features used by ReCFS, which achieve higher accuracy than other popular feature selection method. These results reflect the applicability of ReCFS feature selection in survival classification for in ALL_AML and LUNG dataset.
Percentage of accuracies within K-Nearest Neighbour classification of six gene subset datasets for Relief-F, CFS, MRMR, JMI, DISR and ReCFS methods
As seen in Table 3 for Colon and Lung data, ReCFS method shows the elastic improvement for each feature subsets starting from 10 to 100. To select the best informative genes by the ReCFS compared to other well-known techniques as seen in Table 3. Getting the high degree of accuracy for high dimensional gene datasets as Lung cancer and ALL_AML with respect to K-NN classifier.
Table 4 shows the 91.96% highest average accuracy for the K-NN classifiers for the Ionosphere dataset. As stated in Table 4, result for less number of features used byReCFS, which achieve higher accuracy than other popular feature selection method. These results reflect the applicability of ReCFS feature selection in survival classification for in Heart, Ionosphere, Lymphography, Sonar, and Vehicle dataset.
Percentage of accuracies within K-Nearest Neigh-bour classification of four feature subset for feature selection method
Figures 4 (a) to (e) shows the performance comparison of the proposed method with respect to K-NN classifiers five datasets. As shown in Fig. 4 the proposed method achieves the best performance in K-nearest neighbour classification with respect to Heart, Ionosphere, Lymphography, Sonar and Vehicle datasets.

(a)-(c) shows the performance comparison of the pro-posed method with respect to K-NN classifiers Lung, ALL_AML and Colon datasets.

(a)-(e) shows the performance comparison of the proposed method with respect to K-NN classifier on Heart, Ionosphere, Lymphography, Sonar and Vehicle datasets.
Table 5 shows the 91.83% highest average accuracy for the NB classifiers for the Ionosphere dataset. As stated in Table 5, result for less number of features used by ReCFS, which achieve higher accuracy than other popular feature selection method. These results reflect the applicability of ReCFS feature selection in survival classification for in Heart, Ionosphere, Lymphography, Sonar, and Vehicle dataset.
Percentage of accuracies within Naïve Bayes classification of fourUCI datasets for feature selection methods
Figures 5 (a) to (e) shows the performance comparison of the proposed method with respect to NB classifiers five datasets. As shown in Fig. 5 proposed method achieves the best performance in Naïve Bayes with respect to Heart, Ionosphere, Lymphography, Sonar and Vehicle datasets.
We conducted the non-parametric statistical hypothesis test Wilcoxon test for benchmark datasets to evaluate the performance of the proposed method using two diverse classifiers as shown in Table 6.

(a) to (e) shows the performance comparison of the proposed method with respect to NB classifier on Heart, Ionosphere, Lymphography, Sonar and Vehicle datasets.
Wilcoxon signed-rank test results of the proposed method against other feature selection based methods for UCI and gene datasets at the significance level α = 0.05 (+ of outperformed rank and±of equivalent rank) on KNN and NB classifier
For the result of the Wilcoxon signed-rank test at a significance level α = 0.05, Table 6 shows the of the proposed method gives the better result against other feature selection methods such as MRMR and JMI in benchmark datasets with large number of labels.
To select high scored features, in this paper, we have developed a proliferation technique in the high dimensional real-life datasets which is a combination of Relief-F and CFS method, called ReCFS that leads to providing the relevant and highly correlated feature subsets. In the first stage of proposed method, find the candidate feature set by using the Relief-F method and also filter out erroneous features or reduce the processing time for next stage method. CFS method applied directly to chosen highly correlated features and explicitly filters out irrelevant features from the candidate feature set. The performance of numerous data mining techniques and computational methods has been evaluated in terms of classification accuracy on benchmark datasets. We have investigated the performance of proposed method on two diverse classifiers and applied 10-fold cross-validation to measure the classification performance of these classifiers (i.e., NB and KNN). The experiments have been conducted and compared the classification accuracy by the two classifiers on benchmark datasets. We found that the proposed method is promising feature selection method with the NB classifier in microarray gene datasets such as ALL_AML and Lung cancer ranging from94.78–98.08% in ALL_AML and 74.04–83.8% in Lung cancer and all used UCI datasets. Also, proposed (ReCFS) is promising feature selection method with KNN classifier in microarray gene datasets such as Lung cancer and ALL_AML dataset ranging from 88.57–95.78% in ALL_AML, 64.04–72.95% in Colon cancer, and 81.00–90.80% in Lung cancer all used UCI dataset.
