Abstract
Feature selection is the problem of eliminating the features which are irrelevant and/or redundant. It can also be assumed as the problem of selecting a small subset of features which are necessary and sufficient to describe the target concept. In this paper, a new feature selection method based on the concepts of sensitivity and Pearson’s correlation is introduced which is called Sensitivity and Correlation based Feature Selection-SCFS. The sensitivity of one feature is computed via applying the subtractive clustering and is utilized as feature-target relevancy. Pearson’s correlation coefficient is used to determine the redundancy among a subset of selected features. The introduced measure increases the score of a selected feature subset which has maximum relevancy to the target concept and minimum redundancy among features. The proposed criterion is employed as the fitness function in a genetic algorithm in order to evaluate feature subsets. Some well-known benchmark datasets are utilized for investigating the performance of the proposed method. Also, the results of our method are compared with the other similar feature selection methods. The obtained results show however SCFS is an unsupervised filter; it is well comparable to the other well-known supervised methods in terms of classification accuracy and the number of selected features.
Introduction
Dimensionality reduction methods can be generally classified into feature extraction (FE) and feature selection (FS) approaches [1, 2]. FE methods [3, 4] map the original feature space to a lower-dimensional one and create new features. FS method [5, 6] is a search process or technique used to select a subset of features to build robust learning models. Some irrelevant and/or redundant features generally exist in the learning data that make learning harder and reduce general performance of the learned models [7]. An irrelevant feature is not correlated with the target variable and does not affect the target concept in any way, and a redundant feature does not add anything new to the target concept. So, the main idea of FS is to choose a subset of features by eliminating irrelevant features and also redundant features which are strongly correlated [8]. There are many potential advantages of feature selection: making data visualization and data understanding easy, reducing the processing and storage requirements, reducing training and utilization times and improving prediction performance [9]. Some applications of feature selection are speech recognition [10], text categorization [11], gene selection from microarray data [12], face recognition [13, 14].
FS techniques can be classified into the following categories: filter, wrapper and embedded techniques [7, 15–22]. The filter approach evaluates and selects feature subsets based on the general characteristics of data. Examples of filter methods are the simple auxiliary criteria such as correlation [23], mutual information [9], information gain [11], consistency [24], distance [25], dependency [26] and the others easy usually include some statistical measures without employing any learning model. One advantage of filter methods is that they are usually fast since they do not use the learning algorithm, and therefore they are suitable for large datasets [15].
In contrast to the filter, the wrapper technique involves a learning model and uses its performance as the evaluation criterion [7, 27]. The wrapper methods search the space of possible feature subsets to obtain the best evaluation of a specific subset of features [16]. Variety of search methods is frequently used to obtain optimal subset. Although the wrapper approach is computationally more expensive than the filter approach, the general performance of this approach is better than the filter [7].
Most of studies combine filters and wrappers. They use filters either for the ranking of features or to reduce the number of the candidate features. In particular, these hybrid methods are based on a sequential (e.g., two-step) approach where the first step is usually based on the filter methods to reduce the number of features considered in the second step. Using this reduced set, a wrapper method is then employed to select the desired number of features in the second step. This scheme possesses the superiorities of model independence in filter and uses the advantages of wrapper [16]. The performance of this approach is more than the filter and less than the wrapper.
FS algorithms with filter and embedded models may return either a subset of selected features or the weights of all features. According to the type of the output, they can be divided into feature weighting and subset selection algorithms. Algorithms which use the wrapper model usually return feature subset [15].
Many of the filter and embedded methods estimate weighting score of the features then features with bad scores are removed. These methods can be classified into univariate and multivariate methods. The univariates only use feature relevancy (feature-target correlation) to compute weighting scores and they ignore feature redundancy (feature-feature correlation), so they result in poor classification performance. There are numerous well-known univariate filter methods, such as Kruskal wallis [28], Gini index [29], Information gain [30], Bayesian logistic regression (Blogreg) [31], Fisher score [32], Relief-F [33] and Chi-square Score [34]. The multivariate methods, that take both criteria into account, act better than the others. Such as Minimal-Redundancy-Maximal-Relevance criterion (MRMR) [26], Correlation-based Feature Selection (CFS) [23], Sparse multinomial logistic regression via bayesian regularization (Sbmlr) [35] and Fast Correlation-Based Filter(FCBF) [36].
In this study, an unsupervised filter method based on subtractive clustering and Pearson’s correlation coefficient is proposed. Then, it is integrated within a genetic algorithm (GA) and named Sensitivity and Correlation based Feature Selection - SCFS. In the proposed method, sensitivity of the features is computed through subtractive clustering and is used to calculate the relevancy of features. The Pearson’s correlation coefficient is applied to calculate the redundancy. The purpose of this method is the selection of a feature subset with maximum relevancy, minimum redundancy and minimum size. This method, unlike most of the common methods, is suitable for both supervised and unsupervised problems and is well comparable to the other supervised methods in terms of classification accuracy and the number of selected features.
The remainder of this paper is organized as follows: In Section 2, the Correlation based Feature Selection, the GA method as global optimization procedure and clustering methods are described briefly. In Section 3, the proposed method is introduced. In Section 4, the proposed algorithm is compared with some common methods and the experimental results are presented. Finally, Section 5 concludes the paper.
Preliminaries
Correlation-based feature selection
The term correlation refers to a degree of dependency or predictability of one variable toward another. The Correlation-based Feature Selection (CFS), proposed by Hall [23], evaluates subsets of features on the basis of the following hypothesis: A good feature subset is one that contains features highly correlated with (predictive of) the target, yet uncorrelated with (not predictive of) each other. If the correlation between each of the features and the target variable is found, and the inter-correlation between each pair of the features is obtained, then the merit of a feature subset consisting of k features can be predicted by the following equation:
Here, is the average value of all feature-target correlations, is the average value of all feature-feature correlations and k is the number of selected features. Here correlation criterion is based on Pearson’s correlation coefficient which can only detect linear dependencies between the feature and the target. A simple way of lifting this restriction is to make a non-linear fit of the target with single features and rank according to the goodness of fit. In this paper a non-linear criterion is proposed which is suitable for labeled and unlabeled datasets.
In statistics, the Pearson’s correlation coefficient [37], is a measure of the correlation between two variables X and Y. Its result is a value between 1 and –1 inclusive. Correlations equal to 1 or –1 correspond to data points lying exactly on a line. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases. A value of –1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear correlation between the variables. This criterion is widely used in the sciences as a measure of the strength of linear dependence between two variables. Pearson’s correlation coefficient between two variables is defined as the covariance of the two variables divided by the product of their standard deviations.
The formula is as follows:
The Pearson’s correlation coefficient is symmetric: correlation (X, Y) = correlation (Y, X).
In this section, feature selection strategies based on GA are reviewed. GA is generally quite effective for the rapid search of large, nonlinear and poorly understood spaces. It is one of the most widely used techniques for feature and instance selection, and can improve the performance of data mining algorithms [38]. In [39] a comparative study of algorithms for feature selection is carried out. The results of many experiments of this study show that GA is more suitable than the other heuristic search methods for large and medium sized problems. Such methods can result in several optimal (or close-to-optimal) feature subsets as output. In GA a binary chromosome, which the length of which is equal to the number of the original features, is employed for feature selection. A zero or one in the jth position of the chromosome denotes the absence or presence of the jth feature in this particular subset. A population of chromosomes is usually produced randomly while the size of the population and how they are created are important issues. The typical genetic operators (crossover and mutation) are applied to the pool of feature subsets. Again, the choice to use which types of crossover and mutation must be carefully considered. This generates a new feature subset pool which may be evaluated in two different ways. If a filter approach is adopted, the fitness of individuals is to be calculated using a suitable criterion function. This function evaluates the “goodness” of feature subset; a larger value of a function indicates a better feature subset. A criterion function could be entropy, correlation measure or a combination of several criteria. In the wrapper approach, chromosomes are evaluated by inducing a classifier based on the feature subset, and obtaining the classification accuracy (or an estimation of it) on the data. To guide the search toward minimal feature subsets, the subset size can also be incorporated into the fitness function of both filter and wrapper methods.
A suitable stopping criterion must be chosen. This is typically achieved by limiting the number of generations that take place or by setting some threshold which must be exceeded by the fitness function. If the stopping criterion is not satisfied, the individuals are selected from the current subset pool and the process described above repeats.
Clustering
Clustering can be considered as the most important issue in unsupervised learning. It is an approach to measure the similarity present in data and puts the similar data into the same groups. A clustering algorithm partitions a data set into several groups or clusters so that the similarity within a group is maximized while minimizing this factor between the groups is done. In classical clustering, each sample belongs to only one cluster while it may belong to several clusters with a membership degree. The main difference between classical and fuzzy clustering is that a sample can belong to more than one cluster. The fuzzy clustering algorithm is one of the most important methods used in unsupervised pattern recognition. In recent years, these approaches have been used in a wide range including pattern recognition, fuzzy control and machine vision. This section reviews four of the most representative clustering techniques [40]: K-means clustering, Fuzzy c-means clustering, Mountain clustering, Subtractive clustering.
The first technique is a crisp method and the other techniques are fuzzy approaches. By having compared these four methods, the subtractive clustering approach is utilized in this paper since it is better than the other three methods in terms of stability. The stability of one clustering technique means different executions of algorithm will lead to the same clusters. This is provided by the stopping criteria and having fixed clustering parameters. Also, in the subtractive clustering the number of clusters is not determined by the user instead it depends on the radii parameters. If one uses a fixed set of radii parameters and runs the algorithm several times, the resulted clusters are always the same.
K-means clustering is a very applicable algorithm but it requires each sample to belong only to one cluster. It seems more reasonable to use fuzzy techniques. Fuzzy c-means algorithm is one of the effective methods in unsupervised clustering. However, due to the different parameters and lack of effective methods for their choice, it has a limited number of applications and if we use a constant value for the parameters of the algorithm, the sensitivity of the algorithm to the initial data would be very high. Moreover, stopping at local minimum or deceleration convergence of the algorithm are the deficiencies of this algorithm. Although the stopping problem of this algorithm has been solved by [41], the initial membership degree has to be determined randomly as well. Also, in the first two clustering approaches the number of clusters should be specified in advance. In the feature selection the number of clusters is not specified, so these approaches are not suitable for this problem.
The mountain clustering method is a simple and effective algorithm. However, its computation grows exponentially by increasing the dimension of the problem because the method must evaluate the mountain function over all the grid points. This approach is suitable for two-and three-dimensional data. High dimension data, in this algorithm requires excessive computing. Subtractive clustering has solved this problem by using data points as the candidates for cluster centers, instead of the grid points as in mountain clustering. Therefore, in this study, the subtractive clustering is employed for scoring and weighting the features. The subtractive clustering method is an extension of the mountain clustering method. This method assumes that data point is a potential cluster center and calculates a measure of the likelihood as if each data point would define the cluster center, based on the density of surrounding data points. This algorithm in the first step selects the data point with the highest potential to be the first cluster center and then eliminates the effect of all data points in the vicinity of the first cluster center (as determined by radii), in order to determine the next data cluster and its center location. This process iterates until all the data points are covered by a cluster center. It seems that this clustering approach is suitable for feature selection problem. This approach is described in detail in Section 3.1.
Proposed method
In this section, we first define the relevance of features in terms of subtractive clustering -the suitable clustering method for feature selection problem; then, we show how to aggregate three criteria (relevancy, redundancy and the number of selected features) in a fitness function. Eventually, we optimize this function with GA.
Feature relevancy evaluation via sensitivity analysis based on subtractive clustering
Subtractive clustering is a clustering method, proposed by Chiu [42]. Formally, let X = { X1, …, X
k
, …, X
n
} be a dataset, where X
k
= { xk1, …, x
kj
, …, x
km
} ∈ R
m
. n is the number of samples and m is the number of features in dataset X. Since each data point is a candidate for cluster centers, a density measure at data point X
i
is defined as:
or
∥∥ denotes the Euclidean distance and r a is a positive constant. Hence, a data point will have a high density value if it has many neighboring data points. The points outside r a have little influence on its density. After the density of every data point has been computed, we select the data point with the highest density as the first cluster center.
Let X
c
1
be the first cluster center and D
c
1
be its density value. We revise the density measure of each data point X
i
by:
r b is a positive constant. Therefore the data points near the first cluster center X c 1 will have significantly reduced density measures, thereby making the points unlikely to be selected as the next cluster center. To get the distance between cluster centers, r b is considered larger than r a ; a good choice is r b = 1.5 r a . After revising the density function, the data point with greatest density value will be selected as the next cluster center [43, 44]. This process of revising the density function and finding the next cluster center continues until a sufficient number of clusters or density changes are attained.
The density measure implies the relations between the features and the sample clusters/classes. In the Equation (4) the density value is affected by the sample feature values. We can define the sensitivity of the density measure of ith sample with respect to the kth feature as:
The sensitivity S (i, k) represents how much does the kth feature influences the density measure of the ith data point. In fact, it means how much information there is to cluster the ith sample in kth feature.
According to Equation (4), is:
where i ≠ p, j = p,
where i = p, j ≠ p,
and where i = p, j = p or i ≠ p, j ≠ p,
The sensitivity of the kth feature for dataset X is defined as follows:
In calculating the sensitivity of the kth feature in dataset X, its impact on computing the density of all samples is considered. The features with higher sensitivity value can contain more information about the clusters than the others.
There are two basic steps in any filter based subset selection method. These steps are: 1) generation step that generates candidate feature subsets and 2) evaluation step that evaluates the generated candidate feature subsets according to a merit measure. The stopping criteria will determine whether this process should be stopped or continued. The process will terminate if the required number of features is obtained or the user-specified iterations are reached. According to the search mechanism used in a feature selection method, the feature selection methods are divided into three groups, named complete, heuristic and random methods. In heuristic methods, some meta-heuristic search algorithms such as GA are utilized for performing the two steps of feature selection. In the present study GA is used for the feature selection while a combination of information based (sensitivity) and dependency based (correlation coefficient) criteria is employed as the evaluation merit. This combination is on the basis of CFS merit and is defined as follows:
is the average value of all selected features’ sensitivity, is the average value of all feature-feature correlations coefficient in the selected subset, and k is the number of selected features. This merit must be maximized that means maximizing features relevancy to the target concept () along with minimizing features-features redundancy () and number of selected features (k). The proposed Merit is utilized as the fitness function of GA. The proposed feature selection scheme is shown in Fig. 1. Initially, dataset has been normalized to fit in a range then it is used to find the best feature subset. Selected subset must be tested using some classifier methods. In the subset selection step at first sensitivity values of features are calculated and then these values are fed to GA. GA initializes population randomly, but afterwards, in the evaluation step, the fitness of generated subset using relevancy, redundancy, and number of selected features is determined. The GA process continues until the number of generation reachs 50. Finally selected feature subset will be used for the test step.
This section presents SCFS performance on several real-world benchmark datasets. The proposed method is compared with some of the most commonly used algorithms. The details of datasets, the comparative methods, the user specified parameters, and finally the comparisons of SCFS with the other existing works are also discussed in this section.
Datasets
Datasets are Libras, Parkinson, Soybean, Breast cancer, Ecoli, Yeast, Page-blocks, Dermatology, Seeds, Waveform, Thyroid Disease, Diabetes, Vehicle, Iris, and Laryngeal. These datasets have been the subject of many studies in machine learning and data mining. The characteristics of these datasets are shown in Table 1. Most of these datasets are freely available in the University of California Irvine Machine Learning Repository (UCI) [45].
Methods for comparison
In our experiments eleven algorithms (two embedded and nine filter) are utilized, which are listed in Table 2. All of these algorithms are implemented in a package with ASU feature selection repository [15]. In Table 2, column 2, the types of learning algorithms, which are used by the feature selection methods, are presented. Column 3 indicates the approach of algorithms which can be either filter or embedded. Column 4 specifies whether the features are evaluated individually or not. Most of the methods are univariate and cannot handle the feature’s redundancy. Column 5, determines the type of algorithms in terms of their produced outputs. The results of the feature ranking methods are in a collection of all features which are sorted according to their importance. In contrast to the ranking methods the other ones return a subset of features as the final result.
Parameters setup
In this work, all samples of a given dataset are used for training in order to find the best feature subset. Then these samples are given to the k-fold cross validation process for evaluating the classification accuracy of the selected features. Table 3 shows the parameters of GA, which are shared among all datasets. These parameters have been chosen after some trial and error executions. For making a fair comparison, first, the number of features for ranking methods are considered as 1, 2, 3 … No. of features respectively. Then, the averages of classification accuracies and number of features,corresponding to the five best classifier accuracies, are reported as the performance measures for feature ranking methods.
K-fold cross validation
In utilizing this technique the dataset with N samples are divided into k partitions of approximately equal size. The classifier is trained k times. Each time the kth fold is used as the test set and the other samples are used to train the classifier at each time. During the training phase, the classifier is trained on k–1 out of k-folds. The prediction performance of the classifier is estimated by considering the average classification accuracy of the k cross-validation experiments.
Results
It can be easily recognized that in all experiments, the proposed GA based solution (SCFS) is better than most of the other defined methods. SCFS usually returns the most minimized feature vector compared to the other subset selection methods. It is very important to minimize the number of the selected features in order to control the complexity of the pattern recognition models. Our experiments show that solutions with the best accuracy rates do not necessarily have more features. Experiments are run 20 times. The results of these experiments are considered as an evaluation function. The evaluation function is based on the average classification accuracy and the number of the selected features in 20 runs. This function is:
EF, CA, NOF and NSF are the evaluation function, the average classification accuracy, the number of the original features and the average number of the selected features, respectively. In order to make an equal weight of importance for both clasification accuracy and number of selected features, the value of coefficient (α) is considered to be 0.5 [7]. Classifiers for these experiments are classical K-Nearest-Neighbor (K-NN) and Binary Tree. Existing classifiers are divided into two categories [46], lazy and eager. In these results we have used both two types. K-NN is a lazy classifier and Binary Tree is a typical example of the latter category. K-NN is a non-parametric method for classification and regression, which predicts the target or class memberships based on the K closest training examples in the feature space. Here K is considered 1, 3 and 7. The results in term of evaluation function with K-NN (K = 1, 3, 7) and Binary Tree are reported in Tables 4–7. In all 4 tables, SCFS obtained the best performance. For example in Table 4, SCFS, FCBF and Kruskal present good performance on second, third and seventh datasets respectively. SBMLR, Blogreg, Gini index shows good performance only on one dataset. In this table CFS, Fisher score, information gain, Relief-F, MRMR and Chi-square do not have good performance in any dataset. Generally, the performance of the feature subset selected by the SCFS algorithm is the best comparing to the selected features by the other methods, though in some cases the feature subset selected by the other defined algorithms can show good performance as well. We also extract the features that have been selected by two or more feature selection algorithms simultaneously and list them in Table 8. As shown in Table 8, the features selected by SCFS are also selected by other feature selection algorithms.
For example, the fourth feature of Iris dataset was ranked 1st in SCFS, FCBF, Blogreg, Relief-F and Kruskal and it was ranked 2nd in seven other algorithms. Because of using GA, SCFS in each run may return a little different feature subset. Therefore, in this method each feature which is repeated more than the others, has the best rank. The features which are more frequently selected may potentially be more important feature markers for diagnosing different types of samples.
To prove the superiority of the proposed method compared to the other defined methods, the results of the previous section are analyzed using non-parametric statistical tests. In this paper we have applied non-parametric statistical tests for multiple comparisons by combining all the datasets. We first have applied the Friedman test [47] in order to compute a ranking among the distributions. The Friedman test is a version of the Repeated -Measures ANOVA that can be performed on ordinal data. They will be useful to detect significant differences between the methods. Then we apply post-hoc procedures. This test allows us to detect effective statistical differences between the control approach, i.e., the one with the lowest Friedman rank, and the remaining approaches. Table 9 shows the obtained average rankings across all data sets following the Friedman procedure for each method. Obtained p-values by Friedman test are also shown in column 2 of Table 10. If the p-value is lower than the level of significance α (in the experiments, α= 0.05), we can reject the null hypothesis and affirm that there exist statistical differences between SCFS and the other methods. Otherwise, no statistical difference exists among the distributions, and therefore, all algorithms evolve toward similar results. In the intended column, the eight first hypotheses are rejected. They are bold. It shows that differences between SCFS and the eight first methods are significant but its results are similar to FCBF, Kruskal wallis and the results of Gini index. Obtained p-values through applying post hoc methods over the results of Friedman procedure, are shown in Table 10. The control approach is SCFS. Finner and Li post hoc tests are employed. According to these results, the differences between SCFS and most of the explained methods are significant. Freedman and post hoc p-values can demonstrate this. These experiments were conducted using KEEL [48] (Knowledge Extraction based on Evolutionary Learning), a tool for creating, learning, optimizing and evaluating various models. KEEL has been developed in the Java environment by a group of Spanish research centers and is available free for non-commercial purposes [49].
Discussions
As we have analyzed, the objective function of SCFS in Equation (12) is similar to the CFS method. Thus, the feature selection result of the proposed method is compared with CFS and other ten common methods in this section. SCFS mostly proposed a feature vector with a minimum size compared to the other implemented methods. It is very important to minimize the number of the selected features to control the complexity of the pattern recognition problem, but there should also be maximized classification accuracy.
Classification accuracy and the number of selected features are formulated in Equation (13). In order to make the same importance for classification accuracy and the number of selected features, α in this equation is considered 0.5 [7]. Tables 4–7 show that SCFS has the best results in this term. Also, non-parametric tests are implemented to show significant difference between SCFS and the other described methods. Results of these tests are as follows: According to Table 9 SCFS has the best rank in Friedman test. And the following remarks are drawn from Table 10:
According to the result of Friedman test, the difference between SCFS and all methods are meaningful except FCBF, Kruskal wallis and Gini index (second column).
In Finner’s procedure the p-values greater than 0.041099 are bolded in the third column that means the differences between the average raking of SCFS and those of the other methods are significant except for the two last methods in the above table (FCBF and Kruskal wallis).
It can be seen that Li’s procedure, compared to the other methods, rejects more hypotheses, which that means the difference between SCFS and all the methods is meaningful except for FCBF.
Concluding remarks
A new filter method-SCFS was presented in this study to find a minimum size feature subset while preserving the highest classification accuracy. The advantages of this method include the ability to find small subsets of features that have good modeling accuracy and apply on unsupervised problems. The subtractive clustering and Pearson’s correlation coefficient criterion were utilized to compute the relevancy and the redundancy of the features. Our proposed algorithm performs a heuristic search using GA. The fitness value was calculated according to a merit like the CFS feature selection merit while the sensitivity value was utilized instead of correlation coefficient based relevancy measure. The sensitivity measure was calculated based on the subtractive clustering. The analysis of sensitivity has been introduced based on the fuzzy C-means clustering algorithm in the literature. In the present study, subtractive clustering algorithm was used to present an unsupervised relevancy measure. The proposed method is a multiclass filter and it removes redundant and irrelevant features. This method was compared with eleven common filters and embedded algorithms. The results of the comparisons reveal the superiority of the proposed algorithm compared to the others in terms of classification performance and the number of selected features. Some non-parametric statistical tests were utilized in order to reach a scientific comparison of the results. The outcome of these tests confirms the ability of the proposed method in comparison to the others. It is worth noting that the contributions of this paper can be summarized as follows: Deriving the sensitivity measure based on the subtractive clustering along with deriving all related formulas. Proposing a new feature evaluation measure based on the sensitivity and Pearson’s correlation. Presenting a scientific comparison of feature selection methods based on the non-parametric statistical tests.
In this work, we focused on the searching strategies using binary projection. One interesting future research direction is to propose a nonlinear feature-feature correlation criterion and replace the Pearson’s correlation coefficient.
