Abstract
In many machine learning or patter recognition tasks such as classification, datasets with a large number of features are involved. Feature selection aims at eliminating the redundant and irrelevant features which would bring computational burden and degrade the performance of learning algorithms. Particle swarm optimization (PSO) has been widely used in feature selection due to its global search ability and computational efficiency. However, PSO was originally designed for continuous optimization problems and the discretization of PSO in feature selection is still a problem which needs further investigation. This paper develops a novel feature selection algorithm based on a set based discrete PSO (SPSO). SPSO employs a set based encoding scheme which makes it able to characterize the discrete search space in feature selection problem. It also redefines the velocity term and the corresponding arithmetic operators which enables it to search for the optimal feature subset in the discrete space. In addition, a novel feature subset evaluation criterion based on contribution rate is proposed as the fitness function in SPSO. The proposed criterion does not need any pre-determined parameter to keep the balance between relevance and redundancy of the feature subset. The proposed method is compared with six filter approaches and four wrapper approaches on ten well known UCI dataset and the experimental results demonstrate the proposed method is promising.
Keywords
Introduction
With the rapid development of computer hardware and data storage, the amount of information grows explosively in which data are usually described by a huge number of features. However, not all the features are useful in building predictive models. Some redundant or irrelevant features would improve the computational cost and have a negative effect on the performance of learning algorithms, known as “the curse of dimensionality” [1]. Feature selection is an essential data pre-processing step in pattern recognition and data mining tasks. Feature selection aims to seek a subset of features while remaining the discriminating information as much as possible. Feature selection can speed up the learning process and lead to a more accurate and understandable predictive model.
Based on the feature subset evaluation criterion, feature selection methods can be generally categorized into filter approach and wrapper approach. In the wrapper approach, a pre-determined classifier is used to calculate the classification accuracy of a feature subset. The filter approach does not need any classifiers and it evaluates feature subsets with the statistical characteristics of features. In most cases, wrappers can achieve better classification performance since the feature subset is directly chosen based on the classification performance, but they are computational expensive, especially in high-dimensional datasets and they may suffer from over-fitting to the pre-determined classifier. Compared with wrappers, filters show higher computational efficiency and good generalization ability as the feature selection process is independent of any classifier. Furthermore, the bias caused by various classification algorithms would not affect the performance of filter methods.
Except the feature subset evaluation criterion, search strategy is also a crucial part in feature selection. Feature selection is a very difficult combinatorial optimization problem as the search space increases exponentially with the number of features. For a dataset with
Evolutional computation (EC) techniques are known for their global search ability in high-dimensional search space. Many EC based algorithms have been employed to feature selection, including genetic algorithm (GA), ant colony optimization (ACO), particle swarm optimization (PSO), and differential evolution (DE). In EC based feature selection methods, a feature subset is evaluated as a group. EC based wrapper approaches aim at finding feature subsets with high classification accuracy but they suffer from high computational cost in high dimensional datasets and prone to over-fitting the training data [8, 9, 10]. Compared with other EC techniques, PSO is simpler to implement and converges more quickly to the global optimal. Hence, PSO has been widely used in feature selection problems. However, existing PSO based feature selection methods still face two main drawbacks.
PSO was originally designed to solve optimization problems in continuous space. But feature selection is an optimization problem in discrete space. The discrete PSO algorithms used in existing PSO based feature selection methods generally maintains the simple mechanism of PSO on the whole. Their searching ability in discrete space is not satisfactory, compared with other EC such as GA and ACO. Therefore, the potential of PSO in feature selection has not been fully investigated. In PSO based feature selection method, one important factor is the design of objective function. Filter approach aims to generate feature subset with maximum relevance and minimum redundancy. Many researchers used a weighing parameter to keep the balance between relevancy and redundancy. But the parameter needs to be fine tuned in order to obtain good results in different datasets. Even when multi-objective approaches have been proposed to optimize relevance and redundancy simultaneously, two problems still exist: (1) it is hard to choose the optimal feature subset from the set of non-dominated solutions; (2) these two objectives may not be completely conflicting in some cases, so multi-objective approaches cannot guarantee the optimal feature subsets.
Based on those abovementioned problems, this research mainly focuses on: (1) how to discretize the original PSO in order to improve its search ability in discrete search space; (2) how to design the feature subset evaluation criterion which can keep a good balance between the relevance and redundancy in various feature selection problems.
In order to overcome the existing problems, this paper proposes a novel feature selection method using a discrete variant of PSO and a parameter-free feature subset evaluation criterion (SPSOFS). Set based PSO (SPSO) is a relatively recent version of discrete PSO [11]. SPSO introduces a set based particle encoding scheme and redefines the corresponding operators in PSO. SPSO keeps the main advantages of PSO, such as quick converging speed and ease of implementation and it has shown promising results in traveling salesman problem and the multidimensional knapsack problem.
In addition, a novel feature subset evaluation criterion based on contribution rate is proposed as the fitness function in SPSO. The proposed criterion does not require any pre-determined control parameters while the balance between relevance and redundancy is achieved by using the idea of the contribution rate of relevance and redundancy, respectively. The motivation of designing the criterion is to make the relevancy term and redundancy term always comparable in different datasets. Since the objective function is parameter-free, the proposed method can be used in different datasets without any modifications.
The rest of the paper is organized as follows: Section 2 reviews the related works. Section 3 briefly introduces some basic concepts about mutual information and PSO. Section 4 presents the proposed SPSOFS. Section 5 shows the experimental results and analysis. Section 6 concludes the whole paper.
Feature selection is an important data preprocessing step in pattern recognition and data analysis. It can reduce the dimensionality of the dataset and avoid the curse of dimensionality. Filter approaches aim at selecting the most relevant features and alleviating feature redundancy.
Feature relevance means its relationship with the target class. Feature relevance can be measured by some criteria like mutual information [2], distance [11], similarity [13] and consistency [14] and then the top ranked features are chosen to build the prediction model. Among all, mutual information (MI) is a robust and promising criterion which can measure arbitrary dependence relationships between two features and it has been widely utilized to find the most discriminative feature subset. MI can be directly used to select feature subset by maximizing the relevance between features and the target class, which is called the Max-Relevance strategy. The main problem of this method is that it would select several relevant but redundant features.
Therefore, feature selection methods considering reducing feature redundancy are proposed. Feature redundancy can defined as the relationship between a set of selected features. MIFS [2] and mRMR [3] are two well-known feature selection algorithms considering both relevance and redundancy. MIFS selected features with max relevance and employed a weight coefficient to alleviate the redundant information. mRMR replaced the weight coefficient in MIFS with the reciprocal of the feature subset cardinality. Hence, users do not need to set a suitable weight coefficient in order to obtain good feature subsets.
Some researchers proposed a new criterion which is to calculate the discriminative information that a feature can newly provide given a set of selected features. Based on this criterion, two representative methods JMI [6] and CMIM [15] were proposed. These methods select feature according to how much new classification this feature could provide. One drawback of these MI based approaches is that they neglect the interaction between features. One irrelevant feature may provide useful information when combined with other features.
In order to solve the problem, multi-information was proposed to measure the 3-way interaction between three features [16]. It needs to be noted that multi-information can take both positive and negative values. Some feature selection methods employed multi-information as the evaluation criterion, such as CIFE [4], IWFS [17], and ICAP [18]. In these methods, multi-information is used to measure the redundancy between two features and redundancy of a feature subset can be reduced.
These abovementioned feature selection methods mainly employ the greedy forward search strategy and they cannot guarantee the global optimal solution. Therefore, EC techniques have been extended to feature selection problem due to its global search ability and fast convergence speed. This paper mainly focuses on filter approaches. Therefore, EC based filters will be reviewed. GA is widely used in feature selection problems. Chakraborty proposed a GA based feature selection method using fuzzy set fitness function [19]. DE is a popular population based optimization algorithm which has been used in various problems as well as feature selection. Bhadra et al. developed a DE based filter feature selection method which optimizes the average standard deviation and dissimilarity of the selected feature subset, and the average similarity of non-selected features [20].
PSO is a relatively simple optimization framework and PSO have also been employed to feature selection. Wang et al. proposed a feature selection model based on rough set theory and PSO [20]. Bae et al. used an improve version of PSO for feature selection which shows better computational efficiency than the traditional PSO [22]. Cervante et al. proposed a PSO based feature selection technique aiming at maximizing the relevance and minimizing the redundancy of the obtained feature subsets [23]. ACO is a swarm based intelligent optimization algorithm. Tabakhi et al. proposed a filter method based on ACO, named UFSACO [24]. UFSACO selected features in several iterations without using any classifer. Tabakhi and Moradi represented the search space as a graph and used ACO to rank the features [25].
Multi-objective optimization techniques have also been applied to feature selection to optimize multiple filter criteria simultaneously. Xue et al. proposed several multi-objective EC based filter approaches, in which they used multi-objective binary PSO, NSGAII, and SPEA2 to optimize two different criteria simultaneously [26, 27, 28]. Hancer et al. developed a multi-objective artificial bee colony (MOABC) based feature selection model in which a new fuzzy mutual information criterion was used to evaluate the relevance of feature subsets [29]. Das et al. presented two bi-objective feature weighting and selection models based on MOEA/D and the two objectives were relevance and redundancy [30, 31].
Basic concepts
This section introduces several fundamental concepts related to this research. The proposed SPSOFS will be described later based on these basic backgrounds. Section 3.1 briefly describes mutual information. Section 3.2 introduces PSO and Section 3.3 explains how to extend PSO to feature selection problems.
Mutual information
In information theory, mutual information can effectively measure the correlation between two variables. For two discrete variable
where
Particle swarm optimization is a swarm based intelligence optimization algorithm, which is inspired by the social behavior of bird flocking or fish schooling [33]. PSO has drawn a lot of attentions due to its fast convergence and ease of implementation. It has shown promising results in a wide variety of optimization problems. In the standard PSO, each particle is encoded as a candidate solution of the problem to be solved. Particles fly in a multi-dimensional space searching for the optimal position according to their own flying experience and the experience of other particles in the swarm.
Let
where
The original PSO was predominately used to solve optimization problems in continuous space. However, feature selection is an optimization problem defined in discrete space. Therefore, PSO needs to be modified in order to be extended to solve feature selection problems. In addition, these modifications should preserve the important characteristics and main advantages of PSO, such as ease of implementation and fast convergence speed.
Most of the PSO based feature selection approaches employed the binary PSO (BPSO) developed by Kennedy and Eberhart [34]. In BPSO based feature selection approaches, the position of each particle represents a candidate feature subset. Each dimension can take value 1 or 0 which indicates whether the corresponding feature is selected in this feature subset. The velocity indicates the probability of the corresponding dimension of the position taking value 1. BPSO follows the simple framework of the canonical PSO and the search ability of BPSO in discrete search space is not satisfactory [11]. Therefore, PSO needs to be fine tuned to enhance its search ability in discrete space.
A Set-based discrete PSO model for feature selection
After the introduction of the basic concepts in Section 3, this section describes the proposed SPSOFS in detail. Section 4.1 describes the encoding scheme. Sections 4.2 and 4.3 introduce the velocity and position updating rules in SPSOFS, respectively. Section 4.4 describes the feature subset evaluation criterion.
Particle encoding scheme
In SPSOFS, the position vector represents a candidate feature subset. Denote
In standard PSO, the velocity term is crucial since it decides the moving direction and speed of particles. Velocity is the guidance of particle which determines whether the particle can find the optimal position. In SPSOFS, the velocity term is defined as a set of candidate features and their corresponding possibility of being selected. The velocity of particle
where
In PSO, velocity decides the moving direction and tendency of particles. Particles update their positions with the new velocities. As the position and velocity terms are redefined in SPSOFS, the velocity updating rule in Eq. (2) used in continuous search space is no longer suitable for discrete optimization. The updating rules of PSO need to be redefined to be extended to feature selection problem. In order to update velocities, some operators in Eq. (2) are redefined:
For example, given velocity
position – position: Given two position vectors
In the new definition,
According to Eq. (8), if one feature exists in both velocity terms, its new possibility is set to the larger one between
When the velocity is updated, each particle adjusts its current position with the new velocity. In this process, the particle learns some important information from the velocity. In Eq. (3), velocity term and position term can be added directed. However, in SPSOFS, these two operands have different formats.
In order to update the position with the velocity, the velocity term is transformed into a set of features firstly. A rand number
The cut set and the current position will form the new position together. Denote the size of cutest is
Feature subset evaluation criterion
In SPSOFS, a group of features are evaluated as a whole. MI can be used to measure the relevance between the feature subset and the target class label and the redundancy between the selected features. However, it is a difficult problem to keep the balance between the relevance and the redundancy term as the magnitude of these two terms varies a lot in different datasets. It is impractical to choose a weighting parameter which can obtain consistent performance in all the datasets.
In this paper, a novel criterion based on contribution rate is proposed. The proposed criterion does not need any pre-determined parameter and the balance of relevance and redundancy is achieved by considering the contribution rate of the feature subset in terms of relevance and redundancy, respectively. Hence, the new criterion shows good robustness and consistency over different datasets. Given a particle
As
Since the total redundancy between all the
Finally, the fitness function is defined as:
The main advantage of this criterion is that the two terms in Eq. (11) are both in the range of [0,1], so relevance and redundancy are always comparable in different datasets. This criterion seeks for the best combination of
Datasets
In order to verify the effectiveness of the proposed method (refer to Algorithm 1), ten datasets from the UCI machine learning repository are chosen to conduct experiments. These datasets show a large diversity over the number of classes, features, and instances. The detailed information of these datasets is given in Table 1. For the datasets with missing values, these missing values are made up by averaging their adjacent data. For the datasets with continuous values, the MDL discretization method [35] is used to transform the continuous values into discrete values in order to compute the MI between the features.
Datasets
Datasets
For each dataset, all the instances are randomly divided into 10 equally sized folds. 9 folds are used as the training set and the remaining 1 fold is used as the test data. All the feature selection approaches are run on training data to obtain feature subsets. Then the test set is used to testify the classification accuracy of the obtained feature subsets. The cross-validation process is repeated 10 times with each of the 10 folds as the test data. The mean classification accuracy of the 10-fold cross validation is the classification performance of the feature subset. In this paper,
We compare the proposed SPSOFS with several state-of-the-art feature selection algorithms to testify the effectiveness of SPSOFS. These comparative algorithms are listed as follows, including six filters and four wrappers.
Filter approach: Six filters can be categorized into two groups: (1) non-EC based approaches: mRMR, CIFE, and JMI; (2) EC based approaches: Simultaneous Feature Selection and Weighting (SFSW) [36], Feature weighting and selection with Pareto optimal concept (FWSP) [31], and Differential Evolution based Multi-Objective Feature Selection (DEMOFS) [37]. Wrapper approach: Binary particle swarm optimization (BPSO) [34], Barebones particle swarm optimization (BBPSO) [38], PSO (4-2) [39], and Binary particle swarm optimization with catfish effect (BPSO-CE) [40].
SPSOFS is first compared with three non-EC based filter approaches about the classification accuracy in each dataset with specific number of selected features. Then, the average classification accuracy and number of features of SPSOFS are compared with three EC based filters. Thirdly, four wrappers are used as comparison in terms of classification accuracy and number of features. Last, we further investigate the effectiveness of the contribution rate based fitness function.
The experiments are performed on a machine with Intel(R) Core(TM) i5-6500 at 3.2 GHz and 8.00 GB of RAM using MATLAB. The operating system is MS Windows 10. The results of mRMR, CIFE, and JMI are run with the Feature Selection Toolbox (FEAST) which is developed by Brown et al. [41]. (Available at:
Results and discussions
In this section, the experimental results are reported. In order to measure the performance of the SPSOFS, four sets of experiments are conducted. In the first set of experiments, SPSOFS is compared with three non-EC approaches. In the second set, three EC based filters are compared with SPSOFS. The third set compares SPSOFS with four PSO based wrappers and the effectiveness of the proposed contribution rate based evaluation criterion is examined in the last set of experiment.
Comparison with non-EC based filter approaches
Three non-EC approaches, mRMR, CIFE, and JMI, all employ the forward sequential search strategy. Features are ranked in the descending order according to their own criteria and the top
Comparison of the classification accuracy based on the first
selected features
Comparison of the classification accuracy based on the first
It can be found from Table 2 that SPSOFS achieves the highest classification accuracy in most of the datasets, especially in the datasets with relatively larger number of features, such as Sonar, Musk and Arrhythmia. For instance, in the Musk dataset, SPSOFS obtains the highest classification accuracy with three different numbers of chosen features. In datasets with small number of features, the performance of SPSOFS is also very competitive which can be seen from the results of Heart, Wine, and German. It should be noticed that SPSOFS is able to find a small subset of features with high classification accuracy in datasets with a large number of features. For instance, in the Sonar dataset, the accuracy is 0.7936 for SPSOF with 10 features while the classification accuracies with the top 30 features of mRMR and CIFE are still lower than 0.7936. This demonstrates the superiority of the group based feature selection method. SPSOFS can find the best combination of
In addition, SPSOFS shows good consistency over the ten datasets. Only SPSOFS produces promising results in all the datasets while other comparative algorithms work well on some but not all the datasets. For instance, the performance of mRMR in Sonar dataset is not as good as other algorithms while CIFE and JMI fail to find informative feature subsets in Heart and Wine datasets.
Classification results for Heart dataset.
Classification results for Ionosphere dataset.
Another effective way to measure the performance of these four methods is to compare the classification accuracies with the increase of number of selected features. Four datasets with different number of features (Heart, wine, sonar, and musk) are chosen as the representative datasets to conduct this experiment. Figures 1–4 show the classification accuracies of mRMR, CIFE, JMI, and SPSOFS over different number of selected features in the four datasets. The
Classification results for Sonar dataset.
Three recently proposed EC based feature selection methods are compared with SPSOFS in order to further illustrate the effectiveness of SPSOFS. These comparative methods also use KNN with 10-fold cross validation to calculate the classification accuracy. The results of these methods are derived from the corresponding papers. Hence, only the results of the datasets which are used in comparative methods and this paper are presented in Table 3. The results of SPSOFS shown in Table 3 are the best mean classification accuracies in each dataset (the maximum number of features is restricted to 30). The highest classification accuracy in each dataset is shown in boldface.
Comparison with 3 EC based filter approaches
Comparison with 3 EC based filter approaches
Classification results for Musk dataset.
As can be seen from Table 3, SPSOFS outperforms other comparative algorithms in terms of classification accuracy in 7 out of 10 datasets. In other dataset, SPSOFS obtains slightly lower classification accuracy than the best one in each dataset but SPSOFS selects smaller feature sets in the three datasets, especially in the datasets with a large number of features. On a whole, SPSOFS shows very competitive results compared with other EC based filter approaches.
In this section, SPSOFS is compared with four EC based wrappers, including BPSO, BBPSO, PSO (4-2) and BPSO-CE. Table 4 reports the mean classification accuracies of the four wrapper approaches and SPSOFS in ten independent runs on the ten datasets. For each dataset, the best result is shown in boldface.
Comparison with four PSO based wrapper approaches
Comparison with four PSO based wrapper approaches
According to Table 4, it can be seen that SPSOFS obtains the best classification performance in 6 out of 10 datasets and achieves the second best result in 3 dataset. The classification performance of SPSOFS is not satisfactory in Musk dataset but the gap between SPSOFS and other methods is not large. In terms of the average classification, SPSOFS offers the best result of 0.8125 while second best is 0.7868 which is achieved by BBSPO. Consequently, it can be concluded from Table 4 that SPSOFS shows superior performance to the four PSO based wrapper approaches.
In this section, we will investigate the effectiveness of the contribution rate based feature subset evaluation criterion. In most EC based filter approaches, both relevance and redundancy terms are included in the evaluation criterion and a weighting parameter is adopted to keep the balance between relevance and redundancy. The fitness function is shown as follows:
where
In this section, SPSOFS with the contribution rate (CR) based feature subset evaluation criterion is compared with SPSOFS using Eq. (12) as the fitness function. Three different values of
Comparison with different weighting parameters
It can be seen from Table 5 that SPSOFS with the contribution rate based evaluation criterion shows the best result in most of the cases. When using Eq. (12) as the evaluation criterion, the fine tuning of the weighting parameter is very important. Different balancing parameter would lead to different results. For example, in Heart dataset, when
It can be concluded from the abovementioned results that it is crucial to set a proper weighting parameter in order to guarantee good performance. However, it is almost impractical to choose a weighting parameter which is suitable in all the datasets. The contribution rate based feature subset evaluation criterion is parameter free which does not need any prior information about the dataset in order to choose a proper weighting parameter. Moreover, the proposed evaluation criterion shows superior performance in terms of classification accuracy than using the weighting parameter. Henceï¼the proposed criterion can be used directly in various real world problems without setting any control parameter and it can also produce promising results.
The aim of this research is to select high quality feature subsets within acceptable time. This paper proposes a novel feature selection method (SPSOFS) using a set based particle swarm optimization and a contribution rate based feature subset evaluation criterion. In EC based filters, two main problems are the search strategy and the feature evaluation criterion. The proposed method shows several attractive advantages. This paper first applies SPSO to feature selection problem to make use of its search ability in the discrete search space. Compared with the greedy forward search strategy which is employed by many filter feature selection methods, SPSO shows powerful global search ability which enables it to generate optimal feature subsets in the high dimensional feature space. Moreover, the novel feature subset evaluation criterion proposed in this paper does not need any pre-specified parameter to keep the balance between relevance and redundancy. It is easy to implement and can produce stable results in different datasets without any modifications.
In order to verify the effectiveness of the proposed method, it is compared with six filter and four wrapper approaches on ten UCI datasets. The experimental results show that SPSOFS can find feature subsets with sufficient discriminative information. SPSOFS achieves better classification accuracy than other methods in most of the cases. Moreover, a detailed discussion also demonstrates the effectiveness of the novel feature subset evaluation criterion. Future work may focus on incorporating the multi-information to form a new feature subset evaluation criterion. Multi-information can be used to describe the complementary or contradictory interactions between features and find the interactive features.
Footnotes
Acknowledgments
This work was supported by the NUPTSF under grant no. NY214186 and the Natural Science Foundation of Jiangsu Province under grant no. BK20160898.
