Feature selection using a set based discrete particle swarm optimization and a novel feature subset evaluation criterion

Abstract

In many machine learning or patter recognition tasks such as classification, datasets with a large number of features are involved. Feature selection aims at eliminating the redundant and irrelevant features which would bring computational burden and degrade the performance of learning algorithms. Particle swarm optimization (PSO) has been widely used in feature selection due to its global search ability and computational efficiency. However, PSO was originally designed for continuous optimization problems and the discretization of PSO in feature selection is still a problem which needs further investigation. This paper develops a novel feature selection algorithm based on a set based discrete PSO (SPSO). SPSO employs a set based encoding scheme which makes it able to characterize the discrete search space in feature selection problem. It also redefines the velocity term and the corresponding arithmetic operators which enables it to search for the optimal feature subset in the discrete space. In addition, a novel feature subset evaluation criterion based on contribution rate is proposed as the fitness function in SPSO. The proposed criterion does not need any pre-determined parameter to keep the balance between relevance and redundancy of the feature subset. The proposed method is compared with six filter approaches and four wrapper approaches on ten well known UCI dataset and the experimental results demonstrate the proposed method is promising.

Keywords

Feature selection particle swarm optimization discrete search space mutual information feature subset evaluation criterion

1. Introduction

With the rapid development of computer hardware and data storage, the amount of information grows explosively in which data are usually described by a huge number of features. However, not all the features are useful in building predictive models. Some redundant or irrelevant features would improve the computational cost and have a negative effect on the performance of learning algorithms, known as “the curse of dimensionality” [1]. Feature selection is an essential data pre-processing step in pattern recognition and data mining tasks. Feature selection aims to seek a subset of features while remaining the discriminating information as much as possible. Feature selection can speed up the learning process and lead to a more accurate and understandable predictive model.

Based on the feature subset evaluation criterion, feature selection methods can be generally categorized into filter approach and wrapper approach. In the wrapper approach, a pre-determined classifier is used to calculate the classification accuracy of a feature subset. The filter approach does not need any classifiers and it evaluates feature subsets with the statistical characteristics of features. In most cases, wrappers can achieve better classification performance since the feature subset is directly chosen based on the classification performance, but they are computational expensive, especially in high-dimensional datasets and they may suffer from over-fitting to the pre-determined classifier. Compared with wrappers, filters show higher computational efficiency and good generalization ability as the feature selection process is independent of any classifier. Furthermore, the bias caused by various classification algorithms would not affect the performance of filter methods.

Except the feature subset evaluation criterion, search strategy is also a crucial part in feature selection. Feature selection is a very difficult combinatorial optimization problem as the search space increases exponentially with the number of features. For a dataset with $n$ features, there are 2 ${}^{n}$ candidate feature subsets. Exhaustive search is impractical in large datasets due to the high computational cost. Many heuristic approaches have been proposed to obtain the best feature subset. Many methods employed greedy forward search strategy (such as MIFS [2], mRMR [3], CIFE [4] and JMI [6]). They start from the most relevant feature and new feature is selected once at a time considering the relevance with the target class and the redundancy with the chosen features. The major disadvantage of this search strategy is that it cannot evaluate a set of features simultaneously because the best $m$ features may not be the best feature subset with cardinality $m$ . Another problem with the greed forward strategy is that it may fall into local optimal easily.

Evolutional computation (EC) techniques are known for their global search ability in high-dimensional search space. Many EC based algorithms have been employed to feature selection, including genetic algorithm (GA), ant colony optimization (ACO), particle swarm optimization (PSO), and differential evolution (DE). In EC based feature selection methods, a feature subset is evaluated as a group. EC based wrapper approaches aim at finding feature subsets with high classification accuracy but they suffer from high computational cost in high dimensional datasets and prone to over-fitting the training data [8, 9, 10]. Compared with other EC techniques, PSO is simpler to implement and converges more quickly to the global optimal. Hence, PSO has been widely used in feature selection problems. However, existing PSO based feature selection methods still face two main drawbacks.

a.
PSO was originally designed to solve optimization problems in continuous space. But feature selection is an optimization problem in discrete space. The discrete PSO algorithms used in existing PSO based feature selection methods generally maintains the simple mechanism of PSO on the whole. Their searching ability in discrete space is not satisfactory, compared with other EC such as GA and ACO. Therefore, the potential of PSO in feature selection has not been fully investigated.
b.
In PSO based feature selection method, one important factor is the design of objective function. Filter approach aims to generate feature subset with maximum relevance and minimum redundancy. Many researchers used a weighing parameter to keep the balance between relevancy and redundancy. But the parameter needs to be fine tuned in order to obtain good results in different datasets. Even when multi-objective approaches have been proposed to optimize relevance and redundancy simultaneously, two problems still exist: (1) it is hard to choose the optimal feature subset from the set of non-dominated solutions; (2) these two objectives may not be completely conflicting in some cases, so multi-objective approaches cannot guarantee the optimal feature subsets.

Based on those abovementioned problems, this research mainly focuses on: (1) how to discretize the original PSO in order to improve its search ability in discrete search space; (2) how to design the feature subset evaluation criterion which can keep a good balance between the relevance and redundancy in various feature selection problems.

In order to overcome the existing problems, this paper proposes a novel feature selection method using a discrete variant of PSO and a parameter-free feature subset evaluation criterion (SPSOFS). Set based PSO (SPSO) is a relatively recent version of discrete PSO [11]. SPSO introduces a set based particle encoding scheme and redefines the corresponding operators in PSO. SPSO keeps the main advantages of PSO, such as quick converging speed and ease of implementation and it has shown promising results in traveling salesman problem and the multidimensional knapsack problem.

In addition, a novel feature subset evaluation criterion based on contribution rate is proposed as the fitness function in SPSO. The proposed criterion does not require any pre-determined control parameters while the balance between relevance and redundancy is achieved by using the idea of the contribution rate of relevance and redundancy, respectively. The motivation of designing the criterion is to make the relevancy term and redundancy term always comparable in different datasets. Since the objective function is parameter-free, the proposed method can be used in different datasets without any modifications.

The rest of the paper is organized as follows: Section 2 reviews the related works. Section 3 briefly introduces some basic concepts about mutual information and PSO. Section 4 presents the proposed SPSOFS. Section 5 shows the experimental results and analysis. Section 6 concludes the whole paper.
2. Related works

Feature selection is an important data preprocessing step in pattern recognition and data analysis. It can reduce the dimensionality of the dataset and avoid the curse of dimensionality. Filter approaches aim at selecting the most relevant features and alleviating feature redundancy.

Feature relevance means its relationship with the target class. Feature relevance can be measured by some criteria like mutual information [2], distance [11], similarity [13] and consistency [14] and then the top ranked features are chosen to build the prediction model. Among all, mutual information (MI) is a robust and promising criterion which can measure arbitrary dependence relationships between two features and it has been widely utilized to find the most discriminative feature subset. MI can be directly used to select feature subset by maximizing the relevance between features and the target class, which is called the Max-Relevance strategy. The main problem of this method is that it would select several relevant but redundant features.

Therefore, feature selection methods considering reducing feature redundancy are proposed. Feature redundancy can defined as the relationship between a set of selected features. MIFS [2] and mRMR [3] are two well-known feature selection algorithms considering both relevance and redundancy. MIFS selected features with max relevance and employed a weight coefficient to alleviate the redundant information. mRMR replaced the weight coefficient in MIFS with the reciprocal of the feature subset cardinality. Hence, users do not need to set a suitable weight coefficient in order to obtain good feature subsets.

Some researchers proposed a new criterion which is to calculate the discriminative information that a feature can newly provide given a set of selected features. Based on this criterion, two representative methods JMI [6] and CMIM [15] were proposed. These methods select feature according to how much new classification this feature could provide. One drawback of these MI based approaches is that they neglect the interaction between features. One irrelevant feature may provide useful information when combined with other features.

In order to solve the problem, multi-information was proposed to measure the 3-way interaction between three features [16]. It needs to be noted that multi-information can take both positive and negative values. Some feature selection methods employed multi-information as the evaluation criterion, such as CIFE [4], IWFS [17], and ICAP [18]. In these methods, multi-information is used to measure the redundancy between two features and redundancy of a feature subset can be reduced.

These abovementioned feature selection methods mainly employ the greedy forward search strategy and they cannot guarantee the global optimal solution. Therefore, EC techniques have been extended to feature selection problem due to its global search ability and fast convergence speed. This paper mainly focuses on filter approaches. Therefore, EC based filters will be reviewed. GA is widely used in feature selection problems. Chakraborty proposed a GA based feature selection method using fuzzy set fitness function [19]. DE is a popular population based optimization algorithm which has been used in various problems as well as feature selection. Bhadra et al. developed a DE based filter feature selection method which optimizes the average standard deviation and dissimilarity of the selected feature subset, and the average similarity of non-selected features [20].

PSO is a relatively simple optimization framework and PSO have also been employed to feature selection. Wang et al. proposed a feature selection model based on rough set theory and PSO [20]. Bae et al. used an improve version of PSO for feature selection which shows better computational efficiency than the traditional PSO [22]. Cervante et al. proposed a PSO based feature selection technique aiming at maximizing the relevance and minimizing the redundancy of the obtained feature subsets [23]. ACO is a swarm based intelligent optimization algorithm. Tabakhi et al. proposed a filter method based on ACO, named UFSACO [24]. UFSACO selected features in several iterations without using any classifer. Tabakhi and Moradi represented the search space as a graph and used ACO to rank the features [25].

Multi-objective optimization techniques have also been applied to feature selection to optimize multiple filter criteria simultaneously. Xue et al. proposed several multi-objective EC based filter approaches, in which they used multi-objective binary PSO, NSGAII, and SPEA2 to optimize two different criteria simultaneously [26, 27, 28]. Hancer et al. developed a multi-objective artificial bee colony (MOABC) based feature selection model in which a new fuzzy mutual information criterion was used to evaluate the relevance of feature subsets [29]. Das et al. presented two bi-objective feature weighting and selection models based on MOEA/D and the two objectives were relevance and redundancy [30, 31].

3. Basic concepts

This section introduces several fundamental concepts related to this research. The proposed SPSOFS will be described later based on these basic backgrounds. Section 3.1 briefly describes mutual information. Section 3.2 introduces PSO and Section 3.3 explains how to extend PSO to feature selection problems.

3.1 Mutual information

In information theory, mutual information can effectively measure the correlation between two variables. For two discrete variable $x$ and $c$ , the MI between them is defined as follows:

$\displaystyle I(x;c)=H(x)-H(x|c)$ (1)

where $H(x)$ means the entropy of variable $x$ and $H(x|c)$ denotes the conditional entropy $x$ given $c$ . $I(x;c)$ is non-negative. When $I(x;c)=0$ , $x$ and $c$ are completely independent. MI can be used to evaluate the discriminative ability of features. MI based feature selection methods usually employ the successive search strategies to generate feature subset [32].

3.2 Particle swarm optimization

Particle swarm optimization is a swarm based intelligence optimization algorithm, which is inspired by the social behavior of bird flocking or fish schooling [33]. PSO has drawn a lot of attentions due to its fast convergence and ease of implementation. It has shown promising results in a wide variety of optimization problems. In the standard PSO, each particle is encoded as a candidate solution of the problem to be solved. Particles fly in a multi-dimensional space searching for the optimal position according to their own flying experience and the experience of other particles in the swarm.

Let $X_{i}=(x_{1},x_{2},\ldots,x_{m})$ denote the position of the $i$ th particle in the swarm. $m$ is the dimension of the search space. Its current velocity is $V_{i}=(v_{1},v_{2},\ldots,v_{m})$ . In the basic PSO algorithm, the positions of particles are updated by the following equations:

$\displaystyle V_{i}^{t+1}=w\times V_{i}^{t}+c_{1}\times r_{1}\times(\textit{% pbest}_{i}-X_{i}^{t})+c_{2}\times r_{2}\times(\textit{gbest}-X_{i}^{t})$ (2) $\displaystyle X_{i}^{t+1}=X_{i}^{t}+V_{i}^{t+1}$ (3)

where $V_{i}^{t}$ is the velocity of particle $i$ in cycle $t$ ; $X_{i}^{t}$ is the position of particle $i$ in cycle $t$ ; $\textit{pbest}_{i}$ is the position of personal best of particle $i$ ; gbest is the position of the global best; $w$ is the inertia weight which plays an important role in balancing global and local search and it is typically set between 0 and 1. A relatively large inertia weight can improve the global search ability while a relatively small inertia weight focuses on local search. $c_{1}$ is the cognitive weight and $c_{2}$ is the social weight; $r_{1}$ and $r_{2}$ are two random numbers between [0,1].

3.3 Particle swarm optimization for feature selection

The original PSO was predominately used to solve optimization problems in continuous space. However, feature selection is an optimization problem defined in discrete space. Therefore, PSO needs to be modified in order to be extended to solve feature selection problems. In addition, these modifications should preserve the important characteristics and main advantages of PSO, such as ease of implementation and fast convergence speed.

Most of the PSO based feature selection approaches employed the binary PSO (BPSO) developed by Kennedy and Eberhart [34]. In BPSO based feature selection approaches, the position of each particle represents a candidate feature subset. Each dimension can take value 1 or 0 which indicates whether the corresponding feature is selected in this feature subset. The velocity indicates the probability of the corresponding dimension of the position taking value 1. BPSO follows the simple framework of the canonical PSO and the search ability of BPSO in discrete search space is not satisfactory [11]. Therefore, PSO needs to be fine tuned to enhance its search ability in discrete space.

4. A Set-based discrete PSO model for feature selection

After the introduction of the basic concepts in Section 3, this section describes the proposed SPSOFS in detail. Section 4.1 describes the encoding scheme. Sections 4.2 and 4.3 introduce the velocity and position updating rules in SPSOFS, respectively. Section 4.4 describes the feature subset evaluation criterion.

4.1 Particle encoding scheme

In SPSOFS, the position vector represents a candidate feature subset. Denote $X_{i}$ as the position of the $i$ th particle and $X_{i}=\{x_{1},x_{2},\ldots,x_{m}\}$ , where $X_{i}\in F$ . The dimension of the position vector is equal to the number of features to be selected. Each dimension is a random integer between [1, $n$ ] where $n$ denotes the number of original features. For example, in a dataset with 10 features, a particle $X_{i}=$ {1, 4, 5, 8} means that features 1, 4, 5, and 8 are selected in this feature subset. At the beginning of the algorithm, each particle position is generated by randomly choose $m$ different integers between [1, $n$ ].

In standard PSO, the velocity term is crucial since it decides the moving direction and speed of particles. Velocity is the guidance of particle which determines whether the particle can find the optimal position. In SPSOFS, the velocity term is defined as a set of candidate features and their corresponding possibility of being selected. The velocity of particle $i$ is defined as follows:

$\displaystyle V_{i}=\{f/p(f)|f\subseteq F\}$ (4)

where $f$ is a feature subset chosen from $F$ and $p(f)$ is the corresponding possibility. For example, the $i$ th particle’s velocity $V_{i}=$ {4/0.4, 6/0.7} means the possibility of feature 4 and 6 being selected are 0.4 and 0.7, respectively. In order to keep the velocity term short and simple during the optimization process, when the possibility of one feature $f_{i}$ is smaller than a pre-defined threshold, the item $f_{i}/p(f_{i})$ is deleted from the velocity term because $f_{i}$ is impossible to be selected. In this paper, the threshold is set as 0.1. Initially, the velocity of each particle is generated with randomly choose one feature from the entire feature set and assign its corresponding possibility which is a random value between [0,1].

4.2 Velocity updating rules

In PSO, velocity decides the moving direction and tendency of particles. Particles update their positions with the new velocities. As the position and velocity terms are redefined in SPSOFS, the velocity updating rule in Eq. (2) used in continuous search space is no longer suitable for discrete optimization. The updating rules of PSO need to be redefined to be extended to feature selection problem. In order to update velocities, some operators in Eq. (2) are redefined:

a.
$w\times V$ is calculated as follows:

$\displaystyle w\times V=\left\{\begin{array}[]{ll}\{f/1\},&\text{if}\ p(f)% \times w>1\\ \{f/p(f)\times w\},&\text{otherwise}\\ \end{array}\right.$ (5)

For example, given velocity $V_{i}=$ {4/0.4, 6/0.7} and $w=$ 0.4, we have $w\times V_{i}=$ {4/0.16, 6/0.28}.
b.
position – position: Given two position vectors $X_{1}$ and $X_{2}$ , $X_{1}-X_{2}$ is defined as:

$\displaystyle X_{1}-X_{2}=\left\{{f|f\in X_{1}\ \text{and}\ f\notin X_{2}}\right\}$ (6)

In the new definition, $X_{1}-X_{2}$ means those features exist in subset $X_{1}$ but not exist in subset $X_{2}$ . For example, given $X=$ {1, 4, 5, 8} and gbest $=$ {2, 4, 6, 8}, gbest – $X=$ {2, 6}.
c.
$c\times r\times\ \textit{position}$ is calculate as follows:

$\displaystyle c\times r\times\ \textit{position}=\left\{\begin{array}[]{ll}\{f% /1\}&\text{if}\ c\times r>1\\ \{f/c\times r\}&\text{otherwise}\\ \end{array}\right.$ (7)

$c\times r\times(\textit{gbest}-X)$ and $c\times r\times(\textit{pbest}-X)$ can be computed in this way. By using this operator, the position term is converted into a set of features and their corresponding possibility, which is the same as the format of the velocity term.
d.
$V_{1}+V_{2}$ : The last operator needs to be redefined in Eq. (2) is the plus operator of two velocity terms. According to the abovementioned definitions, the three operands in Eq. (2) have the same format, i.e., a set of features and their corresponding possibility. Given two velocity terms $V_{1}=\{f/p_{1}(f)|f\in F\}$ and $V_{2}=\{f/p_{2}(f)|f\in F\}$ , $V_{1}+V_{2}$ is defined as:

$\displaystyle V_{1}+V_{2}=\{f/\max(p_{1}(f),p_{2}(f))|f\in F\}$ (8)

According to Eq. (8), if one feature exists in both velocity terms, its new possibility is set to the larger one between $p_{1}(f)$ and $p_{2}(f)$ . For example, given $V_{1}=\{4/0.4,6/0.7\}$ and $V_{2}=\{4/0.8,5/0.4\}$ , we have $V_{1}+V_{2}=\{4/0.8,5/0.4,6/0.7\}$ . In this example, it is clear to see that feature 4 exists in both velocity terms, so its possibility in the new velocity is set as the larger one.
4.3 Position updating rules

When the velocity is updated, each particle adjusts its current position with the new velocity. In this process, the particle learns some important information from the velocity. In Eq. (3), velocity term and position term can be added directed. However, in SPSOFS, these two operands have different formats.

In order to update the position with the velocity, the velocity term is transformed into a set of features firstly. A rand number $\alpha\in(0,1)$ is generated. For each velocity term, if the feature’s corresponding possibility is larger than $\alpha$ , the feature will be stored in a cut set. The cut set is the new information that a particle learns from the velocity.

The cut set and the current position will form the new position together. Denote the size of cutest is $l$ and the dimension of the position is $m$ . When $l$ is larger than $m$ , $m$ features will be randomly chosen from the cut set to form the new position while the current position is not used. When $l$ is smaller than $m$ , $m-l$ features will be randomly chosen from the current position, and they are mixed with the $l$ features in the cut set to form the new position together.

4.4 Feature subset evaluation criterion

In SPSOFS, a group of features are evaluated as a whole. MI can be used to measure the relevance between the feature subset and the target class label and the redundancy between the selected features. However, it is a difficult problem to keep the balance between the relevance and the redundancy term as the magnitude of these two terms varies a lot in different datasets. It is impractical to choose a weighting parameter which can obtain consistent performance in all the datasets.

In this paper, a novel criterion based on contribution rate is proposed. The proposed criterion does not need any pre-determined parameter and the balance of relevance and redundancy is achieved by considering the contribution rate of the feature subset in terms of relevance and redundancy, respectively. Hence, the new criterion shows good robustness and consistency over different datasets. Given a particle $X$ and the original feature set $F$ , the dimension of the particle is $m$ and the total feature number is $n$ . First, the contribution rate of the feature subset $X$ in terms of relevance is defined as follows:

$\displaystyle\textit{Rel\_R}=\frac{\textit{chosenMI}}{\textit{totalMI}}=\frac{% \sum\limits_{x\in X}{I(x,c)}}{\sum\limits_{f\in F}{I(f,c)}}$ (9)

As $X$ is a subset of $F$ , $0\leqslant\textit{Rel\_R}\leqslant 1$ . The larger Rel_R is, the more discriminative power the feature subset $X$ provides. Then, the contribution rate of the feature subset $X$ in terms of redundancy is calculated as follows:

$\displaystyle\textit{Red\_R }=\frac{\sqrt{\textit{chosenRED}}}{\sqrt{\textit{% totalRED}}}=\frac{\sqrt{\sum\limits_{x_{i},x_{j}\in X}{I(x_{i},x_{j})}}}{\sqrt% {\sum\limits_{f_{i},f_{j}\in F}{I(f_{i},f_{j})}}}$ (10)

Since the total redundancy between all the $n$ features has $n^{2}$ terms and the redundancy between the selected $m$ features has $m^{2}$ term, the square root of the redundancy is used to calculate how much redundancy the feature subset takes up. According to the definition, $0\leqslant\textit{Red\_R}\leqslant 1$ . The larger Red_R is, the more redundancy the feature subset $X$ contains.

Finally, the fitness function is defined as:

$\displaystyle\textit{Fitness}=\textit{Rel\_R}-\textit{Red\_R}$ (11)

The main advantage of this criterion is that the two terms in Eq. (11) are both in the range of [0,1], so relevance and redundancy are always comparable in different datasets. This criterion seeks for the best combination of $m$ features which contains a large proportion of relevance and a small proportion of redundancy. Therefore, the fitness function can guide SPSOFS to search for high quality feature subsets with maximum relevance and minimum redundancy.

Algorithm 1. The pseudo-code of SPSOFS

1: initialize the swarm;

2: evaluate all the particles according to the fitness function in Section 4.4; 3: Initialize the personal best of each particle and the global best;

4: for each iteration do

5: for each particle

i

6: compute its new velocity with the velocity updating rules introduced in Section 4.2;

7: compute the new position of each particle with the position updating rules in Section 4.3;

8: evaluation particle

i

according to the fitness function;

9: update the personal best of particle

i

;

10: end

11: update the global best of the swarm;

12: end

11: compute the classification accuracy of global optimal feature subset in the test dataset;

12: return the global best, the corresponding feature subset, the classification accuracy in the test dataset.

5. Experimental results and analysis

5.1 Datasets

In order to verify the effectiveness of the proposed method (refer to Algorithm 1), ten datasets from the UCI machine learning repository are chosen to conduct experiments. These datasets show a large diversity over the number of classes, features, and instances. The detailed information of these datasets is given in Table 1. For the datasets with missing values, these missing values are made up by averaging their adjacent data. For the datasets with continuous values, the MDL discretization method [35] is used to transform the continuous values into discrete values in order to compute the MI between the features.

Table 1
Datasets

Dataset	Number of features	Number of classes	Number of instances	Missing value?
Glass	9	7	214	No
Heart	10	2	270	Yes
Wine	13	3	178	No
Australia	15	2	690	Yes
German	24	2	1000	No
Ionosphere	34	2	351	No
Waveform	40	3	5000	No
Sonar	60	2	208	No
Musk	166	2	476	No
Arrhythmia	279	16	452	Yes

For each dataset, all the instances are randomly divided into 10 equally sized folds. 9 folds are used as the training set and the remaining 1 fold is used as the test data. All the feature selection approaches are run on training data to obtain feature subsets. Then the test set is used to testify the classification accuracy of the obtained feature subsets. The cross-validation process is repeated 10 times with each of the 10 folds as the test data. The mean classification accuracy of the 10-fold cross validation is the classification performance of the feature subset. In this paper, $K$ nearest neighbors (KNN) is employed as the classifier to evaluate the classification accuracy. $K$ is set as 5. In order to make fair comparisons, all the methods are run on the same data.

5.2 Comparative algorithms

We compare the proposed SPSOFS with several state-of-the-art feature selection algorithms to testify the effectiveness of SPSOFS. These comparative algorithms are listed as follows, including six filters and four wrappers.

a.
Filter approach: Six filters can be categorized into two groups: (1) non-EC based approaches: mRMR, CIFE, and JMI; (2) EC based approaches: Simultaneous Feature Selection and Weighting (SFSW) [36], Feature weighting and selection with Pareto optimal concept (FWSP) [31], and Differential Evolution based Multi-Objective Feature Selection (DEMOFS) [37].
b.
Wrapper approach: Binary particle swarm optimization (BPSO) [34], Barebones particle swarm optimization (BBPSO) [38], PSO (4-2) [39], and Binary particle swarm optimization with catfish effect (BPSO-CE) [40].

SPSOFS is first compared with three non-EC based filter approaches about the classification accuracy in each dataset with specific number of selected features. Then, the average classification accuracy and number of features of SPSOFS are compared with three EC based filters. Thirdly, four wrappers are used as comparison in terms of classification accuracy and number of features. Last, we further investigate the effectiveness of the contribution rate based fitness function.
5.3 Experimental settings

The experiments are performed on a machine with Intel(R) Core(TM) i5-6500 at 3.2 GHz and 8.00 GB of RAM using MATLAB. The operating system is MS Windows 10. The results of mRMR, CIFE, and JMI are run with the Feature Selection Toolbox (FEAST) which is developed by Brown et al. [41]. (Available at: http://www.cs.man.ac.uk/∼gbrown/fstoolbox/). In SPSOFS, the maximum number of iterations is empirically set to 50. The number of individuals in population is set to 50. The cognitive and social components, $c_{1}$ and $c_{2}$ , are both set to 1.2, as suggested in [11].The time decreasing inertia weight is used with $w_{\max}=$ 0.9 and $w_{\min}=$ 0.4. For the four PSO based wrapper methods, the maximum number of iterations, the population size, and the inertia weight are set the same as SPSOFS. The cognitive and social components, $c_{1}$ and $c_{2}$ , are both set to 2. The upper and lower bounds for velocity are selected as 6 and $-$ 6, respectively. SPSOFS and four PSO based wrappers are run ten times in each datasets with random seeds to avoid the influence of random factors.

5.4 Results and discussions

In this section, the experimental results are reported. In order to measure the performance of the SPSOFS, four sets of experiments are conducted. In the first set of experiments, SPSOFS is compared with three non-EC approaches. In the second set, three EC based filters are compared with SPSOFS. The third set compares SPSOFS with four PSO based wrappers and the effectiveness of the proposed contribution rate based evaluation criterion is examined in the last set of experiment.

5.4.1 Comparison with non-EC based filter approaches

Three non-EC approaches, mRMR, CIFE, and JMI, all employ the forward sequential search strategy. Features are ranked in the descending order according to their own criteria and the top $m$ features are chosen as the feature subset where $m$ is a pre-specified parameter. SPSOFS aims at selecting $m$ features from the entire feature subset. All the four feature selection methods need a pre-specified parameter, i.e. the number of features to be selected. Hence, the classification accuracies of these four methods with different number of selected features are compared. For the ten datasets, we choose several ranges of the number of selected features where the classification accuracies become stable in most cases. The results are shown in Table 2. The classification accuracy of SPSOFS shown in Table 2 is the mean classification accuracy in ten independent runs. The best classification accuracy obtained among the four algorithms is shown in boldface.

Table 2
Comparison of the classification accuracy based on the first $m$ selected features

Dataset	N.F.	mRMR	CIFE	JMI	SPSOFS
Glass	m $=$ 3	0.6445	0.6465	0.6680	0.6688
	m $=$ 6	0.6870	0.7076	0.7076	0.6918
Heart	m $=$ 3	0.8333	0.7	0.7185	0.8259
	m $=$ 6	0.8370	0.7519	0.7519	0.8407
Wine	m $=$ 3	0.9205	0.8998	0.9292	0.9391
	m $=$ 6	0.9744	0.9214	0.9744	0.9744
	m $=$ 9	0.9802	0.9134	0.9607	0.9802
Australia	m $=$ 3	0.8293	0.8203	0.8159	0.8493
	m $=$ 6	0.8422	0.7420	0.7754	0.8449
	m $=$ 9	0.8130	0.7275	0.7536	0.8232
German	m $=$ 5	0.7230	0.7170	0.7060	0.73
	m $=$ 10	0.7160	0.7330	0.7210	0.7270
	m $=$ 15	0.7240	0.7330	0.74	0.7260
Ionosphere	m $=$ 5	0.8603	0.8660	0.8715	0.8840
	m $=$ 10	0.8687	0.8460	0.8631	0.8863
	m $=$ 15	0.8601	0.8402	0.8630	0.8687
Waveform	m $=$ 10	0.7510	0.7198	0.8202	0.8283
	m $=$ 15	0.8216	0.6968	0.8328	0.8342
	m $=$ 20	0.8268	0.6776	0.8284	0.8322
Sonar	m $=$ 10	0.7068	0.7793	0.7786	0.7936
	m $=$ 20	0.7370	0.7543	0.7621	0.8343
	m $=$ 30	0.7803	0.7807	0.8186	0.8321
Musk	m $=$ 10	0.7332	0.7572	0.7507	0.7963
	m $=$ 20	0.8130	0.7752	0.7688	0.8298
	m $=$ 30	0.8109	0.7889	0.7788	0.8254
Arrhythmia	m $=$ 10	0.5968	0.5390	0.5679	0.6077
	m $=$ 20	0.6057	0.5565	0.5635	0.6322
	m $=$ 30	0.6256	0.5591	0.5746	0.6432

It can be found from Table 2 that SPSOFS achieves the highest classification accuracy in most of the datasets, especially in the datasets with relatively larger number of features, such as Sonar, Musk and Arrhythmia. For instance, in the Musk dataset, SPSOFS obtains the highest classification accuracy with three different numbers of chosen features. In datasets with small number of features, the performance of SPSOFS is also very competitive which can be seen from the results of Heart, Wine, and German. It should be noticed that SPSOFS is able to find a small subset of features with high classification accuracy in datasets with a large number of features. For instance, in the Sonar dataset, the accuracy is 0.7936 for SPSOF with 10 features while the classification accuracies with the top 30 features of mRMR and CIFE are still lower than 0.7936. This demonstrates the superiority of the group based feature selection method. SPSOFS can find the best combination of $m$ features rather than the best $m$ individual features.

In addition, SPSOFS shows good consistency over the ten datasets. Only SPSOFS produces promising results in all the datasets while other comparative algorithms work well on some but not all the datasets. For instance, the performance of mRMR in Sonar dataset is not as good as other algorithms while CIFE and JMI fail to find informative feature subsets in Heart and Wine datasets.

Figure 1.

Classification results for Heart dataset.

Figure 2.

Classification results for Ionosphere dataset.

Another effective way to measure the performance of these four methods is to compare the classification accuracies with the increase of number of selected features. Four datasets with different number of features (Heart, wine, sonar, and musk) are chosen as the representative datasets to conduct this experiment. Figures 1–4 show the classification accuracies of mRMR, CIFE, JMI, and SPSOFS over different number of selected features in the four datasets. The $x$ -axis refers to the number of selected features and the $y$ -axis represents the classification accuracy. From Fig. 1, we can see mRMR and SPSOFS obtain much better results than CIFE and JMI in Heart dataset. SPSOFS achieves the highest classification accuracy 0.8444 with 4 features. Figures 2 and 3 show that SPSOFS obtains better results than other algorithms in most cases. In Musk dataset, SPSOFS achieves better classification accuracy than other algorithms in all the plots. On the whole, SPSOFS can produce better results than other comparative feature selection algorithms.

Figure 3.

Classification results for Sonar dataset.

5.4.2 Comparison with EC based filter approaches

Three recently proposed EC based feature selection methods are compared with SPSOFS in order to further illustrate the effectiveness of SPSOFS. These comparative methods also use KNN with 10-fold cross validation to calculate the classification accuracy. The results of these methods are derived from the corresponding papers. Hence, only the results of the datasets which are used in comparative methods and this paper are presented in Table 3. The results of SPSOFS shown in Table 3 are the best mean classification accuracies in each dataset (the maximum number of features is restricted to 30). The highest classification accuracy in each dataset is shown in boldface.

Table 3
Comparison with 3 EC based filter approaches

Dataset	SFSW	FWSP	DEMOFS	SPSOFS
Glass	0.6776(4.40)	0.6916(4)	–	0.6919(6)
Heart	0.8(5.30)	0.7840(5)	–	0.8444(4)
Wine	0.9605(6.90)	0.9607(8)	0.8965(6)	0.9802(9)
Australia	0.8464(4.70)	0.8656(5)	0.7730(4)	0.8493(3)
German	0.7130(10.5)	–	0.7010(1)	0.727(10)
Ionosphere	0.8831(11.5)	0.8650(3)	–	0.8915(9)
Waveform	0.8365(16.0)	–	–	0.8342(15)
Sonar	0.8274(20.0)	0.7932(22)	0.7860(10)	0.8343(20)
Musk	0.8152(59.3)	–	0.8345(58)	0.8298(20)
Arrhythmia	0.6577(100)	–	–	0.6432(30)

Figure 4.

Classification results for Musk dataset.

As can be seen from Table 3, SPSOFS outperforms other comparative algorithms in terms of classification accuracy in 7 out of 10 datasets. In other dataset, SPSOFS obtains slightly lower classification accuracy than the best one in each dataset but SPSOFS selects smaller feature sets in the three datasets, especially in the datasets with a large number of features. On a whole, SPSOFS shows very competitive results compared with other EC based filter approaches.

5.4.3 Comparison with EC based wrappers

In this section, SPSOFS is compared with four EC based wrappers, including BPSO, BBPSO, PSO (4-2) and BPSO-CE. Table 4 reports the mean classification accuracies of the four wrapper approaches and SPSOFS in ten independent runs on the ten datasets. For each dataset, the best result is shown in boldface.

Table 4
Comparison with four PSO based wrapper approaches

Dataset	BPSO	BBPSO	PSO(4-2)	BPSO-CE	SPSOFS
Glass	0.6992	0.6805	0.6721	0.6907	0.6918
Heart	0.8296	0.8321	0.7978	0.8444	0.8407
Wine	0.9704	0.9704	0.9526	0.9667	0.9802
Australia	0.8423	0.84	0.8334	0.8386	0.8492
German	0.7207	0.7319	0.6847	0.7229	0.73
Ionosphere	0.8491	0.8472	0.8689	0.8368	0.8913
Waveform	0.8021	0.8316	0.8032	0.7958	0.8342
Sonar	0.7206	0.7667	0.7794	0.7710	0.8343
Musk	0.8392	0.8442	0.8487	0.8492	0.8298
Arrhythmia	0.5165	0.5231	0.4814	0.4976	0.6432
Average	0.7790	0.7868	0.7722	0.7814	0.8125

According to Table 4, it can be seen that SPSOFS obtains the best classification performance in 6 out of 10 datasets and achieves the second best result in 3 dataset. The classification performance of SPSOFS is not satisfactory in Musk dataset but the gap between SPSOFS and other methods is not large. In terms of the average classification, SPSOFS offers the best result of 0.8125 while second best is 0.7868 which is achieved by BBSPO. Consequently, it can be concluded from Table 4 that SPSOFS shows superior performance to the four PSO based wrapper approaches.

5.4.4 Effect of the contribution rate based evaluation criterion

In this section, we will investigate the effectiveness of the contribution rate based feature subset evaluation criterion. In most EC based filter approaches, both relevance and redundancy terms are included in the evaluation criterion and a weighting parameter is adopted to keep the balance between relevance and redundancy. The fitness function is shown as follows:

$\displaystyle\textit{Fitness}=\sum\limits_{x\in X}{I(x,c)}+\alpha*\sqrt{\sum% \limits_{x_{i},x_{j}\in X}{I(x_{i},x_{j})}}$ (12)

where $\alpha$ is the weighting parameter which determines the relative importance of relevance and redundancy in the fitness function. When $\alpha<1$ , the relevance is considered more important than redundancy. When $\alpha>1$ , the algorithm focus reducing redundancy. When $\alpha=$ 1, the two terms are assumed equally important. It is obvious that different $\alpha$ would guide the feature selection algorithm search for different regions. Users need to choose the proper value of weighting parameter in order to satisfy their requirements. For example, when the goal of feature selection is to include as much useful information as possible, a small value of $\alpha$ is suitable. While eliminating redundancy is the prior goal, $\alpha$ should be a relatively large value. However, this may be impractical in real world applications since the magnitude of the two terms in Eq. (12) may vary a lot in different problems.

In this section, SPSOFS with the contribution rate (CR) based feature subset evaluation criterion is compared with SPSOFS using Eq. (12) as the fitness function. Three different values of $\alpha$ are used for experiments, which are 0.75, 1, and 1.25. Four representative datasets with different number of features are used for comparison, including Heart, Ionosphere, Sonar, and Musk. Table 5 shows the mean classification accuracies of the SPSOFS with different fitness functions on the four representative datasets. The best result in each dataset with different feature subset size is shown in boldface.

Table 5

Comparison with different weighting parameters

	N.F.	$\alpha=$ 0.75	$\alpha=$ 1	$\alpha=$ 1.25	CR
Heart	m $=$ 3	0.8358	0.8284	0.8264	0.8259
	m $=$ 6	0.7889	0.7889	0.8	0.8333
Ionoshpere	m $=$ 5	0.8716	0.8772	0.8774	0.8840
	m $=$ 10	0.8603	0.8716	0.8802	0.8863
	m $=$ 15	0.8601	0.8629	0.8516	0.8687
Sonar	m $=$ 10	0.7671	0.7936	0.7771	0.7936
	m $=$ 20	0.8057	0.7957	0.7993	0.8343
	m $=$ 30	0.8221	0.8286	0.8233	0.8321
Musk	m $=$ 10	0.7684	0.7837	0.7705	0.7963
	m $=$ 20	0.8014	0.8087	0.8083	0.8298
	m $=$ 30	0.8043	0.8107	0.8279	0.8254
Average		0.8169	0.8227	0.8220	0.8372

It can be seen from Table 5 that SPSOFS with the contribution rate based evaluation criterion shows the best result in most of the cases. When using Eq. (12) as the evaluation criterion, the fine tuning of the weighting parameter is very important. Different balancing parameter would lead to different results. For example, in Heart dataset, when $\alpha=$ 0.75, SPSOFS achieve the highest classification accuracy with feature subset size 3. However, when the feature subset size is 6, CR shows the best performance while $\alpha=$ 0.75 places the 3rd. In the Ionosphere dataset, when $\alpha=$ 1.25, SPSOFS shows better performance than $\alpha=$ 0.75 and $\alpha=$ 1 with feature subset size 5 and 10. But when feature subset size is 15, $\alpha=$ 1.25 falls behind other weighting parameters.

It can be concluded from the abovementioned results that it is crucial to set a proper weighting parameter in order to guarantee good performance. However, it is almost impractical to choose a weighting parameter which is suitable in all the datasets. The contribution rate based feature subset evaluation criterion is parameter free which does not need any prior information about the dataset in order to choose a proper weighting parameter. Moreover, the proposed evaluation criterion shows superior performance in terms of classification accuracy than using the weighting parameter. Henceï¼Œthe proposed criterion can be used directly in various real world problems without setting any control parameter and it can also produce promising results.

6. Conclusions

The aim of this research is to select high quality feature subsets within acceptable time. This paper proposes a novel feature selection method (SPSOFS) using a set based particle swarm optimization and a contribution rate based feature subset evaluation criterion. In EC based filters, two main problems are the search strategy and the feature evaluation criterion. The proposed method shows several attractive advantages. This paper first applies SPSO to feature selection problem to make use of its search ability in the discrete search space. Compared with the greedy forward search strategy which is employed by many filter feature selection methods, SPSO shows powerful global search ability which enables it to generate optimal feature subsets in the high dimensional feature space. Moreover, the novel feature subset evaluation criterion proposed in this paper does not need any pre-specified parameter to keep the balance between relevance and redundancy. It is easy to implement and can produce stable results in different datasets without any modifications.

In order to verify the effectiveness of the proposed method, it is compared with six filter and four wrapper approaches on ten UCI datasets. The experimental results show that SPSOFS can find feature subsets with sufficient discriminative information. SPSOFS achieves better classification accuracy than other methods in most of the cases. Moreover, a detailed discussion also demonstrates the effectiveness of the novel feature subset evaluation criterion. Future work may focus on incorporating the multi-information to form a new feature subset evaluation criterion. Multi-information can be used to describe the complementary or contradictory interactions between features and find the interactive features.

Footnotes

Acknowledgments

This work was supported by the NUPTSF under grant no. NY214186 and the Natural Science Foundation of Jiangsu Province under grant no. BK20160898.

References

Guyon

and Elisseeff

, An introduction to variable and feature selection, J Mach Learn Res (2003), 1157–1182.

Battiti

, Using mutual information for selecting features in supervised neural net learning, IEEE Trans Neural Netw 5(4) (1994), 537–550.

Peng

Long

and Ding

, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min redundancy, IEEE Trans Pattern Anal Mach Intell 27(8) (2005), 1226–1238.

Lin

and Tang

, Conditional infomax learning: An integrated framework for feature extraction and fusion, in: Proc. 9th Eur. Conf. Comput. Vis., 2006, pp. 68–82.

and Hung

, Distance-based feature selection on classification of uncertain objects, Adv Artif Intell LNCS 7106, (2011), 172–181.

Yang

and Moody

, Data visualization and feature selection: New algorithms for nongaussian data, Advances Neural Inf Process Syst 12 (1999), 687–693.

S.J.

Yang

W.J.

Sun

Yao

and Yang

, Similarity-based feature selection for learning from examples with continuous values, Adv Knowl Discov Data Min LNCS (2009), 957–964.

Xue

Zhang

and Browne

W.N.

, Particle swarm optimisation for feature selection in classification: novel initialisation and updating mechanisms, Appl Soft Comput 18 (2014), 261–276.

Moradi

and Gholampour

, A hybrid particle swarm optimization for feature subset selection by integrating a novel local search strategy, Applied Soft Computing 43 (2016), 117–130.

10.

Ghaemi

and Feizi-Derakhshi

M.R.

, Feature selection using forest optimization algorithm, Pattern Recognit 60 (2016), 121–129.

11.

Chen

Zhang

Henry

S.H.

Zhong

and Shi

, A novel set-based particle swarm optimization method for discrete optimization problems, IEEE Transactions on Evolutionary Computation 14(2) (2010), 278–300.

12.

and Hung

, Distance-based feature selection on classification of uncertain objects, Adv Artif Intell LNCS 7106 (2011), 172–181.

13.

S.J.

Yang

W.J.

Sun

Yao

and Yang

, Similarity-based feature selection for learning from examples with continuous values, Adv Knowl Discov Data Min LNCS (2009), 957–964.

14.

Azofra

A.A.

Benitez

J.M.

and Castro

J.L.

, Consistency measures for feature selection, J Intell Inf Syst 30 (2008), 273–292.

15.

Fleuret

, Fast binary feature selection with conditional mutual information, J Mach Learn Res 5 (2004), 1531–1555.

16.

Brown

Pocock

Zhao

and Luj

, Conditional likelihood maximisation: A unifying framework for information theoretic feature selection, J Mach Learn Res 13 (2012), 27–66.

17.

Zeng

Zhang

and Yin

, A novel feature selection method considering feature interaction, Pattern Recognition 48 (2015), 2656–2666.

18.

Jakulin

, Machine learning based on attribute interactions, Ph.D. dissertation, Faculty Comput. Inf. Sci., Ljubljana Univ., Ljubljana, Slovenia, 2005.

19.

Chakraborty

, Genetic algorithm with fuzzy fitness function for feature selection, in: IEEE International Symposium on Industrial Electronics (ISIE’02), vol. 1, 2002, pp. 315–319.

20.

Bhadra

and Bandyopadhyay

, Unsupervised feature selection using an improved version of Differential Evolution, Expert Systems with Applications 42 (2015), 4042–4053.

21.

Wang

Yang

Teng

Xia

and Jensen

, Feature selection based on rough sets and particle swarm optimization, Pattern Recognition Letters 28(4) (2007), 459–471.

22.

Bae

Yeh

Chung

and Liu

, Feature selection with intelligent dynamic swarm and rough set, Expert Systems with Applications 37 (2010), 7026–7032.

23.

Cervante

Xue

Zhang

and Lin

, Binary particle swarm optimisation for feature selection: A filter based approach, in: IEEE Congress on Evolutionary Computation, CEC 2012, pp. 889–896.

24.

Tabakhi

Moradi

and Akhlaghian

, An unsupervised feature selection algorithm based on ant colony optimization, Engineering Applications of Artificial Intelligence 32 (2014), 112–123.

25.

Tabakhi

and Moradi

, Relevance – redundancy feature selection based on ant colony optimization, Pattern Recognition 48 (2015), 2798–2811.

26.

Xue

Cervante

Shang

and Zhang

, A particle swarm optimisation based multi-objective filter approach to feature selection for classification, in: PRICAI, ser. Lecture Notes in Computer Science, vol. 7458, 2012, pp. 673–685.

27.

Xue

Cervante

Shang

Browne

and Zhang

, A multi-objective PSO for filter-based feature selection in classification problems, Connection Science 24(2–3) (2012), 91–116.

28.

Xue

Cervante

Shang

Browne

W.N.

and Zhang

, Multiobjective evolutionary algorithms for filter based feature selection in classification, International Journal on Artificial Intelligence Tools 22(4), (2013), 1350024.

29.

Hancer

Xue

Zhang

Karaboga

and Akay

, A multiobjective artificial bee colony approach to feature selection using fuzzy mutual information, in: Proc. IEEE Congr. Evol. Comput. (CEC), Sendai, Japan, 2015, pp. 2420–2427.

30.

Paul

and Das

, Simultaneous feature selection and weighting-An evolutionary multi-objective optimization approach, Pattern Recognition Letters 65 (2015), 51–59.

31.

Das

and Das

, Feature weighting and selection with a Pareto-optimal trade-off between relevancy and redundancy, Pattern Recognition Letters 88 (2017), 12–19.

32.

Liu

and Motoda

, Feature Selection for Knowledge Discovery and Data Mining. Boston, MA, USA: Kluwer, 1998.

33.

Kennedy

and Eberhart

R.C.

, Particle swarm optimization, in: Proceedings of IEEE International Conference on Neural Networks, 1995, pp. 1942–1948.

34.

Kennedy

and Eberhart

, A discrete binary version of the particle swarm algorithm, in: IEEE International Conference on Systems, Man, and Cybernetics, 1997, pp. 4104–4108.

35.

Fayyad

and Irani

, Multi-interval discretization of continuous-valued attributes for classification learning, in: Proceedings of Thirteenth International Joint Conference on Artificial Intelligence, 1993, pp. 1022–1027.

36.

Paul

and Das

, Simultaneous feature selection and weighting – an evolutionary multi-objective optimization approach, Pattern Recognit Lett 65 (2015), 51–59.

37.

Xue

and Zhang

, Multi-objective feature selection in classification: a differential evolution approach, Simul Evol Learn LNCS 8886 (2014), 516–528.

38.

Kennedy

, Bare bones particle swarms, in: Proceeding of the 2003 IEEE Swarm Intelligence Symposium, 2003, pp. 80–87.

39.

Xue

Zhang

and Browne

W.N.

, Particle swarm optimisation for feature selection in classification: novel initialisation and updating mechanisms, Appl Soft Comput 18 (2014), 261–276.

40.

Chuang

L.-Y.

Tsai

S.-W.

and Yang

C.-H.

, Improved binary particle swarm optimization using catfish effect for feature selection, Expert Syst Appl 38 (2011), 12699–12707.

41.

Brown

Pocock

Zhao

M.-J.

and Luján

, Conditional likelihood maximisation: a unifying framework for information theoretic feature selection, J Mach Learn Res 13(1) (2012), 27–66.

Feature selection using a set based discrete particle swarm optimization and a novel feature subset evaluation criterion

Abstract

Keywords

1. Introduction

3. Basic concepts

3.1 Mutual information

4. A Set-based discrete PSO model for feature selection

4.1 Particle encoding scheme

4.4 Feature subset evaluation criterion

5.1 Datasets

Table 1 Datasets

5.4 Results and discussions

5.4.1 Comparison with non-EC based filter approaches

Table 2 Comparison of the classification accuracy based on the first m selected features

Table 3 Comparison with 3 EC based filter approaches

Table 4 Comparison with four PSO based wrapper approaches

Footnotes

Acknowledgments

References

Table 1
Datasets

Table 2
Comparison of the classification accuracy based on the first $m$ selected features

Table 3
Comparison with 3 EC based filter approaches

Table 4
Comparison with four PSO based wrapper approaches