Feature selection via computational intelligence techniques

Abstract

Feature selection (FS) has become an essential task in overcoming high dimensional and complex machine learning problems. FS is a process used for reducing the size of the dataset by separating or extracting unnecessary and unrelated properties from it. This process improves the performance of classification algorithms and reduces the evaluation time by enabling the use of small sized datasets with useful features during the classification process. FS aims to gain a minimal feature subset in a problem domain while retaining the accuracy of the original data. In this study, four computational intelligence techniques, namely, migrating birds optimization (MBO), simulated annealing (SA), differential evolution (DE) and particle swarm optimization (PSO) are implemented for the FS problem as search algorithms and compared on the 17 well-known datasets taken from UCI machine learning repository where the dimension of the tackled datasets vary from 4 to 500. This is the first time that MBO is applied for solving the FS problem. In order to judge the quality of the subsets generated by the search algorithms, two different subset evaluation methods are implemented in this study. These methods are probabilistic consistency-based FS (PCFS) and correlation-based FS (CFS). Performance comparison of the algorithms is done by using three well-known classifiers; k-nearest neighbor, naive bayes and decision tree (C4.5). As a benchmark, the accuracy values found by classifiers using the datasets with all features are used. Results of the experiments show that our MBO-based filter approach outperforms the other three approaches in terms of accuracy values. In the experiments, it is also observed that as a subset evaluator CFS outperforms PCFS and as a classifier C4.5 gets better results when compared to k-nearest neighbor and naive bayes.

Keywords

Feature selection computational intelligence dimensionality reduction meta-heuristics classification algorithms subset evaluators

1 Introduction

As technology evolves, size of the data increases rapidly, and this causes problems for effective and efficient data management. We are in the era of big data where in several areas such as, healthcare, social media, internet applications, etc. large amounts of data are generated at an unprecedented speed [17]. In addition to the important features in big data, it also includes so many unimportant features that lead to the curse of dimensionality problem. Due to high storage and computational cost, curse of dimensionality affects the performance of the algorithms. At that point, to reduce the dimension of the big data, it is necessary to apply machine learning techniques. Feature extraction and feature selection are the two most commonly used techniques for the dimensionality reduction.

In feature extraction (FE), original high dimensional features are first combined and then projected into new feature space with lower dimensionality. On the other hand, feature selection (FS) selects a subset of important (relevant) features for model construction by excluding irrelevant and redundant features. In terms of readability and interpretability, FS is better than FE [18].

FS is a very important pre-processing step for machine learning. It does not only decrease computational expense by reducing the number of exploited features but also increases the accuracy of the learning model by eliminating redundant or irrelevant features [22]. FS is indeed a difficult process. For example, a feature that seems to be unnecessary or irrelevant alone, can be very useful when combined with other features. Therefore, attention should be given when deciding which feature should be discarded and which feature should not be discarded. To accomplish this, all combinations of features in the dataset should be considered. However, for a dataset with N features, 2^N possible different sets of features arise. Thus, as the size of the dataset increases, the solution space increases exponentially.

In the literature, existing FS methods can be categorized into three classes, i.e., filter methods, wrapper methods and embedded methods. Filter methods rank the features based on a certain criteria and some examples of these methods are chi-square analysis [43], rough set and fuzzy-rough set-based dependency [27, 28], information gain, and symmetrical uncertainty [36], group-based fuzzy-rough FS (FRFS) [27], probabilistic consistency-based FS (PCFS) [23] and correlation-based FS (CFS) [21]. On the other hand, wrapper methods [4, 44] contain learning algorithms. Wrapper methods generally provide better solution, but it takes more time to execute when compared to the filter methods. Embedded methods are similar to wrapper methods but feature selection is performed during the training process. In our study, we focus on the filter methods and use two commonly preferred subset evaluators: i) PCFS and ii) CFS where more information can be found in the experiment and results section.

FS problem is in the group of NP-hard problems. Therefore, there is no polynomial time exact method for this problem. However, heuristics and meta-heuristics are proven to be successful for supplying powerful and flexible search strategies to generate high-quality solutions for NP-hard problems in acceptable run times. Heuristics and meta-heuristics generally utilize random processes in the discovery of the search space to deal with the combinatorial explosion caused by the use of exact methods. Even though they have stochastic designs, their well-engineered exploration and exploitation strategies provide superiority to their search process. A main benefit arises from their attainment to abstain from local minimum approving degeneracy of the objective function during their progress.

In this study, we firstly build an extendable framework where search algorithms, evaluators, learner algorithms and datasets are integrated in a plug and play fashion. As the search algorithms (which are used to generate the subset of the features), four different meta-heuristics are used: migrating birds optimization (MBO), simulated annealing (SA), differential evolution (DE) and particle swarm optimization (PSO) algorithms. To our best, this is the first time that MBO is applied to solve the FS problem. MBO algorithm is inspired from the V formation of the migrating birds in real life [6]. Solutions (corresponding to birds in the analogy) are initialized randomly in the solution space and they try to move to better positions by searching their neighborhood. Throughout the algorithm, the birds share their unused neighbors with the follower birds which are placed in V formation hypothetically. SA is proposed by Kirkpatrick et al. in 1983 [35]. SA is a very powerful and popular meta-heuristic for optimization. It is designed to escape from local optimum where the algorithm may select even worse solutions as a new solution with a low probability. DE algorithm is an evolutionary computation technique that looks for the optimal solution for a problem by iteratively trying to enhance a possible solution with respect to a given measure of quality [16]. The particle swarm concept is originated from the simulation of behavior of biological organisms. The starting point was to simulate visually the nuanced but unpredictable motion of bird flocking. The particle swarm optimization (PSO) algorithm imitates the flying birds and their information exchange ways to solve optimization problems [34].

Comparison of algorithms is performed on the datasets taken from the UCI Machine Learning Repository [25] by using the aforementioned subset evaluator techniques and the accuracy values computed by widely used classifiers: k-nearest neighbor, decision tree (C4.5) and naive bayes. Several high dimensional and low dimensional datasets are used. As a benchmark, we get the accuracy values by using classifiers with full number of features. Then four meta-heuristic based search algorithms are compared with each other and also with the benchmark accuracy.

The contributions of this study are:

A novel MBO-based filter approach is proposed for the FS problem.

Its performance is compared with three well-known meta-heuristic based filter approaches with respect to

two different subset evaluators,

three different classifier algorithms,

on the datasets that vary from high dimensional to low dimensional.

Performance of two well-known subset evaluators are compared under three different classifiers on the datasets having different characteristics.

For the rest of the paper, the organization is as follows. In Section 2, problem definition and literature survey are given for the FS problem and four meta-heuristic algorithms. In Section 3, how these algorithms are applied to FS problem is explained in details. Computational experiment setup is given in Section 4. The details of results and discussions are given in Section 5 and paper is concluded with some future works in Section 6.

2 Problem definition and literature review

In this section, feature selection problem is briefly defined and related literature survey is given. Also, the meta-heuristics used in this study are explained briefly with their literature survey on the FS problem.

2.1 Feature selection

In general, FS algorithms are separated into three groups: i) wrapper approaches, ii) filter approaches and iii) embedded approaches. In wrapper approaches, evaluation process includes learner / classification algorithms. In these approaches, newly generated cluster is sent to the classification algorithm and the classification algorithm provides a feedback about this cluster. According to the feedback from the classification algorithm, the wrapper approaches form a new cluster again and send it to the classification algorithm. This process continues until a pre-determined stopping criterion is met. On the other hand, the filter approaches work without the learner / classification algorithms. In this case the feedback is taken from the evaluators which return much more promptly (in polynomial time). Due to high complexity in wrapper approaches, they are slower and have higher calculation cost than filter approaches [29]. Embedded approaches are similar to wrapper approaches, however feature selection is performed during the training process.

Various strategies have been used in FS in order to generate the feature subsets while searching for the best subset. Some of these are greedy search based sequential forward selection and sequential backward selection methods. However, these methods have troubles related to high calculation cost and getting stuck at local optima. As the success of wrapper approaches mainly depends on using efficient search strategies, various meta-heuristic methods (genetic algorithms, ant colony optimization, particle swarm optimization, etc.) have been used in feature subset creation and search operations. This is due to the fact that, meta-heuristics give the global optimum or the results close to the global optimum with low calculation costs. In recent years, various meta-heuristic-based algorithms have been applied to solve the FS problem in the literature [2 , 42].

Oh et al. developed the hybrid genetic algorithm (HGA) for the FS problem [14]. In that study, in order to make a more careful search, the local search method parameterized by the fluctuation factor is placed inside the HGA. However, PSO algorithm with the same fitness function developed in [19] gave better performance than the HGA. In another similar study, Jing [37] developed an HGA for subset selection by incorporating a local search operation based on rough set theory to fine-tune solutions for the genetic algorithm. Hamdani et al. developed the multipurpose (number of properties and classification performance) FS algorithm, however this algorithm was not compared to any of the other FS algorithms [38].

Nieto et al. introduced a modified binary differential evolution algorithm for the FS problem using a support vector machine classifier as a wrapper method [15]. However, this algorithm needs too many iterations (4000) to perform well. Khushaba et al. used statistical feature distribution criteria to assist the evolution process in DE while selecting the most promising feature subset [31]. However, the algorithm lacks the ability to discover the optimal feature subset size because its functionality is limited to selecting feature subsets with a predetermined cardinality [1]. Using this drawback, Al-Ani et al. used a new DE algorithm to identify related feature subsets. In this algorithm, a series of wheels is created and these wheels are used to create a feature subset. In that study, the authors proposed that the search area could be narrowed with the help of the wheel approach. In another study, DE-based FS algorithm is designed for implication analysis in a resource-deficient language [39].

Gao et al. developed ant colony optimization (ACO)-based wrapper-oriented FS algorithm to detect intrusion into the network [11]. Fisher’s discriminant ratio is adopted as a heuristic information for ACO. In another study, ACO and rough set based hybrid algorithm is proposed for the FS problem [12]. The proposed method starts with the features in the core of the rough set. However, this method was not compared with the commonly used FS algorithms.

Wang et al. used PSO algorithm to search the optimal feature subsets and to apply rough set-based reduction [40]. Liu et al. used multi-swarm PSO algorithm for the FS problem [41]. Recently, the multipurpose dual PSO algorithm, which takes into account both the number of features and the classification performance, is proposed in [2]. In this proposed PSO algorithm, factors such as crowd, mutation and dominance were taken into consideration. As a result of the experiments, it has been shown that the proposed algorithm obtains a better feature subset than traditional FS algorithms.

2.2 Search algorithms

2.2.1 Migrating birds optimization

One of the recently proposed swarm intelligence techniques is Migrating Birds Optimization (MBO) algorithm. As a search method, this algorithm has an initialization approach of using a set of predefined solutions corresponding to birds in a V formation and starts to search off from its neighbors. The algorithm begins with the initial solution (representing the leader bird) and continues up to the lines of tails in the bird flock. Each solution is replaced by the best neighbor solution only if it has an improvement. There is a repetition mechanism in which not only the previous solution makes benefits to the next solutions but also a solution assesses its neighbors for the replacement of the current solution. Improvement of all solutions by this neighbor method is followed by one of the second solutions in the V form taking the first place to start another loop. The algorithm runs until the stopping criteria are met [6]. MBO is a new solution approach for the FS research field. In our preliminary work, performance of MBO on FS problem is compared with some meta-heuristics under one classifier and one subset evaluator. It is shown that, MBO outperforms other algorithms [10].

2.2.2 Particle swarm optimization

Simulation of behavior of biological organisms has given way to the particle swarm concept that has simulated unpredictable and nuanced motion of bird flocking as a starting point. With this in mind, particle swarm optimization (PSO) algorithm tries to find a potential solution imitating the flying birds or particles with certain velocities and the information exchange among them. Velocity tuning of the particle is done according to the flying experience and others’ experience in the swarm. PSO has not only a wide usage area in hard combinatorial optimization problems but also high performance on feature selection problem since particle swarms have a chance to discover the best feature combinations while flying within the problem space. PSO has shown such a strong search capability in the problem domain that it explores near optimal solutions in reasonable computational time [16, 40]. Liu et al. proposed an improved version of PSO [41] and Chuang et al. made improvements on binary PSO using catfish effect [20] for the feature selection problem.

2.2.3 Simulated annealing

Simulated Annealing (SA), proposed by Kirkpatrick et al. in 1983 [35], is one of the oldest and stochastic meta-heuristics that tries to find an approximate solution to the global optimum. SA makes use of annealing metallurgical technique that heats and cools metals repeatedly to get benefits like increasing the size and reducing the faults in order to reach the best quality. SA can also be used for feature selection problem working on just one feature subset to search the most appropriate solution to reach the thermal equilibrium while other population-based algorithms in this area manage selecting and improving multiple possible solutions. SA adjusts the temperature parameter using a cooling rate once it becomes in a fine state according to pre-given improvement criteria. Clearly can be inferred that one can get high efficiency in decreasing computational time by managing only one subset, however this may also result in lack of finding the best solution if initial settings are not adjusted well enough. Diao and Shen worked SA on FS problem and made comparisons among other nine meta-heuristics [26]. Meiri and Zahavi applied SA approach to FS problem in marketing applications [30].

2.2.4 Differential evolution

Differential Evolution (DE) algorithm, which is an evolutionary computation (EC) technique, iteratively tries to find the optimal solution to improve a possible solution in terms of a given measure of quality. DE is a less-expensive and easy to implement stochastic direct search method with a few predetermined parameters as compared to other EC techniques. In multi-dimensional real-valued optimization problems, DE starts searching with a randomly-generated initial candidate solution in order to find global optimum avoiding local optimum solutions. Mutation and crossover operations, which indicates DE as evolutionary, take place after the initialization step to get a better population [3, 34]. DE is used for FS by Hancer et al. [7] and it is mixed with artificial bee colony algorithms to get a hybrid approach by Zorarpaci and Ozel for feature selection purposes [8].

3 Application of algorithms to feature selection problem

The reader can easily recall that the aim in the FS problem is reducing the number of features (dimensions) while keeping the accuracy value at least same as in the original dataset. Also recall that filter approach is one of the most commonly used FS algorithms due to its speed where subset evaluators are used to measure the quality of the feature subset. In this study, we investigate the performance of i) search algorithms and ii) subset evaluators from different perspectives. Performance evaluation of search algorithms is done in terms of their accuracy values and number of exploited features found by three different classifiers through applying two evaluators. On the other hand, performance of subset evaluators in terms of accuracy value and exploited number of features are explored under three classifiers where they coordinate with four search algorithms. Furthermore, all of these algorithms are integrated in a framework which can be extended in any dimension very swiftly.

In this study, we focus on filter approaches and proposed a novel filter approach based on MBO. The performance of MBO is compared with three state-of-the-art meta-heuristic based filter approaches from several perspectives. In our framework, MBO, PSO, SA and DE are used as the search algorithms to find the feature subsets. Then probabilistic consistency-based FS (PCFS) and correlation-based FS (CFS) are used to judge the quality of the feature subsets. After the best subset is decided by a search algorithm (through communicating with the evaluators several times), its true performance is determined by a classifier algorithm. As classifier algorithms, we preferred to use three most commonly used classifiers, k-nearest neighbor, naive bayes and decision tree (C4.5). General structure of our framework for the proposed algorithms is shown in Figure 1.

Fig. 1

The framework designed for comparing the algorithms on the feature selection problem.

In order to apply meta-heuristic algorithms to a problem successfully, defining a solution and creating a new solution from an existing one (neighbor generation) are the most important parts. In this study, for MBO algorithm, we define a solution as a vector having their elements as the weight values of features in [0,1] range. By rounding its elements to the nearest integer, this vector is transformed into a binary-valued vector. Thus, ith element in the vector indicates the selection status of the ith feature. In the initialization phase of the search algorithm, solution vectors are generated by assigning random weight values to their elements. In the neighbor generation phase, we modify the current solution vector by mutating all elements by small incrementations or decrementations (t) obtained by using Eq. 1.

$t = (- 1)^{(int) (G (0, 1) + 0.5)} * G (0, 1) * r$ (1) where G (0, 1) is the random number generator between 0 and 1, r is the radius. A high r value yields large jumps whereas a small r value creates small jumps for the mutation operator.

Representations of a solution and neighbor generation are shown in Figure 2 on a hypothetical dataset having five features. As shown in Figure 2, a solution vector having (0.80, 0.34, 0.21, 0.55, 0.49) as its elements is mutated. In this mutation, let r = 0.02 and let the two invocations of the G (0, 1) produce 0.40 and 0.15, respectively. Then, by applying Eq. 1 we obtain a neighbor solution having (0.818, 0.358, 0.228, 0.568, 0.508) as its set of values. Observe the selection status of features for both original vector and neighbor vector (1 means the feature is selected, whereas 0 means the feature is not selected). At each iteration, current solution and newly generated solution are compared according to a score computed by the subset evaluator methods where a higher score means better quality. Algorithm runs until a stopping criterion is met.

Fig. 2

Solution and neighbor generation.

Initial solutions for SA, DE and PSO are generated by using same way as in MBO. New solution is generated in SA by using the same neighbor generation mechanism of MBO. Mutation and crossover operations for DE are implemented as given in [7]. Change velocity and change position operations for PSO algorithm are adapted from [33] and modified for the FS problem.

When the algorithm stops, it gives its best feature subset found so far as an output. This feature subset is then sent to a classifier algorithm to get an accuracy value that corresponds to this subset. Four search algorithms are compared with each other and benchmark (classifier performance with full set of features) according to accuracy values found by them with respect to two different subset evaluators and three different classifiers. In the same way, subset evaluators are also compared with each other.

4 Computational experiment setup

In this section, we provide details of the computational experiment setups and exploited algorithms. Firstly, we will present the details of the datasets that we used in the computational experiments. Then, we will explain the parameters of the search algorithms. It will be followed by the subset evaluators and classifiers.

4.1 Datasets

To examine the performance of the algorithms, we used 17 datasets taken from UCI machine learning repository [25]. In order to obtain a comprehensive understanding regarding the performance of the selection algorithms and evaluators, we used both small sized and large sized datasets where size refers to the number of features (in line with the context of the manuscript) small refers to datasets having less than 100 features and large refers to datasets having more than or equal to 100 features. Table 1 gives information about the datasets where the range of number of features changes from 4 to 500. The datasets also differ in terms of number of instances and classes where their ranges change from 101 to 20000 and 2 to 41, respectively. Also note that the dataset named "arrhythmia" has some missing values in four features, therefore they are removed in the data preparation step.

Table 1
Description of UCI datasets

Dataset Instances Features Classes

abalone 4177 8 29

arrhythmia 452 275 16

ionosphere 351 34 2

iris 150 4 3

letter 20000 16 26

lymphography 148 18 4

madelon 2000 500 2

muskv1 476 166 2

optdigits 5620 64 10

promoters 106 57 2

sonar 208 60 2

spect 267 22 2

splice 3190 60 3

ticdata2000 5822 85 41

vehicle 846 18 4

wine 178 13 3

zoo 101 17 7

Dataset	Instances	Features	Classes
abalone	4177	8	29
arrhythmia	452	275	16
ionosphere	351	34	2
iris	150	4	3
letter	20000	16	26
lymphography	148	18	4
madelon	2000	500	2
muskv1	476	166	2
optdigits	5620	64	10
promoters	106	57	2
sonar	208	60	2
spect	267	22	2
splice	3190	60	3
ticdata2000	5822	85	41
vehicle	846	18	4
wine	178	13	3
zoo	101	17	7

4.2 Search algorithms

The parameters of the search algorithms used in our implementation are as follows:

MBO: number of birds (nob), number of neighbors (non), number of flapping (nof), overlap factor (olf) and radius (r).

PSO: inertial weight (ω), ϕ₁ and ϕ₂ are the acceleration coefficients, number of generations (nog) and number of solutions (nos).

SA: initial temperature (T), number of iterations in each temperature (R), temperature decrease ratio (a), increase ratio of R (b) and radius (r).

DE: crossover rate (cxp), mutation rate (mp), differential weight (F), number of generations (nog) and number of solutions (nos).

In order to get the best performance from the algorithms, we need to use their best performing values. These best performing values are taken from [10] where the same algorithms were applied to the same problem (See Table 2). After defining best parameters, in the detailed experiments, the number of solutions that algorithm can generate while surfing in the solution space is limited to 50000 for each algorithm. The experiments are repeated ten times and results are presented as average of these ten runs.

Table 2
Values of parameters used in the fine tune experiments.(Bold ones are the best) [10]

MBO PSO SA DE

Param. Values Param. Values Param. Values Param. Values

nob 5,21,21,51 ϕ ₁ 0.1,0.5,1,2 T 100,1000 cxp 0.1,0.5,0.9

non 3,5,7 ϕ ₂ 0.1,0.5,1,2 R 5,20 mp 0.1,0.5,0.9

nof 5,10 ω 0.1,0.5,1 a 1.1,1.5 F 0.1,0.5,1,2

olf 1,2,3 nog 50,100,200 b 1.1,1.5 nog 50,100,200

r 0.01,0.02,0.05 nos 50,100,200 r 0.01,0.02,0.05 nos 50,100,200

MBO	PSO	SA	DE
nob	5,21,21,51	ϕ ₁	0.1,0.5,1,2	T	100,1000	cxp	0.1,0.5,0.9
non	3,5,7	ϕ ₂	0.1,0.5,1,2	R	5,20	mp	0.1,0.5,0.9
nof	5,10	ω	0.1,0.5,1	a	1.1,1.5	F	0.1,0.5,1,2
olf	1,2,3	nog	50,100,200	b	1.1,1.5	nog	50,100,200
r	0.01,0.02,0.05	nos	50,100,200	r	0.01,0.02,0.05	nos	50,100,200

4.3 Subset evaluators

In the evaluation of the feature subsets, we implemented two different subset evaluators; probabilistic consistency-based FS (PCFS) [23] and correlation-based FS (CFS) [21] where computational complexity and characteristics differs from each other. PCFS focuses on inconsistency measure where it aims to identify a group of features that are inconsistent and removes unrelated features. As mentioned in [23], a pattern is a part of an instance without class label. In other words, it is a set of values of the feature subset regarding a specific instance. In PCFS, consistency measure is defined by inconsistency rate and calculated in three steps.

If two instances have same feature value but different class label, then this pattern is called inconsistent.

The inconsistency count for a pattern of a feature subset is the number of times it appears in the data minus the largest number among different class labels.

The inconsistency rate of a feature subset is the sum of all the inconsistency counts over all patterns of the feature subset that appears in the data divided by the total number of instances [23].

CFS identifies subsets of features that are highly correlated with the class but uncorrelated with each other (see the Eq. 2).

${cor}_{zc} = \frac{m \bar{{cor}_{zi}}}{\sqrt{m + m (m - 1) \bar{{cor}_{ii}}}}$ (2) where cor_zc is the correlation between the summed features and the class label, m is the number of features in the subset, $\bar{{cor}_{zi}}$ is the average of the correlations between the features and the class label, and $\bar{{cor}_{ii}}$ is the average inter-correlation between features.

4.4 Classifiers

In order to judge the quality of the subsets found by four algorithms, three commonly used classification methods, namely, k-nearest neighbor (k-NN) [5], naive bayes [9] and decision tree (C4.5) [32] are used. k-NN is a lazy technique that classifies a new instance according to the majority of class labels of its k nearest neighbors. In our experiments, we used 5-nearest neighbors. C4.5 is a tree-based classification technique that consists of leaf, root, branches. From root to leaf nodes, each path represents the classification rules. Naive bayes is another commonly used classification technique based on bayes theorem. In our implementations, we used Waikato Environment for Knowledge Analysis (WEKA) libraries for these three classification techniques.

In the classification part, we used 10 fold cross validation for the data validation. Each dataset is divided into 10 folds with roughly equal sizes and nine of them are used as training set and the remaining one is used for test set. Each fold is used as test set once while the remaining folds are kept as training sets and then average of 10 (number of folds) processes is taken to compute the performance of the classifier.

5 Results and discussion

After fine tuning the parameters for all algorithms, we made an extensive set of tests for comparing performance of the search algorithms and subset evaluators.

We conducted experiments on 17 datasets where each dataset has different characteristics. In the experiments, MBO is compared with PSO, SA and DE algorithms in terms of accuracy values and average number of features in the subsets found by them. Table 3 presents accuracy values (in terms of percentages) of search algorithms with respect to PCFS and CFS using the classifiers; k-nearest neighbor (KNN), naive bayes (NB) and decision tree (C4.5) whereas Table 4 presents the number of exploited features by these algorithms in a similar way.

Table 3
Accuracy values (%) of search algorithms with respect to PCFS and CFS using KKN, NB and C4.5 (bolds are best values)

Dataset Classifier PCFS CFS

Benchmark MBO PSO SA DE Benchmark MBO PSO SA DE

abalone KNN 22.96 24.40 24.40 25.31 23.96 22.96 23.34 23.34 22.48 22.48

arrhythmia 58.85 62.39 61.28 60.18 61.28 58.85 61.73 61.50 63.50 66.15

ionosphere 84.90 84.90 85.47 85.47 85.75 84.90 89.46 87.18 88.32 87.75

iris 95.33 96.00 96.00 96.00 96.00 95.33 96.67 96.67 96.67 96.67

letter 95.50 95.77 95.59 95.28 95.67 95.50 95.81 95.78 95.51 94.46

lymphography 79.73 79.73 82.43 83.11 81.76 79.73 83.11 81.08 82.43 79.05

madelon 56.30 56.15 56.80 56.65 57.85 56.30 59.25 57.23 57.55 85.35

muskv1 83.19 85.29 83.61 84.87 83.82 83.19 84.03 83.82 82.98 84.24

optdigits 98.70 97.69 97.88 97.30 97.85 98.70 98.47 98.01 98.67 98.59

promoters 79.25 75.47 78.30 80.19 83.02 79.25 86.79 83.96 83.96 90.57

sonar 84.62 85.10 86.06 86.54 85.58 84.62 84.62 86.54 85.58 85.10

spect 80.90 83.90 83.90 82.77 82.40 80.90 84.27 83.90 84.64 84.64

splice.data 79.84 82.35 80.94 74.64 80.44 79.84 85.39 82.32 84.39 90.72

ticdata2000 59.45 64.75 57.40 61.61 56.80 59.45 73.02 70.66 72.76 85.01

vehicle 72.70 71.63 71.63 71.04 71.63 72.70 69.39 71.75 69.98 64.07

wine 95.51 97.19 97.19 96.63 97.19 95.51 97.75 97.75 97.75 97.75

zoo 94.06 96.04 95.05 93.07 96.04 94.06 97.03 96.04 96.04 94.06

AVERAGE 77.75 78.75 78.47 78.27 78.65 77.75 80.60 79.86 80.19 82.74

abalone NB 23.77 24.16 24.23 23.77 24.11 23.77 24.99 24.73 24.73 24.73

arrhythmia 59.29 61.95 62.39 61.73 60.84 59.29 64.82 63.72 64.16 68.58

ionosphere 82.62 86.04 85.19 85.19 86.61 82.62 91.17 89.74 90.60 92.02

iris 96.00 96.00 96.00 96.00 96.00 96.00 96.00 96.00 96.00 96.00

letter 64.01 65.22 64.03 64.50 64.82 64.01 63.88 65.03 66.06 64.63

lymphography 82.43 81.76 82.43 83.11 85.81 82.43 84.46 82.43 82.43 81.76

madelon 58.40 59.80 59.55 60.20 57.60 58.40 60.50 59.28 60.90 59.15

muskv1 75.21 77.10 77.31 79.20 77.52 75.21 78.15 77.10 77.73 80.88

optdigits 91.33 89.27 88.35 89.54 87.53 91.33 90.62 89.84 91.05 91.51

promoters 90.57 89.62 88.68 87.74 88.68 90.57 89.62 88.68 90.57 88.68

sonar 67.79 72.12 71.63 72.12 72.60 67.79 71.63 71.63 73.08 68.27

spect 75.66 80.15 77.53 79.40 80.15 75.66 79.78 79.03 81.27 80.15

splice.data 95.36 92.29 91.19 91.54 90.38 95.36 92.26 93.07 92.48 91.32

ticdata2000 77.57 82.55 82.34 80.20 81.91 77.57 84.97 84.46 85.14 86.50

vehicle 44.80 51.65 50.95 49.76 51.30 44.80 50.00 47.64 49.17 48.46

wine 96.63 98.31 96.63 96.63 97.19 96.63 98.31 97.75 97.19 97.19

zoo 98.02 99.01 100.00 100.00 100.00 98.02 100.00 100.00 100.00 100.00

AVERAGE 75.26 76.88 76.38 76.51 76.65 75.26 77.72 77.07 77.79 77.64

abalone C4.5 20.49 23.10 21.76 23.10 21.38 20.49 19.34 20.68 19.34 19.34

arrhythmia 60.40 62.39 60.18 61.50 62.39 60.40 62.83 63.05 63.50 63.94

ionosphere 91.45 91.17 92.02 91.45 92.31 91.45 92.59 92.88 91.74 91.45

iris 96.00 96.00 96.00 96.00 96.00 96.00 96.00 96.00 96.00 96.00

letter 87.92 88.19 87.98 87.62 88.03 87.92 88.22 88.03 88.23 87.28

lymphography 75.00 79.73 83.11 80.41 81.76 75.00 81.08 78.38 78.38 78.38

madelon 70.35 71.85 73.60 72.50 71.25 70.35 71.17 69.20 69.80 78.40

muskv1 84.87 85.50 86.97 82.77 84.24 84.87 84.66 83.61 82.56 86.34

optdigits 90.68 89.77 90.32 90.37 90.07 90.68 91.01 90.93 90.85 90.52

promoters 81.13 81.13 81.13 83.02 81.13 81.13 86.79 84.91 85.85 85.85

sonar 71.15 78.37 74.04 77.88 77.88 71.15 76.92 77.40 77.88 78.37

spect 80.90 84.27 80.90 83.15 82.40 80.90 84.27 82.40 83.15 81.65

splice.data 94.36 92.16 92.04 87.34 91.25 94.36 94.73 94.23 94.83 95.08

ticdata2000 94.45 94.37 94.59 93.94 93.99 94.45 94.54 94.42 94.56 91.88

vehicle 72.58 71.75 72.34 71.87 73.40 72.58 72.34 73.40 73.52 68.32

wine 93.82 93.26 94.94 94.38 94.38 93.82 94.38 94.38 93.82 93.82

zoo 97.03 99.01 99.01 99.01 98.02 97.03 100.00 98.02 100.00 99.01

AVERAGE 80.15 81.30 81.23 80.96 81.17 80.15 81.82 81.29 81.41 81.51

Dataset	Classifier	PCFS	CFS
abalone	KNN	22.96	24.40	24.40	25.31	23.96	22.96	23.34	23.34	22.48	22.48
arrhythmia		58.85	62.39	61.28	60.18	61.28	58.85	61.73	61.50	63.50	66.15
ionosphere		84.90	84.90	85.47	85.47	85.75	84.90	89.46	87.18	88.32	87.75
iris		95.33	96.00	96.00	96.00	96.00	95.33	96.67	96.67	96.67	96.67
letter		95.50	95.77	95.59	95.28	95.67	95.50	95.81	95.78	95.51	94.46
lymphography		79.73	79.73	82.43	83.11	81.76	79.73	83.11	81.08	82.43	79.05
madelon		56.30	56.15	56.80	56.65	57.85	56.30	59.25	57.23	57.55	85.35
muskv1		83.19	85.29	83.61	84.87	83.82	83.19	84.03	83.82	82.98	84.24
optdigits		98.70	97.69	97.88	97.30	97.85	98.70	98.47	98.01	98.67	98.59
promoters		79.25	75.47	78.30	80.19	83.02	79.25	86.79	83.96	83.96	90.57
sonar		84.62	85.10	86.06	86.54	85.58	84.62	84.62	86.54	85.58	85.10
spect		80.90	83.90	83.90	82.77	82.40	80.90	84.27	83.90	84.64	84.64
splice.data		79.84	82.35	80.94	74.64	80.44	79.84	85.39	82.32	84.39	90.72
ticdata2000		59.45	64.75	57.40	61.61	56.80	59.45	73.02	70.66	72.76	85.01
vehicle		72.70	71.63	71.63	71.04	71.63	72.70	69.39	71.75	69.98	64.07
wine		95.51	97.19	97.19	96.63	97.19	95.51	97.75	97.75	97.75	97.75
zoo		94.06	96.04	95.05	93.07	96.04	94.06	97.03	96.04	96.04	94.06
AVERAGE	77.75	78.75	78.47	78.27	78.65	77.75	80.60	79.86	80.19	82.74
abalone	NB	23.77	24.16	24.23	23.77	24.11	23.77	24.99	24.73	24.73	24.73
arrhythmia		59.29	61.95	62.39	61.73	60.84	59.29	64.82	63.72	64.16	68.58
ionosphere		82.62	86.04	85.19	85.19	86.61	82.62	91.17	89.74	90.60	92.02
iris		96.00	96.00	96.00	96.00	96.00	96.00	96.00	96.00	96.00	96.00
letter		64.01	65.22	64.03	64.50	64.82	64.01	63.88	65.03	66.06	64.63
lymphography		82.43	81.76	82.43	83.11	85.81	82.43	84.46	82.43	82.43	81.76
madelon		58.40	59.80	59.55	60.20	57.60	58.40	60.50	59.28	60.90	59.15
muskv1		75.21	77.10	77.31	79.20	77.52	75.21	78.15	77.10	77.73	80.88
optdigits		91.33	89.27	88.35	89.54	87.53	91.33	90.62	89.84	91.05	91.51
promoters		90.57	89.62	88.68	87.74	88.68	90.57	89.62	88.68	90.57	88.68
sonar		67.79	72.12	71.63	72.12	72.60	67.79	71.63	71.63	73.08	68.27
spect		75.66	80.15	77.53	79.40	80.15	75.66	79.78	79.03	81.27	80.15
splice.data		95.36	92.29	91.19	91.54	90.38	95.36	92.26	93.07	92.48	91.32
ticdata2000		77.57	82.55	82.34	80.20	81.91	77.57	84.97	84.46	85.14	86.50
vehicle		44.80	51.65	50.95	49.76	51.30	44.80	50.00	47.64	49.17	48.46
wine		96.63	98.31	96.63	96.63	97.19	96.63	98.31	97.75	97.19	97.19
zoo		98.02	99.01	100.00	100.00	100.00	98.02	100.00	100.00	100.00	100.00
AVERAGE	75.26	76.88	76.38	76.51	76.65	75.26	77.72	77.07	77.79	77.64
abalone	C4.5	20.49	23.10	21.76	23.10	21.38	20.49	19.34	20.68	19.34	19.34
arrhythmia		60.40	62.39	60.18	61.50	62.39	60.40	62.83	63.05	63.50	63.94
ionosphere		91.45	91.17	92.02	91.45	92.31	91.45	92.59	92.88	91.74	91.45
iris		96.00	96.00	96.00	96.00	96.00	96.00	96.00	96.00	96.00	96.00
letter		87.92	88.19	87.98	87.62	88.03	87.92	88.22	88.03	88.23	87.28
lymphography		75.00	79.73	83.11	80.41	81.76	75.00	81.08	78.38	78.38	78.38
madelon		70.35	71.85	73.60	72.50	71.25	70.35	71.17	69.20	69.80	78.40
muskv1		84.87	85.50	86.97	82.77	84.24	84.87	84.66	83.61	82.56	86.34
optdigits		90.68	89.77	90.32	90.37	90.07	90.68	91.01	90.93	90.85	90.52
promoters		81.13	81.13	81.13	83.02	81.13	81.13	86.79	84.91	85.85	85.85
sonar		71.15	78.37	74.04	77.88	77.88	71.15	76.92	77.40	77.88	78.37
spect		80.90	84.27	80.90	83.15	82.40	80.90	84.27	82.40	83.15	81.65
splice.data		94.36	92.16	92.04	87.34	91.25	94.36	94.73	94.23	94.83	95.08
ticdata2000		94.45	94.37	94.59	93.94	93.99	94.45	94.54	94.42	94.56	91.88
vehicle		72.58	71.75	72.34	71.87	73.40	72.58	72.34	73.40	73.52	68.32
wine		93.82	93.26	94.94	94.38	94.38	93.82	94.38	94.38	93.82	93.82
zoo		97.03	99.01	99.01	99.01	98.02	97.03	100.00	98.02	100.00	99.01
AVERAGE	80.15	81.30	81.23	80.96	81.17	80.15	81.82	81.29	81.41	81.51

Table 4

Average number of features found by search algorithms with respect to PCFS and CFS using KKN, NB and C4.5 (bolds are best values)

Dataset	Classifier	Benchmark	PCFS				CFS
			MBO	PSO	SA	DE	MBO	PSO	SA	DE
abalone	KNN	8	4.5	4.1	5.0	4.8	5.4	6.4	5.0	5.0
arrhythmia		275	139.3	138.6	135.5	137.6	123.1	135.2	128.5	56.0
ionosphere		34	21.6	32.7	22.6	25.6	17.5	23.9	17.3	14.4
iris		4	3.1	2.7	2.8	3.3	2.0	2.0	2.0	2.0
letter		16	12.8	15.5	12.9	14.3	10.1	12.4	9.2	9.0
lymphography		18	11.9	11.6	11.0	14.6	9.9	8.2	10.5	11.0
madelon		500	254.1	252.0	247.0	251.0	229.3	243.3	248.0	15.5
muskv1		166	80.1	85.3	85.9	82.4	90.2	82.5	76.6	48.4
optdigits		64	32.7	32.2	31.9	34.6	37.0	31.0	36.7	38.0
promoters		57	29.0	27.2	28.0	29.9	22.1	27.2	23.6	6.6
sonar		60	35.0	54.4	34.3	37.2	24.2	27.2	22.2	19.1
spect		22	17.0	21.1	18.2	20.3	11.0	18.7	12.9	12.0
splice.data		60	30.2	31.7	30.3	32.2	25.5	34.9	24.5	8.2
ticdata2000		85	43.1	51.2	45.5	50.7	33.7	41.4	35.4	4.4
vehicle		18	8.7	8.6	9.8	9.5	11.0	13.0	10.7	11.0
wine		13	8.0	8.1	7.6	9.4	8.6	11.0	8.3	8.0
zoo		17	10.9	10.8	9.8	13.0	9.0	9.9	6.6	7.2
abalone	NB	8	4.6	4.9	4.7	4.4	5.3	5.7	5.0	5.0
arrhythmia		275	142.6	134.6	137.7	135.4	123.4	136.8	131.8	56.8
ionosphere		34	21.7	31.4	23.2	25.3	17.6	20.9	16.8	14.0
iris		4	2.7	3.1	3.0	3.1	2.0	2.0	2.0	2.0
letter		16	12.9	15.3	13.0	14.9	8.9	11.2	9.4	9.0
lymphography		18	12.0	12.1	12.0	13.7	9.3	8.3	10.4	11.0
madelon		500	252.1	252.6	249.5	246.0	209.7	227.1	229.0	15.0
muskv1		166	81.7	84.7	83.3	83.0	91.4	82.3	76.9	47.6
optdigits		64	33.4	30.8	34.8	32.1	33.6	31.1	35.4	38.0
promoters		57	28.9	30.4	30.2	29.7	20.8	27.4	23.5	6.4
sonar		60	33.8	54.9	36.3	34.0	22.7	28.0	24.4	19.1
spect		22	17.6	21.3	18.0	19.7	13.1	18.2	12.5	12.0
splice.data		60	31.6	28.9	31.7	31.1	24.7	31.7	25.2	8.1
ticdata2000		85	45.8	49.5	44.8	48.1	36.5	38.3	33.3	3.7
vehicle		18	9.1	9.5	8.9	9.9	10.9	14.8	10.5	11.0
wine		13	8.1	8.3	7.9	10.1	9.2	11.4	8.5	8.0
zoo		17	10.8	11.0	10.6	11.7	8.3	9.2	5.9	6.4
abalone	C4.5	8	4.1	4.1	4.3	4.4	5.0	5.8	5.0	5.0
arrhythmia		275	136.7	135.1	139.0	136.0	130.1	136.3	132.3	55.5
ionosphere		34	22.0	32.1	22.1	26.4	16.7	24.2	17.8	14.6
iris		4	3.1	3.0	3.4	2.9	2.0	2.0	2.0	2.0
letter		16	13.2	14.7	12.8	15.1	9.7	11.4	10.3	9.0
lymphography		18	11.9	12.0	11.2	14.8	9.5	8.8	10.6	11.0
madelon		500	252.9	245.6	245.7	253.0	239.6	223.4	237.0	16
muskv1		166	79.4	85.5	85.8	84.9	89.2	77.9	80.4	47.9
optdigits		64	32.8	31.2	33.7	32.3	36.0	31.0	36.3	38.0
promoters		57	29.2	29.9	27.3	29.9	21.2	29.2	23.2	6.2
sonar		60	36.2	53.9	36.6	34.1	24.6	27.3	25.2	19.0
spect		22	16.1	21.6	18.1	20.8	11.4	19.2	12.6	12.0
splice.data		60	31.8	29.5	29.2	30.8	26.1	33.9	25.8	8.3
ticdata2000		85	43.0	49.3	42.9	50.7	34.2	37.4	34.0	3.8
vehicle		18	9.1	9.2	8.9	10.6	10.5	13.6	10.5	11.0
wine		13	8.2	8.8	8.3	9.5	9.1	10.7	8.5	8.0
zoo		17	11.3	10.0	9.0	12.8	7.6	9.1	5.8	6.5

The first and most general conclusion from these tables can be drawn by making a general assessment of the overall performance of the algorithms. According to Table 3, when we check the accuracy values of all algorithms, it is seen that except few of the datasets, search algorithms performs better than benchmark values. Remember that, benchmark results are obtained by using only classifiers with full set of features. As seen in Table 4, on all datasets, search algorithms perform better than benchmark values in terms of number of exploited features. The reader should also recall that the goal in the FS problem is reducing the size of datasets while keeping the original accuracy value or even try to increase it. Therefore, results presented in Table 3 and Table 4 show that our search algorithms are applied to the FS problem successfully.

Next, we are going to make an analysis of which evaluator, learner algorithms and search algorithms are performing better. To start with the evaluators, the percentage of decrease on exploited number of features is 38% for PCFS whereas it is 50% for CFS, on the average (which can be extracted from Table 4 easily). Therefore, CFS presents a significantly better performance in decreasing the number of features. Furthermore, the accuracy performance of search algorithms are better when they exploit the CFS evaluator rather than PCFS. This can be observed from Table 3 from the bottom rows showing the average accuracy value for each classifier.

If we take a look at the performance of classifiers, it is seen that C4.5 gives better results than KNN and NB classifiers. When they use full set of features, C4.5 gives 80.15% accuracy value, whereas KNN and NB give 77.75% and 75.26%, respectively, as the average of 17 datasets (see the bottom rows of Table 3 showing the average accuracy value for each classifier). The same performance order is observed when these algorithms are run with feature subsets selected by the search algorithms. Therefore, we can say that among these three classifiers, it is better to use C4.5 rather than KNN and NB.

Once the winners of evaluators and learner algorithms are disclosed, we can focus on the performance of the search algorithms. As the first observation, we can easily say that MBO is the best performing search algorithm. This can be verified by the average performances of the search algorithms given at the bottom rows of the Table 3. Furthermore, the search algorithm showing the best performance with the best learner (C4.5) and best evaluator (CFS) is the MBO algorithm (see the results at the intersection of C4.5 and CFS in Table 3). Therefore, among the search algorithms MBO is significantly better than others. After MBO we observe DE showing the next best performance.

In Table 5 and Table 6, number of winning cases for each dataset are shown according to the accuracy values and number of features, respectively. That is, each column in these tables shows for how many files (out of 17 files), a search algorithm outperformed the others. Note that the sum of figures in each column may exceed 17 due to draws. From Table 5 we observe that MBO outperforms the other search algorithms for most of the evaluator and learner combinations.

Table 5

Number of winning cases in terms of accuracy

Search Algorithms	KNN		NB		C4.5		Total
	PCFS	CFS	PCFS	CFS	PCFS	CFS
MBO	10	7	8	6	8	7	46
PSO	5	5	4	3	7	4	28
SA	4	4	5	7	5	5	30
DE	7	9	6	7	4	6	39

Table 6

Number of winning cases in terms of average number of features

Search Algorithms	KNN		NB		C4.5		Total
	PCFS	CFS	PCFS	CFS	PCFS	CFS
MBO	6	2	8	2	5	4	27
PSO	4	3	3	3	4	2	19
SA	7	4	5	4	7	4	31
DE	0	11	2	12	2	12	39

6 Conclusion

Feature selection is a very important pre-processing step for machine learning. It does not only decrease computational expense but also increase the accuracy of the induction algorithm by eliminating redundant or irrelevant features. In this study, we developed a framework in which a novel feature selection integration of several approaches is proposed. Furthermore, it is the first time that MBO algorithm, as a state-of-the-art optimizer, is applied to the feature selection problem, to our best. Our framework gives the opportunity to compare the performance of four metaheuristics (MBO, PSO, SA and DE) which are mounted as selection mechanism and two evaluators (PCFS and CFS) which are the auxiliary elements in the filter approach. The performance calculation is done by exploiting three different classifier algorithms, namely KNN, NB and C4.5. Computational experiments are conducted on a large set of datasets, in contrast to the existing studies in the literature. Results show that among the search algorithms, MBO gives better accuracy values than the others on the average and we can say that MBO algorithm works best for a wide variety of datasets. From the evaluators perspective, we observe that CFS evaluator provides better results than PCFS. Furthermore, our framework gives us the chance to compare the learner algorithms. From this perspective, we observed that C4.5 provides significantly better results on the average.

As a future work, a novel wrapper approach may be developed by integrating the MBO algorithm with several induction algorithms. Besides, our framework can be extended by including other classical metaheuristics, evaluators and learner algorithms.

References

Al-Ani

, Alsukker

and Khushaba

R.N.

, Feature subset selection using differential evolution and a wheel based search strategy, Swarm and Evolutionary Computation 9 (2013), 15–26.

Xue

, Zhang

and Browne

W.N.

, Particle swarm optimization for feature selection in classification: A multi-objective approach, IEEE transactions on cybernetics 43(6) (2012), 1656–1671.

Xue

, Zhang

, Browne

W.N.

and Yao

, A survey on evolutionary computation approaches to feature selection, IEEE Transactions on Evolutionary Computation 20(4) (2015), 606–626.

Hsu

C.N.

, Huang

H.J.

and Dietrich

, The ANNIGMA-wrapper approach to fast feature selection for neural nets, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 32(2) (2002), 207–212.

Aha

and Kibler

, Instance-based learning algorithms, , Machine Learning 6 (1991), 37–66.

Duman

, Uysal

and Alkaya

A.F.

, Migrating Birds Optimization: A new metaheuristic approach and its performance on quadratic assignment problem, , Information Sciences 217 (2012), 65–77.

Hancer

, Xue

and Zhang

, Differential evolution for filter feature selection based on information theory and feature ranking, , Knowledge-Based Systems 140 (2018), 103–119.

Zorarpaci

and Ozel

S.A.

, A hybrid approach of differential evolution and artificial bee colony for feature selection, , Expert Systems with Applications 62 (2016), 91–103.

John

G.H.

and Langley

, Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh conference on Uncertainty in Artificial intelligence (1995), (pp. 338–345). Morgan Kaufmann Publishers Inc.

10.

Kalayci

G.T.

, Alkaya

A.F.

and Algin

, Exploitation and Comparison of Computational Intelligence Techniques on the Feature Selection Problem. In International Conference on Intelligent and Fuzzy Systems (2019), (pp. 1243–1249), Springer, Cham.

11.

Gao

H.H.

, Yang

H.H.

and Wang

X.Y.

, Ant colony optimization based network intrusion feature selection and detection. In 2005 international conference on machine learning and cybernetics, (2005), (Vol. 6, pp. 3871–3875). IEEE.

12.

Ming

, A rough set based hybrid method to feature selection. In 2008 International Symposium on Knowledge Acquisition and Modeling, (2008), pp. 585–588, IEEE.

13.

Guyon

and Elisseeff

, An introduction to variable and feature selection, Journal of Machine Learning Research 3(Mar), (2003), 1157–1182.

14.

I.S.

, Lee

J.S.

and Moon

B.R.

, Hybrid genetic algorithms for feature selection, IEEE Transactions on pattern analysis and machine intelligence 26(11) (2004), 1424–1437.

15.

Garcia-Nieto

, Alba

and Apolloni

, Hybrid DE-SVM approach for feature selection: application to gene expression datasets. In 2009 2nd International Symposium on Logistics and Industrial Informatics, (2009) pp. 1–6, IEEE.

16.

Kennedy

and Eberhart

, Particle swarm optimization (PSO). In Proc. IEEE International Conference on Neural Networks, Perth, Australia, (1995), (pp. 1942–1948).

17.

, Cheng

, Wang

, Morstatter

, Trevino

R.P.

, Tang

and Liu

, Feature selection: A data perspective, ACM Computing Surveys (CSUR) 50(6) (2018), 94.

18.

Tang

, Alelyani

and Liu

, Feature selection for classification: A review, , Data classification: Algorithms and applications 37 (2014).

19.

Chuang

L.Y.

, Yang

C.H.

and Li

J.C.

, Chaotic maps based on binary particle swarm optimization for feature selection, Applied Soft Computing 11(1) (2011), 239–248.

20.

Chuang

L.Y.

, Tsai

S.W.

and Yang

C.H.

, Improved binary particle swarm optimization using catfish effect for feature selection, Expert Systems with Applications 38(10) (2011), 12699–12707.

21.

Hall

M.A.

, Correlation-based feature selection for machine learning, (1999).

22.

Dash

and Liu

, Feature selection for classification, Intelligent data analysis 1(1-4) (1997), 131–156.

23.

Dash

and Liu

, Consistency-based search in feature selection, Artificial intelligence 151(1-2) (2003), 155–176.

24.

Kudo

and Sklansky

, Comon of algorithms that select features for pattern classifiers, Pattern recognition 33(1) (2000), 25–41 paris.

25.

Lichman

, UCI machine learning repository, (2013).

26.

Diao

and Shen

, Nature inspired feature selection meta-heuristics, Artificial Intelligence Review 44(3) (2015), 311–340.

27.

Jensen

and Shen

, Fuzzy-rough sets assisted attribute selection, IEEE Transactions on fuzzy systems 15(1) (2007), 73–89.

28.

Jensen

and Shen

, Computational intelligence and feature selection: rough and fuzzy approaches, (Vol. 8), (2008), John Wiley and Sons.

29.

Kohavi

and John

G.H.

, Wrappers for feature subset selection, Artificial intelligence 97(1-2) (1997), 273–324.

30.

Meiri

and Zahavi

, Using simulated annealing to optimize the feature selection problem in marketing applications, European Journal of Operational Research 171(3) (2006), 842–858.

31.

Khushaba

R.N.

, Al-Ani

and Al-Jumaily

, Feature subset selection using differential evolution and a statistical repair mechanism, Expert Systems with Applications 38(9) (2011), 11515–11526.

32.

Quinlan

, C4.5: Programs for Machine Learning Morgan Kaufmann Publishers, (1993), San Mateo, CA.

33.

Poli

, Kennedy

and Blackwell.

, Particle swarm optimization. Swarm intelligence 1(1) (2007), 33–57.

34.

Storn

and Price

, Differential evolution– a simple and efficient heuristic for global optimization over continuous spaces, Journal of global optimization 11(4) (1997), 341–359.

35.

Kirkpatrick

, Gelatt

C.D.

and Vecchi

M.P.

, Optimization by simulated annealing, , Science 220 (1983), 671–680.

36.

Kannan

S.S.

and Ramaraj

, A novel hybrid feature selection via Symmetrical Uncertainty ranking based local memetic search algorithm, Knowledge-Based Systems 23(6) (2010), 580–585.

37.

Jing

S.Y.

, A hybrid genetic algorithm for feature subset selection in rough set theory, Soft Computing 18(7) (2014), 1373–1382.

38.

Hamdani

T.M.

, Won

J.M.

, Alimi

A.M.

and Karray

, Multi-objective feature selection with NSGA II. In International conference on adaptive and natural computing algorithms, (2007), pp. 240–247, Springer, Berlin, Heidelberg.

39.

Sikdar

U.K.

, Ekbal

, Saha

, Uryupina

and Poesio

, Differential evolution-based feature selection technique for anaphora resolution, Soft Computing 19(8) (2015), 2149–2161.

40.

Wang

, Yang

, Teng

, Xia

and Jensen

, Feature selection based on rough sets and particle swarm optimization, Pattern recognition letters 28(4) (2007), 459–471.

41.

Liu

, Wang

, Chen

, Dong

, Zhu

and Wang

, An improved particle swarm optimization for feature selection, Journal of Bionic Engineering 8(2) (2011), 191–200.

42.

Zhang

, Gong

D.W.

and Cheng

, Multi-objective particle swarm optimization approach for cost-based feature selection in classification, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 14(1) (2017), 64–75.

43.

Zheng

, Wu

and Srihari

, Feature selection for text categorization on imbalanced data, ACM Sigkdd Explorations Newsletter 6(1) (2004), 80–89.

44.

Zhu

, Ong

Y.S.

and Dash

, Wrapper– filter feature selection algorithm using a memetic framework, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 37(1) (2007), 70–76.

MBO		PSO		SA		DE
Param.	Values	Param.	Values	Param.	Values	Param.	Values
nob	5,21,21,51	ϕ ₁	0.1,0.5,1,2	T	100,1000	cxp	0.1,0.5,0.9
non	3,5,7	ϕ ₂	0.1,0.5,1,2	R	5,20	mp	0.1,0.5,0.9
nof	5,10	ω	0.1,0.5,1	a	1.1,1.5	F	0.1,0.5,1,2
olf	1,2,3	nog	50,100,200	b	1.1,1.5	nog	50,100,200
r	0.01,0.02,0.05	nos	50,100,200	r	0.01,0.02,0.05	nos	50,100,200