Feature selection in classification using self-adaptive owl search optimization algorithm with elitism and mutation strategies

Abstract

The fundamental aim of feature selection is to reduce the dimensionality of data by removing irrelevant and redundant features. As finding out the best subset of features from all possible subsets is computationally expensive, especially for high dimensional data sets, meta-heuristic algorithms are often used as a promising method for addressing the task. In this paper, a variant of recent meta-heuristic approach Owl Search Optimization algorithm (OSA) has been proposed for solving the feature selection problem within a wrapper-based framework. Several strategies are incorporated with an aim to strengthen BOSA (binary version of OSA) in searching the global best solution. The meta-parameter of BOSA is initialized dynamically and then adjusted using a self-adaptive mechanism during the search process. Besides, elitism and mutation operations are combined with BOSA to control the exploitation and exploration better. This improved BOSA is named in this paper as Modified Binary Owl Search Algorithm (MBOSA). Decision Tree (DT) classifier is used for wrapper based fitness function, and the final classification performance of the selected feature subset is evaluated by Support Vector Machine (SVM) classifier. Simulation experiments are conducted on twenty well-known benchmark datasets from UCI for the evaluation of the proposed algorithm, and the results are reported based on classification accuracy, the number of selected features, and execution time. In addition, BOSA along with three common meta-heuristic algorithms Binary Bat Algorithm (BBA), Binary Particle Swarm Optimization (BPSO), and Binary Genetic Algorithm (BGA) are used for comparison. Simulation results show that the proposed approach outperforms similar methods by reducing the number of features significantly while maintaining a comparable level of classification accuracy.

Keywords

Feature subset selection binary owl search algorithm meta-heuristic optimization self adaptive mechanism

1 Introduction

Classification (supervised) is one of the important areas of machine learning in which new observation is predicted based on the learning from training data. It is frequently noticed that datasets contain numerous features, which are not equally important for classification performance. Some redundant and irrelevant features make the classification model more complex, incur more computational burden and affect the classification performance [1]. The problem can be solved effectively with feature selection approach by selecting the most informative or discriminatory features for classification.

Feature subset selection algorithms can be divided into two basic approaches based on feature ranking and subset selection [2]. In ranking approach, features are ordered according to a feature evaluation metric, and high ranking features are selected depending on a predefined number of features or a threshold value of the evaluation metric. Subset selection uses a search strategy that aims to find out the optimal feature subset from all possible subsets. Based on evaluation metrics, feature selection approaches are categorized into wrappers, filters and embedded approach [3]. Wrapper approach uses classification accuracy as a measurement technique for evaluation of subsets. On the contrary, filter approaches use intrinsic characteristics of datasets as the evaluation metric. Wrapper approach is computationally more expensive than filter techniques, but it produces better classification accuracy. In the embedded approach, feature selection is performed during the model building process.

Feature subset selection can be considered as an optimization problem where the most informative subset with m features are to be selected from a total n features (m < n). Exhaustive search method can find out the optimum feature subset, but it is not plausible for relatively large datasets. This is because the time complexity of feature subset selection is O (2ⁿ), which is computationally demanding. To overcome the problem, several greedy strategies such as sequential forward search (SFS), and sequential backward search (SBS) are popularly used. However, they are comparatively slower and sub-optimal for high dimensional data sets as a limited area of search space is explored in these algorithms [4]. Another search approach, different from the above, is meta-heuristic search. These stochastic approaches are effective in solving global optimization problems within a reasonable time constraint. Although an optimum solution is not guaranteed, they produce better sub-optimal solution. In feature subset selection, these methods have been used effectively. Some of these methods are Genetic Algorithm (GA) [5], Tabu search [6], Simulated Annealing (SA) [7], and Particle Swarm Optimization (PSO) [8]. Recently, Owl Search Algorithm (OSA) has been proposed to solve continuous optimization problems with satisfactory performance [9]. To date, OSA and its variants have been effectively employed to solve global optimization problems for several application areas. To trace the maximum power point (MPP) for photovoltaic (PV) systems in electric energy production, OSA is combined with Perturb and Observe (P&O) methods, and it is found that P&O-OSA combined approach achieves faster convergence rate in producing maximum power [10]. To solve the optimized method for the telecommunication supply system, an improved version based on OSA was proposed and their reported results were better than PSO and Emperor Penguins Optimizer (EPO) based approach [11]. A type of OSA, named developed OSA (DOSA), was used to optimize cooling, heating, and power (CCHP) systems and fast convergence for the cost function is achieved by the proposed DOSA compared with GA [12]. Besides, chaotic OSA was employed for optimizing the quality of the negotiation process problem and achieved average fitness value better than PSO [13]. It is observed that OSA shows potentiality in solving global optimization problems. Besides, this algorithm is straight forward, and unlike many popular meta-heuristic techniques such as GA and PSO, it has only one parameter for tuning. Although OSA has been used to solve continuous optimization problems, the capability of this algorithm to solve the feature selection problem has not been attempted before. These motivate us to employ OSA algorithm for feature selection problem.

In authors’ previous work, a binary variant of Owl Search Algorithm, named BOSA, has been proposed, which was shown to have good potential for feature selection [14]. However, during feature selection using BOSA, we observed that there are some issues need to be addressed to improve its performance. First, solution quality is highly sensitive to parameter β, and therefore static initialization of β and adjusting its value linearly during the search process does not provide quality feature subsets for many datasets. Besides, the convergence speed of BOSA is comparatively slower and its premature convergence, especially for high dimensional datasets, is likely to result in less number of feature reduction. Considering these issues, the aim of this study is to improve the BOSA algorithm further for feature subset selection in a wrapper based framework. Therefore, in this paper, a new feature subset selection algorithm based on metaheuristic search strategy named as Modified Binary Owl Search Algorithm (MBOSA) has been proposed. In order to achieve a proper balance between exploration and exploitation in search, three strategies are incorporated: a self-adaptive strategy for parameter tuning, elitism mechanism, and mutation operation. Decision Tree (DT) classifier is used for wrapper based fitness function while the final classification performance of the selected feature subset is evaluated by Support Vector Machine (SVM) classifier. To obtain empirical evidence for the performance, the algorithm is employed on several public benchmark datasets and compared with similar state-of-the-art algorithms.

The structure of the paper is organized as follows: Section 2 presents the related works on feature subset selection. Section 3 describes the details of Owl Search Algorithm. The proposed approach of feature selection is presented in section 4. Section 5 reports the information of the datasets, parameter values used for the algorithms and experimental setups. Section 6 presents the experimental results and discussion. Finally, the conclusion is presented in section 7.

2 Related works

In recent years, various meta-heuristic approaches have been proposed and applied to solve the feature subset selection problem. Among them, population-based algorithms, which consider a set of candidate solutions during the search, are frequently used in feature selection. This is due to their reasonable computational cost and efficient performance in dealing with complex real-world problems.

Early research on population-based feature selection focused on genetic algorithms (GA) and particle swarm optimization (PSO). A filter-based feature selection method was introduced where fuzzy operations are combined with GA [15]. Another implementation of GA approach was multi-objective GA for feature selection of microarray gene expression data [16]. Guha et al. proposed an approach where GA was hybridized with great deluge algorithm to perform feature selection [17]. Meanwhile, a binary GA with granular information was proposed for selecting significant features in [18]. Shukla et al., however, proposed hybrid framework combining the benefits of both filter and wrapper approaches for selecting essential features. In their approach, conditional mutual information maximization (CMIM) selects the prominent features which are then improved by GA [19]. To obtain an optimal feature subset, particle swarm optimization (PSO) has also been extensively used by numerous researchers. A binary variant of PSO or BPSO was proposed for binary optimization problem [20] as well as feature selection problem. An extension of BPSO for feature selection was Hamming distance-based BPSO algorithm for high dimensional feature selection [21]. Furthermore, Chakraborty introduced a BPSO with fuzzy fitness function for appropriate feature selection [22]. Multi-objective concept was also embedded with PSO for feature selection problem [23]. Lu et al. proposed an improved feature selection using particle swarm optimization for text classification [24]. Also, a combination of genetic algorithm and particle swarm optimization was reported in [25].

Another commonly used meta-heuristic for feature selection problem is Ant Colony Optimization (ACO). In [26], authors used ACO for text feature classification. A similar type of wrapper based approach where ACO was hybridized with artificial neural network (ANN) is reported in [27]. A multivariate filter-based feature selection that uses ACO for unsupervised learning was reported in [28]. A harmony search (HS) based feature selection has shown potential for feature selection problem. A recent example of HS for feature selection for imbalanced data problem is presented in [29]. To improve the harmony search-based feature selection for high dimensional medical data, adaptive strategies are suggested in [30]. The apparent merit of this approach is its limited computational overhead. A similar reported relevant approach is feature selection using self-adjusting harmony search [31]. A variation of Artificial Bee Colony (ABC) where similarity search approach was integrated with binary ABC (BABC) algorithm was reported to find out optimum feature subset [32]. Another ABC-based approach is a multi-objective Artificial Bee Colony (MOABC) that employs fuzzy mutual information as a new filter fitness evaluator for efficient feature selection [33]. Cuckoo Search Algorithm (CSA) was also proposed. Combining rough set-based fitness function with CSA for selecting relevant features is reported in [34]. In [35], it was also observed that a gravitational search optimization with wrapper approach can be an effective feature selection strategy [36].

Some contemporary studies in the feature selection problem focused on recent population-based metaheuristics. Rodrigues et al. [37] proposed binary Bat Algorithm (BBA) and Optimum-Path Forest for feature selection. Another approach was a hybridization of BBA with enhanced particle swarm optimization to address the feature selection problem [38]. Mafarja et al. presented a feature selection strategy based on hybridization of Binary Grasshopper Optimization Algorithm (BGOA) with incremental hill climbing search [39]. To improve wrapper feature selection method, in [40], authors combined Binary Salp Swarm Algorithm (BSSA) with Random Weight Network (RWN) classifier and suggested three objectives. Another recent strategy of finding optimal feature subset is Chaotic Salp Swarm Algorithm (CSSA), in which SSA was hybridized with ten different chaotic maps [41]. Chaotic theory was also embedded with other metaheuristic algorithms for feature selection, such as Chaotic Crow Search Algorithm (CCSA) [42], Chaotic Dragonfly Algorithm (CDA) [43], and Chaotic Multi Versed Optimization (CMVO) [44]. In another recent work, Grey Wolf Optimization (GWO) was integrated with Particle Swarm Optimization(PSO) to explore the feature search space [45]. Another similar type of work is combining GWO and Whale Optimization Algorithm (WOA) [46]. Two other recently proposed feature selection approaches based on meta-heuristic algorithms are Binary Teaching-Learning Based Optimization (BTLBO) [47], binary variant of Butterfly Optimization (BBO) [48].

It is observed that several population-based metaheuristics are proposed for solving feature selection problem. According to No Free Lunch theorem [49], a single optimization algorithm cannot work well for all optimization problems. Therefore, new meta-heuristics have the potential to perform well in feature subset selection problem.

3 Owl search algorithm (OSA)

Owl Search Algorithm (OSA) is a nature-inspired algorithm proposed for global optimization problem in [9]. The inspiration comes from the auditory behavior of owls during hunting their prey. Owls can determine the location of their prey using a unique auditory system whereby the sounds reach one ear before the other. The brain of an owl generates an auditory map of the prey’s sound, which induces the owl to fly towards the prey in dark [9]. This nature-inspired strategy has been employed for solving global optimization problems where a group of owls works to find prey or optimum solution. In the d dimensional space, each owl is represented by a randomly generated position. At each iteration, this position is updated based on the influence of the prey’s movement. Let us suppose that the number of owls is denoted as n. The initial position vector of the i^th owl is defined as follows: $O_{i}^{j} = O_{L}^{j} + U (0, 1) \times (O_{U}^{j} - O_{L}^{j})$ (1) Here $O_{i}^{j}$ is the i^th owl in j^th dimension. U (0, 1) is a uniform random number ∈[0, 1]. The upper and lower bound of the position of i^th owl in j^th dimension is $O_{U}^{j}$ and $O_{L}^{j}$ respectively.

The fitness function f (.), associated with the optimization problem, evaluates each search agent O_i. This fitness value of O_i is directly related to the intensity information received through ear and it is represented as $f_{i} = f ([O_{i}^{1}, O_{i}^{2}, . . . . ., O_{i}^{d}])$ (2) where $O_{i}^{1}, O_{i}^{2}, . . . . ., O_{i}^{d}$ represents i^th owl’s position vector in d dimension. The best owl that receives the maximum intensity is found more close to the prey. The normalized intensity value of i^th owl is used to update the position of owls and can be calculated using the following equation. $I_{i} = \frac{(f_{i} - w)}{(b - w)}$ (3) where terms $b = max_{k \in {1, \dots, n}} f_{k}$ and $w = min_{k \in {1, \dots, n}} f_{k}$ are the maximum and minimum intensity information among the owls respectively. The distance between each owl O_i and the prey can be calculated by Euclidean distance where V denotes the location of the prey. The value of V is obtained from the owl with the highest fitness value, that is, the nearest to the prey. $R_{i} = ∥ O_{i}, V ∥_{2}$ (4) Owls receive a change of intensity while moving towards the prey. The change in intensity for i^th owl is calculated as follows: ${Ic}_{i} = \frac{I_{i}}{R_{i}^{2}} + N_{r}$ (5) where N_r is random noise that is used to make the model more realistic.

Finally, the new position of the owl O_i is updated by the following equations: $\begin{matrix} O_{i} (t + 1) = \\ {\begin{matrix} O_{i} (t) + β \times {Ic}_{i} \times | α V - O_{i} (t) |, & if p_{vm} < 0.5 \\ O_{i} (t) - β \times {Ic}_{i} \times | α V - O_{i} (t) |, & if p_{vm} \geq 0.5 \end{matrix} \end{matrix}$ (6) where O_i (t + 1) indicates the solution at iteration t + 1. α is a uniformly distributed random number ∈[0, 0.5]. β is a user-defined parameter that decreases linearly from 1.9 to 0. The parameter p_vm is the probability of prey movement.

4 Proposed approach for feature subset selection

In this section, the detail description of the proposed Modified Owl Search Algorithm (MOSA) for feature selection problem is presented. Although the original Owl Search Algorithm(OSA) is effective in continuous optimization problems, it can not be applied directly to the feature selection problem. Therefore, the binary representation of Owl Search Algorithm (BOSA) is proposed earlier by the authors and is presented in brief in the following subsection. The methods of self-adaptive strategy, elitism, and mutation, which are the key steps of improvement of BOSA, are proposed in this work and are described in this section. Finally, the characteristics of the fitness function and the overall procedure of feature selection are presented.

4.1 Binary representation of OSA

Binary encoding is used for feature selection problem because it is straight forward to implement. Assuming that $s_{i} = {s_{i}^{1}, s_{i}^{2}, . . . . ., s_{i}^{d}}$ is a feature set with total number of features d, then j^th feature is selected if decision variable $s_{i}^{j} = 1$ otherwise $s_{i}^{j} = 0$ for the feature not being selected. A subset of features is the representation of owl in d dimensional space, with each dimension having value either 0 or 1. Initial position of each owl is randomly generated and the position of i^th owl of j^th dimension can be represented as follows: $O_{i}^{j} = U_{b} (0, 1)$ (7) where U_b (0, 1) is a binary random number.

As the positions of the owls are represented in a binary space, Hamming distance is used as a distance measure between the prey and the owl. The original equation of Euclidean distance in (4) is, therefore, replaced as follows. $R_{i} = \sum_{j = 1}^{d} | O_{i}^{j} - V^{j} |$ (8) where $O_{i}^{j}, V^{j} \in {0, 1}$ .

For updating the owls’ position in binary space, the original updating equation of OSA in Equation 6 is modified into following equation. $\begin{matrix} Δ O_{i} (t + 1) = \\ {\begin{matrix} O_{i} (t) + β \times {Ic}_{i} \times | α V - O_{i} (t) |, & if p_{vm} < 0.5 \\ O_{i} (t) - β \times {Ic}_{i} \times | α V - O_{i} (t) |, & if p_{vm} \geq 0.5 \end{matrix} \end{matrix}$ (9) where ΔO_i (t + 1) is the step vector of i^th owl at iteration t + 1. From this step vector, binary position vector O_i (t + 1) is calculated using a transfer function. In this paper, sigmoid transfer function is adopted for this mapping process. Sigmoid function is a common transfer function that can produce probability values between 0 and 1 [20], and the mapping strategy is as follows: $T (Δ O_{i}^{j} (t)) = \frac{1}{1 + e^{- Δ O_{i}^{j} (t)}}$ (10)

Finally, the current position of i^th owl at (t + 1) iteration in j^th dimension is updated according to the following equation based on the probability value of Equation 10. $O_{i}^{j} (t + 1) = {\begin{matrix} 1, & if B < T (Δ O_{i}^{j} (t + 1)) \\ 0, & if B \geq T (Δ O_{i}^{j} (t + 1)) \end{matrix}$ (11) where B is a random number in [0,1].

4.2 Self-adaptive strategy

There is a controlling parameter β, which influences the performance of BOSA a lot. This parameter has to be initialized with a relatively large value at the beginning and is gradually decreased as the search progresses. In this proposed approach, the initial value of β is defined dynamically using the following equation.

$β_{0} = \frac{f_{e}}{(\sum_{i = 1}^{d} f_{i}) / d}$ (12) where β₀ is a initial weight of β. The denominator of the fraction is the mean fitness of the owls’ population before the beginning of the search. The numerator is the final expected fitness f_e, which is 1.0 (maximum) according to the fitness function used in this research. Here β₀ > 0 and it mainly depends on the fitness value of initial owl population.

In each iteration, the parameter β is exponentially decreased using the following mathematical model.

$β (t) = β_{0} e^{- 10 a \frac{t}{T}}$ (13) where β (t) is the value of β at iteration t. β₀ is a initial value, T is the total number of iterations, and a is a parameter which is initially set to 1. Parameter a plays a significant role because it controls β using the self-adaptive mechanism. During the search process, the highest and the lowest fitness values of the owl population are stored in each iteration to check whether they are changing or not in successive iterations. If both the highest and the lowest values do not change from the previous iteration to the immediate next iteration, it is assumed that the solution has reached a local minimum and is considered as a stuck condition and the particular iteration value τ is noted. In every occurrence of stuck condition, the particular τ value is used to modify the parameter a, which controls the behavior of β (t) function. The parameter a is changed based on the following equation, where T is the total number of iterations.

$a (τ) = τ / T$ (14) If parameter a is modified at relatively beginning of the search, its value is small which increases exploration in search. On the other hand, large value of a increases the exploitation ability. A large value is occurred when a is modified at an advanced stage of the search. The shape of the Equation 13 is shown in Fig. 1 with different a values.

Fig. 1

Shape of β during iteration with 3 different a values.

4.3 Elitism strategy

Elitism is a strategy where the best individual is selected for the next generation. Elitism strategy is incorporated in owl search optimization algorithm in order to speed up the convergence of search. In this strategy, the current fitness value of each owl is compared with its immediate earlier fitness value. Between them, the owl with the best fitness value is considered as the candidate owl for the next generation. Mathematically elitism strategy can be represented as follows:

$O_{i} (t) = {\begin{matrix} O_{i} (t), & if f (O_{i} (t)) \geq f (O_{i} (t - 1)) \\ O_{i} (t - 1), & otherwise \end{matrix}$ (15) where f (O_i (t - 1)) and f (O_i (t)) are the fitness of O_i owl at (t - 1) ^th and t^th iteration, respectively.

4.4 Mutation strategy

As it is mentioned in the previous subsection that elitism provides exploitation, it may happen that the solutions are trapped into local optimum, and then diversification is needed to move the search landscape into a promising area of solution space. Mutation strategy is incorporated here to perform this task. Usually, mutation operator is used in an evolutionary algorithm for modification of a single chromosome. In this proposed method, mutation operation is performed when the minimum or maximum fitness value of the owl population do not change for two consecutive iterations. Here uniform bit-flip mutation is used in which position values of each owl is a bit flip independently with a mutation probability, mp_r [50]. This value is set to mp_r = 1/d, where d is the dimension of the owl. Overall, mutation operation is performed using the following equation.

$O_{i}^{j} (t) = {\begin{matrix} 1 - O_{i}^{j} (t), & if R \leq {mp}_{r} \\ O_{i}^{j} (t), & otherwise \end{matrix}$ (16) where R is a random number within [0, 1], and $O_{i}^{j}$ is a i^th owl in j^th dimension.

4.5 Fitness function for evaluation

The goodness of a feature subset is measured by a fitness function. In another way, binary position of an owl is measured at a particular iteration t with this fitness function. In this proposed approach, wrapper-based strategy has been used where classification accuracy is employed for evaluation. In addition, measures like the number of selected feature is also counted for the evaluation. These two criteria are combined together to form a single objective and used as an evaluation function. Classification accuracy is calculated by Decision Tree (DT) with Gini Index as splitting measure [51]. Optimization of the fitness function is a maximization problem where better fitness value corresponds to higher classification accuracy with a lower number of features. The fitness function is given as follows. It is noted that the maximum value of the objective function is 1.

$f (O_{i}) = ω \times A (O_{i}) + (1 - ω) \times (\frac{L_{T} - S_{T}}{L_{T}})$ (17) where f (O_i) is the fitness function of the i^th feature subset O_i. A (O_i) is the classifier accuracy, S_T is the number of features of the subset O_i, and L_T is the total number of the original feature. Here ω is a weight value that controls the trade-offs between two criteria. The value of ω can be between 0 and 1. In this research, it is set to 0.9 so that classification accuracy gets more importance than the cardinality of feature subset.

In this wrapper approach, classifier accuracy A (O_i) is measured using following equation: $A (O_{i}) = \frac{TP + TN}{TP + TN + FP + FN}$ (18) where TP, TN, FP and FN represents the number of true positives, true negatives, false positives and false negatives, respectively. We used 10-fold cross-validation to assure the stability of the obtained results. The training set with the selected feature is used to train the DT algorithms while testing set evaluates the selected features. DT is used here to measure the classification accuracy because it is flexible, makes no assumptions concerning the actual data distribution, and does not require scaling of data [52]. Besides, DT has the ability to handle non-linear relationships between features and classes [53].

4.6 Full procedure of feature subset selection

In this subsection, the entire process of MBOSA for feature subset selection is described. The procedure is illustrated in Fig. 2 and Algorithm 1. This algorithm takes the training data with full features as input value and returns the best feature subset as an output. At first, the owl populations are randomly initialized, and mean fitness value is calculated. The parameter β is then defined based on this fitness value. Besides, other parameters such as a, the global best solution, the position of prey, the number of owls, and the number of iterations are also set.

Fig. 2

Structure of MBOSA.

After the successful initialization of all parameters, several steps are performed repeatedly until stopping criterion (no. of iteration) is met. Elitism strategy is performed in each iteration so that the better owl population can evolve in the next iteration. After the incorporation of elitism strategy, fitness values are calculated, and maximum as well as minimum fitness values are measured. β value is usually decreased exponentially based on Equation 13. When both maximum and minimum fitness value of owl populations for consecutive iterations are the same, β is adjusted by self-adaptive way, and mutation strategy is performed to diversify the owls in another area of search space. The process of self-adaptive mechanism is accomplished by the changes of parameter a of Equation 14 in such a way that it can control the β value.

Once the mutation is performed, minimum and maximum fitness values are again measured. The best current owl is set as the global best if the fitness value of the best owl is greater than the global best. The global best solution is then assigned to the prey’s position. The next step is the updating of the position vector of each owl. Here the important part is mapping the continuous values to binary ones using sigmoid function or s-type of transfer functions. After the final iteration, the algorithm returns the best owl (global best) as the best (optimal) feature subset.

5 Simulation experiments

In this section, the description of the datasets used, design of the simulation experiments, and the settings of parameters for simulation experiments are presented.

5.1 Dataset description

In order to evaluate the performance of the proposed MBOSA, twenty benchmark datasets from the UCI repository are employed [54]. Table 1 shows the datasets used in this experiment. These datasets include a different number of features, instances, and classes. Several of the datasets have more than 500 features. Since some datasets contain missing values, a prepossessing step is performed. For numeric features, missing value is replaced with the median value of the feature. On the other hand, a missing value in the categorical feature is replaced with the most frequent value.

Table 1
Datasets

Datasets No. of Features No. of Instances No. of Classes

Arrhythmia 279 452 16

Breast-w 9 683 2

Clean1 166 476 2

Cnae 856 1080 9

Colon 2000 62 2

Dermatology 34 358 6

Hepatitis 19 155 2

Ionosphere 34 351 2

Libras-move 90 360 15

Lung-cancer 56 32 3

Mice-protein 82 1080 8

Micro-Mass 1300 571 20

Parkinsons 22 195 2

Pendigits 16 10992 10

Promoters 57 106 2

Qsar-biodeg 41 1055 2

Semeion 256 1593 10

Vehicle 18 846 4

Waveform 40 5000 3

Wisconsin 30 569 2

Datasets	No. of Features	No. of Instances	No. of Classes
Arrhythmia	279	452	16
Breast-w	9	683	2
Clean1	166	476	2
Cnae	856	1080	9
Colon	2000	62	2
Dermatology	34	358	6
Hepatitis	19	155	2
Ionosphere	34	351	2
Libras-move	90	360	15
Lung-cancer	56	32	3
Mice-protein	82	1080	8
Micro-Mass	1300	571	20
Parkinsons	22	195	2
Pendigits	16	10992	10
Promoters	57	106	2
Qsar-biodeg	41	1055	2
Semeion	256	1593	10
Vehicle	18	846	4
Waveform	40	5000	3
Wisconsin	30	569	2

5.2 Comparative algorithms

In order to evaluate the proposed MBOSA approach in comparison with other similar approaches, four different state-of-the-art algorithms, which are Binary Particle Swarm Optimization (BPSO) [23], Binary Genetic Algorithm (BGA) [17], Binary Bat Algorithm (BBA) [55] and Binary Owl Search Algorithm (BOSA) [14] are selected for feature subset selection. In [14], authors proposed BOSA with different transfer functions for building BOSA model. In order to avoid bias by the transfer function, the same transfer function in Equation 10 is used here for BBA, BPSO, BOSA, and MBOSA. The parameters related to individual algorithms are shown in Table 2. These parameters are set according to reported works on feature selection and based on trial and error on small simulations. It is noted that all the algorithms used the same fitness function defined in Equation 17. For each algorithm, population size is set to 20, and stopping criteria for each run is set to 100 iterations. Populations are initialized randomly in each algorithm.

Table 2
Parameter settings of algorithms

Algorithm Parameter Value

BBA Number of bats 20

Maximum number of iterations 100

Frequency minimum, Q_min 0

Frequency maximum, Q_max 2

Loudness, A 0.5

Pulse rate, r 0.5

BPSO Number of particles 20

Maximum number of iterations 100

Inertia weight, w 0.1

Acceleration coefficients, c₁ and c₂ 2

BGA Number of chromosomes 20

Maximum number of iterations 100

Crossover ratio 0.9

Mutation rate 0.1

BOSA Number of owls 20

Maximum number of iterations 100

Initial value of β 3

MBOSA Number of owls 20

Maximum number of iterations 100

Algorithm	Parameter	Value
BBA	Number of bats	20
	Maximum number of iterations	100
	Frequency minimum, Q_min	0
	Frequency maximum, Q_max	2
	Loudness, A	0.5
	Pulse rate, r	0.5
BPSO	Number of particles	20
	Maximum number of iterations	100
	Inertia weight, w	0.1
	Acceleration coefficients, c₁ and c₂	2
BGA	Number of chromosomes	20
	Maximum number of iterations	100
	Crossover ratio	0.9
	Mutation rate	0.1
BOSA	Number of owls	20
	Maximum number of iterations	100
	Initial value of β	3
MBOSA	Number of owls	20
	Maximum number of iterations	100

5.3 Evaluation of the selected feature subset

The evaluation of the selected feature subsets by different algorithms is done by classification experiments using Support Vector Machine (SVM) with Linear Kernel [56]. SVM is used in this research because it is one of the widely used learning algorithms that can classify both linear and non-linear data well. 10-fold cross-validation strategy is used to evaluate the classification accuracy of each selected feature subset. After feature subset selection, training sample with selected feature subsets is used for building the SVM model. On the other hand, test samples with selected feature subsets is used for testing the model to calculate classification accuracy.

All algorithms are developed in python 3.7 and executed on an Intel Core i5-9500 machine containing 3.00 GHz CPU, 8 GB RAM, and Windows 10 OS. Finally, the experiment is repeated twenty times to get statistically meaningful results. For each of the algorithms, several evaluation measures that are mean classification accuracy, mean number of selected features, mean computational time, and their standard deviation are calculated. Wilcoxon signed-rank test [57] is used to check the significant difference between MBOSA and each of the other algorithms. To further assess the statistical significance of the results of algorithms, Friedman test [58] and Nemenyi post-hoc test [59] are carried out.

6 Simulation results and discussion

Table 3 shows the average classification accuracy (Avg.) and standard deviation (std.) of BBA, BPSO, BGA, BOSA and MBOSA on twenty datasets. The best result is highlighted in bold text. According to the accuracy of the classification results, MBOSA has the best performance where it achieves the best results for half of the datasets. A relatively low standard deviation of accuracy results of MBOSA indicates its less variation of classification accuracy in different run. The second best algorithm after MBOSA is BGA that produces best results for six datasets. Compared to MBOSA, BBA obtains best classification accuracy on five datasets (Cnae, Libras-move, Ionosphere, Lung-cancer, Parkinson). When MBOSA is considered for comparison with BOSA, for three datasets (Breast-w, Cnae, Vehicle) BOSA produces better average classification accuracy than MBOSA. Furthermore, MBOSA attains better classification accuracy compared to BPSO on all datasets except for Hepatitis.

Table 3
Average classification accuracy and standard deviation (std.) of different methods

Dataset BBA BPSO BGA BOSA MBOSA

Avg. Std. Avg. Std. Avg. Std. Avg. Std. Avg. Std.

Arrhythmia 63.180 1.826 64.451 1.333 64.161 1.520 63.379 2.029 66.609 1.495

Breast-w 96.247 0.229 96.275 0.269 96.246 0.300 96.320 0.290 96.305 0.257

Clean1 76.458 5.084 76.458 5.084 83.021 4.608 76.979 3.453 81.354 2.924

Cnae 89.361 1.360 86.611 1.518 91.148 0.581 90.139 1.443 88.907 1.364

Colon 83.143 2.939 84.000 1.239 84.119 2.606 83.548 2.037 85.405 1.384

Dermatology 96.742 1.074 96.659 1.169 96.766 0.976 96.737 0.751 97.300 0.348

Hepatitis 75.600 1.978 76.425 1.407 76.900 2.018 75.929 2.634 76.325 1.499

Ionosphere 85.009 2.253 83.471 2.575 82.642 3.217 82.021 2.649 84.888 2.475

Libras-move 67.361 2.797 63.472 5.821 68.056 5.399 63.889 5.072 66.806 3.730

Lung-cancer 53.619 10.266 45.333 8.166 46.417 8.430 49.853 12.144 51.500 5.361

MiceProtein 97.546 2.973 97.935 1.730 92.870 0.000 97.843 1.771 99.935 0.088

Micro-mass 76.000 1.309 77.739 3.360 78.462 3.548 76.696 2.041 78.696 1.890

Parkinsons 81.405 1.767 80.774 1.560 79.284 2.159 79.679 1.062 81.266 1.609

Pendigits 97.261 0.357 96.835 0.598 97.666 0.114 97.427 0.595 97.804 0.141

Promoters 68.882 3.637 69.618 4.545 68.464 4.344 67.145 4.169 74.345 4.304

Qsar-biodeg 81.756 1.655 81.831 1.411 82.770 1.295 82.429 1.188 82.658 1.105

Semeion 89.328 0.856 90.688 0.723 91.637 0.417 90.771 0.922 91.788 0.702

Vehicle 74.731 3.066 74.861 2.612 77.500 2.411 75.734 1.277 77.902 1.672

Waveform 85.138 0.576 84.956 0.433 85.726 0.700 85.604 0.547 85.312 0.792

Wisconsin 95.081 0.549 95.221 0.688 95.290 0.329 95.045 0.800 95.433 0.599

Dataset	BBA	BPSO	BGA	BOSA	MBOSA
Arrhythmia	63.180	1.826	64.451	1.333	64.161	1.520	63.379	2.029	66.609	1.495
Breast-w	96.247	0.229	96.275	0.269	96.246	0.300	96.320	0.290	96.305	0.257
Clean1	76.458	5.084	76.458	5.084	83.021	4.608	76.979	3.453	81.354	2.924
Cnae	89.361	1.360	86.611	1.518	91.148	0.581	90.139	1.443	88.907	1.364
Colon	83.143	2.939	84.000	1.239	84.119	2.606	83.548	2.037	85.405	1.384
Dermatology	96.742	1.074	96.659	1.169	96.766	0.976	96.737	0.751	97.300	0.348
Hepatitis	75.600	1.978	76.425	1.407	76.900	2.018	75.929	2.634	76.325	1.499
Ionosphere	85.009	2.253	83.471	2.575	82.642	3.217	82.021	2.649	84.888	2.475
Libras-move	67.361	2.797	63.472	5.821	68.056	5.399	63.889	5.072	66.806	3.730
Lung-cancer	53.619	10.266	45.333	8.166	46.417	8.430	49.853	12.144	51.500	5.361
MiceProtein	97.546	2.973	97.935	1.730	92.870	0.000	97.843	1.771	99.935	0.088
Micro-mass	76.000	1.309	77.739	3.360	78.462	3.548	76.696	2.041	78.696	1.890
Parkinsons	81.405	1.767	80.774	1.560	79.284	2.159	79.679	1.062	81.266	1.609
Pendigits	97.261	0.357	96.835	0.598	97.666	0.114	97.427	0.595	97.804	0.141
Promoters	68.882	3.637	69.618	4.545	68.464	4.344	67.145	4.169	74.345	4.304
Qsar-biodeg	81.756	1.655	81.831	1.411	82.770	1.295	82.429	1.188	82.658	1.105
Semeion	89.328	0.856	90.688	0.723	91.637	0.417	90.771	0.922	91.788	0.702
Vehicle	74.731	3.066	74.861	2.612	77.500	2.411	75.734	1.277	77.902	1.672
Waveform	85.138	0.576	84.956	0.433	85.726	0.700	85.604	0.547	85.312	0.792
Wisconsin	95.081	0.549	95.221	0.688	95.290	0.329	95.045	0.800	95.433	0.599

Table 4 shows the average number of selected features (cardinality of the final feature subset) and standard deviation values for BBA, BPSO, BGA, BOSA, and MBOSA on twenty datasets. The best result is highlighted in bold text. It is noticed that MBOSA selects fewer features compared to other algorithms for almost all cases (18 out of 20). MBOSA outperforms BGA and BOSA in terms of feature reduction for all datasets, whereas BBA has the minimum number of features compared to MBOSA only for two cases (MiceProtein and Pendigits datasets). Compared with MBOSA, BPSO can provide better reduction on only Pendigits dataset. It is found that BGA is not as successful as MBOSA regarding reducing feature size. For large datasets such as Cnae, Colon, Waveform, and Micro-mass, MBOSA outperforms other methods notably in terms of feature reduction.

Table 4

Average number of selected features and standard deviation (std.) of different methods

Dataset	BBA		BPSO		BGA		BOSA		MBOSA
	Avg.	Std.	Avg.	Std.	Avg.	Std.	Avg.	Std.	Avg.	Std.
Arrhythmia	110.40	6.10	100.50	6.33	133.30	5.66	128.10	17.97	97.60	7.17
Breast-w	5.00	0.67	4.80	0.79	4.90	0.57	5.10	0.57	4.80	0.63
Clean1	84.50	7.15	84.50	7.15	155.50	5.30	102.20	6.27	84.10	3.54
Cnae	521.10	70.23	441.60	15.95	533.80	49.59	561.00	11.76	424.20	37.12
Colon	1186.40	97.31	1002.50	23.15	1224.70	60.55	1191.10	97.41	910.10	81.83
Dermatology	17.70	1.42	18.50	2.46	22.90	2.18	21.90	3.78	14.50	2.12
Hepatitis	8.70	2.31	7.00	2.00	9.20	1.55	10.60	1.26	6.70	1.25
Ionosphere	19.70	6.60	16.50	1.72	20.20	3.08	20.00	2.58	15.50	2.42
Libras-move	46.80	5.49	46.00	5.79	83.90	2.81	58.10	6.92	43.80	3.94
Lung-cancer	26.00	5.79	27.80	2.94	35.90	3.25	30.90	5.43	22.10	3.03
MiceProtein	38.10	4.98	42.00	3.94	80.00	0.00	45.50	7.07	41.10	4.36
Micro-mass	724.90	87.30	659.60	24.38	859.70	22.63	779.70	92.18	638.30	8.12
Parkinsons	11.60	2.22	11.60	2.59	15.80	1.75	13.20	2.20	10.10	2.08
Pendigits	13.90	0.88	13.00	0.94	14.40	0.70	15.00	0.67	14.80	1.75
Promoters	29.10	2.96	28.80	3.16	36.30	3.37	33.80	5.09	25.20	2.57
Qsar-biodeg	23.20	5.57	22.80	2.66	26.80	3.39	24.80	2.86	19.60	3.31
Semeion	149.90	20.70	129.90	7.56	163.00	14.66	155.10	5.70	115.80	9.24
Vehicle	13.90	2.81	12.40	2.01	15.50	1.35	15.30	1.16	12.30	2.75
Waveform	26.50	4.22	21.70	2.75	27.20	2.30	23.10	3.38	19.00	1.56
Wisconsin	12.90	4.53	13.00	2.67	15.70	3.09	16.90	3.03	11.50	2.64

Fig. 3 presents the comparison of the proposed MBOSA with other approaches regarding average feature selection ratio. Feature selection ratio is defined as the ratio of the number of selected features and the total number of features and is expressed in percentage. MBOSA select less than 60% feature for 18 datasets and for some datasets it can even achieve less than 40% feature. It is also observed that MBOSA is the most effective in feature reduction as minimum feature selection ratio values are obtained by MBOSA for almost all datasets.

Fig. 3

Average selection ratio of feature obtained with BBA, BPSO, BGA, BOSA, and MBOSA.

Fig. 4 displays the average computational time for all the feature selection techniques employed in this research. In general, datasets containing a large number of features or instances take a longer time irrespective of approaches. MBOSA takes less CPU time than BOSA for the majority of the datasets (15 datasets out of 20). Rest of the five datasets where MBOSA takes more computational time than BOSA selects a smaller number of features. When compared with all other approaches, the proposed approach takes lesser time for 60% datasets. A notable characteristic is that in term of CPU time, MBOSA performs better than other algorithms for datasets with large number of instances (Cnae, Semeion, mice-protein, Waveform, Pendigits). MBOSA is also faster than BGA for 13 out of 20 datasets. However, it is noticed that MBOSA may not exhibit the best CPU time for some high dimensional datasets.

Fig. 4

Average computational time obtained with BBA, BPSO, BGA, BOSA, and MBOSA.

Fig. 5 presents the overall classification accuracy of each method by averaging the accuracy values over all datasets. From this bar chart, it can be remarked that MBOSA performs better in terms of accuracy. On the other hand, Fig. 6 shows the mean feature selection ratio of the algorithms according to the average of the feature selection ratio over the datasets. Results show that MBOSA performs better than others with respect to feature reduction and classification accuracy.

Fig. 5

Overall comparison of methods in respect of average classification accuracy.

Fig. 6

Overall comparison of methods in respect of average feature selection ration.

To determine whether the difference between the proposed MBOSA and each of the four approaches is statistically significant, a two-sided Wilcoxon signed-rank test with a confidence level of 0.05 has been performed. The null hypothesis is that there is no significant difference between the two approaches, whereas the alternative hypothesis states two approaches are different. When p-values are lower than the confidence level of 0.05, the null hypothesis is rejected, and the alternative one is accepted. Table 5 shows the p-values of Wilcoxon signed-rank test when MBOSA is compared against other approaches in terms of classification accuracy and the number of selected features. Non-significant values are highlighted with bold. Besides, during the comparison between two approaches, a number of cases where the proposed approach wins (W), draws (D), or loses (L) over other are also reported in the table.

Table 5

p values of Wilcoxon signed-rank test results computed between MBOSA and other approaches for classification and number of feature selection.(p > =0.05 are bold).

Dataset	Classification accuracy				Feature Selection
	BBA	BPSO	BGA	BOSA	BBA	BPSO	BGA	BOSA
Arrhythmia	<0.05	<0.05	<0.05	<0.05	<0.05	0.36	<0.05	<0.05
Breast-w	0.75	0.68	0.55	0.76	0.41	0.74	0.56	0.26
Clean1	<0.05	<0.05	0.14	<0.05	0.84	0.84	<0.05	<0.05
Cnae	0.65	<0.05	<0.05	0.11	<0.05	0.15	<0.05	<0.05
Colon	<0.05	<0.05	0.26	<0.05	<0.05	<0.05	<0.05	<0.05
Dermatology	0.17	0.15	0.11	0.08	<0.05	<0.05	<0.05	<0.05
Hepatitis	0.39	0.95	0.59	0.96	<0.05	0.68	<0.05	<0.05
Ionosphere	0.86	0.26	<0.05	<0.05	0.08	0.32	<0.05	<0.05
Libras-move	0.48	<0.05	0.21	<0.05	0.26	0.48	<0.05	<0.05
Lung-cancer	0.48	<0.05	0.12	0.44	<0.05	<0.05	<0.05	<0.05
MiceProtein	<0.05	<0.05	<0.05	<0.05	0.14	0.68	<0.05	0.09
Micro-mass	<0.05	0.72	0.96	<0.05	<0.05	<0.05	<0.05	<0.05
Parkinsons	0.58	0.31	<0.05	<0.05	0.19	0.07	<0.05	<0.05
Pendigits	<0.05	<0.05	<0.05	0.07	0.13	<0.05	0.51	0.89
Promoters	<0.05	<0.05	<0.05	<0.05	<0.05	<0.05	<0.05	<0.05
Qsar-biodeg	0.24	0.06	0.96	0.65	0.2	<0.05	<0.05	<0.05
Semeion	<0.05	<0.05	0.8	<0.05	<0.05	<0.05	<0.05	<0.05
Vehicle	<0.05	<0.05	0.96	<0.05	0.26	0.72	<0.05	<0.05
Waveform	0.68	0.15	0.06	0.31	<0.05	<0.05	<0.05	<0.05
Wisconsin	0.24	0.33	0.37	0.31	0.51	0.29	<0.05	<0.05
(W)-(D)-(L)	(9)-(11)-(0)	(11)-(9)-(0)	(6)-(13)-(1)	(11)-(9)-(0)	(10)-(10)-(0)	(9)-(11)-(0)	(18)-(2)-(0)	(17)-(3)-(0)

Results indicate that MBOSA shows significantly better classification accuracy than BOSA and BPSO for the majority of the datasets (wins over 11 datasets). The difference between MBOSA and BBA is significant in 9 out of 20 datasets. However, compared to BGA, MBOSA shows significantly better classification accuracy only in 6 out of 20 datasets and significantly inferior in 1 dataset. On the other hand, results are not significant for 13 datasets.

According to the p values of Table 5 regarding the number of selected features, it is noticed that the most meaningful difference exists between MBOSA and BGA because significant results have been obtained for almost all the datasets, with BGA being significant for two cases. Similarly, a significant difference is also found between MBOSA and BOSA for 17 out of 20 datasets. MBOSA also wins over BBA for 50% of the datasets in terms of the number of selected features. Compared to BPSO, MBOSA wins over nine datasets. MBOSA reduces the number of features significantly than BGA, while it produces competitive classification accuracy. It can be concluded that the proposed approach has a good ability to reduce the number of features as well as to maintain reasonable classification accuracy.

To evaluate whether the differences among the algorithms regarding classification accuracy and the number of selected features of the datasets are significant, we use Friedman test and Nemenyi post-hoc procedure at a confidence level of α = 0.05. Friedman test result shows that statistically significant differences exist among the algorithms in terms of classification accuracy and feature selection results (i.e. p < 0.05 for both cases). Table 6 shows average Friedman mean ranks of the approaches and MBOSA obtains the best mean ranks among the approaches in both accuracy and feature selection. Pair-wise comparison of algorithms with Nemenyi test is shown in Fig. 7. A significant difference between the two algorithms exists when the distance between their average ranks is more than critical distance (CD). Algorithms without significantly different are connected to each other with a straight line. From Fig. 7a, it is observed that there is a significant difference between MBOSA and each of BBA, BOSA, and BPSO regarding accuracy measure. On the other hand, in Fig. 7b, a significant difference between MBOSA and each of BBA, BOSA, and GA are seen when the number of selected features are considered.

Table 6

Friedman mean ranks for data sets.

Algorithm	mean ranks (Accuracy)	mean ranks (Feature selection)
BBA	3.675	2.750
BPSO	3.625	2.175
BGA	2.500	4.650
BOSA	3.450	4.200
MBOSA	1.750	1.225

Fig. 7

Nemenyi’s post-hoc test.

6.1 Discussions

Based on the above empirical study on benchmark datasets, it is found that for high dimensional data, MBOSA results in a lower number of features with relatively better computational accuracy, though in some cases, it takes a longer time. BGA produces better classification accuracy with less computational time in high dimensional datasets, but it can not reduce the number of features greatly. MBOSA performs better than BOSA in all of the three metrics of accuracy, feature reduction, and CPU time. On the other hand, the nearest competitor of MBOSA is BPSO and BBA based upon these metrics.

Overall, with a few exceptions, the proposed feature selection approach is superior to other approaches in the reduction of the number of features and improvement of classification accuracy. This is due to mutation, adaptive, and elitism strategies of MBOSA, which result in several benefits. Elitism promotes better candidate solutions to the next generation in order to improve the convergence of the search. If the premature convergence problem is detected during the search, the mutation strategy is incorporated for diversifying the solution away from the local optimum. Along with the mutation strategy, the parameter β is self-adaptively adjusted, and this adaptive behavior also assists in avoiding early convergence. Another merit is that parameter β is initialized dynamically before the iteration begins, unlike the difficult parameter adjusting in many meta-heuristic approaches. Therefore, the proposed approach has the ability to find out the near-optimal solution by balancing exploration and exploitation.

7 Conclusions

In this work, an improved version of Binary Owl Search Algorithm, named MBOSA, has been developed for feature subset selection problem in pattern classification. To avoid premature convergence and improve the search direction toward global best solution(s), three main strategies which include self-adaptive parameter tuning, mutation, and elitism are incorporated with BOSA. The proposed approach is evaluated over 20 benchmark datasets from UCI, and the performances are compared with other popular metaheuristic algorithms, namely BBA, BPSO, BGA, and BOSA. Three evaluation criteria including classification accuracy, fitness value, number of selected features are considered for comparison among the methods. For all the algorithms, DT classifier served as a part of fitness function during feature subset selection, and SVM is used for the final classification accuracy.

The experimental results show that, compared to other approaches, MBOSA can produce the best classification accuracy for 50% of the datasets, with most of the cases containing small standard deviations. It is also apparent that MBOSA can select the fewest number of features in most of the datasets (18 out of 20). Also, the feature selection ratio of our proposed method is less than 60% in most of the datasets. Although six cases MBOSA has significantly better classification accuracy than BGA, it achieves significantly better feature reduction over BGA in almost all cases. During feature selection, MBOSA also performs significantly better than BOSA for most of the cases. While comparing the number of selected features and classification accuracy, MBOSA shows significantly better performance than BBA, BPSO, BGA around at least 50% of datasets. In the case of CPU time, MBOSA also outperforms the other methods for the majority of the datasets, especially datasets having large number of instances. In general, the proposed approach can reduce the number of features while providing decent classification accuracy, and thus it can be considered worthy of being a powerful feature subset selection tool.

In future, MBOSA can be combined with other wrapper-based approaches including, Naive bays classification (NB), K-nearest neighbor classifier (KNN), and SVM for further extension. Also, the proposed approach can be used in solving more large scale datasets of real-world applications, such as text mining, feature selection in genomic microarray data.

References

, Li

and Liu

, Recent advances in feature selection and its applications, Knowledge and Information Systems 53 (2017), 551–577.

Goswami

, Chakrabarti

and Chakraborty

, An efficient feature selection technique for clustering based on a new measure of feature importance, Journal of Intelligent and Fuzzy Systems 32(6) (2017), 3847–3858.

Chandrashekar

and Sahin

, A survey on feature selection methods, Computers & Electrical Engineering 40(1) (2014), 16–28.

J.G.

and Brodley

C.E.

, Feature selection for unsupervised learning, Journal of Machine Learning Research 5 (2004), 845–889.

Anand

, Devaraj

and Kannapiran

, A novel intrusion detection system for wireless mesh network with hybrid feature selection technique based on ga and mi, Journal of Intelligent & Fuzzy Systems 34(3) (2018), 1243–1250.

Zhang

and Sun

, Feature selection using tabu search method, Pattern Recognition 35(3) (2002), 701–711.

Lin

S.W.

, Lee

Z.J.

, Chen

S.C.

and Tseng

T.Y.

, Parameter determination of support vector machine and feature selection using simulated annealing approach, Applied Soft Computing 8(4) (2008), 1505–1512.

Lin

S.W.

, Ying

K.C.

, Chen

S.C.

and Lee

Z.J.

, Particle swarm optimization for parameter determination and feature selection of support vector machines, Expert Systems with Applications 35(4) (2008), 1817–1824.

Jain

, Maurya

, Rani

and Singh

, Owl search algorithm: A novel nature-inspired heuristic paradigm for global optimization, Journal of Intelligent and Fuzzy Systems 34(3) (2018), 1573–1582.

10.

Farhan

A.F.

, Feilat

E.A.

and Al-Salaymeh

A.S.

, Maximum Power Point Tracking Technique Using Combined Perturb & Observe and Owl Search Algorithms, Proceedings of the International Conference on Electrical and Computing Technologies and Applications (ICECTA), (2019), 1–5.

11.

, Deng

, Su

and Song

, Providing a guaranteed power for the BTS in telecom tower based on improved balanced owl search algorithm, Energy Reports 6 (2020), 297–307.

12.

Cao

, Wang

, Jermsittiparsert

and Shafiee

, A new optimized configuration for capacity and operation improvement of CCHP system based on developed owl search algorithm, Energy Reports 6 (2020), 315–324.

13.

El-Ashmawi

W.H.

, Abd Elminaam

D.S.

, Nabil

A.M.

and Eldesouky

, A chaotic owl search algorithm based bilateral negotiation model, Ain Shams Engineering Journal (2020).

14.

Mandal

A.K.

, Sen

and Chakraborty

, Binary owl search algorithm for feature subset selection, Proceedings of the 10th International Conference on Awareness Science and Technology (iCAST), (2019), 1–6.

15.

Chakraborty

, Genetic algorithm with fuzzy operators for feature subset selection, IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences 85(9) (2002), 2089–2092.

16.

Chakraborty

and Chakraborty

, Multi-objective optimization using pareto ga for gene-selection from microarray data for disease classification, in 2013 IEEE International Conference on Systems, Man, and Cybernetics, (2013), 2629–2634.

17.

Guha

, Ghosh

, Kapri

, Shaw

, Mutsuddi

, Bhateja

and Sarkar

, Deluge based Genetic Algorithm for feature selection, Evolutionary Intelligence (2019), 1–11.

18.

Dong

, Li

, Ding

and Sun

, A novel hybrid genetic algorithm with granular information for feature selection and optimization, Applied Soft Computing 65 (2018), 33–46.

19.

Shukla

A.K.

, Singh

and Vardhan

, A hybrid framework for optimal feature subset selection, Journal of Intelligent & Fuzzy Systems 36(3) (2019), 2247–2259.

20.

Kennedy

and Eberhart

R.C.

, A discrete binary version of the particle swarm algorithm, in IEEE International Conference on Systems, Man and Cybernetics, Computational Cybernetics and Simulation (1997), 4104–4108.

21.

Banka

and Dara

, A hamming distance based binary particle swarm optimization (hdbpso) algorithm for high dimensional feature selection, classification and validation, Pattern Recognition Letters 52 (2015), 94–100.

22.

Chakraborty

, Feature subset selection by particle swarm optimization with fuzzy fitness function, 3rd International Conference on Intelligent System and Knowledge Engineering, (2008), 1038–1042.

23.

Xue

, Zhang

and Browne

W.N.

, Particle swarm optimization for feature selection in classification: A multi-objective approach, IEEE Transactions on Cybernetics 43 (2013), 1656–1671.

24.

, Liang

, Ye

and Cao

, Improved particle swarm optimization algorithm and its application in text feature selection, Applied Soft Computing 35 (2015), 629–636.

25.

Ghamisi

and Benediktsson

J.A.

, Feature selection based on hybridization of genetic algorithm and particle swarm optimization, IEEE Geoscience and Remote Sensing Letters 12 (2015), 309–313.

26.

Aghdam

M.H.

, Ghasem-Aghaee

and Basiri

M.E.

, Text feature selection using ant colony optimization, Expert Systems with Applications 36(3) (2009), 6843–6853.

27.

Manoj

R.J.

, Praveena

M.A.

and Vijayakumar

, An aco–ann based feature selection algorithm for big data, Cluster Computing 22(2) (2019), 3953–3960.

28.

Tabakhi

and Moradi

, Relevance redundancy feature selection based on ant colony optimization, Pattern Recognition 48(9) (2015), 2798–2811.

29.

Moayedikia

, Ong

K.-L.

, Boo

Y.L.

, Yeoh

W.G.

and Jensen

, Feature selection for high dimensional imbalanced class data using harmony search, Engineering Applications of Artificial Intelligence 57 (2017), 38–49.

30.

Dash

, An adaptive harmony search approach for gene selection and classification of high dimensional medical data, Journal of King Saud University - Computer and Information Sciences (2018).

31.

Zheng

, Diao

and Shen

, Self-adjusting harmony searchbased feature selection, Soft Computing 19(6) (2015), 1567–1579.

32.

Hancer

, Xue

, Karaboga

and Zhang

, A binary abc algorithm based on advanced similarity scheme for feature selection, Applied Soft Computing 36 (2015), 334–348.

33.

Hancer

, Xue

, Zhang

, Karaboga

and Akay

, A multi-objective artificial bee colony approach to feature selection using fuzzy mutual information, in 2015 IEEE Congress on Evolutionary Computation (CEC), (2015), 2420–2427.

34.

Aziz

M.A.E.

and Hassanien

A.E.

, Modified cuckoo search algorithm with rough sets for feature selection, Neural Computing and Applications 29(4) (2018), 925–934.

35.

Han

, Chang

, Quan

, Xiong

, Li

, Zhang

and Liu

, Feature subset selection by gravitational search algorithm optimization, Information Sciences 281 (2014), 128–146. Multimedia Modeling.

36.

Xiaobing

, Xianrui

and Hong

, An improved gravitational search algorithm for global optimization, Journal of Intelligent & Fuzzy Systems 37 (2019), 1–9.

37.

Rodrigues

, Pereira

L.A.M.

, Nakamura

R.Y.M.

, Costa

K.A.P.

, Yang

, Souza

A.N.

and Papa

J.P.

, A wrapper approach for feature selection based on Bat Algorithm and Optimum-Path Forest, Expert Systems with Applications 41(5) (2014), 2250–2258.

38.

Tawhid

M.A.

and Dsouza

K.B.

, Hybrid Binary Bat Enhanced Particle Swarm Optimization Algorithm for solving feature selection problems, Applied Computing and Informatics (2018).

39.

Mafarja

, Aljarah

, Faris

, Hammouri

A.I.

, Al-Zoubi

A.M.

and Mirjalili

, Binary grasshopper optimisation algorithm approaches for feature selection problems, Expert Systems with Applications 117 (2019), 267–286.

40.

Faris

, Heidari

A.A.

, Al-Zoubi

A.M.

, Mafarja

, Aljarah

, Eshtay

and Mirjalili

, Time-varying hierarchical chains of salps with random weight networks for feature selection, Expert Systems with Applications 140 (2020), 112898.

41.

Sayed

G.I.

, Khoriba

and Haggag

M.H.

, A novel chaotic salp swarm algorithm for global optimization and feature selection, Applied Intelligence 48(10) (2018), 462–3481.

42.

Sayed

G.I.

, Hassanien

A.E.

and Azar

A.T.

, Feature selection via a novel chaotic crow search algorithm, Neural Computing and Applications 31(1) (2019), 171–188.

43.

Sayed

G.I.

, Tharwat

and Hassanien

A.E.

, Chaotic dragonfly algorithm: an improved metaheuristic algorithm for feature selection, Applied Intelligence 49(1) (2019), 188–205.

44.

Ewees

A.A.

, El Aziz

M.A.

and Hassanien

A.E.

, Chaotic multi-verse optimizer-based feature selection, Neural Computing and Applications (2017), 1–16.

45.

Al-Tashi

, Kadir

S.J.A.

, Rais

H.M.

, Mirjalili

and Alhussian

, Binary optimization using hybrid grey wolf optimization for feature selection, IEEE Access 7 (2019), 39496–39508.

46.

Mafarja

M.M.

and Mirjalili

, Hybrid Whale Optimization Algorithm with simulated annealing for feature selection, Neurocomputing 260 (2017), 302–312.

47.

Allam

and Nandhini

, Optimal feature selection using binary teaching learning based optimization algorithm, Journal of King Saud University - Computer and Information Sciences (2018).

48.

Arora

and Anand

, Binary butterfly optimization approaches for feature selection, Expert Systems with Applications 116 (2019), 147–160.

49.

Wolpert

D.H.

and Macready

W.G.

, et al., No free lunch theorems for optimization, IEEE Transactions on Evolutionary Computation 1(1) (1997), 67–82.

50.

Chicano

, Sutton

A.M.

, Whitley

L.D.

and Alba

, Fitness probability distribution of bit-flip mutation, Evolutionary Computation 23(2) (2015), 217–248.

51.

Myles

A.J.

, Feudale

R.N.

, Liu

, Woody

N.A.

and Brown

S.D.

, An introduction to decision tree modeling, Journal of Chemometrics: A Journal of the Chemometrics Society 18(6) (2004), 275–285.

52.

Mahesh

and Mather

P.M.

, An assessment of the effectiveness of decision tree methods for land cover classification, Remote Sensing of Environment 86(4) (2003), 554–565.

53.

Friedl

M.A.

and Brodley

C.E.

, Decision tree classification of land cover from remotely sensed data, Remote Sensing of Environment 23 (1997), 399–409.

54.

Dua

and Graff

, UCI machine learning repository, (2017).

55.

Mirjalili

, Mirjalili

S.M.

and Yang

X.S.

, Binary bat algorithm, Neural Computing and Applications 25(3-4) (2014), 663–681.

56.

Schlkopf

, Smola

A.J.

, Bach

, Learning with kernels: support vector machines, Regularization Optimization and Beyond The Mit Press, (2018).

57.

Wilcoxon

, Individual comparisons by ranking methods, Biometrics 1(6) (1945), 80–83.

58.

Friedman

, A comparison of alternative tests of significance for the problem of m rankings, The Annals of Mathematical Statistics 11(1) (1940), 86–92.

59.

Nemenyi

, Distribution-Free Multiple Comparison, PhD Thesis, Princeton University, (1963).