Instance-based classification with Ant Colony Optimization

Abstract

Instance-based learning (IBL) methods predict the class label of a new instance based directly on the distance between the new unlabeled instance and each labeled instance in the training set, without constructing a classification model in the training phase. In this paper, we introduce a novel class-based feature weighting technique, in the context of instance-based distance methods, using the Ant Colony Optimization meta-heuristic. We address three different approaches of instance-based classification: $k$ -Nearest Neighbours, distance-based Nearest Neighbours, and Gaussian Kernel Estimator. We present a multi-archive adaptation of the ACO ${}_{\mathbb{R}}$ algorithm and apply it to the optimization of the key parameter in each IBL algorithm and of the class-based feature weights. We also propose an ensemble of classifiers approach that makes use of the archived populations of the ACO ${}_{\mathbb{R}}$ algorithm. We empirically evaluate the performance of our proposed algorithms on 36 benchmark datasets, and compare them with conventional instance-based classification algorithms, using various parameter settings, as well as with a state-of-the-art coevolutionary algorithm for instance selection and feature weighting for Nearest Neighbours classifiers.

Keywords

Machine learning instance-based learning lazy classifiers Swarm Intelligence Ant Colony Optimization

1. Introduction

Instance-based learning (IBL) is a widely used classification approach where the class of an unlabeled instance is predicted directly based on the labels of training set instances, without constructing a model in a training phase [4, 3]. The $k$ -Nearest Neighbours ( $k$ -NN) algorithm is perhaps the most well-known IBL algorithm [77], and one of the most relevant algorithms in data mining [33, 73]. In the $k$ -NN algorithm, the class of a new instance is determined based on the classes of the $k$ nearest instances in the training set to this new instance, where $k$ is a user-supplied parameter. However, $k$ -NN is just one member in the broad family of IBL algorithms. Distance-based Nearest Neighbours ( $d$ -NN) and Gaussian Kernel Estimator (GKE) are other IBL classification algorithms that are widely-used in machine learning [10, 6]. In the $d$ -NN algorithm, only instances of the training set within a certain distance $d$ to the new instance are used to determine its class; while in the GKE algorithm, each instance in the training set contributes in deciding the class of a given new instance according to a Gaussian kernel function.

IBL classification is very sensitive to the quality of the underlying dataset used in the class prediction process. Hence, several data preparation tasks, such as feature selection, feature weighting, instance selection, and instance weighting are often used to improve the effectiveness of an IBL algorithm. This opens the door for various optimization techniques to be effectively applied to perform these tasks to improve IBL.

The Ant Colony Optimization (ACO) meta-heuristic, introduced by Dorigo et al. [23, 22], deals with artificial systems that are inspired by the foraging behaviour of real ants. ACO has been successfully applied to solve a broad range of combinatorial optimization problems [12, 24, 25], and has been applied to classification using a variety of different types of classification models, including classification rules [53, 46, 62, 63, 50, 49], decision trees [13, 51], and various types of Bayesian network classifiers [66, 67, 64, 65]. ACO has also recently been applied to instance and feature selection [7, 8, 61], but to the best of our knowledge has not been previously applied to feature weighting. In addition, ACO ${}_{\mathbb{R}}$ has recently been introduced as an Ant Colony-based algorithm for continuous optimization [70, 41], and has been applied to neural network training [69, 60, 2, 1]. However, to the best of our knowledge, the ACO ${}_{\mathbb{R}}$ algorithm has not been previously applied in the context of instance-based classification prior to this work. For a comprehensive review of ACO algorithms in data mining, the reader is directed to [45].

In this paper, we introduce the use of the ACO ${}_{\mathbb{R}}$ algorithm to perform feature weighting and parameter optimization for three different algorithms in the context of instance-based learning. More precisely, the contributions of this work can be defined as follows:

1.
We propose a new class-based feature weighting scheme, in which several feature weight sets are optimized, one for each available class value. Such a weighting scheme would improve the effectiveness of an IBL classification algorithm in terms of predictive accuracy, and provide the user with more comprehensible knowledge regarding the relative importance of each feature to each class value.
2.
We propose a multi-archive adaptation of ACO ${}_{\mathbb{R}}$ , and use it to optimize the class-based feature weights in the context of three different IBL classification algorithms, namely $k$ -nearest neighbours ( $k$ -NN), distance-based nearest neighbours ( $d$ -NN) and Gaussian kernel estimator (GKE). ACO ${}_{\mathbb{R}}$ is also used to optimize the key parameter for each algorithm – i.e., the number of nearest neighbours ( $k$ ), the maximum distance of the instance neighbourhood ( $d$ ), and the spread (smoothing) parameter $\sigma$ in the GKE.
3.
We propose creating an ensemble of classifiers using the population of the ant colony, rather than selecting the best created solution as the final classifier in order to improve the testing predictive accuracy and avoid classifier over-fitting.

We empirically evaluated the performance of our proposed ACO-based algorithms on 36 benchmark UCI classification datasets [9] and compared them to the conventional version of these IBL algorithms – using a variety of parameter settings – as our evaluation baseline. In addition, we compared our proposed algorithms with a state-of-the-art coevolutionary algorithm for instance selection and feature weighting in the nearest neighbours classifier, namely CIW-NN [21]. The non-parametric Wilcoxon signed-ranks test indicates that our ensemble-based ACO- $k$ NN algorithm is statistically significantly better than the state-of-the-art CIW-NN algorithm, when the two algorithms are given comparable CPU resources.

The rest of the paper is structured as follows. An overview of instance-based learning concepts, algorithms and common improvement techniques are given in the following section. We provide a background on the existing evolutionary algorithms in the literature used to improve IBL in Section 3. In Section 4, we review the ACO ${}_{\mathbb{R}}$ algorithm. The novel class-based feature weighting is introduced in Section 5. In Section 6, we present our proposed multi-archive adaptation of ACO ${}_{\mathbb{R}}$ and its use in optimizing the $k$ -NN, $d$ -NN, and GKE algorithms. The ensemble of classifiers, based on the ACO population, is discussed in Section 7. Our experimental methodology is described in Section 8, followed by the computational results and related discussions in Section 9. Finally, we conclude with general remarks and future research directions in Section 10.
2. Instance-based classification overview

Most classification techniques perform eager learning [77, 33]. That is, the learning algorithm discovers classification patterns from a training set, and represents them in a model constructed by the algorithm during a training phase. Later, this constructed model is used to predict the class of a new instance, without the need to go back to the original training data. However, in lazy learning, all the work is done when the time comes to classify a new instance, and model construction is not necessary. A lazy classifier predicts the class of a new instance directly based on the training set, with respect to the similarity (or distance) between the new instance and each instance in the database (training set) of the domain instances. Such a technique is also known as lazy classification, case-based learning, memory-based learning, and instance-based learning [4, 3].

Instance-based learning’s popularity in the field of data mining is due to two main advantages. First, IBL is a non-parametric method, that is, the only assumption that is made about the domain data is that: similar inputs (features) have similar outputs (classes). Second, IBL methods are incremental learning methods, that is, a new set of instances can be easily accommodated in the domain’s database to be used for further prediction tasks, without having to rebuild a new classification model, as in eager learning methods. Therefore, IBL algorithms, in general, are simple to apply and perform well in a wide range of application domains.

As a start, let us establish some notation that is going to be used throughout the paper. Let $\mathcal{T}_{r}$ be the training set, which includes $N$ data instances with $M$ input features. $\mathcal{T}_{r}$ is the database used by the lazy classifier to find the similar instances (neighbours) to any given new instance. ${\bf x}_{i}^{l}=(x_{i,1},x_{i,2},...,x_{i,M})$ is the $i$ -th labeled instance in $\mathcal{T}_{r}$ , where $x_{i,j}$ is the value of the $j$ -th feature ( $f_{j}$ ) in the $i$ -th instance ${\bf x}_{i}$ , and $C({\bf x}_{i})=l$ is the class label of the instance ${\bf x}_{i}$ . The class label $l\in C$ , where $C$ is the set of the available values in the target class domain. A new (unlabeled) instance to be classified is denoted as ${\bf x}_{s}$ .

Now, in order to implement a lazy classification algorithm, the following elements of the algorithm should be defined:

•
First, a proximity measure should be specified. For any two given instances, ${\bf x}_{1}$ and ${\bf x}_{2}$ , a proximity measure specifies how similar (or dissimilar) ${\bf x}_{1}$ and ${\bf x}_{2}$ are, with respect to their input feature values. The most well-known dissimilarity measure for IBL is Euclidean distance, which we generalize as follows to apply to both numeric and categorical features:

$\textit{EuclideanDistance}({\bf x}_{1},{\bf x}_{2})=\sqrt{\sum_{j=1}^{M}{\left% [\phi({\bf x}_{1,j},{\bf x}_{2,j})\right]^{2}}}$ (1)

where

$\phi({\bf x}_{a,j},{\bf x}_{b,j})=\begin{cases}\left\lvert{\bf x}_{a,j}-{\bf x% }_{b,j}\right\rvert&{\rm\ if\ feature\ }j{\rm\ is\ numeric}\\ 0&{\rm\ if\ feature\ }j{\rm\ is\ categorical\ and\ }x_{a,j}=x_{b,j}\\ 1&{\rm\ if\ feature\ }j{\rm\ is\ categorical\ and\ }x_{a,j}\neq x_{b,j}\end{cases}$ (2)

Of course, the smaller the distance, the more similar the instances ${\bf x}_{1}$ and ${\bf x}_{2}$ . The distance is 0 when all the (used) feature values of ${\bf x}_{1}$ are the same as in ${\bf x}_{2}$ .
•
Second, the effective instances for class prediction should be determined. The effective instances are the subset of $\mathcal{T}_{r}$ with the most similar instances to the new instance ${\bf x}_{s}$ , by which the class label of ${\bf x}_{s}$ is predicted. The most popular approach is to select the $k$ most similar instances (neighbours) to the new instance to be classified, and use these $k$ instances for the class label prediction. This is the basic idea of the $k$ -NN classifier [3, 77, 33]. In this case, the $k$ parameter’s value should be specified by the user. Nonetheless, this is not the only approach to determine the effective instances.

Another approach is to use all the instances within the data partition that ${\bf x}_{s}$ belongs to as the effective instances. One method is the histogram estimator, where the data space is partitioned into equal-sized intervals [6]. However, a better method is to consider all the instances within distance $d$ to the new instance ${\bf x}_{s}$ to be classified as the set of effective instances, which is the idea of the $d$ -NN classifier [6, 10], which will be described in Section 6.2. In this case, ${\bf x}_{s}$ will be the centre of the data partition, and the $d$ parameter’s value should be specified by the user. A third approach is to use all the instances in $\mathcal{T}_{r}$ as effective instances, but with different contributing weights to the final class label prediction of the new instance. This is the idea of the Gaussian Kernel Estimator (GKE) [6, 10], which will be described in Section 6.3.
•
Third, how the selected effective instances are used to predict the class label of a new instance ${\bf x}_{s}$ should be specified. A straightforward approach is to perform majority voting, where the class label that has the highest occurrence among them is assigned to ${\bf x}_{s}$ . A more elaborate approach is to perform a weighted voting, where the effective instances with higher similarity (smaller distance) to the new class have higher weights in determining the class label of the new instance. Several approaches can be used to define an effective instance’s weight based on its distance to the instance ${\bf x}_{s}$ to be classified, yet the Gaussian kernel is a popular function for this task [6], which will be discussed in the context of the GKE classifier (Section 6.3).

Another method is to construct a local classification model using the effective instances set to be used to predict the class label of a new instance ${\bf x}_{s}$ [77]. However, since this method is computationally expensive (learning a different classification model for each new instance), simple classification algorithms are usually used, such as Naïve-Bayes [81].

IBL methods demand careful training data preparation as a pre-processing step before performing classification [77, 73]. IBL classification is very sensitive to the quality of the underlying training set used in the process, since irrelevant features, variable-scale feature values, redundant instances, and outliers affect the choice of the nearest neighbours, and consequently affect the class prediction. Therefore, one important data preparation step is to normalize the values in the domain of each continuous feature, so that each feature has values in a standard range, e.g. from 0 to 1.

Feature selection [42], which consists of selecting the most relevant feature subset to the class prediction, is a well-known useful process for classification algorithms in general. However, feature weighting [76] is more relevant in the context of instance-based classification, and has been a successful approach to improve lazy classifiers [27]. A drawback of Euclidean distance is that it makes the implicit assumption that all features are equally important. In practice, there can be redundancies and correlations among the features; for example, in a medical application: one feature can represent weight while another represents waist size. Furthermore, some features can be noisy or can even be irrelevant to the classification task. In addition, for nominal features, the distance between two equal feature values is 0, and the distance is 1 when the two values are different. This implicitly gives more emphasis to the nominal features over the continuous features in computing the Euclidean distance between two examples, since the difference between two values in the domain of a continuous feature rarely reaches the extreme difference of 0 or 1.

In feature weighting, the distance measure is extended to apply a different weight $w_{j}$ to each feature $j$ , which modifies the way in which the distance measure is computed between two instances ${\bf x}_{1}$ and ${\bf x}_{2}$ . In the case of Euclidean distance, feature weighting takes the form of:

$d({\bf x}_{1},{\bf x}_{2})=\sqrt{\sum_{j=1}^{M}{w_{j}\left[\phi({\bf x}_{1,j},% {\bf x}_{2,j})\right]^{2}}}$ (3)

where $\phi$ is defined in Eq. (2) and $W=(w_{1},w_{2},...,w_{M})$ are real-valued feature weights, each giving a different level of emphasis to its associated feature, and $w_{j}$ is the weight of the $j$ -th feature. Note that feature weighting can be considered as a generalized version of feature selection, since the features that are given zero or very small ineffective weights are considered to be outside the subset of the selected features. The feature weights can be prescribed using expert knowledge of the problem domain; however, in practice this is rarely available.

Approaches to feature selection and feature weighting can be broadly classified into three primary categories: filter, wrapper, and embedded methods. Filter methods [44, 39] assign a heuristic score to each feature based on how well a given feature contributes to the discrimination between classes. Early scoring methods used statistical measures of correlation between features and class labels [80], and have since evolved into more sophisticated measures [48]. Filter methods typically operate on the dataset before it is presented to the learning algorithm, and have an advantage of being relatively light in terms of computational cost.

Wrapper methods [57, 79] use the learning algorithm as a black-box, and its predictive accuracy as a quality evaluation function. Evolutionary Computation (EC) methods such as CIW-NN, as well as ACO methods [7, 8, 61], fall within this category. Wrapper methods generally have good performance but are typically computationally expensive in the training phase compared to filter methods.

Embedded methods [32, 55] perform feature selection or weighting as part of the training phase of the machine learning algorithm. Examples include network pruning in neural network classifiers and limiting the depth of the decision tree in the C4.5 [58] algorithm. These methods are also usually less computationally demanding compared to wrapper methods.

Instance selection [43] is another data preparation task in the context of IBL, which consists of selecting a subset of the most appropriate instances in the training set in order to get rid of outliers, and irrelevant and noisy data instances. Its goal is to isolate the smallest set of instances which enables the classification algorithm to predict the class of a new instance with the same or better predictive accuracy than using the initial dataset. In addition, by minimizing the dataset size, the space complexity and computational costs of the IBL algorithms are reduced. Instance weighting is a generalization of instance selection, which focuses on modifying the way in which distances are measured with respect to the positions of the instances in the training set [52, 21]. One approach is for each instance to have an associated weight that depends on its class [21].
3. Evolutionary Algorithms for IBL

Evolutionary Algorithms (EA) have been investigated [16, 40, 34, 29, 30, 17, 20] for Instance Selection (IS), Instance Weighting (IW), and Feature Weighting (FW). However, thus far, the focus has been limited to the $k$ -NN classifier. Kuncheva [40], Cano et al. [16], and García et al. [29] have applied several flavors of EAs to IS. Kelly and Davis [38] have applied a GA to FW. Other adaptive approaches to FW have been considered by others [36, 72, 39, 76, 47, 37]. Jahromi et al. [35] have presented an adaptive IW approach called Weighted Distance Nearest Neighbor (WDNN) in which the weight for each instance is iteratively optimized by considering the leave-one-out error; the WDNN method has been applied by others [54] to EEG signal classification. Approaches that combine FW and IS have been explored in [34, 38]. Garcia-Pedrajas et al. [31] have applied coevolutionary approaches to IS, and Derrac et al. [19] have presented a coevolutionary method to construct an elaborate multi-classifier based on multiple 1-NN classifiers. Cervantes et al. [17] have presented a Particle Swarm Optimization approach to optimizing the prototypes for a nearest-prototype classifier (which differs from the nearest-neighbours classifier in that the prototypes do not need to be instances of the training set).

In [21], Derrac et al. proposed a coevolutionary approach, based on the 1-nearest neighbour classifier, called CIW-NN (Coevolution of Instance Selection and Weighting schemes for Nearest Neighbour classifier) which has separate coevolving populations for each of the following three tasks: Instance Selection (IS), Feature Weighting (FW), and Instance Weighting (IW).

Each chromosome in the IS population is a $|\mathcal{T}_{r}|$ -dimensional bit-string, where $|\mathcal{T}_{r}|$ is the size of the training set. In a given chromosome, if bit $j$ is equal to 1, this means that the $j$ -th instance of the training set is included in the reduced set. The IS population uses a variant of the CHC (Cross-generational elitist selection, Heterogeneous recombination, and Cataclysmic mutation) algorithm, a classical GA introduced by Eschelman [26] in 1991.

In the FW population, each chromosome is a $M$ -dimensional real-valued vector, where $M$ is the number of features. The $j$ -th entry in each chromosome represents the weight $w_{j}$ to be used in the weighted Euclidean distance proximity measure of Eq. (3). The FW population uses the classical SSGA (Steady State GA) algorithm with multiple descendants [68].

For IW, the CIW-NN algorithm uses an approach in which the weight of an instance depends on its class label, and all instances with the same class label have the same weight. The distance between a labeled training instance ${\bf x}_{1}$ with class label $l({\bf x}_{1})$ and a test instance ${\bf x}_{2}$ is computed as:

$\displaystyle d({\bf x}_{1},{\bf x}_{2})=w_{l({\bf x}_{1})}\cdot\textit{% EuclideanDistance}({\bf x}_{1},{\bf x}_{2})$ (4)

Each chromosome in the IW population therefore consists of a vector of $|C|$ real numbers, where the $j$ -th entry in a chromosome represents the weight of the class label $j$ in Eq. (4). Like the FW population, the IS population also uses the SSGA algorithm with multiple descendants.

Fitness evaluation of an individual in any of the three populations requires the selection of a collaborator from each of the other two populations. The approach followed in CIW-NN is that the most fit individual in any population is always selected as the collaborator for individuals from the other populations. Thus, the fitness function is applied using an individual from each of the three populations (the desired individual from one population, and the most-fit individuals from the other two populations).

Derrac et al. [21] compared CIW-NN to a large number of other approaches: the classical 1-NN classifier, the CHC algorithm applied to IS without coevolution, the SSGA algorithm applied to FW without coevolution, the SSGA algorithm applied to IW without coevolution, a Steady State Memetic Algorithm (SSMA) for IS [29], gradient descent based approaches [52], the WDNN algorithm [35] for IW, a Tabu-Search (TS) based method [72] for Feature Selection and Feature Weighting (FW), a Relief-based approach for FW [39], a FW method based on Mutual Information [76], and a method for simultaneous IS and FW called GOCBR (Global Optimization of feature weighting and instance selection using GA for Case-Based Reasoning) [5]. Compared to these numerous methods, CIW-NN was found [21] to have the best predictive accuracy.

4. Ant Colony Optimization

Ant Colony Optimization (ACO) [24] is a general-purpose, biologically-motivated, population-based optimization meta-heuristic that can be applied to a wide variety of domains [25]. ACO is based on a number of primitive processing elements, each operating in parallel with little centralized control. The processing elements in ACO are called ants, and the collection of processing elements are called a colony. In an ACO algorithm, there is usually a central data structure, analogous to pheromone information in biological ant systems, that represents the time-evolving collective knowledge of the group. In each iteration, each ant typically generates a candidate solution, making use of the central pheromone data structure in some way in its solution construction. After all ants have generated their solutions, a subset of those solutions is then used to update the central data structure in some way.

The majority of research on ACO has focused on discrete (combinatorial) optimization problems [25]. However, ACO methods for continuous problem domains have also been investigated [71, 70, 74]. In this paper, we focus on the ACO ${}_{\mathbb{R}}$ algorithm [70], which has been applied to a number of continuous optimization problems [71, 70, 41].

Suppose the ACO ${}_{\mathbb{R}}$ algorithm is to be applied to an optimization problem over $n$ real-valued variables $V_{1},V_{2},\ldots,V_{n}$ . The central data structure, analogous to pheromone information in natural ants, that is maintained by ACO ${}_{\mathbb{R}}$ is an archive $A$ of $R$ previously-generated candidate solutions. Each element $s_{a}$ in the archive, for $a=1,2,\ldots,R$ , is an $n$ -dimensional real-valued vector, $s_{a}=(s_{a,1},s_{a,2},\ldots,s_{a,n})$ . For example, $s_{a,j}$ refers to the value of the $j$ -th variable in the $a$ -th solution in the archive. The archive is sorted by solution quality, so that $Q(s_{1})\geqslant Q(s_{2})\geqslant\ldots\geqslant Q(s_{R})$ . Each solution $s_{a}$ in the archive has an associated weight $\omega_{a}$ that is related to $Q(s_{a})$ , so that $\omega_{1}\geqslant\omega_{2}\geqslant\ldots\geqslant\omega_{R}$ .

The ACO ${}_{\mathbb{R}}$ algorithm consists of repeated iterations until some termination criteria is reached (e.g. solution cost falls below some desired threshold, some maximum number of iterations is reached, etc.). In each iteration, there are two phases: solution construction and pheromone update. In the solution construction phase, each ant probabilistically constructs a solution based on the solution archive $A$ (representing pheromone information). The solution archive $A$ is initialized with $R$ randomly-generated solutions, where the size $R$ is a user-supplied parameter of the ACO ${}_{\mathbb{R}}$ algorithm. Then, in the pheromone update phase, the $m$ constructed solutions (where $m$ is the number of ants) are added to $A$ , resulting in the size of $A$ temporarily being $R+m$ . The archive $A$ is then sorted by solution quality, and the $m$ worst solutions are discarded, so that the size of $A$ returns to being $R$ . Figure 1 shows the conceptual flow of the ACO ${}_{\mathbb{R}}$ algorithm.

Figure 1.

The work flow of the ACO ${}_{\mathbb{R}}$ algorithms: 1) Archive initialization. 2) Solution selection form the archive. 3) Probabilistic solution creation. 4) Quality evaluation. 5) Archive update. 6) Search parameters update.

The heart of the algorithm is the solution construction phase. In this phase, each ant $i$ generates a candidate solution $s_{i}$ , where $s_{i}$ is an $n$ -dimensional vector, and $s_{i,j}$ represents an assignment to the $j$ -th variable $V_{j}$ . In constructing its solution $s_{i}$ , ant $i$ is influenced by one of the $R$ solutions in the archive $A$ . The ant first probabilistically selects one of the $R$ solutions in the archive according to:

$\text{Pr}(\text{select}\,s_{a})=\frac{\omega_{a}}{\sum_{r=1}^{R}\omega_{r}}$ (5)

Thus, the probability of selecting the $a$ -th solution is proportional to its weight $\omega_{a}$ . Recall that the archive $A$ is sorted by quality, so that solution $s_{a}$ has rank $a$ , with the best solution having a rank of 1. The weights $\omega_{a}$ that are used in Eq. (5) are constructed in each iteration as:

$\displaystyle\omega_{a}=g(a;1,qR)$ (6)

where $g$ is the Gaussian function:

$g(y;\mu,\sigma)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y-\mu)^{2}}{2\sigma^{2}}}$ (7)

Thus, Eq. (6) assigns the weight $\omega_{a}$ to be the value of the Gaussian function with argument $a$ , mean 1.0, and standard deviation $(qR)$ . The value of $q$ is a user-supplied parameter of the algorithm, where smaller values of $q$ cause the better ranked solutions to have higher weights $\omega$ (and thus make the algorithm more exploitative), while larger values of $q$ result in a more uniform distribution.

Let $s_{a}$ be the solution of $A$ that is selected by ant $i$ according to Eq. (5) in a given iteration. Ant $i$ then generates each solution element $s_{i,a}$ by sampling the Gaussian probability density function (PDF):

$\displaystyle s_{i,j}\sim N(s_{a,j},\sigma_{a,j})$ (8)

where $N(\mu,\sigma)$ represents the Gaussian PDF with mean $\mu$ and standard deviation $\sigma$ . One way to sample the Gaussian PDF is through the Box-Muller transform [14], which is based on the following property: if $a_{1}$ and $a_{2}$ are two independent, uniformly distributed random numbers in the interval $(0,1]$ , then $b_{1}$ and $b_{2}$ will be two independent random numbers with a Gaussian distribution $N(0,1)$ , where

$\displaystyle b_{1}=\sqrt{-2\ln{a_{1}}}\cos{(2\pi a_{2})},\;b_{2}=\sqrt{-2\ln{% a_{1}}}\sin{(2\pi a_{2})}$ (9)

In Eq. (8), $s_{a,j}$ represents the value that the solution $s_{a}$ assigns to variable $V_{j}$ , and the standard deviation $\sigma_{a,j}$ is computed according to:

$\sigma_{a,j}=\xi\sum_{r=1}^{R}\frac{\mid s_{a,j}-s_{r,j}\mid}{R-1}$ (10)

where $\xi$ is a user-supplied parameter of the algorithm. The effect of Eq. (10) is that the average distance from $s_{a}$ to other solutions in the archive, for the $j$ -th dimension, is computed, and is then multiplied by $\xi$ . The parameter $\xi$ plays a role in ACO ${}_{\mathbb{R}}$ similar to that of evaporation rate in other ACO algorithms. The higher the value of $\xi$ , the less the extent to which the search is biased towards the area of the search space around the solutions stored in the archive, and the slower the algorithm will converge. Once each ant constructs its solution, the archive $A$ is updated as described above. The process repeats until the desired termination criteria are met.

In all, the algorithm has four user-supplied parameters $m$ , $R$ , $q$ , and $\xi$ , in addition to any parameters related to the termination criteria. The parameter $m$ determines the number of ants; the parameter $R$ determines the number of solutions stored in the archive $A$ ; the parameter $q$ controls the extent to which the top solutions in the archive will dominate solution construction (Eq. (6)); and the parameter $\xi$ influences the degree of diversity in solution construction (Eq. (10)).

The solution archive $A$ can be considered an indirect representation of a probability distribution. When the number of ants $m$ is large, e.g. 10, then the probability distribution represented is sampled 10 times, before the generated solutions are used to update the distribution. When the number of ants $m$ is small, e.g. 1, then the probability distribution is updated immediately after each candidate solution is generated. This latter approach ( $m=1$ ) is employed in our experimental results.

Recently, an extension of ACO ${}_{\mathbb{R}}$ for problems with a mixture of real-valued, ordinal and categorical variables, was introduced, called ACO ${}_{\text{MV}}$ [41]. In ACO ${}_{\text{MV}}$ , real-valued variables are handled as they are in ACO ${}_{\mathbb{R}}$ (as described above), ordinal variables are handled by ACO ${}_{\text{MV}}$ -o, and categorical variables are handled by ACO ${}_{\text{MV}}$ -c.

Suppose an ordinal variable $U$ has $s$ values in its domain, ordered from lowest to highest as $u_{1},u_{2},\ldots,u_{s}$ . ACO ${}_{\text{MV}}$ -o treats $U$ as a real-valued variable $U^{\prime}$ whose value can range from $1$ to $s$ , and operates the same as ACO ${}_{\mathbb{R}}$ , with a single exception. When a value of the real-valued variable $U^{\prime}$ is to be evaluated, it is first rounded to the nearest integer. If it is rounded to $\ell$ , then this corresponds to a value of $u_{\ell}$ for the original ordinal variable $U$ , and the quality evaluation proceeds with the value $u_{\ell}$ .

In handling a categorical variable $U$ with $s$ category labels $u_{1},\ldots,u_{s}$ , the operation of ACO ${}_{\text{MV}}$ -c is quite elaborate and takes into consideration whether each category label $u_{i}$ is represented in the archive as well as the quality of the highest-ranked archived solution that includes each represented category label. The detailed mechanisms of ACO ${}_{\text{MV}}$ -c are outside our scope, however, since we do not work with categorical variables in this paper.

5. Class-based feature weighting for IBL

In many real-world problem domains, some input features are more important to the prediction of a specific class label, while others are less important to the prediction of the same class. This feature-class relevance can be different from one class label to another. For example, consider a medical dataset of patients, where the input features consist of several symptoms and observations of the patient condition, and the target class is one of several diseases. In this situation, some symptoms and observations may be more relevant to a specific disease, while the same symptoms and observations may be less relevant to another disease in the target class domain. An input feature that encodes blood glucose level would likely have a greater discriminatory relevance for a class label of diabetes than it would for a class label of brain stroke. Hence, having one weight value for each symptom feature, with respect to all the diseases in the target class domain, will not reflect the relative symptom-disease (feature-class label) importance.

Therefore, we propose a new class-based feature weighting scheme, in which each feature has a different weight value for each target class label, to give a different level of emphasis to each feature with respect to each class label. More precisely, we propose to optimize ${\bf W}=(W^{1},W^{2},...W^{|c|})$ , which consists of $|C|$ sets of feature weights $W^{l}$ , where $|C|$ is the number of class labels in the domain of the class $C$ , and $W^{l}=(w_{1}^{l},w_{2}^{l},...,w_{M}^{l})$ is the set of feature weights with respect to class label $l$ . The value $w_{j}^{l}$ is the weight of the $j$ -th feature with respect to the class label $l$ ; there are $|C|\times M$ real-valued weights to be specified in total. We propose that the class-based feature weights would be used as follows. When the lazy classification algorithm calculates the distance between ${\bf x}_{s}$ (the new unlabeled instance to be classified) and ${\bf x}_{i}^{l}$ (the $i$ -th labeled instance in the training set $\mathcal{T}_{s}$ ), the set of feature weights, $W^{l}$ , which corresponds to the class label $l$ of the current training instance ${\bf x}_{i}^{l}$ is used, as shown in the following equation:

$d({\bf x}_{s},{\bf x}_{i}^{l},{\bf W})=\sqrt{\sum_{j=1}^{M}{w^{l}_{j}\left[% \phi({\bf x}_{s,j},{\bf x}_{i,j}^{l})\right]^{2}}}$ (11)

where $\phi$ is defined in Eq. (2).

Note that, not only would the class-based weighting scheme improve the effectiveness of an IBL classification algorithm in terms of predictive accuracy – by providing fine-grained feature weights with respect to each class – such a class-based weighting scheme would provide the user with more comprehensible knowledge [28] regarding the relative importance of each feature to each class label, which improves the interpretability of the instance-based classifier. For instance, if the user is trying to compare and analyze two different class labels, $l_{1}$ and $l_{2}$ , in the application domain, the input feature $f_{j}$ that has a large difference between the value of $W_{j}^{l_{1}}$ and $W_{j}^{l_{2}}$ can be more interesting for analysis than another feature that may have similar weight values for both of the class labels. The way this class-based feature weighting is performed using ACO ${}_{\mathbb{R}}$ is described in the following section.

6. Adapting ACO

{}_{\mathbb{R}}

to optimize the parameters of various IBL algorithms

We adapt ACO ${}_{\mathbb{R}}$ to optimize class-based feature weights (discussed in Section 5) in the context of three different instance-based classification algorithms (each will be illustrated later in the current section), as well as one key parameter for each algorithm. Our adaptation, called ACO ${}_{\mathbb{R}}$ -IBL, contains several archives of solution components, each archive holding $R$ elements. Suppose the classification problem at hand has $|C|$ class labels and $M$ features. Then, the algorithm would utilize $(|C|+1)$ archives. In the first archive, $A_{0}$ , each entry is a scalar that represents a potential value of the lazy classifier’s key parameter. Each of the remaining $|C|$ archives corresponds to one of the $|C|$ class labels $l$ . An entry $W^{l}_{a}$ in the $A_{l}$ archive is a $M$ -dimensional real-valued vector that represents candidate feature weights for class label $l$ , where $M$ is the number of the input features of the current dataset. Hence, a complete candidate solution of the ACO ${}_{\mathbb{R}}$ -IBL algorithm is composed of an entry from each of the $(|C|+1)$ archives, and can be represented as a pair $(p,{\bf W})$ , where $p$ is the value of the algorithm’s key parameter, and ${\bf W}$ is the set of feature weight sets, one for each class label $l\in C$ , where $|{\bf W}|=|C|\times M$ . Algorithm 1 shows the overall procedure of ACO ${}_{\mathbb{R}}$ -IBL.

Algorithm 1 Pseudo-code of ACO ${}_{\mathbb{R}}$ -IBL.
1:	Begin
2:	${\mathcal{T}_{r}}\leftarrow\textit{input};$
3:	$t=1;$
4:	InitializeArchive $(A_{0});$ // key-parameter archive
5:	for $l=1\to{\|C\|}$
6:	InitializeArchive $(A_{l});$ // feature weights archive
7:	end for
8:	repeat
9:	$p_{a}=$ SelectFromArchive $(A_{0});$ // select a parameter value
10:	$p_{t}=$ GenerateNewParameterValue $(p_{a});$
11:	${\bf W}_{t}=\phi;$
12:	for $l=1\to{\|C\|}$ do
13:	$W^{l}_{a}=$ SelectFromArchive $(A_{l});$ // select a vector of feature weights
14:	$W^{l}_{t}=$ GenerateNewFWVector $(W^{l}_{a});$
15:	${\bf W}_{t}={\bf W}_{t}\ \cup\ W^{l}_{t};$
16:	end for
17:	$\textit{CLS}_{t}=$ SetupLazyClassifier $(p_{t},{\bf W}_{t},{\mathcal{T}_{r}});$
18:	${\mathcal{T}_{\textit{correct}}}=$ Classify $(\textit{CLS}_{t},{\mathcal{T}_{r}});$ // leave-one-out procedure
19:	$Q_{t}=$ EvaluateQuality $({\mathcal{T}_{\textit{correct}}});$
20:	UpdateArchive $(A_{0},p_{t},Q_{t});$
21:	for $l=1\to{\|C\|}$ do
22:	$Q^{l}_{t}=$ EvaluateQuality $({\mathcal{T}_{\textit{correct}}},l);$
23:	UpdateArchive $(A_{l},W^{l}_{t},Q^{l}_{t});$
24:	end for
25:	UpdateSearchParameters ();
26:	$t=t+1;$
27:	until $t=t_{\textit{max}}$ or Convergence $(I_{\textit{max}})$ ;
28:	return BestSolution ();
29:	End

Essentially, the ACO ${}_{\mathbb{R}}$ -IBL algorithm executes as follows. At the beginning, the archives are initialized with $R$ randomly generated solutions that are evaluated and sorted according to their quality (lines 4 to 7). Then, in each iteration $t$ , a new candidate solution is created as follows. The algorithm probabilistically selects a parameter value $p_{a}$ from the parameter value archive $A_{0}$ (line 9), using fitness-proportionate selection based on the solution ranks (Eq. (5)). Then, a new parameter value $p_{t}$ is generated from the selected value $p_{a}$ (line 10), by sampling the Gaussian Probability Density Function (PDF) $N(p_{a},\sigma_{a})$ , where $\sigma_{a}$ is computed as:

$\sigma_{a}=\xi\sum_{r=1}^{R}\frac{|p_{a}-p_{r}|}{R-1}$ (12)

Next, a set of feature weights are generated for each class label (lines 12 to 16). For each archive $A_{l}$ , the algorithm probabilistically selects a vector, $W_{a}^{l}$ , which represents a feature weight vector for class label $l$ . Then, a new weight vector $W_{t}^{l}$ is probabilistically generated from $W_{a}^{l}$ . Let $W^{l}_{t,j}$ denote the $j$ -th element of $W_{t}^{l}$ , i.e. the weight of the $j$ -th feature with respect to class label $l$ . For each $j=1,\ldots,M$ , $W_{t,j}^{l}$ is generated by sampling the Gaussian PDF $N(W_{a,j}^{l},\sigma_{a,j}^{l})$ , where $W_{a,j}^{l}$ is the value of the $j$ -th element of the selected archived solution $W_{a}^{l}$ , and $\sigma_{a,j}^{l}$ is computed as:

$\sigma_{a,j}^{l}=\xi\sum_{r=1}^{R}\frac{|W_{a,j}^{l}-W_{r,j}^{l}|}{R-1}$ (13)

Note that Eqs (12) and (13) are adaptations of Eq. (10).

At this point, we have a complete candidate solution – consisting of an element from each of the archives – to run the lazy classifier. The lazy classifier $\textit{CLS}_{t}$ is setup (line 17) using the generated parameters and the training set $\mathcal{T}_{r}$ . The way each of the three lazy classifiers works to classify an instance is discussed later in this section. A leave-one-out validation procedure is performed (line 18) – i.e., for each instance ${\bf x}_{i}\in$ $\mathcal{T}_{r}$ , ${\bf x}_{i}$ becomes the validation instance to be classified, while $\left(\mathcal{T}_{r}-\{{\bf x}_{i}\}\right)$ becomes the set of input instances, from which the effective instances are selected and used to determine the class of the validation instance ${\bf x}_{i}$ . Note that the employed distance measure utilizes the class-based feature weights, as in Eq. (11). This leave-one-out quality evaluation approach was also used in the fitness function of CIW-NN [21].

The output of the leave-one-out procedure is the set of the correctly classified instances, ${\mathcal{T}_{\textit{correct}}}$ . We evaluate the quality $Q_{p}$ of the key parameter value $p_{t}$ using predictive accuracy:

$\textit{Accuracy}=\frac{|\mathcal{T}_{\textit{correct}}|}{N}$ (14)

where $N$ is the size of the training set $\mathcal{T}_{r}$ . This quality evaluation is used to update the parameter value archive $A_{0}$ (lines 19 and 20).

In contrast, the quality evaluation of the created class-based feature weights ${\bf W}$ is performed as follows. Each set of feature weights $W^{l}$ , associated with class label $l$ , is evaluated independently, to update its related archive $A_{l}$ , independently of the other archives (lines 22 and 23). More precisely, the quality value $Q_{t}^{l}$ , which is associated with the $W^{l}_{t}$ feature weight set, is calculated as follows:

$\textit{Accuracy}(l)=\frac{|{\mathcal{T}_{\textit{correct}}^{l}}|}{N^{l}}$ (15)

where $\mathcal{T}_{\textit{correct}}^{l}$ are the instances that are correctly classified to class $l$ during the leave-one-out procedure, and $N^{l}$ is the number of instances in the training set labeled by $l$ . In other words, to compute $Q^{l}_{t}$ for the feature weight set $W^{l}$ of class label $l$ , we only compute the accuracy in the context of class label $l$ , as shown in Eq. (15). After that, each archive $A^{l}$ is updated with $W_{t}^{l}$ according to its quality $Q^{t}_{l}$ . For example, if $C=\{l_{1},l_{2},l_{3}\}$ , $W_{t}^{l_{1}}$ may be the third in the ranking for archive $A_{l_{1}}$ , $W_{t}^{l_{2}}$ may be the tenth in the ranking for archive $A_{l_{2}}$ , and $W_{t}^{l_{3}}$ may not be inserted at all into archive $A^{l_{3}}$ .

The idea here is to avoid rewarding a bad feature weight set of a specific class because the feature weight sets of the other classes are good, and similarly to avoid penalizing good feature weights of a specific class because the feature weight sets of the other classes are bad. This can be even more useful in the case of class imbalance problems, where some class labels have relatively low frequencies with respect to other class labels. Using the general accuracy measure (Eq. (14)) would make the accuracy of predicting the high-frequency class labels dominate the overall quality of the classifier. Therefore, the proposed approach should provide a more fine-grained class-based quality evaluation.

After the archives are updated in iteration $t$ , the ACO ${}_{\mathbb{R}}$ -IBL algorithm updates the search parameters using Eq. (6) and moves to iteration $t+1$ (lines 25 and 26). The repeat-until loop (lines 8 to 27) is repeated until $t_{\textit{max}}$ iterations have elapsed, or until the algorithm converges. The algorithm is considered to have converged when there is no improvement in the quality of the best solution for $I_{\textit{max}}$ consecutive iterations. The best solution, which consists of the parameter value at the top of the parameter archive $A_{0}$ , as well as the feature weight vector at the top of each class-based feature weight archive $A^{l}$ , is returned at the end of the algorithm (line 28). Note that, in Algorithm 1, the number of ants $m$ is always 1. This means that only one solution is created in each iteration $t$ ; that constructed solution will be discarded unless its quality is better than the worst solution in the archive (as described in Section 4).

6.1 K-nearest neighbours

The $k$ -nearest neighbours algorithm is the most common lazy classifier. Its key parameter is $k$ , which represents the number of the nearest (effective) instances to be used to classify the validation instance ${\bf x}_{s}$ . Hence, our ACO ${}_{\mathbb{R}}$ -IBL algorithm is responsible for optimizing the value of the integer parameter $k$ – with respect to the current dataset – along with the real-valued class-based feature weights W used in calculating the distance $d_{si}$ between the validation instance ${\bf x}_{s}$ and a training instance ${\bf x}_{i}$ . The integer parameter $k$ is handled in the same way that ordinal variables are handled in the ACO ${}_{\text{MV}}$ [41] algorithm (described in Section 4). Specifically, the value of $k$ is treated as a real-valued scalar by the ACO ${}_{\mathbb{R}}$ algorithm, but is rounded to the nearest integer before being used as a cutoff for the nearest neighbours.

6.2 Distance-based nearest neighbours

A more robust alternative to $k$ -NN is the distance-based nearest neighbours ( $d$ -NN) algorithm. The key parameter for the $d$ -NN classification algorithm is $d$ , which specifies the maximum distance between a training instance ${\bf x}_{i}$ and the validation instance ${\bf x}_{s}$ in order for ${\bf x}_{i}$ to be considered as one of the effective instances that are used to determine the class label of ${\bf x}_{s}$ . In the context of $d$ -NN, the ACO ${}_{\mathbb{R}}$ -IBL algorithm is responsible for optimizing the value of the parameter $d$ – with respect to the current dataset – along with the class-based feature weights W for the distance measure.

In essence, in each iteration of the leave-one-out procedure, the distance $d_{si}$ is calculated between the validation instance ${\bf x}_{s}$ and each training instance ${\bf x}_{i}$ . However, the training instance ${\bf x}_{i}$ is added to the effective instances only if $d_{si}\leqslant d$ , where $d$ is the current value of the maximum allowed distance. Hence, the effective instances in the $d$ -NN algorithm are all the instances in a sphere of radius $d$ in the data space, where ${\bf x}_{s}$ (the instance to be classified) is in the centre of the sphere. Note that the number of the effective instances is fixed to $k$ in $k$ -NN, even if some of these $k$ instances are far from the validation instance ${\bf x}_{s}$ in the data space. On the other hand, in $d$ -NN, the number of effective instances varies from one validation instance to another, depending on where the validation instance is in the data space and how many training instances are within distance $d$ to it.

We employ weighted voting in order to predict the class label of ${\bf x}_{s}$ , where the weight $w_{i}$ of instance ${\bf x}_{i}=\frac{1}{d_{si}}$ . More precisely, for each class label $l$ , we sum the weights of all the instances ${\bf x}_{i}^{l}$ in the effective instances set that belong to $l$ . Then, we assign the validation instance ${\bf x}_{s}$ to the class label with the highest sum of weight values.

6.3 Gaussian kernel estimator

Unlike the two previous lazy classifiers, which use a selected subset of the training instances as the effective instances, GKE uses all the instances in $\mathcal{T}_{r}$ as the effective instances. However, each instance ${\bf x}_{i}$ will have a different weight $w_{i}$ in the weighted voting performed by GKE. Each weight $w_{i}$ is calculated as a function of the distance $d_{si}$ between the training (effective) instance $x_{i}$ and the current validation instance ${\bf x}_{s}$ .

Recall that in the context of $d$ -NN, the weight function $w_{i}=f(d_{si})=\frac{1}{d_{si}}$ . That is, the weight value $w_{i}$ is inversely proportional to the distance value $d_{si}$ . However, in the case of the GKE lazy classifier, $w_{i}=f^{\prime}(d_{si})=g(0;d_{si},\sigma)$ , where $g(y;\mu,\sigma)$ is the Gaussian function defined in Eq. (7).

The key parameter in the GKE is $\sigma$ , which is the spread (or smoothing) parameter which defines the shape of the influences of the training instances in determining the class label of the validation instance. When the value of $\sigma$ is small, the training instances near to ${\bf x}_{s}$ have a large weight, and this weight decreases rapidly as the distance $d_{si}$ increases. On the other hand, when the value of $\sigma$ is larger, we get a smoother effect, where the difference between the weights of the near and the far instances is not as large. This is illustrated in Fig. 2, in which the $x$ -axis represents the distance $d_{si}$ and the $y$ -axis represents the weight $w_{i}$ . The graph shows plots for GKE for several different values of $\sigma$ , as well as for $d$ -NN.

Figure 2.

Plot of weight $w_{i}$ ( $y$ -axis) versus distance $d_{si}$ ( $x$ -axis) for GKE, for several values of $\sigma$ , as well as for $d$ -NN.

7. Ensemble of classifiers using ACO population

An ensemble of classifiers is often used to combine the predictions from separate classifiers in order to increase predictive accuracy [59, 78]. The idea is to construct an ensemble of classifiers having different inductive biases, so that they will make different errors. Hence, combining their classification outputs will make the overall prediction of the ensemble more accurate [33].

With the use of an ACO-based algorithm to learn classifiers, or to optimize classifier parameters as in the current work, there is a potential to apply the idea of a classifier ensemble for free, since many solutions (i.e., classifiers) are created throughout the algorithm’s execution. Thus, instead of using the best constructed solution as the final output classifier of the ACO algorithm, one can select a set of top solutions in the colony to compose an ensemble of classifiers. Nonetheless, the crucial aspect of this approach is to make sure that, at the end of the ACO algorithm’s execution, the colony has a set of solutions with high diversity among them, instead of having very similar solutions that are (nearly) converged to the best solution. In order to promote this, the exploration behaviour should be enforced from the beginning of the algorithm’s execution to increase diversity among the constructed solutions, and this diversity should be maintained until the end of the algorithm.

At the end of execution of ACO ${}_{\mathbb{R}}$ -IBL, there are several ways that the $\left(|C|+1\right)$ archives can be used to construct an ensemble of classifiers. The approach we employ in the present work is the simplest approach, and consists of constructing an ensemble of $R$ classifiers, where the $i$ -th classifier in the ensemble, for $i=1,\ldots,R$ , is composed of the $i$ -th ranked element of each of the $(|C|+1)$ archives. Each test instance is presented to each of the $R$ classifiers, the output of each classifier is recorded, and a majority vote is performed to decide the final predicted class label of the instance.

Recall that the solutions in the archives are initialized randomly at the beginning of the algorithm, which means that the algorithm begins with (presumably) a diversity of initial (probably poor) solutions in the search space. In order to keep the diversity of the solutions in the archives while evolving better solutions, search diversity should be promoted in the algorithm, which is controlled by parameter $q$ , as discussed in Section 4.

More precisely, when the value of $q$ is small, the probability of selecting the best solution in the archive is very high, and the probability of selecting any other solution in the archive is almost zero (according to Eqs (5) and (6)), which makes the algorithm very exploitive. This is illustrated in Fig. 3. For an archive of size $R=50$ , let $r_{k}$ denote the ratio of the probability of selection of the highest-quality solution in the archive (with rank 1) to the probability of selection of the solution in the archive of rank $k$ . Figure 3 shows a plot in which the $x$ -axis represents the parameter $q$ and the $y$ -axis represents the corresponding ratio $r_{50}$ . We can see that the relative likelihood of selecting the worst solution in the archive depends dramatically on $q$ . This is further illustrated in Fig. 4. In this graph, the $x$ -axis is the rank $k$ of a solution in the archive, and the $y$ -axis is the ratio $r_{k}$ . The graph shows plots for several values of $q$ .

Figure 3.

Plot of the ratio $r_{50}$ ( $y$ -axis) versus value of the parameter $q$ ( $x$ -axis), using a logarithmic scale for the $y$ -axis.

Figure 4.

Plot of solution rank $k$ ( $x$ -axis) versus the ratio $r_{k}$ ( $y$ -axis), for several values of the parameter $q$ , using a logarithmic scale for the $y$ -axis.

The effect of selecting only the best solution is that all the new solutions are generated around the best solution in the search space, and with time, they occupy more and more of the archive. In this case, by the end of the algorithm’s execution, the archive will be full of the same (or very similar) solution, which will be useless for a classifier ensemble. On the other hand, if the value of $q$ is relatively large (as is the case in our computational results), all solutions in the archives will have a non-negligible probability of being selected, thus increasing the diversity of the population.

8. Experimental methodology

We evaluate the performance of three versions of our proposed ACO ${}_{\mathbb{R}}$ -IBL algorithm:

•
ACO ${}_{\mathbb{R}}$ - $k$ NN: The ACO ${}_{\mathbb{R}}$ -IBL version with the $k$ nearest neighbours lazy classifier. The key parameter, $k$ , is the number of the nearest neighbours to be used as the effective instances.
•
ACO ${}_{\mathbb{R}}$ - $d$ NN: The ACO ${}_{\mathbb{R}}$ -IBL version with the distance-based nearest neighbours lazy classifier. The key parameter, $d$ , is the maximum distance of the neighbours to be used as the effective instances.
•
ACO ${}_{\mathbb{R}}$ -GKE: The ACO ${}_{\mathbb{R}}$ -IBL version with the Gaussian kernel estimator lazy classifier. The key parameter, $\sigma$ , is the smoothing (spread) parameter in the Gaussian kernel function.

Table 1
Parameter settings of IBL algorithms used in the experiments

Algorithm Parameter Description Value

ACO ${}_{\mathbb{R}}$ -IBL $t_{\textit{max}}$ Maximum number of iterations 10,000

$m$ Number of ants per iteration 1

$R$ Archive size 25

$I_{\textit{max}}$ Maximum number of non-improving iterations 50

$\xi$ Controls speed of convergence 0.85

$q$ Controls exploration/exploitation 0.25

$k$ -NN $k$ Number of the nearest neighbours $\{1,\ldots,21\}$

$d$ -NN $d$ Maximum distance of a neighbour $\{0.01,0.05,0.1,0.15,\ldots,1\}$

GKE $\sigma$ Smoothing parameter in the kernel function $\{0.01,0.0333,0.0666,0.1,0.1333,$

$\ldots,0.6\}$

ACO- $k$ NN $k_{\textit{min}}$ Minimum permitted value of $k$ (# of neighbours) 1

$k_{\textit{max}}$ Maximum permitted value of $k$ (# of neighbours) 21

ACO- $d$ NN $d_{\textit{min}}$ Minimum permitted value of $d$ (maximum distance) 0.01

$d_{\textit{max}}$ Maximum permitted value of $d$ (maximum distance) 1

ACO-GKE $\sigma_{\textit{min}}$ Minimum permitted value of $\sigma$ (smoothing parameter) 0.01

$\sigma_{\textit{max}}$ Maximum permitted value of $\sigma$ (smoothing parameter) 0.5

In each of the previous algorithms, the ant colony meta-heuristic optimized the key parameter of the algorithm, along with the class-based feature-weights. We compare the predictive performance of our ACO versions of the lazy classification algorithms with the conventional version of these algorithms, where there are no feature weights, and the parameter value of each algorithm is user-supplied. For each of the conventional algorithms, we used around twenty different values for the key parameter associated with the algorithm. The parameter configurations used in the experiments for the various algorithms are shown in Table 1.

In addition, we compared our proposed ant-based algorithms with CIW-NN (discussed in Section 3), which is a state-of-the-art coevolutionary genetic algorithm that integrates instance selection, instance weighting, and feature weighting for nearest neighbours classifiers [21].

For the CIW-NN algorithm, we used the default parameters in [21]. However, for fairness of comparison, we assigned the CIW-NN algorithm and our ACO algorithms the same computational budget. In other words, we set the maximum number of evaluations (i.e., candidate solution creation and evaluation) in the CIW-NN algorithm’s entire execution to $t_{\textit{max}}$ , which is the maximum number of iterations in ACO ${}_{\mathbb{R}}$ -IBL. Since only one solution ( $m=1$ ) is created and evaluated per iteration in ACO ${}_{\mathbb{R}}$ -IBL, the overall maximum number of evaluations in both algorithms is the same. Note, however, that the maximum number might not be utilized completely; both ACO and evolutionary algorithms might use a smaller number of iterations if they converge earlier.

Table 2
Characteristics of the datasets used in the experiments

Dataset Instances Classes Features

Total Numeric Categorical

annealing 896 6 38 9 29

audiology 200 24 70 0 70

balance 625 3 4 0 4

breast-l 283 2 9 0 9

breast-p 198 2 32 32 0

breast-tissue 106 6 9 9 0

breast-w 569 2 30 30 0

car 1,728 4 6 0 6

chess 3,196 2 36 0 36

credit-a 690 2 14 6 8

credit-g 1,000 2 20 7 13

cylinder 540 2 35 19 16

dermatology 366 6 34 1 33

ecoli 336 8 7 7 0

hay 132 3 4 0 4

heart-c 303 5 13 7 6

heart-h 293 5 13 7 6

hepatitis 155 2 19 6 13

ionosphere 351 2 34 34 0

iris 150 3 4 4 0

liver-disorders 345 2 6 6 0

lymphography 148 4 18 3 15

monks 556 2 6 0 6

mushrooms 8,124 2 22 0 22

nursery 12,960 5 8 0 8

parkinsons 195 2 22 22 0

pima 768 2 8 8 0

pop 90 3 8 0 8

s-heart 270 2 13 6 7

segmentation 2,273 7 19 19 0

soybean 307 19 35 0 35

transfusion 722 2 4 4 0

ttt 958 2 9 0 9

voting 425 2 16 0 16

wine 178 3 13 13 0

zoo 101 7 16 0 16

The experiments were carried out using the stratified ten-fold cross validation procedure. In essence, a dataset is divided into ten mutually exclusive partitions (folds), with approximately the same number of instances in each partition. Then, each classification algorithm is run ten times, where each time a different partition is used as the test set and the other nine partitions are used as the training set. The results (predictive accuracy rate on the test set) are then averaged and reported as the accuracy rate of the classifier. Since ACO ${}_{\mathbb{R}}$ -IBL and CIW-NN are stochastic algorithms, we run each ten times – using a different random seed to initialize the search each time – for each of the ten iterations of the cross-validation procedure (i.e., 100 runs in total, for each dataset). In the case of the deterministic algorithms, each one is run just once for each iteration of the cross-validation procedure.

The performance of our ACO ${}_{\mathbb{R}}$ -IBL algorithms was evaluated using 36 public-domain datasets from the well-known University of California at Irvine (UCI) dataset repository [9]. The main characteristics of the datasets are shown in Table 2.
9. Computational results

Algorithm	Parameter	Description	Value
ACO ${}_{\mathbb{R}}$ -IBL	$t_{\textit{max}}$	Maximum number of iterations	10,000
	$m$	Number of ants per iteration	1
	$R$	Archive size	25
	$I_{\textit{max}}$	Maximum number of non-improving iterations	50
	$\xi$	Controls speed of convergence	0.85
	$q$	Controls exploration/exploitation	0.25
$k$ -NN	$k$	Number of the nearest neighbours	$\{1,\ldots,21\}$
$d$ -NN	$d$	Maximum distance of a neighbour	$\{0.01,0.05,0.1,0.15,\ldots,1\}$
GKE	$\sigma$	Smoothing parameter in the kernel function	$\{0.01,0.0333,0.0666,0.1,0.1333,$
			$\ldots,0.6\}$
ACO- $k$ NN	$k_{\textit{min}}$	Minimum permitted value of $k$ (# of neighbours)	1
	$k_{\textit{max}}$	Maximum permitted value of $k$ (# of neighbours)	21
ACO- $d$ NN	$d_{\textit{min}}$	Minimum permitted value of $d$ (maximum distance)	0.01
	$d_{\textit{max}}$	Maximum permitted value of $d$ (maximum distance)	1
ACO-GKE	$\sigma_{\textit{min}}$	Minimum permitted value of $\sigma$ (smoothing parameter)	0.01
	$\sigma_{\textit{max}}$	Maximum permitted value of $\sigma$ (smoothing parameter)	0.5

Dataset	Instances	Classes	Features
annealing	896	6	38	9	29
audiology	200	24	70	0	70
balance	625	3	4	0	4
breast-l	283	2	9	0	9
breast-p	198	2	32	32	0
breast-tissue	106	6	9	9	0
breast-w	569	2	30	30	0
car	1,728	4	6	0	6
chess	3,196	2	36	0	36
credit-a	690	2	14	6	8
credit-g	1,000	2	20	7	13
cylinder	540	2	35	19	16
dermatology	366	6	34	1	33
ecoli	336	8	7	7	0
hay	132	3	4	0	4
heart-c	303	5	13	7	6
heart-h	293	5	13	7	6
hepatitis	155	2	19	6	13
ionosphere	351	2	34	34	0
iris	150	3	4	4	0
liver-disorders	345	2	6	6	0
lymphography	148	4	18	3	15
monks	556	2	6	0	6
mushrooms	8,124	2	22	0	22
nursery	12,960	5	8	0	8
parkinsons	195	2	22	22	0
pima	768	2	8	8	0
pop	90	3	8	0	8
s-heart	270	2	13	6	7
segmentation	2,273	7	19	19	0
soybean	307	19	35	0	35
transfusion	722	2	4	4	0
ttt	958	2	9	0	9
voting	425	2	16	0	16
wine	178	3	13	13	0
zoo	101	7	16	0	16

9.1 Predictive accuracy results

9.1.1 Experiment A

Figure 5.

Experiment A: Plot of average predictive accuracy rank ( $y$ -axis) versus value of the $k$ -NN parameter $k$ ( $x$ -axis).

Figure 6.

Experiment A: Plot of average predictive accuracy rank ( $y$ -axis) versus value of the $d$ -NN parameter $d$ ( $x$ -axis).

Figure 7.

Experiment A: Plot of average predictive accuracy rank ( $y$ -axis) versus value of the GKE parameter $\sigma$ ( $x$ -axis).

In Experiment A, the classical $k$ -NN algorithm was run, on the 36 datasets listed in Table 2, with 21 different values of the key parameter $k$ , for

$\displaystyle k\in\{1,2,\ldots,21\}$ (16)

using 10-fold cross-validation as described in Section 8. The average rank of each setting of $k$ was obtained. The average rank for a given setting is obtained by first computing its rank on each dataset individually. In case two or more settings have the same accuracy on a given dataset, the tied settings are given the average of the ranks they span. The individual ranks are then averaged across all datasets to obtain the overall average rank for each setting. Note that the lower the value of the rank, the better. Figure 5 shows a plot of average rank versus value of the parameter $k$ , and indicates that the best average rank is obtained with $k=12$ .

The $d$ -NN algorithm was similarly run with 21 different settings of $d$ , with

$\displaystyle d\in\{0.01,0.05,0.10,0.15,\ldots,1.0\}$ (17)

while the GKE algorithm was run with 19 different settings of $\sigma$ , with

$\displaystyle\sigma\in\{0.01,0.0333,0.0666,0.10,0.1333,0.1666,\ldots,0.6\}$ (18)

The average rank of each setting of each algorithm’s key parameter was computed, and is shown in Fig. 6 for $d$ -NN and Fig. 7 for GKE. These figures indicate that the best accuracy is obtained with $d=0.45$ for $d$ -NN, and with $\sigma=0.40$ for GKE.

The best-performing parameter settings identified in this experiment ( $k=12$ for $k$ -NN, $d=0.45$ for $d$ -NN, and $\sigma=0.40$ for GKE) will be used in Experiment B.

9.1.2 Experiment B

Table 3
Experiment B: Predictive accuracy (%) results for the two variations of ACO- $k$ NN, and for classical $k$ -NN with the optimized parameter setting identified in Experiment A

Dataset	ACO- $k$ NN		$k$ -NN
	Basic	Ensemble	( $k=$ 12)
annealing	94.20	94.20	72.58
audiology	79.17	79.17	78.33
balance	82.67	85.02	84.83
breast-l	73.78	74.15	73.70
breast-p	75.10	75.10	75.24
breast-tissue	62.91	64.83	69.00
breast-w	96.82	96.82	95.26
car	93.57	94.44	94.56
chess	96.60	96.73	95.88
credit-a	85.81	85.81	86.96
credit-g	73.40	74.07	73.90
cylinder	77.03	77.03	76.76
dermatology	95.62	96.99	97.28
ecoli	77.59	79.09	86.96
hay	67.97	67.97	59.90
heart-c	56.50	57.77	56.76
heart-h	67.03	67.03	65.34
hepatitis	83.25	83.25	82.63
ionosphere	94.26	94.26	86.23
iris	89.29	89.29	92.62
liver-disorders	55.60	63.93	60.56
lymphography	81.82	77.10	81.19
monks	63.25	63.25	52.64
mushrooms	99.99	100.00	99.99
nursery	98.73	98.81	97.37
parkinsons	82.18	82.18	87.68
pima	71.22	71.22	74.21
pop	71.96	71.96	69.46
s-heart	83.70	84.41	82.22
segmentation	91.95	93.87	93.29
soybean	84.48	84.48	82.76
transfusion	69.90	69.88	70.67
ttt	98.95	98.95	97.16
voting	93.40	93.74	89.80
wine	95.90	95.90	94.38
zoo	97.32	97.32	100.00
#wins	14	24	11
rank (avg)	2.14	1.65	2.21

Table 4

Experiment B: Predictive accuracy (%) results for the two variations of ACO- $d$ NN, and for classical $d$ -NN with the optimized parameter setting identified in Experiment A

Dataset	ACO- $d$ NN		$d$ -NN
	Basic	Ensemble	( $d=$ 0.45)
annealing	86.08	90.68	61.60
audiology	72.50	75.83	76.67
balance	88.17	90.67	92.50
breast-l	72.84	72.84	71.05
breast-p	76.74	76.74	73.71
breast-tissue	62.91	62.91	52.91
breast-w	82.04	82.04	87.53
car	91.52	91.52	70.76
chess	94.60	97.55	92.04
credit-a	89.71	89.71	83.62
credit-g	73.90	74.90	72.80
cylinder	69.82	69.82	69.68
dermatology	94.24	94.24	65.56
ecoli	68.25	71.85	52.99
hay	74.74	79.77	75.80
heart-c	58.84	58.84	58.76
heart-h	69.57	69.57	66.38
hepatitis	80.00	80.00	71.04
ionosphere	93.39	93.39	93.97
iris	94.00	95.45	92.62
liver-disorders	57.60	58.60	58.32
lymphography	77.10	76.43	79.76
monks	50.81	45.89	56.48
mushrooms	100.00	98.02	100.00
nursery	90.84	90.76	91.44
parkinsons	82.58	82.58	77.87
pima	65.22	65.82	67.05
pop	73.39	74.64	70.89
s-heart	81.48	81.48	84.07
segmentation	86.70	88.70	86.83
soybean	83.79	83.79	76.55
transfusion	70.13	80.13	72.34
ttt	94.32	94.32	72.00
voting	92.94	91.85	88.66
wine	87.61	87.61	94.38
zoo	92.54	95.89	98.75
#wins	15	23	12
rank (avg)	2.06	1.71	2.24

Table 5

Experiment B: Predictive accuracy (%) results for the two variations of ACO-GKE, and for classical GKE with the optimized parameter setting identified in Experiment A

Dataset	ACO-GKE		GKE
	Basic	Ensemble	( $\sigma=$ 0.4)
annealing	83.76	83.76	65.18
audiology	85.00	85.00	70.93
balance	85.67	91.67	77.43
breast-l	72.84	72.84	66.30
breast-p	72.58	75.58	67.84
breast-tissue	53.36	56.36	61.60
breast-w	89.36	89.36	87.86
car	93.69	93.69	87.16
chess	96.98	97.10	88.48
credit-a	79.42	79.42	79.56
credit-g	74.40	74.40	66.50
cylinder	66.92	66.92	69.36
dermatology	93.69	93.77	89.88
ecoli	70.81	72.69	80.46
hay	75.47	75.47	52.50
heart-c	57.77	57.77	49.36
heart-h	67.03	67.03	57.94
hepatitis	81.25	81.25	75.23
ionosphere	81.92	81.92	78.83
iris	91.95	91.95	85.22
liver-disorders	63.93	63.93	53.16
lymphography	83.81	83.81	73.79
monks	63.43	63.43	45.24
mushrooms	100.00	100.00	92.59
nursery	97.94	98.88	89.97
parkinsons	83.53	83.53	80.28
pima	72.14	72.14	66.81
pop	73.39	73.39	62.06
s-heart	85.93	85.93	74.82
segmentation	68.05	68.05	85.89
soybean	87.59	87.59	75.36
transfusion	72.86	68.86	63.27
ttt	99.16	99.16	89.76
voting	91.74	92.74	82.40
wine	83.17	83.17	86.98
zoo	98.75	98.75	92.60
#wins	24	29	6
rank (avg)	1.76	1.57	2.67

Table 6

Experiment B: Results of applying the Friedman test with the Holm post hoc test to the $k$ -NN results of Table 3. In each case, the difference is statistically significant, at the 0.05 threshold, if the reported $p$ value is less than or equal to the corresponding Holm threshold

Comparison	$p$	Holm	sig.?
Ens-ACO- $k$ NN vs. $k$ -NN (best $k$ )	0.0184	0.025	yes
Ens-ACO- $k$ NN vs. ACO- $k$ NN	0.0392	0.05	yes

Table 7

Experiment B: Results of applying the Friedman test with the Holm post hoc test to the $d$ -NN results of Table 4. In each case, the difference is statistically significant, at the 0.10 threshold, if the reported $p$ value is less than or equal to the corresponding Holm threshold

Comparison	$p$	Holm	sig.?
Ens-ACO- $d$ NN vs. $d$ -NN (best $d$ )	0.0251	0.05	yes
Ens-ACO- $d$ NN vs. ACO- $d$ NN	0.1407	0.1	no

Table 8

Experiment B: Results of applying the Friedman test with the Holm post hoc test to the GKE results of Table 5. In each case, the difference is statistically significant, at the 0.05 threshold, if the reported $p$ value is less than or equal to the corresponding Holm threshold

Comparison	$p$	Holm	sig.?
Ens-ACO-GKE vs. GKE (best $\sigma$ )	3.2E-6	0.025	yes
Ens-ACO-GKE vs. ACO-GKE	0.4094	0.05	no

In Experiment B, our basic ACO- $k$ NN algorithm is compared to its ensemble-based version (which we abbreviate as Ens-ACO- $k$ NN) and to classical $k$ -NN using the optimized parameter setting identified in Experiment A. Table 3 reports the predictive accuracy for these three algorithms. For Tables 3–5, as well as Table 9, the highest accuracy for each dataset is shown in boldface, the last row in each table reports the average ranks of the algorithms being compared, and the penultimate row reports the number of datasets for which each algorithm had the highest accuracy. Tables 4–5 report the analogous results for $d$ -NN and GKE.

From Table 3, we note that the highest accuracy was obtained by one of the two versions of ACO- $k$ NN in 25 of the 36 datasets. Ens-ACO- $k$ NN performed better than ACO- $k$ NN in 15 datasets, and worse in only 2 datasets, with 19 ties. The best average rank was obtained by Ens-ACO- $k$ NN with an average rank of 1.65, followed by ACO- $k$ NN with an average rank of 2.14, followed by $k$ -NN with the optimized parameter setting identified in Experiment A with an average rank of 2.21. A Friedman test with the Holm post hoc test is applied to the predictive accuracy results reported in Table 3. The Friedman statistic $\chi^{2}_{F}$ is computed to be 6.60 with 2 degrees of freedom, corresponding to a $p$ value of 0.0369, indicating that the null hypothesis can be rejected at the 0.05 threshold. Table 6 reports the results of the Holm post-hoc tests (at the 0.05 threshold). For Tables 6–8, there is a statistically significant difference for each pair of algorithms being compared if the reported value of $p$ is less than or equal to the corresponding Holm threshold. Statistically significant values of $p$ are shown in boldface. Table 6 indicates that Ens-ACO- $k$ NN is significantly better than each of ACO- $k$ NN and classical $k$ -NN with the optimized parameter setting identified in Experiment A.

For $d$ -NN, we observe from Table 4 that the highest accuracy was obtained by one of the two versions of ACO- $d$ NN in 25 of the 36 datasets. Ens-ACO- $d$ NN performed better than ACO- $d$ NN in 14 datasets, and worse in 5 datasets, with 17 ties. The best average rank was obtained by Ens-ACO- $d$ NN with an average rank of 1.71, followed by ACO- $d$ NN with an average rank of 2.06, followed by $d$ -NN with the optimized parameter setting identified in Experiment A with an average rank of 2.24. A Friedman test is applied to the predictive accuracy results reported in Table 4. The Friedman statistic $\chi^{2}_{F}$ is computed to be 5.18 with 2 degrees of freedom, corresponding to a $p$ value of 0.0750, indicating that the null hypothesis cannot be rejected at the 0.05 threshold, but can be rejected at the 0.10 threshold. Table 7 reports the results of the Holm post-hoc tests at the 0.10 threshold, and indicates that Ens-ACO- $d$ NN is significantly better (at the 0.10 threshold) than classical $d$ -NN with the optimized parameter setting identified in Experiment A, but is not significantly better than ACO- $d$ NN.

For GKE, we observe from Table 5 that the highest accuracy was obtained by one of the two versions of ACO-GKE in 30 of the 36 datasets. Ens-ACO-GKE performed better than ACO-GKE for 8 datasets, and worse for only one dataset, with 27 ties. The best average rank was obtained by Ens-ACO-GKE with an average rank of 1.57, followed by ACO-GKE with an average rank of 1.76, followed by GKE with the optimized parameter setting identified in Experiment A with an average rank of 2.67. A Friedman test with the Holm post hoc test is applied to the predictive accuracy results reported in Table 5. The Friedman statistic $\chi^{2}_{F}$ is computed to be 24.68 with 2 degrees of freedom, corresponding to a $p$ value of 4.4E-6, indicating that the null hypothesis can be rejected. Table 8 reports the results of the Holm post-hoc tests at the 0.05 threshold, and indicates that Ens-ACO-GKE is significantly better than classical GKE with the optimized parameter setting identified in Experiment A, but is not significantly better than ACO-GKE.

9.1.3 Experiment C

Table 9
Experiment C: Predictive accuracy (%) results for our proposed ACO-based algorithms and for the CIW-NN coevolutionary algorithm

Dataset	ACO- $k$ NN		ACO- $d$ NN		ACO-GKE		CIW-NN
	Basic	Ensemble	Basic	Ensemble	Basic	Ensemble
annealing	94.20	94.20	86.08	90.68	83.76	83.76	97.84
audiology	79.17	79.17	72.50	75.83	85.00	85.00	81.67
balance	82.67	85.02	88.17	90.67	85.67	91.67	86.17
breast-l	73.78	74.15	72.84	72.84	72.84	72.84	69.03
breast-p	75.10	75.10	76.74	76.74	72.58	75.58	75.76
breast-tissue	62.91	64.83	62.91	62.91	53.36	56.36	68.82
breast-w	96.82	96.82	82.04	82.04	89.36	89.36	95.61
car	93.57	94.44	91.52	91.52	93.69	93.69	95.56
chess	96.60	96.73	94.60	97.55	96.98	97.10	97.74
credit-a	85.81	85.81	89.71	89.71	79.42	79.42	80.58
credit-g	73.40	74.07	73.90	74.90	74.40	74.40	70.20
cylinder	77.03	77.03	69.82	69.82	66.92	66.92	79.18
dermatology	95.62	96.99	94.24	94.24	93.69	93.77	95.62
ecoli	77.59	79.09	68.25	71.85	70.18	72.69	71.30
hay	67.97	67.97	74.74	79.77	75.47	75.47	73.08
heart-c	56.50	57.77	58.84	58.84	57.77	57.77	54.84
heart-h	67.03	67.03	69.57	69.57	67.03	67.03	62.03
hepatitis	83.25	83.25	80.00	80.00	81.25	81.25	76.79
ionosphere	94.26	94.26	93.39	93.39	81.92	81.92	64.48
iris	89.29	89.29	94.00	95.45	91.95	91.95	93.33
liver-disorders	55.60	63.93	57.60	58.60	63.93	63.93	63.19
lymphography	81.82	77.10	77.10	76.43	83.81	83.81	70.90
monks	63.25	63.25	50.81	45.89	63.43	63.43	59.27
mushrooms	99.99	100.00	100.00	98.02	100.00	100.00	100.00
nursery	98.73	98.81	90.84	90.76	97.94	98.88	96.50
parkinsons	82.18	82.18	82.58	82.58	83.53	83.53	96.42
pima	71.22	71.22	65.22	65.82	72.14	72.14	69.28
pop	71.96	71.96	73.39	74.64	73.39	73.39	70.00
s-heart	83.70	84.41	81.48	81.48	85.93	85.93	75.56
segmentation	91.95	93.87	86.70	88.70	68.05	68.05	84.52
soybean	84.48	84.48	83.79	83.79	87.59	87.59	82.07
transfusion	69.90	69.88	70.13	80.13	72.86	68.86	68.00
ttt	98.95	98.95	94.32	94.32	99.16	99.16	85.26
voting	93.40	93.74	92.94	91.85	91.74	92.74	93.99
wine	95.90	95.90	87.61	87.61	83.17	83.17	97.16
zoo	97.32	97.32	92.54	95.89	98.75	98.75	98.75
#wins	3	9	5	9	10	12	10
rank (avg)	4.06	3.39	4.57	4.08	4.00	3.61	4.29

Table 10

Experiment C: Results of applying a (one-tailed) Wilcoxon signed-ranks test comparing Ens-ACO- $k$ NN (the highest-ranked of our proposed algorithms) to CIW-NN, a state-of-the-art coevolutionary algorithm

Comparison	$N$	$z$	$p$	sig.?
Ens-ACO- $k$ NN vs. CIW-NN	35	$-$ 2.05	0.0202	yes

The main results of this paper are reported in Table 9, which compares our six ACO algorithms to each other and to CIW-NN, a recent state-of-the-art algorithm. The average rankings indicate that Ens-ACO- $k$ NN has the best predictive accuracy of the seven algorithms compared in this table. The second- and third-best accuracy ranks belong to Ens-ACO-GKE and ACO-GKE, respectively. Ens-ACO- $d$ NN and ACO- $k$ NN are in the fourth and fifth place, respectively. In the next to last place is CIW-NN, followed by ACO- $d$ NN in last place.

Focusing our comparison on Ens-ACO- $k$ NN (the highest-ranked of our proposed ACO algorithms) and CIW-NN (a state-of-the-art coevolutionary algorithm), we find that Ens-ACO- $k$ NN has better accuracy than CIW-NN in 21 of the 36 datasets, and worse in 14 of the 36 datasets, with a single tie. As reported in Table 10, a (one-tailed) Wilcoxon signed-ranks test was used to compare the best of our proposed algorithms, Ens-ACO- $k$ NN, to CIW-NN, a state-of-the-art coevolutionary technique. As the table indicates, the $p$ -value was determined to be 0.0202, indicating that the results for Ens-ACO- $k$ NN are significantly better than for CIW-NN.

9.2 Interpreting the class-based feature weights

In addition to potentially improved accuracy, another advantage of our use of class-based feature weights is increased interpretability. As discussed in [28], feature weighting in nearest neighbours classifiers can provide an important form of knowledge that is interpretable by many users. In many applications, users could need to see the nearest neighbours of the instance being classified in order to get some justification or explanation for the class being predicted for that instance. Such an explanation might be needed, e.g., for legal reasons. For instance, a customer whose credit card application is rejected by a bank based on the prediction of a nearest neighbours classifier might have the legal right of knowing why her/his application was rejected. As another example, in this case requiring interpretability because human lives are at stake, if a medical doctor’s diagnosis of some cancer is supported by a nearest neighbours classifier, the doctor should not blindly trust the prediction of the classifier; she/he should interpret the nearest neighbours to see if the prediction makes sense from a medical perspective. In many applications, however, it is not practical for the user to examine all features values in the nearest neighbours, since the number of features can be very large and many features have a small weight (relevance) for a given prediction. Hence, in practice, when a user asks to see the nearest neighbours that were used to classify the current test instance, providing the features in decreasing order of weight (relevance) can help the user to focus her/his attention on the most relevant features.

Our use of class-based feature weights takes this idea one step further by allowing for the representation of the fact that a feature may be more relevant to some classes than to others. By examining the class-based feature weights of the trained classifier, there is the potential to discover interesting patterns, such as features that have very different weight values for different classes, and features that have low weight values for all the classes (which would be good candidates for removal in a feature selection process). These types of observations can be useful to domain experts either for purposes of validating the classifier, or for gaining insight into the problem domain.

Figure 8.

Illustration of class-based feature weights for the car dataset. We observe that the feature maint (price of maintenance) has limited significance for the classes unacceptable and good, but is significant for the classes acceptable and very-good. The feature buying (buying price) is less significant for the class very-good than for the other classes.

Figure 9.

Illustration of class-based feature weights for the pop (Post-Operative Patient) dataset. We observe that the feature CORE-STBL (stability of patient core temperature) is not relevant to class home (patient sent home from hospital), but is relevant to the other two classes, and the feature SURF-STBL (stability of patient surface temperature) is least relevant to the class ICU (patient sent to intensive care) and most relevant to the class general (patient sent to general hospital floor).

Figure 10.

Illustration of class-based feature weights for the iris dataset. We observe that sepal-width has less importance for the versicolor class than for the other two classes; furthermore, the sepal-length and petal-width features both have greater relevance to the versicolor class than for the other two classes.

To illustrate this, we show, in Figs 8–10, a graphical representation of the class-based feature weights for a sample run (randomly selected out of the 100 runs of the 10-times 10-fold cross-validation procedure) of the ACO- $k$ NN algorithm for each of several datasets. In each case, the weights are normalized so that the sum of the weights is 100%. Each graph is a stacked bar-graph in which the bars represent the class labels, and the components of each bar correspond to the feature weights. The names of the features are given in the legend for each figure.

Let us consider Fig. 8, illustrating the car dataset. This dataset, derived from a simple hierarchical decision model, has four class labels (very-good, good, acceptable, unacceptable) representing possible evaluations of a car, based on six categorical features: buying price (buying), cost of maintenance (maint), number of doors (doors), carrying capacity of car in terms of number of persons (persons), size of the luggage boot (lug-boot), and estimated safety of the car (safety). From Fig. 8, we can make a number of meaningful observations. For example, the feature maint (cost of maintenance) has limited significance for the classes unacceptable and good, but is significant for the classes acceptable and very-good. The feature buying (buying price) is less significant for the classes very-good than for the other classes.

Considering Fig. 9, illustrating the Post-Operative Patient (pop) dataset. Each instance in this dataset corresponds to a post-operative patient. The features represent readings related to the patient, and the class labels are: patient prepared to go home (home), patient sent to general hospital floor (general), and patient sent to intensive care unit (ICU). By a simple examination of the figure, we can make several observations: the learned classifier does not consider the feature CORE-STBL (stability of patient core temperature) to be relevant to the class home (patient sent home from hospital), but to be relevant to the other two classes; the classifier considers the feature SURF-STBL (stability of patient surface temperature) to be less relevant to the class ICU (patient sent to intensive care) and more relevant to the class general (patient sent to general hospital floor). These types of observations can be useful to domain experts either for purposes of validating the classifier, or for gaining insight into the problem domain.

As a final example, consider Fig. 10, which illustrates the iris dataset, one of the most well-known datasets in machine learning. In this dataset, each instance corresponds to an iris plant, the features represent physical measurements of various aspects of the plant, and the classes (versicolor, virginica, and setosa) represent three different types of plant. We can observe that sepal-width has less importance for the versicolor class than for the other two classes; the sepal-length and petal-width features both have greater relevance to the versicolor class than for the other two classes.

The reader should note that our objective, in this section, is not to actually extract domain knowledge from the discovered class-based feature weights illustrated in Figs 8–10, but rather to give an example of how these weights can be useful to a domain expert in validating a classifier, or even potentially in gaining insight into the problem domain.

10. Discussion & conclusions

In this paper, we have made several contributions, which can be summarized as follows:

Class-based feature weights (Section 5). We have proposed a new class-based feature-weighting scheme in which each feature has a different real-valued weight for each target class label. This allows each feature to have a different level of emphasis with respect to each class label. In addition to improving predictive accuracy, this scheme has the added advantage that it provides comprehensible knowledge regarding the relative importance of each feature to each class label, which improves the interpretability of the learned classification model.

A multi-archive adaptation of ACO ${}_{\mathbb{R}}$ (Section 6), which is used to optimize the class-based feature weights in the $k$ nearest neighbours classifier, the distance-based nearest-neighbours ( $d$ -NN) classifier and the Gaussian kernel estimator (GKE) nearest neighbours classifier, as well as the key classifier parameter in each of these classifiers, i.e. $k$ in $k$ -NN, $d$ in $d$ -NN, and $\sigma$ in GKE.

The use of an ensemble of classifiers based on the ACO population (Section 7). Rather than using the best constructed solution as the final output classifier of the ACO algorithm, we treat the entire final ACO population as an ensemble of classifiers, employing a majority vote to determine the final predicted class label.

An extensive experimental evaluation, using 36 benchmark datasets and a rigorous 10-times 10-fold cross-validation regimen, of our six proposed algorithms, in addition to the CIW-NN algorithm [21], a state-of-the-art coevolutionary algorithm. As mentioned in Section 3, Derrac et al. [21] have compared CIW-NN to a large number of competing approaches from the literature and found it to have superior predictive accuracy.

In general, from the results presented in Section 9, we observe that none of the seven algorithms compared in Table 9 performed better than the others on all datasets. For each algorithm, there was at least one dataset for which it had the highest predictive accuracy, and at least one dataset for which it had the lowest predictive accuracy.

It is interesting to relate the relative performance of the three versions (Ens-ACO- $k$ NN, Ens-ACO- $d$ NN, and Ens-ACO-GKE) of our ensemble-based ACO-IBL algorithm to some basic characteristics of the datasets. Considering all 36 datasets, Ens-ACO- $k$ NN performed best followed by Ens-ACO-GKE followed by Ens-ACO- $d$ NN. Restricting our attention to the 18 two-class datasets, the pattern is the same: Ens-ACO- $k$ NN is best followed by Ens-ACO-GKE followed by Ens-ACO- $d$ NN. The pattern is also the same for the 18 datasets with more than two classes: Ens-ACO- $k$ NN performs best followed by Ens-ACO-GKE followed by Ens-ACO- $d$ NN. For the 12 datasets with only numeric features: Ens-ACO-GKE is best followed by Ens-ACO- $k$ NN followed by Ens-ACO- $d$ NN. For the 14 datasets with only categorical features: Ens-ACO- $k$ NN is best followed by Ens-ACO- $d$ NN followed by Ens-ACO-GKE. For the 10 datasets with a mix of categorical and numeric features: Ens-ACO- $k$ NN is best followed by Ens-ACO- $d$ NN followed by Ens-ACO-GKE.

A related and worthwhile future research direction is to seek to predict the relative performance of different data mining algorithms based on dataset characteristics, which is an issue addressed in the area of meta-learning [15]. For example, one might find that Algorithm $X$ is better for datasets with a large number of class labels, that Algorithm $Y$ is better for datasets where the ratio of categorical to continuous features is large (or small), or that Algorithm $Z$ is better for datasets where the instance sparsity in the data space is low (or high). In the same vein, one might seek to automatically select the proximity measure to the dataset at hand. In the present work, we have used Euclidean distance as our proximity measure. However, numerous proximity measures have, of course, been studied in the literature [18, 75]. An interesting meta-learning task would be to predict the relative performance of each based on measurable characteristics of the problem domain or dataset.

From Table 9, we observe that five of our proposed six ACO algorithms (including all three ensemble-based ACO algorithms) had a better overall average ranking than CIW-NN. Comparing our best ACO algorithm (Ens-ACO- $k$ NN) to CIW-NN, we note that Ens-ACO- $k$ NN had better predictive accuracy on 21 datasets, worse on 14 datasets, and the same on a single dataset. As shown in Table 10, a Wilcoxon signed-ranks test indicates that Ens-ACO- $k$ NN has significantly better predictive accuracy than CIW-NN.

In addition, Ens-ACO- $k$ NN has the added advantage of better interpretability of the classifiers because of the more fine-grained class-based feature weights, as was illustrated in Figs 8–10. On the other hand, CIW-NN has the advantage that it also includes Instance Selection in addition to Feature Weighting and a class-based Instance Weighting approach, while our approach does not include any Instance Selection. Instance Selection has a number of benefits: the unselected instances can be interesting for analysis per se, and the reduced training set is less sensitive to outliers and noisy data, as well as the benefits of reduced processing time in the classification phase (fewer instances to scan), and efficiency in terms of storage. In future work, we would like to investigate incorporating data reduction mechanisms (Instance Selection as well as Feature Selection) within our ACO approach.

One advantage of the ensemble of classifiers approach presented in Section 7 is that it improves predictive accuracy in a majority of datasets, as indicated in our experimental results. However, the disadvantage is a decrease in interpretability due to producing an ensemble with different (to some extent contradictory) sets of class-based feature weights, rather than a single set of class-based feature weights.

It is possible to classify our multi-archive adaptation of ACO ${}_{\mathbb{R}}$ as a cooperative coevolutionary [56] approach. Using the terminology of coevolution, the $(|C|+1)$ archives can be considered symbiotic coevolving populations. An element of a symbiotic population cannot be evaluated by itself; one element from each population must first be selected to form a complete “super-organism” before a quality evaluation function can be applied. In the approach we follow here, an element is selected from each population using the standard roulette-wheel mechanisms of ACO ${}_{\mathbb{R}}$ , and fitness is assigned to each element independently of the others, as described in Section 6.

With the classifier ensemble approach, it is important to promote diversity in the population. One possibility for future research is to incorporate opposition-based (or collision-avoiding) mechanisms [11] in the algorithm. For example, the solution construction procedure in ACO ${}_{\mathbb{R}}$ can be modified to include a mechanism that penalizes potential solutions that are too close to existing members of the archive.

Another avenue that we would like to pursue in future work is the use of the Cauchy PDF in the ACO ${}_{\mathbb{R}}$ -IBL algorithm. An additional direction we would like to explore is to combine the three lazy classifiers ( $k$ -NN, $d$ -NN, and GKE) in an ensemble-based approach, as follows. ACO would be used to optimize the class-based feature weights and the three parameters $k$ , $d$ and $\sigma$ . A candidate solution would consist of a single set of class-based feature weights, in addition to values for the three parameters. To evaluate the quality of a constructed solution, three lazy classifiers ( $k$ -NN, $d$ -NN, and GKE) would be produced. Each instance would be classified by each of the three classifiers, and the predicted class of the instance would be decided by majority vote.

Footnotes

Acknowledgments

The partial support of a grant from the Brandon University Research Council (BURC) is gratefully acknowledged.

References

Abdelbar

A.M.

El-Nabarawy

Wunch

D.C.

and Salama

K.M.

, Ant colony optimization applied to the training of a high order neural network with adaptable exponential weights, in: Applied Artificial Higher Order Neural Networks for Control and Recognition Zhang

, ed., IGI Global Press, Hershey, PA, USA, 2016.

Abdelbar

A.M.

and Salama

K.M.

, A gradient-guided ACO algorithm for neural network learning, in: Proceedings IEEE Swarm Intelligence Symposium (SIS-2015), 2015, pp. 1133–1140.

Aha

, Lazy Learning, Kluwer, Norwell, MA, USA, 1997.

Aha

Kibler

and Albert

, Instance-based learning algorithms, Machine Learning 6(1) (1991), 37–66.

Ahn

and Kim

K.-J.

, Bankruptcy prediction modeling with hybrid case-based reasoning and genetic algorithms approach, Applied Soft Computing 9(2) (2009), 599–607.

Alpaydyn

, Introduction to Machine Learning, MIT Press, Cambridge, MA, USA, 2010.

Anwar

I.M.

Salama

K.M.

and Abdelbar

A.M.

, ADR-Miner: An ant-based data reduction algorithm for classification, in: Proceedings IEEE Congress of Evolutionary Computation (CEC-2015), 2015, pp. 515–521.

Anwar

I.M.

Salama

K.M.

and Abdelbar

A.M.

, Instance selection with ant colony optimization, in: Proceedings INNS Conference on Big Data, 2015, pp. 248–256.

Asuncion

and Newman

, UCI machine learning repository http://www.ics.uci.edu/mlearn/MLRepository.html, 2007.

10.

Bishop

C.M.

, Pattern Recognition and Machine Learning, Springer, Berlin, Heidelberg, 2007.

11.

Blackwell

and Bentley

, Don’t push me! Collision-avoiding swarms, in: Proceedings IEEE Congress on Evolutionary Computation (CEC-2002), Vol. 2, 2002, pp. 1691–1696.

12.

Blum

and Merkle

, Swarm Intelligence: Introduction and Applications, Springer, New York, NY, USA, 2008.

13.

Boryczka

and Kozak

, An adaptive discretization in the ACDT algorithm for continuous attributes, in: Proceedings International Conference on Computational Collective Intelligence (ICCI-2011), 2011, pp. 475–484.

14.

Box

and Muller

, A note on the generation of random normal deviates, The Annals of Mathematical Statistics 29(2) (1958), 610–611.

15.

Brazdil

Giraud-Carrier

Soares

and Vilalta

, Metalearning: Applications to Data Mining, Springer, Berlin, Heidelberg, 2009.

16.

Cano

Herrera

and Lozano

, Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study, IEEE Transactions on Evolutionary Computation 7(6) (2003), 561–575.

17.

Cervantes

Galvan

and Isasi

, AMPSO: A new particle swarm method for nearest neighbor classification, IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics 39(5) (2009), 1082–1091.

18.

Chomboon

Chujai

Teerarassamee

and Kerdprasop

, An empirical study of distance metrics for k-nearest neighbor algorithm, in: Proceedings International Conference on Industrial Application Engineering, 2015, pp. 280–285.

19.

Derrac

García

and Herrera

, IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule, Pattern Recognition 43(6) (2010), 2082–2105.

20.

Derrac

Triguero

García

and Herrera

, A co-evolutionary framework for nearest neighbor enhancement: Combining instance and feature weighting with instance selection, in: Hybrid Artificial Intelligent Systems, Lecture Notes in Computer Science Volume 7209, Springer, Berlin, Heidelberg, 2012, pp. 176–187.

21.

Derrac

Triguero

García

and Herrera

, Integrating instance selection, instance weighting, and feature weighting for nearest neighbor classifiers by coevolutionary algorithms, IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics 42(5) (2012), 1383–1397.

22.

Dorigo

Caro

G.M.

and Gambardella

L.M.

, Ant algorithms for discrete optimization, Artificial Life 5(2) (1999), 137–172.

23.

Dorigo

Maniezzo

and Colorni

, Ant system: Optimization by a colony of cooperating agents, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 26 (1996), 1–13.

24.

Dorigo

and Stützle

, Ant Colony Optimization, MIT Press, Cambridge, MA, USA, 2004.

25.

Dorigo

and Stützle

, Ant colony optimization: Overview and recent advances, in: Handbook of Metaheuristics Gendreau

and Potvin

, eds, Springer, New York, NY, USA, 2010, pp. 227–263.

26.

Eshelman

L.J.

, The CHC adaptive search algorithm: How to have safe search when engaging in nontraditional genetic recombination, in: Foundations of Genetic Algorithm Rawlings

, ed., San Francisco, CA, USA, Morgan Kauffman, 1991, pp. 265–283.

27.

Fernandez

and Isasi

, Local feature weighting in nearest prototype classification, IEEE Transactions on Neural Networks 19(1) (2008), 40–53.

28.

Freitas

A.A.

, Comprehensible classification models: A position paper, ACM SIGKDD Explorations 15(1) (2013), 1–10.

29.

García

Cano

and Herrera

, A memetic algorithm for evolutionary prototype selection: A scaling up approach, Pattern Recognition 41(8) (2008), 2693–2709.

30.

García

and Herrera

, Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy, Evolutionary Computation 17(3) (2009), 275–306.

31.

Garcia-Pedrajas

del Castillo

and Ortiz-Boyer

, A cooperative coevolutionary algorithm for instance selection for instance-based learning, Machine Learning 78(3) (2010), 381–420.

32.

Guyon

and Elisseeff

, An introduction to variable and feature selection, Journal of Machine Learning Research 3 (2003), 1157–1182.

33.

Han

and Kamber

, Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA, USA, 2000.

34.

S.-Y.

Liu

C.-C.

Liu

and Jou

J.-W.

, Design of an optimal nearest neighbor classifier using an intelligent genetic algorithm, in: Proceedings IEEE Congress on Evolutionary Computation (CEC-2002), Vol. 1, 2002, pp. 594–599.

35.

Jahromi

Parvinnia

and John

, A method of learning weighted similarity function to improve the performance of nearest neighbor, Information Sciences 179(17) (2009), 2964–2973.

36.

Kar

Chakraborti

and Ravindran

, Feature weighting and confidence based prediction for case based reasoning systems, in: Case-Based Reasoning Research and Development, Lecture Notes in Computer Science Volume 7466, 2012, pp. 211–215.

37.

Kardan

Kavian

and Esmaeili

, Simultaneous feature selection and feature weighting with K selection for KNN classification using BBO algorithm, in: Proceedings Conference on Information and Knowledge Technology, 2013, pp. 349–354.

38.

Kelly

J.D.

and Davis

, A hybrid genetic algorithm for classification, in: Proceedings International Joint Conference on Artificial Intelligence (IJCAI-1991), Vol. 2, 1991, pp. 645–650.

39.

Kononenko

, Estimating attributes: Analysis and extensions of relief, in: Proceedings European Conference on Machine Learning (ECML-1994), Lecture Notes in Computer Science Volume 7209 Bergadano

and Raedt

, eds, Springer, Berlin, Heidelberg, 1994, pp. 171–182.

40.

Kuncheva

L.I.

, Editing for the k-nearest neighbors rule by a genetic algorithm, Pattern Recognition Letters 16 (1995), 809–814.

41.

Liao

Socha

Montes de Oca

Stützle

and Dorigo

, Ant colony optimization for mixed-variable optimization problems, IEEE Transactions on Evolutionary Computation 18(4) (2014), 503–518.

42.

Liu

and Motoda

, Feature Extraction, Construction and Selection: A Data Mining Perspective, Springer, Berlin, Heidelberg, 1998.

43.

Liu

and Motoda

, Instance Selection and Construction for Data Mining, Springer-Verlag, New York, 2001.

44.

Liu

and Setiono

, A probabilistic approach to feature selection: A filter solution, in: Proceedings of the 13th International Conference on Machine Learning, 1996, pp. 319–327.

45.

Martens

Baesens

and Fawcett

, Editorial survey: Swarm intelligence for data mining, Machine Learning 82(1) (2011), 1–42.

46.

Martens

De Backer

Haesen

Vanthienen

Snoeck

and Baesens

, Classification with ant colony optimization, IEEE Transactions on Evolutionary Computation 11 (2007), 651–665.

47.

Mateos-Garcia

Garcia-Gutierrez

and Riquelme-Santos

J.C.

, On the evolutionary optimization of k-NN by label-dependent feature weighting, Pattern Recognition Letters 33 (2012), 2232–2238.

48.

Mladenic

and Grobelnik

, Feature selection for unbalanced class distribution and naive Bayes, in: Proceedings International Conference on Machine Learning (ICML-1999), 1999, pp. 258–267.

49.

Otero

and Freitas

, Improving the interpretability of classification rules discovered by an ant colony algorithm, in: Proceedings Genetic and Evolutionary Computation Conference (GECCO-2013), 2013, pp. 73–80.

50.

Otero

Freitas

and Johnson

, A new sequential covering strategy for inducing classification rules with ant colony algorithms, IEEE Transactions on Evolutionary Computation 17 (2013), 64–74.

51.

Otero

F.E.B.

Freitas

A.A.

and Johnson

C.G.

, Inducing decision trees with an ant colony optimization algorithm, Applied Soft Computing 12(11) (2012), 3615–3626.

52.

Paredes

and Vidal

, Learning weighted metrics to minimize nearest-neighbor classification error, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (2006), 1100–1110.

53.

Parpinelli

R.S.

Lopes

H.S.

and Freitas

A.A.

, Data mining with an ant colony optimization algorithm, IEEE Transactions on Evolutionary Computation 6(4) (2002), 321–332.

54.

Parvinnia

Sabeti

Jahromi

M.Z.

and Boostani

, Classification of EEG signals using adaptive weighted distance nearest neighbor algorithm, Journal of King Saud University-Computer and Information Sciences 26 (2014), 1–6.

55.

Peng

Long

and Ding

, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on Pattern Analysis and Machine Intelligence 27(8) (2005), 1226–1238.

56.

Potter

M.A.

and De Jong

K.A.

, Cooperative coevolution: An architecture for evolving coadapted subcomponents, Evolutionary Computation 8(1) (2000), 1–29.

57.

Punch

Goodman

Pei

Chia-Shun

Hovland

and Enbody

, Further research on feature selection and classification using genetic algorithms, in: Proceedings International Conference on Genetic Algorithms, 1993, pp. 557–564.

58.

Quinlan

, Programs for Machine Learning, Morgan Kaufmann, San Francisco, CA, USA, 1993.

59.

Rokach

, Ensemble-based classifiers, Artificial Intelligence Review 33(1–2) (2010), 1–39.

60.

Salama

K.M.

and Abdelbar

A.M.

, Learning neural network structures with ant colony algorithms, Swarm Intelligence 9(4) (2015), 229–265.

61.

Salama

K.M.

Abdelbar

A.M.

and Anwar

, Data reduction for classification with ant colony algorithms, Intelligent Data Analysis 20(5) (2017), 1021–1059.

62.

Salama

K.M.

Abdelbar

A.M.

and Freitas

, Multiple pheromone types and other extensions to the Ant-Miner classification rule discovery algorithm, Swarm Intelligence 5(3–4) (2011), 149–182.

63.

Salama

K.M.

Abdelbar

A.M.

Otero

and Freitas

, Utilizing multiple pheromones in an ant-based algorithm for continuous-attribute classification rule discovery, Applied Soft Computing 13(1) (2013), 667–675.

64.

Salama

K.M.

and Freitas

, Clustering-based Bayesian multi-net classifier construction with ant colony optimization, in: Proceedings IEEE Congress on Evolutionary Computation (CEC-2013), 2013, pp. 3079–3086.

65.

Salama

K.M.

and Freitas

, Extending the ABC-Miner Bayesian classification algorithm, in: 6th International Workshop on Nature Inspired Cooperative Strategies for Optimization (NICSO-2013), volume 512 of Series on Computational Intelligence, Berlin, Springer, 2013, pp. 1–12.

66.

Salama

K.M.

and Freitas

, Learning Bayesian network classifiers using ant colony optimization, Swarm Intelligence 7(2) (2013), 229–254.

67.

Salama

K.M.

and Freitas

, Ant colony algorithms for constructing Bayesian multi-net classifiers, Intelligent Data Analysis 19(2) (2015), 233–257.

68.

Sanchez

A.M.

Lozano

Villar

and Herrera

, Hybrid crossover operators with multiple descendents for real-coded genetic algorithms: Combining neighborhood-based crossover operators, International Journal of Intelligent Systems 24 (2009), 540–567.

69.

Socha

and Blum

, Training feed-forward neural networks with ant colony optimization: An application to pattern classification, in: Proceedings International Conference on Hybrid Intelligent Systems (HIS-2005), 2005, pp. 233–238.

70.

Socha

and Blum

, An ant colony optimization algorithm for continuous optimization: Application to feed-forward neural network training, Neural Computing & Applications 16 (2007), 235–247.

71.

Socha

and Dorigo

, Ant colony optimization for continuous domains, European Journal of Operational Research 185 (2008), 1155–1173.

72.

Tahir

M.A.

Bouridane

and Kurugollu

, Simultaneous feature selection and feature weighting using hybrid tabu search/k-nearest neighbor classifier, Pattern Recognition Letters 28(4) (2007), 438–446.

73.

Tan

P.-N.

Steinbach

and Kumar

, Introduction to Data Mining, Addison Wesley, Boston, MA, USA, 2005.

74.

Tsutsui

, Ant colony optimisation for continuous domains with aggregation pheromones metaphor, in: Proceedings International Conference on Recent Advances in Soft Computing (RASC-2004), 2004, pp. 207–212.

75.

Walters-Williams

and Li

, Comparative study of distance functions for nearest neighbors, in: Advanced Techniques in Computing Sciences and Software Engineering Elleithy

, ed., Springer, Berlin, Heidelberg, 2010, pp. 79–84.

76.

Wettschereck

Aha

and Mohri

, A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms, Artificial Intelligence Review 11 (1997), 273–314.

77.

Witten

I.H.

and Frank

, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, San Francisco, CA, USA, 2010.

78.

Woniak

Graña

and Corchado

, A survey of multiple classifier systems as hybrid systems, Information Fusion 16 (2014), 3–17.

79.

Yang

and Honavar

, Feature subset selection using a genetic algorithm, IEEE Intelligent Systems and Their Applications 13(2) (1998), 44–49.

80.

Yang

and Pedersen

, A comparative study on feature selection in text categorization, in: Proceedings International Conference on Machine Learning (ICML-1997), 1997, pp. 412–420.

81.

Zheng

and Webb

G.I.

, Semi-naive Bayesian classification, Journal of Machine Learning Research 87(1) (2008), 93–125.

Dataset	Instances	Classes	Features
			Total	Numeric	Categorical
annealing	896	6	38	9	29
audiology	200	24	70	0	70
balance	625	3	4	0	4
breast-l	283	2	9	0	9
breast-p	198	2	32	32	0
breast-tissue	106	6	9	9	0
breast-w	569	2	30	30	0
car	1,728	4	6	0	6
chess	3,196	2	36	0	36
credit-a	690	2	14	6	8
credit-g	1,000	2	20	7	13
cylinder	540	2	35	19	16
dermatology	366	6	34	1	33
ecoli	336	8	7	7	0
hay	132	3	4	0	4
heart-c	303	5	13	7	6
heart-h	293	5	13	7	6
hepatitis	155	2	19	6	13
ionosphere	351	2	34	34	0
iris	150	3	4	4	0
liver-disorders	345	2	6	6	0
lymphography	148	4	18	3	15
monks	556	2	6	0	6
mushrooms	8,124	2	22	0	22
nursery	12,960	5	8	0	8
parkinsons	195	2	22	22	0
pima	768	2	8	8	0
pop	90	3	8	0	8
s-heart	270	2	13	6	7
segmentation	2,273	7	19	19	0
soybean	307	19	35	0	35
transfusion	722	2	4	4	0
ttt	958	2	9	0	9
voting	425	2	16	0	16
wine	178	3	13	13	0
zoo	101	7	16	0	16

Instance-based classification with Ant Colony Optimization

Abstract

Keywords

1. Introduction

6.2 Distance-based nearest neighbours

6.3 Gaussian kernel estimator

9.1 Predictive accuracy results

9.1.1 Experiment A

Table 3 Experiment B: Predictive accuracy (%) results for the two variations of ACO- k NN, and for classical k -NN with the optimized parameter setting identified in Experiment A

Table 9 Experiment C: Predictive accuracy (%) results for our proposed ACO-based algorithms and for the CIW-NN coevolutionary algorithm

Footnotes

Acknowledgments

References

Table 3
Experiment B: Predictive accuracy (%) results for the two variations of ACO- $k$ NN, and for classical $k$ -NN with the optimized parameter setting identified in Experiment A

Table 9
Experiment C: Predictive accuracy (%) results for our proposed ACO-based algorithms and for the CIW-NN coevolutionary algorithm