Instance-based learning (IBL) methods predict the class label of a new instance based directly on the distance between the new unlabeled instance and each labeled instance in the training set, without constructing a classification model in the training phase. In this paper, we introduce a novel class-based feature weighting technique, in the context of instance-based distance methods, using the Ant Colony Optimization meta-heuristic. We address three different approaches of instance-based classification: -Nearest Neighbours, distance-based Nearest Neighbours, and Gaussian Kernel Estimator. We present a multi-archive adaptation of the ACO algorithm and apply it to the optimization of the key parameter in each IBL algorithm and of the class-based feature weights. We also propose an ensemble of classifiers approach that makes use of the archived populations of the ACO algorithm. We empirically evaluate the performance of our proposed algorithms on 36 benchmark datasets, and compare them with conventional instance-based classification algorithms, using various parameter settings, as well as with a state-of-the-art coevolutionary algorithm for instance selection and feature weighting for Nearest Neighbours classifiers.
Instance-based learning (IBL) is a widely used classification approach where the class of an unlabeled instance is predicted directly based on the labels of training set instances, without constructing a model in a training phase [4, 3]. The -Nearest Neighbours (-NN) algorithm is perhaps the most well-known IBL algorithm [77], and one of the most relevant algorithms in data mining [33, 73]. In the -NN algorithm, the class of a new instance is determined based on the classes of the nearest instances in the training set to this new instance, where is a user-supplied parameter. However, -NN is just one member in the broad family of IBL algorithms. Distance-based Nearest Neighbours (-NN) and Gaussian Kernel Estimator (GKE) are other IBL classification algorithms that are widely-used in machine learning [10, 6]. In the -NN algorithm, only instances of the training set within a certain distance to the new instance are used to determine its class; while in the GKE algorithm, each instance in the training set contributes in deciding the class of a given new instance according to a Gaussian kernel function.
IBL classification is very sensitive to the quality of the underlying dataset used in the class prediction process. Hence, several data preparation tasks, such as feature selection, feature weighting, instance selection, and instance weighting are often used to improve the effectiveness of an IBL algorithm. This opens the door for various optimization techniques to be effectively applied to perform these tasks to improve IBL.
The Ant Colony Optimization (ACO) meta-heuristic, introduced by Dorigo et al. [23, 22], deals with artificial systems that are inspired by the foraging behaviour of real ants. ACO has been successfully applied to solve a broad range of combinatorial optimization problems [12, 24, 25], and has been applied to classification using a variety of different types of classification models, including classification rules [53, 46, 62, 63, 50, 49], decision trees [13, 51], and various types of Bayesian network classifiers [66, 67, 64, 65]. ACO has also recently been applied to instance and feature selection [7, 8, 61], but to the best of our knowledge has not been previously applied to feature weighting. In addition, ACO has recently been introduced as an Ant Colony-based algorithm for continuous optimization [70, 41], and has been applied to neural network training [69, 60, 2, 1]. However, to the best of our knowledge, the ACO algorithm has not been previously applied in the context of instance-based classification prior to this work. For a comprehensive review of ACO algorithms in data mining, the reader is directed to [45].
In this paper, we introduce the use of the ACO algorithm to perform feature weighting and parameter optimization for three different algorithms in the context of instance-based learning. More precisely, the contributions of this work can be defined as follows:
We propose a new class-based feature weighting scheme, in which several feature weight sets are optimized, one for each available class value. Such a weighting scheme would improve the effectiveness of an IBL classification algorithm in terms of predictive accuracy, and provide the user with more comprehensible knowledge regarding the relative importance of each feature to each class value.
We propose a multi-archive adaptation of ACO, and use it to optimize the class-based feature weights in the context of three different IBL classification algorithms, namely -nearest neighbours (-NN), distance-based nearest neighbours (-NN) and Gaussian kernel estimator (GKE). ACO is also used to optimize the key parameter for each algorithm – i.e., the number of nearest neighbours (), the maximum distance of the instance neighbourhood (), and the spread (smoothing) parameter in the GKE.
We propose creating an ensemble of classifiers using the population of the ant colony, rather than selecting the best created solution as the final classifier in order to improve the testing predictive accuracy and avoid classifier over-fitting.
We empirically evaluated the performance of our proposed ACO-based algorithms on 36 benchmark UCI classification datasets [9] and compared them to the conventional version of these IBL algorithms – using a variety of parameter settings – as our evaluation baseline. In addition, we compared our proposed algorithms with a state-of-the-art coevolutionary algorithm for instance selection and feature weighting in the nearest neighbours classifier, namely CIW-NN [21]. The non-parametric Wilcoxon signed-ranks test indicates that our ensemble-based ACO-NN algorithm is statistically significantly better than the state-of-the-art CIW-NN algorithm, when the two algorithms are given comparable CPU resources.
The rest of the paper is structured as follows. An overview of instance-based learning concepts, algorithms and common improvement techniques are given in the following section. We provide a background on the existing evolutionary algorithms in the literature used to improve IBL in Section 3. In Section 4, we review the ACO algorithm. The novel class-based feature weighting is introduced in Section 5. In Section 6, we present our proposed multi-archive adaptation of ACO and its use in optimizing the -NN, -NN, and GKE algorithms. The ensemble of classifiers, based on the ACO population, is discussed in Section 7. Our experimental methodology is described in Section 8, followed by the computational results and related discussions in Section 9. Finally, we conclude with general remarks and future research directions in Section 10.
Instance-based classification overview
Most classification techniques perform eager learning [77, 33]. That is, the learning algorithm discovers classification patterns from a training set, and represents them in a model constructed by the algorithm during a training phase. Later, this constructed model is used to predict the class of a new instance, without the need to go back to the original training data. However, in lazy learning, all the work is done when the time comes to classify a new instance, and model construction is not necessary. A lazy classifier predicts the class of a new instance directly based on the training set, with respect to the similarity (or distance) between the new instance and each instance in the database (training set) of the domain instances. Such a technique is also known as lazy classification, case-based learning, memory-based learning, and instance-based learning [4, 3].
Instance-based learning’s popularity in the field of data mining is due to two main advantages. First, IBL is a non-parametric method, that is, the only assumption that is made about the domain data is that: similar inputs (features) have similar outputs (classes). Second, IBL methods are incremental learning methods, that is, a new set of instances can be easily accommodated in the domain’s database to be used for further prediction tasks, without having to rebuild a new classification model, as in eager learning methods. Therefore, IBL algorithms, in general, are simple to apply and perform well in a wide range of application domains.
As a start, let us establish some notation that is going to be used throughout the paper. Let be the training set, which includes data instances with input features. is the database used by the lazy classifier to find the similar instances (neighbours) to any given new instance. is the -th labeled instance in , where is the value of the -th feature () in the -th instance , and is the class label of the instance . The class label , where is the set of the available values in the target class domain. A new (unlabeled) instance to be classified is denoted as .
Now, in order to implement a lazy classification algorithm, the following elements of the algorithm should be defined:
First, a proximity measure should be specified. For any two given instances, and , a proximity measure specifies how similar (or dissimilar) and are, with respect to their input feature values. The most well-known dissimilarity measure for IBL is Euclidean distance, which we generalize as follows to apply to both numeric and categorical features:
where
Of course, the smaller the distance, the more similar the instances and . The distance is 0 when all the (used) feature values of are the same as in .
Second, the effective instances for class prediction should be determined. The effective instances are the subset of with the most similar instances to the new instance , by which the class label of is predicted. The most popular approach is to select the most similar instances (neighbours) to the new instance to be classified, and use these instances for the class label prediction. This is the basic idea of the -NN classifier [3, 77, 33]. In this case, the parameter’s value should be specified by the user. Nonetheless, this is not the only approach to determine the effective instances.
Another approach is to use all the instances within the data partition that belongs to as the effective instances. One method is the histogram estimator, where the data space is partitioned into equal-sized intervals [6]. However, a better method is to consider all the instances within distance to the new instance to be classified as the set of effective instances, which is the idea of the -NN classifier [6, 10], which will be described in Section 6.2. In this case, will be the centre of the data partition, and the parameter’s value should be specified by the user. A third approach is to use all the instances in as effective instances, but with different contributing weights to the final class label prediction of the new instance. This is the idea of the Gaussian Kernel Estimator (GKE) [6, 10], which will be described in Section 6.3.
Third, how the selected effective instances are used to predict the class label of a new instance should be specified. A straightforward approach is to perform majority voting, where the class label that has the highest occurrence among them is assigned to . A more elaborate approach is to perform a weighted voting, where the effective instances with higher similarity (smaller distance) to the new class have higher weights in determining the class label of the new instance. Several approaches can be used to define an effective instance’s weight based on its distance to the instance to be classified, yet the Gaussian kernel is a popular function for this task [6], which will be discussed in the context of the GKE classifier (Section 6.3).
Another method is to construct a local classification model using the effective instances set to be used to predict the class label of a new instance [77]. However, since this method is computationally expensive (learning a different classification model for each new instance), simple classification algorithms are usually used, such as Naïve-Bayes [81].
IBL methods demand careful training data preparation as a pre-processing step before performing classification [77, 73]. IBL classification is very sensitive to the quality of the underlying training set used in the process, since irrelevant features, variable-scale feature values, redundant instances, and outliers affect the choice of the nearest neighbours, and consequently affect the class prediction. Therefore, one important data preparation step is to normalize the values in the domain of each continuous feature, so that each feature has values in a standard range, e.g. from 0 to 1.
Feature selection [42], which consists of selecting the most relevant feature subset to the class prediction, is a well-known useful process for classification algorithms in general. However, feature weighting [76] is more relevant in the context of instance-based classification, and has been a successful approach to improve lazy classifiers [27]. A drawback of Euclidean distance is that it makes the implicit assumption that all features are equally important. In practice, there can be redundancies and correlations among the features; for example, in a medical application: one feature can represent weight while another represents waist size. Furthermore, some features can be noisy or can even be irrelevant to the classification task. In addition, for nominal features, the distance between two equal feature values is 0, and the distance is 1 when the two values are different. This implicitly gives more emphasis to the nominal features over the continuous features in computing the Euclidean distance between two examples, since the difference between two values in the domain of a continuous feature rarely reaches the extreme difference of 0 or 1.
In feature weighting, the distance measure is extended to apply a different weight to each feature , which modifies the way in which the distance measure is computed between two instances and . In the case of Euclidean distance, feature weighting takes the form of:
where is defined in Eq. (2) and are real-valued feature weights, each giving a different level of emphasis to its associated feature, and is the weight of the -th feature. Note that feature weighting can be considered as a generalized version of feature selection, since the features that are given zero or very small ineffective weights are considered to be outside the subset of the selected features. The feature weights can be prescribed using expert knowledge of the problem domain; however, in practice this is rarely available.
Approaches to feature selection and feature weighting can be broadly classified into three primary categories: filter, wrapper, and embedded methods. Filter methods [44, 39] assign a heuristic score to each feature based on how well a given feature contributes to the discrimination between classes. Early scoring methods used statistical measures of correlation between features and class labels [80], and have since evolved into more sophisticated measures [48]. Filter methods typically operate on the dataset before it is presented to the learning algorithm, and have an advantage of being relatively light in terms of computational cost.
Wrapper methods [57, 79] use the learning algorithm as a black-box, and its predictive accuracy as a quality evaluation function. Evolutionary Computation (EC) methods such as CIW-NN, as well as ACO methods [7, 8, 61], fall within this category. Wrapper methods generally have good performance but are typically computationally expensive in the training phase compared to filter methods.
Embedded methods [32, 55] perform feature selection or weighting as part of the training phase of the machine learning algorithm. Examples include network pruning in neural network classifiers and limiting the depth of the decision tree in the C4.5 [58] algorithm. These methods are also usually less computationally demanding compared to wrapper methods.
Instance selection [43] is another data preparation task in the context of IBL, which consists of selecting a subset of the most appropriate instances in the training set in order to get rid of outliers, and irrelevant and noisy data instances. Its goal is to isolate the smallest set of instances which enables the classification algorithm to predict the class of a new instance with the same or better predictive accuracy than using the initial dataset. In addition, by minimizing the dataset size, the space complexity and computational costs of the IBL algorithms are reduced. Instance weighting is a generalization of instance selection, which focuses on modifying the way in which distances are measured with respect to the positions of the instances in the training set [52, 21]. One approach is for each instance to have an associated weight that depends on its class [21].
Evolutionary Algorithms for IBL
Evolutionary Algorithms (EA) have been investigated [16, 40, 34, 29, 30, 17, 20] for Instance Selection (IS), Instance Weighting (IW), and Feature Weighting (FW). However, thus far, the focus has been limited to the -NN classifier. Kuncheva [40], Cano et al. [16], and García et al. [29] have applied several flavors of EAs to IS. Kelly and Davis [38] have applied a GA to FW. Other adaptive approaches to FW have been considered by others [36, 72, 39, 76, 47, 37]. Jahromi et al. [35] have presented an adaptive IW approach called Weighted Distance Nearest Neighbor (WDNN) in which the weight for each instance is iteratively optimized by considering the leave-one-out error; the WDNN method has been applied by others [54] to EEG signal classification. Approaches that combine FW and IS have been explored in [34, 38]. Garcia-Pedrajas et al. [31] have applied coevolutionary approaches to IS, and Derrac et al. [19] have presented a coevolutionary method to construct an elaborate multi-classifier based on multiple 1-NN classifiers. Cervantes et al. [17] have presented a Particle Swarm Optimization approach to optimizing the prototypes for a nearest-prototype classifier (which differs from the nearest-neighbours classifier in that the prototypes do not need to be instances of the training set).
In [21], Derrac et al. proposed a coevolutionary approach, based on the 1-nearest neighbour classifier, called CIW-NN (Coevolution of Instance Selection and Weighting schemes for Nearest Neighbour classifier) which has separate coevolving populations for each of the following three tasks: Instance Selection (IS), Feature Weighting (FW), and Instance Weighting (IW).
Each chromosome in the IS population is a -dimensional bit-string, where is the size of the training set. In a given chromosome, if bit is equal to 1, this means that the -th instance of the training set is included in the reduced set. The IS population uses a variant of the CHC (Cross-generational elitist selection, Heterogeneous recombination, and Cataclysmic mutation) algorithm, a classical GA introduced by Eschelman [26] in 1991.
In the FW population, each chromosome is a -dimensional real-valued vector, where is the number of features. The -th entry in each chromosome represents the weight to be used in the weighted Euclidean distance proximity measure of Eq. (3). The FW population uses the classical SSGA (Steady State GA) algorithm with multiple descendants [68].
For IW, the CIW-NN algorithm uses an approach in which the weight of an instance depends on its class label, and all instances with the same class label have the same weight. The distance between a labeled training instance with class label and a test instance is computed as:
Each chromosome in the IW population therefore consists of a vector of real numbers, where the -th entry in a chromosome represents the weight of the class label in Eq. (4). Like the FW population, the IS population also uses the SSGA algorithm with multiple descendants.
Fitness evaluation of an individual in any of the three populations requires the selection of a collaborator from each of the other two populations. The approach followed in CIW-NN is that the most fit individual in any population is always selected as the collaborator for individuals from the other populations. Thus, the fitness function is applied using an individual from each of the three populations (the desired individual from one population, and the most-fit individuals from the other two populations).
Derrac et al. [21] compared CIW-NN to a large number of other approaches: the classical 1-NN classifier, the CHC algorithm applied to IS without coevolution, the SSGA algorithm applied to FW without coevolution, the SSGA algorithm applied to IW without coevolution, a Steady State Memetic Algorithm (SSMA) for IS [29], gradient descent based approaches [52], the WDNN algorithm [35] for IW, a Tabu-Search (TS) based method [72] for Feature Selection and Feature Weighting (FW), a Relief-based approach for FW [39], a FW method based on Mutual Information [76], and a method for simultaneous IS and FW called GOCBR (Global Optimization of feature weighting and instance selection using GA for Case-Based Reasoning) [5]. Compared to these numerous methods, CIW-NN was found [21] to have the best predictive accuracy.
Ant Colony Optimization
Ant Colony Optimization (ACO) [24] is a general-purpose, biologically-motivated, population-based optimization meta-heuristic that can be applied to a wide variety of domains [25]. ACO is based on a number of primitive processing elements, each operating in parallel with little centralized control. The processing elements in ACO are called ants, and the collection of processing elements are called a colony. In an ACO algorithm, there is usually a central data structure, analogous to pheromone information in biological ant systems, that represents the time-evolving collective knowledge of the group. In each iteration, each ant typically generates a candidate solution, making use of the central pheromone data structure in some way in its solution construction. After all ants have generated their solutions, a subset of those solutions is then used to update the central data structure in some way.
The majority of research on ACO has focused on discrete (combinatorial) optimization problems [25]. However, ACO methods for continuous problem domains have also been investigated [71, 70, 74]. In this paper, we focus on the ACO algorithm [70], which has been applied to a number of continuous optimization problems [71, 70, 41].
Suppose the ACO algorithm is to be applied to an optimization problem over real-valued variables . The central data structure, analogous to pheromone information in natural ants, that is maintained by ACO is an archive of previously-generated candidate solutions. Each element in the archive, for , is an -dimensional real-valued vector, . For example, refers to the value of the -th variable in the -th solution in the archive. The archive is sorted by solution quality, so that . Each solution in the archive has an associated weight that is related to , so that .
The ACO algorithm consists of repeated iterations until some termination criteria is reached (e.g. solution cost falls below some desired threshold, some maximum number of iterations is reached, etc.). In each iteration, there are two phases: solution construction and pheromone update. In the solution construction phase, each ant probabilistically constructs a solution based on the solution archive (representing pheromone information). The solution archive is initialized with randomly-generated solutions, where the size is a user-supplied parameter of the ACO algorithm. Then, in the pheromone update phase, the constructed solutions (where is the number of ants) are added to , resulting in the size of temporarily being . The archive is then sorted by solution quality, and the worst solutions are discarded, so that the size of returns to being . Figure 1 shows the conceptual flow of the ACO algorithm.
The work flow of the ACO algorithms: 1) Archive initialization. 2) Solution selection form the archive. 3) Probabilistic solution creation. 4) Quality evaluation. 5) Archive update. 6) Search parameters update.
The heart of the algorithm is the solution construction phase. In this phase, each ant generates a candidate solution , where is an -dimensional vector, and represents an assignment to the -th variable . In constructing its solution , ant is influenced by one of the solutions in the archive . The ant first probabilistically selects one of the solutions in the archive according to:
Thus, the probability of selecting the -th solution is proportional to its weight . Recall that the archive is sorted by quality, so that solution has rank , with the best solution having a rank of 1. The weights that are used in Eq. (5) are constructed in each iteration as:
where is the Gaussian function:
Thus, Eq. (6) assigns the weight to be the value of the Gaussian function with argument , mean 1.0, and standard deviation . The value of is a user-supplied parameter of the algorithm, where smaller values of cause the better ranked solutions to have higher weights (and thus make the algorithm more exploitative), while larger values of result in a more uniform distribution.
Let be the solution of that is selected by ant according to Eq. (5) in a given iteration. Ant then generates each solution element by sampling the Gaussian probability density function (PDF):
where represents the Gaussian PDF with mean and standard deviation . One way to sample the Gaussian PDF is through the Box-Muller transform [14], which is based on the following property: if and are two independent, uniformly distributed random numbers in the interval , then and will be two independent random numbers with a Gaussian distribution , where
In Eq. (8), represents the value that the solution assigns to variable , and the standard deviation is computed according to:
where is a user-supplied parameter of the algorithm. The effect of Eq. (10) is that the average distance from to other solutions in the archive, for the -th dimension, is computed, and is then multiplied by . The parameter plays a role in ACO similar to that of evaporation rate in other ACO algorithms. The higher the value of , the less the extent to which the search is biased towards the area of the search space around the solutions stored in the archive, and the slower the algorithm will converge. Once each ant constructs its solution, the archive is updated as described above. The process repeats until the desired termination criteria are met.
In all, the algorithm has four user-supplied parameters , , , and , in addition to any parameters related to the termination criteria. The parameter determines the number of ants; the parameter determines the number of solutions stored in the archive ; the parameter controls the extent to which the top solutions in the archive will dominate solution construction (Eq. (6)); and the parameter influences the degree of diversity in solution construction (Eq. (10)).
The solution archive can be considered an indirect representation of a probability distribution. When the number of ants is large, e.g. 10, then the probability distribution represented is sampled 10 times, before the generated solutions are used to update the distribution. When the number of ants is small, e.g. 1, then the probability distribution is updated immediately after each candidate solution is generated. This latter approach () is employed in our experimental results.
Recently, an extension of ACO for problems with a mixture of real-valued, ordinal and categorical variables, was introduced, called ACO [41]. In ACO, real-valued variables are handled as they are in ACO (as described above), ordinal variables are handled by ACO-o, and categorical variables are handled by ACO-c.
Suppose an ordinal variable has values in its domain, ordered from lowest to highest as . ACO-o treats as a real-valued variable whose value can range from to , and operates the same as ACO, with a single exception. When a value of the real-valued variable is to be evaluated, it is first rounded to the nearest integer. If it is rounded to , then this corresponds to a value of for the original ordinal variable , and the quality evaluation proceeds with the value .
In handling a categorical variable with category labels , the operation of ACO-c is quite elaborate and takes into consideration whether each category label is represented in the archive as well as the quality of the highest-ranked archived solution that includes each represented category label. The detailed mechanisms of ACO-c are outside our scope, however, since we do not work with categorical variables in this paper.
Class-based feature weighting for IBL
In many real-world problem domains, some input features are more important to the prediction of a specific class label, while others are less important to the prediction of the same class. This feature-class relevance can be different from one class label to another. For example, consider a medical dataset of patients, where the input features consist of several symptoms and observations of the patient condition, and the target class is one of several diseases. In this situation, some symptoms and observations may be more relevant to a specific disease, while the same symptoms and observations may be less relevant to another disease in the target class domain. An input feature that encodes blood glucose level would likely have a greater discriminatory relevance for a class label of diabetes than it would for a class label of brain stroke. Hence, having one weight value for each symptom feature, with respect to all the diseases in the target class domain, will not reflect the relative symptom-disease (feature-class label) importance.
Therefore, we propose a new class-based feature weighting scheme, in which each feature has a different weight value for each target class label, to give a different level of emphasis to each feature with respect to each class label. More precisely, we propose to optimize , which consists of sets of feature weights , where is the number of class labels in the domain of the class , and is the set of feature weights with respect to class label . The value is the weight of the -th feature with respect to the class label ; there are real-valued weights to be specified in total. We propose that the class-based feature weights would be used as follows. When the lazy classification algorithm calculates the distance between (the new unlabeled instance to be classified) and (the -th labeled instance in the training set ), the set of feature weights, , which corresponds to the class label of the current training instance is used, as shown in the following equation:
Note that, not only would the class-based weighting scheme improve the effectiveness of an IBL classification algorithm in terms of predictive accuracy – by providing fine-grained feature weights with respect to each class – such a class-based weighting scheme would provide the user with more comprehensible knowledge [28] regarding the relative importance of each feature to each class label, which improves the interpretability of the instance-based classifier. For instance, if the user is trying to compare and analyze two different class labels, and , in the application domain, the input feature that has a large difference between the value of and can be more interesting for analysis than another feature that may have similar weight values for both of the class labels. The way this class-based feature weighting is performed using ACO is described in the following section.
Adapting ACO to optimize the parameters of various IBL algorithms
We adapt ACO to optimize class-based feature weights (discussed in Section 5) in the context of three different instance-based classification algorithms (each will be illustrated later in the current section), as well as one key parameter for each algorithm. Our adaptation, called ACO-IBL, contains several archives of solution components, each archive holding elements. Suppose the classification problem at hand has class labels and features. Then, the algorithm would utilize archives. In the first archive, , each entry is a scalar that represents a potential value of the lazy classifier’s key parameter. Each of the remaining archives corresponds to one of the class labels . An entry in the archive is a -dimensional real-valued vector that represents candidate feature weights for class label , where is the number of the input features of the current dataset. Hence, a complete candidate solution of the ACO-IBL algorithm is composed of an entry from each of the archives, and can be represented as a pair , where is the value of the algorithm’s key parameter, and is the set of feature weight sets, one for each class label , where . Algorithm 1 shows the overall procedure of ACO-IBL.
Algorithm 1 Pseudo-code of ACO-IBL.
1:
Begin
2:
3:
4:
InitializeArchive // key-parameter archive
5:
for
6:
InitializeArchive // feature weights archive
7:
end for
8:
repeat
9:
SelectFromArchive // select a parameter value
10:
GenerateNewParameterValue
11:
12:
fordo
13:
SelectFromArchive // select a vector of feature weights
14:
GenerateNewFWVector
15:
16:
end for
17:
SetupLazyClassifier
18:
Classify // leave-one-out procedure
19:
EvaluateQuality
20:
UpdateArchive
21:
fordo
22:
EvaluateQuality
23:
UpdateArchive
24:
end for
25:
UpdateSearchParameters ();
26:
27:
untilor Convergence ;
28:
return BestSolution ();
29:
End
Essentially, the ACO-IBL algorithm executes as follows. At the beginning, the archives are initialized with randomly generated solutions that are evaluated and sorted according to their quality (lines 4 to 7). Then, in each iteration , a new candidate solution is created as follows. The algorithm probabilistically selects a parameter value from the parameter value archive (line 9), using fitness-proportionate selection based on the solution ranks (Eq. (5)). Then, a new parameter value is generated from the selected value (line 10), by sampling the Gaussian Probability Density Function (PDF) , where is computed as:
Next, a set of feature weights are generated for each class label (lines 12 to 16). For each archive , the algorithm probabilistically selects a vector, , which represents a feature weight vector for class label . Then, a new weight vector is probabilistically generated from . Let denote the -th element of , i.e. the weight of the -th feature with respect to class label . For each , is generated by sampling the Gaussian PDF , where is the value of the -th element of the selected archived solution , and is computed as:
Note that Eqs (12) and (13) are adaptations of Eq. (10).
At this point, we have a complete candidate solution – consisting of an element from each of the archives – to run the lazy classifier. The lazy classifier is setup (line 17) using the generated parameters and the training set . The way each of the three lazy classifiers works to classify an instance is discussed later in this section. A leave-one-out validation procedure is performed (line 18) – i.e., for each instance , becomes the validation instance to be classified, while becomes the set of input instances, from which the effective instances are selected and used to determine the class of the validation instance . Note that the employed distance measure utilizes the class-based feature weights, as in Eq. (11). This leave-one-out quality evaluation approach was also used in the fitness function of CIW-NN [21].
The output of the leave-one-out procedure is the set of the correctly classified instances, . We evaluate the quality of the key parameter value using predictive accuracy:
where is the size of the training set . This quality evaluation is used to update the parameter value archive (lines 19 and 20).
In contrast, the quality evaluation of the created class-based feature weights is performed as follows. Each set of feature weights , associated with class label , is evaluated independently, to update its related archive , independently of the other archives (lines 22 and 23). More precisely, the quality value , which is associated with the feature weight set, is calculated as follows:
where are the instances that are correctly classified to class during the leave-one-out procedure, and is the number of instances in the training set labeled by . In other words, to compute for the feature weight set of class label , we only compute the accuracy in the context of class label , as shown in Eq. (15). After that, each archive is updated with according to its quality . For example, if , may be the third in the ranking for archive , may be the tenth in the ranking for archive , and may not be inserted at all into archive .
The idea here is to avoid rewarding a bad feature weight set of a specific class because the feature weight sets of the other classes are good, and similarly to avoid penalizing good feature weights of a specific class because the feature weight sets of the other classes are bad. This can be even more useful in the case of class imbalance problems, where some class labels have relatively low frequencies with respect to other class labels. Using the general accuracy measure (Eq. (14)) would make the accuracy of predicting the high-frequency class labels dominate the overall quality of the classifier. Therefore, the proposed approach should provide a more fine-grained class-based quality evaluation.
After the archives are updated in iteration , the ACO-IBL algorithm updates the search parameters using Eq. (6) and moves to iteration (lines 25 and 26). The repeat-until loop (lines 8 to 27) is repeated until iterations have elapsed, or until the algorithm converges. The algorithm is considered to have converged when there is no improvement in the quality of the best solution for consecutive iterations. The best solution, which consists of the parameter value at the top of the parameter archive , as well as the feature weight vector at the top of each class-based feature weight archive , is returned at the end of the algorithm (line 28). Note that, in Algorithm 1, the number of ants is always 1. This means that only one solution is created in each iteration ; that constructed solution will be discarded unless its quality is better than the worst solution in the archive (as described in Section 4).
K-nearest neighbours
The -nearest neighbours algorithm is the most common lazy classifier. Its key parameter is , which represents the number of the nearest (effective) instances to be used to classify the validation instance . Hence, our ACO-IBL algorithm is responsible for optimizing the value of the integer parameter – with respect to the current dataset – along with the real-valued class-based feature weights W used in calculating the distance between the validation instance and a training instance . The integer parameter is handled in the same way that ordinal variables are handled in the ACO [41] algorithm (described in Section 4). Specifically, the value of is treated as a real-valued scalar by the ACO algorithm, but is rounded to the nearest integer before being used as a cutoff for the nearest neighbours.
Distance-based nearest neighbours
A more robust alternative to -NN is the distance-based nearest neighbours (-NN) algorithm. The key parameter for the -NN classification algorithm is , which specifies the maximum distance between a training instance and the validation instance in order for to be considered as one of the effective instances that are used to determine the class label of . In the context of -NN, the ACO-IBL algorithm is responsible for optimizing the value of the parameter – with respect to the current dataset – along with the class-based feature weights W for the distance measure.
In essence, in each iteration of the leave-one-out procedure, the distance is calculated between the validation instance and each training instance . However, the training instance is added to the effective instances only if , where is the current value of the maximum allowed distance. Hence, the effective instances in the -NN algorithm are all the instances in a sphere of radius in the data space, where (the instance to be classified) is in the centre of the sphere. Note that the number of the effective instances is fixed to in -NN, even if some of these instances are far from the validation instance in the data space. On the other hand, in -NN, the number of effective instances varies from one validation instance to another, depending on where the validation instance is in the data space and how many training instances are within distance to it.
We employ weighted voting in order to predict the class label of , where the weight of instance . More precisely, for each class label , we sum the weights of all the instances in the effective instances set that belong to . Then, we assign the validation instance to the class label with the highest sum of weight values.
Gaussian kernel estimator
Unlike the two previous lazy classifiers, which use a selected subset of the training instances as the effective instances, GKE uses all the instances in as the effective instances. However, each instance will have a different weight in the weighted voting performed by GKE. Each weight is calculated as a function of the distance between the training (effective) instance and the current validation instance .
Recall that in the context of -NN, the weight function . That is, the weight value is inversely proportional to the distance value . However, in the case of the GKE lazy classifier, , where is the Gaussian function defined in Eq. (7).
The key parameter in the GKE is , which is the spread (or smoothing) parameter which defines the shape of the influences of the training instances in determining the class label of the validation instance. When the value of is small, the training instances near to have a large weight, and this weight decreases rapidly as the distance increases. On the other hand, when the value of is larger, we get a smoother effect, where the difference between the weights of the near and the far instances is not as large. This is illustrated in Fig. 2, in which the -axis represents the distance and the -axis represents the weight . The graph shows plots for GKE for several different values of , as well as for -NN.
Plot of weight (-axis) versus distance (-axis) for GKE, for several values of , as well as for -NN.
Ensemble of classifiers using ACO population
An ensemble of classifiers is often used to combine the predictions from separate classifiers in order to increase predictive accuracy [59, 78]. The idea is to construct an ensemble of classifiers having different inductive biases, so that they will make different errors. Hence, combining their classification outputs will make the overall prediction of the ensemble more accurate [33].
With the use of an ACO-based algorithm to learn classifiers, or to optimize classifier parameters as in the current work, there is a potential to apply the idea of a classifier ensemble for free, since many solutions (i.e., classifiers) are created throughout the algorithm’s execution. Thus, instead of using the best constructed solution as the final output classifier of the ACO algorithm, one can select a set of top solutions in the colony to compose an ensemble of classifiers. Nonetheless, the crucial aspect of this approach is to make sure that, at the end of the ACO algorithm’s execution, the colony has a set of solutions with high diversity among them, instead of having very similar solutions that are (nearly) converged to the best solution. In order to promote this, the exploration behaviour should be enforced from the beginning of the algorithm’s execution to increase diversity among the constructed solutions, and this diversity should be maintained until the end of the algorithm.
At the end of execution of ACO-IBL, there are several ways that the archives can be used to construct an ensemble of classifiers. The approach we employ in the present work is the simplest approach, and consists of constructing an ensemble of classifiers, where the -th classifier in the ensemble, for , is composed of the -th ranked element of each of the archives. Each test instance is presented to each of the classifiers, the output of each classifier is recorded, and a majority vote is performed to decide the final predicted class label of the instance.
Recall that the solutions in the archives are initialized randomly at the beginning of the algorithm, which means that the algorithm begins with (presumably) a diversity of initial (probably poor) solutions in the search space. In order to keep the diversity of the solutions in the archives while evolving better solutions, search diversity should be promoted in the algorithm, which is controlled by parameter , as discussed in Section 4.
More precisely, when the value of is small, the probability of selecting the best solution in the archive is very high, and the probability of selecting any other solution in the archive is almost zero (according to Eqs (5) and (6)), which makes the algorithm very exploitive. This is illustrated in Fig. 3. For an archive of size , let denote the ratio of the probability of selection of the highest-quality solution in the archive (with rank 1) to the probability of selection of the solution in the archive of rank . Figure 3 shows a plot in which the -axis represents the parameter and the -axis represents the corresponding ratio . We can see that the relative likelihood of selecting the worst solution in the archive depends dramatically on . This is further illustrated in Fig. 4. In this graph, the -axis is the rank of a solution in the archive, and the -axis is the ratio . The graph shows plots for several values of .
Plot of the ratio (-axis) versus value of the parameter (-axis), using a logarithmic scale for the -axis.
Plot of solution rank (-axis) versus the ratio (-axis), for several values of the parameter , using a logarithmic scale for the -axis.
The effect of selecting only the best solution is that all the new solutions are generated around the best solution in the search space, and with time, they occupy more and more of the archive. In this case, by the end of the algorithm’s execution, the archive will be full of the same (or very similar) solution, which will be useless for a classifier ensemble. On the other hand, if the value of is relatively large (as is the case in our computational results), all solutions in the archives will have a non-negligible probability of being selected, thus increasing the diversity of the population.
Experimental methodology
We evaluate the performance of three versions of our proposed ACO-IBL algorithm:
ACO-NN: The ACO-IBL version with the nearest neighbours lazy classifier. The key parameter, , is the number of the nearest neighbours to be used as the effective instances.
ACO-NN: The ACO-IBL version with the distance-based nearest neighbours lazy classifier. The key parameter, , is the maximum distance of the neighbours to be used as the effective instances.
ACO-GKE: The ACO-IBL version with the Gaussian kernel estimator lazy classifier. The key parameter, , is the smoothing (spread) parameter in the Gaussian kernel function.
Parameter settings of IBL algorithms used in the experiments
Algorithm
Parameter
Description
Value
ACO-IBL
Maximum number of iterations
10,000
Number of ants per iteration
1
Archive size
25
Maximum number of non-improving iterations
50
Controls speed of convergence
0.85
Controls exploration/exploitation
0.25
-NN
Number of the nearest neighbours
-NN
Maximum distance of a neighbour
GKE
Smoothing parameter in the kernel function
ACO-NN
Minimum permitted value of (# of neighbours)
1
Maximum permitted value of (# of neighbours)
21
ACO-NN
Minimum permitted value of (maximum distance)
0.01
Maximum permitted value of (maximum distance)
1
ACO-GKE
Minimum permitted value of (smoothing parameter)
0.01
Maximum permitted value of (smoothing parameter)
0.5
In each of the previous algorithms, the ant colony meta-heuristic optimized the key parameter of the algorithm, along with the class-based feature-weights. We compare the predictive performance of our ACO versions of the lazy classification algorithms with the conventional version of these algorithms, where there are no feature weights, and the parameter value of each algorithm is user-supplied. For each of the conventional algorithms, we used around twenty different values for the key parameter associated with the algorithm. The parameter configurations used in the experiments for the various algorithms are shown in Table 1.
In addition, we compared our proposed ant-based algorithms with CIW-NN (discussed in Section 3), which is a state-of-the-art coevolutionary genetic algorithm that integrates instance selection, instance weighting, and feature weighting for nearest neighbours classifiers [21].
For the CIW-NN algorithm, we used the default parameters in [21]. However, for fairness of comparison, we assigned the CIW-NN algorithm and our ACO algorithms the same computational budget. In other words, we set the maximum number of evaluations (i.e., candidate solution creation and evaluation) in the CIW-NN algorithm’s entire execution to , which is the maximum number of iterations in ACO-IBL. Since only one solution () is created and evaluated per iteration in ACO-IBL, the overall maximum number of evaluations in both algorithms is the same. Note, however, that the maximum number might not be utilized completely; both ACO and evolutionary algorithms might use a smaller number of iterations if they converge earlier.
Characteristics of the datasets used in the experiments
Dataset
Instances
Classes
Features
Total
Numeric
Categorical
annealing
896
6
38
9
29
audiology
200
24
70
0
70
balance
625
3
4
0
4
breast-l
283
2
9
0
9
breast-p
198
2
32
32
0
breast-tissue
106
6
9
9
0
breast-w
569
2
30
30
0
car
1,728
4
6
0
6
chess
3,196
2
36
0
36
credit-a
690
2
14
6
8
credit-g
1,000
2
20
7
13
cylinder
540
2
35
19
16
dermatology
366
6
34
1
33
ecoli
336
8
7
7
0
hay
132
3
4
0
4
heart-c
303
5
13
7
6
heart-h
293
5
13
7
6
hepatitis
155
2
19
6
13
ionosphere
351
2
34
34
0
iris
150
3
4
4
0
liver-disorders
345
2
6
6
0
lymphography
148
4
18
3
15
monks
556
2
6
0
6
mushrooms
8,124
2
22
0
22
nursery
12,960
5
8
0
8
parkinsons
195
2
22
22
0
pima
768
2
8
8
0
pop
90
3
8
0
8
s-heart
270
2
13
6
7
segmentation
2,273
7
19
19
0
soybean
307
19
35
0
35
transfusion
722
2
4
4
0
ttt
958
2
9
0
9
voting
425
2
16
0
16
wine
178
3
13
13
0
zoo
101
7
16
0
16
The experiments were carried out using the stratified ten-fold cross validation procedure. In essence, a dataset is divided into ten mutually exclusive partitions (folds), with approximately the same number of instances in each partition. Then, each classification algorithm is run ten times, where each time a different partition is used as the test set and the other nine partitions are used as the training set. The results (predictive accuracy rate on the test set) are then averaged and reported as the accuracy rate of the classifier. Since ACO-IBL and CIW-NN are stochastic algorithms, we run each ten times – using a different random seed to initialize the search each time – for each of the ten iterations of the cross-validation procedure (i.e., 100 runs in total, for each dataset). In the case of the deterministic algorithms, each one is run just once for each iteration of the cross-validation procedure.
The performance of our ACO-IBL algorithms was evaluated using 36 public-domain datasets from the well-known University of California at Irvine (UCI) dataset repository [9]. The main characteristics of the datasets are shown in Table 2.
Computational results
Predictive accuracy results
Experiment A
Experiment A: Plot of average predictive accuracy rank (-axis) versus value of the -NN parameter (-axis).
Experiment A: Plot of average predictive accuracy rank (-axis) versus value of the -NN parameter (-axis).
Experiment A: Plot of average predictive accuracy rank (-axis) versus value of the GKE parameter (-axis).
In Experiment A, the classical -NN algorithm was run, on the 36 datasets listed in Table 2, with 21 different values of the key parameter , for
using 10-fold cross-validation as described in Section 8. The average rank of each setting of was obtained. The average rank for a given setting is obtained by first computing its rank on each dataset individually. In case two or more settings have the same accuracy on a given dataset, the tied settings are given the average of the ranks they span. The individual ranks are then averaged across all datasets to obtain the overall average rank for each setting. Note that the lower the value of the rank, the better. Figure 5 shows a plot of average rank versus value of the parameter , and indicates that the best average rank is obtained with .
The -NN algorithm was similarly run with 21 different settings of , with
while the GKE algorithm was run with 19 different settings of , with
The average rank of each setting of each algorithm’s key parameter was computed, and is shown in Fig. 6 for -NN and Fig. 7 for GKE. These figures indicate that the best accuracy is obtained with for -NN, and with for GKE.
The best-performing parameter settings identified in this experiment ( for -NN, for -NN, and for GKE) will be used in Experiment B.
Experiment B
Experiment B: Predictive accuracy (%) results for the two variations of ACO-NN, and for classical -NN with the optimized parameter setting identified in Experiment A
Dataset
ACO-NN
-NN
Basic
Ensemble
( 12)
annealing
94.20
94.20
72.58
audiology
79.17
79.17
78.33
balance
82.67
85.02
84.83
breast-l
73.78
74.15
73.70
breast-p
75.10
75.10
75.24
breast-tissue
62.91
64.83
69.00
breast-w
96.82
96.82
95.26
car
93.57
94.44
94.56
chess
96.60
96.73
95.88
credit-a
85.81
85.81
86.96
credit-g
73.40
74.07
73.90
cylinder
77.03
77.03
76.76
dermatology
95.62
96.99
97.28
ecoli
77.59
79.09
86.96
hay
67.97
67.97
59.90
heart-c
56.50
57.77
56.76
heart-h
67.03
67.03
65.34
hepatitis
83.25
83.25
82.63
ionosphere
94.26
94.26
86.23
iris
89.29
89.29
92.62
liver-disorders
55.60
63.93
60.56
lymphography
81.82
77.10
81.19
monks
63.25
63.25
52.64
mushrooms
99.99
100.00
99.99
nursery
98.73
98.81
97.37
parkinsons
82.18
82.18
87.68
pima
71.22
71.22
74.21
pop
71.96
71.96
69.46
s-heart
83.70
84.41
82.22
segmentation
91.95
93.87
93.29
soybean
84.48
84.48
82.76
transfusion
69.90
69.88
70.67
ttt
98.95
98.95
97.16
voting
93.40
93.74
89.80
wine
95.90
95.90
94.38
zoo
97.32
97.32
100.00
#wins
14
24
11
rank (avg)
2.14
1.65
2.21
Experiment B: Predictive accuracy (%) results for the two variations of ACO-NN, and for classical -NN with the optimized parameter setting identified in Experiment A
Dataset
ACO-NN
-NN
Basic
Ensemble
( 0.45)
annealing
86.08
90.68
61.60
audiology
72.50
75.83
76.67
balance
88.17
90.67
92.50
breast-l
72.84
72.84
71.05
breast-p
76.74
76.74
73.71
breast-tissue
62.91
62.91
52.91
breast-w
82.04
82.04
87.53
car
91.52
91.52
70.76
chess
94.60
97.55
92.04
credit-a
89.71
89.71
83.62
credit-g
73.90
74.90
72.80
cylinder
69.82
69.82
69.68
dermatology
94.24
94.24
65.56
ecoli
68.25
71.85
52.99
hay
74.74
79.77
75.80
heart-c
58.84
58.84
58.76
heart-h
69.57
69.57
66.38
hepatitis
80.00
80.00
71.04
ionosphere
93.39
93.39
93.97
iris
94.00
95.45
92.62
liver-disorders
57.60
58.60
58.32
lymphography
77.10
76.43
79.76
monks
50.81
45.89
56.48
mushrooms
100.00
98.02
100.00
nursery
90.84
90.76
91.44
parkinsons
82.58
82.58
77.87
pima
65.22
65.82
67.05
pop
73.39
74.64
70.89
s-heart
81.48
81.48
84.07
segmentation
86.70
88.70
86.83
soybean
83.79
83.79
76.55
transfusion
70.13
80.13
72.34
ttt
94.32
94.32
72.00
voting
92.94
91.85
88.66
wine
87.61
87.61
94.38
zoo
92.54
95.89
98.75
#wins
15
23
12
rank (avg)
2.06
1.71
2.24
Experiment B: Predictive accuracy (%) results for the two variations of ACO-GKE, and for classical GKE with the optimized parameter setting identified in Experiment A
Dataset
ACO-GKE
GKE
Basic
Ensemble
( 0.4)
annealing
83.76
83.76
65.18
audiology
85.00
85.00
70.93
balance
85.67
91.67
77.43
breast-l
72.84
72.84
66.30
breast-p
72.58
75.58
67.84
breast-tissue
53.36
56.36
61.60
breast-w
89.36
89.36
87.86
car
93.69
93.69
87.16
chess
96.98
97.10
88.48
credit-a
79.42
79.42
79.56
credit-g
74.40
74.40
66.50
cylinder
66.92
66.92
69.36
dermatology
93.69
93.77
89.88
ecoli
70.81
72.69
80.46
hay
75.47
75.47
52.50
heart-c
57.77
57.77
49.36
heart-h
67.03
67.03
57.94
hepatitis
81.25
81.25
75.23
ionosphere
81.92
81.92
78.83
iris
91.95
91.95
85.22
liver-disorders
63.93
63.93
53.16
lymphography
83.81
83.81
73.79
monks
63.43
63.43
45.24
mushrooms
100.00
100.00
92.59
nursery
97.94
98.88
89.97
parkinsons
83.53
83.53
80.28
pima
72.14
72.14
66.81
pop
73.39
73.39
62.06
s-heart
85.93
85.93
74.82
segmentation
68.05
68.05
85.89
soybean
87.59
87.59
75.36
transfusion
72.86
68.86
63.27
ttt
99.16
99.16
89.76
voting
91.74
92.74
82.40
wine
83.17
83.17
86.98
zoo
98.75
98.75
92.60
#wins
24
29
6
rank (avg)
1.76
1.57
2.67
Experiment B: Results of applying the Friedman test with the Holm post hoc test to the -NN results of Table 3. In each case, the difference is statistically significant, at the 0.05 threshold, if the reported value is less than or equal to the corresponding Holm threshold
Comparison
Holm
sig.?
Ens-ACO-NN vs. -NN (best )
0.0184
0.025
yes
Ens-ACO-NN vs. ACO-NN
0.0392
0.05
yes
Experiment B: Results of applying the Friedman test with the Holm post hoc test to the -NN results of Table 4. In each case, the difference is statistically significant, at the 0.10 threshold, if the reported value is less than or equal to the corresponding Holm threshold
Comparison
Holm
sig.?
Ens-ACO-NN vs. -NN (best )
0.0251
0.05
yes
Ens-ACO-NN vs. ACO-NN
0.1407
0.1
no
Experiment B: Results of applying the Friedman test with the Holm post hoc test to the GKE results of Table 5. In each case, the difference is statistically significant, at the 0.05 threshold, if the reported value is less than or equal to the corresponding Holm threshold
Comparison
Holm
sig.?
Ens-ACO-GKE vs. GKE (best )
3.2E-6
0.025
yes
Ens-ACO-GKE vs. ACO-GKE
0.4094
0.05
no
In Experiment B, our basic ACO-NN algorithm is compared to its ensemble-based version (which we abbreviate as Ens-ACO-NN) and to classical -NN using the optimized parameter setting identified in Experiment A. Table 3 reports the predictive accuracy for these three algorithms. For Tables 3–5, as well as Table 9, the highest accuracy for each dataset is shown in boldface, the last row in each table reports the average ranks of the algorithms being compared, and the penultimate row reports the number of datasets for which each algorithm had the highest accuracy. Tables 4–5 report the analogous results for -NN and GKE.
From Table 3, we note that the highest accuracy was obtained by one of the two versions of ACO-NN in 25 of the 36 datasets. Ens-ACO-NN performed better than ACO-NN in 15 datasets, and worse in only 2 datasets, with 19 ties. The best average rank was obtained by Ens-ACO-NN with an average rank of 1.65, followed by ACO-NN with an average rank of 2.14, followed by -NN with the optimized parameter setting identified in Experiment A with an average rank of 2.21. A Friedman test with the Holm post hoc test is applied to the predictive accuracy results reported in Table 3. The Friedman statistic is computed to be 6.60 with 2 degrees of freedom, corresponding to a value of 0.0369, indicating that the null hypothesis can be rejected at the 0.05 threshold. Table 6 reports the results of the Holm post-hoc tests (at the 0.05 threshold). For Tables 6–8, there is a statistically significant difference for each pair of algorithms being compared if the reported value of is less than or equal to the corresponding Holm threshold. Statistically significant values of are shown in boldface. Table 6 indicates that Ens-ACO-NN is significantly better than each of ACO-NN and classical -NN with the optimized parameter setting identified in Experiment A.
For -NN, we observe from Table 4 that the highest accuracy was obtained by one of the two versions of ACO-NN in 25 of the 36 datasets. Ens-ACO-NN performed better than ACO-NN in 14 datasets, and worse in 5 datasets, with 17 ties. The best average rank was obtained by Ens-ACO-NN with an average rank of 1.71, followed by ACO-NN with an average rank of 2.06, followed by -NN with the optimized parameter setting identified in Experiment A with an average rank of 2.24. A Friedman test is applied to the predictive accuracy results reported in Table 4. The Friedman statistic is computed to be 5.18 with 2 degrees of freedom, corresponding to a value of 0.0750, indicating that the null hypothesis cannot be rejected at the 0.05 threshold, but can be rejected at the 0.10 threshold. Table 7 reports the results of the Holm post-hoc tests at the 0.10 threshold, and indicates that Ens-ACO-NN is significantly better (at the 0.10 threshold) than classical -NN with the optimized parameter setting identified in Experiment A, but is not significantly better than ACO-NN.
For GKE, we observe from Table 5 that the highest accuracy was obtained by one of the two versions of ACO-GKE in 30 of the 36 datasets. Ens-ACO-GKE performed better than ACO-GKE for 8 datasets, and worse for only one dataset, with 27 ties. The best average rank was obtained by Ens-ACO-GKE with an average rank of 1.57, followed by ACO-GKE with an average rank of 1.76, followed by GKE with the optimized parameter setting identified in Experiment A with an average rank of 2.67. A Friedman test with the Holm post hoc test is applied to the predictive accuracy results reported in Table 5. The Friedman statistic is computed to be 24.68 with 2 degrees of freedom, corresponding to a value of 4.4E-6, indicating that the null hypothesis can be rejected. Table 8 reports the results of the Holm post-hoc tests at the 0.05 threshold, and indicates that Ens-ACO-GKE is significantly better than classical GKE with the optimized parameter setting identified in Experiment A, but is not significantly better than ACO-GKE.
Experiment C
Experiment C: Predictive accuracy (%) results for our proposed ACO-based algorithms and for the CIW-NN coevolutionary algorithm
Dataset
ACO-NN
ACO-NN
ACO-GKE
CIW-NN
Basic
Ensemble
Basic
Ensemble
Basic
Ensemble
annealing
94.20
94.20
86.08
90.68
83.76
83.76
97.84
audiology
79.17
79.17
72.50
75.83
85.00
85.00
81.67
balance
82.67
85.02
88.17
90.67
85.67
91.67
86.17
breast-l
73.78
74.15
72.84
72.84
72.84
72.84
69.03
breast-p
75.10
75.10
76.74
76.74
72.58
75.58
75.76
breast-tissue
62.91
64.83
62.91
62.91
53.36
56.36
68.82
breast-w
96.82
96.82
82.04
82.04
89.36
89.36
95.61
car
93.57
94.44
91.52
91.52
93.69
93.69
95.56
chess
96.60
96.73
94.60
97.55
96.98
97.10
97.74
credit-a
85.81
85.81
89.71
89.71
79.42
79.42
80.58
credit-g
73.40
74.07
73.90
74.90
74.40
74.40
70.20
cylinder
77.03
77.03
69.82
69.82
66.92
66.92
79.18
dermatology
95.62
96.99
94.24
94.24
93.69
93.77
95.62
ecoli
77.59
79.09
68.25
71.85
70.18
72.69
71.30
hay
67.97
67.97
74.74
79.77
75.47
75.47
73.08
heart-c
56.50
57.77
58.84
58.84
57.77
57.77
54.84
heart-h
67.03
67.03
69.57
69.57
67.03
67.03
62.03
hepatitis
83.25
83.25
80.00
80.00
81.25
81.25
76.79
ionosphere
94.26
94.26
93.39
93.39
81.92
81.92
64.48
iris
89.29
89.29
94.00
95.45
91.95
91.95
93.33
liver-disorders
55.60
63.93
57.60
58.60
63.93
63.93
63.19
lymphography
81.82
77.10
77.10
76.43
83.81
83.81
70.90
monks
63.25
63.25
50.81
45.89
63.43
63.43
59.27
mushrooms
99.99
100.00
100.00
98.02
100.00
100.00
100.00
nursery
98.73
98.81
90.84
90.76
97.94
98.88
96.50
parkinsons
82.18
82.18
82.58
82.58
83.53
83.53
96.42
pima
71.22
71.22
65.22
65.82
72.14
72.14
69.28
pop
71.96
71.96
73.39
74.64
73.39
73.39
70.00
s-heart
83.70
84.41
81.48
81.48
85.93
85.93
75.56
segmentation
91.95
93.87
86.70
88.70
68.05
68.05
84.52
soybean
84.48
84.48
83.79
83.79
87.59
87.59
82.07
transfusion
69.90
69.88
70.13
80.13
72.86
68.86
68.00
ttt
98.95
98.95
94.32
94.32
99.16
99.16
85.26
voting
93.40
93.74
92.94
91.85
91.74
92.74
93.99
wine
95.90
95.90
87.61
87.61
83.17
83.17
97.16
zoo
97.32
97.32
92.54
95.89
98.75
98.75
98.75
#wins
3
9
5
9
10
12
10
rank (avg)
4.06
3.39
4.57
4.08
4.00
3.61
4.29
Experiment C: Results of applying a (one-tailed) Wilcoxon signed-ranks test comparing Ens-ACO-NN (the highest-ranked of our proposed algorithms) to CIW-NN, a state-of-the-art coevolutionary algorithm
Comparison
sig.?
Ens-ACO-NN vs. CIW-NN
35
2.05
0.0202
yes
The main results of this paper are reported in Table 9, which compares our six ACO algorithms to each other and to CIW-NN, a recent state-of-the-art algorithm. The average rankings indicate that Ens-ACO-NN has the best predictive accuracy of the seven algorithms compared in this table. The second- and third-best accuracy ranks belong to Ens-ACO-GKE and ACO-GKE, respectively. Ens-ACO-NN and ACO-NN are in the fourth and fifth place, respectively. In the next to last place is CIW-NN, followed by ACO-NN in last place.
Focusing our comparison on Ens-ACO-NN (the highest-ranked of our proposed ACO algorithms) and CIW-NN (a state-of-the-art coevolutionary algorithm), we find that Ens-ACO-NN has better accuracy than CIW-NN in 21 of the 36 datasets, and worse in 14 of the 36 datasets, with a single tie. As reported in Table 10, a (one-tailed) Wilcoxon signed-ranks test was used to compare the best of our proposed algorithms, Ens-ACO-NN, to CIW-NN, a state-of-the-art coevolutionary technique. As the table indicates, the -value was determined to be 0.0202, indicating that the results for Ens-ACO-NN are significantly better than for CIW-NN.
Interpreting the class-based feature weights
In addition to potentially improved accuracy, another advantage of our use of class-based feature weights is increased interpretability. As discussed in [28], feature weighting in nearest neighbours classifiers can provide an important form of knowledge that is interpretable by many users. In many applications, users could need to see the nearest neighbours of the instance being classified in order to get some justification or explanation for the class being predicted for that instance. Such an explanation might be needed, e.g., for legal reasons. For instance, a customer whose credit card application is rejected by a bank based on the prediction of a nearest neighbours classifier might have the legal right of knowing why her/his application was rejected. As another example, in this case requiring interpretability because human lives are at stake, if a medical doctor’s diagnosis of some cancer is supported by a nearest neighbours classifier, the doctor should not blindly trust the prediction of the classifier; she/he should interpret the nearest neighbours to see if the prediction makes sense from a medical perspective. In many applications, however, it is not practical for the user to examine all features values in the nearest neighbours, since the number of features can be very large and many features have a small weight (relevance) for a given prediction. Hence, in practice, when a user asks to see the nearest neighbours that were used to classify the current test instance, providing the features in decreasing order of weight (relevance) can help the user to focus her/his attention on the most relevant features.
Our use of class-based feature weights takes this idea one step further by allowing for the representation of the fact that a feature may be more relevant to some classes than to others. By examining the class-based feature weights of the trained classifier, there is the potential to discover interesting patterns, such as features that have very different weight values for different classes, and features that have low weight values for all the classes (which would be good candidates for removal in a feature selection process). These types of observations can be useful to domain experts either for purposes of validating the classifier, or for gaining insight into the problem domain.
Illustration of class-based feature weights for the car dataset. We observe that the feature maint (price of maintenance) has limited significance for the classes unacceptable and good, but is significant for the classes acceptable and very-good. The feature buying (buying price) is less significant for the class very-good than for the other classes.
Illustration of class-based feature weights for the pop (Post-Operative Patient) dataset. We observe that the feature CORE-STBL (stability of patient core temperature) is not relevant to class home (patient sent home from hospital), but is relevant to the other two classes, and the feature SURF-STBL (stability of patient surface temperature) is least relevant to the class ICU (patient sent to intensive care) and most relevant to the class general (patient sent to general hospital floor).
Illustration of class-based feature weights for the iris dataset. We observe that sepal-width has less importance for the versicolor class than for the other two classes; furthermore, the sepal-length and petal-width features both have greater relevance to the versicolor class than for the other two classes.
To illustrate this, we show, in Figs 8–10, a graphical representation of the class-based feature weights for a sample run (randomly selected out of the 100 runs of the 10-times 10-fold cross-validation procedure) of the ACO-NN algorithm for each of several datasets. In each case, the weights are normalized so that the sum of the weights is 100%. Each graph is a stacked bar-graph in which the bars represent the class labels, and the components of each bar correspond to the feature weights. The names of the features are given in the legend for each figure.
Let us consider Fig. 8, illustrating the car dataset. This dataset, derived from a simple hierarchical decision model, has four class labels (very-good, good, acceptable, unacceptable) representing possible evaluations of a car, based on six categorical features: buying price (buying), cost of maintenance (maint), number of doors (doors), carrying capacity of car in terms of number of persons (persons), size of the luggage boot (lug-boot), and estimated safety of the car (safety). From Fig. 8, we can make a number of meaningful observations. For example, the feature maint (cost of maintenance) has limited significance for the classes unacceptable and good, but is significant for the classes acceptable and very-good. The feature buying (buying price) is less significant for the classes very-good than for the other classes.
Considering Fig. 9, illustrating the Post-Operative Patient (pop) dataset. Each instance in this dataset corresponds to a post-operative patient. The features represent readings related to the patient, and the class labels are: patient prepared to go home (home), patient sent to general hospital floor (general), and patient sent to intensive care unit (ICU). By a simple examination of the figure, we can make several observations: the learned classifier does not consider the feature CORE-STBL (stability of patient core temperature) to be relevant to the class home (patient sent home from hospital), but to be relevant to the other two classes; the classifier considers the feature SURF-STBL (stability of patient surface temperature) to be less relevant to the class ICU (patient sent to intensive care) and more relevant to the class general (patient sent to general hospital floor). These types of observations can be useful to domain experts either for purposes of validating the classifier, or for gaining insight into the problem domain.
As a final example, consider Fig. 10, which illustrates the iris dataset, one of the most well-known datasets in machine learning. In this dataset, each instance corresponds to an iris plant, the features represent physical measurements of various aspects of the plant, and the classes (versicolor, virginica, and setosa) represent three different types of plant. We can observe that sepal-width has less importance for the versicolor class than for the other two classes; the sepal-length and petal-width features both have greater relevance to the versicolor class than for the other two classes.
The reader should note that our objective, in this section, is not to actually extract domain knowledge from the discovered class-based feature weights illustrated in Figs 8–10, but rather to give an example of how these weights can be useful to a domain expert in validating a classifier, or even potentially in gaining insight into the problem domain.
Discussion & conclusions
In this paper, we have made several contributions, which can be summarized as follows:
Class-based feature weights (Section 5). We have proposed a new class-based feature-weighting scheme in which each feature has a different real-valued weight for each target class label. This allows each feature to have a different level of emphasis with respect to each class label. In addition to improving predictive accuracy, this scheme has the added advantage that it provides comprehensible knowledge regarding the relative importance of each feature to each class label, which improves the interpretability of the learned classification model.
A multi-archive adaptation of ACO (Section 6), which is used to optimize the class-based feature weights in the nearest neighbours classifier, the distance-based nearest-neighbours (-NN) classifier and the Gaussian kernel estimator (GKE) nearest neighbours classifier, as well as the key classifier parameter in each of these classifiers, i.e. in -NN, in -NN, and in GKE.
The use of an ensemble of classifiers based on the ACO population (Section 7). Rather than using the best constructed solution as the final output classifier of the ACO algorithm, we treat the entire final ACO population as an ensemble of classifiers, employing a majority vote to determine the final predicted class label.
An extensive experimental evaluation, using 36 benchmark datasets and a rigorous 10-times 10-fold cross-validation regimen, of our six proposed algorithms, in addition to the CIW-NN algorithm [21], a state-of-the-art coevolutionary algorithm. As mentioned in Section 3, Derrac et al. [21] have compared CIW-NN to a large number of competing approaches from the literature and found it to have superior predictive accuracy.
In general, from the results presented in Section 9, we observe that none of the seven algorithms compared in Table 9 performed better than the others on all datasets. For each algorithm, there was at least one dataset for which it had the highest predictive accuracy, and at least one dataset for which it had the lowest predictive accuracy.
It is interesting to relate the relative performance of the three versions (Ens-ACO-NN, Ens-ACO-NN, and Ens-ACO-GKE) of our ensemble-based ACO-IBL algorithm to some basic characteristics of the datasets. Considering all 36 datasets, Ens-ACO-NN performed best followed by Ens-ACO-GKE followed by Ens-ACO-NN. Restricting our attention to the 18 two-class datasets, the pattern is the same: Ens-ACO-NN is best followed by Ens-ACO-GKE followed by Ens-ACO-NN. The pattern is also the same for the 18 datasets with more than two classes: Ens-ACO-NN performs best followed by Ens-ACO-GKE followed by Ens-ACO-NN. For the 12 datasets with only numeric features: Ens-ACO-GKE is best followed by Ens-ACO-NN followed by Ens-ACO-NN. For the 14 datasets with only categorical features: Ens-ACO-NN is best followed by Ens-ACO-NN followed by Ens-ACO-GKE. For the 10 datasets with a mix of categorical and numeric features: Ens-ACO-NN is best followed by Ens-ACO-NN followed by Ens-ACO-GKE.
A related and worthwhile future research direction is to seek to predict the relative performance of different data mining algorithms based on dataset characteristics, which is an issue addressed in the area of meta-learning [15]. For example, one might find that Algorithm is better for datasets with a large number of class labels, that Algorithm is better for datasets where the ratio of categorical to continuous features is large (or small), or that Algorithm is better for datasets where the instance sparsity in the data space is low (or high). In the same vein, one might seek to automatically select the proximity measure to the dataset at hand. In the present work, we have used Euclidean distance as our proximity measure. However, numerous proximity measures have, of course, been studied in the literature [18, 75]. An interesting meta-learning task would be to predict the relative performance of each based on measurable characteristics of the problem domain or dataset.
From Table 9, we observe that five of our proposed six ACO algorithms (including all three ensemble-based ACO algorithms) had a better overall average ranking than CIW-NN. Comparing our best ACO algorithm (Ens-ACO-NN) to CIW-NN, we note that Ens-ACO-NN had better predictive accuracy on 21 datasets, worse on 14 datasets, and the same on a single dataset. As shown in Table 10, a Wilcoxon signed-ranks test indicates that Ens-ACO-NN has significantly better predictive accuracy than CIW-NN.
In addition, Ens-ACO-NN has the added advantage of better interpretability of the classifiers because of the more fine-grained class-based feature weights, as was illustrated in Figs 8–10. On the other hand, CIW-NN has the advantage that it also includes Instance Selection in addition to Feature Weighting and a class-based Instance Weighting approach, while our approach does not include any Instance Selection. Instance Selection has a number of benefits: the unselected instances can be interesting for analysis per se, and the reduced training set is less sensitive to outliers and noisy data, as well as the benefits of reduced processing time in the classification phase (fewer instances to scan), and efficiency in terms of storage. In future work, we would like to investigate incorporating data reduction mechanisms (Instance Selection as well as Feature Selection) within our ACO approach.
One advantage of the ensemble of classifiers approach presented in Section 7 is that it improves predictive accuracy in a majority of datasets, as indicated in our experimental results. However, the disadvantage is a decrease in interpretability due to producing an ensemble with different (to some extent contradictory) sets of class-based feature weights, rather than a single set of class-based feature weights.
It is possible to classify our multi-archive adaptation of ACO as a cooperative coevolutionary [56] approach. Using the terminology of coevolution, the archives can be considered symbiotic coevolving populations. An element of a symbiotic population cannot be evaluated by itself; one element from each population must first be selected to form a complete “super-organism” before a quality evaluation function can be applied. In the approach we follow here, an element is selected from each population using the standard roulette-wheel mechanisms of ACO, and fitness is assigned to each element independently of the others, as described in Section 6.
With the classifier ensemble approach, it is important to promote diversity in the population. One possibility for future research is to incorporate opposition-based (or collision-avoiding) mechanisms [11] in the algorithm. For example, the solution construction procedure in ACO can be modified to include a mechanism that penalizes potential solutions that are too close to existing members of the archive.
Another avenue that we would like to pursue in future work is the use of the Cauchy PDF in the ACO-IBL algorithm. An additional direction we would like to explore is to combine the three lazy classifiers (-NN, -NN, and GKE) in an ensemble-based approach, as follows. ACO would be used to optimize the class-based feature weights and the three parameters , and . A candidate solution would consist of a single set of class-based feature weights, in addition to values for the three parameters. To evaluate the quality of a constructed solution, three lazy classifiers (-NN, -NN, and GKE) would be produced. Each instance would be classified by each of the three classifiers, and the predicted class of the instance would be decided by majority vote.
Footnotes
Acknowledgments
The partial support of a grant from the Brandon University Research Council (BURC) is gratefully acknowledged.
References
1.
AbdelbarA.M.El-NabarawyI.WunchD.C. and SalamaK.M., Ant colony optimization applied to the training of a high order neural network with adaptable exponential weights, in: Applied Artificial Higher Order Neural Networks for Control and RecognitionZhangM., ed., IGI Global Press, Hershey, PA, USA, 2016.
2.
AbdelbarA.M. and SalamaK.M., A gradient-guided ACO algorithm for neural network learning, in: Proceedings IEEE Swarm Intelligence Symposium (SIS-2015), 2015, pp. 1133–1140.
3.
AhaD., Lazy Learning, Kluwer, Norwell, MA, USA, 1997.
4.
AhaD.KiblerD. and AlbertM., Instance-based learning algorithms, Machine Learning6(1) (1991), 37–66.
5.
AhnH. and KimK.-J., Bankruptcy prediction modeling with hybrid case-based reasoning and genetic algorithms approach, Applied Soft Computing9(2) (2009), 599–607.
6.
AlpaydynE., Introduction to Machine Learning, MIT Press, Cambridge, MA, USA, 2010.
7.
AnwarI.M.SalamaK.M. and AbdelbarA.M., ADR-Miner: An ant-based data reduction algorithm for classification, in: Proceedings IEEE Congress of Evolutionary Computation (CEC-2015), 2015, pp. 515–521.
8.
AnwarI.M.SalamaK.M. and AbdelbarA.M., Instance selection with ant colony optimization, in: Proceedings INNS Conference on Big Data, 2015, pp. 248–256.
9.
AsuncionA. and NewmanD., UCI machine learning repository http://www.ics.uci.edu/mlearn/MLRepository.html, 2007.
10.
BishopC.M., Pattern Recognition and Machine Learning, Springer, Berlin, Heidelberg, 2007.
11.
BlackwellT. and BentleyP., Don’t push me! Collision-avoiding swarms, in: Proceedings IEEE Congress on Evolutionary Computation (CEC-2002), Vol. 2, 2002, pp. 1691–1696.
12.
BlumC. and MerkleD., Swarm Intelligence: Introduction and Applications, Springer, New York, NY, USA, 2008.
13.
BoryczkaU. and KozakJ., An adaptive discretization in the ACDT algorithm for continuous attributes, in: Proceedings International Conference on Computational Collective Intelligence (ICCI-2011), 2011, pp. 475–484.
14.
BoxG. and MullerM., A note on the generation of random normal deviates, The Annals of Mathematical Statistics29(2) (1958), 610–611.
15.
BrazdilP.Giraud-CarrierC.SoaresC. and VilaltaR., Metalearning: Applications to Data Mining, Springer, Berlin, Heidelberg, 2009.
16.
CanoJ.HerreraF. and LozanoM., Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study, IEEE Transactions on Evolutionary Computation7(6) (2003), 561–575.
17.
CervantesA.GalvanI. and IsasiP., AMPSO: A new particle swarm method for nearest neighbor classification, IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics39(5) (2009), 1082–1091.
18.
ChomboonK.ChujaiP.TeerarassameeP. and KerdprasopK., An empirical study of distance metrics for k-nearest neighbor algorithm, in: Proceedings International Conference on Industrial Application Engineering, 2015, pp. 280–285.
19.
DerracJ.GarcíaS. and HerreraF., IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule, Pattern Recognition43(6) (2010), 2082–2105.
20.
DerracJ.TrigueroI.GarcíaS. and HerreraF., A co-evolutionary framework for nearest neighbor enhancement: Combining instance and feature weighting with instance selection, in: Hybrid Artificial Intelligent Systems, Lecture Notes in Computer Science Volume 7209, Springer, Berlin, Heidelberg, 2012, pp. 176–187.
21.
DerracJ.TrigueroI.GarcíaS. and HerreraF., Integrating instance selection, instance weighting, and feature weighting for nearest neighbor classifiers by coevolutionary algorithms, IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics42(5) (2012), 1383–1397.
22.
DorigoM.CaroG.M. and GambardellaL.M., Ant algorithms for discrete optimization, Artificial Life5(2) (1999), 137–172.
23.
DorigoM.ManiezzoV. and ColorniA., Ant system: Optimization by a colony of cooperating agents, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics26 (1996), 1–13.
24.
DorigoM. and StützleT., Ant Colony Optimization, MIT Press, Cambridge, MA, USA, 2004.
25.
DorigoM. and StützleT., Ant colony optimization: Overview and recent advances, in: Handbook of MetaheuristicsGendreauM. and PotvinY., eds, Springer, New York, NY, USA, 2010, pp. 227–263.
26.
EshelmanL.J., The CHC adaptive search algorithm: How to have safe search when engaging in nontraditional genetic recombination, in: Foundations of Genetic AlgorithmRawlingsG., ed., San Francisco, CA, USA, Morgan Kauffman, 1991, pp. 265–283.
27.
FernandezF. and IsasiP., Local feature weighting in nearest prototype classification, IEEE Transactions on Neural Networks19(1) (2008), 40–53.
28.
FreitasA.A., Comprehensible classification models: A position paper, ACM SIGKDD Explorations15(1) (2013), 1–10.
29.
GarcíaS.CanoJ. and HerreraF., A memetic algorithm for evolutionary prototype selection: A scaling up approach, Pattern Recognition41(8) (2008), 2693–2709.
30.
GarcíaS. and HerreraF., Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy, Evolutionary Computation17(3) (2009), 275–306.
31.
Garcia-PedrajasN.del CastilloJ. and Ortiz-BoyerD., A cooperative coevolutionary algorithm for instance selection for instance-based learning, Machine Learning78(3) (2010), 381–420.
32.
GuyonI. and ElisseeffA., An introduction to variable and feature selection, Journal of Machine Learning Research3 (2003), 1157–1182.
33.
HanJ. and KamberM., Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA, USA, 2000.
34.
HoS.-Y.LiuC.-C.LiuS. and JouJ.-W., Design of an optimal nearest neighbor classifier using an intelligent genetic algorithm, in: Proceedings IEEE Congress on Evolutionary Computation (CEC-2002), Vol. 1, 2002, pp. 594–599.
35.
JahromiM.ParvinniaE. and JohnR., A method of learning weighted similarity function to improve the performance of nearest neighbor, Information Sciences179(17) (2009), 2964–2973.
36.
KarD.ChakrabortiS. and RavindranB., Feature weighting and confidence based prediction for case based reasoning systems, in: Case-Based Reasoning Research and Development, Lecture Notes in Computer Science Volume 7466, 2012, pp. 211–215.
37.
KardanA.KavianA. and EsmaeiliA., Simultaneous feature selection and feature weighting with K selection for KNN classification using BBO algorithm, in: Proceedings Conference on Information and Knowledge Technology, 2013, pp. 349–354.
38.
KellyJ.D. and DavisL., A hybrid genetic algorithm for classification, in: Proceedings International Joint Conference on Artificial Intelligence (IJCAI-1991), Vol. 2, 1991, pp. 645–650.
39.
KononenkoI., Estimating attributes: Analysis and extensions of relief, in: Proceedings European Conference on Machine Learning (ECML-1994), Lecture Notes in Computer Science Volume 7209BergadanoF. and RaedtL., eds, Springer, Berlin, Heidelberg, 1994, pp. 171–182.
40.
KunchevaL.I., Editing for the k-nearest neighbors rule by a genetic algorithm, Pattern Recognition Letters16 (1995), 809–814.
41.
LiaoT.SochaK.Montes de OcaM.StützleT. and DorigoM., Ant colony optimization for mixed-variable optimization problems, IEEE Transactions on Evolutionary Computation18(4) (2014), 503–518.
42.
LiuH. and MotodaH., Feature Extraction, Construction and Selection: A Data Mining Perspective, Springer, Berlin, Heidelberg, 1998.
43.
LiuH. and MotodaH., Instance Selection and Construction for Data Mining, Springer-Verlag, New York, 2001.
44.
LiuH. and SetionoR., A probabilistic approach to feature selection: A filter solution, in: Proceedings of the 13th International Conference on Machine Learning, 1996, pp. 319–327.
45.
MartensD.BaesensB. and FawcettT., Editorial survey: Swarm intelligence for data mining, Machine Learning82(1) (2011), 1–42.
46.
MartensD.De BackerM.HaesenR.VanthienenJ.SnoeckM. and BaesensB., Classification with ant colony optimization, IEEE Transactions on Evolutionary Computation11 (2007), 651–665.
47.
Mateos-GarciaD.Garcia-GutierrezJ. and Riquelme-SantosJ.C., On the evolutionary optimization of k-NN by label-dependent feature weighting, Pattern Recognition Letters33 (2012), 2232–2238.
48.
MladenicD. and GrobelnikM., Feature selection for unbalanced class distribution and naive Bayes, in: Proceedings International Conference on Machine Learning (ICML-1999), 1999, pp. 258–267.
49.
OteroF. and FreitasA., Improving the interpretability of classification rules discovered by an ant colony algorithm, in: Proceedings Genetic and Evolutionary Computation Conference (GECCO-2013), 2013, pp. 73–80.
50.
OteroF.FreitasA. and JohnsonC., A new sequential covering strategy for inducing classification rules with ant colony algorithms, IEEE Transactions on Evolutionary Computation17 (2013), 64–74.
51.
OteroF.E.B.FreitasA.A. and JohnsonC.G., Inducing decision trees with an ant colony optimization algorithm, Applied Soft Computing12(11) (2012), 3615–3626.
52.
ParedesR. and VidalE., Learning weighted metrics to minimize nearest-neighbor classification error, IEEE Transactions on Pattern Analysis and Machine Intelligence28 (2006), 1100–1110.
53.
ParpinelliR.S.LopesH.S. and FreitasA.A., Data mining with an ant colony optimization algorithm, IEEE Transactions on Evolutionary Computation6(4) (2002), 321–332.
54.
ParvinniaE.SabetiM.JahromiM.Z. and BoostaniR., Classification of EEG signals using adaptive weighted distance nearest neighbor algorithm, Journal of King Saud University-Computer and Information Sciences26 (2014), 1–6.
55.
PengH.LongF. and DingC., Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on Pattern Analysis and Machine Intelligence27(8) (2005), 1226–1238.
56.
PotterM.A. and De JongK.A., Cooperative coevolution: An architecture for evolving coadapted subcomponents, Evolutionary Computation8(1) (2000), 1–29.
57.
PunchW.GoodmanE.PeiM.Chia-ShunL.HovlandP. and EnbodyR., Further research on feature selection and classification using genetic algorithms, in: Proceedings International Conference on Genetic Algorithms, 1993, pp. 557–564.
58.
QuinlanJ., Programs for Machine Learning, Morgan Kaufmann, San Francisco, CA, USA, 1993.
SalamaK.M. and AbdelbarA.M., Learning neural network structures with ant colony algorithms, Swarm Intelligence9(4) (2015), 229–265.
61.
SalamaK.M.AbdelbarA.M. and AnwarI., Data reduction for classification with ant colony algorithms, Intelligent Data Analysis20(5) (2017), 1021–1059.
62.
SalamaK.M.AbdelbarA.M. and FreitasA., Multiple pheromone types and other extensions to the Ant-Miner classification rule discovery algorithm, Swarm Intelligence5(3–4) (2011), 149–182.
63.
SalamaK.M.AbdelbarA.M.OteroF. and FreitasA., Utilizing multiple pheromones in an ant-based algorithm for continuous-attribute classification rule discovery, Applied Soft Computing13(1) (2013), 667–675.
64.
SalamaK.M. and FreitasA., Clustering-based Bayesian multi-net classifier construction with ant colony optimization, in: Proceedings IEEE Congress on Evolutionary Computation (CEC-2013), 2013, pp. 3079–3086.
65.
SalamaK.M. and FreitasA., Extending the ABC-Miner Bayesian classification algorithm, in: 6th International Workshop on Nature Inspired Cooperative Strategies for Optimization (NICSO-2013), volume 512 of Series on Computational Intelligence, Berlin, Springer, 2013, pp. 1–12.
66.
SalamaK.M. and FreitasA., Learning Bayesian network classifiers using ant colony optimization, Swarm Intelligence7(2) (2013), 229–254.
67.
SalamaK.M. and FreitasA., Ant colony algorithms for constructing Bayesian multi-net classifiers, Intelligent Data Analysis19(2) (2015), 233–257.
68.
SanchezA.M.LozanoM.VillarP. and HerreraF., Hybrid crossover operators with multiple descendents for real-coded genetic algorithms: Combining neighborhood-based crossover operators, International Journal of Intelligent Systems24 (2009), 540–567.
69.
SochaK. and BlumC., Training feed-forward neural networks with ant colony optimization: An application to pattern classification, in: Proceedings International Conference on Hybrid Intelligent Systems (HIS-2005), 2005, pp. 233–238.
70.
SochaK. and BlumC., An ant colony optimization algorithm for continuous optimization: Application to feed-forward neural network training, Neural Computing & Applications16 (2007), 235–247.
71.
SochaK. and DorigoM., Ant colony optimization for continuous domains, European Journal of Operational Research185 (2008), 1155–1173.
72.
TahirM.A.BouridaneA. and KurugolluF., Simultaneous feature selection and feature weighting using hybrid tabu search/k-nearest neighbor classifier, Pattern Recognition Letters28(4) (2007), 438–446.
73.
TanP.-N.SteinbachM. and KumarV., Introduction to Data Mining, Addison Wesley, Boston, MA, USA, 2005.
74.
TsutsuiS., Ant colony optimisation for continuous domains with aggregation pheromones metaphor, in: Proceedings International Conference on Recent Advances in Soft Computing (RASC-2004), 2004, pp. 207–212.
75.
Walters-WilliamsJ. and LiY., Comparative study of distance functions for nearest neighbors, in: Advanced Techniques in Computing Sciences and Software EngineeringElleithyK., ed., Springer, Berlin, Heidelberg, 2010, pp. 79–84.
76.
WettschereckD.AhaD. and MohriT., A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms, Artificial Intelligence Review11 (1997), 273–314.
77.
WittenI.H. and FrankE., Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, San Francisco, CA, USA, 2010.
78.
WoniakM.GrañaM. and CorchadoE., A survey of multiple classifier systems as hybrid systems, Information Fusion16 (2014), 3–17.
79.
YangJ. and HonavarV., Feature subset selection using a genetic algorithm, IEEE Intelligent Systems and Their Applications13(2) (1998), 44–49.
80.
YangY. and PedersenJ., A comparative study on feature selection in text categorization, in: Proceedings International Conference on Machine Learning (ICML-1997), 1997, pp. 412–420.
81.
ZhengF. and WebbG.I., Semi-naive Bayesian classification, Journal of Machine Learning Research87(1) (2008), 93–125.