Abstract
Cost-sensitive classification is broadly investigated in many real-life decision-making applications where the different misclassification errors may cause asymmetric costs. However, the cost values are usually hardly specified in practice for decision makers especially when they are faced with the complex multi-class decision problems. In this paper we attempt to take advantage of the pairwise comparisons originally proposed in Analytic Hierarchy Process (AHP) and offer decision makers a flexible, qualitative manner for cost specification rather than the definite, quantitative cost assignment. The cost ratios associated with the classes are then derived from the comparison matrix and used as the parameter of the subsequent cost-sensitive classifiers. To promote the performance of single classifier we construct the ensembles of multiple cost-sensitive Learning Vector Quantization Neural Networks (LVQ-NNs) trained independently and then combined together by various weighted voting approaches. Empirical study on some real-world databases reveal that the well optimized cost-sensitive ensembles based on evolutionary computing approaches perform significantly better than the best single classifier within the ensemble in terms of generalization ability and stability.
Keywords
Introduction
The misclassification cost has an important role in many real-life applications which favor some decision errors over the others under the consideration of the potential cost. Cost-sensitive learning has therefore received increasing attention in machine learning research with the emphasis on dealing with asymmetric costs of different misclassification errors. Cost-sensitive learning is also regarded as one approach to solve the imbalanced problems that the class distribution in data is highly skewed (i.e., some class samples are overwhelmed by the others) leading to the much lower classification accuracy on the minority class than the majority class. The majority of the existing works in the area of cost-sensitive classification include resampling, instance-weighting, instance-relabeling, post hoc threshold adjusting, and algorithm-level approach [1, 2, 3]. In the literature, the reported research mostly focused on developing advanced learning algorithms, with little attention on cost assignment problem. Most variants for cost-sensitive classification are based on a cost matrix that specifies definitely the cost values of various misclassification errors. In some problems the cost values could be estimated statistically and represented quantitatively. For example, in credit risk assessment the cost of predicting a real “bad” credit incorrectly can be determined by the loss of loan; Alternatively, the cost of predicting a real “good” credit incorrectly can be estimated by the potential profit. However, in reality, it is unrealistic for many decision problems especially when the decision makers are faced with complex multi-class decision problems where a quantity of cost values are involved. It therefore becomes a problem of practical interest to specify the misclassification cost through a qualitative way.
Ensemble learning characterized by a consensus or committee of decision makers refers to the process of constructing a set of learners and aggregating the individual outputs appropriately to attain the final decision. The ensemble designed for classification problems is also called Multiple Classifier System (MCS) and the members within the ensemble are called base (component, or elementary) classifier. Selective ensemble (SEN) is gaining a lot of interest in both academic and industry with a view of reducing the size of ensembles whilst receiving remarkable generalization ability through selecting an optimal subsets of learners. A lot of reported research have found that selective ensemble of appropriate classifiers can improve the generalization error in a variety of real-world decision applications including financial credit assessment [4], image classification [5], chemical process prediction [6], sea-ice thickness estimation [7], rotary machinery fault diagnosis [8], and so forth. The selection (or weighting) of the component classifiers becomes a challenging issue in recent selective ensemble learning studies.
In this paper we propose a selective ensemble framework adopting Learning Lector Quantization Neural Networks (LVQ-NNs) as the base classifiers for cost-sensitive classification. Firstly, motivated by the pairwise comparison matrix originally proposed in Analytic Hierarchy Process (AHP), we divide the cost assignment problem into a series of simple pairwise comparisons which indicate the relative importance of two classes with respect to the impact on the misclassification cost. The cost ratios associated with the classes are then derived from the comparison matrix and applied as parameters of the subsequent cost-sensitive classifiers (i.e., weighted LVQ-NN in this paper). This provides a flexible and qualitative way for decision makers to determine the cost ratios instead of the definite, quantitative cost values. Secondly, a generalized selective ensemble is designed to promote the overall generalization ability via weighted voting combination. We construct an ensemble of multiple weighted LVQ-NN classifiers that are trained independently and then combined appropriately via weight assignment. Various heuristic approaches including simple voting, direct weighted voting, Particle Swarm Optimization (PSO), Genetic Algorithm (GA), and Simulated Annealing (SA) are employed to specify (and optimize) the weights that denote the contribution of learners to the ensemble. Experiments on some popular datasets from UCI database and a real-world financial database reveal that the well optimized cost-sensitive ensembles based on evolutionary computing paradigms can achieve considerably stable performance significantly better than the best single classifier within the ensemble.
The main contribution of this approach is twofold. Firstly, it offers a qualitative and intuitive way to determine the misclassification cost when the definite cost values are difficultly assigned. Secondly, it proposes a generalized selective ensemble able to boost the generalization ability based on evolutionary computing paradigms. The rest of this paper is organized as follows. The related research background is introduced briefly regarding cost-sensitive classification and ensemble learning in Section 2. Section 3 describes the methodology of a generalized cost-sensitive selective ensemble framework focusing attention on the cost determination by means of pairwise comparisons, weighed LVQ-NN algorithm used in the cost-sensitive context, and selective ensemble schemes based on weight optimization. Section 4 includes the description of experimental databases and the comparative study of different selective ensemble schemes. Finally, we conclude the paper along with some future research lines in Section 5.
General framework of the cost-sensitive selective ensemble and experimental design.
It is well known that all learning methods have their own strength and limitations as stated by the famous “No Free Lunch” (NFL) Theorem in real-world machine learning applications. Therefore, ensemble learning emerged as a promising computing paradigm where several learners are jointly together used for the same decision tasks [9]. To ensure the effective fusion, the learners comprised in the ensemble should have high performance (individual requirement) and low intercorrelation (committee requirement). Many empirical and theoretical studies in a variety of real-world applications have demonstrated that ensemble learning is actually of benefit to improve the stand-alone classifiers although it was not always reported superior to the single best classifier [10, 11].
To make up an ensemble of high overall performance, two ingredients should be considered: one is increasing the diversity of learners, and the other is selecting appropriate classifiers to constitute the ensemble. A variety of approaches have been proposed to construct diverse component classifiers in either homogeneous or heterogenous structure. The homogeneous ensemble is built on multiple classifiers of the same machine learning family by varying the training data (e.g., bagging and boosting), selecting the features of data (e.g., random subspace and rainforest), adjusting the parameters of learning algorithms (e.g., the neuron topology of ANN), or injecting the randomness to the algorithms. Alternatively the heterogenous ensemble is constructed upon multiple classifiers of different learning families. These strategies, no matter what, were developed with the motivation to increase the diversity of learners that make independent misclassifications and potentially lead to better overall decision after combination. The recent studies paid much attention to hybridizing the aforementioned strategies for the purpose of high diversity of elementary classifiers. For example, SVM-based ensembles developed by bagging and random subspace simultaneously [12], or by different kernel functions and feature subsets [13] were shown to outperform the individual SVMs.
Some research work on selective ensemble learning
Some research work on selective ensemble learning
Notations: NB: Naive bayes, LR: Logistic regression, BLR: Bayesian logistic regression, LVQ-NN: Learning vector quantization neural networks, GA: Genetic algorithm, PSO: Particle swarm optimization, SA: Simulated annealing.
The prevailing fusion approaches of multiple learners include linear approach (e.g., averaging), nonlinear approach (e.g., majority voting and weighted voting), probabilistic approach (e.g., Bayesian), correspondence analysis, entropy modelling, and stacking methods. Recent studies put emphasis on selective ensembles (SEN) to enhance the generalization ability with pruned ensembles. A pioneer work is a selective ensemble method (GASEN) using GA [14], which specifies initial weights to a set of neural networks and then optimizes the weights using genetic operators with respect to the overall accuracy as the fitness value. The evolved weights determine which learners are retained to make up the ensemble. They demonstrated by selecting the appropriate learners GASEN can produce smaller-sized ensemble of better generalization accuracy compared with Bagging and Boosting. In another work [15] GA was used to select the near optimal subsets of base classifiers with respect to the prediction accuracy and variance influence factor (VIF). Liu et al. designed a general framework of selective ensemble based on particle swarm optimization [16]. Table 1 summarizes some related research work in the discipline of selective ensemble learning, including the component classifiers and fusion approach.
The prior research work demonstrate that the ensemble learning, particular selective ensemble, is a well-recognized technique to improve the prediction performance, and evolutionary computing approaches are commonly employed in building selective ensembles. In this article we employed three evolutionary computing paradigms, namely, PSO, GA, and SA, in cost-sensitive selective ensembles aiming to pursue the optimal weights of ensemble members with respect to overall performance in cost-sensitive context.
Figure 1 outlines the framework of the proposed cost-sensitive selective ensemble and the experimental design in Section 4. At the beginning the relative importance of one class against another is assigned by decision makers to construct a pairwise comparison matrix as was done in Analytic Hierarchy Process (AHP). The matrix is then processed mathematically to calculate the weights (or priorities) of the compared classes. Also the consistency of the comparisons is evaluated under a criterion for accepting or rejecting the pairwise comparisons. If the criterion is not accepted the comparison matrix has to be revised. Once the consistency is verified, the class weights are transformed to the cost ratios of misclassifications so that higher ratio indicates the more importance (or higher cost potentially caused by misclassification) of the corresponding class. In the following phase, the training dataset (
Pairwise comparison matrix and cost ratio calculation
The analytic hierarchy process (AHP) is a mathematical technique able to combine qualitative and quantitative factors for solving the complex decision problems. Since originally developed by Thomas L. Saaty in the 1970s [28], AHP has been extensively studied and applied with success in a wide variety of fields. The process of AHP is characterized by (1) decomposing the decision problem into a hierarchy of comprehended sub-problems where the root is the decision goal, the leaves are the compared alternatives, and the other nodes are the related factors, (2) comparing each element with another with respect to the impact on the above element and forming the pairwise comparison matrices, (3) deriving the weights from the pairwise comparison matrices respectively, (4) evaluating the consistency of comparison matrices and the whole hierarchy, (5) calculating the priorities of the alternatives over the decision goal.
In pairwise comparison matrix, each value represents the relative importance of one element against another with respect to the impact over the element above (probably a factor or the decision goal) within the hierarchy. The traditional method is to specify the comparison judgement as a positive integer from
Paired comparison judgements
Paired comparison judgements
An example of pairwise comparison matrix with 6 classes
Consistency evaluation: CI
Given a
where RI is the pre-calculated average CI of a large number of randomly generated comparison matrices. The smaller the CR, the better the consistency of the comparison matrix. Normally the consistency of A is acceptable if
In this paper we take advantage of the comparison matrix to specify the misclassification cost by means of simple and intuitive paired comparisons. In the comparison matrix, the classes are the compared alternatives and the goal is the misclassification cost. Let
Although represented in the similar format, the comparison matrix is different from the traditional cost matrix. (1) In the former the values indicate the qualitative judgement of relative importance of two classes with respect to the potential misclassification cost, whilst in the latter the values indicate the definite cost of misclassification errors. (2) In a
Learning vector quantization (LVQ) is a neural network (NN) defined for supervised learning as a variant of self-organized feature map. The neurons are arranged regularly and each is associated with a prototype which represents the class region. LVQ-NN is trained competitively in the manner that for each input the neuron closest (called best-matching neuron) is activated by updating the prototype and adjusting the class boundary accordingly. LVQ is widely applied in a variety of domains with success compared with support vector machines, multivariate statistical methods, and other learning methods [25]. In recent studies some variants of LVQ-NN were developed for learning in the cost-sensitive context [29, 1]. In this paper we used the weighted LVQ-NN algorithm (wLVQ-NN) proposed in [29]. This learning algorithm is applicable for multi-class cost-sensitive classification problems by treating the cost ratios as the weight of training instances during the learning and labeling of the network.
Let
Weight specification strategies of learners within an ensemble
Weight specification strategies of learners within an ensemble
The wLVQ-NN algorithm is trained online by incorporating the instance weights into the updating of prototypes. The rational is that the instances of higher weights impact more on the best-matching neuron so that they are harder to be misclassified. The wLVQ-NN algorithm is described in the following.
Notations:
After each iteration the neurons are re-labeled by weighted voting principle. The training instances are projected to the best-matching neurons via the nearest principle forming a Voronoi set for each neuron. The Voronoi set is normally composed of instances from different classes. Let
The number of neurons is the factor that impacts mostly on the performance of LVQ. To generate a series of diverse learners, we adopt wLVQ-NN as the base learning algorithm with Bootstrap (a sampling technique with replacement) and randomly selected number of neurons. Each solution provides a different cost-sensitive classifier to make up the ensembles.
In the cost-sensitive classification context, the performance of classifiers is usually evaluated by the total test cost [30] or expected misclassification cost [31] due to the misclassification errors. In this problem using cost ratios as parameter of cost-sensitive learning, we define the total cost ratio as an extension of expected misclassification cost for multi-class classification. Let
Generalized selective ensemble schemes
It is known that an ensemble of multiple classifiers is a feasible approach to promote the performance of a single classifier as reported in many studies. Within the ensemble the component classifiers that make the decision independently are combined in some manners. One of the mostly used combination approaches is weighted voting, which directly quantifies the weight of a vote (denoting the degree of confidence to the final decision) to all available learners. Particularly simple ensemble uses majority voting that assigns the same weight to all learners, i.e., the learners are equally important. Ensemble based on weighted voting combination can be seen as a generalized selective ensemble in the sense that in selective ensembles the discarded learners are equivalently specified with a weight of
In the paper we attempt to apply six strategies, i.e., S-Vote (Simple majority Voting), DW-Vote (Direct Weighted Voting), PSO-WV (Particle Swarm Optimization Weighted Voting), GA-WV (Genetic Algorithm Weighted Voting), SA-WV (Simulated Annealing Weighted Voting), and B-One (Best One), to specify the weights
Combine the component classifiers with the same voting weights (S-Vote); Combine the component classifiers with the voting weights proportional to the individual performance (DW-Vote); Combine the component classifierswith the voting weights optimized by particle swarm optimization (PSO-WV); Combine the component classifiers with the voting weights optimized by genetic algorithm (GA-WV); Combine the component classifiers with the voting weights optimized by simulated annealing (SA-WV); Select the single component classifier of the best performance (B-One).
Parameter setting in the experiments
Given an instance
In the experiments, we carry out an empirical comparison of cost-sensitive selective ensembles built on various combination strategies. The experiments were conducted in the MATLAB environment as follows:
The pairwise comparison matrix is generated randomly and the consistency is verified. Given an experimental data set For each generated training data set, the ensembles are constructed based on a series of weighted LVQ-NNs by using Bootstrap and varying the number of neurons randomly selected from 10 to 100. The classifiers are then combined by different weighted voting approaches respectively. The test data is input to the constructed ensembles respectively, and the performance is evaluated in terms of TCR. The process is repeated 10 times with randomly generated pairwise comparison matrix, and the average results are calculated.
The selective ensembles are implemented in Matlab environment. Table 5 lists the parameters and values (most parameters use the default setting) used in the experiments.
Financial ratios of french database
Performance comparison of various ensemble schemes in terms of average TCR. Each is associated with
In the experiments, 9 datasets, namely Horse, German, Lung, Credit, Iris, Wine, Abalone, Glass, and Zoo, obtained from UCI database [32] are used for performance comparison both in binary class and multi-class. Another benchmark dataset studied in the experiments is French, a financial data set of small or middle scaled business companies described in Table 6. It was used to predict the status (healthy or distress) of a company over a period of years. The data is preprocessed to generate a balanced subset composed of 600 distressed companies and 600 healthy ones. The properties of the experimental data sets are characterized in Table 7. The categorical features contained in horse and credit are converted to several binary ones by means of one categorical value corresponding to one binary feature in the new data set.
Pairwise significance test of various ensemble schemes in terms of average TCR over 10 datasets
Pairwise significance test of various ensemble schemes in terms of average TCR over 10 datasets
*denote the statistical significance at 5% level.
Stability comparison of different SEN schemes.
Performance comparison by varying the size of ensembles on French dataset.
In Table 7, the performance results of various ensemble schemes with a fixed number (10 by default) of learners on the test dataset are presented in terms of the TCR value. The results on each dataset have been averaged over 50 trials. The last column gives the performance resulted from the worst single learner (W-One). Each ensemble is associated with a 3-tuple
The stability property denotes the robustness of an algorithm on different data sets. Let
The stability of different SEN schemes in terms of the sum of relative importance w.r.t. TCR is shown in Fig. 2. The distribution of relative performance on each data set is plotted in stack. It is found that referring to the stability, the ensemble schemes based on evolutionary algorithms (namely, PSOWV-SEN, GAWV-SEN, and SAWV-SEN) perform better than the ensembles upon simple voting and direct weighted voting combination, as well as the best single classifier.
In the following experiment we investigate the performance of different ensemble schemes by varying the size of ensemble from 10 to 50 at the same pairwise comparison matrix. In Fig. 3, French dataset is taken as an example. It was found that increasing the size of ensembles does not always contribute to the overall generalization ability. The ensembles based on evolutionary computing paradigms (particulary PSO and GA) outperform those using simple and direct weighted voting combination.
Conclusions and future work
Cost-sensitive learning and ensemble computing have drawn a lot of research interests in the machine learning field. This paper focuses on the cost-sensitive selective ensemble framework and provides two insights to the investigated field. Firstly different from the commonly used cost matrix, a novel approach is presented that specifies the relative importance of two classes (alternatives) regarding the misclassification cost (decision goal) through a series of pairwise comparisons. The element weights derived from the comparison matrix denote the cost ratios of compared classes. This approach assists decision makers to determine the misclassification cost through a qualitative and intuitive way when the definite cost values in traditional cost matrix are difficultly assigned and hence limit the application in real-life decision problems. Secondly, a selective ensemble framework is constructed to promote the overall generalization ability by adopting the weighted LVQ-NNs as the base classifiers and weighted voting combination through some heuristic approaches in order to achieve better performance. Experiments on some real-life datasets from diverse domains reveal the impact of combination strategies on the generalization ability of ensembles. The selective cost-sensitive ensembles based on evolutionary computing paradigm for ensemble weight adjustments demonstrate promising performance compared with the best single classifier in terms of discriminatory power and stability.
In the future work, some limitations of this empirical study will be addressed as the research directions. Firstly, the selective ensembles were established in homogeneous structure in the sense that all learners were designed by weighted LVQ-NN as the learning method by varying the training instances and parameters. As the framework is applicable to any learning method that can take the cost ratios as parameters, we will constitute various ensembles in heterogenous structure to enhance the diversity of members. Further comparative experiments with competing ensembles such as Bagging and Random Subspace will be done in future research. Secondly, in the proposed approaches the weights of learners are optimized to best fit the validation data. It is of worth to investigate the overfitting problem of evolutionary optimization that probably degrades the generalization ability of ensembles. Thirdly, the proposed ensembles are based on weighted voting combination so that all learners join the ensembles with specified voting weights. A threshold will be used to discard some learners completely and make up an ensembles of small size in future study.
Footnotes
Acknowledgments
This work was supported by national funds through the Beijing National Science Foundation (9182017), President Youth Fund of Institutes of Sciences and Development, CAS (Y7X1151Q01), National Natural Science Foundation of China (11601129), and Introduction of Talent Research Fund of Henan Polytechnic University (Y2017-1).
