Abstract
We propose a weight-adjusted-voting framework that combines an ensemble of classifiers for improving sensitivity of prediction. In this framework, we first adjust each individual classifier’s weight in the ensemble based on their ability of making correct predictions, and then use the weight of classifiers and a voting strategy to make final predictions. We also propose a step-wise classifier selection approach and apply it in the weight-adjusted-voting framework to select the proper classifiers from all the candidate classifiers in an ensemble for better sensitivity. To compare the sensitivity of the proposed weight-adjusted-voting, and two other approaches of combining classifiers – voting, and stacking, as well as the sensitivity of each single classifier in the ensemble, we used two different datasets in the UCI machine learning repository for evaluation. The results have demonstrated that our weight-adjusted-voting framework performs better in sensitivity than other approaches compared in the experiment.
Introduction and related work
In machine learning, sensitivity is a metric of classifiers to measure the proportion of objects (e.g. patients) that are correctly predicted to have a positive status (e.g. suffering from a disease). In fact, many machine learning applications emphasize high sensitivity in prediction. For example, an automatic case identification system with high sensitivity can assist manual annotators in distinguishing cases from non-cases in free-text electronic medical records [2]. In computer-aided diagnosis such as screening mammography [4, 10] and screening diabetic retinopathy [1, 22], high sensitivity (i.e. the percentage of sick patients who are correctly diagnosed) is generally considered as a required performance indicator.
To support these kind of applications, machine learning based predictions with high sensitivity are needed. The machine learning based prediction can be roughly divided into two categories: a single classifier, and an ensemble of classifiers. A single classifier uses one classification technique, such as an artificial neural network, naïve bayes, nearest-neighbor, support vector machine, random forest, etc. to make predictions. However, the predictions that are made by only one single classifier may be weak, as it is only one classifier’s decision. Given a classification task, different classifiers may have different decisions. To address the issues of a single classifier, approaches that make use of an ensemble of classifiers for predictions are proposed. An ensemble of classifiers combines two or more classifiers to make a final prediction. The existing methods for combining multiple classifiers include voting [14, 15], boosting [7, 17], stacking [6, 9], bagging [19, 23], and some varieties [12]. The most often used voting strategies are majority voting – a decision rule that makes prediction based on the most votes, and weighted majority voting where each classifier has a weight that indicates the confidence of the classifier’s decision. Boosting algorithms such as AdaBoost convert weak classifiers to strong classifiers by iteratively learning weak classifiers and adding them to the final strong classifier. Stacking first trains some classifiers, and then uses the prediction results of these classifiers as additional inputs to train a different classifier. Bagging (Bootstrap aggregating) uses combinations with repetition to generate additional data from the original datasets, and then uses these data to build the classifiers in an ensemble. Since different classifiers may have different prediction results, some results may be contradictory to or complementary to some other results. Hence, predictions that are made based on the classification results of an ensemble of classifiers are more powerful and complete, as the decision-making takes individual decision of each single classifier in the ensemble into consideration.
However, the existing approaches of combining multiple classifiers have four issues. First, some methods [9, 21] do not consider the weight of classifiers. In [9], classifiers with different error types are combined by a learnable classifier which can learn the ensemble’s classifiers’ error and expertise to improve the overall classification accuracy. In [21], multiple classifiers are combined by a fusion at classification level (e.g. “vote” in WEKA – a machine learning software). These approaches overlook the fact that classifiers which usually make correct decisions should have higher weights than the classifiers which usually make wrong decisions. Second, some approaches (e.g. [12]) do have weights for classifiers in an ensemble, but they use an iterative way to adjust the weights of instances first (i.e. hard-to-classify instances get higher weights), which influences the decisions of classifiers and also affects classifiers’ weight. Involving the weights of instances and the iterative interference between classifiers and instances makes the prediction procedure complex and time-consuming. Third, most approaches (e.g. [12, 13, 18]) that combine multiple classifiers only care about accuracy but ignore sensitivity, which cannot support predictions that desire high sensitivity. In [12, 13, 18], sensitivity is not reported. Finally, all of the approaches that we discuss above do not consider how to select the proper classifiers among all the available candidate classifiers in the ensemble in order to achieve a high sensitivity.
To address the above shortcomings, we propose a weight-adjusted-voting framework that combines multiple classifiers for improving sensitivity of the ensemble. The contributions of our work are fourfold. First, the weight of each classifier in the ensemble depends on the power of classifiers. The more powerful classifier with a good prediction history gets a higher weight. A weight-adjusting algorithm is proposed to dynamically adjust the weight of each classifier in the ensemble. Second, the weight-adjusting algorithm only adjusts the weight of classifier but does not iteratively change the weight of instances, which makes it simpler than the approaches that iteratively change the weight of instances as well as the weight of classifiers. Third, the step-wise classifier selection in our work can select the proper classifiers among all the available candidates in the ensemble. Finally, to demonstrate the performance of our work, we conduct experimental studies on two different datasets: Pima Indian Diabetes (PID) dataset [20] and Diabetic Retinopathy Debrecen (DRD) dataset [5], using the proposed weight-adjusted-voting framework and other approaches – each single classifier in the ensemble, stacking, and voting. The experimental results have shown that the sensitivity of our work in both datasets always outperforms other approaches.
The rest of the paper is organized as follows. In Section 2, we discuss our weight-adjusted-voting framework. We show the experimental results on the PID and DRD datasets respectively in Section 3. Section 4 explains our stepwise classifier selection approach. Section 5 is the discussion of our work. In Section 6, we conclude the paper.
Weight-adjusted-voting framework
The idea of weight-adjusted-voting of classifiers in an ensemble of size three has been proposed in our previous work [16]. It is summarized as follows: For a given instance, if a classifier, say
To illustrate the above idea of weight-adjusted-voting of classifiers in the ensemble of any size, we extend our approach to a weight-adjusted-voting framework shown in Fig. 1.
The weight-adjusted-voting framework on an ensemble of classifiers. It does not show any input for stage 4: the step-wise classifier selection, as the input of this stage varies from M classifiers to less than M classifiers. Details of stage 4 are discussed in Section 4.
Assume there are
In this section, we discuss the algorithm of adjusting weight of M classifiers in an ensemble – stage 2 of the proposed framework, as shown in Algorithm 1.
Take an ensemble of 10 classifiers to exemplify the algorithm. If all 10 classifiers make correct decisions or all make wrong decisions for a given data instance (i.e. line 6 of the “Processing” part of Algorithm 1), the weight of each classifier does not change for that instance. If the number of correct classifiers (i.e. the classifiers that make correct decisions) is no less than the average number of classifiers (i.e. line 8 of the “Processing” part of Algorithm 1), that means at least half of or more than half of the classifiers make correct decisions. In this case, correct classifiers should get credit by increasing the weight and wrong classifiers (i.e. the classifiers that make wrong decisions) should be punished by decreasing the weight. Since the number of correct classifiers is no less than the number of wrong classifiers, each correct classifier increases less weight while each wrong classifier decreases more weight. The total weight of all classifiers is always 1. For example, out of 10 classifiers, if there are 6 classifiers that make correct decisions, then the weight of each correct classifier is increased by step_size1/1.5 (as there are 6 correct classifiers and 4 wrong classifiers), and the weight of each wrong classifier is decreased by step_size1. On the other hand, if the number of correct classifiers is smaller than the average number of classifiers (i.e. line 11 of the “Processing” part of Algorithm 1), it means more than half of the classifiers make wrong decisions. Hence, each correct classifier increases more weight as there are few classifiers which make correct decisions. In the meanwhile, each wrong classifier decreases less weight. For instance, out of 10 classifiers, if there are 2 classifiers that make correct decisions, then the weight of each correct classifier is increased by 4
Voting
We extend the voting rule proposed in our previous work [16] to combine the decisions of
Still take an ensemble of 10 classifiers as an example to exemplify Algorithm 2, as demonstrated in Table 1. If the number of positive votes is larger than the number of negative votes (e.g. 6 positive votes and 4 negative votes), then the decision of the ensemble is positive. If the number of positive votes is greater than 0 but does not exceed the number of negative votes, the average weight of positive voters is compared with the average weight of negative voters to make the final decision. If the average weight of positive voters is not less than the average weight of negative voters, the decision of the ensemble is positive. Otherwise, the decision of the ensemble is negative. If there is no positive voter, the decision of the ensemble is negative.
The voting algorithm applied in an ensemble of 10 classifiers
The voting algorithm applied in an ensemble of 10 classifiers
To evaluate the performance of our proposed work, we used two data sets from the UCI machine learning repository [24] and used Weka [8] for training each classifier. In this section, we provide a brief discussion on these datasets and their respective experimental results.
Datasets
The Pima Indian diabetes data set. This data set has 268 diabetes patients and 500 normal subjects. All subjects are females who are at least 21 years old and of Pima Indian heritage. Each subject has eight attributes, including number of times pregnant, Plasma glucose concentration a 2 hours in an oral glucose tolerance test (OGTT), diastolic blood pressure, triceps skin fold thickness, 2-hour serum insulin, body mass index, diabetes pedigree function, and age. After removing subjects with zeros where they are biologically impossible, the data set used in the experiment included 251 diabetes and 478 normal subjects. We conducted feature selection on the data set to find the smallest number of features that achieve the best classification performance. In our experiment, five features: 2 hours OGTT, 2-hour serum insulin, body mass index, diabetes pedigree function, and age were selected out. The Diabetic Retinopathy Debrecen data set. This dataset contains 18 features extracted from the Messidor image set to predict whether an image contains signs of diabetic retinopathy (DR) or not [3]. It includes 540 images that do not have signs of DR and 611 images that contains signs of DR. All of the images and their features were used in our experiment.
Results
In Table 2, we exhibit the performance of using different numbers of classifiers (3 to 10 classifiers) in the ensemble on the Pima Indian diabetes dataset. We compared the sensitivity of the ensemble with the best sensitivity of using other voting strategies on the same ensemble with rules of average of probabilities, product of probabilities, majority voting, minimum probability, or maximum probability. We also compared the sensitivity of the ensemble with the sensitivity of using stacking on the same ensemble. All of the experiments with other voting rules and stacking were conducted with WEKA. In Table 3, we demonstrate the sensitivity of using different size of classifiers in the ensemble on the Diabetic Retinopathy Debrecen dataset, with our approach, other voting rules, and the stacking.
Comparison of the sensitivity of classifiers on the pima indian diabetes dataset
Comparison of the sensitivity of classifiers on the pima indian diabetes dataset
Comparison of sensitivity of classifiers on the diabetic retinopathy debrecen dataset
The sensitivity of the ensemble of size 10 VS. the sensitivity of each single classifier on the pima indian diabetes dataset
The sensitivity of the ensemble of size 10 VS. the sensitivity of each single classifier on the diabetic retinopathy debrecen dataset
As shown in Tables 2 and 3, the weight-adjusted-voting on an ensemble of classifiers increases the sensitivity in general. Table 2 shows that the sensitivity of the ensemble of classifiers is in an increasing trend as the size of ensemble gets larger. However, as shown in Table 3, for Diabetic Retinopathy Debrecen dataset, the sensitivity of the ensemble of classifiers goes up and down as the size of ensemble gets bigger.
We also compared the sensitivity of the ensemble with the sensitivity of each single classifier in the ensemble, for different sizes of ensemble. Take an ensemble of size 10 as an example. Tables 4 and 5 show the comparison of sensitivity on the Pima Indian diabetes dataset and Diabetic Retinopathy Debrecen dataset, respectively. The ensemble of classifiers demonstrates a higher sensitivity than each classifier’s sensitivity in the ensemble.
As observed from Table 3, the sensitivity of the ensemble of classifiers goes up and down as the size of ensemble gets bigger. This is because the addition of a classifier to an ensemble may not contribute to the increment of the ensemble’s performance although the size of the ensemble is bigger. Instead, the performance of the ensemble might decrease after adding a new classifier (e.g. a classifier with a relatively lower performance than the performance of the existing classifiers) as the final decision of the ensemble is based on the weighted voting of the existing classifiers and the new classifier. Therefore, from the given ensemble of classifiers, how to find a combination of several classifiers that can produce the best sensitivity becomes an issue. To resolve this issue, we propose a stepwise classifier selection approach to select the proper classifiers from a given ensemble for best sensitivity of the ensemble. The idea of our approach is similar to the backward elimination in stepwise regression [11] for feature selection. The approach starts from an ensemble with all candidate classifiers and tests the deletion of each single classifier. If deleting a classifier improves the sensitivity of the ensemble, then the classifier is removed from the ensemble. If deleting a classifier drops the sensitivity of the ensemble, then the classifier is kept in the ensemble. This procedure is repeated until no further improvement is possible or until there is only one classifier left in the ensemble. Details of the algorithm are shown in Algorithm 3.
For example, on the Diabetic Retinopathy Debrecen dataset, to find the best ensemble of classifiers from an ensemble of size 10 – ANN, Naive Bayes, SVM, simplelogistic, RBFNetwork, SGD, Decision Tree, Random Forrest, BayesNet, K-Nearest Neighbor, we remove ANN to see if the sensitivity of the ensemble of the remaining 9 classifiers improves. If the sensitivity of the remaining 9 classifiers drops, we keep ANN in the ensemble. Then we remove Naïve Bayes from the ensemble of size 10 to see if the sensitivity of the ensemble of the remaining 9 classifiers improves. If the sensitivity of the remaining 9 classifiers drops, we keep Naïve Bayes in the ensemble, and so on. On the other hand, if the sensitivity of the remaining 9 classifiers improves when we remove ANN, we then really remove ANN from the original ensemble of size 10. We then start from an ensemble of size 9 (as ANN is removed), and repeat the selection process. If there is only one classifier left or no classifier can be further removed from the ensemble, the process stops.
Discussion
In the weight-adjusted-voting on an ensemble of
As shown in Tables 2 and 3, compared with other voting strategies of combining multiple classifiers, our approach has a higher sensitivity. For different sizes of ensemble, the sensitivity of our approach is always higher than the sensitivity of the ensemble using other voting strategies. In addition, compared with stacking – another way of combining multiple classifiers, our approach also has a higher sensitivity than that of stacking for different sizes of ensemble on the Diabetic Retinopathy Debrecen dataset. On the Pima Indian Diabetes dataset, our approach has a lower sensitivity than that of stacking when the ensemble size is small (i.e. 3, 4, 5, or 6 classifiers). However, when the ensemble size gets bigger (i.e. 7, 8, 9, or 10 classifiers), the sensitivity of our approach becomes higher than the sensitivity of stacking.
Moreover, Tables 4 and 5 have demonstrated the fact that the sensitivity of the ensemble of size 10 is higher than the sensitivity of each single classifier in the ensemble. We have conducted experiments for various size of ensemble (size 3 to size 10). With a different size, for most of the cases the sensitivity of the ensemble is higher than the sensitivity of each single classifier in the ensemble. We intentionally added some classifiers that have a relatively higher sensitivity and some classifiers that have a relatively lower sensitivity in the ensemble to see how a single classifier’s sensitivity affects the sensitivity of the ensemble. It turned out that the influence of a single classifier’s sensitivity to the ensemble is not quite obvious. That is to say, although adding a classifier with a weak sensitivity may drop the sensitivity of the ensemble, the decrement may not be very significant. Also, adding a classifier with a relatively high sensitivity may not always improve the ensemble of classifiers. For example, on the Diabetic Retinopathy Debrecen dataset, when the classifier Simplelogistic with a sensitivity of 77.9% was added in the ensemble, the sensitivity of the ensemble was 72.5%, which was a bit lower than the sensitivity (73.5%) of the ensemble of size 3 (i.e. SVM, Naive Bayes, ANN). However, on the Pima Indian Diabetes dataset, when the classifier K-Nearest Neighbor with a sensitivity of 70.6% was added in the ensemble, the sensitivity of the ensemble increased from 76.4% (with 9 classifiers) to 81.2%.
The size of the ensemble has affected the performance as well. On the Pima Indian dataset, as the size of the ensemble increases, the sensitivity of the ensemble also increases, as shown in Table 2. Although the sensitivity of an ensemble of size 8 is a little bit lower than that of an ensemble of size 7, overall the sensitivity of an ensemble is trending up when the size of the ensemble gets bigger. However, on the Diabetic Retinopathy Debrecen dataset, as the size of the ensemble increases, the sensitivity of the ensemble rises and falls. To eliminate the classifiers that may cause the decrement of sensitivity of the ensemble, we applied the proposed step-wise classifier selection approach on the 10-classifiers ensemble. Take Diabetic Retinopathy Debrecen dataset as an example. It turned out that in the 10 candidate classifiers in the ensemble, a combination of 9 classifiers (SVM, ANN, simplelogistic, RBFNetwork, SGD, NB, decision tree, random forest, k-nearest neighbor) can produce a sensitivity of 85.8% which is better than other selections of the 10 candidate classifiers.
Conclusion
To sum, in this work we propose a weight-adjusted-voting framework on an ensemble of classifiers. This framework involves weight-adjusting, voting, and stepwise classifier selection. Experiments have shown that the proposed weight-adjusted-voting can increase the sensitivity of an ensemble of different sizes on different datasets. Our work can benefit the classification tasks such as computer-aided medical diagnosis. As is known, sensitivity plays a more significant role in medical diagnosis, as it is very important to be able to diagnose a sick patient’s illness. Doctors and patients can then take proper actions ahead of time for a patient’s health. However, the proposed approach is only for binary classification problems. Our future work will explore how to use an ensemble of classifiers on time-series data and for multi-class classification problems.
