A weight-adjusted-voting framework on an ensemble of classifiers for improving sensitivity

Abstract

We propose a weight-adjusted-voting framework that combines an ensemble of classifiers for improving sensitivity of prediction. In this framework, we first adjust each individual classifier’s weight in the ensemble based on their ability of making correct predictions, and then use the weight of classifiers and a voting strategy to make final predictions. We also propose a step-wise classifier selection approach and apply it in the weight-adjusted-voting framework to select the proper classifiers from all the candidate classifiers in an ensemble for better sensitivity. To compare the sensitivity of the proposed weight-adjusted-voting, and two other approaches of combining classifiers – voting, and stacking, as well as the sensitivity of each single classifier in the ensemble, we used two different datasets in the UCI machine learning repository for evaluation. The results have demonstrated that our weight-adjusted-voting framework performs better in sensitivity than other approaches compared in the experiment.

Keywords

Ensemble classifiers weight-adjusted-voting sensitivity step-wise selection

1. Introduction and related work

In machine learning, sensitivity is a metric of classifiers to measure the proportion of objects (e.g. patients) that are correctly predicted to have a positive status (e.g. suffering from a disease). In fact, many machine learning applications emphasize high sensitivity in prediction. For example, an automatic case identification system with high sensitivity can assist manual annotators in distinguishing cases from non-cases in free-text electronic medical records [2]. In computer-aided diagnosis such as screening mammography [4, 10] and screening diabetic retinopathy [1, 22], high sensitivity (i.e. the percentage of sick patients who are correctly diagnosed) is generally considered as a required performance indicator.

To support these kind of applications, machine learning based predictions with high sensitivity are needed. The machine learning based prediction can be roughly divided into two categories: a single classifier, and an ensemble of classifiers. A single classifier uses one classification technique, such as an artificial neural network, naïve bayes, nearest-neighbor, support vector machine, random forest, etc. to make predictions. However, the predictions that are made by only one single classifier may be weak, as it is only one classifier’s decision. Given a classification task, different classifiers may have different decisions. To address the issues of a single classifier, approaches that make use of an ensemble of classifiers for predictions are proposed. An ensemble of classifiers combines two or more classifiers to make a final prediction. The existing methods for combining multiple classifiers include voting [14, 15], boosting [7, 17], stacking [6, 9], bagging [19, 23], and some varieties [12]. The most often used voting strategies are majority voting – a decision rule that makes prediction based on the most votes, and weighted majority voting where each classifier has a weight that indicates the confidence of the classifier’s decision. Boosting algorithms such as AdaBoost convert weak classifiers to strong classifiers by iteratively learning weak classifiers and adding them to the final strong classifier. Stacking first trains some classifiers, and then uses the prediction results of these classifiers as additional inputs to train a different classifier. Bagging (Bootstrap aggregating) uses combinations with repetition to generate additional data from the original datasets, and then uses these data to build the classifiers in an ensemble. Since different classifiers may have different prediction results, some results may be contradictory to or complementary to some other results. Hence, predictions that are made based on the classification results of an ensemble of classifiers are more powerful and complete, as the decision-making takes individual decision of each single classifier in the ensemble into consideration.

However, the existing approaches of combining multiple classifiers have four issues. First, some methods [9, 21] do not consider the weight of classifiers. In [9], classifiers with different error types are combined by a learnable classifier which can learn the ensemble’s classifiers’ error and expertise to improve the overall classification accuracy. In [21], multiple classifiers are combined by a fusion at classification level (e.g. “vote” in WEKA – a machine learning software). These approaches overlook the fact that classifiers which usually make correct decisions should have higher weights than the classifiers which usually make wrong decisions. Second, some approaches (e.g. [12]) do have weights for classifiers in an ensemble, but they use an iterative way to adjust the weights of instances first (i.e. hard-to-classify instances get higher weights), which influences the decisions of classifiers and also affects classifiers’ weight. Involving the weights of instances and the iterative interference between classifiers and instances makes the prediction procedure complex and time-consuming. Third, most approaches (e.g. [12, 13, 18]) that combine multiple classifiers only care about accuracy but ignore sensitivity, which cannot support predictions that desire high sensitivity. In [12, 13, 18], sensitivity is not reported. Finally, all of the approaches that we discuss above do not consider how to select the proper classifiers among all the available candidate classifiers in the ensemble in order to achieve a high sensitivity.

To address the above shortcomings, we propose a weight-adjusted-voting framework that combines multiple classifiers for improving sensitivity of the ensemble. The contributions of our work are fourfold. First, the weight of each classifier in the ensemble depends on the power of classifiers. The more powerful classifier with a good prediction history gets a higher weight. A weight-adjusting algorithm is proposed to dynamically adjust the weight of each classifier in the ensemble. Second, the weight-adjusting algorithm only adjusts the weight of classifier but does not iteratively change the weight of instances, which makes it simpler than the approaches that iteratively change the weight of instances as well as the weight of classifiers. Third, the step-wise classifier selection in our work can select the proper classifiers among all the available candidates in the ensemble. Finally, to demonstrate the performance of our work, we conduct experimental studies on two different datasets: Pima Indian Diabetes (PID) dataset [20] and Diabetic Retinopathy Debrecen (DRD) dataset [5], using the proposed weight-adjusted-voting framework and other approaches – each single classifier in the ensemble, stacking, and voting. The experimental results have shown that the sensitivity of our work in both datasets always outperforms other approaches.

The rest of the paper is organized as follows. In Section 2, we discuss our weight-adjusted-voting framework. We show the experimental results on the PID and DRD datasets respectively in Section 3. Section 4 explains our stepwise classifier selection approach. Section 5 is the discussion of our work. In Section 6, we conclude the paper.

2. Weight-adjusted-voting framework

The idea of weight-adjusted-voting of classifiers in an ensemble of size three has been proposed in our previous work [16]. It is summarized as follows: For a given instance, if a classifier, say $A$ , makes a correct prediction while the other two classifiers, say $B$ and $C$ , make wrong predictions, then the classifier $A$ ’s weight is increased and the weights of $B$ and $C$ are equally decreased. Since classifier $A$ is the only one that correctly classifies the given instance in this scenario, classifier $A$ gets more credit (i.e. add more weight to classifier $A$ ). On the other hand, if classifier $A$ makes a wrong prediction while $B$ and $C$ make correct predictions, then classifier $A$ ’s weight drops and other two classifier’s weights equally increase. In this case, $B$ and $C$ get less credit (i.e. add less weight) as there are two classifiers which make correct predictions. If three classifiers all make correct predictions or all make wrong predictions, then each of them does not change weight. After the weight of each classifier is determined, a voting strategy is proposed and applied to combine the decisions of each classifier. If the number of positive votes is larger than the number of negative votes, the decision is positive. If the number of positive votes is zero, the decision is negative. If the number of positive votes is non-zero but does not exceed the number of negative votes, then the final decision is determined by the voters which have a higher average weight.

To illustrate the above idea of weight-adjusted-voting of classifiers in the ensemble of any size, we extend our approach to a weight-adjusted-voting framework shown in Fig. 1.

Figure 1.

The weight-adjusted-voting framework on an ensemble of classifiers. It does not show any input for stage 4: the step-wise classifier selection, as the input of this stage varies from M classifiers to less than M classifiers. Details of stage 4 are discussed in Section 4.

Assume there are $M$ classifiers in the ensemble, and $N$ instances in the data set. In stage 1, we randomly select 1/3 of positive instances and 1/3 of negative instances from the data set, and use them as the training data for training models of each single classifier (i.e. trained classifier 1, trained classifier 2, etc.). By saying positive/negative instances, we mean the instances with classification label “1”/“0” in binary classification. In stage 2, we select a different 1/3 of positive instances and 1/3 of negative instances from the data set, and use them as the “weight adjuster” data. We use the “adjuster” dataset to test each trained classifier and generate the prediction results of using the trained classifiers on the dataset (i.e. result of using classifier 1, result of using classifier 2, etc. in stage 2). The “adjuster” dataset, together with the prediction result of each classifier for individual instance in the set, is used to adjust each classifier’s weight in the ensemble for making the final decision. In stage 3, we use the remaining 1/3 data of the $N$ instances as the testing dataset to generate the prediction results of using the trained classifiers on the dataset (i.e. result of using classifier 1, result of using classifier 2, etc. in stage 3). At the end, we apply our proposed voting rule (see Section 2.2) to combine the decisions of $M$ classifiers. The voting rule takes the majority voting among $M$ classifiers and the weight of each classifier into consideration to make the final decision. In stage 4, we apply the step-wise classifier selection to select the proper classifiers from the given ensemble for the best sensitivity of the ensemble. When applying the step-wise classifier selection, stages 1, 2, and 3 are repeated sequentially for a different number of classifiers, varying from $M$ classifiers to less than $M$ classifiers. The output of the framework is an optimal combination of classifiers that could produce the best sensitivity from the candidate classifiers in the given ensemble.

2.1 Weight-adjusting algorithm

In this section, we discuss the algorithm of adjusting weight of M classifiers in an ensemble – stage 2 of the proposed framework, as shown in Algorithm 1.

Algorithm 1: Adjusting the weight of each classifier in an ensemble of size $M$
Input:
$T$ : the number of instances in the “adjuster” data set (as shown in Fig. 1) Array $i$ of size $T$ which saves the prediction results of “adjuster” data set using classifier $i$ , where $i=$ 0, 1, 2, …, $M$ -1 and $M\in$ Z ${}^{+}$ Array $L$ of size $T$ which saves the actual class membership of each instance in the “adjuster” data set step_size1: the step size used to adjust weights when the number of classifier(s) that make a correct decision is no less than the average number of classifiers $H$ step_size2: the step size used to adjust weights when the number of classifier(s) that make a correct decision is smaller than the average number of classifiers $H$
Output:
$W$ : a vector ( $W_{0}$ , $W_{1}$ , $W_{2}$ , … $W_{M-1}$ ) that stores the weight of classifier 0, 1, 2, …, $M$ -1 respectively
Processing:
1 compare array 0 and array $L$ , array 1 and array $L$ , array 2 and array $L$ , …, array $M$ -1 and array $L$ to build a matrix $R$ of $T$ rows and $M$ columns such that $R[i][j]$ records whether classifier $j$ correctly classifies instance $i$ (0 for wrong decision, 1 for correct decision), where $i=$ 0, 1, …, $T$ -1 and $j=$ 0, 1, … $M$ -1.
2 let $W_{0}=W_{1}=W_{2}=\ldots=W_{M-1}:=1/M$
3 let $H=\lceil M/2\rceil$
4 for i $:=$ 0 to $T$ -1 do
5 check the values in $R[i][j]$ , where $j=$ 0, 1, … $M$ -1
6 if all $R[i][j]$ is 0 or all $R[i][j]$ is 1 ( $j=$ 0, 1, … $M$ -1)
7 do nothing; /* do not change the weights $W_{0}$ , $W_{1}$ , $W_{2}$ , …, $W_{M-1}$ */
8 if (the number of correct classifiers $>=$ H)
9 weight of each correct classifier $+=$ step_size1/(the number of correct classifier/the number of wrong classifier)
10 weight of each wrong classifier $-=$ step_size1
11 if (the number of correct classifiers $<$ H)
12 weight of each correct classifier $+=$ (the number of wrong classifiers/the number of correct classifiers)
$\times$ step_size2
13 weight of each wrong classifier $-=$ step_size2
14 return $W$

Take an ensemble of 10 classifiers to exemplify the algorithm. If all 10 classifiers make correct decisions or all make wrong decisions for a given data instance (i.e. line 6 of the “Processing” part of Algorithm 1), the weight of each classifier does not change for that instance. If the number of correct classifiers (i.e. the classifiers that make correct decisions) is no less than the average number of classifiers (i.e. line 8 of the “Processing” part of Algorithm 1), that means at least half of or more than half of the classifiers make correct decisions. In this case, correct classifiers should get credit by increasing the weight and wrong classifiers (i.e. the classifiers that make wrong decisions) should be punished by decreasing the weight. Since the number of correct classifiers is no less than the number of wrong classifiers, each correct classifier increases less weight while each wrong classifier decreases more weight. The total weight of all classifiers is always 1. For example, out of 10 classifiers, if there are 6 classifiers that make correct decisions, then the weight of each correct classifier is increased by step_size1/1.5 (as there are 6 correct classifiers and 4 wrong classifiers), and the weight of each wrong classifier is decreased by step_size1. On the other hand, if the number of correct classifiers is smaller than the average number of classifiers (i.e. line 11 of the “Processing” part of Algorithm 1), it means more than half of the classifiers make wrong decisions. Hence, each correct classifier increases more weight as there are few classifiers which make correct decisions. In the meanwhile, each wrong classifier decreases less weight. For instance, out of 10 classifiers, if there are 2 classifiers that make correct decisions, then the weight of each correct classifier is increased by 4 $\times$ step_size2 (as there are 8 wrong classifiers and 2 correct classifiers), and the weight of each wrong classifier is decreased by step_size2.

2.2 Voting

We extend the voting rule proposed in our previous work [16] to combine the decisions of $M$ classifiers. The rule involves the majority voting and also takes the weights of classifiers into consideration, as shown in Algorithm 2.

Algorithm 2: Voting on an ensemble of size $M$
Input:
An instance in the testing data set (as shown in Fig. 1) Array $P$ of size $M$ which saves the prediction results of testing data set using classifier $i$ , where $i=$ 0, 1, 2, …, $M$ -1 and $M\in$ Z ${}^{+}$ $W$ : a vector ( $W_{0}$ , $W_{1}$ , $W_{2}$ , … $W_{M-1}$ ) that stores the weight of classifier 0, 1, 2, …, $M$ -1 respectively
Output:
$R$ : the final prediction result of the given instance
Processing:
1 if the number of positive votes (i.e. value $=$ 1) in the array P exceeds the number of negative votes (i.e. value $=$ 0)
2 R $=$ 1
3 if the number of positive votes in the array P is not 0 and does not exceed the number of negative votes
4 if the average weight of positive voters $>=$ the average weight of negative voters
5 R $=$ 1 6 else
7 R $=$ 0
8 if the number of positive votes in the array P is 0
9 R $=$ 0
10 return R

Still take an ensemble of 10 classifiers as an example to exemplify Algorithm 2, as demonstrated in Table 1. If the number of positive votes is larger than the number of negative votes (e.g. 6 positive votes and 4 negative votes), then the decision of the ensemble is positive. If the number of positive votes is greater than 0 but does not exceed the number of negative votes, the average weight of positive voters is compared with the average weight of negative voters to make the final decision. If the average weight of positive voters is not less than the average weight of negative voters, the decision of the ensemble is positive. Otherwise, the decision of the ensemble is negative. If there is no positive voter, the decision of the ensemble is negative.

Table 1
The voting algorithm applied in an ensemble of 10 classifiers

Number of positive votes	Number of negative votes	Decision of the ensemble
10	0	Positive (i.e. R $=$ 1, as the number of positive votes is greater than the
9	1	number of negative votes. See line 1–2 of the “Processing” part of
8	2	Algorithm 2.)
7	3
6	4
5	5	Positive (i.e. R $=$ 1, if (1) the number of positive votes is greater than
4	6	0 and does not exceed the number of negative votes, AND (2) the
3	7	average weight of positive voters is not less than the average weight of
2	8	negative voters. See line 3–5 of the “Processing” part of Algorithm 2)
1	9	Negative (i.e. R $=$ 0, if (1) is true but (2) is false. See line 3, 6–7 of the “Processing” part of Algorithm 2)
0	10	Negative (i.e. R $=$ 0, as the number of positive votes is 0. See line 8–9 of the “Processing” part of Algorithm 2)

3. Experimental results

To evaluate the performance of our proposed work, we used two data sets from the UCI machine learning repository [24] and used Weka [8] for training each classifier. In this section, we provide a brief discussion on these datasets and their respective experimental results.

3.1 Datasets

The Pima Indian diabetes data set. This data set has 268 diabetes patients and 500 normal subjects. All subjects are females who are at least 21 years old and of Pima Indian heritage. Each subject has eight attributes, including number of times pregnant, Plasma glucose concentration a 2 hours in an oral glucose tolerance test (OGTT), diastolic blood pressure, triceps skin fold thickness, 2-hour serum insulin, body mass index, diabetes pedigree function, and age. After removing subjects with zeros where they are biologically impossible, the data set used in the experiment included 251 diabetes and 478 normal subjects. We conducted feature selection on the data set to find the smallest number of features that achieve the best classification performance. In our experiment, five features: 2 hours OGTT, 2-hour serum insulin, body mass index, diabetes pedigree function, and age were selected out.

The Diabetic Retinopathy Debrecen data set. This dataset contains 18 features extracted from the Messidor image set to predict whether an image contains signs of diabetic retinopathy (DR) or not [3]. It includes 540 images that do not have signs of DR and 611 images that contains signs of DR. All of the images and their features were used in our experiment.

3.2 Results

In Table 2, we exhibit the performance of using different numbers of classifiers (3 to 10 classifiers) in the ensemble on the Pima Indian diabetes dataset. We compared the sensitivity of the ensemble with the best sensitivity of using other voting strategies on the same ensemble with rules of average of probabilities, product of probabilities, majority voting, minimum probability, or maximum probability. We also compared the sensitivity of the ensemble with the sensitivity of using stacking on the same ensemble. All of the experiments with other voting rules and stacking were conducted with WEKA. In Table 3, we demonstrate the sensitivity of using different size of classifiers in the ensemble on the Diabetic Retinopathy Debrecen dataset, with our approach, other voting rules, and the stacking.

Table 2
Comparison of the sensitivity of classifiers on the pima indian diabetes dataset

Size of the ensemble	The sensitivity of using	The best sensitivity of	The sensitivity of using stacking
	weight-adjusted-voting	using other voting rules
3	62.4	61.2	64.7
4	62.4	61.2	69.4
5	64.7	62.4	67.1
6	64.7	61.2	65.9
7	70.6	62.5	67.1
8	69.4	62.5	69.4
9	76.4	62.5	70.6
10	81.2	62.5	74.1

Table 3

Comparison of sensitivity of classifiers on the diabetic retinopathy debrecen dataset

Size of the ensemble	The sensitivity of using	The best sensitivity of	The sensitivity of using stacking
	weight-adjusted-voting	using other voting rules
3	73.5	68.6	61.8
4	72.5	71.6	61.8
5	84.3	70.3	61.8
6	80.9	70.3	62.3
7	79.9	70.9	62.7
8	83.3	70.9	70.1
9	81.4	70.9	68.1
10	82.4	70.9	71.6

Table 4

The sensitivity of the ensemble of size 10 VS. the sensitivity of each single classifier on the pima indian diabetes dataset

Classifier	Sensitivity
SVM	58.8
Naive Bayes	60
ANN	55.3
Simplelogistic	63.5
RBFNetwork	64.7
SGD	60
Decision Tree	67.1
Random Forrest	60
BayesNet	61.2
K-Nearest Neighbor	70.6
Ensemble	81.2

Table 5

The sensitivity of the ensemble of size 10 VS. the sensitivity of each single classifier on the diabetic retinopathy debrecen dataset

Classifier	Sensitivity
SVM	66.2
Naive Bayes	63.2
ANN	68.6
Simplelogistic	77.9
RBFNetwork	70.6
SGD	21.6
Decision Tree	64.7
Random Forrest	68.6
BayesNet	41.2
K-Nearest Neighbor	58.3
Ensemble	82.4

As shown in Tables 2 and 3, the weight-adjusted-voting on an ensemble of classifiers increases the sensitivity in general. Table 2 shows that the sensitivity of the ensemble of classifiers is in an increasing trend as the size of ensemble gets larger. However, as shown in Table 3, for Diabetic Retinopathy Debrecen dataset, the sensitivity of the ensemble of classifiers goes up and down as the size of ensemble gets bigger.

We also compared the sensitivity of the ensemble with the sensitivity of each single classifier in the ensemble, for different sizes of ensemble. Take an ensemble of size 10 as an example. Tables 4 and 5 show the comparison of sensitivity on the Pima Indian diabetes dataset and Diabetic Retinopathy Debrecen dataset, respectively. The ensemble of classifiers demonstrates a higher sensitivity than each classifier’s sensitivity in the ensemble.

4. Stepwise classifier selection approach

As observed from Table 3, the sensitivity of the ensemble of classifiers goes up and down as the size of ensemble gets bigger. This is because the addition of a classifier to an ensemble may not contribute to the increment of the ensemble’s performance although the size of the ensemble is bigger. Instead, the performance of the ensemble might decrease after adding a new classifier (e.g. a classifier with a relatively lower performance than the performance of the existing classifiers) as the final decision of the ensemble is based on the weighted voting of the existing classifiers and the new classifier. Therefore, from the given ensemble of classifiers, how to find a combination of several classifiers that can produce the best sensitivity becomes an issue. To resolve this issue, we propose a stepwise classifier selection approach to select the proper classifiers from a given ensemble for best sensitivity of the ensemble. The idea of our approach is similar to the backward elimination in stepwise regression [11] for feature selection. The approach starts from an ensemble with all candidate classifiers and tests the deletion of each single classifier. If deleting a classifier improves the sensitivity of the ensemble, then the classifier is removed from the ensemble. If deleting a classifier drops the sensitivity of the ensemble, then the classifier is kept in the ensemble. This procedure is repeated until no further improvement is possible or until there is only one classifier left in the ensemble. Details of the algorithm are shown in Algorithm 3.

Algorithm 3: Stepwise classifier selection process
Input:
A set of $N$ classifiers in the ensemble, where $N$ is the total number of all the available classifiers
Output:
A set of $M$ classifiers in the ensemble ( $1\leqslant M\leqslant N$ ), where $M$ is the optimal number of the selected classifiers among all the available candidates
Processing:
1 if $N=$ 1
2 return the classifier
3 for $i:=$ 0 to $N$ -1 do
4 find the sensitivity of the ensemble with classifier $i$ removed
5 if the sensitivity of the ensemble with classifier $i$ removed increases
6 remove classifier $i$ from the original ensemble
7 quit the process and start over the whole process with the input ensemble with classifier $i$ removed
8 if the sensitivity of the ensemble with classifier $i$ removed decreases
9 keep the classifier $i$ in the original ensemble
10 return the classifiers left in the ensemble

For example, on the Diabetic Retinopathy Debrecen dataset, to find the best ensemble of classifiers from an ensemble of size 10 – ANN, Naive Bayes, SVM, simplelogistic, RBFNetwork, SGD, Decision Tree, Random Forrest, BayesNet, K-Nearest Neighbor, we remove ANN to see if the sensitivity of the ensemble of the remaining 9 classifiers improves. If the sensitivity of the remaining 9 classifiers drops, we keep ANN in the ensemble. Then we remove Naïve Bayes from the ensemble of size 10 to see if the sensitivity of the ensemble of the remaining 9 classifiers improves. If the sensitivity of the remaining 9 classifiers drops, we keep Naïve Bayes in the ensemble, and so on. On the other hand, if the sensitivity of the remaining 9 classifiers improves when we remove ANN, we then really remove ANN from the original ensemble of size 10. We then start from an ensemble of size 9 (as ANN is removed), and repeat the selection process. If there is only one classifier left or no classifier can be further removed from the ensemble, the process stops.

5. Discussion

In the weight-adjusted-voting on an ensemble of $M$ classifiers, it is important to determine the values of step sizes. The values of step sizes are determined and fine-tuned from the experiments for better sensitivity, through a heuristic trial and error. Take an ensemble of size 10 on the Pima Indian Diabetes dataset as an example. Based on our experiment, with both step_size1 and step_size2 equal to 0.005, the final weight of each classifier is as follows: ANN – 0.1144; Naive Bayes – 0.084; SVM – 0.022; simplelogistic – 0.0151; RBFNetwork – 0.121; SGD – 0.0675; Decision Tree – 0.139; Random Forrest – 0.051; BayesNet – 0.252; K-Nearest Neighbor – 0.134. We also tried other values of step_size1 and step_size2, but they did not result to good sensitivity. For a different dataset, the weight of each classifier in the same ensemble may be different. For instance, in an ensemble of size 10 on the Diabetic Retinopathy Debrecen dataset, based on our experiment, with both step_size1 and step_size2 equal to 0.0005, the final weight of each classifier is as follows: ANN – 0.1049; Naive Bayes – 0.0983; SVM – 0.0904; simplelogistic – 0.0979; RBFNetwork – 0.1062; SGD – 0.0902; Decision Tree – 0.0966; Random Forrest – 0.1023; BayesNet – 0.0943; K-Nearest Neighbor – 0.1189. Similarly, other values of step_size1 and step_size2 did not produce good sensitivity of the ensemble. In addition, using the weight-adjusted-voting presented in Algorithm 1, a classifier’s weight may become greater than 1 or below 0. Weights beyond 1 or below 0 are okay in our experiment as the voting approach focuses on the number of positive/negative votes and the average weight of positive/negative voters. A classifier with a weight greater than 1 indicates that it plays an important role in the final voting; a classifier with a negative weight indicates that it is less important in the final voting.

As shown in Tables 2 and 3, compared with other voting strategies of combining multiple classifiers, our approach has a higher sensitivity. For different sizes of ensemble, the sensitivity of our approach is always higher than the sensitivity of the ensemble using other voting strategies. In addition, compared with stacking – another way of combining multiple classifiers, our approach also has a higher sensitivity than that of stacking for different sizes of ensemble on the Diabetic Retinopathy Debrecen dataset. On the Pima Indian Diabetes dataset, our approach has a lower sensitivity than that of stacking when the ensemble size is small (i.e. 3, 4, 5, or 6 classifiers). However, when the ensemble size gets bigger (i.e. 7, 8, 9, or 10 classifiers), the sensitivity of our approach becomes higher than the sensitivity of stacking.

Moreover, Tables 4 and 5 have demonstrated the fact that the sensitivity of the ensemble of size 10 is higher than the sensitivity of each single classifier in the ensemble. We have conducted experiments for various size of ensemble (size 3 to size 10). With a different size, for most of the cases the sensitivity of the ensemble is higher than the sensitivity of each single classifier in the ensemble. We intentionally added some classifiers that have a relatively higher sensitivity and some classifiers that have a relatively lower sensitivity in the ensemble to see how a single classifier’s sensitivity affects the sensitivity of the ensemble. It turned out that the influence of a single classifier’s sensitivity to the ensemble is not quite obvious. That is to say, although adding a classifier with a weak sensitivity may drop the sensitivity of the ensemble, the decrement may not be very significant. Also, adding a classifier with a relatively high sensitivity may not always improve the ensemble of classifiers. For example, on the Diabetic Retinopathy Debrecen dataset, when the classifier Simplelogistic with a sensitivity of 77.9% was added in the ensemble, the sensitivity of the ensemble was 72.5%, which was a bit lower than the sensitivity (73.5%) of the ensemble of size 3 (i.e. SVM, Naive Bayes, ANN). However, on the Pima Indian Diabetes dataset, when the classifier K-Nearest Neighbor with a sensitivity of 70.6% was added in the ensemble, the sensitivity of the ensemble increased from 76.4% (with 9 classifiers) to 81.2%.

The size of the ensemble has affected the performance as well. On the Pima Indian dataset, as the size of the ensemble increases, the sensitivity of the ensemble also increases, as shown in Table 2. Although the sensitivity of an ensemble of size 8 is a little bit lower than that of an ensemble of size 7, overall the sensitivity of an ensemble is trending up when the size of the ensemble gets bigger. However, on the Diabetic Retinopathy Debrecen dataset, as the size of the ensemble increases, the sensitivity of the ensemble rises and falls. To eliminate the classifiers that may cause the decrement of sensitivity of the ensemble, we applied the proposed step-wise classifier selection approach on the 10-classifiers ensemble. Take Diabetic Retinopathy Debrecen dataset as an example. It turned out that in the 10 candidate classifiers in the ensemble, a combination of 9 classifiers (SVM, ANN, simplelogistic, RBFNetwork, SGD, NB, decision tree, random forest, k-nearest neighbor) can produce a sensitivity of 85.8% which is better than other selections of the 10 candidate classifiers.

6. Conclusion

To sum, in this work we propose a weight-adjusted-voting framework on an ensemble of classifiers. This framework involves weight-adjusting, voting, and stepwise classifier selection. Experiments have shown that the proposed weight-adjusted-voting can increase the sensitivity of an ensemble of different sizes on different datasets. Our work can benefit the classification tasks such as computer-aided medical diagnosis. As is known, sensitivity plays a more significant role in medical diagnosis, as it is very important to be able to diagnose a sick patient’s illness. Doctors and patients can then take proper actions ahead of time for a patient’s health. However, the proposed approach is only for binary classification problems. Our future work will explore how to use an ensemble of classifiers on time-series data and for multi-class classification problems.

References

Abràmoff

M.D.

et al., Automated analysis of retinal images for detection of referable diabetic retinopathy, JAMA Ophthalmology 131(3) (2013), 351–357.

Afzal

et al., Improving sensitivity of machine learning methods for automated case identification from free-text electronic medical records, BMC Medical Informatics and Decision Making 13(1) (2013), 1.

Antal

and Hajdu

, An ensemble-based system for automatic screening of diabetic retinopathy, Knowledge-Based Systems 60 (2014), 20–27.

Burhenne

Wood

L.J.

et al., Potential contribution of computer-aided detection to the sensitivity of screening mammography 1, Radiology 215(2) (2000), 554–562.

Diabetic Retinopathy Debrecen Data Set [Online], Available: https://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen+Data+Set.

Džeroski

and Ženko

, Is combining classifiers with stacking better than selecting the best one? Machine Learning 54(3) (2004), 255–273.

Galar

et al., A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(4) (2012), 463–484.

Hall

Frank

Holmes

Pfahringer

Reutemann

and Witten

I.H.

, The WEKA data mining software: an update, ACM SIGKDD Explorations Newsletter 11(1) (2009), 10–18.

Hatami

and Ebrahimpour

, Combining multiple classifiers: diversify with boosting and combining by stacking, Int. J. Comput. Sci. Network Security 7(1) (2007), 127–131.

10.

Helvie

M.A.

et al., Sensitivity of Noncommercial Computer-aided Detection System for Mammographic Breast Cancer Detection: Pilot Clinical Trial 1, Radiology 231(1) (2004), 208–214.

11.

Kadane

J.B.

and Lazar

N.A.

, Methods and criteria for model selection, Journal of the American Statistical Association 99(465) (2004), 279–290.

12.

Kim

Moon

and Ahn

, A weight-adjusted voting algorithm for ensembles of classifiers, Journal of the Korean Statistical Society 40(4) (2011), 437–449.

13.

Krajewski

Schnieder

Sommer

Batliner

and Schuller

, Applying multiple classifiers and non-linear dynamics features for detecting sleepiness from speech, Neurocomputing 84 (2012), 65–75.

14.

Kuncheva

L.I.

and Rodríguez

J.J.

, A weighted voting framework for classifiers ensembles, Knowledge and Information Systems 38(2) (2014), 259–275.

15.

Lam

and Suen

S.Y.

, Application of majority voting to pattern recognition: an analysis of its behavior and performance, IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 27(5) (1997), 553–568.

16.

, Diagnosis of Diabetes Using a Weight-Adjusted Voting Approach, in: Bioinformatics and Bioengineering (BIBE), 2014 IEEE International Conference on, 2014 November, pp. 320–324. IEEE.

17.

Mayr

et al., The evolution of boosting algorithms, Methods of Information in Medicine 53(6) (2014), 419–427.

18.

Nanni

and Lumini

, Ensemblator: An ensemble of classifiers for reliable classification of biological data, Pattern Recognition Letters 28(5) (2007), 622–630.

19.

Oza

N.C.

, Online bagging and boosting, in: 2005 IEEE International Conference on Systems, Man and Cybernetics, Vol. 3, 2005. IEEE.

20.

Pima Indians Diabetes Data Set, [Online], Available: https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes.

21.

Salama

G.I.

Abdelhalim

and Zeid

M.A.E.

, Breast cancer diagnosis on three different datasets using multi-classifiers, Breast Cancer (WDBC) 32(569) (2012), 2.

22.

Sánchez

C.I.

et al., Evaluation of a computer-aided diagnosis system for diabetic retinopathy screening on public data, Investigative Ophthalmology & Visual Science 52(7) (2011), 4866–4871.

23.

M.C.

Dongil

and Dongkyoo

, A comparative study of medical data classification methods based on decision tree and bagging algorithms, in: Dependable, Autonomic and Secure Computing, 2009. DASC’09. Eighth IEEE International Conference on. IEEE, 2009.

24.

UCI Machine Learning Repository, [Online], Available: http://archive.ics.uci.edu/ml/.