Abstract
Empirical evidence suggests that ensembles with adequate levels of pairwise diversity among a set of accurate member algorithms can significantly outperform any of the individual algorithms. As a result, several diversity measures have been developed for use in optimizing ensembles. We show, however, that there is natural tension between the pairwise diversity of ensemble members and their individual accuracy. While efficient ensembles can be built with stronger forms of diversity, they also suffer in overall accuracy. On the other hand, ensembles built with weaker forms of diversity can be very accurate, but tend to be significantly more computationally expensive. We discuss these findings in light of the notion of diversity space.
Introduction
Ensemble learning consists of assembling a set of learning algorithms, providing each algorithm the same dataset
There is natural tension between accuracy and diversity, such that an ensemble where all algorithms have high accuracy exhibits low diversity, while an ensemble where the algorithms are very diverse tends to have lower accuracy. This is rather intuitive since, if all algorithms are very accurate, their induced models must be very similar, congregating in a small area of the hypothesis space, and thus make the same, or very similar, predictions; hence, the ensemble’s diversity is low. On the other hand, if the models induced by the algorithms make different predictions, they clearly range over a wider area of the hypothesis space, and their combination (e.g., via voting) is less likely to produce a model approaching the target concept; hence the ensemble’s accuracy is low. In both cases, the expected added value of ensembling is not realized. When building ensembles, designers should thus seek to include algorithms that have high accuracy and that also exhibit good pairwise diversity, as this makes it more likely for the group to predict new instances accurately, even when a minority of the algorithms predict incorrectly.
A number of diversity measures have been proposed, and while much research has focused on analyzing their effect on ensemble accuracy, relatively little has been done to understand how they compare with each other when interacting with the accuracy of the base algorithms in ensemble design. We propose to fill this gap here. We compare the accuracy and efficiency of ensembles produced by optimizing the accuracy of the base algorithms, optimizing the diversity among the base algorithms, and optimizing a simple composite measure. We consider a sample of four representative diversity measures, and two popular ensemble learning techniques, namely voting and stacking. We reach the somewhat unexpected conclusion that weaker diversity measures lead to more accurate ensembles while stronger diversity measures lead to more efficient ensembles.
Related work
It would be impossible to cover all of the literature on ensemble learning and the role that diversity plays in ensemble design. Rather, we focus our attention on the work most relevant to our own. Tang et al. [12] derive a number of theoretical results with respect to the relationship between diversity and maximum margin. They also hint at the overall ineffectiveness of using diversity alone for ensemble design, in the sense that “large diversity may not consistently correspond to a better generalization performance”.
Kuncheva and Whitaker [6] reach a similar conclusion empirically. Their experiments focus on showing the relationship between a voting ensemble’s diversity and the performance improvement over the single best algorithm and the average over the group’s algorithms. While their results show that improvement may be achieved, they also suggest that there is little difference in the various diversity measures used by different authors. Indeed, they find that two diversity measures, namely double fault and coincident failure diversity, form their own individual singleton clusters, while all others are lumped together in a single cluster.
Both of these works focus exclusively on diversity and its impact on ensemble accuracy. By contrast, Zeng et al. [15] propose to combine diversity and accuracy in their F-like weighted average accuracy and diversity (WAD) measure to design ensembles. Their results show that WAD-designed ensembles are of reasonable size and exhibit robust and improved performance over no selection, accuracy-only selection, and diversity-only selection. However, only one diversity measure, namely the disagreement measure, is considered.
These studies are limited to voting ensembles, and either do not take into account the interplay between diversity and base algorithm accuracy, or do so in a very restricted context. Here, we analyze both voting and stacking ensembles, and analyze and compare the behavior of several diversity measures in conjunction with accuracy in terms of both the ensemble’s final accuracy and its efficiency.
Diversity space and ensemble-building measures
We consider the following well-known measures of pairwise diversity from [7], several of which are also studied in [12, 6]. We adopt the notation of [7], reproduced in Table 1, where
Diversity space components
Diversity space components
Double fault (DF) [5]: the probability that
Disagreement measure (DM) [11]: the probability that either
Hamann’s coefficient (
Classifier output difference (COD) [8]: the probability that
Note that other diversity measures exist, such as error correlation, Pearson’s correlation coefficient, and
Voting EnsembleLearn
Classify
In addition to the above individual diversity measures, we propose two complementary measures that strike a balance between diversity and accuracy, as follows.
Dual accuracy-diversity selection (sorted). We leverage a search process that entails initially ranking a set of ensembles by their average pairwise DF in ascending order. We choose DF because it implies a certain level of accuracy and diversity. If we were to simply rank the ensembles according to the average accuracy of their base learners, we could not ensure that there would be a relatively higher score of Weighted accuracy-diversity (composite). We propose a second measure that additively combines accuracy and diversity, as
Finally, we use the following two baseline measures.
Accuracy-only (accuracy). This measure simply optimizes on overall accuracy by choosing the base learners that have the highest classification rates. Random selection (random). This measure creates an ensemble through a random choice of algorithms.
As a reminder, we briefly review voting and stacking here. Let
The stacking ensemble is shown in pseudocode as Algorithm 1. During the learning phase (function Learn), it begins by creating an entirely new dataset that is comprised of the predictions of the models induced by each of the member classification learning algorithms in
Stacking Ensemble
Learn
Classify
Note that stacking ensembles generally come in two forms. The simple form replaces the attributes of the training data by the predictions of the member classification learning algorithms, as in Algorithm 1. A more sophisticated form adds these predictions to the original features of the training data. We restrict our attention to simple stacking.
As potential base learners, we consider 48 classification learning algorithms from Weka [3], as shown in Table 2. We restrict our analysis to ensembles of size 5, since it constitutes a small percentage of the 48 base algorithms and will not result in high degrees of base learner overlap among ensembles. Furthermore, we found that as the size of the ensemble grows with respect to the number of algorithms available, the difference in the accuracy between the best and worst ensembles starts to shrink.
Base classification learning algorithms
Base classification learning algorithms
Ensembles of size 5 also allow us to implement an exhaustive search when optimizing with respect to the different diversity measures. While this brute-force method is expensive, it is feasible given the efficiency of the python itertools library in finding combinations.1
Not all 48 algorithms can be run on all datasets due to input specifications.
We run our experiments on a large sample of 164 datasets, selected such that 1) they represent multi-class classification problems, 2) they can be run successfully against most of our classification learning algorithms, and 3) they are of sufficient size to accommodate 10-fold cross-validation. Ninety-one datasets come from the UCI++ repository,3
Using 10-fold cross-validation on each dataset, the ensembles are optimized over 9 folds and tested on the held out fold. This allows us to construct a meta-dataset consisting of 10
Accuracy of the single best classifier. We follow [6] and select the single learner with the highest accuracy to serve as a benchmark. Note that the best classifier generally changes from instance to instance and does not refer to the single best classifier over all the datasets. However, it does give a notion of how “hard” the particular fold is to classify. Accuracy of best performing ensembles. For each ensemble design method, voting and stacking, and for each diversity measure, DF, DM, Execution time of the base learners in the best performing ensemble. In order to study ensemble efficiency, we focus on the execution times of the base learners. We also isolate the execution time of the base learners to highlight how expensive the algorithms are that are chosen by different metrics. In practice, there is some overhead associated with the voting mechanism and potentially more for the stacking meta-learner. However, an analysis of the base learners allows us to focus on that which the diversity measures can control. The execution times are based on a pre-trial, where each algorithm is run against each dataset. This provides uniformity by avoiding the case where an arbitrary algorithm records different running times when used in two different ensembles. Since in principle base learners can be run in parallel, the bottleneck in the classification process is the slowest base learner. We record only the execution time of that slowest base learner.
Table 3 shows the results of the best performing ensembles overall, i.e., with accuracy averaged over all
Accuracy results of optimized ensembles
Accuracy results of optimized ensembles
(a) Voting
(b) Stacking NB
(c) Stacking DT
(d) Stacking MLP
As expected, the results in Table 3 show that the top performers approach or exceed the performance of the best overall single classifier, do significantly better than the average overall single classifier, and come close to the performance of the single best classifier (within only 2%–3%). The results further highlight remarkable consistency among our measures across the ensemble techniques:
Accuracy, DF, and CompDF rank highest (except in one case). This is consistent with the results of Kuncheva et al. [6] who also found that in their voting ensembles the relationship between accuracy and diversity was strongest with DF.
These results, however, were somewhat unexpected. We had anticipated that our composite measures, which strike a balance between accuracy and diversity, would have yielded the best results. We therefore revisit these findings and attempt to provide some intuition for them in the context of what we refer loosely as diversity strength.
If
By contrast, DF exhibits a rather different behavior. Indeed, as stated above, presentations of DF generally agree that smaller values of DF correspond to increased diversity. While on the surface this seems intuitive, the reality is more nuanced. Consider the case of
With this in mind, one may be tempted to think that diversity, at least a strong measure thereof, does not significantly enhance an ensemble and that the naïve method of selecting the most accurate learners is preferable. However, the right diversity does enhance ensembles in an unexpected way: it improves their execution time. Table 4 shows the overall average execution times of the slowest of the base learners in the ensembles chosen by the different measures. Since the execution time of an ensemble can be no shorter than the execution time of its slowest component learner (assuming parallel implementation), it suffices to consider that time for comparison across ensembles. There is only one entry for Composite since it represents the average running time of the most time consuming component learner, which is independent of the ensembling technique.
Average execution time (in seconds) of ensembles by measure
The results show that the three measures that produce the most accurate ensembles, namely Accuracy, DF, and Sorted, also give rise to the most computationally expensive ensembles, while the strong measures of diversity produce significantly more efficient ensembles. From a high-level perspective, relatively high accuracy is generally the result of an algorithm’s ability to exploit complex decision boundaries or to perform a more exhaustive search in the hypothesis space than a simpler learner. This, of course, implies longer runtimes. Strong diversity, on the other hand, actually tends to discourage inclusion of the most accurate learners for a given dataset, because it, in turn, results in selecting poor classifiers in order to obtain high pairwise diversity. As a result, the base learners of an ensemble that is optimized by a stronger diversity measure will tend to have less computationally expensive learners.
We take a closer look at this relationship between diversity and efficiency, with respect to a number of relevant data characteristics, or meta-features. We restrict our analysis to voting ensembles since these proved to be slightly more accurate (see Table 3). We also focus our attention on only the weakest (DF) and strongest (COD) diversity measures. We begin with the notion of data regularity. Regularity is used here to refer to the degree to which well-structured patterns exist in the data. Because there is no particular measure of regularity, we use the classification rate of SBC as a simple proxy for structure. The assumption is simple: the higher the classification rate of SBC, the “easier” it is to detect patterns in the data. We partitioned our datasets according to 5% accuracy ranges of SBC, and averaged values for each group. For example, SBC had 100% accuracy on 120 datasets, accuracy between 95% and 100% on 249 datasets, accuracy between 90% and 95% on 177 datasets, etc. Fig. 1 shows the classification rate and average execution time of voting ensembles with Accuracy, DF, and COD, as a function of the classification rate of SBC.
Classification rate and average execution time of voting ensembles w.r.t. classification rate of SBC.
Classification rate and average execution time of voting ensembles w.r.t. number of instances.
As expected, the classification rates of the ensembles tracks the classification rate of SBC linearly. Accuracy and DF are almost indistinguishable, and dominate COD. The execution times of
Classification rate and average execution time of voting ensembles w.r.t. number of features.
Classification rate and average execution time of voting ensembles w.r.t. number of classes.
We now consider three other meta-features of the data, namely, number of instances, number of features, and number of classes. When considering classification rate, we add the performances of two ensembles that represent the lower and upper bounds of classification for comparison purposes. The lower bound is represented by Random, and is the minimum bar that any useful method should clear. For the upper bound, which we call Best Ensemble, we show the results of whichever method is most accurate for a given instance. Figures 2–4 show the results.
Classification rate and average execution time of voting ensembles w.r.t. number of learners in common.
Dietterich’s graphical depiction of ensemble learning as shown in [2].
Figure 2 shows very little separation in classification rates between Accuracy and DF. They both are very close to, and often match, the best ensemble. COD performs worse than Random across the whole range of numbers of instances tested. For higher number of instances, Accuracy generally has higher execution times. This may be explained by Accuracy gravitating towards higher error bias learners when given more data and relying on lower error bias learners when the datasets are small. The execution times of DF experience volatility, but do not increase by nearly as much as those of Accuracy as the number of instances increases. As with data regularity, COD’s execution time is almost unaffected by the number of instances. Similar observations can be made about Figs 3 and 4, although COD’s classification rates seem to be more erratic. It is interesting to note the consistency of running times for the COD ensembles. They seem mostly impervious to changes in meta-features. This is because
Finally, Fig. 5 shows an interesting result in which there is a distinct pattern in classification rates and execution times with respect to the number of base learners that voting ensembles with Accuracy and DF have in common.
It appears that the classification rates are highest when Accuracy and DF either have no learners in common or all learners in common. Recall that our ensembles have size 5. A closer look reveals that the case where DF and Accuracy have all learners in common occurred in only 9 instances. Due to the small sample size, we ignore the case of total overlap for DF and Accuracy, and note that the classification rate decreases somewhat monotonically with the number of base learners in common. One explanation is that simpler datasets can be classified by most any classifier. This leads to many ensembles having relatively similar DF scores. This wide diversity of ensembles with low DF scores increases the probability that there will be little or no overlap with ensembles optimized for accuracy. However, as datasets become less linearly separable, a given learner’s error bias increasingly affects its classification rate, leading to disparity in the accuracy of the available base learners. There will be fewer base learners to choose from and the intersection between the Accuracy and DF ensembles will grow.
Accurate learners approximating 
Diverse learners do not approximate 
To further analyze our results, we invoke Dietterich’s insightful graphical depiction [2], reproduced in Fig. 6, of how spaces spanned by ensembles better approximate a signal
Figure 6 shows a hypothesis space
Our results show that diverse, inferior learners are able to span a space that is just as effective for modeling
In this paper, we have revisited the competing requirements of accuracy and diversity in ensemble design, and shown the following.
Weak diversity (e.g., DF) is superior to strong diversity (e.g., COD) with respect to accuracy. In fact, the optimal diversity for accuracy is so weak that it actually positively correlates with accuracy. Stronger diversity results in more efficient ensembles. The stronger the diversity the more efficient the ensemble. The region of the diversity space A less obvious benefit of diversity is that it moves away from computationally expensive learners and remains comparable to an ensemble optimized for accuracy. This has the two-fold benefit of being accurate and less expensive than Accuracy. In the special case of datasets that do not exhibit regularity, ensembles optimized for accuracy are more efficient. The more diverse ensembles simply model noise in different ways and do not lead to a coherent meta-dataset in the case of stacking or a structured consensus in the case of voting.
The first conclusion encourages researchers to combine
Future work should consider the effect of the size of ensembles (here, fixed at 5), the impact of the value of
