Diversity,accuracy and efficiency in ensemble learning: An unexpected result

Abstract

Empirical evidence suggests that ensembles with adequate levels of pairwise diversity among a set of accurate member algorithms can significantly outperform any of the individual algorithms. As a result, several diversity measures have been developed for use in optimizing ensembles. We show, however, that there is natural tension between the pairwise diversity of ensemble members and their individual accuracy. While efficient ensembles can be built with stronger forms of diversity, they also suffer in overall accuracy. On the other hand, ensembles built with weaker forms of diversity can be very accurate, but tend to be significantly more computationally expensive. We discuss these findings in light of the notion of diversity space.

Keywords

Diversity ensemble learning metalearning

1. Introduction

Ensemble learning consists of assembling a set of learning algorithms, providing each algorithm the same dataset $D$ , and combining the results of their induced models in clever ways so as to provide high classification rates for $D$ . In this sense, computational ensemble methods resemble musical ensembles where each instrument reads from a score of music and the combined output is richer than the output of each individual instrument. The interest in ensemble learning is similarly motivated by the idea that the whole may be greater than the sum of its parts, and empirical results do indeed show that ensembles often perform significantly better than their individual constituent algorithms [2]. This improved performance is understood to be largely dependent on the individual learners being both accurate and diverse [1]. Accuracy measures how well the model induced by an algorithm predicts previously unseen instances, and diversity refers to the degree to which the models induced by different algorithms make different predictions.

There is natural tension between accuracy and diversity, such that an ensemble where all algorithms have high accuracy exhibits low diversity, while an ensemble where the algorithms are very diverse tends to have lower accuracy. This is rather intuitive since, if all algorithms are very accurate, their induced models must be very similar, congregating in a small area of the hypothesis space, and thus make the same, or very similar, predictions; hence, the ensemble’s diversity is low. On the other hand, if the models induced by the algorithms make different predictions, they clearly range over a wider area of the hypothesis space, and their combination (e.g., via voting) is less likely to produce a model approaching the target concept; hence the ensemble’s accuracy is low. In both cases, the expected added value of ensembling is not realized. When building ensembles, designers should thus seek to include algorithms that have high accuracy and that also exhibit good pairwise diversity, as this makes it more likely for the group to predict new instances accurately, even when a minority of the algorithms predict incorrectly.

A number of diversity measures have been proposed, and while much research has focused on analyzing their effect on ensemble accuracy, relatively little has been done to understand how they compare with each other when interacting with the accuracy of the base algorithms in ensemble design. We propose to fill this gap here. We compare the accuracy and efficiency of ensembles produced by optimizing the accuracy of the base algorithms, optimizing the diversity among the base algorithms, and optimizing a simple composite measure. We consider a sample of four representative diversity measures, and two popular ensemble learning techniques, namely voting and stacking. We reach the somewhat unexpected conclusion that weaker diversity measures lead to more accurate ensembles while stronger diversity measures lead to more efficient ensembles.

2. Related work

It would be impossible to cover all of the literature on ensemble learning and the role that diversity plays in ensemble design. Rather, we focus our attention on the work most relevant to our own. Tang et al. [12] derive a number of theoretical results with respect to the relationship between diversity and maximum margin. They also hint at the overall ineffectiveness of using diversity alone for ensemble design, in the sense that “large diversity may not consistently correspond to a better generalization performance”.

Kuncheva and Whitaker [6] reach a similar conclusion empirically. Their experiments focus on showing the relationship between a voting ensemble’s diversity and the performance improvement over the single best algorithm and the average over the group’s algorithms. While their results show that improvement may be achieved, they also suggest that there is little difference in the various diversity measures used by different authors. Indeed, they find that two diversity measures, namely double fault and coincident failure diversity, form their own individual singleton clusters, while all others are lumped together in a single cluster.

Both of these works focus exclusively on diversity and its impact on ensemble accuracy. By contrast, Zeng et al. [15] propose to combine diversity and accuracy in their F-like weighted average accuracy and diversity (WAD) measure to design ensembles. Their results show that WAD-designed ensembles are of reasonable size and exhibit robust and improved performance over no selection, accuracy-only selection, and diversity-only selection. However, only one diversity measure, namely the disagreement measure, is considered.

These studies are limited to voting ensembles, and either do not take into account the interplay between diversity and base algorithm accuracy, or do so in a very restricted context. Here, we analyze both voting and stacking ensembles, and analyze and compare the behavior of several diversity measures in conjunction with accuracy in terms of both the ensemble’s final accuracy and its efficiency.

3. Diversity space and ensemble-building measures

We consider the following well-known measures of pairwise diversity from [7], several of which are also studied in [12, 6]. We adopt the notation of [7], reproduced in Table 1, where $h$ is the target classification hypothesis, and $h_{1}$ and $h_{2}$ are the hypotheses induced by two algorithms $A_{1}$ and $A_{2}$ , respectively. The quantities in Table 1 allow us to characterize the components of the diversity space that make up our diversity measures. Definitions are as follows.

Table 1
Diversity space components

$N^{11}$	Number of instances on which both $h_{1}$ and $h_{2}$ are correct $N^{11}=\|\{x:h_{1}(x)=h_{2}(x)=h(x)\}\|$
$N^{10}$	Number of instances on which $h_{1}$ is correct, but $h_{2}$ is incorrect $N^{10}=\|\{x:h_{1}(x)=h(x)\wedge h_{2}(x)\neq h(x)\}\|$
$N^{01}$	Number of instances on which $h_{2}$ is correct, but $h_{1}$ is incorrect $N^{01}=\|\{x:h_{1}(x)\neq h(x)\wedge h_{2}(x)=h(x)\}\|$
$N^{00}$	Number of instances on which both $h_{1}$ and $h_{2}$ are incorrect (they can either make the same or different predictions) $\displaystyle N^{00}=\|\{x:h_{1}(x)\neq h(x)\wedge h_{2}(x)\neq h(x)\}\|=\|\{x:h_% {1}(x)=h_{2}(x)\neq h(x)\}\|+\|\{x:h_{1}(x)\neq h(x)\wedge h_{2}(x)\neq h(x)% \wedge h_{1}(x)\neq h_{2}(x)\}\|=N_{S}^{00}+N_{D}^{00}$
$N$	Total number of instances $N=N^{11}+N^{10}+N^{01}+N^{00}$

•

Double fault (DF) [5]: the probability that $h_{1}$ and $h_{2}$ are both incorrect. Lower values of $D F$ correspond to increased diversity.

$\textit{DF}=\frac{N^{00}}{N}$ (1)

•

Disagreement measure (DM) [11]: the probability that either $h_{1}$ or $h_{2}$ is correct, but not both. Higher values of DM correspond to increased diversity.

$\textit{DM}=\frac{N^{01}+N^{10}}{N}$ (2)

•

Hamann’s coefficient ( $H$ ) [4]: The degree of association between $h_{1}$ and $h_{2}$ in the context of $h$ . $H$ is used as an alternative to Yule’s $Q$ statistic [14], since $Q$ cannot distinguish among different output distributions. Lower values of $H$ correspond to increased diversity.

$H=\frac{(N^{11}+N^{00})-(N^{10}+N^{01})}{N}$ (3)

•

Classifier output difference (COD) [8]: the probability that $h_{1}$ and $h_{2}$ make different predictions. Higher values of COD correspond to increased diversity.

$\textit{COD}=\frac{N^{10}+N^{01}+N^{00}_{D}}{N}$ (4)

Note that other diversity measures exist, such as error correlation, Pearson’s correlation coefficient, and $\kappa$ . Our selection of the above measures is motivated by 1) the intuition that an ideal diversity measure would favor $N^{11}$ , $N^{10}$ , and $N^{01}$ , over $N^{00}$ , since this allows for a higher probability that a subset of the ensemble will predict the true classification, and 2) a deliberate attempt at including different combinations of elements of the diversity space, that capture important subsets of that space.

Voting EnsembleLearn $\mathcal{A}$ , $T$ $\mathcal{H}\leftarrow\emptyset$ $A_{i}\in\mathcal{A}$ $h_{i}\leftarrow$ model induced by $A_{i}$ from $T$ $\mathcal{H}\leftarrow\mathcal{H}\cup h_{i}$

Classify $\mathcal{H}$ , $Y$ , $q$ Return $\textit{argmax}_{y\in Y}\sum_{h_{i}\in\mathcal{H}}\delta(y,h_{i}(q))$

In addition to the above individual diversity measures, we propose two complementary measures that strike a balance between diversity and accuracy, as follows.

•

Dual accuracy-diversity selection (sorted). We leverage a search process that entails initially ranking a set of ensembles by their average pairwise DF in ascending order. We choose DF because it implies a certain level of accuracy and diversity. If we were to simply rank the ensembles according to the average accuracy of their base learners, we could not ensure that there would be a relatively higher score of $N^{00}$ as opposed to the more favorable regions of $N^{10}$ and $N^{01}$ . From the top $k$ ensembles, the next step selects the ensemble with the highest average pairwise COD. This process serves the dual purpose of maximizing accuracy and diversity, but prioritizes accuracy in the search. The justification for giving priority to accuracy is that an ensemble can generally do no worse than its least accurate member algorithm, if the least accurate algorithm has a better classification rate than a random guesser. This is the case even with minimal diversity. However, an ensemble in which the member algorithms make uncorrelated errors at a rate higher than a random guesser will increase the error rate of the ensemble [2]. We refer to this measure as Sorted, since the step that differentiates it from DF is the ranking by COD values of the top $k$ ensembles. We use $k=$ 1,000 in our experiments.

•

Weighted accuracy-diversity (composite). We propose a second measure that additively combines accuracy and diversity, as $\alpha.\textit{ensemble(A)}+(1-\alpha)\textit{diversity(A)}$ , where ensemble(A) is the accuracy of the ensemble of algorithms in $A$ , diversity(A) is the pairwise average of diversity values over all pairs of algorithms in $A$ , and $\alpha$ is an adjustable parameter. Here, ensemble is either voting or stacking, and diversity is either DF or COD. We refer to these measures as Composite since they meld accuracy and diversity, and denote them simply as CompDF and CompCOD in their respective ensembling contexts. We set $\alpha=$ 0.5 in our experiments.

Finally, we use the following two baseline measures.

•

Accuracy-only (accuracy). This measure simply optimizes on overall accuracy by choosing the base learners that have the highest classification rates.

•

Random selection (random). This measure creates an ensemble through a random choice of algorithms.

As a reminder, we briefly review voting and stacking here. Let $\mathcal{A}={A_{1},\ldots,A_{m}}$ be a set of classification learning algorithms, $T$ be a training set of data, $Y$ be a finite set of target class values, and $\delta$ be the generalized Kronecker function, i.e., $\delta(x,y)=1$ if $x=y$ , and 0 otherwise. The voting ensemble is arguably the simplest of the ensemble methods. It is described in pseudocode as Algorithm 1. During the learning phase (function Learn), each algorithm in $\mathcal{A}$ is trained on $T$ , and the corresponding model stored. During classification (function Classify), when a new instance $q$ is presented to be classified, each induced model in $\mathcal{H}$ makes a prediction, and a majority vote is taken over all of the target class values in $Y$ , yielding the ensemble’s predicted class for the new instance.

The stacking ensemble is shown in pseudocode as Algorithm 1. During the learning phase (function Learn), it begins by creating an entirely new dataset that is comprised of the predictions of the models induced by each of the member classification learning algorithms in $\mathcal{A}$ for each training instance $<X,y>$ in $T$ . A separate learner, $A_{meta}$ , sometimes termed a meta-learner, then induces a prediction meta-model, $h_{meta}$ , from this new dataset. During classification (function Classify), when a new instance $q$ is presented to be classified, it is replaced by a new instance consisting of the predictions of the induced models, which is then presented for classification to the meta-model.

Stacking Ensemble

Learn $\mathcal{A}$ , $T$ , $A_{meta}$ $\mathcal{H}\leftarrow\emptyset$ $A_{i}\in\mathcal{A}$ $h_{i}\leftarrow$ model induced by $A_{i}$ from $T$ $\mathcal{H}\leftarrow\mathcal{H}\cup h_{i}$ $\tau\leftarrow\emptyset$ $<X,y>\in T$ $e\leftarrow<h_{1}(X),h_{2}(X),\ldots,h_{\mid\mathcal{H}\mid}(X),y>$ $\tau\leftarrow\tau\cup\{e\}$ $h_{\textit{meta}}=$ model induced by $A_{\textit{meta}}$ from $\tau$

Classify $\mathcal{H}$ , $h_{\textit{meta}}$ , $q$ Return $h_{\textit{meta}}(<h_{1}(q),h_{2}(q),\ldots,h_{\mid\mathcal{H}\mid}(q)>)$

Note that stacking ensembles generally come in two forms. The simple form replaces the attributes of the training data by the predictions of the member classification learning algorithms, as in Algorithm 1. A more sophisticated form adds these predictions to the original features of the training data. We restrict our attention to simple stacking.

4. Experiments

As potential base learners, we consider 48 classification learning algorithms from Weka [3], as shown in Table 2. We restrict our analysis to ensembles of size 5, since it constitutes a small percentage of the 48 base algorithms and will not result in high degrees of base learner overlap among ensembles. Furthermore, we found that as the size of the ensemble grows with respect to the number of algorithms available, the difference in the accuracy between the best and worst ensembles starts to shrink.

Table 2
Base classification learning algorithms

bayes.AODE	misc.HyperPipes
bayes.AODEsr	misc.VFI
bayes.BayesianLogisticRegression	rules.ConjunctiveRule
bayes.BayesNet	rules.DecisionTable
bayes.ComplementNaiveBayes	rules.DTNB
bayes.DMNBtext	rules.JRip
bayes.NaiveBayes	rules.NNge
bayes.NaiveBayesMultinomial	rules.OneR
bayes.NaiveBayesMultinomialUpdateable	rules.PART
bayes.NaiveBayesSimple	rules.Ridor
bayes.NaiveBayesUpdateable	rules.ZeroR
functions.Logistic	trees.ADTree
functions.MultilayerPerceptron	trees.BFTree
functions.RBFNetwork	trees.DecisionStump
functions.SimpleLogistic	trees.FT
functions.SMO	trees.J48
functions.SPegasos	trees.J48graft
functions.VotedPerceptron	trees.LADTree
functions.Winnow	trees.LMT
lazy.IB1	trees.NBTree
lazy.IBk	trees.RandomForest
lazy.KStar	trees.RandomTree
lazy.LBR	trees.REPTree
lazy.LWL	trees.SimpleCart

Ensembles of size 5 also allow us to implement an exhaustive search when optimizing with respect to the different diversity measures. While this brute-force method is expensive, it is feasible given the efficiency of the python itertools library in finding combinations.1

https://docs.python.org/2/library/itertools.html.

One important element of the ensembles we test is that they are mixed ensembles. It is common to use ensembles of homogenous base learners where hyperparameters are adjusted. By contrast, each one of our ensembles consists of unique base learners, and hyperparameters remain unchanged from base learner to base learner. So, for each dataset, we are choosing the ensemble with 5 learners from a set of approximately 45 available base learners on average2

Not all 48 algorithms can be run on all datasets due to input specifications.

It follows that for ensembles of size 5, we must test

{{45}\choose{5}}=

1,221,759 combinations. While this is a large number, each test involves indexing into a 45

\times

45 table 10 times, and since we can quickly index into such tables, the exhaustive search is tractable for the small ensemble size we are analyzing. Unfortunately, optimizing Sorted and Composite is too computationally expensive using the brute-force method. Hence, we simply take the best of the first 2,000 combinations that are output by itertools as a reasonable approximation. For the stacking ensembles, we use Weka’s decision tree learning algorithm J48 (tree), naïve Bayes (bayes), and multi-layer perceptron (mlp) as meta-learners. We never mix meta-learners for a given ensemble, and run the meta-learners separately on each optimized ensemble.

We run our experiments on a large sample of 164 datasets, selected such that 1) they represent multi-class classification problems, 2) they can be run successfully against most of our classification learning algorithms, and 3) they are of sufficient size to accommodate 10-fold cross-validation. Ninety-one datasets come from the UCI++ repository,3

https://github.com/lpfgarcia/ucipp.

supplemented by 53 datasets from the Connectionist Artificial Intelligence Laboratory,4

⁴

http://inf.ufrgs.br/liac.

18 datasets from the Weka Collection of Datasets,5

⁵

http://www.cs.waikato.ac.nz/ml/weka/datasets.html.

and 2 further unique datasets from the UCI Machine Learning Repository.6

⁶

https://archive.ics.uci.edu/ml/datasets.html.

Using 10-fold cross-validation on each dataset, the ensembles are optimized over 9 folds and tested on the held out fold. This allows us to construct a meta-dataset consisting of 10 $\times$ 164 $=$ 1,640 instances, where each instance represents the results on one test fold for one dataset. For each instance, we record the following.

•

Accuracy of the single best classifier. We follow [6] and select the single learner with the highest accuracy to serve as a benchmark. Note that the best classifier generally changes from instance to instance and does not refer to the single best classifier over all the datasets. However, it does give a notion of how “hard” the particular fold is to classify.

•

Accuracy of best performing ensembles. For each ensemble design method, voting and stacking, and for each diversity measure, DF, DM, $H$ , COD, Accuracy, Random, Sorted, CompDF, and CompCOD, we record the accuracy of the best performing ensemble. Hence, a total of 36 optimized ensemble accuracies are recorded: 9 for voting ensembles (one for each measure) and 3 $\times$ 9 $=$ 27 for stacking ensembles (one for each combination of a meta-learner and a diversity measure).

•

Execution time of the base learners in the best performing ensemble. In order to study ensemble efficiency, we focus on the execution times of the base learners. We also isolate the execution time of the base learners to highlight how expensive the algorithms are that are chosen by different metrics. In practice, there is some overhead associated with the voting mechanism and potentially more for the stacking meta-learner. However, an analysis of the base learners allows us to focus on that which the diversity measures can control. The execution times are based on a pre-trial, where each algorithm is run against each dataset. This provides uniformity by avoiding the case where an arbitrary algorithm records different running times when used in two different ensembles. Since in principle base learners can be run in parallel, the bottleneck in the classification process is the slowest base learner. We record only the execution time of that slowest base learner.

5. Results

Table 3 shows the results of the best performing ensembles overall, i.e., with accuracy averaged over all $1,640$ instances in our meta-dataset, ranked in descending order of average accuracy, organized by ensembling technique. For purposes of comparison, the results of the single best classifier (SBC), as well as of the average overall single classifier (AOSC), and the best overall single classifier (BOSC) are included (shaded rows in Table 3). The average overall single classifier is the overall average classification rate of all base learners over all datasets. The best overall single classifier, here trees.LMT, is the algorithm with the highest overall classification rate of all base learners averaged over all datasets. It is different from the single best classifier, which represents the average performance of the best classifiers for each instance in the meta-dataset, and is thus an ideal performance upper bound. The best overall single classifier captures a kind of middle ground, and the average overall single classifier may serve as a lower bound on the performance of ensembles.

Table 3
Accuracy results of optimized ensembles

Measure	Mean	Median	Std Dev.
SBC	80.86%	85.54%	17.49%
Accuracy	78.57%	84.57%	22.29%
DF	78.03%	83.33%	22.39%
CompDF	77.83%	83.33%	21.88%
Sorted	76.58%	82.14%	23.09%
BOSC	76.53%	82.10%	19.30%
CompCOD	74.81%	80.00%	22.79%
Random	74.40%	80.00%	23.61%
H	70.64%	75.00%	22.77%
DM	70.57%	75.00%	22.81%
AOSC	70.49%	75.00%	21.75%
COD	65.03%	70.00%	25.75%

Measure	Mean	Median	Std Dev.
SBC	80.86%	85.54%	17.49%
Accuracy	78.33%	83.33%	22.61%
DF	78.03%	83.33%	22.86%
Sorted	77.02%	82.55%	23.02%
BOSC	76.53%	82.10%	19.30%
CompCOD	76.47%	81.53%	22.90%
Random	76.15%	81.48%	23.26%
CompDF	74.95%	81.25%	24.46%
DM	73.98%	78.57%	22.76%
H	73.88%	78.80%	22.81%
COD	70.82%	75.86%	23.87%
AOSC	70.49%	75.00%	21.75%

Measure	Mean	Median	Std Dev.
SBC	80.86%	85.54%	17.49%
Accuracy	77.82%	83.33%	22.84%
DF	77.08%	83.33%	23.25%
CompDF	76.95%	82.98%	23.14%
BOSC	76.53%	82.10%	19.30%
Sorted	75.98%	81.48%	23.57%
CompCOD	75.69%	81.25%	23.58%
Random	75.55%	80.79%	23.37%
DM	74.02%	78.76%	22.25%
H	73.86%	78.57%	22.40%
COD	70.93%	75.00%	23.74%
AOSC	70.49%	75.00%	21.75%

Measure	Mean	Median	Std Dev.
SBC	80.86%	85.54%	17.49%
Accuracy	77.27%	83.33%	23.36%
CompDF	76.74%	81.48%	22.56%
BOSC	76.53%	82.10%	19.30%
DF	76.34%	82.61%	23.83%
Sorted	75.54%	80.95%	24.03%
Random	74.51%	80.00%	24.31%
CompCOD	74.33%	80.00%	24.13%
H	72.84%	77.78%	23.41%
DM	72.82%	77.78%	23.21%
AOSC	70.49%	75.00%	21.75%
COD	70.29%	75.00%	24.11%

(a) Voting

(b) Stacking NB

(d) Stacking MLP

As expected, the results in Table 3 show that the top performers approach or exceed the performance of the best overall single classifier, do significantly better than the average overall single classifier, and come close to the performance of the single best classifier (within only 2%–3%). The results further highlight remarkable consistency among our measures across the ensemble techniques:

•

Accuracy, DF, and CompDF rank highest (except in one case). This is consistent with the results of Kuncheva et al. [6] who also found that in their voting ensembles the relationship between accuracy and diversity was strongest with DF.

•

$H$ , DM, and COD rank lowest, with COD performing significantly worse than $H$ and DM, and even worse than the average overall single classifier in two cases.

These results, however, were somewhat unexpected. We had anticipated that our composite measures, which strike a balance between accuracy and diversity, would have yielded the best results. We therefore revisit these findings and attempt to provide some intuition for them in the context of what we refer loosely as diversity strength.

If $h_{1}$ performs well on a dataset (i.e., has good accuracy), maximizing COD will inevitably yield $h_{2}$ with lower accuracy than $h_{1}$ , since at least one hypothesis must be incorrect for an instance to count towards the COD value. Hence, although a clear measure of diversity, COD may be a bit too strong relative to accuracy in the context of ensemble learning. This is further confirmed by the results of Table 3, where CompCOD, which effects a trade-off between COD’s diversity and accuracy, always yields significantly better results than COD alone, often reaching a performance level comparable to that of the best overall single classifier. Since for binary classification tasks, $N^{00}_{D}=$ 0, it follows that DM is equivalent to COD for binary classification tasks, and a little weaker in multi-class classification. Indeed, DM also requires that at least one of the algorithms misclassify an instance, but it is indifferent as to how they agree on their classifications. Finally, as seen in their definitions, minimizing $H$ is akin to maximizing DM or COD, since $H$ plays the values of $N^{11}+N^{00}$ against the values of $N^{10}+N^{01}$ . Hence, as expected, $H$ behaves similarly to DM and COD.

By contrast, DF exhibits a rather different behavior. Indeed, as stated above, presentations of DF generally agree that smaller values of DF correspond to increased diversity. While on the surface this seems intuitive, the reality is more nuanced. Consider the case of $\textit{DF}=$ 1, presumably signaling low diversity. If the task is a binary classification task, then there is indeed no diversity, since we have $\textit{DF}=\frac{N^{00}}{N}=\frac{N^{00}_{S}+N^{00}_{D}}{N}=\frac{N^{00}_{S}}% {N}=$ 1, so both $h_{1}$ and $h_{2}$ are incorrect but they must be the same. On the other hand, if the task is a multi-class classification task, it is clearly possible for both $h_{1}$ and $h_{2}$ to be incorrect and yet be different on all instances, so that $\textit{DF}=\frac{N^{00}}{N}=\frac{N^{00}_{D}}{N}=$ 1, and diversity is actually maximized. In both cases, the accuracies of $h_{1}$ and $h_{2}$ are 0. Similarly, when $\textit{DF}=$ 0, presumably signaling higher diversity, we have $N=N^{11}+N^{01}+N^{10}$ . Thus, it is indeed possible to have high diversity if $h_{1}$ and $h_{2}$ differ on every instance but one is always correct (i.e., $N^{01}+N^{10}$ tends to $N$ ), or to have no diversity at all if $h_{1}$ and $h_{2}$ make the same correct predictions on all instances (i.e., $N^{11}$ tends to $N$ ). In both cases, the accuracies of $h_{1}$ and $h_{2}$ are higher. These observations do call into question the validity of DF as a measure of diversity. Yet, within the context of ensemble design, such weak measures seem to yield better ensembles than the stronger ones.

With this in mind, one may be tempted to think that diversity, at least a strong measure thereof, does not significantly enhance an ensemble and that the naïve method of selecting the most accurate learners is preferable. However, the right diversity does enhance ensembles in an unexpected way: it improves their execution time. Table 4 shows the overall average execution times of the slowest of the base learners in the ensembles chosen by the different measures. Since the execution time of an ensemble can be no shorter than the execution time of its slowest component learner (assuming parallel implementation), it suffices to consider that time for comparison across ensembles. There is only one entry for Composite since it represents the average running time of the most time consuming component learner, which is independent of the ensembling technique.

Table 4

Average execution time (in seconds) of ensembles by measure

Measure	Mean	Median	Std Dev.
Accuracy	2.72	0.27	6.51
DF	2.27	0.26	5.83
Sorted	2.00	0.18	5.45
Random	1.44	0.18	4.64
DM	0.97	0.09	3.43
Composite	0.92	0.14	2.91
$H$	0.87	0.09	3.00
COD	0.46	0.07	2.06

The results show that the three measures that produce the most accurate ensembles, namely Accuracy, DF, and Sorted, also give rise to the most computationally expensive ensembles, while the strong measures of diversity produce significantly more efficient ensembles. From a high-level perspective, relatively high accuracy is generally the result of an algorithm’s ability to exploit complex decision boundaries or to perform a more exhaustive search in the hypothesis space than a simpler learner. This, of course, implies longer runtimes. Strong diversity, on the other hand, actually tends to discourage inclusion of the most accurate learners for a given dataset, because it, in turn, results in selecting poor classifiers in order to obtain high pairwise diversity. As a result, the base learners of an ensemble that is optimized by a stronger diversity measure will tend to have less computationally expensive learners.

We take a closer look at this relationship between diversity and efficiency, with respect to a number of relevant data characteristics, or meta-features. We restrict our analysis to voting ensembles since these proved to be slightly more accurate (see Table 3). We also focus our attention on only the weakest (DF) and strongest (COD) diversity measures. We begin with the notion of data regularity. Regularity is used here to refer to the degree to which well-structured patterns exist in the data. Because there is no particular measure of regularity, we use the classification rate of SBC as a simple proxy for structure. The assumption is simple: the higher the classification rate of SBC, the “easier” it is to detect patterns in the data. We partitioned our datasets according to 5% accuracy ranges of SBC, and averaged values for each group. For example, SBC had 100% accuracy on 120 datasets, accuracy between 95% and 100% on 249 datasets, accuracy between 90% and 95% on 177 datasets, etc. Fig. 1 shows the classification rate and average execution time of voting ensembles with Accuracy, DF, and COD, as a function of the classification rate of SBC.

Figure 1.

Classification rate and average execution time of voting ensembles w.r.t. classification rate of SBC.

Figure 2.

Classification rate and average execution time of voting ensembles w.r.t. number of instances.

As expected, the classification rates of the ensembles tracks the classification rate of SBC linearly. Accuracy and DF are almost indistinguishable, and dominate COD. The execution times of $D F$ are generally greater than those of Accuracy when the classification rate of SBC is less than about 70%, and generally lower thereafter. At lower levels of regularity Accuracy seems to be intentionally choosing higher error bias, or simpler algorithms, because such algorithms increase accuracy, while DF is leaning towards relatively lower error bias, or more expensive algorithms. This occurs with DF because it has the latitude to choose less accurate algorithms as long as there is enough diversity in the ensemble such that the number of instances in $N^{10}$ and $N^{01}$ is significantly higher than the number of instances in $N^{00}$ . These results suggest that for datasets that tend towards lower classification rates, there is computational savings in using Accuracy. However, for more separable, regular datasets it pays computationally to use DF. COD’s execution time matches that of Accuracy and DF for low regularity, up to about 0.4, and is almost unaffected thereafter.

Figure 3.

Classification rate and average execution time of voting ensembles w.r.t. number of features.

Figure 4.

Classification rate and average execution time of voting ensembles w.r.t. number of classes.

We now consider three other meta-features of the data, namely, number of instances, number of features, and number of classes. When considering classification rate, we add the performances of two ensembles that represent the lower and upper bounds of classification for comparison purposes. The lower bound is represented by Random, and is the minimum bar that any useful method should clear. For the upper bound, which we call Best Ensemble, we show the results of whichever method is most accurate for a given instance. Figures 2–4 show the results.

Figure 5.

Classification rate and average execution time of voting ensembles w.r.t. number of learners in common.

Figure 6.

Dietterich’s graphical depiction of ensemble learning as shown in [2].

Figure 2 shows very little separation in classification rates between Accuracy and DF. They both are very close to, and often match, the best ensemble. COD performs worse than Random across the whole range of numbers of instances tested. For higher number of instances, Accuracy generally has higher execution times. This may be explained by Accuracy gravitating towards higher error bias learners when given more data and relying on lower error bias learners when the datasets are small. The execution times of DF experience volatility, but do not increase by nearly as much as those of Accuracy as the number of instances increases. As with data regularity, COD’s execution time is almost unaffected by the number of instances. Similar observations can be made about Figs 3 and 4, although COD’s classification rates seem to be more erratic. It is interesting to note the consistency of running times for the COD ensembles. They seem mostly impervious to changes in meta-features. This is because $C O D$ , when optimized, must needs force accurate learners out of the ensemble. Given the general positive correlation between learner accuracy and running time, COD gravitates towards efficient, simpler algorithms in order to maximize diversity. No matter the complexity of the dataset, in terms of number of classes, number of features, etc., these types of algorithms are available for COD. The efficiency does come at a serious cost, however. By imposing diversity, the accuracy of COD is generally lower than even the random ensembles. On the other hand, the ensembles optimized by DF are comparable to those optimized by Accuracy in terms of classification rate. DF maintains this accuracy by directly targeting $N^{10}$ and $N^{01}$ while excluding $N^{00}$ . These results suggest that the regions $N^{01}$ and $N^{10}$ of the diversity space are most useful in terms of efficiency. They create diversity in the ensemble, which indirectly generates efficiency, while maintaining a reasonable level of accuracy.

Finally, Fig. 5 shows an interesting result in which there is a distinct pattern in classification rates and execution times with respect to the number of base learners that voting ensembles with Accuracy and DF have in common.

It appears that the classification rates are highest when Accuracy and DF either have no learners in common or all learners in common. Recall that our ensembles have size 5. A closer look reveals that the case where DF and Accuracy have all learners in common occurred in only 9 instances. Due to the small sample size, we ignore the case of total overlap for DF and Accuracy, and note that the classification rate decreases somewhat monotonically with the number of base learners in common. One explanation is that simpler datasets can be classified by most any classifier. This leads to many ensembles having relatively similar DF scores. This wide diversity of ensembles with low DF scores increases the probability that there will be little or no overlap with ensembles optimized for accuracy. However, as datasets become less linearly separable, a given learner’s error bias increasingly affects its classification rate, leading to disparity in the accuracy of the available base learners. There will be fewer base learners to choose from and the intersection between the Accuracy and DF ensembles will grow.

Figure 7.

Accurate learners approximating $f$ .

Figure 8.

Diverse learners do not approximate $f$ well but their average does.

To further analyze our results, we invoke Dietterich’s insightful graphical depiction [2], reproduced in Fig. 6, of how spaces spanned by ensembles better approximate a signal $f$ than individual classifiers.

Figure 6 shows a hypothesis space $H$ , three different true hypotheses $f$ , and various learners $h_{1}$ , $h_{2}$ , etc. Ensembles generally better approximate the signal $f$ by spanning a larger space than individual classifiers. Specifically, the upper left depiction of $H$ (denoted Statistical) shows how the classifiers can combine their outputs and thereby reduce the risk of misclassifying $f$ . The upper right depiction of $H$ (denoted Computational) depicts the computational savings of ensemble methods. Each learner $h_{1}$ , $h_{2}$ , and $h_{3}$ may be able to accurately model $f$ given enough training and tuning. However, the learners can reach a state similar to learners in the Statistical example with much less training, thereby reducing the computational expense of the ensemble. The example of $H$ labeled Representational is similar to the Statistical depiction, except that in this case the learners are not able to model $f$ individually. However, they can be combined to model $f$ .

Our results show that diverse, inferior learners are able to span a space that is just as effective for modeling $f$ as the space spanned by an ensemble of accurate learners. Consider Figs 7 and 8, the accurate learners in Fig. 7 are able to individually approximate $f$ to a considerable degree, but do so after much training and fine-tuning. The diverse learners in Fig. 8 do not individually approximate $f$ as well as those in Fig. 7. However, their outputs can be combined to produce a result comparable to the combination of the learners in Fig. 7. In the process, observe that the individual diverse learners did not have to search $H$ as much as the accurate learners did. Hence, their runtimes are lower.

6. Conclusion

In this paper, we have revisited the competing requirements of accuracy and diversity in ensemble design, and shown the following.

Weak diversity (e.g., DF) is superior to strong diversity (e.g., COD) with respect to accuracy. In fact, the optimal diversity for accuracy is so weak that it actually positively correlates with accuracy. Stronger diversity results in more efficient ensembles. The stronger the diversity the more efficient the ensemble. The region of the diversity space $N^{11}$ is most critical for optimizing the accuracy of an ensemble while the regions $N^{10}$ and $N^{01}$ are most critical for infusing diversity that maintains relative accuracy. Directly optimizing levels of these three regions as opposed to combining them with elements of $N^{00}$ results in accurate, efficient ensembles.

A less obvious benefit of diversity is that it moves away from computationally expensive learners and remains comparable to an ensemble optimized for accuracy. This has the two-fold benefit of being accurate and less expensive than Accuracy.

In the special case of datasets that do not exhibit regularity, ensembles optimized for accuracy are more efficient. The more diverse ensembles simply model noise in different ways and do not lead to a coherent meta-dataset in the case of stacking or a structured consensus in the case of voting.

The first conclusion encourages researchers to combine $N^{11},N^{10},$ and $N^{01}$ in novel ways to optimize both accuracy and efficiency. The last two conclusions encourage practitioners to implement measures based on the regularity of their data.

Future work should consider the effect of the size of ensembles (here, fixed at 5), the impact of the value of $\alpha$ (here, fixed at 0.5) in the Composite measures, as well as the value of $k$ (here, fixed at 1,000) in the Sorted measure. Furthermore, our selection of datasets is such that important characteristics, such as noise and/or missing values, are likely to be present in some of the datasets. However, we did not provide results along these characteristics. It may be interesting to study the impact of these and other data characteristics on ensemble performance. Finally, another potentially interesting aspect that was not explored here is that of concept drift. Intuition would suggest that the diversity measures would perform better with respect to accuracy, given that diverse ensembles generalize better to new and changing data. Essentially, optimizing ensemble with Accuracy would lead to overfitting.

References

Breiman

, Random forests, Machine Learning 45(1) (2001), 5–32.

Dietterich

T.G.

, Ensemble Methods in Machine Learning, in: Proceedings of the First International Workshop on Multiple Classifier Systems, Springer, Berlin, Heidelberg, 2000, pp. 1–15.

Eibe

Hall

M.A.

and Witten

I.H.

, The WEKA Workbench, Online Appendix for Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, Fourth Edition, 2016.

Gatnar

, A diversity measure for tree-based classifier ensembles, in: Data Analysis and Decision Support. Baier

Decker

and Schmidt-Thieme

, eds., Springer, Berlin, Heidelberg, 2005, pp. 30–38.

Giacinto

and Roli

, Design of effective neural network ensembles for image classification purposes, Image and Vision Computing 19(9) (2001), 699–707.

Kuncheva

L.I.

and Whitaker

C.J.

, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy, Machine Learning 51(2) (2003), 181–207.

Lee

J.W.

and Giraud-Carrier

, A metric for unsupervised metalearning, Intelligent Data Analysis 15(6) (2011), 827–841.

Peterson

A.H.

and Martinez

T.R

, Estimating the potential for combining learning models, in: Proceedings of the ICML Workshop on Meta-learning 2005, pp. 68–75.

R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2007.

10.

Rudolph

and Martinez

, Finding the real differences between learning algorithms, International Journal on Artificial Intelligence Tools 24(3) (2015), 1550001.

11.

Skalak

D.B.

, The sources of increased accuracy for two proposed boosting algorithms, in: Proceedings of the AAAI Workshop on Integrating Multiple Learned Models Workshop, 1996, pp. 120–125.

12.

Tang

E.K.

Suganthan

P.N.

and Yao

, An analysis of diversity measures, Machine Learning 65(1) (2006), 247–271.

13.

Wolpert

D.H.

, Stacked generalization, Neural Networks 5(2) (1992), 241–259.

14.

Yule

G.U.

, VII. On the association of attributes in statistics: with illustrations from the material of the childhood society, &c., Philosophical Transactions of the Royal Society A 194(252–261) (1992), 257–319.

15.

Zeng

Wong

D.F.

and Chao

L.S.

, Constructing better classifier ensemble based on weighted accuracy and diversity measure, The Scientific World Journal, 2014.