A cross-entropy based stacking method in ensemble learning

Abstract

Stacking is one of the major types of ensemble learning techniques in which a set of base classifiers contributes their outputs to the meta-level classifier, and the meta-level classifier combines them so as to produce more accurate classifications. In this paper, we propose a new stacking algorithm that defines the cross-entropy as the loss function for the classification problem. The training process is conducted by using a neural network with the stochastic gradient descent technique. One major characteristic of our method is its treatment of each meta instance as a whole with one optimization model, which is different from some other stacking methods such as stacking with multi-response linear regression and stacking with multi-response model trees. In these methods each meta instance is divided into a set of sub-instances. Multiple models apply to those sub-instances and each for a class label. There is no connection between different models. It is very likely that our treatment is a better choice for finding suitable weights. Experiments with 22 data sets from the UCI machine learning repository show that the proposed stacking approach performs well. It outperforms all three base classifiers, several state-of-the-art stacking algorithms, and some other representative ensemble learning methods on average.

Keywords

Ensemble learning stacking cross entropy gradient descent

1 Introduction

Classifier ensembles have received a lot of attention in recent years due to their ability of improving classification accuracy in different applications [10 , 22]. In ensemble learning, there are various ways of combining classifiers, such as bagging, boosting, stacking and others [9]. Stacking is a way of using a high-level classifier to combine multiple different base classifiers [12, 30]. A typical stacking approach can be divided into three steps as follows: firstly, two or more base classifiers such as Bayes, IBk, J4.8, and so on need to be chosen and trained with some data to reach a given accuracy threshold; secondly, a meta-classifier is chosen to combine results of base classifiers; it also needs to be trained with some data for satisfactory prediction accuracy; finally, new instances can be classified by using the full stacked model. Many researchers have investigated the stacking method and conclude that if used properly the stacking algorithm can be more effective than an individual classifier-based approach [18].

When applying the stacking, we need to carefully consider two issues [6]. One is the output from the base classifiers to feed into the meta-level classifier and the other is the stacking method to make the final decision. For the first issue, two types of input have been considered in building the meta-classifier: first, the class labels to which the instance belongs; second, the probabilities that the instance belongs to all the classes. Previous research demonstrates that probability distributions of classes are more effective than class labels [21, 26]. For the second issue, many researchers have proposed a number of different approaches to stacking such as stacking with multi-response linear regression [26], stacking with meta-decision trees [27], stacking with multi-response model trees [23], dynamically-weighted stacked ensemble [1], ant colony and bee colony optimization based stacking [4, 5], and others [3, 14].

In general, there are two types of methods to combine the outputs of multiple base classifiers: fixed combining methods and trainable combining methods. The former works by applying a mathematical function. It is easy to build and run it and training is not required. Methods such as Sum, Product, Majority Vote, Max, Min, and Median are typical fixed combining methods. The latter needs training data to learn the prediction model, in which stacking is the most frequently used [14, 35]. In recent years, stacked ensembles have been investigated and used in many application areas [7, 8]. According to the styles of prediction by the base level classifiers, we may divide stacking-based ensemble approaches into two types: label-based stacking and Probability Distributions (PD)-based stacking. The standard stacking algorithm introduced by Wolpert is a label-based stacking that uses the predicted class labels by the base-level classifiers as the meta-level attributes for classification tasks [29]. Other methods such as Troika [20] and some evolutionary and swarm intelligence algorithms based stacking [15, 24] are PD-based stacking approaches. For the stacking approaches with high level learning, many researchers find that as input attributes, probability distributions perform better than class labels [18]. Therefore, in the following we review several state-of-the-art PD-based stacking approaches.

Ting and Witten [26] proposed stacking with MLR that used multi-response linear regression as the meta-classifier. Stacking with Multiple Linear Regression (MLR) uses a one-against-all binarization classification strategy and regression learners for each class model. Similar to stacking with MLR, Seelwald proposed StackingC with MLR. StackingC reduces the dimensionality of the meta features by a factor equal to the number of classes. It uses only the class probabilities associated with the class on which the linear regression model is built. Tang et al. [25] proposed a variant of the stacking ensemble algorithm, which extracted the paired preference information from the meta-level data and then applied the Ranking Support Vector Machine (RSVM). Nguyen et al. [31] introduced a fuzzy rule inference based method to capture the uncertainty in the outputs of the base classifiers. Xia et al. [33] proposed a credit scoring model that integrated the bagging algorithm with the stacking framework and employed the XGBoost algorithm as the meta-classifier. Zhang et al. [36] proposed a stacking random forest learning algorithm for contour detection. This stacking could be seen as a two-layer framework for contour detection. At the first layer, four sub-forests are separately trained and used as base classifiers. At the second layer, a random forest is employed as the meta-classifier for contour detection. Some evolutionary and swarm intelligence algorithms have also been used with stacking that pose the stacking configuration as an optimization problem. The genetic algorithm [15], artificial bee colony [5, 24], artificial ant colony [4], and many others can be applied. One disadvantage of these methods is their higher computational cost than other stacking approaches.

In this piece of work, we use the cross entropy [11 , 37] to measure how well the stacking model matches the ideal model. For a given meta-level instance, two probability distributions are calculated. One is the list of all predicted scores which are obtained by a combination of all base classifiers’ predictions. The other is the probability distribution regarding the ground truth labels of all the classes, or the ideal distribution. Based on the two probability distributions, we can obtain the cross-entropy loss. Accordingly, we are interested in minimizing the above-mentioned cross entropy in order to get the optimal predicted scores that are closest in distance to the ideal distribution. In this paper, we present a novel stacking method Stacking-CE (Stacking with the Cross-Entropy loss function), in which an Artificial Neural Network (ANN) is used to train weights for all base classifiers. Specifically, we use the probabilities predicted by the different base classifiers for the same class as inputs to train the meta-level classifier and then calculate the total loss by the cross-entropy function over all meta-level instances. Furthermore, considering the property of lower convex function that is related to the cross entropy, we are able to apply the stochastic gradient descent to achieve loss minimization.

Compared with some other stacking methods such as stacking with multi-response linear regression [26], StackingC [23], and stacking with multi-response model trees [6], our method treats meta instances in a different way. In those stacking methods, each meta instance is divided into a set of sub-instances, and a separate model is applied to each of them for a specific class. One possible drawback of such a treatment is the lack of connection among those models. The proposed method of ours treats each meta instance as a whole with one optimization model. Our experiments demonstrate that it is more effective than those methods mentioned.

The remainder of this paper is organized as follows: in Section 2, we present the classification process of the proposed stacking method. Section 3 presents the settings of the experiment for evaluating the performance of the proposed method. The experimental results are presented and discussed in Section 4. Finally, the conclusion is given in Section 5.

2 Proposed stacked ensemble algorithm

All the symbols and their meanings used in this paper are summarized in Table 1.

Table 1
Symbols and their meanings used in this paper

Symbols Meanings

C N base-level classifiers C = (C₁, C₂, ... , C_N) built using N learning algorithms.

f The meta-level learning algorithm.

J (p, q) The loss function based on the cross-entropy metric.

l _i Class labels of the instances in the training data set (1 ≤ i ≤ L).

L Number of class labels for all the instances in the training data set.

M The meta-level training data set.

N Number of base classifiers.

P Normalized probability distribution vector generated from the ground truth labels.

Q Normalized probability distribution vector from the meta classifier.

R Label judgment R = (r₁, r₂, ... , r_L).

score _j The score calculated by combining all the meta-level classifiers related to class label l_j.

s _kj The predictive probability by the k-th base classifier for the j-th class.

T Number of iterations in the learning process.

U Features for base classifiers.

V Number of features for base classifiers.

W The weight vector used in the single-layer ANN.

▽W The gradient vector of weights.

η Learning rate in the learning process.

Symbols	Meanings
C	N base-level classifiers C = (C₁, C₂, ... , C_N) built using N learning algorithms.
f	The meta-level learning algorithm.
J (p, q)	The loss function based on the cross-entropy metric.
l _i	Class labels of the instances in the training data set (1 ≤ i ≤ L).
L	Number of class labels for all the instances in the training data set.
M	The meta-level training data set.
N	Number of base classifiers.
P	Normalized probability distribution vector generated from the ground truth labels.
Q	Normalized probability distribution vector from the meta classifier.
R	Label judgment R = (r₁, r₂, ... , r_L).
score _j	The score calculated by combining all the meta-level classifiers related to class label l_j.
s _kj	The predictive probability by the k-th base classifier for the j-th class.
T	Number of iterations in the learning process.
U	Features for base classifiers.
V	Number of features for base classifiers.
W	The weight vector used in the single-layer ANN.
▽W	The gradient vector of weights.
η	Learning rate in the learning process.

2.1 The meta-level classifier

A meta-level classifier works in the follow way: each instance is first put through all the base-level classifiers. Suppose there are N base classifiers and L classes. A meta instance as a matrix can be generated from those N base classifiers. See Fig. 1 for an example of such a matrix.

Fig.1

Example of a classified instance represented as a matrix of probability distribution

In Fig. 1, there are three base classifiers (C₁,C₂,C₃) and four class labels (l₁, l₂, l₃, l₄). The value in line j and column k is the probability of the instance belonging to class l_j predicted by base classifier C_k.

Next, we extend the above matrix with a score list calculated by a specific function such as the sum function (the column entitled S) and the label judgment list (the column entitled R) as Fig. 2 shows.

Fig.2

An extended matrix with scores calculated by the sum function and label judgments

In Fig. 2, R = (r₁, r₂, ... , r₄) is a vector of label judgments. r_j =1 if l_j (1 ≤ j ≤ 4) is a true label and 0 otherwise. S = (score₁, score₂, ... , score₄) is a score vector. The j-th element of S can be calculated by ${score}_{j} = f (\sum_{k = 1}^{N} w_{k} \times s_{kj})$ (1)

where s_kj is the probability predicted by the k-th base classifier for the j-th class, w_k is the weight for the k-th base classifier, and score_j is the score calculated for class label l_j by combining all base classifiers’ predictions. More specifically, f is defined as a logistic function, then Equation 1 becomes ${score}_{j} = f (\sum_{k = 1}^{N} w_{k} \times s_{kj}) = \frac{1}{1 + e^{- (w_{0} + \sum_{k = 1}^{N} w_{k} \times s_{kj})}}$ (2)

Finally, the calculated scores by Equation 2 are fed into the softmax function to obtain the probability distribution of all the classes. The final classification can be decided by comparing the predicted probabilities of all the labels.

Fig. 3 shows the details of the meta classifier Stacking-CE, in which three base classifiers and two class labels are used for illustration. For any instance, a group of features U = (u₁, u₂, ... , u_V) are chosen as input to base classifiers. Their prediction scores are passed to the meta classifier. As output we obtain Q = (q₁, q₂), which are the predicted probability scores for the two classes.

Fig.3

The structure of the Stacking-CE with 3 base classifiers and 2 class labels

2.2 The learning process and loss function

In Stacking-CE, we first calculate two different probability distributions P = (p₁, p₂, ... , p_L) and Q = (q₁, q₂, ... , q_L) by Equation 3 and Equation 4, respectively $p_{j} = \frac{\exp (r_{j})}{\sum_{i = 1}^{L} exp (r_{i})}$ (3) $q_{j} = \frac{\exp ({score}_{j})}{\sum_{i = 1}^{L} exp ({score}_{i})}$ (4) where exp () represents the exponential function, q_j represents the probability generated from the predicted score score_j, and p_j represents the probability generated from the label judgement r_j.

Next, the divergence between two probability distributions can be calculated by $J (P, Q) = - \sum_{j = 1}^{L} p_{j} log q_{j}$ (5) where J () is the cross entropy loss between P and Q.

Finally, for a given group of instances as the training data set, we use a single-layer artificial neural network (Refer to Fig. 3) to learn the ensemble model by minimizing the cross entropy of them. More specifically, we use stochastic gradient descent to train the ANN. Its weight gradient is defined as $▽ W = (▽ w_{0}, ▽ w_{1}, . . ., ▽ w_{N})$

Because

$\frac{\partial f (\sum_{k = 1}^{N} w_{k} s_{kj})}{\partial w_{k}} = {score}_{j} (1 - {score}_{j}) s_{kj}$ (6)

We have

$\begin{matrix} ▽ w_{k} & = \frac{\partial J (p, q)}{\partial w_{k}} \\ = \sum_{j = 1}^{L} (q_{j} - p_{j}) {score}_{j} (1 - {score}_{j}) s_{kj} \end{matrix}$ (7)

The weight w_k(0 ≤ j ≤ N) can be updated by

$w_{k} = w_{k} - η \times ▽ w_{k}$ (8)

Equation 7 and Equation 8 can be used to train the optimal weights W = (w₀, w₁, ... , w_N).

Algorithm 1: Pseudo-code for the ANN learning process

Input:

M: The meta-level training data set

f: ANN algorithm with the initial weights W; each element in W is assigned a random number between 0 and 1

T: Iteration number

η: Learning rate

L: Number of class labels

N: Number of base classifiers

$PD \underline{} compute ()$ : Obtain P using Equation 5

$SD \underline{} compute ()$ : Obtain Q using Equation 3

Output:

f: Output the trained ANN{

for (int i : =1 ; i< = T ; i ++){

for (int j : =1 ; j< = M.size ; j ++){

$P = PD \underline{} compute (M . instance (j) . Matrix ())$ ;

$Q = SD \underline{} compute (M . instance (j) . Matrix ())$ ;

for (int k = 0 ; k< = N ; k ++){

Compute gradient ▽w_k using Equation 7;

Update w_k = w_k - η × ▽ w_k;

}

return(f with the learned weight vector W)

}

Algorithm 1 presents the algorithm for the learning process of the artificial neural network. As input we need to provide a meta-level training data set, an initial ANN, four parameters including iteration number T, learning rate η, number of class labels L, and number of base classifiers N. The output of the algorithm is the ANN with learned weights. In this algorithm, two sub-functions $PD \underline{} compute ()$ and $SD \underline{} compute ()$ are used to calculate probability distributions P and Q, respectively. The algorithm mainly comprises a nested loop of three levels. The weights are updated for all the instances involved for a given number of times defined by T.

3 Experimental setup

To evaluate the performance of our stacking approach, we carry out an extensive experiment. Apart from our method, three base level classifiers, six fixed combining methods and 12 ensemble learning classifiers are trained to enable comparisons. The WEKA machine learning suite 1 is used. The three base classifiers are as follows:

NB: the naive Bayes algorithm

J48: the C4.5 decision tree algorithm

SL: the learning algorithm for building linear logistic regression models

Each of the selected classifier has different learning hypotheses (naive Bayes, decision tree, and linear logistic regression)and different inductive bias. We hope such a selection can generate diversity among the base classifiers and more favourable for ensemble.Six fixed combining methods include Sum, Product, Max, Min, Majority Vote and Median.

Thirteen ensemble learning classifiers are as follows: Nine of them are variants of stacking and four others are representative ensemble learning methods that are not stacking-based.

StackingCE (S-CE): stacking with cross-entropy based meta method (proposed in this paper)

StackingMLR(S-MLR): stacking with the multi-response linear regression learning algorithm [26, 35]

Stacking-AdaBoost(J48)(S-AB(J48)): stacking with the AdaBoost (J48) learning algorithm [13]

StackingJ48(S-J48): stacking with the decision tree learning algorithm [25]

StackingC(S-C): a variant of stacking [23]

Stacking-RSVM(S-RSVM): stacking with the ranking support vector machine algorithm [25]

Stacking-RandomForest(S-RF): stacking with the random forest learning algorithm [36]

Stacking-Bagging(J48)(S-Bag(J48)): stacking with bagging (J48) learning algorithm [38]

Stacking-XGBoost(S-XGB): stacking with the XGBoost learning algorithm [33]

SelectBest(Sel-Best): the ensemble algorithm that selects the best from a group of candidates [28]

Bagging: a typical decision tree-based bagging [34]

RandomForest(RF): the classifier constructing a forest of random trees [2]

AdaBoost-SVM (AB-SVM): boosting the SVM classifier using the AdaboostM1 method [32]

22 real-world data sets are used in this experiment. All of them are downloaded from the UCI Machine Learning Repository 2 . The statistics of these data sets are listed in Table 2.

Table 2
Details of the datasets used in the experiments which are taken from the UCI Machine Learning Repository References

No. Data Set # of Instances # of Attributes # of Classes

1 Autos 205 26 7

2 Balance-scale 625 5 3

3 Breast-w 699 10 2

4 Cylinder-bands 540 40 2

5 Ecoli 336 8 8

6 Heart-h v294 14 5

7 Hypothyroid 3772 30 4

8 Ionosphere 351 35 2

9 Iris 150 5 3

10 Letter 20000 17 26

11 Lymph 148 19 4

12 Nursery 12960 9 5

13 Optdigits 5620 65 10

14 Segment 2310 20 7

15 Soybean 683 36 19

16 Spambase 4601 58 2

17 Vote 435 17 2

18 Kr-vs-kp 3196 37 2

19 Splice 3190 61 3

20 Tic-tac-toe 958 10 2

21 Solar-flare_1 323 13 6

22 Phishing Websites 11055 31 2

No.	Data Set	# of Instances	# of Attributes	# of Classes
1	Autos	205	26	7
2	Balance-scale	625	5	3
3	Breast-w	699	10	2
4	Cylinder-bands	540	40	2
5	Ecoli	336	8	8
6	Heart-h	v294	14	5
7	Hypothyroid	3772	30	4
8	Ionosphere	351	35	2
9	Iris	150	5	3
10	Letter	20000	17	26
11	Lymph	148	19	4
12	Nursery	12960	9	5
13	Optdigits	5620	65	10
14	Segment	2310	20	7
15	Soybean	683	36	19
16	Spambase	4601	58	2
17	Vote	435	17	2
18	Kr-vs-kp	3196	37	2
19	Splice	3190	61	3
20	Tic-tac-toe	958	10	2
21	Solar-flare_1	323	13	6
22	Phishing Websites	11055	31	2

For our method Stacking-CE, two parameters need to be set: learning rate η and iteration number T. In order to set a reasonable value for parameters η and T, we tried Soybean. It has 19 classes and 683 imbalanced instances. 30% of instances were randomly selected and different combinations of η and T were tried. Fig. 4 shows the loss curves with different number of iterations on Soybean. In Fig. 4, all curves with different learning rates (0.005, 0.015, 0.025, 0.035, and 0.045) decrease rapidly at the beginning and then slow down when iteration number increases. The curve with a learning rate of 0.025 looks a little better than the others. When the iteration number is larger than 150, the curve decreases very slowly. Therefore, it looks that 0.025 as the learning rate and 150 as the iteration number is a good option. The similar pattern is observed on four other data sets Breast-w, Ecoli, Autos, and Segment. Among them, Segment is a balanced dataset with seven classes and 2310 instances. Thus the same setting is used on other data sets for the experiments. Although this setting is good, it might not be the optimum for all the data sets.

Fig.4

Average loss curves varying with iteration numbers on Soybean

Four metrics are used to evaluate the performance of the methods involved. They are classification accuracy (Acc), win-loss ratio of two methods on Acc, relative improvement ratio of one method over the other on Acc, and F1. F1 is a measure that considers both precision and recall.

In our experiments, we use five-fold stratified cross validation, which is a form of cross validation and in each fold we let the data inside it preserves the class distributions as close as possible to the original whole data set. In order to increase the reliability of the experimental results, we repeat the above process five times and the reported performance are the average of all the instances. Furthermore, we carry out the Friedman test and the Wilcoxon signed-rank test to compare our method with the others. Bonferroni adjustment is also considered for the Wilcoxon signed-rank test.

4 Results and discussion

The experiments were conducted on 22 real-world data sets and 21 methods were involved as baseline. We divide them into three groups and present their classification accuracy in Tables 3, 4 and 5, respectively. In Table 3, we have nine stacking-based methods including ours, Table 4 presents the results of six popular fixed combining methods, while Table 5 presents the results of three base classifiers and four popular ensemble learning algorithms that are not stacking-based. For convenience, the results of our stacking model is presented in all three tables.

Table 3
Classification Accuracy (%) of the nine stacking-based methods on 22 real-world data sets. The highest accuracy among all the methods is marked in boldface. Standard deviation (%) of five runs of Stacking-CE’s accuracy is presented in parentheses

No. S-RF S-AB(J48) S-Bag(J48) S-XGB S-MLR S-J48 S-C S-RSVM S-CE

1 76.390 76.780 76.976 78.731 76.878 73.659 79.220 79.220 80.195(2.32)

2 92.544 91.712 91.936 91.167 89.120 90.112 88.320 89.216 89.504(0.25)

3 95.908 94.762 96.566 96.566 96.651 96.451 96.508 96.079 96.737(0.29)

4 70.259 73.074 73.963 71.889 76.296 72.074 75.741 75.519 76.333(1.26)

5 85.306 84.002 85.719 86.198 85.715 83.816 85.834 86.608 86.909(0.37)

6 80.810 81.154 82.381 81.013 83.867 82.581 84.420 84.077 84.553(0.54)

7 99.449 99.396 99.518 99.475 99.512 99.300 99.523 99.539 99.539(0.05)

8 90.878 90.366 91.165 91.734 91.678 90.880 91.964 90.541 92.304(1.02)

9 94.000 93.467 93.733 93.467 95.067 92.933 95.467 95.333 96.133(0.26)

10 90.743 91.314 88.435 90.970 88.178 85.824 87.989 87.305 88.408(0.25)

11 79.766 75.623 79.862 79.366 82.717 77.784 83.269 83.789 83.779(1.94)

12 97.844 97.535 97.884 97.981 97.096 97.656 97.164 97.110 97.312 (0.14)

13 97.256 97.071 97.089 97.196 97.231 96.794 97.214 97.345 97.356(0.10)

14 96.814 96.433 96.675 96.866 96.857 96.312 96.883 96.658 96.831(0.23)

15 93.323 90.859 88.866 93.323 93.556 88.136 93.645 93.791 93.791(0.49)

16 92.106 91.667 93.515 93.280 93.780 93.454 93.780 93.510 93.901(0.21)

17 94.805 94.897 95.724 94.851 95.862 95.494 95.862 95.816 95.954(0.28)

18 99.136 99.161 99.080 99.193 99.255 99.086 99.255 99.255 99.255(0.07)

19 95.925 95.643 96.082 95.981 96.232 95.680 96.213 96.226 96.309(0.10)

20 97.849 97.703 98.080 98.142 98.330 98.121 98.330 98.330 98.330(0.00)

21 69.357 68.627 70.842 67.377 69.919 68.067 71.090 71.589 71.775(0.42)

22 95.459 95.412 95.966 95.924 95.884 95.841 95.877 95.814 95.955(0.06)

No.	S-RF	S-AB(J48)	S-Bag(J48)	S-XGB	S-MLR	S-J48	S-C	S-RSVM	S-CE
1	76.390	76.780	76.976	78.731	76.878	73.659	79.220	79.220	80.195(2.32)
2	92.544	91.712	91.936	91.167	89.120	90.112	88.320	89.216	89.504(0.25)
3	95.908	94.762	96.566	96.566	96.651	96.451	96.508	96.079	96.737(0.29)
4	70.259	73.074	73.963	71.889	76.296	72.074	75.741	75.519	76.333(1.26)
5	85.306	84.002	85.719	86.198	85.715	83.816	85.834	86.608	86.909(0.37)
6	80.810	81.154	82.381	81.013	83.867	82.581	84.420	84.077	84.553(0.54)
7	99.449	99.396	99.518	99.475	99.512	99.300	99.523	99.539	99.539(0.05)
8	90.878	90.366	91.165	91.734	91.678	90.880	91.964	90.541	92.304(1.02)
9	94.000	93.467	93.733	93.467	95.067	92.933	95.467	95.333	96.133(0.26)
10	90.743	91.314	88.435	90.970	88.178	85.824	87.989	87.305	88.408(0.25)
11	79.766	75.623	79.862	79.366	82.717	77.784	83.269	83.789	83.779(1.94)
12	97.844	97.535	97.884	97.981	97.096	97.656	97.164	97.110	97.312 (0.14)
13	97.256	97.071	97.089	97.196	97.231	96.794	97.214	97.345	97.356(0.10)
14	96.814	96.433	96.675	96.866	96.857	96.312	96.883	96.658	96.831(0.23)
15	93.323	90.859	88.866	93.323	93.556	88.136	93.645	93.791	93.791(0.49)
16	92.106	91.667	93.515	93.280	93.780	93.454	93.780	93.510	93.901(0.21)
17	94.805	94.897	95.724	94.851	95.862	95.494	95.862	95.816	95.954(0.28)
18	99.136	99.161	99.080	99.193	99.255	99.086	99.255	99.255	99.255(0.07)
19	95.925	95.643	96.082	95.981	96.232	95.680	96.213	96.226	96.309(0.10)
20	97.849	97.703	98.080	98.142	98.330	98.121	98.330	98.330	98.330(0.00)
21	69.357	68.627	70.842	67.377	69.919	68.067	71.090	71.589	71.775(0.42)
22	95.459	95.412	95.966	95.924	95.884	95.841	95.877	95.814	95.955(0.06)

Table 4

Classification Accuracy (%) of six fixed combining methods and the presented stacking approach on 22 real-world data sets. The highest accuracy among all the methods is marked in boldface

No.	Sum	Product	Max	Min	Majority Vote	Median	S-CE
1	76.976	79.512	78.243	78.829	75.805	64.293	80.195
2	86.176	80.960	82.656	79.680	89.824	86.944	89.504
3	96.565	95.649	95.593	95.593	96.566	95.421	96.737
4	76.593	76.593	76.407	76.407	73.444	58.333	76.333
5	86.852	84.766	85.124	84.588	86.911	85.485	86.909
6	84.008	84.008	84.212	84.212	84.007	81.553	84.553
7	97.588	98.956	98.977	99.326	96.936	95.785	99.539
8	92.532	88.375	86.778	86.607	92.419	83.014	92.304
9	96.000	95.200	95.467	95.200	95.867	95.600	96.133
10	86.609	87.032	87.186	85.975	81.616	68.527	88.408
11	83.241	81.228	82.018	80.947	82.455	78.966	83.779
12	96.299	97.233	97.298	97.304	93.900	90.432	97.312
13	96.206	90.612	91.794	90.484	96.139	92.808	97.356
14	95.671	95.403	93.576	94.294	95.342	89.593	96.831
15	93.733	93.350	93.264	93.351	93.557	90.337	93.791
16	93.410	82.225	82.269	82.047	93.610	91.689	93.901
17	95.356	94.115	93.057	93.057	95.448	93.149	95.954
18	98.924	99.099	99.262	99.262	98.079	88.586	99.255
19	95.975	95.235	95.129	95.191	95.856	94.144	96.309
20	88.686	87.621	87.767	87.767	88.998	78.746	98.330
21	72.024	72.087	71.590	71.714	70.849	66.195	71.775
22	95.483	95.886	95.911	95.911	94.625	93.261	95.955

Table 5

Classification Accuracy (%) of three base classifiers and five ensemble methods (including the presented stacking approach) on 22 real-world data sets. The highest accuracy among all the methods is marked in boldface

No.	NB	J48	SL	Sel-Best	Bagging	RF	AB (SVM)	S-CE
1	56.878	80.195	73.561	77.366	63.414	80.488	34.927	80.195
2	90.016	77.984	87.616	90.016	83.232	81.376	89.632	89.504
3	96.079	94.306	96.280	95.879	95.649	96.108	95.564	96.737
4	73.815	57.778	74.630	73.074	59.444	72.259	71.407	76.333
5	85.833	82.981	86.378	86.076	83.457	83.689	78.267	86.909
6	84.141	79.994	83.871	83.190	80.609	79.581	64.831	84.553
7	95.302	99.512	96.554	99.512	99.491	99.067	93.016	99.539
8	82.730	89.345	86.950	88.716	92.132	93.214	93.558	92.304
9	95.733	95.200	95.600	95.467	95.067	94.400	96.533	96.133
10	64.052	87.372	77.357	87.372	89.554	93.567	97.314	88.408
11	83.159	76.207	82.455	83.697	76.202	79.480	82.308	83.779
12	90.262	96.724	92.553	96.724	97.022	97.836	99.965	97.312
13	91.295	90.029	97.093	97.139	93.915	96.530	60.235	97.356
14	80.217	96.416	95.048	96.416	95.948	97.203	60.814	96.831
15	92.502	90.923	92.972	92.883	83.457	91.419	91.274	93.791
16	79.592	92.715	92.650	92.667	93.914	94.697	84.381	93.901
17	90.161	95.816	95.356	95.862	95.172	95.724	95.126	95.954
18	87.841	99.255	96.871	99.255	98.973	98.586	98.179	99.255
19	95.317	93.712	95.574	95.423	93.379	89.586	94.502	96.309
20	70.188	83.362	98.226	98.184	88.685	91.148	97.015	98.330
21	66.135	70.224	70.907	70.966	71.462	70.907	70.658	71.775
22	92.930	95.815	93.905	95.815	95.792	96.917	95.077	95.955

From Table 3, we can see that Stacking-CE achieves the best classification accuracy on 17 out of 22 data sets, while the figures for Stacking-RSVM, StackingC, Stacking-MLR, Stacking-XGBoost, Stacking-RandomForest, Stacking-Bagging(J48), StackingJ48 and Stacking-AdaBoost(J48) are 5, 3, 2, 2, 1, 1,0, and 0, respectively. For Stacking-CE, standard deviation of five runs’ accuracies is also shown in Table 3. We can see that in all the cases standard deviation is small. It demonstrates that the proposed method is robust. Comparing the results of Stacking-CE with six popular fixed combining methods in Table 4, we can see that Stacking-CE outperforms all of them. Stacking-CE achieves the best classification accuracy on 17 out of 22 data sets, while the figures for Sum, Product, Max, Min, Majority Vote and Median are 2, 2, 1, 1, 1 and 0, respectively.

The results in Table 5 show the performances of three base classifiers and four popular non-stacking based meta-classifiers. Comparing our method with three base classifiers, we find that it outperforms all of them in 19 data sets and it is one of the best performers in two remaining data sets (autos and kr-vs-kp). Comparing our method with four other meta-classifiers, we find that Stacking-CE achieves the best classification accuracy on 13 out of 22 data sets. While the figures for AdaBoost(SVM), RandomForest, S-SelectBest and Bagging are 4, 4, 2 and 0, respectively.

In order to have a clear view of all the methods involved, we make a head-to-head comparison for each pair of them. The results are shown in Tables 6–8, for each of the three groups, respectively. In these three tables, the improvement percentage and win-loss ratio of two given methods are shown. For example, in Table 6, we may see that “1.5 19/03” for the item in row headed with “S-CE” and in column headed with S-RF. It means that S-CE is 1.5% more effective than S-RF on average and the win-loss ratio between them is 19/03 in all 22 data sets.

Table 6

The improvement percentage of one method over the other method and the win-loss ratio between the two methods for all the methods listed in Table 3

	S-RF	S-AB(J48)	S-Bag(J48)	S-XGB	S-MLR	S-J48	S-C	S-RSVM
S-AB(J48)	-0.4 06/16
S-Bag(J48)	0.3 15/07	0.8 19/03
S-XGB	0.3 16/05	0.7 15/06	0.0 10/11
S-MLR	0.8 18/04	1.3 19/03	0.5 14/08	0.6 14/08
S-J48	-0.8 08/14	-0.4 10/12	-1.1 03/19	-1.1 05/17	-1.6 02/20
S-C	1.1 18/04	1.5 19/03	0.8 17/05	0.8 16/06	0.2 11/07	1.9 20/02
S-RSVM	1.0 17/05	1.5 19/03	0.7 14/08	0.8 15/07	0.2 11/09	1.9 17/05	0.0 08/11
S-CE	1.5 19/03	1.9 19/03	1.2 18/04	1.2 18/04	0.6 19/01	2.3 20/02	0.4 19/01	0.4 17/01

Table 7

The improvement percentage of one method over the other and the win-loss ratio between the two methods with the results from Table 4

	Sum	Product	Max	Min	Majority Vote	Median
Product	-1.5 07/13
Max	-1.5 07/15	-0.1 13/09
Min	-1.8 06/16	-0.4 07/14	-0.3 07/08
Majority Vote	-0.7 06/16	0.9 12/10	1.0 13/09	1.3 13/09
Median	-6.4 01/21	-4.9 05/17	-4.8 06/16	-4.5 06/16	-5.9 00/22
S-CE	1.3 19/03	3.0 20/02	3.0 20/02	3.4 20/02	2.1 19/03	8.9 22/00

Table 8

The improvement percentage of one method over the other and the win-loss ratio between the two methods for all the methods listed in Table 5

	Sel-Best	Bagging	RF	AB (SVM)
Bagging	-3.6 05/17
RF	-0.9 08/14	3.1 15/07
AB (SVM)	-7.6 04/18	-4.2 10/12	-6.6 08/14
S-CE	1.1 20/01	5.3 20/02	2.1 15/07	14.4 17/05

From Table 6, we can see that our method is the best, S-RSVM is in the second place, while S-AB(J48) is the worst among all the stacking-based ensemble methods. Our method has an improvement rate of 0.4% over the second best method S-RSVM and the win-loss ratio between them is 17/1 (plus 4 ties). For other stacking methods, we can observe that S-MLR, StackingC and S-RSVM outperform S-AB(J48), S-XGB, S-Bag(J48) and S-RF. StackingC slightly outperforms S-MLR with an improvement rate of 0.2%. S-RSVM and Stacking-C are very close in performance.

From Table 7, we can see that our method is the best, Sum is in the second place and Median is the worst. Our method clearly outperforms all fixed combining methods. Compared with Sum, our method has an improvement rate of 1.3% over Sum and the win-loss ratio between them is 19/3. We can observe that Majority outperforms Product, Max and Min. Max and Min are very close in performance and both of them outperform Median.

From Table 8, we can see that our method is the best, SelectBest is in the second place, and AdaBoost(SVM) is the worst among all the methods in this group. Comparing our method with the second best SelectBest in this group, our method has an improvement rate of 1.1% over SelectBest and the win-loss ratio between them is 20/1. SelectBest outperforms three other ensemble algorithms Bagging, Randomforest and AdaBoost(SVM), while Randomforest outperforms Bagging with an improvement rate of 3.1%.

We try to find if the difference between our method and the others is significant or not. The Friedman test is carried out and the result shows that as a whole the proposed method is different from the others significantly at a confidence level of 95%. Furthermore, we carry out the Wilcoxon signed-rank test to compare our method with every other method separately. Table 9 presents the results. It shows that Stacking-CE is better than all other methods involved at a significance level of 95%. Considering there are 21 other methods involved at the same time, Bonferroni adjustment may be applied to the results of the Wilcoxon signed-rank test. Rather than setting α = 0.05, we let α_B=0.05/21=0.0023, then the same conclusion for 16 of them, while the difference between our method and five methods including S-RF,S-Bag(J48),S-XGB, RF and AB(SVM) are not significant.

Table 9

Wilcoxon signed-rank test based on accuracy; comparing Stacking-CE and all other 21 methods with positive ranks (R⁺), negative ranks (R^-), and p-values. Superscipt ’b’ indicates a significant difference at the level of 95% with Bonferroni adjustment

Pairwise Comparions	R ⁺	R ^-	p-value
Stacking-CE vs. S-RF	210	43	0.007
Stacking-CE vs. S-AB(J48)	223	30	0.002^b
Stacking-CE vs. S-Bag(J48)	218	35	0.003
Stacking-CE vs. S-XGB	206	47	0.010
Stacking-CE vs. S-MLR	209	1	0.000^b
Stacking-CE vs. S-J48	236	17	0.000^b
Stacking-CE vs. S-C	208	2	0.000^b
Stacking-CE vs. S-RSVM	170	1	0.000^b
Stacking-CE vs. Sum	235	18	0.000^b
Stacking-CE vs. Product	244	9	0.000^b
Stacking-CE vs. Max	248	5	0.000^b
Stacking-CE vs. Min	247	6	0.000^b
Stacking-CE vs. Majority Vote	243	10	0.000^b
Stacking-CE vs. Median	253	0	0.000^b
Stacking-CE vs. NB	250	3	0.000^b
Stacking-CE vs. J48	210	0	0.000^b
Stacking-CE vs. SL	253	0	0.000^b
Stacking-CE vs. Sel-Best	223	8	0.000^b
Stacking-CE vs. Bagging	240	13	0.000^b
Stacking-CE vs. RF	193	60	0.031
Stacking-CE vs. AB(SVM)	212	41	0.006

Finally, we try to find how these methods perform on inbalanced data sets, in which the instances are not evenly distributed across classes. The imbalance ratio (IR) [17] of a data set is defined as Max_num/Min_num, where Max_num and Min

_ num stand for the maximum and minimum number of instances belong to one class in the data set, respectively. In all 22 data sets, we choose top six unbalanced: Autos, Ecoli, Hypothyroid, Lymph, Nursery, and Soybean. Fig. 6 shows the average F1 scores of 13 ensemble methods on them. Stacking-CE performs the best in five out of six data sets. This result shows that stacking-CE is able to deal with imbalanced data sets with good performance.

Fig.5

Average rank of all the methods involved in 22 data sets. The best ranking algorithm is at the left-most side of the diagram.

Fig.6

Performance of 13 ensemble methods on six imbalanced data sets measured by F1

5 Conclusions

In this paper, we have explored how to use stacking to improve ensemble performance. A variant of stacking based approach has been investigated. It trains weights for the combination of base classifiers by using an ANN with the cross entropy of predicted and real probability distributions as its loss function. Moreover, we have applied stochastic gradient descent to minimize the total loss. Experiments are conducted with 22 real-world data sets to evaluate the performance of our method and a large number of other ensemble methods as baseline. Our experiments show that on average, the proposed stacking variant performs better than the other methods. Therefore, the proposed method is very competitive for such tasks.

As our future work, there are two possible extensions to this method. Firstly, we have used a simple ANN to train the weights for all base classifiers. Such a meta-level classifier treats the instances of different classes equally. A more sophisticated multi-layer ANN, should be able to distinguish instances of different classes and treat them differently. Secondly, we have used predicted score from all base classifiers as input. Such scores may not convey all the information that the original features possess. If using some useful original features besides the predictions from base classifiers, it is possible to obtain more effective ensemble classifiers.

Footnotes

References

Büyükçakir

, Bonab

H.R.

and Can

, A Novel Online Stacked Ensemble for Multi-Label Stream Classification, 27th ACM International Conference on Information and Knowledge Management, Italy, 2018, PP. 1063–1072.

Breiman

, Random Forests, Machine Learning 45(1) (2001), 5–32.

Campos

R.R.

, Canuto

S.D.

, Salles

, de Sá

C.C.A.

and Gonçalves

M.A.

, Stacking Bagged and Boosted Forests for Effective Automated Classification, In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Japan, 2017, pp. 105–114.

Chen

, Wong

M.L.

, Li

, Applying Ant Colony Optimization to Configuring Stacking Ensembles for Data Mining, Expert Systems with Applications 41(6) (2014), 2688–2702.

Ding

, Wu

, ABC-based stacking method for multi-label classification, Turkish Journal of Electrical Engineering and Computer Sciences 27(6) (2019), 4231–4245.

Dzeroski

, Zenko

, Is Combining Classifiers with Stacking Better than Selecting the Best One? Machine Learning 54(3) (2004), 255–273.

Ekbal

, Saha

, Stacked Ensemble Coupled with Feature Selection for Biomedical Entity Extraction, Knowledge Based Systems 46 (2013), 22–32.

Fatai

, Labadin

, Abdulraheem

, Improving the Prediction of Petroleum Reservoir Characterization with a Stacked Generalization Ensemble Model of Support Vector Machines, Applied Soft Computing 26 (2015), 483–496.

Hamzeh-Khani

, Parvin

, Rad

, A Classifier Ensemble Enriched with Unsupervised Learning, 14th Mexican International Conference on Artificial Intelligence, Mexico, 2015, pp. 509–517.

10.

Hänsch

, Hellwich

, Classification of PolSAR Images by Stacked Random Forests, ISPRS International Journal of Geo-Information 7(2) (2018), 74.

11.

Huang

, Li

, Weng

, Lee

C.H.

, Beyond Cross-Entropy: towards Better Frame-Level Objective Functions for Deep Neural Network Training in Automatic Speech Recognition, 15th Annual Conference of the International Speech Communication Association, Singapore, 2014, pp. 1214–1218.

12.

Jurek

, Bi

, Wu

, Nugent

C.D.

, Clustering-Based Ensembles as an Alternative to Stacking, IEEE Transactions on Knowledge and Data Engineering 26(9) (2014), 2120–2137.

13.

Kang

, Michalak

, Enhanced version of AdaBoostM1 with J48 Tree learning method, CoRR abs/1802.03522, (2018).

14.

Large

, Lines

, Bagnall

A.J.

, The Heterogeneous Ensembles of Standard Classification Algorithms (HESCA): the Whole is Greater than the Sum of its Parts, CoRR abs/1710.09220, (2017).

15.

Ledezma

, Aler

, Sanchis

, Borrajo

, GA-stacking: Evolutionary Stacked Generalization, Intelligent Data Analysis 14(1) (2010), 89–119.

16.

Lin

, Lin

, Wu

, Xu

, Learning to Rank with Cross Entropy, 20th ACM Conference on Information and Knowledge Management (CIKM 2011), United Kingdom, 2011.

17.

Liu

, Tang

, Cai

, Wang

, Chen

, A hybrid method based on ensemble WELM for handling multi class imbalance in cancer microarray data, Neurocomputing 266 (2017), 641–650.

18.

Lorente

M.P.S.

, Ledezma

, Sanchis

, Generating Ensembles of Heterogeneous Classifiers Using Stacked Generalization, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5(1) (2015), 21–34.

19.

Matlock

, Niz

C.D.

, Rahman

, Ghosh

, Pal

, Investigation of Model Stacking for Drug Sensitivity Prediction, BMC Bioinformatics 19-S(3) (2018), 21–33.

20.

Menahem

, Rokach

, Elovici

, Troika – An Improved Stacking Schema for Classification Tasks, Information Sciences 179(24) (2009), 4097–4122.

21.

Merz

C.J.

, Using Correspondence Analysis to Combine Classifiers, Machine Learning 36(1-2) (1999), 33–58.

22.

van Rijn

J.N.

, Holmes

, Pfahringer

and Vanschoren

, The Online Performance Estimation Framework: Heterogeneous Ensemble Learning for Data Streams, Machine Learning 107(1) (2018), 149–176.

23.

Seewald

A.K.

, How to Make Stacking Better and Faster While Also Taking Care of an Unknown Weakness, 9th International Conference on Machine Learning, Australia, 2002, pp. 554–561.

24.

Shunmugapriya

, Kanmani

, Optimization of Stacking Ensemble Configurations through Artificial Bee Colony Algorithm, Swarm and Evolutionary Computation 12 (2013), 24–32.

25.

Tang

, Chen

, Wang

, Reranking for Stacking Ensemble Learning, In Proceedings of Neural Information Processing – 17th International Conference, Australia, 2010, pp. 575–584.

26.

Ting

K.M.

, Witten

I.H.

, Issues in Stacked Generalization, Journal of Artificial Intelligence Research 10 (1999), 271–289.

27.

Todorovski

, Dzeroski

, Combining Multiple Models with Meta Decision Trees, 4th European Conference on Principles of Data Mining and Knowledge Discovery, France, 2000, pp. 54–64.

28.

Witten

I.H.

, Frank

, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco, USA: Morgan Kaufmann, 1999.

29.

Wolpert

D.H.

, Stacked generalization, Neural Networks 5(2) (1992), 241–259.

30.

Wozniak

, Graña

, Corchado

, A Survey of Multiple Classifier Systems as Hybrid Systems, Information Fusion 16 (2014), 3–17.

31.

Nguyen

T.T.

, Nguyen

M.P.

, Pham

X.C.

, Liew

A.W.C.

, Heterogeneous classifier ensemble with fuzzy rule-based meta learner, Information Sciences 422 (2018), 144–160.

32.

, Wang

, Hu

, AdaBoost-SVM for Electrical Theft Detection and GRNN for Stealing Time Periods Identification, 44th Annual Conference of the IEEE Industrial Electronics Society, USA, 2018, pp. 3073–3078.

33.

Xia

, Liu

, Da

, Xie

, A Novel Heterogeneous Ensemble Credit Scoring Model Based on Bstacking Approach, Expert Systems with Applications 93 (2018), 182–199.

34.

Xie

, Minn

, Real-Time Sleep Apnea Detection by Classifier Combination, IEEE Transactions on Information Technology in Biomedicine 16(3) (2018), 469–477.

35.

Zhang

C.X.

, Duin

R.P.W.

, An Experimental Study of One- and Two-Level Classifier Fusion for Different Sample Sizes, Pattern Recognition Letters 32(14) (2011), 1756–1767.

36.

Zhang

, Yan

, Li

, Bie

, Contour Detection via Stacking Random Forest Learning, Neurocomputing 275 (2018), 2702–2715.

37.

Zhang

, Sabuncu

M.R.

, Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels, CoRR abs/1805.07836, (2018).

38.

Zhang

, Lo

, Xia

, Sun

, An Empirical Study of Classifier Combination for Cross-Project Defect Prediction, 39th IEEE Annual Computer Software and Applications Conference, Taiwan, 2015, pp. 264–269.