Abstract
Stacking is one of the major types of ensemble learning techniques in which a set of base classifiers contributes their outputs to the meta-level classifier, and the meta-level classifier combines them so as to produce more accurate classifications. In this paper, we propose a new stacking algorithm that defines the cross-entropy as the loss function for the classification problem. The training process is conducted by using a neural network with the stochastic gradient descent technique. One major characteristic of our method is its treatment of each meta instance as a whole with one optimization model, which is different from some other stacking methods such as stacking with multi-response linear regression and stacking with multi-response model trees. In these methods each meta instance is divided into a set of sub-instances. Multiple models apply to those sub-instances and each for a class label. There is no connection between different models. It is very likely that our treatment is a better choice for finding suitable weights. Experiments with 22 data sets from the UCI machine learning repository show that the proposed stacking approach performs well. It outperforms all three base classifiers, several state-of-the-art stacking algorithms, and some other representative ensemble learning methods on average.
Introduction
Classifier ensembles have received a lot of attention in recent years due to their ability of improving classification accuracy in different applications [10, 22]. In ensemble learning, there are various ways of combining classifiers, such as bagging, boosting, stacking and others [9]. Stacking is a way of using a high-level classifier to combine multiple different base classifiers [12, 30]. A typical stacking approach can be divided into three steps as follows: firstly, two or more base classifiers such as Bayes, IBk, J4.8, and so on need to be chosen and trained with some data to reach a given accuracy threshold; secondly, a meta-classifier is chosen to combine results of base classifiers; it also needs to be trained with some data for satisfactory prediction accuracy; finally, new instances can be classified by using the full stacked model. Many researchers have investigated the stacking method and conclude that if used properly the stacking algorithm can be more effective than an individual classifier-based approach [18].
When applying the stacking, we need to carefully consider two issues [6]. One is the output from the base classifiers to feed into the meta-level classifier and the other is the stacking method to make the final decision. For the first issue, two types of input have been considered in building the meta-classifier: first, the class labels to which the instance belongs; second, the probabilities that the instance belongs to all the classes. Previous research demonstrates that probability distributions of classes are more effective than class labels [21, 26]. For the second issue, many researchers have proposed a number of different approaches to stacking such as stacking with multi-response linear regression [26], stacking with meta-decision trees [27], stacking with multi-response model trees [23], dynamically-weighted stacked ensemble [1], ant colony and bee colony optimization based stacking [4, 5], and others [3, 14].
In general, there are two types of methods to combine the outputs of multiple base classifiers: fixed combining methods and trainable combining methods. The former works by applying a mathematical function. It is easy to build and run it and training is not required. Methods such as Sum, Product, Majority Vote, Max, Min, and Median are typical fixed combining methods. The latter needs training data to learn the prediction model, in which stacking is the most frequently used [14, 35]. In recent years, stacked ensembles have been investigated and used in many application areas [7, 8]. According to the styles of prediction by the base level classifiers, we may divide stacking-based ensemble approaches into two types: label-based stacking and Probability Distributions (PD)-based stacking. The standard stacking algorithm introduced by Wolpert is a label-based stacking that uses the predicted class labels by the base-level classifiers as the meta-level attributes for classification tasks [29]. Other methods such as Troika [20] and some evolutionary and swarm intelligence algorithms based stacking [15, 24] are PD-based stacking approaches. For the stacking approaches with high level learning, many researchers find that as input attributes, probability distributions perform better than class labels [18]. Therefore, in the following we review several state-of-the-art PD-based stacking approaches.
Ting and Witten [26] proposed stacking with MLR that used multi-response linear regression as the meta-classifier. Stacking with Multiple Linear Regression (MLR) uses a one-against-all binarization classification strategy and regression learners for each class model. Similar to stacking with MLR, Seelwald proposed StackingC with MLR. StackingC reduces the dimensionality of the meta features by a factor equal to the number of classes. It uses only the class probabilities associated with the class on which the linear regression model is built. Tang et al. [25] proposed a variant of the stacking ensemble algorithm, which extracted the paired preference information from the meta-level data and then applied the Ranking Support Vector Machine (RSVM). Nguyen et al. [31] introduced a fuzzy rule inference based method to capture the uncertainty in the outputs of the base classifiers. Xia et al. [33] proposed a credit scoring model that integrated the bagging algorithm with the stacking framework and employed the XGBoost algorithm as the meta-classifier. Zhang et al. [36] proposed a stacking random forest learning algorithm for contour detection. This stacking could be seen as a two-layer framework for contour detection. At the first layer, four sub-forests are separately trained and used as base classifiers. At the second layer, a random forest is employed as the meta-classifier for contour detection. Some evolutionary and swarm intelligence algorithms have also been used with stacking that pose the stacking configuration as an optimization problem. The genetic algorithm [15], artificial bee colony [5, 24], artificial ant colony [4], and many others can be applied. One disadvantage of these methods is their higher computational cost than other stacking approaches.
In this piece of work, we use the cross entropy [11, 37] to measure how well the stacking model matches the ideal model. For a given meta-level instance, two probability distributions are calculated. One is the list of all predicted scores which are obtained by a combination of all base classifiers’ predictions. The other is the probability distribution regarding the ground truth labels of all the classes, or the ideal distribution. Based on the two probability distributions, we can obtain the cross-entropy loss. Accordingly, we are interested in minimizing the above-mentioned cross entropy in order to get the optimal predicted scores that are closest in distance to the ideal distribution. In this paper, we present a novel stacking method Stacking-CE (Stacking with the Cross-Entropy loss function), in which an Artificial Neural Network (ANN) is used to train weights for all base classifiers. Specifically, we use the probabilities predicted by the different base classifiers for the same class as inputs to train the meta-level classifier and then calculate the total loss by the cross-entropy function over all meta-level instances. Furthermore, considering the property of lower convex function that is related to the cross entropy, we are able to apply the stochastic gradient descent to achieve loss minimization.
Compared with some other stacking methods such as stacking with multi-response linear regression [26], StackingC [23], and stacking with multi-response model trees [6], our method treats meta instances in a different way. In those stacking methods, each meta instance is divided into a set of sub-instances, and a separate model is applied to each of them for a specific class. One possible drawback of such a treatment is the lack of connection among those models. The proposed method of ours treats each meta instance as a whole with one optimization model. Our experiments demonstrate that it is more effective than those methods mentioned.
The remainder of this paper is organized as follows: in Section 2, we present the classification process of the proposed stacking method. Section 3 presents the settings of the experiment for evaluating the performance of the proposed method. The experimental results are presented and discussed in Section 4. Finally, the conclusion is given in Section 5.
Proposed stacked ensemble algorithm
All the symbols and their meanings used in this paper are summarized in Table 1.
Symbols and their meanings used in this paper
Symbols and their meanings used in this paper
A meta-level classifier works in the follow way: each instance is first put through all the base-level classifiers. Suppose there are N base classifiers and L classes. A meta instance as a matrix can be generated from those N base classifiers. See Fig. 1 for an example of such a matrix.

Example of a classified instance represented as a matrix of probability distribution
In Fig. 1, there are three base classifiers (C1,C2,C3) and four class labels (l1, l2, l3, l4). The value in line j and column k is the probability of the instance belonging to class l j predicted by base classifier C k .
Next, we extend the above matrix with a score list calculated by a specific function such as the sum function (the column entitled S) and the label judgment list (the column entitled R) as Fig. 2 shows.

An extended matrix with scores calculated by the sum function and label judgments
In Fig. 2, R = (r1, r2, ... , r4) is a vector of label judgments. r
j
=1 if l
j
(1 ≤ j ≤ 4) is a true label and 0 otherwise. S = (score1, score2, ... , score4) is a score vector. The j-th element of S can be calculated by
where s
kj
is the probability predicted by the k-th base classifier for the j-th class, w
k
is the weight for the k-th base classifier, and score
j
is the score calculated for class label l
j
by combining all base classifiers’ predictions. More specifically, f is defined as a logistic function, then Equation 1 becomes
Finally, the calculated scores by Equation 2 are fed into the softmax function to obtain the probability distribution of all the classes. The final classification can be decided by comparing the predicted probabilities of all the labels.
Fig. 3 shows the details of the meta classifier Stacking-CE, in which three base classifiers and two class labels are used for illustration. For any instance, a group of features U = (u1, u2, ... , u V ) are chosen as input to base classifiers. Their prediction scores are passed to the meta classifier. As output we obtain Q = (q1, q2), which are the predicted probability scores for the two classes.

The structure of the Stacking-CE with 3 base classifiers and 2 class labels
In Stacking-CE, we first calculate two different probability distributions P = (p1, p2, ... , p
L
) and Q = (q1, q2, ... , q
L
) by Equation 3 and Equation 4, respectively
Next, the divergence between two probability distributions can be calculated by
Finally, for a given group of instances as the training data set, we use a single-layer artificial neural network (Refer to Fig. 3) to learn the ensemble model by minimizing the cross entropy of them. More specifically, we use stochastic gradient descent to train the ANN. Its weight gradient is defined as
Because
We have
The weight w k (0 ≤ j ≤ N) can be updated by
Equation 7 and Equation 8 can be used to train the optimal weights W = (w0, w1, ... , w N ).
Input:
M: The meta-level training data set
f: ANN algorithm with the initial weights W; each element in W is assigned a random number between 0 and 1
T: Iteration number
η: Learning rate
L: Number of class labels
N: Number of base classifiers
Output:
f: Output the trained ANN{
for (int i : =1 ; i< = T ; i ++){
for (int j : =1 ; j< = M.size ; j ++){
for (int k = 0 ; k< = N ; k ++){
Compute gradient ▽w k using Equation 7;
Update w k = w k - η × ▽ w k ;
}
}
}
return(f with the learned weight vector W)
}
Algorithm 1 presents the algorithm for the learning process of the artificial neural network. As input we need to provide a meta-level training data set, an initial ANN, four parameters including iteration number T, learning rate η, number of class labels L, and number of base classifiers N. The output of the algorithm is the ANN with learned weights. In this algorithm, two sub-functions
To evaluate the performance of our stacking approach, we carry out an extensive experiment. Apart from our method, three base level classifiers, six fixed combining methods and 12 ensemble learning classifiers are trained to enable comparisons. The WEKA machine learning suite
1
is used. The three base classifiers are as follows: NB: the naive Bayes algorithm J48: the C4.5 decision tree algorithm SL: the learning algorithm for building linear logistic regression models
Each of the selected classifier has different learning hypotheses (naive Bayes, decision tree, and linear logistic regression)and different inductive bias. We hope such a selection can generate diversity among the base classifiers and more favourable for ensemble.Six fixed combining methods include Sum, Product, Max, Min, Majority Vote and Median.
Thirteen ensemble learning classifiers are as follows: Nine of them are variants of stacking and four others are representative ensemble learning methods that are not stacking-based. StackingCE (S-CE): stacking with cross-entropy based meta method (proposed in this paper) StackingMLR(S-MLR): stacking with the multi-response linear regression learning algorithm [26, 35] Stacking-AdaBoost(J48)(S-AB(J48)): stacking with the AdaBoost (J48) learning algorithm [13] StackingJ48(S-J48): stacking with the decision tree learning algorithm [25] StackingC(S-C): a variant of stacking [23] Stacking-RSVM(S-RSVM): stacking with the ranking support vector machine algorithm [25] Stacking-RandomForest(S-RF): stacking with the random forest learning algorithm [36] Stacking-Bagging(J48)(S-Bag(J48)): stacking with bagging (J48) learning algorithm [38] Stacking-XGBoost(S-XGB): stacking with the XGBoost learning algorithm [33] SelectBest(Sel-Best): the ensemble algorithm that selects the best from a group of candidates [28] Bagging: a typical decision tree-based bagging [34] RandomForest(RF): the classifier constructing a forest of random trees [2] AdaBoost-SVM (AB-SVM): boosting the SVM classifier using the AdaboostM1 method [32]
22 real-world data sets are used in this experiment. All of them are downloaded from the UCI Machine Learning Repository 2 . The statistics of these data sets are listed in Table 2.
Details of the datasets used in the experiments which are taken from the UCI Machine Learning Repository References
Details of the datasets used in the experiments which are taken from the UCI Machine Learning Repository References
For our method Stacking-CE, two parameters need to be set: learning rate η and iteration number T. In order to set a reasonable value for parameters η and T, we tried Soybean. It has 19 classes and 683 imbalanced instances. 30% of instances were randomly selected and different combinations of η and T were tried. Fig. 4 shows the loss curves with different number of iterations on Soybean. In Fig. 4, all curves with different learning rates (0.005, 0.015, 0.025, 0.035, and 0.045) decrease rapidly at the beginning and then slow down when iteration number increases. The curve with a learning rate of 0.025 looks a little better than the others. When the iteration number is larger than 150, the curve decreases very slowly. Therefore, it looks that 0.025 as the learning rate and 150 as the iteration number is a good option. The similar pattern is observed on four other data sets Breast-w, Ecoli, Autos, and Segment. Among them, Segment is a balanced dataset with seven classes and 2310 instances. Thus the same setting is used on other data sets for the experiments. Although this setting is good, it might not be the optimum for all the data sets.

Average loss curves varying with iteration numbers on Soybean
Four metrics are used to evaluate the performance of the methods involved. They are classification accuracy (Acc), win-loss ratio of two methods on Acc, relative improvement ratio of one method over the other on Acc, and F1. F1 is a measure that considers both precision and recall.
In our experiments, we use five-fold stratified cross validation, which is a form of cross validation and in each fold we let the data inside it preserves the class distributions as close as possible to the original whole data set. In order to increase the reliability of the experimental results, we repeat the above process five times and the reported performance are the average of all the instances. Furthermore, we carry out the Friedman test and the Wilcoxon signed-rank test to compare our method with the others. Bonferroni adjustment is also considered for the Wilcoxon signed-rank test.
The experiments were conducted on 22 real-world data sets and 21 methods were involved as baseline. We divide them into three groups and present their classification accuracy in Tables 3, 4 and 5, respectively. In Table 3, we have nine stacking-based methods including ours, Table 4 presents the results of six popular fixed combining methods, while Table 5 presents the results of three base classifiers and four popular ensemble learning algorithms that are not stacking-based. For convenience, the results of our stacking model is presented in all three tables.
Classification Accuracy (%) of the nine stacking-based methods on 22 real-world data sets. The highest accuracy among all the methods is marked in boldface. Standard deviation (%) of five runs of Stacking-CE’s accuracy is presented in parentheses
Classification Accuracy (%) of the nine stacking-based methods on 22 real-world data sets. The highest accuracy among all the methods is marked in boldface. Standard deviation (%) of five runs of Stacking-CE’s accuracy is presented in parentheses
Classification Accuracy (%) of six fixed combining methods and the presented stacking approach on 22 real-world data sets. The highest accuracy among all the methods is marked in boldface
Classification Accuracy (%) of three base classifiers and five ensemble methods (including the presented stacking approach) on 22 real-world data sets. The highest accuracy among all the methods is marked in boldface
From Table 3, we can see that Stacking-CE achieves the best classification accuracy on 17 out of 22 data sets, while the figures for Stacking-RSVM, StackingC, Stacking-MLR, Stacking-XGBoost, Stacking-RandomForest, Stacking-Bagging(J48), StackingJ48 and Stacking-AdaBoost(J48) are 5, 3, 2, 2, 1, 1,0, and 0, respectively. For Stacking-CE, standard deviation of five runs’ accuracies is also shown in Table 3. We can see that in all the cases standard deviation is small. It demonstrates that the proposed method is robust. Comparing the results of Stacking-CE with six popular fixed combining methods in Table 4, we can see that Stacking-CE outperforms all of them. Stacking-CE achieves the best classification accuracy on 17 out of 22 data sets, while the figures for Sum, Product, Max, Min, Majority Vote and Median are 2, 2, 1, 1, 1 and 0, respectively.
The results in Table 5 show the performances of three base classifiers and four popular non-stacking based meta-classifiers. Comparing our method with three base classifiers, we find that it outperforms all of them in 19 data sets and it is one of the best performers in two remaining data sets (autos and kr-vs-kp). Comparing our method with four other meta-classifiers, we find that Stacking-CE achieves the best classification accuracy on 13 out of 22 data sets. While the figures for AdaBoost(SVM), RandomForest, S-SelectBest and Bagging are 4, 4, 2 and 0, respectively.
In order to have a clear view of all the methods involved, we make a head-to-head comparison for each pair of them. The results are shown in Tables 6–8, for each of the three groups, respectively. In these three tables, the improvement percentage and win-loss ratio of two given methods are shown. For example, in Table 6, we may see that “1.5 19/03” for the item in row headed with “S-CE” and in column headed with S-RF. It means that S-CE is 1.5% more effective than S-RF on average and the win-loss ratio between them is 19/03 in all 22 data sets.
The improvement percentage of one method over the other method and the win-loss ratio between the two methods for all the methods listed in Table 3
The improvement percentage of one method over the other and the win-loss ratio between the two methods with the results from Table 4
The improvement percentage of one method over the other and the win-loss ratio between the two methods for all the methods listed in Table 5
From Table 6, we can see that our method is the best, S-RSVM is in the second place, while S-AB(J48) is the worst among all the stacking-based ensemble methods. Our method has an improvement rate of 0.4% over the second best method S-RSVM and the win-loss ratio between them is 17/1 (plus 4 ties). For other stacking methods, we can observe that S-MLR, StackingC and S-RSVM outperform S-AB(J48), S-XGB, S-Bag(J48) and S-RF. StackingC slightly outperforms S-MLR with an improvement rate of 0.2%. S-RSVM and Stacking-C are very close in performance.
From Table 7, we can see that our method is the best, Sum is in the second place and Median is the worst. Our method clearly outperforms all fixed combining methods. Compared with Sum, our method has an improvement rate of 1.3% over Sum and the win-loss ratio between them is 19/3. We can observe that Majority outperforms Product, Max and Min. Max and Min are very close in performance and both of them outperform Median.
From Table 8, we can see that our method is the best, SelectBest is in the second place, and AdaBoost(SVM) is the worst among all the methods in this group. Comparing our method with the second best SelectBest in this group, our method has an improvement rate of 1.1% over SelectBest and the win-loss ratio between them is 20/1. SelectBest outperforms three other ensemble algorithms Bagging, Randomforest and AdaBoost(SVM), while Randomforest outperforms Bagging with an improvement rate of 3.1%.
We try to find if the difference between our method and the others is significant or not. The Friedman test is carried out and the result shows that as a whole the proposed method is different from the others significantly at a confidence level of 95%. Furthermore, we carry out the Wilcoxon signed-rank test to compare our method with every other method separately. Table 9 presents the results. It shows that Stacking-CE is better than all other methods involved at a significance level of 95%. Considering there are 21 other methods involved at the same time, Bonferroni adjustment may be applied to the results of the Wilcoxon signed-rank test. Rather than setting α = 0.05, we let α B =0.05/21=0.0023, then the same conclusion for 16 of them, while the difference between our method and five methods including S-RF,S-Bag(J48),S-XGB, RF and AB(SVM) are not significant.
Wilcoxon signed-rank test based on accuracy; comparing Stacking-CE and all other 21 methods with positive ranks (R+), negative ranks (R-), and p-values. Superscipt ’b’ indicates a significant difference at the level of 95% with Bonferroni adjustment
Finally, we try to find how these methods perform on inbalanced data sets, in which the instances are not evenly distributed across classes. The imbalance ratio (IR) [17] of a data set is defined as Max_num/Min_num, where Max_num and Min
_ num stand for the maximum and minimum number of instances belong to one class in the data set, respectively. In all 22 data sets, we choose top six unbalanced: Autos, Ecoli, Hypothyroid, Lymph, Nursery, and Soybean. Fig. 6 shows the average F1 scores of 13 ensemble methods on them. Stacking-CE performs the best in five out of six data sets. This result shows that stacking-CE is able to deal with imbalanced data sets with good performance.

Average rank of all the methods involved in 22 data sets. The best ranking algorithm is at the left-most side of the diagram.

Performance of 13 ensemble methods on six imbalanced data sets measured by F1
In this paper, we have explored how to use stacking to improve ensemble performance. A variant of stacking based approach has been investigated. It trains weights for the combination of base classifiers by using an ANN with the cross entropy of predicted and real probability distributions as its loss function. Moreover, we have applied stochastic gradient descent to minimize the total loss. Experiments are conducted with 22 real-world data sets to evaluate the performance of our method and a large number of other ensemble methods as baseline. Our experiments show that on average, the proposed stacking variant performs better than the other methods. Therefore, the proposed method is very competitive for such tasks.
As our future work, there are two possible extensions to this method. Firstly, we have used a simple ANN to train the weights for all base classifiers. Such a meta-level classifier treats the instances of different classes equally. A more sophisticated multi-layer ANN, should be able to distinguish instances of different classes and treat them differently. Secondly, we have used predicted score from all base classifiers as input. Such scores may not convey all the information that the original features possess. If using some useful original features besides the predictions from base classifiers, it is possible to obtain more effective ensemble classifiers.
