Abstract
Binary Relevance (BR) is a simple and direct approach to the Multi-Label Classification (MLC). It decomposes the multi-label problem into several binary problems, one per label. It uses an algorithm of traditional supervised classification in order to solve these binary problems. On the other hand, Credal C4.5 (CC4.5) is a modification of the classical C4.5. CC4.5 estimates the probability of the class variable by using imprecise probabilities. In the literature, this new classification algorithm has obtained better results than C4.5 when both have been applied on datasets with class noise. In MLC, since there are not just a class, but multiple labels are disposed, it is more probable that there is intrinsic noise than in traditional classification. From the previous reasons, in this work it is studied the performance of BR using Credal C4.5 as base classifier versus BR with C4.5. It is carried out an experimental study with several muti-label datasets and a considerable number of measures for MLC. This study shows that the performance of BR is improved when it uses CC4.5 as base classifier versus BR with C4.5. In consequence, it is probably suitable to apply imprecise probabilities in Decision Trees within the MLC field too.
Introduction
The Multi-Label Classification (MLC) is a Machine Learning task that aims to predict the set of labels which are associated with a certain instance, unlike supervised traditional classification, where each example belongs to just a single class. This paradigm is suitable to be applied to a considerable number of domains, such as text categorization [16, 22], image recognition [17] or biology [6, 7]. In those fields it has sense to consider that one instance belongs to several labels, instead of assuming that just one class is associated with an instance.
There are a large number of approaches to the MLC task. A summary of most of them can be found in [28]. Basically, there are two types of algorithms for MLC. On the one hand, the problem transformation methods convert the problem of MLC into other well-known classification problems. On the other hand, the algorithm adaptation methods aim to adapt the supervised traditional classification to the case of MLC.
One of the most direct algorithms within the problem transformation methods is Binary Relevance (BR) [8]. It decomposes the problem of MLC into q binary problems, being q the number of possible labels. This assumes that all the labels are independent and some of the binary problems that are generated might be unbalanced, since it is possible that very few instances belong to a certain label. Nevertheless, in spite of the previous points, this method has shown a good performance in practise. An example about this fact can be found in [13], where the main approaches to MLC are compared with an exhaustive experimentation. In that research, BR is one of the methods that provides the best results. Moreover, this algorithm of MLC is very simple.
In BR, an algorithm of traditional supervised classification is employed in order to approach the different binary problems. One of the most known algorithms of single-label classification is the C4.5 algorithm [11, 21]. In fact, it is one of the best algorithms based on Decision Trees (DT). It is known that DTs are simple, transparent and interpretable models.
In the literature, new algorithms of decision trees based on imprecise probabilities have been proposed. They are known as Credal Decision Trees (CDTs) [5]. Basically, the building process of this kind of trees uses uncertainty measures on credal sets, i.e closed and convex sets of probability distributions. In [2–4] CDTs have been shown a good performance specially when there is class noise. Within CDTs a new version of the C4.5 algorithm based on imprecise probabilities has been proposed, called Credal C4.5 (CC4.5). In [14] and [15] it is shown that both C4.5 and CC4.5 methods provide statistically equivalent results without class noise and CC4.5 is significantly better than C4.5 when there is class noise in the data.
Usually, real datasets can contain class noise although they are not manipulated, which is called intrinsic noise [18]. Furthermore, in MLC we do not have only a class but multiple labels. Hence, in these cases the intrinsic noise tends to be higher. The reason is quite simple: It is clearly less probable that one example has a wrong class value when there is only one class, as in traditional supervised classification than one instance has some incorrect label when there are multiple tags, as occurs in MLC.
Therefore, CC4.5 is a model that has a good behavior when the datasets have class noise and, on the other hand, it is more probable that MLC problems have more intrinsic noise than datasets with just one class variable. For the reasons explained, it is interesting to study the usage of CC4.5 as base learner in BR instead of employing C4.5. Hence, the performance of BR when CC4.5 is used as base classifier is studied in this research, checking if it supposes an improvement versus the usage of C4.5.
A complete experimental research with different datasets and a large number of measures for MLC is carried out in this work. This experimental study shows that, in general, the BR method provides better results using CC4.5 as base classifier than employing C4.5. The rest of this paper is structured as follows: In Section 2, the paradigm of MLC is introduced. The BR method is explained in Section 3. The CC4.5 is introduced in Section 4. In Section 5, the BR with CC4.5 is exposed, explaining its differences versus employing C4.5 as base learner. The experimental research carried out is detailed in Section 6. Finally, Section 7 is devoted to conclusions and future works.
Multi-label classification
Given an instance, the Multi-Label Classification (MLC) tries to predict the set of labels to which belong that instance. For this purpose, in the same way that traditional classification, it is based on a model learned from a training set.
Formally, the task of MLC starts from the following issues: A d-dimensional space attribute A set of label of size q, A training set A quality criterion
When an example
The aim is to find a function h :
Alternatively, in many cases, the goal is to find a function
This function f gives rise to a ranking function for an instance
In addition, the set of labels predicted for an instance
As explained before, the Binary Relevance method (BR) [8] is a direct approach of MLC. BR decomposes the MLC problem into several binary problems, one per class. In order to solve the binary problem corresponding to the label l j , 1 ≤ j ≤ q the BR algorithm first constructs the following binary set:
As it can be observed, to solve the corresponding binary problem, the training set is constructed with the same instances as the original training set
After that, a binary classifier h
j
:
Likewise, let
The BR method is summarized in the pseudo-code given in Fig. 1.

Schedule of BR algorithm.
It is appropriate to remark that the BR approach has two important drawbacks. The first of them is that it completely ignores the possible dependence relations among the labels, which are often frequent. A second disadvantage is the class-imbalance problem which may be suffered in some binary problems due to the small number of instances associated with some labels. However, despite the previous points, in the literature the BR algorithm has shown to provide satisfactory results, as in [13]. Moreover, BR method is the most simple direct approach for MLC. Another advantage of BR is that the implementation of the training phase can be done in parallel, building the classifiers h
j
, 1 ≤ j ≤ q at the same time. Likewise, for an unseen instance
Let C be the class variable and {c1, …, c
k
} be its possible values. Let X be a variable whose possible values are {x1, …, x
n
} and let
The Credal C4.5 (CC4.5) algorithm is based on Decision Trees (DT). The difference between this method and the known C4.5 algorithm [19] is the split criterion employed. The criterion used by C4.5 is the Info-Gain Ratio (IGR). It is based on the Information-Gain (IG), defined as follows:
being H the known Shannon Entropy [23]:
The IGR is obtained from IG by normalizing by the attribute entropy:
The Imprecise Dirichlet Model (IDM) [25], which is a formal theory of imprecise probabilities, estimates that the probability for each possible value of the class variable c j , 1 ≤ j ≤ k is within the interval:
s is a hyperparameter but [25] does not give a recommendation for its value. However, in that work two values are suggested: s = 1 and s = 2. In this research the value s = 1 is used due to the recommendation of Walley and because the procedure to obtain the information measures employed for the special case of the IDM reaches its lowest computational cost for s ≤ 1 [1].
It is easy to check that if the value of s is higher then the interval is wider. Hence, how small is the value of s indicates how reliable is considered the dataset in order to estimate probabilities.
It is also interesting to observe that the lower is the value of N i.e, the smaller is the dataset, the wider the interval is. Therefore, this model considers that the dataset is more reliable if it is bigger and vice-versa.
These probability intervals give rise to a closed and convex set of probability distributions (also called credal set) on the variable C [1]. This set is defined as follows:
being
The procedure to build CC4.5 uses the maximum of entropy function in the credal set defined in (8). It is defined in the following way:
This function is a well established total uncertainty measure on credal sets, according to [12].
The procedure to calculate the probability distribution that maximizes H* is not explained in this work, but it is detailed in [14].
Based on the previous measure, the Imprecise Info-Gain (IIG) is defined in [5] as follows:
Once reached this point, it is possible to define the criterion that is used in the CC4.5 algorithm, which is called Imprecise Info Gain Ratio (IIGR) and it is defined as follows:
In this Section, the method used in this research and its main properties are explained.
We recall that in this work the Binary Relevance method is used with CC4.5 as base classifier. Thus, for each label l
j
∈
It should be said that in this work CC4.5 and C4.5 are compared as base classifiers for the BR. For this reason, the advantages of CC4.5 versus C4.5 are explained in Sections 5.1 and 5.2.
Differences between Credal-C4.5 and C4.5
When s = 0, C4.5 and CC4.5 are equal. When s > 0, the higher is the size of the training set, the smaller is the credal set. Thus, if N is very large (it usually happens in upper levels of the tree) IIGR and IGR give similar values and in these cases CC4.5 and C4.5 can be considered equivalent. However, when the size of the training set is small, usually in lower levels of the tree, the credal set corresponding to it, contains many probability distributions that can be considerably distinct from the one used to calculate H. Thus, IIGR and IGR may be different because H* and H can be very distinct. Therefore, CC4.5 and C4.5 have similar behavior in upper levels of the tree but they are different in lower levels.
Furthermore, it is remarkable that the value of
In [15], it is studied and analyzed with examples that H* function, defined in (10) is more robust to noise than the Shannon’s entropy H. Hence, it can be said that CC4.5 is more robust to noise than C4.5. This fact was corroborated in that work with an extensive experimental research. Therefore, it could be thought that BR is more robust to noise using CC4.5 as base learner than employing C4.5.
Noise in MLC
It is known that real world datasets have usually intrinsic noise, i.e, there is often errors in the data although there is no additional noise introduced. As explained in the introduction, in MLC there might be more intrinsic noise than in single-class classification. In this Section this fact is argued with more detail.
In traditional supervised classification, let p be the probability that an instance has a wrong value of the class variable. Let suppose that in MLC, where there are q labels, the probability that an instance has a wrong value of a certain label, i.e, that the example has assigned erroneously a label or has not assigned a label to which actually belongs, is also p (we are supposing the same probability error for all labels).
Whereas in single-label classification the probability that the example has the correct class value is
Description of the experiments
The MLC algorithm considered in the experiments carried out in this work is the BR method, explained in Section 3. Two base classifiers are used for this algorithm: C4.5 and CC4.5, exposed in Section 4. The implementation given in the MULAN library [24] for the BR algorithm with its default configuration has been employed. The implementation of C4.5 algorithm provided by Weka [27] was used with its default configuration. We added the necessary methods to build CC4.5 trees with the same experimental conditions. For the CC4.5 algorithm the value of s is fixed to s = 1 (see Section 4).
A set of 10 datasets from different domains are used in these experiments, which can be downloaded from the official website of MULAN. These datasets have been employed in an exhaustive research with most of the MLC algorithms carried out in [13]. In that work, the dataset bookmarks is also employed, but it was not considered in this research due to the computational cost that it requires. Table 1 shows, for each dataset, its name, the domain to which the problem belongs, the number of instances, the number of attributes, the number of labels and its label cardinality (LCARD), i.e the average number of labels per instance [20]. Formally:
Description of the datasets used in this experimental research
N is the total number of instances, N_CA is the number of continuous attributes, N_DA is the number of discrete attributes, N_L its number of labels and L_C is the label cardinality
For each one of the previous datasets and for both of the base classifiers employed in these experiments, a 5-fold cross-validation has been executed. It is known that the process of evaluation in MLC is not as trivial as in the case of traditional classification, since there are multiple labels and not just one class variable. For this reason, a set of 14 metrics have been used in order to compare both base classifiers in BR. All of these measures are employed in the experiments carried out by [13] and they are detailed in that research. Basically, these metrics are divided into two groups [28]:
Once all the measures have been computed, for each one of them, a Wilcoxon test [26] with a level of significance of α = 0.05 is used to compare results. It has been done following the recommendations of [9, 10] and taking into account that in this case only two algorithms will be compared.
In this Section, the outcomes corresponding to the Wilcoxon tests are presented. The complete results can be found in the Appendix A. Table 2 shows, for each measure, which algorithm has better performance and if the difference is significant.
Summary of the results of the Wilcoxon tests obtained for each measure and for each method
Summary of the results of the Wilcoxon tests obtained for each measure and for each method
•indicates that the algorithm in the column has a significatively better performance than the method corresponding to the other column and ∘ means a better performance but without significative differences
Through this table we discuss the results. We firstly comment the results of the tests corresponding to example-based measures. Secondly, we analyze the results of the test associated with label-based metrics. After that, the results obtained for the ranking-based metrics are discussed. Finally, the results are summarized.
For Example-based metrics, the higher the value is, the better is the performance, except for hamming loss.
Within the example based-measures, we remark that in Hamming Loss the BR obtains significatively better results with CC4.5 as base learner than with C4.5. With Recall the results are better with C4.5, being the differences significant too. The results obtained for Accuracy, Precision and Subset Accuracy are better in the case of CC4.5, whereas C4.5 gets better results for F1. However, in these cases the differences are not significant.
The previous point implies that in BR with CC4.5 there is less difference between the sets of labels predicted and relevant labels for an instance than with C4.5 because of the results obtained in Hamming Loss. In addition, the better results obtained for Subset Accuracy by BR with CC4.5 mean that it gets better performance in predicting the entire set of relevant labels for the instances (and only the relevant labels). The improvements in the previous measures using CC4.5 as base classifier of BR instead of C4.5 reveal that in MLC there might be a notable intrinsic noise. Therefore, the usage of imprecise probabilities supposes an improvement on the predicted labels sets. Although with C4.5 as base learner BR predicts more relevant labels than with CC4.5 due to the better results obtained in Recall, CC4.5 predicts less irrelevant labels, because of its better performance in Precision. The reason is the following: As noted above, in BR many binary problems are unbalanced. Furthermore, CC4.5 stops branching the tree before than C4.5, overfitting less the data. Consequently, in many cases, CC4.5 does not reach parts of the tree where instances of the minority class (in our case, instances for which a label is relevant) are predicted correctly. However, in these parts of the tree the noise has a very negative influence.
In summary, for one example-based measure BR with CC4.5 is significantly better than BR with C4.5 and, for another measure of this kind the results are significantly better for BR with C4.5. In three of four of the remaining measures the BR with CC4.5 is better than C4.5, although the differences are not significant. It means that, for example-based measures both base classifiers for BR may be almost equivalent, having CC4.5 a slightly better performance.
Label-based measures
For all Label-based measures if the value is higher the performance is better.
For both Macro and Micro Precision the performance of BR is significantly better with CC4.5 as base classifier. The results obtained for both Micro and Macro Recall are significantly better for C4.5. BR obtains better performance with CC4.5 for Micro F1. For Macro F1, the results are better using C4.5. In these last two cases the differences are not significant.
For the previous reasons, BR with C4.5 tends to predict more relevant labels than with CC4.5. On the other hand, BR usually predicts less irrelevant labels as relevant using CC4.5 as base classifier than with C4.5. The reason has been explained above: In BR many binary problems are unbalanced and CC4.5 stops branching the tree before than C4.5.
Hence, it could be said that CC4.5 and C4.5 obtains an equivalent performance for label-based measures when they are used as base-classifiers for BR.
Ranking-based measures
Recall that for Ranking Loss, One Error and coverage the performance is better if the value of these measures is lower. The opposite happens with average precision.
It is easy to deduct that, for Ranking-based measures, BR obtains much better results with CC4.5 as base learner than using C4.5. In fact, it can be observed that for all ranking measures employed in this research the results obtained by CC4.5 are significantly better than those obtained by C4.5.
Therefore, it is obvious that BR obtains a better performance with CC4.5 than with C4.5 for the real-valuated and the corresponding ranking functions, i.e, CC4.5 ranks better the labels when it is used as base classifier of BR versus C4.5. This happens because the intrinsic noise in MLC has a strong influence in the label ranking given by a MLC algorithm and CC4.5 is more robust to noise than C4.5.
Summary of the results
Generally, BR with C4.5 predicts more relevant labels than with CC4.5 due to the better performance obtained with Recall for both example and label-based metrics. On the other hand CC4.5 predicts less irrelevant labels when it is employed as base classifier of BR. This fact can be deducted from the better results obtained in Precision by BR with CC4.5. The performance obtained in Hamming Loss implies that the label set predicted by BR with CC4.5 as base classifier differs less than with C4.5 from the relevant labels set. Moreover, BR ranks better the labels when it uses CC4.5 as base classifier than with C4.5 because the results obtained for BR are better in the first case for all Ranking-based metrics. The reasons of the previous facts have been explained above: in MLC there might be more intrinsic noise than in traditional classification. Thus, the usage of imprecise probabilities in C4.5 when it is used as base classifier of BR is suitable. Nevertheless, with CC4.5 as base learner BR has more difficulty in predicting relevant labels correctly because in BR many binary problems are not balanced and CC4.5 stops branching the tree before than C4.5. Consequently, with CC4.5 it is more difficult to reach parts of the tree where the labels are predicted as relevant (minority class).
The results obtained for 10 of the 16 measures offer significant differences between the two base classifiers for BR considered in this work. The results of 7 of them are favorable to CC4.5 while for 3 are better for C4.5. That is, 7 of 10 significant comparison results are favorable for BR using CC4.5 as base classifier.
For the other 6 measures the comparison results do not present significant differences. However, for 4 of them the results are better for CC4.5, whereas only for 2 of these metrics the results are favorable to C4.5. This is an additional favorable comparison for BR using CC4.5 instead of C4.5. These results are summarized in Table 3.
Summary of the results
Summary of the results
One row indicates, for each algorithm, the number of metrics with the algorithm of the column is significatively better. The second row, shows, for each algorithm, the number of measures for which the method is better but the differences are not significant.
Moreover, for both example-based and label-based measures the results obtained for both algorithms are equivalent, even thought in the first group the results are slightly better for CC4.5, being the difference very little remarkable. Nevertheless, the results obtained for ranking based metrics the results obtained using CC4.5 as base learner of BR are clearly better than the obtained by employing C4.5. It shows that the label ranking given by a MLC algorithm is strongly affected by the presence of intrinsic noise in the data.
Therefore, it could be said that CC4.5 supposes an improvement versus classical C4.5 when it is employed as base classifier for BR.
In this work, we have started from the Binary Relevance algorithm, a very simple and direct approach to Multi-Label Classification that has been proved to obtain good results in the practice. Recall that this method decomposes the multi-label task into several binary problems, one per tag and it uses a supervised traditional classification algorithm in order to solve each binary problem.
In this research it has been considered the Credal C4.5 algorithm as base classifier of Binary Relevance, which has been shown to be an improvement over the classical C4.5 in the presence of noise, and we should keep in mind that in MLC it is more probable that there is intrinsic noise in the data than in single-class classification. In concrete, the performance of the BR using CC4.5 as base learner has been studied versus BR using C4.5.
An experimental study carried out with a set of datasets for MLC has shown that the results obtained with CC4.5 are slightly better for example-based measures and significantly better for ranking measures. In general, there are more metrics for which BR with CC4.5 is better than C4.5. The previous points allow us to conclude that BR employing CC4.5 as base learner is an improvement over C4.5.
The contribution of this work allows us to think that the usage of imprecise probabilities in Decision Trees also provides good results in MLC. For future research, it is interesting to study how this idea works in other algorithms of MLC. For instance, it is worth to design a new adaptation of Decision Trees to MLC using the Imprecise-Dirichlet Model, giving rise to Multi-Label Decision Trees. It is expected that these new future adaptations of Decision Trees to MLC will provide better results than the existing adaptations.
Footnotes
Acknowledgments
This work has been supported by the Spanish “Ministerio de Economía y Competitividad” and by “Fondo Europeo de Desarrollo Regional” (FEDER) under Project TEC2015-69496-R.
Appendix A. Complete results from the experimental study
In this Section, the complete results from the experiments carried out in this research are exposed. In particular, we present the results for each dataset used in the experiments and for each measure across the 5-CV. The results corresponding to example-based measures are presented (Section 7.0.1), followed by the results of label-based measures (Section 7.0.2) and the results associated with ranking-based metrics (Section 7.0.3). In the tables with the results, for each dataset, the better result obtained by the algorithms is marked with bold fonts.
