Using Credal-C4.5 with Binary Relevance for Multi-Label Classification

Abstract

Binary Relevance (BR) is a simple and direct approach to the Multi-Label Classification (MLC). It decomposes the multi-label problem into several binary problems, one per label. It uses an algorithm of traditional supervised classification in order to solve these binary problems. On the other hand, Credal C4.5 (CC4.5) is a modification of the classical C4.5. CC4.5 estimates the probability of the class variable by using imprecise probabilities. In the literature, this new classification algorithm has obtained better results than C4.5 when both have been applied on datasets with class noise. In MLC, since there are not just a class, but multiple labels are disposed, it is more probable that there is intrinsic noise than in traditional classification. From the previous reasons, in this work it is studied the performance of BR using Credal C4.5 as base classifier versus BR with C4.5. It is carried out an experimental study with several muti-label datasets and a considerable number of measures for MLC. This study shows that the performance of BR is improved when it uses CC4.5 as base classifier versus BR with C4.5. In consequence, it is probably suitable to apply imprecise probabilities in Decision Trees within the MLC field too.

Keywords

Multi-label classification Binary Relevance Credal C4.5 C4.5 imprecise probabilities

1 Introduction

The Multi-Label Classification (MLC) is a Machine Learning task that aims to predict the set of labels which are associated with a certain instance, unlike supervised traditional classification, where each example belongs to just a single class. This paradigm is suitable to be applied to a considerable number of domains, such as text categorization [16, 22], image recognition [17] or biology [6, 7]. In those fields it has sense to consider that one instance belongs to several labels, instead of assuming that just one class is associated with an instance.

There are a large number of approaches to the MLC task. A summary of most of them can be found in [28]. Basically, there are two types of algorithms for MLC. On the one hand, the problem transformation methods convert the problem of MLC into other well-known classification problems. On the other hand, the algorithm adaptation methods aim to adapt the supervised traditional classification to the case of MLC.

One of the most direct algorithms within the problem transformation methods is Binary Relevance (BR) [8]. It decomposes the problem of MLC into q binary problems, being q the number of possible labels. This assumes that all the labels are independent and some of the binary problems that are generated might be unbalanced, since it is possible that very few instances belong to a certain label. Nevertheless, in spite of the previous points, this method has shown a good performance in practise. An example about this fact can be found in [13], where the main approaches to MLC are compared with an exhaustive experimentation. In that research, BR is one of the methods that provides the best results. Moreover, this algorithm of MLC is very simple.

In BR, an algorithm of traditional supervised classification is employed in order to approach the different binary problems. One of the most known algorithms of single-label classification is the C4.5 algorithm [11 , 21]. In fact, it is one of the best algorithms based on Decision Trees (DT). It is known that DTs are simple, transparent and interpretable models.

In the literature, new algorithms of decision trees based on imprecise probabilities have been proposed. They are known as Credal Decision Trees (CDTs) [5]. Basically, the building process of this kind of trees uses uncertainty measures on credal sets, i.e closed and convex sets of probability distributions. In [2 –4] CDTs have been shown a good performance specially when there is class noise. Within CDTs a new version of the C4.5 algorithm based on imprecise probabilities has been proposed, called Credal C4.5 (CC4.5). In [14] and [15] it is shown that both C4.5 and CC4.5 methods provide statistically equivalent results without class noise and CC4.5 is significantly better than C4.5 when there is class noise in the data.

Usually, real datasets can contain class noise although they are not manipulated, which is called intrinsic noise [18]. Furthermore, in MLC we do not have only a class but multiple labels. Hence, in these cases the intrinsic noise tends to be higher. The reason is quite simple: It is clearly less probable that one example has a wrong class value when there is only one class, as in traditional supervised classification than one instance has some incorrect label when there are multiple tags, as occurs in MLC.

Therefore, CC4.5 is a model that has a good behavior when the datasets have class noise and, on the other hand, it is more probable that MLC problems have more intrinsic noise than datasets with just one class variable. For the reasons explained, it is interesting to study the usage of CC4.5 as base learner in BR instead of employing C4.5. Hence, the performance of BR when CC4.5 is used as base classifier is studied in this research, checking if it supposes an improvement versus the usage of C4.5.

A complete experimental research with different datasets and a large number of measures for MLC is carried out in this work. This experimental study shows that, in general, the BR method provides better results using CC4.5 as base classifier than employing C4.5. The rest of this paper is structured as follows: In Section 2, the paradigm of MLC is introduced. The BR method is explained in Section 3. The CC4.5 is introduced in Section 4. In Section 5, the BR with CC4.5 is exposed, explaining its differences versus employing C4.5 as base learner. The experimental research carried out is detailed in Section 6. Finally, Section 7 is devoted to conclusions and future works.

2 Multi-label classification

Given an instance, the Multi-Label Classification (MLC) tries to predict the set of labels to which belong that instance. For this purpose, in the same way that traditional classification, it is based on a model learned from a training set.

Formally, the task of MLC starts from the following issues:

A d-dimensional space attribute $X = (X^{1}, \dots, X^{d}) \subseteq ℝ^{d}$ .

A set of label of size q, L = {l₁, l₂, …, l_q}, normally q > 1.

A training set $D = {(x_{i}, L_{i}) | 1 \leq i \leq m},$ , m is the number of instances. $x_{i} = (x_{i}^{1}, \dots, x_{i}^{d}) \in X$ is the vector of attributes of the i-th instance 1 ≤ i ≤ m and L_i ⊂ L is its set of labels.

A quality criterion Q. It measures the predictive capacity of a certain model, as well as its complexity.

When an example x has associated a certain label l_j ∈ L, 1 ≤ j ≤ q it is said that the label l_j is relevant for x. Otherwise the label l_j is said to be irrelevant for x.

The aim is to find a function h : X → 2^L such that h maximizes the criterion Q.

Alternatively, in many cases, the goal is to find a function $f : X \times L \to ℝ$ that maximizes the criterion Q. This real-valuated function measures, for an instance whose feature set is x ∈ X, and for a label l_i ∈ L, 1 ≤ i ≤ q, f (x, l_i) the posterior probability that an example is associated with the label l_i.

This function f gives rise to a ranking function for an instance x, rank _ f_x : L → {1, 2 …, q}, which is defined implicitly verifying that if f (x, l_i) > f (x, l_j) then rank _ f_x (l_i) < rank _ f_x (l_j), where l_i, l_j ∈ L.

In addition, the set of labels predicted for an instance x is extracted from the function f as follows: $h (x) = {l_{j} \in L ∣ f (x, l_{j}) > t (x), 1 \leq j \leq q},$ (1) being $t : X \to ℝ$ a threshold function that let us distinguish between relevant and irrelevant labels for an example. Normally, this threshold function is fixed as a constant. This value tends to be equal to 0.5.

3 Binary relevance

As explained before, the Binary Relevance method (BR) [8] is a direct approach of MLC. BR decomposes the MLC problem into several binary problems, one per class. In order to solve the binary problem corresponding to the label l_j, 1 ≤ j ≤ q the BR algorithm first constructs the following binary set:

$D_{j} = {(x_{i}, φ (L_{i}, l_{j})) ∣ 1 \leq i \leq m},$ (2) where φ (L_i, l_j) =1 if l_j ∈ L_i and 0 otherwise.

As it can be observed, to solve the corresponding binary problem, the training set is constructed with the same instances as the original training set $D$ . For each instance, the entire set of attributes is taken. Now, there is only one binary class attribute. For an instance, it contains the value 1 if the instance has the label l_j and 0 otherwise.

After that, a binary classifier h_j : X → {0, 1} is built from the training set D_j using an algorithm of traditional supervised classification. When a new instance x is wanted to be classified, BR predicts as relevant labels the ones which are predicted to be 1 by the binary classifier h_j for the instance x. The function h : X → 2^L is then defined as follows:

$h (x) = {l_{j} ∣ h_{j} (x) = 1, 1 \leq j \leq q}$ (3)

Likewise, let $f_{j} : X \to ℝ$ be the function which assigns the confidence that an instance has associated the label l_j, 1 ≤ j ≤ q (f_j is the real-valuated function associated with the binary classifier h_j). Then, the real-valuated function $f : X \times L \to ℝ$ associated with h is defined simply as f (x, l_j) = f_j (x) ∀j = 1, …, q. In other words, the confidence that an instance has associated the label l_j is the confidence that the corresponding binary classifier gives to the instance.

The BR method is summarized in the pseudo-code given in Fig. 1.

Fig. 1

Schedule of BR algorithm.

It is appropriate to remark that the BR approach has two important drawbacks. The first of them is that it completely ignores the possible dependence relations among the labels, which are often frequent. A second disadvantage is the class-imbalance problem which may be suffered in some binary problems due to the small number of instances associated with some labels. However, despite the previous points, in the literature the BR algorithm has shown to provide satisfactory results, as in [13]. Moreover, BR method is the most simple direct approach for MLC. Another advantage of BR is that the implementation of the training phase can be done in parallel, building the classifiers h_j, 1 ≤ j ≤ q at the same time. Likewise, for an unseen instance x the prediction of each h_j (x) can be also done in parallel. It clearly supposes a reduction in the computational cost.

4 Credal-C4.5

Let C be the class variable and {c₁, …, c_k} be its possible values. Let X be a variable whose possible values are {x₁, …, x_n} and let $D$ be the training set.

The Credal C4.5 (CC4.5) algorithm is based on Decision Trees (DT). The difference between this method and the known C4.5 algorithm [19] is the split criterion employed. The criterion used by C4.5 is the Info-Gain Ratio (IGR). It is based on the Information-Gain (IG), defined as follows: $IG (C, X) = H (C) - \sum_{i = 1}^{n} P (X = x_{i}) H (C | X = x_{i}),$ (4)

being H the known Shannon Entropy [23]: $H (C) = - \sum_{j = 1}^{K} P (C = c_{j}) \log (P (C = c_{j}))$ (5) where P (X = x_i) is estimated by the proportion of instances in the dataset that verify that X = x_i. The same with P (C = c_j)

The IGR is obtained from IG by normalizing by the attribute entropy: $IGR (C, X) = \frac{IG (C, X)}{H (X)}$ (6) CC4.5 uses a similar criterion to the C4.5 algorithm, but the first algorithm estimates the probabilities variables by using imprecise probabilities.

The Imprecise Dirichlet Model (IDM) [25], which is a formal theory of imprecise probabilities, estimates that the probability for each possible value of the class variable c_j, 1 ≤ j ≤ k is within the interval:

$p (c_{j}) \in [\frac{n_{c_{j}}}{N + s}, \frac{n_{c_{j}} + s}{N + s}],$ (7) where n_{c
_j} is the number of instances in $D$ that have the value c_j for the variable C, N is the total number of instances in $D$ and s is a given parameter of the model.

s is a hyperparameter but [25] does not give a recommendation for its value. However, in that work two values are suggested: s = 1 and s = 2. In this research the value s = 1 is used due to the recommendation of Walley and because the procedure to obtain the information measures employed for the special case of the IDM reaches its lowest computational cost for s ≤ 1 [1].

It is easy to check that if the value of s is higher then the interval is wider. Hence, how small is the value of s indicates how reliable is considered the dataset in order to estimate probabilities.

It is also interesting to observe that the lower is the value of N i.e, the smaller is the dataset, the wider the interval is. Therefore, this model considers that the dataset is more reliable if it is bigger and vice-versa.

These probability intervals give rise to a closed and convex set of probability distributions (also called credal set) on the variable C [1]. This set is defined as follows: $K (C) = {p ∣ p (c_{j}) \in I_{j}, \forall j = 1, \dots, k}$ (8)

being $I_{j} = [\frac{n_{c_{j}}}{N + s}, \frac{n_{c_{j}} + s}{N + s}]$ (9)

The procedure to build CC4.5 uses the maximum of entropy function in the credal set defined in (8). It is defined in the following way: $H^{*} (K (C)) = max {H (p) ∣ p \in K (C)}$ (10)

This function is a well established total uncertainty measure on credal sets, according to [12].

The procedure to calculate the probability distribution that maximizes H^* is not explained in this work, but it is detailed in [14].

Based on the previous measure, the Imprecise Info-Gain (IIG) is defined in [5] as follows: $\begin{matrix} {IIG}^{D} (C, X) = H^{*} (K^{D} (C)) - \\ \sum_{i = 1}^{n} P^{D} (X = x_{i}) H^{*} (K^{D} (C | X = x_{i})), \end{matrix}$ (11) where $P^{D} (X = X_{i})$ is the probability distribution that gives the maximum of entropy in K (X). It is estimated similarly to the distribution that gives the maximum value of entropy in (10), see [14].

Once reached this point, it is possible to define the criterion that is used in the CC4.5 algorithm, which is called Imprecise Info Gain Ratio (IIGR) and it is defined as follows: ${IIGR}^{D} (C, X) = \frac{{IIG}^{D} (C, X)}{H (X)}$ (12)

5 Binary Relevance with Credal-C4.5

In this Section, the method used in this research and its main properties are explained.

We recall that in this work the Binary Relevance method is used with CC4.5 as base classifier. Thus, for each label l_j ∈ L, 1 ≤ j ≤ q, let consider the training set defined in (2). From this training set the corresponding binary classifier h_j is built using the CC4.5 algorithm, which has been explained in Section 4. Then, for a new instance x, the set of predicted labels is the set h (x), defined in 1. As argued in Section 3, this function gives rise to a real valuated function f. For each label l_j, and for an instance x, f (x, l_j) = f_j (x), where f_j (x) is the confidence given by the CC4.5 model associated with the label l_j to the instance x.

It should be said that in this work CC4.5 and C4.5 are compared as base classifiers for the BR. For this reason, the advantages of CC4.5 versus C4.5 are explained in Sections 5.1 and 5.2.

5.1 Differences between Credal-C4.5 and C4.5

When s = 0, C4.5 and CC4.5 are equal. When s > 0, the higher is the size of the training set, the smaller is the credal set. Thus, if N is very large (it usually happens in upper levels of the tree) IIGR and IGR give similar values and in these cases CC4.5 and C4.5 can be considered equivalent. However, when the size of the training set is small, usually in lower levels of the tree, the credal set corresponding to it, contains many probability distributions that can be considerably distinct from the one used to calculate H. Thus, IIGR and IGR may be different because H^* and H can be very distinct. Therefore, CC4.5 and C4.5 have similar behavior in upper levels of the tree but they are different in lower levels.

Furthermore, it is remarkable that the value of ${IIGR}^{D} (C, X)$ can be negative for an attribute X [3], unlike the value of IGR used in C4.5 algorithm. This fact avoids that IIGR criterion selects features that worse the information of the class attribute. Thus, the branching of the tree may stop before in CC4.5 than in C4.5 algorithm. It implies that the overfitting is usually lower in CC4.5 than in C4.5.

In [15], it is studied and analyzed with examples that H^* function, defined in (10) is more robust to noise than the Shannon’s entropy H. Hence, it can be said that CC4.5 is more robust to noise than C4.5. This fact was corroborated in that work with an extensive experimental research. Therefore, it could be thought that BR is more robust to noise using CC4.5 as base learner than employing C4.5.

5.2 Noise in MLC

It is known that real world datasets have usually intrinsic noise, i.e, there is often errors in the data although there is no additional noise introduced. As explained in the introduction, in MLC there might be more intrinsic noise than in single-class classification. In this Section this fact is argued with more detail.

In traditional supervised classification, let p be the probability that an instance has a wrong value of the class variable. Let suppose that in MLC, where there are q labels, the probability that an instance has a wrong value of a certain label, i.e, that the example has assigned erroneously a label or has not assigned a label to which actually belongs, is also p (we are supposing the same probability error for all labels).

Whereas in single-label classification the probability that the example has the correct class value is $(1 - p)$ (13) in MLC, supposing that there is independence among errors in the labels, the probability that an instance has the correct value for all labels is $(1 - p)^{q}$ (14) In the previous situation, let suppose that p = 0.05, the probability that in traditional classification a certain example has the right class value is $1 - 0.05 = 0.95$ (15) If q = 10, the probability that an instance in MLC has the right value for all labels is $(0.95)^{10} = 0.6$ (16) In other words, the probability that an instance has a wrong value for some label is 0.4. Hence, although the probability of noise in one label is very small, the probability of noise in any label can be very notable, as is in this case. From the previous fact it is easy to conclude that the intrinsic noise can be considerably higher in MLC than in single-label classification. For this reason, it seems to be interesting to use a classifier more robust to noise. Summarizing, has been remarked that BR provides good results in practice and that C4.5 is one of the most known learners of traditional classification. Has been also argued that CC4.5 is more robust to noise than C4.5. In addition, according to the reason explained previously, the noise in MLC is expected to be higher than in single-label classification. Therefore, it is worth to study the usage of CC4.5 as base classifier of BR in order to solve MLC problems.

6 Experimental analysis

6.1 Description of the experiments

The MLC algorithm considered in the experiments carried out in this work is the BR method, explained in Section 3. Two base classifiers are used for this algorithm: C4.5 and CC4.5, exposed in Section 4. The implementation given in the MULAN library [24] for the BR algorithm with its default configuration has been employed. The implementation of C4.5 algorithm provided by Weka [27] was used with its default configuration. We added the necessary methods to build CC4.5 trees with the same experimental conditions. For the CC4.5 algorithm the value of s is fixed to s = 1 (see Section 4).

A set of 10 datasets from different domains are used in these experiments, which can be downloaded from the official website of MULAN. These datasets have been employed in an exhaustive research with most of the MLC algorithms carried out in [13]. In that work, the dataset bookmarks is also employed, but it was not considered in this research due to the computational cost that it requires. Table 1 shows, for each dataset, its name, the domain to which the problem belongs, the number of instances, the number of attributes, the number of labels and its label cardinality (LCARD), i.e the average number of labels per instance [20]. Formally: $LCARD (D) = \frac{1}{m} \sum_{i = 1}^{m} | L_{i} |$ (17)

Table 1

Description of the datasets used in this experimental research

Dataset	Domain	N	N_A	N_L	L_C
bibtex	Text	7395	1836	159	2.4
corel5k	Multimedia	5000	499	374	3.52
delicious	Text (Web)	16105	500	983	19.02
emotions	Multimedia	593	72	6	1.87
enron	Text	1702	1001	53	3.38
mediamill	Multimedia	43907	120	101	4.38
medical	Text	978	1449	45	1.24
scene	Multimedia	2407	294	6	1.07
tmc2007	Text	28596	49060	22	2.16
yeast	Biology	2417	103	14	4.24

N is the total number of instances, N_CA is the number of continuous attributes, N_DA is the number of discrete attributes, N_L its number of labels and L_C is the label cardinality

For each one of the previous datasets and for both of the base classifiers employed in these experiments, a 5-fold cross-validation has been executed. It is known that the process of evaluation in MLC is not as trivial as in the case of traditional classification, since there are multiple labels and not just one class variable. For this reason, a set of 14 metrics have been used in order to compare both base classifiers in BR. All of these measures are employed in the experiments carried out by [13] and they are detailed in that research. Basically, these metrics are divided into two groups [28]:

Bipartitions-based: These measures compare the predicted labels for the examples with their really relevant labels. This kind of metrics falls into two categories:

Example-based: This subgroup of measures focuses on the examples, comparing how differ the predicted labels for each example with their real labels. The example-based metrics measure, for each label, the accurate prediction for it over all the examples. Six example-based measures have been employed: Subset Accuracy, Hamming Loss, Accuracy, Precision, Recall and F1.

Label-based: Six measures of this kind have been used: Macro Precision, Macro Recall, Macro F1, Micro Precision, Micro Recall and Micro F1.

Rankings-based: The metrics of this group measure the comparison of the predicted ranking of the labels for the instances with their truth relevant labels. Four ranking-based measures have been used: One Error, Coverage, Ranking Loss and Average Precision.

Once all the measures have been computed, for each one of them, a Wilcoxon test [26] with a level of significance of α = 0.05 is used to compare results. It has been done following the recommendations of [9, 10] and taking into account that in this case only two algorithms will be compared.

6.2 Results and discussion

In this Section, the outcomes corresponding to the Wilcoxon tests are presented. The complete results can be found in the Appendix A. Table 2 shows, for each measure, which algorithm has better performance and if the difference is significant.

Table 2
Summary of the results of the Wilcoxon tests obtained for each measure and for each method

Metric C45 CC4.5

Subset Accuracy ∘

Hamming Loss •

Accuracy ∘

Precision ∘

Recall •

F1 ∘

Micro Precision •

Macro Precision •

Micro Recall •

Macro Recall •

Micro F1 ∘

Macro F1 ∘

Ranking Loss •

One Error •

Coverage •

Average Precision •

Metric	C45	CC4.5
Subset Accuracy		∘
Hamming Loss		•
Accuracy		∘
Precision		∘
Recall	•
F1	∘
Micro Precision		•
Macro Precision		•
Micro Recall	•
Macro Recall	•
Micro F1		∘
Macro F1	∘
Ranking Loss		•
One Error		•
Coverage		•
Average Precision		•

•indicates that the algorithm in the column has a significatively better performance than the method corresponding to the other column and ∘ means a better performance but without significative differences

Through this table we discuss the results. We firstly comment the results of the tests corresponding to example-based measures. Secondly, we analyze the results of the test associated with label-based metrics. After that, the results obtained for the ranking-based metrics are discussed. Finally, the results are summarized.

6.2.1 Example-based measures:

For Example-based metrics, the higher the value is, the better is the performance, except for hamming loss.

Within the example based-measures, we remark that in Hamming Loss the BR obtains significatively better results with CC4.5 as base learner than with C4.5. With Recall the results are better with C4.5, being the differences significant too. The results obtained for Accuracy, Precision and Subset Accuracy are better in the case of CC4.5, whereas C4.5 gets better results for F1. However, in these cases the differences are not significant.

The previous point implies that in BR with CC4.5 there is less difference between the sets of labels predicted and relevant labels for an instance than with C4.5 because of the results obtained in Hamming Loss. In addition, the better results obtained for Subset Accuracy by BR with CC4.5 mean that it gets better performance in predicting the entire set of relevant labels for the instances (and only the relevant labels). The improvements in the previous measures using CC4.5 as base classifier of BR instead of C4.5 reveal that in MLC there might be a notable intrinsic noise. Therefore, the usage of imprecise probabilities supposes an improvement on the predicted labels sets. Although with C4.5 as base learner BR predicts more relevant labels than with CC4.5 due to the better results obtained in Recall, CC4.5 predicts less irrelevant labels, because of its better performance in Precision. The reason is the following: As noted above, in BR many binary problems are unbalanced. Furthermore, CC4.5 stops branching the tree before than C4.5, overfitting less the data. Consequently, in many cases, CC4.5 does not reach parts of the tree where instances of the minority class (in our case, instances for which a label is relevant) are predicted correctly. However, in these parts of the tree the noise has a very negative influence.

In summary, for one example-based measure BR with CC4.5 is significantly better than BR with C4.5 and, for another measure of this kind the results are significantly better for BR with C4.5. In three of four of the remaining measures the BR with CC4.5 is better than C4.5, although the differences are not significant. It means that, for example-based measures both base classifiers for BR may be almost equivalent, having CC4.5 a slightly better performance.

6.2.2 Label-based measures

For all Label-based measures if the value is higher the performance is better.

For both Macro and Micro Precision the performance of BR is significantly better with CC4.5 as base classifier. The results obtained for both Micro and Macro Recall are significantly better for C4.5. BR obtains better performance with CC4.5 for Micro F1. For Macro F1, the results are better using C4.5. In these last two cases the differences are not significant.

For the previous reasons, BR with C4.5 tends to predict more relevant labels than with CC4.5. On the other hand, BR usually predicts less irrelevant labels as relevant using CC4.5 as base classifier than with C4.5. The reason has been explained above: In BR many binary problems are unbalanced and CC4.5 stops branching the tree before than C4.5.

Hence, it could be said that CC4.5 and C4.5 obtains an equivalent performance for label-based measures when they are used as base-classifiers for BR.

6.2.3 Ranking-based measures

Recall that for Ranking Loss, One Error and coverage the performance is better if the value of these measures is lower. The opposite happens with average precision.

It is easy to deduct that, for Ranking-based measures, BR obtains much better results with CC4.5 as base learner than using C4.5. In fact, it can be observed that for all ranking measures employed in this research the results obtained by CC4.5 are significantly better than those obtained by C4.5.

Therefore, it is obvious that BR obtains a better performance with CC4.5 than with C4.5 for the real-valuated and the corresponding ranking functions, i.e, CC4.5 ranks better the labels when it is used as base classifier of BR versus C4.5. This happens because the intrinsic noise in MLC has a strong influence in the label ranking given by a MLC algorithm and CC4.5 is more robust to noise than C4.5.

6.2.4 Summary of the results

Generally, BR with C4.5 predicts more relevant labels than with CC4.5 due to the better performance obtained with Recall for both example and label-based metrics. On the other hand CC4.5 predicts less irrelevant labels when it is employed as base classifier of BR. This fact can be deducted from the better results obtained in Precision by BR with CC4.5. The performance obtained in Hamming Loss implies that the label set predicted by BR with CC4.5 as base classifier differs less than with C4.5 from the relevant labels set. Moreover, BR ranks better the labels when it uses CC4.5 as base classifier than with C4.5 because the results obtained for BR are better in the first case for all Ranking-based metrics. The reasons of the previous facts have been explained above: in MLC there might be more intrinsic noise than in traditional classification. Thus, the usage of imprecise probabilities in C4.5 when it is used as base classifier of BR is suitable. Nevertheless, with CC4.5 as base learner BR has more difficulty in predicting relevant labels correctly because in BR many binary problems are not balanced and CC4.5 stops branching the tree before than C4.5. Consequently, with CC4.5 it is more difficult to reach parts of the tree where the labels are predicted as relevant (minority class).

The results obtained for 10 of the 16 measures offer significant differences between the two base classifiers for BR considered in this work. The results of 7 of them are favorable to CC4.5 while for 3 are better for C4.5. That is, 7 of 10 significant comparison results are favorable for BR using CC4.5 as base classifier.

For the other 6 measures the comparison results do not present significant differences. However, for 4 of them the results are better for CC4.5, whereas only for 2 of these metrics the results are favorable to C4.5. This is an additional favorable comparison for BR using CC4.5 instead of C4.5. These results are summarized in Table 3.

Table 3
Summary of the results

C4.5 CC4.5

Better with significative improvement 3 7

Better without significative improvement 2 4

	C4.5	CC4.5
Better with significative improvement	3	7
Better without significative improvement	2	4

One row indicates, for each algorithm, the number of metrics with the algorithm of the column is significatively better. The second row, shows, for each algorithm, the number of measures for which the method is better but the differences are not significant.

Moreover, for both example-based and label-based measures the results obtained for both algorithms are equivalent, even thought in the first group the results are slightly better for CC4.5, being the difference very little remarkable. Nevertheless, the results obtained for ranking based metrics the results obtained using CC4.5 as base learner of BR are clearly better than the obtained by employing C4.5. It shows that the label ranking given by a MLC algorithm is strongly affected by the presence of intrinsic noise in the data.

Therefore, it could be said that CC4.5 supposes an improvement versus classical C4.5 when it is employed as base classifier for BR.

7 Conclusions and future work

In this work, we have started from the Binary Relevance algorithm, a very simple and direct approach to Multi-Label Classification that has been proved to obtain good results in the practice. Recall that this method decomposes the multi-label task into several binary problems, one per tag and it uses a supervised traditional classification algorithm in order to solve each binary problem.

In this research it has been considered the Credal C4.5 algorithm as base classifier of Binary Relevance, which has been shown to be an improvement over the classical C4.5 in the presence of noise, and we should keep in mind that in MLC it is more probable that there is intrinsic noise in the data than in single-class classification. In concrete, the performance of the BR using CC4.5 as base learner has been studied versus BR using C4.5.

An experimental study carried out with a set of datasets for MLC has shown that the results obtained with CC4.5 are slightly better for example-based measures and significantly better for ranking measures. In general, there are more metrics for which BR with CC4.5 is better than C4.5. The previous points allow us to conclude that BR employing CC4.5 as base learner is an improvement over C4.5.

The contribution of this work allows us to think that the usage of imprecise probabilities in Decision Trees also provides good results in MLC. For future research, it is interesting to study how this idea works in other algorithms of MLC. For instance, it is worth to design a new adaptation of Decision Trees to MLC using the Imprecise-Dirichlet Model, giving rise to Multi-Label Decision Trees. It is expected that these new future adaptations of Decision Trees to MLC will provide better results than the existing adaptations.

Footnotes

Acknowledgments

This work has been supported by the Spanish “Ministerio de Economía y Competitividad” and by “Fondo Europeo de Desarrollo Regional” (FEDER) under Project TEC2015-69496-R.

Appendix A. Complete results from the experimental study

In this Section, the complete results from the experiments carried out in this research are exposed. In particular, we present the results for each dataset used in the experiments and for each measure across the 5-CV. The results corresponding to example-based measures are presented (Section 7.0.1), followed by the results of label-based measures (Section 7.0.2) and the results associated with ranking-based metrics (Section 7.0.3). In the tables with the results, for each dataset, the better result obtained by the algorithms is marked with bold fonts.

References

Abellán , Uncertainty measures on probability intervals from the imprecise dirichlet model, International Journal of General Systems 35(5) (2006), 509–528.

Abellán , Ensembles of decision trees based on imprecise probabilities and uncertainty measures, Information Fusion 14(4) (2013), 423–430.

Abellán and

C.J.

Mantas , Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring, Expert Systems with Applications 41(8) (2014), 3825–3830.

Abellán and

Masegosa , An experimental study about simple decision trees for bagging ensemble on datasets with classification noise, In Symbolic and Quantitative Approaches to Reasoning with Uncertainty, volume 5590 of Lecture Notes in Computer Science, Springer, 2009, pp. 446–456.

Abellán and

Moral , Building classification trees using the total uncertainty criterion, International Journal of Intelligent Systems 18(12) (2003), 1215–1225.

R.T.

Alves ,

M.R.

Delgado and

A.A.

Freitas , Knowledge discovery with arti_cial immune systems for hierarchical multi-label classi_cation of protein functions, In International Conference on Fuzzy Systems, 2010, pp. 1–8.

Barutcuoglu ,

R.E.

Schapire and

O.G.

Troyanskaya , Hierarchical multi-label prediction of gene function, Bioinformatics 22(7) (2006), 830–836.

M.R.

Boutell ,

Luo ,

Shen and

C.M.

Brown , Learning multi-label scene classification, Pattern Recognition 37(9) (2004), 1757–1771.

Charte ,

A.J.

Rivera ,

Charte ,

M.J.

del Jesus and

Herrera , Tips, guidelines and tools for managing multi-label datasets: The mldr.datasets r package and the cometa data repository. Neurocomputing, In Press, 2018.

10.

Demšar , Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006), 1–30.

11.

I.N.

Inza ,

Sierra ,

Blanco and

Larrañaga , Gene selection by sequential search wrapper approaches in microarray cancer class prediction, Journal of Intelligent and Fuzzy Systems 12(1) (2002), 25–33.

12.

G.J.

Klir , Uncertainty and Information: Foundations of Generalized Information Theory. John Wiley And Sons, Inc., 2005.

13.

Madjarov ,

Kocev ,

Gjorgjevikj and

Džeroski , An extensive experimental comparison of methods for multi-label learning, Pattern Recognition 45(9) (2012), 3084–3104.

14.

C.J.

Mantas and

Abellán , Credal-C4.5: Decision tree based on imprecise probabilities to classify noisy data, Expert Systems with Applications 41(10) (2014), 4625–4637.

15.

C.J.

Mantas ,

Abellán and

J.G.

Castellano , Analysis of Credal-C4.5 for classification in noisy domains, Expert Systems with Applications 61 (2016), 314–326.

16.

McCallum , Multi-label text classification with a mixture model trained by EM, In AAAI'99 Workshop on Text Learning, 1999, pp. 1–7.

17.

Nasierding and

Kouzani , Image to Text Translation by Multi-Label Classification, In Advanced Intelligent Computing Theories and Applications With Aspects of Artificial Intelligence, volume 6216 of Lecture Notes in Computer Science, chapter 31, Springer, 2010, pp. 247–254.

18.

Pachowicz and

Bala , A noise-tolerant approach to symbolic learning from sensory data, Journal ofIntelligent and Fuzzy Systems 2 (1994), 347–361.

19.

J.R.

Quinlan , C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.

20.

Read ,

Pfahringer ,

Holmes and

Frank. , Classifier chains for multi-label classification, Machine Learning 85(3) (2011), 333.

21.

Ruiz ,

Riquelme and

Aguilar-Ruiz , Projectionbased measure for efficient feature selection, Journal ofIntelligent and Fuzzy Systems 12 (2002), 175–183.

22.

R.E.

Schapire and

Singer , Boostexter: A boostingbased system for text categorization, Machine Learning 39(2) (2000), 135–168.

23.

C.E.

Shannon , A mathematical theory of communication, Bell System Technical Journal 27(3) (1948), 379–423.

24.

Tsoumakas ,

Spyromitros-Xioufis ,

Vilcek and

Vlahavas , Mulan: A java library for multi-label learning, Journal of Machine Learning Research 12 (2011), 2411–2414.

25.

Walley , Inferences from multinomial data; learning about a bag of marbles (with discussion), Journal of the Royal Statistical Society Series B (Methodological) 58(1) (1996), 3–57.

26.

Wilcoxon , Individual comparisons by ranking methods, Biometrics Bulletin 1(6) (1945), 80–83.

27.

I.H.

Witten and

Frank , Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann Series in Data Management Systems Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, second edition, 2005.

28.

M.L.

Zhang and

Z.H.

Zhou , A review on multi-label learning algorithms, IEEE Transactions on Knowledge and Data Engineering 26(8) (2014), 1819–1837.

Using Credal-C4.5 with Binary Relevance for Multi-Label Classification

Abstract

Keywords

1 Introduction

2 Multi-label classification

5.1 Differences between Credal-C4.5 and C4.5

5.2 Noise in MLC

6.1 Description of the experiments

6.2.2 Label-based measures

6.2.3 Ranking-based measures

6.2.4 Summary of the results

Table 3 Summary of the results C4.5 CC4.5 Better with significative improvement 3 7 Better without significative improvement 2 4

Footnotes

Acknowledgments

Appendix A. Complete results from the experimental study

References

Table 3
Summary of the results

C4.5 CC4.5

Better with significative improvement 3 7

Better without significative improvement 2 4