A general framework for multi-label learning towards class correlations and class imbalance

Abstract

In multi-label classification settings, one of the most common problems is the massive label output space. To alleviate this, some methods opt to exploit label correlations to reduce the output space during prediction. However, these methods sacrifice efficiency or ignore global label correlations. In addition, label imbalances are another problem that is prevalent in multi-label classification. Current methods of correcting for imbalance oftentimes use single-label methods, which fail to consider label correlations. In this paper, we introduce general frameworks that incorporate topic modeling to seamlessly address both problems. We show that these frameworks can allow even the most naïve methods, such as Binary Relevance, to perform similarly to state-of-the-art methods. Furthermore, we show that our frameworks can also adapt state-of-the-art methods to perform better than the methods by themselves.

Keywords

Multi-label learning label correlations class imbalance topic model

1. Introduction

In many real-world applications, objects are oftentimes members of several different groups, which makes the single label assumption of multiclass classification unrealistic. For example, an image may be annotated with classes “sea” and “sunset”, or a document maybe annotated with “culture” and “history”. Multi-label classification is the problem of categorizing these objects, or instances, into any number of classes. Formally speaking, let $X=\mathbb{R}^{d}$ denote the input space of $d$ -dimensional feature vectors and $Y=\{y_{1},y_{2},\ldots,y_{q}\}$ denote the output space of $q$ class labels. Given the multi-label training set $D=\{(x_{i},Y_{i})|1\leqslant i\leqslant N\}$ , where $x_{i}\in X$ is a $d$ -dimensional feature vector and $Y_{i}\subseteq Y$ is the set of labels associated with $x_{i}$ , the task is to learn a multi-label classifier $h:X\rightarrow 2^{Y}$ from $D$ which maps from the space of feature vectors to the space of label sets.

However, there are two primary challenges that are commonly associated with multi-label classification:

1.
Size of the output space. Due to the nature of multi-label classification, the output space of the learning problem is massive. For example, the number of possible label sets for 20 class labels is 2 ${}^{20}$ . To tackle this challenge, state-of-the-art techniques exploit label correlations to reduce the size of the output space [17]. For example, if we know an image is already labeled with “processor” and “motherboard”, then there is a high probability that it is also labeled with “computer”. However, label correlation exploitation considers only pairwise label relationships, which are expensive to compute, especially with a massive label space. Recent methods have used classifier chains to capture the global correlations among labels. Unfortunately, the performance of these classifier chains is highly sensitive to the order of chains as well as the size of the label space.
2.
Label imbalance. This is the problem in which one class may label many more instances than other classes. Although label imbalance is a complication that exists in many machine learning problems, it is more complex in multi-label classification settings [25]. Some methods attempt to generalize to the domain of the multi-label classification problem by using traditional single-label methods for correcting imbalance, such as under- and oversampling. However, these methods are incapable of exploiting label correlations. Others reconstruct the specific multi-label algorithm to adapt to the imbalance problem. However, these methods fail to generalize to other methods.

In this paper, we first introduce our work, LDAML, which specifically tackles massive output spaces with label correlation exploitation. LDAML is a general framework that can adapt and improve most multi-label classification methods. By exploiting global correlations with topic modeling, we showed that LDAML can improve naïve methods such as Binary Relevance to the point that they are on par with state-of-the-art methods. Our method accomplishes this without significantly increasing the time cost. However, LDAML does not take into account the label imbalance problem. Then we introduce an extended framework that builds upon LDAML while also addressing label imbalance.

In the rest of this paper, we briefly introduce related work, present the new framework, report on our experiments, and finally conclude our work.
2. Related work

In the past few decades, many algorithms have been proposed to solve multi-label classification problems. These learning algorithms can be categorized into two major types [25].

1.
Problem transformation methods. The key idea of this category is to change the data to fit the algorithm. A classic method of this type is Binary Relevance (BR) [2]. BR converts the multi-label classification problem to a single-label classification problem by independently training one binary classifier for each label.
2.
Algorithm adaptation methods. The key idea of this category is to change the algorithm to fit the data. Standard algorithms in this category include MLkNN [24], which fits single-label $k$ -nearest neighbors to multi-label learning, and ML-DT [5], which fits decision trees to multi-label learning.

Corresponding to problem 1 in the introduction, exploiting correlations among labels can drastically reduce the size of the output space, thus improve learning performance. A large number of methods that explicitly consider label correlations has been proposed. Some of these methods, such as CLR [8] and COCOA [23], utilize pairwise label correlations. CLR ranks labels by generating all possible label pairs. COCOA predicts labels by building several multiclass imbalance learners to pair with other labels. However, these label correlations consider only pairwise relationships, which are expensive to compute, especially with massive label spaces.

Another category of methods exploits global label correlations with different structures. Classifier Chains (CC) [14] transform multi-label classification problems into chains of binary classification problems. To find the optimal order of label chains, several follow-up algorithms have been proposed based on CC: Ensembles of Classifier Chains (ECC) [14], Entropy Chain Classifiers (ETCC) [13], and Group sensitive Classifier Chains for multi-label classification(GCC) [11]. Random $k$ -Labelsets (RAKEL) [18] transforms the multi-label learning problem into a multiclass classification one with random label subsets. Multi-label Linear Discriminant Analysis (MLDA) [20, 15] to take advantage of label correlations and explore the powerful classification capability of the classical LDA to deal with multi-label multi-class problems. However, the performances of these types of methods are highly sensitive to the structures of the methods.

In addition, corresponding to problem 2 in the introduction, label imbalance in multi-label classification can be viewed from two perspectives [21]: imbalance among different labels and imbalance within individual labels. The former is related to the greatly differing rate of positive instances of different labels. The latter only depends on the degree of imbalance of each individual label.

One technique is to transform the label imbalance problem in multi-label learning into the class imbalance problem in single-label learning, and then solving it with techniques such as under- and over-sampling, ensemble learning, and cost-sensitive learning [9, 22, 16, 12]. Some methods, such as ML-ROS, ML-RUS [3], and Multi-Label Synthetic Minority Over-sampling Technique (MLSMOTE) [4], directly apply techniques that solve the class imbalance problem in single-label learning to multi-label learning. Some methods also try to build training sets based on the inherent properties of majority class to solve the imbalance problem.

Moreover, imbalance-aware algorithms have recently been proposed. COCOA [23] builds a binary-class imbalance classifier for the current class, aggregating it with several multiclass imbalance classifiers for other classes to make final predictions. [21] solved the label imbalance problem in multi-label learning by formulating the problem as a constrained minimization consisting of a submodular objective function. [6] introduced an extension of structured forests and proposed a imbalance-aware formulation by altering the manner in which splitting functions are learned.
3. Framework overview

3.1 LDAML framework

Our approach to the problem of multi-label classification is to exploit label correlations by introducing topic modeling, specifically in the form of latent Dirichlet allocation (LDA) [1]. Unlike traditional methods that exploit correlations with label subsets or label chains, LDAML achieves this goal by discovering abstract label topics. As in the introduction section, let $X=\mathbb{R}^{d}$ denote the input space of $d$ -dimensional feature vectors and $Y=\{y_{1},y_{2},\ldots,y_{q}\}$ denote the output space of $q$ class labels. We are given a multi-label training set $D=\{(x_{i},Y_{i})|1\leqslant i\leqslant N\}$ , where $x_{i}\in X$ is a $d$ -dimensional feature vector and $Y_{i}\subseteq Y$ is the set of labels associated with $x_{i}$ . We frame each instance as a document and each label $y_{ij}$ as a word in the corresponding document. Intuitively, these documents can be described by topics, in which we can expect some labels to appear more often than others. This is especially true in multi-label datasets, which contain large quantities of related labels.

Exploiting label topics from the training set. We introduce LDA into training set $D$ . Each instance $x_{i}$ denotes a document and each label $y_{ij}$ denotes the $j$ -th label in the $i$ -th instance. Then, the generative process is as follows:

1.
Choose the label topic number $K$ and the Dirichlet distribution parameters $\alpha$ and $\beta$ ;
2.
Choose $\theta_{i}\sim\textit{Dir}(\alpha)$ , where $i\in\{1,\ldots,N\}$ ;
3.
Choose $\phi_{k}\sim\textit{Dir}(\beta)$ , where $k\in\{1,\ldots,K\}$ ;
4.
For each label $y_{ij}$ , where $i\in\{1,\ldots,N\}$ and $j\in\{1,\ldots,q\}$

–
Choose a topic $z_{ij}\sim\textit{Dir}(\theta_{i})$ ;
–
Choose a label $y_{ij}\sim\textit{Dir}(\phi_{z_{ij}})$ ;

Then, we compute the instance-topic probability distribution matrix $\theta$ , where $\theta_{ij}$ denotes the probability of the $i$ -th instance in the $j$ -th topic.

Discrete distribution of topics. After calculating $\theta$ , we obtain the probability value that each instance belongs to each topic. To determine which topic each instance belongs to, we require a discrete value (e.g., 0 or 1) instead of the probability. We define the method for discretization in Algorithm 3.1.

Discretization of the instance-topic probability distribution matrix[1] The instance-topic probability distribution matrix $\theta[N][K]$ .discretization matrix of instance-topic $Y_{T}$ .

$i=1$ to $N$ $\textit{max}=\textit{MAX}(\theta[i][1],\ldots,\theta[i][K])$ $j=1$ to $K$ $(\textit{max}-\theta[i][j])<\frac{1}{K}$ $Y_{T}[i][j]=1$ $Y_{T}[i][j]=0$

Predict topics of test examples. We assume that the topic probability distribution of the test dataset is identical to the training dataset. Therefore, the topic of the test dataset can be predicted with the training dataset. Let $D_{T}$ denote the dataset reconstructed from $D$ for topic $K$ : $D_{T}=\{(x_{i},{Y_{T}}_{i})|1\leqslant i\leqslant N\}$ , where ${Y_{T}}_{i}$ is a $K$ -dimensional vector and ${Y_{T}}_{ij}$ indicates whether instance $i$ belongs to topic $j$ . Thus, $D_{T}$ is also a multi-label dataset. Since the topic associated with each instance is a high-level description of the label set space, correlations among topics are weaker than among labels. Thus, the predictive model for topics can be produced by a naïve multi-label classification method $H_{T}(x)$ , such as BR or CC on $D_{T}$ . Finally, we obtain the predicted topic set for the test example by querying the predictive models.

The choice of topic number $K$ is crucial, since a smaller $K$ means that there is less information, but is easier to predict. Conversely, a larger $K$ means the topic contains more detailed information, but is more difficult to predict. To show this, given an increasing sequence $K=\{K_{1},K_{2},\ldots,K_{p}\}$ , where $p$ is the iteration number, we iteratively select each value as the number of topics, then execute the discretization algorithm. Finally, we augment the original feature space in the training set with the discretized instance-topic matrix, and the feature space of the test set with the predicted topic set. In this manner, we can introduce more information in each topic prediction loop.

Predict labels of test examples. Since we assume that the label topics introduce label correlation information, we augment the original feature space with the topic space as features in the label prediction stage. Let $X=\mathbb{R}^{d}$ denotes the input space of $d$ -dimensional feature vectors, $Y_{T}=\{topic_{1},topic_{2},\ldots,topic_{K}\}$ denotes the topic space of $K$ topics, where $K=\sum_{i=1}^{p}K_{i}$ , and $Y=\{y_{1},y_{2},\ldots,y_{q}\}$ denotes the output space of $q$ class labels. We build the new multi-label training set $D^{\prime}=\{(x_{i}^{\prime},Y_{i})|1\leqslant i\leqslant N\}$ , where $x_{i}^{\prime}$ is feature vector $x_{i}\in X$ augmented with topic vector ${Y_{T}}_{i}\in Y_{T}$ and $Y_{i}\subseteq Y$ is the set of labels associated with $x_{i}$ . Then, the test examples have their feature space augmented with the predicted topic set. Finally, we produce the predictive model for label prediction on the new training set $D^{\prime}$ by multi-label classification $H(*)$ . We obtain the predicted label set for the test example by querying the predictive model. Algorithm 3.1 summarizes the complete procedure of the LDAML approach.

LDAML[1] $D$ : the multi-label training set, $D=\{(x_{i},Y_{i})|1\leqslant i\leqslant N\}$ $p$ : the number of iterations $K$ : the increasing sequence of topic number in each loop, $K=\{K_{1},K_{2},\ldots,K_{p}\}$ $M_{T}$ : the base multi-label classifier for topic prediction $M$ : the base multi-label classifier for label prediction $t$ : a test example, $t=(\hat{x},\hat{Y})$ the set of predicted labels for $t$

$D^{0}=D$ , $t^{0}=t$ ; $n=1$ ; $n\leqslant p$ ; $n++$ calculating the instance-topic probability distribution matrix $\theta^{n}$ by generating the LDA model with $K_{n}$ topics on $D^{n-1}$ ;calculating the discretization matrix ${Y_{T}}^{n}$ of $\theta^{n}$ with Algorithm 3.1;constructing the training set ${D_{T}}^{n}$ for predicting $K_{n}$ topics, ${D_{T}}^{n}=\{({x_{i}}^{n-1},{{Y_{T}}_{i}}^{n})|1\leqslant i\leqslant N\}$ and ${x_{i}}^{n-1}$ is the feature space in $D^{n-1}$ ;constructing test example ${t_{T}}^{n}=(\hat{x^{n-1}},\hat{{Y_{T}}^{n}})$ and $\hat{x^{n-1}}$ is the feature space in $t^{n-1}$ ; $H_{T}=\textit{BuildClassifier}({D_{T}}^{n},M_{T})$ ;the set of predicted topics $\hat{{Y_{T}}^{n}}=H_{T}(t_{T}^{n})$ ;constructing training set $D^{n}=\{({x_{i}}^{n},Y_{i})|1\leqslant i\leqslant N\}$ , where ${x_{i}}^{n}$ is ${x_{i}}^{n-1}$ augmented with ${{Y_{T}}_{i}}^{n}$ ;constructing test example $t^{n}=(\hat{x}^{n},\hat{Y})$ , where $\hat{x}^{n}$ is $\hat{x}^{n-1}$ in $t^{n-1}$ augmented with $\hat{{Y_{T}}^{n}}$ ; $H=\textit{BuildClassifier}({D}^{p},M)$ ; $\hat{Y}=H(t^{p})$ ;
3.2 LDAML-IMB framework

As stated in the introduction, there are two key challenges in the multi-label learning, one of which is the huge number of possible label sets for prediction in multi-label data, which is exponential to the size of the label space. To address this, the LDAML framework focuses on exploiting correlations among class labels with topic modeling. Another key challenge is the class imbalance problem, which typically leads to performance degradation in multi-label learning. To tackle this problem, we propose the external framework, LDAML-IMB, which is based on LDAML.

Coupling labels with topics. In the LDAML framework, we obtained the topic distribution of each label set. We couple each label $y_{j}$ with $K$ topics as follows: given the label-topic pair $\{(y_{j},\textit{topic}_{k})|1\leqslant j\leqslant q,1\leqslant k\leqslant K\}$ , we reconstruct a multiclass data set $D_{jk}$ from $D$ .

$\displaystyle D_{jk}=\{x_{i},\Phi(y_{j},\textit{topic}_{k})\}$ (1) $\displaystyle\Phi(y_{j},\textit{topic}_{k})=\left\{\begin{array}[]{ll}0,&\text% {if }y_{j}=0\text{ and }\textit{topic}_{k}=0\\ +1,&\text{if }y_{j}=0\text{ and }\textit{topic}_{k}=1\\ +2,&\text{if }y_{j}=1\end{array}\right.$ (2)

In total, there are four possible classes, which are the four possible combinations of $y_{j}$ and $\textit{topic}_{k}$ . However, we only focus on the positive assignments for $y_{j}$ , so the combinations in which $y_{j}=1$ can be merged together. Thus, the multiclass data $D_{jk}$ can be transformed into three classes. Ideally, the merging can weaken the impact of imbalanced classes.

Building multiclass imbalance learners.

The single label imbalance learning techniques such as random or synthetic under- or over-sampling can be applied to the multi-label class dataset $D_{jk}$ . By choosing a single label imbalance learner $\mathcal{M}$ on multiclass data $D_{jk}$ , we can train the imbalance multiclass classification $h_{jk}$ . For each class label $y_{j}$ , there might be $K$ multiclass classifications. Then, the LDAML result can be adjusted by these $K$ imbalance classifications to better adapt to the label imbalance problem.

Computing whole confidence of learners.

For each class label $y_{j}$ , we can compute the prediction confidence from each classification. In this LDAML-IMB framework, we merge the confidence of LDAML and $K$ multiclass classifications as follows:

$\displaystyle h_{j}(x)=h_{j}^{\textit{LDAML}}(+1|x)+1/K\sum_{k\in\textit{topic% }}h_{jk}(+2|x)$ (3)

Here, the $h_{j}^{\textit{LDAML}}(+1|x)$ returns the confidence value of LDAML when label $j$ is predicted as 1. The $h_{jk}$ is the multiclass classification which is constructed with the label-topic couple $(y_{j},\textit{topic}_{k})$ . As shown in Eq. (2), $h_{jk}(+2|x)$ returns the confidence value of multiclass learner when label $j$ is predicted as 1. In Eq. (3), we compute the average confidence of $K$ multiclass classifications to adjust the result of LDAML and, then obtain the final confidence value of label $j$ .

Computing the best threshold of confidence.

In traditional methods, the threshold $a$ is always set as a fixed value (usually 0.5), and $x$ is predicted to be positive for $y_{j}$ if the confidence value $h_{j}(x)>a$ , and negative otherwise. In the LDAML-IMB framework, the threshold is chosen to be $t_{j}$ , which changes with different labels. Additionally, $h_{j}(x)>t_{j}$ the $x$ is predicted to be positive for $y_{j}$ . The problem lies in how to determine the value of the threshold $t_{j}$ . The best situation is one in which the classification $h_{j}(x)$ can obtain the best performance on the dataset with threshold $t_{j}$ . Here, we adopt the macro-average F-measure, which is the most popular metric for binary classifiers, especially for evaluating performances in the context of the imbalance problem. Therefore, the goal is to find the real value of the threshold that can maximize the value of the macro-average F-measure. The threshold $t_{j}$ is computed as follows:

$\displaystyle t_{j}=\arg\max\limits_{t\in R}F(f_{j},t,D_{j})$ (4)

where $F(f_{j},t,D_{j})$ denotes the F-measure value achieved by applying $\{f_{j},t\}$ to the binary training set $D_{j}$ .

Algorithm 3.2 summarizes the complete procedure of the proposed LDAML-IMB framework.

LDAML-IMB[1] $D$ : the multi-label training data set, $D=\{(x_{i},Y_{i})|1\leqslant i\leqslant N\}$ $K$ : the topic number $M$ : the multiclass imbalance classifier $t$ : a test examplethe set of predicted labels for $t$

$j=1$ ; $j\leqslant q$ ; $j++$ Building the LDAML learner $H_{j}^{\textit{LDAML}}$ and calculating the confidence value $h_{j}^{\textit{LDAML}}(+1|t)$ according to Algorithm 3.1Obtaining the $K$ topic of label set according to Algorithm 3.1 $k=1$ ; $k\leqslant K$ ; $k++$ Building the tri-class data set $D^{jk}$ with label-topic couple $(y_{j},\textit{topic}_{k})$ Building the imbalance multiclass classification $H_{jk}=M(D^{jk})$ and calculating the confidence value $h_{jk}(+2|t)$ Calculating the whole confidence $h_{j}(t)$ according to Eq. (3) Calculating the threshold $t_{j}$ of confidence to maximize the performance according to Eq. (4) $H(t)=\{y_{j}|h_{j}(t)>t_{j},1\leqslant j\leqslant q\}$ ;

In this section, we propose LDAML-IMB framework, which aims to exploit the label correlations as well as deal with the label imbalance problem. The first part, LDAML, exploits the label correlations with topic models, and the second part adjusts the first part with the imbalance problem with several multi-class imbalance learners coupling with the topics from LDAML. The LDAML-IMB can degenerate to the LDAML if we ignore the second part.

4. Experimental design

4.1 Dataset description

We use statistical metrics commonly used to describe other multi-label datasets to describe our own. Given a multi-label dataset $D$ , $|D|$ denotes the number of instances in the dataset, $L(D)$ denotes the number of label classes, $\textit{dim}(D)$ denotes the number of features, $\textit{LCard}(D)$ denotes the average number of label classes in each instance, $\textit{LDen}(D)$ denotes the result of normalization of $\textit{LCard}(D)$ , and $DL(D)$ denotes the number of label types. Lastly, $\textit{max}_{\textit{IMB}}(D),\textit{min}_{\textit{IMB}}(D)$ , and $\textit{avg}_{\textit{IMB}}(D)$ denote the maximum, minimum, and average values of label imbalance ratio, respectively. We summarize the characteristics of our datasets in Table 1.

Table 1
Statistical descriptions of the dataset

Dataset	$\|D\|$	$\textit{dim}(D)$	$L(D)$	$\textit{LCard}(D)$	$\textit{LDen}(D)$	$DL(D)$	$\textit{max}_{\textit{IMB}}(D)$	$\textit{min}(D)$	$\textit{avg}(D)$	Type
Emotions	593	72	6	1.869	0.311	27	1.247	3.003	2.146	Music
Flags	194	19	7	3.392	0.485	54	1.042	6.462	2.753	Images
Corel5k	5000	499	374	3.522	0.009	3175	3.460	50.000	17.857	Images
CAL500	502	68	174	26.044	0.150	502	1.040	24.390	3.846	Music

In this paper, we chose four benchmark multi-label datasets to use in our experiments. These datasets cover two types of data: music (CAL500, Emotions) and images (Corel5k, flags). Meanwhile, these data sets also cover different scales of instance and label class. So the experiment on these experimental data is typical and convincing.

4.2 Compared algorithms

We compared LDAML to several state-of-the-art learning algorithms. We selected algorithms to cover nearly all categories of multi-label classification. From the transformation method categories, we compared with the lazy method MLkNN [24]; transform to binary classification CC and ECC; transform to label ranking method CLR [8] and transform to multi-class classification method RAkEL [18] and COCOA [23]. Notably, COCOA also considers the imbalance problem in addition to label correlations. From the labels correlations categories, BR and MLkNN are the first-order method, CLR is the second-order method, while CC, ECC, COCOA, and RAkEL are high-order methods.

4.3 Experimental settings

We designed two groups of experiments to compare the algorithms. The first group aimed to show the performance with increasing topic number $K$ . We set the increasing topic number as a prime number sequence (i.e., the topic number $K$ follows the sequence 2, 2 $+$ 3, 2 $+$ 3 $+$ 5, etc.). We present the performance of our framework with the increasing topic number on the dataset CAL500. On the second group of experiments, we also present how the performances of the LDAML and LDAML-IMB frameworks improve on each unaltered multi-label algorithm on the other three benchmark dataset (Corel5k, Flags, and Emotions). Since we are only concerned about the results of the comparison, we simply fix the topic number as 2 in this group of experiments. Given the time cost of COCOA, we only used its results as a reference for comparison in each group experiments and don’t be used in our framework.

We instantiated all of the algorithms as follows. We adopted the SMO method, which is provided by the widely-used Weka platform, as the base binary classifier in all multi-label methods [10]. The representative comparison multi-label algorithms, such as MLKNN, CLR, ECC, and RAKEL, are provided by the MULAN multi-label learning library [19].

4.4 Experiment results

In multi-label scenarios, there are many metrics for evaluation, just like Subset Accuracy, Hamming Loss, One error, Coverage, Ranking Loss etc. However, the F-measure, which integrates both precision and recall, is the most-used evaluation metric. This is because it can provide better insight into classification performance than conventional metrics such as accuracy [7]. We choose both micro- and macro-averaged F-measure as evaluation metrics, since they are computed by instances and labels, respectively. On the other hand, in class imbalance scenarios, macro-averaged F-measure and macro-AUC are the most-used evaluation metrics. Thus, we evaluated each method’s performance with three metrics: the micro-averaged F-measure, macro-averaged F-measure, and macro-AUC. The micro- and macro-averaged F-measures are label-based measures that separately evaluate the generalization performance of each class label’s predictor, while macro-AUC is a label-based ranking metric. For all three metrics, a higher value indicates better performance.

LDAML(H)-K refers to the LDAML framework, where $H$ is the base multi-label classifier used in LDAML, and $K=\sum_{i=1}^{p}K_{i}$ is the number of topics with the given topic number sequence $\{K_{1},K_{2},\ldots,K_{p}\}$ . Similarly, LDAML-IMB(H)-K refers to the LDAML-IMB framework, where H and K mean the same things as before. The $\uparrow$ symbol means the higher the value, the better performance. Table 2 reports the results of the first group of experiments, with the best result marked in bold. In addition, the win/loss row indicates the number of times our proposed frameworks are superior/inferior to the compared algorithm. We also show the runtime of each method (ms). In this group, we chose the benchmark dataset CAL500, which contains 174 class labels. In the LDAML and LDAML-IMB frameworks, we chose the simplest multi-label algorithm, BR, as the base multi-label classifier.

Table 2
Experiment results with different topic numbers on CAL500 ( $\uparrow$ )

Algorithm	MI-F	MA-F	MA-AUC	RUNTIME
BR	0.3459	0.0818	0.5044	2808
LDAML (BR)-2	0.3661	0.1150	0.5113	2995
LDAML (BR)-5	0.4110	0.1591	0.5191	3078
LDAML (BR)-10	0.4217	0.1842	0.5245	3122
LDAML-IMB (BR)-2	0.4356	0.285	0.5453	5919
LDAML-IMB (BR)-5	0.4395	0.3005	0.5539	9965
LDAML-IMB (BR)-10	0.4177	0.3093	0.5639	15934
CC	0.3283	0.1208	0.5117	3095
LDAML (CC)-2	0.3583	0.1788	0.5148	3300
LDAML-IMB (CC)-2	0.4460	0.2911	0.5433	5903
ECC	0.4054	0.1898	0.5275	10884
LDAML (ECC)-2	0.4036	0.2032	0.5351	11838
LDAML-IMB (ECC)-2	0.4458	0.2898	0.5512	15396
CLR	0.3494	0.0859	0.5657	28558
LDAML (CLR)-2	0.3677	0.1199	0.5688	30657
LDAML-IMB (CLR)-2	0.4419	0.2898	0.5636	41829
RAkEL	0.3542	0.0894	0.5115	13326
LDAML (RAkEL)-2	0.3756	0.1198	0.5175	14257
LDAML-IMB (RAkEL)-2	0.4351	0.2766	0.5408	17399
MLkNN	0.3310	0.0806	0.5266	619
LDAML (MLkNN)-2	0.3670	0.1205	0.5407	325
LDAML-IMB (MLkNN)-2	0.4506	0.2890	0.5611	4421
COCOA	0.4465	0.2860	0.5683	230309
win/loss	6/0	6/0	6/0

From the result in Table 2, we observe the following:

Even the simplest multi-label algorithm, BR, can achieve better performance across all evaluation metrics by utilizing the LDAML framework.

By using the LDAML-IMB framework, we obtain significant increases in performance, especially when evaluating with the imbalance metrics macro-averaged F-measure and macro-AUC.

With increasing topic number $K$ , the performances of LDAML and LDAML-IMP increase overall. However, the rate of increase slows, which indicates that the performance will not increase indefinitely.

The LDAML results indicate that even though BR performs the worst, the LDAML framework can achieve similar or better performance than state-of-the-art algorithms by exploiting label correlations to enhance BR. More importantly, the only additional time cost is to compute the topic probability distribution matrix in a limited word space (label space). Thus, the time cost of LDAML is closed to the base multi-label classifier in LDAML, which implies that by using LDAML, we can achieve similar or even better performance than the state-of-the-art algorithms, while the time cost is similar to that of faster, naïve methods.

Table 3

Experiment results on Corel5K ( $\uparrow$ )

Algorithm	MI-F	MA-F	MA-AUC	RUNTIME
BR	0.2075	0.1813	0.5700	287702
LDAML (BR)-2	0.2186	0.1708	0.5655	288766
LDAML-IMB (BR)-2	0.2804	0.2463	0.7052	897333
CC	0.2233	0.1859	0.5815	255939
LDAML (CC)-2	0.2464	0.1991	0.5831	251379
LDAML-IMB (CC)-2	0.2924	0.2525	0.7077	853848
ECC	0.2673	0.2114	0.6387	974853
LDAML (ECC)-2	0.2687	0.2120	0.6280	1036669
LDAML-IMB (ECC)-2	0.2874	0.2569	0.7153	1475874
CLR	0.2167	0.1914	0.7702	4467850
LDAML (CLR)-2	0.2279	0.2149	0.7207	4524015
LDAML-IMB (CLR)-2	0.2835	0.2512	0.7488	5655144
RAkEL	0.2206	0.1918	0.6107	1469069
LDAML (RAkEL)-2	0.2411	0.1863	0.5907	1215810
LDAML-IMB (RAkEL)-2	0.2793	0.2557	0.6929	1857057
MLkNN	0.0280	0.0337	0.6303	32890
LDAML (MLkNN)-2	0.0515	0.0635	0.6315	33680
LDAML-IMB (MLkNN)-2	0.2806	0.2529	0.7318	358812
COCOA	0.2649	0.2497	0.7057	78273587
win/loss	6/0	6/0	5/1

Table 4

Experiment results on the dataset Emotions ( $\uparrow$ )

Algorithm	MI-F	MA-F	MA-AUC	RUNTIME
BR	0.6286	0.5854	0.7178	153
LDAML (BR)-2	0.6752	0.6462	0.7490	148
LDAML-IMB (BR)-2	0.6773	0.6717	0.8140	405
CC	0.6354	0.5783	0.7213	178
LDAML (CC)-2	0.6585	0.6170	0.7363	193
LDAML-IMB (CC)-2	0.6761	0.6716	0.8032	410
ECC	0.6897	0.6849	0.8118	384
LDAML (ECC)-2	0.6882	0.6781	0.7841	382
LDAML-IMB (ECC)-2	0.6931	0.6924	0.8257	537
CLR	0.6436	0.6238	0.8265	275
LDAML (CLR)-2	0.6751	0.6478	0.8036	318
LDAML-IMB (CLR)-2	0.6923	0.6895	0.8361	505
RAkEL	0.6821	0.6716	0.8026	714
LDAML (RAkEL)-2	0.6823	0.6617	0.7727	714
LDAML-IMB (RAkEL)-2	0.6899	0.6901	0.8222	914
MLkNN	0.6501	0.6071	0.8387	98
LDAML (MLkNN)-2	0.6606	0.6388	0.8019	104
LDAML-IMB (MLkNN)-2	0.7034	0.6975	0.8394	322
COCOA	0.6714	0.6711	0.8345	731
win/loss	6/0	6/0	6/0

Table 5

Experiment results on the dataset flags ( $\uparrow$ )

Algorithm	MI-F	MA-F	MA-AUC	RUNTIME
BR	0.7285	0.6157	0.6178	125
LDAML (BR)-2	0.7200	0.6591	0.6499	93
LDAML-IMB (BR)-2	0.7265	0.6562	0.6865	251
CC	0.6851	0.5527	0.5761	76
LDAML (CC)-2	0.7000	0.6069	0.6135	65
LDAML-IMB (CC)-2	0.7398	0.6798	0.6958	231
ECC	0.6913	0.6020	0.6522	162
LDAML (ECC)-2	0.6985	0.6122	0.6683	180
LDAML-IMB (ECC)-2	0.7262	0.6751	0.6965	265
CLR	0.7184	0.6093	0.6493	209
LDAML (CLR)-2	0.7277	0.6480	0.6638	200
LDAML-IMB (CLR)-2	0.7551	0.6965	0.6852	355
RAkEL	0.7378	0.6054	0.6709	1067
LDAML (RAkEL)-2	0.7140	0.5839	0.6499	1120
LDAML-IMB (RAkEL)-2	0.7333	0.6464	0.7047	859
MLkNN	0.7081	0.5094	0.6100	9
LDAML (MLkNN)-2	0.7143	0.5311	0.6691	8
LDAML-IMB (MLkNN)-2	0.7446	0.6836	0.7058	113
COCOA	0.7177	0.6266	0.6950	599
win/loss	4/2	6/0	6/0

The LDAML-IMB results show that the imbalance metrics macro-averaged F-measure and macro-AUC can be significant improved, the other metrics are also improved except that sometimes. Although LDAML-IMB sacrifices efficiency to build $K$ additional multi-class classifications, it performs better than LDAML while also being more suitable for the multi-label class imbalance problem. Furthermore, even building $K$ multi-class classifications carries a much lower time cost than using most other state-of-the-art methods.

It is worth noting that the performance may initially increase with the number of topics, but will eventually converge or even decrease. This is because when topics are introduced into the feature space, the new label correlation information may overlap with a previous topic. Furthermore, the topic prediction is completely accurate, and errors may be introduced with the new information. Therefore, the LDAML and LDAML-IMB frameworks perform best on dataset with many label classes and correlations.

Tables 3–5 report the results of the second group of experiments, with the best result marked in bold. Again, the win/loss row indicates the number of times our frameworks are superior/inferior to the compared algorithm on each dataset. And We also show the runtime of each method (ms). In this group of experiments, we fixed the number of topics in the LDAML framework at 2. We focus comparing each comparing multi-label algorithm $H$ with the frameworks LDAML (H) and LDAML-IMB (H). The results show that the performance of almost all methods can be significant improved with the LDAML and LDAML-IMB frameworks. The results also show that nearly all of the best results in each row are from our framework. In summary, with the label correlations introduced by LDAML and the LDAML-IMB-corrected class imbalances, our framework provides robust, efficient solutions to multi-label learning.

We conclude our experiments as follows: 1) The LDAML and LDAML-IMB are general framework which can adapt almost all the multi-label methods. We can adopt state-of-the-art method as the basic method in LDAML to make a breakthrough from the state-of-the-art performance. We can also adopt simple method as the basic method to improve the performance close to the state-of-the-art performance in low time cost; 2) The LDAML can be regarded as a special and simple version of LDAML-IMB which ignore the class-imbalance problem. Compared with LDAML-IMB, the performance of LDAML is improved less but the time cost is also less than LDAML-IMB. So the framework can be flexible selected according to the actual situation; 3) The LDAML-IMB can achieve significant improvement base on LDAML by correcting the class-imbalance. The performance of LDAML-IMB is more effective but with more time cost than LDAML; 4) With the topics increasing, the performance of our framework can be consistent growth except that sometimes. These results indicate that LDAML can provide robust and preferable solutions in multi-label problem.

5. Conclusion

In this paper, we proposed an efficient and effective framework, LDAML, for multi-label classification by first exploiting label correlations to improve basic multi-label classifications. Next, we proposed an external framework, LDAML-IMB, based on LDAML to correct the class imbalance problem. Extensive experiments across benchmark datasets show that these frameworks perform better than state-of-the-art algorithms, especially in terms of imbalance-specific metrics such as the macro-averaged F-measure and macro-AUC. Furthermore, we showed that different base multi-label methods in different frameworks (LDAML/LDAML-IMB) with different topic numbers improve performance to varying degrees at varying time costs. Overall, our frameworks can achieve performances comparable to that of most state-of-the-art methods, but at a much lower time cost.

Footnotes

Acknowledgments

This paper is supported by the National Key Research and Development Program of China (grant no. 2016YFB1001102), the National Natural Science Foundation of China (grant nos. 61375069, 61403156, 61502227), the Collaborative Innovation Center of Novel Software Technology and Industrialization at Nanjing University and the Fundamental Research Funds for the Central Universities (grant nos. 020214380036, 020214380038, 020214380040).

References

Blei

D.M.

A.Y.

and Jordan

M.I.

, Latent dirichlet allocation, Journal of Machine Learning Research 3(Jan) (2003), 993–1022.

Boutell

M.R.

Luo

Shen

and Brown

C.M.

, Learning multi-label scene classification, Pattern Recognition 37(9) (2004), 1757–1771.

Charte

Rivera

A.J.

del Jesus

M.J.

and Herrera

, Addressing imbalance in multilabel classification: Measures and random resampling algorithms, Neurocomputing 163 (2015), 3–16.

Charte

Rivera

A.J.

del Jesus

M.J.

and Herrera

, Mlsmote: approaching imbalanced multilabel learning through synthetic instance generation, Knowledge-Based Systems 89 (2015), 385–397.

Clare

and King

, Knowledge discovery in multi-label phenotype data, Principles of data mining and knowledge discovery, 2001, pp. 42–53.

Daniels

Z.A.

and Metaxas

D.N.

, Addressing imbalance in multi-label classification using structured hellinger forests, in: AAAI, 2017, pp. 1826–1832.

Dembczynski

Jachnik

Kotlowski

Waegeman

and Hüllermeier

, Optimizing the f-measure in multi-label classification: Plug-in rule approach versus structured loss minimization, in: International Conference on Machine Learning, 2013, pp. 1130–1138.

Fürnkranz

Hüllermeier

Loza Mencía

and Brinker

, Multilabel classification via calibrated label ranking, Machine Learning 73(2) (2008), 133–153.

Haixiang

Yijing

Shang

Mingyun

Yuanyue

and Bing

, Learning from class-imbalanced data: review of methods and applications, Expert Systems with Applications 73 (2017), 220–239.

10.

Hall

Frank

Holmes

Pfahringer

Reutemann

and Witten

I.H.

, The weka data mining software: an update, ACM SIGKDD Explorations Newsletter 11(1) (2009), 10–18.

11.

Huang

Wang

Zhang

and Huang

, Group sensitive classifier chains for multi-label classification, in: Multimedia and Expo (ICME), 2015 IEEE International Conference on, IEEE, 2015, pp. 1–6.

12.

Liu

X.-Y.

Q.-Q.

and Zhou

Z.-H.

, Learning imbalanced multi-class data with optimal dichotomy weights, in: Data Mining (ICDM), 2013 IEEE 13th International Conference on, IEEE, 2013, pp. 478–487.

13.

Peng

Fang

Wang

and Xie

, Entropy chain multi-label classifiers for traditional medicine diagnosing parkinson’s disease, in: Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on, IEEE, 2015, pp. 856–862.

14.

Read

Pfahringer

Holmes

and Frank

, Classifier chains for multi-label classification, Machine Learning 85(3) (2011), 333–359.

15.

Shu

and Tao

, A least squares formulation of multi-label linear discriminant analysis, Neurocomputing 156 (2015), 221–230.

16.

Tahir

M.A.

Kittler

and Yan

, Inverse random under sampling for class imbalance problem and its application to multi-label classification, Pattern Recognition 45(10) (2012), 3738–3750.

17.

Tsoumakas

Katakis

and Vlahavas

, Mining multi-label data, Data mining and knowledge discovery handbook, 2010, pp. 667–685.

18.

Tsoumakas

Katakis

and Vlahavas

, Random k-labelsets for multilabel classification, IEEE Transactions on Knowledge and Data Engineering 23(7) (2011), 1079–1089.

19.

Tsoumakas

Spyromitros-Xioufis

Vilcek

and Vlahavas

, Mulan: A java library for multi-label learning, Journal of Machine Learning Research 12(Jul) (2011), 2411–2414.

20.

Wang

Ding

and Huang

, Multi-label linear discriminant analysis, in: European Conference on Computer Vision, Springer, 2010, pp. 126–139.

21.

Lyu

and Ghanem

, Constrained submodular minimization for missing labels and class imbalance in multi-label learning, in: AAAI, 2016, pp. 2229–2236.

22.

Xioufis

E.S.

Spiliopoulou

Tsoumakas

and Vlahavas

I.P.

, Dealing with concept drift and class imbalance in multi-label stream classification, in: IJCAI, 2011, pp. 1583–1588.

23.

Zhang

M.-L.

Y.-K.

and Liu

X.-Y.

, Towards class-imbalance aware multi-label learning, in: IJCAI, 2015, pp. 4041–4047.

24.

Zhang

M.-L.

and Zhou

Z.-H.

, Ml-knn: A lazy learning approach to multi-label learning, Pattern Recognition 40(7) (2007), 2038–2048.

25.

Zhou

Z.-H.

and Zhang

M.-L.

, Multi-label learning, 2017.

A general framework for multi-label learning towards class correlations and class imbalance

Abstract

Keywords

1. Introduction

3.1 LDAML framework

4.1 Dataset description

Table 1 Statistical descriptions of the dataset

4.3 Experimental settings

4.4 Experiment results

Table 2 Experiment results with different topic numbers on CAL500 ( ↑ )

Footnotes

Acknowledgments

References

Table 1
Statistical descriptions of the dataset

Table 2
Experiment results with different topic numbers on CAL500 ( $\uparrow$ )