LAMB: A novel algorithm of label collaboration based multi-label learning

Abstract

Exploiting label correlation is crucially important in multi-label learning, where each instance is associated with multiple labels simultaneously. Multi-label learning is more complex than single-label learning for that the labels tend to be correlated. Traditional multi-label learning algorithms learn independent classifiers for each label and employ ranking or threshold on the classification results. Most existing methods take label correlation as prior knowledge, which have worked well, but they failed to make full use of label dependency. As a result, the real relationship among labels may not be correctly characterized and the final prediction is not explicitly correlated. To address these problems, we propose a novel high-order multi-label learning algorithm of Label collAboration based Multi-laBel learning (LAMB). With regard to each label, LAMB utilizes collaboration between its own prediction and the prediction of other labels. Extensive experiments on various datasets demonstrate that our proposed LAMB algorithm achieves superior performance over existing state-of-the-art algorithms. In addition, one real-world dataset of channelrhodopsins chimeras is assessed, which would be of great value as pre-screen for membrane proteins function.

Keywords

Multi-label label correlation collaboration protein engineering

1. Introduction

In real-world, objects might have multiple semantic meanings. In order to explicitly express semantics of objects, one direct solution is to assign a set of proper labels to each object. Multi-label learning is one of the major learning frameworks to deal with real-world objects with various semantics, where each instance is represented by a feature vector and assigned with multiple labels [29]. Multi-label learning has various application scenarios, such as tagging songs with a subset of emotions [22], image annotation [24], text classification [13, 18], information retrieval [7] video annotation [14] and bioinformatics [27].

A common approach to multi-label learning is to transform into multiple single-label learning problems [23]. When treating labels independently, these methods fail to model the dependency between multiple labels. Exploiting label correlation [1] in multi-label learning has been an important and practical research topic. Previous researches have shown that multi-label classification problems exhibit strong label co-occurrence dependencies [25]. For instance, blue sky and cloud usually appear together, while blue sky and smog almost never co-occur. Moreover, most of these methods either can not model higher-order label correlation, or sacrifice computational complexity to model a more complicated label relationship [19].

Figure 1.

An example of image annotation. The image can be annotated with labels: wind chime, baby’s cot, sofa and stuffed elephant.

Take Fig. 1 as an example, we might notice some more obvious objects in the image like baby’s cot, sofa, wind chime and elephant at the first glance. Such a combination of objects in the image hints us of a living room, which will further help us better recognize the elephant on the carpet to be stuffed animals instead of real ones. The recognition route from the baby’s cot to the living room, and then to stuffed animals, are supposed to deal with label correlation of the prediction. Previous researches, i.e., Classifier Chains (CC) algorithm [21] is a high-order approach considering the relationship among labels, but it is affected by the order of predicted labels. Therefore, how to make full use of the high-order label-dependencies among labels to improve the classification accuracy and reduce the computational complexity becomes extremely challenging. Previous methods mainly focus on exploiting uni-directional relationship. For example, $R_{a\rightarrow b}$ represents the influence of label $a$ to label $b$ , which merely consider the converse relation of label $b$ to label $a$ conditioned $R_{a\rightarrow b}$ .

To resolve the above challenges, we present a novel high-order multi-label learning algorithm of Label collAboration based Multi-laBel learning (LAMB). It can not only well exploit label correlations, but also maintain acceptable computational complexity. Meanwhile, we propose a novel Classifier Reuse and Label-based Reweight (CRLR) algorithm to explore interaction of one label with all other labels in LAMB. Each label contains unique characteristics and it can be seen as domain expert which conveys essential domain knowledge to other labels. For one thing, in each boosting round, in addition to generating a base classifier for each label set from its own hypothesis space, CRLR tries to reuse the classifier generated for the other label set. The reuse process takes into account all trained classifiers from itself and all other labels. For another, CRLR adopts label-based reweight mechanism: if classifier on one label set reused by the other label set makes mistakes, the corresponding weight will be assigned higher. As a result, each label in multi-label learning is able to provide rich domain knowledge with each other and extract complementary information. As is shown in Fig. 2, LAMB algorithm trains $L$ classifier sets based on CRLR algorithm, with decomposition of the overall labels into two subsets in the training set. And then, for each label, LAMB utilizes a voting scheme for consensus of $L-1$ classifiers from $L-1$ classifier sets respectively. An example of LAMB’s voting procedure is shown in Table 1, supposing there are 6 labels. $\bm{f}_{k}=\{f_{k}^{j}|j=\{1,2,\cdots,6\}\&j\neq k\}$ denotes $k$ -th classifier set learned by CRLR $k$ , where $f_{k}^{j}$ is the classifier of $j$ -th label in $k$ -th classifier set. ${}^{\prime}/^{\prime}$ represents no classifier for the current label. What’s more, $\textit{Thr}_{t}(\cdot)$ is a simple threshold function and we adopt $t=0.5$ here.

Table 1

Example of LAMB’s voting procedure of the $i$ -th instance

	$y_{i}^{1}$	$y_{i}^{2}$	$y_{i}^{3}$	$y_{i}^{4}$	$y_{i}^{5}$	$y_{i}^{6}$
CRLR 1: $\bm{f}_{1}$	/	0	1	1	0	0
CRLR 2: $\bm{f}_{2}$	0	/	0	1	0	0
CRLR 3: $\bm{f}_{3}$	1	0	/	1	0	0
CRLR 4: $\bm{f}_{4}$	0	0	1	/	0	0
CRLR 5: $\bm{f}_{5}$	0	0	1	1	/	0
CRLR 6: $\bm{f}_{6}$	1	0	1	0	0	/
$\bm{w}$	0.4	0.0	0.8	0.8	0.0	0.0
$\hat{\bm{Y}}_{i}=\textit{Thr}_{t=0.5}(\bm{w})$	0	0	1	1	0	0

Figure 2.

The architecture of LAMB algorithm. The red circles in the rectangle denote the classifiers produced by CRLR algorithm. When invoking CRLR once, it outputs one classifier set composed of $L-1$ classifier. The black circles in the bottom denote the final classifier voted by upper $L-1$ classifier sets.

The main contributions of this paper include:

•

A novel algorithm of Label collAboration based Multi-laBel learning (LAMB) is proposed to learn global label correlation, with the help of our proposed Classifier Reuse and Label-based Reweight (CRLR) algorithm.

•

Without pre-determining the label order for prediction, LAMB makes prediction for each label by combining the prediction of its own label and the prediction of other labels. LAMB algorithm is more compact and more powerful of high-order label co-occurrence dependency than other state-of-the-art multi-label learning algorithms.

•

One real-world channelrhodopsins chimeras dataset is manually collected and adapted for LAMB algorithm. Furthermore, LAMB algorithm advances the prediction performance of eukaryotic expression and plasma membrane localization, which is significant for membrane proteins function pre-screen.

The rest of the paper is organized as follows. We start with a brief review of related work. Then we formulate the problem and propose LAMB algorithm. Next, experimental results are reported, followed by the conclusion.

2. Related work

In this section, we briefly review the development of multi-label learning algorithms in terms of the order of label correlation [6].

As for first-order algorithms, the most straightforward way to deal with multi-label learning is to decompose them into multiple binary classification sub-problems. Binary Relevance (BR) [23] algorithm is the most simple and efficient solution of multi-label learning. Despite its simplicity, multiple labels are independent from each other.

As for second-order algorithms, they take pairwise relationship between labels into consideration, such as the ranking between relevant labels and irrelevant labels, or the interaction of paired labels [30]. Furthermore, LLSF [16] learns label specific features for each label, making the assumption that each label is only associated with a subset of features from the original feature set. And any two strongly correlated class labels can share more features with each other than two uncorrelated or weakly correlated ones.

As for high-order algorithms, one of the most prominent algorithms is Classifier Chains (CC) [21] which considers correlation between labels. It is obvious that the performance of CC is seriously affected by the training order of labels. To account for the effect of ordering, Ensemble of Classifier Chains (ECC) [21] is an ensemble framework of CC, which can be built with $n$ random permutations instead of inducing one classifier chain. Probabilistic Classifier Chains (PCC) [11] is a extended work of CC with aprobabilistic framework, which suggests to look at the problem from the point of view of risk minimization and Bayes optimal prediction. PCC estimates the entire joint distribution of labels, and it can provide a formal setting that allows for a more thorough analysis of multi-label classification in general and label dependence in particular. However, PCC suffers from the computational issue that inference (i.e., predicting the label of an example) requires time exponential in the number of tags. Although the performance of CC is largely influenced by the chain ordering, the original method uses a random ordering. To cope with this problem, OOCC [10] is proposed to find a specific and more effective chain for each new instance to be classified. The original CC algorithm makes a greedy approximation, and is fast but tends to propagate errors along the chain. MCC [15] presents novel Monte Carlo schemes, both for finding a good chain sequence and performing efficient inference. However, both CC and ECC only make one label help other labels, rather than making these labels help each other. To overcome these problems, there also exists some high-order approaches that exploit label correlation on the hypothesis space, while they do not rely on the label correlation matrix. For example, a boosting approach [17] MAHR is proposed to exploit label correlation with a hypothesis reuse mechanism. Specifically, DSML [20] is a proposed to exploit pairwise inter-set label relationship where an object is associated with only two labels. In addition to chain rule based models, [12, 9, 8] used graph neural networks to model label dependency an obtained state-of-the-art results. All methods relied on knowledge-based graphs being built from label cooccurrence statistics.

3. Methodology

This section briefly presents the problem formulation, and then introduces the overall strategy of our proposed LAMB algorithm in detail.

3.1 Preliminary

Firstly, we summarize some formal symbols used in this paper. In the following, bold character denotes vector.

Let $\mathcal{X}=\mathcal{R}^{d}$ denotes the $d$ dimensional input space.

$\mathcal{Y}=\{\mathcal{Y}^{k}|k\in\{1,2,\cdots,L\}\}$ denotes the label space where $\mathcal{Y}^{k}=\{-1,1\}$ .

In multi-label learning, the task is to learn a mapping function: $\mathit{H}:\mathcal{X}\rightarrow\mathcal{Y}$ when given the training set $\mathcal{D}$ with $N$ data samples $\mathcal{D}=\{(\bm{X}_{i},\bm{Y}_{i})\}_{i=1}^{N}$ . The $i$ -th instance $(\bm{X}_{i},\bm{Y}_{i})$ contains a feature vector $\bm{X}_{i}\in\mathcal{X}$ and a label vector $\bm{Y}_{i}\in\mathcal{Y}$ , where $\bm{Y}_{i}=[y_{i}^{1},y_{i}^{2},\cdots,y_{i}^{L}]$ , $y_{i}^{k}\in\mathcal{Y}^{k}$ . Above all, multi-label learning problem is illustrated in Fig. 3.

Figure 3.

Diagrammatic illustration of multi-label learning. Circles on the right part denote labels. In the training phase, labels are tagged with ${}^{\prime}1^{\prime}$ if classified as positive, empty otherwise. In the testing phase, circles with symbol ${}^{\prime}?^{\prime}$ mean that the labels are unknown and we need to make prediction for them.

Furthermore, we define some notations used for testing phase. Suppose there is a testing dataset $\mathcal{T}=\{(\bm{X}_{i},\bm{Y}_{i})\}_{i=1}^{M}$ with $M$ data samples. We denote $\bm{Z}_{i}=[z_{i}^{1},z_{i}^{2},\cdots,z_{i}^{L}]$ as predicted labels of $\bm{X}_{i}$ in $\mathcal{T}$ . More specifically, given a test dataset $\mathcal{T}$ , the goal is to make the prediction vectors $\bm{Z}=\{\bm{Z}_{i}\}_{i=1}^{M}$ close to the ground-truth vectors $\bm{Y}=\{\bm{Y}_{i}\}_{i=1}^{M}$ .

3.2 LAMB algorithm

Based on the above problem formulation, we present Label collAboration based Multi-laBel learning (LAMB) algorithm. In order to exploit the relationship between the $k$ -th label and the other labels, we propose Classifier Reuse and Label-based Reweight (CRLR) algorithm. The procedure of LAMB is summarized in Alg. 3.2. LAMB algorithm trains $L$ classifier sets with CRLR algorithm. For $k$ -th classifier set, there are $L-1$ classifiers for all the labels except the $k$ -th label. After that, LAMB utilizes voting scheme for consensus of $L-1$ classifiers from $L-1$ classifier sets for each label respectively.

[h] : LAMB Algorithm[1] $\mathcal{D}$ : original dataset $L$ : number of labels $H=[h_{1},h_{2},\cdots,h_{k},\cdots,h_{L}]$ : classifiers of all labels $k=1:L$ $\textit{Cur}\mathcal{Y}=\mathcal{Y}\setminus\ \mathcal{Y}_{k}$ $\textit{Res}\mathcal{Y}=\mathcal{Y}_{k}$ $\{f_{k}^{j}\}_{j=1,j\neq k}^{L}=$ CRLR( $k$ , $\textit{Cur}\mathcal{Y}$ , $\textit{Res}\mathcal{Y}$ ) $j=1:L$ $h_{j}=$ Vote $(\{f_{k}^{j}|k=\{1,\cdots,L\}\&k\neq j\})$ $H$

Next, we introduce our proposed CRLR algorithm for LAMB algorithm in detail. The procedure of CRLR is shown in Alg. 3.2 and the additional notations is listed as follows.

$\mathcal{D}_{C}^{0}=\{(\bm{X}_{i},\bm{\textit{CurY}}_{i})\}_{i=1}^{N}$ is the initial dataset including all instances in $\mathcal{D}$ , where $\bm{X}_{i}\in\mathcal{X}$ , $\bm{\textit{CurY}}_{i}\in\textit{Cur}\mathcal{Y}$ .

$\mathcal{D}_{R}^{0}=\{(\bm{X}_{i},\bm{\textit{ResY}}_{i})\}_{i=1}^{N}$ is the initial dataset including all instances in $\mathcal{D}$ , where $\bm{X}_{i}\in\mathcal{X}$ , $\bm{\textit{ResY}}_{i}\in\textit{Res}\mathcal{Y}$ .

$\mathcal{D}_{C}$ denotes the dataset sampled from $\mathcal{D}_{C}^{0}$ according to the sample distribution $(\bm{W}_{C})^{t}=\{(W_{C}^{i})^{t}\}_{i=1}^{N}$ at $t$ -th boosting round, where $(W_{C}^{i})^{t}$ denotes the weight for the $i$ -th instance in $\mathcal{D}_{C}^{0}$ .

$\mathcal{D}_{R}$ denotes the dataset sampled from $\mathcal{D}_{R}^{0}$ according to the sample distribution $(\bm{W}_{R})^{t}=\{(W_{R}^{i})^{t}\}_{i=1}^{N}$ at $t$ -th boosting round, where $(W_{R}^{i})^{t}$ denotes the weight for the $i$ -th instance in $\mathcal{D}_{R}^{0}$ .

CRLR algorithm decomposes the original problem into two dependent classification problems in the boosting framework. A common advantage of ensembles is their well-known effect of generally increasing overall predictive performance. In the classical AdaBoost algorithm, an instance on which classifier has made mistake will be emphasized by increasing its weight. CRLR algorithm maintains two sample distribution $(\bm{W}_{C})^{t}$ and $(\bm{W}_{R})^{t}$ in the boosting framework. At each boosting round, we firstly sample two datasets $\mathcal{D}_{C}$ and $\mathcal{D}_{R}$ with distribution $(\bm{W}_{C})^{t}$ and $(\bm{W}_{R})^{t}$ respectively.

[h] : CRLR algorithm[1] $k$ : $k$ -th label in LAMB $\textit{Cur}\mathcal{Y}$ : label space of all labels except the $k$ -th label $\textit{Res}\mathcal{Y}$ : label space of the $k$ -th label $T$ : number of boosting round $\lambda$ : label-based reweight parameter $\bm{f}_{k}=\{f_{k}^{j}\}_{j=1,j\neq k}^{L}$ : classifiers of label set in $\textit{Cur}\mathcal{Y}$ Initialize: $(W_{C}^{i})^{0}=(W_{R}^{i})^{0}=\frac{1}{N}$ , $\mathcal{D}_{C}=\mathcal{D}_{C}^{0}$ , $\mathcal{D}_{R}=\mathcal{D}_{R}^{0}$ $t=1:T$ $t>1$ Sample $\mathcal{D}_{C}$ from $\mathcal{D}_{C}^{0}$ according to $(\bm{W}_{C})^{t}$ Sample $\mathcal{D}_{R}$ from $\mathcal{D}_{R}^{0}$ according to $(\bm{W}_{R})^{t}$ Train classifier $g_{1}$ , $g_{2}$ and $g_{3}$ in order Compute $G_{C}^{t}(\cdot)$ , $G_{R}^{t}(\cdot)$ with Eq. (3.2) Calculate $(\alpha_{C}^{i})^{t}$ and $(\alpha_{R}^{i})^{t}$ with Eq. (3.2) Update $(W_{C}^{i})^{t+1}$ and $(W_{R}^{i})^{t+1}$ with Eq. (3.2) Normalize $(W_{C}^{i})^{t+1}$ and $(W_{R}^{i})^{t+1}$ $A_{C}^{t}<A_{C}^{0}$ or $A_{R}^{t}<A_{R}^{0}$ break $j=1:L∼{}\&∼{}j\neq k$ Compute $f_{k}^{j}$ with Eq. (8) $\bm{f}_{k}$

Figure 4.

Overview of LAMB algorithm at $t$ -th boosting round. On the one hand, Classifier 1 is reused by Classifier 2 and Classifier 3. On the other hand, Classifier 2 is reused by Classifier 3. We repeat the procedure until reaching the stop criterion.

Classifier reuse

As illustrated in Fig. 4, we train three classifiers at each boosting round.

•

Classifier 1: Given training dataset $\mathcal{D}_{C}$ , the task is to learn mapping function $g_{1}:\mathcal{X}\rightarrow\textit{Cur}\mathcal{Y}$ .

•

Classifier 2: Classifier 1 is reused to make prediction for label set in $\textit{Cur}\mathcal{Y}$ in $\mathcal{D}_{R}$ . In other words, Classifier 1 serves as a domain expert to acquire the label set $\textit{Cur}\mathcal{Y}$ . And we have $\bm{Z}^{C}=g_{1}(\bm{X})$ , $\bm{Z}^{C}\in\textit{Cur}\mathcal{Y}$ and $\bm{Z}^{C}=\{{\bm{Z}^{j}\}_{j=1,j\neq k}^{L}}$ , where $\bm{Z}^{j}$ denotes the $j$ -th label of all instances. Given training dataset $\mathcal{D}_{R}$ and predicted label $\bm{Z}^{C}$ , the task is to learn mapping function $g_{2}:\mathcal{X}+\textit{Cur}\mathcal{Y}\rightarrow\textit{Res}\mathcal{Y}$ .

•

Classifier 3: Similarly, Classifier 1 is reused to make prediction for label set in $\textit{Cur}\mathcal{Y}$ . Then, Classifier 2 is reused to make prediction for label set in $\textit{Res}\mathcal{Y}$ . Classifier 2 serves as another domain experts, which can ensure accurate annotation. And we have $\bm{Z}^{R}=g_{2}([\bm{X},g_{1}(\bm{X})])$ , $\bm{Z}^{R}\in\textit{Res}\mathcal{Y}$ and $\bm{Z}^{R}=\bm{Z}^{k}$ . Given training dataset $\mathcal{D}_{C}$ and predicted label $\bm{Z}^{R}$ , the task is to learn mapping function $g_{3}:\mathcal{X}+\textit{Res}\mathcal{Y}\rightarrow\textit{Cur}\mathcal{Y}$ .

Afterwards, we get classifier $G_{C}^{t}(\cdot)$ for label set in $\textit{Cur}\mathcal{Y}$ and classifier $G_{R}^{t}(\cdot)$ for label set in $\textit{Res}\mathcal{Y}$ at $t$ -th boosting round. $G_{R}^{t}(\cdot)=G_{k}^{t}(\cdot)$ and $G_{C}^{t}(\cdot)=\{G_{j}^{t}(\cdot)\}_{j=1,j\neq k}^{L}$ , where $G_{j}^{t}(\cdot)$ represents classifier for the $j$ -th label.

$\displaystyle G_{C}^{t}(\bm{X})=g_{3}([\bm{X},Z^{R}])$ (1) $\displaystyle G_{R}^{t}(\bm{X})=g_{2}([\bm{X},\bm{Z}^{C}])$

where $Z^{R}=g_{2}([\bm{X},g_{1}(\bm{X})])$ , $Z^{C}=g_{1}(\bm{X})$ .

$\alpha_{C}^{t}$ and $\alpha_{R}^{t}$ represents weight of classifier $G_{C}^{t}(\cdot)$ and $G_{R}^{t}(\cdot)$ for final prediction respectively, which is derived by using the inferring method to bound the overall training error.

$\displaystyle\alpha_{C}^{t}=\frac{1}{2}log\frac{A_{C}^{t}}{1-A_{C}^{t}}$ (2) $\displaystyle\alpha_{R}^{t}=\frac{1}{2}log\frac{A_{R}^{t}}{1-A_{R}^{t}}$

where $A_{C}^{t}$ and $A_{R}^{t}$ are subset accuracy [28] on the original dataset $\mathcal{D}_{C}^{0}$ and $\mathcal{D}_{R}^{0}$ at $t$ -th boosting round, respectively.

Label-based reweight

We maintain two dataset $\mathcal{D}_{C}$ and $\mathcal{D}_{R}$ which are sampled with distribution weights $(\bm{W}_{C})^{t}$ and $(\bm{W}_{R})^{t}$ . To better exploit label correlation, we are supposed to update distribution weights $(\bm{W}_{C})^{t}$ and $(\bm{W}_{R})^{t}$ from the following two aspects.

•

Considering mistake made by its own classifier.

An instance on which the classifier that has made mistake will be emphasized by assigning a higher weight.

As for $\bm{X}_{i}$ in $\mathcal{D}_{C}$ , $(\theta_{C}^{i})^{t}$ increases the weight $(W_{C}^{i})^{t+1}$ when $\bm{X}_{i}$ is misclassified by classifier $G_{C}^{t}$ .

$\displaystyle(\theta_{C}^{i})^{t}=\exp(\alpha_{C}^{t}\cdot\mathbb{I}(\bm{% \textit{CurY}}_{i}\not=G_{C}^{t}(\bm{X}_{i}))$ (3)

where $\mathbb{I}(\cdot)$ denotes the indicator function which outputs 1 if $\cdot$ is true, 0 otherwise.

As for $\bm{X}_{i}$ in $\mathcal{D}_{R}$ , $(\theta_{R}^{i})^{t}$ increases the weight $(W_{R}^{i})^{t+1}$ when $\bm{X}_{i}$ is misclassified by classifier $G_{R}^{t}$ .

$\displaystyle(\theta_{R}^{i})^{t}=\exp(\alpha_{R}^{t}\cdot\mathbb{I}(\bm{% \textit{ResY}}_{i}\not=G_{R}^{t}(\bm{X}_{i}))$ (4)

•

Considering mistake made by classifier reuse.

We denote $\lambda$ as label-based reweight parameter, which adjusts distribution of $\mathcal{D}_{C}$ with the help of classifier $G_{R}^{t}(\cdot)$ and adjusts distribution of $\mathcal{D}_{R}$ with the help of classifier $G_{C}^{t}(\cdot)$ . In other words, $\lambda$ is used to increase the weight of an instance on one label set if it is misclassified on the other label set. As a result, the classification of one label set will get additional help from the other label set.

As for $\bm{X}_{i}$ in $\mathcal{D}_{C}$ , $(\beta_{C}^{i})^{t}$ increases the weight $(W_{C}^{i})^{t+1}$ when $\bm{X}_{i}$ is misclassified by reused classifier $G_{R}^{t}$ .

$\displaystyle(\beta_{C}^{i})^{t}=\lambda^{\mathbb{I}(\bm{\textit{ResY}}_{i}% \not=G_{R}^{t}(\bm{X}_{i}))}$ (5)

As for $\bm{X}_{i}$ in $\mathcal{D}_{R}$ , $(\beta_{R}^{i})^{t}$ increases the weight $(W_{R}^{i})^{t+1}$ when $\bm{X}_{i}$ is misclassified by reused classifier $G_{C}^{t}$ .

$\displaystyle(\beta_{R}^{i})^{t}=\lambda^{\mathbb{I}(\bm{\textit{CurY}}_{i}% \not=G_{C}^{t}(\bm{X}_{i}))}$ (6)

Above all, the sample distribution weight $(W_{C}^{i})^{t+1}$ and $(W_{R}^{i})^{t+1}$ are updated according to:

$\displaystyle(W_{C}^{i})^{t+1}=(W_{C}^{i})^{t}\cdot(\theta_{C}^{i})^{t}\cdot(% \beta_{C}^{i})^{t}$ (7) $\displaystyle(W_{R}^{i})^{t+1}=(W_{R}^{i})^{t}\cdot(\theta_{R}^{i})^{t}\cdot(% \beta_{R}^{i})^{t}$

where mistakes might be amplified if the classification is not correct through iterations.

To efficiently classify, we need to define a stop criterion for our algorithm.

•

The classifier is considered to be too weak to get better performance in the boosting framework when $A_{C}^{t}<A_{C}^{0}$ or $A_{R}^{t}<A_{R}^{0}$ , as a result, the boosting round ends.

•

Reaching the maximum boosting round $T$ .

Finally, taking classifier $G_{j}^{t}(\cdot)$ in $G_{C}^{t}(\cdot)$ into consideration, labels are predicted for instance $\bm{X}$ in testing dataset according to:

$\displaystyle f_{k}^{j}(\bm{X})=\underset{\bm{\textit{ResY}}^{j}}{\textit{% argmax}}\sum_{t=1}^{T}\alpha_{C}^{t}\cdot\mathbb{I}(G_{j}^{t}(\bm{X})=\bm{% \textit{ResY}}^{j})$ (8)

where $j=\{1,\cdots,{L}\&j\neq k\}$ .

4. Experiments

In this section, we firstly describe the experimental datasets and experimental setting. And then, we present extensive experimental results and comparisons that demonstrate the superiority of our proposed LAMB algorithm. All the experiments are running on a machine with 3.2 GHz Inter Core i7 processor and 64 GB main memory.

4.1 Dataset description

The experiments are conducted on three public multi-label datasets and one manually collected real-world dataset. Note that there is no noise or missing values in these datasets. Details of the datasets are summarized in Table 6, where $N$ and $L$ denote the number of instances and labels in each dataset. LCard, LDen and LDiv shows chararctertic of multi-label datasets, which represent label cardinality, label density and label diversity, respectively.

Scene

A natural scene may contain multiple objects such that the scene can be described by multiple labels (e.g., a field scene with a mountain in the background). Semantic Scene [3] classification, is a public multi-label dataset with 2407 instances, where each instance contains 294 features and 6 labels.

Emotions

A piece of music may belong to more than one class. Emotions [22] is a publicly available dataset with 72 music features for 593 songs categorized into one or more out of 6 classes of emotions.

Yeast

Yeast Saccharomyces cerevisiae is one of the best studied organisms and Yeast [28] dataset is one of the most commonly used multi-label bioinformatic dataset. Each gene is described by the concatenation of micro-array expression data and phylogenetic profile. In Yeast gene functional classification dataset, there are 2417 genes each represented by a 103-dimensional feature vector and each gene is associated with 14 possible gene functional labels.

ChRs

A small subset 243 (0.2%) of ChR chimeras are chosen from a 118,098-variant ChR recombination library of three parent ChRs [2]. Detail in how to construct ChRs is described from the following two aspects.

As for input, we run ProFET over a sequence in $\mathcal{D}$ to extract these features, resulting in a vector of 1173 features. ProFET [6] has been used in various function prediction task. Most features capture statistically informative patterns and the extracted features show excellent biological interpretability. In addition, different representations of sequences and the amino acids (AA) alphabet provide a compact, compressed set of features. These features can be divided into six categories: biophysical quantitative properties; letter-based features; local potential features; information based statistics; AA scale based features; and transformed CTD features [26].

As for output, expression and localization are synthesized and measured for 243 sequences, which can be considered as two labels of membrane proteins. The biological experimental results (expression and localization) which is in the form of discrete values need to be transformed into class labels, i.e. 1 or $-$ 1. Eukaryotic expression of a sequence was labeled 1 if it performed at least as well as the lowest performing parent, and it was considered $-$ 1 if it performed worse than the lowest performing parent. Because the lowest performing parent for expression and localization, CheRiff, is produced and localized in sufficient quantities for downstream functional studies, [3] believes this to be an appropriate threshold for 1 vs $-$ 1 performance. Plasma membrane localization of a sequence is preprocessed the same as eukaryotic expression.

Table 2
Statistics of datasets

Name	Domain	N	L	LCard	LDen	LDiv
Scene	Image	2407	6	1.074	0.179	15
Emotions	Music	593	6	1.869	0.311	27
Yeast	Biology	2417	14	4.237	0.303	198
ChRs	Biology	243	2	0.374	0.187	2

4.2 Comparing methods

We compare high-order LAMB algorithm with four multi-label algorithms which are listed as follows:

•
first-order algorithm: BR [23]
•
second-order algorithm: LLSF [16]
•
high-order algorithm: CC, ECC [21], MLKNN [28] and DSML [20] (for ChRs dataset)

4.3 Evaluation metrics

$F_{1}$ score is one of the most popular metrics for evaluation of binary classification [4]. Three widely adopted evaluation criteria based on $F_{1}$ score, i.e., $\textit{MacroF}_{1}$ , $\textit{MicroF}_{1}$ and $\textit{ExampleF}_{1}$ are used to measure the performance of multi-label classification algorithms. To have a fair comparison, we employ five widely adopted standard metrics, i.e., HammingLoss, SubsetAcc, $\textit{MacroF}_{1}$ , $\textit{MicroF}_{1}$ and $\textit{ExampleF}_{1}$ which are listed as follows:

4.3.1. HammingLoss evaluates the fraction of misclassified instance-label pairs, i.e. a relevant label is missed or an irrelevant is predicted.

$\displaystyle\textit{HammingLoss}=\frac{1}{M}\sum_{i=1}^{M}\frac{1}{L}|H(\bm{X% }_{i})\triangle\bm{Y}_{i}|$ (9)

where $\triangle$ stands for the symmetric difference between two sets.

4.3.2. SubsetAcc evaluates the fraction of correctly classified examples, i.e. the predicted label set is identical to the ground-truth label set. Intuitively, subset accuracy can be regarded as a multi-label counterpart of the traditional accuracy metric, and tends to be overly strict especially when the size of label space (i.e. $L$ ) is large.

$\displaystyle\textit{SubsetAcc}=\frac{1}{M}\sum_{i=1}^{M}\mathbb{I}(H(\bm{X}_{% i})=\bm{Y}_{i}).$ (10)

where $\mathbb{I}(\cdot)$ denotes the indicator function which outputs 1 if $\cdot$ is true, 0 otherwise.

4.3.3. $\textit{MacroF}_{1}$ calculates metrics for each label, and finds their unweighted mean which is an arithmetic average of $F_{1}$ of $L$ labels. But $\textit{MacroF}_{1}$ does not take label imbalance into account.

$\displaystyle\textit{MacroF}_{1}=\frac{1}{L}\sum_{i=1}^{L}\frac{2TP_{i}}{(2TP_% {i}+FP_{i}+FN_{i})}$ (11)

where $TP_{i}$ , $FP_{i}$ and $FN_{i}$ denotes the number of true positives, false positives and false negatives in the $i$ -th label set, respectively.

4.3.4. $\textit{MicroF}_{1}$ calculate metrics globally by counting the total true positives, false negatives and false positives, which can be considered as a weighted average of $F_{1}$ score over the two labels.

$\displaystyle\textit{MicroF}_{1}=\frac{\sum_{i=1}^{L}{2TP_{i}}}{\sum_{i=1}^{L}% (2TP_{i}+FP_{i}+FN_{i})}$ (12)

4.3.5. $\textit{ExampleF}_{1}$ is integrated version of $\textit{Precision}(H)$ and $\textit{Recall}(H)$ with balancing factor $\beta>0$ . The most common choice is $\beta=1$ which leads to the harmonic mean of precision and recall.

$\displaystyle\textit{ExampleF}_{1}=\frac{2\cdot\textit{Precision}(H)\cdot% \textit{Recall}(H)}{\textit{Precision}(H)+\textit{Recall}(H)}$ (13)

4.4 Experimental results

For all these algorithms, we report the best results of the optimal parameters in terms of classification performance. Meanwhile, we perform 10-fold cross validation (CV) and take the average value of the results in the end.

Tables 3 and 4 report the detailed experimental results on three public datasets and one real-world channelrhodopsins chimeras dataset respectively, where the best performance among all the algorithms is shown in boldface. As for ChRs dataset, CC ${}_{01}$ represents label chain is set as $\{0,1\}$ and CC ${}_{10}$ represents label chain is set as $\{1,0\}$ in CC algorithm. It is obvious that LAMB algorithm outperforms the other comparing algorithms on all metrics.

Table 3
Predictive performance (mean $\pm$ std) comparison on three public multi-label datasets. $\uparrow/\downarrow$ indicates that the larger/smaller the better of a criterion. The best performance on each dataset is bolded

Algorithm	Evaluation metrics
	HammingLoss $\downarrow$	SubsetAcc $\uparrow$	MacroF ${}_{1}\uparrow$	MicroF ${}_{1}\uparrow$	ExampleF ${}_{1}\uparrow$
Scene
BR	0.1044 $\pm$ 0.0088	0.5363 $\pm$ 0.0414	0.6924 $\pm$ 0.0254	0.6884 $\pm$ 0.0266	0.6267 $\pm$ 0.0338
LLSF	0.1055 $\pm$ 0.0056	0.4869 $\pm$ 0.0284	0.6444 $\pm$ 0.0269	0.6425 $\pm$ 0.0270	0.5363 $\pm$ 0.0336
CC	0.1048 $\pm$ 0.0103	0.6526 $\pm$ 0.0312	0.7125 $\pm$ 0.0280	0.7025 $\pm$ 0.0288	0.7051 $\pm$ 0.0298
ECC	0.0937 $\pm$ 0.0050	0.5958 $\pm$ 0.0158	0.7102 $\pm$ 0.0122	0.7050 $\pm$ 0.0155	0.6388 $\pm$ 0.0137
MLKNN	0.0892 $\pm$ 0.0077	0.6294 $\pm$ 0.0302	0.7429 $\pm$ 0.0224	0.7392 $\pm$ 0.0212	0.7098 $\pm$ 0.0243
CAMEL	0.0756 $\pm$ 0.0057	0.6464 $\pm$ 0.0241	0.7722 $\pm$ 0.0210	0.7631 $\pm$ 0.0191	0.6949 $\pm$ 0.0268
LAMB	0.0779 $\pm$ 0.0086	0.7146 $\pm$ 0.0259	0.7822 $\pm$ 0.0258	0.7772 $\pm$ 0.0248	0.7678 $\pm$ 0.0248
Emotions
BR	0.1990 $\pm$ 0.0193	0.2647 $\pm$ 0.0626	0.6181 $\pm$ 0.0345	0.6530 $\pm$ 0.0427	0.5937 $\pm$ 0.0579
LLSF	0.2072 $\pm$ 0.0143	0.2542 $\pm$ 0.0488	0.6159 $\pm$ 0.0325	0.6412 $\pm$ 0.0333	0.5939 $\pm$ 0.0390
CC	0.2158 $\pm$ 0.0293	0.2917 $\pm$ 0.0620	0.6140 $\pm$ 0.0598	0.6490 $\pm$ 0.0534	0.6226 $\pm$ 0.0565
ECC	0.2021 $\pm$ 0.0231	0.2832 $\pm$ 0.0417	0.6267 $\pm$ 0.0387	0.6474 $\pm$ 0.0383	0.5844 $\pm$ 0.0461
MLKNN	0.2783 $\pm$ 0.0219	0.2124 $\pm$ 0.0382	0.4960 $\pm$ 0.0333	0.5289 $\pm$ 0.0296	0.4945 $\pm$ 0.0257
CAMEL	0.2269 $\pm$ 0.0135	0.2073 $\pm$ 0.0335	0.5415 $\pm$ 0.0275	0.5662 $\pm$ 0.0222	0.4933 $\pm$ 0.0291
LAMB	0.2037 $\pm$ 0.0281	0.3070 $\pm$ 0.0487	0.6397 $\pm$ 0.0496	0.6671 $\pm$ 0.0440	0.6407 $\pm$ 0.0462
Yeast
BR	0.1992 $\pm$ 0.0063	0.1469 $\pm$ 0.0227	0.3231 $\pm$ 0.0077	0.6330 $\pm$ 0.0135	0.6101 $\pm$ 0.0163
LLSF	0.2953 $\pm$ 0.0061	0.0029 $\pm$ 0.0026	0.2449 $\pm$ 0.0053	0.2934 $\pm$ 0.0066	0.2473 $\pm$ 0.0079
CC	0.2127 $\pm$ 0.0084	0.1874 $\pm$ 0.0205	0.3421 $\pm$ 0.0142	0.6152 $\pm$ 0.0166	0.5857 $\pm$ 0.0168
ECC	0.2070 $\pm$ 0.0107	0.1601 $\pm$ 0.0267	0.3811 $\pm$ 0.0222	0.6234 $\pm$ 0.0235	0.5948 $\pm$ 0.0274
MLKNN	0.1986 $\pm$ 0.0104	0.1969 $\pm$ 0.0353	0.4073 $\pm$ 0.0221	0.6439 $\pm$ 0.0221	0.6157 $\pm$ 0.0271
CAMEL	0.1885 $\pm$ 0.0068	0.2011 $\pm$ 0.0180	0.4584 $\pm$ 0.0199	0.6584 $\pm$ 0.0120	0.6276 $\pm$ 0.0130
LAMB	0.1975 $\pm$ 0.0093	0.2089 $\pm$ 0.0215	0.4145 $\pm$ 0.0160	0.6459 $\pm$ 0.0148	0.6169 $\pm$ 0.0161

Table 4

Predictive performance comparison on ChRs dataset. $\uparrow/\downarrow$ indicates that the larger/smaller the better of a criterion. The best results are in bold. N/A denotes not available

Algorithm	Evaluation metrics
	HammingLoss $\downarrow$	SubsetAcc $\uparrow$	MacroF ${}_{1}\uparrow$	MicroF ${}_{1}\uparrow$	ExampleF ${}_{1}\uparrow$
BR	0.1775 $\pm$ 0.0542	0.6573 $\pm$ 0.0990	0.2269 $\pm$ 0.0888	0.3574 $\pm$ 0.1602	0.1030 $\pm$ 0.0565
LLSF	0.2002 $\pm$ 0.0552	0.6036 $\pm$ 0.1037	0.0000 $\pm$ 0.0000	N/A	0.0000 $\pm$ 0.0000
CC ${}_{01}$	0.1702 $\pm$ 0.0425	0.6678 $\pm$ 0.0897	0.2276 $\pm$ 0.0913	0.3671 $\pm$ 0.1743	0.1075 $\pm$ 0.0691
CC ${}_{10}$	0.1746 $\pm$ 0.0417	0.6672 $\pm$ 0.0730	0.2252 $\pm$ 0.0855	0.3661 $\pm$ 0.1473	0.1073 $\pm$ 0.0560
ECC	0.1686 $\pm$ 0.0461	0.6712 $\pm$ 0.0903	0.2041 $\pm$ 0.1034	0.3192 $\pm$ 0.1612	0.1012 $\pm$ 0.0903
MLKNN	0.1750 $\pm$ 0.0255	0.6542 $\pm$ 0.0570	0.2465 $\pm$ 0.1382	0.3446 $\pm$ 0.1458	0.0987 $\pm$ 0.0491
DSML	0.1706 $\pm$ 0.0508	0.6670 $\pm$ 0.0944	0.2310 $\pm$ 0.0695	0.3615 $\pm$ 0.1232	0.0990 $\pm$ 0.0501
LAMB	0.1606 $\pm$ 0.0346	0.6872 $\pm$ 0.0673	0.2506 $\pm$ 0.0743	0.3943 $\pm$ 0.1249	0.1065 $\pm$ 0.0405

Table 5

Predictive performance comparison on Scene dataset. $\uparrow/\downarrow$ indicates that the larger/smaller the better of a criterion. The best results are in bold. N/A denotes not available

Parameter	Evaluation metrics
	HammingLoss $\downarrow$	$\textit{SubsetAcc}\uparrow$	$\textit{MacroF}_{1}\uparrow$	$\textit{MicroF}_{1}\uparrow$	$\textit{ExampleF}_{1}\uparrow$
$T=1$
	0.0775 $\pm$ 0.0066	0.7046 $\pm$ 0.0241	0.7803 $\pm$ 0.0150	0.7739 $\pm$ 0.0188	0.7518 $\pm$ 0.0193
$T=2$
$\lambda=1.00$	0.0787 $\pm$ 0.0097	0.7133 $\pm$ 0.0344	0.7816 $\pm$ 0.0255	0.7768 $\pm$ 0.0267	0.7700 $\pm$ 0.0288
$\lambda=1.05$	0.0798 $\pm$ 0.0067	0.7121 $\pm$ 0.0243	0.7762 $\pm$ 0.0111	0.7745 $\pm$ 0.0196	0.7698 $\pm$ 0.0212
$\lambda=1.15$	0.0776 $\pm$ 0.0104	0.7133 $\pm$ 0.0365	0.7850 $\pm$ 0.0273	0.7811 $\pm$ 0.0278	0.7758 $\pm$ 0.0300
$\lambda=1.25$	0.0797 $\pm$ 0.0086	0.7121 $\pm$ 0.0293	0.7796 $\pm$ 0.0236	0.7746 $\pm$ 0.0254	0.7700 $\pm$ 0.0297
$\lambda=1.35$	0.0789 $\pm$ 0.0058	0.7129 $\pm$ 0.0115	0.7811 $\pm$ 0.0193	0.7772 $\pm$ 0.0178	0.7729 $\pm$ 0.0195
$T=3$
$\lambda=1.00$	0.0715 $\pm$ 0.0077	0.7270 $\pm$ 0.0192	0.8025 $\pm$ 0.0175	0.7950 $\pm$ 0.0218	0.7837 $\pm$ 0.0210
$\lambda=1.05$	0.0720 $\pm$ 0.0108	0.7250 $\pm$ 0.0339	0.8000 $\pm$ 0.0294	0.7936 $\pm$ 0.0303	0.7816 $\pm$ 0.0284
$\lambda=1.15$	0.0697 $\pm$ 0.0072	0.7354 $\pm$ 0.0277	0.8066 $\pm$ 0.0195	0.8001 $\pm$ 0.0209	0.7889 $\pm$ 0.0239
$\lambda=1.25$	0.0709 $\pm$ 0.0088	0.7325 $\pm$ 0.0291	0.8044 $\pm$ 0.0215	0.7972 $\pm$ 0.0239	0.7874 $\pm$ 0.0228
$\lambda=1.35$	0.0720 $\pm$ 0.0069	0.7237 $\pm$ 0.0304	0.7981 $\pm$ 0.0235	0.7921 $\pm$ 0.0227	0.7781 $\pm$ 0.0279

In order to investigate influence of boosting round and influence of label-based reweight parameter $\lambda$ on the final prediction, we perform on Scene dataset when we vary one parameter while keeping other parameters fixed and the experimental results are shown in Table 5.

Influence of boosting round $T$

The number of boosting round $T$ is an important parameter of boosting algorithm. Analysis in [5] reveals that classifier should have low training error and a small number of boosting round in order to achieve good performance. Increasing number of boosting round will make classifier overly complex and may lead to overfitting.

As is shown in Fig. 5, as number of boosting round increasing with $\lambda$ fixed, we get the optimal results on all matrices when the number of boosting round is 3. When the number of boosting round increases greater than 3, the performance on all metrics declines slightly, which is in line with our intuition since LAMB is based on boosting framework. We attribute this phenomenon to the fact that LAMB consider the inter relationship between label set in $\textit{Cur}\mathcal{Y}$ and label set in $\textit{Res}\mathcal{Y}$ and thus it can tolerant more complex models.

Influence of label-based reweight parameter $\lambda$

Reweight parameter $\lambda$ is used for exploiting the correlation between label set in $\textit{Cur}\mathcal{Y}$ and label set in $\textit{Res}\mathcal{Y}$ . As a result, the classifier on $\mathcal{D}_{C}$ will get additional informative help from label set in $\textit{Res}\mathcal{Y}$ and the classifier on $\mathcal{D}_{R}$ will get additional informative help from label set in $\textit{Cur}\mathcal{Y}$ .

In Fig. 6, dotted lines represent the performance without label-based reweight mechanism and the other two solid lines are above the dotted line. It is obvious that LAMB with label-based reweight mechanism performs better than that without label-based reweight mechanism, which validate the effectiveness of label-based mechanism. Different datasets may have different optimal $\lambda$ . As for Scene dataset, $\lambda=1.15$ may be a relatively proper setting.

Table 6

Computational complexity of LAMB compared with common algorithms

Algorithms	$\#$ models	Class labels/model	Examples/model
BR	$L$	$2$	$N$
PW	$\frac{L(L-1)}{2}$	$2$	$\leqslant N$
LC	$1$	$\min(N,2^{L}-1)$	$N$
LAMB	$L$	$2$	$N$

Figure 5.

Performance of changes made by the number of boosting round on Scene dataset when fix label-based reweight parameter $\lambda=1.15$ .

Figure 6.

Performance of changes made by label-based reweight parameter on Scene dataset. The dotted lines represent the performance when the number of boosting round is set to 1, without reweight.

Complexity analysis

Table 6 shows the worst case computational complexity of LAMB and other transformation algorithms, in terms of the number of single-label models involved in the transformation, and the class labels and examples associated with these models. BR has linear complexity $\mathcal{O}(L)$ , where $L$ is the number of labels. Pairwise (PW) algorithms train a binary model for each pair of labels, which results in a set of pairwise preferences than a multi-label prediction. Although PW classification performs well in several domains, it is quadratic complexity $\mathcal{O}(L^{2})$ . Label combination (LC) or label power-set algorithms treat each label set as a single-label in single-label multi-class problem. LC can model label correlation, but brings about exponential complexity $\mathcal{O}(2^{L})$ in the worst case. Compared with other transformation algorithms, BR has lowest computational complexity. The computation complexity of BR is linearly $\mathcal{Q}(L\times f(d,N))$ , where $f(d,N)$ is the complexity of the classification for $d$ attributes and $N$ data samples. The computation complexity of LAMB is computed as follows,

$\displaystyle{\mathcal{Q}(L\times f(d+1,N)+L\times f(d+L-1,N))}{=}{\mathcal{O}% (L\times(d+1)\times f(1,N)+L\times(d+L-1)\times f(1,N))}{=}{\mathcal{O}(L% \times L\times f(1,N)+2L\times d\times f(1,N))}$ (14)

where the second term dominates as long as $L<d$ . Consequently, the computation complexity of LAMB is $\mathcal{O}(L\times d\times f(1,N))$ , which is identical to the computation complexity of BR.

5. Conclusion and future work

Complex objects, such as images, can be represented with multi-label information, where label relationship is very helpful for the final prediction. In this paper, we propose a novel high-order multi-label learning algorithm named Label collAboration based Multi-laBel learning (LAMB). For each label, LAMB algorithm combines its own label’s prediction and the prediction of other labels together as the final prediction, which makes each label collaborate with each other. Experimental studies on three public datasets validate the effectiveness of our proposed LAMB algorithm. Meanwhile, we assess one real-world channelrhodopsins chimeras dataset, which shows that LAMB can advance the prediction performance of eukaryotic expression and plasma membrane localization of channelrhodopsins. Furthermore, it provides new insights into understanding and designing membrane proteins. LAMB algorithm considers the global collaborative relationship between labels. However, different instances have different characteristics and such collaborative relationship may not be suitable for all the instances. In the future, we are supposed to explore different collaborative relationship between labels for different instances when there exists noise or missing labels in multi-label data.

Footnotes

Acknowledgments

This paper is supported by the National Key Research and Development Program of China (Grant No. 2018YFB1403400), the National Natural Science Foundation of China (Grant No. 61876080, Grant No. 62002137), the Key Research and Development Program of Jiangsu(Grant No. BE2019105), the Collaborative Innovation Center of Novel Software Technology and Industrialization at Nanjing University.

References

Bar

, Visual objects in context, Nature Reviews Neuroscience 5(8) (2004), 617.

Bedbrook

C.N.

Yang

K.K.

Rice

A.J.

Gradinaru

and Arnold

F.H.

, Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization, PLoS Computational Biology 13(10) (2017), e1005786.

Boutell

M.R.

Luo

Shen

and Brown

C.M.

, Learning multi-label scene classification, Pattern Recognition 37(9) (2004), 1757–1771.

Dembczynski

Jachnik

Kotlowski

Waegeman

and Hüllermeier

, Optimizing the f-measure in multi-label classification: Plug-in rule approach versus structured loss minimization, in: International Conference on Machine Learning, pages 1130–1138, 2013.

Freund

and Schapire

R.E.

, A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences 55(1) (1997), 119–139.

Gibaja

and Ventura

, A tutorial on multilabel learning, ACM Computing Surveys (CSUR) 47(3) (2015), 52.

Gopal

and Yang

, Multilabel classification with meta-level features, in: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 315–322, ACM, 2010.

Chen

Lin

Hui

Chen

and Wu

, Knowledge-guided multi-label few-shot learning for general image recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.

Chen

Hui

and Lin

, Learning semantic-specific graph representation for multi-label image recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 522–531, 2019.

10.

da Silva

P.N.

Gonçalves

E.C.

Plastino

and Freitas

A.A.

, Distinct chains for different instances: An effective strategy for multi-label classifier chains, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 453–468, Springer, 2014.

11.

Dembczynski

Cheng

and Hüllermeier

, Bayes optimal multilabel classification via probabilistic classifier chains, in: ICML, pages 279–286, 2010.

12.

Jack

Arshdeep

and Qi

, Neural Message Passing for Multi-label Classification, in: ECML/PKDD (2), Vol. 11907 of Lecture Notes in Computer Science, pages 138–163, Springer, 2019.

13.

Peng

Wang

Gong

Yang

and He

, Hierarchical taxonomy-aware and attentional graph capsule RCNNs for large-scale multi-label text classification, in: IEEE Transactions on Knowledge and Data Engineering, pages 2505–2519, 2021.

14.

Ray

Wang

Tran

Wang

Feiszli

Torresani

and Paluri

, Scenes-objects-actions: A multi-task, multi-label video dataset, in: Proceedings of the European Conference on Computer Vision (ECCV), pages 635–651, 2018.

15.

Read

Martino

and Luengo

, Efficient monte carlo methods for multi-dimensional learning with classifier chains, Pattern Recognition 47(3) (2014), 1535–1546.

16.

Huang

and Wu

, Learning label specific features for multi-label classification, in: 2015 IEEE International Conference on Data Mining, pages 181–190, IEEE, 2015.

17.

Huang

S.-J.

and Zhou

Z.-H.

, Multi-label hypothesis reuse, in: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 525–533, ACM, 2012.

18.

Ouyang

and Zhou

, Supervised topic models for multi-label classification, Neurocomputing 149 (2015), 811–819.

19.

Zhao

and Guo

, Multi-label image classification with a probabilistic label enhancement model, in: UAI, Vol. 1, page 3, 2014.

20.

Liu

Zhao

Huang

S.-J.

Jiang

and Zhou

Z.-H.

, Dual set multi-label learning, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

21.

Read

Pfahringer

Holmes

and Frank

, Classifier chains for multi-label classification, Machine Learning 85(3) (2011), 333.

22.

Trohidis

Tsoumakas

Kalliris

and Vlahavas

I.P.

, Multi-label classification of music into emotions, in: ISMIR, Vol. 8, pages 325–330, 2008.

23.

Tsoumakas

and Katakis

, Multi-label classification: An overview, International Journal of Data Warehousing and Mining (IJDWM) 3(3) (2007), 1–13.

24.

Wang

Yang

Mao

Huang

and Xu

, Cnn-rnn: A unified framework for multi-label image classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2285–2294, 2016.

25.

Xue

Zhang

Fan

and Lu

, Correlative multi-label multi-instance image annotation, in: 2011 International Conference on Computer Vision, pages 651–658, IEEE, 2011.

26.

You

Zhang

Xiong

Sun

Mamitsuka

and Zhu

, Golabeler: Improving sequence-based large-scale protein function prediction by learning to rank, Bioinformatics 34(14) (2018), 2465–2473.

27.

Zhang

M.-L.

and Zhou

Z.-H.

, Multilabel neural networks with applications to functional genomics and text categorization, IEEE Transactions on Knowledge and Data Engineering 18(10) (2006), 1338–1351.

28.

Zhang

M.-L.

and Zhou

Z.-H.

, Ml-knn: A lazy learning approach to multi-label learning, Pattern Recognition 40(7) (2007), 2038–2048.

29.

Zhang

M.-L.

and Zhou

Z.-H.

, A review on multi-label learning algorithms, IEEE Transactions on Knowledge and Data Engineering 26(8) (2013), 1819–1837.

30.

Zhu

and Gong

, Multi-labelled classification using maximum entropy method, in: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 274–281, ACM, 2005.

LAMB: A novel algorithm of label collaboration based multi-label learning

Abstract

Keywords

1. Introduction

3. Methodology

3.1 Preliminary

Classifier reuse

Label-based reweight

4.1 Dataset description

Scene

Emotions

Yeast

ChRs

Table 2 Statistics of datasets

• first-order algorithm: BR [23] • second-order algorithm: LLSF [16] • high-order algorithm: CC, ECC [21], MLKNN [28] and DSML [20] (for ChRs dataset) 4.3 Evaluation metrics

Table 3 Predictive performance (mean ± std) comparison on three public multi-label datasets. ↑ ⁣ / ⁣ ↓ indicates that the larger/smaller the better of a criterion. The best performance on each dataset is bolded

Influence of boosting round T

Influence of label-based reweight parameter λ

Complexity analysis

Footnotes

Acknowledgments

References

Table 2
Statistics of datasets

•
first-order algorithm: BR [23]
•
second-order algorithm: LLSF [16]
•
high-order algorithm: CC, ECC [21], MLKNN [28] and DSML [20] (for ChRs dataset)

4.3 Evaluation metrics

Table 3
Predictive performance (mean $\pm$ std) comparison on three public multi-label datasets. $\uparrow/\downarrow$ indicates that the larger/smaller the better of a criterion. The best performance on each dataset is bolded

Influence of boosting round $T$

Influence of label-based reweight parameter $\lambda$