Cross-modality semantic guidance for multi-label image classification

Abstract

Multi-label image classification aims to predict a set of labels that are present in an image. The key challenge of multi-label image classification lies in two aspects: modeling label correlations and utilizing spatial information. However, the existing approaches mainly calculate the correlation between labels according to co-occurrence among them. While the result is easily affected by the label noise and occasional co-occurrences. In addition, some works try to model the correlation between labels and spatial features, but the correlation among labels is not fully considered to model the spatial relationships among features. To address the above issues, we propose a novel cross-modality semantic guidance-based framework for multi-label image classification, namely CMSG. First, we design a semantic-guided attention (SGA) module, which applies the label correlation matrix to guide the learning of class-specific features, which implicitly models semantic correlations among labels. Second, we design a spatial-aware attention (SAA) module to extract high-level semantic-aware spatial features based on class-specific features obtained from the SGA module. The experiments carried out on three benchmark datasets demonstrate that our proposed method outperforms existing state-of-the-art algorithms on multi-label image classification.

Keywords

Multi-label image classification label relation cross-modality

1. Introduction

Multi-label image classification is a fundamental and challenging task in computer vision, which aims to predict the presence of multiple objects in an image. Researchers have developed a series of methods for multi-label image classification [29, 8, 14]. Multi-label image classification has more extensive applications than its single-label counterpart, such as medical diagnosis recognition [2], remote sensing image classification [18], attribute recognition [39], scene understanding [27], and emotion recognition [22]. For Example, in medical diagnosis, chest X-ray (CXR) is one of the common screening techniques used in the diagnosis of chest diseases. Multi-label image classification model can automatically predict the possible diseases for patients based on their CXR images [2], such as Atelectasis, opacity, and Consolidation. Since the presence of multiple objects and abundant semantic information in multi-label images, traditional methods that convert multi-label image classification problems into a series of single-label classification problems without modeling the label correlations, which can greatly affect classification performance. Furthermore, accurately locating object regions in an image facilitates the extraction of spatial features corresponding to the categories. Therefore, modeling label correlations and correlations between labels and object regions are significant to improve the performance of classification.

With the emergence of graph neural networks, recent works [6, 17, 15] capture label correlations by propagating node messages via graph convolution networks. For example, ML-GCN [6] proposed to generate multiple label-specific classifiers via a graph convolution network. In these graph-based approaches, the label correlation matrix is typically obtained by counting the co-occurrence of label pairs in the training data. However, the model performance will be affected by the label noise and occasional co-occurrences. Subsequently, several works [20, 38] optimized the topology of the graph by learning multiple graph structures. But they only construct label correlation by label embeddings and ignore the visual features. In addition to modeling label correlations, the researchers propose to address multi-label image classification tasks based on image spatial information. Some works [31, 5] have been proposed to model the spatial relationships among features, but they are modeled without the guidance of label semantic information. Furthermore, some works [37, 43] try to model the correlation between labels and spatial features, but the correlation among labels is not fully considered to model the spatial relationships among features.

To address the above issues, we propose a cross-modality semantic guidance (CMSG) based framework for multi-label image classification, which models the correlation between features corresponding to different labels and learns the semantic-aware spatial features with the guidance of semantic label embeddings. Specifically, the proposed framework, CMSG, is composed of two critical modules. First, we design a semantic-guided attention (SGA) module to learn the class-specific feature representation for different labels. In the SGA module, the label correlation matrix obtained by calculating the cosine similarity based on the label embeddings is utilized to model the relationship among features corresponding to different labels with the proposed multi-head mask attention. Second, we design a spatial-aware attention (SAA) module which is composed of a multi-head cross-attention. SAA can further extract the high-level semantic-ware spatial features based on the class-specific features obtained by the SGA module. Extensive experiments carried out on three benchmark multi-label datasets, including MS-COCO 2014, VOC 2007, and VOC 2012, show that the proposed method CMSG achieves competitive performance on multi-label classification. The main contributions of this paper are summarized as follows:

•
We propose a novel cross-modality semantic guidance-based framework for multi-label image classification, namely CMSG, which is composed of two critical modules.
•
we design a semantic-guided attention (SGA) module which is composed of a multi-head masked attention. SGA applies the label correlation matrix to guide the learning of class-specific features, where the semantic correlation among labels is implicitly modeled.
•
We further design a spatial-aware attention (SAA) module to extract high-level semantic-aware spatial features based on class-specific features obtained from the SGA module, where the correlation between labels and spatial features is modeled.
•
We evaluate our methods on three benchmark datasets, including MS-COCO 2014, VOC 2007, and VOC 2012 and demonstrate that our proposed method can achieve competitive performance.

The rest of this paper is organized as follows. Section 2 reviews the previous works on multi-label learning. Section 3 introduces the proposed method CMSG in detail. Comparative experiment results and analyses are presented in Section 4. Finally, we conclude this paper in Section 5.
2. Related work

In recent years, various multi-label image classification approaches have been proposed. In the following subsections, we will review these approaches from label correlation and spatial information modeling two aspects.

2.1 Label correlation-based approaches

Early works for multi-label learning focus on the decomposition of a multi-label classification task into multiple single-label classification tasks. For example, Boutell et al. [1] proposed to convert a multi-label classification problem into a set of independent binary classification problems, where each binary classification problem corresponds to a label. But these methods ignore the issue of label correlations in multi-label images. In multi-label classification, labels are often correlated with each other, and modeling label correlations can significantly improve the performance of classification. Wang et al. [29] introduce recurrent neural networks (RNNs) to achieve effective multi-label classification by considering label co-occurrence in the training data. Chen et al. proposed an order-free RNN [3] that predicts label sequences using a confidence-ranked LSTM, rather than requiring a predefined label order. Unlike these sequential approaches with RNNs, some works use graph structures to construct label relations. Recently, the success of graph neural networks has aroused growing concerns. In [6], graph convolutional neural networks are adopted to propagate label representations to model label correlations. In [4], SSGRL is proposed, which learns inter-label dependencies through a graph with gated recurrent neural networks. However, these works construct graph structures directly by counting label co-occurrences according to the training data, which may cause model overfitting. In [36], ADD-GCN is proposed to generate dynamic graphs with an attention mechanism to achieve dynamic graph convolution. Currently, several approaches utilize the multi-head self-attention mechanism in Transformer [28] to model label correlations. Lanchantin et al. [16] adopt the Transformer encoder to capture label long-range dependencies. S-MAT [33] employs masked attention to filter the redundant label dependencies and enhance the robustness of the model.

2.2 Spatial information-based approaches

The spatial information of images plays a key role in multi-label image classification. With the development of object detection, many works utilize pre-trained object detection models to roughly localize multiple regions and then to recognize each region with convolution neural networks [25, 44, 7]. For example, Wei et al. [32] propose HCP which generates numerous proposals with object detection models and treats each proposal as a single-label classification task, but this method incurs a huge computational cost because of the proposal generating. Wang et al. [31] introduce a spatial transformer network to extract features of regions of interest and predict the labels of each region sequentially using LSTM. In [11], a two-stream framework MCAR is developed to identify multi-class objects from global to local. In addition to generating the regions, it is also important to explore the association between spatial information of images and semantic labels. Chen et al. [4] propose a semantic decoupling module to capture the interaction between labels and visual features. Zhu et al. [43] aggregate visual features from spatial streams to semantic streams to update label semantic information. You et al. [37] introduce cross-modality attention to measure the importance of each location by computing the similarity between spatial features and semantic labels. Different from these methods, this paper adopts the cross-attention mechanism to learn spatial features with the guidance of the semantic information of labels, which is a simple yet effective method.

Figure 1.

The overall pipeline of our proposed framework CMSG.

3. Proposed framework

3.1 Overview

In this section, we will introduce the proposed framework CMSG, which consists of two main modules, i.e., the semantic-guided attention (SGA) module and the spatial-aware attention (SAA) module. The overall pipeline of the proposed framework is shown in Fig. 1.

Specifically, given an image, we first feed it into the backbone to generate the feature map $X_{0}\in R^{d\times h\times w}$ which is then transformed into $X_{C}\in R^{C\times h\times w}$ via a $1\times 1$ convolution layer. To model label correlations, we construct a semantic correlation matrix $M\in R^{C\times C}$ based on the label embeddings $E\in R^{C\times d^{\prime}}$ by calculating the cosine similarity. Then, the feature map $X_{C}$ and the semantic correlation matrix are fed into the SGA module to learn the class-specific feature map $X_{L}\in R^{C\times hw}$ which implicitly reflects semantic correlations. To extract the features from regions of interest and focus on semantic-aware regions, $X_{L}$ is first converted into $X_{S}\in R^{hw\times d^{\prime}}$ via a transpose operation and linear projection, and then is fed into the SAA module with the label embeddings $E$ to learn the semantic-aware spatial features $X_{M}\in R^{hw\times d^{\prime}}$ . Finally, we can obtain the label prediction $\hat{y}$ for the input image by a global max pooling (GMP) and a linear layer based on $X_{M}$ . In the following sections, details of each step of our proposed method CMSG will be introduced.

3.2 Feature extraction

Given an image $I\in R^{3\times H\times W}$ , we first feed it into the backbone to extract its feature map $X_{0}\in R^{d\times h\times w}$ , where $h$ , $w$ , and $d$ are the height, width, and number of channels of the feature map, respectively. To learn the semantic-aware class-specific features, the feature map $X_{0}$ is first converted to $X_{C}\in R^{C\times h\times w}$ via a $1\times 1$ convolution layer and further transformed into the flattened feature $X\in R^{C\times hw}$ , where $C$ is the number of labels. These operations can be defined as:

$\displaystyle X=R(f_{\textit{conv}}(X_{0})),$ (1)

where $f_{\textit{conv}}(\cdot)$ denotes a $1\times 1$ convolution operation and $R(\cdot)$ represents the reshape operation.

3.3 Semantic correlation matrix

In multi-label image classification, labels are often correlated with each other, and modeling label correlations can significantly improve the performance of classification. Many methods calculate the label correlation according to co-occurrence among labels. However, the result is easily affected by the label noise and occasional co-occurrences. In this paper, following [38], we construct a semantic correlation matrix by calculating the similarity between label embeddings which are trained on a large-scale unsupervised corpus that usually contains rich semantic information.

For a multi-label classification problem with $C$ categories, $E\in R^{C\times d^{\prime}}$ is the label word embedding, and $E_{i}\in R^{d^{\prime}}$ denotes the embedding vector for the $i$ -th label with length $d^{\prime}$ . $R\in R^{C\times C}$ indicates the symmetric label correlation matrix, and each element $R_{i,j}$ denotes the correlation between the $i$ -th and the $j$ -th labels. $R_{i,j}$ is calculated via the cosine similarity based on the label embedding vectors, which is defined as

$\displaystyle R_{i,j}=\frac{\sum_{k=1}^{d}E_{i,k}\times E_{j,k}}{\sqrt{\sum_{k% =1}^{d}(E_{i,k})^{2}}\times\sqrt{\sum_{k=1}^{d}(E_{j,k})^{2}}}.$ (2)

Inspired by [6], we set a threshold $\tau$ and weight parameter $p$ to filter the noisy edges and further alleviate the over-smoothing problem by

$\displaystyle R_{i,j}^{\prime}=\left\{\begin{array}[]{ll}0,&\text{ if }R_{i,j}% <\tau\\ 1,&\text{ if }R_{i,j}\geqslant\tau\end{array}\right.$ (3)

Consequently, we can get a re-weighted semantic correlation matrix $M$ by

$\displaystyle M_{i,j}=\left\{\begin{array}[]{ll}\frac{p}{\sum_{j=1,i\neq j}^{n% }R_{i.j}^{\prime}},&\text{ if }i\neq j\\ 1-p,&\text{ if }i=j\end{array}\right.$ (4)

3.4 Semantic-guided attention module

After obtaining the label semantic correlation matrix $M$ , it can be utilized to assist the model to learn high-level class-specific feature representations by the self-attention mechanism. When extracting class-specific features, we expect that the stronger the correlation between two labels, the larger the attention weight value is, otherwise the smaller the weight value is. However, the attention weight learned based on feature representation without the guidance of label semantic information might be inaccurate. Consequently, in this paper, we propose a semantic-guided attention (SGA) module which is composed of a multi-head masked attention (MHMA) with several layers, where the label semantic correlation matrix $M$ is used as a mask to guide the learning feature representation in which the label semantic correlations are implicitly captured. The structure of an MHMA layer is shown in Fig. 2.

Figure 2.

The detailed illustration of multi-head masked attention (MHMA).

Concretely, the feature $X\in R^{C\times hw}$ obtained in the feature extraction stage is regarded as a sequence containing $C$ feature vectors of $h\times w$ dimensions, which is used as the query ${Q}$ , key ${K}$ , and value ${V}$ . Then, a scaled-dot product masked attention of the three matrices can be defined as

$\displaystyle\text{Masked Attention}(Q,K,V)=\text{Softmax}\left(\frac{QK^{T}}{% \sqrt{d_{k}}}+M\right)V,$ (5)

where $\text{Softmax}(QK^{T}/\sqrt{d_{k}}+M)$ can be considered as a weight matrix which is then utilized to calculate a weighted sum of the input $X$ .

By introducing the semantic correlation matrix in the calculation of self-attention scores, it allows the semantic relations of the labels to participate in the calculation of the attention scores. As a result, label semantic correlation is implicitly embedded into the features, and the high-level class-specific feature representations can be learned.

MHMA is a multi-head extension of the above masked single attention mechanism, which is defined as

$\displaystyle\text{MHMA}(Q,K,V)=\text{Concat}((Z_{1},Z_{2},\ldots Z_{h})W^{O}),$ (6)

where $h$ is the number of heads, and $Z_{i}=\text{Masked Attention}(Q_{i},K_{i},V_{i})$ is the output of the $i$ -th single attention, where $Q_{i}=QW_{i}^{Q}$ , $K_{i}=KW_{i}^{K}$ , and $V_{i}=VW_{i}^{V}$ . Here $W_{i}^{Q}\in R^{D\times d^{k}}$ , $W_{i}^{K}\in R^{D\times d^{k}}$ , $W_{i}^{V}\in R^{D\times d^{v}}$ and $W^{O}\in R^{hd^{v}\times D}$ are learnable parameters, and $d^{k}$ and $d^{v}$ are the dimension of key and value.

Our proposed SGA is a multi-layer architecture, where each layer is composed of a multi-head attention mechanism and a Feed-Forward Network (FFN). Specifically, the $l$ -th layer of MHMA is updated as

$\displaystyle X_{L}^{(1)}=X_{L}^{l-1}+\text{MHMA}(X_{L}^{l-1},X_{L}^{l-1},X_{L% }^{l-1}),$ (7) $\displaystyle X_{L}^{l}=X_{L}^{(1)}+\text{FFN}(X_{L}^{(1)}),$ (8)

where $X_{L}^{l-1}$ is the feature updated in the previous layer, and $X_{L}^{0}=X$ . The learned feature representation $X_{L}\in R^{C\times hw}$ is the output of the last layer of SGA module.

3.5 Spatial-aware attention module

In order to fully consider the correlation between the label semantic information and spatial features, in this section, we design a spatial-aware attention (SAA) module to further extract the semantic-aware spatial features based on the class-specific feature $X_{L}\in R^{C\times hw}$ obtained by the SGA module. The SAA module is composed of several layers of multi-head cross-attention (MHCA) which can capture correlations between each location of the visual features and the semantic label embeddings. The structure of MHCA is shown in Fig. 3.

Figure 3.

The detailed illustration of the multi-head cross-attention (MHCA).

Specifically, the feature representation $X_{L}\in R^{C\times hw}$ is first reshaped along the spatial dimension and then transformed into the flattened spatial feature $X_{S}\in R^{hw\times d^{\prime}}$ with a linear layer. The operations can be expressed as

$\displaystyle X_{S}=f_{\textit{linear}}(R(X_{L})),$ (9)

where $f_{\textit{linear}}(\cdot)$ denotes the fully connected layer from dimension $C$ to $d^{\prime}$ .

The representation $X_{S}\in R^{hw\times d^{\prime}}$ can be considered as a sequence containing $h\times w$ feature vectors of $d^{\prime}$ dimensions w.r.t each pixel (position). First, we define a scaled-dot product cross-attention of the three matrices as

$\displaystyle\text{Cross Attention }(Q,K,V)=\text{Softmax}\left(\frac{QK^{T}}{% \sqrt{d_{k}}}\right)V.$ (10)

Next, to capture the correlations between each position of the visual features and the semantic label embeddings, $X_{S}\in R^{hw\times d^{\prime}}$ is treated as the query $Q$ and label embedding $E\in R^{C\times d^{\prime}}$ is treated as key $K$ and value $V$ . As a result, $\text{Softmax}(QK^{T}/\sqrt{d_{k}})\in R^{hw\times C}$ can be considered as a weight matrix, and each element of it indicates the importance of the position of an image to the semantic label. Then, it will be utilized to calculate a weighted sum over the label embedding $E$ to obtain the semantic-aware spatial feature representation $X_{M}\in R^{hw\times d^{\prime}}$ . Consequently, the visual feature and the semantic label embeddings are thoroughly and effectively fused, and the correlation between each position of the visual features and the semantic label embeddings is fully modeled.

The SAA module is composed of several layers of multi-head cross-attention (MHCA) which is a multi-head extension of the above single cross-attention mechanism. Specifically, we define it as

$\displaystyle\text{MHCA}(Q,K,V)=\text{Concat}((Z_{1},Z_{2},\ldots Z_{h})W^{O})$ (11)

where $h$ is the number of heads, and $Z_{i}=\text{Cross Attention}(Q_{i},K_{i},V_{i})$ is the output of the $i$ -th single attention, where $Q_{i}=QW_{i}^{Q}$ , $K_{i}=KW_{i}^{K}$ , and $V_{i}=VW_{i}^{V}$ . Consequently, the $l$ -th layer of SAA is updated as

$\displaystyle X_{M}^{(1)}=X_{M}^{l-1}+\text{MHCA}(X_{M}^{l-1},E,E),$ (12) $\displaystyle X_{M}^{l}=X_{M}^{(1)}+\text{FFN}(X_{M}^{(1)}),$ (13)

where $X_{M}^{l-1}$ is the feature updated in the previous layer, and $X_{M}^{0}=X_{s}$ . The learned feature representation $X_{M}\in R^{hw\times d^{\prime}}$ is the output of the last layer of SAA module.

3.6 Prediction and Loss function

With the learned feature representation $X_{M}$ by the SAA Module for each image, the label prediction probability $p=\{p_{1},p_{2},\ldots,p_{C}\}$ can be made via a pooling, fully connected layer and sigmoid function, and it is defined as

$\displaystyle p=\sigma(f_{\textit{linear}}(\text{GMP}(R(X_{M})))),$ (14)

where $f_{\textit{linear}}(\cdot)$ denotes the fully connected layer from $d^{\prime}$ to $C$ , $\text{GMP}(\cdot)$ indicates the global max-pooling, $\sigma(\cdot)$ denotes the sigmoid function.

In order to better solve the imbalance problem of sample positive and negative labels, in this paper, we adopt Asymmetric loss [26], which is defined as

$\displaystyle L=-\frac{1}{C}\sum_{c=1}^{C}(y_{c}(1-p_{c})^{\gamma+}\log(p_{c})% +(1-y_{c})p_{m}^{\gamma-}\log(1-p_{m})),$ (15)

where $y=\{y_{1},y_{2},\ldots,y_{C}\}$ is the ground-truth label for an image. $\gamma+$ and $\gamma-$ represent the positive and negative focusing parameters, respectively. To filter very easy negative samples, we employ shifted probability, $p_{m}=\max(p_{c}-m,0)$ , where the margin $m$ is a hyper-parameter.

4. Experiments

4.1 Datasets

To verify the effectiveness of our proposed method CMSG, three widely used multi-label benchmark datasets, including Pascal VOC 2007, Pascal VOC 2012, and MS-COCO2014, are used. Below is a detailed introduction to these datasets.

(1)
VOC2007 Dataset [10] contains 9963 images with 20 categories, and it is split into trainval set and test set, in which the trainval set has 5011 images and the test set has 4952 images.
(2)
VOC2012 Dataset [10] has 22531 images with 20 categories, among which 11540 are used as the trainval set and 10991 are used as the test set.
(3)
MS-COCO 2014 Dateset [19] has 82081 images as the train set and 40504 images as the validation set, and it contains 80 categories with approximately 2.9 labels per image.

4.2 Evaluation metrics

To fairly compare with the state-of-art approaches, the widely used metrics are adopted, including the mean average precision (mAP) for all categories, the per-class precision (CP), recall (CR), and F1-measure (CF1), and overall precision (OP), recall (OR), F1-measure (OF1). In addition, to compare with other existing methods on the MS-COCO dataset, we also report the top-3 results for precision, recall, and F1-measure.

4.3 Implementation details

All the experiments are conducted on a Linux server equipped with two NVIDIA RTX 3090 GPUs. The proposed method CMSG is implemented by the deep learning framework PyTorch [21]. The ResNet101 [12] is utilized as the backbone which is pre-trained on ImageNet [9]. The label embeddings are obtained by GloVe [23] which is trained on the Wikipedia dataset, and the dimension is set to 300. During the training, we use the data augmentation method proposed in [30]. The input images are randomly cropped and resized to $448\times 448$ with random horizontal flips. The SGA module has 2 layers, and each layer is multi-head masked attention with 4 heads. The SAA module has 3 layers, and each layer is multi-head cross-attention attention with 4 heads. The stochastic gradient descent (SGD) is used as the optimizer, where the momentum is 0.9 and the weight decay is 0.0001. The initial learning rate is set to 0.00025 for the backbone and 0.005 for the other components. The prediction threshold is set to 0.6. Our model is trained with a maximum of 50 epochs with a batch size of 16, and the learning rate decays by a factor of 10 after 30 epochs.

4.4 Experiment results

For the comparing approaches, we utilized the experimental results provided by their respective publications for comparisons. The experimental results for all the approaches are summarized in Tables 2–3, where the numbers in bold indicate the best performance and the numbers underlined indicate the second performance.

Table 1
Experimental results on PASCAL VOC 2007, and “*” indicates that ResNext-101 32X16d is used as the backbone

Methods	aero	bike	bird	boat	bottle	bus	car	cat	chair	cow	table	dog	horse	motor	person	plant	sheep	sofa	train	tv	mAP
CNN-RNN [29]	96.7	83.1	94.2	92.8	61.2	82.1	89.1	94.2	64.2	83.6	70.0	92.4	91.7	84.2	93.7	59.8	93.2	75.3	99.7	78.6	84.0
ResNet101 [12]	99.5	97.7	97.8	96.4	75.4	91.8	96.1	97.6	74.2	80.9	85.0	98.4	96.5	95.9	98.4	70.1	88.3	80.2	98.9	89.2	89.9
HCP [32]	98.6	97.1	98.0	95.6	75.3	94.7	95.8	97.3	73.1	90.2	80.0	97.3	96.1	94.9	96.3	78.3	94.7	76.2	97.9	91.5	90.9
ML-GCN [6]	99.5	98.5	98.6	98.1	80.8	94.6	97.2	98.2	82.3	95.7	86.4	98.2	98.7	96.7	99.0	84.7	96.7	84.3	98.9	98.7	94.0
SSGRL [4]	99.5	97.1	97.6	97.8	82.6	94.8	96.7	98.1	78.0	97.0	85.6	97.8	98.3	96.4	98.8	84.9	96.5	79.8	98.4	92.8	93.4
MCAR [11]	99.7	99.0	98.5	98.2	85.4	96.9	97.4	98.8	83.7	95.5	88.8	99.1	98.2	95.1	99.1	84.8	97.1	87.8	98.3	94.8	94.8
DAGAT [41]	99.4	97.3	98.2	98.2	80.5	95.3	97.3	97.6	83.1	94.6	86.5	98.4	98.5	95.8	98.8	93.6	97.7	82.8	98.5	93.9	93.8
SST [5]	99.8	98.6	98.9	98.4	85.5	94.7	97.9	98.6	83.0	96.8	85.7	98.8	98.9	95.7	99.1	85.4	96.2	84.3	99.1	95.0	94.5
CMSG	99.5	98.7	98.8	98.5	83.7	98.0	97.7	99.2	82.9	97.0	89.9	99.4	99.1	98.0	99.1	86.9	99.3	86.6	98.9	96.7	95.4
CMSG^*	99.9	99.4	99.2	99.2	89.2	98.7	98.5	99.5	88.7	98.5	93.1	99.8	99.5	99.3	99.2	90.3	99.4	91.4	99.5	96.6	96.9

Table 2

Experimental results on PASCAL VOC 2012, and “*” indicates that ResNext-101 32X16d is used as the backbone

Methods	aero	bike	bird	boat	bottle	bus	car	cat	chair	cow	table	dog	horse	motor	person	plant	sheep	sofa	train	tv	mAP
RMIC [13]	98.0	85.5	92.6	88.7	64.0	86.8	82.0	94.9	72.7	83.1	73.4	95.2	91.7	90.8	95.5	58.3	87.6	70.6	93.8	83.0	84.4
HCP [32]	99.1	92.8	97.4	94.4	79.9	93.6	89.9	98.2	78.2	94.9	79.8	97.8	97.0	93.8	96.4	74.3	94.7	71.9	96.7	88.6	90.5
RCP [30]	99.3	92.2	97.5	94.9	82.3	94.1	92.4	98.5	83.8	93.5	83.1	98.1	97.3	96.0	98.8	77.1	95.1	79.4	97.7	92.4	92.2
SSGRL [4]	99.5	95.1	97.4	96.4	85.8	94.5	93.7	98.9	87.6	96.3	84.6	98.9	98.6	96.2	98.7	82.2	98.2	84.2	98.1	93.5	93.9
DSDL [40]	99.4	95.3	97.6	95.7	83.5	94.8	93.9	98.5	85.7	94.5	83.8	98.4	97.7	95.9	98.5	80.6	95.7	82.3	98.2	93.2	93.2
CMSG	99.2	96.2	98.5	96.8	86.5	95.8	95.2	99.2	86.6	97.0	84.7	99.1	98.2	96.8	98.8	84.1	98.3	81.2	98.5	94.7	94.3
CMSG^*	99.7	98.6	98.8	98.7	90.9	96.2	96.2	99.4	91.7	99.1	90.0	99.6	99.7	98.3	99.2	87.2	99.3	87.9	99.5	96.0	96.3

1) Performance on the VOC2007 Dataset. The results on the VOC2007 dataset of the proposed method and the state-of-the-art methods are shown in Table 2. Our method achieves the best mAP performance and outperforms other methods in 13 out of 20 categories in terms of AP. Compared with the graph neural network-based methods, including SSGRL [4], DAGAT [41], and ML-GCN [6], the result of mAP of CMSG is 2.0%, 1.6%, and 1.4% higher than that of them, respectively. Compared with the Transformer based method SST [5], the result of mAP is increased by 0.9%. It is noted that our method achieved an mAP of 96.9% when ResNext-101 32X16d [34] network with a semi-weakly supervised pre-trained model on ImageNet [35] is utilized as the backbone.

2) Performance on the VOC2012 Dataset. The results on the VOC2012 dataset of the proposed method and the state-of-the-art methods are shown in Table 2. Similar to the results on VOC2007, CMSG achieves the best mAP performance and outperforms other methods in 16 out of 20 categories in terms of AP. CMSG achieves a high mAP of 96.2% when ResNext-101 32X16d [34] is utilized as the backbone.

3) Performance on the MS-COCO 2014 Dataset. The results on the VOC2012 dataset of the proposed method and the state-of-the-art methods are shown in Table 3. For a fair comparison, we utilized the experimental results of the comparing approaches provided by their respective publications for comparisons when the input images are randomly cropped into the size of $448\times 448$ . Compared to the baseline ResNet101 [12], ML-GCN [6], CSRA [42], MCAR [11], and SST [5], the result of mAP of CMSG is 7.9%, 2.0%, 1.5%, 1.2%, and 0.8% higher than that of them, respectively. Notably, compared with CMA and MS-CMA [25], which also perform cross-modal attention, the mAP of CMSG is 1.6% and 1.2% higher than that of them. When ResNext-101 32X16d [34] is utilized as the backbone, CMSG achieves a high mAP of 87.9% which has a significant margin with that of the comparing approaches.

Table 3

Experimental results on MS-COCO, “*” indicates that ResNext-101 32X16d is used as the backbone, and “–” denotes that the result was not reported

Methods	All							Top-3
	mAP	CP	CR	CF1	OP	OR	OF1	CP	CR	CF1	OP	OR	OF1
CNN-RNN [29]	61.2	–	–	–	–	–	–	66.0	55.6	60.4	69.2	66.4	67.8
ResNet101 [12]	77.1	79.5	66.0	72.1	83.3	70.7	76.5	83.6	58.7	68.9	88.8	62.6	73.4
ML-GCN [6]	83.0	85.1	72.0	78.0	85.8	75.4	80.3	89.2	64.1	74.6	90.5	66.5	76.7
MCAR [11]	83.8	85.0	72.1	78.0	88.0	73.9	80.3	88.1	65.5	75.1	91.0	66.3	76.7
CMA [37]	83.4	83.4	72.9	77.8	86.8	76.3	80.9	86.7	64.9	74.3	90.9	67.2	77.2
MS-CMA [37]	83.8	82.9	74.4	78.4	84.4	77.9	81.0	88.2	65.0	74.9	90.2	67.4	77.1
MSRN [24]	83.4	86.5	71.5	78.3	86.1	75.5	80.4	84.5	72.9	78.3	84.3	76.8	80.4
CSRA [42]	83.5	84.1	72.5	77.9	85.6	75.7	80.3	88.5	64.2	74.4	90.4	66.4	76.5
SST [5]	84.2	86.1	72.1	78.5	87.2	75.4	80.8	89.8	64.1	74.8	91.5	66.4	76.9
CMSG	85.0	86.4	75.7	80.7	87.2	78.1	82.4	83.0	69.1	75.5	85.0	70.7	77.2
CMSG^*	87.9	89.6	78.7	83.8	90.3	80.1	84.9	87.8	71.4	78.7	88.4	72.1	79.4

According to these experimental results, we can see that our proposed method CMSG achieves the best performance among all compared approaches and outperforms all baselines by a significant margin in terms of every evaluation criterion. The better performance of our proposed method demonstrates the effectiveness of multi-label image classification with cross-modality semantic guidance.

4.5 Ablation study

To demonstrate the effectiveness of each component of our proposed method, we conduct several ablation experiments on the VOC2007 and MS-COCO datasets. For the proposed method CMSG, we have two main modules, i.e., the semantic-guided attention (SGA) and spatial-aware attention (SAA) modules. To evaluate the effectiveness of each module, we performed module ablation experiments by sequentially removing each module to validate its importance. Table 4 displays the results for different combinations of these modules in terms of mAP, where CMSG w/o SAA and CMSG w/o SGA denote our framework CMSG is executed without the SAA module and the SGA module respectively. CMSG w/o $M$ indicates that CMSG is executed without considering the semantic correlation matrix $M$ in Eq. (5).

As shown in Table 4, the baseline ResNet-101 without the semantic-guided attention (SGA) and spatial-aware attention (SAA) modules obtains 77.1% on mAP. When the SGA module is added, the results of mAP are 1.8% and 1.4% higher than that of ResNet-101 on VOC2007 and MS-COCO, respectively. Similarly, when the SAA module is added, the results of mAP are 2.7% and 2.5% higher than that of ResNet-101 on VOC2007 and MS-COCO, respectively. This observation indicates that the SAA module plays a more important role than SGA in multi-image classification for the proposed method. When both modules are equipped, CMSG achieves the highest performance on both datasets. Additionally, when we remove the semantic correlation matrix in the SGA module, the results of mAP are 0.2% and 0.3% lower than that of our framework CMSG on VOC2007 and MS-COCO, respectively. This result clearly demonstrates the effectiveness of incorporating the label correlation matrix $M\in R^{C\times C}$ into the multi-head attention. By introducing the semantic correlation matrix $M$ in the calculation of the multi-head self-attention score, the correlation between labels according to the feature representation will be more accurate. Accordingly, the semantic correlation among labels is implicitly incorporated into the model which can learn more effective and discriminative data representation and further improve the model performance. In conclusion, the results of the module ablation experiments clearly demonstrate that all the modules in CMSG are crucial to its performance improvement.

Table 4
Experimental results of ablation study on VOC 2007 and MS-COCO 2014 datasets

Methods	mAP
	VOC2007	MS-COCO
ResNet-101	92.4	77.1
CMSG w/o SAA	94.2	81.5
CMSG w/o SGA	95.1	83.6
CMSG w/o $M$	95.2	84.7
CMSG	95.4	85.0

4.6 Parameter sensitivity analysis

The proposed method CMSG has several hyperparameters, i.e., the number of heads and the number of layers of the multi-head masked attention and cross-attention modules. To fully understand the impact of these hyperparameters on the model performance, we conduct experiments on the VOC 2007 dataset.

4.6.1 Effect of the number of attention heads

The multi-head attention mechanism can capture information from multiple perspectives. To investigate the impact of the number of attention heads $H$ on the performance of the proposed model CMSG, we experiment with different values of $H$ on the VOC2007 dataset. In the multi-head attention mechanism, the embedding dimension should be divisible by the number of heads. Therefore, we set the number of heads $H$ in the range of $\{1,2,4,7\}$ and $\{1,2,3,4,5,6\}$ for masked attention and cross-attention modules, respectively. The experimental results are presented in Fig. 4. We observed that a smaller or larger number of heads typically leads to lower performance, and the best performance is achieved when $H$ is set to 4. Therefore, in this paper, $H$ is set to 4 for both masked attention and cross-attention for all the experiments.

Figure 4.

Results of parameter sensitivity analysis w.r.t the number of heads $h$ in the two attention modules.

4.6.2 Effect of the number of attention layers

In order to explore the impact of the number of layers $l$ of the multi-head attention mechanism on the performance of the proposed model CMSG, we conducted experiments on the VOC2007 dataset. Specifically, we set the number of layers $l$ of masked attention and cross-attention in the range of $\{1,2,3,4,5,6,7,8\}$ , and the number of attention heads is fixed with $H=4$ . The experimental results are presented in Fig. 5. Smaller or larger values of $l$ typically lead to lower performance due to insufficient and excessive learning of feature representations. We observed that the best performance is achieved when $l$ is set to 2 and 3 for masked and cross-attention modules respectively, as shown in Fig. 5.

Figure 5.

Results of parameter sensitivity analysis w.r.t the number of layers $l$ in the two attention modules.

5. Conclusion

In this paper, to exploit the relationship between labels and features, as well as to capture semantic-aware spatial features, we propose a cross-modality semantic guidance-based framework CMSG for multi-label image classification. The proposed framework is mainly composed of two multi-head attention modules. The semantic-guided attention (SGA) module uses the label correlation matrix to guide features to implicitly capture semantic correlations through a multi-head masked attention mechanism. The spatial-aware attention (SAA) module utilizes the multi-head cross-attention mechanism to capture the correlations between semantic label embeddings and individual spatial locations to learn high-level semantic-aware spatial features. Experimental results demonstrate that our proposed method, CMSG, achieves superior performance compared to state-of-the-art approaches. Furthermore, our results verify that effectively modeling the correlation between labels, and the correlations between labels and individual spatial locations can further improve the performance of multi-label image classification.

Footnotes

Acknowledgments

This work is supported by the Natural Science Foundation of China: 61806005, the University Synergy Innovation Program of Anhui Province: GXXT-2022-052 and GXXT-2020-012, the Outstanding Young Talents Support Program of Anhui Province: gxyqZD2022032, and the Natural Science Foundation of the Educational Commission of Anhui Province of China: KJ2021A0373.

References

Boutell

M.R.

Luo

Shen

and Brown

C.M.

, Learning multi-label scene classification, Pattern Recognition 37(9) (2004), 1757–1771.

Chen

Zhang

and Zhang

, Multi-label chest x-ray image classification via semantic similarity graph embedding, IEEE Transactions on Circuits and Systems for Video Technology 32(4) (2021), 2455–2468.

Chen

Yeh

and Wang

, Order-free rnn with visual attention for multi-label classification, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2018.

Chen

Hui

and Lin

, Learning semantic-specific graph representation for multi-label image recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 522–531.

Chen

Cui

Zhao

Song

Zhang

and Yoshie

, SST: Spatial and semantic transformers for multi-label image recognition, IEEE Trans. Image Process. 31 (2022), 2570–2583.

Chen

Wei

Wang

and Guo

, Multi-label image recognition with graph convolutional networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5177–5186.

Cheng

Zhang

Lin

and Torr

, Bing: Binarized normed gradients for objectness estimation at 300fps, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3286–3293.

Cheng

Huang

Zhang

Chen

and Zheng

, Improving multi-label learning by modeling local label and feature correlations, Intelligent Data Analysis 27(2) (2023), 379–398.

Deng

Dong

Socher

L.-J.

and Fei-Fei

, Imagenet: A large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 248–255.

10.

Everingham

Van Gool

Williams

C.K.

Winn

and Zisserman

, The pascal visual object classes (voc) challenge, International Journal of Computer Vision 88 (2009), 303–308.

11.

Gao

and Zhou

, Learning to discover multi-class attentional regions for multi-label image recognition, IEEE Transactions on Image Processing 30 (2021), 5920–5932.

12.

Zhang

Ren

and Sun

, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

13.

Guo

and Tao

, Reinforced multi-label image classification by exploring curriculum, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2018.

14.

Huang

Qian

Wang

and Yamanishi

, Multi-label learning with missing and completely unobserved labels, Data Mining and Knowledge Discovery 35 (2021), 1061–1086.

15.

Huang

Yan

Zheng

and Hong

, Discovering unknown labels for multi-label image classification, in: 2022 IEEE International Conference on Data Mining Workshops (ICDMW), 2022, pp. 797–806.

16.

Lanchantin

Wang

Ordonez

and Qi

, General multi-label image classification with transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16478–16488.

17.

Peng

Qiao

and Peng

, Learning label correlations for multi-label image recognition with graph networks, Pattern Recognition Letters 138 (2020), 378–384.

18.

Lin

Zhao

Wang

Z.J.

and Chen

, Multilabel aerial image classification with unsupervised domain adaptation, IEEE Transactions on Geoscience and Remote Sensing 60 (2021), 1–13.

19.

Lin

T.-Y.

Maire

Belongie

Hays

Perona

Ramanan

Dollár

and Zitnick

C.L.

, Microsoft coco: Common objects in context, in: Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, 2014, pp. 740–755.

20.

Nguyen

H.D.

and Le

, Modular graph transformer networks for multi-label image classification, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 9092–9100.

21.

Paszke

Gross

Chintala

Chanan

Yang

DeVito

Lin

Desmaison

Antiga

and Lerer

, Automatic differentiation in pytorch, 2017.

22.

Peng

Liu

Huang

B.-L.

and Kong

, Cross-session emotion recognition by joint label-common and label-specific eeg features exploration, IEEE Transactions on Neural Systems and Rehabilitation Engineering 31 (2023), 759–768.

23.

Pennington

Socher

and Manning

C.D.

, Glove: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543.

24.

Che

Huang

and Zheng

, Multi-layered semantic representation network for multi-label image classification, International Journal of Machine Learning and Cybernetics, 2023, 1–9.

25.

Ren

Girshick

and Sun

, Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015), 1137–1149.

26.

Ridnik

Ben-Baruch

Zamir

Noy

Friedman

Protter

and Zelnik-Manor

, Asymmetric loss for multi-label classification, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 82–91.

27.

Shao

Kang

Change Loy

and Wang

, Deeply learned attributes for crowded scene understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4657–4666.

28.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

Ł.

and Polosukhin

, Attention is all you need, Advances in neural information processing systems, 2017, 30.

29.

Wang

Yang

Mao

Huang

and Xu

, Cnn-rnn: A unified framework for multi-label image classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2285–2294.

30.

Wang

Luo

Hong

Tang

and Feng

, Beyond object proposals: Random crop pooling for multi-label image recognition, IEEE Transactions on Image Processing 25(12) (2016), 5678–5688.

31.

Wang

Chen

and Lin

, Multi-label image recognition by recurrently discovering attentional regions, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 464–472.

32.

Wei

Xia

Lin

Huang

Dong

Zhao

and Yan

, Hcp: A flexible cnn framework for multi-label image classification, IEEE Transactions on Pattern Analysis and Machine Intelligence 38(9) (2015), 1901–1907.

33.

and Liu

, S-MAT: Semantic-driven masked attention transformer for multi-label aerial image classification, Sensors 22(14) (2022), 5433.

34.

Xie

Girshick

Dollár

and He

, Aggregated residual transformations for deep neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1492–1500.

35.

Yalniz

I.Z.

Jégou

Chen

Paluri

and Mahajan

D.K.

, Billion-scale semi-supervised learning for image classification, ArXiv, abs/1905.00546, 2019.

36.

Peng

and Qiao

, Attention-driven dynamic graph convolutional network for multi-label image recognition, in: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, 2020, pp. 649–665.

37.

You

Guo

Cui

Long

Bao

and Wen

, Cross-modality attention with semantic graph embedding for multi-label classification, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 12709–12716.

38.

Yuan

Chen

Zhang

Shi

Geng

Fan

and Rui

, Graph attention transformer network for multi-label image classification, ACM Transactions on Multimedia Computing, Communications and Applications 19(4) (2023), 1–16.

39.

Zhang

Wang

and Li

, Gait energy image-based human attribute recognition using two-branch deep convolutional neural network, IEEE Transactions on Biometrics, Behavior, and Identity Science 5(1) (2022), 53–63.

40.

Zhou

Huang

and Xing

, Deep semantic dictionary learning for multi-label image classification, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 3572–3580.

41.

Zhou

Xia

Dou

and Hu

, Double attention based on graph attention network for image multi-label classification, ACM Transactions on Multimedia Computing, Communications and Applications 19(1) (2023), 1–23.

42.

Zhu

and Wu

, Residual attention: A simple but effective method for multi-label recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 184–193.

43.

Zhu

Cao

Liu

and Liu

, Two-stream transformer for multi-label image classification, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 3598–3607.

44.

Zitnick

C.L.

and Dollár

, Edge boxes: Locating object proposals from edges, in: Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, 2014, pp. 391–405.

Cross-modality semantic guidance for multi-label image classification

Abstract

Keywords

1. Introduction

2.1 Label correlation-based approaches

2.2 Spatial information-based approaches

3.1 Overview

3.2 Feature extraction

4.1 Datasets

4.3 Implementation details

4.4 Experiment results

Table 1 Experimental results on PASCAL VOC 2007, and “*” indicates that ResNext-101 32X16d is used as the backbone

Table 4 Experimental results of ablation study on VOC 2007 and MS-COCO 2014 datasets

4.6.1 Effect of the number of attention heads

Footnotes

Acknowledgments

References

Table 1
Experimental results on PASCAL VOC 2007, and “*” indicates that ResNext-101 32X16d is used as the backbone

Table 4
Experimental results of ablation study on VOC 2007 and MS-COCO 2014 datasets