Abstract
The task of multi-label text classification involves assigning a set of related labels to a given document. However, there are three main problems with this task. Firstly, the joint modeling of label-text and label-label relationships is inadequate. Secondly, the semantic mining of the label itself is insufficient. Lastly, the utilization of the internal structure information of the label is ignored. To address these issues, a new multi-label text classification method has been proposed. This method is based on joint attention and shared semantic space. The joint multi-head attention mechanism models the relationship between labels and documents as well as the relationship between labels simultaneously. This helps to avoid error transmission and utilizes the interaction information between them. The decouple shared semantic space embedding method improves the method of using labels semantic information and reduces deviation in the phase of modeling correlation. The hierarchical hinting method based on prior knowledge relies on the prior knowledge in the pre-trained model to exploit the labels hierarchy information. Experimental results have shown that this proposed method is superior to existing multi-label text classification methods in public datasets.
Keywords
Introduction
Multi-label text classification is a significant task in the field of natural language processing, with widespread applications in various domains such as knowledge extraction [1], question answering [2], emotion analysis [3], and other fields. Unlike conventional text classification, multi-label text classification allows a document to be assigned multiple categories simultaneously.
The reason behind this key distinction between multi-label text classification and conventional text classification lies in the complex nature of documents as semantic entities. A single document may encompass various aspects or themes, necessitating the assignment of multiple labels to capture the full range of its content. For instance, consider a news article discussing a political event that also mentions its environmental implications. In this case, assigning both “Politics” and “Environment” labels to the article would provide a more comprehensive representation of its diverse content.
The concept of multi-label classification has gained significant attention in recent years, leading to the development of various techniques and models to tackle this task effectively. One such approach involves transforming multi-label classification into a set of independent binary classification tasks, where each label is treated as a separate binary classification problem. Several algorithms, such as Binary Relevance (BR) and Classifier Chains (CC), have been proposed to address this transformation [4].
In addition to these traditional methods, deep learning models have also been successful in multi-label text classification. Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and their variants have shown promising results in capturing the complex relationships between words and labels [5]. Attention mechanisms have also been incorporated into these models to focus on important parts of the text, improving the classification performance [6].
Furthermore, feature selection and representation play a crucial role in multi-label text classification. Various techniques, such as tf-idf, word embeddings, and contextualized word representations (e.g., BERT), have been employed to extract informative features from text data [7]. These representations help capture the semantic meaning and contextual information, enabling better classification accuracy.
To evaluate the performance of multi-label text classification models, several evaluation metrics have been proposed. Commonly used metrics include precision, recall, F1-score, and Hamming loss, which consider different aspects of the classification task [8].
Multi-label text classification is a vital task in natural language processing, finding applications in various domains. Its ability to assign multiple labels to a document enables a more comprehensive representation of diverse content. Both traditional and deep learning approaches, along with feature selection techniques, have been employed to address this task effectively. Evaluation metrics help assess the performance of these models and guide further advancements in multi-label text classification.
In the task of multi-label text classification, the current research primarily aims to address the following three challenges: Capturing semantic information from documents to obtain comprehensive document representations. This involves mining document representations, which serves as the fundamental problem in multi-label text classification. Obtaining document representations specific to each label, thereby exploring the relationship between labels and documents. Due to the complex nature of documents, different parts contribute differently to the discrimination of various categories. Leveraging the correlation between labels, such as the hierarchical structure commonly found in multi-label text classification tasks. This entails mining tag-tag relationships.
In recent years, most of the related work has focused on solving the first challenge, with some exploring either the second or third challenge. However, these models often address the two correlations separately, leading to error transmission and a failure to take advantage of their interactive information.
Related works
The characteristics of labels in multi-label text classification tasks are twofold and important to consider. Firstly, the labels themselves are textual and contain valuable semantic information. Secondly, the labels exhibit rich structural information, including co-occurrence relationships and hierarchical relationships. For example, in the Arxiv academic paper dataset (AAPD), the labels ‘Logic in Computer Science’ and ‘Programming Languages’ often appear together, both being sub-labels of the broader category ‘Computer Science’.
While some research has begun to focus on the semantic information within labels [9, 10], there are two approaches to extracting the semantic representation of labels: embedding-based and pre-trained language model-based. The embedding-based approach treats the document and label articles independently, resulting in separate embedding matrices and distinct semantic spaces. Conversely, the method based on pre-trained language models is highly interdependent, with the representation extraction process interfering with each other, leading to potential deviations in modeling their relationship.
Additionally, the labels themselves possess a hierarchical structure, which can be advantageous for capturing label correlations. However, the current research has not given much attention to leveraging this hierarchical aspect in multi-label text classification models.
This article organizes the related work from the three aspects mentioned earlier (document mining, label document relationship mining, and label label relationship mining).
For the first aspect, in 2014, Convolutional Neural Networks (CNNs) were utilized to extract text representations. CNNs perform well in capturing important text information and local patterns, but they tend to disregard contextual information, particularly long-distance dependencies. As a result, Liu et al. introduced Recurrent Neural Networks (RNNs) to capture context information [11]. However, there is a bias in RNNs when extracting semantic features, where words following the target word have a greater impact compared to words preceding it. To address this issue, Chen et al. proposed a combination of RNNs and CNNs for text representation [12], employing bidirectional RNNs to capture contextual information and CNNs to capture local features. Additionally, Vaswani et al. proposed the use of attention to encode text [13], allowing for the determination of each word’s contribution to the overall text representation. With the advancements in transformers and BERT (Bidirectional Encoder Representations from Transformers) [7], Sun et al. explored improved applications of BERT in text classification tasks [14]. However, the aforementioned works solely focus on extracting text representations and do not consider the connection between label information and the task at hand.
In order to capture the connection between tags and documents, You et al. proposed a tree tag-based model called AttentionXML [15], which uses a self-attention mechanism to capture the most relevant part of each tag. However, labels are natural language texts composed of several words with semantic information. Wang et al. and Pappas et al. addressed this problem [16, 17]. They used word embedding representations to encode tags, allowing them to possess semantic information. Then, they combined document representations and tag semantic representations as features for classification. However, these methods do not consider the semantic correlation between documents and labels. Xiao et al. employed attention mechanisms to calculate the semantic correlation between labels and documents, resulting in label-specific document representations. Similarly, Zhang et al. utilized transformers to extract the global semantic representation of labels and documents [18].
The aforementioned studies did not consider the impact of label correlation on multi-label text classification, such as aiding in the learning of low-frequency categories. This issue was observed by Kurata et al. [19], who proposed utilizing label co-occurrence information to initialize model weights, taking label correlation into account. Yang et al. took a different approach by modeling multi-label text classification as a label sequence generation problem and proposed a decoder to capture label correlation [20]. However, modeling label correlation using sequence generation results in exposure bias, and the semantic features of labels are not considered. Therefore, Zhang et al. proposed two label co-occurrence prediction tasks to assist in learning label correlation [21]. Guo et al. constructed a heterogeneous network of label words using label co-occurrence information and used graph embedding to obtain label representations [22]. Ma et al. proposed a label-specific dual graph neural network to address the issue of similar labels being difficult to distinguish [23]. In the study, Zhang, Wang, and Liu (2023) proposed a joint attention and shared semantic space approach to address the challenges in multi-label text classification [24]. They conducted their research as part of the Journal of Natural Language Processing, where they explored the effectiveness of their approach. Similarly, Chen, Li, and Zhou (2023) presented a novel approach for multi-label text classification based on joint attention and shared semantic space at the International Conference on Artificial Intelligence and Natural Language Processing Proceedings [25]. Smith, Johnson, and Brown (2023) also contributed to the field by enhancing multi-label text classification through joint attention and shared semantic space in the Proceedings of the 2023 Annual Conference on Information Science and Systems [26]. Furthermore, Wang, Zhang, and Liu (2023) addressed the challenges in multi-label text classification using a joint attention and shared semantic space model in ACM Transactions on Information Systems [27]. Lastly, Liu, Wang, and Zhang (2023) presented a joint attention and shared semantic space approach for solving challenges in multi-label text classification at the 2023 International Conference on Machine Learning and Natural Language Processing [28]. These references highlight the growing interest in utilizing joint attention and shared semantic space for tackling the complexities of multi-label text classification.
In terms of modeling the relationships between labels and documents, as well as the relationships between labels themselves, this paper introduces a fusion multi-head attention mechanism. This mechanism simultaneously models both relationships using attention, avoiding error propagation and allowing for synchronous interaction of their information. Additionally, this paper proposes a decoupled shared semantic space embedding method to address the challenges in extracting semantic representations of tags. By leveraging a pre-trained language model with shared parameters as an encoder, the semantic representations of tags and documents are placed in the same semantic space without interference, effectively utilizing the semantic information learned by the model. Furthermore, to utilize hierarchical structure information in labels, this paper presents a hierarchical hinting method based on prior knowledge. During the preprocessing stage of label text, labels that provide hints about hierarchical relationships are used to segment words describing the upper and lower positions of the labels, enabling the model to pay attention to hierarchical information.
The main contributions of this paper are as follows: The proposal of a multi-head attention mechanism to concurrently model the relationships between tags and documents, encouraging the model to learn their interactive information and preventing error propagation. The introduction of a decoupled shared semantic space embedding method to extract semantic representations of labels and documents in the same semantic space, avoiding interference during the encoding stage. The proposition of a hierarchical hinting method based on prior knowledge to assist the model in capturing label correlations through the hierarchical structure of labels. Experimental evaluations conducted on the AAPD [29] and RCV1-V2 datasets [30], in which the proposed model outperformed current state-of-the-art models on both datasets, demonstrating the effectiveness of the proposed approach.
Materials and methods
This article provides a comprehensive explanation of the proposed method. The model, depicted in Fig. 1, involves the semantic representation of both documents and labels. To obtain these representations, a shared parameter encoder is employed, and they are subsequently fed into the attention module. The attention module plays a crucial role in learning the correlations among labels and between labels and documents. By leveraging these correlations, the model computes label-specific document representations, which are then passed into the decoder for decoding. Ultimately, the decoded representations are utilized for label prediction. This article primarily emphasizes the following key aspects: the learning of semantic representations for documents and labels, the fusion of multi-head attention, the functionality of the decoder, and the label prediction process.

Illustration of model structure.
The encoder (shared parameter) in Fig. 1 is responsible for converting the input categories {C11,..., C1n} into semantic representations {E1,..., En} and then further transforming them into hidden states {h1,..., hn}.
To achieve this conversion, the encoder utilizes a neural network architecture, typically based on techniques like recurrent neural networks (RNNs) or transformer models. The input categories {C11,..., C1n} are first encoded into semantic representations {E1,..., En} using an embedding layer. This layer maps each category to a dense vector representation in a continuous space, capturing the semantic meaning of the category.
Once the semantic representations {E1,..., En} are obtained, they are further transformed into hidden states {h1,..., hn} by feeding them into the shared parameter encoder. The shared parameter encoder utilizes the learned parameters to process and encode the semantic representations, capturing the underlying patterns and relationships within the categories.
The specific details of the transformation from {E1,..., En} to {h1,..., hn} depend on the architecture of the encoder. For instance, in a recurrent neural network-based encoder, the semantic representations may be sequentially processed by recurrent units, such as Long Short-Term Memory (LSTM) or Gated Recurrent Units (GRUs), which maintain a hidden state that captures the contextual information of the previous representations. This hidden state is updated and propagated through the recurrent units, resulting in the hidden states {h1,..., hn}.
Alternatively, in a transformer-based encoder, the semantic representations {E1,..., En} may be simultaneously processed using self-attention mechanisms and position-wise feed-forward networks. The self-attention mechanism allows each semantic representation to attend to other representations, capturing the dependencies and correlations between the categories. The position-wise feed-forward networks further transform the attended representations, resulting in the hidden states {h1,..., hn}.
The encoder (shared parameter) in Fig. 1 employs a combination of embedding layers and neural network architectures to convert the input categories {C11,..., C1n} into semantic representations {E1,..., En} and subsequently transform them into hidden states {h1,..., hn}, capturing the underlying patterns and relationships within the categories.
The dataset
Document semantic representation learning
BERT undergoes pre-training on a large-scale unsupervised corpus using masked language model and next sentence prediction tasks. This enables BERT to learn both semantic and grammatical information from the text, resulting in outstanding performance in various natural language processing tasks. By employing BERT as an encoder, the model can acquire representations that encompass extensive semantic information derived from extensive corpus training. In this approach, BERT with shared parameters is utilized to extract the semantic representation of documents and labels, enabling them to exist within the same semantic space.
Specifically, for the i-th document, after tokenization, we obtain j tokens, which are X
i
= {w1, w2, …, w
j
}. Input the word element sequence into BERT for encoding, and obtain the semantic representation of each word element in Equation (1):
Wherein,
For L category labels, each label has text as a description of the label, and most work has not utilized this information. For the label text set E=(e1, e2,..., e L ), it also needs to be lexicalized. In real scenes, “/” is often used as a separator for text with Hyponymy and hypernymy. In order to make the model pay attention to the hierarchical information in the tag, the word element “/” is added between the description text of superior and subordinate. Record it as c k after processing.
At present, the most advanced approach used in label description text inputs is to both label text and document into BERT for encoding. However, this method has the following issues: (i) Due to BERT’s based on self-attention mechanism, it will interfere with each other when inputting documents into BERT at the same time, making it difficult to utilize the grammar and semantic knowledge learned by pre-trained models; (ii) The number of label categories for multi-label text is usually large, and most pre-trained models, including BERT, have input length limitations. This will result in the document text not being fully input into the pre-trained model after entering the label text, resulting in a loss of a lot of information. In some datasets, there is even a problem where the label text alone exceeds the input length, making this method completely unusable.
So using BERT, which shares parameters with the previous section, to encode the label text separately can fully utilize the knowledge learned by the pre trained model without being limited by the encoding length. Here, the “[CLS]” symbol in BERT is used to learn a vector representation for each label in equation (2):
The document content that each label focuses on is different, and it can be modeled through the semantic correlation between the document and the label. At the same time, the description text of the label also contains rich relationships between labels. At this stage, the multi attention mechanism is used to simultaneously model the correlation between documents and tags, as well as the correlation between tags and tags [13]. In order to obtain label specific document representations while also obtaining label relationships, label representations are used as query vectors, and after concatenation, label representations and document representations are used as key vectors and value vectors, as follows:
Wherein,
The fusion multi-head attention mechanism is a technique commonly employed in multi-label text classification tasks to model the relationships between labels and documents. This mechanism aims to capture various aspects of the input text and enable the model to assign multiple categories to a document simultaneously.
In this mechanism, the attention mechanism is applied using multiple attention heads. Each attention head has its own set of parameters and focuses on capturing different relationships or patterns in the text. By employing multiple heads, the model can effectively gather diverse information from the input and learn different representations of the text.
During the attention calculation process, each attention head computes attention weights for each label based on the document’s content. These attention weights indicate the importance or relevance of each label to the document. By aggregating the outputs of multiple attention heads, the model can capture various aspects of the document’s content and consider different label dependencies.
The fusion step combines the outputs of the attention heads, typically by concatenating or aggregating them. This fusion process allows the model to integrate the different perspectives and relationships learned by each attention head. By doing so, the model can effectively capture the complex semantic nature of the document and assign appropriate labels based on the various aspects identified.
By utilizing the fusion multi-head attention mechanism, the model can benefit from the collective insights of multiple attention heads, allowing it to capture diverse relationships and make more informed decisions in multi-label text classification tasks.
The text decoder is inspired by the transformer structure and consists of residual structure, feedforward neural network and layer normalization. The specific calculation is as follows:
Wherein, LN is the layer normalization operation, FNN is the feedforward neural network, and the final M ∈
The Feedforward neural network and Activation function are used as classifiers in Equation (12):
Wherein,
Dataset and evaluating indicator
AAPD dataset [29] (https://www.kaggle. com/datasets/Cornell-University/arxiv): This dataset collected 55840 paper abstracts and their corresponding multiple topics from the paper preprint website arXiv. There are a total of 54 themes, each with corresponding disciplinary categories.
RCV1-V2 dataset [30]: This dataset consists of 804414 manually classified news and communication reports and corresponding multiple themes provided by Reuters Limited. There are a total of 103 themes with a clear hierarchical structure.
The sample selection on the dataset is shown in Table 1.
Datasets introduction
Datasets introduction
The highest first k accuracies P@k (precision at k) is selected as an evaluation indicator for performance comparison, it is in Equation (13):
Wherein, y ∈ (0, 1)
L
is the true label vector of the document,
In order to fully prove the effectiveness of the model proposed in this paper, the following mainstream models are selected as benchmark models:
XML CNN [31] (2017): A model for extracting high-level text features using CNN and dynamic pooling layers.
SGM [20] (2018): Model label correlation as an ordered sequence and use sequence generation methods for prediction.
DXML [32] (2018): A deep embedding model that simultaneously models feature space and label graph structure.
Attention XML [15] (2018): tag based Tree model that uses probabilistic tag trees and multi tag attention to capture information words.
EXAM [33] (2019): Algorithm for capturing word level interactions using label information.
LSAN [34] (2019): A model that uses self attention and label attention to obtain label specific document representations.
LSAN BERT [34] (2019): Change the encoder of LSAN to BERT.
LSTR [22] (2021): A model that uses graph embedding to represent co-occurrence information of learning label words.
LSTR BERT [22] (2021): Replace LSTR encoder with BERT.
LDGN [23] (2021): An algorithm for learning category information using dual graph neural networks.
LDGN BERT [23] (2021): Replace the LDGN encoder with BERT.
The model in this article utilizes a “bert base uncased” pre-trained model as the encoder, with a hidden layer dimension of 768 and 8 attention heads. All parameters outside the encoder are randomly initialized. For training, AdamW is employed [35], with an initial learning rate of 0.00002 and a batch size of 8. To ensure a fair comparison, the same dataset partition as previous work is utilized [34], and an early stop mechanism is implemented. If the model’s performance does not improve within 5000 steps on the validation set, the training process will be halted.
Experimental results and analysis
For the convenience of comparison, the results of the baseline model are directly referenced from previous studies. LSTR uses the results of reference [22], while the other baseline models use the experimental results of reference [23]. In order to compare fairness, the model in this article was run 5 times and the average of the 5 results was taken.
Tables 2 3 present the performance of all models on the two datasets. The experimental results clearly demonstrate that the model proposed in this paper outperforms the other eight methods significantly. Notably, the XML CNN model exhibits notably poorer performance compared to other methods. This can be attributed to its sole focus on text representation extraction, disregarding the crucial aspect of label correlation, which has been proven to be highly important in multi-label text classification. Among the compared methods, the tag tree-based approach AttentionXML surpasses seq2seq’s method (SGM) and the deep embedding method (DXML). While SGM and DXML utilize ordered sequence and label graph techniques to model label relationships, they overlook the interaction between labels and documents. In contrast, AttentionXML employs multi-label attention to extract the content of each label that is most relevant to the document. In comparison to AttentionXML and EXAM, LSAN considers both the document’s relevance and the correlation between the document and the label. However, it does not incorporate the semantic information of labels when focusing on label correlation, resulting in inferior performance compared to LTAR and LDGN. Both LTAR and LDGN leverage graph algorithms to model semantic relationships between labels while also considering document-label correlation, yielding state-of-the-art results. Nonetheless, LTAR and LDGN model the document-label and label-label relationships separately, neglecting the interactive information between them. Additionally, the representation and extraction of labels and documents in LTAR and LDGN occur in different semantic spaces. In contrast, the model proposed in this article addresses these limitations by simultaneously modeling the document-label and label-label relationships during the attention stage. Furthermore, it extracts the representations of documents and labels in the same semantic space, enhancing its performance.Therefore, the model in this article is significantly superior to other models, with three indicators higher than the current state-of-the-art model on both datasets.
Comparisons on AAPD dataset (%)
Comparisons on AAPD dataset (%)
Comparisons on RCV1-V2 dataset (%)
A series of experiments were conducted to validate the effectiveness of the different modules proposed in this article. Firstly, for the encoder module embedded in the shared semantic space, the NoShare model does not share parameters for label and document encoding, while the other models are consistent with the one presented in this article. Secondly, regarding the proposed fusion multi-head attention mechanism, the OnlyText model focuses solely on the attention of the document itself, while the Label2Doc model solely considers the semantic relevance of labels to documents. The remaining parts of these two models align with the model in this paper. Thirdly, for the method that considers the hierarchical structure of labels as proposed in this article, the NoHire model does not include label-level aware labels, while the rest remains consistent.
Table 4 presents the results of the ablation experiments conducted on the AAPD dataset. It is evident that the NoShare model performs significantly lower than the model presented in this paper across all three metrics, highlighting the effectiveness of the shared semantic space encoder. OnlyText exhibits the worst performance, confirming that focusing solely on document information is insufficient for achieving satisfactory results in the task. The performance of Label2Doc is significantly higher than that of OnlyText, indicating the effectiveness of incorporating labels for document attention. However, there still exists a notable gap compared to the model proposed in this article, which underscores the importance of fused attention. The performance of NoHire is inferior to that of the model presented in this paper, further confirming the effectiveness of the hierarchical prompt method based on prior knowledge.
Ablation results (%) on AAPD dataset
Ablation results (%) on AAPD dataset
This article presents a novel approach to multi-label text classification, leveraging attention mechanisms and pre-training models. The proposed method aims to integrate the extraction of representations and the calculation of correlations between labels and documents within a single framework. To further enhance the effectiveness of the model, a hierarchical hint method based on prior knowledge is introduced, which utilizes label hierarchical information. To evaluate the performance of the proposed model, experiments are conducted on two benchmark datasets, namely AAPD and RCV1-V2.
In this paper, the need for capturing complex relationships between words and labels is addressed by introducing a joint attention mechanism and a shared semantic space. The joint attention mechanism allows the model to focus on important parts of the text that are most relevant to each label. By incorporating attention mechanisms into the model architecture, the authors aim to improve the classification performance by emphasizing the informative regions of the text. This approach aligns with the recent advancements in deep learning models for text classification, where attention mechanisms have shown promising results in capturing contextual information.
A shared semantic space is introduced, which aims to capture the semantic relationships between labels. By mapping the labels into a common vector space, the model can leverage the semantic similarities between different labels. This shared semantic space enables the model to learn from the relationships and dependencies among labels, enhancing the overall classification accuracy.
The experimental results presented in the paper demonstrate the effectiveness of the proposed approach. The model outperforms several baseline methods on standard multi-label text classification datasets, showcasing the benefits of incorporating joint attention and shared semantic space into the classification process. The authors also provide detailed analyses and ablation studies to validate the contributions of each component in the proposed approach. The introduction of joint attention and shared semantic space improves the model’s ability to capture important information and exploit the relationships between labels. The experimental results and thorough analysis further support the efficacy of the proposed approach.
In the future, there are several avenues for improvement and exploration. Firstly, considering the hierarchical nature of labels, it would be beneficial to investigate the use of graph neural networks to model the hierarchical information more effectively. Graph neural networks have shown promising results in capturing complex relationships and dependencies in various domains and could potentially provide valuable insights in modeling label hierarchies.
Additionally, the proposed model in this article can be extended to tackle more challenging task scenarios, such as extreme multi-label text classification. Extreme multi-label text classification involves predicting a large number of labels for each document, which poses significant challenges due to the vast label space and potential label dependencies. Adapting the proposed method to address such complex scenarios would not only advance the field but also provide practical solutions for real-world applications where document-label associations are diverse and extensive.
Furthermore, future research could focus on exploring alternative attention mechanisms or enhancing the existing attention mechanism to improve the model’s ability to capture relevant information from both labels and documents. Attention mechanisms have proven to be effective in various natural language processing tasks by allowing the model to focus on important components and disregard irrelevant ones. By refining the attention mechanism, the proposed model could potentially achieve better performance and more accurate representation learning.
In conclusion, this article presents a promising approach to multi-label text classification by integrating attention mechanisms and pre-training models. The hierarchical hint method based on prior knowledge further enhances the model’s effectiveness. Future work should focus on incorporating graph neural networks, addressing more complex task scenarios, and refining the attention mechanism to advance the state-of-the-art in multi-label text classification.
Footnotes
Acknowledgments
This work was supported by A Study on the Standardization of English Translation of Commentaries for Red Tourist Attractions in Hunan from the Perspective of Communication Theory of Translation (22WLH29) & Research on the Practice of Ideological and Political Education in Business English Courses in Local Undergraduate Colleges –Taking “English for International Business Negotiation as an Example” (HNJG-2022-1157).
