Abstract
Graph Convolutional Network (GCN) is an effective tool for classification prediction. In a text classification task, the text is constructed as a word-document graph. However, existing methods only make document category predictions based on document nodes in the word-document graph, neglecting the auxiliary role of word nodes in classification. Based on this, this paper proposes a novel GCN structure based on Token-Task Learning (TTL) for text classification. This paper performs part of speech (POS) tagging or Named Entity Recognition (NER) as auxiliary tasks for text classification. By establishing the relationship between token classification and text classification, text category prediction can take into account the information implied by word nodes, thereby enhancing the accuracy of text classification. In addition, this paper replaces Relu in TextGCN with Mish to enhance data fitting capability of GCN. The experiments are carried out on five text classification datasets, and the experimental results show that the proposed method effectively improves the accuracy of text classification while outperforming the comparison methods.
Keywords
Introduction
Text classification and Token classification are two interrelated yet distinct tasks within the domain of natural language processing (NLP). Text classification (Kowsari et al., 2019) involves assigning a text document to a predefined category or label. The primary objective is to automatically ascertain the appropriate category for a given text through the utilization of algorithms or models. This task may manifest as a binary classification challenge, such as spam detection (determining whether a text is spam or non-spam), or as a multi-classification problem, where texts are categorized into multiple classes, as seen in applications like news classification and sentiment analysis. In text classification, the entire document is treated as a singular input, and the model’s role is to predict the category to which the entire document belongs.
Conversely, Token classification is concerned with categorizing each individual word or token within a text. This task operates at a more granular level than text classification, focusing on each discrete unit of text. Prominent applications encompass named entity recognition (NER), which involves identifying entities like person names, place names, and organization names, as well as part-of-speech (POS) tagging, where each word is tagged with its grammatical classification. In Token classification, the text undergoes decomposition into individual tokens, with each token being subjected to the classification process independently. Consequently, the model is required to sequentially process each word or tag in the text, assigning a specific category or label to each unit.
The majority of prior research on text classification has focused on enhancing model structures to improve overall text classification performance. However, this paper adopts an alternative perspective by seeking to enhance text classification performance through the application of multi-task learning. Multi-task learning can be seamlessly integrated into various model structures, serving as a valuable auxiliary mechanism in the broader context of text classification endeavors. Consequently, the primary objective of this paper is to enhance text classification performance by employing Token classification as a supportive component in the training process of text classification models.
Machine learning and deep learning serve as the primary methodologies for text classification (Mirończuk & Protasiewicz, 2018). Among the prevalent machine learning techniques in this domain are Support Vector Machine (SVM) (Philip et al., 2023), K-Nearest Neighbor (KNN) (Parlak & Uysal, 2023), and Random Forest (RF) (Shah et al., 2020). These methods exhibit commendable text classification results while demanding minimal computational resources. However, their efficacy diminishes when confronted with intricate text classification tasks. The emergence of word embedding technology (Incitti et al., 2023) has ushered in a transformative era for vector-based deep learning methods, resulting in significant performance enhancements across various Natural Language Processing tasks (Kowsari et al., 2019). In recent years, Convolutional Neural Network (CNN) (Soni et al., 2023) and Recurrent Neural Network (RNN) (Chen et al., 2023) have been the predominant deep learning approaches utilized in text classification. CNN primarily focuses on extracting local feature information from text, primarily due to the constraints imposed by its convolutional kernel size. In contrast, RNN considers the significance of each token within the text, placing greater emphasis on extracting global feature information. However, it is crucial to note that RNN is susceptible to the issue of gradient disappearance. In 2019, Yao et al. introduced a groundbreaking application of Graph Convolutional Networks (GCN) to text classification, known as TextGCN (Yao et al., 2019). TextGCN has demonstrated superior text classification performance compared to CNN and RNN structures. Therefore, in this paper, we opt to employ GCN instead of CNN or RNN. Furthermore, TextGCN organizes documents and words into a large heterogeneous graph, enabling the extraction of structural information. However, existing work based on TextGCN solely utilizes the document nodes in the heterogeneous graph for classification, overlooking the information of word nodes, which constitute the numerical majority. In this paper, we aim to fully leverage word node information to enhance document classification performance by establishing both a Token classification task and a document classification task.
Multi-Task Learning (MTL) entails the integration of multiple interconnected tasks to facilitate concurrent learning, aiming to unveil the relationships between these tasks and enhance the performance of each individual task (Zhang & Yang, 2021). Various forms of multi-task learning exist, such as Joint Learning, Learning to Learn, and Learning with Auxiliary Tasks (LAT), among others. In the context of LAT, the primary objective is to augment the performance of the main task by leveraging supplementary tasks. These auxiliary tasks should ideally exhibit close relevance to the main task or possess the capacity to enhance the overall learning process (Vafaeikia et al., 2020). Currently, predominant LAT architectures fall into two categories: hard parameter sharing and soft parameter sharing. Soft parameter sharing involves training multiple tasks using independent networks, while these networks maintain connections with each other. The focus of soft parameter sharing is on identifying similarities among parameters across different networks (Lyle et al., 2021). In contrast, hard parameter sharing implies that different tasks share a common underlying model and employ different top-level models. Hard parameter sharing exhibits stronger task relevance and higher parameter sharing than soft parameter sharing. This paper aims to strengthen the connection between Token classification and document classification using hard parameter sharing. Additionally, hard parameter sharing typically results in a simpler model structure due to fewer shared layers. This simplicity improves the interpretability of the model and reduces the burden of training and deployment. Therefore, in this paper, we opt to employ the hard parameter sharing model in the context of multi-task learning.
Building upon the aforementioned observations, and taking into account the presence of a substantial quantity of dormant word nodes within the word-document graph in TextGCN, the present study introduces an auxiliary task centered around word node classification. The envisaged outcome of this research is the augmentation of text classification accuracy through the incorporation of token classification. The ensuing section delineates the key contributions put forth by this paper.
To harness the complete informational content embedded in word nodes, this study incorporates Named Entity Recognition (NER) and Part-of-Speech (POS) tagging as supplementary tasks in the context of text classification. Notably, this research represents a pioneering endeavor by concurrently conducting text classification and Token-Task Learning (TTL) within a unified graph framework. In addition, the proposed method is different from the traditional multi-task learning method in that it transmits information between multiple tasks through graphs. This paper investigates the efficacy of the POS tagging task and named entity recognition task in supporting text classification. A rigid parameter-sharing architecture has been devised to enhance the linkage between TTL and text classification. This paper enhances the structure of GCN by incorporating the Mish activation function. The utilization of Mish contributes to the augmentation of data fitting capabilities and facilitates more effective gradient propagation. Furthermore, Mish enhances the noise resistance of GCN, thereby diminishing the impact of model interference caused by the presence of noisy words.
GCN for Text Classification
The scholarly literature (Liu et al., 2020) introduces a text classification methodology leveraging text graph tensors and three distinct compositions (semantic, syntactic, and sequential) to bridge the gap among various types of graphs. In a related vein, the work by Huang et al. (2019) proposes a document-level text classification approach using Graph Convolutional Networks (GCN). This method constructs each document as an independent graph to optimize computational costs, outperforming TextGCN in terms of performance. In a different approach, Dong et al. (2022) suggests a text classification technique relying on Bidirectional Gated Recurrent Unit (BiGRU) (Fang et al., 2021) and GCN. This method incorporates Word2vec (Mallik & Kumar, 2023) for word embedding representation, BiGRU for contextual information, and input GCN for extracting spatial data. Furthermore, Xue et al. (2022) advocates for a GCN with Bidirectional Long Short-Term Memory (BiLSTM) text classification method. This method integrates Wordnet (Benarafa et al., 2023), BERT, and Bidirectional LSTM with Attention to extract contextual relationships. The contextual relationships are combined through residual concatenation for classification purposes. In a concise text classification model, Ye et al. (2020) proposes a methodology based on GCN and BERT (Devlin et al., 2018). This approach utilizes a word-document-topic graph structure enabled by the Biterm Topic Model (BTM) (Huang et al., 2020) to derive document topics. Word node features are fused with word features from BERT and input to Bidirectional LSTM (BiLSTM) to extract contextual semantics. The merged features, along with document node features, contribute to the final classification results. Literature (Lin et al., 2021) puts forth a text classification model based on BERT with GCN. The model initializes the node vector of GCN using BERT and jointly trains both GCN and BERT to fully exploit the advantages of both models. Additionally, literature (Wang et al., 2022) devises a GCN text classification method based on inductive graphs. In this approach, the original dataset is statistically summarized into small graphs, resulting in commendable standalone classification results.
MTL for Text Classification
The referenced literature (Zhao et al., 2020) introduces a generative multi-task learning approach tailored for text classification and categorization. This method incorporates a shared encoder, a multi-label classification decoder, and a hierarchical classification decoder. Another source (Yang & Shang, 2019) proposes a bidirectional language model-based multi-task learning technique for text classification. This method utilizes language modeling as an auxiliary task within the private component to extract task-specific features. It also integrates a loss constraint via a uniform label distribution in the shared component to facilitate the learning of common features. Furthermore, capsule networks are investigated for text classification tasks in the literature (Xiao et al., 2018). A unified and effective multi-task learning architecture is proposed, employing a task routing algorithm to mitigate interference between tasks by clustering features for each specific task. In a separate investigation (Lu et al., 2020), a hybrid representation-learning network is presented for text classification tasks. This approach consists of two essential components: a BiGRU and an attention network module, complemented by a convolutional neural network module. The attention module enables the model to learn private feature representations from training texts, while the convolutional neural network module facilitates the learning of global representations through sharing. Finally, the literature references (Mao et al., 2021) introduce a novel multi-task learning method called BanditMTL. This method, based on an adversarial multi-armed bandit framework, employs a mirror gradient ascent-descent algorithm (Qu et al., 2022) to regularize task variance.
Summary Analysis
Table 1 presents the advantages and limitations of related works.
Related Work Details.
Related Work Details.
From Table 1, it can be observed that text classification (Dong et al., 2022; Huang et al., 2019; Lin et al., 2021; Liu et al., 2020; Xue et al., 2022; Ye et al., 2020) based on GCN improves upon TextGCN by innovating in graph construction or integrating other neural networks to enhance feature extraction capabilities, resulting in certain performance gains. However, none of the studies focus their innovations on the word nodes within the TextGCN text graph. This paper argues that leveraging the information from a large number of word nodes in the text graph can effectively enhance the performance of text classification tasks. Literature (Lu et al., 2020; Mao et al., 2021; Xiao et al., 2018; Yang & Shang, 2019; Zhao et al., 2020) improves text classification performance by constructing a multitask learning framework, focusing on introducing CNN or RNN to build complex networks, effectively boosting text classification performance. However, they also overlook the contribution of token-level tasks to text classification tasks. In summary, this paper introduces token-level tasks to improve text classification performance, specifically part-of-speech tagging and named entity recognition. What sets this approach apart is that both text classification and token classification tasks are performed within the same graph, which is unprecedented. The benefit of this approach is that information from token tasks can naturally propagate to the text classification task through the graph structure.
Method Structure
The network structure of the proposed method is shown in Figure 1.

The Text Classification Structure Based on TTL and GCN.
As shown in Figure 1, words are represented by blue nodes and documents are represented by green nodes. Edges between words are represented in pink and edges between documents and words are represented in black. Both TTL and text classification methodologies exhibit a shared foundational graph convolutional layer. This layer, characterized by hard parameter sharing, serves to enhance the interlinkages between disparate tasks. The proposed method consists of two parts, which are data preprocessing and Token-Task Learning - Graph Convolutional Network (TTL-GCN). The overall algorithm of the proposed method is shown in Algorithm 1.
Our data underwent preprocessing procedures closely aligned with those outlined in the literature (Yao et al., 2019). Due to the absence of Part-of-Speech (POS) and entity tags in our utilized dataset, this paper necessitated the utilization of relevant tools to annotate the POS and entity attributes of the text. Initially, this paper segmented the words within each document. Subsequently, the Natural Language Toolkit (NLTK) (Bird & Edward, 2009) was employed for POS tagging of the words. For marking the entity information of the text, this paper utilized the pre-trained BERT model from Hugging Face.
Given that each word may exhibit distinct POS and entity labels in various contexts, multiple POS and entity labels may be assigned to each word. However, in GCN, word nodes possess only a single feature expression. Consequently, our model engages in either multi-label or single-label prediction for each token. Despite initial experimentation with the MTL architecture based on multi-label prediction across five datasets, the performance was suboptimal. Consequently, this study retains only the most frequently occurring POS and entities for each word.
Subsequent to this, this paper proceeded to establish word relationships based on context. Employing a window of length 20, this paper systematically scanned each document, recording the frequency of occurrence for individual words within the window. Furthermore, this paper documented the frequency of occurrence for adjacent word pairs within the same window. The detailed process is elucidated in Figure 2.

The Building Graph Process.
Upon processing the dataset as illustrated in Figure 2, this paper derived the interconnections among words and the associations between documents and words. Subsequently, these connections will be assigned weights in accordance with equations (1), (2), and (3).
In equations (1), (2), and (3),
Ultimately, the amalgamation of document-to-word and word-to-word edges results in the creation of a comprehensive graph. This graph serves as a representation of the relationships among all documents and words within the corpus. The weights assigned to edges between documents and words are determined by Term Frequency-Inverse Document Frequency (TF-IDF) (Ramos, 2003), as illustrated in equations (4), (5), and (6).
In the aforementioned equations,

Word-document Graph.
Following the construction of the word-document graph,this paper will proceed to enhance the graph through the application of the TTL-GCN structure, as elaborated in the subsequent section.
Graph Convolutional Network
The Graph Convolutional Network proves instrumental in capturing spatial relationships among nodes and subsequently classifying them (Liu et al., 2023; Zhang et al., 2019). In recent years, scholars have increasingly acknowledged the applicability of GCN in text classification, rendering it a prevalent choice for such endeavors.
To streamline computational processes, all operations are executed through matrix multiplication, and the graph structure is retained and computed in the form of an adjacency matrix. The initial layer of the GCN involves node updates as delineated in equations (7) and (8).
Following the update of the base Graph Convolution Layer, each node accumulates information from its neighboring nodes, resulting in nodes possessing specific spatial characteristics and exhibiting clustering effects. Subsequently, the nodes at the base layer are input into the document layer and token layer. The update equations for these layers are presented in equations (9) and (10).
This paper employs a hard parameter sharing framework within the context of multi-task learning. This paper involves treating token classification as an auxiliary task and text classification as the primary task. Compared with the traditional multi-task learning framework, the proposed method uses graphs to convey information between multiple tasks. A robust linkage between the TTL and text classification is established by utilizing a common graph convolutional layer. Even in cases where the connection is not explicitly defined through the loss function, the impact of TTL on text classification performance is evident due to the employed hard parameter sharing mechanism, a notable advantage of this approach. This paper leverages the latent information present in idle word nodes in TextGCN to facilitate the transfer of knowledge from the auxiliary task (token classification) to the main task (text classification). The loss function for the primary task of text classification is precisely defined in equation (11).
The loss function for the token classification is defined as shown in equation (12).
In this section, this paper will experimentally verify the effectiveness and superiority of the method in this paper.
Experimental Datasets
This paper selected MR (Bo & Lillian, 2005), R8 (Fabrizio, 2002), R52 (Fabrizio, 2002), 20NG (Moschitti & Basili, 2004), and Ohsumed (Joachims, 1998) as representative experimental datasets in this field. This paper is based on the TextGCN improvement, and in order to highlight the performance of the proposed method, it is therefore chosen to perform comparative experiments on five datasets in TextGCN. The information about these datasets is presented in Table 2.
The Datasets Information.
The Datasets Information.
As shown in Table 2, these datasets contain long text, short text, and sentiment analysis domains, which can basically represent the text classification domain. In addition, it can be seen that the number of word nodes in the dataset is much larger than the number of document nodes, indicating that there are a large number of idle nodes in the word-document graph. In particular, in 20NG, word nodes account for about 70
The accuracy of classification tasks serves as the key experimental evaluation metric. The calculation method is illustrated in equation (14).
The experimental parameters of the proposed method are basically consistent with TextGCN (Yao et al., 2019). The value of labmda is not deterministic, it is determined by the text classification effect achieved by different data sets. When the loss on the verification set is no longer reduced, stop training.
Baselines
In this section, we will briefly introduce the baseline models of this paper.
Results
This paper shows the relevant experimental results in this section, as well as a brief analysis of the experimental results. The test accuracy of each method are shown in Table 3.
Text Classification Accuracy on Test Set (
).
Text Classification Accuracy on Test Set (
In Table 3, the proposed method demonstrates state-of-the-art performance across three text classification datasets, underscoring its superiority in this domain. Notably, the GCN based on TTL outperforms the hybrid network referenced in Fang et al. (2021) and Ye et al. (2020). This achievement is attributed to the effective utilization of the token classification function associated with word nodes. This strategic incorporation allows text classification to leverage POS or entity information, thereby enhancing its overall accuracy. However, it is worth noting that in the cases of the 20NG and MR datasets, the proposed method falls short compared to the comparison approach. This performance gap may stem from the proposed method’s potential limitations in handling both short and long texts effectively. Furthermore, an interesting observation is that TextGCN exhibits a generally superior structure when compared to RNN and CNN. In the realm of text classification, GCN appears to circumvent the issue of gradient disappearance encountered by RNNs and mitigates the problem of over-focusing on local information, a common pitfall associated with CNNs. Additionally, it is noteworthy that in simpler tasks, such as R8 and R52, traditional machine learning methods outshine deep learning approaches. Specifically, SVM demonstrate commendable classification accuracy. Conversely, the Transformer model underperforms in classification, possibly due to the dataset’s limited size, which constrains the Transformer’s capabilities. The proposed method showcases notable advancements in text classification, leveraging the strengths of GCN and token classification while acknowledging potential challenges in handling specific text lengths. The comparison with traditional machine learning methods provides valuable insights into the nuanced performance variations across different datasets and models.
Subsequently, ablation experiments were conducted to substantiate the efficacy of the proposed methodology, as delineated in Table 4.
Results of Ablation Experiments (
In Table 4, it is evident that the classification accuracy of TextGCN(Mish) surpasses that of TextGCN across all five datasets. This observation underscores the enhanced fitting ability of GCN for text achieved through the incorporation of the Mish function. Reiterating this finding, Table 4 reaffirms that TextGCN(Mish) consistently outperforms TextGCN in terms of classification accuracy, indicating the beneficial impact of the Mish function on the model’s text-fitting capabilities. Furthermore, both TTL-GCN(POS) and TTL-GCN(NER) exhibit improved accuracy compared to TextGCN(Mish), highlighting the positive influence of multi-task learning based on token classification on text classification. This comprehensive evidence substantiates the efficacy of the proposed method. The superior performance of TTL-GCN(POS) in comparison to TTL-GCN(NER) may be attributed to the insightful understanding that knowledge of the POS of a word contributes significantly to comprehending the meaning of a text, particularly in sentences where a single word may have multiple meanings. In contrast, discerning whether a word functions as an entity does not notably enhance our understanding of the text’s meaning. Consequently, the performance improvement in the text classification task is more pronounced with POS tagging than with entity recognition. Moreover, it is noteworthy that TTL-GCN(NER) unexpectedly exhibits a performance decline on the Ohsumed dataset. This paper hypothesizes that this decline may be linked to the suboptimal performance of NER on the Ohsumed dataset, as detailed in Table 4. This suboptimal NER performance is likely to have adversely affected TTL-GCN(NER).
The text and token classification accuracy under different values of
Classification Accuracy of Text and Token on Different

The Relationship Between the Value of Accuracy Improvement and the Number of Word Nodes. The Vertical Coordinate is the Accuracy Rate of TTL-GCN Improvement in %. The Horizontal Coordinate is the Number of Word Nodes in Each Dataset.

The Loss and Accuracy of Text Classification on MR. Where doc_loss Refers to the Loss of the Text Classification Task and eval_loss Refers to the Validation Set Accuracy of the Text Classification Task.
In Table 5, this paper observes that both TTL-GCN(POS) and TTL-GCN(NER) present a weak positive correlation with
In Figure 4, it is evident that the enhancement effect of TTL based on word node classification exhibits a discernible positive correlation with the quantity of word nodes. This observation suggests that a greater involvement of word nodes in token classification tends to be more advantageous for text classification. Nevertheless, excessively large word nodes may exert a detrimental influence on text classification. Specifically, when the number of word nodes reaches 42,757, the accuracy improvement becomes marginal. Subsequently, this paper presents the loss and accuracy curves for text classification on the MR dataset, as illustrated in Figure 5.
In Figure 5, although MTL takes more time to reach the optimal loss and accuracy, both the loss curve and the accuracy curve are smoother than those of TextGCN. This shows that replacing Relu with Mish makes the training of the model smoother and improves the classification accuracy.
Text classification methods based on traditional TextGCN often neglect a considerable amount of valuable information associated with word nodes, thereby missing an opportunity to positively impact text classification outcomes. This paper introduces an innovative text classification approach that integrates Token-Task Learning with Graph Convolutional Networks. The proposed method effectively leverages the positive influence of word nodes by transforming them into a token classification task, thereby enhancing the training process for the overall text classification task and consequently improving accuracy.
Moreover, this paper adopts the Mish activation function for Graph Convolutional Networks, yielding similarly impressive results. Our proposed method exhibits superior classification performance compared to existing hybrid network Graph Convolutional Networks, without compromising their functionality. Furthermore, for methods (Fang et al., 2021; Ye et al., 2020), the accuracy of text classification can be further augmented by incorporating the Token-Task Learning structure introduced in our method.
In summary, the proposed method offers a novel perspective for TextGCN-based text classification techniques, underscoring the significance of word node classification. This emphasis, in turn, leads to enhanced results in text classification.
The limitation of this paper lies in the fact that the performance of TTL-GCN could be influenced by the performance of the Token task. When the Token task becomes complex and challenging, the performance of text classification may suffer. Future work will focus on designing more advanced TextCGN structures and constructing a more reliable Token task framework.
Footnotes
Acknowledgements
This work has been supported by Liaoning Provincial Science and Technology Department (No. 2022JH2/101300268) and the transportation department of Liaoning Province (No. 2023-360-17).
Funding
This work was supported by the Liaoning Provincial Science and Technology Department (No. 2022JH2/101300268); and in part by the transportation department of Liaoning Province (No. 2023-360-17).
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
