A Novel Graph Convolutional Text Classification Based on Token-Task Learning

Abstract

Graph Convolutional Network (GCN) is an effective tool for classification prediction. In a text classification task, the text is constructed as a word-document graph. However, existing methods only make document category predictions based on document nodes in the word-document graph, neglecting the auxiliary role of word nodes in classification. Based on this, this paper proposes a novel GCN structure based on Token-Task Learning (TTL) for text classification. This paper performs part of speech (POS) tagging or Named Entity Recognition (NER) as auxiliary tasks for text classification. By establishing the relationship between token classification and text classification, text category prediction can take into account the information implied by word nodes, thereby enhancing the accuracy of text classification. In addition, this paper replaces Relu in TextGCN with Mish to enhance data fitting capability of GCN. The experiments are carried out on five text classification datasets, and the experimental results show that the proposed method effectively improves the accuracy of text classification while outperforming the comparison methods.

Keywords

graph convolutional network (GCN)text classification token-task learning (TTL)part of speech (POS) tagging named entity recognition (NER)

1. Introduction

Text classification and Token classification are two interrelated yet distinct tasks within the domain of natural language processing (NLP). Text classification (Kowsari et al., 2019) involves assigning a text document to a predefined category or label. The primary objective is to automatically ascertain the appropriate category for a given text through the utilization of algorithms or models. This task may manifest as a binary classification challenge, such as spam detection (determining whether a text is spam or non-spam), or as a multi-classification problem, where texts are categorized into multiple classes, as seen in applications like news classification and sentiment analysis. In text classification, the entire document is treated as a singular input, and the model’s role is to predict the category to which the entire document belongs.

Conversely, Token classification is concerned with categorizing each individual word or token within a text. This task operates at a more granular level than text classification, focusing on each discrete unit of text. Prominent applications encompass named entity recognition (NER), which involves identifying entities like person names, place names, and organization names, as well as part-of-speech (POS) tagging, where each word is tagged with its grammatical classification. In Token classification, the text undergoes decomposition into individual tokens, with each token being subjected to the classification process independently. Consequently, the model is required to sequentially process each word or tag in the text, assigning a specific category or label to each unit.

The majority of prior research on text classification has focused on enhancing model structures to improve overall text classification performance. However, this paper adopts an alternative perspective by seeking to enhance text classification performance through the application of multi-task learning. Multi-task learning can be seamlessly integrated into various model structures, serving as a valuable auxiliary mechanism in the broader context of text classification endeavors. Consequently, the primary objective of this paper is to enhance text classification performance by employing Token classification as a supportive component in the training process of text classification models.

Machine learning and deep learning serve as the primary methodologies for text classification (Mirończuk & Protasiewicz, 2018). Among the prevalent machine learning techniques in this domain are Support Vector Machine (SVM) (Philip et al., 2023), K-Nearest Neighbor (KNN) (Parlak & Uysal, 2023), and Random Forest (RF) (Shah et al., 2020). These methods exhibit commendable text classification results while demanding minimal computational resources. However, their efficacy diminishes when confronted with intricate text classification tasks. The emergence of word embedding technology (Incitti et al., 2023) has ushered in a transformative era for vector-based deep learning methods, resulting in significant performance enhancements across various Natural Language Processing tasks (Kowsari et al., 2019). In recent years, Convolutional Neural Network (CNN) (Soni et al., 2023) and Recurrent Neural Network (RNN) (Chen et al., 2023) have been the predominant deep learning approaches utilized in text classification. CNN primarily focuses on extracting local feature information from text, primarily due to the constraints imposed by its convolutional kernel size. In contrast, RNN considers the significance of each token within the text, placing greater emphasis on extracting global feature information. However, it is crucial to note that RNN is susceptible to the issue of gradient disappearance. In 2019, Yao et al. introduced a groundbreaking application of Graph Convolutional Networks (GCN) to text classification, known as TextGCN (Yao et al., 2019). TextGCN has demonstrated superior text classification performance compared to CNN and RNN structures. Therefore, in this paper, we opt to employ GCN instead of CNN or RNN. Furthermore, TextGCN organizes documents and words into a large heterogeneous graph, enabling the extraction of structural information. However, existing work based on TextGCN solely utilizes the document nodes in the heterogeneous graph for classification, overlooking the information of word nodes, which constitute the numerical majority. In this paper, we aim to fully leverage word node information to enhance document classification performance by establishing both a Token classification task and a document classification task.

Multi-Task Learning (MTL) entails the integration of multiple interconnected tasks to facilitate concurrent learning, aiming to unveil the relationships between these tasks and enhance the performance of each individual task (Zhang & Yang, 2021). Various forms of multi-task learning exist, such as Joint Learning, Learning to Learn, and Learning with Auxiliary Tasks (LAT), among others. In the context of LAT, the primary objective is to augment the performance of the main task by leveraging supplementary tasks. These auxiliary tasks should ideally exhibit close relevance to the main task or possess the capacity to enhance the overall learning process (Vafaeikia et al., 2020). Currently, predominant LAT architectures fall into two categories: hard parameter sharing and soft parameter sharing. Soft parameter sharing involves training multiple tasks using independent networks, while these networks maintain connections with each other. The focus of soft parameter sharing is on identifying similarities among parameters across different networks (Lyle et al., 2021). In contrast, hard parameter sharing implies that different tasks share a common underlying model and employ different top-level models. Hard parameter sharing exhibits stronger task relevance and higher parameter sharing than soft parameter sharing. This paper aims to strengthen the connection between Token classification and document classification using hard parameter sharing. Additionally, hard parameter sharing typically results in a simpler model structure due to fewer shared layers. This simplicity improves the interpretability of the model and reduces the burden of training and deployment. Therefore, in this paper, we opt to employ the hard parameter sharing model in the context of multi-task learning.

Building upon the aforementioned observations, and taking into account the presence of a substantial quantity of dormant word nodes within the word-document graph in TextGCN, the present study introduces an auxiliary task centered around word node classification. The envisaged outcome of this research is the augmentation of text classification accuracy through the incorporation of token classification. The ensuing section delineates the key contributions put forth by this paper.

–
To harness the complete informational content embedded in word nodes, this study incorporates Named Entity Recognition (NER) and Part-of-Speech (POS) tagging as supplementary tasks in the context of text classification. Notably, this research represents a pioneering endeavor by concurrently conducting text classification and Token-Task Learning (TTL) within a unified graph framework. In addition, the proposed method is different from the traditional multi-task learning method in that it transmits information between multiple tasks through graphs.
–
This paper investigates the efficacy of the POS tagging task and named entity recognition task in supporting text classification. A rigid parameter-sharing architecture has been devised to enhance the linkage between TTL and text classification.
–
This paper enhances the structure of GCN by incorporating the Mish activation function. The utilization of Mish contributes to the augmentation of data fitting capabilities and facilitates more effective gradient propagation. Furthermore, Mish enhances the noise resistance of GCN, thereby diminishing the impact of model interference caused by the presence of noisy words.

2. Literature Review

2.1. GCN for Text Classification

The scholarly literature (Liu et al., 2020) introduces a text classification methodology leveraging text graph tensors and three distinct compositions (semantic, syntactic, and sequential) to bridge the gap among various types of graphs. In a related vein, the work by Huang et al. (2019) proposes a document-level text classification approach using Graph Convolutional Networks (GCN). This method constructs each document as an independent graph to optimize computational costs, outperforming TextGCN in terms of performance. In a different approach, Dong et al. (2022) suggests a text classification technique relying on Bidirectional Gated Recurrent Unit (BiGRU) (Fang et al., 2021) and GCN. This method incorporates Word2vec (Mallik & Kumar, 2023) for word embedding representation, BiGRU for contextual information, and input GCN for extracting spatial data. Furthermore, Xue et al. (2022) advocates for a GCN with Bidirectional Long Short-Term Memory (BiLSTM) text classification method. This method integrates Wordnet (Benarafa et al., 2023), BERT, and Bidirectional LSTM with Attention to extract contextual relationships. The contextual relationships are combined through residual concatenation for classification purposes. In a concise text classification model, Ye et al. (2020) proposes a methodology based on GCN and BERT (Devlin et al., 2018). This approach utilizes a word-document-topic graph structure enabled by the Biterm Topic Model (BTM) (Huang et al., 2020) to derive document topics. Word node features are fused with word features from BERT and input to Bidirectional LSTM (BiLSTM) to extract contextual semantics. The merged features, along with document node features, contribute to the final classification results. Literature (Lin et al., 2021) puts forth a text classification model based on BERT with GCN. The model initializes the node vector of GCN using BERT and jointly trains both GCN and BERT to fully exploit the advantages of both models. Additionally, literature (Wang et al., 2022) devises a GCN text classification method based on inductive graphs. In this approach, the original dataset is statistically summarized into small graphs, resulting in commendable standalone classification results.

2.2. MTL for Text Classification

The referenced literature (Zhao et al., 2020) introduces a generative multi-task learning approach tailored for text classification and categorization. This method incorporates a shared encoder, a multi-label classification decoder, and a hierarchical classification decoder. Another source (Yang & Shang, 2019) proposes a bidirectional language model-based multi-task learning technique for text classification. This method utilizes language modeling as an auxiliary task within the private component to extract task-specific features. It also integrates a loss constraint via a uniform label distribution in the shared component to facilitate the learning of common features. Furthermore, capsule networks are investigated for text classification tasks in the literature (Xiao et al., 2018). A unified and effective multi-task learning architecture is proposed, employing a task routing algorithm to mitigate interference between tasks by clustering features for each specific task. In a separate investigation (Lu et al., 2020), a hybrid representation-learning network is presented for text classification tasks. This approach consists of two essential components: a BiGRU and an attention network module, complemented by a convolutional neural network module. The attention module enables the model to learn private feature representations from training texts, while the convolutional neural network module facilitates the learning of global representations through sharing. Finally, the literature references (Mao et al., 2021) introduce a novel multi-task learning method called BanditMTL. This method, based on an adversarial multi-armed bandit framework, employs a mirror gradient ascent-descent algorithm (Qu et al., 2022) to regularize task variance.

2.3. Summary Analysis

Table 1 presents the advantages and limitations of related works.

Table 1.
Related Work Details.

Work Advantage Limitation

Liu et al. (2020) Three feature graphs, semantic, syntactic and sequential, are constructed Ignored the role of word nodes

Huang et al. (2019) Constructing graphs at the document level saves computational resources Ignored the role of word nodes

Dong et al. (2022) Merge BiGRU and GCN, combining the contextual and spatial information of the text Ignored the role of word nodes

Xue et al. (2022) Constructing graphs based on dependency syntax for graph embedding, combined with BERT and BiLSTM Ignored the role of word nodes

Ye et al. (2020) Combined BERT and GCN to construct a topic-word-document graph Ignored the role of word nodes

Lin et al. (2021) The first combination of BERT with GCN Ignored the role of word nodes

Zhao et al. (2020) Multitask Learning for Multi-Label Classification and Hierarchical Label Classification Based on BiLSTM Lack of correlation between tasks

Yang and Shang (2019) Based on BiLSTM, a multitask learning framework was constructed for language modeling and text classification tasks Ignored tasks at the Token level

Xiao et al. (2018) Constructed multiple text classification task frameworks based on capsule networks Ignored tasks at the Token level

Lu et al. (2020) Constructed a multi-task learning framework by integrating BiGRU and CNN Ignored tasks at the Token level

Mao et al. (2021) Joint Learning of Sentiment Analysis and Topic Text Classification Based on CNN Ignored the role of word nodes

Work	Advantage	Limitation
Liu et al. (2020)	Three feature graphs, semantic, syntactic and sequential, are constructed	Ignored the role of word nodes
Huang et al. (2019)	Constructing graphs at the document level saves computational resources	Ignored the role of word nodes
Dong et al. (2022)	Merge BiGRU and GCN, combining the contextual and spatial information of the text	Ignored the role of word nodes
Xue et al. (2022)	Constructing graphs based on dependency syntax for graph embedding, combined with BERT and BiLSTM	Ignored the role of word nodes
Ye et al. (2020)	Combined BERT and GCN to construct a topic-word-document graph	Ignored the role of word nodes
Lin et al. (2021)	The first combination of BERT with GCN	Ignored the role of word nodes
Zhao et al. (2020)	Multitask Learning for Multi-Label Classification and Hierarchical Label Classification Based on BiLSTM	Lack of correlation between tasks
Yang and Shang (2019)	Based on BiLSTM, a multitask learning framework was constructed for language modeling and text classification tasks	Ignored tasks at the Token level
Xiao et al. (2018)	Constructed multiple text classification task frameworks based on capsule networks	Ignored tasks at the Token level
Lu et al. (2020)	Constructed a multi-task learning framework by integrating BiGRU and CNN	Ignored tasks at the Token level
Mao et al. (2021)	Joint Learning of Sentiment Analysis and Topic Text Classification Based on CNN	Ignored the role of word nodes

From Table 1, it can be observed that text classification (Dong et al., 2022; Huang et al., 2019; Lin et al., 2021; Liu et al., 2020; Xue et al., 2022; Ye et al., 2020) based on GCN improves upon TextGCN by innovating in graph construction or integrating other neural networks to enhance feature extraction capabilities, resulting in certain performance gains. However, none of the studies focus their innovations on the word nodes within the TextGCN text graph. This paper argues that leveraging the information from a large number of word nodes in the text graph can effectively enhance the performance of text classification tasks. Literature (Lu et al., 2020; Mao et al., 2021; Xiao et al., 2018; Yang & Shang, 2019; Zhao et al., 2020) improves text classification performance by constructing a multitask learning framework, focusing on introducing CNN or RNN to build complex networks, effectively boosting text classification performance. However, they also overlook the contribution of token-level tasks to text classification tasks. In summary, this paper introduces token-level tasks to improve text classification performance, specifically part-of-speech tagging and named entity recognition. What sets this approach apart is that both text classification and token classification tasks are performed within the same graph, which is unprecedented. The benefit of this approach is that information from token tasks can naturally propagate to the text classification task through the graph structure.

3. Proposed Methodology

3.1. Method Structure

The network structure of the proposed method is shown in Figure 1.

Figure 1.

The Text Classification Structure Based on TTL and GCN.

As shown in Figure 1, words are represented by blue nodes and documents are represented by green nodes. Edges between words are represented in pink and edges between documents and words are represented in black. Both TTL and text classification methodologies exhibit a shared foundational graph convolutional layer. This layer, characterized by hard parameter sharing, serves to enhance the interlinkages between disparate tasks. The proposed method consists of two parts, which are data preprocessing and Token-Task Learning - Graph Convolutional Network (TTL-GCN). The overall algorithm of the proposed method is shown in Algorithm 1.

3.2. Data Preprocessing

Our data underwent preprocessing procedures closely aligned with those outlined in the literature (Yao et al., 2019). Due to the absence of Part-of-Speech (POS) and entity tags in our utilized dataset, this paper necessitated the utilization of relevant tools to annotate the POS and entity attributes of the text. Initially, this paper segmented the words within each document. Subsequently, the Natural Language Toolkit (NLTK) (Bird & Edward, 2009) was employed for POS tagging of the words. For marking the entity information of the text, this paper utilized the pre-trained BERT model from Hugging Face.

Given that each word may exhibit distinct POS and entity labels in various contexts, multiple POS and entity labels may be assigned to each word. However, in GCN, word nodes possess only a single feature expression. Consequently, our model engages in either multi-label or single-label prediction for each token. Despite initial experimentation with the MTL architecture based on multi-label prediction across five datasets, the performance was suboptimal. Consequently, this study retains only the most frequently occurring POS and entities for each word.

Subsequent to this, this paper proceeded to establish word relationships based on context. Employing a window of length 20, this paper systematically scanned each document, recording the frequency of occurrence for individual words within the window. Furthermore, this paper documented the frequency of occurrence for adjacent word pairs within the same window. The detailed process is elucidated in Figure 2.

Figure 2.

The Building Graph Process.

Upon processing the dataset as illustrated in Figure 2, this paper derived the interconnections among words and the associations between documents and words. Subsequently, these connections will be assigned weights in accordance with equations (1), (2), and (3).

\begin{aligned} p (i) & = \frac{N_{i}}{N_{w}} \end{aligned}

(1)

\begin{aligned} p (i, j) & = \frac{N_{i j}}{N_{w}} \end{aligned}

(2)

\begin{aligned} P M I (i, j) & = \frac{p (i, j)}{p (i) p (j)} \end{aligned}

(3)

In equations (1), (2), and (3), $N_{w}$ denotes the total number of sliding windows, $N_{i}$ represents the occurrences of term $i$ across all sliding windows, and $N_{i j}$ signifies the instances where both terms $i$ and $j$ co-occur within all sliding windows simultaneously. The evaluation of the relationship strength between two terms employs Pointwise Mutual Information (PMI) (Bouma, 2009). A heightened PMI value indicates a more robust association between the two terms. Should the PMI value exceed 0, the two words are deemed strongly associated, and an edge is established between them with an associated weight.

Ultimately, the amalgamation of document-to-word and word-to-word edges results in the creation of a comprehensive graph. This graph serves as a representation of the relationships among all documents and words within the corpus. The weights assigned to edges between documents and words are determined by Term Frequency-Inverse Document Frequency (TF-IDF) (Ramos, 2003), as illustrated in equations (4), (5), and (6).

\begin{aligned} T F & = \frac{N_{i}}{N_{d}} \end{aligned}

(4)

\begin{aligned} I D F & = l o g (\frac{N_{D}}{N_{i d}}) \end{aligned}

(5)

\begin{aligned} T F - I D F & = T F \times I D F \end{aligned}

(6)

In the aforementioned equations, $N_{d}$ represents the total number of words in the current document, $N_{i}$ denotes the frequency of occurrence of the term $i$ in the current document, $N_{D}$ signifies the total number of documents in the corpus, and $N_{i d}$ stands for the count of documents containing the word $i$ (Li et al., 2023). TF-IDF is primarily employed for evaluating the significance of words in the context of document classification. A higher TF-IDF score indicates that a word carries greater importance within a given document. Subsequent to the completion of the aforementioned computational steps, this paper constructed a word-document graph, as depicted in Figure 3.

Figure 3.

Word-document Graph.

Following the construction of the word-document graph,this paper will proceed to enhance the graph through the application of the TTL-GCN structure, as elaborated in the subsequent section.

3.3. TTL-GCN

3.3.1. Graph Convolutional Network

The Graph Convolutional Network proves instrumental in capturing spatial relationships among nodes and subsequently classifying them (Liu et al., 2023; Zhang et al., 2019). In recent years, scholars have increasingly acknowledged the applicability of GCN in text classification, rendering it a prevalent choice for such endeavors.

To streamline computational processes, all operations are executed through matrix multiplication, and the graph structure is retained and computed in the form of an adjacency matrix. The initial layer of the GCN involves node updates as delineated in equations (7) and (8).

\begin{aligned} \hat{A} & = D^{1 / 2} A D^{- 1 / 2} \end{aligned}

(7)

\begin{aligned} L^{(b a s e)} & = ρ (\hat{A} X W_{o}) \end{aligned}

(8)

Let

A

denote the adjacency matrix of the graph,

ρ

represent the Mish function as introduced by work (Misra, 2019),

D

signify the degree matrix of

A

X

denote the node features, and

W_{o}

stand for the weight matrix. To enhance the model’s data fitting capacity and noise resistance, the proposed methodology substitutes the Rectified Linear Unit (ReLU) function in Graph Convolutional Networks (GCN) with the Mish function.

Following the update of the base Graph Convolution Layer, each node accumulates information from its neighboring nodes, resulting in nodes possessing specific spatial characteristics and exhibiting clustering effects. Subsequently, the nodes at the base layer are input into the document layer and token layer. The update equations for these layers are presented in equations (9) and (10).

\begin{aligned} L^{(d o c u m e n t)} & = \hat{A} L^{(b a s e)} W_{o} \end{aligned}

(9)

\begin{aligned} L^{(t o k e n)} & = \hat{A} L^{(b a s e)} W_{o} \end{aligned}

(10)

In equations (9) and (10), the

L^{(b a s e)}

is updated to

L^{(d o c u m e n t)}

and

L^{(t o k e n)}

. The proposed method gets the category prediction result for each document based on

L^{(d o c u m e n t)}

and the category prediction for each token based on

L^{(t o k e n)}

. Thus, the proposed method obtains the feature output in both directions based on the same Graph Convolution Layer. Then the proposed method establishes the connection between the document layer and token layer through multi-task learning.

3.3.2. Multi-Task Learning

This paper employs a hard parameter sharing framework within the context of multi-task learning. This paper involves treating token classification as an auxiliary task and text classification as the primary task. Compared with the traditional multi-task learning framework, the proposed method uses graphs to convey information between multiple tasks. A robust linkage between the TTL and text classification is established by utilizing a common graph convolutional layer. Even in cases where the connection is not explicitly defined through the loss function, the impact of TTL on text classification performance is evident due to the employed hard parameter sharing mechanism, a notable advantage of this approach. This paper leverages the latent information present in idle word nodes in TextGCN to facilitate the transfer of knowledge from the auxiliary task (token classification) to the main task (text classification). The loss function for the primary task of text classification is precisely defined in equation (11).

l o s s_{d o c u m e n t} = - \frac{1}{N} \sum \sum_{d} y (d) l o g p (d)

(11)

In equation (11),

d

represents the document,

y (d)

is the true label of the document, and

p (d)

is the predicted value of the document.

The loss function for the token classification is defined as shown in equation (12).

l o s s_{t o k e n} = - \frac{1}{N} \sum \sum_{w} y (w) l o g p (w)

(12)

In equation (12),

w

represents the word,

y (w)

is the true label of the word, and

p (w)

is the predicted value of the word. After that, this paper combines

l o s s_{d o c u m e n t}

and

l o s s_{t o k e n}

, and the total loss function is shown in equation (13).

L o s s = λ \times l o s s_{d o c u m e n t} + (1 - λ) \times l o s s_{t o k e n}

(13)

In equation (13), this paper controls the training focus of the model by change the value of

λ

, the value of

λ

ranges from 0 to 1.

4. Experimental Results

In this section, this paper will experimentally verify the effectiveness and superiority of the method in this paper.

4.1. Experimental Datasets

This paper selected MR (Bo & Lillian, 2005), R8 (Fabrizio, 2002), R52 (Fabrizio, 2002), 20NG (Moschitti & Basili, 2004), and Ohsumed (Joachims, 1998) as representative experimental datasets in this field. This paper is based on the TextGCN improvement, and in order to highlight the performance of the proposed method, it is therefore chosen to perform comparative experiments on five datasets in TextGCN. The information about these datasets is presented in Table 2.

Table 2.
The Datasets Information.

Dataset Docs Words Training Test Avg Length

MR $^{a}$ 10,662 18,764 7,108 3,554 20

R8 $^{b}$ 7,674 7,688 5,485 2,189 65

20NG $^{c}$ 18,846 42,757 11,314 7,532 221

R52 $^{d}$ 9,100 8,892 6,532 2,568 69

Ohsumed $^{e}$ 7,400 14,157 3,357 4,043 135

Dataset	Docs	Words	Training	Test	Avg Length
MR $^{a}$	10,662	18,764	7,108	3,554	20
R8 $^{b}$	7,674	7,688	5,485	2,189	65
20NG $^{c}$	18,846	42,757	11,314	7,532	221
R52 $^{d}$	9,100	8,892	6,532	2,568	69
Ohsumed $^{e}$	7,400	14,157	3,357	4,043	135

$^{a}$ MR: A two-classified sentiment analysis dataset. https://www.cs.cornell.edu/people/pabo/movie-review-data/.

$^{b}$ R8: A news text classification dataset with eight categories. https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html.

$^{c}$ 20NG: A news text classification dataset with 20 categories. https://disi.unitn.it/moschitti/corpora.htm.

$^{d}$ R52: A news text classification dataset with 52 categories. https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html.

$^{e}$ Ohsumed: A text classification dataset of medical journals with 23 categories. https://disi.unitn.it/moschitti/corpora.htm.

As shown in Table 2, these datasets contain long text, short text, and sentiment analysis domains, which can basically represent the text classification domain. In addition, it can be seen that the number of word nodes in the dataset is much larger than the number of document nodes, indicating that there are a large number of idle nodes in the word-document graph. In particular, in 20NG, word nodes account for about 70 $%$ of the total number of nodes. It can be seen that this paper constructed such a large text graph, but only 30 $%$ of the nodes are utilized for document classification. Therefore, the potential of word nodes in improving the accuracy of text classification is huge.

4.2. Experimental Evaluation Index

The accuracy of classification tasks serves as the key experimental evaluation metric. The calculation method is illustrated in equation (14).

\begin{aligned} A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N} \end{aligned}

(14)

TP represents the count of correctly predicted positive classes, TN is the count of correctly predicted negative classes, FP stands for the count of negative classes predicted as positive, and FN refers to the count of positive classes predicted as negative.

4.3. Experimental Setting

The experimental parameters of the proposed method are basically consistent with TextGCN (Yao et al., 2019). The value of labmda is not deterministic, it is determined by the text classification effect achieved by different data sets. When the loss on the verification set is no longer reduced, stop training.

4.4. Baselines

In this section, we will briefly introduce the baseline models of this paper.

Machine Learning. Machine learning methods have gained widespread popularity in the field of text classification. Our study focuses on three such methods, namely SVM, KNN, and RF, with their text features being initialized by TF-IDF.

Deep Learning. This paper also considered deep learning methods such as BiLSTM, BiGRU, CNN, FastText (Umer et al., 2023) and Transformer (Vaswani et al., 2017). For BiLSTM and BiGRU, this paper uses the last output as the classification token, which is then fed into a linear layer to obtain the prediction. The CNN model is based on TextCNN and uses a convolution kernel of (2,3,4). In Transformer, all output tokens are merged and fed into a linear layer for prediction. The word vectors for RNN, Transformer, and CNN are initialized using pre-trained GloVe (Pennington et al., 2014). And, FastText classification is performed by summing and averaging the word vectors obtained from training and using a linear layer for prediction.

Recent Related Works. This paper selected TextGCN (Yao et al., 2019), BiGRU-GCN (Fang et al., 2021), and BiLSTM-GCN (Ye et al., 2020) as the methods for comparison. To evaluate the structural advantages and disadvantages of each method, the proposed method initialized the node features with one-hot encoding.

4.5. Results

This paper shows the relevant experimental results in this section, as well as a brief analysis of the experimental results. The test accuracy of each method are shown in Table 3.

Table 3.
Text Classification Accuracy on Test Set ( $%$ ).

Method 20NG R8 R52 Oh MR

SVM 83.49 96.85 93.12 63.06 75.51

KNN 67.81 88.16 85.41 56.56 70.86

RF 77.26 94.91 87.12 58.62 69.51

BiLSTM 73.11 96.52 90.32 49.63 77.44

BiGRU 73.51 96.58 91.26 49.23 76.88

CNN 82.29 95.55 87.49 58.72 77.66

Transformer 74.44 96.53 92.19 53.21 76.32

FastText 79.66 94.65 90.96 55.68 76.52

TextGCN (Yao et al., 2019) 86.38 96.85 93.52 68.46 76.06

BiGRU-GCN (Fang et al., 2021) 86.88 97.10 93.86 68.44 77.52

BiLSTM-GCN (Ye et al., 2020) 86.63 97.44 94.19 69.03 78.30

TTL-GCN 86.77 97.53 94.39 69.21 77.10

Method	20NG	R8	R52	Oh	MR
SVM	83.49	96.85	93.12	63.06	75.51
KNN	67.81	88.16	85.41	56.56	70.86
RF	77.26	94.91	87.12	58.62	69.51
BiLSTM	73.11	96.52	90.32	49.63	77.44
BiGRU	73.51	96.58	91.26	49.23	76.88
CNN	82.29	95.55	87.49	58.72	77.66
Transformer	74.44	96.53	92.19	53.21	76.32
FastText	79.66	94.65	90.96	55.68	76.52
TextGCN (Yao et al., 2019)	86.38	96.85	93.52	68.46	76.06
BiGRU-GCN (Fang et al., 2021)	86.88	97.10	93.86	68.44	77.52
BiLSTM-GCN (Ye et al., 2020)	86.63	97.44	94.19	69.03	78.30
TTL-GCN	86.77	97.53	94.39	69.21	77.10

In Table 3, the proposed method demonstrates state-of-the-art performance across three text classification datasets, underscoring its superiority in this domain. Notably, the GCN based on TTL outperforms the hybrid network referenced in Fang et al. (2021) and Ye et al. (2020). This achievement is attributed to the effective utilization of the token classification function associated with word nodes. This strategic incorporation allows text classification to leverage POS or entity information, thereby enhancing its overall accuracy. However, it is worth noting that in the cases of the 20NG and MR datasets, the proposed method falls short compared to the comparison approach. This performance gap may stem from the proposed method’s potential limitations in handling both short and long texts effectively. Furthermore, an interesting observation is that TextGCN exhibits a generally superior structure when compared to RNN and CNN. In the realm of text classification, GCN appears to circumvent the issue of gradient disappearance encountered by RNNs and mitigates the problem of over-focusing on local information, a common pitfall associated with CNNs. Additionally, it is noteworthy that in simpler tasks, such as R8 and R52, traditional machine learning methods outshine deep learning approaches. Specifically, SVM demonstrate commendable classification accuracy. Conversely, the Transformer model underperforms in classification, possibly due to the dataset’s limited size, which constrains the Transformer’s capabilities. The proposed method showcases notable advancements in text classification, leveraging the strengths of GCN and token classification while acknowledging potential challenges in handling specific text lengths. The comparison with traditional machine learning methods provides valuable insights into the nuanced performance variations across different datasets and models.

Subsequently, ablation experiments were conducted to substantiate the efficacy of the proposed methodology, as delineated in Table 4.

Table 4.

Results of Ablation Experiments ( $%$ ).

Method	20NG	R8	R52	Oh	MR
TextGCN (Yao et al., 2019)	86.38	96.85	93.52	68.46	76.06
TextGCN(Mish)	86.52	97.08	93.73	68.86	76.28
TTL-GCN(POS)	86.77	97.21	94.39	69.21	77.10
TTL-GCN(NER)	86.58	97.53	94.28	68.84	76.93

In Table 4, it is evident that the classification accuracy of TextGCN(Mish) surpasses that of TextGCN across all five datasets. This observation underscores the enhanced fitting ability of GCN for text achieved through the incorporation of the Mish function. Reiterating this finding, Table 4 reaffirms that TextGCN(Mish) consistently outperforms TextGCN in terms of classification accuracy, indicating the beneficial impact of the Mish function on the model’s text-fitting capabilities. Furthermore, both TTL-GCN(POS) and TTL-GCN(NER) exhibit improved accuracy compared to TextGCN(Mish), highlighting the positive influence of multi-task learning based on token classification on text classification. This comprehensive evidence substantiates the efficacy of the proposed method. The superior performance of TTL-GCN(POS) in comparison to TTL-GCN(NER) may be attributed to the insightful understanding that knowledge of the POS of a word contributes significantly to comprehending the meaning of a text, particularly in sentences where a single word may have multiple meanings. In contrast, discerning whether a word functions as an entity does not notably enhance our understanding of the text’s meaning. Consequently, the performance improvement in the text classification task is more pronounced with POS tagging than with entity recognition. Moreover, it is noteworthy that TTL-GCN(NER) unexpectedly exhibits a performance decline on the Ohsumed dataset. This paper hypothesizes that this decline may be linked to the suboptimal performance of NER on the Ohsumed dataset, as detailed in Table 4. This suboptimal NER performance is likely to have adversely affected TTL-GCN(NER).

The text and token classification accuracy under different values of $λ$ are detailed in Table 5, where TTL-GCN(POS) and TTL-GCN(NER) denote MTL text classification methods based on POS tagging and NER, respectively. Meanwhile, GCN(POS) and GCN(NER) represent token classification methods for POS tagging and NER.

Table 5.

Classification Accuracy of Text and Token on Different $λ$ Value ( $%$ ).

Task	Method	Dataset	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
		20NG	83.21	84.12	85.44	85.69	86.07	86.11	86.27	86.48	86.77
		R8	96.62	96.85	97.08	96.71	97.21	97.17	97.22	97.08	97.12
	TTL-GCN(POS)	R52	92.76	93.5	93.89	93.81	93.84	94.2	94.31	94.39	94.35
		Ohsumed	60.77	65.33	67.15	67.32	67.99	68.41	68.88	69.03	69.21
		MR	75.30	75.12	76.23	76.65	76.31	76.14	76.31	76.42	77.10
Text classification
		20NG	82.92	84.26	85.66	85.71	86.44	86.32	86.58	86.49	86.54
		R8	97.12	97.12	97.17	97.53	97.26	97.20	97.21	97.17	97.21
	TTL-GCN(NER)	R52	93.46	94.04	93.96	94.20	94.08	94.12	94.28	93.93	94.20
		Ohsumed	66.16	67.82	68.14	67.92	67.85	68.39	68.84	68.71	68.76
		MR	74.09	75.04	75.97	75.44	76.62	76.11	76.06	76.65	76.93
		20NG	99.52	99.48	99.35	99.28	99.36	99.22	99.18	99.10	99.06
		R8	99.58	99.21	98.98	98.92	98.88	98.82	98.88	98.79	98.75
	GCN(POS)	R52	99.31	99.11	99.12	99.13	99.11	99.14	98.92	98.88	98.86
		Ohsumed	99.88	99.87	99.71	99.52	99.51	99.45	99.33	99.26	99.21
		MR	82.62	69.59	61.27	57.89	54.15	51.75	51.36	46.21	42.32
Token classification
		20NG	93.64	93.52	93.42	93.38	93.46	93.56	92.88	93.68	92.66
		R8	71.27	71.48	67.77	60.93	63.54	68.99	72.47	65.04	74.44
	GCN(NER)	R52	92.83	91.85	92.85	93.12	93.25	92.71	92.52	93.38	92.28
		Ohsumed	63.87	64.00	62.26	62.85	63.03	64.03	65.84	65.85	60.35
		MR	98.25	98.19	98.25	98.31	98.31	98.26	98.29	98.21	98.34

Figure 4.

The Relationship Between the Value of Accuracy Improvement and the Number of Word Nodes. The Vertical Coordinate is the Accuracy Rate of TTL-GCN Improvement in %. The Horizontal Coordinate is the Number of Word Nodes in Each Dataset.

Figure 5.

The Loss and Accuracy of Text Classification on MR. Where doc_loss Refers to the Loss of the Text Classification Task and eval_loss Refers to the Validation Set Accuracy of the Text Classification Task.

In Table 5, this paper observes that both TTL-GCN(POS) and TTL-GCN(NER) present a weak positive correlation with $λ$ . According to equation (9), when $λ$ is larger, Token classification plays a smaller role in the training of the text classification task. Therefore, our analysis demonstrates that when Token node classification has minimal impact on text classification, text classification can achieve the most significant performance improvement. For example, when $λ$ equals 0.9, TTL-GCN(POS) achieves the highest text classification accuracy on 20NG, Ohsumed, and MR. In addition, in Token classification task, POS tagging task presents a strong negative correlation with $λ$ . This phenomenon is easily comprehensible: when focusing on Token classification in multi-task learning, it leads to increased training feedback, consequently resulting in higher classification accuracy. Nonetheless, the classification accuracy of NER exhibits a positive correlation with the parameter $λ$ . This may be attributed to the simplicity of NER, as excessive training feedback appears to have an adverse impact on its performance. The relationship between the classification accuracy of the proposed method and the number of Token nodes has also been explored in this article, as illustrated in Figure 4.

In Figure 4, it is evident that the enhancement effect of TTL based on word node classification exhibits a discernible positive correlation with the quantity of word nodes. This observation suggests that a greater involvement of word nodes in token classification tends to be more advantageous for text classification. Nevertheless, excessively large word nodes may exert a detrimental influence on text classification. Specifically, when the number of word nodes reaches 42,757, the accuracy improvement becomes marginal. Subsequently, this paper presents the loss and accuracy curves for text classification on the MR dataset, as illustrated in Figure 5.

In Figure 5, although MTL takes more time to reach the optimal loss and accuracy, both the loss curve and the accuracy curve are smoother than those of TextGCN. This shows that replacing Relu with Mish makes the training of the model smoother and improves the classification accuracy.

5. Conclusion

Text classification methods based on traditional TextGCN often neglect a considerable amount of valuable information associated with word nodes, thereby missing an opportunity to positively impact text classification outcomes. This paper introduces an innovative text classification approach that integrates Token-Task Learning with Graph Convolutional Networks. The proposed method effectively leverages the positive influence of word nodes by transforming them into a token classification task, thereby enhancing the training process for the overall text classification task and consequently improving accuracy.

Moreover, this paper adopts the Mish activation function for Graph Convolutional Networks, yielding similarly impressive results. Our proposed method exhibits superior classification performance compared to existing hybrid network Graph Convolutional Networks, without compromising their functionality. Furthermore, for methods (Fang et al., 2021; Ye et al., 2020), the accuracy of text classification can be further augmented by incorporating the Token-Task Learning structure introduced in our method.

In summary, the proposed method offers a novel perspective for TextGCN-based text classification techniques, underscoring the significance of word node classification. This emphasis, in turn, leads to enhanced results in text classification.

The limitation of this paper lies in the fact that the performance of TTL-GCN could be influenced by the performance of the Token task. When the Token task becomes complex and challenging, the performance of text classification may suffer. Future work will focus on designing more advanced TextCGN structures and constructing a more reliable Token task framework.

Footnotes

Acknowledgements

This work has been supported by Liaoning Provincial Science and Technology Department (No. 2022JH2/101300268) and the transportation department of Liaoning Province (No. 2023-360-17).

ORCID iD

Xudong Song

Funding

This work was supported by the Liaoning Provincial Science and Technology Department (No. 2022JH2/101300268); and in part by the transportation department of Liaoning Province (No. 2023-360-17).

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Benarafa

Benkhalifa

Akhloufi

(2023). WordNet semantic relations based enhancement of KNN model for implicit aspect identification in sentiment analysis. International Journal of Computational Intelligence Systems, 16(1), 3.

Bird

Edward

et al. (2009). Natural language processing with python. O’Reilly Media Inc.

Lillian

(2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of ACL (pp. 115–124).

Bouma

(2009). Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL, 30, 31–40.

Chen

Jin

Gerontitis

et al. (2023). Improved recurrent neural networks for text classification and dynamic Sylvester equation solving. Neural Processing Letters, 55(7), 8755–8784.

Devlin

Chang

M. W.

Lee

et al. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Dong

Yang

Cao

(2022). A text classification model based on GCN and BiGRU fusion. In Proceedings of the 8th international conference on computing and artificial intelligence (pp. 318–322).

Fabrizio

(2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.

Fang

Shu

et al. (2021). Text classification model based on multi-head self-attention mechanism and BiGRU. In 2021 IEEE conference on telecommunications, optics and computer science (TOCS) (pp. 357–361). IEEE.

10.

Huang

et al. (2019). Text level graph neural network for text classification. arXiv preprint arXiv:1910.02356.

11.

Huang

Peng

et al. (2020). Improving biterm topic model with word embeddings. World Wide Web, 23(6), 3099–3124.

12.

Incitti

Urli

Snidaro

(2023). Beyond word embeddings: A survey. Information Fusion, 89, 418–436.

13.

Joachims

(1998). Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning (pp. 137–142). Berlin, Heidelberg: Springer Berlin Heidelberg.

14.

Kowsari

Jafari Meimandi

Heidarysafa

et al. (2019). Text classification algorithms: A survey. Information, 10(4), 150.

15.

Yan

Wang

et al. (2023). Text classification on heterogeneous information network via enhanced GCN and knowledge. Neural Computing and Applications, 35(20), 14911–14927.

16.

Lin

Meng

Sun

et al. (2021). BertGCN: Transductive text classification by combining GCN and BERT. arXiv preprint arXiv:2105.05727.

17.

Liu

Guan

Yang

et al. (2023). Effective method for making Chinese word vector dynamic. Journal of Intelligent & Fuzzy Systems, 45(1), 941–952.

18.

Liu

You

Zhang

et al. (2020). Tensor graph convolutional networks for text classification. In Proceedings of the AAAI conference on artificial intelligence, (vol. 34, no. 05, pp. 8409–8416).

19.

Gan

Yin

et al. (2020). Multi-task learning using a hybrid representation for text classification. Neural Computing and Applications, 32, 6467–6480.

20.

Lyle

Rowland

Ostrovski

et al. (2021). On the effect of auxiliary tasks on representation dynamics. In International conference on artificial intelligence and statistics (pp. 1–9). PMLR.

21.

Mallik

Kumar

(2023). Word2Vec and LSTM based deep learning technique for context-free fake news detection. Multimedia Tools and Applications, 83(1), 919–940.

22.

Mao

Wang

Liu

et al. (2021). BanditMTL: Bandit-based multi-task learning for text classification. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers) (pp. 5506–5516).

23.

Mirończuk

M. M.

Protasiewicz

(2018). A recent overview of the state-of-the-art elements of text classification. Expert Systems with Applications, 106, 36–54.

24.

Misra

(2019). Mish: A self regularized non-monotonic activation function. arXiv preprint arXiv:1908.08681.

25.

Moschitti

Basili

(2004). Complex linguistic features for text classification: A comprehensive study. In European conference on information retrieval (pp. 181–196). Berlin, Heidelberg: Springer Berlin Heidelberg.

26.

Parlak

Uysal

A. K.

(2023). A novel filter feature selection method for text classification: Extensive feature selector. Journal of Information Science, 49(1), 59–78.

27.

Pennington

Socher

Manning

C. D.

(2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).

28.

Philip

Veerasekhar Reddy

Harshini

et al. (2023). A comparative study of text classification using selective machine learning algorithms. In 2023 7th International conference on intelligent computing and control systems (ICICCS) (pp. 482–484). IEEE.

29.

Lyu

Chi

C. H.

(2022). Multi-task learning framework for detecting hashtag hijack attack in mobile social networks. In 2022 IEEE 19th international conference on mobile ad hoc and smart systems (MASS) (pp. 90–98). IEEE.

30.

Ramos

(2003). Using TF-IDF to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning (vol. 242, no. 1, pp. 29–48).

31.

Shah

Patel

Sanghvi

et al. (2020). A comparative analysis of logistic regression, random forest and KNN models for the text classification. Augmented Human Research, 5, 1–16.

32.

Soni

Chouhan

S. S.

Rathore

S. S.

(2023). TextConvoNet: A convolutional neural network based architecture for text classification. Applied Intelligence, 53(11), 14249–14268.

33.

Umer

Imtiaz

Ahmad

et al. (2023). Impact of convolutional neural network and FastText embedding on text classification. Multimedia Tools and Applications, 82(4), 5569–5585.

34.

Vafaeikia

Namdar

Khalvati

(2020). A brief review of deep multi-task learning and auxiliary task learning. arXiv preprint arXiv:2007.01126.

35.

Vaswani

Shazeer

Parmar

(2017). Attention Is All You Need. In 31st international conference on neural information processing systems (NIPS) (pp. 6000–6010). ACM.

36.

Wang

Han

S. C.

Poon

(2022). InducT-GCN: Inductive graph convolutional networks for text classification. In 2022 26th International conference on pattern recognition (ICPR) (pp. 1243–1249). IEEE.

37.

Xiao

Zhang

Chen

et al. (2018). MCapsNET: Capsule network for text with multi-task learning. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 4565–4574).

38.

Xue

Zhu

Wang

et al. (2022). The study on the text classification based on graph convolutional network and BiLSTM. In Proceedings of the 8th international conference on computing and artificial intelligence (pp. 323–331).

39.

Yang

Shang

(2019). Multi-task learning with bidirectional language models for text classification. In 2019 International joint conference on neural networks (IJCNN) (pp. 1–8). IEEE.

40.

Yao

Mao

Luo

(2019). Graph convolutional networks for text classification. In Proceedings of the AAAI conference on artificial intelligence (vol. 33, no. 01, pp. 7370–7377).

41.

Jiang

Liu

et al. (2020). Document and word representations generated by graph convolutional network and bert for short text classification (pp. 2275–2281).

42.

Zhang

Tong

et al. (2019). Graph convolutional networks: A comprehensive review. Computational Social Networks, 6(1), 1–23.

43.

Zhang

Yang

(2021). A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12), 5586–5609.

44.

Zhao

Gao

Chen

et al. (2020). Generative multi-task learning for text classification. IEEE Access, 8, 86380–86387.

A Novel Graph Convolutional Text Classification Based on Token-Task Learning

Abstract

Keywords

1. Introduction

2.1. GCN for Text Classification

2.2. MTL for Text Classification

2.3. Summary Analysis

3.1. Method Structure

3.3.1. Graph Convolutional Network

4.1. Experimental Datasets

Table 2. The Datasets Information. Dataset Docs Words Training Test Avg Length MR a 10,662 18,764 7,108 3,554 20 R8 b 7,674 7,688 5,485 2,189 65 20NG c 18,846 42,757 11,314 7,532 221 R52 d 9,100 8,892 6,532 2,568 69 Ohsumed e 7,400 14,157 3,357 4,043 135

4.4. Baselines

4.5. Results

Footnotes

Acknowledgements

ORCID iD

Funding

Declaration of conflicting interests

References

Table 2.
The Datasets Information.

Dataset Docs Words Training Test Avg Length

MR $^{a}$ 10,662 18,764 7,108 3,554 20

R8 $^{b}$ 7,674 7,688 5,485 2,189 65

20NG $^{c}$ 18,846 42,757 11,314 7,532 221

R52 $^{d}$ 9,100 8,892 6,532 2,568 69

Ohsumed $^{e}$ 7,400 14,157 3,357 4,043 135