Concept and dependencies enhanced graph convolutional networks for short text classification

Abstract

Short text classification task is a special kind of text classification task in that the text to be classified is generally short, typically generating a sparse text representation that lacks rich semantic information. Given this shortcoming, scholars worldwide have explored improved short text classification methods based on deep learning. However, existing methods cannot effectively use concept knowledge and long-distance word dependencies. Therefore, based on graph neural networks from the perspective of text composition, we propose concept and dependencies enhanced graph convolutional networks for short text classification. First, the co-occurrence relationship between words is obtained by sliding window, the inclusion relationship between documents and words is obtained by TF-IDF, long-distance word dependencies is obtained by Stanford CoreNLP, and the association relationship between concepts in the concept graph with entities in the text is obtained through Microsoft Concept Graph. Then, a text graph is constructed for an entire text corpus based on the four relationships. Finally, the text graph is input into graph convolutional neural networks, and the category of each document node is predicted after two layers of convolution. Experimental results demonstrate that our proposed method overall best on multiple classical English text classification datasets.

Keywords

Short text classification Knowledge graph Graph convolutional neural networks Long-distance dependency Building graph for text

1 Introduction

Text classification is a fundamental problem in the field of natural language processing (NLP), which has numerous applications such as sentiment analysis, spam detection, and extractive question answering [1 –3]. With the widespread use of mobile internet, a large amount of short text information has emerged. How to extract and use this information has piqued the interest of academia and industry. Short text classification is a classification method based on the given classification system for forums /BBS, messages and replies, suggestions and feedback, mobile phone short messages/network notes, instant chat records (MSN/QQ/POPO) etc. Compared with long text, the short text has less content. The short text usually consists of several to a dozen words, the longer text is about 100 words, the content is less and the features are not obvious. The sentence expression of short text may have non-standard behavior. The short text’s ability to describe information is weak and the corresponding text representation is more sparse. Traditional short text classification methods are mainly based on machine learning [4 –7], resulting in severe feature sparsity and weak feature expression ability when processing short texts. With the development of deep learning, convolutional neural networks (CNN) [8, 9], recurrent neural networks (RNN) [10, 11], and the attention mechanism [12, 13] have also been used to solve problems of short text classification, but CNN generally can only represent the semantic information of context in the local window, and the attention mechanism mainly focuses on the sequence of local continuous word, so these methods still cannot effectively solve the problem of short text feature sparsity. Graph neural networks (GNN) have also been used to complete short text classification, but there are few studies on feature expansion of short text classification based on GNN [14 –17].

Feature expansion is an effective method to solve short text feature sparsity. There are two commonly used feature expansion methods. One is based on the content of the text. For instance, in 2019, Gao et al. [18] proposed a new short text conditional random regularization topic modelling model, which aggregated short texts into pseudo-documents to address the problem of short text feature sparsity, and by using random field regularization model to realize semantically related words sharing the same topic assignment, semantic disambiguation was achieved. The dependency between words is the relevant information extracted from the content of the short text. For short text, long-distance dependencies have more significance and can compensate for semantic deficiency to a certain extent. The other is based on the introduction of external knowledge, for example, Sahami et al. [19] used the web search results as the feature expansion items of keywords to enrich the features of short texts. However, because feature expansion based on search engines heavily relies on the open interface of search engines, the results obtained in practical application scenarios are unstable and unfeasible.

Currently, many scholars combine knowledge graphs with graph neural networks to solve NLP tasks [20 –22]. For example, Wang et al. [23] proposed a knowledge awareness GNN with smooth regularization of labels for recommendation systems. This model combine graph convolutional network (GCN) with knowledge graph and aggregates neighborhood information with different weights using user-specific relational scoring functions. Ye et al. [24] proposed an improved GNN-based method that can reason over multiple paragraphs with entities and construct entity graphs using relational facts in a knowledge graph to explicitly capture connections between entities. Zhou et al. [25] designed a GCN sentiment classification model based on syntax and knowledge. The model combines a syntactic dependency tree and a commonsense knowledge graph simultaneously for aspect-level sentiment classification, and to enhance the ability of sentences to represent a given aspect, two strategies are developed to model the syntactic dependency tree and the commonsense knowledge graph, respectively. Syntactic relationship is the study of the rules governing the ways different constituents are combined to form sentences in language, or the study of the interrelationships between elements in sentence structures. However, there are few studies on the combination of knowledge graph and GNN to solve the task of short text classification [26]. In order to solve the problem of sparse features, Yan et al. [27] constructs a knowledge graph and injects the triples contained in it into sentences as domain knowledge to realize feature expansion, thus solving the problem of feature sparsity.

In order to solve the above problems, we propose a short text classification method of graph convolution networks with enhanced concept and dependencies(KDGCN). We use two feature expansion methods to solve the feature sparse problem of short texts. First, based on the semantic information of the short text itself as an expansion item, Stanford CoreNLP1 is used to analyze the dependency of the text, and select the words belonging to the long-distance dependency to expand the text graph. At the same time, external knowledge is introduced for feature expansion. We extract concepts corresponding to short text features from the Microsoft Concept Graph to enrich the context features of short texts. The following related papers are referred to when calculating the experimental results [28, 29]. In summary, our main contributions of this study can be summarized as follows:

(1) On the basis of existing text classification models based on GNN, the relationship between concepts and words, the co-occurrence relation between words, the relation between documents and words, and long-distance word dependencies are introduced into the text graph construction process.

(2) A text graph containing multiple relationships is introduced into GCN and a novel short text classification model based on GCN with enhanced concept and dependencies is proposed.

(3) Experiments are conducted on several benchmark datasets and experimental results show that our method overall better than existing baseline models.

2 Related work

At present, typical text classification algorithms include those based on machine learning, deep learning, and GNN.

2.1 Common short text classification methods

Text classification algorithms mainly include support vector machine [30], naive Bayes [31], and other machine learning algorithms that ignore the natural order structure or context information in text data, making learning semantic information about vocabulary extremely difficult. Many scholars have conducted studies to improve the above problems. For example, Bouaziz et al. [32] combined data enrichment with semantics using a random forest classifier for short text classification to overcome the sparseness and deficiency of contextual information.

With the advancement of deep learning, relevant research on text classification algorithms based on deep learning has expanded. Chen et al. [33] applied CNN to text classification tasks for the first time. To solve the problem of sparse features of short texts, Wang et al. [34] proposed a framework that combines knowledge using CNN. This model first uses a taxonomy knowledge base to map entities in short texts to corresponding concepts through isA and isPropertyOf relations. Then, words and related concepts are combined in a pre-trained word vector to obtain short text embedding. Hao et al. [35] proposed a Chinese short text classification method based on mutual attention CNN, which can integrate word-level and character-level features so that the model will not lose too much feature information. Chen et al. [36] proposed a deep short text classification method based on knowledge-driven attention. This model retrieves knowledge from external knowledge sources, treats conceptual information as a kind of knowledge, and integrates it into a deep neural network to enhance the semantic representation of short texts. To distinguish the importance of different knowledge, an attention mechanism is also introduced. Xu et al. [37] proposed a Dual Embeddings Convolutional Neural Network (DE-CNN). Firstly it uses the dual-embeddings to extract concepts and contexts, then uses attention layer to extract context-relevant concepts from the extracted concepts, finally the context-relevant concepts are merged into the text representation, and convolutional neural network is used to conduct short text classification.

2.2 Short text classification based graph convolutional networks

GNN has gained popularity due to their powerful expressive ability, and they are also used to solve the problem of text classification [38 –41]. GCN is one of the typical variants of GNN proposed by Kipf et al. [42]. Defferrard et al. [43] first applied GCN to text classification tasks. In 2019, Yao et al. [44] improved the model proposed by Defferrard et al and transformed the text classification problem into a node classification problem. First, a text graph with global relations between documents and words is constructed based on the entire text corpus, and then, document nodes are classified by GCN. For short text classification tasks, Hu et al. [45] proposed a semi-supervised short text classification method based on heterogeneous graph attention networks. This model uses the heterogeneous information network framework to model short texts, which can contain any additional information and capture rich relationships between texts and additional information. Tayal et al. [46] proposed a regularized GCN for short text classification that uses extra knowledge between text and labels to enhance short text information. Wang et al. [47] modelled the short text dataset as a hierarchical heterogeneous graph consisting of three type graphs which introduce more semantic and syntactic information, and then dynamically learned to facilitate effective label propagation.

2.3 Knowledge graph

Knowledge graph is a concept proposed by Google based on the semantic web, which has been widely used in NLP tasks. Knowledge graph improves the storage and acquisition methods of knowledge and can extract high-quality knowledge, effectively solving the problem of sparse short text features. In 2016, Microsoft research institute released Microsoft Concept Graph, which is expressed in the form of a triad of instances, concepts, and relationships, among which the IsA relationship is between concepts and instances, such as triad (Microsoft, company, IsA) indicates that Microsoft is a company. A short text may directly contain concept words in a knowledge graph, but it is more common that the text include instance information related to these concepts, such as “It’s a very valuable film” and “More than anything else, kissing Jessica Stein injects freshness and spirit into the romantic comedy genre.” Both sentences correspond to the concept of “film”, but the first sentence contains the concept word “film” directly, and the second sentence represents the concept of “film” with the instant message “kissing Jessica Stein”, the concept is obtained by mapping text information into Microsoft Concept Graph. Obviously, by conceptualizing the text information, we can obtain the extended words with the highest semantic relevance, which can expand the representation of short text features and improve the accuracy of short text classification.

2.4 Graph convolutional networks

GCN is a multi-layer neural network work, which operates directly on the graph and induces the embedding vector of nodes according to their neighborhood attributes. The input of GCN contains N × D dimensional eigenmatrix H formed by N nodes and their features, and N × N dimensional adjacency matrix A formed by the relationship between each node. GCN also needs to introduce a degree matrix, which is used to represent a node is associated with how many nodes. In GCN, the features of the neighbors are summed directly. The aggregation equation of neighbor nodes is as follows: $H^{l + 1} = σ (\tilde{A} H^{l} W^{l})$ (1) where $\tilde{A}$ is a normalized A, W is a random parameter matrix.

In general, each layer of GCN is multiplied by the adjacency matrix Aⁿ and the eigenmatrix H^l to get the summary of the neighbor features of each vertex, then multiplied by a parameter matrix W^l. Add the activation function σ and do a nonlinear transformation to obtain the matrix H^l + 1 of the features of the aggregate adjacency vertices, so that the output vector of the last layer of GCN contains not only the initial features of nodes but also the topological features of the network.

Existing methods based on machine learning and deep learning have significantly improved short text classification, but the combination of concept knowledge, word dependencies and GNN has not been considered to tackle short text classification tasks. Therefore, it is of great significance to explore GCN for short text classification based on concept and dependencies enhancement.

3 Our proposed method

The model structure of the proposed method is shown in Fig. 1. Firstly, the data cleaning of the text is carried out. Secondly, we obtain entity information, dependency relationship and word co-occurrence information in the text through TAGME, Stanford CoreNLP and sliding window technology respectively, and then map entity to Microsoft Concept Graph to obtain corresponding concept. Thirdly, the obtained information is constructed through the following four relationships: word co-occurrence relationship, distance dependency, association relationship between concepts and words, and inclusion relationship between texts and words. Finally, the text map is input into the two-layer GCN network, and the text label is obtained through training.

Fig. 1

KDGCN overall architecture diagram. “D” are document nodes, “W” are word nodes, “C” are concept nodes, and lines indicate edges between nodes.

3.1 Building text graph

We take concepts, documents, and words as nodes and establish edges between nodes based on four relationships: the relationship between concepts and words, word co-occurrence, the relation between documents and words, and long-distance word dependencies. Establish a text graph G = (V, E) for corpus, where V (|V| = n) and E represent sets of nodes and edges, respectively. X ∈ R^n*m is an eigenmatrix, where n refers to the number of nodes, m represents the dimension of the eigenvector, X_v ∈ R^m refers to the eigenvector of node v, and the eigenmatrix is defined as the identity matrix.

3.1.1 Associated edges of document and word nodes

Edges between word nodes are constructed on the basis of word co-occurrence, where word co-occurrence information is obtained by a sliding window with a fixed size, and the weight between two words is calculated using point wise mutual information (PMI). When the PMI value is negative, the semantic correlation between words in the corpus is low or non-existent, so only edges with a positive PMI value are retained. The weight of an edge between node i and node j is defined as $PMI (i, j) = \log \frac{p (i, j)}{pt (i) p (i)}$ (2)

If a word appears in a document, a linking edge is constructed between the document and word, and the weight of the edge between the document and word is calculated by term frequency-inverse document frequency.

We use Stanford CoreNLP, an NLP toolkit of Stanford University, to analyze the dependence relationship between texts, and words with long-distance dependencies are selected from the obtained dependencies to expand the text graph. The relationship between two words in a sentence that are dependent on each other and there are many lexical intervals between them is called long-distance word dependencies. The weight of edges constructed on the basis of long-distance word dependencies are calculated by PMI, where edges with positive PMI are only retained. The PMI calculation method used here is as follows: $I_{rel} (w_{1}, w_{2}) = \log \frac{p ((w_{1}, w 2) | rel)}{p (w_{1} | rel) p (w_{2} | rel)}$ (3) where: $p ((w_{1}, w_{2}) | rel) = \log \frac{p (w_{1}, rel, w_{2})}{p (rel)}$ (4) The probability is calculated by maximum likelihood estimate: $p (w_{1}, rel, w_{2}) = \log \frac{Count (w_{1}, rel, w_{2})}{Count (*, *, *)}$ (5) $p (w_{1} | rel) = \log \frac{Count (w_{1}, rel, *)}{Count (*, rel, *)}$ (6) $p (w_{2} | rel) = \log \frac{Count (*, rel, w_{2})}{Count (*, rel, *)}$ (7) $p (rel) = \log \frac{Count (*, rel, *)}{Count (*, *, *)}$ (8) The following formula can be derived: ${PMI}_{rel} (w_{1}, w_{2}) = \log \frac{Count (w_{1}, rel, w_{2}) Count (*, rel, *)}{Count (w_{1}, rel, *) Count (*, rel, w_{2})}$ (9) where rel represents the long-distance dependence relationship between word w₁ and word w₂, and * represents a possible word or dependence relationship. The dependency structure corresponding to the example “Its rawness and vitality give it considerable punch” is shown in Fig. 2.

The most important parts of a sentence are the subject and predicate. The subject is the center of the sentence, and the predicate is the action or state of the subject. From the example in Fig. 2, “the subject predicate relationship” points to the key point of the sentence, the subject-predicate relationship refers to the relationship between the executor of the action and the action. In general sentences, if there is a predicate, there will be an object, which is the object and receiver of the action. For example, the “punch” in the example sentence in Fig. 2 is crucial for the semantic expression of the sentence. The verb-object relationship refers to the relationship between the receiver of the action and the action, and the juxtaposition relationship refers to the correlation between sentences or words, they are carried out at the same time or in the same place. The juxtaposition relationship can be different things related to each other, or different aspects of the same thing, or different actions of the same subject. Eventually, through numerous experiments, we select above three kinds of dependencies as the long-distance dependencies of the model, as shown in Table 1:

Fig. 2

Dependency graph structure.

Table 1

Long-distance dependency

Dependency relationship	Example
Nsubj	“..rawness..give”ž
Obj	“..give..punch..”ž
Conj	“..rawness..vitality..”ž

3.1.2 The relationship between concepts and words

First, identify entities in the document and map them to Wikipedia using an entity linking tool TAGME2. For the word “cast” in the text, the linked entity and related information obtained through TAGME are shown in Table 2.

Table 2
Parameters returned by TAGME

Key Value Meaning

Spot Cast Entities that exist in the text

Start 15 The starting position of the original entity in the text

link_probability 0.006936665624380112 The probability of the original entity appearing in Wikipedia

Rho 0.003468332812190056 The confidence score that the TagMe assigns to the comment.

End 19 The end position of the original entity concept in the text

Id 21501000 The numeric identifier of the entity

Title Casting The original entity in the text corresponds to the entity in Wikipedia

Key	Value	Meaning
Spot	Cast	Entities that exist in the text
Start	15	The starting position of the original entity in the text
link_probability	0.006936665624380112	The probability of the original entity appearing in Wikipedia
Rho	0.003468332812190056	The confidence score that the TagMe assigns to the comment.
End	19	The end position of the original entity concept in the text
Id	21501000	The numeric identifier of the entity
Title	Casting	The original entity in the text corresponds to the entity in Wikipedia

Concepts embody our knowledge of the kinds of things there are in the world. Tying our past experiences to our present interactions with the environment, they enable us to recognize and understand new objects and events. Concepts mainly refer to collections, categories, object types and types of things, such as people, geography, etc. We map the reserved entities to different semantic concepts in Microsoft Concept Graph. We can obtain the concept corresponding to the entity and its mapping probability by requesting the official website for the corresponding API function. Among the obtained concepts, a concept with a mapping probability greater than 0.5 is selected as one of the nodes of the text graph, and an edge is established between the corresponding word and concept, and the weight between the edges is represented by the mapping probability. Microsoft Concept Graph provides six algorithms for calculating the correlation degree between entities and concepts, namely, P (c ∣ e), P (e ∣ c), mutual information, point mutual information, standardized point mutual information, and conceptual reasoning algorithms. For the word “Microsoft”, some concepts are obtained according to different algorithms, such as "company”, “vendor”, “client”, and the mapping probability between "Microsoft” and these concepts such as 0.611, 0.089, 0.048. We select three groups of data with the highest probability to show in Table 3.

Table 3

The corresponding concept of “Microsoft” and its mapping probability

Score by P(c,e)		Score by MI		Score by P(e,c)		Score by NPMI		Score by PMI^∧K		Score by BLC
company	0.611	company	0.594	Vendor unique email file system	0.1	Vendor unique email file system	0.105	Vendor unique email file system	0.141	Company	0.313
vendor	0.089	vendor	0.106	Joint monopolist	0.1	Joint monopolist	0.103	Joint monopolist	0.123	Software company	0.151
client	0.048	Large company	0.051	Largest internet property	0.1	Operating system developer	0.102	Operating system developer	0.113	vendor	0.102

Following the example in Table 3, we compare the six calculation methods and find that the concepts and mapping probability calculated by P (c ∣ e) are more reasonable and accurate. Thus, we use P (c ∣ e) to calculate the degree of association between entities and concepts.P (c ∣ e) is the probability that concept c is the corresponding concept of an entity e.

$P (c ∣ e) = \frac{n (c, e)}{\sum_{e \in c_{j}} n (c_{j}, e)}$ (10) where n (c, e) represents the number of simultaneous occurrences of the entity e and concept c, and ∑_{e∈c_j}n (c_j, e) represents the sum of the number of simultaneous occurrences of the entity e and all concepts to which the entity belongs.

3.2 Model training

After constructing the text graph, we input the text graph into a simple two-layer GCN. The activation function is used in the first layer. The main function of this layer is to make each node receive information from neighboring nodes for fusion and update its representation and perform a nonlinear transformation of all feature information about nodes by the operation. The second layer of node embedding has the same size as the label set and inputs information from layer 1 into SoftMax. A represents the adjacency matrix, and D represents the degree matrix and assumes that each node is connected to itself, X is a matrix containing all n nodes with their features. $Z = softmax (\tilde{A} ReLU (\tilde{A} {XW}_{0}) W_{1})$ (11) where $\tilde{A} = D^{- \frac{1}{2}} {AD}^{- \frac{1}{2}}$ (12) The loss function is defined as the cross-entropy error of all tag documents, $L = - \sum_{d \in Y_{D}} \sum_{f = 1}^{F} Y_{df} {lnZ}_{df}$ (13) where Y_D represents the document index set with labels, F represents the dimension of the output feature, the size is equal to the number of categories, and Y represents the label indication matrix. The weight parameters W₁ and W₀ can be obtained through gradient descent training. In Formula (11), $E_{1} = \tilde{A} {XW}_{0}$ includes the first-layer document and word embedding, and $E_{2} = \tilde{A} ReLU (\tilde{A} {XW}_{0}) W_{1}$ includes the second-layer document and word embedding. The two-layer GCN allows messages to be passed between nodes up to two steps apart. Thus, while there are no direct document-to-document edges in the text graph, the two-layer GCN allows information exchange between document pairs.

4 Experiments

4.1 Datasets

The experiment uses four public text classification datasets for model verification, including Movie Review (MR), Ohsumed, R8, and R52. The statistics of the datasets are summarized in Table 4.

Table 4
Summary statistics of datasets

Dataset Training Test Classes Words Average Length

MR 7108 3554 2 18764 20.39

Ohsumed 3357 4043 23 14157 135.82

R8 5485 2189 8 7688 65.72

R52 6532 2568 52 8892 69.82

Dataset	Training	Test	Classes	Words	Average Length
MR	7108	3554	2	18764	20.39
Ohsumed	3357	4043	23	14157	135.82
R8	5485	2189	8	7688	65.72
R52	6532	2568	52	8892	69.82

•MR is a binary sentiment classification film review dataset, including 5331 positive reviews and 5331 negative reviews. The training set includes 7108 documents, and the test set includes 3554 documents.

•Ohsumed comes from the medical database MEDLINE10, and its content is the title or abstract of medical journals. After removing documents with multiple labels, only 7400 documents belonging to one category are retained, including 3357 documents for the training set and 4043 documents for the test set.

•R8 is a subset of the Reuters 21578 dataset and includes 8 categories, including 5485 documents for the training set and 2189 documents for the test set.

•R52 is a subset of The Reuters 21578 dataset and includes 6 categories, including 6532 documents in the training set and 2568 documents in the test set.

4.2 Baselines

To evaluate the performance of the KDGCN model, we use six baseline models as follows:

•CNN: CNN proposed by Kim et al. [33], uses convolution kernels of different sizes to conduct convolution and maximum pooling operations for word embedding to obtain text representation, which can better capture local correlation.

•Long short-term memory (LSTM) proposed by Liu et al. [48]: LSTM uses the last hidden state as the representation of text, which can solve the problem of gradient vanishing to a certain extent. Bi-directional LSTM is another LSTM which operates bi-directionally which can better capture contextual information in sentences than unidirectional LSTM.

•fastText proposed by Joulin et al. [49]: it is an open-source word vector and text classification tool of Facebook, providing a simple and efficient method for text classification and representation learning.

•Graph-CNN-C: a graph CNN model proposed by Defferrard et al. [43], which performs a convolution operation on word embedding similarity graphs, where the Chebyshev filter is used for the convolution operation.

•Text-GCN: the text classification model based on GCN proposed by Yao et al. [44], which first constructs a single large-scale text graph for the entire corpus and input the text graph into GCN.

•SGC: A simplified graph convolution model proposed by Wu et al. [50], which is obtained by repeatedly removing the nonlinearity between GCN layers.

•BERT+LR: This model [51] uses BERT as the encoder to obtain a document representation and then uses the Logistic Regression as the classifier.

•BERT: Bidirectional encoder representation from transformers [52], which is a pre-trained language representation model.

4.3 Parameter settings

The performance of two-layer GCN is better than that of single-layer GCN, when the networks are stacked with multiple layers, the features between nodes are too smooth, resulting in over-fitting. Therefore, the number of GCN layers is set as 2 in this paper. If the sliding window size is too small, it cannot generate enough global collaborative information, while if it is too large, an edge may be added between nodes that are not closely connected. Therefore, the sliding window size is set to 20 in this paper.If DropOut is too small, it will easily cause over-fitting, while if it is too large, it will drop important features, resulting in underfitting. Therefore, in this paper sets the DropOut at 0.5. We randomly select 10% of the training set as the verification set. Train KDGCN for up to 200 epochs and stop when the loss of 10 consecutive epochs is not reduced.

4.4 Experimental results

We use the accuracy rate to evaluate the performance of the model. We run the proposed model 10 times and report mean ± standard deviation, we refer to the following methods [53, 54]. The results of the other nine baseline models come from different papers [44 , 55]. The specific experimental results are shown in Table 5.

Table 5
Test Accuracy on document classification task

Model R8 R52 Ohsumed MR

CNN-rand 94.02 85.37 43.87 74.98

LSTM 96.09 90.48 51.10 77.33

Bi-LSTM 96.31 90.54 49.27 77.68

fastText 94.74 90.99 55.69 7624

Gprah-CNN-C 96.99 92.75 63.86 77.22

TextGCN 97.07 93.56 68.36 76.74

SGC 97.20 94.00 68.50 75.90

BERT+LR 96.26 91.49 54.65 81.07

BERT 97.26 96.26 68.74 85.88

KDGCN 97.12± 94.00± 69.11± 77.15±

0.12 0.14 0.22 0.09

Model	R8	R52	Ohsumed	MR
CNN-rand	94.02	85.37	43.87	74.98
LSTM	96.09	90.48	51.10	77.33
Bi-LSTM	96.31	90.54	49.27	77.68
fastText	94.74	90.99	55.69	7624
Gprah-CNN-C	96.99	92.75	63.86	77.22
TextGCN	97.07	93.56	68.36	76.74
SGC	97.20	94.00	68.50	75.90
BERT+LR	96.26	91.49	54.65	81.07
BERT	97.26	96.26	68.74	85.88
KDGCN	97.12±	94.00±	69.11±	77.15±
	0.12	0.14	0.22	0.09

Table 5 shows the test accuracy of each model. Experimental results show that the proposed method outperforms overall best, which demonstrates the effectiveness of the proposed method in short text classification. The experimental results of CNN are obtained through training with randomly initialized word embedding. CNN can build continuous and short-distance semantic models well; however, because the pooling layer of a CNN will lose a lot of valuable information when processing context information and ignore the association between local and global words, its performance on the four datasets is relatively poor. LSTM and Bi-LSTM were training with pre-trained word embeddings. LSTM and Bi-LSTM models are improved RNN, which can solve the problem that RNN cannot address with long-distance dependencies and can alleviate the sparsity of short texts to a certain extent. However, there are still some deficiencies. The fastText model only reduces the training time significantly, but its accuracy is comparable to the deep learning model. The fastText trained with bigrams.

Graph-CNN-C operates convolutions over word embedding similarity graphs, and performs well on the MR dataset, which indicating that the model can maintain grammatical and semantic relationships between words by constructing word similarity graphs and provide additional information for text. The text graph constructed by Text-GCN can capture not only the relationship between documents and words but also the relationship between global and local words. The label information can be propagated to the entire graph through nodes, so its result is better than that of a deep learning model. However, it can neither delve deeply into the potential information of short text nor introduce external knowledge to solve the problem of sparse features of short text. SGC has better performance than Text-GCN on some datasets, we conjecture this performance boost is caused by SGC has fewer parameters than Text-GCN, therefore SGC suffers less from overfitting.

It can be seen that KDGCN and BERT are comparable in multiple datasets, but BERT performs very well in MR, because MR is a binary sentiment classification dataset. Context information is very important in the task of sentiment classification, and BERT takes good account of the context information of words when encoding text.

The proposed method extends the composition method of the existing text classification algorithm based on graph neural network, and introduces the semantic information hidden by the text itself and the concept information from the external Microsoft Concept Graph to expand the feature representation of the short text. As a result, the method proposed in this paper performs best on the whole, however, despite we introduce concept knowledge into the text graph construction process, but because of the current knowledge graph is still incomplete, some entities in the short text do not have corresponding concepts, resulting in a low coverage of concept knowledge. Therefore, KDGCN did not achieve the best results on all datasets, but compared with Text-GCN, the model proposed in this paper performs better.

It can be seen from the experimental results that the proposed KDGCN model uses the association relationship between concepts and words, the dependency relationship between distant words in sentences, the co-occurrence relationship between words, and the inclusion relationship between documents and words for composition, which can solve the problem of sparse features in short texts to a certain extent.

4.4.1 Ablation experiments

To verify the influence of features such as association relationship between words and concepts, long-distance word dependencies, word co-occurrence, and inclusion relationship between documents and words on the proposed model, we conducted ablation experiments regarding the four relationships using the four datasets, and the experimental results are shown in Table 6.

Table 6
The results of ablation experiments

R8 R52 Ohsumed MR

WC 97.03 93.48 68.43 76.75

WC+L 97.20 93.65 68.66 76.19

WC+C 97.26 93.85 69.02 76.96

WC+LC 97.12 94.00 69.11 77.15

	R8	R52	Ohsumed	MR
WC	97.03	93.48	68.43	76.75
WC+L	97.20	93.65	68.66	76.19
WC+C	97.26	93.85	69.02	76.96
WC+LC	97.12	94.00	69.11	77.15

•WC: Simultaneously use the inclusion relationship between documents and words and word co-occurrence;

•WC + L: Simultaneously use long-distance word dependencies, word co-occurrence, and the inclusion relationship between documents and words;

•WC + C: Simultaneously use the association relationship between concepts and words, word co-occurrence, and the inclusion relationship between documents and words;

•WC + LC: Simultaneously use the association relationship between concepts and words, long-distance word dependencies, word co-occurrence, and the inclusion relationship between documents and words;

Table 6 illustrates that when only concept information is added, the datasets are improved to varying degrees relative to the baseline models, which indicates that the addition of external knowledge can alleviate the shortcomings of short text classification features sparsity to a certain extent. When long-distance word dependencies are only added, the performance of the MR dataset decreases because the text length in the dataset is generally short, with an average length is only 20.39. At this time, the effect of adding long-distance word dependencies is not obvious. For the R8 and R52 datasets with relatively long text lengths, a certain degree of improvement can be seen. On most of the datasets, the results are improved when concept information and long-distance word dependencies are added simultaneously compared to the models which only add one type of knowledge. As a result, the experimental results show that the proposed KDGCN model can capture richer semantic information by simultaneously using the relationship between concepts and words, the co-occurrence relationship between words, the containment relationship between documents and words, and the dependence relationship between distant words in sentences for composition.

4.4.2 Case study

In order to understand the importance of introducing concepts and long-distance dependency for text classification, we annotate the entity information in sentences, as shown in Fig. 3, and annotate words with long-distance dependency, as shown in Fig. 4. The highlighted words are closely related to label, which explains the effectiveness of the model proposed in this article for text classification tasks.

Fig. 3

Concept visualisation in ohsumed.

Fig. 4

Long-distance dependency visualisation in hsumed.

4.4.3 Parameter sensitivity

To verify the influence of the number of GCN layers on experimental results, this study analyzes the influence of the number of GCN layers on the classification results of the four datasets on the premise that other parameters remain unchanged, and the trend changes are shown in Fig. 5. Figure 5 shows that F1 values show an increasing trend and then a decreasing trend on all four datasets. Because GCN contain the operation of aggregating the features of neighbor nodes, when the networks are stacked with multiple layers, the features between nodes are too smooth, resulting in over-fitting. Therefore, the best effect can be achieved when the number of stacked layers is 2.

Fig. 5

Accuracy with different GCN layers. The horizontal axis indicates the size of the GCN layer and the ordinate indicate the accuracy on the datasets.

Figure 6 shows that the accuracy rate increases first and then decreases with the size of the sliding window. When the sliding window size is increased to 20, the accuracy rate reaches the maximum. This shows that a sliding window that is too small cannot generate sufficient global word co-occurrence information, whereas a sliding window that is too large may cause edges to be formed between nodes not closely related.

Fig. 6

Accuracy with different sliding window sizes. The horizontal axis indicates the size of the sliding window and the ordinate indicate the accuracy on the datasets.

5 Conclusions

In this study, we propose a short text classification model based on GCN with concept and dependencies enhanced. The model introduces semantic concepts in a knowledge graph and long-distance word dependencies into the text composition process. The model simultaneously uses the relationship between words and concepts, long-distance word dependencies, word co-occurrence, and inclusion relationship between documents and words of an entire text corpus to build a single text graph, which solves the problem of sparse short text classification features to a certain extent. The experimental results show the effectiveness of the proposed method, but there are still some problems. For example, although concept knowledge is introduced into the text graph construction process, but because of the current knowledge graph is still not complete, some entities in the short text do not have corresponding concepts. Therefore, in the future we can consider introducing the attribute information of the entities into the text graph construction process.

Footnotes

Acknowledgments

This work is supported by the National Natural Science Foundation of China (No.62176145).

https://stanfordnlp.github.io/CoreNLP/

http://sobigdata.d4science.org/group/tagme/

References

Miotto

, Li

, Kidd

B.A.

and Dudley

J.T.

, Deep patient: Anunsupervised representation to predict the future of patients fromthe electronic health records, Scientific Reports 6(1) (2016), 1–10.

Bakshi

R.K.

, Kaur

Opinion mining and sentiment analysis, In: 2016 3rd International Confer- ence on Computing for Sustainable Global Development (INDIACom), IEEE, 2016, pp. 452–455.

Wang

A.H.

Don’t follow me: Spam detection in twitter, In: 2010 International Conference on Security and Cryptogra- phy (SECRYPT), IEEE, 2010, pp. 1–10.

Lewis

D.D.

, Ringuette

A comparison of two learning algorithms for text categorization, In: Third Annual Sympo- sium on Document Analysis and Information Retrieval, vol. 33. 1994, pp. 81–93.

Apté

, Damerau

and Weiss

S.M.

, Automated learning ofdecision rules for text categorization, ACM Transactions onInformation Systems (TOIS) 12(3) (1994), 233–251.

Sebastiani

, Machine learning in automated text categorization, ACM Computing Surveys (CSUR) 34(1) (2002), 1–47.

Lee

L.H.

, Wan

C.H.

, Rajkumar

et al, An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization, Appl Intell 37 (2012), 80–99.

Grefenstette

, Blunsom

et al, A convolutional neu-ral network for modelling sentences, In: The 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland, 2014.

Zhang

, Zhao

and LeCun

, Character-level convolutionalnetworks for text classification, Advances in NeuralInformation Processing Systems 28 (2015), 649–657.

10.

Socher

, Huval

, Manning

C.D.

, Ng

A.Y.

Seman- tic compositionality through recursive matrix-vector spaces, Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computa- tional Natural Language Learning, 2012, pp. 1201–1211.

11.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation 9(8) (1997), 1735–1780.

12.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomezand

and Kaiser

, Attention is all you need, Advances in NeuralInformation Processing Systems 2017 (2017), 5998–6008.

13.

Yang

, Yang

, Dyer

, He

, Smola

, Hovy

Hierarchical attention networks for document classification, Proceedings of the 2016 Conference of the North American of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1480–1489.

14.

Lan

, Li

, Hu

, Sun

, Zhang

Knowl- edge Graph Integrated Graph Neural Networks for Chinese Medical Text Classification, 2021 IEEE Inter- national Conference on Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 2021, pp. 682–687. doi:10.1109/BIBM52615.2021.9669286.

15.

Liu

, Pang

, Li

et al, Few-shot short-text classification with language representations and centroid similarity[J], Applied Intelligence (2022), 1–12.

16.

Wang

, Wang

, Yao

et al, Hierarchical Hetero- geneous Graph Representation Learning for Short Text Classification[J]. arXiv e-prints, 2021.

17.

Yang

, Jin

, Tao

et al, Text classification based on graphneural network and dependency parsing[J], Computer Science 49(12) (2022), 8.

18.

Gao

, Peng

, Wang

, Zhang

, Xie

and Tian

, Incorporating word embeddings into topic modeling of short text, Knowledge and Information Systems 61 (2019), 1123–1145.

19.

Sahami

, Heilman

T.D.

A web-based kernel function for measuring the similarity of short text snippets, In Pro-ceedings of the 15th international conference onWorld Wide Web Association for Computing Machinery, New York, NY, USA, 2006, pp. 377–386.

20.

, Meng

, Wang

, Zhang

, Ouyang

et al, Graph-based chinese word sense disambiguation with multi-knowledgeSpringer Nature LATEX template 16 Article Title integration, Comput Mater Continua 61(1) (2019), 197–212.

21.

Wang

, Huang

, Wang

, Yuan

, Liu

, He

et al, Learning intents behind interactions with knowledge graph for recommendation, In: Proceedings of theWeb Conference 2021, 2021, pp. 878–887.

22.

Yasunaga

, Ren

, Bosselut

, Liang

, Leskovec

, Qagnn: Reasoning with language models and knowledge graphs for question answering. arXiv preprint arXiv:210406378, 2021.

23.

Wang

, Zhang

, Jure

, Zhao

, Li

, Wang

Knowledge-aware Graph Neural Networks with Label Smoothness Regularization for Recommender Systems, The 25th ACM SIGKDD International Conference, 2019.

24.

, Lin

, Liu

, Sun

Multi-Paragraph Reasoning with Knowledge-enhanced Graph Neural Network. arXiv preprint arXiv:1911.02170, 2019.

25.

Zhou

, Huang

, Hu

and He

, SK-GCN: Modeling syntax andknowledge via graph convolutional network for aspect-level sentimentclassification, Knowledge-Based Systems 205(3) (2020), 106292.

26.

Yan

, Jian

and Sun

, SAKG-BERT: Enabling languagerepresentation with knowledge graphs for chinese sentimentanalysis[J], IEEE Access 9 (2021), 101695–101701.

27.

Yan

Research on Short Text Classification Algorithm Based on Graph Neural Network and Fusion of External Features[D], Jilin University.

28.

Abu Arqub

, , Adaptation of reproducing kernel algorithm for solvingfuzzy Fredholm-Volterra integrodifferential equations, NeuralComputing and Applications 28 (2017), 1591–1610.

29.

Abu Arqub

, Singh

and Alhodaly

, , Adaptation of kernelfunctions-based approach withAtangana–Baleanu–Caputo distributed orderderivative for solutions of fuzzy fractional Volterra and Fredholmintegrodifferential equations, Mathematical Methods in theApplied Sciences 2021 (2021), 1–28.

30.

Joachims

Text categorization with support vector machines: Learning with many relevant features, European Conference on Machine Learning, Springer, Berlin, Heidelberg, 1998, pp. 137–142.

31.

McCallum

and Nigam

, A comparison of event models for naivebayes text classification, AAAI-98 Workshop on Learning forText Categorization 752(1) (1998), 41–48.

32.

Bouaziz

, Dartigues-Pallez

, da Costa Pereira

, Precioso

, Lloret

, Short text classification using semantic random forest, International Conference on Data Warehousing and Knowledge Discovery Springer International Publishing, 2014, pp. 288–299.

33.

Kim

Convolutional neural networks for sentence classification, In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1746–1751.

34.

Jin

, Wang

, Zhang

, Yan

Combining Knowledge with Deep Convolutional Neural Networks for Short Text Classification, Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017.

35.

Hao

, Xu

, Liang

J.Y.

, Zhang

B.W.

and Yin

X.C.

, Chinese shorttext classification with mutual-attention convolutional neuralnetworks, ACM Transactions on Asian and LowResource LanguageInformation Processing (TALLIP) 19(5) (2020), 1–13.

36.

Chen

, Hu

, Liu

, Xiao

and Jiang

, Deep short textclassification with knowledge powered attention, Proceedings ofthe AAAI Conference on Artificial Intelligence 33(01) (2019), 6252–6259.

37.

, Cai

, Wu

, Lei

, Huang

, Fung Leung

, et al., Incorporating context-relevant concepts into convolutional neuralnetworks for short text classification, Neurocomputing 386 (2020), 42–53.

38.

Liu

, You

, Zhang

, Wu

and Lv

, Tensor graphconvolutional networks for text classification, Proceedings ofthe AAAI Conference on Artificial Intelligence 34(05) (2020), 8409–8416.

39.

Ding

, Wang

, Li

, Liu

, Be more with less: Hypergraph attention networks for inductive text classification. arXiv preprint arXiv:2011.00387 (2020).

40.

Lin

, Meng

, Sun

, Han

, Kuang

, Li

, Wu

, BertGCN: Transductive Text Classification by Com- bining GCN and BERT. arXiv preprint arXiv:2105.05727 (2021).

41.

Huang

, Ma

, Li

, Zhang

, Wang

, Text level graph neural network for text classification, arXiv preprint arXiv:1910.02356, 2019.

42.

Kipf

T.N.

, Welling

, Semi-supervised classififica- tion with graph convolutional networks. arXiv preprint arXiv:1609.02907.

43.

Defferrard

, Bresson

, Vandergheynst

, Convolu- tional neural networks on graphs with fast localized spectral filtering, In NIPS, pp. 3844–3852.

44.

Yao

, Mao

and Luo

, Graph convolutional networks for textclassification, Proceedings of the AAAI Conference onArtificial Intelligence 33 (2019), 7370–7377.

45.

Linmei

, Yang

, Shi

, Ji

, Li

, Heterogeneous graph attention networks for semi-supervised short text clas- sification, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter- national Joint Conference on Natural Language Processing (EMNLPIJCNLP), 2019, pp. 4821–4830.

46.

Tayal

, Rao

, Agarwal

, Jia

, Subbian

, Kumar

, Regularized Graph Convolutional Networks for Short Text Classification, Proceedings of the 28th Interna- tional Conference on Computational Linguistics: Industry Track, 2020, pp. 236–242.

47.

Wang

, Wang

, Yao

, Dou

, Hierarchical het- erogeneous graph representation learning for short text classification, In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 3091–3101.

48.

Liu

, Qiu

, Huang

, Recurrent neural network for text classification with multi-task learning, In Proceed- ings of the TwentyFifth International Joint Conference on Artificial Intelligence, 2016, pp.2873–2879.

49.

Joulin

, Grave

É.

, Bojanowski

, Mikolov

, Bag of tricks for efficient text classification, In: Proceedings of the 15th Conference of the European Chapter of the Associa- tion for Computational Linguistics: Volume 2, Short Papers, 2017, pp. 427–431.

50.

, Souza

, Zhang

, Fifty

, Yu

, Weinberger

, Simplifying graph convolutional networks, In: Interna- tional Conference on Machine Learning PMLR, 2019, p. 6861–6871.

51.

Gao

and Huang

, A gating context-aware text classification model with BERT and graph convolutional networks[J], Journal of Intelligent and Fuzzy Systems 40(3) (2021), 4331–4343.

52.

Devlin

, Chang

M.-W.

, Lee

, Toutanova

, BERT: Pre-training of deep bidirectional transformers for language understanding, In NAACL, Minneapolis, Minnesota Association for Computational Linguistics, pp. 4171–4186.

53.

Abu Arqub

, Singh

, Maayah

and Alhodaly

, Reproducing kernel approach for numerical solutions of fuzzy fractional initial value problems under the Mittag–Leffler kernel differential operator, Mathematical Methods in the Applied Sciences 2021 (2021), 1–22.

54.

Alshammari

, Al-Smadi

, Abu Arqub

, Hashim

and Alias

M.A.

, Residual series representation algorithm for solving fuzzy duffing oscillator equations, Symmetry 12 (2020), 572.

55.

Huang

Y.H.

, Chen

Y.H.

, Chen

Y.S.

,Con-TextING: Granting Document-Wise Contextual Embeddings to Graph Neural Networks for Inductive Text Classification, In Proceedings of the 29th International Conference on Com- putational Linguistics, 2022, pp. 1163–1168.

Concept and dependencies enhanced graph convolutional networks for short text classification

Abstract

Keywords

1 Introduction

2 Related work

2.1 Common short text classification methods

2.2 Short text classification based graph convolutional networks

2.3 Knowledge graph

2.4 Graph convolutional networks

3.1.1 Associated edges of document and word nodes

4.1 Datasets

Table 4 Summary statistics of datasets Dataset Training Test Classes Words Average Length MR 7108 3554 2 18764 20.39 Ohsumed 3357 4043 23 14157 135.82 R8 5485 2189 8 7688 65.72 R52 6532 2568 52 8892 69.82

4.3 Parameter settings

4.4 Experimental results

Table 6 The results of ablation experiments R8 R52 Ohsumed MR WC 97.03 93.48 68.43 76.75 WC+L 97.20 93.65 68.66 76.19 WC+C 97.26 93.85 69.02 76.96 WC+LC 97.12 94.00 69.11 77.15

Footnotes

Acknowledgments

References

Table 4
Summary statistics of datasets

Dataset Training Test Classes Words Average Length

MR 7108 3554 2 18764 20.39

Ohsumed 3357 4043 23 14157 135.82

R8 5485 2189 8 7688 65.72

R52 6532 2568 52 8892 69.82

Table 6
The results of ablation experiments

R8 R52 Ohsumed MR

WC 97.03 93.48 68.43 76.75

WC+L 97.20 93.65 68.66 76.19

WC+C 97.26 93.85 69.02 76.96

WC+LC 97.12 94.00 69.11 77.15