Machine learning based software effort estimation using development-centric features for crowdsourcing platform

Abstract

Multi-label text classification is a method for categorizing textual data based on features extracted from the original textual information. When it comes to modelling text structural properties, Graph Convolutional Network (GCN) has demonstrated outstanding performance. However, most existing graph-based models do not model the structure of a single text unit and do not consider the sequence information in each document (e.g., word order). To resolve these issues and fully utilize the text’s structural and sequential details, a text classification model called Sequential GCN with Multi-Head Attention (SGCN-MHA) is proposed in this paper. For each text, a separate text graph is constructed in which nodes are the words of the text, and the edges between nodes corresponding to the word relations. Then the GCN is used to extract the structural feature. To enable the word nodes in the document graph to hold contextual information, the BiLSTM is also applied to learn the sequential feature for each graph. Finally, the Multi-Head Attention mechanism is adopted to interact with these two features and then aggregate them to get access to critical information in the text. The efficiency of our approach has been tested on two standard datasets, including comparative and ablation experiments.

Keywords

Multi-label text classification graph convolutional network BiLSTM network attention mechanism

1. Introduction

In today’s society, people are creating and acquiring more and more content on the Internet, with significant news platforms generating large amounts of textual data every moment, such as chat posts, movie reviews, product evaluations, and news on a wide range of topics. Extracting relevant data from different types of textual data is a natural necessity in people’s daily lives. Text classification, as a common Natural Language Processing (NLP) technique, can play an important role in the processing and analysis of web text data. Text classification can be applied to many domains, such as sentiment classification [1, 2], question and answer [3, 4], recommend systems [5], literature organisation [6], disease diagnosis [7], topic annotation [8], etc.

Three stages of development can be used to categorize the history of text classification processing technology. The first stage was before 1980 and was stuck at the level of writing rules manually. The researchers tried to classify the text using the matching rules approaches and finally predicted the class of text data for these words. However, manual methods have many drawbacks. Firstly, analysing textual data and the subsequent maintenance of the rules is labour-intensive, time-consuming and money-consuming and can take several years of work by experts in the field. Secondly, the classification of texts applies only to one dataset and does not apply to other datasets. The second stage started after the 1980s when researchers began using machine learning algorithms to implement automatic classification. Boreham [9] initially represented the words in the text as vectors and used clustering to classify them. Torsten [10] was the first to use plain Bayesian classification and TF-IDF (Term Frequency – Inverse Document Frequency) to classify news text data. He also proposed SVM (Support Vector Machines) [11] later and proved that it is better than plain Bayesian, K-Nearest Neighbour (KNN) [12] in classification effect. The third stage is in the last two decades. Many neural network-based models appeared, including Convolutional Neural Network (CNN) [13] and Recurrent Neural Network (RNN) [14] and other neural networks, have been applied to text categorization challenges. These approaches focus on, for example, the sequential structure of the text and the flat form of pictures. Still, in real life, more data representations are not simply sequential or flat arrangements but are represented as more complex graph structures, such as social networks, regulatory networks, biomolecular structures, etc. This led to the emergence of an entirely new type of neural network: the Graph Neural Network (GNN) [15].

Since GNN has a natural advantage in preserving and modelling structural information, some researchers have started applying existing neural network models to graph structures for end-to-end graph modelling. Graph Convolutional Network (GCN) is one of the most popular neural networks. KipF et al. [16] first used GCN to do text classification tasks, outperforming the conventional CNN model. Afterward, Yao et al. [17] proposed TextGCN, which uses words and documents as nodes to create a large, coherent graph of the whole corpus. However, this model is challenged by a series of problems, such as a lot of graph-based methods failing to adequately account for local aspects of the text, like key phrases and word order information specific to each document.

In order to overcome the challenges mentioned above, a new model called Sequential GCN with Multi-Head Attention (SGCN-MHA) is proposed. After constructing a graph for each input document, GCN is used to extract the structural features, and BiLSTM is used to extract the contextual semantic relationships of the text in two directions. The Multi-Head Attention mechanism is employed to interact with the two features and assign weights to them in order to acquire the critical feature words while also maximizing the use of the two features in the classification process. Compared with other models, this approach overcomes the limitation of other models that may focus solely on one aspect (structural or sequential features) and provides a more holistic solution for document classification tasks that require considering both types of features simultaneously. Therefore, the problem of effectively leveraging both structural and sequential information in the classification process can be uniquely addressed by the SGCN-MHA model. In summary, this paper makes three important contributions:

The proposed model can learn different features from multiple aspects. It takes into account both semantic semantics from BiLSTM and structural semantics from GCN, thus enhancing the representation effect of the trained text.

To avoid the separation of these two semantics in the prediction process and to look for essential terms that are more applicable to the assignment, a Multi-Head Attention mechanism is used, which can interact with the two representations to fully exploit them and give certain words varying amounts of weight for better classification results.

Two popular datasets are used to test the effectiveness of SGCN-MHA. The comparative and ablation experiments are conducted to show the better performance of our model.

The rest of the paper is organized as follows. A synopsis of recent work on text classification methods is discussed in Section 2, along with the state of the art of GCN text classification research. The proposed text classification model SGCN-MHA is further explained in Section 3. The experiment’s datasets, assessment metrics, and outcomes are described in Section 4. The experimental results are then presented and discussed. In the end, Section 5 presents the conclusions of this paper.

2. Related work

2.1 Text classification

Text classification is an essential area of Natural Language Processing (NLP) [18] and is widely used in recommendation systems, question-and-answer systems, sentiment analysis, and other regions. The process of classifying texts into one or more groups based on predetermined labels and the characteristics of the texts is known as text classification. Depending on how many labels are provided, a single or multi-label text classification might be used to describe the text classification task. Single-label text classification classifies each text data into a single-label category. Multi-label text classification, as opposed to single-label text classification, separates each text’s data into several label categories that are more useful in practical contexts and correlate to objective objects’ characteristics and norms.

Text classification technology has evolved from manual annotation to machine learning algorithms and then to deep learning techniques. Traditional methods, such as plain Bayesian [19], K-Nearest Neighbor (KNN) [12], and Support Vector Machines (SVM) [20], have been studied using feature engineering to represent text, like Bag-of-words (BOW) [11], N-Gram [21], and Topic Model [22]. For researchers, it is necessary to obtain manually specified features and use machine learning algorithms to train and predict. The classic machine learning algorithms can no longer meet the performance requirements of algorithms used in actual applications due to the number of computer resources and the vast amount of data involved.

Due to deep learning’s rapid development, particularly in image and speech processing, artificial intelligence has made significant strides in recent years. Inspired by this, deep neural network-based techniques have been used extensively in text classification. Currently, the two most popular deep learning models are Convolutional Neural Network (CNN) [13] and Recurrent Neural Network (RNN) [14]. They are good at capturing local information and extracting a sequence’s most important information without changing the input sequence’s position. Kim et al. [23] introduced TextCNN in 2014, a text classification model with CNN as the processing framework, which includes convolutional, pooling, and output layers. The model achieved good classification results and became a new benchmark method for text classification at that time. Although CNN effectively extracts local information from the text, it does not perform well with long text. RNN has a set of memory cells, so RNN can be adapted to process the sequential data and is better than CNN in capturing sequence context. It produces an output depending on the memory cells and the current input as each data sequence is received. Although RNN is good at processing sequential data, it suffers from gradient instability when processing long sequences, which prevents it from capturing relationships between long words. The optimized RNN models, such as Long Short-Term Memory network (LSTM) [24] and Gated Recurrent Unit (GRU) [25], as well as BiLSTM and BiGRU, which are derived from them, improve the shortcomings of general RNN to some extent. Deng et al. [26] proposed the ABLG-CNN model, combining BiLSTM and CNN to obtain contextual and important topic information. Teng et al. [27] integrated two methods, using BiGRU to extract contextual information from the text and CNN to extract local information. In addition, pre-trained language models have attracted more attention, such as BERT [28]. It has been pre-trained on a sizable dataset and can be fine-tuned for use in NLP subtasks. Prabhu et al. [29] proposed a BERT-based active learning strategy for multi-label text classification. Liu et al. [30] used an improved BERT model ALBERT, combined with CNN for text classification. These models, however, frequently ignore the co-occurrence of single words and fail to fully use the text’s structural characteristics, focusing instead primarily on the order of the text.

2.2 Graph Convolutional Network

Recently, Graph Neural Network (GNN) has also become popular because it has proven to be a powerful tool for solving text classification problems by considering the correlation between words and preserving the global structural information of graphs. GNN captures the semantic relation between words by converting textual data into graphical data to discover more effective methods of expressing the nodes or the whole network and has succeeded in several text categorization tasks. Therefore, GNN can directly handle complex structural data by prioritizing global domain information. Several researchers use existing neural network models for graph structure to model graphs end-to-end, the most popular of which is the Graph Convolutional Network (GCN). Bruna et al. [31] proposed the first GCN model in 2013. They used the convolution theorem and graph theory to define graph convolution in spectral space. But the original GCN model has the disadvantage of high time-distance complexity. Kipf et al. [16] presented a simplified GCN model that can significantly reduce the time-distance complexity of parameterizing the convolution kernel using the spectral approach. They also did the text classification task, and the result outperformed the conventional CNN model.

In order to build graphs for complicated corpora, there are two main approaches to do. Making a vast, unified corpus text graph based on the word relationships of the words and documents in the entire corpus is one approach. The TextGCN algorithm developed by Yao et al. [17] created a corpus-level graph of words and documents as nodes, which achieved good performance in the text classification task. SGC [32] and S2GC [33] used the same approach as TextGCN to construct a graph but proposed a different method of information transmission. TensorGCN [34] and TextGTL [35] built a semantic-based graph, a syntactic-based graph, and a sequence-based graph in three different methods. Each of these three graphs was whole-corpus based and used the same approach to communication used in the GCN. Because these models build a large graph based on the whole corpus, it will lead to significant memory consumption. If there are new documents, it needs to modify the entire graph structure. In addition, it is unable to take into consideration the structural features of individual documents. The other method builds small individual graphs for each document in the dataset, including semantic and syntactic correlations graphs. Huang et al. [36] proposed a new graph-based model in which the graphs of each input text are constructed using much smaller text windows rather than a single graph of the whole corpus. This approach drastically decreases the number of edges and the memory overhead while extracting additional local characteristics. Zhang et al. [37] proposed TextING, a graph-based text classification model that shows how each document’s words relate to one another in context and enables the inductive learning of new words. Although GCN can learn the graphic structure information of the text, such as syntax and semantic parse tree, they need to pay more attention to the order of the text.

Now, due to the popularity of the pre-training model, researchers have begun to combine the pre-training language model with the GCN. TG-Transformer proposed by Zhang et al. [38] combined Transformer and GCN. Jeong et al. [39] made use of GCN and BERT to produce a document encoding and a context encoding for the recommendation task. However, this method cannot take into account interactions between features, which reduces the mapping possibilities. Lu et al. [40] propose the VGCN-BERT model (VGCN) by combining BERT capabilities with a vocabulary GCN. Lin et al. [41] proposed BertGCN, which embedded BERT into TextGCN. She et al. [42] processed the event text using GCN and BERT to get the relevant representation vectors, respectively. However, these models, which use pre-trained models, are computationally expensive and require external resources that are not always accessible.

3. Proposed SGCN-MHA model

3.1 System overview

This section describes the architecture of the SGCN-MHA comprehensively. In Fig. 1, the model architecture is shown, which consists of four components: graph construction, GCN layer, BiLSTM layer, and attention layer. There are four stages in our model. Concretely, the raw data is pre-processed in the graph construction stage, and then different graphs are constructed for each input document by the words’ relationship. The text’s nodes stand in for the individual words, while the edges show how those words are related. After the graph construction stage, GCN extracts the structural feature so that the neighbourhood features of every word node can be collected to study the word representations from their neighbourhood structures. Moreover, in the BiLSTM-based feature extraction stage, BiLSTM takes complete account of the contextual information of each word in the document, aiming at upgrading the two-way representation of the word nodes in each document graph frontward and backwards. The second stage considers the text’s structural elements, while the third stage represents the documents’ semantic content. Finally, in the attention stage, before aggregating the contextual data, these two features are interacted with and combined on each graph using the Multi-Head Attention technique to solve the feature interaction problem and concentrate on the text’s main ideas. The final text representation through the attention layer is calculated based on this score using the output of GCN and BiLSTM as the input. Then the text’s label can be predicted based on learned representations. Further details are given below.

Table 1 summarizes significant notations for convenience of reference.

Table 1
Notations

Notation	Description
${G}$	Graph
${A}$	Adjacency matrix
${\hat{A}}$	Laplacian transform matrix
${X}$	Feature matrix
${D}$	Degree matrix
${\sigma}$	Activation function
${Q}$	Query matrix
${K}$	Key matrix
${V}$	Value matrix
${b}$	Bias

Figure 1.

The architecture of SGCN-MHA. (a) The input document where ${w}_{{i}}$ represent the words in a text. (b) Graph construction and the feature matrix $X$ , which is the input of the feature extraction layer. (c) GCN-based structural feature extraction and BiLSTM-based sequential feature extraction. (d) MHA-based feature interaction and aggregation, and the text is classified according to the last text representation.

3.2 Graph construction

Independent graphs are created for every input document, with all words in the text represented as nodes and the relationships between words as edges. The text’s words can be represented as ${{T}=\{{n}_{0}\ldots{n}_{i}\ldots{n}_{l-1}\}}$ , where ${{n}_{{i}}}$ stands for the ${{i}_{th}}$ word, a vector initialized by ${d}$ dimensional word embedding that can be trained and updated in the model. The graph is denoted by ${{G=(N,E)}}$ and ${N}$ and ${E}$ are denoted as:

$\displaystyle{N}=\{{n}_{{i}}\mid i\in[0,l-1]\}$ (1) $\displaystyle{E}=\{e_{ij}\mid i\in[0,l-1];j\in[i-k,i+k]\}$ (2)

In this equation, ${N}$ is the set of nodes in the graph, and ${E}$ is the set of edges between nodes. And the letters ${l}$ and ${k}$ represent, respectively, the length of the phrase and the number of words in the window associated with each word in the graph.

First, the raw data must be pre-processed using standard methods, like word cleaning. Unprocessed data may contain many useless data, such as website link data, punctuation marks, etc. This data is of no practical relevance for many classification tasks and is usually discarded as noisy data. Moreover, the stop words and the words appearing less than five times in the document need to be removed. Stop words such as conjunctions, exclamation marks, or end-of-sentence intonation, which are present in a large number in the text, do not contribute to text classification and may even severely interfere with the classification task. So stop words should be efficiently filtered away to enhance both the model’s and classification task’s performance.

After data preprocessing, the joint recurrence information needs to be extracted in order to model the correlation between words in the document. To achieve this, a sliding window approach is employed by selecting a fixed-size window of four words and iterating over the text corpus. For each window, an edge is created between the four words in the window and added to the graph. This process is repeated for each window in the corpus, resulting in a graph where each node represents a unique word and each edge represents the co-occurrence of two words within a fixed distance of each other. The graph has three important matrices, as shown in Fig. 2. The first graph is an adjacency matrix ${A}$ , representing the relationship between nodes. ${A}$ can be defined as ${A\in{R}^{|v|*|v|}}$ , where ${v}$ represent the vector of graph. The second graph is degree matrix ${D}$ , a diagonal matrix used to depict each node’s degree in terms of how many other nodes it connects. And the last graph is the feature matrix ${X}$ , which is used to represent the features of the nodes. ${X}$ can be defined as ${X\in R^{n\times c}}$ , where ${n}$ stands for the number of nodes and ${c}$ for the size of the features.

Figure 2.

An illustration of the constructed graph, adjacency matrix ${A}$ , and feature matrix ${X}$ . For example, in the sentence “Jack is a boy who loves drawing”, the text is pre-processed in such a way that nouns, adjectives and verbs such as “Jack”, “boy”, “who”, “loves”, “drawing” are selected as nodes of the graph, and “is”, “a” are not used as nodes of the graph as stop words, and “Jack” and “boy” are therefore connected by a single edge etc.

3.3 GCN layer

After constructing the graphs, the text co-occurrence graph ${G=(N,E)}$ is obtained. Then GCN is used to make use of the graph structure of the constructed graph. GCN extracts features by performing node classification, graph classification, and edge prediction on graph data, which is to learn the structure of the text network, and the neighborhood aggregation operation is carried out following the connection relationship of the text network in order to obtain the embedded representation of text words. GCN needs to learn a mapping function that enables nodes in the graph to combine their properties with those of nearby nodes to generate a new representation of the node. A multi-layer GCN overlay is necessary in order to capture the high-level domain information of word nodes. In detail, the GCN learning process is as follows. At first, we need to transform the current node features. Then we aggregate the domain node’s features to get that node’s new features. At last, increase the nonlinearity by using an activation function.

The formulas for GCN are shown in Eqs (3)–(5). GCN has two inputs, one is the adjacency matrix ${A}$ of graph ${G}$ , which is used to represent the connection relationship between nodes, and the other one is the feature matrix ${X}$ of graph ${G}$ , which is used to represent the features of nodes. As the equations show, there are two steps to the GCN learning process. The first step is calculating the normalized Laplacian transform matrix ${\hat{A}}$ from the adjacency matrix ${A}$ . Equations (3) and (4) display the detail. By combining a unitary matrix ${I_{N}}$ with the adjacency matrix ${A}$ , ${\tilde{A}}$ can be constructed. ${\tilde{D}}$ is the degree matrix of ${\tilde{A}}$ , which could be expressed as ${\tilde{D}_{ii}=\sum_{j}\tilde{A}_{ij}}$ . In the second step, the structural information is extracted from the feature matrix ${X}$ . The specifics are provided in Eq. (5). In this equation, ${y}$ represents the number of the network layers, and ${H^{(y)}}$ represents the feature of the ${{y}_{th}}$ layer of the network. If ${y=0}$ , then ${H^{(0)}=X}$ , which means the input matrix is connected to the network. ${W^{(y)}}$ denotes the trainable weight matrix. After a ${y}$ -level messaging process, the non-sequential semantic feature representation ${H_{0}^{G}}$ is obtained.

$\displaystyle\hat{A}=\tilde{D}^{-0.5}\tilde{A}\tilde{D}^{-0.5}$ (3) $\displaystyle{\tilde{A}}=A+I_{N}$ (4) $\displaystyle{H^{(y+1)}}=\sigma(\hat{A}H^{(y)}{W}^{(y)})$ (5)

3.4 BiLSTM layer

Word order is important for text classification, so it is necessary to learn both front-to-back and back-to-back information at the same time. For each text graph, a BiLSTM network is used to extract the sequence information, and the text word vectors are encoded into hidden vectors with contextual semantic information. BiLSTM is composed of forward LSTM and backward LSTM, so it is often used to model contextual information in the field of NLP.

The LSTM model owns an adaptive locking mechanism to ensure that the LSTM neural unit maintains its previous state and remembers the feature extraction of the current input neural unit. The LSTM block consists of a forgetting gate, an input gate, and an output gate, denoted as ${f_{t}}$ , ${i_{t}}$ , ${o_{t}}$ at time ${t}$ , respectively, in Eqs (6)–(8). The weight matrices for each component are denoted by ${w_{f}}$ , ${w_{i}}$ and ${w_{o}}$ , whereas the bias vectors for each component are denoted by ${b_{f}}$ , ${b_{i}}$ and ${b_{o}}$ . The candidate state value of the memory cell at time ${t}$ is represented by the function $\tanh$ as ${\tilde{c}_{t}}$ . ${c_{t}}$ is the state of the memory cell at time ${t}$ . The equations for ${\tilde{c}_{t}}$ and ${c_{t}}$ are shown in Eqs (9) and (10), respectively. Equation (9) updates both the data in the current memory block and the data in a memory block that was used to forget certain information temporarily. The output information is ultimately decided by the output gate, and Eq. (11) contains the current output. After the experiment, the sequence of hidden layer states obtained with the same statement length is ${\{{h}_{0},{h}_{1}\ldots{h}_{n-1}\}}$ .

$\displaystyle{f_{t}}=\sigma(w_{f}[x_{t},h_{t-1}]+b_{f})$ (6) $\displaystyle{i_{t}}=\sigma(w_{i}[x_{t},h_{t-1}]+b_{i})$ (7) $\displaystyle{o_{t}}=\sigma(w_{o}[x_{t},h_{t-1}]+b_{o})$ (8) $\displaystyle{\tilde{c}_{t}}=\tanh(w_{c}[x_{t},h_{t-1}]+b_{c})$ (9) $\displaystyle{c_{t}}=i_{t}*\tilde{c}_{t}+f_{t}*c_{t-1}$ (10) $\displaystyle{h_{t}}=o_{t}*\tanh(c_{t})$ (11)

Although the LSTM model is a unidirectional distribution model and cannot extract text context information, in a text classification task, the outcome of the current instant is frequently linked to both the words that came before it and the words that came after it. Whereas in the BiLSTM model, each sequence of text is distributed in a back-and-forth and forth-and-back two directions, and then extract the contextual information of the sequence is. Therefore, the sequence representation obtained by the BiLSTM model takes contextual semantic information into account, and its feature-rich representation is more suitable for text classification.

For each document-level graph, BiLSTM is used to obtain contextual information between words in order to learn each document’s word representation and then modify the document’s corresponding feature matrix. For example, in order to encode “We love singing”, ${\mathrm{LSTM}_{\mathrm{L}}}$ input “We”, “Love”, “Singing” to obtain the text vector ${\{h_{L0},h_{L1},h_{L2}\}}$ , and the text vector ${\{h_{R0},h_{R1},h_{R2}\}}$ is obtained by inputting the same text information to ${\mathrm{LSTM}_{\mathrm{R}}}$ . Finally, the forward and backward hidden vectors are merged to obtain ${\{[h_{L0},h_{R2}],[h_{L1},h_{R1}],[h_{L2},h_{R2}]\}}$ , i.e., the ${\{h_{0},h_{1},h_{2}\}}$ . After this step, the final text feature representation ${H_{0}^{B}}$ is obtained.

3.5 Attention layer

After each graph has had its structural and time series features extracted using the GCN and BiLSTM, respectively, the Multi-Head Attention mechanism is applied to each graph. The Multi-Head Attention mechanism makes it possible to represent a node that is affected by several dependencies, allowing it to incorporate multiple behaviors, such as capturing various dependencies in a sequence and learning different behaviors based on the same attention mechanism. Because of this, even though it is a polysemous word, its precise definition can be determined through various dependencies.

The primary process of the Multi-Head Attention mechanism can be divided into three stages. The first step is to identify the different subspaces of the queries, the key, and the value mappings. The second step is to transform the query matrix ${Q}$ , the key matrix ${K}$ , and the value matrix ${V}$ into ${h}$ different sets of linear projections obtained by independent learning. Finally, in the third step, the ${h}$ groups of transformed query, key, and value sets are sent in parallel to the attention pool. The specific calculation process is shown in Eqs (12)–(14), where ${d}$ represents the dimension of the key.

The text representation obtained from GCN is defined as ${H_{0}^{G}}$ and from BiLSTM is defined as ${H_{0}^{B}}$ , and then compute the query, key, and value matrices for ${H_{0}^{G}}$ and ${H_{0}^{B}}$ respectively, as shown in Fig. 3. ${H_{0}^{B}}$ ’s query matrix, ${H_{0}^{G}}$ ’s key matrix, and ${H_{0}^{G}}$ ’s value matrix are taken as the input of the Eq. (13) to get the ${H^{B}}$ , and ${H_{0}^{B}}$ ’s key matrix, ${H_{0}^{B}}$ ’s value matrix and ${H_{0}^{G}}$ ’s query matrix are taken as the input of the Eq. (13) to get the ${H^{G}}$ . Therefore, we get a GCN attention representation that depends on the output of the BiLSTM and a BiLSTM attention representation that relies on the output of the GCN.

$\displaystyle\textit{Attention}(Q,K,V)=\textit{Softmax}\left(\frac{QK^{T}}{% \sqrt{d}}\right)V$ (12) $\displaystyle{\textit{head}}_{i}=\textit{Attention}(QW_{i}^{Q},KW_{i}^{K},VW_{% i}^{V})$ (13) $\displaystyle\textit{MultiHead}(Q,K,V)=\textit{Concat}(\textit{head }_{1},% \ldots\textit{head }_{h})W$ (14)

Figure 3.

Attention layer.

The extracted features ${H^{G}}$ and ${H^{B}}$ need to be aggregated interactively by concatenation, as shown in Eq. (15). The features obtained by BiLSTM and GCN are designated as ${{h}_{i}^{B}}$ and ${{h}_{i}^{G}}$ , respectively, for a word ${w_{i}}$ in the sequence, and the ${{h}_{i}}$ represents the aggregated feature. The final feature of the entire document ${i}$ is denoted as ${{H}_{i}}$ .

$\displaystyle{{h}_{i}}=\textit{Concat}({h}_{i}^{B},{h}_{i}^{G})$ (15)

After feature aggregation, a Softmax function is employed to predict the document’s label. As shown in Eq. (16), ${W}$ represents the matrix that maps the vector into the output space.

$\displaystyle{{y}_{i}}=\textit{Softmax}({W}{H}_{i}+{b})$ (16)

Moreover, to train the model, the cross-entropy loss is used between the actual label ${{T}_{\textit{actual}}}$ and the predicted label ${{T}_{\textit{predict}}}$ . By lowering the cost of the losses, the network’s performance is intended to be improved. The following is the cross-entropy loss equation:

$\displaystyle\textit{Loss}=-T_{\textit{actual}}\log T_{\textit{predict}}$ (17)

4. Experiments

This section contains a detailed description of experiment datasets, evaluation indicators, comparative experimental results, and ablation experimental result. Firstly, comparative experiments are conducted on two datasets with SGCN-MHA and several benchmark models and compare the results of the experiments for text classification. Afterwards, ablation experiments on SGCN-MHA are performed to explore whether several parts of the model can improve the model’s effectiveness.

4.1 Datasets

Two publicly accessible datasets are chosen to test the SGCN-MHA in order to validate its effectiveness. The specific information on these two datasets is as follows:

RCV1-V2 [43]: RCV1-V2 (Reuters Corpus Volume I) is a Reuters dataset provided by Lewis et al. This dataset has more than 800000 Reuters English news texts and corresponding news categories. The dataset also contains 103 news topics, which may have multiple topics. The experiments need to predict topics according to the news content.

AAPD [44]: AAPD (Arxiv Academic Paper Dataset) is a dataset of Arxiv academic papers provided by Yang et al., which collects 54840 papers in the field of computer science, including abstracts and corresponding keywords. Each paper abstract will correspond to multiple keywords. The dataset contains 54 keywords in total. The experiments need to predict the keywords corresponding to the paper according to the abstract of the paper.

The original text needs to be pre-processed because it contains a lot of noisy data, such as punctuation and stop words. Stop words mean the data of pronouns, prepositions, and adverbs, which have no practical meaning and have no impact on the classification. Removing these words can improve the density of keywords so that the text features can be extracted more accurately. Then the words need to be standardized. All words should be converted to lowercase and checked for misspellings, as well as the original lexical form of the word should be checked, and the word should be converted to its standard form. Each dataset in this experiment is divided into a training set, a test set, and a validation set. Table 2 displays the datasets’ statistical information. When conducting experiments with the RCV1-V2 dataset, a total of 804,414 individual graphs are constructed. When conducting experiments with the AAPD dataset, a total of 55,840 individual graphs are constructed.

Table 2
Statistics of datasets

Dataset	Total	Labels	Average length of text	Average number of labels
RCV1-V2	804414	103	123.94	3.24
AAPD	54840	54	163.42	2.41

4.2 Evaluation indicators

This research uses four evaluation indicators to evaluate the SGCN-MHA’s performance: precision, recall, f1-score, and hamming loss. The precision represents the proportion of text classification models whose positive predictions are correct accounts for all positive predictions. The recall represents the proportion of text classification models with correct positive predictions compared to all actual positive predictions. The f1-score is a total precision and recall score that measures how well the classification model performed overall. Equations (18)–(20) are used to display the calculation formula for precision, recall, and f1-score.

In these equations, positive samples are referred to as positive class by the model, negative samples as a negative class by the model, negative samples anticipated as a positive class by the model, and positive samples as a negative class by the model are referred to as ${TP}$ , ${TN}$ , ${FP}$ , and ${FN}$ , respectively. When evaluating multiple samples, we calculate the ${TP_{i}}$ , ${TN_{i}}$ , ${FP_{i}}$ and ${FN_{i}}$ of each sample ${i}$ , and add them respectively to get the ${TP_{\textit{total}}}$ , ${TN_{\textit{total}}}$ , ${FP_{\textit{total}}}$ , and ${FN_{\textit{total}}}$ . They are calculated as Eqs (21)–(24), where ${N}$ is the total number of samples to be evaluated.

$\displaystyle\textit{Precision}=\frac{TP}{TP+FP}$ (18) $\displaystyle\textit{Recall}=\frac{TP}{TP+FN}$ (19) $\displaystyle F1=2\times\left(\frac{\textit{Precision}\times\textit{Recall}}{% \textit{Precision}+\textit{Recall}}\right)$ (20) $\displaystyle{TP_{\textit{total}}}=TP_{1}+TP_{2}+\ldots+TP_{N}$ (21) $\displaystyle{TN_{\textit{total}}}=TN_{1}+TN_{2}+\ldots+TN_{N}$ (22) $\displaystyle{FP_{\textit{total}}}=FP_{1}+FP_{2}+\ldots+FP_{N}$ (23) $\displaystyle{FN_{\textit{total}}}=FN_{1}+FN_{2}+\ldots+FN_{N}$ (24)

The hamming loss is used to examine the misclassification of samples on individual tokens, such as relevant tokens not appearing in the predicted set of tokens or irrelevant tokens appearing in the predicted set of tokens. The smaller the value of this metric, the better the model performance. If ${n_{\textit{labels}}}$ is the number of labels, ${n_{\textit{samples }}}$ is the number of samples, and ${\hat{z}_{i,j}}$ is the predicted value for the ${j_{th}}$ label of a particular sample ${i}$ , then a definition of hamming loss is:

$\displaystyle\textit{Hamming Loss}(z,\hat{z})=\frac{1}{n_{\textit{samples}}*n_% {\textit{labels}}}\sum_{i=0}^{n_{\textit{samples}}-1}\sum_{j=0}^{n_{\textit{% labels}}-1}1(\hat{z}_{i,j}\neq z_{i,j})$ (25)

4.3 Comparative experiments

Various methods are used as baselines, including LEAM, LSAM, SGM, and Seq2Set. Following that, we contrast them with the findings of our own model SGCN-MHA.

LEAM [45] embeds labels to produce a more discriminating representation of the content when classifying text.

LSAN [46] implements label information and uses an attention mechanism to capture semantic information between documents and different labels.

SGM [44] is a sequence generation model that employs an LSTM-based Seq2Seq model combined with an attention mechanism, while the decoding phase uses global embedding to obtain inter-tag dependencies.

Seq2Set [47] adds a set decoder to SGM to take advantage of the unordered nature of the set to reduce the impact of incorrect label ordering. Not only does it capture the relationship between labels, but it also reduces the dependence on label sequences.

On the RCV1-V2 and AAPD datasets, SGCN-MHA is compared with other models for text categorization. Tables 3 and 4 display each model’s precision, recall, f1-score, and hamming loss. A better model with a higher value is denoted by a “ $+$ ”, and a better model with a lower value is denoted by a “ $-$ ”.

Table 3 demonstrates that Seq2Set, which takes label correlation modeling into account, essentially achieves the best results of these benchmark methods on the RCV1-V2 dataset. However, the model proposed in this paper produces better results than the Seq2Set, with significant improvements in some evaluation metrics. It is possible that Seq2Set’s basic architecture for sequence-to-set transformation, which may not successfully capture the complicated features and sequential information available in text data, is the cause of its a little insufficient performance in comparison to SGCN-MHA in text classification tasks. In contrast, SGCN-MHA is better suited for capturing text associations and structure since it can effectively catch semantic details and context-related information. Table 4 demonstrates that the label embedding method LSAN on the AAPD dataset shows promise, however compared to SGCN-MHA, LSAN might struggle due to its possible shortcomings in capturing complex semantic links. The SGCN-MHA allows for better handling of complex relationships while LSAN concentrates on label-specific document representations using semantics and self-attention. The complete method used by SGCN-MHA contributes to its high performance, excels at capturing complicated meanings.

Table 3
Experimental results of comparative experiments on RCV1-V2 dataset

Model	Precision ( $+$ )	Recall ( $+$ )	F1 ( $+$ )	HL ( $-$ )
LEAM	0.871	0.841	0.856	0.0090
LSAN	0.913	0.841	0.875	0.0075
SGM	0.887	0.850	0.869	0.0081
Seq2Set	0.900	0.858	0.879	0.0073
SGCN-MHA	0.906	0.862	0.883	0.0076

Table 4

Experimental results of comparative experiments on AAPD dataset

Model	Precision ( $+$ )	Recall ( $+$ )	F1 ( $+$ )	HL ( $-$ )
LEAM	0.765	0.596	0.670	0.0261
LSAN	0.777	0.646	0.706	0.0242
SGM	0.746	0.659	0.699	0.0251
Seq2Set	0.739	0.674	0.705	0.0247
SGCN-MHA	0.779	0.650	0.709	0.0244

4.4 Ablation experiments

In order to verify whether GCN, BiLSTM, and the Multi-Head Attention mechanism contribute to the text classification task, three models are designed and compared with the original model on the RCV1-V2 and AAPD datasets, respectively.

GCN is imported to extract the text’s structural features. The original model and the no-GCN model are contrasted in order to demonstrate if GCN may enhance the classification performance of SGCN-MHA. The results from Tables 5 and 6 show that the performance of the no-GCN degrades when the GCN is removed. It is clear that the GCN plays a crucial role because of its ability to accurately record intricate relationships between data elements. Additionally, by supplying vital contextual information, it improves the model’s comprehension, thus enhancing its capacity to make wise selections during classifying tasks. Therefore, its absence might deprive the model of these vital skills, which would greatly impair its classification ability.

BiLSTM is used to preserve the text’s sequence information. The original model is contrasted with the no-BiLSTM model to see if BiLSTM may enhance the classification effect of the model. It is evident from Tables 5 and 6 that the model with no-BiLSTM is less effective than the original model. The no-BiLSTM model’s efficiency suffered noticeably as a result of the absence of BiLSTM because of BiLSTM’s contributions. Since BiLSTM is adept at capturing sequential features, the model may pick up on tiny details in the data, which is crucial for tasks that depend on sequential information for improved classification accuracy. Additionally, the model’s ability to distinguish significant information from the input is improved by its skill at extracting features, which helps the model make better classification decisions.

To figure out whether applying the Multi-Head Attention mechanism improves the model’s classification performance, the original model is compared to a model with the Multi-Head Attention mechanism removed, known as no-MHA. It is evident from Tables 5 and 6 that the original model performs better than the model without MHA. Due to the various contributions of the Multi-Head Attention mechanism, the original model is more effective to the no-MHA model. MHA improves the model by enabling dynamic feature interaction, building a thorough comprehension of contextual details, making it simpler to extract abstract features that capture complex patterns and improving the creation of demonstrating data representations.

Table 5
Experimental results of ablation experiments on RCV1-V2 dataset

Model	Precision ( $+$ )	Recall ( $+$ )	F1 ( $+$ )	HL ( $-$ )
SGCN-MHA	0.906	0.862	0.883	0.0076
no-GCN	0.894	0.854	0.874	0.0089
no-BiLSTM	0.888	0.849	0.868	0.0091
no-MHA	0.892	0.851	0.871	0.0082

Table 6

Experimental results of ablation experiments on AAPD dataset

Model	Precision ( $+$ )	Recall ( $+$ )	F1 ( $+$ )	HL ( $-$ )
SGCN-MHA	0.779	0.650	0.709	0.0244
no-GCN	0.767	0.635	0.695	0.0251
no-BiLSTM	0.770	0.639	0.698	0.0249
no-MHA	0.769	0.641	0.699	0.0254

5. Conclusion

In this work, a text classification model named SGCN-MHA based on BiLSTM-GCN and the Multi-Head Attention mechanism is proposed, which can better analyze the textual features. Each document is trained as an individual graph, and GCN extracts the structural characteristics of the text. Additionally, BiLSTM is imported to extract the sequential feature in both the front and back directions in order to preserve the word order information in each document. Also, to extract the important information from the text and avoid the separation of these features, on each graph, the Multi-Head Attention mechanism works on each graph to interact with and aggregate the structural and sequential features. The model is compared with other benchmark methods on two standard datasets to evaluate the model’s effectiveness, and the experiment results prove the effectiveness of our model. The ablation experiments are also performed to demonstrate the usefulness of each component of the proposed model.

References

Kalchbrenner

Grefenstette

and Blunsom

, A Convolutional Neural Network for Modelling Sentences, CoRR abs/1404.2188, 2014. http://arxiv.org/abs/1404.2188.

Liu

Qiu

Chen

and Huang

, Multi-Timescale Long Short-Term Memory Neural Network for Modelling Sentences and Documents, 2015, pp. 2326–2335. doi: 10.18653/v1/D15-1280.

Maas

A.L.

Daly

R.E.

Pham

P.T.

Huang

A.Y.

and Potts

, Learning Word Vectors for Sentiment Analysis, in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Portland, Oregon, USA, 2011, pp. 142–150. https://aclanthology.org/P11-1015.

Tai

K.S.

Socher

and Manning

C.D.

, Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Beijing, China, 2015, pp. 1556–1566. doi: 10.3115/v1/P15-1150. https://aclanthology.org/P15-1150.

Katakis

I.M.

Tsoumakas

and Vlahavas

I.P.

, Multilabel Text Classification for Automated Tag Suggestion, 2008.

Peng

You

Wang

Zhai

Mamitsuka

and Zhu

, DeepMeSH: Deep semantic representation for improving large-scale MeSH indexing, Bioinformatics 32 (2016), i70–i79. doi: 10.1093/bioinformatics/btw294.

Ding

Bhanushali

and Liu

, Deep Anomaly Detection on Attributed Networks, in: Proceedings of the 2019 SIAM International Conference on Data Mining (SDM), pp. 594–602. doi: 10.1137/1.9781611975673.67.

Wang

and Manning

, Baselines and Bigrams: Simple, Good Sentiment and Topic Classification, in: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Jeju Island, Korea, 2012, pp. 90–94. https://aclanthology.org/P12-2018.

Boreham

and Niblett

, Classification of legal texts by computer, Inf. Process. Manag 12 (1976), 125–132.

10.

Joachims

, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, in: ICML, 1997.

11.

Joachims

, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, in: ECML, 1998.

12.

Tan

, An effective refinement strategy for KNN text classifier, Expert Syst. Appl 30 (2006), 290–298.

13.

LeCun

Boser

Denker

J.S.

Henderson

Howard

R.E.

Hubbard

and Jackel

L.D.

, Backpropagation applied to handwritten zip code recognition, Neural Computation 1(4) (1989), 541–551. doi: 10.1162/neco.1989.1.4.541.

14.

Elman

J.L.

, Finding structure in time, Cogn. Sci 14 (1990), 179–211.

15.

Cai

Zheng

V.W.

and Chang

K.C.-C.

, A comprehensive survey of graph embedding: Problems, techniques, and applications, IEEE Transactions on Knowledge and Data Engineering 30 (2018), 1616–1637.

16.

Kipf

and Welling

, Semi-Supervised Classification with Graph Convolutional Networks, ArXiv abs/1609.02907, 2017.

17.

Yao

Mao

and Luo

, Graph Convolutional Networks for Text Classification, in: AAAI, 2019.

18.

Lodhi

Saunders

Shawe-Taylor

Cristianini

and Watkins

, Text classification using string kernels, J. Mach. Learn. Res 2 (2000), 419–444.

19.

Androutsopoulos

Koutsias

Chandrinos

K.V.

Paliouras

and Spyropoulos

C.D.

, An evaluation of Naive Bayesian anti-spam filtering, ArXiv cs.CL/0006013, 2000.

20.

Forman

, BNS feature scaling: an improved representation over tf-idf for svm text classification, in: CIKM ’08, 2008.

21.

Cavnar

W.B.

and Trenkle

J.M.

, N-gram-based text categorization, 1994.

22.

Tang

Meng

Nguyen

Mei

and Zhang

, Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis, in: ICML, 2014.

23.

Kim

, Convolutional Neural Networks for Sentence Classification, in: Conference on Empirical Methods in Natural Language Processing, 2014.

24.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation 9 (1997), 1735–1780.

25.

Tarlow

Brockschmidt

and Zemel

R.S.

, Gated Graph Sequence Neural Networks, CoRR abs/1511.05493, 2016.

26.

Deng

Cheng

and Wang

, Attention-based BiLSTM fused CNN with gating mechanism model for Chinese long text classification, Comput. Speech Lang 68 (2021), 101182.

27.

Jinbao

Weiwei

Yidan

Qiaoxin

Chenyuan

and Long

, Text Classification Method Based on BiGRU-Attention and CNN Hybrid Model, in: 2021 4th International Conference on Artificial Intelligence and Pattern Recognition, 2021.

28.

Devlin

Chang

M.-W.

Lee

and Toutanova

, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: NAACL, 2019.

29.

Prabhu

Mohamed

and Misra

, Multi-class Text Classification using BERT-based Active Learning, ArXiv abs/2104.14289, 2021.

30.

Liu

Pang

Zhou

and Yue

, Research on multi-label text classification method based on tALBERT-CNN, International Journal of Computational Intelligence Systems 14 (2021).

31.

Bruna

Zaremba

Szlam

A.D.

and LeCun

, Spectral Networks and Locally Connected Networks on Graphs, CoRR abs/1312.6203, 2013.

32.

Zhang

de Souza

A.H.

Fifty

and Weinberger

K.Q.

, Simplifying Graph Convolutional Networks, ArXiv abs/1902.07153, 2019.

33.

Zhu

and Koniusz

, Simple Spectral Graph Convolution, in: ICLR, 2021.

34.

Liu

You

Zhang

and Lv

, Tensor Graph Convolutional Networks for Text Classification, ArXiv abs/2001.05313, 2020.

35.

Peng

and Wang

, TextGTL: Graph-based Transductive Learning for Semi-supervised Text Classification via Structure-Sensitive Interpolation, in: IJCAI, 2021.

36.

Huang

Zhang

and Wang

, Text Level Graph Neural Network for Text Classification, in: EMNLP, 2019.

37.

Zhang

Cui

Wen

and Wang

, Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks, ArXiv abs/2004.13826, 2020.

38.

Zhang

and Zhang

, Text Graph Transformer for Document Classification, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 8322–8327. doi: 10.18653/v1/2020.emnlp-main.668. https://aclanthology.org/2020.emnlp-main.668.

39.

Jeong

Jang

Shin

Park

E.L.

and Choi

, A context-aware citation recommendation model with BERT and graph convolutional networks, Scientometrics, 2020, 1–16.

40.

and Nie

, VGCN-BERT: Augmenting BERT with graph embedding for text classification, Advances in Information Retrieval 12035 (2020), 369–382.

41.

Lin

Meng

Sun

Han

Kuang

and Wu

, BertGCN: Transductive Text Classification by Combining GNN and BERT, ArXiv abs/2105.05727, 2021.

42.

She

Chen

and Chen

, Joint learning with BERT-GCN and multi-attention for event text classification and event assignment, IEEE Access PP (2022), 1–1.

43.

Lewis

D.D.

Yang

Rose

T.G.

and Li

, RCV1: A new benchmark collection for text categorization research, J. Mach. Learn. Res 5 (2004), 361–397.

44.

Yang

Sun

and Wang

, SGM: Sequence Generation Model for Multi-label Classification, in: COLING, 2018.

45.

Wang

Zhang

Shen

Zhang

Henao

and Carin

, Joint Embedding of Words and Labels for Text Classification, in: ACL, 2018.

46.

Xiao

Huang

Chen

and Jing

, Label-Specific Document Representation for Multi-Label Text Classification, in: Conference on Empirical Methods in Natural Language Processing, 2019.

47.

Yang

Luo

Lin

and Sun

, A Deep Reinforced Sequence-to-Set Model for Multi-Label Classification, in: ACL, 2019.

48.

Vaswani

Shazeer

N.M.

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

and Polosukhin

, Attention is All you Need, in: NIPS, 2017.

49.

Guo

Zhang

Liu

and Ma

, Improving text classification with weighted word embeddings via a multi-channel TextCNN model, Neurocomputing 363 (2019), 366–374.

50.

Bahdanau

Cho

and Bengio

, Neural Machine Translation by Jointly Learning to Align and Translate, CoRR abs/1409.0473, 2015.

51.

Boutell

M.R.

Luo

Shen

and Brown

C.M.

, Learning multi-label scene classification, Pattern Recognit 37 (2004), 1757–1771.

52.

Read

Pfahringer

Holmes

and Frank

, Classifier chains for multi-label classification, Machine Learning 85 (2011), 333–359.

53.

Wang

Yang

Mao

Huang

and Xu

, CNN-RNN: A Unified Framework for Multi-label Image Classification, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2285–2294.

54.

Gonçalves

and Quaresma

, A Preliminary Approach to the Multilabel Classification Problem of Portuguese Juridical Documents, in: EPIA, 2003.

Machine learning based software effort estimation using development-centric features for crowdsourcing platform

Abstract

Keywords

1. Introduction

2. Related work

2.1 Text classification

2.2 Graph Convolutional Network

3. Proposed SGCN-MHA model

3.1 System overview

Table 1 Notations

4.1 Datasets

Table 2 Statistics of datasets

Table 3 Experimental results of comparative experiments on RCV1-V2 dataset

Table 5 Experimental results of ablation experiments on RCV1-V2 dataset

References

Table 1
Notations

Table 2
Statistics of datasets

Table 3
Experimental results of comparative experiments on RCV1-V2 dataset

Table 5
Experimental results of ablation experiments on RCV1-V2 dataset