Abstract
Text classification is a fundamental task in Nature Language Processing(NLP). However, with the challenge of complex semantic information, how to extract useful features becomes a critical issue. Different from other traditional methods, we propose a new model based on two parallel RNNs architecture, which captures context information through LSTM and GRU respectively and simultaneously. Motivated by the siamese network, our proposed architecture generates attention matrix through calculating similarity between the parallel captured context information, which ensures the effectiveness of extracted features and further improves classification results. We evaluate our proposed model on six text classification tasks. The result of experiments shows that the ABLGCNN model proposed in this paper has the faster convergence speed and the higher precision than other models.
Keywords
Introduction
Following the trend of machine learning, deep learning theory has made great progress in natural language processing (NLP) tasks, such as answer selection [26], document summarization [17], text classification [30, 31] and so on. The most popular deep learning architectures are Convolution Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). CNNs have good performance on features extraction by convolution kernels, while RNNs are often used to capture the flexible context information.
Within NLP tasks, the complexity of contextual information is a critical issue. Therefore, how to capture useful semantic information has attracted many researchers. According to the different characteristics of CNNs and RNNs, some researchers modified architectures by combining the two classical models. For example, on the basis of TextCNN [31], Lai et al. proposed TextRCNN [24] network by connecting Bi-LSTM to CNNs pooling layer, which not only reserves advantages of selecting discriminative features through the max-pooling layer, but also expands the range of captured contextual information through the recurrent structure.
Another popular breakthrough in feature extraction is the attention mechanism, firstly applied in NLP by Bahdanau [7]. It achieved excellent performances in machine translation [8, 16] by measuring the different parts contribution to the whole, which is in accordance with human’s linguistics cognition. One of the representative works is ABCNN [27], which firstly incorporates attention theory into CNNs. With the aims of solving the interdependence problem between two different sentences, ABCNN was established on two weight-sharing CNNs to model sentences pairs. Compared with Bi-CNN mentioned in [27], the results show that computing attention weight on the convolution output achieves considerable improvement on sentence pairs modeling.
It is worth mentioning that the basic Bi-CNN model in ABCNN was highly motivated by siamese architecture [32], which does also inspire our innovation. Technically, siamese architecture was firstly proposed for calculating the similarity between two patterns to verity signatures. Attracted by its excellent performance on calculating similarities, this structure has been widely applied in the field of computer vision and NLP, like object tracking [33], human identification [34], face recognition [35] and so on. While in the field of NLP, siamese network has achieved satisfying performance on calculating semantic textual similarity [36, 37]. However, to the best of our knowledge, siamese architecture has not been widely studied in the text classification tasks, which stir up our inspiration.
Model
As shown in Fig. 1, the overall model consists of five parts: Input Layer, LSTM and GRU Layer, Attention Layer, Convolution and Attention-based Pooling Layer, and Output Layer. The details of each component are described in the following sections.

ABLGCNN architecture.
In the sample shown in Fig. 1, the input sentence matrix consists of 5 words, represented as s for the length of sentence. Each word is represented as a d-dimension vector pre-trained by Word2vec embedding [25] or Glove embedding [19], where d = 300. Therefore, the sentence can be represented as a feature map of dimension d × s.
LSTM and GRU layers
RNNs, as a network being calculated over time, performs the same calculations at each time node and relies on the previous time state result. With the addition of hidden layer, RNNs consider the sequence information between words, which is more suitable for NLP tasks. Long short term memory (LSTM), as a variant of RNN, was firstly proposed by Hochreiter and Schmidhuber [22], which mainly overcomes the problem of gradient disappearance through the gate structure. In order to simplify LSTM model without influencing the effect, Cho proposed Gated recurrent unit (GRU) [13] model, which adaptively captures dependencies at different time scales using loop blocks. Given a sequence S ={ s0, s1, . . . , s l }, both the LSTM and GRU models need to process the previous word vector results to perform the current word vector calculation.
Figure 2(a) shows the structural block of the LSTM model. It mainly consists of three gates and a storage unit. The three gates are input gate, forget gate and output gate, which control the flow of data for input, storage and output respectively. The activation function f and g in Fig. 2(a) represent the tanh and sigmoid function respectively. In the LSTM model, when the matrix x
t
at time t is given, we first consider the forget gate f
t
, which controls how many units of the last moment will be transmitted to the current moment.

Illustration for the structures of LSTM and GRU.
In Equation (1), ⊕ presents the stitching of matrix, W f and Pb f represent the forget gate weight matrix and the bias respectively. ht-1 represents the output of the previous state. σ is the sigmoid function.
The next step is to consider the input gate, which is primarily responsible for the input of sequence at current state.
In Equations (2), (3) and (4), i
t
is the first part of the calculation, which mainly considers how much information of the current input will be saved in the storage unit.
Finally, the output gate gets the corresponding output and passes it to the next moment.
The GRU model is shown in Fig. 2(b). The main difference from the LSTM model is that the GRU model does not have storage units. The calculations are as follows:
In Equations (7), (8), (9) and (10), ⊗ represents the multiplication of corresponding elements between the matrices. The reset gate r t determines the number of units updated from all units’ previous activation ht-1 in the same layer. And the update gate z t determines the number of units updated from its activation. At last, GRU activation unit is generated from the previous activation unit and the current candidate unit.
As the main contribution in this paper, the input feature map of d × s is transferred to both LSTM and GRU respectively and simultaneously for capturing text information. As shown in Fig. 1, the output is marked as s L and s G respectively, where each column represents a neural unit. It is worth noting that we set the same number of neurons in LSTM and GRU for obtaining the same output size from LSTM and GRU, which ensures the further similarity calculation.
Adopting the idea from ABCNN-2 [27], we calculate the similarity between two parallel RNNs results and then generate the attention matrix
More formally, let
Note that Euclidean (·) is Euclidean distance, where
Since LSTM and GRU have captured text information, feature vectors can generate a text matrix. In order to extract more meaningful information, the proposed model performs convolution and pooling operations. Based on the attention mechanism layer, the convolution layer captures features through a convolution kernel with a window size n × w. For example, a new feature c
m
is generated from a window of words xm:m+w-1 is follows:
In Equation (12), b
c
is a bias term and f (·) is a non-linear function like tanh function. The convolution kernel is applied to each possible word window in the sentence {x1:w, x2:w, . . . , xs-w+1:s } and generates a feature map as follows:
Normally, on the basis of feature mapping, neural network models usually perform maximum pooling or average pooling operations. The proposed model in this paper applies the average pooling operation based on the attention mechanism.
In Equation (14),
As shown in Fig. 1, the fully-connection layer follows the pooling layer. The outputs are passed to a softmax classifier to predict the semantic relation label
Finally, we minimize the categorical cross-entropy loss as following calculation:
In Equation (17), K is the number of target classes,
We compare our model with others on various datasets. The summary statistics of datasets are shown in Table 1.
Summary statistics for the datasets
Summary statistics for the datasets
C: number of target classes. L: average sentence length. M: maximum sentence length. Train/Test: train/test dataset size. |V|: vocabulary size. |V pre |: number of words after pre-trained.
Pre-trained word embedding in unlabeled corpus enables better generalization of training datasets [12]. In order to get more practical results, our experiments utilize the Glove embedding trained by [19] and Word2Vec embedding from Google News. The words not present in the pre-trained words are initialized by randomly sampling from uniform distribution in [–0.1, 0.1]. The dimensionality of each word is 300 and the text vectors were trained through the continuous bag-of-words architecture [25].
Hyper-parameter settings
We randomly select 10% of the data as the test set. The evaluation metric of the five datasets is accuracy. MR and MPQA also use the loss as the metric. The final hyper-parameters settings are as follows.
The dimension of GloVe embedding and Word2Vec is 300. The hidden units of LSTM [22], GRU [13], BLSTM [1] and RCNN [24] are 128. In TextCNN [31] and ABLGCNN (ours), we use 128 convolutional filters and each window sizes is (3, sentence length). The mini-batch size is set as 50, the learning rate of Adam is 0.01 and L2 penalty with coefficient is 10-3. For regularization, we apply the dropout operation [9] with dropout rate of 0.5 for the LSTM, GRU and Convolution Layer.
Results
Comparative analysis of test set accuracy
The experiment results of Glove embedding and Word2vec embedding are shown in Table 2 and Table 3 respectively. The comparison shows that the classification accuracy of the results of two word embedding methods are not much different.
Results of our ABLGCNN model against other models (word embeddings are Glove)
Results of our ABLGCNN model against other models (word embeddings are Glove)
Results of our ABLGCNN model against other models (word embeddings are Word2vec)
Comparing the accuracy of LSTM, LSTM-Att, BLSTM and BLSTM-Att, it is found that attention mechanism improves the classification results obviously, which is according with its good performance on feature extraction.
Comparing the accuracy of TextCNN and RCNN, it is found that the classification of TextCNN is relatively low because of its simplicity in model and the lack of consideration in context relationship, while RCNN effectively improves the performance with the addition of recurrent network.
Compared with other models, the ABLGCNN model proposed in this paper shows the best performance of classification accuracy. As ABCNN model has already achieved quite considerable result, our model technologically adds the parallel RNNs network. With the advantage of RNNs in capturing context information, the attention matrix is generated on the basis of the calculated similarity, which further ensures the importance of extracted feature effectively and increases the classification accuracy.
In order to give results analysis from the different aspects, we also conduct experiments on training set convergence performance. It should be noted that for better presenting comparison results, we remain four contrast models on MR and MPQA datasets. The loss values of all models are smoothing processed to avoid the problem possibly caused by shock loss. The number of iterations is 10000. In order to present the convergence changes clearly, we record one point every a hundred iterations.
As shown in Fig. 3, ABLGCNN converges quickly with the notable decreasing slope, while ABCNN shows a slight weaker convergence speed. Same as the disappointing performance in accuracy, the convergence speed is quite poor, which further proves that attention-based Bi-LSTM is not suitable for short text classification. TextCNN shows unstable converging performance in the four experiment. Combined with its unstable classification acuracy, we consider that this simple model is easily caused over-fitting problem. While RCNN shows a relatively good convergence speed, which further proves the effectiveness of adding recurrent network.

Loss value of the training set of MR and MPQA.
From the above two perspectives, the ABLGCNN model not only is of high accuracy in the testing set, but also converges quickly in the training set.
In this paper, we propose a new model ABLGCNN for short text classification. The main contribution is the application of LSTM and GRU networks in parallel to capture context information and to calculate the attention weights for extracting more feature information on the basis of parallel results. The experimental results show that ABLGCNN outperforms than other models for its higher classification results and quicker convergence speed.
Though the new model proves its potentials for short text classification, the large quantity of parameters makes it unsuitable for long text classification. An interesting and possible direction is to connect multiple network models in parallel with the attention mechanism.
Footnotes
Acknowledgment
This research was supported by National Natural Science Foundation of China (No. 61871234, No. 61302155).
