Abstract
As an important branch of Nature Language Processing (NLP), how to extract useful text information and effective long-range associations has always been a bottleneck for text classification. With the great effort of deep learning researchers, deep Convolutional Neural Networks (CNNs) have made remarkable achievements in Computer Vision but still controversial in NLP tasks. In this paper, we propose a novel deep CNN named Deep Pyramid Temporal Convolutional Network (DPTCN) for short text classification, which is mainly consisting of concatenated embedding layer, causal convolution, 1/2 max pooling down-sampling and residual blocks. It is worth mentioning that our work was highly inspired by two well-designed models: one is temporal convolutional network for sequential modeling; another is deep pyramid CNN for text categorization; as their applicability and pertinence remind us how to build a model in a special domain. In the experiments, we evaluate the proposed model on 7 datasets with 6 models and analyze the impact of three different embedding methods. The results prove that our work is a good attempt to apply word-level deep convolutional network in short text classification.
Introduction
With the rapid development of deep learning, there are varied models emerging in an endless stream. As one of the classic deep learning models, Convolution Neural Networks (CNNs) display the practicality and importance because of their good feature extraction performance and low computation cost. In 1998, the proposal of LeNet-5 [37] marked a new phase for CNNs. LeCun first brought forward convolutional networks, which covered the key components of CNNs: local receptive fields, weight replication and sub-sampling. Since then, lots of researchers and scholars carried out experiments and improvement on CNNs and their variants, which can be classified into 7 categories [3]: spatial exploitation, depth, multi-path, width, feature-map exploitation, channel boosting and attention.
Up to now, CNNs have made considerable achievements in the area of Computer Vision (CV), Speech Recognition and Natural Language Processing (NLP), especially perform impressively in image tasks. For example, AlexNet [4] first applied deep CNNs in image classification, which used fewer connections and parameters compared to similarly-sized feedforward neural networks; Simonyan and Zisserman [19] proposed VGG, a deeper convolution network with small convolution filters (3×3) in all layers, and proved its effective improvement in that structure. However, the blind deepening of network will bring new problems, such as overfitting and computation costs increasing. In view of this, Szegedy et al. [9] put forward GooLeNet and a new neural network architecture named Inception, which transforming the network to a sparse structure. Another common problem for deep neural networks is the difficulty of training. For the improvement, ResNet [16] added shortcuts between networks to construct skip connections; DenseNet [12] made direct connections from any layer to all subsequent layers, which strengthened feature propagation, enhanced the characteristics in convolution and further broke the limits of image classification.
As shown above, the deepening of network architectures achieved milestone breakthroughs in CV tasks. However, there is no single model or improvement method fitting for all requirements. Different from the image, semantic information in text is more complex. Thus, it is doubtful whether deep CNNs could break the limits of text classification or not. VDCNN [1], as the first well-designed character-level deep CNNs for NLP tasks, successfully showed the benefit of depths. However, the experiments in [26] presented that a shallow word-level CNN outperforms VPCNN with a faster computation speed, while [13] pointed out that a word-level deep CNN based on DenseNet only increased the complexity with no improvement on the text classification performance compared to a shallow one. Is there any effective improvement for deep CNNs in text classification? The answer is definitely yes. Johnson and Zhang [27] proposed a low-complexity word-level Deep Pyramid CNN (DPCNN), which shaped like pyramid to take sample of data; Wang et al. [33] added a multi-scale feature attention mechanism to DenseNet, which also applied deep CNNs well in text classification. With a great pertinence and applicability, these dedicated improvements on text classification enlighten our proposal.
As one of the characteristics for CNNs, different sizes of convolutional kernels will enable the extraction of multi-scale features in local information, but the lack of sequential information and global features has always been a big drawback. To compensate the limitation, Bai et al. [30] proposed Temporal Convolutional Network (TCN) for sequence modeling tasks, consisting of causal convolution, residual connections and dilated convolutions. Since then, a lot of researchers applied TCN in other deep learning tasks, such as action segmentation [8], interest recommendation [21], demand prediction [18] and speech separation [39]. In the area of text classification, Han et al. [14] expanded TCN to two directions, which considering both the natural and reverse orders of sentences; Zuo et al. [38] added attention mechanism to Bi-TCN to enrich extracted contextual multi-scale semantic information; Cao et al. [10] built an LSTM-TCN hybrid model and combined with attention mechanism.
Attracted by the characteristic of remaining context relationship in causal convolution and the special designing of deep network for text classification in DPCNN, in this paper, we aim to put forward a novel deep architecture for short text classification with both above superiorities. The main contributions of this paper can be described in two points: one is the proposal of Deep Pyramid Temporal Convolutional Network (DPTCN), consisting of the embedding layer, causal convolutions, 1/2 max pooling, residual blocks and classifier. To the best of our knowledge, it is the first proposal of deep CNN model with high pertinence and applicability for short text classification. The experiment results are strong evidence to break the stereotype of deep CNN in NLP tasks. Another breakthrough is the optimization of embedding method. When we re-examine each module to make sure the best performance of our proposal, the embedding layer is improved to a concatenated one consisting of word2vec [34] and region embedding [28]. The experiment results prove the improvement in embedding layer.
Related works
Dilated causal convolution
Dilated Causal Convolution was firstly proposed in Wavenet [5], then TCN [30] applied it for a wider application scope - sequence modeling. The special characteristic of causal convolutions is its remaining of original data order when modeling, which ensuring that the prediction p (xt+1|x1, …, x t ) at time step t does not depend on any data from the future time steps xt+1, xt+2, …, x T , as the two bottom layers shown in Fig. 1 left side. Because the model with causal convolutions has no recurrent connections, the training process is more time-saving than recurrent neural networks [5].

The illustrations of dilated causal convolutions (left) and deep pyramid CNN (right).
However, the modeling range of causal convolution depends on the kernel size. If pursuing the extraction of longer dependence relationship, deeper network is required. Thus, Wavenet added dilated convolution, which was successfully applied in signal processing [22, 24], image segmentation [11, 20] and many other tasks. Dilated operation performs convolution with skipping steps according to dilation factor, and therefore realizing coarse-grained convolution effectively, as the three above layers shown in Fig. 1 left side. Stacked dilated convolutions make it possible that the network has a wide receptive field only with a few layers.
For a global information extraction in text classification, DPCNN [27] proposed a down-sampling method named “1/2 pooling”, which performs max-pooling with size 3 and stride 2 after convolution layer. With the increasing depth of model, “1/2 pooling” makes sure that the number of feature maps remain unchanged and therefore avoids the increasing of complexity. As shown in Fig. 1 right side, after L times down-sampling, the semantic connection among words within 2 L distance can be represented, which strengthens the relationship for long-distance words and therefore the model could extract more global information.
1/2 pooling causal convolution (Our proposal)
No matter the dilated convolution in TCN or “1/2 pooling” in DPCNN, both aims are to expand receptive field and extract more exact global semantic information through sampling, which highly stir up our inspiration and advance our innovation. Causal convolution remains the sequential order of original data and further prompts the extraction of context relationship, while DPCNN effectively improves text classification performance in the condition that time complexity is impervious to the model depth. Considering of both superiority and characteristic, in this paper, we propose “1/2 pooling causal convolution”, as shown in Fig. 2.

The overall architecture of DPTCN.
The proposed model mainly consists of five parts as shown in Fig. 2. The whole architecture is described in details as following.
Input and embedding layer
Assuming a sentence consists of n words, S = {w1, w2, …, w
n
}. After word2vec [34] pre-trained embedding, each word is represented as a d-dimension vector. The whole sentence can be represented as:
where ⊕ is the concatenation operator.
It is worth mentioning that DPCNN used region embedding to convert word vector, a word embedding method they proposed earlier in another paper [28]. There is a little trick that the authors did not mention the scope of application while all the experiments were conducted on long text datasets. In the preparatory experiments, we found that DPCNN with region embedding did not perform well in short text classification. Therefore, we change back to the traditional word embedding method word2vec and attempt to concatenate them serially. As the embedding layer shown in Fig. 2, after word2vec embedding, d-dimension vectors are transmitted to region embedding layer, which converting small regions of data to feature vectors for use in the upper layers. The serial connection of embedding layers generates a higher-dimension feature.
The embedding layer is followed by stacking of convolution blocks, where TCN enlightened our improvement deeply. Considering the influence of words order on sentence meaning, all convolutional layers in our model are causal convolutions, as the bottom two layers shown in Fig. 1 left side. The output after causal convolution at time t is convolved only with elements from time t and the earlier in the previous layer. This convolutional method effectively remains the text sequential information compared to the traditional convolution ways.
Suppose the word vector matrix to be in convolution operation at time i is X
i
, the filter at jth layer is K
j
, the bias vector is b
i
and the non-linear activation function is f (·), then the output at time i can be calculated as follows:
To avoid the problem of vanishing gradient, causal convolution is followed by normalization and ReLU non-linear activation function, which will be shown in detail in Section 3.4.
As CNNs perform unfavorably on the representation of long-range associations and the extraction of global information, down-sampling methods are commonly applied with CNNs to remedy the defect in sequential modeling. In this paper, we perform max-pooling with size 3 and stride 2 after causal convolutions, as shown in Fig. 2 right side. The max pooling proposed in [13] is operated as: C* = max{ C
i
}. After m times max-pooling operation, the extracted features can be represented as
There is an interesting thing that the author of TCN applied dilated convolutions with causal convolutions in order to expand the convolutional range and to achieve a wide receptive filed, which functions similarly to down-sampling. However, the classification accuracy of TCN on short text datasets is not good (in Section 4) and that is the reason why we adopt another sampling method in the proposal.
Shortcut connections
Apart from down-sampling methods, the range of extracted features after sampling also depends on the depth of model. To enable the training of deep model, we add shortcut connections with identity mapping adopted from [8]:

The illustrations of residual block.
Compared with the nonlinear activation in Equation (5), the “linearity” in Equation (4) outperforms for its ease of optimization and decrease of overfitting, which has been proved in the ablation experiments of [17]. In our model, the output after transformation
After all above operations, the concatenation of extracted features is denoted as O and then passed to a softmax classifier to predict the label y from a discrete set of classes Y. The calculations are as follows:
We adopt cross-entropy function as the loss function in the model, which calculated as follows:
where m is the number of label categories,
To verify the practicality and generality of proposed model, the experiments are conducted on 7 different types of datasets as shown in Table 1 as following.
Summary statistics of datasets
Summary statistics of datasets
(C: number of label classes. L: average sentence length. M: maximum sentence length. Train/Test: train/test dataset size. V: vocabulary size).
MR: Movie reviews [7], a two-category dataset with one sentence per review, containing positive and negative reviews.
SST-1: Stanford Sentiment Treebank, an extension of MR from [29], including five types - very negative, negative, neutral, positive and very positive.
SST-2: Same as SST-1 but with neutral reviews removed and binary labels.
Subj: Subjectivity dataset [6] classifying a sentence as being subjective or objective.
TREC: Question classification dataset [35]. The task involves classifying a question into six question types (abbreviation, description, entity, human, location, numeric value).
CR: Customer reviews of various products which include cameras, MP3 and so on [23]. It is also a two-categorical dataset which contains negative and positive reviews.
MPQA: Opinion polarity detection subtask of the MPQA dataset [15].
We randomly select 10% of datasets as the testing set and apply10-fold cross validation in the experiment. The evaluation metric of the five datasets is accuracy, which calculated as Equation (9). y
i
is the real label, y* is the estimated label and N is the dataset size.
All the datasets are pre-trained by word2vec vectors [34], which were trained on 100 billion words from Google News. The words not present in the pre-trained words are initialized randomly. The dimension of each word is 300 and the text vectors were trained through the continuous bag-of-word architecture. The hidden units of LSTM [31], Bi-LSTM [2] and RCNN [32] is 128. In TCN, DPCNN, DPTCN, TextCNN [36], we use 128 convolutional filters and the kernel size is 3. The batch size is 50, the learning rate of Adam is 0.01, L2 penalty with coefficient is 10-3 and the dropout rate is 0.5.
For an overall analysis of the proposed model, we conducted the experiment on 7 datasets and compared it to other 6 models as shown in Table 2, especially TCN and DPCNN. In addition, for a better idea of the impact of embedding, we utilized three different embedding methods on DPCNN and our proposed model.
The experiment results of classification accuracy (%)
The experiment results of classification accuracy (%)
Comparing the accuracy of TCN, LSTM and Bi-LSTM, the result does not show the superiority of causal convolution in TCN compared to canonical recurrent networks as [1] presents. The experiment of this group corroborates the pertinence and limitation of a model to some extent because TCN was not proposed pointing to text classification, but causal convolution is still of great potential for its ability of remaining sequential information if applied properly, as our proposal verified.
Comparing the accuracy of TextCNN with RCNN and DPCNN respectively, the shallow convolution network performs poorly for its single extraction of local textual information, where RCNN takes into consideration the context relationship with the addition of recurrent network and DPCNN extracts global information for the expanding of receptive field increasing with network depth, which both validate the importance of long-range textual associations for a model applying in text classification.
Actually, the results of DPCNN shown in Table 2 and Fig. 4 are not obtained from the original model proposed in [27]. As mentioned in Section 3.1, when we conduct experiments on DPCNN with region embedding it adopted, the classification accuracy on short text datasets performs not well as shown in Table 3. Therefore, we replaced the embedding layer with word2vec and combined two embedding methods serially as the embedding layer shown in Fig. 2. The complete experiment results on accuracy comparison of three different embedding method in DPCNN and DPTCN are presented in Table 3 and 4 respectively.

The experiment results of time consumption (seconds). (
The accuracy of different embedding methods in DPCNN (%)
(
The accuracy of different embedding methods in DPTCN (%)
(
From Table 3 and 4, it is clear to observe that there is an obvious increment for classification accuracy with word2vec embedding rather than region embedding. While comparing word2vec and the concatenation embedding method, the addition of an embedding in small regions of data also improves the results on the basis of word2vec, which effectively proves the effective representation of high-dimension features. When we compare the time consumption, word2vec embedding saves 6.42% and 6.87% for 7 datasets on average in DPCNN and DPTCN respectively rather than region embedding, which is understandable that the embedding of small regions of data in convolution increases the computing cost. But from another perspective, word2vec pre-trained embedding relies on extra corpus, while region embedding does not.
From above analysis, the proposed word-level deep CNNs model is of great effectiveness for short text classification in the accuracy improvement, no matter the application of causal convolutions or the concatenated embedding method.
During the modeling process, there are various options of each module, which are not confined to the embedding method, the sequence of shortcut connection or the down-sample method. The hidden reasons behind decisions are all the considerate and comprehensive pondering of authors. In this paper, how to successfully apply a deep CNN model to the short text classification and improve the accuracy is the primary target, but when facing the application in real life, what need take into consideration might include maneuverability, device costs, time consumption and many other factors. Therefore, it is necessary to realize the pertinence and limitation of a single model. DPTCN is a small successful attempt to break the stereotype of deep CNN to short text classification with causal convolutions and residual blocks. While the proposal of TCN brings a wider space and more possibilities for the development of CNNs, which is still of great potential for the innovations of deep learning models and the application in many fields, besides short text classification and NLP tasks.
Footnotes
Acknowledgments
This research was supported by National Natural Science Foundation of China (No. 61871234).
