Document classification using convolutional neural networks with small window sizes and latent semantic analysis

Abstract

A parsimonious convolutional neural network (CNN) for text document classification that replicates the ease of use and high classification performance of linear methods is presented. This new CNN architecture can leverage locally trained latent semantic analysis (LSA) word vectors. The architecture is based on parallel 1D convolutional layers with small window sizes, ranging from 1 to 5 words. To test the efficacy of the new CNN architecture, three balanced text datasets that are known to perform exceedingly well with linear classifiers were evaluated. Also, three additional imbalanced datasets were evaluated to gauge the robustness of the LSA vectors and small window sizes. The new CNN architecture consisting of 1 to 4-grams, coupled with LSA word vectors, exceeded the accuracy of all linear classifiers on balanced datasets with an average improvement of 0.73%. In four out of the total six datasets, the LSA word vectors provided a maximum classification performance on par with or better than word2vec vectors in CNNs. Furthermore, in four out of the six datasets, the new CNN architecture provided the highest classification performance. Thus, the new CNN architecture and LSA word vectors could be used as a baseline method for text classification tasks.

Keywords

Convolutional neural networks document classification latent semantic analysis word embedding word vectors

1. Introduction

Text classification is a classic problem where the objective is to assign a set of categories to documents. Studies in text classification vary from developing a sophisticated document feature representation [11,34] to implementing simple document representations in efficient classifiers [6]. A common approach in text classification is the bag-of-word as ngrams representation, where documents are represented with a vector of words that appear in each document [22]. Ngrams are very simple to generate, however it results in very large and sparse vectors. The sparsity in ngrams are due to the length of the text and the use of different words with the same meaning in the corpus vocabulary. Understanding the sparsity and semantics arising from text documents representation is a major challenge in text categorization [17,30].

Recent studies [13] have applied 1-dimensional convolutional layers (1D-CNNs) directly to one-hot vector representations of text documents obtained from bag-of-words. Typically this is not well-suited for convolution networks, wherein dense data is preferred. To mitigate this issue, an embedding layer prior to the convolutional layers is used to densify the one-hot vectors [3,14,18,32]. An embedding layer takes a large vocabulary and projects the full representation into smaller dimensional space [18]. CNNs with an embedding layer are similar to long short-term memory (LSTM) models [10], which also have an embedding layer when used in document sentiment analysis [7,8]. However, LSTMs are known to be difficult to train since they are sensitive to hyperparameters such as batch size, hidden dimensions, and learning rate [28].

Recently, word vectors obtained from an unsupervised neural network language model (word2vec [27]) have been successfully used as pretrained weights to initialize an embedding layer in CNNs for text classification [18]. The word2vec (w2v) word vectors are trained on 100 billion words of Google News. The goal of using w2v vectors with a CNN in Kim [18] was similar to Razavian et al. [29], which showed that for image classification, pretrained features from a different domain can be fine-tuned to domain specific-tasks. However, if the domain of the document classification task is very different from the pretrained vectors, classification accuracy may not reach its full potential, since the purpose of unsupervised pretraining is to provide relevant features that can improve accuracy [2].

Furthermore, there are limitations for training w2v models locally. First, w2v models require large datasets for training. For satisfactory performance, a minimum of 10 million words should be in the training corpora [1]. A curated dataset of such a large size may not always be available for a given dataset. For smaller datasets, LSA word vectors were found to provide better performance in semantic similarity tasks [1]. Second, there are a number of hyperparameters that need to be carefully chosen to obtain meaningful word vectors from the w2v model [24].

1.1. Related work

Recent studies have proposed various CNN architectures for document classification. One of the first studies to demonstrate the use of CNNs for document classification was Kim [18], which used pretrained w2v vectors for the embedding layer and window sizes of 3, 4, and 5 words. These widow sizes represented tri, quad, and pentagrams in the CNN model [18]. The CNN based on the w2v vectors were effective on a number of document classification tasks. Johnson and Zhang [13,14] demonstrated that using pairs of window sizes in the range of 2, 3, 4, and 20 words on the one-hot vector representation (derived from bag-of-words) of documents could provide classification improvement on the model by Kim [18]. However, Johnson and Zhang [13,14] did not use any pretrained vectors for their model. Zhang et al. [36] have used combinations of window sizes ranging from 3, 5 and 7 words, where the input data was embedded in to character space (based on the alphabet) rather than word space in the corpus vocabulary. Conneau et al. [4] further deepened the number convolutional layers in the model from Zhang et al. [36], but limited the window size to only 3 words.

Although CNNs are effective in document classification tasks, their main drawback is that they can be designed using various architectures and include many hyperparameters. This may make their implementation difficult, since there is no clear architecture or guiding principle to follow. Also, the issue of hyperparameter tuning can be further compounded when the CNN is initialized using pretrained word vectors obtained from neural network models, such as w2v. The reason is that word vectors would need to be trained from scratch for a given task. As a result, there is no clear consensus for a baseline CNN architecture and the type of word vectors to be used for initializing the word embedding layer in a CNN.

1.2. Current study

Data transformed with ngrams (typically using uni, bi, and trigrams) and term-frequency inverse-document-frequency (TFIDF) weighting, are still considered effective baseline models for many document classification tasks [33,36]. Especially when used with linear classifiers, such as logistic regression (LR), on smaller datasets with about 500K observations [36]. Furthermore, it has been shown that word vectors obtained from matrix factorization can produce better embeddings than neural network methods [23,24].

Thus in this study, to address the difficulty of training CNNs for document classification, a new CNN architecture that uses parallel 1D convolutional layers with small window sizes, ranging from 1 to 5 words is presented. These small windows sizes are equivalent to using uni, bi, tri, quad, and pentagrams for linear classifiers. Also, locally trained word vectors obtained by LSA are used to initialize the weights in the embedding layer of the CNN models. The LSA vectors are easily obtained by singular value decomposition (SVD) on unigram and TFIDF weighted data [19].

This is the first study to: (1) use LSA word vectors with CNNs and (2) to propose small window sizes (specifically, combining uni and bigrams) in CNN text classification. The new CNN model has two advantages. First, it includes pretrained word embeddings without the onerous hyperpameter search and extensive training time of neural network methods, such as w2v. Second, it leverages the success of combining uni, bi, and trigrams in linear classifiers by applying a similar architecture within CNN models.

In the following section we highlight how to set up the LSA based CNNs and demonstrate its effectiveness against baseline linear classifiers and against CNN models initialized with random and pretrained w2v vectors.

2. Materials and methods

Below, we describe the implementation of the baseline linear classifier, the new CNN architecture that leverages LSA, and the CNN model designed for w2v vectors. Document datasets for all models used the same preprocessing steps.

2.1. Datasets

The models were evaluated on three benchmark datasets (IMDB, AGNews, DBP) containing balanced class distributions. These three datasets were chosen because they were shown have to very high classification accuracies using linear classifiers on ngram transformed and TFIDF weighted data, even beating deep CNN architectures [33,36]. The AGNews and DBP datasets are from [36]. A summary of the datasets and the size of training and test data are presented in Table 1. Brief information regarding each of the datasets is given below:

IMDB Movie reviews: This dataset contains 25,000 training and 25,000 test samples, wherein the objective is to predict if a given review has either negative or positive sentiment [26].

AGNews: Antonio Gulli’s news (AGnews) article corpus contains 496,835 articles from more than 2000 different sources [9]. Only the four largest classes from this corpus, where each document is represented by the title and description fields are used. For each class, 30,000 training and 1900 testing samples are used.

DBP: DBpedia is an ontology dataset extracted from Wikipedia [21]. It contains 14 non-overlapping classes. Each class contains 40,000 training and 5000 testing samples. Each document is represented by the title and abstract of each Wikipedia article.

The following datasets (as listed in Table 1) were used to determine the effect of imbalanced class distribution on the best performing CNN architecture and word vector combinations:

HSI: Hate speech identification (HSI) [5] data contains 7274 non-offensive tweets and 2399 tweets with hate or offensive speech.

20News: 20 Newsgroups is a popular text dataset containing messages taken from online group discussions on Usenet and organized into 20 different topics [20]. The largest topic contains 999 messages and the smallest has 628 messages.

REUTERS: This dataset contains text documents from the 1987 Reuters newswire [25]. There are 74 different categories with the largest topic containing 3926 documents and the smallest topics containing a single document.

For all six datasets, punctuation, hypertext, and stopwords are removed, follwoed by a conversion to lowercase letters. The vocabulary is limited to 30K.

Table 1
Summary of the text document datasets

Dataset Classes Train Test

Balanced

IMDB 2 25,000 25,000

AGNews 4 120,000 7600

DBP 14 560,000 70,000

Imbalanced

HSI 2 4838 4835

20News 20 11,314 7532

REUTERS 74 7769 3008

Dataset	Classes	Train	Test
Balanced
IMDB	2	25,000	25,000
AGNews	4	120,000	7600
DBP	14	560,000	70,000
Imbalanced
HSI	2	4838	4835
20News	20	11,314	7532
REUTERS	74	7769	3008

2.2. Linear classifier with ngram and TFIDF

To construct the TFIDF weighted ${1, 2, 3}$ -ngram text representation, first a bag-of-words (BOW) transformation is performed. The BOW representation is constructed by using the frequency count of each word (unigram), bigram, and trigram. Then for the TFIDF weighting [15], the frequency counts (term-frequency) is divided by the inverse document frequency (IDF). The IDF is the log of the total number of samples divided by the number of samples in the data. The ${1, 2, 3}$ -ngram size is limited to the top 30K most frequent terms, similar to [13]. TFIDF representation, was shown to generally perform better than the BOW representation [36]. Finally, the TFIDF weighted ${1, 2, 3}$ -ngram representation was used with multinomial logistic regression (LR) with the $L_{2}$ regularization parameter $C = 1$ from the LIBLINEAR package [6].

Fig. 1.

Schematic of the proposed model. In this example, a single sentence of a preprocessed movie review is input into a model where there is a vocabulary size of 5 in the dictionary. The word vector embedding layer reduces the dimensionality of the dictionary to 3. Then 3 parallel 1D-convolutions are applied, using 3 different filter sizes, on which $L_{2}$ reguralization is performed. Max-over-time pooling is applied, which takes the single best feature per feature map. Finally, the review is classified with a score from 1 to 5.

2.3. CNN model

CNNs are feed-forward neural networks, where the features generated by the neural networks layers are convolved with each other until a final classification is applied. Typically, the features are extracted from small 2D patches of an image. For instance, for a 32 × 32 grayscale image, 5 × 5 regions are extracted from which the features are learned. However, for text documents this 2D region is simplified down to 1D patches.

The core CNN architecture for the text document classification presented in this study employs a 1D-CNN model [3]. A sentence can be defined by a sequence of n concatenated k-dimensional word vectors, $x_{i} \in R^{k}$ , $\begin{matrix} (1) & x_{1 : n} = x_{1} \oplus x_{2} \oplus \dots \oplus x_{n} . \end{matrix}$ The convolution of words, selected by a window of length h, with a filter $w \in R^{h \times k}$ produces the feature mapping $c_{i}$ given by the formula below, $\begin{matrix} (2) & c_{i} = f (w \cdot x_{i : i + h - 1} + b), \end{matrix}$ where $b \in R$ is the bias term and f is a nonlinear function such as the sigmoid function. Applying the filter $c_{i}$ to all sentences selected by the window h provides the feature mapping $c \in R^{n - h + 1}$ , where $\begin{matrix} (3) & c = [c_{1}, c_{2}, \dots, c_{n - h + 1}] . \end{matrix}$ Then max-over-time pooling is applied to all the features, where the maximum activation value for the filter is obtained. This convolutional process is applied to all filters for a given window of words. The last step in the CNN model is a softmax layer, which is used to obtain the output class of the text document. A schematic workflow of a general word vector based CNN is shown in Fig. 1. In the subsections below, the details regarding the number of filters, window sizes, and word vectors is provided.

2.4. Embedding layer

The word vectors in the CNN model are represented by the embedding layer $W_{E} \in R^{d \times v}$ , where d is the dimension of the embedding layer and v is the size of the vocabulary. The input text layer is a set of one-hot (binary) vectors representing the words used in documents from the vocabulary v (as in Fig. 1). The goal of the embedding layer is to map a discrete and sparse representation of words to continuous and dense values, which are best suited for CNNs. Therefore in the CNN model, the ngrams are modeled by a continuous feature space representation where similar words are located close to each other. This transformation provides the semantic understanding of words to 1D-CNN models. In the training of the CNN model, word vectors are fine-tuned to fit the classification task of the documents.

In this study, three types of word vectors are used to initialize the embedding matrix. First, as a baseline, random word vectors from $W_{rand} \in R^{d \times v}$ are used, which are initialized by the uniform distribution within the range [−0.05, 0.05]. Second, the LSA word vectors are formed by using SVD on a unigram transformed and TFIDF weighted data. Specifically, using only the top d singular values, SVD is performed on $X \in R^{v \times n}$ , where v is the size of the vocabulary and n is the number of documents, giving $\begin{matrix} (4) & X_{d} = U_{d} S_{d} V_{d}^{T} . \end{matrix}$ Then the LSA word vectors are defined as the rows of $\begin{matrix} (5) & W_{LSA} = U_{d} S_{d} . \end{matrix}$ Also, according to Levy et al. [24], the word vectors are row normalized, which has been shown to improve representative accuracy.

Finally, the word2vec word vectors pretrained on 100 billion words of Google News are used to form $W_{w 2 v} \in R^{d \times v}$ . The word2vec model is a two-layer unsupervised neural network that produces vector representation of words, similar to LSA. In the model, the objective function maximizes the log probability of a context word ( $w_{O}$ ), given its input words ( $w_{I}$ ). The output is a n-dimensional vector for each word in the vocabulary that represents the weights of the hidden layers. Words that are similar in context to each other in the training corpus are located close to each other in the vector space representation. In Fig. 2, two different w2v model architectures are shown, the Continuous Bag of Words (CBOW) and Skip-gram models. The CBOW architecture, predicts a word based on the surrounding context words. The Skip-gram architecture uses the current word to predict the surrounding words in a fixed-size window. For infrequent words, the Skip-gram architecture works better, whereas the CBOW model works faster than the Skip-gram model.

Fig. 2.

Training of word2vec word vectors. The two models of word2vec architectures are shown here.

The random word vectors in $W_{rand}$ are used to test the efficacy of the word vectors in $W_{LSA}$ and $W_{w 2 v}$ for the efficacy of initializing the embedding layer. For the subsequent CNN models presented below, the following name convention is used, “{ngram range}- $W_{E}$ type-regularization-CNN”, where “{ngram range}” indicates the window size used on the word vectors, “ $W_{E}$ ” takes on either $W_{rand}$ , $W_{LSA}$ or $W_{w 2 v}$ , and “regularization” is either $L_{2}$ weight decay (wt) or dropout (dp) based. The “{ngram range}” and “regularization” are dependent on the model architectures defined below.

2.5. LSA-based CNN architecture

To successfully apply LSA in a CNN model, three different architectures were developed to take advantage of the LSA word vectors. Specifically, networks with ngram filter sizes $= {1, 2, 3}, {1, 2, 3, 4}$ , and ${1, 2, 3, 4, 5}$ are implemented. Each ngram filter size pertains to a separate and parallel convolutional layer in the model. These layers are finally concatenated with each other in the max-over-time pooling layer, prior to the softmax classification layer. The following are the parameters used for both network architectures:

Filters: The number of filters is set to 128 for each of the ngram filters, which is a common size in convolutional network models [16].

Emedding: The embedding dimension size was set to 300 to match the word2vector dimensionality [27].

Regularization: The regularization was applied to each of the weights in the convolutional layers using an $L_{2}$ parameter weight decay of $10 e^{- 5}$ [12,13].

Activation: Rectified linear units (ReLu) were used as an activation function, as it has been shown to be sufficient in other studies [37] and also to enable comparison to the w2v-based CNN model specified in Kim [18].

A summary of the architecture is shown for the

{1, 2, 3, 4}

W_{LSA}

-wt-CNN model in Fig. 3. The

{1, 2, 3}

W_{LSA}

-wt-CNN and

{1, 2, 3, 4, 5}

W_{LSA}

-wt-CNN models follows the same setup except for the addition or absence of filters.

Fig. 3.

Sample CNN model architecture. The above setup shows the architecture and parameters for the top performing LSA-based CNN model.

2.6. W2V-based CNN architecture

The 1D-CNN architecture from Kim [18] was also used to compare against the CNN architecture developed to leverage LSA vectors. The CNN architecture developed for w2v vectors uses larger filter sizes since the word2vectors are designed to capture long distance relationships among words [27]. The w2v-based CNN model was shown to be an effective classifier compared to linear and other state-of-the-art classifiers such as RNN’s and LSTM’s [18]. For the w2v-based CNN architecture, the following parameters as given in [18] were used: filter sizes $= {3, 4, 5}$ ; embedding dimension size = 300; and dropout [31] of 0.3 used for regularization. Only the number of filters for each filter size was increased from 100 to 128, to enable fair comparison to the CNN model architecture developed for LSA word vectors.

2.7. Experimental setup

For all six datasets, the vocabulary size of the training corpus was limited to 30K words for the linear and CNN based models. To test the effectiveness of the LSA and w2v specific CNN architectures, the $W_{rand}$ , $W_{LSA}$ or $W_{w 2 v}$ word vectors were input into both types of architectures. The Adadelta optimizer [35] was used to trained the models for 10 epochs as it was shown to reach convergence quickly [18].

For all balanced datasets, the traditional accuracy metric (ratio of true positive and negative documents to the number of all documents) was reported on the testing samples. On the imbalanced datasets, the best average performing LSA-based CNN architecture was used to compare against the w2v-based CNN architecture. Furthermore, on the imbalanced datasets the macro-averaged F-score (the unweighted F-score across all classes) was used to report the classification performance, since it is not skewed by the larger class sizes. The F-score itself is the harmonic mean between the recall and precision metrics.

3. Results

Table 2
Classification accuracy for text documents on the balanced datasets. Bold face indicates best accuracy for a dataset

Model IMDB AGNews DBP Mean^a

${1, 2, 3}$ -TFIDF- $L_{2}$ -LR^b 0.8942 0.9142 0.9797 0.9294

${1, 2, 3}$ - $W_{rand}$ -wt-CNN^c 0.8957 0.9201 0.9847 0.9335

${1, 2, 3}$ - $W_{w 2 v}$ -wt-CNN 0.8996 0.9201 0.9854 0.935

${1, 2, 3}$ - $W_{LSA}$ -wt-CNN 0.9001 0.9209 0.9858 0.9356

${1, 2, 3, 4}$ - $W_{rand}$ -wt-CNN 0.8961 0.9195 0.9849 0.9335

${1, 2, 3, 4}$ - $W_{w 2 v}$ -wt-CNN 0.8989 0.9201 0.9859 0.9350

${1, 2, 3, 4}$ - $W_{LSA}$ -wt-CNN 0.9010 0.9213 0.9862 0.9362

${1, 2, 3, 4, 5}$ - $W_{rand}$ -wt-CNN 0.8965 0.9201 0.9851 0.9339

${1, 2, 3, 4, 5}$ - $W_{w 2 v}$ -wt-CNN 0.8997 0.9214 0.9863 0.9358

${1, 2, 3, 4, 5}$ - $W_{LSA}$ -wt-CNN 0.9002 0.9209 0.9863 0.9358

${3, 4, 5}$ - $W_{rand}$ -dp-CNN 0.8886 0.9159 0.9842 0.9296

${3, 4, 5}$ - $W_{w 2 v}$ -dp-CNN 0.8965 0.9196 0.9859 0.934

${3, 4, 5}$ - $W_{LSA}$ -dp-CNN 0.8980 0.9195 0.9855 0.9343

Model	IMDB	AGNews	DBP	Mean^a
${1, 2, 3}$ -TFIDF- $L_{2}$ -LR^b	0.8942	0.9142	0.9797	0.9294
${1, 2, 3}$ - $W_{rand}$ -wt-CNN^c	0.8957	0.9201	0.9847	0.9335
${1, 2, 3}$ - $W_{w 2 v}$ -wt-CNN	0.8996	0.9201	0.9854	0.935
${1, 2, 3}$ - $W_{LSA}$ -wt-CNN	0.9001	0.9209	0.9858	0.9356
${1, 2, 3, 4}$ - $W_{rand}$ -wt-CNN	0.8961	0.9195	0.9849	0.9335
${1, 2, 3, 4}$ - $W_{w 2 v}$ -wt-CNN	0.8989	0.9201	0.9859	0.9350
${1, 2, 3, 4}$ - $W_{LSA}$ -wt-CNN	0.9010	0.9213	0.9862	0.9362
${1, 2, 3, 4, 5}$ - $W_{rand}$ -wt-CNN	0.8965	0.9201	0.9851	0.9339
${1, 2, 3, 4, 5}$ - $W_{w 2 v}$ -wt-CNN	0.8997	0.9214	0.9863	0.9358
${1, 2, 3, 4, 5}$ - $W_{LSA}$ -wt-CNN	0.9002	0.9209	0.9863	0.9358
${3, 4, 5}$ - $W_{rand}$ -dp-CNN	0.8886	0.9159	0.9842	0.9296
${3, 4, 5}$ - $W_{w 2 v}$ -dp-CNN	0.8965	0.9196	0.9859	0.934
${3, 4, 5}$ - $W_{LSA}$ -dp-CNN	0.8980	0.9195	0.9855	0.9343

^a Mean is the average accuracy across the different datasets.

^bThe logistic regression is performed with $L_{2}$ regularization on TFIDF weighted uni, bi, and trigrams.

^cThe following name convention is used, “{ngram range}- $W_{E}$ type-regularization-CNN”.

Table 2 presents the experimental results of the proposed CNN architecture for LSA word vectors against the baseline linear model and the CNN architecture designed for w2v word vectors on the balanced datasets. On average, the best performing CNN architecture and word vector type across all balanced datasets was the ${1, 2, 3, 4}$ - $W_{LSA}$ -wt-CNN model, with an average accuracy of 0.9362. While the worst performing CNN architecture and word embedding combination was the ${1, 2, 3}$ - $W_{rand}$ -wt-CNN model with an accuracy of 0.9335.

On the IMDB dataset, the ${1, 2, 3, 4}$ - $W_{LSA}$ -wt-CNN model had the highest accuracy with 0.9010. The next best model was ${1, 2, 3, 4, 5}$ - $W_{LSA}$ -wt-CNN, which despite having an extra 5-gram filter had a lower accuracy of 0.9002. On AGNews, the ${1, 2, 3, 4, 5}$ - $W_{w 2 v}$ -wt-CNN achieved the highest accuracy with 0.9214, only barely higher than the ${1, 2, 3, 4}$ - $W_{LSA}$ -wt-CNN model, which achieved an accuracy of 0.9213. For the DBP dataset, the ${1, 2, 3, 4, 5}$ - $W_{LSA}$ -wt-CNN and ${1, 2, 3, 4, 5}$ - $W_{w 2 v}$ -wt-CNN tied for the highest accuracy at 0.9863. This was almost the same with ${1, 2, 3, 4}$ - $W_{LSA}$ -wt-CNN, which had an accuracy of 0.9862.

For the IMDB and AGNews datasets, the w2v vectors reached their maximum accuracy using the ${1, 2, 3, 4, 5}$ - $W_{w 2 v}$ -wt-CNN models instead of the ${3, 4, 5}$ - $W_{w 2 v}$ -dp-CNN model, which was specifically designed for the w2v vectors [18]. Moreover, on the AGNews dataset, $W_{w 2 v}$ word vectors were tied for accuracy on the ${1, 2, 3}$ - $W_{w 2 v}$ -wt-CNN and ${1, 2, 3, 4}$ - $W_{w 2 v}$ -wt-CNN models, with a higher accuracy than the ${3, 4, 5}$ - $W_{w 2 v}$ -dp-CNN model. On the DBPedia dataset, w2v vectors reached its peak accuracy using the ${1, 2, 3, 4, 5}$ - $W_{w 2 v}$ -wt-CNN model.

When the ${3, 4, 5}$ - $W_{E}$ type-dp-CNN architecture was used, $W_{w 2 v}$ provided a higher maximum accuracy than the $W_{LSA}$ and $W_{rand}$ on the AGNews and DBP datasets, but not on the IMDB dataset, demonstrating that this architecture is w2v specific. Nonetheless, this architecture never achieved the maximum performance on the balanced datasets. Also, on all of the CNN architectures, $W_{rand}$ achieved higher accuracy than the linear classifier model.

Fig. 4.

Classification accuracy of the CNN models on the balanced datasets. The effect of the different architectures and word vectors is shown here.

On the balanced datasets, an analysis of variance (ANOVA) with a 3 (embedding type) × 4 (filter architecture) random effects design showed that there was no significant interaction between the filter architecture and word vector used in the embedding layer (Fig. 4). However, there were significant differences among the filter architectures ( $F (3, 22) = 5.51$ , $p < 0.01$ ) and word vectors ( $F (2, 22) = 12.68$ , $p < 0.001$ ).

Specifically, Tukey post-hoc tests showed that the ${3, 4, 5}$ - $W_{E}$ type-dp-CNN architecture had a significantly lower mean accuracy ( $μ_{{3, 4, 5}} = 0.9326$ ) than then other architectures containing small windows sizes ( $μ_{{1, 2, 3, 4, 5}} = 0.9352$ , $μ_{{1, 2, 3, 4}} = 0.9348$ , $μ_{{1, 2, 3}} = 0.9347$ ; all comparisons $p < 0.01$ ). With respect to the word vector used, there was no difference between the mean accuracy for $W_{LSA}$ ( $μ_{LSA} = 0.9355$ ) and $W_{w 2 v}$ ( $μ_{w 2 v} = 0.9350$ ) vectors. However both were significantly greater than the mean accuracy for the random word vector across all balanced datasets ( $μ_{rand} = 0.9326$ ; $p < 0.001$ for all comparisons).

Table 3

Comparison of classification accuracies of the LSA-based CNN to character-based CNN models. Accuracy values of the character-based CNNs are taken from their respective studies. Bold face indicates best accuracy for a dataset

Model	AGNews	DBP
${1, 2, 3, 4}$ - $W_{LSA}$ -wt-CNN	0.9213	0.9862
6- $W_{char}$ -CNN [36]	0.9145	0.9845
9- $W_{char}$ -CNN [4]	0.9083	0.9865
29- $W_{char}$ -CNN [4]	0.9133	0.9871

Since the ${1, 2, 3, 4}$ - $W_{LSA}$ -wt-CNN model had the highest mean accuracy, we compared it to the state-of-the-art character-based CNN classifiers (shown in Table 3) used in document classification. These CNN models use alphanumeric characters, rather than words, to embed the documents. The idea is to learn the semantic relationships of words in higher convolutional layers, similar to image classification tasks, where the input data to the CNNs are only pixels [4,36]. The character based models mainly differ in the number of convolutional layers and the filters sizes in CNN architecture. The model of the character-based CNN by Zhang et al. [36] uses 6 convolutional layers with filter size of 7 (on the first two convolutional layers) and 3 (on the remaining convolutional layers), which is designated as 6- $W_{char}$ -CNN. In Conneau et al. [4], either 9 convolutional layers or 29 convolutional layers with a filter size of 3 was used, which are designated as 9- $W_{char}$ -CNN and 29- $W_{char}$ -CNN, respectively. Compared to these models, our best performing ${1, 2, 3, 4}$ - $W_{LSA}$ -wt-CNN model had the highest accuracy on the AGNews and third highest accuracy on the DBPedia datasets. Comparison results for the IMDB dataset is not available since [4,36] did not use it for analysis.

Table 4

Comparison of F-score values for the new ${1, 2, 3, 4}$ - $W_{E}$ type-wt-CNN architecture to the ${3, 4, 5}$ - $W_{E}$ type-dp-CNN models. Bold face indicates best F-score for the model, embedding vector, and dataset

Model	HSI	20News	REUTERS
${1, 2, 3, 4}$ - $W_{rand}$ -wt-CNN	0.7883	0.7141	0.4762
${1, 2, 3, 4}$ - $W_{w 2 v}$ -wt-CNN	0.7936	0.7794	0.5756
${1, 2, 3, 4}$ - $W_{LSA}$ -wt-CNN	0.7792	0.8309	0.5268
${3, 4, 5}$ - $W_{rand}$ -dp-CNN	0.7866	0.5232	0.4091
${3, 4, 5}$ - $W_{w 2 v}$ -dp-CNN	0.7873	0.8005	0.6008
${3, 4, 5}$ - $W_{LSA}$ -dp-CNN	0.7666	0.8352	0.5976

For the imbalanced datasets, the ${1, 2, 3, 4}$ - $W_{LSA}$ -wt-CNN model was compared against the ${3, 4, 5}$ - $W_{w 2 v}$ -dp-CNN architecture from [18] (as shown in Table 4). Also in this comparison, the three different types of word vectors (LSA, w2v, and random) were used in both models. The ${3, 4, 5}$ - $W_{E}$ type-dp-CNN architecture attained the highest F-score on two out of three datasets. On the 20News dataset it achieved a F-score = 0.8352 using LSA vectors, not the w2v vectors that it was specifically designed for. Then, on the REUTERS dataset, the model achieved a F-score = 0.6008 using w2v vectors. For the HSI dataset, interestingly the ${1, 2, 3, 4}$ - $W_{E}$ type-wt-CNN achieved the highest F-score of 0.7936, not with the LSA vectors, but with the w2v vectors.

An ANOVA with a 3 (embedding type) × 2 (filter architecture) random effects design showed that there was no significant interaction between the filter architecture and word vector used in the embedding layer across the imbalanced datasets. Nor was there a significant difference between the two filter architectures. However, there was a significant main effect of the word vectors ( $F (2, 10) = 5.41$ , $p < 0.05$ ), where Tukey post-hoc tests showed that both the LSA ( $μ_{LSA} = 0.7227$ ) and w2v ( $μ_{w 2 v} = 0.7229$ ) vectors performed better than the random ( $μ_{rand} = 0.6163$ , $p < 0.001$ ) word vectors.

4. Discussion

This study has shown that locally trained LSA word vectors can be used as an alternative to pretrained w2v word vectors in a CNN model for text document classification. In two balanced (IMDB and DBP) and one imbalanced (20News) dataset, the LSA vectors provided the maximum classification performance. In the AGNews balanced dataset, the highest classification accuracy for the LSA vectors was lower by a very small margin against the w2v vectors. Also, a novel architecture using small word window sizes for CNN classification of text documents was introduced. The new small windowed CNN architecture provided the maximum classification performance on all three balanced (IMDB, AGNews, and DBP) and in one imbalanced (HSI) dataset. The combination of the new CNN architecture and LSA word vectors was also shown to perform the better than all other character-based CNN models on the AGNews dataset and third best on the DBPedia dataset.

Most CNN architecture for document classifications have used combinations of window sizes ranging from 3, 5 and 7 words [36]. Although, the goal was to model short and long distance relationships among words [36], smaller window sizes of 1 or 2 words have not been analyzed, as in this current study. As shown in Johnson and Zhang [13], effective linear classification with ${1, 2, 3}$ -ngram is heavily dependent on unigrams rather than bigram and trigrams. The larger window sizes may better at modeling semantic similarity tasks, but short distances may be better suited for classification. Also, it has been shown that combining uni and bigrams increases classification accuracy in linear models [33].

Thus, the success of the LSA-based CNN model architecture could be attributed to the use of the smaller filter sizes together with the LSA word vectors. Since the LSA word vectors are extracted using unigrams, the LSA only encodes short distance relationships among words. Thus, using filter sizes larger than quadgrams should not be able to capture meaningful semantic relationships among words. Whereas the w2v word vectors perform better in a CNN with larger filter sizes, since w2v word vectors are trained with large windows sizes ranging from 5 and 10 words in length [24]. Also the results show that carefully tuning traditional methods, such as LSA word vectors, gives equivalent results to neural network methods, such as w2v, as shown in Levy et al. [23].

A limitation of linear models is that they cannot use ngram representations that are not present in the training dataset [13]. CNNs are able to find and use ngrams that are not wholly in the training set, which is why CNN based ngram models perform better than linear ngram models. For instance, CNN based ngrams can learn a general trigram “best X-positive ever”, where X-positive represents a positive word. This general trigram can then be used to classify the testing data as long as it follows the general form [13].

One limitation of the LSA-based CNN model and the related small word window architecture, can be that in datasets with long documents (i.e., many and long sentences per document), the model may begin to perform worse against models that take advantage of larger window sizes. In this case, CNN architecture based on w2v word vectors or character-based models may perform better, since they are more likely to capture long distance semantic relationships among words.

Another limitation was demonstrated when the majority of the imbalanced document datasets were better classified using the w2v inspired CNN architecture, which has the larger window sizes. This could be due to the fact that small window sizes are not able capture the higher level word relationships, which could then be used to offset the lack of relationships obtained from the smaller document classes.

5. Conclusion

This study has shown that LSA word vectors using CNNs with small window sizes can be used as a baseline classifier for document classification. One reason for this, is that the LSA word vectors can be more domain specific than pretrained w2v word vectors and that small window sizes can leverage the embedding of LSA vectors. Furthermore, LSA word vector dimensionality can be easily adjusted to any size. Whereas the pretrained w2v word vectors are set to a dimensionality of 300, unless a costly pretraining process is undertaken. For future studies, we would like to analyze the effect of using different LSA word vector dimensions. Also, we would like to implement other types of traditional word vectors, which could have an effect on larger text datasets.

References

Altszyler,

Sigman,

Ribeiro and

D.F.

Slezak, Comparative study of LSA vs Word2vec embeddings in small corpora: A case study in dreams database, 2016, arXiv preprint arXiv:1610.01520.

Bengio,

Courville and

Vincent, Representation learning: A review and new perspectives, IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8) (2013), 1798–1828. doi:10.1109/TPAMI.2013.50.

Collobert,

Weston,

Bottou,

Karlen,

Kavukcuoglu and

Kuksa, Natural language processing (almost) from scratch, Journal of Machine Learning Research 12 (2011), 2493–2537.

Conneau,

Schwenk,

LeCun and

Barrault, Very deep convolutional networks for text classification, in: Long Papers – Continued, Vol. 1, Association for Computational Linguistics (ACL), 2017, pp. 1107–1116.

Davidson,

Warmsley,

Macy and

Weber, Automated hate speech detection and the problem of offensive language, in: Eleventh International AAAI Conference on Web and Social Media, 2017.

R.-E.

Fan,

K.-W.

Chang,

C.-J.

Hsieh,

X.-R.

Wang and

C.-J.

Lin, LIBLINEAR: A library for large linear classification, Journal of Machine Learning Research 9 (2008), 1871–1874.

Graves and

Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures, Neural Networks 18(5) (2005), 602–610. doi:10.1016/j.neunet.2005.06.042.

Greff,

R.K.

Srivastava,

Koutník,

B.R.

Steunebrink and

Schmidhuber, LSTM: A search space odyssey, 2015, arXiv preprint arXiv:1503.04069.

Gulli, AG’s corpus of news articles, 2004, (Accessed on 05/06/2018). https://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html.

10.

Hochreiter and

Schmidhuber, Long short-term memory, Neural Computation 9(8) (1997), 1735–1780. doi:10.1162/neco.1997.9.8.1735.

11.

Javed,

Maruf and

H.A.

Babri, A two-stage Markov blanket based feature selection algorithm for text classification, Neurocomputing 157 (2015), 91–104. doi:10.1016/j.neucom.2015.01.031.

12.

Jia,

Shelhamer,

Donahue,

Karayev,

Long,

Girshick,

Guadarrama and

Darrell, Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the 22nd ACM International Conference on Multimedia, ACM, 2014, pp. 675–678.

13.

Johnson and

Zhang, Effective use of word order for text categorization with convolutional neural networks, in: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Denver, Colorado, 2015, pp. 103–112.

14.

Johnson and

Zhang, Semi-supervised convolutional neural networks for text categorization via region embedding, in: Advances in Neural Information Processing Systems 28,

Cortes,

N.D.

Lawrence,

D.D.

Lee,

Sugiyama and

Garnett, eds, Curran Associates, Inc., 2015, pp. 919–927.

15.

K.S.

Jones, A statistical interpretation of term specificity and its application in retrieval, Journal of Documentation 28(1) (1972), 11–21. doi:10.1108/eb026526.

16.

Karpathy, CS231n Convolutional Neural Networks for Visual Recognition, 2014, (Accessed on 05/06/2018), http://cs231n.github.io/convolutional-networks/.

17.

Kim,

Howland and

Park, Dimension reduction in text classification with support vector machines, in: Journal of Machine Learning Research, 2005, pp. 37–53.

18.

Kim, Convolutional neural networks for sentence classification, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1746–1751. doi:10.3115/v1/D14-1181.

19.

T.K.

Landauer,

D.S.

McNamara,

Dennis and

Kintsch, Handbook of Latent Semantic Analysis, Psychology Press, 2013.

20.

Lang, 20 Newsgroups, 2008, (Accessed on 05/06/2018). http://qwone.com/~jason/20Newsgroups/.

21.

Lehmann,

Isele,

Jakob,

Jentzsch,

Kontokostas,

P.N.

Mendes,

Hellmann,

Morsey,

Van Kleef,

Auer et al., DBpedia – A large-scale, multilingual knowledge base extracted from Wikipedia, Semantic Web 6(2) (2015), 167–195. doi:10.3233/SW-140134.

22.

Leskovec,

Rajaraman and

J.D.

Ullman, Mining of Massive Datasets, Cambridge University Press, 2014.

23.

Levy and

Goldberg, Neural word embedding as implicit matrix factorization, in: Advances in Neural Information Processing Systems 27,

Ghahramani,

Welling,

Cortes,

N.D.

Lawrence and

K.Q.

Weinberger, eds, Curran Associates, Inc., 2014, pp. 2177–2185.

24.

Levy,

Goldberg and

Dagan, Improving distributional similarity with lessons learned from word embeddings, Transactions of the Association for Computational Linguistics 3 (2015), 211–225. doi:10.1162/tacl_a_00134.

25.

D.D.

Lewis, Reuters-21578 Text Categorization Collection Data Set, 1997, (Accessed on 05/06/2018). http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html.

26.

A.L.

Maas,

R.E.

Daly,

P.T.

Pham,

Huang,

A.Y.

Ng and

Potts, Learning word vectors for sentiment analysis, in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Portland, Oregon, USA, 2011, pp. 142–150.

27.

Mikolov,

Sutskever,

Chen,

G.S.

Corrado and

Dean, Distributed representations of words and phrases and their compositionality, in: Advances in Neural Information Processing Systems, 2013, pp. 3111–3119.

28.

Pascanu,

Mikolov and

Bengio, On the difficulty of training recurrent neural networks, in: Proceedings of the 30th International Conference on International Conference on Machine Learning, Vol. 28, JMLR.org, 2013, pp. III–1310–III-1318. http://dl.acm.org/citation.cfm?id=3042817.3043083 .

29.

A.S.

Razavian,

Azizpour,

Sullivan and

Carlsson, CNN features off-the-shelf: An astounding baseline for recognition, in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on, IEEE, 2014, pp. 512–519. doi:10.1109/CVPRW.2014.131.

30.

Silva,

Lotrič,

Ribeiro and

Dobnikar, Distributed text classification with an ensemble kernel-based learning approach, systems, man, and cybernetics, part C: Applications and reviews, IEEE Transactions on 40(3) (2010), 287–297.

31.

Srivastava,

Hinton,

Krizhevsky,

Sutskever and

Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research 15(1) (2014), 1929–1958.

32.

Wang,

Xu,

Tian,

C.-L.

Liu and

Hao, Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification, Neurocomputing 174 (2016), 806–814. doi:10.1016/j.neucom.2015.09.096.

33.

Wang and

C.D.

Manning, Baselines and bigrams: Simple, good sentiment and topic classification, in: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, Association for Computational Linguistics, 2012, pp. 90–94.

34.

A.J.J.

Yepes,

Plaza,

Carrillo-de-Albornoz,

J.G.

Mork and

A.R.

Aronson, Feature engineering for MEDLINE citation categorization with MeSH, BMC Bioinformatics 16(1) (2015), 1. doi:10.1186/s12859-014-0430-y.

35.

M.D.

Zeiler, ADADELTA: An adaptive learning rate method, 2012, arXiv preprint arXiv:1212.5701.

36.

Zhang,

Zhao and

LeCun, Character-level convolutional networks for text classification, in: Advances in Neural Information Processing Systems, 2015, pp. 649–657.

37.

Zhang and

Wallace, A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification, 2015, arXiv preprint arXiv:1510.03820.