Abstract
Sentiment analysis is a task that belongs to natural language processing and it is highly used in texts extracted from social networks. This task consists of assigning the labels or classes: positive, negative or neutral to the text. However, analyzing a piece of text extracted from social networks to determine if it represents a positive or negative sentiment is a difficult task, because social media texts contain slangs, typographical errors and cultural context. The shortcomings of traditional frequency based feature extraction models such as bag of words or TF-IDF affect the accuracy of sentiment classification. To improve the precision in the sentiment classification task, it is possible to use natural language modelling methods that are able to learn contextual information from words. In this work, word embedding such as Word2Vec, GloVe and Doc2VecC with different dimensions are used. The resulting word vectors will be used to train recurring neural networks such as LSTM, BiLSTM, GRU and BiGRU, to improve sentiment classification.
Introduction
Today, the web allows users to provide feedback over their likes or dislikes. Social networks become the perfect place to extract information about any topic, due to the great variety of users who publish content every day. The study of attitudes or points of view becomes an essential task in the analysis of a person’s behaviour, and these can be summarised as sentiments [14].
Sentiment analysis [1, 16] can be defined as the computational study of opinions, evaluations, and attitudes towards entities, individuals, problems, events, themes and their attributes. There are three levels of sentiment analysis [12]: Document analysis. Sentence analysis. Aspect/entity analysis.
This work is based on the document analysis level. The document analysis level is used when the main task is to find the general sentiment on a topic. This type of analysis assumes that the entire document expresses an opinion about a unique entity or feature.
At the level of sentence analysis, it is assumed that each of the sentences belonging to a text express opinions and feelings about a characteristic or entity.
At the entity level or aspect level [13], the analysis of multiple features of an object is performed, i.e. suppose a customer buys a Nokia cell phone, the customer notes that the quality of the cell phone’s camera is good, but the sound quality is quite bad. The aspect level is performed to analyse the features of the cell phone in this review.
Sentiment analysis methods can be divided into machine learning, lexicon, and hybrid method [10].
The machine learning method contains supervised learning, this approach is used when the data is properly labelled, and unsupervised learning is used when the data is not labelled and difficult to find any pattern for the classification of the data.
Lexical-based method: Lexicon is a collection of predefined words in which a polarity score is associated with each word [16].
This is the simplest approach to performing the sentiment classification task. This method uses a dictionary and performs a word comparison to classify that document.
The opinions or reviews that reside in social networks become valuable, because, currently, various organisations or businesses use traditional methods for sentiment analysis [1] to increase or improve the popularity of a product or a service. These traditional methods perform well when the semantics of the text is simple or the amount of data is limited. However, with the incorporation of different social networks, the amount of data increases every day and expressions through language on the Internet are abundant. It is difficult to analyse text quickly and efficiently using traditional methods.
On the other hand, scientists around the world constantly seek to develop new technologies or methods. The scientific community has begun to use deep learning due to its good performance [18].
Artificial neural networks (ANNs) are inspired by the functioning of neurons in the human brain but simplified. They are very efficient for complex problems, specifically for tasks such as classification, association and pattern recognition [11].
In related work, Wang et al. [2] compared the efficiency of different recurrent neural networks (RNNs) with the aim of improving the accuracy of sentiment analysis by using word embedding with 100 dimensions. Goularas and Kamis [3] compare the performance of convolutional neural networks (CNN) and long-short term memory cell (LSTM) in classifying feelings using 25-dimensional word embedding.
Fazeel Abid et al. [4] also compare the performance of different RNNs including the combination of recurrent and convolutional, using word embedding with 300 dimensions. Minmin [20] proposes to use a variation on the Doc2vec [21] model. With Doc2VecC Minmin obtain an average of the word vector with words randomly removed from a document then this vector is concatenated with the vectors obtained with Doc2Vec. Minmin reaches 88.4 using a word embedding with 100 dimensions and logistic regression.
However, any of the works mentioned above consider the impact that may occur in the models when selecting different dimensions in the word vectors. Based on this observation, the present work compares two algorithms to produce word vectors Word2Vec and GloVe with 100, 200, 300 and 350 dimensions. These resulting word vectors will be used as features to train different models using RNN.
This paper is structured as follows: Section 2 presents the materials and methods employed in this work. Section 3 describes the proposed methodology. Section 4 describes the results obtained and Section 5 concludes this paper.
Materials and methods
This section describes Word2Vec, GloVe and Doc2VecC, algorithms used in this work for word vector representation. It also presents the recurrent neural networks used in this work and the selected corpora.
Word vectors
Word vectors (or word embedding) are widely used in natural language processing (NLP), it is a language modelling technique used to represent words or phrases as vectors of real values.
Machine learning algorithms and deep learning architectures are unable to process raw text so, they need to be modelled [19] or, in other words, the features associated with the text must be numerical values in order to use as inputs into a machine learning algorithm.
The word vectors can be classified into two main categories [17]: Frequency-based word vectors, where the meaning between the words is not stored. This technique consider the words as separate features. Some examples are: Bag of words Measure TF-IDF Co-occurrence matrix Word embedding, is a vector that represents a word (or a character). Some examples are: Word2Vec GloVe Doc2VecC
In this work, the use of word embedding: Word2Vec, GloVe and Doc2VecC is presented.
Pre-trained word embedding
An embedding layer is used for training with text a neural network; however, these vectors will be initialised randomly and then gradually improved during the training phase, with the descending gradient algorithm at each step of the process in the back propagation. However, this process is usually very expensive, it will consume a lot of time in training since the network will be trying to find the best weights.
Pre-trained word embedding is an example of transfer learning. The main idea is to use public embedding or to execute the process of obtaining the word vectors separately by using algorithms for word embedding like Word2Vec or GloVe.
Using these pre-trained embedding, we have the advantage that the weights of our neural network will not be initialised randomly and these word values will be set as initialization weights. This method helps make training faster and improves the performance of NLP models.
Word2Vec
Developed by Tomas Mikolov et al. [5] on Google in 2013. Two models with a single hidden layer architectures are proposed to calculate vector representations of words from very large data sets in a very short time. These models are capable of capturing semantic and syntactic information from words. The proposed models are:
GloVe
Pennington et al. [6] developed global Vectors for Word Representation or GloVe at Stanford. GloVe is an unsupervised learning algorithm for obtaining vector representations of words. The training phase is performed using a co-occurrence matrix, and the resulting representations show linear substructures of the vector space of words.
The word vector obtained with this model preserves the relationships and similarities between words. This model uses a co-occurrence matrix for a given sentence, that is, the matrix represents how often a pair of words appear together. Each element of the array is the count of the pair of words that appear together.
Doc2VecC
Document vector through corruption (Doc2VecC). Minmin [20] presented an interesting approach inspired by doc2Vec [21] including a global context by capturing the semantic meanings of the document.
It consists of an input layer, a projection layer, and an output layer to predict the target word. Neighbour word embeds provide a local context obtained through the doc2Vec architecture, while the vector representation of the entire document serves as a global context. Unlike doc2vec vectors, which directly learn a unique vector for each document, Doc2VecC represents each document as an average of the word embedding taken at random from the document.
Furthermore, the authors choose to corrupt the original document by randomly deleting a significant part of the words, rendering the document averaged with only the vectors of the remaining words. This corruption mechanism allows acceleration during training since it significantly reduces the number of parameters to update in backpropagation.
Recurrent neural networks
Simple recurrent neural network
A simple recurrent neural network (SRNN) is a type of neural network where the output of the previous step is fed as input to the current step. The SRNN can be used to process sequence data, usually text. Multiple studies have shown that SRNN can be used successfully in natural language processing tasks [7].
However, this type of neural network presents the short-term memory problem or the vanishing gradient. If a text is too long, the SRNN will have a hard time passing information from earlier steps to later steps. This issue, called the vanishing gradient problem, is when the gradient contracts during the stage of backpropagation through time. If the value of the gradient becomes extremely small, it means that the network is not learning, therefore, the SRNN is forgetting the information.
To solve the short-term memory problem some cells like long short-term memory (LSTM) and gated recurrent unit (GRU) were developed. These networks have internal mechanisms called gates that allow regulating the flow of information.
Long short-term memory
A long short-term memory (LSTM) network has a similar architecture to a simple recurrent neural network, the LSTM also processes sequence data. The differences from this network are the gates and the logical operations in the cell. These operations are used to allow the LSTM to keep or forget information.
Gated recurrent unit
The gated recurrent unit (GRU) network was proposed [8] as a simpler alternative to the LSTM. The network also has gates to guarantee the flow of information. GRU uses the hidden state to transfer information. This network has only two gates, a reset gate and an update gate. The GRU network is faster to train because it is composed of fewer operations than the LSTM network.
Bidirectional recurrent neural networks
Bidirectional recurrent neural networks are composed of two RNN layers one layer that stores previous representations and another that helps us learn future representations; these networks improve performance of the SRNN and LSTM and GRU cells.
The bidirectional neural networks are combinations of recurring networks, forward and backwards providing the network with more context about the document and faster learning.
Large movie review corpora
Large Movie Review v1.0 by Mass et al. [9] obtained from the original Stanford AI original repository contains 50,000 reviews of movies in English. The corpora are uniformly divided into 25,000 for the training set and 25,000 for the test set. Table 1 shows the label distribution in training and testing sets.
Corpora labels distribution
Corpora labels distribution
In the entire collection, a movie has no more than 30 reviews, because having multiple reviews of the same movie can produce correlated ratings. In addition, the training set and the test set contain a disjoint set of movies.
A negative review scores< =4 out of 10, and a positive review scores> =7 out of 10.
The proposed methodology has been designed to solve the task of sentiment classification on corpora about movie reviews collected from IMBD 1 . In this work, the notation of 0 represents a negative sentiment and 1 represent a positive sentiment.
The proposed methodology uses word embedding as feature extraction, this language modelling technique is used to map words to vectors (or real numbers) to represent words or phrases in a vector space.
The vectors obtained through this technique are a representation of a text, where words that have a similar meaning have a similar representation. In other words, these vectors represent words in a hyperplane, where the words that have some relation in the corpora will be closer together.
Different classifiers using recurrent neural networks will be used after getting the word vector of a document, before the training phase, the models will be able to determine which sentiment (negative or positive) a document corresponds to; in order to establish which recurrent neural network model performs better. Figure 1 shows the architecture proposed for this work.

Architecture proposed.
The proposed methodology consists of three main modules, which are: Text Preprocessing Feature Extraction Sentiment Classification
Data preprocessing consists of a series of techniques that have the objective of correctly initialising the data that will serve as input for the feature extraction phase.
During the pre-processing phase, a method was developed in which all reviews were labelled with respect to their score. Label 0 was placed on negatives and 1 on positives.
In this work, various natural language processing techniques were used to normalise the data, since the messages obtained from IMDB are informal, and generally contain words with special symbols, including emoticons. The following techniques were applied: Transform all the words to Words like “okaaaaaaay”. can be regularly found on social networks and may have some influence on the task of classification. In this work,
Feature extraction
The objective of this stage is to find the best numerical representation of each of the input words in a document to construct a vector (readable format for the model). To speed up the training phase on the classification models, three algorithms have been trained: GloVe, Word2Vec and Doc2VeC to previously construct the embedding of words from the corpora.
GloVe
In the present work, the open-source algorithm of Stanford GloVe
2
has been re-trained on the movie review corpora to construct vectors of 100, 200, 300 and 350 dimensions. Important features that have been modified on the demo.sh file was: VECTOR_SIZE = 100, 200, 300 and 350 MAX_ITER = 30 WINDOW_SIZE = 5 CORPUS = imdb.txt
Word2Vec
In the present work, gensim’s library was used to create the model. Some of the important parameters that were used in the model are the following: Size = 100, 200, 300 and 350 dimensions Window = 5 Iterations = 30 sg = CBOW (0)
The CBOW training algorithm was selected because after an exhaustive search in the literature was found that this algorithm is more suitable for the characteristics of our corpora.
Skip-gram: works well with a small amount of training data, represents not very common words or phrases well.
CBOW: faster to train than skip-gram, slightly better precision for frequent words.
Doc2VecC
The open-source algorithm of Doc2Vecrmbox3
3
has been trained on the movie review corpora to construct vectors of 100 dimensions. Important features that have been modified on the go.sh file was: Size = 100, 200, 300 and 350 Train algorithm = CBOW Window = 10 Iterations = 20
Sentiment classification
To evaluate the performance of the model in the classification task mentioned in this work, four models were trained with recurrent neural networks, using an embedding layer. The neural networks SRNN, LSTM, BiLSTM, GRU, and BiGRU were used as models in the experiments to determine the best performance.
The architecture used to create the sentiment classification model consists of the following fully connected sequence layers (Figs. 2 and 3 shows the architecture of the proposed models).

Model diagram using RNN.

Model diagram using Bi-RNN.
The resulting dimensions are: (num_words_review, embedding_dimension). The embedding layer is a weight matrix that has already been calculated by training the GloVe, Word2vec and Doc2VecC algorithms.
Bidirectional networks consist of two RNN layers, one layer that stores previous representations and another which helps us learn future representations, that is, they have the ability to analyse sentences forwards and backwards. Generally, these types of networks train LSTM and GRU cells. In the network, these networks propagate the input back and front through the RNN layer and then concatenate the final output.
In the first layer a rectified ReLU function is used, for all inputs less than 0 (i.e. –120, –6.7, –0.0344, 0) the value is 0 while for anything positive (i.e. 10, 15, 34) the value is retained. The last output layer contains a single neuron and a sigmoid activation function, which gives predictions on values between 1 and 0.
The network is compiled with a binary cross-entropy loss function; this loss calculates the loss with two classes 0 and 1. The network is also compiled with an optimizer; Adam and Stochastic Gradient Descent (SGD) were the optimizers used for testing during the experiments. Each optimizer was used with different learning rates, values between 0.0001 –0.0003 were tested.
All the parameters used on the models of this work are on Table 2.
Parameters used on the classifiers
This section presents the results obtained with the models proposed in section 3. In order to prove which architecture performs better in the task of sentiment classification, different experiments have been carried out using recurrent neural networks like SRNN, LSTM, GRU, BiLSTM and BiGRU in combination with Word2Vec and GloVe. Furthermore, in this paper, experiments were performed using doc2VecC and support vector machines (SVM).
In addition, a comparison is shown between the results of the related work mentioned in the introduction against the results executed in this work.
The following terminologies are used for all tables in this section. In the model column, those recurrent neural network architectures that have been used for each experiment have been listed. In this work, different variants of recurrent neural networks are used. In the dimension column, the accuracy obtained from the model with respect to the dimension number (100, 200, 300 and 350) of the vector of input words used in each experiment.
GloVe
From the large movie review corpora, a comparison was carried out among the neural network mentioned above. The vector dimension of words used are [100, 200, 300 and 350], 100 epochs were used for each of the experiments the end result is the mean of five experiments with a batch number equal to 250 and a learning rate equal to 0.0001.
The results of the sentiment classification are shown in Table 3. Both LSTM and GRU show better performance than RNN because they can detect long term dependencies.
Accuracy on RNN models using GloVe
Accuracy on RNN models using GloVe
However, during this comparison the LSTM with 350 dimensions in the word vector performs slightly better than the GRU. In this comparison, the performances obtained using 300 and 350 dimensions present a similar accuracy.
In the same way as in GloVe’s algorithm, from the large movie review corpora, a comparison among the neural network mentioned above was carried out. The vector dimension of words used are [100, 200, 300 and 350], 100 epochs were used for each of the experiments the end result is the mean of five experiments with a batch number equal to 250 and a learning rate equal to 0.0001.
The results of the sentiment classification using the embedding obtained from the model of Word2Vec are shown in Table 4. In this execution, the lowest performances are 75.74 and 76.46 using 100 dimensions and bidirectional networks, however, as the number of dimensions in the word vectors increases, the accuracy reaches 83.23.
Accuracy on RNN models using word2Vec
Accuracy on RNN models using word2Vec
The experiments on the data large movie review corpora were carried out using Doc2Vec+SVM. Different kernels have been used in the experiments such as the linear, RBF and sigmoidal, it was also experimented using different coefficients for the RBF and sigmoidal kernels. Table 5 shows the accuracy obtained using an SVM. The best accuracy obtained was with an RBF kernel and a “scale” coefficient, achieving a 90.36 accuracy. With the linear kernel SVM, the precision obtained is 89.26, which represents the highest precision after the RBF kernel.
Accuracy using doc2VecC and SVM
Accuracy using doc2VecC and SVM
During the experimentation phase, two methods were proposed for frequency-based language modelling: bag of words+TF-IDF and n-grams of characters+TF-IDF with logistic regression and support vector machine as a baseline for purposes of comparison. For the baseline character n-gram technique, trigrams are used. Table 6 shows the results of the models using the proposed baseline techniques.
Base lines
Base lines
The highest performance obtained in the sentiment classification was obtained using a bag of words+TF-IDF and a vector support machine, reaching 71% accuracy.
Bag of words with TF-IDF and char trigrams with TF-IDF in combination with logistic regression and vector support machine was proposed as baselines. However, when testing these language-modelling methods with the proposed models, a good performance is not obtained; this is because in these methods the syntactic and semantic content between the words is not captured, since they consider the words as separate characteristics. These methods produce sparse word vectors.
The recurrent neural network in combination with Word2Vec or GloVe presents a similar performance to the baseline techniques. These results could be due to the fact that the simple RNN, compared to the LSTM and GRU networks, does not contain a storage unit.
In this experiment, we compared the performance on different recurrent networks variants such as RNN, LSTM, GRU, and bidirectional. This paper also improves the efficiency of sentiment analysis by a fusion model that integrates doc2VecC and SVM with a RBF kernel.
Table 7 shows the comparison against the related work. The text underlined in grey shows the results obtained with the proposed models using bidirectional neural networks, Doc2VecC and SVM.
Comparison using GloVe, Word2Vec and Doc2VecC against the related work
Comparison using GloVe, Word2Vec and Doc2VecC against the related work
The models with word vectors with 350-dimensional and with 300-dimensional show a better performance than the results presented by Fazeel [4], which is mainly due to two reasons. In the first place, during experimentation on these models, the use of previously trained word vectors on the corpus may have been one of the main reasons, that is, on the hyper-parameters that were fine-tuned within the Word2Vec and GloVe such as word count or word window number to include. Another reason lies in the model, although an attempt was made to replicate the same model presented in the related work, however, it is possible that hyper-parameter configurations of the model were not reported.
The results obtained using Doc2VecC and SVM with linear kernel and RBF present a slightly higher performance than those obtained in the related work.
Through this experiment, we have an understanding of the performance on different recurrent networks variants and how to apply them in natural language processing.
This paper only selects the corpus of the movie reviews for sentiment analysis. For different sentiment corpus, the content vocabulary will be special and some words unique, so this difference in vocabulary will affect the orientation of sentiments from different text corpus.
To analyse the reviews from the corpus, this paper only propose to use different recurrent networks variants such as RNN, LSTM, GRU, and bidirectional. One of the contributions of the paper is the main use Word2Vec, GloVe and doc2VecC as natural language modelling techniques and the experimentation with dimensions of 100, 200, 300 and 350 in the word vectors to train the models of recurrent neural networks. This experiment proves that bidirectional recurrent neural network can achieve higher accuracy on the sentiment classification problem.
Through the natural language processing on the IMDB corpus for sentiment analysis, the following conclusions are drawn: the use of RNN and its variants has better performance than the traditional perceptron, but the computing power is restricted by hardware when we want to use more dimensions. More resources are used for training and running. Therefore, it is necessary to select appropriate models and algorithms based on data size, training cost, etc., while deep learning commonly used neural network algorithms also greatly reduce training costs and improve accuracy.
In the future, we want to experiment with some of the transformers 4 models, a new technique that currently exists such as BERT, ALBERT and ROBERTa, in order to design a deep learning model based on transformers to improve accuracy in the multiclass sentiment classification.
