Learning emotional word embeddings for sentiment analysis

Abstract

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.

Keywords

Sentiment analysis word embedding classification representation learning

1 Introduction

Sentiment analysis or opinion mining is the computational study of people’s opinions, sentiments, emotions, appraisals, and attitudes towards entities such as products, services, organizations,individuals, issues, events, topics, and their attributes [1]. Large amounts of emotional text data have been accumulated by social media platforms such as Twitter, Facebook, and WeChat and by shopping platforms such as Taobao, Jingdong, and Amazon. Analyzing the emotional tendency of these text data has both essential research significance and commercial value [2]. Moreover, sentiment analysis has grown to be one of the most active research areas in natural language processing (NLP). Existing research on sentiment analysis includes both supervised and unsupervised methods. In the supervised setting, the papers used all types of supervised machine learning methods (such as Support Vector Machines (SVM), Maximum Entropy, etc.) and feature combinations. Unsupervised methods include various methods that exploit sentiment lexicons, grammatical analysis, and syntactic patterns. Several survey books and papers have been published, which cover those early methods and applications extensively [3].

Distributed representation learning, as a new deep language analysis technology, has received increasing attention from researchers in the field of NLP. Many deep learning models in NLP need distributed representation results as input features [4]. Word embedding is a technique for language distributed representation learning, which transforms words in a vocabulary to vectors of continuous real numbers. Therefore, the representation of words has become the basis for the development of various tasks and in research. In a vector space, it is easy to quantify the distance between words by distance or angle. Thus, word representations (also known as word embeddings) have received increasing attention from researchers in the NLP field.

There are two main frameworks for training word representations: global matrix factorization methods, such as latent semantic analysis (LSA) [5, 6] and local neural network methods [7, 8], such as the word2vec model [9]. The former generates low-dimensional word representations by factorization matrices that capture global statistical information about a corpus. The latter uses a neural network framework to train word representations that are good at making predictions within local context windows. In these two main structures, most word representations are learned from extensive collections of document texts; however, they ignore the sentiment information in documents. Nevertheless, in real-world datasets, documents such as commodity review data may contain rich emotional information. This situation is suboptimal because it means that the emotional information in the documents is not being used when discovering word representations. In this paper, we focus on embed emotional information into word representations.

Recent efforts on sentiment detection take advantage of word representations embedded in semantic vector spaces [9, 10], which are learned based on neural networks or probabilistic models on large text corpora. The derived word embeddings have been shown to accurately capture the semantics and context of words. Simultaneously, using the resulting embeddings in a supervised classification setting (especially with neural network architectures) can improve the trained sentiment models, as shown by Le and Mikolov [11]. Severyn and Moschitti [12], Socher et al. [13] and Tang et al. [14] proposed learning sentiment specific word representations by applying sentiment labels (positive and negative). The results from the word representation embedding algorithm show that in addition to capturing precise syntactic and semantic information, the word embeddings obtained from these algorithms demonstrate a linear structure particularly well suited to performing analogy tasks. More researchers focus on the sentiment embedding from different perspectives. Fu et al. [15] propose an integrated sentiment embedding method to combine context and sentiment information using a dual-task learning algorithm to perform sentiment analysis. Sun et al. [16] propose solve the problem of text containing semantics, syntax, sentiment and other information. Kaibi et al. [17] focus on the comparison of three commonly used word embeddings techniques (Word2vec, Fasttext and Glove) on Twitter datasets for Sentiment Analysis. Mohamed et al. [18] propose an enhanced ensemble classifier framework which is based on lexicon-based method, bag-of-words, and pre-trained word embedding. Seyed et al. [19] propose a novel method, Improved Word Vectors (IWV), which increases the accuracy of pre-trained word embeddings in sentiment analysis. However, It is a challenging problem to integrate emotional information into the pre-trained word vectors and represent document sentiment features.

In this paper, we propose an emotional word embedding (EWE) model for sentiment analysis. This method first uses pre-trained word vectors to represent document features with two different linear weighting methods (EWE (1) and EWE (delta - idf)). Then, the document vectors are used as input to the classification model to train the text sentiment classifier, which is based on a neural network. The emotional polarity of the text is propagated into the word vectors. Our experimental results on three kinds of real-world data sets demonstrate that the proposed EWE achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks.

The contributions of this paper can be summarized as follows.

We propose an emotional word embedding (EWE) model to encode both word level and emotion level sentiment information when learning sentiment-specific word embedding, which makes full use of existing sentiment lexicons.

We present two different linear weighting methods (EWE (1) and EWE (delta - idf)) to pre-trained word vectors to represent document features. The document vectors are used as input to the classification model to train the text sentiment classifier, which is based on a neural network. The emotional polarity of the text is propagated into the word vectors.

We conduct experiments on three standard datasets sentiment classification benchmarks. The proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.

The remainder of this paper is organized as follows. In Section 2, we briefly summarize the related works on sentiment analysis and word representations. Section 3, describes the proposed EWE model for incorporating emotional information into the word representations. Section 4 presents an experimental evaluation of the proposed model’s performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks, and Section 5 concludes the paper.

2 Related Works

A vast amount of related work has been performed on word representations and sentiment analysis. In this section, we discuss the significant contributions from both areas.

2.1 Word vector representations

Word embedding is a popular method in NLP whose goal is to learn low-dimensional vector representations of words from large numbers of documents. This approach can capture both syntactic and semantic word relationships. The earliest vector representation of words was proposed by Harris [20] in 1954, who introduced a distributional structure in which the appearance of each word is related to that word in context. Subsequently, two main frameworks were developed for training word representations: local neural network methods and global matrix factorization methods.

Deep Neural Network models: After neural networks were developed, a variety of deep neural network models were subsequently proposed to learn word representations. For example, one popular word embedding model is C&W, which was presented by Collobert and Weston [21]. They utilized a convolutional neural network (CNN) architecture to obtain the word vectors and then used them in several semantic tasks. Yoshua et al. [7] proposed learning word representations with a neural network architecture for language modeling purposes. Later, higher quality models that easier to train were introduced in [9]. The skip-gram and continuous bag-of-words (CBOW) models make use of shallow neural network architectures to either predict the context window for a given the word (skip-gram) or to predict a word given a context window (CBOW). Since then, a large number of models have been proposed that are based on those two models, including works by Le and Mikolov [11], Severyn and Moschitti [12], Socher et al. [13].

Moreover, a logbilinear model was proposed by Botha and Blunsom [22], who exploited addition as a composition function to enhance word vector representations from morpheme vectors. Chen et al. [23] proposed a character-enhanced word embedding model. Yang and Sun [24] introduced a model to improve the learning of Chinese word embeddings by using semantic knowledge. The basic idea was to learn the word representations by considering the semantic information about words and their component characters when performing composition functions.

Matrix Factorization models: Some other studies have utilized large factorization matrices to generate low-dimensional word representations. These methods seek to capture global statistical information about words and their contexts from a corpus. Researchers usually adopt weighting schemes for matrix construction, factorization methods, and context types when using global matrix factorization models. Regarding the weighting schemes used during matrix construction, Fritz [25] provided a detailed discussion of the raw counts, binary counts, TF-IDF, and PMI. However, the factorization methods include the independent component correlation algorithm (ICA) [26], singular value decomposition (SVD)[27] and nonnegative matrix factorization (NMF) [28]. The main types of context include surrounding words. Pantel [29] tried various surrounding contexts, such as a left window, right window, and the full context window. Others include patterns around words and documents. Typical models of these types include latent semantic indexing (LSI) [5] and latent Dirichlet allocation (LDA) [30]. Global matrix factorization methods directly utilize word co-occurrence statistics and succeed in capitalizing on the vast amount of repetition in text corpora.

2.2 Document-level sentiment analysis

Document-level sentiment classification is the most popular and extensively studied topic in the field of sentiment analysis. The goal of this task is to classify a document (e.g., a product review) as expressing positive or negative sentiment. Document sentiment classification considers each document as a whole and ignores details such as, for example, who is expressing the sentiment or which product aspects are involved. The assumptions behind document sentiment analysis are that each document expresses an opinion regarding a single entity and that a single sentiment holder expresses those sentiments.

According to [31], the approaches for document sentiment classification can be grouped into supervised and unsupervised tasks. The lexicon-based approaches are types of the traditional approaches for sentiment analysis that use pre-compiled sentiment lexicons containing different words and their polarity to classify a given word into positive or negative sentiment class labels. The studies [3], [32] provide a detailed description of these approaches. Stone et al. [33] started the task of sentiment analysis using the lexicon method in 1966. Later, different lexicons were proposed such as WordNet, WordNet-Affect, SenticNet, MPQA, and SentiWordNet [34]. Among supervised approaches, Dyer et al. [35] was the first to adopt a supervised machine learning method to address the sentiment classification problem. Many previous works showed that feature engineering plays a crucial role in sentiment analysis tasks. The widely used one-hot word representation always serves as a baseline in sentiment analysis. Furthermore, Richard et al. [36] incorporated numerous manual features to build a state-of-art system for performing sentiment analysis on Twitter posts. The current state-of-the-art approaches to sentiment analysis rely on embedding-based feature extraction and deep learning architectures [37 –39]. These approaches represent words as a function of their context, which enables machine learning algorithms to generalize across words with similar contextual representations.

Deep learning approaches is an emerging branch of machine learning algorithms, which is inspired by artificial neural networks.Word embeddings are types of word representation that aim at representing wordsa̧ŕ meaning in the form of vectors, they serve as first data processing layer in deep learning approaches [40]. There are heavily relies on the learned representations produced by word embedding methods for sentiment analysis. Fu et al. [15] propose an integrated sentiment embedding method to combine context and sentiment information using a dual-task learning algorithm to perform sentiment analysis. Sun et al. [16] propose solve the problem of text containing semantics, syntax, sentiment and other information. Kaibi et al. [17] focus on the comparison of three commonly used word embeddings techniques (Word2vec, Fasttext and Glove) on Twitter datasets for Sentiment Analysis. Mohamed et al. [18] propose an enhanced ensemble classifier framework which is based on lexicon-based method, bag-of-words, and pre-trained word embedding. Seyed et al. [19] propose a novel method, Improved Word Vectors (IWV), which increases the accuracy of pre-trained word embeddings in sentiment analysis.

In the main related works, most word representations are learned from large amounts of document texts, and they ignore the sentiment information in documents. In real-world datasets, documents such as commodity review data may capture rich emotional intelligence. The situation is suboptimal because the word representations ignore this sensitive information in the documents when discovering word representations. In this paper, we focus on how to incorporate this emotional information into word representations.

3 Emotional Word Embedding Model

In this section, we propose the Emotional Word Embedding model named EWE. First, we design two weighting methods (EWE (1) and EWE (delta - idf)) to construct the document features. Then, the emotional polarity of the text is propagated back to the word vector by a neural network classifier. In this way, we can generate word embeddings that include emotional characteristics are generated.

3.1 Problem description

It has been shown conclusively that unsupervised word vector learning methods can be used to estimate the probability distributions of words from a large-scale corpus [41]. These distributions are characterized by words with similar contexts, and the corresponding word vectors are similar concerning spatial distance. However, the words obtained by this method do not reflect the emotional polarity of the words, as shown in Table 1.

Table 1
Similarity of word vectors for good vs bad and good vs great.

Words Quality Similarity

good bad 66.9

good great 37.1

Words	Quality	Similarity
good	bad	66.9
good	great	37.1

In Table 1, although ’bad’ and ’good’ carry emotionally opposite sentiments, they are similar in both their usage scenarios and contexts; therefore, the resulting vectors are relatively close in space. In contrast, ’great’ and ’good’ have similar (both positive) emotional polarities. However, due to their contextual differences, the distance between the vectors of these two terms is relatively large. Therefore, the main research goal of this paper is to reduce the spatial distance for word vectors with the same emotional polarity and to increase the spatial distance for word vectors with opposite polarities. We want to obtain the maximum value between w₁ and w₂ if they are synonyms and the minimum value if they are antonyms.

We wish to learn word representations that capture the emotional information of words while maintaining predictive power for supervised tasks. Given a collection of documents d₁, d₂, . . . , d_n with corresponding binary sentiments y₁, y₂, . . . , y_n, the goal is to learn a set of emotional word vectors that are spatially close to other vectors with the same emotional polarity and spatially distant from vectors with a different emotional polarity. The overall goal is to train a classifier that, when given a previously unseen document d, can accurately estimate the sentiment of the document, and the notations used in this paper are shown in Table 2.

Table 2

The notations used in this paper

Notation	Description
d₁, d₂, . . . , d_n	A collection of documents.
y₁, y₂, . . . , y_n	The binary sentiments of the documents.
w _t	The word vector.
α _t	The word vector weighting to
	construct document vectors.
N	The number of negative samples.
P	The number of positive samples.
N _t	The w_t word frequency in the
	negative samples.
P _t	The w_t word frequency in the
	positive samples.
d _i	The feature vector of the i_th document.
H	The weight matrix of the hidden layer.
b ₁	The bias matrix for the hidden layer.
J ()	The cost function of the training
	EWE model.

3.2 Emotional word embedding model

We propose the EWE model, which has a 3-layer neural network structure, to train a classifier via a given set of documents d₁, d₂, . . . , d_n and their corresponding emotion labels y₁, y₂, . . . , y_n. Moreover, the model can learn the emotional polarity of the word while completing the emotional classification. The principle of the EWE model is to learn the emotional polarities of the words from the emotional label of the document. Moreover, the model identifies the emotional polarity of a sentence through the emotional words in the document. The architecture of EWE is shown in Fig. 1.

Fig. 1

Architecture of Emotional Word Embedding(EWE) learning methods

Under the EWE model architecture, we set a linear weighting method to vectorize the document for each word vector w₁, w₂ . . . , w_m in the document. In this paper, we propose two methods to set the word vector weighting to construct document vectors.

$α_{t} = {\begin{matrix} 1 \\ log \frac{| N |}{N_{t}} - log \frac{| P |}{P_{t}} \end{matrix}$ (1)

As shown in Formula 1, the weight for a word vector w_t can be represented as α_t. The first method of weighting is that all the word weights in the document are set to 1, corresponding to α_t = 1. We call this first weighting method EWE (1). The second weighting method is $α_{t} = log \frac{| N |}{N_{t}} - log \frac{| P |}{P_{t}}$ , where N is the number of negative samples, P is the number of positive samples, N_t is the w_t word frequency in the negative samples and P_t is the w_t word frequency in the positive samples. We call the second weighting method EWE (delta - idf) because it can be regarded as an improvement of the TF - IDF algorithm.

Based on the above two weighting methods, we perform weighted summation on each dimension to obtain the feature vector d_i of the i - th document. $d_{i} = \sum_{j = 1}^{k} α_{j} w_{j}$ (2)

For the training samples, the input of the neural network is x = [d₁, d₂, . . . , d_n]

The hidden layer is the same as in a common feedforward neural network, and is followed by a fully connected layer. We select the tanh function in Formula 3 as the activation function of the hidden layer. $h = tanh (Hx + b_{1})$ (3) Where H is the weight matrix of the hidden layer, and b₁ is the bias matrix for the hidden layer.

3.3 Training EWE

Logistic regression models are widely used in a large body of classification tasks, such as sentiment analysis. In this section, we assume that the probability model of interest is the logistic model. Under this assumption, the calculation of the output layer is performed with a logistic regression model: $P (y | w, θ) = sigmod (Uh + b_{2})$ (4)

The parameters in the EWE model include the word vector W for each word and the neural network parameters H,U,b₁ and b₂. To learn the parameters, our paper defines the cost function J (H, U, b₁, b₂, w) as follows: $\begin{matrix} J (H, U, b_{1}, b_{2}, w) = \frac{1}{m} \sum_{i = 1}^{m} L ({\hat{y}}_{i}, y_{i}) \\ = - \frac{1}{m} \sum_{i = 1}^{m} [y_{i} log {\hat{y}}_{i} + (1 - y_{i}) log (1 - {\hat{y}}_{i})] \end{matrix}$ (5)

This optimization problem can now be written as a maximum likelihood estimation problem in which the goal is to find a set of parameters that minimizes the cost: $min J (H, U, b_{1}, b_{2}, w)$ (6)

Therefore, we use minibatch gradient descent (MBGD) in our optimization method and use the backpropagation algorithm to calculate the gradient. The training process for the proposed EWE is shown in Algorithm 1.

Algorithm 1 Emotional Word Embedding Training Algorithm

Input: Document feature D = [d₁, d₂, . . . , d_n];

Output: Word vector W and Classification model;

Initialization word vector W by word2vec and related parameters by random;

for each t in iter(T): do

for each n in samples(N): do

Calculate the cost function value J (H, U, b₁, b₂, w);

Calculate the parameters of the gradient update classifier θ;

Calculate the word vector W that appears in the gradient update sentence;

end for

return Word vector W and Classification model;

The time complexity of the learning algorithm is O (T · K · |N| · L), where T is the number of iterations until converge. N is the number of negative sampling. Number of dimension K indicates the length of representation vectors for each node, which meas that we need to learn O (K) parameters for each word. L is the maximum length of the document. Moreover, The space complexity of EWE is O (|W| · K), since we need to learn K-dimensional vectors for each word w ∈ W.

4 Experimental Results and Analysis

In this section, we demonstrate the efficacy and efficiency of the presented EWE (1) and EWE (delta - idf) models for learning emotional word representations. We evaluate the models empirically on three tasks: sentiment analysis, the similarity of the sentiment text, and sentiment word analysis. We perform the evaluation on three datasets: IMDB, Yelp, and Amazon. The implementation process of the EWE model is described in Fig. 2.

Fig. 2

The flow chart for the implementation process of the EWE model.

4.1 Datasets and preprocessing

To train EWE, we use three datasets that are used in our study. Among these, the IMDB dataset contains 1,000 movie reviews, the Yelp dataset includes 1,000 food reviews, and the Amazon dataset contains 1,000 product reviews. All three datasets were balanced to consist of 500 positive-emotion samples and 500 negative-emotion samples. All three datasets can be downloaded from the UCI repository 1 ; the sample content in the datasets is shown in Table 3.

Table 3
Samples of IMDB, Yelp and Amazon

Dataset Sample

IMDB Actually, the graphics were good at the time.

Yelp Service was very prompt.

Amazon Excellent bluetooth headset.

Dataset	Sample
IMDB	Actually, the graphics were good at the time.
Yelp	Service was very prompt.
Amazon	Excellent bluetooth headset.

Generally, English text data do not need word segmentation, nor is it necessary to consider the problem of coding conversion. However, in English, spelling is a crucial problem that must be considered. Because of the particular characteristics of the English language and grammatical considerations such as the grammatical case, singular and plural forms, and the grammatical person, words have different forms. If this aspect is not addressed, various forms of the same word may be mistakenly classified as different words, which greatly interferes with the results.

For the training datasets, this paper first uses the NLTK toolkit in Python to preprocess the English texts. The preprocessing operations include removing stop words, removing special symbols, changing uppercase to lowercase, restoring parts of speech, and so on. Then the stop word removal function removes words from the text that do not contribute relevant meaning. Part-of-speech restoration removes affixes from the words based on WordNet and extracts the main part of the word. An example is listed in Table 4.

Table 4

Effect of data preprocessing in emotional word embedding

Before processing	After processing
I bought this to use with my Kindle Fire and absolutely loved it!	buy use kindle fire absolutely love
This product is ideal for people like me whose ears are very sensitive.	product ideal people like whose ear sensitive.
This case seems well made.	case seem well make.

4.2 Results for the sentiment analysis task

In this section, we first list the baselines that we compare with our method and describe the evaluation protocol; then, we provide an analysis of the results.

4.2.1 Baseline

To examine its effectiveness, we compared EWE against the following baselines:

Naive Bayes classifier (NB): The bag-of-words features is based on the classic Naive Bayes classifier for sentiment classification. They used the optimization in the NLTK toolkit.

Two-step (TS) [43]: This baseline is presented to test the effectiveness of unsupervised embedding algorithms such as word2vec and LSA as features for document sentiment classification. The TS delivers the following two steps to perform sentiment analysis. First, it learns the unigram word embeddings in an unsupervised fashion and then used these word embeddings to obtain document embeddings from the weighted linear combination. Second, it learns a logistic regression classifier for sentiment analysis based on the obtained document embeddings.

Recursive neural tensor network (RNTN): Socher [44] proposed the recursive neural tensor network (RNTN), which learns compositionality from texts of varying lengths and learns the classification in a supervised fashion with fine-grained sentiment labels. In our paper, EWE is intended to perform binary classification. Therefore, we also apply RNTN in a binary classification framework.

Supervised word embeddings for sentiment analysis (SWESA) [43]: SWESA leverages document label information to learn vector representations of words from a modest corpus of text documents by solving an optimization problem that minimizes a cost function that considers both word embeddings and classification accuracy.

4.2.2 Evaluation

The performances are assessed based on accuracy on the test set, and the metrics are defined as follows:

Precision (PR) is the ratio of the number of true positives (TP) to the number of true positives (TP) + false positives (FP) $PR = \frac{TP}{TP + FP}$ (7)

Recall (RE) is the ratio of the number of true positives (TP) to the number of true positives (TP) + false negatives (FN) $RE = \frac{TP}{TP + FN}$ (8)

F-measure(F1) is the weighted harmonic averages of Precision and Recall $F 1 = \frac{2 * PR * RE}{PR + RE}$ (9)

Area Under the Curve (AUC) is obtained by applying the trapezoidal rule to calculate the area from the ROC curve. The transverse coordinate of the ROC curve is the false positive rate (FPR, False Positive Rate, as shown in Formula 11), and the longitudinal coordinate is the true positive rate. In the classification task, the higher the true positive rate (TPR is, true positive rate, as shown in Formula 10), the lower the FPR, and the better the classification effect is. The ROC curve can be understood as the relationship between TPR and FPR under different threshold values. The area under the ROC curve is the AUC, which is often used to measure machine learning algorithm performance for “binary classification problems” (generalization ability). The larger the AUC value is, the better the effect of the classifier for fitting the data.

$TPR = \frac{TP}{TP + FN}$ (10)

$FPR = \frac{FP}{FP + TN}$ (11)

4.2.3 Results analysis

In this study, the experiments were performed using three data sets: IMDB, Yelp, and Amazon. We obtained average values by using 10-fold cross-validation. The number of iterations was determined experimentally, that is, when the change in the target function value was less than an absolute threshold (≤10^-5), the iteration is terminated.

We compare the results of the proposed model with those of the benchmark models with regard to text emotion classification tasks, as shown in Table 5 to Table 8.

Table 5
The result of text sentiment classification on IMDB

Models PR RE F1

EWE (delta - idf) 0.82 0.82 0.82

EWE (1) 0.78 0.79 0.78

SWESA(LDA) 0.76 0.78 0.77

SWESA(word2vec) 0.77 0.78 0.77

TS(LDA) 0.7 0.72 0.71

TS(word2vec) 0.57 0.56 0.56

NB 0.76 0.75 0.75

RNTN 0.54 0.60 0.57

Models	PR	RE	F1
EWE (delta - idf)	0.82	0.82	0.82
EWE (1)	0.78	0.79	0.78
SWESA(LDA)	0.76	0.78	0.77
SWESA(word2vec)	0.77	0.78	0.77
TS(LDA)	0.7	0.72	0.71
TS(word2vec)	0.57	0.56	0.56
NB	0.76	0.75	0.75
RNTN	0.54	0.60	0.57

Table 6

The result of text sentiment classification on Yelp

Models	PR	RE	F1
EWE (delta - idf)	0.84	0.85	0.84
EWE (1)	0.81	0.82	0.81
SWESA(LDA)	0.78	0.76	0.77
SWESA(word2vec)	0.78	0.78	0.78
TS(LDA)	0.76	0.78	0.77
TS(word2vec)	0.65	0.68	0.66
NB	0.7	0.72	0.71
RNTN	0.51	0.52	0.51

Table 7

The result of text sentiment classification on Amazon

Models	PR	RE	F1
EWE (delta - idf)	0.86	0.88	0.87
EWE (1)	0.82	0.82	0.82
SWESA(LDA)	0.80	0.78	0.79
SWESA(word2vec)	0.80	0.82	0.81
TS(LDA)	0.77	0.78	0.77
TS(word2vec)	0.71	0.71	0.71
NB	0.73	0.70	0.71
RNTN	0.49	0.50	0.49

Table 8

AUC of text sentiment classification

Models	IMDB	Yelp	Amazon
EWE (delta - idf)	0.84	0.88	0.90
EWE (1)	0.85	0.89	0.91
SWESA(LDA)	0.81	0.88	0.88
SWESA(word2vec)	0.81	0.86	0.87
TS(LDA)	0.78	0.83	0.85
TS(word2vec)	0.59	0.69	0.77
NB	0.48	0.57	0.61

As the results show, compared with the other methods, the two EWE models achieve better results on the emotion classification task (PR,RE and F1). Moreover, the EWE (delta - idf) method is the most effective because the pre-trained word vector contains a wealth of emotional information, and the use of the linear weighting method more effectively combines semantic and emotional polarity. The average weighting method performs a greater degree of correction to the word vector itself, which may destroy the contained semantics, resulting in a slight decline in the results.

4.3 Results regarding the similarity of sentiment texts

Another goal of this paper is to determine the similarity of emotional texts. This paper investigates the degree of similarity between sentences by calculating the similarity between text feature vectors. From the viewpoint of emotional similarity analysis, sentences highly similar to sentences with positive emotions should also contain positive emotions. In the same way, sentences highly similar to sentences with negative emotions should contain negative emotions.

In our paper, we used the cosine value to calculate the similarity of two sentence vectors. Because the cosine can be used to measure the difference between two vector directions. There are (x₁, x₂, . . . , x_N) and (y₁, y₂, . . . , y_N) represents the eigenvectors of two sentences, and N is the dimension of the feature vector. $similarity = \frac{\sum_{i = 1}^{N} x_{i} y_{i}}{\sqrt{\sum_{i = 1}^{N} x_{i}^{2}} \sqrt{\sum_{i = 1}^{N} y_{i}^{2}}}$ (12)

In this paper, the three sentences with the highest similarity are calculated using the text characteristics obtained by EWE (1), EWE (delta - idf), and TwoStep(word2vec) to evaluate the method’s performance.

Taking the “First off the reception sucks, I have never had more than 2 bars, ever.” sentence as an example, this sentence is clearly an example of negative emotion polarity, and the three sentences with the closest similarity degree are obtained by the calculation shown in Table 9.

Table 9

Result of sentence similarity

Method	Similar sentence	Sentiment
EWE (1)	Top1. The first thing that happened was that the tracking was off.	neg
	Top2. But when I check voice mail at night, the keypad backlight turns off a few seconds into the first message, and then I’m lost.	neg
	Top3. I ordered this for sony Ericsson W810i but I think it only worked once (thats when I first used it).	neg
EWE (detla - idf)	Top1. The one big drawback of the MP3 player is the buttons on the phone’s front cover that let you pause and skip songs lock out after a few seconds.	neg
	Top2. The worst phone I’ve ever had... Only had it for a few months.	neg
	Top3. It lasts less than 30 minutes, if I actually try to use the phone. My wife has the same phone with the same problem.	neg
Two-Step(word2vec)	Top1. But it does get better reception and clarity than any phone I’ve had before.	pos
	Top2. None of the new ones have ever quite worked properly.	neg
	Top3. In the span of an hour, I had two people exclaim ¡°Whoa - is that the new phone on TV?!?	pos

The results show that the two weighting methods in EWE are used to calculate the text vector as the feature used in the text similarity calculation. Then, the three sentences with the highest similarity are obtained. We find that the emotional polarity of the sentences obtained by the EWE methods is consistent. However, the sentences obtained by the two-step method cannot distinguish the emotion polarity well. The text similarity calculation has similar results on the other data sets; the EWE model can incorporate the emotional polarity of the document into the word vector well, thus providing better features for sentiment analysis.

4.4 Analysis of the results on sentiment words

In this paper, by calculating the similarities between word vectors for synonyms and antonyms, we investigate the emotional polarity contained in word vectors and then examine whether this training method can distinguish words with similar context but opposite sentiment polarity.

We compare the word vector obtained after training or model with the word vector obtained by pre-training word2vec to investigate emotional polarity. The sentiment similarity calculation method is consistent with the sentence similarity calculation method; the angle reflects the similarity. Here, the cosine value of the two-dimensional spatial angle is used to calculate the corresponding two-dimensional angle. $angle = arccos (similarity)$ (13)

As shown in Table 10, it is an antonym comparison: Taking the word ’bad’ as an example, the words ’good’, ’great’ and ’excellent’ are all words with opposite sentiment polarity to ’bad’. The table shows that the similarity between ’bad’ and these three words in the EWE (1) model is negative¡ªthe opposite of the result of the word vector calculation obtained by word2vec, indicating that the terms are further apart in vector space.

Table 10

Antonyms of “bad”

bad	EWE (1)		EWE(detla - idf)		word2vec
	sim	angle	sim	angle	sim	angle
good	-58.8	126°	66	49°	66.9	48°
great	-78.3	141 . 6°	19.8	79°	17.3	80°
excellent	-64.2	130°	27.6	74°	27.1	74°

By treating the similarity as the cosine value of a 2D angle, a mapping diagram of the word vector in 2D space can be visualized:

Figure 3 shows that the word vectors trained by the EWE (1) model can reveal words with similar contexts but opposite emotional polarities because they are further apart in vector space.

Fig. 3

Similarity among good, excellent, great and the benchmark words: bad.

Table 11 presents a close-term comparison; taking the word “good” as an example, “good” and “fine” represent equivalent words in terms of positive emotional polarity. The similarity of “good” and other two words is relatively high in the EWE (1) model; the antonym results are combined to reveal that the model has successfully represented the emotional polarity of the text for each of the emotional words, which is reflected in the spatial distances of the word vectors.

Table 11

Synonyms of “good”

good	EWE (1)		EWE (detla - idf)		word2vec
	sim	angle	sim	angle	sim	angle
great	86.0	31°	37.9	68°	37.1	68°
excellent	81.7	35°	55.8	56°	57.6	55°
nice	83.5	33°	52.1	59°	51.7	59°

Figure 4 shows that the EWE (1) model training causes the vectors of words with different contexts but similar emotional polarity to be closer in vector space. However, the word vectors obtained by the EWE (delta - idf) method are no different than the original pre-trained word vectors. The reasons may be as follows. (1) The emotional information of words was added to the word weight level to consider positive and negative emotion during the classification process effectively; thus, the word vector does not change much during the training process. (2) In EWE (1), the weight is initially set to 1; that is, the word vectors of each text are simply summed up. That approach does not convey precise emotional information; thus, the emotion of the text can be extended to vector form more effectively during the emotional classification training process, allowing the emotional word vector to be obtained.

Fig. 4

Similarity among good and great, excellent, nice.

5 Conclusion

In this paper, a word vector training method for text emotion classification is proposed. First, word2vec is used to pre-train the word vector, and the pre-trained word vector is modified by two weighted methods to obtain a document feature vector, which is used as the input to the classifier. Then, the training samples are used to train the classifiers, and the sensitive polarity of the text is transferred to the word vector through gradient descent and back-propagation. The experimental results reveal that the emotional word vector accomplishes the following goals.

It achieves better results than do baseline models on the text emotion prediction task. That is, given a new, unseen sentence, the model can use the word vector with sensitive polarity to perform the emotional classification.

In the sentence similarity calculation task, the emotional polarity of a sentence obtained by calculating sentence similarity remains consistent with the target.

In the word emotional comparison task, words with similar contexts but opposite emotional polarity can be used as antonyms. After training, the spatial distances between these words is greater; this allows words with different contexts but similar emotional polarities to be spatially closer after training.

Moreover, there are still several open problems that should be investigated further: (1) In our models, we only consider the explicit sentiment documents. In many cases, there are also rich implicit sentiment documents. We will exploit implicit sentiment documents in word representation learning. (2) We may also explore topical word embedding model for the text analysis.

Footnotes

Acknowledgements

This research is supported by the National Natural Science Foundation of China (Grant No.62072288, 61303167, 61702306, U1931207), Shandong Provincial Natural Science Foundation, China (ZR2017BF015), SDUST Research Fund (2015TDJH102), the Humanities and Social Science Research Project of the Ministry of Education (18YJAZH017), the Taishan Scholar Program of Shandong Province (Grant No.ts20190936).

UCI Machine Learning Repository

References

Lei

, Shuai

and Bing

, Deep learning for sentiment analysis: A survey,e, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery 8(4) (2018), 1253.

Zhao

, Li

, Zhang

, Chiclana

and Viedma

E.H.

, An incremental method to detect communities in dynamic evolving social networks, Knowledge-Based Systems 163 (2019), 404–415.

Liu

, Sentiment Analysis: Mining Opinions, Sentiments, and Emotions, 2015.

Collobert

, Weston

, Bottou

, Karlen

, Kavukcuoglu

and Kuksa

, Natural language processing (almost) from scratch, Journal of Machine Learning Research 12(1) (2011), 2493–2537.

Deerwester

, Dumais

S.T.

, Furnas

G.W.

, Landauer

T.K.

and Harshman

, Indexing by latent semantic analysis, Journal of the American Society for Information Science 41(6) (1990), 391–407.

Morin

and Bengio

, Hierarchical probabilistic neural network language model, in: In AISTATS, 2005.

Bengio

, Ducharme

, Vincent

and Janvin

, A neural probabilistic language model, J Mach Learn Res 3 (2003), 1137–1155.

Mikolov

, Sutskever

, Chen

, Corrado

and Dean

, Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems 26.

Mikolov

, Chen

, Corrado

and Dean

, Efficient estimation of word representations in vector space, Proceedings of Workshop at ICLR 2013 (2013), 1–5.

10.

Pennington

, Socher

and Manning

C.D.

, Glove: Global vectors for word representation, in: Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543.

11.

and Mikolov

, Distributed representations of sentences and documents, in: Proceedings of the 31th International Conference on Machine Learning(ICML), 2014, pp. 1188–1196.

12.

Severyn

and Moschitti

, Unitn: Training deep convolutional neural network for twitter sentiment classification, in: Proceedings of the 9th InternationalWorkshop on Semantic Evaluation(SemEval 2015), 2015, pp. 464–469.

13.

Socher

, Pennington

, Huang

E.H.

, Ng

A.Y.

and Manning

C.D.

, Semi-supervised recursive autoencoders for predicting sentiment distributions, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011, pp. 151–161.

14.

Tang

, Wei

, Yang

, Zhou

, Liu

and Qin

, Learning sentiment-specific word embedding for twitter sentiment classification, 1 (2014), 1555–1565. doi: 10.3115/v1/P14-1146.

15.

, Liu

and Peng

S.L.

, An integrated word embedding-based dual-task learning method for sentiment analysis, Arabian Journal for Science and Engineering. Section A, Sciences 45(4) (2020), 2571–2586.

16.

Sun

, Du

, Shi

and Huang

, Text sentiment polarity classification method based on word embedding (2018), 99–104.

17.

Kaibi

, Nfaoui

E.H.

and Satori

, A comparative evaluation of word embeddings techniques for twitter sentiment analysis, in: 2019 International Conference on Wireless Technologies, Embedded and Intelligent Systems (WITS), 2019.

18.

Mohamed

E.H.

, Moussa

E.S.

and Haggag

M.H.

, An enhanced sentiment analysis framework based on pretrained word embedding, International Journal of Computational Intelligence and Applications 19(5) (2020), 2050031.

19.

Rezaeinia

S.M.

, Rahmani

, Ghodsi

and Veisi

, Sentiment analysis based on improved pre-trained word embeddings, Expert Systems with Applications 117(MAR.) (2018), 139–147.

20.

Harris

Z.S.

, Distributional structure, WORD 10(2-3) (1954), 146–162. doi: 10.1080/00437956.1954.11659520.

21.

Collobert

and Weston

, A unified architecture for natural language processing: Deep neural networks with multitask learning, in: Proceedings of the 25th International Conference on Machine Learning, ACM, 2008, pp. 160–167.

22.

Botha

J.A.

and Blunsom

, Compositional morphology for word representations and language modelling, arXiv: Computation and Language.

23.

Chen

, Xu

, Liu

, Sun

and Luan

, Joint learning of character and word embeddings, in: International Joint Conference on Artificial Intelligence(IJCAI’15), 2015, pp. 1236–1242.

24.

Yang

and Sun

, Improved learning of chinese word embeddings with semantic knowledge, in: Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, Cham, 2015, pp. 15–25.

25.

Fritz

, Carolin

and Barbara

, Predicting lexical priming effects from distributional semantic similarities: A replication with extension, Frontiers in Psychology 7 (2016), 1646. doi: 10.3389/fpsyg.2016.01646.

26.

Hyvarinen

and Oja

, Independent component analysis: algorithms and applications, Neural Networks 13(4) (2000), 411–430.

27.

Golub

G.H.

and Reinsch

, Singular value decomposition and least squares solutions, Numerische Mathematik 14(5) (1970), 403–420.

28.

Lee

D.D.

and Seung

H.S.

, Learning the parts of objects by non-negative matrix factorization, Nature 401(6755) (1999), 788–791.

29.

Turney

P.D.

and Pantel

, From frequency to meaning : Vector space models of semantics, Journal of Artificial Intelligence Research (2010), 141–188.

30.

Blei

D.M.

, Ng

A.Y.

, Jordan

M.I.

and Lafferty

, Latent dirichlet allocation, Journal of Machine Learning Research 3 (2003), 993–1022.

31.

, Oparin

, Allauzen

, Gauvain

and Yvon

, Structured output layer neural network language models for speech recognition, IEEE Transactions on Audio, Speech, and Language Processing 21(1) (2013), 197–206.

32.

Medhat

, Hassan

and Korashy

, Sentiment analysis algorithms and applications: A survey, Ain Shams Engineering Journal 5(4) (2014), 1093–1113.

33.

Stone

P.J.

, Bales

R.F.

, Namenwirth

J.Z.

and Ogilvie

D.M.

, The general inquirer: A computer system for content analysis and retrieval based on the sentence as a unit of information, Systems Research and Behavioral Science 7(4) (2010), 484–498.

34.

Taboada

, Brooke

, Tofiloski

, Voll

K.D.

and Stede

, Lexicon-based methods for sentiment analysis, Computational Linguistics 37(2) (2011), 267–307.

35.

Dyer

, Ballesteros

, Ling

, Matthews

and Smith

N.A.

, Transition-based dependency parsing with stack long short-term memory, Computer Science 37(2) (2015), 321–332.

36.

Socher

, Bauer

, Manning

C.D.

and Ng

A.Y.

, Parsing with compositional vector grammars, in: In Proceedings of the ACL conference, 2013, pp. 455–465.

37.

Barnes

, Klinger

and Walde

S.S.I.

, Assessing state of-the-art sentiment models on state-of-the-art sentiment datasets, arXiv: Computation and Language.

38.

Zhao

, Zhou

, Qi

, Chang

and Zhou

. Inductive Representation Learning via CNN for Partially-unseen Attributed Networks. IEEE Transactions on Network Science and Engineering 8(1) (2021), 695–706.

39.

Zhao

, Zhou

, Li

, Tang

and Zeng

, Deep-EmLAN: Deep embedding learning for attributed networks, Information Sciences 543 (2021), 382–397.

40.

Giatsoglou

, Vozalis

M.G.

, Diamantaras

, Vakali

, Sarigiannidis

and Chatzisavvas

K.C.

, Sentiment analysis leveraging emotions and word embeddings, Expert Systems with Applications 69 (2017), 214–224.

41.

Dai

, Wang

, Xu

, Xiong

and Wei

D.Q.

, Estimation of probability distribution and its application in bayesian classification and maximum likelihood regression, Interdisciplinary Sciences Computational Life Sciences 11(3) (2019), 559.

42.

Friedman

, Geiger

and Goldszmidt

, Bayesian network classifiers, Machine Learning 29(2) (1997), 131–163.

43.

Sarma

P.K.

and Sethares

, Sentiment analysis by joint learning of word embeddings and classifier., arXiv: Computation and Language.

44.

Socher

, Perelygin

, Wu

, Chuang

, Manning

C.D.

, Ng

and Potts

, Recursive deep models for semantic compositionality over a sentiment treebank, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2013, pp. 1631–1642.

Learning emotional word embeddings for sentiment analysis

Abstract

Keywords

1 Introduction

2 Related Works

2.1 Word vector representations

2.2 Document-level sentiment analysis

3 Emotional Word Embedding Model

3.1 Problem description

Table 1 Similarity of word vectors for good vs bad and good vs great. Words Quality Similarity good bad 66.9 good great 37.1

Table 3 Samples of IMDB, Yelp and Amazon Dataset Sample IMDB Actually, the graphics were good at the time. Yelp Service was very prompt. Amazon Excellent bluetooth headset.

4.2.1 Baseline

4.2.2 Evaluation

Table 5 The result of text sentiment classification on IMDB Models PR RE F1 EWE (delta - idf) 0.82 0.82 0.82 EWE (1) 0.78 0.79 0.78 SWESA(LDA) 0.76 0.78 0.77 SWESA(word2vec) 0.77 0.78 0.77 TS(LDA) 0.7 0.72 0.71 TS(word2vec) 0.57 0.56 0.56 NB 0.76 0.75 0.75 RNTN 0.54 0.60 0.57

Footnotes

Acknowledgements

References

Table 1
Similarity of word vectors for good vs bad and good vs great.

Words Quality Similarity

good bad 66.9

good great 37.1

Table 3
Samples of IMDB, Yelp and Amazon

Dataset Sample

IMDB Actually, the graphics were good at the time.

Yelp Service was very prompt.

Amazon Excellent bluetooth headset.

Table 5
The result of text sentiment classification on IMDB

Models PR RE F1

EWE (delta - idf) 0.82 0.82 0.82

EWE (1) 0.78 0.79 0.78

SWESA(LDA) 0.76 0.78 0.77

SWESA(word2vec) 0.77 0.78 0.77

TS(LDA) 0.7 0.72 0.71

TS(word2vec) 0.57 0.56 0.56

NB 0.76 0.75 0.75

RNTN 0.54 0.60 0.57