Abstract
Word embeddings have been successfully used in diverse tasks of Natural Language Processing, including sentiment analysis and emotion classification, even though these embeddings do not contain any emotional or sentimental information. This article proposes a method to refine pre-trained embeddings with emotional and sentimental content. To this end, a Multi-output Neural Network is proposed to learn emotions and sentiments simultaneously. The resulting embeddings are tested in emotion classification and sentiment analysis tasks, showing an improvement compared with the pre-trained vectors and other proposes in the state-of-the-art for fine-grained emotion classification.
Introduction
In Natural Language Processing (NLP), word representation is essential for many tasks. Two types of word representations exist in NLP: the first one proposed was one-hot-encoding, where each word represents an index of a vocabulary; the second one, introduced by [2], is a vector-based model (also known as embeddings) where each word is represented in n-dimensional space as a vector of continuous numbers. In contrast to one-hot-encoding, embeddings encode similarities between words as distance or angle between word vectors, capturing various lexico-semantic relations, while one-hot-encoding representation does not capture any semantic information. Despite their advantages, embeddings do not carry any emotional information; however, this limitation has not reduced their usage in emotion classification and sentiment analysis tasks.
Recent research in sentiment-emotion classification tasks focuses on deep learning methods using transformers. This paper focuses on research using sentiment-emotion word embeddings to improve the classification. To this end, it is proposed a refinement of pre-trained word embeddings using a Multi-output Neural Network using different sentiment-emotion lexicons 1 .
This paper is organized as follows: in Section 2, a review of state-of-the-art was made; Section 3 describes in detail the proposed model; in Section 4, a description of the corpus used for comparison with state-of-the-art and results are presented, in Section 5 a conclusion and proposal of future work are presented.
Related work
Two main approaches have been proposed in state-of-the-art to incorporate sentiment information in embeddings [15]: Learning sentiment word embeddings from scratch in a combination of supervised and unsupervised learning (described in 2.1). Refining pre-trained word embeddings using sentiment and emotion lexicons as resources (described in 2.2).
For the first approach, [8] proposed two models: one to capture semantic similarities using a probabilistic model similar to Latent Dirichlet Allocation (LDA) to learn the word’s association strength with respect to each latent topic; the second model was used to capture word sentiment using logistic regression to predict the polarity of documents. The combination of these models maximizes the sum of the objective functions of the models. The authors tested their approach for the sentiment classification task using three corpora: Polarity Dataset, Subjectivity Dataset, both proposed by [17], and their proposed dataset collected from IMDB (Internet Movie Database, [10]). The results showed better accuracy for the semantic model than the combined model in two datasets (Polarity and Subjectivity).
Later on [30] proposed SSWE (Sentiment Specific Word Embedding) using a dataset of 10M tweets labeled as positive or negative (balanced). The authors developed a model that captures the sentiment information of sentences and the syntactic contexts of words using a specific loss (a linear combination of two hinge losses). The authors used the dataset used in the International Workshop on Semantic Evaluation 2013 (SemEval-2013 dataset) [16] to test their proposal. Compared to other algorithms evaluated on the same dataset, the proposal showed the best performance in accuracy for the classification.
Refining pre-trained word embeddings
For the second approach, [1] proposed Emotion Word Embeddings (EWE), an emotion-enriched word representation refining GloVe embeddings [19]. The authors used a Long Short-Term Memory (LSTM) and a Neural Network (NN) with a hidden layer, with the embedding matrix E (initialized with GloVe) added to the input layer. The authors used the six basic emotions of [3] (anger, surprise, disgust, enjoyment, fear, and sadness) for the word-emotion vector; this vector was concatenated to each of the inputs of the LSTM then, the outputs were passed to the NN for the classification of the emotions. The final representation of EWE is the embedding matrix E after the training. To measure the emotion similarity, the authors clustered emotionally similar words, using the formula proposed by [30] and the emotion lexicon DepecheMood [28]. EWE outperforms the results of the other embeddings: GloVe, Word2Vec [9], SSWE [30].
In [14], based on the approach of retrofitting (a technique that encourages related words to have comparable vector representations in order to improve vector space representations using relational information from semantic lexicons., [4]), the authors proposed the counter-fitting method, using a loss function to inject antonyms and synonyms constraints into vector space representations. Diverse research focuses on generating emotional embeddings have been made using counter-fitting.
Afterwards, [24] used counter-fitting to generate emotional embeddings using the eight basic emotion categories proposed by [20] (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust), and the lexicon National Research Council Canada Word-Emotion Association Lexicon (NRC-EmoLex) [13]. Authors showed that by using counter-fitting to different pre-trained vectors, an improvement of emotional similarity is reached based on the [26] emotion categorization.
A similar approach was taken by [6], where two steps were used to obtain the embeddings: the first step was to concatenate the values of Valance, Arousal and Dominance (VAD 2 ) [32] values to the pre-trained embeddings and, as the second step, apply counter-fitting. The authors used the International Survey on Emotion Antecedents and Reactions (ISEAR) [31] and Twitter Emotion Corpus (TEC) [11] datasets to test their approach, comparing the results with the pre-trained embeddings. The authors showed that concatenating VAD improves the accuracy classification for all the pre-trained embeddings, while applying the second step had a lower performance than pre-trained embeddings.
In [15], authors proposed Sentiment-Aware Word Embedding (SAWE) to refine GloVe by combining semantic and sentimental aspects of words. To obtain the sentiment aspect, the authors used a Feed-Forward Neural Network model to predict the polarity (strongly positive, weakly positive, strongly negative, and weakly negative) score of pre-trained word embeddings using the combination of two lexicons: Extended version of Affective Norms of English Words (E-ANEW) [32] and Subjectivity clue [33]. After the training, the senti-embeddings were obtained from a linear combination between the input matrix and the output of the hidden layer. Finally, the authors used Principal Component Analysis (PCA) to reduce the dimensionality of the input pre-trained word vectors concatenation with senti-embeddings. Using the dataset Stanford Sentiment Treebank (SST) [27] and SemEval-2013 [16], the authors show an improvement in accuracy with respect to pre-trained embeddings and other proposals of the state-of-the-art, including SSWE.
[25] proposed a method to generate affective embeddings using retrofitting and VAD (Valence, Arousal, and Dominance). The authors used a Multi-Layer Perceptron to learn a transformation function using a custom loss function. Authors apply this transformation function to pre-trained vectors to obtain the final emotional embeddings. The authors used SST, SemEval-2017 (International Workshop on Semantic Evaluation 2017) [23], and Mustard++ [22] dataset to test their proposal, obtaining the best Micro F1-score for all datasets compared to pre-trained embeddings and other works of the state-of-the-art, as EWE, counter-fitting, among others.
This work took a similar approach to the state-of-the-art. It proposed using a Multi-Output Neural Network to learn emotional embeddings using a combination of lexicons and PCA.
Proposed work: refinement of pre-trained word embeddings
Multi-output Neural Networks are used to predict multiple outputs given an input, allowing diverse output data types. In Natural Language Processing, sub-fields of Multi-output Learning have been used for different applications: Document Categorization, Language Translation, Named Entity Recognition, among others [34]
This paper has proposed the use of a Multi-output Neural Network for the refinement of pre-trained word embeddings. This proposal consists of three-steps: Learn a transformation function using a multi-output neural network to map pre-trained embeddings to an intermediate representation, which will be referred to as senti-embeddings, using a lexicon of around 20,000 words. Use the map function learned to obtain the senti-embeddings representation for all the words present in the pre-trained embeddings using the transformation function. Concatenate original pre-trained embeddings with senti-embeddings to preserve part of the semantic information. Using PCA, the resulting vector is reduced to obtain the final embeddings with a dimension size of 300.
Each part of the proposal is described in detail in the following subsections.
A Multi-output Neural Network was trained to simultaneously learn emotional content (represented by Valance, Arousal and Dominance) and polarity content (negative, positive, and neutral). The end of using this network was to learn a non-linear transformation function to generate VAD-polarity enriched embeddings. The following lexicons were used for the training:
The Network consists of two outputs: a regression for predicting the VAD values and a multi-class classification for predicting the polarity. Specific losses were used for each output. Total loss is calculated with the sum of the individual losses per output. GloVe embeddings were used as input for the network 5 . A split of ten percent of the lexicon was used for validation. The following subsections are described in detail in each part of the Multi-output Model. Figure 1 shows the architecture of the proposed model.

Multi-output Neural Network architecture. The model receives as input 20,618 GloVe embeddings (for those words in the VAD lexicon). Each input vector has a VAD value and a polarity value at the output.
The initial size of the NRC-VAD lexicon is 19,971 words. To expand the number of words with a VAD value and a corresponding value in GloVe, if the lemma of a word present in GloVe is also present in VAD, this word is added to the lexicon with the corresponding VAD value of the lemma. The size of the final lexicon is 20,618.
The regression output is the VAD value of each input word, i.e., the output layer has a size of 3 (corresponding to each VAD value) and a linear activation. Similar to the approach of [25], each of the VAD values per word was weighted using the density-based weighting scheme (DenseWeight) [29], using an alpha value of 1. The values obtained were averaged to obtain a weighting per word. Weighted Mean Squared Error (WMSE), shown in Equation 1, was used. All the words in the VAD lexicon were used to create the dataset. The pre-trained GloVe embeddings were used as input, considering only those values present in VAD; for words in VAD but not in GloVe, a random uniform initialization was used.
where N is the number of samples; y _ gold i is target value for sample i; y _ pred i is the predicted value for sample i.
The output of the Multi-class classification corresponds to the combination of the lexicons Subjectivity clue lexicon and NRC-EmoLex, using only those words that also exist in the VAD lexicon.
NRC-EmoLex
For the lexicon NRC-EmoLex, only two labels of the original dataset were used, negative and positive, plus a dimension for neutral words. Given that all words in NRC-EmoLex are present in NRC-VAD, only those words that appear with both positive and negative values were discarded. The final number of words considered was 14,073, Table 1 shows the final distribution of words.
Distribution of words in NRC-EmoLex after removing words with both positive and negative values
Distribution of words in NRC-EmoLex after removing words with both positive and negative values
Subjectivity clue
Originally the Subjectivity clue lexicon contained 7,228 words; 2,781 words were absent in VAD, so they were discarded. The final number of words considered was 4,447; Table 2 shows the final distribution of words.
Distribution of words in Subjectivity clues lexicon after removing words not present in VAD
The Subjectivity clue lexicon was represented in three dimensions following the following considerations: Strongly_Positive and/or Weakly_Positive → Positive. Strongly_Negative and/or Weakly_Negative → Negative. Weakly_Positive and Weakly_Negative → Neutral. Strongly_Positive and Weakly_Negative → Positive. Strongly_Positive and Strongly_Negative → Neutral. Weakly_Positive and Strongly_Negative → Negative.
Combined polarity lexicon
A neutral value was assigned for words existing in the VAD lexicon but not in the combined polarity lexicon. Table 3 shows the result of combining words in NRC-EmoLex and Subjectivity clues and the combination of these with the lemmas added to VAD and words only in VAD (Combined polarity lexicon)
Number of words in the combined polarity lexicon
Given that the combined polarity lexicon distribution is imbalanced, the loss Weighted Categorical Cross-Entropy (WCCE) [5], described in Equation 3, was used. Equation 2 specifies the formula used to determine weights per class.
where weights
class
is the corresponding weight per class; n _ samples is the total number of samples; n _ classes is the number of classes; y
class
is the number of positive samples for class class.
where M is the number of samples; k is the number of classes; w k is the weight for class k; y _ gold m is target value for sample m; y _ pred m is the predicted value for sample m
The Multi-output Neural Network was trained for 200 epochs to learn the mapping function. With the model trained, it was used to obtain the senti-embedding representation using the pre-trained embeddings. The output layers of the model are removed, and the last hidden layers of each output are concatenated. The pre-trained embeddings (GloVe) are fed into the model, resulting in the senti-embedding at the output. The senti-embeddings are concatenated to the original pre-trained embeddings to preserve part of the semantic information.
The concatenated senti-embeddings and pre-trained embeddings are reduced to obtain the final representation of the embeddings with a dimension of 300. Given the vocabulary size (around 2M words), IncrementalPCA [18] was used for the reduction.
Results
The embeddings were tested on the task of emotional classification and sentiment analysis using the following corpus:
A Bi-LSTM was used to evaluate the generated embeddings. The same GloVe embeddings used as input for the training5 were used as the baseline. The results of the classification are shown in Table 4.
Sentiment classification for Bi-LSTM and different embeddings. The results reported are the average micro F1-scores and standard deviation after ten runs. *This is an approximation based on the specifications of the article. The best results are shown in bold, the second best result is underlined
There was no improvement in the polarity representation using the proposal of this article, having similar results to the state-of-the-art.
For the dataset SemEval-2017, all the embeddings had a similar performance. The proposal of this paper had the second-best average micro F1-score; the best one was proposed by Shah et al. [25]. For binary classification, using SST-2, the proposal had the best performance on average; EWE had the second-best result; however, considering the standard deviation, the variation is minimal.
For ISEAR, there is a more significant difference in the result for the other embeddings, making this article’s proposal the best one.
This paper proposed using a Multi-output Neural Network to create embeddings rich in emotion and polarity using a combination of lexicons and outputs.
For fine-grained classification, the proposal had the best representation of emotions; thus, using a Multi-Output Neural Network successfully captures emotional information. However, the polarity information requires improvement since the results obtained for SemEval-2017 and SST-2 are almost identical to those shown by the state of the art.
For future work, Multi-output Neural Networks will continue to be exploited to create embeddings that better represent sentiments and emotions by using a balanced lexicon, avoiding over-representing a specific class, could improve the representation of the sentiments.
Footnotes
Acknowledgment
The work was done with partial support from the Mexican Government through grant A1-S-47854 of CONACYT, Mexico, grants 20232138, 20232080, 20231567, and 20231387 of the Secretarıa de Investigación y Posgrado of the Instituto Politécnico Nacional, Mexico. The authors thank the CONACYT for the computing resources brought to them through the Plataforma de Aprendizaje Profundo para Tecnologıas del Lenguaje of the Laboratorio de Supercómputo of the INAOE, Mexico, and acknowledge the support of Microsoft through the Microsoft Latin America Ph.D. Award.
Notes
2
Valence or Pleasure: the pleasantness of a stimulus; Arousal: the intensity of emotion provoked by a stimulus; Dominance: the degree of control exerted by a stimulus.
3
NRC-VAD: National Research Council Canada Valence, Arousal, and Dominance Lexicon
4
NRC-EmoLex: National Research Council Canada Word-Emotion Association Lexicon
