Abstract
This paper describes our proposal for Sentiment Analysis in Twitter for the Spanish language. The main characteristics of the system are the use of word embedding specifically trained from tweets in Spanish and the use of self-attention mechanisms that allow to consider sequences without using convolutional nor recurrent layers. These self-attention mechanisms are based on the encoders of the Transformer model. The results obtained on the Task 1 of the TASS 2019 workshop, for all the Spanish variants proposed, support the correctness and adequacy of our proposal.
Introduction
Sentiment Analysis (SA) is one of the Natural Language Processing (NLP) problems that has been more studied in last decade. SA consists of determining if the polarity of a document is positive, negative or neutral.
Initially, SA models were trained to deal with long and nearly normative text, typically reviews of some products or services [13], [30]. As the use and the influence of Twitter have grown in last years, NLP community has incremented their efforts to address the peculiarities of the language in this social network. Nowadays, there is still great interest in the study of Sentiment Analysis in the Twitter domain. This is evidenced by the organization of different tracks devoted to this subject [18], [20].
Sentiment Analysis in Twitter presents some specific problems that do not occur in SA for normative text. On the one hand, the lack of context due to the limited length of the tweets. On the other hand, as in other social networks, the use of informal language is common in Twitter, that includes spelling errors, elongated words, the use of emoticons, special terms, user mentions, etc.
The workshop "Sentiment Analysis at SEPLN" (TASS) organized within the framework of the International Conference of the Spanish Society for Natural Language Processing (SEPLN 1 ) is since 2012 the reference for the evaluation of systems for SA in Twitter for the Spanish language.
In last few years, self-attention mechanisms of the Transformer Encoder [26] have been proved as a very effective way for computing representations and complex relationships. They have been successfully used in some SA problems related to products and services reviews in English [12] [1].
In this work, we propose the use of these multi-head self-attention mechanisms, on top of pre-trained Twitter word embeddings, in order to address the Sentiment Analysis problem in Spanish on Twitter. To evaluate the adequacy of our proposal, we performed an extensive experimentation on Task 1 of the TASS 2019 workshop for several Spanish variants, where our system obtains very competitive results, being one of the best ranked systems in the competition.
The rest of the article is structured as follows. Section 2 presents the state-of-the-art for Twitter Sentiment Analysis both for English and Spanish. In Section 3, a description of the task addressed in this work is presented. In Section 4, we describe the architecture of the proposed system. Section 5 summarizes the conducted experimental evaluation, the achieved results and a qualitative analysis of the performance of the self-attention heads. Finally, some conclusions are shown in Section 6.
Related work
Most works that addressed the SA problem have used polarity lexicons in some way. The construction of these lexicons is another widely explored field of research. Polarity lexicons have usually been constructed for English [13], [30], but efforts have also been made to create lexicons for Spanish [19, 21]. However, its use has declined over time due to the increase in the quality of representations, typically based on word or sentence embeddings.
The SemEval workshop has proposed several tasks related to Sentiment Analysis on Twitter from 2013 to 2017. In the last two editions [18, 20] many of the participating teams have included in their systems state-of-the-art deep learning approaches. In this respect, SemEval has become the reference for Sentiment Analysis on Twitter problem for the English language.
In SA for Twitter in Spanish, the most relevant workshop is the TASS workshop that has proposed different tasks for SA that focus on the Spanish language since 2012. An overview of the different tasks proposed, the participating teams, and the results obtained can be found in [4, 27–29].
The Task 1 of TASS 2018 was focused in Sentiment Analysis at tweet level. The corpus provided by the organizers was InterTASS 2.0, including the Spain (ES), Peru (PE) and Costa Rica (CR) Spanish variants. Moreover, the organizers proposed two subtasks, Subtask 1 for monolingual SA and Subtask 2 for multilingual SA. The systems presented by [8] and [5] were the most competitive systems on the three Spanish variants for almost all the tasks while the system of [17] obtained the best results for the PE variant multilingual task.
In [8], the authors explore several deep learning architectures such as Deep Averaging Networks (DAN) [11], Attention Long Short Term Memory networks (Att-LSTM) [10] and Convolutional Neural Networks (CNN), along as different representations such as bag-of-words and Twitter word embeddings. In this case, the DAN system outperforms all the other participating systems in the ES variant.
Similarly, in [5], also were explored several deep learning architectures such as CNN and LSTM trained on top of Wikipedia word embeddings along as Support Vector Machines with a tweet representation based on word embeddings and several polarity statistics extracted from lexicons. Their LSTM and CNN systems are the first ranked systems for the CR and PE variants respectively.
The system proposed in [17] also was shown as the most competitive for the PE variant on the multilingual subtask. It is based on a genetic algorithm (EvoMSA) that orchestrates other subsystems. These subsystems are B4MSA [24] for tune input related hyper-parameters such as the normalization and the representation; and the classifier EvoDAG [9].
However, recent advances and mechanisms that have improved the NLP state-of-the-art have been published only for English SA. Meanwhile, in other languages such as the Spanish and its variants, these state-of-the-art advances are applied progressively in a slow way due to it is necessary to adjust them to work correctly in these languages.
One of these recent improvements is the proposal of the Transformer model in [26] for machine translation. This architecture is based on multi-head self-attention, dispensing with convolution and recurrences to learn relationships among words. The relationships captured by this kind of attention have shown to be effective on English SA tasks [1, 12] outperforming other systems based on Bidirectional LSTM and CNN on corpora such as Sentiment Stanford TreeBank [22] and SenTube [25].
Task description
In order to validate our proposal for SA on Twitter in the Spanish language, we decided to participate in the Task 1 of TASS 2019.
This task consists on assigning global polarity to tweets on four classes
The organizers provided the InterTASS corpus composed by tweets from 5 different Spanish-speaking countries: Spain (ES), Peru (PE), Costa Rica (CR), Uruguay (UY) and Mexico (MX). For each Spanish variant 3 sample sets have been defined: training set (TR), development set (DV) and test set (TS). Only one Spanish variant can be used both for training and testing the system. Consequently, 5 different evaluations, one per Spanish variant, were proposed. Some statistics of the InterTASS corpus are shown in Table 1.
Number of tweets per class in all the sample sets of InterTASS for all the Spanish variants
Number of tweets per class in all the sample sets of InterTASS for all the Spanish variants
The InterTASS corpus is unbalanced and there is a bias towards the N and P classes, except in the training and development sets of the PE variant, where the most frequent class is NONE. However, in the test set of this variant, the class distributions differs, being N and P the most frequent classes. Moreover, the class NEU is usually the less populated class in all Spanish variants.
Our system is based on the Transformer [26] model. Initially proposed for machine translation, the Transformer model dispenses with convolution and recurrences to learn long-range relationships. Instead of this kind of mechanisms, it relies on multi head self-attention, where multiple attentions among the words of a sequence are computed in parallel to take into account different relationships among them. This reduces the computational complexity per layer (being also more parallelizable) and the max path length of dependencies among words to
Concretely, we use the encoder part of the Transformer model in order to extract vector representations that are useful to perform Sentiment Analysis. We denote this encoding part of the Transformer model as Transformer Encoder (TE). Figure 1 shows the representation of the proposed architecture for the addressed task.

System architecture based on the Transformer Encoder model.
The input of the model is a tweet X = {x1, x2, . . ., x T : x i ∈ {0, . . . , V}} where T is the maximum length of the tweet and V is the vocabulary size. This tweet is passed through a d-dimensional pre-trained embedding layer, E, frozen during the training phase. Moreover, to consider positional information we also experimented with the sine and cosine functions proposed in [26].
This, encoded as
After the combination of the word embeddings with the positional information, dropout [23] was used to drop input words with a certain probability p to regularize the model. On top of these representations, N transformer encoders are applied, which rely on the multi-head scaled dot-product attention shown in Equations 2 - 4. These encoders are identical to [26], including the layer-normalized [2] residual connections.
The output for only one encoder, S, is computed as shown in Equation 8 for a given sample X0.
Due to a vector representation is required to train classifiers on top of these encoders, a global average pooling mechanism was applied on S. The resulting vector is used as input to a single-layer feed-forward network, whose output layer computes a probability distribution over the the four classes of the task
We use Adam as update rule with lr = 0.001, β1 = 0.9 and β2 = 0.999 and Noam as learning rate schedule [26] with 15 warmup _ steps. Due to the imbalance in all the Spanish variants subsets, weighted cross entropy is used as loss function considering the distribution of each class in the training set. Concretely, we used the proportion between the most frequent class and the frequency of a given class,
In order to initialize the embedding layer of our system with a rich semantic representation for the words of the task, a 300-d skipgram model [16] was trained on texts from the same domain of the task (Twitter). This model was trained by using 87M tweets from several Spanish variants, downloaded by streaming during several months in 2017 in our laboratory.
Regarding to the preprocessing, we have applied the same preprocess steps to all the given data, both the tweets used to learn the Word2Vec embeddings model and those provided by the organization to train the systems. Firstly, a case-folding process is applied to all the tweets, secondly, we tokenized the tweets by using TokTokTokenizer from NLTK [3]. Thirdly, user mentions, hashtags and URLS are replaced by three generic-class tokens (user, hashtag and url respectively). Finally, elongated tokens are diselongated allowing the same vowel to appear only twice consecutively in a token (e.g. jaaaa becomes jaa).
Experimental work
In order to validate our proposal for Sentiment Analysis in Twitter and to select the best model to participate in the 2019 edition of TASS competition, we carried out some experimentation on the development set. To train the models, we fixed some hyper-parameters such as batch _ size = 32, d k = 64, d ff = d and T = 50. Other hyper-parameters such as p, warmup _ steps or h were set, considering the results obtained in previous experiments, to p = 0.7, warmup _ steps = 5 epochs and h = 8.
Moreover, we compared our proposal, which is based on Transformer Encoders (TE), with another deep learning systems such as Deep Averaging Networks (DAN) [11] and Attention Long Short Term Memory Networks [10] (Att-LSTM) that are commonly used in related text classification tasks and are the systems proposed by the teams that achieved best results in the 2018 edition of TASS [5, 8].
We were interested to observe how the use of positional encodings or the number of encoders affect to the results obtained. Specifically, we train different models removing the positional information (TE-NoPos) and using 1 or 2 encoders. We tested all these combinations only on the ES variant and the best two configurations were also applied to the remaining variants (PE, CR, UY, MX).
The results in terms of macro-F1 (MF1), macro-recall (MR), macro-precision (MP) and Accuracy (Acc) achieved by all the systems considered in the development phase for all the Spanish variants are shown in Table 2. It can be seen that the best transformer encoders models (1-TE-NoPos, 2-TE-NoPos) outperform the DAN and Att-LSTM approaches by a margin of ∼5 points for MF1 measure. This is due to the great improvement in both MR (∼6 points) and MP (∼3 points).
Results on the development set for the different Spanish variants
Results on the development set for the different Spanish variants
The use of the positional information in the TE approaches decreases the system performances (1-TE-Pos versus 1-TE-NoPos and 2-TE-Pos versus 2-TE-NoPos). This seems to indicate that the positional information, represented by sine and cosine functions added to the word embeddings, is not useful to the classification. However, the results obtained by Att-LSTM, which considers the positional information by its internal memory, obtains better results than the 1-TE-Pos and 2-TE-Pos approaches in almost all the metrics.
The 1-TE-NoPos model obtains better results, in terms of MR and MF1, than the 2-TE-NoPos model, outperforming its results on ∼2 points in terms of MF1. This behavior is observed in almost all the variants, except in the MX variant, where both models obtain similar results in terms of MF1 and 2-TE-NoPos outperforms 1-TE-NoPos in terms of MR.
Table 3 shows the results, at class level, achieved by the best model (1-TE-NoPos) for all Spanish variant. In most cases, the results obtained in the N and P classes are better than those obtained in the other classes, except in the PE variant, where the NONE class is the one that obtains the best results. For all Spanish variants, as expected, the most difficult class is the NEU class due to the fact that this class corresponds to tweets that merge positive and negative sentiments.
Results at class level for the 1-TE-NoPos model and all Spanish variants on the development set
In order to study in detail the behavior of our best system (1-TE-NoPos), we computed the confusion matrix for the ES variant that can be seen in Table 4. Note that, the NEU class is highly confused with the N and P classes. This seems to indicate that our model detects the presence of sentiment (positive or negative), but it is not capable to detect when both sentiments are present together. In addition, it can be observed that the N and P classes are confused with each other.
Confusion matrix (1-TE-NoPos) on the ES variant development set
In light of the results of the development phase, we decided to use 1-TE-NoPos system to participate in the TASS 2019 competition. Table 5 shows the official results for all Spanish variants and the position of our system (ranked using F1 measure) in each variant [6]. As it can be seen, our system is ranked in first place for the ES and MX variants and in second place for CR, PE and UY variants.
Official results and ranking of our system on the TASS 2019 competition
Highlight that although our system has been optimized for the ES variant, it has behaved reasonably well for the rest of variants.
With the aim of understanding the proposed model, we have analyzed the behavior of the self-attention mechanisms.
A competitive SA system should be able to combine several aspects to determine the polarity of a tweet. Among others, some of these aspects are the polarity of the words and the presence of sentiment modifiers such as polarity shifters or reversers in the tweets. We hypothesize that the attention heads of our system should capture some of these aspects.
In order to determine what heads react to these aspects, we computed the average attention that each word receives from each head considering all the occurrences of the word in a given sample set. We formalized this computation in Algorithm 1.
The development set of the ES variant is used to verify that the model generalizes and captures interesting relationships even in samples that it has never seen.
From the set of samples χ with vocabulary
The columns of this attention matrix are averaged to obtain
Once α is computed, it is possible to observe if some heads are capable of taking into account some properties at word level that are necessary to determine the sentiment of a tweet.
Figure 2 shows the attention of all heads (from 1 to 8) for 6 words with high polarity. These words are extracted from the ElHuyar [21] lexicon. First row in Figure 2 shows the attention per head of three words with positive polarity (best, wonderful and cool) and the second row corresponds to three words with negative polarity (worst, horrible and shit). It can be observed that the attention heads 4 and 5 react with high intensity when the polarity is negative and positive respectively. Moreover, head 4 does not react when the polarity is positive, the same behavior is observed for head 5 when the polarity is negative. Furthermore, heads 6 and 7 seem to attend to the negative words and not to the positive ones; head 3 reacts more intensively to positive words rather than negative ones.

Attentions for several words that contains sentiment.
We extended the study to all words in the vocabulary that appear in the ElHuyar polarity lexicon [21]. Figure 3 shows average attentions per head for positive and negative words. It can be seen that the negative words receive higher attention than the positive ones. In particular, head 4 reacts more to negative words than head 5 reacts to positive words.

Sum of attentions for all the attention heads on the words of ElHuyar.
To confirm the capability of the heads 4 and 5 detecting the polarity of the words, we designed a classifier that uses only the attention of heads 4 and 5 (αw4 and αw5) to determine the polarity of each word w of the vocabulary
We attempted to address the Task 1 of TASS 2019 for ES variant using only the information of heads 4 and 5 and the ElHuyar lexicon. To do this, we designed a classifier based on the sum of the polarity of the words. The classifiers works as follow: if the sample does not contain any word with polarity its class is NONE, if the sample contains the same number of positive and negative words its class is NEU, otherwise the class of the sample is P or N depending of the number of positive and negative words.
This classifier is directly computable on any polarity lexicon (e.g ElHuyar), however to use heads 4 and 5 of our system we need to design a mechanism to discretize the polarity of each word based on the outputs of both heads. In our case, we obtain a probability distribution over the P and N classes by means of a softmax function on the output of the two heads. To discretize this function, we used a threshold ε = 0.165 experimentally set. This classifier, SumPolClassifier, is defined in the Algorithm 2.
In order to use the SumPolClassifier with ElHuyar lexicon, p (N|w) and p (P|w) are obtained directly from the lexicon. Table 6 shows the results of SumPolClassifier applied to the development set of the ES variant of Task 1 of TASS 2019 both with heads 4 and 5, and ElHuyar lexicon. It can be seen how the results in terms of macro-F1 are similar in both approaches. Both systems classify similarly the classes NEU, NONE and P. However, the recall on the class N with the heads 4 and 5 is significantly lower than with ElHuyar although they have more precision.
Results of SumPolClassifier both using the heads 4 and 5, and ElHuyar lexicon on the development set.
Finally, we studied how attention heads react to words that are supposed to be polarity shifters or polarity reversers. Figure 4 shows average attentions per head for eight of these words. The words in the first row (not, never, neither and anybody) are polarity reversers and the words in the second row (very, nothing, forever and something) are polarity shifters.

Attention per head on polarity reversers and shifters.
It can be seen that head 1 reacts to all the shifters and reversers. This head do not react to positive or negative words (see Figures 2 and 3). In addition, heads 4 and 5 do not react to shifters nor reversers because these words do not have polarity per se. However, the attention values for head 1 are not relatively high except in the case of no and always. These results seem to indicate that, although it reacts fairly well to common shifters and reversers, it is necessary to reinforce the attentions dedicated to this type of words.
It is also remarkable that all the polarity reversers and the polarity shifter nothing, all of them with negative inertia, are attended by head 7 that was related to the negative polarity as previously discussed.
In this work, we have presented a proposal for Sentiment Analysis in Twitter for the Spanish language. Our proposal is based on the use word embedding trained from tweets in Spanish and the Transformer Encoder architecture. This architecture relies only on self-attention mechanisms to ease the learning of relationships among words, without using convolutional or recurrent mechanisms.
We have tested our system on the Task 1 of the 2019 edition of the TASS workshop for which the organizers provided 5 subsets corresponding to 5 Spanish variants. Although the hyper-parameters of the model had been tuned considering only the ES variant, our system was ranked first or second on all the Spanish variants.
These results have encouraged us to perform a thorough study of how the self-attention heads capture the information required to perform Sentiment Analysis. We have detected some heads directly related with positive and negative words and another that reacts to polarity shifters and reversers.
