Abstract
Abstract
The present study deals with the detection of negative emotions in informal short texts (tweets). Our work takes advantage of several features of social networks, particularly their availability and confidence they offer users in terms of reflecting their emotions. The corpus of tweets was manually marked with emotions. The corpus was balanced because it had 3,000 tweets for each of Ekman's negative emotions and for neutral tweets (15,000 tweets in total). The objective of the present study was to apply automatic learning in two (sad versus neutral tweets) or five (tweets with emotions distinguished) categories. Different features were evaluated by changing types of elements (words or lemmas), sizes (uni-, bi-, tri-, unibi-, unibitrigrams, among others), and values (term frequency or term frequency-inverse document frequency). Sadness was detected with an F1 = 0.962. The F1 for all neutral tweets and those with negative emotions was relatively high (0.664) because the task itself was difficult (random baseline = 0.2 for five categories). The present results were obtained from experiments conducted on the balanced textual corpus for the first time and were better than the state-of-the-art methods.
Introduction
D
Emotion is a complex mental phenomenon; thus, there is currently no consensus on its definition as each definition focuses on different specific features of emotion. For example, Ekman and Davidson define an emotion as a reaction by a human being arising from a real or imagined perception of something, be it an object, a person, a place, or a memory. 2 Another more recent definition specifies that emotion is any mental experience with high intensity and hedonic content relating to pleasure or displeasure. 3 For the present study, it was important to have emotions reflected in several different mental states, which can be associated with certain commonly accepted concepts, such as sadness, anger, and others. Thus, the use of emotions in the present study was mainly practical as labels for texts that express these emotions. Fortunately for us, users trust social networks and write their texts reflecting their emotions, such as fear or sadness. In this way, many texts (tweets) are available and their corresponding emotion labels are known, which provides us with an abundance of marked data (corpus) for automatic processing.
Currently, computers help us solve various problems related to linguistics and psychology. Computers can even perform better than human beings, especially when we want to identify different patterns in a large amount of data: a human does not see these patterns, whereas a computer can detect them. However, we need to apply a formal model for the detection of patterns and look for common patterns in different objects (i.e., we evaluate the similarity of the objects of our study). The formal model that is most widely used in computing comparisons (evaluation of similarity) of objects is the vector space model. The vector space model consists of representation of objects as vectors of the values of their features (i.e., objects have features, features have values, and each object is represented as a vector of these values). 1 This model not only reflects reality (the features characterize objects) but also can subjectively choose various sets of features. Thus, our task is to choose the feature set and value types that are best suited for our problem.
In the present work, our main research question is as follows: what features are best suited for the automatic detection of negative emotions? We answer this question by conducting experiments with the data from our corpus when the tweets (texts) are marked with the emotions that they express. The experiments are conducted by applying the standard machine learning algorithm to these data whereby the computer decides how good the chosen features (vector space model) are. The general procedure for this decision-making approach is to first train the computer model on the existing data and then assess if the correct category (emotion of the tweet in our case) can be predicted from a small section of the test data. If we have high precision, then we have good features and they will also function well on the new data. We conducted experiments with various feature sets.
No set of features is unique and correct to represent our objects (in this case, tweets), which greatly hinders the task, and therefore, we have to empirically select the best sets of features and their corresponding values to perform the detection of negative emotions. To date, unigrams (sequences of textual elements with a size n = 1) have provided the best results.4–7 However, the construction of new feature sets could allow us to generate better data than those obtained from unigrams.
Our hypothesis is that the use of our newly combined features will improve the automatic detection of tweets with negative emotions and neutral tweets. These combined features should allow for the introduction of more information into the methods of automatic learning by taking into account different types, values, and sizes. In addition, the size of the set of features used could be delimited (e.g., testing with 1,000 or 1,500 of the most frequently combined features per category). We tried two experimental scenarios: for two and for five categories. Using two categories, a single negative emotion (sadness vs. neutral tweets) should automatically be detected, whereas in the case of five categories, automatic detection of the four negative emotions and neutral tweets should be performed at the same time. In this manner, we provide a new method for advancing the detection of negative emotions in tweets.
The current study was restricted to the automatic determination of negative emotional states expressed by authors at the time of writing informal short texts (tweets). Tweets are entries in the online service Twitter, which is very widely used. Its users publish ∼58 million tweets every day. 8 Herein, a corpus of tweets written in Spanish was compiled, but the methodology can be applied to any other corpus or language. In our corpus, tweets from 2014 to 2015 had a maximum size of 140 characters. It is important to note, however, that the use of tweets involves dealing with more cumbersome aspects, such as informality and increased expressiveness, due to the popularization of metadata and use of multiple conventions. 9 One example of this is the use of hashtags (words preceded by #) to express emotions. 4
Ekman's Six Basic Emotions
There are several possible ways to categorize emotions and choose what the basic or primary emotions are. For example, Plutchik
10
suggested using eight basic emotions grouped on a positive or negative basis: “joy-sadness,” “anger-fear,” “trust-disgust,” and “surprise-anticipation.” However, Ekman
11
proposed that certain patterns constitute the six basic emotions. He did not include “anticipation” and “trust” and used the term “happiness” instead of “joy.” Ekman's six emotions are as follows:
Sadness—a negative emotion that appears after perception of loss, rejection, or abuse (e.g., the death of a loved one causes feelings of sadness). Fear—a negative emotion that leads people to seek protection (e.g., seeing a wild animal can elicit feelings of fear, causing one to flee or hide). Anger—a negative emotion that manifests itself when a person physically or emotionally harms others (e.g., a child hits another and the affected child reacts to stop the threat, experiencing anger). Disgust—a negative emotion that appears when a person considers something unpleasant (e.g., a vegetarian hearing talk about steaks will experience disgust). Surprise—a neutral emotion experienced when something happens suddenly (e.g., winning the lottery can cause the feeling of surprise). Happiness—a positive emotion that gives people a sense of well-being and security (e.g., happiness can be felt when someone prepares a favorite food).
Currently, Ekman's six emotions have gained the highest degree of acceptance within the psychological literature 12 and are commonly used in computational linguistics. 1 In the present work, we focused on the automatic detection of the negative emotions (sadness, fear, anger, and disgust), with special emphasis on sadness.
Related Findings
The information in Twitter is not only written in English, but a very large part is also written in Spanish. The work presented here has not been previously conducted in Spanish due to several challenges. First, a new or different data source must be built, which requires a lot of time and effort, especially considering the lack of linguistic resources in Spanish. 5 Specifically, there are no large collections for the emotions we chose to work with in the present study. Sidorov et al. 7 established that 3,000 marked tweets are a sufficient number for conducting specific tasks similar to detection of emotions. Although collections of tweets do exist, most are either of improper size (less than 3,000 marked tweets) or do not contain expressions for each emotion that we chose to consider herein. Another challenge is detection of certain emotions, which has caused most previous works to use unbalanced resources. Mohammad 4 used an unbalanced corpus that had 761 tweets marked with the emotion disgust, 1,555 tweets with anger, among others. Blázquez et al. 13 used an unbalanced corpus that only had 2 percent of tweets marked with the emotions surprise and anger combined, among others. Thus, a balanced yet sufficiently large corpora of tweets (or other texts) manually marked with emotions does not yet exist even for the English language, which could be a gold standard for the task of automatic detection of emotions.
In the automatic learning field, the problem of category imbalance in the number of training examples for each category is addressed by balancing the corpus. We constructed our balanced corpus in Spanish as it is usually done in computational linguistics. Each tweet is checked by humans to confirm that it reflects the appropriate emotion in relation to the hashtag content. For example, if a tweet has the hashtag #triste (#sad), then that tweet expresses sadness, which is verified manually. Another challenge is that tweets have many nonstandard elements that do not allow for automatic processing. Thus, specific pre-preprocessing is needed. For this, we developed a dictionary with 670 elements, including informal words and phrases. While doing automatic processing, automatic classification features must be selected. Until now, the best results have been obtained using unigrams.4–7 Therefore, new sets of features must be built that allow us to obtain better results than with unigrams. In the current work, we demonstrate the superior performance of n-grams and combinations of them.
Construction of the Balanced Corpus
In total, our balanced corpus had 15,000 tweets (downloaded over 5 months in 2014–2015) that express negative emotions identified by Ekman 11 and neutral tweets (without any emotion) in Spanish. Of them, 3,000 were marked tweets with the emotion sadness, 3,000 were marked tweets with anger, and so on, according to previous work performed by Sidorov et al. 7 ; the agreement between our three evaluators was very good (0.816). 14 This resource (corpus) is available online.
We retrieved tweets using Python, the tweepy library,
15
and our list of emotional hashtags. The recovered tweets had unique identifiers, texts, creation dates, and URLs (optional). An emotional hashtag was defined as an emotionally charged word preceded by the “#” symbol. While Mohammad
4
and Blázquez et al.
13
used less than 24 emotional hashtags each to recover emotions, we employed 75 elements linked to Ekman's negative emotions. We included adjectives, synonyms, singulars, plurals, and different genres, among others. We also used specific hashtags (e.g., #mesaca [#makes_me_mad]) so that we could retrieve tweets that reflect the emotion anger. We chose specific hashtags after retrieving 1,855,000 tweets containing any hashtag over the course of a week and semi-automatically analyzed them by obtaining their occurrences and selecting those with a certain emotional load, among others. We collected neutral tweets using the hashtag #noticias (#news) because after reviewing some examples, we found that they were stories of events provided with the greatest possible objectivity. It is important to mention that the present work focused on detecting negative emotions expressed by authors at the time of writing. An example of a neutral tweet is:
“#noticias descubren la molécula vital para formar agua en estrellas moribundas: un trabajo del grupo de astrof … (#news a molecule is discovered vital for forming water in dying stars: a work of a group of astrophysicists …)”
Finally, we applied filters, such as elimination of re-tweets and exclusion of tweets with less than five official words (i.e., removing tweets with limited context), found in the Spanish dictionary provided by FreeLing. This dictionary has 669,291 forms corresponding to more than 76,000 lemma-parts-of-speech combinations. 16
Methodology for Detecting Negative Emotions
We defined the following stages in the detection of negative emotions in tweets in Spanish:
Construction of the vector space model
The vector space model is widely used in computer science and makes it possible to compare objects in a formal way. The best way to describe objects (e.g., tweets) is to represent them using features and their values. 1 Because there is no single correct set of features, the task is to select the best set of features and their values.
Preprocessing of the balanced corpus
The preprocessed version of our corpus is available upon request. The preprocessing steps are as follows:
Automatically informal words and phrases were replaced using our online dictionary. Changing them to official words found in the Spanish dictionary provided by FreeLing
16
prevented their elimination, improving the expression of negative emotions in tweets in Spanish. Our dictionary consists of 670 elements; some elements have an emotional load and others appear frequently in the balanced corpus. Element examples include dep to descanse en paz (rest in peace), ex to pareja anterior (previous couple), mala leche to mal humor (bad mood), hdp to hijo de puta (son of a bitch), pelotudo to ingenuo (silly person), msj to mensaje (message), tv to televisión (television), among others. We replaced diminutives, augmentatives, and superlatives with their simple forms by human assessment. Since these are not present in the Spanish dictionary provided by FreeLing,
16
their treatment allows us to solve some errors, such as their expression with some lack of spelling or alteration.17,18 For example, muchisimo, muchiisisimo, and muchísimo (very much) are modified to mucho (very). We automatically removed hashtags, URLs, among others. To construct our features, we excluded punctuations (including emojis) and stop-words using the Spanish language list from the Natural Language Toolkit.
19
Stop-words are grammatical words such as articles, prepositions, pronouns, among others.
Below, we present an example of original tweet marked with the emotion sadness as well as its corresponding preprocessed text.
Original tweet: “esta noche lo unico que hago es recordarte … mi vida me haces muchisima falta/: #hermana #triste”
Preprocessed tweet: “esta noche único hago recordarte vida haces mucha falta”
Extraction of traditional and combined features
An n-gram is a sequence of textual elements (words, lemmas, grammatical labels, among others) as they appear in the text. In our case, n indicates the number of elements to be taken. With respect to the types of features, we selected words and lemmas. Lemmas were found using the official Spanish dictionary provided by FreeLing, 16 and words were obtained from the tweets and converted to low register. We also extracted n-grams of different sizes (uni-, bi-, and trigrams). Similarly, we defined a combination of unigrams with bigrams that we call unibigrams; another combination of bigrams with trigrams (bitrigrams) was also used. Unibitrigrams, a combination of unigrams with bigrams and trigrams, were also extracted.
We constructed combined features using the Waikato Environment for Knowledge Analysis (WEKA) tool.
20
This tool has a filter called StringtoWordVector, which converts each character string or tweet into a set of features. The feature set is determined by the first filtered batch (usually training data). Using this filter, we applied the n-gram Tokenizer to separate the texts into n-grams, specifying their maximum (x) and minimum (y) sizes. For example, unibitrigrams were extracted specifying that x = 3 and y = 1. To show the work carried out for the feature extraction step, we used the sentence “I am feeling blue about his relationship with his girlfriend.” Table 1 shows the traditional features extracted for this example, taking into account different sizes and types (if a word and its lemma were the same, diagonals were omitted). From this example, we constructed the following combined features:
Unibigrams of the words “feeling_blue,” “blue_relationship,” “relationship_girlfriend,” “feeling,” “blue,” “relationship,” and “girlfriend.” Bitrigrams of the words “feeling_blue_relationship,” “blue_relationship_girlfriend,” “feeling_blue,” “blue_relationship,” and “relationship_girlfriend.” Unibitrigrams of the words “feeling_blue_relationship,” “blue_relationship_girlfriend,” “feeling_blue,” “blue_relationship,” and “relationship_girlfriend,” “feeling,” “blue,” “relationship,” and “girlfriend.”
Example of Extracted Sentence Features Expressing Negative Emotion
Implementation of automatic learning
In computational linguistics, the naive Bayes (NB) algorithm is commonly used as the baseline, and support vector machine (SVM) is the most generally applied method. SVM has been used in related works that obtain the best results.4,7 However, the Bayesian method with a multinomial distribution (MNB) and decision trees (J48) have recently given good or stable results in Spanish.6,7,13 Consequently, we implemented these four methods (NB, MNB, SVM, and J48) with different features.
Selection of baselines
A baseline is a simple method commonly used in the state of the art to solve a given task. Our first baseline was the NB algorithm for unigrams, and our second was SVM.
Evaluation of results
The measures used for evaluation were accuracy, precision, recall, and a harmonic combination of precision and recall called F1:
where
Experiments
Testing with two categories
To detect sad versus neutral tweets, we constructed 1,000 of the most frequent features per category (2,000 in total). We hypothesized that the use of our newly combined features would improve the automatic detection of tweets with negative emotions and neutral tweets. These combined features should allow introduction of more information to methods of automatic learning, taking into account different types, values, and sizes. The size of the set of features used could be delimited (e.g., testing with 1,000 or 1,500 of the most frequently combined features per category). Using two categories, a single negative emotion (sad versus neutral tweets) could automatically be detected, whereas using five categories would enable automatic detection of the four negative emotions and neutral tweets at the same time. If our hypothesis is correct, it would be useful to discover an automatic learning method that yields better results than SVM in this specific task. The trigrams were omitted because we obtained the lowest results. In this case, the results using unibitrigrams are very similar to those using unigrams and unibigrams. Moreover, results using NB, MNB, and SVM were quite similar. The best results were obtained using unibigrams with different values: F1 = 0.962 and 0.958 using MNB and unigrams with term frequency-inverse document frequency (TF-IDF) values and F1 = 0.960 using MNB. In this case, there was only a very small difference using unigrams (traditional method) and combinations of features (Table 2).
F1 Results from Detection of the Emotion Sadness Using Different Automatic Learning Methods and Sizes and Values of Word Features
The results for F1 obtained in the detection of the emotion sadness. Numbers in bold highlight the best F1 results.
NB, naive Bayes; TF, term frequency; TF-IDF, term frequency-inverse document frequency; MNB, multinomial distribution; SVM, support vector machine; J48, decision trees.
Testing with five categories
For all negative categories and neutral tweets, 1,000 of the most frequent features per category (5,000 in total) were extracted. We found that the results were worse using 1,500 of the most frequent features per category. Tables 3, 4, and 5 show the results after detection of all negative categories and neutral tweets. As with two-category testing, we hypothesized that the use of newly combined features would improve automatic detection of tweets with negative emotions and neutral tweets. We overcame the baseline values, and our best results using four categories and neutral tweets were F1 = 0.664 and 0.663 using MNB with unibigrams or unibitrigrams with TF values. The baselines were F1 = 0.500 and 0.606 using NB or SVM with unigrams, respectively. Our best results are relatively high values because the classification task itself was difficult (random baseline = 0.2 for five categories). Thus, using the combination of features, and not just unigrams, is recommended. Furthermore, most cases showed significant differences between the types of features (lemmas or words), with the F1 results varying between 0.30 and 0.55. Therefore, using words and not lemmas is recommended.
F1 Accuracy Results from Detection of All Negative Emotions and Neutral Tweets Using Different Automatic Learning Methods and Sizes and Values of Lemma Features
Numbers in bold indicate the best F1 results.
F1 Accuracy Results from Detection of All Negative Emotions and Neutral Tweets Using Different Automatic Learning Methods and Sizes and Values of Word Features
Numbers in bold indicate the best F1 results.
Number of Tweets Representing Each Emotion Classified in Each Category by Multinomial Distribution with Unibigrams Using Four Negative Emotions and Neutral Tweets
No_emo, no emotion.
Conclusions
The present study focused on automatic detection of negative emotions in short informal texts messages called tweets using machine learning and automatic text analysis techniques. We considered selection of the best features set for this task (construction of the vector space model) and its automatic empirical evaluation. Various methods of automatic learning were evaluated using traditional and novel combined features (unibigrams, unibitrigrams, among others).
Our hypothesis was that the use of our newly combined features would improve the automatic detection of tweets with negative emotions and neutral tweets. The hypothesis was proven since we discovered that unibigrams or unibitrigrams with TF values are the best features for the automatic detection of various negative emotions (detection of the correct category of tweets with four negative emotions and neutral tweets). In addition, we showed that delimiting the size of the set of features used to the 1,000 most frequently combined features per category also improves our results for this task. Note that for the situation with two categories (sad vs. neutral tweets), no significant improvement was observed, but the state-of-the-art methods obtained an almost perfect result (F1 = 0.962, which is very close to the perfect performance, F1 = 1).
To the best of our knowledge, this is the first study to evaluate the effectiveness of using these combined features on a corpus of Twitter messages. Although our experiments were conducted on tweets written in Spanish, our methodology is language independent. One of the other interesting discoveries is that the MNB classifier is the best method for this task since it generated better results than the SVM, which is the method that has been used in related works obtaining the best results to date.6,15
When we considered two categories for classification (sad vs. neutral tweets), there were no relevant differences in the results related to the use of combined features. In this case, the correct classification rate was very high (F1 = 0.962). Using five categories (four negative emotions and neutral tweets), unibigrams and unibitrigrams of words obtained the best correct classification rate (F1 = 0.664) with our corpus. Furthermore, we created a dictionary with 670 elements (including informal words and phrases) to normalize tweets, which can complement SentiStrength for Spanish. 21
In the future, we could expand our set of negative emotions used, such as boredom and frustration, which could result in interesting discoveries for the detection of negative emotions in tweets. 22 Moreover, we could include emojis and URLs as part of the combined features to improve the detection of negative emotions based on previous work. 23
Notes
a.
b.
Footnotes
Author Disclosure Statement
No competing financial interests exist.
