Abstract
Real-time messaging and opinion sharing in social media websites have made them valuable sources of different kinds of information. This source provides the opportunity for doing different kinds of analysis. Sentiment analysis as one of the most important of these analyses gains increasing interests. However, the research in this field is still facing challenges. The mainstream of the sentiment analysis research on social media websites and microblogs just exploits the textual content of the posts. This makes the analysis hard because microblog posts are short and noisy. However, they have lots of contexts which can be exploited for sentiment analysis. In order to use the context as an auxiliary source, some recent papers use reply/retweet to model the context of the target post. We claim that multiple sequential contexts can be used jointly in a unified model. In this article, we propose a context-aware multi-thread hierarchical long short-term memory (MHLSTM) that jointly models different kinds of contexts, such as tweep, hashtag and reply besides the content of the target post. Experimental evaluations on a real-world Twitter data set demonstrate that our proposed model can outperform some strong baseline models by 28.39% in terms of relative error reduction.
1. Introduction
Social media websites have become very popular for sharing ideas and opinions about everything. In the social media, people share their opinions about different topics, discuss daily issues and express their sentiments about products they use [1]. In 2019, 2.95 billion users used social media websites, and in Twitter as one of the most famous social networks, people published 511,200 tweets each minute in July 2019.1,2 In fact, the growth of the Internet has enabled opinion sharing on a wide range of topics. This huge amount of up-to-date data has brought opportunities for performing different kinds of analysis, including sentiment analysis about hot topics [2,3].
Sentiment analysis also called opinion mining [4] analyses opinions, sentiments, appraisals, attitudes and emotions in different types of data including video, audio and mostly text [5]. The ultimate goal of this analysis is to understand the emotional orientation towards entities including products, organisations, individuals, events, issues or topics [4]. Furthermore, it is sometimes more useful to extract the sentiment towards specific aspects of an entity, such as the weight of a laptop. In order to infer sentiment polarity about a unit, researchers have used predefined lexicons, machine learning or hybrid methods [6–8]. Results of the sentiment analysis can benefit different groups: companies become aware of the costumer’s sentiment about their products and brands [1,9], governments can be informed of people’s opinion about their policies [10] and potential buyers can decide more efficiently, based on the opinions expressed by other people who have already used those products [11].
As the usage of social media increases, analysing its data gains more attention, too. Especially, growing attention to social media data has been reported in the field of sentiment analysis [4,6,12]. However, sentiment analysis in social media web sites such as Twitter [13] has some special challenges, including limitation of tweet length [6,8,14] and the dependency of tweets’ semantic to the other related tweets such as reply tweets [12,15]. These challenges make it hard to infer the sentiment using the content of the tweet itself (dubbed as the target tweet). In response to these challenges, some researchers suggested exploiting contexts of the target tweet. These contexts includes the tweets of the friends of the user who wrote the target tweet (dubbed as the tweep) [10,16], related tweets through conversations [12,14] or tweets of the similar tweeps to the tweep of the target tweet [6]. In this line of research, different methods have been suggested. Some recent papers suggested using recurrent neural networks (RNNs) to model the retweet/replies thread of related tweets, because tweets are published sequentially [12,15]. However, there are other kinds of useful sequential contexts that can be useful in determining the sentiment of the target tweet.
Sentiment consistency indicates that the tweets posted by a person tend to have similar sentiment [6,17]. Note that the most recent tweets of a tweep are more related to the target tweet than earlier ones. For example, consider this target tweet and its tweep context:
Target tweet:
‘I need to learn my lesson. I can’t be happy, I can’t be content. I don’t deserve to be happy, I don’t deserve to be content. I deserve nothing and I will always have nothing. I wish things were different but I don’t believe things will ever change. #Depression’.
Tweep context:
‘Mental illness runs in my family on my mother’s side. She has had issues as well as her mother and father and her grandmother also had issues. I was dealt a losing hand from the beginning. My mental health has degraded over the last 5 years to the point that I can’t function’.
‘Just when you think you are having kind of an ok day something comes along and ruins you and makes things worse than they have been. Drinking doesn’t help the situation, but it at least it makes you feel a little better. #Depression #bills #Sadness #sad’.
‘I have brief moments where I feel better about life in general. I wouldn’t call it happiness, but more moments of contentment. I wish it lasted longer than the fleeting time it happens. #mentalillness #mentalhealth #MentalHealthAwareness #sadness’.
‘I miss days when things were simple. It was only a few years ago. Looking back on it and I feel like I was happier then. Now there is nothing but stress and worry and dread and hopelessness. #sad #Depression #thoughts #mentalhealth #mentalillness’.
‘When your mental health is a full time job, there’s no vacation’.
Due to the existence of some positive words including happy, content and deserve, the textual models such as support-vector machine (SVM) and long short-term memory (LSTM) fail in analysing the sentiment of the target tweet (positive instead of negative). However, our proposed model which effectively utilises tweep context can infer a negative state about the tweep of the target tweet and make better estimation about the target tweet.
Furthermore, tweets with common hashtags tend to have similar sentiments. This is because that people have similar sentiments towards topics [6], which provide another context for the target tweet. In this article, we propose a multi-thread hierarchical long short-term memory (MHLSTM) network that uses different context types of the target tweet in order to determine its sentiment polarity. The proposed architecture is hierarchical in the sense that it considers word-based modelling of tweets’ content in one level and the tweet-based modelling of the thread at a higher level. We evaluate our proposed model using a real-world data set and show that it can outperform some strong baselines and state-of-the-art models. The main contributions of this article are as follows:
We propose MHLSTM that can jointly model the arbitrary number of different sequential contexts, such as previous tweets that the target tweet replies to. This model considers the order of tokens in the tweet and the order of context tweets in each of the threads in a hierarchical structure.
We model the tweets of a person as time-ordered sequence ends with the target tweet, and utilise his or her previous tweets to enrich the content of the target tweet.
We model tweets which have a common hashtag with the target tweet as a time-ordered sequence, and utilise previous tweets to enrich the target tweet representation.
The rest of this article is organised as follows. In section 2, we provide a literature review of the social media sentiment analysis and the use of deep learning methods in the field. Each subsection in section 2 ends with the contextualised models. Section 3 introduces the baselines and the proposed model. In section 4, we present our experiments. Finally, this article ends in section 5 with conclusion and future works.
2. Related work
2.1. Shallow learning in social media sentiment analysis
The surge of social media websites and the simplicity of their usage provide us a unique opinionated information source for different types of analysis, including sentiment analysis. The mainstream of research in sentiment analysis on social media concentrated on the content of the posts [6]. In this line, researchers have used different pre-processing techniques, features and classifiers. Kouloumpis et al. [18] used n-grams, lexicons, part-of-speech and micro-blogging features such as presence of emoticons and abbreviations as features. Go et al. [9] used different classification methods including Naive Bayes, maximum entropy and SVM on the text of tweets. Basile and Novielli [19] used SVM with features such as bigrams, number of hashtags, lexicon and embeddings of tweet words. Angiani et al. [20] investigated the impact of different pre-processing techniques on the Twitter sentiment analysis [21]. Their results showed that among different pre-processing methods, stemming is the most influential for improving Twitter sentiment classification task.
As posts in social media are not isolated, the meaning of a typical post is not completely understood without its context [8]. Therefore, another line of research used different kinds of related posts as contexts. Hu et al. [14] proposed sociological approach to handling noisy and short texts (SANT), a system that jointly minimises the difference between textual prediction and labels, as well as the difference between context-based predictions and labels. They used tweep and friends as contexts. Mi et al. [22] proposed the microblog sentiment analysis using user similarity and social relations (MSA-USSR) method which used the tweep similarity and interaction-based social relations and achieved better performance than SANT. Vanzo et al. [23] modelled the sentiment classification problem as a sequential classification task over thread of tweets and used SVMhmm as their classifier. Zou et al. [6] proposed a method based on the sentiment consistency [17] and emotional contagion. Emotional contagion [24] indicates that similar people tend to have similar opinions. So, they combined social and topical contexts to improve sentiment classification.
2.2. Deep learning in social media sentiment analysis
In recent years, deep learning methods have reached remarkable results in many natural language processing tasks, including sentiment analysis [8,12]. Based on the idea that general embeddings (such as word2vec) of some opposite words such as ‘good’ and ‘bad’ are similar because of their similar contexts, some works proposed other word embedding techniques which are capable of capturing the words sentimental direction, too [25–27]. Wang et al. [28] used long short-term memory (LSTM) for the Twitter sentiment classification and showed this model has better understanding of negative compositions. Since the initial word representations greatly affect sentiment classification, Rouvier and Favre [29] combined lexical, part-of-speech and sentiment embeddings to initialise input representations. They used a convolutional neural network (CNN) on top of these representations. Based on the same idea, Severyn and Moschitti [30] proposed to pre-train embeddings in two stages. First, they trained embeddings on 50 million tweets; then, to enrich representations with sentimental clues, they fine-tuned them with 10 million tweets that were labelled in distant supervision manner. Also, Cliche [31] used millions of tweets to train word embeddings and then refined them using distantly supervised tweets. He used these embeddings in the bidirectional long short-term memory (BiLSTM) and CNN networks. Majumder et al. [2] utilised correlation between sentiment classification and sarcasm detection proposed a multitask system. This model jointly predicts the sentiment of the tweet and the sarcasm.
Besides using the contents of the target tweet, a few researchers have examined different types of contexts to improve sentiment classification results of deep learning methods; however, these research works are in early stage and further works are still required. In this line, Croce et al. [32] proposed a system called the Unitor that injected sentiment information using polarity lexicons to CNN and used related tweets based on conversation or hashtag instead of arbitrary tweets in order to pre-train network. Since Twitter provides the features of reposting a tweet, called retweeting [33], and replying to a tweet, Huang et al. [12] used reply/retweet as the context and proposed the hierarchical long short-term memory (HLSTM) for tweet classification. In the first level of this model, a representation of each tweet was formed, and the representation of the thread was obtained in the second level. Feng et al. [15] proposed a similar system called context attention–based long short-term memory (CA-LSTM), but instead of using the last state of the LSTM, they equipped the model with attention mechanism in each level of the model. The attention mechanism made a weighted average of the all LSTM states resulting in a better representation. The last two methods proposed to model just one context in a sequential manner, but ignored the other related sequential contexts.
3. Baselines and the proposed method
In this section, we introduce the architecture of our proposed method and some baselines and the state-of-the-art models.
3.1. SVM
First, we train different types of linear SVM including SVM(1), SVM(1–3), context-aware support-vector machine (CASVM) and SVM(GloVe). The first three models use the bag-of-words (BOWs) representations of the target tweet’s text as features. This is a popular baseline for text classification [4,34]. SVM(1) uses only unigrams as vocabulary while SVM(1–3) uses unigrams, bigrams and trigrams in its vocabulary. In order to test the ability of SVM to benefit from contexts, in CASVM, we concatenate the target tweets’ tokens with its contexts’ tokens and feed them to SVM(1–3). We train different variants of this model including CASVM(user context), CASVM(hashtag context), CASVM(reply context) and CASVM(all contexts). Finally, SVM(GloVe) uses the average of the GloVe word embeddings [35] of the target tweet’s tokens as tweet’s representation.
3.2. LSTM
RNN is a special kind of the neural network models adapted to sequential data such as texts. These models (RNN and its variants) have proved their efficiency in sentiment analysis task [34]. In each time step, the RNN linearly combines input features of that time step and the current state of the system. It then applies a non-linearity to the result [36]
where
3.3. BiLSTM
While RNN considers only information from the previous states, bidirectional recurrent neural network (BRNN) [38] considers information from next states, too. In fact, BRNN combines two RNNs in different directions in order to use forward and backward information
where
3.4. HLSTM and CA-LSTM
Text data exhibit sequential structure at multiple levels including characters, words, sentences, tweets and related tweets. These levels constitute a natural hierarchy for representing the meaning of different fragments of text [39,40]. Consider a thread of tweets in which each tweet has relation with previous ones, and each word in each tweet is related to other words in the same tweet. To model this phenomenon, Huang et al. [12] proposed HLSTM that uses a word-level LSTM which receives the words as input, and represents the tweet by the last state of the first-level LSTM. It then uses that representation as input to the second-level LSTM. In this line, Feng et al. [15] showed that the attention mechanism could improve the model performance, so they incorporated attention to both levels of the model. After obtaining the tweet representations, the model feeds each tweet representation to the second-level LSTM that produces an overall representation of the thread. The output of this level then is used for classification.
3.5. Proposed model: MHLSTM
There are some kinds of related tweets to the target tweet. These data can potentially be used to improve the quality of sentiment analysis. Furthermore, since each tweet has a publication time, related tweets constitute a sequence. As we mentioned earlier, both HLSTM and CA-LSTM represent each tweet based on one context. However, there are other related contexts which can be utilised in a similar way, such as the past tweets of the tweep who tweeted the target tweet. It has been shown that the tweets of the same tweep are more probably to have similar sentiments than random tweets [14]. Also, sentiment towards the same topic can be consistent to some degree [12]; for example, during some disasters or pandemics such as Corona virus outbreak, it is quite possible to see more negative tweets about that event. So, we propose multi-thread context-aware LSTM that can utilise different kinds of related sequences. Our proposed system contains two levels of LSTM in each sequence which model tweets and threads, respectively. The system then combines representations of different threads. Figure 1 shows the overall architecture of the proposed model. The architecture is structured by different threads. In each of these threads, we have a sequence of tweets ordered by time that ends with the target tweet. Inputs to the model are tweet tokens. There is a word-level LSTM for each thread
where
where
where
where
where

Architecture of the multi-thread hierarchical LSTM (MHLSTM).
4. Experimental results and discussion
4.1. Data and experimental setup
In order to evaluate the proposed model, we use Twitter. It has been reported that Twitter is the main social media data source for sentiment analysis [4] and context-aware sentiment analysis [8]. For annotating data, we use a popular annotation approach for Twitter sentiment analysis called distant supervision [9,25,43–46], also known as indirect crowdsourcing [7]. In this line, researchers use sentimental hashtags and emoticons to label tweets as positive or negative. Mohammad [43] conducted two experiments which showed that the hashtag-based labels were consistent with the human annotations [7], and Janssens et al. [45] compared valence of hashtag labels and human labels and indicated a high degree of agreement between them. Based on the previous research works, we used #happy, #excellent, #happiness, #sadness, #sad and #frustrated to collect target tweets using the Twitter official application programming interface (API) in 14 April 2019. We extracted 22,437 target tweets, including 11,370 (~51%) positive and 11,067 (~0.49%) negative ones. The searched hashtags themselves were then removed in order to prevent data leakage. We then gathered three kinds of the contexts for each target tweet (up to five previous tweets). These include tweets that the target tweet is a reply to them, previous tweets that include the first hashtag of the target tweet, and the previous tweets of the tweep of the target tweet. Table 1 shows the statistics of the data set. It is clear that a high percentage of tweets has the tweep and the hashtag contexts, and about one-fifth of the tweets have the reply context.
Data set statistics.
We implemented all of the models using the TensorFlow [47], except SVM. For SVM, we used the scikit-learn library [48]. The proposed model was implemented in two settings. First, word vectors were initialised randomly, and second, they were initialised with the pre-trained GloVe word embeddings [35], and weights of theses vectors were fine-tuned during training to better capture sentiment clues. We used GloVe instead of word2vec [49] because the results of a recent paper shows that the GloVe outperforms word2vec in Twitter sentiment analysis [50]. We choose the best hyperparameters for each model experimentally. Table 2 shows the hyperparameter values we used for our model. We used dropout [51] for regularisation with keep rate 0.5 on the output of the LSTMs in both levels, and we used the Adam optimisation method [52] as optimiser. We compared our proposed model with some baselines and state-of-the-art models including SVM, LSTM, BiLSTM, HLSTM and CA-LSTM.
Hyperparameters of the model.
RNN: recurrent neural network.
5. Results
In this section, we present our experimental results and compare different models and approaches. In each of the following table, the best model is specified in bold. Table 3 shows the results for baselines which does not use neither context nor pre-trained word embeddings (dubbed as baselines). As a linear model, SVM performs quite well on our data set and makes a powerful baseline. Perhaps, it is the reason that SVM has been reported as the most frequent classification method in the sentiment analysis literature [4]. SVM(1–3) provides the best performance. This model which is equipped with unigrams, bigrams and trigrams performs better than SVM(1), probably because it can handle negation and sentimental phrases. We tested other feature combinations, such as unigrams to five-grams, but we did not get any improvement. As the second baseline, LSTM works slightly better than SVM(1), but not as well as SVM(1–3). This is because deep models are data-hungry which means they need a lot of data to learn well [8].
Experiment results for simple baselines.
SVM: support-vector machine; LSTM: long short-term memory; BiLSTM: bidirectional long short-term memory.
Table 4 shows the experimental results for context-aware models. For SVM(1–3), we concatenated all the context tweets to the target tweet and considered that as a single sample. This kind of context usage has not any positive impact on SVM, because it does not consider the position of tokens. Also, context tokens may add noise to the model. This result has been reported in the previous research works, too [15]. Among different contexts, hashtag has the most negative effect on SVM, because people may have different sentimental tendencies towards the same topic, and the SVM cannot handle this. The reply context has the least negative effect, probably because of the low fraction of the context (about 20%).
Experimental results for context-aware models.
CASVM: context-aware support-vector machine; HLSTM: hierarchical long short-term memory; CA-LSTM: context attention–based long short-term memory; MHLSTM: multi-thread hierarchical long short-term memory.
CA-LSTM [15] with two levels of attention on the reply thread makes an appropriate representation of the tweets and the reply thread, so it performs better than HLSTM [12]. It can be seen that our proposed model (MHLSTM) that utilises three kinds of the contexts performs quite well and decreases the error rate by 17.72%
In another experiment, we initialised word embeddings with pre-trained GloVe word vectors and fine-tuned them along with the other parameters. The reason for fine-tuning is that some words with opposite sentiments such as good and bad have similar general embeddings because of their similar contexts [25,26,50]. Table 5 shows the results of this experiment. Comparing this table with Table 4 reveals the benefit of initialising neural models with pre-trained vectors. This observation is in line with previous research [15]. SVM(GloVe) does not perform well, because averaging in low-dimensional space can lead to a lot of information loss as reported in previous research works, too [40,53]. BiLSTM performs better than LSTM as a powerful model for sentiment analysis [34]. CA-LSTM performs better than all the other previous models, because it utilises the reply context and efficiently combines this information with the target tweet representation. Finally, the proposed MHLSTM performs better than all methods and decreases the error rate by 28.39%
Experiment results for models with pre-trained embeddings.
LSTM: long short-term memory; BiLSTM: bidirectional long short-term memory; HLSTM: hierarchical long short-term memory; CA-LSTM: context attention–based long short-term memory; MHLSTM: multi-thread hierarchical long short-term memory.
Figure 2 shows the final comparison of different models including baselines, state-of-the-art models and the proposed MHLSTM model. It is obvious that our proposed model performs significantly better than all the other models.

Experimental results (macro F).
In this section, we study the difference between context length of the correctly predicted samples by MHLSTM and incorrectly predicted ones. Table 6 shows all of the context types and their combination. It is clear that the number of contexts is significantly different between correctly labelled samples and incorrectly ones.
Difference between correctly labelled and incorrectly labelled sample length by MHLSTM.
6. Conclusion
In this article, we proposed the MHLSTM model which incorporates preceding tweets of the different context types by the attention weights for multi-thread context-aware sentiment classification. We modelled different kinds of the contexts as separate threads and utilised hierarchical LSTM to jointly model tweets and threads. Our proposed model captures the long-distance dependency in each context and classifies target tweets based on that. Experimental results on a real-world Twitter data set demonstrated that our model improves classification results compared with the other baselines and some of the state-of-the-art models. Future direction for works includes exploiting other kinds of sequential contexts of the target tweet, such as tweets of the tweep friends, inspired by the sentiment contagion theory [54]. Different kinds of context utilisation can be investigated, too. Other directions are using of the contextual long short-term memory (CLSTM) [55], which combines contexts in the core architecture of the LSTM, and using context representation as LSTM initial state, instead of using zero-initialised state.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship and/or publication of this article.
