Abstract
Sentiment analysis deals with classifying the opinions in text. Twitter is the most popular microblogging platform in social media, with hundreds of millions of tweets posted every day. A considerable number of tweets contain opinions. The goal of this paper is to classify the polarity of the tweets into positive and negative classes using dynamic sentiment lexicons based on frequencies of words in positive and negative classes. We extract five meta-level features incorporating the generated sentiment lexicons and classify the text based on them. We also incorporate some previously known lexicon-based and corpus-based features. The proposed method is assessed on six datasets, and outperforms previous papers on accuracy on four datasets, and on f-measure on three datasets. This method generates sentiment lexicons dynamically. The changes of meanings of words can be captured by the generated lexicons. Our research produces very promising results in sentiment analysis in terms of accuracy and f-measure. The accuracy of our method on four datasets and the f-measure of our method on three datasets are higher than 85%.
Introduction
Social media users generate massive amounts of text. They share their feelings about different subjects and entities, such as products, politicians, and corporations [1]. There are numerous media outlets in which people can express themselves. One of those is Twitter, in which the posts are called tweets and should not exceed 140 characters. Twitter has about 320 million active users and about 500 million tweets per day.
As an increasing number of users share their views online, microblogging websites are a valuable source of opinions [2]. The opinion people share in microblogging platforms can be used for marketing of products [3], social studies [2], world news [4], and prediction of future events [5]. The user-generated content on the web enables information technologies to benefit from diversity of the knowledge the users provide [6]. Hence, companies are eager to mine Twitter to find out about people’s opinion [7].
Automatic analyzing of the opinions expressed in text is a discipline named opinion mining or sentiment analysis. This field is about sentiments, opinions and emotions people express about various subjects, such as products, organizations, companies and famous people [1]. Sentiment analysis deals with opinionated text such as reviews [8, 9], blog posts [10, 11], and news [12].
The limitation on tweet lengths makes opinion mining in Twitter most similar to sentence-level opinion mining [7]. The special culture and language of Twitter users is a challenge for sentiment analysis in Twitter [13]. Some literatures [14–16] considered the problem of sentiment analysis as a three-class classification consisting of positive, negative and neutral classes. There are other related fields in sentiment analysis, such as emotion mining and strength detection [17–20].
In this research, we create sentiment lexicons from datasets based on corpus, and then make use of these lexicons to classify text. Our focus in this paper is on polarity classification of subjective tweets, in which we classify text into positive and negative classes. One of the challenges in sentiment analysis in Twitter is the problem of “slangs” that are widely used in tweets. Slangs present the different backgrounds the Twitter users have [13]. Our method addresses this problem by creating dynamic sentiment lexicons from of all of the words present in the datasets; some existing sentiment lexicons, such as AFINN [21] and Bing Liu’s lexicon [1] do not contain most of slangs. Also, the shortness of tweets makes them difficult to evaluate using existing lexicons [22]. Malformed words are also prevalent in Twitter, and this is another shortcoming in using of lexicons [23].
An optimal general-purpose lexicon for all domains cannot be created and the words are sensitive to the domain [24]. Our method generates dynamic lexicons by using the training datasets, and then extracts five meta-level features from tweets using these lexicons. We classify the text using a machine learning technique, SMO [25]. We divide the dataset into training and test datasets, and the lexicons are built from training data. Using these lexicons, the values for meta-level features are calculated for both training and test data.
Our contributions in this paper are as follows: we propose a novel approach for generating sentiment lexicons and assigning numerical scores to words present in text, based on the frequencies of words in positive and negative text; these lexicons are used to calculate meta-level features; we show that the features of these lexicons can be used alongside the features of other lexicons to improve their accuracy and f-measure.
The paper is organized as follows: In Section 2, we explore the related work in the field of polarity classification. We describe our method in Section 3. In Section 4, the results are presented and discussed. Finally, in Section 5, we conclude the paper.
Related work
We describe both works done in the field of sentiment analysis on Twitter, and works that focus on sentiment lexicons. We also explore fuzzy sentiment analysis and web resources.
Sentiment analysis in twitter
Tweets are 140 characters or less, and hence, are usually straightforward. They are considered a great resource for sentiment analysis [26].
Records in the Twitter datasets, i.e. tweets, should be labeled for classification. It is difficult to label a huge amount of records manually, and some works benefit from emoticons for labeling tweets [27–29]. Authors in [26] argue that using emoticons generates noise in labeling data, since a positive emoticon does not necessarily make a tweet positive.
Go et al. [28] used emoticons for creating an automatically labeled dataset of 1,600,000 tweets. Liu et al. [29] created another dataset and labeled the records with use of emoticons and manual labeling.
To classify sentiments of Tweets, Gonçalves et al. [30] proposed a combination of sentiment analysis methods. Agarwal et al. [3] used POS-specific features and a tree kernel for feature engineering for sentiment analysis in Twitter. Zhang et al. [31] combined lexicon-based and learning-based methods for sentiment classification of tweets. Mohammad et al. [32] proposed two SVM classifiers; one for sentiment classification of tweets, and one for sentiment classification of a term within a tweet. Hu et al. [33] used networked data, and investigated if social relations can be useful in detection of sentiments in tweets. Saif et al. [34] added semantic features as additional features for sentiment classification of tweets. Bravo-Marquez et al. [35] used meta-level features based on sentiment lexicons for classification of tweets. Kaewpitakkun et al. [36] used a hybrid approach that incorporated sentiment lexicons and machine learning techniques. Saif et al. [37] proposed a lexicon adaptation method that considers the context in which the words are used. Coletta et al. [38] combined classification and clustering for classification of tweets. Da Silva et al. [13] used an ensemble of classifier for sentiment classification of tweets. Speriosu et al. [39] used label propagation with a maximum entropy classifier on Twitter datasets. Carvalho et al. [40] used word co-occurrences in a statistical and evolutionary model for sentiment classification of tweets. Lu [41] used a semi-supervised approach and incorporated microblog-microblog relations. Saif et al. [42] captured patterns of words of which the contextual semantics is similar. Baecchi et al. [43] incorporated a multimodal approach. Hu et al. in [44] used emotional signals, such as emoticons or product ratings in sentiment classification. Saif et al. [45] incorporated a lexicon-based method. Keshavarz and Saniee proposed a genetic algorithm approach for polarity classification of microblogs. They defined an optimization problem, and solved it [46]. Another work in the field of optimization algorithms is [47] in which the cuckoo search is incorporated. Rout et al. [48] used n-gram and part-of-speech features to analyze social media texts from sentiment and emotion points of view. The problem of tweet sentiment analysis in the Spanish language has been explored in [49], by using convolutional neural networks.
Web resources
Sentiment analysis works on text gathered from web; i.e. a web resource. Numerous web resource features, such as n-grams, phrases, terms, hypernyms, document categories, and named entities are cited in [50] that are used in web mining. Incorporating text features in web mining is addressed in [51]. Features such as terms, keywords and phrases are used in [52]. Linguistic features are incorporated in [53] as well. Using additional aspects to web resources, such as geographical information is explored in [54].
Fuzzy sentiment analysis
Several research papers also have studied the effect of combining fuzzy logic and sentiment analysis. Fuzzy logic is used for modeling polarity in [55]. The assumption in this paper is that a text can be highly or mildly positive or negative. The problem studied in [56] is to build a fuzzy product ontology in aspect-based opinion mining. In [57], the reviews are divided into very positive to very negative groups. A fuzzy lexicon is used in [58], in which the degree of positivity or negativity of words and reviews are decided by fuzzy sets. A neuro-fuzzy model is incorporated for sentiment analysis in [59]. WordNet is explored and the membership of near 8000 words in fuzzy categories of sentiment are calculated in [60]. The propagation in social networks is explored using fuzzy sets for better sentiment analysis in [61]. In [62], the authors argue that there is an inherent vagueness in definition of positivity, objectivity and negativity, and try to address the issue using fuzzy sets. Finally, the problem of creating fuzzy domain sentiment ontology is studied in [63].
Sentiment lexicons
One of the main approaches for sentiment analysis is to use sentiment lexicons. A sentiment lexicon is a set of sentiment words or phrases with assigned scores. Since our work here is to generate dynamic lexicons, we describe existing well-known lexicons. Bradley and Lang [65] proposed the ANEW lexicon, which stands for Affective Norms for English Words. This lexicon was introduced before the rise of microblogging. Nielsen [21] proposed a new lexicon inspired from ANEW, and named it AFINN. Bing Liu’s lexicon is created by Bing Liu [1]. This lexicon contains positive and negative words. The words of EmoLex (NRC-emotion) [66] are labelled by their polarity and emotion. The NRC-hashtag lexicon is for the SemEval task by the NRC-Canada team [32]. The OpinionFinder lexicon is based on Multi-Perspective Question-Answering dataset (MPQA) and is proposed by Wilson et al. [67]. The Sentiment140 lexicon is also created by the NRC-Canada team. Another lexicon is SentiWordNet which was first proposed by Esuli and Sebastiani [68]. SentiWordNet 3.0 is created by Baccianella et al. [69].
Using these lexicons, one should consider challenges such as intensification and negation [70].
One shortcoming of the lexicons is that they rate some words as positive (or negative), without considering that these words may have different meanings in social media [46]. Also, the coverage of some of these lexicons, such as AFINN [21], Bing Liu’s lexicon [1] and OpinionFinder [67] on tweets are low.
Frequency-based sentiment analysis (FBSA)
First, we propose an algorithm, FBSA, that generates a sentiment lexicon, and then using the generated lexicon, the classification task is done. In our method, a sentiment lexicon is generated in the training phase of the algorithm on the training dataset, and this lexicon is used for classification of records in the test dataset. We use the 10-fold cross-validation scheme for our method, build sentiment lexicons based on the 9 parts that form the training set, create a model on the training data, and test the model on the test data. In the proposed method, all of the words are considered for lexicon generation, because of the short length of tweets, and that every word can be decisive in classification [46]. A salient advantage of our algorithm is not omitting any words in the datasets, even stop-words.
The proposed method
FBSA first calculates the sentiment score for each word, based on term frequencies of words in the training dataset. Suppose we have p datasets, D1 to D
p
, that contain a number of tweets and labels showing if they are positive or negative. We split a dataset into training and test datasets according to the 10-fold cross-validation mechanism. For each word w
j
in the training dataset D
i
, we define two cumulative values; positive and negative frequencies:
The D
i
dataset is the training dataset, and thus it does not include records from the test dataset. Comparing freq+ and freq- is only meaningful when the number of positive and negative records are equal. We use a coefficient based on records in the positive and negative classes, and calculate normalized frequency using Equation (3):
n P (i) and n N (i) represent the number of positive and negative records in D i , respectively.
We use an equation for calculating the sentiment score for each word in the lexicon. We propose Equation (4) to calculate this value for each word:
Equation (4) provides a sentiment score between –1 and +1 for each word. If the score is near 0, it means that word is more objective than subjective. Scores near +1 indicate the positivity of the word, and words with scores near –1 are more negative.
For example, if the word “love” appears 28 times in positive tweets (e.g. in 28 separate tweets) and 3 times in negative tweets (e.g. two times in one negative tweet and one time in another negative tweet) in dataset number 1, freq+ (“ love”, D1) =28 and freq _ (“ love”, D1) =3. However, in our example, the training dataset is imbalanced and contains 182 positive and 177 negative tweets. Therefore, we compute the normalized frequency for tackling the imbalance issue: nfreq _ (“ love”
We introduce five meta-level features to do the classification task on the datasets. The meta-features are as follows:
FPos: Sum of Score for positive words in tweet
FNeg: Sum of Score for negative words in tweet
PWords: Number of positive words in the record based on Score
NWords: Number of negative words in the record based on Score
Score: Sum of Score for all the words in tweet
In this paper, when we refer to an FBSA-generated sentiment lexicon on a dataset, it means using a sentiment lexicon generated for the whole dataset. There is an exception for the cases that we want to classify text and generate lexicons based on training datasets.
The core of the feature vector is made of the aforementioned features. An example of computing the features for a certain record is as follows. Consider a very simple lexicon, created by the FBSA method, which is demonstrated in Table 1.
A simple lexicon created using FBSA
Now, each record in the dataset (each tweet) should be transformed into a feature vector. Assume a record, consisting of the following text: “It is good”. Its feature vector is calculated as shown in Table 2.
The feature values for the record “It is good”
Here, each record is converted into five features. Then, a model is built based on the training data, and applied on the test data. The lexicon is created solely based on the training data, and it is used to calculate features for both training and test data.
Since the datasets are from Twitter, they may be accompanied by a hashtag (#), used to indicate the keywords of the tweets, and to emphasize a word. It is interesting to know whether there is a significant difference between words with and without hashtags. Because of that, a word with hashtag and without hashtag is treated as two different words.
We also incorporate additional features from five sentiment lexicons to improve the accuracy and f-measure. Moreover, incorporating additional features can be used to check if our lexicons complement other lexicons by improving their accuracy.
By adding any other sentiment lexicons, we add two meta-level features for each lexicon to our five features. These meta-level features are presented in Table 3 [26]. The lexicons are chosen focusing on diversity. Some of the lexicons are manually created, such as Bing Liu’s lexicon, while NRC-hashtag is created automatically. Bing Liu’s lexicon and OpinionFinder group words into positive and negative ones, but the scores in Sentiment140 have three decimal places. We also add bigram features, which are term frequencies of unigrams and bigrams.
Features based on lexicons for classification of sentiment
Features based on lexicons for classification of sentiment
Our method uses SMO of WEKA 3.6 as our classifier. It should be noted that the HCR dataset defines the training and test datasets beforehand, and thus, no cross-validation is applied on this dataset.
In this section, the datasets are introduced, and then the results of running our method are reported and discussed. Only the subjective text in the datasets (positive and negative) are considered here.
Datasets
We run FBSA on six datasets that are made of tweets [28, 72]. These datasets are widely used for sentiment analysis: Sanders 1 , OMD and Strict OMD [72], HCR [39], STS-Test [28], and STS-Gold [71]. The Obama-McCain Debate (OMD) dataset consists of 3238 tweets [72], of which, only the tweets with a majority vote are considered. The strict version of OMD consists of the tweets in which the votes are 100 percent unanimous. The Healthcare Reform (HCR) dataset consists of a training dataset with 621 tweets and a test dataset with 665 records.
The Stanford dataset consists of a training dataset with 1.6 million tweets, and a test dataset (STS-Test) with 177 negative and 182 positive tweets that are manually labeled [28]. Several studies such as [26] just consider the STS-Test dataset. We here use the test version of Stanford dataset. We also compare our results with methods that use STS-Train for training.
Table 4 shows an overview about these datasets.
The negative, positive and total tweets in each dataset
The negative, positive and total tweets in each dataset
A program in C# 2015 was written to implement the algorithm. Since it only calculates frequencies of words and meta-level features, its execution time is very fast. The runtimes of feature generation on different datasets are shown in Table 5.
Runtime of feature generation in datasets (in seconds)
Runtime of feature generation in datasets (in seconds)
We used 10-fold cross-validation for our algorithm in the datasets, and report the average value for each of the measures.
We compare the results of our algorithm (Tables 6 to 11) with state of the art methods, such as [13, 26, 41] and baseline methods, such as [21]. The baselines results are based on the meta-level features explained in Table 3 2 . The FBSA method uses FPos, FNeg, PWords, NWords and Score as meta-level features. Then we incrementally add additional features, such as meta-level features from other lexicons and bi-grams. We performed Wilcoxon Signed-Ranks test, a statistical non-parametric test to compare our results with other methods over datasets (Tables 12 and 13) for accuracy and f-score, respectively. Tables 14 and 15 also show the results of this test for comparing the FBSA method (without additional features) with bigrams and static lexicons. The results show that using only the five core FBSA features outperforms static lexicons, and is competitive with bigrams. However, classification based on bigram features is very time-consuming compared to the FBSA features.
Results of FBSA on HCR dataset (%)
Results of FBSA on HCR dataset (%)
Results of FBSA on OMD dataset (%)
Results of FBSA on SOMD Dataset (%)
Results of FBSA on STS dataset (%)
Results of FBSA on STS-Gold dataset (%)
Results of FBSA on Sanders dataset (%)
Wilcoxon Signed-Ranks test on accuracy values
Wilcoxon Signed-Ranks test on f-measure values
Wilcoxon Signed-Ranks test on accuracy values for FBSA without additional features
Wilcoxon Signed-Ranks test on f-measure values for FBSA without additional features
The comparisons have these implications: (i) the five core features extracted using the FBSA method (the FBSA row) generally outperform the method of using all of the static lexicons, combined. This means that using a dynamic lexicon which can be created very fast, can outperform not only single static lexicons, but the combination of static lexicons as well. This can be contributed to the dynamic lexicons being domain specific, which calls for creation of dynamic lexicons; and (ii) the dynamic lexicons are useful additions to the existing lexicons, and significantly improve their accuracy values.
In one of the experiments, we have used balanced training datasets. To make them balanced, we applied sampling with replacement on the training datasets. The weights of SMO show that FPos and Score features have more weights in classifying tweets than the others.
The intersection of all the lexicons is shown in Fig. 1. It shows the number of words that lexicons generated by FBSA and other static lexicons. The coverage of AFINN and Bing Liu’s lexicon on words of Sanders and HCR datasets is less than 10 percent. The high accuracy and f-measure results show that there can be implicit sentiment words. For example, the score of the word “for” is usually mildly positive. It can be inferred that when “for” is used in a sentence, it tends to be positive. This effect is shown in [46] as well. However, in numerous research papers, this word is omitted. Saif et al. [73] and Keshavarz and Saniee [46] did not omit the stop-words.

Intersection of words in lexicons.
The concordance of lexicons is shown in Fig. 2. It shows that how sentiment lexicons agree on the sentiment direction of the common words. The bolder color shows higher concordance.
The effect of using hashtags alongside words can be seen in Table 15. This table shows the values of words with and without hashtags. It is better to treat words with and without hashtags as different words.

Concordance of lexicons over common words.
The score of words with and without hashtags
One particular problem is that the polarity of a word can be dependent of the context. For example, the word “mouse” may be a negative word in a hotel review; but it can be an objective word in the computer context. This is a problem that most domain-dependent works suffer from. However, using domain-specific lexicons can be useful. For example, a hotel manager can create lexicons based on hotel reviews, in which the word “mouse” gets a negative score. This lexicon can be used for hotel reviews.
In our paper, we performed sentiment analysis on twitter. We have considered frequencies of words in positive and negative records, and then calculated sentiment scores for them. Then, five meta-level features based on these lexicons were extracted.
We have also incorporated other lexicon-based and corpus-based features to improve the accuracy. This shows that the lexicons we have created, existing lexicons, and bigrams, complement each other. Our results outperform other previous methods in three of six datasets in f-measure.
The advantages of the proposed method are as follows: (i) desirable runtime, in which the lexicons are built and the features are calculated very fast; (ii) high accuracy and f-measure, especially when used alongside other lexicon features; (iii) an insight about words; and (iv) the ability to use different classifiers.
We want to explore low power algorithms for sentiment analysis on social media, for mood detection and stress monitoring. Since the runtime of algorithm is very low, it is useful to get integrated into small devices that can benefit from sentiment analysis such as text that users type through their smartphones. Smartphones are capable of collecting contextual information [74], and individuals use them to access to their social media [75]. Another direction for the future work is to create a low power algorithms (low power algorithms are requirements for these devices [76]), that can accurately extract lexicons and operates on users smartphones. This can have several application varied from mood detection and stress monitoring [77] to quantified self systems.
The fast runtime of the algorithm enables considering the timing aspect. Because of the dynamic nature of the social media, the words meanings may change overtime, such as in events that cause peaks in Twitter. The important role of timing in social media has been addressed in [79, 80] as well. The lexicons can be updated very quickly in the events [46].
