Abstract
With the rapid proliferation of online-blogging and micro-blogging websites, millions of text posts are generated and made available online every day. Utilizing this rich data channel could facilitate educated purchasing of items, discovering trends and public tendencies regarding various products available in the market, discovering political inclination of societies prior to a national election, etc. Since the last decade, Sentiment Analysis (SA) has received increased attention from many researchers as a method for addressing topics, such as the aforementioned ones. This paper focuses on SA using sentiment features and patterns. We propose different sentiment polarity detection methods, two unsupervised methods and one supervised, which we compare with two baseline methods, a state-of-the-art Support Vector Machine (SVM) classifier trained on a unigram bag-of-words model, and an unsupervised SentiStrength [38] algorithm. In our experiments, we show that our polarity detection methods are highly effective and can outperform the aforementioned baselines in most of our conducted experiments.
Keywords
Introduction
The raise of online social media in the recent years has changed the way people communicate with each other when they share thoughts and opinions. Micro-blogging websites such as Twitter have gained increased popularity, and thus, novel and yet rich data channels are formed. Every day, a huge amount of informal/formal subjective text statements are made available online through these channels. The knowledge captured from these texts, could be employed for scientific surveys from a social or political perspective [27]. On one hand, companies and product owners who aim to ameliorate their products/services may strongly benefit from the rich feedback [14], [36]. On the other hand, customers could also learn about positivity or negativity of users respecting different features of products/services and consequently may react. Furthermore, applications such as rating movies based on online movie reviews [31] could emerge only by making use of these data.
Sentiment Analysis (SA) is the process of extracting the polarity of people’s subjective opinions from plain natural language texts [3]. An SA system, takes as input a set of documents with unknown polarity, and returns as output the opinions expressed in those documents and their predicted polarity. This allows both customers and companies to have easy access to public’s opinion regarding certain items/products.
There has been a great amount of previous research on various methods utilizing the web technology to maximize the benefits of customers, as well as, companies in the market place [2]. Likewise, this study pursues the same goal by performing polarity detection on three datasets, two of which are related to consumer products. For this purpose, we use two data sets presented and used in references [1] and [17] as well as our own Twitter data set, in which each tweet was manually annotated by three human judges.
The three data sets used in this research work, contain texts ranging from very informal posts such as tweets to more formal ones such as amazon reviews.
Twitter messages as many others posted on the blogosphere are mostly informal. Because of the anomalistic nature of informal text, analysis or processing of this kind of text is often more challenging compared to formal text. The main difference between processing formal and informal text is in data preprocessing. Formal text often needs less preprocessing. Informal text however, often contains emoticons, use of slangs, bad grammar, and sarcasm or non-dictionary-standard words. Thus, analysis of this category of text is often more difficult. In this paper, we propose various ways for handling both formal and informal texts, and compare the presented methods against two baseline methods in a benchmark.
This article is an extended version of the workshop paper published in the proceedings of Web Intelligence and Intelligent Agent Technologies [3]. The further contributions of this study are:
Extended sentiment feature set for the supervised classifier.
Handling both, informal texts such as tweets and formal texts such as Amazon Reviews.
Usage of three datasets for a more comprehensive evaluation of our proposed methods.
Further investigation of the state-of-the-art in SA.
Extended experiments, proving the claim that our supervised method presented in Section 5, could serve as a domain-independent classifier in the context of targeted consumer-based products.
Comparison of our polarity detection methods against the SentiStrength [38] algorithm.
The remainder of this article is organized as follows: Section 2 is devoted to previous related work. It investigates various supervised and unsupervised approaches for sentiment polarity detection of texts. In addition, it provides a review on concept-level polarity detection and SA of Twitter. Section 3 and 4 present two new unsupervised polarity detection algorithms named UPD and UTOPD. Section 5 proposes another supervised polarity detection method along with its components. Consequently, Section 6 tests and evaluates the performance of our presented classifiers on the three mentioned data sets. The findings are discussed in Section 7. Finally, Section 8 concludes this article and gives insight into the future work.
Related work
According to Pang and Lee [30], SA may be commonly described by two main tasks: First, “identifying whether a given textual entity is subjective or objective” and second, “identifying polarity of subjective texts”. In this paper we only focus on the second task and therefore only address respective related work.
Polarity detection of subjective texts has been extensively studied in the literature. It is commonly known as classifying subjective opinions into two classes of positive or negative polarity and in some cases, another neutral class. Researchers have applied polarity detection to a wide range of domains, e.g., movie reviews [31], product reviews [10], [36], [14], and news and blogs [16], [5].
Although there has been a whole bunch of methods applied to sentiment polarity detection in text, in this paper, we distinguish between supervised learning and unsupervised learning approaches. In addition, we survey the most recent concept-level methods of polarity detection as a separate category. In summary, the three categories that are briefly surveyed in this section are as follows:
Supervised methods: Machine learning methods which define a classification function based on labeled training data.
Unsupervised methods: Find underlying patterns upon which unlabeled data could be classified.
Concept-level methods: These methods rely on world knowledge such as web ontologies or networks of concepts.
In the following, Subsections 2.1, 2.2, and 2.3, each of these categories is briefly reviewed. Finally, due to the fact that Twitter is one of the richest text data channels, and two of the three datasets used in this article are obtained from Twitter, Subsection 2.4 describes some previous related work on Twitter data.
Supervised methods
Supervised methods are machine learning approaches in which a classifier is trained using already labeled training data. These methods were the first methods proposed in the field of SA [31].
Two main parts of every supervised method are its feature set and a classifier that the feature set is fed to. In the following, we briefly survey some previous work in this domain. Then, in Subsection 2.1.1, we give some insights into most common features and common classifiers. Furthermore, some deficits of supervised polarity detection methods are discussed in Subsection 2.1.2.
Melvile et al. [24] propose a so-called “unified framework” allowing to use background lexical information in terms of word-class associations for a supervised approach. The information is refined for specific domains using training data related to those domains. By using a multinomial Naïve Bayes (NB) classifier, their experiments show that incorporation of lexical knowledge improves classification performance. Additionally, they state that using their method, they obtained a higher performance than other approaches using either only lexical knowledge or training data.
Another supervised approach is presented by Ng et al. [26]. They adopt two discriminatory sentiment lexicons to label unlabeled data. By making use of two discriminatory-word lexicons (one negative and the other positive), pseudo-documents containing all the words of the chosen lexicon are created. Then, they compute the cosine similarity between these pseudo-documents and the unlabeled documents. Based on the cosine similarity, a document is assigned either a positive or a negative sentiment label. Finally, the labels are employed to train a NB classifier.
Gamon [13] utilizes a linear Support Vector Machine (SVM) classifier and train it on large feature vectors consisting of various n-grams and linguistic features, e.g. Part of Speech (POS). He shows that his method is effective in polarity detection of noisy customer feedback data.
Xia et al. [42] address a method which extracts word relation features and combines them with conventional text classification features such as unigrams and bigrams for the task of sentiment classification. For this purpose, they compare various combinations of the aforementioned features, and NB and SVM classifiers.
Li et al. [21] propose a machine learning method to incorporate contextual valence shifters into a document-level sentiment classification system. First, a feature selection method is adopted to automatically generate the training data for a binary classifier on polarity shifting detection of sentences. Then, by using the obtained binary classifier, each document in the original polarity classification training data is split into two partitions: polarity-shifted and polarity-unshifted. They are used to train two base classifiers respectively for further classifier combination. Both of our proposed unsupervised methods as well as our supervised target-oriented classifier also take into account contextual valence shifters while performing document-level sentiment classification. However, our approach is very different than that proposed by [21].
Ng et al. [26] utilizes linguistic knowledge for automatic identification and classification of reviews. First, they train an SVM classifier on the top unigrams feature set as a baseline classifier. Then they experiment the effect of distinct variations of feature set on classification accuracy, by including bigrams and trigrams, dependency relations, and polarity information of adjectives in the feature set, and finally discarding objective materials from it.
Common features and classifiers for supervised polarity detection
Common Features:
The most commonly used features for the task of sentiment polarity detection are as follows:
N-Grams:
Most prominent features in the context of SA and text processing in general, are n-gram models (and more commonly unigrams, bigrams and trigrams). Unigrams of a document are merely the terms present in that document. The unigram model could be either computed as the number of times that a term occurs in a document (i.e. unigram term frequency model) or it could be computed merely on the basis of presence of that term in the document (i.e. unigram bag-of-words model).
In addition, bigrams and trigrams have also been used in the literature (cf. [31], [26]). Bigrams are the occurrence of a certain term followed by another specific term. Likewise, trigrams are the occurrence of a certain bigram followed by a certain term. These features are beneficial in the contexts where the capturing of local dependencies between terms is required. Pang et al. [31] tested several feature set combinations on a movie reviews data set, including a combination of unigrams and bigrams and concluded that a unigram bag-of-words feature set could outperform other feature sets. Due to these results the unigram bag-of-words model in commonly used in the literature as the best feature vector model and considered as a baseline in the context of SA.
Ng et al. [26] used a different movie review dataset than Pang and colleagues [31]. It consists of 2,000 documents, 1,000 of which have a positive and 1,000 a negative sentiment. In their work, they illustrated that by adding bigrams and trigrams to their unigram feature vector, accuracy went up by nearly 2% over the bag-of-words baseline when using an SVM classifier.
TF/IDF Measure:
The TF/IDF (Term Frequency/ Inverse Document Frequency) is a classical term weighting measure which used in text information retrieval. It may also be used as a feature to train machine learning classifiers. As in traditional text classification, these features have proven to be highly effective for sentiment classification as well [22].
Part of Speech (POS) Tags:
POS tags are another type of features that have been used in the literature (e.g. in [31], [13]). However, combining unigrams and POS tags in one feature set, decreases classification accuracy as compared with when the feature vector consists only of unigrams (cf. [31]).
Word Position in Document:
In some previous research, word positions in documents were adopted as features. Pang et al. [31] were one of the first researchers who use these features in the domain of SA.
Sentiment/Polar features:
For training a machine learning classifier sentiment features are essential. In most cases they are generated by making use of a sentiment. Agarwal et al. [1] for instance, adopt a sentiment lexicon for counting the number of occurrences of positive and negative words in tweets. The resulting two measures are used along with other features to train a classifier. A supervised approach proposed in this article also incorporates such features.
Most Common Classifiers:
Popular classifiers used for conventional text classification problems, such as NB and SVM as stated in Liu et al. [23], are commonly used for SA. The most used supervised algorithms in the context of SA are SVM, NB, and Maximum Entropy (MaxEnt) classifiers. In the work of Pang et al. [31], all three of these machine learning approaches were applied and tested on a movie reviews data set. Their findings showed that an SVM trained on a unigram bag-of-words feature set, outperforms all other approaches presented in their work. Go et al. [13] also tested all three above mentioned classifiers on a large corpus of short informal texts (i.e. tweets). In their paper, they reported that their findings mirror the findings of [31] by showing that a unigram-based SVM performs better than a unigram-based NB classifier and likewise the NB classifier performs better than a unigram-based MaxEnt classifier in terms of classification accuracy. In addition, our earlier research [3] showed the same results by testing the three aforementioned classifiers on a smaller Twitter dataset.
Deficits of supervised methods
One drawback of supervised approaches is their inherent domain dependency. Read [33] shows that standard machine learning approaches for opinion polarity detection are both domain-dependent (dataset topic) and temporally-dependent (datasets collected in different periods on an annual basis). Another drawback is that for classifiers to be trained on huge feature sets (i.e. consisting of unigrams or other conventional features for text processing), long training sessions are required and due to large number of features, they often reveal a huge computational complexity and memory intensity.
Unsupervised methods
Unsupervised methods are mostly lexicon-based methods. They employ a dictionary in which each sentiment bearing word is associated with either a sentiment score or a set of sentiment bearing seed words. These methods use different functions to calculate the overall sentiment of each text snippet in a document and thus calculate an overall sentiment score for the entire document. Paltoglou et al. [29] show that an unsupervised lexicon-based method could outperform supervised methods, namely, NB and MaxEnt classifiers on a few datasets for polarity detection. Accordingly, they conclude that unsupervised methods can outperform supervised methods. However, they do not report a comparison of their unsupervised method against an SVM unigram baseline.
Methods for automatic sentiment lexicon generation are generally divided into two categories: corpus-based methods and dictionary-based methods.
Approaches of the first category, also referred to as Semantic Orientation from Association (SOA), use a set of positive and negative sentiment seed words to find other sentiment words in the corpus [40]. Dictionary-based methods usually start with a set of sentiment labeled seed words and utilize various algorithms to propagate sentiment labels to a synonym and antonym network of the labeled seed words. However, a lexicon could be also built manually. In the following, as an example of corpus-based automatically built lexicons, we first explain all the steps of SOA in [40] and then we briefly review dictionary-based methods. In addition we name some of the manually built lexicons.
Examples of automatically and manually built lexicons
Automatically built sentiment lexicons
Consequently, the semantic orientation of the extracted phrases is estimated based on the Point-wise Mutual Information (PMI) formula presented below:
In this formula, p stands for a given phrase,
By consulting a search engine (i.e. AltaVista in their case) and by making use of Formula 1, Turney calculates the probability of co-occurrence of the extracted phrases with a small set of positive and negative seed words. In this way he estimates the probability of a phrase being positive or negative.
Patterns of POS tags for extracting two consecutive words (Turney, 2002)
Dictionary-based:
Dictionary-based methods rely on a dictionary with synonyms and antonyms attached to each word entry. The most prominent example is Wordnet [25], which is one of the commonly used dictionaries in this area. Dictionary-based methods usually start with a set of seed words with known positive and negative polarity and propagate the known sentiment labels to synonyms and antonyms of each seed word. This process continues iteratively, until it cannot progress further, because all words are labeled. This so-called lexical graph propagation based method was first introduced by Hu and Liu [17]. Similar to Hu and Liu, [7] adopted the distance between two nodes in a lexical graph as a measure to compute the relatedness of a given word to positive and negative seed words. In addition, the method proposed by Kamps et al. [18] included also a neutral seed set in addition to the positive and negative seed set that resulted in a faster and more accurate convergence.
There exist a number of manually built sentiment lexicons. One example of such lexicon is the SentiStrength lexicon [38] which reportedly has a high word matching accuracy rate on informal texts. That is because the stemming rules that it uses were derived from the social networking website MySpace. Other examples are the LIWC lexicon (Linguistic Inquiry and Word Count) [32] or Senti-WordNet [11]. Another instance ANEW [12], allows for calculating the polarity of words by two measures of valence and arousal. Finally, the General Inquirer lexicon [35] is another popular lexicon which should be mentioned.
Deficits of unsupervised methods
Although unsupervised methods have a number of advantages over supervised methods, they also come with some drawbacks. One drawback of unsupervised methods is that their performance primarily relies on the completeness and quality of the sentiment lexicon they use. Even when the used lexicon would be very complete, the unsupervised algorithm would result in many false matched words. For instance, most of the popular already existing unsupervised methods such as SentiStrength [38] would classify the sentence “A looks like B” as positive due to the existence of the word “like”. However, in this given context, the word “like” is indeed a neutral word. Another drawback of unsupervised methods is that the sentiment lexicon should be continuously maintained, i.e. due to language change over time, lexicons should be updated. Finally, yet another drawback is the given domain dependency. This may require either individual specialized lexicons resulting in higher management costs, or a general lexicon affecting the classification accuracy in each domain.
Concept-level sentiment polarity detection
Concept-based methods are inspired by the notion that analyzing texts, merely on the word-level might not convey the actual meaning of texts; and that a better understanding of a text fragment could be achieved if it is represented in terms of real-world concepts. Approaches relying on large semantic knowledge bases step away from blindly using keywords and word co-occurrence count, as stated by reference [8], but instead rely on the implicit meaning/features associated with natural language concepts. Concept-based approaches are capable of detecting subtly expressed sentiments. In this subsection we present a brief review of such methods and their fundamentals.
Concept-based approaches may rely on a semantic ontology or a specific type of lexical resource. One such lexical resource built specially for concept-based approaches, is the SenticNet [9]. At the time of writing this article, the last stable version of SenticNet contains 13,291 single-word or multi-word concepts with their corresponding sentiment polarity score ranging from ‘
Poria et al. [34] strives for constructing an enhanced version of SenticNet, by automatically merging it with WordNet Affect. That is, they try to map concepts present in SenticNet to emotions present in WNA. Furthermore, they feed features obtained from various combinations of methods to NB, Multi-layer perceptron, and SVM. Their experimental results show that their method of enhancing SenticNet, adds to the polarity detection accuracy.
Tsai et al. [39] describes a two-step method for concept-level sentiment analysis. In the first step, they use iterative regression to assign a sentiment value to each concept on ConceptNet (i.e. a semantic network of common sense knowledge). That is, they create a graph structure of ConceptNet. Moreover, by using nodes with known values, they construct a regression model to predict other nodes’ values. In the subsequent step, they use the assigned sentiment values as starting values for a random-walk method with in-link normalization. This part of their method spreads sentiment values to neighboring nodes.
Sentiment analysis of Twitter data
Analysis of Twitter data has been the focus of many recent researches in the domain of SA. Pak and Paroubek [28] introduced a method to collect a micro-blogging data set based on appearance of happy or sad emoticons in messages, as noisy labels. They used the collected corpus to train a multinomial NB classifier that uses n-gram and POS-tags as features.
O’connor et al. [27] showed that there is a correlation between sentiment measures computed using word frequencies in tweets and both consumer confidence polls and political polls. Hence, they illustrated that inclination of public towards different entities could be examined by analysis of tweets. Lai [20] measured presidential performance over a certain time period by extracting general public sentiment from Twitter. For this purpose he used the SentiStrength lexicon [37]. As already mentioned, Agarwal et al. [1] adopt a set of sentiment features as well as some non-sentiment features to process and analyze a manually annotated data set of tweets. Wang et al. [41] proposed a graph-based hash tag approach to classifying Twitter post sentiments. Kouloumpis et al. [19] used linguistic features along with other features that capture information about the informal and anomalistic language used in micro-blogs.
Most of the already existing polarity detection systems, however, perform polarity detection without even defining a target word that their SA is directed at. In many-real world problems, especially in the domain of consumer products, comparison of a target item with its competitors should be dealt with. Thus, performing a target-oriented SA is very crucial. Nevertheless, a unigram-based model would detect the same sentiment polarity for both, the target of SA, and its rivals that co-occur in a certain document.
In this article we extend our previous works on sentiment features [3], [4], and present an improved set of sentiment features to perform target-oriented (optional) polarity detection on three datasets. We show that even by using a very small set of features the unigram model (which is commonly considered as a baseline for sentiment polarity analysis) is outperformed. We also show that our proposed methods outperform the SentiStrength [38] algorithm.
An Unsupervised Polarity Detection (UPD) algorithm
In this section we describe an unsupervised algorithm for SA which was already presented in our previous work [3], [4]. However, this study does a more comprehensive analysis of the proposed algorithm by examining its performance and comparing it with that of the SentiStrength algorithm on three datasets. For the sake of simplicity of referring to this algorithm, we name it “UPD”. This algorithm initially performs some preprocessing steps as follows:
@username is replaced with “ATUSER”.
URLs are removed.
“#word” is replaced with “word”.
Slangs (abbreviations) are replaced by their actual phrase equivalences. A manually built slang dictionary is used for this purpose.
Each document (or tweet) is split into smaller text snippets based on “.”, “!” and “?”. We therefore, define a text snippet as a number of words that occur in between two punctuations.
We first replace slangs with their equivalences using a slang dictionary. To build this slang dictionary, we manually collected a slang dictionary by using as many online resources that we could find, and added them to the slang dictionary of SentiStrength [38]. We also made further modifications to SentiStrength by adding new sentiment words. Most modifications made to this lexicon are acquiring a bigger slang dictionary, and the adding of sentiment words for which their exact antonyms were available in SentiStrength, but they themselves were missing. For instance, the word “useful” is present in the SentiStrength lexicon, while its immediate antonym “unuseful” is not. Although, this word might not be included in some standard dictionaries, but in many web language dictionaries since the task of sentiment analysis is mostly done on web texts. In the evaluation chapter of this article we present the effect of these modifications by comparing the two lexicons.
In the second step, we used in two separate attempts both the standard SentiStrength lexicon [38], and our Modified Lexicon (ML) for tagging all sentiment-bearing words in documents with their corresponding sentiment scores. Likewise, according to a list of emoticons, we tagged happy emoticons with a sentiment score of ‘
After having all the words in a document tagged either by their score or type, we may handle occurrence of intensifiers, diminishers, and negations. First, we intensify the strength of a sentiment-bearing word that appears after an intensifier, by the score of the respecting intensifier word. Analogously, in the case of diminishers, we weaken the strength of a sentiment-bearing word that appears after a diminisher word by the strength of that diminisher (in SentiStrengh all diminishers have a score of ‘
As the last step, in order to compute a sentiment score for an entire document (or tweet), we aggregate the word scores. We define the decision threshold for classifying documents as ‘0’. That is, if the overall sentiment score of a document is less than or equal to ‘0’ it is classified as negative, and otherwise if the score is greater than ‘0’ it is classified as positive.
An Unsupervised Target-Oriented Polarity Detection algorithm (UTOPD)
For the sake of simplicity of referring to this approach we name it UTOPD. In this approach, we use the same preprocessing and word tagging steps mentioned for the UPD algorithm explained in Section 3.
In addition to all the tagged words in each document, we tag the target of SA with “TARGET”. The target of SA is an entity described by word(s) in a document that our classifier has to find its sentiment in that document. To achieve this, the entity name (e.g. iPhone 5) could be given to UTOPD as an input query word. Consequently, this SA classifier searches the target entity in a given document using regular expressions. Hence, it detects the target of sentiment.
Then for each word in a document, if the word has a sentiment score above ‘0’ in the SentiStrength lexicon, it is tagged with “Pos” (i.e. positive) and otherwise if the word has a sentiment score bellow ‘0’, it is tagged with “Neg” (i.e. negative). Likewise, the negation words are tagged with “Negation”. Then we use the following formula to compute an overall sentiment score for the entire document:
As presented by Formula 2, the main idea of this algorithm is to count the number of features that are more likely to occur in positive sentences in the numerator and count the number of features that are more likely to occur in negative sentences in the denominator. According to our experiments, this method could robustly rank documents and differentiate between sentences with positive patterns and negative patterns instantly. By making use of this method we could rank documents according to their underlying sentiment patterns. Compared with the unsupervised UPD algorithm, this method could better assign similar ranks to documents with similar sentiment patterns. Our experimental results, evaluating the entire presented system, are discussed in Section 6.
This section describes our new target-oriented supervised polarity detection system. It consists of three major modules: a preprocessing module, a lexicon-based sentiment feature generator module and finally a Machine Learning module.
The preprocessing module generally preprocesses the tweets and prepares them according to the needs of a post-ordered analysis. The lexicon-based feature generator module, tags sentiment-bearing words by their sentiment score derived from a sentiment lexicon, and further tags valence shifters with either their score or in the case of negations by its type. Subsequently, it produces a set of features. In the final step, the machine learning module, which is a linear SVM classifier, takes the feature set provided by the sentiment feature generator module and performs a classification task. In the following subsections each module is further elaborated.
The preprocessing module
This module includes the same preprocessing steps that were described for the unsupervised algorithm explained in Section 3. In addition to those steps, the target (of sentiment) word towards which our SA is oriented, is replaced by “TARGET”. This target word could be given to the system as an input query word. To achieve this, the word(s) given as an input query word to the system is/are searched in a given document using regular expressions. Thus, the target of sentiment is tagged in the document. Providing a query word to the system as the target of sentiment is optional. In cases where no target word is given as input, the system would perform SA without a target.
Sentiment feature generator module
This module, first tags all the words either by their polarity score or by type, exactly in the same way it was described in the case of our unsupervised algorithm. That is, it uses the SentiStrength lexicon [13] to tag all the sentiment-bearing words, intensifiers, diminishers, and emoticons with their corresponding sentiment score. Furthermore it tags negation words with “NEGATE”. If a word was not sentiment-bearing, its sentiment score would be ‘0’. Subsequently, a set of sentiment features presented in Table 2 are computed. In addition, we include the overall document sentiment score computed by the unsupervised method presented in the previous section as one of the features in our feature set.
In summary, the outcome of this module is a set of sentiment features.
Our primary goal and strategy in extracting features is to capture sequence of sentiment relevant words that show a document’s sentiment change. We strive for capturing the underlying sentence patterns (in terms of sentiment change) as features in our feature set. Additionally, we define features that present the neighborhood of the target of sentiment. Table 2 presents this feature set.
Feature
“If I don’t get an iPhone for Christmas, I would be sad.”
Features used in our system
Features used in our system
In the above example, it could be seen that the actual polarity of the sentence regarding “iPhone” is positive, whereas, the sentence only contains words and patterns that usually occur in a negative context. The statement “don’t get an iPhone” is negative and therefore sets
The machine learning module is a linear SVM that takes as input the feature set described in the previous subsection and accordingly classifies the tweets (or documents) to separate classes. The parameters of the SVM classifier have been optimized experimentally on an independent dataset. We describe them in the following: We have used the L2 distance (i.e. Euclidean distance) for the calculation of the error term and loss; the tolerance for stopping criteria is set to 0.0001; C is set to 1.0; class weight is set to ‘none’; and other implementation-dependent parameters have been set to their default values.
Evaluation
Following the common trend we set an SVM classifier trained on a unigram bag-of-words model as a baseline for evaluating supervised classifiers. Moreover we assume the unsupervised SentiStrength algorithm [38] as a baseline for evaluating unsupervised sentiment classifiers. As we further present our evaluation in the next subsections, we present assessments of these two methods and the reasons why we set them as baselines for evaluating our work. In Subsection 6.1 we further provide a description of the three datasets that we used for our evaluation. In Subsection 6.2 we present the evaluation metrics that we used and finally, in Subsection 6.3, we present our experimental results.
Datasets
For our evaluation, we use three datasets two of which are Twitter data, and one consists of sentences derived from amazon consumer products reviews.
DFKI Twitter dataset
The challenges we had to deal with when using this dataset were given in the presence of a target of sentiment and its competitors, use of sarcastic expressions, and informality of texts.
This dataset is a Twitter dataset generated for our previous publications [3] and [4], with the difference that re-tweets in the dataset, which are a duplicate of the same tweets are removed. Our improved dataset consists of 920 tweets labeled by a group of 22 human annotators from which 460 have a positive polarity and 460 have a negative polarity. We gathered our dataset by consulting the Twitter API and making use of word spotting based on occurrence of the word “iPhone”. To comply with the gold standard, each tweet went through a three persons voting process. That is, out of a collection of random tweets all containing the word “iPhone”, each tweet was read and labeled (positive, negative, or neutral/irrelevant) by three people. Then for each tweet, if at least two out of the three people had agreed on the same class label, the tweet is accepted. For preparing the dataset, each participant was given a set of rules for data annotation. The rules told them, that even if a tweet was expressing a positive sentiment regarding “iPhone” by making use of negative sentiment words, that tweet should be labeled as positive; and vice versa.
Free-style Twitter dataset
This dataset is a Twitter dataset previously used by Agarwal et al. [1] for sentiment analysis. We acquired their dataset by contacting them. This dataset that we used for binary sentiment classification contains 1,709 tweets on the positive class and another 1,709 tweets on the negative class. According to [1], this dataset contains 23,837 English words while the total number of tokens in the dataset is 79,152. This means that a considerable portion of this dataset constitutes of non-English words. Our investigation of the dataset also confirms this fact. Due to this fact, a comparison of a machine learning classifier trained on a unigram model against an unsupervised lexicon-based classifier would not be fair since our lexicon-based classifiers only cover English. Anyway, a comparison of various lexicon-based approaches on this dataset is a valid experiment.
Moreover, this dataset contains no target words for sentiment classification. Thus, our target-oriented supervised sentiment classifier would be challenged to perform polarity detection with no specified target of sentiment.
Amazon reviews dataset
The Amazon reviews dataset that we used for our experiments is a part of a publicly available dataset previously described in [17] for the task of sentiment classification. This dataset contains sentences derived from Amazon reviews about five electronic consumer products. Additionally, it is unbalanced (i.e. the number of reviews for the five products strongly differ). Moreover, the results presented by Hu and Liu [17] on this dataset are hardly comparable against any other methods. That is because they measure the classification accuracy of their method, only by analyzing the opinion sentences according to their unique definition. By their definition, a sentence is called an opinion sentence “if the sentence contains one or more product features and one or more opinion words”. Also, in their definition an opinion word is restricted to an adjective.
Due to the fact that they have not labeled the opinion sentences in their dataset according to their definition, it would be very hard to duplicate their results. However, in their dataset they have annotated opinion sentences that contain one or more product features with respective polarity label. By investigating these sentences we found out that these labeled sentences count to 634 on the positive class and 1,094 on the negative class for the total of all five products.
In order to create a balanced dataset for our experiments, we obtained 630 random sentences from the positive and 630 random sentences from the negative class of the original dataset. Therefore, this dataset contains 1,260 sentences in total.
Analogous to the free-style Twitter dataset, this dataset contains no target words for sentiment classification. Thus, our target-oriented supervised sentiment classifier would be challenged to perform polarity detection with no specified target of sentiment.
Evaluation metrics
We evaluate the methods presented in this paper by using accuracy on Single Class Accuracy (SCA), Overall Accuracy (OA), and
where
In addition to the above measures, we evaluate our various methods based on their Receiver Operating Characteristic (ROC) curves.
Furthermore, for testing any of the supervised classifiers we applied a tenfold cross validation.
Experimental results
In this subsection we present the results of our experiments. First, for the sake of completeness we present the result of our experiment published previously in [3], which shows why we preferred an SVM classifier trained on a unigram model (i.e. with both emoticons and stop words included) as a baseline classifier for supervised methods.
In [3] we tested the effect of presence or absence of emoticons and stop words in the unigram feature set to train and test SVM, NB, as well as MaxEnt classifiers. For all three classifiers, the data was preprocessed in the same way as the steps described in Section 3. In [3] we performed this experiment in order to define a baseline for evaluating our supervised method.

Overall accuracy of MaxEnt, Naïve Bayes, and SVM by testing the effect of presence or absence of emoticons and stop words in the bag-of-words feature set.
Figure 1 illustrates the results of this experiment. The classification accuracy results reported in the figure are all based on the overall accuracy measure. The accuracy results were computed using a dataset obtained in the same way as the datasets described in Subsection 6.1.1 with 470 tweets in the positive class and 470 tweets in the negative class.
In this figure, “N-Em, N-SW” stands for no emoticons and no stop words included in the feature set. Similarly, “W-Em, N-SW” means with emoticons included, but no stop words. And the other two labels could be read in the same way.
According to Fig. 1, we may conclude that:
An SVM trained on a unigram feature set including both emoticons and stop words, outperforms all other combinations. Consequently, we define it as one of the baselines for our experiments.
In [3] we reported that SVM and MaxEnt classifiers are fragile to noise, i.e. including emoticons in the training feature vector. As illustrated in the two experiments in which stop words are removed, our findings mirror this results of Go et al. [15]. However, as the other two experiments including stop words suggest, the SVM classifier shows better classification accuracy when emoticons are included in the feature vector. Generally however, our classification results mirror the results of Go et al. [15] by showing that the SVM performs better than NB and likewise the NB performs better than MaxEnt.
The NB classifier always remains robust against noise and emoticons.
The MaxEnt classifier always shows the same behavior as the SVM, however, with lower classification accuracy.
In addition to the SVM unigram model, we also assume the SentStrength [38] classifier as a second baseline for evaluating the unsupervised classifiers presented in this paper. Since SentiStrength is one of the best available unsupervised methods [38], which is usable as a baseline approach [6], we also tested this classifier against our methods. We use the SentiStrength classifier by scraping the website affiliated with [38], that presents an online version of this algorithm. In our experiments we found some malfunctions in the online version of SentiStrength. These malfunctions were mostly due to lack of preprocessing in order to handle various types of text. As an example, at the time of writing this article, SentiStrength wasn’t able to classify texts starting with a hash tag. We tried to classify the phrase “#love you” and learned that instead of a positive and a negative output score, it returns “E” and “in”. Furthermore, there are more cases that SentiStrength malfunctions because of lack of sufficient preprocessing.
Performance of Various Classifiers on the DFKI Twitter Dataset (Subsection 6.1.1)
Performance of Various Classifiers on the Free-Style Twitter Dataset (Subsection 6.1.2)
Performance of Various Classifiers on the Amazon Reviews Dataset (Subsection 6.1.3)
Since, SentiStrength was not able to classify some texts due to not being able to read those texts, we used the same preprocessing steps used for our two unsupervised algorithms before feeding the data to the classifier. However, using our preprocessing engine, we made sure that SentiStrength, make no mistakes of this type and returns valid scores per each input text.
Having defined the SVM unigram model and the SentiStrength as baselines for our experiments, we now compare our proposed methods against them in a benchmark.
As described in Section 3, we built an enhanced version of the SentiStrength lexicon to be used by our classifiers. However, to make a fair benchmark for comparing the performance of our two unsupervised methods presented in Sections 4 and 5 against the SentiStrength classifier, we evaluate these methods by testing them both when they used our ML, as well as, when they used only the standard SentiStrength lexicon. The results of our benchmarks on the three datasets described in Section 6.1 are presented in Tables 3, 4, and 5 respectively. For each benchmark we compared the performance of our target-oriented supervised approach with our two unsupervised algorithms, both when our ML and the original SentiStrength lexicon were used. In addition the SentiStrength algorithm itself, and an SVM trained on a unigram feature vector were included in the benchmarks.
As Tables 3 and 5 show, our supervised method outperforms the SVM unigram baseline by a significant margin based on our used evaluation metrics, i.e. overall classification accuracy and
Furthermore, Tables 3, 4, and 5 show that both of our unsupervised algorithms, UPD, and UTOPD outperform the SentiStrength algorithm in terms of classification accuracy and overall
Additionally, the results presented in the three tables, clearly show that our ML is superior to the SentiStrength lexicon on all three datasets.
Moreover, for further detailed evaluation of the various presented methods, Figures 2, 3, and 4 show the ROC diagrams (true positive rate against false positive rate) of those methods on the DFKI Twitter dataset, free-style Twitter dataset, and the Amazon reviews dataset respectively. For assessing accuracy of each classifier, we additionally measured the Area Under the Curve (AUC) for each method. In the figures presenting the ROC diagrams, the “_ML” suffixes at the end of the names of our unsupervised methods mean “using ML” and the suffix “SSL” means “using Standard SentiStrength Lexicon”.

ROC Curves Showing the Performance of Various Tested Methods on the DFKI Twitter Dataset.

ROC Curves Showing the Performance of Various Tested Methods on the Agarwal Dataset except the SVM Unigram Model.

ROC Curves Showing the Performance of Various Tested Methods on the Amazon Reviews Dataset.
Given our experimental results we finally summarize that:
When comparing our supervised classifier against the SVM unigram baseline, the results suggest that it outperforms the SVM unigram baseline in terms of classification accuracy,
When comparing our two unsupervised methods against the unsupervised SentiStrength classifier, the results suggest that whenever either the UPD algorithm or UTOPD algorithm used our ML, they both outperformed the SentiStrength in terms of overall accuracy,
Additionally, when comparing the UPD and the UTOPD using the standard SentiStrength lexicon against SentiStrength, they both outperform the SentiStrength algorithm on the DFKI Twitter and the Amazon reviews dataset based on overall accuracy,
Analogously, when comparing the SentiStrength against our UPD algorithm, when it uses the standard SentiStrength lexicon, the results indicate that the UPD algorithm outperforms the SentiStrength algorithm on the DFKI Twitter dataset and the Amazon reviews dataset based on overall accuracy,
When comparing UPD and UTOPD, we observe that UTOPD, has a higher performance than UPD in all of our benchmark tests. UTOPD is one of the first unsupervised target-oriented polarity detection algorithms in the field of SA. In addition, it outperforms the two other unsupervised SA classifiers on three datasets.
As a final evaluation of our supervised SA system, we discuss the domain independency feature of this classifier. Due to the fact that the type of features used to train the supervised SA system reduces the dimensionality of words (e.g. the feature “Number of Positive Words” maps the words “love” and “like” to the same class), we initially believed that after training, our system reaches a certain level of domain-independence. For example, our system could be trained on data from the electronics consumer products with a predefined target word and be used to classify data (e.g. about food) independent of a domain. That is, if we train our supervised classifier based on data regarding the target “iPhone”, we could use this classifier to test it on a dataset that specifies the word “chocolate” as the target of sentiment. For the interested reader we would like to refer to our earlier paper [4], where we trained the classifier on Twitter data about “iPhone” and tested the classifier with Twitter data about “chocolate”. In summary, we gathered a small test set of 100 tweets (50 positive and 50 negative), in which the word “chocolate” was the target of sentiment. This data set was acquired in the same way as the DFKI Twitter dataset described in Subsection 6.1.1, with the difference that the tweets were labeled only by one human annotator. Our supervised classifier was able to classify 84% of the tweets accurately. Thus, by conducting this experiment, we showed our system can use training data from one domain to classify tweets in another domain as far as they are expressing sentiments about an explicit target word. However, it should be clarified that in cases that one word is positive in one domain and negative in another domain, there is a need for usage of domain dependent lexicons for each domain and the current system cannot solve such problem.
There are some remarks regarding our proposed supervised and unsupervised methods that we should address, which are:
All the features proposed in our supervised system, require a very short time to be computed.
The feature set used by the supervised method is very small, which means the system has a lower time and memory complexity as compared with a supervised method trained on a unigram feature vector. However, at the same time it outperforms unigram feature sets in terms of classification accuracy, and
By using the ft
Our methods strongly rely on the completeness of the sentiment lexicon, which is the main source for identifying sentiment words. This is while, any kind of SA system would rely on a sort of a knowledge resource, i.e. training data in case of supervised n-gram-based methods or a sentiment lexicon or knowledge repository in case of unsupervised methods.
Unlike the unigram feature model, our method would not be dramatically impacted if the similarity between training and testing data fluctuates. That is because we capture underlying sentiment patterns of documents rather than the exact words.
Handling negations has been researched before in the literature [31]. Here we also proposed our way of handling them. We did not define a static window size. Instead, a negation was applied whenever a sentiment word or the target of sentiment was encountered; and in our methods the effectiveness of negations were bound only to a single text snippet. In order to deal with a negation word expressing negative sentiment regarding a target word (e.g. “No iPhone for me!!”), we introduced feature ft
The UTOPD algorithm proves to be a promising approach in its class, as it is able to distinguish between two sentences that would receive the same sentiment scores from UPD or SentiStrength. This is done only by taking into account the sentiment features present in a document. However, UTOPD was primarily designed as a ranking algorithm used as a part of sentiment summarization system to rank and rate documents.
Conclusion
In this paper, we introduced a supervised method that combines usage of a sentiment lexicon along with a machine learning classifier for polarity detection of opinionated texts on three different datasets. We showed that our proposed target-oriented supervised method outperforms the SVM unigram baseline in the majority of our conducted experiments.
Moreover, we compared two unsupervised algorithms and compared them against the SentiStrength baseline and also showed that both of these methods outperform the SentiStrength algorithm within our settings.
According to our observation, we believe that in the context of SA, moving towards sentiment features rather than conventional text processing features would be a promising solution to this problem.
A future work of this study would be the creation of a context-sensitive and self-learning sentiment lexicon to cover a broader range of sentiment words than standard sentiment lexicons. This would lead to a reduction or removal of the maintenance. In this way, our lexicon based methods could potentially reach their maximum performance.
In addition, detection of sarcasm, as well as, finding more features that could semantically interpret a document would be other future enhancements of this study.
