Building Normalized SentiMI to enhance semi-supervised sentiment analysis

Abstract

Sentiment analysis and polarity detection is a type of text classification where natural language opinion is analyzed in order to classify it into either positive or negative categories. Classification of text into sentiment labels is a very difficult task as opinions expressed in natural language may contain abbreviations, slangs, sarcasm, irony and/or idioms. The proposed research focuses on the use of SentiWordNet3.0 as a labeled corpus for training purposes. We present a complete framework based on a dictionary named Normalized SentiMI (nSentiMI) which is created by calculating point-wise mutual information for each term/part-of-speech pair extracted from SentiWordNet. The proposed framework is applied on a dataset of 50,000 movie reviews to identify the value of a weight factor α and then evaluated on an unseen test dataset of 2000 movie reviews. Comparison with state of art techniques also confirms the superiority of proposed approach.

Keywords

SentiWordNet mutual information sentiment analysis social media text mining movie reviews

1 Introduction

Sentiment analysis concerns the computational treatment of subjective information used in tweets, blog posts and comments. It categorizes text into positive, negative or neutral feelings or sentiments using various approaches. Large number of subjective text is available on the internet such as product reviews, text, messages in discussion forums and opinions about particular product etc. Therefore, sentiment classification and opinion mining can be potentially useful for many applications such as search engines, market search and recommender systems [1].

Kennedy and Inkpen [2] proposed that SentiWordNet is one common resource that can be used to detect sentiment in text. It is a dictionary of opinionated terms that contains opinion information for different terms which are extracted from WordNet database and the dictionary is now publicly available. This resource is very useful for opinion mining and sentiment analysis tasks. Another benefit of SentiWordNet is that it contains sentiment information for English language which replaces the manual tagging and driving procedures for lexicons [3]. Moreover, as SentiWordNet resource is built using semi supervised method therefore, it can be easily updated for future enhancements of WordNet and also for other languages that use similar lexicons. There are different versions of SentiWordNet that are available online for example, SentiWordNet 3.0 is an improved version of SentiWordNet 1.0 by Esuli and Sebastiani [4] which has now become a lexical resource for sentiment classification and used in large number of research projects worldwide [5].

According to Ye, Zhang and Law [6] sentiment analysis can be performed using both supervised and unsupervised learning approaches. Supervised learning techniques utilize sources of features and labeled data in order to determine different aspects of text. There are many supervised learning algorithms used for sentiment analysis and polarity classification such as k Nearest Neighbors, Naïve Bayes, Support Vector Machines and Rule based Classification [7]. Unsupervised learning techniques do not use any labeled data or training corpus and utilize the information contained within given text in order to determine the sentiment or polarity. Some of the techniques used for unsupervised or semi-supervised classification are based on resources like bag of words, emoticons list and lexicons [7].

There are multiple preprocessing techniques in-volved to improve the accuracy of sentiment classification of text. Yang and Pedersen [8] suggested that feature pruning is one of the preprocessing methods that performs dimensionality reduction and improves text categorization accuracy. There are multiple methods used for feature pruning such as Document Frequency Thresholding, Mutual Information, Information Gain, Chi-Square and Term Strength. Each of them focuses on removing unnecessary terms from a text based on corpus statistics and utilizes the information that can improve text categorization performance [8].

This research is focused on English language data. A lexical resource named ‘Normalized SentiMI’ is built by calculating Point-wise Mutual Information using SentiWordNet as the training dataset. The training dataset is analyzed in order to find the optimum alpha values which helps to raise the accuracy level for the test dataset. A feature set consisting of adjectives, verbs and adverbs is used as it achieved highest accuracy when compared with other combinations. The proposed technique has obtained highest performance improvement of 30.5% in accuracy, 12.9% in sensitivity, 40.3% in specificity and 36.2% in F-Measure.

The rest of paper is organized as follows: Section 2 describes the related work while Section 3 is concerned with proposed technique. Details about the datasets are presented in Section 4. Section 5 is concerned with results and discussion. Finally, conclusion and future work is given in Section 6.

2 Literature review

According to Yang and Pedersen [8], the comparison of different feature selection methods are used for text categorization. Five different term-goodness criteria such as term selection based on Document Frequency (DF), Information Gain (IG), Mutual Information (MI), Chi-Square test and Term Strength (TS) are used for elimination of irrelevant information. The evaluation of results indicates that IG and Chi-Square are most effective methods for removing irrelevant information without eliminating necessary categorizing terms and it is concluded that DF, IG and Chi-Square scores are strongly correlated for particular terms. Kazemzadeh, Lee and Narayanan [9] presented two models to represent the meaning of emotion words. One model interprets the emotion words as three dimensional IT2 FS (interval type 2 fuzzy sets) where three dimensions are valence, activation, and dominance. The second model is based on the open-ended set of questions from the game of emotion twenty questions (EM020Q) where the meaning of emotion words is represented by answers of these questions. The experimental results indicate that the second model is more useful to deal with large emotion vocabularies.

Wilson, Wiebe and Hoffmann [10] presented word and phrase level sentiment analysis whereas Zhang et al. [11] performed document level sentiment classification. Turney [12] also investigated a method for document level sentiment analysis where adjectives are used to calculate the weight of a document. If an adjective occurs with the word “excellent” then it is classified as positive and if it occurs with “poor” then it is classified as negative. Read [13] proposed a method for sentiment analysis using emoticons such as:-((sad face) and:-) (happy face) and created a training set to perform sentiment analysis. Naïve Bayes and SVM classifiers are used to perform classification and experimental results indicate high classification accuracy on the basis of emoticon analysis.

Keefe and Koprinska [14] performed the analysis of various feature selection and feature weighted methods used for sentiment analysis. The research paper evaluated a range of methods in combination with Naïve Bayes and Support Vector Machine classifiers. Feature selection methods include Categorical Proportional Difference (PD), SWN Subjectivity Scores (SWNSS) and SWN Proportional Difference (SWNPD). Various feature weighting methods are Feature Frequency (FF), Feature Presence (FP), Term Frequency-Inverse Document Frequency, SWN Word Score Groups (SWN-SG) and SWN Word Polarity Groups (SWN-PG). The combination and each classifier are evaluated against different feature selection and weighted methods. The experimental results indicate that the combination of PD, FP and SVM gives best performance results. Verma and Bhattacharyya [15] proposed a word level sentiment analysis for feature vectors of a given document. SVMs are used to perform the sentiment analysis which classify a document into positive or negative.Information Gain is used to calculate the weights of features and pruning is performed for vectors which have Information Gain less than a given threshold. The experimental results indicate high sentiment classification accuracy.

Reyes and Rosso [16] performed sentiment analysis and classification of information that contains irony data. The proposed model utilizes three conceptual layers along with eight different textural features to automatically categorize the irony data. The results of proposed framework are evaluated by human annotator and show the complexity in automatically detecting irony in dataset. Barnden et al. [17] presented sentiment analysis of figurative language in twitter using different participant systems. The research focused on classification of tweets containing irony and metaphors. The results evaluation is performed in order to determine the degree to which conventional sentiment analysis can handle creative language. A fine-grained sentiment score will be assigned by each participant system for each tweet and then comparison is performed with weighted average score provided by human annotator. The highest evaluation score is assigned to the system that is consistently closer to the gold-standard. Blitzer, Dredze and Pereira [18] focused on domain adaption for sentiment analysis. The proposed research focuses on reducing the relation error by making modifications in structural correspondence learning (SCL) algorithm. Moreover, the proposed research also showed the method to correct structural misalignments by using small amount of domain data. The proposed research can be extended by adding ranking of features with more realistic issues. Dredze, Crammer and Pereira [19] introduced confidence weighted linear classifiers that add confidence information parameter to linear classification for sentiment analysis. The proposed algorithm focused on PA (passive-aggressive) updates for linear classifiers. The parameter confidence is modeled for linear classification with Gaussian distribution which online updates the parameter estimates and reduces the distribution’s variance. Pang, Lee and Vaithyanathan [20] addressed the problem of classifying documents by overall sentiment such as positive or negative. Three machine learning methods namely Naïve Bayes, Support Vector Machine and Maximum Entropy are employed. Multiple methods such as unigram, bigram, POS tagging with each word and identification of position of word within the text are analyzed. Analysis of results indicates Unigram presence information turned out to be more effective. The next step in proposed work would be the identification of features indicating whether sentences are on-topic.

There are several challenges in sentiment analysis and polarity classification tasks that make existing and emerging applications so interesting. First, Word Sense Disambiguation is a classical NLP (natural language processing) problem. A word that is considered to be positive in one way, may be considered as negative in another situation. For example, the word “long” can have both positive and negative meaning in a sentence. If a customer said “laptop’s battery life is long”, is a positive opinion. On the other hand if the customer said “the startup time of laptop is long”, is a negative opinion [21]. These differences mean that an opinion system trained to gather opinions on one type of product or product feature may not perform very well on another. Second most important challenge is addressing the problem of sudden deviation from positive to negative polarity. For example “The show has a great cast, superb story and spectacular pictures, the director has managed to make a mess of the whole thing”. Third challenge is handling of negation. If they are not managed properly, it can completely mislead. For example “Not only do I not approve Supernova 7200, but also hesitate to call it a phone” has a positive sentiment word approve, but the statement is considered as negative due to many negations. Another challenge would arise because people don’t always express their opinion in the same way. Most traditional text processing relies on the fact that small differences between two pieces of text don’t change the meaning very much. However, in polarity classification “The show was great” is very different from “The show was not great” [22]. Another challenge in sentiment analysis is to detect the pragmatics of user opinion which may change the sentiment thoroughly. Pragmatics is a field in which the context of given statement/word is studied in order to understand the polarity. For example, “I just finished watching THE DESTROY” gives the positive sentiment whereas “That completely destroyed me” is a negative sentiment. The identification of entity is another challenge in sentiment analysis and polarity classification. A statement may have multiple entities associated with it. It is necessary to find out the entity towards which the opinion is directed. For example” Sony is better than Samsung” is positive for Sony but negative for Samsung.

Although, an extensive amount of research is being performed for sentiment analysis and polarity classification but there is still room for improvement. There is a need for more precise classification method that can classify data with high accuracy and performance. It is observed that unsupervised or semi-supervised approaches have low accuracy whereas supervised methodologies show high performance. The problem with supervised approaches is that labeled data is not easily available. Therefore, the need of high performing semi-supervised approaches is of utmost importance. This research is focused on presenting a semi-supervised framework, eliminating the need of tagged instances by using SentiWordNet as labeled corpus and at the same time achieving high performance levels comparable to the performance results of supervised approaches.

3 The proposed approach

The proposed approach combines semi-supervised and un-supervised classification techniques to perform sentiment analysis and opinion mining. The proposed framework can be divided into two basic modules: Pre-Processing and Classification.

3.1 Pre-Processing

The first step in pre-processing involves gathering data from online repositories. We have used online movie reviews datasets for the evaluation of the proposed research. Pre-Processing is then applied on the datasets in order to remove noise, inconsistency and incomplete information from data. Data must be preprocessed in order to achieve quality results from any data or text mining task. We have used following tasks for pre-processing of online movie reviews data.

Filter only English language data.

URLs are removed as they do not contribute to sentiment classification of any dataset because they do not contain any valuable information for informal text.

The question words like what, why, who, when etc. do not contribute to sentiment analysis and polarity detection for any sentence and they are removed.

The special characters such as;.[]{}()/’#@. must be removed from text in order to avoiddiscrepancies.

Stemming and Lemmatization is then performed in order to further refine the dataset [23].

Stop words are most commonly used English words that do not contain any sentiment information, for instance, ‘is’, ‘which’, ‘the’, etc and they are removed from dataset.

Spelling correction is applied so that we have a better chance of matching the words to the proposed nSentiMI. We have used JSpell 1 for spelling correction. JSpell is a spell checking API that does basically three things. First, it parses the text that is typed into a page and finds each individual word. Second, each word is checked against a list of correct words. Third, if no match can be found then the spell checker looks for suggestions that are similar in sound and structure to the word that is incorrect.

We have replaced slangs with complete words in order to perform efficient sentiment scoring and classification. Slang dictionary 2 is used to find slangs and its definition and then it is replaced with complete word or phrase.

3.2 Classification

The classification process involves two steps. First, part of speech tagging is performed for each term and then these tagged sets are used for further sentiment classification using the proposed technique. The experiment execution stages are defined in Fig. 1. Each of these classification steps are explained as follows:

3.2.1 POS tagging and Tokenization

We have used Java based Stanford POS tagger 3 for tagging of each movie review in order to further refine its polarity and returns part-of-speech for each term. Tagged dataset is passed for tokenization where each term and its part-of-speech are extracted.

3.2.2 Feature selection

Forward feature pruning method removes irrelevant features from the text and reduces original feature set. Moreover, classification accuracy is increased while decreasing the time of learning algorithm. There are multiple methods to perform feature pruning such as feature pruning based on adjectives, feature pruning based on nouns, feature pruning based on adverbs etc. Turner [12] used adjectives for sentiment analysis and classification. Jain, et al. [24] and Lin, et al. [25] also used adjectives to perform sentiment calculation. Saggion, et al. [26] achieved 41% accuracy using combination of adjectives and adverbs for sentiment calculation.

We have performed feature pruning on movie reviews dataset after tokenization and removed all words which are tagged as nouns. We have used combination of adjectives, verbs and adverbs as feature pruning method. The identification of features is performed on the basis ofPOS tag given by Stanford POS tagger. Feature selection is carried out by evaluating our algorithm for all the SentiWordNet POS tags.

3.2.3 Proposed sentiment dictionary ‘nSentiMI’

The main focus of the proposed research is on development of a sentiment dictionary. SentiWordNet 3.0 (SWN) terms are used to calculate mutual information for each term/par-of-speech pair. A sample from SWN is given in Table 1. The usage based ranking is also determined from SWN glosses.

It is clear from Table 2 that part of speech (POS) can be defined by ‘a’, ‘v’, ‘r’, and ‘n’ where ‘a’ is adjective, ‘v’ is verb, ‘r’ is adverb and ‘n’ is defined as noun. The sentiment score is categorized into three different types namely positive, negative and objective. The positive score is represented by ‘PosScore’ whereas negative score is defined as ‘NegScore’. The objective ‘ObjScore’ is calculated by the following equation: $ObjScore = 1 - (PosScore + NegScore)$ (1)

Formula 2 and 3 are provided by SentiWordNet3.0 4 to approximate the sentiment value of a word with a label. Following equation is used to calculate the Synset score: $SynsetScore = PosScore - NegScore .$ (2)

The weight of synsets is determined by the usage rank which indicates the sense in which a word is mostcommonly used. Following equation is used to determine the final score of a given term: $Score = \sum_{rk = 1}^{n} Synset Score (rk) / rk$ (3) where ‘rk’ is the rank of given term extracted from SWN.

Let us suppose that we want to calculate the score for term “unable” with POS “a” according to formula 3. It has 3 usage ranks listed in SWN. Therefore, n = 3. The positive and negative scores extracted from SWN are given as follows:

For each rank, the synset score is calculated according to formula 2 which result in −0.75, −0.375 and −0.125 for ranks 1, 2 and 3 respectively. Now, using formula 3

Iteration 1 (for rank 1) → −0.75/1 =−0.75 → Score =−0.75

Iteration 2 (for rank 2) → −0.375/2 =−0.19 → Score =−0.94

Iteration 3 (for rank 3) → −0.125/3 =−0.042 → Score =−0.982

Hence, according to formula 3, the final score calculated for term “unable” and POS “a” is −0.982.

Each gloss from SWN is considered as a training document and equation (2) is used to calculate the synset score. If this score is zero, the document is labeled as ‘objective’ otherwise it is labeled as ‘subjective’. Further division of subjective synsets is performed upon the sysnset score and they are divided into positive and negative classes. If a synset score is greater than zero, it is labeled as positive synset whereas negative label is assigned when synset score is less than zero. The proposed research uses 117659 training samples where 7282 samples are classified as positive whereas 7224 samples are labeled as negative on the basis of their synset scores. The rest of samples are categorized as objective.

For each unique Synset#POS combination, the number of positive and negative samples are counted. Further processing is not performed on objective terms and nouns because they do not play any role for providing notable information. They are ignored for further processing. Following formula is used to measure the mutual information MI for each Synset#POS combination: $MI (t, l) = \log 2 \frac{A \times N}{(A + B) \times (A + C)}$ (4) where N denotes the number on labeled samples, A is the co-occurrence count of term t with label l, B is the count of term t occurring without label l, and C is the count of samples with label l, excluding the term t. We have calculated the mutual information for 19280 features using this method. Table 3 shows a sample from proposed nSentiMI dictionary where ‘cntPos’ and ‘cntNeg’ are defined as ‘A’ for positive and negative documents respectively. The mutual information for positive terms is represented as ‘miPos’ whereas mutual information for negative terms is defined by ‘miNeg’. Equation (4) is used to calculate the ‘miPos’ and ‘miNeg’ for each term.

The classification of each term is performed using nSentiMI dictionary for the data obtained in first step. We have uploaded nSentiMI on labarchives so that it is publically available online 5 .

3.2.4 nSentiMI based classification

For each term encountered in the movie review, the Mutual Information scores are extracted from nSentiMI based on the POS tag. Any term not found in nSentiMI is assigned a zero score which means that it has no effect on the outcome. nSentiMI based classification is performed using the technique proposed by Lin et al. [25]. According to this technique, the mutual information based sentiment score for positive and negative labels are calculated using the Equations (5) and (6) respectively. $nSentiMI (+) = (α \times miPos) + (1 - α) \times (- miNeg)$ (5) $nSentiMI (-) = (α \times miNeg) + (1 - α) \times (- miPos)$ (6) where miPos and miNeg are Mutual Information for positive and negative labels extracted from nSentiMI. α is the parameter used to weigh contributions of nSentiMI (+) and nSentiMI (−). The values of α range from 0 to 1. The final sentiment score of the term is obtained using Equation (7)

$\begin{matrix} Senti MI Term Score \\ = {\begin{matrix} | SentiMI (+) | & ifSentiMI (+) > 0 \\ 0 & ifSentiMI (+) = | SentiMI (-) | \\ - | SentiMI (-) | & ifSentiMI (+) < | SentiMI (-) | \end{matrix} \end{matrix}$ (7) z-score normalization is applied to get the Normalized SentiMI Term Score. The z-score normalization is a dimensionless quantity which is used to calculate the number of standard deviations an observation is above the mean. Positive z-score represents that a datum is above the mean whereas negative value indicates that a datum is below the mean [27]. Z-score normalization is commonly referred to as good normalization method because the minimum and maximum range remains preserved and dispersion of the series such as standard deviation or variance is introduced. The z-score linearly transforms the data in such a way, that the mean value of the transformed data equals 0 while their standard deviation equals 1. The transformed values themselves do not lie in a particular interval [28]. Following formula is used for z-score transformation: $Z = \frac{x - μ}{σ}$ where μ represents the mean value of data and σ shows standard deviation of data. The Z represents the distance between a value and the data mean. Z can be positive or negative depending on the type of data.

The document sentiment orientation is obtained by adding all the Normalized Senti MI Term Scores. A positive label is assigned to the document if the overall score results in a positive figure. Similarly, a negative label is assigned for a final document score of less than zero.

4 Dataset description

We have used publicly available movie reviews dataset 6 to find the value of weight factor α. The dataset contains 50,000 movie reviews which are divided into 25,000 positive and 25,000 negative movie reviews. The testing of the classifier is performed using an unseen dataset of online movie reviews 7 containing a total of 2000 movie reviews. The test dataset is divided into 10 datasets with 200 movie reviews each in order to avoid biasness of test set. Each dataset further contains 100 positive and 100 negative movie reviews in order to maintain class distribution. An overview of datasets is given in Table 4. The sensitivity, specificity and f-measure values are calculated using the training set in order to determine appropriate values of α to be used in Equations (5) and (6). It is observed from the results that there is a tradeoff between sensitivity and specificity for the calculated values. High specificity results are obtained for α values ranging from 0.0 to 0.3 whereas the sensitivity results dominate for α ranging from 0.4 and 0.7. The accuracy level is low for the range of 0.8 to 1.0 α values where an agreement is established between sensitivity and specificity values. The best results achieved for sensitivity is when α is 0.5 and for specificity when α is 0.1.

5 Results and discussion

The proposed framework is evaluated against test set in order to determine the efficiency and performance against other techniques. Accuracy, sensitivity, specificity and f-measure are used to evaluate the performance of proposed technique. Accuracy is defined as the proposed of correctly classified instances against total number of instances. Following formula is used to calculate the accuracy of proposed framework: $Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$ where TP, TN, FP, FN are true positives, true negatives, false positives, and false negatives respectively.

Sensitivity is defined as true positive rate and measures the proportion of actual positives whereas specificity is the true negative rate or it measures the proportion of correctly identified negatives. Mathematically: $Sensitivity = \frac{TP}{TP + FN}$ $Specificity = \frac{TN}{TN + FP}$

F-Measure is the ratio between sensitivity andspecificity. $F - Measure = \frac{2 * Sensitivity * Specificity}{Sensitivity + Specificity}$

Training dataset is used to determine the value of α that is to be used in Equations (5) and (6). The results of sensitivity, specificity and f-measure are shown in Fig. 2 for different values of α. The analysis of results indicates that there is a tradeoff between sensitivity and α value for each dataset. It is also observed that there is a trade-off between the sensitivity and specificity of the proposed algorithm. Specificity results are on the high for α values ranging from 0.0 to 0.3 whereas the sensitivity results dominate for α ranging from 0.4 and 0.7. An agreement among sensitivity and specificity results is observed for α values in the range of 0.8 to 1.0, however, the overall accuracy level is low. The best results achieved for sensitivity is when α is 0.5 and for specificity when α is 0.1.

The observations about the values of α in the training dataset also hold true for the test dataset. It is seen from the results that as long as the value of α remains low, the sensitivity value is high around 90% whereas when the value of α keeps on increasing, the sensitivity value goes down. Alternatively, when the value of α is low, the value of specificity is low. The specificity for proposed framework increases with the increase of α. Therefore, it can be concluded from the analysis that specificity is directly proportional to α whereas sensitivity is inverse proportional to α. f-measure is not affected by the value of α. It remains around center for each dataset at every level of α. The proposed nSentiMI based classifier is evaluated using α= 0.5 to detect positive instances and α= 0.1 to detect negative instances. Sensitivity, specificity and f-measure are calculated for each of the test dataset.

The sensitivity, specificity and f-measure results of the proposed algorithm are calculated for the test datasets. 10-fold cross-validated average accuracy, sensitivity, specificity and f-measure are 89.1% , 86.5% , 91.7% and 89% respectively.

We applied proposed sentiment classification framework on multiple combinations of POS tags from sentiment dictionary. First of all, only adjectives are used to compute sentiment results. A combination of adjectives and nouns is then used as features and results are obtained. Adjectives are then used with verbs to compute the results. In a similar manner results are computed for a combination of adjectives and adverbs as features. Finally, we have used combination of adjectives, verbs, and adverbs as features and results are computed. Accuracy, sensitivity, specificity and f-measure are calculated for all these combinations. The comparison of results indicates that the combination of adjectives, verbs and adverbs achieved highest accuracy. The comparison of accuracy, sensitivity, specificity and f-measure for multiple combinations is shown in Table 5.

Lin et al. [25] used different values of α to classify a document into positive and negative sentiment and the performance results can be considerably raised. Based on the training observations, we have also used different levels of α for classification of dataset. Therefore, the positive document is detected if α= 0.5 and negative document is detected if we use α= 0.1. Figure 3 shows the evaluation of each instance for different levels of α. If α= 0.5 and the instance is classified as positive, then positive label is assigned, otherwise it is evaluated for α= 0.1. If this results in a negative class, the test instance is labeled as negative. If the test instance is still not labeled, it is assigned a positive class. The classification process ends here.

Since there is no specific polarity score list is available, we have used SentiWordNet to obtain the polarity of individual words. SentiWordNet is a lexical resource that is used for opinion mining and sentiment analysis [29]. Each synset in SentiWordNet is assigned three sentiment scores namely positivity, negativity and objectivity. SentiWordNet 3.0 is an enhanced version of SentiWordNet 1.0 and is publically available for research purpose.

We have performed comparison of the proposed framework (nSentiMI) with the performance of SWN, IG and Chi-Sq based classifiers. SentiWordNet based classification is performed by the help of classifier provided on SentiWordNet website 8 . IG and Chi-Sq based classifiers are developed by the application of the proposed framework using Information Gain and Chi-Square statistics, respectively, in a similar manner as Point-wise Mutual Information is applied for nSentiMI based classifier.

The importance of an attribute (X) with respect to class attribute (Y) can also be determined using Information Gain (IG). Formally, information gain of an attribute/feature X with respect to a class attribute Y is the reduction in uncertainty about the value of Y when we know the value of X [15]. Information Gain is calculated by the feature’s contribution on decreasing overall entropy. Following formula is used to determine Information gain for a given feature: $InfoGain (Y; X) = Entropy (Y) - Entropy (Y | X)$ where X and Y are discrete variables and there are ranging from {x ₁, x ₂, x ₃, …, x _m} and {y ₁, y ₂, y ₃, …, y _m}. $Entropy (Y) - \sum_{i = 1}^{m} (p_{i}) {log}_{2} (p_{i})$ where m represents the number of classes such as m = 2 for binary classification and and P _i denotes probability that a random instance in partition D belongs to class C _i. If we have to partition (classify) the instance in D on some feature attribute A {a ₁, … a _v}, D will split into v partitions set {D ₁, …, D _v}. $Entropy (Y | X) = \sum_{j = 1}^{v} \frac{| D_{j} |}{| D |} inf o (D_{j}}$ where |Dj|/|D| is the weight of the jth partition and Info(Dj) is the entropy of partition Dj. High Information Gain features reduce the uncertainty about the class to the maximum. We select the features ranked as per the highest information gain score. We can optimize the information needed or decrease the overall entropy by classifying the instances using those ranked features.

The chi-squared statistics (Chi-Sq) is used to determine the association between the word feature and its associated class. Chi-Sq as a common statistical test represents divergence from the distribution expected (i.e. resultant partition) based on the assumption that the feature occurrence is perfectly independent of the class value [30]. Following formula is used to determine the Chi-Sq [31]: $\begin{matrix} Chi - Sq (t, c) \\ = \frac{N * (AD - BE)^{2}}{(A + E) * (B + D) * (A + B) * (E + D)} \end{matrix}$ where A is the frequency when t and c _i co-occur; B represents counts when t occurs without c _i. E is the number representing events when c _i occurs without t; D is the frequency when neither c _i nor t occurs; N represents total documents in the corpus. The Chi-Sq statistic will be zero if t and c _i are independent.

Table 6 shows the 10-fold cross validated average accuracy, sensitivity, specificity and f-Measure comparison of the proposed framework with other techniques. It is analyzed from the comparison that the proposed nSentiMI has high accuracy values for all the partitions in the dataset as compared to other techniques. It is concluded from the analysis that proposed nSentiMI has achieved accuracy improvement of 23.4% , 26% and 30.45% over SWN, IG and Chi-Sq techniques respectively.

The analysis of results indicates that Chi-Sq technique has achieved highest sensitivity of 85% whereas proposed nSentiMI shows a consistent sensitivity level. It has achieved average sensitivity of 86.5% whereas SWN has obtained 73.6% average sensitivity whereas IG and Chi-Sq have achieved 74.90% and 77.3% sensitivity levels respectively. Therefore, it is concluded from the analysis that proposed nSentiMI has average sensitivity improvement of 12.9% , 11.6% and 9.2% over SWN, IG and Chi-Sq respectively.

It is clear from the specificity comparison that the proposed nSentiMI has achieved high specificity values for all the datasets as compared to other techniques. nSentiMI has achieved average specificity of 91.7% , SWN has obtained average specificity of 57.80% whereas IG and Chi-Sq have obtained specificity levels of 51.40% and 40.3% respectively. Therefore, the analysis indicates that the proposed nSentiMI has achieved average specificity improvement of 33.9% , 40.3% and 51.4% from SWN, IG and Chi-Sq respectively.

The comparison of results indicates that proposed nSentiMI has achieved highest f-measure values for all the datasets as compared to other techniques. It has achieved average f-measure of 89% whereas SWN has achieved average f-measure of 64.6% , IG and Chi-Sq have achieved average f-measure levels of 60.7% and 52.7% respectively. The analysis of results indicates that nSentiMI has f-measure improvement of 24.4% , 28.3% and 36.2% over SWN, IG and Chi-Sqrespectively.

The reason behind achieving results improvement is that normalized SentiMI has utilized optimum alpha values that are analyzed from the training dataset which help to raise the accuracy level for the test dataset. Moreover, proposed nSentiMI has used a combination of adjectives, verbs and adverbs as feature set since it achieved highest accuracy when compared with other combinations. The normalization of values is performed because the coefficients associated with each variable will scale appropriately to adjust for the disparity in the variable sizes. Furthermore, if one value is 100 times larger than another (on average), then the proposed nSentiMI framework will better behave if the normalize/standardize the two variables to be approximately equivalent.

The comparison of proposed technique with state of art techniques is performed in order to show significance of the proposed technique. Table 7 shows accuracy comparison of proposed nSentiMI with other techniques for Cornell Movie Review (MR) dataset. The reported accuracies are the best that were achieved during experimentation. Each technique has utilized either supervised or unsupervised data mining algorithms for sentiment classification of movie review dataset. It is clear from the comparison that the proposed technique has attained highest accuracy for sentiment classification and polarity detection. The average accuracy achieved by proposed nSentiMI is 89.10% which is much higher than the accuracy of state of art techniques.

6 Conclusions and future work

The main challenge in the field of sentiment analysis is the availability of tagged corpus which is the basic requirement of any supervised or semi-supervised algorithm. This research utilizes SentiWordNet as the training corpus to generate a new lexical resource nSentiMI. This lexical resource ‘nSentiMI’ is built by calculating Point-wise Mutual Information using SentiWordNet. We have proposed a complete framework, based on nSentiMI, which alleviates the need of a tagged corpus for every domain. To the best of our knowledge, SentiWordNet was not explored before in this regard. The proposed research focuses on providing a sentiment analysis framework which uses word knowledge provided by social media to transform it into more informative way. Moreover, we have proposed an algorithm to find an appropriate value of weight factor α which is used to enhance the performance of nSentiMI by analyzing 50,000 publically available movie reviews dataset. Testing of the proposed framework is performed on another publically available movie review dataset consisting of 2,000 reviews. The proposed framework outperforms other techniques as given in results and discussion. The proposed nSentiMI dictionary is used for sentiment classification in conjunction with SentiWordNet 3.0. The synset scores and part of speech tagging information is obtained using nSentiMI dictionary which is helpful for classifying the term into positive or negative sentiment. The POS tagging is performed using Stanford POS tagger. The main focus of proposed research is on improving the sentiment classification accuracy using proposed nSentiMI. The desired results are achieved by increasing the specificity level while keeping the sensitivity level high. The proposed nSentiMI has achieved best accuracy of 93% and highest improvement of accuracy is 30.5% which is a great achievement as compared to other techniques. A comparison with other state of art techniques for the same dataset is also presented, which verifies the significance of our proposed approach.

In future, we plan to analyze nSentiMI with supervised learning approaches like Support Vector Machines, Naïve Bayes, and Neural Networks. We would attempt to employ other techniques for subjectivity scoring and use of other social media datasets. We will also focus on issue of negation of the terms as well. More pre-processing steps can also be incorporated in order to further enhance the classification accuracy level. Moreover scoring functions that incorporate the frequency of the use of a synset can also be used.

Footnotes

1

2

3

4

5

6

7

">http://www.cs.cornell.edu/People/pabo/movie-review-data/

8

References

Ohana

Tierney

2009

Sentiment classification of reviews using SentiWordNet

9th IT&T Conference, Dublin Institute of Technology Dublin, Ireland

Kennedy

Inkpen

2006

Sentiment classification of movie reviews using contextual valence shifters

Computational Intelligence 22 110 125

Pang

Lee

2008

Opinion mining and sentiment analysis

Foundations and Trends in Information Retrieval 2 1-2 1 135

Esuli

Sebastiani

2006

Sentiwordnet: A publicly available lexical resource for opinion mining

5th Conference on Language Resources and Evaluation

Paltoglou

2014

Sentiment analysis on social media. Online collaborative action

Springer

Zhang

Law

2009

Sentiment classification of online reviews to travel destinations by supervised machine learning approachesm

Expert Systems with Applications 36 3 Part 2 6527 6535

Elsevier

Han

Kamber

2006

Data Mining: Concepts and Techniques

Second edition

Morgan kaufmann

Yang

Pedersen

1997

A comparative study on feature selection in text categorization

14th ICML

Kazemzadeh

Lee

Narayanan

2013

Fuzzy logic models for the meaning of emotion words

IEEE Computational Intelligence Magazine 8 2 34 49

10.

Wilson

Wiebe

Hoffmann

2009

Recognizing contextual polarity: An exploration of features for phrase-level sentiment analysis

Computational Linguistics 35 3 399 403

11.

Zhang

Zeng

Wang

Zuo

2009

Sentiment analysis of chinese document: From sentence to document level

Journal of the Association for Information Science and Technology 60 12 2474 2487

12.

Turney

2002

Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classification of reviews

40th Annual Meeting of ACL 417 424

Philadelphia

13.

Read

2005

Using emoticons to reduce dependency in machine learning techniques for sentiment classification

ACL Student Research Workshop 43 48

14.

Keefe

Koprinska

2009

Feature Selection and Weighting Methods in Sentiment Analysis

14th Australian Document Computing Symposium

15.

Verma

Bhattacharyya

2008

Incorporating Semantic Knowledge for Sentiment Analysis. India

6th International Conference on Natural Language Processing

16.

Reyes

Rosso

2014

On the difficulty of automatically detecting irony: Beyond a simple case of negation

Knowledge and Information Systems 40 3 595 614

17.

Barnden

Reyes

Shutova

Rosso

Veale

2015

Sentiment analysis of figurative language in twitter. SemEval-task

18.

Blitzer

Dredze

Pereira

2007

Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. 45th ACL

187 205

19.

Dredze

Crammer

Pereira

2008

Confidence-weighted linear classification

25th International Conference on Machine Learning ACM 264 271

20.

Pang

Lee

Vaithyanathan

2002

Thumbs up?: Sentiment classification using machine learning techniques

ACL-02 Conference on Empirical Methods in Natural Language Processing 10 79 86

21.

Abbasi

France

Zhang

Chen

2011

Selecting attributes for sentiment classification using feature relation networks

IEEE Transactions on Knowledge and Data Engineering 23 3 447 462

22.

Wiegand

Balahur

2010

A Survey on the Role of Negation in Sentiment Analysis

Workshop on Negation and Speculation in Natural Language Processing

23.

Jivani

2011

A comparative study of Stemming algorithms

International Journal of Computer Technology and Applications 2 6 1930 1938

24.

Jain

Pandey

2013

Analysis and implementation of sentiment classification using lexical POS markers

International Journal of Computing, Comm and Networking 2 1 36 40

25.

Lin

Zhang

Wang

Zhou

2012

An information theoretic approach to sentiment polarity classification

2nd Joint WICOW/AIRWeb Workshop on Web Quality 5 40 ACM

26.

Saggion

Funk

2010

Interpreting SentiWordNet for opinion classification

7th LREC 1129 1133

27.

Larsen

Marx

2000

An Introduction to Mathematical Statistics and Its Applications

Third Edition ISBN 0-13-922303-7 282

28.

Carroll

2002

Statistics Made Simple for School Leaders

Rowman & Littlefield

Retrieved 7 June 2009

29.

Yang

Bhattacharya

Srinivasan

2012

Lexical and Machine Learning Approaches Toward Online Reputation Management

CLEF (Online Working Notes/Labs/Workshop)

30.

Tan

Zhang

2008

An empirical study of sentiment analysis for chinese documents

Expert Systems with Applications 34 4 2622 2629

31.

Sharma

Dey

2012

Performance Investigation of Feature Selection Methods and Sentiment Lexicons for Sentiment Analysis

Special Issue of International Journal of Computer Applications (0975 –8887)–ACCTHPCA 15 20

32.

Socher

Pennington

Huang

Manning

2011

Semi-supervised Recursive Autoencoders for Predicting Sentiment Distributions. EMNLP’11

151 161

33.

Zhou

2011

Self-Training from Labeled Features for Sentiment Analysis

Information processing and management

34.

Verma

Bhattacharyya

2008

Incorporating Semantic Knowledge for Sentiment Analysis. India

Proceedings of ICON-2008:6th International Conference on Natural Language Processing