Abstract
Anomaly detection in sentiment mining refers to detecting user’s abnormal sentiment patterns in a large collection of sentiment data. The anomalies detected may be due to rapid sentiment changes that are hidden in a huge amount of text. The anomaly of sentiment data sources is a foremost factor in affecting the efficiency of sentiment classification methods. Thus, analyzing sentiment data to identify abnormal sentiment patterns in a timely manner is a valuable topic of research. In this work, it is analyzed how anomaly detection and elimination can aid sentiment classification and hence enhance sentiment mining. This paper proposes a model that combines the proposed anomaly detection method with meta-classification method to detect and eliminate anomalies and classify user’s sentiments. This paper also focuses on identifying the optimum percentage of data to be eliminated as anomalies after detection, so as to perform sentiment classification effectively on movie review data. The results exhibit the capabilities of the proposed method and offer better insight into this area of research.
Keywords
Introduction
Anomaly detection in sentiment classification is the finding of unusual sentiment patterns from the sentiment data set. An anomaly is a data sample that deviates satisfactorily from other samples to offer ascent to the doubt that it was created by anomalous means [1 , 29]. User’s sentimental content obtained from shared media data contains valuable information for decision making. Online reviews in social media platforms play an influential role in the new user’s decisions [30, 33]. This results in activities such as spreading of fake reviews. Detecting these fake reviews is important to improve user’s experience. Such fake sentimental review may result in anomalies due to sudden sentiment changes. Such anomalies if ignored, the consequences are rigorous. Several applications at the present time need methodical data analysis to eliminate the anomalies and make certain of system consistency [2 , 29]. There are three basic methods for anomaly detection such as model-based approach, proximity-based approach and density based approach. Eliminating anomalies from a dataset is an essential assignment from which most machine learning algorithms can benefit. A dataset less in anomalies takes perfect modeling tasks into consideration. Thus the anomaly detection method is an exceedingly important technique for data cleaning.
The task of anomaly elimination, from an input dataset with anomalies, makes this problem more difficult as anomalies will reduce the final model. Different approaches for anomaly detection were already employed in various data mining tasks such as classification and clustering [5 , 28]. In recent years, the concept of anomaly ensembles was also applied to detecting anomalies to build a reliable diverse classification model [16, 31]. These issues were addressed in many real world datasets, but not much addressed in sentiment classification domain [5 , 23–25]. Thus motivates the necessity for a detailed analysis bringing collectively unique techniques on sentiment dataset.
This paper is organized as follows: Section 2 presents a review of earlier work. The methodology employed for sentiment classification research is presented in Section 3. Section 4 details the various analysis carried out for anomaly elimination. The results and discussion are provided in Section 5. Section 6 concludes the paper.
Review of literature
The methods used for anomaly detection are mostly unsupervised and non-parametric. A wide variety of technique for anomaly detection [3] such as classification, statistical methods, information theoretic, clustering and spectral approaches has been proposed [1
, 22]. A general collection of the technique for the detection of anomalies in an unsupervised circumstance is those based on nearest neighbors. This approach is based on the supposition that out of the way data instances show a comparatively larger distance to its nearest neighborhood when compared with that of a normal instance [2
, 11]. Another popular approach is the Local outlier factor (LOF) algorithm, which assigns anomaly scores depending on both the reachability of a data point and the relative density of its neighborhood [2
, 29]. Neural network techniques have the benefit of giving straightforward outcome to infer. Neural network techniques are able to handle diverse types of features and are also able to handle the noise in the data [10, 28]. Anomaly detection techniques discussed in the literature are mainly dictionary, text, neural network, time series, statistic, rule, and rank based [5
, 14]. Similar to an ensemble of classifiers, ensemble anomaly detection methods are a grouping of accurate perhaps diverse anomaly detection methods [8
, 32]. To deal with the limits as well as the challenge mentioned in the literature above, this work focuses on the method which improves the present techniques for anomaly detection in the course of sentiment analysis on movie review data [30, 33]. In this paper, an anomaly detection method based on subspace, which can efficiently illustrate the local distribution of objects and detect anomalies hidden in subspaces of the data is proposed. In thorough experiments on movie review data, it shows that the method outperforms the competing for Local outlier factor (LOF) by detecting anomalies in subspace. The main concept behind this work is finding out the anomalies in lower-dimensional subspaces of the dataset rather than the full-dimensional dataset. Compared with existing anomaly detection research, the contributions of this work are listed as follows: A meta subspace based framework is proposed to detect anomalies which can efficiently explain the local distribution of objects and detect anomalies hidden in subspaces of the data. The focus is on employing a methodology for the empirical evaluation of the effects of different levels of outlier elimination in the complexity of sentiment dataset. A comparative performance is obtained for proposed meta subspace anomaly detection method with LOF method for anomaly detection on challenging movie review sentiment dataset.
Methodology
In this work, an anomaly detection method based on subspace is proposed (Fig. 1). The performance of the proposed algorithm is validated by taking the benchmark movie review dataset. First, for each sample in the review dataset, an anomaly score is identified by employing the proposed anomaly detection method. Then the samples are sorted based on the anomaly score. From the perspective of the literature review done, no method has been proved as the standard for identifying the number of samples to be eliminated as anomalies after anomaly detection. In order to find the optimum number of samples to be eliminated as anomalies, an analysis is carried out by varying the percentage of anomalies (2%, 4%, 6%, 8%, and 10%) to be removed.

Methodology design.
For each level of elimination, a meta-classifier is employed to assess the performance of anomaly elimination. Based on the performance measure, the optimum number of samples to be eliminated as anomalies for the proposed anomaly detection method is identified.
In this research, a subspace based anomaly detection method is proposed. The proposed subspace anomaly detection method employs LOF outlier detection method as a baseline method. The results obtained for the proposed method is compared with an individual LOF method.
Local outlier factor (LOF)
The LOF algorithm is an unsupervised anomaly detection method which computes the local density deviation of a given data instance concerning its neighbors. It considers the anomaly samples that have a generously lesser density than their neighbors. The count of neighbors used is normally chosen one greater than the least number of objects a group has to have, with the goal that other objects can be local anomalies relative to this group. Those objects that are smaller than the maximum number of neighbor objects can potentially be local anomalies [5
, 25]. The steps involved in LOF outlier detection are as follows, Calculate all the distance between each two data points. Find out the distance between o (data point) and its k-th nearest neighbor for all data points. Calculate the N
k
(o) k-distance neighborhood of o for all data points. Calculate all the Local Reachability Density of o. Calculate all the LOF
k
(o). Sort all the LOF
k
(o).
Proposed anomaly detection method
The steps involved in the proposed anomaly detection method are shown in Fig. 2. Initially, a group of features of a random size (k), is chosen from dataset D. The random subsets of features are selected to generate a lower dimensional representation of dataset D i.e D t of the dataset D. D t is the dataset obtained by bootstrap sampling without replacements, to produce a subsample S(t) of size k. The resulting data thus contains subsamples of the actual dataset D. The method proposed in this work uses S(t) in estimating the density for each instance. The estimation of density is carried out on a different group of neighbors (k value), resulting in a more accurate outcome. LOF is greatly dominated by the relative density of its neighborhood. The relative density computation is performed iteratively with random sets of instances that can provide diverse and possibly balancing results. In the end, the set of anomalies scores for each iteration are coupled together to generate the final set of anomaly scores. A single set with a distinctive score for each instance is obtained. The final score is the collective result of the various iterations of LOF on multiple subspaces of the actual dataset D.

Proposed Anomaly Detection Method.
The pseudo code of the proposed work is given in Fig. 3. In Fig. 3, local reachability distance (LRD) is the estimated distance at which a point can be found by its neighbor. So if a neighbor were to reach out LRD value distance in any direction, it would be most optimal to find that individual point. reachDistance is the max value of the Kth nearest neighbor of the point and the Manhattan distance of between the point and its neighbor.

Pseudo code of Proposed Anomaly Detection Method.
The main focus of this work is to perform anomaly detection and elimination and hence enhance sentiment classification. For the classification task, after anomaly elimination, naive bayes, support vector machine, decision tree, and meta-classifiers are employed [16]. The main assumption behind the naive bayes model is that each feature is conditionally independent of all other features given the class. This assumption allows us to write the likelihood as a simple product and helps the naive bayes model generalize well in practice. Decision tree builds classification in the form of a tree structure. The common purpose of using a decision tree is to build a tree based training model to predict the class or value of target variables by learning decision rules inferred from training data.
Support vector machine is employed to obtain the best separating line. Support vector machine searches for the closest points, which are called as support vectors [17]. SVM finds a hyperplane in a hyperspace of attributes that separates two classes the best with the largest margin between classes. Meta-learning helps to improve the performance of machine learning classifiers by combining several individual models. The meta-classification method allows for better classification performance compared to an individual model. Meta learning methods combine several classification techniques into one combined meta-model in order to decrease variance. Bagging is a meta-learning method that creates separate subsets of the training dataset and creates a classifier for each subset. The results of these multiple classifiers are then grouped. Bagged SVM is the meta-classifier used in this research for sentiment classification. The default parameters available in weka tool are used for all classifiers used.
Anomaly elimination
In this work, two diverse techniques are used for detecting anomalies. The performances of these techniques were evaluated. LOF and proposed subspace-based method are the techniques that are employed in the movie review data set. These approaches have a tendency to allocate each object with a score that shows the degree to which it is an anomaly. A single benchmark movie review dataset was used throughout the study. Experiments were performed on benchmark movie review dataset by pang and Lee [16]. Pang and Lee’s [16] Movie Review Data was one of the first widely-available sentiment analysis datasets. It contains 1000 positive and 1000 negative movie reviews from IMDB. It has two categories, positive and negative sentiment. There are no neutral reviews.
Anomaly detection of the data is done on the subsets for movie review dataset. So the profitability to get hidden or missing anomalies became more. The proposed approach computes anomaly scores in lower-dimensional subspaces of each level of noise inserted data. The term subspaces used in this work represent subsets of the original data variables. After outlier detection using the proposed outlier detection method, five different new datasets were created from the original movie review dataset with outlier score for each sample by eliminating top k (k = 2%, 4%, 6%, 8% and 10% of the original dataset) anomalies for each new dataset. Thus, five different anomaly versions of a movie review dataset were generated to investigate the influence of varying levels of outliers. For each outlier eliminated movie review dataset, the classification methods are employed. The proposed technique has quite a lot of advantages over the existing anomaly detection methods. The proposed technique is least subject to a masking consequence from inappropriate features. The proposed method reduces the difficulty of detecting anomalies over a high dimensional feature space in detecting anomalies in multiple lower dimensional subspaces.
LOF is used as a baseline against which to compare the results of the proposed anomaly detection method. The various parameter settings for these techniques are mentioned as follows. The k value i.e. number of neighbors used in LOF and in proposed anomaly detection method is the same. Moreover, the results obtained with LOF can vary drastically depending upon the selection of k (k = 4). Another parameter to take into account, the size s of the subsamples of data was set to 10.
The results of anomaly detection methods are represented in the form of histograms. An outlier in a histogram is an observation that lies outside the overall pattern of a distribution. The histogram in blue color shows the outlier scores for positive class instances. The histogram in red color shows the outlier scores for positive class instances. The histograms shown in Figs. 4 and 5 represent the probability of relative occurrence of every anomaly (outlier). For a distinctive big outlier value, the histogram is somewhat skewed to the right (positively skewed).

Histogram of LOF method outlier scores.

Histogram of Proposed detection method outlier scores.
The point on the far left or right in the histogram is an anomaly in Figs. 4 and 5. In Fig. 4, LOF anomaly detection for the movie review dataset, the maximum number of instances are scattered in the region of anomaly score values of 0.94 to 1.09 for both positive and negative class instances. The peak occurs at anomaly score value of 1. There is no gap observable till outlier score less than 1.09, but scores for outliers are more and more increasing with the gap after outlier score of 1.09. It means, as the interpoint distance is increased, more and more data points are becoming neighbors till the outlier score reaches 1.09 and afterward, the density starts decreasing. The outlier score generated for normal distribution will be less than 1.09. The score for outliers will be far from 1.09. Both positive and negative class instances are distributed with the highest outlier score. Thus both class instances have been identified as outliers. Few instances having top k outlier score can be identified as outliers from these highest outlier score instances.
Figure 5 shows the histogram of outlier scores obtained for the proposed anomaly detection method on movie review dataset. In Fig. 5, for movie review dataset, it can be inferred that the anomaly values are distributed between 0.01 and 0.15. In Fig. 5, the maximum numbers of instances are distributed around anomaly score of around 0.03–0.09 for both positive and negative class instances. Among the anomaly scores calculated, the top anomalies are to be identified from anomaly score greater than 1.08. With these bigger outlier values being eliminated, the histogram obtained is more or less symmetric. Among the anomaly scores calculated, it can be seen that very few instances have the highest anomaly score in LOF (Fig. 4) than the proposed outlier detection method (Fig. 5).
Machine learning has various methods for evaluating the performance of learning algorithms. After employing the LOF and proposed anomaly detection methods on movie review dataset, sentiment classification is carried out by eliminating the top k outlier scores. But the value of k is to be chosen randomly as no method has been proved in the literature as the standard for identifying the count of anomalies to be eliminated. In order to identify the optimum value of k for the movie review dataset, various k values are analyzed (k = 2%, 4%, 6%,8% and 10% of the original dataset). After removal of top k% of anomaly score, the instances in the datasets are reduced accordingly. To compare the performance of LOF and proposed anomaly detection methods, four different classifiers discussed in section 3.2 are used. The classifiers are Support Vector Machine (SVM), Naive Bayes classifier (NB), Decision tree (DT), and Meta classifier. These classifiers are employed muchn the literature of sentiment classification. Various evaluation metrics are used for validating the performance of classification methods with anomaly detection. For each 10-fold cross-validation, the common metrics used are accuracy, precision, recall, and f-score. Table 1 shows the description of various k values used in the analysis. The results obtained for the precision of the classifiers for both outlier detection methods with various top k% outlier elimination is shown in Table 2. The precision values for all classifiers for movie review dataset before removing anomalies (k = 0%) is less when compared to the values obtained after removal of anomalies in Table 2. This indicates that removing anomalies in a dataset results in considerable increase in performance of classifiers used. Analyzing the effect of percentage of anomalies removed, employing LOF and proposed anomaly detection method, all classifier shows an improved result for top k% values of 2, 4, 6 and 8 compared to actual review dataset without anomaly removal (k = 0%). But the performance of all classifiers employed reduces suddenly for k value of 10%.
Description of k values
Description of k values
Precision of classifiers (%)
Of the LOF and proposed anomaly detection algorithms, the proposed anomaly detection algorithm shows better precision value for all classifiers employed. Among the classifiers, the precision value obtained by proposed outlier detection method is high for meta-classifier employed, for all values of top k% (k = 2,4, 6,8,10) of anomaly removed dataset. On analyzing the effect of the number of data samples removed as anomalies (k%), it is found that for top 2% and 4% anomaly removal, the precision of all classifiers for both anomaly detection methods gradually increases and reaches a maximum at top 6% of data removed as anomalies. For all classifier and anomaly detection combination, after 6% (k = 8% and 10%), the precision starts to degrade. Thus the optimum value of k, i.e percentage of samples to be removed as an anomaly is identified as top 6% based on precision measures. The combination of meta-classifier with proposed anomaly detection algorithm gives a better precision value of 91.8% for top 6% anomaly removed dataset. The results obtained for recall of the classifiers for both outlier detection methods with various top k% outlier elimination is shown in Table 3. The results obtained for recall values in Table 3 shows that the recall values for all classifiers before removing anomalies is less when compared to the recall values obtained after removal of anomalies (k = 2, 4, 6, 8, 10) for movie review dataset used. On exploring the effect of recall on top k% (k = 2, 4, 6, 8, 10) of anomalies removed, employing LOF and proposed anomaly detection method, it is observed that all the classifier employed exhibit improved recall values for k% values of 2, 4, 6, and 8 compared to actual review dataset without anomaly removal (k = 0). It is also evident from the tabulated results that the recall of all classifiers employed degrades suddenly for k value of 10%. Among the LOF and proposed anomaly detection methods, the proposed anomaly detection method shows higher recall value for all the classifiers employed. Among the classifiers used in this work, the recall value obtained by proposed outlier detection method is high for meta-classifier employed for all k% (k = 2, 4, 6, 8, 10) of anomaly removed dataset. On analyzing the effect of the number of data samples removed as anomalies (k%), it is found that for 2% and 4% anomaly removal, the recall of all classifiers for both anomaly detection methods gradually increases and reaches a maximum at top 6% of data removed as anomalies. For all classifier and anomaly detection combination, after 6% (for k = 8% and 10%), the recall starts to degrade. Thus the optimum value of k, i.e percentage of samples to be removed as an anomaly, is identified as top 6% based on recall measures for the movie review dataset. The combination of meta-classifier with proposed anomaly detection method gives a better recall value of 93.8% for top 6% anomaly removed dataset.
Recall of classifiers
The f-score is the combined measure of precision and recall (Table 4). Thus a similar observation noted for precision and recall is noted here also. Among anomaly detection methods, proposed anomaly detection method dominates than LOF method for all classifiers. Among the classifiers, the meta-classifier performs better than other classifiers for both anomaly detection methods. Among the top k% anomaly elimination values, top 6% of anomaly elimination has highest f-score performance (93.50%) than other k values of 2%, 4 %, 8%, and 10% The accuracy value is measured for all classifiers (Table 5) with the proposed anomaly detection method using the identified optimum k % value (k = 6) from the results obtained in Tables 2 to 4. From Tables 2–4, it is observed that the proposed anomaly detection method performs better for all classification methods employed than LOF.
F-Score of classifiers
Accuracy of classifiers
In Table 5, it is also observed that all classifiers employed have its classification accuracy improved when compared to the accuracy obtained in the literature. This provides an evidence for the effect of anomaly elimination in enhancing the performance of sentiment classification model. In general, the results show that the meta-classifier performs better for the proposed anomaly detection method with the optimum top 6% anomaly eliminated dataset in terms of accuracy measure.
Thus identifying the outliers in sentiment data reduce the complexity of a model and makes it easier to interpret. It also improves the accuracy of a model if the right subset is chosen and thereby reduces over fitting. It is evident from Table 5, that meta-classifier gives better precision of 93.2 %, recall of 93.8%, F-score of 93.5% compared to other classification methods using proposed anomaly detection method. In the case of existing LOF anomaly detection method, meta-classifier gives better precision of 92.3 %, recall of 93.5%, F-score of 92.9% compared to other classification methods. In both anomaly detection methods, the highest performance metrics are obtained for movie review dataset with top 6% of data identified as outliers. This suggests that, if the percentage of data eliminated is less than 6%, the influence of noise is more and affects the classifier performance. Also, if outlier removal is greater than 6% of actual data, there are possibilities than non-outlier sample are eliminated as an anomaly. Among the classifiers used, meta-classifier dominates in terms of accuracy, precision, recall, and F-score (precision of 93.2 %, recall of 93.8%, F-score of 93.5%). In terms of outlier detection methods, proposed anomaly detection method performs better for all classification methods employed with than LOF. It is also observed that all classifiers employed with proposed anomaly detection method has its classification performance improved when compared to performance easures obtained with LOF anomaly detection method.
In this paper, a study on the user’s sentiment classification and abnormal sentiment detection on social media are carried out. The applicability of the proposed method is explored using movie review data as a case study. In this work, a sub space-based outlier detection method is proposed for anomaly detection. This work also aims at identifying the optimum number of anomalies to be eliminated for the review dataset used. Results show that the accuracy, precision, recall, and f-score of anomaly detection (precision of 93.2 %, recall of 93.8%, F-score of 93.5%) in the proposed anomaly detection method is higher than the compared baseline method precision of 92.3 %, recall of 93.5%, F-score of 92.9%. The experiment also shows that the optimum percentage of an anomaly to be eliminated for movie review dataset is top 6% of original data based on anomaly score. It is also noted that the performance of classifiers deteriorates as the number of anomalies to be eliminated increases above 6%. The proposed anomaly elimination method and meta-classifier employed to detect abnormal user’s sentiments prove to be very effective in terms of all performance measures for movie review dataset.
A consideration for further research is the possibility of using the proposed method not only with LOF but on top of different anomaly detection algorithms, which can improve the behavior of unsupervised anomaly detection methods. The experiment was carried on movie review dataset, and the results show that review sentiment can be accurately classified by the proposed method. The model proposed can be used in other sentiment analysis domains such as tweets, facebook, WhatsApp and newsgroups.
