Abstract
Sentiment analysis and opinion mining is an area that has experienced considerable growth over the last decade. This area of research attempts to determine the feelings, opinions, emotions, among other things, of people on something or someone. To do this, natural language techniques and machine learning algorithms are used.
This article discusses the problem of extracting sentiment and opinions from a collection of reviews on scientific articles conducted under an international conference on computing in northern Chile.
The first aim of this analysis is to automatically determine the orientation of a review and contrast this with the assessment made by the reviewer of the article. This would allow scientists to characterize and compare reviews crosswise and more objectively support the overall assessment of a scientific article.
A hybrid approach that combines an unsupervised machine learning algorithm with techniques from natural language processing is proposed to analyze reviews. This method uses part-of-speech (POS) tagging to obtain the syntactic structure of a sentence. This syntactic structure, along with the use of dictionaries, allows determining the semantic orientation of the review through a scoring algorithm.
A set of experiments were conducted to evaluate the capability and performance of the proposed approaches relative to a baseline, using standard metrics, such as accuracy, precision, recall, and the
Introduction
Opinions are central to almost all human activities because they are a key influence on people’s behavior. Each time a decision needs to be made, humans look for others’ opinions. In the real world, enterprises and organizations seek to know public opinion about their products and services. In turn, customers want to know others’ opinion about a certain product before buying it. In the past, people looked for opinions from their friends and family, while organizations made polls or organized focus groups. Nevertheless, with the sudden growth of social networks such as Twitter and Facebook, individuals and organizations use data provided by these means to support their decision-making process. The field of sentiment analysis, also called opinion mining, emerged in this context.
Sentiment analysis is a relatively recent area in the field of data mining. There are different techniques for extracting, processing, and seeking objective data in texts. Nevertheless, there are subjective components that are also interesting. These components including opinions, sentiments, and emotions, among others, are the focus of sentiment analysis.
Sentiment analysis includes a great amount of tasks such as sentiment extraction and classification, subjectivity detection, opinion summary, and opinion spam detection, among others. To do these tasks accurately, it is necessary to face several challenges, particularly the meaning formalization of an opinion. For this purpose, a series of formalisms and math representations to express opinions have been developed.
Sentiment analysis is an area with great development opportunities, particularly due to the huge growth of data available in the web, for example, in blogs, social networks, and forums, among others. One of the applications of opinion mining is product or service assessment by analyzing users’ opinions or reviews. This application is highly important for organizations because it allows discovering what people think and say about a certain trademark [20].
An application area where opinion mining techniques have not been applied yet is the reviewing process of scientific articles. In addition, the scientific paper reviewing process is the main quality control mechanism for most scientific communities. This involves reviewing each paper in order to provide suggestions to authors for correcting and improving a paper, whether they think it can be published or must be rejected [5]. As in the sentiment analysis in the industry, there is a suggestion to use opinion mining for analyzing the orientation of scientific paper reviews. This paper shows the application of sentiment analysis on a data set consisting of paper peer reviews.
The domain of scientific paper reviews presents some major challenges, such as:
Usually classes are unbalanced, because there is a strong bias towards negative opinions. Different reviews usually vary in terms of the number of assessments. Normally, there is not a clear correlation between the number of positive and negative opinions with the final evaluation made by reviewers.
All these issues make this domain a challenge for opinion mining and sentiment analysis purposes.
Specifically, anonymous reviews taken from an international conference have been used as a data set. This conference is an academic/business event of informatics and computer engineering. Authors submitted their papers through EasyChair. The papers could be written in Spanish, English or Portuguese. A double blind review scheme was used to prevent biases during the evaluation of the different articles. An international reviewing committee was in charge of the evaluation of each paper. The papers were distributed among the reviewers according to their affinity to the corresponding research area. The reviewers evaluated the submitted papers and provided their comments and evaluations in Spanish and in some cases in English.
This paper aims to present the implementation of sentiment analysis methods in the area of scientific paper reviews as a proof of concept for future applications. The used techniques include a Bayesian classifier (NB), a classifier built on the basis of support vector machines (SVM), an unsupervised classifier in the form of a scoring algorithm based on Part-Of-Speech tagging [20] and keyword matching, and finally a hybrid method using both the scoring algorithm and SVM.
The remaining part of this document is organized as follows: Section 2 shows papers related to this study. Section 3 describes the materials and methods used, including a description of data, tools, and processing. Also, the four implemented methods are described in detail: NB, SVM, the scoring algorithm and the hybrid method based on scoring and SVM (called HS-SVM). In particular, this section details the implemented algorithms and optimal parametrization of the methods discussed. In addition, the evaluation and assessment of these classifiers is detailed. Section 4 shows the main results and their discussion. Finally, Section 5 deals with the conclusions and possible future work.
Sentiment analysis
Opinion mining systems development poses many challenges. First, it is necessary to identify text content. This is not an easy task due to the nature of language, which contains a great deal of semantic subtleties not present in other types of data. Second, sentiments must be classified in one way or another and thus determine their orientation. There are different ways to address this problem [26].
An opinion may be simply defined as a positive or negative sentiment, a viewpoint, an emotion or an appreciation about something or someone. In mathematical terms, an opinion is defined as a quintuple (
Apart from sentiment and opinion, subjectivity and emotion are two other important related concepts in the area of opinion mining. A subjective sentence may express a personal sensation, a viewpoint or a belief; however, it does not necessarily involve a sentiment. A good classification of subjectivity may ensure a better sentiment classification [31], and this process can even be considered more complex than distinguishing positive, negative or neutral sentiments. On the other hand, an emotion may be considered as an expression of an individual’s own subjective thoughts. Emotions are closely related to sentiments. In fact, the way the strength of an opinion is measured is associated with the intensity of certain emotions such as love, hate, surprise, anger, and sadness, among others.
Then, the objective of opinion mining is discovering all opinions in a quintuple (
Sentiment classification can be traditionally done in two ways: supervised and unsupervised based on semantics. The success of these techniques depends mainly on the appropriate extraction of the set of characteristics used to detect sentiments. The most used supervised techniques are support vector machines (SVM) and naïve Bayes (NB) classifier [32]. Machine learning solutions involve building classifiers from a collection of documents, where each text can be represented as a bag of words [47, 27]. Also, it is common to use some stemming techniques and stop word elimination. In general, classifiers with a good behavior in the domain where they are trained do not show the same behavior in another domain since they are highly dependent on training data used [1]. Most of the literature is dedicated to domain specific solutions, and while there is much work towards cross domain opinion mining most solutions are domain dependant [15]. This article focuses on the domain of scientific paper reviews.
Unsupervised semantics-based methods use dictionaries in which different types of words are classified according to their semantic orientation [46]. Unlike traditional machine learning methods, semantics-based unsupervised methods are more dependent on their domain, although their performance may vary from one domain to another. There are two important sub-categories to mention: dictionary-based and corpus-based. The dictionary-based technique uses a set of initial terms usually manually collected. This set grows by looking up synonyms and antonyms. An example of this type of dictionary is WordNet, which was used for developing SentiWordNet [2]. The main drawback of this type of approach is its inability to face the specific orientations of a domain and context. The corpus-based technique emerged with the purpose of providing dictionaries for a specific domain. These dictionaries result from a set of opinions seeds growing through the search of words related by means of statistical or semantic techniques such as Latent Semantic Analysis (LSA) or just by the frequency of occurrence of words within the collection of documents used [36].
Authors in [22] present a refined characterization of sentiment analysis techniques, including machine learning (supervised and unsupervised algorithms) and lexicon-based approaches (dictionary-based and corpus-based methods). In this review, supervised methods used for sentiment analysis include decision trees, support vector machines, neural networks, and methods based on probability, such as naive Bayes, Bayesian networks and maximum entropy.
A series of related papers is discussed below. Since there are no applications in the same domain, the domain of reviews or entity critique (e.g. films, hotels, products) is used as a reference since they are the closest among possible applications. This study is partially based on the work proposed by the authors in [49], where an opinion classification system of film reviews in Spanish is shown, using dependency parsing and POS tagging.
Table 1 shows results from different studies to determine polarity, starting with the seminal work from Pang et al. [27]. These results are shown with the purpose of providing a reference framework to evaluate results obtained. Table 1 focuses mostly on binary classification. Not all the papers shown in the table will be discussed, unless they are pertinent to our specific work. The strategy used is shown in the Approach column. It can be based on machine learning (ML), lexicon (L) or it may be hybrid (H). The area being worked out is shown in the Domain column. Most work is done on film critiques or Twitter. The values in the Results column are shown in terms of general accuracy, unless otherwise stated. The best results obtained for a certain paper are shown. If work involves doing tests on different data sets or with different class amounts, results will be reported separated by a slash (/) in the same order. The information in the table was obtained from the systematic reviews in [32, 40]. The first paper deals with opinion mining as a whole, while the second one focuses on deep learning, a machine learning branch with different applications in opinion mining.
Results obtained in related works
Results obtained in related works
An effective sentiment analysis requires not only considering words individually, but also the linguistic construction of the sentence analyzed since it may totally change the sentiment expressed. The usual way of facing these constructions is by defining a heuristic. Authors in [27] work on film critiques and use a simple heuristic assuming that the negation scope includes words between the negator and the first punctuation after the negative term. Authors in [39] use data generated from the POS tagging process to identify the negation scope.
Apart from linguistic aspects, sentiment analysis must take into account the quality of the text analyzed. Furthermore, people make spelling and grammar mistakes. Some incorrectly written words were found during data processing. To solve these problems, spelling correctors may be used.
An important aspect in opinion mining is detecting sarcasm and irony. This is a complex task in this research field, particularly for the lack of agreement among researchers on how sarcasm and irony must be formally defined [33]. Another aspect is to detect unreliable spam or opinions that may distort analyses [20]. In comparison to other kind of reviews, research paper reviews do not include these aspects; so, they were not included in our analysis.
Research in the opinion mining area has greatly grown in the last decade, though most work focuses on texts written in English. While sentiment analysis in Spanish does not differ in essence with respect to English sentiment analysis, there is a lack of tools and libraries in comparison with English, which makes the implementation of sentiment analysis in Spanish more complex in general. Additionally, the Spanish language is less structured, compact and technical than English, which makes its semantic analysis difficult. Furthermore, only a small percentage of the research work is based on the Spanish language, with the vast majority of them focused on the English language.
Lexicon and grammar differences between Spanish and English may have an impact on the performance of systems trained for a certain language. Categorizing an opinion as positive, negative or neutral seems a simple task; however, it is really complex, particularly when opinions are written in different languages. Authors in [4] have studied the impact of English, German, and French particularities.
Some opinion mining studies focus on the Spanish language. One of the most relevant is proposed in [7]. It uses a semantics-based model defining a collection of dictionaries to calculate sentiments. Another study recently proposed in [48] describes an opinion mining system that classifies the orientation of Spanish texts taken from Twitter, according to an analysis of natural language, obtaining the syntactic sentence structure.
While works of sentiment analysis centered in movie reviews and product reviews are common in the literature, it must be mentioned that these domains of application are quite different from scientific paper reviews. An important difference is that peer reviews of research articles are an occluded genre (i.e. the documents are not publicly available) [14], contrary to movie reviews and product reviews that are intended for the general public.
Another key difference is the vocabulary used, which due to the scientific background of the domain tends to formality. An important difference is that in terms of the use of language the reviewers tend to respect the respective rules of ortography and grammar, which facilitates the analysis in comparison with the other kind of reviews. In general, the main difference is the expected level of formality found throughout the text.
Furthermore, the interpretation of a paper review can be a difficult task because of the conflicting signals contained in the text [13]. Also, reviews contain requests for changes in the form of directions, suggestions, clarification requests and recommendations. Early career researchers tend to be more affected by this, since they lack the experience to adequately interpret the reviewers’ comments [25].
Finally, it is important to remark that no publications using scientific paper reviews as a work domain for sentiment analysis have been found in the literature, except for our previous work [17] which in fact is a previous and shorter version of this extended article. So, this proposal for applying sentiment analysis is a novel contribution to this domain.
Attribute description for the paper reviews data set
One of the common problems in scientific paper reviews is that the scores provided by reviewers can be inconsistent with what is written in the review. Particularly, there are cases in which reviewers are too strict, leading to the contradiction that, in reading the review, critiques are scarce, thus indicating that a paper was accepted, but in reading the reviewer’s result, you may find that it was rejected. The problem can also be the opposite, that is, a reviewer makes substantive critiques while indicating that the paper must be accepted.
Concerning the problem above, consistency evaluation between the written review and the reviewers’ score is proposed as a practical application of sentiment classification. For these reasons, the classifier used in this study was trained according to manual data tagging, not the reviewer’s original classification. This allows revising the consistency between what the review states and what the reviewer says about the paper acceptance or rejection.
In this context, conducting a longitudinal evaluation of the consistency between the review and each reviewer’s acceptance is proposed as future work. This evaluation must be done while keeping anonymity and giving each reviewer a numerical identifier so as to avoid revealing their true identity.
This work would allow classifying reviewers between strict (i.e., the score is always more negative than the review’s critique) and non-strict (i.e., the score is always more positive than the review’s critique). This classification can be applied in such a way that reviewers may be distributed equitably, thus guaranteeing that a good paper will not be rejected because reviewers are too strict and a poor paper will not be accepted because reviewers are not very strict.
The current system is used as a proof of concept, showing that it is possible to use automatic sentiment classification methods to determine review orientations. Certainly, the classification provided by the system is not expected to be consistent with the results given by the reviewers themselves. In fact, this is the consistency to be determined.
Materials and methods
Research data
The data set consists of paper reviews sent to an international conference in Spanish.1
The data set used in this study can be found in
Empty reviews and reviews in English are not considered in the analysis. Table 3 shows a basic statistics summary concerning word count and number of sentences for the reviews in the data set.
Review data set statistics
Figure 1 shows the data distribution in terms of the classifications assigned by the authors when reviewing the content of each review, note that the data set is skewed. Figure 2 shows the data distribution in terms of the classifications assigned by original reviewers. The distribution of the original scores is more uniform in comparison to the revised scores. This difference is assumed to come from a discrepancy between the way the paper is evaluated and the way the review is written by the original reviewer.
The study focuses on classifying reviews according to the scale determined by the authors. Original evaluations will be used as complements for evaluating the consistency between the classification inferred from the text and the one assigned by the reviewer.
In Table 4 the relationship between the original evaluation scale and the orientation scale can be observed. There is a slight bias toward negative classes in the orientations when compared to the evaluations. Also, it is clear that the perceived orientation of the review is not completely aligned with the real evaluation. The accuracy of human evaluators is measured through their accuracy on predicting the real evaluation correctly. Considering all the five classes in the problem there is an accuracy rate of 36.65%. On the other hand, the accuracy rate is 65.45% if a ternary classification approach is taken.
Confusion matrix of evaluation (rows) vs orientation (columns)
Distribution of review qualifications (revised score).
Distribution of review qualifications (original score).
The accuracy rates given in the previous paragraph serve as a reference. Indeed they can be seen as a baseline with which to compare the different results obtained with different techniques. In fact, it can be observed that for the five classes case the different methods have a similar behaviour, while there is a clear difference in favor of human classifiers in the ternary case.
The following tools were used for developing an opinion classification system and making sentiment analysis:
Python programming language, version 2.7. Scikit-learn library, its classifier implementations and evaluation methods [28].
Stanford POS Tagger library, particularly its model for processing text in Spanish [44]. This model uses the form proposed by the EAGLES group to tag words [18] in each sentence. SentiWordNet 3.0 lexical ontology, containing semantic orientations and synonym sets in English [2]. A Spanish-translated version available in [29] was used. Some words and their translation were added to the original set because it was not complete. Dictionaries specifying the semantics of the words. They were constructed by manually reviewing the data set and finding words that fit in each category. The following dictionaries were considered:
Positive words, e.g. “bueno” (good) and “innovador” (innovative). Negative, e.g “malo” (bad/wrong), “deficiente” (deficient). Adversative words, e.g. “pero” (but). Amplifier words, e.g. “muy” (very). Mitigator words, e.g. “poco” (few). Suggestion words, e.g. “sugiero” (to suggest), “corregir” (to correct). Negation words, e.g. “no”, “nunca” (never). A list of compound expressions that must be fused before processing the text, such as “sin embargo” (“nevertheless”, “nonetheless”, “however”), which is taken to be a single token in the form “sin_embargo”).
Methods used in opinion mining are related to data extraction and preprocessing, natural language processing, and machine learning methods, which play a fundamental role in the task of determining the orientation of an opinion. A learning task may be divided into two broad approaches: supervised learning, in which classes are provided in data, and unsupervised learning, in which classes are unknown and the learning algorithm needs to automatically generate class values. Supervised methods naïve Bayes [3] and Support Vector Machines [20] were used. For the unsupervised learning task, an approach based on part-of-speech tagging and keyword matching was used. Furthermore, a hybrid approach [32] which combines both supervised and unsupervised methods is proposed.
Deep learning methods have not been tested due to the small size of the data set. While deep learning methods perform well in sentiment analysis [40], the number of parameters that must be estimated for deep learning to work well is too big for the amount of data present in this data set. Enlarging the data set is a difficult task since scientific reviews are an occluded genre [14] and as such getting access to more data is not easy. Gathering more reviews has been left for future work, and given this, the application of deep learning methods on this data set has been left for future work.
Figure 3 shows the high level architecture of the implemented system with the purpose of showing the general logic flow. Paper reviews are represented in a structured format using json. As part of the preprocessing step the raw data has been checked manually and corrections have been applied where needed. After reading the corrected data, another preprocessing step is needed before constructing the supervised and unsupervised classifiers. All the classifiers generate a report in text format that can be visualized by the final user.
High level diagram of the implemented methods.
NB classifier assumes that all attributes are conditionally independent, but this assumption is not generally achieved in practice. For example, words in a document are not independent among them. Despite this, researchers have shown that this method generates good models [20].
As for SVM, this approach has a sound theoretical basis and has empirically shown to be the most accurate classifier for text documents [20]. The classifier implemented by Pyhton scikit-learn library [28], libsvm implementation [9] was used. Particularly, a linear kernel was used because it rendered better results than other nuclei available in the library. The optimal classifier parametrization was obtained via empirical tests. The optimal parameter
For SVM, an output coding based on error correction codes [10] was used. This method is implemented in sklearn libraries and its performance was better than the one vs. all approach used by default for the implementation [28], obtaining a 10% improvement in terms of the average metric
In both cases, the training of the classifiers was done by splitting the data set into a training set and a testing set with a 70% and 30% proportion, respectively.
Unsupervised methods: Part-Of-Speech tagging
Once the text is separated in tokens, the next step is usually made to conduct a morphosyntactic analysis to identify characteristics, for example, its grammatical category. This analysis is known as Part-Of-Speech (POS) tagging.
The method uses a text in a given language as input and, through the application of its internal POS tagging model, assigns a grammatical category to the words in a sentence, for example, verb and adjective, among others. In addition, each category has its own characteristics, for example, in Spanish verbs are characterized by tense and type of subject, which are not applicable to nouns.
The complexity of this task depends on the target language to be analyzed. For example, Spanish is more complex as to verb conjugation and implicit subjects. To apply this technique, preprocessing stemming is omitted because it may prevent obtaining the correct grammar structure.
POS tagging poses two main challenges: The first one is word ambiguity, which depends on the context of the sentence analyzed; the second one is assigning a grammatical category to a word when the system does not know how to do it. To solve both problems, the context around the word in a sentence is typically considered and the most probable is selected. The grammatical category has a relevant characteristic. A word belonging to the same word group can replace a token with the same grammatical category, without affecting the sentence grammatically [34].
Most tools to determine grammatical category only work in English, as a result it becomes necessary to find a POS tagging library that can handle Spanish. The Stanford Log-linear Part-Of-Speech Tagger [45] library was used. This library reads a text and assigns a grammatical category to each word. This library is implemented in Java (version 8) and provides models in six different languages, including Spanish.
Data preprocessing
Before classifying a text, it is necessary to process it. First, punctuation standardization is done, so that writing rules can be respected (for example, “The writing is awful,but the form is correct.” would become “The writing is awful, but the form is correct.” (now, there is a space after the comma)). Once this is done, the text is tokenized, separating it into sentences (according to the use of periods) and each sentence into words. Depending on each case, different preprocessing is done.
In the case of NB, punctuation marks and Spanish stopwords are eliminated because they do not provide any data for this classifier. A TF-IDF scheme is applied to the input text, this representation being Bayes classifier input. Similarly, in the case of SVM, punctuation marks and Spanish stopwords are eliminated. A TF-IDF scheme is applied to the input text; then, the singular value decomposition (SVD) method is applied, keeping 100 main values, this representation being SVM input. SVD is applied in order to reduce dimensionality, even though SVM is not sensitive to high dimensionalities, this reduction will reduce the computational cost of the method.
In the case of POS Tagging neither punctuation marks nor stopwords are eliminated because they contain useful data for the classifier (for example, negation). The text is then entered into Stanford POS Tagger in order to identify its semantic structure. Finally, a manual review is made to look for words (i.e. iterating over each word in the document) found in certain dictionaries so as to mark these instances with additional tags. This list of tokens and their associated tags corresponds to the unsupervised classifier input.
Scoring algorithm
To evaluate a review, Algorithm 3.5 is used over each sentence and then the average of all the sentences in the review are calculated.
The value produced by Algorithm 3.5 provides the semantic orientation of the review in terms of a continuous numeric scale. This result must be discretized to obtain the classification in the corresponding classes.
The binary classification method (classes “
The algorithm was implemented by following a rule-based scheme, according to the semantic characteristics of words. Particularly, a dictionary-based approach combined with a series of heuristics was used, these heuristics consist of rules that define the effect of each type of word on the semantic orientation of a sentence.
First, each word is analyzed to be tagged according to its semantic characteristics (POS Tagging). In addition, the dictionaries mentioned previously were used to add other tags in each word. The dictionaries are listed below, they were used in order to specify the effect of each word on the semantic orientation of the sentence. Particularly, the general effect on the sentence, according to a series of pre-established rules, is calculated, depending on the word found and its semantic orientation. The strategy used in each case is similar to the one used in [49], though without using dependency parsing.
[H] Scoring Algorithm[1] TokenList, a list of tokens in a sentence; PosBias, an additional weight factor for positive words; NegBias, an additional weight factor for negative words. TotalScore, the semantic orientation value for the sentence. ScoreSentence TotalScore
(Token token in TokenList) Tags
Word lists
For example, in the sentence “el artículo es innovador” (“the article is innovative”) the word “innovative” would be in this dictionary, as it is a positive word, and this sentence would have a positive semantic orientation.
For example, in the sentence “el artículo está mal escrito” (“the article is badly written”) the word “badly” would be in this dictionary, as it is a negative word, and this sentence would have a negative semantic orientation.
For example, in the sentence “el artículo está muy bien escrito” (“the article is very well written”) the word “very” would be in this dictionary, as it has the effect of intensifying the effect of the next word. So if the word “well” added 0.5 to the semantic orientation, after using the intensification factor it would now add 1.25. This sentence would in turn have a very positive semantic orientation. This is implemented by using the value ModFactor as can be seen in Algorithm 1, in this case, the value would be 2.5.
For example, in the sentence “el artículo tiene pocos errores” (“the article has a few mistakes”) the word “few” would be in this dictionary, as it has the effect of mitigating the effect of the next word. So if the word “error” subtracted 0.5 to the semantic orientation, after using the mitigation factor it would now subtract 0.2. This sentence would in turn have a slightly negative semantic orientation. This is implemented by using the value ModFactor as can be seen in Algorithm 1, in this case the value would be 2.5 (note that 1/2.5
For example, in the sentence “el artículo no es bueno” (“the article is not good”) the word “no” would be in this dictionary, as it has the effect of reversing the orientation. So if the word “good” added 0.5 to the semantic orientation, after using the negation factor it would now subtract 0.5. This sentence would in turn have a negative semantic orientation. The negation is implemented through the boolean value Inverted in Algorithm 1.
For example, in the sentence “la estructura está bien, pero tiene problemas de contenido” (“the structure is good, but the content has problems”) the word “pero” would be in this dictionary, as it is an adversative clause. So if the word “good” added 0.5 to the semantic orientation and the word “problems” subtracted 0.5 to the semantic orientation, then after considering the adversative clause the word “good” would add 0.25, and then the whole sentence would have a semantic orientation of
Usually, reviews that suggest direct rejection tend to use discourse units with the function of negative evaluation, while reviews that suggest a major revision of the article use discourse units with the function of recommendation [35]. Based on this, the score of a recommendation, while slightly negative in the sense that it implies that the paper must be improved, has a lower impact than a direct negative evaluation. The suitable empirical value was found to be
For example, in the sentence “sugiero mejorar la estructura” (“I suggest improving the structure”) the word “suggest” would be in this dictionary, as it implies a suggestion and something that must be improved. So this sentence would now have a semantic orientation of
Heuristics
If a If a
Finally, in case the word is not included in a dictionary (the list of words, not the ontology), it is assumed that it does not have any effect in this domain. So, its score is assigned to 0, under the assumption that it will have no effect.
The list of previous heuristics could be refined. Nevertheless, the results obtained with them are satisfactory since the result improved compared to the baseline performance obtained for our classifiers without using heuristics.
Algorithm 3.5 produces continuous values that can be positive or negative. Nevertheless, the objective is to obtain the semantic orientation in terms of the classes defined above. Thus, Algorithms 2–4 must be used for binary, ternary and the five-point classification. For this purpose, the parameter values (DoublePositiveThreshold, DoubleNegativeThreshold, NegativeThreshold y PositiveThreshold) were obtained by applying Monte Carlo simulation, testing a series of value ranges between
[H] Score-based Binary Classification[1] Score, value given by the scoring algorithm (Algorithm 3.5). Class: positive or negative. BinaryScoreClassification Score
Hybrid Scoring Support Vector Machine components and flow.
Score-based Ternary Classification[1] Score, value given by the scoring algorithm (Algorithm 3.5). Class: positive, negative or neutral. MulticlassScoreClassification Score
Another method based on the scoring algorithm and support vector machines is proposed for classification in this domain. The method has been named Hybrid Scoring Support Vector Machine (HS-SVM), in reference to the fact that it is a hybrid method that uses the scoring algorithm proposed in the previous section. This is a hybrid method of sentiment analysis since it combines a supervised classifier (SVM) and an unsupervised classifier (Scoring algorithm) to obtain the final class. The preprocessing steps for this new method are the same ones used for the original classifiers. Figure 4 shows the proposed method’s components and flow.
The score works as a new feature for the SVM’s input data. The SVM is then trained with this additional feature. This proposed approach has the advantage of having the information provided by the scoring algorithm and its associated components and the flexibility of the SVM. However, it has a higher computational cost since it requires the usage of the scoring algorithm and training the SVM classifier. Nevertheless, since the data set for this application is sufficiently small, this drawback has no significant effect.
Aspect evaluator
Reviewer comments can have different functions, and they can be more directed towards the technical content, the general readibility or the structural aspect of the paper itself [14]. So while there are many aspects that could be evaluated, for example the opinion of the reviewer on the validity of the claims in the article or the discussion itself, it is simpler to evaluate textual aspects such as the format or writing rather than the content itself, since the latter requires certain knowledge of the domain of the reviewed article. Given this, a list of five important aspects considered when reviewing a paper was constructed. The evaluated aspects are listed below:
References Format Structure Writing
Evaluation consists in looking for references to these aspects (or their synonyms) in a sentence. A score is assigned to each sentence using Algorithm 3.5. The search of synonyms is done by using SentiWordNet synonym sets or synsets [2].
[H] Aspect Evaluator Algorithm[1] TokenList, a list of tokens in a sentence. AspectScore, an array with 4 positions as inputs for the basic aspects defined as: Writing, Format, References and Structure. AspectScoreSentence AspectScores
A vector containing the scores of each aspect is initialized in zero. As the algorithm evaluates the sentence tokens, POS tags are used to check if the current token is an adjective, a verb or a noun. These three tags were considered because an adjective and a verb may implicitly correspond to one aspect (e.g., “do not refer” or “well written”). If they correspond to one of these tags, they are checked to see if they agree with one of the aspects defined in the list. If all previous conditions apply, the current sentence score is added to the score of the associated aspect.
If an adversative clause is found, the current accumulated score is saved and a new accumulator is initialized because the use of these expressions marks the beginning of a different semantic orientation and the accumulation of previous values may affect the accuracy of results. The algorithm then continues its calculations using the new accumulator. Once the algorithm finishes the analysis of the sentence, the final score is the sum of the old accumulator and the new accumulator.
In the final implementation, the scoring and aspect evaluation algorithms were considered as one function, for the sake of simplicity.
This section shows the results obtained with the implemented methods. First, the results from the orientation classification task are discussed, followed by the results of the evaluation classification task. Then, the results obtained from the aspect evaluator are provided.
To evaluate the classifier standard machine learning and pattern recognition metrics for classification problems are applied. In particular, we use accuracy, precision, recall and the
Evaluation metrics are provided as an average over each class, along with the corresponding standard deviation considering 10 replications, except in the case of the scoring algorithm, which is evaluated over all the data set and always provides the same result since it is deterministic (results only depend on parameters).
Orientation classification
The results provided here originate from using the methods to classify the orientation of each review (i.e. the perceived evaluation). Table 5 shows the classification results for binary classification, Table 6 shows the results for ternary classification and finally Table 7 shows the results for the 5-point scale classification.
In the binary case, performance is similar regarding the results from other studies (as shown in Table 1). The best average performance is obtained with the scoring algorithm, followed by HS-SVM, pure SVM and NB.
The amount of data available for the binary classification case is smaller than the amount of data for the multiclass case because the neutral reviews of the data set are not used. One of the main problems in comparison with other studies is the scarce amount of data available. A much better performance may be expected with a greater amount of instances.
In the case of ternary classification, average performance decreases in all metrics. This performance reduction is due to the greater classification complexity inherent to a problem with more classes. If the classifier were to work as a random selection it would only have 33.3% probability of predicting correctly. So, in comparison to that baseline, the classifiers still have a good quality. However, it is interesting to note that in this case, the best results are obtained with the HS-SVM classifier, which now surpasses the scoring algorithm itself.
Classification results for orientation in the binary case
Classification results for orientation in the binary case
Classification results for orientation in the ternary case
Classification results for orientation in the 5-point scale case
In the case of the 5-point scale classification, the scoring algorithm is slightly better than the supervised methods and the HS-SVM approach surpasses all the other methods in this case, just as it did in the ternary case. According to these results, the use of this hybrid approach has better classification performance in the multiclass case, while in the binary case it is only slightly behind the scoring algorithm. In this sense, this method is considered to be more robust in relation to an increase in the number of classes.
There were problems with classifying very negative reviews with the scoring algorithm (and in general), in particular, if the lower threshold for the scoring algorithm classification is increased, examples of a very negative class can be correctly classified; however, some negative examples will also be incorrectly classified.
One of the main issues that may affect classification results for the supervised case is that these classifiers do not take into account text structure. They only consider the appearance of words according to the TF-IDF scheme described in the data preprocessing section.
The poor performance of SVM on this multiclass data set may be due to the fact that this classifier is highly sensitive to class imbalance [23]. And as Fig. 1 shows, this data set is highly skewed. So, in a sense, the obtained results by SVM on that data set could not be reliable.
Better results could be obtained with the scoring algorithm by improving the heuristics used or applying parsing dependency [49]. Nevertheless, results are considered satisfactory, since in all the metrics this method surpasses the other approaches.
The performance improvement with respect to the pure SVM approach is consistent in all the cases. The method works by adding more information to SVM, basically facilitating the classification process. SVM is helped by the heuristics defined for the scoring algorithm.
This method could also be combined with the results obtained for the aspects of each review. In this approach, the use of the scoring algorithm and aspect evaluation could be considered as an additional preprocessing stage. This stage would have the function of calculating additional text characteristics to facilitate the classification process by supervised methods.
This combined approach may be used for generalizations in other opinion mining cases. It would be interesting to evaluate if similar improvements may be made in other domains. Certainly, it would be necessary to adapt and modify scoring algorithms and aspect evaluation, and probably obtain a new set of optimal parameters.
Adding a hierarchical classification approach may improve results, by first filtering neutral reviews, then applying binary classification, and later applying an approach on positive and negative sets to separate very negative/positive examples from those only negative/positive.
The results provided here are obtained from executing the methods to classify the evaluation of each review (i.e. the original score given by the reviewers). Table 8 shows the classification results for binary case, Table 9 shows the results for the ternary case and Table 10 shows the results for the 5-point scale classification.
Classification results for evaluation in the binary case
Classification results for evaluation in the binary case
Classification results for evaluation in the ternary case
Classification results for evaluation in the 5-point scale case
In general, maximum possible performance decreases. Although the obtained results are still acceptable since they are better than a random selection, they show that properly classifying the instances is more complex if the original scores provided by each reviewer are used instead of the orientation scores. This discrepancy results from the fact that reviewers do not usually provide scores agreeing with what is actually written in the review.
It is important to note that the parametrization of the scoring algorithm was not adjusted, retaining the original one designed for orientation classification. While this reduces classification accuracy and all associated metrics, this method is still competitive with the baseline methods (NB and SVM), and even those are still surpassed by the scoring algorithm classification in the binary case.
On the other hand, HS-SVM obtains the best results in comparison to the other methods. This stems from the flexibility provided by its SVM component, while at the same time benefiting from all the information provided by the scoring method. In general, according to the results of these experiments, HS-SVM surpasses the other methods, both in the evaluation classification task and in the orientation classification task.
Table 11 summarizes the results for each aspect and it also shows the distribution of the aspects with respect to its orientation (positive, negative or neutral). The results shown correspond to the average values, rounded to the third significant decimal.
Summary of results for aspect evaluation
Summary of results for aspect evaluation
Correctness evaluation becomes more complex because there is no previous tagging of these scores. Based on the results obtained, there are more positive than negative evaluations in almost all the defined aspects. The average of the values obtained is positive in all aspects, except for the one concerning structure. However, the majority of the reviews is considered neutral towards most of these aspects. The neutral ones are usually the result of not mentioning this aspect in the review. Considering this, it must be noted that references themselves are the most mentioned aspect according to these results.
A manual data review shows that the behavior observed may be due to the fact that one of the main problems arises in aspects referring to the structure of the papers evaluated. In addition, several reviews consider writing and discourse as good, even if the content or other aspects are negatively characterized. A graph with average aspect values is shown in Fig. 5 to illustrate the differences between the aspects.
Aspect average bar graph.
The format aspect is the least mentioned one in comparison with the other ones when considering the number of zero scores. The aspect most commented by reviewers is references. This makes sense because it is in agreement with the logical demands of a scientific paper, where the validity of the content is generally more important than the format itself.
On average, the results obtained for the set of papers is positive; however, the approach used is far from being optimal because there is no mechanism to automatically obtain the paper aspects. So, there certainly are interesting elements which were not considered. Nonetheless, the aspects defined may include the main evaluation criteria when reviewing a paper, without considering the content and its contribution.
There is a possibility to enlarge the classifiers implemented. Particularly, the scores of each aspect could be used as additional input for the classifier. Although they are calculated by following the same scheme as the general score, these could provide more information to the classifier, as an extension to what was done in the HS-SVM method.
Finally, the methods implemented here could be applied in similar sentiment analysis domains, such as other kind of reviews (e.g., movies, hotels or products). However, this would entail adapting some of the dictionaries used in Algorithm 3.5. For example, positive and negative words may vary from one domain to another, but adversative clauses should remain the same.
This article has studied the application of sentiment analysis techniques in the domain of paper reviews. Specifically, it has applied supervised methods (NB and SVM), an unsupervised method (the scoring algorithm) and a hybrid approach (HS-SVM) in the classification of 382 (non-empty Spanish) reviews of research papers of an international conference.
The best performance is obtained with binary classification, corresponding to the simplest version of the problem studied. Performance gradually decreases as more classes are added (such as the neutral one or those corresponding to extreme values). In this sense, the HS-SVM method is more robust than the others in relation to the number of classes.
One of the most interesting results is improvement obtained by the combination of the scoring algorithm and SVM. Basically, the score gives additional information to the SVM to facilitate the classification. Future work could deal with the extension and generalization of this method, also including the scores obtained for the aspects so as to further improve performance. By adding new semantic information (e.g. the score) to traditional machine learning methods, an improvement is expected to be obtained in the results of sentiment classification as compared with a pure method.
In the future, the algorithm performance to obtain the scores of each aspect must be evaluated. Its results were analyzed by observing those obtained in each review and the general average, but there is no specific metric as in the other methods evaluated. To better evaluate these results, it is necessary to have the tags for each aspect. These should be manually obtained in analyzing each review, although the weakness of this study is its subjectivity. So, automatic forms of generating tags for each aspect could be explored.
With respect to possible modifications of the models, one of the factors that could be considered in future work is individual reviewer bias (i.e. the reviewer may have a tendency to evaluate the papers lower or higher than the mean). In order to account for this bias, the current model would need to be modified. Also, another aspect that could be studied is an adequate handling of multi-lingual reviews, as well as the search of an appropriate parametrization in this case.
Concerning the experimental results, it is necessary to enlarge the list of features with more lexico-grammatical features, so that classifiers perform better and improved classification results are acquired. Also, expanding the data set with more reviews would be useful in future research, since the current data set is too small to apply some techniques that require more data to perform well.
As to the applicability of the proposal, future work could deal with the longitudinal evaluation of consistency between the review and the acceptance or rejection of the paper by each reviewer. This may allow a better evaluation of papers since it would be possible to recognize whether a reviewer is strict or not. Finally, since there are no other papers using scientific paper reviews as an application domain, the proposal in this study is a contribution and innovation for the field of sentiment analysis and opinion mining.
