Abstract
This paper presents an application of sentiment analysis on customer feedback data in the area of heavy equipment repair services. Sentiment analysis is used as a part of a framework for text mining-based Customer Loyalty Improvement Recommender System (CLIRS). In order to provide business users of the system with accurate predictions for customer satisfaction metrics, the original algorithm for the opinion mining needed to be improved. The paper presents the background of the proposed approach, the current techniques used to mine text data and existing applications of sentiment analysis. We propose an aspect-based, taxonomy-driven approach for customized sentiment analysis. The contribution of this paper is the implementation and evaluation of the proposed methods that improve the accuracy and coverage of the opinion mining algorithm. The improvements are illustrated with examples covered by the algorithm in the customer dataset. The application of the proposed methods resulted in increasing the algorithm’s accuracy from 92% to 96%, and coverage from 36% to 48%. This research is an attempt to handle well-known issues in natural language processing that are currently not handled by text mining algorithms, such as ambiguity and context, opinionated verbs/nouns, subject recognition from pronouns. This is significant because the proposed techniques are generalizable to any application that uses sentiment analysis algorithm.
Problem description
Sentiment analysis (opinion mining) is the field of study that analyzes people’s opinions, sentiments, evaluations, attitudes, and emotions from written language. It became one of the most active research areas in natural language processing (NLP), widely studied in data mining, web mining and text mining. Sentiment analysis systems are being applied in almost every business and social domain. Opinions are central to almost all human activities and are key influencers of our behaviors. As J.Ellen Foster said in 1893: “Sentiment is the mightiest force in civilization…”.

Web-based CLIRS system – step 1. Choosing a company to analyze recommendations for improving its customers’ loyalty. The data of a chosen entity is merged with the data of its semantic neighbors.
Sentiment-analysis technologies have many potential applications. In recent years sentiment analysis applications have spread to almost every possible domain, from consumer products, services, healthcare, and financial services to social events and political elections.
Sentiment analysis is usually part of a larger text analytics framework. Within our research, text mining and sentiment analysis is a part of a recommender system for improving customer loyalty.
This research is conducted on a customer dataset shared with our research group by a consulting company, which specializes in improving customer satisfaction/loyalty of their clients. Their clients are companies that provide repair services for heavy equipment. The dataset contains 400,000 records, where each record represents a structured questionnaire collected from the telephone survey of a randomly chosen customer of the client company. The data heterogeneity reflects different types of surveys, such as service, parts, rentals, etc. Additionally, the dataset describes 34 different companies scattered geographically across the United States and Canada (Fig. 1). The records also contain information about the company, details of the service or product being assessed by the survey, information about a surveyed customer, and survey details. The dataset contains mostly numerical scores for each question (called “benchmarks”), but also free-form text comments from the customers. Each survey record is labeled with a Net Promoter Score (NPS) status of a surveyed customer (Promoter/Passive/Detractor). Promoters are loyal customers of a company, who are likely to reference it to others. Detractors, on the other hand, are dissatisfied customers, who “detract” from the company’s reputation and are likely to not recommend it. There is also a group of customers called Passives, who are satisfied customers, but unenthusiastic about referring to others. The business goal is to increase a metric called Net Promoter Score (NPS), which is correlated with the revenue growth of a company. The NPS1
NPS®, Net Promoter® and Net Promoter® Score are registered trademarks of Satmetrix Systems, Inc., Bain and Company and Fred Reichheld.
The recommender system proposed in our research is built from the knowledge discovered through the data and text mining techniques ([39]). It utilizes data visualization techniques to improve the presentation of the recommendations and interaction with the system ([47]). Our unique approach quantifies the predicted NPS impact of the recommendations (Fig. 2). This quantification is based on the algorithm that calculates the NPS impact from the statistical characteristics of the extracted rules: confidence and support, and from the meta-actions that trigger these rules.

Web-based CLIRS system – step 2. The system allows for interaction with the user to adjust the feasibility of different business changes. In step 3 the system generates recommendations and ranks them according to the predicted impact on NPS and feasibility.
A new version of the recommender system is proposed, which is based entirely on the text data, the free-form comments of customers, which express their opinions. The rationale behind this work is that the business expects its customer feedback to be in the unstructured form primarily in the future. Our solution to this problem involves transforming the unstructured text data into a structured form using text mining techniques. Once we have transformed the text into a structured form, we will apply a previous numerical-based approach for recommendations. Therefore, the critical element of the method is the transformation procedure based on sentiment analysis.
Sentiment analysis involves analyzing the subjective part of the text for its polarity – whether they denote a positive or negative opinion. Analyzing the opinion can be performed at three levels: (1) document-level – extracting the overall sentiment of an entire comment; (2) sentence-level – sentiment detected for each sentence of a comment; or (3) aspect-level – sentiment analysis in reference to certain aspects or features of the product/service, such as price or staff ([14]).
Document-level sentiment analysis
Document-level sentiment analysis is mostly conducted with supervised learning techniques (classification), but there are also some unsupervised methods applied ([10]). Sentiment classification can be formulated as a classification problem with two decision classes: positive and negative. Existing supervised methods, such as Naive Bayes, Support Vector Machines (SVM), can be readily applied to sentiment classification. The earliest work of automatic sentiment classification at the document level classified movie reviews from IMDB into two classes, positive and negative ([35]). Features used to classify can be terms and their frequencies, part of speech tags, opinion words, syntactic dependency, and negation ([34]). Besides the binary prediction of sentiment, there were also models predicting the rating scores of reviews ([33]). In that case, the problem was formulated as a regression problem, as the rating scores are ordinal, and solved using SVM regression. There were also unsupervised methods proposed for sentiment classification: Turney et al. performed the classification based on some fixed syntactic patterns that are likely to be used to express opinions ([54]). The syntactic patterns were composed of part-of-speech tags.
Sentence-level sentiment analysis
At the sentence level, each sentence in the document is analyzed and classified as either positive or negative. The methods are similar as in the case of document-level sentiment analysis. Sentence-level sentiment analysis can use rules, based on the clauses of a sentence ([26]). Sentiment classification does not try to find concrete features that were commented on. Therefore, its granularity of analysis is different from that of aspect-based sentiment analysis.
Aspect-based sentiment analysis
Opinions extracted at the document or the sentence level often do not provide the detail of the sentiment. These details are needed for some applications, which need opinions on certain aspects or features of the object (on what people exactly liked and did not like). Aspect-level sentiment analysis performs a finer-grained analysis. It is based on the idea that an opinion consists of sentiment (positive or negative) and a target of opinion. It helps to understand the sentiment problem better and to address mixed opinions, such as: “Although the service is not that great, I still love this restaurant”. This sentence has a positive tone, but in the service aspect, it is negative.
The major tasks in the aspect-based sentiment analysis are: (1) aspect extraction (feature identification); (2) recognition of polarity towards given aspect (positive/negative/neutral); (3) producing a structured summary of opinions about aspects, which can be further used for qualitative and quantitative analyses ([10]). The text mining process for web reviews involving aspect-based sentiment analysis and summarization is considered a pioneer work on feature-based opinion summarization ([19]). Three subtasks of generating feature-based summaries are defined: (1) identifying features of the product; (2) identifying review opinionated sentences; (3) producing summaries. The system crawls the web for the customer reviews and stores them in a database. Then, it extracts the most frequent features on which people expressed their opinion, using part-of-speech tagging. Association rule mining, based on the Apriori algorithm, is used to extract frequent itemsets as explicit product features. Frequent itemsets are itemsets that have the support of at least equal to minimum support ([9]). Secondly, opinion words are extracted using the resulting frequent features. Semantic orientations of the opinion words are determined based on WordNet and positive/negative word dictionary. The opinion sentences are identified as those that contain one or more feature words, as well as opinion words describing these features. Lastly, the orientation of each opinion sentence is identified and a final summary is produced. An opinion aggregation function is applied to determine the final orientation of the opinion on each object feature in the sentence. Our approach also handles negations and ‘but-clauses’.
Aspect extraction When domain knowledge is not available from the experts directly to build a domain-specific aspect dictionary, aspects need to be automatically extracted as the first step. There are four main approaches for the aspect extraction: (1) extraction based on frequent nouns and noun phrases; (2) extraction by exploiting opinion and target relations (syntactical relations); (3) supervised learning; (4) topic modeling/unsupervised learning. There exists a variety of methods for aspect extraction, such as word n-grams, bi-grams, word cluster, casting, POS tagging, parse dependencies, relations, and punctuation marks. Supervised learning techniques include Hidden Markov Models (HMM) and Conditional Random Fields (CRF). Topic modeling is an unsupervised learning method that assumes each document consists of a mixture of topics and each topic is a probability distribution over words. There were two basic models proposed: pLSA (Probabilistic Latent Semantic Analysis) and LDA (Latent Dirichlet allocation) ([4,18]). In another study, an unsupervised information extraction system called OPINE was developed ([36]). OPINE first extracts noun phrases from reviews and retains those with a frequency greater than an experimentally set threshold. Another work proposed a term extraction technique based on heuristics and selection algorithms. A multi-knowledge based approach for movie reviews and summarization was proposed ([62]). The method used the keyword list and dependency relation templates together to mine explicit feature-opinion pairs ([59]). Another method based on syntactical dependency relations was presented for extracting the product feature and identifying opinions that are associated with the product features in each sentence ([46]). Firstly, parsing and dependency analysis are performed as a pre-processing step. The reviews are parsed using the Stanford parser, resulting in a dependency tree ([11]). While parsing the sentence, noun phrases are identified as product feature candidates using linguistic patterns. Then, for each product feature candidate in every dependency parse tree, related opinion words are searched for, amongst adjectives and verbs. A set of candidate feature-opinion pairs is generated and then the probabilistic-based model, based on maximum entropy is used to predict the relevance of each feature-opinion relation. Additionally, the authors proposed using the product ontology to resolve the problem of incompatible terminology – different customers referring to the same product features using different terminology. The ontology contains encoded semantic information and provides a source of shared and precisely defined terms.
After aspect extraction, an optional step is to group them into synonymous categories, where each category represents a unique aspect. For example, “call quality” and “voice quality” refer to the same aspect for phones. The first method to handle this topic was based on several similarity metrics, defined with string similarity, synonyms, and lexical distances measured using WordNet ([6]). A more sophisticated approach used publicly available hierarchies/taxonomies of products and the actual product reviews to generate the ultimate aspect hierarchies ([60]). In another work, a semi-supervised learning method was proposed to group aspect expressions into user-specified aspect categories ([61]).
The sentiment strength The most important indicators of sentiments are sentiment words (opinion words). A list of such words is called sentiment lexicon (opinion lexicons). Opinion lexicons are resources that associate sentiment orientation and words. Over the years, researchers have designed numerous algorithms to compile such lexicons. These lexicons can be used not only for polarity detection but also for further supervised expansions of lexicons ([5]). Hu and Liu used a semantically labeled list of adjectives ([19]). This list was expanded with some nouns by Liu et al. ([24]). It consists of two lists: one has positive entries (2,003 entries in total) and another contains negative words (4,782 entries in total). The most popular sentiment dictionary is SentiWordNet, built on WordNet, a lexical database for the English language ([15]). In SentiWordNet, each sense of a word is assigned a pair of positive and negative polarity score ([2]). Each entry in SentiWordNet comprises all possible parts of speech the word can appear in, all the meanings (senses) corresponding to each part of speech and a pair of polarity scores associated with each sense. There are 28,431 sentiment-bearing entries (out of total 86,994 WordNet terms). The values for polarity scores are on a continuous scale ranging between 0 and 1. The default algorithm, described on the SentiWordNet website, calculates an overall polarity for each sense of a word as a positive score minus negative score. In the next step, it calculates a weighted sum of all the overall polarities for all senses of the word, with the weights defined as the ranks of senses. The polarity scores in SentiWordNet were generated automatically using a semi-supervised method ([13]). AFINN is a strength-oriented lexicon with positive words (564 in total) scored from 1 to 5 and negative words (964) scored from
Summary generation In most sentiment analysis applications, there is a need to study the opinions of many people due to the subjective nature of opinions. Some form of summary is needed. Therefore, it is usually the next step, after detecting aspect/opinion words and calculating polarity. It involves two tasks: (1) for each feature, associated opinion sentences are divided into positive and negative “buckets”, based on the calculated polarity. Optionally, the number of positive/negative comments about a particular feature can be displayed; (2) features are ranked according to the frequency of their appearances in the reviews ([19]). For the purposes of reviewing summarization, often a variety of visualization methods are deployed. In its very basic form, for example, Amazon displays an average rating and several reviews next to it. Mousing over the stars brings up a histogram of reviewer ratings annotated with counts for the 5-star reviews, 4-star reviews, etc. A variety of other techniques for sentiment visualization were proposed, such as bar charts, rose plots, box plots, and two-dimensional visualizations ([17,24,29]).
The tools for sentiment analysis
Edward Abbey once said: “Sentiment without action is the ruin of the soul”. Indeed, it is important to discuss real-world applications of opinion mining/sentiment analysis. Growing demand for text analytics tools has raised the profile of specialized vendors such as Attensity OdinText, Clarabridge, and Kana, which perform trended and basic root-cause analysis of customers’ comments ([8,32,51]). SAS Institute, IBM SPSS, SAP (Insight) and Tibco (Insightful) offer tools for analyzing text for predictive insights. Lexalytics, Nstein and Teragram, a division of SAS, offer text mining specialized for sentiment analysis ([43,49,50]). There are also solutions attempting at the recognition of the importance of issues based on voice audio recordings and volume analysis, such as Verint Systems ([55]). Rosetta Stone is a solution using IBM SPSS text analytics software to analyze answers to open-ended questions in surveys of current and potential customers ([37]). It uses the resulting insights to drive decisions on advertising, marketing, and product development, strategic planning as well as to identify the strengths and weaknesses of products. Choice Hotels and Gaylord Hotels both applied text analytics software from Clarabridge to quickly gather sentiment out of thousands of customer satisfaction surveys gathered each day ([8]). The software recognizes positive and negative comments and associates them with specific hotel locations, facilities, service, rooms, and employee shifts. The feedback results in an immediate customer service response (through calls or letters) to acknowledge and apologize for the problems. More important is that the system allows chain and facility managers to track trends to spot problems and best practices.
Besides business, opinions are of substantial significance in politics. The important application is understanding what voters are thinking ([22] and [30]).
Sentiment-analysis and opinion-mining systems also have an important role as enabling technologies for other systems. One such application is an augmentation of recommendation systems. Question answering is another area where sentiment proves to be useful.
Quantifying the economic impact of sentiment
Reviews influence both the purchasing decisions of other customers, who read the reviews, as well as decisions of product manufacturers regarding product-development, marketing, and advertising. However, the subjective perception of “the influenced” and the reality might differ. Therefore, a key element is to understand the real economic impact of sentiment expressed in surveys and reviews. The results of such analysis can be used by companies to estimate how much effort and resources should be allocated to address the issues.
There have already been economics studies conducted to find out whether the polarity has a measurable, significant influence on customers ([44,45]). The most common approach is to use hedonic regression to analyze the value and the significance of different item features to a function, such as a measure of utility to the customer, using historical data ([42]). Specific economic functions under examination include revenue, revenue growth, stock trading volume, etc. Another approach attempts to assign a “dollar value” to various adjective-noun pairs, adverb-verb pairs, and similar lexical configurations ([16]). It is important to note that different subsegments of the consumer population may react differently. Additionally, in some studies, positive ratings have an effect but negative ones do not, and in other studies, the opposite effect is seen. However, in most studies, a positive correlation effect is observed between survey polarity and economic effect, and the correlation is statistically significant ([1,3,7,12,25,27,53]).
Methodology
This section presents our approach for customer satisfaction improvement, which is based on mining customer data for actionable knowledge. We also propose a method for transforming the text data into a structured form of a data table, using opinion mining. This approach illustrates how text data can be mined for actionable knowledge, using the sentiment analysis approach. We also propose quantifying the effects of changes in the sentiment on the NPS metric.
Actionable knowledge discovery
The customer dataset is mined for actionable knowledge. We are using algorithms for action rule discovery. The action rule concept was firstly proposed by Ras and Wieczorkowska in 2000, and since then investigated further in application areas such as business, healthcare, music automatic indexing and retrieval ([39–41,48,57]). Action rules present a new way in the machine learning, which solves problems that traditional methods, such as classification or association rules, cannot handle. The purpose is to analyze data to improve understanding of it and seek specific actions (recommendations) to enhance the decision-making process. An action shows a way of controlling or changing some of the values of the attributes for a given set of objects to achieve desired results.
Action rules
An action rule is defined as a rule that describes a transition that may occur within objects from one state to another, with respect to the decision attribute, as defined by the user ([40]). Decision attribute is a distinguished attribute, while the rest of the attributes are partitioned into stable and flexible attributes.
In logic nomenclature, action rule is defined as a term:
Formally defined, an action rule is built from atomic action sets. Atomic action term is an expression
By action sets, we mean the smallest collection of sets such that:
If t is an atomic action term, then t is an action set. If If t is a candidate action set and for any two atomic actions
By an action rule, we mean any expression
Meta actions
Our approach uses the concept of meta actions. Meta-actions are the triggers used for activating action rules and making them effective. The concept of meta-action was initially proposed in Wang et al. and later defined in Ras et al. ([38,52,56]). Meta-actions are understood as higher-level actions. While an action rule is understood as a set of atomic actions that need to be made for achieving the expected result, meta-actions are the actions that need to be executed in order to trigger corresponding atomic actions. The relations between meta-actions and changes of the attribute values they trigger can be modeled using either an influence matrix or ontology. In our domain, we assume that one atomic action can be triggered by more than one meta-action. A set of meta-actions triggers an action rule that consists of atomic actions covered by these meta-actions. Also, some action rules can be invoked by more than one set of meta-actions. The goal is to select such a set of meta-actions (M) which would trigger a larger number of actions and, the same, bring greater effect in terms of NPS improvement. The effect of applying M is defined as the product of its support and confidence:
Action rules for sentiment
The first task in applying action rules for the sentiment analysis is transforming the text data into a structured form. A data table is built by applying the transformation algorithm based on text mining. In such a table each row represents a comment given in a survey. Each column represents an aspect of service/product that is relevant for sentiment analysis. These aspects were defined by the business domain experts and encoded into a domain dictionary. Examples of such aspects in the considered domain of the equipment repair are “Service Quality”, “Technician Knowledge”, “Staff Attitude”, etc. A value in a row/column represents a sentiment value extracted for the aspect given by a column from the text comment given in a row. In other words, each survey is represented as a vector of numbers, where each number represents a polarity value of sentiment towards the aspect given by a number in a vector. The sentiment value can be positive or negative. Additionally, the sentiment can be neutral, strong or very strong. Therefore, the values in the table are on a discrete scale in the range

Sample action rule for the aspect-based sentiment
The attributes of atomic actions in the rule represent aspects in sentiment analysis. The interpretation of the rule is that a certain action in sentiment towards a precise aspect needs to be undertaken, in order to change the customer from the detractor to the promoter status. In the example in Listing 1, if the sentiment towards service quality changes from very negative to negative, towards technician knowledge – from negative to positive, and towards price competitiveness – from negative to very positive, then a customer changes from being a detractor to being a promoter of a company.
In the third step of our method, the extracted patterns are incorporated into the knowledge base of the recommender system. Having incorporated the rules, the recommender system will quantify the expected impact of the recommendations on NPS.
We adopted the following procedure for sentiment analysis and text summarization ([21]):
Identifying opinion sentences and their orientation with localization. Summarizing each opinion sentence using discovered dependency templates. Opinion summarizations based on identified feature words. Generating meta-actions with regard to given suggestions.
The process of mining customers’ comments uses sentiment analysis, text summarization and feature identification based on guided folksonomy. The domain-specific dictionaries were built with the business domain experts. Our approach results in generating suggestions (sets of meta-actions), which are the basis of the built recommender system. Our approach for aspect-based sentiment mining follows the schema described earlier ([19]). We use aspect-based sentiment analysis, where we extract an opinion that consists of sentiment (positive or negative) and a target of the opinion, that is, a specific aspect or feature of the object. Therefore, our approach offers a more detailed analysis than the most adopted document-level or sentence-level approaches for sentiment analysis.
Opinion identification
The first step in our algorithm is identifying an opinion sentence, based on the occurrence of an opinion word. We use a dictionary (list) of positive and negative words (adjectives). Context (localization) is also considered in the algorithm, by using additional context dictionaries, specific to our domain. For example, in the comment “the charge was too high”, “high” is recognized, according to the standard adjective lists, as neither positive nor negative. However, this comment still presents an insightful opinion about discontent about pricing. Therefore, “high” is added to the context list of pricing as a negative.

Web-based CLIRS system – step 3. The user can analyze fine-grained comments that were used by the system to generate the recommendations. The comments are summarized with regard to the aspect (e.g. “service done correctly”) and divided into positive and negative.
In the next step, the sentences, identified as opinionated in the first step, are aggregated into segments. Feature-opinion pairs are generated based on grammatical dependencies between features and opinion words. We use the Stanford NLP library for the recognition of the grammatical dependencies ([11]). A dependency relationship describes a grammatical relation between a governor word and a dependent word in a sentence. There is wide coverage of different dependency templates (about 50 defined dependencies). Therefore, we are able to detect the most occurring syntactical relations associated with opinion words. Additionally, we added code that detects negation and ‘but’-clauses to our algorithm.
Aspect identification
Having extracted segments in the previous step, aspects’ feature words are identified using the supervised pattern mining method ([23]). The parts-of-speech tags (POS) in the used NLP library and aspect dictionaries are used in the aspect identification step.
Segment clustering
Opinion summarizations are used in many sentiment analysis works to generate a final review summary about the discovery results on feature and opinions mining and rank them according to their appearances in the reviews ([19]). In our approach, we remove the redundancy of extracted segments and cluster segments into different classes. The feature clustering is based on the domain-specific dictionary of seed words/phrases. To cluster a segment into the corresponding class, the list of seed words is checked whether it contains the feature word or the base form of its feature.
Generating the sets of meta actions
To generate meta-actions, each feature class is divided into several subclasses. Each subclass is related to the specific aspect of that feature. These aspects have been defined in the domain-specific dictionaries. Correspondingly, the last step of the algorithm is generating sets of meta-actions. These are the actual output recommendations provided to the business users. In our approach, we also display comments that correspond to the meta actions (Fig. 3). It allows for a fine-grained analysis of the problem. Additionally, we use visualization techniques in the form of an expandable table to display positive and negative opinions per each aspect. Each comment is additionally annotated by its survey ID. Therefore the comment can be tracked in its reference to the specific survey and its context. The negative opinions displayed in our visualization provide insight into issues that need to be addressed, and the positive provide insight into issues that should be reinforced.
The improvements in the sentiment mining algorithm
The goal of improving sentiment mining is to provide a better performance of the text-mining based recommender system. This subsection describes, in more detail, the proposed methods and the changes introduced to the original algorithm of opinion mining. The following strategies are proposed in our research to improve the sentiment mining algorithm in terms of accuracy and coverage:
Adding opinion dictionaries
The original algorithm was based on the adjective list only. We added new dictionaries for sentiment: SentiWordNet and AFINN, described in the previous section. By using additional dictionaries we expect to increase coverage of the algorithm, in particular, the coverage of opinion words.
Using nouns and verbs as opinion words
Traditionally, in most opinion mining algorithms, adjectives and adverbs are used as opinion words. However, there are also examples of comments, where verbs and nouns are the indicators of the sentiment. We added new dictionaries – SentiWordNet and AFINN, which contain not only opinionated adjectives but also verbs and nouns with an assigned polarity score. We added code to the algorithm that handles nouns and verbs as opinion words. One potential problem in this approach is that if a word is first detected as an opinion word, it will not be considered further in the algorithm as an aspect word. The alternative strategy is to change the sequence of steps in the algorithm to detect the feature word first and the opinion secondly.
Increasing the sentiment polarity scale
The same as in the original algorithm, we adopt a dictionary-based sentiment analysis for both the opinion and aspect recognition. However, the new algorithm needs to detect not only the polarity of sentiment (negative or positive) but also its strength. Expanding the sentiment scale from
AFINN contains 564 positive and 964 negative words ([31]). Each word in AFINN list is assigned a discrete value in the range:
Experimental results
Test data
The experiments were conducted on the subset of data from 2016 with about 80,000 records. In the built recommender system we use the column Notes for Promoter Score as a primary source of text data. An alternative strategy to maximize the text content available is to concatenate all text data available for each record. The text from all the “Notes” columns can be merged for that purpose – Interviewer Notes, Resolution Notes, General Notes and Notes Benchmark for each benchmark. Even then, about 6,000 records (7.5%) have no associated text comments. To compare the machine versus human sentiment recognition we have identified a representative subset of 70 text comments. We have chosen comments from 35 customers identified as Promoters and 35 comments from Detractors of one company. Each record was manually annotated in terms of a sentiment value towards each aspect that was mentioned in a column.
The tested algorithm
The procedure, that tests the modified algorithm for opinion mining in the experimental setup is described by the following steps:
Preparing the file with text comments (XLS or XLSX). Each row represents a text comment from the “Notes for Promoter Score” column of the original dataset. Alternatively, the text can be a concatenation of all the “Notes” columns in a survey. The file with text is preprocessed – file reader iterates through rows in the spreadsheet. Processing the current comment, which may contain many sentences. Processing the current sentence using Stanford Parser Treebank Language Pack ([11]). Tagging the words in the current sentence with the Part-of-Speech labels using Stanford POS tagger ([11]). Creating dependency list – grammatical dependency relations based on predefined templates in Stanford package, GrammaticalStructureFactory ([11]). Identifying opinion words in a sentence, using: opinion word lists (Hu&Liu / AFINN/ SentiWordNet), negations list (“not”, “neither”, etc.), conjunctive words lists (“and”, “but”, “therefore”, etc.), strong words list (“really”, “very”, “much”, etc.), strong positive (“best”, “great”, “excellent”, etc.) and strong negative words lists (“worst”): Check if the current word is a negation (if yes, set index for negation word). Check if the current word is conjunction (if yes, set index for conjunction word). Check if the current word is a strong opinion words (if yes, set index for a strong word). Check if a word is in the strong positive/strong negative list set the polarity accordingly to 2 or Check if the current word is present in positive/negative adjective list: Set polarity accordingly to the list: Consider cases of negation, comparative forms of adjectives and conjunction to change the original polarity. If a valid strong opinion word is found in relation to the adjective, increase polarity strength to If a word is not found in adjective lists, look for its presence in AFINN dictionary: If a word is found in AFINN, retrieve its polarity and use mapping to Consider cases of negation, comparative forms of adjectives and the conjunction to change the original polarity. If a valid strong opinion word found in relation to the adjective, increase polarity strength to If a word is not found in adjective lists nor in AFINN, look for it in SentiWordNet dictionary: If a word is found in SentiWordNet, the case for adjectives and adverbs: Retrieve the polarity from the SentiWordNet dictionary considering the word POS tag, and use mapping function to convert the continuous numbers from Consider cases of negation, comparative forms of adjectives and adverbs and conjunction to change the original polarity. If a valid strong opinion word is found in relation to the adjective, increase polarity strength to The case for verbs and nouns as opinion words: retrieve the polarity from the SentiWordNet using the POS tag and map to the scale. For verbs, find its base form – and look in the dictionary for all its possible base forms. Consider cases of negation, conjunction and strength words to change the original polarity. Opinion sentence summarization – finding words related to the found opinion word, using dependency relations. Finding a feature keyword related to the opinion word based on the previous step’s results. Feature aggregation – assigning the segment to the feature category. Summarizing results – grouping segments by orientation, feature classes. Generating meta-actions from oriented segments.
Evaluation metrics
The metrics of our primary interest with regard to evaluating the sentiment analysis algorithm are:
Accuracy – measured as the number of correctly recognized and classified opinions by the number of all opinions extracted. Coverage – the number of opinions extracted divided by the total number of comments (in this experiment, 70 comments). Weighted measure – a measure combining two previous measures, calculated as 0.5 ∗ Accuracy + 0.5 ∗ Coverage. The assumption is that these two metrics are equally important in the overall assessment of the sentiment analysis algorithm.
We assume that human can recognize sentiment with 100% accuracy and can recognize each existing sentiment (maximal coverage). The assumption is that the maximal coverage for the dataset is not necessarily 100%, as not all the comments are opinionated or contain the opinion about any of the aspects. A comment is assumed to be covered if at least one opinionated aspect in the comment was recognized (coverage per comment). In the second test case, coverage is measured by dividing by an actual number of opinionated aspects (coverage per opinion).
Test cases
The base case is the human sentiment recognition (Hum). The original algorithm for sentiment analysis is based on adjective lists only (Adj). The third case tested is based on the SentiWordNet dictionary only. This test case is further divided into subcases with/without using nouns and verbs as opinion words (correspondingly S+V/NN/S−V/NN). The fourth test case is based on AFINN dictionary only (AFINN). Each case was tested in isolation, using each dictionary separately. Then, a combined strategy involving all three dictionaries was designed and tested (All). The combined strategy is described in more detail in the previous subsection.
Results
In our test data, 62 out of 70 comments were opinionated. Therefore, the coverage of the base case (Hum) is 89% (see Table 1). In Hum the sentiment is recognized with 100% accuracy. The original algorithm (Adj) covered only 20% of the comments (Cov -b). It failed to recognize the sentiment in 48 out of 62 comments that were opinionated. The accuracy of the original algorithm was 71% (Acc -b), recognizing correctly 10 out of 14 comments it covered. Table 1 compares the results for different approaches to sentiment mining. It shows metrics before (-b) and after (-a) introducing additional modifications to the algorithm. The changes in the algorithm and dictionaries resulted in covering twice as many comments as before (Cov -a). The accuracy improved from 71% to 93% for Adj (Acc -a). When comparing different dictionaries, AFINN proved to have the worst coverage, but it was very accurate. Adj and S+V/NN have similar coverage – 39%, but the latter has worse accuracy. Therefore, its weighted metric is 60% vs 66% for Adj. S−V/NN covered less (31%) but was more accurate than the version with verbs and nouns as opinion words (95% versus 81%). Based on these results a final strategy was adopted for the combined approach. The current word is checked first in adjective lists, secondly in AFINN, and thirdly in SentiWordNet. Lastly, the strategy checks for verbs and nouns as opinion words (see subsection “The tested algorithm” for the detailed description). The combined strategy resulted in 43% coverage and 83% accuracy. The coverage increased but at the cost of accuracy. Table 2 presents results comparing different approaches, with metrics calculated based on the number of opinionated aspects, rather than on the comment-level. The combined approach (All) resulted in 38% coverage and 95% accuracy in recognizing all opinionated aspects across all comments. As previously, the approach based on the adjective list resulted in the highest coverage (36%) and accuracy (92%). The second best, in terms of coverage, was S+V/NN (33%). At the same time, this approach was worse in accuracy than Adj (88% vs 92%), S−V/NN (96%) and AFINN (95%). In the third experimental setup, additional adjustments were introduced to both the dictionary and the algorithm. Table 3 presents the final results and compares the accomplished machine sentiment recognition with human recognition (Hum). By introducing modifications, the coverage increased to 57%, as calculated per comment, and to 48%, as calculated per opinion. However, the accuracy decreased to 88%, as calculated per comment. On the other hand, when calculating per opinion, it increased to 96%. Overall, the algorithm was improved significantly. The weighted metric increased from 63% to 72%, when measuring per comment, and from 67% to 72%, when measuring per opinion. On the other hand, there is still a gap versus human recognition. Therefore, there is room for further improvements (see the next section “Discussion”). Another conclusion from the experiments is that is quite challenging to improve both coverage and accuracy at the same time. Usually, when improvements bring higher coverage, this is at the cost of precision, and vice versa. In general, we observed improvement in relation to the initial algorithm: the accuracy was improved from 92% to 96% and coverage from 36% to 48%.
Comparing the accuracy and coverage for the sentiment analysis using different approaches – metrics calculated per comment
Comparing the accuracy and coverage for the sentiment analysis using different approaches – metrics calculated per comment
Comparing the accuracy and coverage for the sentiment analysis using different approaches – metrics calculated per opinion
The results of the final combined strategy to sentiment analysis, compared to human recognition
Other experiments evaluating the improvements were related to (1) measuring the sparsity of the data table built from the opinion mining (Table 4 and Table 5); (2) action rule mining (Table 6). These two aspects are important with regard to the built text-based recommender system. The changes introduced to the sentiment algorithm primarily aimed at decreasing the sparsity of opinion table, which results from the transformation based on sentiment mining. This data table is further used for action rule mining, which is consequently used to generate recommendations for customer loyalty improvement. Therefore, higher accuracy and coverage of the opinion mining results in a greater predicted impact of the recommendations on NPS.
The sparsity of opinion table before and after modifications of the opinion mining algorithm – the case for the dataset of company 16
The sparsity of opinion table before and after modifications of the opinion mining algorithm – the case for the dataset of company 16
The sparsity of the opinion table The sparsity of the opinion table is calculated as the number of the cells with non-NULL values divided by a total number of cells. The non-NULL value means that sentiment towards an aspect (column of the cell) was recognized by the algorithm applied to the comment given in a survey (row of a cell). Initially, the resulting opinion table was very sparse, with about 1% values present. After modifications to the sentiment analysis algorithm, the sparsity was reduced (3%). Tables 4 and 5 present the details on the comparison of sparsity before (-b) and after modifications (-a), for the client company 16 and 3, respectively. The actual names of the companies are confidential. Sparsity (Sp) is calculated by dividing by the number of rows in the table. The total sparsity (SpT) is calculated by dividing by the total number of cells in the table. The percentage value in the tables is interpreted as the relative occurrence of the extracted opinions. Tables present details for each sentiment value (in range
The results show that the sparsity of opinion tables for both cases (company 3 and company 16) was reduced significantly, about three times versus the initial sparsity.
Sparsity of opinion table before and after modifications of the opinion mining algorithm – the case for the dataset of company 3
Coverage of action rules extracted in the previous approach (-ben) and in the new approach from the opinion table, before and after introducing the modifications to the sentiment mining algorithm. Coverage, times of rule extraction, the total number of rules were measured for Client3 and Client16 datasets
To test the quality of the extracted actionable knowledge from the opinion tables, the coverage of the extracted pattern was measured. The coverage of action rules is understood, here, as the number of distinct customers in the dataset matched with the extracted rules. There were several scenarios tested (Table 6). The main goal of this experiment was to compare the coverage of rules when working with numerical data tables with question scores (benchmark tables -ben), with the new approach of using a data table that results from the transformation of the unstructured data. The second goal was to compare the coverage before (-before) and after (-after) introducing the proposed changes to the sentiment mining algorithm. The test cases also differentiate between using the default (0) sentiment value (-no0/-with0) in the opinion table. When counting the default neutral sentiment values (-with0), the initial coverage of action rules was 10.43% and 8.2%, for company 3 and 16, respectively. Without the default sentiment value, the initial coverage of the rules was 7.83% and 4.92%, respectively for these companies. These numbers were significantly lower than in the previous approach (-ben), where action rules covered 85.2% customers for both companies. After introducing modifications to the opinion mining algorithm, the coverage of the action rules increased to 14.8% (-no0)/15.7% (-with0) and 13.1%/16.4% for company 3 and company 16, respectively. This is considered significant improvement, however still much lower than in the case of benchmark tables.
Discussion
This section discusses the examples of text covered by our approach, identifies remaining challenges and concludes with the main results within this study and future work. First, the enhancements introduced to the sentiment mining algorithm are discussed with examples of text from the tested dataset.
Verbs and nouns as opinion words
Using verbs and nouns as opinion words, based on polarity calculated in SentiWordNet, resulted in covering more comments and patterns. For example, the following comment was covered: “He stated the
The sentiment strength
Another important change was adding the detection of the strength of the sentiment. It was implemented with: (1) introducing dictionaries that assign the scale of opinion polarity to words; (2) detecting words that “strengthen” the polarity of words – for example “very”, “really”, etc. Examples of the comments with strong sentiment recognized by our algorithm are printed below:
“Chuck stated that he wasn’t there when they did it, but the mechanic who was stated that the technician did “Mike stated the field mechanic was very knowledgable and a
“He stated that the technician was
→ technician very friendly: 2 → did great job: 2 → handled difficult situation: “Paul said that the service man, Chris, was “He stated Charlie is a
Dealing with ambiguity and context
Some words cause ambiguity and depend heavily on the context. For example, initially, the word “good” was assigned to the category “Technician Attitude”, in the context of the “Technician” feature (e.g. “good mechanics”). However, after a more careful semantic analysis of the comment, it was concluded that the opinion holder meant rather “Technician Knowledge”. Therefore, the word “good” was reassigned to this category in the context of “technician”. Another example from the analyzed test dataset was: “He stated the
Two opinion words for one aspect
The experiments revealed cases, where two opinion words were used to describe one feature, e.g. “
Expanding dictionaries for aspects
In the proposed approach, dictionary-based feature recognition is used. It means that the detection of feature words related to the discovered opinion words is based on the predefined libraries of seed words for features and more fine-grained aspects of the features. These were developed manually by looking through a large sample of comments. However, they are scalable and can be expanded as the uncovered examples are identified. For example, a common word occurring in comments is “experience”: “He stated this past
Dealing with pronouns
Another problem is that the opinion is often expressed in relation to pronouns: “they”, “he”, etc. For example, the sentence: “
Consistency of aspect dictionaries
Other fallacies of the algorithm were identified and resolved. For example, each keyword in the feature category must correspond to an aspect category. Otherwise not all opinionated segments will result in generating recommendations. For example, for the segment printed below, no corresponding meta-actions were generated: subSegments features after aggregation: (stated machine new=machine,service, having condition fixed=„ they already having=having„ fixed on unit=fixed,service)
segment orientation:stated machine new: 1.
The reason was that the seed word “machine” for the “service” feature was not defined in any aspect of the service. The same case was for the segment: “Stated that they provide fast service” – it results in the opinionated segment: “provide fast service: 1”, but not in any meta-action, as “fast” was not defined as a seed word for the category “Service Timeliness”. The last but not least, it was observed that there are opinions about parts in the service surveys, for example: “He stated
Adjusting sentiment dictionaries
Experiments revealed that sentiment scores are not accurately calculated for all the cases. For example, SentiWordNet dictionary uses a semi-supervised method was to label words with the sentiment. The word “good” has a calculated weighted polarity score of 0.63 (which is mapped to 2 in our scale), while the word “great” – 0.25, which after mapping corresponds to 1. Therefore, “good” results in being stronger positive than “great”. Such inconsistencies are solved by adjusting the sentiment dictionary manually. In our system, we changed the entry in the dictionary for the word “good”. The PosScore was changed from 0.75 to 0.25. A similar case was with the adverb “well” (its mapped sentiment based on SentiWordNet was 2).
Remaining challenges
Although sentiment words are important for the opinion mining algorithms, building systems based only on them will not give good results. Natural language recognition is much more complex, and involves the following issues:
Words can have opposite orientations in a different application domain. Sentiment word might not express an opinion in question (interrogative) sentences and conditional sentences (e.g. “Can you tell me which Sony camera is good?, “If I can find a good camera in the shop, I will buy it”). Sarcastic sentences are hard to deal with. Sentences without sentiment words can also imply opinions (e.g. “This dishwasher uses a lot of water”).
Sentiment analysis, as each NLP task, must handle coreference, negation, disambiguation, comparative sentences. Despite the introduced change and improvements, the coverage is still unsatisfying. Correspondingly, the coverage of the action rules extracted from the table built on opinion mining leaves still room for improvement. This section discusses identified challenges with illustration on the examples from the customer dataset that are poorly handled by the algorithm.
Non-standard opinion patterns Generally, all the opinions that do not represent the “opinion word + feature aspect” pattern are not detected. This is a significant group of comments that were not recognized by the algorithm. Example of such comments, currently not handled by the algorithm, are:
Expressing opinions by describing the situation/incident (“storytelling”) without using actual opinion words – humans can infer the sentiment from the story. Implicit opinions – similar to the case above, but more general. For example, humans can use objective words like numbers to express implicit opinions. Complex and comparative sentences – resulting from the limitations of the syntactical dependencies recognition. An opinion and an aspect in one word, for example, “repaired”. Using opinion words for the purpose other than expressing opinions, for example for expressing the desired state or expectations.
Below are discussed the examples of comments from our dataset that proved problematic for the algorithm.
Complex and comparative sentences The sentiment recognition is limited to the syntactical dependencies that can be recognized. The experiments revealed that some syntactical relations are not recognized, especially in complex and comparative sentences. For example, in the sentence “…he
Implicit opinions Implicit opinions can be expressed without using explicit opinion words. Examples of implicit opinions in the dataset are: “He stated that they had to make two trips and
Feature and opinion in one word Currently, the algorithm does not handle situations when the same word is used to describe the feature as well as opinion. As soon as the word is flagged an opinion word, other words are searched for dependencies with this word. Therefore, sample comments, such as: “He said they found a number of issues and
Another problem is that sometimes opinions are expressed with a single noun (without using opinion words) – as an answer to the question asked to customers to receive additional comments: “What made you Detractor?” or “What could have been improved?”. As a result, the frequent pattern of a comment is an actual answer: “Because of +
Opinion words in different context Opinion words can be used for the purpose different than expressing a sentiment towards certain aspects. For example: “Bob said at $117 an hour for service he
Ambiguity The algorithm handles ambiguity related to different aspects poorly in some cases. For example, in “Kurt again stated it was the not being kept in the loop, and
Handling misspellings As the text comments were transcribed from the telephone conversation with the customers, misspellings are quite common in the dataset: about 7% of the comments contained misspellings. It is especially problematic when a feature/aspect or opinion word is misspelled. Due to the dictionary-based approach, they are not recognized as opinionated, because of misspellings in the opinion words:
“Mike stated the field mechanic was very “Kevin said they know what they are doing and they are “Eric stated that he would like better “AJ stated that they are “Jeff stated that their
This results in lower coverage than it would be with the correctly spelled opinion words. The proposed method is to add a pre-processing step that handles misspellings. Such an approach can be based on dictionaries of commonly misspelled words.
Phrases, idiomatic and phrasal verbs expressions Another category of cases poorly handled by the algorithm is text with phrases, common slang expressions, idiomatic expressions, and phrasal verbs. Examples from the tested dataset, not handled by the algorithm in opinion detection include:
“He stated they were right there
“Rob stated they
“Jeff stated that they were able to
“Chris stated did work
Dealing with proper nouns and entity recognition As already mentioned previously, pronouns are often used in relation to opinion words. Besides pronouns, also names of technicians are used, for example: “Paul said that the service man,
Conclusions
This paper presented a number of methods introduced to improve the accuracy and coverage of the sentiment analysis. The proposed methods have been evaluated and illustrated with examples of the real comments covered by the algorithm. This study resulted in improving the sentiment algorithm from 92% to 96% in its accuracy, and from 36% to 48% in its coverage. In the future, we plan to conduct experiments measuring other metrics of interest, such as Mean Absolute Error (MAE) or F-measure. Most importantly, the introduced methods brought significant improvement for the text mining – based recommender system CLIRS. After comparing and analyzing results with different sentiment dictionaries, the final strategy for the sentiment analysis algorithm incorporated into the CLIRS framework. The improvements were measured with the sparsity of the sentiment table and the coverage of action rules extracted from this table. The sparsity was reduced about three times, and the coverage of action rules increased about twice. Correspondingly, the quality of the recommendations of the built text mining-based system improved. However, there is still room for improvement when comparing to the human capabilities of sentiment recognition. The remaining challenges in NLP were described. New methods for improvement were identified in the discussion that will provide a basis for the future work within this research.
