A rule based approach for aspect extraction in hindi reviews

Abstract

Fast growth of technology and the tremendous growth of population has made millions of people to be active participants on social networking forums. The experiences shared by the participants on different websites is highly useful not only to customers to make decisions but also helps companies to maintain sustainability in businesses. Sentiment analysis is an automated process to analyze the public opinion behind certain topics. Identifying targets of user’s opinion from text is referred to as aspect extraction task, which is the most crucial and important part of Sentiment Analysis. The proposed system is a rule-based approach to extract aspect terms from reviews. A sequence of patterns is created based on the dependency relations between target and its nearby words. The system of rules works on a benchmark of dataset for Hindi shared by Akhtar et al., 2016. The evaluated results show that the proposed approach has significant improvement in extracting aspects over the baseline approach reported on the same dataset.

Keywords

Sentiment analysis (SA)aspect term sequential pattern dependency parser (DP)part of speech (POS)

1 Introduction

In the last few years, the growth in social networking sites and the availability of extensive data has created a platform for public to discuss their ideas and opinions about different entities. Research shows that the progress in digitization has enabled users to analyze public opinion about products or services rather than taking opinion from a handful of people like family, relatives or friends. Sentiment analysis (SA) [1, 2] is an automated process to analyze a given sentence, find opinion and classify the opinion into positive, negative or neutral class. The application of SA is very powerful in many areas such as identifying customer attitude towards business and organization, trends of product or services purchase based upon online reviews, public opinion polling during election campaign and prediction of stock market for investors and more. Among three levels of SA such as document-level, sentence level, and aspect-level, the most important is aspect-level. For a given review sentence there is possibility to get different opinions for different aspect terms. For eg. in the sentence “ |” The battery life and display screen of this phone are very good but the voice quality is not good”. The possible aspect terms could be (battery life), (display screen), (voice quality) for the opinion words (good) and (not good). Aspect-level SA identifies aspects related to each product or topic and then classifies opinion towards each individual aspect. The first sub task of identifying the aspects is called aspect term extraction [3] and the proposed work focuses on this sub task. In aspect term extraction it is always necessary to identify the relation between the aspect term and the words associated to it. The proposed work explores the strength of Part of Speech (POS) patterns and dependency relations to frame pattern sequences that have been based on the relationship between an aspect and its immediate related words.

The organization of the paper comes in following order: Section 2 illustrates some of the works related to aspect term extraction; Section 3 presents the details of proposed system; Section 4 shows the corpus details, Section 5 the Experimental Results and Analysis, and finally Section 6 presents conclusions and future work.

2 Literature survey

The development of technology and internet reachability has increased the number of web users. Depending on the demand there are many websites that provide information in Hindi related to travel, entertainment, music, electronic devices etc. With this large amount of information available as opinions or views, analyzing user views to improve the quality and service related to any product becomes a valuable task. We can apply sentiment analysis technique to analyse the opinion related to each aspect terms of entity. Related to sentiment analysis much of the research work has focused on English language and works in Hindi language is not explored due to inadequacy of annotated data and tools required for Hindi text sentiment classification. Earlier research on sentiment analysis focused mainly in identifying opinion words and perform document or sentence level sentiment analysis based on polarity of opinion words. The focus was to use SentiWordNet or build opinion-based dictionary which mainly opens a way towards sentiment classification [4, 5] for Indian languages. This section describes some of the prominent works done in sentiment analysis related to Hindi language.

Most of the classification approaches in sentiment analysis has been based on SentiWordNet [6]. Das and Bandyopadhya were the first to work on Bengali language. The authors developed SentiWordNet for three Indian languages Hindi, Telugu and Bengali. The approaches are SentiWordNet based, corpus based, dictionary based and finally an online game based approach to create and validate the developed SentiWordNet [7]. For sentiment classification a hybrid approach [8] which combines in-language, machine translation and resource based approach has been experimented. The work reported an accuracy of 60.31% and 78.14% for Hindi SentiWordNet and in-language based sentiment analysis for Hindi documents. A graph-based method [9] to generate subjectivity lexicon for Hindi which consists of adjectives and adverbs with polarity scores taken from Hindi WordNet is yet another approach in the field. The proposers reported 70.4% accuracy on classification of reviews. Use of WordNet plays an important role in Cross-Lingual Sentiment Analysis. In cross-lingual sentiment analysis the training corpus in one language is used to predict the sentiment on another test language and Machine Translation plays a significant role in bridging the two languages. Pushpak Bhattacharya and team have attempted a similar approach [10] and used word sense as features of a supervised classifier. They reported an accuracy of 72% and 84% for Hindi and Marathi sentiment classification. Contributing an annotated dataset for Hindi movie reviews and an improved HindiSentiWordNet (HSWN) by assigning more opinion words were other angles of research [11]. The combined approach to improve HSWN incorporating negation and discourse handing achieved an accuracy of 80.21% for Hindi movie dataset. Another improvisation [12] of HindiSentiWordNet was by adding synonyms of opinion words and have applied the approach to Hotel and Movie domains. The work reported a sentiment classification accuracy of 77% for movie domain and 88% for hotel domain. A different approach to contribute scalable lexicons [13] categorized words as positive, negative, neutral and ambiguous categories ensured that it has an ample coverage over all emotions. For reasons pertaining to non-availability of sufficient web resources and tools required for Hindi text classification, very less works have been experimented in supervised and semi-supervised based sentiment classification for Hindi language.

Akhtar et al. [14] has built a benchmark dataset for aspect level sentiment analysis. They aggregated data from 12 domains where they manually annotated each review with aspect term, aspect category and aspect polarity. To extract aspect terms they used conditional random field with features like POS tagging, prefix and suffix information for aspect term extraction and support vector machine(SVM)for sentiment classification task. The obtained result shows an average F-measure of 41.07% and accuracy of 54.05% for aspect term extraction and sentiment classification tasks. In [15] document level and aspect level sentiment classification for Hindi movie reviews has been experimented. For document level sentiment analysis, the work has considered different combinations of adjectives, adverbs and verbs. Each review contains positive and negative opinion about different aspects of an item and it is not an easy task to find document-level sentiment polarity. For aspect level sentiment analysis, aspects are represented as vectors, opinion words extracted for each aspect and Senti Word Net based approach applied to extract sentiment towards the opinion words. In recent works use of deep learning techniques has shown good results in sentiment analysis tasks. Garg et al. have tried a neural network approach [16] for sentiment analysis task which initiates with a phase that applied Convolutional neural network approach for feature extraction, secondly applied Multi-Objective genetic algorithm (MOGA) framework to select optimized feature sets and finally used SVM with non-linear kernel for sentiment classification. They have used datasets of different domains (Twitter and product reviews) related to English and Hindi language. A combination of CRF combined with a Bi-LSTM network for the aspect extraction task was experimented by Hetal and team [17]. The proposers claimed to have used novel features in the CRF model and the sequence labelling model reported an improvement of almost two points in F-measure from the state of art.

Explorations in aspect level in Hindi language is minimal when compared to English language. In Hindi most of the research works are lexicon based. Recently the use of supervised methods and deep learning techniques have highlighted the aspect level sentiment classification task. Although many works in sentiment analysis have started for Hindi language the unavailability of corpora in Hindi language, word sense disambiguation, word order, morphological variation, spelling variations, POS tag tools, sentiment-based Dictionary etc. are some issues faced while doing sentiment analysis task. With these the conclusion is that fine grained analysis of aspect terms in Hindi reviews is in a very initial stage. The proposed work focuses on a rule based approach applied on sequence patterns. The idea to build sequen tial patterns rules is based on the work proposed by Cheah and team [18] in which the system extracted sequence of patterns based on the relation between aspect and opinion words. Relating to their work this work focuses on creating sequential patterns and rules with the help of POS patterns and dependency relations to extract aspect terms from review sentences. The following section discusses about the proposed methodology with details on sequence patterns and rule generation.

3 Proposed methodology

The goal of the work is to extract single and multi-word aspect terms for the given product review domains. The methodology carries out the task in three steps: (1) Pre-processing (2) Identifying Patterns from review sentences (3) Build rules based on these extracted sequential patterns. Figure 1 presents the diagrammatic flow of the proposed approach which has a pre-processing module, pattern extraction module and a rule creation module.

Fig. 1

Schematic diagram of the proposed approach for aspect term extraction.

3.1 Pre-processing module

In the pre-processing stage the collected review data goes through the stages like removal of special characters and spaces, tokenization, POS tagging and dependency parsing. For POS tagging and dependency parsing, the Hindi dependency parser has been used 1 ). A sample output of a review sentence is shown in Fig. 2 which explains the parsing output for a given sentence. For simplifying the tasks carried out in the pattern extraction module and rule generation modules, the output of the dependency parser is stored in suitable data structures which is depicted in Table 1. For easy understanding and implementation, a set of labels p0, p1, p2, p3, p4, p5 are used to represent each field of the parser output. Here p0 denotes word id (the position of the word in a sentence), p1 denotes the word (the original word in the sentence), p2 denotes the lemma (the root word from the original word), p3 denotes the POS tag NN (noun), NNP (proper noun), JJ (adjective), VM (verb), PSP (postposition relation) etc., p4 denotes the parent id (id of the dependent word), p5 denotes the dependency label (the hindi karaka’s cases in terms of different notation like k1, k2, k3 etc. Details about Hindi Karaka Rachana can be found from TreeBanks for Indian Languages [19].

Fig. 2

Output of Hindi Dependency Parser.

3.2 Pattern extraction module

The approach to generate patterns is based on the frequent relations between POS tags and the dependency labels. Patterns are generated based on the associations between noun with opinion word, noun with noun, noun with verb, noun with postposition relation, noun with preposition, noun with pronoun, noun with quantifier and noun with demonstrative pronoun etc. In building patterns, it is necessary to find the head word and its dependent words from phrases. Using POS tag formatting and the relation between elements a consecutive sequence of three or four words have been chosen for pattern creation. A total of 18 patterns are generated and Table 2. shows the sequence patterns under each POS tag.

Table 1
Rule Field Details of the Parsed Output

3.3 Rule creation module

This section shows how the rules are constructed depending on patterns chosen from review sentences. Using POS tags and dependency labels a total of 18 rules are defined. These rules are categorized into 11 classes. Each of the class shown in Table 3 shows the associated pattern, the expected aspect term related to each rule and an example for each rule. In the discussion part of each rule a sequence of symbols have been used, denoting (w_i - 1, w_i - 2) previous words and (w_i +1, w_i +2) next words of the head word (w_i). The rules first extract the head word and other words are extracted based on the head word. Each class categorization is explained below.

Table 2
Details of sequence patterns

Sl.No First_word Second_word Third_word Fourth_word

1 NN JJ VM —

2 JJ NN PSP —-

3 JJ NN NN PSP/VM/JJ

4 NN NN VM/PSP —

5 PSP/JJ NN VM —

6 NN CC: NN —-

7 PSP NN PSP —–

8 PSP NN NN PSP

9 PSP NN NN VM/JJ

10 PSP XC NN/NNP PSP

11 NN PRP JJ/VM —

12 QF NN PSP/RP/PRP —

13 NN QF NN/VM —–

14 NN QC NN/JJ —–

15 QC NN PSP/VM

16 XC NN,NNP RP/VM/PSP —-

17 NN RP JJ/VM/RP —-

18 DEM NN PSP —-

Sl.No	First_word	Second_word	Third_word	Fourth_word
1	NN	JJ	VM	—
2	JJ	NN	PSP	—-
3	JJ	NN	NN	PSP/VM/JJ
4	NN	NN	VM/PSP	—
5	PSP/JJ	NN	VM	—
6	NN	CC:	NN	—-
7	PSP	NN	PSP	—–
8	PSP	NN	NN	PSP
9	PSP	NN	NN	VM/JJ
10	PSP	XC	NN/NNP	PSP
11	NN	PRP	JJ/VM	—
12	QF	NN	PSP/RP/PRP	—
13	NN	QF	NN/VM	—–
14	NN	QC	NN/JJ	—–
15	QC	NN	PSP/VM
16	XC	NN,NNP	RP/VM/PSP	—-
17	NN	RP	JJ/VM/RP	—-
18	DEM	NN	PSP	—-

Table 3

Classes derived from rules framed

Class.1 Adjective association:

In a sentence if a noun/ noun phrase is directly associated with an adjective the opinion word then the noun/noun phrase before or after an adjective is the aspect term. A description of this rule for adjective association is shown in Table 3. Similarly, two more rules are written under this class.

Class.2 Noun association:

There are many cases where noun phrases tend to be aspect terms. In any sentence, if any noun phrase is followed by verb, adjective or psp relations then the noun phrase is extracted as aspect term.

Class.3 Verb association (main verb VM):

In a sentence if a noun is directly associated to a main verb then the associated noun is an aspect term.

Class.4 Conjunction association:

In a sentence if two consecutive nouns are associated through a conjunction and one of them is identified as an aspect then the other too should be marked as an aspect term.

Class 5. Sambandh karak association:

In case of ‘sambandh karak’, if there exists a ‘sambandh karak’ or PSP relations like or or on both sides of a noun/noun phrase then the associated noun/noun phrase is the aspect term. Table 3 depicts one among these rules. Similarly, three more rules are written under this class.

Class 6. Pronoun/ association:

If there exists pronouns like , , (his) or (her) or (they) or (you) or (her) or (its), (mine), then find the nearest noun as the aspect term. In many cases in giving reviews people use pronouns in association with opinion word. To identify aspect term, if there is a pronoun in association with noun, then look for the adjective or verb associated next to pronoun.

Class 7. Quantifier association ( /a lot, /some, /some):

All quantifiers like Hindi etc. are marked as QF. If there is a quantifier then find the associated noun as an aspect term. Depending on the position of Quantifier we consider noun as aspect term. There are two rules are formed under this class and are depicted in Table 3.

Class 8. Cardinal association:

If there is a numeral or a number represents some quantity of a noun then the associated noun takes as aspect term. There are two rules are formed under this class.

Class 9. Compound noun association:

In Hindi, a sequence of nouns is a series of two or more nouns which may not separate by any post-position relation like /ka, ’/ki, /ko. To extract aspect term this rule takes two consecutive nouns as a compound noun.

Class 10. Particle (RP) association:

In Hindi expressions like /to, /bhi, /se, etc. are marked as RP. If a noun is associated with one of these particles, then the associated noun is the potential aspect term.

Class 11. Demonstrative (DEM) association:

Demonstrative pronouns are usually used to identify place or thing. In Hindi if a pronoun from (/these, /this, /this/those) is followed by noun then the associated noun is considered as aspect term.

4 Corpus details

The proposed work has been experimented on the corpus created by Akhtar et al., 2016 for Hindi language. This is an aspect annotated dataset and is publicly available. The corpus has twelve domains with 5417 review sentences. The review sentences have been manually annotated with aspect term, aspect term polarity, aspect term category and polarities categorized into positive, negative, conflict and neutral. With features like suffix and prefix, POS tag, and chunk, they have used CRF and SVM approaches for aspect term extraction and sentiment classification. The proposed approach has considered only five domains for evaluation purpose. Each domain contains positive, negative, neutral and conflict polarity sentences. Table 4. shows the details about the number of review sentences with total number of aspect term counts corresponding to five domains.

Table 4
Corpus Statistics

Domain Name # of Labeled sentences # of Aspects

Mobile Apps 229 164

Camera 150 183

Television 135 144

Speakers 47 48

Mobile 1141 1416

Overall 1702 1955

Domain Name	# of Labeled sentences	# of Aspects
Mobile Apps	229	164
Camera	150	183
Television	135	144
Speakers	47	48
Mobile	1141	1416
Overall	1702	1955

5 Experimental results and analysis

The rules categorized under 11 classes in Section 4 has been analysed for the applicability in all the domains under consideration and Table 5 gives the statistics of the number of rules that could be absorbed into each domain. After applying the rules described in Section 4, the result of the proposed approach for five domains is compared with the precision and recall results shared by Akhtar et al., 2016 [14] which reported their results on 12 domains. Their model is a supervised machine learning approach which is based on CRF for aspect term extraction task. They have used POS tag information, suffix and prefix, chunk information, context features for the aspect extraction and sentiment classification task. In comparison to (Akhtar et al., 2016) the proposed approach has given importance to aspect extraction task. In this system POS tag and dependency relation features have played a significant role in building patterns and rules. For the proposed approach the results has been evaluated on five domains. The results of the existing baseline approach is compared with that of the proposed approach which is evaluated on the same dataset and is shown in Table 6. The evaluation result shows how well the predicted results match the actual labelled data, calculated by precision and recall percentages for extracting aspects from review sentences.

Table 5
Rules Applicability

Sl.No Domain Name Total No of Rules

1 Mobile Apps 13

2 Camera 12

3 Television 13

4 Speakers 11

5 Mobile 18

Sl.No	Domain Name	Total No of Rules
1	Mobile Apps	13
2	Camera	12
3	Television	13
4	Speakers	11
5	Mobile	18

Table 6

Result analysis for baseline and rule-based approach

	Results from Baseline approach		Results from Proposed approach
Domain	Precision	Recall	Precision	Recall
Mobile-Apps	50.00	18.00	80.00	70.75
Camera	60.00	31.76	88.57	64.14
Television	75.60	42.46	89.33	60.91
Speakers	83.33	22.72	77.78	58.33
Mobile	67.48	44.42	80.42	46.00
Overall	67.28	31.87	83.22	60.03

Comparative analysis in Fig. 3 shows that the proposed rule-based method has significantly improved over the existing baseline model in aspect extraction task. The baseline model had reported an overall result of precision 61.96 % and recall 30.72% for all 12 domains datasets in aspect term retrieval.

Fig. 3

Comparison of Proposed and Baseline approach.

The comparison in Table 6 is only with respect to the five domains on which the proposed method has been experimented. The model has showcased a consistent high precision across all experimented domains. The proposed method shows an overall precision of 83.22% and recall of 60.026% which is almost 16 and 28 points higher in precision and recall respectively than the method approached by (Akhtar et al., 2016) in aspect extraction. The improvement of result in precision and recall shows that the rule-based approach can be applied effectively in aspect retrieving task.

Error analysis

Error analysis has been done on the obtained results. The errors observed are discussed below.

In some of the cases the annotated aspect is not available as it is in the sentence. For example in the sentence “ “(Lyricist Gaurav Solanki also gave accurate expression in his songs) the annotated aspect is “ /song” while the aspect extracted using rules is “ /songs.”

In many cases parsing and POS tagging are incorrect due to limitation in the POS tagger. Because of this aspect term extraction becomes difficult. For example, name of a person or surname of a person which is actually a proper noun but tagged as noun. For example “ ”, the annotated aspect is “/performance” while the aspects extracted using rules are [ (Gitanjali), (performance)].

6 Conclusion

The main objective of the proposed work is to extract aspect terms from review sentences using a rule based approach. Hindi dependency parser is used to find word relations in sentences. Since, the part of speech (POS) tags and the dependency labels are not fully accurate for Hindi sentences, we get errors while processing the dependency labels of words. This in turn effects creation of rules for review sentences. Another limitation in the work is in constructing rules. The sentence structures in Hindi are very dissimilar to one another, it is not possible to build exhaustive rules from a group of sentences. But even with these limitations, the proposed approach which discovers patterns and builds rules based on these patterns performs satisfactory against the machine learning approach.

Even though the model has proved effective in different electronic domains it needs to be tested on other varied domains too. The future scope also involves exploring the sub problem of sentiment classification of every identified aspect term. Exploring neural network-based approach for aspect term extraction and hence extend the approach to sentiment classification task for minimal supervision models are included in the future scope.

Footnotes

(

References

Pang

and Lee

, Opinion mining and sentiment analysis, Foundations and Trends in Information Retrieval2(1–2) (2008), 1–135.

Venugopalan

and Gupta

, Exploring sentiment analysis on twitter data, Eighth international conference on contemporary computing (IC3), IEEE (2015), 241–427.

Venugopalan

and Gupta

, An Unsupervised Hierarchical Rule Based Model for Aspect Term Extraction Augmented with Pruning Strategies, Third International Conference on Computing and Network Communications, Procedia Computer Science (2020), 22–31.

Mulatkar

, Sentiment classification in hindi, International Journal of Scientific and Technology Research3(5) (2014).

Venugopalan

and Gupta

, Sentiment classification for Hindi tweets in a constrained environment augmented using tweet specific features, International conference on mining intelligence and knowledge exploration, Springer, (2015), 664–670.

Das

and Bandyopadhyay

, Senti Wordnet for Indian Languages, Proceedings of the 8thWorkshop on Asian Language Resources (2010), 56–63.

Das

and Bandyopadhyay

, Dr Sentiment Creates SentiWordNet(s) for Indian Languages Involving Internet Population, Proceedings of Indo- wordnet workshop, (2010).

Joshi

, Balamurali

A.R.

and Bhattacharyya

, A Fallback Strategy for Sentiment Analysis in Hindi: a Case Study, Proceedings of the 8th ICON, (2010).

Bakliwal

, Arora

and Varma

, Hindi subjective lexicon: A lexical resource for hindi polarity classification, Proceedings of the Eight International Conference on Language Resources and Evaluation (2012), 1189–1196.

10.

Balamurali

A.R.

, Joshi

and Bhattacharyya

, Cross-Lingual Sentiment Analysis for Indian Languages using Linked Word Nets, Proceedings of COLING, (2012), 73–82.

11.

Mittal

, Chouhan

B.A.G.

, et al., Sentiment Analysis of Hindi Review based on Negation and Discourse Relation, Proceedings of International Joint Conference on Natural Language Processing (2013), 45–50.

12.

Mishra

, Venugopalan

and Gupta

, Context specific Lexicon for Hindi reviews, Procedia Computer Science93 (2016), 554–563.

13.

Garg

and Lobiyal

D.K.

, Hindi Emotion Net: A Scalable Emotion Lexicon for Sentiment Classification of Hindi Text, ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)19(4) (2020), 1–35.

14.

Akhtar

Md.S.

, Ekbal

and Bhattacharyya

, Aspect based Sentiment Analysis in Hindi: Resource Creation and Evaluation, International Conference on Language Resources and Evaluation (LRTC) (2016), 2703–2709.

15.

Singh

and Piryani

, Sentiment Analysis of Movie Reviews : A new feature–based heuristic for aspect-level sentiment classification, International Mutli-Conference on Automation, Computing, Communication, Control and Compressed Sensing (iMac4s) (2013), 712–717.

16.

Garg

and Buttar

P.K.

, Aspect based sentiment analysis of hindi text review, International Journal of Advanced Research in Computer Science8(7) (2017), 831–836.

17.

Hetal

and Attar

, Extracting Aspect Terms using CRF and Bi-LSTM Models, International Conference on Computational Intelligence and Data Science (2020), 2486–2495.

18.

Rana

T.A.

and Cheah

Y.-N.

, Sequential Patterns-Based Rules for Aspect-Based Sentiment Analysis, Advanced Science Letters (2018), 1370–1374.

19.

Bharati

, Sharma

D.M.

, Husain

, Bai

, Begam

and Sangal

, Ann Corra: Tree Banks for Indian Languages Guidelines for Annotating Hindi Tree Bank. (version – 2.5), (2012).

A rule based approach for aspect extraction in hindi reviews

Abstract

Keywords

1 Introduction

2 Literature survey

3 Proposed methodology

Table 1 Rule Field Details of the Parsed Output

Table 4 Corpus Statistics Domain Name # of Labeled sentences # of Aspects Mobile Apps 229 164 Camera 150 183 Television 135 144 Speakers 47 48 Mobile 1141 1416 Overall 1702 1955

Table 5 Rules Applicability Sl.No Domain Name Total No of Rules 1 Mobile Apps 13 2 Camera 12 3 Television 13 4 Speakers 11 5 Mobile 18

Error analysis

Footnotes

References

Table 1
Rule Field Details of the Parsed Output

Table 4
Corpus Statistics

Domain Name # of Labeled sentences # of Aspects

Mobile Apps 229 164

Camera 150 183

Television 135 144

Speakers 47 48

Mobile 1141 1416

Overall 1702 1955

Table 5
Rules Applicability

Sl.No Domain Name Total No of Rules

1 Mobile Apps 13

2 Camera 12

3 Television 13

4 Speakers 11

5 Mobile 18