Sentiment analysis in Nepali: Exploring machine learning and lexicon-based approaches

Abstract

In recent times, sentiment analysis research has achieved tremendous impetus on English textual data, however, a very less amount of research has been focused on Nepali textual data. This work is focused towards Nepali textual data. We have explored machine learning approaches and proposed a lexicon-based approach using linguistic features and lexical resources to perform sentiment analysis for tweets written in Nepali language. This lexicon-based approach, first pre-process the tweet, locate the opinion-oriented features and then compute the sentiment polarity of tweet. We have investigated both conventional machine learning models (Multinomial Naïve Bayes (NB), Decision Tree, Support Vector Machine (SVM) and logistic regression) and deep learning models (Convolution Neural Network (CNN), Long Short-Term Memory (LSTM) and CNN-LSTM) for sentiment analysis of Nepali text. These machine learning models and lexicon-based approach have been evaluated on tweet dataset related to Nepal Earthquake 2015 and Nepal blockade 2015. Lexicon based approach has outperformed than conventional machine learning models. Deep learning models have outperformed than conventional machine learning models and lexicon-based approach. We have also created Nepali SentiWordNet and Nepali SenticNet sentiment lexicon from existing English language resources as by-product.

Keywords

Lexicon-based sentiment analysis Nepali language Twitter sentiment analysis Nepali SentiWordNet Nepali SenticNet deep learning sentiment analysis

1 Introduction

Sentiment analysis is a natural language processing (NLP) task, it identifies the orientation of opinion in a piece of text using linguistic features to classify the text into either positive or negative class. In recent time, sentiment analysis has been very useful task in various domain. Various organization exploits sentiment analysis application to analyze the customer attitudes towards their product/services which helps them in order to take informed decision.

With the growth of user-generated textual content on the internet, the sentiment analysis research has achieved a tremendous momentum. Majority of research has focused on English text and quite less research has been carried out on Nepali language, possibly due to scarcity of accurate linguistic resources (such as stemmer, lemmatizer, POS tagger) to perform sentiment analysis.

Nepali is an Indo-Aryan language of the sub-part of Eastern Pahari, spoken by approximately 45 million people around the globe [1]. It is an official language of Nepal. Nepali is free word order language as compared to English, which increases the complexity in handling user-generated content. The scarcity of linguistic resources for Nepali language creates more challenges ranging from the creation, collection and generation of lexical resources and datasets. Lack of NLP resources increases the complexities of existing classification methods for Nepali text. In this paper, our work is focused towards these challenges. We have explored machine learning approaches and proposed a lexicon-based approach which utilizes linguistic features and lexical resource to identify the orientation of opinion features to classify the piece of text into either positive or negative class. Some Nepali sentiment lexicon (such as Nepali SentiWordNet, Nepali SenticNet) has been generated from existing English language resources (such as SentiWordNet, SenticNet) as by product. The lexicon-based approach utilizes these lexicons in addition to NRC emotion lexicon [21–22] to compute the sentiment polarity of a piece of text. We have studied both conventional machine learning models (Multinomial Naïve Bayes (NB), Decision Tree, Support Vector Machine (SVM) and logistic regression) and Deep Learning models (Convolution Neural Network (CNN), Long Short-Term Memory (LSTM) and CNN-LSTM) for sentiment analysis of Nepali text. We have also evaluated machine learning models and lexicon-based approach on tweet dataset related to Nepal Earthquake 2015 and Nepal blockade 2015. The tweets are generally about relief and shelter situation in earthquake. It is observed that lexicon-based approach achieved better precision, recall and accuracy performance than conventional machine learning models. Deep learning models have outperformed than conventional machine learning model and lexicon-based approach.

The rest of the paper is organized as follows: Section 2 presents the NLP research in Nepali language. Section 3 explains the machine learning-based approaches for sentiment analysis of Nepali texts. Section 4 presents the linguistic based approach for sentiment analysis of Nepali texts. Section 5 describes the experimental analysis. Section 6 presents the conclusion.

2 NLP Research in Nepali language

Sentiment analysis is an important area of NLP with comprehensive and growing literature. The majority of published literature is concentrated towards the English language, however, very less amount of work has been done for other scarce resource languages (SRL). SRL are languages for which availability of NLP tools (such as POS tagger, lemmatizer, and annotated corpora) are limited and under the developing phase. Nepali language is one such SRL.

NLP research in Nepal has started in year 2005 with the launch of Nepali Spell checker [2] and English to Nepali machine translation project named “Dobhase” [3]. Madan Puraskar Pustakalya in collaboration with Kathmandu University has taken this initiation to create the NLP resources for Nepali text. Nepali National corpus (NNC) of 14 million words has been constructed and annotated for both spoken and written data [1]. This corpus also has parallel speech corpus (English-Nepali and Nepali-English) as additional resource. The morphological analyzer has been implemented for determining the morpheme(s) of a specified derived or inflected word [3, 4]. The Universal networking language (UNL) Nepali de-converter which generates Nepali language and represents meaning sentence by sentence in logical form has been proposed in [5]. Hardie has constructed a collocation-based method for classification of Nepali postpositions [6]. For analyzing the structure of Nepali grammar, a system named Nepal Grammar Checker has been developed [7]. Parts of speech (POS) tagger for Nepali text has been built using support vector machine (SVM) [8] and first-order hidden Markov model [9]. To extract a specific information (such as names, places, time) from Nepali text, a Named Entity Recognition (NER) has been developed using support vector machine and semi-hybrid approach [10, 11].

Thakur and Singh have augmented a lexicon pooled with Naïve Bayes classifier for classification of Nepali news stories [12]. Gupta and Bal have built bootstrap approach for detecting sentiment in Nepali text [13]. Regmi et al. utilizes supervised machine learning classifiers for analysis of facts and opinion of Nepali text [14]. They also conducted a comparative study of classifiers. Some studies [12 –15] are concentrated on classification of Nepali news articles dataset. Thapa and Bal have also applied supervised machine learning classifiers for sentiment analysis of books and movie reviews written in Nepali [16]. Machine learning approaches: Naïve Bayes and SVM have been used for mobile sms spam filtering [17].

In this paper, a lexicon-based approach has been implemented using linguistic formulation and lexical resources to identify an orientation of opinion features from a tweet and then compute the sentiment polarity of a tweet. We have also explored machine learning based approached for sentiment analysis on Nepali text. To our best knowledge, this is the first work that focused on tweets. We have also conducted a comparative study of different lexicon.

3 Machine learning-based approaches for sentiment analysis of Nepali texts

3.1 Conventional machine learning models

We have studied the most popular supervised machine learning classifiers for sentiment analysis on Nepali texts such as Multinomial Naïve Bayes (NB), Decision Tree, Support Vector Machine (SVM) and Logistic Regression. We have considered all these machine learning classifiers as baselines for a lexicon-based approach. Figure 1 shows the block diagram for the supervised machine learning classifiers. During the training phase, a feature extractor module extracted the Tweet Features as listed in Table 1 and converted each tweet input value to a Feature Matrix. Then, we have computed TFIDF of unigram, bigram, trigram after pre-processing the tweet by removing non-Nepali text, hashtags, URLs and repeated punctuation marks. We have added TFIDF of unigram, bigram, and trigram as a feature into a Feature Matrix. These features matrices and labels are given as input to the machine learning classifier to generate the model. The selection of Tweet Features are derived from previous studies [18 –20]. During the prediction phase, the same feature extractor module extracts the Tweet Features and TFIDF unigram, bigram trigram from an unseen tweet and converts each tweet input to feature matrix. These matrices are given as input to the classifier model to predict the labels for unseen tweets. The scikit-learn python library has been used for implementation of Multinomial NB, Decision Tree, SVM, and logistic regression. The scikit-learn is a python library which provides machine learning classifiers to perform NLP tasks. All machine learning classifiers experiments are run with tenfold cross-validation.

Fig. 1

Block diagram for supervised machine learning approaches.

Table 1

Tweet features

No. of URLS, No. of words, No. of stopwords, No. of slang words, No. of hashtags, No. of user mentions, Tweet length (No. of characters), No. of digits, No. of unique characters, No. of special characters, Likes count, Retweet count, No. of nouns, No. of adjectives, No. of prepositions, No. of pronouns, No. of verbs, No. of adverbs, No. of conjunctions

3.2 Deep learning models

3.2.1 Convolution neural network (CNN)

Recently, CNNs have achieved remarkable results on sentence classification task [25 –29]. It converts the words of sentence into vector that can be used as input as shown in Fig. 2. Figure 2 shows the architecture of CNN for sentiment analysis. This model is inspired by Zhang & Wallace [30]. The model has three regions: R1 for bigram, R2 for trigram and R3 for four gram, each region has two filters. Filters are used to perform convolutions on input tweet matrix and create variable length feature maps. Max pooling layer is applied over each feature map to get univariate feature vector. Next is to concatenate all six univariate vectors to a single vector. The softmax layer gets single vector as input and classify the sentence. Table 2 shows CNN model configuration description.

Fig. 2

CNN architecture for sentiment analysis.

Table 2

CNN model configuration description

Description	Values
Input word vector	(1) Pretrained word2vec Fast text model^2,3
	(2) Trained word2vecmodel with 0.148 Millions Nepali Tweets (We have downloaded in June 2018).
Filter region size	1, 2, 3, 4
Activation Function	Rectified Linear Unit (ReLU)
Max Pooling	1-max pooling layer
Dropout	0.5
Loss	Binary cross entropy
Optimizer	Adam
Dense	1
Batch size	32
Epoch	5

²https://fasttext.cc/docs/en/crawl-vectors.html. ³https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ne.300.bin.gz.

3.2.2 Long short term memory (LSTM) model

LSTM are widely used for sentiment analysis task [31 –34]. LSTM network was developed to overcome the drawback of Recurrent Neural Network (RNN) that is vanishing gradient and exploding gradient. In LSTM model, gates are used to control the memorizing process. Figure 3 illustrates the LSTM architecture for sentiment analysis. It comprises of six layers: Tokenize Layer, Embedding Layer, LSTM layer, Fully connected layer, Sigmoid activation layer and output layer.

Tokenize Layer: This layer is compulsory step for converting words into token i.e. integer.

Embedding Layer: In this layer, word tokens are converted into embedding of specific size.

LSTM Layer: It can be explained by hidden state and number of layers.

Fully Connected Layer: It maps LSTM layer output to required output size.

Sigmoid Activation Layer: It converts all output values between 0 and 1.

Output: The final output of the network is sigmoid layer output.

Fig. 3

LSTM architecture for sentiment analysis.

Table 3 shows the LSTM model configuration parameters description.

Table 3

LSTM model configuration description

Description	Values
Input word vector	(1) Pretrained word2vec Fast text model
	(2) Trained word2vecmodel with 0.148 Millions Nepali Tweets (We have downloaded in June 2018).
LSTM output	300
Dropout	0.5
Loss	Binary cross entropy
Optimizer	Adam
Dense	1
Batch size	32
Epoch	5

3.3 Convolution Neural Network – Long Short Term Memory (CNN-LSTM) model

Generally, CNN may be incapable to extract long distance dependency but it can identify local information from sentences. This limitation of CNN can be addressed by LSTM through sequentially modelling texts around sentences. Figure 4 shows the CNN-LSTM architecture for sentiment analysis of Nepali texts. This model is inspired by Jin Wang et al. [35]. CNN-LSTM model has been widely used for sentiment analysis tasks [36 –38].This model comprises of eight layer.

Tokenize Layer: This layer is compulsory step for converting words into token i.e. integer.

Embedding Layer: In this layer, word tokens are converted into embedding of specific size.

Convolution Layer: It perform convolutions on input tweet matrix and create variable length feature maps

Max-pooling layer: It subsample the output of convolution layer by removing the non-maximal values.

LSTM Layer: It will utilize the ordering of features to learn the sequence of input.

Fully Connected Layer: It maps LSTM layer output to required output size.

Sigmoid Activation Layer: It converts all output values between 0 and 1.

Output: The final output of the network is sigmoid layer output.

Fig. 4

CNN-LSTM architecture for sentiment analysis.

Table 4 shows CNN-LSTM model configuration parameters description.

Table 4

CNN-LSTM model configuration description

Description	Values
Input word vector	(1) Pretrained word2vec Fast text model
	(2) Trained word2vecmodel with 0.148 Millions Nepali Tweets (We have downloaded in June 2018).
Filter region size	4
Activation Function	Rectified Linear Unit (ReLU)
Max Pooling	1-max pooling layer
LSTM output	300
Dropout	0.5
Loss	Binary cross entropy
Optimizer	Adam
Dense	1
Batch size	32
Epoch	5

We have used have used Keras 4 , Genism 5 and TensorFlow 6 python libraries for the implementation of all three deep learning models. Before applying any model on tweet dataset, we have preprocessed the tweet by removing non-Nepali text, hashtags, URLs and repeated punctuation marks. Moreover, we have run all deep learning models with tenfold cross validation.

4 Linguistic-based approach for sentiment analysis of Nepali texts

4.1 Methodological approach

As we have investigated in the last section, machine learning approaches for the sentiment analysis of Nepali text. The conventional machine learning classifiers prior requirement is labelled training set to perform the sentiment analysis. In order to remove this requirement, we have designed lexicon-based approach. The lexicon-based approach for sentiment analysis is focused towards the assumption that the contextual sentiment orientation of any document, sentence or aspect is the aggregate of the sentiment score of each word or phrase or feature (such as adjective, adverb adjective combination, or adverb adjective adverb verb combination) present in that specified text. The pre-requisite for lexicon-based approach are lexical resources, parts of speech (POS) tagger, and a good method for computing contextual sentiment score of a feature. However, Nepali is a SRL and there are no standard POS tagger and lexical resources available for Nepali language. Therefore, we have used dictionary-based approach to identify POS from text. To detect the opinion feature from a piece of text, we have created a sentiment lexicon with the help of dictionaries: Nepali to Nepali Dictionary and Nepali to English dictionary; and existing English sentiment lexicons such as SentiWordNet and SenticNet. Figure 2 shows the block diagram of proposed work. The approach takes tweet as an input, pre-process the tweet by removing non-Nepali text, hashtags, URLs and repeated punctuation marks, extract the feature using Nepali to Nepali dictionary and compute the sentiment polarity of a tweet with the help of Nepali sentiment lexicons (Nepali SentiWordNet, Nepali SenticNet, Domain Specific Sentiment Lexicon) and NRC Emotion Lexicon as shown in block diagram. NRC emotion lexicon is created by Saif Mohammad [21, 22] for the English language. It is a collection of words and their association with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). They have also released this lexicon in other languages including the Nepali language. So we have used the Nepali version of NRC Emotion lexicon for the lexicon-based approach. This lexicon has 12 field: 1. English Word, 2. Nepali Word 3. Anger, 4. Fear, 5. Anticipation, 6. Trust, 7. Surprise, 8. Sadness, 9. Joy, 10. Disgust, 11. Positive, 12. Negative. Out of these 12 fields, we have used four fields: English word, Nepali Word, Positive and Negative fields. The value of positive and negative field is either 0 or 1. The details about the creation of Nepali Sentiment lexicons is discussed in next Section 4.2.

4.1.1 Algorithm for computation of sentiment polarity

This section explains the algorithm for the computation of sentiment polarity. The pre-processed tweets are fed as input to the computation of sentiment polarity algorithm.

Fig. 5

Block diagram for Lexicon-based approach.

The algorithm first tokenizes each pre-processed tweet and compute unigram, bigram and trigram through findNGrams function to create a feature vector. The next step is to identify the opinion feature from the feature vector and then compute the sentiment polarity with the help of created lexicons (Nepali SentiWordNet, Nepali SenticNet, DSSL) and NRC emotion lexicon. We are using Python Pandas library.

In the pseudo-code, score_df is a pandas data frame which will store word (W), frequency (F), sentiment (S), and parts of speech (POS). In the pseudo-code, getSentimentScore is a function which will take word and POS as input and return sentiment score of the input word. The pseudo-code of the algorithm is given below:

Algorithm 1: Sentiment Polarity Computation
1: for each t ∈ tweetsdo
2: score_df = panda.DataFrame(‘W’:[], ‘F’:[], ‘S’:[], ‘POS’:[])
3: words = t.split(“ ”)
4: #compute unigram, bigram, and trigram
5: feature = findNGrams(words, min = 1, max = 4)
6: for each w ∈ featuredo
7: if(wexists in score_df) then
8: score_df [w][‘F’] = score_df [w][‘F’]+1
9: continue
10: end if
11: temp = w.split(“ ”)
12: if(len(temp)==1)then # w is unigram
13: Lookup for POS of w in NND
14: if((POS_w is ADJ) or (POS_w is ADV) or (POS_w is VB) or (POS_w is NN))then
15: Lookup w in Sentiment Lexicon
16: SS_w = getSentimentScore(w,POS_w)
17: end if
18: else if(len(temp)==2)then # w is bigram
19: w₁ = temp[0], w₂ = temp[1]
20: if (w₁ exists in score_df and w₂ exists in score_df)then
21: if (POS_w1 is ADV and POS_w2 is ADJ)then
22: score_df [w₁][‘F’] = score_df [w₁][‘F’] - 1
23: score_df [w₂][‘F’] = score_df [w][‘F’] - 1
24: if (SS_w2 _> 0) then SS_w2 = SS_w2 ₊ 1
25: else if(SS_w2 _< 0) then SS_w2 = SS_w2-1 end if
26: end if
27: end if
28: else if (w exists in NRC_Lexicon)then
29: SS_w = NRC_Lexicon[w][SS]
30: else if (w exists in DSSL) then
31: SS_w =DSSL[w][SS]
32: end if
33: if (SS_w !=0)then
34: score_df = score_df.append({‘W’: w, ‘F’: 1, ‘S’: SS_w, ‘POS’:POS_w})
35: end if
36: end for
37: TS_t = sum(score_df[‘F’] * score_df[‘S’])
38: if(TS_t > 0) then t_polarity=“positive”
39: else if(TS_t < 0) then t_polarity=“negative”end if
40: end for

4.2 Creation of lexical resources

4.2.1 Creation of dataset

There are no appropriate annotated datasets, and other lexical resources available to perform sentiment analysis on the Nepali language. Therefore, we have built a dataset of 600 positive and 600 negative tweets. For the creation of dataset, first, we have downloaded the tweets regarding Nepal Earthquake 2015 (duration: 25-04-2015 to 25-05-2015) and Nepal Blockade 2015 (duration: 15-09-2015 to 31-12-2015) from a Twitter (Downloaded on 30-03-2018). The Nepal Earthquake occurred on 25-04-2015 with magnitude 7.8. According to the report, more than 8702 people died in that earthquake. People have suffered from the scarcity of critical resources such as medicines, water, epidemics, and other life-threatening diseases. The Nepali Blockade 2015 started from 23-05-2015, it was an economic and humanitarian crisis. It has affected people of Nepal and its economy very severely. We have crawled around 4200 tweets related to Nepal Earthquake and 800 related to Nepal Blockade. These 5000 tweets were given to three annotators (graduate students) for labelling. The annotators were asked to label each tweet with sentiment polarity identified in the tweet text. The annotators have identified some tweets which contain English text in between the Nepali text (For example: “ Chinese , ! #BackOfIndia”) and some tweets contain English words written in Nepali text (For example: “). We have removed such tweets. The quality of annotation has been measured by two standard agreement parameters: Inter-Indexer Consistency (IIC) [23] and Cohen’s Kappa [24]. Out of 5000 tweets, 600 tweets are found to be positive and 600 as Negative. Table 5 presents the dataset description which we have used for evaluation purpose.

Table 5
Dataset description with annotation results

Dataset Total Tweets #Positive tweets #Negative tweets IIC Cohen’s kappa

Earthquake 1200 600 600 92.88% 85.38%

Blockade

Tweets

(D)

Dataset	Total Tweets	#Positive tweets	#Negative tweets	IIC	Cohen’s kappa
Earthquake	1200	600	600	92.88%	85.38%
Blockade
Tweets
(D)

4.2.2 Generation of sentiment lexicons

The sentiment lexicons have been extensively used for the designing of sentiment analysis approaches. Sentiment lexicon is a lexical sentiment dictionary where for each word there is a sentiment score representing positive, negative and neutral behavior of word. Many sentiment lexicons have been created/generated for the English language but not for the Nepali language. As best of our knowledge there is no sentiment lexicon publicly available for the Nepali language. We have created three sentiment lexicons: 1. Nepali SentiWordNet, 2. Nepali SenticNet, and 3. Domain Specific Sentiment Lexicon (DSSL) using Nepali to Nepali dictionary (NND), Nepali to English dictionary (NED), SentiWordNet and SenticNet. The name of NND is “Nepali Brihat Shabdkosh” 7 . NND contains 7126 words with Nepali parts-of-speech (POS) and its meaning in Nepali. With the help of Nepali language experts, all Nepali POS has been converted to equivalent English POS. NED is extracted from Android APP named “Nepali English Dictionary v1” 8 . NED comprises of 40126 words with equivalent meaning in English but it doesn’t contain POS of words. SentiWordNet is a publicly available lexical tool for opining mining. In SentiWordNet each WordNet synset has three scores: positive, negative and objective score. In order to retrieve the sentiment score of a particular word from SentiWordNet, a weighted average scheme has been used. This score varies between –1 to +1. SenticNet is concept oriented sentiment analysis tool, and it is also publicly available. In SenticNet, for each word, the sentiment score lies between –1 to +1.

For the creation of Nepali SentiWordNet, NND, NED and English SentiWordNet have been used. Nepali SentiWordNet lexicon contains eight fields: (1) Word in Nepali, (2) Nepali POS, (3) Equivalent English POS, (4) meaning in Nepali, (5) Meaning in English, (6) Word found in SentiWordNet, (7) SentiWordNet POS (8) sentiment strength. The following steps have been applied for building a Nepali SentiWordNet:

Algorithm 2: Creation of Nepali SentiwordNet
1: NepaliSentiWordNet=[]
2: for each word ∈ NNDdo
3: Lookup meanings of word in NED
4: for each m ∈ meaningsdo
5: Retrieve SS_m from SentiWordNet
6: ifSS_m !=0then
7: NepaliSentiWordNet.add(word, SS_m)
8: end if
9: end for
10: end for

In pseudo-code, NepaliSentiWordNet is pandas data frame which contains eight field about which we have mentioned in the above paragraph.

Creation of Nepali SenticNet process is similar to the process of creation of Nepali SentiWordNet. Each entry in Nepali SenticNet contains seven fields instead of eight because English SenticNet does not contain the POS of word/concept. Seven fields of Nepali SenticNet are: (1) Word in Nepali, (2) Nepali POS, (3) Equivalent English POS, (4) meaning in Nepali, (5) Meaning in English, (6) Word found in SentiWordNet, (7) sentiment strength. The following steps have been applied for building a Nepali SenticNet:

Algorithm 3: Creation of Nepali SenticNet
1: NepaliSenticNet = []
2: for each word ∈ NNDdo
3: Lookup meanings of word in NED
4: for each m ∈ meaningsdo
5: Retrieve SS_m from SenticNet
6: ifSS_m !=0then
7: NepaliSenticNet.add(word, SS_m)
8: end if
9: end for
10: end for

DSSL has been created for Earthquake and blockade domains. The objective behind the creation of this lexicon is to take into account the maximum number of phrases and slang words used in the Nepali language to portray sentiment with respect to Earthquake and blockade domains. This dictionary has been created from 4200 Nepali tweets regarding the earthquake and 800 Nepali tweets regarding blockade. These tweets were given to three different students who speak and write Nepali and were asked to extract positive, negative and slang words used in these tweets. Each entry in DSSL has two fields: (1) Nepali Word (2) Sentiment Polarity. The sentiment polarity for positive word is 1 and –1 for negative word and slang word. Table 6 presents lexicon details.

Table 6
Lexicon details

Name Total words

NRC-Emotion Lexicon 9322

Nepali to Nepali Dictionary (NND) 7621

Nepali to English Dictionary (NED) 40175

Domain Specific Sentiment Lexicon (DSSL) 579

Nepali SentiWordNet 1882

Nepali SenticNet 2414

Name	Total words
NRC-Emotion Lexicon	9322
Nepali to Nepali Dictionary (NND)	7621
Nepali to English Dictionary (NED)	40175
Domain Specific Sentiment Lexicon (DSSL)	579
Nepali SentiWordNet	1882
Nepali SenticNet	2414

5 Results

The performance of all models has been evaluated through standard performance measures: Accuracy, Precision, Recall and F-measure. The macro-averaging method has been applied for the calculation of precision and recall. All machine learning models and lexicon-based approach have been evaluated on tweet dataset D. Table 7 shows the results of conventional machine learning models and deep learning models. From Table 7, it is observed that the majority of approaches have obtained nearly 60% accuracy and SVM with features (TFIDF (Unigram+Bi-gram+Trigram)+Tweet Features) has achieved maximum accuracy of 63.25% which is better among all the machine learning approaches.

Table 7
Results of machine learning approaches on dataset D

Conventional machine learning models result

Model Features Confusion matrix Accuracy Precision Recall F-measure

Predicted

P N

Multinomial TFIDF Unigram Actual P 367 233 62.91% 0.639 0.647 0.643

Naïve Bayes +Tweet Features N 212 388

TFIDF (Unigram+Bigram) Actual P 373 227 62.91% 0.642 0.637 0.639

+Tweet Features N 218 382

TFIDF (Unigram+Bigram+Trigram) Actual P 373 227 62.91% 0.642 0.637 0.639

+Tweet Features N 218 382

Decision Tree TFIDF Unigram Actual P 362 238 55.91% 0.570 0.515 0.541

+Tweet Features N 291 309

TFIDF (Unigram+Bigram) Actual P 362 238 56.00% 0.567 0.517 0.541

+Tweet Features N 290 310

TFIDF (Unigram+Bigram+Trigram) Actual P 374 226 57.25% 0.581 0.522 0.549

+Tweet Features N 287 313

SVM TFIDF Unigram Actual P 351 249 63.00% 0.625 0.675 0.649

+Tweet Features N 195 405

TFIDF (Unigram+Bigram) Actual P 349 251 63.08% 0.626 0.680 0.652

+Tweet Features N 192 408

TFIDF (Unigram+Bigram+Trigram) Actual P 350 250 63.25% 0.627 0.682 0.653

+Tweet Features N 191 409

N 191 409

Logistic Regression TFIDF Unigram Actual P 326 274 59.33% 0.584 0.643 0.612

(C = 100), penalty=‘l2’ +Tweet Features N 214 386

TFIDF (Unigram+Bigram) Actual P 325 275 59.58% 0.586 0.650 0.616

+Tweet Features N 210 390

TFIDF (Unigram+Bigram+Trigram) Actual P 325 275 59.50% 0.585 0.648 0.615

+Tweet Features N 211 389

Deep learning models result

Model Confusion matrix Accuracy Precision Recall F-measure

Predicted

P N

CNN+Pretrained word2vec Actual P 420 180 69.5% 0.703 0.690 0.696

N 186 414

CNN+Trained word2vec Actual P 412 188 68.5% 0.688 0.683 0.686

N 190 410

LSTM+Pretrained word2vec Actual P 419 181 69.0% 0.698 0.682 0.689

N 191 409

LSTM+Trained word2vec Actual P 420 180 69.0% 0.702 0.680 0.691

N 192 408

CNN_LSTM+Pretrained word2vec Actual P 432 168 66.2% 0.692 0.603 0.645

N 238 362

CNN_LSTM+Trained word2vec Actual P 387 213 67.9% 0.679 0.715 0.6964

N 171 429

	Conventional machine learning models result
Multinomial	TFIDF Unigram	Actual	P	367	233	62.91%	0.639	0.647	0.643
Naïve Bayes	+Tweet Features		N	212	388
	TFIDF (Unigram+Bigram)	Actual	P	373	227	62.91%	0.642	0.637	0.639
	+Tweet Features		N	218	382
	TFIDF (Unigram+Bigram+Trigram)	Actual	P	373	227	62.91%	0.642	0.637	0.639
	+Tweet Features		N	218	382
Decision Tree	TFIDF Unigram	Actual	P	362	238	55.91%	0.570	0.515	0.541
	+Tweet Features		N	291	309
	TFIDF (Unigram+Bigram)	Actual	P	362	238	56.00%	0.567	0.517	0.541
	+Tweet Features		N	290	310
	TFIDF (Unigram+Bigram+Trigram)	Actual	P	374	226	57.25%	0.581	0.522	0.549
	+Tweet Features		N	287	313
SVM	TFIDF Unigram	Actual	P	351	249	63.00%	0.625	0.675	0.649
	+Tweet Features		N	195	405
	TFIDF (Unigram+Bigram)	Actual	P	349	251	63.08%	0.626	0.680	0.652
	+Tweet Features		N	192	408
	TFIDF (Unigram+Bigram+Trigram)	Actual	P	350	250	63.25%	0.627	0.682	0.653
	+Tweet Features		N	191	409
			N	191	409
Logistic Regression	TFIDF Unigram	Actual	P	326	274	59.33%	0.584	0.643	0.612
(C = 100), penalty=‘l2’	+Tweet Features		N	214	386
	TFIDF (Unigram+Bigram)	Actual	P	325	275	59.58%	0.586	0.650	0.616
	+Tweet Features		N	210	390
	TFIDF (Unigram+Bigram+Trigram)	Actual	P	325	275	59.50%	0.585	0.648	0.615
	+Tweet Features		N	211	389
Deep learning models result
Model	Confusion matrix	Accuracy	Precision	Recall	F-measure
			Predicted
				P	N
CNN+Pretrained word2vec	Actual	P	420	180	69.5%	0.703	0.690	0.696
			N	186	414
CNN+Trained word2vec	Actual	P	412	188	68.5%	0.688	0.683	0.686
			N	190	410
LSTM+Pretrained word2vec	Actual	P	419	181	69.0%	0.698	0.682	0.689
			N	191	409
LSTM+Trained word2vec	Actual	P	420	180	69.0%	0.702	0.680	0.691
			N	192	408
CNN_LSTM+Pretrained word2vec	Actual	P	432	168	66.2%	0.692	0.603	0.645
			N	238	362
CNN_LSTM+Trained word2vec	Actual	P	387	213	67.9%	0.679	0.715	0.6964
			N	171	429

From Table 7 it is also observed that the CNN+pretrained word2vec model has achieved better results than all other models. Moreover, deep learning models have outperformed than Conventional machine learning models.

The lexicon approach has been evaluated on each of the four lexicons: 1. Nepali SentiWordNet, 2. Nepali SenticNet, 3. NRC Emotion lexicon and 4. Domain specific lexicon independently and then on the combination of two and three lexicons as shown in Table 8. It is observed from Table 8 that Nepali SentiWordNet and Nepali SenticNet achieves 19% and 17.8% precision respectively because of the presence of limited words related to the earthquake and blockade domain. NRC Emotion lexicon and DSSL have outperformed than all other lexicons. NRC emotion lexicon has more words than Nepali SenticNet and Nepali SentiWordNet whereas DSSL dictionary is created with objective to take into account maximum number of phrases and slang words related to earthquake and blockade domain. The combination of two lexicon has been tried to take the advantage of each lexicon. It is also observed from Table 8 that combination of NRC emotion lexicon and DSSL have outperformed than all other combinations of two lexicon. We have also evaluated the approach on combination of three lexicon. From Table 8 it is seen that combination of NRC Emotion lexicon, DSSL, and Nepali SentiWordNet achieved slightly better results than the combination of NRC Emotion and DSSL. This combination has also performed better than conventional machine learning models. Moreover, it is observed that deep learning models have outperformed than conventional machine learning model and lexicon-based approach.

Table 8

Lexicon-based results of dataset D

Lexicon	Confusion matrix				Accuracy	Precision	Recall	F-measure
		Predicted
			P	N
Nepali SentiWordNet	Actual	P	133	467	19.00%	0.188	0.190	0.189
		N	505	95
NRC Emotion Lexicon	Actual	P	343	257	45.90%	0.457	0.459	0.458
		N	392	208
Nepali SenticNet	Actual	P	101	499	17.80%	0.178	0.178	0.178
		N	487	113
DSSL (Domain Specific Sentiment Lexicon)	Actual	P	227	373	43.66%	0.436	0.437	0.436
		N	303	297
Nepali SentiWorNet+NRC Emotion	Actual	P	363	237	49.16%	0.491	0.492	0.491
		N	373	227
Nepali SenticNet+NRC Emotion	Actual	P	345	255	48.83%	0.488	0.483	0.488
		N	359	241
Nepali SentiWordNet+DSSL	Actual	P	293	307	51.25%	0.513	0.513	0.513
		N	278	322
Nepali SenticNet+DSSL	Actual	P	273	327	50.33%	0.503	0.503	0.503
		N	269	331
NRC Emotion Lexicon+DSD	Actual	P	421	179	66.83%	0.669	0.668	0.689
		N	219	381
Nepali SentiWordNet+NRC Emotion Lexicon+DSSL	Actual	P	432	168	68.50%	0.686	0.685	0.685
		N	210	390
Nepali SenticNet+NRC Emotion Lexicon+DSSL	Actual	P	412	188	66.75%	0.668	0.668	0.668
		N	211	389

6 Conclusion

In this paper we have explored machine learning approaches and proposed a lexicon-based approach using linguistic features and lexical resources to perform sentiment analysis for tweets written in Nepali language. Nepali is a SRL language and there are no standard POS tagger and lexical resources such as Nepali sentiment lexicons available to perform sentiment analysis on the Nepali language. Therefore, we have used dictionary-based approach to identify POS from the text. We have also created Nepali sentiment lexicons with the help of Nepali to Nepal Dictionary, Nepali to English dictionary and existing publicly available English sentiment lexicons such as SentiWordNet and SenticNet. Furthermore, we have created three sentiment lexicon: Nepali SentiWordNet, Nepali SenticNet, and DSSL as by-product of this work.

The lexicon-based approach takes a tweet as input, pre-process the tweet, identify the opinion feature from the text and compute the sentiment polarity using sentiment lexicon. We have investigated both conventional machine learning models (Multinomial Naïve Bayes (NB), Decision Tree, Support Vector Machine (SVM) and logistic regression) and deep learning models (Convolution Neural Network (CNN), Long Short-Term memory (LSTM) and CNN-LSTM) for sentiment analysis of Nepali text. The machine learning models and lexicon-based approach have been evaluated on Nepal Earthquake 2015 and Nepal Blockade 2015 tweets.

It has been observed that combination of three lexicons: NRC Emotion lexicon, DSSL, and Nepali SentiWordNet have achieved better result than the independent lexicon and other combination of lexicons. Also, this combination has outperformed than Conventional machine learning Models. Deep learning models have outperformed than conventional machine learning model and lexicon-based approach. We believe that the performance of lexicon-based approach can be improved further by increasing the size of lexicons. The conventional machine learning classifiers result can also be enhanced by increasing the training size of data. Similarly, the deep learning models result can be improved by increasing the training size of data and by applying transfer learning approaches. Moreover, the lexicon-based approach is fully scalable and adaptable to any domain. Also, this approach can also be extended for other SRL. The only effort required will be to construct lexical resources for that SRL.

Footnotes

References

Yadava

Y.P.

, Hardie

, Lohani

R.R.

, Regmi

B.N.

, Gurung

, McEnery

, Allwood

and Hall

, Construction and annotation of a corpus of contemporary Nepali, Corpora3(2) (2008), 213–225.

Bal

B.K.

, Shrestha

and Pustakalaya.

M.P.

, Nepali Spellchecker, PAN Localization Working Papers2007 (2004), 316–318.

Bal

B. K.

and Shrestha

, A morphological analyzer and a stemmer for Nepali. PAN localization, Working Papers2007 (2004), 324–31.

Sitaula

, A hybrid algorithm for stemming of Nepali text, Intelligent Information Management5.04 (2013), 136.

Bista

S.K.

and Keshari

, UNL Nepali Deconverter. International-CALIBER 2005, Kochi, India, available at: http://ir.inflibnet.ac.in/bitstream/1944/499/1/05cali_8.pdf.

Hardie

, A collocation-based approach to Nepali postpositions, Corpus Linguistics and Linguistic Theory4(1) (2008).

Bal

B.K.

, Shrestha

and Pustakalaya

M.P.

, Architectural and system design of the Nepali grammar checker, PAN Localization Working Paper, 2007.

Shahi

T.B.

, Dhamala

T.N.

and Balami

, Support vector machines based part of speech tagging for Nepali text, International Journal of Computer Applications70(24) (2013), 38–42.

Paul

, Purkayastha

B.S.

and Sarkar

, Hidden Markov Model based Part of Speech Tagging for Nepali language, 2015 International Symposium on Advanced Computing and Communication (ISACC), Sep. 2015.

10.

Bam

S.B.

and Shahi

T.B.

, Named Entity recognition for Nepali text using support vector machines, Intelligent Information Management06(02) (2014), 21–29.

11.

Dey

, Paul

and Purkayastha

B.S.

, Named entity recognition for nepali language: A semi hybrid approach, International Journal of Engineering and Innovative Technology (IJEIT)3 (2014), 21–25.

12.

Thakur

S.K.

and Singh

V.K.

, A lexicon pool augmented naive bayes classifier for Nepali text, 2014 Seventh International Conference on Contemporary Computing (IC3), Aug. 2014.

13.

Gupta

C.P.

and Bal

B.K.

, Detecting sentiment in Nepali texts: A bootstrap approach for sentiment analysis of texts in the Nepali language, 2015 International Conference on Cognitive Computing and Information Processing (CCIP), Mar. 2015.

14.

Regmi

, Bal

B.K.

and Kultsova

, Analyzing facts and opinions in Nepali subjective texts, 2017 8th International Conference on Information, Intelligence, Systems & Applications (IISA), Aug. 2017.

15.

Shahi

T.B.

and Pant

A.K.

, Nepali news classification using Naïve Bayes, Support Vector Machines and Neural Networks, 2018 International Conference on Communication information and Computing Technology (ICCICT), Feb. 2018.

16.

Thapa

L.B.R.

and Bal

B.K.

, Classifying sentiments in Nepali subjective texts, 2016 7th International Conference on Information, Intelligence, Systems & Applications (IISA), Jul. 2016.

17.

Shahi

T.B.

and Yadav

, Mobile SMS spam filtering for Nepali text using Naive Bayesian and support vector machine, International Journal of Intelligence Science04(01) (2014), 24–28.

18.

Duan

, Jiang

, Qin

, Zhou

and Shum

H.Y.

, An empirical study on learning to rank of tweets. Proceedings of the 23rd International Conference on Computational Linguistics, Association for Computational Linguistics, 2010, 295–303.

19.

McCreadie

and Macdonald

, “Relevance in microblogs: Enhancing tweet retrieval using hyperlinked documents.” Proceedings of the 10th conference on open research areas in information retrieval. Le Centre de Hautes Etudes Internationales D’informatique Documentaire, 2013, 189–196.

20.

Vosecky

, Leung

K.W.-T.

and Ng

, Searching for quality microblog posts: Filtering and ranking based on content analysis and implicit links,” Lecture Notes in Computer Science 2012, 397–413.

21.

Mohammad

S.M.

and Turney

P.D.

, Crowdsourcing a word-emotion association Lexicon, Computational Intelligence29(3) (2012), 436–465.

22.

Mohammad

S.M.

, From once upon a time to happily ever after: Tracking emotions in mail and books, Decision Support Systems53(4) (2012), 730–741.

23.

Rolling

, Indexing consistency, quality and efficiency, Information Processing & Management17(2) (1981), 69–76.

24.

Byrt

, How good is that agreement?, Epidemiology7(5) (1996), 561.

25.

Kim

, Convolutional Neural Networks for Sentence Classification, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.

26.

Iyyer

, Manjunatha

, Boyd-Graber

and Daumé III

, Deep Unordered Composition Rivals Syntactic Methods for Text Classification, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015.

27.

Goldberg

, A primer on neural network models for natural language processing, Journal of Artificial Intelligence Research57 (2016), 345–420.

28.

Wang

, Xu

, Liu

, Zhang

, Wang

and Hao

, Semantic Clustering and Convolutional Neural Network for Short Text Categorization, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2015.

29.

Kalchbrenner

, Grefenstette

and Blunsom

, “A Convolutional Neural Network for Modelling Sentences,” Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2014.

30.

Zhang

and Wallace

, A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification.” arXiv preprint arXiv:1510.03820 (2015).

31.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation9(8) (1997), 1735–1780.

32.

Ruder

, Ghaffari

and Breslin

J.G.

, A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.

33.

Tang

, Qin

, Feng

and Liu

, Target-dependent sentiment classification with long short term memory. arXiv preprint arXiv:1512.01100 (2015).

34.

Tang

, Qin

, Feng

, Liu

, Effective LSTMs for target-dependent sentiment classification. arXiv preprint arXiv:1512.01100 (2015).

35.

Wang

, Yu

L.-C.

, Lai

K.R.

and Zhang

, Dimensional Sentiment Analysis Using a Regional CNN-LSTM Model,” Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2016.

36.

Wang

, Jiang

and Luo

, Combination of convolutional and recurrent neural network for sentiment analysis of short texts. Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical papers. 2016.

37.

Zhang

, Yuan

, Wang

and Zhang

, YNU-HPCC at EmoInt-2017: Using a CNN-LSTM Model for Sentiment Intensity Prediction, Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, 2017.

38.

Yenter

and Verma

, Deep CNN-LSTM with combined kernels from multiple branches for IMDb review sentiment analysis, 2017 IEEE 8th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference (UEMCON), Oct. 2017.

	Conventional machine learning models result
Model	Features	Confusion matrix				Accuracy	Precision	Recall	F-measure
			Predicted
				P	N
Multinomial	TFIDF Unigram	Actual	P	367	233	62.91%	0.639	0.647	0.643
Naïve Bayes	+Tweet Features		N	212	388
	TFIDF (Unigram+Bigram)	Actual	P	373	227	62.91%	0.642	0.637	0.639
	+Tweet Features		N	218	382
	TFIDF (Unigram+Bigram+Trigram)	Actual	P	373	227	62.91%	0.642	0.637	0.639
	+Tweet Features		N	218	382
Decision Tree	TFIDF Unigram	Actual	P	362	238	55.91%	0.570	0.515	0.541
	+Tweet Features		N	291	309
	TFIDF (Unigram+Bigram)	Actual	P	362	238	56.00%	0.567	0.517	0.541
	+Tweet Features		N	290	310
	TFIDF (Unigram+Bigram+Trigram)	Actual	P	374	226	57.25%	0.581	0.522	0.549
	+Tweet Features		N	287	313
SVM	TFIDF Unigram	Actual	P	351	249	63.00%	0.625	0.675	0.649
	+Tweet Features		N	195	405
	TFIDF (Unigram+Bigram)	Actual	P	349	251	63.08%	0.626	0.680	0.652
	+Tweet Features		N	192	408
	TFIDF (Unigram+Bigram+Trigram)	Actual	P	350	250	63.25%	0.627	0.682	0.653
	+Tweet Features		N	191	409
			N	191	409
Logistic Regression	TFIDF Unigram	Actual	P	326	274	59.33%	0.584	0.643	0.612
(C = 100), penalty=‘l2’	+Tweet Features		N	214	386
	TFIDF (Unigram+Bigram)	Actual	P	325	275	59.58%	0.586	0.650	0.616
	+Tweet Features		N	210	390
	TFIDF (Unigram+Bigram+Trigram)	Actual	P	325	275	59.50%	0.585	0.648	0.615
	+Tweet Features		N	211	389
Deep learning models result
Model		Confusion matrix				Accuracy	Precision	Recall	F-measure
			Predicted
				P	N
CNN+Pretrained word2vec		Actual	P	420	180	69.5%	0.703	0.690	0.696
			N	186	414
CNN+Trained word2vec		Actual	P	412	188	68.5%	0.688	0.683	0.686
			N	190	410
LSTM+Pretrained word2vec		Actual	P	419	181	69.0%	0.698	0.682	0.689
			N	191	409
LSTM+Trained word2vec		Actual	P	420	180	69.0%	0.702	0.680	0.691
			N	192	408
CNN_LSTM+Pretrained word2vec		Actual	P	432	168	66.2%	0.692	0.603	0.645
			N	238	362
CNN_LSTM+Trained word2vec		Actual	P	387	213	67.9%	0.679	0.715	0.6964
			N	171	429

Sentiment analysis in Nepali: Exploring machine learning and lexicon-based approaches

Abstract

Keywords

1 Introduction

2 NLP Research in Nepali language

3 Machine learning-based approaches for sentiment analysis of Nepali texts

3.1 Conventional machine learning models

3.2.1 Convolution neural network (CNN)

4.1 Methodological approach

4.1.1 Algorithm for computation of sentiment polarity

4.2.1 Creation of dataset

Table 5 Dataset description with annotation results Dataset Total Tweets #Positive tweets #Negative tweets IIC Cohen’s kappa Earthquake 1200 600 600 92.88% 85.38% Blockade Tweets (D)

Table 6 Lexicon details Name Total words NRC-Emotion Lexicon 9322 Nepali to Nepali Dictionary (NND) 7621 Nepali to English Dictionary (NED) 40175 Domain Specific Sentiment Lexicon (DSSL) 579 Nepali SentiWordNet 1882 Nepali SenticNet 2414

Footnotes

References

Table 5
Dataset description with annotation results

Dataset Total Tweets #Positive tweets #Negative tweets IIC Cohen’s kappa

Earthquake 1200 600 600 92.88% 85.38%

Blockade

Tweets

(D)

Table 6
Lexicon details

Name Total words

NRC-Emotion Lexicon 9322

Nepali to Nepali Dictionary (NND) 7621

Nepali to English Dictionary (NED) 40175

Domain Specific Sentiment Lexicon (DSSL) 579

Nepali SentiWordNet 1882

Nepali SenticNet 2414