Sentiment analysis of Kazakh text and their polarity

Abstract

Sentiment analysis is one of the most important and interesting tasks in natural languages. A number of resources and tools have been developed for sentiment analysis of English, Turkish, Russian and other languages. Unfortunately, there were no data and tools available for sentiment analysis in Kazakh. The Dictionary of Kazakh sentiment words has been created during this study. In this work, we described the rule-based method using a dictionary of emotional words for sentiment analysis of texts in the Kazakh language, based on the morphological rules and ontological model. We studied the texts in Kazakh and determined the parts of speech that define the text mood. Based on the conducted studies, a lot of phrases were identified as determining the text polarity. This paper is an extended version of the paper published in [in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), LNCS, 2017, pp. 669–677]. In addition to the original material, the paper includes additional rules for determining sentiment on a 5-point scale.

Keywords

Sentiment analysis Kazakh language rule-based method morphological rules ontology

1. Introduction

Sentiment analysis or opinion mining in natural languages is one of the fastest growing technologies of natural language processing. Sentiment analysis, also called opinion mining, is the field of study that analyzes people’s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes [14]. Sentiment analysis is considered as a major topic for companies, enterprises who might be interested in identifying opportunities within a new market. Emotions and opinions play a significant role in people’s everyday life and their decision-making process. The sentiment analysis tools have been widely accepted in the commercial and social fields. It can be noticed that the number of blogs, reviews, forums, web pages of social networks are growing by the day in worldwide network. Therefore, manual processing such a big amount of data becomes impossible, thus different linguistic and machine learning methods are used. Sentiment analysis had been applied on various levels, starting from the whole text level, then going towards the sentence and/or phrase and aspect levels.

For English a plenty of resources and systems have been developed for sentiment analysis of texts by now [14,16]. A number of researches are conducting on sentiment analysis for Russian [7–9], Turkish [3,11,28], Spanish, Arabic [15,20] and other languages. For Spanish [21] proposed an approach to the subjectivity detection on Twitter micro texts that explores the uses of the structured information of the social network framework. For Arabic proposed a semantic approach to discover user attitudes and business insights from Arabic social media by building an Arabic Sentiment Ontology that contains groups of words that express different sentiments in different dialects [20].

Fig. 1.

Sentiment analysis methods.

The sentiment analysis of texts written in Kazakh language has been studied little. There are some works on sentiment analysis for dual languages, Kazakh and Russian [1,19]. [19] described modern approaches of solving the task of sentiment analysis of news articles in Kazakh and Russian languages by using deep recurrent neural networks. Thereby, research shows that good results can be achieved even without knowing linguistic features of a particular language. Also a deep neural network model that uses bilingual word embedding to effectively solve sentiment classification problem for a given pair of languages has been proposed. The authors apply this approach to two corpora of two different language pairs: English–Russian and Russian–Kazakh. It is shown how to train a classifier in one language and predict in another. This approach achieves 73% accuracy for English and 74% accuracy for Russian. For Kazakh sentiment analysis, propose a baseline method, that achieves 60% accuracy; and a method to learn bilingual embedding from a large unlabeled corpus using a bilingual word pairs [1].

Computers are beginning to acquire the ability to recognize emotions. In 1995 Rosalind W. Picard [17] reported about key issues in “affective computing”, computing that relates to, arises from, or influences emotions. Since then, a lot of research has been carried out. Many studies are related to the emotion recognition from texts. [5] suggest an approach for emotion recognition using web-based similarity and also propose an emotion ranking model based on semantic proximity measures, e.g. confidence, PMI, PMING.

Today, there are a lot of mobile devices, such as smart phones, tablets, cameras and PC around the world. Also, a lot of applications for audio, video posting, chats are implementing day by day. Accordingly, text, audio and video information are increasing. Because of this, the task of extracting emotion from text, image, audio and video information becomes an important task.

The emotion is extracting not only from texts, but also from audio and video content [4,18], from images [13]. Such applications can be used in as social media marketing, brand positioning, election and financial prediction.

This work can be considered as an introduction and an attempt to apply the linguistic approach for sentiment analysis of the texts written in the Kazakh language. For that reason, this paper describes the rule-based methods used in sentiment analysis and approaches used to determine the sentiment of Kazakhs sentences by formalizing the morphological rules.

2. The main methods used in sentiment analysis

According to the work described in [14], the automatic analysis of a sentiment of texts in the natural language is carried out by applying the methods such as machine learning methods and lexicon-based methods (Fig. 1).

The sentiment analyses based on machine learning methods are “trained” on a collection of pre-marked texts. These methods include a support vector machine (SVM), logistic regression, naive Bayes classifier, maximum entropy, k nearest neighbor (k-NN) and other methods.

Lexicon-based methods usually use morphological analysis, specifically designed sentiment dictionaries of words and phrases as well as sets of linguistic rules and corpora [22].

3. Determining the sentiment of phrases in Kazakh language

Determination of sentiment of sentences in Kazakh language is based on a classification of texts by five features [−2..2]: very negative (−2), negative (−1), neutral (0), positive (1), very positive (2). For this purpose, a dictionary of emotional Kazakh words was developed which participates in determining the polarity of the text. The dictionary was manually created and marked by polarity on a 5-point scale [−2..2]. The dictionary contains about 11000 emotional words and phrases [23]. In Kazakh language the polarity of a phrase is given by parts of speech as noun, adjective, verb and adverb. After that, morphological rules of parts of speech are formalized that are involved in determining the polarity of the sentence: words and/or phrases are extracted from the sentence, which contains evaluative words. The overall sentiment of the text is evaluated according to the sentence/phrase polarity.

Adjectives mainly determine the sentiment polarity of the text, and the noun plays a role of an aspect (object) of discussion. From the extracted phrases we can determine the polarity of the whole text. The polarity of evaluative words might depend on the context and subject area. Also, the polarity can be changed or intensified depending on adverbs, verbs and conjunctions. The following phrases can be used to define the polarity:

[NOUN] + [VERB]

[NOUN] + [VERB] + [Negation]

[ADJECTIVE] + [NOUN]

[ADJECTIVE] + [Negation] + [NOUN]

[ADJECTIVE] + [VERB]

[ADJECTIVE] + [VERB] + [Negation]

[Not ADJECTIVE] + [VERB]

[Not ADJECTIVE] + [VERB] + [Negation]

[ADVERB] + [ADJECTIVE]

[ADVERB] + [NOUN];

For determining of morphological features we used work described in [26]. In that work we explained how semantic hyper-graphs are used to describe ontological models of morphological rules of the Kazakh language. On the basis of this work, morphological analyzer was built. This morphological analyzer is used for extraction of morphological information from texts.

There is a growing demand for information systems-oriented interpretation of human language. These systems are designed to be capable of understanding the intentions and opinions of the author with minimal human intervention. In the article [10] entitled Considerations on Ontologies Construction, the author identifies the challenge the interpretation of heterogeneous information by automated tools and analyzes possibilities of using ontology to resolve these issues. The combination of ontological model and natural language rules can improve performance of sentiment analysis. Ontology can be used to solve various tasks [2,6].

Ontology is a powerful and widely used tool to model relationships between objects belonging to various subject fields. In the context of computer and information sciences, an ontology defines a set of representational primitives with which to model a domain of knowledge or discourse. The representational primitives are typically classes (or sets), attributes (or properties), and relationships (or relations among class members). The definitions of the representational primitives include information about their meaning and constraints on their logically consistent application.

This formalism determines ontology O as triple $(V, R, K)$ , where V is a set of classes of a given subject field, R is a set of relationships between these classes, and K is a set of attributes in the domain [12].

In Fig. 2 presented part of ontology model for determining the sentiment of phrases in the Kazakh. For example, if we have some collocation, it can consist of adjective and noun (verb). The ontology formalism – O(adjective, has_polarity, adjective with positive(negative) orientation); O(collocation, has_polarity, sentiment).

Fig. 2.

Ontology for determining the sentiment of the collocation.

Ontology has allowed us to present in a model form of phrase with sentiment for further use of OWL in RDF schema. In addition, construction of semantic queries in SPARQL language based on the rule presented in Part 4.

4. Rules used to define the sentiment

We formalized rules for defining the phrases sentiment in the Kazakh language and are described using production rules. For this, we introduce the following meta notations (Table 1).

Table 1
Meta notations

Notation Definition

ω Set of words of language

L Set of sentences of language

A Set of adjectives

N Set of noun words

V Set of verbs

$V^{- 1}$ Set of verbs with negative form

D Set of superlative and comparative adverbs

$sent$ Predicate of sentiment

¬ Operation of negation words emes/zhok (not/no)

· Operation of concatenation

$α, β, γ, \dots, ζ, ξ$ Any string of words, variables

Notation	Definition
ω	Set of words of language
L	Set of sentences of language
A	Set of adjectives
N	Set of noun words
V	Set of verbs
$V^{- 1}$	Set of verbs with negative form
D	Set of superlative and comparative adverbs
$sent$	Predicate of sentiment
¬	Operation of negation words emes/zhok (not/no)
·	Operation of concatenation
$α, β, γ, \dots, ζ, ξ$	Any string of words, variables

The production rules for determining the sentiment of an adjective and noun phrases are given below in the sequential form $\frac{A}{B}$ , where A – antecedent, B – consequent. In [27] we described some rules. Here we want to represent more rules for determining the sentiment of phrases.

If the extracted word is a positive noun, and the next word is a neutral verb, then the sentiment of this phrase is positive. $\begin{matrix} \frac{(\begin{matrix} ω \in L, ω = ζ \cdot α \cdot β \cdot ξ, α \in N, \\ sent (α) = 1, β \in V, sent (β) = 0 \end{matrix})}{sent (ω) = 1} \end{matrix}$

Example, toi (1) boldy (0) (there was a wedding).

If there is a negation word “emes” (not) between the negative adjective and noun, then the sentiment of this phrase is become positive. $\begin{matrix} \frac{(\begin{matrix} ω \in L, ω = ζ \cdot α \cdot β \cdot γ \cdot ξ, α \in A, \\ sent (α) = 0, β = \neg, γ \in N \end{matrix})}{sent (ω) = 1} \end{matrix}$

Example, zhaman (negative adjective) emes (negation) kino (noun) (movie is not bad).

If the found word is an adjective with positive polarity and the next word after it is a neutral verb, then the polarity of this phrase is positive. $\begin{matrix} \frac{(\begin{matrix} ω \in L, ω = ζ \cdot α \cdot β \cdot ξ, α \in A, \\ sent (α) = 1, β \in V, sent (β) = 0 \end{matrix})}{sent (ω) = 1} \end{matrix}$

Example, zhaksy (positive adjective) isteidi (verb) (works good).

If a noun follows by a verb phrase, then the word coming after the verb should be checked. If the noun is positive and there is a negation word (emes/zhok (not)) after verb, then the sentiment of this phrase is negative. $\begin{matrix} \frac{(\begin{matrix} ω \in L, ω = ζ \cdot α \cdot β \cdot γ \cdot ξ, α \in N, \\ sent (α) = 1, β \in V, sent (β) = 0, γ = \neg \end{matrix})}{sent (ω) = - 1} \end{matrix}$

Example, adilettilik (positive noun) ornagan (verb) emes/zhok (negation) (there is no justice).

If the adjective is very positive and the next word is a neutral noun, then the sentiment of this phrase is very positive. $\begin{matrix} \frac{(\begin{matrix} ω \in L, ω = ζ \cdot α \cdot β \cdot ξ, α \in A, \\ sent (α) = 2, β \in N, sent (β) = 0 \end{matrix})}{sent (ω) = 2} \end{matrix}$

Example, ardaqty (positive adjective) ana (noun) (honorable mother).

If the adjective is very negative and the next word is a negative noun, then the sentiment of this phrase is very negative. $\begin{matrix} \frac{(\begin{matrix} ω \in L, ω = ζ \cdot α \cdot β \cdot ξ, α \in A, \\ sent (α) = - 2, β \in N, sent (β) = - 1 \end{matrix})}{sent (ω) = - 2} \end{matrix}$

Example, qatygez (very negative adjective) terrorist (negative noun) (violent terrorist)

Fig. 3.

Program fragment. Example for collocations [adjective] + [verb], [adjective] + [noun], [adverb] + [adjective].

If the superlative or comparative adverbs comes before noun with negative sentiment, then the sentiment of this phrase is very negative. $\begin{matrix} \frac{(\begin{matrix} ω \in L, ω = ζ \cdot α \cdot β \cdot ξ, α \in D, \\ β \in N, sent (β) = - 1 \end{matrix})}{sent (ω) = - 2} \end{matrix}$

For example, nagyz (adverb) shaitan (negative noun) (real devil).

The polarity of the whole text is defined as the arithmetic average of the values of the polarity of lexical units (sentences) and the rules for their combination. $\begin{matrix} sent (L) = \frac{\sum_{i = 1}^{n} {sent}_{i} (ω)}{n} \end{matrix}$

In [24] the simple and extended fuzzy evaluation models are described. The technique of fuzzy inference associated with the estimation of the general level of the hotel state on the basis of the calculated criterion for each aspect is proposed. The criterion is measured in percent and is derived on the basis of fuzzy subjective estimates of the hotel services in Kazakh.

5. Results

The implemented system is based on the described rules for Kazakh language. The fragment of the implemented program is given in Fig. 3.

Explanantion for Fig. 3.: (the meal is not tasty), (too spicy), (highly respected), (bad influences), (have clear objectives), (main idea is a place without nuclear)

In addition, the sentiment might also depend on the conjunctions between words or sentences

if there are connecting conjunctions, the sentiment does not change;

if between words or sentences comes dividing or adversative conjunctions, then semantic orientation of sentence changes to opposite. For example, bul kampit tatti, birak katty eken (this candy is tasty, but hard).

In Table 2 we compare the methods implemented for the Kazakh language.

Table 2
Results

Method Accuracy

Long-Short Term Memory (LSTM) [7] 86.3 %

Deep learning model for bilingual sentiment [9] 60%

Rule based method (our method) 83%

Method	Accuracy
Long-Short Term Memory (LSTM) [7]	86.3 %
Deep learning model for bilingual sentiment [9]	60%
Rule based method (our method)	83%

As can be seen, our method gives good results.

6. Conclusion

In this work, we reviewed the ontology-based sentiment analysis of Kazakh phrases. Ontology has allowed us to present phrases with sentiment in a model form and for further use of OWL in RDF schema. In addition, ontology allowed constructing semantic queries in SPARQL language. Queries are based on the formal rules determining sentiment of phrases in the Kazakh language.

We also plan to apply this method for defining sentiment of sentences ant text in the Kazakh language in the future. For this purpose, we will consider the conjunctions (and, or, but, etc.) and apply different logics to formalize them. Also we will expand the ontological model to determine the sentiment polarity of texts in the Kazakh language.

This work describes the first attempts to extract the sentiment of texts on very positive/positive/neutral/negative/very negative. In the future, we plan to classify the texts not only by polarity, but also extract the emotion of the author of the text by using psychological models (Ekman, Plutchik).

References

Y.B.

Abdullin and

V.V.

Ivanov, Deep learning model for bilingual sentiment classification of short texts, Scientific and Technical Journal of Information Technologies, Mechanics and Optics17(1) (2017), 129–136. doi:10.17586/2226-1494-2017-17-1-129-136.

Afsharchi,

Denzinger and

Far, Enhancing communication with groups of agents using learned non-unanimous ontology concepts, Web Intelligence Journal, IOS Press7(2) (2009), 107–121.

Akba,

Uçan,

E.A.

Sezer and

Sever, Assessment of feature selection metrics for sentiment analyses: Turkish movie reviews, in: Proceedings of the 8th European Conference on Data Mining, 2014, pp. 180–184.

Arunnehru and

Kalaiselvi, Geetha: Automatic human emotion recognition in surveillance video, in: Studies in Computational Intelligence, Vol. 660, 2017, pp. 321–342.

Biondi,

Franzoni,

Li and

Milani, Web-based similarity for emotion recognition in web objects, in: Proceedings – 9th IEEE/ACM International Conference on Utility and Cloud Computing, UCC, 2016, pp. 327–332.

Cao,

Ch.

Zhang and

Liu, Ontology-based integration of business intelligence, Web Intelligence Journal, IOS Press4(3) (2006), 313–325.

Chetviorkin,

Braslavskiy and

Loukachevich, Sentiment analysis track at ROMIP 2011, in: Proceedings of International Conference Dialog – 2012, Vol. 2, pp. 1–14.

Chetviorkin and

Loukachevitch, Extraction of Russian sentiment lexicon for product meta-domain, in: Proceedings of COLING, 2012, pp. 593–610.

Chetvirokin and

Loukachevitch, Sentiment analysis track at ROMIP 2012, in: Proceedings of International Conference Dialog – 2013, Vol. 2, pp. 40–50.

10.

Cicortas,

Iordan and

Fortis, Considerations on construction ontologies, Journal Annals Computer Science Series 1 (2009), 79–88.

11.

Eryiğit,

Çetin,

Yanık,

Temel and

Çiçekli, TURKSENT: A sentiment annotation tool for social media, in: Proceedings of the 7th Linguistic Annotation Workshop & Interoperability with Discourse, ACL, Sofia, Bulgaria, 2013.

12.

T.R.

Gruber, Toward principles for the design of ontologies used for knowledge sharing, International Journal Human–Computer Studies43(5–6) (1995), 907–928. doi:10.1006/ijhc.1995.1081.

13.

Jiang,

A.T.S.

Ho,

Cheheb,

Al-Maadeed,

Al-Maadeed and

Bouridane, Emotion recognition from scrambled facial images via many graph embedding, Pattern Recognition67 (2017), 245–251. doi:10.1016/j.patcog.2017.02.003.

14.

Liu, Sentiment Analysis and Opinion Mining, Morgan & Claypool Publishers, 2012.

15.

Mohammad,

Salameh and

Kiritchenko, Sentiment lexicons for Arabic social media, in: Proceedings of Theition of the Language Resources and Evaluation Conference, 10th edn, Portorož, Slovenia, 2016.

16.

Pang and

Lee, Opinion mining and sentiment analysis, in: Foundations and Trends in Information Retrieval, Now Publishers, 2008.

17.

R.W.

Picard, Affective computing, MIT Media Laboratory Perceptual Computing Section Technical Report No. 321, Media Lab. Massachusetts Institute of Technology, Cambridge Univ., 1995.

18.

Poria,

Chaturvedi,

Cambria and

Hussain, Convolutional MKL based multimodal emotion recognition and sentiment analysis, in: Proceedings – IEEE International Conference on Data Mining, ICDM, 2017, pp. 439–448, art. no. 7837868.

19.

N.S.

Sakenovich and

A.S.

Zharmagambetov, On one approach of solving sentiment analysis task for Kazakh and Russian languages using deep learning, in: Computational Collective Intelligence. ICCCI 2016, Lecture Notes in Computer Science, Vol. 9876, 2016.

20.

Samir and

Ibrahim, Semantic sentiment analysis in Arabic social media, Journal of King Saud University – Computer and Information Sciences29(2) (2016), 229–233.

21.

Sixto,

Almeida and

López-de-Ipiña, An approach to subjectivity detection on Twitter using the structured information, in: Computational Collective Intelligence. ICCCI 2016, Lecture Notes in Computer Science, Vol. 9875, Springer, Cham, 2016.

22.

Taboada,

Brooke,

Tofiloski,

Voll and

Stede, Lexicon-based methods for sentiment analysis, Computational Linguistics37(2) (2011), 267–307. doi:10.1162/COLI_a_00049.

23.

Yergesh, Identifying the tonality of texts in the kazakh language on the basis of the dictionary of emotional lexis, in: V International Conference on Computer Processing of Turkic Languages “TurkLang 2017”. Conference Proceedings, in 2 volumes, T 1, Publisher of the Academy of Sciences of the Republic of Tatarstan, Kazan, 2017, pp. 62–67.

24.

Yergesh,

Bekmanova and

Sharipbay, Sentiment analysis on the hotel reviews in the Kazakh language, in: Proc. International Conference on Computer Science and Engineering (UBMK), 2017, pp. 790–794.

25.

Yergesh,

Bekmanova,

Sharipbay and

Yergesh, Ontology-based sentiment analysis of kazakh sentences, in: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), LNCS, Vol. 10406, 2017, pp. 669–677.

26.

Yergesh,

Mukanova,

Sharipbay,

Bekmanova and

Razakhova, Semantic hyper-graph based representation of nouns in the Kazakh language, Computacion y Sistemas18(3) (2014), 627–635.

27.

Yergesh,

Sharipbay,

Bekmanova and

Lipnitskii, Sentiment analysis of Kazakh phrases based on morphological rules, Journal of Kyrgyz State Technical University Named after I. Razzakov. Theoretical and Applied Scientific Technical Journal38(2) (2016), 39–42.

28.

Yıldırım,

Çetin,

Eryiğit and

Temel, The impact of NLP on Turkish sentiment analysis, in: Proceedings of the TURKLANG’14 International Conference on Turkic Language Processing, Istanbul, 2014.

Sentiment analysis of Kazakh text and their polarity

Abstract

Keywords

1. Introduction

3. Determining the sentiment of phrases in Kazakh language

Table 2 Results Method Accuracy Long-Short Term Memory (LSTM) [7] 86.3 % Deep learning model for bilingual sentiment [9] 60% Rule based method (our method) 83%

References

Table 2
Results

Method Accuracy

Long-Short Term Memory (LSTM) [7] 86.3 %

Deep learning model for bilingual sentiment [9] 60%

Rule based method (our method) 83%