A mixed approach of statistical weighting method and unsupervised method to improve Uyghur sentiment classification

Abstract

Considering the scarcity of Uyghur sentiment resources, in this paper proposed a new combined unsupervised sentiment classification method for Uyghur text without any labeled corpora. In the first part, a Uyghur sentiment dictionary, UYSentiDict, was adopted to classify the sentences. For the sentiment vocabulary matching, both the matching of the original word and the stem were considered, and the influence of sentence patterns, negation words, and degree adverbs were further considered as well. Based on different thresholds, the sentences with higher sentiment values were selected from the lexicon-based classification results as a pseudo-labeled dataset. In the second part, different sentiment characteristics were learned from the pseudo-labeled dataset by the machine learning classifier, and the remaining categorical data were further classified. It can be concluded that the method proposed in this paper has good classification efficiency in Uyghur sentiment corpora in four different fields, and some results were performed better than the classification results of machine learning classifier. Moreover, this method is not restricted by the field of data and does not need to be marked in advance with good training corpus, and can solve the resource shortage problem in the field of Uyghur sentiment classification effectively.

Keywords

Sentiment dictionary machine learning classification unsupervised sentiment classification lexicon-based classification Uyghur language

1. Introduction

Text sentiment analysis has become an active research topic in the field of NLP. Among the classifiers available, the positive and negative sentiment classification of network data has attracted researchers’ attention. For the first time in 2002, Pang [1] used the machine learning classifier to classify the sentiment of documents. Since then, supervised machine learning classification method has been widely applied. Supervised machine learning classification uses artificially labeled datasets for training, and when there are sufficient and correctly labeled training data in a field, this method can achieve satisfactory classification results.

Network data covers a wide range of fields. Ensuring sufficient annotated corpora are available in certain fields to meet various classification requirements can be challenging. Although the lexicon-based sentiment classification method is domain independent and does not require an annotated corpus, some sentiment vocabularies express different sentiment polarities in different fields or different contexts.

Domestic and foreign researchers have conducted in-depth research on the sentiment classification of English and Chinese. A large-scale sentiment corpus and sentiment dictionaries have been created [2, 3, 4, 5]. Many satisfactory sentiment classification models and algorithms have emerged, and have been cited in some practical applications.

Compared with the study of English and Chinese sentiment classification, the research on Uyghur text sentiment classification has relatively lagged. No lexical resources or labeled corpus publicly available for researchers to use for Uyghur language sentiment classification. For languages lacking sentiment resources, some researchers combined dictionaries and machine learning classifiers. Li [6] and Melville et al. [7] combined dictionaries and small amount of annotated texts to train classifiers. He [8] and Qiu et al. [9] used the sentiment dictionary to complete initial judgments on the tendency of the text, used this result to generate a new classifier, and then revised the initial results. Such methods do not rely on a large amount of annotated data and can be used in sentiment classification tasks in any field. Therefore, these methods are suitable for the sentiment classification of resource-poor languages, such as Uyghur.

Inspired by previous work [8, 9], in this paper, first calculated the sentence’s sentiment score based on the Uyghur sentiment dictionary (UYSentiDict).1

¹
Code and data for the first step are available at https://github.com/ErpanY/Uy_SO-PMI.

Next, the sentence sentiment score was corrected by the influence of sentence patterns, negation words, and degree adverbs on the sentiment of the sentences.2

Code and data for the second step are available at https://github.com/ErpanY/Uy_DictSentiClassify.

Finally, according to the sentiment score of each sentence, the sample data with higher classification confidence were selected as the pseudo-annotation training set, and the machine learning classifier was trained on the pseudo-annotation training set to perform sentiment classification of the remaining data.3

For the code, corpus and experimental results of the final step are available at https://github.com/ErpanY/Uy_SentiClassify ByML.

The method applied in this paper involved training the machine learning classifier based on the sentiment classification results of the lexicon-based classifier. Therefore, the result of the lexicon-based classifier directly affects the classification result of the subsequent machine learning classifier. In the process of classification based on sentiment lexicon, degree adverbs, negation words, and sentence patterns are the key factors that influence the tendency of sentences. Although some of these factors were also considered in the literature related to Uyghur sentiment classification, they were all assessed on the corpus of a certain area. However, the overall impact on different corpus has not been systematically studied. Therefore, taking into account the above factors that affect the classification results, research aimed to improve the classification efficiency as much as possible, based on the lexicon classifier.

The main contributions of this article are as follows:

•

We systematically studied the influence of modifiers, such as turning conjunctions, progressive conjunctions, degree adverbs, and negation words, on the sentiment polarity of Uyghur sentences. We propose some optimization algorithms. To classify a sentence, we first determined the sentence pattern, and then found the sentiment word and the additional modifiers, such as degree adverbs and negation words, within a certain window size before and after the sentiment word. The sentiment polarity of sentences was graded according to different weight rules.

•

The traditional dictionary scoring method was obtained by directly summing up the sentimental values of the positive and negative words in the sentence. Although this method is simple, it resulted in lower accuracy. Uyghur is a rich morphological adhesive language. Words are composed of stems and affixes. Some sentiment vocabularies are the words themselves, and some are their stems. Therefore, to standardize the scoring algorithm, we set weight factors for word stems determined through comparison experiments. In the process of sentence sentiment scoring, we considered the original word as well as the stem.

•

The method proposed in this paper is universal for the sentiment classification of resource-poor languages and is domain independent. First, by using the domain independence feature of the dictionary method, the pseudo-annotated data of the target domain was obtained, and then the domain knowledge was learned from the pseudo-annotated data by a machine learning classifier. Finally, the remaining data that cannot be classified by the dictionary method was classified, thereby combining the strengths of above two methods. This method does not need to annotate the corpus in the target area in advance, so it effectively solves the resources shortage problem in the sentiment classification work of minor languages such as Uyghur.

2. Related work

Sentiment classification methods are divided into two general methods: machine learning [1] and lexicon-based methods [10, 11]. At present, machine learning methods are the most common method for sentiment classification. When a sufficient amount of labelled training corpora is available, this method can achieve better classification efficiency. However, this method has a serious “domain-dependent” problem [12]. That is, classifiers trained on annotated sample sets in a certain domain can only perform well on test sets in the same field. However, when switching to other fields, especially if the distribution between the target domain and the source domain is large, the performance of the algorithm will be greatly reduced. This problem can be solved by labelling a sufficiently-sized domain dataset.

With the rapid development of the Internet, a large amount of text data must be processed every day. Quickly labelling this data is a challenging task. Data labelling is a laborious task, and labelling costs are high. In reality, the number of unlabelled data are always more than the amount of labelled data. Therefore, the research on sentiment classification has been focusing on semi-supervised learning with unlabelled samples and unsupervised learning methods without labelled samples. Dasgupta and Ng [13] first selected the easy-to-categorize review text based on the spectral clustering method, and then applied the active learning method to tag the indistinguishable texts, and finally built an entire semi-supervised learning system by means of transfer learning. Goldberg and Zhu [14] solved the problem of predicting the polarity scores of comments based on the semi-supervised learning model of a graph. Li et al. [15] started with a personal view of the evaluation text and used a cooperative training method to perform semi-supervised sentiment classification. Xue et al. [16] proposed a semi-supervised sentiment classification method that integrates social network knowledge.

For the unsupervised sentiment classification method, Turney’s work [11] was the pioneer work. First, the phrase in the document according to a certain pattern is extracted, then the sentimental tendency of the phrase is calculated based on the PMI (Pointwise Mutual Information) value between the phrase and the seed words. Finally, the polarity of the document is calculated. Subsequent studies [17, 18, 19] used sentiment lexicon to perform unsupervised sentiment classification. This domain-independent method does not require any tagged corpora, and based on a small amount of seed words, can achieve better classification results. However, this method also has some shortcomings. Although the sentiment lexicon is not related to the domain as a whole, some sentiment words may express different sentiment polarities in different fields or different contexts. For example, “unpredictable” expresses positive feelings when used in the field of movie reviews, and expresses negative feelings when used in the automotive field to evaluate automotive performance. In the field of digital cameras, “long” is used to express the positive tendency in the phrase “long power supply time”, and in the phrase “long focusing time” the tendency of “long” is negative.

Considering the performance limitations of unsupervised classification methods based on sentiment lexicon, another group of scholars proposed an unsupervised sentiment classification algorithm that combines sentiment lexicon and machine learning classifier. Some of these algorithms combine lexicon-based methods with machine learning methods that are trained on a small number of labelled data [6, 7, 20]. This method was proven to effectively improve the classification results obtained by either the machine learning classifier and the lexicon-based classifier used in isolation. Ten et al. [21, 22] divided the sentiment classification process into two steps: first, the sentiment dictionary is used to make initial judgments on the tendencies of the review text, then this result is used to generate a new classifier. Finally, the statistical properties of the corpora in the domain are used to accomplish the classification task [8, 9, 21, 22].

In the field of Uyghur text sentiment classification, some researchers have explored sentiment classification based on sentiment lexicon [23, 24, 25]. In Yusuf and Hamdulla [23], the sentiment words and phrases were manually extracted from sentences based on the sentiment characteristics of Uyghur sentences, and a Uyghur sentiment lexicon containing 379 words was constructed. Then, the keywords matching algorithm performed four categories of sentiment classification on Uyghur sentiment corpora containing 873 sentences. The literature [24] proposed a Uyghur sentiment word set based on automatic annotation, and completed eight categories of sentiment classification for sentences. During the classification process, the influence of turning conjunctions and negations on the sentiment tendency of sentences was analysed. The method of shielding the first half of the sentence was adopted for turning conjunctions. The treatment of negation words reversed the tendency of the sentence as long as a negation word appeared in the sentence. Nian et al. [25] translated the HowNet (Chinese sentiment dictionary), NTUSD (National Taiwan University School of Dentistry), and the dictionary published by the Dalian University of Technology into Uyghur, and expanded it with the Uyghur Synonymous Dictionary, constructing a Uyghur sentiment lexicon containing 9342 words. With the help of negation words, degree adverbs, transition adjuncts, and other modifiers, sentiment classification on the self-built sentiment corpus was accomplished. However, prior studies did not describe how to address these modifiers in detail.

3. Unsupervised sentiment classification method combined with lexicon and machine learning

The unsupervised sentiment classification method in this paper is divided into two phases. The first phase involves selecting the pseudo-labelled training data. In this stage, the sentiment scores of each sentence in the dataset to be classified are calculated using a sentiment dictionary, and then the part of the classification data with a higher confidence level is selected as a pseudo-labelled data set. The second stage is the classifier learning phase, in which a classifier is trained using the pseudo-labelled dataset obtained in the previous stage. The remaining first-stage datasets with lower confidence are classified to obtain the final classification result. The classification framework is shown in Fig. 1.

Figure 1.

Process diagram of the classification framework.

3.1 Lexicon-based sentiment classification phase

3.1.1 Construction of uyghur sentiment lexicon

Lexicon-based sentiment classification methods need a cross-domain sentiment dictionary with wide coverage. No public sentiment dictionaries are available in Uyghur.

To quickly create a Uyghur sentiment lexicon, we followed previously established methods [26, 27] to translate selected positive and negative sentiment words and positive and negative evaluative words from two widely used Chinese dictionaries, HowNet [5] and NTUSD [28], which were collected and compiled by Taiwan University into Uyghur by the machine translation tool called the Tilmach Chinese-Uyghur Bilingual Dictionary (Provided by multilingual information technology laboratory of Xinjiang). Then, we created a basic Uyghur sentiment lexicon using the translated words after manual filtering. The vocabulary obtained by the translation retained the original sentimental tendencies. Some words that lost or changed their sentiment tendencies were removed from the translated words list.

To further expand the Uyghur sentiment lexicon, adjectives, interjections, modal words, verbs, and nouns were extracted from the Practical Dictionary of Uyghur and Chinese [29] as a candidate sentiment vocabulary. The sentiment polarity of the words was annotated by two professional students, and inconsistent annotations were determined through negotiation, and they were added to the vocabulary list of the basic sentiment lexicon. The final constructed Uyghur sentiment lexicon was named as UYSentiDict. UYSentiDict also includes vocabulary such as phrases, idioms, negation words, degree adverbs, and conjunctions, as shown in Table 1.

Table 1
Uyghur sentiment lexicon glossary

	Positive	Negative	Total
Words	3044	5551	8595
Phrase	870	2414	3284
Idiom	214	236	450
Negations			9
Degree adverb			129
Conjunctions			9

3.1.2 Classification process based on sentiment dictionary

The first step was to calculate the sentiment score of each sentence in the dataset to be classified according to the sentiment lexicon. If the sentiment score of the sentence was greater than zero, then the sentence was positive; if it was less than zero, it was negative. If the score was equal to zero, then the sentence type could not be determined. During the classification process, based on the grammatical characteristics of the Uyghur language, the impact to the sentence sentiment of conjunctions, degree adverbs, and negations was comprehensively considered.

3.1.2.1. Sentiment vocabulary

In most cases, the sentiment tendencies expressed in texts are reflected by sentiment words, so sentiment words are one of the important bases of sentiment tendency judgment.

Uyghur is a morphologically rich language; Uyghur words are made up of stems and suffixes. The sentiment tendencies of some words are expressed by its stem, and some sentiments are expressed after the stem plus some suffixes. Through the “Uyghur lexical analyser” (Provided by multilingual information technology laboratory of Xinjiang) This paper conducted word segmentation and stemming on the sentiment corpus. This paper first checked the sentiment words from the original word sequence of each sentence. If the sentiment could not be determined from the original words sequence, then the stem of each word in the sentence was checked. The following is the sentence processed by the Uyghur lexical analyser.

“ $u[T=P][S=u]\textit{bolsa}[T=V][S=\textit{bol}]\textit{bir}[T=M][S=\textit{bir}% ]\textit{ichi}[T=N][S=\textit{ichi}]\textit{tar}[T=A]$ $[S=\textit{tar}]\textit{adem}[T=N][S=\textit{adem}]<\textit{EOS}>$ ” (Translation: This is a narrow-minded person.)

where “ $T=$ ” means Part-of-Speech tags, $P$ means pronoun, $V$ means verb, and $M$ means numeral. Outside the square brackets is the prototype of the word, “ $S=$ ” indicates the stem of the word, and $<\textit{EOS}>$ is a sentence end tag.

In Nian et al. [25] and Wiegand et al. [30], the sentiment value of positive words was designed to be 1, and that of negative words was $-$ 1. Elsewhere in the literature [31], 1 was assigned to positive words and $-$ 2 for negative words. Through comparison, the sentiment value of the positive words in this paper was finally determined to be 1, and the negative words $-$ 2. Considering that the sentiment expressed in the stem is always weaker than the original word, this paper designed a sentiment weight factor for the stem and calculated the stem sentiment value according to Eq. (1).

$\displaystyle SO(S)=\sum_{i=0}^{i=n}f(\omega,x_{i},b),f(\omega,x,b)=\left\{% \begin{array}[]{ll}1-\omega\cdot b,&x=1\\ -2+\omega\cdot b,&x=-2,b\in\{0,1\}\\ 0&x=0\\ \end{array}\right.$ (1)

where $\omega$ represents the stem weight factor, $x_{i}$ represents the sentiment value of the ith word in the sentence $S$ (positive equals 1, negative equals $-$ 2, non-sentiment words equal 0), and $b$ is the matching coefficient. When the original word in sentence $S$ matches a word in the dictionary, $b$ equals 0. If the stem of a word in sentence $S$ matches a word in the dictionary, $b$ equals 1.

3.1.2.2. Special sentence pattern

In Uyghur, for a sentence containing turning conjunctions, the clauses behind the turning conjunction often express the true sentiment of the speaker, and the preceding clause of the turning conjunction is not the true intention of the speaker and can be ignored. Such as:

“u naxayiti chirayliq, lékin bek hurun”. (Translation: She is beautiful, but very lazy).

Although the positive term “chirayliq” appeared in the first half of the sentence, the turning conjunctive “lékin” was used, emphasizing the negative term “hurun” in the latter half of the sentence. This article summarizes a total of seven Uyghur textual turning conjunctions from the book “The reference grammar of modern Uygur Language” [32], as follows:

“biraq, lékin, emma, halbuki, epsuski, epsus, shughinisi” (Translation: but, however, yet, unfortunately)

By analyzing the sentiment corpus in this paper, found that colloquial turning conjunctions such as “emmaze lékinze” were also in the corpus. Therefore, we expanded the dictionary of turning conjunctions into a dictionary containing the above nine turning conjunctions. In this paper, through the in-depth analysis of Uyghur sentiment corpora, we improved the turning conjunction processing methods used in the literature [24]. Sentence sentiment score calculation rules with turning conjunctions are summarized in Table 2. This calculation rule uses the turning conjunction as the center and divides the sentence $S_{i}$ into two parts, $S_{i1}$ and $S_{i2}$ . According to the tendencies of $S_{i1}$ and $S_{i2}$ , the final tendency of the sentence $S_{i}$ was determined, and a weight coefficient was designed. According to the calculation rules in Table 2, the final sentiment score of was adjusted accordingly.

Table 2

Sentence sentiment score calculation rules with turning conjunctions

	Sentiment tendency
Tendency of $S_{i1}$	pos	pos	pos	neg	neg	neg	Unknown	Unknown	Unknown
Tendency of $S_{i2}$	pos	neg	Unknown	pos	neg	Unknown	pos	neg	Unknown
Tendency of $S_{i}$	pos	neg	neg	pos	neg	pos	pos	neg	Unknown
Weight coefficient	0.5	2	1	2	0.5	1	1	1	0

The sentiment scoring process and the value adjusting process of the sentence containing turning conjunctions are shown as Algorithms 1 and 2, respectively.

Algorithm 1. Sentence sentiment scoring process with turning conjunctions.
$\textit{if }(SO(S_{i2})!=0)$
$\textit{then }SO(S_{i})=SO(S_{i2})$
$\textit{else if }(SO(S_{i2})==0\&\&SO(S_{i1})!=0)$
$\textit{then }SO(S_{i})=SO(S_{i1})*(-1)$
else
$\textit{then }SO(S_{i})=0$

In Algorithm 1, if the sentiment tendency of the latter half of sentence $S_{i2}$ was positive or negative, then the sentiment tendency of the entire sentence is equal to the sentiment tendency of $S_{i2}$ . If the tendency of $S_{i2}$ was uncertain, or if the tendency of the first half of sentence $S_{i1}$ was positive or negative, the sentiment tendency of the entire sentence was opposite to that of $S_{i1}$ . If neither condition was satisfied, the tendency of the sentence $S_{i}$ could not be determined.

Algorithm 2. Sentence sentiment score adjustment method with turning conjunctions.
$\textit{if }(S_{i1}==0\textit{ and }S_{i2}==0)$
$\textit{then }SO(S_{i})=0$
$\textit{else if }((SO(S_{i1})>0\&\&SO(S_{i2})>0)\textit{ or}$
$(SO(S_{i1})<0\&\&SO(S_{i2})<0))$
$\textit{then }SO(S_{i})=SO(S_{i})*0.5$
$\textit{else if }((SO(S_{i1})>0\&\&SO(S_{i2})<0)\textit{or}$
$(SO(S_{i1})<0\&\&SO(S_{i2})>0))$
$\textit{then }SO(S_{i})=SO(S_{i})*2$

In Algorithm 2, if the sentiment tendency of $S_{i1}$ and $S_{i2}$ were both uncertain, then the sentence $S_{i}$ had a sentiment value of 0. If $S_{i1}$ and $S_{i2}$ had the same sentiment tendency, then the sentence did not conform to the turning conjunction rule, so the sentiment value of $S_{i}$ was multiplied by 0.5. If $S_{i1}$ and $S_{i2}$ had opposite sentiment tendencies, the sentence $S_{i}$ was more likely to have a sentiment tendency equal to that of $S_{i2}$ , so $S_{i}$ ’s sentiment value was multiplied by 2.

Sentences that contain progressive conjunctions affect the sentiment of the sentence. For example, the sentence “He is not only good at learning, but also very friendly”. The sentence contains two positive words “good” and “friendly”, but the sentence in the latter part of the sentence is even stronger. Such progressive sentences are also found in Uyghur texts. The two most commonly used progressive conjunctions in Uyghur are:

“…la qlmastin …/…belki …”, “…qalmay …/…yene …” (Translation: Not only …but also …)

For example, there are two sentences in the corpus:

“siz yaxshi fotugraf bolupla qalmastin yene usta shair ikensiz jumu”. (Translation: You are not only a good photographer, but a good poet.) “ular eneniwiy naxshilarni orunlapla qalmastin belki yene gheripning nurghun muzikilirini orundaydu”. (Translation: They not only played traditional music but also played a lot of Western music.)

Both sentences express positive sentiments. If a sentence like this does not take into account the progressive sentence, then the progressive conjunctions “la qlmastin” will be regarded as a negative word because it contains the negation suffix “ma”. The first half of the sentence will change into a negative sentiment. Then, the sentiment tendency of the entire sentence becomes 0, so it would be impossible to determine its category. As shown in the following examples:

$\displaystyle SO(S)=SO(\textit{``yaxshi''})\times SO(\textit{``qlmastin''})+SO% (\textit{``usta''})=1\times(-1)+1=0$

According to the stipulations in Liu et al. [33], progressive conjunctions were used as the segmentation point of the sentence to divide the sentence into two parts. The first half of the sentence’s sentiment weight was designed to be 1, and the latter half of the sentence’s sentiment weight was designed to be 1.5.

3.1.2.3. Degree adverbs

Degree adverbs play an important role in the process of sentiment analysis. For verbs and adjectives, degree adverbs are modifiers that strengthen or weaken the meaning of words. The combination of degree adverbs and sentiment words can more clearly express the degree of sentiment in the text. When degree adverbs modify sentiment words, the intensity of the sentiment tendencies of the sentiment words changes, which become stronger or weaker than prior modification. For example:

⟀ bu shéxir neqeder güzel he. (This is such a beautiful poem.) ⟁ ademni héjep bizar qildi. (It’s so annoying.) ⟂ bügün sel hérip qaptimen. (Today I’m a bit tired.)

In ⟀ and ⟁ above, the degree adverbs “neqeder” and “héjep” modify the positive term “güzel” and the negative term “bizar”, respectively, expressing a more intense sentiment. In example ⟂, “sel” modifies the negative term “hérip” and weakens the original sentiment. Relevant literature expressed strengths and weaknesses according to different weight values.

We learned from the definition of the weights of Uyghur degree adverbs previously proposed [25, 30], and combined the weights with the definition of the Uyghur adverbs in the literature [32], dividing the degree adverbs into three levels: high, medium, and low. We also defined the weight of each level. Table 3 shows the 129 Uyghur degree adverbs and the corresponding weights collected in this paper.

Table 3

Examples of Uyghur degree adverbs

Level	Weight	Degree adverb example	Amount
High	2.5	bek, intayin, ajayip …	74
Medium	1.5	esla, nisbeten, eynen …	36
Low	0.5	kichikkine, sel, azraq …	19

The degree adverbs in Uyghur are usually present in front of the modified words. The method proposed in this paper is a sliding window with a length of three for degree adverbs, to find degree adverbs from the three vocabularies in front of the sentiment vocabulary. The optimal window size was determined by contrast experiments. If a degree adverb was present, the sentiment score of the sentiment word modified by the degree adverb was multiplied by the corresponding weight value.

3.1.2.4. Negations

Negations are common linguistic phenomena that affect sentiment tendencies. When a negation word modifies a positive sentiment word, the original positive sentiment expressed will be transformed into negative, and vice versa. The negation category of modern Uyghur consists of simple negation words, derived negation words, and negation configuration morphemes [34].

•

Simple Negation Words. The simple negation words in modern Uyghur texts include “yaq”, “emes”, “yoq”, and so on, which in turn signify the meaning of “no”, “not”, and “none”. Among them, “yaq” is generally a negative response to a non-questionable sentence. “yoq” indicates that the state of things does not exist or disappear. They do not have the ability to reverse the sentiment information expressed in sentences. “emes” is generally used to negate some of the qualitative states represented by adjectives, and in most cases reverses the tendency of the sentiment words that preceded negation words.

•

Derived Negation Words. Adding the prefix “bet …bi …na …” before the partial nouns, adjectives, and verb roots, or appending the suffix “siz” after the roots constitutes a negative derivative that is opposite to the meaning of the original word. We did not address derived negation words in this paper, and since such words are limited in Uyghur, we incorporated all derived negation words selected from the Uyghur Detailed Explanation Dictionary from the 1999 edition of Xinjiang People’s Publishing House, into the Uyghur text sentiment dictionary we created.

•

Negation Morpheme. Uyghur’s negation form of morpheme is “ma/me”. When the vowel weakens into a “mi” suffix, it is added at the end of the stem of the verb to indicate negation, such as: “yaz $+$ ma $=$ yazma” (do not write).

In Uyghur, the negation word appears after the word to be modified. In short, the negation component that can constitute a negation sentence in Uyghur is only the shape-denying component “ma/me” and the negation component “emes” [35]. Therefore, we only considered the influence of these two negation components on the sentiment tendency of sentences.

We designed a sliding window with a length of five for the negation words by comparing the optimum window size determined by experiment, and looked at whether the negation components appeared in the sentiment vocabulary and the following four words after the sentiment vocabulary. Then the following steps were used to address negation components that modify the sentiment tendency of sentences.

•

Step 1. View the part of speech tag of sentiment vocabulary. If it is a verb and the negation morpheme suffix “ma/me/mi” is connected, the sentiment score of the sentiment word is multiplied by $-$ 1.

•

Step 2. Check whether the negation word “emes” or words connecting the negation morpheme suffix “ma/me/mi” in the four words behind the sentiment word are present. If so, then the sentiment score of this word is multiplied by $-$ 1.

(a)

If the negation word “emes” is connected to the suffix “mu” to become “emesmu”, then it has no reversal effect on the sentimental tendency of sentences. Therefore, the stem “emes” of “emesmu” is not treated as a negation word. For example:

Bu shéirmu xéli qamliship qaptu emesmu. (This poem is not bad.)

(b)

If the four words behind the sentiment vocabulary have negation morphological suffixes such as “ma/me/mi”, the vocabulary is judged as a negation vocabulary, but the following suffixes containing “ma/me/mi” should be excluded: “men, miz, midu, maq, mek, miki, mamdu, memdu, mamsen, memsen, tima, miken, ptime, mighay, migey, misi, ghiymidi, migey, masmidi”.

•

Step 3. If two consecutive negation words occur before and after a sentiment word in a sentence, then the sentiment of the sentence is not modified.

•

Step 4. If no sentiment vocabulary is present in the sentence, but there are multiple negation words, then the sentence is more likely to express a negative sentiment. Therefore, in this paper, when more than three negation words were present in a sentence, the sentences were directly judged as negative sentences, and the sentiment score assigned was $-$ 2.

To evaluate the influence of the above modified components on the sentence sentiment classification results, the following four lexicon-based (LB) sentiment classifiers were designed.

•

Classifier LB ${}_{\text{basic}}$ uses the sentimental values of the positive and negative words that appear in the sentence to obtain the sentiment tendency of the sentence.

•

Classifier LB ${}_{\text{but}}$ considers the influence of the conjunctions on the sentiment tendency of the sentence while considering the positive and negative vocabulary.

•

Classifier LB ${}_{\text{deg}}$ considers the influence of degree adverbs based on LB ${}_{\text{but}}$ .

•

Classifier LB ${}_{\text{neg}}$ considers the influence of negation words based on LB ${}_{\text{deg}}$ .

Table 4 summarizes the attributes of the different LB classifiers.

Table 4

Attributes of different lexicon-based (LB) classifiers

Property	LB ${}_{\text{basic}}$	LB ${}_{\text{but}}$	LB ${}_{\text{deg}}$	LB ${}_{\text{no}}$
Positive and negative words	$\surd$	$\surd$	$\surd$	$\surd$
Conjunctions		$\surd$	$\surd$	$\surd$
Degree adverbs			$\surd$	$\surd$
Negation words				$\surd$

The LB ${}_{\text{no}}$ classifier that considers all modifiers was used as an example to illustrate the sentence sentiment classification process based on the sentiment dictionary. The process description is outlined in Algorithm 3.

Algorithm 3. Sentence sentiment score calculation algorithm $SO(S_{i})$ based on LB ${}_{\text{no}}$
Input:
Test Corpora: D (Comments: D1, Movie: D2, Literary: D3, Microblog: D4)
Uyghur Sentiment Dictionary:
UySentiDict (Positive dictionary: PosDic, Weight ${}^{+}$ $=$ 1 or Weight ${}^{+}$ $=$ 1 $-$ $\omega$
Negative dictionary: NegDic, Weight ${}^{-}$ $=$ $-$ 2 or Weight ${}^{-}$ $=$ $-$ 2 $+$ $\omega$ )
Conjunctions Dictionary: ButWords
Degree Adverb Dictionary: IntensifierDic (Degree Adverb Weight, $p=$ {0.5, 1.5, 2.5})
Negation word dictionary: NegationDic
Term Weight: Weight $\in$ [0 $\sim$ 2]
POS tagger: POS {“adjective: A”, “noun: N”, “verb: V”, “adverb: D”, …};
Output:
Corresponding to each sentence $S_{i}$ in the corpus $D_{i}$ , compute the sentiment tendency score $SO(S_{i})$ ;
Process:
1: For $S_{i}\in S\textit{ in }D\textit{ do}$ :
2: $SO(S_{i})=0$ ;
3: Using Uyghur lexical analyzer to divide sentence $S_{i}$ into word set $W_{si}$ ;
4: if (ButWores note in $S_{i}$ ) then
5: for $w_{j}\in W_{si}\textit{ do}$
6: if ( $w_{j}\in\text{PosDic}\&\&w_{j}\in\text{NegDic}$ ) then
6: $SO(w_{j})=\text{Weight}^{+}+\text{Weight}^{-}$ ;
7: if (( $w_{j-1}$ or $w_{j-2}$ or $w_{j-a}$ ) $\in$ IntensifierDic) then
6: $SO(w_{j})=p*SO(w_{j})$ ;
8: for ( $x=0;x\leqslant 4;x++$ )
9: if (getPOS ( $w_{j+x}$ ) $==$ “V” && $w_{j+x}$ $i n$ NegationDic) then
6: $SO(w_{j})=SO(w_{j})*(-1)$ ;
10: end for
11: $SO(S_{i})+=SO(w_{j})$ ;
12: end for
13: else if (ButWords in $S_{i}$ ) then
14: $S_{i}=S_{i1}+\textit{butword}+S_{i2}$ ;
15: $SO(S_{i})=SO(S_{i2})$ ; (Specific calculations such as algorithms 1, 2)
16: return $SO(S_{i})$ ;
17: end for

The basic steps of this algorithm are as follows:

•

Step 1. Load the basic resources: Uyghur sentiment dictionary, degree adverb dictionary, negative dictionary, and transitional dictionary;

•

Step 2. Use the Uyghur lexical analyzer to cut each sentence $S_{i}$ in corpus $D_{i}$ into word sequence $W_{si}$ , and determine the part of speech (POS) and stem information of each word;

•

Step 3. If a turning conjunction is in the sentence, then the sentence is divided into two parts, $S_{i1}$ and $S_{i2}$ with the turning conjunction at the center, and the sentence’s sentiment tendency is calculated according to Algorithms 1 and 2;

•

Step 4. Calculate the score $SO(S_{i})$ of the sentence $S_{i}$ according to the following steps:

(a)

According to the sentiment dictionaries PosDic and NegDic, each word $w_{j}$ in the sentence $S_{i}$ is judged to determine if it is a sentiment word. The sentiment value of the word $SO(w_{i})$ is determined. Among them, when $w_{j}$ is the original word, the weight Weight ${}^{+}$ is set to 1 and Weight ${}^{-}$ is set to $-$ 2. When $w_{j}$ is a stem, Weight ${}^{+}$ is calculated according to Eq. (1). The value of $\omega$ was determined by comparison experiments;

(b)

According to the degree adverb dictionary IntensifierDic, check whether there are degree adverbs among the three words in front of the sentiment word $w_{j}$ . If so, the $SO(w_{j})$ is enhanced $p$ times;

(c)

Check whether a verb occurs in the sentiment word $w_{j}$ and the following four words. If so, then the negation word dictionary NegationDic is used to determine whether the verb is a negation verb. If there is a negation word and the number is odd, then for $w_{j}$ , the $SO(w_{j})$ is multiplied by $-$ 1;

(d)

The $SO(S_{i})$ of the sentence $S_{i}$ is obtained by summing the $SO(w_{j})$ of each affective vocabulary $w_{j}$ in the sentence $S_{i}$ . If $SO(S_{i})$ is greater than zero, the sentence is positively labeled. If less than zero, the sentence is negatively label, and if the value is equal to zero, the type cannot be determined.

•

Step 5. Return the sentiment score $SO(S_{i})$ of the sentence $S_{i}$ .

The sentence $S_{i}$ ’s sentiment tendency score calculation process is summarized in Eq. (2).

$\displaystyle SO(S)=\sum_{i=0}^{i=n}\left(f(\omega,x_{i},b)\times y_{i+m}% \times\prod_{p=0}^{p=3}z_{i+p}\right),m\in\{1,2,3\}$ (2)

where $m$ denotes the degree adverb window index, $y_{i+m}$ denotes whether the $m$ th word preceding the sentiment word $x_{i}$ in the sentence $S$ is a degree adverb, the range of $y$ is (not degree adverb $=$ 1, adverbs of weakening degree $=$ 0.5, adverbs of general degree $=$ 1.5, adverbs of enhancement $=$ 2.5), $p$ denotes the index of negation word windows, $z_{i+0}$ denotes whether the $p$ th word behind the sentiment word $x_{i}$ in the sentence is a negation word, and the range of $z$ is (negation $=-$ 1, not negation $=$ 1).

3.2 Combined module

The basic aim of creating an unsupervised classifier based on sentiment lexicon proposed in this paper was to construct a domain-independent and powerful Uyghur lexicon-based classifier. We then wanted to train a machine classifier based on the corpora marked by Uyghur lexicon-based classifier. To achieve a better classification effect with the machine classifier, the lexicon-based classifier ensures the accuracy of the tagged corpus. The experiment was divided into two steps. The first step was a classification module based on a sentiment dictionary. The data marked by the step was divided into two groups. The first group included pseudo labeled data, in which the sentiment score of every sentence was higher than the specified threshold value. The second group included classified data that consisted of the sentences with lower sentiment scores or sentences whose tendency could not be defined. The second step used the pseudo labeled data obtained in the first step to train the machine learning classifier and then using the classified data as test data to evaluate the classification effect.

Since BNB (Bernoulli Naive Bayes) and SVM (Support Vector Machine) classifiers perform well in the field of text classification and sentiment classification [1], the above two classifiers were selected for the second step machine learning classifier. After, the basic features, such as Unigrams, Bigrams, DictWords, and PBPs (Part-of-Speech based phrases) [36, 37], the combined features of Unigrams and Bigrams, and the combined features of Unigrams and PBPs, were selected from the pseudo labeled data. Based on the feature selection method MI (Mutual Information) and the feature weighting method Tf-Idf (Term frequency-Inverse document frequency), the most relevant features and the most differentiated features were selected. Next, the two classifiers NB and SVM finished the classification of the data classified in the first step.

4. Experiments and analysis

4.1 Experimental dataset

In order to verify the effectiveness of the proposed method in this paper, design sentiment classification experiments on corpus from four different fields. The description of the four different fields of sentiment dataset used in this paper is shown in Table 5.

Table 5
Overview of datasets

Dataset	Positive corpus	Negative corpus	Total
Commentary corpus	4843	5019	9862
Movie corpus	215	446	661
Microblog corpus	717	1110	1827
Literary corpus	240	344	584

Figure 2.

Influence of stem weights on the classification results in LB ${}_{\text{basic}}$ .

Figure 3.

Influence of stem weights on the classification results in LB ${}_{\text{but}}$ .

4.2 Analysis of classification effect based on sentiment lexicon classifier

4.2.1 Influence of stem weight factor, sliding window size of degree adverb, and negation word on classification results

Uyghur is a morphologically complex language; stems connect different suffixes to produce different words. In this paper, when looking for a sentiment word from a sentence that matched the sentiment dictionary, we first looked at the original words. If a match was not found, we looked for the stems of each word in the sentence. In this paper, sentiment values of 1 and $-$ 2 were designed for positive and negative sentiment vocabulary, respectively. The sentiment value of the stem was calculated according to Eq. (1).

In Uyghur, degree adverbs generally appear in front of modified words, and negation words appear behind modified words. To determine the maximum influence range of this kind of vocabulary on the sentimental tendency of sentences, we designed a sliding window size with a length of three for degree adverbs, and found the degree adverbs from the three vocabularies in front of sentiment words. We designed a sliding window with a size of five for the negation words, meaning we looked for negation elements in the sentiment vocabulary itself and the four words that followed.

Figures 2–5 describe the effect of stem weight factors (from 0.01 to 0.99), degree adverb sliding window size, and negation word sliding window size on sentiment classification results on different corpus and in different LB classifiers. In Figs 2 and 3, only the effect of stem weight factor $\omega$ changes on the classification results is considered because they do not involve degree adverbs and negation words.

In Fig. 4, the best weight factor range and degree adverb window size were determined by 99 (stem weight factor) $\times$ 3 (degree adverb window size) $=$ 297 crossover experiments.

In Fig. 5, the best weight factor range, degree adverb, and negation word window size were determined by 99 (stem weight factor) $\times$ 3 (degree adverb window size) $\times$ 3 (negation word window size) $=$ 1485 crossover experiments. Figure 5 shows the effect of changes in stem weights and negation word windows on the classification results when the degree adverb’s sliding window size is set to 1. The best results were obtained when degree adverb window size was 1.

Figure 4.

Influence of stem weight and degree adverb window size on the classification result in LB ${}_{\text{deg}}$ .

Table 6

Optimal values for different parameters

LB	$\omega$	DegIndex	NegIndex	Accuracy
Commentary corpus
LB ${}_{\text{basic}}$	0.41–0.49	–	–	81.011
LB ${}_{\text{but}}$	0.34–0.49	–	–	81.247
LB ${}_{\text{deg}}$	0.39–0.43	1	–	82.170
LB ${}_{\text{no}}$	0.34–0.43	1	1	82.930
Literary corpus
LB ${}_{\text{basic}}$	0.34–0.49	–	–	86.644
LB ${}_{\text{but}}$	0.34–0.49	–	–	86.815
LB ${}_{\text{deg}}$	0.43–0.49	2	–	86.986
LB ${}_{\text{no}}$	0.43–0.49	2	0	86.815
Movie corpus
LB ${}_{\text{basic}}$	0.34–0.49	–	–	79.274
LB ${}_{\text{but}}$	0.34–0.49	–	–	79.274
LB ${}_{\text{deg}}$	0.21–0.24	2	–	81.846
LB ${}_{\text{no}}$	0.12–0.24	2	0	81.997
Microblog corpus
LB ${}_{\text{basic}}$	0.51–0.66	–	–	84.387
LB ${}_{\text{but}}$	0.51–0.66	–	–	84.332
LB ${}_{\text{deg}}$	0.51–0.55	1	–	85.596
LB ${}_{\text{no}}$	0.54–0.55	1	1	85.871

Figure 5.

Influence of stem weights and negation word window sizes on the classification results in LB ${}_{\text{no}}$ .

Based on the above experimental results, the optimal weight factor range $\omega$ , degree adverb window size (DegIndex), negative word window size (NegIndex), and classification accuracy (Accuracy) of different lexicon-based classifiers and different corpus were determined and are listed in Table 6.

From the experimental results in Table 6, stemming weight factors have a common point of 0.43 for different LB classifiers and different corpus, in addition to the partial results on the Microblog corpus and Movie corpus. Therefore, was set to 0.43 in the following experiments. In the Commentary corpus and Microblog corpus, when both the degree adverb window size and the negation window size were set to one, LB ${}_{\text{deg}}$ and LB ${}_{\text{no}}$ had the highest classification accuracy. When the Literary corpus and Movie corpus had a degree adverb window size of two and a negative word window size of zero, LB ${}_{\text{deg}}$ and LB ${}_{\text{no}}$ achieved the highest classification accuracy. Accounting for the comparison of corpus size and accuracy, the optimal window sizes for degree adverbs and negation words were specified as one (DegIndex $=$ 1, NegIndex $=$ 1) in the following experiments.

4.2.2 Influence of sentiment vocabulary matching methods on classification results in different LB classifiers

To compare the effect of different sentiment word matching methods on the classification results of the LB classifier, we used the above four LB classifiers to perform classification experiments on different corpus. Tables 7–10 provide the classification results. In the four tables, the second column finds the sentiment vocabulary only from the original word, the third column finds the sentiment vocabulary from the stem, and the fourth column finds the sentiment vocabulary from the original word first, then from the stem. At this time, the weights of the positive and negative stem are the same as the original word, at 1 and $-$ 2, respectively. The fifth column finds the sentiment word from the original word first, and then from the stem. This time, the positive stem weight is (1–0.43), and the negative stem weight is ( $-$ 2 $+$ 0.43).

Table 7
Classification accuracy of different LB classifiers on Commentary corpus

LB classifier	Original	Stem	Original and stem	Original and stem ${}_{\text{Weight}}$
LB ${}_{\text{basic}}$	72.835	77.891	78.927	81.011
LB ${}_{\text{but}}$	73.060	78.219	79.276	81.247
LB ${}_{\text{deg}}$	73.738	79.543	80.672	82.170
LB ${}_{\text{no}}$	74.559	80.241	81.360	82.930
LB ${}_{\text{avg}}$	73.548	78.974	80.059	81.840

Table 8

Classification accuracy of different LB classifiers on Movie corpus

LB classifier	Original	Stem	Original and stem	Original and stem ${}_{\text{Weight}}$
LB ${}_{\text{basic}}$	69.138	75.643	77.458	79.274
LB ${}_{\text{but}}$	69.138	75.643	77.458	79.274
LB ${}_{\text{deg}}$	70.045	78.064	80.332	81.392
LB ${}_{\text{no}}$	68.230	76.551	78.971	79.879
LB ${}_{\text{avg}}$	69.138	76.475	78.555	79.955

Table 9

Classification accuracy of different LB classifiers on Microblog corpus

LB classifier	Original	Stem	Original and stem	Original and stem ${}_{\text{Weight}}$
LB ${}_{\text{basic}}$	71.358	80.869	82.133	84.112
LB ${}_{\text{but}}$	71.578	80.979	82.243	84.222
LB ${}_{\text{deg}}$	72.183	82.628	84.002	85.322
LB ${}_{\text{no}}$	73.337	83.178	84.497	85.761
LB ${}_{\text{avg}}$	72.114	81.914	83.219	84.854

From the above experimental results, for different LB classifiers, the sentiment word matching method of Original and Stem ${}_{\text{weight}}$ achieved the best classification results. For example, in the Commentary corpus classification experiment, the average accuracy of the matching method increased the average accuracy of the original word matching by 8.292%, by 17.424% in the Literary corpus, by 10.817% in the Movie corpus, and by 12.740% in the Microblog corpus. It show that, using the sentiment word matching method, when considering the original word and the stem matching method, the classification accuracy in the Uyghur sentiment classification obviously improved.

4.3 Analysis of classification results after combining lexicon-based classifier and machine learning classifier

4.3.1 Classification results of machine learning classifiers on pseudo-labeled datasets of different scales

The greater the value of the sentence’s sentiment score, the stronger the sentiment bias of the sentence and the more accurate the classification result. In this paper, the second step machine classifier was trained on the pseudo-labeled data obtained from the first step LB classifier. To obtain a more correct training corpus, different sentiment thresholds were designed, and pseudo-labeled data were selected according to the classification accuracy under different thresholds and the coverage of correct sentences. From the above experimental results (Tables 7–10), LB ${}_{\text{no}}$ had the best classification effect among the four LB classifiers designed in this paper. The purpose of this paper was also to select the best classifier from different LB classifiers to obtain the most accurate pseudo-label data. Therefore, the LB ${}_{\text{no}}$ classifier was selected.

Table 10
Classification accuracy of different LB classifiers on Literary corpus

LB classifier	Original	Stem	Original and stem	Original and stem ${}_{\text{Weight}}$
LB ${}_{\text{basic}}$	69.178	80.651	81.849	86.644
LB ${}_{\text{but}}$	69.007	80.993	82.021	86.815
LB ${}_{\text{deg}}$	69.007	80.993	82.534	86.644
LB ${}_{\text{no}}$	69.349	81.507	82.877	86.131
LB ${}_{\text{avg}}$	69.135	81.036	82.32	86.559

Figure 6 displays a line chart of the ratio of the classification accuracy of the LB ${}_{\text{no}}$ lexicon classifier to the four sentiment corpora when the sentiment score is greater than the specified threshold and the proportion of correctly classified sentences in the total corpora.

Figure 6.

Classification accuracy and correct sentence coverage under different sentiment thresholds in the LB ${}_{\text{no}}$ classifier.

Table 11

Classification results when the training set uses different pseudo-labeled data scales on the Commentary corpus

Sentiment threshold	$\|$ SO (si) $\|$ $>$ 0.5		$\|$ SO (si) $\|$ $>$ 1		$\|$ SO (si) $\|$ $>$ 1.5		$\|$ SO (si) $\|$ $>$ 2
	Origin	Stem	Origin	Stem	Origin	Stem	Origin	Stem
LB classifier average classification accuracy: LBavg $=$ 81.840
BNB
Unigram	72.191	73.876	78.393	80.357	78.146	80.829	80.271	82.693 ${}^{+}$
Bigram	59.059	64.396	62.798	66.101	63.268	66.537	64.952	67.908
DictWords	67.308	75.296	74.695	77.497	75.921	78.236	79.296	80.351
PBPs	49.972	54.124	53.797	57.933	58.869	62.290	60.645	64.187
Uni $+$ bi	78.552	80.638	78.944	81.209	79.941	81.817	80.979	83.598 ${}^{+}$
Uni $+$ PBPs	76.914	79.552	79.813	80.061	79.586	82.911 ${}^{+}$	80.509	82.883 ${}^{+}$
SVM
Unigram	70.014	72.472	77.649	80.208	78.098	80.220	79.586	81.907 ${}^{+}$
Bigram	57.022	60.534	61.310	65.030	62.122	65.512	62.429	66.071
DictWords	67.308	74.012	74.424	76.100	74.600	76.406	78.108	78.178
PBPs	52.700	57.811	56.386	60.368	57.890	63.457	60.567	63.833
Uni $+$ bi	78.856	81.288	79.434	81.740	79.925	82.021 ${}^{+}$	80.907	83.070 ${}^{+}$
Uni $+$ PBPs	73.283	79.464	74.801	79.893	77.323	82.478 ${}^{+}$	78.296	82.941 ${}^{+}$

Table 12

Classification results when the training set uses different pseudo-labeled data scales on the Literary corpus

Sentiment threshold	$\|$ SO (si) $\|$ $>$ 0.5		$\|$ SO (si) $\|$ $>$ 1		$\|$ SO (si) $\|$ $>$ 1.5		$\|$ SO (si) $\|$ $>$ 2
	Origin	Stem	Origin	Stem	Origin	Stem	Origin	Stem
LB classifier average classification accuracy: LBavg $=$ 86.559
BNB
Unigram	80.282	81.646	75.333	74.219	71.028	68.707	67.460	69.841
Bigram	85.000	85.443	80.921	82.000	69.932	73.364	65.110	66.022
DictWords	84.112	84.868	78.261	78.472	74.026	76.442	73.158	71.632
PBPs	84.177	80.380	71.094	73.438	68.707	70.748	64.917	67.403
Uni $+$ bi	85.915	86.076	82.000	82.667	75.234	76.636	70.635	71.825
Uni $+$ PBPs	82.911	80.380	72.656	72.656	69.728	70.408	65.470	67.956
SVM
Unigram	79.114	70.423	63.156	75.391	69.388	71.028	65.746	70.238
Bigram	86.620 ${}^{+}$	88.028 ${}^{+}$	83.323	82.667	75.234	77.570	71.429	74.603
Dictwords	85.047	86.765 ${}^{+}$	80.000	84.028	75.974	77.404	72.619	74.206
PBPs	84.810	81.646	73.047	74.219	68.367	61.905	66.022	67.127
Uni $+$ bi	86.076	84.810	82.119	81.333	74.766	76.168	70.635	72.619
Uni $+$ PBPs	84.177	81.013	72.656	74.609	64.626	73.469	62.983	73.204

Table 13

Classification results when the training set uses different pseudo-labeled data scales on the Movie corpus

Sentiment threshold	$\|$ SO (si) $\|$ $>$ 0.5		$\|$ SO (si) $\|$ $>$ 1		$\|$ SO (si) $\|$ $>$ 1.5		$\|$ SO (si) $\|$ $>$ 2
	Origin	Stem	Origin	Stem	Origin	Stem	Origin	Stem
LB classifier average classification accuracy: LBavg $=$ 79.955
BNB
Unigram	82.555 ${}^{+}$	79.128	80.482 ${}^{+}$	77.108	80.610 ${}^{+}$	75.216	78.624	78.072
Bigram	87.797 ${}^{+}$	88.136 ${}^{+}$	83.387 ${}^{+}$	84.026 ${}^{+}$	82.421 ${}^{+}$	79.521	79.115	78.870
Dictwords	89.785 ${}^{+}$	90.984 ${}^{+}$	86.207 ${}^{+}$	89.313 ${}^{+}$	87.111 ${}^{+}$	85.473 ${}^{+}$	83.746 ${}^{+}$	83.853 ${}^{+}$
PBPs	84.735 ${}^{+}$	85.981 ${}^{+}$	82.651 ${}^{+}$	82.892 ${}^{+}$	81.917 ${}^{+}$	81.699 ${}^{+}$	77.883	75.803
Uni $+$ bi	86.441 ${}^{+}$	85.358 ${}^{+}$	83.706 ${}^{+}$	83.387 ${}^{+}$	82.789 ${}^{+}$	82.709 ${}^{+}$	81.572 ${}^{+}$	79.773
Uni $+$ PBPs	85.047 ${}^{+}$	85.047 ${}^{+}$	82.892 ${}^{+}$	82.651 ${}^{+}$	82.135 ${}^{+}$	80.828 ${}^{+}$	80.529 ${}^{+}$	79.773
SVM
Unigram	82.712 ${}^{+}$	83.051 ${}^{+}$	80.964 ${}^{+}$	80.192 ${}^{+}$	79.956 ${}^{+}$	79.827	78.378	78.378
Bigram	88.475 ${}^{+}$	86.293 ${}^{+}$	84.345 ${}^{+}$	84.733 ${}^{+}$	82.997 ${}^{+}$	82.997 ${}^{+}$	79.361	79.607
Dictwords	91.398 ${}^{+}$	93.089 ${}^{+}$	89.163 ${}^{+}$	88.931 ${}^{+}$	85.886 ${}^{+}$	87.162 ${}^{+}$	85.159 ${}^{+}$	85.552 ${}^{+}$
PBPs	86.604 ${}^{+}$	86.293 ${}^{+}$	83.855 ${}^{+}$	82.892 ${}^{+}$	82.571 ${}^{+}$	82.571 ${}^{+}$	78.639	79.206
Uni $+$ bi	88.514 ${}^{+}$	88.136 ${}^{+}$	84.345 ${}^{+}$	84.026 ${}^{+}$	84.150 ${}^{+}$	83.285 ${}^{+}$	83.538 ${}^{+}$	81.327 ${}^{+}$
Uni $+$ PBPs	83.178 ${}^{+}$	85.047 ${}^{+}$	82.892 ${}^{+}$	82.892 ${}^{+}$	82.789 ${}^{+}$	81.481 ${}^{+}$	81.285 ${}^{+}$	78.828

Table 14

Classification results when the training set uses different pseudo-labeled data scales on the Microblog corpus

Sentiment threshold	$\|$ SO (si) $\|$ $>$ 0.5		$\|$ SO (si) $\|$ $>$ 1		$\|$ SO (si) $\|$ $>$ 1.5		$\|$ SO (si) $\|$ $>$ 2
	Origin	Stem	Origin	Stem	Origin	Stem	Origin	Stem
LB classifier average classification accuracy: LBavg $=$ 84.854
BNB
Unigram	85.219 ${}^{+}$	71.363	74.705	76.016	76.103	77.180	76.435	77.806
Bigram	88.859 ${}^{+}$	87.529 ${}^{+}$	83.834	76.147	81.168	73.843	75.491	72.494
Dictwords	83.289	61.562	80.502	68.285	79.661	71.776	82.569	70.472
PBPs	88.222 ${}^{+}$	86.143 ${}^{+}$	76.540	76.016	75.027	66.523	73.608	73.693
Uni $+$ bi	88.329 ${}^{+}$	88.064 ${}^{+}$	84.065	82.679	82.486	80.791	80.603	79.863
Uni $+$ PBPs	87.529 ${}^{+}$	85.450 ${}^{+}$	78.506	77.720	77.072	78.794	77.806	78.320
SVM
Unigram	85.450 ${}^{+}$	74.271	79.677	74.836	80.226	79.096	81.258	81.520
Bigram	90.186 ${}^{+}$	87.529 ${}^{+}$	84.527	82.679	81.544	81.356	76.409	77.720
Dictwords	88.594 ${}^{+}$	87.798 ${}^{+}$	84.065	82.910	83.051	82.863	82.831	81.782
PBPs	85.681 ${}^{+}$	84.758	77.851	75.885	74.812	74.812	74.122	74.036
Uni $+$ bi	89.390 ${}^{+}$	88.859 ${}^{+}$	83.834	83.372	82.298	80.038	80.865	83.224
Uni $+$ PBPs	86.836 ${}^{+}$	86.143 ${}^{+}$	80.472	83.486	81.163	83.746	82.177	85.176 ${}^{+}$

From Fig. 6, with the increase in sentence sentiment scores, the classification accuracy slowly increased. However, the proportion of correctly classified sentences in the total corpora gradually decreased. For example, in the Commentary corpus, when the sentiment score was greater than the threshold of 1.2, the classification accuracy was close to 95%, and 82% of the total corpus were correctly classified sentences. To obtain higher classification efficiency and accuracy, the scale of the pseudo-labeled data needs to be as great as possible. To determine the best pseudo-labeled data limit, we chose the same number of positive and negative corpus as the pseudo-annotated data of the training classifier in the classification results under different sentiment score thresholds. Based on the BNB and SVM classifiers, basic features, such as Unigrams, Bigrams, DictWords, and PBPs, and the combination of Unigram and Bigram features, and the combination of Unigram and PBPs, were used to train the classifier. Then, we classified the remaining data with a sentiment score less than the specified threshold. Tables 11–14 provide the classification results when using different scale pseudo-labeled datasets on different corpus. The threshold refers to the absolute value of the sentence trend value, i.e., the absolute value of the positive sentence is greater than or equal to the specified threshold, and the absolute value of the negative sentence is less than or equal to the specified threshold. The “ $+$ ” sign in the upper right corner of the accuracy rate indicates that the results of the machine learning classifier are higher than the best classification accuracy obtained by the lexicon classifier, and bold indicates the highest value.

From the experimental results (Tables 11–14), for some features and some thresholds, the final classification result improved considerably after combining the LB classifier and the machine learning classifier Without any labeled corpora, a high-quality training corpus was obtained using a LB classification method. On this basis, we trained the machine learning classifier and obtained a higher classification accuracy than the LB classifier.

4.3.2 Comparison of classification results of this paper with other classification methods

Table 15 describes the results of the supervised learning methods such as SVM BNB, CNN (Convolutional Neural Networks), RNN (Recurrent Neural Network) and RNN $+$ LSTM (Long-Short Term Memory)4

⁴
Code for neural network experiment can be download from here https://github.com/ErpanY/Uy_SentiClassifyByDeep Learning.

with unigram as a feature. To attain the optimal classification accuracy, we used the cross-validation method 10 times on the four sentiment corpora proposed in this paper.

Table 15

Comparison of the experimental results for the four corpora on the supervised classifiers, the LB classifier, and the unsupervised classifier proposed in this article

Corpus	RNN $+$ LSTM	RNN	CNN	SVM	BNB	UYSentiDict		Unsupervised classifier
						Traditional	Improved
Commentary	81.34	78.80	84.77	85.76	84.89	72.835	82.93	83.60
Literary	72.43	65.22	72.46	69.34	64.03	69.138	86.13	88.03
Movie	80.18	73.01	79.94	83.38	75.79	71.358	79.88	93.09
Microblog	85.31	79.56	84.62	88.07	84.67	69.178	85.76	90.19

From the Table 15, on the Commentary corpus, the best classification accuracy of the combined method proposed in this paper was 83.60%, 0.67% higher than the lexicon-based method, but not as good as the machine learning based method. On the other three corpora, the classification results of the proposed method were better than those based on LB classifiers and machine learning classifiers. This shows that, based on the classification results, the lexicon-based classifiers used to train the machine learning classifier had better classification accuracy than the lexicon-based classification method. On a smaller corpus, the classification efficiency of combining methods was even better than machine learning methods. In general, the combination method is very suitable for use in the sentiment classification task of resource-poor languages such as the Uyghur language

5. Conclusion

Considering the difficulties for achieving tagged samples, this paper improves the existing Lexicon-Based classification method and combines it with the machine learning classifier. Firstly, classify the Uyghur sentiment corpus take using of “UYSentiDict”, in the matching process of emotional vocabulary, the object is extended from the word prototype to the stem, and the influence of the language grammar rules (turning conjunctions, progressive conjunctions, degree adverbs, negation words) on the emotional tendency of the sentences is fully considered. Then the machine learning classifier trained on the pseudo-annotated data sets that selected from the results of lexicon based classifier, and the remaining corpora are classified by extracting some optimal features. Thanks to the proposed method not be constrained by domain, also not need to use pre-tagged training data, therefore our approach can deal with the resource scarcity problem of Uyghur sentiment classification.

In this paper, we discussed an unsupervised sentiment classification method. In the process of classification, only considered two kinds of emotion tendencies, positive and negative, while emotional texts contain many kinds of emotion changes, such as happy, sad, surprised, frightened …, etc. Therefore, in the later work, we will study more emotional tendencies in Uyghur texts.

Nowadays, in the field of sentiment classification, researchers use deep learning method to get satisfactory classification results. The prominent advantage of sentiment classification method based on deep learning is that it does not need a large number of labeled corpus, and does not need human participation in feature selection. However, there is a common problem in deep learning method that constructs word vectors according to the context of vocabulary without considering the emotional information, that resulting in similar word vectors trained by words with similar context but opposite emotional polarity. In the later work, we will deeply study the sentiment classification method based on deep learning, and combine the advantages of traditional machine learning methods into the classification process of deep learning method, so as to further improve the classification efficiency of Uyghur text sentiment classification.

Footnotes

Acknowledgments

This work was supported by Foundation of National Program on Key Basic Research Project of China (2014CB340506); National Natural Science Foundation of China (61363063, 61662076).

References

Pang

Lee

and Vaithyanathan

, Thumbs up? Sentiment classification using machine learning techniques, in: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, 2002, pp. 79–86.

Dave

Lawrence

and Pennock

D.M.

, Mining the peanut gallery: Opinion extraction and semantic classification of product reviews, in: International Conference on World Wide Web, 2003, pp. 519–528.

Maas

A.L.

Daly

R.E.

Pham

P.T.

Huang

A.Y.

and Potts

, Learning word vectors for sentiment analysis, in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 2011, pp. 142–150.

Esuli

and Sebastiani

, SentiWordNet: A publicly available lexical resource for opinion mining, in: Proceedings of the 5th Conference on Language Resources and Evaluation, 2006, pp. 417–422.

Mou

and Du

, Sentiment classification of Chinese movie reviews in micro-blog based on context, in: IEEE International Conference on Cloud Computing and Big Data Analysis, 2016, pp. 313–318.

Zhang

and Sindhwani

, A non-negative matrix tri-factorization approach to sentiment classification with lexical prior knowledge, in: Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL 2009) and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2009, pp. 244–252.

Melville

Gryc

and Lawrence

R.D.

, Sentiment analysis of blogs by combining lexical knowledge with text classification, in: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009, pp. 1275–1284.

and Zhou

, Self-training from labeled features for sentiment analysis, Information Processing & Management 47(4) (2011), 606–616.

Qiu

Zhang

and Zhao

, SELC: a self-supervised model for sentiment classification, in: Proceedings of the 18th ACM Conference on Information and Knowledge Management, 2009, pp. 929–936.

10.

Wiebe

, Learning subjective adjectives from corpora, in: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, 2000, pp. 735–740.

11.

Turney

P.D.

, Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews, in: Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2002, pp. 417–124.

12.

Aue

and Gamon

, Customizing sentiment classifiers to new domains: a case study, in: Proceedings of the Recent Advances in Natural Language Processing (RANLP), Borovets, 2005.

13.

Dasgupta

and Ng

, Mine the easy, classify the hard: a semi-supervised approach to automatic sentiment classification, in: Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL 2009) and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2009, pp. 701–709.

14.

Goldberg

A.B.

and Zhu

, Seeing stars when there aren’t many stars: graph-based semi-supervised learning for sentiment categorization, in: Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing, 2006, pp. 45–52.

15.

Huang

C.R.

Zhou

and Lee

S.Y.M.

, Employing personal/impersonal views in supervised and semi-supervised sentiment classification, in: Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2010, pp. 414–423.

16.

Xue

and Wang

, Semi-supervised sentiment classification with social network, Acta Scientiarum Naturalium Universitatis Pekinensis 50(1) (2014), 61–66.

17.

Kennedy

and Inkpen

, Sentiment classification of movie reviews using contextual valence shifters, Computational Intelligence 22(2) (2006), 110–125.

18.

Taboada

Brooke

Tofiloski

Voll

and Stede

, Lexicon-based methods for sentiment analysis, Computational Linguistics 37(2) (2011), 267–307.

19.

Zagibalov

and Carroll

, Unsupervised classification of sentiment and objectivity in Chinese text, in: Proceedings of the Third International Joint Conference on Natural Language Processing, 2008, pp. 304–311.

20.

Andreevskaia

and Bergler

, When specialists and generalists work together: Overcoming domain dependence in sentiment tagging, in: Proceedings of Association for Computational Linguistics-08: HLT, 2008, pp. 290–298.

21.

Tan

Wang

and Cheng

, Combining learn-based and lexicon-based techniques for sentiment detection without using labeled examples, in: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2008, pp. 743–744.

22.

Tan

Tang

and Cheng

, A novel scheme for domain transfer problem in the context of sentiment analysis, in: Proceedings of the CIKM, 2007, pp. 979–982.

23.

Yusuf

and Hamdulla

, The sentiment classification of Uyghur sentences based on sentiment dictionary, Computer Knowledge and Technology 10(10) (2014), 2371–2374.

24.

Huang

Tian

S.W.

and Feng

G.J.

, Sentence sentiment analysis based on Uyghur sentiment word, Computer Engineering 38(9) (2012), 183–185.

25.

Nian

Liu

R.L.

Marhaba

and Fan

Z.K.

, Analysis of the sentence tendency in Uighur language, Computer Systems & Applications 25(7) (2016), 171–175.

26.

Mageed

M.A.

and Diab

, Toward building a large-scale Arabic sentiment lexicon, in: Proceedings of the 6th International Global WordNet Conference, 2012, pp. 18–22.

27.

Steinberger

Ebrahim

Ehrmann

Hurriyetoglu

Kabadjov

Lenkova

Steinberger

Tanev

Vázquez

and Zavarella

, Creating sentiment dictionaries via triangulation, Decis. Support Syst 53(4) (2012), 689–694.

28.

Chen

Huang

and Chen

, NTUSD-Fin: A Market Sentiment Dictionary for Financial Social Media Data Applications, in: Proceedings of the First Financial Narrative Processing Workshop, 2018.

29.

Chen

, Practical Dictionary of Uygur and Chinese, Xinjiang University Press, 1995.

30.

Wiegand

Klenner

and Klakow

, Bootstrapping polarity classifiers with rule-based classification, Language Resources & Evaluation 47(4) (2013), 1049–1088.

31.

Zhang

, Research on Sentiment Classification Methods of Web Review Texts, Ph.D. Dissertation, College of Computer Science of Chongqing University, 2015.

32.

Litip

, The Reference Grammar of Modern Uygur Language, China Social Sciences Press, 2012.

33.

Liu

Y.J.

S.G.

S.M.

and Su

, Classification of chinese texts sentiment based on semantic and conjunction, Journal of Sichuan University Natural Science Edition 52 (2015), 57–62.

34.

and Teng

, An analysis of statements of negation in the modern Uygur language, Language & Translation 2 (2001), 11–13.

35.

Muzapar, A Study on the Transformation between Affirmative Sentences and Negative Sentences of Modern Uighur, Ph.D. Dissertation. Xinjiang University, 2003.

36.

Turhuntay

, Feature selection and machine learning algorithms for Uyghur text sentiment classification, Boletín Técnico 55(13) (2017), 56–66.

37.

Turhuntay

and Slamu

, Uyghur text sentiment classification based on bi-tagged features, Journal of Chinese Information Processing 32(8) (2018), 80–90.

A mixed approach of statistical weighting method and unsupervised method to improve Uyghur sentiment classification

Abstract

Keywords

1. Introduction

1 Code and data for the first step are available at https://github.com/ErpanY/Uy_SO-PMI.

3. Unsupervised sentiment classification method combined with lexicon and machine learning

3.1.1 Construction of uyghur sentiment lexicon

Table 1 Uyghur sentiment lexicon glossary

3.1.2.1. Sentiment vocabulary

3.1.2.2. Special sentence pattern

3.1.2.3. Degree adverbs

3.1.2.4. Negations

4. Experiments and analysis

4.1 Experimental dataset

Table 5 Overview of datasets

4.2.1 Influence of stem weight factor, sliding window size of degree adverb, and negation word on classification results

Table 7 Classification accuracy of different LB classifiers on Commentary corpus

4.3.1 Classification results of machine learning classifiers on pseudo-labeled datasets of different scales

Table 10 Classification accuracy of different LB classifiers on Literary corpus

4 Code for neural network experiment can be download from here https://github.com/ErpanY/Uy_SentiClassifyByDeep Learning.

Footnotes

Acknowledgments

References

¹
Code and data for the first step are available at https://github.com/ErpanY/Uy_SO-PMI.

Table 1
Uyghur sentiment lexicon glossary

Table 5
Overview of datasets

Table 7
Classification accuracy of different LB classifiers on Commentary corpus

Table 10
Classification accuracy of different LB classifiers on Literary corpus

⁴
Code for neural network experiment can be download from here https://github.com/ErpanY/Uy_SentiClassifyByDeep Learning.