Aara’– a system for mining the polarity of Saudi public opinion through e-newspaper comments

Abstract

Aara’ is a system for mining opinion polarity through the pool of comments that readers write anonymously at the online edition of Saudi newspapers. We use a nave Bayes classifier with a revised n-gram approach to extract the public opinion polarity, which is expressed in Arabic, classifying it into four categories. For training we manually marked the comments as belonging to one of the categories. All the words in the documents of the training set were removed except those with explicit connotations. After the training the words designated as vocabulary were classified into one of the categories. Our system carries out polarity classification over informal colloquial Arabic that is unstructured and with a reasonable proportion of spelling errors. The result of testing our system showed a macro-averaged precision of 86.5%, while the macro-averaged F-score was 84.5%. The accuracy of the system is 82%.

Keywords

Arabic NLP colloquial Arabic naive Bayes public sentiment revised n-gram

1. Introduction

Certain events in history may take the world by surprise, such as the Arab Spring. Indeed this and other similar events cannot pop out of nowhere but the symptoms may have been so miniscule that most people did not pay attention to them. The consequences of some of these events are far reaching and it would be useful to be able predict a future crisis ahead of time. The virtual world provides a treasure trove for pundits seeking to predict the next big event. These days millions of web surfers express their opinions about any topic through forums, blogs, social networks and many online editions of newspapers that enable their visitors to record their comments related to specific news. Although often associated with sentiment analysis, opinion mining is a new discipline that recently has attracted increased attention. It is an area that is related to natural language processing and text mining, with an objective of identifying opinions and thoughts expressed in natural language. The majority of the work in this area is devoted to English with very little in other Latin-based languages. Despite the fact that Arabic is one of top 10 most used language on the internet,¹ it lags behind in many NLP applications. With the just recent Arab spring there have been a surge of works devoted to opinion mining in Arabic.

One of the geopolitically important countries in the Arab world is the Kingdom of Saudi Arabia. It is a vast country that is sparsely populated and has one of the largest known oil reserves in the world. Moreover, two of the three most sacred sites for Muslims are in Saudi Arabia. Given its importance, little is known about the country and its population as there are no official or private organizations that conduct public survey/polling. Recently the state’s grip on the media has been relaxed and people are allowed to post their written comments in most of the online editions of local newspapers. This provides a golden opportunity to peek into the local peoples’ minds and we suppose it to be an excellent alternative to organized polling. In this work we introduce Aara’ (آراء), Arabic for ‘opinions’, a system to mine the general public opinion through the comments they leave anonymously in online local newspapers. We picked two of the most widely circulated newspapers that are printed in Riyadh, the capital of the kingdom and its most populous city (population 5.25 million). We observed that certain news tends to generate hundreds and maybe thousands of online comments in a 24 hour period. One such example is the news regarding switching of the weekends. Thursday and Friday were the official weekend in Saudi Arabia until a royal decree in June 2013 switched it to Friday and Saturday. When a local newspaper posted the news, ‘The Shura Council agrees to “study” changing the weekend’,² there were 196 different comments in the span of 24 hours. Table 1 lists a sample of the comments.

Table 1.

Sample of comments by readers along with English translation by the authors. For those who can read Arabic, some of the comments may not be clear as they are written in local dialect.

We must change the break to become Friday and Saturday instead of being cut off from the world on Thursday, causing losses in billions	يجب تعديل الاجازة لتصبح الجمعة والسبت بدل من الانقطاع عن العالم يوم الخميس مما يسبب خسائر بالمليارات

We do not want change, Thursday and Friday are better	لا مانبي التغيير خميس وجمعه احلي

Having studied abroad and experienced both, Thursday and Friday break is more enjoyable … I hate Saturday as it is more like the Jew’s holiday. And those that talk about bank closing should note that banks in Japan close even before we start our morning stock market session and in US it opens after ours are closed, so the argument is baseless :)	عشت التجربتين بحكم الدراسة في الخارج، عطلة الخميس والجمعة أمتع بكثييير..وأكره عطلة السبت لأنها تشابه عطلة اليهود الجدير بالذكر اللي يتكلمون عن إغلاق البنوك إن البنوك في اليابان تغلق قبل ما يبدأ التداول عندنا وفي أمريكا تفتح بعد ما يغلق التداول عندنا فكلامهم مردود عليهم :)

Arabic is a Semitic language which surprisingly predates Islam. This is confirmed by the discovery of many pre-Islamic Arabic inscriptions dating from the second to fourth centuries CE [1]. Arabic can be classified into Classical and Modern. The Classical Arabic represents the pure language spoken by the Arabs, whereas Modern Standard Arabic (MSA) is an evolving variety of Arabic with constant borrowing and innovations to meet modern challenges [2]. The Arabic orthographic system uses small diacritical markings to represent the short vowels (a, i, u). These markings, which are placed either above or below the letter, are used to clarify the sense and meaning of the word. For example, علم could mean any of science, flag or taught depending on the diacritical markings. MSA is characterized by the absence of diacritics, which are reserved for the rare cases when it is not possible to disambiguate the meaning through the context [3]. Arabic is the native language of over 330 million speakers [4]. Most Arabs can read and understand MSA with few or no problems. That is why the printed media are typically written in MSA; however, in informal cases, such as blogs or writing comments in e-newspapers, people tend to write using a combination of MSA and colloquial Arabic. The colloquial, or the dialectical, Arabic differs from one region to another; each region has its own vocabulary, phonology, syntax and morphology rules [5]. Unfortunately, these rules are not written and there are no available dictionaries for their vocabularies [6]. The range of dialects that include Arabic is much more varied than the range of dialects that are typically considered to involve European languages such as English and French. This makes for the possibility of multiple spelling of a single word within the same dialectical group.

Owing to its large geographic area, there are several dialectical groups within Saudi Arabia. Chief among them is the Hejazi dialect in the western region, the Nejdi in the central region, and the Sharqiyya in the eastern region of the country. This work deals with the Arabic Nejdi dialect. As we just mentioned, this dialect is commonly used in the central region of Saudi Arabia, including the capital Riyadh. Overall there are about 10 million people who speak this dialect.³ In this work we had to deal with many challenges, including the high possibility of written comments being a textual mix of MSA and dialectical Arabic, and the lack of a unified spelling for dialectical words, making it likely to have more than one spelling for a single word. To this we may add the fact that people often write and submit the comments without re-editing or checking for errors, so we cannot discard the possibility of spelling errors. To accomplish our task, we used a naive Bayes classifier with revised n-gram model to extract the public opinion, classifying it into four categories: strongly positive, positive, negative and strongly negative.

To the best of the authors’ knowledge, there is no study specifically in Arabic that makes use of readers’ comments in e-newspapers to probe for the public sentiment, nor of sentiment analysis in Arabic that goes beyond binary (positive and negative) classification. We believe that this is an untapped rich source to mine for the public sentiment given that news stories cover all genres: politics, religion, sports, business, etc. The two local newspapers we used for this study appear in printed and electronic editions. The latter is free and allows for anonymous comments from the readers. Indeed Twitter is very popular in Saudi Arabia and makes for a good source to probe for public sentiment. However, the ability to comment on news stories pre-dates Twitter, which was introduced into the kingdom in late 2010/early 2011. A certain age group dominates Twitter users while we believe that commenting on news stories is more uniformly distributed among all age groups. The archives of these two e-newspapers keep their news stories and all the written comments accessible to the public for free. One of them has faithfully retained all the comments in their news archive since January 2005. This makes for a great study in shifts of public sentiment over certain events.

This paper is organized as follows. Section 2 briefly describes related works in the area of automatic public opinion extraction. Section 3 details the major components of our system. In Section 4 we introduce the experiments carried out to evaluate the performance of our system and the results along with their interpretation. The conclusion and future work direction are provided in Section 5.

2. Related work

The basic task in sentiment analysis is classifying the polarity of a given text. Some of the early works include Pang et al. [7] who applied machine learning technique to determine whether a movie review is positive or negative. Another early work is that of Turney [8], who used unsupervised learning to classify product reviews. The recent turmoil in many parts of the world has fuelled research to answer questions about what and where the next big event will be. This is evident from the number of papers which appeared in the first half of 2013 and 2012 as compared with earlier years tackling various aspects related to opinion mining and/or sentiment analysis.

Abu-Jbara et al. [9] used NLP techniques to analyse debates in Arabic, identifying subgroups with opposing opinions. Hamouda and Akaichi [10] investigated the positive and negative sentiments of Tunisian Facebook users from a dataset collected during the Tunisian revolution. The authors reported an accuracy of 75.31% using a combination of unigram and bigram and Support Vector Machine (SVM) classifiers. The precision was 60% (89.2%) for negative (positive) polarity respectively, while recall was 83.3% (71.2%) for negative (positive) polarity, respectively. Repeating the same experiment using naive Bayes (NB) classifier lowered accuracy to 74.05%.

Mohammad et al. [11] worked on sentiment analysis of tweets written in English language. The objective was to classify the tweet as being positive, negative or neutral. Narr et al. [12] proposed a scheme to analyse the sentiment of a tweet written in any language. They tested their system on tweets in four languages: English, German, French and Portuguese. The best accuracy of 81.3% was reported for English while the lowest accuracy of 64.9% was for Portuguese.

Elarnaoty et al. [13] tackled the problem of mining for the opinion holder in Arabic text. For this the authors used a combination of semi-supervised pattern classification and conditional random fields. They were able to achieve a precision of 85.52% and a recall of 39.49%. Al-Subaihin [14] and Al-Subaihin et al. [15] presented a sentiment analysis tool for opinions pertaining to restaurant reviews. The interesting part of the work was the author’s novel approach to using human computation to handle colloquial Arabic. The author was able to achieve a precision of 60.5%. Abdul-Mageed et al. [16] presented SAMAR, a system for subjectivity and sentiment analysis for Arabic social media genres. For their study, the authors used different datasets: chats, tweets, Wikipedia talk pages and forums. Each dataset was divided into 80/10/10% for training/development/testing respectively. Employing different pre-processing, their best accuracy was 81.36% (from forum dataset). The F-score for positive sentiment ranged between 49.41% (tweet dataset) and 88.64% (forum dataset), and for the negative sentiment the F-score ranged between 48.1% (forum dataset) and 77.78% (Wikipedia talk pages).

Danowski [17] developed a network-based method which he called the ‘Semantic Network Analyzer’ to quantify sentiment in Taleban propaganda materials that originate from Afghani and Pakistani sources over a period of five years. According to the author, the Taleban content generally showed evidence of system flourishing. Glass and Colbaugh [18] presented two methods for estimating social media sentiment. Both methods rely on text classification which models the data as a bipartite graph of documents and words. The system was used to estimate regional public opinion regarding 2009 Jakarta hotel bombing and 2011 Egyptian revolution.

Rushdi-Saleh et al. [19] presented an Opinion Corpus for Arabic that contains 500 movie reviews collected from a variety of web pages and blogs. The reviews were equally split, 250 positive and 250 negative. The authors used a word-based n-gram model (n = 1–3) and both NB and SVM to determine the polarity of a review. The system was tested using 10-fold cross-validation and both TF-IDF and TF for a weighting scheme. The best results were achieved using a trigram model, TF-IDF and SVM classifier with a precision of 87.4%, a recall of 95.2% and an accuracy of 90.6%. Using NB classifier instead lowered the accuracy to 89%. Almas and Ahmad [20] proposed a language-informed framework for financial news analysis in English, Arabic and Urdu. The authors resorted to some statistical scheme to extract terms, collocations and n-gram. Some steps were taken for pattern generalization and pruning. For evaluation they experimented using the top 10 keywords for each language and achieved a precision over 90%; however, the recall was low (ranged between 8.6 and 22.2%).

Silva et al. [21] used an SVM classifier for classification of opinions related to Portuguese political actors. They experimented with several feature sets, for example, bag-of-words, n-gram, and Parts of Speech. The authors reported that for several possible feature combinations the precision rate was over 90%. Froelich et al. [22] presented a case study of the use of text mining to evaluate citizen comments on public issues; they used word counting to obtain a list of most frequent terms and they built a dictionary to include different words with the same meaning. After the terms were generated, many tools in the system were used to give better information about the problem. One of these tools is text categorization, which clusters the comments into suggested groups with each cluster or group containing a number of related comments. Overall the authors achieved a precision of over 90% with an F-score of over 80% for classifying the news polarity.

3. Our proposed system

In this paper we introduce Aara’, a system for mining public opinion, specifically Saudi popular opnion. Aara’ uses an NB classifier. It also uses the revised n-gram algorithm to further improve the classifier’s performance. Once the system is trained, it can be supplied with the full set of comments pertaining to specific news and the system will provide the percentage of comments in each category. The main components of the system are shown in Figure 1.

Figure 1.

The general architecture of Aara’.

3.1. Preparing the material for the training session

We manually compiled a set of 815 comments gathered from two online editions of local newspapers, Alriyadh⁴ and Aljazirah.⁵ Comments were picked from different news genres (politics, sports, editorial, health, religion, science, etc.). These comments were manually classified into four different categories/polarities: strongly positive (موافق بشدة), positive (موافق), negative (معارض) and strongly negative (معارض بشدة). When collecting the comments, we collected those ones which directly reflected the writer’s positive or negative attitude toward the issue being discussed in the news. Next we filtered out comments that did not have words bearing explicit positive or negative connotation. Finally the comments were split into two disjoint sets, a training set (620 comments), which was used to train the system, and a testing set (195 comments), which was used to validate and test the system. We pre-processed the training set prior to training the classifier. Pre-processing was restricted to removing prepositions, markings, numerals and other Arabic particles. We also removed all words that did not have any explicit overtone. Table 2 shows some sample comments that were used for the training.

Table 2.

Sample comments used for the training. The underlined words are those with explicit overtone. All the remaining words (in grey) will be removed as they do not have any connotation.

Comment	Polarity
الحمد لله استطاع شباب و شعب مصر العظيم الانتصار على الظلم و الطغيان و رموز الفساد جميعاً و ننتظر مصر فى ثوبها الجديد ومبارك لمصر و العالم العربى و تحية لروح شهداء مصر العظم الذين ضحوا بارواحهم لنجاح الثورة	Positive
امنا بالله و صدقنا هالنمو وش خططكم لتطوير هالمطار اللي يتردى يوم ورى يوم ؟ و هل فيه خطة لفتح الصالة اللي لها 30 سنى مقفلة لمواجهة هالنمو اللي تدّعونه؟!	Negative
مع أحترامي أفشل طيران شفته في حياتي, خاصة في مواسم الحج :( فشلتونا ياناس !	Strongly negative

3.2. Training module

In general the classifiers can be grouped into supervised, semi-supervised and unsupervised classifiers. The NB classifier belongs to the first group and is a major part of our system. The NB classifier assumes that the probability of word occurrence is independent of its position within text. Let $V$ be the set of all polarities {positive, negative, strongly positive, strongly negative}. The essence of an NB classifier is to assign to a new instance having attribute values (a₁, a₂, …, a_n) the polarity that has the highest probability among the others. Given a set of examples and their target values (polarity) that belong to one of the target value classes, a new instance can be classified using

v_{NB} = \arg max_{v_{j} \in V} P (v_{j}) \underset{i}{Π} P (a_{i} | v_{j})

(1)

where $v_{NB}$ is the polarity of the new instance, $P (v_{j})$ is the probability of target $v_{j},$ where v_i $\in$ V and $P (a_{i} | v_{j})$ is the probability of observing attribute $a_{i}$ given target $v_{j} .$

Our NB-based training module is listed in Figure 2. The main task of this module is to train the system from a set of labelled comments with predefined classes. This is accomplished in two steps:

Vocabulary building. A table is created which includes all the distinct keywords that we manually compiled for the training set (see the previous section). We will refer to this table as $Γ .$

Computing probabilities. $P (v_{j})$ refers to the probability of class $v_{j},$ where $v_{j} \in$ {positive, negative, strongly positive, strongly negative}; and $P (w_{k} | v_{j})$ is the probability that the word $w_{k}$ occurs in comment whose polarity is $v_{j} .$

Figure 2.

Naive Bayes-based training module. The training set has been pre-processed. We removed all words with no explicit connotations.

Recall that the training set has been pre-processed. We removed all the words leaving only those with some kind of explicit connotation. At the end of the training session each word is assigned a probability and it is accordingly classified into one of four categories: strongly positive, positive, negative and strongly negative overtones. For example after the training, the word عملاق belongs to the target class ‘strongly positive’ with probability 0.0016, and belongs to the classes ‘strongly negative’, ‘positive’ and ‘negative’ with probabilities 0.008, 0.005 and 0.005, respectively. In this case the highest probability is 0.0016, so this word will be classified as being strongly positive. Any word that had an equal probability in every target class was filtered out. There are a total of 1118 words in the vocabulary: 389 words with positive, 598 words with negative, 63 with strongly positive and 68 with strongly negative connotations. The list of words in the vocabulary is available upon request.

3.3. Comment classification module

This module returns the estimated polarity of the target comment. The NB-based classifier (Figure 3) searches for each word $a_{i}$ that occurs in the target comment (the one to be classified) in the vocabulary table $Γ .$ If found, the classifier uses Equation (1) and the probabilities that were computed in the training module to estimate the polarity of the target comment.

Figure 3.

Naive Bayes-based comment classifier module.

What complicates the process is that most of the nouns and verbs in Arabic are prefixed. The definite article (ال) is always attached to nouns, and many conjunctions and prepositions are also attached as prefixes to nouns and verbs. This hinders the retrieval of morphological variants of words [23, 24]. The next example illustrates one of the challenges. Let us assume the word success (نجاح) was in the vocabulary table $Γ$ following the training process. If we happen to have a comment with the word بنجاح, with success or وبنجاح, and with success the classifier will fail to recognize the attached prefixes and will end up discarding both words as they are not in the vocabulary table. The conventional approach in Arabic would be to use a light stemmer. Stemming is not an easy task and requires a predefined list of prefixes and suffixes. For our system we avoided stemming and instead went for a revised n-gram approach. In fact the revised n-gram also helped us with many spelling mistakes we encountered in the comments.

3.4. Revised n-gram

The pure n-gram model can be used to compute the similarity of two strings through counting the number of common n-grams they have. The n-gram similarity coefficient $δ_{n}$ for two words $a$ and $b$ is defined as

δ_{n} (a, b) = \frac{# similar n - grams in a and b}{# unique n - grams in union of a and b} .

(2)

When computing the similarity coefficient, the pure n-gram approach does not consider the order of the n-gram in the target word [25]. This means a higher probability of the matching score between two strings even though they may not share the same concept [23], see Figure 4a. The revised n-gram approach helps to overcome this problem.

Figure 4.

The bigram similarity measure between the Arabic word التحالفات (the alliances) and الفاتح (the conqueror). (a) Using a pure bigram the similarity coefficient is 6/7 ≈ 85.72%, and (b) with a revised bigram it is 2/7 ≈ 28.57%. The latter is favoured as both words do not belong to the same meaning class.

Ahmad and Nürnberger [23] proposed a language-independent approach for conflation that does not require a prior knowledge of the language or the predefined rules. The revised n-gram approach is applied for cases $n \geq 2 .$ Let $a$ and $b$ be the words to be compared. Assume without loss of generality that $| a | \geq | b | .$ The similarity score $S$ for an n-gram of size $n$ and an odd-numbered window of size $m$ is given by:

S_{n, m} (a, b) = \frac{\sum_{i = 2}^{| a | - n + 1} \sum_{j = - (m - 1) / 2}^{(m - 1) / 2} g (a [i : n], b [i + j : n])}{# unique n - grams in union of a and b}

(3)

where $g (α, β) = 1$ if $α = β$ and zero otherwise; and $a [i : ℓ]$ denotes a substring of word $a$ of length $ℓ$ starting from position $i .$ Where the length $ℓ \leq 0$ , the substring will be empty. The revised n-gram insists that the order of the n-grams be maintained when comparing for the similarities between the words (Figure 4b). This feature is very practical for Arabic nouns and verbs which are heavily affixed (prefix and suffix). It is also helpful in our task of classifying the comments into one of the four categories. In many cases the form of the word affects the classification of the comment. For example, the word مبروك leads to ‘positive’ classification, whereas مبروووك leads to ‘strongly positive’ classification (Figure 5). Table 3 shows some of the words and their forms which affect the classification.

Figure 5.

Using a revised bigram for the similarity measure of two forms of the same word leading to different classifications. Here the similarity score is 80%.

Table 3.

Sample list of words whose different forms lead to different classifications

First form	Classification	Second form	Classification
مبروك	Positive	مبروووك	Strongly positive
شكرا	Positive	شكرررا	Strongly positive
كفو	Positive	كفووو	Strongly positive
فشل	Negative	أفشل	Strongly negative
دمران	Negative	دمرااانه	Strongly negative
القهر	Negative	القهههر	Strongly negative

Spelling errors were a common problem in this work. Most of the comments we came across contained some sort of typo. We identified three groups of spelling errors. In the first group we have errors owing to the proximity in the sound of the pair of letters: ض and ت;ظ and ة; and س and ص, for example, احتظنهم, which is correctly spelled احتضنهم. In the second group we have typos owing to mixing up between the short (diacritical marking) and long vowels, for example, موظاعفة, whose correct spelling is مظاعفة. In the last group we have words with the letter ء, hamza. In Arabic, the letter hamza is written as ء, أ, ؤ or ئ. There is a complex set of rules that dictates how the letter hamza is written. For example, هؤلاء was mispelled as هئولاء. In the training set, all the mispelled words were manually corrected. However, in the testing set we did not edit the comments so all the mispelling were present. Some of the more common mispellings were handled with a simple hash table. For others, the revised n-gram was able to correctly relate the mispelled word with its correct form [26].

Using the revised n-gram in the text classification phase of our system improved its ability to classify the comments. Each word $w$ in the comment that we intended to classify was searched for in the vocabulary table $Γ,$ and where it was not in the table the revised n-gram approach was used to determine if we had words similar to $w$ in the table $Γ .$ This improved the classifier’s ability to classify the comments. We considered two words as belonging to the same concept if their similarity score was at least 70%. This percentage was set after several trials and errors. We noticed that lowering the similarity score to less than 70% resulted in many unrelated words being treated as belonging to the same concept; if we set it over 70%, then some words that belonged to the same meaning were being ignored.

As we said earlier, the training set was pre-processed, which included removing all the words with no explicit connotation. It was possible to just ignore words with neutral overtone and leave them in the comment. However, there are two reasons why we decided against this. First, leaving these words in will prolong the training process. This is because the theoretical time complexity of NB classifier is $O (nm),$ where $n$ is the number of training examples and $m$ is the number of words in the training data. Second, they make noise in the classification. Consider the following scenario. Assume $C$ is a comment in the training set, and $u$ is one of the words in $C$ with no explicit overtone. The classifier will not find $u$ in the vocabulary table but it will find another word, say $w,$ which is highly similar to $u .$ So now the classifier will use the class probability of $w$ in classifying the comment $C .$ This may sway the overall classification of $C .$

4. Results and discussion

We implemented Aara’ using Java programming language and Microsoft Access. We manually collected a total of 815 comments and these were split into two disjoint sets. The ‘training set’ had 620 comments that were used to train the system through two phases – building vocabulary and computing probabilities (Figure 6); and the ‘testing set’ comprised 195 comments, of which 39% were labelled as positive, 36% as negative, 17% as strongly positive and the rest as strongly negative. Figure 7 shows a screen shot of the testing window.

Figure 6.

A screen shot of the training window showing the vocabulary and their corresponding probabilities.

Figure 7.

A sample screen shot of the testing window showing individual testing set comment along with Aara’s auto classification and our own manual classification.

Four measures are used to evaluate the system: precision $(P),$ recall $(R),$ F-score (F) and accuracy $(Acc) .$ These are defined as [27]:

\begin{matrix} P = \frac{TP}{TP + FP} \\ R = \frac{TP}{TP + FN} \\ F = \frac{2 PR}{P + R} = \frac{2 TP}{2 TP + FP + FN} \\ Acc = \frac{TP + TN}{TP + FP + TN + FN} \end{matrix}

(4)

where TP (true positive) is the number of comments correctly classified as belonging to class $v_{j};$ FP (false positive) is the number of comments incorrectly classified as belonging to class $v_{j};$ FN (false negative) is the number of comments that are not classified as belonging to class $v_{j}$ but should have been; and TN (true negative) is the number of correctly classified comments as not belonging to class $v_{j} .$ The class $v_{j} \in$ {positive, negative, strongly positive, strongly negative}. The precision is a measure of exactness, that is, what percentage of comments classified as positive are actually such. The recall is a measure of completeness, that is, what percentage of positive comments is classified as such. It is worth noting that the values of precision, recall and F-score range between 0 and 1. Table 4 shows the result of evaluating our system on the testing set. The accuracy of the system is 82%.

Table 4.

Summary of evaluating the testing set (195 comments) over the four categories

	Positive	Negative	Strongly positive	Strongly negative
P	0.887	0.964	0.789	0.818
R	0.840	0.768	0.882	0.818
F	0.863	0.855	0.833	0.818

One clear reason for the misclassification of some comments is the existence of new words in the testing set that were missing in the vocabulary table. This can be resolved by a larger training set that should minimize the possibility of unseen words in the testing set. The best precision of 96.4% is in the negative category and we can attribute it to the large number of negative words in our vocabulary. About 53.5% words in the vocabulary are those with negative connotation. The lower recall in the negative category is attributed to the nature of Arabic sentences. In English, the negation is often expressed by a prefix, for example, un- or im-. For example ‘unclean’ or ‘imperfect’. This is not the case with Arabic, where the negation is expressed by a separate term preceding the word. For example, نظيف, clean vs غير نظيف, not clean. The classifier was trained on a set of words that were in the training data. Each word was treated as independent, and so the classifier cannot handle the negation. One way to solve this problem is to hyphenate (or join) the negation with the word, for example, غير_نظيف or غيرنظيف in the pre-processing stage of the training set.

When multiple class labels are involved, as in our case, then averaging the evaluation measures can give a view on the general results. Two names are used to refer to averaged results: micro-averaged and macro-averaged results. Let $P_{micro} (P_{macro})$ be micro (macro)-averaged precisions, and similarly $R_{micro} (R_{macro})$ for recalls. The micro-averaged measures can be defined as follows [28]. Let $V$ be the set of all classes (positive, negative, strongly positive, strongly negative), then:

\begin{matrix} P_{micro} = & \frac{\sum_{v \in V} T P_{v}}{\sum_{v \in V} T P_{v} + F P_{v}} \\ R_{micro} = & \frac{\sum_{v \in V} T P_{v}}{\sum_{v \in V} T P_{v} + F N_{v}} \end{matrix}

(5)

where $T P_{v}$ is the true positive for class $v,$ and $F P_{v}$ $(F N_{v})$ is false positive (negative) for class $v .$ The macro-averaged measures are defined as follows [28]:

\begin{matrix} P_{macro} = & \frac{1}{| V |} \sum_{v \in V} P_{v} \\ R_{macro} = & \frac{1}{| V |} \sum_{v \in V} R_{v} \end{matrix}

(6)

where $P_{v} (R_{v})$ is the precision (recall) for class $v$ respectively, and $| V |$ is the number of classes which in our case is 4. Calculating the above averaging measures to our results we get: P_micro = 88.6%, R_micro = 82%, P_macro = 86.5% and R_macro = 82.7%. The corresponding F-score will simply be the harmonic means of their respected precision and recall. Calculating the micro- and macro-averaged F-score we get F_micro = 85.2%, and F_macro = 84.5%.

All the systems we reviewed supported either binary (positive and negative) or ternary (positive, negative and neutral) classes. Ours supports four categories. Therefore it would be unfair to compare between them; however, for the sake of completeness we will list the performances of the different systems we reviewed that explicitly support Arabic. Where the authors used SVM and NB classifiers, we will list the performance of the NB classifier as this is the one we used. Also, some authors report measures for each class, for example, precision for positive and another for negative class. To simplify the comparison, we will use macro-averaged measures. Micro-averaged measures were not possible as this requires unavailable pieces of information. Table 5 summarizes the performance of different systems.

Table 5.

Performance summary of different systems. Blank entries mark information unavailable. Note that all the systems (except ours) handle either two or three classes. All the performance measures are taken from the literature

	Acc	$P_{macro}$	$R_{macro}$	$F_{macro}$
Our system	82%	86.5%	82.7%	84.5%
[10]	74.05%	74.05%	74.55%	74.3%
[14]		60.5%^a	59%	59.74%
[16]	81.36%			83.21%^b
[19]	89%	85.25%	94.8%	89.77%
[20]		90%	15.3%	26.15%

This is the best precision the author reported.

This is the average of F-score for positive and negative classes.

From Table 5 we can say that the only system that has a better accuracy than ours is that by Rushdi-Saleh et al. [19], although our system has a slightly better precision. There are two reasons why it has a better performance: fewer classes and a smaller vocabulary. It is worth noting that the aforementioned system was trained to handle movie reviews as either good or bad, whereas our system had to classify the comments into one of four classes. Unlike opinions expressed in comments to any news genre, reviews of movies tend to have a limited vocabulary. That may explain why it has a better accuracy. Given that our system handles four classes, we believe its performance is very competitive.

5. Conclusion and future work

Aara’ was developed to extract the polarity of Saudi public opinion automatically using a naive Bayes classifier and the revised bigram approach. For input the system uses the set of all comments which readers wrote anonymously in the online edition of local newspapers. Our system classifies the comments into one of four categories: strongly positive, positive, negative and strongly negative. This is the first system in Arabic that classifies text in more than three classes. We trained the system using the Nejdi Arabic (colloquial mostly used in central Arabian Peninsula). Some of the challenges were handling unstructured text written in local slang with plenty of typos. For the test data the system achieved an accuracy of 82%. The precision ranged between 78.9% (for strongly positive comments) and 96.4% (for negative comments), while the recall ranged between 76.8% (negative comments) and 88.2% (strongly positive comments). The corresponding F-score was 85.2% (micro-averaged) and 84.5% (macro-averaged). Our system is flexible and can be easily be adapted to work with any set of Arabic comments.

For future work, we need to handle the negation terms properly and this will improve the accuracy of the system. Also we need to filter out the comments that are irrelevant to the news story, classifying only those that explicitly express an opinion on what is in the article. A long-term goal is to output a brief summary of what the readers are saying in their comments. Using the archives of news stories and their associated comments we can track the shifts in public sentiment over time pertaining to certain events.

Footnotes

Funding

This work was supported by a special fund in the Research Centre of the College of Computer and Information Sciences at King Saud University.

Notes

References

Al-Azami

. The History of the Qur’anic Text: From revelation to compilation, 2nd edn. Sherwoord Park, Alberta: Al-Qalam Publishing, 2011, pp. 123–129.

Farghaly

Shaalan

. Arabic natural language processing: challenges and solutions. ACM Transactions on Asian Language Information Processing 2009; 8(4): 1–22.

Azmi

Almajed

. A survey of automatic Arabic diacritization techniques. Natural Language Engineering 2013, doi: 10.1017/S1351324913000284

CIA. World fact book. Washington DC: Central Intelligence Agency, 2008.

Habash

Rambow

. MAGEAD: A morphological analyzer and generator for the Arabic dialects. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics, Sydney, 2006, pp. 681–688.

Mustafa

AbdAlla

Suleman

. Current Approaches in Arabic IR: A survey. In: The 11th international conference on Asia-Pacific Digital Libraties (ICADL 2008), Bali, 2008.

Pang

Lee

Vaithyanathan

. Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings conference empirical methods in natural language processing (EMNLP), 2002, pp. 79–86.

Turney

. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: Proceedings of Association for Computational Linguistics, 2002, pp. 417–424.

Abu-Jbara

King

Diab

Radev

. Identifying opinion subgroups in arabic online discussions. In: Proceedings of Association for Computational Linguistics (short paper), 2013.

10.

Hamouda

Akaichi

. Social networks’ text mining for sentiment classification: The case of Facebook’ statuses updates in the ‘Arabic Spring’ era. International Journal Application or Innovation in Engineering and Management 2013; 2(5): 470–478.

11.

Mohammad

Kiritchenko

Zhu

. NRC-Canada: Building the state-of-the-art in sentiment analysis of Tweets. In: The 7th International Workshop on Semantic Evaluation Exercises (SemEval-2013), Atlanta, GA, June 2013.

12.

Narr

Hulfenhaus

Albayrak

. Language-independent Twitter sentiment analysis. In: KDML workshop on knowledge discovery, data mining and machine learning 2012 (KDML-2012), Dortmund, September 2012.

13.

Elarnaoty

AbdelRahman

Fahmy

. A machine learning approach for opinion holder extraction in Arabic language. International Journal Artificial Intelligence and Applications (IJAIA) 2012; 3(2):45–63.

14.

Al-Subaihin

. Sentiment analysis of modern Arabic in new media (using human-based computing). Unpublished MSc Project Report. King Saud University, Riyadh, Saudi Arabia, 2012.

15.

Al-Subaihin

Al-Khalifa

Al-Salman

. A proposed sentiment analysis tool for modern Arabic using human computation. In: The 13th international conference information integration and web-based app and services (iiWAS2011), December, 2011.

16.

Abdul-Mageed

Kubler

Diab

. SAMAR: A system for subjectivity and sentiment analysis of Arabic social media. In: Proceedings of the 3rd Workshop Comput Approach Subjectivity Sentiment Analysis, Jeju, Republic of Korea, July 2012, pp. 19–28.

17.

Danowski

. Sentiment network analysis of Taleban and RFE/RL open-source content about Afganistan. In: 2012 European intelligence and security informatics conference, 2012; doi: 10.1109/EISIC.2012.54

18.

Glass

Colbaugh

. Estimating the sentiment of social media content for security informatics applications. Security Informatics 2012; 1(3); doi: 10.1186/2190-8532-1-3

19.

Rushdi-Saleh

Martin-Valdivia

Urena-Lopez

Perea-Ortega

. OCA: Opinion corpus for Arabia. Journal of the American Society for Information Science and Technology 2011; 62(10): 2045–2054.

20.

Almas

Ahmad

. A note on extracting ‘sentiments’ in financial news in English, Arabic and Urdu. In: The Second Workshop on Computational Approaches to Arabic Script-based Languages, LSA 2007, Linguistic Institute, Stanford University, July 2007.

21.

Silva

Carvalho

Sarmento

Oliveria

Magalhães

. The design of OPTIMISM, an opinion mining system for Protuguese politics. In: New trends in artificial intelligence: Proceedings of EPIA 2009 – 14th Portuguese conference on artificial intelligence, Universidade de Aveiro, 2009, pp. 565–576.

22.

Froelich

Ananyan

Olson

. The use of text mining to analyze public input. White paper Megaputer Intelligence, 2008.

23.

Ahmad

Nürnberger

. N-Grams conflation approach for Arabic text processing. In: Proceedings International Workshop on Improving Non English Web Searching (iNEWS ‘07), Amsterdam, Netherland, 2007, pp. 39–46.

24.

Ahmad

Nürnberger

. Evaluation of N-gram conflation approaches for Arabic text retrieval. Journal American Society for Information Science and Technology 2009; 60(7): 1448–1465.

25.

Khaltar

B-O

Fujii

Ishikawa

. Extracting loanwords from Mongolian corpora and producing a Japanese–Mongolian bilingual dictionary. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of ACL, Sydney, 2006, pp. 657–664.

26.

Ahmad

De Luca

Nürnberger

. Revised N-gram based automatic spelling correction tool to improve retrieval effectiveness. Research Journal Computer Science and Computer Engineering with Applications (Polibits) 2009; 40: 39–48.

27.

Han

Kamber

Pei

. Data mining: Concepts and techniques, 3rd edn. Morgan Kaufmann: Waltham, MA, 2012, pp. 364–368.

28.

Sebastiani

. Machine learning in automated text categorization. ACM Computing Surveys 2002; 34(1): 1–47.