Abstract
Sentiment analysis (SA) aims to extract users’ opinions automatically from their posts and comments. Almost all prior works have used machine learning algorithms. Recently, SA research has shown promising performance in using the deep learning approach. However, deep learning is greedy and requires large datasets to learn, so it takes more time for data annotation. In this research, we proposed a semiautomatic approach using Naïve Bayes (NB) to annotate a new dataset in order to reduce the human effort and time spent on the annotation process. We created a dataset for the purpose of training and testing the classifier by collecting Saudi dialect tweets. The dataset produced from the semiautomatic model was then used to train and test deep learning classifiers to perform Saudi dialect SA. The accuracy achieved by the NB classifier was 83%. The trained semiautomatic model was used to annotate the new dataset before it was fed into the deep learning classifiers. The three deep learning classifiers tested in this research were convolutional neural network (CNN), long short-term memory (LSTM) and bidirectional long short-term memory (Bi-LSTM). Support vector machine (SVM) was used as the baseline for comparison. Overall, the performance of the deep learning classifiers exceeded that of SVM. The results showed that CNN reported the highest performance. On one hand, the performance of Bi-LSTM was higher than that of LSTM and SVM, and, on the other hand, the performance of LSTM was higher than that of SVM. The proposed semiautomatic annotation approach is usable and promising to increase speed and save time and effort in the annotation process.
1. Introduction
The Arabic language has been ranked as the fourth most frequently used language on the Internet [1]. It is also widely used in many communities; it is the official language of 22 countries, with more than 400 million speakers worldwide [2]. Nowadays, social media, such as Twitter and Facebook, are popular tools used by people to express their opinions. However, there seems to be a shortage of research relative to Arabic sentiment analysis (ASA) compared with sentiment analysis (SA) on other languages, such as English. SA offers a method to analyse people’s comments and reviews on several topics, which are available on websites and social media networks, to understand their attitudes and perceptions. It aims to automatically classify users’ comments, tweets, or reviews into, for example, positive, negative, or neutral. It plays an essential role in extracting meaningful knowledge from people’s reviews or comments to determine their opinions about different events, products or topics [3].
Compared with English SA, ASA has many challenges. Arabic has a complicated morphology [4–6], and its apparently unique characteristics – mainly its derivations and inflections – clearly make the task of morphological analysis difficult. Being morphological therefore makes Arabic stand out among the most interesting languages to investigate. Similar to other languages, Arabic has different dialects spoken in different regions and countries throughout the Arab world. There are many morphological differences between modern standard Arabic (MSA) and Arabic dialects. MSA is used in education, news and formal events, to name a few. It is not the native language of almost all Arabs, whose actual native tongue is several dialects, which vary from one community to another and from one geographical region and country to another. The Arabic dialect uses some slang words and expressions that do not have equivalents in MSA. In both MSA and other Arabic dialects, a word can have various meanings or connotations that are different from one linguistic variety to another [7]. Each dialect has different vocabularies, grammatical rules and morphologies [8]. Although the majority of Arabic native speakers use Arabic dialects, the literature focuses on MSA rather than Arabic dialects. These differences between MSA and Arabic dialects have affected morphology, word order and vocabulary [9].
A few SA studies have been conducted on Arabic compared with other languages, such as English [10]. Almost all prior works have used machine learning algorithms. Recently, the deep learning approach has shown promising performance for SA on English [11], Thai [12], Persian [13] and Tamil [14]. However, deep learning is greedy and requires large datasets to learn, so it takes a long time for data annotation. The annotation process is a highly important task, which is always done manually to provide the SA classifier with examples of data that will be used to train the classifier in order to reach reasonable accuracy. There is a need for a tool to help in the annotation process, as manual annotation is time and effort consuming, especially for larger datasets. Annotation is a challenging process that requires a human with some important characteristics. Examples of such characteristics include dialect expertise and having the time and effort to accomplish the process manually in an accurate manner. However, recruiting the right human annotators cannot be achieved randomly because of the consequential impact on the output results of the classifiers.
In this study, we propose a semiautomatic annotation model to annotate Saudi dialect tweets using Naïve Bayes (NB). Saudi Twitter users are the most active among more than 2.6 million users in Arab countries [15], so it is a significant source of data in our study. The Saudi tweets generated per month constitute the highest volume in Arab countries, and the average Saudi tweets generated per day reach around 8.8 million tweets [15]. This study focuses on the Saudi dialect text on Twitter for SA purposes. Twitter allows its users to have nicknames and use anonymous names so that they can hide their real identities and express their opinions freely in short sentences. This research will illustrate our semiautomatic approach used to annotate the data. The main goal of the proposed annotation model is to reduce the annotation effort and time usually exerted and consumed manually by humans. The model is used to classify the new dataset into positive, negative and neutral, and it asks the annotator to confirm the classification result by accepting or modifying it. The annotated data produced from the semiautomatic model are used to train and test the following deep learning classifiers: Convolutional neural network (CNN), long short-term memory (LSTM) and bidirectional long short-term memory (Bi-LSTM). The support vector machine (SVM) is also used as a baseline for comparison. This study reports and discusses the results of applying the proposed classifiers to accomplish the Saudi dialect SA.
The rest of the article is organised as follows. Section 2 summarises related work, and Section 3 introduces the collection and preparation of the Saudi dialect dataset. Section 4 presents the proposed semiautomatic annotation model, and Section 5 provides the experimental study and the discussion of the results. Section 6 concludes the article and presents an overview of our future work.
2. Related work
The study of semiautomatic annotation allows several users to annotate the data into the corresponding dialect, part of speech and morphosyntactic annotation. Benajiba and Diab [16] presented a web-based tool for the purpose of the semiautomatic annotation of four Arabic dialects: Egyptian, Iraqi, Levantine and Moroccan. Their model used a graphical user interface (GUI) that allowed many annotators to annotate the data. Alshutayri and Atwell [17] implemented an online game tailored to dialectal annotation. They used GUI to annotate a given text into each of the following dialects: Gulf, Iraqi, Egyptian, Levantine and North African. Alosaimy and Atwell [18] proposed another web-based semiautomatic model that targeted the morphosyntactic annotation of Arabic. Samih et al. [19] proposed a similar web-based model for the part-of-speech annotation of the Moroccan dialect. Their research shows the role that web-based tools play in enabling people to perform manual annotation.
Machine learning classifiers, such as SVM and NB, have shown interesting results with ASA. Duwairi [20] classified 100 Arabic documents using NB, k-nearest-neighbours (KNN) and distance-based classifiers. The author showed how NB outperformed other classifiers. Al-Kabi and Al-Sinjilawi [21] carried out a comparative study on the performance of six methods to classify Arabic documents. Their experimental results show that NB outperformed the other methods. Azmi and Alzanin [22] applied an NB classifier to 815 collected comments from Saudi e-newspapers and annotated them manually into four classes: Strongly positive, positive, negative and strongly negative. Their experimental results reached 85% accuracy. Duwairi and El-Orfali [23] applied three machine learning classifiers. They collected 164 positive and 136 negative reviews, and they used a benchmark dataset of MSA [24], which has 500 positive and negative movie reviews. In their research, NB achieved the highest accuracy of 96.6%. Alternatively, Joachims [25] applied five machine learning classifiers on two different datasets. The first dataset consisted of 9603 training and 3299 testing documents. The second one consisted of 10,000 training and 10,000 testing documents. The author’s experimental results showed that SVM achieved the best performance compared with the four other classifiers. By the same token, Rushdi-Saleh et al. [24] created a balanced Arabic dataset collected from several web pages and blogs with 500 balanced Arabic movie reviews (i.e. 250 were positive and 250 were negative). By applying NB and SVM, they concluded that SVM outperformed NB with an accuracy of 90%. This was also explored by Shoukry and Rafea [26], who applied NB and SVM on the Egyptian dialect and MSA dataset, which consisted of 500 positive and 500 negative tweets. Their final results showed that SVM achieved the highest accuracy with more than 72%. It was reported in the literature that Abdulla et al. [27] applied lexicon-based and some machine learning classifiers. They created a dataset with 2000 Arabic tweets written in MSA and the Jordanian dialect, which were annotated manually into positive or negative labels. They showed that SVM had the highest accuracy of 87.2% among lexicon-based and other machine learning classifiers. Hassan et al. [28] applied SA using machine learning classifiers, such as SVM, NB and logistic regression. The dataset consisted of 6388 tweets of English and multi-languages. They concluded that SVM outperformed the other classifiers with an accuracy of 89%. These experimental results would pave the way for a promising ASA in which machine learning is applied to reach the best accuracy possible.
Studies of SA are well documented; it is also well acknowledged that the application of the deep learning approach brought about breakthrough performance that exceeded that of baseline machine learning classifiers. For instance, Kim [11] introduced a CNN classifier using seven English benchmark datasets and recorded a performance accuracy of 89.6%. Similarly, Vateekul and Koomsubha [12] proposed the first Thai SA of deep learning applied on Twitter data. They extracted 11,000 positive and 11,000 negative tweets to perform the study. They proposed the application of two classifiers: LSTM and CNN. The results showed that the two classifiers outperformed various well-known machine learning classifiers with an accuracy of 75.35%. Likewise, Dahou et al. [29] applied CNN on five Arabic benchmark datasets that had MSA and dialectal Arabic. These benchmark datasets were composed of large Arabic book reviews (LABR), which had 63,000 book reviews [30]; Arabic sentiment tweets dataset (ASTD), which had 10,000 tweets [31]; the gold-standard (GS) dataset, which had 8868 tweets [32]; ArTwitter, which had 2000 tweets [27]; and a dataset created by ElSahar and El-Beltagy, which had 33 K reviews [33]. The obtained results showed a remarkable performance with an accuracy of 91.7%. In a similar way, Al-Azani and El-Alfy [34] explored CNN and LSTM in terms of applying ASA on Arabic benchmark datasets (i.e. ASTD and ArTwitter). The experimental results concluded that LSTM outperformed CNN with an accuracy of 87.27%. Oussous et al. [35] applied CNN and LSTM to test their capability on ASA. They utilised two Arabic datasets consisting of Arabic and dialectal Arabic, which were annotated into two sets: 1000 positive and 1000 negative. They applied different machine learning classifiers for the sake of comparison, such as NB, SVM and maximum entropy (ME). They reported that deep learning classifiers outperformed machine learning, and CNN achieved better performance than LSTM with an accuracy of 96%. Ultimately, studies of deep learning yielded results outperforming those of the baseline machine learning classifiers applied on SA with large volumes of datasets. Yet the annotation process conducted manually turned out to be a challenging task, as it required human expertise, effort and time to be accomplished efficiently. Thus, there is a need for efficient annotation tools to create huge datasets for the application of deep learning approach.
This research aims to use the baseline machine learning algorithm, namely NB, in order to apply the semiautomatic model for the purpose of annotating Saudi tweets into positive, negative or neutral. Several thresholds are tested to annotate the new dataset using the semiautomatic model. In addition, as a way to apply SA through deep learning classifiers, we utilised the produced datasets across the different thresholds. We ultimately used the SVM classifier to compare its performance with the deep learning classifiers.
3. Saudi dialect dataset
3.1. Dataset collection and preprocessing
Twitter was the source of our data. We collected Saudi dialect tweets using the Twitter API in different domains, such as sports, politics and social issues. The data preprocessing took place after data collection. The preprocessing steps were done by removing user mentions, hashtags, URL, emojis, symbols, numbers, punctuation marks, non-Arabic letters, diacritics and repeated letters. We also normalised the data by converting some forms of letters to one form. For instance, the letters
were replaced by
,the letter
by
, and the letter
letter by
. Table 1 shows the sample of data deleted from the tweets in the preprocessing steps.
Examples of data removed from the collected tweets.
3.2. Manual annotation
The main goal here is to assign each tweet manually to one sentiment class. The annotator assigned each tweet to one of four classes: positive, negative, neutral or confused. We added a fourth class to remove confusing tweets from the dataset. These tweets were eliminated from the dataset at the end of the annotation process. It took three native Saudi speakers about 7 months to accomplish the manual annotation of about 11,425 Saudi tweets that were categorised as positive (4479), negative (3192) and neutral (3754). Table 2 shows the samples of the collected dataset.
Samples of positive, negative, and neutral tweets from the Saudi dataset.
4. A Semiautomatic annotation approach
The annotation process is a highly important task, which is always done manually to provide the SA classifier with examples of data that will be used to train the classifier in order to reach reasonable accuracy. As shown in the previous section, after the data collection and preprocessing, we spent around 7 months manually annotating tweets. Alternatively, in this section, we propose a model to conduct the annotation automatically using machine learning algorithms. This section provides an explanation of the semiautomatic approach that we proposed to annotate data. To do so, we have suggested a semiautomatic annotation approach to annotate Saudi tweets into positive, negative or neutral. This annotation process is illustrated in Figure 1. The main goal of the proposed annotation model of this research is to reduce human effort and time in the annotation process. Recently, research on SA has increased because of the ever-increasing need for an automatic SA tool. Machine learning classifiers have recorded great results in the literature, as mentioned in Section 2. Therefore, the proposed model uses the machine learning approach to perform the annotation process.

The proposed semiautomatic annotation model.
As illustrated in Figure 1, the new tweets are used as input to the proposed model, and it is assumed that these tweets are well cleaned and preprocessed for further analysis. Once the new tweet is received in the automatic annotator component, it will be classified using machine learning classifier into one of three classes: positive, negative or neutral. In this model, the assigned class will be accepted (i.e. pass as shown in the figure) if its probability score exceeds a certain threshold, and the tweets will be added to the annotated tweets. If the probability score is less than the threshold (i.e. fail as shown in the figure), the tweets will be transferred to a human annotator for manual annotation, which then will be added to the annotated tweets. Thus, at the end of the cycle, the component of annotated tweets is a combination of manual and automatic annotations (i.e. the results of the annotation process become semiautomatic). Several thresholds are tested in this research. Then, the annotated dataset will be used as input into other classifiers such as deep learning classifiers to apply SA as shown in the above figure.
In this work, NB classifiers are adopted to automatically annotate 3495 tweets. NB is used because of its simplicity and its application of the probability theorem to assign a probability score to each class. It calculates the classification probability or posterior probability after the calculation of the conditional probability used to classify the text [36]. Using NB classifier, the assigned class will be accepted provided that its probability score exceeds a certain threshold and the tweets will be added to the annotated tweets. If the probability score is less than the threshold, the tweet will be transferred to the human for manual annotation. Moreover, the main feature of this model is asking the human annotator to confirm the results whose probability is less than a threshold, while those whose probability results are greater than the threshold are automatically accepted without requiring human annotation.
5. Experimental study
5.1. Evaluation measures
Two measures are used to evaluate the performance of our proposed model: effort and accuracy. Effort measures the time spent on manual annotation. For the sake of simplicity, it is assumed that the time of annotation for each tweet is equal, so the time spent for manual annotation is directly proportional to the number of tweets. In this experiment, effort is measured as
where m is the number of tweets rejected by the NB classifier because of their probability scores being less than the given threshold (i.e. fail as in Figure 1), and N is the total number of tweets, namely, 3495 in our experiments. It is important to emphasise that choosing a certain value for the threshold will directly affect the value of m and thus the required effort of manual annotation. A high value of the threshold will end up with a high value of m (i.e. a large number of tweets requiring manual annotation) and vice versa.
The second metric is the accuracy determined by the total number of correct classifications by the NB classifier. It is calculated as
where r is the number of tweets correctly classified by the NB classifier, and n is the total number of tweets that have passed the given threshold, as illustrated in Figure 1 (n is a subset of N that is less than or equal to 3495). Note that equation (2) does a similar task as equation (6), but it works on part of the tweets that passed the given threshold. In this equation, r is equivalent to TP + TN, and n is equivalent to TP + FP + TN + FN. The performance of the other classifiers (SVM and deep learning) is calculated as
where TP refers to the total of correct positive classifications, TN refers to the total of correct negative classifications, FP refers to the total of incorrect positive classifications and FN refers to the total of incorrect negative classifications.
5.2. Experiments and results
This section presents the experiments performed in this research. The first experiment was our previous research that recorded a promising result in Saudi dialect SA using a deep learning approach. The second experiment shows the semiautomatic annotation model proposed in this research. The third experiment applies three deep learning classifiers and SVM to perform SA using the resulting dataset from the semiautomatic model. This section also shows the analysis and discussion of the results.
5.2.1. Experiment 1
In the first experiment [37], we collected around 60,000 Saudi dialect tweets using the Twitter API of different domains. We extracted 32,063 tweets to create that dataset. The research illustrated the dataset collection and annotation along with the preprocessing process. The collected dataset was manually annotated. At the beginning of this process, we removed duplicate tweets. Then, we conducted data annotation using crowdsourcing to classify the tweets into two sets consisting of 17,707 positive and 14,356 negative tweets. Preprocessing took place by removing user mentions, hashtags and URLs, emojis, symbols, numbers, punctuation marks, non-Arabic letters, diacritics and repeated letters from such words. We also had to normalise the data by converting some forms of letters to one form. In this research, we applied the continues bag of word (CBOW) model proposed by Mikolov et al. [38] on the dataset to learn the vector representations of the words in an unsupervised way. CBOW predicts the surrounding words in training sentences to learn the vector representation of a word [39]. We then applied SA using deep learning approach.
We applied two deep learning classifiers, LSTM and Bi-LSTM, for the purpose of exploring and investigating the capability of deep learning to record high performance in Saudi dialect SA using the dataset of 32,063 tweets. The results showed that the deep learning classifiers (Bi-LSTM and LSTM) achieved better accuracy than the machine learning classifiers (i.e. in the experiment, we used the well-known SVM classifier for the sake of comparison, as shown in Figure 2). In this experiment, the annotation process was a challenging task and thus motivated us to think of an easy, fast and accurate way to accomplish it. However, finding people who are native and have the time and effort to accomplish the accurate annotation of large volumes of datasets is not always easy. We then came up with the idea of a semiautomatic annotation model so that we can use the annotated dataset to annotate new data with the help of the model without having to perform the annotation manually from scratch.

Accuracy of the classifiers: LSTM, Bi-LSTM and SVM.
5.2.2. Experiment 2
The NB classifier was trained on an original dataset that was collected, preprocessed and manually annotated using the steps mentioned above in Section 3. The performance of the trained NB classifier using 10-fold cross validation is presented in Table 3. As presented in Section 4, we applied NB to automatically annotate the new tweets.
Performance of the NB.
NB: Naïve Bayes.
There is a need for a new dataset in order to apply the semiautomatic model for annotating the new data. The experiments in this section are carried out on a new dataset consisting of 3495 Saudi tweets (i.e. 1012 positive, 823 negative and 1660 neutral). After the training and testing of the NB classifier using the original dataset, we fed the model with the new dataset. The semiautomatic model automatically classified tweets whose probability scores were higher than the threshold into positive, negative or neutral. Tweets that had probability scores less than the threshold were transferred to the annotator to accept or modify their classes. In these experiments, we examined the effect of several values of the threshold on annotation accuracy and manual effort. The results are shown in Figure 3.

Effect of different threshold values on annotation accuracy and manual effort.
As shown in Figure 3, there is a direct relationship between the threshold, the achieved accuracy and the required effort; the higher the value of the threshold, the greater the accuracy and effort and vice versa. In other words, there is a compromise between the effort (i.e. manual annotation) and the achieved accuracy. In our experiment, the threshold represents the probability of the classification being accepted when using NB to classify the tweets. The full automation of annotation will be reached when n equals N, which means that all the classified tweets passed the determined threshold. In this case, human effort is no longer required for further annotation. This may happen by choosing a low value for the threshold, such as 40, which clearly reduced the required effort to 1%. This means that all the tweets were annotated automatically, except for around 30 ones that were transferred for manual annotation. However, although the low value of the threshold reduces the required effort, it injects the dataset with inaccurate annotated tweets (i.e. 34% of misclassified tweets when the threshold equals 40). In our proposed solution, the trade-off between accuracy and effort is guided by the value of the threshold. It is observed that 25% of manual effort leads to 74% of accurate annotation when the threshold is 60. A slight improvement in accuracy was observed at 76%, although double effort was spent on manual annotation (i.e. an increase from 25% to 50% when the threshold was 70). This represented an additional 25% of effort for manual annotation; for example, 77% of effort enhances the accuracy of the annotation to 86% when the threshold was 85.
Overall, a compromise between the accuracy of automatic annotation and the effort for manual annotation is required. In some cases, we may sacrifice accuracy to save time and effort. This may happen when we lack sufficient resources for annotation (i.e. few human annotators) or simply because we do not have enough time to do the manual annotation.
5.2.3. Experiment 3
This experiment aims to explore and investigate the capability of deep learning to enhance the Saudi dialect SA. It would contribute to providing an empirical comparison between the deep learning classifiers to find the best classifier towards enhancing the Saudi dialect SA. Therefore, the Saudi dialect SA was applied using the datasets produced from the semiautomatic annotation model as explained in Figure 1, across different thresholds. We used SVM as a baseline classifier to evaluate the performance of deep learning classifiers. We applied the semiautomatic model to annotate the new dataset consisting of 3495 Saudi dialect tweets across different thresholds. In this experiment, the semiautomatic model annotated the new tweets automatically in which their annotation probability was higher than or equal to the threshold, and the rest of tweets transferred to manual annotation because their annotation probability was less than the threshold. However, the dataset transferred into vectors using Word2vc model [38] to feed the deep learning classifiers. Recently, SA research used learning vectors on deep learning as features without using feature engineering [40]. The combination of deep learning classifier with another Word2vec neural network model increases the performance of the classifier [41]. In order to accomplish the experiment, we applied Word2vec to learn the vector representations of the words.
The experiment conducted in this section utilised the deep learning classifiers recurrent neural network (RNN) and CNN. Recently, it has shown great performance in the natural language processing (NLP) applications. RNN takes a sequential input, where the current state depends on the previous state and the current input [42]. It has two types which are LSTM [43] and Bi-LSTM [44]. LSTM was created as an improved architecture of RNN that was developed to handle the long time-dependencies of RNN [43,45]. However, the problem of vanishing gradients that occurs on RNN is because the input at

The architecture of LSTM and Bi-LSTM.
CNN model has achieved the state-of-the-art results in different applications. Initially, CNN recorded breakthrough results in the image recognition [46] and hand writing recognition [47]. Then, in the NLP tasks such as semantic parsing [48], search query retrieval [49] and SA research as shown in the previous Section 2. In the CNN, there are two main layers which are the convolution and pooling layers. The order of layers is that a convolution is followed by pooling. The process of the convolution layer employs a filter to a fixed window size of the input to extract features [11,41]. The pooling layer applies the maximum operation to the output of the previous convolution layer. The input of the last fully connected layer is the output of the previous pooling layer and is used to determine the output class. Figure 5 shows the architecture of the CNN.

The convolutional neural network architecture [11].
Deep learning classifiers are built using Keras and TensorFlow. For the sake of comparison, we used SVM as the baseline to examine the performance of deep learning. In the experiment, we examined the use of seven thresholds to test the effect of the chosen threshold for semiautomatic annotation on the achieved accuracy. We repeated the experiment using the following thresholds: 100% (i.e. manual annotation), 90%, 80%, 70%, 60%, 50% and 40%. The performance of the proposed classifiers, LSTM, Bi-LSTM, CNN and SVM, is shown in Figure 6. It presents the accuracies of the proposed classifiers across each threshold. At the first threshold, the classifiers LSTM, Bi-LSTM, CNN and SVM achieved the following results: 86%, 86.3%, 87% and 84.2%, respectively. At the 90% threshold, the accuracies were slightly decreased for all the classifiers: 83.9% for LSTM, 84.01% for Bi-LSTM, 84.12% for CNN and 82.8% for SVM. At the 80% threshold, we observed a slight decrease in accuracy for each classifier: 80% for LSTM, 80.12% for Bi-LSTM, 80.5% for CNN and 78.7% for SVM. At the 70% threshold, the accuracy decreased significantly for each classifier: 74% for LSTM, 75.3% for Bi-LSTM, 76% for CNN and 72.2% for SVM. This decrease may have been due to the quality of annotation, as some data were probably incorrectly annotated by the semiautomatic model. However, a noticeable decrease in accuracy was observed at the 60% threshold: 67% for LSTM, 68% for Bi-LSTM, 68.5% for CNN and 65% for SVM. Clearly, a significant decrease in the accuracy of the classifiers at this threshold was noted, but much effort has been saved for the manual annotation at this threshold. The accuracies at the 50% threshold were as follows: 63.6% for LSTM, 65% for Bi-LSTM, 65.5% for CNN and 62% for SVM. At the last threshold of 40%, the outcome accuracies were the lowest for all the classifiers: 63% for LSTM, 64.3% for Bi-LSTM, 65% for CNN and 61% for SVM. Overall, the deep learning classifiers outperformed SVM, especially the CNN, which achieved the highest accuracy of 87%. The results also showed that the accuracies of all the classifiers decreased with having more automatic annotation.

Accuracy of the classifiers at different annotation thresholds.
Figure 7 gives a better presentation of the effectiveness of the thresholds on the classifiers’ performance according to the selected threshold values. There was obviously a sharp decrease in the accuracy of all the classifiers when we set the threshold to 70, and the same effect happened with 60. After that, the decrease returned to being similar to the transition from manual annotation to 90. For a threshold less than 80 (i.e. 70, 60, 50 and 40), almost all classifiers had a similar effect on their performance. In order to reach a compromise between accuracy and effort, our recommendation is to set the annotation threshold to be at least 70. In doing so, almost half of the required effort will be saved (i.e. 49.7%, as shown in Figure 3), as the Saudi tweets will be annotated automatically through our proposed model with an acceptable level of accuracy (i.e. 75.64% for CNN, as shown in Figure 6). If the user wants a higher accuracy of annotation, the threshold could be set to 80 or 90. Several factors must be considered in setting the annotation threshold, such as the time available for manual annotation or the sensitivity of the application. If we use SA to evaluate some healthcare applications, we might require a higher level of annotation accuracy, but for the investigation of customers’ satisfaction with a new product, a level of accuracy around 70% may be accepted.

Trend of the classifiers’ accuracy over different values of annotation thresholds.
As indicated in the results of the above-mentioned experiments, Saudi dialect SA that used deep learning classifiers has outperformed the baseline machine learning classifier. Thus, CNN achieved the highest accuracies, whereas SVM obtained the lowest ones across the different thresholds. On one hand, the accuracies achieved by Bi-LSTM were higher than those achieved by LSTM and SVM, and, on the other hand, the accuracy achieved by LSTM was higher than that achieved by SVM. The results appeared pioneering and usable.
To sum up, in this work, the proposed semiautomatic annotation model used the trained NB classifier to annotate the new dataset, as shown in Figure 1. Seven thresholds were tested to explore the extent of reasonable accuracy for the classifiers across each threshold: 100% (i.e. manual annotation), 90%, 80%, 70%, 60%, 50% and 40%. The semiautomatic annotation model is proposed for the purpose of minimising human effort and time spent on manual annotation. The resulted datasets across several thresholds are used as input into the deep learning classifiers and SVM to apply Saudi dialect SA. As observed in Figures 6 and 7, the highest accuracy was obtained at the highest threshold and vice versa for all classifiers. As expected, the highest accuracy was achieved when the annotation threshold was 100%. However, the accuracies observed at the first threshold are follows: 86% for CNN, 86.3% for LSTM, 87% for Bi-LSTM and 84.2% for SVM. This outcome of the experiments would conclude that applying Saudi dialect SA using deep learning classifiers provides promising results better than that of baseline machine learning classifier. The final results indicate that using CNN recorded a great enhancement of the Saudi dialect SA followed by Bi-LSTM and LSTM, respectively.
6. Conclusion and future work
In this research, we applied the NB classifier to perform semiautomatic annotation. The main goal of the proposed annotation model is to minimise the human effort and time spent with manual annotation. We collected, preprocessed and manually annotated 11,425 Saudi tweets, which were then used to train and test the NB classifier. As for the trained NB classifier, it was used to classify the new dataset consisting of 3495 Saudi dialect tweets; seven threshold values were investigated: 100%, 90%, 80%, 70%, 60%, 50% and 40%. To apply SA, deep learning classifiers (i.e. LSTM, Bi-LSTM and CNN) and SVM utilised the datasets that resulted from the semiautomatic model at every threshold. As for CNN, it outperformed the other classifiers at all seven thresholds, with the highest accuracy of 87% at the first threshold. SVM was also applied to determine how different its performance is compared with that of deep learning classifiers; SVM proved to be the least classifier across all thresholds. The semiautomatic model approach presented in this work achieved promising and applicable results to facilitate the future work of annotation.
For future work, we plan to further examine the proposed model using a new dataset of a larger size. We also plan to implement a web-based annotation tool using the proposed semiautomatic model because it would allow many users to annotate the texts.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: The authors extend their appreciation to the Deanship of Scientific Research at King Saud University for funding this work.
