Abstract
Sentiment analysis became a very motivating area in both academic and industrial fields due to the exponential increase of the online published reviews and recommendations. To solve the problem of analysing and classifying those reviews and recommendations, several techniques have been proposed. Lately, deep neural networks showed promising outcomes in sentiment analysis. The growing number of Arab users on the Internet along with the increasing amount of published Arabic reviews and comments encouraged researchers to apply deep learning to analyse them. This article is a comprehensive overview of research works that utilised the deep learning approach for Arabic sentiment analysis.
1. Introduction
Sentiment analysis is the area of study concerned with analysing opinions, emotions, evaluations or attitudes towards an entity such as an event, a product, a service, a news, an individual, an organisation or issues. This area is discussed in the literature under different names, for example, opinion mining, review mining, sentiment mining and subjectivity analysis [1]. Sentiment analysis primarily aims at detecting sentiments articulated in texts and classify these sentiments into positive, negative (favourable or unfavourable) or neutral opinions towards a topic or an issue. It is also concerned with sarcasm detection and emotion analysis. Lately, there is a considerable interest by organisations and companies to detect and examine opinions within text documents such as web pages, news articles, comments, reviews and blogs instead of building surveys. Thus, sentiment analysis can offer enormous opportunities for various applications such as decision making, risk management, marketing analysis and detection of rumours [2].
The focus is towards sentiments expressed in the Arabic language due to the growing population of Internet users that use the Arabic language; it is estimated about 5% of worldwide users [3]. Also, in the last few years, it is considered one of the most rising languages on the web. The morphological complexity and the lexical ambiguity of the Arabic language can pose a challenge when working with it. Another problem is that the Arabic language has three different varieties: classical Arabic (CA), modern standard Arabic (MSA) and dialectical Arabic (DA). Throughout this article, we aim at providing a review on the utilisation of the deep learning (DL) approach to analyse sentiments expressed in the Arabic text. The remainder of the article is organised as follows. A concise background regarding techniques applied for sentiment analysis, DL and word embedding is found in Section 2. In Section 3, we explore the sentiment analysis task with its different levels with a summarisation for the proposed DL models which have been applied in Arabic sentiment analysis. Finally, the conclusion is presented in Section 4.
2. Background
The identification of positive and negative opinions is not an easy task; it requires information retrieval, linguistic knowledge, natural language processing and a profound comprehension of the textual context [4]. Various techniques have been used in the literature to solve the task of sentiment analysis, which can be classified into three classes: (1) machine learning (ML)-based, (2) lexicon-based and (3) hybrid techniques [5], as shown in Figure 1. Techniques based on ML utilise either traditional ML algorithms or DL algorithms. The traditional ML algorithms can be supervised using algorithms such as naive Bayes, Bayesian network, neural network, decision tree and support vector machine (SVM) [6–9] or unsupervised using clustering algorithms such as K-means [10]. Lexicon-based techniques use two approaches: the dictionary-based approach [11,12] and the corpus-based approach which uses either semantic [13] or statistical methods [14]. Hybrid techniques combine both ML algorithms and lexicon-based methods [15–18].

Sentiment classification techniques.
2.1. DL
Lately, DL gained huge interest by researchers due to its impressive performance in areas like computer vision, genomics, speech and handwriting recognition, drug discovery and, recently, natural language processing (NLP) including sentiment analysis applications [19]. DL is a subfield of ML that uses multilayer neural networks, in which each of these stacked layers is used for learning data representations with multiple levels of abstraction.
The accuracy of both ML- and lexicon-based techniques heavily depends on the manually crafted features which can be obsolete over time. On the contrary, the selected predefined set of features is not required in the DL models [20], which accept raw data as input and extract features automatically. Moreover, in the lexicon-based approach, the maintenance and regular update of the lexicon is crucial which can be a tedious and time-consuming task. It is worth to mention that there are few lexicons for the Arabic language compared with the number of lexicons for the English language. In addition, ML techniques need adjustments if the model makes inaccurate prediction, while in DL this step is not needed. For this, DL would be a suitable fit for sentiment analysis applications. Several algorithms of deep neural networks such as convolutional neural network (CNN), recurrent neural network (RNN), recursive neural network (RecNN), long short-term memory (LSTM) and memory network (MemNN) have been utilised to solve the sentiment analysis task [21,22].
2.2. Word embedding
After the breakthrough that was made by DL models in many fields and especially NLP, these have extensively been applied in research for sentiment analysis. Most DL models require input data to be in the form of numerical vectors that represent words and sentences. Thus, they use some language modelling techniques, known as word embedding, that map words to vectors of real numbers. Word embedding tries to map semantic meaning to the geometric space using either matrix factorisation [23,24] or neural networks [25].
In the literature, word2vec [26] and Global Vectors (GloVe) [24] are two models of word embedding, both of which learn word vectors from their co-occurrence information. As a consequence, words are represented by a much lower dimensional vector space which enables the DL models to map words that have similar semantic properties [20,26]. All the words that have a similar meaning are represented by vectors that are close to each other, as shown in Figure 2. word2vec is widely used in the sentiment analysis task as shown in the next section. There are two model architectures of word2vec: Skip-gram (SG) and continuous bag of words (CBOW). The CBOW model works by predicting the current or the target word based on its contextual words, which represent the surrounding words, within a predefined window size (preceding and following words), as illustrated in Figure 3(a). However, the SG architecture given the target word predicts the surrounding words, as shown in Figure 3(b).

Word embedding.

word2vec architectures: (a) continuous bag of words (CBOW) and (b) Skip-gram (SG).
2.3. Datasets
The number of Arabic published datasets that can be used in the task of sentiment analysis is limited compared with the datasets of other languages. In the following, the most used and known Arabic datasets are presented. The first dataset is the Opinion Corpus for Arabic (OCA) [27] which is a fairly small dataset. It is a balanced dataset that contains 500 movie reviews split equally between positive and negative classes. The Large Scale Arabic Book Reviews Dataset (LABR) [28] is also a well-known dataset that consists of book reviews rated by the users on a scale of 5 to 1. It has more than 63,000 Arabic reviews collected from www.goodreads.com which makes it one of the largest Arabic datasets and therefore one of the most suitable datasets for sentiment analysis models based on DL techniques. AWATIF [29] is a multi-genre dataset that contains around 5382 labelled sentences written in MSA. The Arabic Treebank (ATB) [30] is an Arabic dataset where three newswire corpora were analysed with more than 500,000 annotated word tokens. In addition, the Arabic Sentiment Tweets Dataset (ASTD) [31] is an Arabic tweets dataset that contains around 10,000 labelled tweets divided into positive, negative, mixed and objective classes. Another tweets dataset is the ArTwitter [32] dataset with 2000 tweets equally divided between positive and negative classes. There is also the Human Annotated Arabic Dataset (HAAD) [33] which is a dataset designated for the aspect-level sentiment analysis. The dataset is a subset of the LABR with 2838 reviews and 1296 different aspect expressions.
3. Sentiment analysis
Studies have considered three main levels of granularity in sentiment analysis: document level, sentence level and entity/aspect level [34]. In the following subsections, these different levels are presented.
3.1. Document level
In this level, the problem is to find whether the entire content of the document expresses a negative or positive opinion [35]. At this level of analysis, we assume that sentiments expressed in each document are articulated about a single object, which indicates that documents that discuss multiple entities are not considered in this type of analysis. The document can be, for example, an article, a review, a blog and so on. Many studies have been performed to explore the use of deep neural networks to classify sentiments expressed on document level [36–41].
Usually, when we address analysis at the document level, we are referring to documents that consist of more than 10–15 sentences. Unfortunately, there is no work that has been done on the Arabic text at this level.
3.2. Sentence level
The second level that researchers have investigated in sentiment analysis is the sentence level. At this level, individual sentences are classified by taking into consideration that not all sentences are opinionated. A subjectivity classification is first performed to find whether the sentence is opinionated or not. Then the resulting opinionated sentence can be analysed to see whether it is positive, neutral or negative [42]. To differentiate between document- and sentence-level analyses, we should know that in documents we may encounter different contradicting sentiments, while sentences usually express a single sentiment.
Al-Sallab et al. [43] have introduced the first application of DL models on the Arabic text for sentiment analysis. They explored four different DL models: deep neural network (DNN), combined deep belief network (DBN), deep autoencoder (DAE) and recursive autoencoder (RAE). In DNN, DBN and DAE, they used bag of words and ArSenl lexicon [44] to generate the feature vector. The results show that the RAE model outperformed the three other models. Al-Sallab et al. [45] proposed enhancements to the RAE model, proposed in Al-Sallab et al. [43], to adapt to challenges that arise with the Arabic text. Morphological tokenisation is proposed to overcome the overfitting and morphological complexity of the Arabic text. The model was tested with three different datasets, and it has an average of 80% accuracy. The same authors [45] participated in Sem-Eval 2017 task 4 (message polarity) workshop with their RAE model and achieved an accuracy equal to 41% [46].
Baly et al. [47] applied the recursive neural tensor network (RNTN) model proposed in Socher et al. [48] on the Arabic text. The model was trained by a sentiment treebank called (ARSENTB) which is a morphologically enriched treebank created by the authors. They used word2vec embedding using the CBOW model on the QALB corpus that contains about 550,000 comments as an input to the RNTN model. After testing the model, results showed that this model could achieve an accuracy of up to 80%. However, using parse trees like treebanks can lower the accuracy of the model. In a later work [49], the same authors tested their model proposed in Baly et al. [47] with a different dataset called ASTD [31] which consists of 10,006 tweets. The RNTN was trained twice, first time using lemmas, where each lemma represents a set of words that have the same meaning and differ by only inflectional morphology, and in the second time using raw words. RNTN when trained by lemmas showed better accuracy.
Another work was performed by Alayba et al. [50] to analyse opinions about health services. They gathered their dataset from Twitter hashtags and ended up with 2026 tweets. They compared two DL models, namely, DNN and CNN with word2vec embedding. The CNN model had an accuracy of 90%, while the DNN model achieved an 85% accuracy. In this study, the CNN model was trained on a very small dataset. Lately, the authors proposed another model in Alayba et al. [51] to overcome the limitation of training CNN on a small dataset. Instead, they trained a combined CNN and lexicon model on top of word2vec constructed from a large corpus acquired from multiple Arabic journals. The accuracy of their model has increased from 90% to 92%. However, the two models did not address the negation problem.
A model has been proposed in Abdelhade et al. [52], using the DNN algorithm. The authors used eight layers in the model to classify Arabic tweets. The sentiment of each tweet is given by extracting the sentiment words from the tweet using a lexicon and then summing their polarities. Although the model showed good performance, it exhibited sensitivity in its performance towards different datasets, and there was not any consideration for negation.
Dahou et al. have examined two word embedding models in Dahou et al. [53], the CBOW and SG, using corpus with 3.4 billion Arabic words selected from 10 billion words collected by crawling web pages. The SegPhrase framework [54] was used to generate better word embedding. Then, to classify sentiments, a CNN-based model was trained by a previously trained word embedding.
In Al-Azani and El-Alfy [55], five architectures were investigated including CNN, CNN-LSTM, simple LSTM, stacked LSTM and combined LSTM to analyse Arabic tweets. They employed dynamic and static CBOW and SG word embeddings to train the models. Experiment results showed that the combined LSTM model trained by dynamic CBOW outperformed the other models.
In SemEval-2017 (task4-subtask A), a model that uses DL approach were proposed [56] to classify a set of tweets. The model proposed by authors combines three convolutional recurrent neural networks (CRNNs). The input to each CRNN is different in which, the first CRNN takes as an input out-domain embedding from Wikipedia in Arabic. The second CRNN takes input from in-domain embedding, for example, a dataset of tweets, and the words polarities served as an input to the final network. Then, the outputs of the three CRNN are concatenated and used as an input to a multilayer perceptron (MLP). The accuracy achieved by the model is 50.8%.
In Alayba et al. [57], the authors explore the benefits of combining CNN and LSTM networks in analysing Arabic sentiments. Their model consists of an input layer where each word is embedded into a vector of size 100, then a convolutional layer with a filter size of 20. A max-pooling layer follows the convolutional layer to downsize the features using a max function followed by a dropout layer. The output from the dropout layer is fed into the LSTM layer and then a fully connected layer takes the output of the LSTM layer as an input. Finally, a sigmoid function is used to classify each input. The model was tested on four different datasets with three diverse sentiment levels and the results were promising.
Heikal et al. [58] have introduced two DL models, one is a CNN model and the other is an LSTM model. The CNN model is constructed from three CNN layers where the input to those layers is the same word embedding. The output of those layers is concatenated and fed into a fully connected layer followed by a dropout layer and finally a softmax function. On the contrary, the LSTM model is constructed from a bidirectional LSTM where the final output of each layer is concatenated and passed to a fully connected layer followed by a dropout layer and finally a softmax function will determine the class of the input text. From those two models, the authors introduced an ensemble model that uses soft voting to determine the sentiment class of the text. The experimental results show that the ensemble model outperforms the two models.
In SemEval-2018 (task1), Abdullah and Shaikh [59] introduced a model with a dense network and an LSTM network to predict the intensity of the emotions that are expressed in tweets. Their model has three sets of predictions, which are used to get the final output. The first prediction is a result of using the dense network with Arabic tweets translated to English as an input. The feature vector is formed using doc2vec and AffectiveTweets package [60]. In addition, a dense network is used to get the second prediction where the input is Arabic tweets and translated Arabic tweets. word2vec and AffectiveTweets package are used to form the feature vector for the second prediction. The LSTM network is used to get the third prediction with Arabic tweets as an input and the feature vector is padded word2vec where each word is represented by a 300-dimensional vector. Both networks use a sigmoid function to predict the sentiment of the tweets. The final output of the model is the average of the three predictions which is a real number between 0 and 1. The model performance is measured by Spearman correlation score in which it scored a 0.773.
Abdullah et al. [61] presented the Sentiment and Emotion Detection in Arabic Text (SEDAT) model which consisted of two submodels. The first submodel takes as an input Arabic tweets translated to English and Arabic tweets which are represented by various set of features including doc2vec, AffectiveTweets and DeepMoji [62]. Those features are passed to a fully connected layer followed by a sigmoid layer. The second submodel takes Arabic tweets as an input which are represented by word embeddings and then pass those embeddings to the CNN layer, then to the LSTM layer, followed by a sigmoid layer. The final output of the model is calculated by averaging the output of the two submodels.
A hybrid incremental learning model for Arabic sentiment analysis was proposed by Elshakankery and Ahmed [63]. Their model uses two different ML classifiers and one DL classifier which is an RNN. The input to the network is a set of 16 different weights calculated from three different lexicons that form the feature vector. The model includes a lexicon semi-automatic update mechanism that updates the lexicon and therefore the words’ weights. After updating the lexicon using the ArTwitter dataset, the model’s best accuracy is equal to 85%.
Four DL networks were explored in Lulu and Elnagar [64] with Egyptian, Levantine, Gulf and Iraqi dialect sentences to predict sentiment polarity. LSTM, CNN, bidirectional long short-term memory (BLSTM) and convolutional long short-term memory (CLSTM) were tested with a subset of the Arabic Online Commentary (AOC) dataset [65]. The LSTM scored the best accuracy result among other classifiers with 71.4% for all three chosen dialects combined.
The effect of utilising different pre-trained word embeddings was explored with the LSTM network in Alwehaibi and Roy [66]. Arabic-news [67], AraFT [68] and AraVec [69] were the three Arabic pre-trained word embeddings that are used in the experiment. AraVec and Arabic-news were trained using word2vec with the CBOW architecture, while FastText with SG was used to train AraFT. An LSTM network with a softmax layer at the final layer was used in the experiment. The AraFT scored the best results among the other pre-trained word embeddings with 93.5% accuracy.
Many previous works did not explicitly declare the sentiment analysis level that they were using in their work. Hence, we assume that since their models are trained by or tested on microblogs or short-text datasets, they are more likely to be a sentence-level analysis than a document-level analysis.
3.3. Entity/aspect level
The aspect-level analysis is finer grained compared with the document- and sentence-level analyses. At this level, the task is to classify sentiments expressed about a target aspect of an entity. As an example, in the sentence ‘the screen resolution is very good, but the size is very small’ if the aspect that we targeted is the ‘screen resolution’, the expressed opinion is positive. However, there is a negative opinion if the targeted aspect is the ‘size’. Therefore, this task consists of three phases: entity extraction, aspect extraction and aspect sentiment classification [42].
Ruder et al. [70] addressed the aspect-based sentiment analysis task by developing a hierarchical bidirectional long short-term memory (H-LSTM) model. Their model consists of stacked bidirectional LSTMs where in every time step the output of those bidirectional layers is concatenated and fed as an input to the final layer along with the aspect vector. The final layer was a softmax layer that gave the probability distribution for each sentence. They tested their model on different datasets with different languages including Arabic where they achieved an accuracy equal to 82%. Later, the same team developed a CNN-based model [71] as a participation in SemEval-2016 for aspect-based sentiment analysis. The CNN layers take the word embeddings of sentences together with the aspect vector as an input. Again they tested their model on different languages including Arabic in which their system achieved an 82% accuracy.
Al-Smadi et al. [72] addressed the aspect-based sentiment analysis for Arabic Hotels reviews. Their dataset consisted of 2291 Arabic reviews that were set for the 2016 Semantic Evaluation workshop [73]. The dataset was prepared using AraNLP [74] and MADAMIRA [75] tools, which were used to extract semantic, syntactic and morphological features that the authors believed would improve their results. They used the RNN approach using the Deeplearning4j Framework [76] to implement their solution. The network consisted of five hidden layers, and the results show that their proposed system achieved an accuracy of 87%.
In SemEval-2017 task4, two systems that use the DL approach were proposed [56,77] for Arabic sentiment analysis. In El-Beltagy et al. [77] for the topic-based message classification, subtask B, the authors implemented three independent classifiers, CNN, MLP and logistic regression. The final classification of each tweet is determined based on voting among the three classifiers. In González et al. [56], the same model described earlier in Section 3.2 was used. Experimental results revealed that the former model outperformed the latter model. Unfortunately, both works do not consider the topic, (target), information at their models. Thus, they analysed the text without considering the target place on the context or determining the association between the target and its surrounding context.
Al-Smadi et al. [78] proposed a DL model to analyse sentiment expressed at the aspect level. The proposed model is based on the LSTM network where the input to the model is the text embeddings along with the aspect embedding. The output of the LSTM layer is then passed to a hidden layer to compute the attention weight vector a and the feature vector of the sentence and the target aspect r. Then vectors a and r are used with a softmax layer to predict the sentiment expressed towards the aspect. The model was tested on the Arabic Hotels reviews dataset [79] and achieved an 82.6% accuracy.
A summary of the DL Arabic sentiment analysis models that have been proposed is presented in Table 1.
Summary of the proposed deep learning–based models for Arabic sentiment analysis.
MSA: modern standard Arabic; DNN: deep neural network; LSTM: long short-term memory; CNN: convolutional neural network; MAR: macro average recall; ASTD: Arabic Sentiment Tweets Dataset; SG: Skip-gram; RAE: recursive autoencoder; ATB: Arabic treebank; NLM: neural language model; RNN: recurrent neural network; CBOW: continuous bag of words; RNTN: recursive neural tensor network; MLP: multilayer perceptron; CRNN: convolutional recurrent neural network; AOC: Arabic Online Commentary; H-LSTM: hierarchical bidirectional long short-term memory.
Spearman correlation score.
4. Conclusion
Recently, analysing sentiments and opinions using DL attracted the attention of many researchers. In this article, the proposed models to solve the problem of Arabic sentiment analysis using DL are presented. The research work achieved in Arabic sentiment analysis using DL is still in its early stages compared with other languages like the English language. With the rapid advance in DL research, we expect the proposal of a significant number of models in the Arabic sentiment analysis using different DL algorithms since several directions can be explored.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project is funded by the Deanship of Scientific Research (DSR), King Abdulaziz University, Jeddah, under Grant No. (DG1440-37-612). The authors, therefore, gratefully acknowledge the DSR technical and financial support.
