Combining BERT with TCN-BiGRU for enhancing Arabic aspect category detection

Abstract

Aspect-based sentiment analysis (ABSA) is a challenging task of sentiment analysis that aims at extracting the discussed aspects and identifying the sentiment corresponding to each aspect. We can distinguish three main ABSA tasks: aspect term extraction, aspect category detection (ACD), and aspect sentiment classification. Most Arabic ABSA research has relied on rule-based or machine learning-based methods, with little attention to deep learning techniques. Moreover, most existing Arabic deep learning models are initialized using context-free word embedding models, which cannot handle polysemy. Therefore, this paper aims at overcoming the limitations mentioned above by exploiting the contextualized embeddings from pre-trained language models, specifically the BERT model. Besides, we combine BERT with a temporal convolutional network and a bidirectional gated recurrent unit network in order to enhance the extracted semantic and contextual features. The evaluation results show that the proposed method has outperformed the baseline and other models by achieving an F1-score of 84.58% for the Arabic ACD task. Furthermore, a set of methods are examined to handle the class imbalance in the used dataset. Data augmentation based on back-translation has shown its effectiveness through enhancing the first results by an overall improvement of more than 3% in terms of F1-score.

Keywords

Aspect-based sentiment analysis aspect category detection deep learning BERT data augmentation arabic language

1 Introduction

Sentiment analysis, also known as opinion mining, automates the extraction of opinions, thoughts, and attitudes towards services, products, events, or public issues. Therefore, it has become one of the most active natural language processing (NLP) tasks. Moreover, opinion mining has been widely applied in various domains, such as healthcare [1], business [2], social networks [3], and education [4].

Sentiment analysis can be performed on three main levels: document level, sentence level, and aspect level. The first two levels identify the sentiment polarity of the whole document or sentence, which is not always useful, as different sentiments can be expressed towards multiple aspects in the same text. Unlike these two levels, the aspect level, also referred to as aspect-based sentiment analysis (ABSA), enables the extraction of the discussed aspects and the identification of the sentiment polarities expressed towards each aspect.

ABSA can be divided into three main tasks: aspect term extraction, aspect sentiment classification, and aspect category detection (ACD). The first task aims at extracting the discussed aspect word that literally appears in the text, whereas the second task identifies the corresponding sentiment polarity to each aspect. For the ACD task, the purpose is to detect the discussed aspect category given a pre-defined list of categories. The aspect category is a coarse-grained aspect that does not necessarily appear as a term in the sentence. In this paper, only the ACD task is performed.

Arabic is one of the most spoken languages in the world, with more than 440 million speakers [5]. It is ranked as the 4th most used language on the internet, with an internet users’ growth of 9348.0% in the last twenty years 1 . Thus, Arabic sentiment analysis has attracted increasing attention from the research community in recent years. However, most existing Arabic sentiment analysis papers have tackled the document or sentence levels, with little attention to the aspect level [6].

In addition, the majority of the existing Arabic ABSA models were implemented using rule-based or machine-learning-based methods, whereas deep learning techniques are under-discovered in this area [7]. Besides, most of the proposed deep learning methods are initialized using context-free word embedding models, which fail to capture polysemy. Therefore, this study aims at overcoming these limitations by providing a deep learning model based on BERT contextualized embeddings. Additionally, BERT is combined with a temporal convolutional network (TCN) and a bidirectional gated recurrent unit network (BiGRU) to enhance the extracted semantic and contextual features further. Besides, a set of methods are examined to handle the class imbalance problem in the evaluated dataset. The main contributions of this paper are:

Implementing a deep learning-based model to handle the ACD task without the need for external linguistic resources or tedious feature-engineering tasks.

Combining BERT with a TCN-BiGRU model to handle the ACD task. To our knowledge, this is the first time to use this combination to accomplish this task.

Investigating a set of methods to overcome the problem of the imbalanced dataset. To our knowledge, this is the first time to handle this issue in this task in Arabic.

The rest of this paper is structured as follows: Related work to Arabic ACD is overviewed in Section 2. Section 3 introduces the research methodology. Section 4 describes the dataset and comparison models. Section 5 provides the experimental results. The class imbalance issue is discussed in Section 6. Finally, Section 7 concludes the paper and provides future work directions.

2 Literature review

Arabic ABSA is attracting much less attention from researchers compared to English. Moreover, work on Arabic ACD is limited compared to other tasks, namely aspect term extraction and aspect sentiment polarity classification [8]. This section overviews Arabic ACD methods implemented based on deep learning models.

To the best of our knowledge, the first deep learning-based models that performed the ACD task in Arabic were that of Ruder and Ghaffari [9] and Tamchyna and Veselovská [10]. Both models were submitted to the SemEval 2016 task 5 workshop [11]. The former is a multi-label model implemented using a convolutional neural network (CNN), where vector representations were randomly initialized for the Arabic language. A threshold was fixed to predict the category labels of each sentence. The proposed model was ranked first in this task for Arabic by achieving an F-1 score of 52.1%. The latter is implemented as a binary classifier for each aspect category. A long short-term memory (LSTM) was used as an encoded layer to capture long-term dependencies in the data, followed by a logistic regression classifier for label prediction. The model obtained an F1-score score of 47.3%.

An attempt to enhance the previous results was introduced in Al-Dabet and Tedmori [12]. The authors proposed a combination of CNN with a variant of LSTM, called independent long short-term memory network (IndyLSTM), to handle the ACD task. They used a binary relevance (BR) classification approach, in which the proposed model was decomposed into a set of independent binary classifiers to train each category independently. The experimental results achieved an F1-score of 58.1%.

A recent study in Bensoltane and Zaki [8] has achieved enhanced results compared to previous models. The authors proposed a deep learning model based on BiGRU to handle the Arabic ACD task. Besides, different word embedding models were examined to initialize vector representations, namely word-level, character level, domain-specific, and contextualized embeddings. The best results were achieved using BERT contextual embeddings with an overall enhancement of more than 7% compared to the previous model of Al-Dabet and Tedmori [12].

Another enhanced model based on BERT was introduced in Bensoltane and Zaki [13]. The authors investigated two methods of implementing a BERT model for the ACD task: fine-tuning and feature extraction. The first method fine-tuned the BERT model using a linear layer for multi-class classification. The second method extracted word embeddings from BERT while freezing its parameters, then a CNN-based model was trained for the classification task. The proposed models were evaluated on an Arabic news dataset. The Evaluation results proved the effectiveness of fine-tuning compared to feature extraction (F1-score=82.8% vs. 77.5%), especially in low resource settings.

In this study, an improved model of the previous method is presented. The proposed model enhances BERT contextual and semantic features by integrating a TCN-BiGRU model with the BERT fine-tuned model. Besides, different methods are investigated to overcome the problem of class imbalance in the used dataset.

3 Methodology

3.1 Task description

Given a sentence S and a set of pre-defined aspect categories (C₁, C₂, . . . ,C_n). The aim of this task is to assign one or more categories to S. It is worth mentioning that the dataset used in this study assigns only one category to each sentence, which can be handled as a multi-class classification task.

3.2 Model overview

The proposed model combines BERT with TCN-BiGRU along with global pooling mechanism to perform the ACD task. First, a BERT layer is used to provide contextualized word embeddings. Then, a TCN layer is applied to extract temporal features, followed by a BiGRU layer to further capture long-term sequence dependencies form both left and right sides. Next, global maximum and global average pooling layers are applied simultaneously on the output of BiGRU layer to reduce the dimensions of previous layers while capturing meaningful features. Finally, the output of the concatenate layer is fed into a fully connected layer for category label prediction. Figure 1 illustrates the overall architecture of the proposed model. A detailed description of each layer is provided in the following subsections.

Fig. 1

Overall architecture of the proposed model.

3.2.1 BERT layer

BERT [14] is a deeply bidirectional and pre-trained language representation model implemented based on transformers. It has achieved enhanced results in many NLP tasks, such as text classification [15], sentiment analysis [16], and named entity recognition [17]. BERT was pre-trained on two unsupervised tasks: masked language modeling (MLM) and next sentence prediction (NSP). For the MLM task, 15% of tokens are randomly masked, and the model is then trained to predict the hidden tokens. The second task helps the model understand the relationship between two sentences.

Unlike traditional word embedding models like GloVe [18] and Word2Vec [19], BERT provides different vector representations for the same word depending on the context in which it appears. Table 1 shows an example of a sentence with different meanings of the word “\epsfbox G:/Tex/IOSPRESS/IFS/0-221214/IF-01.eps”. It can be noticed that BERT provides different vector values for this word based on the context in which it occurs. It is noteworthy that, in this study, BERT is fine-tuned on the downstream task in order to release its true power [20].

Table 1
An example of a sentence with different meanings of the word “”

Before feeding the input sequence into the BERT model, special tokens, namely [CLS] and [SEP] are added to the beginning and end of each input text. Each input token is fed into token embedding, segment embedding, and position embedding layers. The token embedding layer converts each word into a vector representation of a fixed dimension. The segment embedding layer distinguishes between two sentences in an input sentence pair. In this study, we use only one input sentence. Thus, 0 is assigned to all the tokens of the sentence. The position embedding layer numbers each position in the sequence. These three embeddings are linearly summed to generate a single representation that is fed into the BERT’s encoder layer.

BERT uses a transformer encoder, which adopts a self-attention mechanism to capture the relationship between words in the context by computing the attention between each token in the sequence. The attention is computed using three input word vector matrices: Query Vector Q, Value Vector V, and Key Vector (K), as illustrated in the following formula: $Attention (Q, K, V) = softmax (\frac{{QK}^{T}}{\sqrt{d_{k}}}) V$ (1)

Where d_k denotes the input vector dimension, and SoftMax is used to obtain the weights on the values through converting a vector of numbers into a vector of probabilities. The final output is provided by summing the weights of all input vectors.

Additionally, the transformer uses multi-head attention with h attention heads in parallel to capture information from different representation subspaces. More details can be found in Vaswani and Shazeer [21].

3.2.2 TCN layer

TCN is a variation of CNN that combines causal convolution and dilated convolution. It has a backpropagation path that is different from the temporal direction of the sequence, which enables it to avoid the gradient explosion or gradient disappearance of RNN. TCN also allows the parallel computing and enables controlling the sequence memory length by changing the size of the receptive field. Readers can refer to the original paper of TCN [22] for more details. The main components of TCN are:

Causal convolution: TCN employs causal convolutions, where the current output is convolved only by the current and historical inputs. This means that there is no information leakage from future to past. Figure 2 illustrates a causal convolution.

Dilated convolution: Dilated convolutions are used to obtain longer historical information with a larger and more flexible receptive field. They can capture long-term dependencies by allowing a larger part of the input data to contribute to the output [23]. The formula of a receptive field is: $RF = (k - 1) * d + 1$ (2)

where k is the kernel size, and d is the dilation factor. A TCN receptive field can be increased by choosing a larger kernel size or increasing the dilation factor.

Fig. 2

A causal convolution with two layers and kernel size=3.

For one-dimensional input sequence X and a filter f: ${0, \dots, k - 1} \to ℝ$ , the operation of dilated convolution on an element s of the input sequence is computed as follows: $F (s) = \sum_{i = 0}^{k - 1} f (i) . X_{s - d . i}$ (3)

Where s–d.i denotes the direction of the past, which represents the d.i element index before s.

Residual Connections: The residual module is used to solve the problem of gradient explosion, gradient disappearance and gradient degradation brought by deep neural networks. A residual block contains two layers of dilated causal convolution, which is a causal convolution with a dilation filter as shown in Fig. 3, and nonlinear activation functions as illustrated in Fig. 4. The weight of each convolution kernel is normalized for regularization. Besides, a dropout layer is added after each dilated causal convolution to avoid over-fitting.

Fig. 3

A dilated causal convolution with two layers, kernel size=2, and dilation=2.

Fig. 4

TCN residual block.

The output of the residual block is computed as follows: $o = Activation (X + F (X))$ (4)

Where X is the input, $F$ (X) denotes the operation of the two-layer dilated convolution. Activation(.) is the activation function, which is the rectified linear unit (ReLU).

1×1 convolution is added to ensure that the input and output have the same dimensions and can be added together.

3.2.3 BiGRU layer

Gated recurrent unit (GRU) [24] is a type of recurrent neural network (RNN), which is an artificial neural network adapted to handle data that involves sequences. Since RNNs suffer from the gradient vanishing problem, LSTM and GRU were introduced to overcome this shortcoming. GRU has a less complex structure compared with LSTM, which combines the input and forget gate of LSTM into a single update gate and combines the hidden and cell states into a single hidden state. The architecture of GRU is illustrated in Fig. 5. The update gate z_t decides the amount of the past information that needs to be passed along to the next state. The reset gate r_t dermines how much of the past information to neglect. The detailed calculation formula is as follows: $z_{t} = σ (W_{zx} x_{t} + U_{zh} h_{t - 1})$ (5) $r_{t} = σ (W_{rx} x_{t} + U_{rh} h_{t - 1})$ (6) ${\tilde{h}}_{t} = \tanh (W_{cx} x_{t} + r_{t} ⊙ U_{ch} h_{t - 1})$ (7) $h_{t} = (1 - z_{t}) ⊙ o_{t} + z_{t} ⊙ h_{t - 1}$ (8)

Fig. 5

Architecture of GRU.

Where σ is the sigmoid function, ⊙ denotes the element-wise product of the matrix, W and U are the weight matrices that need to be learned.

Since TCN cannot see the future data, a BiGRU layer is used to enable handing data in both directions and provide complete contextual information. Indeed, BiGRU structure is composed of forward GRU and backward GRU, which can capture the past and future information. The calculation formulas are: $\overset{⇀}{h_{t}} = GRU (x_{t}, \overset{⇀}{h_{t - 1}})$ (9) $\overset{↼}{h_{t}} = GRU (x_{t}, \overset{↼}{h_{t + 1}})$ (10)

The hidden layer output h_t of BiGRU at time t is the concatenation of the forward and backward states: $h_{t} = [\overset{⇀}{h_{t}}, \overset{↼}{h_{t}}]$ (11)

Finally, the output of BiGRU is: $h = {h_{1,} h_{2}, \dots, h_{n}}$ (12)

3.2.4 Global pooling mechanism

The output of BiGRU layer is simultaneously fed into a global maximum and global average pooling layers. The former layer retrieves a maximum value of each feature in the BiGRU layer, whereas the latter one retrieves each feature’s average value. Finally, the concatenation of both layers is passed to the final output layer. The calculation formulas are as follows: $V_{avg} = g_{avg} (h)$ (13) $V_{\max} = g_{\max} (h)$ (14) $V_{conc} = concat (V_{avg}, V_{\max})$ (15)

Where g_avg and g_max are the global average and global max pooling operations, respectively. concat is a concatenation operation.

3.2.5 Output layer

The output layer consists of a fully connected layer with a SoftMax activation function for predicting the category label as follows: $\hat{y} = softmax ({WV}_{conc} + b)$ (16)

Where $\hat{y}$ denotes the prediction probabilities, W is a trainable weight matrix, and b is a bias term.

3.2.6 Model training

The objective of the training process is to minimize the cross-entropy of the predicted and true results. The categorical cross-entropy is used as a loss function since the ACD task in this study is a multi-class classification problem. The calculation formula is: $L (\hat{y}, y) = - \sum_{i = 1}^{N} \sum_{k = 1}^{C} y_{j}^{k} \log ({\hat{y}}_{i}^{k})$ (17)

Where $y_{j}^{k}$ notes the ground-truth label; ${\hat{y}}_{i}^{k}$ notes the prediction probabilities; N is the number of training samples and C is the number of categories.

4 Experiments

4.1 Dataset

The dataset used in this study was provided by Al-Ayyoub and Al-Sarhan [25]. News posts and comments about the Gaza Attacks in 2014 were collected from Facebook to examine the effect of the news on readers. The dataset was manually annotated following the SemEval 2014 task 4 [26] annotation guidelines. The discussed aspect terms, aspect categories, and the corresponding sentiment polarities were annotated for each post. The dataset is composed of 2265 posts written in modern standard Arabic (MSA). Four pre-defined categories are considered: Results, Plans, Parties, and Peace. The distribution of posts per aspect category is illustrated in Table 2. Besides, an example of the Arabic news dataset schema is illustrated in Fig. 6.

Table 2
Distribution of posts per category

Aspect category Number of posts

Parties 141

Peace 350

Plans 813

Results 961

Aspect category	Number of posts
Parties	141
Peace	350
Plans	813
Results	961

Fig. 6

Example of the Arabic news dataset schema.

4.2 Experimental settings

The language development is Python 3.6, and the models were implemented using Keras and TensorFlow libraries. The base version of AraBERT model [27] is used in this study, which consists of 12 layers of transformers with a number of self-attention heads as 12 and a hidden size of 768. The adopted hyper-parameters in our experiments are shown in Table 3.

Table 3
Experimental hyper-parameters

Hyper-parameter Value

TCN number of filters 128

TCN kernel size 3

TCN activation (residual block) ReLU

GRU hidden units 100

Loss function Categorical cross-entropy

Optimizer Adam

Batch size 32

Max sequence length 128

Learning rate 2e-5

Hyper-parameter	Value
TCN number of filters	128
TCN kernel size	3
TCN activation (residual block)	ReLU
GRU hidden units	100
Loss function	Categorical cross-entropy
Optimizer	Adam
Batch size	32
Max sequence length	128
Learning rate	2e-5

4.3 Experimental metrics

To compare our results to the baseline and prior models, precision, recall, and F1-score are computed as follows: $Precision = \frac{TP}{TP + FP}$ (18) $Recall = \frac{TP}{TP + FN}$ (19) $F 1 = \frac{2 PR}{P + R}$ (20)

Where TP denotes the correct categories retrieved from the test set, FP is the irrelevant categories identified in the same dataset, and FN denotes the relevant aspect categories that have not been detected.

4.4 Comparison models

To evaluate the performance of our model, we compare it with the following models:

Baseline [25]: following the SemEval 2014 task 4 baseline model. A test sentence t is assigned to the most frequent category of the K similar training sentences to t. The distance between two sentences was computed using the Dice similarity coefficient.

S-BERT [13]. The BERT model is fine-tuned using a fully connected layer with a SoftMax activation function to handle the ACD task.

AraVec-TCN-BiGRU: 300-dimensional Ara-Vec model [28] trained on a Twitter dataset is used to initialize the word embeddings of the TCN-BiGRU model.

BERT-CNN-BiGRU: The same as the proposed model but using a traditional CNN layer instead of TCN.

BERT-TCN-BiLSTM: The same as the proposed model. However, a BiLSTM layer is added on top of TCN instead of BiGRU.

5 Results and discussion

5.1 Experimental analysis

The experimental results are illustrated in Table 4. They show that the proposed model outperforms the baseline model by more than 19% overall improvement. Additionally, our model has achieved better results than S-BERT (84.58% vs. 82.8%). This shows that combining the fine-tuned BERT model with more powerful layers yields enhanced results, confirming the findings of previous studies [13 and 29]. On the other hand, the AraVec-TCN-BiGRU model achieves lower results than the proposed model by more than 8%, proving the effectiveness of semantic features extracted by the BERT model. Additionally, BERT splits unknown words into known sub-words, which solves the out of vocabulary (OOV) challenging issue for NLP tasks, particularly in morphologically rich languages, such as Arabic. Moreover, our model has achieved enhanced results than the BERT-CNN-BiGRU model, indicating that TCN is better at modeling text sequences compared to a traditional CNN model. Besides, replacing BiGRU with a BiLSTM layer has dropped the overall performance. This proves that GRU is better at handling small-scale datasets than LSTM, which is the case of the existing Arabic ABSA corpora. This result affirms the findings of previous studies [30].

Table 4
Main experimental results. The results with “†” are retrieved from original papers, and the best results are marked in bold

Model Precision Recall F1-score

Baseline [25] 64.9% ^† 64.9% ^† 64.9% ^†

S-BERT [13] 82.8% ^† 82.8% ^† 82.8% ^†

AraVec-TCN-BiGRU 75.77% 75.77% 75.77%

BERT-CNN-BiGRU 83.63% 83.26% 83.44%

BERT-TCN-BiLSTM 84.07% 83.7% 83.89%

BERT-TCN-BiGRU 84.58% 84.58% 84.58%

Model	Precision	Recall	F1-score
Baseline [25]	64.9% ^†	64.9% ^†	64.9% ^†
S-BERT [13]	82.8% ^†	82.8% ^†	82.8% ^†
AraVec-TCN-BiGRU	75.77%	75.77%	75.77%
BERT-CNN-BiGRU	83.63%	83.26%	83.44%
BERT-TCN-BiLSTM	84.07%	83.7%	83.89%
BERT-TCN-BiGRU	84.58%	84.58%	84.58%

5.2 Ablation study

An ablation study is conducted in order to verify the effectiveness of each component of our model. The results of this study are illustrated in Table 5. We first remove the global pooling layers from the proposed model, which has led to a performance drop. This indicates the efficiency of the used pooling layers in capturing prominent tokens from the BiGRU layer. Besides, removing the TCN layer has dropped the overall performance by more than 1% in terms of F1-score, indicating the effectiveness of the temporal features extracted by TCN. On the other hand, removing the BiGRU layer has affected the performance of the model more than removing TCN or pooling layers, which shows the importance of capturing long-term dependencies from both left and right sides for the ACD task.

Table 5
Results of ablation study

Model Precision Recall F1-score

BERT-TCN-BiGRU wo pooling 84.44% 83.7% 84.07%

BERT-TCN 82.37% 82.37% 82.37%

BERT-BiGRU 83.25% 83.25% 83.25%

BERT-TCN-BiGRU 84.58% 84.58% 84.58%

Model	Precision	Recall	F1-score
BERT-TCN-BiGRU wo pooling	84.44%	83.7%	84.07%
BERT-TCN	82.37%	82.37%	82.37%
BERT-BiGRU	83.25%	83.25%	83.25%
BERT-TCN-BiGRU	84.58%	84.58%	84.58%

5.3 Training time

A comparative experiment on training time for these models: BERT-BiGRU, BERT-TCN-BiLSTM, BERT-TCN-BiGRU and BERT-CNN-BiGRU is conducted. All the models were trained on a NVIDIA Tesla K80. The experimental results are illustrated in Table 6. It can be seen that BERT-TCN-BiGRU is faster than the BERT-BiGRU model, proving that TCN has accelerated the training process of the proposed model. Besides, our model takes less time for training than BERT-TCN-BiLSTM, which can be justified thanks to the simplified structure of GRU compared to LSTM. On the other hand, BERT-TCN-BiGRU trains faster than BERT-CNN-BiGRU, indicating that TCN reduces the training costs better than a traditional CNN model.

Table 6
Average training time per epoch

Model Number of trainable parameters Training time (Seconds)

BERT-BiGRU 136,213,244 223

BERT-TCN-BiLSTM 136,312,164 224

BERT-CNN-BiGRU 135,626,620 204

BERT-TCN-BiGRU 136,266,364 171

Model	Number of trainable parameters	Training time (Seconds)
BERT-BiGRU	136,213,244	223
BERT-TCN-BiLSTM	136,312,164	224
BERT-CNN-BiGRU	135,626,620	204
BERT-TCN-BiGRU	136,266,364	171

6 Class imbalance

The imbalanced data problem refers to the fact that the number of samples from one or more categories outnumbers the samples from the rest of the categories. Models trained on an imbalanced dataset will generally tend to misclassify instances belonging to smaller classes more than those belonging to larger categories. Based on the distribution size illustrated in Fig. 7, it can be noticed that the dataset used in this study is imbalanced. The following categories: Parties and Peace have much fewer samples compared to the rest of the classes. Given the confusion matrix in Fig. 8, our proposed model cannot learn enough information to distinguish the Parties category. Therefore, this section aims to investigate different techniques to overcome the shortcoming of the imbalanced dataset. Results achieved using different methods are illustrated in Table 7. Besides, precision, recall, and F1-score for each category are shown in Table 8. We notice that these techniques are applied to the training dataset only.

Fig. 7

Distribution size of the evaluated dataset.

Fig. 8

Confusion matrix of the proposed model on the original dataset.

Table 7

Results using different approaches to handle the class imbalance problem

Method	Precision	Recall	F1-score
None	84.58%	84.58%	84.58%
Oversampling	84.14%	84.14%	84.14%
Undersampling	78.66%	77.97%	78.31%
Random masking augmentation	85.9%	85.9%	85.9%
Back-translation	88.11%	88.11%	88.11%

Table 8

Precision, recall, and F1-score for each category

Method	Category	Precision	Recall	F1-score
None	Parties	60.0%	27.27%	37.5%
	Peace	96.55%	93.33%	94.91%
	Plans	83.14%	83.14%	83.14%
	Results	83.65%	89.7%	86.57%
Oversampling	Parties	100%	36.36%	53.33%
	Peace	81.82%	90.0%	85.72%
	Plans	79.59%	87.64%	83.42%
	Results	89.13%	84.54%	86.77%
Undersampling	Parties	33.33%	63.64%	43.75%
	Peace	80.0%	93.33%	86.15%
	Plans	72.83%	75.28%	74.03%
	Results	86.07%	70.1%	77.27%
Random masking augmentation	Parties	80.0%	36.36%	50.0%
	Peace	87.1%	90.0%	88.53%
	Plans	85.55%	86.52%	86.03%
	Results	86.14%	89.69%	87.88%
Backtranslation	Parties	75.0%	81.82%	78.25%
	Peace	96.67%	96.67%	96.67%
	Plans	87.95%	82.02%	84.88%
	Results	87.25%	91.75%	89.44%

6.1 Sampling

The sampling method is a widely used data analysis technique that aims to adjust the class distribution of a dataset. We can distinguish two main techniques of sampling: oversampling and undersampling methods.

6.1.1 Oversampling

The oversampling technique re-balances the categories in the dataset by randomly duplicating samples from minority classes [31 and 32]. After applying this technique, the new distribution size of the training dataset is illustrated in Fig. 9, which shows a balanced distribution between the examples of each category. However, our model achieves lower results on this new data, as shown in Table 7 and Fig. 10. This can be justified because duplicating original data has led to over-fitting on the minority class training instances.

Fig. 9

Distribution size of training dataset based on oversampling.

Fig. 10

Confusion matrix of the proposed model based on oversampling.

6.1.2 Undersampling

This method tends to remove samples from the majority categories to adjust the class imbalance [33 and 34]. Figure 11 shows the distribution size of training data after undersampling. Nevertheless, as illustrated in Table 7 and Fig. 12, the performance of the proposed model has significantly dropped when using this dataset. This can be justified because removing samples has led to information loss, especially with data scarcity in this study.

Fig. 11

Distribution size of training dataset based on undersampling.

Fig. 12

Confusion matrix of the proposed model based on undersampling.

6.2 Data augmentation

Data augmentation refers to a set of techniques that artificially increase the amount of dataset by generating new data points from the existing dataset. This paper examines two data augmentation methods, namely random masking augmentation and back-translation, to increase the number of examples of minority classes.

6.2.1 Random masking augmentation

This method relies on a masked word prediction task. Words are randomly masked in each sentence, and the BERT model makes predictions for the masked words. Unlike context-independent word embedding models, BERT predicts possible values depending on the context of the sentence. It is worth mentioning that BERT supports only one masked word at a time. Thus, this technique is applied in series by tacking one different masked word at each time. Table 9 shows examples of generated sentences using this method. It can be seen that different variants of the same sentence can be obtained by randomly masking different words. The performance of the proposed model using the augmented dataset is illustrated in Table 7 and Fig. 13. We can notice that this technique has enabled the proposed model to predict more positive minority samples (50.0% vs 37.5% for F1-score of Parties category), hence improving the overall performance of the proposed model.

Table 9
Examples of random masking generated sentences

Fig. 13

Confusion matrix of the proposed model based on random masking augmentation.

6.2.2 Back-translation

This technique first translates sentences to another language and then translates them back to the original language [35]. This helps in generating new sentences of distinct wording to the original ones while preserving the original context and meaning. We applied this method several times on examples of the minority classes in our dataset using different languages (i.e., English, French, Spanish, and Dutch). Table 10 shows some examples of the obtained sentences. It can be noticed that different variations can be obtained from the same sentence but using other languages. As illustrated in Table 7 and Fig. 14, the back-translation technique has enabled the proposed model to achieve the best results compared to the other evaluated techniques. Besides, the F1-score of Parties class has enhanced significantly in comparison to the first results (78.25% vs. 37.5%). This proves the efficiency of back-translation in improving the diversity and the quality of the data, thus enhancing the learning process and reducing over-fitting.

Table 10
Examples of back-translated sentence

Fig. 14

Confusion matrix of the proposed model based on back-translation.

7 Conclusion

This study proposes an enhanced deep learning model for handling the Arabic ACD task. The proposed model first extracts contextualized word embeddings using BERT, followed by a TCN-BiGRU model to enhance the extracted contextual and semantic features. Extensive experiments conducted on an Arabic reference dataset have shown the effectiveness of the proposed method compared to the baseline and related work models. The proposed model has achieved an overall improvement of more than 19% compared to the baseline model. Besides, an ablation study was conducted to prove the effectiveness of each component of our model. Furthermore, a set of methods were examined to handle the class imbalance problem in the used dataset. Indeed, data augmentation based on back-translation has shown effectiveness in generating diversity and high-quality datasets by enhancing the first results by more than 3% in terms of F1-score.

Future work includes adapting the proposed model to other ABSA tasks, namely aspect term extraction and aspect sentiment classification. Besides, we intend to evaluate our model on datasets in other languages than Arabic, particularly English. Furthermore, this paper has investigated some data-level methods to handle the class imbalance issue. Hence, other techniques are to be examined to solve this problem, such as paraphrasing and text generation using GPT2 [36] model. Moreover, we intend to handle this problem at the algorithm-level using different methods, such as cost-sensitive learning and one-class classification.

Footnotes

Acknowledgment

The authors would like to thank Dr Kamal Sbiri for proofreading the paper.

References

Abualigah

, Alfar

H.E.

, Shehab

and Hussein

A.M.A.

, Sentiment analysis in healthcare: a brief review. Recent Advances in NLP: The Case of Arabic Language, (2020), p. 129–141.

Ahmed

A.A.A.

, Agarwal

, Kurniawan

I.G.A.

, Anantadjaya

S.P.D.

and Krishnan

, Business boosting through sentiment analysis using Artificial Intelligence approach. International Journal of System Assurance Engineering and Management, (2022).

Munshi

, Arvindhan

and Thirunavukkarasu

, Random Forest Application of Twitter Data Sentiment Analysis in Online Social Network Prediction. Emerging Technologies for Healthcare: Internet of Things and Deep Learning Models, (2021), p. 299–313.

Baragash

and Aldowah

, Sentiment analysis in higher education: a systematic mapping review. in Journal of Physics: Conference Series, 2021, IOP Publishing.

Muaad

A.Y.

, Al-antari

M.A.

, Lee

and Davanagere

H.J.

, A Novel Deep Learning ArCAR System for Arabic Text Recognition with Character-Level Representation. Computer Sciences & Mathematics Forum 2(1) (2022), 14.

Oueslati

, Cambria

, HajHmida

M.B.

and Ounelli

, A review of sentiment analysis research in Arabic language. Future Generation Computer Systems 112 (2020), 408–430.

Bensoltane

and Zaki

, Aspect-based sentiment analysis: an overview in the use of Arabic language. Artificial Intelligence Review, (2022).

Bensoltane

and Zaki

, Comparing word embedding models for Arabic aspect category detection using a deep learning-based approach. in E3S Web of Conferences, 2021, EDP Sciences.

Ruder

, Ghaffari

and Breslin

J.G.

, Insight-1 at semeval-2016 task 5: Deep learning for multilingual aspect-based sentiment analysis. arXiv preprint arXiv:1609.02748, (2016).

10.

Tamchyna

and Veselovská

, UFAL at SemEval-2016 Task 5: Recurrent Neural Networks for Sentence Classification. in SemEval 2016. 2016. Association for Computational Linguistics.

11.

Pontiki

, Galanis

, Papageorgiou

, Androutsopoulos

, Manandhar

, Al-Smadi

, Al-Ayyoub

, Zhao

, Qin

and De Clercq

, Semeval-2016 task 5: Aspect based sentiment analysis. in International workshop on semantic evaluation, 2016.

12.

Al-Dabet

, Tedmori

and Al-Smadi

, Enhancing Arabic aspect-based sentiment analysis using deep learning models. Computer Speech & Language 69 (2021), 101224.

13.

Bensoltane

and Zaki

, Towards Arabic aspect-based sentiment analysis: a transfer learning-based approach. Social Network Analysis and Mining 12(1) (2021), 7.

14.

Devlin

, Chang

M.-W.

, Lee

and Toutanova

, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. in In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, Association for Computational Linguistics.

15.

Zheng

and Yang

, A new method of improving BERT for text classification. in International Conference on Intelligent Science and Big Data Engineering, 2019, Springer.

16.

Gao

, Feng

, Song

and Wu

, Target-dependent sentiment classification with BERT. IEEE Access 7 (2019), 154290–154299.

17.

Alsaaran

and Alrabiah

, Classical Arabic named entity recognition using variant deep neural network architectures and BERT. IEEE Access 9 (2021), 91537–91547.

18.

Pennington

, Socher

and Manning

C.D.

, Glove: Global vectors for word representation. in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014.

19.

Mikolov

, Chen

, Corrado

and Dean

, Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, (2013).

20.

Song

, Wang

, Jiang

, Liu

and Rao

, Attentional encoder network for targeted sentiment classification. arXiv preprint arXiv:1902.09314, (2019).

21.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

Ł.

and Polosukhin

, Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).

22.

Bai

, Kolter

J.Z.

and Koltun

, An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, (2018).

23.

Huang

and Hain

, Improving audio anomalies recognition using temporal convolutional attention networks. in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, IEEE

24.

Chung

, Gulcehre

, Cho

and Bengio

, Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, (2014).

25.

Al-Ayyoub

, Al-Sarhan

, Al-So’ud

, Al-Smadi

and Jararweh

, Framework for Affective News Analysis of Arabic News: 2014 Gaza Attacks Case Study. J UCS 23(3) (2017), 327–352.

26.

Pontiki

, Galanis

, Pavlopoulos

, Papageorgiou

, Androutsopoulos

and Manandhar

, SemEval-2014 Task 4: Aspect Based Sentiment Analysis. in COLING 2014, 2014.

27.

Baly

and Hajj

, AraBERT: Transformer-based model for Arabic language understanding. in Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, 2020.

28.

Soliman

A.B.

, Eissa

and El-Beltagy

S.R.

, AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP. Procedia Computer Science 117 (2017), 256–265.

29.

, Bing

, Zhang

and Lam

, Exploiting BERT for end-to-end aspect-based sentiment analysis. arXiv preprint arXiv:1910.00883, (2019).

30.

Wang

, Huang

, Li

, Zhou

and Jiang

, Refined global word embeddings based on sentiment concept for sentiment analysis. IEEE Access 9 (2021), 37075–37085.

31.

Suh

, Yu

, Mo

, Song

and Kim

, A comparison of oversampling methods on imbalanced topic classification of Korean news articles. Journal of Cognitive Science 18(4) (2017), 391–437.

32.

Padurariu

and Breaban

M.E.

, Dealing with data imbalance in text classification. Procedia Computer Science 159 (2019), 736–745.

33.

Prusa

, Khoshgoftaar

T.M.

, Dittman

D.J.

and Napolitano

, Using random undersampling to alleviate class imbalance on tweet sentiment data. in 2015 IEEE international conference on information reuse and integration, 2015, IEEE.

34.

Rathpisey

and Adji

T.B.

, Handling imbalance issue in hate speech classification using sampling-based methods. in 2019 5th International Conference on Science in Information Technology (ICSITech), 2019, IEEE.

35.

A.W.

, Dohan

, Luong

M.-T.

, Zhao

, Chen

, Norouzi

and Le

Q.V.

, Handling imbalance issue in hate speech classification using sampling-based methods. in 2019 5th International Conference on Science in Information Technology (ICSITech), 2019, IEEE.

36.

Radford

, Wu

, Child

, Luan

, Amodei

and Sutskever

, Language models are unsupervised multitask learners. OpenAI Blog 1(8) (2019), 9.

Combining BERT with TCN-BiGRU for enhancing Arabic aspect category detection

Abstract

Keywords

1 Introduction

2 Literature review

3 Methodology

3.1 Task description

3.2 Model overview

Table 1 An example of a sentence with different meanings of the word “”

4.1 Dataset

Table 2 Distribution of posts per category Aspect category Number of posts Parties 141 Peace 350 Plans 813 Results 961

Table 3 Experimental hyper-parameters Hyper-parameter Value TCN number of filters 128 TCN kernel size 3 TCN activation (residual block) ReLU GRU hidden units 100 Loss function Categorical cross-entropy Optimizer Adam Batch size 32 Max sequence length 128 Learning rate 2e-5

5 Results and discussion

5.1 Experimental analysis

Table 5 Results of ablation study Model Precision Recall F1-score BERT-TCN-BiGRU wo pooling 84.44% 83.7% 84.07% BERT-TCN 82.37% 82.37% 82.37% BERT-BiGRU 83.25% 83.25% 83.25% BERT-TCN-BiGRU 84.58% 84.58% 84.58%

Table 6 Average training time per epoch Model Number of trainable parameters Training time (Seconds) BERT-BiGRU 136,213,244 223 BERT-TCN-BiLSTM 136,312,164 224 BERT-CNN-BiGRU 135,626,620 204 BERT-TCN-BiGRU 136,266,364 171

6.1.1 Oversampling

6.2.1 Random masking augmentation

Table 9 Examples of random masking generated sentences

Table 10 Examples of back-translated sentence

Footnotes

Acknowledgment

References

Table 1
An example of a sentence with different meanings of the word “”

Table 2
Distribution of posts per category

Aspect category Number of posts

Parties 141

Peace 350

Plans 813

Results 961

Table 3
Experimental hyper-parameters

Hyper-parameter Value

TCN number of filters 128

TCN kernel size 3

TCN activation (residual block) ReLU

GRU hidden units 100

Loss function Categorical cross-entropy

Optimizer Adam

Batch size 32

Max sequence length 128

Learning rate 2e-5

Table 5
Results of ablation study

Model Precision Recall F1-score

BERT-TCN-BiGRU wo pooling 84.44% 83.7% 84.07%

BERT-TCN 82.37% 82.37% 82.37%

BERT-BiGRU 83.25% 83.25% 83.25%

BERT-TCN-BiGRU 84.58% 84.58% 84.58%

Table 6
Average training time per epoch

Model Number of trainable parameters Training time (Seconds)

BERT-BiGRU 136,213,244 223

BERT-TCN-BiLSTM 136,312,164 224

BERT-CNN-BiGRU 135,626,620 204

BERT-TCN-BiGRU 136,266,364 171

Table 9
Examples of random masking generated sentences

Table 10
Examples of back-translated sentence