SE4SA: a deep syntactical contextualized text representation learning approach for sentiment analysis

Abstract

Recently, many pre-trained text embedding models have been applied to effectively extract latent features from texts and achieve remarkable performance in various downstream tasks of sentiment analysis domain. However, these pre-trained text embedding models also encounter limitations related to the capability preserving the syntactical structure as well as the global long-range dependent relationships of words. Thus, they might fail to recognize the relevant syntactical features of words as valuable evidences for analyzing sentiment aspects. To overcome these limitations, we proposed a novel deep semantic contextual embedding technique for sentiment analysis, called as: SE4SA. Our proposed SE4SA is a multi-level text embedding model which enables to jointly exploit the long-range syntactical and sequential representations of texts. Then, these achieved rich semantic textual representations can support to have a better understanding on the sentiment aspects of the given text corpus, thereby resulting the better performance on sentiment analysis task. Extensive experiments in several benchmark datasets demonstrate the effectiveness or our proposed SE4SA model in comparing with recent state-of-the-art model.

Keywords

Sentiment analysis GCN BERT attention masked language model

1 Introduction

In recent years, along with the rapid growth and popularization of Internet, e-commerce has become the mainstream for customers to purchase their desired products or services. As a primitive task of natural language processing (NLP) domain, the sentiment analysis [1 –4] is an automatic computational evaluation of customers’ opinions and attitudes, towards their purchased products and services. In fact, sentiment analysis not only effectively supports for organizations/companies on e-commerce platforms to improve the quality of their products/services and consumer satisfaction but also provide useful references for the customers about their wanted goods. In general, the automatic sentiment analysis systems are designed to identify the sentiment polarities of aspects related to products/services which are explicitly presented in customer’s reviews/comments. The sentiment tendencies of consumers about products/services might be varying but can be generally categorized as positive, negative and neutral opinions. Thus, the sentiment analysis problem is formulated as a sentiment polarity prediction/classification task of a given text (reviews, comments, etc.). Traditionally, the sentiment classification is frequently regarded as a special case of text classification. By considering sentiment analysis as a classification task, textual representation plays an important role which enables to capture the original information that is conveyed by words/sentences/phrases in a given document. In the past, related sentiment tendency evaluation problems are mainly tackled by applying textual hand-crafted feature engineering-based techniques, such as bag-of-words (BOW) based representations (word frequency, TF-IDF, etc.) and out-of-the-shelf classification algorithms to categorize the polarities of user’s opinions. However, the hand-crafted feature engineering-based models encounter limitations related to the sparsity of short texts and unable to preserve the complex semantic relationships of words. Among early attempts for overcoming shortages of BOW-based textual representation approach, neural network-based word embedding techniques, such as: Word2Vec [1], GloVe [2] etc. were proposed to effectively generate dense and rich-semantic embedding vectors for word representation. Then, these learnt word embedding vectors are combined with non-neural classification algorithms, such as: Logistic Regression (LR), Support Vector Machine (SVM), etc. to obtain better results in sentiment classification task. Latterly, the emergence of deep learning has provided powerful techniques for textual representation learning problem and produced state-of-the-art performances in primitive tasks of NLP area, such as sentiment analysis. The development of text representation learning approach which is combined with advanced deep neural network-based archiectures, such as Convolutional Neural Network (CNN) [3 –5] and Long Short-Term Memory (LSTM) [6, 7] have leveraged the performance of the sentiment analysis task. However, deep neural network-based models also have several drawbacks related to sufficiently capture the surrounding local context of words in separated sentences to categorize the emotional polarity of the given sentences. In recent time, with the raises of sequence-to-sequence (seq2seq) [8] and attention-based transformer [9] in textual representation learning approach, many pre-trained text embedding techniques, such as: ELMo [10], GPT [11], BERT [12], etc. have been utilized to efficiently capture the rich-semantic contextual information from large-scale text corpora, thereby achieve significant performances in NLP’s downstream problems. Similar to the traditional approach, the sentiment analysis is basically considered as a text classification problem. Thus, many attention-based pre-trained language models [10, 12] have been applied to encode the contextual relationships of words/sentences and explore the latent feature representations for sentiment polarity categorization at the decoding steps.

1.1 Recent achievements & existing limitations

Recently, with the tremendous emergences of advanced deep neural network architectures in NLP domain like as seq2seq [12], attention mechanism [13] and transformer [14 –16], there are notable models have been proposed recently to deal with multiple challenging SA tasks. Despite great achievements of previous deep neural network/transformer-based models, there are limitations regarded to the neglect of extra linguistic knowledge integration as well as structural syntactical relationship evaluation during the process of textual embedding for sentiment analysis task. Most of recent designed pre-trained based models, like as: BERT-PT [17], ABSA-BERT [18] and DAPT/TAPT [19] for sentiment analysis task majorly focus on modelling the semantic sequential relatedness between contextual words and sentimental aspects of sentences. Hence, they are considered as insufficient to preserve the global syntactical dependencies of given texts. Moreover, the sequence-based textual embedding models can only preserve the multi-word latent features as consecutive terms with the recurrent neural network-based learning operation over sentences. Thus, with the problem of identifying sentiment polarity, they are inadequate to determine emotional expressions which are depicted by multiple words/compound words that are not continuous to each other. Therefore, these transformer-based models are limited to achieve better performance in multiple complex SA tasks like as the aspect-based.

Recently, there is a remarkable work of Ke, P. et al., called as: SentiLARE [20] which is a BERT-based sentiment-aware model for effectively acquiring the emotional polarities from texts. The proposed SentiLARE model utilizes multiple linguistic pre-knowledge resources to leverage the performance of sentiment analysis with custom label-aware masked text representation learning processes. However, proposed SentiLARE model is also categorized as sequence-based textual embedding approach which fail to fully capture the global contextual and syntactical structures of given texts.

1.2 Our motivations & contributions

To overcome above listed limitations, we proposed a novel pre-trained syntactical textual embedding model called: SE4SA which enables to jointly learn the sequential and global syntactical representations of the given text corpus. First of all, to acquire the structural syntactic latent features of words in each sentence, we apply a self-attention-based mechanism to capture the of co-reference relationships between words. Then the archived latent co-reference relatedness embedding vectors are fused into the latent syntactical word-sentence/document relationship representations by using a multi-layered GCN-based encoder [17]. In our approach, we use a multi-layered GCN architecture to encode the syntactical relationships between words and a given document which are represented as a grammatical dependency tree. By referring to the parsed syntactical dependency tree of each document via Stanford CoreNLP, a GCN encoder is sufficiently capable of preserving the syntactically relevant words to the target different sentimental aspects, and exploiting the long-range dependent relations of words which are not next to each other. The final syntactical representations of words in each document are then used to feed into a label-aware masked BERT-based encoder to jointly integrate with the latent continuous representations of words. To acquire the sequential semantic representations of words in a document, we applied the pre-trained BERT encoder with masked language mechanism. In this step, mainly inherited from previous studies of Ke, P. et al. in SentiLARE model, we incorporate the sequential textual embedding process with the external SentiWordNet [18] as the pre-knowledge source for the sentiment-oriented representation learning task. In order to deal with the sentiment analysis task in different types of documents, including: short (contain only one sentence) and long (few sentences or a completed paragraph/microblog) documents, we utilize a Bi-LSM encoder at the output layer of the given masked language BERT-based embedding architecture to learn the overall representations of each document.

Figure 1 illustrates the overall architecture and main components of our proposed SE4SA model. To sum up, our contributions in this paper can be summarized as three folds, which are:

First of all, we propose a novel approach of preserving the co-reference and syntactical dependency relationships between words of each document. We name this proposed textual embedding strategy as: CoSynEmb process. For the co-reference relationships of words, we apply the self-attention-based neural mechanism on the contextual dependent co-reference relatedness between words/phrases to efficiently capture the co-referencing representations. Then, the achieved co-referencing representations in previous step are merged into the structural syntactical representations of words which are preserved from the dependency tree/graph of a given document by using a multi-layered GCN-based encoder. To handle the word representation merging task, we use a custom personalized non-linear fusion function to effectively map the co-referencing embedding vector of words into the long-range syntactical dependency latent representation space. After the merging process, we can perceive the final unified rich-semantic embedding vectors of all words in a given document which are then used as the inputs for the label-aware masked BERT-based encoder in the next process.

Secondly, to incorporate word embedding vectors which are achieved in previous CoSynEmb process with the task-specific sentiment analysis, we integrate them with the emotional senses which are referred from the SentiWordNet dictionary as the external pre-knowledge resource. By doing this, we can inject latent linguistic knowledge features for the given word representations which enables to sufficiently derive the global sentiment polarities of a whole document. Then, these enriched sentiment-aware word embedding vectors are fed into a BERT-based embedding architecture to directly capture the sequential latent representations of words with the masked label-aware mechanism. Then, for sentence’s embedding vectors of the given BERT-based architecture, we use a Bi-LSTM encoder at the output layer to combine and generate a final unified sentiment-aware representation of a given document. At the last stage, to handle the sentiment categorization task, we use a full-connected multi-layered perception (MLP) with the softmax normalization layer at the end to conduct the classification task.

Finally, to evaluate the effectiveness of our proposed SE4SA model, we conduct extensive experiments in multiple benchmark datasets, including: Stanford Sentiment Treebank (SST), Amazon Reviews (AR), Movie Reviews (MR), IMDb, and Yelp. Experimental outputs demonstrate the outperformances of our proposed SE4SA model in comparing with recent state-of-the-art sentiment analysis baselines.

Fig. 1

Illustration of overall architecture of our proposed SE4SA model.

The main differences between our proposed SE4SA and recent baselines. As recent well-known BERT-based SA models, like as BERT-PT [13], TransBERT [19] and SentiBERT [20] mainly focus on capturing the rich contextual information of texts to facilitate multiple downstream SA tasks like as sentence-level and various aspect-based sentiment analysis tasks. Although these transformer-based techniques have demonstrated remarkable performances in multiple SA tasks, they still suffered limitations related to the lacks of thorough evaluations of the long-range syntactical relationships between words in which can assist to achieve much higher quality of textual representations and better fine-tune for improving the performances of SA tasks. Inspired from recent studies [21, 22] in the utilization of graph-based neural network, like as GCN [17] in text analysis and mining, we employ the GCN with the textual syntactical and co-referencing relationship extraction to efficiently capture rich-schematic representations of texts. Then, these rich structural representations are integrated with the pre-trained BERT model to fine-tune for handling sentence-level and aspect-based SA tasks. In general, the combined textual graph structural (via GCN) and rich contextual (via pre-trained BERT) text embedding approach in our proposed SE4SA model can not only enrich the semantic information of learnt textual representations but also support to significantly leverage the performance of multiple downstream SA tasks. This proposal is the main difference of our proposed SE4SA model with contemporary BERT-based SA models.

The left parts of our paper are organized into four sections. In the second section, we briefly present recent studies in sentiment analysis domain as well as discuss about pros/cons of each model. Next, we formally describe about related background concepts and notations which are used in this paper in the third section. In the fourth section, we present detailed descriptions about our main ideas, methodology of the proposed SE4SA model and implementation. Then, we demonstrate our extensive experiments with multiple benchmark datasets in the fifth section. In this section, we also provide studies related to comparative baseline, experimental result discussions and parameter sensitivity of our proposed SE4SA model. In the last section, we conclude about our achievements in this paper as well as highlight some possible improvements for the future works.

2 Related works

In this literature review section, we formally present recent studies related to the traditional, deep learning-based and advanced pre-trained text embedding approaches for sentiment analysis task.

2.1 Traditional approach for sentiment analysis

Considering as a primitive task in NLP domain, sentiment analysis/classification have been popularly studied in decades due to its potential applications in multiple disciplines. From the past, several language-specific analyzing methods have been proposed to effectively model the sentiment from the type-varied text corpora, such as: product’s reviews in e-commerce platforms, comments/microblogs in social networks, etc. Traditional sentiment analysis models can be categorized as two main trends. The first trend is opinion-aware language technique which mainly focus on identifying emotional aspects of occurring words in a given document to fulfill the sentiment classification task. The models [19, 20] in this trend are mainly focus on the use of the expert knowledge/lexical resources (e.g., SentiWordNet [18, 21] for sentimental references. The second trend mainly concentrates on developing textual analysis methods (e.g., BOW, n-grams, etc.) to learn the representations of given documents for the sentiment categorization process by using binary-class/multi-class classification algorithms [22, 23]. Although the traditional methods of both lexicon-based and textual analysis-based approaches can automatically extract sentimental aspects from given text corpora, they often rely on expert-based/hand-crafted feature engineering process to ensure the quality of model’s outputs. To release the dependences on pre-knowledge resources as and high efforts on the manual feature engineering tasks, the deep learning-based models have been proposed.

2.2 Deep learning-based approach for sentiment analysis

Deep learning-based models for sentiment analysis task have a common advantage that they don’t or less require expert knowledge intervention to automatically extract latent emotional aspects from texts to leverage the performance of sentiment classification. In existing deep learning-based techniques, the sentiment analysis task is frequently formulated as a joint three-way prediction problem with trained sentiment classifiers to predict a given document as positive, neural, and negative. The sentiment classifiers might be out-of-the-shelf classification algorithms (e.g., SVM, LR) or neural network-based mechanism (e.g., MLP), are fed by latent embedding vectors of given documents which are achieved by common deep neural network-based architectures, like as recurrent neural network (RNN) (e.g.,: GRU, LSTM, Bi-LSTM etc.) and convolutional neural network (CNN) [28].

2.2.1 RNN-based approach for SA task

RNN is considered as the most popular deep neural architecture which is used in multiple tasks of NLP domain, including sentiment classification. In order to preserve the sequential relations of words in a given document for leverage the emotional polarity. Among early attempts, there are notable works like as the proposals of Chen, T. et al. in Bi-LSTM-CRF [28], Balikas, G. et al. in Bi-LSTM + Multitask [29] which proposed modified Bi-LSTM based architecture to facilitate the sequential representation learning process of texts in which can support to leverage the performances of SA task. Recently, Ma, Y. proposed a novel LSTM-based model, called as SenticLSTM [7] which is a combination between the LSTM-based textual encoder and the recurrent additive network that support to directly detect the target-dependent sentiment aspects for the aspect-based polarity classification task. In the recurrent neural network (RNN) based approach, there are also studies of Wen, S. et al. in the proposed MLSTM model [30] which proposed a memristor-based LSTM architecture to fasten the text embedding process for sentiment analysis task, which is similar to recent works [31, 32]. However, most of RNN-based SA models still suffered limitations related to the capability of capturing rich contextual information from texts to facilitate the sentiment classification-driven training process.

2.2.2 CNN-based approach for SA task

In recent years, there are notable efforts in the utilization of CNN-based architecture to assist the representation learning process for better textual understanding and fine-tuning for multiple complex SA tasks. Such as the well-known study of Santos D. et al. [3] in applying deep convolutional neural network for jointly exploring the word-based and sentence-based latent representations to conduct the sentiment analysis of short texts. Similar to that, there are works of Jianqiang Z. et al. [4] which used a multi-layered CNN-based encoder to effectively learn the contextual semantic features and the co-occurrence relationships between words. Then, the learnt latent representations of texts via CNN-based encoder are used to support for identifying the sentiment polarity. Inherited from successes of previous CNN-based techniques, Fan, C. et al. [8] proposed a memory-based CNN text representation learning method to effectively preserve the latent features of both words and multi-words expressions in texts for the aspect-based specific sentiment analysis task. To specifically target emotional expression words in texts, Hyun, D. et al. [36] proposed a novel target-dependent CNN-based textual representation learning technique to capture the distance relationships between the target emotional words and their surrounding contextual words. Or recent well-known works of Abid, F. et al. [37] and Piryani, R. et al. [38] which proposed an integration between CNN and LSTM architecture to assist the sentiment-aware textual embedding process in which can support to leverage the accuracy performance of the sentiment analysis task.

2.3 Sequential pre-trained language model for sentiment analysis

With the raises of Seq2Seq [11] and attention-based transformer [12] in textual representation, several advanced sequential text embedding baselines have been proposed, such as: ELMo [13], GPT [14], ULMFit [39], BERT [15], etc. These sequential pre-trained language model have demonstrated significant performances in acquiring the rich semantic contextual representations of texts which then effectively leverage the accuracy results of multiple NLP’s tasks, including sentiment classification. Among common pre-trained language models, BERT [15] is considered as the most popular one due to its flexibility in various NLP-based pre-training tasks, including the masked language mechanism for general text classification task. An early work of Xu, H. et al. [16] in the proposed BERT-PT model which applies the pre-trained BERT-based textual embedding technique to benefit the aspect-based sentiment classification. The success of BERT-PT model has demonstrated the potential application of pre-trained language models in sentiment analysis task. Similar to the proposed TransBERT [22] which integrates the supervised transferable knowledge with the fine-tuning process of BERT to effectively handle multiple NLP’s tasks. Recent famous study of Gururangan, S. et al. [18] which specified the possibility of using a unified pre-trained language model for multiple NLP-specific tasks via the different pre/post-training processes. Following the pre/post-training strategy, there are recent proposed models, such as DomBERT [40] and SentiBERT [23] have integrated the training process with relevant domains and multilevel attention-based mechanism to improve the performance of aspect-based sentiment analysis. To enrich the semantic meanings of word embedding vectors for the task-specific sentiment analysis, recently proposals of Ke, P. et al. [19] in using the integration of sentiment-related linguistic pre-knowledge lexicon, such as SentiWordNet and pre-trained BERT-based textual embedding mechanism to benefit the wide-range of primitive tasks in the sentiment analysis area, called as SentiLARE model.

However, there is a major limitation of proposed sentiment-driven pre-trained language models is that they only mostly focus on preserving the contineous latent representations from a given document to identify the sentiment polarities of words in different sequential contexts. Thus, they might be unable to capture the latent syntactical structure of overall word-document relations which enable to recognize the global relatedness contextual words as important clues for identifying sentimental aspects. Majorly inherited from the existing works on utilizing pre-trained language embedding model for sentiment analysis task. In this paper, we propose a combination of GCN-based and masked language BERT-based encoding mechanisms to jointly learn the sequential and syntactical dependency structures of texts for dealing with wide-range of downstream tasks in sentiment analysis domain.

3 Preliminaries & problem formulation

In this section, we briefly present about background concepts of BERT textual embedding, mask language model and the GCN architecture. The ultimate goal of the proposed SE4SA model is to learn the representations of given documents for identifying sentiment polarities. In more details, our proposed SE4SA is a textual embedding (definition 1) model which is designed for task-specific sentiment analysis. Considering sentiment analysis task as a textual classification problem (definition 2), the learnt sentiment-aware text of documents via text embedding techniques used to train the classifier to predict the distribution of sentiment polarities. In previous studies of using textual embedding for sentiment analysis, several sequential text embedding (definition 3) methods, such as: BERT [12] is majorly used to capture the contextualized vector representation of each word in a given sentence/document. In fact, BERT is one of the important innovations in the recent advances of contextualized representation learning area. For sentiment analysis, pre-trained BERT can be adopted different fine-tuning approaches to meet specific architectures for different end tasks, such as general sentiment classification, sentiment aspect extraction, etc. This BERT’s advantage enables to minimize the requirements of prior human/expert knowledge for the data modelling process.

Definition 1. General Text Embedding: Normally, a designed text embedding technique is defined as a mapping function, as: f_emb (.) which supports to transform a given word(𝓌)/sentence(𝓈)/document(𝒹) into a d-dimensional vector, denoted as: $f_{emb} (𝓌 | 𝓈 | 𝒹) \to ({\vec{e}}^{𝓌} | {\vec{e}}^{𝓈} | {\vec{e}}^{𝒹}) \in ℝ^{1 \times d}$ .

Definition 2. Sentiment Analysis/Classification: The extracting information about the sentiments behind texts can be modelled as a traditional classification problem, with a given document (𝒹) with length of (n), as: $W_{𝒹} = {𝓌_{i}}_{i = 1}^{n}$ and a set of sentiment polarities/classes, as: C ={ positive, negative, neutral }. A specified text embedding method is applied to learn the contextual representation of the given document (𝒹), denoted as: $f_{emb} (𝒹) \to {\vec{e}}^{𝒹} \in ℝ^{1 \times d}$ . Then, a classification model/mechanism, as a mapping function –as: f_class, is applied to predict the proper sentiment polarity/class (c, c ∈ C) of the given document (𝒹), denoted as: $f_{class} ({\vec{e}}^{𝒹}) \to c$ .

Definition 3. Sequential Text Embedding: Given a sequence of words (in a sentence/document) of length (n), denoted as: W ={ 𝓌₁, 𝓌₂, …, 𝓌_n }, a textual embedding technique, as a mapping function: f_{seq
_e
mb} (.), is designed to learn the representation of the whole given sequence as a set of hidden state vectors, denoted as: $X = {\vec{h_{1}}, \vec{h_{2}}, \dots, \vec{h_{n}}} \in ℝ^{n \times d}$ , or: $f_{{seq}_{e} mb} (W) \to X = {\vec{h_{1}}, \vec{h_{2}}, \dots, \vec{h_{n}}}$ . The learnt representation matrix, $X = {\vec{h_{t}}}_{t = 1}^{n}$ carries out rich-semantic contextual information of the given sequence.

Definition 4. Graph Convolutional Network (GCN) [17]: is a recent state-of-the-art approach for graph/network representation learning. GCN is considered as an adaptation of the deep convolutional neural network (CNN) architecture which is popularly studied recently due to its simplicity. A GCN-based encoder applies the principle of neighborhood aggregation and spectral graph propagation to effectively effecively encode the global information of graph-based data. Normally, a GCN architecture is defined as a multi-layered neural network, each t^th layer is generally defined as: $H^{[t + 1]} = f_{act} (W^{[t]} H^{[t]} \hat{A})$ , with W, H and $\hat{A}$ are the model’s weighting parameter matrix, hidden state matrix and normalized adjacency matrix, respectively and a general activation function f_act (.), is defined as the ReLU (.) function in the original work of Kipf T. et al. [17].

However, most of recent pre-trained models for sentiment analysis have a common limitation of unable to fully exploit the long-range multi-word relationships as well as the overall syntactical information structure within a whole document. Thus, sequential pre-trained based models might fail to determine the important sentiment aspects which are represented by multiple words/compound words that are not contineous occurring with each other. To overcome this limitation, a novel proposal of using GCN (definition 4) to effectively exploit the long-range dependent relationships between words. GCN is commonly used for multiple unstructured data embedding, including texts in forms of graph-based structures. Table 1 shows common notations which are used in our paper.

Table 1
List of used notations & descriptions

Notation Description

𝓌, 𝓈, ℊ and 𝒹 Representing for a word, sentence, word co-referencing span and document, respectively.

W and D The sets of unique words/vocabulary and documents, respectively.

X^n×d An embedding matrix with size: n × d, or: $X \in ℝ n \times d$ .

${\vec{e}}^{𝓌}$ , ${\vec{e}}^{𝓈}$ , ${\vec{e}}^{ℊ}$ and ${\vec{e}}^{𝒹}$ The embedding vectors of a word, a sentence, a word co-referencing span and a document, respectively.

$\vec{h}$ A hidden state vector.

d and h The dimensionality of embedding vector and hidden sate vector (aka number of RNN-based neural cells) of RNN-based architectures (LSTM/Bi-LSTM), respectively.

f_emb (.) A specific embedding function.

f_act (.) A specific activation function of a neural network-based architecture.

G = (V, E) A graph-based structure with (V) and (E) are sets of nodes and edges, respectively.

A and $\hat{A}$ The adjacency matrix and normalized adjacency matrix of a given graph-based structure/network, respectively.

Notation	Description
𝓌, 𝓈, ℊ and 𝒹	Representing for a word, sentence, word co-referencing span and document, respectively.
W and D	The sets of unique words/vocabulary and documents, respectively.
X^n×d	An embedding matrix with size: n × d, or: $X \in ℝ n \times d$ .
${\vec{e}}^{𝓌}$ , ${\vec{e}}^{𝓈}$ , ${\vec{e}}^{ℊ}$ and ${\vec{e}}^{𝒹}$	The embedding vectors of a word, a sentence, a word co-referencing span and a document, respectively.
$\vec{h}$	A hidden state vector.
d and h	The dimensionality of embedding vector and hidden sate vector (aka number of RNN-based neural cells) of RNN-based architectures (LSTM/Bi-LSTM), respectively.
f_emb (.)	A specific embedding function.
f_act (.)	A specific activation function of a neural network-based architecture.
G = (V, E)	A graph-based structure with (V) and (E) are sets of nodes and edges, respectively.
A and $\hat{A}$	The adjacency matrix and normalized adjacency matrix of a given graph-based structure/network, respectively.

4 Methodology

In this section, we formally introduce the approach of our proposed SE4SA model which is a sentiment-aware masked text embedding model. Our proposed SE4SA enables to jointly learn the semantic sequential and structural syntactical representations of documents for sentiment polarity identification. First of all, each document will be passed through the CoSynEmb-based embedding mechanism to fully learn both co-referencing and syntactical contextual relationships between words by using the self-attention-based mechanism with the multi-layered GCN-based dependent text graph encoder. Then, the achieved CoSynEmb-based embedding matrix for words are used to feed into a masked pre-trained BERT encoder to learn the sequential representations of all words in a given document at the local context level. Through separated masked label-aware pre/post training steps, we achieve the unified embedding vectors of words which are then accumulated into a final representation of a given document by using a Bi-LSTM encoder at the output layer. Finally, having obtained the final representation of a target document, we feed it into a full-connected MLP layer to conduct the sentiment classification.

4.1 CoSynEmb: co-referencing and syntactical text representation learning

4.1.1 Textual syntactical structure representation learning via GCN

Given a n-word document, as: ={ 𝓌₁, 𝓌₂, …, 𝓌_n }, we firstly applied the pre-trained Word2Vec model [1] to learn the local contextual representations of words, denoted as: $f_{word 2 vec} (𝒹) \to X_{𝓌, 𝒹}^{word 2 vec}$ , with $X_{𝒹}^{word 2 vec} \in ℝ^{n \times d}$ presents for the low d-dimensional embedding matrix of a given document (𝒹), with each (t^th) row is the embedding vector of t^th unique word.

Syntactical text graph construction. Mainly designed to address the limitations of previous models related to the sequence-based text embedding approach, we utilize a multi-layered graph convolutional network architecture to learn the syntactical relationships between words of a given document (𝒹) over the syntactical dependency tree. To do this, at initial step, we use the Stanford CoreNLP [33] tool to parse the grammatical dependency tree as a graph-based structure, denoted as: G_𝒹 = (V_𝒹, E_𝒹) of a given document (𝒹). A parsed dependency tree of a given document (𝒹), as: G_𝒹 = (V_𝒹, E_𝒹), where V_𝒹 presents for the set of unique words and E_𝒹 presents for the set of dependent relationships between word nodes in V_𝒹.

Word representation learning via pre-trained Word2Vec vs. BERT. In our approach, we mainly utilized pre-trained Word2Vec model (released by Google/Google-News-300 version) which has been trained upon the large-scale text corpora to efficiently achieve the rich contextual representations of words within each document (𝒹). These word embedding vectors are later used as the initial node features of the corresponding constructed syntactical text graph for each document. In our case, we didn’t directly use the pre-trained BERT model to obtains the word representations of each input document (𝒹) due to several reasons. First of all, pre-trained BERT model is considered as the sentence-level based textual embedding approach in which the contextual information of word embedding vectors is locally captured within their occurred sentences, thus can’t be directly applied to document-level global structural representation learning process via GCN. Moreover, most of pre-trained BERT versions are majorly utilized for task-driven fine-tuning purposes, such as classification, thus word presentations produced by pre-trained BERT can be explicitly utilize for the general use of document-level contextual representation learning with separated word-based semantic evaluations.

Then, a k-layered GCN architecture is utilized to capture the syntactical representation of word nodes in a given dependency text graph G_𝒹 (as illustrated in Fig. 2). Taking the initial word embedding vectors of $X_{𝒹}^{word 2 vec}$ as the node feature matrix (as show in equation 1a), through (k)-step GCN-based propagation learning processes, the representation of each word node is updated with graph-based spectral convolution operation with normalization factor, as the following (as shown in Equation 1b):

$H_{𝒹}^{1} = f_{act} (W_{𝒹}^{[0]} . X_{𝒹}^{Word 2 Vec} . {\hat{A}}_{𝒹} + b_{𝒹}^{[0]})$ (1a) $H_{𝒹}^{[t + 1]} = f_{act} (W_{𝒹}^{[t]} . H_{𝒹}^{[t]} . {\hat{A}}_{𝒹} + b_{𝒹}^{[t]})$ (1b)

Fig. 2

The illustration of syntactical structure representation learning via GCN for each document.

Where:

f_act (.), is the activation function of a given GCN-based architecture, in this case we used the ReLU (.) function.

H^[t], W^[t] and b^[t], are the hidden state, weighting parameter and bias matrices at the t^th layer of a given GCN-based architecture.

${\hat{A}}_{𝒹}$ , is the normalized adjacency matrix of a given dependency text graph (G_d) of the document (𝒹), where: ${\hat{A}}_{𝒹} = {\tilde{D_{𝒹}}}^{- \frac{1}{2}} \tilde{A_{𝒹}} {\tilde{D_{𝒹}}}^{- \frac{1}{2}}$ with $\tilde{A_{𝒹}} = A_{𝒹} + I_{𝒹}$ and $\tilde{D_{𝒹}} = diag (\sum_{j} {\tilde{A_{𝒹}}}_{ij})$ –with: I_𝒹, $\tilde{A_{𝒹}}$ and $\tilde{D_{𝒹}}$ are the identity matrix, adjacency matrix with self-connection and degree matrix of the given $\tilde{A_{𝒹}}$ , respectively.

After the propagation learning process, at the k^th layer of the given GCN-based architecture we can achieve the syntactical representation of words in a given document, as last hidden state matrix, denoted as: $X_{𝓌, 𝒹}^{syn} = H_{𝒹}^{[k]} \in ℝ^{n \times d}$ .

4.1.2 Co-referencing relationship representation learning

Next, to learn the co-referencing relationships between words in a given document (𝒹), we applied the self-attention mechanism with Bi-LSTM encoder to learn the contextual co-referencing spans from co-referencing word clusters. To extract co-referencing relatedness of words, we use the syntactic co-reference parser of the Stanford CoreNLP [33] library. Beginning with a Bi-LSTM based encoder, similar to the syntactical structure representation learning via GCN in previous approach, we also take the word embedding vectors of the given document (𝒹) as the inputs for a given Bi-LSTM architecture (as illustrated in Fig. 3). To learn the sequential representations of words over each co-referencing span, denoted as: (ℊ), we encode a given set of continuous word embedding vectors, as: ${\vec{e}}^{𝓌}$ with a Bi-LSTM architecture and take the concatenated hidden vector outputs as the final embedding vector for each co-referencing span, denoted as: ${\vec{e}}^{ℊ}$ . For each t^th word (with corresponding embedding vector as: ${\vec{e}}^{𝓌_{t}}$ ) in a given i^th span (ℊ), denoted as: 𝓌_t ∈ W_ℊi, we obtain the hidden states for both directions at each time-step (t), as the following (as shown in the Equation 2):

$\begin{matrix} f_{[α]}^{[t]} = σ (U_{f} [{\vec{e}}^{𝓌_{t}}, h_{[+ α, α]}^{[t]}] + b_{f}) \end{matrix}$ $\begin{matrix} ℴ_{[α]}^{[t]} = σ (U_{ℴ} [{\vec{e}}^{𝓌_{t}}, h_{[+ α, α]}^{[t]}] + b_{ℴ}) \end{matrix}$

$\tilde{𝒸_{[α]}^{[t]}} = \tanh (U_{𝒸} [{\vec{e}}^{𝒸_{t}}, h_{[+ α, α]}^{[t]}] + b_{𝒸})$ (2) $\begin{matrix} 𝒸_{[α]}^{[t]} = f_{[α]}^{[t]} \circ \tilde{𝒸_{[θ]}^{[t]}} + (1 - f_{[α]}^{[t]}) \circ 𝒸_{[α]}^{[t - 1]} \end{matrix}$ $\begin{matrix} h_{[α]}^{[t]} = ℴ_{[α]}^{[t]} \circ \tanh (𝒸_{[α]}^{[t]}) \end{matrix}$

Fig. 3

The illustration of co-referencing relationship representation learning via Bi-LSTM with attention based mechanism for each document.

Where:

σ (.), is the sigmoid function.

${\vec{e}}^{𝓌_{t}}$ , is the word embedding vector of a given i^th span (ℊ_i), at a specific time-step (t).

U and b, are the weighting parameter and bias metrices of a given Bi-LSTM architecture.

The use of a Bi-LSTM encoder enables us to fully capture the contextual information of the co-referencing relationships between words which also is the surrounding external and internal structure within each span (ℊ_i). Then, we attach the co-referencing relatedness representation learning process with the self-attention mechanism to align it with the syntactic structure of occurring words inside each span. The overall task-specific co-referencing relation embedding task with the attention-based mechanism is formulated as the following (as shown in Equation 3a and 3b): $\begin{matrix} h_{[+ α]}^{ℊ_{i}} = LSTM (W_{ℊ}, + α) \\ h_{[- α]}^{ℊ_{i}} = LSTM (W ℊ_{,} - α) \\ {\vec{\tilde{e}}}^{ℊ_{i}} = [h_{[+ α]}^{ℊ_{i}}, h_{[- α]}^{ℊ_{i}}] \end{matrix}$ (3a) $\begin{matrix} γ^{ℊ_{i}} = softmax (Z_{γ} . Linear ({\vec{\tilde{e}}}^{ℊ_{i}})) \\ λ^{ℊ_{i}} = Dropout (W_{γ}) . γ^{ℊ_{i}} \\ {\vec{e}}^{ℊ_{i}} = \prod_{𝓌_{t} \in ℊ_{i}} λ^{ℊ_{i}} . {\vec{e}}^{𝓌_{t}} \end{matrix}$ (3b)

Where:

${\vec{\tilde{e}}}^{ℊ_{i}}$ , is the unified sequential representation of a given span (ℊ_i) which is the concatenated output hidden states of the previous Bi-LSTM architecture (Equation 2).

Z_γ and W_γ, are the weighting parameter metrices of a given self-attention-based mechanism for aligning the span representation with its occurring word embedding vectors.

Finally, to efficiently integrate the achieved co-referencing relationship representation with separated word embedding vectors at the global document context level, we applying the average pool strategy to softly arrange extracted latent features of different representations into a single unified embedding space. Let $G_{=} {ℊ_{i}}_{i = 1}^{m}$ , is set of (m) existing co-referencing spans in a given document (𝒹), we update the embedding vector of each word for the contextual co-referencing information via the average pool strategy as the following (as shown in Equation 4):

${\vec{e}}^{𝓌_{t}} \leftarrow AvgPool ({\vec{e}}^{𝓌_{t}}, {\vec{e}}^{ℊ_{i}}) with \lor_{ℊ i} \in G_{𝒹} and \lor_{𝓌_{t}} \in_{ℊ_{i}}$ (4)

Where:

$G_{𝒹}$ , is set of existing co-referencing spans in a given document (𝒹).

${\vec{e}}^{𝓌}$ and ${\vec{e}}^{ℊ}$ , present for the embedding vectors of a given word and their associated co-referencing span, respectively.

From these enriched word embedding vectors with the co-referencing relationship contextual information, we form a new word embedding matrix, denoted as: $X_{𝓌, 𝒹}^{coref} \in ℝ^{n \times d}$ .

4.1.3 Syntactical and co-referencing relationship embedding fusion

At this stage, we have achieved separated embedding matrices of words in a document (𝒹), which are the syntactic structure-based embedding matrix, denoted as: $X_{𝓌, 𝒹}^{syn} \in ℝ^{n \times d}$ (section 4.1.1) and the co-referencing relationship-based embedding matrix, denoted as: $X_{𝓌, 𝒹}^{coref} \in ℝ^{n \times d}$ (section 4.1.2). These two embedding matrices carry out different contextual information of existing words in a given document (𝒹). To effectively merge these embedding matrices into a single unified embedding space for later usage in sequential learning steps via pre-trained BERT model, we define a custom embedding fusion mechanism with a non-linear personalized function, as: fnl (.). Our fusion mechanism is formulated as a full-connected MLP architecture as the following (Equation 5):

$\begin{matrix} f_{nl} ({{\vec{e}}^{𝓌}} {\vec{e}}^{𝓌 \in {X_{𝓌, 𝒹}^{syn}, X_{𝓌, 𝒹}^{coref}}}) \\ = σ (\sum_{{\vec{e}}^{𝓌} \in {X_{𝓌, 𝒹}^{syn}, X_{𝓌, 𝒹}^{coref}}} U^{nl} σ (W^{nl} . {\vec{e}}^{𝓌} + b^{nl})) \end{matrix}$ (5)

Where:

σ (.), is the defined non-linear function for our proposed word embedding fusion mechanism –in this case is the sigmoid function.

U^nl, W^nl and b^nl, are the weighting parameter and bias matrices of a given fusion function.

The ultimate purpose of using a non-linear fusion mechanism to merge different types of word embedding vectors into a single unified representation space is to flexibly model the complex syntactical and co-referencing relationships between words. To optimize need-to-update parameters of our defined fusion function, we apply the traditional stochastic gradient descent (SGD) strategy with the pre-defined learning rate: η and calculated gradients for each parameter corresponding with the given word embedding vectors, as: $U_{update}^{nl} = \frac{\partial {\vec{e}}^{𝓌}}{\partial U^{nl}}$ , $W_{update}^{nl} = \frac{\partial {\vec{e}}^{𝓌}}{\partial W^{nl}}$ and $b_{update}^{nl} = \frac{\partial {\vec{e}}^{𝓌}}{\partial b^{nl}}$ .

4.2 Sentiment-aware text representation learning with pre-trained BERT

From the previous achieved unified word embedding vectors, we apply the previous approach of Ke, P. et al. in SentiLARE model [16] with pre-knowledge injection for each word embedding vector via the SentiWordNet [18] lexicon. In the first step, for each word in a given document (𝒹), we apply the Stanford CoreNLP [31] tool to extract the part-of-speech (POS) label. To obtain the sentiment polarity label of each word in the given document (𝒹) by calculating the sentiment polarity score between each i^th word (𝓌_i) and its related labelled POS: p_i with different sentiment context-gloss senses ( $S G$ ), denoted as: sp (〈 𝓌_i|p_i 〉 , 𝒹). For more details, to calculate the contextual sentiment-aware similarity score between a specific word-POS pair, as: 〈𝓌_{i} |p_{i}〉with an existing (k^th) sense, as: $S G_{i}^{k}$ which is referred from the SentiWordNet lexicon, we simultaneously evaluate the sense ranking and the contextual lexical gloss similarity to compute the weight of each sense for a given word-POS pair: 〈𝓌_{i} |p_{i}〉as the following (as shown in Equation 6):

$sp (〈 𝓌_{i} | p_{i} 〉, d) = \sum_{k = 1}^{| S G_{i} |} ψ_{i}^{k} ({pos}_{i}^{k} - {neg}_{i}^{k})$ (6) $\begin{matrix} ψ_{i}^{k} = softmax (λ cosine \\ (f_{Avg - CoSynEmb} (d), f_{Avg - CoSynEmb} (S G_{i}^{k}))) \end{matrix}$

Where:

f_Avg-CoSynEmb (.), is the average embedding vector of a given text by using the proposed CoSynEmb-based strategy as a mapping function (described in 0).

$S G_{i}^{k}$ , is the textual gloss content of a specific (k^th) sense of a given word-POS pair, as: 〈𝓌_i|p_i〉.

${pos}_{i}^{k}$ and ${neg}_{i}^{k}$ , are the positive and negative score of the given (k^th) sense which is referred from the SentiWordNet lexicon, respectively.

λ, is the normalized parameter for approximating the impact of each sense within overall existing senses in $S G_{i}$ , is calculated as: $λ = \frac{1}{| S G_{i} |}$ .

After identifying the sentiment polarity score of each word-POS pair 〈𝓌_i|p_i〉 in a given document (𝒹) which supports to indicate the sentiment label, as: ℓ_i (sp (〈𝓌_i|p_i〉, 𝒹) > 0: positive, sp (〈𝓌_i|p_i〉, 𝒹) = 0: neutral otherwise: negative) of each word-POS pair following its surrounding contextual representation, we utilize pre-trained BERT-based architecture with masked language mechanism. We define a set of training set for our BERT-based sequential embedding model, as: $T = {〈 {\vec{e}}^{𝓌_{i}} | p_{i} | ℓ_{i} 〉}_{i = 1}^{n}$ . Then, the sequential representation of each sentence (𝓈_j) in a given document (𝒹) with (z) number of sentences, is obtained by the following (as shown in Equation 7):

${{\vec{e}}^{𝓈_{j}}}_{j = 1}^{z} = BERT (T = {〈 {\vec{e}}^{𝓌_{i}} | p_{i} | ℓ_{i} 〉}_{i = 1}^{n})$ (7)

Where:

${\vec{e}}^{𝓌_{i}}$ , is the embedding representation of (i^th) word which is achieved by the CoSynEmb-based embedding strategy.

BERT (.), is the pre-trained BERT-based sentence-level embedding model with specific several masked word positions.

${\vec{e}}^{𝓈_{j}}$ , presents for the sentence-level embedding vector of (j^th.) sentence of a given document (𝒹) as each time-step hidden state of the given BERT-based architecture.

Inherited from previous work of Ke, P. et al. [16], we utilize the sentiment-aware masked language training strategy with the configurations which let the given BERT-based encoder to predict the masked word (𝓌) with its associated POS tag (p) and the assigned sentiment label (ℓ). Hence, the loss function with the desired learning objective for the output layer of pre-trained BERT model is defined as the following (as shown in Equation 7).

$\begin{matrix} L^{BERT} = - \sum_{i = 1}^{n} f_{j} (𝓌_{i}) . (logProb ({\vec{e}}^{𝓌_{i}} | T, ℓ_{i}) \\ + logProb (i | T, ℓ_{i}) + logProb (ℓ_{i} | T, ℓ_{i})) \end{matrix}$ (8)

Where:

f_j (.), is the indicator function which return vae [1] if the (i^th) word is masked, otherwise is [0].

logProb (. | .), is the logarithmic probability of given masked word which is calculated upon the corresponding sentence-level hidden state.

After the model training process, we will achieve a set of sentence-level embedding vector of a given document (𝒹) which carry out sequential contextual information of words. Then, to produce the final representation of a given document (𝒹), as: ${\vec{e}}^{𝒹}$ , we use another Bi-LSTM encoder to map the embedding vectors of sentences, as: $X_{,}^{BERT} = {{\vec{e}}^{j}}_{j = 1}^{z}$ in a unified representation space, as illustrated in Equation 9:

$\begin{matrix} h_{[+ α]}^{𝓈} = LSTM (X_{𝓈, 𝒹}^{BERT}, + α) \\ h_{[- α]}^{𝓈} = LSTM (X_{𝓈, 𝒹}^{BERT}, - α) \\ {\vec{e}}^{𝒹} = [h_{[+ α]}^{𝓈}, h_{[- α]}^{𝓈}] \end{matrix}$ (9)

Where:

${\vec{e}}^{𝒹}$ , is the final embedding vector of a given document (𝒹).

$X_{𝓈, 𝒹}^{BERT}$ and α, a set of BERT-based sentence’s embedding vectors and model’s parameters direction of a given Bi-LSTM encoder, respectively.

Finally, after having the final rich-semantic sentiment-aware representation of a given document (𝒹), as: ${\vec{e}}^{𝒹}$ , we feed it into a full-connected MLP architecture with softmax normalization function at the output layer for conducting the document-level sentiment polarity categorization task, as a mapping function: $f_{class} ({\vec{e}}^{𝒹}) \to c$ , c ∈ C with C is set of sentiment classes. The sentiment classification task is formulated as the following (see function 10a): $\begin{matrix} f_{class} ({\vec{e}}^{𝒹}) & = prob (c | 𝒹) \\ = softmax (M^{class} . {\vec{e}}^{𝒹} + b^{class}) \end{matrix}$ (10a) $L^{class} = - \sum_{〈 𝒹, c 〉 \in D} logProb (c | 𝒹) + λ {∥ Θ ∥}_{F}$ (10b)

Where:

M^class and b^class, are the weighting parameter and bias matrices of a given MLP architecture.

$D$ , is set of training set as a data tuple: 𝒹, c with document content and its assigned sentiment label.

To train the given sentiment classification model, we apply the standard SGD with cross-entropy loss and the L2-regularization strategy (is generally formulated in Equation 10b).

5 Experiments & discussions

In this section, we conduct extensive experiments to demonstrate the effectiveness of our proposed SE4SA model in comparing with recent state-of-the-art baselines for sentiment analysis.

5.1 Experimental dataset usage & settings

5.1.1 Dataset usage descriptions

Our experiments are conducted in six benchmark datasets, include: Stanford Sentiment Treebank (SST), Amazon Reviews (AR), Movie Reviews (MR), IMDb, Yelp and SemEval2014/2016. Below is the detailed information of each dataset:

Stanford Sentiment Treebank (SST): is a well-known labelled parsed dataset for sentiment analysis task. This dataset includes fine-grained sentiment labels for over 215 K phrases which are parsed by the Stanford NLP toolkit in the parse syntactical trees of more than 11 K sentences/documents. The SST dataset is considered as a challenging dataset for multiple specific NLP’s tasks which require through comprehensive analysis as well as the effective compositional effects of designed language model. This dataset can be downloaded at the official website of Stanford NLP group 1 .

Amazon Reviews (AR): is a large-scale dataset which contains about 34 M user’s reviews on more than 2.4 M products which are collected from the Amazon e-commerce platform. The length of user’s reviews in this dataset is vary (one to multiple sentences). The AR dataset can be freely achieved at the official website of Stanford SNAP group 2 .

Movie Review (MR): is a classical dataset for sentiment analysis and recommendation tasks. This dataset contains about 10 K user’s reviews on movies. Each user’s review is labelled with its source review’s sentiment categories, and the corresponding sentiment label (positive or negative). In this MR dataset, we equally have approximately 5 K of positive and negative reviews. The MR dataset is available at this repository 3 .

Yelp: belongs to a well-known Yelp Challenge Dataset which contains 1.6 M reviews of 366 K users on 61 K local businesses/organizations. This dataset can be downloaded at the official website of Yelp dataset 4 .

IMDb: similar to MR dataset, the IMDb is also a movie review dataset which contains about 50 K user’s reviews which are categorized as positive and negative. This dataset is split into two parts, training and testing with 25 K reviews for each part. This dataset can be directly downloaded at this repository 5 .

SemEval2014 (Task 4) [43]: is a well-known dataset for aspect-level sentiment analysis task. In this dataset, we selected two categories: laptop (Laptop14) and restaurant (Resta14) for the aspect-based sentiment analysis. This dataset can be directly downloaded at this repository 6 .

SemEval2016 (Task 5) [44]: similar to SemEval2014 dataset, in this dataset we also selected the restaurant domain (Resta16) for the experiments in aspect-based sentiment analysis. This dataset can be downloaded at this repository 7 .

Sentimental aspect-based datasets descriptions.

Laptop14 (SemEval2014): belongs to the well-known aspect-based sentimental analysis challenge of SemEval2014 (Task4) [43] which is a subset of dataset released in [45] by Ganu, G. et al. This dataset contains customer’s reviews on laptop regarded products which is originally divided in training (3045) and testing (800) sets. This dataset is used for evaluate the sentimental ATE-based and ATSC-based tasks which is similar to previous experimental studies [20, 46].

Resta14 (SemEval2014): also belongs to the SemEval2014 (Task4) challenge [43], this dataset contains 3841 customer’s reviews on the restaurant related products/services which are originally split into two sets: 3041 for training and 800 for testing purposes. For this dataset, we used it to evaluate our proposed SE4SA model with other comparative baselines on the ATE-based, ATSC-based, ATSC-based and ACSC-based tasks [20, 46].

Resta16 (SemEval2016): is included in the recent SemEvel2016 (Task 5) challenge [44] released by Pontiki, Maria, et al. which is similar to the Resta14 (SemEval2014) contains customer’s reviews on the products/services of restaurant domain (2 K for training and 676 for testing). Similar to the Resta14 (SemEval2014), we used this dataset to evaluate the sentimental ACD-based and ACSC-based tasks, similar to recent empirical studies in these works [20, 46].

For textual contents in each dataset, we applied some simple text pre-processing steps, such as special character removal, word stemming and tokenization, sentence segmentation, etc. For complex textual processing tasks, such as: extracting POS tag for each word and constructing dependency tree for text documents, we the mainly used the Stanford CoreNLP library 8 . Tables 2 and 3 show general statistics of datasets which are used for our document/sentence-level and aspect-level experiments after pre-processing steps, respectively.

Table 2
General statistics of datasets which are used for document/sentence-level experiments

Dataset Dataset usage Avg. document Sentiment

Training Testing Validation length class

SST 8,544 1,101 2,210 21.62 5

AR 1,589,221 338,000 682,000 168.63 5

MR 8,534 1,078 1,050 23.37 2

Yelp 564,000 78,000 62,000 156.72 5

IMDb 22,500 2,500 25,000 281.35 2

Dataset	Dataset usage	Avg. document	Sentiment
SST	8,544	1,101	2,210	21.62	5
AR	1,589,221	338,000	682,000	168.63	5
MR	8,534	1,078	1,050	23.37	2
Yelp	564,000	78,000	62,000	156.72	5
IMDb	22,500	2,500	25,000	281.35	2

Table 3

General statistics of datasets which are used for aspect-level experiments

Dataset	Sentences/documents		Terms		Categories		Sentiment class
	Training	Testing	Training	Testing	Training	Testing
Laptop14	3,045	800	2,358	654	–	–	3
Resta14	3,041	800	3,693	1,134	3,711	1,025	3
Resta16	2,000	676	–	–	2,507	859	3

5.1.2 Evaluation metric & experimental settings

To evaluate the accuracy performance of each sentiment analysis models, all the experimental outputs are evaluated by the accuracy (Acc.) and F1 metric. For all datasets in our experiments, we applied the out-of-the-box pre-trained Word2Vec 300-dimensional embedding model 9 to initialize and achieve word embedding vectors of each document. For the pre-trained BERT implementation, we used the original large and uncased version 10 for the initial setup and fine-tuning processes. Table 4 shows general setting parameters of our proposed SE4SA which are used in all experiments of our paper.

Table 4
General settings for our proposed SE4SA model

Parameter Value

Number of GCN layers for the CoSynEmb-based textual embedding strategy 5

Number of LSTM-base cells for Bi-LSTM archiectures. 256

Number of training epoch 300

Training batch size of MLP classfication layer training process. 32

The general learning rate (η) for SGD of all neural network-based architecture 0.001

The general co-efficient of L2-regularization of the MLP classfication layer. 10⁵

Parameter	Value
Number of GCN layers for the CoSynEmb-based textual embedding strategy	5
Number of LSTM-base cells for Bi-LSTM archiectures.	256
Number of training epoch	300
Training batch size of MLP classfication layer training process.	32
The general learning rate (η) for SGD of all neural network-based architecture	0.001
The general co-efficient of L2-regularization of the MLP classfication layer.	10⁵

5.1.3 Comparative baselines

To demonstrate the effectiveness of our proposed SE4SA model, we also implemented several well-known sentiment analysis baselines for comparative studies, which are:

General pre-trained BERT [12]: is the traditional BERT-based textual representation learning of Devlin et al. [12]. BERT is considered as the sentence-level text representation technique which enables to learn the sequential representations of words/sentences at the local context level. For experiments in this paper, we implemented the large/uncased pre-trained BERT version with masked mechanism for sentiment classification task.

BERT-PT [13]: is considered as an early study of applying pre-trained BERT for sentiment analysis task. In the BERT-PT model, Xu, H. et al proposed a novel approach of sentiment-specific pre/post-training approaches on the masked language pre-trained BERT encoding mechanism to improve the performance of BERT-based fine-tuning procedure for the sentiment analysis task.

TransBERT [28]: is a novel proposal of integrating the transferable supervised knowledge with the pre-trained BERT-based textual representation learning process. The proposed TransBERT enables to not only transfer supervised knowledge of a given language source from the large-scale unlabeled training data but also effectively convey various semantic relatedness of texts for multiple NLP’s tasks, including the sentiment classification.

SentiBERT [30]: is also a well-known variant of pre-trained BERT model which is designed for sentiment analysis task. The proposed SentiBERT incorporates the attention-based contextual latent word embedding vectors with the binary constituency parse tree to effectively preserving the semantic latent representations of sentiment aspects in a given sentence/document.

ASGCN [34]: is a GCN-based text representation learning approach for fully capturing the syntactical structure relationships between words in texts. In ASGCN model, Zhang, C. proposed a novel approach of using multi-layered GCN architecture to encode the long-range dependent relationships between words over the parsed dependency tree. However, ASGCN still encounters a major limitation of thorough evaluation on the sequential relationships between words/sentences in texts.

SentiLARE [16]: is the main competitor of our proposed SE4SA model. In SentiLARE model, Ke, P. et al. proposed a novel approach of integrated external knowledge source with the pre-trained BERT model for leveraging the performance of multiple downstream tasks in sentiment analysis domain. In the original implementation of SentiLARE model, Ke, P. et al. [16] used the SentiWordNet as the main knowledge source for the masked language BERT-based pre/post training procedures. Extensive experiments in multiple standard datasets demonstrated the outperformances of SentiLARE in comparing with recent sentiment analysis baselines.

For general configurations of above listed models, we applied the same settings which are presented in Table 4. For specified parameters of each model, we used the same configurations which are described in the original works where these models achieved the highest accuracy performance.

5.2 Experimental outputs & discussions

5.2.1 Sentence/document-level sentiment classification task

We firstly evaluated the accuracy performances of different sentiment analysis models in sentence/document-level based task. Tables 5 and 6 show the experimental outputs of sentiment analysis task via different techniques in terms of Precision, Recall (R) and F1 standard metrics within benchmark datasets. We can definitely observe from the experimental results that our proposed SE4SA model performs better than recent state-of-the-art baselines on the sentence/document-level sentiment classification task, thus indicating the effectiveness of our proposals in this paper. In more details, our proposed SE4SA model significantly outperforms averagely 14.12% (general pre-trained BERT), 12.55% (TransBERT), 8.55% (BERT-PT) and 6.46% (SentiBERT) in all benchmark datasets. For our main competitors, which are: ASGCN and SentiLARE, our proposed SE4SA also slightly achieves better performance approximately 4.71% and 1.66%, respectively.

Table 5
Experimental outputs for the sentence/document-level sentiment analysis task via different models in terms of Precision (P), Recall (R) and F1 metrics within SST, AR and MR datasets

Model SST AR MR

P R F1 P R F1 P R F1

BERT 0.47821 0.48611 0.48213 0.56892 0.58248 0.57562 0.82021 0.85829 0.83882

BERT-PT 0.51921 0.50559 0.51231 0.61891 0.65771 0.63772 0.83871 0.86759 0.85291

TransBERT 0.46872 0.47559 0.47213 0.60811 0.63072 0.61921 0.85671 0.82803 0.84213

SentiBERT 0.52867 0.53300 0.53083 0.65081 0.66682 0.65872 0.86972 0.87612 0.87291

ASGCN 0.53871 0.60112 0.56821 0.63781 0.62083 0.62921 0.88432 0.87415 0.87921

SentiLARE 0.55821 0.59622 0.57659 0.67891 0.68538 0.68213 0.90551 0.93252 0.91882

SE4SA 0.58681 0.57752 0.58213 0.72091 0.70356 0.71213 0.92678 0.92568 0.92623

Model	SST	AR	MR
BERT	0.47821	0.48611	0.48213	0.56892	0.58248	0.57562	0.82021	0.85829	0.83882
BERT-PT	0.51921	0.50559	0.51231	0.61891	0.65771	0.63772	0.83871	0.86759	0.85291
TransBERT	0.46872	0.47559	0.47213	0.60811	0.63072	0.61921	0.85671	0.82803	0.84213
SentiBERT	0.52867	0.53300	0.53083	0.65081	0.66682	0.65872	0.86972	0.87612	0.87291
ASGCN	0.53871	0.60112	0.56821	0.63781	0.62083	0.62921	0.88432	0.87415	0.87921
SentiLARE	0.55821	0.59622	0.57659	0.67891	0.68538	0.68213	0.90551	0.93252	0.91882
SE4SA	0.58681	0.57752	0.58213	0.72091	0.70356	0.71213	0.92678	0.92568	0.92623

Table 6

Experimental outputs for the sentence/document-level sentiment analysis task via different models in terms of Precision (P), Recall (R) and F1 metrics within Yelp and IMDb datasets

Model	Yelp			IMDb
	P	R	F1	P	R	F1
BERT	0.65781	0.62718	0.64213	0.91891	0.90544	0.91213
BERT-PT	0.66781	0.67651	0.67213	0.90921	0.93357	0.92123
TransBERT	0.64871	0.61637	0.63213	0.89671	0.90887	0.90275
SentiBERT	0.67819	0.64681	0.66213	0.92781	0.95689	0.94213
ASGCN	0.67891	0.70587	0.69213	0.96781	0.95076	0.95921
SentiLARE	0.69921	0.68802	0.69357	0.97819	0.95941	0.96871
SE4SA	0.71781	0.68712	0.70213	0.98982	0.97278	0.98123

5.2.2 Aspect-level sentiment classification task

For experiments related to the aspect-level sentiment analysis, we conducted four main downstream subtasks, which are: aspect term extract (ATE), aspect term sentiment classification (ATSC), aspect category detection (ACD) and aspect category sentiment classification (ACSC). In general, Table 7 shows general statistics of SemEval2014 (Laptop14 and Resta14) and SemEval2016 (Resta16) datasets which are used for multiple downstream subtasks of aspect-level sentiment analysis tasks via different textual embedding methods.

Table 7
General dataset statistics for different aspect-level sentiment analysis subtasks with different models

Dataset ATE Task ATSC Task ACD Task ACSC Task

Train Test Val Train Test Val Train Test Val Train Test Val

Laptop14 1,338 150 422 2,163 150 638 – – – – – –

Resta14 1,871 150 606 3,452 150 1,120 2,891 150 800 3,366 150 973

Resta16 – – – – – – 1,850 150 676 2,150 150 751

Dataset	ATE Task	ATSC Task	ACD Task	ACSC Task
Laptop14	1,338	150	422	2,163	150	638	–	–	–	–	–	–
Resta14	1,871	150	606	3,452	150	1,120	2,891	150	800	3,366	150	973
Resta16	–	–	–	–	–	–	1,850	150	676	2,150	150	751

Tables 8 and 9 present the experimental results for different aspect-level sentiment analysis subtasks in terms of accuracy and F1 metrics. The experimental outputs present the outperformance of our proposed SE4SA model in comparing with recent text embedding baselines for multiple aspect-level sentiment analysis subtasks, including: ATE, ATSC, ACD and ACSC. In general, for the aspect-specific term analysis related subtasks (ATE and ATSC) about proposed SE4SA remarkably outperforms about 8.74%, 8.22%, 4.55% and 2.94% in comparing with the general pre-trained BERT, BERT-PT, TransBERT and SentiBERT in terms of F1 accuracy metric for both Laptop14 and Resta14 datasets. For the category aspect related subtasks (ACD and ACSC), our proposed model also achieves better performance about 12.73% (general pre-trained BERT), 12.59% (BERT-PT), 7.45% (TransBERT) and 9.32% (SentiBERT). In comparing with our main competitors in this paper which are ASGCN and SentiLARE, the proposed SE4SA also performs better about 3.77% /3.72% (term-specific/category-specific term tasks) and 2.16% /1.72% (term-specific/category-specific term tasks) in terms of F1 evaluation metric for all datasets, respectively.

Table 8

Experimental outputs for aspect-level sentiment analysis subtasks: ATE and ATSC via different models in terms of Accuracy (Acc), Precision (P), Recall (R) and F1 metrics

Model	ATE
	Laptop14				Resta14
	Acc	P	R	F1	Acc	P	R	F1
BERT	0.91023	0.78921	0.80742	0.79821	0.89812	0.83627	0.806992	0.82137
BERT-PT	0.86782	0.77621	0.74603	0.76082	0.91671	0.85672	0.824362	0.84023
TransBERT	0.87812	0.81672	0.78439	0.80023	0.92821	0.85722	0.858201	0.85771
SentiBERT	0.89281	0.80732	0.81953	0.81338	0.94092	0.86672	0.867701	0.86721
ASGCN	0.92012	0.77821	0.80135	0.78961	0.94921	0.86627	0.864153	0.86521
SentiLARE	0.94082	0.81678	0.83617	0.82636	0.95213	0.89671	0.87697	0.88673
SE4SA	0.95782	0.84671	0.81998	0.83313	0.96078	0.90561	0.893018	0.89927
ATSC
BERT	0.75821	0.68782	0.676533	0.68213	0.85672	0.74672	0.73148	0.73902
BERT-PT	0.78093	0.70821	0.688488	0.69821	0.87784	0.76891	0.74392	0.75621
TransBERT	0.81921	0.72819	0.70367	0.71572	0.83921	0.76672	0.81287	0.78912
SentiBERT	0.86092	0.76567	0.736533	0.75082	0.82081	0.79871	0.76369	0.78081
ASGCN	0.84672	0.74672	0.732457	0.73952	0.83019	0.80782	0.77681	0.79201
SentiLARE	0.85921	0.73561	0.744868	0.74021	0.82823	0.79881	0.76821	0.78321
SE4SA	0.86723	0.78921	0.744513	0.76621	0.84029	0.81982	0.79675	0.80812

Table 9

Experimental outputs for aspect-level sentiment analysis subtasks: ACD and ACSC via different models in terms of Accuracy (Acc), Precision (P), Recall (R) and F1 metrics

Model	ACD
	Resta14				Resta16
	Acc	P	R	F1	Acc	P	R	F1
BERT	0.94678	0.91278	0.90566	0.90921	0.78921	0.67891	0.69982	0.68921
BERT-PT	0.92689	0.88978	0.87555	0.88261	0.77921	0.72897	0.70974	0.71923
TransBERT	0.95782	0.90821	0.92539	0.91672	0.80384	0.73891	0.73452	0.73671
SentiBERT	0.91586	0.86781	0.85001	0.85882	0.82481	0.75867	0.72276	0.74028
ASGCN	0.94031	0.90781	0.87829	0.89281	0.86821	0.80818	0.80806	0.80812
SentiLARE	0.97761	0.92891	0.89207	0.91012	0.88921	0.81789	0.80282	0.81029
SE4SA	0.98912	0.94081	0.93743	0.93912	0.89821	0.83891	0.81910	0.82889
	ACSC
BERT	0.82921	0.76891	0.76951	0.76921	0.80213	0.72881	0.71797	0.72335
BERT-PT	0.84056	0.79821	0.76668	0.78213	0.81682	0.71182	0.70982	0.71082
TransBERT	0.88623	0.81281	0.82770	0.82019	0.82923	0.77621	0.76233	0.76921
SentiBERT	0.89023	0.83878	0.83964	0.83921	0.81023	0.75678	0.74179	0.74921
ASGCN	0.92662	0.87892	0.87471	0.87681	0.85291	0.78672	0.77874	0.78271
SentiLARE	0.93081	0.89671	0.87599	0.88623	0.89672	0.81927	0.81843	0.81885
SE4SA	0.95671	0.89892	0.88698	0.89291	0.90589	0.82989	0.81781	0.82381

In overall, experimental results in both sentence/document-level and aspect-level sentiment analysis tasks demonstrate the effectiveness of our proposed SE4SA model which prove the potential application of the combination between the sequential and syntactical structure representation learning in textual embedding approach for sentiment analysis.

5.2.3 Studies on model’s parameter sensitivity

In this section, we present extensive empirical studies related to the setup parameters’ sensitivity of our proposed SE4SA model, including the dimensionality of embedding vector (d), number of LSTM-based cells in Bi-LSTM encoders (h), number of training epochs and number of layers for the GCN-based syntactical encoder in our CoSynEmb-based text embedding strategy. To conduct experiments related to model’s parameter sensitivity, we selected two large-scale datasets, include: Amazon Reviews (AR) and Yelp which have > 500 K documents. Our proposed SE4SA is implemented to conduct sentiment analysis task in these two large-scale datasets with different values of evaluated parameter while fixing the others.

As shown from the experimental outputs in Figs. 4 and 5, our proposed SE4SA model is quite insensitive with the dimensional embedding size and number of used LSTM-based cells for Bi-LSTM architectures. In more details, for different sizes of datass, the value of (d) needs about > 200 to achieve the highest accuracy performance in terms of F1 metric for both AR and Yelp datasets. Similar to that, we varied the value of (h) from 10 to 300 and observed the changes in accuracy performance of our proposed model. It is shown from experimental outputs that the proposed SE4SA model is definitely insensitive with the number of used LSTM-based cells. With value of (h) parameter is over 250, the performance of SE4SA model becomes stably balanced for both AR and Yelp datasets. We can assume from these experimental outputs that the value of (d) and (h) parameters should be aligned with the size of evaluated dataset in order to sufficiently capture latent features from texts. Figure 6 shows our extended experimental studies on the number of training epochs and number of GCN-based layers which are initially configured for our SE4SA model. For different datasets, our proposed SE4SA model need different amount of epochs for the convergence which is > 250 for AR dataset and > 200 for Yelp. It indicates the fact that our model needs more number of training epochs to reach the convergence point for larger dataset size. In contrast, the increase in number of used GCN-based layer for the syntactical embedding shows the opposite results. As shown from Fig. 6, our proposed model gains the highest accuracy performance with number of GCN-based layers is set from range [5, 6] for AR dataset and [3 –5] for Yelp dataset. For both datasets, the increase of this parameter over 6 leads to the significantly downgrades of overall SE4SA model’s accuracy performance in terms of F1 metric. Thus, it presents that the SE4SA model is quite sensitive with the setup number of GCN-based layers which are used for the CoSynEmb-based word embedding strategy.

Fig. 4

The effect of the dimensionality of embedding vector (d) in the overall accuracy performance of our proposed SE4SA model.

Fig. 5

The effect of the number of LSTM-based cells (h) of Bi-LSTM encoder in the overall accuracy performance of our proposed SE4SA model.

Fig. 6

The influences of number of training epochs and used GCN-based layers in the overall accuracy performance of our proposed SE4SA model.

6 Conclusions & future works

In this paper, we formally present a novel approach, called SE4SA which is an integration of graph convolutional network with pre-trained BERT model for jointly capturing the sequential semantic and syntactical structural representations of text which enables to leverage the accuracy performance of multiple downstream tasks in sentiment analysis area. In our proposed SE4SA model, we firstly proposed a novel GCN-based textual embedding technique, called as: CoSynEmb which supports to preserve both global syntactical structure and contextual co-referencing relationships of words in a given document. Then, the rich-semantic CoSynEmb-based word representations are used to feed into a pre-trained BERT-based sentence-level encoder with masked language mechanism to effectively learn the continuous representations of words and sentences. Inspired from previous models, our proposed SE4SA is also able to integrate with external sentiment lexicons such as SentiWordNet to leverage the sentiment polarity capability through pre-training tasks with BERT-based architecture. Extensive experiments in benchmark datasets demonstrates the effectiveness of our proposed model in comparing with recent state-of-the-art baselines. For future improvement of our works, we intend to integrate the semantic representations of texts with other information sources, such as social network structure in order to improve the accuracy performance of sentiment analysis task.

Declarations

This study was funded by Thu Dau Mot University, Binh Duong, Vietnam.

Footnotes

Acknowledgments

This research is funded by Thu Dau Mot University, Binh Duong, Vietnam.

Conflict of interest

This research is funded by Thu Dau Mot University, Binh Duong, Vietnam.

SST dataset:

AR dataset:

MR dataset:

Yelp Dataset Challenge:

IMDb dataset:

SemEval2014 (Task 4) dataset:

SemEval2016 (Task 5) dataset:

Stanford CoreNLP library:

Pre-trained Word2Vec (300-dimensional):

Pre-trained BERT:

References

Nguyen

T.L.

, Kavuri

and Lee

, A fuzzy convolutional neural network for text sentiment analysis, Journal of Intelligent & Fuzzy Systems 35(6) (2018), 6025–6034.

Ranganathan

, Irudayaraj

A.S.

, Bagavathi

and Tzacheva

A.A.

, Actionable pattern discovery for Sentiment Analysis on Twitter Data in clustered environment, Journal of Intelligent & Fuzzy Systems 34(5) (2018), 2849–2863.

Yadav

and Vishwakarma

D.K.

, Sentiment analysis using deep learning architectures: a review, Artificial Intelligence Review 53(6) (2020), 4335–4385.

Hongmei

and Songlin

, Sentiment analysis of students in ideological and political teaching based on artificial intelligence and data mining, Journal of Intelligent & Fuzzy Systems (2021), 1–10.

Mikolov

, Chen

, Corrado

and Dean

, Efficient estimation of word representations in vector space, in 1st International Conference on Learning Representations (ICRL), (2013).

Pennington

, Socher

and Manning

C.D.

, Glove: Global vectors for word representation, in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), (2014).

Dos Santos

and Gatti

, Deep convolutional neural networks for sentiment analysis of short texts, in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (2014).

Jianqiang

, Xiaolin

and Xuejun

, Deep convolution neural networks for twitter sentiment analysis, IEEE Access 6 (2018), 23253–23260.

Fan

, Gao

, Du

, Gui

, Xu

and Wong

K.F.

, Convolution-based memory network for aspect-based sentiment analysis, in The 41st International ACM SIGIR conference on research & development in information retrieval (2018).

10.

Wang

, Yu

L.C.

, Lai

K.R.

and Zhang

, Dimensional sentiment analysis using a regional CNN-LSTM model, in Proceedings of the 54th annual meeting of the association for computational linguistics (2016).

11.

, Peng

, Khan

, Cambria

and Hussain

, Sentic LSTM: a hybrid network for targeted aspect-based sentiment analysis, Cognitive Computation 10(4) (2018), 639–650.

12.

Bahdanau

, Cho

and Bengio

, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473, (2014).

13.

Vaswani

, et al., Attention is all you need, in Proceedings of the 31st International Conference on Neural Information Processing Systems, (2017).

14.

Peters

M.E.

, Neumann

, Iyyer

, Gardner

, Clark

, Lee

and Zettlemoyer

, Deep contextualized word representations, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies (2018).

15.

Radford

, Narasimhan

, Salimans

and Sutskever

, Improving language understanding by generative pretraining, OpenAI (2018).

16.

Devlin

, Chang

M.W.

, Lee

and Toutanova

, Bert: Pre-training of deep bidirectional transformers for language understanding, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies (2019).

17.

, Liu

, Shu

and Yu

P.S.

, BERT post-training for review reading comprehension and aspect-based sentiment analysis, arXiv preprint arXiv:1904.02232, (2019).

18.

Sun

, Huang

and Qiu

, Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence, arXiv preprint arXiv:1903.09588, (2019).

19.

Gururangan

, Marasović

, Swayamdipta

, Lo

, Beltagy

, Downey

and Smith

N.A.

, Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks, arXiv preprint arXiv:2004.10964., (2020).

20.

, Ji

, Liu

, Zhu

and Huang

, SentiLARE: Sentiment-aware language representation learning with linguistic knowledge, in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), (2020).

21.

Kipf

T.N.

and Welling

, Semi-supervised classification with graph convolutional networks, in 5th International Conference on Learning Representations, ICLR, (2017).

22.

Esuli

and Sebastiani

, Sentiwordnet: A publicly available lexical resource for opinion mining, in LREC, (2006).

23.

, Ding

and Liu

, Story Ending Prediction by Transferable BERT, in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI), (2019).

24.

Yin

, Meng

and Chang

K.W.

, SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020).

25.

Lee

, He

, Lewis

and Zettlemoyer

, End-to-end Neural Coreference Resolution, in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (2017).

26.

Yao

, Mao

and Luo

, Graph convolutional networks for text classification, in Proceedings of the AAAI Conference on Artificial Intelligence, (2019).

27.

Agarwal

, Sharma

, Sikka

and Dhir

, Opinionmining of news headlines using SentiWordNet, in Symposium on Colossal Data Analysis and Networking (CDAN), IEEE, (2016).

28.

Khan

F.H.

, Qamar

and Bashir

, A semi-supervised approach to sentiment analysis using revised sentiment strength based on SentiWordNet, Knowledge and information Systems 51(3) (2017), 851–872.

29.

Baccianella

, Esuli

and Sebastiani

, Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining, in LREC, (2010).

30.

Bouazizi

and Ohtsuki

, A pattern-based approach for multi-class sentiment analysis in Twitter, IEEE Access 5 (2017), 20617–20639.

31.

Liu

S.M.

and Chen

J.H.

, A multi-label classification based approach for sentiment classification, Expert Systems with Applications 42(3) (2015), 1083–1093.

32.

Chen

, Xu

, He

and Wang

, Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN, Expert Systems with Applications 72 (2017), 221–230.

33.

Balikas

, Moura

and Amini

M.R.

, Multitask learning for fine-grained twitter sentiment analysis, in Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, (2017).

34.

Wen

, Wei

, Yang

, Guo

, Zeng

, Huang

and Chen

, Memristive LSTM network for sentiment analysis, IEEE Transactions on Systems, Man, and Cybernetics: Systems (2019).

35.

Huang

, Jiang

, Hasan

, Jiang

and Li

, A topic BiLSTM model for sentiment classification, in Proceedings of the 2nd International Conference on Innovation in Artificial Intelligence (2018).

36.

, Meng

, Qiu

, Yu

and Wu

, Sentiment analysis of comment texts based on BiLSTM, IEEE Access 7 (2019), 51522–51532.

37.

Hyun

, Park

, Yang

M.C.

, Song

, Lee

J.T.

and Yu

, Target-aware convolutional neural network for target-level sentiment analysis, Information Sciences 491 (2019), 166–178.

38.

Abid

, Alam

, Yasir

and Li

, Sentiment analysis through recurrent variants latterly on convolutional neural network of Twitter, Future Generation Computer Systems 95 (2019), 292–308.

39.

Piryani

, Piryani

, Singh

V.K.

and Pinto

, Sentiment analysis in Nepali: Exploring machine learning and lexicon-based approaches, Journal of Intelligent & Fuzzy Systems 39(2) (2020), 2201–2212.

40.

Howard

and Ruder

, Universal language model fine-tuning for text classification, arXiv preprint arXiv:1801.06146, (2018).

41.

, Liu

, Shu

and Philip

S.Y.

, DomBERT: Domain-oriented Language Model for Aspect-based Sentiment Analysis, in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, (2020).

42.

Manning

C.D.

, Surdeanu

, Bauer

, Finkel

J.R.

, Bethard

and McClosky

, The Stanford CoreNLP natural language processing toolkit, in Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, (2014).

43.

Pontiki

, et al., SemEval 2014 Task 4 - Aspect Based Sentiment Analysis, in Proceedings of the 8th International Workshop on Semantic Evaluation (2014).

44.

Pontiki

, et al., Semeval-2016 task 5: Aspect based sentiment analysis, in International workshop on semantic evaluation, (2016).

45.

Ganu

, Elhadad

and Marian

, Beyond the stars: improving rating predictions using review text content, in WebDB, (2009).

46.

Zhang

, Li

and Song

, Aspect-based Sentiment Classification with Aspect-specific Graph Convolutional Networks, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), (2019).