Lexical attention and aspect-oriented graph convolutional networks for aspect-based sentiment analysis

Abstract

The purpose of aspect-based sentiment analysis is to predict the sentiment polarity of different aspects in a text. In previous work, while attention has been paid to the use of Graph Convolutional Networks (GCN) to encode syntactic dependencies in order to exploit syntactic information, previous models have tended to confuse opinion words from different aspects due to the complexity of language and the diversity of aspects. On the other hand, the effect of word lexicality on aspects’ sentiment polarity judgments has not been considered in previous studies. In this paper, we propose lexical attention and aspect-oriented GCN to solve the above problems. First, we construct an aspect-oriented dependency-parsed tree by analyzing and pruning the dependency-parsed tree of the sentence, then use the lexical attention mechanism to focus on the features of the lexical properties that play a key role in determining the sentiment polarity, and finally extract the aspect-oriented lexical weighted features by a GCN.Extensive experimental results on three benchmark datasets demonstrate the effectiveness of our approach.

Keywords

Sentiment analysis GCN lexical attention dependency parsing

1 Introduction

Traditional sentiment analysis focuses on inferring the sentiment polarity of an entire sentence. In contrast to traditional methods, aspect-based sentiment analysis (ABSA) is a fine-grained sentiment analysis task intended to identify the sentiment polarity (e.g., negative, neutral, or positive) of the given aspect. A sentence may contain several different aspects, each of which may have a different sentiment polarity. Aspect-based sentiment analysis has many practical applications, for example, aspect-based sentiment analysis for product reviews can extract users’ attitudes towards different aspects of a product, providing a more granular reference for manufacturers to further improve their products. For example, given a sentence like “I love Windows 7 which is a vast improvement over Vista.” the aspect of the sentence is usually an entity, so the aspect of the sentence is “Windows 7” and “Vista”, whose sentiment polarity is positive and negative, respectively.

Intuitively, the key to the task is to relate aspects to their respective opinion words in order to infer the sentiment polarity of the aspect. In recent years, deep learning has been widely used in ABSA tasks, especially in Recurrent Neural Networks(RNN) and Convolutional Neural Networks (CNN). Tang et al. [1] used two LSTMs to extract the features on the left and right of the aspect, and finally splices the implied vectors of the two LSTMs into a softmax classifier for classification.Xue et al. [2], on the other hand, used the CNN. Also, the Attention mechanism is widely used in ABSA tasks, Tang et al.; Wang et al.; Li et al.; Fan et al.; Du et al. [3 –7] all by using the attention mechanism to achieve aspect sentiment classification. However, these methods largely ignore the grammatical structure of the sentence and tend to use non-opinion words as a feature to judge the sentiment polarity of that aspect. Therefore, in recent years, graph neural networks have been widely used in ABSA tasks, where researchers used the dependency-parsed tree of sentences to obtain the adjacency matrix of sentences and then extracted features by GCN [8]. Bai et al. [9] designed a relational graph attention network that integrates typed syntactic dependency information.

Most graph neural network-based aspect sentiment analysis tasks encode a dependency-parsed tree for the entire sentence, and few researchers have converted it into an aspect-oriented dependency-parsed tree. In addition, since the existing dependency parsers do not have aspect-oriented specific parsers, the parser tends to confuse the opinion words of different aspects when there are two or more aspects in a sentence. For example, a sentence like this: “The price is so expensive but the service is timely and abundant.”, the dependency-parsed tree of the sentence after parsing by the parser is shown in Figure Figure 1., where the subscript of each word is its lexical property. It is easy to see from this that the dependency tree incorrectly connects the aspect word “price” with the opinion words of the aspect word “service”.In addition to this, to our knowledge, the lexical properties of each word has not been introduced as a basis for judging aspectual sentiment polarity in previous studies.

Fig. 1

The dependency-parsed tree for “The price is so expensive but the service is timely and abundant.”

By analyzing the lexical properties of the words in the dataset, we found that most of the aspects’ opinion words belonged to verbs, adverbs and adjectives. Table 1 shows the corresponding lexical properties of each word in the sentence. To further argue for the sentiment polarity of the decision aspect the opinion word consists mainly of verbs, adverbs and adjectives, the constituent elements include { JJ,JJR,JJS,RB,RBR,RBS,VB,VBD,VBG,VBN,VBP, V BZ}, we randomly selected 100 data items from the Restaurant and Laptop datasets [10], respectively, and recorded the number of lexical occurrences of the constituent elements of the opinion words, and the results are shown in Table 2. From this we can see that although there are a few other lexical forms that make up opinion words, verbs, adverbs and adjectives still make up the majority. So we can infer that the word of opinion, which determines the sentiment polarity of the aspect, consists mainly of verbs, adverbs and adjectives.

Table 1

Examples of sentences and their corresponding lexical properties

Sentence	The mushroom was rather over cooked and dried but the chicken was fine .
Pos	‘DT’ ‘NN’ ‘VBD’ ‘RB’ ‘RB’ ‘VBN’ ‘CC’ ‘VBN’ ‘CC’ ‘DT’ ‘NN’ ‘VBD’ ‘JJ’ ‘.’
Sentence	But the staff was so horrible to us.
Pos	‘CC’ ‘DT’ ‘NN’ ‘VBD’ ‘RB’ ‘JJ’ ‘IN’ ‘PRP’ ‘.’
Sentence	It is easy to use, has great screen quality, and every so light weight.
Pos	‘PRP’ ‘VBZ’ ‘JJ’ ‘TO’ ‘VB’ ‘,’ ‘VBZ’ ‘JJ’ ‘NN’ ‘NN’ ‘,’ ‘CC’ ‘DT’ ‘RB’ ‘JJ’ ‘NN’, ‘.’
Sentence	Nice atmosphere, the service was very pleasant and the desert was good.
Pos	‘JJ’ ‘NN’ ‘,’ ‘DT’ ‘NN’ ‘VBD’ ‘RB’ ‘JJ’ ‘CC’ ‘DT’ ‘NN’ ‘VBD’ ‘JJ’ ‘.’
Sentence	Its fast, easy to use and it looks great.
Pos	‘PRP’ ‘JJ’ ‘,’ ‘JJ’ ‘TO’ ‘VB’ ‘CC’ ‘PRP’ ‘VBZ’ ‘JJ’ ‘.’
Sentence	I love the keyboard and the screen.
Pos	‘PRP’ ‘VBP’ ‘DT’ ‘NN’ ‘CC’ ‘DT’ ‘NN’ ‘.’

Table 2

The number of lexical occurrences of opinion words in 100 sentences extracted from the Restaurant and Laptop datasets, respectively

	JJ	JJR	JJS	RB	RBR	RBS	VB	VBD	VBG	VBN	VBP	VBZ	DT	NN	NNS	IN	TO
Restaurant	133	2	3	70	10	5	4	40	12	6	2	59	1	2	0	1	0
Laptop	121	4	2	43	2	1	20	17	2	23	3	8	2	3	1	1	2

In this paper, we propose a lexical attention and aspect-oriented GCN for aspect-based sentiment analysis. First we use Biaffine Parser [11] to obtain the dependency-parsed tree of the sentences and, of course, the lexical properties of each word. The original dependency-parsed tree has been restructured to better relate aspects to opinion terms, and an aspect-oriented dependency-parsed tree can better focus on the connection between aspects and potential opinion words. The lexical attention mechanism focuses on verbs, adjectives and adverbs, and the GCN is used to encode new dependency-parsed tree with lexical attention. Finally, the global features and aspect-oriented features are fused using a multi-headed self-attention mechanism. We have named this model LA-GCN and extensive experimental results on three benchmark datasets demonstrate the effectiveness of LA-GCN.

The contributions of this work include:

In this paper,we propose a new aspect-oriented parse tree to reduce the impact of non-opinion words on aspects by pruning the nodes.

We propose lexical attention mechanisms to focus on features that play a key role in sentiment polarity judgments.

Experiments on the ablated LA-GCN design model were conducted to assess the significance and effectiveness of the LA-GCN design architecture.

2 Related work

In recent years, a variety of approaches have been cited to deal with the ABSA task, and most researchers have based their learning of features on deep learning methods. In this section, we will focus on the work related to aspect-level sentiment analysis from the perspective of deep learning.

Since neural network-based approaches do not require artificially crafted features and can be trained by the network to obtain semantic information about the text, they have received increasing attention from scholars studying natural language processing, including, of course,the ABSA task. Dong et al. [12] proposed Adaptive Recurrent Neural Network (AdaRNN) for target-related Twitter sentiment classification. Ada-RNN adaptively communicates the emotions of words to aspects based on the context and the syntactic relationships between them. Tang et al. [1] proposed two models, TD-LSTM and TC-LSTM, in order to obtain the relationship between aspects and contextual features. TD-LSTM used two LSTMs to model the aspect and left context and the aspect and right context respectively, and finally the implicit vectors of the two LSTMs are spliced and fed into a softmax classifier for classification. TC-LSTM splices the average of the aspect word vectors with the word vectors of each word in the sentence and then performs the same operation as TD-LSTM. Nguyen et al.; Wang et al.; Ma et al. [13 –15] all used RNN to implement the ABSA task. Cheng et al. [16] employed a multi-attentional mechanism to capture emotionally distant features, resulting in greater robustness to irrelevant information. Xue et al. [2] proposed a model based on CNN and the electron pass mechanism.

Since the introduction of Transformer [17] in 2017, Transformer-based pre-training models GPT [18], GPT-2 [19], GPT-3 [20], BERT [21], and XLNet [22] have been widely used in natural language processing tasks, among which BERT has become a research hotspot for ABSA and has achieved good results. Xu et al. [23] explored a novel post-training method on the popular language model BERT to improve the fine-tuning performance of BERT on Reviewed Reading Comprehension (RRC), which can also be adapted to aspect-based sentiment analysis. Song et al. [24] proposed an attentional encoder network (AEN) that eschews recursion and used an attention-based encoder for context and aspect-to-subset modeling, and also applied a pre-trained BERT to this task. BERT was also used in the Category Name Embedding network (CNE-net) proposed by Dai et al. [25]. LCF-BERT was also a BERT-based post-training method [26].

In recent years, GCN combined with dependency trees have shown attractive effectiveness in ABSA tasks. Zhang et al.;Sun et al.; Chen et al. [8 , 28] proposed to build a GCN on the dependency tree of sentences to exploit syntactic information and word dependencies. Tang et al. [29] proposed a dependency graph enhanced dual-transformer network (named DGEDT). Liang et al. [30] were able to leverage syntactic knowledge (dependencies and types) by using well-designed dependency embedded graph convolutional networks (DREGCN). However, these approaches usually ignore the lexical properties of words and the construction of an aspect-oriented dependency trees.

3 Preliminary

3.1 Dependency parsing

The syntactic structure of a sentence can be revealed by dependency parsing. The result of the dependency parsing of the sentence “The mushroom was rather over cooked and dried but the chicken was fine.” is shown in Figure 2., where the label under each word indicates its lexical properties. It is easy to see that the dependency tree is not rooted with the aspect, moreover, when the number of aspects in a sentence is greater than or equal to 2, the general dependency parsers tend to confuse the connection of opinion words of multiple aspects. We found by observation that most of the false connections start from root to the node farthest away from root. The dependency-parsed tree for this sentence is incorrectly connected in the aspect sentiment analysis scenario: cooked→fine.

Fig. 2

The dependency-parsed tree for “The mushroom was rather over cooked and dried but the chicken was fine.”

3.2 Dependency pruning

Based on the above observations, we propose a method for constructing aspect-oriented dependency trees. By pruning the incorrect connections and reshaping the original dependency tree, only the nodes that are linked to that aspect as well as the lexical properties are retained.

Algorithm 1 describes the pruning process. Given an aspect $A = {w_{i}^{a}, w_{i + 1}^{a}, \dots, w_{m}^{a}}$ and a sentence $T = {w_{1}^{t}, w_{2}^{t}, \dots, w_{n}^{t}}$ , a dependency parse of T yields a dependency-parsed tree D, dependency relations R and lexical category of words LA. Where num-aspect indicates the number of aspects, r_Qj indicates that node Q to j has a dependency, r [1] indicates the location of the node with which it is connected, and r [2] indicates the location of the node. The pruning of dependent trees is carried out in 3 main steps. First, for sentences with aspect number greater than 1, traverse the original dependency tree D, obtain the location of the root node, and find the node N with a connection to the root node and the largest relative distance, and remove its connection to the root. The second step traverses the dependency tree after the node is deleted, setting the aspect as the new root node Q. Then find the node that has an association with Q, connect Q directly to it, and save the node lexicality, and set the lexicality of nodes not related to Q to non. The third step returns the final dependency tree $\hat{D}$ and new lexical propertie $\hat{LA}$ .

Algorithm 1

Dependency Pruning

Require: aspect $A = {w_{i}^{a}, w_{i + 1}^{a}, \dots, w_{m}^{a}}$ , sentence $T = {w_{1}^{t}, w_{2}^{t}, \dots, w_{n}^{t}}$ ,dependency-parsed tree D,dependency relations R, lexical properties LA and num-aspect

Ensure: aspect-oriented dependency tree $\hat{D}$ and new lexical properties $\hat{LA}$

1: if num-aspect >1 then

2: get the position of the root node in D

3: for r in R do

4: if r[1]==root then

5: distance = abs(r[2]-r[1])

6: if distance > max-distance then

7: max-distance=distance

8: N = r[2]

9: end if

11: end if

11: end for

12: delete r_rootN

13: construct the root Q for $\hat{D}$

14: for j = 1 to n do

15: If $w_{j}^{t}$ and Q have a dependency then

16: $w_{j}^{t} \overset{r_{Qj}}{\leftarrow} Q$

17: $\hat{LA}$ [j]=LA[j]

18: else

19: $\hat{LA}$ [j]=non

20: end if

21: end for

22: else

23: $\hat{D}$ =D

24: $\hat{LA}$ [j]=LA

25: end if

26: return $\hat{D}$ , $\hat{LA}$

Dependency Pruning
Require: aspect $A = {w_{i}^{a}, w_{i + 1}^{a}, \dots, w_{m}^{a}}$ , sentence $T = {w_{1}^{t}, w_{2}^{t}, \dots, w_{n}^{t}}$ ,dependency-parsed tree D,dependency relations R, lexical properties LA and num-aspect
Ensure: aspect-oriented dependency tree $\hat{D}$ and new lexical properties $\hat{LA}$
1: if num-aspect >1 then
2: get the position of the root node in D
3: for r in R do
4: if r[1]==root then
5: distance = abs(r[2]-r[1])
6: if distance > max-distance then
7: max-distance=distance
8: N = r[2]
9: end if
11: end if
11: end for
12: delete r_rootN
13: construct the root Q for $\hat{D}$
14: for j = 1 to n do
15: If $w_{j}^{t}$ and Q have a dependency then
16: $w_{j}^{t} \overset{r_{Qj}}{\leftarrow} Q$
17: $\hat{LA}$ [j]=LA[j]
18: else
19: $\hat{LA}$ [j]=non
20: end if
21: end for
22: else
23: $\hat{D}$ =D
24: $\hat{LA}$ [j]=LA
25: end if
26: return $\hat{D}$ , $\hat{LA}$

The results of pruning the sentence “The mushroom was rather over cooked and dried but the chicken was fine.” are shown in Figure 3. and Fig.4

Fig. 3

Results after pruning of aspect “mushroom”.

Fig. 4

Results after pruning of aspect “chicken”.

Pruning of the original dependency tree allowed the model to focus more quickly and accurately on the relationship between aspects and opinion words. The lexical properties of the retained words are also used as important factors in determining the sentiment polarity of aspects.

4 Proposed methodology

Given a context sequence $T = {w_{1}^{t}, w_{2}^{t}, \dots, w_{n}^{t}}$ and an aspect sequence $A = {w_{i}^{a}, w_{i + 1}^{a}, \dots, w_{m}^{a}}$ where A is a subsequence of T. The main purpose of the model is to infer the sentiment polarity P expressed by A in T,P ∈ {Positive, Neutral, Negative}.

The network architecture of our proposed LA-GCN is shown in Figure Fig.5. The network mainly consists of an embedding layer, a lexical attention layer, a graph convolutional network layer, a feature fusion layer, and an output layer. For a given input text, we first utilize BERT as the aspect-based encoder to extract the hidden contextual representations. Then a mask operation is performed through the lexical attention layer for the purpose of focusing on opinion words, and the GCN aggregates the feature vectors of the nodes adjacent to the aspect. Finally feature fusion layer to fuse local features and global features between the aspect and the context.

Fig. 5

Overall architecture of LA-GCN design. ${e_{1}^{l}, e_{2}^{l}, \dots, e_{n}^{l}}$ represents the local feature vector. ${e_{1}^{g}, e_{2}^{g}, \dots, e_{n}^{g}}$ represents the global feature vector.MH Self-Attention: Multi-Head Self-Attention.

4.1 Embedding layer

The embedding layer uses pre-trained BERT to generate word vectors for the sequences. The input format for obtaining global features is [CLS]+T+[SEP]+A+[SEP]. The format of the input for obtaining aspect-oriented local features is [CLS]+T+[SEP].

For the input of local context and global context representations T^l and T^g, respectively, we have $e^{l} = {BERT}^{l} (T^{l})$ (1) $e^{g} = {BERT}^{g} (T^{g})$ (2)

BERT^l and BERT^g are the corresponding BERT-shared layer modeling for local context and global context, respectively. The global feature e^g represents the semantic relationship between the aspect and the contextual sentence. The aspect-oriented local feature e^l represents the interaction between the context and the aspect.

4.2 Lexical attention layer

The lexical attention layer focuses on the characteristics of words that play a key role in the sentiment polarity judgments. Since most opinion words are made up of adjectives, verbs, or adverbs, the lexicality to focus on M = {JJ,JJR,JJS,RB,RBR,RBS,VB,VBD,VBG, VBN,VBP,VBZ}. We classify the node lexicality into {major, other, non}. Where “major” means that the node has a dependency on the aspect and the lexical property of the node belongs to M,“other” indicates a dependent relationship with the aspect, but the lexical property of the node is not that of M, “non” indicates the lexical properties of a node that has no dependency on the aspect.

The lexical attention layer assigns weights to local features e^l by lexicality, which is obtained by dependency pruning. If the lexical property belongs to “major”, the features are fully preserved, and if the lexical property belongs to “other”, the decay is enhanced by the relative distance from the aspect, focusing mainly on features that are closer to the aspect. The features with lexical “non” are masked, and the masked features are set as zero vectors.

Given a local feature e^l, the aspect A. The vector o^la after performing lexical attention is: $v_{j}^{w} = {\begin{matrix} E & {\hat{LA}}_{j} \in major or i \leq j \leq m \\ (1 - \frac{SR D_{j}}{n}) \cdot E & {\hat{LA}}_{j} = other \\ O & {\hat{LA}}_{j} = non \end{matrix}$ (3) $W = {v_{1}^{w}, v_{2}^{w}, \dots, v_{n}^{w}}$ (4) $O^{la} = e^{l} \cdot W$ (5)

Where $v_{j}^{w}$ is the mask vector, E ∈ R^h, a vector of all ones,h is the local feature vector length, SRD_j is the relative distance from the word at position j to the aspect, and i,m are the aspect start and end positions,and $\hat{{LA}_{j}}$ is the lexicality of the word at position j. n is the sentence length, O is an all-zero vector. W denotes the weight mask matrix, o^la is the output of lexical attention layer, “.” denotes the dot product operation of vectors.

4.3 Graph convolutional network layer

The GCN aggregates the feature vectors of neighboring nodes and propagates the information of a node to its first-order neighbor nodes. The graph convolutional network layer takes the lexically weighted features and passes them through the GCN to obtain new features. For a dependency tree with n nodes, an n×n adjacency matrix A can be generated. We took the pruned dependency tree and generated an n×n adjacency matrix adj according to its dependencies, where we added a self-loop for each node, i.e., adj_ij=adj_ji=1. Then, the GCN layer can convolve the features of neighboring nodes to obtain new node features by the following functions: ${\tilde{h}}_{i}^{l + 1} = \sum_{j = 1}^{n} ad j_{ij} W^{l + 1} g_{j}^{l}$ (6) $h_{i}^{l + 1} = ReLU (\frac{{\tilde{h}}_{i}^{l + 1}}{(d_{i} + 1)} + b^{l + 1})$ (7)

Where $g_{j}^{l} \in R^{2 d_{h}}$ is the j-th token’s representation evolved from the preceding GCN layer while $h_{i}^{l + 1} \in R^{2 d_{h}}$ is the product of current GCN layer, and $d_{i} = \sum_{j = 1}^{n} ad j_{ij}$ is degree of the i-th token in the tree. The weights W^l+1 and bias b^l+1 are trainable parameters.

4.4 Multi-Head self-attention

The attention function used in this paper is Scaled Dot Product Attention, and the attention fraction is calculated as follows: $Attention (Q, K, V) = Softmax (\frac{K^{T} Q}{\sqrt{d_{K}}}) \cdot V$ (8) $Q, K, V = f (o)$ (9) $f (o) = {\begin{matrix} Q = o \cdot w^{q} \\ K = o \cdot w^{k} \\ V = o \cdot w^{v} \end{matrix}$ (10)

where o is the input word vector representation, Q, K, V are obtained by multiplying o by their respective weight matrices w^q ∈ R^d_h×d_q, w^k ∈ R^d_h×d_k,w^v ∈ R^d_h×d_v, d_h is the hidden layer dimension, and h is the number of attention heads, which is set to 6 in this paper.

Assuming that H_i is a learned representation for each self-attentional head, then there is: $H_{i} = Attention (o \cdot w_{i}^{q}, o \cdot w_{i}^{k}, o \cdot w_{i}^{v})$ (11) $M HSA (o) = Tanh ({H_{0}; H_{1}; \dots; H_{h}} \cdot w^{mhsa})$ (12)

where “;” denotes a vector connection and w^mhsa ∈ R^hd_v×d_h is a learnable parameter.

4.5 Feature fusion layer

Feature Fusion Layer (FFL) is designed to interactively learn aspect-oriented local features and global features between aspects and contexts. If only local information is considered, some useful information will be overlooked, so the MSAH operation is performed after connecting local features with global features. $O^{\lg} = [O^{gcn}; e^{g}]$ (13) $O^{all} = MHSA (O^{\lg})$ (14)

where o^gcn is the output of the graph convolutional network layer, e^g is the text pair consisting of the context sentence T and the aspect term A obtained by pre-trained BERT, and “;” denotes the vector connection.

4.6 Output layer

In the output layer, the output vector O^all of the feature fusion layer is passed through the fully connected layer to obtain the vector $\tilde{O}$ . Finally, the sentiment polarity is predicted by the Softmax layer. $\tilde{O} = O^{all} \cdot W + b$ (15) $y = Softmax (\tilde{O}) = \frac{exp (\tilde{O})}{\sum_{K = 1}^{C} exp (\tilde{O})}$ (16)

where C is the sentimental polarity class, y is the sentimental polarity predicted by the model, and W ∈ R^1×d_h and b ∈ R^{d
_h} are learnable parameters.

4.7 Model training

We use the L2-regularized cross-entropy loss as a loss function to adjust the LA-GCN model parameters, and the loss function is defined as follows: $L = \sum_{1}^{C} {\hat{y}}_{l} log y_{i} + λ \sum_{θ \in Θ} θ^{2}$ (17) where C is the number of classes, λ is the L2 regularization parameter, and Θ is the parameter set of the model.L2 regularization is effective in suppressing overfitting, and in addition to this, we have added a dropout layer to avoid overfitting of the model.

5 Experiments

5.1 Dataset

We conducted experiments on three datasets: the SemEval-2014 Task consisting of Restaurant reviews and Laptop reviews [10], and the ACL 14 Twitter dataset collected by Dong et al. [12].All aspects of the above dataset were labeled with three types of sentiment polarity: positive, neutral, and negative. Statistics for the three datasets are shown in Table 3.

Table 3
Statistics of the three datasets

Dataset Positive Neutral Negative

train test train test train test

Restaurant 2164 728 637 196 807 196

Twitter 1561 173 3127 346 1560 173

Laptop 994 341 464 169 870 128

Dataset	Positive	Neutral	Negative
Restaurant	2164	728	637	196	807	196
Twitter	1561	173	3127	346	1560	173
Laptop	994	341	464	169	870	128

5.2 Parameter settings

We use BERT-base English version, which contains 12 hidden layers and 768 hidden units for each layer. The vocab size of BERT is 30,522. We use Adam [31] as the optimizer for the model, with the initial value of the learning rate set to 2×10^-5 and the L2 regularization is set to 1×10^-5. Batch shuffling is applied to the training set. The batch size of all model is set as 16. We train our model up to 10 epochs and conduct the same experiment for 10 times with random initialization. The dependency-parsed tree is obtained using Biaffine Parser 1

5.3 Model comparisons

To fully evaluate the performance of the LA-GCN 2 model, we evaluated it on three experimental data and compared it with several baseline models, including:

TD-LSTM [1] uses two LSTMs to model the aspect and left context and the aspect and right context, respectively, and then splices the last hidden vector of the two LSTMs into a softmax classifier for classification.

ATAE-LSTM [4] splices the embedding of each word with the embedding of the aspect at the embedding level to get a representation of the word associated with the aspect. The final representation is then obtained and classified using LSTM and attention.

MGAN [6] uses fine-grained and coarse-grained attention to capture word-level interactions between aspects and sentences.

BERT-PT [23] explored a novel post-training method on the popular language model BERT to improve BERT’s fine-tuning performance on Reviewed Reading Comprehension (RRC), which can also be adapted to aspect-based sentiment analysis.

AEN-BERT [24] eschews recursion and uses an attention-based encoder to model between the context and the target.

LCF-BERT [26] proposes a Local Context Focusing (LCF) mechanism for aspect-based sentiment classification based on Multiple Head Self-Attention (MHSA), which focuses on contextual local features. The mechanism utilizes contextual feature dynamic masking (CDM) and contextual feature dynamic weighting (CDW) layers to focus more on local contextual words.

ASGCN [8] extracts aspect-related features from the data by performing dependency parsing on the data and then using a GCN.

TD-GAT [32] uses the new Target-Dependent Graph Attention Network (TD-GAT) for aspect-level sentiment classification, which explicitly exploits dependency relationships between words. Use a dependency tree, which propagates affective features directly from the syntactic context of the aspect target.

R-GAT [9] considered dependency labeling information and proposed a new relational graph attention network which integrates typed syntactic dependency information.

5.4 Results and analysis

5.4.1 Main results

Table 4 shows the overall performance of all models. According to the experimental data, we can observe that the LA-GCN model outperforms most of the baseline models. Compared with the graph neural network-based ASGCN, TD-GAT and RGAT, the overall performance of our model is substantially improved, illustrating the effectiveness of lexical attention-based and aspect-oriented GCN design. By assigning higher weights to potential opinion words through a lexical attention mechanism, the aspect-oriented GCN aggregates features of words with grammatical connections to aspects, thus improving the overall performance of the model. Among them, the performance improvement of LA-GCN model is larger in Restaurant and Laptop datasets, but limited in Twitter dataset because the sentences in Twitter dataset are less grammatical, which limits the efficacy. Furthermore we can find that the BERT-based baseline model has surpassed most of the current ABSA models, which proves the usefulness of the BERT-based pre-trained model in aspect-based sentiment analysis tasks. After combining BERT with our proposed model, the overall performance is further improved and reaches a new level.

Table 4
Model comparison results (%).The results of models we reproduced by following the methodology published in the paper are indicated by asterisk (#)

Model Restaurant Twitter Laptop

Acc F1 Acc F1 Acc F1

TD-LSTM 75.63 - 70.8 69 68.13 -

ATAE-LSTM 77.2 - - - 68.7 -

MGAN 81.25 71.94 72.54 70.81 75.39 72.47

BERT-PT 84.95 76.96 - - 78.07 75.08

AEN-BERT# 82.23 70.72 73.72 70.68 78.99 71.03

LCF-BERT-CDW# 85.91 79.12 76.2 75.15 80.21 76.20

LCF-BERT-CDM# 85.80 79.05 75.45 74.81 79.63 75.25

ASGCN# 80.89 72.17 72.35 70.55 76.05 71.17

TD-GAT 80.35 76.13 72.68 71.15 74.13 72.01

RGAT 83.55 75.99 75.36 74.15 78.02 74.00

LA-GCN 87.34 82.20 76.59 75.38 81.85 78.32

Model	Restaurant	Twitter	Laptop
TD-LSTM	75.63	-	70.8	69	68.13	-
ATAE-LSTM	77.2	-	-	-	68.7	-
MGAN	81.25	71.94	72.54	70.81	75.39	72.47
BERT-PT	84.95	76.96	-	-	78.07	75.08
AEN-BERT#	82.23	70.72	73.72	70.68	78.99	71.03
LCF-BERT-CDW#	85.91	79.12	76.2	75.15	80.21	76.20
LCF-BERT-CDM#	85.80	79.05	75.45	74.81	79.63	75.25
ASGCN#	80.89	72.17	72.35	70.55	76.05	71.17
TD-GAT	80.35	76.13	72.68	71.15	74.13	72.01
RGAT	83.55	75.99	75.36	74.15	78.02	74.00
LA-GCN	87.34	82.20	76.59	75.38	81.85	78.32

5.4.2 Ablation study

To further examine the contribution made by each component of LA-GCN to the overall performance, an ablation study was performed on LA-GCN. The results are shown in Table 5.

Table 5
Ablation study results (%). “w/o” means “without”

Model Restaurant Twitter Laptop

Acc F1 Acc F1 Acc F1

LA-GCN 87.34 82.20 76.59 75.38 81.85 78.32

LA-GCN w/o DP 85.80 79.50 76.37 74.58 80.88 77.95

LA-GCN w/o LA 86.07 80.03 75.87 74.08 80.72 77.04

LA-GCN w/o GCN 86.62 80.33 76.22 75.00 81.38 77.78

LA-GCN w/o FFL 85.39 77.41 74.81 73.84 79.80 75.61

Model	Restaurant	Twitter	Laptop
LA-GCN	87.34	82.20	76.59	75.38	81.85	78.32
LA-GCN w/o DP	85.80	79.50	76.37	74.58	80.88	77.95
LA-GCN w/o LA	86.07	80.03	75.87	74.08	80.72	77.04
LA-GCN w/o GCN	86.62	80.33	76.22	75.00	81.38	77.78
LA-GCN w/o FFL	85.39	77.41	74.81	73.84	79.80	75.61

We first investigate the impact of the dependent pruning (DP) mechanism on the overall model. It was found through experiments that the Restaurant dataset and the Laptop dataset were more affected when the model was trained on the unpruned dependency parse tree, both in terms of accuracy and F1 values, which were much lower. The main reason is that both lexical attention and aspect-oriented GCN designs are based on dependency-parsed tree implementations. If the dependency tree is not pruned, there is a high chance that the lexical attention mechanism will give high weight to opinion words from other aspects, and the graph convolutional neural network will also extract other wrong information when extracting features, which will have a greater impact on the performance of the model. The impact on the Twitter dataset is smaller in comparison, mainly because the sentences in the Twitter dataset are less grammatical, so the performance is not degraded too much.

Next we investigated the effect of the lexical attention layer (LAL). The role of the lexical attention mechanism is to focus on the features of words that play a key role in the sentimental polarity judgments. We found that the overall performance of the model without LAL is smaller than that of LA-GCN, which indicates that LAL is important for LA-GCN design, and also shows that the word lexicality is useful for sentiment classification.

We then investigate the impact of the graph convolutional network layer (GCN). This layer mainly extracts new features on the output of the lexical attention layer. From the experimental data we can find that the overall performance of the model decreases, but the effect on the model is smaller compared to the effect of the lexical attention layer. Because the lexical attention layer has already acquired the opinion word features, this layer is also based on the lexical attention layer for feature extraction, so the overall impact on the model is less than that of the lexical attention layer.

Finally we investigate the effect of the feature fusion layer (FFL) on the model. In studying the significance of the FFL, we only took aspect-oriented local features for classification and did not fuse global features. Although the experimental results are much worse compared to LA-GCN, the performance on Restaurant and Laptop datasets is still much improved compared to ASGCN, TD-GAT, RGAT, which are graph convolutional neural network-based models. On the one hand, this illustrates the effectiveness of lexical attention and aspect-oriented GCN design. On the other hand it also directly illustrates the importance of the feature fusion layer for the model.

5.4.3 Case analysis

In this section, we select a sample for case study analysis.The results of the sample-dependent pruning are illustrated in Figures Fig.6. and Fig.7., and LA-GCN outputs correct predictions for both aspects of the sample, ’food’ and ’perks’. Figures Fig.8. and Fig.9. show a visualizations of the lexical attention process for each of the two aspects.

Fig. 6

Results of dependency pruning and LA-GCN prediction for the aspect “food”.

Fig. 7

Results of dependency pruning and LA-GCN prediction for the aspect “perks”.

Fig. 8

The process of lexical attention to aspect “food”. Darker cell color indicates higher attention value and white indicates that the feature will be masked.

Fig. 9

The process of lexical attention to aspect “perks”. Darker cell color indicates higher attention value and white indicates that the feature will be masked.

Dependency pruning reduces linking errors and enhances the connection between aspects and opinion words.LA-GCN introduces a lexical attention mechanism that allows for better prediction of aspect sentiment polarity by focusing on features that play a key role in aspect sentiment polarity judgments through grammatical information, while suppressing interference from opinion words in other aspects.

6 Conclusion

In this paper, we propose an approach based on lexical attention and aspect-oriented GCN. Firstly, by pruning the dependency-parsed tree, reconstructing an aspect-oriented dependency-parsed tree can better link aspects and opinion words, and then introducing a lexical attention mechanism to focus on potential opinion words that have dependency relationships with aspects, and afterwards aggregating the feature vectors of aspect-adjacent nodes by GCN. After obtaining aspect-oriented local features, to avoid missing useful information, we stitch the global features with the local features, then fuse the features through a multi-headed self-attentive mechanism, and finally perform classification. This method not only allows direct attention to the characteristics of the opinion words, but also reduces the incorrectness introduced by the parser. We also conducted an ablation study to verify the usefulness of the structural design of the LA-GCN layers for the model. Experimental results on three public datasets show that the method is able to better associate aspects with opinion words, which significantly improves the performance of the model.

Acknowledgements

This work is supported by the Science & Technology project (41008114, 41011215, and 41014117).

Footnotes

Available at

References

Tang

, Qin

, Feng

and Liu

, Effective LSTMs for target-dependent sentiment classification, arXiv preprint arXiv:1512.01100 (2015).

Xue

and Li

, Aspect based sentiment analysis with gated convolutional networks, arXiv preprint arXiv:1805.07043 (2018).

Tang

, Qin

and Liu

, Aspect level sentiment classification with deep memory network, arXiv preprint arXiv:1605.08900 (2016).

Wang

, Huang

, Zhu

and Zhao

, Attention-based LSTMfor aspect-level sentiment classification, in: Proceedings of the 2016 conference on empirical methods in natural language processing, (2016), pp. 606–615.

, Guo

and Mei

, Deep memory networks for attitude identification, in: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, (2017), pp. 671–680.

Fan

, Feng

and Zhao

, Multi-grained attention network for aspect-level sentiment classification, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, (2018), pp. 3433–3442.

and Sun

, Capsule network with interactive attention for aspectlevel sentiment classification, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), (2019), pp. 5492–5501.

Zhang

, Li

and Song

, Aspect-based sentiment classification with aspect-specific graph convolutional networks, arXiv preprint arXiv:1909.03477 (2019).

Bai

, Liu

and Zhang

, Exploiting Typed Syntactic Dependencies for Targeted Sentiment Classification Using Graph Attention Neural Network, arXiv preprint arXiv:2002.09685 (2020).

10.

Pontiki

, Papageorgiou

, Galanis

, Androutsopoulos

, Pavlopoulos

and Manandhar

, SemEval-Task 4: Aspect Based Sentiment Analysis, SemEval 2014 (2014), 27.

11.

Dozat

and Manning

C.D.

, Deep Biaffine Attention for Neural Dependency Parsing (2016).

12.

Dong

, Wei

, Tan

, Tang

, Zhou

and Xu

, Adaptive recursive neural network for target-dependent twitter sentiment classification, in: Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 2: Short papers), (2014), pp. 49–54.

13.

Nguyen

T.H.

and Shirai

, Phrasernn: Phrase recursive neural network for aspect-based sentiment analysis, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, (2015), pp. 2509–2514.

14.

Wang

, Pan

S.J.

, Dahlmeier

and Xiao

, Recursive neural conditional random fields for aspect-based sentimentanalysis, arXiv preprint arXiv:1603.06679 (2016).

15.

, Peng

and Cambria

, Targeted Aspect-Based Sentiment Analysis via Embedding Commonsense Knowledge into an Attentive LSTM., in: Aaai, 2018, pp. 5876–5883.

16.

Chen

, Sun

, Bing

and Yang

, Recurrent attention network on memory for aspect sentiment analysis, in: Proceedings of the 2017 conference on empirical methods in natural language processing, (2017), pp. 452–461.

17.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

Ł.

and Polosukhin

, Attention is all you need, Advances in Neural Information Processing Systems 30 (2017), 5998–6008.

18.

Radford

, Narasimhan

, Salimans

and Sutskever

, Improving language understanding by generative pretraining, 2018.

19.

Radford

, Wu

, Child

, Luan

, Amodei

and Sutskever

, Language models are unsupervised multitask learners, OpenAI blog 1(8) (2019), 9.

20.

Brown

T.B.

, Mann

, Ryder

, Subbiah

, Kaplan

, Dhariwal

, Neelakantan

, Shyam

, Sastry

, Askell

, et al., Language models are few-shot learners, arXiv preprint arXiv:2005.14165 (2020).

21.

Devlin

, Chang

M.-W.

, Lee

and Toutanova

, Bert: Pretraining of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).

22.

Yang

, Dai

, Yang

, Carbonell

, Salakhutdinov

R.R.

and Le

Q.V.

, Xlnet: Generalized autoregressive pretraining for language understanding, in: Advances in neural information processing systems, (2019), pp. 5753–5763.

23.

, Liu

, Shu

and Yu

P.S.

, Bert post-training for review reading comprehension and aspect-based sentiment analysis, arXiv preprint arXiv:1904.02232 (2019).

24.

Song

, Wang

, Jiang

, Liu

and Rao

, Attentional encoder network for targeted sentiment classification, arXivpreprint arXiv:1902.09314 (2019).

25.

Dai

, Peng

, Chen

and Ding

, A Multi-Task Incremental Learning Framework with Category Name Embedding for Aspect-Category Sentiment Analysis, arXiv preprint arXiv:2010.02784 (2020).

26.

Zeng

, Yang

, Xu

, Zhou

and Han

, LCF: A local context focus mechanism for aspect-based sentiment classification, Applied Sciences 9(16) (2019), 3389.

27.

Sun

, Zhang

, Mensah

, Mao

and Liu

, Aspectlevel sentiment analysis via convolution over dependency tree, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), (2019), pp. 5683–5692.

28.

Chen

, Teng

and Zhang

, Inducing Target-specific Latent Structures for Aspet Sentiment Classification, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), (2020), pp. 5596–5607.

29.

Tang

, Ji

, Li

and Zhou

, Dependency graph enhanced dual-transformer structure for aspect-based sentiment classification, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (2020), pp. 6578–6588.

30.

Liang

, Meng

, Zhang

, Xu

, Chen

and Zhou

, A Dependency Syntactic Knowledge Augmented Interactive Architecture for End-to-End Aspect-based Sentiment Analysis, arXiv preprint arXiv:2004.01951 (2020).

31.

Kingma

D.P.

and Ba

, Adam: A method for stochasticoptimization, arXiv preprint arXiv:1412.6980 (2014).

32.

Huang

and Carley

K.M.

, Syntax-aware aspect level sentiment classification with graph attention networks, arXivpreprint arXiv:1909.02606 (2019).

Lexical attention and aspect-oriented graph convolutional networks for aspect-based sentiment analysis

Abstract

Keywords

1 Introduction

3 Preliminary

3.1 Dependency parsing

5.1 Dataset

Table 3 Statistics of the three datasets Dataset Positive Neutral Negative train test train test train test Restaurant 2164 728 637 196 807 196 Twitter 1561 173 3127 346 1560 173 Laptop 994 341 464 169 870 128

5.3 Model comparisons

5.4 Results and analysis

5.4.1 Main results

Acknowledgements

Footnotes

References

Table 3
Statistics of the three datasets

Dataset Positive Neutral Negative

train test train test train test

Restaurant 2164 728 637 196 807 196

Twitter 1561 173 3127 346 1560 173

Laptop 994 341 464 169 870 128