Abstract
Aspect-based sentiment analysis (ABSA) contains three subtasks, namely aspect term extraction, opinion term extraction and aspect-level sentiment classification. In order to make full use of the relationship between the three subtasks, some recent studies have successfully tried to use a unified framework to solve the problem of aspect-based sentiment analysis. However, these studies have not yet integrated domain knowledge into the model. Inspired by the post-training task, we propose a joint model (RACL-BERT-PT). This model combines the pre-training model BERT-PT with domain knowledge and the unified joint training framework RACL. The experimental results show that our model has achieved better results than previous experiments on three public data.
Introduction
The aspect-based sentiment analysis (ABSA) task was first proposed by [1]. It is a fine-grained sentiment analysis whose purpose is to determine the sentiment polarity of one or more objects in a single sentence. A specific example of this task, from a review of a restaurant, "The
In order to add domain knowledge to the pre-training model, and in order to solve the problem that the small data set cannot be pre-trained, some researchers have proposed a post-training method based on the BERT model. Specifically, they propose a novel joint post-training technique that takes BERT’s pre-trained weights as the initialization for basic language understanding and adapt BERT with both domain knowledge and task knowledge for the domain set. This technique leverages supervised (yet out-of-domain) MRC data [2], where the former enhances domain-awareness. MRC here refers to machine reading comprehension, RRC refers to review reading comprehension. The RRC data set is constructed from a task they proposed, and the RRC task is designed to match appropriate answers from the questions.
The unified training framework does not consider integrating domain knowledge, and BERT based on the post-training method does not consider the correlation of the three subtasks. Therefore, the two methods are combined to make full use of domain knowledge and information interaction between tasks to improve the accuracy of aspect sentiment classification. As a combined method, our contributions are as follows: We use the relationship between ABSA sub-tasks and the pre-training model combined with domain knowledge to achieve the best results for AE, OE, and SC tasks; It proves that domain knowledge is beneficial to the effect of AE, OE, and SC tasks.
Related work
Aspect extraction is an important task in sentiment analysis [1] and has many applications [3, 4]. We analyze the latest existing method research and comparison and propose the motivation of our method.
They are to use the knowledge graph representation vector as the feature input; Design a new pre-training task; Add additional modules.
Among them, the work of [2] is representative. A Post-training method is proposed to incorporate domain knowledge into the pre-training model. The best results have been achieved on the two public data sets of ABSA.
Our work combines the above methods. For this reason, we use a pre-trained model [2] incorporating knowledge on a unified framework [28] to improve the results of the ABSA task.
Methodology
Task definition
We give a sentence S = w1, . . . , w i , . . . , w n in order to solve three subtasks AE, OE and SC. Specifically, it is formulated into three sequence labeling problems.
Model architecture
As shown in Fig. 1, our model mainly consists of two parts: RACL and BERT-PT. This chapter will introduce our two core modules RACL and BERT-PT, and how to combine them.

Interactive relations among subtasks in ABSA.

BERT-PT model, where [CLS] is a dummy token not used for RRC and [SEP] is intended to separate q and d.
The training process is as follows, we first obtain the hidden representation as h = BERT (x) ∈ Rr h *|x|, where r h is the size of the hidden dimension, and |x| is the length of the input sequence x = ([CLS] , q1, . . . , q m , [SEP] , d1, . . . , d m , [SEP]), where [CLS] is a dummy token not used for RRC and [SEP] is intended to separate q and d. Then the hidden representation is passed to two separate dense layers followed by softmax function: l1 = Softmax (W1 · h + b1) and l2 = Softmax (W2 · h + b2). Finally, the output is a span across the position in d (after the [SEP] token of the input), indicated by two pointers (indexes) s and e computed from l1 and l2. The specific mathematical expressions of l1 and l2 are:
Specifically, using two pre-training goals: masked language model (MLM) and next sentence prediction(NSP). The training goals are as follows:
The evaluation index is a perfect match of EM and F1 scores. EM requires a string that exactly matches the answer with the marked answer range. The F1 score is the average F1 score of a single answer. Figure 1 show that BERT-PT is more effective than the baseline method on reading comprehension tasks, which shows the benefits of having two kinds of knowledge. In the experimental part, we will continue to verify the advantages of BERT-PT on AE, OE, and SC tasks in combination with this method.
As shown in Fig. 3, a single RACL contains three modules, each of which is designed for corresponding subtasks. As shown in Fig. 4, the input of each module accepts BERT-PT or BERT shared representation from the underlying sentence coding module. The three modules encode their respective task-oriented features, and then they carry out relational, collaborative learning through the four propagation relations R1,...,R4. Finally, the three modules predict the corresponding label sequences Y A , Y O and Y S . However, the AE, OE, and SC modules in the single-layer RACL model can only extract lower language features, so we, like [28], stack RACL to multiple layers to obtain higher-level semantic features.

RACL model.

BERT-PT for RACL input
In order to extract the private features of the subtasks, we use the CNN proposed by [30] as the encoder function F here. For the AE and OE tasks, we directly extract the local features X A and X O (as shown in formula 5, 6, where d c and d e represents the word dimension, n represents the number of words), while for the SC task, we also need to consider the semantic information related to the extraction of its context. Therefore, it is necessary to use the semantic relationship between the attention query mechanism and the context feature. The formula is as follows:
where d
s
i,j
represents the strength of dependence between the i-th query and the j-th context word, and
In addition to the use of shared and private solutions, the use of all relationships between subtasks can also enhance the effectiveness of tasks. Specifically, how to use the relationship between R1, R2, R3, and R4.
R1 represents the relationship between AE and OE, and AE and OE can help each other. For example, to describe food, We use words "delicious" to modify, rather than words such as "elegance" and vice versa. To model R1, use the semantic relationship between AE and OE to exchange useful information. For AE and OE, the semantic relationship between words is defined as follows:
where aoi,j represent the interaction between AE and OE, and
For the words w i in OE, we can perform weighted summation through the semantic relations of all words in AE and obtain useful clues XA2O from AE. On the contrary, for the word w i in AE, we use the same method to obtain the clue XO2A. Then, we connect the AE-oriented feature X A and the OE-oriented feature X O with the corresponding useful clues XO2A and XA2O respectively to form the final task feature representation and input it to the fully connected layer to predict the label. The formula is as follows:
where W A (W O ) ∈ R3×2d c is a transformation matrix, Y A (Y O ) is the predicted tag sequence of AE(OE).
R2 is the triadic relation between SC and R1. The key to the SC task is to determine the dependencies of aspects and contexts. For example, when predicting the sentiment polarity of food, the context "sufficient" and "delicious" play an important role so that R1 can help SC tasks. We will use MO2A as the representation of R1, and the specific operation of R2 is defined as:
MO2A reflects the terminology and context dependence from the extraction perspective, while M ctx reflects the dependence between them from the classification perspective. Of course, we should have the same effect in theory if we change MO2A to MA2O.
R3 represents the binary relationship between SC and OE. Because opinions usually express the sentiment polarity, for example, being "too delicious" is usually a positive factor. More attention should be paid to the opinion items extracted by the OE task in the SC task. Similar to R2 and R3, the formula is defined as follows:
where P represents the probability value that we give.
By doing this, the opinion item can obtain a greater weight representation in the sentiment prediction matrix. Then we will recalculate the feature X
S
of the SC in formula 10, as in formula 21, express the connection of H and X
S
as the final feature of SC for the final sentiment polarity prediction:
where W S ∈ R3×2d h is a transformation matrix, Y s ∈ R3×n is the predicted tag sequence of SC.
R4 represents the binary relationship between SC and AE. Only aspect words have their corresponding sentiment polarity. For example, in food reviews, "food" and "environment" usually have specific sentiment polarity. It shows that the AE task is helpful to the supervised training of the SC task. Therefore, we directly use the label YA corresponding to AE to improve the marking process in SC, specifically as follows:
where
As shown in Fig. 4, we use BERT-PT for sentence encoding as the input of the first RACL module because a single RACL module may only extract lower semantic features, so we stack the RACL modules. Specifically, the features extracted by the RACL module of the first layer are used as the input of the RACL module of the second layer, thereby stacking the RACL module to the L layer. Finally, the output results of each layer are averaged and pooled as the prediction result.
Where T∈ { AE, OE, SC } represents a specific subtask, and L is the number of layers. In actual experiments, we also use 6 layers (L=6).
Training procedure
The list of pipelines from the input layer to the last layer is shown in Table 2. After generating the tag sequences Y A , Y O and Y S for the sentence S e , one of the examples shown in Table 3, we compute the cross-entropy loss of each subtask:
where T∈ {A, O, S} denotes the subtask, N is the length of S
e
, J is the category of labels,
RRC in EM(Exact Match) and F1
Pipeline of model
Input Example
1 Y A contains the aspect term tag sequences, where 0=O, 1=B, 2=I. 2 Y O contains the opinion term tag sequences, where 0=O, 1=B, 2=I. 3 Y S contains the sentiment tag sequences, where 0=background, 1=positive, 2=negative, 3=neutral. * Token represents the word form for bert to perform token expressed.
Datasets
Datasets
To show more details of the data set, we randomly selected a real example in the data set, as shown in Table 3. We first add special tags [CLS] and [SEP] to the sentence, then we use Bert’s own vocabulary for token representation, and finally use it as the input of the model, as shown in Fig. 4. The three tags {Y A , Y O , Y S } are used for fine-tuning the model.
The model that achieves the smallest loss on the development set is used to evaluate the test set. The evaluation index uses the F1 index to represent the performance of each subtask, and the specific results are shown in Table 5. In order to facilitate the comparison of experimental results to verify the effectiveness of the method, we use the following model for comparison.
Comparison of different methods
Comparison of different methods
*We will separate the GloVe-based method (M1) and the BERT-based method (M2) for fair comparison and the BERT-PT-base method (M3). The best scores are shown in bold, with "-" indicating the method does not include subtask OE.
The results reported for all models are the average over 10 runs. It is noteworthy that our model results are very stable, and the results of ten experiments hardly fluctuate. As shown in Table 5, we used three methods for comparison: M1 represents the method based on Glove, M2 is based on the BERT-Large method, and M3 is based on the BERT-PT method.
First, in the M1 method, we can observe that RACL-Glove outperforms all baselines in the overall indicator ABSA-F1 and achieves 2.12%, 2.92%, and 2.40% surpasses on the three data sets, respectively. Secondly, in the M2 method, RACL-BERT-Large surpasses the method based on the same type of encoder in ABSA-F1. On the three data sets, 4.70%, 1.67%, and 3.76% are achieved, respectively. It shows that the use of joint training of all sub-tasks and a comprehensive model of their interaction is beneficial to improve the performance of ABSA tasks. Finally, the M3 method using the pre-trained model BERT-PT with domain knowledge has achieved the latest results in ABSA-F1. It shows that the pre-trained model with domain knowledge has advantages in solving aspect-oriented sentiment classification tasks.
Analysis
From Fig. 5, we can see that when K=1, the model’s performance is inferior because the extracted feature is a single word; when the K=3 and K=5, the size of the features we extract is more reasonable. When K exceeds 5, the model’s performance deteriorates because other irrelevant words will be added during feature extraction. In Fig. 6, we can see that when we increase the number of RACL layers to 5, better performance can be achieved, but when it exceeds 5, the performance of the model does not change much, and increasing the number of layers means an increase in parameters. The cost will increase.

Effects of K.

Effects of L.
Corss validation result
From the experimental results, we can see that three models through cross-validation, the results of the three tasks have a slight decline. For example, our method is in two data sets (Res14, Lap14), and the F1 values of ABSA decreased by 0.5, 1, but rose 0.5 in the data set Res15.
In this paper, we propose a joint model RACL-BERT-PT, which combines the relationship-aware collaborative learning framework RACL and the pre-training model BERT-PT with domain knowledge. Experiments on three real-world datasets demonstrate that our RACL-BERT-PT model outperforms the state-of-the-art pipeline and unified baselines for the complete ABSA task. In order to prove the universality of the model, in future work, we will explore and study our model in a broader data set.
Footnotes
Acknowledgments
We thank the anonymous reviewers for their valuable comments. This research was funded by [the National Natural Science Foundation of China] grant number [No.U1603115], [the National Key Research and Development Program of China] grant number [No.2017YFBO504203], [the Science and Technology Planning Project of Sichuan Province] grant number [No.18SXHZ0054] and [National Engineering Laboratory for Public Safety Risk Perception and Control by Big Data] grant number [PSRPC:No.XJ201810101].
