Multi-task learning model for aspect term extraction and aspect polarity classification based on dual-labels

Abstract

Aspect-based sentiment analysis (ABSA) is a hot and significant task of natural language processing, which is composed of two subtasks, the aspect term extraction (ATE) and aspect polarity classification (APC). Previous researches generally studied two subtasks independently and designed neural network models for ATE and APC respectively. However, it integrates various manual features into the model, which will consume plenty of computing resources and labor. Moreover, the quality of the ATE results will affect the performance of APC. This paper proposes a multi-task learning model based on dual auxiliary labels for ATE and APC. In this paper, general IOB labels, and sentimental IOB labels are equipped to efficiently solve both ATE and APC tasks without manual features adopted. Experiments are conducted on two general ABSA benchmark datasets of SemEval-2014. The experimental results reveal that the proposed model is of great performance and efficient for both ATE and APC tasks compared to the main baseline models.

Keywords

Multi-task learning aspect term extraction aspect polarity classification sentiment classification

1 Introduction

Aspect-based sentiment analysis (ABSA) is an important task of natural language processing and a fine-grained sentiment analysis task, which includes aspect term extraction (ATE) and aspect polarity classification (APC). ATE aims to identify and extract aspect terms from sentences, while the objective of APC is aimed at classifying the sentiment polarity of the extracted aspect terms. For example, in a review of the sentence “The camera shooting function of this mobile phone is very powerful, but the function of endurance is poor”, the identified aspect terms of ATE are “camera shooting function” and “function of endurance”, the aspect polarity classification identifies these two functions’ polarity that is positive are negative respectively. Most of the previous studies carried out ATE and APC tasks independently. However, it breaks the correlation between those two tasks. ATE is the underlying task of APC, so the quality of ATE results directly influences the accuracy of APC. Previous studies require a great number of resources to construct efficient ATE and APC model, and the internal relationship between ATE and APC has not been fully explored and studied. Therefore, in view of the above analysis, this paper proposes a model, Multi-task Learning Model for Aspect Term Extraction and Aspect Polarity Classification Based on Dual-Labels (MTL-ATEPC, also called MTL). By adopting general IOB 1 labels and sentimental IOB labels to enhance the feature input, this model regards APE and APC as a multi-task joint label model, conducting APE and APC tasks at the same time.

Figure 1 is the main block diagram illustrating the MTL models for the ATEPC task. At the bottom of the diagram is the input layer of the model. Before the training starts, the input sequence needs to be annotated and tokenized. The embedding module is essential for most NLP models, and the annotated text needs to be transformed into a form that can be recognized and processed by the model through embedding technology. For most models, only the word embedding is utilized, however, the character embedding is used additionally in this paper. The main network is the core of the model, consisting of a bidirectional LSTM network and a CRF layer for extracting and learning text features. The ATEPC module also deduces aspect polarity and extracts aspect terms, while the ATE layer is only responsible for extracting aspect terms.

Fig. 1

The main block diagram of the model.

We organized the article into five sections. The first section introduces the research problems; the second part discusses the related work; the third part states the methodology of the model, and the last two sections conduct a large number of experiments to evaluate models’ performance and conclude the advantages of the model.

The contributions of this paper are highlighted as follows:

This paper is the first study to conduct multi-task learning for ATE and APC by using Dual-Labels where MTL deals with ATE and APC simultaneously with the help of general IOB labels and sentimental IOB labels.

A new weighted fusion loss function is used in this paper, which focuses on the optimization of the multi-task model for ATE and APC.

The experimental results on two benchmark datasets of SemEval2014 verify MTL. Compared with other state-of-the-art models, the proposed model posses strong competitiveness in both ATE and APC.

2 Related work

2.1 Aspect term extraction

Since the ATE task is of great significance for the APC task, it catches more and more researchers’ attention. The researches aimed at the ATE task consist of rule-based methods, traditional machine learning methods, and deep learning methods.

The rule-based methods usually extract the aspect terms of review by formulating a rule. Hu et al. [1] first put forward that it is practicable to extract explicit aspect terms with high frequency from corpus by associating rules, and to detect implicit aspects by calculating the minimum distance between aspect terms and opinion words. Qiu et al. [2] made use of the syntax tree rule between opinion words and aspect terms to extend the dictionary of initial opinion and extract aspect terms. To a great extent, the rule-based methods depend on the predetermined rules and are restricted to specific applying fields.

For traditional machine learning-based methods, ATE is regarded as a sequence labeling task. Hidden Markov Model (HMM) and Conditional Random Field (CRF) [3, 4] are the most commonly used approaches in the past few years. Jin et al. [5] raised a lexicalized HMM model, which integrated semantic characteristics like Part of Speech (POS) and contextual information. Jacob et al. [6] extracted aspect terms by taking the token, POS, short dependency path, word distance, and opinion sentences as features and putting them into CRF. The experiments were conducted in four datasets: movies, web-services, cars, and cameras. Chernyshevich et al. [7] took advantage of the abundant vocabulary, syntax, and statistical characteristics and combined CRF to extract aspect terms of two specific fields. Traditional machine learning methods are greatly dependent on tedious manual feature selection, regardless of the interaction between aspect terms and opinion words.

Present researches indicate that deep learning methods are competent to capture and process token features, and it performs well in the aspect term extraction. Therefore, the aspect term extraction based on deep learning becomes popular. Wang et al. [8] proposed a joint model that can deal with explicit aspect terms extraction and opinion words at the same time. This model combines the recursive neural network and CRF. Besides, manual features can be adapted to this model, which further improves the ability of information extraction. Without any supplementary supervision, Xu et al. [9] used general-purpose embeddings and domain-specific embeddings as the input of the convolutional neural network, which obtained the competitive experimental results. Luo et al. [10] come up with a new bidirectional dependency tree network that enables the model to extract the dependency tree from sentences and combine the bidirectional long short-term memory network (LSTM) with CRF. It can also obtain the tree-structured and sequential information of the dependency tree. Consequently, it does good for solving aspect terms extraction.

2.2 Aspect polarity classification

APC is a task that aims to recognize the sentiment polarity of the extracted aspect terms. For three-way classification, the polarity of each aspect term may be positive, negative, and neutral. In early works, APC were considered as a standard text classification problem, which were integrated with traditional machine learning and feature engineering to establish the sentiment classifier. These methods requires a large number of manual features and additional resources, taking no consideration of the problem that there can be multiple aspect terms in a sentence and the polarity between diverse aspect terms may be different.

Neural network-based models achieved great improvement and success in aspect polarity classification tasks. Generally, the recurrent neural networks (RNNs) are pretty applicable for APC task compared to traditional machine learning methods. Tang et al. [11] proposed two target-dependent long short-term memory (LSTM) models. By integrating aspect terms into the model, Experimental results on two benchmark datasets prove that the performance is superior to the standard LSTM-based model. Tai et al. [12] applied Tree-LSTM network architecture to identify aspect polarity, and the experimental results reveal that Tree-LSTM performs better than LSTM. The attention mechanism is also commonly applied in the APC task. Wang et al. [13] put forward a long short-term memory neural network model based on the attention mechanism. When different aspect terms are input into the model, the attention mechanism calculates attention weights for each token of the sentence, and the tokens holding highest attention weights will be focused. Ma et al. [14] raised the interactive attention network model, which deploys two attention layers to learn the relationship between context and aspect terms interactively and to generate the representation of aspect terms and context. Furthermore, the convolutional neural networks (CNNs) are also applied in the APC task. Xue et al. [15] utilized gated convolutional neural network to selectively output sentiment characteristics. Parallel training can be carried out in this model. Xing et al. [16] proposed the convolutional neural network based on attention. Once aspect terms are deployed and combined in the model, the performance of aspect terms polarity classification can be enhanced remarkably.

2.3 Multi-task aspects-based sentiment analysis

Aspect term extraction is the fundamental task of aspect polarity classification, furthermore, the results of aspect term extraction will directly influence the accuracy of aspect polarity classification. Consequently, it is necessary to establish a multi-task learning model for aspect term extraction and aspect polarity classification. Nguyen et al. [17] proposed a joint model where general IOB labels are mixed with sentimental IOB labels, which can tackle the ATE task and APC task in parallel. Wang et al. [18] proposed a multi-task neural learning framework that can simultaneously deal with the ATE and the APC tasks. This model deploys an attention mechanism to learn the joint representation of aspect terms and sentiment polarities. Ma et al. [19] designed a hierarchical gated RNN architecture for learning the abstract features of ATE and APC. The impact of IOB labels and sentimental labels are taken into consideration in this network.

3 Methodology

3.1 Task definition

Suppose that [w₁, w₂, … w_n] represents a sentence consisting of n words. And the character sequence of word is [c₁, c₂, … c_m], m is length of the word. The general IOB labels are [B_asp, I_asp, O], the asp denotes the label of aspect. And the sentimental IOB labels applied in this paper are [B_pos, I_pos, B_neg, I_neg, B_neu, I_neu, O]. pos, neg, neu means positive, negative and neutral respectively, which are the sentimental labels of the aspects. By utilizing the proposed MTL-design models, the tasks of ATE and APC are carried out simultaneously.

3.2 Long short-term memory network

LSTM is an advanced RNN architecture, which has been widely adopted to varieties of NLP tasks. Huang et al. [20] adopted the LSTM network to sequence tagging tasks. LSTMs are effective to capture semantic features, and they are powerful to maintain historical information. Owing to the long-term memory mechanism, LSTM networks can make full use of long term features.

An typical implementation of LSTM cell is as follows: $f_{t} = σ_{g} (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f})$ (1) $i_{t} = σ_{g} (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i})$ (2) $o_{t} = σ_{g} (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o})$ (3) $\tilde{c} = σ (W_{c} x_{t} + U_{c} h_{t - 1} + b_{c})$ (4) $c_{t} = f_{t} \circ c_{t - 1} + i_{t} \circ \tilde{c}$ (5) $h_{t} = o_{t} \circ σ_{h} (c_{t})$ (6) where f_t is the forget gate, i_t is the input gate, and o_t is the output gate. $\tilde{c}$ is the new content memory, c_t is the context vector, and σ is the logistic Sigmoid activate function. ∘ is the element-wise product; W_f, b_f are the weights for f_t; W_i, b_i are the weights for i_t, and W_o, b_o are weights for o_t respectively; All of the vector above share the same dimension consistent with embedding dimension.

Traditional LSTM networks are single-directional, it weakens the network’s ability to feature extraction. In recent years, bidirectional LSTM [20, 21] is more popular compared to traditional RNN architecture, since they are more efficient at capturing features and can learn features both backward and forwards. By utilizing Bidirectional LSTM, the model proposed in this paper performs better and achieve a considerable performance.

3.3 Word embedding

Word Embedding is of great significance and indispensable for most NLP tasks, and it maps words into vector spaces. This paper adopts the Word2Vec [22] as word embedding that was pre-trained in a large corpus. Given a word embedding matrix $E_{w} \in ℝ^{d_{w} \times | V_{w} |}$ , each word within a sentence can be represented into a low-dimensional vector $e_{w} \in ℝ^{d_{w}}$ , where d_w is the dimension of the word embedding dimension. And Given a character embedding matrix $E_{c} \in ℝ^{d_{c} \times | V_{c} |}$ , each character in a word will be represented into a low-dimensional vector $e_{c} \in ℝ^{d} c$ , where d_c is the character embedding dimension. |V_w| and |V_c| are the vocabulary size of the word embedding and character embedding respectively.

3.4 Condition random field

Conditional Random Field is a common and widely applied method in name entity recognition (NER) and other tasks, which improves the performance of most sequence tagging tasks. The output representation and the input of CRF are connected, CRF accepts the representation of hidden states from LSTM, and predicts the result on sentence-level. However, LSTM predicts the result on token-level generally.

3.5 Model architecture

The design of the models is called MTL. Moreover, this paper proposes 4 variant models of MTL, the architecture of those variant models are presented in Figs. 2, 3, 4, 5. In general, all of the MTL models share the similar network architecture, including word embedding layer, bidirectional LSTM (BiLSTM) layers, and CRF layers. The left side of the MTL model is CRF with sentimental IOB labels (Sentiment-CRF, S-CRF), and the right side of the MTL model is CRF with general IOB labels (General-CRF, G-CRF). The difference of four model architecture is as follows:

Fig. 2

The architecture of MTL-1.

Fig. 3

The architecture of MTL-2.

Fig. 4

The architecture of MTL-3.

Fig. 5

The architecture of MTL-4.

MTL-1: As is shown in Fig. 2, S-CRF, and G-CRF share the embedding layer and BiLSTM with each other. In this variation, the representation of hidden states can be delivered from BiLSTM into S-CRF and G-CRF.

The shortcut of MTL-2 is shown in Fig 3, S-CRF and G-CRF share the embedding layer with each other, but not sharing BiLSTM. They respectively input the output from the embedding layer into two different BiLSTM networks, and finally, respectively input the hidden output from the two BiLSTM networks into S-CRF and G-CRF.

Figure 4 is the overall architecture of MTL-3, S-CRF, and G-CRF do not share the embedding layer but share BiLSTM with each other. By inputting the different embedding layer into the BiLSTM network, they produce different presentations of the hidden states that will respectively input the presentations of the hidden states into S-CRF and G-CRF.

Figure 5 is the architecture of MTL-4, S-CRF and G-CRF share neither embedding layer nor BiLSTM with each other. They input different embedding layers into different BiLSTM, which produce a different representation of the hidden states, and then respectively input different representation of the hidden states into S-CRF and G-CRF.

All of the MTL-design models share a similar architecture, only some components are replaced with other architectures. S-CRF, on the left side of MTL-1, outputs sentimental IOB label that is used for the joint task of ATE and APC (ATEPC). G-CRF, on the right side of MTL-1, outputs general IOB labels that are used for the ATE task.

For MTL-1, the embedding layer delivers the concatenated vector of word embedding with character representation. Character representation is the hidden states delivered by the BiLSTM encoder. The concatenated vector h_c can be calculated as follows: $x_{m} = E_{c} ⊢ c_{m}$ (7) $\vec{h_{c}} = \vec{{LSTM}_{c}} (x_{m})$ (8) $\overset{\leftarrow}{h_{c}} = \overset{\leftarrow - - - -}{{LSTM}_{c}} (x_{m})$ (9)

$h_{c} = [\vec{h_{c}}, \overset{\leftarrow}{h_{c}}]$ (10)

“ ⊢ ” denotes the word embedding process; “,” denotes the vector concatenation operation. $\vec{h_{c}}$ and $\overset{\leftarrow}{h_{c}}$ are the representations of the hidden states output by LSTM encoder. x_c is the final output representation of the BiLSTM, which is regarded as character representation. A concatenating operation is used over the x_c and x_w to obtain the joint representation x_cw, The calculation process is as follows: $x_{w} = E_{w} ⊢ w_{n}$ (11) $x_{cw} = [x_{c}, x_{w}]$ (12)

When inputting x_cw into the BiLSTM, the result is h_cw. The calculation process is as follows: $\vec{h_{cw}} = \vec{LST M_{w}} (x_{cw})$ (13) $\overset{\leftarrow}{h_{cw}} = \overset{\leftarrow - - - -}{LST M_{w}} (x_{cw})$ (14) $h_{cw} = [\overset{- - \to}{h_{cw}}, \overset{\leftarrow - -}{h_{cw}}]$ (15)

Finally, by inputting h_cw respectively into S-CRF and G-CRF in MTL-1, ATEPC and ATE can be done at the same time. After this process, two groups of predictive sequence labels can be gained. They are sentimental IOB labels and the general IOB labels. Suppose the sequence delivered to the CRF layers is X = [x₁, x₂, ⋯ x_T], and the label sequence is Y = [y₁, y₂, …y_T], and then each sequence score socre (X, Y) is calculated as follows: $socre (X, Y) = \sum_{t = 0}^{T} A_{y_{t}, y_{t + 1}} + \sum_{t = 1}^{T} Z_{t, y_{t}}$ (16)

A is the matrix of transition scores; A_ij represents the score of a transition from tag i to tag j; Z is the input of CRF, which represents h_cw, the output of BiLSTM; Z_(t,i) represents the scores of tag i. In the course of training, the calculation formula of the negative log-likelihood of the minimized label sequence is as follows: $p (Y | X) = \frac{e^{score (X, Y)}}{\sum_{\hat{Y} \in Y_{X}} e^{S core (X, \hat{Y})}}$ (17) ${loss}_{sent / asp} = - \sum_{x, y} log p (Y | X)$ (18)

Y_X is the collection of all label sequences; p (Y|X) represents the probability of label Y under the condition of the given X. MTL has two labels output. In this paper, the loss function of ATEPC and ATE is weighed and summed. The formula is as follows: $losses = λ {loss}_{asp} + (1 - λ) {loss}_{sent}$ (19)

loss_asp means the loss function of task ATE; loss_sent means the loss function of task ATEPC; λ ∈ (0, 1), and λ is the weight coefficient between ATE and ATEPC. In the course of training, the minimized loss function is losses.

4 Experiment

4.1 Datasets and experiment settings

In order to verify the performance of MTL, the experiments are conducted on the two benchmark datasets of SemEval-2014 task4 2 . These two benchmark datasets contain reviews about restaurants and laptops. The statistical information of the datasets is shown in Table 1.

Table 1
Statistical tables of SemEval-2014 datasets.

Dataset Sentences Terms

Train Test Total Train Test Total

Restaurant 3041 800 3841 3693 1134 4827

Laptop 3045 800 3845 2358 654 3012

Dataset	Sentences	Terms
Restaurant	3041	800	3841	3693	1134	4827
Laptop	3045	800	3845	2358	654	3012

In this paper, the custom word embedding W2V [17] was trained on the corpus from different fields and based on Word2Vec. The word embedding dimension is 150. It uses the Adam Optimizer training model with a learning rate of 0.001; the size of batch size is 5; the number of hidden units in the BiLSTM layer is 200. The dimension of character embedding is 100, and character embedding is randomly initialized. The hidden units of LSTM after character embedding are 100. In order to prevent over-fitting, dropout is used for optimization and its value is set to 0.5.

4.2 Baselines

To better evaluate the performance of MTL variations, this paper selected several models with excellent performance as a comparison. According to the experimental results, our model has achieved excellent performance.

DCU: Various manual features are used as input of SVM. It is the best model in subtask 2 (Aspect Term Polarity) of SemEval 2014 task4 (ABSA).

Memnet [24]: It use deep memory neural network to conduct APC and different attention strategies are designed. Meanwhile, it uses content and position information to learn context weight and text representation.

RAM [25]: Recurrent attention network on memory is proposed. It attains the feature of different attention through multi-attention and then predicts the final sentiment polarity.

WDEmb [26]: It is unsupervised learning ATE model combining word embedding and dependent path embedding. It completes ATE by inputting the combined embedding into CRF.

RNCRF+F, RNCRF-O+F [8]: It integrates recurrent neural networks and CRF into a unified framework, and carries out joint extraction of explicit aspect terms and opinion terms based on manual features.

CMLA [27]: It is a multi-task model coupling multi-layer attentions for co-extraction of aspect terms and opinion terms. One of the attention is used for extracting aspect terms, and the other is used for extracting opinion terms.

MNN-1, MNN-2 [18]: It is a multi-task joint framework for simultaneous APC and ATE, which uses attention mechanism to learn the relationship between the joint representation of ATE and APC.

MATEPC [17]: It is a multi-task learning model for simultaneous ATE and APC. It combines IOB labels with sentimental labels of aspect terms and redesigns the new labels for this model.

4.3 Experiment analysis

In this paper, the macro F1 score is adopted as the evaluation criterion for ATE, while the accuracy is the criterion for APC. Table 2 is the accuracy of the baseline models for APC, and Table 3 is the F1 score of the baseline models for ATE.

Table 2
The Accuracy of the baseline models for APC.

Models Restaurant (%) Laptop (%)

DCU 80.95 70.49

Memnet 80.95 72.37

RAM 80.23 74.49

MNN-1 77.99 71.94

MNN-2 79.91 72.85

MATEPC 81.17 69.38

Models	Restaurant (%)	Laptop (%)
DCU	80.95	70.49
Memnet	80.95	72.37
RAM	80.23	74.49
MNN-1	77.99	71.94
MNN-2	79.91	72.85
MATEPC	81.17	69.38

Table 3

The F1 score of the baseline models for ATE.

Models	Restaurant (%)	Laptop (%)
WDEmb	84.97	75.16
RNCRF-O-F	84.25	77.26
RNCRF-F	84.93	78.42
CMLA	85.29	77.8
MNN-1	85.01	77.37
MNN-2	85.84	79.91
MATEPC	85.77	79.64

From Tables 2 and 3, the multi-task joint model with sentimental IOB labels MATEPC and MNN achieve the higher accuracy and F1 score on the Restaurant and Laptop datasets. In particular, the accuracy of MATEPC on the Restaurant dataset reaches 81.17%, which exceeds all comparison models. Moreover, the F1 of the MNN-2 model on the Restaurant and Laptop datasets outperforms all comparison models. The results illustrate that the sentimental IOB labels obtain a great promotion for MATEPC and MNN. Additionally, it has great advantages for solving ATE and APC tasks.

Tables 4 and 5 are the experimental results of MTL on the Restaurant and Laptop datasets. ATEPC : ATE corresponds (1 - λ): λ on Formula 19; F1_ATE is the F1 score of ATE on the left side of MTL; F1_ATEPC is the F1 of ATEPC on the right side of MTL; Accuracy represents the accuracy of ATEPC and APC MTL model. As is shown in Tables 4 and 5, under the condition of different ATEPC : ATE, the experimental results of F1_ATE, F1_ATEPC, and Accuracy in the four models are the best.

Table 4

The experimental results of MTL Model on the Restaurant dataset. “ATEPC : ATE” represents λ : 1 - λ.

Model	Restaurant
	ATEPC : ATE	Metrics	Value	Comparison model	Metrics	Value	improvement (%)
MTL-1	0.2:0.8	F1_ATE	86.21	MNN-2	F1	85.84	0.37
	0.4:0.6	F1_ATEPC	86.46				0.62
	0.6:0.4	Accuracy	81.84	MATEPC	Accuracy	81.17	0.67
MTL-2	0.5:0.5	F1_ATE	86.1	MNN-2	F1	85.84	0.26
	0.4:0.6	F1_ATEPC	85.79				–
	0.7:0.3	Accuracy	81.73	MATEPC	Accuracy	81.17	0.56
MTL-3	0.1:0.9	F1_ATE	86.46	MNN-2	F1	85.84	0.62
	0.1:0.9	F1_ATEPC	86.3				0.46
	0.6:0.4	Accuracy	81.82	MATEPC	Accuracy	81.17	0.65
MTL-4	0.5:0.5	F1_ATE	86.88	MNN-2	F1	85.84	1.04
	0.1:0.9	F1_ATEPC	85.72				–
	0.9:0.1	Accuracy	82.33	MATEPC	Accuracy	81.17	1.16

Table 5

The results of MTL Model on Laptop dataset. ATEPC : ATE represents λ : 1 - λ.

Model	Laptop
	ATEPC : ATE	Metrics	Value	Comparison model	Metrics	Value	Improvement(%)
MTL-1	0.3:0.7	F1_ATE	80.66	MNN-2	F1	79.91	0.75
	0.1:0.9	F1_ATEPC	81.44				1.53
	0.8:0.2	Accuracy	76.09	RAM	Accuracy	74.49	1.6
MTL-2	0.6:0.4	F1_ATE	80.06	MNN-2	F1	79.91	0.15
	0.1:0.9	F1_ATEPC	80.75				0.84
	0.7:0.3	Accuracy	74.29	RAM	Accuracy	74.49	–
MTL-3	0.4:0.6	F1_ATE	80.96	MNN-2	F1	79.91	1.05
	0.4:0.6	F1_ATEPC	80.62				0.71
	0.9:0.1	Accuracy	73.08	RAM	Accuracy	74.49	–
MTL-4	0.8:0.2	F1_ATE	80.76	MNN-2	F1	79.91	0.85
	0.2:0.8	F1_ATEPC	80.34				0.43
	0.4:0.6	Accuracy	75.2	RAM	Accuracy	74.49	0.71

From Tables 4, the F1_ATE, F1_ATEPC, Accuracy obtains the superior performance while ATEPC : ATE are set as 0.2:0.8, 0.4:0.6, and 0.6:0.4 on the Restaurant dataset. Compared with F1 of MNN-2, F1_ATE and F1_ATEPC respectively increase by 0.37% and 0.53%. Compared with MATEPC, Accuracy increases by 0.67%. The F1_ATE, F1_ATEPC, and Accuracy of MTL-2 on the Restaurant dataset reach the maximum while ATEPC : ATE are set as 0.5:0.5, 0.4:0.6, and 0.7:0.3. Compared with F1 of MNN-2 and the Accuracy of MATEPC, the F1_ATE and Accuracy increase by 0.26% and 0.56%, respectively. When the ATEPC : ATE are 0.1:0.9, 0.1:0.9, and 0.6:0.4, F1_ATE, F1_ATEPC, and Accuracy on the Restaurant dataset of MTL-3 achieve the superior performance. And F1_ATE and F1_ATEPC increase by 0.62% and 0.46% compared with F1 of MNN-2. Compared with MATEPC model, Accuracy increases by 0.65%. The optimal ATEPC : ATE for MTL-4 on the Restaurant dataset are 0.5:0.5, 0.1:0.9, and 0.9:0.1. Compared to F1 of MNN-2 and the Accuracy of MATEPC, F1_ATE and Accuracy increase by 1.04% and 1.16%, respectively.

By analyzing the experimental results we collated in the Laptop dataset (See Table 5), we explore the optimal ATEPC : ATE and performance of MTL design models. The superior ATEPC : ATE setting of MTL-1 are 0.3:0.7, 0.1:0.9, and 0.8:0.2. The F1_ATE and F1_ATEPC of MTL-1 increase by 0.75% and 1.53% compared to F1 of MNN-2, respectively. And Compared with RAM, the Accuracy increases by 1.60%. For the MTL-2 model, while the ATEPC : ATE are set to 0.6:0.4, 0.1:0.9, and 0.7:0.3, F1_ATE, F1_ATEPC, and Accuracy on the Laptop dataset of MTL-2 attains the promising result. Compared with F1 of MNN-2, the F1_ATE and F1_ATEPC increase by 0.15% and 0.84%, respectively. For the MTL-3 model, the optimal ATEPC : ATE are 0.4:0.6, 0.4:0.6, and 0.9:0.1. And the F1_ATE and F1_ATEPC increase by 1.05% and 0.71% compared with F1 of the MNN-2 model, respectively. While the ATEPC : ATE settings are 0.8:0.2, 0.2:0.8, and 0.4:0.6, F1_ATE, F1_ATEPC, and Accuracy on the Laptop dataset of MTL-4 reach the best performance. Compared with the F1 score of MNN-2, F1_ATE and F1_ATEPC increase by 0.85% and 0.43%, respectively. Compared with RAM, the Accuracy increases by 0.71%. The experimental results in Tables 4 and 5 show that consistent performance is achieved in the four MTL models proposed in this paper. As for the ATE and APC, the performance of MTL models increases much more than that of the single-task model using the manual features. It illustrates that MTL models have greater competitive advantages in the situation of no manual feature. Compared with the MATEPC and MNN that only use sentimental IOB labels, MTL models based on general IOB labels and sentimental IOB labels obtains the best experimental results both on ATE and APC. It also shows that general IOB labels can provide extra features with MTL models, and these extra features can promote the performance of MTL models. Meanwhile, MTL models use dual-labels (general IOB label and sentimental IOB label) to enhance model feature input and then provide more features for the ATE and ATEPC of MTL models. Moreover, the ATE and ATEPC of MTL models will promote each other mutually, and increase the performance of MTL models on ATE and APC.

Table 6 are the experiment results of MTL models on F1_ATE, F1_ATEPC, and Accuracy under the condition of high balance state. A high balance state is achieved among F1_ATE, F1_ATEPC, and Accuracy with the same ATEPC : ATE in MTL, and a relatively large value is achieved under three indicators. As can be seen from Table 6, the values of F1_ATEPC of MTL-1 on the Restaurant and Laptop are maximum. The values of F1_ATE and Accuracy of MTL-4 on Restaurant and Laptop datasets are optimal. A high balance state can be achieved among F1_ATE, F1_ATEPC, and Accuracy with the same ATEPC : ATE in MTL. The MTL models have competitive advantages both on F1 of ATE and the accuracy of APC. The experiment results show that under the condition of no manual feature and less consumption of resources, the MTL with general IOB labels and sentimental IOB labels has the advantage of simplicity and efficiency, which can tackle the ATE and APC efficiently at the same time.

Table 6

The results of MTL Model in a high balanced state. ATEPC : ATE represents λ : 1 - λ.

Model	Restaurant (%)				Laptop (%)
	ATEPC : ATE	F1_ATE	F1_ATEPC	Accuracy	ATEPC : ATE	F1_ATE	F1_ATEPC	Accuracy
MTL-1	0.6:0.4	86.02	86.35	81.84	0.7:0.3	80.51	80.46	73.64
MTL-2	0.7:0.3	85.68	85.14	81.73	0.5:0.5	79.15	80.22	72.9
MTL-3	0.6:0.4	85.57	86.19	81.82	0.6:0.4	80.27	80.39	73.03
MTL-4	0.9:0.1	86.17	85.37	82.33	0.8:0.2	80.76	79.27	74.46

4.4 The impact of character embedding

In this paper, to verify the effectiveness of character embedding on this paper, a comparative experiment without character embedding as model input was designed, and the impact on character embedding on F1_ATE, F1_ATEPC, and the Accuracy was analyzed.

Figures 6 and 7 are line charts of F1_ATE of MTL-1 model on Restaurant and Laptop datasets. It can be seen from Figs. 6 and 7 that with the same proportion of ATEPC : ATE, the values of F1_ATE with character embedding are nearly larger than those without character embedding.

Fig. 6

F1_ATE of MTL-1 on Restaurant dataset.

Fig. 7

F1_ATE of MTL-1 on Laptop dataset.

Figures 8 and 9 are line charts of F1_ATEPC of MTL-1 model on Restaurant and Laptop datasets. It can be seen from Figs. 8 and 9 that with the same proportion of ATEPC : ATE, the values of F1_ATEPC with character embedding are nearly larger than those without character embedding.

Fig. 8

F1_ATEPC of MTL-1 on Restaurant dataset.

Fig. 9

F1_ATEPC of MTL-1 on Laptop dataset.

Figures 10 and 11 are line charts of Accuracy of MTL-1 on Restaurant and Laptop datasets. It can be seen from Figs. 10 and 11 that character embedding has little effect on the accuracy of the model. With the same proportion of ATEPC : ATE, the accuracy of the model has little difference no matter it has character embedding or not.

Fig. 10

The accuracy of MTL-1 on Restaurant dataset.

Fig. 11

The accuracy of MTL-1 on Laptop dataset.

The above experiments only analyze the impact of whether or not the character embedding in MTL-1 has on F1_ATE, F1_ATEPC, and Accuracy. The experiment results of MTL-2, MTL-3, and MTL-4 on datasets Restaurant and Laptop are similar to those of MTL-1. Therefore, this paper omits the relevant details. The analysis of the experimental results of Figs. 6-9, 6, and 7 indicate that character embedding can increase F1_ATE and F1_ATEPC, while it has little impact on Accuracy. Character embedding can provide morphological features of characters, important relationships between characters and words, and similarity rules of character composition of aspect terms with aspect term extraction. However, these features provided by character embedding have little effect on aspect polarity classification.

5 Conclusion

This paper proposes a multi-task learning model (MTL) that can tackle ATE and APC simultaneously. And the experimental results of datasets Restaurant and Laptop show that general IOB labels have the ability to supplement extra features to promote the performance of MTL. Meanwhile, using general IOB labels and sentimental IOB labels contributes to promoting the feature input of MTL. Furthermore, the ATE and ATEPC of MTL enable to promote each other. MTL has achieved the most advanced performance on ATE and APC. The MTL based on general IOB labels and sentimental IOB labels is more competitive than single-task models and multi-task models. That’s because it is capable of dealing with ATE tasks and APC simultaneously and efficiently under the circumstances of no manual feature and less consumption of resources. In the future, it is able to use the attention mechanism to improve the performance of the multi-task learning model for ATE and APC. In addition, regularizing the new fusion weighted loss function is also a potential research issue.

Footnotes

6

Thanks to the anonymous reviewers and the scholars who helped us. This research is supported by the Innovation Project of Graduate School of South China Normal University and funded by National Natural Science Foundation of China, Multi-modal Brain-Computer Interface and Its Application in Patients with Consciousness Disorder, Project approval number: 61876067.

This paper adopts the “IOB2”, a commonly applied tagging scheme for sequence labeling. “I, O, B” means inside, outside and begin respectively.

The datasets can be found at:

References

and Liu

, Mining opinion features in customer reviews. In Proceedings of the 19th national conference on Artifical intelligence, (2004), pp. 755–760. AAAI Press.

Qiu

, Liu

, Bu

and Chen

, Opinion word expansion and target extraction through double propagation, Computational Linguistics 37(1), 2011.

Lafferty

, McCallum

and Pereira

F. CN

, Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.

Hamdan

, Bellot

and Bechet

, Lsislif: Crf and logistic regression for opinion target extraction and sentiment polarity analysis. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), (2015), pp. 753–758.

Jin

and Ho

H.H.

, A novel lexicalized hmm-based learning framework for web opinion mining. In Proceedings of the 26th Annual International Conference on Machine Learning, (2009), pp. 465–472. ACM.

Jakob

and Gurevych

, Extracting opinion targets in a single-and cross-domain setting with conditional random fields. In Proceedings of the 2010 conference on empirical methods in natural language processing, (2010), pp. 1035–1045. Association for Computational Linguistics.

Chernyshevich

, Ihs r&d belarus: Cross-domain extraction of product features using crf. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), (2014), pp. 309–313.

Wang

, Pan

S.J.

, Dahlmeier

and Xiao

, Recursive neural conditional random fields for aspectbased sentiment analysis. arXiv preprint arXiv:1603.06679, 2016.

, Liu

, Shu

and Philip

S.Y.

, Double embeddings and cnn-based sequence labeling for aspect extraction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), (2018), pp. 592–598.

10.

Luo

, Li

, Liu

, Wang

and Unger

, Improving aspect term extraction with bidirectional dependency tree representation, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 27(7) (2019), 1201–1212.

11.

Tang

, Qin

, Feng

and Liu

, Effective lstms for target-dependent sentiment classification. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, (2016), pp. 3298–3307.

12.

Tai

K.S.

, Socher

and Manning

C.D.

, Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) 1 (2015), pp. 1556–1566.

13.

Wang

, Huang

and Zhao

, et al., Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing, (2016), pp. 606–615.

14.

, Li

, Zhang

and Wang

, Interactive attention networks for aspect-level sentiment classification. arXiv preprint arXiv:1709.00893, 2017.

15.

Xue

and Li

, Aspect based sentiment analysis with gated convolutional networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), (2018), pp. 2514–2523.

16.

Xing

, Xiao

, Wu

and Ding

, Aconvolutional neural network for aspect sentiment classification. arXiv preprint arXiv:1807.01704, 2018.

17.

Nguyen

and Shirai

, A joint model of term extraction and polarity classification for aspect-based sentiment analysis. In 2018 10th International Conference on Knowledge and Systems Engineering (KSE), (2018), pp. 323–328. IEEE.

18.

Wang

, Lan

and Wang

, Towards a onestop solution to both aspect extraction and sentiment analysis tasks with neural multi-task learning. In 2018 International Joint Conference on Neural Networks (IJCNN), (2018), pp. 1–8. IEEE.

19.

, Li

and Wang

, Joint learning for targeted sentiment analysis. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, (2018), pp. 4737–4742.

20.

Huang

, Xu

and Yu

, Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.

21.

Graves

, Mohamed

A.-R.

and Hinton

, Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, (2013), pp. 6645–6649. IEEE.

22.

Mikolov

, Sutskever

, Chen

, Corrado

G.S.

and Dean

, Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, (2013), pp. 3111–3119.

23.

Wagner

, Arora

, Cortes

, Barman

, Bogdanova

, Foster

and Tounsi

, Dcu: Aspect-based polarity classification for semeval task 4. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), (2014), pp. 223–229.

24.

Tang

, Qin

and Liu

, Aspect level sentiment classification with deep memory network. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, (2016), pp. 214–224.

25.

Chen

, Sun

, Bing

and Yang

, Recurrent attention network on memory for aspect sentiment analysis. In Proceedings of the 2017 conference on empirical methods in natural language processing, (2017), pp. 452–461.

26.

Yin

, Wei

, Dong

, Xu

, Zhang

and Zhou

, Unsupervised word and dependency path embeddings for aspect term extraction. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, (2016), pp. 2979–2985. AAAI Press.

27.

Wang

, Pan

S.J.

, Dahlmeier

and Xiao

, Coupled multi-layer attentions for co-extraction of aspect and opinion terms. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.

Multi-task learning model for aspect term extraction and aspect polarity classification based on dual-labels

Abstract

Keywords

1 Introduction

2.1 Aspect term extraction

2.2 Aspect polarity classification

2.3 Multi-task aspects-based sentiment analysis

3 Methodology

3.1 Task definition

3.2 Long short-term memory network

3.4 Condition random field

3.5 Model architecture

4.1 Datasets and experiment settings

Table 1 Statistical tables of SemEval-2014 datasets. Dataset Sentences Terms Train Test Total Train Test Total Restaurant 3041 800 3841 3693 1134 4827 Laptop 3045 800 3845 2358 654 3012

4.3 Experiment analysis

Table 2 The Accuracy of the baseline models for APC. Models Restaurant (%) Laptop (%) DCU 80.95 70.49 Memnet 80.95 72.37 RAM 80.23 74.49 MNN-1 77.99 71.94 MNN-2 79.91 72.85 MATEPC 81.17 69.38

Footnotes

6

References

Table 1
Statistical tables of SemEval-2014 datasets.

Dataset Sentences Terms

Train Test Total Train Test Total

Restaurant 3041 800 3841 3693 1134 4827

Laptop 3045 800 3845 2358 654 3012

Table 2
The Accuracy of the baseline models for APC.

Models Restaurant (%) Laptop (%)

DCU 80.95 70.49

Memnet 80.95 72.37

RAM 80.23 74.49

MNN-1 77.99 71.94

MNN-2 79.91 72.85

MATEPC 81.17 69.38