Customer sentiment recognition in conversation based on bidirectional LSTM and self-attention mechanism

Abstract

In the E-commerce environment, conversations between customers and businesses contain lots of useful information about customer sentiment. By mining that information, customer sentiment can be validly identified, which is helpful in accurately identifying customer needs and improving customer satisfaction. For conversational sentiment analysis, most existing approaches take contextual information into account. On this basis, we focus on the degree of association between utterances, which can more effectively capture overall and useful sentiment information in conversation. For this purpose, we propose a hybrid model to recognize customer sentiment in conversation. The model obtains utterance vectors with sentiment information through Sentiment Knowledge Enhanced Pre-training (SKEP), then uses the bidirectional long short-term memory network (BiLSTM) to generate contextual semantic information, and further obtains customer sentiment information by applying the self-attention mechanism to focus on the degree of association between utterances. The experimental results on the JD Dialog dataset show that our model can more accurately recognize customer sentiment than other baseline models in customer service conversation.

Keywords

Customer sentiment recognition bidirectional long short-term memory network self-attention mechanism sentiment knowledge enhanced pre-training

1 Introduction

Recently, with the development of artificial intelligence, conversation systems are gradually moving towards practical application. Thereinto, customer service chatbot is widely used in online business environments, such as Amazon’s Alexa [1], JD.com’s JIMI [2], and Alibaba’s AliMe [3]. The customer service chatbot content usually comes from a corpus of web-based knowledge [4], which can answer accurately enough and save the company’s labor costs. But its emotional intelligence is still a short board. Emotion is indispensable key element in human communication. For customer consults, the customer service chatbot sometimes fails to accurately identify customer sentiment, which may result in low customer satisfaction and negative reviews. Then, it will lead to a decline in corporate performance. Consequently, accurately recognizing customer sentiment in conversations has become a vital way to shape a positive corporate image and improve the competitiveness of enterprises.

Recognizing customer sentiment in conversations needs to consider an essential aspect, i.e., the contextual information, which can enhance, weaken, or reverse the original sentiment of the current utterance [5]. Take the conversation in Fig. 1 as an example, if only considering the sentiment of the current utterance, the fourth utterance expresses the neutral sentiment. However, the fourth utterance expresses the negative sentiment if considering the contextual information of the conversation, which is inferred by the third sentence indicating that the order was placed for a long time. For considering contextual information in the conversational sentiment analysis, most researchers mainly utilized the recurrent neural networks (RNN) [6] or long short-term memory (LSTM) [7] method. Although these two methods can effectively capture contextual semantic information, they can only capture contextual semantic information in the forward direction, which cannot identify more fine-grained sentiment classification, such as the five-category task that used in this paper (very negative, negative, neutral, positive, and very positive). For accurately identifying the more fine-grained sentiment classification, it is essential to capture the bidirectional semantic information [8] of the research subjects, understanding more thoroughly the interaction between the contextual sentiment words, degree words, and negative words. In addition, the degree of association between utterances is ignored by these existing studies. Focusing on the degree of association between utterances is helpful for accurately detecting customer sentiment in conversations because the previous utterances have different degrees of influence on the sentiment tendency of the current utterance. For instance, compared with the first three sentences, the content of the fourth to sixth sentences has a greater impact on the sentiment of the seventh utterance. Thus, analyzing the sentiment of the seventh utterance should pay attention to the important association between the fourth to sixth and the seventh utterance.

Fig. 1

Illustration of a customer service conversation.

Conversation sentiment analysis is different from single-sentence sentiment analysis, which not only needs to consider the context information of the current utterance, but also considers the degree of influence of historical utterance on the emotion of the current utterance. Current models for analyzing the sentiment of single-sentence utterances, such as SVM [9] and Random Forest [10], fail to pay attention to contextual information and the degree of impact. Recent research on sentiment analysis of customer service conversations focuses on sentiment analysis by modeling speaker, contextual information, or multimodal aspects such as: DSAGCN [11], BiERU [12], Affect-GCN [13]. In addition, most dialogue sentiment analysis models use BERT pre-training models, which cannot extract rich emotional information such as aspect-sentiment pair, sentiment word, word polarity, which will affect the accuracy of the emotion recognition model. Because it is a challenging task to extract sentiment knowledge in the pre-training process, the main objective of our study is to design a sentiment analysis model that extracts sentiment knowledge, contextual information, and the degree of correlation between utterance.

In this paper, to accurately recognize customer sentiment in customer service conversations by capturing context information from two directions and focusing on the degree of association between utterances, we propose a sentiment analysis approach based on the self-attention mechanism and bidirectional long short-term memory network (BiLSTM). As the first step of the method, we use a pre-training model to encode each utterance in conversations and extract the sentiment feature from utterances. After contextual sentiment information and bidirectional semantic dependencies between utterances are captured by BiLSTM, we apply the self-attention mechanism to discover the degree of association between utterances in conversations.

The main contributions of this paper can be concluded as follows.

We propose a self-attention-based BiLSTM model to recognize customer sentiment in conversation, which can effectively combine contextual information and the degree of association between utterances.

The SKEP model is used to generate the utterance representations with sentiment information.

The experiment results demonstrate that the performance of our model is superior to the baseline models for sentiment analysis of the JD Dialog dataset.

The remainder of this paper is organized as follows: Section 2 introduces the related work from both conversational sentiment analysis and Sentiment Analysis with Self-attention based BiLSTM model. Section 3 presents the proposed model. Section 4 is the comparative experiment and results analysis. Finally, the conclusion and future prospects are presented in Section 5.

2 Related work

The related work mainly consists of conversational sentiment analysis and Sentiment Analysis with Self-attention based BiLSTM model, which will be described as follows:

2.1 Conversational sentiment analysis

Sentiment analysis is one of the most meaningful tasks in the field of NLP and has attracted great attention from the academia [12]. Sentiment analysis is also known as sentiment recognition [14], sentiment classification[15, 16]. In the extant literature, conversational sentiment analysis research has mainly focused on speech sentiment recognition [17], textual sentiment analysis [18], image sentiment analysis [19], and multimodal sentiment analysis [20]. Recently, some deep learning technologies were applied to sentiment recognition like capsule networks, convolutional neural network (CNN), graph convolutional networks (GCN), and recurrent neural network (RNN). Singh et al. [21] proposed a capsule network based novel approach to analyze textual conversational sentiment. Sun et al. [4]proposed a hybrid model to detect abnormal sentiment in conversation in which the CNN-LSTM method is used to identify users’ emotions. Ghosal et al. [22] proposed Dialogue Graph Convolutional Network to model inter and self-party dependency to improve context understanding for utterance-level emotion detection in conversations. Huddar et al. [23]utilized the RNN to capture the interlocutor state and contextual state between the utterances. Another research method is to combine bidirectional LSTM for text sentiment analysis, which is related to our work. Bidirectional long short-term memory (BiLSTM) can gain contextual sentimental sequences by combining forward and backward hidden layers, and capture the connections between sentimental aspect words and their context words [16]. However, the bidirectional LSTM was rarely applied to conversational sentiment analysis, which was mainly introduced to Community reviews [24].

2.2 Sentiment analysis with self-attention based BiLSTM model

For the research of sentiment analysis, many researchers utilized deep learning to capture contextual information, which can improve the accuracy of sentiment recognition. Majumder et al. [20] proposed multimodal sentiment analysis method using context modeling, which mainly uses recurrent neural network (RNN) to extract context-aware utterance features at three modalities feature. Shenoy and Sardana [25] proposed an end-to-end RNN architecture to model four factors that affect the sentiment of an utterance, i.e., the context of the conversation, interlocutor state, interlocutor intent, and previous interlocutor states and emotions of a particular participant in the conversation. In addition, LSTM-based model also can extract contextual features from the utterances [7]. Yang et al. [26]proposed a method to model context, entity, and aspect memory for sentiment analysis of posts on social media platforms in which the contextual information is captured by LSTM model. However, long short-term memory fails to analyze the sentiment due to the existence of multiple types of contexts. To solve this problem, Keramatfar et al. [27] proposed a multithread hierarchical long short-term memory network to extract different context types of tweep, which helps recognize its sentiment polarity. Tang et al. [28] proposed a method about target-dependent sentiment classification, which mainly uses target-dependent long short-term memory to capture contextual information and model the relatedness of a target word with its context words. Unidirectional LSTM only can capture the contextual information in the forward direction, but bidirectional long short-term memory can obtain the contextual information in the forward and backward direction. Basiri et al. [29] utilized two independent bidirectional LSTM and GRU layers to extract both past and future contexts by combining the forward and backward hidden layers, and then used the attention mechanism to focus on important parts, at last performed sentiment analysis in the output of the convolutional and pooling layer that decreases the dimensionality of the feature space. Lv et al. [30] applied bidirectional long-short memory network to extract the context, then used the self-attention mechanism and the encoder-decoder attention mechanism to capture the correlation between context words and the correlation between context word and aspect word, respectively.

In addition, attention mechanisms often are regarded as an effective way to accurately recognize sentiment by focusing on important information. Bahdanau et al. [31]was the first to apply attention mechanisms to the NLP field. Huddar et al. [32]proposed a novel method for multi-modal sentiment analysis in conversation, which mainly uses RNN to capture the interlocutor state and contextual state between the utterances, then utilizes the pair-wise attention mechanism to focus on the important information of modalities before fusion. Huang et al. [32]proposed a model about integrating emotional intelligence and attention mechanism for sentiment analysis, which first uses utilizing EI to improve the feature learning ability of LSTM network, and then apply attention mechanism based on high-level abstraction to adaptively adjust the weight of text hidden representation. Huddar et al. [33] proposed a model to solve feature alignment between the modalities in the sentiment analysis, which uses BiLSTM to capture the contextual information among the words of an utterance and between the nearby utterances, and then applied the attention model based on weighted pooling to select the important features within the modalities and importance of each modality. However, the attention-based deep learning approach ignores the influence of each utterance on the entire text. To solve this problem, Wang et al. [34]proposed a sentence-to-sentence attention network based on multi-head self-attention, which can focus on the importance of each sentence to the complete text. Gan et al. [35]proposed a sentiment analysis method that considers multi-channel features, which uses local attention and global attention to weight the output features of each channel and the fused features of all channels, respectively.

Inspired by but differs from the aforementioned studies, this paper proposes a conversational sentiment analysis method based on self-attention mechanism and BiLSTM. The novelty of this approach is that it takes into account both contextual information and the degree of association between utterances. Specifically, BiLSTM is used to extract contextual semantic features from the current utterance, and the self-attention mechanism is applied to set different weight according to the degree of influence between utterances.

3 Methodology

The architecture of our model is shown in Fig. 2. The model mainly contains three steps. Firstly, we regard each utterance in the dialogue dataset as the basic unit to form a sequence of utterance. The sentiment features are extracted from dialogue text based on SKEP method. Each utterance vector with sentiment features is then used as the input of BiLSTM. Secondly, BiLSTM processes the utterance vector by combining forward and backward LSTM to capture bidirectional semantic dependencies, and extract the hidden information in different aspects of the utterance. Finally, self-attention mechanism is applied to extract the dependence of all utterances in the conversation. The softmax classifier is then applied on the output of self-attention layer to classify the corresponding dialogue emotion, which result in five sentiment classes i.e., negative, very negative, neutral, positive, and very positive. The proposed model is described from three sections i.e., SKEP pre-training model, BiLSTM, and self-attention mechanism.

Fig. 2

The framework of BiLSTM with self-attention model.

3.1 SKEP

Existing conversational sentiment analysis work ignores emotional knowledge, i.e., affective words and aspect-affective pairs during pre-training. The importance of sentiment knowledge has been verified by different tasks, i.e., sentence-level sentiment classification, aspect-level sentiment classification, and opinion extraction. Sentiment words play a key role in sentiment analysis, and many studies on sentiment analysis have focused on extracting sentiment words. In addition, these studies also focus on analyzing word polarity, such as traditional dictionary-based models that use word polarity to analyze the sentiment of text. Aspect-sentiment pairs reveal more information than sentiment words, which can clearly indicate the emotion corresponding to the thing. Therefore, we use SKEP to extract emotional words, word polarity, and aspect-sentiment pair, which generate utterance representation that will be more suitable for customer sentiment analysis in conversations.

Furthermore, sentiment word is recognized by Pointwise Mutual Information (PMI) method and sentiment seed words; After mining the sentiment word, the aspect-sentiment pair is extracted by simple constraints. Aspect-sentiment pairs refer to an aspect and its corresponding sentiment word. Therefore, the sentiment word with the closest noun will be considered an aspect-sentiment pair. The aspect-sentiment pairs are recognized by the constraint $‖ d_{s} - d_{a} ‖ \leq 3$ , which represent the distance between the sentiment word and the aspect word [36]. The process of SKEP is shown in Fig. 3.

Fig. 3

Example of SKEP model.

SKEP masks a sequence by three masking strategies:

Aspect-sentiment Pair Masking. In a sentence, there are no more than two pairs are selected to mask, and they are random.

Sentiment Word Masking. The number of tokens that are MASK cannot exceed 10% of the total number of tokens in the current sentence.

Common Token Masking. If the proportion of tokens accounted for by emotional words in the second step is not 10%, the number of MASK is only to supplement the remaining number of 10% in the second step.

SKEP uses the sentiment masking mechanism to recognize the sentiment information of input texts, and generate a corrupted version by masking this information [36]; then, the SKEP recovers the sentiment information with three sentiment objectives, i.e., Sentiment Word prediction L_sw, Word Polarity prediction L_wp, and Aspect-sentiment Pair prediction L_ap. These sentiment objectives can be calculated as follows.

The objective optimization function L_m of the SKEP model is as follows: $\begin{matrix} L_{m} = L_{sw} + L_{wp} + L_{ap} \end{matrix}$ (1)

The L_sw is calculated as follows: $\begin{matrix} L_{sw} = - \sum_{i = 1}^{n} m_{i} \times y_{i}^{sw} log {\hat{y}}_{i}^{sw} \end{matrix}$ (2) $\begin{matrix} {\hat{y}}_{i}^{sw} = Softmax ({\tilde{x}}_{i} W^{sw} + b^{sw}) \end{matrix}$ (3) $\begin{matrix} L_{wp} = - \sum_{i = 1}^{n} m_{i} \times y_{i}^{wp} log {\hat{y}}_{i}^{wp} \end{matrix}$ (4) $\begin{matrix} {\hat{y}}_{i}^{wp} = Softmax ({\tilde{x}}_{i} W^{wp} + b^{wp}) \end{matrix}$ (5) $\begin{matrix} L_{ap} = - \sum_{a = 1}^{A} y_{a} log {\hat{y}}_{a} \end{matrix}$ (6) $\begin{matrix} {\hat{y}}_{a} = sigmoid ({\tilde{x}}_{1} W^{ap} + b^{ap}) \end{matrix}$ (7)

where W and b are the parameters in the pre-training process; ${\hat{y}}_{i}^{sw}$ , ${\hat{y}}_{i}^{wp}$ and ${\hat{y}}_{a}$ are the normalized probability vector; y_a is the sparse vector representation of a target aspect-sentiment pair; if the ith word is sentiment word, m_i = 1, otherwise m_i = 0. L_sw does not calculate the loss of each token, but only the loss of the position of the sentiment word, and the position of the non-sentiment word does not participate in the calculation. L_wp calculate the loss of polarity.

we use the SKEP to obtain emotional information of conversations and provide it to whole models in the form of emotional context vectors. Finally, the SKEP generates the utterance representations with sentiment information as the input vectors to BiLSTM.

3.2 BiLSTM

BiLSTM is an improvement of LSTM that effectively solves some problems of RNN, such as the long-term distance dependence, gradient disappearance and gradient explosion. It is composed of forward LSTM and backward LSTM; hence, BiLSTM can more precisely capture contextual semantic information. LSTM consists of the input gate, the forget gate f, the output gate o, and cell memory state C. These modules can be computed as follows. $\begin{matrix} f_{t} = σ (W_{f} [h_{t - 1}, x_{t}] + b_{f}) \end{matrix}$ (8) $\begin{matrix} i_{t} = σ (W_{i} [h_{t - 1}, x_{t}] + b_{i}) \end{matrix}$ (9) $\begin{matrix} o_{t} = σ (W_{o} [h_{t - 1}, x_{t}] + b_{o}) \end{matrix}$ (10) $\begin{matrix} {\tilde{C}}_{t} = \tanh (W_{C} [h_{t - 1}, x_{t}] + b_{C}) \end{matrix}$ (11) $\begin{matrix} C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ {\tilde{C}}_{t} \end{matrix}$ (12) $\begin{matrix} h_{t} = o_{t} ⊙ \tanh (C_{t}) \end{matrix}$ (13)

where h_t is the output vector of hidden layer at position t; σ is the sigmoid function; x_t is the input vector of LSTM; W_i, W_f, and W_o ∈ R^d_h×d_k are the weight matrices; b_i, b_f, b_o, and b_c ∈ R^{d
_h} are the biases.

Traditional LSTM models can capture the semantic information in the forward direction but ignore the backward context information [24]. Therefore, we utilize the BiLSTM model to consider contextual information to capture bidirectional semantic dependencies. By combining forward and backward LSTM, BiLSTM build contextual sentimental sequences based on features recognition of utterance vectors from the JD Dialog dataset. The two LSTMs are calculated as follows: $\begin{matrix} {\vec{C}}_{t}, {\vec{h}}_{t} = g^{LSTM} ({\vec{C}}_{t - 1}, {\vec{h}}_{t - 1}, x_{t}) \end{matrix}$ (14) $\begin{matrix} {\overset{\leftarrow}{C}}_{t}, {\overset{\leftarrow}{h}}_{t} = g^{LSTM} ({\overset{\leftarrow}{C}}_{t - 1}, {\overset{\leftarrow}{h}}_{t - 1}, x_{t}) \end{matrix}$ (15)

Where C_t and h_t consider forward and backward context. At the tth position, the output is $h_{t} = {\vec{h}}_{t} \oplus {\overset{\leftarrow}{h}}_{t}$ , which is the concatenation of hidden layer output vectors of the forward and backward LSTM. Finally, the output of BiLSTM is H = {h₁, h₂, …, h_n}.

3.3 Self-attention mechanism and output

Although BSTM can build contextual sentimental sequences from conversations, it cannot extract the important information in the context. To improve the accuracy of sentiment analysis, the self-attention mechanism is introduced to set the different weight of the degree of association between utterances according to the degree of influence of each utterance on other utterances and focus on the sentiment features in the utterance [37]. The self-attention mechanism can be formally expressed as follows: $\begin{matrix} g_{i} = \sum_{j \neq i} α_{i, j} h_{j} \end{matrix}$ (16) $\begin{matrix} α_{i, j} = \frac{e^{score (h_{i}, h_{j})}}{\sum_{j} e^{score (h_{i}, h_{j})}} \end{matrix}$ (17) $\begin{matrix} score (h_{i}, h_{j}) = V_{a}^{T} \tanh (w_{a} [h_{i} \oplus h_{j}]) \end{matrix}$ (18) where g_i is the weighted feature vector, α_i,j is the vector of self-attention weight, h_j is the output vector of the BiLSTM at the jth position, $V_{a}^{T}$ is the transpose of the parameter vector obtained by the training learning, score (h_i, h_j) and the correlation of utterance pairs (h_i, h_j) are achieved by MLP.

Finally, a softmax layer is utilized to classify the result: $\begin{matrix} P = softmax (W_{s} G + b_{s}) \end{matrix}$ (19) where G = {g₁, g₂, …, g_n}; W_s and b_s are the parameters of softmax layer. The aim of model training is to reduce the cross-entropy error between the target distribution and the predicted distribution [38]. Hence, the loss function is introduced to the model training. $\begin{matrix} loss = - \sum_{i} \sum_{k} y_{i}^{k} {logp}_{i}^{k} + λ θ^{2} \end{matrix}$ (20) where i is the index of sentence, k is the index of class, y is the actual category, p is the category of the predicted sentiment, λ is the L2-regularization coefficient, θ is a parameter set. In this paper, we use the back-propagation algorithm to train the model.

4 Experiments

4.1 Dataset and parameter settings

Our experiments are conducted on the JD Dialog dataset which is obtained from JD Dialogue Challenge 1 . To the best of my knowledge, the JD dialog dataset in the field of e -commerce is rarely used to analyze customers’ sentiment in conversation. We first process the dataset by cleaning, deleting, removing noise and irrelevant content. Each utterance is then annotated with one of five sentiment labels, which are very negative, negative, neutral, positive, and very positive.

The specifications in the annotation process are as follows:

If the sentence contains very obvious negative emotional words, or contains two or more negative emotional words, the emotional tendency of the sentence is marked as very negative, as in Example 1:

Customer: It’s been several times, you are too perfunctory. (very negative)

If the statement expresses dissatisfaction with the goods or services without significant negative emotions, the emotional tendency of the statement is marked as negative, as in Example 2:

Customer: Why hasn’t my order shipped yet? (negative)

If the statement is just a simple consultation without any emotional color, mark it as neutral, such as Example 3:

Customer: Hello, are you there? (neutral)

If the sentence expresses gratitude or contains one or two positive emotional words, label the emotional tendency of the sentence as positive, as in Example 4:

Agent: Hello, thank you for coming, how can I help you? (positive)

If the sentence contains obvious positive emotional words, or contains two or more positive emotional words, the emotional content of the sentence is labeled as very positive, as in Example 5:

Customer: The quality of the goods is good, I will give you good reviews. (very positive)

When manually annotating each sentence in a dialogue, three annotators are assigned to independently label their emotional tendencies. If two of them do not agree on their emotional labeling, an expert makes the final decision. The data annotation results show that the consistency of the two is above 89%, and the consistency of the three is above 83%. The Kappa coefficient was calculated between each two labeled personnel, and the mean value of the Kappa coefficient was greater than 0.75 after consistency analysis. Based on the results of the above related indicators, the data annotation of this topic has good consistency, and the quality of the data annotation meets the experimental standards.

In this paper, after labeling the dialogue emotional tendency of the dataset, the dataset is divided into training set, verification set and test set according to the ratio of 7 : 2:1, and the dataset is statistically analyzed. Table 1 shows the distribution of five types of affective tendency labels on the dataset. From the statistical results in Table 1, it can be seen that the neutral and positive emotions in the data set account for a relatively large proportion of words, and the negative emotions are less. The main reason why neutral emotions account for a large proportion is that the content related to consultation in the conversation accounts for more, and the reason why positive emotions account for a large proportion is that positive emotions account for a large proportion in the words of customer service, which is consistent with the actual situation in our daily communication. In addition, the proportion of dialogue sentences with very negative emotional tendency and very positive emotional tendency in the whole sample was 2.13% and 1.46%, respectively, which brought label imbalance to the sentiment analysis task in the customer service dialogue scenario. In addition, the dataset includes a total of 70616 conversation sentences, of which the average conversation length of each sentence of customer service is 20.65, and the average length of customer conversation per sentence is 11.61.

Table 1
Information of the dataset

Sentiment Train Dev Test Proportion

Very negative 1038 163 282 1483 (2.1%)

Negative 3466 613 1019 5098 (7.22%)

Neutral 31625 3953 8345 43923 (62.2%)

Positive 12794 2482 3819 19095 (27.04%)

Very positive 722 102 193 1017 (1.44%)

Total utterances 49645 7313 13658 70616

Sentiment	Train	Dev	Test	Proportion
Very negative	1038	163	282	1483 (2.1%)
Negative	3466	613	1019	5098 (7.22%)
Neutral	31625	3953	8345	43923 (62.2%)
Positive	12794	2482	3819	19095 (27.04%)
Very positive	722	102	193	1017 (1.44%)
Total utterances	49645	7313	13658	70616

In this experiment, the maximum utterance length of the JD Dialog dataset is set to 196. We set up different epochs to observe the loss value, this result is shown in Fig. 4. When epoch is executed after 16, the loss value has not improved significantly. Thus, the epoch parameter value is set to 16. In addition, the size of the LSTM cell state is 128, the dropout rate is 0.5, the learning rate is e^-3, and the batch size is 8. The optimization function of the model is Adam, considering that the emotional tendency classification in this experiment belongs to a multi-classification scenario and there is an unbalanced distribution, the training loss function is selected as the cross-entropy loss function.

Fig. 4

The loss value of training and validation.

4.2 Evaluation metrics

This paper uses Precision(Pr), Recall(Re), F1-measure(F1), and Marco-F1 to evaluate the performance of the model. These standards are widely used in sentiment analysis tasks. These criteria are calculated as follows: $\begin{matrix} \Pr = \frac{TP}{TP + FP} \end{matrix}$ (21) $\begin{matrix} Re = \frac{TP}{TP + FN} \end{matrix}$ (22) $\begin{matrix} F 1 = \frac{2 \times \Pr \times Re}{\Pr + Re} \end{matrix}$ (23)

Marco-F1 is obtained by averaging the F1 values of all classes and is used to evaluate the overall performance of the model.

4.2 Baselines

To evaluate the performance of our model, we implement following baseline models to sentiment recognition in conversation.

BERT [39],where a BERT model is used to construct the utterance representations.

SKEP [36], where the SKEP model embeds sentiment information into pre-trained utterance representations which combines sentiment words, word polarity and aspect-sentiment pair.

BERT-BiLSTM [40], in English Textual Dialogue, the model extracts BERT embeddings then feed them into BiLSTM layer. After that, the output from BiLSTM layer is used for sentiment classification.

SKEP-BiLSTM, where BERT is replaced with SKEP in BERT-BiLSTM.

BiERU [12], where use LSTM and CNN to extract context information.

DSAGCN [11], where employs a self-attention mechanism to capture the most effective words in the dialogue context emotional semantics, construct multimodal sentiment relationship graphs based on speaker relationships,

4.3 Experimental results and discussions

We compare the performance of our model with baselines on the JD Dialog dataset. Each model is run five times and recorded the average scores of evaluate metrics. From Table 2, we can see that the performance of our model is superior to the baseline models, which demonstrates our model can better recognize sentiment by considering the contextual information and the degree of association between utterances.

Table 2
Comparison of our model and other baseline models

Models JD Dialog dataset

Very negative Negative Neutral Positive Very positive Average

Pr Re F1 Pr Re F1 Pr Re F1 Pr Re F1 Pr Re F1 Marco-F1

BERT 60.76 73.56 66.55 63.28 71.48 67.13 82.71 94.17 88.07 78.06 87.52 82.52 77.27 79.12 78.18 76.49

SKEP 64.84 68.95 66.83 60.83 70.78 65.43 84.93 92.91 88.74 77.69 89.53 83.19 79.92 78.69 79.30 76.70

BERT-BiLSTM 73.16 70.05 71.57 69.07 73.15 71.05 85.79 94.31 89.85 79.86 85.51 82.59 78.11 80.34 79.21 78.85

SKEP-BiLSTM 68.57 72.01 70.25 70.84 68.83 69.82 87.09 95.21 90.97 80.19 91.31 85.39 78.03 81.34 79.65 79.22

BiERU 71.31 70.36 70.83 70.65 72.10 71.37 87.39 92.18 89.72 81.98 90.72 86.13 79.17 78.75 78.96 79.40

DSAGCN 74.07 72.78 73.42 76.14 71.19 73.58 86.41 92.28 89.25 82.57 89.12 85.72 80.58 79.57 80.07 80.41

Our model 75.17 71.27 73.17 77.22 71.24 74.11 87.16 94.82 90.83 84.07 91.59 87.67 82.01 80.17 81.08 81.37

Models	JD Dialog dataset
Our model	75.17	71.27	73.17	77.22	71.24	74.11	87.16	94.82	90.83	84.07	91.59	87.67	82.01	80.17	81.08	81.37

The performance of the pre-trained models, the SKEP-BiLSTM model performs better than the BERT- BiLSTM model on the sentiment recognition of conversation, which indicates the effectiveness of using SKEP to encode each utterance in the conversation. In addition, BERT-BiLSTM and SKEP-BiLSTM are all superior to a single BERT and SKEP model, which indicates that conversation is a continuous process and the sentiment recognition of the current utterance depends on the historical conversation. Furthermore, the F1 values of the negative and positive labels increased much more than that of other sentiment labels after considering the context features, which conform to the actual situation, i.e., when there is no contextual information, the utterance with the very negative or the very positive sentiment have obvious sentiment features, which is easier to recognize than the utterance with the negative or positive sentiment. Thus, for the utterance with negative or positive sentiment, when considering the context of which, the performance of BERT-BiLSTM and SKEP-BiLSTM models has more obviously improved in recognizing negative and positive sentiment.

To find out whether paying attention to the degree of association between utterances is beneficial for analyzing sentiment in the conversation, we compared our model with the SKEP-BiLSTM model. The results show that the macro-F1 value of our model is 2.15 higher than that of the SKEP-BiLSTM model, which indicates the effectiveness of using the self-attention mechanism to focus on the degree of association between utterances based on the contextual information.

The performance of our model is better than that of existing models i.e., BiERU and DSAGCN, which shows the importance of models considering sentiment knowledge, context information and the degree of association between utterances.

To further evaluate the performance of our proposed model in recognizing the conversational sentiment, we select two cases for analysis. Figure 5 illustrates two dialogue examples predicted by SKEP, SKEP-BiLSTM, and our model. We can see from this figure that: 1) For example 1 about “refund”, we find that both SKEP-BiLSTM and our model give correct predictions, while SKEP give wrong prediction. This is because both SKEP-BiLSTM and our model can capture contextual information, while SKEP focuses only on the current utterance. 2) For example 2 about the exchange of goods, the last utterance expresses dissatisfaction with the quality of the goods after the exchange. Both the predictions of SKEP and SKEP-BiLSTM aren’t negative, while our model still predicts accurately. The main reason for this is that our model not only captures contextual information but also focuses on the degree of association between utterances. Specifically, compared with the twelfth and thirteenth utterances, the content of the eleventh utterance has a greater impact on the sentiment of the last utterance. The relation between the eleventh and last utterances is set to a larger weight by self-attention mechanism. Hence, our model can recognize sentiment more accurately than SKEP and SKEP-BiLSTM. Experimental results show the necessity of SKEP, BiLSTM and Self-attention, and they give full play to their respective advantages, that is, extracting sentiment knowledge, context information and the degree of association between utterances that is conducive to dialogue sentiment analysis.

Fig. 5

Illustration of examples from the JD Dialog dataset with their sentiment predicted by different approaches.

5 Conclusion

In this paper, we proposed an efficient model to accurately recognize customer sentiment in customer service conversations. The model consists of SKEP, Bidirectional LSTM, and self-attention mechanism. Firstly, the SKEP is used to generate the utterance representations by embedding sentiment information. Then, we use BiLSTM to capture contextual information on the utterance representations. Finally, the self-attention mechanism is applied to focus on the degree of association between utterances. To verify the performance of our model, we select four baseline models for the same task. Altogether 6305 customer service conversations (i.e., 70616 utterances) in the JD Dialog dataset are used to evaluate the performance of our proposed approach. Experimental results on the JD Dialog dataset show that our model has better performance than other baseline models in terms of recognizing customer sentiment.

For future work, the following two research directions merit exploration: 1) considering more factors that affect the accuracy of sentiment recognition in conversations, such as topics, commonsense knowledge, interactions between interlocutors, etc. 2) extending our experiments on additional application domains and language datasets, which is used to verify the effectiveness of our model.

Footnotes

Acknowledgments

The authors would like to thank all anonymous reviewers for their valuable comments and suggestions which have significantly improved the quality and presentation of this paper.

References

Chung

, Iorga

, Voas

and Lee

, Alexa, Can I Trust You? in Computer 50(9) (2017), 100–104https://doi.org/10.1109/MC.2017.3571053

Zhu

Case II (Part A): JIMI’s Growth Path: Artificial Intelligence Has Redefined the Customer Service of JD.Com, in Emerging Champions in the Digital Economy, Springer: Singapore, 2019, pp. 91–103.https://doi.org/10.1007/978-981-13-2628-8_3

, et al. Alime assist: An intelligent assistant for creating an innovative e-commerce experience, in Proc. of the 2017 ACM on Conference on Information and Knowledge Management, 2017, pp. 2495–2498.https://doi.org/10.1145/3132847.3133169

Sun

, Zhang

and Li

, Dynamic emotion modelling and anomaly detection in conversation based on emotional transition tensor, Information Fusion 46 (2019), 11–22https://doi.org/10.1016/j.inffus.2018.04.001

Huang

, et al., Emotion Detection for Conversations Based on Reinforcement Learning Framework, in IEEE MultiMedia 28(2) (2021), 76–85.

Majid

, Santoso

H.A.

Conversations Sentiment and Intent Categorization Using Context RNN for Emo658 tion Recognition, in 7th Int. Conf. on Advanced Computing and Communication Systems (ICACCS), 2021, pp. 46–50.https://doi.org/10.1109/icaccs51430.2021.9441740

Poria

, Cambria

, Hazarika

, Majumder

, Zadeh

, Morency

Context-dependent sentiment analysis in user-generated videos, in Proc. of the 55th annual meeting of the association for computational linguistics, Vancouver, Canada, 2017, pp. 873–883.https://doi.org/10.18653/v1/P17-1081

Graves

and Schmidhuber

, Framewise phoneme classification with bidirectional LSTM and other neural network architectures, Neural Networks 18(5) (2005), 602–610https://doi.org/10.1016/j.neunet.2005.06.042

Jain

and Kashyap

KL.

, Ensemble hybrid model for Hindi COVID-19 text classification with metaheuristic optimization algorithm, Multimedia Tools and Applications 82(11) (2023), 16839–16859.

10.

Jain

and Kashyap

KL.

, Multilayer hybrid ensemble machine learning model for analysis of Covid-19 vaccine sentiments, Fuzzy Systems 43(5) (2022), 6307–6319.

11.

Shou

, Meng

, Ai

, Yang

and Li

, Conversational emotion recognition studies based on graph convolutional neural networks and a dependent syntactic analysis, , Neuro computing 501 (2022), 629–639.

12.

, Shao

, Ji

and Cambria

, BiERU: Bidirectional emotional recurrent unit for conversational sentiment analysis, Neurocomputing 467 (2022), 73–82https://doi.org/10.1016/j.neucom.2021.09.057

13.

Firdaus

, Singh

G.V.

, Ekbal

, Bhattacharyyaet

Affect-GCN: a multimodal graph convolutional network for multi-emotion with intensity recognition and sentiment analysis in dialogues, Multimedia Tools and Applications 2023.

14.

Bhangdia

, Bhansali

, Chaudhari

, Chandnani

, Dhore

M.L.

Speech Emotion Recognition and Sentiment Analysis based Therapist Bot, in Third Int. Conf. on Inventive Research in Com-puting Applications (ICIRCA), 2021, pp. 96–101.https://doi.org/10.1109/icirca51532.2021.9544671

15.

Wang

, Wang

, Sun

, Li

, Liu

, Si

, Zhang

, Zhou

Sentiment Classification in Customer Service Dia-logue withTopic-Aware Multi-Task Learning, in Proc. of the AAAI Conf. on Artificial Intelligence, 2020, pp. 9177–9184.https://doi.org/10.1609/aaai.v34i05.6454

16.

, Li

, Ling

, Ding

and Shen

, Sentiment classification using attention mechanism and bidirectional long short-term memory network, Applied Soft Computing 112 (2021), 107792https://doi.org/10.1016/j.asoc.2021.107792

17.

Xie

, Liang

, Huang

, Zou

and Schuller

, Speech Emotion Classification Using Attention-Based LSTM, in IEEE/ACM Transactions on Audio, Speech, and Language Processing 27(11) (2019), 1675–1685https://doi.org/10.1109/taslp.2019.2925934

18.

Peng

, Fang

, Xie

and Zhou

, Topic-enhanced emotional conversation generation with attention mechanism, Knowledge-Based Systems 163 (2019), 429–437https://doi.org/10.1016/j.knosys.2018.09.006

19.

You

, Luo

, Jin

, Yang

Robust image sentiment analysis using progressively trained and domain transferred deep networks, in Twenty-ninth AAAI conf. on artificial intelligence, North America, 2015.

20.

Majumder

, Hazarika

, Gelbukh

, Cambria

and Poria

, Multimodal sentiment analysis using hierarchical fusion with context modeling, Knowledge-Based Systems 161 (2018), 124–133https://doi.org/10.1016/j.knosys.2018.07.041

21.

Singh

, Panwar

, Choudhary

, Ojha

Textual Conversational Sentiment Analysis in Deep Learning using capsule network, in 2021 Int. Conf. on Electrical, Computer, Communications and Mechatronics Engineering (ICEC- CME), 2021, pp. 1–6.

22.

Ghosal

, Majumder

, Poria

, Chhaya

, Gelbukh

DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation, in Proc. of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJC-NLP), Hong Kong, China, 2019, pp. 154–164.

23.

Huddar

M.G.

, Sannakki

S. S.

and Rajpurohit

V. S.

, Attention-based multi-modal sentiment analysis and emotion detection in conversation using RNN, International Journal of Interactive Multimedia and Artificial Intelligence 6(6) (2021), 112+https://doi.org/10.9781/ijimai.2020.07.004

24.

Wei

, Liao

, Yang

, Wang

and Zhao

, BiLSTM with Multi-Polarity Orthogonal Attention for Implicit Sentiment Analysis, Neurocomputing 383 (2020), 165–173https://doi.org/10.1016/j.neucom.2019.11.054

25.

Shenoy

, Sardana

Multilogue-Net: A Context Aware RNN for Multi-modal Emotion Detection and Sentiment Analysis in Conversation, in Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML), Association for Computational Linguistics, Seattle, USA, 2020, pp. 19–28.https://doi.org/10.18653/v1/2020.challengehml-1.3

26.

Yang

, Yang

, Wang

and Xie

, Multi-Entity Aspect-Based Sentiment Analysis With Context, Entity and Aspect Memory, in Proc. of the AAAI Conf. on Artificial Intelligence 32(1) (2018)https://doi.org/10.1609/aaai.v32i1.12059

27.

Keramatfar

, Amirkhani

, Jalaly

Bidgoly, Multi-thread hierarchical deep model for context-aware sentiment analysis, Journal of Information Science 2021.https://doi.org/10.1177/0165551521990617

28.

Tang

, Qin

, Feng

, Liu

Effective LSTMs for Target-Dependent Sentiment Classification, in Proc. of COLING 2016, the 26th Int. Conf. on Computational Linguistics: Technical Papers, Osaka, Japan, 2016, pp. 3298–3307.https://aclanthology.org/C16-1311

29.

Basiri

M.E.

, Nemati

, Abdar

, Cambria

and Acharya

U. R.

, ABCDM: An Attention-based Bidirectional CNN-RNN Deep Model for sentiment analysis, Future Generation Computer Systems 115 (2021), 279–294https://doi.org/10.1016/j.future.2020.08.005

30.

, Wei

, Cao

, Peng

, Niu

, Yu

and Wang

, Aspect-level sentiment analysis using context and aspect memory network, Neurocomputing 428 (2021), 195–205https://doi.org/10.1016/j.neucom.2020.11.049

31.

Bahdanau

, Cho

K.H.

, Bengio

Neural machine translation by jointly learning to align and translate, in 3rd Int. Conf. on Learning Representations, 2015.

32.

Huang

, Li

, Yuan

, Zhang

and Qiao

, Attention-Emotion-Enhanced Convolutional LSTM for Sentiment Analysis, in IEEE Transactions on Neural Networks and Learning Systems 33(9) (2022), 4332–4345https://doi.org/10.1109/TNNLS.2021.3056664

33.

Huddar

M.G.

, Sannakki

S.S.

and Rajpurohit

V.S.

, Attention-based word-level contextual feature extraction and cross-modality fusion for sentiment analysis and emotion classification, International Journal of Intelligent Engineering Informatics 8(1) (2020), 1–18.

34.

Wang

, Li

and Hou

, S2SAN: A sentence-to-sentence attention network for sentiment analysis of online reviews, Decision Support Systems 149 (2021), 113603https://doi.org/10.1016/j.dss.2021.113603

35.

Gan

, Feng

and Zhang

, Scalable multi-channel dilated CNN–BiLSTM model with attention mechanism for Chinese textual sentiment analysis, Future Generation Computer Systems 118 (2021), 297–309.

36.

Tian

, Gao

, Xiao

, Liu

, He

, Wu

, Wang

, Wu

SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis, in Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 4067–4076.https://doi.org/10.18653/v1/2020.acl-main.374

37.

, Qi

, Tang

and Yu

, Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification, Neurocomputing 387 (2020), 63–77https://doi.org/10.1016/j.neucom.2020.01.006

38.

Shuang

, Ren

, Yang

, Li

and Loo

, AELA-DLSTMs: Attention-Enabled and Location-Aware Double LSTMs for aspect-level sentiment classification, Neurocomputing 334 (2019), 25–34https://doi.org/10.1016/j.neucom.2018.11.084

39.

Devlin

, Chang

M.W.

, Lee

, Toutanova

Bert: Pre-training of deep bidirectional transformers for language understanding, in Proc. of NAACL-2019, 2019, pp. 4171–4186.https://arxiv.org/pdf/1810.04805.pdf

40.

Al-Omari

, Abdullah

M.A.

, Shaikh

EmoDet2: Emotion Detection in EnglishTextual Dialogue usingBERT and BiLSTM Models, in 11th Int.l Conf. on Information and Communication Systems (ICICS), 2020, pp. 226–232.https://doi.org/10.1109/ICICS49469.2020.239539