Abstract
In multi-turn dialogue generation, dialogue contexts have been shown to have an important influence on the reasoning of the next round of dialogue. A multi-turn dialogue between two people should be able to give a reasonable response according to the relevant context. However, the widely used hierarchical recurrent encoder-decoder model and the latest model that detecting the relevant contexts with self-attention are facing the same problem. Their given response doesn’t match the identity of the current speaker, which we call it role ambiguity. In this paper, we propose a new model, named RoRePo, to tackle this problem by detecting the role information and relative position information. Firstly, as a part of the decoder input, we add a role embedding to identity different speakers. Secondly, we incorporate self-attention mechanism with relative position representation to dialogue context understanding. Besides, the design of our model architecture considers the influence of latent variables in generating more diverse responses. Experimental results of our evaluations on the DailyDialog and DSTC7_AVSD datasets show that our proposed model advances in multi-turn dialogue generation.
Introduction
Multi-turn dialogue generation has always been a difficult task, aiming to generate a context-dependent response according to the previous several turns. Recently, it is getting more and more attention. The dialogue systems are used in a wider set of applications ranging from e-commerce customer service to medical field. Multi-turn dialogue generation is a typical sequence-to-sequence (Seq2Seq) learning task in Natural Language Processing. The mainstream approaches to Seq2Seq learning leverage deep recurrent neural network (RNN) with attention mechanism [16, 17] as basic building block. The hierarchical recurrent encoder-decoder (HRED) [1] model and its variants, such as hierarchical latent variable encoder-decoder (VHRED) [2] model, are widely used in this area. In the encoder part, they deploy a word-level RNN to map each utterance to an utterance vector, and then deploy a context-level RNN to process the utterance vectors. In the decoder phase, another RNN is used to generate a response based on the output of the context-level RNN with the objective of maximizing the averaged likelihood.
Although the HRED network can capture long temporal dependencies in multi-turn conversations, the response usually depends on some relevant contexts rather than all contextual information. To tackle this problem, many researchers try to detect and use the relevant contexts for multi-turn dialogue generation by introducing the traditional attention mechanism [3, 4], but they ignore the position bias problem [18]. Zhang et al. [5] proposes a new model, namely ReCoSa, which detects the relevant contexts by self-attention network. They use the self-attention mechanism to measure the relevance between the response and each utterance in the context based on the fact that the self-attention is superior in capturing long distant dependency [6].
However, it’s worth noting that the response should be not only contextual but also appropriate to the speaker’s identity. Here we give an example selected from the DailyDialog dataset [13], as shown in Table 1. In this case, if we don’t distinguish the utterances between role A and role B in the context, although we can easily infer that A is the driver and B is the passenger, it is likely that the current response doesn’t match the identity of the current speaker, which we call it role ambiguity. Another common problem is that the current conversation generation models lack common-sense knowledge which is critical to develop conversations. The ReCoSa model can only use limited range of knowledge mining from the limited training corpus, leading to a limited performance improvement. Therefore, introducing common-sense knowledge and distinguishing between different roles are still challenging problems in multi-turn dialogue generation.
The “Role ambiguity” problem
The “Role ambiguity” problem
In this paper, we propose a new model, namely RoRePo, aiming to detect the role and relative position information. Our model follows the encoder-decoder architecture. In the encoder part, we use the cross-attention mechanism to measure the relevance between the response and each utterance in the context. Besides, we use the relation-aware self-attention mechanism to get the context and response representations, which can alleviate the position bias problem and extend the self-attention mechanism to efficiently consider the relative position representations of utterances in context. The motivation comes from the work [10] that incorporating relative position representations in the self-attention mechanism of the Transformer, as shown in Shaw et al. Furthermore, inspired by the segment embedding of Bidirectional Encoder Representations from Transformers (BERT) [9], as an additional input of the Gated Recurrent Unit (GRU) decoder [14], we train different role embeddings for different speakers in the context. To introduce the common-sense knowledge, we consider initializing the word embedding with BERT embedding and make a discussion about it, which is equivalent to implicitly expanding the training corpus. Finally, we deploy a decoder based on the related context to generate the response.
In the experiments, we use two public multi-turn open-domain English datasets to evaluate the proposed model, i.e. DailyDialog [13] and DSTC7_AVSD [15], the Audio Visual Scene-Aware Dialog (AVSD) Challenge at the 7th Dialog System Technology Challenges (DSTC7). The results show that our model outperforms several strong comparison models in relevance and diversity.
The main contributions of this paper are as follows: The RoRePo model proposed in the paper provides additional information to solve the role ambiguity problem. The human evaluation results show that it’s not that easy to fall into a role ambiguity situation. The relation-aware self-attention layer, incorporating relative position representations in the self-attention, is firstly deployed to encode utterance-level relative position information in the dialogue context, which is essential for the logical multi-turn dialogue. To approximatively introduce the common-sense knowledge, we use the pre-trained model to initialize the word embedding, to discuss the effectiveness of an input embedding with richer semantic. To some extent, the general structure of RoRePo is closest to the combination of ReCoSa and VHRED. We fuse the encoder with the self-attention mechanism and the decoder with the latent variable, which has been proved that it can improve the relevance and diversity at the same time. Our model improves the Bilingual Evaluation Understudy (BLEU) and Distinct score compared with several famous baseline methods.
Dialogue generation, specifically the multi-turn dialogue generation, has attracted increasing scientific attentions in recent years. Deep-learning based jobs represented by Serban et al. [1] are experiencing explosive growth and they are combinations of the recurrence unit and the attention mechanism. Limited by the ability to encode long-term dependencies, traditional attention mechanism tends to focus on the adjacent context sentences, namely attentive position bias problem. The vanilla RNN model may have a poor performance in multi-turn dialogue because clues to the response may be remote. So Serban et al. proposed the HRED model which first encodes the dependency within the sentence, then encodes the dependency between the context sentences. This was followed up by Serban et al. [2], who proposed the VHRED model, introducing the latent variable to improve the diversity of the generated responses. Xing et al. [4] proposed the HRAN model which introduces the traditional attention mechanism into HRED. The current state is used for attention weight calculation in both word-level RNN and sentence-level RNN, but the attentive position bias problem remains. However, it is not appropriate to simply treat all contexts indiscriminately, because the response is usually relevant to some previous contexts. Tian et al. [3] proposed the WSeq model which calculates the attention weights by cosine similarity scores between response embedding and each context embedding. However, the WSeq model suffers from the lack of considering the semantic matching. Realizing the shortcomings of HRED and WSeq, Zhang et al. [5] proposed a new model, namely ReCoSa, which can focus on the relevant contexts with both long and short distance, and consider both semantic matching and surface matching of embedding between response and each utterance in context. The ReCoSa model successfully extracts the relevant contexts for response generation by using self-attention, but they think little of the importance of relative position information between contextual sentences.
Besides that, the ReCoSa model is largely limited due to the lack of common-sense knowledge. Our model fine-tunes pre-trained semantic vectors from pre-training model to approximatively introduce the external knowledge. Recently, there are many high-quality new achievements in the study of introducing the common-sense knowledge. Specifically, Zhou et al. [7] further proposed a Chinese multi-domain knowledge-driven conversation dataset, KdConv. They applied it to several benchmark models and saw significant performance improvements, which illustrates the importance of external knowledge to dialogue generation. For the same reason, to enrich the semantic information of the post, Zhang et al. [21] combined Graph Neural Network(GNN) with their proposed Flow Attention mechanism to encode more distant relevant concepts in the knowledge graph. Guan et al. [22] post-trained their model on knowledge bases for the story generation task, after pre-training on the large-scale corpus.
RoRePo model
In this section, we describe in detail the components of the proposed RoRePo model, with the architecture shown in Fig. 1. Our model consists of an utterance representation encoder, a context-level relation-aware self-attention layer, a role embedding layer, a latent variable layer and a decoder module for generation process.

The architecture of the RoRePo.
Consider a dialogue consisting of a sequence of n utterances, u = (u0, u1, . . . , un-1), where each utterance u
i
= (wu
i
,0, . . . , wu
i
,n
u
i
-1) contains a sequence of n
u
i
word tokens. For utterance u
i
, at the time step j, given the hidden state hj-1 and the word embedding eu
i
,j, the encoder generates a hidden state h
j
. The task of utterance representation encoder can be defined as follows: Given the word embedding sequence U
i
= (eu
i
,0, . . . , eu
i
,n
u
i
-1) of each context utterance u
i
, generate a sequence of hidden states h
u
i
= (h0, . . . , hn
u
i
-1) and take the utterance representation C
i
to be hn
u
i
-1, where U
i
is initialized by the weighted sum of the last four Transformer layers’ output of BERT. The e
u
i
is computed as follows:
Devlin et al. [9] pretrained deep bidirectional transformers on a large-scale Wikipedia corpus. Taking u i as an input, we use one of the public release versions of the pre-trained BERT models, namely BERT-base, to generate a pre-trained semantic word embedding U i . We believe that it can implicitly introduce the common-sense knowledge by a pre-trained distribution of semantic space. Therefore, given U = (U0, . . . , Un-1), we finally get a sequence of n utterance representations, C = (C0, . . . , Cn-1).
The self-attention mechanism has the advantage of capturing the long distant dependency information, alleviating the attentive position bias problem. Shaw et al. [10] proposed an extension of self-attention mechanism to consider the pair relationship between the inputs, which has been shown to be a great power to model the relative position information. They used a relative position representation to the Transformer module, which has improved the performance for the Machine Translation. Compared to their work, we use the relative position representation to the context encoder module, believing that the relative position information between each context sentence is valuable in the dialogue generation task.
In this paper, we adopt the relation-aware self-attention mechanism. We define n as the turn size of the dialogue context and d
x
as the embedding size of the utterance representation. Given the query
To obtain the context attention representation O
c
, we take x = (x0, . . . , xn-1) to be { (C0+ PE0) , . . . , (Cn-1 + PEn-1) } and feed the matrix of n input embedding vectors x as queries, keys, and values matrices, where C
i
is the utterance representation, PE
i
is the position embedding, for the i-turn utterance in the context. The context attention representation O
c
can be calculated as follows:
In order to solve the role ambiguity problem, we design a role embedding layer. We add an additional embedding, the role embedding RE i , to distinguish different speakers during the decoding of the dialogue response. We initialize two GRU modules, one GRU module for even round sentences and another for odd round sentences in context. Suppose two speakers carry identifiable information, e.g. driver and passenger, for different speakers, we have an idea that the identity semantic information can be mined through the interpretation of his historical discourses. As two speakers alternate in the conversation, for general, we take even round utterances in the context, i.e., u0, u2 and so on, as historical discourses of speaker 1. In the same way, we take odd round utterances, i.e., u1, u3 and so on, as historical discourses of speaker 2.
As shown in the bottom right of Fig. 1, we visualize how the role embedding layer works. According to the contexts, it’s easy to tell that the current round i is an even round and we need even round utterances as historical discourses to generate RE
i
. For general, the role embedding is calculated as follows for different round i:
Motivated by the discussion of Serban et al. [2], we are going to introduce the latent variable into our model and our model augments the decoder with a latent variable, which can help the decoder generate responses with more information. Following their instructions, we sampled our latent variable l from a multivariable normal distribution N (μ, ∑) with mean
Given the pre-trained semantic vectors r = (wr,0, . . . , wr,m - 1) of response Y, we use another relation-aware self-attention module to transform each pre-trained embedding wr,i and its position embedding PE
i
to obtain the response attention representation. The relation-aware multi-head self-attention module feeds the matrix of response vector R ={ R0, . . . , Rm-1 } as query, key and value matrix, where R
i
= (wr,i + PE
i
). Then the response attention representation O
R
is computed as follows:
During training, the latent variable layer feeds the context attention representation O
c
and the response attention representation O
R
as input. They are concatenated into one and fed into a two-layer feed-forward network with the activation function tanh. After the feed-forward network, we define the μ
posterior
by applying a linear transformation for the output of the feed-forward network. And we define the ∑
posterior
by applying a different linear transformation to feed-forward network’s output followed by a softplus function. With the mean μ
posterior
and the covariance matrix ∑
posterior
, we get the latent variable l by Equation (15). It’s worth noting that, during inferring, we actually calculate l based on the prior distribution, μ
prior
and ∑
prior
, without considering the O
R
. With the aim of approximating the posterior distribution, i.e. μ
posterior
and ∑
posterior
, during the training process, we follow the work in Serban et al. [2] which applied the KL divergence between prior and posterior distribution. More details can be found from their paper.
Given l, the model next uses the teacher forcing strategy to generate the target sequence tokens w
r
= (wr,1, . . . , wr,m) during training. To obtain the context-response attention weights, we feed the decoder hidden state h as query, and the context attention representation O
c
as key and value. The output is denoted as CE, where
The results are calculated by the following algorithm flow, applying equations and structure used in this paper:
In this section, we conduct our experiments on publicly available datasets and compare the proposed RoRePo model against some baseline methods.
Datasets
We use two public multi-turn open-domain English corpora, DailyDialog [13], and DSTC7_AVSD [15], as datasets in our experiment.
DailyDialog is a human-written multi-turn open-domain English corpus, covering various topics about our daily life. We download the data from a public access link 1 . The DailyDialog data are randomly separated into training/validation/test sets with 76052/7069/6740 conversations. The DSTC7_AVSD data are conversations about the objects and events that happen between two speakers. DSTC7_AVSD data are randomly separated into training/validation/test sets with 76590/17870/1710 conversations. The conversations are randomly shuffled before they are split into training/validation/test sets. Sentences in both datasets are preprocessed and begin with two special tokens <user0>and <user1>which denote two different speakers. And the special token ‘_eou_’ is used to separate multiple sentences in the context. More features of the datasets are presented in Table 2.
Key features of the DailyDialog and DSTC7_AVSD
Key features of the DailyDialog and DSTC7_AVSD
The encoder and decoder RNN in our model and comparison models are set as 1-layer GRU with 512 hidden neurons. All the RNN model parameters are initialized by the Xavier normal distribution. The self-attention layer is set to be a 3-layer multi-head attention module with 512 hidden neurons and 8 heads, based on several experiments as shown in Figs. 2 and 3. The hidden size of the latent variable is set as 100. We apply the gradient clipping and teacher forcing [20] strategy. Adam [19] is utilized for our optimization. We use Pytorch to build our model. The model is trained to minimize the negative log likelihood loss. We run all the models on a Tesla T4 GPU.

The BLEU scores of different head numbers.

The scores of different layer numbers.
We use both automatic evaluation metrics and human judgments for evaluating the generation task, including BLEU-1∼4, ROUGE, PPL and Distinct-1∼2.
BLEU [11], a metric for evaluating generated sequences, compares the dialogue-generated text with the reference text. BLEU-n is a score of n-gram matching for a particular order, where n-gram refers to a fragment of n consecutive words in a sequence. BLEU has been used extensively in the evaluation of generative dialogue systems. Also, it is an important metric for evaluating the relevance of the generated response to contexts in the multi-turn dialogue task.
ROUGE, another metric for evaluating the generated text, is commonly used for machine translation and text summarization evaluation. Unlike BLEU, it is primarily based on the recall algorithm.
PPL, short for perplexity, is used to measure the performance of a language model. After learning the language model distribution from the training corpus, perplexity is calculated in the test set. The smaller the perplexity value, the higher the accuracy and the better the language model is.
Distinct [12], a main indicator for evaluating the diversity of generated responses, focuses on the number of distinct n-grams in a sentence, thus penalizing sentences with many repeated words. The most commonly used are distinct-1 and distinct-2.
The human evaluation is another important method for analyzing the natural language generation. The human evaluation we used here is a manual comparison of different responses generated by different models based on the same contexts.
Comparison models
To evaluate our model comprehensively, we compared RoRePo with the following models:
Analysis and results
In this section, we describe the experimental results on two public datasets.
Our model outperforms the above baselines in terms of BLEU, ROUGE and Distinct measures. For DailyDialog, compared with the ReCoSa model, our model improved performance over BLEU-1∼4, ROUGE and Distinct-1∼2 by 1.0/0.86/0.91/0/92, 0.61 and 0.29/1.92, respectively. For DSTC7_AVSD, our model outperforms the compared models over BLEU-1∼4 and Distinct-1∼2 by 0.87/0.25/0.17/0.14 and 0.27/1.59, respectively, at least. The improvement over indicators of relevance and diversity indicates that our model is designed to be effective. Experimental results are shown in Tables 3 and 4. The RSA, LV, RE and BE mentioned in several tables are the short names of the relation-aware self-attention mechanism, the latent variable, the role embedding and the BERT embedding, respectively.
The automatic evaluation results (%) on DailyDialog. The optimal performance value for each metric is indicated in bold
The automatic evaluation results (%) on DailyDialog. The optimal performance value for each metric is indicated in bold
The automatic evaluation results (%) on DSTC7_AVSD
We also did some ablation experiments, isolating the impact of the role embedding, relative position representations, the latent variable and the BERT embedding, to explore some of the impact of our model design. We will elaborate on this in sections 5.1 to 5.4.
In our experiments, we performed an ablation experiment on the role embedding and the results are shown in Table 5, and we can observe a little but not much performance improvement of automatic evaluation metrics from including the role embedding, because role embedding is proposed to solve the problem of role ambiguity. Currently, it can only reflect its performance through manual evaluation on the two-person conversation cases with the role information, and we provide a sample analysis, as shown in Table 6.
The results(%) of ablation experiment on Role Embedding for Daily Dialog
The results(%) of ablation experiment on Role Embedding for Daily Dialog
The generated responses from different models based on three contexts in two cases choose from DailyDialog
The base model with role embedding achieves 0.51, 0.69, 0.64 and 0.53 increment in BLEU-1 to BLEU-4, respectively, which proves that introducing the role embedding layer has enhanced the semantic encoding capability for the dialogue context. It also reflects that there are semantic differences between the odd round sentences and the even round sentences. Our experimental results show that our model can make use of the semantic differences to explore different role semantics.
As shown in Table 6, we did a couple of case studies by human evaluation on the DailyDialog dataset. The left-hand side of the table analyzes an example of a conversation between a driver and a passenger. Context1 and context3 are the speaking history of speaker1, and context2 is the speaking history of speaker2. From the semantics of three contexts on Example1, it can be easily inferred that the identities of speaker1 and speaker2 are passenger and driver respectively. For the next turn, it is the turn for the driver to give a response. But the baseline Seq2Seq model and HRED model ask about ticket prices and travel time, respectively, as described in Section 1, which are not appropriate answers for drivers, but better for passengers. Compared to the model mentioned above, the model we proposed gives a response, either a statement or a question, which is more consistent with the speaker’s identity. Furthermore, in the generation process, our base model with role embedding generates a response with the same semantics as context2, and the potential reason is that we control the role semantics so that they lean toward context2.
The right-hand side of the table analyzes an example of a conversation between a doctor and a patient’s family. Example 2 is a typical example of the “one to many” feature of dialogue replies. The “come upstairs” in reference is difficult to be fitted in training. Our base model didn’t perform well, but with the addition of role embedding, the quality of responses improved.
We performed an ablation experiment on the relation-aware self-attention mechanism and the results are shown in Table 7. We notice that the base model with the relative position perception mechanism achieves 0.11, 0.21, 0.10 and 0.04 increments in BLEU-1 to BLEU-4, respectively. It indicates that it is necessary to pay attention to the relationship between sentences in the dialogue context.
The results(%) of ablation experiment on Relation-aware Self-attention for DailyDialog
The results(%) of ablation experiment on Relation-aware Self-attention for DailyDialog
As shown in Figs. 4 and 5, In the heat map, a row indicates the current turn’s attention to all turns in the context. For example, as shown in Fig. 5, the attention score assigned to turn0 by turn2 is 0.18 in the third self-attention layer. The lighter the color, the more attention is allocated, and conversely, the darker the color, the less attention is allocated.

The visualization of the weight of triple context-level self-attention layers for Example 3 in Table 8 before considering the relative position information between utterances in the context.

The visualization of the weight of triple self-attention layers for Example 3 in Table 8 after considering the relative position information.
Before and after considering the relative position information between utterances, we can see that the weight allocation of the self-attention mechanism is quite different. Let’s focus on the comparison of the allocation of attention weights in the last layer, the third self-attention layer.
As you can see from the Figs. 4 and 5, after considering the relative position information between utterances, turn2 has significantly increased the attention to turn0 and the value increased from 0.07 in Fig. 4 to 0.18 in Fig. 5, because the word ‘later’ in turn0 corresponds with the word ‘now’ in turn2, as shown in Table 8, which constitutes a logical pair of time relations. In addition, turn2 has significantly increased its attention to turn5 and turn6, and we can see that the value increased from 0.14 and 0.05 in Fig. 4 to 0.24 and 0.18 in Fig. 5, because turn5 again gives a negative answer to turn2’s question and turn6 asks “Not even for an hour?”, whose query object is “Watching TV”, which is closely related to turn2.
The generated responses from different models on DailyDialog
The self-attention weight visualization results reflect that the relation-aware self-attention mechanism has a stronger ability to encode relevant information and dialogue logic over a long distance.
For DailyDialog dataset, the highest distinct scores of our model are 2.93 and 14.80 on distinct-1 and distinct-2 respectively, as shown in Table 3, which are higher than baseline models over Distinct-1 and Distinct-2 by 0.29 and 1.92 at least.
Moreover, we performed an ablation experiment on the latent variable and the results are shown in Table 9. It shows that the base model with the latent variable achieves 0.05 and 0.1 increment in Distinct-1 and Distinct-2, respectively. It indicates that the latent variable is obviously helpful for generating more diverse dialog replies.
The results (%) of ablation experiment on latent variable for DailyDialog
The results (%) of ablation experiment on latent variable for DailyDialog
We also evaluated the average length of the responses generated by our model and the baseline models, as is shown in Fig. 6. The x-coordinate shows different datasets and the y-coordinate shows the average response length.

The average length of responses generated by our model and baselines.
The average response lengths of our model are 12.284 and 9.195 on DailyDialog and DSTC7_AVSD respectively, and they are 0.703 and 0.141 higher than the ReCoSa model.
Though our model has the best diversity performance for the highest distinct value and the longest average sentence length, there is still a big gap with the ground-true data of the test set in Table 2. The differences are 2.816 and 0.905 on DailyDialog and DSTC7_AVSD respectively.
This paper proposes a new dialogue generation model. We use a role embedding layer to leverage the role ambiguity problem and use a relation-aware self-attention mechanism to represent the relative position relationship between utterances in the context. Moreover, we assemble the latent variable and the BERT embedding to diversify our dialog responses. Our model improves the BLEU and Distinct score compared with several famous baseline methods.
But the present study ignores the role ambiguity problem and lacks the automatic evaluation index for evaluating role ambiguity, which can only rely on human judgment. In addition, our model lacks the ability to explicitly introduce knowledge from an external knowledge base. On the one hand, our future work will focus on the improvement of role embedding in a way that plugging a role label classification task into our model to train a multi-task learning. On the other hand, we will introduce common-sense by using external knowledge as supplement to enhance the semantic representation of the utterances in the context.
Footnotes
Acknowledgments
Thanks to GMFTBY
" href="#fn2" id="a-79">
2
at Beijing Institute of Technology for helps with the code. Our research is funded by the National Natural Science Foundation of China, Multi-model Brain-Computer Interface and Its Application in Patients with Consciousness Disorder, project approval number: 61876067, the Guangdong General Colleges and University Special Projects in Key Areas of Artificial Intelligence of China, Research and Application of Key Techniques of Sentiment analysis, project number: 2019KZDZX1033, the Guangdong Provincial Key Laboratory of Cyber-Physical Systems, project number: 2020B1212060069 and the National & Local Joint Engineering Research Center of Intelligent Manufacturing Cyber-Physical Systems project.
