Research on discourse role recognition in task-oriented collaborative dialogue

Abstract

Task-oriented collaborative dialogues have become an indispensable form of communication in our daily work and learning, in which participants exchange ideas and share information to advance goals. It is crucial to automatically analyze participants’ contributions and understand these dialogues relative to individuals with limited attention spans. In this paper, seven Discourse Role (DR) labels are designed to describe discourse’s different roles in collaborative dialogues for goal achievement. We collected about 11K discourses from a publicly available dialogue corpus and annotated them with DR tags to construct a dataset named MRDR (Meeting Recorder Discourse Role). In addition, this paper proposes a novel hierarchical model, STTAHM (Speaker Turn and Topic-Aware Hierarchical Model), for Discourse Role classification. The model is equipped to perceive speaker turn and dialogue topic and can effectively capture the discourse’s local and global semantic information. Experimental results show that our proposed method is effective on the constructed dataset, and the accuracy of Discourse Role classification reaches 86.99%.

Keywords

Task-oriented collaborative dialogue discourse role dataset speaker turn topic-aware

1 Introduction

With the proliferation of automatic speech recognition systems, the recordings of task-oriented collaborative dialogues are increasing, which creates an excellent opportunity to improve transparency and productivity. However, in the face of such an unprecedented number of conversations, how can we quickly analyze participants’ contributions and understand these dialogues relative to the limited attention span of individuals? Merce et al. [1] proposed three dialogue modes: argumentative, cumulative, and exploratory dialogues. They categorized the discourse mainly on the linguistic and psychological levels. They focused on how people use language to think together when solving problems. Goodman et al. [2] researched the roles that speakers played during collaborative dialogues and categorized them into 12 speaker roles, such as initiator, evaluator, and suggestor. They are mainly designed to identify problems arising in the discussion and give appropriate guidance. Stone et al. [3] provide predefined relationships between discourses in a collaborative dialogue, such as question and answer, contrast, continuation, etc., which explicitly show the flow of information and interactions between the discourses. However, current analysis of collaborative dialogues mainly focuses on the micro level, where the research focuses on the discourse itself or the relationship between the discourses and the discourse, while there are relatively few studies on the macro level. This paper focuses on the role that discourse plays in collaborative dialogues for goal achievement and puts the research focus on the representation of macro dialogue structure.

In task-oriented collaborative dialogues, each discourse plays a specific Discourse Role while expressing a particular content. The following features characterize the discourse of this type of dialogue: (1) Goal-oriented: the discourse usually revolves around a task or a problem. The communication between the participants is oriented to the realization of common goals. (2) Information sharing: discourse involves the sharing and exchanging of information. Participants express themselves through discourse to facilitate information sharing and shared understanding. (3) Cooperation and coordination: discourse is cooperative and coordinated. Participants cooperate and coordinate with each other to achieve common goals. (4) Feedback and evaluation: discourse includes feedback and evaluation. Participants give feedback on the views of others to facilitate consensus and decision-making. (5) Context-sensitive: Discourse is context-sensitive. Participants need to consider the content of previous conversations to understand better and respond to questions or requests. Therefore, based on the above discourse characteristics, this paper designs seven Discourse Role labels for describing discourse’s different roles in collaborative dialogues for goal achievement. The categories and definitions of Discourse Roles are detailed in Section 3.2. For example, a sample task-oriented collaborative dialogue is illustrated in Table 1. For each discourse in Table 1, the corresponding label defines its Discourse Role. These Discourse Roles are related to the facilitation and coordination of the task in which the group is involved. Accurately identifying Discourse Roles provides an objective basis for analyzing the specific contribution of each participant in the conversation. The three main challenges of this paper are:

Table 1
Example of task-oriented collaborative dialogue

Speaker Discourse DR Labels

mn015 how long would it take to add another node on the observatory and play around with it? Initiating New Topic(INT)

me012 another node on what? Seeking Clarification (SC)

me003 uh well it depends on how many things it’s linked to. Provide Information(PI)

mn015 let’s just say make it really simple. if we create something that for example would be um so some things can be landmarks in your sense. but they can never be entered.so for example a statue. Interpretive Inference(II)

me012 good point. Evaluation(E)

me003 right. Passively Accepting(PA)

me012 actually no. Non-Exploration(NE)

Speaker	Discourse	DR Labels
mn015	how long would it take to add another node on the observatory and play around with it?	Initiating New Topic(INT)
me012	another node on what?	Seeking Clarification (SC)
me003	uh well it depends on how many things it’s linked to.	Provide Information(PI)
mn015	let’s just say make it really simple. if we create something that for example would be um so some things can be landmarks in your sense. but they can never be entered.so for example a statue.	Interpretive Inference(II)
me012	good point.	Evaluation(E)
me003	right.	Passively Accepting(PA)
me012	actually no.	Non-Exploration(NE)

First, the datasets with annotations are limited. Although task-oriented collaborative dialogues have become an indispensable form of communication in our daily work and study, almost no annotated corpora are dedicated to detecting discourse roles in such dialogues.

Second, most existing approaches treat oral dialogues as analogous to written texts, ignoring explicit modeling of speaker turns and dialogue topics. In a dialogue, a change in speaker turn causes a shift in Discourse Role. For example, in Table 1, given that the Discourse Role of speaker mn015 is Interpretive Inference, if the next speaker turn changes, i.e., the discourse comes from me012, then the corresponding Discourse Role is likely to be Evaluation; however, if there is no change in the speaker turn, i.e., the speaker remains mn015, then the next discourse role is still likely to be Interpretive Inference. In addition, different topics of dialogue may lead to different distribution of discourse roles, for example, in the topic of "Technical Support," the discourse roles tend to Seeking Clarification and Provide Information, while in the topic of "Problem-Solving", the discourse roles tend to Interpretive Inference and Evaluation. Therefore, it is essential to model speaker turn and dialogue topic in dialogues. Therefore, this paper hypothesizes that modeling speaker turn and dialogue topic as additional contextual information can effectively support discourse role categorization. However, unlike well-structured texts, speaker turn and dialogue topic change as the dialogue progresses, which poses a challenge for discourse role recognition.

Third, unlike text categorization tasks that deal with each discourse individually, discourse role recognition relies on contextual information between discourses. For example, Initiating New Topic may appear as declarative or interrogative sentences, whose discourse features are less obvious. To accurately recognize an Initiating New Topic, it is necessary to determine whether its semantics are related to the previous dialogue. This means that when performing discourse role recognition, the characteristics of individual discourse should be considered, and the context and dynamics of the whole dialogue need to be understood.

To address these challenges, we collect task-oriented dialogue data from an existing corpus of dialogues and annotate it. In this paper, the problem of discourse role classification is treated as a sequence labeling task. Based on [5], a large pre-trained language model RoBERTa [6] is first utilized to obtain discourse embeddings. To equip the model with the ability to perceive speaker turn and dialogue topic, two additional embedding layers were introduced to encode speaker turn and dialogue topic information, respectively. These additional embedding layers generate embedding vectors of the same size as the discourse embeddings, and then they are subjected to a summation operation with the discourse embeddings. Next, the augmented discourse embeddings are fed into a bidirectional long short-term memory network (BiLSTM) to capture the contextual information of the discourse better.The discourse representations output from the BiLSTM are used for discourse role recognition. The main contributions of this paper’s work are as follows:

1) A dataset called MRDR (Meeting Recorder Discourse Role) was constructed. About 11K discourses were collected from the dialogue corpus and annotated with DR tags. This dataset provides a powerful resource for the task of discourse role classification.

2) A novel hierarchical model, STTAHM (Speaker Turn and Topic-Aware Hierarchical Model), is proposed for discourse role classification. The model is equipped with the ability to perceive speaker turn and dialogue topic and effectively capture both local and global semantic information of discourse.

3) Comparative analysis shows that STTATM achieves state-of-the-art performance on MRDR. Furthermore, exhaustive ablation experimental studies determine the effectiveness of each module of the STTATM.

2 Related work

2.1 Dialogue mode

The dialogue mode was first proposed in the field of education. In general, informal learning cannot be thoroughly compared to collaborative learning in an educational environment. However, there are similarities in how meaning is formed and recognized, especially when people engage in task-oriented collaborative dialogue. Mercer et al. [1] considered how speakers are currently talking and thinking and proposed three modes of social thinking: argumentative dialogue, cumulative dialogue, and exploratory dialogue. Ferguson et al. [7] explored a method for detecting exploratory dialogue in online synchronous text chat. They manually identified a list of cue phrases indicating the presence of exploratory dialogue. Ferguson et al. [8] used exploratory dialogue to indicate ongoing learning. They developed a self-training framework based on previous work that uses discourse and topic features for classification by integrating cue phrase matching and k-nearest neighbor classification. Ekman et al. [9] developed a framework for encoding exploratory dialogue based on Mercer’s research, including five categories: challenge, evaluation, extension, reasoning, providing information, and community work. The exploratory elements of the whole dialogue were analyzed through qualitative content analysis, focusing on how people use language to think together when solving problems. Based on a coding framework for dialogue modes, this paper defines categories of discourse roles. The focus is on discourse’s role in collaborative dialogues for goal achievement to analyze the participant’s contributions to the dialogue. In addition, discourse roles can reflect changes in the way information is conveyed and the way speakers interact with each other in a dialogue, thus contributing to the understanding of the conversation.

2.2 Speaker role

Most of the existing work enhances the ability of language models to understand dialogues mainly by encoding speaker roles. Chi et al. [4] proposed a role-based contextual model that can independently consider the roles of different speakers based on various speaking patterns in multi-person dialogues. Similarly, Chen et al. [10] proposed an attention-based network that utilizes temporal information and speaker roles to improve spoken comprehension. Zhu et al. [11] trained a vector for each speaker role and represented it as a fixed-length append to the embedding of speaker turns to improve the performance of the summary model. Song et al. [12] Used a single-layer neural network to convert speaker roles into vector representations. They attached them to a word encoder, which helps to recognize patient and doctor discourse better. Unlike methods based on speaker roles, the method proposed in this paper focuses on speaker turn and is independent of the speaker’s identity information. Thus, it is still valid when the speakers in a dialogue do not have formal roles. This allows the model to better adapt to different types of dialogue scenarios.

2.3 Dialogue act

Dialogue Act (DA) is a semantic label associated with each discourse in a dialogue that indicates the speaker’s intention, e.g., question, backchannel, stated opinion, etc. Recently, many studies have approached DA classification as a sequence labeling task to fully use contextual information between discourses. Chen et al. [13] posed the problem of Dialogue Act recognition in terms of capturing hierarchically rich discourse representations and generalized richer structural dependencies of the CRF attentional graphs without giving up end-to-end training. Li et al. [14] proposed a dual-attention hierarchical recurrent neural network for DA classification. Their model can capture information about DAs and topics in response to discourse and information about their interactions. Raheja and Tetreault [15] utilize a context-aware self-attention mechanism and a hierarchical RNN. Shang et al. [16] use a BiLSTM-CRF architecture that considers speaker information for a DA classification task. Colombo et al. [17] introduced a seq2seq model customized for DA classification using a hierarchical encoder, a novel guided attention mechanism, and beam search applied to training and inference. Malhotra et al. [18] proposed a Transformer-based architecture with novel speaker- and time-aware contextual learning for classifying dialogue behavior in mental health counseling dialogues. Similar to the above work, this paper treats the discourse role categorization task as a sequence labeling task. The difference is that the above approaches assist in discourse category identification by proposing more complex and specialized models to model speaker interaction or conversation topic information, which inevitably introduces many parameters for training. In this paper, speaker turn information and dialogue topic information are encoded through two additional embedding layers. This approach requires only negligible modifications to the recurrent model and introduces O(1) space complexity.

3 Corpora construction

The Discourse Role dataset, called MRDR, is presented in this section. Overall, 11K utterances are annotated with seven well-designed Discourse Role labels. The rest of the section provides detailed information on data collection, annotation schemes, annotation process, data preprocessing, and data statistics.

3.1 Data collection

Since the object of study in this paper is task-oriented collaborative dialogue, a large corpus of dialogues about meetings is consulted to construct a Discourse Role dataset. Ultimately, the MRDA corpus [19] is chosen as the original dialogue data. This is because MRDA is a multi-party, task-oriented, collaborative conversation corpus with a large number of discussions on a wide range of topics. At the same time, the dataset is tagged with dialogue act, which can be used as auxiliary information for annotating the Discourse Role labels. The MRDA dataset is constructed by annotating the dialogue act on the ICSI Conference Corpus [20]. The ICSI Conference Corpus is captured from real face-to-face conferences covering various specialized areas, such as computer science, linguistics, and so on. This gives authenticity and diversity to the conversations in the corpus. It consists of 75 meetings of approximately one hour each. There are 53 unique speakers in the corpus, with an average of about six speakers per session. In ICSI, there are, on average, 10,189 words and 464 turns in the conference transcripts. Since the ASR system generates the transcripts, the textual error rate of ICSI is 37% respectively [21]. This paper selects eight meetings to be annotated with Discourse Role labels.

3.2 Dataset annotation scheme

Seven discourse role labels were designed by analyzing data from real collaborative dialogues. These labels aim to describe the different roles of discourse for goal achievement. The specific descriptions are as follows:

Initiating New Topics (INT): This label is used for discourse that introduces a new topic related to the task in a dialogue. It serves as a guide to the dialogue.

Interpretive Inference (II): This label is used for discourse that draws new conclusions from existing information or fills in missing information through logic or reasoning. The keywords in this category are "I think", "because", "the reason is, "the next step," "the next step," "the next step," "the next step," "the next step," "but if," "probably" and other phrases that involve deeper thinking. It promotes deeper analysis and a common understanding of the dialogue.

Provide Information (PI): This label is used for discourse that provides additional resources or background information. It serves to expand the resources of the discussion and support decision-making.

Evaluation (E): This label is used for discourse that evaluates statements or information. It can drive the dialogue in a better direction and facilitate quality decision-making and consensus-building.

Seeking Clarification (SC): This label is used for discourse that requests clarification. It ensures the accuracy and completeness of information and facilitates deeper discussion.

Passive Acceptance (PA): This label is used for words that express simple approval and concurrence. It helps to establish an atmosphere of effective communication and cooperation. However, it can also lead to boring dialogue and a lack of thought stimulation.

Non-Exploratory (NE): This label is used for discourse that asserts or counter-asserts. It may lead to superficiality in the discussion process and cause negative emotions in the dialogue. Examples are "Yes, it is." "No, it is not.".

3.3 Annotation process

The ICSI meeting corpus contains dialogue summaries, topic descriptions, and dialogue act information. Therefore, to improve the annotation quality of the dataset, before annotation, the annotator reads the summaries of the annotated dialogues to understand the content of the dialogues. During the annotation process, the Discourse Role annotation work will be done with the help of conversation topic descriptions and conversation act labels. The annotators are five graduate students from the computer technology major.

At first, the basic unit of Discourse Role annotation was set as turn (i.e., the speaker did not change the words spoken). However, during the annotation process, it was found that some conference speakers spoke too long, and their Discourse Roles changed during the long speech. For example, in the early part of the speech, the Discourse Role is Interpretive Inference(II), and in the later part, the Discourse Role is changed to Seeking Clarification(SC). Therefore, the annotator segments the speaker’s speech according to the semantics of the discourse and takes the speaker’s complete expression as the annotation unit.

To ensure an understanding of the task and the annotation scheme, the dataset was sampled, and each annotator was asked to annotate it according to a prepared set of guidelines. After this, all annotators engaged in a discussion to ensure consistency. After several rounds of annotation and discussion, the entire dataset was available for annotation. After the data annotation, a Kappa score for MRDR was calculated to measure inter-annotator agreement. An inter-annotator consistency score of 0.643 was obtained.

3.4 Data pre-processing

To ensure the dataset’s quality and eliminate interfering factors in the discourse role identification task, the following processing measures were taken: (1) Discourses that were not fully transcribed in the dialogue were deleted. (2) All the discourses of one complete formulation of each speaker were concatenated as the unit of the sample, i.e., discourse. (3) Unique numerical codes were assigned to each discourse role in increasing order, and then unique discourse role labels, i.e., DR, were assigned to each discourse. (4) To facilitate the chunking of dialogues, each meeting is considered a complete dialogue segment, and the dialogues are coded with a numerical serial number in increasing order, i.e., dialogue_id. The dataset is processed accurately and orderly through these steps, providing a reliable basis for subsequent analysis and modeling.

3.5 Statistics

In total, there are 11064 discourses in MRDR. A total of 6021 data were annotated with one complete expression of the speaker as the annotation unit. Compared to other dialogues, the dialogues in MRDR are usually long, with an average dialogue length of about 1500 times. Since we used one complete expression of the speaker as the annotation unit, the average dialogue length was about 660 times, and the average length of the discourse was 15 words.

Table 2 shows the distribution of Discourse Role labels in the MRDR dataset. It can be observed that Interpretive Inference (II) occupies the most significant proportion, followed by Passively Accepting (PA). In contrast, Non-Exploration (NE) had the smallest percentage. This distribution pattern may be because the process of interpretation and inference is crucial for problem-solving and consensus-building in task-oriented dialogue. At the same time, acceptance of others’ viewpoints and information helps to establish an effective communication and cooperative atmosphere. Non-Exploration (NE) is relatively less frequent because Non-Exploratory discussions may reduce the engagement and depth of the whole dialogue. It is worth noting that the distribution pattern of Discourse Roles included in the MRDR dataset is uneven, with specific Discourse Roles being used more frequently than others in task-oriented collaborative conversations. This distribution pattern reflects the natural variation in participants’ use of the Discourse Role in collaborative dialog. At the same time, the skewed nature of this data presents a challenge for Discourse Role identification.

Table 2
Distribution of Discourse Role labels in MRDR

DR Number Proportion (%)

INT 342 5.68

II 2616 43.447

PI 288 4.783

E 310 5.148

SC 439 7.291

PA 1898 31.523

NE 128 2.126

Total 6021 100

DR	Number	Proportion (%)
INT	342	5.68
II	2616	43.447
PI	288	4.783
E	310	5.148
SC	439	7.291
PA	1898	31.523
NE	128	2.126
Total	6021	100

4 Method

The general structure of the STTATM model proposed in this paper is shown in Fig. 1. In this example, the dialogue consists of five discourses containing two different topics, and the topic changes in the fourth discourse. The STTATM model is a hierarchical encoder tagger. Given a dialogue containing a series of discourses, (1) Discourse-level encoding: discourse embeddings are obtained using a sizeable pre-trained language model RoBERTa, integrating speaker turn information and dialogue topic information to the discourse embeddings. (2) Dialogue-level encoding: The enhanced discourse embeddings are fed into BiLSTM to capture the contextual information of the discourse. Finally, the obtained discourse representations are used for Discourse Role recognition. In this section, each component of the model is described in detail.

Fig. 1

General structure of the STTATM model.

4.1 Task definition

Given corpus $D = {(C_{n}, Y_{n}, T_{n}, S_{n})}_{n = 1}^{N}$ consists of N dialogues, where $C_{n} = {〈 u_{l}^{n} 〉}_{l = 1}^{L}$ is a dialogue fragment containing L discourse sequences, and $T_{n} = {〈 t_{b}^{n} 〉}_{b = 1}^{B}$ are a topic labels containing B dialogue topics, the $S_{n} = {〈 s_{l}^{n} 〉}_{l = 1}^{L}$ are the speaker turn labels, and $Y_{n} = {〈 y_{l}^{n} 〉}_{l = 1}^{L}$ are the corresponding Discourse Role labels. Given a dialogue C_k and its corresponding speaker turn labels S_k and topic labels T_k, the goal of the STTATM model is to be able to assign the correct Discourse Role label Y_k to each discourse in the dialogue C_k.

4.2 Hierarchical dialogue encoder

This paper uses a hierarchical structure to model the relationships between discourses in a dialogue. Each discourse is first encoded independently, and then the encoded discourse sequence is fed into a recurrent neural network to generate a context-aware discourse representation.

4.2.1 Discourse-level coding

In this paper, a pre-trained language model is used to encode each discourse independently. This is because the encoded representations of pre-trained language models have been shown to improve the performance of several NLP tasks. Specifically, RoBERTa is used as the discourse encoder, which is based on the popular BERT [22] with an enhanced training mechanism. RoBERTa has been shown to have better performance than BERT. The RoBERTa tokenizer lowercases and tokenizes each discourse, and then inserts a special [CLS] token at the beginning of the tokenized sequence. The token sequence [[CLS] , W_k,1, W_k,2, …, W_k,Nk] is fed into the Transformer [23] encoder, which is initialized using RoBERTa pre-training weights. The RoBERTa model is further fine-tuned on the Discourse Role classification task.

$r (u) = RoBERTa (u)$ (1) where u is a given discourse in a dialogue and r (u) is the discourse embedding representation obtained by pre-training the language model RoBERTa.

Speaker turn modeling Since alternating between speakers can provide significant information to aid discourse role recognition, it is crucial to notify the model when the speaker turn changes. To achieve speaker turn modeling in dialogues, this paper uses 0/1 tags to encode speaker turn for each discourse. Specifically, the speaker turn of the first discourse in a conversation is labeled as 0, and subsequent discourses are encoded based on whether the speaker changes. If the speaker changes, the speaker turn is labeled 1; if no change occurs, the speaker turn continues to be labeled 0. These coding processes are accomplished through automated code. For example, if the original speaker sequence for a discourse is <0, 1, 1, 2, 2, 1, 3>, it is relabeled to <0, 1, 1, 0, 0, 0, 1, 0>. That is, two independent embeddings are assigned to each speaker independent of the dialogue content. These two embeddings are obtained by training all the speakers in the training set and are used as learnable parameters in the optimization process. They are the same size as the discourse embeddings and are generated by the speaker turn embedding layer with speaker turn labels as input. This has the advantage of simplifying the process of speaker interaction, as the number of speakers in different dialogues may differ. This idea is inspired by the study of [24] in which the authors used the speaker’s mood change information as a guide, fused it with the discourse semantic information, and experimentally verified that the mood shift information helps to improve the accuracy of mood recognition.

We are inspired by the study of Transformers [23], who summed positional embeddings with token embeddings in sequence representations. Therefore, this paper sums the speaker turn embedding with the discourse embedding r(u).

$g (u, s) = r (u) + e (s), s \in {0, 1}$ (2) where u is the given discourse, s is the speaker turn, e (s) is the speaker turn embedding, and g (u, s) is the speaker turn-aware discourse embedding.

Topic modeling Due to the different distribution of discourse roles in dialogues with different topics. Therefore, it is crucial to model dialogue topic as additional contextual information. This paper assigns unique numerical codes to each topic in increasing order. Then, a corresponding dialogue topic label is added to each discourse. For example, if a dialogue containing 8 sentences contains 2 topics and the dialogue topic changes in the fifth sentence, the dialogue topic sequence of the discourse is <0, 0, 0, 0, 0, 0, 1,1,1>. In this paper, the topic labels are fed into the topic embedding layer to obtain topic embeddings that have the same size as the discourse embeddings, and these embeddings are learnable parameters in the optimization process. Finally, the topic embeddings are summed with speaker turn-aware discourse embeddings g (u, s).

$g (u, s, t) = g (u, s) + f (t)$ (3) Where t is the topic encoding, f (t) is the topic embedding and g (u, s, t) is the speaker turn and topic-aware discourse embedding.

Such a design enables speaker turn information and topic information to be integrated with the discourse, which helps to enhance the model’s ability to perceive speaker turn and dialogue topic. In this paper, we also try the tandem operation of discourse embedding with speaker turn embedding and topic embedding, which performs poorly compared to summation.

4.2.2 Dialogue-level coding

Since a dialogue text is a continuous sequence of discourse, there is often a strong correlation between the preceding and following discourses, which is crucial for Discourse Role recognition. Therefore, we use BiLSTM to capture the contextual information of each discourse. Given the enhanced discourse embeddings ${〈 g (u_{t}, s_{t}, t_{t}) 〉}_{t = 1}^{n}$ in dialogue C, we use BiLSTM to model the context of each discourse:

${〈 q (u_{t}, s_{t}, t_{t}) 〉}_{t = 1}^{n} = B i L S T M {〈 g (u_{t}, s_{t}, t_{t}) 〉}_{t = 1}^{n}$ (4) where ${〈 q (u_{t}, s_{t}, t_{t}) 〉}_{t = 1}^{n}$ are contextualized speaker turn and topic-aware discourse embeddings from the hidden states of the BiLSTM model. Specifically, an enhanced sequence of discourse embeddings is placed into two LSTM in forward and backward order to capture the bidirectional semantic information of the context. Each discourse corresponds to two different LSTM hidden layer state representations $\vec{h_{i}}$ and $\overset{\leftarrow}{h_{i}}$ . The two representations are concatenated to obtain h_i ∈ R^2d_lstm, where d_lstm is the number of hidden layer states of the LSTM.

4.3 Discourse role classification

Finally, using a linear layer to predict the Discourse Role y_k of each discourse.

$y_{k} = W^{e} h_{i} + b^{e}$ (5) where W^e ∈ R^E×H and b^e ∈ R^E are trainable parameters and E is the number of Discourse Role categories.

This layer is optimized using cross-entropy loss.

$L_{cls} = - \frac{1}{\sum_{i = 1}^{C} K_{i}} \sum_{i = 1}^{C} \sum_{j = 1}^{K_{i}} log P_{i, j} [y_{i, j}]$ (6) where C is the number of dialogues. K_i is the number of discourses in dialogue i. P_i,j is the probability distribution of Discourse Role labels for discourse j in dialogue i. y_i,j is the true Discourse Role label.

5 Experiments

5.1 Implementation details

This paper implements the proposed model using the PyTorch framework, and the classification cross-entropy loss is optimized using the Adam optimizer. The complete list of hyperparameters is given in Table 3.

Table 3
Training hyperparameters

Hyperparameter Value

Batch Size 4

Hidden Size 768

Layers_RoBERTa 12

Layers_BiLSTM 1

Dropout 0.1

Learning Rate 1e-4

Epochs 50

nfinetune 1

Hyperparameter	Value
Batch Size	4
Hidden Size	768
Layers_RoBERTa	12
Layers_BiLSTM	1
Dropout	0.1
Learning Rate	1e-4
Epochs	50
nfinetune	1

Since the dialogues in MRDR are much longer (up to 1053 times), dialogues are partitioned into fixed-length dialogue blocks to evade overflowing memory during training. As shown in Fig. 2, a dialogue of length 10 is sliced into three dialogue blocks of length 4, where each block denotes a data spot. The chunking operation is not required for validation or testing, as maintaining the computational graph during training consumes more GPU memory.

Fig. 2

Example of dialogue blocking.

Keeping the other hyperparameters constant, Table 4 shows the results using different block sizes on MRDR. As the block size increases from small values, the performance improves, and the RNN can utilize more contextual information. Nevertheless, once a threshold is exceeded, a further increase in the block size leads to a performance reduction. In this case, the RNN suffers from gradient disappearance and explosion problems and forgets about long-term dependencies. Therefore, it can be argued that partitioning long dialogue into smaller blocks as input improves performance in the Discourse Role classification task. Therefore, the block_size parameter is set to 16.

Table 4

Accuracy of using different block sizes on MRDA

Block_size	Accuracy
4	84.51
8	86.80
16	86.99
32	86.62
64	86.41

5.2 Baselines

TextRNN: TextRNN is a Recurrent Neural Network (RNN) based text classification model that models variable-length sequences. TextRNN transforms each word of a text into a vector and then inputs all word vectors into the RNN for sequence modeling.

TextRNN_Att [26]: TextRNN_Att is a model that adds an attention mechanism to TextRNN. At each time step, TextRNN_Att obtains the weight of each word by calculating the inner product of each word vector with a weight vector and then sums all word vectors weighted to obtain the output vector for that time step.

Transformers [23]: Transformers is a model based on a self-attentive mechanism. It learns the relationship between different positions in a sequence by performing self-attentive computation on all positions of the input sequence.

FastText [27]: FastText is a text classification model based on the bag-of-words model, which obtains a bag-of-words representation of text by performing word separation and n-gram operations on the text. Then, FastText performs classification using a hierarchical Softmax method to map the text into predefined categories.

TextCNN [28]: TextCNN is a convolutional neural network (CNN)_based text classification model that uses convolutional kernels of different sizes to capture information of different lengths in the text and integrates all the features at the end through a pooling layer to finally output the classification results.

TextRCNN [29]: TextRCNN (Recurrent Convolutional Neural Network) is a deep learning model specifically for text classification tasks. It combines the advantages of RNN and CNN to capture both time series information and spatial information in text data.

Bi_LSTM_CRF [30]: The method constructs a hierarchical bi-directional LSTM as the basic unit and a conditional random field as the top layer to accomplish the task of dialogue act recognition.

LSTM_Softmax [31]: The method applies a deep LSTM structure to classify dialogue acts through softmax operations. The authors claim that word embedding, dropout, weight decay, and several LSTM layers significantly impact the final performance.

CASA [15]: This is a context-aware attention-based system for classifying dialogue act. It uses RNNs at the dialogue and discourse levels and computes context-aware self-attention before final classification.

RoBERTa [6]: We use RoBERTa as a baseline in this work due to its superiority in various benchmark tests. RoBERTa is similar to BERT. it is an encoder-only language model trained in an unsupervised manner on a large amount of unlabeled data using masked language modeling targets.

5.3 Experimental results

In the experiments, the dataset was divided into the ratio of 70:20:10 as training, validation, and test sets, respectively. To measure the performance of the STTATM model and other baseline models, the experiments in this paper use Precision(P), Recall(R), F1-measure(F1), and Accuracy(ACC) as the evaluation metrics of model performance. The speaker turn embedding, and topic embedding proposed in this paper can be used in other embedding-based discourse classification methods. Since these two embeddings are not used in the code of the baseline model, the proposed speaker turn embedding, and dialog topic embedding are not implemented on the baseline. To effectively validate the performance of the STTATM model, this paper uses speaker turn and topic embeddings on the RoBERTa model.

Table 5 shows the results of the comparison experiments. It can be observed that the STTATM model achieved state-of-the-art results on MRDR with an accuracy of 86.99%. Except for the RoBERTa model, none of the other baseline models reached 80% accuracy. The possible reason for this gap is that RoBERTa learns rich contextual information on large-scale textual data through pre-training and can capture the rich semantic representation of discourse. Second, RoBERTa performs well on a small amount of labeled data relative to traditional models because it can migrate pre-trained knowledge. Meanwhile, RoBERTa model’s performance improves significantly with the addition of speaker turn and dialogue topic information, demonstrating the effectiveness of modeling speaker turn and dialogue topic in discourse role recognition task. It is worth noting that the accuracy of the RoBERTa speaker+topic model is still lower than that of the STTATM model. Discourse role recognition relies on remote contextual information.RoBERTa is a Transformer-based model that uses a self-attention mechanism that captures semantic relationships between words. Although it can consider a certain amount of context during pre-training, its self-attention mechanism may be limited when dealing with long sequences. In contrast, BiLSTM is a recurrent neural network that processes sequences in a time-step-by-time-step manner, gradually accumulating and transferring information in the sequences, and thus has advantages in dealing with long-distance dependencies. Thus, the model can understand and model the context at different levels by combining the two, improving the overall understanding of the sequence.

Table 5
Experimental results of the comparison model

Model ACC

TextRNN 78.81

TextRNN_Att 77.46

Transformers 71.39

FastText 79.29

TextCNN 78.81

TextRCNN 78.71

Bi_LSTM_CRF 77.64

LSTM_Softmax 78.26

CASA 74.07

RoBERTa 84.20

RoBERTa_speaker 85.32

RoBERTa_topic 85.36

RoBERTa_{speaker +topic} 85.51

STTATM 86.99

Model	ACC
TextRNN	78.81
TextRNN_Att	77.46
Transformers	71.39
FastText	79.29
TextCNN	78.81
TextRCNN	78.71
Bi_LSTM_CRF	77.64
LSTM_Softmax	78.26
CASA	74.07
RoBERTa	84.20
RoBERTa_speaker	85.32
RoBERTa_topic	85.36
RoBERTa_{speaker +topic}	85.51
STTATM	86.99

Table 6 shows the performance of the STTATM model in terms of discourse role labels. It can be observed that II, E, SC, PA, and NE are better recognized with accuracies of 89.30%, 71.74%, 89.77%, 93.47%, and 92.86%, respectively. These discourse roles appear more frequently in the dataset, their discourse features are apparent, and the model can capture their discourse features effectively. However, the accuracy rates of PI and INT are 51.52% and 48.94%, respectively. These two types of discourse roles appear less frequently in the dataset, and their discourse forms are diverse, which may lead to difficulties for the model to learn their discourse features thoroughly. In summary, the STTATM model performs well in most discourse roles. However, there is still space to improve the performance of the model for the rare roles with less distinctive discourse features, and more training data or more complex model structures are needed to improve the recognition of these discourse roles.

Table 6

Labels classification results for STTATM

DR	P	R	F1	ACC
INT	79.31	48.94	60.53	48.94
II	87.58	89.30	88.43	89.30
PI	47.22	51.52	49.28	51.52
E	62.26	71.74	66.67	71.74
SC	91.86	89.77	90.80	89.77
PA	96.20	93.47	94.81	93.47
NE	52.00	92.86	66.67	92.86
Total	73.78	76.80	73.88	86.99

5.4 Error analysis

Figure 3 reports the confusion matrix of STTATM. Two pairs with significant error rates (≥ 40%) can be observed - INT: II (44.7%) and PI: II (42.4%). The STTATM model tends to predict INT and PI as II incorrectly. This error can be attributed to the diversity of discourse, i.e., INT and PI are often presented in different discourse forms. For example, "I think we should......" and "because someone already has...... For example, "I think we should " and "because someone already has " belong to INT and PI, respectively, but they are easily confused with II because of the presence of "I think" and "because," which are keywords of II’s discourse role. Another possible reason is that II is a more significant part of the dataset, and the model may be more inclined to predict the more frequent categories and less effective in predicting the rare categories. For the remaining cases, the error rate can be considered nominal. Therefore, this paper clarifies that STTATM can be further improved with a more balanced dataset.

Fig. 3

Confusion matrix for STTATM.

5.5 Ablation experiments

Ablation experiments are conducted to verify the effectiveness of speaker turn embedding and dialogue topic embedding in the STTATM model. The experimental results are shown in Table 7. “¬speaker” indicates that no speaker turn embedding is added; “¬topic” indicates that no topic embedding is added. It can be observed that there is a significant improvement in the performance of the STTATM model compared to the STTATM_{¬speaker +¬topic} model, indicating the effectiveness of modeling speaker turn and dialogue topic in the discourse role classification task. In multi-round dialogues, discourse roles usually change dynamically, and speaker turn embedding can help the model capture this dynamism. In addition, there is a correlation between dialogue topics and discourse roles, and dialogue topic embedding can help the model capture this correlation. Meanwhile, it is noted that the STTATM_¬speaker model outperforms the STTATM_¬topic model in all metrics. Therefore, it can be concluded that dialogue topic information is more effective for discourse role modeling than speaker turn information. Dialogue topics provide a broader context that helps the model understand the overall context of the current dialogue. Although speaker turn information is essential for understanding the alternation between speakers, it provides only local information that limits the model’s understanding of the entire dialogue. Therefore, the speaker turn provides limited auxiliary information.

Table 7
Comparison results of ablation experiments

Model P R F ACC

STTATM_{¬speaker +¬topic} 67.86 79.86 72.11 84.68

STTATM_¬speaker 69.89 77.70 73.51 85.62

STTATM_¬topic 68.25 76.67 72.24 84.87

STTATM 73.78 76.80 73.88 86.99

Model	P	R	F	ACC
STTATM_{¬speaker +¬topic}	67.86	79.86	72.11	84.68
STTATM_¬speaker	69.89	77.70	73.51	85.62
STTATM_¬topic	68.25	76.67	72.24	84.87
STTATM	73.78	76.80	73.88	86.99

6 Conclusion

In this work, a new dataset called MRDR is developed to provide a platform for Discourse Role classification in task-oriented collaborative dialogues. In addition, this paper proposes the STTATM model, a hierarchical model for Discourse Role classification. The model encodes discourse from the discourse level and dialogue level, which simulates the relationship between discourses in dialogue and can effectively capture the discourse’s local and global semantic information. Meanwhile, the two global additive embedding vectors introduced provide an effective means for the model to have the ability to perceive speaker turn and dialogue topic. Moreover, this technique only requires negligible modifications to the recurrent model and introduces O(1) space complexity. The approach proposed in this paper focuses on speaker turn being independent of the speaker’s identity information, which allows the model to be better adapted to different types of dialogue. For future work, the dataset used in this experiment has some impact on the final performance of the model due to the small size and uneven distribution of samples across the discourse role categories. Therefore, the following work will be centered on expanding the corpus of different categories of discourse roles to build a larger and more evenly distributed high-quality dataset. At the same time, attempts to try and optimize the model will be explored further to improve the performance of the discourse role recognition model.

Discourse role identification has a variety of practical application scenarios. By identifying discourse roles, it is possible to understand the contributions of individual participants in task-oriented meetings, which helps leaders to manage conversations and optimize collaboration efficiently. In addition, categorizing discourse roles helps to analyze the speaker’s characteristics. For example, participants who tend to be "Interpretive Inference" have higher analytical and reasoning abilities, and those who tend to be "Initiating New Topic" have organizational and guiding abilities. Participants who preferred "Initiating New Topic" could organize and lead. The characteristics of the participants allow for better coordination of cooperation and task assignment, thus increasing teamwork efficiency.

Funding acknowledgment

National Natural Science Foundation of China(621 66041).

References

Mercer

, Littleton

, Dialogue and the development ofchildren’s thinking: A sociocultural approach, Routledge, 2007.

Goodman

, Hitzeman

, Linton

, et al., Towards intelligentagents for collaborative learning: recognizing the roles of dialogue participants, in: International Conference on User Modeling.Berlin, Heidelberg: Springer Berlin Heidelberg, (2003), pp. 363–367.

Stone

, Stojnic

, Lepore

, Situated discourses anddiscourse relations, In: Proceedings of the 10th internationalconference on computational semantics (iwcs 2013)-short papers. (2013), pp. 390–396.

Chi

T.C.

, Chen

P.C.

, Su

S.Y.

, et al., Speaker role contextualmodeling for language understanding and dialogue policy learning. arXiv preprint arXiv:1710.00164, 2017.

, Tavabi

, Lerman

, et al., Speaker turn modeling fordialogue act classification. arXiv preprint arXiv:2109.05056, 2021.

Liu

, Ott

, Goyal

, et al., Roberta: A robustly optimized bertpretraining approach. arXiv preprint arXiv:1907.11692, 2019.

Ferguson

, Buckingham

, Shum, Learning analytics to identifyexploratory dialogue within synchronous text chat, In: Proceedings of the 1st international conference on learninganalytics and knowledge (2011), pp. 99–103.

Ferguson

, Wei

, He

, et al., An evaluation of learninganalytics to identify exploratory dialogue in online discussions, In: Proceedings of the Third International Conference onLearning Analytics and Knowledge (2013), pp. 85–93.

Ekman

, Unpacking social media dialogues to Find productive dialogue. In: ICERI2022 Proceedings. iATED (2022), pp. 1432–1440.

10.

Chen

P.C.

, Chi

T.C.

, Su

S.Y.

, et al., Dynamic time-aware attentionto speaker roles and contexts for spoken language understanding, In: 2017 IEEE Automatic Speech Recognition and UnderstandingWorkshop (ASRU). IEEE, (2017), pp. 554–560.

11.

Zhu

, Xu

, Zeng

, et al., A hierarchical network forabstractive meeting summarization with cross-domain pretraining. arXiv preprint arXiv:2004.02016, 2020.

12.

Song

, Tian

, Wang

, et al., Summarizing medical conversationsvia identifying important utterances. In: Proceedings of the28th International Conference on Computational Linguistics (2020), pp. 717–729.

13.

Chen

, Yang

, Zhao

, et al., Dialogue act recognition viacrfattentive structured network, In: The 41st international acmsigir conference on research & development in informationretrieval (2018), pp. 225–234.

14.

, Lin

, Collinson

, et al., A dual-attention hierarchicalrecurrent neural network for dialogue act classification. arXiv preprint arXiv:1810.09154, 2018.

15.

Raheja

, Tetreault

, Dialogue act classification withcontextaware self-attention. arXiv preprint arXiv:1904.02594, 2019.

16.

Shang

, Tixier

A.J.P.

, Vazirgiannis

, et al., Speakerchangeaware crf for dialogue act classification. arXiv preprint arXiv:2004.02913, 2020.

17.

Colombo

, Chapuis

, Manica

, et al., Guiding attention insequence-to-sequence models for dialogue act prediction, In: Proceedings of the AAAI Conference on Artificial Intelligence 34(5) (2020), pp. 7594–7601.

18.

Malhotra

, Waheed

, Srivastava

, et al., Speaker and timeawarejoint contextual learning for dialogue-act classification incounselling conversations, In: Proceedings of the fifteenth ACMinternational conference on web search and data mining (2022), pp. 735–745.

19.

Shriberg

, Dhillon

, Bhagat

, et al., The ICSI meetingrecorder dialog act (MRDA) corpus. In: Proceedings of the 5^th SIGdial Workshop on Discourse and Dialogue at HLTNAACL 2004 (2004), pp. 97–100.

20.

Janin

, Baron

, Edwards

, et al., The ICSI meeting corpus, In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP’03). IEEE. 1(2003), pp. I–I.

21.

Shang

, Ding

, Zhang

, et al., Unsupervised abstractivemeeting summarization with multi-sentence compression and budgeted submodular maximization. arXiv preprint arXiv:1805.05271, 2018.

22.

Devlin

, Chang

M.W.

, Lee

, et al., Bert: Pre-training of deepbidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

23.

Gao

, Cao

, Guan

, et al., Emotion recognition inconversations with emotion shift detection based on multi-tasklearning, Knowledge-Based Systems 248 (2022), 108861.

24.

Vaswani

, Shazeer

, Parmar

, et al., Attention is all youneed, Advances in neural information processing systems 30, 2017.

25.

Liu

, Qiu

, Huang

, Recurrent neural network for textclassification with multi-task learning. arXiv preprint arXiv:1605.05101, 2016.

26.

Zhou

, Shi

, Tian

, et al., Attention-based bidirectional longshort-term memory networks for relation classification, In: Proceedings of the 54th annual meeting of the association forcomputational linguistics (volume 2: Short papers) (2016), pp. 207–212.

27.

Joulin

, Grave

, Bojanowski

, et al., Bag of tricks forefficient text classification. arXiv preprint arXiv:1607.01759, 2016.

28.

Chen

, Convolutional neural network for sentence classification, University of Waterloo, 2015.

29.

Lai

, Xu

, Liu

, et al., Recurrent convolutional neuralnetworks for text classification, In: Proceedings of the AAAI conference on artificial intelligence 29(1) (2015).

30.

Kumar

, Agarwal

, Dasgupta

, et al., Dialogue act sequencelabeling using hierarchical encoder with crf, Proceedings ofthe aaai conference on artificial intelligence 32(1), 2018.

31.

Khanpour

, Guntakandla

, Nielsen

, Dialogue actclassification in domain-independent conversations using a deeprecurrent neural network, Proceedings of coling 2016, the 26thinternational conference on computational linguistics: Technicalpapers (2016), 2012–2021.

Research on discourse role recognition in task-oriented collaborative dialogue

Abstract

Keywords

1 Introduction

2.1 Dialogue mode

2.2 Speaker role

2.3 Dialogue act

3 Corpora construction

3.1 Data collection

3.2 Dataset annotation scheme

3.3 Annotation process

3.4 Data pre-processing

3.5 Statistics

Table 2 Distribution of Discourse Role labels in MRDR DR Number Proportion (%) INT 342 5.68 II 2616 43.447 PI 288 4.783 E 310 5.148 SC 439 7.291 PA 1898 31.523 NE 128 2.126 Total 6021 100

4.2 Hierarchical dialogue encoder

4.2.1 Discourse-level coding

5.1 Implementation details

Table 3 Training hyperparameters Hyperparameter Value Batch Size 4 Hidden Size 768 Layers RoBERTa 12 Layers BiLSTM 1 Dropout 0.1 Learning Rate 1e-4 Epochs 50 nfinetune 1

5.3 Experimental results

Table 7 Comparison results of ablation experiments Model P R F ACC STTATM¬speaker +¬topic 67.86 79.86 72.11 84.68 STTATM¬speaker 69.89 77.70 73.51 85.62 STTATM¬topic 68.25 76.67 72.24 84.87 STTATM 73.78 76.80 73.88 86.99

Funding acknowledgment

References

Table 2
Distribution of Discourse Role labels in MRDR

DR Number Proportion (%)

INT 342 5.68

II 2616 43.447

PI 288 4.783

E 310 5.148

SC 439 7.291

PA 1898 31.523

NE 128 2.126

Total 6021 100

Table 3
Training hyperparameters

Hyperparameter Value

Batch Size 4

Hidden Size 768

Layers_RoBERTa 12

Layers_BiLSTM 1

Dropout 0.1

Learning Rate 1e-4

Epochs 50

nfinetune 1

Table 7
Comparison results of ablation experiments

Model P R F ACC

STTATM_{¬speaker +¬topic} 67.86 79.86 72.11 84.68

STTATM_¬speaker 69.89 77.70 73.51 85.62

STTATM_¬topic 68.25 76.67 72.24 84.87

STTATM 73.78 76.80 73.88 86.99