Joint intent detection and slot filling with wheel-graph attention networks

Abstract

Intent detection and slot filling are recognized as two very important tasks in a spoken language understanding (SLU) system. In order to model these two tasks at the same time, many joint models based on deep neural networks have been proposed recently and archived excellent results. In addition, graph neural network has made good achievements in the field of vision. Therefore, we combine these two advantages and propose a new joint model with a wheel-graph attention network (Wheel-GAT), which is able to model interrelated connections directly for single intent detection and slot filling. To construct a graph structure for utterances, we create intent nodes, slot nodes, and directed edges. Intent nodes can provide utterance-level semantic information for slot filling, while slot nodes can also provide local keyword information for intent detection. The two tasks promote each other and carry out end-to-end training at the same time. Experiments show that our proposed approach is superior to multiple baselines on ATIS and SNIPS datasets. Besides, we also demonstrate that using bi-directional encoder representation from transformer (BERT) model further boosts the performance of the SLU task.

Keywords

Spoken language understanding graph neural network attention mechanism joint learning

1 Introduction

Apart from the voice part, the task-oriented dialog system mainly consists of four parts: spoken language understanding (SLU), dialogue state tracking (DST), dialogue policy optimization (DPO), and natural language generation (NLG) [1]. As the beginning part, the quality of SLU module directly affects the performance of the whole dialog system. Its main responsibility is to take the user utterance as input and perform the following three tasks: domain determination, intent detection, and slot filling [2]. Among them, the purpose of the first two tasks is to identify the domain and intent in the utterance, both of which are classification problems and are usually combined for modeling. The latter is generally considered to be a sequence tagging problem [3]. For instance, the utterance “play techno on lastfm” randomly sampled in the SNIPS dataset, as shown in the Table 1. It adopts BIO annotation method. As you can see, each word in the utterance corresponds to a slot tag, and the whole utterance corresponds to a specific intent.

Table 1
A sample example contains: intent label PlayMusic and slot label (BIO annotation format)

Sentence play techno on lastfm

Slots O B-genre O B-service

Intent PlayMusic

Sentence	play	techno	on	lastfm
Slots	O	B-genre	O	B-service
Intent	PlayMusic

In early research, slot filling and intent detection tasks were usually modeled separately from two tasks, which is called pipelining methods. Intent detection is considered to be an utterance classification problem to predict an intent label, which can be modeled using traditional classifiers, including logic regression, support vector machine (SVM) [4], Adaboost [5] or recurrent neural network (RNN) [6]. The slot filling task can be framed as a sequence labeling problem. At present, there are two better performance methods: conditional random field (CRF) [7] and recurrent neural network (RNN) [8]. However, traditional machine learning methods need to extract features by hand, and RNN often has the problem of gradient vanishing when dealing with long sentence sequences.

Considering these two tasks often occur simultaneously and are related to each other, the tendency is to model the two tasks at the same time and develop a series of joint models [9 –12]. However, these models only implicitly apply the joint loss function to the relationship between the two tasks. [2] proposed a RNN-LSTM model that does not establish an explicit relationship between intent and slots. Considering that the correct identification of the intent in the sentence is helpful to the slot filling task. [13, 14], and [15] designed a gate/mask mechanism, which integrates the intent information into the slot vector, and further combines the intent information to slot filling, which can filter out some non-entity slots. [16] adopts the token-level intent detection for the stack-propagation framework, using the output vector of the intent directly as input to slot filling. Recently, some researchers have also begun to study the use of slot information to predict intent tag, explicitly constructing a bi-directional interrelated relationship between two tasks. [17] proposed a capsule-based neural network model, which constructs intent capsules, slot capsules and word capsules, and accomplishes slot filling and intent detection via a dynamic routing-by-agreement schema. [18] proposed a SF-ID neural network structure. It uses ID subnet and SF subnet to establish direct connections between the two tasks to help them promote each other mutually. This approach adopts an iterative method in training, which needs to set the number of iterations manually, so it is difficult to optimize.

We apply the proposed approach to the single intent ATIS and SNIPS public datasets from [19] and [13], separately. Our experimental results show that our approach outperforms multiple baseline models. We further verified that using the pre-trained BERT representations [20] can greatly improve performance. The main contributions of this paper can be summarized as follows: (1) We propose a novel joint model with a wheel-graph attention network, which is able to model interrelated connections directly for single intent detection and slot filling tasks. (2) Establishing the interrelated mechanism explicitly among intent nodes and slot nodes in an utterance by a graph attention neural network (GAT) structure. (3) The graph structure can better learn the weight values of the edges of intent and slot nodes, and make our joint model more interpretable. (4) Our method is proved to be more effective on two popular datasets. (5) We investigated and explained the performance improvement after introducing pre-trained BERT into SLU tasks.

For easy reproduction, the source code of our implementation is publicly stored in https://github.com/gumowangfei/WheelGraph-SLU.

2 Related works

In current section, we will introduce the related research progress of SLU and GNN in detail.

2.1 Spoken language understanding

Separate Model Intent detection and slot filling are modeled separately. The intention detection task is described as a text classification problem in order to predict an intent tag. The traditional approach is to employ n-grams as features with generic entities, such as address and date [12]. This type of approach is restricted to the dimensionality of the input space. Another popular approaches is to train machine learning models on labeled training data, such as support vector machine (SVM) and Adaboost [4, 5]. The methods based on deep neural network technology show excellent performance, such as deep belief networks (DBNs) and RNNs [21, 22]. Slot filling is often regarded as a sequence tagging task in order to predict a slot tag. The traditional method based on conditional random fields (CRF) architecture, which has a strong ability on sequence labeling tasks [7]. Another popular approach is CRF-free sequential labeling. [8] introduced LSTM architecture for this task and obtained a marginal improvement over RNN. [23] and [24] introduce the self-attention mechanism for slot filling.

Implicit Joint Model Recently, researchers have combined intent detection and slot filling tasks to model, and proposed many joint methods to eliminate the error propagation problem caused by the pipelining approaches. All these models only use shared parameters and joint loss functions to connect the two tasks implicitly. [2] designed a popular bidirectional RNN-LSTM model architecture for joint modeling of intent detection and slot filling, in which LSTM designed three kinds of gates (forget gate, input gate and output gate) to solve the problem of gradient vanishing in modeling long sentences. proposed an SLU model based on joint modeling of bidirectional GRU (BiGRU) and maximum pooling. Intent prediction uses maximum pooling to capture the global features of sentences to facilitate intent detection. Slot prediction adds conditional random field (CRF), simulation label transfer to obtain the global optimal solution of the whole sequence. [10] proposed a neural network model of joint intent detection and slot filling based on attention, and further discussed that the attention mechanism is added to the encoder-decoder framework, which can learn the alignment information of encoder-decoder and pay attention to important information, which is more effective for sequence labelling tasks. All these models are superior to the pipeline model through the mutual enhancement between the two tasks. However, these joint approaches are only implicit joint models based on joint loss function.

Explicit unidirectional related Joint Model In recent years, some works have explored explicit unidirectional correlation joint models. In addition to sharing the encoder layer, these models use intent information to enhance slot prediction. Most of the gating/masking mechanisms use intent prediction information to select words with entity constraints in sentences, which is beneficial to the prediction of slot sequence tags. [15] proposes a novel gated self-attention model, which uses intention information for slot prediction for the first time, so as to make full use of the semantic correlation between intention and slot. [13] proposed a model based on intent, slot attention and gating mechanism. The gate focuses on learning the relationship between intent and slot attention vector, and obtains better prediction results through global optimization. [25] introduced a multi-head attention mechanism to encode feature vectors and uses mask selection mechanisms to explicitly model intent on detecting the relationship between slot filling. [16] perform the token-level intent detection for the Stack-Propagation framework. It outputs multiple intent predictions, which are passed into subsequent slot predictions to better incorporate intent information. Predictions of intent are obtained through elections.

Explicit Interrelated Joint Model Considering this close correlation between the two tasks. In order to make fuller use of the interrelationship between the two, recent studies have begun to explore the establishment of explicitly interrelated joint models. [26] applies BiLSTM to the input sequence respectively. A dual model architecture with decoder is proposed. Firstly, biLSTM is used to learn long sequence sentence information, and then the bidirectional hidden vector generated at each time is input to the last hidden layer output of a unidirectional LSTM, as intention prediction. Like the intent detection part, the slot filling network uses the hidden output vector generated at each time of the unidirectional LSTM for the prediction of slot labels. [18] designed the SF-ID network: ID subnet and SF subnet to establish the information exchange between the two tasks to help each other promote each other. SF subnet introduces relevant factors to apply intent information to slot filling tasks, while ID subnets also calculate a correlation factor to use slot information for intent detection tasks. The correlation factors here are actually similar to gating mechanisms. The two subnets are trained by iterative mechanism. [17] proposed a novel bidirectional interrelated approach of joint intent recognition and slot filling based on capsule network. Establish an explicit hierarchical relationship among words, intent and slots, with a clear structure and a stronger explanation. When the network is updated, the slot information learned through the routing-by-agreement schema enhances the prediction of the intent label, and a re-routing schema improves the prediction of the slot label with the intent information learned.

2.2 Graph neural networks

Applying graph neural networks (GNN) to solve some problems has been a popular approach recently in social network analysis [27], knowledge graphs [28], urban computing, and many other research areas [29, 30]. GNN is very suitable for modeling non-Euclidean sample problems, but traditional neural network methods can only deal with regular data. [31] proposed a simplified model of graph neural network, called graph convolutional network (GCN). GCN is a multi-layer neural network that relies on graph structure to determine update weights. It operates directly on the graph and summarizes the node’s embedded vector based on its neighborhood nodes.

Unlike previously discussed many joint models methods, our proposed approach explicitly establishes direct connections among intent nodes and slots nodes by GAT [29], which uses weighted neighbor features with feature dependent and structure-free normalization, in the style of attention. GAT is an improvement of GCN on neighbor weight assignment. Attention is learned through Multi-head Attention, which makes more sense than the GCN’s update weights, which rely solely on the graph structure. Analogous to multiple channels in ConvNet [32], GAT introduces a multi-head attention [33], which increases the parameters of the model, but improves the learning ability of the model, and can learn more types of attention features. Unlike other models [17, 18], our model does not need to set the number of iterations during training. We have also established a wheel graph structure to learn context-aware information in an utterance better.

3 Proposed approaches

In this section, we will introduce our wheel-graph graph attention model for SLU tasks. The detailed network architecture of the entire model is shown in Figure 1. First, we show how to uses a text encoder to encode an utterance, so that can gain the shared knowledge between two tasks. Second, we introduce the graph attention network (GAT) user weighted neighbor features with feature dependent and structure-free normalization, in the style of attention. Next, the wheel-graph attention network performs an interrelation connection fusion learning of the intent nodes and slot nodes. Finally, intent detection and slot filling are optimized simultaneously via a joint learning schema. Finally, the end-to-end joint learning schema is used to optimize intent detection and slot filling tasks simultaneously.

Fig. 1

The overall architecture of the proposed model based on Wheel-Graph attention networks.

3.1 Text encoder

Word Embedding: Given a sequence of words, we first covert each word as embedding vector e_t, and the sequence is represented as [e₁, . . . , e_T], where T is the number of words in the utterance.

Affine Transformation: We perform an affine transformation on the embedding sequence, which is a data standardization method.

$x_{t} = W e_{t} + b$ (1) where W and b are trainable weights and biases.

Two-Layer BiGRU: As an extension of the traditional feed-forward neural networks, it was difficult to train Recurrent neural networks (RNNs) to obtain long-term dependencies, because the gradients often either explode or vanish. Therefore, some more ingenious activation functions with gating units were created. Two revolutionary methods are long short-term memory (LSTM) [34] and gated recurrent unit (GRU) [35]. Like the LSTM gate unit, the GRU has gate units that regulate the flow of information within the unit. However, there is no single storage unit and there are fewer parameters. Based on this, we use GRU in this work.

$r_{t} = σ (W_{r} x_{t} + U_{r} h_{t - 1})$ (2) $z_{t} = σ (W_{z} x_{t} + U_{z} h_{t - 1})$ (3) ${\tilde{h}}_{t} = \tanh (W x_{t} + r_{t} ⊙ (U h_{t - 1}))$ (4) $h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ {\tilde{h}}_{t}$ (5) where x_t is the input at time t, r_t and z_t are reset gate and update gate respectively, σ is sigmoid function and ⊙ is an element-wise multiplication, W and U are weight matrices. When the reset gate is off (r_t close to 0), the reset gate effectively makes the unit act as if it is reading the first symbol of an input sequence, allowing it to forget the previously computed state. For simplification, the above equations are abbreviated with h_t = GRU (x_t, h_t-1).

In order to take into account both past and future information. Consequently, we will exploit a two-Layer bidirectional GRU (BiGRU) to learn the utterance representations at each time step. The BiGRU, a modification of the GRU, consists of a forward and a backward GRU. The layer reads the affine transformed output vectors [x₁, . . . , x_T] and generates T hidden states by concatenating the forward and backward hidden vectors of BiGRU:

${\vec{h}}_{t} = \vec{GRU} (x_{t}, {\vec{h}}_{t - 1})$ (6)

${\overset{\leftarrow}{h}}_{t} = \overset{\leftarrow}{G R U} (x_{t}, {\overset{\leftarrow}{h}}_{t - 1})$ (7)

${\overset{\leftrightarrow}{h}}_{t} = [{\vec{h}}_{t}, {\overset{\leftarrow}{h}}_{t}]$ (8) where ${\vec{h}}_{t}$ is the hidden state of forward pass in BiGRU, ${\overset{\leftarrow}{h}}_{t}$ is the hidden state of backward pass in BiGRU and ${\overset{\leftrightarrow}{h}}_{t}$ is the concatenation of the forward and backward hidden states at time t.

In summary, to get more fine-grained sequence information, we use a two-layer BiGRU to encode input information. The representation is defined as:

${\overset{\leftrightarrow}{h}}_{t} = B i G R U (B i G R U (x_{t}))$ (9)

3.2 Graph attention network

The graph attention network (GAT) [29] is a variant structure of graph neural network [36] and is an important module in our proposed approach. It propagates the intent or slot information from a one-hop neighborhood. Given a dependency graph with N nodes, where each node is associated with a local hidden vector x. a GAT layer calculates node representation by aggregating neighborhood’s hidden state vector.

GAT exploits the attention mechanism as a substitute for the statically normalized convolution operation. Below are the equations to compute the node embedding $h_{i}^{(l + 1)}$ of layer l + 1 from the embeddings of layer l.

$z_{i}^{(l)} = W^{(l)} h_{i}^{(l)}$ (10) $e_{ij}^{(l)} = f ({\vec{a}}^{(l)^{T}} (z_{i}^{(l)} ∥ z_{j}^{(l)}))$ (11) $α_{ij}^{(l)} = \frac{\exp (e_{ij}^{(l)})}{\sum_{k \in N (i)} \exp (e_{ik}^{(l)})}$ (12) $h_{i}^{(l + 1)} = σ (\sum_{j \in N (i)} α_{ij}^{(l)} z_{j}^{(l)})$ (13) where W^(l) is a linear transformation weight matrix of input states, ∥ represents vector concatenation, ${\vec{a}}^{(l)}$ is the attention context vector learned in the process of training, and ·^T represents transposition. f (·) is a LeakyReLU non-linear activation function [37]. N (i) is the neighbor nodes of node i. σ is the activation function such as tanh. For simplification, the above equations are abbreviated with h^(l+1) = GAT (h^(l)).

3.3 Wheel-graph attention network

In the SLU task, there is a strong correlation between intent detection and slot filling. To make full use of the correlation between intent and slot, we constructed a wheel-graph structure. In Figure 1, this wheel-graph structure contains an intent node and slot nodes.

For the node representation, we use the output of the previous two-layer BiGRU, and the formula is expressed as:

$h_{0}^{I} = {max}_{i = 1}^{T} {\leftrightarrow h}_{t}$ (14) where T is the number of words in the sentence, and the max function is an element-wise function. We use $h_{0}^{I}$ as the representation of the intent node and ${\overset{\leftrightarrow}{h}}_{t}$ as the representation of the slot nodes.

For the edge, we created a bidirectional connection between the intent node and the slot nodes. To make better use of the context information of the utterance, we created a bidirectional connection between the slot nodes and connected the head and tail of the utterance to form a loop.

All in all, the feedforward process of our proposed wheel-graph attention network can be written as follows:

$h_{m} = [h_{0}^{I},^{\leftrightarrow} h_{t}]$ (15) $h_{m}^{(l + 1)} = GRU (GAT (h_{m}^{(l)}), h_{m}^{(l)})$ (16) $h^{I}, h_{t}^{S} = h_{0}^{(l + 1)}, h_{1 : m}^{(l + 1)}$ (17) where m ∈ 0, 1, …, t, h^I is the hidden state output of the intent, and $h_{t}^{S}$ is the hidden state output of the slots.

3.4 Joint intent detection and slot filling

The last layer is the output layer. We adopt a joint learning method. The softmax function is applied to output representations with a linear transformation matrix to give the probability distribution y^I over the intent labels and the distribution $y_{t}^{S}$ over the t - th slot labels. Formally,

$y^{I} = softmax (W^{I} h^{I} + b^{I})$ (18) $y_{t}^{S} = softmax (W^{S} h_{t}^{S} + b^{S})$ (19) $o^{I} = argmax (y^{I})$ (20) $o_{t}^{S} = argmax (y_{t}^{S})$ (21) where W^I and W^S are trainable parameters of the model, b^I and b^S are bias vectors. o^I and $o_{t}^{S}$ are the predicted output labels for intent and slot task respectively.

Then we define loss function for our model. We use ${\hat{y}}^{I}$ and ${\hat{y}}^{S}$ to represent the ground truth label for intent and slot.

The loss function of intent is a cross-entropy cost function.

$L_{1} = - \sum_{i = 1}^{n_{I}} {\hat{y}}^{i, I} \log (y^{i, I})$ (22)

Similarly, the loss function for a slot label sequence is formulated as:

$L_{2} = - \sum_{t = 1}^{T} \sum_{i = 1}^{n_{S}} {\hat{y}}_{t}^{i, S} \log (y_{t}^{i, S})$ (23) where n_I is the total number of intent label types, n_S is the total number of slot label types, and T is the number of words in the whole sentence.

The training objective of the model is minimizing a united loss function:

$L_{θ} = α L_{1} + (1 - α) L_{2}$ (24) where α is a weight factor that adjusts the attention to two tasks.

4 Experiments

In this section, we describe our experimental setup and report our experimental results.

4.1 Experimental setup

For the experiment, we adopt two popular datasets, including ATIS [38] and SNIPS [19], which is collected by Snips personal voice assistant in 2018. They are two public benchmark single-intent datasets, which are widely used as benchmarks in SLU researches. Compared to the single-domain ATIS dataset, SNIPS is more complicated, mainly due to the intent diversity and large vocabulary. Both datasets used in our paper follows the same format and partition as in [16]. The statistics of two datasets are shown in Table 2.

Table 2
Datasets overview

Datasets ATIS SNIPS

# Training set 4,478 13,084

# Validation set 500 700

# Test set 893 700

# Intents 21 7

# Slots 120 72

Vocabulary Size 722 11,241

Avg. Length 11.28 9.05

Datasets	ATIS	SNIPS
# Training set	4,478	13,084
# Validation set	500	700
# Test set	893	700
# Intents	21	7
# Slots	120	72
Vocabulary Size	722	11,241
Avg. Length	11.28	9.05

In order to verify the effectiveness of our approach, we compare it with the following baseline approaches. It is worth noting that the metrics of some approaches are obtained directly from [16].

Joint Seq applies an RNN-LSTM architecture for slot filling, and the last hidden state of LSTM is used to predict the intent of the utterance [2].

Attention BiRNN adopts an attention-based RNN model for joint intent detection and slot filling. Slot label dependencies are modeled in the forward RNN. A max-pooling over time on the hidden state vectors is used to perform the intent classification [11].

Slot-Gated Full Atten. utilizes a slot-gated mechanism that focuses on learning the relationship between intent and slot attention vectors. The intent attention context vector is used for the intent classification [13].

Self-Attention Model first makes use of self-attention to generate a context-aware representation of the embedding. Then a bidirectional recurrent layer takes as input the embeddings and context-aware vectors to produce hidden states. Finally, it exploits the intent-augmented gating mechanism to match the slot label [15].

Bi-Model is a new Bi-model based RNN semantic frame parsing network structure which performs the intent detection and slot filling tasks jointly by considering their cross-impact to each other using two correlated bidirectional LSTMs [26].

SF-ID Network is a novel bi-directional interrelated model for joint intent detection and slot filling. It contains an entirely new iteration mechanism inside the SF-ID network to enhance the bi-directional interrelated connections [18].

CAPSULE-NLU introduces a capsule-based neural network model with a dynamic routing-by-agreement schema to accomplish intent detection and slot filling tasks. The output representations of IntentCaps and SlotCaps are used to intent detection and slot filling, respectively [17].

Stack-Propagation adopts a Stack-Propagation, which directly uses the intent information as input for slot filling and performs the token-level intent detection to further alleviate the error propagation [16].

4.2 Implementation details

In our experiments, the dimensionalities of the word embedding are 1024 for the ATIS dataset and SNIPS dataset. All model weights are initialized with uniform distribution. The number of hidden units of the BiGRU encoder is set as 512. The number of layers of the GAT model is set to 1. Graph node representation is set to 1024. The weight factor α is set to 0.1. We use the Adam optimizer [39] with an initial learning rate of 10^-3, and L2 weight decay is set to 10^-6. The model is trained on all the training data with a mini-batch size of 64. In order to enhance our model to generalize well, the maximum norm of gradient clipping is set to 1.0. We also apply the dropout ratio is 0.2 for reducing overfit.

We implemented our model using PyTorch 1 and DGL 2 on a Linux machine with Quadro p5000 GPUs. For all the experiments, We select the model that works best on the validation set and evaluate it on the test set.

4.3 Experimental results

As with Qin et al [16], we used three evaluation metrics in our experiment of the SLU tasks. For intent detection tasks, the accuracy is used. For slot fill tasks, the f1-score is utilized. F1-score combines two metrics of precision rate and recall rate, which can better evaluate the model. In addition, in order to evaluate the overall performance of the sentence, the sentence accuracy was used to represent the overall performance of the two tasks. Specifically, the percentage of samples when intent detection and slot filling are both correctly predicted across the corpus. Table 3 shows the experimental results of the proposed model on single-intent ATIS and SNIPS datasets.

Table 3
Comparison results of different methods using Wheel-GAN on ATIS and SNIPS datasets. The metrics with^* show that the improvement of our approach on all baselines is statistically significant with p < 0.05 under t-test

Model ATIS Dataset SNIPS Dataset

Slot (F1) Intent (Acc) Sentence (Acc) Slot (F1) Intent (Acc) Sentence (Acc)

Joint Seq [2] 94.3 92.6 80.7 87.3 96.9 73.2

Attention BiRNN [11] 94.2 91.1 78.9 87.8 96.7 74.1

Slot-Gated Full Atten. [13] 94.8 93.6 82.2 88.8 97.0 75.5

Self-Attentive Model [15] 95.1 96.8 82.2 90.0 97.5 81.0

Bi-Model [26] 95.5 96.4 85.7 93.5 97.2 83.8

SF-ID Network [18] 95.6 96.6 86.0 90.5 97.0 78.4

CAPSULE-NLU [17] 95.2 95.0 83.4 91.8 97.3 80.9

Stack-propagation [16] 95.9 96.9 86.5 94.2 98.0 86.9

Wheel-GAT 96.0^* 97.5^* 87.2^* 94.8^* 98.4^* 87.4^*

Model	ATIS Dataset	SNIPS Dataset
Joint Seq [2]	94.3	92.6	80.7	87.3	96.9	73.2
Attention BiRNN [11]	94.2	91.1	78.9	87.8	96.7	74.1
Slot-Gated Full Atten. [13]	94.8	93.6	82.2	88.8	97.0	75.5
Self-Attentive Model [15]	95.1	96.8	82.2	90.0	97.5	81.0
Bi-Model [26]	95.5	96.4	85.7	93.5	97.2	83.8
SF-ID Network [18]	95.6	96.6	86.0	90.5	97.0	78.4
CAPSULE-NLU [17]	95.2	95.0	83.4	91.8	97.3	80.9
Stack-propagation [16]	95.9	96.9	86.5	94.2	98.0	86.9
Wheel-GAT	96.0^*	97.5^*	87.2^*	94.8^*	98.4^*	87.4^*

We note that the results of unidirectional related joint models are better than implicit joint models like Joint Seq [2] and Attention BiRNN [11], and the results of interrelated joint models are better than unidirectional related joint models like Slot-Gated Full Atten. [13] and Self-Attentive Model [15]. That is likely due to the strong correlation between the two tasks. The intent representations apply slot information to intent detection task while the slot representations use intent information in slot filling task. The bi-directional interrelated model helps the two tasks to promote each other mutually.

We also find that our graph-based Wheel-GAT model performs better than the best prior joint model Stack-Propagation Framework. In ATIS dataset, our model achieve 0.6%improvement on Intent (Acc), 0.1%improvement on Slot (F1-score) and 0.7%improvement on Sentence (Acc). In the SNIPS dataset, our model achieve 0.4%improvement on Intent (Acc), 0.6%improvement on Slot (F1-score), and 0.5%improvement on Sentence (Acc). This indicates the effectiveness of our Wheel-GAT model. In the previously proposed model, the iteration mechanism used to set the number of iterations is not flexible on training, and the token-level intent detection increases the output load when the utterance is very long. While our model employed graph-based attention network, which uses weighted neighbor features with feature dependent and structure-free normalization, in the style of attention, and directly takes the explicit intent information and slot information further help grasp the relationship between the two tasks and improve the SLU performance.

4.4 Ablation study

In this section, to further examine the level of benefit that each component of Wheel-GAT brings to the performance, an ablation study is performed on our model. The ablation study is a more general method, which is performed to evaluate whether and how each part of the model contributes to the full model. We ablate four important components and conduct different approaches in this experiment. Note that all the variants are based on joint learning method with joint loss.

Wheel-GAT w/o intent → slot, where no directed edge connection is added from the intent node to the slot node. The intent information is not explicitly applied to the slot filling task on the graph layer.

Wheel-GAT w/o slot → intent, where no directed edge connection is applied from the slot node to the intent node. The slot information is not explicitly utilized to the intent detection task on the graph layer.

Wheel-GAT w/o head ↔ tail, where no bidirectional edge connection is used between the intent node and the slot node. We only use joint loss for joint model, rather than explicitly establishing the transmission of information between the two tasks.

Wheel-GAT w/o GAT, where no graph attention mechanism is performed in our model. The message propagation is computed via GCN instead of GAT. GCN introduces the statically normalized convolution operation as a substitute for the attention mechanism.

Table 4 shows the joint learning performance of the ablated model on ATIS and SNIPS datasets. We find that all variants of our much model perform well based on our graph structure except Wheel-GAT w/o GAT. As listed in the table, all features contribute to both intent detection and slot filling tasks.

Table 4
Ablation Study on ATIS and SNIPS datasets. → indicates that the intent node points to the edge of the slot node. ← indicates that the slot node points to the edge of the intent node. ↔ indicates the edge where the head and tail word nodes are connected in an utterance

Model ATIS Dataset SNIPS Dataset

Slot (F1) Intent (Acc) Sentence (Acc) Slot (F1) Intent (Acc) Sentence (Acc)

Wheel-GAT 96.0 97.5 87.2 94.8 98.4 87.4

Wheel-GAT w/o intent → slot 95.5 97.1 86.9 93.5 98.0 85.7

Wheel-GAT w/o slot → intent 95.4 96.8 86.6 93.9 97.9 85.8

Wheel-GAT w/o head ↔ tail 95.6 97.0 86.9 94.0 97.6 85.8

Wheel-GAT w/o GAT 95.0 96.2 84.3 90.8 96.7 77.6

Model	ATIS Dataset	SNIPS Dataset
Wheel-GAT	96.0	97.5	87.2	94.8	98.4	87.4
Wheel-GAT w/o intent → slot	95.5	97.1	86.9	93.5	98.0	85.7
Wheel-GAT w/o slot → intent	95.4	96.8	86.6	93.9	97.9	85.8
Wheel-GAT w/o head ↔ tail	95.6	97.0	86.9	94.0	97.6	85.8
Wheel-GAT w/o GAT	95.0	96.2	84.3	90.8	96.7	77.6

If we remove the intent → slot edge from the holistic model, the slot performance drops 0.5%and 1.3%respectively on two datasets. Similarly, we remove the slot → intent edge from the holistic model, the intent performance down a lot respectively on two datasets. The result can be interpreted that intent information and slot information are stimulative mutually with each other. We can see that the added edge does improve performance a lot to a certain extent, which is consistent with the findings of previous work [13 , 18].

If we remove the head ↔ tail edge from the holistic model, we see 0.4%drop in terms of F1-score in ATIS and 0.8%drop in terms of F1-score in SNIPS. We attribute it to the fact that head ↔ tail structure can better model context-aware information in an utterance.

To verify the effectiveness of the attention mechanism, we remove the GAT and use GCN instead. For GCN, a graph convolution operation produces the normalized sum of the node feature of neighbors. The result shows that the intent performance drops 1.3%and 1.7%, the slot performance drops 1.0%and 4.0%, and the sentence accuracy drops 2.9%and 9.8%respectively on ATIS and SNIPS datasets. We attribute it to the fact that GAT uses weighting neighbor features with feature dependent and structure-free normalization, in the style of attention.

4.5 Visualization of wheel-graph attention layer

In this section, to better understand what the wheel-graph attention structure has learned, we visualize the attention weights of slot → intent and each slot node, which is shown in Figure 2.

Fig. 2

The central node is intent token and slot tokens are enclosed by *. For each edge, the darker the color is, it means that this corresponding of the two nodes is more relevant. It aggregates more information from this source node features.

Based on the utterance “play signe anderson chant music that is newest”, the intent “PlayMusic” and the slot “O B–artist I–artist B–music_item O O O B–sort”, we can clearly see that the attention weight is successfully focused on the correct slot, which indicates our wheel-graph attention layer can learn to incorporate the specific slot information on intent node in Figure 2a. In addition, more specific intent token information is also passed into the slot node in Figure 2b, which achieves a fine-grained intent information integration to guide the token-level slot prediction. Our model has learned more meaningful weight values. Therefore, the node information of intent and slots can be transmitted more effectively through attention weights in our proposed wheel-graph attention interaction layer, and promote the performance of the two tasks at the same time. The structure of this graph provides a better explanation of both tasks.

4.6 Effect of BERT

In this section, we also experiment with a pre-trained BERT-based [20] model instead of the Embedding layer, and use the fine-tuning approach to boost SLU task performance and keep other components the same as with our model.

From Table 5, it can be seen that the Stack-Propagation + BERT [16] joint approach achieves a new state-of-the-art (SOTA) result than another without a BERT-based model, which indicates the effectiveness of a powerful pre-trained model in SLU tasks. We attribute it to the fact that pre-trained models can provide rich semantic features, which can help improve the performance of SLU tasks. Wheel-GAT + BERT outperforms the Stack-Propagation + BERT. That is likely due to we adopt explicit interaction between intent detection and slot filling in two datasets. This demonstrates that our proposed model is reasonable.

Table 5
The SLU experimental results on BERT-based model on ATIS and SNIPS datasets

Model ATIS Dataset SNIPS Dataset

Slot (F1) Intent (Acc) Sentence (Acc) Slot (F1) Intent (Acc) Sentence (Acc)

Wheel-GAT 96.0 97.5 87.2 94.8 98.4 87.4

BERT SLU [25] 96.1 97.5 88.2 97.0 98.6 92.8

Stack-Propagation + BERT [16] 96.1 97.5 88.6 97.0 99.0 92.9

Wheel-GAT + BERT 96.5 98.0 90.2 97.4 99.3 93.6

Model	ATIS Dataset	SNIPS Dataset
Wheel-GAT	96.0	97.5	87.2	94.8	98.4	87.4
BERT SLU [25]	96.1	97.5	88.2	97.0	98.6	92.8
Stack-Propagation + BERT [16]	96.1	97.5	88.6	97.0	99.0	92.9
Wheel-GAT + BERT	96.5	98.0	90.2	97.4	99.3	93.6

5 Conclusion and future work

In this paper, we first apply the graph network to the SLU tasks. And we propose a new wheel-graph attention network (Wheel-GAT) model, which provides a bidirectional interrelated mechanism for intent detection and slot filling tasks. Compared with the previous model, our model has explicit modeling intent detection task and slot filling task. Where the intent node and the slot node construct an explicit bidirectional interrelated edge. This graph propagation mechanism can better learn the weight of the associated edge, further provide fine-grained semantic information integration for token-level slot filling to predict the slot label correctly, and it can also provide specific slot information integration for sentence-level intent detection to predict the intent label correctly. Experimental results show that the bidirectional interrelated model helps the two tasks promote performance each other mutually. Although this explicit way of constructing bidirectional interrelated relationship increases the complexity of the model, the performance of the whole model is greatly improved compared with the implicit method.

We discuss the network details of the proposed model, and do some experiments with some benchmark models to verify the effectiveness of our model. In addition, in order to further explore the advantages of our model, some ablation experiments were performed. Specifically, we first conduct experiments on ATIS and SNIPS single intent datasets. The experimental results show that the method of our model outperforms all baseline methods on all evaluation metrics. Then, in order to further investigate the effectiveness of the wheel-gat components for correlation intent detection and slot filling, we also report the ablation test results in Table 4. Since we are using the attention mechanism, so we visualize and analyze the slot → intent and the attention weight of each slot node. Finally, we also discuss and analyze the effect of adding pre-trained BERT model to SLU tasks. The results show that the proposed model achieves the state-of-the-art performance.

In future works, our plan can be summarized as follows: (1) We plan to increase the scale of our dataset and explore the efficacy of combining external knowledge with our proposed model. (2) Collecting multi-intent datasets and expanding our proposed model to multi-intent datasets to explore its adaptive capabilities. (3) We plan to introduce reinforcement learning on the basis of our proposed model, and use the reward mechanism of reinforcement learning to improve the performance of the model. (4) Intent detection and slot filling are usually used together, and any task prediction error will have a great impact on subsequent dialog state tracking (DST). How to improve the accuracy of the two tasks while ensuring the stable improvement of the overall evaluation metrics (Sentence accuracy) still needs to be further explored.

Footnotes

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant No.61876043, National Natural Science Foundation of Guangdong Province under Grant No.2018A030313868 and Major Industry-University Research Project of Guangdong Province under Grant No.2016B010108004. The corresponding author of this paper is Bi Zeng.

References

Zhang

, Takanobu

, Zhu

, Huang

, Zhu

, Recent advances and challenges in task-oriented dialog systems, Science China Technological Sciences 2020, pp. 1–17.

Hakkani-Tür

, Tür

, Celikyilmaz

, Chen

Y.-N.

, Gao

, Deng

, Wang

Y.-Y.

, “Multi-domain joint semantic frame parsing using bi-directional rnn-lstm., in Interspeech 2016, 715–719.

Sarikaya

, Hinton

G.E.

and Deoras

, Application of deep beliefnetworks for natural language understanding, IEEE/ACMTransactions on Audio, Speech, and Language Processing 22(4) (2014), 778–784.

Haffner

, Tur

and Wright

J.H.

, Optimizing svms for complex callclassification, in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03), IEEE 1 (2003) I–I.

Schapire

R.E.

and Singer

, Boostexter: A boosting-based system fortext categorization, Machine Learning 39(2–3) (2000), 135–168.

Lai

, Xu

, Liu

, Zhao

, Recurrent convolutional neural networks for text classification, in Twenty-ninth AAAI conference on artificial intelligence, 2015.

Raymond

, Riccardi

, Generative and discriminative algorithms for spoken language understanding, in Eighth Annual Conference of the International Speech Communication Association, 2007.

Yao

, Peng

, Zhang

, Yu

, Zweig

, Shi

, Spoken language understanding using long short-term memory neural networks, in 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 189–194, IEEE, 2014.

Guo

, Tur

, Yih

W.-t.

, Zweig

, Joint semantic utterance classification and slot filling with recursive neural networks, in 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 554–559, IEEE, 2014.

10.

Liu

and Lane

, Attention-based recurrent neural network modelsfor joint intent detection and slot filling, Interspeech 2016 (2016), 685–689.

11.

Liu

, Lane

, Joint online spoken language understanding and language modeling with recurrent neural networks, in Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 22–30, 2016.

12.

Zhang

and Wang

, A joint model of intent determination and slotfilling for spoken language understanding, in IJCAI 16(2016) (2016), 2993–2999.

13.

Goo

C.-W.

, Gao

, Hsu

Y.-K.

, Huo

C.-L.

, Chen

T.-C.

, Hsu

K.-W.

, Chen

Y.-N.

, Slot-gated modeling for joint slot filling and intent prediction, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 753–757, 2018.

14.

Chen

, Zeng

, Lou

, Aself-attention joint model for spoken language understanding in situational dialog applications, arXiv preprint arXiv:1905.11393, 2019.

15.

, Li

, Qi

, Aself-attentive model with gate mechanism for spoken language understanding, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3824–3833, 2018.

16.

Qin

, Che

, Li

, Wen

, Liu

, A stackpropagation framework with token-level intent detection for spoken language understanding, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2078–2087, 2019.

17.

Zhang

, Li

, Du

, Fan

, Philip

S.Y.

, Joint slot filling and intent detection via capsule neural networks, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5259–5267, 2019.

18.

Haihong

, Niu

, Chen

, Song

, A novel bidirectional interrelated model for joint intent detection and slot filling, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5467–5471, 2019.

19.

Coucke

, Saade

, Ball

, Bluche

, Caulier

, Leroy

, Doumouro

, Gisselbrecht

, Caltagirone

, Lavril

, et al., Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces, arXiv preprint arXiv:1805.10190, 2018.

20.

Devlin

, Chang

M.-W.

, Lee

, Toutanova

, Bert: Pretraining of deep bidirectional transformers for language understanding, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019.

21.

Ravuri

, Stolcke

, Recurrent neural network and lstm models for lexical utterance classification,“in Sixteenth Annual Conference of the International Speech Communication Association, 2015.

22.

Deoras

, Sarikaya

, Deep belief network based semantic taggers for spoken language understanding, in Interspeech, pp. 2713–2717, 2013.

23.

Shen

, Jiang

, Zhou

, Pan

, Long

, Zhang

, Disan: Directional self-attention network for rnn/cnn-free language understanding, 2018.

24.

Tan

, Wang

, Xie

, Chen

, Shi

, Deep semantic role labeling with self-attention, in AAAI, 2018.

25.

Chen

, Zhuo

, Wang

, Bert for joint intent classification and slot filling, arXiv preprint arXiv:1902.10909, 2019.

26.

Wang

, Shen

, Jin

, Abi-model based rnn semantic frame parsing model for intent detection and slot filling, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 309–314, 2018.

27.

Hamilton

, Ying

, Leskovec

, Inductive representation learning on large graphs, in Advances in neural information processing systems, pp. 1024–1034, 2017.

28.

Hamaguchi

, Oiwa

, Shimbo

, Matsumoto

, Knowledge transfer for out-of-knowledge-base entities: a graph neural network approach, in Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 1802–1808, 2017.

29.

Veličković

, Cucurull

, Casanova

, Romero

, Liò

, Bengio

, Graph attention networks, in International Conference on Learning Representations, 2018.

30.

Huang

, Ma

, Li

, Zhang

, Houfeng

, Text level graph neural network for text classification, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3435–3441, 2019.

31.

Kipf

T.N.

, Welling

, Semi-supervised classification with graph convolutional networks,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, OpenReview.net, 2017.

32.

Krizhevsky

, Sutskever

, Hinton

G.E.

, Imagenet classification with deep convolutional neural networks, in Advances in neural information processing systems, pp. 1097–1105, 2012.

33.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

, Polosukhin

, Attention is all you need, in Advances in neural information processing systems, pp. 5998–6008, 2017.

34.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation 9(8) (1997), 1735–1780.

35.

Cho

, van Merriënboer

, Bahdanau

, Bengio

, On the properties of neural machine translation: Encoderdecoder approaches, in Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 103–111, 2014.

36.

Scarselli

, Gori

, Tsoi

A.C.

, Hagenbuchner

and Monfardini

, The graph neural network model, IEEE Transactions on Neural Networks 20(1) (2008), 61–80.

37.

Maas

A.L.

, Hannun

A.Y.

and Ng

A.Y.

, Rectifier nonlinearities improveneural network acoustic models, in Proc. icml 30 (2013), 3.

38.

Hemphill

C.T.

, Godfrey

J.J.

, Doddington

G.R.

, The atis spoken language systems pilot corpus, in Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24–27, 1990, 1990.

39.

Kingma

D.P.

, Ba

, Adam: A method for stochastic optimization, in ICLR (Poster), 2015.

Joint intent detection and slot filling with wheel-graph attention networks

Abstract

Keywords

1 Introduction

Table 1 A sample example contains: intent label PlayMusic and slot label (BIO annotation format) Sentence play techno on lastfm Slots O B-genre O B-service Intent PlayMusic

2.1 Spoken language understanding

2.2 Graph neural networks

3 Proposed approaches

4.1 Experimental setup

Table 2 Datasets overview Datasets ATIS SNIPS # Training set 4,478 13,084 # Validation set 500 700 # Test set 893 700 # Intents 21 7 # Slots 120 72 Vocabulary Size 722 11,241 Avg. Length 11.28 9.05

4.3 Experimental results

Footnotes

Acknowledgments

References

Table 1
A sample example contains: intent label PlayMusic and slot label (BIO annotation format)

Sentence play techno on lastfm

Slots O B-genre O B-service

Intent PlayMusic

Table 2
Datasets overview

Datasets ATIS SNIPS

# Training set 4,478 13,084

# Validation set 500 700

# Test set 893 700

# Intents 21 7

# Slots 120 72

Vocabulary Size 722 11,241

Avg. Length 11.28 9.05