Multiway dynamic mask attention networks for natural language inference

Abstract

Attention mechanisms are widely used on NLP tasks and show strong performance in modeling local/global dependencies. Directional self-attention network shows the competitive performance on various datasets, but it not considers the reverse information of a sentence. In this paper, we propose the Multiway Dynamic Mask attention Network (MDMAN). The model has two modules: a dynamic mask selector and a multi-attention encoder. The dynamic mask selector chooses high-quality reverse information with reinforcement learning and feeds reverse information to multi-attention encoder, the multi-attention encoder uses four attention functions to match the word in the same sentence at different token level, then combine the information from all functions to obtain the final representation. Our experiments performed on two publicly available NLI datasets show that MDMAN achieves significant improvement over DSAN.

Keywords

Natural language processing attention mechanism reinforcement learning natural language inference

1. Introduction

Natural language inference (NLI) or recognizing textual entailment (RTE) play a significant role in the field of natural language processing. Given a pair of sentences (premise and hypothesis), the purpose of natural language inference is to determine whether the hypothetical sentence can be reasonably inferred from the given premise sentence [1]. There are three types of relation in NLI, Entailment (the hypothesis can be inferred), Contradiction (the hypothesis cannot be true), and Neutral (irrelevant). A few examples are illustrated in Table 1. Recently, a substantial amount of annotated data such as SNLI [2] and MultiNLI [3] make it possible to use deep learning methods to solve the NLI task. Deep learning methods applied in NLI can be roughly divided into two ways. The first methods are sentence encoding-based models where each sentence is encoded to a fixed-sized vector in a completely independent manner and the two vectors for the corresponding sentences are used in predicting the degree of matching, such as Bi-LSTM max-pooling network in Conneau et al. [4] or Stack-augmented Parser-Interpreter Neural Network (SPINN) in Bowman et al. [5]. The latter methods which allow utilizing interactive features between two sentences to directly model the relation between two sentences, such as the densely-connected co-attentive recurrent Model (ESIM) in Kim et al. or the enhancing sequential inference model (BiMPM) in Chen et al.

Recent works have found that models with attention mechanisms have shown state-of-the-art performance in natural language inference [8], question answering [9], sentiment analysis. Recently, Tay et al. proposed a multicast attention network for natural language processing, which the attention mechanisms are imagined as a feature extractor [13]. The model uses multiple attention to extract different word-level features at different granularity. Thus, the model can obtain better hidden features of a sentence.

Table 1
Examples of relations between a premise and a hypothesis, where E for entailment, C for contradiction, N for neutral, P for premise and H for hypothesis

P	Two men on bicycles competing in a race.
H	People are riding bikes.	E
	Men are riding bicycles on the streets.	C
	A few people are catching fish.	N

In parallel, some neural networks composing only of attention, especially self-attention, outperform traditional convolutional [10] or recurrent neural networks [11] on NLP tasks due to its highly parallelized computations, such as transformer [14] and DISAN [15], which further shows the effective in attention mechanisms in capturing contextual dependencies. Lately, Shen et al. presented an attention-based model, which provides strong results for various tasks [15]. In particular, they achieve the state-of-the-art result in SNLI. In the model, the sentence can be temporal order encoded by the forward mask and backward mask. However, the reverse information of the sentence was not considered at all when the model encodes sentence by direction mask, which will inevitably lead to the lack of information. Moreover, it is only a simple concatenation of the sentence information of multiple attention. As a result, it bloats subsequent layers with a high dimensional vector. It is hence increasing the parameter cost of subsequent layers.

In this paper, to tackle this limitation, we propose multiway dynamic mask attention networks consisting of two modules: a dynamic mask selector and multi-attention encoder. However, it also brings an additional problem. We don’t have explicit supervision for the dynamic mask selector, and the mask in the dynamic mask selector is a discrete variable, which leads to a non-differentiable objective function. Since the dynamic mask selector has the following two properties: first, trial-and-error search, the dynamic mask selector attempts to adjust the mask to obtain the feedback (or reward) of the selected mask; Second, we can only obtain feedback from multi-attention encoder when we complete the typically delayed mask selection process. Therefore, We address this challenge by casting the dynamic mask selector task as a reinforcement learning problem. The dynamic mask selector uses the current sentence feature to adjust mask based on the direction mask. This allows the multi-attention encoder to obtain effective reverse information when use masks to encode temporal order information. As shown in the example in the table, two men on bicycles competing in a race and people are riding bikes. The relation of these two sentences is entailment. The relation between two men on bicycles competing in a race and men are riding bicycles on the streets is contradictory. These vectors are very similar, which makes it difficult to distinguish between them using a single attention mechanism. In order to address the issue, we combine two different self-attention with the dynamic mask generated by the dynamic mask selector to generate four different attention to obtain information at different token level. And then use the attention fusion layer to solve the problem of consequently incurs parameter costs on the subsequent network layer due to the increase in the number of attention, the attention fusion layer also retains useful information.

Our contributions in this work are as follows:

•

We propose a multiway dynamic mask attention network, which consists of a dynamic mask selector and a multi-attention encoder. This formalization allows our model to utilize different word-level features to obtain better sentence information.

•

We formulate the dynamic mask selector as a reinforcement learning problem. The model performs the selection of the mask without ground truth labels but just with a weak supervision signal from the multi-attention encoder.

2. Related work

Recently, the development of large-scale annotated datasets [2] and deep learning algorithms have made significant achievements in modeling sentences. Researchers have studied the essential representations techniques of NLP on the SNLI, such as attention [16], memory [17], and grammatical structures [18]. In the conventional methods, these models use a sentence encoder to encode sentences into a sentence representation and then determine the relationship based on these independent sentence representations through a neural network classifier [19]. These sentence-encoding-based methods make it easy to extract sentence representations and can be used in transfer learning to other natural language tasks [20]. Wang and Jiang proposed match-LSTM for natural language inference, trying to match the current word in the hypothesis with the representation of the premise through the attention mechanism [21]. Wang et al. proposed BiMPM, which achieves the significant results in recognizing textual entailment and answer sentence selection by matching two sentences by bilateral matching with attention mechanisms in multiple perspectives [22]. Tan et al. used the representations obtained by different attention to enhance the feature representation, avoiding the problem of simply using max-pooling and mean-pooling to cause information loss [23].

Recently, attention mechanisms are successfully applied on the natural language processing. Seo et al. proposed a Bi-Directional Attention Flow (BIDAF) model [24]. It uses bi-directional attention mechanism to obtain a context representation. Vaswani et al. proposed a multi-head attention model. The multi-head attention model uses multi-head dot-product attention to represent each word in the sentence. Shen et al. constructed a fully attention-based sentence coder, which proposed a multi-dimensional self-attention mechanism that uses the attention model to calculate the attention of each dimension [15]. Shen et al. constructed a fully attention-based sentence coder, which proposed a multi-dimensional self-attention mechanism that uses the attention model to calculate the attention of each dimension and adds a directional mask to the logit of attention, words in a specific direction in the sentence were masked to avoid attention. The extent to which attention results are ultimately reflected was determined through fusion gate. In our study, we constructed a model based on multi-dimensional self-attention mechanism, which uses dynamic masks to obtain reverse information and uses multiple attentions to focus on different features at the sentence level.

3. Background

Self-Attention is a special case of the attention mechanism. It can generate the dependencies between tokens from the same sequence. Recently, various self-attention mechanisms have been developed and applied in different tasks. In this paper, we use Directional self-attention [15] and Directional dot attention (dot attention [25] with mask). the dependency of $x_{i}$ on another token $x_{j}$ is computed by $f\left({x_{i},x_{j}}\right)$ in two attentions. A softmax function then transforms the scores $f\left({x_{i},x_{j}}\right)$ to a probability distribution $p$ . The above process can be summarized by the following equations.

$\displaystyle{f_{1}\left({x_{i},x_{j}}\right)=\alpha\tanh\left({\left[{W^{1}x_% {i}+W^{2}x_{j}+b}\right]/\alpha}\right)+M_{ij}}$ $\displaystyle{f_{2}\left({x_{i},x_{j}}\right)=v^{t}\tanh\left({W\left({x_{i}% \odot x_{j}}\right)}\right)+M_{ij}}$ (1) $\displaystyle{p_{i}^{j}=\textit{softmax}\left({f\left({x_{i},x_{j}}\right)}% \right)}$

where $f_{1}\left({x_{i},x_{j}}\right)$ is the directional self-attention, $f_{2}\left({x_{i},x_{j}}\right)$ is the directional dot attention, $M_{ij}\in\left\{{-\infty,0}\right\}$ is the mask of the current word $x_{i}$ to other word $x_{j}$ . To see why a mask can encode directional information, When $M_{ij}=-\infty$ , $f\left({x_{i},x_{j}}\right)=-\infty$ , the result after applying softmax is zero, that is, $p_{i}^{j}=0$ , indicating that there is no attention of $x_{j}$ to on the feature. On the contrary, we have $p_{i}^{j}>0$ , which means that attention of $x_{j}$ to $x_{i}$ exists on the feature.

The final output of attention S is the expectation of sampling the token according to the categorical distribution. i.e.,

$\displaystyle\mbox{S}=\mathop{\sum}\limits_{i=1}^{n}p_{i}^{j}x_{i}$ (2)

4. Proposed model

The overall architecture of our model is shown in Fig. 1. Our model consists of two key modules: a dynamic mask selector and a multi-attention encoder. The dynamic mask selector dynamically generates an effective mask based on the current sentence feature to boost the ability of the multi-attention encoder to obtain useful information. The multi-attention encoder is mainly composed of two independent components, a sentence encoder and a classifier. In the multi-attention encoder, we follow the conventional neural network structure in natural language reasoning [26], First, the input sentences, premise and hypothesis, are encoded as vectors, u and v, without using any other sentence information, through multi-attention encoder, and then a probability for each of the 3-class is generated through the classifier.

Figure 1.

Overall architecture. The dynamic mask selector generates an effective mask according to a policy function, and then The mask are used to train a better multi-attention encoder. The dynamic mask selector updates the parameters with a reward calculated from the multi-attention encoder.

Figure 2.

The overall process of the dynamic mask: (1) initialize the forward and backward masks; (2) the mask generated by the dynamic mask selector based on the sentence features, and (3) the final forward and backward masks.

4.1 Dynamic mask selector

Directional Self Attention Network utilizes the forward mask and the backward mask to temporal order encodes the sentences based solely on the attention mechanism. The model achieved very significant results. However, its mask is fixed. The model can only encode sentences in a specific direction, and can not dynamically generate a more effective mask based on the sentence feature. In this paper, the dynamic mask selector uses the current sentence feature to dynamically select its important reverse information based on the direction mask.

For the training of the dynamic mask selector, there are no truth labels to indicate whether the token should be selected and the mask is a discrete random variable, which is leads to a non-differentiable objective function. As a result, we address the issue by using the policy gradient method of reinforcement learning. The dynamic mask selector follows a policy to decide which action (selecting the token or not) at each state (consisting of ${x}_{i}$ and the current sentence), generates a corresponding reverse mask for the sentence according to each action, and then receive a reward from the multi-attention encoder. The dynamic mask selector uses reward to update the action function. By optimizing the action function, the action function can generate efficient reverse mask to reach our desired goal.

In the dynamic mask selector, the forward mask and the backward mask are first initialized according to Eqs (3) and (4). The mask after initialization is shown in Fig. 2(1).

$\displaystyle M_{ij}^{fw}=\left\{{{\begin{array}[]{ll}{0,}&{i<j}\\ {-\infty,}&\textit{otherwise}\\ \end{array}}}\right.$ (3) $\displaystyle M_{ij}^{bw}=\left\{{{\begin{array}[]{ll}{0,}&{i>j}\\ {-\infty,}&\textit{otherwise}\\ \end{array}}}\right.$ (4)

The state $s_{i}$ is the input of the dynamic mask selector. We represent the state $s_{i}$ as a continuous real valued vector $F\left({s_{i}}\right)$ , which encodes the following information: 1) the vector representation of the i-th word of the current sentence, 2) The vector representation of the current sentence, which are the average of the vector representations of all words in the current sentence.

$\displaystyle F\left({s_{i}}\right)=\left[{x_{i};\textit{pooling}\left(x\right% )}\right]$ (5)

In the current sentence $x$ , each word $x_{i}$ has an associated action function ${a}_{i}$ to indicate whether the word is selected to generate a reverse mask. We adopt a logistic function as the policy function:

$\displaystyle a_{i}=\pi\left({a_{i}{|}x}\right)=\sigma\left({W{\ast}F\left({s_% {i}}\right)+b}\right)$ (6)

where $F\left({s_{i}}\right)$ is the state feature vector, $W$ , $b$ are the learnable parameters, and ${\sigma}\left(\cdot\right)$ is the sigmoid function.

Then, we use the action function a to generate two equal-length sequences vectors $r^{j}=\left[{r_{i}^{j},\ldots,r_{n}^{j}}\right]$ , $j=\left({h,d}\right)$ where $r_{i}=1$ implies that $x_{i}$ is selected whereas $r_{i}=0$ indicates that $x_{i}$ is discarded. In the experiment, we found that probability $=$ 0.7 works well. Therefore, we use $r^{h}$ and $r^{d}$ to generate an $n\times n$ mask, i.e.,

$\displaystyle r_{i}^{j}=\left\{{{\begin{array}[]{ll}{1,}&{a_{i}^{j}>0.7}\\ {0,}&{a_{i}^{j}\ll 0.7}\\ \end{array}}}\right.$ (7) $\displaystyle M_{ij}^{r}=\left\{{{\begin{array}[]{ll}{0,}&{r_{i}^{h}=r_{i}^{d}% =1\&i\neq j}\\ {-\infty,}&\textit{otherwise}\\ \end{array}}}\right.$ (8)

Finally, the resulting mask $M_{ij}^{r}$ is applied to the initialized mask to generate the final forward mask and backward mask. The overall process is shown in Fig. 2.

4.2 Multi-attention encoder

Sentence Encoder. The sentence encoder is mainly composed of the word embedding layer, the dynamic mask multi-attention layer, the attention fusion layer, and the classifier. The overall structure is shown in Fig. 3.

Figure 3.

The overall structure of the sentence encoder.

Word Embedding Layer. In the word embedding layer, the input sentences (premise and hypothesis) are transformed into dense representations $t=\left[{t_{1},t_{2},\ldots,t_{n}}\right]$ by pre-trained GloVe vectors [21]. We then use a fully connected layer to produce a sequence of hidden state $x=\left[{x_{1},x_{2},\ldots,x_{n}}\right]$ ,

$\displaystyle x={\sigma}\left({W^{\left(s\right)}t+b^{\left(s\right)}}\right)$ (9)

where $t\in R^{d_{e}\times n}$ , $x\in R^{d_{s}\times n}$ , $W^{\left(s\right)}$ and $b^{\left(s\right)}$ are the learnable parameters, ${\sigma}\left(\cdot\right)$ is an activation function (ELU).

Dynamic Mask Multi-Attention Layer. In the dynamic mask multi-attention layer, we design multiple attention functions to obtain information at different token level. We combine self-attention and dot attention with a dynamic mask to generate four different attention mechanisms, forward dynamic self-attention, backward dynamic self-attention, forward dynamic dot attention, and backward dynamic dot attention. They use their own calculation method to calculate the score of each feature of each token. We first generate vector representations $s^{k}$ for all elements from the input sequence through four kinds of attention, and then use the fusion gate to generate the final context-aware vector representation $u^{k}$ for the sentence.

$\displaystyle u_{i}^{k}=F_{k}\left({{\begin{array}[]{*{20}c}{x_{i,}}&{x_{j,}}&% {W_{k}}\\ \end{array}}}\right)$ (10)

where $F_{k}\left(\cdot\right)$ is the current attention mechanism and its multi-dimensional fusion gate. $k=(cf,cb,df,db)$ , which are forward dynamic self-attention, backward dynamic self-attention, forward dynamic dot attention, and backward dynamic dot attention.

Forward dynamic self-attention:

$\displaystyle{f_{cf}\left({x_{i},x_{j}}\right)=\alpha\tanh\left({\left[{W_{cf}% ^{1}x_{i}+W_{cf}^{2}x_{j}+b_{cf}}\right]/\alpha}\right)+M_{ij}^{fw}}$ $\displaystyle{p_{i}^{j}=\exp\left({f_{cf}\left({x_{i},x_{j}}\right)}\right)% \left/\mathop{\sum}\limits_{i=1}^{N}\exp\left({f_{cf}\left({x_{i},x_{j}}\right% )}\right)\right.}$ (11) $\displaystyle{s_{j}^{cf}=\mathop{\sum}\limits_{i=1}^{N}p_{i}^{j}x_{i}}$

backward dynamic self-attention:

$\displaystyle{f_{cb}\left({x_{i},x_{j}}\right)=\alpha\tanh\left({\left[{W_{cb}% ^{1}x_{i}+W_{cb}^{2}x_{j}+b_{cb}}\right]/\alpha}\right)+M_{ij}^{bw}}$ $\displaystyle{p_{i}^{j}=\exp\left({f_{cb}\left({x_{i},x_{j}}\right)}\right)% \left/\mathop{\sum}\limits_{i=1}^{N}\mbox{exp}\left({f_{cb}\left({x_{i},x_{j}}% \right)}\right)\right.}$ (12) $\displaystyle{s_{j}^{cb}=\mathop{\sum}\limits_{i=1}^{N}p_{i}^{j}x_{i}}$

forward dynamic dot attention:

$\displaystyle{f_{df}\left({x_{i},x_{j}}\right)=v^{t}\tanh\left({W_{df}\left({x% _{i}\odot x_{j}}\right)}\right)+M_{ij}^{fw}}$ $\displaystyle{p_{i}^{j}=\exp\left({f_{df}\left({x_{i},x_{j}}\right)}\right)% \left/\mathop{\sum}\limits_{i=1}^{N}\mbox{exp}\left({f_{df}\left({x_{i},x_{j}}% \right)}\right)\right.}$ (13) $\displaystyle{s_{j}^{df}=\mathop{\sum}\limits_{i=1}^{N}p_{i}^{j}x_{i}}$

backward dynamic dot attention

$\displaystyle{f_{db}\left({x_{i},x_{j}}\right)=v^{t}\tanh\left({W_{db}\left({x% _{i}\odot x_{j}}\right)}\right)+M_{ij}^{bw}}$ $\displaystyle{p_{i}^{j}=\exp\left({f_{db}\left({x_{i},x_{j}}\right)}\right)% \left/\mathop{\sum}\limits_{i=1}^{N}\mbox{exp}\left({f_{db}\left({x_{i},x_{j}}% \right)}\right)\right.}$ (14) $\displaystyle{s_{j}^{db}=\mathop{\sum}\limits_{i=1}^{N}p_{i}^{j}x_{i}}$

Where is $\alpha$ scalar, we always set $\alpha=$ 5, and obtain stable output. $W_{k}$ and $b_{k}$ are the learnable parameters.

Then, we combine the attention-generated s and the input sentence vector $x$ to generate the context aware vector representation by using multi-dimensional fusion gates.

$\displaystyle{G=\textit{sigmoid}\left({W^{1}s^{k}+W^{2}x+b}\right)}$ (15) $\displaystyle{u^{k}=Gx+\left({1-G}\right)s^{k}}$

where, $W^{1}$ and $W^{2}$ are the learnable parameters, $k=(cf,cb,df,db)$ .

Attention Fusion Layer. In the attention fusion layer, we aggregate the matching information from Dynamic Mask Multi-Attention functions to generate the resulting representation.

$\displaystyle{c_{k}=v^{t}\tanh\left({Wu_{t}^{k}+b}\right),\left({k=cf,cb,df,db% }\right)}$ $\displaystyle{p_{k}^{c}=\exp\left({c_{k}}\right)\left/\mathop{\sum}\limits_{i=% \left({cf,cb,df,db}\right)}\mbox{exp}\left({c_{i}}\right)\right.}$ (16) $\displaystyle{s_{i}^{f}=\mathop{\sum}\limits_{k}p_{k}^{c}u_{i}^{k}}$

4.3 Classifier

We follow the standard procedure [5]. For the output encoding u (premise) and v (hypothesis), the representation of the relation between the two sentences is generated by the concatenation of u, v, u – v, u $\odot$ v, which is fed into a 300DReLU layer, and then a 3-unit output layer with softmax to compute a probability distribution over the three types of relationships.

5. Experiments

In this section, we conduct experiments on the NLI task with two datasets, SNLI and MultiNLI. Experimental results show that our model outperforms our baseline and other competing approaches.

5.1 Training details

We implement our model with Tensorflow [28] framework and train out model on single Nvidia TITAN X. We use the 300D GloVe 6B pre-trained vectors to initialize the word embedding without any finetuning. The out-of-vocabulary words are randomly initialized with uniform distribution. All weight matrices are initialized using Glorot [29], and the biases are initialized with 0. We set the dropout to 0.73, the initial learning rate is set to 0.5, and the decay rate is 0.999, the batch size is set to 64. We use Adam as optimizer. Hidden units number $d_{s}$ is set to 300. Since the model cannot provide accurate reward feedback to the dynamic mask selector at the beginning of training, the mask will not be updated. We set the mask to update when the steps are 20000. The mask will be updated every 200 steps.

In the classification setting, we use cross-entropy loss plus L2 regularization penalty as the loss, i.e.,

$\displaystyle L\left({\theta_{s}}\right)=-\sum_{i}{y}^{\prime}\log(y_{i})+% \lambda||\theta_{s}||^{2}$ (17)

where, $y_{i}$ is the prediction result and $y^{\prime}_{i}$ is the ground truth, $\theta_{s}$ contains parameters for dynamic mask multi-attention layer and classification layer, $\lambda$ is the penalty weight and its value is 0.01.

In the dynamic mask selector, there is no label indicating whether a token should be selected. As a result, we use the policy gradient method to train our dynamic mask selector. Since the overall goal of the dynamic mask selector is to obtain an efficient reverse mask and the number of the reverse positions selected is as small as possible. We can obtain a loss value from the multi-attention encoder, which can be regarded as the delayed reward to train the dynamic mask selector and a penalty limiting the number of selected reverse positions is included in the reward $R$ , i.e.,

$\displaystyle R=\mathop{\sum}\limits_{i}{y}^{\prime}\log(y_{i})-\beta\left(% \mathop{\sum}\limits_{i}r_{i}\textit{len}(r)\right)$ (18)

Where $\beta$ is the penalty weight, set to 0.01 in the experiment. Then, we aim to maximize the expected total reward. More formally, our objective function is defined as

$\displaystyle L\left({\theta_{r}}\right)=\mathop{\sum}\nolimits\log\pi(a|x,% \theta)R$ (19)

Table 2

Experimental results for different methods on SNLI

Model name	$\left\|\theta\right\|$	Train acc (%)	Test acc (%)
300D NSE encoders [31]	3.0 m	86.2	84.6
600D deep gated attn. [32]	11 m	90.5	85.5
300D DiSAN [15]	2.4 m	91.1	85.6
1200D distance-based self-attention network [33]	4.7 m	89.6	86.3
2400D multiple DSA [8]	7.0 m	89.0	87.4
DISAN with dynamic mask	3.4 m	88.7	85.9
Directional multiway attention network	4.5 m	89.2	86.1
Our multiway dynamic mask attention network	5.0 m	90.3	86.5

5.2 SNLI results

For thorough comparison, in addition to the neural networks proposed in the previous NLI work, we implement two extra neural network baselines to compare with MDMAN. These neural networks help us to analyze the improvement contributed by each part of MDMAN.

Dynamic Mask self-attention Network: DISAN with Dynamic Mask.

Directional Multiway attention Network: Applying a directional mask on a multi-attention mechanism.

Table 2 shows the results on SNLI. The results show that our model outperforms our baseline model DiSAN. The model achieves better performance on the SNLI dataset. Comparing the two additional baseline models, we demonstrate that adding dynamic masks can improve the accuracy of the model, and using multiple attention with dynamic masks can further improve accuracy.

5.3 MultiNLI results

The results of applying our model to the MultiNLI data dataset without additional parameter tuning were compared with the baseline model and other models. The results are shown in Table 3. We obtained our matched-test accuracy and mismatched-test accuracy by submitting our test results to Kaggle open evaluation platforms. The results show that our matched-test accuracy and mismatched-test accuracy are greater than 3.7% and 1.9% when compared to the Directional Self-Attention Network, respectively. Our model achieves better accuracy in MultiNLI and is a significant improvement over our baseline model.

Table 3
Experimental results for different methods on MultiNLI

Model name	Matched test acc (%)	Mismatched test acc (%)
CBOW [3]	64.8	64.5
DiSAN [15]	71.0	71.4
BiLSTM $+$ inner-attention [34]	72.1	72.1
Gated BiLSTM [32]	73.5	73.6
Distance-based self-attention network [33]	74.1	72.9
SS BiLSTM [26]	74.6	73.6
Our multiway dynamic mask attention network	74.7	73.2

5.4 Case study

To gain a closer view of what dependencies in a sentence can be captured by MDMAN, we visualize the attention probability or alignment score by heatmaps. In order to compare with our baseline model, we will focus on dynamic forward and backward self-attention (Eqs (4.2) and (4.2)) and forward/backward fusion gates F (Eq. (4.2)). Note that we only explain its dependencies at the word level.

We select a sentence from SNLI test set as an example and visualize its result value. The sentence is “A girl playing a violin along with a group of people”.

Figure 4.

Attention probability in dynamic forward/backward self-attention.

As shown in Fig. 4, similar to directed self-attention, semantically important words, such as nouns and verbs, usually get a lot of attention in dynamic mask attention, but for stop words (am, is, are, etc.), there will get less attention. For globally important words, e.g., girl, violin, get large attention from these words. Unlike directed self-attention, dynamic self-attention focuses on reverse semantically important words and usually ignores semantically unimportant words, such as stop words. As a result, dynamic attention can obtain important reverse information.

As shown in Fig. 5, we show the value of the fusion gate F (Eq. (4.2)). This fusion gate combines the input h and the output of dynamic masked self-attention to generate a sentence vector. If the weight of F is very small, it usually tends to select the output of dynamic masked self-attention instead of the input h. This shows that the gate values for words that are meaningless, such as stop words, tend to be very small. Because meaningless words themselves cannot contribute important information, at this time, we add their semantic relations with other words to make it more meaningful. Compared with the fusion gate value F generated by the directed self-attention, we have a smaller gate value for the meaningless word than the directed self-attention gate value. For semantically important words, the value of the fusion gate is larger than the value of the fusion gate of the directed self-attention, which indicates that we can better utilize the characteristics of the context. It may help to understand the sentence better.

Figure 5.

(a) and (b) is the fusion gate F in dynamic forward/backward self-attention, (c) and (d) is the fusion gate F in directional forward/backward self-attention.

6. Conclusion

In this paper, we propose a multi-path dynamic mask model. On the basis of DISAN, we use dynamic mask to obtain its important reverse information and use a variety of attention methods to pay attention to the different granularity of sentences at the word level. Using the attention fusion layer instead of the simple vector connection, we can get a better sentence vector while reducing the burden of the subsequent network layer, thus improving our experimental results.

In future research, we will further study attention-based models and achieve better performance in other more challenging tasks such as QA, reading comprehension.

Footnotes

Acknowledgments

This work is supported by the Zhejiang Province Technology Project (No. 2020C03105).

References

Sun

et al., Recognizing Text Entailment via Bidirectional LSTM Model with Inner-Attention, 10363 (2017), 448–457.

Bowman

S.R.

Angeli

Potts

and Manning

C.D.

, A large annotated corpus for learning natural language inference, CoRR abs/1508.05326, 2015.

Williams

Nangia

and Bowman

S.R.

, A broad-coverage challenge corpus for sentence understanding through inference, CoRR abs/1704.05426, 2017.

Conneau

Kiela

Schwenk

Barrault

and Bordes

, Supervised learning of universal sentence representations from natural language inference data, CoRR abs/1705.02364, 2017.

Bowman

S.R.

Gauthier

Rastogi

Gupta

Manning

C.D.

and Potts

, A fast unified model for parsing and sentence understanding, arXiv preprint arXiv:1603.06021, 2016.

Kim

Kang

and Kwak

, Semantic sentence matching with densely-connected recurrent and co-attentive information, Proceedings of the AAAI Conference on Artificial Intelligence 33 (2019), 6586–6593.

Liu

Chen

and Gao

, Multi-Task Deep Neural Networks for Natural Language Understanding, ACL, 2019, 4487–4496.

Yoon

Lee

and Lee

, Dynamic Self-Attention: Computing Attention over Words Dynamically for Sentence Embedding, CoRR abs/1808.07383, 2018.

Devlin

Chang

Lee

and Toutanova

, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2018, 4171–4186.

10.

Dong

Huang

Yang

and Yan

, More is less: A more complicated network with less inference complexity, CVPR, 2017, 1895–1903.

11.

Aarne Talman, Anssi Yli-Jyrä, Jörg Tiedemann, Natural Language Inference with Hierarchical BiLSTM Max Pooling Architecture, CoRR abs/1808.08762, 2018.

12.

Vaswani

Shazeer

Parmar

et al., Attention is all you need, Advances in neural information processing systems, 2017, 5998–6008.

13.

Yang

Yao

Sun

and Xu

, Exploiting the complementary strengths of multi-layer CNN features for image retrieval, Neurocomputing 237 (2017), 235–241.

14.

Yin

Schütze

Xiang

and Zhou

, ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs, 4 (2016), 259–272.

15.

Shen

Zhou

Long

Jiang

Pan

and Zhang

, Disan, Directional self-attention network for rnn/cnn-free language understanding, 2018, 245–254.

16.

Parikh

A.P.

Täckström

Das

and Uszkoreit

, A Decomposable Attention Model for Natural Language Inference, EMNLP, 2016, 2249–2255.

17.

Hudson

D.A.

and Manning

C.D.

, Compositional Attention Networks for Machine Reasoning, ICLR (Poster), 2018.

18.

Chen

Zhu

Ling

Wei

Jiang

and Inkpen

, Enhanced lstm for natural language inference, 2017, 1657–1668.

19.

Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, Hannaneh Hajishirzi, Bidirectional Attention Flow for Machine Comprehension, ICLR (Poster), 2017.

20.

Tan

Dos Santos

Xiang

and Zhou

, Improved representation learning for question answer matching, 2016, 464–473.

21.

Wang

and Jiang

, Learning Natural Language Inference with LSTM, 2016, 1442–1451.

22.

Wang

Hamza

and Florian

, Bilateral Multi-Perspective Matching for Natural Language Sentences, 2017, 4144–4150.

23.

Tan

Wei

Wang

and Zhou

, Multiway Attention Networks for Modeling Sentence Pairs, 2018, 4411–4417.

24.

Seo

M.J.

Kembhavi

Farhadi

and Hajishirzi

, Bidirectional Attention Flow for Machine Comprehension, 2017.

25.

Wang

and Jiang

, A Compare-Aggregate Model for Matching Text Sequences, 2017.

26.

Nie

and Bansal

, Shortcut-Stacked Sentence Encoders for Multi-Domain Inference, 2017, 41–45.

27.

Pennington

Socher

and Manning

, Glove: Global vectors for word representation, 2014, 1532–1543.

28.

Abadi

Agarwal

Barham

Brevdo

Chen

and Citro

, TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, CoRR abs/1603.04467, 2016.

29.

Glorot

and Bengio

, Understanding the difficulty of training deep feedforward neural networks, 2010, 249–256.

30.

Mou

Men

Zhang

Yan

and Jin

, Natural Language Inference by Tree-Based Convolution and Heuristic Matching, 2016.

31.

and Munkhdalai

, Neural Semantic Encoders, 2017, 397–407.

32.

Chen

Zhu

Ling

Wei

Jiang

and Inkpen

, Recurrent Neural Network-Based Sentence Encoder with Gated Attention for Natural Language Inference, 2017, 36–40.

33.

and Cho

, Distance-based Self-Attention Network for Natural Language Inference, CoRR abs/1712.02047, 2017.

34.

Balazs

J.A.

Marrese-Taylor

Loyola

and Matsuo

, Refining Raw Sentence Representations for Textual Entailment Recognition via Attention, 2017, 51–55.

Multiway dynamic mask attention networks for natural language inference

Abstract

Keywords

1. Introduction

Table 1 Examples of relations between a premise and a hypothesis, where E for entailment, C for contradiction, N for neutral, P for premise and H for hypothesis

3. Background

5. Experiments

5.1 Training details

5.3 MultiNLI results

Table 3 Experimental results for different methods on MultiNLI

Footnotes

Acknowledgments

References

Table 1
Examples of relations between a premise and a hypothesis, where E for entailment, C for contradiction, N for neutral, P for premise and H for hypothesis

Table 3
Experimental results for different methods on MultiNLI