Guided attention mechanism: Training network more efficiently

Abstract

With the wide application of attention mechanism in multitudinous field of natural language processing (NLP), to date various deep neural networks based on this mechanism have been introduced and developed. However, a major problem with this kind of application is that a long time will be consumed due to the current networks still need to rely on their own ability to form attention values from scratch during the training. In this paper, we propose an auxiliary method called the Guided Attention Mechanism (GAM), which utilizes the prior knowledge to guide the network to form attention values in NLP field, thereby shortening the network training time and making the attention values more accurate. This work designed two sets of prior knowledge generation processes based on the regularization method and the deep learning method respectively. And the prior knowledge is used to guide the attention values of the original network in terms of values and angles. The experimental results show that compared with the original network, the classification accuracy of the network using GAM is improved by about 2%, and the training time is reduced by 5∼9%.

Keywords

Guided attention mechanism natural language processing prior knowledge

1 Introduction

Deep neural networks have achieved tremendous success in image fields such as target classification [1], object detection [2], etc. When introducing them into various natural language processing (NLP) tasks, researchers found that, unlike the independence of pixels, words have a significant sequential relationship in the text. The combination of words in different orders causes the text to present different semantics, which makes the fully connected neural network no longer suitable for text processing. To overcome this problem, early NLP tasks mainly relied on recurrent neural networks (RNNs) [3], RNNs learned the positional relationship between inputs through a cyclic structure and is applied to many fields such as classification of brain signals [4], Takagi-Sugeno fuzzy models [5] etc. However, with the increase of text length, more redundant information is brought into the network, which also leads to the issues of gradient disappearance and gradient explosion. After that, two complex structures, Long Short-Term Memory (LSTM) [6] and Gated Rectified Unit (GRU) [7], were applied to the RNNs. They achieve the effect of selecting and filtering information by adding some gate structures to the RNN’s recurrent unit. In addition to using more complex network structures, the embedded representation of words has also been widely used and studied. Its idea is to train a special deep learning network [8, 9] by using a large amount of corpus to make words that are close in meaning appear closer in the embedded space, and thus providing some semantic knowledge to accelerate the network’s understanding of the input text. Although this method can reflect the semantic relationship between words, it cannot show the influence of different words on the text.

As the scene becomes more diverse, the description of the text becomes more complicated. Researchers endeavor to make models automatically focus on the words that have decisive effect and capture the important semantic information in a sentence, which led to the widespread use of attention mechanism. The purpose of this mechanism is to calculate a set of weights for the input text to indicate the importance of words in the corresponding position. The initial attention mechanism was based on RNNs, which was first applied to machine translation [10] and achieved the best results at that time. After this, lots of NLP tasks, such as question answer [11], text classification [12], Aspect-level sentiment classification [13], began to combine attention mechanism in the construction of the model, and achieved significant breakthroughs in effect. However, restricted by the recurrent structure of the RNN, the input information is hard to get parallel processing in the network. So that the attention value of the current moment cannot be calculated until the output of the previous moment is generated, which increases the time for network forward propagation and training. At present, convolutional attention mechanism [14] and self-attention mechanism [15] have been proposed to improve this problem. The former uses a convolutional network to extract features and calculate attention values, while the latter directly compares words at different locations of the input text to obtain attention values. Among them, BERT [16] and GPT [17], which are based on the self-attention mechanism, can advance the state of the art for many NLP tasks.

By changing and adjusting the structure of the attention mechanism, the network can be given stronger performance. But, during training, one of the main obstacles is that the attention networks need to rely on their own ability to form the attention values from scratch which increases the difficulty of learning. To this end, some methods in the image field attempt to introduce prior knowledge to generate attention values for input data. For example, using the image to guide the attention values of the text [18], the audio information is utilized to guide the attention values of the image [19], and the image and the attribute mutually guide the attention values [20].

In our paper, guided attention mechanism (GAM), a general method of guiding attention in NLP field, is proposed to speed up the formation of networks’ attention values and improve their accuracy. As shown in Fig. 1, unlike that the prior knowledge in [18 –20] is independent of the input data and used directly to generate the attention values. GAM will first calculate a set of guiding values(GVs)(which is priori knowledge) for each text. Then, by using some mathematical methods, the attention values of the taught network will generate new attention values with the help of GVs, there the taught network refers to any network which generates attention values for input data. Finally, the new attention values will replace the old one and be guided by GVs again in the next iteration.

Fig. 1

The process framework of GAM.

In summary, our contributions are three-fold:

A set of process, which based on a new text categorization network designed by us, is utilized to generate GVs was proposed. And a regularized process, which based on a public emotional dictionary, was described. We applied some preprocessing operations on the public emotional dictionary to obtain GVs. In the experiment, the two sets of processes were compared. The advantages and disadvantages of the two processes and their application scenarios were also emphasized;

Two methods for using GVs to guide attention values were designed. One of them combines some widely used activation functions. The other applies the new loss function we defined;

We present a complete and general process for guiding attention values in the NLP field and verify the effectiveness of GAM on six tasks.

The remainder of the paper is shown as follows. In Section 2, the related work is discussed. Section 3 introduces two processes for generating GVs, and each step is described in detail. In Section 4, two methods for guiding the attention values of the taught network based on the GVs are provided, and the basic ideas of each method are illustrated mathematically. The datasets used in the experiments and the parameters of the experimental models are listed in Section 5. The experimental results are discussed and analyzed in Section 6. In Section 7, the paper is discussed and concluded.

2 Related work

In most kinds of NLP tasks, to obtain the important semantic information, the network is expected to have a focus when learning text, which is often referred to as attention. When the input sentence is x and the target output is y, the forward propagation process of the model with attention mechanism can be abbreviated as formula (1), where g (x) is the attention values calculated for input x: $y = f (x, g (x))$ (1)

At present, considerable research have been devoted to design g (·). According to the structure of the network, the existing attention mechanisms can be divided into recurrent attention mechanism (RAM) [10], convolutional attention mechanism (CAM) [14] and self-attention mechanism (SAM) [15]. There three attention mechanisms change the optimization objective of the model from maximizing conditional probability p (y|x) to p (y|c). Where c is the context vector, which can be expressed as c = z (x, g (x)). Among them, RAM and SAM are widely used in many NLP tasks [21].

The main feature of RAM is based on RNN, and it is necessary to continuously calculate the attention value during the cycle. Taking machine translation as an example, its structure is shown in Fig. 2(a).

Fig. 2

Two attention mechanisms.

The input vectors x₁, x₂, …, x_{T
_x} generate annotation sequencesh₁, h₂, …, h_{T
_x} through a layer of bidirectional RNN. Each annotation h_i contains information about the whole input sequence with a strong focus on the parts surrounding the i^th word of the input sequence. The context vector c_i is computed as a weighted sum of annotations h_i in formula (2):

$c_{i} = \sum_{j = 1}^{T_{x}} α_{ij} h_{j}$ (2)

α_ij is the attention value corresponding to the annotation h_j at time i, and its calculation is formula (3), Where F (·) is a fully connected network with an output node of 1, which uses w_α and b_α as weights and bias.

$\begin{matrix} α_{ij} = \frac{exp (F (s_{i - 1}, h_{j}))}{\sum_{k = 1}^{T_{x}} \exp (F (s_{i - 1}, h_{k}))}, \\ F (s_{i - 1}, h_{j}) = [s_{i - 1}, h_{j}] w_{α} + b_{α} \end{matrix}$ (3)

RAM assigns a weight value to the output information of each moment. These weight values are constantly adjusted during the back propagation to make the model gradually pay attention. Due to the good performance of this mechanism, many NLP fields have been developed by leaps and bounds. In machine translation [10], encoding the source sentence into a fixed-length vector becomes the bottleneck of the encoder-decoder architecture. The phrase-based translation system built by incorporating the attention mechanism is superior to all the networks at that time. In question answering [22], using simple entity detection and anonymisation algorithms to construct datasets and combining attention mechanisms allows the model to answer complex questions with minimal prior knowledge of language structure. In order to enable the model to learn the hierarchical structure of the document, the hierarchical attention network [23], which applied the attention mechanism at both the word and sentence levels, was proposed. Experiments show that this method is significantly better than the previous method in document classification.

Restricted by the physical structure of the model, RAM cannot input data in parallel, which leads to longer training time, even if optimization schemes such as factorization tricks [24] and conditional computation [25] can improve computational performance, but this limitation still exists. SAM was proposed to solve this problem.

SAM correlates information from different locations on the input sequence as a representation. Its structure is shown in Fig. 2(b). The entire operation can be expressed as: $self_attention (Q, K, V) = softmax (\frac{{QK}^{T}}{\sqrt{d_{k}}}) V$ (4)

Inputs include queries (Q) and keys (K) with dimension d_k, and values (V) with dimension d_v. By calculating queries and keys, the attention value corresponding to values is obtained as $softmax ({QK}^{T} / \sqrt{d_{k}})$ . When the three inputs in formula (4) satisfy Q = K = V, the result is self-attention value.

SAM avoids the additional cost of cyclic input by feeding the information into the network at one time. This mechanism also contributes to some new network structures including Transformer [15]. The BERT [16] network based on the multi-layer Transformer overcomes the long-term dependence problem caused by long text and has been proven to obtain new state-of-the-art results on eleven NLP fields. In the learning task, a large amount of unlabeled corpus with abundant information cannot be used in model training, this limitation is effectively solved by combining unsupervised learning with GPT [17] built on Transformer.

The above two attention mechanisms can give the model better performance by mean of different calculation methods. However, in the initial iteration of the model, due to the randomness of each hyperparameter, the attention direction changes constantly, which makes the convergence speed of the model is slower. In order to solve this problem and shorten the training time of the model, this paper proposes the Guided Attention Mechanism (GAM), which provides accurate directional guidance for the model, and ensures that attention values can be adjusted according to actual data.

3 Generating guiding attention values

Currently, in NLP field, the research on adding prior knowledge before network training mostly focuses on the vectorized representation of words [26, 27], but too little work has been devoted to provide prior knowledge for network to help it form attention values. In this section, we describe two sets of processes for generating guiding values (GVs). They will use regularized methods (Sec. 3.1) and deep learning methods (Sec. 3.2) respectively to generate GVs (i.e. prior knowledge) for each text input into the network, whose effects are shown in Fig. 3(a) and Fig. 3(b) (The darker the color, the larger the corresponding value).

Fig. 3

Visual representation of the guiding values.

GVs generated by the two processes need to show the taught network which areas of the text it should focus on, without being detailed to every word. So as to ensure the fault-tolerance and independent learning ability of the taught network.

3.1 Regularized method based on emotional dictionary

When reading an article, people’s eyes are often attracted by words or short sentences with rich potential meanings. But they will not pay too much attention to adverbs, prepositions and some words which are insipid or just used to conform to grammatical norms. These potential meanings include, but are not limited to, the author’s emotional richness, subjective consciousness, and the foreshadowing of the following text. Considering the use of existing means to numerically describe the subjective consciousness and the author’s foreshadowing of the following text is very difficult. In our experiment, the emotional richness of the words is used as an auxiliary tool of the regularization method to simulate human reading behavior.

SentiWordNet [28], an emotional dictionary published by opinion mining, has a wide coverage and a high degree of detail. For example, the word ‘good’ is listed 4 meanings as nouns in the dictionary, 21 meanings as adjectives and 2 meanings as adverbs, and three scores for each meaning: positivity, negativity, objectivity.

In this study, SentiWordNet3.0 [29], the latest edition of this series of emotional dictionaries, was selected. Each word in the dictionary is pre-processed to calculate the corresponding average emotional score, which is used to express the emotional richness of the word, such as formula (5):

$\begin{matrix} emotion_score \\ = abs (\frac{\sum_{i = 1}^{n} (pos_s {core}_{i} - neg_s {core}_{i})}{n}) \end{matrix}$ (5)

Where n is the number of meanings of a word in the dictionary, pos _ score_i and neg _ score_i is the positive and negative scores marked by the dictionary in the i^th meaning.

As expressed in the formula (6), the text is traversed first, and if the words are included in the dictionary, the corresponding emotional score is taken, otherwise it is taken as 0. After the replacement is completed, all the values are normalized to obtain GVs g₁, g₂, …, g_Tx. $g_{i} = \frac{e_{i}}{\sum_{j = 0}^{Tx} e_{j}}, e_{i} = {\begin{matrix} emotion_s {core}_{{word}_{i}} \\ (if {word}_{i} indictionary) \\ 0 (others) \end{matrix}$ (6)

Part of the effect of this method is shown in Fig. 3(a). It can be found that not only adjectives but also some nouns and verbs in sentences have high emotional score, which also proves the usability and generalization of SentiWordNet3.0.

The main advantage of using emotional dictionary to generate GVs is that it has low computational complexity and fast computational speed. However, limited by the size of emotional dictionary, words that contain rich emotions but do not appear in the dictionary will not be able to give the corresponding emotional score. At the same time, the emotional score of the same word in different sentences will not be different owing to this method does not combine the semantics of sentences. Therefore, the longer the length of the input text, the more words it contains, the better the effect.

3.2 Guiding network based on deep learning

Different from the regularized method, the main idea of this section is to train a new classification model and use it to generate a corresponding emotional value for each word in the text, and then calculate GVs based on these emotional values. The model used to generate the emotional values is referred to as the guiding network in this paper, and its structure is shown in Fig. 4.

Fig. 4

Guiding network.

Before training the guiding network, we select the Yelp reviews full star dataset [30] as the training corpus which is constructed by randomly taking 130000 training samples and 10000 testing sample for each review star from 1 to 5. And each word in samples is initialized to be a 100 × 1 dimensional vector x by Glove100 [9]. The values in the weight and bias matrix used in the forward propagation are initialized by randomly sampling from uniform distribution in [- 1, 1].

In forward propagation, the input x₁, x₂, …, x_Tx ∈ R^100×Tx, first pass into a fully connected layer and an LSTM layer, their outputs then become inputs to another fully connected layer and reverse LSTM layer. Here, the forward LSTM and reverse LSTM are not input at the same time and their input tensors are different. Therefore, this structure shown by the dashed box in Fig. 4 is called asynchronous bidirectional LSTM layer where LSTM with long-term memory and selective forgetting ability is chosen to extract deep features of text. Since the input to each layer contains the output of a fully connected layer, more original information can be retained than the existing bidirectional LSTM layer. Referring to the experimental results in [31] on the selection of the number of LSTM output nodes. The number of output nodes of the forward LSTM layer, the reverse LSTM layer, and the fully connected layer in this structure is set to 100. The calculation of asynchronous bidirectional LSTM layer is as shown in formula (7), where w₁ ∈ R^100×100 and w₂ ∈ R^200×100 are weight matrices in two fully connected layers, b₁ ∈ R^Tx×100 and b₂ ∈ R^Tx×100 are bias matrix, l₄₁, l₄₂, …, l_4Tx ∈ R^200×Tx represents the final output of this structure. The relationship between the reverse LSTM layer and the forward LSTM layer can be expressed as LSTM _ Reverse (l₃₁, l₃₂, …, l_3Tx) = LSTM (l_3Tx, …, l₃₂, l₃₁). $[l_{11}, l_{12}, \dots, l_{1 Tx}]^{T} = [x_{1}, x_{2}, \dots, x_{Tx}]^{T} w_{1} + b_{1}$

$\begin{matrix} l_{21}, l_{22}, \dots, l_{2 Tx} \\ = Concat (\begin{matrix} [l_{11}, l_{12}, \dots, l_{1 Tx}], \\ LSTM (l_{11}, l_{12}, \dots, l_{1 Tx}) \end{matrix}) \end{matrix}$ (7) $[l_{31}, l_{32}, \dots, l_{3 Tx}]^{T} = [l_{21}, l_{22}, \dots, l_{2 Tx}]^{T} w_{2} + b_{2}$ $\begin{matrix} l_{41}, l_{42}, \dots, l_{4 Tx} \\ = Concat (\begin{matrix} [l_{31}, l_{32}, \dots, l_{3 Tx}], \\ LSTM_Reverse (l_{31}, l_{32}, \dots, l_{3 Tx}) \end{matrix}) \end{matrix}$

Considering the limited ability of single-layer asynchronous bidirectional LSTM, a 3-layer stack structure (where the dimension of the weight w₁ in the last two layers is changed to 200 × 100) is adopted in the guiding network to extract deep semantic feature c₁, c₂, …, c_Tx ∈ R^200×Tx. Then, the values at each position of c₁, c₂, …, c_Tx are respectively taken into a forward fully connected layer F and a sigmoid (·) activation layer S. As shown in formula (8), where c_i ∈ R^200×1 is the input of the current position, w_fi ∈ R^200×1 and b_fi ∈ R^1×1 respectively represent the weight and bias in the F layer. The weight of each F layer will not be shared, and finally the output s₁, s₂, …, s_Tx ∈ R^1×Tx whose values are between (0, 1) are obtained. $s_{i} = sigmoid (c_{i}^{T} w_{fi} + b_{fi}) i = 1, 2, \dots, Tx$ (8)

We call s₁, s₂, …, s_Tx the emotional values extracted from x₁, x₂, …, x_Tx. Subsequently, s₁, s₂, …, s_Tx are sent as input information to a multi-classifier. The multi-classifier is operated as follows: $c_{1}, c_{2}, \dots, c_{class_number} = [s_{1}, s_{2}, \dots, s_{Tx}] w_{c} + b_{c}$ (9) $p_{j} = \frac{e^{c_{j}}}{\sum_{i = 1}^{class_number} e^{c_{i}}} j = 1, 2, \dots, class_number$ (10)

Where class _ number represents the number of categories of the classification,w_c ∈ R^{Tx×class_number} is the weight matrix and b_c ∈ R^{1×class_number} is the bias matrix, p₁, p₂, …, p_{class_number} ∈ R^{1×class_number} is the final output of the guiding network.

We divide texts into two categories based on whether they have significant emotional tendency and use them to train the network. Texts with obvious emotional tendencies often contain some words with rich emotions, which play a crucial role in the final decision of the network. As the classification accuracy of the two types of texts continues to increase, network will gradually have an ability to distinguish between emotionally rich words and other words. When the text is input into the trained network, the value between the corresponding s₁, s₂, …, s_Tx will fluctuate due to the difference in the degree of emotional richness between words. There are great fluctuations between words with rich emotions and other words. GVs will be based on these fluctuations.

Samples labeled as 1 and 5 stars in the Yelp data set are grouped into texts with significant sentimental tendencies, and the remaining samples are grouped into another class. class _ number is 2. The cross entropy error function is chosen as the loss function, the batch size is set to 32, the network optimizer selects AdamOptimier.

We randomly select 80% of each class of text to form the training dataset, and the rest data to form the test dataset. Each word in the text will be initialized to a word vector by using Glove100. Subsequently, the training data is fed into the guiding network for iterative training, and the test data is fed into the network to test the model performance after each iteration. The accuracy of the network in training data and test data varies with the number of iterations as shown in Fig. 5. It can be seen from Fig. 5 that the model achieves the highest accuracy of 76.2% on the test data after the 9th iteration, and then the accuracy decreases due to overfitting. Therefore, the training is ended after the completion of the 9th iteration.

Fig. 5

Accuracy varies with the number of iterations.

When the training is complete, we will feed the text that needs to generate GVs into the network and take out the corresponding s₁, s₂, …, s_Tx. As shown in formula (11), s₁, s₂, …, s_Tx are input to the function max_N (·).max_N (s₁, s₂, …, s_Tx) first subtracts item by item and takes their absolute values to generate |s_Tx - s₁|, |s₁ - s₂|, …, |s_Tx-1 - s_Tx|, which is used to quantify the magnitude of the fluctuation. Then, selecting the items with the top N largest values in the |s_Tx - s₁|, |s₁ - s₂|, …, |s_Tx-1 - s_Tx| (N depends on the length of sequence) and return the values on the right side of the minus sign in these N items. These N values will form a new sequence as the output of max _ N (s₁, s₂, …, s_Tx), if s_i exists in this sequence, it is retained, otherwise it is set to 0. Finally, the sequence is normalized to get g₁, g₂, …, g_Tx which is GVs. Part of the effect of this method is shown in Fig. 3(b).

$\begin{matrix} g_{i} & = \frac{q_{i}}{\sum_{j = 0}^{Tx} q_{j}}, \\ q_{i} & = {\begin{matrix} s_{i} if s_{i} in max_N (s_{1}, s_{2}, \dots, s_{Tx}) \\ 0 if s_{i} not in max_N (s_{1}, s_{2}, \dots, s_{Tx}) \end{matrix} \end{matrix}$ (11)

The main advantage of the guiding network is that it owns good generalization. For each input text, GVs can be generated according to its semantics. However, in order to ensure that the network has good feature extraction capabilities, a large amount of training corpus is needed. And due to the complicated structure, the time for processing a single sentence is longer than using an emotional dictionary. Therefore, this method is more suitable for the case where the length of input text is shorter, and for the long text, the training data may need to be re-selected for the network.

4 Methods of guiding attention

In forward propagation, almost all networks that use the attention mechanism will calculate one or more sets of weight values t₁, t₂, …, t_Tx for the input sequence. By normalizing these weights, the attention values α₁, α₂, …, α_Tx are obtained, and they are constantly adjusted during the back propagation. Based on the guiding values g₁, g₂, …, g_Tx mentioned in Section 3, we designed the method of guiding attention in terms of value (Sec. 4.1) and angle (Sec. 4.2), which aims to make the taught network quickly form attention after being guided by GVs.

4.1 Guide the values

The flow of the method of guiding the numerical value is shown in Fig. 6, and the mathematical description is as follows:

$\begin{matrix} α_{1}, α_{2}, \dots, α_{Tx} = & softmax (relu (tanh (t_{1}, t_{2}, \dots, t_{Tx}) \\ + g_{1}, g_{2}, \dots, g_{Tx})) \end{matrix}$ (12)

Fig. 6

Method of guiding numerical values.

The weight values t₁, t₂, …, t_Tx ∈ (- ∞ , + ∞) calculated by the taught network convert the range of values to (- 1, 1) through the tanh(·) function. While g₁, g₂, …, g_Tx ∈ (0, 1), summing the two sequences yields h₁, h₂, …, h_Tx. These operations can serve the following two purposes:

The final attention values of the taught network will be based on GVs, and GVs will exert influence throughout the learning process;

Where the GVs is wrong, the taught network can weaken its influence by adding negative values. And for those areas that GVs have not noticed, the taught network can also concern them by adding positive values.

The negative values in h₁, h₂, …, h_Tx are the part that the taught network is unwilling to care about or is overcorrected. This part should not be taken into account when calculating the final attention values. So relu (·) function is chosen to take the negative values to zero, and then the final attention values α₁, α₂, …, α_Tx are obtained by normalizing the sequence with softmax(·) function.

From the above calculation process, it shows that the attention values formed by the taught network are the result of modifying and perfecting GVs, which avoids the need to adjust the attention values by a large margin. In the whole process of guidance, no additional parameters are introduced, only some non-linear activation functions are used to transform the values, so there is no additional time consumption.

4.2 Guide the angle

Vectors of any dimension can be represented in a specific vector space, the guiding values g₁, g₂, …, g_Tx and the attention values α₁, α₂, …, α_Tx of the taught network in the vector space as shown in Fig. 7(a).

Fig. 7

Spatial representation of the vector.

In vector space, the cosine value of angle θ is used to measure the difference between two vectors. It is expressed as formula (13)

$cos θ = \frac{α_{1} g_{1} + α_{2} g_{2} + \dots + α_{Tx} g_{Tx}}{\sqrt{α_{1}^{2} + α_{2}^{2} + \dots α_{Tx}^{2}} \cdot \sqrt{g_{1}^{2} + g_{2}^{2} + \dots g_{Tx}^{2}}}$ (13)

When the two vectors are not similar, cos θ tends to 0, the larger the θ is. As shown in Fig. 7(b), the two vectors appear more separated in space. When the two vectors are more similar, cos θ tends to 1, and smaller the θ is, as shown in Fig. 7(c), the two vectors appear closer in space.

In the process of guiding, we hope that the attention values of the taught network can be quickly approached to the GVs, and maintain a certain angle between the two to ensure the independence of the taught network. This requires that cos θ can be increased to a limited extent during training. For this reason, the loss function of the network is revised as formula (14). ${Loss}_{new} = Loss_original + ω (1 - \cos θ)$ (14)

The new loss function adds ω (1 - cosθ) to the original, where ω ∈ [0, + ∞) is the penalty coefficient, which is a constant.

In order to minimize the loss function, the taught network will reduce the value of the (1 - cosθ) when optimizing the hyperparameters, which will lead to the increase of cos θ. In space, it shows that the attention vector of the taught network is close to the GVs vector. By adjusting the penalty coefficient ω, the over-fitting between the two vectors can be prevented. When ω increases, the value of the ω (1 - cosθ) becomes larger which causes the network imposes more punishments on (1 - cosθ) to make its value smaller. However, when ω decreases, the value of ω (1 - cosθ) becomes smaller, the network will weaken the penalty for the (1 - cosθ) to make its value larger. After several attempts, we recommend setting the value of ω to [0.001, 0.01].

The method of guiding the angle quickly forms attention values by narrowing the angle between the two vectors. The difference from the method in section 4.1 is that the method of this section is intended to make the attention values of the taught network close to GVs, rather than modifying it. By adding penalty coefficient ω, The size of the angle between two vectors becomes controllable.

5 Experiments

To verify the effectiveness and generalization of GAM, six public datasets were selected in this experiment, and the classification task was performed on them.

5.1 Datasets

The following six public datasets were selected for comparison experiments. The information of each dataset is shown in Table 1. Columns 5 and 6 of the table represent the number of words contained in the datasets and the value of N (mentioned in Section 3.2) chosen for the datasets. In this experiment, Glove100 [9] trained by Wikipedia data was used to initialize the word vector of each word. In order to reduce the contingency in the experiment, each dataset was tested by a five-fold cross validation method.

Table 1
Distribution of each dataset

Data Classes Average sentence length Dataset size Value of N

MR 2 20 10662 8

SST1 5 18 11855 7

SST2 2 19 9613 7

Subj 2 23 10000 8

TREC 6 10 5952 3

MPQA 2 3 10606 2

Data	Classes	Average sentence length	Dataset size	Value of N
MR	2	20	10662	8
SST1	5	18	11855	7
SST2	2	19	9613	7
Subj	2	23	10000	8
TREC	6	10	5952	3
MPQA	2	3	10606	2

MR: Movie review sentences, whose classification involves positive/negative comments.

SST1 [33]: an extension of MR but with fine-grained labels (very positive, positive, neutral, negative, very negative).

SST2: Same as SST-1 but with neutral reviews removed and binary labels.

Subj [34]: Subjectivity dataset where the task is to classify a sentence as being subjective or objective

TREC [35]: TREC question dataset - task involves classifying a question into 6 question types (whether the question is about person, location, numeric information, etc.)

MPQA [36]: The MPQA Opinion Corpus contains news articles from a wide variety of news sources manually annotated for opinions and other private states (beliefs, emotions, sentiments, speculations, etc.).

5.2 Experimental network and parameter settings

To confirm the validity and generalization of GAM, we selected a widely used and representative network from recurrent attention mechanism and self-attention mechanism respectively. The recurrent attention mechanism corresponds to the bidirectional LSTM combined with the attention mechanism (Bilstm+attention) [10], and the self-attention mechanism corresponds to the Transformer [15]. The structure of each network is consistent with that in original literature, the hidden units of all LSTM is 300, the word vector dimension of each word is 100 and they are fine-tuned during training to improve the performance of classification.

Before the experiment, we generated GVs for the six datasets based on the two sets of processes described in Section 3. Using dga and lga to represent GVs generated by using emotional dictionary and deep learning network respectively. The two selected networks will use the two types of GVs and two guiding methods proposed in Section 4. The two guiding methods are abbreviated as VG (Section 4.1) and AG (Section 4.2), respectively. As will be described below, for example, Transformer_VG_dga will be used to abbreviate a network that combines GVs. Except for structural inconsistency, other parameters of each network are consistent during training. The batch size is set to 32, the network optimizer selects AdamOptimier, and the penalty coefficient ω is set to 0.008. During the experiment, the same training procedures and the same dataset are applied for all networks.

6 Results

The performance of the eight new networks using the GAM and the original networks on the six datasets is shown in Table 2. Meanwhile, some high-performance networks and methods, and their corresponding classification accuracy is also listed.

Table 2
Performance of networks

Model MR SST1 SST2 Subj TREC MPQA

Bilstm+attention 80.25% 48.44% 83.00% 93.05% 87.75% 89.49%

(234s) (185s) (145s) (366s) (102s) (129s)

Bilstm+attention_VG_dga 82.11% 48.17% 83.68% 93.40% 89.00% 89.16%

(220s) (171s) (142s) (355s) (100s) (81s)

Bilstm+attention_VG_lga 83.10% 48.85% 83.15% 94.80% 89.59% 90.99%

(222s) (176s) (144s) (350s) (82s) (115s)

Bilstm+attention_AG_dga 81.95% 48.82% 83.05% 93.25% 89.59% 89.49%

(223s) (177s) (147s) (356s) (106s) (86s)

Bilstm+attention_AG_lga 83.18% 48.88% 84.10% 94.60% 88.83% 89.86%

(233s) (183s) (144s) (373s) (112s) (123s)

Transformer 74.12% 45.81% 78.52% 93.10% 87.89% 89.11%

(182s) (151s) (125s) (334s) (102s) (135s)

Transformer_VG_dga 76.50% 44.46% 80.03% 93.40% 88.85% 88.88%

(170s) (136s) (118s) (274s) (93s) (97s)

Transformer_VG_lga 77.17% 56.18% 88.56% 94.25% 89.91% 88.93%

(176s) (141s) (116s) (273s) (91s) (127s)

Transformer_AG_dga 75.43% 46.15% 80.29% 93.10% 88.41% 88.92%

(170s) (142s) (115s) (283s) (101s) (137s)

Transformer_AG_lga 76.86% 56.68% 84.60% 93.75% 89.05% 89.54%

(183s) (152s) (125s) (282s) (89s) (110s)

Sentiment Lexicon Enhanced Attention [37] 82.90% 48.90% – – – –

Directional Self-Attention [38] – 51.70% – 94.20% 94.2% 90.10%

Convolutional Attention [39] 82.80% 50.00% 87.70% 94.10% – –

Domain-Adaptive Network [40] 81.10% – 87.60% – – –

Bilstm-2Dpooling [41] 81.50% 50.50% 88.30% 93.70% 94.80% –

Bilstm-2DCNN [41] 82.30% 52.40% 89.50% 94.00% 96.10% –

LSTM-RNN [42] – 49.90% 88.00% – – –

MV-RNN [43] 79.00% 44.40% 82.90% – – –

CNN-rand [44] 76.10% 45.00% 82.70% 89.60% 91.20% 83.40%

DCNN [45] – 48.50% 86.80% – 93.00% –

MV-CNN [46] – 49.60% 89.40% 93.90% – –

MG-CNN [47] – 48.01% – 87.47% 94.87% –

DAN [48] – 48.20% 86.80% – – –

Model	MR	SST1	SST2	Subj	TREC	MPQA
Bilstm+attention	80.25%	48.44%	83.00%	93.05%	87.75%	89.49%
	(234s)	(185s)	(145s)	(366s)	(102s)	(129s)
Bilstm+attention_VG_dga	82.11%	48.17%	83.68%	93.40%	89.00%	89.16%
	(220s)	(171s)	(142s)	(355s)	(100s)	(81s)
Bilstm+attention_VG_lga	83.10%	48.85%	83.15%	94.80%	89.59%	90.99%
	(222s)	(176s)	(144s)	(350s)	(82s)	(115s)
Bilstm+attention_AG_dga	81.95%	48.82%	83.05%	93.25%	89.59%	89.49%
	(223s)	(177s)	(147s)	(356s)	(106s)	(86s)
Bilstm+attention_AG_lga	83.18%	48.88%	84.10%	94.60%	88.83%	89.86%
	(233s)	(183s)	(144s)	(373s)	(112s)	(123s)
Transformer	74.12%	45.81%	78.52%	93.10%	87.89%	89.11%
	(182s)	(151s)	(125s)	(334s)	(102s)	(135s)
Transformer_VG_dga	76.50%	44.46%	80.03%	93.40%	88.85%	88.88%
	(170s)	(136s)	(118s)	(274s)	(93s)	(97s)
Transformer_VG_lga	77.17%	56.18%	88.56%	94.25%	89.91%	88.93%
	(176s)	(141s)	(116s)	(273s)	(91s)	(127s)
Transformer_AG_dga	75.43%	46.15%	80.29%	93.10%	88.41%	88.92%
	(170s)	(142s)	(115s)	(283s)	(101s)	(137s)
Transformer_AG_lga	76.86%	56.68%	84.60%	93.75%	89.05%	89.54%
	(183s)	(152s)	(125s)	(282s)	(89s)	(110s)
Sentiment Lexicon Enhanced Attention [37]	82.90%	48.90%	–	–	–	–
Directional Self-Attention [38]	–	51.70%	–	94.20%	94.2%	90.10%
Convolutional Attention [39]	82.80%	50.00%	87.70%	94.10%	–	–
Domain-Adaptive Network [40]	81.10%	–	87.60%	–	–	–
Bilstm-2Dpooling [41]	81.50%	50.50%	88.30%	93.70%	94.80%	–
Bilstm-2DCNN [41]	82.30%	52.40%	89.50%	94.00%	96.10%	–
LSTM-RNN [42]	–	49.90%	88.00%	–	–	–
MV-RNN [43]	79.00%	44.40%	82.90%	–	–	–
CNN-rand [44]	76.10%	45.00%	82.70%	89.60%	91.20%	83.40%
DCNN [45]	–	48.50%	86.80%	–	93.00%	–
MV-CNN [46]	–	49.60%	89.40%	93.90%	–	–
MG-CNN [47]	–	48.01%	–	87.47%	94.87%	–
DAN [48]	–	48.20%	86.80%	–	–	–

6.1 Impact on accuracy

As can be seen from Table 2, GAM improves the classification accuracy of networks to varying degrees, and achieves the best results in the four classification tasks. By introducing GAM, the basic attention mechanism outperforms some attention methods with complex structures. Compared to Bilstm+attention, Transformer needs to generate more attention values during the forward propagation process, which also makes GAM have a greater impact on it. This is reflected in SST1 and SST2 datasets, where the performance of Transformer is improved more significantly.

The mathematical statistics of Table 2 indicate that VG and AG respectively increased the classification accuracy of the original network by 1.51% and 2.81% on average. Networks using lga are on average about 1.37% higher in classification performance than networks using dga.

Figure 8 shows a visual representation of the attention values produced by the network on the text (The darker the color, the larger the corresponding value). Here, Bilstm+attention and Bilstm+attention_AG_lga are selected for comparison. It can be observed that the effect of the attention values in (b) is more accurate than that in (a). This is because under the guidance of GAM, the network has gained more prior knowledge and used it as a basis to form attention values.

Fig. 8

Visual representation of the attention values.

6.2 Impact on time

Table 2 lists the training time for the ten experimental networks to achieve accuracy. It indicates that, in most cases, the training time of the improved network is shorter than that of the original network. According to statistical data, the introduction of GAM has shortened the training time of 5.87% and 8.75% on the Bilstm+attention and the Transformer on average. Since the forward propagation process of Transformer requires more attention values than the Bilstm+attention, the use of GAM has a greater impact on it, and the training time is reduced more. The results also demonstrate that the degree of GAM optimization is strange on different datasets.

To understand the role of GAM in the training process, this paper presents the curves of the accuracy of the models on MR and TREC datasets varying with time as shown in Fig. 9.

Fig. 9

Accuracy curve over time.

In Fig. 9, the GVs generated in different ways are used to distinguish the figures on the same dataset. There (a) and (b) show the performance of each network on the MR dataset, and (c) and (d) show the performance of each network on the TREC dataset. By comparing the figures, it is worthwhile mentioning that:

GAM can make the network approach the optimal accuracy faster and help the network achieve high performance in a shorter time;

VG directly operates on values, which can make the effect of GVs more obvious in the early stage of network training. However, AG consumes a little time to get close to GVs in terms of angle, so it will take longer;

The use of lga has a greater impact on the performance of Transformer, while the difference between lga and dga on Bilstm+attention network is not obvious.

7 Conclusion and discussion

In this work, two sets of processes for generating guiding values (GVs) are designed based on the emotional dictionary and the deep learning network. At the same time, we raise two methods combining GVs to guide the network’s attention values in terms of values and angles. The results of experiments show that Guided Attention Mechanism (GAM) can improve the performance of the attention networks. On the one hand, compared with the original network, the improved network can converge to a higher accuracy and the GVs generated by the two methods can put a good impact on the networks. On the other hand, the method of guiding angle can improve the accuracy of classification better than the method of guiding numerical value, while the latter is more advantageous in shortening the training time of network. Moreover, it is also shown in the result that the more times the attention values are generated during the forward propagation of taught network, the more significant the effect of GAM is.

Benefit by the simple and efficient operation. GAM, which has a strong generalization, enables networks to form attention values quickly and accurately during training. However, the effect of this mechanism is susceptible to GVs. When the input text is more complicated, a larger emotional dictionary and more training corpora will be needed to generate GVs.

Future research will focus on using GAM in more tasks and exploring more ways to generate GVs for different needs.

Footnotes

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grants: Methodologies for Understanding Big Data and Knowledge Discovery (61836016).

References

Krizhevsky

, Sutskever

and Hinton

G.E.

, Imagenet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems (2012), 1097–1105.

Girshick

, Donahue

, Darrell

and Malik

, Rich feature hierarchies for accurate object detection and semantic segmentation, Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (2014), 580–587.

Cho

, Van Merriënboer

, Gulcehre

, Bahdanau

, Bougares

, Schwenk

and Bengio

, Learning phrase representations using RNN encoder-decoder for statistical machine translation, arXiv preprint arXiv:1406.1078 (2014).

Rubio

J.d.J.

, Ricardo Cruz

, Elias

, Ochoa

, Balcazarand

and Aguilar

, ANFIS system for classification of brain signals, Journal of Intelligent & Fuzzy Systems 37(3) (2019), 1–9.

de Jesús Rubio

, Lughofer

, Meda-Campaña

J.A.

, Páramo

L.A.

, Novoa

J.F.

and Pacheco

, Neural network updating via argument Kalman filter for modeling of Takagi-Sugeno fuzzy models, Journal of Intelligent & Fuzzy Systems 35(2) (2018), 1–12.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation 9 (1997), 1735–1780.

Cho

, Van Merriënboer

, Bahdanau

and Bengio

, On the properties of neural machine translation: Encoder-decoder approaches, arXiv preprint arXiv:1409.1259 (2014).

Mikolov

, Sutskever

, Chen

, Corrado

G.S.

and Dean

, Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems (2013), 3111–3119.

Pennington

, Socher

and Manning

, Glove: Global vectors for word representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014), 1532–1543.

10.

Bahdanau

, Cho

and Bengio

, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473 (2014).

11.

Sukhbaatar

, Weston

and Fergus

, End-to-end memory networks, Advances in Neural Information Processing Systems (2015), 2440–2448.

12.

and Huang

, Text classification research with attention-based recurrent neural networks, International Journal of Computers Communications & Control 13 (2018), 50–61.

13.

Wang

, Huang

, Zhu

and Zhao

, Attention-based LSTM for aspect-level sentiment classification, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (2016), 606–615.

14.

Woo

, Park

, Lee

J.-Y.

and So Kweon

, Cbam: Convolutional block attention module, Proceedings of the European Conference on Computer Vision (ECCV) (2018), 3–19.

15.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

Ł.

and Polosukhin

, Attention is all you need, Advances in Neural Information Processing Systems (2017), 5998–6008.

16.

Devlin

, Chang

M.-W.

, Lee

and Toutanova

, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).

17.

Radford

, Narasimhan

, Salimans

, Sutskever

, Improving language understanding by generative pre-training, URL https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/languageunderstandingpaper.pdf (2018).

18.

, Li

and Pang

, Fusion-Attention Network for person search with free-form natural language, Pattern Recognition Letters 116 (2018), 205–211.

19.

Liu

, Geng

, Ling

and Cheung

Y.-m.

, Attention guided deep audio-face fusion for efficient speaker naming, Pattern Recognition 88 (2019), 557–568.

20.

, He

, Wang

and Yang

, Image-attribute reciprocally guided attention network for pedestrian attribute recognition, Pattern Recognition Letters 120 (2019), 89–95.

21.

, An introductory survey on attention mechanisms in nlp problems, Proceedings of SAI Intelligent Systems Conference (2019), 432–448.

22.

Hermann

K. M.

, Kocisky

, Grefenstette

, Espeholt

, Kay

, Suleyman

and Blunsom

, Teaching machines to read and comprehend, Advances in Neural Information Processing Systems (2015), 1693–1701.

23.

Yang

, Yang

, Dyer

, He

, Smola

and Hovy

, Hierarchical attention networks for document classification, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2016), 1480–1489.

24.

Kuchaiev

and Ginsburg

, Factorization tricks for LSTM networks, arXiv preprint arXiv:1703.10722 (2017).

25.

Shazeer

, Mirhoseini

, Maziarz

, Davis

, Le

, Hinton

and Dean

, Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, arXiv preprint arXiv:1701.06538 (2017).

26.

Choi

, Cho

and Bengio

, Context-dependent word representation for neural machine translation, Computer Speech & Language 45 (2017), 149–160.

27.

Peters

M. E.

, Neumann

, Iyyer

, Gardner

, Clark

, Lee

and Zettlemoyer

, Deep contextualized word representations, arXiv preprint arXiv:1802.05365 (2018).

28.

Esuli

and Sebastiani

, Sentiwordnet: A publicly available lexical resource for opinion mining, LREC 6 (2006), 417–422.

29.

Baccianella

, Esuli

and Sebastiani

, Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining, Lrec 10 (2010), 2200–2204.

30.

Zhang

, Zhao

and LeCun

, Character-level convolutional networks for text classification, Advances in Neural Information Processing Systems (2015), 649–657.

31.

Lee

J. Y.

and Dernoncourt

, Sequential short-text classification with recurrent and convolutional neural networks, arXiv preprint arXiv:1603.03827 (2016).

32.

Pang

and Lee

, Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales, Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (2005), 115–124.

33.

Socher

, Perelygin

, Wu

, Chuang

, Manning

C. D.

, Ng

and Potts

, Recursive deep models for semantic compositionality over a sentiment treebank, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (2013), 1631–1642.

34.

Pang

and Lee

, A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts, Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (2004), 271.

35.

and Roth

, Learning question classifiers, Proceedings of the 19th International Conference on Computational Linguistics 1 (2002), 1–7.

36.

Wiebe

, Wilson

and Cardie

, Annotating expressions of opinions and emotions in language, Language Resources and Evaluation 39 (2005), 165–210.

37.

Lei

, Yang

and Yang

, Sentiment lexicon enhanced attention-based LSTM for sentiment classification, Thirty-Second AAAI Conference on Artificial Intelligence (2018).

38.

Shen

, Zhou

, Long

, Jiang

, Pan

and Zhang

, Disan: Directional self-attention network for rnn/cnn-free language understanding, Thirty-Second AAAI Conference on Artificial Intelligence (2018).

39.

, Gui

, Xu

and He

, A convolutional attention model for text classification, National CCF Conference on Natural Language Processing and Chinese Computing (2017), 183–195.

40.

, Guo

, Zhang

, Gu

and Yang

, Imbalanced text sentiment classification using universal and domain-specific knowledge, Knowledge-Based Systems 160 (2018), 1–15.

41.

Zhou

, Qi

, Zheng

, Xu

, Bao

and Xu

, Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling, arXiv preprint arXiv:1611.06639 (2016).

42.

and Zuidema

, Compositional distributional semantics with long short term memory, arXiv preprint arXiv:1503.02510 (2015).

43.

Socher

, Huval

, Manning

C.D.

and Ng

A.Y.

, Semantic compositionality through recursive matrix-vector spaces, Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (2012), 1201–1211.

44.

Kim

, Convolutional neural networks for sentence classification, arXiv preprint arXiv:1408.5882 (2014).

45.

Kalchbrenner

, Grefenstette

and Blunsom

, A convolutional neural network for modelling sentences, arXiv preprint arXiv:1404.2188 (2014).

46.

Yin

and Schütze

, Multichannel variable-size convolution for sentence classification, arXiv preprint arXiv:1603.04513 (2016).

47.

Zhang

, Roller

and Wallace

, MGNC-CNN: A simple approach to exploiting multiple word embeddings for sentence classification, arXiv preprint arXiv:1603.00968 (2016).

48.

Iyyer

, Manjunatha

, Boyd-Graber

and Daumé

III , Deep unordered composition rivals syntactic methods for text classification, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing 1 (2015), 1681–1691.

Guided attention mechanism: Training network more efficiently

Abstract

Keywords

1 Introduction

4.1 Guide the values

5.1 Datasets

Table 1 Distribution of each dataset Data Classes Average sentence length Dataset size Value of N MR 2 20 10662 8 SST1 5 18 11855 7 SST2 2 19 9613 7 Subj 2 23 10000 8 TREC 6 10 5952 3 MPQA 2 3 10606 2

6 Results

Footnotes

Acknowledgments

References

Table 1
Distribution of each dataset

Data Classes Average sentence length Dataset size Value of N

MR 2 20 10662 8

SST1 5 18 11855 7

SST2 2 19 9613 7

Subj 2 23 10000 8

TREC 6 10 5952 3

MPQA 2 3 10606 2