Extracting relational facts based on hybrid Syntax-Guided transformer and pointer network

Abstract

Joint extraction of entities and relations from unstructured text is an essential step in constructing a knowledge base. However, relational facts in these texts are often complicated, where most of them contain overlapping triplets, making the joint extraction task still challenging. This paper proposes a novel Sequence-to-Sequence (Seq2Seq) framework to handle the overlapping issue, which models the triplet extraction as a sequence generation task. Specifically, a unique cascade structure is proposed to connect transformer and pointer network to extract entities and relations jointly. By this means, sequences can be generated in triplet-level and it speeds up the decoding process. Besides, a syntax-guided encoder is applied to integrate the sentence’s syntax structure into the transformer encoder explicitly, which helps the encoder pay more accurate attention to the syntax-related words. Extensive experiments were conducted on three public datasets, named NYT24, NYT29, and WebNLG, and the results show the validity of this model by comparing with various baselines. In addition, a pre-trained BERT model is also employed as the encoder. Then it comes up to excellent performance that the F1 scores on the three datasets surpass the strongest baseline by 5.7%, 5.6%, and 4.4%.

Keywords

Information extraction sequence to sequence transformer network pointer network syntax-guided attention network

1 Introduction

Joint extraction of entities and relations is an essential task for information extraction and plays a vital role in many applications, like knowledge base population [1 –3] and clinical information extraction [4, 5]. This task demands that entity pairs and the corresponding relations be extracted from the unstructured text simultaneously. For example, given the sentence ‘Obama was born in Honolulu’, it is aimed at extracting a relational triplet ‘(Obama, place_of_birth, Honolulu)’ from the sentence. Generally, ‘Obama’ is referenced as the subject, ‘Honolulu’ is referenced as the object, and ‘place_of_birth’ is the relationship between the subject and object.

Early studies implemented joint extraction mainly by pipeline methods, which split the joint extraction into two separate sub-tasks: Named Entity Recognition (NER) and Relation classification (RC). Pipeline methods made the joint extraction easy to conduct, but it introduced an error propagation issue [6]. Besides, the relevance between the two sub-tasks have not been exploited. One improved way is to apply the joint learning method, in which a parameter-sharing layer is designed to construct the bridge between the NER and RC [7 –9]. However, after identifying all the entities, the joint learning method is also demanded to classify the relations between each entity pair; thus the interaction between the two sub-tasks is not fully exploited. Besides, the overlapping issue is yet to be fully handled by the joint learning method.

The joint extraction becomes challenging, as one entity has multiple relationships with others in the overlapping scenarios. Some examples are presented in Figure 1, where sentences are divided into three categories: Normal, Entity Pair Overlapping (EPO), and Single Entity Overlapping (SEO). In the normal category, an entity has only one relation with the other. The EPO category represents that an entity pair has more than one relations. In the SEO category, an entity has relations with multiple other entities. The overlapping issue makes the joint extraction task arduous for many joint learning methods. It requires the model to enumerate over all possible entity pairs, which leads to a heavy computational burden, since N entities will cause roughly N² pairs. Furthermore, many entity pairs have no internal relation. Then ‘None’ labels are assigned to these entity pairs during the training process, making the neural model difficult to learn the real relationship between the entity pairs.

Fig. 1

Some examples of the overlapping triplets in the real world.

Recently, many studies apply Seq2Seq architecture to address the overlapping issue [10 –13]. In these methods, the joint extraction is modeled as a triplet sequence generation task. In this way, an entity or relation can be generated or copied several times when needed to participate in different triplets. Thus, the Seq2Seq methods are neither plagued with overlapping relations nor with excessive computations. However, the following limitations also exist in previous studies: (1) the relational triplets are generated on entity-level or word-level, thus requiring at least three steps to generate a triplet, which is time-consuming, and the dependencies on triplet-level is not explored; (2) Existing methods are based on recurrent neural networks (RNNs), which have own limitations, low efficiency in training and hard to model the long-dependency.

In this paper, a transformer-based Seq2Seq model is proposed to address the issues mentioned above. Specifically, the transformer network [14] is adopted as Encoder-Decoder architecture to generate the relation sequences and a pointer network [15] based on a multi-headed attention mechanism is further designed to identify the related entity pairs. In the Encoder-Decoder network, the proposed model extends the transformer’s self-attention mechanism with syntax-guided constrain to capture syntax-related parts with the concerned words. The sentence’s syntax tree works as clues and is integrated into the encoder explicitly. In the pointer network, the multi-headed layer’s attention scores are treated as a pointer to identify the related entities. Lastly, a unique cascading structure is designed to connect the transformer and pointer network.

To verify the validity of the proposed method, extensive experiments were conducted on three public datasets, named NYT24, NYT29, and WebNLG, as previous studies [10 –13]. Evaluation results demonstrate that the proposed model has its superiority in effectiveness and efficiency compared with baselines. Further experiments by introducing a pre-trained BERT [16] as an encoder are also conducted. With the help of the pertained-model, a significant performance gain over the transformer encoder is made.

In summary, the main contributions of this paper are as follows:

A novel cascade framework is proposed to extract entities and relations jointly from sentences, which generates triplet-level-based sequences and improves the parallelization capability.

A syntax-guided network is proposed to capture the syntactic information of the input sentence explicitly, which facilitates the encoder to pay more attention to the important words in the input sentences.

Experimental results on the three public datasets show that the proposed model achieves competitive performance compared with baselines.

This paper proceeds as follows: (2) section 2 summarizes related work in this field; (3) section 3 gives a formulate definition of the task; (4) section 4 presents our approach and the detail of each component; (5) experiment and discussion are depicted in section 5; (6) section 6 gives a conclusion and future work.

2 Related work

Over the past few decades, many efforts have been devoted to the joint extraction task. The existing methods can be divided into three categories: pipeline method, joint learning method, and Seq2Seq method.

2.1 Pipeline method

Traditional studies solved the joint extraction task with the pipeline method, which divided it into NER and RC. Early methods utilized feature-based approaches to solve the two sub-tasks [17 –19]. These methods require a large amount of feature engineering efforts, and the extracted features are sparse. Recently, many neural networks were proposed for NER and RC. Lample et al. [20] introduced a neural network for NER, which utilized a bi-directional Long Short-term Memory (BiLSTM) network to encode the sentence, then decoded the sequence labels based on a conditional random filed layer (CRF). To capture the features from both word-level and character-level, some studies apply a hybrid model of BiLSTM and Convolutional neural network (CNN) for NER, where the character-level feature is captured by the CNN model [21, 22]. For the RC task, Zeng et al. [23] utilized a CNN model to extract lexical and sentence level features, then the features are fed into a softmax layer to predict the relationship between the entities. Liu et al. [24] proposed a novel end-to-end multi-level semantic representation enhancement network for RC, which can enhance the semantic representation of entities from word, phrase, and context level. The neural-network-based methods are more robust and achieve better performance than the feature-based method. As the neural network’s ability to extract deep continuous features. Despite the success of the neural-network-based methods, the pipeline model also suffers from apparent issues. Firstly, the pipeline model may introduce an error propagation problem [6]. Secondly, the interaction between the two sub-tasks is not exploited.

2.2 Joint learning method

To address the issues in pipeline methods, joint learning models have been proposed. Some are also feature-based [6 , 26], which heavily rely on feature engineering. In recent years, neural-network-based methods have been investigated. The neural-network-based methods can be divided into different categories according to how they model the interaction between the two sub-tasks. The first category is the parameter sharing method, in which the input sentence is encoded by a parameter sharing encoder and then passed into two separate decoders for NER and RC [7 –9]. The parameters sharing methods also have to pipeline the detected entities into the relation classifier. Therefore, the triplet-level dependency of the two tasks is not fully exploited. The second category is the tagging-based method. These methods encode entities and relations into the same space by designing a unique tagging scheme, then generate the entities and relations in one decoder [27, 28]. Besides, some studies utilized Graph Convolution Network (GCN) [29] or Reinforcement learning (RL) [31, 32] methods to tackle this task. The graph-based method model the interaction between entities and relations via two-phase GCN. The RL-based method applies hierarchical reinforcement learning (HRL) to enhance the interaction between NER and RC. These methods solve the problem of overlapping relations to some extent. Nevertheless, there also have some issues, such as over redundant computation and high computational complexity.

2.3 Seq2seq method

Recently, many Seq2Seq methods were investigated [10 –12] to address the joint extraction task. These methods treat the joint extraction as a triplet sequence generation task. Thus the entities or relations can be generated repeatedly. Although the Seq2Seq model has shown its superiority in addressing the overlapping issue, existing methods suffer from some limitations. The CopyRE [10] cannot extract the entire words of the entities, hindering its usage in a real-world application. The CopyMTL [11] needs to introduce an additional NER module to identify the entity boundaries of the triplet. Besides, existing models generate the triplet sequence on word-level, thus lost the triplet-level dependency. What’s more, existing Seq2Seq methods are all RNN-based that cannot be computed in parallel and are still weak when capturing the long-dependence of two farther apart entities.

3 Problem statement

For a given sentence X = [w₁, w₂, . . . , w_n] and a predefined relation set R = [r₁, r₂, . . . , r_n], the goal of joint extraction is to predict all the triplets correctly. A triplet is defined as (s, r, o), where r ∈ R and s, o ∈ X, s ≠ o. In fact, s and o often contain multi-words. In this study, the joint extraction is modeled as a triplet generation task, and we apply a Seq2Seq model to generate the triplet one-by-one until ending with a special stop symbol ‘EOS’. From a probabilistic view, the triplet sequence generation is equal to predict the probability p_θ (s, r, o) that the relational facts hold. Formally, for a given annotated sentence x_j from the training dataset D and some potentially overlapping triplets T_j (s, r, o) in x_j, it aims at maximizing the data likelihood of the training dataset D, which is formalized as follows:

$\begin{matrix} p_{θ} (T | D) \\ = \prod_{j = 1}^{D} [\prod_{(s, r, o) \in T_{j}} p_{θ} ((s, r, o) | x_{j})] \end{matrix}$ (1) where (s, r, o) ∈ T_j denotes a triplet of the triplet set in a sentence x_j.

According to the rule-chain of probability, Eq. 1 can be rewritten as follows: $\begin{matrix} p_{θ} (T | D) \\ = \prod_{j = 1}^{D} [\prod_{r \in R} p_{θ} (r | x_{j}) \prod_{(s, o) \in T_{j} | r} p_{θ} ((s, o) | r, x_{j})] \end{matrix}$ (2) where R is the set of all possible relations, (s, o) ∈ T_j|r is a (r, o) pair in the triples led by relation r in T_j.

Eq. (2) implies that the model can first predict the relations from the sentence and then extract the subject entity and object entity simultaneously when given the relation.

This formulation provides many advantages. Firstly, the relationship is detected first, then the entities are viewed as arguments for each relation, and be recognized by an object tagger p_θ ((s, o) |r, x_j), as opposed to classifying relations for (subject, object) pairs. Secondly, by utilizing a transformer model, the relation detection and entity identification can be completed in one decoding step. This process can be seen as extracting a triplet from triplet-level. Thus, the interactions between the relations and entities are better modeled. Thirdly, the overlapping issue can be solved in design, as multiple triples may share the same entities in this method.

4 Proposed model

The overall architecture of the proposed model is illustrated in Figure 2, which contains three components. On the left, a syntax-guided network encodes the input sentence into the hidden representation, representing the semantic information of the sentences. In the middle, the hidden representation and the embedding of relational queries are fed into a relation decoder to compute each relation’s probability. On the right, a pointer network receives the output of the sentence encoder and relation decoder to predict the span of the subject and object entities.

Fig. 2

Main architecture of the proposed model. The model is composed of three parts: Syntax-Guided Encoder, Relation decoder and Pointer network.

4.1 Syntax-guided encoder

This study utilizes two kinds of encoders to extract deep features from the input sentence: transformer encoder [14] and pre-trained model BERT [16]. As the architecture of BERT is mainly based on multi-layers of bidirectional transformer block, this subsection only gives a description of the transformer encoder.

4.1.1 Token embedding

For an sentence x with n tokens x = [w₁, w₂, . . . , w_n], the first step is to convert the input tokens into continuous vectors C = [c₁, c₂, . . . , c_n], where c_i is the embedding of i-th token. In this work, the token embedding is the sum of the word embedding and the position embedding, where the word embedding consists of two parts: word-level embedding and character-level embedding. The character-level embedding can help the model to capture the semantic information of words better; hence it is used in many previous studies [22 , 35]. In this study, a 1DCNN network with Max-Pool is applied to get the character-level embedding as prior work [35]. The positional embedding is the same as the vanilla transformer, which is sinusoidal based. For the BERT-encoder, there also contains an additional segment embedding.

4.1.2 Sentence encoding based on syn-encoder

Previous studies [36, 37] have shown that the transformer-based model is able to capture certain syntactic information implicitly by learning from sufficient examples. However, there is still a gap between the syntactic structures implicitly learned and the explicit syntax structures created by human experts, which is useful prior knowledge for many NLP (natural language processing) downstream tasks. In this study, we also argue that the prior syntax information is essential for the joint extract task, as the facts in a sentence are often determined by some keywords that have syntax dependency in the syntax tree. For example, for the case described in the first section, there is syntax dependency among the keywords ‘Obama’, ‘born’ and ‘Honolulu’ (can be viewed in Fig. 4), and these information is enough to extract the fact ‘<Obama, place _ of _ birth, Honolulu>’. Conversely, noise may be introduced if the model pays attention to all the words in the sentence.

To capture the sentence’s syntax structure, this paper proposes a syntax-guided encoder (Syn-Encoder) to incorporate syntactic information, as shown in Figure 3. Firstly, the input sentence is encoded by the vanilla transformer; then, the output from the vanilla is fed into a syntax-guided attention network to get the syntax-guided representation. Secondly, the outputs from the vanilla transformer and the syntax-guided attention network are aggregated together to represent the final representation. With the guidance of the syntax dependency information, the encoder makes the word only attend to the syntactic-related words in the sentence.

Fig. 3

The syntax-guided encoder is composed of two components: Transformer encoder and syntax-guided attention network.

The vanilla transformer is stacked by N identical layers, and each layer contains two sub-networks: a multi-headed attention layer and a fully connected feed-forward network (FFN). The multi-headed attention layer applies multiple heads to learn features from different sub-spaces, where each head is based on a self-attention mechanism. In the self-attention mechanism, each word’s representation is a weighted sum of other words in the sentence, and the weight matrix is often referenced as the attention matrix. The vanilla transformer adopts the scaled dot-product attention to generate the attention matrix. Particularly, the input embedding is first projected into query, key, and value vectors by different linear layers. Then, the transformer computes the dot products of the query vector with every key vector and divided each by $\sqrt{d_{k}}$ to get the weight matrix, where d_k is equal to the size of the hidden dimension divides by the number of heads. Finally, a softmax function is applied to the weight matrix to get the attention map. After that, the obtained attention is multiplied on the value vectors to get a representation T. This procedure is formalized as Eq. 3:

$\begin{array}{l} T = Attention (Q_{T}, K_{T}, V_{T}) \\ = softmax (\frac{Q_{T} {(K_{T})}^{⊤}}{\sqrt{d_{k}}}) V_{T} \end{array}$ (3) where Q_T, K_T and V_T represents the query, key and value vectors.

The multi-headed attention utilizes multiple learned heads to obtain features from different aspects. Then the output from different heads is concatenated together and once again projected to get the updated representation.

$\begin{matrix} {\tilde{M}}_{T} = MultiHead (Q_{T}, K_{T}, V_{T}) \\ = W_{T} {[T_{0}^{*}, T_{1}^{*}, . . ., T_{m}^{*}]}^{⊤} \end{matrix}$ (4) where W_T is a training parameter, and m is the number of the heads.

The updated representation is then passed to a residual connection and normalization layer to obtain a new representation.

$M_{T} = LN ({\tilde{M}}_{T} + Q_{T})$ (5)

The second layer FFN contains two linear transformations with a Rectified Linear Unit (ReLU). The output of the second layer is also fed to a residual connection and normalization layer to compute the output.

$\tilde{H} = \max (M_{T} W_{1} + b_{1}) W_{2} + b_{2}$ (6) $H = LN (\tilde{H} + M_{T})$ (7) where W₁, b₁ ∈ R^{d_model×d_ff} and W₂, b₂ ∈ R^{d_ff×d_model} are the parameters of the linear transformations. d_ff = 2048 is the dimension of the inner-layer, and d_model equals to the dimension of the input and output. If the pre-trained model is utilizing to encode the sentence, H represents the output from the last layer of BERT model.

To fully exploit the syntactic information of the input sentence, the output of the vanilla transformer is further fed into a syntax-guided attention network to fusion the prior syntax knowledge. In the syntax-guided network, a masked matrix M is adopted into the self-attention operation. Specifically, for an input sentence with n words s = (w₁, w₂, . . . , w_n). The sentence is first parsed to get a syntax dependency tree. Then the masked matrix M with n × n elements is constructed according to the ancestor nodes p_j for each word in the syntax tree. In the masked matrix M, each element m_ij represents the dependency of the word i to word j. If the word j is the ancestor of the word i, then the value of the element is set to 1, otherwise 0. The description of the process is presented in Eq. (8). $m_{ij} = {\begin{matrix} 1, & if j \in p_{j}, or i = j \\ 0, & otherwise \end{matrix}$ (8)

Figure 4 gives an example of how to generate the masked matrix M. It can be seen that if the element m_ij = 1, it means that the word j is the ancestor of the word i.

Fig. 4

An example of the generation of the masked matrix based on the dependency tree.

When the masked matrix M is built, a syntax-guided multi-headed attention is computed on the output of the vanilla transformer H. The output H is first projected into new query (Q^′), key (K^′), and value (V^′) vectors through linear layer. Then the syntax-guide attention matrix is obtained by conducting a dot-product to query-key pairs with the dependency mask M. Finally, a softmax layer is applied to the weight matrix to compute the attention matrix, denoted as A^′, which is defined as Eq. (9).

$A^{'} = softmax (\frac{(Q^{'} (K^{'})^{T})}{\sqrt{d_{k}}}) M$ (9)

The syntax-guided representation is obtained by multiply the attention matrix A^′ to the value vectors, which is shown as Eq. (10):

$H_{i}^{'} = A_{i}^{'} V^{'}$ (10)

A multi-headed attention mechanism is also applied in the syntax-guided network, and the representation is the concatenation from each head. $H^{'} = MultiHead (Q^{'}, K^{'}, V^{'})$ (11)

Finally, the output is the sum of the vanilla transformer and the syntax-guided network, which is computed as follows: $H^{syn} = α H + (1 - α) H^{'}$ (12)

We control the contribution of the two part by a hyper-parameter α in Eq. (12), and it equals to 0.5 in this work.

4.2 Relation decoder

The relation decoder is also stacked by N identical layers, and each layer consists of three sub-layers: masked multi-headed attention, encoder-decoder attention, and FFN layer. There also has a residual connection and normalization around each of the sub-layers. The decoder’s input embedding is the sum of word embedding and position embedding of the relation queries. In the masked multi-head attention, a masking operation is implemented to prevent the rightward information flow to the decoder, same as the vanilla transformer. The detail of this process is formalized as Eq. (13):

$H^{m} = MultiHead (h_{1}^{m}, . . ., h_{i}^{m}, . . . h_{h}^{m})$ (13) where $h_{i}^{m}$ is defined as follows" $h_{i}^{m} = Attention (Q_{i}^{t}, K_{i}^{t}, V_{i}^{t})$ (14) where $Q_{t}^{i}, K_{t}^{i}, V_{t}^{i}$ are the query, key, and value vectors of the i -th head, which are projection of target words embedding by different linear layer. A residual connection and normalization layer are also following this layer. To simplify, this subsection does not describe it again.

The second layer is an encoder-decoder attention layer, which performs multi-headed attention over the encoder layer’s output. The input for this layer also contains three vectors, where the query vector equals to the output of the first sub-layer H_m, the key and value are the output of the syntax-guided layer H. The process is as Eq. (15):

$h^{ed} = MultiHead (h_{1}^{ed}, . . ., h_{i}^{ed}, . . . h_{h}^{ed}),$ (15) where $h_{i}^{ed}$ is defined as follows: $h_{i}^{ed} = Attention (H_{i}^{m}, H_{i}^{syn}, H_{i}^{syn})$ (16)

Then the encoder-decoder representations are also fed into the normalization and FFN layers, same as the encoding layer.

$\tilde{H^{ed}} = LN (h^{ed} + H^{m})$ (17) $H^{ed} = LN (FFN (\tilde{H^{ed}}) + \tilde{H^{ed}})$ (18)

Lastly, the output of the decoder layer is feed into a linear layer and then through a softmax to get the probability of the relation p_r, as defined in Eq. (19):

$p_{r} = softmax (H^{ed} W_{o} + b_{o})$ (19) where W_o, b_o ∈ R^d_model×r are the trained parameters, d_model is the dimension of the output of the decoder and r is the number of the relations.

4.3 Pointer network

The subject and object entity are the span of the input sentence. Thus, the proposed model designs two pointer-networks to identify the start and end position of the two entities. As described in section 3, the subject and object can be seen as arguments for the relation, then the span of the entities is determined by the relation sequence and the input sentence. Therefore, a multi-headed attention network is utilized as the pointer network to recognize the entities. In the multi-headed attention layer, the key and value are the output of the syntax-guided encoder H, and the query is the output of relation decoder H^ed. In this way, each relation makes attention to all the input tokens, and the attention scores can be treated as the pointer to select the related entities. For each headed, the attention matrix is calculated as Eq. (20): $A_{p} = softmax (\frac{((W_{Q})^{⊤} H^{ed}) ((W_{K})^{⊤} H)^{⊤}}{\sqrt{d_{k}}})$ (20) where W_Q and W_K are the training parameters.

As each entity has a start and end position, the multi-headed is split into two parts. The first part is used to determine the beginning location, and the second half is used to determine the end location. Specifically, the start position probability p_s is computed by the sum of attention scores from head 1 to n/2, and the end position probability p_e is calculated by the sum of the attention scores from the head n/2 +1 to n. The detail is presented in Eq. (21) and (22):

$p_{s} = sum (A_{1}^{p}, A_{2}^{p}, . . ., A_{\frac{n}{2}}^{p})$ (21)

$p_{e} = sum (A_{\frac{n}{2} + 1}^{p}, A_{\frac{n}{2} + 1}^{p}, . . ., A_{n}^{p})$ (22) where n is the total number of the multi-headed.

The object decoder is same as the subject decoder, which utilizes the multi-headed attention scores as the pointer network. As the object is determined by both the relation and the object, then the query vector is the sum of the relation decoder and the subject decoder, and the key and value vectors are equal to the output of the syntax-guided encoder H.

4.4 Training and inference

During the training, it aims at minimizing the sum of the negative log-likelihood of the relation classification and the four-pointer locations. The training objective is defined as Eq. (23):

$J_{θ} = - \frac{1}{B \times T} \sum_{b = 1}^{B} \sum_{t = 1}^{T} [(log (r_{t}) + log (e 1_{s}) + log (e 1_{e}) + log (e 2_{s}) + log (e 2_{e}))]$ (23)

where r, e1_s, e1_e, e2_s, and e2_e are the softmax score of the corresponding relation label, start, and end location of the object and subject entities. B and T stand for the batch size and the maximum number of the triplets in each sentence.

During the inference, the greedy search algorithm is utilized to generate relation sequences.

5 Experiment

5.1 Datasets

Experiments were conducted on three public datasets, NYT24, NYT29, and WebNLG, as previous studies [10 –13]. The NYT24 and NYT29 are originally produced by distant supervision method from the New York Times [30]. The NYT24 dataset was first produced by X. Zeng et al. [10], which contains 24 types of relations. The NYT29 dataset was first released by R. Takanobu et al. [31], which includes 29 types of relations. WebNLG was proposed by Garden et al. [39] for Natural Language Generation task, having 246 predefined relations types. This study adopts the dataset released by T. Nayak et al. [13] for NYT24 and NYT29. For the WebNLG dataset, the dataset pre-processed by Zeng et al. [10] is utilized. The statistics of the three datasets is illustrated in Table 1.

Table 1
Statistics of the three public datasets

categories NYT24 NYT29 WebGNL

Train Test Train Test Train Test

sentence 56,195 5,000 63,306 4,006 5,019 703

relations 24 24 29 29 246 246

Normal 37,013 3,266 53,444 2,951 1,596 246

EPO 9,782 978 8,379 715 227 26

SEO 14,735 1,297 9,862 757 3,406 457

categories	NYT24	NYT29	WebGNL
sentence	56,195	5,000	63,306	4,006	5,019	703
relations	24	24	29	29	246	246
Normal	37,013	3,266	53,444	2,951	1,596	246
EPO	9,782	978	8,379	715	227	26
SEO	14,735	1,297	9,862	757	3,406	457

5.2 Implementation details

In this study, two different encoders are implemented, and the hyper-parameters are different for them. Tables 2 and 3 record the hyper-parameters for the transformer-based and BERT-based models.

Table 2
Hyper-parameters for the transformer-based model

Parameters Value

Word embedding d_w 300

Char embedding d_c 50

Kernel size of the 1DCNN for char embedding c_k 3

Number of the filters for the 1DCNN model c_w 212

Positional embedding d_p 512

Max length of the character for each word l_c 10

Number of the layers for the encoder N_e 4

Number of the layers for Syn-Encoder N_s 4

Number of the layers for the relation decoder N_r 4

Number of the layers for the pointer decoder N_p 1

Number of the heads h 8

Batch size B 64

Learning rate lr 1e-5

Parameters	Value
Word embedding d_w	300
Char embedding d_c	50
Kernel size of the 1DCNN for char embedding c_k	3
Number of the filters for the 1DCNN model c_w	212
Positional embedding d_p	512
Max length of the character for each word l_c	10
Number of the layers for the encoder N_e	4
Number of the layers for Syn-Encoder N_s	4
Number of the layers for the relation decoder N_r	4
Number of the layers for the pointer decoder N_p	1
Number of the heads h	8
Batch size B	64
Learning rate lr	1e-5

Table 3

Hyper-parameters for the BERT-based model

Parameters	Value
Word embedding d_w	768
Positional embedding d_p	768
Bidirectional Transformer blocks for BERT N_b	12
Number of the layers for Syn-Encoder N_s	4
Number of the layers for the relation decoder N_r	4
Number of the layers for the pointer decoder N_p	1
Number of the heads h	12
Batch size B	8
Learning rate lr	1e-5

In the transformer-based model, the word embedding and char embedding dimensions are 300 and 50, respectively. The number of the filter for the 1DCNN model is set to 212, and the kernel size equals 3. The max length of the characters is set to 10. The transformer’s encoder and decoder are stacked by N = 4 identical layers, where each has 8 multi-headeds with the hidden dimension equals to 512. The pointer network layer contains one identical layer. In the BERT-based model, the BERT-Base-Cased is chosen for NYT24, WebNLG, and BERT-Base-Uncased is selected for NYT29. Both of them have 12 bidirectional Transformer blocks, and the size of the hidden state is 768. The number of the headed is 12 for the BERT-based model. We use the Stanford CoreNLP tool [40] to parse the sentence and get the dependency information.

During the training, the transformer-based model is optimized by an Adam [40] optimizer with a learning rate of 1e-5. The BERT-based model is fine-tuned with a learning rate of 1e-5 by the BERTAdam. The warm-up strategy is applied during the training and the warm-up steps is 8000. The training process lasts 150 epochs with the batch size of 64 for the transformer-based model and 8 for the BERT-based model. We also adopt the early stop strategy to train the model. If the F1-scores does not rise up for ten consecutive epochs, we stop the training process.

The network is implemented with PyTorch framework [42]. All the experiments were conducted on the Ubuntu server equipped with GeForce GTX 1080Ti GPU.

5.3 Baselines and metrics

To verify the validity of the proposed model, several strong baselines are chosen as compared methods, including the following methods.

CopyRE is an RNN-based Seq2Seq model with a copy mechanism. The CopyRE-one uses one decoder to generate the outputs, and CopyRE-mul utilizes multiple decoders to generate the results.

CraphR [29] is a graph-based network that learns the interaction between entities and relations through GCN. A two-phase GCN network (GraphR-1p,GraphR-2p) is implemented to extract the entites and relations from the plain text simultaneously.

CopyMTL [11] (CopyMTL-one and CopyMTL-mul) is a further improvement on the CopyR model. CopyMTL proposes a multi-task learning framework with a copy mechanism to generate the output sequence and predict the multi-token entities.

HRL [31] is composed by a Hierarchical Framework with a two-level reinforcement learning (RL) process for relation extraction, where the high-level RL process is used to detect the relation indication, and the low-level RL process is used to detect the corresponding entities [31].

Seq2Seq+RL [12] applies the reinforcement learning method to control the extraction order of the triplets, which can generate the relational facts freely.

Encoder-Decoder model [13], WDec and PtDec are RNN-based encoder-decoder architecture for joint extraction task. WDec extracts one word at each time step, and PtDec generates a tuple at every time step.

As some models cannot handle the multi-tokens entities, some previous work, such as CopyRE [10], GraphR [29], and Seq2Seq+RL [12], use a partial matching metric, where a triplet is regarded as correct only the relation and the header of subject and object entities are correct. Other models, such as HRL [31], WDec [13], and PtDec [13], use a strict matching mechanism. A triplet is regarded as correct only the relation and the corresponding full named entities are both correct. For a fair comparison, a strict matching metric is adopted in this study. Although the partial matching based methods are not strict, the results of these methods are also listed. The widely used metrics precision (P), recall (R), and F1 scores (F1) are reported in the experimental results. The definition of the three metrics is as follows: $precision = \frac{TP}{TP + FP}$ (24) $recall = \frac{TP}{TP + FN}$ (25) $F 1 = 2 * \frac{precision * recall}{precision + recall}$ (26) Where TP represents the number of triplets that are extracted correctly and FP represents the number of triplets that are extracted incorrectly. FN is the number of triplets that are not extracted.

In subsequent sections, this papers uses STP to refer to the transformer-based model and BSTP to the BERT-based model.

5.4 Experimental results

5.4.1 Overall evaluation results

The experimental results are presented in Table 4. From the results, it can be observed that the proposed method outperforms all the baselines among the three datasets. In these baselines, the WDec and Ptdec achieve a significant performance improvement than others, reflecting that the Seq2Seq method effectively handles the overlapping triplets. Compared with the WDec and Ptrdec, our model achieves a better performance. For the dataset NYT24, STP achieves the F1 score of 82.7%, which is 1.2% and 4.3% higher than the WDec and PtDec. For the NYT29, the F1 scores of STP are increased by 1.6% and 2.5% relative to the WDec and PtDec, and the F1 score reaches to 69.8%. For the last dataset, STP achieves the F1 score of 71.2%, which is 1.5% and 0.7% higher than the WDec and PtDec. The pre-trained BERT is also adopted as the encoder, and the results are recorded in the last lines. With the help of the pre-trained model, the proposed model obtains the best F1 scores on the three datasets, and the performance outperformsthe strongest baseline by 5.7%, 5.6%, and 4.4%, respectively. The performance improvements of the BERT-based highlights the importance of prior knowledge in the pre-trained language model.

Table 4
Results of different methods on NYT and WebNLG datasets. ‘p’ represents the model uses a partial matching metric; ‘s’ stands for the model adopt the strict matching metric

Method NYT24 NYT29 WebNLG

P (%) R (%) F1 (%) P (%) R (%) F1 (%) P (%) R (%) F1 (%)

CopyRE-one [10] (p) 61.2 53.0 57.1 - - - 32.2 28.9 30.5

CopyRE-mul [10] (p) 61.0 56.6 58.7 56.9 45.2 50.4 37.7 36.4 37.1

GraphR-1p [29] (p) 62.9 57.3 60.0 - - - 42.3 39.2 40.7

GraphR-2p [29] (p) 63.9 60.0 61.9 - - - 44.7 41.1 42.9

HRL [31] (s) 78.1 77.1 77.6 69.2 60.1 64.3

Seq2Seq+RL ([12]) (p) 77.9 67.2 72.1 - - - 63.3 59.9 61.6

CopyMTL-One ([11]) (s) 72.7 69.2 70.9 - - - 63.3 59.9 61.6

CopyMTL-Mul ([11]) (s) 75.7 68.7 72.0 - - - 60.9 61.1 61.0

WDec ([13]) (s) 87.6 76.2 81.5 77.7 60.8 68.2 90.6 56.7 69.7

PtDec ([13]) (s) 81.8 75.3 78.4 73.2 62.4 67.3 72.3 68.9 70.5

STP (s) 84.9 80.5 82.7 74.7 63.9 69.8 71.5 70.8 71.2

BSTP (s) 87.9 86.2 87.2 78.7 69.7 73.8 75.1 74.6 74.9

Method	NYT24	NYT29	WebNLG
CopyRE-one [10] (p)	61.2	53.0	57.1	-	-	-	32.2	28.9	30.5
CopyRE-mul [10] (p)	61.0	56.6	58.7	56.9	45.2	50.4	37.7	36.4	37.1
GraphR-1p [29] (p)	62.9	57.3	60.0	-	-	-	42.3	39.2	40.7
GraphR-2p [29] (p)	63.9	60.0	61.9	-	-	-	44.7	41.1	42.9
HRL [31] (s)	78.1	77.1	77.6	69.2	60.1	64.3
Seq2Seq+RL ([12]) (p)	77.9	67.2	72.1	-	-	-	63.3	59.9	61.6
CopyMTL-One ([11]) (s)	72.7	69.2	70.9	-	-	-	63.3	59.9	61.6
CopyMTL-Mul ([11]) (s)	75.7	68.7	72.0	-	-	-	60.9	61.1	61.0
WDec ([13]) (s)	87.6	76.2	81.5	77.7	60.8	68.2	90.6	56.7	69.7
PtDec ([13]) (s)	81.8	75.3	78.4	73.2	62.4	67.3	72.3	68.9	70.5
STP (s)	84.9	80.5	82.7	74.7	63.9	69.8	71.5	70.8	71.2
BSTP (s)	87.9	86.2	87.2	78.7	69.7	73.8	75.1	74.6	74.9

5.4.2 Evaluation for different types of triplets

To further study the capabilities of the proposed methods for extraction of the overlapping triplets, this subsection conducts an extended experiment on different types of triplets. As the NYT24 and NYT29 have a similar set of relations and text, the NYT24 and WebNLG are chosen as the experimental datasets. The NYT24 and WebNLG dataset are split into three sub-categories: Normal, EPO, and SEO. Then, evaluations are conducted on different types of sub-datasets, and the results are presented in Figure 5. As is illustrated in Figure 5, most baselines on the types of Normal, EPO, and SEO show a decreasing trend. It reflects that the difficulty is increasing over the extraction of triplets from the EPO and SPO categories. In contrast, our model attains strong performance over the baselines, especially for the more challenging patterns SEO and EPO. On the dataset NYT24, BSTP achieves the best F1 scores on the three types of patterns. For the dataset WebNLG, BSTP still has the best results in the Normal and SEO class. Although Seq2seqRl has achieved better results in the EPO category on WebNLG, BSTP is generally superior to all other models.

Fig. 5

F1 scores by different overlapping patterns on the NYT24 and WebNLG.

5.4.3 Evaluation for different numbers of triplets

This subsection further validates the proposed method’s capability to extract triplets from the sentence with a different number of triplets. We split the NYT24 and WebNLG dataset into five categories according to the number of the triplets in the sentence. Then experiments were conducted on each category. Figure 6 gives the F1 scores of the models on NYT24, and Figure 7 records the F1 scores achieved on WebNLG. From the results, it can be clearly seen that the F1 scores of the first two models (MulDecoder and Graph_2st) show a declining trend as the number of triplets increased. For the next four models, the F1 scores show a tendency to rise and then decrease. This indicates the Seq2Seq model has its superiority to handle the sentence containing multiple triplets, as the entities and relations can be repeated several times when needed to participate in different triplets. In NYT24, the proposed model achieves the best F1 score when the number of triplets is 4. In WebNLG, the best F1 score is obtained when the number of triplets is 2. For each category, BSTP achieves competitive performance compared with all the baselines on NYT24 and WebNLG, showing its superiority to extract triplets from complicated scenarios.

Fig. 6

Comparison on the NYT24 dataset for extracting different number of triplets.

Fig. 7

Comparison on the WebNLG dataset for extracting different number of triplets.

5.5 Analysis and discussion

5.5.1 Performance analysis

This subsection gives a deep analysis of the results achieved on three datasets. Based on the results from Table 2, it can be seen that there is a significant gap between the performance on NYT24 with other two datasets. It reflects the extracting of triplets on NYT29, and WebNLG is more complicated than on NYT24. For the NYT29, the overlapping triplets on the training dataset are about 28%. In contrast, the training data of NYT24 contains a large number of overlapping triplets, accounting for more than 43%, making the model trained on NYT24 are more flexible than on NYT29. The dataset WebNLG contains 246 types of relations, almost 10 times the size of NYT24. Besides, the WebNLG dataset contains fewer training samples, making the performance of the models on WebNLG worse than that on NYT24.

5.5.2 Ablation study

To verify the syntax information is helpful for the joint extraction task. Ablation experiments are conducted, and the results are given in Table 5. As is observed, the syntax-guided model achieves better performance than the origin transformer-based and BERT-based models with the help of the syntax structure. It demonstrates that the syntax structure helps the model to obtain a better representation of the input sentences.

Table 5
Ablation experiment of the syntax-guided networks on the three datasets

Method NYT24 NYT29 WebNLG

P (%) R (%) F1 (%) P (%) R (%) F1 (%) P (%) R (%) F1 (%)

Transformer+Pointer 84.5 80.1 81.8 74.2 63.3 67.8 69.8 69.1 69.3

Syn-Transformer+Pointer 84.9 80.5 82.7 74.4 63.6 68.8 71.5 70.8 71.2

BERT-Transformer-Pointer 87.6 85.4 86.3 78.2 68.8 72.8 74.00 72.00 72.40

BERT-syn-sformer+Pointer 87.9 86.2 87.2 78.7 69.7 73.8 75.10 74.60 74.90

Method	NYT24	NYT29	WebNLG
Transformer+Pointer	84.5	80.1	81.8	74.2	63.3	67.8	69.8	69.1	69.3
Syn-Transformer+Pointer	84.9	80.5	82.7	74.4	63.6	68.8	71.5	70.8	71.2
BERT-Transformer-Pointer	87.6	85.4	86.3	78.2	68.8	72.8	74.00	72.00	72.40
BERT-syn-sformer+Pointer	87.9	86.2	87.2	78.7	69.7	73.8	75.10	74.60	74.90

5.5.3 Computational cost

Table 6 reports the model size and inference speed of our method compared with two competitive methods: WDec and PtDec. The inference speed is measured by the number of batches processed per second during inference. From the results, it can be observed that WDec is the lowest in the decoding process. Because WDec generates the triplets on entity-level or word-level rather than triplet-level. It requires more decoding steps to produce a triplet than the other three models. Comparing the model size of PTdec and STP, STP is larger than PTdec, but it decodes faster than PTdec. The reason is that STP produces the relations and entity pairs in parallel, while PTdec still needs to determine the entity boundary before decoding the relation. BSTP is the largest due to the use of the pre-trained model, but it even decodes faster than the WDec model.

Table 6
Comparison of the model size and decoding speed (Bat/s refers to the number of batches processed per second)

Method NYT24 NYT29 WebNLG

Model size Testing speed Model size Testing speed Model size Testing speed

WDec 27M 1.85 Bat/s 26M 1.91 Bat/s 16M 4.62 Bat/s

PtDec 24M 6.58 Bat/s 23M 6.15 Bat/s 13M 9.16 Bat/s

STP 48M 7.49 Bat/s 48M 7.35 Bat/s 37M 9.82 Bat/S

BSTP 154M 3.03 Bat/s 154M 2.73 Bat/s 140M 5.02 Bat/s

5.5.4 Visualization of attention map

This subsection visualizes the attention map of the encoder-decoder multi-headed attention layer, and some examples are illustrated in Figure 8. In the attention map, the vertical axis represents the relationship category contained in the sentence, and the horizontal axis represents the attention weight gained by each word in the sentence when extracting the entities corresponding to each relationship. In the first case, the model gives more attention to the words ‘See’ and ‘Louis’ when extracting the first fact (See, deathplace, Louis). For the second fact (See, birthplace, Dallas), the model makes attention to the important words, such as ‘See’, ‘born’, ‘Dallas’, and so on. In the second case, the entity pair (Wal - Mard, SamWalton) contains multiple relations. The model pays accurate attention to the keywords ‘Wal-Mart’, ‘Sam Walton’ and extract the triplets correctly. In the third case, the model give accurate attention weight to the words ‘Japan’ and ‘Tokyo’, and extracted multiple triplets from the sentence correctly. It can be observed that there are no clear clues in the third case to indicate the entity pair ‘(Japan, Tokyo)’ contains the relationship ‘capital’. Our method still accurately predict the fact (Japan, capital, Tokyo). The reason for this is that the pre-trained model BERT contains large prior knowledge which helps the model to recognize this facts.

Fig. 8

Visualization of the attention map for the multi-headed attention layer.

5.5.5 Error analysis

An error analysis is conducted in this section to further explore the factors that affect the model’s performance. This subsection analyzes the model performance in predicting different elements of the triplets, and the results are presented in Table 7. E1 and E2 represent the model predicts the subject entity or object entity correctly. (E1, R) stands for the subject and relation is predicted correctly. (R,E2) indicates the model correctly predicts the relation and object. (E1, E2) means that both the subject and object are extracted correctly. (E1, R, E2) acts for a triplet is extracted correctly.

Table 7
Performance of the model on predicting different elements of the triplets

Method NYT24 NYT29 WebNLG

P(%) R(%) F1(%) P(%) R(%) F1(%) P(%) R(%) F1(%)

E1 94.65 92.82 93.72 87.21 76.49 81.50 95.79 96.52 96.15

E2 94.13 92.31 93.21 87.48 76.73 81.76 93.22 93.92 93.57

R 95.64 93.79 94.81 89.62 78.61 83.76 90.89 91.58 91.24

(E1, R) 91.66 89.89 90.76 82.99 72.79 77.55 82.23 82.86 82.54

(R, E2) 90.50 88.75 89.62 82.25 72.14 76.86 77.33 77.92 77.63

(E1, E2) 90.41 88.66 89.52 81.12 71.15 75.81 86.00 86.65 86.32

(E1, R, E2) 87.90 86.20 87.00 78.40 69.50 73.70 75.10 74.60 74.90

Method	NYT24	NYT29	WebNLG
E1	94.65	92.82	93.72	87.21	76.49	81.50	95.79	96.52	96.15
E2	94.13	92.31	93.21	87.48	76.73	81.76	93.22	93.92	93.57
R	95.64	93.79	94.81	89.62	78.61	83.76	90.89	91.58	91.24
(E1, R)	91.66	89.89	90.76	82.99	72.79	77.55	82.23	82.86	82.54
(R, E2)	90.50	88.75	89.62	82.25	72.14	76.86	77.33	77.92	77.63
(E1, E2)	90.41	88.66	89.52	81.12	71.15	75.81	86.00	86.65	86.32
(E1, R, E2)	87.90	86.20	87.00	78.40	69.50	73.70	75.10	74.60	74.90

In the dataset NYT24 and NYT29, the performance on R is higher than E1 and E2, demonstrating that extracting relations is easier than extracting the entities. It also finds that there is an apparent gap between the F1-score on E1/E2/R and (E1, R)/(R, E2)/(E1, E2). It reflects that many triplets are only correctly identified for one element. The performance of (E1, R, E2) is lower than the (E1, R)/(R, E2)/(E1, E2), implicating that most of the entities and relations are correctly identified, but some of them fail to form a valid triplet.

In the dataset WebNLG, it can be seen that the performance of E1/E2 is comparatively larger than R. It shows that extracting relations is more complicated than extracting entities on the WebNLG dataset. The reason is that WebNLG contains more relation types than NYT24 and NYT29. What’s more, it can be observed that the performance of (E1, R)/(R, E2)/(E1, R, E2) is significantly reduced compared with the E1/E2/R/(E1, E2). It shows that misidentifying the relations will bring more performance degradation than misidentifying the entities.

6 Conclusion and future work

In this paper, a novel Seq2Seq architecture is presented to handle the overlapping issue in the joint extraction task, which combines transformer and pointer networks to extract entities and relations. Unlike traditional Seq2Seq methods, the new model generates the triplets on triplet-level; thus, it can capture the triplet-level dependencies and speed up the decoding process. Furthermore, a syntax-guided network is introduced to explicitly incorporate the sentence’s syntactic information into the encoder, helping the model pay more attention to the important words of the sentence. Extensive experiments were conducted on the NYT and WebNLG datasets to evaluate the proposed method. The experimental results show that our proposed model works well with the overlapping triplets in complex scenarios and outperforms other baselines.

In this study, we utilize an autoregressive approaches to generate the triplet one by one in a certain order. In the future, we would like to apply a non-autoregressive parallel decoding method, which can directly output the final set of triples in one shot. Further, we would like to promote our research’s practical application in knowledge base construction and web information extraction.

Footnotes

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant 61871278, the Sichuan Science and Technology Program (no. 2018HH0143).

References

Komninos

and Manandhar

, Feature-rich networks for knowledge base completion, in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Vancouver, Canada: Association for Computational Linguistics (2017), 324–329. [Online]. Available: https://www.aclweb.org/anthology/P17-2051.

Oramas

, Espinosa-Anke

, Sordo

, Saggion

and Serra

, Information extraction for knowledge base construction in the music domain, Data Knowledge Engineering 106 (2016), 70–83. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0169023X16300416.

Martinez-Rodriguez

J.L.

, Lopez-Arevalo

and Rios-Alvarado

A.B.

, Openie-based approach for knowledge graph construction from text, Expert Systems with Applications 113 (2018), 339–355. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0957417418304329.

Wang

, Wang

, Rastegar-Mojarad

, Moon

, Shen

, Afzal

, Liu

, Zeng

, Mehrabi

, Sohn

and Liu

, Clinical information extraction applications: A literature review, Journal of Biomedical Informatics 77 (2018), 34–49. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1532046417302563.

Zhang

, Zhang

, Ren

, Qiu

, Ma

and Sun

, Extracting comprehensive clinical information for breast cancer using deep learning methods, International Journal of Medical Informatics 132 (2019), 103985. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1386505619310068.

and Ji

, Incremental joint extraction of entity mentions and relations, in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Baltimore, Maryland: Association for Computa tional Linguistics, (2014), 402–412. [Online]. Available: https://www.aclweb.org/anthology/P14-1038.

Zheng

, Hao

, Lu

, Bao

, Xu

, Hao

and Xu

, Joint entity and relation extraction based on a hybrid neural network, Neurocomputing 257 (2017), 59–66. machine Learning and Signal Processing for Big Multimedia Analysis. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0925231217301613.

Miwa

and Bansal

, End-to-end relation extraction using LSTMs on sequences and tree structures, in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics (2016), 1105–1116. [Online]. Available: https://www.aclweb.org/anthology/P16-1105.

Bekoulis

, Deleu

, Demeester

and Develder

, Joint entity recognition and relation extraction as a multi-head selection problem, Expert Systems with Applications 114 (2018), 34–45. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S095741741830455X.

10.

Zeng

, Zeng

, He

, Liu

and Zhao

, Extracting relational facts by an end-to-end neural model with copy mechanism, in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics (2018), 506–514. [Online]. Available: https://www.aclweb.org/anthology/P18-1047.

11.

Zeng

, Zhang

and Liu

, Copymtl: Copy mechanism for joint extraction of entities and relations with multi-task learning, in Proceedings of The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI) (2020).

12.

Zeng

, He

, Zeng

, Liu

and Zhao

, Learning the extraction order of multiple relational facts in a sentence with reinforcement learning, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, (2019), 367–377. [Online]. Available: https://www.aclweb.org/anthology/D19-1035.

13.

Nayak

and Ng

H.T.

, Effective modeling of encoderdecoder architecture for joint entity and relation extraction, in Proceedings of The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI) (2020).

14.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

L.u.

and Polosukhin

, Attention is all you need, in Advances in Neural Information Processing Systems 30, I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett, Eds. Curran Associates, Inc., (2017), 5998–6008. [Online]. Available: http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.

15.

Vinyals

, Fortunato

and Jaitly

, Pointer networks, in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS’15. Cambridge, MA, USA: MIT Press, (2015), 2692–2700.

16.

Devlin

, Chang

M.-W.

, Lee

and Toutanova

, BERT: Pre-training of deep bidirectional transformers for language understanding, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics (2019), 4171–4186. [Online]. Available: https://www.aclweb.org/anthology/N19-1423.

17.

Roth

and Yih

W.-t.

, Global inference for entity and relation identification via a linear programming formulation, Introduction to Statistical Relational Learning (2007), 553–580.

18.

Chan

Y.S.

and Roth

, Exploiting syntactico-semantic structures for relation extraction, in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics, (2011), 551–560. [Online]. Available: https://www.aclweb.org/anthology/P11-1056.

19.

Zhou

, Su

, Zhang

and Zhang

, Exploring various knowledge in relation extraction, in Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). Ann Arbor, Michigan: Association for Computational Linguistics (2005), 427–434. [Online]. Available: https://www.aclweb.org/anthology/P05-1053.

20.

Lample

, Ballesteros

, Subramanian

, Kawakami

and Dyer

, Neural architectures for named entity recognition, in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, California: Association for Computational Linguistics (2016), 260–270. [Online]. Available: https://www.aclweb.org/anthology/N16-1030.

21.

Chiu

J.P.

and Nichols

, Named entity recognition with bidirectional LSTM-CNNs, Transactions of the Association for Computational Linguistics 4 (2016), 357–370. [Online]. Available: https://www.aclweb.org/anthology/Q16-1026.

22.

S.-H.

, Kim

, Min

and Kim

, Improving lstm crfs using character-based compositions for korean named entity recognition, Computer Speech Language 54 (2019), 106–121. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0885230817300852.

23.

Zeng

, Liu

, Lai

, Zhou

and Zhao

, Relation classification via convolutional deep neural network, in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. Dublin, Ireland: Dublin City University and Association for Computational Linguistics (2014), 2335–2344. [Online]. Available: https://www.aclweb.org/anthology/C14-1220.

24.

Liu

, Yang

and He

, Multi-level semantic representation enhancement network for relationship extraction, Neurocomputing 403 (2020), 282–293. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0925231220306238.

25.

Miwa

and Sasaki

, Modeling joint entity and relation extraction with table representation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics (2014), 1858–1869. [Online]. Available: https://www.aclweb.org/anthology/D14-1200.

26.

Ren

, Wu

, He

, Qu

, Voss

C.R.

, Ji

, Abdelzaher

T.F.

and Han

, Cotype: Joint extraction of typed entities and relations with knowledge bases, in Proceedings of the 26th International Conference on World Wide Web, ser. WWW ’17. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee (2017), 1015–1024. [Online]. Available: https://doi.org/10.1145/3038912.3052708.

27.

Zheng

, Wang

, Bao

, Hao

, Zhou

and Xu

, Joint extraction of entities and relations based on a novel tagging scheme, in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada: Association for Computational Linguistics (2017), 1227–1236. [Online]. Available: https://www.aclweb.org/anthology/P17-1113.

28.

Wang

, Zhang

, Che

and Liu

, Joint extraction of entities and relations based on a novel graph scheme, in IJCAI (2018), 4461–4467.

29.

T.-J.

, Li

P.-H.

and Ma

W.-Y.

, GraphRel: Modeling text as relational graphs for joint entity and relation extraction, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics (2019), 1409–1418. [Online]. Available: https://www.aclweb.org/anthology/P19-1136.

30.

Mintz

, Bills

, Snow

and Jurafsky

, Distant supervision for relation extraction without labeled data, in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Suntec, Singapore: Association for Computational Linguistics (2009), 1003–1011. [Online]. Available: https://www.aclweb.org/anthology/P09-1113.

31.

Takanobu

, Zhang

, Liu

and Huang

, A hierarchical framework for relation extraction with reinforcement learning, in Proceedings of the AAAI Conference on Artificial Intelligence 33 (2019), 7072–7079.

32.

Kingma

D.P.

and Ba

, Adam: A method for stochastic optimization, in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds. (2015). [Online]. Available: http://arxiv.org/abs/1412.6980.

33.

Feng

, Zhang

, Hao

and Chen

, Joint extraction of entities and relations using reinforcement learning and deep learning, Computational Intelligence and Neuroscience 2017 (2017), 7 643 065–7 643 065

34.

Liang

, Xu

and Zhao

, Combining word-level and character-level representations for relation classification of informal text, in Proceedings of the 2nd Workshop on Representation Learning for NLP. Vancouver, Canada: Association for Computational Linguistics (2017), 43–47. [Online]. Available: https://www.aclweb.org/anthology/W17-2606.

35.

Zhou

, Liu

, Luo

, Chen

, Qing

and He

, Joint entity and relation extraction based on reinforcement learning, IEEE Access 7 (2019), 125688–125699.

36.

John

and Christopher D.

, A Structural Probe for Finding Syntax in Word Representations, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Minnesota, Jun. 2019. [Online]. Available: https://www.aclweb.org/anthology/N19-1419.

37.

Ganesh

, Benoít

and Djamá

, What Does BERT Learn about the Structure of Language? in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, Jul. (2019). [Online]. Available: https://www.aclweb.org/anthology/P19-1356.

38.

J.L.

, Kiros

J.R.

and Hinton

G.E.

, Layer normalization, arXiv preprint arXiv:1607.06450 (2016).

39.

Gardent

, Shimorina

, Narayan

and Perez-Beltrachini

, Creating Training Corpora for NLG Micro-Planning, in 55th annual meeting of the Association for Computational Linguistics (ACL), Vancouver, Canada, (2017). [Online]. Available: https://hal.inria.fr/hal-01623744.

40.

Manning

, Surdeanu

, Bauer

, Finkel

, Bethard

and McClosky

, The Stanford CoreNLP natural language processing toolkit, in Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Baltimore, Maryland: Association for Computational Linguistics (2014), 55–60. [Online]. Available: https://www.aclweb.org/anthology/P14-5010.

41.

Kingma

D.P.

and Ba

, Adam: A method for stochastic optimization, in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, (2015), Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. [Online]. Available: http://arxiv.org/abs/1412.6980.

42.

Paszke

, Gross

, Massa

, Lerer

, Bradbury

, Chanan

, Killeen

, Lin

, Gimelshein

, Antiga

, et al., Pytorch: An imperative style, high-performance deep learning library, in Advances in neural information processing systems (2019), 8026–8037.

Extracting relational facts based on hybrid Syntax-Guided transformer and pointer network

Abstract

Keywords

1 Introduction

2.1 Pipeline method

2.2 Joint learning method

2.3 Seq2seq method

3 Problem statement

4.1.1 Token embedding

4.1.2 Sentence encoding based on syn-encoder

5.1 Datasets

Table 1 Statistics of the three public datasets categories NYT24 NYT29 WebGNL Train Test Train Test Train Test sentence 56,195 5,000 63,306 4,006 5,019 703 relations 24 24 29 29 246 246 Normal 37,013 3,266 53,444 2,951 1,596 246 EPO 9,782 978 8,379 715 227 26 SEO 14,735 1,297 9,862 757 3,406 457

5.4.1 Overall evaluation results

5.5.1 Performance analysis

5.5.2 Ablation study

Footnotes

Acknowledgments

References