A multiple head selection joint entity-relation extraction model

Abstract

In the entity extraction task, there are some complex extraction problems, such as nested entity, entity boundary recognition, context ambiguity, and multi-instance entity recognition. Entity nesting is an important challenge in relational extraction. The main reason of entity nesting problem is that the boundary information between entities is not clear. In order to solve the entity nesting problem at the fragment level, while preserving the relationship between fragments with the same characteristics and improving efficiency, we proposed a brand new fragment annotation method. On the basis of traditional fragment annotation method, combined with pointer annotation method, we designed an annotation method of "ergodic enumeration + group mapping". On the basis of this method, an entity extraction model is designed: Span-Extraction Based Entity Extraction Model (LMA). Our model underwent a series of validations in the English data sets New York Times(NYT) and WEBNLG, showing significant improvements over the baseline model F1. It can effectively alleviate the above problems.

Keywords

Entity extraction relational extraction nested entity context ambiguity

1 Introduction

Relational Triple Extraction (RTE) [1], also known as entity relationship joint extraction, is a classic task in the field of information extraction. It aims to extract structured relational triples (Subjects, Relationships, Objects) from text to build knowledge maps. Entity relationship extraction can be divided into pipeline extraction and joint entity extraction. The pipeline model divides the entity relationship extraction task into two subtasks. First, entity recognition is carried out, and then the relationship classification task is completed in the case of a given entity and sentence. The advantage of the pipeline model is that it is easy to realize and has high flexibility, but it is easy to cause entity overlap and exposure bias. Exposure bias refers to the phenomenon where during training, each input is conditioned on the true label from the real sample, but during testing, the input is conditioned on the output from the previous step. Entity overlap refers to some identical entities between different relationship triples, as shown in Table 1. Joint entity and relationship extraction refers to the modeling and extraction of both entities and relationships in a single step, in contrast to the pipeline approach where entity recognition and relationship extraction are done separately. Joint extraction makes use of the potential relationship between the two tasks, which can alleviate the shortcomings of error accumulation to some extent. However, the problem of entity overlapped and exposure bias remains challenging to solve. To address the aforementioned issues, this paper proposes a joint relational extraction model that leverages partial segment tagging. Our model effectively deals with the challenges of entity overlapping and exposure bias.

Table 1
Examples of entity overlap

Entity Overlap Type Text Triplets

Normal Yao Ming was born in China. (Yao Ming,birthplace, China)

SEO(Single Entity Overlap) Yao Ming was born in China and graduated from Shanghai Jiaotong University. (Yao Ming,birthplace, China) (Yao Ming,graduated from, Shanghai Jiaotong University)

EPO(Entity Pair Overlap) San Francisco is a major city in california. (california, contains,San Francisco.) (california, major city,San Francisco.)

Entity Overlap Type	Text	Triplets
Normal	Yao Ming was born in China.	(Yao Ming,birthplace, China)
SEO(Single Entity Overlap)	Yao Ming was born in China and graduated from Shanghai Jiaotong University.	(Yao Ming,birthplace, China) (Yao Ming,graduated from, Shanghai Jiaotong University)
EPO(Entity Pair Overlap)	San Francisco is a major city in california.	(california, contains,San Francisco.) (california, major city,San Francisco.)

2 Related work

Entity relationship extraction can be divided into two main categories based on the model structure: pipeline extraction and joint entity and relationship extraction. Pipeline extraction primarily leverages recurrent neural networks (RNN), convolutional neural networks (CNN), and other neural network structures. Socher [2] was the first to apply RNN to relational extraction models, effectively addressing the challenge of capturing the meaning of long phrases. Zeng [3] pioneered the use of CNN for extracting word and sentence-level features in the context of relationship extraction. Nguyen [4] proposed a convolutional neural network (CNN)-based model that employs multiple window-size convolution kernels to effectively capture implicit features in text. This approach reduces the reliance on external toolkits and sequence-to-sequence methods, potentially leading to better performance and increased efficiency. Wu [5] proposed a relationship extraction model based on the attention mechanism at the neuron block level. Greff [6] proposed an LSTM based method for relation extraction, which is based on the shortest path of the syntactic dependency analysis tree. It integrates word vectors, part of speech, syntax, and other features for relational classification. Zhang [7] used the BiLSTM model to extract relations by combining the information before and after the current word. Har [8] proposed a method for entity recognition and relationship classification, which leverages part-of-speech and syntactic dependency features in the input layer to extract relationships between entities.

While the pipeline method for entity and relationship extraction is straightforward, it has some inherent limitations. (1) Error accumulation: Errors in entity extraction will further affect the results of relationship extraction. (2) Entity redundancy: When performing relationship classification, it is necessary to match the pre-extracted entities in pairs. When a sentence has multiple entity pairs, multiple <sentence, e1, e2>, need to be constructed for multiple relationship classifications. (3) Lack of interaction: Ignoring the internal connections and dependencies between the two tasks can limit the performance of joint entity and relationship extraction. In recent studies on pipeline models, Qin [9] applied the idea of reinforcement learning to separately extract entities and relationships. Although the pipeline models eventually produced relatively good results, they ignored the interdependent relationship between entities and relationships. To address this issue, Zhong and Chen [10] proposed a novel pipeline method that combines entity information for entity and relationship extraction. However, this method still suffers from error propagation problems, and therefore, further exploration of better solutions is necessary. Early pipeline extraction work failed to capture the implicit correlation between these two independent subtasks, resulting in these methods being greatly susceptible to error propagation. To address these issues, recent research has primarily focused on joint entity and relationship extraction. Joint extraction models can be mainly classified into two categories: feature-based models and end-to-end joint extraction models. Feature-based models [11] introduce a relatively complex feature engineering process and heavily rely on natural language processing tools for feature extraction and laborious manual operations.Sequence-to-Sequence (Seq2Seq) is a general end-to-end sequence learning method, primarily based on the encoder-decoder architecture. Zheng [12] introduced a unified tagging scheme and transformed the relationship extraction problem into a sequence tagging problem.To address the issue of overlapping triplets, Zeng [13] applied a Seq2Seq model with a copying mechanism.Although these models have achieved certain effectiveness, most of them are still unable to handle complex application scenarios, where a sentence is composed of multiple overlapping relationship triplets. Ye [14] propose generative models that view triple as a token sequence. Wei [15], Yuan [16], Zheng [17] and Li [18] decomposes RTE into different subtasks. However, those methods learn the interaction between sub-tasks solely through input sharing, which can lead to cascade errors.

3 Model

LMA consists of both a coding layer and a decoding layer, as shown in Figure 1. We utilized a BERT pre-trained model [19] with single-character tokens for word unit encoding. The pre-trained model outputs word and sentence vectors that contain a large amount of external semantic information. For fragment encoding, we combined the word and sentence vectors generated by the pre-trained model with a fragment embedding method to generate initial fragment vectors. We then utilized two models, namely, a long short-term memory network and a multi-head self-attention mechanism [20], to extract fragment features, which enabled the direct construction of fragment semantic vectors. The encoding layer used fragment markers and different strategies to map the positional relationships between tokens and their corresponding fragments.The decoding layer used a multi-head selection mechanism and enumerate the relationships between fragment pairs to decode the entity relationship triplets, thereby avoiding exposure bias, relationship overlaps, and error accumulation.

Fig. 1

The framework of LMA.

3.1 Fragment extraction method

At present, the mainstream pre-training models are based on word elements to sentence. As a result, all kinds of labeling models based on word markers can only predict the label of a single word element. During recognition, it is necessary to combine the recognition results of multiple words, which can lead to error accumulation and difficulty in resolving nested entities. In this paper, we address the nesting problem by enumerating fragments, which ensures that all candidate entities appear in the enumerated fragments. At the same time, the fragment marking method is designed, which combines the position index of the first and last lexical elements that make up the fragment into an index tuple to mark the fragment.The main idea of fragment tagging is to annotate fragments with their corresponding location indices in a fragment embedding matrix, using a meta-location index. For instance, consider the sentence "San Francisco is a major city in California". Assuming that tokens are set based on terms and the maximum length of the sliding window is 7, tokens in the text are sequentially read as a starting point. The token "San" is marked as (0, 0), "San Francisco" is marked as (0, 1), "San Francisco is" is marked as (0, 2), and so on. When the fragment length meets the maximum window length, This round of iteration for the extraction task has been completed. Before next round of operations, the "group mapping" task for the corresponding group will be performed first, followed by the next round of parameter settings.Meanwhile, the tagging order of fragments is mapped using three different strategies: fragment same start mapping strategy, fragment same end mapping policy, and fragment same length mapping policy. The fragment marker diagram is shown in Figure 2.

Fig. 2

Span Marking Diagram

3.1.1 Fragment mapping method

The location index of the fragment in the fragment embedding matrix varies according to the different fragment mapping strategies. The mapping strategy for the same start point of a fragment(SSF): Fix the starting point of the text fragment, change the length of the fragment by changing the endpoint of the fragment, that is the right border of the sliding window until the maximum length of the window is reached, and centralize the fragments from the same starting point. The mapping strategy for the same length of a fragment(SLF): The window length is fixed and the text is traversed, then the window length is adjusted and the text is re-traversed until the maximum window length is reached. Fragments with identical lengths are extracted and sorted together. The mapping strategy for the same end point of a fragment(SEF): Fix the end point of the text fragment, change the length of the fragment by changing the startpoint of the fragment, that is the left border of the sliding window until the maximum length of the window is reached, and centralize the fragments from the same starting point.

3.1.2 Group mapping method

Before proceeding to the next round of the enumeration task, we first need to perform the "group mapping" task for the group. The purpose of this task is to build an index in the matrix for the extracted fragments in accordance with the previously extracted grouping order and to preserve common features of a group of fragments through mapping logic. We take SSF as an example. Specifically, we first package all the fragments extracted in this round that have the feature starting coordinate of "0" into a group and then fill them into the matrix sequentially. For example, As shown in Figure 3, the first fragment "San" in the first group has a fragment coordinate of (0,0), which is mapped to the matrix grid where the fragment starting coordinate is 0 and the fragment ending coordinate is 0 in the two-dimensional matrix. Similarly, all indices of the first group are filled in horizontally. After the mapping is completed, the next round of parameter settings begins, moving the left side of the window (Start) one position to the right, while the right side of the window (End) moves back to the same coordinate point as the left side of the window, and the next round of operations begins. After one round of operations is completed, the next round will be mapped in a new row in the matrix, using the "line wrapping" logical structure as a separator between different groups. When the last fragment with a coordinate of (6,6) is extracted, the fragment starting coordinate and the fragment ending coordinate are equal, and the end of the text is reached, and the group is packaged and mapped. Once all groups have completed mapping, all rounds end, and the fragment extraction task also ends. Through this logical structure where each row in the matrix represents a group, we can preserve the shared characteristics of fragments with the same starting point. Finally, we flatten the two-dimensional matrix into a one-dimensional list according to the index order, and this one-dimensional list is the output of the fragment extraction, which will serve as the input for the entity classification task.

Fig. 3

Three different mapping strategies

3.2 Entity classification principle

First, the one-dimensional list outputted by the group mapping method is imported into the entity classification layer. As shown in Figure 4, the same fragment will have different index values depending on the mapping strategy used. Then, we process the input fragment list using LSTM and a multi-head self-attention mechanism. The resulting list is then passed through a sigmoid activation function to perform a multi-classification task on each fragment, constructing a two-dimensional matrix of "entity fragment-entity type", which can accurately classify each fragment according to each type. The fragment feature matrix calculation method for SLF and SEF is the same as that for SSF. However, due to the different feature mapping strategies, the feature matrices will also differ, which allows for different perspectives to be used to focus on different fragment features and adapt to different semantic information needs.

Fig. 4

Example entity classification layer

3.3 Coding layer

3.3.1 Token representation

First, the words are encoded based on the vocabulary of the selected BERT model. Then, the combined fragment’s index in the fragment sequence is determined using the designed fragment marker and corresponding fragment mapping strategy. The word units are then inputted, and the contextual semantic information of each word unit and the overall semantic information of the sentence are extracted. The text input is expressed as S = [s₁, s₂, s₃, . . . , s_n], and the vector after BERT encoding is shown in Formula (1). $H, \bar{H} = BERT (S)$ (1)

WhereH = [h₁, h₂, h₃, . . . , h_n] represents the word element vector generated after each word element is BERT encoded, and H ∈ R^d represents the sentence vector generated after the entire text is BERT encoded. It is the number of lexical elements and the dimension of the BERT hidden state.

3.3.2 Fragment encoding

Fragment vectors are constructed by averaging and pooling each word unit vector that constitutes a fragment, and then splicing them with sentence vectors. Assuming that H_i:j = { h_i, h_i+1, …, h_j } , H_i:j ∈ R^{(j-i+1)^*d} represents the word vector matrix composed of all word units in the window of length w between the head token at position index i and the tail token at position index j, the formula for calculating the fragment semantic vector is shown in Formula (2) - (5). $k = D ((i, j)), 0 < i \leq j \leq n, 0 \leq k < m_$ (2)

$H_{k}^{'} = Meanpool (H_{i \cdot j})$ (3)

$x_{k}^{'} = concat (H_{k}^{'}, \bar{H})$ (4)

$x_{k} = Tanh (Linear (x_{k}^{'}))$ (5)

Among $0 < i \leq j \leq n, 0 \leq k < m, H^{'} k \in R^{d}, x_{k}^{'} \in R^{2^{*} d}, x_{k} \in R^{d}$ , m is the fragment embedding matrix. It is the position index of the fragment composed of the first word element position index and the last word element position index in the fragment embedding matrix, representing the fragment vector composed of all the initial clip vectors constructed, through the formulas (2) - (5), and then form the clip embedding matrix according to the mapping results of the clip tags and the mapping strategy. where X = { x₁, x₂, x₃, …, x_m } X ∈ R^{m^*d} represents the clip embedding matrix composed of all clip vectors.

3.3.3 Fragment feature extraction

To enhance the extraction of deep segment features and promote interaction among segment information, the LMA model proposed in this paper incorporates both LSTM [21] and multi-head self-attention mechanisms. By controlling the input content and memory unit content through the input gate, output gate, and forgetting gate, LSTM creates a memory of past input information. This enables LSTM to effectively capture context and dependencies between segments, leading to improved overall model performance. which can effectively alleviate the problems such as gradient disappearance and gradient explosion. The attention mechanism can selectively focus on important information in the text. Multi-head self-attention mechanism is a variation of the attention mechanism, in which Q (query), K (key), and V (value) are equal. Multiple queries are utilized to extract distinct sets of information from the input in parallel for concatenation, and shared attention is applied to information from different feature subspaces at various positions. This enables the model to capture a more comprehensive range of fragment feature information. P = { p₁, p₂, p₃, …, p_m } represents the fragment vector after LSTM encoding, and the formula is shown formula(6).where P ∈ R^{m^*d}. $P = LSTM (X)$ (6)

3.4 Decoding layer

At the decoding layer, we use the multi-head selection mechanism, which aims to carry out single-step multi-head relationship extraction, so as to extract all existing entity relationships and overcome problems such as relationship overlap and exposure bias.

3.4.1 Antagonistic training

It is worth mentioning that to improve the robustness and generalization ability of the model, we have added confrontation training to the model. The purpose of confrontation training is to impose disturbances on the original input samples and then use them for training after obtaining the confrontation samples. In general, based on traditional training, the addition of confrontation training can further improve the effect. When we carried out the contrast experiment, the result was higher than the F1 value without the contrast experiment. The calculation process of the fragment vector is shown in Formula (7) [23] - (8) [24]. $min_{θ} E_{(x, y) \sim D} [max_{r_{adv} \in S} L (θ, x + r_{adv}, y)]$ (7)

$r_{adv} = ɛ g / ∥ g ∥_{2}$ (8)

3.5 Loss function

Multi category cross entropy loss function is adopted in this paper [25], and the formula is shown in Formula(9), The definition of the loss function for fragment multi-head selection is shown in Formula (10). $Loss = \sum_{t = 1}^{m} \sum_{k = 1}^{g} - log P_{i} (Label = i_{k} ∣ t_{i})$ (9)

${Loss}_{re} = \sum_{i = 1}^{m} \sum_{j = 1}^{m} \sum_{k = 1}^{l} - log P_{r} (Head = t_{i})$ (10)

4 Experiments

4.1 Dataset

English datasets NYT [26] and WebNLG [27] were used in the experiment. The NYT dataset contains 24 predefined relations, was generated through distant supervision from articles in the New York Post. Meanwhile, WebNLG, initially created for natural language generation tasks, was utilized by Zeng et al. To perform triple extraction of relations, resulting in the definition of 171 relations. The calculation formula is shown in Table 2.

Table 2
Datasets basic information table

NYT WEBNLG

Statistical Attributes Train Test Train Test

Relation 24 24 171 171

Sentence 56195 5000 5019 703

Triplet 112936 10142 17120 2208

	NYT	WEBNLG
Relation	24	24	171	171
Sentence	56195	5000	5019	703
Triplet	112936	10142	17120	2208

4.2 Text analysis

Text can be divided into three categories according to entity overlap types: normal, entity pair overlay (EPO), and single entity overlay (SEO).The specific classification quantity is shown in Table 3.

Table 3
Table of the number of relationship overlapping types of text

Number of texts

Dataset Normal SEO EOP

NTY(Train) 41109 8918 15606

NYT(Test) 3266 1297 978

WEBNLG(Train) 3575 3675 1424

WEBNLG(Test) 246 457 26

	Number of texts
NTY(Train)	41109	8918	15606
NYT(Test)	3266	1297	978
WEBNLG(Train)	3575	3675	1424
WEBNLG(Test)	246	457	26

5 Setting

In this study, we utilized a single NVIDIA TITAN XP graphics card with CentOS 7.9 operating system and 12G running memory for our deep learning experiments. The programming language used was Python 3.7, and the deep learning framework employed was Pytorch 1.7. During the model training phase, we set the maximum window length to 16, the maximum word element length to 128, and the word element vector dimension to 768. The batch size was set to 4, the learning rate to 0.00001, and the epoch to 100. These parameters were chosen after careful experimentation to achieve optimal results.

6 Evaluation

In this paper, we adopt accuracy rate(The ratio of true positive samples to the total number of samples predicted as positive by the model), recall rate(The ratio of true positive samples to the total number of actual positive samples), and F1(The F1 score is the harmonic mean of precision and recall) value as metrics for evaluating the experimental results, and adopt strict criteria.. The evaluation formula is shown in Formula(11)- Formula(13).

$Precision = \frac{TP}{TP + FP}$ (11)

$Recall = \frac{TP}{TP + FN}$ (12)

$F 1 = \frac{2 * Precision * Recall}{Precision + Recall}$ (13)

7 Experimental Results and Analysis

To verify the effectiveness of the model, we compared the mapping strategy for the same starting point of a fragment with other seven baseline models. In this paper, we designed three groups of experiments: a validation experiment,a hyperparameter experiment and a set of ablation experiments.

7.1 Validation Experiment

To validate the experiment, we selected two datasets and compared them with seven baseline models: (a) Novel-Tagging [12], a model that transforms the two tasks of relation extraction and entity extraction into a unified sequence annotation. (b)CasRel [15] employs a cascade binary tagging framework;(c) TP-Linker [28], a single-stage joint extraction model by linking token and token annotations. (d) PRGC [17], Proposed a joint relationship extraction framework based on predictive relationships and global correspondences.(e)TDEER [29], proposed a joint entity relationship extraction model based on translation decoding mechanism.(f) GRTE [30], proposes an iterative model to enhance the model’s learning of global features, which include two parts: the correlation between token pairs and the correlation between relationships.(g)EmRel [31], propose the integration of relational representations into the model for display.We reproduced the original paper and obtained experimental results in our experimental environment.The experimental results are shown in Table 4. The LMA method proposed in this paper has achieved a high level of accuracy, recall rate, and F1 score. In comparison, the Novel Tagging method fails to overcome the challenges of overlapping and nested entities, resulting in a lower recall rate. On the other hand, CasRel suffers from the issues of error transmission and exposure bias due to its pipeline-based extraction mode, where errors and omissions in subject extraction can directly affect the subsequent relationship prediction and object extraction. The TP-Linker method has a high annotation complexity, involves numerous redundant operations and information, and has low decoding efficiency. Similarly, PRGC belongs to the pipeline extraction mode, and its accuracy of final triplet extraction can be influenced by the relationship judgment and entity extraction parts.In contrast, LMA is a one-step decoding of entities and relationships, which fundamentally avoids the problem of exposure bias from the model mechanism. Moreover, the model’s accuracy rate and recall rate are almost the same, which proves its robustness. The LMA method performs well on WebNLG, which has a large number of relationships and a small amount of data, indicating that the model has strong generalization and migration ability, and is suitable for various application scenarios. The experimental results demonstrate that the LMA method proposed in this paper can effectively solve the problems of entity nesting and exposure bias.

Table 4
Comparative experimental results table

NYT WEBNLG

MODEL P R F1 P R F1

Novel Tagging 32.8 30.6 31.7 52.5 19.3 28.3

CasRel 85.1 84.5 84.8 88.8 87.2 88.0

TP-Linker 85.2 87.8 86.5 84.3 81.6 83.0

PRGC 88.9 87.5 88.2 88.3 86.8 87.5

TDEER 89.9 87.4 88.6 89.9 90.4 90.1

GRTE 85.4 83.2 084.2 89.4 56.2 69.0

EmRel 90.0 90.7 90.3 87.9 85.3 86.6

LMA 90.1 92.2 91.1 91.0 82.2 86.4

	NYT	WEBNLG
Novel Tagging	32.8	30.6	31.7	52.5	19.3	28.3
CasRel	85.1	84.5	84.8	88.8	87.2	88.0
TP-Linker	85.2	87.8	86.5	84.3	81.6	83.0
PRGC	88.9	87.5	88.2	88.3	86.8	87.5
TDEER	89.9	87.4	88.6	89.9	90.4	90.1
GRTE	85.4	83.2	084.2	89.4	56.2	69.0
EmRel	90.0	90.7	90.3	87.9	85.3	86.6
LMA	90.1	92.2	91.1	91.0	82.2	86.4

7.2 Hyperparameter Experiment

7.2.1 Experimental Results of Different Window Lengths under Three Strategies

In order to investigate the effect of selecting different maximum window lengths under different segment index mapping strategies on experimental results, we used three different mapping strategies, namely "start" representing mapping strategy for segments with the same starting point, "end" representing mapping strategy for segments with the same ending point, and "length" representing mapping strategy for segments with the same length. We conducted experiments using different maximum window lengths, and the experimental results are shown in Table 5.

Table 5
Table of experimental results of different window lengths under three strategies

Start Length End

Dataset window size P R F1 P R F1 P R F1

6 84.8 27.4 41.5 86.4 28.1 42.4 85.6 27.7 41.9

7 85.6 19.4 62.6 87.4 50.3 63.9 86.5 49.9 63.3

8 85.9 63.3 72.9 88.3 64.4 74.5 87.1 63.5 73.4

9 86.1 75.1 80.2 88.7 77.2 82.5 81.7 68.6 74.6

10 86.1 81.2 83.7 87.9 82.1 84.9 81.6 74.6 77.9

NYT 11 86.6 84.6 85.6 88.4 86.8 87.6 87.0 87.7 84.8

12 86.8 86.6 86.5 89.3 87.7 88.5 82.6 78.3 80.4

13 87.2 87.4 87.3 87.9 87.4 87.6 78.9 78.5 78.8

14 87.5 87.7 87.5 86.3 86.5 86.4 80.8 78.6 79.7

15 87.6 87.8 87.7 84.1 84.6 84.4 82.8 80.3 81.5

16 90.1 92.2 91.1 88.7 85.3 87.0 84.2 76.3 80.0

6 85.1 32.1 46.2 78.4 27.2 40.4 43.0 8.1 13.7

7 87.5 48.8 62.5 81.0 40.5 54.0 50.7 12.2 19.6

8 86.9 57.6 69.2 81.4 57.4 69.9 54.8 13.5 21.6

9 86.7 71.4 78.3 87.5 68.1 76.4 50.5 18.1 26.6

10 87.0 81.9 84.3 89.7 80.1 84.6 76.6 49.5 60.2

WebNLG 11 87.0 83.6 85.3 89.4 78.2 83.4 58.2 19.7 29.4

12 87.1 84.9 86.0 88.6 78.0 82.9 51.9 21.0 30.0

13 87.4 87.0 87.3 88.7 71.7 79.2 52.3 20.1 29.0

14 88.5 87.4 88.0 79.6 66.7 72.6 49.6 23.4 28.6

15 91.0 82.2 86.4 89.7 79.5 84.2 60.8 40.3 48.5

16 88.7 87.5 88.1 74.4 59.3 66.0 51.7 19.8 28.6

	Start	Length	End
	6	84.8	27.4	41.5	86.4	28.1	42.4	85.6	27.7	41.9
	7	85.6	19.4	62.6	87.4	50.3	63.9	86.5	49.9	63.3
	8	85.9	63.3	72.9	88.3	64.4	74.5	87.1	63.5	73.4
	9	86.1	75.1	80.2	88.7	77.2	82.5	81.7	68.6	74.6
	10	86.1	81.2	83.7	87.9	82.1	84.9	81.6	74.6	77.9
NYT	11	86.6	84.6	85.6	88.4	86.8	87.6	87.0	87.7	84.8
	12	86.8	86.6	86.5	89.3	87.7	88.5	82.6	78.3	80.4
	13	87.2	87.4	87.3	87.9	87.4	87.6	78.9	78.5	78.8
	14	87.5	87.7	87.5	86.3	86.5	86.4	80.8	78.6	79.7
	15	87.6	87.8	87.7	84.1	84.6	84.4	82.8	80.3	81.5
	16	90.1	92.2	91.1	88.7	85.3	87.0	84.2	76.3	80.0
	6	85.1	32.1	46.2	78.4	27.2	40.4	43.0	8.1	13.7
	7	87.5	48.8	62.5	81.0	40.5	54.0	50.7	12.2	19.6
	8	86.9	57.6	69.2	81.4	57.4	69.9	54.8	13.5	21.6
	9	86.7	71.4	78.3	87.5	68.1	76.4	50.5	18.1	26.6
	10	87.0	81.9	84.3	89.7	80.1	84.6	76.6	49.5	60.2
WebNLG	11	87.0	83.6	85.3	89.4	78.2	83.4	58.2	19.7	29.4
	12	87.1	84.9	86.0	88.6	78.0	82.9	51.9	21.0	30.0
	13	87.4	87.0	87.3	88.7	71.7	79.2	52.3	20.1	29.0
	14	88.5	87.4	88.0	79.6	66.7	72.6	49.6	23.4	28.6
	15	91.0	82.2	86.4	89.7	79.5	84.2	60.8	40.3	48.5
	16	88.7	87.5	88.1	74.4	59.3	66.0	51.7	19.8	28.6

7.3 Ablation Experiment

7.3.1 Discussion the ability of the multi-head selection mechanism

This section focuses on investigating the impact of removing the multi-head selection mechanism, and the experimental results are presented in Table 6. The removal of the multi-head selection mechanism led to a 2.7 and 1.2 decrease in the F1 score of the model on the NYT and WEBNLG datasets, respectively. This indicates that the multi-head attention mechanism’s fragment feature extraction is effective for the model. The main reason for this is that using LSTM for fragment encoding can only reinforce the interaction and dependency relationships between fragments, whereas it cannot extract deep fragment feature information. Overall, our findings demonstrate the importance of the multi-head attention mechanism in enhancing model performance.

Table 6
Ablation experimental results table

Dataset Multiple head selection Prec. Rec. F1

NYT YES 90.1 92.2 91.1

NYT NO 89.8 86.6 88.4

WEBNLG YES 91.0 82.2 86.4

WEBNLG NO 89.7 81.1 85.2

Dataset	Multiple head selection	Prec.	Rec.	F1
NYT	YES	90.1	92.2	91.1
NYT	NO	89.8	86.6	88.4
WEBNLG	YES	91.0	82.2	86.4
WEBNLG	NO	89.7	81.1	85.2

8 Conclusion

This article proposes a new method for text annotation and mapping, and on this basis, establishes a joint extraction model. The model uses fragment tagging and fragment embedding methods to construct fragment vector representations; extracts fragment features through long short-term memory networks and multi-head self-attention mechanisms, and implements single-step relationship extraction through fragment label classification. This model aims to solve complex extraction problems such as nested entities, entity boundary recognition, context ambiguity, and multi-instance entity recognition in entity recognition tasks. Compared with various baseline models, the F1 value has obvious advantages in the English relation extraction dataset NYT and WebNLG. but there is still room for exploration. For example, the construction of fragment vector representations is somewhat rough. The next focus is on how to construct fragment vectors more accurately and efficiently. How to better match the subjects and objects of the same relationship and solve the problem of entity overlap is also a future research direction.

References

Xun

, You

and Yang

, A review of relation extraction, New Technology of Library Information Service (2013).

Feng

, Banerjee

and Choi

, Characterizing stylistic elements in syntactic structure, In: Joint Conference on Empirical Methods in Natural Language Processing Computational Natural Language Learning, 2013.

Liu

, Ramanath

, Sadeh

and Smith

N.A.

, A step towards usable privacy policy: Automatic alignment of privacy statements, In: International Conference on Computational Linguistics, 2014.

Nguyen

T.H.

and Grishman

, Relation extraction: Perspective from convolutional neural networks, In: Workshop on Vector Space Modeling for Natural Language Processing, 2015.

Wutianhao

, A dongbieke: Extraction of lstm relationships based on neuronal block level attention mechanism, Computer Application Research (S02) (2020), 76–79.

Greff

, Srivastava

R.K.

, Koutník

, Steunebrink

B.R.

and Schmidhuber

, Lstm: A search space odyssey, IEEE Transactions on Neural Networks Learning Systems 28(10) (2016), 2222–2232.

Shu

, Zheng

, Hu

and Ming

, Bidirectional long short-term memory networks for relation classification, 2015.

Golshan

P.N.

, Dashti

, Azizi

and Safari

, A study of recent contributions on information extraction, 2018.

Qin

, Xu

and Wang

W.Y.

, Robust distant supervision relation extraction via deep reinforcement learning, 2018.

10.

Zhong

and Chen

, A frustratingly easy approach for joint entity and relation extraction, 2020.

11.

and Ji

, Incremental joint extraction of entity mentions and relations, in Meeting of the Association for Computational Linguistics, 2014.

12.

Zheng

, Wang

, Bao

, Hao

, Zhou

and Xu

, Joint extraction of entities and relations based on a novel tagging scheme, 2017.

13.

Zeng

, Zeng

, He

, Kang

and Zhao

, Extracting relational facts by an end-to-end neural model with copy mechanism, In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, (Volume 1: Long Papers), 2018.

14.

, Zhang

, Deng

, Chen

, Tan

, Huang

and Chen

, Contrastive triple extraction with generative transformer, In: National Conference on Artificial Intelligence, 2021.

15.

Wei

, Su

, Wang

, Tian

and Chang

, A novel cascade binary tagging framework for relational triple extraction, In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020.

16.

Yuan

, Zhou

, Pan

, Zhu

and Guo

, A relationspecific attention network for joint entity and relation extraction, in Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence IJCAI-PRICAI-20, 2020.

17.

Zheng

, Wen

, Chen

, Yang

, Zhang

, Qin

, Xu

and Zheng

, PRGC: Potential relation and global correspondence based joint relational triple extraction, 2021.

18.

, Luo

, Dong

, Yang

, Luan

and He

, Tdeer: An efficient translating decoding schema for joint extraction of entities and relations, in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 8055–8064.

19.

Devlin

, Chang

M.W.

, Lee

and Toutanova

, BERT: Pretraining of deep bidirectional transformers for language understanding, 2018.

20.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

and Kaiser

, Polosukhin

, Attention is all you need, arXiv (2017).

21.

Peng

, Wei

, Tian

, Qi

and Bo

, Attention-based bidirectional long short-term memory networks for relation classification, In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2016.

22.

Sutskever

, Vinyals

and Le

Q.V.

, Sequence to sequence learning with neural networks, Advances in Neural Information Processing Systems, (2014).

23.

Madry

, Makelov

, Schmidt

, Tsipras

and Vladu

, Towards deep learning models resistant to adversarial attacks, 2017.

24.

Chin

W.S.

, Zhuang

, Juan

Y.C.

and Lin

C.J.

, A fast parallel stochastic gradient method for matrix factorization in shared memory systems, Acm Transactions on Intelligent Systems and Technology 6(1) (2015), 1–24.

25.

Takeru

, Shin-Ichi

, Shin

and Masanori

, Virtual adversarial training:Aregularization method for supervised and semi-supervised learning, IEEE Transactions on Pattern Analysis and Machine Intelligence (2018), 1–1.

26.

Riedel

, Yao

and Mccallum

A.K.

, Modeling relations and their mentions without labeled text, in Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, Barcelona, Spain, Proceedings, Part III, 2010.

27.

Gardent

, Shimorina

, Narayan

and Perez-Beltrachini

, Creating training corpora for nlg microplanning, in Meeting of the Association for Computational Linguistics, 2017.

28.

Wang

, Yu

, Zhang

, Liu

and Sun

, Tplinker: Single-stage joint extraction of entities and relations through token pair linking, 2020.

29.

, Luo

, Dong

, Yang

, Luan

and He

, TDEER: An efficient translating decoding schema for joint extraction of entities and relations. Association for Computational Linguistics, In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics, pp. 8055–8064.

30.

Ren

, Zhang

, Yin

, Zhao

, Liu

, Li

and Liu

, A novel global feature-oriented relational triple extraction model based on table filling, Association for Computational Linguistics (2021).

31.

, Wang

, L

, Shi

, Zhu

, Gao

and Mao

, EMREL: Joint representation of entities and embedded relations for multi-triple extraction. Association for Computational Linguistics, In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics.

	NYT		WEBNLG
Statistical Attributes	Train	Test	Train	Test
Relation	24	24	171	171
Sentence	56195	5000	5019	703
Triplet	112936	10142	17120	2208

	NYT			WEBNLG
MODEL	P	R	F1	P	R	F1
Novel Tagging	32.8	30.6	31.7	52.5	19.3	28.3
CasRel	85.1	84.5	84.8	88.8	87.2	88.0
TP-Linker	85.2	87.8	86.5	84.3	81.6	83.0
PRGC	88.9	87.5	88.2	88.3	86.8	87.5
TDEER	89.9	87.4	88.6	89.9	90.4	90.1
GRTE	85.4	83.2	084.2	89.4	56.2	69.0
EmRel	90.0	90.7	90.3	87.9	85.3	86.6
LMA	90.1	92.2	91.1	91.0	82.2	86.4

A multiple head selection joint entity-relation extraction model

Abstract

Keywords

1 Introduction

3 Model

3.1.2 Group mapping method

3.3.1 Token representation

3.4.1 Antagonistic training

4.1 Dataset

Table 2 Datasets basic information table NYT WEBNLG Statistical Attributes Train Test Train Test Relation 24 24 171 171 Sentence 56195 5000 5019 703 Triplet 112936 10142 17120 2208

Table 3 Table of the number of relationship overlapping types of text Number of texts Dataset Normal SEO EOP NTY(Train) 41109 8918 15606 NYT(Test) 3266 1297 978 WEBNLG(Train) 3575 3675 1424 WEBNLG(Test) 246 457 26

6 Evaluation

7.1 Validation Experiment

7.2.1 Experimental Results of Different Window Lengths under Three Strategies

7.3.1 Discussion the ability of the multi-head selection mechanism

Table 6 Ablation experimental results table Dataset Multiple head selection Prec. Rec. F1 NYT YES 90.1 92.2 91.1 NYT NO 89.8 86.6 88.4 WEBNLG YES 91.0 82.2 86.4 WEBNLG NO 89.7 81.1 85.2

References

Table 2
Datasets basic information table

NYT WEBNLG

Statistical Attributes Train Test Train Test

Relation 24 24 171 171

Sentence 56195 5000 5019 703

Triplet 112936 10142 17120 2208

Table 3
Table of the number of relationship overlapping types of text

Number of texts

Dataset Normal SEO EOP

NTY(Train) 41109 8918 15606

NYT(Test) 3266 1297 978

WEBNLG(Train) 3575 3675 1424

WEBNLG(Test) 246 457 26

Table 6
Ablation experimental results table

Dataset Multiple head selection Prec. Rec. F1

NYT YES 90.1 92.2 91.1

NYT NO 89.8 86.6 88.4

WEBNLG YES 91.0 82.2 86.4

WEBNLG NO 89.7 81.1 85.2