A relational extraction approach based on multiple embedding representations and multi-head self-attention

Abstract

Currently, word segmentation errors and polysemy problems are common in the field of Chinese relationship extraction. Although character-based model input can avoid word segmentation errors, in order to obtain the word information of a sentence, it is often necessary to introduce a dictionary or an external knowledge base to expand the word information, which requires a lot of manpower and time. In response to the above existing problems, this article uses characters as input, uses multiple embedding models to jointly form a character vector sequence, and obtains features containing character information through BiLSTM and attention layers; considering that convolutional neural networks are good at extracting local features, obtain features containing word information through multi-kernel convolutional layers and multi-head self-attention layers, and finally use a gating mechanism to fuse the features. The model was tested on the public SanWen data set and our own cultural-travel data set, and obtained F1 values of 61.22% and 60.26% respectively. Experimental results show that our method can achieve better relationship extraction effects without using word segmentation tools and without building a dictionary or external knowledge base, and the effect is better than most commonly used models currently.

Keywords

Chinese relation extraction multiple embedded representations muti-head self-attention gating mechanism

1 Introduction

There exists a large amount of data and knowledge in the field of culture and tourism, in which the data have been widely used for user preference speculation, intelligent recommendation, etc., but due to the lack of attention to the knowledge as well as the low level of knowledge management, this knowledge has been largely neglected. As a descriptive framework of structured knowledge, knowledge graph has rich linguistic information and powerful reasoning ability, which can manage knowledge better [1]. Therefore, constructing the cultural and tourism knowledge graph helps to assess the tourism consumption trend, the current development of the industry and so on. The relational extraction task on the cultural-travel dataset is one of the key steps in constructing the cultural tourism knowledge graph. Chinese relational extraction methods can be classified into two types based on input granularity. One is based on character input, where sentences are represented by character embedding, with the disadvantage that word-level information cannot be fully utilised; the other is based on word input, and this method often suffers from the problem of word-splitting error, which makes the accuracy of the extraction task suffer [30]. For example, the sentence “, (I’ll be travelling home in the next couple of days, so I’ll try the authentic Chongqing market hotpot again before I leave)”. Here, the correct participle of “ (Chongqing market hotpot)” should be “ (Chongqing)/(market)/(hotpot)”, but the final participle may be “ (Chongqing Municipality)/(well)/(hotpot)”. Therefore, in order to avoid the error, you can choose the character-based input method. The problem of multiple meanings of a word is also one of the common problems in Chinese relational extraction tasks. For example: “, (I’m going to a nearby park today where there are lots of cuckoos/azaleas)”. The word “” here can be a bird (cuckoo) or a flower (azalea). In order to alleviate the problem of multiple meanings at a time, most of the existing methods are to construct dictionaries or introduce external knowledge bases as extended word information, however, constructing dictionaries or introducing external knowledge bases requires a lot of manpower and time, and the efficiency is relatively low. How to adequately express sentence information is also a concern for the relational extraction task. In terms of feature combination, most of the current feature combination methods use direct splicing [2 –4] or summing [5, 6], but those approaches does not work well to combine features that express different information.

To address the above situation, this paper proposes a relationship extraction method based on multiple embedding representations and multi-head self-attention. The main contributions of this paper are as follows:

The use of character embedding eliminates the need for a word-splitting tool and avoids the impact of word-splitting errors.

Using multiple embedding models to jointly represent character vectors can enhance the meaning of characters and the relationship between characters; considering the characteristics of convolutional neural networks that are good at extracting local features, the sequence of character vectors is inputted into a convolutional neural network with multiple convolutional kernels of different sizes to extract the local features of a sentence, which contains the information of multiple characters, and can be regarded as a word vector, thus enhancing the meaning of words and the relationship between words, alleviating the problem of multiple meanings of words without the need to use dictionaries or introduce external knowledge bases to act as extended word information, avoiding the need to consume too much manpower and time;

The use of multi-head self-attention mechanism, which introduces multi-head computation and multiple representations, allows focusing on different representation subspaces of the inputs, can capture more rich feature information, improves robustness, solves the long dependency problem [7], and further mitigates the polysemy problem;

The gating mechanism can adaptively adjust the weights of each model output during the training process, making the fused feature vector more suitable for the task requirements. The features containing sentence character information and word information are fused using the gating mechanism, which fully expresses the sentence semantic information;

The model was experimented on the publicly available dataset SanWen and the Cultural-travel dataset, and the results showed that the model obtained higher F1 values than most of the commonly used models and was portable;

The rest of the paper is organised as follows. An overview of the current state of research on relational extraction is given in Section 2. A detailed description of the proposed research methodology is given in Section 3. In Section 4 a series of experiments on the proposed methodology are conducted and analysed. In Section 5 some conclusions are drawn as well as an outlook for future work.

2 Related work

The research process of relational extraction methods can be broadly summarised as pattern matching based methods, feature based and kernel based methods, and deep learning based methods.

Pattern matching based relationship extraction methods are all about extracting inter-entity relationships based on rules or dictionaries, which are manually constructed. For example, K. Fundel et al. [8] proposed a rule-based approach which first generates a dependency parse tree and then extracts the relationships using a knowledge base/rule base, the extracted relationships are then tested again for reasonableness, e.g., if a relationship is preceded by a negative word, that particular relationship will be discarded.An approach proposed by Kamel Nebhi [9] utilises a grammar parser to perform entity-relationship extraction by using the DBPedia dataset, rules were written to extract patterns using 188 training set articles, data sourced from Quero News, and French news were used for evaluation. However, this approach only supports specific domains, is poorly portable, and is difficult to write and has a long construction cycle. Of course, the relations extracted from the knowledge base rules have high accuracy.

Feature-based approaches require the selection of a large number of features such as lexical, syntactic, entity and syntactic features, and finally the combination of certain features and the selection of a classifier for the classification of relations. These features are performed manually and the combination of a limited number of feature items does not express the semantic relations well. Although kernel-based methods do not use feature vectors directly, they require manual design of the kernel function, and as the performance requirements increase, the kernel function becomes exceptionally complex, and is slow and poorly processed for large-scale training data, and the model performance decreases.Kambhatla [10] used a variety of text features to form vectors, and achieved good results on the maximum entropy model. Che Wanxiang et al. [11] used Winnow and Support Vector Machines for the task on ACE2004 dataset and obtained 73.08% and 73.27% F1 values respectively. Xin Huang et al. [12], on the other hand, combined lexical, syntactic, entity and syntactic basic features and achieved 72.77% F1 value on ACE2005 dataset, while Lixin Gan et al. [13] combined two major features: syntactic relations and dependencies to perform this task. All of the above methods achieve better results, but the selection of features is done manually.Cristianini et al. [14] proved that the combination of a finite number of feature terms does not express the semantic relations well, and that the performance of feature combinations is limited, so a kernel function based approach is considered. Its advantage is that it does not use feature vectors directly, the disadvantage is that the kernel function becomes complex as the performance requirements increase, and it is slow and poorly processed for large-scale data training. Zelenko et al. [15] combined 3 manually designed kernel functions, Support Vector Machines, and Voting Perceptron Learning Algorithms to extract the unstructured natural relations- -personal affiliation and organisational location relationships with good results. Culotta et al. [16] used an extended kernel machine to extract relationships in support vectors, and Bunescu et al. [17] proposed a new relationship extraction scheme based on the observation that the information required to assert a relationship between two named entities in the same sentence is usually be captured through the shortest path between two entities in the dependency graph. Experiments yielded that the shortest path dependency kernel outperforms the method based on dependency tree kernel.

Instead of manually pre-selecting features, deep learning based methods use vector combination to represent complex semantic information. The general process of relation extraction under deep learning can be summarised as follows: first the input sentence semantic information is represented using vectors, then the vectors are fed into a feature extraction model, and finally the relations are output using a classifier. Convolutional neural networks (CNN) [18], recurrent neural networks (RNN) [19], long short-term memory networks (LSTM) [20] and their extended optimisations have been widely used in this task with good results. Liu et al. [21] firstly applied convolutional neural networks (CNNs) to the relation extraction task. Instead of using word embedding as an input layer, the method encodes the input using a dictionary of synonyms and then uses a CNN structure without a pooling layer to automatically learn semantic features. On the ACE2005 dataset, the method outperformed the then state-of-the-art number kernel method by 9%. Zeng et al. [22] then added a maximum pooling layer to liu et al. [21] and added positional embeddings to represent the distance between words and entities, combining word embeddings and positional embeddings to represent the inputs, and this approach was widely used. Nguyen et al. [23] then built on Zeng et al. [22] ’s method by adding the number of convolutional layer filters, and experiments show that using multiple filters can improve cnn performance. Zhang et al. [24] used RNN instead of CNN as a relationship extraction model, and the results proved that using the RNN model has a good performance on the relationship extraction task. Zhang et al. [25] proposed a bi-directional long-short memory network for the relationship extraction task, and the experimental results show that the model can achieve good performance on the dataset SemEval -2010, the model can achieve state-of-the-art performance. Meanwhile, they used external lexical resources (e.g., WordNet) to get more features and achieved better results. In addition, Zhou et al. [26] used the BLSTM-ATT model for relation extraction, where the attention mechanism reduces the effect of noise and enhances the role of keywords. Mintz et al. [27] proposed the first remote supervision-based approach to train supervised models without manually labelling data, which is a major breakthrough in the field of relation extraction. However, it is clear that the method also has many problems at the beginning of its proposal. The biggest problem is that the labelled data may not necessarily contain the corresponding relations. To solve this problem, Zeng et al. [28] proposed the PCNN model, which can effectively solve the problems of mislabelling, error propagation and accumulation. Lin et al. [29] built sentence-level attention on multiple instances to learn features, assigning a high weight to sentence features that can express the relation, and a low weight to those that cannot express the relation. The method can use the dataset obtained by remote supervision to train the model without worrying about the annotation error problem caused by remote supervision, while the features of multiple sentences are learnt to avoid the feature omission problem.

Currently, participle error and word polysemy difficulties are common in Chinese relational extraction tasks, which are far less popular than English relational extraction tasks in terms of research heat and research progress. In order to avoid or mitigate these problems to a certain extent, more and more scholars choose to use character input to avoid participle error, and then introduce dictionaries or external knowledge bases as extended word information. Li et al. [30] proposed the MG-Lattice model in 2019, which combines character input with an external linguistic knowledge base for the relational extraction task; and Xu et al. [31] introduced a network model that merges word and character inputs and uses BiGRU instead of BiLSTM for the relational extraction task; Zhang et al. [32] proposed the concept of entity meaning to act as external linguistic knowledge to enhance sentence information; Kong et al. [33] are proposing an adaptive approach that includes word information at the embedding layer and uses a dictionary to match all words matched by each character merged into the character input based model. In order to mitigate the multiple meanings of a word, all the above models use dictionaries or external knowledge bases as extended word information, but constructing dictionaries as well as external knowledge bases requires a lot of time and manpower. In this paper, considering that different embedding models have different training details, we use multiple embedding models to jointly represent the character vectors, and then use the feature of convolutional neural network that is good at extracting local features to obtain local features as extended word information, so as to alleviate the problem of multiple meanings of a word while avoiding splitting words and without the need to construct a dictionary or an external knowledge base.

3 Methods

For a given Chinese sentence and two labeled entities, the task of relationship extraction is to extract the semantic relationship between the two entities. The overall framework of the method proposed in this article is shown in Fig. 1, which mainly consists of three parts: Part 1 is used to obtain features containing the character information of the sentence, Part 2 is used to obtain features containing the word information of the sentence, and Part 3 is the feature fusion part. In Part 1, character embedding is used to represent each character in the sentence as a vector to avoid word segmentation errors, and then the character vector sequence is fed into a network model composed of BiLSTM layer and attention layer to obtain character-level features; Part 2 uses multiple Embedding models to jointly form character vector sequences can enhance the meaning of characters and the relationship between characters; considering that convolutional neural networks are good at extracting local features, a convolutional neural network with multiple different convolution kernel sizes is used to obtain the sentence Local features, each local feature contains information of multiple characters, can be regarded as a word vector. Therefore, on the basis of using multiple embedding models to enhance the meaning of characters and the relationship between characters, a convolutional neural network with multiple different convolution kernel sizes is used to enhance the meaning of words and the relationship between words, easing the problem of polysemy. Then the word vector sequence is sent to a network model composed of a multi-head self-attention layer and a maximum pooling layer to obtain word-level features; Part 3 fuses character-level features and word-level features, and then Fully express the semantic information of the sentence; finally, the fused feature vector is passed to the softmax layer. The Softmax activation function will correspond to all relationship categories, generate a probability distribution for each category, and select the category with the highest probability as the relationship prediction result.

Fig. 1

General framework.

3.1 Part 1: Obtain features containing character information

The algorithm description diagram of this part is Algorithm 1. This part of the algorithm will be described in detail next.

Algorithm 1 Part 1: Obtain features containing character information

Input: {v₁, v₂, . . . , v_l}, where $v_{i} = [c_{i}; o^{j_{i}^{e 1}}; o^{j_{i}^{e 2}}]$

Output: Character-level feature vector Y

for each v_i∈ { v₁, v₂, . . . , v_l } do

${bls}_{i} = \vec{lstm} (v_{i}, \vec{{bls}_{i - 1}}) \oplus \overset{\leftarrow}{lstm} (v_{i}, \overset{\leftarrow}{{bls}_{i + 1}})$

BLS ={ bls_i } , i ∈ { 1, . . . , l }

end for

Finally, ${{bls}_{i}}_{1}^{l}$ are spliced to form the output BLS of the BiLSTM layer.

ρ = softmax (W^Ttanh (BLS))

Y = W^blstanh (ρ^TBLS)

return Y

3.1.1 Inputs

The input to this part of the network model is a sentence vector S = { v₁, v₂, . . . , v_l } ∈ R^l×a (a = m + 2n) (l is the length of the sentence, m is the character vector dimension, and n is the position vector dimension), and v_i is the vectorial representation of the i-th word consisting of the character vector c_i and the two position vectors $o^{j_{i}^{e 1}}, o^{j_{i}^{e 2}}$ , which is denoted as $v_{i} = [c_{i}; o^{j_{i}^{e 1}}; o^{j_{i}^{e 2}}]$ . The character vector c_i is obtained using the trained FastText [34], which converts each character into the corresponding vector form; the position vector is used to represent the relative distance between the current character and the two entities, denoted as $o^{j_{i}^{e 1}}, o^{j_{i}^{e 2}} \in R^{n}$ ; and the relative distance $j_{i}^{e 1}$ is computed as follows:

$j_{i}^{e 1} = {\begin{matrix} i - s^{e 1}, i < s^{e 1} \\ 0, s^{e 1} \leq i \leq d^{e 1} \\ i - d^{e 1}, i > d^{e 1} \end{matrix}$ (1) where $j_{i}^{e 1}$ denotes the relative distance of the current i-th word from entity e1, and s^e1 and d^e1 denote the start and end positions of entity e1. $j_{i}^{e 2}$ is computed in a similar way to $j_{i}^{e 1}$ .

3.1.2 BiLSTM layer and attention layer

The sentence vector S = { v₁, v₂, . . . , v_l } ∈ R^l×a is inputted into the network model consisting of the BiLSTM layer and the attention layer, as shown in Fig. 2. The BiLSTM can ensure the global and complete nature of the extracted features, and the attention mechanism can give different weight to different words, so the combination of the two can enhance the semantic information, thus obtaining a higher quality feature vector Y ∈ d_out (d_out is equal to the number of relational categories) that contains the information of the sentence characters. First through the BiLSTM layer, BiLSTM consists of forward LSTM and reverse LSTM, which have the same structure. Each LSTM consists of a series of recursively connected neurons (indicated by yellow circles in the figure). Each neuron consists of input gate i_i, forgetting gate f_i, output gate o_i, input x_i, cell state c_i, temporary cell state $\tilde{c_{i}}$ , and hidden layer state h_i. The input gate is used to control the information entering the neuron, the forgetting gate is used to select the information to be forgotten, and the output gate determines which information is to be transferred to the hidden state at the next moment. Each character vector passes through all gates to generate a hidden state vector. The specific calculation formula is as follows:

$f_{i} = σ (W_{x_{f}} x_{i} + W_{h_{f}} h_{i - 1} + W_{c_{f}} c_{i - 1} + b_{f})$ (2)

$i_{i} = σ (W_{x_{i}} x_{i} + W_{h_{i}} h_{i - 1} + W_{c_{i}} c_{i - 1} + b_{i})$ (3)

$\tilde{c_{i}} = tanh (W_{x_{c}} x_{i} + W_{h_{c}} h_{i - 1} + W_{c_{c}} c_{i - 1} + b_{c})$ (4)

$c_{i} = f_{i} c_{i - 1} + i_{i} \tilde{c_{i}}$ (5)

$o_{i} = σ (W_{x_{o}} x_{i} + W_{h_{o}} h_{i - 1} + W_{c_{o}} c_{i - 1} + b_{o})$ (6)

$h_{i} = o_{i} tanh (c_{i})$ (7)

Among them, W_{x
_f}, W_{h
_f}, W_{c
_f}, and b_f are the weight matrices to which extent the forgetting gate is forgotten; W_{x
_i}, W_{h
_i}, W_{c
_i}, and b_i are the weight matrices to which extent the memory gate is memorized; W_{x
_c}, W_{h
_c}, W_{c
_c}, b_c are the weight matrices of newly learned things in the memory gate; W_{x
_o}, W_{h
_o}, W_{c
_o}, b_o is the weight matrix of the output gate. Each character learns a forward hidden state vector $\vec{h_{i}}$ and a reverse hidden state vector $\overset{\leftarrow}{h_{i}}$ through neurons, which together form the hidden state vector bls_i (indicated by a blue circle in the figure). Calculated as follows:

${bls}_{i} = [\vec{h_{i}} \oplus \overset{\leftarrow}{h_{i}}]$ (8)

Finally, a new matrix is output, expressed as BLS = [bls₁, bls₂, . . . , bls_l] ∈ R^l×2g, g is the number of LSTM units, and this feature contains the Contextual information. Then through the Attention layer, the formula is calculated as follows:

$ρ = softmax (W^{T} \tanh (BLS))$ (9)

$Y = W^{bls} \tanh (ρ^{T} BLS)$ (10) where W ∈ R^2g is a random vector, W^T is its transpose; ρ is the attention probability value of BLS, and ρ^T is its transpose; W^bls ∈ R^d_out×2g is a linear mapping matrix, and d_out is the number of relation types; and Y is the feature vector containing the character information, with dimension d_out.

Fig. 2

Obtain features containing character information.

3.2 Part 2: Obtain features containing word information

The algorithm description diagram of this part is Algorithm 2. This part of the algorithm will be described in detail next.

Algorithm 2 Part 2: Obtain features containing word information

Input: S ={ v₁, v₂, . . . , v_l }, where $v_{i} = [v_{i}^{1}; v_{i}^{2}]$ , $v_{i}^{1} = [c_{i}^{1}; o_{1}^{j_{i}^{e 1}}; o_{1}^{j_{i}^{e 2}}]$ , $v_{i}^{2} = [c_{i}^{2}; o_{2}^{j_{i}^{e 1}}; o_{2}^{j_{i}^{e 2}}]$

Output: Word-level feature vector U

for i = 1 to k do

$w_{t} = f (H_{t}^{p} \cdot S_{i : j} + b_{t}^{p})$

W ={ w_t } , t ∈ { 1, . . . , k }

end for

X ={ x₁, x₂, . . . , x_l } ← W = { w₁, w₂, . . . , w_k }

${head}_{i} = Att ({XT}_{i}^{Q}, {XT}_{i}^{K}, {XT}_{i}^{V})$ , where

$Att (Q_{i}, K_{i}, V_{i}) = Softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{d}}) V_{i}$

$A = [Concat ({head}_{1}, {head}_{2}, \dots, {head}_{h})] T_{i}^{O}$

U = Maxpooling (A)

return U

3.2.1 Inputs

The input to this section is different from the input to the section for obtaining features that contain information about the characters of a sentence. Since different embedding models have different training details, using multiple trained embedding models to jointly represent the character vectors can enhance the meaning of the characters and the relationship between the characters. Considering the Chinese corpus, both Word2vec [35] and FastText are able to make better use of character-level information to generate embedding vectors. Among them, Word2vec takes into account the information within the local context window; FastText can also handle unregistered words, which is a common problem in Chinese corpus. Therefore, Word2vec and FastText are chosen as the vector representation models for Chinese corpus. The character vector $c_{i}^{1}$ trained by FastText and two position vectors $o_{1}^{j_{i}^{e 1}}, o_{1}^{j_{i}^{e 2}}$ form the vector representation of the i-th character $v_{i}^{1}$ ; the character vector $c_{i}^{2}$ trained by Word2vec and the position vectors $o_{2}^{j_{i}^{e 1}}, o_{2}^{j_{i}^{e 2}}$ form the vector representation $v_{i}^{2}$ of the i-th character; finally, the vector representation $v_{i} = [v_{i}^{1}; v_{i}^{2}] \in R^{2 n + 4 m}$ is formed by joining $v_{i}^{1}$ and $v_{i}^{2}$ as the vector representation of the i-th character, where n is the dimension of the character vector and m is the dimension of the position vector.

3.2.2 Multi-kernel convolution layer

Simply using feature vectors obtained from character embeddings as input cannot fully express the meaning of sentences. Therefore character embeddings are taken as input and a convolutional neural network with multiple different convolution kernel sizes is used to extract local features. It can be observed from Fig. 3 that the output of the convolutional neural network includes multiple character information, so the extracted local features can be regarded as a word vector. The input represented by multiple embedding models enhances the meaning of the characters and the relationship between the characters, and the word vector is obtained through this layer which in turn enhances the meaning of the words and the relationship between the words. The method is as follows: the input sequence is fed into a convolutional neural network with multiple convolutional kernels of different sizes, and the word vector w_t is obtained each time, with t denoting the tth convolutional kernel. The calculation formula is as follows:

Fig. 3

Convolutional neural network with convolutional kernel size 3.

$w_{t} = f (H_{t}^{p} \cdot S_{i : j} + b_{t}^{p})$ (11)

where 1 ≤ i ≤ l - (j - i + 1), S_i:j denotes the sequence of vectors from i to j in the sentence S, t denotes the tth convolutional kernel, l is a fixed sentence length, the kernel size is j-i+1, $H_{t}^{p}$ and $b_{t}^{p}$ are the parameter matrices and bias entries of the tth convolutional kernel, p denotes the padding value of the tth convolutional kernel, and the setting of padding value ensures that the output dimensions of the different convolutional kernels used are all l. f denotes the activation function. Figure 4 shows the input sequence fed into the convolutional neural network with different convolutional kernel sizes, and finally obtains the vector sequence W = { w₁, w₂, . . . , w_k } ∈ R^k×l (k denotes the total number of convolutions).

Fig. 4

Local features obtained by convolutional layers.

3.2.3 Multi-head self-attention layer and Maxpool layer

The vector sequence W = { w₁, w₂, . . . , w_k } ∈ R^k×l obtained from the convolutional layer is first converted into the vector sequence shape

X = { x₁, x₂, . . . , x_l } ∈ R^l×k required by the model, which is inputted into the network model consisting of the Multi-head self-attention layer and the Maxpool layer. The multi-head self-attention mechanism, which introduces multi-head computation and multiple representations, allows to focus on different representation subspaces of the inputs and is able to capture richer feature information. The problem of word polysemy is further mitigated on the basis of multiple input vector representations. The computational process of the Multi-head self-attention layer is shown in Fig. 5, firstly, the input sequence X is transformed into the query matrix Q_i, key matrix K_i and value matrix V_i of dimensions all R^l×(k/h) (i denotes the current parameter belongs to the head_i, h is the number of heads) at the Linear transformation layer by means of linear transformation matrices $T_{i}^{Q}, T_{i}^{K}, T_{i}^{V}$ ; then use Softmax function in the Scaled dot product layer to get the attention distribution; finally multiple attention distributions are spliced together, and the sentence vector space representation A ∈ R^l×d_out (d_out is the number of relational categories) with weights is obtained by linear transformation matrix $T_{i}^{o} \in R^{k \times d_{out}}$ . The computational formula is as follows:

${head}_{i} = Att ({XT}_{i}^{Q}, {XT}_{i}^{K}, {XT}_{i}^{V})$ (12)

$A = [a_{1}, a_{2}, \dots, a_{l}]$ (13) $= [Concat ({head}_{1}, {head}_{2}, \dots, {head}_{h})] T_{i}^{O}$ (14) where $Att (Q_{i}, K_{i}, V_{i}) = Softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{d}}) V_{i}$ , $\sqrt{d}$ is the square root of the dimension d of the key matrix K_i. Then the obtained sentence representation with weights A = [a₁, a₂, . . . , a_l] ∈ R^l×d_out is inputted to the Maxpool layer and the most salient features are selected to form a word feature vector U ∈ R^{d
_out}, which is used to represent the word information.

Fig. 5

Obtaining features containing information about sentence words.

3.3 Part 3: Feature fusion

The algorithm description diagram of this part is Algorithm 3. This part of the algorithm will be described in detail next.

Algorithm 3 Part 3: Feature fusion

Input: Character-level feature vector Y, Word-level feature vector U

Output: Fused feature vector O

O_gate = σ (W_gate [Y ; U] + b_gate)

O = O_gate ⊙ Y + (1 - O_gate) ⊙ U

return O

3.3.1 Feature fusion based on gating mechanisms

The feature vector Y ∈ R^{d
_out} used to represent character information and the feature vector U ∈ R^{d
_out} used to represent word information are fused using a gating mechanism, which can adaptively adjust the weights of the outputs of each model during the training process, so that the different types of models work together to compensate for the different deficiencies of each other, making the fused feature vectors more suitable for the task requirements, and thus improving the overall performance.

As shown in Fig. 6, the Sigmoid function(the symbol σ denotes) is used as the gating function, and the feature Y and feature U are spliced as inputs to obtain the gating tensor O_gate, which can be regarded as a learnable tensor consisting of the learning parameters W_gate ∈ R^2d_out, b_gate. Then the gating tensor is multiplied with the two features to finally obtain the fused feature vector O ∈ R^{d
_out} (⊙ denotes element-by-element multiplication). the formula is as follows:

$O_{gate} = σ (W_{gate} [Y; U] + b_{gate})$ (15)

$O = O_{gate} ⊙ Y + (1 - O_{gate}) ⊙ U$ (16) Finally, the fusion result O ∈ R^{d
_out} is input to the Softmax layer for relationship prediction.

Fig. 6

Feature fusion using gating mechanism.

3.4 Relationship prediction

Input the feature vector O obtained in Section 3.3 into the softmax classifier to calculate the probability of each category. For each sentence, the category with the largest probability value is the relationship prediction result. Calculated as follows: $p (y) = softmax (O)$ (17) $\hat{y} = \underset{y}{argmax} p (y)$ (18)

Among them, p (y) _i represents the probability that the relationship category is i, and $\hat{y}$ represents taking the category with the highest probability in the probability distribution as the index. We use the cross-entropy function to define the difference between the prediction results and the true label, and the calculation formula is as follows:

$J (θ) = - \frac{1}{C} \sum_{i = 1}^{C} p_{i} \log (y_{i}) + λ {∥ θ ∥}_{F}^{2}$ (19) where C is the number of sentences in the data set, p_i represents the one-hot vector of the true relationship of the current sentence, y_i is the probability of each category estimated using the softmax function, and λ is the L2 regularization hyperparameter, θ representing all parameters in the model. Reduce the gap between predicted and actual labels by minimizing the loss function.

4 Experiments

In this section, we evaluate the experimental model on the publicly available SanWen dataset and the Cultural-travel dataset produced by obtaining review data from China’s Ctrip travel website, and conduct a series of ablation experiments. This section consists of the following parts:

The data set, assessment indicators and experimental parameter settings are described.

Overall comparison of models and comparative experiments.

4.1 Description of the data set, assessment indicators and experimental parameter settings

4.1.1 Data sets

The datasets used in this experiment are the publicly available SanWen dataset and the Cultural-travel dataset produced by obtaining review data from the Ctrip travel website in China.

The SanWen dataset contains 837 Chinese documents, of which 17,227 sentences are used for training, 2,220 sentences for testing, and 1,793 sentences for validation. There are 9 kinds of relations. In addition, it includes an extra class for representing relations that do not belong to any of the 9 main relations.

The Cultural-travel dataset refers to the public sanwen standard data set [36] and follows the annotation standards consistent with it to ensure the comparability and interpretability of our annotated data sets. It comes from review data on China’s Ctrip travel website, part of which is obtained through web crawling, and part of which is provided by relevant cultural and tourism bureaus. After studying and organizing these comment data, the text was divided into multiple sentences and some special characters and stop words were removed. They were of little use for the relationship extraction task. Finally, 1822 pieces of data were screened out, and 5 relationships were summarized. Relationship descriptions and some examples are shown in Table 1. The number of each relationship is shown in Table 2, of which 1440 are used as training data. When annotating the filtered sentences, we conducted research on Chinese annotation tools and finally chose the Elf Annotation Assistant software. It is simple to operate and better supports Chinese relational annotation. By defining relationships, sentences can be quickly annotated. Finally exported in JSON format.

Table 1
Description of relationships

Relationship Description Example Entity 1 Entity 2

(Located) Describe the spatial relationship between entities. (The White Tower is located in a remote mountainous area in Nanchong.) (The White Tower) (Nanchong)

(Own) Indicate an ownership or control relationship between entities. (There are many parks worth visiting in Lizhuang Ancient Town.) (Lizhuang Ancient Town) (Park)

(Nearby) describe the spatial proximity relationship between entities. (Next to Tagong Temple is the beautiful pagoda.) (Tagong Temple) (Pagoda)

(Feature) Represent special attributes or characteristic relationships between entities. (Qingcheng Mountain is famous for Taoism.) (Qingcheng Mountain) (Taoism)

(History) Describes the temporal relationship between entities. (Li Bai, the immortal poet, traveled to Shu in 731 AD.) (Li Bai) (Shu)

Relationship	Description	Example	Entity 1	Entity 2
(Located)	Describe the spatial relationship between entities.	(The White Tower is located in a remote mountainous area in Nanchong.)	(The White Tower)	(Nanchong)
(Own)	Indicate an ownership or control relationship between entities.	(There are many parks worth visiting in Lizhuang Ancient Town.)	(Lizhuang Ancient Town)	(Park)
(Nearby)	describe the spatial proximity relationship between entities.	(Next to Tagong Temple is the beautiful pagoda.)	(Tagong Temple)	(Pagoda)
(Feature)	Represent special attributes or characteristic relationships between entities.	(Qingcheng Mountain is famous for Taoism.)	(Qingcheng Mountain)	(Taoism)
(History)	Describes the temporal relationship between entities.	(Li Bai, the immortal poet, traveled to Shu in 731 AD.)	(Li Bai)	(Shu)

Table 2

Number of relationships

(Number)	(Relationship)	(Quantity)
1	(Locate)	425
2	(Own)	397
3	(Near)	363
4	(Characteristic)	324
5	(History)	315

4.1.2 Assessment of indicators

Precision, Recall, and a reconciled mean F1 value based on Precision and Recall are common metrics used to evaluate model performance. The Precision, Recall and F1 values are calculated as follows:

$Precision = \frac{TP}{TP + FP}$ (20)

$Recall = \frac{TP}{TP + FN}$ (21)

$F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall} \times 100 %$ (22) where TP indicates that the prediction was positive and the sample was actually positive; FP indicates that the prediction was positive but the sample was actually negative; and FN indicates that the prediction was negative but the sample was actually positive. This experiment uses F1 values to evaluate the performance of each model.

4.1.3 Experimental parameters

The hyperparameter setting table for this experiment is shown in Table 3. These parameters were obtained from the evaluation results of the validation dataset, and the other parameters had little effect on the overall performance of the model, so they were set empirically.

Table 3
Hyperparameter settings

Hyperparameter Value Hyperparameter Value

Character embedding size 50 Input dropout 0.5

Position embedding size 5 LSTM dropout 0.5

Filter size 2,3,4,5 Out dropout 0.5

The number of convolutions 256 Learning rate 1.0

State size of BiLSTM 256 Decay rate 0.9

Batch size 10 Epoch 100

The number of heads of attention 3 Max_len 100

Hyperparameter	Value	Hyperparameter	Value
Character embedding size	50	Input dropout	0.5
Position embedding size	5	LSTM dropout	0.5
Filter size	2,3,4,5	Out dropout	0.5
The number of convolutions	256	Learning rate	1.0
State size of BiLSTM	256	Decay rate	0.9
Batch size	10	Epoch	100
The number of heads of attention	3	Max_len	100

the dimension of character embedding vector is 50, the dimension of position embedding vector is 5, the hidden dimension of BiLSTM layer is 256, the number of convolutions per filter is 256, the convolution kernel size is [2 –5], and the number of multi-head self-attention heads is 3. AdaDelta algorithm [37] is used as the optimisation algorithm for the network model with an initial learning rate of 1.0 and a decay rate of 0.9. The experiments use the dropout technique to mitigate the overfitting problem, with the input embedding, BiLSTM layer, and output layer set to 0.5 and the number of training rounds set to 100.

4.2 Overall comparison of models and comparative experiments

This part first compares the model with a series of existing common models to verify the effectiveness of the model; then compares the use of multiple vector embedding model to obtain character sequences as input vectors with the use of a single vector representation model to obtain character sequences as input vectors, to verify that the problem of word polysemy can be mitigated and the effect of relational extraction can be improved under the use of multiple vector embedding model; and then verifies the the usefulness of the multi-head self-attention mechanism for the relation extraction task; and finally, a comparison experiment to illustrate the importance of character information and word information as well as the use of a gating mechanism to more adequately express sentence information.

4.2.1 Overall comparison of models

The experiment compares the proposed modelling approach with the following baseline models on the SanWen dataset and the Cultural-travel dataset: Bi-LSTM [25] proposes a BiLSTM model for relation extraction; Attention Bi-LSTM [26] employs an attention mechanism after Bi-LSTM to extract relations; CNN [38] utilises a convolutional neural network for extracting word- and sentence-level features; Attention CNN [39] proposes an attention-based attention convolutional neural network that uses information such as word embeddings and positional embeddings for relation extraction; Att-Pooling-CNN [40] applies the attention mechanism to the input sequence as well as to the pooling layer for learning the attention of the parts of the input utterance to the two entities and to the target category, respectively; Att-BLSTM (Latent entity typing) [41] proposes an entity-aware attention mechanism with latent entity typing for a relation extraction task; and Att-BLSTM+C-Att-BLSTM [32] introduces the concept of entity meaning to provide more information and improve the accuracy of relation extraction. The experimental data are shown in Tables 4, 5 and Fig. 7. The results show that compared with currently commonly used and representative models, our proposed model performs best on both the SanWen dataset and the Cultural-travel dataset, illustrating that even without using word segmentation tools and without building a dictionary or external knowledge base, Our method is still effective and portable on the Chinese relation extraction problem. The F1 values obtained on the SanWen data set with a large amount of data and the Cultural-travel data set with a small amount of data are not much different, indicating that our model has certain robustness and generalization ability when processing data sets of different sizes.

Table 4
Results of different RE models on the SanWen dataset

Number Model Feature set Precision/% Recall/% F1/%

1 BiLSTM Character embedding + position

embedding,NER,WordNet 54.52 58.87 56.61

2 Attention BiLSTM Character embedding + position

embedding 54.23 60.29 57.10

3 CNN Character embedding + position

embedding,NER,WordNet 53.41 58.17 55.69

4 Attention CNN Character embedding + position

embedding 54.69 58.35 56.46

5 Att-Pooling-CNN Character embedding + position

embedding 54.46 59.13 56.70

6 Att-BLSTM(Latent

entity typing) Character embedding + position

embedding,Latent entity typing 55.31 60.55 57.81

7 Att-BLSTM

C-Att-BLSTM Character embedding + position

embedding, Entity sense 57.64 59.24 58.43

8 Ours Model Character embedding + position

embedding, Multiple embeddings 59.73 62.79 61.22

Number	Model	Feature set	Precision/%	Recall/%	F1/%
1	BiLSTM	Character embedding + position
		embedding,NER,WordNet	54.52	58.87	56.61
2	Attention BiLSTM	Character embedding + position
		embedding	54.23	60.29	57.10
3	CNN	Character embedding + position
		embedding,NER,WordNet	53.41	58.17	55.69
4	Attention CNN	Character embedding + position
		embedding	54.69	58.35	56.46
5	Att-Pooling-CNN	Character embedding + position
		embedding	54.46	59.13	56.70
6	Att-BLSTM(Latent
	entity typing)	Character embedding + position
		embedding,Latent entity typing	55.31	60.55	57.81
7	Att-BLSTM
	C-Att-BLSTM	Character embedding + position
		embedding, Entity sense	57.64	59.24	58.43
8	Ours Model	Character embedding + position
		embedding, Multiple embeddings	59.73	62.79	61.22

Table 5

Results of different RE models on the Cultural-travel dataset

Number	Model	Feature set	Precision/%	Recall/%	F1/%
1	BiLSTM	Character embedding + position
		embedding	54.29	57.50	55.85
2	Attention BiLSTM	Character embedding + position
		embedding	54.37	59.39	56.77
3	CNN	Character embedding + position
		embedding	52.96	57.42	55.10
4	Attention CNN	Character embedding + position
		embedding	53.82	58.41	56.02
5	Att-Pooling-CNN	Character embedding + position
		embedding	54.55	58.66	56.53
6	Att-BLSTM(Latent
	entity typing)	Character embedding + position
		embedding,Latent entity typing	55.43	58.90	57.11
7	Att-BLSTM
	C-Att-BLSTM	Character embedding + position
		embedding, Entity sense	57.17	58.81	57.98
8	Ours Model	Character embedding + position
		embedding, Multiple embeddings	59.04	61.53	60.26

Fig. 7

F1 values of different RE models on the SanWen and cultural-travel datasets.

4.2.2 Validating the role of multiple embedded representations

In Chinese corpus, both Word2vec and FastText can make better use of character level information to generate embedding vectors. Among them, FastText handles unregistered words [34]; Word2vec can also take into account the information within the local context window [35], which is a common problem in Chinese corpus. Therefore, Word2vec and FastText are chosen as the embedding representation models for the Chinese corpus. In order to verify the usefulness of multiple embedding representations, the vector representations generated by the embedding models Word2vec, FastText, and the vector representations jointly composed by the two are used as inputs to the proposed models, respectively. As can be seen from Table 6, the F1 values obtained using multiple embedding models to jointly represent the model inputs are higher than those obtained using a single embedding model to represent the inputs on both the SanWen dataset and the Cultural-travel dataset, which suggests that multiple embedding representations can enhance the meanings of the words and the relationships between the words, which alleviates the polysemy problem of words to a certain extent and improves the effect of relationship extraction.

Table 6
Validating the role of multiple embedded representations

Input SanWen(F1%) Cultural-travel(F1%)

Word2vec 59.74 59.46

FastText 60.76 60.04

Word2vec+FastText 61.22 60.26

Input	SanWen(F1%)	Cultural-travel(F1%)
Word2vec	59.74	59.46
FastText	60.76	60.04
Word2vec+FastText	61.22	60.26

4.2.3 Validating the role of multi-head self-attention

In order to verify the role of the multi-head attention mechanism, the models CNN-Maxpool and CNN-Attention-Maxpool are used instead of C-Mutihead_Att-Maxpool (which is part 2 of the overall framework for extracting word features), respectively. As can be seen from Table 7, the F1 value obtained by the word feature extraction model based on the multi-head attention mechanism on the SanWen dataset and the Cultural-travel dataset is higher than that of the model that does not use the attention mechanism or uses a single attention mechanism. It shows that the multi-head self-attention mechanism has a more positive effect on improving the relationship extraction effect. This is because the multi-head self-attention mechanism introduces multi-head calculations and multiple representations, which can capture more rich feature information and further solve the long dependency problem.

Table 7
Validating the role of multi-head self-attention

Model SanWen(F1%) Cultural-travel(F1%)

CNN-Maxpool 58.50 57.25

CNN-Attention-Maxpool 60.64 59.43

C-Mutihead_Att-Maxpool 61.22 60.26

Model	SanWen(F1%)	Cultural-travel(F1%)
CNN-Maxpool	58.50	57.25
CNN-Attention-Maxpool	60.64	59.43
C-Mutihead_Att-Maxpool	61.22	60.26

4.2.4 Validating the importance of character features and word features and the role of gating mechanisms

The model BLSTM-Att is part 1 of the overall framework diagram, used to extract character features; the model C-Mutihead_Att Maxpool is the part 2 part of the overall framework diagram, used to extract word features; Concatenation, Addition and Gating mechanism are three types of two features Combination method. As can be seen from Table 8, on the public data sets SanWen and Cultural-travel data sets, whether it is the Concatenation method or the Addition method, the F1 value obtained by combining two features is always higher than using a single feature, indicating that It is clear that both character information and word information are very important for expressing the meaning of a sentence; the F1 value obtained using the Concatenation method and the Addition method is not as good as the F1 value obtained using the gating mechanism for feature fusion, indicating that the use of the gating mechanism for feature combination can better express sentence information. This is because the gating mechanism can adaptively adjust the weight of each model output during the training process, allowing different types of models to work together, thereby improving overall performance.

Table 8
Validating the role of different feature information and gating mechanisms

Model Feature Fusion Method SanWen(F1%) Cultural-travel(F1%)

BLSTM-Att – 58.63 58.27

C-Mutihead_Att-Maxpool – 57.86 57.10

BLSTM-Att and C-Mutihead_Att-Maxpool Concatenation 60.52 59.52

Addition 60.12 59.43

Gating mechanism 61.22 60.26

5 Conclusion

In this paper, we propose a relationship extraction method based on multiple embedding representations and multi-head self-attention. The use of character embedding eliminates the need for a word-splitting tool and avoids the impact of word-splitting errors; the use of multiple trained embedding models to jointly represent character vectors enhances the meanings of characters and the relationships between characters, which are then augmented by a convolutional neural network with multiple convolutional kernel sizes to enhance the meanings of words and their relationships, without the need to use a dictionary or to introduce an external knowledge base to act as an extension of the word information, which avoids consuming too much manpower and time in mitigating the multiple meanings of words; further mitigating word polysemy by utilising the multiple representation capability of the multi-head self-attention mechanism; and using the gating mechanism to adaptively adjust the weights of the outputs of the individual models during the training process to adequately represent the sentence semantic information. The limitation of this paper is that the current dataset is relatively small in number and there is a certain sample imbalance between different categories, which may lead to a certain gap between accuracy and recall. In future work, we can explore from the perspective of method complexity, or we can build a more comprehensive data set to further explore the importance of features such as strokes and pronunciation in expressing Chinese semantic information, and seek to build new neural network models to further improve performance.

Footnotes

Acknowledgments

This work is supported by the National Key Research and Development Plan of China, Key Project of Cyberspace Security Governance (No. 2022YFB3103103).

References

, Pan

, Cambria

, et al. A survey on knowledge graphs: Representation, acquisition, and applications[J], IEEE Transactions on Neural Networks and Learning Systems 33(2) (2021), 494–514.

Chen

, Wang

, Yang

, Qing

, Huang

and Chen

, A multi-channel deep neural network for relation extraction, IEEE Access 8 (2020), 13195–13203.

Cheng

, Yao

, Xiang

, Zhang

, Tang

and Zhong

, Text sentiment orientation analysis based on multi-channel CNN and bidirectional GRU with attention mechanism, IEEE Access 8 (2020), 134964–134975. https://dx-doi-org.web.bisu.edu.cn/10.1109/ACCESS.2020.3005823

, Mou

, Li

, Chen

, Peng

, Jin

, Classifying relations via long short term memory networks along shortest dependency paths, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1785–1794, https://dx-doi-org.web.bisu.edu.cn/10.18653/v1/D15-1206

Kim

, Convolutional neural networks for sentence classification, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, 2014, pp. 1746–1751. https://dx-doi-org.web.bisu.edu.cn/10.3115/v1/D14-1181

Ruder

, Ghaffari

, Breslin

J.G.

, Character-level and multi-channel convolutional neural networks for large-scale authorship attribution, 2016, arXiv Preprint arXiv:1609.06686.

Vaswani

, Shazeer

, Parmar

, et al. Attention is all you need[J], Advances in neural information processing systems, 2017, 30.

Fundel

, Küffner

and Zimmer

, RelEx–Relationextraction using dependency parse trees[J], Bioinformatics 23(3) (2007), 365–371.

Nebhi

, A rule-based relation extraction system using DBpedia and syntactic parsing[C], Proceedings of the NLP-DBPEDIA-2013 Workshop co-located with the 12th International Semantic Web Conference (ISWC 2013), 2013.

10.

Kambhatla

, Combining lexical, syntactic, and semantic features with maximum entropy models for information extraction[C], Proceedings of the ACL interactive poster and demonstration sessions, 2004, 178–181.

11.

Che

W.X.

, Liu

, Li

, Automatic entity relation extraction[J], Journal of Chinese Information Processing, 2005.

12.

Huang

, Zhu

Q.M.

and Qian

L.H.

, Chinese entity relationship extraction based on feature combination[J], Microelectronics and Computers 2010(4), 198–200.

13.

Gan

X.L.

, Wan

C.X.

and Liu

D.X.

, Chinese entity relationship extraction based on syntactic and semantic features[J], Computer Research and Development 53(2) (2016), 284–302.

14.

Cristianini

, Shawe-Taylor

, An introduction to support vector machines and other kernel-based learning methods[M], Cambridge university press, 2000.

15.

Zelenko

, Aone

and Richardella

, Kernel methods for relation extraction[J], Journal of Machine Learning Research 3(Feb) (2003), 1083–1106.

16.

Culotta

, Sorensen

, Dependency tree kernels for relation extraction[C], Proceedings of the 42nd annual meeting of the association for computational linguistics (ACL-04), 2004, 423–429.

17.

Bunescu

, Mooney

, A shortest path dependency kernel for relation extraction[C], Proceedings of human language technology conference and conference on empirical methods in natural language processing, 2005, 724–731.

18.

, Li

, Wang

, Li

and Zhong

, A comprehensive exploration of semantic relation extraction via pre-trained CNNs, Knowl-Based Syst 194 (2020), 105488. https://doi.org/10.1016/j.knosys.2020.105488

19.

Zhang

, Wang

, Relation classification via recurrent neural network, 2015, arXiv preprint arXiv:1508.01006.

20.

Wei

, Xu

, Hu

, Entity relationship extraction based on bi-LSTM and attention mechanism, in: 2021 2nd International Conference on Artificial Intelligence and Information Systems, 2021, pp. 1–5. https://dx-doi-org.web.bisu.edu.cn/10.1145/3469213.3470701

21.

Liu

, Sun

, Chao

, Che

, Convolution neural network for relation extraction, in: International Conference on Advanced Data Mining and Applications, 2013, pp. 231–242. https://dx-doi-org.web.bisu.edu.cn/10.1007/978-3-642-53917-6-21

22.

Zeng

, Liu

, Lai

, Zhou

, Zhao

, Relation classification via convolutional deep neural network, in: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics, 2014, pp. 2335–2344.

23.

Nguyen

T.H.

, Grishman

, Relation extraction: Perspective from convolutional neural networks, in: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, 2015, pp. 39–48. https://dx-doi-org.web.bisu.edu.cn/10.3115/v1/W15-1506

24.

Zhang

, Wang

, Relation classification via recurrent neural network, 2015, arXiv preprint arXiv:1508.01006.

25.

Zhang

, Zheng

, Hu

, Yang

, Bidirectional long short-term memory networks for relation classification, in: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, 2015, pp. 73–78.

26.

Zhou

, Shi

, Tian

, Qi

, Li

, Hao

, Xu

, Attention-based bidirectional long shortterm memory networks for relation classification, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016, pp. 207–212. https://dx-doi-org.web.bisu.edu.cn/10.18653/v1/P162034

27.

Mintz

, Bills

, Snow

, et al. Distant supervision for relation extraction without labeled data[C], Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2009, 1003–1011.

28.

Zeng

, Liu

, Chen

, et al. Distant supervision for relation extraction via piecewise convolutional neural networks[C], Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, 1753–1762.

29.

Lin

, Shen

, Liu

, et al. Neural relation extraction with selective attention over instances[C], (Volume : Long Papers), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics 1 (2016), 2124–2133.

30.

, Ding

, Liu

, et al. Chinese relation extraction with multi-grained information and external linguistic knowledge[C], Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, 4377–4386.

31.

, Yuan

L.P.

and Zhong

, Chinese relation extraction using lattice GRU[C], 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) IEEE, 1 (2020), 1188–1192.

32.

Zhang

, Hao

, Tang

, et al. A multi-feature fusion model for Chinese relation extraction with entity sense[J], Knowledge-Based Systems 206 (2020), 106348.

33.

Kong

, Liu

, Wei

, et al. Chinese relation extraction using extend softword[J], IEEE Access 9 (2021), 110299–110308.

34.

Bojanowski

, Grave

, Joulin

, et al. Enriching word vectors with subword information[J], Transactions of the Association for Computational Linguistics 5 (2017), 135–146.

35.

Mikolov

, Sutskever

, Chen

, Corrado

G.S.

and Dean

, Distributed representations of words and phrases and their compositionality, in: Advances in Neural Information Processing Systems (2013), 3111–3119.

36.

, Wen

, Sun

, et al. A discourse-level named entity recognition and relation extraction dataset for chinese literature text. arXiv 2017[J], arXiv preprint arXiv:1711.07010.

37.

Zeiler

M.D.

, Adadelta: an adaptive learning rate method[J], arXiv preprint arXiv:1212.5701, 2012.

38.

Zeng

, Liu

, Lai

, Zhou

, Zhao

, Relation classification via convolutional deep neural network, 2014.

39.

Shen

, Huang

X.J.

, Attention-based convolutional neural network for semantic relation extraction[C], Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2016, 2526–2536.

40.

Wang

, Cao

, De Melo

, et al. Relation classification via multi-level attention cnns[C], Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016, 1298–1307.

41.

Lee

, Seo

and Choi

Y.S.

, Semantic relation classification via bidirectional lstm networks with entity-aware attention using latent entity typing[J], Symmetry 11(6) (2019), 785.

Model	Feature Fusion Method	SanWen(F1%)	Cultural-travel(F1%)
BLSTM-Att	–	58.63	58.27
C-Mutihead_Att-Maxpool	–	57.86	57.10
BLSTM-Att and C-Mutihead_Att-Maxpool	Concatenation	60.52	59.52
	Addition	60.12	59.43
	Gating mechanism	61.22	60.26

A relational extraction approach based on multiple embedding representations and multi-head self-attention

Abstract

Keywords

1 Introduction

2 Related work

3 Methods

3.1.1 Inputs

3.2.1 Inputs

3.2.2 Multi-kernel convolution layer

3.3.1 Feature fusion based on gating mechanisms

4.1 Description of the data set, assessment indicators and experimental parameter settings

4.1.1 Data sets

4.2.1 Overall comparison of models

Table 6 Validating the role of multiple embedded representations Input SanWen(F1%) Cultural-travel(F1%) Word2vec 59.74 59.46 FastText 60.76 60.04 Word2vec+FastText 61.22 60.26

Table 7 Validating the role of multi-head self-attention Model SanWen(F1%) Cultural-travel(F1%) CNN-Maxpool 58.50 57.25 CNN-Attention-Maxpool 60.64 59.43 C-Mutihead_Att-Maxpool 61.22 60.26

Footnotes

Acknowledgments

References

Table 6
Validating the role of multiple embedded representations

Input SanWen(F1%) Cultural-travel(F1%)

Word2vec 59.74 59.46

FastText 60.76 60.04

Word2vec+FastText 61.22 60.26

Table 7
Validating the role of multi-head self-attention

Model SanWen(F1%) Cultural-travel(F1%)

CNN-Maxpool 58.50 57.25

CNN-Attention-Maxpool 60.64 59.43

C-Mutihead_Att-Maxpool 61.22 60.26