Abstract
Currently, word segmentation errors and polysemy problems are common in the field of Chinese relationship extraction. Although character-based model input can avoid word segmentation errors, in order to obtain the word information of a sentence, it is often necessary to introduce a dictionary or an external knowledge base to expand the word information, which requires a lot of manpower and time. In response to the above existing problems, this article uses characters as input, uses multiple embedding models to jointly form a character vector sequence, and obtains features containing character information through BiLSTM and attention layers; considering that convolutional neural networks are good at extracting local features, obtain features containing word information through multi-kernel convolutional layers and multi-head self-attention layers, and finally use a gating mechanism to fuse the features. The model was tested on the public SanWen data set and our own cultural-travel data set, and obtained F1 values of 61.22% and 60.26% respectively. Experimental results show that our method can achieve better relationship extraction effects without using word segmentation tools and without building a dictionary or external knowledge base, and the effect is better than most commonly used models currently.
Keywords
Introduction
There exists a large amount of data and knowledge in the field of culture and tourism, in which the data have been widely used for user preference speculation, intelligent recommendation, etc., but due to the lack of attention to the knowledge as well as the low level of knowledge management, this knowledge has been largely neglected. As a descriptive framework of structured knowledge, knowledge graph has rich linguistic information and powerful reasoning ability, which can manage knowledge better [1]. Therefore, constructing the cultural and tourism knowledge graph helps to assess the tourism consumption trend, the current development of the industry and so on. The relational extraction task on the cultural-travel dataset is one of the key steps in constructing the cultural tourism knowledge graph. Chinese relational extraction methods can be classified into two types based on input granularity. One is based on character input, where sentences are represented by character embedding, with the disadvantage that word-level information cannot be fully utilised; the other is based on word input, and this method often suffers from the problem of word-splitting error, which makes the accuracy of the extraction task suffer [30]. For example, the sentence “
,
(I’ll be travelling home in the next couple of days, so I’ll try the authentic Chongqing market hotpot again before I leave)”. Here, the correct participle of “
(Chongqing market hotpot)” should be “
(Chongqing)/
(market)/
(hotpot)”, but the final participle may be “
(Chongqing Municipality)/
(well)/
(hotpot)”. Therefore, in order to avoid the error, you can choose the character-based input method. The problem of multiple meanings of a word is also one of the common problems in Chinese relational extraction tasks. For example: “
,
(I’m going to a nearby park today where there are lots of cuckoos/azaleas)”. The word “
” here can be a bird (cuckoo) or a flower (azalea). In order to alleviate the problem of multiple meanings at a time, most of the existing methods are to construct dictionaries or introduce external knowledge bases as extended word information, however, constructing dictionaries or introducing external knowledge bases requires a lot of manpower and time, and the efficiency is relatively low. How to adequately express sentence information is also a concern for the relational extraction task. In terms of feature combination, most of the current feature combination methods use direct splicing [2–4] or summing [5, 6], but those approaches does not work well to combine features that express different information.
To address the above situation, this paper proposes a relationship extraction method based on multiple embedding representations and multi-head self-attention. The main contributions of this paper are as follows: The use of character embedding eliminates the need for a word-splitting tool and avoids the impact of word-splitting errors. Using multiple embedding models to jointly represent character vectors can enhance the meaning of characters and the relationship between characters; considering the characteristics of convolutional neural networks that are good at extracting local features, the sequence of character vectors is inputted into a convolutional neural network with multiple convolutional kernels of different sizes to extract the local features of a sentence, which contains the information of multiple characters, and can be regarded as a word vector, thus enhancing the meaning of words and the relationship between words, alleviating the problem of multiple meanings of words without the need to use dictionaries or introduce external knowledge bases to act as extended word information, avoiding the need to consume too much manpower and time; The use of multi-head self-attention mechanism, which introduces multi-head computation and multiple representations, allows focusing on different representation subspaces of the inputs, can capture more rich feature information, improves robustness, solves the long dependency problem [7], and further mitigates the polysemy problem; The gating mechanism can adaptively adjust the weights of each model output during the training process, making the fused feature vector more suitable for the task requirements. The features containing sentence character information and word information are fused using the gating mechanism, which fully expresses the sentence semantic information; The model was experimented on the publicly available dataset SanWen and the Cultural-travel dataset, and the results showed that the model obtained higher F1 values than most of the commonly used models and was portable;
The rest of the paper is organised as follows. An overview of the current state of research on relational extraction is given in Section 2. A detailed description of the proposed research methodology is given in Section 3. In Section 4 a series of experiments on the proposed methodology are conducted and analysed. In Section 5 some conclusions are drawn as well as an outlook for future work.
Related work
The research process of relational extraction methods can be broadly summarised as pattern matching based methods, feature based and kernel based methods, and deep learning based methods.
Pattern matching based relationship extraction methods are all about extracting inter-entity relationships based on rules or dictionaries, which are manually constructed. For example, K. Fundel et al. [8] proposed a rule-based approach which first generates a dependency parse tree and then extracts the relationships using a knowledge base/rule base, the extracted relationships are then tested again for reasonableness, e.g., if a relationship is preceded by a negative word, that particular relationship will be discarded.An approach proposed by Kamel Nebhi [9] utilises a grammar parser to perform entity-relationship extraction by using the DBPedia dataset, rules were written to extract patterns using 188 training set articles, data sourced from Quero News, and French news were used for evaluation. However, this approach only supports specific domains, is poorly portable, and is difficult to write and has a long construction cycle. Of course, the relations extracted from the knowledge base rules have high accuracy.
Feature-based approaches require the selection of a large number of features such as lexical, syntactic, entity and syntactic features, and finally the combination of certain features and the selection of a classifier for the classification of relations. These features are performed manually and the combination of a limited number of feature items does not express the semantic relations well. Although kernel-based methods do not use feature vectors directly, they require manual design of the kernel function, and as the performance requirements increase, the kernel function becomes exceptionally complex, and is slow and poorly processed for large-scale training data, and the model performance decreases.Kambhatla [10] used a variety of text features to form vectors, and achieved good results on the maximum entropy model. Che Wanxiang et al. [11] used Winnow and Support Vector Machines for the task on ACE2004 dataset and obtained 73.08% and 73.27% F1 values respectively. Xin Huang et al. [12], on the other hand, combined lexical, syntactic, entity and syntactic basic features and achieved 72.77% F1 value on ACE2005 dataset, while Lixin Gan et al. [13] combined two major features: syntactic relations and dependencies to perform this task. All of the above methods achieve better results, but the selection of features is done manually.Cristianini et al. [14] proved that the combination of a finite number of feature terms does not express the semantic relations well, and that the performance of feature combinations is limited, so a kernel function based approach is considered. Its advantage is that it does not use feature vectors directly, the disadvantage is that the kernel function becomes complex as the performance requirements increase, and it is slow and poorly processed for large-scale data training. Zelenko et al. [15] combined 3 manually designed kernel functions, Support Vector Machines, and Voting Perceptron Learning Algorithms to extract the unstructured natural relations- -personal affiliation and organisational location relationships with good results. Culotta et al. [16] used an extended kernel machine to extract relationships in support vectors, and Bunescu et al. [17] proposed a new relationship extraction scheme based on the observation that the information required to assert a relationship between two named entities in the same sentence is usually be captured through the shortest path between two entities in the dependency graph. Experiments yielded that the shortest path dependency kernel outperforms the method based on dependency tree kernel.
Instead of manually pre-selecting features, deep learning based methods use vector combination to represent complex semantic information. The general process of relation extraction under deep learning can be summarised as follows: first the input sentence semantic information is represented using vectors, then the vectors are fed into a feature extraction model, and finally the relations are output using a classifier. Convolutional neural networks (CNN) [18], recurrent neural networks (RNN) [19], long short-term memory networks (LSTM) [20] and their extended optimisations have been widely used in this task with good results. Liu et al. [21] firstly applied convolutional neural networks (CNNs) to the relation extraction task. Instead of using word embedding as an input layer, the method encodes the input using a dictionary of synonyms and then uses a CNN structure without a pooling layer to automatically learn semantic features. On the ACE2005 dataset, the method outperformed the then state-of-the-art number kernel method by 9%. Zeng et al. [22] then added a maximum pooling layer to liu et al. [21] and added positional embeddings to represent the distance between words and entities, combining word embeddings and positional embeddings to represent the inputs, and this approach was widely used. Nguyen et al. [23] then built on Zeng et al. [22] ’s method by adding the number of convolutional layer filters, and experiments show that using multiple filters can improve cnn performance. Zhang et al. [24] used RNN instead of CNN as a relationship extraction model, and the results proved that using the RNN model has a good performance on the relationship extraction task. Zhang et al. [25] proposed a bi-directional long-short memory network for the relationship extraction task, and the experimental results show that the model can achieve good performance on the dataset SemEval -2010, the model can achieve state-of-the-art performance. Meanwhile, they used external lexical resources (e.g., WordNet) to get more features and achieved better results. In addition, Zhou et al. [26] used the BLSTM-ATT model for relation extraction, where the attention mechanism reduces the effect of noise and enhances the role of keywords. Mintz et al. [27] proposed the first remote supervision-based approach to train supervised models without manually labelling data, which is a major breakthrough in the field of relation extraction. However, it is clear that the method also has many problems at the beginning of its proposal. The biggest problem is that the labelled data may not necessarily contain the corresponding relations. To solve this problem, Zeng et al. [28] proposed the PCNN model, which can effectively solve the problems of mislabelling, error propagation and accumulation. Lin et al. [29] built sentence-level attention on multiple instances to learn features, assigning a high weight to sentence features that can express the relation, and a low weight to those that cannot express the relation. The method can use the dataset obtained by remote supervision to train the model without worrying about the annotation error problem caused by remote supervision, while the features of multiple sentences are learnt to avoid the feature omission problem.
Currently, participle error and word polysemy difficulties are common in Chinese relational extraction tasks, which are far less popular than English relational extraction tasks in terms of research heat and research progress. In order to avoid or mitigate these problems to a certain extent, more and more scholars choose to use character input to avoid participle error, and then introduce dictionaries or external knowledge bases as extended word information. Li et al. [30] proposed the MG-Lattice model in 2019, which combines character input with an external linguistic knowledge base for the relational extraction task; and Xu et al. [31] introduced a network model that merges word and character inputs and uses BiGRU instead of BiLSTM for the relational extraction task; Zhang et al. [32] proposed the concept of entity meaning to act as external linguistic knowledge to enhance sentence information; Kong et al. [33] are proposing an adaptive approach that includes word information at the embedding layer and uses a dictionary to match all words matched by each character merged into the character input based model. In order to mitigate the multiple meanings of a word, all the above models use dictionaries or external knowledge bases as extended word information, but constructing dictionaries as well as external knowledge bases requires a lot of time and manpower. In this paper, considering that different embedding models have different training details, we use multiple embedding models to jointly represent the character vectors, and then use the feature of convolutional neural network that is good at extracting local features to obtain local features as extended word information, so as to alleviate the problem of multiple meanings of a word while avoiding splitting words and without the need to construct a dictionary or an external knowledge base.
Methods
For a given Chinese sentence and two labeled entities, the task of relationship extraction is to extract the semantic relationship between the two entities. The overall framework of the method proposed in this article is shown in Fig. 1, which mainly consists of three parts: Part 1 is used to obtain features containing the character information of the sentence, Part 2 is used to obtain features containing the word information of the sentence, and Part 3 is the feature fusion part. In Part 1, character embedding is used to represent each character in the sentence as a vector to avoid word segmentation errors, and then the character vector sequence is fed into a network model composed of BiLSTM layer and attention layer to obtain character-level features; Part 2 uses multiple Embedding models to jointly form character vector sequences can enhance the meaning of characters and the relationship between characters; considering that convolutional neural networks are good at extracting local features, a convolutional neural network with multiple different convolution kernel sizes is used to obtain the sentence Local features, each local feature contains information of multiple characters, can be regarded as a word vector. Therefore, on the basis of using multiple embedding models to enhance the meaning of characters and the relationship between characters, a convolutional neural network with multiple different convolution kernel sizes is used to enhance the meaning of words and the relationship between words, easing the problem of polysemy. Then the word vector sequence is sent to a network model composed of a multi-head self-attention layer and a maximum pooling layer to obtain word-level features; Part 3 fuses character-level features and word-level features, and then Fully express the semantic information of the sentence; finally, the fused feature vector is passed to the softmax layer. The Softmax activation function will correspond to all relationship categories, generate a probability distribution for each category, and select the category with the highest probability as the relationship prediction result.

General framework.
The algorithm description diagram of this part is Algorithm 1. This part of the algorithm will be described in detail next.
BLS ={ bls i } , i ∈ { 1, . . . , l }
Finally,
ρ = softmax (W T tanh (BLS))
Y = W bls tanh (ρ T BLS)
Inputs
The input to this part of the network model is a sentence vector S = { v1, v2, . . . , v
l
} ∈ Rl×a (a = m + 2n) (l is the length of the sentence, m is the character vector dimension, and n is the position vector dimension), and v
i
is the vectorial representation of the i-th word consisting of the character vector c
i
and the two position vectors
The sentence vector S = { v1, v2, . . . , v
l
} ∈ Rl×a is inputted into the network model consisting of the BiLSTM layer and the attention layer, as shown in Fig. 2. The BiLSTM can ensure the global and complete nature of the extracted features, and the attention mechanism can give different weight to different words, so the combination of the two can enhance the semantic information, thus obtaining a higher quality feature vector Y ∈ d
out
(d
out
is equal to the number of relational categories) that contains the information of the sentence characters. First through the BiLSTM layer, BiLSTM consists of forward LSTM and reverse LSTM, which have the same structure. Each LSTM consists of a series of recursively connected neurons (indicated by yellow circles in the figure). Each neuron consists of input gate i
i
, forgetting gate f
i
, output gate o
i
, input x
i
, cell state c
i
, temporary cell state
Among them, W
x
f
, W
h
f
, W
c
f
, and b
f
are the weight matrices to which extent the forgetting gate is forgotten; W
x
i
, W
h
i
, W
c
i
, and b
i
are the weight matrices to which extent the memory gate is memorized; W
x
c
, W
h
c
, W
c
c
, b
c
are the weight matrices of newly learned things in the memory gate; W
x
o
, W
h
o
, W
c
o
, b
o
is the weight matrix of the output gate. Each character learns a forward hidden state vector
Finally, a new matrix is output, expressed as BLS = [bls1, bls2, . . . , bls l ] ∈ Rl×2g, g is the number of LSTM units, and this feature contains the Contextual information. Then through the Attention layer, the formula is calculated as follows:

Obtain features containing character information.
The algorithm description diagram of this part is Algorithm 2. This part of the algorithm will be described in detail next.
W ={ w t } , t ∈ { 1, . . . , k }
X ={ x1, x2, . . . , x l } ← W = { w1, w2, . . . , w k }
U = Maxpooling (A)
Inputs
The input to this section is different from the input to the section for obtaining features that contain information about the characters of a sentence. Since different embedding models have different training details, using multiple trained embedding models to jointly represent the character vectors can enhance the meaning of the characters and the relationship between the characters. Considering the Chinese corpus, both Word2vec [35] and FastText are able to make better use of character-level information to generate embedding vectors. Among them, Word2vec takes into account the information within the local context window; FastText can also handle unregistered words, which is a common problem in Chinese corpus. Therefore, Word2vec and FastText are chosen as the vector representation models for Chinese corpus. The character vector
Multi-kernel convolution layer
Simply using feature vectors obtained from character embeddings as input cannot fully express the meaning of sentences. Therefore character embeddings are taken as input and a convolutional neural network with multiple different convolution kernel sizes is used to extract local features. It can be observed from Fig. 3 that the output of the convolutional neural network includes multiple character information, so the extracted local features can be regarded as a word vector. The input represented by multiple embedding models enhances the meaning of the characters and the relationship between the characters, and the word vector is obtained through this layer which in turn enhances the meaning of the words and the relationship between the words. The method is as follows: the input sequence is fed into a convolutional neural network with multiple convolutional kernels of different sizes, and the word vector w t is obtained each time, with t denoting the tth convolutional kernel. The calculation formula is as follows:

Convolutional neural network with convolutional kernel size 3.
where 1 ≤ i ≤ l - (j - i + 1), Si:j denotes the sequence of vectors from i to j in the sentence S, t denotes the tth convolutional kernel, l is a fixed sentence length, the kernel size is j-i+1,

Local features obtained by convolutional layers.
The vector sequence W = { w1, w2, . . . , w k } ∈ Rk×l obtained from the convolutional layer is first converted into the vector sequence shape
X = { x1, x2, . . . , x
l
} ∈ Rl×k required by the model, which is inputted into the network model consisting of the Multi-head self-attention layer and the Maxpool layer. The multi-head self-attention mechanism, which introduces multi-head computation and multiple representations, allows to focus on different representation subspaces of the inputs and is able to capture richer feature information. The problem of word polysemy is further mitigated on the basis of multiple input vector representations. The computational process of the Multi-head self-attention layer is shown in Fig. 5, firstly, the input sequence X is transformed into the query matrix Q
i
, key matrix K
i
and value matrix V
i
of dimensions all Rl×(k/h) (i denotes the current parameter belongs to the head
i
, h is the number of heads) at the Linear transformation layer by means of linear transformation matrices

Obtaining features containing information about sentence words.
The algorithm description diagram of this part is Algorithm 3. This part of the algorithm will be described in detail next.
O gate = σ (W gate [Y ; U] + b gate )
O = O gate ⊙ Y + (1 - O gate ) ⊙ U
Feature fusion based on gating mechanisms
The feature vector Y ∈ R d out used to represent character information and the feature vector U ∈ R d out used to represent word information are fused using a gating mechanism, which can adaptively adjust the weights of the outputs of each model during the training process, so that the different types of models work together to compensate for the different deficiencies of each other, making the fused feature vectors more suitable for the task requirements, and thus improving the overall performance.
As shown in Fig. 6, the Sigmoid function(the symbol σ denotes) is used as the gating function, and the feature Y and feature U are spliced as inputs to obtain the gating tensor O gate , which can be regarded as a learnable tensor consisting of the learning parameters W gate ∈ R2d out , b gate . Then the gating tensor is multiplied with the two features to finally obtain the fused feature vector O ∈ R d out (⊙ denotes element-by-element multiplication). the formula is as follows:

Feature fusion using gating mechanism.
Input the feature vector O obtained in Section 3.3 into the softmax classifier to calculate the probability of each category. For each sentence, the category with the largest probability value is the relationship prediction result. Calculated as follows:
Among them, p (y)
i
represents the probability that the relationship category is i, and
In this section, we evaluate the experimental model on the publicly available SanWen dataset and the Cultural-travel dataset produced by obtaining review data from China’s Ctrip travel website, and conduct a series of ablation experiments. This section consists of the following parts: The data set, assessment indicators and experimental parameter settings are described. Overall comparison of models and comparative experiments.
Description of the data set, assessment indicators and experimental parameter settings
Data sets
The datasets used in this experiment are the publicly available SanWen dataset and the Cultural-travel dataset produced by obtaining review data from the Ctrip travel website in China.
The SanWen dataset contains 837 Chinese documents, of which 17,227 sentences are used for training, 2,220 sentences for testing, and 1,793 sentences for validation. There are 9 kinds of relations. In addition, it includes an extra class for representing relations that do not belong to any of the 9 main relations.
The Cultural-travel dataset refers to the public sanwen standard data set [36] and follows the annotation standards consistent with it to ensure the comparability and interpretability of our annotated data sets. It comes from review data on China’s Ctrip travel website, part of which is obtained through web crawling, and part of which is provided by relevant cultural and tourism bureaus. After studying and organizing these comment data, the text was divided into multiple sentences and some special characters and stop words were removed. They were of little use for the relationship extraction task. Finally, 1822 pieces of data were screened out, and 5 relationships were summarized. Relationship descriptions and some examples are shown in Table 1. The number of each relationship is shown in Table 2, of which 1440 are used as training data. When annotating the filtered sentences, we conducted research on Chinese annotation tools and finally chose the Elf Annotation Assistant software. It is simple to operate and better supports Chinese relational annotation. By defining relationships, sentences can be quickly annotated. Finally exported in JSON format.
Description of relationships
Description of relationships
Number of relationships
Precision, Recall, and a reconciled mean F1 value based on Precision and Recall are common metrics used to evaluate model performance. The Precision, Recall and F1 values are calculated as follows:
The hyperparameter setting table for this experiment is shown in Table 3. These parameters were obtained from the evaluation results of the validation dataset, and the other parameters had little effect on the overall performance of the model, so they were set empirically.
Hyperparameter settings
Hyperparameter settings
the dimension of character embedding vector is 50, the dimension of position embedding vector is 5, the hidden dimension of BiLSTM layer is 256, the number of convolutions per filter is 256, the convolution kernel size is [2–5], and the number of multi-head self-attention heads is 3. AdaDelta algorithm [37] is used as the optimisation algorithm for the network model with an initial learning rate of 1.0 and a decay rate of 0.9. The experiments use the dropout technique to mitigate the overfitting problem, with the input embedding, BiLSTM layer, and output layer set to 0.5 and the number of training rounds set to 100.
This part first compares the model with a series of existing common models to verify the effectiveness of the model; then compares the use of multiple vector embedding model to obtain character sequences as input vectors with the use of a single vector representation model to obtain character sequences as input vectors, to verify that the problem of word polysemy can be mitigated and the effect of relational extraction can be improved under the use of multiple vector embedding model; and then verifies the the usefulness of the multi-head self-attention mechanism for the relation extraction task; and finally, a comparison experiment to illustrate the importance of character information and word information as well as the use of a gating mechanism to more adequately express sentence information.
Overall comparison of models
The experiment compares the proposed modelling approach with the following baseline models on the SanWen dataset and the Cultural-travel dataset: Bi-LSTM [25] proposes a BiLSTM model for relation extraction; Attention Bi-LSTM [26] employs an attention mechanism after Bi-LSTM to extract relations; CNN [38] utilises a convolutional neural network for extracting word- and sentence-level features; Attention CNN [39] proposes an attention-based attention convolutional neural network that uses information such as word embeddings and positional embeddings for relation extraction; Att-Pooling-CNN [40] applies the attention mechanism to the input sequence as well as to the pooling layer for learning the attention of the parts of the input utterance to the two entities and to the target category, respectively; Att-BLSTM (Latent entity typing) [41] proposes an entity-aware attention mechanism with latent entity typing for a relation extraction task; and Att-BLSTM+C-Att-BLSTM [32] introduces the concept of entity meaning to provide more information and improve the accuracy of relation extraction. The experimental data are shown in Tables 4, 5 and Fig. 7. The results show that compared with currently commonly used and representative models, our proposed model performs best on both the SanWen dataset and the Cultural-travel dataset, illustrating that even without using word segmentation tools and without building a dictionary or external knowledge base, Our method is still effective and portable on the Chinese relation extraction problem. The F1 values obtained on the SanWen data set with a large amount of data and the Cultural-travel data set with a small amount of data are not much different, indicating that our model has certain robustness and generalization ability when processing data sets of different sizes.
Results of different RE models on the SanWen dataset
Results of different RE models on the SanWen dataset
Results of different RE models on the Cultural-travel dataset

F1 values of different RE models on the SanWen and cultural-travel datasets.
In Chinese corpus, both Word2vec and FastText can make better use of character level information to generate embedding vectors. Among them, FastText handles unregistered words [34]; Word2vec can also take into account the information within the local context window [35], which is a common problem in Chinese corpus. Therefore, Word2vec and FastText are chosen as the embedding representation models for the Chinese corpus. In order to verify the usefulness of multiple embedding representations, the vector representations generated by the embedding models Word2vec, FastText, and the vector representations jointly composed by the two are used as inputs to the proposed models, respectively. As can be seen from Table 6, the F1 values obtained using multiple embedding models to jointly represent the model inputs are higher than those obtained using a single embedding model to represent the inputs on both the SanWen dataset and the Cultural-travel dataset, which suggests that multiple embedding representations can enhance the meanings of the words and the relationships between the words, which alleviates the polysemy problem of words to a certain extent and improves the effect of relationship extraction.
Validating the role of multiple embedded representations
Validating the role of multiple embedded representations
In order to verify the role of the multi-head attention mechanism, the models CNN-Maxpool and CNN-Attention-Maxpool are used instead of C-Mutihead_Att-Maxpool (which is part 2 of the overall framework for extracting word features), respectively. As can be seen from Table 7, the F1 value obtained by the word feature extraction model based on the multi-head attention mechanism on the SanWen dataset and the Cultural-travel dataset is higher than that of the model that does not use the attention mechanism or uses a single attention mechanism. It shows that the multi-head self-attention mechanism has a more positive effect on improving the relationship extraction effect. This is because the multi-head self-attention mechanism introduces multi-head calculations and multiple representations, which can capture more rich feature information and further solve the long dependency problem.
Validating the role of multi-head self-attention
Validating the role of multi-head self-attention
The model BLSTM-Att is part 1 of the overall framework diagram, used to extract character features; the model C-Mutihead_Att Maxpool is the part 2 part of the overall framework diagram, used to extract word features; Concatenation, Addition and Gating mechanism are three types of two features Combination method. As can be seen from Table 8, on the public data sets SanWen and Cultural-travel data sets, whether it is the Concatenation method or the Addition method, the F1 value obtained by combining two features is always higher than using a single feature, indicating that It is clear that both character information and word information are very important for expressing the meaning of a sentence; the F1 value obtained using the Concatenation method and the Addition method is not as good as the F1 value obtained using the gating mechanism for feature fusion, indicating that the use of the gating mechanism for feature combination can better express sentence information. This is because the gating mechanism can adaptively adjust the weight of each model output during the training process, allowing different types of models to work together, thereby improving overall performance.
Validating the role of different feature information and gating mechanisms
Validating the role of different feature information and gating mechanisms
In this paper, we propose a relationship extraction method based on multiple embedding representations and multi-head self-attention. The use of character embedding eliminates the need for a word-splitting tool and avoids the impact of word-splitting errors; the use of multiple trained embedding models to jointly represent character vectors enhances the meanings of characters and the relationships between characters, which are then augmented by a convolutional neural network with multiple convolutional kernel sizes to enhance the meanings of words and their relationships, without the need to use a dictionary or to introduce an external knowledge base to act as an extension of the word information, which avoids consuming too much manpower and time in mitigating the multiple meanings of words; further mitigating word polysemy by utilising the multiple representation capability of the multi-head self-attention mechanism; and using the gating mechanism to adaptively adjust the weights of the outputs of the individual models during the training process to adequately represent the sentence semantic information. The limitation of this paper is that the current dataset is relatively small in number and there is a certain sample imbalance between different categories, which may lead to a certain gap between accuracy and recall. In future work, we can explore from the perspective of method complexity, or we can build a more comprehensive data set to further explore the importance of features such as strokes and pronunciation in expressing Chinese semantic information, and seek to build new neural network models to further improve performance.
Footnotes
Acknowledgments
This work is supported by the National Key Research and Development Plan of China, Key Project of Cyberspace Security Governance (No. 2022YFB3103103).

(Number)
(Relationship)
(Quantity)