Abstract
In the medical field, Named Entity Recognition (NER) plays a crucial role in the process of information extraction through electronic medical records and medical texts. To address the problems of long distance entity, entity confusion, and difficulty in boundary division in the Chinese electronic medical record NER task, we propose a Chinese electronic medical record NER method based on the multi-head attention mechanism and character-word fusion. This method uses a new character-word joint feature representation based on the pre-training model BERT and self-constructed domain dictionary, which can accurately divide the entity boundary and solve the impact of unregistered words. Subsequently, on the basis of the BiLSTM-CRF model, a multi-head attention mechanism is introduced to learn the dependency relationship between remote entities and entity information in different semantic spaces, which effectively improves the performance of the model. Experiments show that our models have better performance and achieves significant improvement compared to baselines. The specific performance is that the F1 value on the Chinese electronic medical record data set reaches 95.22%, which is 2.67%higher than the F1 value of the baseline model.
Keywords
Introduction
With the rise of machine learning, Natural Language Process (NLP) has emerged as a new area of research aimed at enabling machines to communicate efficiently with humans. Named entity recognition, as a fundamental task in natural language processing, is the basis for the completion of other downstream tasks (e.g. infor mation retrieval, information relationship extraction, etc.) [1]. As the product of the development of intelligent medicine, Electronic Medical records (EMR) can not only speed up the work of digital medical service but also promote the development of medicine to a certain extent. Electronic medical records are the major source for clinicians to understand patient information. Electronic medical records are the primary source of information for clinicians to understand patient care, but in most cases the information in these records is unstructured, making statistical analysis difficult to handle [2, 3]. Electronic medical records hold an enormous amount of medical research data, including patient symptoms, reasons for medication use, patient treatment outcomes [4], etc. Therefore, the entity recognition of electronic medical records to extract meaningful medical entity information can bring significant progress toward biomedical research.
In healthcare, NER plays a key role in extracting key information from medical data and then using them as input for subsequent downstream tasks [5, 6]. At present, there are mainly problems in medical named entity recognition, such as long medical entity nouns, difficult boundary division of medical entity nouns, confusing entity types, and entity nesting. In order to solve the problem of long entity recognition and boundary division, Tan CQ et al. [7] proposed a boundary-enhanced span classification neural network model. The model can easily detect entities in different subsequences by the span method and predict entity types in the corresponding position interval. In addition, they add a boundary detection task to predict which words as entity boundaries. This method has achieved excellent results in public datasets. For the problem that sequential constraints and single-input modeling cannot adequately learn global-scale information, Ying L et al. [8] proposed a hierarchical context representation enhanced model. The method uses sentence level representation and document level representation, in the hierarchical use of different methods to learn the contribution of each word, and finally fusion hierarchical context representation and hidden state information to identify entities. Given the problem of entity nesting and entity segmentation error, Yue Z et al. [9] proposed a Lattice-LSTM model for Chinese NER. The model encodes the input character sequences and all matching potential words, fully using word information to select the most relevant characters and words from the sentences to obtain better results. However, the Lattice-LSTM has a complex model architecture, which limits the application of the model in many real-time NER domains, especially in Chinese clinical electronic medical records NER.
Although the existing methods have achieved excellent results in NER, these methods are more common in English. Moreover, Chinese electronic medical records have the problems of special structure, confusing types and difficult to identify entity boundaries.The presence of these problems makes entity identification more difficult. Aiming at these problems, named entity recognition model of Chinese electronic medical records based on multi-head attention mechanism and character-word combination is proposed to solve the complex problem of named entity recognition of Chinese electronic medical records through a variety of neural networks cooperation. The major contributions of this paper are:
(1) A joint character-word vector approach is used to combine character-based and word-based information to learn the contextual semantics. We use the dictionary for character segmentation and word segmentation to reduce the impact of unknown words and enhance the model recognition performance.
(2) The model combines multi-head attention mechanisms. The multi-head attention mechanism can learn relevant information and capture global information in different subspaces, which makes up for the limitations of BiLSTM, solves the problems of long-distance entities and confusing entities in electronic medical records. The introduction of the multi-head attention mechanism further enhances the effectiveness of the model.
(3) Self-constructed medical domain dictionary further avoid entity recognition errors caused by segmentation boundary errors. The proposed model achie-ved better results on the electronic medical record dataset, where the F1 value can reach 95.22%, which is an improvement of 1%to 2.67%compared to other baseline models.
Related work
As the basis for the development of the biomedical field, the medical named entity recognition task has attracted great attention at home and abroad. To improve the accuracy of medical named entity recognition, many scholars have made significant efforts for this task and published many papers and academic works, which promoted the development of the field. How to improve the recognition performance of biomedical named entity recognition models is a hot topic of current research. Similar to the generic domain named entity recognition task, we can divide the approaches for medical named entity recognition research into three major categories: medical dictionary-based approaches, machine learning-based approaches, and deep learning-based approaches.
Medical dictionary-based approach
The medical dictionary-based approach mainly achi-eves medical entity tagging by string exact matching of medical entity dictionary. Medical practitioners and medical researchers usually produced medical dictionaries with strong specialization, such as the Unified Medical Language System (UMLS) [10], which contains many specialized entities in the medical field and is widely used for medical NER research. Lu et al. [11] achieved good recognition performance by using UMLS to match the NCBI disease dataset with dictionary. Peregrine et al. [12] expanded and improved the number of named entities in the dictionary better with spelling change recognition algorithm, keyword recognition algorithm. Demner-Fushman et al. [13] used a search engine approach to speed up the process of dictionary search and achieved good recognition efficiency on the corresponding medical dataset. The dictionary-based named entity recognition has been continuously improved, but it always has the defect of low recognition accuracy because of the limitations of the dictionary itself. Many researchers have focused their ideas for improvement on the massive expansion of the dictionary’s external resources and have improved the F1 value to some extent. However, as medical entities are an expanding process and external resources generate "noisy" matches, it is necessary to use new methods to replace the medical dictionary approach.
Machine learning-based approach
The machine learning-based approach uses medical entities (e.g., drugs, disease names, etc.) as a set of features and uses certain tagging strategies to apply machine learning algorithms to the entity tagging process, ensuring the correctness and completeness of the entities in the medical entity recognition process. Commonly used machine learning models in the medical field include SVM, CRF, etc. Jon et al. [14] used CRF model in the I2B2 assessment of drug recognition task F1 value reached 85.7%, which is much higher than the dictionary-based approach. Abbas et al. [15] used particle swarm optimization algorithm PSO and Bayesian decision as combination classifier, and got 88%F1 value in Biocreative chemical named entity, which is higher than the previous model. In addition, some researchers have made studies for specific medical NER problems, such as Jianbo et al. [16] for 400 discharge records from Peking Union Medical College Hospital, customizing four types of entities: clinical problems, procedures, examinations, and medications, used word and segmentation information as feature representation, and finally getting 93.51%of F1 values by SSVM. For clinical records of traditional Chinese medicine, Wang et al. [17] used the CRF method to identify symptom entities and achieved good recognition results. Machine learning algorithms have good performance improvement over dictionary-based methods, but the language characteristics will lead to model migration between different domains cause significant performance problems. Therefore, machine learning methods still require a sizeable amount of tagged corpus to train the model, which not only requires a lot of manual participation but also has a large room for improvement in recognition accuracy.
Deep learning-based approach
The deep learning-based medical named entity recognition method solves the research bottleneck of traditional machine learning methods in the medical field. Deep learning methods are often used in the medical field to extract information. Wu et al. [18] used a deep learning model DNN to identify clinical entities in electronic medical records and achieved an F1 value of 92.8%. Gridach et al. [19] combined CRF and BiLSTM to identify biological entities and achieved 89.5%F1 values on the BioCreative IIGM dataset. Lample et al. [20] used the word embedding model (word2vec) to characterize word features and the F1 value on the English data set reached 83.4%. For Chinese electronic medical records, Zhang et al. [21] coded the character input while fusing the word position information, and additionally introduced an attention mechanism and achieved better results. Xue et al. [22] introduced BERT, a language pre-training model published by Google, into joint learning, which improved the feature representation of the shared parameter layer. Dan Li et al. [23] designed a joint model of long and short-term memory networks and conditional random fields and introduced the BERT model, which combined with a part-head-aware approach to effectively enhance the extraction of named entities in the medical domain. Although deep learning has become the mainstream method for named entity recognition, deep learning methods also suffer from computational complexity, high data quality requirements, and poor interpretability. These problems have to be solved on the way to the landing application of deep learning methods.
Methodology
The overall structure of the model is shown in Fig 1. The model contains the Embedding layer, BiLSTM layer, Multi-head attention layer, and CRF layer. The Embedding layer mainly fuses character vectors and word vectors. First, we use BERT pretraining model to generate character vector representation for preprocessed EMR dataset. Then, the word vector representation is generated by self-constructing the domain dictionary using the forward maximum matching algorithm (FMM) to slice the input sequence. Finally, the character vector and word vector are fused into a joint character-word embedding vector. The BiLSTM layer is used for feature extraction, and a multi-head attention layer is introduced after the BiLSTM layer to obtain dependencies between long-distance entities and the semantic information of entities from different perspectives. In the end, the optimal annotation sequence is generated using CRF.

Chinese electronic medical record NER model (BWBAC).
In named entity recognition tasks, making low-dimensional dense vector transformations of text vocabulary may lead to better semantic representations, thus replacing the traditional One-Hot vector representation [24]. In the Chinese domain, the performance of word-based NER models heavily depends on the results of Chinese word separation, and character-based NER models also lead to a certain degree of information loss due to the separate splitting of words into characters. Therefore, the recognition model of character-word fusion is used for the complex diversity of Chinese electronic medical record research models.
Character embedding
The text character vector is first generated using the pre-training model BERT, which transforms each character in the input sequence into a vector, and then feature extraction is performed in the context to get rich semantic information. We use the Chinese pre-training model Chinese-BERT-wwm for character vector embedding, which uses whole-word mask during training to improve the accuracy of language learning. Whole-word mask for Chinese is to mask all the Chinese characters that make up the same word using a word separation tool, and learn the correct output through training. Because each token of the input BERT is a character, the embedding of the character vector is obtained after training.
The BERT model generates the model input vector by splicing three vectors, where token embedding is the character level vector input to BERT, segment embedding is used to represent the type vector relative to the sentence, and positional encoding is used to represent the word position information. We denote the transformer block as Trans (T), where T denotes the input vector. The specific operation is:
The word-based Chinese NER model relies heavily on the accuracy of Chinese word separation, and the direct use of lexical distributed representation to train the neural network model has the potential to make the overall model affected by noise. Inspired by the ideas of Lample [20], Shi Jia [25], and Ningjie Lu [26], the forward maximum matching (FMM) algorithm is used to slice and dice the input sequence. The reason for using the FMM algorithm is that the words segmented by the FMM algorithm are based on a self-constructed splitting dictionary rather than a splitting tool dictionary, thus reducing the effect of noise in delineating word boundaries in Chinese electronic medical records.
After the sequence of Chinese electronic medical records is sliced by the FMM algorithm, it will result the form of character-word fusion, in which the sliced words all exist in the custom dictionary, and the words that do not exist in the dictionary will be sliced into independent Chinese characters. Based on this, this paper constructs two lookup tables for finding Chinese character-level vector information and word-level vector information, respectively. Take the word vector finding process as an example, for a word table D of given size v, each row of D represents the word vector of each word, and the dimension of word w i in the word table is d, then D ∈ Rd×v, and the word vector e w i of w i can be obtained by looking up the table in table D. A vector u i of dimension v is used to represent the index vector of the vocabulary u i in the word list D. The part of u i that is 1 indicates the position of w i in D, and all other positions are 0.
Suppose the input sequence s = [w1, w2, . . . , w n ], which consists of n words in total, after obtaining the word vector representation of each word, the sequence can be represented by a matrix E made by stitching together n word vectors.
After obtaining the vector representation of words, for characters segmented by FMM, their character vector representation is got by the same table lookup. Thus the final complete vector containing word information is a sequence m w i mixed from the character vector and the word vector got by table lookup, corresponding to a two-dimensional matrix of dimension Rd×n. The generated word vector m w i is spliced with the character vector T w i generated by the BERT model to comprehensively represent the semantic feature information at different granularities. Regarding the hybrid vector fusion of word vectors generated by FMM and character vectors generated by BERT is shown in Fig 2.

Character-word fusion diagram.
Finally, for these two distributed vectors, the fusion is as in Eq. (4), which is considered to contain both character information and word information to both fuse the features and simplify the computational effort.
The BiLSTM layer performs feature extraction on the sequences using the output vectors from the character-word fusion. The model input vector is a character-word fusion vector, and in order to mitigate its overfitting, dropout is set before inputting it to the BiLSTM. For a joint embedding vector X
t
at any position t in the input sequence, the LSTM will combine X
t
and the state ht-1 at the previous moment to calculate the hidden state h
t
at the current moment. BiLSTM can effectively memorize the context information by setting two independent hidden layers, and finally calculate the forward representation
Although BiLSTM is able to capture long-distance dependent information, there is a possibility that valid information may disappear after long-distance transmission resulting in incomplete identification of entities. Attention mechanisms can train word-to-word similarity to capture long-distance dependent information more easily. We learn the contextual relationship between different characters through the attention mechanism. Different weights are assigned according to the importance of characters for learning the dependency between entities, which is helpful to solve the long-distance dependency problem caused by the excessive length of named entities in electronic medical records.
The output vector matrix h of the BiLSTM layer is transformed into three input matrices with all dimensions d k by three different mapping operations respectively. h denotes the current input. W q , W k , W v are three mutually independent parameter matrices. h are multiplied with the parameter matrices to finally get three vectors Qurey (Q) , Key (K) , Value (V). Then the formula for calculating attention can be expressed as:
First calculate the similarity between Q and K.After that, map the similarity result in the range of (0,1) by softmax function. where
As shown in Fig 3, the multi-head attention mechanism searches for the required information from the input in a parallel manner and learns different aspects of the information to solve the problem. Using only one attention can only find one obvious association for a token, while multi-head attention can find as many associations as possible. Each head splits the tensor output from the BiLSTM layer, and each head gets a set of Q, K, V for attention computation. However, each head gets only a partial representation of each token of the input. Q, Kand V will be independently mapped h times by different parameter matrices. Then correspondingly input to h parallel heads to execute the attention function operation. Finally, the computation results of h heads are combined and linearly mapped once more to obtain the final output AT = [at1, at2, . . . , at n ], and the operation process is shown as follows.

Multi-head attention mechanism.
In Eqs. (7)-(8),
Finally, the multi-head attention layer assigns weights to the feature vector h output from the BiLSTM layer. The common output feature vector of the BiLSTM layer and the multi-head attention layer is calculated as shown in Eq. (9).
The multi-head attention mechanism enables each attention to learn different characteristic parts of each token, thus equalizing the bias of a single attention and better distinguishing easily confused medical entities.
We decode the hidden state sequence C = [c1, c2, . . . , c n ] output from the attention layer to predict the final annotation sequence. If the input sentence X = [x1, x2, . . . , x n ] is given and the predicted sequence of labels is Y = [y1, y2, . . . , y n ], the score probability is calculated as follows.
Y X denotes the sequence X all possible labeled sequences. Finally, the globally optimal labeled sequence is predicted by the Viterbi algorithm.
Dataset
This paper uses the 2017 CCKS publicly available electronic medical record dataset, which has electronic medical record document categories including general items, medical history characteristics, treatment procedures, and discharges. The dataset has been deprived of private patient information, and the named entities involved include five major types of body parts, diseases and diagnoses, examinations and tests, signs and symptoms, and treatments. The experiment divides this dataset into three sets: training set, test set, and validation set, and uses the BIOES labeling strategy. The label is connected to the entity type with a horizontal bar "-", where B indicates the beginning part of the entity, such as "B-Disease and Diagnosis", I indicates the entity to go to the middle part of the content, such as "I-Disease and Diagnosis ", E indicates the end part of the entity, such as "E-Disease and Diagnosis", O indicates the non-entity part, such as "Child", and S indicates a single entity.
In acquiring character and word embeddings, we construct a dictionary D. Medical named entity recognition task differs from the general-purpose domain, and its word composition and language characteristics have strong professional characteristics. In order to match the language style of Chinese electronic medical records, it is desirable to extract keywords directly from medical texts when constructing D. The major sources for constructing dictionary D in this paper include the following directions.
(1) Entity names of different entity types crawled from the more well-known medical websites in the country.
(2) Baidu, Sogou thesaurus and some words of ICD-10.
The initial dictionary is composed of the above-mentioned resources after filtering together, and the initial dictionary is composed of dictionary D after de-duplication, de-noising, and identification by professional organizations. For the processed dictionaries, the entities are divided into five categories: body parts, examinations and tests, signs and symptoms, treatments, diseases and diagnoses, corresponding to the category dictionaries D b , D c , D s , D t , D d .
Example of dictionary entries
Example of dictionary entries
In this paper, we evaluate the model using Precision, Recall and F1 (F-Measure) values of the generic named entity recognition criteria. With TP denoting the number of correctly identified entities, FP denoting entities that exist but are incorrectly identified, and OP denoting entities that are not identified, the specific evaluation process is shown in Eqs. (12)-(14).
Regarding the parameter configuration of the neural network model as shown in Table 2, the character vector dimension is set to 200 and the word vector dimension is set to 200.
Neural network parameter settings
Batch size refers to the size of data read in during single network training, BERT max input length refers to the maximum length of BERT input, Char embedding size and Word embedding size refer to the dimensionality of character vector and word vector respectively, LSTM Hidden unit refers to the size of LSTM hidden layer The LSTM Hidden unit refers to the LSTM hidden layer size, Head num refers to the number of multi-head attention mechanism heads, Learning rate refers to the weight update step, Optimizer refers to the model optimizer, and Epoch specifies the number of model iterations.
During the experiment, it was found that the change of certain parameters had a more obvious effect on the model effectiveness. Therefore, the experiment firstly explores the effect of different parameters of the model settings, and secondly explores the research topic of this paper by comparing the experimental results of different models.
Parameter adjustment
Different datasets require different epochs for training, and a suitable epoch allows the loss to converge to a stable value. Choosing the best epoch allows the model to achieve the best results while reducing the training time. To explore the effect of the number of an epoch on the experiment, we give a plot of the proposed model evaluation metrics with an epoch in Fig 4. As can be seen from the figure, the proposed model undergoes fast convergence before 6 epochs, after which the model performance grows slowly with the increase in the epoch’s number, and after about 18 epochs, the model performance only makes a small float, and the final epoch number value is selected as 35 in order to reduce the meaningless computation time.

Model performance with epoch.
Increasing the number of hidden layers can improve the performance of the model, but it will complicate the network, causing longer training time and overfitting phenomenon. We reduce the error by choosing the appropriate number of hidden layer nodes. Too few hidden layer nodes will cause poor training results, and too many hidden layer nodes will result in overfitting. To improve the model performance, the number of hidden layer nodes and learning rate in the model parameters are trained as shown in Table 3 The experimental results obtained from multiple sets of experiments show that different numbers of hidden layer nodes and learning rates can have an impact on the experimental results. Looking at the data in the table, we can see that the model achieves the best recognition results when the number of hidden layer nodes is 200 and the learning rate is 0.003.
Model recognition results for different training parameters
Each layer of BERT captures different information, and BERT can gradually learn deeper semantics from lower to higher layers. The long-range dependence of BERT requires more layers of modeling, and the number of BERT layers affects the results and time of training. In this paper, the proposed model is tested for comparison by reducing the number of BERT model layers and getting the output of the model with different layers, as shown in Table 4. As can be seen from the table, as the number of layers of the BERT model increases, the time required for model training also keeps increasing. Experimenting on the unified dataset, when the number of layers of the BERT model increased to 8, the model F1 value also kept increasing to the maximum value of 91.11%, and as the number of layers kept increasing to 12, the model F1 value leveled off.
Experimental results of different layers of BERT model
We compare the recognition results between different models to verify the effectiveness of our proposed improvements to the models. The models for comparison are BWBAC, which introduces character-word fusion and multi-head attention mechanism, BWBC (BERT + Word + BiLSTM + CRF), which only introduces character-word fusion, BBAC (BERT + BiLSTM + Multi-head attention + CRF), which only adds multi-head attention mechanism, and BBC (BERT + BiLSTM + CRF), a baseline model for named entity recognition. In order to make the model achieve the optimal performance, the experimental parameters are selected from the optimal combination of parameters obtained from the above experiments. The identification results of the four models are shown in Table 5:
Comparison of model identification results
Comparison of model identification results
By comparing the experimental results, the BWBC model achieves an accuracy value of 94.09%, a recall rate of 93.95%, and an F1 value of 94.02%, which is an F1 value improvement of 1.47%over the BCC model. It is shown that text enhancement by fusing character-word vectors through a self-constructed dictionary can improve the recognition rate of entities. As well, the experiments also show that the use of dictionary in closed domains has a certain improvement in enhancing the learning of text effect. When trained using BBAC, the model achieves an accuracy rate of 93.82%, a recall rate of 94.74%, and an F1 value of 94.27%, which is an F1 value improvement of 1.72%over BBC, indicating that the introduction of the multi-head attention mechanism has an improvement on the overall performance of the model when trained using fixed parameters in a model consisting of BERT. Finally, comparing the four models, the BWBAC model achieves the best recognition rate, proving that introducing character-word fusion vectors and multi-head attention mechanism in the model can improve the overall performance of the model, and the self-constructed dictionary enhances the delineation of entity boundaries and strengthens lexical understanding. In the experiment, the F1 value of the proposed model in this paper is 0.952, and the highest accuracy value and recall rate are achieved, which verifies the theoretical correctness of this paper.
Finally, in order to further validate the improvement of the proposed model, we proposed model are compared with those of previous studies conducted on the CCKS2017 task2, and the methods and results of the studies on the CCKS2017 dataset are listed in Table 6. Hu et al. [30] proposed a hybrid system that combines four methods, Rule, CRF, RNN, and RNN with features, and adds a voting mechanism at the end. Wang et al. [31] fused dictionaries to deep neural networks to solve the problem that some rare entities cannot be recognized. Qiu et al. [32] proposed a residual convolutional neural network with conditional random fields. From the data in the table, our method is better than other methods, And the improved idea of the model proposed in this paper is correct for the medical text NER task.
Studies related to CCKS2017 electronic medical record dataset
The Chinese electronic medical record named entity recognition task is a key part of Chinese medical information extraction. In response to various difficulties arising in Chinese electronic medical record entity recognition, this paper proposes a deep learning model combining multi-head attention mechanism and character-word fusion. The model enhances text learning by pre-training the language model BERT and self-built dictionary to fuse character and word information, and introduces a multi-head attention mechanism to learn semantic information from different perspectives and obtain dependencies between long entities. Through experiments, it is proved that the model in this paper can accurately extract entities in complex electronic medical record texts. The improvements made by the model can solve the problems of long-distance entities, easily confused entities, and difficult to cut entity boundaries that arise from named entity recognition. Compared with all baseline models, our model achieves the best results in improving the recognition of named entities in Chinese electronic medical records.
While the model has achieved excellent results, it also has flaws. The generalization ability of the model has space for improvement. In addition, there is redundant information in character and word information fusion. In future work, we continue to investigate redundancy-reducing multi-granularity information models and few-shot learning to improve model performance, in addition to investigating model generalization capabilities and improving model interpretability on a variety of medical datasets.
Footnotes
Acknowledgments
This work was supported by the National Natural Science Foundation of China (62073123), the Open Project of the Key Laboratory of Food Information Processing and Control of Ministry of Education (KFJJ-2020-109), Key scientific research projects of higher education institutions in Henan Province (21A520008), and Henan Province Science and Technology Research Project (212102210058).
