Abstract
Named entity recognition is a fundamental task of natural language processing. The number of biomedical named entities is huge, the naming rules are not uniform, and the entity word formation is complex, which brings great difficulties to the biomedical named entity recognition. Traditional machine learning algorithms rely heavily on manual extraction of features. The quality of feature extraction directly affects the accuracy of entity recognition. In the biomedical domain, the cost of manually extracting features and annotating data sets is enormous. In recent years, deep learning methods that do not rely on artificial features have made great progress in many domains. This paper proposes a model based on Glove-BLSTM-CRF for biomedical named entity recognition. Firstly, the Glove model is used to train word vector with semantic features, and BLSTM is used to train word vector with character morphological features. The two are combined as the final representation of the word, then input into the BLSTM-CRF deep learning model to recognize the entity categories. The experimental results show that the model has achieved a better result in the JNLPBA 2004 biomedical named entity recognition task without relying on any artificial features and rules, and the F1 value reaches 75.62%.
Keywords
Introduction
Named entity recognition is the most basic task in natural language processing. In the domain of biomedicine, named entity recognition is the recognition of proteins, genes, DNA, RNA, cells and other entities in the literature. Protein-protein interaction extraction, disease-gene relationship extraction, etc. all rely on biomedical named entity recognition.
At present, there are four research methods for biomedical named entity recognition: dictionary-based methods, rule-based methods, statistical machine learning methods, and neural networks methods. The number of biomedical named entities has reached millions, and new entities are constantly appearing and naming rules are not uniform. Dictionary and training data cannot provide information sufficient, making the accuracy of dictionary-based and rule-based methods difficult to improve.
The statistical machine learning methods commonly used in biomedical named entity recognition include Hidden Markov Model (HMM), Maximum Entropy Model (ME), Support Vector Machine Model (SVM), and Conditional Random Field Model (CRF). Statistical machine learning methods rely heavily on artificial features design. Artificial features and domain knowledge can improve the performance of the model but also reduce the robustness and generalization ability of the whole model.
Compared with the statistical machine learning methods, the deep learning methods based on neural networks has the characteristics of more generalization and less dependence on artificial features. In recent years, deep learning has made great progress in many domains of natural language processing. Some researchers have input pre-trained word vector into the deep neural networks model to recognize biomedical entities and achieved good results. The quality of the word vector has a direct impact on the results of the model. The better word vector representation has always been a hot spot for researchers.
Related work
In terms of statistical machine learning, Settles [1] presented a framework for biomedical named entity recognition using Conditional Random Fields with a variety of traditional and novel features. Wang et al. [2] verified the Gimli method CRF-based for biomedical named entity recognition. GuoDong and Jian [3] combines rich domain knowledge and artificial features for biomedical named entity recognition. Liao and Wu [4] built a skip-chain CRF model for biomedical named entity recognition that can take account into biomedical information with long-distance dependence sufficiently.
In terms of deep neural networks, Yao et al. [5] first used neural networks to train word vector on unlabeled biological texts, and then built multi-layer neural networks to train model. Li et al. [6] built a bidirectional LSTM networks for biomedical named entity recognition. Li et al. [7] first used CNN to train the character-level features of the word, then combined with the word vector obtained from large-scale background corpus training, and input into the BLSTM-CRF deep neural networks and obtain the best results at that time.
Glove [8] (Global Vector for Word Representation) builds a word co-occurrence matrix based on the size of the word window based on the entire corpus, and generates a word vector space by algorithm dimensionality reduction. Pennington et al. [8] have proved that the word vector trained by Glove is better than the word vector trained by skip-gram and CBOW, and the larger the corpus, the better the performance of the word vector trained by Glove. Due to the particularity of the biomedical domain, the frequency of some biomedical entities is relatively low, and with the emergence of new entities, the word vector of Glove training does not represent some biomedical entities well. Therefore, this paper proposes to use bidirectional LSTM [9] to train the character-level vector of words and combine them with the Glove word vector to obtain a better word vector representation. We input each character vector of the word into the LSTM networks, taking the last hidden state as the vector representation of the word. Because of the superiority of the LSTM networks on the sequence data, it is assumed that it can well represent the feature information of a word on character structure, and then through the experiment to verify. Using the bidirectional LSTM to obtain the character structure information of the forward and backward, it is better to represent a word from the character structure. The combined word vector is input into the BLSTM networks to train, and the score of the word on each tag is obtained through a fully connected layer. Finally, the global optimal tag sequence is generated by the CRF model in units of sentences.
The Glove-character BLSTM-BLSTM-CRF model
Overall structure of model
Figure 1 illustrate the overall structure of Glove-character BLSTM-BLSTM-CRF model. First, all the non-repeating word vector in the corpus are trained through the Glove model, and the 100-dimensional character vector table is randomly initialized. The input text sentence query Glove word vector table to obtain the 300-dimensional Glove word vector, queries the character vector table and generates a 200-dimensional word vector with character feature information through the BLSTM networks. The two are combined into a 500-dimensional word vector. The combined word vector is input into the BLSTM networks to train, and the score of the word on each tag is obtained through a fully connected layer. Finally, the global optimal tag sequence is generated by the CRF model in units of sentences.
The Glove-character BLSTM-BLSTM-CRF model.
The Glove [8] (Global Vector for Word Representation) model uses an unsupervised algorithm to obtain a vector representation of a word. The vector representation of words mainly includes: global matrix factorization methods (such as latent semantic analysis (LSA) [10]) and local context window methods (such as the skip-gram model of Mikolov et al. [11]). The Glove model combines the advantages of both, counts the number of word co-occurrences based on the word window size, forms a global co-occurrence matrix based on the entire corpus, and trains the co-occurrence matrix of words with a specific weighted least squares model. The model generates a vector space with meaningful substructure. Pennington et al. [8] have proved that the word vector trained by Glove is better than the word vector trained by skip-gram and CBOW, and the larger the corpus, the better the performance of the word vector trained by Glove.
This paper uses a 300-dimensional word vector space based on 840B tokens training from Stanford University [12]. There are 2.2 million non-repeating words in the vector space. In this paper, there are 25103 non-repeating words in the data set. There are only 16361 words in the vector space, and the non-existing words are represented by 300-dimensional zero vector. We also use the data set of this paper to train word vector. The training result is that all 25103 words are represented by a 300-dimensional word vector. Through multiple experiments, the word vector trained by the data set of this paper is compared with the word vector of the Stanford University training input into BLSTM-CRF model, and the F1 value is about 1.5% lower. The word vectors trained through the dataset in this paper and the pre-trained word vectors from Stanford University were input to the BLSTM-CRF network for comparison. Through multiple experiments, the former is about 1.5% lower than the latter’s F1 value. It can be seen the superiority of Glove word vector trained by large corpus. If the word vector is trained based on a large amount of biomedical literature corpus, maybe we can achieve some performance improvement in the domain of biomedical NLP. Since the requirements for training large corpus conditions are high, only one idea is supposed here.
BLSTM networks
A LSTM networks cell.
Long Short-Term Memory (LSTM) networks [9] is a more advanced RNN variant that solves the gradient disappearance problem of variable-length RNNs. Compared to those RNNs that only have a single hidden state, the LSTM has more parameters that give better control over which memory are stored and which memory are discarded in a given time step. Figure 2 illustrates a single LSTM networks cell that is implemented as the following:
Bidirectional LSTM calculates two sets of different hidden layer representations using the forward sequence and the backward sequence, and the two are combined as the final hidden layer representation. Bidirectional LSTM can make better use of context information.
Figure 3 shows the flow of character BLSTM. Randomly initialize a character vector table containing all the non-repeating characters of the data set. For a input sentence sequence, query the character vector table for each word to obtain a two-dimensional tensor and fed to the BLSTM networks.
Concatenates last cell hidden states of forward LSTM and backward LSTM as a character-level vector of word. We assume that it can represent a word well from the character structure.
The word vector trained by Glove is concatenated with character-level vector trained by BLSTM networks to seek a better word vector representation. In fact, the 300-dimensional Glove word vector and the 200-dimensional character-level word vector trained by BLSTM are connected together to form a 500-dimensional final word vector representation.
The flow diagram of character BLSTM.
Biomedical named entity recognition can be regarded as a sequential sequence tagging problem. Linear-CRF can solve the sequential sequence tagging problem well (the structure is shown in Fig. 4). By introducing the state transition probability, the relationship between adjacent tags can be considered to obtain an optimally predictive tag sequence. By connecting the CRF layer behind the BLSTM networks, it is possible to obtain better accuracy of tag prediction by using not only past input features but also future input features. The conditional probability defined by a linear-CRF is:
where:
For the input sentence sequence
During the training we optimize the parameters by maximizing the log-likelihood function
A linear-CRF network.
Data
The data set selected in this paper is the JNLPBA 2004 shared task data set [13]. The training set is 2000 abstracts of biomedical literature, and the test set is 404 biomedical literature abstracts. The task recognizes five types of biomedical named entities: protein, DNA, RNA, cell-line, cell-type.
Experimental environment and training parameters
The models in this paper are all running in python 3.6 and tensorflow 1.14 environment.
In this paper, the JNLPBA 2004 shared task training set is shuffled and divided into a training set and a verification set by an 8:2 ratio. The model is selected by verifying the F1 value on the verification set, and the F1 value on the test set is used as the final criterion for the model. The data set annotation method uses the IOBES method.
The training parameters of the models in this paper are shown in Table 1.
Training parameters of the model
Training parameters of the model
This paper analyzes the impact of each model on the experimental results by splitting the model. Each experiment was tested 5 times and the centered result were taken. The test set results were evaluated using the official evaluation script of JNLPBA 2004 shared task. The experimental results are shown in Table 2. Table 2 can be used to visually see the impact of various parts of the model on the experimental result.
Comparison of model splitted of this paper
Comparison of model splitted of this paper
Table 3 shows the comparison the model of this paper with other advanced models in recent years on the JNLPBA 2004 data set.
Comparison of the model of this paper with other advanced models
In the field of artificial features, Settles [1] achieved a 69.5% F1 value on the JNLPBA 2004 data set using conditional random fields and rich feature sets. Wang et al. [2] verified the CRF-based Gimli method and achieved 72.23% F1 values on the JNLPBA 2004 data set. GuoDong et al. [3] achieved 72.55% F1 value on the JNLPBA 2004 data set using CRF combining rich domain knowledge and artificial features. Liao et al. [4] constructed a skip-chain CRF model, which can sufficiently consider biomedical information with long-distance dependence, and achieved an F1 value of 73.20% on the JNLPBA 2004 data set. Tang et al. [14] used CRF for biomedical entity recognition with adding different word vector features based on basic artificial features and achieved 71.39% F1 values on the JNLPBA 2004 data set. Chang et al. [15] included word embedding generated from an unlabeled corpus into the CRFs system for biomedical entity recognition. To further improved performance, a post-processing algorithm is employed after the named entity recognition task and achieved a 71.85% F1 value on the JNLPBA 2004 data set.
In terms of deep neural networks, Yao et al. [5] first used neural networks to generate word vector on unlabeled biomedical texts, and then built multi-layer neural networks, and obtained an F1 values of 71.01% on the JNLPBA 2004 data set. Li et al. [6] achieved a 72.76% F1 value on the JNLPBA 2004 data set using a bidirectional Long short-term memory network. Li et al. [7] first used CNN to train the character-level features of words, and then combined with the word vector obtained from large-scale background corpus training, and input into the BLSTM-CRF deep neural networks. They achieved a 74.4% F1 value on the JNLPBA 2004 data set that is the best result at the time.
This paper introduces the Glove-character BLSTM word vector representation, constructs the BLSTM-CRF deep learning model, and achieves a better result. The F1 value is 75.62%.
Due to the particularity of the biomedical domain, the frequency of some biomedical entities is relatively low, and with the emergence of new entities, the word vector of Glove training does not represent some biomedical entities well. Therefore, this paper proposes to use BLSTM to train the character-level vector of words and combine them with Glove word vector in order to obtain a better word vector representation. The experimental results show that the character-level word vector of BLSTM training can indeed make up for the deficiency of Glove word vector to some extent. In this paper, the Glove word vector is applied to the biomedical domain for the first time, and BLSTM is used to train word vector with character morphological features. The two are combined as the final vector representation of the word and fed to the BLSTM-CRF networks. We achieve a 75.62% F1 value on the JNLPBA 2004 data set.
Next, we hope to extract information for biologists from the massive biomedical literature on the basic of entity recognition, such as protein-protein interactions, gene-disease relationships and so on.
Footnotes
Acknowledgments
This work was supported by the Inner Mongolia Autonomous Region Mongolian language information special support project of China under Grant No. MW-2018-MGYWXXH-202, the Natural Science Foundation of Inner Mongolia of China under Grant No. 2020MS06012.
