Abstract
Knowledge Graph has gradually become one of core drivers advancing the Internet and AI in recent years, while there is currently no normal knowledge graph in the field of agriculture. Named Entity Recognition (NER), one important step in constructing knowledge graphs, has become a hot topic in both academia and industry. With the help of the Bidirectional Long Short-Term Memory Network (Bi-LSTM) and Conditional Random Field (CRF) model, we introduce a method of ensemble learning, and implement a named entity recognition model ELER. Our model achieves good results for the CoNLL2003 data set, the accuracy and F1 value in the best experimental results are respectively improved by 1.37% and 0.7% when compared with the BiLSTM-CRF model. In addition, our model achieves an F1 score of 91% for the agricultural data set AgriNER2018, which proves the validity of ELER model for small agriculture sample data sets and lays a foundation for the construction of agricultural knowledge graphs.
Introduction
Knowledge acquisition [1] is a core step of knowledge graph construction which aims to extract useful information from massive text data. Named entity recognition [2] is a classical problem in knowledge acquisition, which is to identify information with special meaning or clear referent from text.
Named entities generally refer to entities that have a specific meaning or referentiality in the text, and usually include names of people, places, organizations, dates, times, and proper nouns. NER systems are to extract the above entities from the unstructured input text, and can identity more categories of entities according to demands, such as product names, models, prices and so on. The concept of entity is unlimited, entities involved in the agricultural dataset in this article are the descriptors of the “agricultural thesaurus” [3].
NER can be roughly divided into two methods: rule-based method and machine-learning method. In the early stage of named entity recognition research, rule-base method dominates [4]. However, most rule-based entity recognition methods are constructed manually by linguistic experts, but this manner has poor system portability and is too expensive.
Approaches of machine learning for NER research mainly include feature-based method and neural network-based method [5]. Currently, a variety of machine learning models have been used for named entity recognition, including language model [6], hidden Markov model [7, 8], maximum entropy model, error-driven learning method, decision tree method, DL-CoTrain and CoBoost, etc.
Although NER methods based on traditional characteristics have achieved high performance, such methods rely on artificial features and existing natural language processing tools, so that the scalability is poor, and a lot of effective features need to be designed or mined manually. In recent years, with the development of deep learning, neural network-based methods have been widely used and achieved good results [9, 10, 11, 12].
BiLSTM-CRF model is proposed by Huang et al. [13], in which they constructed a variety of neural network models to solve the sequence annotation problem in natural language processing. Since then, BiLSTM-CRF model has been widely used in field of named entity recognition. However, most of these models cannot capture the semantic information between sentences. On the basis of the work of Huang et al., we introduced the method of ensemble learning to solve this problem and improved the F1 value of BiLSTM-CRF model for data set CONLL2003. In addition, we used the agricultural data set AgriNER2018 constructed by our team to conduct experiments on ELER model and obtained excellent results, which verified the validity of the model for small sample data set for agriculture, and was of great significance to the construction of agricultural knowledge graphs.
The rest of this paper is structured as follows. In Overview, we introduced the research status of named entity recognition and the theoretical basis of the models involved in ELER. Subsequently, the structure and training details of ELER model are discussed in Proposed Method. Thereafter, we present the results of our experiments in Experiments. Finally, a summary is in Conclusion.
Overview
LSTM
Recurrent Neural Network (RNN) can improve the performance of neural network by understanding the sequence dependence, but due to its training algorithm, the model may have the problem of gradient explosion or gradient disappearance in the training process, Long Short-Term Memory (LSTM) [14] is one of the most popular and effective ways to reduce this problem. The LSTM changes the structure of the hidden units from “Sigmoid” or “tanh” to storage units, where their inputs and outputs are controlled by gates that control the flow of information to the hidden neurons and preserve features extracted from previous time steps. A typical LSTM cell consists of an input node, an input gate, a forgetting gate, and an output gate, as well as a cell activation component, as shown in the Fig. 1, where
A typical LSTM cell.
In 2003, Hammerton et al. [15] applied LSTM to named entity recognition, which was the first application of neural network in NER; In 2011, Collobert et al. [16] proposed a window-based deep neural network model (CNN-CRF), whose training effect exceeded all previous traditional algorithms. People who also use convolutional neural network for NER include Yao C et al. [17], who adopted the CNN model for named entity recognition to optimize online medical guidance, Chiu et al. [18], who proposed to combine BiLSTM with CNN, break the limitation of fixed window size by using BiLSTM model, and use CNN to obtain clear character-level features such as prefix and suffix, thus improving the effect of named entity recognition, and Zheng et al. [19], who proposed a hybrid neural network model to extract entities and their relations, this model combined a bidirectional encod-decoder LSTM module for entity extraction and a CNN module for relation classification for joint extraction without any manually constructed features.
Conditional Random Field (CRF) is a discriminant probability model, whose characteristic is that the output random variable is assumed to constitute Markov random field. CRF is defined as follows:
Assuming that
BiLSTM-CRF
BiLSTM-CRF model combines a bidirectional LSTM network with a CRF network. When a sentence is input into BiLSTM-CRF, each unit represents a vector composed of self-embedding or word embedding, and all embeds are tuned to the optimum during training. As shown in Fig. 2, firstly, the word embedding vectors are input into BiLSTM layer. One LSTM processes sequences are from start to finish in the forward time direction, and the other LSTM processes the sequence are from start to finish in the negative time direction. Compared with LSTM-CRF, BiLSTM-CRF model can not only effectively use input features before current time step t, but also learn future input features, which can enhance the accuracy of marking. The CRF model has a transition feature that takes into account the sequentially between output labels. BiLSTM layer connected after CRF as the output layer can effectively use past and future tags to predict current tags. So the problem that the tags is independent between each other when the common softmax layer is used as the output layer which does not consider the correlation among outputs, is solved.
BiLSTM-CRF model.
BiLSTM-CRF model has been widely used in field of named entity recognition since 2015, Lee et al. [20] combine BiLSTM-CRF and CNN to realize the named entity recognition task by building BiLSTM-CNNs-CRF model; Wu et al. [21] added the example-based migration algorithm, and transferred the knowledge from the source domain to the target domain through the method of weight generation and sample selection by using BiLSTM-CRF model, effectively solving the problem of insufficient learning ability of deep learning to a mall amount of data; Chen et al. [22] proposed a divide-and-conquer method to improve their ability to analyze emotions by classifying sentence types using BiLSTM-CRF and CNN; Luo et al. [23] introduced attention mechanism on the basis of BiLSTM-CRF to conduct document-level chemical NER, this model achieved better performance results with fewer feature engineering; Gridach [24] proposed a novel neural network structure based on BiLSTM and CRF, which can acquire knowledge from word- and character-level representations automatically, eliminating the need for most functional engineering for biomedical named entity recognition; Li et al. [25] with CNN to extract character-level representations, BiLSTM-CRF for biomedical entity recognition, and using BiLSTM-RNN for relation classification, etc. [26, 27].
Ensemble learning, also known as multi-classifier system, completes learning tasks by building and combining multiple learners. The system produces several individual learners, which are usually generated from training data by existing learning algorithms, and then are combined with certain strategies to obtain a strong learner. Previous works suggest that ensemble learning can be used to improve the performance of named entity recognition tools [28]. There are two main problems to be solved in ensemble learning: how to get several individual learners and how to choose combination strategies. In this paper, bagging [29] algorithm is adopted to generate individual learners. In this method there are few dependencies between individual learners, so a parallelization method can be generated. Then, the voting method is adopted as a combination strategy, that is, a voting classifier is obtained based on these individual learning machines, and the class with the most votes is taken as the final prediction category.
Proposed method
ELER
At present, common BiLSTM-CRF models construct the input of the model through sentence sampling, without considering the context information among sentences. In this paper, we propose an entity recognition model ELER based on ensemble learning, in addition to the sentence sampling method, we also adopt the truncation sequence sampling method, which means to intercept the sample training set data with a fixed length, so that the length of the intercepted data is the same without padding, so as to learn the context information between sentences. In ELER model, we use bagging algorithm to generate individual learners, and then use voting method as the combination strategy to integrate the sentence sampling method and truncated sequence sampling method.
The basic idea of ELER model is as follows:
Generate three input for training, the one through the sampling method with sentence as unit, and other two through truncated sampling method with different lengths; Obtains three different individual learners by using three inputs training BiLSTM-CRF model respectively; Using trained individual learners to predict results;
ELER model.
The process of ELER execution.
Votes on the predicted results through three individual learners, which are the predicted results of ELER model, and then calculate the accuracy rate, recall rate and F1. The structure of this model is shown in Fig. 3.
The basic idea of voting in ELER model is as follows:
Pairwise compare the prediction results of the three learners; Preserving the entities with the same offset and category in a common collection entity_all; Discard all entities that appear only once; Entities in the entity_all are unique.
The specific algorithm process is shown in Fig. 4.
According to the above voting strategy, ELER model can obtain the maximum number of correctly predicted entities while the minimum number of predicted entities can be obtained, which will help improve the accuracy of entity identification, recall rate and F1 value.
We preprocesse Numbers and upper and lower case letters. The specific strategies are as follows:
Change all uppercase letters into lowercase letters (in order to solve the problem of data sparseness); Represents three hundred dimensions of zero for all numbers; Directly represents the three-hundred-dimensional vector in the word embedding set when a word is embedded in the word embedding set; Represents three hundred dimensions of zero when a word is not embedded in the word set.
ELER model is obtained through ensemble learning of three BiLSTM-CRF models. Therefore, training of ELER model is mainly conducted on three BiLSTM-CRF models respectively. The training process of BiLSTM-CRF is as follows:
For each batch of training, the first runs forward propagation of BiLSTM-CRF model, including forward propagation of forward LSTM from beginning to end in forward time direction and forward propagation of backward LSTM from beginning to end in negative time direction, and output score matrix
Objective function
According to ELER model maximizes the logarithmic likelihood of sentence level. The main idea is to consider the probability of tag-to-tag conversion to explore the best road to optimize. First of all, a tag transformation matrix
Let
The logarithmic likelihood objective function and its gradient descent can be calculated bu dynamic programming method. For the given network output score
In order to evaluate the effect of entity recognition, the Precision, Recall Rate and F1, which are used in most existing recognition models are used in this paper. Among them, Precision and Recall Rates are defined below:
Among them,
Experimental data set
In our work, Glove [30], the public word embedding set trained by Stanford, is used to represent the input of English data set (download address:
The experiments of this paper are adopted to standard data set CoNLL2003 [31] and agricultural data set AgriNER2018. The first data set is extracted from news data of Reuters, in which the data of named entity identification consists of eight files covering German and English. The English data set of CoNLL2003 mainly marks four types of entities, including location, organization, person and miscellaneous.
The task of entity recognition is to give each word in a sentence an entity tag. A single entity may consist of multiple words. The IOB labeling method is used in this article, Where letter I represents inside, a word that is not the first word in the entity. Letter O is outside, which represents all word that are not contained in the entity set. Letter B is the beginning, which represents the first word in the entity. If a word is first word of an entity, it is labeled B-label. If a word is part of an entity but not the first word, it is indicated as I-label, and all other non-entities are labeled as O.
All data files in CoNLL2003 contain one word per line, with blank lines representing sentence boundaries. At the end of each line is a label indicating whether the current word is an entity. here is an example sentence from the CoNLL2003 dataset:
As you can see, each row of data contains four parts: word (the first column), POS (part of speech) tag (the second column), trunk tag (the third column), and NER tag (the fourth column). The named entities in the dataset are non-recursive and non-overlapping.
The numbers of sentences, fragments and entities in each file of CoNLL2003 and are shown in the Table 1.
Number of sentences, tokens, entities in CoNLL2003
Number of sentences, tokens, entities in CoNLL2003
Since there is no publicly available agricultural entity identification data set, we constructed the data set AgriNER2018 using agricultural thesaurus and supplemented it with web crawlers. In the experiments, we selected three entity categories of sediment, soil forming process and soil layer for labeling. In order to extend the positive sample of entity labeling data, we consider the subcategories of these three types of entities as the same category as them. Taking sediment as an example, in order to construct a sediment entity recognition data set, we crawl all the text information of the webpage of the sediment and all its sub-entities from Baidu Wikipedia, such as avalanches, morains, residues and so on. The crawled contents include abstracts and all text information under the directory title.
The specific steps to construct dataset AgriNER2018 from Baidu Wikipedia are as follows:
Sort out the URLs of web pages to be crawled. URLs of Baidu Wikipedia have a fixed format: “ Edit web crawler and run it. Data cleaning. Clean up the data crawled in the previous step, remove unwanted data, and save the data as a column. Annotation data. The annotation method used in AgriNER2018 data set is similar to that used in CoNLL2003. Among them, “SED” is used to represent deposits, “TOPSOIL” is for soil layer and “SOILFORMING” represent soil-forming process, etc.
AgriNER2018 is divided into training set and test set, the numbers of sentences, fragments and entities in each set are shown in the Table 2.
Number of sentences, tokens, entities in AgriNER2018
The network model of ELER uses BPTT algorithm to update the parameters, the optimization algorithm uses Adam, and the range of gradient clipping value clip is 5.0. The sampling length of two different stages is 20 and 30 respectively. The number of LSTM layers number_layers is 300, the input dimension of cells input_size is 300, the learning rate lr is 0.001, epoch is 100, batch_size is 128, and the word vector dimension is 300.
Results
The entity recognition results of all models by using the CoNLL2003 data set are shown in Tables 3 and 4, and the results for AgriNER2018 are show in Tables 5 and 6. Among them, BiLSTM-CRF-Sen model reproduced the model proposed by Huang [13], and the experimental results are as follows.
Prediction numbers of different models on CoNLL2003
Prediction numbers of different models on CoNLL2003
Performance comparison of different models on CoNLL2003
Prediction numbers of different models on AgriNER2018
Performance comparison of different models on AgriNER2018
In Tables 3 and 5, the second, third and fourth columns respectively represent the total number of correctly predicted entities, the total number of predicted entities, and the number of real entities in each named entity recognition experiment. As we can see, ELER obtained the maximum number of correctly predicted entities after voting under the condition that the total number of predicted entities was small. In Tables 4 and 6, the last row is the best results obtained by the ELER model in NER experiments. According to formula (2) and (3), Precision rate and Recall rate are a set of values with a balance between each other, this phenomenon can also be seen in this table. The experimental results in Tables 3 and 4 show that ELER model achieves better results than any individual model, in which the precision for CONLL2003 is increased by 1.37% and the F1 value is increased by 0.7%. The results in Tables 5 and 6 show that our model can be well applied to the entity recognition of agricultural small sample data sets.
In this paper, an entity recognition model ELER is designed based on BiLSTM-CRF model, which combines several weak learners into one strong learner by using the feature of ensemble learning. In the construction of ELER model, truncated sequence sampling method is combined with common sentence sampling method to avoid the defect that sentence sampling method will lose the context information among sentences. Compared with BILSTM-CRF-Sen, BILSTM-CRF-Batch30 and BILSTM-CRF-Batch20 models, experimental results show that ELER model has better prediction results by using standard data set CoNLL2003 and agricultural data set AgriNER2018, including higher Precision and F1 value, which indicates that ELER model can effectively improve the recognition effect of BiSLTM-CRF model and can be well applied to the data set of small agricultural samples, which is of great significance to the construction of agricultural knowledge graphs.
Footnotes
Acknowledgments
Our work has been fully supported by the National Key Research and Development Program (2017YFD0301506), the Natural Science Foundation of Hunan Province, China (Grant No. 2019JJ40133), National Natural Science Foundation of China (Grant No. 61972146), Double first-class Construction Project of Hunan Agricultural University, China (Grant No. SYL2019077), as well as the Key Research and Development Program of Hunan Province, China (Grant No. 2017NK2381).
