Abstract
Named entity recognition (NER) as a crucial technology is widely used in many application scenarios, including information extraction, information retrieval, text summarization, and machine translation assisted in AI-based smart communication and networking systems. As people pay more and more attention to NER, it has gradually become an independent and important research field. Currently, most of the NER models need to manually adjust their hyper-parameters, which is not only time-consuming and laborious, but also easy to fall into a local optimal situation. To deal with such problem, this paper proposes a machine learning-guided model to achieve NER, where the hyper-parameters of model are automatically adjusted to improve the computational performance. Specifically, the proposed model is implemented by using bi-directional encoder representation from transformers (BERT) and conditional random field (CRF). Meanwhile, the collaborative computing paradigm is also fused in the model, while utilizing the particle swarm optimization (PSO) to automatically search for the best value of hyper-parameters in a collaborative way. The experimental results demonstrate the satisfactory performance of our proposed model.
Keywords
Introduction
With the wide application of various electronic texts on the Internet, massive and complicated information have brought severe challenges to people’s information acquisition. Therefore, people urgently need some automated tools to assist them in mining knowledge from such huge amounts of information. Technologies, such as information extraction, information retrieval, text summarization and machine translation, have formally emerged in this background. More importantly, named entity recognition (NER) is a significant issue among these technologies. The goal of NER is to identify entities with special meanings, such as date, location, person, organization, proper nouns, and some others, from the text, and add corresponding labeling information to those entities, to facilitate the follow-up work of information extraction [1]. Furthermore, as people pay more and more attention to NER, it has gradually become an independent and important research field. However, NER also has the following challenges.
Since named entities have the characteristics of diversity, complexity, and randomness, it is difficult for people to clearly define and classify their entity types. The length of the named entity is uncertain. For example, for entities such as institutions, their length varies greatly. The abbreviations of some entities only have two characters. However, the full names of some entities can reach dozens of characters. In real industrial applications, there is a lack of large-scale knowledge databases like Wikipedia. Therefore, it is very challenging to acquire a considerable annotation data.
In order to deal with the above problems, some supervised, semi-supervised and unsupervised machine learning algorithms have been proposed [2, 3, 4]. To further enhance the accuracy of NER tasks, we develop a novel model through the combination of popular deep learning algorithm and stochastic optimization technology on the basis of population cooperation. It is worth mentioning that in order to ensure that the learning model obtains a satisfactory fitting effect on a specific dataset, many important hyper-parameters in the model need to be adjusted during the realization of the NER task. Traditional NER models require a lot of time to manually adjust these hyper-parameters. Now, with our proposed model, smart strategies can be used to perform this operation.
Generally speaking, collaborative computing is a very efficient architecture. In this architecture, individuals are capable of working together to achieve the same goal in a collaborative manner. In other words, collaborative computing allows different individuals in a decentralized state collaborate and share information to fulfill an assignment together. On the one hand, the collaboration and communication between individuals in different fields are more effectively promoted, and on the other hand, the work quality and efficiency of the whole group are greatly improved. Under these circumstances, particle swarm optimization (PSO), as an evolutionary computing technology [5], is to solve stochastic optimization problems through information sharing and collaborative cooperation between different particles in a group. Therefore, by introducing collaborative computing into our model, the PSO algorithm is used to automatically find the optimal value of hyper-parameters in a collaborative manner.
Specifically, we present a novel model to accomplish NER by adopting an advanced scheme, which is made up of bi-directional encoder representation from transformers (BERT), conditional random field (CRF) and PSO. This model is called BERT-CRF-PSO. Hence, we can intuitively see that this model contains three sub-modules: BERT model, CRF layer and PSO algorithm. Where BERT model takes advantage of the encoder architecture in the transformer to obtain the semantic vector, and combines the two methods of mask language model (MLM) and next sentence prediction (NSP) to capture word-level and sentence-level representations respectively, to achieve true contextual prediction [6]. The CRF layer is able to effectively make use of contextual labels to implement the prediction for the current label, while attaining more precise prediction conditional probabilities of named entity [7]. Concurrently, we integrate the PSO algorithm into the BERT-CRF model to automatically fine-tune some important hyper-parameters. In short, BERT model is liable for pre-training the dataset, CRF layer is used to add appropriate constraints to the final predicted label, and PSO algorithm is responsible for fine-tuning the hyper-parameters. Although these three modules play different roles, they cooperate with each other to accomplish NER tasks together. With the help of our novel model, the performance of NER can be estimated to be improved accordingly.
The contributions of this paper can be summarized as the following two aspects.
Introducing collaborative computing strategy to NER, it is able to effectively extract named entities in the way of information sharing and mutual cooperation. Combining the PSO algorithm with the BERT-CRF model to automatically fine-tune some important hyper-parameters, so that it is convenient to search for the global optimal value of hyper-parameters. In this way, the performance of the entire model on NER is maximized.
The remainder of this paper is arranged as follows. In Section 2, some work in relation to NER is presented. In Section 3, we introduce the proposed model of BERT-CRF-PSO in detail. In Section 4, the comparative experiments of different models on different Chinese datasets are shown. Finally, our work is generalized in Section 5.
NER has always been a central issue in the field of NLP. The earliest method is based on rules and vocabularies, followed by traditional machine learning-based methods. In recent years, a particularly popular and widely used method is based on deep learning. In general, the development trend of NER is diagrammed in Fig. 1. There follows an analysis on some related algorithms or models about NER.
The research progress of NER.
The basic idea of rule and vocabulary-based methods is to select features, e.g., keywords, statistical information, punctuation, position words and direction words, to construct specific rules and vocabularies manually, and use ways, e.g., string matching and pattern, to realize NER.
Combining heuristic thinking and artificial rules, Xie et al. developed a model for extracting named entities from unstructured text [8]. However, the proposed method is not easy to extend to other entity types or datasets, resulting in inability to adapt to changes in data. Akhondi et al. constructed a system that could automatically identify chemicals in text data by using dictionaries and grammar [9]. Specifically, they introduced vocabulary resources that could provide chemical structure information, the LeadMine tool based on grammar recognition, and regular expressions into the task of NER for chemicals. In the end, the
In short, this method is very dependent on the rules manually made by linguistic experts, and each rule is given a certain weight. When a rule conflict is encountered, the rule with the highest weight is selected to determine the type of named entity. Thence, such method can achieve satisfactory results only when the established rules can accurately reflect the language characteristics of a certain field. In general, this type of method has the following shortcomings.
It relies too much on specific text styles, domains and languages, leading to the time-consuming process of formulating rules and it is difficult to cover all languages. In addition, it is particularly prone to errors. For different fields and systems, linguistic experts need to re-make rules, resulting in poor portability of the system. There are other issues, e.g., a long system construction period and too much cost.
It is for the above reasons that the traditional machine learning-based methods have been developed.
In the field of traditional machine learning, NER is usually regarded as a sequence tagging task. Sequence tagging tasks usually learn tagging models from a large amount of corpus, to tag each position of the sentence appropriately. For this task, some algorithms based on traditional machine learning have been adopted, and they mainly include support vector machine (SVM) [12], CRF [7], maximum entropy (ME) [13] and hidden Markov model (HMM) [14].
The advantages and disadvantages of the above four traditional machine learning algorithms are as follows.
SVM can address high-dimensional and nonlinear problems, but it is difficult to implement on large-scale training samples and difficult to achieve multi-classification problems. CRF can generate a globally and flexible optimal label model for NER. Unfortunately, its convergence speed is much slow, and the training time is expensive. ME has the merits of good versatility and compact structure. However, the ME needs to perform normalization calculations on the data, resulting in relatively large additional overhead. HMM combines with the viterbi algorithm to realize the recognition of named entities from the sequence [15]. So its training and recognition speeds are fast. However, because the HMM is memoryless, it can not make good use of contextual information, which results in a poor recognition effect.
Sobhana et al. used CRF to implement a NER system for geological text [16]. The system can use the contextual information of words to predict the characteristics of various named entities. The training set used by the researcher has a total of more than 200,000 words, covering 17 types of labels. Finally, the overall
Although the traditional machine learning-based methods employ manually labeled corpus for training, they do not demand a large amount of linguistic knowledge, and the requirements for linguistic experts are not so high. Hence, they can be completed in a short time compared with the rules and vocabulary-based methods. In addition, the system implemented by these methods has better portability. However, they have high requirements for feature selection. Table 1 shows some commonly used features based on traditional machine learning methods. Then, when we use those methods to accomplish NLP tasks, we need to choose a series of features that affect the target task from the original data, and add them to the feature vector. It is for the above reasons, many scholars have begun to explore NER methods on the basis of deep learning.
Commonly used features in traditional machine learning-based NER
Deep learning-based computing paradigms have enabled us to address massive amount of data efficiently [20, 21]. Then, the deep learning-based NER method has accordingly attracted more attention. Compared with the previous two methods, the learning model constructed by this type of method has the following benefits [22, 23, 24].
With the help of nonlinear activation functions, they can learn potential knowledge that more reflects the characteristics of the text from the original dataset. The typical feature of this type of learning model is end-to-end. This implies that they can address the problem of error propagation between modules and obtain satisfactory experimental results. A large number of experiments have proved that this type of learning models are very good at solving sequence tagging problems because they do not rely on feature engineering and domain knowledge.
Chiu et al. designed a novel neural network structure combining bi-directional long short-term memory (Bi-LSTM) and convolutional neural network (CNN) [25]. This structure is end-to-end and does not require too much feature engineering. Most importantly, it can automatically detect word-level and character-level features. For medical NER, Xu et al. designed a model using BiLSTM-CRF [26], which could learn the information features from a given dataset. Experiments on the NCBI disease corpus as one of the evaluation benchmark datasets showed that their method could achieve the
Different from the above methods, although the method proposed in this paper belongs to a deep learning method, we achieve automatic fine-tuning of the hyper-parameters of the model by introducing the PSO algorithm and the idea of collaborative computing. In this way, the disadvantage of manually tuning hyper-parameters, which is time-consuming and prone to falling into local optimum, is avoided, making it easier to use the deep learning model to extract results.
The overall framework of our model is demonstrated in Fig. 2, where “Text” refers to the input sentence of the model, [CLS] is a special symbol that is added at the beginning of each input sentence, i.e., the first token of each input sentence is [CLS],
The architecture of our proposed model.
From Fig. 2, we can intuitively notice that this model belongs to an end-to-end model (i.e., a deep learning model), which can directly obtain the target output results from the original input data. Therefore, it does not rely on the features mentioned in Table 1. In addition, as can be seen from Fig. 2, this model contains three sub-modules: BERT model, CRF layer and PSO algorithm. Among them, BERT model makes use of the encoder architecture in the transformer to acquire the semantic vector, and combines the two methods of MLM and NSP to capture word-level and sentence-level representations respectively, to achieve true contextual prediction. The CRF layer can learn constraints from the training data, and then add these constraints to the final predicted label to make sure that they are legal and correct. With the constraints learned by the CRF layer, the number of invalid prediction label sequences can be significantly reduced. Simultaneously, we integrate the PSO algorithm into the BERT-CRF model to automatically fine-tune some important hyper-parameters. Generally speaking, these three sub-modules cooperate with each other and share information to accomplish NER tasks together. In this section, we will present the principles of these three sub-modules in detail.
The goal of the BERT model is to take advantage of large-scale unlabeled corpus for training, so as to gain a representation of the text that contains rich semantic information. In order to achieve this goal, BERT adopts the encoder architecture in the transformer to extract features, and combines the self-attention mechanism to make each word possess global semantic information, to obtain the best contextual representation of each word. In addition, BERT uses two methods, including MLM and NSP, to capture word-level and sentence-level representations, respectively. Next, we introduce in detail the internal mechanism of BERT and other models it uses.
The structure of BERT
Transformer includes two parts: encoder and decoder. However, BERT only adopts the encoder of transformer. The model structure of BERT is deeper than that of transformer. Transformer’s encoder contains 6 encoder blocks, BERT-base contains 12 encoder blocks, and BERT-large contains 24 encoder blocks. The encoder structure of transformer is shown in Fig. 3 [29]. From Fig. 3, we can conclude that the encoder layer consists of 6 identical encoder modules. Each encoder module contains two sub-layers. The first is a multi-head attention mechanism, and the second is a feed-forward neural network. The two sub-layers are surrounded by a residual connection and a layer normalization, namely Add&Norm. It is worth mentioning that, the multi-head attention mechanism means that the model requires calculating multiple attentions. Each attention has its own role, so they focus on different information in the input, and finally all the attention information is spliced together. Moreover, the reason why BERT only uses the encoder of transformer is because BERT is actually a pre-training model. This suggests that it is different from some other concrete assignments of NLP, and it is implemented through a language model. At the same time, since BERT does not employ the transformer’s decoder, it has also decreased numerous extra operations in the attention function.
The encoder structure of transformer.
The input of BERT can contain a sentence pair (sentence A and sentence B), or it can be a single sentence. Taking a single sentence A “she likes traveling” as an example, the input embedding of BERT is shown in Fig. 4. In Fig. 4, token embedding refers to the word embedding obtained through training. Segment embedding is used to distinguish which sentence each word belongs to (for example,
Furthermore, BERT applies the method of WordPiece to construct a dictionary. WordPiece refers to divide each word into sub-words. If a word is not found in the dictionary, it will be split into sub-words one by one. For example, “traveling” in Fig. 4 has become “travel” and “#ing”. If the corresponding token is not found for a certain subword, the subword is directly marked as [unknown]. By using WordPiece to construct a dictionary, this can not only effectively obtain the root information of the word, but also reduce the dictionary capacity to a certain extent.
The input embedding of BERT.
Generally speaking, the conventional language model can only be trained from left to right or from right to left, but the BERT model is trained in multiple layers and bidirectionally. Therefore, BERT chooses the MLM model to solve the problem of only using one-way information. MLM is to randomly cover or replace any word in a sentence, and then let the model predict the covered or replaced part through the context. In this model, BERT randomly masks 15% of the tokens in the dataset during training, and then only predicts these masked tokens. In fact, MLM learns the relationship between words. Through this method, the BERT model can get more contextual information of words for prediction.
An obvious flaw of this approach is that there is no masked token in the fine-tuning stage, which leads to a mismatch between the pre-training and fine-tuning phases. Therefore, for 15% of the tokens to be masked, BERT adopts the following three methods to alleviate this problem:
In 80% of the cases, the word is replaced with the [MASK] mark. For example, I like apple In 10% of the cases, the word is replaced with a random word. For example, I like apple In the other 10% of the cases, the word is to remain unchanged. For example, I like apple
Traditional language models can not directly capture the information between sentences. In order to solve this problem, BERT introduces the NSP task. Its purpose is to obtain information between sentences, so that the model can better understand the relationship between sentences. The training corpus for this task can be generated by extracting sentence pairs (for example, sentence A and sentence B) from the dataset, where there is a 50% probability that B is the next sentence of A, and there is also a 50% probability that B is A random sentence in the dataset (i.e., B is not the next sentence of A).
Here are two examples:
Input
I often go to play [MASK] [SEP] (B)
Label
Input
I like to eat [MASK] [SEP] (B)
Label
Here, [CLS] and [SEP] are two special symbols, the former is used to classify output, and the latter is used to separate non-contiguous token sequences.
In short, through the above important modules or mechanisms, BERT can simultaneously use the context-related information of the current word for feature extraction and use the dynamic context-related information to adjust the word vector accordingly. In addition, if we make use of the pre-trained BERT model to achieve downstream tasks, we only need to load it as the word embedding layer of the current task, and then build other structures after the BERT model for our own tasks, without much modification or optimization of the code.
The design of CRF layer
NER is essentially a sequence tagging task. The meaning of sequence tagging task is to use appropriate tagging methods to tag words or characters with appropriate labels according to the context information of each word or character. Hence, for the purpose of making our classifier achieve better performance, when tagging the dataset, we can attempt to utilize labeling information of adjacent data. For traditional classifiers, it is very difficult to do this. Fortunately, CRF is very suitable for obtaining contextual information. Because a significant feature of CRF is using a logarithmic linear function to obtain joint probabilities of feature sequences. In this way, it is easy to effectively make use of the contextual label to predict the current label. More importantly, CRF is a representative sequence tagging algorithm. This means that for a given input sentence sequence
In a sentence, the label of the first word should be “O” or “B-”, not “I-”. The legal label pattern that CRF layer can learn should be “B- The tag sequence such as “O I-
Figure 5 shows the structure of CRF [7]. From this figure, we can see that there are two kinds of feature functions involved in the CRF layer, the state feature function as well as the transition feature function. Among them, the state feature function is used to calculate the state score, and the transition feature function is used to calculate the transition score. The former only focuses on which entity label the characters at the current position can be converted into, while the latter focuses on which entity lable combinations the characters at the current position and its adjacent positions can have.
The structure of CRF.
Suppose
where
Nowadays, CRF has been applied in some sequence tagging tasks, e.g., NER, part-of-speech tagging and word segmentation [30, 31, 32].
PSO algorithm is an evolutionary computing technology developed by Kennedy and Eberhart in 1995 [5]. It is derived from the simulation of a simplified social model. PSO is inspired by the observation of the activities of birds. Through the collaboration and information sharing between individuals, the population develops an evolutionary process from disorder to order in the solution space of the problem, and finally acquires a global optimal solution. Therefore, the core point of PSO is to simulate predation behavior of a flock of birds. Through mutual cooperation between individuals, each bird will evaluate whether the position it has found is the optimal solution and share the optimal solution to the entire flock of birds.
At the beginning, the PSO algorithm needs to initialize a random solution for each particle. It adopts the fitness value to measure the quality of solution and iteratively searches for the optimal solution, which is consistent with the idea of the simulated annealing (SA) algorithm [33]. More importantly, the PSO algorithm performs simply than the genetic algorithm (GA) [34]. It does not have the complicated “crossover” and “mutation” operations in the GA. Therefore, the superiority of PSO algorithm is that it is fast to converge, easy to carry out and does not require to regulate many parameters. Nowadays, the PSO algorithm has been widely used in fields such as image processing and data mining [35, 36].
The optimal value of the hyper-parameters of the model can assist in achieving better performance on specific dataset, which is actually an optimization problem. Coincidentally, PSO is specifically good at handling optimization problems. In PSO, the position of particles in the searching space refers to the solution space of the problem. Every particle has the three features of position, speed, and fitness. In this paper, position represents the value of the hyper-parameters of the BERT model, speed indicates the direction of hyper-parameter changes, optimization function denotes the BERT-CRF model, and fitness can be evaluated by
First, PSO uses random functions to initialize the particle swarm so that each particle has initial positions and speeds. Second, PSO makes use of the optimization function to initialize the fitness of the particle swarm. Then, the initial fitness value is taken as the optimal value of each particle, and the position corresponding to each fitness value is taken as the optimal position of each particle. At the same time, the best fitness value among the particles is regarded as the best global optimal value, and the position corresponding to each fitness value is regarded as the global optimal position of the particle. In the process of iteratively searching for the optimal solution, each particle updates its position and speed by tracking two “extremums”, that is, the historical optimal solution of the particle itself, expressed by
where
According to Eq. (2), we know that for the updating strategy of particle velocity, we can divide it into three parts.
The first one represents the impact of the particle’s current speed on the particle flight, and this part provides the particle’s flight dynamics in the searching space. The second one is individual cognitive behavior, which refers to the individual experience of particles. In simple words, this is a vertical search. A sample itself has many generations, and each generation has a hyper-parameter. These hyper-parameters must have an optimal value. Therefore, The third part is group cognition behavior, which represents the influence of group experience on the flight trajectory of particles, prompting particles to move toward the best position found by the group. Therefore,
Furthermore, the parameter optimization by introducing PSO is shown in Algorithm 3.3.
[h] : The process of parameter optimization is realized by introducing PSO.[1] Set the fitness:
The experiments are conducted with three different Chinese datasets, under the Python 3.7.10 computing environment deployed on a computer with the Ubuntu 19.10 operating system. Especially, in order to evaluate the performance of our proposed model, we compare it with some other traditional and popular models.
Datasets and data preprocessing
A total of three Chinese datasets are utilized in our experiment. The first is the Boson dataset,1
More specifically, the two datasets, Boson and People’s Daily, are baseline data commonly used in NER tasks. The inquiry data of the steel industry comes from the historical inquiry information of customers on the steel e-commerce platform, so there will be a lot of dirty data, such as noise and missing values. To ensure the reliability of the data, we use mathematical models and expert databases to clean these data. In this process, the mathematical model is used to extract useful and important attribute values from the original data, and the expert database realizes the verification of the data. Through the combination of mathematical models and expert databases, we can obtain a relatively reliable NER corpus related to the steel industry. Furthermore, we adopt the BIO annotation method mentioned in Section 3.2 to annotate the named entities of all datasets.
Evaluating whether a named entity is correctly identified mainly includes two aspects: whether the type of the entity is correctly labeled and whether the boundary of the entity is correctly recognized. The former is often called text, and the latter is often called type. It can be seen that there are three main types of errors: the entity type is correctly labeled (the main entity words and part-of-speech labels contained in the named entity are correctly tagged) but the entity boundary is recognized incorrectly, the entity boundary is recognized correctly but the entity type is labeled incorrectly, the entity type is labeled incorrectly and the entity boundary is recognized incorrectly. Therefore, the definition of whether a named entity is correctly recognized is not simple. Different systems and fields have different focuses, and their definitions may be different. No matter how the actual situation changes, according to the combination of the true class of sample and the predicted class of classifier, it is divided into four situations in Table 2 [37], in which the definitions are listed as follows.
TP (True Positive): entities correctly identified by the model in the test set. FP (False Positive): entities identified by the model while they are not existing in the test set. TN (True Negative): it is not an entity in the test set and the model also recognizes it as a non-entity. FN (False Negative): entities in the test set that are not correctly identified by the model.
Confusion matrix of classification results
The performance of NER for a specific system and field is generally evaluated by calculating the corresponding Precision, Recall, and
Precision ( Recall (
In addition, they can be expressed by the following formula.
We use eight existing models (i.e., HMM [14], CRF [7], CNN-CRF [38], BiRNN-CRF [39], BiLSTM [40], BiLSTM-CRF [26], BERT-CRF [41], BERT-BiLSTM-CRF [28]) and our BERT-CRF-PSO to implement experiments on three Chinese datasets.
The authors of the BERT model mentioned in their paper that “for fine-tuning, most model hyper-parameters are the same as in the phase of pre-training, in addition to
It can be seen from the above that the values of these three hyper-parameters will greatly affect the performance of the model. Therefore, we choose these three hyper-parameters as the optimization objects in our experiment.
In order to make our model have better generalization ability, we adopt some strategies to avoid overfitting. For example, we use the dropout method to randomly delete the neurons during the learning process, so that each time a different model learns the parameters. At the same time, we use the warmup strategy to optimize the learning rate, that is, select a smaller learning rate at the beginning of model training, and use the preset learning rate for training after a period of training. Additionally, in order to better evaluate the generalization ability of the model, we divide the dataset into training set, validation set and test set according to a certain ratio. Among them, the training set is used for parameter learning, the validation set is used for the performance evaluation of hyper-parameters, and the test set is used to evaluate the generalization ability. Therefore, the results shown in our experiments are all those on the test set.
Additionally, we conduct a comparative experiment of fine-tuning two hyper-parameters (i.e.,
The comparison results of the nine models we used on three different Chinese datasets are shown in Table 3. Neural network-based models are randomly initialized resulting in different experimental results from run to run, so all results are averaged over three runs. In addition, the last column of Table 3 (
Comparison results of different models on different Chinese datasets
(a) represents fine-tuning two hyper-parameters, (b) represents fine-tuning three hyper-parameters.
Compared with other existing models, our proposed model introduces the PSO algorithm, which enables the particle swarm to automatically fine-tune the hyper-parameters in a given search space (i.e., hyper-parameter range) to find the optimal value of the hyper-parameter. In this way, the performance of the model for a specific dataset can achieve the global optimum. It is worth noting that the hyper-parameters of the other models are manually set based on the results of multiple experiments and past experience. The experimental results in Table 3 imply that these models that require manual adjustment of hyper-parameters are not only time-consuming and laborious, but also easy to fall into a local optimal situation. The model we proposed can obtain the global optimal solution through the cooperation and information sharing between each particle. The results in Table 3 confirm the superiority of our model. Meanwhile, the last column of Table 3 shows that our proposed model has the smallest standard deviation for different datasets, which indicates that our proposed model has the best stability compared to other models. Moreover, if we increase the number of model hyper-parameters that need to be fine-tuned in Algorithm 3.3, the performance improvement effect of our model will be more apparent. The comparison results of BERT-CRF-PSO (a) and BERT-CRF-PSO (b) in Table 3 strongly confirm our statement. Therefore, we can conclude that the BERT-CRF-PSO model can assist us in automatically finding the global optimal hyper-parameters for a specific dataset, thereby significantly improving the performance of entity extraction, which is a very effective method.
By introducing the idea of collaborative computing, we present a novel model called BERT-CRF-PSO to complete the NER task. Then, three modules, including BERT model, CRF layer and PSO algorithm, are incorporated into our model. Here, the BERT model is used to pre-train the language model to acquire the corresponding word vector, the CRF layer learns constraints from the training data and adds these constraints to the final predicted label to make sure that they are legal and correct, and the PSO algorithm is in charge of fine-tuning the hyper-parameters in a given search space to automatically find the optimal value of the hyper-parameter in a cooperative way. Although these three modules play different roles, they share information and cooperate with each other to accomplish NER tasks together. Therefore, through our architecture, the experimental results for different datasets can reach the global optimum. The experimental results in Table 3 verify the effectiveness of our model and improve the
Footnotes
Acknowledgments
This work was supported in part by the Beijing Natural Science Foundation under Grants 19L2029 and M21032, in part by the National Natural Science Foundation of China under Grants 81961138010 and U1836106, in part by the Scientific and Technological Innovation Foundation of Foshan under Grants BK20BF010 and BK21BF001, and in part by the Fundamental Research Funds for the University of Science and Technology Beijing under Grant FRF-BD-19-012A.
Conflict of interest
The authors declare no conflict of interest.
