Collaborative optimization with PSO for named entity recognition-based applications

Abstract

Named entity recognition (NER) as a crucial technology is widely used in many application scenarios, including information extraction, information retrieval, text summarization, and machine translation assisted in AI-based smart communication and networking systems. As people pay more and more attention to NER, it has gradually become an independent and important research field. Currently, most of the NER models need to manually adjust their hyper-parameters, which is not only time-consuming and laborious, but also easy to fall into a local optimal situation. To deal with such problem, this paper proposes a machine learning-guided model to achieve NER, where the hyper-parameters of model are automatically adjusted to improve the computational performance. Specifically, the proposed model is implemented by using bi-directional encoder representation from transformers (BERT) and conditional random field (CRF). Meanwhile, the collaborative computing paradigm is also fused in the model, while utilizing the particle swarm optimization (PSO) to automatically search for the best value of hyper-parameters in a collaborative way. The experimental results demonstrate the satisfactory performance of our proposed model.

Keywords

Named entity recognition (NER)particle swarm optimization (PSO)bi-directional encoder representation from transformers (BERT)collaborative optimization

1. Introduction

With the wide application of various electronic texts on the Internet, massive and complicated information have brought severe challenges to people’s information acquisition. Therefore, people urgently need some automated tools to assist them in mining knowledge from such huge amounts of information. Technologies, such as information extraction, information retrieval, text summarization and machine translation, have formally emerged in this background. More importantly, named entity recognition (NER) is a significant issue among these technologies. The goal of NER is to identify entities with special meanings, such as date, location, person, organization, proper nouns, and some others, from the text, and add corresponding labeling information to those entities, to facilitate the follow-up work of information extraction [1]. Furthermore, as people pay more and more attention to NER, it has gradually become an independent and important research field. However, NER also has the following challenges.

•
Since named entities have the characteristics of diversity, complexity, and randomness, it is difficult for people to clearly define and classify their entity types.
•
The length of the named entity is uncertain. For example, for entities such as institutions, their length varies greatly. The abbreviations of some entities only have two characters. However, the full names of some entities can reach dozens of characters.
•
In real industrial applications, there is a lack of large-scale knowledge databases like Wikipedia. Therefore, it is very challenging to acquire a considerable annotation data.

In order to deal with the above problems, some supervised, semi-supervised and unsupervised machine learning algorithms have been proposed [2, 3, 4]. To further enhance the accuracy of NER tasks, we develop a novel model through the combination of popular deep learning algorithm and stochastic optimization technology on the basis of population cooperation. It is worth mentioning that in order to ensure that the learning model obtains a satisfactory fitting effect on a specific dataset, many important hyper-parameters in the model need to be adjusted during the realization of the NER task. Traditional NER models require a lot of time to manually adjust these hyper-parameters. Now, with our proposed model, smart strategies can be used to perform this operation.

Generally speaking, collaborative computing is a very efficient architecture. In this architecture, individuals are capable of working together to achieve the same goal in a collaborative manner. In other words, collaborative computing allows different individuals in a decentralized state collaborate and share information to fulfill an assignment together. On the one hand, the collaboration and communication between individuals in different fields are more effectively promoted, and on the other hand, the work quality and efficiency of the whole group are greatly improved. Under these circumstances, particle swarm optimization (PSO), as an evolutionary computing technology [5], is to solve stochastic optimization problems through information sharing and collaborative cooperation between different particles in a group. Therefore, by introducing collaborative computing into our model, the PSO algorithm is used to automatically find the optimal value of hyper-parameters in a collaborative manner.

Specifically, we present a novel model to accomplish NER by adopting an advanced scheme, which is made up of bi-directional encoder representation from transformers (BERT), conditional random field (CRF) and PSO. This model is called BERT-CRF-PSO. Hence, we can intuitively see that this model contains three sub-modules: BERT model, CRF layer and PSO algorithm. Where BERT model takes advantage of the encoder architecture in the transformer to obtain the semantic vector, and combines the two methods of mask language model (MLM) and next sentence prediction (NSP) to capture word-level and sentence-level representations respectively, to achieve true contextual prediction [6]. The CRF layer is able to effectively make use of contextual labels to implement the prediction for the current label, while attaining more precise prediction conditional probabilities of named entity [7]. Concurrently, we integrate the PSO algorithm into the BERT-CRF model to automatically fine-tune some important hyper-parameters. In short, BERT model is liable for pre-training the dataset, CRF layer is used to add appropriate constraints to the final predicted label, and PSO algorithm is responsible for fine-tuning the hyper-parameters. Although these three modules play different roles, they cooperate with each other to accomplish NER tasks together. With the help of our novel model, the performance of NER can be estimated to be improved accordingly.

The contributions of this paper can be summarized as the following two aspects.

•
Introducing collaborative computing strategy to NER, it is able to effectively extract named entities in the way of information sharing and mutual cooperation.
•
Combining the PSO algorithm with the BERT-CRF model to automatically fine-tune some important hyper-parameters, so that it is convenient to search for the global optimal value of hyper-parameters. In this way, the performance of the entire model on NER is maximized.

The remainder of this paper is arranged as follows. In Section 2, some work in relation to NER is presented. In Section 3, we introduce the proposed model of BERT-CRF-PSO in detail. In Section 4, the comparative experiments of different models on different Chinese datasets are shown. Finally, our work is generalized in Section 5.
2. Background

NER has always been a central issue in the field of NLP. The earliest method is based on rules and vocabularies, followed by traditional machine learning-based methods. In recent years, a particularly popular and widely used method is based on deep learning. In general, the development trend of NER is diagrammed in Fig. 1. There follows an analysis on some related algorithms or models about NER.

Figure 1.

The research progress of NER.

2.1 The rule and vocabulary-based NER

The basic idea of rule and vocabulary-based methods is to select features, e.g., keywords, statistical information, punctuation, position words and direction words, to construct specific rules and vocabularies manually, and use ways, e.g., string matching and pattern, to realize NER.

Combining heuristic thinking and artificial rules, Xie et al. developed a model for extracting named entities from unstructured text [8]. However, the proposed method is not easy to extend to other entity types or datasets, resulting in inability to adapt to changes in data. Akhondi et al. constructed a system that could automatically identify chemicals in text data by using dictionaries and grammar [9]. Specifically, they introduced vocabulary resources that could provide chemical structure information, the LeadMine tool based on grammar recognition, and regular expressions into the task of NER for chemicals. In the end, the $F_{1}$ -Score ( $F_{1}$ ) of the system reached 77.80%, and its performance was better than any single system considered by the author. Farmakiotou et al. implemented a NER system for Greek text by introducing manual vocabulary resources [10], and the test results on the Greek corpus of financial news were satisfactory. In detail, the system realized the automatic extraction of three types of named entities from the Greek corpus of financial news, namely organization, person, and location. The $F_{1}$ that the system could achieve for the above three types of named entities were 86.90%, 81.60% and 82.40% respectively. Collins et al. developed a method called DL-CoTrain [11]. There are three basic steps in this method, i.e., 1) artificially predefining a set of seed rules, 2) performing multiple unsupervised training on this rule set to obtain more rules, and 3) using the final rule set to extract named entities. In the end, the accuracy of this method for the three categories of organization, person, and location all exceeded 91%.

In short, this method is very dependent on the rules manually made by linguistic experts, and each rule is given a certain weight. When a rule conflict is encountered, the rule with the highest weight is selected to determine the type of named entity. Thence, such method can achieve satisfactory results only when the established rules can accurately reflect the language characteristics of a certain field. In general, this type of method has the following shortcomings.

•
It relies too much on specific text styles, domains and languages, leading to the time-consuming process of formulating rules and it is difficult to cover all languages. In addition, it is particularly prone to errors.
•
For different fields and systems, linguistic experts need to re-make rules, resulting in poor portability of the system.
•
There are other issues, e.g., a long system construction period and too much cost.

It is for the above reasons that the traditional machine learning-based methods have been developed.
2.2 The traditional machine learning-based NER

In the field of traditional machine learning, NER is usually regarded as a sequence tagging task. Sequence tagging tasks usually learn tagging models from a large amount of corpus, to tag each position of the sentence appropriately. For this task, some algorithms based on traditional machine learning have been adopted, and they mainly include support vector machine (SVM) [12], CRF [7], maximum entropy (ME) [13] and hidden Markov model (HMM) [14].

The advantages and disadvantages of the above four traditional machine learning algorithms are as follows.

•
SVM can address high-dimensional and nonlinear problems, but it is difficult to implement on large-scale training samples and difficult to achieve multi-classification problems.
•
CRF can generate a globally and flexible optimal label model for NER. Unfortunately, its convergence speed is much slow, and the training time is expensive.
•
ME has the merits of good versatility and compact structure. However, the ME needs to perform normalization calculations on the data, resulting in relatively large additional overhead.
•
HMM combines with the viterbi algorithm to realize the recognition of named entities from the sequence [15]. So its training and recognition speeds are fast. However, because the HMM is memoryless, it can not make good use of contextual information, which results in a poor recognition effect.

Sobhana et al. used CRF to implement a NER system for geological text [16]. The system can use the contextual information of words to predict the characteristics of various named entities. The training set used by the researcher has a total of more than 200,000 words, covering 17 types of labels. Finally, the overall $F_{1}$ of this system for these 17 types of labels reached 75.80%. Chieu and Ng developed an English named entity recognizer using ME [17]. They used ME to acquire the global information of the document, and the best results obtained on MUC-6 and MUC-7 were 93.27% and 87.24%, respectively. Ekbal and Bandyopadhyay utilized SVM to build a NER system for Hindi and Bengali [18]. Their experimental results had demonstrated the precision, recall and $F_{1}$ of 74.34%, 80.23% and 77.17%, for Hindi and 80.12%, 88.61% and 84.15%, for Bengali. Zhou and Su proposed a chunk tagger model with HMM to implement NER tasks [19]. The recognition types of this model include names, times, and numerical quantities. They improved the traditional formula of HMM to integrate more information. Finally, in the English NER tasks of MUC-6 and MUC-7, the $F_{1}$ reached 96.60% and 94.10%, respectively.

Although the traditional machine learning-based methods employ manually labeled corpus for training, they do not demand a large amount of linguistic knowledge, and the requirements for linguistic experts are not so high. Hence, they can be completed in a short time compared with the rules and vocabulary-based methods. In addition, the system implemented by these methods has better portability. However, they have high requirements for feature selection. Table 1 shows some commonly used features based on traditional machine learning methods. Then, when we use those methods to accomplish NLP tasks, we need to choose a series of features that affect the target task from the original data, and add them to the feature vector. It is for the above reasons, many scholars have begun to explore NER methods on the basis of deep learning.

Table 1
Commonly used features in traditional machine learning-based NER

Feature categories Feature

Domain knowledge Lexicons, existing NER tools

Context Windows, conjunctions

Linguistic Lemmatization, stemming, chucking, syntactic parsing

Orthographic Capitalization, symbols

Morphological n-gram character, n-gram word, suffixes, and prefixes

2.3 The deep learning-based NER

Feature categories	Feature
Domain knowledge	Lexicons, existing NER tools
Context	Windows, conjunctions
Linguistic	Lemmatization, stemming, chucking, syntactic parsing
Orthographic	Capitalization, symbols
Morphological	n-gram character, n-gram word, suffixes, and prefixes

Deep learning-based computing paradigms have enabled us to address massive amount of data efficiently [20, 21]. Then, the deep learning-based NER method has accordingly attracted more attention. Compared with the previous two methods, the learning model constructed by this type of method has the following benefits [22, 23, 24].

•
With the help of nonlinear activation functions, they can learn potential knowledge that more reflects the characteristics of the text from the original dataset.
•
The typical feature of this type of learning model is end-to-end. This implies that they can address the problem of error propagation between modules and obtain satisfactory experimental results.
•
A large number of experiments have proved that this type of learning models are very good at solving sequence tagging problems because they do not rely on feature engineering and domain knowledge.

Chiu et al. designed a novel neural network structure combining bi-directional long short-term memory (Bi-LSTM) and convolutional neural network (CNN) [25]. This structure is end-to-end and does not require too much feature engineering. Most importantly, it can automatically detect word-level and character-level features. For medical NER, Xu et al. designed a model using BiLSTM-CRF [26], which could learn the information features from a given dataset. Experiments on the NCBI disease corpus as one of the evaluation benchmark datasets showed that their method could achieve the $F_{1}$ of 80.22%, which was better than many widely used baseline methods. Ma and Hovy developed a novel neural network scheme called BiLSTM-CNN-CRF [27]. The scheme they proposed can automatically benefit from word-level and character-level representation. In addition, the test results of this scheme on the datasets of the two sequence tagging tasks had achieved very superior results, and could be applied to a wider range of sequence tagging tasks. Jiang et al. used the BERT-BiLSTM-CRF model to realize the NER task based on Chinese electronic medical records [28]. In their model, the BERT module is used to generate word vectors based on context information, and the BiLSTM network is combined with the CRF layer to further train the word vectors. The experimental results on the CCKS 2017 dataset show that the BERT-BiLSTM-CRF model can achieve better performance than other baseline models.

Different from the above methods, although the method proposed in this paper belongs to a deep learning method, we achieve automatic fine-tuning of the hyper-parameters of the model by introducing the PSO algorithm and the idea of collaborative computing. In this way, the disadvantage of manually tuning hyper-parameters, which is time-consuming and prone to falling into local optimum, is avoided, making it easier to use the deep learning model to extract results.
3. Methodology

The overall framework of our model is demonstrated in Fig. 2, where “Text” refers to the input sentence of the model, [CLS] is a special symbol that is added at the beginning of each input sentence, i.e., the first token of each input sentence is [CLS], $\textit{Tok}_{i}$ indicates the $i$ -th character, $N$ represents the number of characters in the input, ${E}_{i}$ means the $i$ -th input embedding, ${C}$ is the representation vector used for subsequent classification tasks, ${T}_{i}$ represents the contextual representation of the $i$ -th token, and $\textit{Tag}_{i}$ is the label of the $i$ -th character. Here, $i$ is restricted to $[1,N]$ .

Figure 2.

The architecture of our proposed model.

From Fig. 2, we can intuitively notice that this model belongs to an end-to-end model (i.e., a deep learning model), which can directly obtain the target output results from the original input data. Therefore, it does not rely on the features mentioned in Table 1. In addition, as can be seen from Fig. 2, this model contains three sub-modules: BERT model, CRF layer and PSO algorithm. Among them, BERT model makes use of the encoder architecture in the transformer to acquire the semantic vector, and combines the two methods of MLM and NSP to capture word-level and sentence-level representations respectively, to achieve true contextual prediction. The CRF layer can learn constraints from the training data, and then add these constraints to the final predicted label to make sure that they are legal and correct. With the constraints learned by the CRF layer, the number of invalid prediction label sequences can be significantly reduced. Simultaneously, we integrate the PSO algorithm into the BERT-CRF model to automatically fine-tune some important hyper-parameters. Generally speaking, these three sub-modules cooperate with each other and share information to accomplish NER tasks together. In this section, we will present the principles of these three sub-modules in detail.

3.1 The description of BERT model

The goal of the BERT model is to take advantage of large-scale unlabeled corpus for training, so as to gain a representation of the text that contains rich semantic information. In order to achieve this goal, BERT adopts the encoder architecture in the transformer to extract features, and combines the self-attention mechanism to make each word possess global semantic information, to obtain the best contextual representation of each word. In addition, BERT uses two methods, including MLM and NSP, to capture word-level and sentence-level representations, respectively. Next, we introduce in detail the internal mechanism of BERT and other models it uses.

3.1.1 The structure of BERT

Transformer includes two parts: encoder and decoder. However, BERT only adopts the encoder of transformer. The model structure of BERT is deeper than that of transformer. Transformer’s encoder contains 6 encoder blocks, BERT-base contains 12 encoder blocks, and BERT-large contains 24 encoder blocks. The encoder structure of transformer is shown in Fig. 3 [29]. From Fig. 3, we can conclude that the encoder layer consists of 6 identical encoder modules. Each encoder module contains two sub-layers. The first is a multi-head attention mechanism, and the second is a feed-forward neural network. The two sub-layers are surrounded by a residual connection and a layer normalization, namely Add&Norm. It is worth mentioning that, the multi-head attention mechanism means that the model requires calculating multiple attentions. Each attention has its own role, so they focus on different information in the input, and finally all the attention information is spliced together. Moreover, the reason why BERT only uses the encoder of transformer is because BERT is actually a pre-training model. This suggests that it is different from some other concrete assignments of NLP, and it is implemented through a language model. At the same time, since BERT does not employ the transformer’s decoder, it has also decreased numerous extra operations in the attention function.

Figure 3.

The encoder structure of transformer.

3.1.2 The input embedding of BERT

The input of BERT can contain a sentence pair (sentence A and sentence B), or it can be a single sentence. Taking a single sentence A “she likes traveling” as an example, the input embedding of BERT is shown in Fig. 4. In Fig. 4, token embedding refers to the word embedding obtained through training. Segment embedding is used to distinguish which sentence each word belongs to (for example, ${E}_{A}$ indicates sentence A). In other words, the role of segment embedding is to use the embedding information to let the model distinguish the context of the sentence. That is, the token of the previous sentence is all 0, and the token of the latter sentence is all 1, so that the model can judge the starting and ending positions of the two sentences. Position embedding represents the position of the encoded word obtained through learning. Finally, the three parts of token embedding, segment embedding and position embedding are added to get the input embedding of BERT.

Furthermore, BERT applies the method of WordPiece to construct a dictionary. WordPiece refers to divide each word into sub-words. If a word is not found in the dictionary, it will be split into sub-words one by one. For example, “traveling” in Fig. 4 has become “travel” and “#ing”. If the corresponding token is not found for a certain subword, the subword is directly marked as [unknown]. By using WordPiece to construct a dictionary, this can not only effectively obtain the root information of the word, but also reduce the dictionary capacity to a certain extent.

Figure 4.

The input embedding of BERT.

3.1.3 The MLM model

Generally speaking, the conventional language model can only be trained from left to right or from right to left, but the BERT model is trained in multiple layers and bidirectionally. Therefore, BERT chooses the MLM model to solve the problem of only using one-way information. MLM is to randomly cover or replace any word in a sentence, and then let the model predict the covered or replaced part through the context. In this model, BERT randomly masks 15% of the tokens in the dataset during training, and then only predicts these masked tokens. In fact, MLM learns the relationship between words. Through this method, the BERT model can get more contextual information of words for prediction.

An obvious flaw of this approach is that there is no masked token in the fine-tuning stage, which leads to a mismatch between the pre-training and fine-tuning phases. Therefore, for 15% of the tokens to be masked, BERT adopts the following three methods to alleviate this problem:

•
In 80% of the cases, the word is replaced with the [MASK] mark. For example, I like apple $\rightarrow$ I like [MASK].
•
In 10% of the cases, the word is replaced with a random word. For example, I like apple $\rightarrow$ I like banana.
•
In the other 10% of the cases, the word is to remain unchanged. For example, I like apple $\rightarrow$ I like apple.

3.1.4 Next sentence prediction

Traditional language models can not directly capture the information between sentences. In order to solve this problem, BERT introduces the NSP task. Its purpose is to obtain information between sentences, so that the model can better understand the relationship between sentences. The training corpus for this task can be generated by extracting sentence pairs (for example, sentence A and sentence B) from the dataset, where there is a 50% probability that B is the next sentence of A, and there is also a 50% probability that B is A random sentence in the dataset (i.e., B is not the next sentence of A).

Here are two examples:

Input $=$ [CLS] My favorite sport is [MASK] [SEP] (A)

I often go to play [MASK] [SEP] (B)

Label $=$ B is the next sentence of A

Input $=$ [CLS] My favorite sport is [MASK] [SEP] (A)

I like to eat [MASK] [SEP] (B)

Label $=$ B is not the next sentence of A

Here, [CLS] and [SEP] are two special symbols, the former is used to classify output, and the latter is used to separate non-contiguous token sequences.

In short, through the above important modules or mechanisms, BERT can simultaneously use the context-related information of the current word for feature extraction and use the dynamic context-related information to adjust the word vector accordingly. In addition, if we make use of the pre-trained BERT model to achieve downstream tasks, we only need to load it as the word embedding layer of the current task, and then build other structures after the BERT model for our own tasks, without much modification or optimization of the code.

3.2 The design of CRF layer

NER is essentially a sequence tagging task. The meaning of sequence tagging task is to use appropriate tagging methods to tag words or characters with appropriate labels according to the context information of each word or character. Hence, for the purpose of making our classifier achieve better performance, when tagging the dataset, we can attempt to utilize labeling information of adjacent data. For traditional classifiers, it is very difficult to do this. Fortunately, CRF is very suitable for obtaining contextual information. Because a significant feature of CRF is using a logarithmic linear function to obtain joint probabilities of feature sequences. In this way, it is easy to effectively make use of the contextual label to predict the current label. More importantly, CRF is a representative sequence tagging algorithm. This means that for a given input sentence sequence $\textbf{X}=\{x_{1},x_{2},\ldots,x_{n-1},x_{n}\}$ , the output sequence corresponding to the model is $\textbf{Y}=\{y_{1},y_{2},\ldots,y_{n-1},y_{n}\}$ . When implementing the sequence tagging task, the CRF layer can learn some constraints from the training dataset and add these constraints to the final predicted labels to make sure that they are legal and correct. For example, if the dataset take the BIO tagging scheme, i.e., beginning, inside, outside. The first token of the entity is marked as B-[type], the tokens in other positions of the entity are marked as I-[type], and the token that does not belong to any entity is marked as O. Therefore, the constraints learned by CRF layer are as follows.

•
In a sentence, the label of the first word should be “O” or “B-”, not “I-”.
•
The legal label pattern that CRF layer can learn should be “B- $\text{type}_{1}$ I- $\text{type}_{2}$ $\ldots$ I- $\text{type}_{n}$ ”, which implies that $\text{type}_{1}$ , $\text{type}_{2}$ and $\text{type}_{n}$ are the same classification of entity. For example, “B-Person I-Person I-Person” is a correct tag sequence. However, “B-Location I-Person I-Person” is an incorrect tag sequence.
•
The tag sequence such as “O I- $\ldots$ ” is not correct. That is, the correct tag sequence is “O B- I-”.

Figure 5 shows the structure of CRF [7]. From this figure, we can see that there are two kinds of feature functions involved in the CRF layer, the state feature function as well as the transition feature function. Among them, the state feature function is used to calculate the state score, and the transition feature function is used to calculate the transition score. The former only focuses on which entity label the characters at the current position can be converted into, while the latter focuses on which entity lable combinations the characters at the current position and its adjacent positions can have.

Figure 5.
The structure of CRF.

Suppose $\textbf{X}=\{x_{1},x_{2},\ldots,x_{n-1},x_{n}\}$ is the input sentence sequence, and suppose $\textbf{Y}=\{y_{1},y_{2},\linebreak\ldots,y_{n-1},y_{n}\}$ is the corresponding target sentence sequence. Let $P(\textbf{Y}|\textbf{X})$ represent the linear chain conditional probability, it can be expressed by Eq. (1).

$\displaystyle P(\textbf{Y}|\textbf{X})=\frac{1}{Z}{\rm exp}\left(\sum_{i=1}^{n% -1}\sum_{k=1}^{K}\lambda_{k}t_{k}(y_{i+1},y_{i},\textbf{X},i)+\sum_{i=1}^{n}% \sum_{l=1}^{L}u_{l}s_{l}(y_{i},\textbf{X},i)\right),$ (1)

where $i$ denotes the position of the current node, $K$ refers to the total number of transition feature functions, $L$ represents the number of status feature functions, $Z$ means a normalized factor, $u_{l}$ and $\lambda_{k}$ are weights. In addition, $s_{l}$ and $t_{k}$ are the status feature function and the transition feature function, respectively.

Nowadays, CRF has been applied in some sequence tagging tasks, e.g., NER, part-of-speech tagging and word segmentation [30, 31, 32].
3.3 The PSO-based collaborative optimization

PSO algorithm is an evolutionary computing technology developed by Kennedy and Eberhart in 1995 [5]. It is derived from the simulation of a simplified social model. PSO is inspired by the observation of the activities of birds. Through the collaboration and information sharing between individuals, the population develops an evolutionary process from disorder to order in the solution space of the problem, and finally acquires a global optimal solution. Therefore, the core point of PSO is to simulate predation behavior of a flock of birds. Through mutual cooperation between individuals, each bird will evaluate whether the position it has found is the optimal solution and share the optimal solution to the entire flock of birds.

At the beginning, the PSO algorithm needs to initialize a random solution for each particle. It adopts the fitness value to measure the quality of solution and iteratively searches for the optimal solution, which is consistent with the idea of the simulated annealing (SA) algorithm [33]. More importantly, the PSO algorithm performs simply than the genetic algorithm (GA) [34]. It does not have the complicated “crossover” and “mutation” operations in the GA. Therefore, the superiority of PSO algorithm is that it is fast to converge, easy to carry out and does not require to regulate many parameters. Nowadays, the PSO algorithm has been widely used in fields such as image processing and data mining [35, 36].

The optimal value of the hyper-parameters of the model can assist in achieving better performance on specific dataset, which is actually an optimization problem. Coincidentally, PSO is specifically good at handling optimization problems. In PSO, the position of particles in the searching space refers to the solution space of the problem. Every particle has the three features of position, speed, and fitness. In this paper, position represents the value of the hyper-parameters of the BERT model, speed indicates the direction of hyper-parameter changes, optimization function denotes the BERT-CRF model, and fitness can be evaluated by $F_{1}$ . In essence, we generally use the three indicators of Precision, Recall and $F_{1}$ to measure the performance of the learning model. Among them, the two indicators of Precision and Recall are mutually restrictive. That is, the higher the Precision, the lower the Recall, and the lower the Precision, the higher the Recall. And $F_{1}$ can be regarded as the equal weighted average of Precision and Recall, which can be used to comprehensively reflect the overall index. Therefore, we choose $F_{1}$ as the fitness of the particles. Specifically, the higher the $F_{1}$ , the better the solution found by the particle.

First, PSO uses random functions to initialize the particle swarm so that each particle has initial positions and speeds. Second, PSO makes use of the optimization function to initialize the fitness of the particle swarm. Then, the initial fitness value is taken as the optimal value of each particle, and the position corresponding to each fitness value is taken as the optimal position of each particle. At the same time, the best fitness value among the particles is regarded as the best global optimal value, and the position corresponding to each fitness value is regarded as the global optimal position of the particle. In the process of iteratively searching for the optimal solution, each particle updates its position and speed by tracking two “extremums”, that is, the historical optimal solution of the particle itself, expressed by $\overline{\rm pbest}$ , and the historical optimal solution of the particle swarm, expressed by $\overline{\rm gbest}$ . With these two optimal solutions, the particles update their speed and position according to Eqs (2) and (3). Suppose the information of the $i$ -th particle is recorded by a $D$ -dimensional vector, where $D$ is the number of hyper-parameters that need to be adjusted. In that way, the position can be described as $\textbf{X}_{i}=(x_{i1},x_{i2},\ldots,x_{id},\ldots,x_{iD})^{\top}$ , and the speed can be described as $\textbf{V}_{i}=(v_{i1},v_{i2},\ldots,v_{id},\ldots,v_{iD})^{\top}$ . Then, we have:

$\displaystyle v_{id}^{k+1}=w\times v_{id}^{k}+c_{1}\times{\rm rand}_{1}^{k}% \times(\overline{\rm pbest}_{id}^{k}-x_{id}^{k})+c_{2}\times{\rm rand}_{2}^{k}% \times(\overline{\rm gbest}_{d}^{k}-x_{id}^{k}),$ (2) $\displaystyle x_{id}^{k+1}=x_{id}^{k}+v_{id}^{k+1},$ (3)

where $v_{id}^{k}\,(1\leqslant d\leqslant D)$ refers to the speed of the $d$ -th hyper-parameter of the $i$ -th particle during the $k$ -th iteration. $w$ is the inertia factor, the larger the $w$ , the stronger the global search ability, and the weaker the local search ability. $c_{1}$ and $c_{2}$ are learning factors, which can be used to maintain a balance between the convergence speed and the search effect. ${\rm rand}_{1}$ and ${\rm rand}_{2}$ $\in[0,1]$ are random values. Moreover, $x_{id}^{k}$ indicates the position of the $d$ -th dimension of the $i$ -th particle during the $k$ -th iteration. $\overline{\rm pbest}_{id}$ is the position of the individual extreme point of the $i$ -th particle during the $d$ -th dimension, and $\overline{\rm gbest}_{d}$ is the position of the global extreme point of the particle swarm during the $d$ -th dimension. Finally, the position and speed of each particle will be restricted to a range to prevent the particle swarm from moving away from the search space.

According to Eq. (2), we know that for the updating strategy of particle velocity, we can divide it into three parts.

•

The first one represents the impact of the particle’s current speed on the particle flight, and this part provides the particle’s flight dynamics in the searching space.

•

The second one is individual cognitive behavior, which refers to the individual experience of particles. In simple words, this is a vertical search. A sample itself has many generations, and each generation has a hyper-parameter. These hyper-parameters must have an optimal value. Therefore, $\overline{\rm pbest}$ is the optimal value of the particle itself in history.

•

The third part is group cognition behavior, which represents the influence of group experience on the flight trajectory of particles, prompting particles to move toward the best position found by the group. Therefore, $\overline{\rm gbest}$ is the optimal value of all particles (i.e., populations) in history.

Furthermore, the parameter optimization by introducing PSO is shown in Algorithm 3.3.

[h] : The process of parameter optimization is realized by introducing PSO.[1] Set the fitness: $F_{1}$ Set the position vector (BERT model hyper-parameters): $[x_{1},x_{2},\ldots,x_{D}]$ ; Set the maximum number of iterations: $k_{\rm max}$ ; Initialize the number of particles $n$ and utilize random functions to initialize the position and speed of each particle; Use the optimization function to initialize the $F_{1}$ of each particle; Initialize the optimal parameter $\overline{\rm pbest}$ of each particle, and the optimal parameter $\overline{\rm gbest}$ of the particle swarm according to $F_{1}$ ; Initialize the iteration number $k=1$ ; Make use of Eqs (2) and (3) to update the position and speed of each particle; Update the $F_{1}$ ; Update the historically optimal parameters of each particle; Update the globally optimal parameters of the population; $k=k+1$ $k>k_{\rm max}$ Output the optimal position vector and corresponding $F_{1}$ values; Go back to Step 8;

4. Experimental results and discussion

The experiments are conducted with three different Chinese datasets, under the Python 3.7.10 computing environment deployed on a computer with the Ubuntu 19.10 operating system. Especially, in order to evaluate the performance of our proposed model, we compare it with some other traditional and popular models.

4.1 Datasets and data preprocessing

A total of three Chinese datasets are utilized in our experiment. The first is the Boson dataset,1

¹
http://static.bosonnlp.com/dev/resource.

which consists of six types of named entities, namely location, person, time, product, organization and company. The second is the People’s Daily dataset,2

http://www.ling.lancs.ac.uk/corplang/pdcorpus/pdcorpus.html.

possessing three named entities of person, institution and organization. The last is the inquiry data of steel industry. It contains 744 data including eight named entity classifications, such as place of production, surface structure, specification, grade, weight, thickness, variety, and surface treatment. It should be noted that the datasets of Boson and People’s Daily are both available online for free. The inquiry data of steel industry is the historical query information of customers on the steel e-commerce platform. During the experiment, in order to ensure that the amount of data in different datasets is roughly the same, we select 1000 pieces of data from the Boson dataset and the People’s Daily dataset respectively.

More specifically, the two datasets, Boson and People’s Daily, are baseline data commonly used in NER tasks. The inquiry data of the steel industry comes from the historical inquiry information of customers on the steel e-commerce platform, so there will be a lot of dirty data, such as noise and missing values. To ensure the reliability of the data, we use mathematical models and expert databases to clean these data. In this process, the mathematical model is used to extract useful and important attribute values from the original data, and the expert database realizes the verification of the data. Through the combination of mathematical models and expert databases, we can obtain a relatively reliable NER corpus related to the steel industry. Furthermore, we adopt the BIO annotation method mentioned in Section 3.2 to annotate the named entities of all datasets.

4.2 Evaluation system for NER

Evaluating whether a named entity is correctly identified mainly includes two aspects: whether the type of the entity is correctly labeled and whether the boundary of the entity is correctly recognized. The former is often called text, and the latter is often called type. It can be seen that there are three main types of errors: the entity type is correctly labeled (the main entity words and part-of-speech labels contained in the named entity are correctly tagged) but the entity boundary is recognized incorrectly, the entity boundary is recognized correctly but the entity type is labeled incorrectly, the entity type is labeled incorrectly and the entity boundary is recognized incorrectly. Therefore, the definition of whether a named entity is correctly recognized is not simple. Different systems and fields have different focuses, and their definitions may be different. No matter how the actual situation changes, according to the combination of the true class of sample and the predicted class of classifier, it is divided into four situations in Table 2 [37], in which the definitions are listed as follows.

•
TP (True Positive): entities correctly identified by the model in the test set.
•
FP (False Positive): entities identified by the model while they are not existing in the test set.
•
TN (True Negative): it is not an entity in the test set and the model also recognizes it as a non-entity.
•
FN (False Negative): entities in the test set that are not correctly identified by the model.

Table 2
Confusion matrix of classification results

Real category Predicted category

Positive Negative

Positive TP FN

Negative FP TN

The performance of NER for a specific system and field is generally evaluated by calculating the corresponding Precision, Recall, and $F_{1}$ . The definitions of them are:

•
Precision ( $P$ ): the percentage of correctly classified entities to the entities that identified by the model.
•
Recall ( $R$ ): the percentage of correctly classified entities to the total number of entities in the test set.
•
$F_{1}$ : the equal weighted average of Precision and Recall.

In addition, they can be expressed by the following formula.

$\displaystyle P=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}},$ (4) $\displaystyle R=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}},$ (5) $\displaystyle F_{1}=\frac{2\times P\times R}{P+R}.$ (6)
4.3 Experimental comparison

Real category	Predicted category
	Positive	Negative
Positive	TP	FN
Negative	FP	TN

We use eight existing models (i.e., HMM [14], CRF [7], CNN-CRF [38], BiRNN-CRF [39], BiLSTM [40], BiLSTM-CRF [26], BERT-CRF [41], BERT-BiLSTM-CRF [28]) and our BERT-CRF-PSO to implement experiments on three Chinese datasets.

The authors of the BERT model mentioned in their paper that “for fine-tuning, most model hyper-parameters are the same as in the phase of pre-training, in addition to ${\rm batch\_size}$ , ${\rm training\_epoch}$ and ${\rm learning\_rate}$ [6]”, which are detailed below.

•
${\rm\mathbf{batch\_size}}$ : It represents the size of the data for each training of the model. If ${\rm batch\_size}$ is too small, it means that the number of data samples input to the network is too small, and the statistics are not representative, which makes it difficult for the network to converge. If ${\rm batch\_size}$ is too large, it is easy to fall into the local optimal solution and reduce the accuracy.
•
${\rm\mathbf{training\_epoch}}$ : It means the number of times the entire training set has been trained by the model. If ${\rm training\_epoch}$ is too small, the model may be underfitting, and the degree of fitting to the data is relatively low. If ${\rm training\_epoch}$ is too large, the model may be overfitting and the generalization ability to the data is poor.
•
${\rm\mathbf{learning\_rate}}$ : It refers to the learning speed of the model. If ${\rm learning\_rate}$ is too small, the change speed of the loss function will be very slow, which will greatly increase the convergence complexity of the network, and it is easy to be trapped in the local minimum. If ${\rm learning\_rate}$ is too large, the loss function may directly exceed the global optimal point, resulting in low accuracy.

It can be seen from the above that the values of these three hyper-parameters will greatly affect the performance of the model. Therefore, we choose these three hyper-parameters as the optimization objects in our experiment.

In order to make our model have better generalization ability, we adopt some strategies to avoid overfitting. For example, we use the dropout method to randomly delete the neurons during the learning process, so that each time a different model learns the parameters. At the same time, we use the warmup strategy to optimize the learning rate, that is, select a smaller learning rate at the beginning of model training, and use the preset learning rate for training after a period of training. Additionally, in order to better evaluate the generalization ability of the model, we divide the dataset into training set, validation set and test set according to a certain ratio. Among them, the training set is used for parameter learning, the validation set is used for the performance evaluation of hyper-parameters, and the test set is used to evaluate the generalization ability. Therefore, the results shown in our experiments are all those on the test set.

Additionally, we conduct a comparative experiment of fine-tuning two hyper-parameters (i.e., ${\rm batch\_size}$ , ${\rm training\_epoch}$ ) and three hyper-parameters (i.e., ${\rm batch\_size}$ , ${\rm training\_epoch}$ and ${\rm learning\_rate}$ ) to certificate the superior performance of our proposed method. Based on our experience, in Eq. (2), $w$ is set to 0.9, and both $c_{1}$ and $c_{2}$ are set to 2. At the same time, ${\rm batch\_size}\in[8,32]$ , ${\rm training\_epoch}\in[10,30]$ and ${\rm learning\_rate}\in$ [1e-5,1e-4] are three important hyper-parameters of BERT model, which are required to be adjusted through PSO algorithm.

The comparison results of the nine models we used on three different Chinese datasets are shown in Table 3. Neural network-based models are randomly initialized resulting in different experimental results from run to run, so all results are averaged over three runs. In addition, the last column of Table 3 ( $F_{1}$ -std) represents the standard deviation of $F_{1}$ obtained from three runs. For the convenience of observation and comparison, we enlarge the final standard deviation by 100 times. However, HMM and CRF do not belong to neural network-based models, so they have no standard deviation and are represented by “ $-$ ” in Table 3.

Table 3
Comparison results of different models on different Chinese datasets

Dataset Model Precision Recall $\bm{F_{1}}$ $\bm{F_{1}}$ -std

Boson HMM 53.05% 54.94% 53.98% –

CRF 78.52% 70.58% 74.34% –

CNN-CRF 39.78% 30.78% 34.70% 2.68

BiRNN-CRF 58.39% 55.05% 56.66% 0.42

BiLSTM 40.70% 45.41% 42.92% 4.78

BiLSTM-CRF 59.58% 64.22% 61.81% 0.58

BERT-CRF 82.97% 85.22% 84.08% 0.43

BERT-BiLSTM-CRF 82.78% 85.37% 84.05% 0.32

BERT-CRF-PSO(a) 83.33% 85.67% 84.49% 0.07

BERT-CRF-PSO(b) 83.92% 85.87% 84.88% 0.23

People’s daily HMM 67.14% 70.88% 68.96% –

CRF 85.61% 76.45% 80.77% –

CNN-CRF 60.94% 67.24% 63.92% 0.54

BiRNN-CRF 75.69% 74.87% 75.27% 0.55

BiLSTM 54.23% 67.31% 60.05% 4.30

BiLSTM-CRF 65.67% 70.45% 67.97% 0.43

BERT-CRF 93.37% 96.47% 94.89% 0.57

BERT-BiLSTM-CRF 93.43% 96.34% 94.86% 1.05

BERT-CRF-PSO(a) 93.68% 97.29% 95.45% 0.22

BERT-CRF-PSO(b) 95.69% 98.22% 96.94% 0.08

Inquiry data of the steel industry HMM 78.12% 81.12% 79.59% –

CRF 91.20% 88.64% 89.90% –

CNN-CRF 75.00% 76.93% 75.95% 1.63

BiRNN-CRF 88.25% 87.39% 87.82% 0.79

BiLSTM 79.42% 85.00% 82.12% 0.85

BiLSTM-CRF 84.36% 86.43% 85.38% 0.77

BERT-CRF 91.58% 93.42% 92.49% 0.60

BERT-BiLSTM-CRF 91.83% 93.72% 92.76% 0.48

BERT-CRF-PSO(a) 92.66% 94.21% 93.43% 0.42

BERT-CRF-PSO(b) 92.73% 94.45% 93.58% 0.23

(a) represents fine-tuning two hyper-parameters, (b) represents fine-tuning three hyper-parameters.

Compared with other existing models, our proposed model introduces the PSO algorithm, which enables the particle swarm to automatically fine-tune the hyper-parameters in a given search space (i.e., hyper-parameter range) to find the optimal value of the hyper-parameter. In this way, the performance of the model for a specific dataset can achieve the global optimum. It is worth noting that the hyper-parameters of the other models are manually set based on the results of multiple experiments and past experience. The experimental results in Table 3 imply that these models that require manual adjustment of hyper-parameters are not only time-consuming and laborious, but also easy to fall into a local optimal situation. The model we proposed can obtain the global optimal solution through the cooperation and information sharing between each particle. The results in Table 3 confirm the superiority of our model. Meanwhile, the last column of Table 3 shows that our proposed model has the smallest standard deviation for different datasets, which indicates that our proposed model has the best stability compared to other models. Moreover, if we increase the number of model hyper-parameters that need to be fine-tuned in Algorithm 3.3, the performance improvement effect of our model will be more apparent. The comparison results of BERT-CRF-PSO (a) and BERT-CRF-PSO (b) in Table 3 strongly confirm our statement. Therefore, we can conclude that the BERT-CRF-PSO model can assist us in automatically finding the global optimal hyper-parameters for a specific dataset, thereby significantly improving the performance of entity extraction, which is a very effective method.
5. Conclusions

Dataset	Model	Precision	Recall	$\bm{F_{1}}$	$\bm{F_{1}}$ -std
Boson	HMM	53.05%	54.94%	53.98%	–
	CRF	78.52%	70.58%	74.34%	–
	CNN-CRF	39.78%	30.78%	34.70%	2.68
	BiRNN-CRF	58.39%	55.05%	56.66%	0.42
	BiLSTM	40.70%	45.41%	42.92%	4.78
	BiLSTM-CRF	59.58%	64.22%	61.81%	0.58
	BERT-CRF	82.97%	85.22%	84.08%	0.43
	BERT-BiLSTM-CRF	82.78%	85.37%	84.05%	0.32
	BERT-CRF-PSO(a)	83.33%	85.67%	84.49%	0.07
	BERT-CRF-PSO(b)	83.92%	85.87%	84.88%	0.23
People’s daily	HMM	67.14%	70.88%	68.96%	–
	CRF	85.61%	76.45%	80.77%	–
	CNN-CRF	60.94%	67.24%	63.92%	0.54
	BiRNN-CRF	75.69%	74.87%	75.27%	0.55
	BiLSTM	54.23%	67.31%	60.05%	4.30
	BiLSTM-CRF	65.67%	70.45%	67.97%	0.43
	BERT-CRF	93.37%	96.47%	94.89%	0.57
	BERT-BiLSTM-CRF	93.43%	96.34%	94.86%	1.05
	BERT-CRF-PSO(a)	93.68%	97.29%	95.45%	0.22
	BERT-CRF-PSO(b)	95.69%	98.22%	96.94%	0.08
Inquiry data of the steel industry	HMM	78.12%	81.12%	79.59%	–
	CRF	91.20%	88.64%	89.90%	–
	CNN-CRF	75.00%	76.93%	75.95%	1.63
	BiRNN-CRF	88.25%	87.39%	87.82%	0.79
	BiLSTM	79.42%	85.00%	82.12%	0.85
	BiLSTM-CRF	84.36%	86.43%	85.38%	0.77
	BERT-CRF	91.58%	93.42%	92.49%	0.60
	BERT-BiLSTM-CRF	91.83%	93.72%	92.76%	0.48
	BERT-CRF-PSO(a)	92.66%	94.21%	93.43%	0.42
	BERT-CRF-PSO(b)	92.73%	94.45%	93.58%	0.23

By introducing the idea of collaborative computing, we present a novel model called BERT-CRF-PSO to complete the NER task. Then, three modules, including BERT model, CRF layer and PSO algorithm, are incorporated into our model. Here, the BERT model is used to pre-train the language model to acquire the corresponding word vector, the CRF layer learns constraints from the training data and adds these constraints to the final predicted label to make sure that they are legal and correct, and the PSO algorithm is in charge of fine-tuning the hyper-parameters in a given search space to automatically find the optimal value of the hyper-parameter in a cooperative way. Although these three modules play different roles, they share information and cooperate with each other to accomplish NER tasks together. Therefore, through our architecture, the experimental results for different datasets can reach the global optimum. The experimental results in Table 3 verify the effectiveness of our model and improve the $F_{1}$ of NER.

Footnotes

Acknowledgments

This work was supported in part by the Beijing Natural Science Foundation under Grants 19L2029 and M21032, in part by the National Natural Science Foundation of China under Grants 81961138010 and U1836106, in part by the Scientific and Technological Innovation Foundation of Foshan under Grants BK20BF010 and BK21BF001, and in part by the Fundamental Research Funds for the University of Science and Technology Beijing under Grant FRF-BD-19-012A.

Portions of this paper was presented at the International Conference on Collaborative Computing: Networking, Applications and Worksharing in 2021 [].

Conflict of interest

The authors declare no conflict of interest.

References

Nadeau

and Sekine

, A survey of named entity recognition and classification, Lingvisticae Investigationes 30(24) (2007), 3–26.

Zhang

and Elhadad

, Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts, Journal of Biomedical Informatics 46(6) (2013), 1088–1098.

Nasar

Jaffry

S.W.

and Malik

M.K.

, Named entity recognition and relation extraction: State-of-the-art, ACM Computing Surveys 54(1) (2021), 1–39.

and Sun

, A unified model for cross-domain and semi-supervised named entity recognition in chinese social media, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31, AAAI Press, San Francisco, USA, 2017.

Kennedy

and Eberhart

, Particle Swarm Optimization, in: Proceedings of the International Conference on Neural Networks, IEEE, Perth, Australia, 1995, pp. 1942–1948.

Devlin

Chang

M.-W.

Lee

and Toutanova

, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 4171–4186.

Lafferty

McCallum

and Pereira

F.C.

, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, in Proceedings of the 18th International Conference on Machine Learning, Morgan Kaufmann, San Francisco, CA, USA, 2001, pp. 282–289.

Xie

Liu

Jia

Luan

and Sun

, Representation Learning of Knowledge Graphs with Entity Descriptions, in: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI Press, Phoenix, Arizona, USA, 2016, pp. 2659–2665.

Akhondi

S.A.

Hettne

K.M.

Van Der Horst

Van Mulligen

E.M.

and Kors

J.A.

, Recognition of chemical entities: Combining dictionary-based and grammar-based approaches, Journal of Cheminformatics 7(1) (2015), 1–11.

10.

Farmakiotou

Karkaletsis

Koutsias

Sigletos

Spyropoulos

C.D.

and Stamatopoulos

, Rule-based Named Entity Recognition for Greek Financial Texts, in Proceedings of the Workshop on Computational Lexicography and Multimedia Dictionaries, Citeseer, 2000, pp. 75–78.

11.

Collins

and Singer

, Unsupervised Models for Named Entity Classification, in: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, Association for Computational Linguistics, MD, USA, 1999, pp. 100–110.

12.

Noble

W.S.

, What is a support vector machine, Nature Biotechnology 24(12) (2006), 1565–1567.

13.

Saha

S.K.

Sarkar

and Mitra

, Feature selection techniques for maximum entropy based biomedical named entity recognition, Journal of Biomedical Informatics 42(5) (2009), 905–911.

14.

Eddy

S.R.

, What is a hidden markov model, Nature Biotechnology 22(10) (2004), 1315–1316.

15.

Forney

G.D.

, The viterbi algorithm, Proceedings of the IEEE 61(3) (1973), 268–278.

16.

Sobhana

Mitra

and Ghosh

, Conditional random field based named entity recognition in geological text, International Journal of Computer Applications 1(3) (2010), 143–147.

17.

Chieu

H.L.

and Ng

H.T.

, Named Entity Recognition: a Maximum Entropy Approach using Global Information, in: Proceedings of the 19th International Conference on Computational Linguistics, Howard International House and Academia Sinica, Taipei, Taiwan, 2002.

18.

Ekbal

and Bandyopadhyay

, Named entity recognition using support vector machine: A language independent approach, International Journal of Electrical, Computer, and Systems Engineering 4(2) (2010), 155–170.

19.

Zhou

and Su

, Named Entity Recognition using an HMM-based Chunk Tagger, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Philadelphia, USA, 2002, pp. 473–480.

20.

Luo

Sun

Wang

Zhao

Wang

J.-H.

and Zhang

, Short-term wind speed forecasting via stacked extreme learning machine with generalized correntropy, IEEE Transactions on Industrial Informatics 14(11) (2018), 4963–4971.

21.

Luo

Chen

Yang

and Li

, Ophthalmic disease detection via deep learning with a novel mixture loss function, IEEE Journal of Biomedical and Health Informatics 25(9) (2021), 3332–3339.

22.

Lample

Ballesteros

Subramanian

Kawakami

and Dyer

, Neural Architectures for Named Entity Recognition, in: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, San Diego, CA, USA, 2016, pp. 260–270.

23.

Žukov-Gregorič

Bachrach

and Coope

, Named Entity Recognition with Parallel Recurrent Neural Networks, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 69–74.

24.

Zhou

J.T.

Zhang

Jin

Peng

Xiao

and Cao

, RoSeq: Robust sequence labeling, IEEE Transactions on Neural Networks and Learning Systems 31(7) (2019), 2304–2314.

25.

Chiu

and Nichols

, Named Entity Recognition with Bidirectional LSTM-CNNs, Transactions of the Association for Computational Linguistics 4 (2016), 357–370.

26.

Zhou

Hao

and Liu

, A Bidirectional LSTM and Conditional Random Fields Approach to Medical Named Entity Recognition, in: Proceedings of the International Conference on Advanced Intelligent Systems and Informatics, Springer, Cairo, Egypt, 2017, pp. 355–365.

27.

and Hovy

, End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Berlin, Germany, 2016, pp. 1064–1074.

28.

Zhang

Jiang

Zhao

Hou

Liu

and Zhang

, A BERT-BiLSTM-CRF Model for Chinese Electronic Medical Records Named Entity Recognition, in: Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World, IEEE, Xiangtan, China, 2019, pp. 166–169.

29.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

and Polosukhin

, Attention is all You Need, in: Proceedings of the 31st Annual Conference on Neural Information Processing Systems, Neural Information Processing Systems Foundation, Long Beach, CA, USA, 2017, pp. 5999–6009.

30.

Liu

Zhang

Che

Liu

and Wu

, Domain Adaptation for CRF-based Chinese Word Segmentation using Free Annotations, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Doha, Qatar, 2014, pp. 864–874.

31.

Constant

and Sigogne

, MWU-Aware Part-of-Speech Tagging with a CRF Model and Lexical Resources, in: Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World, Association for Computational Linguistics, Oregon, USA, 2011, pp. 49–56.

32.

Khabsa

and Giles

C.L.

, Chemical entity extraction using CRF and an ensemble of extractors, Journal of Cheminformatics 7(1) (2015), 1–9.

33.

Steinbrunn

Moerkotte

and Kemper

, Heuristic and randomized optimization for the join ordering problem, The VLDB Journal 6(3) (1997), 191–208.

34.

Holland

J.H.

, Adaptation in Nature and Artificial Systems, University of Michigan Press, Ann Arbor, MI, USA, 1975.

35.

Djemame

Batouche

Oulhadj

and Siarry

, Solving reverse emergence with quantum PSO application to image processing, Soft Computing 23(16) (2019), 6921–6935.

36.

Lipare

Edla

D.R.

and Dharavath

, Fuzzy rule generation using modified PSO for clustering in wireless sensor networks, IEEE Transactions on Green Communications and Networking 5(2) (2021), 846–857.

37.

Stehman

S.V.

, Selecting and interpreting measures of thematic classification accuracy, Remote Sensing of Environment 62(1) (1997), 77–89.

38.

Knobelreiter

Reinbacher

Shekhovtsov

and Pock

, End-to-end training of hybrid CNN-CRF models for stereo, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, Honolulu, USA, 2017, pp. 2339–2348.

39.

Shao

Hardmeier

Tiedemann

and Nivre

, Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF, in: Proceedings of the Eighth International Joint Conference on Natural Language Processing, Asian Federation of Natural Language Processing, Taipei, Taiwan, 2017, pp. 173–183.

40.

Zeng

Yang

Feng

Wang

and Zhao

, A Convolution BiLSTM Neural Network Model for Chinese Event Extraction, in: Natural Language Understanding and Intelligent Applications, Lecture Notes in Computer Science, Vol. 10102, Springer, Kunming, China, 2016, pp. 275–287.

41.

Liu

Wang

and Xu

, LTP: A New Active Learning Strategy for Bert-CRF Based Named Entity Recognition, CoRR abs/2001.02524, 2020.

42.

Peng

Luo

Shen

Huang

and Chen

, A Collaborative Optimization-Guided Entity Extraction Scheme, in: Proceedings of the International Conference on Collaborative Computing: Networking, Applications and Worksharing, Springer, 2021, pp. 190–205.

Collaborative optimization with PSO for named entity recognition-based applications

Abstract

Keywords

1. Introduction

3.1.1 The structure of BERT

3.2 The design of CRF layer

4.1 Datasets and data preprocessing

1 http://static.bosonnlp.com/dev/resource.

Footnotes

Acknowledgments

Conflict of interest

References

¹
http://static.bosonnlp.com/dev/resource.