ELCA: Enhanced boundary location for Chinese named entity recognition via contextual association

Abstract

Named Entity Recognition (NER) is a fundamental task that aids in the completion of other tasks such as text understanding, information retrieval and question answering in Natural Language Processing (NLP). In recent years, the use of a mix of character-word structure and dictionary information for Chinese NER has been demonstrated to be effective. As a representative of hybrid models, Lattice-LSTM has obtained better benchmarking results in several publicly available Chinese NER datasets. However, Lattice-LSTM does not address the issue of long-distance entities or the detection of several entities with the same character. At the same time, the ambiguity of entity boundary information also leads to a decrease in the accuracy of embedding NER. This paper proposes ELCA: Enhanced Boundary Location for Chinese Named Entity Recognition Via Contextual Association, a method that solves the problem of long-distance dependent entities by using sentence-level position information. At the same time, it uses adaptive word convolution to overcome the problem of several entities sharing the same character. ELCA achieves the state-of-the-art outcomes in Chinese Word Segmentation and Chinese NER.

Keywords

Nested Chinese NER Lattice-LSTM NLP

1. Introduction

Named Entity Recognition (NER) is a widely used technology for information extraction in Natural Language Processing (NLP) [27]. NER mainly aims to accurately identify and classify named entities (e.g., names of people, organizations, location, etc) within unstructured text. Thus, NER plays a crucial role in real-world NLP because it provides accurate named entities for various downstream applications such as text understanding [12], information retrieval [30], text clustering [5], question answering [16], machine translation [32], and knowledge base construction [2]. Since sentences in Chinese are not naturally segmented, Chinese NER for social media text is more challenging than English NER [17, 21]. Therefore, to segment Chinese sentences, most existing Chinese NER for social media text models uses existing Chinese Word Separation (CWS) systems to perform word segmentation [10, 40]. However, due to the fact that the same characters in Chinese may have multiple parts of speech and represent different entities, Chinese Word Segmentation (CWS) systems incorrectly segments the sentences. As a consequence, Chinese NER for social media text may cause inaccuracy in entity boundary detection and entity category prediction. To address these issues, some studies have worked on the enhancement of information between characters [22, 25]. For instance, “BMES” [45] is a lattice structure that effectively uses word information to avoid word error propagation and segmentation. [25] point out that characters always outperform words in the framework of deep learning. In Chinese NER for social media text words often contain more important information than characters. Therefore the inability to use lexical information effectively is the fatal flaw of character-based NER methods.

Figure 1.

Sentence-level and word Position Attention Informations.

A drawback of the purely character-based NER methods is that the word information is not fully exploited. Zhang and Yang [45] presented the lattice-LSTM, which combines the character-based NER with a lexicon created using words. In addition, when characters match many words in the dictionary, that keeps all words matched with characters and lets the latter NER model determine the word to be applied instead of heuristically selecting that word. The LSTM’s structure takes into account a temporal sequence, resulting in a significant computational costs. Peng et al. [29] proposed word information as well as adaptive word convolution is considered to enhance the boundary problem of the lexicon. The graph attention mechanism in PGAT [39] calculates different attention to the four dimensions of the character’s tokens (B, M, E, S) to make effective use of lexical information. However, this approach does not take into account the location of the characters, which leads to unclear boundaries in Chinese text. For example, “UTF8gbsn货拉拉拉拉布拉多吗? (Cargo Lara Lara Labrador?)” In Chinese, we can use position selective attention fundamentally to distinguish the five “UTF8gbsn拉 (La)” in the sentence to enhance the entity boundaries of “UTF8gbsn货拉拉 (Cargo Lala)” and “UTF8gbsn货拉拉拉 (Labrador)”.

In recent years, in order to speed up the computation, FLAT [19] proposed a Transformer-based Chinese NER for Social Media Text method for characters as well as location. Jia [13] proposed to use adaptive convolution of characters and words, and this approach enhances the role of words. However, this approach does not consider word position and the fact that sentence position information also affects the NER results. Jin [15] propose an implicit Transformer-based encoding of relative position information is proposed to enhance the boundary information of entities. This method, however, does not work with entities that are dependent on long-distance communication and entities repeated in sentences. As Fig. 1 shows, when the two entities “UTF8gbsn小米 (Xiaomi)” and “UTF8gbsn小米粥 (Xiaomi porridge)” have two Chinese characters (Xiaomi) repeated, this will pose a challenge to the Chinese NER for Social Media Text.

In this paper, we propose an enhanced boundary Location for chinese NER Via contextual association (ELCA) for nested Chinese NER for Social Media Text to overcome the above problems. ELCA solves the problem of long entities by adaptive convolution between characters. In detail, we link the index of the current character in context, find the first three characters associated with it and the next three characters. In contrast to conventional word separation methods, this paper employs adaptive word convolution to establish connections between the current character and contextual characters. This innovative approach enhances the delineation of Chinese entities, allowing for more precise identification of the most appropriate entity. As a result, it significantly improves the recognition rate of NER. This ensures that the current character cannot form an entity with both the previous character and the next character, and then continues with the next character so that no long entity is missed. ELCA uses sentence-level positional information with inter-character positional information to solve the problem that the same characters represent different entities. Unlike typical attention mechanisms, the position-selective attention discussed in this paper fundamentally employs a flat structure for encoding position information. This flattened position encoding has the potential to enhance the accuracy of Chinese NER. We set the index of two character codes (character and sentence), and the index value can determine and judge whether the current entity appears in other sentences, so as to determine the accuracy of the current entity.

In summary, the contributions of this paper are as follows:

•

In this study, we have enhanced the embeddings of entities in the text by incorporating their absolute positional information. The boundaries of embedded entities are now primarily determined by positional encodings. Consequently, our approach has yielded superior results in NER for social media texts.

•

Contextual fragments of characters of different lengths are encoded to dynamically generate word-level representations of location-specific characters. As a result, our approach is able to capture useful word-level semantic information while reducing the impact of segmentation error cascades.

•

The experimental results show that ELCA performs significantly better in both NER and CWS datasets in terms of characteristics compared to the baseline using character information. Specifically, the performance on OntoNotes is improved by 2.3%.

2. Related work

Several studies have employed character-based word representations obtained from end-to-end neural models, rather than relying solely on word-level representations, as the primary input [8]. [9] shows that the accuracy and efficiency of constructing cross-lingual bilingual mappings is low due to unclear information if entities in the corpus. These character-based representations capture additional information about the words and help improve the performance of named entity recognition (NER). In rule-based NER approaches, manual rules based on domain-specific gazetteers, syntactic-lexical patterns, and pre-processed synonym dictionaries are utilized. These rules aid in identifying protein mentions, putative genes, and other relevant entities [18]. Tran [36] proposed a neural NER model that extracts word features from word embeddings and character-level recurrent neural networks (RNNs). The model incorporates stack residual LSTM and trainable bias decoding techniques to enhance its performance. To capture informative morphological representations from word character sequences, Vzukov et al. [47] employed a deep bidirectional GRU. The resulting character-level representations are then combined with word embeddings through concatenation to form the final word representation. In summary, researchers have utilized character-based word representations, rule-based techniques, and the combination of character and word embeddings to improve the accuracy and effectiveness of NER systems. These approaches enhance the models’ ability to capture morphological information and identify specific entities within text.

Although co-training might improve the validity of the word segmentation, the NER module still had no specific measures to avoid segmentation errors [31]. The above existing methods suffered the potential issue of error propagation and confusion of entity boundaries. To address the challenges mentioned above, researchers have employed automatic dictionary construction techniques for Chinese named entity recognition (NER). These dictionaries are pretrained on large automatically segmented text and contain vocabulary with both boundary and semantic information [32]. Boundary information is provided by the dictionary itself, while semantic information is obtained through pre-trained word embeddings [1, 26].

Some approaches have explored performing Chinese NER directly at the character level, which has proven to be effective in empirical studies [22]. Ma et al. [25] addressed the computational efficiency challenges associated with utilizing word lexicons in Chinese NER. Stanis-lawek et al. [33] proposed encoding the lattice into a graph neural network (GNN), such as the Lexicon-based Graph Network (LGN) [33]. Additionally, the Flat-Lattice Transformer (FLAT) [20] introduced four codes to the NER process by considering the head and tail positions of characters within words, albeit with increased computational costs. BERT, with its bi-directional transformer architecture, swiftly incorporates character information into the NER process. Enhanced character embeddings have been achieved by incorporating the ALBERT pre-training language model and Multi-word Information (MWI) [21]. While this paper provides a comprehensive overview of Named Entity Recognition (NER) techniques, it is important to acknowledge that recent advances in the field extend beyond its scope. For instance, multi-modal NER [14] and financial domain [43].

In summary, scholars have utilized automatic dictionary construction, character-level NER, graph neural networks, and enhanced character embeddings to overcome the challenges in Chinese NER. These techniques leverage boundary and semantic information, address computational efficiency concerns, and incorporate character-level details into the models.

3. Methodology

Table 1
Datasets

	Dataset	Type	Train	Dev	Test
NER	Weibo	Sent	1.4k	0.27k	0.27k
		Char	73.8k	14.5k	14.8k
	Ontonotes	Sent	15.7k	4.3k	4.3k
		Char	491.9k	200.5k	208.1k
	MSRA	Sent	46.4k	–	4.4k
		Char	2169.9k	–	172.6k
	Resume	Sent	3.8k	0.46k	0.48k
		Char	124.1k	13.9k	15.1k
CWS	PKU	Sent	19.1k	–	1.9k
		Char	1826k	–	173k 0
	MSR	Sent	86.9k	–	4.0k
		Char	4050k	–	184k
	CTB6	Sent	23k	2k	3k
		Char	1056k	100k	134k

Figure 2.

The architecture of our model.

Figure 3.

ELCA implementation details.

In this subsection, we present ELCA and outline its implementation steps. As shown in Fig. 2, ELCA consists of two main parts: position selective attention and contextual association. More details are shown in Fig. 3. Firstly, ELCA obtains the vector representation of the character through the embedding of the character, and the vector representation of the position information through the position selective attention. ELCA enhances the entity boundaries formed by the position information of the entity by feeding the vector representation of the character’s position information into the add-normalize layer of the BERT for the self-attention mechanism of the operation, thus enhancing the entity boundary formed by the entity position information. The position information and the original character embedding are input into the contextual association step to form the contextual association calculation of the text characters, and then input into the BERT to calculate the accurate boundary information to achieve the optimal recognition effect. In the position selective attention stage, we use positional coding to construct character-level and sentence-level positional information that can be used to more accurately represent entity information in the text. In the contextual association stage, we use adaptive word convolution and contextual association of characters to find and enhance the entity boundary information, which further improves the NER accuracy. Our framework consists of two main steps: pre-training and fine-tuning.

During the pre-training phase, the ELCA model is trained on unlabeled datasets using various pre-training tasks. This allows the model to learn meaningful representations from the data. In the fine-tuning phase of ELCA, the model initializes the pre-trained parameters and then fine-tunes them using sentence pair positional selective attention. This attention mechanism helps the model focus on relevant information while considering the position of each word in the sentence. It’s important to note that each word has its own separate fine-tuned model, although they all share the same pre-training parameters. This allows the model to capture word-specific characteristics during fine-tuning. To illustrate the concepts discussed, we will use a quiz example, depicted in Fig. 4, as a running example throughout this section.

As for sequence encoding, we use convolutional operations as our basic encoding unit. Chinese text in everyday use usually has no standardized syntax or grammar and presents semantics in a fragmented form, for example, UTF8gbsn来到杨过曾经生活过的地方, UTF8gbsn小龙女动情地说: “UTF8gbsn我也想过过过儿过过的生活” (Coming to the place where Yang had lived, Little Loong Girl said emotionally: “I also want to live through the life of the child over.”).

Figure 4.

An illustration of our pre-training framework in BERT.

3.1 Preliminary: BERT

BERT [6] (Bidirectional Encoder Representations from Transformers) is a pre-trained language model comprising a stack of multi-head self-attention layers and fully connected layers. For each head in the $l_{th}$ multi-head self-attention layers, the output matrix $\mathbf{H}^{\text{out},l}=\{\mathbf{h}_{1}^{\text{out},l},\mathbf{h}_{2}^{% \text{out},l},\ldots,\mathbf{h}_{n}^{\text{out},l}\}\in\mathbb{R}^{n\times d_{% k}}$ satisfies:

$\displaystyle\mathbf{h}_{i}^{\text{out},l}=\sum_{j=1}^{n}\left(\frac{\exp% \alpha_{ij}^{l}}{\sum_{j^{\prime}}\exp\alpha_{ij^{\prime}}^{l}}\mathbf{h}_{j}^% {\mathrm{in},l}\mathbf{W}^{v,l}\right)$

(1) $\displaystyle\alpha_{ij}^{l}=\frac{1}{\sqrt{2d_{k}}}(\mathbf{h}_{i}^{\mathrm{% in},l}\mathbf{W}^{q,l})(\mathbf{h}_{j}^{\mathrm{in},l}\mathbf{W}^{k,l})^{T},$

where $\mathbf{H}^{\mathrm{in},l}=\{\mathbf{h}_{1}^{\mathrm{in},l},\mathbf{h}_{2}^{% \mathrm{in},l},\ldots,\mathbf{h}_{n}^{\mathrm{in},l}\}\in\mathbb{R}^{n\times d% _{h}}$ is the input matrix, and $\mathbf{W}^{q,l},\mathbf{W}^{\hat{k},l},\mathbf{W}^{v,l}\in\mathbb{R}^{d_{h}% \times d_{k}}$ are learnable parameters. $n$ and $d_{h}$ are seqyence Sequence length and hidden size, and attention size $d_{k}=d_{h}/n_{h}$ , where $n_{h}$ is the number of attention heads.
3.2 Step 1: Position selective attention

The original BERT model was not specifically designed to handle lattice diagrams, which encode sequences of characters as well as nested and overlapping words from different partitions. Incorporating position information from the lattice diagram accurately into the interaction between symbols is not straightforward. To address this, we propose an expansion of the position embedding of the attention plane and introduce position selective attention.

In this section, we present the fundamental architecture of Transformer and leverage its encoder, known as the BERT block, for optimizing the NER task. The BERT block comprises a self-attention network and a feedforward network (FFN). Each layer in the network includes residual connections and normalization. The FFN is a position-wise multi-layer Perceptron that performs nonlinear transformations. Transformer conducts self-attention over the sequence individually for each distance of attention ( $H$ distances), and then concatenates the results of the $H$ distances. For simplicity, we omit the distance index in the following formula. The calculation for each distance is as follows, represented by Eq. (3.2):

$\displaystyle\operatorname{Att}(\mathbf{A},\mathbf{V})=\operatorname{softmax}(% \mathbf{A})\mathbf{V},$ $\displaystyle\mathbf{A}_{\mathbf{i}\mathbf{j}}=\left(\frac{\mathbf{Q}_{\mathbf% {i}}\mathbf{K}_{\mathbf{j}}^{\mathrm{T}}}{\sqrt{\mathrm{d}_{\textit{distance}}% }}\right),$ (2) $\displaystyle[\mathbf{Q},\mathbf{K},\mathbf{V}]=E_{x}\left[\mathbf{W}_{q},% \mathbf{W}_{k},\mathbf{W}_{v}\right],$

where $E$ is the token embedding lookup table or the output of the last Transformer layer. $\mathbf{W}_{\mathrm{q}},\mathbf{W}_{\mathrm{k}},\mathbf{W}_{\mathrm{v}}\in% \mathbb{R}^{d_{\text{model}}\times d_{\text{distance}}}$ are learnable parameters, and $d_{\text{model}}=H\times d_{\text{distance}}$ , $d_{\text{distance}}$ is the distance between characters in each sentence.

In Fig. 4, the distance-lattice exhibits spans of various lengths. Given a sentence $s=\{c_{1},c_{2},\ldots,c_{n}\}$ as a character sequence, each character is assigned a pre-defined tag. To capture the interaction between spans, we propose absolute position encoding for spans within the sentence. As shown in the table part of Fig. 5, the instance can be segmented and obtained candidate segmentations. We determine entity boundaries from the positional information of characters and sentences. When considering two spans, $x_{i}$ and $x_{j}$ , within the lattice, there are three possible relationships: intersection, inclusion, and irrelevance. These relationships depend on their respective advantages and disadvantages. Since we cannot directly encode these relationships, we utilize dense vectors to represent them. By incorporating the absolute position of spans within the sentence and the absolute position of the sentence within the document, we can perform continuous transformations. This approach allows us to not only calculate the relationship between the two position information but also express additional detailed information, such as the distance between characters and the relative position of characters within words. Let $\textit{distance}[i]$ denote the character position of span $x_{i}$ , and $\textit{sentence}[i]$ represent the sentence position of span $x_{i}$ . In summary, these character positions can be calculated as follows:

$\displaystyle d_{ij}^{(\textit{character})}=\textit{distance}[i]-\textit{% distance}[j],$ (3) $\displaystyle d_{ij}^{(\textit{sentence})}=\textit{sentence}[i]-\textit{% sentence}[j],$ (4) $\displaystyle d_{ij}^{(s-c)}=\textit{character}[i]_{\textit{sentence}[i]}-% \textit{character}[j]_{\textit{sentence}[j]},$ (5)

Figure 5.

Create the candidate position embedding.

in the calculation mentioned above, we use $d_{ij}^{(\textit{character})}$ to represent the distance between characters and $d_{ij}^{(\textit{sentence})}$ and $d_{ij}^{(s-c)}$ to denote the distances between sentence positions and character positions, respectively. These distances are determined in a similar manner. To convert and encode the absolute position information, a simple Rectified Linear Unit (ReLU) activation function is applied to the three distances. This activation function helps transform the distance values into meaningful representations that can be utilized in subsequent processing steps.

$\displaystyle R_{ij}=\operatorname{ReLU}(W_{r}(\mathbf{P}_{d_{ij}^{(\textit{% character})}}\oplus\mathbf{P}_{d_{ij}^{(\textit{sentence})}}\oplus\mathbf{P}_{% d_{ij}^{(s-c)}}))$ (6)

where $W_{r}$ is a learnable parameter, $\oplus$ denotes the concatenation operator, and $P_{d}$ is calculated as in [37]:

$\displaystyle\mathbf{p}_{d}^{(2k)}=\sin(d/10000^{2k/d_{\text{model}}}),$ (7) $\displaystyle\mathbf{p}_{d}^{(2k+1)}=\cos(d/10000^{2k/d_{\text{model}}}),$

where $d_{ij}^{(\textit{character})}$ , $d_{ij}^{(\textit{sentence})}$ , $d_{ij}^{(s-c)}$ and $k$ denotes the index of dimension of absolute position encoding. So we define a variant of the self-attention mechanism to combine the span between characters for position encoding, as shown in the following:

$\displaystyle\mathbf{A}_{i,j}^{*}=\mathbf{W}_{q}^{\top}\mathbf{E}_{x_{i}}^{% \top}\mathbf{E}_{x_{j}}\mathbf{W}_{k,E}+\mathbf{W}_{q}^{\top}\mathbf{E}_{x_{i}% }^{\top}\mathbf{R}_{ij}\mathbf{W}_{k,R}{}+\mathbf{u}^{\top}\mathbf{E}_{x_{j}}% \mathbf{W}_{k,E}+\mathbf{v}^{\top}\mathbf{R}_{ij}\mathbf{W}_{k,R}$ (8)

where $\mathbf{W}_{q},\mathbf{W}_{k,R},\mathbf{W}_{k,E}\in\mathbb{R}^{d_{\text{model}% }\times d_{\text{head}}}$ and $\mathbf{u},\mathbf{v}\in\mathbb{R}^{d_{\text{hepd}}}$ are learnable parameters. Then we replace $A$ with $A^{*}$ in Eq. (3.2). The following calculation is the same with vanilla Transformer.

3.3 Step 2: Contextual association

In this stage, we aim to encode the word-level semantics based on the positional choices of each character and the sentence position. This encoding process is essential for capturing the complete meaning of each word. For each character $c_{i}$ , we need to encode the segmentation that includes $c_{i}$ into its word-level representation. However, this presents a challenge as the length of the segmentation and the position of characters within it are unpredictable. A single type of segmentation encoding structure cannot accommodate this variability. Therefore, we propose an adaptable word scroller in this paper. When $c_{i}$ is the $k^{\text{th}}$ character of the word $w$ , there are two ways to form subwords: left subword and right subword. These subwords can be expressed as follows:

$\displaystyle w_{m:m+h-1}$ (9) $\displaystyle\Leftrightarrow\textit{subw}_{m:i}\oplus\textit{subw}_{i:m+h-1}$ (10) $\displaystyle\Leftrightarrow\operatorname{subw}_{(i-k):i}\oplus\textit{subw}_{% i:(i+h-1-k)},$ (11)

where $1\leqslant m\leqslant n$ , $1\leqslant h\leqslant 4$ , $m\leqslant i\leqslant m+h$ , and $0\leqslant k<h$ , $\oplus$ denotes join operation. To explain the above equation, we get a better representation by drawing a table, as shown in the Fig. 6. For example, the “UTF8gbsn李 (Li)” is the first (i.e. $k=0$ ) character of the word “UTF8gbsn李明 (Liming)” (i.e. $i=m=1$ and $h=2$ ), we defined left is $\textit{subw}_{1:1}$ and the right $\textit{subw}_{1:2}$ to express the word $w_{1:2}$ , and then as the word-level representation for the character “UTF8gbsn李 (Li)”. Therefore, we ignore $\textit{subw}_{1:1}$ because inside $\textit{subw}_{1:2}$ contains.

To automatically correspond to subwords, we construct feature maps $F(n*7)$ to obtain different sizes by performing roll-up machine operations on the features in different directions, as Eq. (12):

$\displaystyle\bm{F}=\left[\begin{array}[]{cccc}\overleftarrow{sw_{1}^{3}}&% \overleftarrow{sw_{2}^{3}}&\ldots&\overleftarrow{sw_{n}^{3}}\\ \overleftarrow{sw_{1}^{2}}&\overleftarrow{sw_{2}^{2}}&\ldots&\overleftarrow{sw% _{n}^{2}}\\ \overleftarrow{sw_{1}^{1}}&\overleftarrow{sw_{2}^{1}}&\ldots&\overleftarrow{sw% _{n}^{1}}\\ sw_{1}^{0}&sw_{2}^{0}&\ldots&sw_{n}^{0}\\ \overrightarrow{sw_{1}^{1}}&\overrightarrow{sw_{2}^{1}}&\ldots&\overrightarrow% {sw_{n}^{1}}\\ \overrightarrow{sw_{1}^{2}}&\overrightarrow{sw_{2}^{2}}&\ldots&\overrightarrow% {sw_{n}^{2}}\\ \overrightarrow{sw_{1}^{3}}&\overrightarrow{sw_{2}^{3}}&\ldots&\overrightarrow% {sw_{n}^{3}}\end{array}\right]$ (12)

To compute the contextual associations of characters, we use $\overleftarrow{sw}_{i}^{k}$ and $\overrightarrow{sw}_{i}^{k}$ to compute adaptive word convolution for the current word. where $z$ represents the embedding representation of the current character.

$\displaystyle\overleftarrow{sw_{i}^{k}}=\operatorname{ReLU}(\bm{W}_{k}^{(s)}[z% _{i-k},\ldots,z_{i}]+b_{k}^{(s)})$ (13) $\displaystyle\overrightarrow{sw_{i}^{k}}=\operatorname{ReLU}(\bm{W}_{k}^{(s^{% \prime})}[\bm{z}_{i},\ldots,\bm{z}_{i+k}]+b_{k}^{(s^{\prime})})$ (14) $\displaystyle\bm{z}_{i}=\bm{c}_{i}^{(e)}+\bm{W}^{(v)}\bm{v}_{i}$ (15)

Figure 6.

Display the tabulation of subwords. The red vertical lines identify correct word segmentations. The ✓shows the subwords that fit each character.

where $\bm{W}^{(v)}\in\mathbb{R}^{d_{v}}$ , the $\overrightarrow{}$ indicates the windows sliding forward, whereas $\overleftarrow{}$ shows the windows sliding backward.

It is clear from the Eq. (12), a type of adaptive valid character can be used to pick the candidate character information learned in step 1, and then the representation $w_{i}$ of each word level can be determined. In detail,

$\displaystyle\bm{w}_{i}=\sum_{f=0}^{6}\alpha_{if}\bm{F}_{i,f},$ (16) $\displaystyle\alpha_{if}=\frac{\exp(\mathrm{g}(\bm{F}_{i,f},\bm{v}_{i}))}{\sum% _{f=0}^{6}\exp(\mathrm{g}(\bm{F}_{i,f},\bm{v}_{i}))},$ (17) $\displaystyle\mathrm{g}(\bm{F}_{i},\bm{v}_{i})=\operatorname{ReLU}(\bm{W}^{(% \alpha)}[\bm{F}_{i}+\bm{W}^{(v)}\bm{v}_{i}]).$ (18)

In calculating Eq. (16), based on the properties of Chinese, we simply traverse f from 0 to 6 (no more than 6 entities in Chinese). We gain meaningful word-level semantic information from location information after going through steps 1 and 2 and avoid the issue of segmentation error cascade.

4. Evaluation

4.1 Experimental settings

Our experimental settings are similar with the protocols of [34], including tested datasets, evaluation metrics (P, R, F1) and etc.

•
In terms of implementation details:

–
The character embeddings are pretrained using word2vec on the original microblog text with a dimension of 100.
–
For the basic BiLSTM $+$ CRF model, we use a hidden state size of 200, as it is a bidirectional LSTM.
–
The basic CNNs $+$ CRF model employs 100 filters with window lengths of 2, 3, 4, and 5.
–
Other parameters are adjusted accordingly, with a learning rate of 0.001 and a dropout rate of 0.5.
–
The validation set is created by randomly selecting 20% of the training set.
–
We train each model for up to 120 epochs using the Adam optimizer, stopping if the validation loss does not decrease for 20 consecutive epochs.
–
Additionally, for our specific implementation, we set the learning rate to 2e-5 and the dimensions $d_{e}$ and $d_{p}$ to 100, while $d_{v}$ is set to 25.
–
We have experimented with different options and found these settings to be the most reasonable.

•
Hyperparameter selection. For different data sets of Chinese NER, this chapter sets different hyperparameters for experiments. In the Table 2, this chapter conducts experiments on MSRA and Ontonotes. Due to the relatively large size of the two data sets, this chapter tests an optimal set of hyperparameters for experiments. In the Table 3, this chapter tests a random search for different hyperparameters. For example, search for learning rate and batch size.

During the experiment, we discuss the $d_{\textit{AdaptiveWord}}$ parameter. Due to the characteristics of Chinese entities (Chinese entity word exists very few entities length is not more than 8 characters) when the number of adaptive words is larger the length of the information obtained by ELCA becomes longer, resulting in higher computational overhead of the model. However, this approach has almost no effect on the accuracy of the model, so in this paper, we set the character length of the context as 3 plus the current character itself can traverse 7 characters in the context. For the discussion of characters as mentioned above, the length of Chinese entity words rarely exceeds 8 characters, so we set the position of the length of 8 characters for encoding. For the consideration of sentences, since ELCA mainly focuses on the Chinese short text dataset, in short text the sentence length is set to 2 sentences to cover most of the data, when the length exceeds 2, it brings a lot of computational overhead and does not get performance improvement. Therefore, we choose the hyperparameterized data as shown in Tables 2 and 3.

Table 2
Hyperparameters of Ontonotes and MSRA

Batch size 10

lr 2e-5

-decay 0.5

-momentum 0.9

$d_{\textit{AdaptiveWord}}$ 3

$d_{\textit{characters}}$ 8

$d_{\textit{sentences}}$ 2

FFN size 480

Embed dropout 0.5

Output dropout 0.3

Warmup 10 (epoch)

Table 3
Hyperparameters of Weibo and Resume

Batch size [10, 20]

lr [2e-5, 8e-4]

-decay 0.5

-momentum 0.9

$d_{\textit{AdaptiveWord}}$ [3, 7]

$d_{\textit{characters}}$ [7, 13]

$d_{\textit{sentences}}$ [3, 5]

FFN size 480

Embed dropout 0.5

Output dropout 0.3

Warmup 10 (epoch)

Table 4
Performance on OntoNotes

Performance on OntoNotes. A model followed by (LSTM) (e.g.,

Proposed (LSTM)) indicates that its sequence modeling layer is LSTM-based.

Input Models P R F1

Gold seg Yang et al., 2016 [40] 65.59 71.84 68.57

Yang et al., 2016^[40] 72.98 80.15 76.40

Che et al., 2013^[3] 77.71 72.51 75.02

Wang et al., 2013^*[38] 76.43 72.32 74.32

Word-based (LSTM) 76.66 63.60 69.52

$+$ char $+$ bichar 78.62 73.13 75.77

Auto seg Word-based (LSTM) 72.84 59.72 65.63

$+$ char $+$ bichar 73.36 70.12 71.70

No seg Char-based (LSTM) 68.79 60.35 64.30

$+$ bichar $+$ softword 74.36 69.43 71.89

$+$ ExSoftword 69.90 66.46 68.13

$+$ bichar $+$ ExSoftword Lattice-LSTM 73.80 71.05 72.40

LR-CNN 76.35 71.56 73.88

SoftLexicon (LSTM) 76.40 72.60 74.45

SoftLexicon (LSTM) 77.28 74.07 75.64

$+$ bichar 77.13 75.22 76.16

BERT-Tagger 76.01 79.96 77.93

BERT $+$ LSTM $+$ CRF 81.99 81.65 81.82

SoftLexicon (LSTM) $+$ BERT 83.41 82.21 82.81

No seg ELCA 85.08 83.74 85.12

Table 5
Performance on resume

Performance on Resume.

Models P R F1

Word-based (LSTM) 93.72 93.44 93.58

$+$ char $+$ bichar 94.07 94.42 94.24

Char-based (LSTM) 93.66 93.31 93.48

$+$ bichar $+$ softword 94.53 94.29 94.41

$+$ ExSoftword 95.29 94.42 94.85

$+$ bichar $+$ ExSoftword 96.14 94.72 95.43

Lattice-LSTM 94.81 94.11 94.46

LR-CNN [8] 95.37 94.84 95.11

SoftLexicon (LSTM) 95.30 95.77 95.53

SoftLexicon (LSTM) $+$ bichar 95.71 95.77 95.74

BERT-Tagger 94.87 96.50 95.68

BERT $+$ LSTM $+$ CRF 95.75 95.28 95.51

con (LSTM) $+$ BERT 96.08 96.13 96.11

BL-BTC [15] 96.3 97.46 96.88

ELCA 96.75 97.21 96.98

Table 6
Performance on Weibo

Performance on Weibo. NE, NM and Overall denote F1

scores for named entities, nominal entities (excluding

named entities) and both, respectively.

Models NE NM Overall

Peng and Dredze, 2015 [28] 51.96 61.05 56.05

Peng and Dredze, 2016 [29] 55.28 62.97 58.99

He and Sun, 2017a [10] 50.60 59.32 54.82

He and Sun, 2017b [11] 54.50 62.17 58.23

Char-based (LSTM) 46.11 55.29 52.77

$+$ bichar $+$ softword 50.55 60.11 56.75

$+$ ExSoftword 44.65 55.19 52.42

$+$ bichar $+$ ExSoftword 58.93 53.38 56.02

Lattice-LSTM 53.04 62.25 58.79

LR-CNN [8] 57.14 66.67 59.92

SoftLexicon (LSTM) 59.08 62.22 61.42

SoftLexicon (LSTM) $+$ bichar 58.12 64.20 59.81

BERT-Tagger 65.77 62.05 63.80

BERT $+$ LSTM $+$ CRF 69.65 64.62 67.33

SoftLexicon (LSTM) $+$ BERT 70.94 67.02 70.50

ELCA 73.25 69.75 72.23

Table 7
Performance on MSRA

Performance on MSRA

Models NE NM Overall

Chen et al. 2006 [4] 91.22 81.71 86.20

Zhang et al. 2006 [44] 92.20 90.18 91.18

Zhou et al. 2013 [46] 91.86 88.75 90.28

Lu et al. 2016 [23] – – 87.94

Dong et al. 2016 [7] 91.28 90.62 90.95

Char-based(LSTM) 90.74 86.96 88.81

$+$ bichar $+$ softword 92.97 90.80 91.87

$+$ ExSoftword 90.77 87.23 88.97

$+$ bichar $+$ ExSoftword 93.21 91.57 92.38

Lattice-LSTM 93.57 92.79 93.18

LR-CNN [8] 94.50 92.93 93.71

SoftLexicon (LSTM) 94.63 92.70 93.66

SoftLexicon(LSTM) $+$ bichar 94.73 93.40 94.06

BERT-Tagger 93.40 94.12 93.76

BERT $+$ LSTM $+$ CRF 95.06 94.61 94.83

SoftLexicon (LSTM) $+$ BERT 95.75 95.10 95.42

ELCA 96.14 95.31 95.33

Table 8
Performance on CWS

Model PKU MSR CTB6

Yang et al. (2017) [41] 95.00 96.80 95.40

Ma et al. (2018) [24] 96.10 97.40 96.70

Yang et al. (2019) [42] 95.80 97.80 96.10

Qiu et al. (2020) 96.41 98.05 96.99

Tian et al. (2020c) (with BERT) [35] 96.51 98.28 97.16

Tian et al. (2020c) (with ZEN) [35] 96.53 98.40 97.25

BERT 96.25 97.94 96.98

BERT $+$ Word 96.55 98.41 97.25

ERINE 96.33 98.17 97.02

ZEN 96.36 98.36 97.13

ELCA 96.73 98.42 97.44

Table 9
Chinese NER and CWS case

#1 Example of Chinese NER

Sentence (truncated) UTF8gbsn内蒙古呼伦贝尔盟 (Hulunbuir League, Inner Mongolia)

Matched Words UTF8gbsn内蒙，内蒙古，内蒙古呼伦贝尔，蒙古，呼伦，呼伦贝尔，呼伦贝尔盟，贝尔

Inner Mongolia, Inner Mongolia, Inner Mongolia Hulunbuir, Mongolia, Hulun,

Hulunbuir, Hulunbuir League, Buir

Characters UTF8gbsn内 UTF8gbsn蒙 UTF8gbsn古 UTF8gbsn呼 UTF8gbsn伦 UTF8gbsn贝 UTF8gbsn尔 UTF8gbsn盟

Gold Labels B-GPE I-GPE E-GPE B-GPE I-GPE I-GPE I-GPE E-GPE

BERT B-GPE I-GPE I-GPE I-GPE I-GPE I-GPE I-GPE E-GPE

BERT $+$ Word B-GPE I-GPE E-GPE B-ORG I-ORG I-ORG I-ORG E-ORG

ELCA B-GPE I-GPE E-GPE B-GPE I-GPE I-GPE I-GPE E-GPE

#2 Example of Chinese CWS

Sentence (truncated) UTF8gbsn乱七八糟的关系 (Messy Relationship)

Matched Words UTF8gbsn乱七八糟，七八，八糟，关系

Mess, Seven and Eight, Bad News, Relationship

Characters UTF8gbsn乱 UTF8gbsn七 UTF8gbsn八 UTF8gbsn糟 UTF8gbsn的 UTF8gbsn关 UTF8gbsn系

Gold Labels B-ADJ I-ADJ I-ADJ E-ADJ S-PART B-NOUN E-NOUN

BERT B-ADJ I-NUM I-NUM E-ADJ S-PART B-NOUN E-NOUN

BERT $+$ Word B-ADJ I-NUM I-NUM E-ADJ S-PART B-NOUN E-NOUN

ELCA B-ADJ I-ADJ I-ADJ E-ADJ S-PART B-NOUN E-NOUN

4.2 Overall performance

Batch size	10
lr	2e-5
-decay	0.5
-momentum	0.9
$d_{\textit{AdaptiveWord}}$	3
$d_{\textit{characters}}$	8
$d_{\textit{sentences}}$	2
FFN size	480
Embed dropout	0.5
Output dropout	0.3
Warmup	10 (epoch)

Batch size	[10, 20]
lr	[2e-5, 8e-4]
-decay	0.5
-momentum	0.9
$d_{\textit{AdaptiveWord}}$	[3, 7]
$d_{\textit{characters}}$	[7, 13]
$d_{\textit{sentences}}$	[3, 5]
FFN size	480
Embed dropout	0.5
Output dropout	0.3
Warmup	10 (epoch)

Performance on OntoNotes. A model followed by (LSTM) (e.g.,
Input	Models	P	R	F1
Gold seg	Yang et al., 2016 [40]	65.59	71.84	68.57
	Yang et al., 2016^*[40]	72.98	80.15	76.40
	Che et al., 2013^*[3]	77.71	72.51	75.02
	Wang et al., 2013^*[38]	76.43	72.32	74.32
	Word-based (LSTM)	76.66	63.60	69.52
	$+$ char $+$ bichar	78.62	73.13	75.77
Auto seg	Word-based (LSTM)	72.84	59.72	65.63
	$+$ char $+$ bichar	73.36	70.12	71.70
No seg	Char-based (LSTM)	68.79	60.35	64.30
	$+$ bichar $+$ softword	74.36	69.43	71.89
	$+$ ExSoftword	69.90	66.46	68.13
	$+$ bichar $+$ ExSoftword Lattice-LSTM	73.80	71.05	72.40
	LR-CNN	76.35	71.56	73.88
	SoftLexicon (LSTM)	76.40	72.60	74.45
	SoftLexicon (LSTM)	77.28	74.07	75.64
	$+$ bichar	77.13	75.22	76.16
	BERT-Tagger	76.01	79.96	77.93
	BERT $+$ LSTM $+$ CRF	81.99	81.65	81.82
	SoftLexicon (LSTM) $+$ BERT	83.41	82.21	82.81
No seg	ELCA	85.08	83.74	85.12

Performance on Resume.
Word-based (LSTM)	93.72	93.44	93.58
$+$ char $+$ bichar	94.07	94.42	94.24
Char-based (LSTM)	93.66	93.31	93.48
$+$ bichar $+$ softword	94.53	94.29	94.41
$+$ ExSoftword	95.29	94.42	94.85
$+$ bichar $+$ ExSoftword	96.14	94.72	95.43
Lattice-LSTM	94.81	94.11	94.46
LR-CNN [8]	95.37	94.84	95.11
SoftLexicon (LSTM)	95.30	95.77	95.53
SoftLexicon (LSTM) $+$ bichar	95.71	95.77	95.74
BERT-Tagger	94.87	96.50	95.68
BERT $+$ LSTM $+$ CRF	95.75	95.28	95.51
con (LSTM) $+$ BERT	96.08	96.13	96.11
BL-BTC [15]	96.3	97.46	96.88
ELCA	96.75	97.21	96.98

Performance on Weibo. NE, NM and Overall denote F1
Models	NE	NM	Overall
Peng and Dredze, 2015 [28]	51.96	61.05	56.05
Peng and Dredze, 2016 [29]	55.28	62.97	58.99
He and Sun, 2017a [10]	50.60	59.32	54.82
He and Sun, 2017b [11]	54.50	62.17	58.23
Char-based (LSTM)	46.11	55.29	52.77
$+$ bichar $+$ softword	50.55	60.11	56.75
$+$ ExSoftword	44.65	55.19	52.42
$+$ bichar $+$ ExSoftword	58.93	53.38	56.02
Lattice-LSTM	53.04	62.25	58.79
LR-CNN [8]	57.14	66.67	59.92
SoftLexicon (LSTM)	59.08	62.22	61.42
SoftLexicon (LSTM) $+$ bichar	58.12	64.20	59.81
BERT-Tagger	65.77	62.05	63.80
BERT $+$ LSTM $+$ CRF	69.65	64.62	67.33
SoftLexicon (LSTM) $+$ BERT	70.94	67.02	70.50
ELCA	73.25	69.75	72.23

Performance on MSRA
Chen et al. 2006 [4]	91.22	81.71	86.20
Zhang et al. 2006 [44]	92.20	90.18	91.18
Zhou et al. 2013 [46]	91.86	88.75	90.28
Lu et al. 2016 [23]	–	–	87.94
Dong et al. 2016 [7]	91.28	90.62	90.95
Char-based(LSTM)	90.74	86.96	88.81
$+$ bichar $+$ softword	92.97	90.80	91.87
$+$ ExSoftword	90.77	87.23	88.97
$+$ bichar $+$ ExSoftword	93.21	91.57	92.38
Lattice-LSTM	93.57	92.79	93.18
LR-CNN [8]	94.50	92.93	93.71
SoftLexicon (LSTM)	94.63	92.70	93.66
SoftLexicon(LSTM) $+$ bichar	94.73	93.40	94.06
BERT-Tagger	93.40	94.12	93.76
BERT $+$ LSTM $+$ CRF	95.06	94.61	94.83
SoftLexicon (LSTM) $+$ BERT	95.75	95.10	95.42
ELCA	96.14	95.31	95.33

Model	PKU	MSR	CTB6
Yang et al. (2017) [41]	95.00	96.80	95.40
Ma et al. (2018) [24]	96.10	97.40	96.70
Yang et al. (2019) [42]	95.80	97.80	96.10
Qiu et al. (2020)	96.41	98.05	96.99
Tian et al. (2020c) (with BERT) [35]	96.51	98.28	97.16
Tian et al. (2020c) (with ZEN) [35]	96.53	98.40	97.25
BERT	96.25	97.94	96.98
BERT $+$ Word	96.55	98.41	97.25
ERINE	96.33	98.17	97.02
ZEN	96.36	98.36	97.13
ELCA	96.73	98.42	97.44

#1 Example of Chinese NER
Sentence (truncated)	UTF8gbsn内蒙古呼伦贝尔盟 (Hulunbuir League, Inner Mongolia)
Matched Words	UTF8gbsn内蒙，内蒙古，内蒙古呼伦贝尔，蒙古，呼伦，呼伦贝尔，呼伦贝尔盟，贝尔
	Inner Mongolia, Inner Mongolia, Inner Mongolia Hulunbuir, Mongolia, Hulun,
	Hulunbuir, Hulunbuir League, Buir
Characters	UTF8gbsn内	UTF8gbsn蒙	UTF8gbsn古	UTF8gbsn呼	UTF8gbsn伦	UTF8gbsn贝	UTF8gbsn尔	UTF8gbsn盟
Gold Labels	B-GPE	I-GPE	E-GPE	B-GPE	I-GPE	I-GPE	I-GPE	E-GPE
BERT	B-GPE	I-GPE	I-GPE	I-GPE	I-GPE	I-GPE	I-GPE	E-GPE
BERT $+$ Word	B-GPE	I-GPE	E-GPE	B-ORG	I-ORG	I-ORG	I-ORG	E-ORG
ELCA	B-GPE	I-GPE	E-GPE	B-GPE	I-GPE	I-GPE	I-GPE	E-GPE
#2 Example of Chinese CWS
Sentence (truncated)	UTF8gbsn乱七八糟的关系 (Messy Relationship)
Matched Words	UTF8gbsn乱七八糟，七八，八糟，关系
	Mess, Seven and Eight, Bad News, Relationship
Characters	UTF8gbsn乱	UTF8gbsn七	UTF8gbsn八	UTF8gbsn糟	UTF8gbsn的	UTF8gbsn关	UTF8gbsn系
Gold Labels	B-ADJ	I-ADJ	I-ADJ	E-ADJ	S-PART	B-NOUN	E-NOUN
BERT	B-ADJ	I-NUM	I-NUM	E-ADJ	S-PART	B-NOUN	E-NOUN
BERT $+$ Word	B-ADJ	I-NUM	I-NUM	E-ADJ	S-PART	B-NOUN	E-NOUN
ELCA	B-ADJ	I-ADJ	I-ADJ	E-ADJ	S-PART	B-NOUN	E-NOUN

As shown in Tables 4–6, ELCA outperforms baseline models and other lexicon-based models on four Chinese NER datasets. ELCA outperforms lattice-LSTM [45] by 3.62 in average F1 score. For the state-of-the-art SoftLexicon (LSTM) [25], ELCA has an average F1 score improvement of 0.86. Probably due to the improvement of BERT features based on absolute location information, Lexicon-base based on small datasets is not as obvious as it is on large datasets compared to other datasets.

4.3 Effectiveness study

Tables 4–6 1

¹
In Tables 3–6, we use the lexicon model as basic.

show the performances of our method against the compared baselines. In this experiment, we mainly compare the model from lexicon model and a single bidirectional LSTM.

•

OntoNotes. Table 4 shows the results on the Ontonotes dataset, where the training set and the test set constitute the golden word segmentation. The methods of the “Gold seg” and the “Auto seg” groups are all word-based, with the former input building on gold word segmentation results and the latter building on automatic word segmentation results and the latter building on automatic word segmentation results by a segmenter trained on OntoNotes training data. ELCA uses the “No seg” group are character-based. From this table, we can clearly get the following two observations. First, no matter which word segmentation method is used, the F1 score of ELCA has relatively good results. Second, by comparing several other “seg” methods, we can see that the results of “No seg” are often poor in the case of the same model. But ELCA maintains excellent competitiveness.

•

Resume/MSAR. In order to better compare the experimental results, we select the best statistical model on this data set. These models use a wealth of manual features [44], character embedding features, radical features, cross-domain data, and semi-supervised data. These results are shown in Tables 5 and 7. From the tables we can get a better recognition rate that is higher than these best lexicon models.

•

Weibo. From Table 6, Weibo data has a huge number of chats as well as extended text statements because it is a Chinese social media data. Long text statements provide more information at the sentence level and are better suited for ELCA. However, because discussions are prone to missing entities, this does not result in better results.

•

CWS. In terms of the accuracy of the word separation results, we employed four regularly used CWS datasets and compared the results to certain baseline values. The findings of the experiments show that SPW-BERT improves by roughly 1.5 percent.

•

Case Study. Table 9 shows the results of Chinese NER in Ontonotes dataset, since BERT alone cannot determine the boundary of the entity, but when BERT $+$ Word and ELCA can segment correctly, however BERT $+$ Word cannot predict the type of long entity accurately UTF8gbsn呼伦贝尔盟 (Hulunbeier Union). When we use ELCA the model can take into account inter-sentence as well as inter-word features, so ELCA is better able to capture the more complex entities provided by BERT and the lexicon.

•

Speed. NER is as a fundamental NLP tool, which needs high speeds of training. Due to limits of structure of LSTM, [45] for 1 batch training, it suffers from slow speeds during training. However, our model can overcome this defect. For fair comparison, we achieve 20 training, development and testing epochs. The average time of them is shown in Table 10. Obviously, we can gain a 6-9x speedup over the Lattice-LSTM.

4.4 Ablation study

We conduct ablation experiments to prove the effectiveness of our contributions. First, we conduct the main analysis through the small datasets Weibo and Resume. Second, we conduct different ablation experiments on four Chinese datasets, as shown in Table 11.

Table 10
The performance of models in training and testing time. Time is measured in seconds. Lattice means the Lattice-LSTM

Data	Ours (s)	Lattice (s)	Speedup
WeiboNER-ALL	89	608	X6.8
MSRA	2341	19477	X8.7

Table 11

An ablation study of the proposed model

	OntoNotes	MSRA	Weibo	Resume
SoftLexicon (LSTM)	75.64	93.66	61.42	95.53
Only Position Attention Information	81.47	93.11	67.32	95.93
Only Adaptive Word Convolution	83.74	94.86	69.93	96.41

Figure 7.

F1-value against the sentence length.

Only Sentence-level Position Attention Information.

We talk about location information in the course of our experiments as location information of characters and location information of sentences. As shown in Table 11, when we consider only sentence position information in our experiments, the results are about 6% higher than SoftLexicon (LSTM) results. When we consider only the character information in our experiments, the F1 value is reduced by about 5%.

Only Adaptive Word Convolution.

We removed the Sentence-level Position Attention Information in the ablation experiment part and determined the effect of a single Only Adaptive Word Convolution on the F1 index.

Sentence Length.

As show in Fig. 7, the F1 value trend of change the baselines and ELCA on Ontonotes dataset are shown. All models can show their performance as a function of sentence length with such a performance curve. The shorter the sentence, the smaller the impact on performance. We speculate that since the complexity of sentences increases with sentence length, this poses a greater challenge for NER.

5. Conclusion

ELCA: Sentence-level Position Attention Information and Adaptive Word Convolution for Chinese NER is proposed in this paper. ELCA solves the problem of long-distance dependent entities in Chinese NER that cannot be detected using absolute location information and adaptive word convolution. At the same time, adaptive word convolution can improve the character relationship. ELCA, on the other hand, does more than merely improve boundary word information; it also provides dynamic word-level representations for letters at specified points by encoding segments of varying lengths. It has a promising research future. In the future, we plan to extend the application of ELCA to additional Named Entity Recognition (NER) tasks in low-resource languages. Our goal is to improve the efficiency of low-resource models by developing a unified NER model that can be applied to a range of low-resource languages, thereby reducing the computational overhead associated with building separate NER systems for each language.

Footnotes

Acknowledgments

The work described in this paper is funded by the National Natural Science Foundation of China under Grant Nos. 61772210 and U1911201, Guangdong Province Universities Pearl River Scholar Funded Scheme (2018) and the Project of Science and Technology in Guangzhou in China under Grant No. 202007040006.

References

Bengio

Ducharme

Vincent

Janvin

, A neural probabilistic language model, The Journal of Machine Learning Research 3 (2003), 1137–1155.

Bhatia

Pinto

, Automated construction of knowledge-bases for safety critical applications: Challenges and opportunities, volume 2846 of CEUR Workshop Proceedings, 2021.

Che

Wang

Manning

C.D.

Liu

, Named entity recognition with bilingual constraints, The ACL, 2013, 52–62.

Chen

Peng

Shan

Sun

, Chinese named entity recognition with conditional probabilistic models, ACL, 2006, 173–176.

Cui

Zhong

Bai

, A new chinese text clustering algorithm based on wrd and improved k-means, Intelligent Data Analysis (Preprint) (2023), 1–16.

Devlin

Chang

Lee

Toutanova

, BERT: pre-training of deep bidirectional transformers for language understanding, ACL, 2019, 4171–4186.

Dong

Zhang

Zong

Hattori

, Character-based LSTM-CRF with radical-level features for chinese named entity recognition, Lecture Notes in Computer Science, 2016, 239–250.

Gui

Zhang

Zhao

Jiang

Y.-G.

Huang

, Cnn-based chinese ner with lexicon rethinking, in: IJCAI, 2019, pp. 4982–4988.

, The parallel corpus for information extraction based on natural language processing and machine translation, Expert Systems 36(5) (2019), e12349.

10.

Sun

, F-score driven max margin neural network for named entity recognition in chinese social media, ACL, 2017, 713–718.

11.

Sun

, A unified model for cross-domain and semi-supervised named entity recognition in chinese social media, AAAI Press, 2017, 3216–3222.

12.

Ivgi

Shaham

Berant

, Efficient long-text understanding with short-text models, Transactions of the Association for Computational Linguistics 11 (2023), 284–299.

13.

Jia

Shi

Yang

Zhang

, Entity enhanced BERT pre-training for chinese NER, in: EMNLP, ACL, 2020, pp. 6384–6396.

14.

Jia

Shen

Liao

Chen

, Mner-qg: An end-to-end mrc framework for multimodal named entity recognition with query grounding, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, 2023, pp. 8032–8040.

15.

Jin

Zhao

, A hybrid transformer approach for chinese ner with features augmentation, Expert Systems with Applications 209 (2022), 118385.

16.

Lee

Hwang

Jang

, Fine-grained named entity recognition and relation extraction for question answering, in: SIGIR, ACM, 2007, pp. 799–800.

17.

Wang

Hui

S.C.

Liao

Zhu

Huang

, A segment enhanced span-based model for nested named entity recognition, Neurocomputing 465 (2021), 26–37.

18.

Sun

Han

, A survey on deep learning for named entity recognition, IEEE Transactions on Knowledge and Data Engineering, 2020.

19.

Yan

Qiu

Huang

, FLAT: chinese NER using flat-lattice transformer, ACL, 2020, 6836–6842.

20.

Yan

Qiu

Huang

, FLAT: Chinese NER using flat-lattice transformer, in: Proceedings of the 58th Annual Meeting of the ACL, Online, ACL, July 2020, pp. 6836–6842.

21.

Liu

Zhong

, Chinese named entity recognition based on bi-directional quasi-recurrent neural networks improved with bert: new method to solve chinese ner, in: International Conference on Innovation in Artificial Intelligence, 2021, pp. 15–19.

22.

Liu

Song

, An encoding strategy based word-character LSTM for chinese NER, in: NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), ACL, 2019, pp. 2379–2389.

23.

Zhang

, Multi-prototype chinese character embedding, European Language Resources Association (ELRA), 2016.

24.

Ganchev

Weiss

, State-of-the-art chinese word segmentation with bi-lstms, in: EMNLP, 2018, pp. 4902–4908.

25.

Peng

Zhang

Wei

Huang

, Simplify the usage of lexicon in chinese NER, ACL, 2020, 5951–5960.

26.

Mikolov

Sutskever

Chen

Corrado

G.S.

Dean

, Distributed representations of words and phrases and their compositionality, in: Advances in Neural Information Processing Systems, 2013, pp. 3111–3119.

27.

Nadeau

Sekine

, A survey of named entity recognition and classification, Lingvisticae Investigationes 30(1) (2007), 3–26.

28.

Peng

Dredze

, Named entity recognition for chinese social media with jointly trained embeddings, ACL, 2015, 548–554.

29.

Peng

Dredze

, Improving named entity recognition for chinese social media with word segmentation representation learning, ACL, 2016.

30.

Petkova

Croft

W.B.

, Proximity-based document representation for named entity retrieval, ACM, CIKM, 2007, 731–740.

31.

Qin

Chen

Cai

Liu

Jin

, Long short-term memory with activation on gradient, Neural Networks 164 (2023), 135–145.

32.

Singh

Kumar

Chana

, Improving neural machine translation for low-resource indian languages using rule-based feature extraction, Neural Comput. Appl. 33(4) (2021), 1103–1122.

33.

Stanislawek

Wróblewska

Wójcicka

Ziembicki

Biecek

, Named entity recognition – is there a glass ceiling? in: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Hong Kong, China, ACL, Nov. 2019.

34.

Sui

Chen

Liu

Zhao

Liu

, Leverage lexical knowledge for Chinese named entity recognition via collaborative graph network, in: EMNLP-IJCNLP, Nov. 2019.

35.

Tian

Song

Xia

Zhang

Wang

, Improving chinese word segmentation with wordhood memory networks, in: ACL, 2020, pp. 8274–8285.

36.

Tran

MacKinlay

Yepes

A.J.

, Named entity recognition with stack residual lstm and trainable bias decoding,

arXivpreprintarXiv:1706.07598

, 2017.

37.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

Polosukhin

, Attention is all you need, in: NIPS, 2017, pp. 5998–6008.

38.

Wang

Che

Manning

C.D.

, Effective bilingual constraints for semi-supervised learning of named entity recognizers, AAAI Press, 2013.

39.

Wang

Chen

, Polymorphic graph attention network for chinese ner, Expert Systems with Applications, 2022, 117467.

40.

Yang

Teng

Zhang

, Combining discrete and neural features for sequence labeling, volume 9623 of Lecture Notes in Computer Science, Springer, 2016, 140–154.

41.

Yang

Zhang

Dong

, Neural word segmentation with rich pretraining, in: R. Barzilay and M. Kan, editors, ACL, 2007, pp. 839–849.

42.

Yang

Zhang

Liang

, Subword encoding in lattice LSTM for chinese word segmentation, in: NAACL-HLT, 2019, pp. 2720–2725.

43.

Zhang

Wang

Liu

Zhang

, Chinese named entity recognition method for the finance domain based on enhanced features and pretrained language models, Information Sciences 625 (2023), 385–400.

44.

Zhang

Qin

Wen

Wang

, Word segmentation and named entity recognition for SIGHAN bakeoff3, ACL, 2006, 158–161.

45.

Zhang

Yang

, Chinese NER using lattice LSTM, ACL, 2018, 1554–1564.

46.

Zhou

Zhang

, Chinese named entity recognition via joint identification and categorization, Chinese Journal of Electronics 22(2) (2013), 225–230.

47.

Žukov-Gregorič

Bachrach

Coope

, Named entity recognition with parallel recurrent neural networks, in: ACL, 2018, pp. 69–74.

Performance on OntoNotes. A model followed by (LSTM) (e.g.,
Proposed (LSTM)) indicates that its sequence modeling layer is LSTM-based.
Input	Models	P	R	F1
Gold seg	Yang et al., 2016 [40]	65.59	71.84	68.57
	Yang et al., 2016^*[40]	72.98	80.15	76.40
	Che et al., 2013^*[3]	77.71	72.51	75.02
	Wang et al., 2013^*[38]	76.43	72.32	74.32
	Word-based (LSTM)	76.66	63.60	69.52
	$+$ char $+$ bichar	78.62	73.13	75.77
Auto seg	Word-based (LSTM)	72.84	59.72	65.63
	$+$ char $+$ bichar	73.36	70.12	71.70
No seg	Char-based (LSTM)	68.79	60.35	64.30
	$+$ bichar $+$ softword	74.36	69.43	71.89
	$+$ ExSoftword	69.90	66.46	68.13
	$+$ bichar $+$ ExSoftword Lattice-LSTM	73.80	71.05	72.40
	LR-CNN	76.35	71.56	73.88
	SoftLexicon (LSTM)	76.40	72.60	74.45
	SoftLexicon (LSTM)	77.28	74.07	75.64
	$+$ bichar	77.13	75.22	76.16
	BERT-Tagger	76.01	79.96	77.93
	BERT $+$ LSTM $+$ CRF	81.99	81.65	81.82
	SoftLexicon (LSTM) $+$ BERT	83.41	82.21	82.81
No seg	ELCA	85.08	83.74	85.12

Performance on Weibo. NE, NM and Overall denote F1
scores for named entities, nominal entities (excluding
named entities) and both, respectively.
Models	NE	NM	Overall
Peng and Dredze, 2015 [28]	51.96	61.05	56.05
Peng and Dredze, 2016 [29]	55.28	62.97	58.99
He and Sun, 2017a [10]	50.60	59.32	54.82
He and Sun, 2017b [11]	54.50	62.17	58.23
Char-based (LSTM)	46.11	55.29	52.77
$+$ bichar $+$ softword	50.55	60.11	56.75
$+$ ExSoftword	44.65	55.19	52.42
$+$ bichar $+$ ExSoftword	58.93	53.38	56.02
Lattice-LSTM	53.04	62.25	58.79
LR-CNN [8]	57.14	66.67	59.92
SoftLexicon (LSTM)	59.08	62.22	61.42
SoftLexicon (LSTM) $+$ bichar	58.12	64.20	59.81
BERT-Tagger	65.77	62.05	63.80
BERT $+$ LSTM $+$ CRF	69.65	64.62	67.33
SoftLexicon (LSTM) $+$ BERT	70.94	67.02	70.50
ELCA	73.25	69.75	72.23

ELCA: Enhanced boundary location for Chinese named entity recognition via contextual association

Abstract

Keywords

1. Introduction

3. Methodology

Table 1 Datasets

4.1 Experimental settings

4.3 Effectiveness study

1 In Tables 3–6, we use the lexicon model as basic.

Table 10 The performance of models in training and testing time. Time is measured in seconds. Lattice means the Lattice-LSTM

Only Sentence-level Position Attention Information.

Only Adaptive Word Convolution.

Sentence Length.

Footnotes

Acknowledgments

References

Table 1
Datasets

¹
In Tables 3–6, we use the lexicon model as basic.

Table 10
The performance of models in training and testing time. Time is measured in seconds. Lattice means the Lattice-LSTM