An improved neural network for domain adaptive Chinese word segmentation

Abstract

Various methods have been proposed recently to solve the problem of weak domain adaptability of Chinese word segmentation (CWS) models based on neural networks. However, although some of these improved models achieve high segmentation accuracy in a specific domain, they need to be retrained when applied to another. After rethinking the domain adaptability, two criteria, including the segmentation accuracy and the universality, are suggested for measuring it. Taking the above two criteria into consideration, an improved neural-based CWS model is proposed, which incorporates the common lexicon and unlabeled data into BERT. To make the most use of lexicon, a new method is proposed to construct the lexicon-based feature vector. In addition, the domain-specific words can be effectively extracted by pre-training a language model on the unlabeled data. Finally, a GRU-like gate structure is used to integrate the lexicon-based feature vector and language model into BERT. Experiments on five different domains reveal that the domain adaptability of this model is very strong.

Keywords

Chinese word segmentation domain adaptability neural network

1. Introduction

For Chinese, a word instead of a single character should be the smallest semantic unit since a character is often not clear enough to express the semantic it contains in the text. Hence, CWS is the first step in many Chinese natural language processing tasks, and its importance is self-evident.

In recent years, great success has been achieved in segmentation accuracy with the use of neural-based CWS models [1, 2, 3, 12]. Compared with CWS models based on traditional machine learning, neural-based CWS models are superior in utilizing the long-distance contextual information and able to avoid manual feature engineering [1, 2]. However, such models still suffer from the problem of weak domain adaptability. Specifically, the test set contains many domain-specific words that are not available in the training set. In addition, the difference of domains the datasets belong to may lead to the different contexts of the same word, which finally results in different segmentation [4, 5]. The domains of the training and test sets are referred to as the source and target domain respectively in the following.

It is pointed out that out of vocabulary (OOV) words could not be well recognized by changing neural architecture without the additional resources [6]. Lexicons and unlabeled data contain many domain-specific words, and the latter reflects the contextual information of the target domain, so both can be used as additional resources. Various methods have been proposed to integrate lexicons and unlabeled data into the neural-based CWS models [5, 7, 8, 23, 24]. These methods can improve segmentation accuracy in the target domain, indeed. However, some of the above models can only be applied to a specific domain, and they need to be retrained when applied to another [8, 9]. There does exist some universal models [5, 7], but most researchers only focus on cross-domain segmentation and ignore the universality of the model. Cross-domain segmentation focuses on studying transfer learning from the source domain to a specific target domain from the perspective of segmentation accuracy. But it ignores the universality of the model, that is, the model can be applied to various target domains after training.

Based on the above analysis, we hold the point that the domain adaptability means that the model can achieve outstanding segmentation accuracy in different domains without retraining. Therefore, two aspects, including segmentation accuracy and universality, should be considered simultaneously when it comes to domain adaptability.

In this paper, we propose a novel neural-based CWS model which incorporates both common lexicons and unlabeled data derived from the target domain into BERT [10] to improve the domain adaptability of CWS models. With the consideration of the segmentation accuracy and universality, we adopt BERT instead of BiLSTM for CWS. Since BERT is pre-trained on a large-scale unlabeled corpus, it is, of course, suitable for more domains than BiLSTM. The pre-trained parameters of BERT are fine-tuned using the annotation set of the source domain to obtain the BERT based CWS model (BERT-CWS). In order to further enhance the domain adaptability, a common lexicon is used to design a feature vector for each character in the sequence. The vector can represent the position of a character in the word containing it, which is beneficial for word segmentation. Finally, by incorporating the vectors into BERT-CWS, we realize the fusion of BERT and lexicons and get the model BERT-DICT-CWS.

It is almost impossible for the common lexicons to include all domain-specific words. To deal with this problem, we use the target domain unlabeled data to train a language model which can reflect the compactness between characters, so as to capture domain-specific words not included in lexicons. However, because the unlabeled data comes from the target domain, after incorporating the language model into BERT-DICT-CWS, the final model BERT-DICT-LM-CWS is almost equivalent to BERT-DICT-CWS in terms of universality. The improvement of BERT-DICT-LM-CWS is that its segmentation accuracy is higher than BERT-DICT-CWS in a specific domain. Experiments show that the domain adaptability of BERT-DICT-LM-CWS is very strong.

2. Related work

To improve the domain adaptation of the neural-based CWS models, our final model uses BERT to replace the basic BiLSTM, and we propose a new method to use lexicons and unlabeled data. This chapter will analyze some of the work related to the use of BERT, lexicons and unlabeled data from the perspective of segmentation accuracy and universality.

[11] proposed Transformer which based solely on self-attention mechanism. Comparing with the recurrent neural network (RNN), Transformer is parallelizable and superior in training speed [11]. [10] pre-trained BERT based on Transformer in a large corpus and then fine-tuned the parameters of the pre-trained model with the annotation dataset for a specific task. Since the corpus used in pre-training was derived from various kinds of domains, the model BERT has wide universality.

The idiom lexicon is used by [2] to replace the idioms matched in the training set with a specific tag before training the LSTM based CWS model. [13, 21] selected several words randomly from lexicons to generate pseudo labeled data for handling lacking the training set. To provide valuable information about different aspects, [7] constructed an 8-dimensional feature vector based on lexicons for each character in the sequence and used BiLSTM to integrate it. [22] incorporated the unlabeled data and lexicon into model training as indirect supervision.

[14] obtained numeric statistical features from target domain unlabeled data and incorporated them into LSTM. [15] proposed a general semi-supervised approach for adding pre-trained context embeddings obtained from a bidirectional language model, which pre-trained on a large number of unlabeled data, to sequence labeling tasks. [8] used a gate mechanism to combine BiLSTM with the bidirectional language model trained on the target domain unlabeled data. Among them, [8, 14] did not improve the universality of their models because they only used unlabeled data derived from a specific target domain.

In order to improve the universality of our model, we incorporate lexicons (not including target domain lexicon) into BERT-CWS. In addition, to make up for the shortcoming that the common lexicon cannot contain all of domain-related vocabularies, and further improve the segmentation accuracy of our model, we incorporate the target domain unlabeled data into BERT-DICT-CWS.

3. Model introduction

This chapter first introduces how to adapt BERT to CWS. Then, a new approach to construct the feature vector based on lexicons is proposed to achieve the purpose of incorporating useful lexicon information into BERT-CWS to get the model BERT-DICT-CWS, which can improve the domain adaptability of BERT-CWS from the aspects of universality and segmentation accuracy. Finally, without reducing the universality of BERT-DICT-CWS, we incorporate a language model pre-trained on the target domain unlabeled data into BERT-DICT-CWS so that the segmentation accuracy of BERT-DICT-LM-CWS is further improved in a specific target domain.

3.1 BERT-CWS

CWS is generally treated as a sequence labeling task. Specifically, each word in the sequence is marked with {B, M, E, S}, where B (Begin), M (Middle) and E (End) represent the first character, the middle character, the end character of the word respectively, and S (Single) represents the word composed of a single character. Our BERT-CWS model is also based on the idea of sequence labeling. We choose BERT because it is a model based on bidirectional Transformer, which constructed entirely on the self-attention and Multi-head attention mechanism. For the single character, Transformer calculates the attention of it with all other characters in the related sequence. Through Transformer, a character vector will be obtained, which embodies its dependence on surrounding characters, which is beneficial for word segmentation. [10] used Masked LM to pre-train BERT on a large-scale corpus. Hence, the character vector obtained by the pre-trained BERT model can reflect its context very well.

Figure 1.

BERT-DICT-CWS. Trm means Transformer block.

The BERT-CWS model is shown in the left dotted box of Fig. 1. In word embedding layer, BERT uses three kinds of word embeddings to represent a single character. First, each character will find its corresponding word embedding from WordPiece embeddings [16]. Next, because Transformer can’t reflect temporal information, BERT adds position information for each character. Finally, it also incorporates the embedding of the related sentence where the character is located. The formula of word embedding for the $i$ th character is as follows:

$\displaystyle E_{i}=E^{\textit{token}}_{i}\oplus E^{\textit{position}}_{i}% \oplus E^{\textit{segment}}_{i}.$ (1)

where $E^{\textit{token}}_{i}$ , $E^{\textit{position}}_{i}$ , $E^{\textit{segment}}_{i}$ respectively denotes the WordPiece embedding, the position embedding and the sentence embedding of the $i$ th character.

Through the bidirectional Transformer module, we can get the output of BERT.

$\displaystyle{\overrightarrow{h}}^{\textit{BERT}}_{i}=\textit{Transformer}% \left(E_{i},{\overrightarrow{h}}^{\textit{BERT}}_{i-1}\right),\ \ \ \ \forall i% \in\left[2,n\right],\ {\overrightarrow{h}}^{\textit{BERT}}_{1}=E_{1}.$ (2) $\displaystyle{\overleftarrow{h}}^{\textit{BERT}}_{i}=\textit{Transformer}\left% (E_{i},{\overleftarrow{h}}^{\textit{BERT}}_{i+1}\right),\ \ \ \ \forall i\in% \left[1,n-1\right],\ {\overleftarrow{h}}^{\textit{BERT}}_{n}=E_{n}.$ (3) $\displaystyle h^{\textit{BERT}}_{i}=\left[{\overrightarrow{h}}^{\textit{BERT}}% _{i};{\overleftarrow{h}}^{\textit{BERT}}_{i}\right].$ (4)

where ${\overrightarrow{h}}^{\textit{BERT}}_{i}$ is the forward Transformer hidden state and ${\overleftarrow{h}}^{\textit{BERT}}_{i}$ is the backward Transformer hidden state; $\mathrm{\ }h^{\textit{BERT}}_{i}$ is the concatenation of ${\overrightarrow{h}}^{\textit{BERT}}_{i}$ and ${\overleftarrow{h}}^{\textit{BERT}}_{i}$ .

Finally, the $\mathrm{\ }h^{\textit{BERT}}_{i}$ is normalized by the softmax nonlinear mapping layer, so that BERT can be applied to the sequence labeling task.

$\displaystyle{\hat{y}}_{i}=\textit{softmax}(\textit{Wh}^{\textit{BERT}}_{i}+b).$ (5)

where ${\hat{y}}_{i}$ represents the prediction probabilities, $W$ is the weight matrix, and $b$ is the bias.

Given the truth labels $y_{1},y_{2},\ldots,y_{n}$ , where $y_{i}$ is represented by one-hot vector, the cross-entropy loss function can be formulated as follows.

$\displaystyle\textit{Loss}\left(Y,\hat{Y}\right)=-\frac{1}{n}\sum^{n}_{i=1}{y^% {T}_{i}}{\textit{log}\ {\hat{y}}_{i}\ }.$ (6)

where $Y=\left\{y_{1},y_{2},\ldots,y_{n}\right\}$ and $\hat{Y}=\{{\hat{y}}_{1},{\hat{y}}_{2},\ldots,{\hat{y}}_{n}\}$ .

By minimizing the cross-entropy loss function, the model implements backpropagation and updates the parameters of the model, making BERT suitable for CWS.

3.2 Usage of lexicons

Lexicons contain some words that do not exist in source domain training set but exist in the target domain test set. For example, “UTF8gbsn柴胡” (Bupleurum) is likely to not exist in the source domain, but it does exist in both lexicons and the medical domain. Therefore, lexicons contribute to improving domain adaptability. For an input sequence, we construct a lexicon-based feature vector for each character in it. According to the characteristics of the words, these feature vectors only consider the local dependencies of the characters, so they need to be passed to BiLSTM to obtain the long-distance dependency. Finally, the output of BiLSTM is concatenated with the output of BERT to obtain the model BERT-DICT-CWS.

The position of a character in a word can be known with lexicons, but this position is sometimes not fixed in a sequence. For example: the sequence “UTF8gbsn产品质量”, “UTF8gbsn品” is the end of the word “UTF8gbsn产品”, but also the first character of the word “UTF8gbsn品质”. In order to be able to express the possibility of characters becoming B, M, E, S in a neural network, we give the following method.

For an input sequence $C=\left(c_{1},c_{2},\ldots,c_{n}\right)$ , we use a four-dimensional feature vector $f_{i}$ to represent its corresponding character $c_{i}$ . In return, we can get the feature sequence $F=\left(f_{1},f_{2},\ldots,f_{n}\right)$ . The four dimensions of $f_{i}$ correspond to B, M, E, S, respectively. For example, $f_{i2}$ represents the second dimension of $f_{i}$ , and its value represents the probability that $c_{i}$ is marked as M.

For each character $c_{i}$ , the specific algorithm for obtaining its feature vector $f_{i}$ is as follows:

Algorithm getting the corresponding feature vector

f_{i}

for

c_{i}

Input sequence

C=\left(c_{1},c_{2},\ldots,c_{n}\right)

c_{i}

, lexicon

Output

f_{i}

Initialization flag

=

f_{i}\mathrm{=(0,0,0,0)}

Calculate n-grams for

c_{i}

, as shown in Table 1.

For word in n-grams do

if word in lexicon then flag

=

1, index

=

the position of

c_{i}

in word

if index

=

0 then

f_{i1}+=

elif index

=

len(word)

-

1 then

f_{i3}+=1

else

f_{i2}+=1

else continue

if flag

=

0 then

f_{i4}+=1

f_{i}=\frac{(f_{i1}+0.1,f_{i2}+0.1,f_{i3}+0.1,f_{i4}+0.1)}{f_{i1}+0.1+f_{i2}+0% .1+f_{i3}+0.1+f_{i4}+0.1}

Return

f_{i}

End

With the consideration of out of vocabulary words and deficiencies of string matching, the number 0.1 is added at last.

The BERT-DICT-CWS model is designed as Fig.1. Here, the lexicon-based feature vector is passed to BiLSTM because the lexicon information only considers the local dependence while BiLSTM can obtain the long-distance dependence of the character so that the representation vector corresponding to the single character can better express its position information in the word. The output formula of BiLSTM is as follows.

$\displaystyle h^{\textit{DICT}}_{i}=\textit{BiLSTM}\left(f_{i},{% \overrightarrow{h}}^{\textit{DICT}}_{i-1},{\overleftarrow{h}}^{\textit{DICT}}_% {i+1};\theta\right).$ (7)

Where $f_{i}$ is the feature vector for $c_{i}$ , ${\overrightarrow{h}}^{\textit{DICT}}_{i-1}$ is the forward hidden layer state at position $i-1$ , ${\overleftarrow{h}}^{\textit{DICT}}_{i+1}$ is the backward hidden layer state at position $i+1$ , and $\theta$ is the parameters of BiLSTM.

We concatenate the output of BERT and BiLSTM. The formula is as follows.

$\displaystyle h_{i}=h^{\textit{BERT}}_{i}\oplus h^{\textit{DICT}}_{i}.$ (8)

The following softmax normalization process and the cross-entropy loss function are the same as before.

Table 1

n-grams of $c_{i}$

Type	Template
2-gram	$c_{i-1}c_{i}\ ,\ c_{i}c_{i+1}$
3-gram	$c_{i-2}c_{i-1}c_{i}\ ,\ c_{i-1}c_{i}c_{i+1}\ ,\ c_{i}c_{i+1}c_{i+2}\$
4-gram	$c_{i-3}c_{i-2}c_{i-1}c_{i}\ ,\ c_{i-2}c_{i-1}c_{i}c_{i+1}\ ,\ c_{i-1}c_{i}c_{i% +1}c_{i+2}\ ,\ c_{i}c_{i+1}c_{i+2}c_{i+3}$
5-gram	$c_{i-4}c_{i-3}c_{i-2}c_{i-1}c_{i}\ ,\ c_{i-3}c_{i-2}c_{i-1}c_{i}c_{i+1}\ ,\ % \mathrm{\dots}\mathrm{\ ,\ }c_{i}c_{i+1}c_{i+2}c_{i+3}c_{i+4}$

Figure 2.

BERT-DICT-LM-CWS. F_LSTM meams forward LSTM.

3.3 Usage of unlabeled data

Taking some special areas into consideration, they have not been pre-trained by BERT, and most of their domain vocabularies do not exist in the common lexicon. In this case, unlabeled dataset of the target domain is used to pre-train a language model to improve the word segmentation accuracy in this area.

The language model uses previous words in the sequence to predict the probability of the next word. For a sequence $C=(c_{1},c_{2},\ldots,c_{n})$ , the forward language model predicts $c_{i}$ with $c_{1},c_{2},$ $\ldots,c_{i-1}$ , which means to computes $\mathrm{P}\left(c_{i}|c_{1},c_{2},\ldots,c_{i-1}\right)$ , and the backward language model calculates $\mathrm{P}\left(c_{i}|c_{i+1},c_{i+2},\ldots,c_{n}\right)$ . This probability embodies the associativity between characters. For example, “UTF8gbsn中华民族”, P (“UTF8gbsn民” | “UTF8gbsn中华”) is definitely smaller than P (“UTF8族”|“UTF8中华民”) because the amount of collocations behind “UTF8中华” is more than “UTF8中华民”. The more the amount of collocations, the smaller the probability P, and the less closely they are. In short, the language model can reflect the associativity between characters, which is very useful for capturing domain-specific words and making up for the deficiency of lexicons.

The neural network structure of the language model can be seen in the right dotted box of Fig. 2. For the character $c_{i}$ , we use the input representation of BERT to get its word embedding. Then through the forward and backward LSTM, the hidden states ${\overrightarrow{h}}^{\textit{LM}}_{i}$ and ${\overleftarrow{h}}^{\textit{LM}}_{i}$ can be obtained respectively. Finally, a softmax non-linear layer outputs the prediction probability ${\hat{y}}^{f}_{i}$ and ${\hat{y}}^{b}_{i}$ for the next character.

After Incorporating a language model pre-trained on the target domain unlabeled data, our final model BERT-DICT-LM-CWS is designed as Fig. 2.

The difficulty in designing this model is how to handle the relationship of the output of BERT, the lexicon module, and the language model, i.e. $h^{\textit{BERT}}$ , $h^{\textit{DICT}}$ and $h^{\textit{LM}}$ ( $h^{\textit{LM}}={\overrightarrow{h}}^{\textit{LM}}\oplus{\overleftarrow{h}}^{% \textit{LM}}$ ). Since the language model is incorporated to make up for the deficiency of the lexicon, we concatenate $h^{\textit{DICT}}$ and $h^{\textit{LM}}$ , and get the result $h^{\textit{DICT}+\textit{LM}}$ .

Concatenation is superior in retaining the original information of each part, but there is also a disadvantage that at each time-step, the weight of a certain part for the current sequence cannot be highlighted. In addition, continuous concatenating will cause the problem of excessive dimensions. Referring to [8], we use a GRU-like gate mechanism to deal with the relationship between $h^{\textit{DICT}+\textit{LM}}$ and $h^{\textit{BERT}}$ .

$\displaystyle z=\sigma\left(U_{z}h^{\textit{DICT}+\textit{LM}}+W_{Z}h^{\textit% {BERT}}+b_{z}\right).$ (9) $\displaystyle r=\sigma\left(U_{r}h^{\textit{DICT}+\textit{LM}}+W_{r}h^{\textit% {BERT}}+b_{r}\right).$ (10) $\displaystyle\tilde{h}=\textit{tanh}\left(U_{h}h^{\textit{DICT}+\textit{LM}}+W% _{h}(r\odot h^{\textit{BERT}})+b_{h}\right).$ (11) $\displaystyle h=\left(1-z\right)\odot h^{\textit{BERT}}+z\odot\tilde{h}.$ (12)

where $z$ and $r$ represent the update and reset gate respectively.

Finally, the obtained $h$ is normalized by softmax to predict the label for each character.

4. Experiments

4.1 Datasets

We use Chinese Treebank 5.0 (CTB5) as the source domain labeled training set. Zhuxian (a Chinese novel) and four other self-made datasets are used as the target domain test sets. Zhuxian was annotated by [17]. In accordance with the method of [18], we have produced development and test sets in four different domains, including literature, medicine, computer and finance. The literary dataset comes from “A Dream of Red Mansions” and “Journey to the West”. The medical dataset comes from “Hu Xishu talks about typhoid fever” and “Chinese Pharmacy”. The computer dataset comes from HowNet. The financial dataset comes from “Modern Monetary Theory”.

In addition, the lexicon we use is a common lexicon which comes from jieba. The unlabeled datasets come from the corresponding field of the test set. For example, when testing Zhuxian, we will use its remaining data, excluding the development and test sets, as the unlabeled dataset.

Finally, to test the performance of the proposed model in the same domain, we also use Chinese Treebank 6.0 (CTB6) , and PKU and MSR which come from SIGHAN2005 [25].

Table 2
Datasets

Datasets		CTB5	ZX	Lit	Med	Com	Fin	PKU	MSR	CTB6
Train	#sentence	18.1K	CTB5’s training set					19.1K	86.9K	25.5K
	#word	49.4K						1.11M	2.37M	0.70M
Dev	#sentence	$-$	0.8k	0.8k	0.6k	0.5k	0.5k	$-$	$-$	$-$
	#word	$-$	20.4k	19.2k	14.0k	18.7k	15.2k	$-$	$-$	$-$
Test	#sentence	$-$	1.4k	1k	0.7k	0.7k	0.5k	1.9K	4.0K	2.8K
	#word	$-$	34.4k	24.3k	13.5k	23.5k	14.4k	0.10M	0.18M	0.13M

4.2 Configurations

In the experiment, we mainly trained three models: BERT-CWS, BERT-DICT-CWS and BERT-DICT-LM-CWS. BERT-CWS and BERT-DICT-CWS can be used to test different target domains after training on CTB5. However, the BERT-DICT-LM-CWS model needs to be retrained in different target domains because it requires the target domain unlabeled data.

Table 3 shows some of the important hyperparameters that are needed for these three models. In order to fine-tune the parameters of the pre-trained BERT model, we set a relatively small initial learning rate, and after training a batch, the learning rate will decay linearly. Too small a learning rate can lead to overfitting, so we set up dropout rate to prevent this problem. In addition, the model uses Adam [19] as the optimizer.

Table 3
Hyperparameters

Parameters	Values	Parameters	Values
BERT_embedding_size	128	Initial learning rate	2e-5
DICT_feature_size	4	Dropout rate	0.1
BERT_hidden_size	768	Train_batch_size	32
DICT_BiLSTM_hidden_size	128	Dev_batch_size	8
LM_BiLSTM_hidden_size	300	Test_batch_size	8

4.3 Results and discussion

Table 4 shows the F1 values of different neural-based CWS models in eight domains. The cross-domain segmentation results are shown from the second column to the six, and the last three columns reveal F1 values in the same domain. The bold models in this table indicate that they are retrained in each target domain. Domain adaptation of our model is analyzed emphatically in the following.

Table 4
F1 values of different neural-based CWS models

Models	Source	CTB5
	Target	ZX	Lit	Med	Fin	Com	PKU	MSR	CTB6
BiLSTM		89.05	83.42	85.61	92.70	92.91	95.05	96.50	95.16
BiLSTM $+$ CRF		89.44	83.71	86.04	93.01	93.11	95.10	97.0	95.60
Zhang et al. [17]		88.34	$-$	$-$	$-$	$-$	$-$	$-$	$-$
Liu et al. [20]		90.63	$-$	$-$	$-$	$-$	$-$	$-$	$-$
Zhang et al. [5]		90.66	$-$	$-$	$-$	$-$	$-$	$-$	$-$
Yang et al. [26]		$-$	$-$	$-$	$-$	$-$	96.30	97.50	96.20
Zhang et al. [7]		91.10	85.32	88.37	95.10	95.07	96.20	97.60	96.10
Zhao et al. [8] BiLSTM $+$ LM		90.60	84.81	87.16	94.40	94.86	95.39	96.84	95.57
BERT-CWS		91.43	85.80	88.58	95.80	96.01	96.46	98.37	96.76
BERT-DICT-CWS		91.84	86.27	89.25	96.38	96.47	96.82	98.58	97.08
BERT-DICT-LM-CWS		92.25	86.81	89.68	96.89	96.96	97.34	98.75	97.62
BERT-DICT-Com-CWS		91.81	86.27	89.35	96.29	96.52	96.71	98.53	97.02

In the first block, we present two benchmark models BiLSTM and BiLSTM $+$ CRF for CWS. They do not use additional resources such as lexicons, unlabeled data, and so on. Since the conditional random field (CRF) considers the dependencies between adjacent tags, for example, the two “B” and “S” should not appear after “B”. Therefore, the segmentation accuracy of BiLSTM $+$ CRF is slightly better than BiLSTM.

In the second block, we present models using additional resources. [17] used external lexicons and unlabeled target domain data to improve domain adaptation for joint CWS and POS. [20] used partially-labeled data derived from target domain to train a CRF model. This approach can solve the problem of the insufficient training set, but the universality of the model cannot be enhanced. [5] attempted to solve the domain adaptability problem based on the semantic similarity between the target domain unlabeled data and the source domain labeled data. [8] used a language model pre-trained on unlabeled data to obtain co-occurrence information for words. They only considered the segmentation accuracy in a specific domain without considering the universality of the model. Because of this, it is necessary to retrain the model of [8] in different domains. [7] constructed an 8-dimensional lexicon-based feature vector and used two methods to integrate it into the neural network. Here, we just use the method of concatenation.

In the last block, we present three models designed in this paper and a BERT-DICT-COM-CWS model. As can be seen from Table 4, the segmentation result of BERT-CWS is better than BiLSTM and BiLSTM $+$ CRF. This is mainly because BERT has been pre-trained on a large-scale corpus. The BERT-DICT-CWS model is better than BERT-CWS in all five domains, indicating that our method of integrating lexicon information into BERT-CWS can improve the domain adaptability. The F1 values of BERT-DICT-LM-CWS are higher than BERT-DICT-CWS in all five domains. However, since retraining is required in each domain, it cannot be said that the domain adaptability of BERT-DICT-LM-CWS is strong. In order to study its universality, this experiment gives the F1 values of BERT-DICT-COM-CWS. It is a special case of BERT-DICT-LM-CWS, which means that only the unlabeled data of the computer domain is incorporated in BERT-DICT-CWS, and it will not be retrained in other domains. As can be seen, the segmentation accuracy of BERT-DICT-COM-CWS is comparable to BERT-DICT-CWS and has an advantage in the computer domain. This shows that the universality of BERT-DICT-LM-CWS is comparable to BERT-DICT-CWS, but the word segmentation accuracy in a specific domain is higher than BERT-DICT-CWS. In conclusion, the domain adaptability of our final model BERT-DICT-LM-CWS is very strong.

The in-domain segmentation performance of our model is analyzed as follows. In-domain means the source and target domain data are derived from the same domain. [26] built a modular segmentation model and pre-train the most important submodule using rich external sources. [7, 8] used lexicons and unlabeled data, respectively. The last three columns in Table 4 show that the F1 values of the above three models are higher than the benchmark model BiLSTM, which proves the usefulness of external resources. BERT also pre-trained on large-scale datasets, which leads to the high performance of the model BERT-CWS. After using the scheme of this paper to integrate the lexicons and unlabeled data into BERT, the F1 value is further improved, which proves that our scheme is effective.

4.4 Analysis

Words that have not appeared in the labeled data can be called out of vocabulary (OOV) words. Accurately identifying OOV words is a difficult task for cross-domain segmentation task. The OOV recall can reflect the strength of domain adaptability to a certain degree. Therefore, Table 5 shows the OOV recall for BiLSTM and the three models proposed in our experiments in five different domains. It can be seen that BERT-CWS has a much higher OOV recall than BiLSTM. After incorporating the lexicon and unlabeled data, the OOV recall is further improved.

Table 5
OOV recall

Model	ZX	Lit	Med	Fin	Com
BiLSTM	74.00	70.21	72.15	82.41	81.23
BERT-CWS	87.73	81.13	83.61	90.15	89.01
BERT-DICT-CWS	88.65	81.88	84.51	90.85	90.26
BERT-DICT-LM-CWS	88.92	82.41	84.78	91.11	90.48

The main drawback of the proposed model is that the amount of parameters is too large. This leads to a longer time for the model to be trained, and the speed of word segmentation on the test set is about three times lower than normal BiLSTM based CWS model. In order to minimize the complexity of the model, a four-dimensional lexicon-based feature vector is constructed. Comparing with the eight-dimensional lexicon-based feature vector proposed by [7], our method uses less dimension and is superior in training speed.

To compare the two methods of constructing a lexicon-based feature vector proposed by [7] and us, respectively, we replace the feature vector part of Zhang’s model with ours without changing other parts. B_4 is used to represent our model, and B_8 is the model of Zhang. As can be seen from Fig. 3, these two feature vectors have almost the same effects. When the hidden layer dimension is small, our method will have a slight advantage. With the increase of dimension, the segmentation accuracy of both methods will decrease, and our method drops faster. In addition, since our feature vector uses only four dimensions, it will be faster in training speed.

Figure 3.

Comparation of two lexicon-based feature vectors. The horizontal axis represents the hidden layer dimension and the vertical axis represents the F1 value.

5. Conclusion

This paper analyzes the domain adaptability of CWS models based on neural network and holds the point that the strength of domain adaptability should be measured from the aspects of segmentation accuracy and universality. Based on this point, we propose to combine the BERT model with lexicons and target domain unlabeled data to improve the domain adaptability. Experiments show that the fusion of BERT-CWS and lexicon can improve both the segmentation accuracy and universality of BERT-CWS. What’s more, incorporating the target domain unlabeled data into BERT-DICT-CWS can improve the segmentation accuracy in a specific domain, but unable to improve the universality of BERT-DICT-CWS. In summary, our final BERT-DICT-LM-CWS model has very strong domain adaptability.

Footnotes

Acknowledgments

This work is supported by Zhejiang Provincial Technical Plan Project (No. 2020C03105).

References

Zheng

Chen

and Xu

, Deep learning for Chines word segmentation and POS tagging, In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013, pp. 647–657.

Chen

Qiu

Zhu

et al., Long short-term memory neural networks for Chinese word segmentation, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015, pp. 1197–1206.

Yao

and Huang

, Bi-directional LSTM recurrent neural network for Chinese word segmentation, International Conference on Neural Information Processing. Springer, Cham. 2016, pp. 345–353.

Liu

and Zhang

, Unsupervised domain adaptation for joint segmentation and POS-tagging, Proceedings of COLING 2012: Posters. 2012, pp. 745–754.

Zhang

Miao

et al., Addressing domain adaptation for Chinese word segmentation with instances-based transfer learning, Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham. 2018, pp. 24–36.

Ganchev

and Weiss

, State-of-the-art Chinese word segmentation with Bi-LSTMs, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, pp. 4902–4908.

Zhang

Liu

and Fu

, Neural networks incorporating lexicons for Chinese word segmentation, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. 2018, pp. 5682–5689.

Zhao

Zhang

Wang

et al., Neural networks incorporating unlabeled and partially-labeled data for cross-domain Chinese word segmentation, IJCAI. 2018, pp. 4602–4608.

Shao

Zheng

Yang

et al., Domain-specific Chinese word segmentation based on bi-directional long-short term memory model, IEEE Access 7 (2019), 12993–13002.

10.

Devlin

Chang

M.W.

Lee

et al., Bert: Pre-training of deep bidirectional transformers for language understanding, CoRR. (2018), abs/1810.04805.

11.

Vaswani

Shazeer

Parmar

et al., Attention is all you need, Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.

12.

Wang

and Guo

, Learning Chinese word segmentation based on bidirectional GRU-CRF and CNN network model, IJTHI 15 (2019), 47–62.

13.

Liu

et al., Neural Chinese word segmentation with lexicon knowledge, CCF International Conference on Natural Language Processing and Chinese Computing. Springer, Cham. 2018, pp. 80–91.

14.

Zheng

Che

Guo

et al., Enhancing LSTM-based word segmentation using unlabeled data, Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, 2017, pp. 60–70.

15.

Peters

M.E.

Ammar

Bhagavatula

et al., Semi-supervised sequence tagging with bidirectional language models, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 1, 2017, pp. 1756–1765.

16.

Schuster

Chen

et al., Google’s neural machine translation system: Bridging the gap between human and machine translation, CoRR. (2016), abs/1609.08144.

17.

Zhang

Che

et al., Type-supervised domain adaptation for joint segmentation and pos-tagging, Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014, pp. 588–597.

18.

Jiang

Huang

and Liu

, Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging: a case study, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 1, 2009, pp. 522–530.

19.

Kingma

D.P.

and Ba

, Adam: A method for stochastic optimization, ICLR, arXiv preprint arXiv (2014), 1412.6980.

20.

Liu

Zhang

Che

et al., Domain adaptation for CRF-based Chinese word segmentation using free annotations, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 864–874.

21.

Liu

et al., Neural Chinese word segmentation with dictionary, Neurocomputing 338 (2019), 46–54.

22.

Liu

et al., Neural Chinese word segmentation with lexicon and unlabeled data via posterior regularization, CoRR. (2019), abs/1905.01963.

23.

Bao

et al., Neural domain adaptation for Chinese word segmentation, IALP (2017), 131–134.

24.

Zhang

et al., Improving cross-domain Chinese word segmentation with word embeddings, CoRR. (2019), abs/1903.01698.

25.

Emerson

, The Second International Chinese Word Segmentation Bakeoff, Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, SIGHAN@IJCNLP, 2005.

26.

Yang

Zhang

and Dong

, Neural word segmentation with rich pretraining, ACL 1 (2017), 839–849.

An improved neural network for domain adaptive Chinese word segmentation

Abstract

Keywords

1. Introduction

2. Related work

3. Model introduction

3.1 BERT-CWS

4.1 Datasets

Table 2 Datasets

Table 3 Hyperparameters

Table 4 F1 values of different neural-based CWS models

Table 5 OOV recall

Footnotes

Acknowledgments

References

Table 2
Datasets

Table 3
Hyperparameters

Table 4
F1 values of different neural-based CWS models

Table 5
OOV recall