Abstract
Semantic matching is one of the critical technologies for intelligent customer service. Since Bidirectional Encoder Representations from Transformers (BERT) is proposed, fine-tuning on a large-scale pre-training language model becomes a general method to implement text semantic matching. However, in practical application, the accuracy of the BERT model is limited by the quantity of pre-training corpus and proper nouns in the target domain. An enhancement method for knowledge based on domain dictionary to mask input is proposed to solve the problem. Firstly, for modul input, we use keyword matching to recognize and mask the word in domain. Secondly, using self-supervised learning to inject knowledge of the target domain into the BERT model. Thirdly, we fine-tune the BERT model with public datasets LCQMC and BQboost. Finally, we test the model’s performance with a financial company’s user data. The experimental results show that after using our method and BQboost, accuracy increases by 12.12% on average in practical applications.
Introduction
With the development and application of Internet technologies such as big data and artificial intelligence, the traditional lending model of commercial banks has undergone disruptive changes. Intelligent customer service with artificial intelligence technology can shunt traffic, reduce labor costs, and improve operational efficiency and user experience [7–9]. Text semantic matching is the key technology of intelligent customer service, which estimates semantic similarity between the source and the target text pieces [3,22]. In loan Q&A, the semantic matching task is to calculate the similarity between the user and standard questions in the knowledge base. However, user questions without standard, various products and services’ name of large enterprises is unusual, which resulting semantic matching will be complicated [20].
Text semantic matching is divided into traditional and deep learning methods now. The conventional manner, such as TF-IDF, Bm25, VSM, are based on statistical methods, which use the lexical coincidence degree to calculate text literal similarity between two texts [13–15,18]. Deep learning has recently developed rapidly and is widely used in natural language processing [1,21,23,24], gradually replacing traditional methods. In 2013, Mikulov et al. [12] proposed the Word2vec model, which builds distributed word vectors to replace one-hot coding, solving the problem of sparse coding. However, word2vec model does not solve the problem of polysemy. In 2017, Vaswani et al. [19] proposed a Transformer structure based on attention. The transformer can parallel processing all words and solve the problem of polysemy by location coding. In 2018, Devlin et al. [4] proposed Bidirectional Encoder Representations from Transformers (BERT), a pre-training representation model based on a large-scale corpus, which is stacked by the encoder, part of transformer. Devlin et al. [5] also offered methods for BERT to adapt downstream tasks: Masked LM, Next Sentence Prediction, and fine-tuning. BERT with fine-tuning mode achieves state-of-the-art in many NLP tasks, demonstrating great potential of fine-tuning method.
Since BERT is proposed, fine-tuning on large-scale pre-training language model has become a general method for text semantic matching. However, due to the hidden meaning and size of the pre-training corpus, the BERT model is limited in applicable scenes. To improve the performance of BERT in the target domain, Sun et al. [17] used intra-task, intradomain, and cross-domain corpus to perform incremental pre-training on BERT. The results show that incremental pre-training can improve the model’s performance in the target domain.
The above improvement for pre-training phase mask shows its superiority on multiple open datasets. However, limited by the quality and quantity of corpus in the target domain, the BERT model still lacks domain knowledge, which shows the low accuracy of semantic precision matching. A knowledge enhancement method based on a domain dictionary mask is proposed to solve the above problems. Firstly, in pre-training, the target domain-related corpus is used for random and dictionary masking training in the BERT model. Secondly, the universal datasets LCQMC [10] are used to fine-tune and test the influence of pre-training strategy on the model performance. Finally, after replacing the domain-specific vocabulary, professional domain datasets BQ [2] are used to fine-tune again on the model to optimize the performance in downstream tasks. Experimental results show that the enhanced BERT compared with the original BERT on downstream tasks improves by an average of 12.12%.
The remainder of the paper is organized as follows: Section 1 describes the model structure and an enhancement method for knowledge in detail. Section 2 presents the results and discussion. Finally, Section 3 concludes the paper.
Knowledge enhancement of domain dictionary mask BERT
Since BERT is proposed, a new paradigm of pre-train-fine-tune has emerged in natural language processing. The pre-training model based on large-scale corpus can effectively alleviate the problem of insufficient corpus. In the downstream task, only a tiny amount of corpus is needed, which may improve the model’s performance by continuing to fine-tune [24]. Pre-train-fine-tune is shown in equations (1)–(2).
In equations (1)–(2), pretrain as a function represents the pretrain strategy; fine_tuning is also a function that describes the fine-tuning process. After pre-training on the specific domain datasets A, the modelpretrain is obtained. Fine-tuning on the downstream task datasets B on modelpretrain can get the final model modelfinal.

The model structure of incremental pre-training and fine-tuning on domain corpus.
The model structure of incremental pre-training and fine-tuning based on domain corpus is shown in Fig. 1. The model includes three parts: pre-training, fine-tuning and downstream task testing.
Pre-training: inject the domain corpus into the model and adjust the network parameters to make the model adapt to the tasks of the target domain. In the pre-training stage, the input corpus is copied first, and then the mask is generated in a random or predefined way. Finally, the original and masked corpus are input into the model for self-supervised training. The model predicts the masked content according to the context to learn the target domain’s knowledge. Fine-tuning: adapt the model to specific downstream tasks through training. Firstly, building the input with form as sentence pair and label. Secondly, the transformer encoder can let the model’s attention to the sentence pair simultaneously and then predict the semantic similarity of sentence pair. Finally, we can adjust the network parameters by loss, calculating by label and predicting results. We can obtain the model adapt text Semantic matching task through the steps above. Downstream task test: After adjusting the parameters of above two modules, this module is used to test the model effect in practical application of target domain.
Mask self-supervised training
In target domain, owning to high variety and proportion of proper nouns, we propose a method that combines random and domain vocabulary masks in pretrain. In our method, original and masking corpus are input models for self-supervised learning, which can enhance the model’s modeling ability of vital information in downstream tasks. Combining random and domain dictionary masks can maximize learning the knowledge from pre-train datasets.
Random mask method
In the data preprocessing stage, for a given sentence sequence with length T,
Domain dictionary mask method

Domain dictionary mask method.
The domain dictionary masking method is shown in Fig. 2. Using the method of keyword matching, for the sentence sequence
In equation (3),
In the fine-tuning stage, to solve insufficient corpus in target domain, we propose using public datasets similar to target domain datasets to train the model. In this paper, our target domain is financial company loan Q&A, so we choose the public datasets BQ to be source domain datasets, similar to our target domain. To solve the problem that many words in the target domain do not appear in the source domain, we preprocess the datasets with manual screening, as shown in Table 1. At first, we select the sentences that contain proper nouns in source domain. Then, we replace the proper nouns with conceptual words. This way, we can reduce the proprietary nouns and enhance data generalization in similar domains.
Dataset preprocessing
Dataset preprocessing
Experimental data
There are two types of tasks and six datasets in the experiment. Tasks include text classification and semantic matching. Datasets are shown in Table 2. Lcqmc is a universal domain dataset. BQ and Ours_pre, Ours_dev, Ours_test is a professional domain dataset.
Experimental setup and evaluation criteria
The experimental environment is shown in Table 3. The random seeds are set as 7. We use Adamw to be the model optimizer and set the gradient attenuation factor as 0.9, the square gradient attenuation factor as 0.99, the initialization learning rate as
Dataset introduction
Dataset introduction
Experimental environment
We use accuracy as an evaluation criterion for text classification and semantic matching tasks. The definition of indicators is shown in equations (4)–(5). Equation (4) is used to calculate the accuracy of the text binary classification task, and equation (5) is used to calculate the accuracy of the semantic similarity task.
In equation (4), TP, the problem pair label and the actual prediction of the model also is 1; FN, the problem pair label is 0, and the actual prediction of the model is 0; All is the number of all test samples.
In equation (5), All is the number of all test samples; M is the number of matching correct questions. m inputs and n questions in the question base form
The experimental dataset is divided into four parts: pre-training set, training set, verification set, and test set. Among them, Ours_pre dataset is used for model pre-training, training set of LCQMC and BQ dataset is used for model fine-tuning, Ours_dev, and Ours_test is used to verify the performance improvement of the model in downstream tasks.
Mask experiment
Mask experiment
As shown in Table 4, BERT means that LCQMC is directly used to fine-tune the original BERT model. +Random, +Dict, +Random_Dict implies that based on the original BERT, it is used for Ours_pre dataset is incrementally trained by using random mask method, domain dictionary mask method, and a combination of the two approaches, and finally fine-tuned on the LCQMC dataset.
The experimental results show that our data is used in the pre-training stage. The pre dataset will damage the generalization of the BERT model, and the accuracy of LCQMC test set will be reduced by 0.18%–0.99%. But, in the downstream tasks, the accuracy of the BERT model pre-trained by +Dict method is the lowest in BQ verification set, and the accuracy of Ours_dev and Ours_test datasets is better than +Random. This shows that the BERT model based on domain dictionary enhancement reduces the performance of downstream tasks in the domain with large differences in professional terms and improves the modeling ability of crucial information of downstream tasks in the target domain. +Random_Dict combining + Random and +Dict method on Ours_dev and Ours_test dataset performs best, and the accuracy is improved by 5.26%, 4.38%, and 10.77%, respectively, compared with BERT. It shows that combining random and dictionary masks can maximize the domain knowledge in the pre-training corpus.
Continuous training of corpus in similar domain
As shown in Table 5, +BQ and +BQboost indicate that basis on +Random_Dict uses BQ and BQboost datasets for fine-tuning. Ours_test dataset finetune on BERT model, the accuracy of Top1 and Top5 decreased by 9.43% and 1.01%. It indicates that the deviation between target and similar domains will affect the effect of the downstream task. The accuracy of +BQboost on Ours_dev and Ours_test datasets is better than +Random_Dict and +BQ. It indicates that the model’s performance in downstream tasks of similar domains can be improved by cleaning the professional domain dataset and fine-tuning the model.
In the practical application of the BERT model, there are some problems, such as insufficient pre-training corpus, quantity of proper nouns in target domain, and expression with colloquializes. The enhancement method for knowledge based on domain dictionary mask is proposed to solve the problems. Random masking and domain dictionary masking are used to mask target domain corpus. After masking, the corpus is input to the model for self-supervised learning, which enhances the model’s ability for professional domain terms in practical application. After pre-training, the model is fine-tuned by the universal dataset LCQMC and domain dataset BQboost. Finally, the user data of a financial company are used for testing. Experimental results show that the knowledge-enhanced BERT compared with original BERT increases by 12.12% on average in downstream tasks.
Our research is only a single corpus from a similar domain to fine-tune the model. Our future work will use multiple datasets from different source domains and comprehensively consider the semantic information hidden in BERT’s underlying and high-level networks [6] to obtain better results.
Footnotes
Acknowledgements
The study is funded by the Guangdong Science and Technology Plan Project (2021B1212100004, 2019B010139001), Guangzhou Science and Technology Plan Project (201902020016), and Guangdong Natural Science Fund Project (2021A1515011243).
Conflict of interest
None to report.
