Abstract
Due to the polysemy and complexity of the Chinese language, Chinese machine reading comprehension has always been a challenging task. To improve the semantic understanding and robustness of Chinese machine reading comprehension models, we propose a model that utilizes adversarial training algorithms and Permuted Language Model (PERT). Firstly, we employ the PERT pre-training model to embed paragraphs and questions into vector space to obtain corresponding sequential representations. Secondly, we use a multi-head self-attention mechanism to extract key textual information from the sequence and employ a Bi-GRU network to semantically fuse the output feature vectors, aiming to learn deep semantic representations in the text. Finally, we introduce perturbations into the model training process. We achieve this by utilizing adversarial training algorithms such as Fast Gradient Method (FGM) and Projected Gradient Descent (PGD). These algorithms generate adversarial samples to enhance the model’s robustness and stability when facing diverse inputs.
We conducted comparative experiments on the publicly available Chinese reading comprehension datasets CMRC2018 and DRCD. The experimental results show that our proposed model has achieved significant improvements in both EM and F1-Score compared to the baseline model. To validate the model’s generalization and robustness, we utilized ChatGPT to construct a scientific dataset that includes a large number of domain-specific terms, sentences with mixed Chinese and English, and complex comprehension tasks. Our model also performed remarkably well on the self-built dataset. In conclusion, the proposed model not only effectively enhances the understanding of semantic information in Chinese text but also demonstrates a certain level of generalization capability.
Keywords
Introduction
In recent years, natural language processing (NLP), as a subfield of artificial intelligence, has experienced significant development. Machine Reading Comprehension (MRC) is an important task within NLP, aiming to enable machines to read and comprehend unstructured text and answer corresponding questions based on the text content. This task holds significant value in practical applications, such as intelligent customer service, information retrieval, and intelligent question-answering systems.
MRC tasks themselves are challenging, requiring models to handle various language phenomena and understand text semantics and logical relationships. Deep learning-based methods for MRC task modeling have evolved alongside the development of deep learning techniques. Early commonly used methods involved encoding contextual semantics using Recurrent Neural Networks (RNNs) to capture long-term dependencies [1–5]. However, due to the issue of vanishing gradients, RNN networks struggle to effectively retain long-term dependency information, thereby impacting the model’s understanding of long texts and introducing result biases. With the emergence of various RNN variants (e.g., LSTM, GRU) and the continuous development of pre-training models (BERT [6], RoBERTa [7], GPT [8], XLNet [9]), a mainstream approach to enhance the text reading, analysis, and summarization capabilities of machine reading comprehension is through fine-tuning pre-training models like BERT to accomplish downstream tasks (Feng et al. [10] and Cui et al. [11]). This approach leverages the language knowledge already acquired by the BERT model, enabling exceptional performance on specific tasks [12].
Although the aforementioned models and methods have achieved success in MRC tasks, several challenges persist: (1) Most current models are trained specifically for the English language, which poses numerous challenges when processing Chinese text. In machine reading comprehension (MRC) tasks, encoding the entire sentence is crucial to assist models in understanding semantic information. However, the MLM pre-training method used by the BERT model is less suitable for Chinese encoding tasks and may generate noise tokens irrelevant to the answers, resulting in reduced model performance and diminished semantic understanding abilities (Schlegel et al. [13], Yu et al. [14]). Additionally, the Chinese language itself possesses unique linguistic structures and grammar rules, further complicating Chinese Natural Language Processing (NLP). Therefore, it is essential to adopt pre-training models specifically tailored for Chinese and corresponding pre-processing methods to address these issues. (2) Existing MRC models have certain limitations in feature extraction and fusion. For example, the QAnet model [15] has two flaws. Firstly, the model utilizes Convolutional Neural Networks (CNNs) for text feature extraction. While CNNs excel in handling simple factual text questions, their performance may be limited when dealing with complex problems that require multi-step reasoning or deep logical inference. Secondly, in terms of feature fusion, commonly used methods in these models include concatenation, weighted summation, and dot product. These methods can increase the model’s dimensionality and computational cost. Additionally, the weak correlation between features can result in suboptimal feature fusion between questions and contexts.(3) Liu et al. [16] and Yang et al. [17] have pointed out in their research that current pre-training models are vulnerable to adversarial attacks. These attacks can significantly affect model performance, and sometimes even lead to incorrect answer outputs. In attacks, a common strategy involves adding sentences to the input sequence that are semantically similar to the answer, resulting in misleading outcomes. For example, in the reading comprehension task, the question is “Where did John go in January?”, if the passage “John went to Beijing in January.” to “John went to Beijing, the capital of China, in January 2023”, the model could incorrectly select “China” as the answer. In addition, an attacker could exploit this vulnerability to build a more intelligent attack method that would compromise the performance of the model. This suggests that the impact of adversarial attacks needs to be considered during model design. Additionally, countermeasures should be implemented to enhance the robustness and generalization of the model.
The contributions of this paper are summarized as follows: To enhance the semantic understanding capability of Chinese machine reading comprehension, we utilize the Chinese Permuted Language Model (PERT). The PERT model, which uses permuted language modeling instead of traditional Masked Language Modeling (MLM), enables us to obtain better and more effective vector representations of the paragraph and questions. The proposed model in this paper has been improved based on the original PERT model, enhancing its capabilities. In the feature extraction phase, we utilize a multi-head self-attention mechanism to extract valuable and relevant text from the input sequence. In the feature fusion phase, we employ a Bidirectional Gated Recurrent Unit (BI-GRU) to semantically fuse the paragraph and question information, obtaining more comprehensive and accurate semantic representation. To address the vulnerability of current pre-trained models to adversarial attacks, we utilize two adversarial training algorithms, namely Fast Gradient Method (FGM) and Projected Gradient Descent (PGD), to enhance the robustness and generalization capability of the model. The model was evaluated on Chinese extractive reading comprehension datasets (CMRC2018 and DRCD) as well as a self-built metadata of scientific dataset. The proposed model outperformed existing baseline pre-trained models and demonstrated a certain level of generalization ability.
Related work
Traditional deep learning-based MRC
With the development of deep learning and the continuous increase in datasets, neural networks have shown superior performance in MRC tasks compared to traditional rule-based and machine learning methods, and have gradually become mainstream in research. As shown in Fig. 1, the following is a general architecture for machine reading comprehension.

Machine reading comprehension architecture.
Compared to traditional tasks such as word segmentation, named entity recognition, and syntactic analysis, Machine Reading Comprehension (MRC) typically requires more contextual information and deeper semantic understanding, thus necessitating the extensive use of techniques such as text representation, retrieval, reference resolution, and reasoning [18]. To accurately predict answers, researchers have conducted extensive exploration to extract relevance between context and questions. Initially, one-hot encoding or methods like Word2Vec [19] were used to convert input words into fixed-length vectors. These vectors were then passed into CNNs, RNNs, Transformers, and other network models to extract features from context and questions. This allowed for the capture of semantic information at multiple levels [15, 20–22]. Subsequently, features were fused using attention mechanisms or similar approaches, and answer prediction was performed based on various tasks such as open-ended questioning [23, 24]. In the past few years, neural network-based MRC models have achieved good results through training on large-scale corpora. However, existing neural network-based MRC models still face certain challenges and limitations. For example, they perform poorly when handling long texts and complex semantic relationships. Additionally, MRC requires improved reasoning and inference capabilities to tackle complex logical and inferential questions. As research progresses, new techniques have been introduced to MRC, such as pre-training models and transfer learning, aiming to enhance model performance.
In the field of natural language processing, pre-trained neural encoders have become one of the mainstream research directions. Early word vector models like Word2Vec encoded words into low-dimensional spaces and captured the correlations between words, but they were limited in effectively mining contextual information. Consequently, more advanced pre-trained neural encoders emerged. BERT (Bidirectional Encoder Representations from Transformers) is one of the most successful pre-trained neural encoders to date. It employs bidirectional context modeling and unsupervised pre-training, allowing it to better capture semantic and syntactic information in language. BERT has demonstrated state-of-the-art performance in tasks such as Machine Reading Comprehension (MRC) and other natural language tasks. However, true reading comprehension requires not only language understanding but also knowledge that supports complex reasoning. To address this, many researchers have proposed various methods to enhance the capabilities of the BERT model. For example, Gong et al. [25] introduced an approach that predicts answers based on knowledge graph representation and context. Fan et al. [26] proposed a method that combines BERT with an LSTM-based retrieval-inference module, enabling the model to possess semantic understanding and reasoning abilities. Yang et al. [27] utilized structured knowledge bases to enhance the model’s performance, while Wang et al. [28] employed Bi-LSTM and attention mechanism modules to effectively fuse text and questions, improving the model’s performance. Sun et al. [29] used the FGM adversarial training algorithm to enhance the model’s robustness. Despite its achievements, the BERT model, during the pre-training process, requires predicting randomly masked words using the Masked Language Model (MLM) task. However, in downstream tasks, the model needs to directly model and analyze complete texts. This requirement differs significantly from the MLM task and can potentially result in suboptimal performance in MRC tasks. In contrast, the Permuted Language Model (PERT) utilizes a novel pre-training structure to address this issue. This structure replaces the MLM masking task with a word order prediction task, alleviating the discrepancy between pre-training and fine-tuning tasks and improving the model’s generalization and performance on downstream tasks.
Neural network-based Chinese machine reading comprehension models
In recent years, Chinese reading comprehension models have received increasing attention, and with the development of deep learning technology, more and more MRC methods with complex structures have emerged [30]. These models have performed well on multiple datasets, providing more accurate and precise answers to real-world applications. However, due to the complexity and flexibility of the Chinese language, Chinese machine reading comprehension models have always been a challenging task. Firstly, compared with English, Chinese has more ambiguity phenomena. These include homophones, homographs, and heteronyms. These phenomena make it extremely difficult for models to understand sentences and passages accurately. Secondly, the context of the passage is crucial for understanding the problem and providing accurate answers [31]. Chinese articles often use omission structures, which require models to have good anaphora resolution and context reasoning ability. When processing long texts, the model needs to process a large amount of information and make accurate predictions under limited computing resources. In addition, traditional training methods may cause overfitting of the model, affecting its performance on unseen data.
To address these challenges, researchers have adopted various strategies and technologies. For example, introducing pre-trained language models and combining them with reading comprehension tasks to improve the model’s language understanding ability. In addition, some research work combines different deep learning techniques, further promoting the development of Chinese reading comprehension. Table 1 lists some classic neural network-based Chinese machine reading comprehension models. By summarizing, we found that these models use classical deep learning methods in feature embedding, feature extraction, and feature fusion, but there are also some limitations. Feature embedding. Existing Chinese models use many classic embedding methods. However, traditional word embedding methods such as Word2vec and ELMO often struggle to overcome polysemy and contextual understanding problems. Although the BERT model can partially solve these problems, the MLM pre-training method is not very suitable for Chinese reading comprehension. Feature extraction and fusion. Currently, most models use traditional neural network methods in the feature extraction stage, and have not yet applied the multi-head self-attention mechanism in the Transformer structure to extract features. At the same time, in terms of feature fusion, most models still use classical attention mechanisms. Although this method performs well in processing local information, it lacks comprehensive modeling ability for global text. This means that the model may not fully utilize the contextual information in the text and may lead to decreased performance when handling complex tasks. Robustness of the model. There is almost no research considering the robustness and generalization of Chinese reading comprehension models.
To address the aforementioned issues in Chinese MRC, this paper proposes a machine reading comprehension model that utilizes adversarial training algorithm and Perturbed Language Model (PERT). The model employs Chinese PERT model in the feature embedding layer to enhance the model’s reasoning ability. For feature extraction and fusion, it utilizes the multi-head self-attention mechanism and Bi-GRU network to learn deep semantic representations in the text. Lastly, adversarial training with FGM and PGD algorithms is incorporated during the model training process to introduce perturbations and improve the model’s robustness and generalization.
Neural network-based Chinese machine reading comprehension models
Neural network-based Chinese machine reading comprehension models
We propose an extractive machine reading comprehension model, given a paragraph P = {p1, p2, . . . , p n } containing n tokens and a question Q = {q1, q2, . . . , q m } containing tokens, the goal is to predict an answer A, which is a contiguous span in the passage A = {a i , . . . , a j }, where i and j represent the boundaries of the answer. As shown in Fig. 2, the model can be divided into embedding layer, feature extraction layer, feature fusion layer, and answer prediction layer. In the training process, two adversarial training algorithms are used to train the model respectively.

An overview of our model. e p , e s and e t respectively represent position embedding, segment embedding, and token embedding. r is the output of the multi-head self-attention mechanism. The model employs adversarial training.
PERT is a Chinese pre-trained language model. It takes as input a pair of sequences
” (I eat apples) into “
” (I apples eat). The new input sequence pair, denoted as and , is then fed into the model to predict the original positions of the words. By permuting the sentences, contextual representation information is established. This approach helps alleviate the discrepancy between pre-training tasks and machine reading comprehension tasks, thereby improving model performance.
In our approach using the PERT model, we encode the two sequences, paragraph P and question Q, to obtain embedding vectors E
p
= PERT (P) and E
q
= PERT (Q) , with
Feature extraction
In the feature extraction stage, the multi-head self-attention mechanism has advantages over traditional neural network models like CNN and RNN in processing sequential data. Firstly, it can consider the global information of the entire input sequence simultaneously, rather than being limited to local receptive fields or processed step-by-step. This allows the model to capture a broader context, enabling better understanding of long-range dependencies and global structures. Secondly, the multi-head self-attention mechanism enables the model to learn features at different granularities. By introducing multiple attention heads, each head can focus on different ranges of content in the input sequence, resulting in richer and more diverse feature representations. This helps enhance the model’s representation ability, enabling it to better capture important features and information in the input sequence. Additionally, the multi-head self-attention mechanism also has the advantage of parallel computation, allowing for efficient computations. Unlike RNN, each head in the multi-head self-attention mechanism can be computed independently, enabling parallel processing of the input sequence and speeding up computation. This is particularly important when dealing with long sequences or large-scale data, as it improves the efficiency of model training and inference. As shown in Fig. 3, an attention head can associate the question information with the paragraph information to obtain a more accurate representation of the answer.

Multi-head self-attention mechanism.
Specifically, E
P
and E
q
are fed into the multi-head self-attention mechanism. The input feature matrix is transformed by linear mapping matrices to obtain Q, K, and V matrices. Then, the dot product of Q and K is calculated and normalized by softmax to obtain attention scores. Finally, the scores are weighted and summed with V to obtain the context vector for each word, as shown in the following formula:
Where
In the feature fusion layer, a common approach is to use two independent GRU networks to separately encode the semantic information of the question and the text, and then use them as inputs to the model. However, this method ignores the interaction between the question and the text, leading to the loss of some important semantic information. To address this issue, we adopt the Bi-GRU to integrate the contextual and question feature information. Bi-GRU is a bidirectional recurrent neural network consisting of two GRUs with opposite directions. It can effectively handle long sequential data and avoid the problems of vanishing or exploding gradients. In Bi-GRU, the update gate and the reset gate are two important gating mechanisms that regulate the flow of information and influence the update of hidden states at each time step. Additionally, there are two important states: the candidate hidden state and the hidden state. The candidate hidden state is a temporary state computed based on the input and the previous hidden state, and it is updated at each time step. The hidden state is the final state computed based on the candidate hidden state, the update gate, and the previous hidden state. The entire process is illustrated in Fig. 4.

Bi-GRU network.
In the model, we incorporate a multi-head self-attention mechanism to capture global contextual information and enrich semantic representations. Additionally, we employ a BI-GRU network to further integrate these features. The BI-GRU network processes the input sequence in both forward and backward directions, obtaining forward states and backward states at each time step. These states are then concatenated to form the final representation, enhancing the predictive ability of the model. Specifically, the dimensions of the forward and backward states are set to 384. z
t
and r
t
represent the update gate and reset gate respectively,
Where x
t
is the t element of the input sequence, h
t
is the hidden state at time step t, W, U, b are the trainable parameters of the model, σ is the sigmoid function, ⊙ represents element-wise multiplication, tanh is the hyperbolic tangent function, [· ; ·] represents concatenation,
When performing answer prediction, MRC models typically consist of three steps. Firstly, in the context, a probability distribution for the start position and another probability distribution for the end position are generated for each position. This is achieved by applying fully connected layers and softmax activation function. This process predicts the likelihood of each position being the start and end positions of the answer based on the representation obtained after the context-question interaction. Lastly, based on the probability distributions of the start and end positions, the model selects the answer span with the highest probability as the final answer. More specifically, the representation vector h
t
is generated by the Bi-GRU network for the context and question. This vector is then fed into a fully connected network for linear transformation. Finally, the softmax function is applied to compute the probability distributions for the answer’s start and end positions.
Where FC represents the fully connected layer, a represents the vector of the fully connected transformation, A start and A end represent the starting and ending positions of the answer predicted by the model.
Adversarial training is an effective method to improve the robustness of deep learning models. In this method, the model is first subjected to a maximum perturbation on the input data, causing it to be in the worst-case scenario. Then, the model parameters are updated by minimizing the loss on the adversarial examples. In natural language processing tasks, a commonly used method for generating perturbations is to use targeted adversarial examples generated from the word embeddings layer. Specifically, in each training iteration, the model randomly selects a certain proportion of samples from the input data and converts them into continuous vector representations through the embedding layer. Then, targeted adversarial perturbations are added to these vector representations, generating adversarial examples. These adversarial examples are then added to the training set for further training, as shown in Fig. 5. Adversarial training can be generally expressed as follows:

Illustration of adding adversarial samples to the embedding layer.
Where D represents the distribution of input samples, x represents the input sample, y represents the label, θ represents the model parameters, L represents the loss function, Δx represents the applied perturbation, and Ω represents the perturbation space.
In order to enhance the model’s performance in handling unknown knowledge, we plan to employ adversarial training algorithms to introduce adversarial samples into the embedding layer. This approach involves incorporating input samples with perturbations, challenging the model’s learning capabilities and improving its adaptability to new circumstances. The Fast Gradient Method (FGM) and Projected Gradient Descent (PGD) are two commonly used algorithms in adversarial training, which have demonstrated good performance in the MRC task. FGM is a simple yet effective adversarial attack method. It generates adversarial examples by applying small perturbations to input data along the gradient direction. Specifically, for a given input sample, the method involves computing the gradient of the loss function with respect to the input data, taking the sign of the gradient, multiplying it by a small perturbation value, and adding the perturbation to the original input data to create an adversarial example. The specific formula is as follows:
PGD is a more powerful adversarial attack method that builds upon FGM by introducing an iterative process and a projection step. In PGD, the generation of adversarial examples is an iterative process where each step involves making small adjustments to the input data along the gradient direction of the loss function while ensuring that the adjusted data points remain within an allowable range. Through multiple iterations, PGD can produce more challenging adversarial examples that are harder for models to defend against or detect compared to those generated by FGM. The specific formula is as follows:
Where ɛ is the perturbation hyperparameter, x represents the input sample, ∇x represents the gradient of the loss function L with respect to input x, f θ represents the neural network function, y represents the label corresponding to x, and x t represents the x at time step t.
In this study, we conducted experiments to evaluate the performance of these two algorithms.
Experimental environment and parameters
The operating system used in our study was Linux, with an NVIDIA Tesla V100 GPU, CUDA 10.1, PyTorch 1.7.1 as the deep learning framework, and the Adam optimizer was used for network optimization. Dropout technique was employed to prevent overfitting, and the loss function used was cross-entropy. The pre-training model used was the PERT model. A warm-up training with a length of 0.11 times the dataset was applied during the initial training of the model, and the CLS was used as the textual vector output. The main hyperparameters of the model are shown in Table 2.
Hyperparameter setting
Hyperparameter setting
In our experiments, we used two public datasets, CMRC2018 and DRCD, as well as our own metadata of scientific dataset (Section 5.3). CMRC2018 is a Chinese machine reading comprehension dataset based on the extractive method. It consists of nearly 20,000 real questions annotated by experts on Wikipedia paragraphs. DRCD is a traditional Chinese question-answering dataset that provides data in the format of paragraphs and questions. The dataset is compiled from 2,108 Wikipedia articles and contains over 10,014 paragraphs and more than 30,000 questions. An example of the dataset is shown in Table 3.
Example of dataset
Example of dataset
The CMRC2018 and DRCD datasets used two commonly used indicators in NLP tasks –Exact Match (EM) and F1-score (F1). EM is used to measure the accuracy of the model in answering questions, i.e. whether the predicted answer matches the standard answer exactly. Specifically, if the predicted answer of the model is exactly the same as the standard answer, it is recorded as 1; if they are inconsistent, it is recorded as 0. Although Exact Match (EM) provides a strict standard for evaluating the correctness of predictions, it may not capture partial matches or close matches, where the answers partially match the standard answer and are sometimes acceptable. Therefore, F1-score is used to comprehensively consider the accuracy and completeness of the model’s predictions.
Baseline model
Currently, the pre-training-fine-tuning method has become one of the mainstream methods in the field of machine reading comprehension (MRC). To evaluate the effectiveness of the proposed method in the MRC task, this paper chooses a traditional neural network model and some representative pre-training models for fine-tuning and performing performance comparison analysis, as shown in Table 4.
Baseline model
Baseline model
Ablation analysis
To verify the contribution of each module to the performance improvement of the model, ablation experiments were conducted on the CMRC2018 and DRCD datasets in this study. The muatt represents the multi-head self-attention mechanism, and the bigru represents the bidirectional GRU model. The experimental results are shown in Tables 5 and 6.
Ablations on the CMRC2018 dev set (%)
Ablations on the CMRC2018 dev set (%)
Ablations on the DRCD dev set (%)
As shown in Table 5, it can be seen that in our proposed improved method, using the multi-head self-attention mechanism improves F1 by 1.18%; adding bigru improves F1 by 1.58%; when all modules are added, F1 is improved by 1.97%. By using FGM and PGD for adversarial training, F1 is improved by 3.23% and 3.49%, respectively.
As shown in Table 6, our proposed model exhibits certain generalization abilities, with an F1-score that is 1.74% higher than the PERT model when using the FGM algorithm.
The perturbation radius size in adversarial training is a significant hyperparameter that directly impacts the model’s robustness and generalization ability. To investigate the influence of different perturbation radii on the Chinese reading comprehension task model, we conducted experiments with different perturbation radii using FGM and PGD algorithms and evaluated them to determine the appropriate perturbation radius for this task. Our proposed model was utilized, with perturbation radii set to 0.1, 0.3, 0.5, 1, and 2, respectively. Further experimental details are available in Tables 7 and 8. Subsequent ablation experiments further demonstrated the enhancement of model performance upon the incorporation of adversarial training algorithms.
FGM experiments with different epsilon (%)
PGD experiments with different epsilon (%)
Through the experimental results, we found that the optimal values for the perturbation radius for the FGM algorithm are 0.1 and 2, while for the PGD algorithm, the optimal values are 0.3 and 0.5. This means that the appropriate value for the perturbation radius in Chinese reading comprehension tasks cannot be simply determined through trial and error. Finding the optimal value requires considering multiple factors, including algorithm type, dataset characteristics, task requirements, etc. For example, for a certain dataset, setting the perturbation radius too large may cause the model to make incorrect predictions on correct data, which would affect the model’s performance. Additionally, using adversarial training algorithms will increase the training cost and time of the model, and also require fine-tuning of hyperparameters to achieve optimal performance on different tasks and datasets. Therefore, in practical applications, we should choose the appropriate algorithm and parameters based on specific needs and circumstances to achieve effective adversarial training.
To assess the efficacy of the proposed method, comparative experiments were conducted on two benchmark datasets, namely the CMRC2018 and DRCD. The experimental results averaged over multiple trials, are presented in Tables 9 and 10, respectively.
Results on the CMRC2018 dev set (%)
Results on the CMRC2018 dev set (%)
Results on the DRCD dev set (%)
From the data results in Tables 9 and 10, it can be observed that applying pre-trained models performs better than the traditional QAnet model on the CMRC2018 and DRCD datasets. Additionally, our proposed model shows improvement in both EM and F1-score indicators compared to the original pre-training-fine-tuning method. Specifically, the EM values of our model on the CMRC2018 and DRCD datasets increased by 5.02% and 9.1%, respectively. Meanwhile, our method was compared with other relatively new Chinese pre-trained models, and the results showed the effectiveness of our method.
To test the reading comprehension and generalization capabilities of the proposed model in this paper, we plan to conduct comparative experiments on a self-created scientific dataset(scientific dataset). These experiments will help us assess the effectiveness and feasibility of the model in real-world scenarios, and gain deeper insights into the model’s conceptual understanding and application capabilities in the field of science, as well as its potential in other application domains.
Personally constructed dataset
The current availability of publicly accessible Chinese metadata of scientific datasets is limited, which poses constraints on the depth and breadth of scientific research. Therefore, in this study, we utilized the Requests library and Scrapy framework in Python to retrieve over 30,000 data records containing descriptions and other metadata information from the Chinese Earth System Science Data Center 1 . We conducted preprocessing on these data, including data cleaning and deduplication processes.
Based on the descriptions, we constructed a scientific domain machine reading comprehension dataset similar in format to CMRC2018 and annotated a thousand questions using ChatGPT 2 and manual correction. ChatGPT is a large neural network model based on deep learning algorithms that utilizes natural language processing techniques to achieve powerful text generation and comprehension capabilities. During the process of marking self-built datasets, ChatGPT can infer and judge based on existing knowledge and models. It can analyze text content, understand context, and provide reasonable answers, effectively controlling the quality of marked data to ensure accuracy and consistency of results [39]. After the self-built dataset was generated by ChatGPT, two graduate students performed cross-validation, mainly verifying and correcting the “question,” “answer,” and “answer tag position” generated. The generated samples are as shown in Table 11.
For example, the sample from Table 11 shows that metadata of scientific datasets have distinct characteristics compared to general reading comprehension datasets. Firstly, the paragraph descriptions contain a significant amount of domain-specific terms. These terms, such as “infrared,” “spatial patterns,” and “GPM IMERG Final,” have specific meanings and usages in the scientific domain. Secondly, the sentences exhibit a mixture of Chinese and English. For example, this dataset utilizes Himawari-8 infrared brightness temperature data to retrieve quantitative precipitation in the Yangtze River Basin with a resolution of 1 hour/25 km. This means that both Chinese and English are used within the same sentence. Thirdly, there is a strong interdependency between sentences, including a dataset overview, similar datasets, fine-grained feature descriptions of datasets, data value descriptions, and characteristics of open access to datasets. From the perspective of MRC tasks, each logical unit can be considered as a subtask that needs to be understood. Therefore, understanding paragraphs in scientific datasets is relatively challenging for models.
Example of scientific dataset
Example of scientific dataset
In the aforementioned experiments, we found that QANet, a traditional neural network, did not perform satisfactorily. Therefore, in this experiment, we chose to only compare pre-trained models as a strategy. We fine-tuned popular pre-trained models and conducted comparative experiments with our proposed model. The specific results are shown in Table 12.
Results on the scientific dataset (%)
Results on the scientific dataset (%)
In the results presented in Table 12, our proposed model outperformed the BERT model. We used a pre-trained BERT model (trained on a combination of Wikipedia and book corpora) and fine-tuned it on our dataset as the baseline model. The BERT model utilizes self-attention mechanisms to implicitly learn global interactions between words. However, the Masked Language Modeling (MLM) task may be less suitable for reading comprehension. This mismatch in task requirements could be one of the reasons for the performance degradation of BERT in reading comprehension tasks. On the other hand, the PERT pre-training model, which uses the word order prediction task, exhibits better adaptation to complex reasoning tasks when applied to downstream tasks. Additionally, the results indicate that the model has broad applicability and a certain level of generalization ability. This implies that the model is not only suitable for general domain datasets but also has the potential to be used in the scientific data domain and across multiple domains.
Discussion
In this paper, we propose a Chinese machine reading comprehension (MRC) model. Our model incorporates adversarial training and is experimented on the CMRC2018 and DRCD general domain datasets. Our model achieves satisfactory results on both datasets, demonstrating its effectiveness in understanding and answering Chinese texts. Furthermore, we conduct applied experiments in the scientific data field to evaluate the model’s performance in processing texts from different domains. Our findings indicate that our model exhibits good accuracy and robustness, even when there are differences between domains. This suggests that our model has great potential for wide-ranging application and can effectively handle MRC tasks in multiple domains.
Regarding the three questions raised above, we find that 1) Chinese permuted language models are an effective method for significantly improving the performance of MRC models in Chinese reading comprehension tasks. 2) Transformer models with multi-head self-attention mechanisms and BI-GRU (Bidirectional Gated Recurrent Unit) models are more suitable for capturing sequential information at the feature extraction and fusion levels. These models can consider both global and local contexts in the input text efficiently through self-attention mechanisms and bidirectional recurrent structures to learn and encode semantic information effectively. 3) Adversarial training is an effective approach to address the issues of model generalization and robustness, particularly in scientific datasets for Chinese MRC tasks. Adversarial training enhances model robustness in handling noise and exceptional cases by adding adversarial examples, which can improve the model’s generalization performance on unknown data.
Future work
Compared to existing related work, our model in this paper has significant advantages. By incorporating adversarial training, our model becomes more robust and achieves satisfactory performance on different datasets. However, we also acknowledge some potential areas for improvement. Despite conducting generalization experiments in the scientific data field, our model may still encounter difficulties in handling domain-specific terminology and knowledge. To further enhance the performance of our model, our future research will focus on improving input tokenization and leveraging external knowledge to enhance the model’s understanding and application of domain-specific background knowledge. This will contribute to achieving better results in a wider range of application scenarios. Additionally, we plan to apply the model to intelligent question answering and retrieval tasks.
Conclusion
Based on the PERT pre-training model, we have constructed a machine reading comprehension model that utilizes adversarial training techniques. This model utilizes a multi-head self-attention mechanism for feature extraction and combines semantic features through a BI-GRU network to overcome the challenges of semantic fusion and robustness in Chinese machine reading comprehension tasks. We conducted experiments on the CMRC2018 and DRCD. The results of our study reveal significant improvements in EM and F1 metrics when comparing our model to the Chinese baseline model. Furthermore, our model also exhibits excellent performance on a self-constructed and challenging scientific dataset.
Footnotes
Acknowledgments
Author contributions
Jianping Liu and Xintao Chu. wrote the main manuscript text, Jian Wang translated the full text, Meng Wang prepared figures, and Yingfei Wang prepared tables. All authors reviewed the manuscript.
Funding
This work was supported in part by the Starting Project of Scientific Research in the North Minzu University titled “Research of Information Retrieval Model Based on the Decision Process” under Grant 2020KYQD37; in part by the Key Research and Development Program for Talent Introduction of Ningxia Province China titled “Research on Key Technologies of Scientific Data Retrieval in the Context of Open Science” under Grant 2022BSB03044; in part by the Natural Science Foundation Project of Ningxia Province, China, titled “User-Oriented Multi-Criteria Relevance Ranking Algorithm and Its Application” under Grant 2021AAC03205.
