Abstract
Although deep learning models show powerful performance, they are still easily deceived by adversarial samples. Some methods for generating adversarial samples have the drawback of high time loss, which is problematic for adversarial training, and the existing adversarial training methods are difficult to adapt to the dynamic nature of the model, so it is still challenging to study an efficient adversarial training method. In this paper, we propose an adversarial training method, the core of which is the improved adversarial sample generation method AGFAT for adversarial training and the improved dynamic adversarial training method AGFAT-DAT. AGFAT uses a word frequency-based approach to identify significant words, filter replacement candidates, and use an efficient semantic constraint module as a means to reduce the time of adversarial sample generation; AGFAT-DAT is a dynamic adversarial training approach that uses a cyclic attack on the model after adversarial training and generates adversarial samples for adversarial training again. It is demonstrated that the proposed method can significantly reduce the generation time of adversarial samples, and the adversarial-trained model can also effectively defend against other types of word-level adversarial attacks.
Introduction
Pre-trained models based on deep neural networks such as Bert [1] and Roberta [2] have been very successful in many fields, including computer vision, natural language processing, etc., which have greatly contributed to the advancement of tasks such as email classification and sentiment analysis [3]. However, deep neural networks have been shown to be vulnerable to attacks by some elaborate methods that interfere with the judgment of the model by adding some perturbations to the original samples [4]. In computer vision, these perturbations are usually pixel modifications, while in the field of natural language processing, although text is discrete, it is also threatened by adversarial samples, where the addition of imperceptibly small perturbations to clean text can interfere with the model’s correct judgment of the text [5–10].
One way to enable the model to effectively defend adversarial attacks is through adversarial training. Adversarial training involves adding perturbations to the original sample X to generate an adversarial sample X adv and then training the model on an augmented set of a mixture of X and X adv to improve the model’s tolerance to the adversarial sample. Adversarial training can improve the robustness and generalizability of the model to a certain extent, but its performance improvement is limited by the quality and diversity of the adversarial samples [11], and most of the existing algorithms for generating adversarial samples are combinatorial optimization problems solved by heuristic search algorithms, which continuously iterate to find the optimal adversarial samples. Such an iterative search process is very complex and expensive, with thousands of searches and queries possible to generate an adversarial sample, and this high cost greatly hinders the practicality of using adversarial samples for adversarial training to improve model robustness. Existing adversarial training methods use an attack method to attack the model once and then generate an adversarial sample and mix it with a clean sample to train the model adversarially to improve the robustness of the model, which is difficult to adapt to the dynamically changing nature of the model, and the improvement of the model’s robustness is always limited. By dynamically generating adversarial samples during the training process, the model is continuously subjected to different types of adversarial attacks to improve its robustness and generalization ability, but this method significantly increases the training time and computational cost. Therefore, it is important to find an efficient and low-cost adversarial sample generation method and to investigate efficient adversarial training methods to improve model robustness.
In this paper, we propose to improve adversarial training by using simpler and faster adversarial sample generation methods as well as dynamic training. The proposed AGFAT-DAT adversarial training method generates adversarial samples by dynamically attacking the adversarially trained Bert and Roberta model and uses the generated adversarial samples to train the model again to improve the robustness of the model; the attack method AGFAT determines important words based on word frequency, filters the set of candidate words from Counter-fitting word embeddings, and uses a more efficient semantic constraint module to generate adversarial samples. Experiments show that the adversarially trained model has less impact on the classification of clean datasets, improves the ability of the model to resist word-level adversarial attacks more significantly, and that AGFAT can successfully interfere with the accurate classification of samples by the pre-trained model. The contributions of this paper are as follows:
Proposed the word frequency-based adversarial sample generation method AGFAT, which not only can effectively generate semantically maintained and fluent adversarial samples, but the generated adversarial samples can successfully attack with the model with less time loss. An improved dynamic adversarial training method, AGFAT-DAT, is proposed, and the trained model is able to improve the robustness and generalization of the model while ensuring a small impact on the classification accuracy of clean data sets. The model can effectively resist other word-level adversarial attacks after adversarial training.
This paper is structured as follows: Section 2 introduces the current status of research on adversarial attack and adversarial training in recent years; Section 3 discusses the theoretical part of AGFAT; Section 4 discusses the theoretical part of AGFAT-DAT; Section 5 presents the experimental details and results of this paper; and Section 6 is the conclusion section.
Related works
Adversarial attack
In the field of NLP, adversarial attacks can be divided into three categories, namely character-level attacks, word-level attacks, and sentence-level attacks [12]. Among these three types of adversarial attacks, character-level attacks can preserve semantics relatively intact [13], and a new character-level black-box attack method is proposed in Deepwordbug [14] with several functions to compute word importance scores but can be detected by spell checkers [15]. Unlike character-level attacks, sentence-level attacks usually involve inserting a sentence or paraphrasing somewhere in the text, but they have certain drawbacks [16, 17]. Compared to character-level and sentence-level attacks, word-level adversarial attacks largely keep conditions such as the original sentence’s semantic syntax unchanged, so the research in this paper is also directed at word-level adversarial attacks.
The process of generating word-level adversarial samples is divided into three main parts: (1) selecting the replaced words; (2) searching for the best perturbation to construct candidate sets (e.g., synonyms); and (3) filtering out samples that do not meet the requirements to generate adversarial samples (semantic constraints, grammatical correctness, etc.). The adversarial attack can be seen as a combined optimization of these three parts. Some theories based on greedy algorithms have been proposed one after another: Ren [8] et al. proposed a probabilistic weighted word salience word-level adversarial attack (PWWS) to determine word importance by word salience; Jin [6] et al. proposed Textfooler, which differs from PWWS in that word importance and synonyms are selected in different ways; Textbugger [7] contains both word-level and character-level multiple attacks. Bert-attack [5] predicts words for replacement based on the Mask-language-model and identifies important words based on the magnitude of the change in the model’s predicted confidence in the real label of the sentence.
Meanwhile, Morris [18] and Herel [19] et al. show that some of the word-level adversarial attacks produce adversarial samples. There are syntactic and semantic problems; an inevitable process of word-level adversarial attacks is to construct synonym candidate sets, previous researchers have proposed Glove [20], Counter-fitting[17], etc. as word-embeddings to construct suitable synonym sets for words. If the constructed candidate sets are large, for an input sample, if a greedy approach is used to replace the candidate words in the set one by one to select replacement words, the time consumption brought by this is very huge.
Based on this, in this paper, we propose to use a word frequency-based approach to speed up the process of determining word importance; secondly, SPE[21]is used to generate sentence representations and then calculate sentence similarity using cosine similarity. SPE adopts more efficient semantic constraints, and the vectors outputted using SPE can preserve the original semantics to the maximum extent and can speed up adversarial sample generation.
Adversarial defense theory
One of the existing approaches to defend adversarial attacks is detection and recovery, where detection and recovery are generally simultaneous. Zhou [22], Xu [23], Shen [24] et al. propose a method to first detect possible perturbations in the input and then recover them. Mozes [25] and Bao [26] et al. propose a method to detect the frequency of words in a sentence to determine whether the input is a clean sample or an adversarial sample for the detection defense method. However, one drawback of the detection-recovery defense approach is that it is very susceptible to constraints on the detection performance of the model [27].
Another common approach to improving model robustness is adversarial training, which refers to training the network with clean and adversarial samples to improve its robustness, It consists of two steps: generating adversarial samples by attacking the target model, and then fine-tuning the model on the augmented dataset using these adversarial samples. Goodfellow et al. proposed FGSM to generate adversarial samples, and in FGM [28] was extended to adversarial training in the textual domain. The improvement of the robustness of the model by adversarial training is also very dependent on the quality of the adversarial samples.
After which many adversarial training methods were proposed, Chen [29] et al. proposed FreeLB, which reduces the training period while updating the gradient of model parameters, and Madry [30] et al. proposed PGD adversarial training, which generates adversarial samples and continuously iterates training to improve model robustness. Liu [31] et al. proposed a computationally less expensive adversarial training framework, A2T, which also generates adversarial samples based on synonyms and then uses them for adversarial training; Chen [32] et al. extended traditional adversarial training to fine-grained feature-level adversarial training; Miao et al. [33] proposed an adversarial learning frameworks to introduce contrastive learning into adversarial training; Berant et al. [34] proposed a discrete adversarial training approach to enhance the robustness of the model.
In addition, Yoo [35] et al. found that previous methods for generating adversarial samples did not sufficiently take into account the factor of attack time consumption, while adversarial training often requires the fast generation of adversarial samples. At the same time, the models attacked in the existing adversarial sample generation process are models trained with clean data only, and using only the adversarial samples generated by attacking the models trained with clean data for adversarial training has very limited improvement on the robustness of the models.
In this paper, we improve adversarial training in two aspects. Firstly, word frequency and the SPE sentence encoder are used to reduce the time loss of adversarial sample generation; secondly, the improved dynamic adversarial training is used to further improve the robustness of the model.
AGFAT
Cheaper adversarial sample generation method
A key aspect of using adversarial samples for adversarial training is the time loss of adversarial sample generation. Most previous studies have demonstrated the importance of computing word importance rankings to help models locate important words that affect classification judgments, such as the deletion-based word ranking approach proposed in Textfooler, which removes a word in a sentence one by one and then inputs it into the model to determine the magnitude of change in confidence scores to identify important words. Although the deletion-based approach is very effective, for the task of long text, hundreds of forward passes are required at the step of determining important words, which is accompanied by a large amount of time consumption and therefore not applicable to adversarial training. While using the gradient-based approach to determine significant words speeds up the determination of significant words, forward and backward propagation still requires significant computational resources and time loss and is not applicable to black-box scenarios.
Based on this, this paper proposes a word frequency-based adversarial sample generation method. Firstly, the word importance ranking is determined based on the word frequency of words in the whole dataset to save time in determining important words, and after determining the word frequency of words in the sentence, stop words are removed and sorted in descending order by word frequency as the word importance ranking; secondly, the inverse fitting word embedding is used to search for synonyms to build a candidate set of replacement words for these words; after obtaining the candidate set, the order of words in the candidate set is adjusted based on the word frequency, and priority is given to After obtaining the candidate set, the order of words in the candidate set is adjusted according to the word frequency, and the low-frequency words are used as replacement words to reduce the number of queries to the model; finally, the SPE semantic constraint module is used to eliminate the low-quality adversarial samples. For tasks such as SST-2, where the data volume is small and the average length of the data is short, statistical word frequencies are obviously invalid, so the frequency dictionary provided by the FrequencyWords 1 repository is used to determine the important words for this type of data set.
Attack steps
Under the black box condition, the attacker does not know the specific parameters of the model and is only able to query the confidence scores and predicted labels given by the target model, given a dataset D ={ X, Y } consisting of N sentences, D contains X ={ X1, X2, …, X
n
} and the corresponding labels Y ={ Y1, Y2, …, Y
n
},for some pre-trained model F : X ⟶ Y that maps the text X into the space of labels Y. For a given sentence X
i
∈ D, an effective adversarial sample
Sim is the similarity function to judge the semantic similarity between the adversarial sample and the original sample, and ω is the threshold value to judge the sentence similarity, and the sample is retained if it is higher than the threshold value and discarded if it is lower than the threshold value. The proposed AGFAT algorithm is shown in Algorithm 1, which consists of three main steps:
Clean example set D = {X i , Y i }; target model F; sentence similarity function Sim (·); word similarity function CosSim (·); sentence similarity threshold ω; λ ∈ {0, 1}; word-embeddings word emb over the vocabulary V ocab .
Adversarial example set
1: Initialization:D adv ← {}
2: shuffle D
3:
4: Obtain word importance by word frequency and sort in descending order for each word x j in X i .
5: Build a set W: Filter out the stop words and take the top m% words.
6:
7: Build a set S by extracting the top N synonyms using CosSim (x j emb , word emb ) for each word in Vocab.
8: Build a set w x j by selecting M low frequency words from the set S and arrange them in ascending order of frequency.
9: w x j ← POSFilter (w x j )
10:
11:
12:
13:
14:
15: D
adv
←
16:
17: return D adv
18:
19: jump to Line 3
20:
21:
22:
23:
24:
25: return D adv
Given a sentence X i ={ x1, x2, …, x j } containing j words, in order to improve efficiency, this paper chooses a word frequency-based approach to identify significant words in the sentence, and ranks the words in the sentence in descending order according to their word frequency, and uses the resulting sequence as a word importance ranking..In order to avoid damage to the grammar, stop words such as “a”, “the”, etc. are removed by using the NLTK and Spacy libraries, and only some significant words are taken in order to reduce the search time, Only some important words are taken and the set W of words to be replaced is generated for them.
After obtaining the set of words to be replaced in step 1, the next step is to select suitable replacement words for them, To identify Replacement Words, a counter-fitting word-embedding is used to represent the words, which is specially designed for synonym recognition [17]. First, the set S of synonyms is constructed for each significant word x j by judging the semantic similarity between two words by the cosine similarity of word embedding; then M words with high similarity and identical lexicality are filtered out from S and arranged in ascending order of word frequency, and if they are not available in the word frequency dictionary, their word frequency is set to 0 and placed at the top of the set, and the set of synonym candidates for word x j is finally obtained w x j .
After obtaining the synonym candidate set in step 2, the generated adversarial sample
Some recent studies have used Bert-score [36] and Universal-Sentence-Encoders (USE) [37] and DistilBERT [38] to constrain the semantic gap between the adversarial and original samples, with USE being the most commonly used, and while USE can ensure semantics to a greater extent, it may occupy a large amount of GPU memory and take longer training time, while DistilBERT requires about 10 times less GPU memory [31] and requires fewer operations compared to USE, but the distillation technique may result in the model not capturing the details in the sentence effectively. If the generated adversarial sample cannot guarantee that the semantics and syntax of the sample do not change substantially, then the generated adversarial sample is a failed sample, so it is necessary to use an efficient method to constrain the adversarial sample.The time loss of adversarial training is certainly an important issue, but the quality of the adversarial samples is also an important factor affecting the robustness of the model after adversarial training.
Based on this, this paper uses an efficient and robust sentence representation method, SPE[21], to encode the original and adversarial samples into a high-dimensional vector and use their cosine similarity scores as an approximation of semantic similarity.SPE not only considers whether the semantics of the generated text are maintained after adversarial attack, but can also further remove the influence of antonyms; The use of SPE can further accelerate the whole process of adversarial sample generation. The purpose of this paper is to find a balance between “speed” and “quality” for generating adversarial samples for adversarial training.
AGFAT-DAT
Dynamic training strategies
Adversarial training in computer vision usually generates adversarial samples in each small batch, however, it is difficult to do so in natural language processing because adversarial attacks require other components (e.g., sentence encoders, etc.) And secondly, as stated by Yoo [31] et al. the Bert and Roberta model requires a large amount of memory in training, and it is difficult to run adversarial attacks in the same small batch of adversarial training. Therefore, this paper generates adversarial samples before each epoch, and then uses the generated samples mixed with clean samples to train the model and maximize the GPU utilization.
In this paper, we propose an improved dynamic adversarial training strategy, AGFAT-DAT, with the procedure shown in Fig. 1. The process is described as follows: Fine-tune the model beforehand using a clean data set. and assuming that the model is trained N times, at the i-th training, the model Mi-1 is attacked with a random sample of data from the data set using the AGFAT algorithm, and the generated adversarial samples are mixed with clean samples to train the model adversarially to obtain the model M i . The advantage of this is that the model attacked each time is the model after the previous adversarial training, so the dynamic adversarial training can improve the robustness of the model compared with the existing method of attacking static models.

Schematic diagram of AGFAT-DAT confrontation training once.
The whole training process is as follows: firstly, the pre-trained model is fine-tuned using clean samples; secondly, to avoid repetition of attacks, the data set is randomly disrupted before each attack and the set of adversarial samples is generated using the algorithm AGFAT; the generated adversarial samples are mixed with clean samples to train the model, then the model is saved and the last trained model is attacked before the next adversarial training, and the model is continuously iterated to train α times.
For the training data set D ={ X, Y }, X ={ X1, X2, …, X n } are the input samples, Y i is the real label corresponding to X i , N is the number of samples, and C is the number of label categories. The samples are sent to the encoder to generate the feature representation h i . The cross-entropy loss used for classification is shown in Equation (3):
The final training objective is shown in Equation 4. Where D’ is the set of clean and adversarial sample mixes, the objective is to minimize the classification loss in determining the true labels of the original and adversarial samples.
Experimental setup
In this paper, IMDB, MR, SST-2, CR and AG are selected as the datasets for the experiments, and the summary information of the datasets is shown in Table 1. The official Bert_base and Roberta_base models were used directly in order to better utilize the knowledge of the pre-trained models. The word-level adversarial attacks Textfooler [6], PWWS [8], and BAE [10] were used as comparison baselines to test the success rate of the attacks; the adversarial training methods A2T [31] and FGM [4] were used as baseline methods to compare the robustness of AGFAT-DAT. All experimental results in this paper were obtained under the same parameter settings.
Summary information of the dataset
Summary information of the dataset
IMDB: is a sentiment analysis dataset with the task of predicting the sentiment (positive or negative) of movie reviews.
MR: is a dataset of sentence-level sentiment classification of positive and negative movie reviews.
SST-2: phraselevel binary sentiment classification using ?negrained sentiment labels on movie reviews.
CR: A dataset of customer reviews, where each sample is marked as positive or negative.
AG: is a sentence-level classification on four news topics. Includes four categories: world, sports, business and science/technology.
In the process of fine-tuning the model, batch_size=32 was set, the AdamW optimizer [21] with a learning rate of 2e-5 and a weight decay of 0.01 was used, and 10 epochs of the model were fine-tuned using clean samples,Train the model using a mixed set of samples 3 epoch. The perturbation ratio was set to 10% for the IMDB dataset and 40% for the SST-2, MR, and AG datasets; in the AG dataset The model is fine-tuned by taking 40,000 data points from the AG data set. To save the training time, the similarity matrix was loaded and calculated in advance; the similarity parameter ω was set to 0.6 for MR and SST-2 datasets and 0.8 for IMDB and AG datasets; λ was set to 0.2; the number of synonyms was 25.
Results of confrontation robustness

Verify the effect of the size of λ on the adversarial training results. Test the classification accuracy of the model on the original test set (Ori-Set), the adversarial sample set generated using Textfooler (T-Adv-Set), and the adversarial sample set generated using AGFAT (A-Adv-Set).
Comparison of information on running time, number of queries, and perturbation ratio. The results are averaged from two experiments. The experimental results are all derived from running on NVIDIA-RTX2080Ti
From the experimental results, we can see that using the word frequency-based approach (WF) can accelerate the adversarial sample generation, and Textfooler can bring about 18% speedup in generating adversarial samples on the SST-2 dataset after using WF, while AGFAT reduces the time loss by about 30% compared to Textfooler and 34% compared to Textfooler on the MR dataset; and using WF can not only reduce the time loss but also reduce the average number of queries of the model.
Test attack success rate (A.S.%) denotes the attack success rate and is calculated as shown in Equation 5; ▵% denotes the percentage change between the attack success rates of different training methods. The experimental results were taken from the average of three experiments
Table 3 shows the classification accuracy of the model on the adversarial samples before and after attacking the adversarial training using four methods, AGFAT, Textfooler, PWWS, and BAE. Although AGFAT uses stronger semantic constraints to reject low-quality adversarial samples, resulting in lower attack success, the purpose of this paper is to quickly generate adversarial samples for adversarial training. As can be seen from the table, using the adversarial training model of the same attack can significantly reduce the attack success rate of this attack method, up to 50.3% in the Bert_base model and 65.8% in the Roberta_base model; and it can also successfully reduce the success rate of other baseline attack methods, The success rate can be reduced by up to 26.0% for Textfooler, up to 33.1% for PWWS, and up to 32.1% for BAE. It can be demonstrated that AGFAT-DAT can effectively improve the model’s ability to resist word-level adversarial attacks.
Examples of adversarial samples generated based on the SST-2 dataset using the AGFAT-DAT attack Bert_base model. Where the bolded words are modified words
The set of adversarial samples is generated using TextFooler and AGFAT. Orig.Acc. is the classification accuracy of the original clean samples, and Adv.Acc. is the classification accuracy of the model on the set of adversarial samples
From Table 5, we can see that the model after adversarial training can significantly improve the robustness of the model, and the gradient optimization-based adversarial training FGM has weaker defense against word-level adversarial attacks, but it can also improve the classification accuracy of the model for the adversarial samples; for the MR dataset, the classification accuracy of the model after adversarial training using A2T improves to 28.6% and 46.8% for the adversarial samples, respectively; In comparison, using AGFAT-DAT adversarial training can significantly improve the robustness of the model, and the classification accuracy improves to 66.5%, 68.9% and 50.2% for the set of adversarial samples generated using AGFAT, which also shows that the adversarial training can effectively resist the adversarial samples used for adversarial training; for the MR,SST-2 and IMDB datasets, the classification accuracy for the adversarial samples generated using Textfooler were improved to 35.3%,36.7% and 35.2%. Overall, the model after adversarial training using AGFAT-DAT possesses stronger robustness compared to that using the baseline approach.
In this section, it is verified whether the model after adversarial training using the AGFAT-DAT method affects its classification accuracy on the original test set; while adversarial training can also be regarded as a data augmentation method that also affects the generalization of the model, the classification accuracy of the model before and after adversarial training on the test set of the CR dataset is also tested. Here, 10 epochs were fine-tuned using the training set for the Bert_base model, and 500 adversarial samples were generated as the adversarial sample set by attacking the Bert_base and Roberta_base models without fine-tuning. The specific results are shown in Table 6.
Classification accuracy (%) of the model before and after adversarial training for the original training set, the adversarial sample set, and the CR test set.Orig.acc. is the classification accuracy of the original test set, and Adv.acc. is the classification accuracy of the adversarial sample set generated using AGFAT for the test set
Classification accuracy (%) of the model before and after adversarial training for the original training set, the adversarial sample set, and the CR test set.Orig.acc. is the classification accuracy of the original test set, and Adv.acc. is the classification accuracy of the adversarial sample set generated using AGFAT for the test set
From the results, it can be seen that using the adversarial samples generated by AGFAT for adversarial training not only significantly improves the robustness of the models, but also improves the classification accuracy of the models for the test sets of CR datasets to varying degrees. Although adversarial training reduces the classification accuracy of the Bert_base, Roberta_base models for the IMDB and SST-2 test sets, the magnitude is small. In addition, the experiments show that the models after adversarial training using AGFAT-DAT on all three datasets can improve the classification accuracy of the models for the CR test set, indicating that the adversarial training of the models using the adversarial samples generated by AGFAT can improve the generalization of the models.
In this section, the effectiveness of the dynamic adversarial training method is verified by comparing the classification accuracy of the Bert_base model without adversarial training, the model trained using the ordinary adversarial training method (AT), and the dynamic adversarial training method (DAT) on the set of adversarial samples, where 2000 adversarial samples are taken from AT for adversarial training and another 500 adversarial samples not involved in training are taken The results are shown in Table 7. It can be seen from the table that any of the adversarial training methods can improve the robustness of the model; meanwhile, the classification accuracy of the adversarial samples generated by Texfooler for dynamic adversarial training improved by 12.2%, 8.6% and 24.4%, respectively, compared with the normal adversarial training; and the classification accuracy of the model after dynamic adversarial training with AGFAT improved by 11.8%,13.8% and 16.2% compared with the normal adversarial training for the adversarial samples. This also shows that the dynamic adversarial training method can improve the robustness of the model compared to normal adversarial training.
Classification accuracy (%) of the model after using ordinary adversarial training and dynamic adversarial training on the set of adversarial samples
Classification accuracy (%) of the model after using ordinary adversarial training and dynamic adversarial training on the set of adversarial samples
In the Bert_base and Roberta_base models, for the SST-2 dataset, the adversarial samples were generated using Textfooler and AGFAT, and the [CLS] features were used as the embedding representations of the input samples, and the distances between the [CLS] features of the original and adversarial samples were measured, as shown in Fig. 3. From the figure, we can see that the model after adversarial training can reduce the distance between the original and adversarial samples more significantly, which also shows that the whole process of adversarial training is to map the original and adversarial samples to a very close distance to improve the robustness of the model.

The L2 distance between the [CLS] features of the adversarial sample and the original sample is measured. Base_model represents the model after fine-tuning, and AT-model represents the model after adversarial training; TF and AG represent the generation of adversarial samples using Textfooler and AGFAT.
In addition, to check whether the adversarially trained model can accurately capture the important information features in the sentence, the attention scores of each word and [CLS] feature in the sentence were calculated. The Roberta_base model was fine-tuned beforehand using the IMDB training set to calculate the L2 distance between features, and the example sentences were taken from the sample utterances in Table 4, and the results are shown in Fig. 4. From the results, we can see that the model after the adversarial training pays more attention to the words expressing negative emotions, such as "fear", in the sentences; on the contrary, the original Roberta model does not pay much attention to this information. Therefore, adversarial training can improve the sensitivity of the model to semantic changes and help it focus on key information with greater attention, thus classifying the adversarial samples correctly.

Visualization of the attention plot. Roberta-AT represents the model after adversarial training, with darker colors representing higher scores.
In this paper, an improved dynamic adversarial training strategy, AGFAT-DAT, is proposed, which generates adversarial samples for the next round of adversarial training by continuously attacking the model after adversarial training to improve the model’s ability to defend against word-level adversarial attacks. The core part of AGFAT-DAT is a fast and efficient word frequency-based adversarial sample generation method, AGFAT, designed to quickly generate adversarial samples for adversarial training. Extensive experiments have demonstrated that AGFAT has less time loss compared to the baseline method and has the ability to attack the model faster; it has also been demonstrated that the models trained with AGFAT-DAT adversarial have stronger robustness and generalization than the baseline method.
Footnotes
Acknowledgments
This work was supported by the Natural Science Foundation of China under Grant 61562065 and the Inner Mongolia Natural Science Foundation Project under Grant 2019MS06001 and 2023MS06012.
https://github.com/hermitdave/FrequencyWords
