Abstract
Test-time augmentation (TTA) has become a widely adopted technique in the computer vision field, which can improve the prediction performance of models by aggregating the predictions of multiple augmented test samples without additional training or hyperparameter tuning. While previous research has demonstrated the effectiveness of TTA in visual tasks, its application in natural language processing (NLP) tasks remains challenging due to complexities such as varying text lengths, discretization of word elements, and missing word elements. These unfavorable factors make it difficult to preserve the label invariance of the standard TTA method for augmented text samples. Therefore, this paper proposes a novel TTA technique called Defy, which combines nearest-neighbor anomaly detection algorithm and an adaptive weighting network architecture with a bidirectional KL divergence entropy regularization term between the original sample and the aggregated sample, to encourage the model to make more consistent and reliable predictions for various augmented samples. Additionally, by comparing with Defy, the paper further explores the problem that common TTA methods may impair the semantic meaning of the text during augmentation, leading to a shift in the model’s prediction results from correct to corrupt. Extensive experimental results demonstrate that Defy consistently outperforms existing TTA methods in various text classification tasks and brings consistent improvements across different mainstream models.
Keywords
Introduction
Pre-trained language models (PLMs) have achieved remarkable success in various natural language processing (NLP) tasks [13, 47]. Nevertheless, most of their works focus on improving the performance of PLMs during training, which involves the utilization of models with a higher number of parameters, improved network structures [13, 45], various data augmentation methods [44, 58], and adversarial training [27]. But the robustness of PLMs during inference is often overlooked.
Typically, data augmentation is applied before or during model training, but rencent studies have shown that data augmentation can also be applied during model inference [29, 48], which is known as test-time augmentation (TTA). TTA is a tried-and-true method for improving the final prediction accuracy and robustness of language models by aggregating augmented samples. TTA does not rely on any hyper-parameter tuning and can improve the robustness of the model to test data perturbations without additional training. TTA has been widely used in computer vision tasks [2, 52], and it has been proven to be effective in improving the accuracy and robustness of models. However, the application of TTA in the field of natural language processing (NLP) is still in its infancy, and there are still some obstacles to overcome. In contrast to the widespread use of TTA in computer vision tasks, where the mainly utilized data augmentation methods such as rotation, rescaling, and translation can largely preserve labels and convey crucial visual information about the described objects or scenes. However, the TTA methods used in text processing tasks, such as synonym replacement, random word deletion, and word position exchanging often alter the semantic and grammatical structure of the text, making it challenging to select effective augmented samples for aggregation and difficult to ensure that the labels of aggregated samples are not damaged, which is crucial for the accuracy and stability of the model.
Another obstacle is how to more effectively aggregate multiple augmented data samples to improve the robustness of the model. The standard TTA method simply averages the prediction results of all augmented samples, which can bring some performance improvements, but it ignores the fact that not every augmented sample can bring value gains.
Simply averaging the probability prediction results of all augmented samples often leads to significant prediction biases due to the influence of outlier augmented samples. Although large-scale pre-trained language models have been proven to have some ability to recognize out-of-distribution data [15], but the accuracy of these models is easily disrupted by slight perturbations and lead to a drop in performance [1]. Therefore, how to calibrate and filter outliers in the augmented samples [8, 17], reassign sample weights, and minimize the risk of damaging model performance caused by anomalous augmented samples is a challenging problem.
In this work, we focus on how to improve the robustness of TTA aggregation while avoiding overemphasizing the influence of anomalous augmented samples. [50] focuses on the selection of data augmentation methods during testing, whereas our focus lies in enhancing test-time aggregation and, therefore, does not involve discussions in this regard. We present the related work in Section 2, provide the problem definition in Section 3, and formally propose our methodology in Section 4. Furthermore, We conducted extensive ablation experiments on various factors that may affect By adding a regularization term based on the bidirectional KL divergence entropy between the distribution of original samples and the aggregated samples to the conventional cross-entropy method. This promotes consistency in predictions across different augmented samples and is better able to adapt to non-uniform weights among augmented samples. Incorporating anomaly Nearest Neighbor (ANN) detection for calibrating abnormal augmentations, our method can effectively filter out abnormal augmented samples to minimize the risk of uncertainty amplification caused by different augmentation methods. The versatility of our method is evident from its ability to be applied "plug-and-play" to any existing based model without any hyper-parameter tuning, allowing us to work seamlessly with other techniques that enhance model robustness.
Related work
Test-time augmentation
TTA is a technique applied to a trained model during testing, where multiple augmented samples are generated for each original sample, and the average prediction over the augmented samples is used as the aggregated result to improve the final output of the model. Although data augmentation is typically applied during model training but can also use during prediction. And TTA has been widely demonstrated to enhance model accuracy and robustness [24], as well as address distribution shift issues [60] and defend against adversarial attacks [40]. Researchers have proposed various TTA methods, such as [22] proposed an instance-aware TTA algorithm based on a loss predictor that dynamically selects samples for TTA [39] used Mixup algorithm [59] to mix the input with other randomly clean augmented samples, thereby improving the model’s ability to withstand adversarial attacks after TTA [31] introduced a greedy search strategy to select samples for augmentation [31] discussed the impact of various basic text data augmentation strategies on model accuracy in text classification applications.
Anomaly detection
Anomaly detection technique involves identifying unexpected behavior in data that leads to anomalies, deviations, or outliers, and assigning them uncertainty (anomaly) scores. Meanwhile, it is also commonly used to calibrate the uncertainty of the anomalous data [10]. Traditional anomaly detection techniques are mainly based on statistical methods, while machine learning-based techniques further employ self-supervised, unsupervised, or semi-supervised methods, such as One-Class SVM [47] and Local Outlier Factor (LOF), which use self-supervised and contrastive learning to achieve better performance in detecting neighboring samples [16]. Deep learning-based techniques, on the other hand, use neural networks for anomaly detection, such as unsupervised LSTM-based anomaly detection [38], BERT-based anomaly detection [12], and adaptive threshold-based streaming anomaly detection [43].
Ensemble selection
Ensemble learning is a machine learning strategy that makes overall decisions based on the prediction results of multiple models [5]. It can improve the overall performance and robustness of the model, but the cost is that a large amount of computational resources are required to train these models. Although TTA is often used in conjunction with a single model, it is meaningful to consider TTA as an aggregation of different models. This is because during TTA, each augmented data sample is equivalent to a new test sample, and when multiple test results based on these samples are aggregated, it is approximating a local ensemble learning process. A typical example is that [9] proposed to select the optimal clustering results from multiple clustering results for aggregation.
Problem formulation
Test Time Augmentation (TTA) is a technique used to improve the performance of a model during the inference phase. It involves creating multiple versions of each test sample using data augmentation techniques, and then averaging the model’s predictions for each version to produce the final prediction. This method is based on the assumption that the model may recognize different features in the different augmented versions of a sample, leading to a more robust prediction.
Let’s denote the original input sample as x and the model’s prediction function as f. In the usual scenario without TTA, the prediction for the sample would be f (x).
The TTA technique involves applying a set of augmentation transformations T = {t1, t2, . . . , t
m
} to the sample x, creating a set of augmented samples
The final prediction ptta for the original sample x is then obtained by averaging the predictions in Ptta. This can be formally expressed as:
The problem of TTA thus lies in selecting the appropriate set of augmentation transformations T and in determining how to aggregate the set of predictions P to produce the final prediction. Different choices for T and the aggregation method can lead to different performance characteristics for the TTA technique.
One common challenge is deciding which augmentation techniques to include in T. Not all techniques may be beneficial for all types of data or models, and some may even harm the model’s performance if they introduce too much noise or distort the data in a way that the model cannot handle.
Another challenge is how to aggregate the set of predictions Ptta to produce the final prediction. It is necessary to assign appropriate weights to different data augmentation samples to eliminate the negative impact of certain samples.
Despite these challenges, TTA has been shown to be a valuable technique that can significantly improve model performance, making it a worthwhile area of study and experimentation.
As shown in Fig. 3.1, when using the standard TTA method to generate augmented samples for aggregated prediction, it is prone to cause the final prediction results of the model to deviate from the ground truth. This is because the prediction results of the augmented samples may not all bring net value-added benefits, so we need to effectively measure, screen and allocate weights based on the intrinsic characteristics of the augmented samples to minimize the risk of damage caused by abnormal augmented samples. If we simply use the standard TTA method to average the prediction results of all augmented samples, the final prediction results are easily affected by abnormal augmented samples, leading to deviation from the ground truth.

Visualization results of IG sampling of samples from the SST-5 dataset after different data augmentations 6.2. Different colors indicate different levels of importance, with darker colors indicating greater importance.
Therefore, we further attempt to use Integrated Gradients (IG) [41] to help us analyze and observe which augmented samples have a negative impact on the aggregated prediction.
Figure 3.1 shows the importance of different words in sentences obtained by different data augmentation methods. Specifically, when the input sample is "the story and characters are nowhere near gripping enough" with true label "negative". We observe that the data-augmented sample "story characters are near gripping enough." using the RWD method, which involves word deletion, removes the crucial feature "nowhere" and results in the model predicting the sample as "very positive",which is inconsistent with the original label. In the process of employing the RPI operation, an error was encountered in which a question mark("?") was erroneously inserted at the end of the sentence. As a consequence, the meaning of the sentence was transformed from a declarative to an interrogative form. Although such an alteration may have augmented the semantic richness of the sentence during training. However, the standard TTA method only employs a simple weighted average operation, which fails to reduce the weight of anomalous samples or remove them. As a result, the correct label of the original sample is compromised. Therefore, it is necessary to identify anomalous samples and allocate varying weights to the effective augmented samples.
To overcome the limitations of TTA, we have introduced nearest neighbor anomaly detection algorithms and network architectures with learnable weights to improve the prediction accuracy and robustness of deep learning models. Specifically, our approach involves two steps (see section 4.2-4.3): (1) We use a k-NN anomaly detection algorithm to select the most anomalous samples from the augmented samples. (2) We use a sub network architecture with learnable weights to aggregate the predictions of the remaining valuable augmented samples. In this section, we will introduce the details of our method.
Data augmentation on test time
We first consider
Then we can consider inputting the augmented samples
where
Now, by the results of 6.1, we select the HNSW64 (64 represents the degree of nodes in the graph) algorithm based on graph retrieval method to screen out anomalous samples from the augmented sample set
Then the remaining samples in the augmented sample set
To aggregate the diversity of augmented samples while avoid generating excessive out-of-distribution samples, we propose a distance-based selection approach. and then takes the average of these distance values to obtain the mean of all elements in the distance matrix, denoted as
1: x ← 1
2:
3:
4:
5:
6:
7:
8:
9:
After obtaining the predicted probability distribution matrix
In some cases, only using the normalized values obtained from sigmoid function to assign weights may lead to over-reliance on certain augmented samples, resulting in an excessive dependence on the features of these samples while aggregating prediction results. To address this issue, we propose a new method for mixing weights. Specifically, multiply
The next step is to take the average along dimension C of
Then we take the average of the maximum values of each row of
Finally, we use α to control the weight of the prediction results based on the original samples and the aggregated prediction results based on the augmented samples. The final prediction results are obtained by mixing the two prediction results according to the following formula:
Considering that the aggregate prediction result ptta and the prediction result p0 based on the original sample are similar but slightly different outputs of the same model, it is similar to the situation where the same sample corresponds to two different prediction results given by two different models. Therefore, on the basis of minimizing the cross-entropy loss 11 between the model’s predicted result and the ground truth, we add calculations of the similarity between the original sample and the augmented sample to construct the objective function.Consequently, it contains two optimization objectives:
The first objective is to minimize the bidirectional Kullback-Leibler (KL) divergence
The second objective is to minimize the cross-entropy loss
The final loss function of our method is as below:
Looking at formula 12 in a different perspective, by adding a regularization term to the cross-entropy loss function [56] which is equivalent to adding a penalty term to the cross-entropy loss function. This enhances the robustness and adaptability of the model to augmented samples, while also reducing sensitivity to certain perturbations and the likelihood of overfitting. This regularization method can better normalize the predicted results of the model, reducing the inconsistency between predicted results ptta based on the aggregated samples and predicted results p0 based on the original samples. It achieves a transition from "model fusion" to "weight fusion," which is more conducive to the stability and reliability of the final prediction results and is also more consistent with the core idea of soft voting.
Therefore, the pseudocode of our method is shown in Algorithm 2.
1:
2: Get the augmented samples
3: Get the predictions
4: Feed
5: Filter out and get the remaining augmented samples
6: Update remaining samples weight
7: Apply Beta distribution β to
8: Get the aggregated prediction ptta by 9.
9:
10: Compute loss
11: Update the model weights by
12:
13:
14:
Baselines and models
All experiments were conducted on the pre-trained DistilBERT-base-uncased [46], BERT-base-uncase [7], and RoBERTa-base [28] network from the Huggingface Transformers library, as used by [61, 26], and fine-tuned on the MRPC, RTE, SST-2 [51], SST-5 [49], SUBJ [6], Trec-Fine, Trec-Coarse [26], AG-News [61]. During the test time adaptation process, the Baseline refers to a non-adaptive model. These datasets were sourced from either the Huggingface Datasets or their official websites. Apart from the baseline, we compare with the following typical baselines to verify the effectiveness of Defy: (1) Original (Raw Prediction): Directly use the prediction of the original sample without TTA; (2) Standard TTA: Average logit across all augmented samples [25], and this is the standard practice in
Implementation details
All our experiments follow the hyperparameter settings provided by the dataset official website. We use the model training code provided by Hugging face Transformers [55]. During test time, we adopt Adam [23] with initial learning rate 2e-5, batch size 32 and max sequence length 128. All experiments were conducted on a single NVIDIA RTX A6000 GPU, and we performed the experiments with 3 different random seeds.
Experimental results
Overall results
Here we describe and discuss the results summarized in Table 1 and Fig. 6.1. The results obtained by Defy are compared with the performance of the standard TTA, Max TTA, and baseline models (without any type of TTA method). Table 1 displays a detailed comparison between Defy and the standard TTA method in terms of predicted results on eight benchmark datasets and three benchmark models. In the total of 24 experiments, our method showed a leading advantage in almost all of them.
Overall results. Here the data augmentation method used by TTA is RWSR + RWI + RWS + RWD. BERT, DistilBERT, and RoBERTa are corresponding to BERT-base-uncased, DistilBERT-base-uncased, and RoBERTa-base respectively. Bold characters represent the best results
Overall results. Here the data augmentation method used by TTA is RWSR + RWI + RWS + RWD. BERT, DistilBERT, and RoBERTa are corresponding to BERT-base-uncased, DistilBERT-base-uncased, and RoBERTa-base respectively. Bold characters represent the best results

Visualization results of IG sampling of samples from the SST-5 dataset after different data augmentations 6.2. Different colors indicate different levels of importance, with darker colors indicating greater importance.
Figure 6.1 demonstrates that standard TTA method performance across different datasets. It can be seen that although there is a slight performance improvement on some individual datasets, the standard TTA method’s performance is unstable on most datasets and cannot significantly improve the performance. In specific datasets, it may even lead to a decrease in performance. This indicates that using a simple approach to average the model’s prediction results is not reasonable for text classification tasks at least. This will result in uncertain out-of-distribution augmented samples being averaged into the final results. However, [14]’s study also showed that although the Max method could improve the model’s ability to recognize out-of-domain distribution augmented samples, introducing this method into TTA to improve text classification accuracy was not very effective. This is because selecting the maximum predicted probability value in the augmented samples is the most volatile of all TTA methods, which erroneously calibrates the predictions on almost all datasets, greatly affecting the predictions.

The performance of different TTA methods on different datasets, using RWSR + RWI + RWS + RWD for data augmentation with an augmentation magnitude of 32.Here,the standard deviation represents the model of BERT, RoBERTa and DistilBERT.
In this study, we present a novel TTA method for effectively aggregating multiple non-deterministic augmented samples, which ultimately leads to improved model performance. Specifically, reducing the impact of uncertain samples during the aggregation of augmented samples at test-time, thereby enhancing the robustness of the model’s prediction results. Some additional observations can be observed from Fig. 6, indicating that the effectiveness of TTA methods varies across datasets and is dependent on the model’s abilities. Interestingly, the study finds that for models with strong abilities, the effectiveness of the TTA method may not be significant, resulting in limited net gains. This is evidenced by the narrowing of benefits observed in BERT-base-uncased and RoBERTa-base models on the SUBJ dataset, as outlined in Subsection 6.3.

The impact of different data augmentation methods on TTA net gains on BERT, DistilBERT, and RoBERTa. The standard deviation in the figure represents different augmentation magnitudes.
There are two factors that affect TTA net gains: data augmentation methods and the aggregation method of augmented samples. While we discussed the design of aggregation methods in detail earlier, and [53] has found that Neural networks have a high error tolerance for the low-frequency components of text, as these components usually correspond to the overall semantics and features of the text. Therefore, processing the low-frequency components will not have a significant impact on the model’s predictive results. Conversely, neural networks have a relatively low tolerance for errors in the high-frequency components of text, as these components typically correspond to the details and local features of the text, and even slight changes may cause the model to have larger prediction biases.
Due to the above reasons, we focus on the most representative and widely used data augmentation methods, including the following: Random Punctuation Insertion(RPI) [21]: inserting specific punctuation marks at a randomly selected position within the range of the text sequence length with a probability of 0.3 Random Word Synonym Replacement(RWSR) [34]: Randomly select N words in the sentence that are not stop words, and then randomly choose possible synonyms for these words to replace them with a probability of 0.1. Random Word Insertion(RWI) [57]: Randomly selecting a non-stop word in the sentence and inserting a synonym of that word at a random position in the sentenc with a probability of 0.1. Random Word Swap(RWS) [30]: Randomly swapping the positions of two words in the sentence with a probability of 0.1. Random Word Deletion(RWD) [3]: Randomly deleting words from the sentence with a probability of 0.1.
Although more complex data augmentation methods have been proposed, for example methods based on reinforcement learning [44], they may bring more uncertainty and additional computational burden.

The impact of sample augmentation magnitude on TTA. Where the data augmentation method is RWSR + RWI + RWS + RWD, by averaging all the data sets and different models.
Table 2 shows the six data augmentation policies used in our experiments. Specifically, random punctuation insertion was denoted as RWSR, while random word synonym replacement was represented by RWSR. Additionally, random word insertion was denoted as RWI, random word swap was represented by RWS, and random word deletion was denoted as RWD. Finally, the combination of the four fundamental data augmentation methods was represented by RWSR + RWI + RWS + RWD. There is evidence that combining multiple data augmentation methods can improve the robustness of the augmented model at test time [29]. It is worth noting that the combination of various data augmentation methods, such as RWSR + RWI + RWS + RWD, is similar to what is discussed in [29]. And RPI modifies the semantics of the text more slightly and exceeds RWSR + RWI + RWS + RWD during training as mention in [21], so we choose RPI as another basic data augmentation method.
Different data augmentation policies
We start to analyze the results of different data augmentation methods by the results in Fig. 6.2, We observe that RWSR + RWI + RWS + RWD is the best augmentation strategy in terms of performance on Bert in our method, regardless of which dataset is used, but considering that the performance of different base models is not all like this. For each base model, we have found different performance best augmentation strategies. Nonetheless, the results of RWSR + RWI + RWS + RWD are still relatively close to the best performance results in the first two places on the three base models. We also note that RWI has achieved unexpectedly good results on the RoBERTa model, while RWD has performed poorly on all three base models. This indicates that different models have different dependencies on different data augmentation methods, which is also one of the factors we need to consider when designing data augmentation methods.
For the standard TTA method, there is no significant performance improvement on various data augmentation methods on the three models, and RWSR causes a large variance. Although most data augmentation methods have damaged the performance of the model on Max TTA, RPI augmented method has shown a significant lower variance on the three models.
Next, we start to analyze the results in Fig. 5 to explore the impact of different data augmentation magnitudes on TTA. The main observation result obtained in this context is that our method Defy is the only method that shows a linear increase in model performance with the increase of the number of augmentations. In contrast, the standard TTA method further increases the variance of the model performance with the increase of the number of augmentations, while the Max TTA method will further reduce the model performance.

The impact of different data augmentation methods on the accuracy of different pre-training models. Here, the average accuracy across all datasets is taken, using RWSR + RWI + RWS + RWD for data augmentation with an augmentation magnitude of 32. The standard deviation represents different datasets.
In the previous section, we observed that different models have different dependencies on data augmentation methods, so in this section, we further explore whether different models will have an impact on different TTA methods. To verify this, we compared the performance of the representative models, BERT-base and RoBERTa-base, as well as a lightweight model, DistilBERT-base, which has only 60% of the parameters of BERT-base.
We start to analyze the results of different models by the results in Fig. 6. The first observation is that our method Defy is the only method that can consistently improve the performance of different models. Contrarily, the standard TTA method and the Max method damage the performance of the model to varying degrees, especially on DistilBERT, the standard TTA method even causes significant fluctuations in the performance of the model. Although RoBERTa is a more robust model, it has not been able to reduce the damage of the Max method to the model performance.
Now,we focus on the best results obtained on each specific dataset, as shown in Table 1. Despite our method, Defy, achieving the best results on all datasets, it exhibits significant differences in performance on different models for certain datasets. For instance, we observe that the RTE dataset demonstrates lower benefits on BERT and DistilBERT, but performs significantly better on RoBERTa. When a baseline model has already achieved high performance levels, the improvement in model performance through the TTA method can be reduced, as in the case of MRPC, where RoBERTa’s baseline performance has already reached 86.43%, resulting in almost no benefit from TTA. However, for BERT and DistilBERT, with baseline performances of 81.57% and 79.83%, respectively, our TTA (Ours) method shows an average benefit of almost 0.4%.

The impact of different k-NN algorithms on the performance of our method with anomaly augmented samples. Here, we compared different models and averaged the results across all datasets, using RWSR + RWI + RWS + RWD for data augmentation with an augmentation magnitude of 32.
In the previous subsection, it was observed that Defy (Ours) was the best TTA method when considering different benchmark models, despite variations in performance across specific datasets. Given this observation, this section aims to further explore the effectiveness of the anomaly detection methods employed in our approach. Simple methods commonly used in NLP tasks, such as Euclidean distance, often fail to capture the full potential of tensors and are sensitive to feature distribution [42]. Additionally, these methods are not well-suited to zero or sparse vectors and are prone to misclassifying anomalous samples. To address these issues, this subsection focuses on experiments conducted with several advanced k-NN methods, including Annoy-Angular, Flat-L2, HNSW64-L2, HNSWFlat, and HNSWSQ [19]. Through these experiments, we aim to find an anomaly detection method that is universally applicable across different datasets and models. Annoy [4] use random projection and tree structure with hyperplane partitioning based on the Angular distance metric. Flat-L2 [36] use linear scanning on a plane to search for nearest neighbors based on the Euclidean distance metric. HNSW64-L2 [32] use a graph-based approach with Hierarchical navigable small world(HNSW) structure, 64 represents the number of node for each vector, and L2 represents the Euclidean distance metric. HNSWFlat [33] only use a graph-based approach with HNSW. HNSWSQ [18] use a graph-based approach with HNSW combined with scalar quantization.
The results are shown in Fig. 7. In the experiments, Defy conducted different k-NN algorithms across three models and employed data augmentation through RWSR + RWI + RWS + RWD with 32 repetitions. Notably, It was observed that the performances of HNSW-based methods (HNSW564-L2, HNSWFlat, and HNSWSQ) were almost identical to Flat. Specifically, the median performance of HNSWSQ was relatively high on the Bert model. Although Annoy-Angular achieved slightly better overall performance than other methods on Bert, it was not sufficient to surpass the performance of HNSW-based methods on many occasions, and its performance was not always stable on specific datasets, such as RTE. Additionally, the performance of Flat-Cos was consistently underperformed in comparison to other methods across all models. This may be attributed to the sensitivity of cosine similarity measurement to the angles between data points. In cases where the probability prediction distributions p i of the model are close to each other, their cosine similarity may be high, resulting in inaccurate distance estimations.

The percentage of label changes for different datasets using different TTA methods. Here,corrected label means that the model predicted inaccurately but aggregated correctly using TTA, and corrupted label means that the model predicted accurately but aggregated inaccurately through TTA.
Figure 8 portrays the standard TTA method, Max TTA method, and our method, concerning the proportions of label corruption and correction across eight datasets. Label corruption denotes the number of instances where the model predicted accurately but aggregated inaccurately through TTA, while label correction represents the number of cases where the model predicted inaccurately but aggregated correctly using TTA. We notice that the standard TTA method consistently has a higher proportion of label corruption than correction across most datasets. The Max TTA method exhibits an abnormally high proportion of label corruption, reaching 0.07 on the Trec-Fine dataset. Conversely, our proposed approach consistently has a higher proportion of label correction than corruption. To identify the data augmentation samples leading to incorrect predicted labels, we conduct a thorough investigation of the data augmentation samples responsible for label corruption.
In this paper, we presents a novel TTA method, Defy, which overcomes the limitations of the standard TTA method in the NLP field, by leveraging k-NN-based anomaly detection algorithm and weight adaptation mechanism to effectively adapt to the characteristics of different datasets and models. Unlike other robust methods, Defy does not interfere with the training process of the backbone network and can be used in conjunction with other robust methods to further improve model performance. Furthermore, the approach offers excellent plug-and-play capabilities, enabling easy integration into existing models. Moving forward, attention should be directed toward addressing the following two challenges: How to selectively apply TTA to those samples which need to be corrected, rather than the entire test set, in order to minimize time and computational costs. Design more reasonable and effective data augmentation methods, especially for test time.
Footnotes
Acknowledgments
This work was funded by the Ten Thousand Talent Plans for Young Top-notch Talents of Yunnan Province (Project No. YNWR-QNBJ-2018-351). The authors declare that there are no conflicts of interest associated with this funding.
Data availability statement
The benchmark datasets and models utilized in this study were previously introduced in Section 5.1. These datasets are openly accessible and can be retrieved through the links provided in that particular section.
Statements and declarations
No potential conflict of interest was reported by the authors.
