Abstract
The ability of deep learning based bearing fault diagnosis methods is developing rapidly. However, it is difficult to obtain sufficient and comprehensive fault data in industrial applications, and changes in vibration signals caused by machine operating conditions can also hinder the accuracy of the model. The problem of limited data and frequent changes in operating conditions can seriously affect the effectiveness of deep learning methods. To tackle these challenges, a novel transformer model named the Differential Window Transformer (Dwin Transformer), which employs a new differential window self-attention mechanism, is presented in this paper. Meanwhile, the model introduces a hierarchical structure and a new patch merging to further improve performance. Furthermore, a new fault diagnosis model dealing with limited training data is proposed, which combines the Auxiliary Classifier Generative Adversarial Network with the Dwin Transformer(DT-ACGAN). The DT-ACGAN model can generate high-quality fake samples to facilitate training with limited data, significantly improving diagnostic capabilities. The proposed model can achieve excellent results under the dual challenges of limited data and variable working conditions by combining Dwin Transformer with GAN. The DT-ACGAN owns superior diagnostic accuracy and generalization performance under limited sample data and varying working environments when compared with other existing models. A comparative test about cross-domain ability is conducted on the Case Western Reserve University dataset and Jiang Nan University dataset. The results show that the proposed method achieves an average accuracy of 11.3% and 3.76% higher than other existing methods with limited data respectively.
Introduction
As technologies develop and the demand for machinery grows, the mechanical structure becomes increasingly complex. Rolling bearings have been widely applied and played a crucial role in rotating machinery for high efficiency and easy installation. The health of the rolling bearings impacts the status of the mechanical equipment. Bearing failure can cause equipment failure and significant economic loss or personnel casualties. Therefore, monitoring and diagnosing the health of rolling bearings through their vibration signal is essential for ensuring expected production for enterprises [1–3].
Models based on deep learning have end-to-end capabilities and are widely used in various industrial scenarios [4–6]. Convolutional Neural Networks (CNNs) are widely researched in bearing fault diagnosis due to their precise diagnostic capabilities [7, 8]. Xu et al. [9] proposed a novel CNN called a Collaborative Fusion Convolutional Neural Network (CFCNN), which considers intrinsic correlations and the distribution gap between different signals. Ruan et al. [10] proposed a novel physics-guided CNN (PGCNN), which explores the physics-guided design for CNN’s parameters based on signal analysis, including the input size and convolution kernel size. However, it is difficult for CNN to effectively capture the global features of the data, which can lead to overfitting in the model. Overfitting of the model will intensify when the training data is limited, which can lead to a significant decrease in the model’s performance.
Compared to CNN, Long Short-Term Memory(LSTM) exhibit robust capabilities in extracting global features and capturing temporal relationships between data [11]. Fan et al. [12] proposed a novel method based on One-Layer Wide Convolutional and Long-Short Term Memory network(1LWCNNLSTM), which has better load generalization capability and noise immunity for the vibration data coming from the complex working environments. However, LSTM has a large computational load and performs poorly when the sequence length is too long. Therefore, LSTM lacks good performance when the data sequence is too long or requires high real-time performance. Another model that can effectively extract global features is the Transformer, which utilizes multi-head self-attention and parallel training mechanisms to capture global features and long-range information of the time-series signals. However, the transformer encounters challenges such as high computational requirements stemming from the global attention mechanism and limited local feature extraction capability. Liu et al. [13] made enhancements to the ViT model to tackle this issue by proposing the Shifted Window Transformer model (Swin Transformer). The improvements included constructing a hierarchical structure based on Transformer and the window self-attention mechanism instead of the global self-attention mechanism. Moreover, the shifted window mechanism captured the lost data relationship between different windows. However, the performance of this model is poor when confronted with noisy training samples and varying working conditions because its architecture is so deep, and the shifted window mechanism cannot effectively capture vibration signal features.
Most of the previously mentioned studies necessitate substantial and comprehensive training data. These models performance notably declines when confronted with limited training samples. Currently, prevalent methods that enhance model performance with limited samples include data augmentation or few-shot learning. Generative Adversarial Networks (GAN) [14] is a typical method of data augmentation, which can generate data that closely approximates the real data distribution. Fu et al. [15] proposed a fault diagnosis method by combining GAN and a stacked denoising auto-encoder (SDAE), which boosted SDAE’s fault diagnosis performance on limited training data by augmenting the real measured data. Huang et al. [16] proposed an improved label-noise robust auxiliary classifier generative adversarial network (rAC-GAN) driven by limited data, which learns shallow fault features in advance based on real bearing fault data to obtain input signals. Consequently, the model’s performance notably advanced when confronted with limited data. Data augmentation can improve the model’s performance with limited data by amplifying the sample count. However, these methods cannot adapt to changing working conditions. Moreover, inappropriate data augmentation can lead to model overfitting, which results in good performance on training data but is unreliable on real data [17].
Presently, the predominant technique in few-shot learning involves metric-based meta-learning. The fundamental idea of meta-learning is to ’learn to learn’. Meta-learning algorithms train models across multiple tasks to acquire versatile learning strategies. These strategies enable the model to adapt quickly with limited data. Classic methods include Siamese Networks [18], Matching Networks [19], and Prototypical Networks [20], among others. Zhang et al. [21] proposed Siamese CNN by combining CNN with the Siamese network, which achieved significant performance improvement with limited samples. Wang et al. [22] proposed a Feature Space Metric-based Meta-learning Model (FSM3) by extending Matching Networks and Prototypical Networks. The model diagnoses fault by using sample attributes and cluster similarity data. However, Siamese networks and Matching networks demand high-quality training data. In addition, the attention mechanism may be challenging to converge in some cases.
It is worth noting that the lack of samples and frequent changes in working conditions often coexist in the actual production process. However, existing models often can only solve one type of problem, and rarely can they simultaneously solve this dual challenge well [23]. The problem of limited data and frequent changes in operating conditions can seriously affect the effectiveness of deep learning models. A model that can simultaneously address these two challenges is necessary to improve the safety of actual production activities.
This paper enhances the Swin Transformer to address this gap and proposes a new fault diagnosis model called the Differential Window Transformer (Dwin Transformer). The model used a novel window attention mechanism and feature fusion method according to the data distribution characteristics of one-dimensional vibration signals. In addition, applying a hierarchical structure enhances enhances the feature extraction ability of the model. Then, we replace the discriminator of ACGAN with the Dwin Transformer to improve the diagnostic ability of this model under limited fault samples. This paper proposed an Auxiliary Classification Generative Adversarial Network based on the Dwin transformer (DT-ACGAN). The model can accurately extract features, complete the classification task under limited samples, and effectively deal with the challenge of changing working conditions. The contributions of this study are as follows:
(1)A new differential window multi-head self-attention mechanism was proposed. Furthermore, a new Transformer named Dwin Transformer was proposed based on differential window multi-head self-attention. The model significantly reduced computational effort and enhanced generalization to vibration signals under variable working conditions.
(2)The paper explored and found the most suitable combination between Dwin Transformer and GAN. The DT-ACGAN was presented by substituting the discriminator with the Dwin Transformer. The model notably improves the performance of the discriminator, thereby enhancing the quality of generated samples and the diagnostic capabilities of the model.
(3)DT-ACGAN still has excellent fault diagnosis performance with limited data. It also inherits the remarkable cross-domain generalization ability of the Dwin Transformer. Therefore, it solved two challenges simultaneously of limited training data and different working conditions.
The rest of this paper is organized as follows. Section 2 details the related methods of DT-ACGAN. Section 3 describes the DT-ACGAN model’s architecture and clarifies the model’s principle. In Section 4, the model is comprehensively analyzed through experiments, including the Case Western Reserve University (CWRU) dataset [24] and the JangNan University (JUN) dataset [25]. Different working conditions are included in the dataset. Section 5 describes the research conclusions, highlights the cross-domain generalization ability of the proposed model with limited data, and discusses future research work.
Methodology
Swin Transformer
The Swin Transformer model consists of 3 parts: Patch partition, Patch merging, and Swin Transformer block. The model is described in detail below.
Overall architecture
Figure 1 illustrates the complete architecture of the Swin Transformer. The input data is first divided into patches of uniform size in Patch Partition, and each patch is flattened in the channel direction. Then, the processed data is sent to the Swin Transformer block for computation. The output is then sent to the patch merging layer. The data is further expanded in the channel dimension while reducing its width and height to create a hierarchy and reduce the amount of data. The data is then fed to the second Swin Transformer block, and the final output is obtained after two iterations.

Structure of the Swin Transformer.
The Swin Transformer block is the core component of the proposed model. It introduces a window self-attention mechanism to reduce the amount of computation significantly. The feature map is divided into multiple Windows in the window self-attention algorithm. The self-attention computation is performed independently for each window. The self-attention computation process consists of the following steps [26].

The illustration of the shifted window approach for computing self-attention.
After the window moved, the previously unreachable window reestablished contact. This method solves the problem of interacting with information between different windows [13].
GAN expand the dataset by generating fake data [14]. The generator’s goal is to generate data similar to real data, while the discriminator initially acts as a classifier to distinguish between real data and generated data. These networks iteratively compete with each other to enhance their capabilities until they reach equilibrium. In an ideal state, the generator can generate data that matches the distribution of real data, while the discriminator cannot distinguish between real and synthetic data. The objective function T (D, G) of GAN can be expressed using the formula as follows [14].
Auxiliary Classifier GAN(AC-GAN) enables discriminators to distinguish the authenticity and category of input samples [27]. To achieve this, the Softmax was introduced into the discriminator. The Softmax calculates the probability value for each classification. Furthermore, ACGAN used a new objective function. The objective function has two parts: [27]:
Dwin Transformer
This article first normalizes the vibration signal and folds it into two dimensions. Then, convert the data into images to match the structure of the proposed model [28]. However, it is worth noting that the converted image differs from the typical image. The pixels in the transformed data have a stronger correlation with their adjacent pixels on the left and right rather than the top and bottom. Given this unique feature, we apply the rectangular window partitioning mode to Windows Multi Head Self Attention (W-MSA) and adopt Differential Window Multi Head Self Attention (DW-MSA) to facilitate information exchange between these windows. Figure 3 describes the structure of DW-MSA, where we select two windows of different sizes for self-attention to minimize data loss caused by window attention mechanisms while ensuring that computation speed is not affected.

The illustration of the different window approaches for computing self-attention.
Inspired by CNN, the hierarchical structure is more conducive to feature extraction. The model consists of two blocks containing two multi-head self-attention mechanisms with different window sizes. The model’s structure is shown in Fig. 4.

Structure of the Dwin Transformer.
We set the patch size to 1x4 to match our data input format and propose a patch merging method to reduce the data volume. The proposed method groups pixels in each unit with identical relative positions and then merges these groups in the channel dimension. Figure 5 demonstrates the specific steps of our implementation.

Patch Merging.
The Dwin Transformer Block is the core part of the Dwin Transformer model, and its structure is shown in Fig. 6.

Two Successive Dwin Transformer Blocks.
The Dwin Transformer Block is calculated as the following equation [13].
Although Dwin Transformer has excellent cross-domain generalization ability, it typically requires much training data. However, without sufficient training samples, using the Dwin Transformer for fault diagnosis tasks may lead to performance degradation and failure to demonstrate its powerful cross-domain generalization ability. A feasible solution to this problem is to use GAN to increase training samples.
Based on this idea, we propose an improved model by replacing ACGAN’s discriminator with the Dwin Transformer. This modification enhances the recognition and classification capabilities of the discriminator, thereby improving the network’s overall performance in fault diagnosis tasks. The increase in sample size dramatically enhances the Transformer’s capabilities. At the same time, we introduced Differentiable Augmentation to make the data distribution more consistent with real data [29]. The improved model can effectively utilize limited training samples to improve the performance of the fault diagnosis model.The model structure diagram is shown in Fig. 7. Then, the DT-ACGAN’s pseudocode is provided as Algorithm 3 to improve the readability of the paper.In the Algorithm 3, the discriminator consists of the Dwin Transformer.

Structure of the DT-ACGAN model.
This improved model architecture not only solves the problem of insufficient training samples but also fully utilizes the cross-domain generalization ability of the Dwin Transformer. We generated more training samples by combining GANs and Transformer, which improved the diagnostic accuracy and generalization ability of the Transformer model. Moreover, this method balances the discriminator and generator’s capabilities, improving the performance of the entire fault diagnosis system.
In summary, we significantly improved the diagnostic ability of the model by using the Dwin Transformer and integrating GANs to expand the training samples. This method solves the challenge of limited training samples and inherits the advantages of the Dwin Transformer in cross-domain generalization. The construction process of the model is summarized as follows:
1.Divide the original signal into datasets corresponding to different operating conditions.
2.Perform overlapping sampling using sliding windows on the dataset.
3.Initialize the model parameters.
4.Input Gaussian noise into the generator to generate synthetic samples.
5.Merge the synthetic and real samples to create a mixed dataset and feed it into the discriminator.
6.Iteratively adjust the parameters through forward and backward propagation until the diagnostic accuracy on the validation dataset satisfies the practical requirements.
7.Validate the trained model using the test dataset and evaluate its diagnostic capability.
Case Study 1: CWRU dataset
Description of experimental data
PyTorch was used as a machine learning framework in the experiment in this study. This program runs on a computer with 32GB RAM, NVIDIA GeForce RTX 3090 GPU, Intel i7-13700K CPU, and the operating system Windows 11. The dataset in the experiment is from the CWRU Bearing Data Center.
The experimental equipment was shown in Fig. 8. The bearing used is an SKF6205. The damage to the bearing is achieved using the Electro-Discharge Machining (EDM) technique.

Motor driving mechanical system used by CWRU.
The entire dataset is divided into four types: A, B, C, and D, where datasets A, B, and C represent data with loads of 1HP, 2HP, and 3HP, respectively. Dataset D includes all data. The data length of each data sample is 2500. The Table 1 summarizes the dataset used in this article.
Description of the CWRU dataset
Comparative experiments were conducted to evaluate the cross domain generalization ability of the Dwin converter. The experimental content includes training the model on one dataset and testing it using data from other datasets after training, for example, A → B refers to training on dataset A and testing on dataset B. All of which have a training period of 20. the comparison results of the generalization experiments are shown in Fig. 9.

Migration and generalization performance comparison of different models.
This figure shows the performance of various models. The Dwin Transformer achieving achieved the highest accuracy, which confirm that the Dwin Transformer exhibits excellent cross-domain generalization ability when sufficient training samples are provided. It is worth mentioning that the Swin Transformer is prone to overfitting when dealing with the vibration signal, which results in low test set accuracy. The Dwin Transformer achieved stronger performance than the Swin Transformer in processing one-dimensional vibration signals. This can also prove that our improvement of Swin Transformer is effective.
To illustrate the help of the window attention mechanism in reducing computational complexity, we compared the running times of the Vision Transformer, Swin Transformer, and Dwin Transformer. The results are shown in the Table 2. The results show that the W-MSA can effectively reduce computational complexity compared to the global MSA. Furthermore, the structure of the Dwin Transformer is more straightforward and more suitable for one-dimensional vibration signals. Therefore, it has less computational complexity and better fault diagnosis performance.
The running times of different models
To verify the effectiveness of the proposed model, we conducted comparative experiments using ACGAN and Dwin Transformer instead of generators, discriminators, or both. The experimental results are shown in the Table 3.
Comparison of different structures of ACGAN with 60 samples
Comparison of different structures of ACGAN with 60 samples
The Table 3 indicates that the model performs best when the discriminator is replaced separately. We believe this is because the imbalance between the discriminator and generator hinders training when using vibration signals as samples. The addition of Dwin transformers effectively corrected this imbalance.
In order to comprehensively evaluate the fault diagnosis ability of the proposed model under limited training samples, we conducted comparative experiments using the same experimental setup described earlier. The experimental results are depicted in Fig. 10.

Comparison of diagnostic results for different samples.
The figure indicates that the proposed model exhibits a significant advantage in scenarios with a small sample size. Therefore, the proposed model reduces the dependence on the number of training samples and can effectively handle rolling bearing fault diagnosis tasks with limited samples. Then, a comparative experiment was designed to evaluate the cross-domain generalization ability of the proposed model under limited samples. The data volume for this experiment is 60, and all models are trained on one dataset. Data from other datasets is used for testing after training is completed. The experimental results are presented in Fig. 11.

Migration and generalization performance comparison of diffrent models with 60 samples.
The results in the figure indicate that although the training samples are limited, the proposed model achieves the highest diagnostic accuracy and exhibits excellent cross-domain generalization performance. In order to increase the persuasiveness of the experiment, we will conduct in-depth analysis of task cases. Then, we selected the precision, the recall,and the F1 score to evaluate the performance of the proposed model. Given the relatively poor performance of the cross-domain task B→C, we have chosen it as our case study. Tables 4–6 respectively represent the precision, recall, and F1 score for this task. We employ a confusion matrix to visualize the model’s classification for each fault category, as depicted in Fig. 12. To ensure the accuracy and stability of the test results, we have re selected 2000 new samples from the dataset as the test dataset. The experimental results show that the proposed model performs poorly in the third type of task, but exhibits excellent performance in other tasks. It must be emphasized that this task represents the worst performance of the proposed model, which will perform better in other tasks. In addition, compared to other models, the advantages are more pronounced. The results indicate that DT-ACGAN outperforms all other models.

Confusion matrix of DT-ACGAN with 60 samples.
Precision (%) comparison for cross-domain task B→C with 60 samples
Recall (%) comparison for cross-domain task B→C with 60 samples
F1 score (%) comparison for cross-domain task B→C with 60 samples
Description of experimental data
To further evaluate the stability and performance of the model, we conducted additional comparative experiments using the JNU bearing fault dataset. The experimental setup used to collect this dataset is illustrated in Fig. 13.

The acquisition equipment of JNU dataset.
The dataset is divided into four types: A, B, C, and D. Datasets A, B, and C represent data collected at different speeds, while dataset D includes all data. The bearing model used in the dataset is ER-12K, and the sampling frequency is 50 kHz. The experimental setup is the same as the Study 1. The length of each data sample is 2500. The Table 7 summarizes the dataset used in this article.
Description of the JNU dataset
To evaluate the fault diagnosis ability of the proposed model under limited data conditions, we randomly selected 80, 240, 480, 720, 1000, and 4000 samples from dataset D. At the same time, we selected the same comparative model as in Case Study 1 for comparative experiments. The experimental results are depicted in Fig. 14.

Comparison of diagnostic results for different samples.
The results in the figure clearly indicate that the overall performance of the JUN dataset is worse than that of the CWRU dataset. This result indicates that the JUN dataset is more complex than CWRU. Nevertheless, the proposed model still achieved the best results. To further demonstrate the cross-domain generalization ability of the proposed model under limited data, we conducted experiments using the same approach as in Case Study 1. The experimental results are illustrated in Fig. 15. The figure indicates that our method consistently performs best across all scenarios.

Migration and generalization performance comparison of diffrent models with 80 samples.
Furthermore, we selected the task C→B for detailed analysis for the same reasons as in Case Study 1. Tables 8 –10 present precision, recall, and F1 score for this task. The data in these tables clearly demonstrate the excellent cross-domain generalization ability of the proposed model under limited data.
Precision (%) comparison for cross-domain task C→B with 80 samples
Recall (%) comparison for cross-domain task C→B with 80 samples
F1 score (%) comparison for cross-domain task C→B with 80 samples
The Gaussian white noise with different Signal-to-Noise Ratios (SNR) was added to the samples in the CWRU dataset to evaluate the noise resistance of the proposed model. SNR represents the ratio of signal power to noise power and is typically expressed in decibels (dB). The smaller the SNR value, the stronger the noise and an SNR of 0 indicates that the signal and noise power are equal. This experiment used samples with SNR values of 0dB, 2dB, 4dB, 6dB, and 8dB to simulate vibration signals containing noise. The specific experimental results are shown in the Fig. 16.

Results of different models in a noise environment with 60 samples.
The figure indicates that DT-ACGAN consistently outperforms other models under limited sample conditions. Although this result does not necessarily indicate that the proposed model has the most substantial anti-interference ability, it demonstrates its excellent diagnostic ability in noisy environments.
The two major challenges faced by rolling bearing fault diagnosis based on vibration signals are limited data and changes in operating conditions. A novel Transformer based GAN is proposed in this article to address these issues. This article first designs a new window self-attention mechanism and then proposes the Dwin Transformer based on this mechanism to effectively extract the features of vibration signals. The Dwin Transformer can effectively adapt to changes in operating conditions. In addition, we have combined Dwin Transformer with ACGAN to propose DT-ACGAN to address the challenge of limited data. We present the experimental results showing that our method has better generalization ability in the limited data and cross-domain tasks compared with the existing approaches.
However, the proposed model still has some limitations. For example, the model’s input only uses one type of vibration signal in the time domain. We think that multimodality fusion is the future direction for the diagnosis of rolling bearing faults. We will combine different forms and types of signals for further research in the future.
