Residual SwinV2 transformer coordinate attention network for image super resolution

Abstract

Swin Transformers have been designed and used in various image super-resolution (SR) applications. One of the recent image restoration methods is RSTCANet, which combines Swin Transformer with Channel Attention. However, for some channels of images that may carry less useful information or noise, Channel Attention cannot automatically learn the insignificance of these channels. Instead, it tries to enhance their expression capability by adjusting the weights. It may lead to excessive focus on noise information while neglecting more essential features. In this paper, we propose a new image SR method, RSVTCANet, based on an extension of Swin2SR. Specifically, to effectively gather global information for the channel of images, we modify the Residual SwinV2 Transformer blocks in Swin2SR by introducing the coordinate attention for each two successive SwinV2 Transformer Layers (S2TL) and replacing Multi-head Self-Attention (MSA) with Efficient Multi-head Self-Attention version 2 (EMSAv2) to employ the resulting residual SwinV2 Transformer coordinate attention blocks (RSVTCABs) for feature extraction. Additionally, to improve the generalization of RSVTCANet during training, we apply an optimized RandAugment for data augmentation on the training dataset. Extensive experimental results show that RSVTCANet outperforms the recent image SR method regarding visual quality and measures such as PSNR and SSIM.

Keywords

Coordinate Attention efficient multi-head self-attention version 2 RandAugment image super-resolution SwinV2 transformer

1. Introduction

Image super-resolution (SR) [15] technology uses deep learning algorithms to reconstruct low-resolution (LR) images into their corresponding high-resolution (HR) counterparts. This technique holds significant importance in restoring old and blurry photographs and is currently a popular research direction with promising prospects. Over the years, deep learning-based image SR technology has undergone substantial development. Initially, the field was dominated by Convolutional Neural Network (CNN) and Generative Adversarial Network (GAN)-based [24,28,35,36,39,40] models, starting with the pioneering work of SRCNN [12]. However, the emergence of the Transformer [38] has disrupted this landscape. With its inherent self-attention mechanism capable of capturing global contextual information in images, the Transformer is more suitable for image SR tasks than GAN-based models.

Consequently, a myriad of Transformer-based image SR models have been proposed successively. The initial Transformer-based image SR model, IPT [5], holds particular significance. It showcased the potential of incorporating the Transformer architecture in image SR tasks. Subsequently, building upon the remarkable performance of the Swin Transformer [30] in various domains, Liang et al. took the lead in applying it to image SR. They successfully introduced SwinIR [27], the first image SR model based on the Swin Transformer. A notable breakthrough achieved by SwinIR is introducing a sliding window mechanism, enabling information transmission between adjacent windows. This advancement marks a significant milestone for the Transformer series in image SR.

Numerous image SR models based on the Swin Transformer have been proposed [4,6–9,25,46,51], taking inspiration from SwinIR. For example, Choi et al. proposed Swin2SR [10] by replacing the Swin Transformer with Swin Transformer V2 [29]. Zhang et al. introduced ART [47] by incorporating Dense Attention and Sparse Attention [34] into SwinIR. Moreover, Transformer-based techniques for image SR are extensively utilized across different domains [20,21,53,54]. For instance, in medical imaging, the predominance of global information over local details by multi-layer perceptrons, coupled with abundant low-frequency data in LR inputs, impedes the efficacy of shift-window self-attention mechanisms. To tackle this challenge, Lu et al. [31] introduced a novel asymmetric convolutional Swin Transformer layer, building upon SwinMR [18], that adeptly integrates global and local information from neighboring windows or pixels. As an illustration, in response to the challenges of proficiently restoring terrain details and creating an HR Digital Elevation Model (DEM) that retains coordinate information within the realm of SR techniques involving CNN and GAN, Li [26] and colleagues introduced an enhanced DEM SR Transformer (DSRT) Network. Specifically tailored for large-scale DEM SR, this network strategically considers the continuity of geographic information. In recent years, notable progress has been made in image SR techniques applied to natural and text images. For example, within digital wallpaper image enhancement, existing methods have encountered challenges preserving intricate details within text regions while enhancing the overall visual appeal. Huang et al. [43] introduced an innovative model named Real Text-SwinIR to tackle this challenge. This model integrates a novel, plug-and-play attention-based approach known as Learned Text Loss (LTL), ensuring the clarity of graphics while effectively preserving well-defined text structures.

Furthermore, Xing et al. Introduced channel attention [50] to the Swin Transformer and effectively combined them, resulting in a superior network structure known as RSTCANet. However, the channel attention mechanism calculates the importance of each channel using average pooling operations and applies the resulting weights to feature maps. This averaging operation may underestimate or overlook crucial channels, failing to capture subtle differences. Furthermore, in some instances, spatial correlations exist between different spatial positions of feature maps, but traditional channel attention focuses solely on inter-channel relationships, disregarding spatial relationships. To address these issues, this paper introduces coordinate attention [16] to the image SR network, thus resolving the challenges posed by channel attention. The main contributions of this paper are as follows:

- Based on a comprehensive analysis of the issues associated with the channel attention, we propose a novel image SR method called RSVTCANet.

- We design a novel residual SwinV2 Transformer coordinate attention block (RSVTCAB) that combines efficient multi-head self-attention version 2 (EMSAv2) [49], coordinate attention to activate more pixels for better reconstruction.

- We utilized RandAugment [11] for data augmentation on the training set, which effectively enhances the generalization and final performance of RSVTCANet during training.

The main objective of this paper is to propose an image SR model called RSVTCANet. The remaining parts of this paper are outlined as follows: Section 2 provides a detailed overview of related works in the field of image SR, including Transformer and Swin Transformer. Section 3 presents the methodology of this study. The experimental section of this paper, including ablation studies and classical image SR, is covered in Section 4. Section 5 concludes the article with a summary and overview of the study.

2. Related work

2.1. Transformer

Traditional RNNs and related algorithms can only compute sequentially from left to right to left. This mechanism limits the model’s parallel capability and leads to information loss during sequential computation. To address this issue, Vaswani et al. proposed the Transformer, a model composed entirely of attention mechanisms distinct from traditional CNNs and RNNs [38]. The basic structure of the Transformer consisting of two main parts: the encoder and the decoder, each comprising six blocks. The primary function of the encoder is to extract features from the input, providing adequate semantic information for the decoding process. On the other hand, the decoder utilizes the results from the encoder and the previous predictions to output the following result in the sequence.

Figure 1 illustrates the detailed internal structure of the transformer, showcasing the encoder block on the left and the decoder block on the right. The model’s input comprises two components: Input Embedding and Positional Encoding. The embedding layer transforms input data, such as text, into vector representations that the model can process to convey the information encapsulated in the original data. Positional Encoding, on the other hand, provides the model with details regarding the sequential order of the current time step. The red section denotes Multi-Head Attention, which encompasses multiple instances of Self-Attention. Each encoder and decoder block incorporates a Multi-Head Attention mechanism. This design choice enables each attention mechanism to specialize in different aspects of the vocabulary, effectively balancing biases that could arise from a sole attention mechanism. It facilitates the expression of multiple interpretations of word meanings and ultimately enhances the model’s performance [38].

Fig. 1.

The transformer - model architecture.

Additionally, the decoder block includes Masked Multi-Head Attention. In the transformer, masking serves two primary purposes: masking out uninformative padding regions and information from the “future”. While masking in the encoder primarily fulfills the first purpose, it simultaneously serves both goals in the decoder. The yellow section represents Add & Norm. The term “Add” refers to residual connections, which prevent network degradation, whereas “Norm” pertains to Layer Normalization, which normalizes the activation values of each layer. The model’s output is obtained by passing the result of the decoder through linear and softmax operations. The primary purpose of the Linear layer is to perform a linear transformation on the output of the previous step to achieve the desired output dimension [38].

2.2. Swin transformer

Traditional vision transformers [14] process features of a single size, making them unsuitable for handling dense prediction tasks that involve downsampled components by a factor of 16. Moreover, self-attention is continuously computed on the entire image, resulting in computational complexity that quadratically increases with the image size. This high complexity can be prohibitive. To address these challenges, Liu et al. proposed the swin transformer [30]. Unlike the vision transformer, the swin transformer computes self-attention within smaller sliding windows, significantly reducing computational complexity and increasing efficiency. By shifting the operation, neighboring windows can interact, enabling cross-window connections between upper and lower layers and effectively achieving global modeling. Figure 2 illustrates the overall network architecture of the swin transformer. The patch partition step divides the input image (H × W × 3) into non-overlapping patches of size 4 × 4. H and W respectively represent the length and width of the input image. These patches are then flattened in the channel direction, resulting in a shape transformation from H × W × 3 to (H/4) × (W/4) × 48. The linear embedding layer applies a linear transformation to the channel data of each pixel, reducing the dimension from 48 to C. Consequently, the image undergoes another shape transformation from (H/4) × (W/4) × 48 to (H/4) × (W/4) × C. The swin transformer block is responsible for feature extraction and comprises two types of attention mechanisms, W-MSA and SW-MSA, which are used sequentially in pairs. Therefore, the total number of swin transformer blocks in the framework is always even. Each stage consists of a linear embedding layer and a swin transformer block to construct feature maps of different sizes. Furthermore, except for stage 1, which initially passes through a linear embedding layer, the remaining stages undergo a patch merging layer. This layer performs downsampling to reduce resolution and halves the channel number. This hierarchical design effectively reduces computational overhead [30].

Fig. 2.

(a) The architecture of a Swin Transformer; (b) two successive Swin Transformer blocks.

2.3. Coordinate attention

In recent years, incorporating attention modules into network architectures has emerged as one of the most commonly used methods for improving image SR models. However, while attention modules significantly enhance model performance, most existing approaches solely focus on modeling the relationships between channels to weigh each channel, thereby overlooking the crucial importance of positional information. This positional information plays a vital role in generating spatially selective feature maps. To address this limitation, Hou et al. [16] proposed a novel “coordinate attention” module that simultaneously considers channel relationships and positional information. The coordinate attention module captures information across channels and integrates direction awareness and position sensitivity, enabling the model to localize and identify target regions more accurately. By encoding precise positional information by considering channel relationships and long-range dependencies, the coordinate attention module is structured into two main steps: coordinate information embedding and coordinate attention generation. Refer to Fig. 3 for a visual representation of this module.

Fig. 3.

Basic structure of the coordinate attention.

First is the embedding of positional information. While global pooling is commonly used in channel attention to encode spatial information as channel descriptors, it often loses positional information. To address the positional information loss caused by 2D global pooling, we propose using coordinate attention, which decomposes channel attention into two parallel 1D feature encodings. This method efficiently integrates spatial coordinate information. Specifically, for the input x, pooling kernels of size $(H, 1)$ and $(1, W)$ are applied along the horizontal and vertical coordinate directions, respectively, to encode each channel. Consequently, the output representation of the c-th track with a height of h and width of w is described by Eq. (1) and Eq. (2) [16], respectively. $\begin{array}{c} (1) & Z_{c}^{h} (h) = \frac{1}{W} \sum_{0 ⩽ i ⩽ W} x_{c} (h, i) \\ (2) & Z_{c}^{w} (w) = \frac{1}{H} \sum_{0 ⩽ j ⩽ H} x_{c} (j, w) \end{array}$

Equation (1) and Eq. (2) perform feature aggregation along two spatial directions, yielding a pair of direction-aware attention maps. These transformations enable the attention module to capture long-range dependencies in one spatial order while preserving precise positional information in the other spatial direction. Consequently, this facilitates more accurate localization of the target of interest within the network. The coordinate information embedding corresponds to the X Avg Pool and Y Avg Pool depicted in Fig. 3.

The next step is coordinate attention generation, which fully utilizes positional information to accurately locate the regions of interest and effectively capture the relationships between channels. Specifically, the two feature maps, $Z^{h}$ and $Z^{w}$ , generated by Eq. (1) and Eq. (2), are concatenated. Subsequently, a 1x1 convolution, represented by $F_{1}$ in Eq. (3) [16], is applied to the concatenated feature maps. In this context, δ means a non-linear activation function. The resulting $f \in R^{C / r \times (H + W)}$ is the intermediate feature map containing spatial information in horizontal and vertical directions. The parameter r denotes the downsampling ratio. $\begin{matrix} (3) & f = δ (F_{1} ([\begin{array}{c} z^{h}, z^{w} \end{array}])) \end{matrix}$

Next, the tensor f is further divided into two separate tensors, $f^{h} \in R^{C / r \times H}$ , and $f^{w} \in R^{C / r \times W}$ , along the spatial dimension. Then, two 1x1 convolutions ( $F_{h}$ and $F_{w}$ ) transform the channel dimensions of feature maps $f^{h}$ and $f^{w}$ to match the input x, resulting in Eq. (4) and Eq. (5). $\begin{array}{c} (4) & g^{h} = σ (F_{h} (f^{h})) \\ (5) & g^{w} = σ (F_{w} (f^{w})) \end{array}$

Finally, by expanding $g^{h}$ and $g^{w}$ as attention weights, the output of the Coordinate Attention can be expressed as Eq. (6) [16]. $\begin{matrix} (6) & y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j) \end{matrix}$

2.4. EMSAv2

The transformer’s core lies in the Multi-head Self-Attention (MSA) [38], as illustrated in Fig. 4. However, MSA encounters two primary challenges. Firstly, the computational requirements of MSA grow quadratically with the increase in input token dimensions (referred to as $d m$ or n), resulting in significant computational complexity. Secondly, each head in MSA exclusively handles a subset of embedded dimensions, potentially impacting the network’s performance. To tackle these issues, Zhang et al. proposed EMSAv2 [48,49], a more straightforward, faster, and robust multi-scale structure for visual recognition, building upon MSA. In comparison to the original MSA, EMSAv2 incorporates two additional structures: Pixel-Shuffle [37] and DWConv [17]. The precise positions of these two structures are depicted in Fig. 4. The specific working principle is as follows:

Fig. 4.

Comparison of MSA and EMSAv2.

Given a 1D input token $x \in R^{n \times d_{m}}$ , n represents the token length, and $d_{m}$ represents the channel dimension. EMSAv2 employs linear operations to project x and obtain the queries: $Q = x W_{q} + b_{q}$ , where $W_{q}$ and $b_{q}$ are the weights and biases for linear projection. Subsequently, Q is split into k groups (i.e., k heads) in preparation for the next step, resulting in $Q \in R^{k \times n \times d_{m}}$ , where $d_{k} = d_{m} / k$ is the dimension within each head. To conserve memory, x is reshaped to its 2D size and downsampled through deep convolution to reduce height and width. Before the final output, a Pixel-Shuffle module (UP) is added to reconstruct the intermediate-to-high-frequency information lost due to downsampling operations. Finally, the output $x^{'}$ is reshaped back to 1D size, followed by the addition of layer normalization. The keys K and values V for $x^{'}$ are obtained using the same method as acquiring Q. Therefore, the final output of EMSAv2 can be represented as [49]: $\begin{aligned} EMSAv 2 (Q, K, V) & = Softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V + Up (V) \end{aligned}$

2.5. RandAugment

Data augmentation [2,44] is commonly employed in deep learning and other tasks to enhance model generalization. Randaugment, proposed by Ekin D et al., is an automatic data augmentation algorithm. Its working principle primarily revolves around enriching the dataset through the use of two parameters, N and M. Specifically, it randomly selects N operations from a set of 14 predefined functions (Identity, AutoContrast, Equalize, Rotate, Solarize, Color, Posterize, Contrast, Brightness, Sharpness, ShearX, ShearY, TranslateX, TranslateY). It applies them to the images in the dataset with an intensity level of M. As the values of N and M increase, the regularization strength also magnifies [11]. Figure 5 illustrates several examples of images processed using Randaugment.

Fig. 5.

Several good results for images in DIV2K [1] processed using the RandAugment algorithm.

Figure 5 shows that lighting adjustments were implemented in the image (a), significantly enhancing the visibility of the previously subtle green leaves along the edges of the original image. Image (b) utilized color enhancement techniques to deepen the red and blue components, imparting a more vibrant and three-dimensional appearance to the entire porcelain object. Image (c) underwent simultaneous adjustments in brightness and color, resulting in an overall increase in brightness and introducing distinctive color variations across different regions of the image, thus creating a more layered visual effect. Image (d) underwent style processing, rendering the snake’s skin darker and giving the sand a more granular texture. Finally, image (e) intensified the almost imperceptible underwater fine lines from the original image, making them discernible to the naked eye.

3. Method

3.1. Network architecture

3.1.1. The overall structure

The overall network structure of the proposed model is illustrated in Fig. 6. Similar to Swin2SR, it primarily consists of the shallow feature extraction module, the deep feature extraction module, and the image reconstruction module. The shallow feature extraction module comprises a single 3 × 3 convolutional layer. Its primary purpose is to preprocess the input images by converting the channel number to match the required features for the deep feature extraction module. The image reconstruction module combines convolution and up-sampling with specific structures determined by the intended application. These structures include classical image SR, lightweight image SR, real-world image SR, and image JPEG Compression Artifact Reduction. The classical image SR structure involves up-sampling [37] and two 3 × 3 convolutional layers. Additionally, there is a long skip connection between the shallow feature extraction module and the deep feature extraction module, directly connecting to the image reconstruction module. We introduce a novel basic block called RSVTCAB for the deep feature extraction module, which incorporates coordinate attention and EMSAv2. The enhanced overall network structure is referred to as RSVTCANet.

Fig. 6.

Residual SwinV2 Transformer Coordinate Attention Network (RSVTCANet) and Residual SwinV2 Transformer Coordinate Attention Block (RSVTCAB).

3.1.2. Residual SwinV2 Transformer coordinate attention block

Figure 6 shows that the deep feature extraction module primarily consists of N RSVTCABs and a 3 × 3 convolutional layer. Each RSVTCAB consists of N S2TLs and a 3 × 3 convolutional layer. Additionally, there is a long skip connection connecting the head and tail of the RSVTCAB. We have made improvements both internally and externally to the S2TL. Externally, we introduce a Coordinate Attention Block by connecting every two consecutive S2TL with residual connections. Therefore, in each RSVTCAB, the number of Coordinate Attention Blocks is half that of S2TL. Internally, we replaced the original MSA with EMSAv2 in the S2TL. The revised S2TL consists of EMSAv2 and MLP, each followed by a normalization layer connected through residual connections.

3.2. Loss function

We utilize the L1 loss [23] as our loss function to train our model. Its operation can be expressed as: $\begin{matrix} L (Θ) = \frac{1}{N} \sum_{i = 1}^{N} {‖ RSVTCANet (I_{M}^{i}) - I_{GT}^{i} ‖}_{1} \end{matrix}$ Here, N represents the number of images in the training set, Θ represents the number of parameters in the RSVTCANet, M denotes the input LR image, $GT$ signifies the original HR image, and $‖ \cdot ‖_{1}$ corresponds to the norm of the $ℓ_{1}$ loss function.

3.3. Improved RandAugment for image preprocess

Before formal training, we preprocess the images in the training dataset using the RandAugment. For each image, we apply two augmentation operations randomly selected with equal probability from a set of 14 functions, enhancing the image with an intensity of 10 (N = 2, M = 10). However, during the augmentation process, we encountered challenges such as blurring or excessive noise in particular images, leading to a decline in image quality compared to their pre-augmentation state. Some images even suffered from structural damage. Figure 7 presents examples of inadequately enhanced images.

Fig. 7.

Several bad results for images in DIV2K [1] processed using the RandAugment.

Figure 7 illustrates the impact of various image issues on image quality. In the image (a), the stone is excessively illuminated, resulting in a nearly transparent white appearance and the loss of all texture details. Image (b) showcases the blurry and noisy processing of the window and staircase, completely distorting their original appearance. The most severe destruction is evident in the image (c), where the resultant image differs entirely from the original in content and structure. Image (d) introduces unknown and discomforting irregular light spots. Excessive darkening in the image (e) has rendered the grass and water details unrecognizable compared to the original image. To ensure that the training process does not affect the model’s performance, we selectively retain and replace the authentic images with those yield better augmentation results. Conversely, images with poor augmentation effects are excluded from the training dataset and are not replaced.

4. Experiments

4.1. Datasets

For training, we use sub-images (480 × 480 for ×4 SR task, 240 × 240 for ×8 SR task) that were generated by cropping the DIV2K [1] dataset, consisting of 32,592 pairs of cropped HR and LR images. The LR images were obtained by downscaling the HR images using MATLAB’s bicubic kernel interpolation method. The downscaling factors were set to 4 or 8. For testing, we evaluate our method on six benchmark datasets:Set5 [3], Set14 [45], BSD100 [32], Urban100 [19], General100 [13], and Manga109 [33]. PSNR and SSIM [41] values are used to calculate the experimental results computed on the Y channel in the YCbCr space.

4.2. Implementation details

We set the batch size to 4 and the training HR patch size to 48. The window size, channel number, and reduction factor are typically set to 8, 180, and 2, respectively. We use the Adam [22] with $β 1 = 0.9$ and $β 2 = 0.99$ to optimize the L1 loss function for RSVTCANet. In the training stage, the initial learning rate reduces by 50% in 250000, 400000, 450000, and 475000 iterations for a total of 500000 iterations, respectively.

Our implementation is built on BasicSR. Our experiments were conducted using the PyTorch framework. The training was performed on four GPUs with NVIDIA GeForce GTX 1080 Ti for the ablation study and one GPU with NVIDIA A100-SXM-80GB for the classical image SR task. The tests were conducted on a CPU with an Intel(R) Core(TM) i3-8130U clocked at 2.2 GHz.

4.3. Ablation study

We conducted an ablation study before performing the classical image SR tasks to validate the effectiveness of the three proposed improvement methods for RSVTCANet. The ablation study consisted of three groups of experiments corresponding to this paper’s proposed improvement methods: coordinate attention, EMSAv2, and RandAugment.

We conduct ablation studies to investigate the importance of different network architectures. We design the corresponding networks with comparable parameters. As shown in Table 1, we present six different networks according to different improvement methods. The model using V1+Channel+MSA-based (ID:1) is degraded to a complete network like the architecture of RSTCANet. The model using the SwinV2 Transformer-based network (ID:2) obtains better performance gains than that using the SwinV1 Transformer-based network (ID:1). It demonstrates that scaled cosine attention in SwinV2 Transformer produces a gentler attention weight than dot product attention in SwinV1 Transformer, thereby improving the accuracy of the trained model.

To verify whether the RandAugment can genuinely enhance the model’s generalization during the training process, we conducted a comparative analysis between two scenarios: one involved processing the DIV2K datasets using the RandAugment, while the other did not. The model that incorporated the RandAugment was named RSVTCANet+. The comparative results are presented in Table 1. As we can see, our model using RandAugment (ID:6) achieves dramatically enhanced performance. It further verifies that RandAugment plays an essential role in improving the model’s generalization ability during training and its final performance.

Finally, by comparing experimental groups 2 and 3 (ID:2 vs. ID:3) or experimental groups 4 and 5 (ID:4 vs. ID:5), which can be found that the model using EMSAv2 structure can achieve more efficient performance than the model using MSA structure. It shows that Pixel-Shuffle and DWConv in EMSAv2 can play their respective roles in improving model performance.

Table 1
Comparison of different network architectures for ×4 SR on the six benchmark datasets. Best performance are in red colors. The higher the SSIM, the better

Table 2

The ablation study of different RSVTCABs in the RSVTCANet and different S2TLs in one RSVTCAB for ×4 SR. The parameter settings of different model variants. N and K denote the number of RSVTCAB in RSVTCANet and the number of S2TL in one RSVTCAB, respectively. Best performance are in red colors. The higher the PSNR and SSIM, the better

4.3.1. Impact of different RSVTCABs in the RSVTCANet and different S2TLs in one RSVTCAB

Due to the inability to determine conclusively whether both Coordinate Attention and EMSAv2 have positive effects on the model, to avoid the situation where one module has a positive effect while the other has a negative effect, but the overall effect remains positive when combined, we conducted an ablation study on RSVTCAB with only the inclusion of Coordinate Attention. The ablation study for the different RSVTCABs in the RSVTCANet and different S2TLs in one RSVTCAB was divided into six distinct scenarios by setting different learning rates ( $2 * 10^{- 4} / 1 * 10^{- 4} / 0.5 * 10^{- 4}$ ) , the number of RSVTCAB(2/4/6) in the deep feature extraction module and the number of S2TL(6/8) in the RSVTCAB.

Table 2 presents the results of the ablation study for the different RSVTCABs in the RSVTCANet and different S2TLs in one RSVTCAB. In the table, N represents the number of RSVTCAB in the deep feature extraction module, while K represents the number of S2TL in each RSVTCAB. According to Table 2, it can be observed that when the learning rate is set to $1 * 10^{- 4}$ , and N is 4, the highest PSNR and SSIM values are obtained in Set14, BSD100, and Manga109 datasets. However, when the learning rate is $1 * 10^{- 4}$ , and N is 6, the highest PSNR values are achieved in Set5, Urban100, and General00 datasets. For BSD100 and Urban100 datasets, the highest SSIM values are obtained when the learning rate is $1 * 10^{- 4}$ , N is 4 and K is 8. Furthermore, our model performs better with a learning rate of $1 * 10^{- 4}$ compared to the other two learning rates. Although out of the 12 metric data evaluated across the six benchmark datasets, 6 achieve the highest values with a learning rate of $1 * 10^{- 4}$ and N of 6. However, due to the presence of multiple auxiliary tasks in Classic Image SR, in order to reduce overall training time, we chose a solution with a relatively smaller parameter count. Therefore, we set the learning rate, the number of RSVTCAB, and the number of S2TL in each RSVTCAB to $1 * 10^{- 4}$ , 4, and 6, respectively.

4.3.2. Impact of window size in RSVTCANet

EMSAv2 provides an efficient and effective way to enlarge window size. To investigate the impact of different window sizes on model performance, we conduct the ablation study and report in Table 3. In the Table, we do not use the RandAugment in our RSVTCANet and set the window size to $8 \times 8$ , $16 \times 16$ , and $24 \times 24$ , respectively, to observe the performance difference. The results show that a larger window size yields better performance improvement.

Table 3
Ablation study on the window size in RSVTCANet for ×4 SR. w/o denote the without. Larger windows can result in better performance. Best performance are in red colors. The higher the PSNR and SSIM, the better

4.3.3. Impact of kernel size of EMSAv2

We introduce EMSAv2 in Section 2.4, which aims to encode more local information without increasing too many computations. To explore which kernel size can improve the best performance, we attempt to use $3 \times 3$ , $5 \times 5$ , $7 \times 7$ depth-wise convolution and report the results in Table 4. Given that the depth-wise convolution has little effect on the number of parameters, we do not list them in the table. Obviously, $3 \times 3$ depth-wise convolution leads to the best results.

Table 4
Ablation study on DWConv of EMSAv2 for ×4 SR. From the results on the six benchmark datasets, we can see that using 3 × 3 depthwise convolution yields the best results. Best performance are in red colors. The higher the PSNR and SSIM, the better

4.3.4. Impact of the number of coordinate attention blocks in one RSVTCAB

We designed four variant structures based on RSVTCANet, as shown in Fig. 8, to examine the influence of different coordination attention blocks on the model. Three other variations of RSVTCANet are designed to investigate the effect of coordination attention blocks. In the RSVTCAB of RSVTCANet-CA1, there is only one coordinate attention block, and the input of this coordinate attention block is the input of RSVTCAB. The attention generated by this coordinate attention block is multiplied by the features produced by the sixth S2TL of RSVTCAB. There are two coordinate attention blocks in RSVTCANet-CA2. Note that the attention generated by this coordinate attention block is multiplied by the features produced by the third and sixth S2TL of RSVTCAB. For RSVTCANet-CA6, in the RSVTCAB, there is one coordinate attention block for each S2TL.

Fig. 8.

Four variant structures with different numbers of coordinate attention blocks in one RSVTCAB are designed based on RSVTCANet.

A comparison with RSVTCANet-CA1 presented in Table 5, shows that exploiting one coordinate attention block for each three successive S2TLs (RSVTCANet-CA2) or applying one coordinate attention block for each two successive S2TLs (RSVTCANet) can improve the performance of the RSVTCAB. However, with the inclusion of six coordinate attention blocks in RSVTCAB (RSVTCANet-CA6), the performance of RSVTCANet tends to degrade. This phenomenon can be attributed to the shifting window partition mechanism of two consecutive S2TLs. In response to the limitation of cross-window connections in window-based self-attention modules, the authors introduced a strategic shifting window partition approach within two successive SwinV2 transformer blocks. While learning to coordinate attention for each S2TL, connections across windows are disregarded. Conversely, the application of coordinate attention for every two consecutive S2TLs proves to have a positive impact, thereby augmenting the efficacy of the shifting window partition strategy.

Table 5

The ablation study of different coordinate attention blocks in one RSVTCAB for ×4 SR. Best performance are in red colors. The higher the SSIM, the better

4.4. Classical image super-resolution

To demonstrate the performance of RSVTCANet in classical image SR tasks, we compared it with the following classical models: SwinIR, Swin2SR, RSTCANet [42], ART, and SRFormer [52]. We reproduced all models using the same settings in a consistent environment to ensure experimental rigor. Since all models share a similar overall network structure, we made uniform adjustments based on the results of the ablation study. We set the number of basic blocks in the deep feature extraction module to 4 and the number of STLs in each basic block to 6. The learning rate was set to $1 * 10^{- 4}$ .

Table 6
Quantitative comparison (average PSNR/SSIM) of our RSVTCANet with recent classical image SR methods on six benchmark datasets. Best performance are in red colors. The higher the PSNR and SSIM, the better

4.4.1. Quantitative comparison

Table 6 presents the quantitative comparison of classical image SR tasks. The results of ×4 SR are shown in the Table 6, It can be seen that RSVTCANet achieves the best performance on almost all six benchmark datasets. Specifically, RSVTCANet achieves a higher PSNR of 0.04 dB compared to ART on Set5 and General100 datasets. Moreover, substantial improvement in PSNR of 0.11 dB is achieved compared to Swin2SR on Manga109 datasets. The performance boost gets even more significant when using RandAugment to augment the DIV2K datasets as RSVTCANet+.

Fig. 9.

Qualitative comparison of our RSVTCANet with recent classical image SR methods for the ×4 SR task. For each example, our RSVTCANet can restore the structures and details better than other methods.

Unlike most works in the field of image SR, this paper also conducted an additional ×8 SR task to validate the versatility of RSVTCANet across different upscaling factors. As presented in Table 6, among the 12 metrics measured across 6 benchmark datasets, RSVTCANet achieved the best values for 10 of the metrics. This demonstrates that RSVTCANet performs admirably in ×4 SR tasks and excels in ×8 SR tasks. Specifically, RSVTCANet achieved a superior PSNR to Swin2SR by 0.03 dB on Set14, General100, and Manga109 while exhibiting a 0.04 dB higher PSNR on Urban100. Additionally, our investigation revealed that the metrics obtained from RSVTCANet+ did not surpass the performance of RSVTCANet. One possible explanation for this outcome is the unsuitability of the randaugment algorithm for ×8 SR tasks. The above strongly supports that our RSVTCANet is effective and efficient.

4.4.2. Qualitative analysis

Figure 9 presents the qualitative comparison between the proposed model and other classical models. From “ppt3” in Set14, it can be observed that although none of the models can perfectly restore English words, the letters generated by RSVTCANet+ exhibit the highest level of clarity. Each letter is distinctively separated, unlike the letters generated by other models that blend. In Urban100, specifically in the “img002”, RSVTCANet+ generates window grids with the most pronounced hierarchical structure. Similarly, in Manga109, the lines caused by RSVTCANet+ in the “ARMS” are also the clearest. In summary, the images generated by the proposed model in this paper exhibit more rich texture details than other classical models.

5. Conclusion

This paper proposes a SwinV2 Transformer-based image restoration model RSVTCANet, based on the residual SwinV2 Transformer Coordinate Attention blocks (RSVTCAB). Our model combines coordinate attention and efficient multi-head self-attention version 2 (EMSAv2) to activate more pixels for HR reconstruction. Furthermore, we propose an optimized data augmentation method based on RandAugment, which effectively enhances both the generalization during training and the overall performance of RSVTCANet. Quantitative and qualitative results also demonstrate that RSVTCANet exhibits superior performance in image reconstruction and can generate richer texture details in the resulting images. In the future, we aim to apply RSVTCANet to other image restoration tasks, such as real-world image SR, image denoising and JPEG compression artifact reduction. Finally, we hope our RSVTCANet can serve as a useful tool for research in SR model design.

References

Agustsson and

Timofte , Ntire 2017 challenge on single image super-resolution: Dataset and study, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 126–135.

Atienza , Data augmentation for scene text recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1561–1570.

Bevilacqua ,

Roumy ,

Guillemot and

M.L.

Alberi-Morel , Low-complexity single-image super-resolution based on nonnegative neighbor embedding (2012).

Cao ,

Liang ,

Zhang ,

Li ,

Zhang ,

Wang and

L.V.

Gool , Reference-based image super-resolution with deformable attention transformer, in: European Conference on Computer Vision, Springer, 2022, pp. 325–342.

Chen ,

Wang ,

Guo ,

Xu ,

Deng ,

Liu ,

Ma ,

Xu ,

Xu and

Gao , Pre-trained image processing transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12299–12310.

Chen ,

Li ,

Liu ,

Li ,

Tang and

Chen , SwinFSR: Stereo image super-resolution using SwinIR and frequency domain knowledge, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1764–1774.

Chen ,

Wang ,

Zhou ,

Qiao and

Dong , Activating more pixels in image super-resolution transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22367–22377.

Chen ,

Zhang ,

Gu ,

Kong ,

Yuan et al., Cross aggregation transformer for image restoration, Advances in Neural Information Processing Systems 35 (2022), 25478–25490.

Choi ,

Lee and

Yang , N-gram in swin transformers for efficient lightweight image super-resolution, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2071–2081.

10.

M.V.

Conde ,

U.-J.

Choi ,

Burchi and

Timofte , Swin2SR: Swinv2 transformer for compressed image super-resolution and restoration, in: European Conference on Computer Vision, Springer, 2022, pp. 669–687.

11.

E.D.

Cubuk ,

Zoph ,

Shlens and

Q.V.

Le , Randaugment: Practical automated data augmentation with a reduced search space, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 702–703.

12.

Dong ,

C.C.

Loy ,

He and

Tang , Image super-resolution using deep convolutional networks, IEEE transactions on pattern analysis and machine intelligence 38(2) (2015), 295–307. doi:10.1109/TPAMI.2015.2439281.

13.

Dong ,

C.C.

Loy and

Tang , Accelerating the super-resolution convolutional neural network, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, the Netherlands, October 11–14, 2016, Proceedings, Part II 14, Springer, Amsterdam, The Netherlands, 2016, pp. 391–407.

14.

Dosovitskiy ,

Beyer ,

Kolesnikov ,

Weissenborn ,

Zhai ,

Unterthiner ,

Dehghani ,

Minderer ,

Heigold ,

Gelly et al., An image is worth

16 \times 16

words: Transformers for image recognition at scale, 2020. arXiv preprint arXiv:2010.11929.

15.

Glasner ,

Bagon and

Irani , Super-resolution from a single image, in: 2009 IEEE 12th International Conference on Computer Vision, IEEE, 2009, pp. 349–356. doi:10.1109/ICCV.2009.5459271.

16.

Hou ,

Zhou and

Feng , Coordinate attention for efficient mobile network design, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13713–13722.

17.

A.G.

Howard ,

Zhu ,

Chen ,

Kalenichenko ,

Wang ,

Weyand ,

Andreetto and

Adam , Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017. arXiv preprint arXiv:1704.04861.

18.

Huang ,

Fang ,

Wu ,

Gao ,

Li ,

Del Ser ,

Xia and

Yang , Swin transformer for fast MRI, Neurocomputing 493 (2022), 281–304. doi:10.1016/j.neucom.2022.04.051.

19.

J.-B.

Huang ,

Singh and

Ahuja , Single image super-resolution from transformed self-exemplars, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5197–5206.

20.

Huang ,

Chen and

Xu , Dtsr: Detail-enhanced transformer for image super-resolution, The Visual Computer (2023), 1–18.

21.

R.-Y.

Ju ,

C.-C.

Chen ,

J.-S.

Chiang ,

Y.-S.

Lin and

W.-H.

Chen , Resolution enhancement processing on low quality images using swin transformer based on interval dense connection strategy, Multimedia Tools and Applications (2023), 1–17.

22.

D.P.

Kingma and

Ba , Adam: A method for stochastic optimization, 2014. arXiv preprint arXiv:1412.6980.

23.

W.-S.

Lai ,

J.-B.

Huang ,

Ahuja and

M.-H.

Yang , Deep Laplacian pyramid networks for fast and accurate super-resolution, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 624–632.

24.

Ledig ,

Theis ,

Huszár ,

Caballero ,

Cunningham ,

Acosta ,

Aitken ,

Tejani ,

Totz ,

Wang et al., Photo-realistic single image super-resolution using a generative adversarial network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4681–4690.

25.

Li ,

Fan ,

Xiang ,

Demandolx ,

Ranjan ,

Timofte and

Van Gool , Efficient and explicit modelling of image hierarchies for image restoration, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18278–18289.

26.

Li ,

Zhu ,

Yao ,

Yue ,

Á.F.

García-Fernández ,

E.G.

Lim and

Levers , A large scale Digital Elevation Model super-resolution Transformer, International Journal of Applied Earth Observation and Geoinformation 124 (2023), 103496. doi:10.1016/j.jag.2023.103496.

27.

Liang ,

Cao ,

Sun ,

Zhang ,

Van Gool and

Timofte , Swinir: Image restoration using swin transformer, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1833–1844.

28.

Liang ,

Zeng and

Zhang , Details or artifacts: A locally discriminative learning approach to realistic image super-resolution, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5657–5666.

29.

Liu ,

Hu ,

Lin ,

Yao ,

Xie ,

Wei ,

Ning ,

Cao ,

Zhang ,

Dong et al., Swin transformer v2: Scaling up capacity and resolution, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12009–12019.

30.

Liu ,

Lin ,

Cao ,

Hu ,

Wei ,

Zhang ,

Lin and

Guo , Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.

31.

Lu ,

Jiang ,

Tian ,

Gu ,

Lu ,

Yang ,

Gong ,

Han ,

Jiang and

Zhang , Asymmetric convolution Swin Transformer for medical image super-resolution, Alexandria Engineering Journal 85 (2023), 177–184. doi:10.1016/j.aej.2023.11.044.

32.

Martin ,

Fowlkes ,

Tal and

Malik , A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, in: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vol. 2, IEEE, 2001, pp. 416–423. doi:10.1109/ICCV.2001.937655.

33.

Matsui ,

Ito ,

Aramaki ,

Fujimoto ,

Ogawa ,

Yamasaki and

Aizawa , Sketch-based manga retrieval using manga109 dataset, Multimedia Tools and Applications 76 (2017), 21811–21838. doi:10.1007/s11042-016-4020-z.

34.

Mei ,

Fan and

Zhou , Image super-resolution with non-local sparse attention, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3517–3526.

35.

Mirza and

Osindero , Conditional generative adversarial nets, 2014. arXiv preprint arXiv:1411.1784.

36.

S.H.

Park ,

Y.S.

Moon and

N.I.

Cho , Perception-oriented single image super-resolution using optimal objective estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1725–1735.

37.

Shi ,

Caballero ,

Huszár ,

Totz ,

A.P.

Aitken ,

Bishop ,

Rueckert and

Wang , Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1874–1883.

38.

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A.N.

Gomez ,

Ł.

Kaiser and

Polosukhin , Attention is all you need, Advances in neural information processing systems, 30 (2017).

39.

Wang ,

Xie ,

Dong and

Shan , Real-esrgan: Training real-world blind super-resolution with pure synthetic data, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1905–1914.

40.

Wang ,

Yu ,

Wu ,

Gu ,

Liu ,

Dong ,

Qiao and

Change Loy , Esrgan: Enhanced super-resolution generative adversarial networks, in: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018.

41.

Wang ,

A.C.

Bovik ,

H.R.

Sheikh and

E.P.

Simoncelli , Image quality assessment: From error visibility to structural similarity, IEEE transactions on image processing 13(4) (2004), 600–612. doi:10.1109/TIP.2003.819861.

42.

Xing and

Egiazarian , Residual swin transformer channel attention network for image demosaicing, in: 2022 10th European Workshop on Visual Information Processing (EUVIP), IEEE, 2022, pp. 1–6.

43.

Xue ,

Zhou ,

Zhang ,

Shao ,

Wei and

Wang , Rt-swinir: An improved digital wallchart image super-resolution with attention-based learned text loss, The Visual Computer 39(8) (2023), 3467–3479. doi:10.1007/s00371-023-03017-3.

44.

Yoo ,

Ahn and

K.-A.

Sohn , Rethinking data augmentation for image super-resolution: A Comprehensive analysis and a new strategy, 2020. arXiv preprint arXiv:2004.00448.

45.

Zeyde ,

Elad and

Protter , On single image scale-up using sparse-representations, in: Curves and Surfaces: 7th International Conference, Avignon, France, June 24–30, 2010, Revised Selected Papers 7, Springer, 2012, pp. 711–730. doi:10.1007/978-3-642-27413-8_47.

46.

Zhang ,

Huang ,

Liu ,

Wang and

Jin , Swinfir: Revisiting the swinir with fast fourier convolution and improved training for image super-resolution, 2022. arXiv preprint arXiv:2208.11247.

47.

Zhang ,

Gu ,

Zhang ,

Kong and

Yuan , Accurate image restoration with attention retractable transformer, 2022. arXiv preprint arXiv:2210.01427.

48.

Zhang and

Y.-B.

Yang , Rest: An efficient transformer for visual recognition, Advances in neural information processing systems 34 (2021), 15475–15485.

49.

Zhang and

Y.-B.

Yang , Rest v2: Simpler, faster and stronger, Advances in Neural Information Processing Systems 35 (2022), 36440–36452.

50.

Zhang ,

Li ,

Wang ,

Zhong and

Fu , Image super-resolution using very deep residual channel attention networks, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 286–301.

51.

Zheng ,

Zhu ,

Shi and

Weng , Efficient mixed transformer for single image super-resolution, 2023. arXiv preprint arXiv:2305.11403.

52.

Zhou ,

Li ,

C.-L.

Guo ,

Bai ,

M.-M.

Cheng and

Hou , SRFormer: Permuted self-attention for single image super-resolution, 2023. arXiv preprint arXiv:2303.09735.

53.

Zhu ,

Wang ,

Xie and

Xu , Multiview latent space learning with progressively fine-tuned deep features for unsupervised domain adaptation, Information Sciences (2024), 120223. doi:10.1016/j.ins.2024.120223.

54.

Zhu ,

Sun ,

Wang and

Zhu , A double transformer residual super-resolution network for cross-resolution person re-identification, The Egyptian Journal of Remote Sensing and Space Science 26(3) (2023), 768–776. doi:10.1016/j.ejrs.2023.07.015.

Residual SwinV2 transformer coordinate attention network for image super resolution

Abstract

Keywords

1. Introduction

2. Related work

2.1. Transformer

3.1. Network architecture

3.1.1. The overall structure

3.2. Loss function

3.3. Improved RandAugment for image preprocess

4.1. Datasets

4.2. Implementation details

4.3. Ablation study

Table 1 Comparison of different network architectures for ×4 SR on the six benchmark datasets. Best performance are in red colors. The higher the SSIM, the better

4.3.2. Impact of window size in RSVTCANet

Table 3 Ablation study on the window size in RSVTCANet for ×4 SR. w/o denote the without. Larger windows can result in better performance. Best performance are in red colors. The higher the PSNR and SSIM, the better

Table 4 Ablation study on DWConv of EMSAv2 for ×4 SR. From the results on the six benchmark datasets, we can see that using 3 × 3 depthwise convolution yields the best results. Best performance are in red colors. The higher the PSNR and SSIM, the better

Table 6 Quantitative comparison (average PSNR/SSIM) of our RSVTCANet with recent classical image SR methods on six benchmark datasets. Best performance are in red colors. The higher the PSNR and SSIM, the better

5. Conclusion

References

Table 1
Comparison of different network architectures for ×4 SR on the six benchmark datasets. Best performance are in red colors. The higher the SSIM, the better

Table 3
Ablation study on the window size in RSVTCANet for ×4 SR. w/o denote the without. Larger windows can result in better performance. Best performance are in red colors. The higher the PSNR and SSIM, the better

Table 4
Ablation study on DWConv of EMSAv2 for ×4 SR. From the results on the six benchmark datasets, we can see that using 3 × 3 depthwise convolution yields the best results. Best performance are in red colors. The higher the PSNR and SSIM, the better

Table 6
Quantitative comparison (average PSNR/SSIM) of our RSVTCANet with recent classical image SR methods on six benchmark datasets. Best performance are in red colors. The higher the PSNR and SSIM, the better