A novel dual-granularity lightweight transformer for vision tasks

Abstract

Transformer-based networks have revolutionized visual tasks with their continuous innovation, leading to significant progress. However, the widespread adoption of Vision Transformers (ViT) is limited due to their high computational and parameter requirements, making them less feasible for resource-constrained mobile and edge computing devices. Moreover, existing lightweight ViTs exhibit limitations in capturing different granular features, extracting local features efficiently, and incorporating the inductive bias inherent in convolutional neural networks. These limitations somewhat impact the overall performance. To address these limitations, we propose an efficient ViT called Dual-Granularity Former (DGFormer). DGFormer mitigates these limitations by introducing two innovative modules: Dual-Granularity Attention (DG Attention) and Efficient Feed-Forward Network (Efficient FFN). In our experiments, on the image recognition task of ImageNet, DGFormer surpasses lightweight models such as PVTv2-B0 and Swin Transformer by 2.3% in terms of Top1 accuracy. On the object detection task of COCO, under RetinaNet detection framework, DGFormer outperforms PVTv2-B0 and Swin Transformer with increase of 0.5% and 2.4% in average precision (AP), respectively. Similarly, under Mask R-CNN detection framework, DGFormer exhibits improvement of 0.4% and 1.8% in AP compared to PVTv2-B0 and Swin Transformer, respectively. On the semantic segmentation task on the ADE20K, DGFormer achieves a substantial improvement of 2.0% and 2.5% in mean Intersection over Union (mIoU) over PVTv2-B0 and Swin Transformer, respectively. The code is open-source and available at: https://github.com/ISCLab-Bistu/DGFormer.git.

Keywords

Transformer object detection classification semantic segmentation

1. Introduction

Transformer-based networks have achieved remarkable success in various computer vision tasks and gained significant attention. Models like Swin Transformer [1], PVT (Pyramid Vision Transformer) [2], MobileViT [3] have demonstrated superior performance compared to convolutional neural networks (CNNs) such as ResNet [4], VGG [5], MobileNet [6] and DenseNet [7]. ViT [8], as a pioneering model, segments input images into equal-sized image patches, treating them as tokens for self-attention computation within the Transformer model. This approach allows ViT to extract global features from the input images. Furthermore, ViT partially overcome some of the limitations faced by CNNs in processing large-scale images, offering performance on par with CNNs across a range of tasks. Additionally, several improved Transformer-based models, such as DeiT [9], T2T-ViT [10], and CaiT [11], have emerged, building upon the base of ViT with additional modules to heighten performance. However, these models often carry the burden of higher computational and parameter requirements, thereby slowing inference speed, increasing memory consumption, and diminishing efficiency. These factors make them unsuitable for deployment on resource-limited edge computing devices.

On the other hand, continuous innovation in lightweight CNN models has achieved significant breakthroughs on edge computing devices and has driven the development of visual applications. For instance, [12] proposed the lightweight DSUNet, which provides an efficient solution for lane detection and path prediction tasks in autonomous driving through end-to-end learning. [13] implemented a lightweight real-time road surface state recognition algorithm on the low-power embedded device NVIDIA Jetson AGX Xavier, which has advanced the field of intelligent transportation. In addition, [62, 63] had also made significant contributions to intelligent transportation systems. [60] proposed a lightweight printed circuit board (PCB) defects detection model (light-PDD), addressing the issues of redundant parameters and slow inference speed in existing methods. [61] proposed a new outlier detection method using the fuzzy C-means (FCM) algorithm, significantly contributing to defect detection in industrial applications. [56] proposed a novel method called Consistency-Dependence Guided Knowledge Distillation (CDKD), which aims to make CNN models more lightweight while maintaining fast inference speed and high detection accuracy for object detection in remote sensing (RS) images. [59] proposed a tensor decomposition and knowledge distillation-based network (TDKD-Net) for low-altitude aerial (LAA) object detection, utilizing tensor decomposition and knowledge distillation methods to reduce redundant parameters while maintaining performance. These methods provide valuable insights for deploying models on resource-constrained edge computing devices. However, these achievements seem somewhat out of reach for applications based on Transformer models. Due to their large parameter and computational requirements, Transformer-based models are challenging to deploy on resource-constrained edge computing devices. Therefore, it is necessary to propose Transformer models that are more lightweight but still deliver outstanding performance.

In this paper, we introduce convolutional operations into the structure of ViT to leverage the advantages of attention mechanisms [8] in global modeling and the efficient modeling of local features by convolutions. This combination results in a lightweight and versatile backbone network that efficiently captures features at different granularities. We name this network Dual-Granularity Transformer (DGFormer), and its structure is illustrated in Fig. 1. Specifically, DGFormer consists of two novel modules: Dual-Granularity Attention (DG Attention) and Efficient Feed-Forward Network (Efficient FFN). The DG Attention is responsible for efficiently extracting two different granularities of feature information and modeling global features, while Efficient FFN is designed to efficiently extract local feature information. Both modules possess the inductive bias [14] capability. DGFormer adopts a standard four-stage design [15], and its parameter size is similar to existing lightweight networks such as MobileNetV2 [16] and PVTv2-B0 [17]. The main contributions of this paper can be summarized in three aspects:

We propose Dual-Granularity Transformer (DGFormer), which is a lightweight Transformer-based model that incorporates convolutional operations to efficiently model features across different granularities. This design enables DGFormer to achieve competitive performance while maintaining an extremely low parameter count and FLOPs (floating-point operations).

We propose two lightweight and efficient modules (DG Attention and Efficient FFN) to address the limitations of the lightweight Transformer-based models. The DG Attention module effectively extracts global feature information and models diverse granularities of features, the Efficient FFN module efficiently captures local feature information by leveraging the inductive bias of convolutions [14]. These modules effectively alleviate these limitations and significantly enhance the performance of lightweight Transformer-based models.

We evaluate DGFormer on multiple vision tasks and compare it with popular lightweight models. Remarkably, DGFormer outperforms these models across all tasks, even with fewer parameters and FLOPs. In ImageNet image classification [18], DGFormer surpasses PVTv2-B0 and Swin Transformer by 2.3% in Top1 accuracy. In object detection on COCO [19, 20], with RetinaNet detector, DGFormer surpasses PVTv2-B0 and Swin Transformer by 0.5% and 2.4% in AP. With Mask R-CNN detector, DGFormer surpasses PVTv2-B0 and Swin Transformer by 0.4% and 1.8% in AP. In semantic segmentation on ADE20K [21], DGFormer surpasses PVTv2-B0 and Swin Transformer by 2.0% and 2.5% in mIoU. These results highlight the powerful performance of DGFormer across various vision tasks.

2. Related work

2.1 Light-weight vision-transformer

As the creator of the Transformer, Vision Transformer (ViT) has achieved tremendous success in both natural language processing (NLP) [22] and computer vision fields. However, the multi-head self-attention mechanism (MSA) [8] in ViT requires processing a large number of image patches, leading to high model parameters and computational complexity. Therefore, several approaches have been proposed to lightweight the MSA. One such approach is the Spatial Reduction Attention (SRA) mechanism introduced in PVT [2], which compresses the spatial scales of keys ( $K$ ) and values ( $V$ ) before computing self-attention, thereby reducing the computational and parameter costs. Additionally, the Swin Transformer [1] and Swin Transformer v2 [23] adopt a window-based method to divide the image into windows and perform multi-head self-attention computations within local sliding windows, significantly reducing the computational burden. CSWin [24] introduces the Cross-Shape Window self-attention mechanism, which utilizes cross-shaped windows to perform parallel self-attention computations, achieving powerful modeling capabilities while limiting computational costs. On the other hand, lightweight ViT models have been designed from the perspective of lightweight structure design. For example, MobileViT [3] effectively combines the strengths of Transformer and CNN, achieving good performance with fewer parameters and computations. Mobileformer [25] is another lightweight model that combines MobileNet [6] and Transformer, leveraging fewer tokens to learn global prior knowledge and achieve lower computational costs. Symmetric Former (SFormer) [57] combines CNN with Transformer in a comprehensive manner and introduces lightweight convolutional modules to replace redundant structures in ViT. Lite Vision Transformer (LVT) [26] is also a lightweight ViT model that adopts lightweight structure design. It introduces Convolutional Self-Attention (CSA) and Recursive Atrous Self-Attention (RASA) to efficiently model low-level and high-level features, respectively. These approaches aim to reduce the computational and parameter costs of ViT while maintaining strong modeling capabilities.

However, these existing methods have achieved lightweight ViT models to some extent and addressed redundancy, but they have not consistently demonstrated strong performance across the mentioned three visual tasks (Object detection, Classification, Semantic segmentation). For example, SFormer [57] is a lightweight ViT model and performs well in image classification and semantic segmentation tasks but shows some weaknesses in object detection. Similarly, LVT [26] performs well across the three visual tasks, it does not reach the desired level of lightweightness and still exhibits higher FLOPs and computational complexity. In contrast, our proposed DGFormer not only achieves lightweightness with lower parameter count and FLOPs, but also incorporates convolutional inductive bias and multi-granularity feature information. By maintaining lightweight design principles, DGFormer demonstrates strong performance across all the mentioned three visual tasks, outperforming the existing methods.

2.2 Convolution with vision-transformer

Due to the lack of the inherent inductive bias of convolutional neural networks (CNNs) [14], ViT models may slightly underperform lightweight CNNs in certain cases. Some researchers have found that incorporating convolutions into ViT can improve model stability and performance. These approaches can be categorized into three types. The first type involves adding convolutional stems to ViT models to introduce the inductive bias. For example, works like CMT [27, 28] add convolutional stems to ViT models to enhance their performance. Similarly, [58] proposed the Deconv-Transformer (DecT) network, where they introduced convolutional operations at the beginning of the ViT model and achieved exceptional performance. The second type involves integrating CNNs into the multi-head self-attention mechanism [8]. Works like CVPT [29] and CoaT [30] fuse convolutional and self-attention position encodings, while ConViT [31] introduces the Gated Position Self-Attention (GPSA) to incorporate soft convolutional inductive bias. CvT [32] modifies the multi-head attention in Transformers and replaces linear projection with depth-wise separable convolutions to further enhance model expressiveness. The third type involves adding CNN structures within the multi-head self-attention mechanism. Approaches such as CoAtNet [33] and LocalViT [34] enhance feature extraction capabilities by adding convolutions within the MSA. These methods often stack convolutional layers and self-attention layers alternately to introduce the inductive bias, thereby enhancing model performance. By incorporating convolutional elements into ViT models, these approaches aim to leverage the strengths of both Transformers and CNNs, resulting in improved stability and performance.

While these methods introduce convolutions into ViT models to leverage the strengths of both Transformers and CNNs, they often result in large model parameter sizes and computational complexity due to the lack of lightweight design principles. Additionally, these approaches typically focus on incorporating a single type of convolutional element. For example, CMT [27] adds convolutional stems to ViT structures to introduce the inductive bias. Similarly, CvT [32] replaces linear projection in the MSA with convolutions to incorporate the desired inductive bias. Although these approaches improve model performance to some extent, they fail to consider the significance of model parameter sizes and FLOPs. In contrast, our proposed DGFormer maintains a lightweight design while introducing convolutions in different parts of the model to incorporate the desired inductive bias. For instance, DGFormer adds a convolutional stem at the beginning of the backbone network and integrates convolutional structures within the MSA and feed-forward network (FFN) components. This novel approach achieves competitive performance while ensuring lightweightness, offering a unique and effective solution.

3. Method

3.1 Overall architecture

DGFormer is a powerful hierarchical model that leverages the strengths of both CNN and Transformer architectures. By synergistically combining CNN’s local perception capabilities with Transformer’s global context modeling and long-range dependency modeling abilities, DGFormer achieves a comprehensive and efficient representation for various visual tasks. Its structure, as shown in Fig. 1, consists of four stages. “Stage 1” comprises a convolutional stem and two Dual-Granularity Former Blocks. “Stage 2”, “Stage 3”, and “Stage 4” consist of a Patch Embed module and two Dual-Granularity Former Blocks. Each stage includes an Absolute Position Embedding (APE) [35] and a Relative Position Embedding (PEG) [36] to encode the spatial structure of the image. Each Dual-Granularity Former Block consists of a Dual-Granularity Attention module and an Efficient Feed-Forward Network (FFN) module. The specific details are described in Sections 3.2 and 3.3. The convolutional stem consists of stacked convolution modules, which efficiently extract local features and reduce the image size. The Patch Embed module downsamples the feature map and divides it into fixed-size image blocks, which are then inputted into the Dual-Granularity Former Block as a sequence. By utilizing the convolutional stem and Patch Embed modules, DGFormer can generate feature maps of four different sizes $H/4\times W/4\times 32$ , $H/8\times W/8\times 64$ , $H/16\times W/16\times 160$ and $H/32\times W/32\times 256$ , from an input image of size $H\times W\times 3$ . This is consistent with the output feature map sizes of existing classic CNN models such as VGG [5] and ResNet [4]. Consequently, Dual-Granularity Former can directly replace the backbone network in existing visual task models like RetinaNet [37], Mask R-CNN [38] and Semantic FPN [39].

Figure 1.

Dual-Granularity Former (DGFormer). $H$ , $W$ and $C$ represent the height, width and channel dimension of the image, respectively. Where $H_{1}=H/r$ , $H_{2}=2H/r$ and $W_{1}=W/r$ , $W_{2}=2W/r$ , $r$ represents the reduction factor. $Sr_{1}$ and $Sr_{2}$ are functions used for downsampling feature maps at different rates. PW-Conv denotes pointwise convolution, DW-Conv denotes depthwise convolution, MSA denotes multi-head self-attention, LayerNorm and LN denote Layer Normalization [8].

3.2 Dual-granularity former block

The Dual-Granularity Former Block, as shown in Fig. 2, consists of two LayerNorm (Layer Normalization) [8] layers, a DG Attention module, and an Efficient FFN module. Specifically, the input feature sequence is passed through two branches of the residual structure [40]. One branch undergoes Layer Normalization to expedite model convergence and then extracts dual-granularity features using the DG Attention module. The other branch is connected to the DG Attention module’s output through a shortcut, the equation is represented by the following formula:

$\displaystyle{\rm{\bf X}}={\rm{\bf X}}+\text{DGA}(\text{LayerNorm}({\rm{\bf X}% }))$ (1)

where $\text{LayerNorm}(\cdot)$ represents the operation of Layer Normalization and $\text{DGA}(\cdot)$ represents the operation of the DG Attention module.

The output $X$ is then fed into another residual structure. One branch undergoes Layer Normalization and is subjected to a non-linear transformation through the Efficient FFN module to enhance the model’s expressive power. The other branch is connected to the output of the Efficient FFN module through a shortcut, the equation is represented by the following formula:

$\displaystyle{\rm{\bf X}}={\rm{\bf X}}+\text{EFFN}(\text{LayerNorm}({\rm{\bf X% }}))$ (2)

where $\text{EFFN}(\cdot)$ represents the operation of the Efficient FFN module. The DG Attention module and the Efficient FFN module are described in detail later in this paper.

Figure 2.

Dual-Granularity Former Block. ${\rm{\bf X}}$ represents the feature sequence, $n$ denotes the number of the patches, $d$ represents the dimension of the feature sequence.

3.3 Dual-granularity attention module

As a crucial component of the Transformer, the multi-head self-attention mechanism directly influences the performance of the model. Some studies (such as MViT [41], MViTv2 [42] and CrossVit [43]) have shown that integrating multi-scale feature information with the Transformer model can effectively improve its performance. However, the inclusion of multi-scale network structures increases the model’s parameter count and complexity, making it cumbersome and unsuitable for designing lightweight models. We believe that the increased complexity associated with multi-scale feature information primarily stems from the computation of multiple feature maps in the multi-head self-attention mechanism. To address this and achieve multi-granularity feature modeling in lightweight Transformer models, we propose the Dual-Granularity Attention module, as illustrated in Fig. 3. In the Dual-Granularity Attention module, we integrate the spatially downsampled dual-granularity feature information into the keys ( $K$ ) and values ( $V$ ) and then perform the computation of the multi-head self-attention mechanism. This equips the Transformer model with the capability to model multi-granularity feature information while maintaining lightweight design principles.

Figure 3.

Dual-Granularity Attention (DG Attention) module.

As shown in Fig. 3, Dual-Granularity Attention module primarily consists of two branches: Granularity 1 and Granularity 2, designed to efficiently model features at different granularities. Building upon PVT’s inspiration, we downsample $K$ and $V$ using max pooling instead of convolution. This approach reduces the parameter count, improving efficiency, while highlighting salient features in the feature maps. Specifically, the input feature sequence ${\rm{\bf X}}_{\textit{Seq}}\in\mathbb{R}^{HW\times C}$ undergoes processing through two branches. One branch utilizes a weight matrix ${\rm{\bf W}}_{Q}$ to obtain $Q$ (Query), the equation is represented by the following formula:

$\displaystyle Q={\rm{\bf X}}_{\textit{Seq}}\cdot{\rm{\bf W}}_{Q}$ (3)

where $Q\in\mathbb{R}^{HW\times C}$ and ${\rm{\bf W}}_{Q}\in\mathbb{R}^{C\times C}$ . The other branch first undergoes a transformation function called $\mbox{Seq2Img}(\cdot)$ to convert the input feature sequence ${\rm{\bf X}}_{\textit{Seq}}$ into a feature map format, resulting in ${\rm{\bf X}}_{\textit{Img}}$ , the equation is represented by the following formula:

$\displaystyle{\rm{\bf X}}_{\textit{Img}}=\text{Seq2Img}({\rm{\bf X}}_{\textit{% seq}})$ (4)

where ${\rm{\bf X}}_{\textit{Img}}\in\mathbb{R}^{H\times W\times C}$ and $\text{Seq2Img}(\cdot)$ is a Reshape operation that takes the input of size from $\mathbb{R}^{HW\times kC}$ and reshapes it to the output of size $\mathbb{R}^{H\times W\times kC}$ . Next, the feature map ${\rm{\bf X}}_{\textit{Img}}$ is separately fed into two different downsampling branches, Granularity 1 and Granularity 2, where they undergo max pooling operations ( $Sr_{1}$ and $Sr_{2})$ and Layer Normalization (LN) [8], followed by transformation functions called $\mbox{Img2Seq}(\cdot)$ , resulting in feature sequences ${\rm{\bf X}}_{\textit{Seq}1}$ and ${\rm{\bf X}}_{\textit{Seq}2}$ , respectively, the equation are represented by the following formulas:

$\displaystyle{\rm{\bf X}}_{\textit{Seq}1}=\text{Img2Seq}(\text{LN}(Sr_{1}({\rm% {\bf X}}_{\textit{Img}})))$ (5) $\displaystyle{\rm{\bf X}}_{\textit{Seq}2}=\text{Img2Seq}(\text{LN}(Sr_{2}({\rm% {\bf X}}_{\textit{Img}})))$ (6)

where ${\rm{\bf X}}_{\textit{Seq}1}\in\mathbb{R}^{H_{1}W_{1}\times C}$ , ${\rm{\bf X}}_{\textit{Seq}2}\in\mathbb{R}^{H_{2}W_{2}\times C}$ and $\text{Img2Seq}(\cdot)$ is a Reshape operation that takes the input of size from $\mathbb{R}^{H\times W\times kC}$ and reshapes it to the output of size $\mathbb{R}^{HW\times kC}$ . Then, the feature sequences ${\rm{\bf X}}_{\textit{Seq}1}$ and ${\rm{\bf X}}_{\textit{Seq}2}$ with different granularities are concatenated using the concatenation function $\text{Concat}(\cdot)$ , resulting in a fused feature sequence ${\rm{\bf X}}_{\textit{Seq}}^{\prime}$ , the equation is represented by the following formula:

$\displaystyle{\rm{\bf X}}_{\textit{Seq}}^{\prime}=\text{Concat}({\rm{\bf X}}_{% \textit{Seq}1},{\rm{\bf X}}_{\textit{Seq}2})$ (7)

where ${\rm{\bf X}}_{\textit{Seq}}^{\prime}\in\mathbb{R}^{(H_{1}W_{1}+H_{2}W_{2})% \times C}$ . Subsequently, the weight matrices ${\rm{\bf W}}_{K}$ and ${\rm{\bf W}}_{V}$ are applied to obtain $K$ (Key) and $V$ (Value), the equation are represented by the following formulas:

$\displaystyle K={\rm{\bf X}}_{\textit{Seq}}^{\prime}\cdot{\rm{\bf W}}_{K}$ (8) $\displaystyle V={\rm{\bf X}}_{\textit{Seq}}^{\prime}\cdot{\rm{\bf W}}_{V}$ (9)

where $K\in\mathbb{R}^{(H_{1}W_{1}+H_{2}W_{2})\times C}$ , $V\in\mathbb{R}^{(H_{1}W_{1}+H_{2}W_{2})\times C}$ , ${\rm{\bf W}}_{K}\in\mathbb{R}^{C\times C}$ and ${\rm{\bf W}}_{V}\in\mathbb{R}^{C\times C}$ . Finally, we perform multi-head self-attention computation on the obtained $Q$ , $K$ and $V$ , to obtain the final output feature sequence ${\rm{\bf X}}_{\textit{Attention}}$ , the equation is represented by the following formula:

$\displaystyle{\rm{\bf X}}_{\textit{Attention}}=\text{MSA}(Q,K,V)$ (10) $\displaystyle\text{MSA}(Q,K,V)=\text{Concat}(\text{head}_{0},\ldots,\text{head% }_{i})\cdot{\rm{\bf W}}_{O}$ (11) $\displaystyle\text{head}_{i}=\text{Softmax}\left(\frac{Q_{i}\cdot K_{i}^{T}}{% \sqrt{C/N_{\textit{head}}}}\right)V_{i}$ (12)

where ${\rm{\bf X}}_{\textit{Attention}}\in\mathbb{R}^{HW\times C}$ , $\text{MSA}(\cdot)$ represents the operation of the MSA, ${\rm{\bf W}}_{O}\in\mathbb{R}^{C\times C}$ represents the weight matrix, $\text{Concat}(\cdot)$ represents the concatenation operation, $\text{head}_{i}$ represents the head of the MSA, $\{Q_{i}\in\mathbb{R}^{HW\times C_{i}},K_{i}\in\mathbb{R}^{(H_{1}W_{1}+H_{2}W_{% 2})\times C_{i}},V_{i}\in\mathbb{R}^{(H_{1}W_{1}+H_{2}W_{2})\times C_{i}}\}$ represents the Quary, Key and Value of $\text{head}_{i}$ , $C$ represents the dimension of each head and $N_{\textit{head}}$ represents the head number of the attention layer.

3.4 Efficient feed-forward network module

While the multi-head self-attention mechanism captures the correlations among all positions in the generated feature sequence, its utilization in the original Feed-Forward Network (FFN) is not fully optimized. Additionally, although the FFN can learn complex linear relationships between different positions and effectively extract local features, we believe that its feature representation capacity can be further improved. To address these limitations, we introduce the SE (Squeeze-and-Excitation) attention mechanism [44] and propose a lightweight and efficient version of the FFN called Efficient Feed-Forward Network (Efficient FFN). By incorporating the SE attention mechanism, the Efficient FFN enables the FFN to focus more on important channels within the input features, thereby enhancing the feature representation capacity and further improving the model’s performance. This improvement is particularly effective for tasks that require modeling global features, as the SE attention significantly improves the model’s performance and achieve better results.

Figure 4.

Structure of efficient feed-forward network (efficient FFN).

As shown in Fig. 4, the Efficient FFN module primarily consists of two components: an efficient feature extraction part and a channel attention weighting part. The efficient feature extraction part begins by applying Layer Normalization (LN) [8] operations to the input feature sequence denoted as ${\rm{\bf X}}_{in}\in\mathbb{R}^{HW\times C}$ , transforming it into a feature map denoted as ${\rm{\bf X}}_{LN+\text{Seq2Img}}\in\mathbb{R}^{H\times W\times C}$ , the equation is represented by the following formula:

$\displaystyle{\rm{\bf X}}_{LN+\text{Seq2Img}}=\text{Seq2Img}(\text{LN}({\rm{% \bf X}}_{in}))$ (13)

Then, efficient feature extraction is performed using the Efficient Convolutions module (EConvs). Specifically, EConvs is composed of stacked PointWise Convolution (PWConv), DepthWise Convolution (DWConv), and PWConv layers. The first PWConv expands the channel dimension of the feature map to a higher dimension ( $4C$ ), followed by efficient feature extraction using DWConv. Finally, the second PWConv reduces the channel dimension of the feature map back to its original size ( $C$ ), resulting in ${\rm{\bf X}}_{\textit{EConvs}}\in\mathbb{R}^{H\times W\times C}$ , the equation is represented by the following formula:

$\displaystyle{\rm{\bf X}}_{\textit{EConvs}}=\textit{PWConv}(\textit{DWConv}(% \textit{PWConv}({\rm{\bf X}}_{LN+\text{Seq2Img}})))$ (14)

The channel attention weighting part applies SE (Squeeze-and-Excitation) attention operations to the input feature map, resulting in a weighted feature map denoted as ${\rm{\bf X}}_{SE}\in\mathbb{R}^{H\times W\times C}$ , the equation is represented by the following formula:

$\displaystyle{\rm{\bf X}}_{SE}=\textit{SEAttention}({\rm{\bf X}}_{\textit{% EConvs}})$ (15)

This operation assigns weights to the channels of the feature map based on their importance. Next, ${\rm{\bf X}}_{SE}$ is transformed into a feature sequence and combined with the input feature sequence ( ${\rm{\bf X}}_{in}$ ) using a residual structure. This combination preserves more cross-layer information. As a result, a weighted feature sequence (denoted as ${\rm{\bf X}}_{\textit{out}}\in\mathbb{R}^{HW\times C}$ ) with channel-specific weights is obtained, the equation is represented by the following formula:

$\displaystyle{\rm{\bf X}}_{\textit{out}}=\mbox{\text{Img2Seq}}({\rm{\bf X}}_{% SE})+{\rm{\bf X}}_{in}$ (16)

This structure helps Efficient FFN to effectively integrate global and local features when handling complex sequence models, providing significant advantages in tasks that require modeling global information. Furthermore, it introduces the inductive bias inherent in convolutional operations, further enhancing the model’s ability to extract local features. Consequently, this mitigates to some extent the limitations faced by Transformer-based models, which lack this inductive bias.

4. Experiments

In this section, we explore the effectiveness of DGFormer by conducting experiments on three different vision tasks: image classification, object detection, and semantic segmentation. Our objective is to evaluate the performance of DGFormer by comparing it with existing lightweight models in these tasks.

4.1 Evaluation metrics

We evaluate the model’s size by considering the number of parameters (#Param) and the number of floating point operations (FLOPs) performed by the model (including both the backbone and output layers). Image classification, object detection, and semantic segmentation, are evaluated using Top1, AP (Average Precision), and mIoU (Mean Intersection over Union). Top1 measures the accuracy of the model by calculating the ratio of correctly predicted samples (the category with the highest probability matches the label) to the total number of samples. AP is divided into six parts: AP_50:5:95, AP₅₀, AP₇₅, AP_S, AP_M and AP_L. AP₅₀ and AP₇₅ represent the IoU (Intersection over Union) thresholds between the predicted objects and the labels, set at 0.5 and 0.75, respectively, and then AP is calculated based on a range of correct and incorrect samples. AP_50:5:95 is expressed as the IoU thresholds between the predicted objects and the labels ranging from 0.5 to 0.95 with a stride of 0.05 and AP is subsequently calculated separately and taken as the mean value. AP_S, AP_M and AP_L represent the AP calculated specifically for small, medium, and large objects, respectively. mIoU is measured in pixels and is calculated as the intersection between predicted and labeled regions divided by the union of the two.

4.2 Image classification

Settings

The image classification experiments were performed on the ImageNet-1K dataset [18], which consists of 1,000 categories, 1.28 million training images and 50,000 validation images. For fair comparison, all models were trained on the training set, and the Top1 accuracy on the validation set was reported. Data processing techniques were employed to enhance the training process. Specifically, we followed the data augmentation approach of PVT, which included random cropping, random horizontal flipping [45], label-smoothing regularization [46], mixup [47], CutMix [48], and random erasing [49]. For optimization, we employed the AdamW optimizer [50] with a momentum of 0.9, a mini-batch size of 256, and a weight decay of 5 $\times$ 10^{- 2}. The initial learning rate was set to 1 $\times$ 10^{- 3} and decreased following the cosine schedule. All models were trained for 300 epochs from scratch on 2 NVIDIA RTX 4090 GPUs. To benchmark, we applied a center crop on the validation set, where a 224 $\times$ 224 patch was cropped to evaluate the classification accuracy. These procedures ensured standardized training and evaluation, while the employed data processing techniques contributed to improved performance and robustness of the image classification models.

Results

In Table 1, the performance of the DGFormer model on classification datasets is presented. Among ViT models with less than 6M parameters, DGFormer has the fewest parameters (3.4M), which is the same as Swin [1] and QuadTree-A-b0 [51]. However, DGFormer achieves the highest Top1 accuracy (72.8%) for image classification, surpassing Swin and PVTv2-B0 by 2.3%, QuadTree-A-b0 by 1.9%, QuadTree-B-b0 [51] by 0.8%, T2T-ViT-7 [10] by 1.1%, DeiT-Tiny/16 [9] by 0.6%. Note that DGFormer also exhibits similar advantages compared to CNN models. DGFormer has 0.1M fewer parameters than MobileNetv2, but achieves a 0.9% improvement in Top1 accuracy. Overall, DGFormer demonstrates impressive performance in image classification tasks among lightweight models.

To validate the performance of the DG Attention module, we compared the DGFormer-FFN model with other Transformer models that use the same FFN module. Compared to PVTv2-B0, the results shown in Table 1 indicate that DGFormer-FFN achieves a 1.8% improvement in Top1 accuracy while reducing 0.1 GFLOPs and 0.3M parameters. Compared to models like Swin, QuadTree-A-b0, and QuadTree-B-b0, it also demonstrates relatively higher Top1 accuracy with fewer or similar parameters and GFLOPs. This suggests that the DG Attention module has better computational efficiency while maintaining competitive performance compared to other Attention modules.

To validate the performance of the Efficient FFN module, we further conducted a comparison between DGFormer-FFN and DGFormer (using the Efficient FFN module). The results, presented in Table 1, demonstrate that DGFormer achieves a higher Top1 accuracy by 0.5% with a 0.1 GFLOPs reduction. This indicates that the Efficient FFN module possesses computational efficiency while maintaining competitive performance compared to the original FFN module.

Table 1
Image classification performance on the ImageNet validation set, “#Param” refers to number of parameters, “GFLOPs” refers to Giga Floating Point Operations, which is calculated with an input scale of 224 $\times$ 224 and “Swin” refers to the Swin Transformer [57], DGFormer-FFN refers to the DGFormer using the FFN Module

Method	#Param (M)	GFLOPs	Top1 acc (%)
Swin	3.4	0.4	70.5
DGFormer (our)	3.4	0.4	72.8
DGFormer-FFN (our)	3.4	0.5	72.3
QuadTree-A-b0	3.4	0.6	70.9
MobileNetv2	3.5	0.3	71.9
QuadTree-B-b0	3.5	0.7	72
PVTv2-B0	3.7	0.6	70.5
T2T-ViT-7	4.3	1.1	71.7
DeiT-Tiny/16	5.7	1.3	72.2

Table 2

Obeject detection performance on COCO val2017. Each metrics are the mean of 5 experiments. “#Param” refers to number of parameters, “GFLOPs” refers to Giga Floating Point Operations, which is calculated with an input scale of 800 $\times$ 1333 and “Swin” refers to the Swin Transformer [57]

Detector	Backbone	#Param (M)	GFLOPs	AP_50:5:95	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
RetinaNet	DGFormer (our)	12.6	167.2	37.6	58.0	40.1	22.6	40.7	49.7
RetinaNet	Swin	12.7	168.6	35.2	54.3	37.4	23.0	38.2	45.9
RetinaNet	PVTv2-B0	13.0	186.1	37.1	57.1	39.5	23.1	40.4	49.7
Mask R-CNN	DGFormer (our)	23.1	183.1	38.6	60.5	41.8	24.0	41.8	50.0
Mask R-CNN	Swin	23.2	171.5	36.8	58.9	39.0	23.1	39.4	47.2
Mask R-CNN	PVTv2-B0	23.5	188.7	38.2	60.5	40.7	22.9	40.9	49.6

4.3 Object detection

Settings

Object detection experiments were performed on the challenging COCO benchmark [19]. All models were trained on COCO train2017 (118k images) and evaluated on COCO val2017 (5k images). The efficiency of the DGFormer backbone was assessed in combination with two standard detectors (RetinaNet [37] and Mask R-CNN [38]), and compared against PVTv2-B0 and Swin Transformer. Before training, data processing techniques were applied to initialize the backbone. We employed weights pre-trained on ImageNet to initialize the backbone, while the newly added layers were initialized using Xavier [52]. Our models were trained using an AdamW optimizer [50] with an initial learning rate of 1 $\times$ 10^{- 4}, utilizing a batch size of 16 distributed across 4 A100 GPUs. Following common practices [37, 38, 53], we adopted 1 $\times$ training schedule (12 epochs) to train all detection models. During the training phase, the input images underwent data processing to ensure uniformity. Specifically, the images were resized, with the shorter side set to 800 pixels, while ensuring that the longer side did not exceed 1,333 pixels. During the test phase, the shorter of the input image is fixed to 800 pixels. These data processing steps were employed to facilitate standardized training and evaluation of the object detection models.

Results

As shown in Table 2, the results obtained from RetinaNet and Mask R-CNN for object detection consistently demonstrate that Swin Transformer performs the weakest, while DGFormer achieves the highest performance. Specifically, under RetinaNet detection framework, DGFormer stands out with its lowest number of parameters (12.6 vs. 12.7 and 13.0) and lowest GFLOPs (167.2 vs. 168.6 and 186.1), yet it outperforms PVTv2-B0 by 0.5% (37.6 vs. 37.1) in terms of average precision (AP_50:5:95) and surpasses Swin Transformer by 2.4% (37.6 vs. 35.2). Additionally, DGFormer achieves highest scores in term of AP_S, AP_M and AP_L, indicating its efficiency in capturing objects of all scales. Similarly, under Mask R-CNN detection framework, DGFormer exhibits comparable performance. It stands out with the lowest number of parameters (23.1 vs. 23.2 and 23.5) and second lowest GFLOPs (183.1 vs. 171.5 and 188.7), but it outperforms PVTv2-B0 by 0.4% (38.6 vs. 38.2) in terms of AP_50:5:95 and surpasses Swin Transformer by 1.8% (38.6 vs. 36.8). Once again, DGFormer also achieves the highest score in term of AP_S, AP_M and AP_L. These experimental results demonstrate the powerful of our model.

Table 3
Semantic segmentation performance of different backbones on the ADE20K validation set. “#Param” refers to number of parameters, “GFLOPs” refers to Giga Floating Point Operations, which is calculated with an input scale of 512 $\times$ 512 and “Swin” refers to the Swin Transformer [57]

Method	Encoder	#Param	GFLOPs	mIoU (%)
Semantic FPN	Swin	6.8	24.0	36.7
Semantic FPN	DGFormer (Our)	7.2	23.5	39.2
Semantic FPN	PVTv2-B0	7.6	25.0	37.2

4.4 Semantic segmentation

Settings

We chose the ADE20K dataset [21], a challenging scene parsing dataset, to analyze the performance of our models in semantic segmentation. ADE20K consists of 150 fine-grained semantic categories, with 20,210, 2,000, and 3,352 images allocated for training, validation, and testing, respectively. Our experiments were conducted using the Semantic FPN framework [39] and the mmsegmentation [54] codebase, ensuring fair comparison within this framework. Before training, data processing techniques were applied to initialize the DGFormer backbone. We initialized the backbone with weights pre-trained on ImageNet [18], while the newly added layers were initialized using Xavier [52]. Our model was optimized using the AdamW optimizer [50] with an initial learning rate of 1 $\times$ 10^{- 4}. Following common practice [39, 55], we trained our models for 80k iterations with a batch size of 16 on 2 NVIDIA RTX 4090 GPUs. The learning rate was decayed using a polynomial decay schedule with a power of 0.9. During the training phase, we applied random resizing and cropping techniques to the images, ensuring a size of 512 $\times$ 512 pixels. During the testing phase, the images were scaled to have a shorter side of 512 pixels while maintaining the aspect ratio. These data processing steps were employed to ensure standardized training and evaluation of the models in semantic segmentation.

Results

As shown in Table 3, when Semantic FPN is used for semantic segmentation, DGFormer demonstrates significant results compared to the other two models. PVTv2-B0 has the highest number of parameters and GFLOPs, yet its performance is weaker than DGFormer. Swin Transformer, on the other hand, exhibits the weakest performance. Specifically, DGFormer stands out with the second lowest number of parameters (7.2 vs. 6.8 and 7.6) and lowest GFLOPs (23.5 vs. 24.0 and 25.0), but it outperforms PVTv2-B0 by 2.0% (39.2 vs. 37.2) in terms of mIoU and surpasses Swin Transformer by 2.5% (39.2 vs. 36.7). We believe that the remarkable performance of DGFormer on ADE20K can be attributed to its Dual-Granularity Attention mechanism. This mechanism efficiently models different fine-grained feature information, enabling DGFormer to excel in semantic segmentation tasks that require a high level of detail in feature representation.

5. Conclusion

Through our research on lightweight Transformer models, we identified three primary limitations: insufficient capability to extract features at different granularities, weak capacity for local feature extraction, and the lack of inherent inductive biases specific to convolutional operations. To mitigate these limitations, we introduce DGFormer, a lightweight powerful Transformer backbone for three vision tasks (image classification, object detection, semantic segmentation). DGFormer is equipped with two new modules: Dual-Granularity Attention (DG Attention) module and Efficient Feed-Forward Network (Efficient FFN) module. The DG Attention module is designed to boost the global modeling capacity of the model, enabling it to extract features at different granularities. In contrast, the Efficient FFN module focuses on the efficient extraction of local features. Notably, both these modules incorporate inductive biases, addressing a major limitation of existing lightweight Transformer models. Our extensive experimental evaluation, conducted across three visual tasks, indicates that DGFormer outperforms established models such as PVTv2 and Swin Transformer. This superior performance is achieved with fewer parameters and reduced FLOPs, demonstrating the model’s efficiency. In future research directions, we will focus on improving the performance of DGFormer to make it more practical and efficient in real-world applications. We believe that DGFormer has particular value in the context of edge computing devices, where its capabilities can be effectively leveraged. Furthermore, we hope that our work will become a benchmark for future research in the field of lightweight vision transformers. We also hope that our research outcomes will provide a valuable reference for research in the field of lightweight vision transformers, and promote further development in this field.

Footnotes

Acknowledgments

This research was funded by the National Natural Science Foundation of China (NSFC) (grant no. U21A6003), the Program of Promoting the Development of University-Diligence Talents (grant no. 5112111145).

Conflict of interest

All authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Liu

Lin

Cao

Wei

Zhang

Lin

and Guo

, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.

Wang

Xie

Fan

D.P.

Song

Liang

Luo

and Shao

, Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 568–578.

Mehta

and Rastegari

, Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer, arXiv preprint arXiv:2110.02178, 2021.

Zhang

Ren

and Sun

, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 770–778.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014.

Howard

A.G.

Zhu

Chen

Kalenichenko

Wang

Weyand

Andreetto

and Adam

, Mobilenets: Efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861, 2017.

Huang

Liu

Van der Maaten

and Weinberger

K.Q.

, Condensenet: An efficient densenet using learned group convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2752–2761.

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner

Dehghani

Minderer

Heigold

and Gelly

, An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:11929, 2020.

Touvron

Cord

Douze

Massa

Sablayrolles

and Jégou

, Training data-efficient image transformers and distillation through attention, in: International Conference on Machine Learning, PMLR, 2021, July, pp. 10347–10357.

10.

Yuan

Chen

Wang

Shi

Jiang

Z.H.

Tay

F.E.

Feng

and Yan

, Tokens-to-token vit: Training vision transformers from scratch on imagenet, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 558–567.

11.

Touvron

Cord

Sablayrolles

Synnaeve

and Jégou

, Going deeper with image transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 32–42.

12.

Lee

D.H.

and Liu

J.L.

, End-to-end deep learning of lane detection and path prediction for real-time autonomous driving, Signal, Image and Video Processing 17(1) (2023), 199–205.

13.

Gin

Wang

Cheng

and Fang

, Road surface state recognition using deep convolution network on the low-power-consumption embedded device, Microprocessors and Microsystems 96 (2023), 104740.

14.

Xiao

Singh

Mintun

Darrell

Dollár

and Girshick

, Early convolutions help transformers see better, Advances in Neural Information Processing Systems 34 (2021), 30392–30400.

15.

Chu

Tian

Zhang

Wang

Wei

Xia

and Shen

, Conditional positional encodings for vision transformers, arXiv preprint arXiv:2102.10882, 2021.

16.

Sandler

Howard

Zhu

Zhmoginov

and Chen

L.C.

, Mobilenetv2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.

17.

Wang

Xie

Fan

D.P.

Song

Liang

Luo

and Shao

, PVT v2: Improved baselines with Pyramid Vision Transformer, Computational Visual Media 8(3) (2022), 415–424.

18.

Deng

Dong

Socher

, Li

and Fei-Fei

, Imagenet: A large-scale hierarchical image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, Ieee, 2009, June, pp. 248–255.

19.

Lin

T.Y.

Maire

Belongie

Hays

Perona

Ramanan

Dollár

and Zitnick

C.L.

, Microsoft coco: Common objects in context, in: Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, Springer International Publishing, pp. 740–755.

20.

Everingham

Van Gool

Williams

C.K.

Winn

and Zisserman

, The pascal visual object classes (voc) challenge, International Journal of Computer Vision 88 (2010), 303–338.

21.

Zhou

Zhao

Puig

Fidler

Barriuso

and Torralba

, Scene parsing through ade20k dataset, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 633–641.

22.

Beltagy

Peters

M.E.

and Cohan

, Longformer: The long-document transformer, arXiv preprint arXiv:2004.05150, 2020.

23.

Liu

Lin

Yao

Xie

Wei

Ning

Cao

Zhang

Dong

Wei

and Guo

, Swin transformer v2: Scaling up capacity and resolution, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12009–12019.

24.

Dong

Bao

Chen

Zhang

Yuan

Chen

and Guo

, Cswin transformer: A general vision transformer backbone with cross-shaped windows, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12124–12134.

25.

Chen

Dai

Chen

Liu

Dong

Yuan

and Liu

, Mobile-former: Bridging mobilenet and transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5270–5279.

26.

Yang

Wang

Zhang

Wei

Lin

and Yuille

, Lite vision transformer with enhanced self-attention, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11998–12008.

27.

Guo

Han

Tang

Chen

Wang

and Xu

, Cmt: Convolutional neural networks meet vision transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12175–12185.

28.

Xiao

Singh

Mintun

Darrell

Dollár

and Girshick

, Early convolutions help transformers see better, Advances in Neural Information Processing Systems 34 (2021), 30392–30400.

29.

Chu

Tian

Zhang

Wang

Wei

Xia

and Shen

, Conditional positional encodings for vision transformers, arXiv preprint arXiv:2102.10882, 2021.

30.

Chang

and Tu

, Co-scale conv-attentional image transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9981–9990.

31.

d’Ascoli

Touvron

Leavitt

M.L.

Morcos

A.S.

Biroli

and Sagun

, Convit: Improving vision transformers with soft convolutional inductive biases, in: International Conference on Machine Learning, PMLR, 2021, July, pp. 2286–2296.

32.

Xiao

Codella

Liu

Dai

Yuan

and Zhang

, Cvt: Introducing convolutions to vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 22–31.

33.

Dai

Liu

, Le

and Tan

, Coatnet: Marrying convolution and attention for all data sizes, Advances in Neural Information Processing Systems 34 (2021), 3965–3977.

34.

Zhang

Cao

Timofte

and Van Gool

, Localvit: Bringing locality to vision transformers, arXiv preprint arXiv:2104.05707, 2021.

35.

Wang

Shang

Lioma

Jiang

Yang

Liu

and Simonsen

J.G.

, On position embeddings in bert, in: International Conference on Learning Representations, 2020, October.

36.

Chu

Tian

Wang

Zhang

Ren

Wei

Xia

and Shen

, Twins: Revisiting the design of spatial attention in vision transformers, Advances in Neural Information Processing Systems 34 (2021), 9355–9366.

37.

Lin

T.Y.

Goyal

Girshick

and Dollár

, Focal loss for dense object detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988.

38.

Gkioxari

Dollár

and Girshick

, Mask r-cnn, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2961–2969.

39.

Kirillov

Girshick

and Dollár

, Panoptic feature pyramid networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6399–6408.

40.

Sandler

Howard

Zhu

Zhmoginov

and Chen

L.C.

, Mobilenetv2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.

41.

Fan

Xiong

Mangalam

Yan

Malik

and Feichtenhofer

, Multiscale vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6824–6835.

42.

C.Y.

Fan

Mangalam

Xiong

Malik

and Feichtenhofer

, Mvitv2: Improved multiscale vision transformers for classification and detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4804–4814.

43.

Chen

C.F.R.

Fan

and Panda

, Crossvit: Cross-attention multi-scale vision transformer for image classification, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 357–366.

44.

Shen

and Sun

, Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.

45.

Szegedy

Liu

Jia

Sermanet

Reed

Anguelov

Erhan

Vanhoucke

and Rabinovich

, Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.

46.

Szegedy

Vanhoucke

Ioffe

Shlens

and Wojna

, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.

47.

Zhang

Cisse

Dauphin

Y.N.

and Lopez-Paz

, mixup: Beyond empirical risk minimization, arXiv preprint arXiv:1710.09412, 2017.

48.

Yun

Han

S.J.

Chun

Choe

and Yoo

, Cutmix: Regularization strategy to train strong classifiers with localizable features, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6023–6032.

49.

Zhong

Zheng

Kang

and Yang

, Random erasing data augmentation, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34(07), 2020, pp. 13001–13008.

50.

Loshchilov

and Hutter

, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101, 2017.

51.

Tang

Zhang

Zhu

and Tan

, Quadtree attention for vision transformers, arXiv preprint arXiv:2201.02767, 2022.

52.

Glorot

and Bengio

, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, 2010, March, pp. 249–256.

53.

Chen

Wang

Pang

Cao

Xiong

et al., MMDetection: Open mmlab detection toolbox and benchmark, arXiv preprint arXiv:1906.07155, 2019.

54.

Chen

and Lin

, MMSegmenation, 2020. https://github.com/openmmlab/mmsegmentation.

55.

Chen

L.C.

Papandreou

Kokkinos

Murphy

and Yuille

A.L.

, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution and fully connected crfs, IEEE Transactions on Pattern Analysis and Machine Intelligence 40(4) (2017), 834–848.

56.

Chen

Lin

Polat

Alhudhaif

and Alenezi

, Consistency-and dependence-guided knowledge distillation for object detection in remote sensing images, Expert Systems with Applications 229 (2023), 120519.

57.

Liang

Tang

Liu

and You

, A lightweight vision transformer with symmetric modules for vision tasks, Intelligent Data Analysis, (Preprint), 1–17.

58.

Lin

Yao

Chen

Alhudhaif

and Alenezi

, Deconv-transformer (DecT): A histopathological image classification model for breast cancer based on color deconvolution and transformer architecture, Information Sciences 608 (2022), 1093–1112.

59.

Zeng

and Luo

, A novel tensor decomposition-based efficient detector for low-altitude aerial objects with knowledge distillation scheme, IEEE/CAA Journal of Automatica Sinica 11(2) (2024), 1-15.

60.

Tang

Wang

Zhang

and Zeng

, A lightweight surface defect detection framework combined with dual-domain attention mechanism, Expert Systems with Applications 238 (2024), 121726.

61.

Fang

Wang

Liu

Lauria

Zeng

Prieto

Sikström

and Liu

, A new particle swarm optimization algorithm for outlier detection: industrial data clustering in wire arc additive manufacturing, IEEE Transactions on Automation Science and Engineering, 2022.

62.

Lin

Luo

and Xu

, HRST-LR: A Hessian Regularization Spatio-Temporal Low Rank Algorithm for Traffic Data Imputation, IEEE Transactions on Intelligent Transportation Systems, 2023.

63.

Chen

Lin

Liu

Yang

Zhang

and Xu

, NT-DPTC: A non-negative temporal dimension preserved tensor completion model for missing traffic data imputation, Information Sciences 653 (2024), 119797.

A novel dual-granularity lightweight transformer for vision tasks

Abstract

Keywords

1. Introduction

2. Related work

2.1 Light-weight vision-transformer

2.2 Convolution with vision-transformer

3. Method

3.1 Overall architecture

4.1 Evaluation metrics

4.2 Image classification

Settings

Results

Settings

Results

Table 3 Semantic segmentation performance of different backbones on the ADE20K validation set. “#Param” refers to number of parameters, “GFLOPs” refers to Giga Floating Point Operations, which is calculated with an input scale of 512 × 512 and “Swin” refers to the Swin Transformer [57]

Settings

Results

5. Conclusion

Footnotes

Acknowledgments

Conflict of interest

References