A lightweight vision transformer with symmetric modules for vision tasks

Abstract

Transformer-based networks have demonstrated their powerful performance in various vision tasks. However, these transformer-based networks are heavyweight and cannot be applied to edge computing (mobile) devices. Despite that the lightweight transformer network has emerged, several problems remain, i.e., weak feature extraction ability, feature redundancy, and lack of convolutional inductive bias. To address these three problems, we propose a lightweight visual transformer (Symmetric Former, SFormer), which contains two novel modules (Symmetric Block and Symmetric FFN). Specifically, we design Symmetric Block to expand feature capacity inside the module and enhance the long-range modeling capability of attention mechanism. To increase the compactness of the model and introduce inductive bias, we introduce convolutional cheap operations to design Symmetric FFN. We compared the SFormer with existing lightweight transformers on several vision tasks. Remarkably, on the image recognition task of ImageNet [13], SFormer gains 1.2% and 1.6% accuracy improvements compared to PVTv2-b0 and Swin Transformer, respectively. On the semantic segmentation task of ADE20K [64], SFormer delivers performance improvements of 0.2% and 0.7% compared to PVTv2-b0 and Swin Transformer, respectively. On the cityscapes dataset [11], SFormer delivers performance improvements of 2.5% and 4.2% compared to PVTv2-b0 and Swin Transformer, respectively. The code is open-source and available at: https://github.com/ISCLab-Bistu/Symmetric_Former.git.

Keywords

Transformer edge computing classification object detection semantic segmentation

1. Introduction

Transformer-based architectures, e.g., ViT (Vision Transformers) [16], Swin Transformer [35], Swin Transformer v2 [36], PVT (Pyramid Vision Transformer) [51], MobileViT [38], etc., have achieved remarkable success, demonstrating highly competitive performance compared to CNNs (Convolution Neural Networks), e.g., VGG [41], ResNet [21], ResNext [56], RepVGG [14], etc., in a variety of vision tasks. Among these Visual Transformer-based architectures, ViT [16] is the pioneer and the most popular transformer model. Typically, ViT segments the image into a string of non-overlapping patches and then uses MSA [53] (multi head self-attention) to learn representations between the patches. In performance competition, ViT-based model [35] can be achieved by increasing the model size (i.e., number of attention heads, length of tokens) to SOTA (state of the art). However, these performance gains come at the cost of reduced inference speed and increased memory footprint. Many vision applications, i.e., video surveillance, unmanned vehicles and drones, require models to run on resource-limited edge computing devices in real-time. Therefore, the ViT-based model for such applications should be lightweight and fast.

Currently, lightweight CNNs are successfully working on edge computing devices and driving many real-world vision applications. For example, [5] implemented real-time semantic segmentation and drove vehicle autopilot applications on Xilinx ZCU102, [39] implemented real-time object detection and drove autonomous motion planning application for UAV (Unmanned Aerial Vehicle) on Intel Movidius Myriad X VPU. However, ViT-based networks still struggle on edge devices due to a large number of parameters and FLOPs (floating point of operations) [28]. In contrast to lightweight CNNs, the ViT-based models are heavyweight (i.e., ViT-B/16 [16] vs. MobileNetv2 [40]: 86 million vs. 3.4 million parameters) and harder to optimize [55]. In terms of the number of patches, MSA has a secondary computational complexity. With the same parameter size, the ViT-based models tend to have higher FLOPs and consume more memory than CNNs (i.e., PVTv2-B0 vs. MobileNetv2. 0.7 vs. 0.3 GFLOPs).

Moreover, Token length (feature capacity) will limit the feature extraction ability of the model, transformer performance decreases dramatically as token embedding decreases [52, 35, 15], i.e., the token length of PVTv2-B0 is half of that of PVTv2-B1 and the Top1 accuracy decreases by 8.2%. When the size of the model is reduced to be suitable for mobile devices, Transformer model shows a performance disadvantage compared to the convolutional model (i.e., PVTv2-B0 [52] vs. MobileNetv2. 3.7 vs. 3.5 million parameters, 70.5% vs 72.0% Top-1 accuracy). Besides, MLP used by Transformer FFN is likely to generate a large number of similar redundant features due to the simple space structure and the similar initial values [20]. And, since ViT lacks the inductive bias inherent in convolution, the ViT-based models are converged slowly and sensitive to the choice of hyperparameters [55].

In this paper, we introduce convolution in ViT and combine the advantages of attention mechanisms (input adaptive weighting and global processing [58]) and convolution (inductive bias) to construct a lightweight, compact and general ViT backbone. To achieve this goal, we propose the Symmetric Former (SFormer), the structure is shown in. Specifically, SFormer mainly consists of two novel Attention modules (Symmetric Block) and FFN (Feed-Forward Network) modules (Symmetric FFN) to pursue performance and compactness. These two modules model correlations between tokens and features within tokens, respectively. SFormer follows a standard four-level structure, which has similar parameter sizes to existing lightweight networks, such as MobileNetV2 and PVTv2-B0. The contributions of this paper are three-fold:

(1)
We summarize and analyze three shortcomings of existing lightweight ViTs, i.e., weak feature extraction ability, feature redundancy, and lack of convolutional induction bias.
(2)
We propose a novel lightweight and compact FFN module (symmetric FFN) and an enhanced attention module (symmetric Block) to solve these problems. Symmetric block introduces the idea of inverted residual block [40], which expands the channel capacity inside the module and enhances the long-range modeling capability of the attention mechanism. And, Symmetric FFN employs cheap operations of convolution, which introduces convolutional inductive bias and reduces the redundant features.
(3)
We propose a novel ViT backbone network, Symmetric Former (SFormer), by combining Symmetric Block and Symmetric FFN. Experiments on various vision tasks demonstrate the superiority of SFormer over PVTv2-b0 and Swin Transformer. Notably, for image recognition on the ImageNet [13] dataset, SFormer achieves a 1.2% and 1.6% improvement in accuracy compared to PVTv2-b0 and Swin Transformer, respectively. On the semantic segmentation task of ADE20K [64], SFormer delivers performance improvements of 0.2% and 0.7% compared to PVTv2-b0 and Swin Transformer, respectively. On the cityscapes dataset [11], SFormer delivers performance improvements of 2.5% and 4.2% compared to PVTv2-b0 and Swin Transformer, respectively.

2. Related work

Light vision transformer

Transformer has been very successful in both NLP and computer vision. ViT was the first vision transformer to successfully apply the NLP transformer [6] architecture to image recognition tasks with excellent performance. Then, many scholars designed enhanced ViT models [17, 50] and achieved comparable performance to convolutional neural networks. With the development of ViT, scholars have achieved SOTA on several important vision tasks based on the ViT model, i.e., [44] proposed a novel Transformer object detection framework TSP-RCNN that achieved SOTA on COCO dataset; [43] proposed Segmenter, a novel Transformer semantic detection framework, which achieved SOTA on ADE20K and Pascal Context scene understanding datasets; [4] proposed a novel Transformer video understanding framework, ViViT, which achieved SOTA on multiple video classification datasets, i.e., Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time. Since the computation of the MSA grows quadratically with the number of tokens, many works have constructed light and efficient vision transformers by building MSAs with near-linear computational complexity. For example, [51] proposed a linear spatially reduced attention (SRA), which achieves linear computational complexity by compressing the tokens; [24] introduced depth-separable convolution in the attention mechanism to suppress the spatial communication of the attention mechanism and reduce the computational complexity; [47] designed Quadtree Attention, which limits the self-attentive receptive field of different regions based on the intensity of attention; [26] improved information exchange by introducing token shuffle operation between cross-window attention mechanisms; Meanwhile, [35] designed window-based MSA, which restricts the receptive field of self-attentive mechanisms; [27, 9, 49, 6, 61] approximated the full attention matrix by linearizing the SoftMax attention, which can be accelerated by first computing the product of keys and values; [31] employed learned inducing points with fixed size to compute attention with input tokens and reduce the computation to linear complexity. In contrast, this paper takes the existing attention mechanism and proposes two more compact and powerful modules.

Convolution and vision-transformer

Recent work [52] shows the advantages of combining convolution and Transformer. We introduce them into three categories. The first one is the fusion of convolution and self-attention mechanisms. [58] proposed Convolutional Self-Attention (CSA) and designed a new ViT backbone network; [12] introduced two ways to combine convolution and attention mechanisms and replaced the self-attention mechanisms in ViT; [57] proposed a new self-attention mechanism that introduces convolution for local enhancement, called Locally Enhanced Self-Attention (LESA); [38] designed the Transformers as Convolutions Module and combined it with MobileNet to design a lightweight ViT backbone network. The second one is to introduce CNN in MSA module or FFN module. [10] introduced CNN into position embedding to enhance the generalization capability of the model; [33, 52] inserted the convolution operation into the FFN module to enhance the local feature representation capability of the model; [54] introduced CNN into MSA module as the input projection of self-attention mechanism. The third one is inserting CNNs in ViT architecture. [19] inserted several convolutional layers in front of ViT, which introduces the convolutional inductive bias of the model; [38] inserted several Inverted Blocks before ViT to introduce convolutional inductive bias and construct a low-latency and lightweight Transformer backbone; [7] added a CNN branch next to ViT and designed an interaction bridge. Our proposed SFormer combines CNN with Transformer in-depth and designs lightweight convolutional modules to replace the redundant structure in ViT.

Cheap Operations in CNN

Constrained by GPU computing resources, [30] proposed group convolution, which reduces computation by limiting the channel exchanges of convolutional layers. However, the performance of convolutional neural networks is severely degraded due to the limitation of group convolution on channel communication. To solve this problem, [62] proposed the channel shuffle operation and applied it between two layers of group convolution to improve the channel information exchange. Subsequently, [25] decomposed the traditional 3 $\times$ 3 convolution into two convolutional layers to reduce the computational complexity of the model, i.e., pointwise convolution and depth-separable convolution. [20] found that the output of pointwise convolution has a large number of redundant features and proposed Ghost Block, which reduces the output layer of pointwise convolution and generates more eigenfeatures by depth-separable convolution. Besides, [59] proposed Asymmetrical bottleneck to reduce the number of parameters of the model and maintain the performance by feature reuse operation. In contrast to these approaches, we apply cheap operations to lightweight vision transformer and served as the backbone of the generic model.

3. Method

The architecture of our proposed symmetric former is shown in. To be specific, Symmetric Former consists of four Stages, of which “Stage1” consists of a Patch Partition module and two Symmetric Former Blocks. Each Symmetric Former Block contains a Symmetric Block and a Symmetric FFN, described in Sections 3.1 and 3.3, respectively. “Stage2”, “Stage3” and “Stage4” are composed of a Patch Merge module and several Symmetric Former Blocks. “Stage1” first employs Patch Partition to divide the input image into $H/4\times W/4$ patches, each of which represents a 4 $\times$ 4 image chunk. The subsequent two Symmetric Former Blocks model the channel associations within the patches and the spatial associations between the patches. To construct the standard four-layer structure [35], “Stage2”, “Stage3” and “Stage4” first downsample the feature maps by using the Patch Merge module gradually. Patch Merge module splices four adjacent tokens and compresses the token length by a factor of two using MLP (Multilayer Perceptron). The output feature map sizes of the three Patch Merge modules are $H/8\times W/8$ , $H/16\times W/16$ and $H/32\times W/32$ , respectively. Patch Merge modules of “Stage2”, “Stage3” and “Stage4” are followed by several Symmetric Former Blocks. These Symmetric Former Blocks serve the same purpose as the Symmetric Former Blocks in “Stage1”. In terms of the output feature map size of the 4 Stages, which are consistent with the existing typical CNN architecture (i.e., VGG, Resnet, etc.). Consequently, Symmetric Former can directly replace the backbone network of existing visual task models (e.g., RetinaNet [8], Semantic FPN [29], etc.).

3.1 Symmetric block

Inspired by Inverted Residual Block and MobileViT, we propose Symmetric Block, as shown in. Symmetric Block effectively improves the context modeling capability of MSAs in lightweight ViT networks by increasing the length of the token embedding and expanding the channel capacity within module. Symmetric Block contains two branches and a Fusion module, i.e., Global Representation branch and Local Representation branch. Specifically, the input of the Global Representation branch is considered as short tokens. And the tokens are normalized by Layer Normalization and projected as three long tokens by three MLP layers. The outputs of Layer Normalization and three MLP layers can be denoted as ${\rm{\bf\hat{Z}}}_{LN}=LN({\rm{\bf Z}}_{in})$ , ${\rm{\bf\hat{Z}}}_{Q}={\rm{\bf\hat{Z}}}_{LN}\cdot{\rm{\bf W}}_{Q}$ , ${\rm{\bf\hat{Z}}}_{K}={\rm{\bf\hat{Z}}}_{LN}\cdot{\rm{\bf W}}_{K}$ and ${\rm{\bf\hat{Z}}}_{V}={\rm{\bf\hat{Z}}}_{LN}\cdot{\rm{\bf W}}_{V}$ , respectively. $\{{\rm{\bf W}}_{Q},{\rm{\bf W}}_{K},{\rm{\bf W}}_{V}\}\in{\rm R}^{C\times kC}$ is the weight of the three MLP layers, ${\rm{\bf Z}}_{in}\in{\rm R}^{HW\times C}$ is the input of the Symmetric Block, $H$ and $W$ are the dimensions of the feature maps, $C$ is the token length. Then, the long token’s long-range context is modeled by an attention mechanism to generate features with global information. Notably, inspired by [15], LePE (Locally-enhanced Positional Encoding) is introduced as Position Embedding. The following Eqs (1) and (2) represent the output of LePE and the output of the attention mechanism, respectively.

$\displaystyle{\rm{\bf\hat{Z}}}_{\textit{LePE}}=\text{Img2Seq}(\text{Seq2Img}({% \rm{\bf\hat{Z}}}_{V})\odot{\rm{\bf W}}_{\textit{LeV}})$ (1) $\displaystyle{\rm{\bf\hat{Z}}}_{\textit{Attention}}=\text{MSA}({\rm{\bf\hat{Z}% }}_{Q},{\rm{\bf\hat{Z}}}_{K},{\rm{\bf\hat{Z}}}_{V})+{\rm{\bf\hat{Z}}}_{\textit% {LePE}}$ (2)

Figure 1.

Symmetric Former (SFormer). $H$ and $W$ represent the height and width of the image, respectively. Where $r$ denotes the expansion factor, $C$ represents the token length, PW-Conv denotes pointwise convolution, DW-Conv denotes depth-separable convolution, GConv denotes 3 $\times$ 3 group convolution and LN denotes Layer Normalization [16].

where MSA is the window-based self-attention mechanism [35], Seq2Img and Img2Seq are Reshape operations (Seq2Img takes the input from ${\rm R}^{HW\times kC}$ reshape to ${\rm R}^{H\times W\times kC}$ and Img2Seq takes the input from ${\rm R}^{H\times W\times kC}$ reshape to ${\rm R}^{HW\times kC})$ , $\odot$ denotes the convolution operation and ${\rm{\bf W}}_{LeV}\in{\rm R}^{3\times 3\times kC}$ is the kernel of the depth-separable convolution.

To further represent the local spatial information of the tokens, the local representation branch employs depth-separable convolution and applies to the input. Then, MLP fuses global representation branch with local representation branch and projects them as short tokens. Finally, the input and output are connected by a residual structure. The outputs of the depth-separable convolution and Symmetric Block are represented by Eqs (3) and (4), respectively.

$\displaystyle{\rm{\bf\hat{Z}}}_{\textit{Local}}=\text{Img2Seq}(\text{Seq2Img}(% {\rm{\bf Z}}_{in})\odot{\rm{\bf W}}_{\textit{Local}})$ (3) $\displaystyle{\rm{\bf\hat{Z}}}_{\textit{Out}}=[{\rm{\bf\hat{Z}}}_{\textit{% Attention}},{\rm{\bf\hat{Z}}}_{\textit{Local}}]^{T}\cdot{\rm{\bf W}}_{\textit{% Proj}}+{\rm{\bf Z}}_{in}$ (4)

where ${\rm{\bf W}}_{\textit{Local}}\in{\rm R}^{3\times 3\times C}$ is the kernel of the depth-separable convolution and ${\rm{\bf W}}_{\textit{Proj}}\in{\rm R}^{(k+1)C\times C}$ is the weight of MLP.

Figure 2.

Symmetric Block. $H$ and $W$ are the dimensions of the feature maps, $C$ is the token length.

3.2 Limitations of FFN

The origin FFN (shown in Fig. 3a) is proposed by ViT, which models the channel by two pointwise convolutions (MLP) and connects the input to the output using residuals. Specifically, the first convolution expands the channel capacity $r$ times, while the second convolution models the channel and reduces the channel $r$ times. [33] argues that the original FFN lacks local spatial modeling capability. Inspired by Inverted Residual Block, [33] introduces convolutional inductive bias by inserting depth-separable convolution in the two-layer pointwise convolutions of FFN. However, this two-layer convolutions structure of improvement FFN is still heavy and redundant.

To address this problem, [20] proposed the Ghost Module, whose structure is shown in Fig. 3b. Ghost Module reduces the channel expansion multiplicity by half and reduces redundant features by generating more eigenfeatures through depth-separable convolution. Ghost Module can effectively enhance the compactness and local spatial modeling capability of the model. However, [59] argues that the second pointwise convolution is responsible for expressivity and should not be oversimplified. But Ghost Module compresses the second pointwise convolution by half. The ideal improvement module is to be lightweight and compact while maintaining performance.

3.3 Symmetric FFN

Inspired by [20, 59], we propose Symmetric FFN, a lightweight and compact FFN module. The structure of Symmetric FFN is shown in Fig. 3c. We can see that the output channels of the hidden layer of Symmetric FFN are symmetrical, which are $r/2\cdot C$ , $r C$ , $r C$ and $r/2\cdot C$ respectively. Notably, inspired by [32], we add an additional residual to the Symmetric FFN, which connects two hidden layers with output channels of $r/2\cdot C$ . This dual residual structure can effectively improve the gradient flow and smooth the loss surfaces during training.

Symmetric FFN consists of two parts, Channel Expansion and Modeling. Channel Expansion first widens the channel capacity by pointwise convolution to $r/2$ times the input features. To reduce the redundancy of the feature maps generated by pointwise convolution, subsequent depth-separable convolutions model the local spatial of the feature maps to generate more eigenfeatures. The output of the pointwise convolution is concatenated with the eigenfeatures generated by the depth-separable convolution and the number of channels of its output is $r$ times the number of input features. To simplify the output equations of Channel Expansion, we define the following three equations: ${\rm{\bf\hat{X}}}=\text{Seq2Img}({\rm{\bf X}}_{in})$ , ${\rm{\bf\hat{X}}}_{PW1}=\delta({\rm{\bf\hat{X}}}\odot{\rm{\bf W}}_{PW1})$ and ${\rm{\bf\hat{X}}}_{DW1}=\delta({\rm{\bf\hat{X}}}_{PW1}\odot{\rm{\bf W}}_{DW})$ . ${\rm{\bf X}}_{in}\in{\rm R}^{HW\times C}$ is the input feature map, $H$ , $W$ and $C$ are the length, width and channel of the input feature map, respectively, ${\rm{\bf W}}_{PW1}\in{\rm R}^{C\times r/2\cdot C}$ is the kernel of pointwise convolution, ${\rm{\bf W}}_{DW}\in{\rm R}^{3\times 3\times r/2\cdot C}$ is the kernel of depth-separable convolution and $\delta$ is the GeLU [23] activation function. The output of Channel Expansion is represented by Eq. (5):

$\displaystyle{\rm{\bf\hat{X}}}_{E}=[{\rm{\bf\hat{X}}}_{PW1},{\rm{\bf\hat{X}}}_% {DW}]^{T}$ (5)

Modeling contains a group convolution and a pointwise convolution. Specifically, to further reduce the number of parameters and the computational complexity of the model, the channels of the input feature map is reduced by a 3 $\times$ 3 group convolution. This group convolution divides the channels of the input feature map into $r/2$ groups, each containing two channels and can reduce the channels by a factor of two to $r/2\cdot C$ . The channels of the feature map are then reduced to $C$ by a pointwise convolution.

Figure 3.

Structure of FFN, Ghost Module and Symmetric FFN. c is the channel of current feature map and $C$ is the channel of input feature map.

Notably, the output Channel Expansion is rearranged by a Channel-Shuffle [62] module in order to maintain the consistency of the group convolutional input channels. We define the output of Channel-Shuffle module as ${\rm{\bf\hat{X}}}_{EC}=\textit{ChannelShuffle}({\rm{\bf\hat{X}}}_{E})$ . ${\rm{\bf\hat{X}}}_{E}$ and ${\rm{\bf\hat{X}}}_{EC}$ the order of the channels as shown in Eqs (6) and (7), respectively.

$\displaystyle{\rm{\bf\hat{X}}}_{E}\buildrel\Delta\over{=}[{\rm{\bf\hat{X}}}_{% PW1}^{1},{\rm{\bf\hat{X}}}_{PW1}^{2},\ldots,{\rm{\bf\hat{X}}}_{PW1}^{r/2\cdot C% },{\rm{\bf\hat{X}}}_{DW}^{1},{\rm{\bf\hat{X}}}_{DW}^{2},\ldots,{\rm{\bf\hat{X}% }}_{DW}^{r/2\cdot C}]$ (6) $\displaystyle{\rm{\bf\hat{X}}}_{EC}\buildrel\Delta\over{=}[{\rm{\bf\hat{X}}}_{% PW1}^{1},{\rm{\bf\hat{X}}}_{DW}^{1},{\rm{\bf\hat{X}}}_{PW1}^{2},{\rm{\bf\hat{X% }}}_{DW}^{2},\ldots,{\rm{\bf\hat{X}}}_{PW1}^{r/2\cdot C},{\rm{\bf\hat{X}}}_{DW% }^{r/2\cdot C}]$ (7)

We can find (as shown by the two dashed lines in Fig. 3c) that the two channels of each group of the group convolution come from ${\rm{\bf\hat{X}}}_{PW1}$ , which mainly contains non-local information and from ${\rm{\bf\hat{X}}}_{DW}$ , which mainly contains local spatial information. Group convolution can combine these two channels with different scale information and reconstruct their local spatial features. The outputs of the group convolution and the depth-separable convolution are summed and form the residual structure. The output of each channel and the overall output of the residual structure are represented by Eqs (8) and (9):

$\displaystyle{\rm{\bf\hat{X}}}_{GC}^{l}={\rm{\bf\hat{X}}}_{DW}^{l}+[{\rm{\bf% \hat{X}}}_{PW1}^{l},{\rm{\bf\hat{X}}}_{DW}^{l}]^{T}\odot{\rm{\bf W}}_{GC}^{l}$ (8) $\displaystyle{\rm{\bf\hat{X}}}_{GC}={\rm{\bf\hat{X}}}_{DW}+{\rm{\bf\hat{X}}}_{% EC}\odot{\rm{\bf W}}_{GC}$ (9)

where $l$ is the lth channel of the feature map, ${\rm{\bf W}}_{GC}\in{\rm R}^{3\times 3\times rC\times 2}$ is the kernel of the group convolution. According to Eq. (8), we can see that both residual and group convolution benefit from the Channel-Shuffle operation. Channel-Shuffle aligns channels of depth-separable convolution, group convolution and residual structures. Finally, Modeling models the channel by the second pointwise convolution and connects it to the input of Symmetric FFN by the second residual. Mathematically, the output of Symmetric FFN is represented by Eq. (10):

$\displaystyle{\rm{\bf\hat{X}}}_{\textit{out}}=\text{Img2Seq}({\rm{\bf\hat{X}}}% +{\rm{\bf\hat{X}}}_{GC}\odot{\rm{\bf W}}_{PW2})$ (10)

where ${\rm{\bf W}}_{PW2}\in{\rm R}^{r/2\cdot C\times C}$ is the kernel of the second pointwise convolution.

4. Experiments

The experiments include three types of vision tasks: image classification, object detection and semantic segmentation. To evaluate the performance of SFormer, we compare SFormer with existing lightweight models.

4.1 Evaluation metrics

We evaluate the model size using #Param and FLOPs. The #Param refers to the number of parameters of the entire model (including the backbone and output layers). The FLOPs refers to the number of floating point operations of the entire model (excluding the loss function). Three vision tasks, image classification, object detection and semantic segmentation, were evaluated using top1, AP (Average Precision) and mIoU (Mean Intersection over Union) as metrics. Top1 is the number of samples with correct predictions (the category with the highest probability of the output matching the label) divided by the number of all samples. And AP is divided into six parts, i.e., AP ${}_{50:5:95}$ , AP ${}_{50}$ , AP ${}_{75}$ , AP ${}_{S}$ , AP ${}_{M}$ and AP ${}_{L}$ . AP ${}_{50}$ and AP ${}_{75}$ are the IoU (Intersection over Union) thresholds between object and label taken as 0.5 and 0.75, respectively and subsequently, AP [34] calculated based on a series of correct and incorrect samples. AP ${}_{50:5:95}$ is expressed as the IoU threshold between the object and the label taken from 0.5 to 0.95 at a stride of 0.05 and AP is subsequently calculated separately and taken as the mean value. AP ${}_{S}$ , AP ${}_{M}$ and AP ${}_{L}$ are the APs that are calculated from small objects, middle objects and large objects, respectively. mIoU is measured in pixels and is calculated as the intersection between predictions and labels divided by the union between predictions and labels.

4.2 Image classification

Settings

The image classification experiments were performed on the ImageNet-1K dataset [21]. It contains 1,000 categories, 1.28 million training images and 50,000 validation images. All models were trained on the training set for a fair comparison. These models were then tested in the validation set and Top1 accuracy was used as an evaluation metric. We followed the data augmentation approach of Swin Transformer, which includes applying random clipping, random horizontal flipping [46], label smoothing regularization [45], blending and random erasing [63]. During training, we used the AdamW optimizer [37] with a momentum of 0.9, a small batch size of 512 and a weight decay of 5 $\times$ 10 ${}^{-2}$ to optimize the model. The initial learning rate was set to 1 $\times$ 10 ${}^{-3}$ and decayed according to the cosine policy. All models were trained from scratch for 300 epochs on 2 Nvidia RTX 3090 GPUs. We cropped 224 $\times$ 224 patches on the validation set based on the image centers to evaluate the classification accuracy.

Results

Shows the image classification performance. In the ViT model with the number of parameters less than 4M, SFormer has the same number of parameters as Swin and QuadTree-A-b0, but the highest Top1 accuracy for image classification (1.6% more accurate than Swin, 1.2% more accurate than QuadTree-A-b0, 0.1% more accurate than QuadTree-B-b0 and PVTv2-B0 by 1.6%). It is worth noting that SFormer has the same advantage over the CNN’s model. SFormer has 0.1M less parameters than MobileNetv2, but gains 0.2% improvement in Top1. To validate our improvements on the attention module, we further compare SFormer-FFN with other Transformer models with a similar number of parameters. As shown in the bottom part of, SFormer-FFN has the lowest number of parameters and computation but the highest Top1 accuracy compared to other models.

4.3 Object detection

Settings

In this paper, object detection experiments are performed on COCO dataset and Pascal VOC dataset, respectively. For COCO dataset, all models were trained on COCO train 2017 (118k images) and evaluated on COCO val 2017 (5k images). In the training phase, we randomly adjusted the short side of the input image in the range of [480, 800] and the long side of image is fixed at 1333. During the test phase, the resolution of the input image was fixed at 800 $\times$ 1333. For Pascal VOC dataset, all models were trained on VOC Train 0712 and evaluated on VOC Val 07. The resolution of the input image was fixed at 600 $\times$ 1000. We used mmdetection [2] as the codebase. We initialized the SFormer backbone with the weights pre-trained on ImageNet. We used the same settings as Swin Transformer [35] to train the model with a batch size of 16 and an initial learning rate of 1 $\times$ 10 ${}^{-4}$ for 12 epochs for the AdamW optimizer. And, AP (average precision) was used to evaluate our method.

Table 1
Image classification performance on the ImageNet validation set, “#P” refers to millions of parameters, “#G” refers to GFLOPs (Giga Floating Point Operations), which was calculated with an input scale of 224 $\times$ 224 and “Swin” refers to the Swin Transformer configured using Table 1. SFormer-FFN refers to the SFormer using the FFN Module

Method	#P	#G	Top1 (%)
Swin	3.4	0.4	70.5
SFormer (our)	3.4	0.5	72.1
QuadTree-A-b0	3.4	0.6	70.9
MobileNetv2	3.5	0.3	71.9
QuadTree-B-b0	3.5	0.7	72
PVTv2-B0	3.7	0.6	70.5
SFormer-FFN (our)	4.3	0.6	73.3
T2T-ViT-7 [6]	4.3	1.1	71.7
PiT-Ti [24]	4.9	0.7	73
DeiT-Tiny/16 [48]	5.7	1.3	72.2

Table 2

Configurations of SFormer and compressed Swin Transformer [47]. “#C” is the embedding length of Patch Embedding, “#Blocks” is the number of blocks for each “stage”, “#heads” is the heads number of MSA for each “stage”, r is the expand ratio of FFN and Symmetric FFN

Models	#C	#Blocks	#heads	r
SFormer	32	2,2,6,2	2,2,5,8	4
Swin Transformer	32	2,2,6,2	2,2,5,8	4

To analyze the existence of statistically significant differences in the essential performance metrics obtained by the different models, statistical analyses were performed in this paper. We performed one-way analysis of variance (ANOVA) and Tukey post-hoc tests. Since the COCO dataset and Pascal VOC dataset have been divided into training set, validation set and test set, we trained and validated the model five times, and used the results of those five experiments for statistical analysis. For COCO dataset, we used the AP ${}_{50}$ metric for the analysis, and for Pascal VOC dataset, we used the mAP metric for the analysis.

Results

We mainly compared our method with PVTv2, Swin Transformer on RetinaNet [8] and Mask RCNN [22] detection framework. Table 2 shows the results of COCO Val 2017 object detection. The overall trend is that Swin has the weakest performance and PVT has the strongest performance. Specifically, SFormer achieves higher performance in both RetinaNet and Mask RCNN detection frameworks compared to Swin Transformer which uses the same attention mechanism, but with the same number of parameters. This result demonstrates that our two improved modules effectively enhance the modeling and representation capabilities of the model. Compared to PVTv2-B0, the number of parameters is reduced by 0.3, but still maintains a competitive performance.

Table 3

Results of COCO dataset. Each metrics are the mean of 5 experiments. “#G” refers to GFLOPs (Giga Floating Point Operations), which was calculated with an input scale of 800 $\times$ 1333

Framework	Backbone	#P	#G	AP ${}_{50:5:95}$	AP ${}_{50}$	AP ${}_{75}$	AP ${}_{S}$	AP ${}_{M}$	AP ${}_{L}$
RetinaNet	Swin	12.7	168.6	35.2	54.3	37.4	23.0	38.2	45.9
RetinaNet	SFormer (our)	12.7	169.7	36.2	56.3	38.6	23.3	39.4	47.5
RetinaNet	PVTv2-B0	13.0	186.1	37.1	57.1	39.5	23.1	40.4	49.7
Mask RCNN	Swin	23.2	171.5	36.8	58.9	39.0	23.1	39.4	47.2
Mask RCNN	SFormer (our)	23.2	172.6	37.5	59.5	39.4	23.4	39.9	48.1
Mask RCNN	PVTv2-B0	23.5	188.7	38.2	60.5	40.7	22.9	40.9	49.6

We argue that the window-based self-attention mechanism used by SFormer limits the information exchange between sliding windows, which affects the fine granularity of detection of objects across sliding windows and large objects. Due to the restricted information exchange between sliding windows, the model struggles to precisely predict object boundaries outside the sliding window. In Table 3, We can see that the AP ${}_{S}$ (small objects) of SFormer on both RetinaNet and Mask RCNN frameworks exceed the AP ${}_{S}$ of PVTv2-B0, but the AP ${}_{50}$ , AP ${}_{M}$ and AP ${}_{L}$ of SFormer are less than PVTv2-B0.

ANOVA results for the COCO dataset is shown in Table 4. For AP50 metric in the COCO dataset, statistically significant differences are seen between the models ( $P=$ 1 $\times$ 10 ${}^{-20}$ ). Post hoc comparisons using Tukey HSD test shows significant differences between models that use the same framework and the different backbone networks. Notably, the $P$ -value between SFormer ${}^{+M}$ and Swin ${}^{+M}$ is larger ( $P=$ 0.031), possibly due to the close performance, but still less than 0.05 and rejected. Besides, the comparison between models using the same backbone network and the different frameworks also demonstrates significant differences.

Table 4

ANOVA results of COCO dataset. “ $+$ R” represents that the model adopts the RetinaNet framework. “ $+$ M” represents that the model adopts the Mask RCNN framework

Method	Mean (AP ${}_{50}$ )	SD
Swin ${}^{+R}$	54.299	0.313
SFormer ${}^{+R}$	56.276	0.179
PVTv2-B0 ${}^{+R}$	57.144	0.267
Swin ${}^{+M}$	58.906	0.364
SFormer ${}^{+M}$	59.548	0.263
PVTv2-B0 ${}^{+M}$	60.509	0.213
ANOVA $F$ -value	284.607 (1 $\times$ 10 ${}^{-20}$ )
Tukey HSD group comparisons	Mean difference	$P$ -value
SFormer ${}^{+R}$ VS Swin ${}^{+R}$	$-$ 1.977	0.001
PVTv2-B0 ${}^{+R}$ VS Swin ${}^{+R}$	$-$ 2.844	0.001
PVTv2-B0 ${}^{+R}$ VS SFormer ${}^{+R}$	$-$ 0.868	0.002
SFormer ${}^{+M}$ VS Swin ${}^{+M}$	$-$ 0.642	0.031
PVTv2-B0 ${}^{+M}$ VS Swin ${}^{+M}$	$-$ 1.603	0.001
PVTv2-B0 ${}^{+M}$ VS SFormer ${}^{+M}$	$-$ 0.961	0.001
Swin ${}^{+M}$ VS Swin ${}^{+R}$	$-$ 4.607	0.001
PVTv2-B0 ${}^{+M}$ VS PVTv2-B0 ${}^{+R}$	$-$ 3.365	0.001
SFormer ${}^{+M}$ VS SFormer ${}^{+R}$	$-$ 3.272	0.001

The results of Pascal VOC dataset are shown in Table 2. In general, the performance of these three models is consistent with their performance in COCO. SFormer and Swin have the lowest number of parameters, FLOPs of Swin are somewhat lower. PVTv2-B0 has the largest number of parameters and the highest FLOPs. SFormer outperform Swin by 2.5% and 2.6% on the RetinaNet and Mask RCNN frameworks, respectively. And, the performance of SFormer is lower than that of PVTv2-B0.

Table 6 presents the results of ANOVA for Pascal VOC dataset. Same with ANOVA results for COCO dataset, the intra-group variance between the models is much smaller than the inter-group variance, and the difference statistically significant ( $P=$ 1 $\times$ 10 ${}^{-16}$ ). Tukey HSD test performed between the models shows significant differences whether the models are compared using the same framework or the models with the same backbone.

Table 5

Results of Pascal VOC dataset. “#G” was calculated with an input scale of 600 $\times$ 1000

Framework	Backbone	#P	#G	mAP
RetinaNet	Swin	11.5	79.5	72.9
RetinaNet	SFormer (our)	11.5	80.1	76.2
RetinaNet	PVTv2-B0	11.8	85.0	78.5
Mask RCNN	Swin	22.9	118.6	75.8
Mask RCNN	SFormer (our)	22.9	120.7	78.6
Mask RCNN	PVTv2-B0	23.2	125.6	79.7

Table 6

ANOVA results of Pascal VOC dataset

Method	Mean (AP ${}_{50}$ )	SD
Swin ${}^{+R}$	72.846	0.206
SFormer ${}^{+R}$	76.197	0.498
PVTv2-B0 ${}^{+R}$	78.456	0.433
Swin ${}^{+M}$	75.786	0.643
SFormer ${}^{+M}$	78.598	0.442
PVTv2-B0 ${}^{+M}$	79.724	0.289
ANOVA $F$ -value	128.765 (1 $\times$ 10 ${}^{-16}$ )
Tukey HSD group comparisons	Mean difference	$P$ -value
SFormer ${}^{+R}$ VS Swin ${}^{+R}$	$-$ 3.352	0.001
PVTv2-B0 ${}^{+R}$ VS Swin ${}^{+R}$	$-$ 5.610	0.001
PVTv2-B0 ${}^{+R}$ VS SFormer ${}^{+R}$	$-$ 2.259	0.001
SFormer ${}^{+M}$ VS Swin ${}^{+M}$	$-$ 2.811	0.001
PVTv2-B0 ${}^{+M}$ VS Swin ${}^{+M}$	$-$ 3.938	0.001
PVTv2-B0 ${}^{+M}$ VS SFormer ${}^{+M}$	$-$ 1.127	0.016
Swin ${}^{+M}$ VS Swin ${}^{+R}$	$-$ 2.941	0.001
PVTv2-B0 ${}^{+M}$ VS PVTv2-B0 ${}^{+R}$	$-$ 1.268	0.006
SFormer ${}^{+M}$ VS SFormer ${}^{+R}$	$-$ 2.400	0.001

4.4 Semantic segmentation

Settings

We chose ADE20K and cityscapes dataset to measure the performance of semantic segmentation. We used the Semantic FPN [29] framework and the mmsegmentation [1] codebase. All models were trained under this framework for fair comparison. In the training phase, encoder was initialized with the weights pre-trained on ImageNet and the other newly added layers are initialized with Xavier [18]. We optimized our model with AdamW with an initial learning rate of 1 $\times$ 10 ${}^{-4}$ . We trained our model on 2 Nvidia RTX 3090 GPUs in mini-batches of 16 for 80k iterations. The learning rate was decayed according to a polynomial decay schedule with a power of 0.9. For ADE20K dataset, we randomly resized and cropped the image to 512 $\times$ 512 for training and rescaled it to have a short edge of 512 pixels during testing. For cityscapes dataset, we randomly resized and cropped the image to 512 $\times$ 1024 for training and rescaled it to have a short edge of 512 pixels during testing. To perform one-way ANOVA [42] and Tukey post hoc tests [3], and the ADE20k dataset and the cityscapes dataset have been divided into training, validation, and test sets, we trained and validated the model five times, and performs statistical analysis.

Results

The results of ADE20K dataset are shown in Table 7. We can observe the advantage of SFormer in terms of semantic segmentation compared to the other two models. Among them, SFormer has an mIoU improvement of 0.7% compared to Swin transformer which uses the same attention mechanism. We attribute this improvement to the introduction of convolution in the FFN module and the enhancement of the attention module. Unlike the object detection task, SFormer provides a 0.2% improvement in mIoU over PVTv2-B0 on this task. We argue that sliding windows have different effects on the semantic segmentation task and the object detection task. Semantic segmentation aims to predict the semantics of each pixel, while object detection aims to predict the boundaries of each object. On the object detection task, the model cannot predict the boundaries and center of the cross-window objects based on features within the window. For the semantic segmentation task, the semantics (class) of each pixel within the sliding window is still predictable even if the feature information is limited by the sliding window.

Table 7
Results of ADE20K semantic segmentation. “#G” was calculated with an input scale of 512 $\times$ 512

Method	Encoder	#P	#G	mIoU (%)
Semantic FPN	Swin	6.8	24.0	36.7
Semantic FPN	SFormer (our)	6.8	24.3	37.4
Semantic FPN	PVTv2-B0	7.6	25.0	37.2

The results of ANOVA for ADE20K dataset are presented in Table 8. Overall, although the differences in performance between the models are quite small, statistically significant differences are reflected because the models are stable (SD $<$ 0.2). The test levels between the models are high ( $F=$ 20.971) and significantly different ( $P<$ 0.05). We further analyze using Tukey HSD test and there are significant differences between each model. $P$ -values for all comparisons are low enough to represent statistically significant differences. And, $P$ -value for the comparison of PVTv2-B0 and SFormer are relatively high due to the similarity of the two models’ mIoU metrics.

Table 8

ANOVA results of ADE20K semantic segmentation. “ $+$ S” represents that the model adopts the Semantic FPN framework

Method	Mean (mIoU)	SD
Swin ${}^{+S}$	36.638	0.142
SFormer ${}^{+S}$	37.264	0.184
PVTv2-B0 ${}^{+S}$	36.987	0.046
ANOVA $F$ -value	20.971 (1 $\times$ 10 ${}^{-5}$ )
Tukey HSD group comparisons	Mean difference	$P$ -value
SFormer ${}^{+S}$ VS Swin ${}^{+S}$	$-$ 0.626	0.001
PVTv2-B0 ${}^{+S}$ VS Swin ${}^{+S}$	$-$ 0.349	0.009
PVTv2-B0 ${}^{+S}$ VS SFormer ${}^{+S}$	$-$ 0.277	0.036

The results of cityscapes dataset are shown in Table 2. In general, the performance trends are consistent among the three models compared to the ADE20K dataset, but the performance gap increases. SFormer has the lowest number of parameters and the strongest performance. PVTv2-B0 has the highest number of parameters and the highest FLOPs, but the performance is weaker than SFormer. While Swin has the weakest performance.

Table 9

Results of cityscapes semantic segmentation. “#G” was calculated with an input scale of 512 $\times$ 1024

Method	Encoder	#P	#G	mIoU (%)
Semantic FPN	Swin	7.3	47.9	68.7
Semantic FPN	SFormer (our)	7.3	48.6	72.9
Semantic FPN	PVTv2-B0	7.9	52.2	70.4

Table 10 lists ANOVA results for the cityscapes dataset. Compared to ADE20K dataset, the stability of three models in the cityscapes dataset is significantly lower and SD increases dramatically. It can be seen that the SD of the three models Swin, SFormer and PVTv2-B0 increases by 0.638, 0.157 and 1.036, respectively. And, ANOVA analysis shows that the differences between the models also increased, exhibiting statistically significant differences. Tukey HSD test also reveals the same results, the mean difference between the models are all larger than 1.5 and the $P$ -values are all less than 0.05.

Table 10

ANOVA results of cityscapes semantic segmentation

Method	Mean (mIoU)	SD
Swin ${}^{+S}$	68.457	0.780
SFormer ${}^{+S}$	73.033	0.341
PVTv2-B0 ${}^{+S}$	70.289	1.082
ANOVA $F$ -value	33.519 (1 $\times$ 10 ${}^{-6}$ )
Tukey HSD group comparisons	Mean difference	$P$ -value
SFormer ${}^{+S}$ VS Swin ${}^{+S}$	$-$ 4.576	0.001
PVTv2-B0 ${}^{+S}$ VS Swin ${}^{+S}$	$-$ 1.830	0.017
PVTv2-B0 ${}^{+S}$ VS SFormer ${}^{+S}$	2.745	0.001

5. Discussion

Despite SFormer achieves the best performance on the classification task and semantic segmentation task compared to models of the same size. However, SFormer performs worse than PVTv2 for the object detection task. from Table 3, we can see that SFormer is stronger than PVTv2 for small objects, but weaker than PVTv2 for medium and large objects. We believe that the main reason for this problem is that the window-based self-attention mechanism used by SFormer limits the information exchange between sliding windows, which affects the fine granularity of object detection across sliding windows and large objects. Due to the restricted information exchange between sliding windows, it is difficult for the model to precisely predict object boundaries outside the sliding windows.

To address the problem, we believe that there are some approaches that may solve the problem: (1) Replacing some of the attention mechanisms. For instance, alternating between window-based attention mechanisms and spatial reduction attention mechanisms; (2) Adjusting the number of attention mechanism modules of each stage. Window-based attention mechanisms start sliding windows and interacting spatial information at the second attention mechanism of each Block. Consequently, adjusting the number of attention mechanism modules of each Stage and optimizing the model structure can achieve optimization of the model performance; (3) Adopting spatial convolution to model the input and output of the attention mechanism. Introducing more convolutional induction bias, and achieving spatial information exchange through local spatial modeling.

6. Conclusion

We investigate three shortcomings of the lightweight Transformer, i.e., weak feature extraction ability, feature redundancy and lack of convolutional inductive bias. And two improvement modules are designed: Symmetric Block and Symmetric FFN. Symmetric Block mainly enhances the modeling ability of the model, while Symmetric FFN mainly increases the compactness of the model and introduces inductive bias. Extensive experiments on different tasks, i.e., image classification and semantic segmentation, demonstrate that the proposed SFormer is stronger than PVTv2 and Swin Transformer with comparable number of parameters. Since SFormer performs weaker than PVTv2 in object detection task, we perform further analysis and propose reasons for this phenomenon. We summarize this in Discussion, and suggest possible solutions for the future. Hopefully, the proposed SFormer model will be applied and performed on the edge computing devices, speeding up the computation of the transformer model to perform vision tasks in the future.

Footnotes

Acknowledgments

This research was funded by the National Natural Science Foundation of China (NSFC) (grant no. U21A6003), the Program of Promoting the Development of University-Diligence Talents (grant no. 5112111145).

References

open-mmlab, https://github.com/open-mmlab/mmsegmentation, 2022.

open-mmlab, https://github.com/open-mmlab/mmdetection, 2022.

Abdi

and Williams

, Tukey’s honestly significant difference (HSD) test, Encyclopedia of Research Design 3 (2010), 1–5.

Arnab

Dehghani

Heigold

Sun

Lučić

and Schmid

, Vivit: A video vision transformer, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6836–6846.

Bai

Zhao

and Huang

, A Near Sensor Edge Computing System for Point Cloud Semantic Segmentation, in: 2022 IEEE International Symposium on Circuits and Systems (ISCAS), 2022, pp. 1818–1822.

Beltagy

Peters

M.E.

and Cohan

, Longformer: The long-document transformer, arXiv preprint arXiv:05150, 2020, PP.

Chen

Dai

Chen

Liu

Dong

Yuan

and Liu

, Mobile-former: Bridging mobilenet and transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5270–5279.

Cheng

and Yu

, RetinaNet With Difference Channel Attention and Adaptively Spatial Feature Fusion for Steel Surface Defect Detection, IEEE Transactions on Instrumentation and Measurement 70 (2021), 1–11.

Choromanski

Likhosherstov

Dohan

Song

Gane

Sarlos

Hawkins

Davis

Mohiuddin

and Kaiser

, Rethinking attention with performers, arXiv preprint arXiv:14794, 2020, PP.

10.

Chu

Tian

Zhang

Wang

Wei

Xia

and Shen

, Conditional positional encodings for vision transformers, arXiv preprint arXiv:108822021.

11.

Cordts

Omran

Ramos

Rehfeld

Enzweiler

Benenson

Franke

Roth

and Schiele

, The cityscapes dataset for semantic urban scene understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3213–3223.

12.

Dai

Liu

Q.V.

and Tan

, Coatnet: Marrying convolution and attention for all data sizes, Advances in Neural Information Processing Systems 34 (2021), 3965–3977.

13.

Deng

Dong

Socher

L.-J.

and Fei-Fei

, Imagenet: A large-scale hierarchical image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.

14.

Ding

Zhang

Han

Ding

and Sun

, Repvgg: Making vgg-style convnets great again, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13733–13742.

15.

Dong

Bao

Chen

Zhang

Yuan

Chen

and Guo

, Cswin transformer: A general vision transformer backbone with cross-shaped windows, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12124–12134.

16.

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner

Dehghani

Minderer

Heigold

and Gelly

, An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:11929, 2020, PP.

17.

Fan

Xiong

Mangalam

Yan

Malik

and Feichtenhofer

, Multiscale vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6824–6835.

18.

Glorot

and Bengio

, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010, pp. 249–256.

19.

Graham

El-Nouby

Touvron

Stock

Joulin

Jégou

and Douze

, Levit: a vision transformer in convnet’s clothing for faster inference, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12259–12269.

20.

Han

Wang

Tian

Guo

and Xu

, Ghostnet: More features from cheap operations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1580–1589.

21.

Zhang

Ren

and Sun

, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

22.

Gkioxari

Dollár

and Girshick

, Mask r-cnn, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2961–2969.

23.

Hendrycks

and Gimpel

, Gaussian error linear units (gelus), arXiv preprint arXiv:08415, 2016, PP.

24.

Heo

Yun

Han

Chun

Choe

and Oh

S.J.

, Rethinking spatial dimensions of vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11936–11945.

25.

Howard

A.G.

Zhu

Chen

Kalenichenko

Wang

Weyand

Andreetto

and Adam

, Mobilenets: Efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:04861, 2017, PP.

26.

Huang

Ben

Luo

Cheng

and Fu

, Shuffle transformer: Rethinking spatial shuffle for vision transformer, arXiv preprint arXiv:03650, 2021, PP.

27.

Katharopoulos

Vyas

Pappas

and Fleuret

, Transformers are rnns: Fast autoregressive transformers with linear attention, in: International Conference on Machine Learning, 2020, pp. 5156–5165.

28.

Khan

Naseer

Hayat

Zamir

S.W.

Khan

F.S.

and Shah

, Transformers in vision: A survey, ACM Computing Surveys 54 (2022), 1–41.

29.

Kirillov

Girshick

and Dollár

, Panoptic feature pyramid networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6399–6408.

30.

Krizhevsky

Sutskever

and Hinton

, Imagenet classification with deep convolutional neural networks, Communications of the ACM 60 (2017), 84–90.

31.

Lee

Kim

Kosiorek

Choi

and Teh

Y.W.

, Set transformer: A framework for attention-based permutation-invariant neural networks, in: International Conference on Machine Learning, 2019, pp. 3744–3753.

32.

Taylor

Studer

and Goldstein

, Visualizing the loss landscape of neural nets, Advances in neural information processing systems, 2018, 31.

33.

Zhang

Cao

Timofte

and Van Gool

, Localvit: Bringing locality to vision transformers, arXiv preprint arXiv:05707, 2021, PP.

34.

Lin

T.-Y.

Maire

Belongie

Hays

Perona

Ramanan

Dollár

and Zitnick

C.L.

, Microsoft coco: Common objects in context, in: Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, 2014, pp. 740–755.

35.

Liu

Lin

Cao

Wei

Zhang

Lin

and Guo

, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.

36.

Liu

Lin

Yao

Xie

Wei

Ning

Cao

Zhang

and Dong

, Swin transformer v2: Scaling up capacity and resolution, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12009–12019.

37.

Loshchilov

and Hutter

, Decoupled weight decay regularization, arXiv preprint arXiv:05101, 2017, PP.

38.

Mehta

and Rastegari

, Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer, arXiv preprint arXiv:02178, 2021, PP.

39.

Sandino

Maire

Caccetta

Sanderson

and Gonzalez

, Drone-based autonomous motion planning system for outdoor environments under object detection uncertainty, Remote Sensing 13 (2021), 4481.

40.

Sandler

Howard

Zhu

Zhmoginov

and Chen

L.-C.

, Mobilenetv2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.

41.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:11929, 2014, PP.

42.

and Wold

, Analysis of variance (ANOVA), Chemometrics and Intelligent Laboratory Systems 6 (1989), 259–272.

43.

Strudel

Garcia

Laptev

and Schmid

, Segmenter: Transformer for semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 7262–7272.

44.

Sun

Cao

Yang

and Kitani

K.M.

, Rethinking transformer-based set prediction for object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3611–3620.

45.

Szegedy

Liu

Jia

Sermanet

Reed

Anguelov

Erhan

Vanhoucke

and Rabinovich

, Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.

46.

Szegedy

Vanhoucke

Ioffe

Shlens

and Wojna

, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.

47.

Tang

Zhang

Zhu

and Tan

, Quadtree attention for vision transformers, arXiv preprint arXiv:02767, 2022, PP.

48.

Touvron

Cord

Douze

Massa

Sablayrolles

and Jégou

, Training data-efficient image transformers & distillation through attention, in: International Conference on Machine Learning, 2021, pp. 10347–10357.

49.

Wang

B.Z.

Khabsa

Fang

and Ma

, Linformer: Self-attention with linear complexity, arXiv preprint arXiv:04768, 2020, PP.

50.

Wang

Gao

Sun

and Hu

, A Closer Look at Self-supervised Lightweight Vision Transformers, arXiv preprint arXiv:14443, 2022, PP.

51.

Wang

Xie

Fan

D.-P.

Song

Liang

Luo

and Shao

, Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 568–578.

52.

Wang

Xie

Fan

D.-P.

Song

Liang

Luo

and Shao

, PVT v2: Improved baselines with Pyramid Vision Transformer, Computational Visual Media 8 (2022), 415–424.

53.

Wang

Girshick

Gupta

and He

, Non-local neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803.

54.

Xiao

Codella

Liu

Dai

Yuan

and Zhang

, Cvt: Introducing convolutions to vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 22–31.

55.

Xiao

Singh

Mintun

Darrell

Dollár

and Girshick

, Early convolutions help transformers see better, Advances in Neural Information Processing Systems 34 (2021), 30392–30400.

56.

Xie

Girshick

Dollár

and He

, Aggregated residual transformations for deep neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1492–1500.

57.

Yang

Qiao

Kortylewski

and Yuille

, Locally Enhanced Self-Attention: Combining Self-Attention and Convolution as Local and Context Terms, arXiv preprint arXiv:056372021.

58.

Yang

Wang

Zhang

Wei

Lin

and Yuille

, Lite vision transformer with enhanced self-attention, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11998–12008.

59.

Yang

Shen

and Zhao

, AsymmNet: Towards ultralight convolution neural networks using asymmetrical bottlenecks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2339–2348.

60.

Yuan

Chen

Wang

Shi

Jiang

Z.-H.

Tay

F.E.

Feng

and Yan

, Tokens-to-token vit: Training vision transformers from scratch on imagenet, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 558–567.

61.

Zaheer

Guruganesh

Dubey

K.A.

Ainslie

Alberti

Ontanon

Pham

Ravula

Wang

and Yang

, Big bird: Transformers for longer sequences, in: Advances in Neural Information Processing Systems, 2020, pp. 17283–17297.

62.

Zhang

Zhou

Lin

and Sun

, Shufflenet: An extremely efficient convolutional neural network for mobile devices, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6848–6856.

63.

Zhong

Zheng

Kang

and Yang

, Random erasing data augmentation, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 13001–13008.

64.

Zhou

Zhao

Puig

Fidler

Barriuso

and Torralba

, Scene parsing through ade20k dataset, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 633–641.

A lightweight vision transformer with symmetric modules for vision tasks

Abstract

Keywords

1. Introduction

Light vision transformer

Convolution and vision-transformer

Cheap Operations in CNN

3. Method

3.1 Symmetric block

3.3 Symmetric FFN

4.1 Evaluation metrics

4.2 Image classification

Settings

Results

4.3 Object detection

Settings

Results

Settings

Results

Table 7 Results of ADE20K semantic segmentation. “#G” was calculated with an input scale of 512 × 512

6. Conclusion

Footnotes

Acknowledgments

References

Table 7
Results of ADE20K semantic segmentation. “#G” was calculated with an input scale of 512 $\times$ 512