Cross-scale sampling transformer for semantic image segmentation

Abstract

In increasingly complex scenes, multi-scale information fusion becomes more and more critical for semantic image segmentation. Various methods are proposed to model multi-scale information, such as local to global, but this is not enough for the scene changes more and more, and the image resolution becomes larger and larger. Cross-Scale Sampling Transformer is proposed in this paper. We first propose that each scale feature is sparsely sampled at one time, and all other features are fused, which is different from all previous methods. Specifically, the Channel Information Augmentation module is first proposed to enhance query feature features, highlight part of the response to sampling points and enhance image features. Next, the Multi-Scale Feature Enhancement module performs a one-time fusion of full-scale features, and each feature can obtain information about other scale features. In addition, the Cross-Scale Fusion module is used for cross-scale fusion of query feature and full-scale feature. Finally, the above three modules constitute our Cross-Scale Sampling Transformer(CSSFormer). We evaluate our CSSFormer on four challenging semantic segmentation benchmarks, including PASCAL Context, ADE20K, COCO-Stuff 10K, and Cityscapes, achieving 59.95%, 55.48%, 50.92%, and 84.72% mIoU, respectively, outperform the state-of-the-art.

Keywords

Multi-scale fusion Segmentation Transformer

1 Introduction

Image semantic segmentation is a challenging task with many applications in reality, e.g., human-computer interaction [20], augmented reality [1], and driverless technology [18]. It is mainly used for the per-pixel classification of images. Since Long et al. developed Fully Convolutional Networks [42], semantic segmentation has attracted more and more attention.

However, there are usually three difficulties in the per-pixel classification of natural images: (1)There are many categories in the image, leading to confusing similar categories and classification errors. For example, ADE20K [67] has 150 categories, and COCO-Stuff 10k [3] has 182. It requires the global modeling capability of the model. (2) The boundary segmentation is often rough. Because of the high resolution, the boundary part occupies very few pixels and is not easily segmented. It requires the model to be able to deal with the boundary accurately [2 , 50]. (3)There will often be large and small objects in the same image. It requires the model to be capable of multi-scale modeling.

In the shallow layer of the network, a feature map usually contains more detailed information. It can distinguish small objects well, but large objects cannot be distinguished well due to the limitation of the receptive field. In the deep layer of the network, it is difficult to distinguish small objects because the feature map is downsampled many times. However, it usually contains rich semantic information and can distinguish large objects well. Therefore, many works [8 , 69] are exploring integrating multi-scale features and modeling multi-scale context information. CNN’s excellent ability to extract local information is used to conduct local modeling of a specific scale feature map and fine processing of boundaries.

As shown in Fig. 1(a), the red pixel merges all the information in the small window, but it is often limited by the receptive field and fails to solve problems 1 and 3. DeepLab series [5 , 8] use dilated convolution to enlarge the receptive field, valid for question 1. As shown in Fig. 1(b), the red pixel integrates some of the surrounding pixels and expands the receptive field, thus improving the efficiency of calculation. The emergence of Non-local Networks [51] make global modeling a reality. As shown in Fig. 1(c), the red pixel integrates all the information of the whole feature map. However, it is unfriendly to problem 2 and increases the consumption of computing resources. Currently, many works explore how to fuse multi-scale features and demonstrate that modeling multi-scale context is very beneficial for semantic segmentation. Panoptic feature pyramid networks(Semantic-FPN) [29] is a typical representative, which gradually adopts a top-down pathway to inject strong semantics into the bottom layer. As shown in Fig. 1(d), the red pixel is gradually fused from the features in the previous scale window but is limited by a small receptive field. In this way, it is impossible to transfer the rich details of the shallow layer to the deep features in a bottom-up manner (i.e., Gated fully fusion [32] is a particular case). As for Fig. 1(e), it is an excellent solution to problems 1, 2, and 3 at present. Typical representatives include Feature Pyramid Transformer[63] and Asymmetric non-local neural networks [69]. Red pixel integrates features of multiple scales at one time and has a global receptive field. However, they all sample feature maps of different scales to the same size and lose scale-level information. Their processing of boundary details is rough, significantly increasing the amount of computation.

Fig. 1

Comparison with the existing methods. The red dot represents the query feature. (a) Single-scale feature fusion only fuses some features around the red dot. (b) Single-scale expansion feature fusion expands the area of local feature fusion. (c) Single-scale global feature fusion fuses global features. (d) Multi-scale local feature fusion fuses multi-scale features layer by layer in only one local area. (e) Multi-scale global feature fusion fuses multi-scale features layer by layer in only one local area. (f) Ours implements a sparse sampling strategy and fuses information of all scales at one time to achieve a global receptive field.

We absorb the previous experience and propose a Cross-Scale Sampling Transformer(CSSFormer) to solve the above problems uniformly. As shown in Fig. 1(f), the red pixel can sample all the scale feature maps at one time to generate global sampling points, and these sampling points can make the red pixel shift to align the boundary. Our approach is computationally friendly and can significantly reduce carbon emissions. Furthermore, we evaluate our CSSFormer on four challenging semantic segmentation benchmarks, including PASCAL Context [44], ADE20K [67], COCO-Stuff 10K [3], and Cityscapes [13], achieving 59.95%, 55.48%, 50.92%, and 84.72% mIoU, respectively, which outperform the SOTA methods. Our main contributions include:

We propose a Channel Information Augmentation module to enhance query feature features, highlight the response with sampling points, and assign weights to different channels.

We propose a Multi-Scale Feature Enhancement module for sparse sampling and fusion of full-scale features.

We propose a Cross-Scale Fusion module for the cross-scale fusion of query feature and full-scale feature to obtain information of all scales at one time.

Based on the above three modules, outperforming the state-of-the-art on four challenging benchmarks, including PASCAL Context [44], ADE20K [67], COCO-Stuff 10K [3], and Cityscapes [13].

2 Related work

This section mainly describes related work from three aspects, including Boundary handling, Multi-scale feature fusion, and Transformer in Semantic Segmentation.

2.1 Boundary handling

Much of the previous work focuses on exploring semantic boundaries [40 , 60]. These methods are often applied to high-resolution images but are ineffective for context modeling and multi-scale fusion. In the case of many categories, it is easy to lead to classification errors. Seman-flow [31] applies optical flow to semantic segmentation and proposes a module of pixel deviation alignment to calibrate the offset pixels. FaPN [25] uses the spatial position alignment module to calibrate the pixel position deviation caused in the upsampling process. Fekri et al. [17] proposes a method for bark texture classification with high accuracy based on the improved local ternary patterns (ILTP). Unlike the previous work, our work can make the pixels in the edge position combine with the context information and carry out effective spatial displacement to achieve the natural effect of the alignment of the boundary position.

2.2 Multi-scale Features Fusion

Feature maps of different scales have different information. Shallow features have more details, while in-depth feature semantic information is richer. The advantage of multi-scale feature fusion is that features can fuse information of different scales, enrich semantic features and detail features, and effectively distinguish objects of different sizes in an image. In object detection, Feature Pyramid Networks(FPN) [38] firstly propose a multi-scale fusion strategy to fuse multi-scale features gradually. Semantic-FPN [29] and SETR-MLA [66] extend FPN and adopt top-down fusion mode to fuse multi-scale features. Based on this top-down fusion, ZigZagNet [35] propose top-down and bottom-up propagations to aggregate multi-scale features. Differently, DeepLab series [5 , 8] introduces the dilated convolution, which fuses multi-scale features via concatenation at the channel dimension. Zhao et al. [65] introduces the Pyramid Pooling Module for context modeling. Zhu et al. [69] uses a non-local approach to fuse features from two different scales. Lin et al. [36, 37] uses a sparse sampling strategy to fuse multi-scale features. Zhang et al. [63] proposes a Feature Pyramid Transformer for multi-scale feature fusion, which adopts the feature map of all scales to the same scale. Grounding Transformer is used to carry out the fusion of features, abandoning the traditional top-down and bottom-up pathway. Different from the previous methods, we explore how to select the features of all scales at one time for global modeling based on retaining scale-level information (i.e., fusion without sampling to the same size), which is computationally friendly.

2.3 Transformer in Semantic Segmentation

Encouraged by Transformer’s great success in the field of natural language processing, Transformer has also made great strides in the field of image recognition. Vision Transformer(ViT) [16] is the first model to apply Transformer to image classification and achieve state-of-the-art. Swin Transformer [41] uses a pyramid structure, proposing a shifted window-based self-attention and performing well in both classified and downstream tasks. Recently, Transformer has received more and more attention on semantic segmentation [11 , 66]. Since Transformer naturally has a global receptive field, this is friendly to segmentation tasks. Zheng [66] is the first semantic segmentation model using ViT as the backbone, which uses a Transformer-based encoder and CNN-based decoder. Ranftl et al. [45] also use ViT as the backbone and uses the convolutional progressive fusion feature to get full-resolution predictions. Many works are beginning to apply Transformer to decoder architecture. Xie et al. [57] uses top features extracted from the encoder. K types of learnable Token Embedded are randomly generated for transparent object segmentation. Nevertheless, Transformer is less suited to semantic segmentation when sequences are long. Therefore, SegFormer [58] uses the Transformer-based encoder to extract features and the lightweight MLP-decoder to predict pixel by pixel. FTN [53] and Pale Transformer [54] use grouping strategies to fuse context features. The Transformer is not friendly to sequence length, so Segmenter [48] uses features from the top extracted from the encoder and generates k classes of learnable tokens embedded in the Transformer for segmentation. Cheng et al. [11] build a unified framework for instance and semantic segmentation by introducing mask predictions. Although Transformer has the feature of global modeling, it is limited by the input of large resolution images(i.e., self-attention). So, we use a sparse sampling strategy to sample from multi-scale features with sampling points far smaller than feature-sequence, and then cross-attention is used to conduct cross-scale global modeling.

3 Method

We first describe our framework, Cross-Scale Sampling Transformer (CSSFormer). We present the Channel Information Augmentation module(CIA), which assigns different weights to different channels according to their importance to motivate importance features and assigns a global receptive field to each channel. Then, we present the Multi-Scale Feature Enhancement module (MSFE), which combines multi-scale features to generate offsets to align each sample point on each scale, and uses a Transformer to fuse multi-scale information. Finally, we introduce the Cross-Scale Fusion module (CSF), which differs from the MSFE module in that it can fuse all scales of information for each query feature at once.

3.1 Framework

The overall framework of our CSSFormer is shown in Fig. 2, which consists of a Channel Information Augmentation module(CIA), Multi-Scale Feature Enhancement module (MSFE), and Cross-Scale Fusion module (CSF). Recently, Swin Transformer’s [41] powerful modeling capabilities in downstream tasks are impressive, so we chose it as the backbone of our method. Given an input image $I$ ∈ $ℝ^{3 \times H \times W}$ , We first use a backbone to get a multi-scale feature set, and use a top-down fusion approach similar to FPN [38] to deliver strong semantics to the bottom layer, which has little computational overhead, and resulting in pyramid features ${F_{i} \in ℝ^{C \times \frac{H}{2^{i + 1}} \times \frac{W}{2^{i + 1}}}}_{i = 1}^{4}$ with richer semantics, where i, C, H, W denotes the stage index of the backbone, channel number, height and width, respectively. Then, we combine the obtained multi-scale features, use the Multi-Scale Feature Enhancement module to correct the deviation of the sampling points of multi-scale features, and carry out the fusion of multi-scale features to obtain a multi-scale feature sequence with more robust semantics and more accurate spatial position. The Channel Information Augmentation module is then used to enhance the features of each query feature, enrich the semantics of each query feature, and assign different weights to the channels adaptively. Next, query and multi-scale features are fed into the Cross-Scale Fusion module, and cross-attention is used for the final Cross-Scale Fusion. When each query feature has all-scale feature information, it also has a global receptive field. The spatial position of the sampling points of each query feature is corrected to obtain the final multi-scale feature set. Finally, a top-down pathway is adopted to fuse multi-scale feature sets and output them.

Fig. 2

Overall architecture of our Cross-Scale Sampling Transformer(CSSFormer). "CIA," "MSFE," and "CSF" indicate Channel Information Augmentation module, Multi-Scale Feature Enhancement module, and Cross-Scale Fusion module, respectively.

3.2 Channel Information Augmentation module

In order to facilitate the subsequent recalibration of query feature sampling points, details of query features are required to be highlighted. We hope to adaptively adjust the importance degree of each channel relative to the global context features so that the calibration can be carried out dynamically with the sampling points. We find that the richer the acceptance field of the feature map is, the more conducive it is to the expansion and contraction of channel weights and the stronger the responsiveness of sampling points. To this end, we design a Channel Information Augmentation module to re-model the query feature.

The data flow diagram of the entire Channel Information Augmentation module is shown in Fig. 3. Specifically, We use adaptive averaging pooling to obtain feature maps of four different sizes for each scale feature ${F_{i} \in ℝ^{C \times \frac{H}{2^{i + 1}} \times \frac{W}{2^{i + 1}}}}_{i = 1}^{4}$ . This process is formulated as follow:

Fig. 3

Channel Information Augmentation module. "POOL" means global average pooling. "CONV" represents the 1×1 convolution layer. Φ means upsampling operation. φ means 3×3 convolution layer and global average pooling. "MLP" consists of several fully connected layers and a sigmoid function.

$g_{i, j} = {Average pooling (F_{i})}_{j = 1}^{4},$ (1)

In addition, 1×1 convolution is used to align the number of channels and features with different receptive fields are obtained, and upsampling is performed to the size of the query feature for fusion.

$g_{i, j}^{'} = Conv (g_{i, j}),$ (2)

$F_{i}^{'} = concatenate (F_{i}, Φ (g_{i, j}^{'})),$ (3)

For significance feature $F_{i}^{'}$ , we obtain a global feature vector through 3×3 convolution layer and global average pooling. This process is formulated as follows:

$α_{i} = φ (F_{i}^{'}) .$ (4)

Next, for global feature vector α_i, we use an MLP module (i.e. two 1 × 1 conv layers followed by a sigmoid activation function, the input and output dimensions of the two 1 × 1 conv layers change as C->C/2 and C/2 ->C) to predict the importance of each channel. This process is formulated as follows:

$α_{i}^{'} = MLP (a) .$ (5)

To get the importance of each channel on the feature map, we also need to dot and $F_{i}$ to get the final output. This process is formulated as follows:

$\hat{F_{i}^{'}} = α_{i}^{'} \times F_{i} + F_{i} .$ (6)

It is worth mentioning that our Channel Information Augmentation module was inspired by SENet [24] and PSPNet [65]. The difference is that we hope to more effectively scale the channel weights by enhancing the receptive field of the feature map to correspond to the sampling point and feel the farther area and residual connections are used to prevent the channel information from being overly squeezed and excited. In this way, local details will be highlighted more after the CIA Module is used. The query feature will be re-modeled with only a little computation, conducive to the subsequent multi-scale information fusion and adjustment of edge detail features.

3.3 Multi-Scale Feature Enhancement module

We find that simply superimposing multi-scale information is not friendly to the query feature of each scale, and it cannot effectively obtain information gained from it. Therefore, the MSFE module (as shown in Fig. 4) is proposed to enhance multi-scale information, mainly through the interaction of multi-scale features, fine-tuning the spatial position of the pixel in the spatial domain, and one-time fusion of multi-scale information. We find that the one-time integration of all information scales would produce a considerable amount of computation and cause a significant GPU memory overhead, resulting in difficult training. Inspired by Deformable detr [68], we adopt a sparse sampling strategy to solve this problem(as shown in Fig. 5).

Fig. 4

Multi-Scale Feature Enhancement module. Sampling offset requires only one fully connected layer to generate.

Fig. 5

Pyramid Feature Sampling Alignment(PFSA). The pyramid features are sampled by sampling points and aligned by sampling offset.

Specifically, we flatten the multi-scale features ${F_{i} \in ℝ^{C \times \frac{H}{2^{i + 1}} \times \frac{W}{2^{i + 1}}}}_{i = 1}^{4}$ and simply stack them together to obtain a full-scale feature sequence, denoted as $\hat{F} \in ℝ^{C \times \sum_{i = 1}^{4} \frac{H}{2^{i + 1}} \times \frac{W}{2^{i + 1}}}$ . Similar to the traditional self-attention mechanism, we also map query and value through two fully connected layers, except that keys are discarded. The Attention weights are obtained directly from query, reducing Transformer’s second complexity to one. Specifically:

$A = Softmax (\sum_{i = 1}^{k} {Weight}_{i}) \times {Value}_{(C_{s} + ▵_{i})} .$ (7) Where k is the number of sparse sampling pixels, Weight_i is the attention weights for the i-th sampled key element. C_s is the coordinates of sampling pixels and ▵_i denote the interpolation method and sampling offsets. The generation of sampling offsets can be obtained only through a fully connected layer.

3.4 Cross-Scale Fusion module

The importance of effective modeling of multi-scale information has been demonstrated in [8 , 69]. As shown in Fig. 1, we hope to avoid the position offset caused by upsampling in the top-down fusion process [29, 45]. And to be able to merge the full-scale features obtained from the MSFE module at one time. As shown in Fig. 7, we propose our CSF module, which can calibrate the spatial position of pixels and obtain the full-scale information under the action of multi-scale information. For each scale feature, we carry out Multi-Head Self-Attention on $F_{i}$ , which highlights the feature with strong semantics and is conducive to the generation of sampling offset in the future. In addition, we set up a learnable matrix $α \in ℝ^{1 \times C}$ , which can dynamically scale and shrink the Attention weight according to the network learning to avoid too large or too small weight. In particular:

Fig. 6

Cross-Scale Fusion module."RP" means reference point. Value is the output from the MSFE Module.

Fig. 7

Visual analysis.

$\hat{A} = α \times Softmax (\frac{{qk}^{T}}{\sqrt{d_{head}}}) v,$ (8) In this way, the query feature is obtained, and the value feature is obtained from the MSFE module. Let us rewrite formula (7):

$Attention = Softmax (\sum_{i = 1}^{k} {Weight}_{qi}) \times A_{(C_{s} + ▵_{i})} .$ (9) It is worth mentioning that the Weight_qi is from the query feature, and the generation of ▵_i is from the query feature in order to feature-aligned multi-scale information through the query feature. $A$ is the output from formula (7).

3.5 Loss Function

A multi-task loss function of AUX and CE is used to optimize the model parameters jointly. In particular, the loss function of AUX is defined as:

$L_{AUX} = \frac{1}{H \times W} \sum L_{ce} (F_{3}^{K \times H \times W}, G T) .$ (10)

Here, we use Swin Transformer as the backbone and use $L_{AUX}$ for the third stage feature, L_ce and $G T$ represent the cross entropy loss and ground truth, respectively. $F_{3}$ represents 16 times of upsampling on the feature map of the encoder’s third stage. For the final per-pixel classification, we define the loss function of $L_{CE}$ as:

$L_{CE} = \frac{1}{H \times W} \sum L_{ce} (O, G T) .$ (11)

Here, $O$ indicates that the output feature is upsampled eight times to obtain the size of the original figure.

Finally, we formulate the multi-task loss function $L$ as:

$L = λ L_{AUX} + L_{CE} .$ (12) We empirically set λ = 0.4 by default. The model parameters are learned jointly through back propagation with this joint loss function.

4 Experiments

We first introduce the datasets and implementation details. Then, we compare our method with the recent state-of-the-arts on four challenging semantic segmentation benchmarks, Finally, extensive ablation studies are conducted to evaluate the effectiveness of our approach.

4.1 Datasets

PASCAL Context [44] is an extension of the PASCAL VOC 2010 detection challenge. It contains 4998 and 5105 images for training and validation, respectively. We evaluated the most frequent 60 classes (59 categories with background) following previous works.

ADE20K [67] is a challenging benchmark, including 150 categories and diverse scenes with 1,038 image-level labels, split into 20000 and 2000 images for training and validation.

COCO-Stuff 10K [3] is a significant scene parsing benchmark with 9000 training images and 1000 testing images with 182 categories (80 objects and 91 stuff).

Cityscapes [13] carefully annotates 19 object categories of urban landscape images. It contains 5K finely annotated images, split into 2975 and 500 for training and validation.

4.2 Implementation details

Because Vision Transformer has been prominent in the visual field recently, and the community is gradually using Transformer to replace CNN as the backbone. Since Swin Transformer [41] is a pyramid structure with low computation overhead, we choose it as our backbone. The channel C of features $F_{i}$ is set to 256. We select k=16 sampling points for the feature map of each scale. Both the MSFE module and CSF module have a depth of 2. The head number of the MSFE module and CSF module is set to 12. Following the default setting (e.g., data augmentation and training schedule) of public codebase mmsegmentation [12]. We optimize our models using AdamW [43] with a batch size of 16 (batch size of 8 for Cityscapes), and we adopt a polynomial learning rate decay schedule. Momentum and weight decay are set to 0.9 and 0.01, respectively, for all the experiments on the four datasets. We set the initial learning rate at 0.00006 on ADE20K and Cityscapes and 0.00002 on Pascal Context and COCO-Stuff 10K. The total iterations are set to 160k, 60k, 80k, and 80k for ADE20K, COCO-Stuff 10K, Cityscapes, and PASCAL Context, respectively. During training, data augmentation in all the experiments consists of three steps:(i) random horizontal flipping, (ii) we apply random resize with the ratio between 0.5 and 2, (iii) random cropping (512x512, 512x512, 768x768, and 480x480 for ADE20K, COCO-Stuff 10K, Cityscapes and Pascal Context respectively). For a fair comparison, we apply the synchronized BN to the decoder and auxiliary loss head(encoder-stage3), and we do not adopt the widely-used tricks such as OHEM [47] loss, different learning rate(LR) for backbone and decoder head, and class balanced loss in model training. During inference, we use the default settings of mmsegmentation [12]. Precisely, the input image is first scaled to a uniform size. Multi-scale scaling and random horizontal flip are then performed on the image with scaling factors (0.5, 0.75, 1.0, 1.25, 1.5, 1.75). The sliding window is adopted for the test (e.g., 512 × 512 for ADE20K).

4.3 Comparisons with the State-of-the-art Methods

Results on ADE20K. Table 1 shows the results on ADE20K dataset. We compare the performance of CSSFormer with state-of-the-art methods. For previous work, ISNet [28] achieves mIoU 47.55% by using CNN as the backbone. SegFormer [58] and MCIBI [27] achieves mIoU 51.80%, 50.80% respectively by using Transformer as backbone. Our CSSFormer achieves mIoU of 53.12% with single-scale (SS) inference. When multi-scale inference is adopted, our method achieves a new state-of-the-art with mIoU hitting 54.02% and +2.22 mIoU higher than SegFormer (54.02 vs. 51.80). When training with larger image sizes (640 × 640), our method is +1.88 mIoU higher than Segmenter [48] (55.48 vs. 53.60). To be fair, we did not use any tricks in our experiment.

Table 1
State-of-the-art comparison on the ADE20K dataset. SS: Single-scale inference. MS: Multi-scale inference. † means the resolution of the image is 640 × 640, otherwise 512 × 512.

Method Backbone mIoU(SS) mIoU(MS)

FCN [42] ResNet-101 39.91 41.40

EncNet [64] ResNet-101 42.61 44.65

OCRNet [61] HRNet-W48 43.25 44.88

CCNet [26] ResNet-101 43.71 45.04

ANN [69] ResNet-101 - 45.24

PSPNet [65] ResNet-269 44.39 45.35

FPT [63] ResNet-101 - 45.90

DeepLabV3+ [9] ResNet-101 45.47 46.35

DMNet [21] ResNet-101 45.42 46.76

ISNet [28] ResNeSt-101 - 47.55

SETR [66] ViT-Large 48.64 50.28

DPT [45] ViT-Hybrid - 49.02

MCIBI [27] ViT-Large - 50.80

SegFormer [58] MiT-B5 51.00 51.80

CSSFormer (ours) Swin-Large 53.12 54.02

Swin-UperNet [41]^† Swin-Large 52.10 53.50

Segmenter [48]† ViT-Large 51.80 53.60

CSSFormer (ours) ^† Swin-Large 53.84 55.48

Method	Backbone	mIoU(SS)	mIoU(MS)
FCN [42]	ResNet-101	39.91	41.40
EncNet [64]	ResNet-101	42.61	44.65
OCRNet [61]	HRNet-W48	43.25	44.88
CCNet [26]	ResNet-101	43.71	45.04
ANN [69]	ResNet-101	-	45.24
PSPNet [65]	ResNet-269	44.39	45.35
FPT [63]	ResNet-101	-	45.90
DeepLabV3+ [9]	ResNet-101	45.47	46.35
DMNet [21]	ResNet-101	45.42	46.76
ISNet [28]	ResNeSt-101	-	47.55
SETR [66]	ViT-Large	48.64	50.28
DPT [45]	ViT-Hybrid	-	49.02
MCIBI [27]	ViT-Large	-	50.80
SegFormer [58]	MiT-B5	51.00	51.80
CSSFormer (ours)	Swin-Large	53.12	54.02
Swin-UperNet [41]^†	Swin-Large	52.10	53.50
Segmenter [48]†	ViT-Large	51.80	53.60
CSSFormer (ours) ^†	Swin-Large	53.84	55.48

Results on Pascal Context. Table 2 compares the segmentation results on Pascal Context. GINet [55] with ResNet-101 backbone achieves a mIoU of 54.90%, and SR [62] with HRNet-w48 backbone achieves 55.70 mIoU. When using Transformer as the backbone, the previous best method, Upernet achieves mIoU 57.29%. Our CSSFormer achieves a superior mIoU of 59.95%, which is +2.66 mIoU higher than UperNet (59.95 vs. 57.29).

Table 2

Comparison with the state-of-the-art approaches on the PASCAL Context dataset. MS: Multi-scale inference.

Method	Backbone	mIoU(MS)
PSPNet [65]	ResNet-269	47.80
DeepLabV3+ [9]	ResNet-101	48.47
DANet [19]	ResNet-101	52.60
ANN [69]	ResNet-101	52.80
EMANet [33]	ResNet-101	53.10
SVCNet [14]	ResNet-101	53.20
ACNet [15]	ResNet-101	54.10
GFFNet [32]	ResNet-101	54.20
Efficientfcn [39]	ResNet-101	54.30
APCNet [22]	ResNet-101	54.70
OCRNet [61]	ResNet-101	54.80
RecoNet [10]	ResNet-101	54.80
GINet [55]	ResNet-101	54.90
SR [62]	HRNet-w48	55.70
SETR [66]	ViT-L	55.83
Swin-UperNet [52]	Swin-L	57.29
CSSFormer (ours)	Swin-L	59.95

Results on COCO-Stuff 10K. The state-of-the-art results on COCO-Stuff 10K dataset are shown in Table 3. ISNet with ResNeSt-101 as the backbone achieves mIoU 42.08%. MCIBI and Upernet achieve mIoU 44.89%, 47.71% respectively. Our CSSFormer achieves a superior mIoU of 50.92% with multi-scale inference.

Table 3

Comparison with the state-of-the-art approaches on the COCO-Stuff 10K dataset. MS: Multi-scale inference.

Method	Backbone	mIoU(MS)
PSPNet [65]	ResNet-101	38.86
OCRNet [61]	ResNet-101	39.50
DANet [19]	ResNet-101	39.70
SVCNet [14]	ResNet-101	39.60
MaskFormer [11]	ResNet-101	39.80
EMANet [33]	ResNet-101	39.90
SpyGR [30]	ResNet-101	39.90
ACNet [15]	ResNet-101	40.10
GINet [55]	ResNet-101	40.60
OCRNet [61]	HRNetV2-W48	40.50
RegionContrast [23]	ResNet-101	40.70
RecoNet [10]	ResNet-101	41.50
ISNet [28]	ResNeSt-101	42.08
MCIBI [27]	ViT-L	44.89
Swin-UperNet [52]	Swin-L	47.71
CSSFormer (ours)	Swin-L	50.92

Results on Cityscapes. Tables 4 show the comparative results on the Cityscapes validation set. We can see that our CSSFormer achieves a new state of the art with mIoU(SS) hitting 83.78%, which is +4.44% higher than SETR [66] (83.78 vs. 79.34), for multi-scale testing, our method is +2.57 mIoU(MS) higher than SETR(84.72 vs. 82.15). For Segformer trained with higher resolution(1024×1024 is much larger than our crop size 768×768), CSSFormer is still +1.38% mIoU(SS) higher than it (83.78 vs. 82.40), for multi-scale testing, our method is +0.72 mIoU(MS) higher than SegFormer(84.72 vs. 84.00).

Table 4

Comparison with the state-of-the-art methods on the Cityscapes validation set. “SS” and “MS” indicate single-scale inference and multi-scale inference, respectively. “†” means the input resolution is 1024 × 1024, otherwise 768 × 768.

Method	Backbone	mIoU(SS)	mIoU(MS)
EncNet [64]	ResNet-101	76.10	76.97
PSPNet [65]	ResNet-101	78.87	80.04
GCNet [4]	ResNet-101	79.18	80.71
DNLNet [59]	ResNet-101	79.41	80.68
CCNet [26]	ResNet-101	79.45	80.66
DANet [19]	ResNet-101	80.47	82.02
ANN [69]	ResNet-101	-	81.30
MaskFormer [11]	ResNet-101	-	81.40
OCRNet [61]	HRNet-w48	80.70	81.87
ISNet [28]	ResNeSt-101	81.10	-
MCIBI [27]	ResNet-50	81.14	-
FPT [63]	ResNet-101	81.70	-
SR [62]	HRNet-w48	82.10	-
Segmenter [48]	DeiT-B	79.00	80.60
Segmenter [48]	ViT-L	-	81.30
SETR-PUP [66]	ViT-L	79.34	82.15
SegFormer [58] ^†	MiT-B5	82.40	84.00
CSSFormer (ours) ^†	Swin-L	83.78	84.72

4.4 Ablation study

4.4.1 Comparisons of different decoders

To remove each method to use different backbone this influence factor, only compare decoder method. We use mmsegmentation as the framework, the same training strategy, and Swin transformer [41] as the backbone of the PASCAL Context dataset to fairly compare various recently popular methods. The results are shown in Table 7. We can see that with Swin-Tiny as the backbone, our CSSFormer is +2.30%mIoU higher than the best of the other methods in the single scale test (50.61 vs. 48.31). When using Swin-Small, Base, and Large as the backbone, CSSFormer is +1.86%, +1.68%, +1.90%mIoU higher than the best of the other methods, respectively (53.78 vs. 51.92, 54.48 vs. 52.80, 58.87 vs. 56.97).

Table 7
Combinations of different decoders on PASCAL Context validation dataset. We implement all the methods in the table, and all models are trained on PASCAL Context dataset with 80K iterations and batch size 16, and crop size is 480×480. All methods are evaluated using mean IoU (%), single scale test protocol. We’re only testing FLOPs and Params for the Decoder, removing the encoder. We use Swin as the backbone, T-S-B-L represent Tiny-Small-Base-Large models respectively. {2, 2, 6, 2} represents the number of layers of each stage, and ${96 * 2^{i}}_{i = 0}^{3}$ is the number of channels in each stage, the others are similar.

Backbone MLP CNN Transformer

Segformer Semantic FPN DPT SETR-MLA Upernet GFF Segmentor Trans2Seg FPT ours

(GFLOPs,Params(M)) (18, 3) (112, 27) (97, 17) (25, 6) (187, 37) (85, 17) (17, 64) (66,17) (414, 92) (96, 54)

T-{2, 2, 6, 2}- ${96 * 2^{i}}_{i = 0}^{3}$ 46.63 46.72 46.79 46.77 45.06 48.04 48.56 47.42 48.31 50.61

S-{2, 2, 18, 2}- ${96 * 2^{i}}_{i = 0}^{3}$ 50.40 50.49 50.47 50.48 51.67 51.07 48.35 50.17 51.92 53.78

B-{2, 2, 18, 2}- ${128 * 2^{i}}_{i = 0}^{3}$ 51.04 51.48 51.55 51.51 52.52 51.89 48.84 50.80 52.80 54.48

L-{2, 2, 18, 2}- ${192 * 2^{i}}_{i = 0}^{3}$ 56.15 56.78 56.41 56.62 56.87 56.49 52.59 55.51 56.97 58.87

Backbone	MLP		CNN	Transformer
(GFLOPs,Params(M))	(18, 3)	(112, 27)	(97, 17)	(25, 6)	(187, 37)	(85, 17)	(17, 64)	(66,17)	(414, 92)	(96, 54)
T-{2, 2, 6, 2}- ${96 * 2^{i}}_{i = 0}^{3}$	46.63	46.72	46.79	46.77	45.06	48.04	48.56	47.42	48.31	50.61
S-{2, 2, 18, 2}- ${96 * 2^{i}}_{i = 0}^{3}$	50.40	50.49	50.47	50.48	51.67	51.07	48.35	50.17	51.92	53.78
B-{2, 2, 18, 2}- ${128 * 2^{i}}_{i = 0}^{3}$	51.04	51.48	51.55	51.51	52.52	51.89	48.84	50.80	52.80	54.48
L-{2, 2, 18, 2}- ${192 * 2^{i}}_{i = 0}^{3}$	56.15	56.78	56.41	56.62	56.87	56.49	52.59	55.51	56.97	58.87

4.4.2 Performance of each module

In Table 5, we first evaluate the performance of each module in CSSFormer on the Pascal Context dataset. All ablation experiments are performed at a crop size of (480 × 480). Our CSSFormer with the CIA module and MSFE-CSF module is +3.86%mIoU higher than baseline in single scale test. The amount of computation increases by 42.3 GFLOPs. Our CIA module is +0.91%mIoU higher than baseline. The amount of computation increases by only 2.4 GFLOPs. Our MSFE-CSF module is +2.97%mIoU higher than baseline. The amount of computation increases by only 39.9 GFLOPs.

Table 5
Ablation study on PASCAL Context testing set. The input size is 480 × 480.

Baseline CIA MSFE-CSF FLOPs Params mIoU

√ - - 54.0G 33M 46.75

√ √ - 56.4G 37M 47.66

√ - √ 93.9G 50M 49.72

√ √ √ 96.3G 54M 50.61

Baseline	CIA	MSFE-CSF	FLOPs	Params	mIoU
√	-	-	54.0G	33M	46.75
√	√	-	56.4G	37M	47.66
√	-	√	93.9G	50M	49.72
√	√	√	96.3G	54M	50.61

4.4.3 Different number of sampling points

We conduct an ablation study on Pascal Context on how many sampling points are used, as shown in Table 6. We can see that performance increases with sampling points and then decreases. The number of sampling points varies from dataset to dataset. However, we use the parameter sampling points=16 on four datasets for convenience.

Table 6
Compare with different sampling points with Swin-Tiny on PASCAL Context testing set.

Sampling points mIoU

4 49.12

16 50.61

32 50.02

64 49.55

Sampling points	mIoU
4	49.12
16	50.61
32	50.02
64	49.55

4.4.4 Visualization

We briefly visualize several of the most popular models, and we can see that our model still has some advantages.

5 Conclusion

Cross-Scale Sampling Transformer is proposed in this paper. We first propose that each scale feature is sparsely sampled at one time, and all other features are fused, which is different from all previous methods. Specifically, the Channel Information Augmentation module is first proposed to enhance query feature features, highlight part of the response to sampling points and enhance image features. Next, the Multi-Scale Feature Enhancement module performs a one-time fusion of full-scale features, and each feature can obtain information about other scale features. In addition, the Cross-Scale Fusion module is used for cross-scale fusion of query features and full-scale features. Finally, the above three modules constitute our Cross-Scale Sampling Transformer(CSSFormer). A large number of experiments show the effectiveness of our method.

Footnotes

Acknowledgments

The study is partially supported by the National Natural Science Foundation of China under Grant (U2003208) and Key R & D Project of Xinjiang Uygur Autonomous Region(2021B01002).

References

Alhaija

H.A.

, Mustikovela

S.K.

, Mescheder

, Geiger

and Rother

, Augmented reality meets deep learning for car instance segmentation in urban scenes, In British Machine Vision Conference 1 (2017), 2.

Bertasius

, Shi

and Torresani

, Semantic segmentation with boundary neural fields, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3602–3610.

Caesar

, Uijlings

and Ferrari

, Cocostuff: Thing and stuff classes in context, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1209–1218.

Cao

, Xu

, Lin

, Wei

and Hu

, Gcnet: Non-local networks meet squeeze-excitation networks and beyond, In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.

Chen

L.-C.

, Papandreou

, Schroff

and Adam

, Rethinking atrous convolution for semantic image segmentation, CoRR, abs/1706.05587, 2017.

Chen

L.-C.

, Barron

J.T.

, Papandreou

, Murphy

and Yuille

A.L.

, Semantic image segmentation with taskspecific edge detection using cnns and a discriminatively trained domain transform, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4545–4554.

Chen

L.-C.

, Papandreou

, Kokkinos

, Murphy

and Yuille

A.L.

, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs, IEEE Transactions on Pattern Analysis and Machine Intelligence 40(4) (2017), 834–848.

Chen

L.-C.

, Papandreou

, Kokkinos

, Murphy

and Yuille

A.L.

Chen

L.-C.

, Zhu

, Papandreou

, Schroff

and Adam

, Encoder-decoder with atrous separable convolution for semantic image segmentation, In Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 801–818.

10.

Chen

, Zhu

, Sun

, He

, Li

, Shen

and Yu

, Tensor low-rank reconstruction for semantic segmentation, In European Conference on Computer Vision, Springer, 2020, pp. 52–69.

11.

Cheng

, Schwing

A.G.

and Kirillov

, Per-pixel classification is not all you need for semantic segmentation, (2021).

12.

MMSegmentation Contributors. MM Segmentation: Openmmlab semantic segmentation toolbox and benchmark, https://github.com/open-mmlab/mmsegmentation, 2020.

13.

Cordts

, Omran

, Ramos

, Rehfeld

, Enzweiler

, Benenson

, Franke

, Roth

and Schiele

, The cityscapes dataset for semantic urban scene understanding, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3213–3223.

14.

Ding

, Jiang

, Shuai

, Liu

A.Q.

and Wang

, Semantic correlation promoted shape-variant context for segmentation, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8885–8894.

15.

Ding

, Guo

, Ding

and Han

, Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1911–1920.

16.

Dosovitskiy

, Beyer

, Kolesnikov

, Weissenborn

, Zhai

, Unterthiner

, Dehghani

, Minderer

, Heigold

, Gelly

, Uszkoreit

and Houlsby

, An image is worth 16x16 words: Transformers for image recognition at scale, ICLR (2021).

17.

Fekri-Ershad

, Bark texture classification using improved local ternary patterns and multilayer neural network, Expert Systems with Applications 158 (2020), 113509.

18.

Feng

, Haase-Schütz

, Rosenbaum

, Hertlein

, Glaeser

, Timm

, Wiesbeck

and Dietmayer

, Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges, IEEE Transactions on Intelligent Transportation Systems (2020).

19.

, Liu

, Tian

, Li

, Bao

, Fang

and Lu

, Dual attention network for scene segmentation, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3146–3154.

20.

Harders

and Szekely

, Enhancing human-computer interaction in medical segmentation, Proceedings of the IEEE 91(9) (2003), 1430–1442.

21.

, Deng

and Qiao

, Dynamic multiscale filters for semantic segmentation, In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3562–3572.

22.

, Deng

, Zhou

, Wang

and Qiao

, Adaptive pyramid context network for semantic segmentation, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7519–7528.

23.

, Cui

and Wang

, Region-aware contrastive learning for semantic segmentation, In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16291–16301.

24.

, Shen

and Sun

, Squeeze-and-excitation net-works, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.

25.

Huang

, Lu

, Cheng

and He

, Fapn: Feature-aligned pyramid network for dense image prediction, In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 864–873.

26.

Huang

, Wang

, Huang

, Wei

and Liu

, Ccnet: Criss-cross attention for semantic segmentation, In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 603–612.

27.

Jin

, Gong

, Yu

, Chu

, Wang

and Shao

, Mining contextual information beyond image for semantic segmentation, In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 7231–7241.

28.

Jin

, Liu

, Chu

and Yu

, Isnet: Integrate image-level and semantic-level context for semantic segmentation, In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 7189–7198.

29.

Kirillov

, Girshick

, He

and Dollár

, Panoptic feature pyramid networks, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6399–6408.

30.

, Yang

, Zhao

, Shen

, Lin

and Liu

, Spatial pyramid based graph reasoning for semantic segmentation, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8950–8959.

31.

, You

, Zhu

, Zhao

, Yang

, Tan

and Tong

, Semantic flow for fast and accurate scene parsing, In European Conference on Computer Vision, Springer, 2020, pp. 775–793.

32.

, Zhao

, Han

, Tong

, Tan

and Yang

, Gated fully fusion for semantic segmentation, In AAAI, 2020.

33.

, Zhong

, Wu

, Yang

, Lin

and Liu

, Expectation-maximization attention networks for semantic segmentation, In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9167–9176.

34.

Lin

, Ji

, Lischinski

, Cohen-Or

and Huang

, Multi-scale context intertwining for semantic segmentation, In Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 603–619.

35.

Lin

, Shen

, Ji

, Lischinski

, Cohen-Or

and Huang

, Zigzagnet: Fusing top-down and bottom up context for object segmentation, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7490–7499.

36.

Lin

, Liang

, He

, Zheng

, Tian

and Chen

, Structtoken: Rethinking semantic segmentation with structural prior, arXiv e-prints (2022), arXiv–2203.

37.

Lin

, Wu

, Tian

and Guo

, Feature selecive transformer for semantic image segmentation, arXiv preprint arXiv:2203.14124 (2022).

38.

Lin

T.-Y.

, Dollár

, Girshick

, He

, Hariharan

and Belongie

, Feature pyramid networks for object detection, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.

39.

Liu

, He

, Zhang

, Ren

J.S.

and Li

, Efficientfcn: Holistically-guided decoding for semantic segmentation, In European Conference on Computer Vision, Springer, 2020, pp. 1–17.

40.

Liu

, Cheng

M.-M.

, Hu

, Wang

and Bai

, Richer convolutional features for edge detection, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3000–3009.

41.

Liu

, Lin

, Cao

, Hu

, Wei

, Zhang

, Lin

and Guo

, Swin transformer: Hierarchical vision transformer using shifted windows, International Conference on Computer Vision (ICCV), 2021.

42.

Long

, Shelhamer

and Darrell

, Fully convolutional networks for semantic segmentation, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.

43.

Loshchilov

and Hutter

, Decoupled weight decay regularization, In International Conference on Learning Representations, 2018.

44.

Mottaghi

, Chen

, Liu

, Cho

N.-G.

, Lee

S.-W.

, Fidler

, Urtasun

and Yuille

, The role of context for object detection and semantic segmentation in the wild, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 891–898.

45.

Ranftl

, Bochkovskiy

and Koltun

, Vision transformers for dense prediction, In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12179–12188.

46.

Ruan

, Liu

, Huang

, Wei

and Zhao

, Devil in the details: Towards accurate single and multiple human parsing, In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 2019, pp. 4814–4821.

47.

Shrivastava

, Gupta

and Girshick

, Training region-based object detectors with online hard example mining, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 761–769.

48.

Strudel

, Garcia

, Laptev

and Schmid

, Segmenter: Transformer for semantic segmentation, In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 7262–7272.

49.

Sun

, Crawford

, Tang

and Milenković

, Simultaneous optimization of both node and edge conservation in network alignment via wave, In International Workshop on Algorithms in Bioinformatics, Springer, 2015, pp. 16–39.

50.

Takikawa

, Acuna

, Jampani

and Fidler

, Gated-scnn: Gated shape cnns for semantic segmentation, In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5229–5238.

51.

Wang

, Girshick

, Gupta

and He

, Non-local neural networks, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803.

52.

, Wu

, Lin

, Tian

and Guo

, Fully transformer networks for semantic image segmentation, arXiv preprint arXiv:2106.04108 (2021).

53.

, Wu

, Lin

, Tian

and Guo

, Fully transformer networks for semantic image segmentation, arXiv preprint arXiv:2106.04108 (2021).

54.

, Wu

, Tan

and Guo

, Pale transformer: A general vision transformer backbone with paleshaped attention, arXiv preprint arXiv:2112.14000 (2021).

55.

, Lu

, Zhu

, Zhang

, Wu

, Ma

and Guo

, Ginet: Graph interaction network for scene parsing, In European Conference on Computer Vision, Springer, 2020, pp. 34–51.

56.

Xiao

, Liu

, Zhou

, Jiang

and Sun

, Unified perceptual parsing for scene understanding, In Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 418–434.

57.

Xie

, Wang

, Sun

, Xu

, Liang

and Luo

, Segmenting transparent objects in the wild with transformer. In Z.-H. Zhou, editor, Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, International Joint Conferences on Artificial Intelligence Organization, Main Track, 2021, pp. 1194–1200.

58.

Xie

, Wang

, Yu

, Anandkumar

, Alvarez

J.M.

and Luo

, Segformer: Simple and efficient design for semantic segmentation with transformers, arXiv preprint arXiv:2105.15203 (2021).

59.

Yin

, Yao

, Cao

, Li

, Zhang

, Lin

and Hu

, Disentangled non-local neural networks, In European Conference on Computer Vision, Springer, 2020, pp. 191–207.

60.

, Feng

, Liu

M.-Y.

and Ramalingam

, Casenet: Deep category-aware semantic edge detection, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5964–5973.

61.

Yuan

, Chen

and Wang

, Object-contextual representations for semantic segmentation, In Computer Vision–ECCV 2020:16th European Conference, Glasgow, UK, Proceedings, Part VI 16, Springer, 2020, pp. 173–190.

62.

Zhang

, Zhang

, Tang

, Hua

X.-S.

and Sun

, Self-regulation for semantic segmentation, In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6953–6963.

63.

Zhang

, Zhang

, Tang

, Wang

, Hua

and Sun

, Feature pyramid transformer, In European Conference on Computer Vision, Springer, 2020, pp. 323–339.

64.

Zhang

, Dana

, Shi

, Zhang

, Wang

, Tyagi

and Agrawal

, Context encoding for semantic segmentation, In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 7151–7160.

65.

Zhao

, Shi

, Qi

, Wang

and Jia

, Pyramid scene parsing network, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2881–2890.

66.

Zheng

, Lu

, Zhao

, Zhu

, Luo

, Wang

, Fu

, Feng

, Xiang

, Torr

P.H.S.

and Zhang

, Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers, In CVPR, 2021.

67.

Zhou

, Zhao

, Puig

, Xiao

, Fidler

, Barriuso

and Torralba

, Semantic understanding of scenes through the ade20k dataset, International Journal of Computer Vision 127(3) (2019), 302–321.

68.

Zhu

, Su

, Lu

, Li

, Wang

and Dai

, Deformable detr: Deformable transformers for end-to-end object detection, In International Conference on Learning Representations, 2020.

69.

Zhu

, Xu

, Bai

, Huang

and Bai

, Asymmetric non-local neural networks for semantic segmentation, In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 593–602.

Backbone	MLP			CNN	Transformer
	Segformer	Semantic FPN	DPT	SETR-MLA	Upernet	GFF	Segmentor	Trans2Seg	FPT	ours
(GFLOPs,Params(M))	(18, 3)	(112, 27)	(97, 17)	(25, 6)	(187, 37)	(85, 17)	(17, 64)	(66,17)	(414, 92)	(96, 54)
T-{2, 2, 6, 2}- ${96 * 2^{i}}_{i = 0}^{3}$	46.63	46.72	46.79	46.77	45.06	48.04	48.56	47.42	48.31	50.61
S-{2, 2, 18, 2}- ${96 * 2^{i}}_{i = 0}^{3}$	50.40	50.49	50.47	50.48	51.67	51.07	48.35	50.17	51.92	53.78
B-{2, 2, 18, 2}- ${128 * 2^{i}}_{i = 0}^{3}$	51.04	51.48	51.55	51.51	52.52	51.89	48.84	50.80	52.80	54.48
L-{2, 2, 18, 2}- ${192 * 2^{i}}_{i = 0}^{3}$	56.15	56.78	56.41	56.62	56.87	56.49	52.59	55.51	56.97	58.87

Cross-scale sampling transformer for semantic image segmentation

Abstract

Keywords

1 Introduction

2.1 Boundary handling

2.2 Multi-scale Features Fusion

2.3 Transformer in Semantic Segmentation

3 Method

3.1 Framework

4.1 Datasets

4.2 Implementation details

4.3 Comparisons with the State-of-the-art Methods

4.4.1 Comparisons of different decoders

Table 5 Ablation study on PASCAL Context testing set. The input size is 480 × 480. Baseline CIA MSFE-CSF FLOPs Params mIoU √ - - 54.0G 33M 46.75 √ √ - 56.4G 37M 47.66 √ - √ 93.9G 50M 49.72 √ √ √ 96.3G 54M 50.61

Table 6 Compare with different sampling points with Swin-Tiny on PASCAL Context testing set. Sampling points mIoU 4 49.12 16 50.61 32 50.02 64 49.55

5 Conclusion

Footnotes

Acknowledgments

References

Table 5
Ablation study on PASCAL Context testing set. The input size is 480 × 480.

Baseline CIA MSFE-CSF FLOPs Params mIoU

√ - - 54.0G 33M 46.75

√ √ - 56.4G 37M 47.66

√ - √ 93.9G 50M 49.72

√ √ √ 96.3G 54M 50.61

Table 6
Compare with different sampling points with Swin-Tiny on PASCAL Context testing set.

Sampling points mIoU

4 49.12

16 50.61

32 50.02

64 49.55