mSwinUNet: A multi-modal U-shaped swin transformer for supervised change detection

Abstract

Convolutional neural networks (CNNs) have received significant attention for change detection (CD) on multimodal remote sensing images, but they struggle to capture global cues due to the locality of convolution operations. In contrast, the transformer can learn global semantic information by dividing the input image into patches, adding position encodings, and utilizing the self-attention mechanism. Motivated by this, we propose mSwinUNet, a novel end-to-end multi-modal model with swin-transformer-based and U-shaped siamese network architectures for supervised CD using Sentinel-1 Synthetic Aperture Radar (SAR) and Sentinel-2 Multispectral Imager (MSI) data. mSwinUNet contains multi-modal encoder with difference module, bottleneck, and fused decoder, and all of them are based on swin transformer. Firstly, tokenized multi-modal bitemporal image patches are fed into multiple Siamese encoder branches to extract multi-level multi-modal difference feature maps in parallel. Subsequently, the last level multi-modal difference maps are fused to generate the smallest scale change map in the bottleneck. Then, the hierarchical decoder incorporates patch expansion and fusion operations to fuse multi-scale difference and change maps, effectively recuperating the details of the change information. Finally, the last patch expansion and a linear projection are applied to output the final change map, which preserves the identical spatial resolution as the input image. Extensive experiments have shown that mSwinUNet outperforms several the state-of-the-art multi-modal CD methods on OSCD dataset and the corresponding Sentinel-1 SAR data.

Keywords

Change detection (CD)multi-modal siamese network swin transformer remote sensing image

1 Introduction

Change detection (CD) is universally recognized as a critical research topic in the field of intelligent interpretation of remote sensing images [15 , 21]. It aims to identify the changed areas between bi-temporal remotely sensed images in earth observation. Such changed areas are important cues for various remote sensing applications, such as land use investigation [6], ecological environment monitoring [13], resource management [20], and nature disaster assessment [2], etc. However, changed area extraction from bi-temporal images still mainly relies on manual digitization by GIS technology [24]. Accordingly, how to automatically and accurately identify the changed areas has attracted increasing attention [17 , 31].

Nowadays, deep learning from computer vision has become increasingly applied in remote sensing image analysis. The supervised CD can be formulated as a problem of image classification or segmentation. Thus, deep convolutional neural networks (ConvNet) have also adopted for CD task. For optical image CD, Zhan et al. [28] have proposed a deep siamese convolutional network and got good performance. Unfortunately, the network generates a large number of parameters and it is also difficult to be trained end-to-end, because the size of the feature maps from each convolutional layer is the same as that of the input image and a k-nearest neighbor (k-NN) approach [25] was introduced. Daudt et al. [7] have presented fully convolutional siamese networks (FCNNs) to perform CD. This network which is constructed by encoder-decoder architecture can reduce the network parameters and can be trained end-to-end from scratch. Lots of follow-up work [5 , 22] is devoted to stacking convolutional layers, dilated convolutions, and channel/spatial attention in the network architecture to capture global details. In order to further enhance the semantic feature representation extracted from remote sensing image, a advanced transformer [4] has been introduced into a ConvNet encoder. However, this encoder is not suitable for multispectral (including more than three bands) remote sensing images because its backbone was ResNet which is applied pre-trained weight obtained by training on large RGB-image-based dataset. Bandara et al. [1] have tried to bring pure Transformer into the CD domain and proposed a hierarchical transformer encoder with a lightweight MLP decoder, which can be trained on multispectral remote sensing image dataset. Furthermore, zhang et al. [29] have introduced the swin transformer block [19] as the basic unit and designed SwinSUNet with Siamese U-shaped structure, which can better extract change information. Since then, the family of transformer variants [14 , 27] has been widely introduced for change detection tasks.

However, above-mentioned work do not consider the fusion of multi-modal remote sensing images collected from multiple sensors, thus missing the opportunity to utilize the variety of remote sensing images available to provide supplementary information. Ebel et al. [9] have proposed multi-modal siamese architecture that ingests both Optical and SAR data and processes the multi-modal information. Compared to the optical baseline, the improvement in CD accuracy is still very small. To further promote the fusion of SAR and Optical data at decision level with deep learning, Hafner et al. [11, 12] have proposed dual stream U-Net (DS_UNet) to process the SAR and Optical image pair in parallel and then fuse extracted change maps from both sensors at the final decision stage. It is regrettable that the above method still stacks convolutional layers in the network architecture. Nowadays, no research has been done on advanced transformers for multi-modal CD.

Motivated by multi-modal fusion [9 , 12] and advanced transformer technology [3], we propose a novel end-to-end multi-modal model with Swin-transformer-based and U-shaped siamese Network architecture, mSwinUNet for short. On one hand, mSwinUNet is aim to solve supervised multi-modal CD, which is distinct from other traditional mono-modal CD tasks. It typically receives a pair of multi-modal bitemporal images as input, such as Optical and SAR bitemporal images. On the other hand, compared to CNN-based CD network, mSwinUNet is based on swin transformer. Therefore, mSwinUNet can not only extract their long-term global information in space time, but also leverage simultaneously supplementary information among multi-modal remote sensing images.

mSwinUNet is composed of multi-modal encoder with difference module, bottleneck, and decoder used for multi-modal fusion. They are all built using the Swin Transformer block. The multi-modal encoder is specifically designed to process multi-modal bitemporal input images by utilizing multiple Siamese encoder branches. These branches split the input bitemporal images into non-overlapping image patches using both patch partition and linear embedding methods. The image patches are treated as tokens and fed into the hierarchical Swin transformer, which comprises patch merging and Swin transformer blocks for learning multiscale bitemporal features. Then, the difference maps between these bitemporal features are computed via difference module. The last level multi-modal difference maps are fused through concatenation and linear projection in the bottleneck. Nextly, the decoder performs up-sampling on the extracted change map using patch expanding. The upsampled features are fused with the multi-scale and multi-modal difference maps at each stage of the hierarchical Swin transformer. Finally, the decoder applies the last patch expanding and a linear projection to restore the change map to its original spatial resolution as the input images, and further performs CD prediction. We train and validate our proposed mSwinUNet on multi-modal satellite observations including Onera Satellite CD (OSCD) dataset and the corresponding Sentinel-1 SAR data, whereby multi-modal bitemporal remote sensing images is labeled with one actual binary change map. Extensive experiments demonstrate that our proposed mSwinUNet can further improve capability of extracting spatio-temporal change map and integrate the rich complementary information from disparate sensors, thereby obtaining the superior multi-modal CD performance than previous counterparts.

The main contributions are summarized as follows:

(1) we propose mSwinUNet, a novel end-to-end multi-modal siamese network that utilizes the U-shaped architecture and swin transformer. To our best knowledge, mSwinUNet is the first transformer network designed purely for supervised multi-modal CD. Therefore, it has both the ability to can leverage supplementary information from multi-modal remote sensing images and the capability to extract better their long-term global spatiotemporal features.

(2) Instead of concatenating both bitemporal features with different scale from the encoder branches, we utilized the absolute value of the element-wise subtraction process to generate the difference maps. The results of our experimentation indicate that this approach yields better results.

(3) The effectiveness and efficiency of the proposed mSwinUNet have been extensively validated through experiments conducted on real-world multi-modal satellite observations, including the OSCD dataset and corresponding Sentinel-1 SAR data. Our method outperforms several recent multi-modal CD methods in terms of accuracy and efficiency, as demonstrated by the experimental results.

The remainder of this paper is organized as follows. Section II describes in detail our proposed mSwinUNet architecture for multi-modal CD task. The dataset, evaluation metrics, baselines, experimental settings and performance comparison are reported in Section III. Finally, conclusion is drawn in Section IV.

2 Proposed mSwinUNet network architecture

2.1 Architecture overview

The overall network architecture of our proposed mSwinUNet is depicted in Fig. 1. From this figure, we can observe that mSwinUNet follows the similar structure as SiamUNet [9], utilizing the Siamese network architecture. It comprises of two encoder branches, each focusing on a distinct modal type, and a decoder that fuses the multi-modal and multi-scale features. The main difference between the two models is that the basic element and difference module of SiamUNet [9] respectively are convolution and concatenation operation, while the ones of mSwinUNet respectively are Swin transformer block [29] and element-wise absolute of the subtraction (Sub&Abs), so mSwinUNet has stronger ability to extract global information and change maps in space-time.

Fig. 1

The multi-modal network architecture of our proposed mSwinUNet for supervised CD. The network consists of two encoder branches with difference module for Optical and SAR modalities, bottleneck, decoder, and all of them are based on swin transformer block. Pre-change and post-change optical images are bi-temporal images obtained from Sentinel-2 Multispectral Imager data, whereas pre-change and post-change SAR images are bi-temporal images derived from Sentinel-1 Synthetic Aperture Radar data. Binary change map is the label.

mSwinUNet consists of two encoder branches with difference module, bottleneck, and decoder. The two branches composed of Swin transformer blocks of different levels are used to simultaneously process optical and SAR bi-temporal images in parallel and generate hierarchical multi-scale features, which are both high-resolution coarse-grained and low-resolution fine-grained. The difference maps of hierarchical and multi-scale features from multi-modal bi-temporal images are computed by the difference modules. The bottleneck module predominantly utilizes concatenation and linear projection to fuse difference maps at the final stage of the multi-modal encoder branches. The decoder is mainly responsible for fusing the difference maps computed via difference modules from the multi-modal encoder branches. This helps to offset the loss of spatial information that occurs due to down-sampling. Ultimately, the decoder uses the last patch expanding layer to carry out 4× up-sampling to restore the resolution of the change maps to the input resolution (W×H). A linear projection is then applied to these up-sampled change maps to produce pixel-level change detection predictions. Each part is introduced detailedly as follows.

2.2 Swin transformer block

The main building block of our proposed mSwinUNet with swin transformer [19] is built on the shifted window partitioning strategy. The basis block comprises two consecutive swin transformer blocks as illustrated in Fig. 2. In each swin transformer block, a LayerNorm (LN) layer, a multilayer perceptron (MLP) with GELU non-linearity, a residual connection, and two multi-head self-attention modules (W-MSA and SW-MSA) are included. These modules are distinct from window partition approaches employed to minimize the effect of window locality. The two successive swin transformer blocks of the l-th layer in mSwinUNet network can be formulated as:

Fig. 2

Two Successive swin transformer blocks of the l-th layer in mSwinUNet.

$\begin{matrix} {\hat{F}}^{l} = W - MSA (LN (F^{l - 1})) + F^{l - 1}, \\ {\ddot{F}}^{l} = MLP (LN ({\hat{F}}^{l})) + {\hat{F}}^{l}, \\ {\bar{F}}^{l} = SW - MSA (LN ({\ddot{F}}^{l})) + {\ddot{F}}^{l}, \\ {\tilde{F}}^{l} = MLP (LN ({\tilde{F}}^{l})) + {\bar{F}}^{l} \end{matrix}$ (1)

where F ^l-1 denotes input from the (l-1)-th layer in mSwinUNet; ${\hat{F}}^{l}$ represents the outputs of the W-MSA module, whereas ${\bar{F}}^{l}$ denotes the outputs of the S W-MSA module; ${\ddot{F}}^{l}$ and ${\tilde{F}}^{l}$ denote the respective outputs of the MLP module in the two successive swin transformer blocks; W-MSA and S W-MSA refer to the window-based multi-head self-attention module and the shifted window-based multi-head self-attention module, respectively.

2.3 Multi-modal Siwn transformer encoder

Patch Embedding. This module is mainly used to transform the inputs into sequence embeddings, which mainly includes patch partition and linear embedding. The input $X_{m}^{t}$ , which is original remote sensing image of the modality m at time t, is split into non-overlapping patches by the patch partition approach. Such patches are treated as “tokens” and their raw-valued features are come from original pixel values of the input $X_{m}^{t}$ . Then, a linear embedding is applied to project the raw-valued feature to an arbitrary dimension (represented as C). The patch partition and linear embedding can be implemented by a 2-d convolution operation whose kernel and stride sizes are both equal to patch size. The formula is as follows:

${\begin{matrix} F_{m}^{t_0} = Conv 2 D_{k \times k} (X_{m}^{t}) \\ m \in {optical, sar} \\ t \in {pre, post} \end{matrix}$ (2)

where Conv2D is a 2-d convolution operation, k is kernel size, m denotes Optical or SAR modality of remote sensing image, $F m t_0$ is the output of the $X_{m}^{t}$ implemented by patch embedding module.

Patch Merging. This module is responsible for down-sampling the input patch resolution and increasing its the channel dimension. Specifically, according to the row and column positions of elements in the input patch, the elements are selected at intervals of 2 elements to divide into 4 parts. Then, the 4 parts are concatenated together. Finally, a linear layer is applied to project the concatenated feature dimension into the 2× the dimension of the input patch. This process can be formulated as:

$\begin{matrix} L = [{{\tilde{f}}_{0 + 2 * i, 0 + 2 * j}^{m}}_{i, j = 0}^{\frac{M^{t_l}}{2}, \frac{N^{t_l}}{2}}, {{\tilde{f}}_{1 + 2 * i, 0 + 2 * j}^{m}}_{i, j = 0}^{\frac{M^{t_l}}{2}, \frac{N^{t_l}}{2}}, \\ {{\tilde{f}}_{0 + 2 * i, 1 + 2 * j}^{m}}_{i, j = 0}^{\frac{M^{t_l}}{2}, \frac{N^{t_l}}{2}}, {{\tilde{f}}_{1 + 2 * i, 1 + 2 * j}^{m}}_{i, j = 0}^{\frac{M^{t_l}}{2}, \frac{N^{t_l}}{2}}] \end{matrix}$ (3)

$F_{m}^{t_l} = Linear (2 C, C) (L)$ (4)

where $\tilde{f}$ is the elements in the patch ${\tilde{F}}_{m}^{t_l}$ after be processing by the l-th m-modal swin transformer blocks at time t, [] is concatenation operation, L is intermediate parameter, Linear () is linear projection, F_m^t_l is the operation result by the patch merging.

Difference Module. We utilize element-wise absolute of the subtraction as difference module to compute the difference of multi-level features of pre-change and post-change images from the hierarchical swin transformer encoder. The formula can be given as follows:

$F_{m}^{diff_l} = | F_{m}^{pre_l} - F_{m}^{post_l} |$ (5)

where $F_{m}^{pre_l}$ and $F_{m}^{post_l}$ denote respectively l-level features extracted by the l-th swin transformer blocks from the pre-change and the post-change images, $F_{m}^{diff_l}$ is the element-wise absolute value of the subtraction between $F_{m}^{pre_l}$ and $F_{m}^{post_l}$ .

2.4 Bottleneck

The multi-modal swin transformer blocks are used to constructed the bottleneck to relearn the feature down-sampled by previous patch merging. In the bottleneck, the main work is to fuse and up-sample the multi-modal difference maps from the last layers of two encoder branches. This procedure is implemented through a concatenation operation and a linear projection layer as follows: ${\tilde{F}}^{0} = Linear (C, C) ([F_{optical}^{diff_last}, F_{sar}^{diff_last}])$ (6) where $F_{optical}^{diff_last}$ and $F_{sar}^{diff_last}$ denote difference maps extracted by the last layer of optical and SAR images. In here, the value of the last is 4; ${\tilde{F}}^{0}$ is the last output of two encoder branches and also the initial input of the decoder.

2.5 Fused decoder

Patch Expanding. This is the inverse operation of patch merging to achieve up-sampling the input patch resolution and reducing its the channel dimension. Specifically, a linear layer is firstly applied on the input patch to increase the feature dimension to 2× the original dimension. Then, a rearrange operation reshapes the feature maps into 2× the input resolution and reduce the channel dimension to quarter of the original dimension accordingly. This procedure is formulized as follow:

$L = Linear (C, 2 C) ({\tilde{F}}^{l - 1})$ (7) $F^{up_l} = Rerrange (2 H, 2 W, \frac{C}{4}) (L)$ (8)

where ${\tilde{F}}^{l - 1}$ is the output of the previous layer in the decoder and the input of this layer. Rearrange () is the rearrange operation to reshape the feature map. F^up_l is the result up-sampled by patch expanding of l-th layer.

Concatenation & Fusion. The multi-scale feature difference maps from the multi-modal encoder and the up-sampled features by patch expanding are concatenated and fused through a linear projection layer as follows:

$L = [(F_{optical}^{diff_(last - l)}, F_{sar}^{diff_(last - l)}, F^{up_l}]$ (9)

$F^{l} = Linear (C, 2 C) (L)$ (10)

where $F_{optical}^{diff_(last-l)}$ and $F_{sar}^{diff_(last-l)}$ are the difference feature maps generated by the encoder of (last-l)-th layer corresponding to the decoder of l-th layer; F^l is the fused result of the difference maps and the up-sampled change map in the l-th layer decoder.

Upsampling & Prediction. We utilize the last patch expanding to upsample the fused feature maps at 4 times scale to the size of H×W. Finally, the upsampled feature is processed through a linear projection to predict the pixel-level change mask CM with a resolution of H×W×2, i.e., change and no-change. This linear projection layer can be formulated as follows: $CM = Linear (C, 2) (F^{up_last})$ (11)

where F^up_last is the result up-sampled by the last patch expanding, CM is the final predicated binary image.

3 Experimental results and analysis

3.1 DataSet and evaluation metrics

To validate the proposed mSwinUNet, we performed experiments on the publicly available Sentinel-2 Onera Satellite CD (OSCD) dataset introduced by Daudt et al. [8], along with the corresponding Sentinel-1 SAR image data published in [12]. The OSCD dataset comprises 24 pairs of bitemporal multispectral images captured by Sentinel-2 satellites from different cities across the world. Each of the Sentinel-2 images in the bitemporal image pair has 13 bands. The corresponding sentinel-1 SAR images are collected from different orbits, and maintain the same resolution as OSCD dataset by preprocessed means of removing speckle noise, pixel-wise mean and normalization, and image resampling. The sentinel-1 SAR image has only 1 band. Each multi-modal bitemporal image pair is labeled with one pixel-wise groundtruth change map. We split officially it into two parts: 14 training sites of the multi-modal dataset for training and 10 ones for evaluating.

Meanwhile, Overall Accuracy (OA), Precision, Recall, and F1-score (F1) were utilized to evaluate the performance of the proposed mSwinUNet in multi-modal change detection tasks, in comparison to state-of-the-art methods from different perspectives. In the change detection task, OA represents the proportion of correctly predicted changed and unchanged pixels to the total number of pixels, Precision indicates the proportion of accurately predicted changed pixels out of all predicted changed pixels, Recall quantifies the fraction of correctly predicted changed pixels out of the actual changed pixels, and F1 score provides a balanced measurement of recall and precision, offering a comprehensive evaluation of the model’s performance. Their values range from 0 to 1, with higher precision values indicating fewer false detections and larger recall values indicating fewer missed predictions. The F1 score and OA serve as overall evaluation metrics for the prediction results, with larger values indicating better prediction outcomes. Their formulas can be expressed as follows: $OA = \frac{TP + TN}{TP + FP + FN + TN}$ (12) $Precision = \frac{TP}{TP + FP}$ (13) $Recall = \frac{TP}{TP + FN}$ (14) $F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall}$ (15) where TP, TN, FP, and FN are the number of True Positives, True Negatives, False Positives, and False Negatives, respectively.

3.2 Baselines

To demonstrate the effectiveness of our proposed mSwinUNet model, we make a comparison to the following several existing state-of-the-art baselines:

UNet [23] is a mono-modal network model, we integrate Sentinel-1 (S1) and Sentinel-2 (S2) observations into a single input tensor and processed them jointly.

SiamUNet [9] is the first multi-modal Siamese network that ingests both optical and SAR data, and processes the multi-modal information using a convolutional encoder-decoder architecture.

DS_UNet [12] is a dual-stream UNet model that can extract informative change features from both optical and SAR image pairs simultaneously. These change features are then fused in the final decision stage.

3.3 Experimental settings

All the experiments were conducted using Pytorch 1.11.0, on a 64-bit CentOS 7.9.200 Server equipped with an AMD EPYC 7702 64-Core Processor CPU, 128GB RAM, and NVIDIA GA100 GPU support. In order to improve the quality of the dataset, the technique of data augmentation utilizes random strategies, including random flipping, random cropping, and random 90 degree rotation. All the trainable parameters of the baselines and our model were initialized randomly. To facilitate the training, we chose AdamW as the optimizer with a weight decay of 0.01 and beta values of (0.9, 0.999). Cross-Entropy (CE) Loss was applied for training. With a batch size of 2, we experimented with the learning rate decay values of [0.25, 0.3, 0.5], while keeping the learning rate fixed at 0.001. We set the learning rate to 0.001 and explored the learning rate decay in the range of [0.25, 0.3, 0.5]. The training process was terminated if no improvement was observed after 500 epochs. Each experiment was conducted ten times to reduce any random fluctuations and recorded their average results.

3.4 Performance comparison on difference module

To verify the effectiveness and superiority of the element-wise absolute of the subtraction (Sub&Abs) of the two feature maps as the difference module in each encoder branch of mSwinUNet, we have compared our results with mainstream difference modules of Convolution [1] and Concatenation [9, 29]. The numerical results are summarized in Table 1. From this table, we can see that the performance of Sub&Abs is much better than Convolution and Concatenation. This shows our introduced Sub&Abs has more powerful ability to compute difference feature maps in multi-modal CD task.

Table 1
Performance comparison on difference module

Model OA F1 Precision Recall

+Convolution 0.959 0.398 0.553 0.315

+Concatenation 0.963 0.537 0.588 0.494

+Abs&Sub 0.967 0.600 0.640 0.565

Model	OA	F1	Precision	Recall
+Convolution	0.959	0.398	0.553	0.315
+Concatenation	0.963	0.537	0.588	0.494
+Abs&Sub	0.967	0.600	0.640	0.565

3.5 Performance comparison on modality combination

We verify the performance of our proposed mSwinUNet under different modality combinations. In Table 2, we have summarized the quantitative results and drawn the following findings: (1) In the case of mono-modality, the Optical modality achieves much better performance than SAR modality. This is because Sentinel-2 multispectral image has 13 bands, enabling more feature information. (2) Under modality combination situations, the combination of Optical and SAR modalities outperforms either mono-modality alone. This is because mSwinUNet is able to makes full use of the complementary information between Optical and SAR modalities.

Table 2
Performance comparison with modality combination

Modal OA F1 Precision Recall

SAR 0.948 0.409 0.336 0.260

Optical 0.956 0.334 0.482 0.256

SAR+Optical 0.967 0.600 0.640 0.565

Modal	OA	F1	Precision	Recall
SAR	0.948	0.409	0.336	0.260
Optical	0.956	0.334	0.482	0.256
SAR+Optical	0.967	0.600	0.640	0.565

3.6 Performance comparison among models

We summarized the performance comparison conducted between our model and the baselines in Table 3. (1) Compared to other models, the UNet [23] model achieves an Overall Accuracy of 0.948, F1-score of 0.304, Precision of 0.366, Recall of 0.260, exhibiting the worst performance. This is attributed to the fact that the Optical and SAR modalities are simply combined together as a single input into standard UNet. (2) SiamUNet [9] is better than UNet [23]. This is because it is a first multi-modal Siamese architecture based on UNet to process Optical and SAR data separately by multi-modal encoders and fuse the extracted features at multiple decoder depths. (3) DS_UNet [12] outperforms SiamUNet [9]. This is due to the fact that DS_UNet further fuses SAR and optical features at decision level. (4) In comparison, our proposed mSwinUNet achieves the best result. This justifies the effectiveness of our mSwinUNet for multi-modal remote sensing CD, because swin transformer block can extract more descriptive features and our network architecture has a stronger ability to fuse these Optical and SAR features.

Table 3
Performance comparison between our proposed method and several state-of-the-art baselines

Model OA F1 Precision Recall

UNet [23] 0.948 0.304 0.366 0.260

SiamUNet [9] 0.956 0.334 0.482 0.256

DS_UNet [12] 0.966 0.577 0.630 0.531

mSwinUNet 0.967 0.600 0.640 0.565

Model	OA	F1	Precision	Recall
UNet [23]	0.948	0.304	0.366	0.260
SiamUNet [9]	0.956	0.334	0.482	0.256
DS_UNet [12]	0.966	0.577	0.630	0.531
mSwinUNet	0.967	0.600	0.640	0.565

Figure 3 depicts a comparison between the visual results of an intuitive and qualitative evaluation of the performance of both baselines and our proposed mSwinUNet on the test images. The rows 1-2 demonstrate respectively Sentinel-2 Optical images captured at time t1 and t2, while the rows 3-4 showcase the Sentinel-1 SAR images captured at the same times. The fifth row displays the corresponding groundtruth change maps. From row 6 to 9, it can be observed that the visual result from baselines and mSwinUNet and the performance is gradually improving. These qualitative results were consistent with the quantitative results in Table 3. The fact shows the superiority of our proposed mSwinUNet model over the baselines.

Fig. 3

Visualization results from UNet [23], SiamUNet [9], DS_UNet [12] and our proposed mSwinUNet model on the OSCD test dataset and the corresponding SAR image test dataset. For the convenience of viewing, different colors are used to express different meanings; white indicates true positive, green indicates false negative, magenta means false positive, and black denote true negatives, respectively.

4 Conclusion

This paper presented mSwinUNet, a novel multi-modal siamese network architecture that uses the Swin Transformer to further improve change detection performance using multi-modal bi-temporal remote sensing imagery. It comprises three key components: a multi-modal encoder, a decoder, and a bottleneck for fusing the multi-modal difference information. mSwinUNet has both the ability to extract better their long-term global spatiotemporal features and leverage supplementary information from multi-modal remote sensing images, leading to improved accuracy in multi-modal CD tasks. Extensive experiments have indicated that mSwinUNet outperforms several state-of-the-art baselines on the Sentinel-2 optical and the corresponding Sentinel-1 SAR image dataset.

Inevitably, there are some limitations in our work that can provide direction for our future work: (1) Currently, the input images for the mSwinUNet model are the Optical and SAR bi-temporal images. As future work, we will plan to incorporate more modalities of remote sensing images to evaluate the performance of the multi-modal siamese network (mSwinUNet) for CD tasks, as the encoder’s number can increase with the number of available modalities. (2) mSwinUNet is applied for supervised multi-modal change detection tasks. Next, our future plan is to develop multi-modal deep learning model with swin-transformer-based and U-shaped siamese network architecture for unsupervised and weakly supervised CD tasks, with the goal of reducing the labeling burden of remote sensing images.

Footnotes

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grants 62271361 and 52271366, and the Department of Science and Technology, Hubei Provincial People’s Government under Grant 2021CFB513.

References

Bandara

W.G.C.

, Patel

V.M.

A transformer-based siamese network for change detection, arXiv preprint arXiv:2201.01293, 2022.

Brunner

, Lemoine

and Bruzzone

, Earthquake damage assessmentof buildings using vhr optical and sar imagery, IEEETransactions on Geoscience and Remote Sensing 48(5) (2010), 2403–2420.

Cao

, Wang

, Chen

, Jiang

, Zhang

, Tian

, Wang

Swin-unet: Unet-like pure transformer for medical image segmentation, arXiv preprint arXiv:2105.05537, 2021.

Chen

, Qi

and Shi

, Remote sensing image change detectionwith transformers, IEEE Transactions on Geoscience and RemoteSensing 60 (2021), 1–14.

Chen

, Yuan

, Peng

, Chen

, Huang

, Zhu

, Liu

and Li

, Dasnet: Dual attentive fully convolutional siamese networksfor change detection in high-resolution satellite images, IEEEJournal of Selected Topics in Applied Earth Observations and RemoteSensing 14 (2020), 1194–1206.

Chen

, He

, Wang

and Xiao

, The influence of polarimetricparameters and an object-based approach on land cover classificationin coastal wetlands, Remote Sensing 6(12) (2014), 12575–12592.

Daudt

R.C.

, Le Saux

, Boulch

Fully convolutional siamese networks for change detection, In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 4063–4067. IEEE, 2018.

Daudt

R.C.

, Le Saux

, Boulch

, Gousseau

Urban change detection for multispectral earth observation using convolutional neural networks, In IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, pages 2115–2118. Ieee, 2018.

Ebel

, Saha

and Zhu

X.X.

, Fusing multi-modal data forsupervised change detection, The International Archives of thePhotogrammetry, Remote Sensing and Spatial Information Sciences 43 (2021), 243–249.

10.

Fang

, Li

, Shao

and Li

, Snunet-cd: A densely connectedsiamese network for change detection of vhr images, IEEEGeoscience and Remote Sensing Letters 19 (2021), 1–5.

11.

Hafner

Multi-Modal Deep Learning with Sentinel-1 and Sentinel-2 Data for Urban Mapping and Change Detection, PhD thesis, KTH Royal Institute of Technology, 2022.

12.

Hafner

, Nascetti

, Azizpour

and Ban

, Sentinel-1 andsentinel-2 data fusion for urban change detection using a dualstream u-net, IEEE Geoscience and Remote Sensing Letters 19 (2021), 1–5.

13.

Kennedy

R.E.

, Townsend

P.A.

, Gross

J.E.

, Cohen

W.B.

, Bolstad

, Wang.

and Adams

, Remote sensing change detection tools fornatural resource managers: Understanding concepts and tradeoffs inthe design of landscape monitoring projects, Remote Sensing ofEnvironment 113(7) (2009), 1382–1396.

14.

Liang

, Chen

, Liu

, Zhu

, Geng

and Zhang

, Changedetection for high-resolution remote sensing images based on aunet-like siamese-structured transformer network, Sensors &Materials 35 (2023).

15.

Liu

, Jiao

, Tang

, Yang

, Ma

and Hou

, Localrestricted convolutional neural network for change detection inpolarimetric sar images, IEEE transactions on neural networksand learning systems 30(3) (2018), 818–833.

16.

Liu

, Shi

, Li

and Chai

, Learning token-alignedrepresentations with multimodel transformers fordifferent-resolution change detection, IEEE Transactions onGeoscience and Remote Sensing 60 (2022), 1–13.

17.

Liu

, Pang

, Zhan

, Zhang

and Yang

, Building changedetection for remote sensing images using a dual-task constraineddeep siamese convolutional network model, IEEE Geoscience andRemote Sensing Letters 18(5) (2020), 811–815.

18.

Liu

, Li

, Mercier

, He

and Pan

, Change detection inheterogenous remote sensing images via homogeneous pixeltransformation, IEEE Transactions on Image Processing 27(4) (2017), 1822–1834.

19.

Liu

, Lin

, Cao

, Hu

, Wei

, Zhang

, Lin

, GuoSwin

Swin transformer: Hierarchical vision transformer using shifted windows, In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.

20.

Z.Y.

, Shi

, Zhang

and Benediktsson

J.A.

, Landslide inventorymapping from bitemporal high-resolution remote sensing images usingchange detection and multiscale segmentation, IEEE Journal ofSelected Topics in Applied Earth Observations and Remote Sensing 11(5) (2018), 1520–1532.

21.

Malila

W.A.

Change vector analysis: An approach for detecting forest changes with landsat, In LARS symposia, page 385, 1980.

22.

Peng

, Zhong

, Li

and Li

, Optical remote sensing imagechange detection based on attention mechanism and image difference, IEEE Transactions on Geoscience and Remote Sensing 59(9) (2020), 7296–7307.

23.

Ronneberger

, Fischer

, Brox

U-net: Convolutional networks for biomedical image segmentation, In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.

24.

Sofina

and Ehlers

, Building change detection using highresolution remotely sensed data and gis, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 9(8)(2016), 3430–3438.

25.

Touazi

, Bouchaffra

A k-nearest neighbor approach to improve change detection from remote sensing: Application to optical aerial images, In 2015 15th International Conference on Intelligent Systems Design and Applications (ISDA), pages 98–103. IEEE, 2015.

26.

Wang

, Chen

, Guo

, Yi

, Li

Stcd: efficient siamese transformers-based change detection method for remote sensing images, Geo-spatial Information Science, pages 1–20, 2023.

27.

Yan

, Wan

, Zhang

Fully transformer network for change detection of remote sensing images, In Proceedings of the Asian Conference on Computer Vision, pages 1691–1708, 2022.

28.

Zhan

, Fu

, Yan

, Sun

, Wang

and Qiu

, Change detectionbased on deep siamese convolutional network for optical aerialimages, IEEE Geoscience and Remote Sensing Letters 14(10) (2017), 1845–1849.

29.

Zhang

, Wang

, Cheng

and Li

, Swinsunet: Pure transformernetwork for remote sensing image change detection, IEEETransactions on Geoscience and Remote Sensing 60 (2022), 1–13.

30.

Zhang

and Shi

, A feature difference convolutional neuralnetwork-based change detection method, IEEE Transactions onGeoscience and Remote Sensing 58(10) (2020), 7232–7246.

31.

Zhang

, Xu

, Chen

, Yan

and Sun

, Triplet-based semanticrelation learning for aerial remote sensing image change detection, IEEE Geoscience and Remote Sensing Letters 16(2) (2018), 266–270.

mSwinUNet: A multi-modal U-shaped swin transformer for supervised change detection

Abstract

Keywords

1 Introduction

2 Proposed mSwinUNet network architecture

2.1 Architecture overview

3.1 DataSet and evaluation metrics

3.3 Experimental settings

3.4 Performance comparison on difference module

Table 1 Performance comparison on difference module Model OA F1 Precision Recall +Convolution 0.959 0.398 0.553 0.315 +Concatenation 0.963 0.537 0.588 0.494 +Abs&Sub 0.967 0.600 0.640 0.565

Table 2 Performance comparison with modality combination Modal OA F1 Precision Recall SAR 0.948 0.409 0.336 0.260 Optical 0.956 0.334 0.482 0.256 SAR+Optical 0.967 0.600 0.640 0.565

Table 3 Performance comparison between our proposed method and several state-of-the-art baselines Model OA F1 Precision Recall UNet [23] 0.948 0.304 0.366 0.260 SiamUNet [9] 0.956 0.334 0.482 0.256 DS_UNet [12] 0.966 0.577 0.630 0.531 mSwinUNet 0.967 0.600 0.640 0.565

Footnotes

Acknowledgments

References

Table 1
Performance comparison on difference module

Model OA F1 Precision Recall

+Convolution 0.959 0.398 0.553 0.315

+Concatenation 0.963 0.537 0.588 0.494

+Abs&Sub 0.967 0.600 0.640 0.565

Table 2
Performance comparison with modality combination

Modal OA F1 Precision Recall

SAR 0.948 0.409 0.336 0.260

Optical 0.956 0.334 0.482 0.256

SAR+Optical 0.967 0.600 0.640 0.565

Table 3
Performance comparison between our proposed method and several state-of-the-art baselines

Model OA F1 Precision Recall

UNet [23] 0.948 0.304 0.366 0.260

SiamUNet [9] 0.956 0.334 0.482 0.256

DS_UNet [12] 0.966 0.577 0.630 0.531

mSwinUNet 0.967 0.600 0.640 0.565