Abstract
Change detection in synthetic aperture radar (SAR) images is an important part of remote sensing (RS) image analysis. Contemporary researchers have concentrated on the spatial and deep-layer semantic information while giving little attention to the extraction of multidimensional and shallow-layer feature representations. Furthermore, change detection relies on patch-wise training and pixel-to-pixel prediction while the accuracy of change detection is sensitive to the introduction of edge noise and the availability of original position information. To address these challenges, we propose a new neural network structure that enables spatial-frequency-temporal feature extraction through end-to-end training for change detection between SAR images from two different points in time. Our method uses image patches fed into three parallel network structures: a densely connected convolutional neural network (CNN), a frequency domain processing network based on a discrete cosine transform (DCT), and a recurrent neural network (RNN). Multi-dimensional feature representations alleviate speckle noise and provide comprehensive consideration of semantic information. We also propose an ensemble multi-region-channel module (MRCM) to emphasize the central region of each feature map, with the most critical information in each channel employed for binary classification. We validate our proposed method on four benchmark SAR datasets. Experimental results demonstrate the competitive performance of our method.
Keywords
Abbreviations
The abbreviations in this paper are as follows:
Remote Sensing
Synthetic Aperture Radar
Spatial-Frequency-Temporal feature extraction Network
Multi-Region-Channel Module
Convolutional Neural Network
Fully Connected Layer
Recurrent Neural Network
Long Short-Term Memory
Gate Recurrent Unit
Discrete Cosine Transform
Difference Image
Fuzzy C-Means Algorithm [31]
Channel Attention Module [36]
Multilayer Perception
False Positives
False Negatives
Overall Error
Percentage of Correct Classification
Kappa Coefficient
Principal Component Analysis Network [21]
Deep Belief Network [24]
Neighborhood-based Ratio and Extreme Learning Machine [25]
Convolutional-Wavelet Neural Networks [26]
Multiscale Capsule Network [27]
Siamese Adaptive Fusion Network [28]
Introduction
Remote sensing (RS) image change detection is a technology for identifying changed regions in images of the same scene at different times. RS change detection is commonly used when analyzing synthetic aperture radar (SAR) data, multispectral data, and hyperspectral data. The recent and rapid growth in SAR sensor abilities has led scholars to pay more attention to SAR remote sensing images, which are not easily affected by light and atmospheric conditions and offer all-weather and all-day use along with multiple band and polarization features. SAR sensors have extensive uses in many fields, such as mapping of environment and natural resources [1, 2], urban studies [3–5], and natural disaster assessment [6, 7]. However, the imaging device of the radar sensor is affected by wave interference, resulting in large amounts of speckle noise and the loss of original surface information in typical SAR images [8]. Effectively suppressing noise while obtaining high-precision detection results is still a challenging task.
Based on the available remote sensing image information, change detection algorithms use supervised, semi-supervised, and unsupervised approaches to enable the recognition of pixels in changed regions. The supervised method requires artificially labeled pixels for changed and unchanged regions to train the classifier models [9, 10]. The semi-supervised method uses a mix of manually labeled samples and unlabeled data to train the classifier models [11, 12]. Supervised and semi-supervised methods are more efficient and accurate than unsupervised methods because they feed labeled data into the model for training. Nonetheless, the difficulty of manually labeling data for current needs means that unsupervised change detection methods have been widely adopted [13–15].
Change detection algorithms generally employ three main steps: 1) image pre-processing; 2) difference image (DI) production; and 3) classification of the DI to find changed and unchanged pixels. Image pretreatment includes radiation correction, geometric registration, and noise reduction. The representative difference features help obtain high-precision change detection results. Subtraction, log-ratio, and normal difference methods have become the mainstream algorithms for constructing DIs. The third step, classification, has received significant research interest. Thresholding [16, 17] and clustering [18, 19] have been broadly studied for recognizing changed and unchanged areas using a DI. Among them, the fuzzy c-means (FCM) algorithm obtains good performance and extends other related methods [20]. Nevertheless, these approaches extract little of the rich features available from the image information, and the classification models still have room to improve.
SAR image change detection results have improved remarkably in conjunction with the rapid growth of deep neural network research. Principal component analysis (PCA) has been used to establish the parameters of convolution filters and to retain the hierarchical architecture of a traditional convolutional neural network (CNN). PCA is a simple deep learning network that requires fewer training samples compared to CNNs [21]. Li et al. proposed the principal component analysis network (PCANet) guided by context-aware saliency detection to extract training samples available in SAR images [22]. A CNN using supervised learning has been applied to learn SAR image features [23]. A deep belief network (DBN) was employed to obtain spatial characteristics in SAR images [24]. These studies still have two significant issues. First, The first is the lack of mutual reinforcement between multidimensional and shallow-layer features. Most of the models have been constructed on the basis of spatial and deep-layer semantic information and rarely involve multidimensional feature fusion. The second is the inability to suppress edge noise and exploit original position information. These are widely considered to be good ways to include patch-wise features for binary classification, but the introduction of noisy features from each patch and the loss of original image features are difficult to handle.
Neighborhood-based ratio (NR) and extreme learning machine (ELM) methods have also been applied in SAR image change detection to reduce the loss of information in the DI [25]. One researcher presented a convolution-al-wavelet neural network (CWNN) [26] to reduce speckle noise effectively. Gao et al. introduced a multiple scale capsule network (Ms-CapsNet) for aggregating spatial features from different positions [27]. Another researcher presented a Siamese Adaptive Fusion Network (SAFNet) [28] to avoid error gradient accumulation and achieve operations among various scales of feature maps for change detection. Li et al. imported deep translation results to a supervised change detection network for optical and SAR images [29]. However, developing a robust model to detect changes in SAR images that can address the preceding challenges is a non-trivial problem.
In this paper, we propose a Spatial-Frequency-Temporal Feature Extraction Network (SFTNet) that jointly employs a spatial domain processing module containing a densely connected CNN, a frequency domain processing module based on a discrete cosine transform (DCT), and a temporal domain processing module with a recurrent neural network (RNN) for SAR image change detection. The extracted features are integrated into the same network structure as three branches for inference. This method mitigates the influence of noise by taking advantage of the feature representations. We further propose an ensemble multi-region-channel module (MRCM) that emphasizes the central region of each patch and enables most of the contextual information in each channel to be used for efficient changed pixel detection.
The contributions of this work are as follows. To the best of our knowledge, ours is the first approach to use an end-to-end SFTNet to implement multi-dimensional feature extraction for SAR image change detection. Our MRCM exploits multi-level semantic features, alleviating speckle noise and providing comprehensive consideration of contextual information. The experimental results of our proposed method are superior to other state-of-the-art (SOTA) methods when using four real SAR datasets.
We organize our paper as follows. We introduce SAR image change detection in Section 1 and then introduce each component of our solution in Section 2, with detailed structures of SFTNet given in part 2.2. Section 3 describes our implementation results and analysis. Section 4 presents the parameters and architectures that affect the accuracy of the model. Finally, Section 5 contains our conclusions and future prospects for our work.
Materials and methods
As described in the introduction, a common strategy used to study change detection is to produce a difference image (DI) from bitemporal (i.e., images taken at two different times) images with the DI operator. Our method then uses an appropriate classification model to accurately locate changed and unchanged pixels, labelled as “1” or “0” to generate the final change detection map. The flow of this approach is illustrated in Fig. 1. Overall, it consists of three main steps: preclassification, training of the SFTNet model, and final generation of the change map.
Preclassification

The procedure of our proposed approach.
The main purpose of preclassification is to recognize samples from the DI that have higher probabilities of being correctly classified into unchanged or changed classes. Subtraction and log-ratio are two mainstream DI operators frequently applied for SAR image change detection [30]. The log-ratio method used here is expressed as
In prior work [31], Gabor features were extracted and the fuzzy c-means (FCM) algorithm was first used to divide the log-ratio image into three separate clusters: unchanged, changed, and intermediate classes denoted as w u , w c , and w i , respectively. We employ these clusters according to the Equations (2) through (4).
In Equations (2) through (4), m ∈ [1, + ∞) denotes the fuzziness degree,
Since pixels belonging to w u and w c have a higher probability of being unchanged or changed, only 10% of samples in w u and w c from each image patch are randomly selected to be further classified. All pixels in w i are training data for SFTNet.
The framework of the proposed method is illustrated in Fig. 2. The feature extraction network consists of spatial feature extraction, frequency feature extraction, and temporal feature extraction. The spatial feature extraction branch consists of a densely connected CNN and MRCM. The densely connected CNN works well to capture spatial semantics features, and the MRCM enhances the use of central location features and contextual information in each channel along with essential feature representations. The frequency feature extraction branch uses a frequency domain processing network that employs DCT and a selection strategy. The critical reshaped DCT coefficients are fed to a fully connected (FC) layer for inference as features. The temporal feature extraction branch adopts an RNN to analyze temporal dependence in SAR images effectively. The entire process works in an unsupervised manner.

Illustration of the Spatial-Frequency-Temporal Network (SFTNet). The network has three main parts: spatial feature extraction, frequency feature extraction, and temporal feature extraction.
We now demonstrate how to achieve mutual reinforcement from all extracted features when performing further binary classification. In particular, the input patches are fusions of origin images and DI, with a size of 3 × r × r. We set r = 7 in our work. Each input patch is fed into the SFTNet as pixels for training and contains feature information of the input images I1 and I2 and the difference image I d . The ultimate goal is to generate a binary change detection map through the final classification.
Inspired by the effective use of multiscale features by UNet++ [33], we combine a densely connected CNN with MRCM for spatial feature extraction. UNet++is an architecture that takes advantage of an efficient ensemble of U-Nets of varying depths that share an encoder and yields a highly flexible feature fusion scheme as the skip connections aggregate features of varying semantic scales in the decoder sub-networks. The combination of CNN and MRCM abandons pixels around edge areas to make use of central contextual information and fuses feature representations from different semantic levels and spatial positions.
The original UNet++was not suitable for our work even though it has been proven to be effective in image segmentation. If the image patches are sent to the network directly, the changed information of some small parts of SAR images would be lost after sampling by the layer structure. To address this problem, we reduce the number of sampling layers and revise the output nodes of UNet++ to be the densely connected CNN. Figure 3 shows the complete flow of the densely connected CNN. Each input patch is fed into the network and down-sampled. Each constituent block is then restored to its original size by up-sampling with a sub-decoder. Skip-connections transmit fine-grained spatial features to two sub-decoders, effectively applying shallow localization information in the deeper layers. In addition, to further use these features, the outputs of the blocks in the shallow sub-decoder combine with the blocks in the deeper sub-decoder of the same size.

Illustration of the densely connected CNN. Part (a) is the backbone of densely connected CNN. Six cubes denote the same convolution blocks, which are connected through these operations of down-sampling, up-sampling and skip-connection. The blue arrows indicate feature representations after training of encoder and sub-decoders, and these parameters are input into the multi-region-channel module (MRCM). Part (b) is the specific structure of a convolutional block.
In reply to: Fig. 3(a) each block Bi,j denotes a convolution block. Inspired by the residual unit structure that contains convolution, max pooling, and batch normalization operations along with activation functions [34], we consider a convolution block defined by
We use bi,j to denote the output of block Bi,j where i is the down-sampling layer in the encoder and j is the convolution layer of the dense block along the skip connection. The stack of feature maps is computed by
As shown in Fig. 4, the MRCM is designed to capture crucial feature information between different channels automatically. Given an input node

Graphic model of the multi-region-channel module (MRCM). Input B0,j contains three blocks B0,0, B0,1, and B0,2, from which we obtain three feature maps V s 1 , V s 2 and V s 3 by element-wise summation. These features presentations are fed into two channel attention modules (CAM) for the final output. The picture in the upper right corner is the detailed structure of CAM.
After the final 3 × 3 convolution operation for feature maps, we obtain the three branches of the feature map: Zo′, Zhc′, and Zvc′. These features are merged for next calculation using
We also introduce a channel attention module (CAM) [36] and propose an ensemble expansion module in deep supervision. Three blocks of spatial fused features V s 1 , V s 2 , and V s 3 are summed to be a fused feature map, and we connect it to a CAM for extracting the intra-block relations. Simultaneously, three blocks of spatial fused features V s 1 , V s 2 , and V s 3 are directly concatenated to another CAM. The output spatial feature is obtained by a sequence of steps.
Specifically, the MRCM is implemented by Equations (11) through (14).
In these equations, σ(·) indicates the sigmoid activation function,
Extraction of spatial features from the DI is greatly affected by speckle noise in SAR images, causing information loss and accuracy degradation and makes it tough to build a robust model. Recently, some researchers have pointed out that performing feature extraction in the frequency domain helps suppress speckle noise and improves accuracy [37]. Inspired by one proposed method in the frequency domain [38], we introduce frequency domain information as one of the network branches. The combination of frequency feature extraction using DCT with a selection strategy and conventional spatial down-sampling approach achieves higher accuracy.
In Fig. 5, the length of the reshaped factor is set to 3 × 8 ×8 by bilinear interpolation operation. To further capture the most significant DCT coefficients from the transform map, we employ a channel selection strategy by calculating

An overview of the frequency feature extraction. After the discrete cosine transform (DCT), the input is transformed to the feature maps composed of DCT factors, and then the channel selection strategy provides the input to the reshape operation for selection of critical features.

Illustration of the fully connected RNN, LSTM, and GRU. V t input denotes the feature vectors, which are transformed into input sequences as V T . h t indicates the recurrent hidden state of RNN units. In LSTM, o t , f t ,,g t and c t are the output gates, forget gates, input gates, cell gates, and memory cells, respectively. In GRU, the reset gates and updates gates are represented by r t and z t , respectively. n t is the candidate activation at the present time step.
The final frequency feature V frequency is obtained by element-wise multiplication:
Feedforward neural network structures such as CNNs offer great performance under the assumption that all inputs are uncorrelated with each other. Nevertheless, it is a good practice to determine the correlation from processing time sequences in the change detection task. Unlike CNNs, RNNs have a recursive implicit state where the activation at each time step depends on the activation at the previous time step. Considering that RNNs are capable of handling dependent and sequential inputs between times t1 and t2, and the internal state (memory) processes variable length sequences of inputs. we use three types of RNN architectures: a fully connected RNN, a long short-term memory (LSTM), and a gated recurrent unit (GRU). GRU is the principal architecture in our proposed model. Figure 6 shows the structure of the fully connected RNN, LSTM, and GRU. We can summarize the functionality of each as follows.
1) Fully Connected RNN: Given an input patch of size 3 × r × r that is reshaped to feature vectors V t input in the time domain, the fully connected RNN updates its recurrent hidden state h t according to
2) LSTM: The LSTM architecture uses memory cells to store information, making it better at exploiting long range dependences in the sequences [39]. Among the many variants from the original version, we use the following for calculating the activation of LSTM units:
The output gates o t are updated by
The cell state c t at time t are updated by
In these equations, W ii and W hi are input-input coefficient matrices and hidden-input weight matrices, respectively. W if and W hf are input-forget coefficient matrices and hidden-forget weight matrices, respectively. W ig and W hg are input-cell coefficient matrices and hidden-cell weight matrices, respectively. b terms indicate related bias.
3) GRU: The GRU is similar to an LSTM with a forget gate, but with fewer parameters [40]. It directly shows whole cell state values at each time step due to the lack of output gates. We use the following implementation for calculating the activation of GRU units:
The update gate z t is calculated by
The computation of candidate activation is similar to the fully connected RNN (cf. [18]) given by
Reset gate r t is updated by
As shown in Fig. 2, V
s
, V
f
, and V
t
are output factors of three feature extraction structures. These factors are directly merged by a concatenation operation of the information of spatial, frequency and temporal features. After mapping the feature representations to the sample space, the final softmax layer calculates the possibility of unchanged versus changed to generate the change map
Most change detection tasks for bitemporal SAR images have the experimental results that show a data imbalance between the unchanged and changed pixels, which weakens the performance of the model. Inspired by the application of focal loss in object detection for balancing the effect of positive and negative samples [41], we design a loss function that combines the cross-entropy loss and the focal loss by an additive operation:
SAR dataset and evaluation metric
For this paper, we evaluated our method using four sample sets of bitemporal SAR images [25]. Table 1 provides a summary of the images in the sample datasets.
Characteristics of the SAR datasets
Characteristics of the SAR datasets
The first dataset named Ottawa was acquired over the city of Ottawa by the Radarsat-2 SAR sensor in July and August 1997. The images, with a spatial resolution of 10×10 m, are 350×290 pixels showing areas that were once flooded. It was provided by the Defense Research and Development Canada (DRDC).
The second dataset contains images around flooded areas of Bern, including Thun, Bern, and the airport that were captured by the European Remote Sensing (ERS)-2 satellite SAR sensor in April and May 1999. Each image is 301×301 pixels with a spatial resolution of 30×30 m. Images of the Aare valley between Bern and Thun were selected to recognize the flooded regions.
The San Francisco dataset was acquired by the European Remote Sensing (ERS)-2 satellite SAR sensor and contains images from the city of San Francisco taken in August 2003 and May 2004. Each image is 256×256 pixels with a spatial resolution of 25×25 m. These images were provided by the European Space Agency (ESA).
The Yellow River dataset contains images of a region of the Yellow River around Dongying City, Shandong Province, China from Radarsat-2 taken in June 2008 and June 2009. Each image is 257×289 pixels with an 8×8 m resolution. Furthermore, there were two datasets kept different noise levels with the sizes of 257×289 and 400×300, respectively.
In addition, the sample datasets have a reference image as the ground truth, which accurately recognizes unchanged and changed regions. We employ the reference images to verify our model. Figures 7–10 show the four SAR datasets and their change detection maps.

Ottawa dataset: (a) July 1997; (b) August 1997; (c) The change detection map.

Bern dataset: (a) April 1999; (b) May 1999; (c) The change detection map.

San Francisco dataset: (a) August 2003; (b) May 2004; (c) The change detection map.

Yellow River dataset: (a) June 2008; (b) June 2009; (c) The change detection map.
To evaluate the accuracy of the proposed model, we use common indices for quantitative analysis: false positives (FP), false negatives (FN), overall error (OE), percentage of correct classification (PCC), and kappa coefficient (KC) [42]. OE is the sum of incorrectly classified pixels:
In our experiments, the difference images for the SAR dataset were generated using the log-ratio operator explained in Equation (1). To compare the effects of different DI operators, as shown in Fig. 11, we introduced the subtraction and normal difference methods:

The difference images generated from three operators. For better processing, we normalized them in the range [0, 255]. The operator in the first row is subtraction, the operators in the second and third rows are the normal difference and log-ratio operator, respectively. (a) Ottawa, (b) Bern, (c) San Francisco, and (d) Yellow River.
To demonstrate the benefits of our proposed method, we compared it to the following state-of-the-art baselines: PCANet [21], DBN [24], NR-ELM [25], CWNN [26], Ms-CapsNet [27], and SAFNet [28]. We used 10000 pixels as training samples, with 7000 changed pixels and 3000 unchanged pixels. We used 1500 pixels as training samples on the Bern dataset, with 1050 changed pixels and 450 unchanged pixels. We trained the network for 50 epochs using a batch size of 128. There were also pretreatment techniques used before training the network, and we set parameters for PCANet and DBN to match those used with our method. The other experiments used default parameters given in the original papers.
The visualized results of different change detection methods on the datasets are shown in Fig. 12. For both San Francisco and Yellow River datasets, Ms-CapsNet produced more erroneous recognitions. Our proposed SFTNet obtained more accurate identifications of changed regions compared with the other five methods. Tables 2 through 5 display the quantitative results of all methods on the different datasets.

The visualized comparisons of different changed detection methods on the Ottawa (first row), Bern (second row), San Francisco (third row), and Yellow River datasets (last row): (a) Ground truth image. (b) Result from PCANet [21]. (c) Result from DBN [24]. (d) Result from NR-ELM [25]. (e) Result from CWNN [26]. (f) Result from Ms-CapsNet [27]. (g) Result from SAFNet [28]. (h) Result from the proposed SFTNet.
The change detection results of different methods on the Ottawa dataset
For the Ottawa dataset in Table 2 (the first row of Fig. 12), CWNN and DBN obtained high FP and FN values, respectively. The proposed SFTNet achieved the highest KC value of all the methods.
From Bern dataset in Table 3 (the second row of Fig. 12), Ms-CapsNet and PCANet produced final change maps containing some noisy results and missing many changed regions. Our SFTNet attained a KC value that was 14.42% and 16.09% higher than CWNN and Ms-CapsNet, respectively.
Ms-CapsNet incorrectly generated the largest number of changed areas on the third row in Fig. 12 while our SFTNet made extensive use of the discriminative information between the changed and unchanged pixels on the San Francisco dataset. SFTNet’s KC value was 70.03% higher than Ms-CapsNet as shown in Table 4.
The change detection results of different methods on the Bern dataset
The change detection results of different methods on the San Francisco dataset
The Yellow River dataset has much stronger speckle noise. As shown in Fig. 12, Ms-CapsNet introduced many small noise areas when generating the final change changed map. SFTNet performed well even with the noise. As shown in Table 5, SFTNet achieved KC values that were 17.28%, 7.39%, and 23.54% higher than those from NR-ELM, CWNN, and Ms-CapsNet, respectively.
The change detection results of different methods on the Yellow River dataset
Time cost is an important factor for the effective application of deep learning-based methods in change detection tasks. Table 6 shows the time consumption of our proposed SFTNet along with the other methods. NR-ELM, Ms-CapsNet, DBN, and PCANet required less time because of their relatively simple models. However, compared with CWNN and SAFNet, SFTNet exhibited high efficiency. We also highlight that SFTNet deliberately trains the small samples through the feature extraction network, this method ensures model performance while controlling the computing time.
The time cost of different change detection methods
The time cost of different change detection methods
Our proposed model has three main branches: spatial feature extraction, frequency feature extraction, and temporal feature extraction. We designed alternative methods by removing one of these branches and using a different loss function to verify the effectiveness of the model. The densely connected CNN refers to modified UNet++without the MRCM. Table 7 shows the necessity of each part of SFTNet, with each branch affecting the performance of the model.
The ablation study of SFTNet, with performance evaluated by PCC
The ablation study of SFTNet, with performance evaluated by PCC
The experimental results in Parts 3.3 and 3.4 show that the proposed method not only was efficient at feature extraction but also distinguished changed pixels better than other methods. We now discuss the parameters and architectures that affect the accuracy of the model.
Selection of the patch size
To evaluate the correlation between the patch size and the performance of our method, we set r=5,7,9,11,13, and 15, in turn. Figure 13 shows the relationship curve between r and PCC values on the SAR datasets. The PCC values increase first and then stabilize, which shows that the contextual information is critical for SAR image change detection. The end of the curve makes it evident that a larger patch size helps to decrease the computational cost but may introduces noise information that affects the results.

The relationship between patch size r and PCC values on the SAR datasets.
We assessed three types of RNN architectures in the temporal feature extraction branch of SFTNet for comparative analysis. The results and statistics of these architectures, including fully connected RNN, LSTM, and GRU on the SAR dataset are shown below.
In Fig. 14, according to the mean accuracy of the four SAR datasets, GRU and LSTM architectures achieve the first- and second-best results, respectively. The fully connected RNN achieves the lowest accuracy.

The variation of the PCC values for different RNN architectures on the SAR dataset.
In this section, we examine the performance of three different DI operators during preclassification for comparative analysis. The log-ratio operator has the best performance, as shown in Fig. 15. The normal difference operator was second best on the Ottawa, Bern, and Yellow River datasets. The subtraction operator offers the least performance. Due to the influence of speckle noise, the subtraction operator incorrectly detects small changed regions. Its worst result was on the Yellow River dataset, at 71.13%.

The variation of the PCC values for different DI operators on the SAR datasets.
In this paper, we present a method for performing SAR image change detection with higher results than existing methods. Our SFTNet alleviates speckle noise and comprehensively considers semantic information when detecting changes by using an ensemble MRCM that captures multidimensional fusion features using three parallel network branches. The inclusion of an improved UNet++ and a new ensemble module extract crucial spatial feature information between different channels. We use DCT and a selection to capture the frequency features. An RNN architecture obtains the temporal feature relationship. Experimental results on four datasets demonstrate the superior performance of SFTNet compared to existing state-of-the-art methods.
SAR sensor applications typically involve high-resolution images of large areas over long periods of time. Our STFNet manages noise well and is suitable for change detection in SAR images. In our next work, we plan to enhance our structure and verify our method works well in other scenarios.
