A spatial-frequency-temporal feature extraction network for change detection in synthetic aperture radar images 1

Abstract

Change detection in synthetic aperture radar (SAR) images is an important part of remote sensing (RS) image analysis. Contemporary researchers have concentrated on the spatial and deep-layer semantic information while giving little attention to the extraction of multidimensional and shallow-layer feature representations. Furthermore, change detection relies on patch-wise training and pixel-to-pixel prediction while the accuracy of change detection is sensitive to the introduction of edge noise and the availability of original position information. To address these challenges, we propose a new neural network structure that enables spatial-frequency-temporal feature extraction through end-to-end training for change detection between SAR images from two different points in time. Our method uses image patches fed into three parallel network structures: a densely connected convolutional neural network (CNN), a frequency domain processing network based on a discrete cosine transform (DCT), and a recurrent neural network (RNN). Multi-dimensional feature representations alleviate speckle noise and provide comprehensive consideration of semantic information. We also propose an ensemble multi-region-channel module (MRCM) to emphasize the central region of each feature map, with the most critical information in each channel employed for binary classification. We validate our proposed method on four benchmark SAR datasets. Experimental results demonstrate the competitive performance of our method.

Keywords

Change detection SFTNet feature extraction synthetic aperture radar (SAR) images deep learning neural network

Abbreviations

The abbreviations in this paper are as follows:

Remote Sensing

SAR

Synthetic Aperture Radar

SFTNet

Spatial-Frequency-Temporal feature extraction Network

MRCM

Multi-Region-Channel Module

CNN

Convolutional Neural Network

Fully Connected Layer

RNN

Recurrent Neural Network

LSTM

Long Short-Term Memory

GRU

Gate Recurrent Unit

DCT

Discrete Cosine Transform

Difference Image

FCM

Fuzzy C-Means Algorithm [31]

CAM

Channel Attention Module [36]

MLP

Multilayer Perception

False Positives

False Negatives

Overall Error

PCC

Percentage of Correct Classification

Kappa Coefficient

PCANet

Principal Component Analysis Network [21]

DBN

Deep Belief Network [24]

NR-ELM

Neighborhood-based Ratio and Extreme Learning Machine [25]

CWNN

Convolutional-Wavelet Neural Networks [26]

Ms-CapsNet

Multiscale Capsule Network [27]

SAFNet

Siamese Adaptive Fusion Network [28]

1 Introduction

Remote sensing (RS) image change detection is a technology for identifying changed regions in images of the same scene at different times. RS change detection is commonly used when analyzing synthetic aperture radar (SAR) data, multispectral data, and hyperspectral data. The recent and rapid growth in SAR sensor abilities has led scholars to pay more attention to SAR remote sensing images, which are not easily affected by light and atmospheric conditions and offer all-weather and all-day use along with multiple band and polarization features. SAR sensors have extensive uses in many fields, such as mapping of environment and natural resources [1, 2], urban studies [3 –5], and natural disaster assessment [6, 7]. However, the imaging device of the radar sensor is affected by wave interference, resulting in large amounts of speckle noise and the loss of original surface information in typical SAR images [8]. Effectively suppressing noise while obtaining high-precision detection results is still a challenging task.

Based on the available remote sensing image information, change detection algorithms use supervised, semi-supervised, and unsupervised approaches to enable the recognition of pixels in changed regions. The supervised method requires artificially labeled pixels for changed and unchanged regions to train the classifier models [9, 10]. The semi-supervised method uses a mix of manually labeled samples and unlabeled data to train the classifier models [11, 12]. Supervised and semi-supervised methods are more efficient and accurate than unsupervised methods because they feed labeled data into the model for training. Nonetheless, the difficulty of manually labeling data for current needs means that unsupervised change detection methods have been widely adopted [13 –15].

Change detection algorithms generally employ three main steps: 1) image pre-processing; 2) difference image (DI) production; and 3) classification of the DI to find changed and unchanged pixels. Image pretreatment includes radiation correction, geometric registration, and noise reduction. The representative difference features help obtain high-precision change detection results. Subtraction, log-ratio, and normal difference methods have become the mainstream algorithms for constructing DIs. The third step, classification, has received significant research interest. Thresholding [16, 17] and clustering [18, 19] have been broadly studied for recognizing changed and unchanged areas using a DI. Among them, the fuzzy c-means (FCM) algorithm obtains good performance and extends other related methods [20]. Nevertheless, these approaches extract little of the rich features available from the image information, and the classification models still have room to improve.

SAR image change detection results have improved remarkably in conjunction with the rapid growth of deep neural network research. Principal component analysis (PCA) has been used to establish the parameters of convolution filters and to retain the hierarchical architecture of a traditional convolutional neural network (CNN). PCA is a simple deep learning network that requires fewer training samples compared to CNNs [21]. Li et al. proposed the principal component analysis network (PCANet) guided by context-aware saliency detection to extract training samples available in SAR images [22]. A CNN using supervised learning has been applied to learn SAR image features [23]. A deep belief network (DBN) was employed to obtain spatial characteristics in SAR images [24]. These studies still have two significant issues. First, The first is the lack of mutual reinforcement between multidimensional and shallow-layer features. Most of the models have been constructed on the basis of spatial and deep-layer semantic information and rarely involve multidimensional feature fusion. The second is the inability to suppress edge noise and exploit original position information. These are widely considered to be good ways to include patch-wise features for binary classification, but the introduction of noisy features from each patch and the loss of original image features are difficult to handle.

Neighborhood-based ratio (NR) and extreme learning machine (ELM) methods have also been applied in SAR image change detection to reduce the loss of information in the DI [25]. One researcher presented a convolution-al-wavelet neural network (CWNN) [26] to reduce speckle noise effectively. Gao et al. introduced a multiple scale capsule network (Ms-CapsNet) for aggregating spatial features from different positions [27]. Another researcher presented a Siamese Adaptive Fusion Network (SAFNet) [28] to avoid error gradient accumulation and achieve operations among various scales of feature maps for change detection. Li et al. imported deep translation results to a supervised change detection network for optical and SAR images [29]. However, developing a robust model to detect changes in SAR images that can address the preceding challenges is a non-trivial problem.

In this paper, we propose a Spatial-Frequency-Temporal Feature Extraction Network (SFTNet) that jointly employs a spatial domain processing module containing a densely connected CNN, a frequency domain processing module based on a discrete cosine transform (DCT), and a temporal domain processing module with a recurrent neural network (RNN) for SAR image change detection. The extracted features are integrated into the same network structure as three branches for inference. This method mitigates the influence of noise by taking advantage of the feature representations. We further propose an ensemble multi-region-channel module (MRCM) that emphasizes the central region of each patch and enables most of the contextual information in each channel to be used for efficient changed pixel detection.

The contributions of this work are as follows.

To the best of our knowledge, ours is the first approach to use an end-to-end SFTNet to implement multi-dimensional feature extraction for SAR image change detection.

Our MRCM exploits multi-level semantic features, alleviating speckle noise and providing comprehensive consideration of contextual information.

The experimental results of our proposed method are superior to other state-of-the-art (SOTA) methods when using four real SAR datasets.

We organize our paper as follows. We introduce SAR image change detection in Section 1 and then introduce each component of our solution in Section 2, with detailed structures of SFTNet given in part 2.2. Section 3 describes our implementation results and analysis. Section 4 presents the parameters and architectures that affect the accuracy of the model. Finally, Section 5 contains our conclusions and future prospects for our work.

2 Materials and methods

As described in the introduction, a common strategy used to study change detection is to produce a difference image (DI) from bitemporal (i.e., images taken at two different times) images with the DI operator. Our method then uses an appropriate classification model to accurately locate changed and unchanged pixels, labelled as “1” or “0” to generate the final change detection map. The flow of this approach is illustrated in Fig. 1. Overall, it consists of three main steps: preclassification, training of the SFTNet model, and final generation of the change map.

2.1 Preclassification

Fig. 1

The procedure of our proposed approach.

The main purpose of preclassification is to recognize samples from the DI that have higher probabilities of being correctly classified into unchanged or changed classes. Subtraction and log-ratio are two mainstream DI operators frequently applied for SAR image change detection [30]. The log-ratio method used here is expressed as

$I_{d} = | {log}_{10} (\frac{I_{2} + 1}{I_{1} + 1}) |,$ (1) where I₁ and I₂ are two SAR images captured covering the same geographical region at different times. I_d is the log-ratio DI operator.

In prior work [31], Gabor features were extracted and the fuzzy c-means (FCM) algorithm was first used to divide the log-ratio image into three separate clusters: unchanged, changed, and intermediate classes denoted as w_u, w_c, and w_i, respectively. We employ these clusters according to the Equations (2) through (4).

$J_{m} (U, V) = \sum_{i = 1}^{c} \sum_{j = 1}^{MN} u_{ij}^{m} {∥ x_{j} - v_{i} ∥}^{2}$ (2)

$s . t . u_{ij} \in [0, 1], \sum_{i = 1}^{c} u_{ij} = 1 \forall j$ (3)

$0 < \sum_{j = 1}^{MN} u_{ij} < MN \forall i$ (4)

In Equations (2) through (4), m ∈ [1, + ∞) denotes the fuzziness degree, $U = [u_{ij}]_{c \times MN}$ represents a partition matrix with u_ij being the membership grade of j^th pixel in cluster i, and $V = [v_{1}, v_{2}, v_{3}]$ denotes the vector of the centroid of cluster. $J_{mitsc}$ of Equations (2) can be iteratively optimized by alternately updating u_ij and v_i until reaching convergence [32]. The centroid of i^th cluster is thus calculated by

$v_{i}^{(t + 1)} = \frac{\sum_{j = 1}^{MN} {(u_{ij}^{(t)})}^{m} x_{j}}{\sum_{j = 1}^{MN} {(u_{ij}^{(t)})}^{m}},$ (5) where the partition matrix $U^{(0)}$ is initialized by setting c = 3, m = 2, and t = 0. The membership grade u_ij is updated by

$u_{ij}^{(t + 1)} = \frac{{∥ x_{j} - v_{i}^{(t + 1)} ∥}^{- 2 / (m - 1)}}{\sum_{r = 1}^{c} {∥ x_{j} - v_{r}^{(t + 1)} ∥}^{- 2 / (m - 1)}},$ (6) where the centroid of i^th cluster is calculated until convergence by setting t = t + 1. label(·) denotes the assignment of the pixels into the classes ${w_{u}^{1}, w_{i}, w_{c}^{1}}$ by

$label (Y_{d}^{l \in Ω_{p}}) = {\begin{matrix} w_{u}^{1}, & p = {argmin}_{i = 1, 2, 3} M_{Ω_{i}} \\ w_{c}^{1}, & p = {argmax}_{i = 1, 2, 3} M_{Ω_{i}} \\ w_{i}, & otherwise \end{matrix},$ (7) where Ω_i = 1, 2, 3 indicates three different clusters distinguished by the highest degree of membership of each pixel from $U$ , and $M_{Ω_{iitsc}} = (1 / | Ω_{iitsc} |) \sum_{l \in Ω_{iitsc}} Y_{Ditsc}^{litsc}$ denotes the average value (i.e., the mean) of Y_D in cluster Ω_i.

Since pixels belonging to w_u and w_c have a higher probability of being unchanged or changed, only 10% of samples in w_u and w_c from each image patch are randomly selected to be further classified. All pixels in w_i are training data for SFTNet.

2.2 SFTNet model

The framework of the proposed method is illustrated in Fig. 2. The feature extraction network consists of spatial feature extraction, frequency feature extraction, and temporal feature extraction. The spatial feature extraction branch consists of a densely connected CNN and MRCM. The densely connected CNN works well to capture spatial semantics features, and the MRCM enhances the use of central location features and contextual information in each channel along with essential feature representations. The frequency feature extraction branch uses a frequency domain processing network that employs DCT and a selection strategy. The critical reshaped DCT coefficients are fed to a fully connected (FC) layer for inference as features. The temporal feature extraction branch adopts an RNN to analyze temporal dependence in SAR images effectively. The entire process works in an unsupervised manner.

Fig. 2

Illustration of the Spatial-Frequency-Temporal Network (SFTNet). The network has three main parts: spatial feature extraction, frequency feature extraction, and temporal feature extraction.

We now demonstrate how to achieve mutual reinforcement from all extracted features when performing further binary classification. In particular, the input patches are fusions of origin images and DI, with a size of 3 × r × r. We set r = 7 in our work. Each input patch is fed into the SFTNet as pixels for training and contains feature information of the input images I₁ and I₂ and the difference image I_d. The ultimate goal is to generate a binary change detection map through the final classification.

2.2.1 Spatial feature extraction

Inspired by the effective use of multiscale features by UNet++ [33], we combine a densely connected CNN with MRCM for spatial feature extraction. UNet++is an architecture that takes advantage of an efficient ensemble of U-Nets of varying depths that share an encoder and yields a highly flexible feature fusion scheme as the skip connections aggregate features of varying semantic scales in the decoder sub-networks. The combination of CNN and MRCM abandons pixels around edge areas to make use of central contextual information and fuses feature representations from different semantic levels and spatial positions.

The original UNet++was not suitable for our work even though it has been proven to be effective in image segmentation. If the image patches are sent to the network directly, the changed information of some small parts of SAR images would be lost after sampling by the layer structure. To address this problem, we reduce the number of sampling layers and revise the output nodes of UNet++ to be the densely connected CNN. Figure 3 shows the complete flow of the densely connected CNN. Each input patch is fed into the network and down-sampled. Each constituent block is then restored to its original size by up-sampling with a sub-decoder. Skip-connections transmit fine-grained spatial features to two sub-decoders, effectively applying shallow localization information in the deeper layers. In addition, to further use these features, the outputs of the blocks in the shallow sub-decoder combine with the blocks in the deeper sub-decoder of the same size.

Fig. 3

Illustration of the densely connected CNN. Part (a) is the backbone of densely connected CNN. Six cubes denote the same convolution blocks, which are connected through these operations of down-sampling, up-sampling and skip-connection. The blue arrows indicate feature representations after training of encoder and sub-decoders, and these parameters are input into the multi-region-channel module (MRCM). Part (b) is the specific structure of a convolutional block.

In reply to: Fig. 3(a) each block B^i,j denotes a convolution block. Inspired by the residual unit structure that contains convolution, max pooling, and batch normalization operations along with activation functions [34], we consider a convolution block defined by

$y = F (x, {W_{i}}) + x,$ (8) where x and y are the input and output vectors of the block considered. The function F(x, {W_i}) denotes the stacked nonlinear layers to be learned. For the example in Fig. 3(b) that has two layers, F = W₂σ(W₁x), with σ representing ReLU [35]. To simplify the notation, F(·) omits the biases. The operation F(·) + x is performed by a skip-connection and element-wise addition. We adopt the second nonlinearity after the addition (i.e., σ(y)).

We use b^i,j to denote the output of block B^i,j where i is the down-sampling layer in the encoder and j is the convolution layer of the dense block along the skip connection. The stack of feature maps is computed by

$b^{i, j} = {\begin{matrix} H (D (b^{i - 1, j})), & j = 0 \\ H ([{[b^{i, k}]}_{k = 0}^{j - 1}, U (b^{i + 1, j - 1})]), & j > 0 \end{matrix},$ (9) where $H \cdot$ denotes the convolution block operation. We use 16 feature map channels initially. $D (\cdot)$ denotes the down-sampling operation implemented by a 2 × 2 max pooling layer. $U (\cdot)$ denotes the up-sampling operation implemented by the transposition method of convolution. The operator [·] denotes the concatenation operation, with two blocks concatenated in the channel dimension. In Fig. 3, blocks at level j = 0 extract features, blocks at level j > 0 receive j + 1^th inputs, of which the j^th inputs are all outputs of the previous blocks from the same skip connection, and the j + 1^th input is the up-sampled output from the lower skip connection.

As shown in Fig. 4, the MRCM is designed to capture crucial feature information between different channels automatically. Given an input node $B^{0, j} \in ℝ^{16 \times r \times r} (j = 0, 1, 2)$ , the input block is fed into different convolution layers to generate three feature maps: $Z_{o^{'}} \in ℝ^{C \times r \times r}$ , $Z_{{hc}^{'}} \in ℝ^{C \times r \times r}$ , and $Z_{{vc}^{'}} \in ℝ^{C \times r \times r}$ . These three branches constitute Z and have a size of $\frac{C}{3} \times r \times r$ according to the channel C. Z_o indicates the overall zone of feature representations and is used to retain the original contextual information. Z_hc contains the horizontal central zone of the feature representations. We keep the horizontal middle zone features and set the other elements to 0 to remove the pixels of the top and bottom rows. Z_vc contains the vertical central zone of the feature representations. We keep the vertical middle zone features and set the other elements to 0 to remove the pixels of the left and right columns. We set r = 7 and C = 15 in this implementation to pay attention to the center zone.

Fig. 4

Graphic model of the multi-region-channel module (MRCM). Input B^0,j contains three blocks B^0,0, B^0,1, and B^0,2, from which we obtain three feature maps V_{s
₁}, V_{s
₂} and V_{s
₃} by element-wise summation. These features presentations are fed into two channel attention modules (CAM) for the final output. The picture in the upper right corner is the detailed structure of CAM.

After the final 3 × 3 convolution operation for feature maps, we obtain the three branches of the feature map: Z_o′, Z_hc′, and Z_vc′. These features are merged for next calculation using

$V_{s_{k}} = Z_{o^{'}} + Z_{{hc}^{'}} + Z_{{vc}^{'}},$ (10) where $V_{s_{k}} \in ℝ^{\frac{C}{3} \times r \times r} (k = 1, 2, 3)$ indicates the spatial fused features from the block B^0,j, with V_{s
₁}, V_{s
₂}, and V_{s
₃} transformed into a 5 × 7 ×7 shape.

We also introduce a channel attention module (CAM) [36] and propose an ensemble expansion module in deep supervision. Three blocks of spatial fused features V_{s
₁}, V_{s
₂}, and V_{s
₃} are summed to be a fused feature map, and we connect it to a CAM for extracting the intra-block relations. Simultaneously, three blocks of spatial fused features V_{s
₁}, V_{s
₂}, and V_{s
₃} are directly concatenated to another CAM. The output spatial feature is obtained by a sequence of steps.

Specifically, the MRCM is implemented by Equations (11) through (14).

$M_{CAM} (f) = σ (W_{1} (W_{0} (f_{avg})) + W_{1} (W_{0} (f_{max})))$ (11)

$V_{fusion} = V_{s_{1}} + V_{s_{2}} + V_{s_{3}}$ (12)

$V_{ensemble} = [V_{s_{1}} + V_{s_{2}} + V_{s_{3}}]$ (13)

$\begin{matrix} ECAM (V_{fusion}, V_{ensemble}) \\ = ({repeat}_{(3)} (M_{CAM} (V_{fusion})) \\ \oplus V_{ensemble}) \otimes M_{CAM} (V_{emsemble}) \end{matrix}$ (14)

In these equations, σ(·) indicates the sigmoid activation function, $W_{0} \in ℝ^{C / r \times C}$ and $W_{1} \in ℝ^{C \times C / r}$ are weights of the multilayer perceptron (MLP). The operator [·] denotes the concatenation of three feature blocks, the repeat_(n)(·) function denotes the operation of copying a feature map and concatenating it in the channel dimension n times. ⊕ and ⊗ are element-wise addition and multiplication, respectively.

2.2.2 Frequency feature extraction

Extraction of spatial features from the DI is greatly affected by speckle noise in SAR images, causing information loss and accuracy degradation and makes it tough to build a robust model. Recently, some researchers have pointed out that performing feature extraction in the frequency domain helps suppress speckle noise and improves accuracy [37]. Inspired by one proposed method in the frequency domain [38], we introduce frequency domain information as one of the network branches. The combination of frequency feature extraction using DCT with a selection strategy and conventional spatial down-sampling approach achieves higher accuracy.

In Fig. 5, the length of the reshaped factor is set to 3 × 8 ×8 by bilinear interpolation operation. To further capture the most significant DCT coefficients from the transform map, we employ a channel selection strategy by calculating

Fig. 5

An overview of the frequency feature extraction. After the discrete cosine transform (DCT), the input is transformed to the feature maps composed of DCT factors, and then the channel selection strategy provides the input to the reshape operation for selection of critical features.

$V_{f_{1}} = W^{l} r + b^{l},$ (15) where V_{f
₁} denotes a linear information vector and an attention control gate with two linear transformations. W^l and b^l are the weight matrix and bias in linear transformation, respectively. V_{f
₂} does the same transformation as V_{f
₁}.

Fig. 6

Illustration of the fully connected RNN, LSTM, and GRU. V_{t
_input} denotes the feature vectors, which are transformed into input sequences as V^T. h_t indicates the recurrent hidden state of RNN units. In LSTM, o_t, f_t,,g_t and c_t are the output gates, forget gates, input gates, cell gates, and memory cells, respectively. In GRU, the reset gates and updates gates are represented by r_t and z_t, respectively. n_t is the candidate activation at the present time step.

$V_{f_{2}} = σ (W^{a} r + b^{a}),$ (16) where σ(·) denotes the sigmoid function. W^a and b^a are the weight matrix and bias in linear transformation, respectively.

The final frequency feature V_frequency is obtained by element-wise multiplication:

$V_{frequency} = V_{f_{1}} ⊙ V_{f_{2}},$ (17) where ⊙ denotes the element-wise multiplication.

2.2.3 Temporal feature extraction

Feedforward neural network structures such as CNNs offer great performance under the assumption that all inputs are uncorrelated with each other. Nevertheless, it is a good practice to determine the correlation from processing time sequences in the change detection task. Unlike CNNs, RNNs have a recursive implicit state where the activation at each time step depends on the activation at the previous time step. Considering that RNNs are capable of handling dependent and sequential inputs between times t₁ and t₂, and the internal state (memory) processes variable length sequences of inputs. we use three types of RNN architectures: a fully connected RNN, a long short-term memory (LSTM), and a gated recurrent unit (GRU). GRU is the principal architecture in our proposed model. Figure 6 shows the structure of the fully connected RNN, LSTM, and GRU. We can summarize the functionality of each as follows.

1) Fully Connected RNN: Given an input patch of size 3 × r × r that is reshaped to feature vectors V_{t
_input} in the time domain, the fully connected RNN updates its recurrent hidden state h_t according to

$h_{t} = {\begin{matrix} 0, & if t = 0 \\ φ ({Uh}_{t - 1} + {WV}_{t_{input}}), & otherwise \end{matrix},$ (18) where φ(·) is a nonlinear activation function, such as logistic sigmoid function or hyperbolic tangent (tanh) function. U and W denote the coefficient matrices for the activation of recurrent hidden units h_t-1 at the previous time step and input feature vectors V_{t
_input} at the present time step.

2) LSTM: The LSTM architecture uses memory cells to store information, making it better at exploiting long range dependences in the sequences [39]. Among the many variants from the original version, we use the following for calculating the activation of LSTM units:

$h_{t} = o_{t} \tanh (c_{t}),$ (19) where h_t denotes the hidden state at time t, o_t represents the output gates that control the amount of memory content exposure, and c_t indicates the cell state at time t. tanh(·) is the hyperbolic tangent function.

The output gates o_t are updated by

$o_{t} = σ (W_{io} V_{t_{input}} + b_{io} + W_{ho} h_{t - 1} + b_{ho}),$ (20) where σ(·) denotes the sigmoid function. V_{t
_input} is the input at time t, h_t-1 denotes the hidden state at time t - 1. W_io and W_ho are input-output coefficient matrices and hidden-output weight matrices, respectively. b terms denote related bias.

The cell state c_t at time t are updated by

$c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ g_{t},$ (21) where ⊙ denotes the Hadamard product. c_t-1 represents the cell state at time t - 1. We calculate f_t, i_t, and g_t as the forget, input, and cell gates, respectively, using Equations (22) through (24)

$i_{t} = σ (W_{ii} V_{t_{input}} + b_{ii} + W_{hi} h_{t - 1} + b_{hi}),$ (22)

$f_{t} = σ (W_{if} V_{t_{input}} + b_{if} + W_{hf} h_{t - 1} + b_{hf}),$ (23)

$g_{t} = \tanh (W_{ig} V_{t_{input}} + b_{ig} + W_{hg} h_{t - 1} + b_{hg}) .$ (24)

In these equations, W_ii and W_hi are input-input coefficient matrices and hidden-input weight matrices, respectively. W_if and W_hf are input-forget coefficient matrices and hidden-forget weight matrices, respectively. W_ig and W_hg are input-cell coefficient matrices and hidden-cell weight matrices, respectively. b terms indicate related bias.

3) GRU: The GRU is similar to an LSTM with a forget gate, but with fewer parameters [40]. It directly shows whole cell state values at each time step due to the lack of output gates. We use the following implementation for calculating the activation of GRU units:

$h_{t} = (1 - z_{t}) * n_{t} + z_{t} * h_{t - 1},$ (25) where z_t denotes the update gates that control update contents, and the recurrent hidden state h_t at time t is a linear interpolation operation between the hidden state h_t-1 at time t - 1 and the candidate activation n_t.

The update gate z_t is calculated by

$z_{t} = σ (W_{iz} V_{t_{input}} + b_{iz} + W_{hz} h_{t - 1} + b_{hz}),$ (26) where σ(·) denotes the sigmoid function. W_iz and W_hz are the input-update weight matrix and hidden-update coefficient matrix, respectively. b terms indicate related bias.

The computation of candidate activation is similar to the fully connected RNN (cf. [18]) given by

$n_{t} = \tanh (W_{in} V_{t_{input}} + b_{in} + r_{t} * (W_{hn} h_{t - 1} + b_{hn})),$ (27) where tanh(·) denotes the hyperbolic tangent function. r_t represents the reset gate. When r_t = 0, the reset gate stays off: the units only receive current input V_{t
_input} and ignore the information of the computed state at time t - 1. When r_t = 1, the reset gate stays on: the units keep the activation of the previous recurrent layer. W_in and W_hn are the input-update weight matrix and hidden-update coefficient matrix, respectively.

Reset gate r_t is updated by

$r_{t} = σ (W_{ir} V_{t_{input}} + b_{ir} + W_{hr} h_{t - 1} + b_{hr}),$ (28) where W_ir and W_hr are the input-reset weight matrix and hidden-reset coefficient matrix, respectively.

2.3 Change map

As shown in Fig. 2, V_s, V_f, and V_t are output factors of three feature extraction structures. These factors are directly merged by a concatenation operation of the information of spatial, frequency and temporal features. After mapping the feature representations to the sample space, the final softmax layer calculates the possibility of unchanged versus changed to generate the change map $\hat{Y}$ :

$\hat{Y} = soft \max (FC [V_{s} + V_{f} + V_{t}]),$ (29) where softmax(·) denotes softmax layer operation, FC(·) denotes the FC layer operation, and [·] indicates the concatenation of the three feature branches. A change map $\hat{Y}$ contains pixels in Ω_i classified with unchanged pixels labeled as “0” and the changed pixels labeled as “1”.

2.4 Loss function

Most change detection tasks for bitemporal SAR images have the experimental results that show a data imbalance between the unchanged and changed pixels, which weakens the performance of the model. Inspired by the application of focal loss in object detection for balancing the effect of positive and negative samples [41], we design a loss function that combines the cross-entropy loss and the focal loss by an additive operation:

$L = L_{ce} + L_{focal}, with$ (30)

$L_{ce} = \frac{1}{N} \sum_{k = 1}^{N} W [C] \cdot (log \frac{exp (\hat{Y} (k) [C])}{\sum_{l = 0}^{1} exp (\hat{Y} (k) [l])}),$ (31) where N denotes the total number of pixels in the target image $\hat{Y}$ . W [·] indicates a manual rescaling weight given to each class. C represents the unchanged and changed pixels with a value of 0 or 1, respectively. $\hat{Y} (k)$ contains two values and denotes a point in $\hat{Y}$ . Simultaneously, the change map $\hat{Y}$ is calculated for focal loss as follows:

$L_{focal} = - α_{k} (1 - \hat{Y} (k))^{γ} log (\hat{Y} (k)),$ (32) where α_k ∈ [0, 1] is a weighting factor to balance the importance of positive/negative examples. γ ⩾ 0 denotes a tunable focusing parameter. We set α_k = 0.25 and γ = 2 in our work.

3 Implementation results and analysis

3.1 SAR dataset and evaluation metric

For this paper, we evaluated our method using four sample sets of bitemporal SAR images [25]. Table 1 provides a summary of the images in the sample datasets.

Table 1
Characteristics of the SAR datasets

Data set Date of two Images Size (Pixel) Spatial Resolution (m) Location Sensor

Ottawa July 1997 August 1997 350×290 10 Ottawa, Canada Radarsat-2

Bern April 1999 May 1999 301×301 30 Regions near the city of Bern, Switzerland European Remote Sensing (ERS)-2 satellite

San Francisco August 2003 May 2004 256×256 25 San Francisco, California, USA European Remote Sensing (ERS)-2 satellite

Yellow River June 2008 June 2009 257×289 8 Dongying, Shandong Province, China Radarsat-2

Data set	Date of two Images	Size (Pixel)	Spatial Resolution (m)	Location	Sensor
Ottawa	July 1997 August 1997	350×290	10	Ottawa, Canada	Radarsat-2
Bern	April 1999 May 1999	301×301	30	Regions near the city of Bern, Switzerland	European Remote Sensing (ERS)-2 satellite
San Francisco	August 2003 May 2004	256×256	25	San Francisco, California, USA	European Remote Sensing (ERS)-2 satellite
Yellow River	June 2008 June 2009	257×289	8	Dongying, Shandong Province, China	Radarsat-2

The first dataset named Ottawa was acquired over the city of Ottawa by the Radarsat-2 SAR sensor in July and August 1997. The images, with a spatial resolution of 10×10 m, are 350×290 pixels showing areas that were once flooded. It was provided by the Defense Research and Development Canada (DRDC).

The second dataset contains images around flooded areas of Bern, including Thun, Bern, and the airport that were captured by the European Remote Sensing (ERS)-2 satellite SAR sensor in April and May 1999. Each image is 301×301 pixels with a spatial resolution of 30×30 m. Images of the Aare valley between Bern and Thun were selected to recognize the flooded regions.

The San Francisco dataset was acquired by the European Remote Sensing (ERS)-2 satellite SAR sensor and contains images from the city of San Francisco taken in August 2003 and May 2004. Each image is 256×256 pixels with a spatial resolution of 25×25 m. These images were provided by the European Space Agency (ESA).

The Yellow River dataset contains images of a region of the Yellow River around Dongying City, Shandong Province, China from Radarsat-2 taken in June 2008 and June 2009. Each image is 257×289 pixels with an 8×8 m resolution. Furthermore, there were two datasets kept different noise levels with the sizes of 257×289 and 400×300, respectively.

In addition, the sample datasets have a reference image as the ground truth, which accurately recognizes unchanged and changed regions. We employ the reference images to verify our model. Figures 7–10 show the four SAR datasets and their change detection maps.

Fig. 7

Ottawa dataset: (a) July 1997; (b) August 1997; (c) The change detection map.

Fig. 8

Bern dataset: (a) April 1999; (b) May 1999; (c) The change detection map.

Fig. 9

San Francisco dataset: (a) August 2003; (b) May 2004; (c) The change detection map.

Fig. 10

Yellow River dataset: (a) June 2008; (b) June 2009; (c) The change detection map.

To evaluate the accuracy of the proposed model, we use common indices for quantitative analysis: false positives (FP), false negatives (FN), overall error (OE), percentage of correct classification (PCC), and kappa coefficient (KC) [42]. OE is the sum of incorrectly classified pixels:

$OE = FP + FN,$ (33) where FP and FN denote the total number of changed and unchanged incorrectly classified pixels, respectively. PCC is the percentage of correctly classified pixels:

$PCC = 1 - \frac{OE}{N} \times 100 %,$ (34) where N indicates the total number of pixels in the target image. KC is a parameter to indicate the accuracy of the classification model according to the balance between FP and FN and has indicative of the perceived accuracy of the change detection results than other indices:

$KC = \frac{PCC - PRE}{1 - PRE} \times 100 %,$ (35) where

$\begin{matrix} PRE \\ = \frac{(N_{c} + FP - FN) \cdot N_{c} + (N_{u} - FP + FN) \cdot N_{u}}{N^{2}}, \end{matrix}$ (36) and N_c and N_u denote the total number of pixels in the changed and unchanged clusters, respectively.

3.2 Difference image

In our experiments, the difference images for the SAR dataset were generated using the log-ratio operator explained in Equation (1). To compare the effects of different DI operators, as shown in Fig. 11, we introduced the subtraction and normal difference methods:

$I_{d_{s}} = | I_{2} - I_{1} |,$ (37)

$I_{d_{n}} = | \frac{I_{2} - I_{1}}{I_{2} + I_{1}} |,$ (38) where I₁ and I₂ are two SAR images captured covering the same geographical region at different times. I_{d
_s} and I_{d
_n} are the subtraction DI operator and the normal difference DI operator, respectively.

Fig. 11

The difference images generated from three operators. For better processing, we normalized them in the range [0, 255]. The operator in the first row is subtraction, the operators in the second and third rows are the normal difference and log-ratio operator, respectively. (a) Ottawa, (b) Bern, (c) San Francisco, and (d) Yellow River.

3.3 Performance of the proposed model

To demonstrate the benefits of our proposed method, we compared it to the following state-of-the-art baselines: PCANet [21], DBN [24], NR-ELM [25], CWNN [26], Ms-CapsNet [27], and SAFNet [28]. We used 10000 pixels as training samples, with 7000 changed pixels and 3000 unchanged pixels. We used 1500 pixels as training samples on the Bern dataset, with 1050 changed pixels and 450 unchanged pixels. We trained the network for 50 epochs using a batch size of 128. There were also pretreatment techniques used before training the network, and we set parameters for PCANet and DBN to match those used with our method. The other experiments used default parameters given in the original papers.

The visualized results of different change detection methods on the datasets are shown in Fig. 12. For both San Francisco and Yellow River datasets, Ms-CapsNet produced more erroneous recognitions. Our proposed SFTNet obtained more accurate identifications of changed regions compared with the other five methods. Tables 2 through 5 display the quantitative results of all methods on the different datasets.

Fig. 12

The visualized comparisons of different changed detection methods on the Ottawa (first row), Bern (second row), San Francisco (third row), and Yellow River datasets (last row): (a) Ground truth image. (b) Result from PCANet [21]. (c) Result from DBN [24]. (d) Result from NR-ELM [25]. (e) Result from CWNN [26]. (f) Result from Ms-CapsNet [27]. (g) Result from SAFNet [28]. (h) Result from the proposed SFTNet.

Table 2

The change detection results of different methods on the Ottawa dataset

Method	Results on the Ottawa dataset
	FP	FN	OE	PCC (%)	KC (%)
PCANet [21]	550	1790	2340	97.69	97.89
DBN [24]	218	2920	3138	96.91	97.14
NR-ELM [25]	668	1138	1806	98.22	98.06
CWNN [26]	1291	434	1725	98.30	93.75
Ms-CapsNet [27]	1040	900	1940	98.09	93.46
SAFNet [28]	882	534	1416	98.60	94.81
Proposed SFTNet	613	967	1580	98.44	98.58

For the Ottawa dataset in Table 2 (the first row of Fig. 12), CWNN and DBN obtained high FP and FN values, respectively. The proposed SFTNet achieved the highest KC value of all the methods.

From Bern dataset in Table 3 (the second row of Fig. 12), Ms-CapsNet and PCANet produced final change maps containing some noisy results and missing many changed regions. Our SFTNet attained a KC value that was 14.42% and 16.09% higher than CWNN and Ms-CapsNet, respectively.

Ms-CapsNet incorrectly generated the largest number of changed areas on the third row in Fig. 12 while our SFTNet made extensive use of the discriminative information between the changed and unchanged pixels on the San Francisco dataset. SFTNet’s KC value was 70.03% higher than Ms-CapsNet as shown in Table 4.

Table 3

The change detection results of different methods on the Bern dataset

Method	Results on the Bern dataset
	FP	FN	OE	PCC (%)	KC (%)
PCANet [21]	8	584	592	99.35	99.39
DBN [24]	13	549	562	99.38	99.42
NR-ELM [25]	156	195	351	99.61	99.57
CWNN [26]	85	230	315	99.65	85.28
Ms-CapsNet [27]	205	176	381	99.58	83.61
SAFNet [28]	265	102	367	99.59	84.96
Proposed SFTNet	87	207	294	99.68	99.70

Table 4

The change detection results of different methods on the San Francisco dataset

Method	Results on the San Francisco dataset
	FP	FN	OE	PCC (%)	KC (%)
PCANet [21]	271	703	974	98.51	98.68
DBN [24]	95	1050	1145	98.25	98.44
NR-ELM [25]	324	469	793	98.79	98.66
CWNN [26]	437	295	732	98.89	91.70
Ms-CapsNet [27]	18199	122	18321	72.04	29.04
SAFNet [28]	668	332	1000	98.47	88.87
Proposed SFTNet	293	397	690	98.95	99.07

The Yellow River dataset has much stronger speckle noise. As shown in Fig. 12, Ms-CapsNet introduced many small noise areas when generating the final change changed map. SFTNet performed well even with the noise. As shown in Table 5, SFTNet achieved KC values that were 17.28%, 7.39%, and 23.54% higher than those from NR-ELM, CWNN, and Ms-CapsNet, respectively.

Table 5

The change detection results of different methods on the Yellow River dataset

Method	Results on the Yellow River dataset
	FP	FN	OE	PCC (%)	KC (%)
PCANet [21]	595	2971	3566	95.20	95.45
DBN [24]	547	3305	3852	94.81	95.07
NR-ELM [25]	571	3663	4234	94.30	78.86
CWNN [26]	712	1694	2406	96.76	88.75
Ms-CapsNet [27]	5848	1763	7611	89.75	72.60
SAFNet [28]	9	6414	6423	91.35	64.15
Proposed SFTNet	544	2493	3037	95.91	96.14

3.4 Runtime comparisons

Time cost is an important factor for the effective application of deep learning-based methods in change detection tasks. Table 6 shows the time consumption of our proposed SFTNet along with the other methods. NR-ELM, Ms-CapsNet, DBN, and PCANet required less time because of their relatively simple models. However, compared with CWNN and SAFNet, SFTNet exhibited high efficiency. We also highlight that SFTNet deliberately trains the small samples through the feature extraction network, this method ensures model performance while controlling the computing time.

Table 6
The time cost of different change detection methods

Method Time cost (in seconds)

Ottawa Bern San Francisco Yellow River

PCANet [21] 58.60 88.91 53.86 87.74

DBN [24] 46.49 63.24 42.46 49.51

NR-ELM [25] 27.37 5.12 3.42 24.88

CWNN [26] 1250.15 748.85 1244.26 2478.66

Ms-CapsNet [27] 35.10 32.03 24.80 27.53

SAFNet [28] 412.26 396.11 271.31 324.93

Proposed SFTNet 118.94 167.68 111.02 183.27

Method	Time cost (in seconds)
PCANet [21]	58.60	88.91	53.86	87.74
DBN [24]	46.49	63.24	42.46	49.51
NR-ELM [25]	27.37	5.12	3.42	24.88
CWNN [26]	1250.15	748.85	1244.26	2478.66
Ms-CapsNet [27]	35.10	32.03	24.80	27.53
SAFNet [28]	412.26	396.11	271.31	324.93
Proposed SFTNet	118.94	167.68	111.02	183.27

3.5 Model ablation study

Our proposed model has three main branches: spatial feature extraction, frequency feature extraction, and temporal feature extraction. We designed alternative methods by removing one of these branches and using a different loss function to verify the effectiveness of the model. The densely connected CNN refers to modified UNet++without the MRCM. Table 7 shows the necessity of each part of SFTNet, with each branch affecting the performance of the model.

Table 7
The ablation study of SFTNet, with performance evaluated by PCC

Method PCC (%)

Ottawa Bern San Francisco Yellow River

densely connected CNN 98.33 99.65 98.55 95.28

w/o spatial feature extraction 98.34 99.64 98.74 95.32

w/o frequency feature extraction 98.55 99.43 98.77 95.50

w/o temporal feature extraction 98.42 99.55 98.69 95.52

w/o focal loss 95.04 91.90 83.67 74.43

w/o cross-entropy loss 94.87 86.98 84.72 69.35

proposed SFTNet 98.44 99.68 98.95 95.91

Method	PCC (%)
densely connected CNN	98.33	99.65	98.55	95.28
w/o spatial feature extraction	98.34	99.64	98.74	95.32
w/o frequency feature extraction	98.55	99.43	98.77	95.50
w/o temporal feature extraction	98.42	99.55	98.69	95.52
w/o focal loss	95.04	91.90	83.67	74.43
w/o cross-entropy loss	94.87	86.98	84.72	69.35
proposed SFTNet	98.44	99.68	98.95	95.91

4 Discussion

The experimental results in Parts 3.3 and 3.4 show that the proposed method not only was efficient at feature extraction but also distinguished changed pixels better than other methods. We now discuss the parameters and architectures that affect the accuracy of the model.

4.1 Selection of the patch size

To evaluate the correlation between the patch size and the performance of our method, we set r=5,7,9,11,13, and 15, in turn. Figure 13 shows the relationship curve between r and PCC values on the SAR datasets. The PCC values increase first and then stabilize, which shows that the contextual information is critical for SAR image change detection. The end of the curve makes it evident that a larger patch size helps to decrease the computational cost but may introduces noise information that affects the results.

Fig. 13

The relationship between patch size r and PCC values on the SAR datasets.

4.2 Analysis of the RNN architectures

We assessed three types of RNN architectures in the temporal feature extraction branch of SFTNet for comparative analysis. The results and statistics of these architectures, including fully connected RNN, LSTM, and GRU on the SAR dataset are shown below.

In Fig. 14, according to the mean accuracy of the four SAR datasets, GRU and LSTM architectures achieve the first- and second-best results, respectively. The fully connected RNN achieves the lowest accuracy.

Fig. 14

The variation of the PCC values for different RNN architectures on the SAR dataset.

4.3 The difference image operator type

In this section, we examine the performance of three different DI operators during preclassification for comparative analysis. The log-ratio operator has the best performance, as shown in Fig. 15. The normal difference operator was second best on the Ottawa, Bern, and Yellow River datasets. The subtraction operator offers the least performance. Due to the influence of speckle noise, the subtraction operator incorrectly detects small changed regions. Its worst result was on the Yellow River dataset, at 71.13%.

Fig. 15

The variation of the PCC values for different DI operators on the SAR datasets.

5 Conclusions

In this paper, we present a method for performing SAR image change detection with higher results than existing methods. Our SFTNet alleviates speckle noise and comprehensively considers semantic information when detecting changes by using an ensemble MRCM that captures multidimensional fusion features using three parallel network branches. The inclusion of an improved UNet++ and a new ensemble module extract crucial spatial feature information between different channels. We use DCT and a selection to capture the frequency features. An RNN architecture obtains the temporal feature relationship. Experimental results on four datasets demonstrate the superior performance of SFTNet compared to existing state-of-the-art methods.

SAR sensor applications typically involve high-resolution images of large areas over long periods of time. Our STFNet manages noise well and is suitable for change detection in SAR images. In our next work, we plan to enhance our structure and verify our method works well in other scenarios.

References

Zhang

, Hu

and Brown

G.S.

, Automatic Surface Water Mapping Using Polarimetric SAR Data for Long-Term Change Detection, Water 12(3) (2020), 872. doi: 10.3390/w12030872.

Zhao

, Ling

and Li

, An Iterative Feedback-Based Change Detection Algorithm for Flood Mapping in SAR Images, IEEE Geoscience and Remote Sensing Letters 16(2) (2019), 231–235. doi: 10.1109/lgrs.2018.2871849.

Gong

, Zhang

, Su

and Liu

, Coupled Dictionary Learning for Change Detection From Multisource Data, IEEE Transactions on Geoscience and Remote Sensing 54(12) (2016), 7077–7091. doi: 10.1109/tgrs.2016.2594952.

Xuedong

, Wenxi

and Shuguang

, Urban Change Detection in TerraSAR Image Using the Difference Method and SAR Coherence Coefficient, Journal of Engineering Science and Technology Review 11(3) (2018), 18–23. doi: 10.25103/jestr.113.03.

, Wang

, Zhang

and Wu

, Urban Building Change Detection in SAR Images Using Combined Differential Image and Residual U-Net Network, Remote Sensing 11(9) (2019), 1091. doi: 10.3390/rs11091091.

Huang

and Jin

, Rapid Flood Mapping and Evaluation with a Supervised Classifier and Change Detection in Shouguang Using Sentinel-1 SAR and Sentinel-2 Optical Data, Remote Sensing 12(13) (2020), 2073. doi: 10.3390/rs12132073.

Park

S.-E.

and Jung

Y.T.

, Detection of Earthquake-Induced Building Damages Using Polarimetric SAR Data, Remote Sensing 12(1) (2020), 137. doi: 10.3390/rs12010137.

Wang

, Chen

J.-W.

, Jiao

and Wang

, How Can Despeckling and Structural Features Benefit to Change Detection on Bitemporal SAR Images? Remote Sensing 11(4) (2019), 421. doi: 10.3390/rs11040421.

Wang

, Zhang

, Chen

, Jiao

and Wang

, Imbalanced Learning-Based Automatic SAR Images Change Detection by Morphologically Supervised PCA-Net, IEEE Geoscience and Remote Sensing Letters 16(4) (2019), 554–558. doi: 10.1109/lgrs.2018.2878420.

10.

, Li

, Wang

and Liang

, Delineation of Radar Glacier Zones in the Antarctic Peninsula Using Polarimetric SAR, Water 12(9) (2020), 2620. doi: 10.3390/w12092620.

11.

Yang

, Liu

, Gao

and Feng

, Extreme Self-Paced Learning Machine for On-Orbit SAR Images Change Detection, IEEE Access 7 (2019), 116413–116423. doi: 10.1109/access.2019.2934983.

12.

Zhang

, Lu

and Li

, A Coarse-to-Fine Semi-Supervised Change Detection for Multispectral Images, IEEE Transactions on Geoscience and Remote Sensing 56(6) (2018), 3587–3599. doi: 10.1109/tgrs.2018.2802785.

13.

Xue

, Lei

, Jia

, Wang

, Chen

and Nandi

A.K.

, Unsupervised Change Detection Using Multiscale and Multiresolution Gaussian-Mixture-Model Guided by Saliency Enhancement, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14 (2021), 1796–1809. doi: 10.1109/jstars.2020.3046838.

14.

Pirrone

, Bovolo

and Bruzzone

, A Novel Framework Based on Polarimetric Change Vectors for Unsupervised Multiclass Change Detection in Dual-Pol Intensity SAR Images, IEEE Transactions on Geoscience and Remote Sensing 58(7) (2020), 4780–4795. doi: 10.1109/tgrs.2020.2966865.

15.

, Zhou

An Unsupervised Framework for Change Detection in Remote Sensing Images, 2021 IEEE 21st International Conference on Communication Technology (ICCT), 2021, pp. 1112-1116. doi: 10.1109/ICCT52962.2021.9658043.

16.

, Liu

, Li

, Jiao

, Lu

and Marturi

, Application of Data Driven Optimization for Change Detection in Synthetic Aperture Radar Images, IEEE Access 8 (2020), 11426–11436. doi: 10.1109/access.2019.2962622.

17.

Chen

, Huang

and Gao

, Small-Target Detection between SAR Images Based on Statistical Modeling of Log-Ratio Operator, Sensors 19(6) (2019), 1431. doi: 10.3390/s19061431.

18.

Liu

, Jia

, Yang

and Kasabov

N.K.

, SAR Image Change Detection Based on Mathematical Morphology and the K-Means Clustering Algorithm, IEEE Access 7 (2019), 43970–43978. doi: 10.1109/access.2019.2908282.

19.

Wang

, Zhao

and Chen

, A framework of spatiotemporal fuzzy clustering for land-cover change detection using SAR time series, International Journal of Remote Sensing 38(2) (2016), 450–466. doi: 10.1080/01431161.2016.1268736.

20.

Yan

, Shi

, Pan

, Zhang

and Wang

, Unsupervised change detection in SAR images based on frequency difference and a modified fuzzy c-means clustering, International Journal of Remote Sensing 39(10) (2018), 3055–3075. doi: 10.1080/01431161.2018.1434325.

21.

Chan

T.-H.

, Jia

, Gao

, Lu

, Zeng

and Ma

, PCANet: A Simple Deep Learning Baseline for Image Classification? IEEE Transactions on Image Processing 24(12) (2015), 5017–5032. doi: 10.1109/tip.2015.2475625>.

22.

, Li

, Zhang

, Wu

, Song

and An

, SAR Image Change Detection Using PCANet Guided by Saliency Detection, IEEE Geoscience and Remote Sensing Letters 16(3) (2019), 402–406. doi: 10.1109/lgrs.2018.2876616.

23.

, Peng

, Chen

, Jiao

, Zhou

and Shang

, A Deep Learning Method for Change Detection in Synthetic Aperture Radar Images, IEEE Transactions on Geoscience and Remote Sensing 57(8) (2019), 5751–5763. doi: 10.1109/tgrs.2019.2901945.

24.

Samadi

, Akbarizadeh

and Kaabi

, Change detection in SAR images using deep belief network: a new training approach based on morphological images, IET Image Processing 13(12) (2019), 2255–2264. doi: 10.1049/iet-ipr.2018.6248.

25.

Gao

, Dong

, Li

, Xu

and Xie

, Change detection from synthetic aperture radar images based on neighborhood-based ratio and extreme learning machine, Journal of Applied Remote Sensing 10(4) (2016), 046019. doi: 10.1117/1.jrs.10.046019.

26.

Gao

, Wang

, Gao

, Dong

and Wang

, Sea Ice Change Detection in SAR Images Based on Convolutional-Wavelet Neural Networks, IEEE Geoscience and Remote Sensing Letters 16(8) (2019), 1240–1244. doi: 10.1109/lgrs.2019.2895656.

27.

Gao

, Gao

, Dong

and Li

H.-C.

, SAR Image Change Detection Based on Multiscale Capsule Network, IEEE Geoscience and Remote Sensing Letters 18(3) (2021), 484–488. doi: 10.1109/lgrs.2020.2977838.

28.

Gao

, Gao

, Dong

, Du

and Li

H.-C.

, , Synthetic Aperture Radar Image Change Detection via Siamese Adaptive Fusion Network, in, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14 (2021), 10748–10760. doi: 10.1109/JSTARS.2021.3120381.

29.

X.-H.

, Du

Z.-S.

, Huang

Y.-Y.

and Tan

Z.-Y.

, A deep translation (GAN) based change detection network for optical and SAR remote sensing images, ISPRS Journal of Photogrammetry and Remote Sensing 179 (2021), 14–34. doi: 10.1016/j.isprsjprs.2021.07.007.

30.

Zheng

, Jiao

, Liu

, Zhang

, Hou

and Wang

, Unsupervised saliency-guided SAR image change detection, Pattern Recognition 61 (2017), 309–326. doi: 10.1016/j.patcog.2016.07.040.

31.

Celik

, Longbotham

and Emery

W.J.

, Gabor Feature Based Unsupervised Change Detection of Multitemporal SAR Images Based on Two-Level Clustering, IEEE Geoscience and Remote Sensing Letters 12(12) (2015), 2458–2462. doi: 10.1109/lgrs.2015.2484220.

32.

Bezdek

J.C.

, Pattern Recognition with Fuzzy Objective Function Algorithms, SIAM Review 25(3) (1983), 442–442. doi: 10.1137/1025116.

33.

Zhou

, Siddiquee

M.M.R.

, Tajbakhsh

and Liang

, UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation, IEEE Transactions on Medical Imaging 39(6) (2020), 1856–1867. doi: 10.1109/tmi.2019.2959609.

34.

, Zhang

, Ren

and Sun

, Identity Mappings in Deep Residual Networks, ECCV 2016 9908 (2016), 630–645. doi: 10.1007/978-3-319-46493-0_38.

35.

Vinod

and Geoffrey

, Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair, Proceedings of the 27th International Conference on Machine Learning 27 (2010), 807–814. doi: 10.5555/3104322.3104425.

36.

Woo

, Park

, Lee

J.-Y.

and Kweon

I.S.

, CBAM: Convolutional Block Attention Module, Computer Vision –ECCV 2018 11211 (2018), 3–19. doi: 10.1007/978-3-030-01234-2_1.

37.

C.-Y.

, Zaheer

, Hu

, Manmatha

, Smola

A.J.

, Krahenbuhl

Compressed Video Action Recognition, IEEE Conference on Computer Vision and Pattern Recognition. doi: 10.1109/cvpr.2018.00631.

38.

, Qin

, Sun

, Wang

, Chen

Y.-K.

, Ren

Learning in the Frequency Domain, IEEE Conference on Computer Vision and Pattern Recognition, 2020. doi: 10.1109/cvpr42600.2020.00181.

39.

Hochreiter

and Schmidhuber

, Long Short-Term Memory, Neural Computation 9(8) (1997), 1735–1780. doi: 10.1162/neco.1997.9.8.1735.

40.

Cho

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Trans- lation, Computation and Language (cs.CL), 2014. doi: 10.3115/v1/D14-1179.

41.

Lin

T.-Y.

, Goyal

, Girshick

, He

, Dollar

Focal Loss for Dense Object Detection, Computer Science, 2017. doi: 10.1109/iccv.2017.324.

42.

Gong

, Su

, Jia

and Chen

, Fuzzy Clustering With a Modified MRF Energy Function for Change Detection in Synthetic Aperture Radar Images, {IEEE Transactions on Fuzzy Systems 22(1) (2014), 98–109. doi: 10.1109/tfuzz.2013.2249072.