Cloud-Graph: A feature interaction graph convolutional network for remote sensing image cloud detection

Abstract

Convolutional neural networks (CNNs) have made significant progress in the field of cloud detection in remote sensing images thanks to their powerful feature representation capabilities. Existing methods typically aggregate low-level features containing details and high-level features containing semantics to make full use of both features to accurately detect cloud regions. However, CNNs are still limited in their ability to reason about the relationships between features, while not being able to model context well. To overcome this problem, this paper designs a novel feature interaction graph convolutional network model that extends the feature fusion process of convolutional neural networks from Euclidean space to non-Euclidean space. The algorithm consists of three main components: remote sensing image feature extraction, feature interaction graph reasoning, and high-resolution feature recovery. The algorithm constructs a feature interaction graph reasoning (FIGR) module to fully interact with low-level and high-level features and then uses a residual graph convolutional network to infer feature higher-order relationships. The network model effectively alleviates the problem of a semantic divide in the feature fusion process, allowing the aggregated features to fuse valuable details and semantic information. The algorithm is designed to better detect clouds with complex cloud layers in remote sensing images with complex cloud shape, size, thickness, and cloud-snow coexistence. Validated on publicly available 38-Cloud and SPARCS datasets and the paper’s own Landsat-8 cloud detection dataset with higher spatial resolution, the proposed method achieves competitive performance under different evaluation metrics. Code is available at https://github.com/HaiLei-Fly/CloudGraph.

Keywords

Remote sensing image cloud detection feature interaction graph convolutional networks image segmentation interpretability

1 Introduction

Optical remote sensing imagery plays an important role in weather forecasting, environmental monitoring, agriculture, and the military [1]. However, about 66% of the Earth’s surface is covered by clouds [2], thus affecting post-processing such as classification and segmentation of remote sensing images, which may mislead the analysis of remote sensing images. Therefore, accurate identification of cloud regions is of great importance for the application of optical remote sensing images.

Over the years, researchers have proposed a large number of cloud detection methods for remote sensing images, which can be broadly classified into two categories: threshold-based methods and statistical-based methods. The best known threshold-based cloud detection algorithms are function of mask (FMask) [3] and automated cloud-cover assessment (ACCA) [4], which require constant optimization of the choice of thresholds. Statistical-based methods use statistical learning algorithms to classify pixels based on manual features designed for image texture, color, and geometry [5]. Due to the diversity of remotely sensed image environments, these methods have limited practical application scenarios. Remote sensing image preview images contain R, G, and B bands for fast cloud detection, and researchers have already used remote sensing preview images for cloud detection tasks, and methods such as CDNet [6] and CDNetV2 [7] have achieved good results. To further improve cloud detection accuracy, this paper will design more powerful feature descriptions and detection networks.

With the development of deep learning-based image segmentation algorithms, convolutional neural networks (CNNs) have made significant progress in cloud detection with their powerful representation learning capabilities. Most existing cloud detection models use symmetric or asymmetric coding and decoding network structures, and methods such as Cloud-Net [8] and Cloud-Net+ [9] have achieved good detection results using convolutional neural networks. Due to the complexity of cloud shape size and sparsity and background, the remote sensing image features extracted by convolutional networks need to be fully utilized to improve cloud detection accuracy. Hou et al. [10] experimentally demonstrated that low-level features extracted by convolutional neural networks contain more spatial detail information, such as texture, boundaries, and spatial structure; high-level features contain rich semantic information. Pu et al. [11] proposed a high-precision cloud detection network, which uses a DenseNet network as the feature extraction backbone, combined with a global self-attentive module and a spatial pyramid pooling module to extract deep semantic features. Peng et al. [12] first explored the relationship between receptive field size and cloud detection network performance by introducing theoretical receptive field (TRF) and effective receptive field (ERF) to measure the receptive field size of different networks and performing cloud detection on Landsat8 OLI data. Simply fusing the two features using CNN is not sufficient to learn the higher-order relationships between the features, while the fusion process introduces noisy features that affect the detection accuracy of cloud regions.

An emerging network architecture, graph convolutional neural networks (GCNs) [13], which can efficiently process graph-structured data by modeling the relationships between graph vertices, has recently revolutionized deep representation learning by incorporating graph computation into a deep learning framework and has benefited many computer vision tasks such as action recognition [14], text detection [15], image segmentation [16] and hyperspectral image classification [17]. The multi-level feature fusion process, which should be performed via non-Euclidean data forms, alleviates the semantic divide problem. Thus, GCNs can be naturally used for remote sensing image cloud detection tasks to accomplish feature fusion and context modeling work.

In this paper, we make the first attempt to build a model based on GCNs, i.e. feature interaction graph convolutional networks, for remote sensing image cloud detection. In this paper, we first extract multi-scale image features using a CNNs-based encoder and generate a feature node graph based on feature region similarity. The proposed feature interaction graph reasoning (FIGR) module is then used to fully interact with the low-level and high-level feature node graphs to reason about feature higher-order relationships. FIGR has a well-designed interaction architecture that enhances the interaction and fusion between features. Unlike existing CNNs-based approaches that simply fuse low-level and high-level features from the encoder, FIGR accurately fuses low-level and high-level features by reasoning about the detailed and semantic features between complementary relationships, precisely representing the valuable information from both sets of features in a collaborative enhancement. Finally, the high-resolution feature recovery (HRFR) component further fuses the augmented low-level and high-level features to output the cloud regions predicted by the network. In this paper, the effectiveness of Cloud-Graph is demonstrated through extensive experiments on the self-built CHLandast-8, open-source 38-Cloud and SPARCS datasets, using different evaluation metrics to assess the proposed algorithm and comparison algorithms. Furthermore, the proposed algorithms are efficient, running at approximately 50 FPS on a single NVIDIA RTX 3060 GPU. More specifically, the main contributions of this paper can be summarized as follows:

A new graph-based cloud detection method for remote sensing images is proposed, which, the first time, attempts to use graph based technology for cloud detection with the using of residual graph convolution to infer higher-order relationships between low-level and high-level features, to better model contextual features, and to accurately detect cloud regions.

A well-designed feature interaction graph reasoning (FIGR) module is presented, which effectively alleviates the problem of semantic gaps in the feature fusion process, enabling aggregated features to fuse valuable details and semantic information while filtering out noisy features.

A new high-resolution dataset for remote sensing imagery cloud detection is established, which includes 64 full scenes collected by Landsat-8 satellites from different regions of China from January to December 2021, as well as manually labeled cloud Masks. This open source dataset can help researchers to train and evaluate cloud detection algorithms and to promote research on cloud detection.

2 Related works

2.1 Remote sensing image cloud detection

Over the years, researchers have proposed many cloud detection methods for remote sensing images. Among them, the threshold method is a relatively mature cloud detection method. Zhu et al. [3] proposed Fmask, which uses a decision tree to label each pixel as cloud or non-cloud, and in each branch of the decision tree, a decision is given based on a threshold function. Irish et al. [4] designed the automated cloud-cover assessment (ACCA) algorithm, which also constructs a decision tree for cloud detection. The same decision tree is constructed for cloud detection. Although threshold-based cloud detection methods have achieved some success, most are not generalizable and require continuous optimization of the choice of thresholds.

In recent years, cloud detection methods based on convolutional neural networks (CNNs) have been gradually proposed. Inspired by semantic segmentation, cloud detection based on fully convolutional neural networks (FCNs) [18] has achieved remarkable results. Zeng et al. [19] used FCN-8 S directly for Landsat-8 satellite image cloud detection, but the segmentation was not accurate. Some recent work has mainly used encoder-decoder structures, such as UNet [20] and SEGNet [21], as architectures for cloud detection. Mohajerani et al. [8] used local and global features from the whole scene to design the end-to-end cloud detection network Cloud-Net. Yang et al. [6] proposed a cloud detection neural network CDNet with an encoder-decoder structure, a feature pyramid module, and a boundary refinement. To further investigate cloud detection in remote sensing images with cloud-snow coexistence, Guo et al. [7] proposed a new cloud detection neural network CDNetV2, designing adaptive feature fusion module and advanced semantic information guided flow module. Hu et al. [22] proposed a deep learning model CDUNet for cloud detection, refining the division boundary of cloud layers to obtain their spatial. He et al. [23] proposed a lightweight network (DABNet) to achieve high accuracy detection of complex clouds. Lu et al. [24] proposed a two-branch model consisting of Transformer and convolutional networks to extract semantic and spatial detail information of images respectively to solve the problem of false detection and missed detection. Zhang et al. [25] proposed a cloud detection framework combining CNN and Transformer to achieve high-precision cloud detection of optical remote sensing images. However, existing cloud detection methods based on convolutional neural networks still suffer from the problem of semantic gaps in the feature fusion process and do not perform contextual modeling well, leading to false and missed detection problems in predicting cloud regions.

2.2 Multi-level feature integration

Several works for dense prediction tasks demonstrate that features from multiple layers facilitate the generation of better predictions [26]. Zhao et al. [27] proposed a pyramidal feature attention network to enhance high-level semantic features and low-level spatial structure features. Zhang et al. [28] proposed a feature aggregation framework that integrates multi-level CNN features at different resolutions. Pang et al. [29] proposed an ensemble interaction module to integrate features from adjacent layers. Ma et al. [30] proposed a pyramidal feature shrinkage network, which aggregates neighboring feature nodes in pairs by shrinking layer by layer. The same researchers use multi-layer feature fusion strategies for remote sensing image cloud detection tasks. Wang et al. [31] proposed a cloud detection network ABNet, in which a full-scale feature fusion model can optimize features and recover spatial information by integrating features at each scale, and a boundary point prediction module further corrects cloud boundary information by classifying cloud boundary points. Guo et al. [32] proposed a lightweight cloud detection framework based on deep learning is designed with a multi-feature fusion strategy that extracts learnable artificial and convolutional features from visible and near-infrared bands, uses a lightweight fully convolutional neural network ClouDet with an expansive convolutional module to extract multi-scale contextual information, and progressively recovers segmentation results of the same size as the input image. Zhao et al. [33] designed a new cloud detection network, DMNet, which contains a dense feature enhancement module (DFEM) and a multi-scale context fusion spatial attention module (MCFSAM) for cloud detection of GF-1 WFV data. The above different CNN-based multi-level feature fusion strategies aim to alleviate the problem of a semantic divide between feature levels and to better aggregate features. However, due to the inherent shortcomings of CNN, multi-level feature fusion is still problematic, leading to the introduction of a large number of redundant features into the fusion process, while failing to fully fuse valuable features.

2.3 Graph convolutional networks

Graph Neural Networks (GNNs) are models that capture graph dependencies by passing messages between the nodes of a graph, and unlike standard neural networks, GNNs can represent information from their neighborhoods at arbitrary depths [34]. Graph Convolutional Neural Networks (GCNs) are variants of GNNs that aim to extend convolution to the graph domain. In recent years, various graph convolutional neural network-based models have been proposed for different applications. Some typical applications in computer vision include 3D pose estimation [35], zero-shot learning [36], point cloud classification and segmentation [37], etc. Luo et al. [38] introduce cascaded graph models to exploit multi-scale, cross-modal information for salient target detection. Zhang et al. [39] propose a new adaptive graph convolutional network with attention graph clustering for collaborative salient target detection. Hong et al. [40] develop a new miniGCNs for hyperspectral image classification. Zhao et al. [41] proposed graph feature pyramid networks to enhance multi-scale features from convolutional feature pyramid networks for target detection. Zhai et al. [42] designed a new mutual graph learning model to extend the traditional mutual graph learning idea to the graph domain to accomplish the task of camouflaged target detection. Wu et al. [43] introduced a bidirectional graph inference network to embed the graph structure into the traditional panoramic segmentation network to mine the intra- and inter-modal relationships between foreground and background classes. For remote sensing image cloud detection, this paper designs a graph-based feature interaction graph inference module to fully infer low-level and high-level features, which can better learn image feature representation to overcome the multiple challenges of CNN-based methods.

3 Method

At the heart of Cloud-Graph is the use of the feature interaction graph reasoning (FIGR) module to reason about the relationships between low-level and high-level features in order to fully fuse detailed and semantic features. In this section, an overview of the feature interaction graph convolutional network architecture proposed in this paper is given first. Then, the construction process of the FIGR module is described in detail. Finally, the implementation details of the model are presented.

3.1 Problem formulation

The remote sensing image cloud detection task is to predict a cloud region image $y \in Y$ given an input image $x \in X$ . The input space $X$ is the RGB space of the image and the target space $Y$ has only one category, i.e. cloud region. In the method proposed in this paper, the graph-based model is defined as a function f_θ : { x } → y, and the goal is to design a suitable model θ that can make full use of the image features extracted by the convolutional network to learn powerful representations so that mapping can be performed more accurately.

3.2 Network overview

The Cloud-Graph network architecture is shown in Fig. 1 and consists of three main components: image feature extraction (IFE), feature interaction graph reasoning (FIGR), and high-resolution feature recovery (HRFR).

Fig. 1

The overall architecture of the Cloud-Graph network. The network consists of three main components: image feature extraction, feature interaction graph reasoning, and feature recovery.

IFE: This paper uses the ResNet-50 backbone network to extract image multi-scale features. Given an input image $I \in ℝ^{H \times W \times 3}$ , the multi-scale features are decoupled into low-level features containing detail information $F_{L} \in ℝ^{h \times w \times c}$ and high-level features containing semantic information $F_{H} \in ℝ^{h \times w \times c}$ .

FIGR: This paper begins by mapping F_L and F_H to non-Euclidean spaces via the graph projection operation f_Gproj (·), generating a low-level feature map $G_{L} = (V_{L}, ɛ_{L})$ and a high-level feature map $G_{H} = (V_{H}, ɛ_{H})$ . As shown in Fig. 2, pixels with similar features form similar feature regions, and each node $V$ further aggregates pixel features to form the node features $Z \in R^{c \times | V |}$ of the graph $G$ . Based on Z, this paper measures the distance between the nodes and calculates the adjacency matrix, which is used to represent the edge ɛ. Then, f_FI (·) of the FIGR module is used to capture the higher-order dependencies between $G_{L}$ and $G_{H}$ , fully fusing the details and semantic features to obtain the low-level feature graph node $V_{L}^{'}$ and the high-level feature graph node $V_{H}^{'}$ after feature interaction. graph inference is performed using the residual graph convolution network to obtain the enhanced higher-order graph representations $G_{L}^{'}$ and $G_{H}^{'}$ . finally, $G_{L}^{'}$ and $G_{H}^{'}$ are projected by f_Rproj (·) to the original Euclidean space to obtain ${\tilde{F}}_{L}$ and ${\tilde{F}}_{H}$ .

Fig. 2

Graph node composition. Node generation principle: similar feature areas converge into graph nodes.

HRFR: This paper uses element summation and feature cascade operations to fuse ${\tilde{F}}_{L}$ , ${\tilde{F}}_{H}$ , F_L and F_H, respectively, to recover high-resolution feature maps through a bilinear interpolation method. At the same time, the recovery process is deeply supervised to ensure the accuracy of the recovered image.

3.3 Feature interaction graph convolutional network

This section details image feature extraction (IFE), feature interaction graph reasoning (FIGR) and high-resolution feature recovery (HRFR).

Image feature extraction (IFE). f_IFE (·) takes the RGB preview image of a remotely sensed image as input and generates a mapping of image low-level features and high-level features, i.e. F_L and F_H. Specifically, given the input image $I \in ℝ^{H \times W \times 3}$ , image features are extracted using ResNet-50 F_i: $F_{i} = f_{IFE} (I; θ_{IFE})$ (1) where, $F_{i} \in ℝ^{h \times w \times c}$ has the spatial resolution of h × w and c channels, i ∈ [1, 5]. In this paper, the features F_i are further decoupled to obtain low-level detail features F_L and high-level semantic features F_H: ${\begin{matrix} F_{L} = C (F_{1}, Conv (U (F_{2}))) \\ F_{H} = C (F_{3}, Conv (U (F_{4})), Conv (U (F_{5}))) \end{matrix}$ (2) where, $F_{L} \in ℝ^{h \times w \times c}$ and $F_{H} \in ℝ^{h \times w \times c}$ have h × w spatial resolution and c channels, C (·) denotes feature cascade operations, Conv (·) denotes convolution operations, and U (·) denotes up-sampling operations.

Feature interaction graph reasoning (FIGR). FIGR aims to fully interact low-level features with high-level features to reason about the higher-order relationships between the two sets of features and to alleviate the semantic gap problem in the feature fusion process. It consists of four operations: (1) graph projection f_Gproj (·) (2) feature interaction f_FI (·) (3) graph inference f_GR (·) (4) graph reprojection f_Rproj (·).

(1) Graph projection f_Gproj (·). In this paper, some pixels with similar features are clustered to the same node, i.e. similar feature node aggregation, as shown in Fig. 2. Following [44], f_Gproj (·) is parameterized by $W \in ℝ^{c \times | V |}$ and $Σ \in ℝ^{c \times | V |}$ , while the number of nodes $| V |$ is pre-specified. Each column $w_{k} \in ℝ^{c}$ of W specifies an anchor point for the vertex k. $x_{ij} \in ℝ^{c}$ thus indexes the c dimensional feature at pixel (i, j). Calculation of the feature vector x_ij to w_k soft assignment $q_{ij}^{k}$ : $q_{ij}^{k} = \frac{exp (- {∥ (x_{ij} - w_{k}) / σ_{k} ∥}_{2}^{2} / 2)}{Σ_{k} exp (- {∥ (x_{ij} - w_{k}) / σ_{k} ∥}_{2}^{2} / 2)}$ (3) where, σ_k ∈ (0, 1) is calculated from the sigmoid function, denoting the column vector of Σ, and/denotes elemental division. This formula calculates the weighted Euclidean distance between x_ij and w_k, and uses the softmax function to calculate the soft distribution. We denote $Q_{L} \in ℝ^{(h \times w) \times | V |}$ and $Q_{H} \in ℝ^{(h \times w) \times | V |}$ as the soft assignment matrix from pixel to vertices:

${\begin{matrix} Q_{L} = [q_{L_{ij}}]_{ij = 0}^{(| V | - 1)} = [[q_{L_{ij}^{k}}]_{k = 0}^{(h \times w) - 1}]_{ij = 0}^{(| V | - 1)} \\ Q_{H} = [q_{H_{ij}}]_{ij = 0}^{(| V | - 1)} = [[q_{H_{ij}^{k}}]_{k = 0}^{(h \times w) - 1}]_{ij = 0}^{(| V | - 1)} \end{matrix}$ (4)

Further calculating the graph node features $Z \in ℝ^{c \times | V |}$ , the features of each graph node are calculated as follows: $z_{k} = \frac{z_{k}^{'}}{{∥ z^{'} ∥}_{2}}, z_{k}^{'} = \frac{1}{Σ_{ij} q_{ij}^{k}} \sum_{ij} q_{ij}^{k} (x_{ij} - w_{k}) / σ_{k}$ (5) where, $z_{k}^{'}$ is the weighted average of the residuals between the feature vector x_ij and the vertex parameter w_k. $z_{k}^{'}$ is further L2 normalized to obtain the eigenvector z_k of vertex k. z_k forms the kth column of the eigenmatrix Z. Graph adjacency matrix $A \in ℝ^{| V | \times | V |}$ : $A = f_{norm} (Z^{T} \times Z)$ (6) where, f_norm (·) means the normalization operation.

(2) Feature interaction f_FI (·). f_FI (·) models the interaction between low-level feature graphs and high-level feature graphs, guiding inter-graph messaging from $G_{L}$ to $G_{H}$ and from $G_{H}$ to $G_{L}$ . Inspired by the work of Fu et al. [45], this paper uses an attention mechanism to compute inter-graph dependencies, with high-level feature graphs passing messages to low-level feature graphs, as shown in Fig. 3. Transform $G_{H}$ into the key graph node $V_{H}^{K}$ and the value graph node $V_{H}^{V}$ and $G_{L}$ into the query graph node $V_{L}^{Q}$ . The adjacency matrix A_H→L is then calculated as follows: $A_{H \to L} = f_{norm} (V_{H}^{K T} \times V_{L}^{Q})$ (7)

Fig. 3

Feature interaction model. High-level feature graphs pass messages to low-level feature graphs.

Low-level feature graphs passing messages to high-level feature graphs, as shown in Fig. 4. Transform $G_{L}$ into the key graph node $V_{L}^{K}$ and the value graph node $V_{L}^{V}$ and $G_{H}$ into the query graph node $V_{H}^{Q}$ . The adjacency matrix A_L→H is then calculated as follows:

Fig. 4

Feature interaction model. Low-level feature graphs pass messages to high-level feature graphs.

$A_{L \to H} = f_{norm} (V_{L}^{K T} \times V_{H}^{Q})$ (8)

After that, the paper completes the transfer of semantic information from $G_{H}$ to $G_{L}$ and detailed information from $G_{L}$ to $G_{H}$ in the following way: ${\begin{matrix} G_{L}^{'} = f_{FIG} (V_{L}^{Q}, V_{H}^{K}, V_{H}^{V}) = Softmax (A_{H \to L}) \times V_{H}^{V} + G_{L} \\ G_{H}^{'} = f_{FIG} (V_{H}^{Q}, V_{L}^{K}, V_{L}^{V}) = Softmax (A_{L \to H}) \times V_{L}^{V} + G_{H} \end{matrix}$ (9) where, $G_{L}^{'}$ and $G_{H}^{'}$ denote the low-level feature map nodes and high-level feature map nodes after feature interaction, respectively.

(3) Graph reasoning f_GR (·). After the feature interaction is completed, the intra-graph inference is performed on $G_{L}^{'}$ and $G_{H}^{'}$ to obtain an enhanced graph representation, and this paper improves the graph convolution [46] to complete the feature higher-order relation inference. ${\begin{matrix} G_{H}^{l + 1} = f_{GR} (G_{H}^{'}) = σ (A_{H \to L} G_{H}^{'} W_{H}^{l}) + G_{H}^{l} \\ G_{L}^{l + 1} = f_{GR} (G_{L}^{'}) = σ (A_{L \to H} G_{L}^{'} W_{L}^{l}) + G_{L}^{l} \end{matrix}$ (10) where, σ (·) is the sigmoid activation function and l denotes the different graph convolution layers, the number of graph convolution layers in this paper is 4. $W_{H}^{l}$ and $W_{L}^{l}$ denote the learnable parameters of the graph convolution layers.

(4) Graph reprojection f_Rproj (·). The enhanced feature graphs $G_{L}^{'}$ and $G_{H}^{'}$ are mapped to the original Euclidean space, respectively, and the graph reprojection is calculated as follows: ${\begin{matrix} {\tilde{F}}_{H} = f_{Rproj} (G_{H}^{'}) = Q_{H} G_{H}^{' T} {+ F}_{H} \\ {\tilde{F}}_{L} = f_{Rproj} (G_{L}^{'}) = Q_{L} G_{L}^{' T} {+ F}_{L} \end{matrix}$ (11) where, ${\tilde{F}}_{H} \in ℝ^{h \times w \times c}$ and ${\tilde{F}}_{L} \in ℝ^{h \times w \times c}$ are high-level features and low-level features enhanced after feature interaction graph inference, respectively.

High-resolution feature recovery (HRFR). f_HRFR (·) inputs feature fused by feature cascading and element addition and output the cloud regions predicted by the network, calculated as follows: $Out = f_{HRFR} ({\tilde{F}}_{LL}, {\tilde{F}}_{LH}, {\tilde{F}}_{HH})$ (12) where, ${\tilde{F}}_{LL}$ denotes the fusion of ${\tilde{F}}_{L}$ and F_L, ${\tilde{F}}_{LH}$ denotes the fusion of ${\tilde{F}}_{L}$ and ${\tilde{F}}_{H}$ , and ${\tilde{F}}_{HH}$ denotes the fusion of ${\tilde{F}}_{H}$ and F_H. Specifically, the fusion is as follows: ${\begin{matrix} {\tilde{F}}_{LL} = Conv (C ({\tilde{F}}_{L}, Conv (D (F_{L})))) \\ {\tilde{F}}_{LH} = Conv (U ({\tilde{F}}_{L} + {\tilde{F}}_{H})) \\ {\tilde{F}}_{HH} = Conv (C (Conv (U (F_{H})), {\tilde{F}}_{H})) \end{matrix}$ (13) where, Conv (·) indicates a convolution operation, C (·) indicates a feature cascade operation, U (·) indicates a feature up-sampling, and D (·) indicates a feature down-sampling.

3.4 Implementation details

In this paper, ResNet-50 [48], which was pre-trained on ImageNet [47], was used as the backbone, and the input image size was 352 × 352 × 3. Both the low-level feature and high-level feature sizes of the IFE module output were 60 × 60 × 512, and then graph node construction was performed. Due to the limitation of computational resources, the number of both low-level feature graph nodes and high-level feature graph nodes is set to 8 in this paper.

The training loss in this paper is defined as the sum of all supervised losses. $Loss = \sum_{k = 1}^{K} l_{bce}^{k}$ (14) where, l_bce is the binary cross-entropy (BCE) loss, and this paper implements deep supervision on the network, set K = 4.

The BCE loss [49], shown in Equation (15), is the most widely used loss in binary classification and segmentation. $\begin{matrix} l_{bce} = - \sum_{(r, c)} [G (r, c) log (S (r, c)) \\ + (1 - G (r, c)) log (1 - S (r, c))] \end{matrix}$ (15) where, G (r, c) is the mask label of the pixel (r, c) and S (r, c) is the predicted cloud area.

4 Experimental results and analysis

4.1 Experimental setup

The proposed network is based on a PyTorch implementation. For network training, the weights of the backbone network are initialized by the pre-trained ResNet-50, and the remaining convolutional layers and modules are initialized randomly. To optimize the network, the optimizer uses Adam with an initial learning rate of 5e-5 and a weight decay rate of 0.0005, with the learning rate decaying by a factor of 10 for every 10 epochs of training. The network training epoch is set to 40 and the batch size is set to 8. The models are trained on an NVIDIA Tesla V100 GPU to ensure higher accuracy. During testing, all models are executed on an NVIDIA RTX 3060 GPU with 12 G of RAM.

4.2 Dataset

To evaluate the performance of the proposed network, experiments are conducted on three typical optical remote sensing image cloud detection datasets. These are the open-sourced 38-Cloud dataset, the SPARCS dataset, and the cloud detection dataset (CHLandsat-8) collected and self-built from Landsat-8 satellite data. The CHLandsat-8 dataset is more challenging as it has a higher spatial resolution and more representative land cover types.

The 38-Cloud dataset [8] consists of 38 Landsat-8 scenes, 18 scenes in the training set, and 20 scenes in the test set. 38 scene images were cropped to 384 × 384, 8400 images in the training set and 9201 images in the test set.

The SPARCS dataset [50] consists of a patch of 1000 × 1000 × 3 extracted from 80 complete Landsat-8 scenes and is commonly used for training and testing remote sensing image cloud detection algorithms.

The CHLandsat-8 dataset includes 64 full scenes collected by Landsat-8 satellites from different regions of China from January 2021 to December 2021 with high-resolution remote sensing images. CHLandsat-8 is probably the first remote sensing image cloud detection dataset collected from Landsat-8 satellites in China. Moreover, the scenes in CHLandsat-8 are more complex and diverse than previous datasets, making it more challenging to achieve high accuracy cloud detection. The dataset covers the Northwest, North, Qinghai-Tibet, and South regions of China and contains a variety of land cover types including urban, snow and ice, grassland, mountain, forest, ocean, and desert. The dataset is a preview map of remotely sensed images with dimensions of 8000×8000×3 approximately. In addition, the reference cloud Mask for the dataset has been annotated and is available online. Surely the open-sourced CHLandsat-8 dataset helps to promote the research on cloud detection (https://github.com/HaiLei-Fly/CHLandsat8/).

In the process of building the dataset, the experts manually mark the position of the cloud in the image pixel by pixel. Meanwhile, the reference Mask is created by marking the pixel values of the cloud and the background with 1 and 0 respectively. The reference Mask has been iteratively checked and corrected to ensure the accuracy of the labels. In this work, 44 images are randomly selected from the dataset as the training set named CHLandsat-8-TR and the rest of 20 images are used as the test set named CHLandsat-8-TE.

Due to the limited memory of the GPU, the 3 datasets with different scene images are cropped to the size of 352 × 352 × 3. The details of the dataset are shown in Table 1.

Table 1
Dataset details

Dataset Scenes Images Train/Test

CHLandsat-8-TR 44 22616 Train

CHLandsat-8-TE 20 10080 Test

38-Cloud-Test 20 10906 Test

SPARCS 80 720 Test

Dataset	Scenes	Images	Train/Test
CHLandsat-8-TR	44	22616	Train
CHLandsat-8-TE	20	10080	Test
38-Cloud-Test	20	10906	Test
SPARCS	80	720	Test

4.3 Evaluation indicators

The proposed algorithm is evaluated through some widely used comprehensive metrics, including (1) maximum F-measure (MaxFm), (2) mean absolute error (MAE), (3) weighted F-measure (WFm), (4) average F-measure (AvgFm), (5) S-measure (Sm), and (6) E-measure (Em).

F-measure (F_β) is an overall performance metric that combines precision and recall [51]: $F_{β} = \frac{(1 + β^{2}) \cdot precision \cdot recall}{β^{2} \cdot precision + recall}$ (16) where, β² is set to 0.3 to emphasize accuracy. The maximum F-measure and the average F-measure are calculated for evaluation in this work [52].

MAE is used as a complement to F-measure to calculate, pixel by pixel, the mean absolute error between the predicted cloud area and the cloud Mask [53]: $MAE = \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{j = 1}^{H} | G (i, j) - S (i, j) |$ (17) where, the predicted cloud area image S ∈ [0, 1] ^W×H, and the cloud Mask image G ∈ [0, 1] ^W×H.

The F-measure is further extended by the weighted F-measure [54], which extends the four fundamental quantities TP, TN, FP, and FN to real values and assigns different weights to different errors at different locations, taking into account neighborhood information ω: $F_{β}^{ω} = \frac{(1 + β^{2}) \cdot {precision}^{ω} \cdot {recall}^{ω}}{β^{2} \cdot {precision}^{ω} {+ recall}^{ω}}$ (18)

S-measure is more sensitive to foreground structural information than F-measure [55], which takes into account both region-perceived structural similarity S_r and object-perceived structural similarity S_o. $S_{m} = α * S_{o} + (1 - α) * S_{r}$ (19) where, α is set to 0.5.

E-measure considers both the global mean and local pixel matching of an image and is defined as [56]: $E_{m} = \frac{1}{W \times H} \sum_{c = 1}^{W} \sum_{r = 1}^{H} Φ (r, c)$ (20) where, Φ is the enhanced alignment matrix, (H, W) and (r, c) denote the image (height, width) and pixel coordinates respectively.

4.4 Comparative experiments

The experimental results are compared quantitatively and qualitatively with the existing SOTA methods, including FCN8 S [18], UNet [20], PSPNet [57], SEGNet [21], GFRNet [58], CDNet [6], CDNetV2 [7], Cloud-Net [8], ClouDet [32]. For a fair comparison, all experimental results are obtained in the same experimental environment using the open-sourced code of the authors using the CHLandsat-8-TR dataset for training and 3 test datasets for testing.

1) Quantitative experiments

To evaluate the quality of the segmented cloud regions, quantitative comparison experiments are performed. Table 2 shows the results trained on the CHLandsat-8-TR dataset and evaluated on the CHLandsat-8-TE dataset. The comparison results indicate that the Cloud-Graph captures more valuable features from the input images and generates cloud regions closer to the manually extracted Mask. The evaluation metrics in the table include MaxFm, MAE, WFm, AvgFm, Sm, and Em. The best performance of each metric is ranked in order of red, blue, and green. Quantitative comparison experiments can be seen that the designed algorithm Cloud-Graph achieves the best performance on all evaluation metrics and the algorithm Cloud Detection outperforms the comparison algorithms, proving the effectiveness of feature interaction graph reasoning (FIGR) and demonstrating that the designed algorithm Cloud-Graph is effective in alleviating the semantic divide problem in the cross-level feature fusion process.

Table 2
Comparison of quantitative results with SOTA methods on the CHlandsat-8-TE dataset

Model CHLandsat-8-TE (20)

MAE MaxFm AvgFm WFm Sm Em

FCN8s 0.106 0.874 0.858 0.784 0.742 0.827

UNet 0.113 0.862 0.826 0.745 0.729 0.789

PSPNet 0.097 0.879 0.869 0.799 0.767 0.858

SEGNet 0.102 0.874 0.850 0.778 0.754 0.824

GFRNet 0.123 0.852 0.827 0.736 0.716 0.779

Cloud-Net 0.101 0.875 0.846 0.764 0.736 0.795

ClouDet 0.095 0.884 0.870 0.796 0.764 0.828

CDNet 0.129 0.848 0.814 0.722 0.709 0.763

CDNetV2 0.125 0.842 0.823 0.735 0.714 0.790

Cloud-Graph 0.069 0.892 0.888 0.854 0.829 0.921

Model	CHLandsat-8-TE (20)
FCN8s	0.106	0.874	0.858	0.784	0.742	0.827
UNet	0.113	0.862	0.826	0.745	0.729	0.789
PSPNet	0.097	0.879	0.869	0.799	0.767	0.858
SEGNet	0.102	0.874	0.850	0.778	0.754	0.824
GFRNet	0.123	0.852	0.827	0.736	0.716	0.779
Cloud-Net	0.101	0.875	0.846	0.764	0.736	0.795
ClouDet	0.095	0.884	0.870	0.796	0.764	0.828
CDNet	0.129	0.848	0.814	0.722	0.709	0.763
CDNetV2	0.125	0.842	0.823	0.735	0.714	0.790
Cloud-Graph	0.069	0.892	0.888	0.854	0.829	0.921

Table 3 shows the results trained on the CHLandsat-8-TR dataset and evaluated on the 38-Cloud-Test dataset with the same evaluation metrics in Table 2. The best performance of each metric is ranked in order of red, blue, and green. Cloud-Graph did not achieve the best performance in the MaxFm and AvgFm metrics, with MaxFm slightly worse than ClouDet by 0.021 and AvgFm slightly worse than ClouDet by 0.012, but substantially outperformed the comparison algorithm in the other four evaluated metrics.

Table 3

Comparison of quantitative results with SOTA methods on the 38-Cloud-Test dataset

Model	38-Cloud-Test (20)
	MAE	MaxFm	AvgFm	WFm	Sm	Em
FCN8s	0.069	0.857	0.833	0.768	0.768	0.817
UNet	0.064	0.879	0.862	0.797	0.785	0.856
PSPNet	0.065	0.831	0.823	0.759	0.777	0.889
SEGNet	0.056	0.857	0.850	0.800	0.806	0.912
GFRNet	0.079	0.843	0.824	0.751	0.754	0.817
Cloud-Net	0.055	0.890	0.878	0.761	0.798	0.870
ClouDet	0.052	0.896	0.882	0.819	0.824	0.899
CDNet	0.106	0.835	0.823	0.738	0.727	0.793
CDNetV2	0.108	0.817	0.805	0.718	0.721	0.781
Cloud-Graph	0.035	0.875	0.870	0.839	0.853	0.945

The results of the Cloud-Graph and comparison algorithms trained on the CHLandsat-8-TR dataset and evaluated on the SPARCS dataset are shown in Table 4. The best performance of each metric is ranked in order of red, blue, and green. Cloud-Graph is identical to ClouDet in terms of MAE evaluation metrics, and substantially outperforms the comparison algorithm in all five other evaluation metrics.

Table 4

Comparison of quantitative results with SOTA methods on the SPARCS dataset

Model	SPARCS (80)
	MAE	MaxFm	AvgFm	WFm	Sm	Em
FCN8s	0.143	0.464	0.386	0.307	0.517	0.456
UNet	0.131	0.527	0.451	0.365	0.542	0.506
PSPNet	0.126	0.543	0.480	0.376	0.541	0.550
SEGNet	0.110	0.631	0.555	0.470	0.592	0.595
GFRNet	0.131	0.516	0.444	0.363	0.547	0.512
Cloud-Net	0.121	0.547	0.462	0.380	0.553	0.517
ClouDet	0.105	0.566	0.502	0.452	0.581	0.554
CDNet	0.116	0.616	0.546	0.459	0.592	0.595
CDNetV2	0.122	0.587	0.514	0.425	0.570	0.560
Cloud-Graph	0.105	0.645	0.582	0.490	0.606	0.627

2) Qualitative experiments

This paper shows the qualitative comparison results between Cloud-Graph and the comparative algorithms FCN8 S, UNet, PSPNet, SEGNet, GFRNet, CDNet, and CDNetV2 in order to verify the good performance of the proposed methods in a more visual comparison. The qualitative results show that Cloud-Graph is able to handle different types of cloud regions and generate accurate cloud detection results.

In order to demonstrate that Cloud-Graph can handle remote sensing images of different scenes, this paper selects grassland scene, desert scene, plateau scene, forest scene, and snow scene in the CHLandsat-8-TE dataset. The cloud detection results of Cloud-Graph and the comparison algorithm are shown in Fig. 5. The CNN-based method has serious problems of missing and false detection when facing remote sensing images of different scenes, and the method in this paper alleviates this problem to a certain extent.

Fig. 5

Comparison of visual results in CHLandsat-8-TE dataset with SOTA methods. Images from different scenes of the CHLandsat-8-TE dataset: Grassland Scene, Desert Scene, Plateau Scene, Forest Scene, and Snow Scene.

In order to further verify the cloud detection capability of the proposed method, the 38-Cloud-Test dataset was selected from simple scene, cloudy scene, partly cloudy scene, thin-cloud scene, ice and snow scene, and ocean scene. The cloud detection results in Fig. 6 show that the proposed method can detect cloudy regions, especially ice and snow scene, in the remote sensing images of different scenes. The method in this paper effectively alleviates this problem.

Fig. 6

Comparison of visual results in 38-Cloud-Test dataset with SOTA methods. Images from different scenes of the 38-Cloud-Test dataset: Simple Scene, Cloudy Scene, Partly Cloudy Scene, Thin-Cloud Scene, Ice and Snow Scene, and Ocean Scene.

In this paper, the thin-cloud scene, simple scene, complex scene, snow scene, cloudy scene, cloud shadows scene, confusing background, and partly cloudy scene. The cloud detection results for Cloud-Graph and the comparison algorithm are shown in Fig. 7.

Fig. 7

Comparison of visual results in SPARCS dataset with SOTA methods. Images from the different scenes of the SPARCS dataset: Thin-Cloud Scene, Simple Scene, Complex Scene, Snow Scene, Cloudy Scene, Cloud shadows Scene, Confusing Background, and Partly Scene.

Cloud-Graph’s cloud detection results on the three datasets compare favorably with the comparison algorithm, as the cloud detection results outperform the comparison algorithm. The aim of this chapter is to fully cross-fertilize low-level and high-level features. The low-level features of remote sensing images contain rich cloud detail features, while the snow and ice background noise will be retained, and the high-level features of images are rich in semantic information, which can roughly locate cloud regions. Both features are fully aggregated through the FIGR module, a large amount of snow and ice background noise will not be learned by the network, and the features will then be recovered in high resolution, and the predicted cloud regions will be more accurate, and false positive and false negative results will be reduced.

Thin and cirrus cloud remote sensing images are also difficult to detect because of the high transparency of these two types of clouds. Desert scene and forest scene images contain abundant thin and cirrus clouds, and the algorithm is susceptible to interference from their land cover type when performing cloud detection, ultimately identifying clouds as non-clouds. Cloud-Graph cloud detection performance is relatively excellent, and the reason for this is still the role of the FIGR module, where low-level feature nodes interact fully with high-level feature nodes, and inference is made through the GCN, and coarse cloud regions are effectively fused with fine cloud features, and the algorithm is ultimately able to identify more accurate clouds.

4.5 Ablation experiments

As the innovative design of this paper is the feature interaction graph reasoning (FIGR) module, the network structure ablation experiments were only done to train the network with or without the addition of the FIGR module and to evaluate the metrics for comparison. The designed algorithm, Cloud-Graph, uses ResNet-50 as the backbone network and evaluates the network performance using MaxFm, MAE, and Sm metrics, with all other configurations of the algorithm remaining the same, and the comparison results on the 3 test datasets are shown in Table 5. The results in the table demonstrate the effectiveness of the designed FIGR module.

Table 5
Results of the proposed network structure ablation experiment

Setting CHLandsat-8-TE (20) 38-Cloud-Test (20) SPARCS (80)

MAE MaxFm Sm MAE MaxFm Sm MAE MaxFm Sm

Backbone 0.112 0.871 0.743 0.069 0.863 0.781 0.121 0.576 0.564

+FIGR 0.069 0.892 0.829 0.035 0.875 0.853 0.105 0.645 0.606

Setting	CHLandsat-8-TE (20)	38-Cloud-Test (20)	SPARCS (80)
Backbone	0.112	0.871	0.743	0.069	0.863	0.781	0.121	0.576	0.564
+FIGR	0.069	0.892	0.829	0.035	0.875	0.853	0.105	0.645	0.606

4.6 Robustness experiments

To verify the robustness of the designed Cloud-Graph algorithm, we trained using a weighted cross-entropy loss-supervised network. The experimental procedure varies the balance parameters α and β for positive and negative samples, while the MAE and MaxFm evaluation metrics are used for quantitative comparison, and the experimental results are shown in Table 6.

Table 6
Experimental results on the robustness of the proposed network

Setting CHLandsat-8-TE (20)

MAE MaxFm

α = 1.0, β = 1.0 0.069 0.892

α = 1.0, β = 1.1 0.067 0.896

α = 1.0, β = 1.2 0.065 0.895

α = 1.1, β = 1.0 0.070 0.893

α = 1.2, β = 1.0 0.068 0.891

Setting	CHLandsat-8-TE (20)
α = 1.0, β = 1.0	0.069	0.892
α = 1.0, β = 1.1	0.067	0.896
α = 1.0, β = 1.2	0.065	0.895
α = 1.1, β = 1.0	0.070	0.893
α = 1.2, β = 1.0	0.068	0.891

As can be seen from the data in the table, the changes in the balance parameters α and β have an impact on the final cloud detection effect of the network, but the overall fluctuations in the evaluation metrics are small, so the Cloud-Graph algorithm is robust.

4.7 Interpretive experiments

To explain the effectiveness of the designed network more intuitively, this paper completes the explanatory experiments through feature visualization. As shown in Fig. 8, F_L denotes the visualization of low-level features containing detailed information, F_H denotes the visualization of high-level features containing semantic information, and ${\tilde{F}}_{L}$ , ${\tilde{F}}_{H}$ denote the visualization of low-level and high-level features respectively after going through the feature interaction graph reasoning (FIGR) module.

Fig. 8

Feature visualization results. Comparison of feature maps of low-level and high-level features with and without feature interaction graph reasoning (FIGR) module.

Comparing the F_L and ${\tilde{F}}_{L}$ images and the F_H and ${\tilde{F}}_{H}$ images in the figure, the feature interaction graph reasoning (FIGR) module reasoned about the relationship between low-level and high-level features so as to fully fuse detailed and semantic features. The proposed network in this paper overcomes the inherent shortcomings of CNNs and alleviates the problem of multi-level feature fusion semantic divide, allowing the network to learn more image features that contribute to cloud detection.

5 Conclusion

In this paper, a novel feature interaction graph convolutional network for remote sensing image cloud detection is proposed, in which the feature interaction graph inference module can mine valuable complementary information in low-level and high-level features and alleviate the semantic gap problem in the feature fusion process. Extensive experiments have shown that the model in this paper helps overcome the inherent shortcomings of CNNs in cloud detection tasks and can better detect cloud regions. We believe that the feature interaction graph reasoning (FIGR) module designed in this paper can be beneficial for other related computer vision tasks, such as image segmentation. In future work, we will further improve the inference speed of the designed algorithm Cloud-Graph, while deploying the algorithm to the smart hardware side for on-the-ground applications.

References

Long

, Shi

, Tang

and Zhang

, Single remote sensing image dehazing, IEEE Geosci Remote Sens Lett 11(1) (2014), 59–63.

Zhang

, Rossow

W.B.

, Lacis

A.A.

, Oinas

and Mishchenko

M.I.

, Calculation of radiative fluxes from the surface to top of atmosphere based on ISCCP and other global data sets: Refinements of the radiative transfer model and the input data, J Geophys Res Atmos 109(D19) (2004), 1–27.

Zhu

, Wang

and Woodcock

C.E.

, Improvement and Expansion of the Fmask Algorithm: Cloud, Cloud Shadow, and Snow Detection for Landsats 4–7, 8, and Sentinel 2 Images, Rem Sen of Env 159 (2015), 269–277.

Irish

P.R.

, Barker

J.L.

, Goward

S.N.

, et al., Characterization of the Landsat-7 ETM+automated cloud-cover assessment (ACCA) algorithm, Photogrammetric Engineering & Remote Sensing 72(10) (2006), 1179–1188.

Zhan

, Wang

, Shi

, Cheng

, Yao

and Sun

, Distinguishing cloud and snow in satellite images via deep convolutional network, IEEE Geosci Remote Sens Lett 14(10) (2017), 1785–1789.

Yang

, Guo

, Yue

, Liu

, Hu

and Li

, CDnet: CNN-based cloud detection for remote sensing imagery, IEEE Trans Geosci Remote Sens 57(8) (2019), 6195–6211.

Guo

, Yang

, Yue

, Tan

, Hou

and Li

, CDnetV2: CNN-based cloud detection for remote sensing imagery with cloud-snow coexistence, IEEE Trans Geosci Remote Sens 59(1) (2020), 700–713.

Mohajerani

and Saeedi

, Cloud-net: An end-to-end cloud detection algorithm for landsat 8 imagery, in Proc IEEE Int Geosci Remote Sens Symp, 2019, pp. 1029–1032.

Mohajerani

and Saeedi

, Cloud and cloud shadow segmentation for remote sensing imagery via filtered jaccard loss function and parametric augmentation, IEEE J-STARS 14 (2021), 4254–4266.

10.

Hou

, Cheng

, Xu

, et al., Deeply supervised salient object detection with short connections, in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), Jul. 2017, pp. 3203–3212.

11.

, et al., Optical Remote Sensing Image Cloud Detection with Self-Attention and Spatial Pyramid Pooling Fusion, Remote Sensing 14(17) (2022), 4312.

12.

Peng

, et al., Understanding the Role of Receptive Field of Convolutional Neural Network for Cloud Detection in Landsat 8 OLI Imagery, IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1–17.

13.

Kipf

and Welling

, Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv:1609.02907, 2016.

14.

Yan

, Xiong

and Lin

, Spatial temporal graph convolutional networks for skeleton-based action recognition, in Proc IEEE Conf Thirty-second AAAI conference on artificial intelligence, 2018.

15.

Zhang

, Zhu

, Hou

, et al., Deep relational reasoning graph network for arbitrary shape text detection, in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), Jun. 2020, pp. 9699–9708.

16.

Wang

, Lu

, Shen

, Crandall

D.J.

and Shao

, Zero-shot video object segmentation via attentive graph neural networks, in Proc IEEE Conf International Conference on Computer Vision. (ICCV), 2019, pp. 9236–9245.

17.

Qin

, Shang

, Tian

, Wang

, Zhang

and Yan

, Tang, Spectral–spatial graph convolutional networks for semi-supervised hyperspectral image classification, IEEE Geosci Remote Sens Lett 16(2) (2019), 241–245.

18.

Long

, Shelhamer

and Darrell

, Fully convolutional networks for semantic segmentation, in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), Jun. 2015, pp. 3431–3440.

19.

Zeng

, Yang

and Deng

, Cloud segmentation of remote sensing images on Landsat-8 by deep learning, in Proc 2nd Int Conf Big Data Res (ICBDR), 2018, pp. 174–177.

20.

Ronneberger

, Fischer

and Brox

, U-net: Convolutional networks for biomedical image segmentation, in Proc Int Conf Med Image Comput Comput-Assist Intervent, Berlin, Germany: Springer, 2015, pp. 234–241.

21.

Badrinarayanan

, Kendall

and Cipolla

, SegNet: A deep convolutional encoder-decoder architecture for image segmentation, IEEE Trans Pattern Anal Mach Intell 39(12) (2017), 2481–2495.

22.

, Zhang

and Xia

, CDUNet: Cloud Detection UNet for Remote Sensing Imagery, Remote Sensing 13(22) (2021), 4533.

23.

, Sun

, Yan

, et al., DABNet: Deformable contextual and boundary-weighted network for cloud detection in remote sensing images, IEEE Trans Geosci and Remote Sens 60 (2021), 1–16.

24.

, et al., Dual-branch network for cloud and cloud shadow segmentation, IEEE Trans Geosci Remote Sens 60 (2022), 1–12.

25.

Zhang

, et al., Cloudformer: Supplementary aggregation feature and mask-classification network for cloud detection, Applied Sciences 12(7) (2022), 3221.

26.

Zhang

, Dai

, Lu

, He

and Wang

, Abi-directional message passing model for salient object detection, in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), Jun. 2018, pp. 1741–1750.

27.

Zhao

and Wu

, Pyramid feature attention network for saliency detection, in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), Jul. 2019, pp. 3085–3094.

28.

Zhang

, Wang

, QI

, et al., Progressive attention guided recurrent network for salient object detection, in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), Jun. 2018, pp. 714–722.

29.

Pang

, Zhao

, Zhang

, et al., Multi-scale interactive network for salient object detection, in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), Jun. 2020, pp. 9413–9422.

30.

, Xia

and Li

, Pyramidal feature shrinking for salient object detection, in Proc IEEE Conf Thirty-second AAAI conference on artificial intelligence, 2021, pp. 2311–2318.

31.

Wang

and Shi

, An All-Scale Feature Fusion Network With Boundary Point Prediction for Cloud Detection, IEEE Geosci Remote Sens Lett 19 (2021), 1–5.

32.

Guo

, Bai

and Qin

, ClouDet: A Dilated Separable CNN-Based Cloud Detection Framework for Remote Sensing Imagery, IEEE J-STARS 14 (2021), 9743–9755.

33.

Zhao

, et al., Detail-Aware Multiscale Context Fusion Network for Cloud Detection, IEEE Geosci Remote Sens Lett (19) (2022), 1–5.

34.

Zhou

, Cui

, Zhang

, Yang

, Liu

and Sun

, Graph neural networks: A review of methods and applications, arXiv preprint arXiv:1812.08434, 2018.

35.

Cai

, Ge

, Liu

, Cai

, et al., Exploiting spatial temporal relationships for 3d pose estimation via graph convolutional networks, in Proc IEEE Conf International Conference on Computer Vision (ICCV), 2019, pp. 2272–2281.

36.

Xie

G.S.

, Liu

, Zhu

, Zhao

, et al., Region graph embedding network for zero-shot learning, in Proc Springer. European conference on computer vision (ECCV), 2020, pp. 562–580.

37.

Wang

, Sun

, Liu

, et al., Dynamic graph cnn for learning on point clouds, Acm Transactions on Graphics 38(5) (2019), 1–12.

38.

Luo

, Li

, Jiao

, et al., Cascade graph neural networks for rgb-d salient object detection, in Proc Springer. European conference on computer vision (ECCV), 2020, pp. 346–364.

39.

Zhang

, Li

, Shen

, et al., Adaptive graph convolutional network with attention graph clustering for co-saliency detection, in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), Jun. 2020, pp. 9050–9059.

40.

Hong

, Gao

, Yao

, et al., Graph convolutional networks for hyperspectral image classification, IEEE Trans Geosci Remote Sens 59(7) 5966–5978, Apr. 2020.

41.

Zhao

, Ge

and Yu

, GraphFPN: Graph feature pyramid network for object detection, in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), Jun. 2021, pp. 2763–2772.

42.

Zhai

, Li

, Chen

, et al., Mutual graph learning for camouflaged object detection, in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), Jun. 2021, pp. 12997–13007.

43.

, Zhang

, Gao

, Deng

, et al., Bidirectional graph reasoning network for panoptic segmentation, in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), Jun. 2020, pp. 9080–9089.

44.

and Gupta

, Beyond grids: Learning graph representations for visual recognition, Advances in Neural Information Processing Systems, 2018, pp. 31.

45.

, Liu

, Tian

, Li

, et al., Dual attention network for scene segmentation, in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), Jul. 2019, pp. 3146–3154.

46.

, Muller

, Thabet

, et al., Deepgcns: Can gcns go as deep as cnns? in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), Jul. 2019, pp. 9267–9276.

47.

Krizhevsky

, Sutskever

and Hinton

G.E.

, Imagenet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems, 2012, pp. 25.

48.

, Zhang

, Ren

and Sun

, Deep residual learning for image recognition, in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), 2016, pp. 770–778.

49.

De Boer

P.T.

, Kroese

D.P.

, Mannor

, et al., A tutorial on the cross-entropy method, Annals of Operations Research 134(1) (2005), 19–67.

50.

Hughes

M.J.

and Hayes

D.J.

, Automated detection of cloud and cloud shadow in single-date Landsat imagery using neural networks and spatial post-processing, Remote Sensing 6(6) (2014), 4907–4926.

51.

Achanta

, Hemami

, Estrada

and Süsstrunk

, Frequency-tuned salient region detection, in IEEE Conf Comput Vis Pattern Recog, 2009, pp. 1597–1604.

52.

Wang

, Lai

, Fu

, et al., Salient object detection in the deep learning era: An in-depth survey, IEEE Trans Pattern Anal Mach Intell, 2021.

53.

, Fan

D.-P.

, Ji

G.-P.

, Zhao

, Shen

and Zhu

, Siamese network for rgb-d salient object detection and beyond, IEEE Trans Pattern Anal Mach Intell, 2021.

54.

Margolin

, Zelnik-Manor

and Tal

, How to evaluate foreground maps?” in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), 2014, pp. 248–255.

55.

Fan

D.P.

, Cheng

M.M.

, Liu

, Li

and Borji

, Structure-measure: A New Way to Evaluate Foreground Maps, in Int Conf Comput Vis, 2017, pp. 4548–4557.

56.

Fan

D.P.

, Gong

, Cao

, Ren

, Cheng

M.M.

and Borji

, Enhanced-alignment Measure for Binary Foreground Map Evaluation, in Int Joint Conf Artif Intell, 2018, pp. 698–704.

57.

Zhao

, Shi

, Qi

, Wang

and Jia

, Pyramid scene parsing network, in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), Jul. 2017, pp. 2881–2890.

58.

Amirul Islam

, Rochan

, et al., Gated feedback refinement network for dense image labeling, in Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), Jul. 2017, pp. 3751–3759.

Cloud-Graph: A feature interaction graph convolutional network for remote sensing image cloud detection

Abstract

Keywords

1 Introduction

2 Related works

2.1 Remote sensing image cloud detection

2.2 Multi-level feature integration

2.3 Graph convolutional networks

3 Method

3.1 Problem formulation

3.2 Network overview

4.1 Experimental setup

4.2 Dataset

Table 1 Dataset details Dataset Scenes Images Train/Test CHLandsat-8-TR 44 22616 Train CHLandsat-8-TE 20 10080 Test 38-Cloud-Test 20 10906 Test SPARCS 80 720 Test

Table 5 Results of the proposed network structure ablation experiment Setting CHLandsat-8-TE (20) 38-Cloud-Test (20) SPARCS (80) MAE MaxFm Sm MAE MaxFm Sm MAE MaxFm Sm Backbone 0.112 0.871 0.743 0.069 0.863 0.781 0.121 0.576 0.564 +FIGR 0.069 0.892 0.829 0.035 0.875 0.853 0.105 0.645 0.606

Table 6 Experimental results on the robustness of the proposed network Setting CHLandsat-8-TE (20) MAE MaxFm α = 1.0, β = 1.0 0.069 0.892 α = 1.0, β = 1.1 0.067 0.896 α = 1.0, β = 1.2 0.065 0.895 α = 1.1, β = 1.0 0.070 0.893 α = 1.2, β = 1.0 0.068 0.891

References

Table 1
Dataset details

Dataset Scenes Images Train/Test

CHLandsat-8-TR 44 22616 Train

CHLandsat-8-TE 20 10080 Test

38-Cloud-Test 20 10906 Test

SPARCS 80 720 Test

Table 5
Results of the proposed network structure ablation experiment

Setting CHLandsat-8-TE (20) 38-Cloud-Test (20) SPARCS (80)

MAE MaxFm Sm MAE MaxFm Sm MAE MaxFm Sm

Backbone 0.112 0.871 0.743 0.069 0.863 0.781 0.121 0.576 0.564

+FIGR 0.069 0.892 0.829 0.035 0.875 0.853 0.105 0.645 0.606

Table 6
Experimental results on the robustness of the proposed network

Setting CHLandsat-8-TE (20)

MAE MaxFm

α = 1.0, β = 1.0 0.069 0.892

α = 1.0, β = 1.1 0.067 0.896

α = 1.0, β = 1.2 0.065 0.895

α = 1.1, β = 1.0 0.070 0.893

α = 1.2, β = 1.0 0.068 0.891