Abstract
Skin lesion segmentation from dermatoscopic images is essential for the diagnosis of skin cancer. However, it is still a challenging task due to the ambiguity of the skin lesions, the irregular shape of the lesions and the presence of various interfering factors. In this paper, we propose a novel Ambiguous Context Enhanced Attention Network (ACEANet) based on the classical encoder-decoder architecture, which is able to accurately and reliably segment a variety of lesions with efficiency. Specifically, a novel Ambiguous Context Enhanced Attention module is embedded in the skip connection to augment the ambiguous boundary information. A Dilated Gated Fusion block is employed in the end of the encoding phase, which effectively reduces the loss of spatial location information due to continuous downsampling. In addition, we propose a novel Cascading Global Context Attention to fuse feature information generated by the encoder with features generated by the decoder of the corresponding layer. In order to verify the effectiveness and advantages of the proposed network, we have performed comparative experiments on ISIC2018 dataset and PH2 dataset. Experiments results demonstrate that the proposed model has superior segmentation performance for skin lesions.
Keywords
Introduction
Skin cancer is one of the most common cancers worldwide and the rate of growth is increasing. Melanoma has a 92% incidence rate and a 60% mortality rate among skin cancers, according to a report by the American Cancer Society [1]. Although skin cancer is highly lethal, death can be avoided both if the tumour is detected and removed in early diagnosis. Dermoscopy is widely used as a non-invasive diagnostic tool for the analysis of skin diseases [2]. The segmentation of lesions by dermoscopic images can obtain the exact area of the lesion, thus greatly facilitating the diagnosis of skin diseases and is clinically important. However, diagnosis through dermoscopic images is a tedious task which requires a lot of time. Therefore, automated computer-aided diagnostic (CAD) systems have been introduced to improve the accuracy and efficiency of diagnosis, helping doctors immensely. In these CAD systems, one of the most critical factors is the segmentation of the lesion from the dermoscopic image. Therefore, attaining accurate skin lesion segmentation results is crucial for dermoscopic image analysis and dermatological diagnosis. However, there is a lot of noise around the lesion in the dermoscopic image, such as hair, blood vessels and air bubbles, as well as a huge variation in lesions from one dermoscopic image to another [3]. Hence the accurate segmentation of skin lesions is still an adjustment task. Traditional methods of skin lesion segmentation rely excessively on fine manual features and clever image pre-processing, resulting in poor performance under application [4, 5, 6, 7].
In recent years, Deep Convolutional Neural Network (DCNN) models have turned skin lesion segmentation into a pixel-level classification problem with remarkable success [8, 9, 10, 11, 12], such as the well-known Full Convolutional Neural Network FCN [13] and its variant network U-Net [14], where the FCN converts a classification convolutional neural network into a segmentation network using fully convolutional layers instead of fully connected layers, allowing for an end-to-end training process by eliminating the need for sliding window slices and allowing for the input of whole images. The U-Net model is unique compared to other fully convolutional neural networks in that it introduces skip connections, which allow information to be passed directly between the encoder and decoder, greatly enriching the contextual information and achieving better results on a large number of medical image segmentation datasets. However, U-Net network directly fuses the features of the encoder with the features of the decoder of the corresponding layer, so that the information of the feature map on the channel will become coarse, which will eventually lead to less than fine pixel segmentation at the edges of the lesion area. Especially for highly noisy dermoscopic images, U-Net directly fuses the features in the skip connection, which may embed the wrong feature information containing noise in the channel and will eventually affect the segmentation results of the model. Moreover, the continuous pooling of U-Net in the encoder leads to a decreasing spatial resolution of the feature map and the loss of spatial location information. Similar to the encoder-decoder architecture of U-Net, Dai et al. [15] proposed a novel Multi-scale Residual Encoding and Decoding network for skin lesion segmentation, which mines the semantic information of the feature map by using multi-scale residuals in the encoding and decoding paths, and proposes a new pooling method (Soft Pooling) in the process of down-sampling, which can obtain more useful information and can obtain more effective segmentation results. In CA-Net [16], a one joint spatial attention module is designed to make the network more focused on foreground regions, and a novel channel attention module is proposed to adaptively recalibrate the feature responses between channels and highlight the most relevant feature channels. In BA-Net [17], a multi-task learning module was designed for joint learning of segmentation of target masks and detection of lesion boundaries, and an interactive attention (IA) was designed to bridge the two tasks to achieve complementary information between the different tasks, effectively exploiting the boundary information and providing a strong cue for better segmentation prediction. Xie et al. [18] proposed a Mutual Bootstrapping Model, which combines dermoscopic segmentation and classification into one model, using the mask generated by segmentation in the model to assist in classification and the results of classification to assist in segmentation, so that the information between classification and segmentation is complementary and thus the performance of the model is improved. Cao et al. [19] proposed a Global and Local Inter-Pixel Correlations Learning Network for skin lesion segmentation that captures non-local contextual information at different levels and further explores global pixel-level relations to handle large variance in shape and size. Gu et al. [20] proposed a deep edge network with boundary information for automatic skin lesion segmentation and designed an entirety-center-edge loss (ECE) function that can be based on the necessary segmentation results Further optimization of the boundary details is possible. In WNet [21], the network of two encoder-decoders is connected in series to form a “W” shaped model and the encoder and decoder directly complement each other’s information, effectively enhancing the performance of the model. Hassanat [22] make use of some color spaces to segment pixels as either objects of interest or non-objects using artificial neural networks.
The attention mechanism have been introduced into deep learning as an effective means of enhancing features and are widely used in computer vision tasks. The ATTU-Net [23] based on U-Net incorporates an attention mechanism in the skip connection, which reinforces crucial feature information, leading to detailed processing of local target information for segmentation. Hu et al. [24] proposes an attention synergy module that optimally integrates spatial and channel information and introduces a weighted binary cross-entropy loss function to emphasize foreground lesion regions. DANet [25] solves the scene segmentation task by capturing rich contextual relevance based on a self-attention mechanism, as well as applying spatial and channel attention to gather information around the feature map. In SA-Net [26], an efficient Shuffle Attention (SA) is proposed based on channel attention and spatial attention, which effectively integrates contextual information. Wu et al. [27] designed a Feature Adaptive Transformation Network, which effectively captures global contextual information and resolves long-distance dependencies by integrating two-way features from the encoding stage transformer and convolution. He et al. [28] proposes a Fully transformer network for skin lesion analysis and designs a Spatial Pyramid Transformer that introduces spatial pyramidal pooling to multi-headed attention, greatly reducing computational resources. SENet [29] has proposed a novel and efficient module called Squeeze and Excitation, which explores the relationships between channels. Although the above works can accomplish automatically skin lesion segmentation, the ability to process the boundaries of lesions is not precise enough and the performance of these methods may be affected when the skin lesion is disturbed by some noises such as hairs, bubbles, etc.
In this paper, we propose an Ambiguous Context Enhanced Attention Network for skin lesion segmentation based on the encoder-decoder architecture. The Ambiguous Context Enhanced Attention module (ACEA) and the Cascading Global Context Attention module (CGCA) are designed for the skip connection between the encoder and the decoder. The ACEA uses a self-attention mechanism to fuse the feature information generated in the encoding stage with the coarse segmentation map output in the decoding stage, leading the model to enhance the ambiguous boundaries information. The CGCA module extracts features by convolution of different sizes to understand the information comprehensively, and passes the feature information generated by the encoder to the corresponding decoder layer so that the decoder can reply more accurately to the feature map at upsampling. To reduce the loss of spatial location information due to successive down-sampling in encoder, we introduced a Dilated Gated Fusion block (DFG) at the end of the encoding phase, which effectively mitigates the loss of spatial location information without changing the feature map size and number of channels, and preserves local and global semantic information in the corresponding regions. To evaluate the performance of the proposed model, we conducted experiments on the ISIC2018 dataset and the PH2 dataset. Our model obtains competitive results in four metrics: Dice score, Jaccard coefficient, Accuracy and Recall. The visualisation results show that our model makes accurate segmentation of skin lesions. In summary, the contributions of this article are as follows:
Based on the encoder-decoder architecture, we propose an Ambiguous Context Enhanced Attention Network for skin lesion segmentation, in which we employed a pre-trained resnet34 network in the encoder, and the coarse segmentation results output in each decoder layer are optimised by deep supervision manner. We propose a novel Ambiguous Context Enhanced Attention module that leverages feature information from the encoder and the coarse segmentation map generated by the decoder using self-attention. This module enhances the boundary features of the lesion and guides the model in accurately segmenting the skin lesion. At the end of the encoding phase, we incorporated a Dilated Gated Fusion block. This block captures multi-scale information from various perceptual fields and effectively mitigates the loss of spatial location information caused by repeated down-sampling. In the skip connection, we introduced a Cascading Global Context Attention module. This module utilizes perceptual fields of different sizes to extract and transmit feature information from the encoder to the corresponding decoder layer. By incorporating features with abundant semantic information into the decoder, it enhances the accuracy of feature map restoration.
The overall of the proposed model.
The overall structure of the model is shown in Fig. 1, where the encoder uses a pre-trained Resnet34 [30] network. In each layer of the encoder, the feature map goes through two layers of 3
Ambiguous Context Enhanced Attention
In general, the edges of lesions are not smooth and the pixels in the transition region between the lesion and normal tissue have similar colour, texture characteristics. As a result, the boundaries of lesions are often ambiguous areas that are difficult to segment in medical image segmentation, and false negatives often occur in the prediction process. Specifically, the predicted probability value of lesion boundary pixels in the final segmentation result is usually close to 0.5, so it is difficult to accurately classify them as background or lesion.
To process the pixels at the boundary of the lesion, we output a portion of the coarse segmentation map in the decoding stage and optimise the pixels at the boundary of the lesion using deep supervision. Meanwhile, we design a novel Ambiguous Context Enhanced Attention module, which is based on a self-attention mechanism to enhance the boundary information of lesions by fusing the feature information generated in the encoding phase with the coarse segmentation map.
The structure of Ambiguous Context Enhanced Attention Module is shown in Fig. 2. where F is from the feature map generated by the encoder and M is the coarse segmentation map generated by the decoder. The foreground region, background region and ambiguous boundary region of M, respectively
where “max” means take the maximum value and “abs” means take the absolute value. we use “max” and “abs” operation to calculate the foreground region, the background region and the ambiguous boundary region, since the probability value of the ambiguous region is close to 0.5 in the predicted probability map, so we calculate the foreground and the background using 0.5 as a bound. The foreground, the background and the ambiguous boundary are then connected in the channel direction.
We calculate the representative vectors of
where
where
The illustration of ambiguous context enhanced attention module.
In the continuous encoding process, the number of channels of the feature map keeps increasing, and these channels contain rich semantic information. At the same time, the continuous down-sampling makes the resolution of the feature map keep decreasing, resulting in the loss of spatial location information. To solve this problem, we propose the Dilated Gated Fusion block, which captures the spatial information of the feature map at different scales through the atrous convolution with different dilation rates, passing more comprehensive and rich semantic information to the decoder.
Figure 3 shows the structure of the Dilated Gated Fusion block. Firstly, the feature maps are randomly divided into four groups, each containing 128 channels. These groups of feature maps are then subjected to atrous convolution with different dilation rates: 1, 2, 5, and 7. Next, four groups of feature maps are concatenated on the channels to obtain F’, and F’ is passed through a global average pooling and two layers of full concatenation action, aiming to learn the global feature information. Finally, a weight matrix is obtained after a sigmoid function, and the weight matrix is multiplied with F’ to enhance the spatial location information of the feature map.
The illustration of dilated gated fusion block.
The skip connection is a connection between the output of the encoder and the input of its corresponding decoder, which scales the feature information upwards for a specific layer. Adding skip connections between the encoder and decoder can preserve the detailed features of the input data. These connections allow the model to find local detail information and avoid missing important global information, and this connection helps the decoder to restore the original feature map more accurately, leading to improved segmentation accuracy. However, there would be a large semantic gap in the channel direction if the feature information generated by the encoder was directly connected to the decoder with the corresponding layer, so that there would be an imbalance of feature information on the channel. Therefore, to better control the feature information in the skip connection, we designed a Cascading Global Context Attention module. In this module, the feature information generated by the encoder is fused with the features of the decoder after convolutional coding with different dilation rates, mining the information on the channel and effectively conveying more detailed information and spatial contextual features to the decoder.
The structure of the CGCA is shown in Fig. 4. Firstly, the feature map
The illustration of cascading global context attention module.
The loss function plays an important role in the training of the model. To reduce overfitting and to obtain excellent convergence speed, we adopt the Dice loss [32] and the binary cross-entropy loss [33] as the overall loss function. The Dice loss is not affected by the category imbalance problem and has been extensively validated in different medical image segmentation tasks and can be written as:
where i represents the ith pixels in the probability distribution map.
At the same time, skin lesion segmentation is a binary problem, so we also use the binary cross-entropy loss:
where
Eventually, our overall loss function consists of a binary cross-entropy loss and a Dice loss function:
where
Our method is conducted in the PyTorch library with an Intel(R) Xeon(R) Gold 5218 CPU@2.30 GHz and an NVIDIA Quadro RTX 6000 GPU(24G). The training epoch is 200. The batch size is 16. The operating system for the experiments was Ubuntu Server 16.04. We use Adam as the overall optimizer, where betas
Datasets
We used public dermoscopic image datasets to train the proposed segmentation model, respectively the ISIC2018 dataset [34] and the PH2 dataset [35]. The two datasets are summarized as shown in Table 1. The ISIC-2018 dataset was provided by the International Skin Imaging Collaboration (ISIC), which contains a total of 2594 RGB images (20.0% melanoma, 72.0% nevus and 8.0% seborrheic keratosis) with image sizes ranging from 540
For the data pre-processing of two datasets, all image is resampled to 256
We strictly separate the datasets into distinct subsets for training, validation, and testing. This ensures that the models are trained solely on the training set and evaluated independently on the unseen test set. When evaluating the performance of ACEANet on the test set, we adhere to strict guidelines and avoid applying any form of data augmentation. The test images remain in their original form, guaranteeing that the model’s performance is assessed solely based on its ability to denoise real-world, unaltered images.
Distribution of the ISIC 2018 dataset and PH2 dataset
Distribution of the ISIC 2018 dataset and PH2 dataset
To quantitatively assess the segmentation ability of the proposed network, we used four commonly used metrics for skin lesion segmentation evaluation: Dice score (Dice), Jaccard coefficient (Jac), Accuracy (Acc) and Recall. The Dice score is defined as the ratio of twice the intersection of the predicted and true segmentations to the sum of the sizes of the two segmentations.
The Dice score ranges from 0 to 1, with 1 indicating a perfect overlap between the predicted and true segmentations. Jaccard coefficient measures the similarity between the predicted segmentation and the ground truth by calculating the ratio of the intersection to the union of the two segmentations. Similar to the Dice score, the Jaccard coefficient also ranges from 0 to 1, with 1 indicating a perfect overlap between the predicted and true segmentations. Accuracy quantifies the percentage of correctly classified pixels or voxels in the segmentation result. The Accuracy metric ranges from 0 to 1, with 1 indicating a perfect segmentation where all pixels or voxels are correctly classified. Recall measures the ability of a segmentation algorithm to correctly identify all positive instances (pixels or voxels) belonging to the target region in the ground truth. The recall metric ranges from 0 to 1, with 1 indicating a perfect recall where all positive instances are correctly identified. Mathematically, they are defined by the following:
where TP is the number of pixels correctly segmented as lesion regions, TN is the number of pixels correctly segmented as non-lesion regions, FP is the number of pixels incorrectly segmented as lesion regions, and FN is the number of pixels incorrectly segmented as non-lesion regions.
Comparison results on the ISIC2018 dataset. Those with “–” indicate that the corresponding metric results are not provided
A comparison of our method with other optimal models on the ISIC2018 dataset is shown in Table 2. As can be seen from the data in the table, it can be concluded that our model achieves better results for all metrics, especially for the three metrics of Dice, Jac and Acc, which all achieved the highest results with 92.16%, 86.46% and 96.75% respectively. Swin-Unet achieved the best results on the Recall metric, but on the other three metrics, the results were lower than our proposed model. FAT-Net achieves better results on Acc and Recall, but there are large gaps compared to other models in the two metrics Dice and Jac, whereas our model achieves better results on all metrics and there are no results with large gaps compared to other models, proving the stability of our model. In comparison with models of encoder-decoder architectures such as U-Net, ATTU-Net and AS-Net, our model outperforms these models in all metrics. Our model adds multiple skip connections between encoders and decoders that fuse features at different levels. This multi-scale feature fusion method can help the model to capture different features at different scales, so as to better segment. By skip connections, the model can utilize features from different levels, making the segmentation results more accurate.
Comparison results on the PH2 dataset. Those with “–” indicate that the corresponding metric results are not provided
Comparison results on the PH2 dataset. Those with “–” indicate that the corresponding metric results are not provided
Segmentation results of different models on the ISIC2018 dataset. (a) input, (b) label, (c) ours, (d) ATTUNet, (e) Ms Red, (f) UNet.
A visualisation results of our method on the ISIC2018 dataset compared to other models is shown in Fig. 5, where the green area shows the segmentation results of the model. As can be seen from the Fig. 5, our method can accurately segment the area of skin lesions compared to other models. Our model compensates for the ambiguous lesion boundary pixels and optimizes the coarse segmentation map by deep supervision so that the final output is closer to the label. The proposed model pays more attention to important features by introducing an attention mechanism that assigns different weights to different image areas. This allows the model to better capture details and edges in the image, improving the accuracy of segmentation. Based on the segmentation results of U-Net and ATTU-Net, it can be observed that when there are large irregular lesion areas, the segmentation results of U-Net are better than ATTU-Net. It is worth noting that ATTU-Net based on U-Net only adds the attention mechanism in the skip connections. Therefore, we suspect that only adding the attention mechanism in the skip connections may affect the overall anti-interference ability of the model and reduce its robustness. When low contrast of the lesion is present, the MsRed model does not accurately segment the lesion. Overall, our proposed model achieves competitive results both in terms of metrics and in terms of segmentation results.
Segmentation results of different models on the PH2 dataset.
The Table 3 summarizes the comparison results between our model and the current optimal models on the PH2 dataset. The data indicates that our model has achieved competitive results in all performance metrics, especially in Dice and Jac, with the best results of 94.18% and 89.27%, respectively. AS-Net achieved the highest result on the Recall metric, which is 0.61% higher than our model. However, our model’s results are higher than AS-Net’s by 1.13%, 1.67%, and 1.04% on the Dice, Jac, and Acc metrics, respectively. AMCC-Net achieved the best results on the Acc metric, exceeding our model by %0.76. However, our model performs better than AMCC-Net by 0.18% and 0.27% on the Dice and Jac metrics, respectively. Our model performs well on all metrics and there are no instances where a metric is too low. This proves that our model has the potential for practical use.
As shown in the Fig. 6, our model is visually compared with other models, and the red circle highlights the differences between segmentation results of the model and label. From the observation of the Fig. 6, it can be concluded that compared to other models, our method has superior segmentation performance, with segmentation results closest to the label. Our model outputs a coarse segmentation map in each layer decoder. Through continuous optimization of the coarse segmentation map, the model can better understand the semantics and structure of the image, and improve the accuracy and robustness of segmentation. For noisy skin lesion images with hair interference, such as the images in rows (b) and (c) in the Fig. 6, the segmentation performance of ResU-Net will be greatly affected, too much noise signal misguides ResU-Net and leads to over-segmentation, while our model can still segment the complete lesion area, proving that our model has strong anti-interference ability. When the boundary of the lesion is vague and the contrast is low, SwinU-Net cannot accurately identify the area of the lesion. For example, the area circled by the red box in row (a) in the Fig. 6 is a non-lesion area, but SwinU-Net classifies it as a lesion area, yet our model can accurately segment the boundary of the lesion. In summary, these results reveal that the ACEANet is more robust than other methods and can accurately segment various types of skin lesions.
Ablation studies
To evaluate the various modules in our model, we performed step-by-step ablation experiments on the ISIC 2018 dataset and designed relevant models for testing.
The results of our ablation experiments are shown in the Table 4, where “–” indicates that the corresponding module has not been added to the current model and “
Ablation studies of different models on the ISIC2018 dataset
Ablation studies of different models on the ISIC2018 dataset
The purpose of dermoscopic image segmentation is to accurately distinguish skin lesions from normal tissues in dermoscopic images to help doctors locate and diagnose various skin diseases more accurately. In this paper, we propose an Ambiguous Context Enhanced Attention Network for skin lesion segmentation named ACEANet. In the skip connection, we have designed a novel ACEA module. The ACEA module can guide the model to enhance the ambiguous lesion boundary information by using the coarse segmentation map generated by the decoder and the feature with rich semantic information produced by the encoder. The five-layer encoder gradually deepens the channels of the feature map, allowing the channels to contain rich semantic information. At the same time, continuous down-sampling results in a loss of spatial information in the feature map. Therefore, at the end of the encoding stage, we embedded a DFG module. This module divides the channel information into different groups, and then extracts features with different sizes of receptive fields. It greatly preserves both detail information and global semantic information. In the skip connections between the encoder and corresponding layers of the decoder, we have designed a CGCA module. This module processes the feature information generated by the encoder by using convolutions with different dilation rates and combines it with the corresponding feature map in the decoder to achieve feature fusion. The CGCA module enables the decoder to more accurately recover the information of the feature map. Experimental results have demonstrated the superior segmentation performance of our model compared to other optimal models on the publicly available ISIC2018 and PH2 datasets. Visual comparisons have also shown that our model can accurately segment lesions in dermatoscopic images. The proposed model can help doctors automatically and accurately segment dermoscopic images, thus providing faster and more accurate diagnosis. It has broad application prospect in CAD system.
