Abstract
In recent years, the incidence of skin diseases has increased significantly, and some malignant tumors caused by skin diseases have brought great hidden dangers to people’s health. In order to help experts perform lesion measurement and auxiliary diagnosis, automatic segmentation methods are very needed in clinical practice. Deep learning and contextual information extraction methods have been applied to many image segmentation tasks. However, their performance is limited due to insufficient training of a large number of parameters and these parameters sometimes fail to capture long-term dependencies. In addition, due to the many interfering factors of the skin disease image, the complex boundary and the uncertain size and shape of the lesion, the segmentation of the skin disease image is still a challenging problem. To solve these problems, we propose a long-distance contextual attention network(LCA-Net). By connecting the non-local module and the channel attention (CAM) in parallel to form a non-local operation, the long-term dependence is captured from the two dimensions of space and channel to enhance the network’s ability to extract features of skin diseases. Our method has an average Jaccard index of 0.771 on the ISIC2017 dataset, which represents a 0.6%improvement over the ISIC2017 Challenge Champion model. The average Jaccard index of 5-fold cross-validation on the ISIC2018 dataset is 0.8256. At the same time, we also compared with some advanced methods of image segmentation, the experimental results show our proposed method has a competitive performance.
Introduction
The rise of medical images has played a vital role in medical diagnosis. Clinical medicine uses MRI, CT and other methods to image human lesions, which provides the possibility for early detection of malignant tumors, cancerous sites and other lesions, so that patients can find the lesions and get treatment in the early stage of cancer. Among the many types of cancers, skin cancer is one of the most prevalent cancers in the world. According to the American Cancer Society’s 2019 annual report, the number of new skin cancer patients in the United States in 2019 was 104,350, and 11,650 deaths were caused. Among them, melanoma is the deadliest form of skin cancer, accounting for 92.46%and 62.06%of new cases and deaths of skin cancer, respectively [1]. But in fact, because the pigmentation lesions occur on the surface of the skin, melanoma can be detected early through visual inspection by experts. If skin lesions are found in the early stage of illness and treated in time, skin lesions such as melanoma can have a higher cure rate [2].
In recent years, as the main diagnosis and treatment method for skin lesions in clinical medicine, Dermatoscope has provided clinical medicine with clear and visualized images of skin diseases. Dermatologists can more effectively judge the type of lesions through dermatoscope. When subjective factors have a greater impact on the diagnosis results, manual reading of the film will inevitably lead to misdiagnosis [3]. In order to support and help doctors to make more accurate skin disease diagnosis, automatic segmentation computer aided diagnosis system to achieve skin disease image segmentation has become an urgent need for clinical diagnosis.
Deep learning learns the characteristics of multiple abstract levels of data from a specific large number of sample data, and has demonstrated excellent performance in the field of image processing, especially in the field of computer vision that uses natural images as the processing object to make breakthroughs [4, 5]. These characteristics make the application of deep learning methods and medical image segmentation possible. In the field of skin lesion segmentation, scholars are constantly researching new models for the segmentation of medical images, and the image segmentation of skin diseases has also made substantial progress. At present, the more mature skin lesion segmentation methods are mainly based on the encoder-decoder structure, such as a series of widely used network structures such as U-Net[6] and U-net++ [7]. In addition, the atrous spatial pyramid pooling (ASPP) [8] and attention mechanism have also been introduced to accurately predict the image by obtaining multi-scale feature maps. However, due to the complexity of the boundary of skin lesions and the characteristics of different shapes and sizes, In the previous work, the model did not pay enough attention to the boundary of skin lesions and the segmentation accuracy was not high enough. In order to solve these problems, we introduced the LCA module and proposed a long-distance contextual attention network based on the traditional U-Net. We Uses a non-local module to calculate the similarity of two locations to capture long-distance dependence, and uses CAM[9] to integrate channel features. In this paper, we have made the following contributions:
(1) We preprocessed the skin disease datasets to solve the problem of skin disease image color imbalance.
(2) We proposed LCANet to integrate the long-distance dependencies of images and performed a more accurate segmentation of the boundary of the skin lesion.
(3) Experiments were conducted on ISIC2017 and ISIC2018 datasets, and the experimental results were compared with the existing skin disease segmentation methods.
(4) We reproduced the current advanced network model, and conducted experiments on ISIC2017, and compared and analyzed the experimental results.
Related work
Network
In this section, we will introduce our proposed skin disease image segmentation model. Our backbone network uses an encoder-decoder structure, adds X-blocks [23] with depth deconvolution to the encoder and decoder to extract high-dimensional features, and the LCA module is proposed to extract image features from space and channel dimensions. Figure 1 shows the specific process of our segmentation method.

Overview of the long-distance contextual attention network.
When using convolutional networks for segmentation tasks, we can regard each channel map as a feature detector. Emphasizing the dependency between channels can help us improve the feature representation of specific semantics. On the other hand, the boundary of the lesion area of the skin disease image is very complicated, and the convolution operation will generate a local receptive field, and the key information will be unclear when processing the complex boundary segmentation task, and there will be a lack of judgment on the "where" feature that needs to be emphasized and the "where" need to be suppressed. Therefore, it is necessary to extract a wide range of location context information and encode it into the feature map. However, current context information extraction methods have limited performance due to insufficient training of a large number of parameters for the overly complex boundaries of medical image segmentation, and cannot capture long-distance dependencies. Based on the above problems, we designed an LCA module, which connects a channel attention module and a non-local module in parallel to integrate and deeply mine the channel information and long-distance context information of the image, and fully train the parameters of the context information. In addition, due to the superiority of U-NET in extracting medical image features, we choose U-NET as our backbone network. The structure of the LCA module is as follows:
As shown in Fig. 2, given a local feature map, We first feed it into a non-local operation and a channel attention module to extract its spatial features and channel features respectively, and Then we combine the results from the two modules and input them into the decoder.

Long-distance contextual attention module.
In order to integrate information along the channel dimension to extract contextual features, we use the channel attention module to generate channel attention feature maps. We first perform a maximum pooling operation and an average pooling operation on the feature map F to aggregate channel information and generate two new mappings F
avg
∈ R1×H×W and F
max
∈ R1×H×W to represent the average feature and the maximum feature. Then we input these two new mappings into the multilayer perceptron to generate the channel attention map M:
According to the characteristics of the skin disease image, we first feed the feature map F ∈ RH×W×C into a 3 × 3 convolutional layer to form a new feature map F0 ∈ RH×W×C0 to filter out some unimportant features, where
Where i is the index of an output position (in space, time, or spacetime) whose response is to be computed and j is the index that enumerates all possible positions. X represents the input feature map. g (X j ) computes a representation of the input signal at the position j. C (X) represents the normalization process, and the function f (X i , X j ) represents the similarity relationship mapping between X i and X j . We use the embedded Gaussian formula to express the similarity relationship:
At this time f (X
i
, X
j
) g (X
j
) can be regarded as the softmax calculation along the j dimension, we can get:
θ (X) and φ (X) are embedding layers realized by two 1 × 1 convolutions, and N is the number of all x in the feature map. Finally, we get the relationship between all positions in the original feature map and the captured dense context information f (X
i
, X
j
). At the same time, we input F0 into a 1 × 1 convolutional layer to generate a new feature map Y ∈ RH×W×C to represent the input signal and reshape it into RN×C. Then we multiply f (X
i
, X
j
) with Y, and perform the element-wise sum operation with F0:
It can be inferred that the feature map Z obtained is the sum of the relationship feature obtained by the non-local module and the original feature, which effectively aggregates long-distance features. Finally, we input Z into the convolutional layer, generate a feature map with the same shape as F, and perform a summation operation with the output of the channel attention module, and integrate channel information for adaptive feature refinement and avoid overfitting.
From previous work we can see that The down-sampling factor of 16 is known to preserve most of the information required to parse the original image correctly at the pixel level [17]. Although the 16×16 thumbnail is small, it is enough to distinguish the target entity in the image and extract the features of the feature map. An increase in resolution will incur additional computing costs, when the resolution of the feature map increases by 2 times in each dimension, the memory consumption of the feature map increases by 4 times. Considering the computational cost of the experiment, we chose to use the LCA module in the last layer of the encoder.
In order to evaluate the performance of our proposed method in skin lesion segmentation, we trained and validated our method on the open-source datasets ISIC2017 [24] and ISIC2018 [25], and compared the results with existing advanced network model methods. The experimental results show that our method has achieved good results on the datasets. In this part, we will introduce the following work in detail: (1) Introduce our datasets and experimental evaluation metrics; (2) Introduce the details of the experimental implementation; (3) Introduce experimental preprocessing operations; (4) We conducted ablation experiments on ISIC2017 to evaluate the influence of the attention module on the results; (5) We conducted experiments on the existing model based on ISIC2017 and ISIC2018, and compared our experimental results with the obtained data, so as to judge the effect of the model more comprehensively.
Datasets and evaluation metrics
Datasets
We conducted experiments on ISIC 2017 and ISIC2018. ISIC2017 contains a total of 2750 dermoscopy images composed of moles, seborrheic keratosis and malignant melanoma. Among them, 2000 sheets are the training set, 150 sheets are the validation set, and 600 sheets are the test set. The ISIC2018 images available include 2594 training set images ranging in size from 540x576 pixels to 4499x6748 pixels. In order to more rigorously verify the performance of the model and the robustness of our model, we performed 5-fold cross-validation on ISIC2018. These skin disease images have been marked by a professional dermatologist on the skin lesion area according to pixels. In the datasets, the size of the original image is not uniform. In order to obtain better segmentation performance, we use bi-linear interpolation to adjust the image to a uniform size before training.
Evaluation metrics
In order to measure the segmentation performance of the model, we use the Dice coefficient (Dice) and Jaccard index (JA) as the main evaluation indicators, which are the same as those of the ISBI Challenge. In addition, we also calculated accuracy (Acc), precision (Prec), specificity (SP), sensitivity (SE) and negative predictive value (NPV), which are commonly used metrics in medical segmentation. Their formulas are as follows:
Our experiment is implemented based on the Keras deep learning framework method, and uses the Adam optimizer to optimize our model. All experiments are performed on NVIDIA Tesla V100 GPU, which has 16g of memory. We set the initial learning rate to 1e-4 and adjust the learning rate according to the loss of the validation set during training. When the performance index is stable on the validation set, we adjust the learning rate to 0.1 times the initial learning rate. If the loss does not decrease after 15 epochs, we will stop the learning of the model. Before inputting the training image into the network, we perform data enhancement operations on the image by randomly rotating and flipping the image. In addition, we will use the sum of the dice loss and the binary cross-entropy loss as the training loss function, and set the batch size to 4. Finally, we save the model with the least loss on the validation set.
Pre-processing of the experiment
Due to the color constancy of the visual system, the human eye obtains a constant object color under different lighting and imaging environments when capturing an image. However, the imaging device does not have this feature, which results in different lighting environments that will cause the color of the collected image to deviate from the true color, as shown in Figure 3.

Images imaged under different conditions in the ISIC2017 dataset and the results after preprocessing.
In order to eliminate the impact of different lighting and shooting environments on the skin disease images, we use the Grayscale World Algorithm (GWA) [26] to grayscale the image. GWA is based on the gray world hypothesis, which assumes that an image with a large number of color changes, its three RGB color components approach the same gray value Gray. Based on this assumption, we take the average values of the three channels of R, G and B as
Then calculate the gain coefficient of the three channels:
Finally, adjust the color components of the original image, and adjust the R, G, and B values of each pixel C to R′, G′, B′ respectively:
We performed grayscale world operation on the skin disease images on the datasets. and then input the processed images into our model. Meanwhile, we also use pre-processed images when conducting comparative experiments with other advanced models.
In order to verify the performance of the module, we conducted different ablation experiments on the ISIC2017 dataset. It can be concluded from Table 1 that using the attention module can improve the segmentation performance of the model. Adding CAM to the model increases the Jaccard index of the model by 0.28%on the basis of the baseline, and adding the non-local attention module increases Jaccard index by 0.82%. We connected the CAM and non-local attention modules in parallel, and added them to the model as the LCA module as a whole, the Jacquard index obtained increased by 1.04%on the basis of the baseline. The results show that the application of the attention mechanism in the skin disease segmentation network can effectively enhance the feature representation of medical images.
Ablation Study for Attention Modules (Implemented on ISIC2017)
Ablation Study for Attention Modules (Implemented on ISIC2017)
Figure 4 is a visual heat map of skin lesions constructed based on the module ablation experiment, showing us that the attention module can be more complete and more effective in capturing contextual features. When the boundary is blurred and there are many interference factors, our model can make a more accurate judgment on the boundary of the lesion through non-local long-distance learning context information.

Visualization of ablation experiment prediction probability map.
In order to evaluate our model more comprehensively, in this section, we will first compare our method with the results of other segmentation methods based on the ISIC2017. In order to further verify the generalization of our model in skin disease image segmentation, we also conducted experiments on ISIC2018 dataset and compared the experimental results with previous work. Then we reproduced some advanced networks (UNET, DANET, etc.), and fed the preprocessed images of ISIC2017 dataset into these networks for experiment, Finally, we compare compared the experimental results of the contrast experiment with the segmentation results of our model to evaluate the model segmentation performance.
Comparison with previous works
We compared the existing related research results based on the ISIC2017 dataset. As shown in Table 2, it can be seen that compared with the existing models, our method has achieved better results in Jaccard index, accuracy and Dice coefficient.
Comparison with other Advanced Models on ISIC2017 (These results come directly from their paper)
Comparison with other Advanced Models on ISIC2017 (These results come directly from their paper)
The available part of the ISIC2018 dataset includes 2594 training sets. We performed a 5-fold cross-validation on the ISIC2018 dataset to measure the performance of the model. We divided the skin disease images into five sections, one of which was used as a test set and the other four groups as a training set. After completing this set of experiments, we take another set of experiments and repeat the process. Table 3 shows the results of the 5-fold cross-validation and the mean value. At the same time, in order to make a horizontal judgment on the performance of our model, we compare our results with the existing segmentation work based on ISIC2018 in Table 4.
Results of 5-fold cross-validation based on ISIC2018
Comparison with other Advanced Models on ISIC2018(These results come directly from their paper)
In order to evaluate our approach more comprehensively, we have implemented some existing advanced models based on keras and conducted experiments on the ISIC2017 dataset. These models include CENet [35], DANet [36], Deeplab-v3+ [37], DoubleU-Net [14], HRNet [8], NestedU-Net [7], PSPNet [38], R2UNet [13], ResUNet [39], SCSEUNet [40], SegNet [41], and SEUNet [40]. We use the same training strategies as in our experiments to train these models. We also use the Adam optimizer, set the initial learning rate to le-4, and use the same hyperparameters as our model. In addition, we mainly design the loss function based on the original settings of these models. If the specific loss function type is not mentioned in the original paper, we use the cross-entropy loss function by default. The results of the experiment are shown in Table 3.
Comparison with other Advanced Models on ISIC2017(These results come from our reimplementation)
Comparison with other Advanced Models on ISIC2017(These results come from our reimplementation)
It can be seen from Table 4 that networks without attention mechanisms such as deeplabv3+ and PSPNet, SegNet perform poorly on the ISIC2017 dataset, while the models with attention mechanism such as CENet, DANet and LCANet have achieved good segmentation results. Compared with SegNet, our model improves Jaccard index, Dice coefficient, and accuracy by 6.51%, 5.87%, and 1.83%, respectively. This is because the attention mechanism can better capture contextual feature information. Compared with other types of images, the lesion area of the skin disease image has a relatively fixed structure, does not have too much semantic information and the feature information is more concentrated. Meanwhile, the boundaries of skin disease images are complex, and the network needs to combine contextual information to highlight features in key areas. The attention mechanism can help the network to discard complex interference information and learn the main features. Therefore, the application of attention can enable the network to obtain better results in the segmentation of skin diseases. In addition, we help the network to obtain non-local long-distance context information by introducing a non-local attention module, and add deep separable convolution to the backbone network. Compared with the experimental results of danet, our model has improved by 0.16%, 0.03%, and 0.11%on the three indicators of Jaccard index, Dice coefficient, and accuracy, respectively.
Figure 5 shows the segmentation results of our model and some advanced models on ISIC2017, these results clearly show that models such as SegNet, Deeplab-v3+ and HRNet, as more advanced models in the field of semantic segmentation, are more inclined to obtain a higher global average to solve multi-class segmentation problems. Since these networks pay attention to the global characteristics of the image, it will appear that the interference factors in the skin disease image, such as stains, are recognized as the focus area. As a result, the segmentation results of these networks on the skin disease image are relatively rough.

LCANet and some advanced model segmentation results
Many models applied to medical images are improved on the basis of Unet, such as DANet, CENet and R2Unet, DoubleUnet, etc. Medical images have fuzzy boundaries and complex gradients, requiring more high-resolution information for accurate segmentation. The U-shaped network combines low-resolution information (providing a basis for object category recognition) and high-resolution information (providing a basis for precise segmentation and positioning), and is more suitable for medical image segmentation than other backbone networks. Our model is also based on a U-shaped network, which can be better used for accurate segmentation of the boundaries of skin disease images.
Although CENet and R2Unet are also improved based on Unet, they do not introduce an attention mechanism, which causes the model to be distracted by secondary feature information. The results in segmentation tasks with strong interference information are not ideal. Our method combines the channel attention module and the non-local attention module to help the network focus on key features, integrate channel and spatial feature information, and combine long-distance context information to make the model achieve better segmentation results. Through the comparison with the above advanced models, it can be found that compared with other advanced models, our model shows better performance in dealing with skin disease images with complex boundary and many interference factors.
In this paper, we propose a long-distance contextual attention network (LCANet) for skin disease image segmentation. The network can capture spatial long-distance dependence through the attention module and integrate it with channel feature information. We conducted a series of experiments based on the ISIC2017 and ISIC2018 datasets and tested the performance of our model on these two datasets. We compared the experimental results with the existing skin disease image segmentation methods. we have implemented some existing advanced models based on keras and conducted experiments on datasets. The experimental results show that our model has better performance on skin disease image segmentation tasks compared with other existing models. It can effectively eliminate irrelevant information, grasp the main features of the segmentation target, and more accurately identify the boundary of the lesion in the skin disease image.
There are still deficiencies waiting to be solved in our research. How to reduce the complexity of the model while improving the segmentation accuracy of the model is the main direction we will overcome in future research. At the same time, we will also apply the proposed model to more medical segmentation datasets.
Acknowledgments
This research is partially supported by Xinjiang Autonomous Region key research and development project: 2021B03001-4, Xinjiang Uygur Autonomous Region (CN) Postgraduate Research and Innovation Project (XJ2020G072), Science and Technology Department of Xinjiang Uyghur Autonomous Region Fund Project (2020E0234). We would also like to thank our tutor for the careful guidance and all the participants for their insightful comments.
Disclosures
The authors declare no conflicts of interest.
