Abstract
BACKGROUND:
Melanoma is a tumor caused by melanocytes with a high degree of malignancy, easy local recurrence, distant metastasis, and poor prognosis. It is also difficult to be detected by inexperienced dermatologist due to their similar appearances, such as color, shape, and contour.
OBJECTIVE:
To develop and test a new computer-aided diagnosis scheme to detect melanoma skin cancer.
METHODS:
In this new scheme, the unsupervised clustering based on deep metric learning is first conducted to make images with high similarity together and the corresponding model weights are utilized as teacher-model for the next stage. Second, benefit from the knowledge distillation, the attention transfer is adopted to make the classification model enable to learn the similarity features and information of categories simultaneously which improve the diagnosis accuracy than the common classification method.
RESULTS:
In validation sets, 8 categories were included, and 2443 samples were calculated. The highest accuracy of the new scheme is 0.7253, which is 5% points higher than the baseline (0.6794). Specifically, the F1-Score of three malignant lesions BCC (Basal cell carcinoma), SCC (Squamous cell carcinomas), and MEL (Melanoma) increase from 0.65 to 0.73, 0.28 to 0.37, and 0.54 to 0.58, respectively. In two test sets of HAN including 3844 samples and BCN including 6375 samples, the highest accuracies are 0.68 and 0.53 for HAM and BCN datasets, respectively, which are higher than the baseline (0.649 and 0.516). Additionally, F1 scores of BCC, SCC, MEL are 0.49, 0.2, 0.45 in HAM dataset and 0.6, 0.14, 0.55 in BCN dataset, respectively, which are also higher than F1 scores the results of baseline.
CONCLUSIONS:
This study demonstrates that the similarity clustering method enables to extract the related feature information to gather similar images together. Moreover, based on the attention transfer, the proposed classification framework can improve total accuracy and F1-score of skin lesion diagnosis.
Introduction
Skin is the largest organ in the body, making it possible for malignant neoplasms to grow in most anatomical sites. There are two main types of skin cancer: melanoma and non-melanoma. The most common non-melanoma tumors are basal cell carcinoma and squamous cell carcinoma. The degree of infiltration of them is shown in Fig. 1. Although melanoma accounts for about 1% of all skin cancers diagnosed, it causes most of the deaths. More specifically, according to the reports from the American Cancer Society, the number of melanoma deaths is expected to increase by 6.5 percent in 2022 [1]. Besides, based on the research published by the National Cancer Institute of China in March, there were 3,800 deaths from cutaneous malignant melanoma in 2016, including 2,100 men and 1,700 women [2]. The estimated five-year survival rate for melanoma is over 99 percent if detected in the early stage. However, the survival rate falls to 68 percent when cancer reaches the lymph nodes and 30 percent when cancer metastasizes to distant organs [1]. It is thus clear that early detection is vital for patients with skin cancers, especially for those with melanoma.

Three types of skin cancer. Obviously, melanoma is an extremely invasive cancer that can penetrate basal layer.
The existing routine clinical examination for skin lesions is primarily diagnosed visually [3] which could be affected by the similarity between skin lesions. Figure 2 exhibits the skin lesion images collected by the dermatoscope and there is remarkable visual similarity among them. More precisely, both the same disease showing different morphological images illustrated in Fig. 2(a) and the different diseases showing similar morphological images illustrated in Fig. 2(b) are phenomena particularly prominent in skin lesions, which exacerbates the difficulty of automatic classification of skin cancers. In the clinical scenario, there are three challenges in using Artificial Intelligence (AI) to diagnose skin cancers. Firstly, the skin cancer images exhibit diversity as a result of the differences in the diagnostic equipment and individuality. Secondly, the annotations of skin cancer images are laborious and must be authorized by experts. Finally, as shown in Fig. 2(b), due to the presence of high similarity between the melanoma and melanocytic nevi in patterns such as colors, shapes, and size, it makes the common classification methods difficult to distinguish them.

From the top row to the bottom row illustrated in (a), there contains blue nevus, mixed nevus, intradermal nevus, congenital nevus, and dysplastic pigmented nevus sequentially. In (b), the green background images in rows 1 and 3 are benign melanocytic nevi, and the red background images in rows 2 and 4 are malignant melanomas. High degree of similarity between melanocytic nevus and melanoma in terms of color, texture, and contour are presented.
Therefore, the objective of this paper is to explore a novel classification method for skin cancers from the perspective of similarity. We utilized deep metric learning to make similarity clustering and then adopted the attention transfer strategy to transfer similarity knowledge for the diagnosis of skin cancers. Deep metric learning aims to learn a representational metric space where the same category data points are close to each other and the different ones are apart from each other. The attention transfer strategy belongs to the technical approach of knowledge distillation. Consequently, for better utilization of the similarity information of skin cancer images, we integrated these two methods to form an automatic skin cancer recognition framework that includes data pre-processing, data training, and data inference.
The rest of this paper is outlined as follows. Section 2 presents a literature review of skin cancer classification and medical applications with deep metric learning. Technical methodologies of similarity clustering and classification with descriptions of the utilized datasets are provided in Section 3. The experimental results are presented in Section 4. Section 5 presents the discussion of this study and this paper ends with conclusions in Section 6.
Skin cancer classification
Recently, with the development of Deep Learning (DL), the skin lesion image classification methods sprung up. The major advance in DL-based skin cancer diagnosis is the research work of Esteva et al. [3] published in Nature. They utilized the InceptionV3 to fit the 129,450 clinical and dermatoscopic images consisting of 2032 different categories of skin lesion diseases. In a three-class disease partition, the CNN achieved 72.1±0.9% overall accuracy and two dermatologists attained 65.56% and 66.0% accuracy on a subset of the validation set. In the nine-class disease partition, the CNN achieved 55.4±1.7% overall accuracy whereas the same two dermatologists attained 53.3% and 55.0% accuracy.
However, there are two deficiencies of DL-based skin cancer diagnosis. Firstly, from the data perspective, the robustness of the deep learning model is inadequate because of the semantic gap between skin cancer images generated by the inter-class similarity and intra-class dissimilarity. Moreover, long-tailed data distribution makes representative information of tail classes learned by the DL model insufficient which brings barely satisfactory results. Secondly, although the ensemble DL models could enhance the ability of feature extraction of skin cancer images, the massive parameters could cause over-fitting and inconvenient deployment applications. As a result, a balanced relationship between the model and data should be established with an appropriate method to keep skin cancer classification accuracy.
The International Skin Imaging Collaboration (ISIC) provide high-quality dermatoscopic images to boost the research on automated diagnosis. Maria et.al [22, 23] proposed skin lesion segmentation algorithms with the active contour model. References [24–26] were DL-based methods for skin lesion segmentation. These works are important process for subsequent diagnosis.
Medical application of deep metric learning
Distance measurement is a technical method of traditional metric learning, which is widely used in the CBIR system. It is not only an important technical procedure of retrieval but also a mathematical criterion of similarity measurement. In 2006, Rahman et al. achieved the skin disease image retrieval by color space transformation, lesion region segmentation, texture feature extraction, and constructed similarity matching function combined with Bhattacharyya distance and Euclidean distance on 358 skin disease images [4]. With the development of deep learning, it promotes the progress of metric learning technology, which makes it develop into deep metric learning (DML). A variety of notable applications of deep metric learning have occurred such as face verification [5], and person re-identification [6]. In 2019, Yang et al. [7] firstly developed a retrieval system for pathological image analysis by using deep metric learning. Based on the published dataset Kimia path24, the Recall@1 reached 97.89%. Under the giant influence of pandemic Covid-19, Zhong et al. [8] utilized deep metric learning to obtain superior performance both in retrieval and diagnosis tasks. The averaged area under the curve (AUC) of CXR-EHR combined prediction is 91.3%. According to the literature [9] published by Cornel Tech and Facebook in 2020, after 13-year exploration and application, deep metric learning has become one of the most attractive research fields of machine learning. Therefore, the implementation of deep metric learning in skin cancer image research would be a certain degree of improvement for solving the semantic gap, which in turn improve the diagnosis accuracy.
From the related works described above, we can observe that the limitations of previous studies are that the utilization of the similarity information is not considered and the deep metric learning is not adopted in the skin cancer images. In our study, the similarity clustering was conducted by deep metric learning to obtain the similarity information. Moreover, we combined the similarity information and category information by attention transfer strategy for final skin cancer classification.
Materials and methods
Datasets
The International Skin Imaging Collaboration (ISIC) is an international effort to improve melanoma diagnosis and the ISIC Archive contains the largest publicly available collection of quality controlled dermatoscopic images of skin lesions. In order to evaluate and test the proposed automatic skin cancer recognition framework, experiments were performed with the ISIC-2019 Dataset containing BCN2000 (BCN) [10], HAM10000 (HAM) [11], and MSK [12, 13]. There are 8 categories of the ISIC-2019 Dataset including 3 malignant ones: Basal cell carcinoma (BCC), Melanoma (MEL), Squamous cell carcinomas (SCC) and 5 benign ones: Actinic keratosis (AK), Benign keratosis (BKL), Dermatofibroma (DF), Nevus (NV), Vascular (VASC). Table 1 lists the number of each category and the unk means the data is not part of those three main datasets. Table 2 presents the distribution on Five main anatomical predilection sites: Head/Neck (HN), anterior torso (A), posterior torso (P), lower extremity (Lo), and upper extremity (Up). Besides, the number of lesions on palms/soles, oral/genital and lateral torso is 398, 59 and 54 respectively. And 2631 samples are not labeled with anatomical site. Visualization is demonstrated on Fig. 3.
Original data distribution of the ISIC-2019
Original data distribution of the ISIC-2019
Original data distribution of the ISIC-2019 on five main anatomical sites: head/neck (HN), anterior torso (A), posterior torso (P), lower extremity (Lo), upper extremity (Up)
Original data resolution of ISIC-2019

Two histograms are presented in (a). The left histogram shows the number of the eight skin lesions on the five main anatomical sites and the right one shows the number of skin lesions of four datasets. (b) presents the examples of eight skin lesions.
Pre-processing
Fine-grained appearance variations of skin lesions are evident as the different imaging devices and physical conditions. The common labeling process ignores the inter-class similarity and intra-class dissimilarity which only concerns the category information. Moreover, the randomly splitting data would not guarantee the validity of evaluation indexes due to the unbalanced data distribution in categories and appearances. Therefore, based on the CLEAR Derm published on JAMA Dermatology [14], we proposed the four-step approach to make reasonable processing of the skin lesion images. The order is as follows: (1) Resize, (2) Similarity Cluster, (3) Black Edge Removal, (4) Data Partitioning.
Resize was the first. For efficient numerical computing and avoiding information redundancy, we uniformly resized the images to 224×224. Similarity clustering was an unsupervised learning process to extract the features with high degree similarity in skin lesion images. After that, it would gather them in a specific cluster for the following operations.
Black edge removal was a traditional image processing method that includes Otsu threshold segmentation, morphological binary operations, and contour extraction. According to the clustering results of BCN in previous step, the removal operation was concentrated on the clusters containing images with black edge and 3506 of them were processed. Data partitioning was conducted with the clustering results to ensure 8 categories of skin lesion images were in both training and validation sets.
Similarity clustering
The common deep learning classification methods are as follows: an image is sequentially wrapped into probability distribution over clinical classes of skin disease which makes the ability to handle images with high similarity not enough. Therefore, we adopted the unsupervised clustering method with deep metric learning to take advantage of similarity feature information within among of them.
Deep Cluster [15] is representative prototype learning work that introduces a “prototype” as the centroid for a cluster and the feature representations are fed into a clustering algorithm to produce cluster assignments. K-means is utilized as the clustering algorithm to cluster the feature vectors z = f θ (x) into K distinct groups where the f θ is the network and x is the input image. The groups are used as pseudo labels and the cross-entropy loss function is used in the training stage. In this work, the cross-entropy loss function was replaced by deep metric learning loss function. This loss function contains different similarity characteristics to improve the accordance of the cluster assignments.
There are two classic loss functions used in deep metric learning. They are contrastive loss [16] and triplet loss [17]. The goal of contrastive loss is to make the distance between positive pairs smaller and the distance between negative pairs larger than another threshold λ as shown in Equation (1):
This characteristic is known as self-similarity. It is only computed by the feature vector pairs. A negative pair with large cosine similarity value is named as hard negative pair. Instead of considering pairs, triplet loss shown in Eq. (2) uses the triplet where an anchor is used to construct the positive pairs to consider the relationship from other pairs of the same anchor.
Consequently, this loss function pays more attention on positive relative similarity and enforces the similarity of a negative pair to be smaller than that of a positive one over a given margin λ.
However, Multi-Similarity (MS) loss [18] could be more appropriate for skin cancers images. This loss function includes the property of inter-class similarity and intra-class dissimilarity. The formulation of multi-similarity loss is as follows:
The pair of samples x
i
and x
j
is labeled y
i
and y
j
respectively, and it is a positive pair if the y
i
is equal to y
j
, otherwise, it is a negative pair. The hard mining process is based on similarity metric and defined as Equations (5) and (6) where ɛ is a given margin. For an anchor x
i
, a positive pair {x
i
, xj} is selected if Sij satisfies the Equation (5).
In 2014, Hinton et al. proposed the concept of knowledge distillation for model compression, which allowed small models with poor generalization to learn the “knowledge” of large teacher models with good generalization. In the distillation process of the classification model, the probability distribution of the SoftMax function becomes softer as the temperature coefficient increases, i.e., it provides more information as to which classes the teacher found more similar to the predicted class, and this information is called the “knowledge” of the model. This process can be expressed as follows:
The framework is illustrated in Fig. 4. In the training stage, a batch of images is wrapped into the metric teacher model (MTM) and classification student model (CSM) simultaneously. In MTM (CSM), images are converted to intermediate layer features illustrated in A (B), according to the attention transfer theory, the output features in C are utilized for CSM parameters update and gratitude calculation with the cross-entropy loss function. Besides, the MTM is frozen in the training stage. In the inference stage, an unknown image is wrapped into a probability distribution over clinical classes of skin lesion and the final diagnosis result corresponds to the maximum probability value.

The skin lesion diagnosis framework consists of (a) and (b). In (a), there exists a four-step approach to address the original images. Red box contains the images with black edge and corresponding results are presented in green box. In data partitioning step, the boxes with three different colors contain the images for training, validation and inference respectively. After the sequential operations, the well processed images and clustering model parameters are ready for fitting the classification model based on the attention transfer method as shown in (b).
The following table shows the concepts of the TP, TN, FP and FN.
For evaluating the classification performances, the F1-Score of 8 categories were calculated respectively. This metric was considered as a harmonic mean of precision and recall. Accuracy was also used to measure the total classification performance.
This automatic skin cancer diagnosis framework was built on Pytorch [20] with ResNet [21]. The common classification method of ResNet was considered as baseline which would be compared with our classification framework. All experiments were conducted on a workstation with a GeForce RTX 3090 NVIDIA GPU.
Concepts of the TP, TN, FP and FN
Concepts of the TP, TN, FP and FN
Our method consists Similarity Clustering and Attention Transfer to make automatic diagnosis for skin cancer images. For better fitting the parameters, in training process, we set 100 epochs for both stages and they could be completed in an hour. Corresponding results are illustrated as follows.
The results of similarity clustering
As shown in Fig. 5, there exhibits the cluster results on HAM, BCN and unk datasets, MSK is excluded due to the relatively small quantity and consistent in form. It can be observed that there are various similar appearances and dissimilar ones existing the images such as skin color, lesion size, hair and markers. In the similarity clustering, the similar images were formed as clusters. We set the number of clusters as 16 according to the number of the images. Then, we pickled 8 clusters for testing, the rest ones are chosen as training and validation. There are artifacts existing in the skin images such as hair (Cluster 3), rulers (Cluster 7) and markings (Cluster 8). Besides, different lighting conditions (Cluster 1 and 2) are presented in the dataset.

The results of similarity clustering.
In our Attention Transfer experiments, we only utilized HAM and BCN dataset for recognition. The table of the dataset split is listed as follows.
Dataset split in train, validation and test
Dataset split in train, validation and test
The validation results are listed in Tables 6 and 7. It is worth mentioning that, we merged the samples of the HAM and BCN both in the training and validation stage for better feature extraction and model selection. The test results of HAM are listed in Tables 8, 9. Tables 10, 11 consist the test results of BCN. The total accuracy was calculated with 8 categories. Layers marked as √ indicate that their parameters were trained in the training process of Attention Transfer. Layer 1 and 2 usually extract the low-level features, on the contrary, layer 3 and 4 extract the high-level features. We listed 9 combinations of layers to make performance comparison with baseline.
The total accuracy of eight categories of validation set
The F1 score of eight categories of validation set
The total accuracy of test set of HAM
The F1 score of test set of HAM
The total accuracy of test set of BCN
The F1 score of test set of BCN
In validation sets, the total accuracy results are listed in Table 6. and the √ means this layer would be utilized in the processing of attention transfer. The accuracy of baseline is 0.6794 marked as blue which is 5 percentage points lower than our method obtaining 0.7253 marked as red. From the results, we can observe that the proposed method plays a positive role in improving accuracy. However, it is worth mentioning that the more layers we migrate, the better results may not be obtained. We consider that this situation would be result from the parameter redundancy. Table 7 comprises the f1-score on 8 categories. The best improving score marked as red indicates that our method could be beneficial for diagnosis of three malignant skin cancers.
In test sets, the best total accuracy of HAM dataset is 0.68 which is higher than the baseline (0.649) and the best total accuracy of BCN dataset is 0.53 which is higher than the result of baseline (0.516). Besides, the F1 score of BCC, SCC, MEL in HAM dataset is 0.49, 0.2, 0.45 respectively. Among of them, the results of BCC and MEL in our method are better than the results of baseline (0.45 and 0.42). Not only that, the F1 score of BCC, SCC, MEL in BCN dataset is 0.6, 0.14, 0.55 which are higher than the results of baseline (0.58, 0.13, 0.54).
References [27, 28] describe why a linear interpolation of points on the precision-recall curve (PR Curve) provides an overly optimistic measure of classifier performance. Precision-Recall is a useful measure of prediction when the classes are very imbalanced.
In our study, the test results of HAM and BCN were chosen and we calculated the values of precision and recall for three malignant lesions (BCC, SCC, MEL) and benign lesions (NV). 9 combinations were made comparisons with the baseline. Corresponding 9 figures of PR curves (a-i) are presented in Figs. 6, 7. In these figures, the solid lines are the PR curves of our method, and the dotted lines are related to the baseline. For NV (black), the differences between these two lines are not obvious. Our view on this is that the feature redundancy of NV makes the performance not obvious. On the contrary, for BCC (red) and MEL (blue), solid lines are higher than dotted lines in HAM and BCN testing sets which indicates that the similarity features are helpful for their recognitions. However, as shown in Fig. 6, the situation of SCC (green) is opposite where the solid lines were lower than the dotted lines especially for Fig. 6(h) and Fig. 6(i). This could be the number of SCC samples is too small to make the good generalization, therefore, the final performance was lower than baseline.

The precision-recall curve of testing sets of HAM.

The precision-recall curve of testing sets of BCN.
In Fig. 2, there is remarkable visual similarity between benign nevus and malignant melanoma. In this work, we designed the framework and conducted the experiments from the perspective of similarity. In the process of similarity clustering, based on the equation (3), we utilized the deep metric learning to make feature extractions and the inter-class similarity and intra-class dissimilarity were concerned. Such features would be helpful for diagnosis. In previous studies, only category information was used. The advantage of our study is that both the category information and the similarity information were utilized in the training process. However, it is worth mentioning that the more layers we migrate, the better results may not be obtained. We consider that this situation would be result from the parameter redundancy.
The limitations of our study are: 1) the features are not established numerical relationship between samples with different categories; 2) the datasets are unbalanced, and we could collect more malignant samples to balance the categories. They are also the challenges in our future research works.
In conclusion, this study adopted a two-step approach for skin cancer image recognition. Firstly, unsupervised clustering was utilized to make the model learn the similarity information and serve as a teacher model subsequently. Secondly, based on the method of attention transfer, the student classification model was trained. During the training, the model learned the similarity information of the feature layers and the category information at the same time. Eventually, the student classification model improved the recognition rate of the lesions.
Footnotes
Acknowledgments
This work was supported by the Application and Basic Research project of Sichuan Province [grant No.2019YJ0055]; the Enterprise Commissioned Technology Development Project of Sichuan University [grant No.18H0832]; and the Achievement Conversion and Guidance Project of Chengdu Science and Technology Bureau [grant No.2017-CY02-00027-GX]. We also thank Highong Intellimage Medical Technology (Tianjin) Co., Ltd for clinical consulting in this research.
