Abstract
BACKGROUND:
Acne is a skin lesion type widely existing in adolescents, and poses computational challenges for automatic diagnosis. Computer vision algorithms are utilized to detect and determine different subtypes of acne. Most of the existing acne detection algorithms are based on the facial natural images, which carry noisy factors like illuminations.
OBJECTIVE:
In order to tackle this issue, this study collected a dataset ACNEDer of dermoscopic acne images with annotations. Deep learning methods have demonstrated powerful capabilities in automatic acne diagnosis, and they usually release the training epoch with the best performance as the delivered model.
METHODS:
This study proposes a novel self-ensemble and stacking-based framework AcneTyper for diagnosing the acne subtypes. Instead of delivering the best epoch, AcneTyper consolidates the prediction results of all training epochs as the latent features and stacks the best subset of these latent features for distinguishing different acne subtypes.
RESULTS:
The proposed AcneTyper framework achieves a promising detection performance of acne subtypes and even outperforms a clinical dermatologist with two-year experiences by 6.8% in accuracy.
CONCLUSION:
The method we proposed is used to determine different subtypes of acne and outperforms inexperienced dermatologists and contributes to reducing the probability of misdiagnosis.
Introduction
Acne is a highly prevalent inflammatory skin lesion involving symptoms like sebaceous sties, and has prevalence peaks during adolescence [1, 2]. Besides adolescents, a study showed that the prevalence of clinical facial acne in women may persist into their middle ages [3]. Therefore, a massive number of acne patients need precise treatments, but the already overloaded schedules of dermatologists could not fulfill all these needs. Various automatic acne diagnosis technologies have been developed for this purpose, including the object detection [4, 5], acne measuring [6, 7], and classification [8, 9] algorithms.
Most of the existing acne diagnosis methods are based on the naturally captured facial images, including those by mobile phones. These images may carry various noise factors, e.g., uneven illuminations. Dermoscopy is a non-invasive imaging technology for the in vivo evaluation of the pigmented skin lesions with optical magnification, liquid immersion and low angle-of-incidence lightening [10]. Skin lesion images captured by dermoscopy have a much higher visibility of the subsurface structures and textures of the lesions compared with the conventionally captured images.
Many computational methods have been developed recently for the automatic detection, counting and classification of the acne lesions, and most of them utilized the hand-crafted features [5, 6, 8]. Chin et al. measured the wrinkle contours of acne by the Laws mask filter, the Gabor filter and the Kirsch filter [11]. Darmawan et al. described the acne backgrounds for the acne subtype detection [12]. Kittigul et al. investigated the acne detection problem using the Speeded Up Robust Features and the k-nearest neighbor (KNN) classifier [13]. Wirdayanti et al. extracted the texture features by the Gray Level Co-occurrence Matrices (GLCM) algorithm and detected the facial skin diseases by the KNN classifier [14]. Maroni et al. built an ensemble of random forest models using the engineered features of color, texture, spatial, shape and unsupervised descriptors from the segmented lesions [6]. The acne was extracted by the Adaptive Thresholding and Laplacian of Gaussian filters. Navarro et al. improved the bag of feature algorithm and integrated the Gaussian filtering and k-means clustering for detecting the common skin diseases [15]. Xuan et al. transformed the RGB images into the other color spaces according to different acne lesion subtypes, and a multi-level thresholding strategy was used to extract acne [5]. Park et al. utilized the Haar features to detect the facial regions and the YCbCr and HSV color spaces to detect the skin regions. They proposed the CIELab color model to detect the pigmentation of acne [16].
Deep learning has also been widely utilized for describing acne and other skin lesions [17, 18, 19]. Convolutional neural network (CNN) is one of the most popular deep learning networks for the medical image processing tasks [20, 21, 22, 23]. Chin et al. developed a facial pore aided detection system using the LeNet-5 model [23]. Saleh et al. proposed an automated facial skin disease detection method by a pre-trained deep convolutional neural network [24]. Karunanayake et al. developed a smartphone-based expert system using a hybrid approach of convolutional neural network (CNN) and natural language processing (NLP) techniques for the detection of acne density, skin sensitivity and acne subtypes [21]. Rashataprucksa et al. used the faster-RCNN and R-FCN to detect acne [4]. Wu et al. proposed a unified framework for the grading and counting of acne [7]. Shen et al. presented a new automatic diagnosis method using the image features extracted by CNN [20]. Junayed et al. used a deep residual neural network to classifying five classes of acne [22]. Isa et al. transferred the YOLOv4 model to recognize the acne [25] and Phan et al. developed an LED therapeutic device based on a deep learning model based on the modified ResNet50 and YOLOv2 for the automatic acne diagnosis [26]. Some model optimization challenges remain to be resolved. Firstly, the first few epochs tend to be under fitted, while the last few epochs may over-fit the prediction models. Secondly, the trained model with the best prediction performance on the validation dataset is usually delivered as the final model, and this operation does not guarantee the best performance on the independent testing dataset. The conventional machine learning algorithms like feature selection and classification may further improve the deep learning models [27]. Hence, we hypothesized that integrating the outputs from all training epochs may improve the classification models.
This study proposes a self-ensemble stacking framework for the dermoscopy-based acne diagnosis. Firstly, we collected and manually annotated a comprehensive dataset of dermoscopy images for four acne subtypes, i.e., tiny comedo, papule, cyst and pustule. Then we ensembled the prediction results of the models trained in all epochs weighted by the stacking strategy. These stacked prediction results of all trained models were regarded as the engineered features and were further screened by feature selection algorithms. Comparing with the predictions directly generated by the CNN model, our ensembled framework showed better performances by integrating the information output from all training epochs.
There are four main contributions of this study. Firstly, we utilize the self-ensemble strategy to consolidate the prediction capabilities of the models trained in all epochs. Secondly, these consolidated models are weighted to differentiate their contributions to the final prediction model. Thirdly, the feature selection algorithms further improve the prediction performance by removing the features from the redundant models. Fourthly, we release to the public our collected and manually annotated dataset ACNEDer of dermoscopic images of four acne subtypes.
Materials and methods
Datasets
A well-annotated dataset ACNEDer was released to provide the dermoscopic images of four acne subtypes, including papule, tiny comedo, pustule and cyst, as shown in Table 1. The original sizes of the images are mostly 1028
The numbers of images for the four acne subtypes in the dataset ACNEDer. The four acne subtypes were summarized in the columns “papule”, “tiny comedo”, “putstule” and “cyst”. The last column “all” gave the total numbers of images for the three sub-datasets training/validation/testing
The numbers of images for the four acne subtypes in the dataset ACNEDer. The four acne subtypes were summarized in the columns “papule”, “tiny comedo”, “putstule” and “cyst”. The last column “all” gave the total numbers of images for the three sub-datasets training/validation/testing
The visualization of sample images. The example images of (a) acne and (b) skin cancers.
To further evaluate our method AcneTyper, a skin cancer dermoscopic image dataset from the Kaggle competition is used [29]. This dataset contains 3297 images, including 2637 training images and 660 testing images. This dataset contains two categories, benign and malignant. The training set consists of 1440 benign skin cancer images and 1197 malignant ones. The testing set has 360 benign skin cancer images and 300 malignant ones. Because the dataset does not provide the validation set, we randomly retrieve 80% of the training samples to train the model and the remaining 20% of the training dataset as the validation set. The example images in this dataset are shown in Fig. 1b.
This study considers the acne subtype prediction as a four-class classification problem. The four acne subtypes are papule, tiny comedo, pustule and cyst. The classification problem is formulated as four binary classification problems, similar to Wu et al. [7]. For each acne subtype, the samples in this subtype are positives and the other samples are negatives. The correct prediction rates of the positive and negative samples are sensitivity (Sn) and specificity (Sp) [30]. They are defined as:
Where TP and FN are the numbers of correctly and incorrectly predicted positive samples, and TN and FP are the numbers of correctly and incorrectly predicted negative samples, respectively. The overall accuracy is defined as:
The metric Precision is defined as:
The metric Youden Index (YI) is also popularly used and is defined as:
A larger value of YI suggests a better classification performance of a given model. The overall classification performance metrics are averaged over the four acne subtypes.
The backbone of AcneTyper is similar to the ResNet18 model pre-trained on the ImageNet dataset. Label smoothing is used to process the labels in this study, where the parameter
Training procedure
This study hypothesized that all training epochs carried useful and complementary information to the final prediction performances. A deep learning algorithm usually splits the dataset into three subsets training/validation/testing, as shown in Fig. 2a. A model is trained using the training dataset, and evaluated on the validation dataset. Most of the deep learning-based studies assumed that the training epoch with the best performance on the validation dataset would be delivered. This assumption does not necessarily hold for many datasets. A stacking procedure consolidates the prediction results of the trained models in all epochs, as shown in Fig. 2b.
Comparison between different training procedures for the deep learning models. (a) Traditional training pipeline; (b) Stacking training pipeline; (c) Our proposed pipeline.
We proposed an ensemble framework to consolidate the prediction results of the models trained in all epochs and to select the subset of prediction results with the best prediction performances on the validation dataset, as shown in Fig. 2c.
This study proposed a network architecture sharing the pre-trained ResNet18 CNN architecture fine-tuned using the training dataset of this study, as shown in Fig. 3. CNN is a network architecture based on the shared-weights of the layered filters and generates feature map, and has been widely used in many computer-vision tasks [31, 32, 33]. A CNN with a large number of layers may lead to the problem of gradience disappearance, and a residual neural network may relief this challenge [34].
The network architecture of AcneTyper. The pre-trained ResNet18 model was fine-tuned on the training dataset, and the internal network architecture was shared by the AcneTyper. Then the prediction results of the AcneTyper model trained in all epochs were screened by feature selection algorithms and the final optimized AcneTyper model was generated.
Fine-tuning a pre-trained model may transfer the knowledge learned in the previous domain to a similar task [35]. This study fine-tuned the pre-trained ResNet18 model on the ACNEDer dataset released in this study and normalized the results by the Batch Normalization (BatchNorm) module through re-centering and re-scaling [36]. The step of label smoothing was carried out on the dataset [37].
The detailed experimental procedure of AcneTyper was introduced here. Firstly, AcneTyper generated the initial set of features
The details of the AcneTyper experimental procedure. The CNN models trained in all epochs were utilized to generate the engineered features. Then these engineered features were selected by feature selection algorithms based on their performances on the validation dataset.
We anticipated that not all consolidated features positively contributed to the prediction performances, and used feature selection algorithms to remove unrelated features. A support vector machine (SVM) classifier was trained on the training dataset, and the weights of all features in the trained SVM model were used to rank the features. The iterative feature elimination strategy was conducted and the feature subset was evaluated on the validation dataset. The feature subset with the best performance on the validation dataset was used to train a SVM model and tested on the testing dataset. Grid search and 5-fold cross validation were used to find the best problem setting.
Ablation experiments
The ablation experiment evaluates the contribution of each module of AcneTyper, as shown in Table 2. First of all, all modules positively contribute to the improvement of the prediction model. The baseline model used in this study is the 18-layer convolutional neural network (CNN) model ResNet18, and Table 2 shows that the output layer of the CNN model ResNet18 achieves 0.6395 in the overall accuracy. The pre-trained ResNet18 is fine-tuned on our curated dataset ACNEDer and achieves an improvement 0.0306 in Acc. The operation of batch normalization further improved the model to Acc
How the performance is affected by different modules. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)
How the performance is affected by different modules. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)
The stacking operation delivers an improvement 0.0340 in Acc, suggesting that the prediction results of the deep learning models trained in different epochs provide complementary information to the prediction model. Feature selection algorithm facilitates an additional improvement 0.0136 in Acc and suggests the necessity of removing some engineered features with no contributions to the prediction models.
In this study, classifier is used to predict the final result via the features extracted by convolutional neural network (CNN), to verify how different classifiers influence the final performance, we compare the performance of five classifiers i.e. Multi-Layer Perceptron (MLP) [38], XGBoost [34], Linear Regression (LR) [39], Random Forest (RF) [40], and Support Vector Machine (SVM) [41]. Grid search method is used to tune the parameters of each classifier [42]. As shown in Table 3, SVM achieved the best performance among the evaluated classifiers. The following sections used SVM in the proposed framework AcneTyper.
Comparison the performance of different classifiers. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)
Comparison the performance of different classifiers. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)
Ensemble algorithms are a group of widely used machine learning methods [43, 44], and they usually deliver better prediction performances by integrating multiple weak base models. This section compared the proposed self-ensemble-based method AcneTyper with a CNN-ensemble method. The pre-trained ResNet18 is used as the base model. We randomly selected 80%of the training sample to fine-tune ResNet18, and 30 base models were trained through random runs. The voting strategy was used to construct the CNN-ensemble model, as shown in Fig. 5. Our method outperformed the CNN-ensemble model by 0.087 in Acc. Another merit of our method was that AcneTyper needed to be trained only once, while the CNN-ensemble method needed to be trained for 30 times to generate the base models. Hence, the self-ensemble strategy in our method delivered both a better prediction performance and a fast training step.
Contribution of the features engineered by the early epochs
The features engineered by the early epochs were evaluated for their contributions to the final prediction model, as shown in Table 4. The parameters of the early epochs may not converge to represent the training samples. We generated multiple concatenations of different epochs. The features without the early 10 epochs were denoted as Remove10. Redundant features from Remove10 were excluded by the feature selection algorithms, and this feature set was denoted as Remove10-FS. The features concatenated by AcneTyper before the feature selection step were denoted as the feature set AcneTyper-NoFS. The list of features used in this study was denoted as AcneTyper.
The early epochs how to influence the performance. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)
The early epochs how to influence the performance. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)
Comparison of AcneTyper and the CNN ensemble method. The horizontal axis gave the performance metrics, and the vertical axis was the metric values.
Table 4 shows that the features engineered by the early 10 epochs represented an important contribution to the acne type prediction model. Feature selection generated a minor change to the model excluding the features engineered by the early 10 epochs. The two feature sets Remove10 (Acc
This suggested that the features engineered by the early 10 epochs represented a useful contribution to the acne type prediction task. Some redundant features still needed to be excluded, and the feature selection step further improved the AcneTyper-NoFS model by 0.0136 in Acc.
Comparison AcneTyper with the traditional visual methods. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)
Comparison AcneTyper with the traditional visual methods. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)
AcneTyper outperforms the hand-crafted features in detecting acne, as shown in Table 5. The conventional machine learning studies for the acne detection problem consist of two steps, i.e., feature extraction and classification. We compared AcneTyper with four popular feature extraction algorithms, i.e., GLCM [45], HOG [46], LBP [46] and SIFT [47]. The extracted features were used to build classification models by five classifiers, i.e., decision tree (DT), random forest (RF), naïve bayes (NB), K-nearest Neighbor (KNN) and support vector machine (SVM). Table 3 shows that AcneTyper outperforms these four feature extractions algorithms in all five classification performance metrics. AcneTyper achieves the major improvements in the prediction sensitivities (Sn), suggesting that the hand-crafted features are difficult in separating different acne subtypes.
Deep learning algorithms are widely used for their automatic abstractions of inherent patterns within the data, and this study evaluated how the existing deep learning models performed on the acne subtype detection problem, as shown in Table 6. Different batch sizes 16/32/64 and epoch numbers 100/200/300 were evaluated for the deep learning models. Overall, these deep learning-extracted features achieved much better acne subtype detection performances compared with the data in Table 5. The AcneTyper model proposed in this study outperformed the other deep learning models in all five performance metrics. Except for AcneTyper, EfficientNet-B2 can achieve the best Acc
Comparison with the features extracted by popular neural network architectures. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)
Comparison with the features extracted by popular neural network architectures. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)
Comparison with the two dermatologists. Der_1 and Der_2 represent the two dermatologists, respectively. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)
The ultimate goal of many computational disease classification studies is to help the clinical practice. This study compared the proposed AcneTyper with the two professional dermatologists involved in this project, as shown in Table 7. The collected dataset ACNEDer was independently annotated by two dermatologists. The two dermatologists then discussed and re-annotated the images with acne types disagreed by the two dermatologists. Only the images with the commonly agreed acne types were kept in the dataset ACNEDer, which was used as the benchmark dataset.
We compared AcneTyper with the initial annotations of the two dermatologists on this benchmark dataset. The performance metrics Sn, Sp, Acc, Pr and YI were calculated for the initial annotations of the two dermatologists. AcneTyper was also trained on the training dataset and evaluated on the validation dataset. The final result was calculated using the evaluated best model on the testing dataset. The dermatologists Der_1 and Der_2 have 4 and 2 years of clinical dermatological practices. Table 7 shows that Der_1 outperforms Der_2 by 0.1088 in Acc, and 0.1021 in Sn. So, an inexperienced dermatologist may induce many false negatives in the acne diagnosis. Our AcneTyper model achieves 0.7959 in Acc, which is between the classification accuracies of Der_1 and Der_2.
So the experimental data shows that AcneTyper outperforms some dermatologists in diagnosing the acne subtypes, and is anticipated to provide useful automatic annotations for the initial screenings in the clinical practice.
Evaluate the performance of AcneTyper on a benchmark dataset
Comparing AcneTyper with the other state-of-the-art methods on skin cancer dataset
Comparing AcneTyper with the other state-of-the-art methods on skin cancer dataset
The proposed framework was evaluated for its generalizability on an additional benchmark dataset, as shown in Table 8. This is a binary classification task, and we compared AcneTyper with four state-of-the-art (SOTA) methods, i.e., DIMLP-ensemble [48], CNN [49], TWDBDL [50] and BARF (cross) [29]. Table 8 shows that BARF (cross) achieved the second best performance, and our method AcneTyper outperformed this method by 8.98% in Accuracy, 9.11% in Recall, 9.07% in Precision and 8.86% in F1-score. The proposed AcneTyper framework outperformed the SOTA methods on the benchmark dataset.
Visualization of AcneTyper feature space using t-SNE on the dataset ACNEDer. The blue, orange, yellow and grey dots represented the samples of the four classes, i.e., tine comedo, papule, pustule and cyst.
Figure 6 visualizes the AcneTyper feature space by t-SNE, a widely used method to visualize the classification result [51]. The t-SNE algorithm mapped the AcneTyper features of an acne image to a two-dimensional point, so that the intra-class distances were minimized and the inter-class distances were maximized with high probabilities. Figure 6 illustrates that the four classes of acne images were well separated into four clusters. The AcneTyper classification model utilized a much more complicated ensembling strategy and delivered a satisfying classification accuracy than the visualization in Fig. 6.
Artificial intelligence (AI) methods have been successfully utilized in disease diagnosis tasks based on various biomedical data types, including biochemical characteristics, image, and electrocardiogram [52, 53, 54]. Sejdinović et al. developed an artificial neural network for the classification of prediabetes and type 2 diabetes (T2D) [55]. Alić et al. proposed a two-layer feedforward artificial neural network for the classification of metabolic syndrome [56]. Veljović et al. investigated the efficiency of artificial neural networks and docking methods to predict antimicrobial activity of new compounds [57]. Badnjevic et al. developed an expert system for an efficient diagnosis of asthma, COPD or a normal lung sample [58]. Catic et al. successfully applied neural networks to classify Patau, Edwards, Down, Turner and Klinefelter Syndromes [59].
This study proposed an acne subtype detection algorithm AcneTyper via self-ensemble and stacking. Self-ensemble is a new algorithmic technology to improve the neural network models. This study demonstrated that this technology could also help the computational diagnosis of acne types. We stacked the prediction results of the models trained in all epochs of the CNN-based models and further screened these engineered features by feature selection algorithms. The ablation experimental data showed that all integrated modules of AcneTyper had positive contributions to the final prediction performance. The best prediction accuracy of the four acne subtypes reaches 0.7959 in Acc by AcneTyper.
AcneTyper outperformed the hand-crafted and deep learning-extracted features on the acne subtype detection problems, and even performed better than a dermatologist with two-year experience in clinical practice.
A well-annotated dataset of four acne subtypes is also released to the public, hoping that the whole research community may get involved in pushing the acne subtype prediction forward.
Our study has some limitations that need to be taken into account in future studies. The size of the dataset is limited. We plan to recruit a large cohort of participants in the future project, and collect more acne images covering more skin lesion types. A diversified list of image-capturing devices will also be evaluated. For example, if the acne captured by the conventional mobile phones could also be accurately detected and classified, the acne diagnosis method will be much more useful to the users while protecting their privacies.
Conclusion
Most acne classification studies were based on natural facial images. This study released a dermoscopy-based acne image dataset together with our well-tuned model, which may provide complementary information to facilitate the clinical decisions of dermatologists.
Our experimental data showed that the integration of multiple base models during the training epochs could achieve better prediction performance than the individual base models. Feature selection algorithms could remove redundant base models and further improve the prediction performance of the integrated model. In the future, we will collect acne images of more acne types, and update the dataset as a benchmark for both computer scientists and dermoscopic researchers.
Footnotes
Acknowledgments
The authors appreciate the insightful comments from the two anonymous reviewers that helped improved the experimental design and result discussions of the work.
Conflict of interest
The authors declare that they have no conflict of interest.
Funding
This work was supported by the Senior and Junior Technological Innovation Team (20210509055RQ), the National Natural Science Foundation of China (62072212 and U19A2061), the Jilin Provincial Key Laboratory of Big Data Intelligent Computing (20180622002JC), the Science and Technology Project of Education Department of Jilin Province (JJKH20200328KJ), and the Fundamental Research Funds for the Central Universities, JLU.
