AcneTyper: An automatic diagnosis method of dermoscopic acne image via self-ensemble and stacking

Abstract

BACKGROUND:

Acne is a skin lesion type widely existing in adolescents, and poses computational challenges for automatic diagnosis. Computer vision algorithms are utilized to detect and determine different subtypes of acne. Most of the existing acne detection algorithms are based on the facial natural images, which carry noisy factors like illuminations.

OBJECTIVE:

In order to tackle this issue, this study collected a dataset ACNEDer of dermoscopic acne images with annotations. Deep learning methods have demonstrated powerful capabilities in automatic acne diagnosis, and they usually release the training epoch with the best performance as the delivered model.

METHODS:

This study proposes a novel self-ensemble and stacking-based framework AcneTyper for diagnosing the acne subtypes. Instead of delivering the best epoch, AcneTyper consolidates the prediction results of all training epochs as the latent features and stacks the best subset of these latent features for distinguishing different acne subtypes.

RESULTS:

The proposed AcneTyper framework achieves a promising detection performance of acne subtypes and even outperforms a clinical dermatologist with two-year experiences by 6.8% in accuracy.

CONCLUSION:

The method we proposed is used to determine different subtypes of acne and outperforms inexperienced dermatologists and contributes to reducing the probability of misdiagnosis.

Keywords

Acne subtype deep learning self-ensemble stacking classification

1. Introduction

Acne is a highly prevalent inflammatory skin lesion involving symptoms like sebaceous sties, and has prevalence peaks during adolescence [1, 2]. Besides adolescents, a study showed that the prevalence of clinical facial acne in women may persist into their middle ages [3]. Therefore, a massive number of acne patients need precise treatments, but the already overloaded schedules of dermatologists could not fulfill all these needs. Various automatic acne diagnosis technologies have been developed for this purpose, including the object detection [4, 5], acne measuring [6, 7], and classification [8, 9] algorithms.

Most of the existing acne diagnosis methods are based on the naturally captured facial images, including those by mobile phones. These images may carry various noise factors, e.g., uneven illuminations. Dermoscopy is a non-invasive imaging technology for the in vivo evaluation of the pigmented skin lesions with optical magnification, liquid immersion and low angle-of-incidence lightening [10]. Skin lesion images captured by dermoscopy have a much higher visibility of the subsurface structures and textures of the lesions compared with the conventionally captured images.

Many computational methods have been developed recently for the automatic detection, counting and classification of the acne lesions, and most of them utilized the hand-crafted features [5, 6, 8]. Chin et al. measured the wrinkle contours of acne by the Laws mask filter, the Gabor filter and the Kirsch filter [11]. Darmawan et al. described the acne backgrounds for the acne subtype detection [12]. Kittigul et al. investigated the acne detection problem using the Speeded Up Robust Features and the k-nearest neighbor (KNN) classifier [13]. Wirdayanti et al. extracted the texture features by the Gray Level Co-occurrence Matrices (GLCM) algorithm and detected the facial skin diseases by the KNN classifier [14]. Maroni et al. built an ensemble of random forest models using the engineered features of color, texture, spatial, shape and unsupervised descriptors from the segmented lesions [6]. The acne was extracted by the Adaptive Thresholding and Laplacian of Gaussian filters. Navarro et al. improved the bag of feature algorithm and integrated the Gaussian filtering and k-means clustering for detecting the common skin diseases [15]. Xuan et al. transformed the RGB images into the other color spaces according to different acne lesion subtypes, and a multi-level thresholding strategy was used to extract acne [5]. Park et al. utilized the Haar features to detect the facial regions and the YCbCr and HSV color spaces to detect the skin regions. They proposed the CIELab color model to detect the pigmentation of acne [16].

Deep learning has also been widely utilized for describing acne and other skin lesions [17, 18, 19]. Convolutional neural network (CNN) is one of the most popular deep learning networks for the medical image processing tasks [20, 21, 22, 23]. Chin et al. developed a facial pore aided detection system using the LeNet-5 model [23]. Saleh et al. proposed an automated facial skin disease detection method by a pre-trained deep convolutional neural network [24]. Karunanayake et al. developed a smartphone-based expert system using a hybrid approach of convolutional neural network (CNN) and natural language processing (NLP) techniques for the detection of acne density, skin sensitivity and acne subtypes [21]. Rashataprucksa et al. used the faster-RCNN and R-FCN to detect acne [4]. Wu et al. proposed a unified framework for the grading and counting of acne [7]. Shen et al. presented a new automatic diagnosis method using the image features extracted by CNN [20]. Junayed et al. used a deep residual neural network to classifying five classes of acne [22]. Isa et al. transferred the YOLOv4 model to recognize the acne [25] and Phan et al. developed an LED therapeutic device based on a deep learning model based on the modified ResNet50 and YOLOv2 for the automatic acne diagnosis [26]. Some model optimization challenges remain to be resolved. Firstly, the first few epochs tend to be under fitted, while the last few epochs may over-fit the prediction models. Secondly, the trained model with the best prediction performance on the validation dataset is usually delivered as the final model, and this operation does not guarantee the best performance on the independent testing dataset. The conventional machine learning algorithms like feature selection and classification may further improve the deep learning models [27]. Hence, we hypothesized that integrating the outputs from all training epochs may improve the classification models.

This study proposes a self-ensemble stacking framework for the dermoscopy-based acne diagnosis. Firstly, we collected and manually annotated a comprehensive dataset of dermoscopy images for four acne subtypes, i.e., tiny comedo, papule, cyst and pustule. Then we ensembled the prediction results of the models trained in all epochs weighted by the stacking strategy. These stacked prediction results of all trained models were regarded as the engineered features and were further screened by feature selection algorithms. Comparing with the predictions directly generated by the CNN model, our ensembled framework showed better performances by integrating the information output from all training epochs.

There are four main contributions of this study. Firstly, we utilize the self-ensemble strategy to consolidate the prediction capabilities of the models trained in all epochs. Secondly, these consolidated models are weighted to differentiate their contributions to the final prediction model. Thirdly, the feature selection algorithms further improve the prediction performance by removing the features from the redundant models. Fourthly, we release to the public our collected and manually annotated dataset ACNEDer of dermoscopic images of four acne subtypes.

2. Materials and methods

2.1 Datasets

A well-annotated dataset ACNEDer was released to provide the dermoscopic images of four acne subtypes, including papule, tiny comedo, pustule and cyst, as shown in Table 1. The original sizes of the images are mostly 1028 $\times$ 1024, while a few of them have varied sizes between 1028 $\times$ 1024 and 312 $\times$ 231. These images were collected by the hand-held portable dermoscopy devices with various brands in the clinical practice. This study resized the images to 224 $\times$ 224 to train a model with a consensus input size, similar to Akata et al. [28]. The ethics approval was registered as 2019-11-01 on November 8, 2019 in the Department of Epidemiology and Biostatistics, School of Public Health, Jilin University. This is a retrospective investigation. The data were collected between 1-1-2020 and 30-11-2020 at the Beijing Dr. of Acne Medical Research Institute. Two dermatologists were involved in this project and independently annotated the acne subtypes. If they gave different annotations, they discussed their opinions and confirmed a consensus final annotation. An image was discarded if they couldn’t reach a common opinion. The information of the dataset is summarized in Table 1. This dataset consists of 1462 images, including 876 training images, 292 validation images and 294 testing images. The example images are shown in Fig. 1a.

Table 1
The numbers of images for the four acne subtypes in the dataset ACNEDer. The four acne subtypes were summarized in the columns “papule”, “tiny comedo”, “putstule” and “cyst”. The last column “all” gave the total numbers of images for the three sub-datasets training/validation/testing

	Papule	Tiny_comedo	Pustule	Cyst	All
Training	261	251	237	127	876
Validation	87	84	79	42	292
Testing	87	84	80	43	294

Figure 1.

The visualization of sample images. The example images of (a) acne and (b) skin cancers.

To further evaluate our method AcneTyper, a skin cancer dermoscopic image dataset from the Kaggle competition is used [29]. This dataset contains 3297 images, including 2637 training images and 660 testing images. This dataset contains two categories, benign and malignant. The training set consists of 1440 benign skin cancer images and 1197 malignant ones. The testing set has 360 benign skin cancer images and 300 malignant ones. Because the dataset does not provide the validation set, we randomly retrieve 80% of the training samples to train the model and the remaining 20% of the training dataset as the validation set. The example images in this dataset are shown in Fig. 1b.

2.2 Metrics

This study considers the acne subtype prediction as a four-class classification problem. The four acne subtypes are papule, tiny comedo, pustule and cyst. The classification problem is formulated as four binary classification problems, similar to Wu et al. [7]. For each acne subtype, the samples in this subtype are positives and the other samples are negatives. The correct prediction rates of the positive and negative samples are sensitivity (Sn) and specificity (Sp) [30]. They are defined as:

$\displaystyle Sn=\frac{TP}{TP+FN}$ (1) $\displaystyle Sp=\frac{TN}{TN+FP}$ (2)

Where TP and FN are the numbers of correctly and incorrectly predicted positive samples, and TN and FP are the numbers of correctly and incorrectly predicted negative samples, respectively. The overall accuracy is defined as:

$\displaystyle\textit{Acc}=\frac{TP+TN}{TP+FN+TN+FP}$ (3)

The metric Precision is defined as:

$\displaystyle Pr=\frac{TP}{TP+FP}$ (4)

The metric Youden Index (YI) is also popularly used and is defined as:

$\displaystyle\text{YI}=Sn+Sp-1$ (5)

A larger value of YI suggests a better classification performance of a given model. The overall classification performance metrics are averaged over the four acne subtypes.

2.3 Implementation details

The backbone of AcneTyper is similar to the ResNet18 model pre-trained on the ImageNet dataset. Label smoothing is used to process the labels in this study, where the parameter $\alpha$ is set to 0.05. The input image is resized to 244 $\times$ 224 $\times$ 3 and normalized to the value range [0, 1]. The Stochastic Gradient Descent (SGD) with the batch size 64 is used to optimize our model for 300 epochs. The parameters momentum and weight decay are set to 0.9 and 1e-4. The initial learning rate is set to 0.1 and the decay of the learning rate is by 0.1 at the 0.50, 0.75, 0.90 of all epochs. The training data are augmented by random crop, random horizontal flip, random vertical flip, and random rotation. The probability of horizontal flip is 0.5, the probability of vertical flip is 0.5, and the rotation angle is 30 degrees at random. 60% of the images are randomly retrieved as the training dataset. The remaining 20% and 20% of the images are used as the validation and testing datasets. This study was run on an Nvidia P100 GPU server with 16GB GPU VRAM. The dataset with annotations and the source code of this study are freely available at http://www.healthinformaticslab.org/supp/.

2.4 Training procedure

This study hypothesized that all training epochs carried useful and complementary information to the final prediction performances. A deep learning algorithm usually splits the dataset into three subsets training/validation/testing, as shown in Fig. 2a. A model is trained using the training dataset, and evaluated on the validation dataset. Most of the deep learning-based studies assumed that the training epoch with the best performance on the validation dataset would be delivered. This assumption does not necessarily hold for many datasets. A stacking procedure consolidates the prediction results of the trained models in all epochs, as shown in Fig. 2b.

Figure 2.

Comparison between different training procedures for the deep learning models. (a) Traditional training pipeline; (b) Stacking training pipeline; (c) Our proposed pipeline.

We proposed an ensemble framework to consolidate the prediction results of the models trained in all epochs and to select the subset of prediction results with the best prediction performances on the validation dataset, as shown in Fig. 2c.

2.5 Network architecture of AcneTyper

This study proposed a network architecture sharing the pre-trained ResNet18 CNN architecture fine-tuned using the training dataset of this study, as shown in Fig. 3. CNN is a network architecture based on the shared-weights of the layered filters and generates feature map, and has been widely used in many computer-vision tasks [31, 32, 33]. A CNN with a large number of layers may lead to the problem of gradience disappearance, and a residual neural network may relief this challenge [34].

Figure 3.

The network architecture of AcneTyper. The pre-trained ResNet18 model was fine-tuned on the training dataset, and the internal network architecture was shared by the AcneTyper. Then the prediction results of the AcneTyper model trained in all epochs were screened by feature selection algorithms and the final optimized AcneTyper model was generated.

Fine-tuning a pre-trained model may transfer the knowledge learned in the previous domain to a similar task [35]. This study fine-tuned the pre-trained ResNet18 model on the ACNEDer dataset released in this study and normalized the results by the Batch Normalization (BatchNorm) module through re-centering and re-scaling [36]. The step of label smoothing was carried out on the dataset [37].

2.6 Detailed experimental procedure

The detailed experimental procedure of AcneTyper was introduced here. Firstly, AcneTyper generated the initial set of features $F(x)$ by consolidating the prediction results of the models trained in all epochs, as illustrated in Fig. 4. Let $F(x)=$ Concat ( $f_{1}(x)$ , …, $f_{i}(x)$ , …, $f_{n}(x)$ ), where $x$ was an input sample, $f_{i}(x)$ was the prediction results of the model trained in the $i^{\text{th}}$ epoch, and Concate() was the concatenation of the input features.

Figure 4.

The details of the AcneTyper experimental procedure. The CNN models trained in all epochs were utilized to generate the engineered features. Then these engineered features were selected by feature selection algorithms based on their performances on the validation dataset.

We anticipated that not all consolidated features positively contributed to the prediction performances, and used feature selection algorithms to remove unrelated features. A support vector machine (SVM) classifier was trained on the training dataset, and the weights of all features in the trained SVM model were used to rank the features. The iterative feature elimination strategy was conducted and the feature subset was evaluated on the validation dataset. The feature subset with the best performance on the validation dataset was used to train a SVM model and tested on the testing dataset. Grid search and 5-fold cross validation were used to find the best problem setting.

3. Results and discussion

3.1 Ablation experiments

The ablation experiment evaluates the contribution of each module of AcneTyper, as shown in Table 2. First of all, all modules positively contribute to the improvement of the prediction model. The baseline model used in this study is the 18-layer convolutional neural network (CNN) model ResNet18, and Table 2 shows that the output layer of the CNN model ResNet18 achieves 0.6395 in the overall accuracy. The pre-trained ResNet18 is fine-tuned on our curated dataset ACNEDer and achieves an improvement 0.0306 in Acc. The operation of batch normalization further improved the model to Acc $=$ 0.7109. The label smooth may also improve the model by 0.0374 in Acc.

Table 2
How the performance is affected by different modules. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)

Module	Module combination
ResNet18	$\surd$	$\surd$	$\surd$	$\surd$	$\surd$	$\surd$
Fine tune		$\surd$	$\surd$	$\surd$	$\surd$	$\surd$
Batch norm			$\surd$	$\surd$	$\surd$	$\surd$
Label smooth				$\surd$	$\surd$	$\surd$
Stacking					$\surd$	$\surd$
FS						$\surd$
Metric	Values
Sn	0.6254	0.6534	0.6914	0.7241	0.7412	0.7681
Sp	0.8785	0.8878	0.9019	0.9147	0.9258	0.9300
Acc	0.6395	0.6701	0.7109	0.7483	0.7823	0.7959
Pr	0.6319	0.6590	0.6940	0.7333	0.7641	0.7923
YI	0.5039	0.5412	0.5933	0.6388	0.6670	0.6981

The stacking operation delivers an improvement 0.0340 in Acc, suggesting that the prediction results of the deep learning models trained in different epochs provide complementary information to the prediction model. Feature selection algorithm facilitates an additional improvement 0.0136 in Acc and suggests the necessity of removing some engineered features with no contributions to the prediction models.

3.2 Evaluation of different classifiers

In this study, classifier is used to predict the final result via the features extracted by convolutional neural network (CNN), to verify how different classifiers influence the final performance, we compare the performance of five classifiers i.e. Multi-Layer Perceptron (MLP) [38], XGBoost [34], Linear Regression (LR) [39], Random Forest (RF) [40], and Support Vector Machine (SVM) [41]. Grid search method is used to tune the parameters of each classifier [42]. As shown in Table 3, SVM achieved the best performance among the evaluated classifiers. The following sections used SVM in the proposed framework AcneTyper.

Table 3
Comparison the performance of different classifiers. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)

	MLP	XGBoost	LR	RF	SVM
Pr	0.6908	0.7366	0.6756	0.7904	0.7923
Sn	0.6742	0.7208	0.6708	0.7615	0.7681
Sp	0.9013	0.9190	0.8988	0.9286	0.9300
YI	0.5755	0.6398	0.5697	0.6901	0.6981
Acc	0.7109	0.7619	0.7007	0.7925	0.7959

3.3 Comparison with the CNN-ensemble method

Ensemble algorithms are a group of widely used machine learning methods [43, 44], and they usually deliver better prediction performances by integrating multiple weak base models. This section compared the proposed self-ensemble-based method AcneTyper with a CNN-ensemble method. The pre-trained ResNet18 is used as the base model. We randomly selected 80%of the training sample to fine-tune ResNet18, and 30 base models were trained through random runs. The voting strategy was used to construct the CNN-ensemble model, as shown in Fig. 5. Our method outperformed the CNN-ensemble model by 0.087 in Acc. Another merit of our method was that AcneTyper needed to be trained only once, while the CNN-ensemble method needed to be trained for 30 times to generate the base models. Hence, the self-ensemble strategy in our method delivered both a better prediction performance and a fast training step.

3.4 Contribution of the features engineered by the early epochs

The features engineered by the early epochs were evaluated for their contributions to the final prediction model, as shown in Table 4. The parameters of the early epochs may not converge to represent the training samples. We generated multiple concatenations of different epochs. The features without the early 10 epochs were denoted as Remove10. Redundant features from Remove10 were excluded by the feature selection algorithms, and this feature set was denoted as Remove10-FS. The features concatenated by AcneTyper before the feature selection step were denoted as the feature set AcneTyper-NoFS. The list of features used in this study was denoted as AcneTyper.

Table 4
The early epochs how to influence the performance. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)

	Sn	Sp	Acc	Pr	YI
Remove10	0.7386	0.9225	0.7721	0.7519	0.6611
Remove10-FS	0.7383	0.9225	0.7721	0.7528	0.6608
AcneTyper-NoFS	0.7412	0.9258	0.7823	0.7641	0.6670
AcneTyper	0.7681	0.9300	0.7959	0.7923	0.6981

Figure 5.

Comparison of AcneTyper and the CNN ensemble method. The horizontal axis gave the performance metrics, and the vertical axis was the metric values.

Table 4 shows that the features engineered by the early 10 epochs represented an important contribution to the acne type prediction model. Feature selection generated a minor change to the model excluding the features engineered by the early 10 epochs. The two feature sets Remove10 (Acc $=$ 0.7721) and Remove10-FS (Acc $=$ 0.7721) achieved worse than the models using the features from the early epochs. The complete version of AcneTyper achieved the best prediction accuracy 0.7959.

This suggested that the features engineered by the early 10 epochs represented a useful contribution to the acne type prediction task. Some redundant features still needed to be excluded, and the feature selection step further improved the AcneTyper-NoFS model by 0.0136 in Acc.

3.5 Comparison with the hand-crafted features

Table 5
Comparison AcneTyper with the traditional visual methods. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)

FE	Classifier	Sn	Sp	Acc	Pr	YI
GLCM	DT	0.3881	0.7955	0.4014	0.3890	0.1836
	RF	0.3877	0.7996	0.4184	0.3987	0.1873
	NB	0.3244	0.7654	0.2755	0.2660	0.0898
	KNN	0.3103	0.7709	0.3265	0.2963	0.0812
	SVM	0.2792	0.7619	0.3197	0.2281	0.0411
HOG	DT	0.3817	0.7983	0.4048	0.3817	0.1800
	RF	0.4352	0.8192	0.4830	0.5303	0.2543
	NB	0.3403	0.7812	0.3503	0.3257	0.1215
	KNN	0.3012	0.7691	0.3401	0.4264	0.0703
	SVM	0.4276	0.8180	0.4796	0.5236	0.2456
LBP	DT	0.3898	0.7976	0.4048	0.3894	0.1874
	RF	0.3523	0.7877	0.3912	0.4385	0.1399
	NB	0.3799	0.7934	0.4014	0.4023	0.1733
	KNN	0.3828	0.7960	0.4014	0.3845	0.1788
	SVM	0.2972	0.7686	0.3367	0.2445	0.0658
SIFT	DT	0.2598	0.7489	0.2517	0.2561	0.0087
	RF	0.2645	0.7527	0.2891	0.3119	0.0172
	NB	0.2169	0.7419	0.1667	0.1276	$-$ 0.0411
	KNN	0.2684	0.7601	0.2891	0.2694	0.0285
	SVM	0.2229	0.7400	0.2551	0.2088	$-$ 0.0371
AcneTyper		0.7681	0.9300	0.7959	0.7923	0.6981

AcneTyper outperforms the hand-crafted features in detecting acne, as shown in Table 5. The conventional machine learning studies for the acne detection problem consist of two steps, i.e., feature extraction and classification. We compared AcneTyper with four popular feature extraction algorithms, i.e., GLCM [45], HOG [46], LBP [46] and SIFT [47]. The extracted features were used to build classification models by five classifiers, i.e., decision tree (DT), random forest (RF), naïve bayes (NB), K-nearest Neighbor (KNN) and support vector machine (SVM). Table 3 shows that AcneTyper outperforms these four feature extractions algorithms in all five classification performance metrics. AcneTyper achieves the major improvements in the prediction sensitivities (Sn), suggesting that the hand-crafted features are difficult in separating different acne subtypes.

3.6 Comparison with deep learning-extracted features

Deep learning algorithms are widely used for their automatic abstractions of inherent patterns within the data, and this study evaluated how the existing deep learning models performed on the acne subtype detection problem, as shown in Table 6. Different batch sizes 16/32/64 and epoch numbers 100/200/300 were evaluated for the deep learning models. Overall, these deep learning-extracted features achieved much better acne subtype detection performances compared with the data in Table 5. The AcneTyper model proposed in this study outperformed the other deep learning models in all five performance metrics. Except for AcneTyper, EfficientNet-B2 can achieve the best Acc $=$ 0.7517, still 0.0442 lower than that of AcneTyper (Acc $=$ 0.7959). The simplest network AlexNet outperforms some more complex network. This might be due to that the more complex network generated more redundant features than the simple network AlexNet. The data suggests that deep learning is powerful in extracting image features and the deep learning-based models may be further improved by the other algorithmic operations in AcneTyper.

Table 6
Comparison with the features extracted by popular neural network architectures. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)

	Sn	Sp	Acc	Pr	YI
AlexNet	0.6639	0.8926	0.6837	0.6757	0.5565
VGG16	0.7010	0.9141	0.7483	0.7187	0.6150
VGG19	0.6969	0.9081	0.7347	0.7405	0.6050
ResNet18	0.6254	0.8785	0.6395	0.6319	0.5039
ResNet34	0.6585	0.8908	0.6769	0.6616	0.5494
ResNet50	0.6736	0.8949	0.6905	0.6811	0.5685
ResNet101	0.6887	0.9005	0.7075	0.6933	0.5892
ResNet152	0.6168	0.8816	0.6565	0.6421	0.4984
DenseNet121	0.6410	0.8850	0.6599	0.6419	0.5260
DenseNet169	0.6992	0.9017	0.7089	0.6943	0.6010
InceptionV3	0.6094	0.8753	0.6293	0.6253	0.4847
EfficientNet-B0	0.6786	0.9041	0.7245	0.7393	0.5827
EfficientNet-B1	0.6994	0.9082	0.7313	0.7176	0.6075
EfficientNet-B2	0.7142	0.9153	0.7517	0.7348	0.6295
EfficientNet-B3	0.7146	0.9119	0.7415	0.7263	0.6265
EfficientNet-B4	0.6918	0.9057	0.7211	0.6941	0.5975
EfficientNet-B5	0.6913	0.9048	0.7211	0.7072	0.5961
EfficientNet-B6	0.7106	0.9099	0.7313	0.7161	0.6204
EfficientNet-B7	0.6605	0.8900	0.6735	0.6474	0.5504
AcneTyper(ours)	0.7681	0.9300	0.7959	0.7923	0.6981

Table 7

Comparison with the two dermatologists. Der_1 and Der_2 represent the two dermatologists, respectively. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)

	Sn	Sp	Acc	Pr	YI
Der_1	0.8470	0.9478	0.8367	0.8107	0.7947
Der_2	0.7449	0.9088	0.7279	0.6946	0.6538
AcneTyper	0.7681	0.9300	0.7959	0.7923	0.6981

3.7 Comparison with the two dermatologists

The ultimate goal of many computational disease classification studies is to help the clinical practice. This study compared the proposed AcneTyper with the two professional dermatologists involved in this project, as shown in Table 7. The collected dataset ACNEDer was independently annotated by two dermatologists. The two dermatologists then discussed and re-annotated the images with acne types disagreed by the two dermatologists. Only the images with the commonly agreed acne types were kept in the dataset ACNEDer, which was used as the benchmark dataset.

We compared AcneTyper with the initial annotations of the two dermatologists on this benchmark dataset. The performance metrics Sn, Sp, Acc, Pr and YI were calculated for the initial annotations of the two dermatologists. AcneTyper was also trained on the training dataset and evaluated on the validation dataset. The final result was calculated using the evaluated best model on the testing dataset. The dermatologists Der_1 and Der_2 have 4 and 2 years of clinical dermatological practices. Table 7 shows that Der_1 outperforms Der_2 by 0.1088 in Acc, and 0.1021 in Sn. So, an inexperienced dermatologist may induce many false negatives in the acne diagnosis. Our AcneTyper model achieves 0.7959 in Acc, which is between the classification accuracies of Der_1 and Der_2.

So the experimental data shows that AcneTyper outperforms some dermatologists in diagnosing the acne subtypes, and is anticipated to provide useful automatic annotations for the initial screenings in the clinical practice.

3.8 Evaluate the performance of AcneTyper on a benchmark dataset

Table 8
Comparing AcneTyper with the other state-of-the-art methods on skin cancer dataset

	Accuracy	Recall	Precision	F1-score
Bologna and Fossati (2020) [48]	84.9	N/A	N/A	N/A
Lee and Renee (2020) [49]	82.9	N/A	N/A	N/A
TWDBDL (2021) [50]	88.95	N/A	N/A	89
BARF (cross) (2021) [29]	89.24	89.3	89.11	89.18
AcneTyper	98.22	98.41	98.18	98.04

The proposed framework was evaluated for its generalizability on an additional benchmark dataset, as shown in Table 8. This is a binary classification task, and we compared AcneTyper with four state-of-the-art (SOTA) methods, i.e., DIMLP-ensemble [48], CNN [49], TWDBDL [50] and BARF (cross) [29]. Table 8 shows that BARF (cross) achieved the second best performance, and our method AcneTyper outperformed this method by 8.98% in Accuracy, 9.11% in Recall, 9.07% in Precision and 8.86% in F1-score. The proposed AcneTyper framework outperformed the SOTA methods on the benchmark dataset.

3.9 Visualization of classification result by t-SNE

Figure 6.

Visualization of AcneTyper feature space using t-SNE on the dataset ACNEDer. The blue, orange, yellow and grey dots represented the samples of the four classes, i.e., tine comedo, papule, pustule and cyst.

Figure 6 visualizes the AcneTyper feature space by t-SNE, a widely used method to visualize the classification result [51]. The t-SNE algorithm mapped the AcneTyper features of an acne image to a two-dimensional point, so that the intra-class distances were minimized and the inter-class distances were maximized with high probabilities. Figure 6 illustrates that the four classes of acne images were well separated into four clusters. The AcneTyper classification model utilized a much more complicated ensembling strategy and delivered a satisfying classification accuracy than the visualization in Fig. 6.

3.10 Discussion

Artificial intelligence (AI) methods have been successfully utilized in disease diagnosis tasks based on various biomedical data types, including biochemical characteristics, image, and electrocardiogram [52, 53, 54]. Sejdinović et al. developed an artificial neural network for the classification of prediabetes and type 2 diabetes (T2D) [55]. Alić et al. proposed a two-layer feedforward artificial neural network for the classification of metabolic syndrome [56]. Veljović et al. investigated the efficiency of artificial neural networks and docking methods to predict antimicrobial activity of new compounds [57]. Badnjevic et al. developed an expert system for an efficient diagnosis of asthma, COPD or a normal lung sample [58]. Catic et al. successfully applied neural networks to classify Patau, Edwards, Down, Turner and Klinefelter Syndromes [59].

This study proposed an acne subtype detection algorithm AcneTyper via self-ensemble and stacking. Self-ensemble is a new algorithmic technology to improve the neural network models. This study demonstrated that this technology could also help the computational diagnosis of acne types. We stacked the prediction results of the models trained in all epochs of the CNN-based models and further screened these engineered features by feature selection algorithms. The ablation experimental data showed that all integrated modules of AcneTyper had positive contributions to the final prediction performance. The best prediction accuracy of the four acne subtypes reaches 0.7959 in Acc by AcneTyper.

AcneTyper outperformed the hand-crafted and deep learning-extracted features on the acne subtype detection problems, and even performed better than a dermatologist with two-year experience in clinical practice.

A well-annotated dataset of four acne subtypes is also released to the public, hoping that the whole research community may get involved in pushing the acne subtype prediction forward.

Our study has some limitations that need to be taken into account in future studies. The size of the dataset is limited. We plan to recruit a large cohort of participants in the future project, and collect more acne images covering more skin lesion types. A diversified list of image-capturing devices will also be evaluated. For example, if the acne captured by the conventional mobile phones could also be accurately detected and classified, the acne diagnosis method will be much more useful to the users while protecting their privacies.

4. Conclusion

Most acne classification studies were based on natural facial images. This study released a dermoscopy-based acne image dataset together with our well-tuned model, which may provide complementary information to facilitate the clinical decisions of dermatologists.

Our experimental data showed that the integration of multiple base models during the training epochs could achieve better prediction performance than the individual base models. Feature selection algorithms could remove redundant base models and further improve the prediction performance of the integrated model. In the future, we will collect acne images of more acne types, and update the dataset as a benchmark for both computer scientists and dermoscopic researchers.

Footnotes

Acknowledgments

The authors appreciate the insightful comments from the two anonymous reviewers that helped improved the experimental design and result discussions of the work.

Conflict of interest

The authors declare that they have no conflict of interest.

Funding

This work was supported by the Senior and Junior Technological Innovation Team (20210509055RQ), the National Natural Science Foundation of China (62072212 and U19A2061), the Jilin Provincial Key Laboratory of Big Data Intelligent Computing (20180622002JC), the Science and Technology Project of Education Department of Jilin Province (JJKH20200328KJ), and the Fundamental Research Funds for the Central Universities, JLU.

References

Tan

Chavda

Leclerc

Dreno

. Projective Personification Approach to the Experience of People With Acne and Acne Scarring-Expressing the Unspoken. JAMA Dermatol. 2022.

Lozynska

Glabska

. Association between nutritional behaviours and acne-related quality of life in a population of polish male adolescents. Nutrients. 2022; 14(13).

Falla

Rodan

Fields

Ong

Skobowiat

. Safety and efficacy of a novel three-step anti-acne regimen formulated specifically for women. Int J Womens Dermatol. 2020; 6(5): 419-23.

Rashataprucksa

Chuangchaichatchavarn

Triukose

Nitinawarat

Pongprutthipan

Piromsopa

. Acne Detection with Deep Neural Networks. 2020 2nd International Conference on Image Processing and Machine Vision; 2020. pp. 53-6.

Xuan

NPN

Thi

Minh

Bao

DTN

. A Multilevel Thresholding Approach for Acne Detection in Medical Treatment. 2021 3rd International Conference on Image Processing and Machine Vision (IPMV); 2021. pp. 17-23.

Maroni

Ermidoro

Previdi

Bigini

. Automated detection, extraction and counting of acne lesions for automatic evaluation and tracking of acne severity. 2017 IEEE Symposium Series on Computational Intelligence (SSCI); 2017: IEEE; 2017. pp. 1-6.

Wen

Liang

Lai

Y-K

She

Cheng

M-M

, et al., editors. Joint acne image grading and counting via label distribution learning. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019.

Yadav

Alfayeed

Khamparia

Pandey

Thanh

DNH

Pande

. HSV modelâ€based segmentation driven facial acne detection using deep learning. Expert Systems. 2021; e12760.

Khan

Malik

Kamel

Dass

Affandi

. Automated system for acne vulgaris grading using self-organizing map. Journal of Medical Imaging and Health Informatics. 2017; 7(8): 1705-13.

10.

Chen

Pradhan

Shahabi

Rizeei

Hou

, et al. Novel hybrid integration approach of bagging-based fisher’s linear discriminant function for groundwater potential analysis. Natural Resources Research. 2019; 28(4): 1239-58.

11.

Chin

C-L

Chen

H-F

Lin

B-J

Chi

M-C

Chen

W-E

Yang

Z-Y

. Facial wrinkle detection with texture feature. 2017 IEEE 8th International Conference on Awareness Science and Technology (iCAST); 2017: IEEE; 2017. pp. 343-7.

12.

Darmawan

Rositasari

Muhimmah

. The Identification System of Acne Type on Indonesian People’s Face Image. IOP Conference Series: Materials Science and Engineering; 2020: IOP Publishing; 2020; 012028.

13.

Kittigul

Uyyanonvara

. Acne Detection Using Speeded up Robust Features and Quantification Using K-Nearest Neighbors Algorithm. Proceedings of the 6th International Conference on Bioinformatics and Biomedical Science; 2017. pp. 168-71.

14.

Mahmudi

Ahsan

Kasim

Nur

Basalamah

Septiarini

. Face Skin Disease Detection with Textural Feature Extraction. 2020 6th International Conference on Science in Information Technology (ICSITech); 2020: IEEE; 2020. pp. 133-7.

15.

Navarro

MCR

Barfeh

DPY

. Skin Disease Detection using Improved Bag of Features Algorithm. 2019 5th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS); 2019: IEEE; 2019. pp. 1-5.

16.

Park

K-H

Kim

Y-H

. Skin condition analysis of facial image using smart device: Based on acne, pigmentation, flush and blemish. Journal of Advanced Information Technology and Convergence. 2018; 8(2): 47-58.

17.

Liu

Fan

Duan

Wang

Ren

, et al. AcneGrader: An ensemble pruning of the deep learning base models to grade acne. Skin Research and Technology. 2022.

18.

Lin

Jiang

Chen

Guan

You

, et al. KIEGLFN: A Unified Acne Grading Framework on Face Images. Computer Methods and Programs in Biomedicine. 2022; 106911.

19.

Wen

Jun

Liu

Kuang

, et al. Acne detection and severity evaluation with interpretable convolutional neural network models. Technology and Health Care. 2022(Preprint): 1-11.

20.

Shen

Zhang

Yan

Zhou

. An automatic diagnosis method of facial acne vulgaris based on convolutional neural network. Scientific Reports. 2018; 8(1): 5839.

21.

Karunanayake

Dananjaya

WGM

Peiris

Gunatileka

Lokuliyana

Kuruppu

. CURETO: Skin Diseases Detection Using Image Processing And CNN. 2020 14th International Conference on Innovations in Information Technology (IIT); 2020: IEEE; 2020. pp. 1-6.

22.

Junayed

Jeny

Atik

Neehal

Karim

Azam

, et al., editors. AcneNet – A Deep CNN Based Classification Approach for Acne Classes. 2019 12th International Conference on Information & Communication Technology and System (ICTS); 2019 18–18 July 2019.

23.

Chin

C-L

Yang

Z-Y

R-C

Yang

C-S

. A Facial Pore Aided Detection System Using CNN Deep Learning Algorithm. 2018 9th International Conference on Awareness Science and Technology (iCAST); 2018: IEEE; 2018. pp. 90-4.

24.

El Saleh

Bakhshi

Amine

N-A

. Deep convolutional neural network for face skin diseases identification. 2019 Fifth International Conference on Advances in Biomedical Engineering (ICABME); 2019: IEEE; 2019. pp. 1-4.

25.

Isa

NAM

Mangshor

NNA

. Acne Type Recognition for Mobile-Based Application Using YOLO. Journal of Physics: Conference Series. 2021; 1962(1): 012041.

26.

Phan

Huynh

Nguyen

Park

, et al. A smart LED therapy device with an automatic facial acne vulgaris diagnosis based on deep learning and internet of things application. Computers in Biology and Medicine. 2021; 136: 104610.

27.

Chen

Pang

Zhao

Miao

Zhang

, et al. Feature selection may improve deep neural networks for the bioinformatics problems. Bioinformatics. 2020; 36(5): 1542-52.

28.

Akata

Reed

Walter

Lee

Schiele

, editors. Evaluation of output embeddings for fine-grained image classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015.

29.

Abdar

Fahami

Chakrabarti

Khosravi

Pławiak

Acharya

, et al. BARF: A new direct and cross-based binary residual feature fusion with uncertainty-aware module for medical image classification. Information Sciences. 2021; 577: 353-78.

30.

Zhong

Sun

Peng

Xie

Yang

Tang

. XGBFEMF: An XGBoost-based framework for essential protein prediction. IEEE Transactions on Nanobioscience. 2018; 17(3): 243-50.

31.

Mortazi

Bagci

. Automatically designing CNN architectures for medical image segmentation. International Workshop on Machine Learning in Medical Imaging; 2018: Springer; 2018. pp. 98-106.

32.

Qiu

Wen

Xie

Wen

F-Q

Sheng

G-Q

Tang

X-G

. Efficient medical image enhancement based on CNN-FBB model. IET Image Processing. 2019; 13(10): 1736-44.

33.

Radenović

Tolias

Chum

. Fine-tuning CNN image retrieval with no human annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2018; 41(7): 1655-68.

34.

Qiu

Zhou

Khandelwal

Yang

. Performance evaluation of hybrid WOA-XGBoost, GWO-XGBoost and BO-XGBoost models to predict blast-induced ground vibration. Engineering with Computers. 2021; 1-18.

35.

Shaha

Pawar

, editors. Transfer learning for image classification. 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA); IEEE; 2018.

36.

Gao

S-H

Han

Cheng

M-M

Peng

, editors. Representative batch normalization with feature calibration. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021.

37.

Müller

Kornblith

Hinton

. When does label smoothing help? Advances in neural information processing systems. 2019; 32.

38.

Zhang

Xiong

Wang

Deng

Song

, et al. T4SEfinder: A bioinformatics tool for genome-scale prediction of bacterial type IV secreted effectors using pre-trained protein language model. Brief Bioinform. 2022; 23(1).

39.

Schmidt

Finan

. Linear regression and the normality assumption. Journal of Clinical Epidemiology. 2018; 98: 146-51.

40.

Speiser

Miller

Tooze

. A comparison of random forest variable selection methods for classification prediction modeling. Expert Systems with Applications. 2019; 134: 93-101.

41.

Jahed Armaghani

Asteris

Askarian

Hasanipanah

Tarinejad

Huynh

. Examining hybrid and single SVM models with different kernels to predict rock brittleness. Sustainability. 2020; 12(6): 2229.

42.

Liu

Gao

Fang

Cao

Wang

, et al. Identifying complex gene-gene interactions: A mixed kernel omnibus testing approach. Brief Bioinform. 2021; 22(6).

43.

Mallipeddi

Suganthan

. Ensemble strategies for population-based optimization algorithms – A survey. Swarm and Evolutionary Computation. 2019; 44: 695-711.

44.

Wang

Liu

Duan

, et al. A novel model for malaria prediction based on ensemble algorithms. PloS One. 2019; 14(12): e0226910.

45.

Zulfira

Suyanto

Septiarini

. Segmentation technique and dynamic ensemble selection to enhance glaucoma severity detection. Comput Biol Med. 2021; 139: 104951.

46.

Turker

Emre

Aydin

. Automated classification of nasal polyps in endoscopy video-frames using handcrafted and CNN features. Comput Biol Med. 2022; 147: 105725.

47.

Yang

Pan

Qin

. Scene-graph-driven semantic feature matching for monocular digestive endoscopy. Comput Biol Med. 2022; 146: 105616.

48.

Bologna

Fossati

. A two-step rule-extraction technique for a cnn. Electronics. 2020; 9(6): 990.

49.

Lee

Chin

RKY

. The effectiveness of data augmentation for melanoma skin cancer prediction using convolutional neural networks. 2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET); 2020: IEEE; 2020. pp. 1-6.

50.

Abdar

Samami

Mahmoodabad

Doan

Mazoure

Hashemifesharaki

, et al. Uncertainty quantification in skin cancer classification using three-way decision-based Bayesian deep learning. Computers in Biology and Medicine. 2021; 135: 104418.

51.

Arora

Kothari

. An analysis of the t-sne algorithm for data visualization. Conference On Learning Theory; 2018: PMLR; 2018. pp. 1455-62.

52.

Shawki

Azmy

Salama

Shawki

. Mathematical and deep learning analysis based on tissue dielectric properties at low frequencies predict outcome in human breast cancer. Technology and Health Care. 2022; 30(3): 633-45.

53.

Liu

Liang

Peng

Sun

, et al. Segmentation of acetowhite region in uterine cervical image based on deep learning. Technology and Health Care. 2022; 30(2): 469-82.

54.

Petryshak

Kachko

Maksymenko

Dobosevych

. Robust deep learning pipeline for PVC beats localization. Technology and Health Care. 2021; 29(S1): 475-86.

55.

Sejdinović

Gurbeta

Badnjević

Malenica

Dujić

Čaušević

, et al. Classification of prediabetes and type 2 diabetes using artificial neural network. CMBEBIH 2017: Springer; 2017. pp. 685-9.

56.

Alić

Gurbeta

Badnjević

Badnjević-Čengić

Malenica

Dujić

, et al. Classification of metabolic syndrome patients using implemented expert system. CMBEBIH 2017: Springer; 2017. pp. 601-7.

57.

Veljović

Špirtović-Halilović

Muratović

Osmanović

Badnjević

Gurbeta

, et al. Artificial neural network and docking study in design and synthesis of xanthenes as antimicrobial agents. CMBEBIH 2017: Springer; 2017. pp. 617-26.

58.

Badnjevic

Gurbeta

Custovic

. An expert diagnostic system to automatically identify asthma and chronic obstructive pulmonary disease in clinical settings. Scientific reports. 2018; 8(1): 1-9.

59.

Catic

Gurbeta

Kurtovic-Kozaric

Mehmedbasic

Badnjevic

. Application of Neural Networks for classification of Patau, Edwards, Down, Turner and Klinefelter Syndrome based on first trimester maternal serum screening data, ultrasonographic findings and patient demographics. BMC Medical Genomics. 2018; 11(1): 1-12.

AcneTyper: An automatic diagnosis method of dermoscopic acne image via self-ensemble and stacking

Abstract

BACKGROUND:

OBJECTIVE:

METHODS:

RESULTS:

CONCLUSION:

Keywords

1. Introduction

2. Materials and methods

2.1 Datasets

2.4 Training procedure

3.1 Ablation experiments

Table 2 How the performance is affected by different modules. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)

Table 3 Comparison the performance of different classifiers. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)

3.4 Contribution of the features engineered by the early epochs

Table 4 The early epochs how to influence the performance. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)

Table 5 Comparison AcneTyper with the traditional visual methods. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)

Table 6 Comparison with the features extracted by popular neural network architectures. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)

3.8 Evaluate the performance of AcneTyper on a benchmark dataset

Table 8 Comparing AcneTyper with the other state-of-the-art methods on skin cancer dataset

4. Conclusion

Footnotes

Acknowledgments

Conflict of interest

Funding

References

Table 2
How the performance is affected by different modules. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)

Table 3
Comparison the performance of different classifiers. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)

Table 4
The early epochs how to influence the performance. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)

Table 5
Comparison AcneTyper with the traditional visual methods. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)

Table 6
Comparison with the features extracted by popular neural network architectures. The performance metrics are Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Precision (Pr), and Youden Index (YI)

Table 8
Comparing AcneTyper with the other state-of-the-art methods on skin cancer dataset