Dual concatenated transfer learning with attention fusion: An ensemble-enhanced approach for skin lesion classification

Abstract

Objective

Classification of skin lesions plays a crucial role in the early detection and diagnosis of various dermatological conditions. The existing deep learning models are plagued by class imbalance, bad feature extraction, and generalization to unseen data. This study aims to develop a robust hybrid deep learning model for multi-class skin lesion classification.

Methods

We propose a hybrid architecture combining three DenseNet models (DN121, DN169, DN201) and three ResNet models (RN50, RN101, RN152) with attention mechanisms (channel attention, squeeze-and-excitation, soft attention). We concatenated the architectures in a dual way. Finally, the concatenated models are ensembled to enhance performance. The model is trained and evaluated on the HAM10000 dataset, with advanced augmentation strategies applied to address class imbalance and improve generalization on unseen data.

Results

The model achieves an accuracy rate of 91.43% and specificity of 92.04%, bettering existing baseline methods. Attention mechanisms significantly improve feature extraction, dual concatenation provides better feature fusion, and ensemble integration enhances overall model robustness.

Conclusion

Our attention mechanism-based hybrid architecture is a robust and reliable solution for machine-based skin lesion classification. Its strong performance indicates its potential to help dermatologists with timely, precise diagnosis, serving as a foundation for other innovations in medical image analysis.

Keywords

skin lesions classification hybrid architecture attention mechanisms augmentation HAM10000 dataset ensemble learning transfer learning

Introduction

Skin lesions denote irregular alterations in the skin’s appearance, while skin ailments encompass a wide array of issues affecting the skin’s well-being, structure, and operation.¹ Skin lesions are frequently the earliest visible indicators of various dermatological disorders. The World Health Organization once said that skin diseases will be the most common disease in human history in the 21st century, with the highest morbidity rate and the highest disability rate.² About 30% to 70% of people of different races and ages in the world suffer from skin diseases.³ Melanoma, the most lethal form of skin cancer, while comparatively rare, accounts for the majority of skin cancer-related fatalities. There are an estimated 76,380 new cases of melanoma and an estimated 6,750 deaths each year in the United States.⁴

Early detection of skin lesions is essential to avoid developing serious illnesses. Unfortunately, there are many patients who do not know what their skin condition is due to the high cost and time it takes to get a formal clinical consult. Artificial intelligence (AI) has emerged as a tool with the potential to increase the effectiveness of self-screening for skin cancer.⁵ AI, specifically Machine learning (ML) and Deep learning (DL), can detect disease conditions through data analysis, making diagnosis easier and raising awareness about the condition.⁶ Despite the considerable volume of research on ML and DL applications for skin lesion detection, several persistent limitations remain. Transfer learning approaches trained on ImageNet data can extract generalizable low-level features; however, they often lack the robustness required for deployment in resource-constrained environments. Furthermore, many such models tend to overfit toward majority classes, leaving minority class distributions inadequately represented due to inherent class imbalance. Finally, Hyperparameters should be selected properly for optimal performance.⁷

To address these challenges and establish a robust foundation, we first investigate the following primary research questions.

RQ1. How can we use a balance in class distributions to get the best dataset on which to train the models?

RQ2. How can data augmentation be employed to achieve better model performance on unseen data, reducing bias and overfitting?

RQ3. How would the model be more generalized when many models are combined, as opposed to using only one model?

After conducting our investigation and carefully examining the previously stated questions, we produced the following Novel Contributions (NC).

• NC1 (Addressing Class Imbalance): Four augmentation strategies were applied and compared, namely Prior Augmentation, Posterior Augmentation, No Augmentation, and Only Train Set Augmentation, to mitigate the class imbalance present in the HAM10000 dataset. Only Train Set Augmentation (TA) was selected as the optimal approach, as it ensures an unbiased evaluation while enhancing the representation of minority classes.

• NC2 (Integration of Attention Mechanisms): Three attention modules, specifically Channel Attention (CA), Squeeze-and-Excitation Attention (SEA), and Soft Attention (SA), were incorporated to improve discriminative feature extraction and reduce background noise in lesion images.

• NC3 (Hybrid Multi-Branch Ensemble Framework): A novel hybrid architecture was designed by combining DenseNet variants (DN121, DN169, DN201) and ResNet variants (RN50, RN101, RN152). A dual concatenation strategy was introduced within each architectural family, followed by a final weighted ensemble across families to exploit their complementary strengths.

• NC4 (Improved Generalization and Robustness): Multi-level ensembling with attention mechanisms was demonstrated to improve stability and robustness over single-architecture approaches, achieving balanced performance across all lesion classes, including minority categories.

• NC5 (Comprehensive Experimental Evaluation): Relying solely on accuracy for imbalanced datasets can be misleading. To address this, a rigorous evaluation was conducted using accuracy alongside Precision, Recall, F1-score, Specificity, and ROC-AUC curves. This ensures that the model’s validity is not skewed by the majority class and provides a holistic view of its performance in real-world diagnostic scenarios.

Literature review

Skin lesion classification has achieved remarkable advancements in recent years with the introduction of deep learning techniques and ensemble learning architectures. Several studies have used well-known datasets such as HAM10000, with the objective being the improvement of classification performance in the face of addressing problems such as class imbalance, feature extraction, and model generalization.

Setiawan and Soewito⁸ introduced CRCDKD, a novel approach that incorporates a mean-teacher architecture, categorical relation-preserving contrastive learning, and a decoupled mean teacher knowledge distillation module using DenseNet backbones. Semi-supervised training and distillation of the method at 89.41% accuracy were promising, but the absence of comprehensive testing on weighting schemes and the need for dynamic tuning mean that more can be done to improve model generalization and robustness. Roy et al.⁹ proposed a DenseNet-121 backbone fortified with a Symmetry-aware Feature Attention (SaFA) module that uses wavelet-guided gradient-based fusion to take advantage of lesion boundary and symmetry information. The method achieved a competitive accuracy of 90.75%, though the absence of cross-dataset validation and external testing limits the assessment of the model’s generalizability beyond the HAM10000 dataset. Sofana Reka et al.¹⁰ explored quantum machine learning on HAM10000 with quantum CNNs and quantum support vector classifiers. Despite the use of classical pre-trained CNN models such as MobileNet and ResNet50, their approach attained a modest accuracy of 82.86%, hindered by poor generalization between visually similar classes and the high computational demand of quantum approaches by nature. Liu et al.¹¹ employed an ensemble approach with MobileNetV2, ResNet18, VGG11, and a new fine-tuning stacking model, SkinNet. Despite employing several architectures with weighted meta-learners, this approach achieved 86.78% accuracy, with class imbalance a major limitation. Saha et al.¹² employed YOLOv8 models for real-time skin lesion classification on HAM10000, achieving a maximum accuracy of 86.2% with YOLOv8x-cls. Despite employing data augmentation for over 30 epochs, the study’s limited consideration of newer architectures, such as Vision Transformers or Graph Convolutional Networks, suggests room for improvement. Gururaj et al.¹³ proposed DeepSkin, which employed transfer learning using DenseNet169 and ResNet50 with pre-processing methods of oversampling, undersampling, and hair removal via autoencoder-decoder segmentation. This method, with an accuracy of 91.2%, was based on conventional architectures with no significant custom developments or augmentation strategies. Azeem et al.¹⁴ proposed SkinLesNet, a novel four-layer CNN with a bespoke design for smartphone image data and geometric augmentation, trained with Adam optimizer and ReLU activations. The network reported 90% accuracy on HAM10000 but may be prone to overfitting despite dropout layers due to its simple design. Wicaksana et al.¹⁵ employed YOLOv11 alongside VGG19 and ResNet50, with stratified splitting and data augmentation, on HAM10000. YOLOv11 achieved 84.74% accuracy but was outperformed by VGG19 and ResNet50 when classifying lesion classes that are visually similar, highlighting challenges with multi-class complexity. Liu et al.¹⁶ employed an ensemble learning approach with ResNet50, MobileNetV3, GhostNet, and PP-LCNet and an improved Grey Wolf Optimizer for adaptive weighting. The model achieved 88.8% accuracy but encountered high computational complexity caused by repeated model training. Bhowmick et al.¹⁷ proposed Dual Concatenated DenseNet with Attention Fusion (DCDAF) that concatenated multiple DenseNet blocks with attention mechanisms for generalization improvement and achieved 90.24% accuracy. Sole reliance on DenseNet variants can limit feature diversity, with scope to include architectures such as ResNet to provide more representative information. Sönmez et al.¹⁸ employed classic CNNs with transfer learning frameworks such as VGG, ResNet, and MobileNet on balanced subsets of the MNIST and HAM10000 datasets. Although with a total accuracy of 80.79%, the authors opined that this may be insufficient for clinical application, mentioning generalization problems with real samples.

Studies^19–27 incorporated augmentation techniques in ISIC2017-2020, HAM10000 and PH2 datasets with transfer learning, but hadn’t tried to explore any ensemble methods.

Furthermore, recent studies by Das et al. have introduced robust ensemble frameworks to address class imbalance in skin lesion classification. In one study,²⁸ they designed a homogeneous EnsembleSVM model that utilizes SMOTE-TOMEK and target-specific augmentation (including rotation, flipping, and zooming) to balance the dataset. This approach employs a two-stage ensemble of Support Vector Machines, achieving a test accuracy of 98.2% on the HAM10000 dataset. In a subsequent work,²⁹ Das et al. proposed a hybrid ensemble learning model that combines homogeneous Adaptive Resonance Theory Mapping (ARTMAP) models with a heterogeneous Fuzzy Min–Max (FMM) classifier. By utilizing a rule-based NEFCLASS model for final classification and balancing the data via SMOTE, this method demonstrated a classification accuracy of 98.4%, proving the efficiency of improved ensemble learning strategies in medical image processing.

Dataset description

In this study, we used the publicly available HAM10000 (Human Against Machine with 10000 training images) dataset, which is hosted on the Harvard Dataverse repository³⁰ (Table 1). The dataset is widely used in skin lesion analysis and provides a multi-class set of dermatoscopic images covering a wide variety of pigmented skin lesions.

Table 1.

Overview of the HAM10000 dataset.

No. of images	Format	No. of classes	Source
10,015	JPG	7	Harvard Dataverse

The HAM10000 dataset comprises 10,015 dermatoscopic images in JPG format, collected from diverse sources to ensure variability in acquisition settings, subject demographics, and lesion properties. The dataset is subdivided into seven diagnostic classes.

• Melanoma (MEL): malignant skin cancer lesions with high rates of metastatic spread

• Melanocytic nevus (NV): benign pigmented lesions which comprise the largest portion of the dataset.

• Basal cell carcinoma (BCC): invasive malignant lesions.

• Actinic keratosis (AK): early malignant lesions that have the potential to evolve into squamous cell carcinoma.

• Benign keratosis (BKL): includes seborrheic keratoses, solar lentigines, and lichen-planus–like keratoses.

• Dermatofibroma (DF): benign fibrous lesions.

• Vascular lesions (VASC): Angiomas, angiokeratomas, pyogenic granulomas, and hemorrhages.

MEL, BCC, and AK are cancerous lesions, whereas NV, BKL, and DF are non-cancerous lesions; however, VASC lesions can be either cancerous or non-cancerous. The dataset is also plagued by class imbalance, with NV having 6,705 images, while the categories DF (115 pictures) and VASC (142 pictures) are underrepresented (Table 2). An overview of the distribution of classes is shown below (Figure 1).

Figure 1.

Sample images of different classes: (a) Actinic keratosis, (b) Basal Cell Carcinoma, (c) Benign keratosis, (d) Dermatofibroma, (e) Melanoma, (f) Nevus, (g) Vascular lesions.

Table 2.

Detailed distribution of the HAM10000 dataset.

Class	Number of images
Melanoma (MEL)	1,113
Melanocytic nevus (NV)	6,705
Basal cell carcinoma (BCC)	514
Actinic keratosis (AK)	327
Benign keratosis (BKL)	1,099
Dermatofibroma (DF)	115
Vascular lesions (VASC)	142
Total	10,015

Research methodology

Data preprocessing and augmentation of training sets

This study adopted a structured plan to preprocess and augment the dataset for training and evaluation purposes. The study started with a raw data set that underwent initial preprocessing, which cleaned and standardized it. To fully avoid data leakage and obtain an unbiased evaluation of the data, we first identified and separated all duplicate images in the dataset. Out of the remaining portion of the unique images, an independent test set, 10%, was reserved exclusively as an Independent Test Set (IND_TEST). The remaining unique data was then stratified into training (70%), validation (15%), and testing (15%) subsets. Finally, the previously isolated duplicate images were merged solely into the training set (70%) to enrich the learning process without compromising the independence of the validation and test sets.

To improve the robustness of the model and its ability to generalize, we employed various data augmentation techniques. Augmentation synthetically increases the dataset size by applying transformations, helping reduce overfitting and improve performance on new data. We tried four different augmentation techniques to find the optimal method: Prior Augmentation (PA), Posterior Augmentation (AP), No Augmentation (NA), and Only Train Set Augmentation (TA). They are diverse based on when and how augmentation is performed, as shown by the following flowcharts (Figure 2).

• Prior Augmentation (PA): This method applies augmentation to the entire dataset before it is split into test, validation, and training sets. This distributes augmented variants across all subsets, thereby enhancing initial data diversity. It also has the potential to introduce augmented patterns from the training set into the test and validation sets, leading to unrealistically optimistic performance results. PA flowchart ((a) Figure 2) places the augmentation step before splitting, with potential downsampling or selection steps in between.

• Posterior Augmentation (AP): Here, the augmentation is done after splitting the dataset into training, validation, and test sets. Each subset is augmented independently, maintaining segregation but potentially leading to different transformations across sets. This method is useful when specific enhancements are needed for each set, though it risks compromising the independence of evaluation sets. As shown in (b) Figure 2, augmentation is carried out subsequent to splitting, followed by potential test selection or downsampling.

• No Augmentation (NA): It is a benchmark with no augmentation happening at all. Preprocessing and dataset splitting occur without artificial extensions, only on the basis of raw samples. Straightforward as it may be, maintaining the original data distribution, NA also carries the risk of overfitting, especially for smaller datasets. NA’s flowchart ((c) Figure 2) skips augmentation altogether and goes straight to splitting and set allocation, with balance as optional downsampling.

• Only Train Set Augmentation (TA): The augmentation is only done to the training set after splitting, without affecting the validation and test sets. This approach maximizes diversity within the training data while leaving the testing sets free from augmented biases. TA is highly effective at preserving the validity of performance measures by preventing data leakage. The downsampling of (d) in Figure 2 is the augmentation applied only to the training branch, but validation and test sets are left unaltered, where applicable.

Figure 2.

Four augmentation techniques: (a) prior augmentation, (b) posterior augmentation, (c) No augmentation, (d) only train set augmentation.

We tested these four methods to identify the optimal augmentation strategy, balancing fairness with data enhancement. After scrutinizing the methods, we chose to employ the Only Train Set Augmentation (TA) method. This selection maintains the independent test set completely unseen in order to allow the model’s generalization ability to be evaluated safely and fairly.

Architecture of proposed model

The proposed hybrid architecture (Figure 3) is a multi-level, multi-branch ensemble framework designed to enhance accuracy and robustness in skin lesion classification. The key aspect of this architecture is that it overcomes the limitations of single-model classification by combining the strengths of DenseNet and ResNet, leveraging complementary feature extraction and additional benefits from multiple attention mechanisms.

Figure 3.

Architecture of our proposed model.

The architecture consists of two parallel, independent primary branches: a DenseNet branch and a ResNet branch. Both are macro-architectures of three base models. The DenseNet branch uses DenseNet-121, DenseNet-169, and DenseNet-201, and the ResNet branch uses ResNet-50, ResNet-101, and ResNet-152. These are chosen for their well-documented image classification performance and for their different architectural components—DenseNet’s dense connectivity and ResNet’s residual connections—that provide an enriched feature set.

To enhance the discriminative capability of each base model, we introduce custom convolutional and attention blocks. As shown in Figure 4, these blocks consist of sequential Conv2D layers with varying kernel sizes followed by batch normalization, allowing the model to capture features at different spatial scales. Specifically, the architecture now comprises eight blocks.

• Blocks 1–4: Conv2D layers with 128 filters of sizes 7 × 7, 5 × 5, 3 × 3, and 1 × 1, each followed by batch normalization.

• Blocks 5–8: Conv2D layers with 256 filters of sizes 7 × 7, 5 × 5, 3 × 3, and 1 × 1, each followed by batch normalization.

Figure 4.

A detailed view of the custom convolutional and attention block architecture.

By stacking Blocks 1–4 with lower filter depth and Blocks 5–8 with higher filter depth, the framework ensures that both low-level and high-level features are extracted effectively. These enriched feature representations are subsequently refined using three attention mechanisms—Channel Attention (CA), Squeeze-and-Excitation Attention (SEA), and Soft Attention (SA)—before being passed to the DenseNet and ResNet branches.

In these blocks, we merge three individual attention mechanisms:

• Channel Attention (CA): This component aims to highlight important features by computing channel attention weights based on the mean and standard deviation of the input feature maps.³¹ These attention weights are then applied to the input through element-wise multiplication, effectively emphasizing the significant channels. Let $x \in R^{C \times H \times W}$ denote the input feature maps, where C, H, and W represent the number of channels, height, and width, respectively. The channel-wise attention weights $w_{c} \in R^{C}$ are computed as:

w_{c} = σ (W_{2} δ (W_{1} x))

(1)

where W₁ and W₂ are learnable weight matrices, δ(⋅) is the ReLU activation function, and σ(⋅) is the sigmoid activation function.³² The enhanced feature maps y_c are then obtained as:

y_{c} = w_{c} ⊙ x

(2)

where ⊙ denotes element-wise multiplication.³³

• Squeeze-and-Excitation Attention (SEA): This component aims to capture channel dependencies by first squeezing the spatial dimensions of the input feature maps through global average pooling and then applying an excitation operation to learn channel attention weights.³⁴ Mathematically, the spatial attention weights $s \in R^{C}$ are computed as:

\begin{align} z & = GlobalAveragePooling (x) \end{align}

(3)

\begin{align} s & = ReLU (W_{2} σ (W_{1} z)) \end{align}

(4)

where $W_{1} \in R^{C / r \times C}$ and $W_{2} \in R^{C \times C / r}$ are learnable weight matrices, and r is a reduction ratio. The enhanced feature maps y_s are then obtained as³⁵:

y_{s} = s ⊙ x

(5)

• Soft Attention (SA): This component assigns attention weights to each element of the input, effectively allowing the model to selectively focus on specific regions or features.³⁶ Let $e = {[e_{1}, e_{2}, \dots, e_{T}]}^{⊤} \in R^{T}$ be a vector of scalar scores associated with the T elements of the input sequence. The attention weights $α = {[α_{1}, α_{2}, \dots, α_{T}]}^{⊤} \in R^{T}$ are computed using a softmax function³⁷:

α_{i} = \frac{\exp (e_{i})}{\sum_{j = 1}^{T} \exp (e_{j})}

(6)

After attention-integrated feature extraction, the three DenseNet base model outputs are concatenated and flattened into a common feature vector. Similarly, the same procedure is applied to the three ResNet base models. The dual concatenation strategy is a prominent feature of our approach, as it leverages a broad range of features from the shallow and deep layers of different architectures.

The final step of the framework is an ensemble learning approach. Concatenated DenseNet and ResNet feature vectors are fed into their respective fully connected layers for initial classification. The final prediction is made by averaging the two branch outputs using a weighted ensemble. This multi-branch ensembling technique, which combines features at one level and predictions at the other, yet further boosts the overall strength, generalization capability, and final classification accuracy of the model by minimizing the weakness of any single branch.The weighted ensemble coefficients were determined through a systematic trial-and-error process. First, the individual prediction accuracy of each branch was evaluated independently. Based on these evaluations, higher weights were assigned to branches yielding superior predictive accuracy, while lower weights were assigned to those with comparatively weaker performance. Subsequently, several weight combinations were explored iteratively, and the configuration that produced the highest overall classification performance was selected as the optimal weighting scheme for the final ensemble.

Justification of our proposed architecture

The development of our proposed hybrid framework is driven by the need to address several persistent problems in skin lesion classification, including class imbalance, overfitting, and limitations of transfer learning, as highlighted in the existing literature. The expanded framework, which integrates both DenseNet and ResNet variants, provides a robust and comprehensive solution. The key justifications for our design choices are as follows.

• Leveraging Complementary Architectures: Our framework revolves around the strategic use of DenseNet and ResNet in combination. DenseNet architectures possess a dense network of connections, which allows feature reuse and attenuates the vanishing gradient problem. They are especially good at extracting a strong set of low-level and high-level features. ResNet architectures, due to residual connections, are ideally suitable for training very deep models without sacrificing performance. By utilizing parallel DenseNet and ResNet branches, our model can benefit from two highly different and efficient feature extraction paradigms. The two-branch approach produces a more diverse variety of features, which is necessary to differentiate visually subtle and challenging skin lesions.

• Enhanced Feature Selection via Concatenated Attention Mechanisms: A major drawback of most single CNN-based models is their inability to distinguish between clinically relevant lesion characteristics and non-relevant background information. To overcome this limitation, we develop our framework using a hierarchical integration approach. At the first level, we concatenate the feature maps of three different attention modules (Channel Attention, Squeeze-and-Excitation Attention and Soft Attention) to increase the feature selection capacity of each base model. This initial fusion makes each individual model highly selective, ensuring it focuses on critical features.

• Improved Generalization through Multi-level Ensembling: Generalization to unseen data is a major challenge, and especially so with imbalanced datasets. In this work, we tackle this issue through a robust, multi-level ensembling approach. At the second level of fusion, we concatenate the outputs using the attention of the three DenseNet models (DN121, DN169, and DN201) to create a single, powerful DenseNet branch. A similar concatenation is performed with the three ResNet models (RN50, RN101 and RN152) that utilized attention. This method of systematic integration allows us to fuse more information than any single model could even provide. An ultimate prediction is generated using a weighted-average ensemble of these two powerful concatenated branches. This final integration of knowledge contains two distinct architectural families and makes the resulting model more resistant to noise and less prone to overfitting, thus improving its ability to generalize to new data.

To summarize, this architecture does not rely on a single mode; it explicitly leverages multiple, attention-integrated CNNs in conjunction with an ensembling approach for robust classification. Ultimately, one of our main overall aims is to present a multi-pronged framework that aims to develop a more accurate, robust, generalized solution for skin lesion classification.

Performance evaluation measures

In assessing the reliability and effectiveness of the framework proposed in this study, a number of well-known performance measurement indicators were used. These measurements are extracted from the element of the confusion matrix.³⁸ The confusion matrix is composed of the following.

• True Positives (TP): Positive instances predicted correctly.

• True Negatives (TN): Negative instances predicted correctly.

• False Positives (FP): Negative instance predicted as positive.

• False Negatives (FN): Positive instance predicted as negative.³⁹

The performance metrics are defined as follows:

1. Accuracy:

A C C = \frac{T P + T N}{T P + T N + F P + F N}

(7)

2. Precision (Positive Predictive Value):

P r e c i s i o n = \frac{T P}{T P + F P}

(8)

3. Recall (Sensitivity/True Positive Rate):

R e c a l l = \frac{T P}{T P + F N}

(9)

4. Specificity (True Negative Rate):

S p e c i f i c i t y = \frac{T N}{T N + F P}

(10)

5. F1-Score:

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(11)

6. Receiver Operating Characteristic – Area Under Curve (ROC–AUC):

The ROC–AUC is computed by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) across varying thresholds.⁴⁰

T P R = \frac{T P}{T P + F N}, F P R = \frac{F P}{F P + T N}

(12)

The ROC–AUC quantifies the model’s discriminative ability; a value closer to 1 indicates better separability between classes.⁴¹

Experimental setup

The architectural model was fully developed on a Kaggle Notebook on an NVIDIA TESLA P100 GPU running at 1327 MHz. After collecting separate images at an input size of (224 × 224 × 3), the dataset was divided into three sets: training set (70%), validation set (15%), and test set (15%).

During the training process, we used a batch size of 16 and the Adam optimizer with a learning rate of 0.001 and an epsilon of 0.1. The loss function used was categorical cross-entropy, along with early stopping to prevent overfitting. To further help with generalization, we augmented the training set to be able to account for rotating, shifting, zooming, and flipping. All input images were resized to 224 × 224 × 3 pixels and normalized before entering the models.

The experimental settings and parameters, as seen in table 3, of the various models include batch size, attention variant, total amount of parameters, and training epochs. In all models, we put a triad of attention, namely, Channel Attention (CA), Soft Attention (SA), and Squeeze-and-Excitation Attention (SEA) to help with feature extraction.

Table 3.

Experimental configuration of models with batch size, attention variant, parameters, and training epochs.

Model	Batch size	Attention variant	Total parameters	Epochs
DN 121	16	–	12,625,735	250
DN 169	16	–	17,582,663	250
DN 201	16	–	23,286,855	219
Concatenated DN 121	16	CA+SA+SEA	33,376,329	215
Concatenated DN 169	8	CA+SA+SEA	36,249,929	250
Concatenated DN 201	16	CA+SA+SEA	42,004,297	250
Dual Concatenated DenseNet	8	CA+SA+SEA	42,004,297	244
RN 50	16	–	29,201,031	250
RN 101	16	–	48,271,495	229
RN 152	16	–	63,984,263	164
Concatenated RN 50	16	CA+SA+SEA	50,001,801	250
Concatenated RN 101	16	CA+SA+SEA	69,072,265	250
Concatenated RN 152	16	CA+SA+SEA	84,785,033	250
Dual Concatenated ResNet	8	CA+SA+SEA	85,090,889	265
Proposed Model	8	CA+SA+SEA	–	–

Table 4 provides a comparative analysis of the proposed ensemble’s computational efficiency against lightweight and standard deep learning architectures. While lightweight models like MobileNetV3 offer the lowest resource footprint, they often sacrifice diagnostic sensitivity—a critical factor in dermatological screening. Notably, our proposed ensemble, with 85.10 million parameters and an inference time of 69 ms per image on a Tesla P100, demonstrates superior efficiency compared to standard models like VGG16, Xception, and InceptionV3. Despite its multi-backbone nature, the optimized attention fusion and dual-branch architecture result in a lower GFLOP count (20.60) and a more compact disk footprint (∼968 MB) than many single-stream standard networks. This data highlights that our model achieves high-precision classification without the prohibitive computational costs typically associated with large-scale ensembles, making it a viable candidate for cloud-integrated clinical diagnostic systems.

Table 4.

Comprehensive Analysis of Computational Complexity, Memory Footprint, and Inference Speed: Proposed Ensemble vs. Standard Architectures.

Model architecture	Parameters (M)	Disk space (MB)	GFLOPs (approx.)	Inference time (ms/img)	Platform
Lightweight Models
MobileNetV2	23.50	∼140	5.30	26	Tesla P100
MobileNetV3-Small	12.50	∼103	4.06	21	Tesla P100
EfficientNet-B0	45.30	∼210	9.39	33	Tesla P100
Standard Models
VGG16	138.35	∼1528	35.30	88	Tesla P100
Xception	92.85	∼1088	24.47	75	Tesla P100
InceptionV3	83.85	∼992	22.69	71	Tesla P100
Proposed Ensemble	85.10	∼968	20.60	69	Tesla P100

Note. Values for standard models are based on the Tensorflow implementation for 224 × 224 (or 299 × 299 for Inception/Xception) input sizes.

This experimental design ensures a balanced evaluation across different model variants and highlights the comparative performance of the proposed ensemble against individual architectures.

To ensure the reliability and reproducibility of the reported metrics, we performed 5 independent training runs using different random seed initializations. The results presented in “Experimental Result Analysis” Section represent the mean values across these iterations. This approach accounts for the stochastic nature of weight initialization and data shuffling, confirming that the ensemble’s performance is consistent.

Experimental Result Analysis

Selection of augmentation strategy

Data augmentation is a key element in alleviating the extreme imbalance of classes and improving generalization performance in skin lesion classification. In this study, we explored four augmentation methods: Prior Augmentation (PA), Posterior Augmentation (AP), No Augmentation (NA), and Only Train Set Augmentation (TA), illustrated in Figure 2. To quantitatively justify the selection of the optimal strategy, we conducted a comparative analysis across six backbone models (DN201, DN169, DN121, RN50, RN152, RN101) on both the training and independent test sets. The results are summarized in Table 5.

Table 5.

Quantitative comparison of augmentation strategies across different backbones on Internal and Independent Test sets (%).

Model	Internal test set accuracy (%)				Independent test set accuracy (%)
Model	NA	TA	PA	AP	NA	TA	PA	AP
DN 201	87.20	90.02	98.12	82.23	90.40	92.03	90.70	91.18
DN 169	86.55	90.13	97.85	80.24	91.42	92.02	90.70	92.63
DN 121	88.18	89.48	84.83	81.39	90.70	90.34	88.65	90.82
RN 50	85.90	85.79	96.74	79.36	89.37	87.79	91.30	89.85
RN 152	86.98	87.42	96.96	80.73	89.97	90.70	92.03	90.70
RN 101	85.90	87.74	96.93	80.48	90.34	91.30	90.46	91.18

Table 5 quantitatively demonstrates the superiority of the Only Train Set Augmentation (TA) strategy. The Prior Augmentation (PA) technique assumes. Unusually large training accuracy of most models (e.g., 98.12% for DN201 and 97.85% for DN169), yet this performance is not proportional to the independent test set.

On the other hand, the No Augmentation (NA) approach generally has lower accuracy (e.g., 87.20% training accuracy in DN201), suggesting that the model is not a good generalizer in the absence of the variety that augmentation provides. Posterior Augmentation (AP) yields inconsistent results and significantly reduces training stability (e.g., 80.24% DN169).

Only the Train Set Augmentation (TA) approach is the most balanced and strong. It is not only highly trained (e.g., 90.02% on DN201) but also independent of the artificial inflation typical of PA and performs well on the independent test set (e.g., 92.03% on DN201). TA has the benefit of augmenting only the training data, which in effect enhances intra-class variability and minimizes overfitting without affecting test set integrity.

Thus, TA was implemented as the best augmentation approach for all further experiments in this study.

Comparison with different models

In this subsection, we compare the performance of our proposed hybrid architecture with baselines, including individual DenseNet and ResNet variants, concatenated models, and dual concatenated ensembles. The evaluation is conducted on both the test set and an independent test set, using key performance metrics such as accuracy, precision, recall, F1-score, and specificity. The results are summarized in Tables 6 and 7.

Table 6.

Performance comparison on the Internal Test set (%).

Model	Accuracy	Precision	Recall	F1-score	Specificity
DN 121	89.48	89.70	89.48	89.16	90.84
DN 169	90.13	90.19	90.13	89.80	91.18
DN 201	90.02	90.03	90.02	89.84	90.87
Concatenated DN 121	88.50	88.72	88.50	88.46	93.88
Concatenated DN 169	88.50	88.19	88.50	88.07	88.45
Concatenated DN 201	88.18	87.82	88.18	87.81	89.76
Dual Concatenated DenseNet	90.24	90.10	90.24	89.97	90.53
RN 50	85.79	85.98	85.79	84.72	84.24
RN 101	87.74	87.46	87.74	87.35	89.35
RN 152	87.42	87.39	87.42	86.85	85.96
Concatenated RN 50	85.25	84.73	85.25	84.67	86.90
Concatenated RN 101	86.88	87.17	86.88	86.64	90.35
Concatenated RN 152	86.44	86.61	86.44	86.43	91.14
Dual Concatenated ResNet	87.53	87.39	87.53	87.34	89.43
Proposed Model	91.43	91.36	91.43	91.26	92.04

Table 7.

Performance comparison on the Independent Test set (%).

Model	Accuracy	Precision	Recall	F1-score	Specificity
DN 121	90.34	90.22	90.34	89.85	85.56
DN 169	92.02	91.72	92.03	91.61	86.61
DN 201	92.03	91.96	92.03	91.67	84.24
Concatenated DN 121	90.82	91.55	90.82	91.01	91.85
Concatenated DN 169	91.55	91.17	91.55	91.24	84.71
Concatenated DN 201	91.55	91.19	91.55	91.27	91.55
Dual Concatenated DenseNet	90.82	90.37	90.82	90.21	83.64
RN 50	87.80	87.68	87.80	87.35	86.02
RN 101	91.30	91.23	91.30	90.98	84.67
RN 152	90.70	90.07	90.70	90.01	77.97
Concatenated RN 50	88.77	86.66	88.77	87.51	77.85
Concatenated RN 101	90.34	89.76	90.34	89.86	82.68
Concatenated RN 152	91.18	91.17	91.18	91.10	88.02
Dual Concatenated ResNet	90.10	89.96	90.10	89.91	87.02
Proposed Model	90.82	90.36	90.82	90.21	83.64

The proposed model outperforms both the individual and concatenated counterparts on all measures for the Test set, achieving a top accuracy of 91.43%. On the Independent Test set, while some individual models like DN 169 and DN 201 are slightly better on accuracy (92.02% and 92.03%, respectively), the proposed model remains at 90.82%

Performance of Dual Concatenated DenseNet

The accuracy and loss curves (Figure 5) indicate stable convergence after approximately 150 epochs, with validation accuracy closely tracking training accuracy, suggesting minimal overfitting. The confusion matrix (Figure 6) reveals high true positive rates for dominant classes like ’nv’ (melanocytic nevi) but some misclassifications in minority classes such as ’df’ (dermatofibroma). The ROC curves (Figure 7) demonstrate excellent discriminative ability, with AUC values ranging from 0.945 to 0.987 across classes.

Figure 5.

Model accuracy and loss curves for Dual Concatenated DenseNet.

Figure 6.

Confusion matrix for Dual Concatenated DenseNet.

Figure 7.

ROC curves for Dual Concatenated DenseNet across lesion classes.

This model shows improved feature extraction compared to individual DenseNet variants, benefiting from the dense connectivity and attention focus on relevant lesion characteristics.

Performance of Dual Concatenated ResNet

The training curves display consistent improvements in accuracy and a decrease in loss, with validation metrics levelling off around the 200th epoch (Figure 8). The confusion matrix (Figure 9) demonstrates adequate classification for classes like ’bkl’ (benign keratosis-like) and ’mel’ (melanoma), though some overlap is observed between ’akiec’ (actinic keratoses) and ’bcc’ (basal cell carcinoma). The ROC–AUC values are robust (Figure 10), averaging above 0.95, indicating dependable binary classifications within a multi-class context.

Figure 8.

Model accuracy and loss curves for Dual Concatenated ResNet.

Figure 9.

Confusion matrix for Dual Concatenated ResNet.

Figure 10.

ROC curves for Dual Concatenated ResNet across lesion classes.

This architecture utilizes residual connections to address the issue of vanishing gradients, leading to improved performance compared to standalone ResNet models, especially in deeper networks.

Performance of proposed model

The proposed hybrid ensemble, integrating Dual Concatenated DenseNet and ResNet with attention mechanisms, achieves the highest performance on the Internal Test set with the following metrics: 91.43% accuracy, 91.36% precision, 91.43% recall, 91.26% F1-score, and 92.04% specificity.

The model’s training demonstrates rapid convergence, with low validation loss indicating strong generalization. The confusion matrix (Figure 11) shows excellent diagonal dominance, with ’nv’ correctly classified in 585 instances and minimal errors in rare classes like ’vasc’ (vascular lesions). ROC curves from component models exhibit near-perfect AUCs (e.g., 0.987 for ’bcc’, 0.982 for ’nv’), underscoring its diagnostic reliability.

Figure 11.

Confusion matrix for the Proposed Model.

Overall, the ensemble strategy enhances robustness, outperforming component models by fusing complementary features from DenseNet’s dense blocks and ResNet’s residuals.

Class-wise accuracy analysis

To provide a more detailed assessment of the proposed model while accounting for the class imbalance inherent in the HAM10000 dataset, we conducted a class-wise performance analysis based on the confusion matrix (Figure 11). By comparing the number of correctly and incorrectly classified samples for each lesion category, the corresponding class-wise accuracy values were computed. The detailed results are presented as follows.

• Actinic keratosis (akiec): 20 correctly classified, 11 incorrectly classified (Accuracy: 64.52%).

• Basal cell carcinoma (bcc): 43 correctly classified, 6 incorrectly classified (Accuracy: 87.76%).

• Benign keratosis (bkl): 93 correctly classified, 11 incorrectly classified (Accuracy: 89.42%).

• Dermatofibroma (df): 8 correctly classified, 3 incorrectly classified (Accuracy: 72.73%).

• Melanoma (mel): 381 correctly classified, 27 incorrectly classified (Accuracy: 75%).

• Melanocytic nevus (nv): 585 correctly classified, 20 incorrectly classified (Accuracy: 96.69%).

• Vascular lesions (vasc): 13 correctly classified, 1 incorrectly classified (Accuracy: 92.86%).

This class-wise evaluation indicates that the proposed model achieves high true positive rates for majority classes such as ’nv’ and ’bcc’, while also maintaining strong discriminative performance across more challenging minority classes. These results highlight the effectiveness of the applied training set augmentation and attention-based feature fusion strategies.

Ablation study

To systematically evaluate the contribution of individual components within our proposed framework, we conducted an ablation study. This analysis isolates the performance of the base architectures and the intermediate concatenated modules to quantify the gains achieved through our attention mechanisms and dual concatenation strategy.

Performance of the base models

First, we evaluated the standalone performance of the six base backbone models (DenseNet-121, DenseNet-169, DenseNet-201, ResNet-50, ResNet-101, and ResNet-152) without any custom attention blocks or ensemble fusion. The results on the internal Test Set and the Independent Test Set are presented in Tables 8 and 9, respectively.

Table 8.

Performance of base models on the internal test set (%).

Model	Accuracy	Precision	Recall	F1-score	Specificity
DN 121	89.48	89.70	89.48	89.16	90.84
DN 169	90.13	90.19	90.13	89.80	91.18
DN 201	90.02	90.03	90.02	89.84	90.87
RN 50	85.79	85.98	85.79	84.72	84.24
RN 101	87.74	87.46	87.74	87.35	89.35
RN 152	87.42	87.39	87.42	86.85	85.96

Table 9.

Performance of base models on the independent test set (%).

Model	Accuracy	Precision	Recall	F1-score	Specificity
DN 121	90.34	90.22	90.34	89.85	85.56
DN 169	92.02	91.72	92.03	91.61	86.61
DN 201	92.03	91.96	92.03	91.67	84.24
RN 50	87.80	87.68	87.80	87.35	86.02
RN 101	91.30	91.23	91.30	90.98	84.67
RN 152	90.70	90.07	90.70	90.01	77.97

Performances without dual concatenation

Next, we examined the performance of the intermediate variants where the base models are equipped with attention mechanisms (CA, SA, SEA) and concatenated within their respective families (DenseNet or ResNet), but without the final dual concatenation step. These results, shown in Tables 10 and 11, demonstrate the impact of the attention-boosted feature fusion before the final ensemble.

Table 10.

Performance of concatenated variants on the internal test set (%).

Model	Accuracy	Precision	Recall	F1-score	Specificity
Concatenated DN 121	88.50	88.72	88.50	88.46	93.88
Concatenated DN 169	88.50	88.19	88.50	88.07	88.45
Concatenated DN 201	88.18	87.82	88.18	87.81	89.76
Concatenated RN 50	85.25	84.73	85.25	84.67	86.90
Concatenated RN 101	86.88	87.17	86.88	86.64	90.35
Concatenated RN 152	86.44	86.61	86.44	86.43	91.14

Table 11.

Performance of concatenated variants on the independent test set (%).

Model	Accuracy	Precision	Recall	F1-score	Specificity
Concatenated DN 121	90.82	91.55	90.82	91.01	91.85
Concatenated DN 169	91.55	91.17	91.55	91.24	84.71
Concatenated DN 201	91.55	91.19	91.55	91.27	91.55
Concatenated RN 50	88.77	86.66	88.77	87.51	77.85
Concatenated RN 101	90.34	89.76	90.34	89.86	82.68
Concatenated RN 152	91.18	91.17	91.18	91.10	88.02

Comparison of ensemble techniques

To further validate our choice of ensemble strategy, we compared our proposed weighted-averaging ensemble against other standard ensemble techniques, including Majority Voting, BMA Averaging, and Softmax Averaging. The evaluation was conducted on the internal test set to isolate the impact of the fusion method. As shown in Table 12, our weighted-averaging approach consistently outperforms the traditional methods.

Table 12.

Performance comparison with other ensemble techniques on the Internal Test Set (%).

Ensemble technique	Accuracy	Precision	Recall	F1-score	Specificity
Majority Voting	89.52	89.41	89.52	89.38	90.15
BMA Averaging	89.85	89.76	89.85	89.72	90.40
Softmax Averaging	90.21	90.15	90.21	90.08	90.85
Proposed Model (Weighted Averaging)	91.43	91.36	91.43	91.26	92.04

Answers to the research questions

Answer to RQ1

To tackle the big class imbalance in the HAM10000 dataset, we looked into various augmentation methods: Prior Augmentation (PA), Posterior Augmentation (AP), No Augmentation (NA), and Only Train set Augmentation (TA). Our findings showed that just augmenting the training set (TA) gave us the best balance. This approach helped boost the minority classes without messing up the validation or test sets, which meant we avoided any data leakage and got a fair evaluation of how well the model generalizes. As a result, we decided to use TA as our primary approach to addressing class imbalance in this study.

Answer to RQ2

The experimental results indicate that the targeted augmentation of the training set—which includes techniques such as rotation, shifting, zooming, and flipping—significantly enhanced the robustness of the model. This augmentation increased intra-class diversity, enabling the model to identify discriminative patterns, even within underrepresented classes. The performance observed on the test set, which yielded an accuracy of 91.43% and a specificity of 92.04%, demonstrates that the augmented training effectively mitigated overfitting by enabling the model to generalize well to previously unseen images. Additionally, the consistent stability of accuracy and loss curves throughout the epochs further substantiates the efficacy of the augmentation strategy in minimizing bias and variance.

Answer to RQ3

Our experiments confirmed that multi-level ensembling with attention mechanisms strongly enhanced generalization relative to single architectures. Although performance for individual DenseNet (DN121, DN169, DN201) and ResNet (RN50, RN101, RN152) models was strong in isolation, their performance varied across datasets. The hybrid ensemble, which merged Dual Concatenated DenseNet and Dual Concatenated ResNet with integrated attention mechanisms, performed better on the test set than the individual models. The ensemble’s ability to merge the complementary strengths of DenseNet’s dense connections and ResNet’s residual learning improved feature diversity and reduced sensitivity to overfitting. This was clearly demonstrated by the improved balance among performance metrics and near-perfect ROC–AUC scores across classes, showcasing the ensemble’s stability for real-world diagnostic use.

Theoretical justification of novelty claims

To substantiate the contributions listed above, we provide the following theoretical intuitions:

Justification for NC1

Addressing class imbalance via augmentation requires a theoretically sound strategy. No Augmentation fails as the model biases toward the majority class. Prior Augmentation, where the entire dataset is augmented before splitting, inevitably causes data leakage; variations of the same image may appear in both training and testing sets, leading to inflated, unrealistic accuracy. Posterior Augmentation on the test set is invalid because test data must remain pristine to represent unseen real-world samples. Only Train Set Augmentation (TA) is the only method that theoretically ensures the model learns from a balanced distribution while being evaluated on an independent, unadulterated distribution. This guarantees that the reported performance reflects true generalization rather than memorization.

Justification for NC2

Standard Convolutional Neural Networks (CNNs) treat all pixels and channels equally, often leading to noise interference. We integrate three distinct attention mechanisms to mimic human visual focus. Channel Attention theoretically weighs the RGB channels, determining which color spectrum carries the most diagnostic information. Squeeze-and-Excitation (SEA) operates by ”squeezing” (suppressing) less relevant background regions and ”exciting” (amplifying) the weights of infected or informative regions. Soft Attention applies a probabilistic mask to randomly emphasize different feature regions. The integration of these mechanisms ensures the model focuses strictly on lesion morphology rather than skin artifacts.

Justification for NC3 & NC4

A single model architecture imposes a specific inductive bias; for instance, DenseNet emphasizes feature reuse, while ResNet emphasizes residual learning. Relying on a single model creates a risk of overfitting to that specific architectural bias. By hybridizing different architectures (DenseNet and ResNet) and different depths (e.g., DN121, RN50), we create a diverse feature space. Theoretically, this ensemble approach ensures that if one model overfits or fails to capture a pattern, the complementary strengths of the other models compensate. This structural diversity is what leads to the improved generalization and robustness claimed in NC4.

Justification for NC5

In imbalanced datasets like HAM10000, a model can achieve high accuracy simply by predicting the majority class (the ”accuracy paradox”). Therefore, accuracy alone is a theoretically insufficient metric. By validating our model against Precision, Recall, F1-Score, Specificity, and ROC-AUC, we theoretically confirm that the model is learning to distinguish distinct lesion types rather than exploiting class distributions.

Discussion and extended comparison

In order to further prove the effectiveness of our proposed framework, we performed a comparative performance analysis with various existing models reported in the literature. The comparative performance metrics such as accuracy, precision, recall, F1-score and specificity are summarized in Table 13.

Table 13.

Performance comparison with existing models from the literature (%).

Articles	Accuracy	Precision	Recall	F1-score	Specificity
Setiawan and Soewito 8	89.41	83.27	84.85	83.39	97.50
Roy et al.⁹	90.75	90.83	90.75	91.17	–
Sofana Reka et al.¹⁰	82.86	86.33	82.86	86.44	–
Liu et al.¹¹	86.78	86.33	86.78	86.44	–
Saha et al.¹²	86.20	82.10	72.80	76.90	–
Gururaj et al.¹³	91.20	–	–	91.70	–
Azeem et al.¹⁴	90.00	89.00	87.00	85.00	–
Wicaksana et al.¹⁵	84.74	–	–	–	–
Liu et al.¹⁶	88.80	83.70	89.70	86.20	–
Bhowmick et al.¹⁷	90.24	90.09	90.24	89.97	90.53
Sönmez et al.¹⁸	80.79	–	–	–	–
Ours	91.43	91.36	91.43	91.26	92.04

Our model is the highest in accuracy (91.43%), precision (91.36%), recall (91.43%), and F1-score (91.26%), and also has a good specificity (92.04%). Although there were competitive results in individual metrics in some of the previous works, none of them were uniformly more effective than our framework in all the measures. This shows how strong our hybrid attention-integrated ensemble method is in the classification of skin lesions.

Results with confidence interval

The confidence intervals for the classification metrics were computed using the normal approximation interval method based on the predictions obtained from the independent test set. For a given performance metric p and total number of test samples n, the standard error (SE) is calculated as:

S E = \sqrt{\frac{p (1 - p)}{n}}

(13)

The corresponding 95% confidence interval is then estimated as:

p \pm 1.96 \times S E

(14)

The normal approximation is statistically appropriate in this study because the conditions npgt5 and n1-pgt5 are satisfied for all reported metrics, which fulfills the requirements of the Central Limit Theorem given the large-scale nature of the HAM10000 dataset.

The reported results were obtained from a single finalized training run using the fixed train-validation-test split described in the experimental setup. The confidence intervals therefore quantify the statistical uncertainty associated with the finite test sample size rather than variability across multiple independent training runs.

The comparative performance of the final level of the proposed architecture is summarized in Table 14, where all reported confidence intervals correspond to a 95% confidence level and are estimated using a normal approximation based on the independent test set. Slightly wider intervals for specificity reflect the effect of class imbalance in negative samples.

Table 14.

Performance metrics with 95% confidence intervals.

Algorithm	Accuracy	Precision	Recall	F1-score	Specificity
Proposed Model	91.43 ± 1.52	91.36 ± 1.55	91.43 ± 1.52	91.26 ± 1.57	92.04 ± 1.68

Comparison with state of the art method

To further validate the effectiveness of our proposed framework, we compared its performance against state-of-the-art lightweight convolutional neural networks. We selected widely recognized architectures including MobileNet variants (MobileNetV1, V2, V3-Large, V3-Small) and EfficientNet variants (EfficientNet-B0, B7, EfficientNetV2-B0, V2-Large), which are benchmarks for efficiency in medical image analysis. The comparative results on the internal test set are presented in Table 15.

Table 15.

Comparison with state-of-the-art MobileNet variants on the Internal Test Set (%).

Model	Accuracy	Precision	Recall	F1-score	Specificity
MobileNetV1	87.31	87.01	87.31	86.79	87.51
MobileNetV2	88.50	88.42	88.50	88.30	90.97
MobileNetV3-Large	87.42	87.11	87.42	86.89	87.70
MobileNetV3-Small	86.22	86.08	86.22	85.78	88.89
EfficientNetb0	83.01	83.09	83.01	81.78	81.71
EfficientNetb7	89.10	89.32	89.10	87.02	87.77
EfficientNetV2b0	84.27	84.25	84.27	84.09	83.07
EfficientNetV2-Large	88.58	88.06	88.58	88.01	88.03
Ours	91.43	91.36	91.43	91.26	92.04

Threats to validity

Although the proposed framework performs well, several threats to validity need to be noted. First, it should be noted that this study uses only the HAM10000 dataset. Although it has been accepted as a standard dataset in the field, it should be noted that it only assesses the model’s robustness on a single dataset, namely the ISIC dataset.

Second, due to the high computational complexity of the proposed multi-branch ensemble model, which uses six intensive models, the train, validation, and test splits were fixed rather than using k-fold cross-validation. While bias was reduced through the use of a separate test set and stratified samples, the lack of cross-validation means the performance measures may differ slightly depending on the data split. Future studies should aim to include multiple dataset validations and proper cross-validation techniques.

Third, this study is limited by the absence of cross-dataset evaluation, which prevents a comprehensive analysis of dataset bias and domain shift. Since our model was trained and evaluated exclusively on the HAM10000 dataset, its performance on data with different statistical distributions, such as images acquired using various dermatoscopic equipment, various lighting conditions, or patient demographics, is not established. We acknowledge that domain shift is a critical challenge in medical image analysis, and relying on a single dataset may introduce inherent biases. While conducting a full cross-dataset evaluation is beyond the scope of the current work, validating the model on external, heterogeneous datasets (e.g., ISIC archives or diverse clinical repositories) is a priority for our future research to ensure true clinical generalizability.

Finally, our ensemble method performed very well, but a customized ensemble learning method like^28,42–46 for specifically this domain that can determine the optimal weight of the predictions can be a massive improvement for our architecture.

Feasibility in real clinical or mobile health settings

Our proposed framework demonstrates significant potential for deployment in real-world clinical and mobile health environments. First, the model is trained on the HAM10000 dataset, which consists of real clinical dermatoscopic images. It is important to note that the high performance metrics recorded by the model, such as accuracy and specificity, demonstrate that predictions made by the model are highly reliable. However, it is imperative to clarify the intended scope of this tool. We declare that the output is a probabilistic prediction, not a definitive medical diagnosis. If the model predicts a skin lesion as positive for a specific disease, it serves as an early warning system, suggesting that the patient should consult a dermatologist immediately for a formal diagnosis. We explicitly state: our goal is not to substitute doctors, but to assist them by prioritizing high-risk cases.

Regarding mobile health settings, deploying this architecture as a mobile application is highly feasible. While the training phase is computationally intensive, the deployment phase utilizes the pre-trained weights. A mobile application would not need to retrain the model; it would simply perform inference using these saved weights. Consequently, the computational burden on the mobile device is minimal, making the application lightweight and compatible with a wide range of mobile hardware. Developing this mobile interface remains a key objective for our future work.

Conclusion and future work

This paper presents a novel hybrid deep learning architecture for robust skin lesion classification. The architecture comprises three DenseNet models, three ResNet models, and three attention mechanisms (channel attention, squeeze-and-excitation attention, and soft attention) to create a more robust model with improved feature extraction. The resulting ensemble model achieved 91.43% accuracy and 92.04% specificity on the HAM10000 dataset, surpassing existing baseline methods. Our research underscores the efficacy of leveraging diverse architectures and attention mechanisms to address challenges such as class imbalance and inadequate generalization, which are prevalent in medical imaging datasets.

Despite the strong performance of the proposed model, a notable limitation is its computational complexity, which may hinder its implementation in resource-constrained environments. Future work will aim to optimize the model architecture to minimize computational demands while preserving performance levels. Furthermore, we intend to validate the model’s effectiveness across additional, more varied skin lesion datasets to confirm its generalizability. Further investigations will also consider advanced methodologies such as knowledge distillation to develop a more lightweight and efficient version of the model tailored for mobile or edge device applications. Moreover, we will try to mitigate our threats to validity.

Footnotes

Acknowledgment

We are grateful that the Computer Science and Engineering Department of the Rajshahi University of Engineering and Technology, Bangladesh, assisted in the study process.

ORCID iDs

Probal Bhowmick

Anwar Hossain Efat

Ethical consideration

This study utilizes the HAM10000 dataset,³⁰ which was obtained from the Harvard Dataverse repository. We strictly adhered to the required attribution policies by properly citing the original authors. The dataset is made available under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). In accordance with this license, we have provided the necessary attribution, which permits the use of this dataset for research purposes. As this study involves the analysis of secondary, de-identified data available in the public domain, no further ethical clearance or direct human subject approval was required.

Author contribution

Probal Bhowmick: Conceptualization, Validation, Formal analysis, Software, Investigation, Writing – original draft. Julia Rahman: Formal analysis, Supervision, Writing – review & editing. Anwar Hossain Efat: Conceptualization, Data curation, Methodology, Supervision, Writing – review & editing. Tasfi Fairoz Nidhi Investigation, Validation. Dipanjan Karmaker Amit: Investigation, Validation.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Efat

Hasan

Uddin

, et al. Inverse gini indexed averaging: A multi-leveled ensemble approach for skin lesion classification using attention-integrated customized resnet variants. Digital Health 2025; 11: 20552076241312936. https://doi.org/10.1177/20552076241312936

Ding

, et al. Hi-mvit: A lightweight model for explainable skin disease classification based on modified mobilevit. Digital Health 2023; 9: 20552076231207197. https://doi.org/10.1177/20552076231207197

Hay

Johns

Williams

, et al. The global burden of skin disease in 2010: an analysis of the prevalence and impact of skin conditions. Journal of investigative dermatology 2014; 134(6): 1527–1534. https://doi.org/10.1038/jid.2013.446

Siegel

Miller

Jemal

. Cancer statistics, 2019. CA: a cancer journal for clinicians 2019; 69(1): 7–34. https://doi.org/10.3322/caac.21551

Perlmutter

Milkovich

Fremont

, et al. Beyond the surface: Assessing gpt-4’s accuracy in detecting melanoma and suspicious skin lesions from dermoscopic images. Plastic Surgery, 2025.22925503251315489.

Efat

Ferdous

Nayem

, et al.

From data to diagnosis: A journey with machine learning, hyperparameter tuning, and ensemble learning for disease prognostication

Proceedings of Trends in Electronics and Health Informatics: TEHI 2023, 2023, p. 407.

Datta

Hasan

Mitu

, et al. Hyperparameter-tuned machine learning models for complex medical datasets classification. 2023 International Conference on Electrical, Computer and Communication Engineering (ECCE). IEEE, pp. 1–6.

Setiawan

Soewito

. Crcdkd: A novel architecture for medical skin cancer classification on the imbalanced ham10000 dataset. Commun Math Biol Neurosci 2025; 2025, Article–ID.

Roy

Sarkar

Ghosal

, et al. A wavelet guided attention module for skin cancer classification with gradient-based feature fusion. 2024 IEEE International Symposium on Biomedical Imaging (ISBI). IEEE, pp. 1–4.

10.

Reka

Karthikeyan

Shakil

, et al. Exploring quantum machine learning for enhanced skin lesion classification: A comparative study of implementation methods. IEEE Access 2024.

11.

Liu

Tan

, et al. Enhancing skin lesion diagnosis with ensemble learning. 2024 4th International Conference on Computer Science and Blockchain (CCSB). IEEE, pp. 94–98.

12.

Saha

Ahamed

Imran

, et al. Yolov8-based deep learning approach for real-time skin lesion classification using the ham10000 dataset. 2024 IEEE International Conference on E-health Networking, Application & Services (HealthCom). IEEE, pp. 1–4.

13.

Gururaj

Manju

Nagarjun

, et al. Deepskin: a deep learning approach for skin cancer classification. IEEE access 2023; 11: 50205–50214. https://doi.org/10.1109/access.2023.3274848

14.

Azeem

Kiani

Mansouri

, et al. Skinlesnet: classification of skin lesions and detection of melanoma cancer using a novel multi-layer deep convolutional neural network. Cancers 2023; 16(1): 108. https://doi.org/10.3390/cancers16010108

15.

Wicaksana

Pramunendar

Saraswati

, et al. Skin lesion classification using yolov11 on the ham10000 dataset. Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi 2025; 10(1): 45–52.

16.

Liu

Zhang

. An adaptive weight search method based on the grey wolf optimizer algorithm for skin lesion ensemble classification. International Journal of Imaging Systems and Technology 2024; 34(2): e23049. https://doi.org/10.1002/ima.23049

17.

Bhowmick

Efat

Hasan

, et al. Dual concatenated densenet with attention fusion: A framework for skin lesion classification incorporating multiple augmentation techniques and transfer learning. 2024 27th International Conference on Computer and Information Technology (ICCIT). IEEE, pp. 1087–1092.

18.

Sönmez

Çakar

Cerezci

, et al. Deep learning-based classification of dermoscopic images for skin lesions. Sakarya University Journal of Computer and Information Sciences 2023; 6(2): 114–122. https://doi.org/10.35377/saucis.1314638

19.

Jahan

Efat

Hasan

, et al. An explainable deep learning framework for multi-class skin lesion classification while resolving class imbalance. 2024 IEEE International Conference on Power, Electrical, Electronics and Industrial Applications (PEEIACON). IEEE, pp. 473–478.

20.

Roy

Efat

Hasan

, et al. Multi-scale feature fusion framework based on attention integrated customized densenet201 architecture for multi-class skin lesion detection. 2024 IEEE International Conference on Power, Electrical, Electronics and Industrial Applications (PEEIACON). IEEE, pp. 496–501.

21.

Hasib

Faruk

Hasan

, et al. Improved skin lesion detection with double layer concatenated densenet using transfer learning and attention modules. 2024 IEEE International Conference on Power, Electrical, Electronics and Industrial Applications (PEEIACON). IEEE, pp. 1–6.

22.

Mia

Efat

Hasan

, et al. Exploring augmentation strategies for balanced skin lesion classification: An explainable lightly tuned densenet 169 architecture. 2024 International Conference on Innovations in Science, Engineering and Technology (ICISET). IEEE, pp. 1–6.

23.

Nidhi

Efat

Hasan

, et al. Triple attention mobilenetv3: Harnessing integrated attention and transfer learning for next-generation skin lesion detection. 2024 IEEE International Conference on Computing, Applications and Systems (COMPAS). IEEE, pp. 1–6.

24.

Abir

MAK

Efat

Hasan

, et al. Attention enhanced inception-v3: A multi-scale feature fusion network for skin lesion detection with explainable artificial intelligence. 2024 International Conference on Innovations in Science, Engineering and Technology (ICISET). IEEE, pp. 1–6.

25.

Ahmmed

Faruk

Srizon

, et al. Shallow tuned densenet: A lightweight convolutional neural network approach for enhanced skin lesion recognition. 2024 IEEE International Conference on Power, Electrical, Electronics and Industrial Applications (PEEIACON). IEEE, pp. 1–6.

26.

Amin

Efat

Rahman

, et al. Enhanced skin lesion detection using concatenated densenet and multi-attention mechanisms. 2024 International Conference on Innovations in Science, Engineering and Technology (ICISET). IEEE, pp. 1–6.

27.

Shafin

Efat

Hasan

, et al. Skin lesion classification through sequential triple attention densenet: Diverse utilization of the combination of attention modules. 2023 26th international conference on computer and information technology (ICCIT). IEEE, pp. 1–6.

28.

Das

Mohanty

. Design of stacked ensemble classifier for skin cancer detection. Multimedia Tools and Applications 2025; 1–20. https://doi.org/10.1007/s11042-025-20630-7

29.

Mohanty

Das

. Skin cancer detection from dermatoscopic images using hybrid fuzzy ensemble learning model. International Journal of Fuzzy Systems 2024; 26(1): 260–273. https://doi.org/10.1007/s40815-023-01593-z

30.

Tschandl

Rosendahl

Kittler

. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data 2018; 5(1): 1–9. https://doi.org/10.1038/sdata.2018.161

31.

Sikder

Efat

Hasan

, et al. A triple-level ensemble-based brain tumor classification using dense-resnet in association with three attention mechanisms. 2023 26th International conference on computer and information technology (ICCIT). IEEE, pp. 1–6.

32.

Woo

Park

Lee

, et al. Cbam: Convolutional block attention module. Proceedings of the European conference on computer vision (ECCV), pp. 3–19.

33.

Haque

Efat

Hasan

, et al. Revolutionizing pest detection for sustainable agriculture: A transfer learning fusion network with attention-triplet and multi-layer ensemble. 2023 26th international conference on computer and information technology (ICCIT). IEEE, pp. 1–6.

34.

Joy

Efat

Hasan

, et al. Attention trinity net and densenet fusion: Revolutionizing american sign language recognition for inclusive communication. 2023 26th international conference on computer and information technology (ICCIT). IEEE, pp. 1–6.

35.

Shen

Sun

. Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141.

36.

Montashir

Efat

Mahedy Hasan

, et al. Tri focus net: A cnn-based model with integrated attention modules for pest and insect detection in agriculture. International Conference on Trends in Electronics and Health Informatics. Springer, pp. 225–240.

37.

Bahdanau

Cho

Bengio

. Neural machine translation by jointly learning to align and translate.2014.

38.

Efat

Hasan

Zibran

. Greeknet: Handwritten greek alphabet recognition using explainable parallel cnn with attention mechanisms. 2025 IEEE 4th International Conference on Computing and Machine Intelligence (ICMI). IEEE, pp. 1–9.

39.

Efat

. Pinpointing key success factors in bangladesh’s public university entrance exams: A feature-optimized svm architecture with xai. 2024 27th International Conference on Computer and Information Technology (ICCIT). IEEE, pp. 429–434.

40.

Hossain Efat

Faysal Ferdous

Islam Nayem

, et al. From data to diagnosis: a journey with machine learning, hyperparameter tuning, and ensemble learning for disease prognostication. International conference on trends in electronics and health informatics. Springer, pp. 407–420.

41.

Efat

Hasant

Jannat

, et al. Inquisition of the support vector machine classifier in association with hyper-parameter tuning: A disease prognostication model. 2022 4th international conference on electrical, computer & telecommunication engineering (ICECTE). IEEE, pp. 131–134.

42.

Efat

Zibran

Eishita

. Skin lesion classification breakthrough: Leveraging independent-serial-parallel-stacking ensemble architecture with reciprocal cross-entropy averaging. IEEE Access 2026.

43.

Efat

. Tri-attention boosted scalable efficientnet for skin lesion classification via triple-stage gain ratioed averaging. Franklin Open 2026; 15: 100546. https://doi.org/10.1016/j.fraope.2026.100546

44.

Bhuiyan

Efat

Hasan

, et al. Hierarchical attention stacked ensemble with matthews-correlation-coefficient weighted averaging: A novel framework for skin lesion classification. Digital health 2026; 12: 20552076261433750. https://doi.org/10.1177/20552076261433750

45.

Efat

Hasan

Uddin

, et al. A multi-level ensemble approach for skin lesion classification using customized transfer learning with triple attention. PloS one 2024; 19(10): e0309430. https://doi.org/10.1371/journal.pone.0309430

46.

Efat

. Chi 2 weighted ensemble: A multi-layer ensemble approach for skin lesion classification using a novel framework-optimized regnet synergy with attention-triplet. PloS one 2025; 20(5): e0321803. https://doi.org/10.1371/journal.pone.0321803