Abstract
The advent of deep learning techniques has ignited interest in medical image processing. The proposed work in this paper suggests one of the edge technologies in deep learning, which is recommended, based on a Radiomics feature extraction model for the effective detection of Kaposi sarcoma, a vascular skin lesion expression that indicates the most prevalent cancer in AIDS patients. This work investigates the role and impact of medical image fusion on deep feature learning based on ensemble learning in the medical domain. The model is crafted wherein the pre-built ResNet50 (Residual network) and Visual Geometry Group (VGG16) are fine-tuned and an ensemble learning approach is applied. The pre-defined CNN was incrementally regulated to determine the appropriate standards for classification efficiency improvements. Our findings show that layer-by-layer fine-tuning can improve the performance of middle and deep layers. This work would serve the purpose of masking and classification of skin lesion images, primarily sarcoma using an ensemble approach. Our proposed assisted framework could be deployed in assisting radiologists by classifying Kaposi sarcoma as well as other related skin lesion diseases, based on the positive classification findings.
Introduction
Cancer is a diverse illness with several subgroups, according to experts. In cancer research, early diagnosis and prognosis of a cancer type have become a necessity because they can aid patient clinical treatment. The use of machine learning in various types of cancer detection and prognosis has become increasingly common in recent years. The most common malignancy in AIDS patients is Kaposi sarcoma, which is characterized by red skin lesions. The KSHV virus (HHV8) causes the disease, which is recognized by its characteristic red skin lesions. Ethel Cesarman et al. [1] suggested a survey on Kaposi sarcoma and its associated herpes virus (KSHV), also known as HHV-8 that is required for the development of KS. HHV-8 is found in less than 10% of the population of the United States. Kaposi sarcoma (KS) is a significantly inferior malignant vascular lesion that most often manifests as skin lesions. Skin lesions, which resemble painless, flat spots that are colorized as red or purple on fair, white skin and blueish or brownish on dark skin, are the most visible indications of KS. The scars do not change color when pressed, unlike bruises. They are not itchy, nor do they drain. The most common way of diagnosing Kaposi sarcoma would be a routine physical exam or any identification of skin tumors. However, because HIV-positive people are more likely to develop KS, healthcare experts who are trained to detect KS and other HIV-related disorders should evaluate them regularly. The medical diagnosis is very extensive and involves many biopsy tests, such as punch biopsy or excisional biopsy. Crombe et al. [2] studied possible soft-tissue sarcomas, their diagnosis, and classification using the advent of artificial intelligence and machine learning. There are numerous studies and works carried out to provide a clinical solution for skin lesion-based diseases. Many research teams investigated the use of ML and deep learning techniques in the fields of biology and bioinformatics to categorize skin cancer patients into high- or low-risk groups. Christoph Wallner et al. [3] suggested a reliable CNN model based on soft tissue sarcoma based on chest X-ray images that assisted in the detection alongside radiologist findings. Jang S et al. [4] suggested a deep-learning-based detection algorithm for lung cancer based on chest radiographs. Therefore, the growth and therapy of cancer have been modeled after these methods. Michal Strzelecki et al. [5] crafted a series of algorithms for skin lesion detection using whole-body images. Three detection algorithms (and their fusion) were examined; two conventional methods—the local brightness distribution and the correlation method—were performed in his analysis.
Walaa Gouda [6] put forth a methodology for detecting skin cancer using skin lesion images by incorporating features of deep learning models such as Resnet50, Inception Resnet, and Inception V3. Machine learning tools must be able to identify key features in complex data sets. Artificial neural networks (ANNs), support vector machines (SVMs), and decision trees (DTs) are a few of the technologies that are frequently utilized to construct prediction models for predicting a cure for cancer. On that note, a dedicated sarcoma detection model is being proposed for improved prediction and accuracy.
Literature survey
Tanzila Saba et al. [7] conducted a study to examine, revisit, categorize, and respond to current advances in cancer detection deployed by machine learning techniques for breast, brain, and skin cancer. Konstantina Kourou et al. [8] analyzed the importance of classifying cancer patients into high- and low-risk categories to explore the use of machine learning (ML) approaches. Hermessi et al. [9] studied the importance and effect of medical image fusion on deep feature learning focused on transfer learning in the medical domain. Soft-tissue sarcomas (STS) are rare tumors that account for 1% of all adult malignancies. There are over 100 different histologic subtypes, with the trunk, extremities, and retroperitoneum being the most prevalent. Gamboa et al. [10] summarise the new literature on histotype-specific treatment of extremity/truncal and retroperitoneal STS in terms of surgery and chemotherapy. Based on perfusion-weighted magnetic resonance imaging, Mahrooz et al. [12] suggested a digitally assisted approach to separate uterine sarcoma from leiomyomas (PWI). Among the most prominent congenital and neonatal defects are vascular lesions.
Syed et al. [11] aimed to provide a thorough understanding of classifications as well as an understanding of the pathogenesis, clinical presentation, and treatment of essential vascular lesions. Centered on probabilistic distribution and best feature collection, Khan et al. [13] proposed a method for identifying and classifying the lesion. Probabilistic distributions such as the regular and uniform distributions are used to segment lesions in dermoscopic pictures. Malignant melanoma is one of the most deadly types of skin cancer, accounting for a substantial number of deaths worldwide. Centered on the deep learning platform, Muhammad Attique Khan et al. [14] suggest a fully automatic computer-assisted diagnosis (CAD) method. In dermoscopy images, Mutlu Mete et al. [15] implemented a boundary-driven framework based on density for identifying lesions on the skin. Similar works using deep learning approaches are shown in Table 1.
Comparison of existing Skin Lesion analysis frameworks (or) methodologies
Comparison of existing Skin Lesion analysis frameworks (or) methodologies
The proposed methodology aims to effectively predict the malignant and benign skin lesion samples fed into the model, which indeed aids clinicians or healthcare practitioners in detecting Kaposi sarcoma, a skin lesion-based type of cancer. The proposed model fetches input from a custom dataset formed from the two primary datasets, namely the lesion dataset and the HAM10000 dataset, and makes use of the pre-built Resnet50 model and Visual Geometry Group (VGG16). The ensemble learning methodology is used to fine-tune the pre-defined model.
The proposed method compares the predictions and classification precision of the traditional models (ResNet50 and VGG16) and also the fine-tuned ensemble model.
A single or conventional approach to machine learning models might not yield the accuracy expected. Every model has its limitations, and thus, to increase the overall accuracy, outputs from multiple models can be merged altogether for better performance. Figure 1 portrays the architecture of the proposed ensemble approach. The input images are initially fed to base learners. Base learners are generally referred to as weak learners, and they generate masked outputs based on the training data fed in. In the given scenario, the input images are fed to base learner models (Refined VGG 16 and Refined Resnet 50) that generate the masked outputs. Meta-learning models learn from the outputs of other machine-learning models. Thus, in our scenario, the meta-model takes in the masked output of two base learner models to obtain segmented images that are classified as benign or malignant. The proposed ensemble model could make better predictions and attain improved performance than a single-labeled model. It is also robust and reduces the probability of dispersed prediction outcomes.

Workflow of the Dermoscopic analysis of Vascular skin lesion sarcoma.
The workflow of the proposed work is shown in Fig. 1. The input images obtained from the custom dataset (Skin Lesion and HAM10000) are pre-processed and fed into the base learning models, viz., VGG16 and ResNet50, which are fine-tuned and yield the classified output that is then fed as input to the meta-learning model, where an ensemble-based CNN algorithm is used to segment the masked images and classify the images as benign or malignant.
Our model fetches input from a custom dataset formed from the two primary datasets, namely the lesion dataset and the HAM10000 dataset. The HAM10000 data used to support the findings of this study can be found for further detailed analysis using the following DOI: [https://doi.org/10.7910/DVN/DBW86T]. The lesion dataset was obtained from the ISIC Challenge Archive, which holds 807 lesion images and 807 corresponding super-pixel masks as training data. The lesion data to support the findings can be found at [https://doi.org/10.34970/2020-ds01]. The training ground truth dataset has 807 dermoscopic feature files. The test data holds 335 lesion images and 335 corresponding superpixel masks. The small size and lack of variety of the available dataset of dermatoscopic images make it difficult to train the neural network for the efficient detection of lesions of the skin that are pigmented.
As a result, the dataset named HAM10000 (“Human Against Machine with 10,000 training images”) was created and released, which aids in the segmentation of pigmented vascular lesions. 10015 dermatoscopic images make up the final dataset, which can be used as a training collection for data analysis and exploration. The above-mentioned Fig. 2 shows the sample images from the dataset.

Sample Raw images of vascular skin lesions.
Data preparation is the process of cleaning and transforming raw data before processing and interpretation. One of the key objectives of data preparation is to ensure that the raw data being processed is reliable and consistent so that the findings of analytics implementations are accurate.
Some of the data preparation strategies include data normalization, removing null values, and data sampling. The input images were normalized to small squares, and the per-channel pixel mean measured on the training dataset was subtracted from the results. The model was used to test some image scaling alternatives.
The data cleaning processes are performed by removing the null values in the HAM dataset. Removing data with missing values yields a model that is both stable and reliable. Since a custom dataset is obtained from the traditional data, up-sampling is required to balance the dataset.
The minority classes are up-sampled, and the majority classes are down-sampled. The pre-processing phase involves improving and enhancing the dataset. It is achieved using the Image Data Generator from the Keras library.
The best part of this class is that it has no impact on the data on the disc. The images are rotated randomly between an angle of 0 to 180 degrees, and random zoom and horizontal and vertical image shifts are performed as shown in Fig. 3.

Sample pre-processed and augmented images.
Segmentation phase
Sklansky [29] proposed methods for segmentation and feature extraction. The pre-processed input image is transformed to grayscale and then passed into an adaptive threshold function, which produces an initial histogram of pixel intensity values. Using the adaptive local threshold procedure, an optimal intensity threshold T is calculated. This threshold is used to divide an image’s pixels into two classes. The adaptive threshold consistently finds the highest variance and sets the threshold to be the one that distinguishes the majority of intensity values. The image is converted to binary (grayscale). Using the depth-first scanning mechanism, only the largest contour is held that is to be predicted.
Contrast enhancement
In this phase, A. Singh, S. Yadav, and N. Singh [24] recommend using a mixture of local and global stretching contrast enhancement approaches to improve the minor ROI image. This phase aims to improve the consistency, contrast, and brightness of the lesion region. The segmented minor ROI for a Kaposi sarcoma skin lesion and the segmented minor ROI for other skin lesions are obtained after contrast enhancement. This will make it easier for the neural network model to derive the necessary features from the image. Denoising, in which high-frequency and low-amount pixels are removed from the ROI segmented image, is one of the image enhancement processes used in this step. The brightness must be maintained, and it must be assured that the brightness is consistent in both training and test images. This process is applied by an adapter, which takes an image and converts it to a fixed size with the highest contrast, constant brightness, and minimal noise before feeding it to the model.
Refined architecture of VGG16
The VGG16 architecture is a traditional convolutional neural network that earned excellence in the ILSVR contest held in 2014. S. Liu and W. Deng [27] suggested the architecture of VGG16 for a small training sample size. Even though there are better pre-trained models developed, making the traditional VGG16 model yet deeper would result in better feature selection and improved training accuracy. The workflow depicted in Fig. 4(a) is designed from the ground up using a sequential model. The fixed input image size to the first convolution layer is 224×224 RGB. The input is routed along a set of stacked layers of convolution which holds a field of reception of dimension 3×3 which is an efficient size to map the notion of different positions such as left, right, up, down, and center.
The linear transformation of input channels is implemented using the 1×1 input convolution filters. For a 3×3 layer of convolution, the stride is assigned to 1 pixel, and spatial padding input is assigned to 1 pixel to retain spatial resolution after performing the convolution operation.
Spatial pooling is performed by five max pooling layers that follow a few convolution layer factors. Max pooling is performed with a stride factor of 2 on a 2×2 window. In various topologies, three Fully-Connected (FC) layers adopt a stack of convolutional layers of varied depths. The completely linked layers are built in the same way in all networks, followed by a softmax layer. All hidden layers have the rectification (ReLU) non-linearity. Thus, the pre-trained VGG16 model is imported using the Keras library. In addition to the base model, a global spatial average-pooling layer is incrementally added on followed by a dropout layer to reduce overfitting, and then followed by a logistic layer. Only the top layers are trained which are initialized randomly. As an optimizer, stochastic gradient descent is used and cross-categorical entropy is the loss function employed as shown in Fig. 4(b). The model predicts as the obtained contour’s inner regions are set to white, and a suitable predicted mask is created.

Traditional VGG16.

Proposed model of VGG16.
The ResNet model included the concept of skip connections, which addressed the problem of vanishing gradient by allowing the gradient to flow along an alternative shortcut path. K. He, X. Zhang, S. Ren, and J. Sun [28] proposed a residual network using deep layers for recognizing images. ResNets have a variety of persuasive advantages, including the elimination of the vanishing-gradient problem, improved feature propagation, feature reuse, and a significant reduction in the number of parameters.
The ResNet50 architecture in Fig. 5(a) has primarily 4 stages where the input image could have height and width dimensions as multiples of 32 and with a channel width of 3. The architecture makes use of an initial convolution using 7×7 kernel sizes and a 3×3 max-pooling layer. The primary stage of the network has three residual blocks, which have three layers each. The kernel size of all three-convolution layers is 64, 64, and 128 respectively.
The convolution operation makes use of the stride of size 2 in the residual block. Thus, the input size is constructively reduced to half and the width of the channel is doubled. The bottleneck mechanism is deployed (ie) for each of the residual functions, and 3 convolution layers of dimensions (1×1, 3×3, 1×1) are stacked over each other.
The 1×1-convolution layer is responsible for reducing the dimensions and followed by the 3×3 layer acts as a bottleneck with small input dimensions. The pre-defined model is affixed by a dense layer with a softmax activation function and Adam optimizer as shown in Fig. 5(b)

Traditional ResNet50.

Proposed model of ResNet50.
S. Ali, S.S. Tirumala, and A. Sarfarzadeh [30] proposed advanced techniques that combine different learning algorithms to achieve greater predictive performance than any of the individual learning algorithms could. Ensemble learning is mainly used to increase a model’s accuracy. Using a meta-learning algorithm, it learns how to combine predictions from one or more fundamental algorithms in machine learning. F. Anifowose, J. Labadin, and A. Abdulraheem [25] suggested contemporary approaches for an artificial network of neurons.
The ensemble approach used in the proposed method uses stacked generalizations. Stacked generalization could also be referred to as stacking, which is typically an ensemble approach aimed at improving model performance. A. Zhou, K. Ren, X. Li, and W. Zhang [32] applied the stacked method of the ensemble approach. The architecture of a stacking model consists of more than two base models, also known as models at level 0, and a meta-model that combines projections of base models, also known as a level 1 model. The meta-model is educated using out-of-sample data projections made by base models. That is, non-training data is inputted into the base model, which extracts predictions. The predictions, combined with the expected outcomes, make up the training dataset, which is employed to match the meta-model. The inputs to the base models can also be used in the meta-training model’s data. This will give the meta-model more context in terms of how to better combine the meta-forecasts. In the proposed method, the pre-built VGG16 as well as Resnet50 are employed. A meta-model is defined as a baseline, allowing for an efficient analysis of base models’ prognosis. Hence, an ensemble CNN is used. Figure 6(a) shows the flowchart that demonstrates the construction of the proposed ensemble model. Algorithm 3.4.1 defines the mathematical flow of the ensemble architecture where a stacking approach is followed.

Construction of the proposed ensemble model.
Input: Training images set{ Ti }
Validation set { Vi }
Base model: VGG16, RESNET50
Init –Array A1
Obtain predictions of base model on Vi
Init –Array O
Hold actual values in Vi
Model = Base VGG16
Model.stack(Spatial_Avg_Pool)
Model.stack(Dropout)
Model.stack(Logistic_conv)
Model.stack(SGD_Entropy)
return R_VGG16
Model = Base RESNET50
Model.stack(flatten)
Model.stack(dense)
Model.stack(Adam, cross-entropy)
return R_RESNET50
K = K_Fold_Cross_Validation()
for iteration from 1 to K:
do:
Split (Ti ->Ti-train, Ti-val)
Train (Refined Base Models on Ti-train) for every value vi in Ti-val: Obtain predictions P1, P2 using refined VGG16 & RESNET50. Append P1, P2 to A1 for Vi Append actual to array O.
A1 –Input Array
O1 - Target Array
Init - A2 for predictions of base models on Vi
and O1 for actual values in Vi
for every value vi in Vi: Obtain predictions P1, P2 using refined VGG16 & RESNET50. Append P1, P2 to A2 for Vi Append actual to array O1.
Generate predictions using A2 and ensemble models.
Evaluate predictions using
Jaccard_Index()
Precision()
Recall()
Accuracy()
F1_Score()
The algorithm is designed to create an ensemble model using modified versions of VGG16 and ResNet50 as base models. The following description defines each function and its process.
Model training
The images present in the dataset are segregated into 70% training data and 30% testing and validation data. The images are read one by one and run through the previous two stages, with the resultant output being saved in a pre-processed folder. The models are developed using the Keras architecture, which is a TensorFlow extension. N. Tajbakhsh et al. [26] suggested the fine-tuning approach. The VGG16 model is trained, wherein all the layers are frozen except for the fully connected layer and the dense layer. (ie) The first 249 layers are frozen, and the remaining layers are left unfrozen. The top two layers are fine-tuned. The ResNet50 is trained normally, and the training results are stored. The predictions obtained from the VGG16 model and the ResNet50 model are valued separately segmented and stacked together for ensemble learning. Thus, the ensemble predictions are obtained as vectors that hold the unaltered original images and the predicted output images with high classification accuracy, as demonstrated in Fig. 6(b).

Original and segmented images.
The model is tested by matching the test labels and window coordinates with the ground truth mark of the test images in the detection dataset to assess the accuracy metrics. The images to be evaluated are converted into vector values called “expected values”. The actual values are compared to the predicted values vector. This results in increased training accuracy. Some of the evaluation metrics are also taken into consideration. The evaluation metrics include intersection over union, Dice score, precision, recall, accuracy, F1 score, sensitivity, and specificity, as described in Table 2.
Model evaluation metrics and their textual elucidation
Model evaluation metrics and their textual elucidation
The model is trained on a custom dataset formed from the two primary datasets, namely the lesion dataset and the HAM10000 dataset. Initially, a preliminary analysis is made on the dataset, and data pre-processing is performed. The different parameters that conform to the sample images include the nature of the skin lesion, the type of cell, the localization factor values, and the age and sex of the person to whom the lesion corresponds. The following Table 3 displays the additional parameters that are associated with the images and their interpretation.
Graphical analysis of associated parameters of skin lesion images
Graphical analysis of associated parameters of skin lesion images
The models were evaluated on a custom dataset generated by integrating the benchmark machine learning datasets (HAM10000) with the ISIC dataset. The results are highly encouraging for the ensemble model. The time taken for training is also comparatively less. The accuracy achieved for the ensemble model is 97.86% and the IoU factor is 95.02%, which is better than the traditional models. A dice score of 96.35% was obtained, with a recall value of 97.67% and sensitivity and specificity values of 96.17% and 96.97%, respectively. The dataset was also evaluated on the traditional models (viz., SVM) with a minimal IOU score of 85.23%, a dice score of 84.98%, a precision value of 83.34%, and a loss of 24.56%, the convolved neural network (CNN) with a dice score of 87.45%, a precision value of 87.32%, an accuracy of 88.98%, a loss of 25.43%, sensitivity and specificity of 87.43% and 88.29%, respectively, and the base learner models VGG16 and ResNet50 were also evaluated. The base learner model VGG16 that was initially trained produced a result of accuracy of 85.01%, a loss value of 20.03%, a sensitivity value of 85.23%, and a specificity value of 86.78%, a dice score of 81.10%, a Jaccard index value of 87.20%, and a recall value of 84.20%. The base learner ResNet50 model performs with an accuracy of 89.90%, an IoU factor of 89.05%, a Dice score of 90.12, a recall value of 88.50%, a loss of 18.20%, and a sensitivity and specificity of 87.865 and 88.54%, respectively. Table 4 shows a constructive comparison between the models based on different performance parameters.
Figure 11 depicts the graphical representation of the different models compared in this paper, based on the critical parameters that are determined namely ACC- Accuracy, PRE- Precision, SE- Sensitivity, SP –Specificity, and IoU –Intersection over Union or Jaccard index. The proposed ensemble model performs fairly enough concerning multiple proposed parameters with an accuracy of 97.67%, precision of 97.31%, Sensitivity value of 96.17% and specificity of 96.97%, and Jaccard Index(IoU) of 95.02%.
Comparison between models based on performance metrics
Comparison between models based on performance metrics

Visual representation of Model performance based on parameters.
The proposed ensemble model takes in the output of two base learner models, namely the VGG16 and the ResNet50. The meta-learner model, or ensemble CNN, is trained for multiple values of hyperparameters, such as the number of epochs, learning rate, batch size, training factor, accuracy, and loss values. The maximum accuracy of 97.86% was obtained with the Adam optimizer trained for 10 epochs with a learning rate of 0.00001 and a training factor of 0.35, producing a loss of 1.12% with a batch size of 32. The model was initially trained with a learning rate of 0.001 and a batch size of 32 per total sample, which yielded an accuracy of 85.88% and a loss of 3.42%. A minimum loss of 1.24% and 1.21% was obtained with the Adam optimizer and a varying learning rate of 0.0001 and 0.00001, respectively. The training factor value was maintained at a constant of 0.3 and then improved by 0.05, which helped to achieve an accuracy of 97.86%. The model was consistently trained with the Adam optimizer to achieve a steady increase in the training curve. The batch size of the model was switched and experimented with between values of 32 and 64 depending on the multiple learning rate factors, with the model being trained for 25, 15, and 10 epochs consecutively to achieve better interpretation results as tabulated in Table 5.
Table 6 visualizes the performance of the proposed ensemble model using accuracy and loss comparison graphs when the model was trained for different numbers of epochs. A maximum of 97.86% accuracy and a minimal loss of 1.12% when trained for 10 epochs. An early stopping factor was also established to ensure that the model did not overfit or even underfit. A steep curve is obtained when the accuracy values increment for every epoch, which indeed shows the model significantly learning the effective parameters to produce sensible outputs.
Table 7 depicts the various metric values obtained as a result of training the ensemble model with hyperparameter tuning. The ensemble CNN with a batch size of 32 trained with a learning rate of 0.00001 and a total of 2000 test samples yields a precision of 97.31%, a recall value of 97.67%, and an F1 score of 97.51%.
Ensemble model performance with associated hyperparameters
Accuracy and loss comparison of ensemble model
Performance metrics achieved with the ensemble and comparison models
Assuming a count of 100 samples, for which the model predicted around 98 to be fairly true values (TP: True Positive), 95 samples being proportionately predictive of the negative class (TN: True Negative), and 2 samples being wrongly predicted for the positive class(Benign: FP: False Positive)
Figure 12 represents the dice analysis of the proposed work for sarcoma analysis and classification using multiple models, with the number of samples along the X-axis and the observed dice value along the Y-axis.

Dice coefficient analysis.
The proposed model achieved a dice score of 96.35 using the custom dataset held along. This shows that the dice coefficient for the put-forward model is higher than the traditional models. Similar results were observed for the IoU or the Jaccard index factor, as shown in Fig. 13.

Jaccard index analysis.
The model is also trained using the traditional datasets HAM10000 (“Human against Machine with 10,000 training images”) and ISIC Challenge Archive 2019: Skin Lesion Dataset. Several works have been carried out using the existing dataset, and a custom dataset was thereby crafted. Table 8 shows the results obtained by employing the models on the traditional dataset.
Table 9 describes the comparison performance of the model tabulating the results of various optimizers based on testing accuracy (T_Acc), Validation loss (V_Loss), Validation Accuracy (V_Acc), MSE and RMSE scores using the optimizers (RMSProp, Adam, SGD, Adamax). A constant learning rate of 0.001 is maintained for the model evaluation. The performance is classified in terms of the dataset used (ISIC Benign, ISIC Malignant, HAM 10000). The custom dataset showed better performance when evaluated, with a maximum validation accuracy of 97%.
The effectiveness of the proposed work can be justified by the comparison made in Fig. 14 between the varying accuracy obtained from the models. A maximum accuracy of 97.86% was achieved by the proposed model when trained with a learning rate of 0.00001 and a training factor of 0.35 with a batch size of 32 that yielded a minimal loss of around 1.1%. The malignant accuracy measure (ML_ACC) was around 97.87%, where the true samples were accurately predicted with the factual class. The Benign Accuracy (BN_ACC) was found to be approximated at 97.32%, where the false class was correctly predicted by the proposed ensemble model. An average accuracy of 96.45% was obtained and consistently maintained through the training of the model, where neither overfitting nor underfitting occurred. The ROC factor for the proposed ensemble model was around 97.512, as shown in Table 10. The model parameters: maximum accuracy (MAX_ACC), malignant accuracy (ML_ACC), benign accuracy (BN_ACC), and average accuracy (AVG_ACC) for the comparison and proposed models (CNN, VGG15, RESNET50, and ENSEMBLE CNN) are tabulated in Table 10.

Accuracy plot and analysis.
Traditional models against the conventional dataset
Model performance using primary evaluation metrics
Measure and accuracy parameter validation for comparison models
The evaluation metrics reveal that the model indeed performs well when compared with the other models under study.
The proposed ensemble model is also analyzed statistically by using a series of hypothesis testing. Initially, the Chi-square test is performed as we have two categorical variables (one hot encoded - lesion type and nature of lesion). This test is essentially performed to determine if there is a significant relationship between the target variables.
(1)- denotes the formula for the chi-square test where O i denotes the values of observation and E i denotes the values of expectation. The p-value can be calculated using the following algorithm:
for ob, ex in series of [(observed, expected)]
chi _sq = sum ((ob –ex)2 / ex)
chi _sq _stat = chi_sq[0] + chi_sq[1]
p _val = 1 –chisqdist(x = chi_sq_stat)
Based on algorithm 1, chi_sq is the chi-square value from which the chi_sq_stat (chi-square statistic) value is calculated. With a significance level of 0.5, the p_val (probability p-value) is calculated. The p-value must be less than 5%, or if the p-value = 0.05, we may confidently conclude that the two models under examination are statistically distinct to reject the null hypothesis that they are comparable.
A non-parametric examination of the distribution of paired nominal data is done using McNemar’s test. A lower “p” value is preferred because it denotes the likelihood that two models are comparable. Table 11 reveals that the meta-learner model (ensemble model) is statistically different from the base learners given.
Statistical analysis based on the chi-square test for the ensemble model
Accuracy, precision, and recall are examples of aggregate measures used to test and develop machine learning (ML) models. These metrics cover the performance of the model on the complete dataset. However, it does not go into the specifics of the errors to help better fix the training/test set. It only helps to adjust the overall model performance attained through further modifying the algorithms. Error analysis aids in the development of responsible ML models by revealing whether the model behaves incorrectly more frequently for particular protected variables or classes. The following Table 12 displays the confusion matrix for a fold of 1000 images per dataset.
Confusion matrix –True / False values
Confusion matrix –True / False values

Convolutional layer outputs of ensemble CNN.
From the confusion matrix, it could be observed that the ensemble model performed fairly well, with both true positive and true negative values being high. False negatives and false positives were nominal. Further error analysis was made based on the data and is tabulated in Table 13. The agreement between categorization and truth values is gauged by the kappa coefficient. Perfect agreement is represented by a kappa value of 1, whereas disagreement is shown by a value of 0. Thus, misclassified classes have low kappa values in the given table. The percentage of values that were projected to belong to a class but do not do so is known as the error of commission rate. They serve as an indicator of false positives. The confidence score indicates the correctness of classification, as denoted in percentages.
Error analysis on the sub-sample data
From the analysis and comparisons made above, it is evident that the proposed ensemble architecture showed significant performance in the analysis of benign sarcoma lesions and malignant lesions.
Our proposed work encompasses an efficient methodology that makes use of an ensemble approach to detect Kaposi sarcoma, a vascular lesion-based skin disease prominently found in AIDS patients. The proposed architecture combines the output obtained from the traditional VGG16 and ResNet50 architectures, and for huge and variable image sizes, it requires fewer hardware specifications and processing time is more convenient. Moreover, a constructive comparison is made between the traditional architectures and the ensemble model. When compared to traditional classifiers, the ensemble model outperforms them. Our model could be improved by fine-tuning more layers to achieve better precision and accuracy. The results achieved could be employed with other models in the future for better prediction and accuracy.
