Abstract
Diabetic retinopathy (DR) is one of the leading causes of blindness. However, because the data distribution of classes is not always balanced, it is challenging for automated early DR detection using deep learning techniques. In this paper, we propose an adaptive weighted ensemble learning method for DR detection based on optical coherence tomography (OCT) images. Specifically, we develop an ensemble learning model based on three advanced deep learning models for higher performance. To better utilize the cues implied in these base models, a novel decision fusion scheme is proposed based on the Bayesian theory in terms of the key evaluation indicators, to dynamically adjust the weighting distribution of base models to alleviate the negative effects potentially caused by the problem of unbalanced data size. Extensive experiments are performed on two public datasets to verify the effectiveness of the proposed method. A quadratic weighted kappa of 0.8487 and an accuracy of 0.9343 on the DRAC2022 dataset, and a quadratic weighted kappa of 0.9007 and an accuracy of 0.8956 on the APTOS2019 dataset are obtained, respectively. The results demonstrate that our method has the ability to enhance the ovearall performance of DR detection on OCT images.
Introduction
Diabetes affects 200 million people worldwide, including 20 million in the United States alone. Diabetic retinopathy (DR) disease, a specific microvascular complication of diabetes, is the leading cause of blindness in the US working-age population [1, 2]. The prevalence of DR increases with the duration of diabetes and almost all diabetic patients have some retinopathy after 20 years [3]. Studies have shown that the use of optical coherence tomography (OCT) is helpful in measuring the thickness of DR areas and in detecting and quantifying the extent of edema. However, OCT false positives are often subclinical diabetic macular edema (DME) cases that cannot be detected clinically. Therefore, the risk of disease progression is still increased [4–9].
Deep learning methods have shown great success in almost all machine learning related areas, especially the convolutional neural networks have recently been widely used to diagnose DR [10–13] by analyzing fundus images [14, 15]. For instance, a deep-learning and a texture feature extraction model are usually combined, or a deep-learning based diagnosis model are trained, to serve as the feature representation of diabetic retinal images so that the diagnoistic decisions are made [16–18].
Although there is a range of research efforts devoting to improve the overall performance of diabetic retinal diagnostic models, these methods tend to predict classes with a large number of samples, and they can hardly work satisfactorily for that of the relatively small proportion of samples. For DR detection, the number of samples of normal eyes in the dataset far exceeds the number of samples with diabetic retinopathy. This makes it easier for the deep learning models to learn the features of normal examples during the training, while the feature representation for diabetic retinopathy is relatively insufficient. The classification model tends to predict the normal class due to the dominance of normal samples. However, the classes corresponding to the fundus images with lesions cannot be trained well due to the data imbalance problem. Thus, the imbalanced data may cause performance degradation for the classification. Besides, the commonly used classifiers and stable models are sensitive to noise and outliers, which may lead to poor classification results. Moreover, due to the weak generalization ability of the classifier, the classification result may be sensitive to the disturbance and change of the input data, thus affecting the robustness of the model. Ensemble learning can effectively deal with the aforementioned problems. By combining the prediction results of multiple base classifiers, the overall recognition and generalization ability could be enhanced, meanwhile avoiding the overfitting problem. More importantly, Quadratic Weighted Kappa (QWK) score is proved to be a very important indicator for all stages of DR disease. The major causes of poor QWK scores are imbalanced distribution of classes or poor quality of images [19].
Inspired by this, in the present work, we propose an adaptive weighted ensemble model to facilitate the DR diagnosis based on OCT images. Specifically, we build an ensemble learning model to consist three base models (i.e. Swin Transformer, FocalNet and ResNet) for better performance. To better utilize the cues provided by each base model, we propose a novel adaptive weighted fusion strategy based on the Bayesian theory. With this strategy, especially in the case of unbalanced samples, the weight of the model with high QWK will be set a relative high value corresponding to the minority category of class. By giving higher weights to the less class learners, more attention will be paid to the predictive ability of minority classes. Thus, it effectively tackles the problem of imbalanced distribution of classes.
Our major contributions can therefore be summarized as follows: i) We propose an adaptive weighted ensemble model for the task of DR detection, to reduce the risk of overfitting and promote the performance of the model on unseen data. ii) We put forward a novel decision fusion method in terms of Bayesian theory, which can effectively solve the problem of unbalanced sample size. iii) experimental results on public datasets demonstrate the superiority of our ensemble learning model and the proposed fusion strategy.
Related work
Deep learning plays a vital role in various computer vision tasks, and it has been widely used in the field of medical imaging for diagnosis. In Adriman’s study [20], deep learning and texture feature extraction methods were combined to classify DR. The feature extraction process in this method is performed in the convolutional layer and the pooling layer, which is not sensitive to the size of the input data. Gayathri et al. [21] put forward to utilize multi-channel convolutional neural network and conventional machine learning classifier for the task of DR classification. Gangwar et al. [22] developed an Inception-ResNet-v2 model based on transfer learning for automatic DR detection. Kassani et al. [23] presented a modified Xception architecture for the diagnosis of DR disease. These methods are not sensitive to the problem of unbalanced data distribution. The success of ensemble learning in the recognition task inspired Oh et al. [24] to employ ensemble learning and the active sample selection to solve the imbalanced biomedical data classification problem and improve the prediction and generalization ability of the model. The active sample selection means building a classifier by starting from a small balanced subset of training data and training the classifier iteratively by adding informative examples into the current training set. It solves the imbalanced data problem by iteratively sampling the useful training examples from the entire training data. The active sample selection process is an iterative procedure that includes a training phase and an example selection phase. In the training stage, the model is trained using only the examples in the training set. In the example selection phase, informative examples in the rest of the set are chosen to the training set in terms of the predefined measure of usefulness of the examples with respect to the current model. However, the proposed active sample selection method is easily affected by the sample selection strategy, and thus leading to information bias or negligence of important samples.
In this paper, we propose an adaptive weighted ensemble model based on three base models for better performance in the task of DR diagnosis. For this purpose, a novel decision fuse strategy is further introduced to dynamically adjust the weight distribution of these base models to alleviate the negative effects potentially caused by the problem of unbalanced sample size. Specifically, the weight of each base model in decision-making fusion is calculated based on the Bayesian theory in terms of the key evaluation indicators after each iteration. Thus, it is able to cooperate and reinforce each other to improve the robustness and generalization ability of the overall model, and the introduction of adaptive weights supplements the classification performance of the class of small sample size.
Proposed method
The overall architecture of our proposed adaptive weighted ensemble model is shown in Fig. 1. It is primarily composed of three parts: data preprocessing, model training, and decision fusion. Data preprocessing and enhancement are conducted in the first part. The outputs of this part are fed into the model training module to train and iterate through three base models (i.e. Swin Transformer, FocalNet and ResNet50). To achieve a better performance, we employ these base models to extract the discriminative features for all three categories of the data. The Swin Transformer improves its ability to process global information by introducing shifted windows. The FocalNet uses hierarchical context aggregation to support multi-scale image classification, making the model more robust to images of different scales and improve classification performance. The ResNet50 solves the problem of gradient explosion in neural network training. To better utilize the cues contained in each base model, we record the loss, accuracy, and quadratic weighted Kappa for each base model during each verification round. In terms of these indicators, we calculate and dynamically update the weights of each model based on our proposed adaptive weighted fusion strategy. The results of these three models are concatenated and fused in the decision fusion part, aiming at enhancing the classification performance, free from imbalanced distribution of classes.

The architecture of the proposed adaptive weighted ensemble learning model.
For the raw data, we randomly flip the image horizontally and vertically by using image processing functions (e.g. RandomRotation, RandomHorizontalFlip and RandomVerticalFlip). This augmentation operation could provide mirror and symmetric variations, increase the diversity of data, and help the model better understand images from different perspectives and orientations. We also employ AutoAugment for data augmentation. Specifically, we add random rotations at arbitrary angles between –15° and +15°. We assign a 60% horizontal flip probability and a 40% vertical flip probability to the data of each epoch. Both of them aim to increase the randomness of the data, and thus improve the generalization and robustness ability of the model. The data enhancement and normalization are further executed to facilitate the subsequent feature extraction. Figure 2 shows three representative examples of image preprocessing.

Representative examples of image preprocessing. From left to right columns, the images are (a) raw data, (b) random rotated and flipped images, and (c) enhanced images, respectively. Non-DR denotes non-diabetic retinopathy, NPDR means non-proliferative diabetic retinopathy, and PDR represents proliferative diabetic retinopathy.
For better performance, three base models (SwinTransformer, FocalNet and ResNet50) are assembled in the ensemble learning model to accomplish the task of DR detection. Swin Transformer provides a better global view by continuously dividing the image into smaller patches, enabling the patches to interact with each other. Compared with traditional convolutional neural networks, Swin Transformers are more accurate when dealing with large-scale objects, which refers to that it is good at extracting global semantic information. By introducing shifted windows, the Swin Transformers pay more attention to the relationship between different areas in the lesion image when learning data features. In DR classification, this means that the model can more effectively capture the distribution of widely distributed normal blood vessels and diseased areas in the fundus images, allowing the network to more comprehensively understand the contextual information of the entire retina. Config.1 illustrates the detailed weight configuration information of Swin Transformer. FocalNet uses hierarchical context aggregation to support multi-scale image classification, which can make the model more robust to images of different scales and improve classification performance. Config.2 represents the configurations of Focal Net. The residual block in ResNet50 has a clear information flow, making the decision-making process easy to explain and understand. The detailed architectures of ResNet50 is given in Table 1. By leveraging each model’s strengths, the overall performance could be enhanced significantly in comparison with single model. The pre-trained weights on ImageNet are imported into the Swin Transformer and FocalNet. The ResNet50 encapsulated in torchvision are employed. We reconstruct the connection layers relevant to our classification task.
Structure of ResNet
Structure of ResNet
During the training stage, in order to push the model learn and fit the data better, the SoftTargetCrossEntropy is utilized. Meanwhile, Apex’s automatic mixed precision (AMP) is introduced to speed up the model training without loss of accuracy. Mixup is further employed to increase the diversity of the data.
We calculate the loss, accuracy, and quadratic weighted Kappa for each sub-model during each verification round. The custom AverageMeter class is used to record and manage these indicators. In terms of these indicators, we calculate and dynamically update the weights of each model based on our proposed fusion strategy. Details of the proposed adaptive weighted decision fuse strategy will be explained and discussed in the following subsection.
To better accomplish the decision fusion task, we develop a novel dynamic weight updating method based on the Bayesian theory. Regarding the evaluation criteria collected after each verification, the key evaluation criteria employed in this work include the QWK, accuracy, and loss values. The loss value of each model is as the prior probability of the fusion strategy, denoted as ω loss,i. We normalize it backward to obtain
To prevent
Further, the normalized QWK score of overall model under the current loss ratio is denoted as K
i
. According to the total probability formula, the proportion distribution of the QWK of each model can be obtained:
Let the accuracy of each model be denoted as A
i
. It is used as the initial weight for each model and is multiplied with P
i
element by element to obtain the updated weight array . is further added with P
i
to get the weight to prevent the individual model from overfitting.
Compared to directly using P i as the weight of each model, the updated weight W i is adjusted by using the accuracy of each model in the validation as a factor to tune each weight individually. This personalized weight adjustment is able to make the model better adapt to the requirements of specific tasks and promote the flexibility and generalization ability of the model.
Dataset
In the present work, we conduct experiments on two retinal image datasets to verify the effectiveness of our method. The Diabetic Retinopathy Analysis Challenge 2022 (DRAC2022) [25] dataset is from Shanghai Sixth People’s Hospital and SVision Imaging, Ltd, containing a total of 611 images. It is the first version related to MICCAI 2022. It consists of 329 Non-DR, 212 NPDR, 70 PDR, and 386 unlabeled testing data for DR classification task. The NPDR usually manifests as microaneurysm, vein beading and retinal microvascular abnormalities. The damage to the blood vessel walls can lead to blood vessel leakage, causing retinal tissue edema. The Non-DR shows no abnormality. The vitreous body inside the eyeball should be transparent, and the blood vessels should show a regular and clear branching pattern. The PDR is characterized by neurovascular complications, such as vitreous hemorrhage, intravitreal neovascularization, and retinal traction detachment, which can cause blindness. The major difference between PDR and NPDR is that PDR is featured with new and abnormal blood vessels, while the main manifestation of NPDR is microvascular disease. More importantly, with this dataset, different algorithms can test their performance and make fair comparisons with other algorithms. We believe this dataset is an important milestone for automatic image quality assessment, lesion segmentation, and DR grading.
Another dataset used in this work was provided by the DR Detection Challenge (APTOS 2019) [26] sponsored by the Asia-Pacific Tele-Ophthalmology Society in 2019. This dataset contains a series of fundus images for classifying and grading DR. Each image has been marked by a clinical physician and divided into five grades according to severity: "0" means no lesion (No_DR), "1" means mild non-proliferative diabetic retinopathy (Mild), "2" means moderate non-proliferative diabetic retinopathy (Moderate), "3" indicates severe non-proliferative diabetic retinopathy (Severe), and "4" indicates proliferative diabetic retinopathy (Proliferative). The goal of this dataset is to classify fundus images through machine learning and deep learning algorithms to help physicians in the early diagnosis and treatment of DR disease. The image sizes of the experimental datasets of both DRAC2022 and APTOS 2019 are all 224×224 pixels.
Experiment setup
During the training stage, we update all weights with a batch-size of 16 in each iteration. The training epoch, learning rate and num_workers are 80, 10-4 and 4, respectively. The ImageNet1K weights officially provided by focal-net and swin-transformer are used as the initial weight of the model. The training data is divided into two subsets, the training set and the validation set. As suggested by previous works [27, 28], 80% of the data are used for training and 20% for validation.
Regarding the experiments, we first carry out an ablation experiment to compare the performance between the proposed ensemble learning model and single deep learning models. To further verify the effectiveness of our proposed adaptive weighted fusion strategy, we perform comparative experiments with our method and the commonly used decision fusion approaches. Additionally, we also compare our experimental results with the previous related research work for the task of diagnosis of DR disease.
We employ four widely used metrics to assess the model performance: accuracy (ACC), precision, Quadratic Weighted Kappa score (Wk), and AUC value. The formulas of these metrics are as follows. Accuracy is defined as the ratio of correctly predicted labels to the total number of labels.
Precision means the correct proportion of positive example recognition. In other words, it denotes the ratio of actually positive samples out of all the positive predictions. High Precision is crucial for the task of reducing misdiagnosis in DR classification, as it reduces the risk of incorrectly predicting healthy samples.
Among them, wi,j is the weight matrix calculated based on the difference between actual label i and the predicted label j. Oi,j represents the number of the i th category that is classified to be the j th category. E is calculated as the outer product between the actual histogram vector of outcomes and the predicted histogram vector. Wk is a ratio between –1 and 1 that measures the consistency between model predictions and actual labels. The closer Wk is to 1, the better the model’s predictions is.
AUC (area under the ROC curve) represents the degree of separability. It is an important evaluation metrics for checking the classification model’s performance. It tells how much the model is capable of distinguishing between classes. The higher the AUC value, the better the model is at distinguishing between patients with the DR disease and no disease.
Comparison with single models
On the DRAC2022 dataset, we submit the results of the testing data with our proposed ensemble model to the official site for verification. A Wk value of 0.8487 and an AUC value of 0.9002 are obtained. Meanwhile, the accuracy and precision of the proposed method are 0.9343 and 0.9214, respectively. We compare the performance of the proposed ensemble learning model and the corresponding single models. Additionally, another two state-of-art object detection models are also developed for comparison. The performance of different models are shown in Table 2. One can see, among these four indicators, our method achieves the best performance in terms of accuracy, Wk and Precision, and a comparable good AUC value. We also list several representative predictive instances of DR classification on the DRAC2022 dataset (see Table 4). Those results demonstrate that ensemble learning method has advantages over the single deep learning methods.
Performance of different models on the DRAC2022 dataset
Performance of different models on the DRAC2022 dataset
Performance of different models on the APTOS 2019 dataset
Prediction results of different models on DRAC2022 dataset
To further evaluate the proposed method, we also test it on the APTOS 2019 dataset. Table 3 shows the experimental results of different models. Remarkable quantitative results are achieved by our method, with an accuracy of 0.8956, a QWK of 0.9007, a AUC value of 0.8942, and a precision of 0.8458. Representative instances of DR classification on the APTOS 2019 dataset are shown in Table 5. The performance difference between our method and other methods further illustrates the effect of our proposed ensemble learning approach. The ensemble learning is helpful in enhancing the performance of DR classification.
Prediction results of different models on the APTOS2019 dataset
We have verified the effect of ensemble learning on the performance of the classification model. In this section, we compare our method with other decision fusion strategies, including simple average, weighted average, and Bayesian average. Simple average refers to that the outputs of the base models are directly summed, and then divided by the number of base models to produce the final output, that is allocating equal weight for each model. Weighted average allows to assign weighted or proportioned weight, that is, each model can be assigned a unique weight. Bayesian average takes uncertainty into account. It uses Bayesian statistical analysis to combine information from different sources, resulting in a more accurate decision result. They are seperately applied to both the DRAC2022 and APTOS 2019 datasets. The comparison results are recorded in Tables 6 and 7. Examples of the corresponding DR classification results are illustrated in Tables 8 and 9, respectively. It can be observed that the proposed adaptive weighted fusion strategy outperforms the conventional fusion schemes.
Performance of different decision fusion methods on the DRAC2022 dataset
Performance of different decision fusion methods on the DRAC2022 dataset
Performance of different decision fusion methods on the APTOS 2019 dataset
Prediction results of different decision fusion on the DRAC2022 dataset
Prediction results of different decision fusion on the APTOS2019 dataset
Most of the previous research efforts have used deep learning techniques to detect the presence or absence of DR disease. Nevertheless, it it still a challenging problem to exactly detect the DR disease. It is necessary to make a relative comparison on related works using the same dataset. Gangwar et al. [22] utilized the transfer learning on pre-trained Inception-ResNet-v2, and added customized CNN layer blocks on top of Inception-ResNet-v2 to build a hybrid model. The accuracy rate obtained on the APTOS 2019 is 82.18%. Kassani et al. [23] developed a modified Xception architecture as the feature extractor, combining a multi-layer perceptron for DR severity classification, achieving a classification accuracy of 83.09% on the APTOS 2019 dataset. However, these works ignore the fact that a single model is sensitive to noise and unstable, which might lead to weak generalization ability. Thus, in our work, we employ an ensemble learning based on multiple models to reduce the variance and the generalization error. Most importantly, we propose an adaptive weighted decision fusion strategy to comprehensively excavate the cues implicit in different classification models, to maximize the prediction effect. As a result, better classification accuracy of 89.56% is obtained with our proposed approach. More specific details can be found in Table 10. Since no relevant articles were reported for the DRAC2022 dataset, we did not perform comparative analysis on this dataset.
Performance of different decision fusion methods on the APTOS 2019 dataset
Performance of different decision fusion methods on the APTOS 2019 dataset
Data distribution
The data distribution is not always balanced in the real world, in which case it is difficult to achieve the best diagnostic results for the minority category of class. The distribution of the classes regarding these two datasets leveraged in the experiment is illustrated in Fig. 3. We can find that the quantity of samples with lesions and the quantity of normal samples are extremely unbalanced. Without artificially increasing the data size, during the training and validation stages, the weight of each model is adjusted dynamically according to the feedback of the key indicators. By setting more reasonable weights, the performance of the overall model is promoted, suggesting the proposed fusion scheme is helpful to alleviate the problem of imbalanced distribution of classes. Our proposed adaptive weighted ensemble model is able to better utilize the cues contained in each base models, identify intrinsic features in DR OCT images, and focus on them especially for the class of small sample size, to facilitate the diagnosis of the lesion.

Data distributions of the DRAC and APTOS datasets.
Comparing the performance of different models, it indicates that the proposed method is effective and feasible in the task of DR classification. The usage of ensemble learning is superior to that of a single deep learning model. The intuition behind is that, each single model has its own strengths, they can capture different aspects and patterns in the data, and the same mistakes normally hardly occur between different models on the testing dataset. Therefore, by integrating the predictions of multiple models, the ensemble model makes the classification results less sensitive to the details of training, which can reduce bias and variance, and promote the overall performance and robustness. Notably, when the dataset is unbalanced, if the fuse method or weight distribution adopted is unreasonable, the ensemble performance may be not as good as the result by single model. When simple average is utilized as the fusion method, equal weights are allocated for each base model. However, the emphasis of these adopted three models are different. For example, Swin Transformers are capable of strong feature representation for both the local and the global features between different categories because of its strengths. The role of this model is undermined with “equal weight”. In this case, the weight distribution is deemed to unreasonable. To promote the overall recognition and generalization ability of the classification model, we adopt an adaptive weight allocation strategy to dynamically adjust the weights according to the performance of each base model. Models with better performance can be assigned with higher weights, thereby increasing their influence in the ensemble model. (c.f. Tables 2 vs. 6 and Tables 3 vs. 7).
Moreover, during the training stage, we collect the key indicators, in terms of which the weights of each model can be dynamically determined based on the Bayesian theory. Compared with other decision fusion strategies, the proposed fusion strategy can better identify the classification-oriented information contained in the models because of the optimal weight distribution. Better performance, reliability, and generalization capabilities can be guaranteed with the proposed approach. Overall, the proposed model outperforms existing methods for DR classification task.
Additionally, the proposed ensemble model makes the best of these three base models’ strengths to excavate the comprehensive and complementary information of the entire retina for higher classification performance. This could also be the key for other underlying models to enhance the overall recognition and generalization ability. The present work provides a general ensemble learning framework and decision fusion scheme for medical image classification tasks. Moreover, the models with unique features in other scenarios are encouraged to introduce into the ensemble learning framework to improve its performance. The proposed method has a potential to be extended to more pattern recognition fields where different models need to jointly understand category-related features beyond that of diabetic retinopathy classification.
Conclusions
In this article, we propose an adaptive weighted ensemble learning network for DR detection. It is capable of automatically and dynamically assign the weight distribution of each base model based on the Bayesian theory in terms of the key assessment criteria. The network could make greatest use of the clues provided by these multiple models through our proposed decision fusion strategy to achieve better performance even in the case with unbalanced distribution of classes. The experimental results on two datasets suggest that the proposed method outperforms the existing approaches for DR detection. We believe that the performance could be further enhanced with both the improvement of image quality and more advanced deep ensemble learning techniques.
Acknowledgements
This research was funded by the National Natural Science Foundation of China (Grant Nos. 61902282, 62071330 and 62201385).
Data availability statement
Copies of the DRAC2022 dataset and the APTOS 2019 dataset can be obtained free of charge from https://drac22.grand-challenge.org and https://www.kaggle.com/datasets/mariaherrerot/aptos2019, respectively
