Abstract
Background
Diabetic Retinopathy (DR) remains a leading cause of blindness among diabetic patients worldwide, necessitating early and accurate diagnostic interventions. While traditional screening methods rely heavily on manual ophthalmologic evaluations, recent advancements in machine learning (ML) and deep learning (DL) have opened new avenues for automated, scalable, and interpretable diagnostic tools. However, challenges persist in developing models that are not only high-performing but also transparent enough to gain clinical trust.
Objective
This study introduces a novel, standardized, and interpretable ML framework designed specifically to enhance diagnostic efficiency and accuracy for DR risk prediction. By prioritizing model interpretability alongside predictive performance, our approach aims to bridge the gap between cutting-edge AI technology and clinical applicability.
Methods
We evaluated eleven ML algorithms, optimizing hyperparameters via grid search and five-fold cross-validation to identify top-performing models. A key innovation lies in our dynamic weighted voting ensemble (Voting_soft), which integrates multiple classifiers based on model confidence, thereby leveraging the strengths of diverse algorithms. Model performance was rigorously assessed using accuracy, sensitivity, and area under the curve (AUC) metrics, with ROC and PR curves comparing performance across varying training dataset proportions. Crucially, we employed SHAP (SHapley Additive exPlanations) for interpretability analysis, providing clinicians with actionable insights into feature contributions.
Results
Through LightGBM-based correlation analysis and AUC curve determination, fourteen clinical features were identified as optimal predictors. Notably, the CatBoost model achieved superior performance on a 20% test set, while the Extreme Random Tree model demonstrated robustness on a 30% test set. Our dynamic weighted voting ensemble (Voting_soft) outperformed individual models in terms of AUC across both datasets. SHAP analysis revealed that age, triglycerides, sex, and HDL-C were key predictors of DR prevalence, offering clinically meaningful explanations for model decisions.
Conclusions
This study presents a groundbreaking ML-based DR risk prediction system that excels in both accuracy and interpretability. The integration of SHAP analysis not only enhances model transparency but also empowers clinicians with a deeper understanding of diagnostic decision-making, ultimately improving the precision and efficiency of DR screening. Our dynamic voting ensemble approach sets a new benchmark for interpretable, multi-model integration in medical diagnostics.
Introduction
Diabetic Retinopathy (DR) is one of the most common microvascular complications of diabetes and is the leading cause of new cases of blindness in diabetic patients. 1 Globally, more than 132 million people suffer from DR and it is the leading cause of visual impairment and blindness in the working-age population. 2 In 2020, more than 103 million people with diabetes were affected by diabetic retinopathy globally, and this number is projected to increase to 160 million by 2045. 3 Compared to all other major causes of blindness, diabetic retinopathy is the only disease whose age-standardized prevalence did not decline between 1990 and 2020. Therefore, early detection, accurate diagnosis and effective treatment of DR are important for improving patient prognosis and reducing the burden on society.
With the development of medical technology, particularly the emergence of machine learning (ML) 4 and deep learning (DL) 5 techniques, new approaches and methodologies have been introduced for the diagnosis and treatment of DR. Wang Qianwen 6 constructed a DR risk prediction model based on the XGBoost algorithm and further enhanced its predictive performance using a Stacking ensemble model, achieving an AUC value of 0.924. This study not only validated the potential of ensemble learning algorithms in DR risk prediction but also demonstrated effective strategies for improving prediction accuracy through model fusion. OY JL 7 explored genetic variables associated with DR using Mendelian randomization and developed prediction models incorporating logistic regression and machine learning algorithms, thereby enhancing the accuracy of DR diagnosis. Sunyoung Kim et al. 8 utilized electronic medical record data from three universities in South Korea to develop a machine learning algorithm based on extreme gradient boosting for predicting DR risk in patients with type 2 diabetes. The results indicated that the XGBoost model exhibited high accuracy and specificity in both the original and validation datasets. Amanda Luong et al. 9 proposed a novel retinopathy risk score (RRS) model and validated its effectiveness in identifying low-risk DR patients through comparison with machine learning models (MLMs). Zhao YL et al. 10 integrated computer vision and clinical structured data to construct a multimodal model for predicting DR referrals. The findings demonstrated that the XGBoost-based multimodal model performed exceptionally well in both internal validation and external testing sets, significantly improving the accuracy of DR referral prediction. Li BB et al. 11 developed a DR risk prediction model using a self-evolving machine learning approach that combined fundus images and clinical data, and calculated DR risk through a visualization system. Pang YH et al. 12 identified potential biomarkers for early DR, diabetic macular edema (DME), and anti-vascular endothelial growth factor (anti-VEGF) treatment responses using multi-omics data (including lipidomics and metabolomics) and machine learning algorithms. Lukashevich Marina M 13 classified fundus images of diabetic retinopathy using ensemble learning methods and provided references for developing personalized treatment plans by analyzing the performance of different algorithms.
Despite remarkable advancements in the application of machine learning for diabetic retinopathy, the poor interpretability of machine learning models poses a significant challenge. Clinicians often struggle to comprehend the decision-making processes of these models, which has consequently limited their clinical acceptance and adoption. 14
The goal of this paper is to develop a generic, standardized, and effective interpretable machine learning system for predicting the risk of DR. This study also provides the following contributions: the data were preprocessed using preprocessing techniques such as data cleaning and normalization. LightGBM and Random Forest were used as feature selection methods, respectively, and the number of predictors of DR was determined by combining feature selection methods with three models. Highly applicable machine learning algorithms including Random Forest, Decision Tree, Extreme Gradient Boosting, K Nearest Neighbors, Extreme Random Tree, Support Vector Machines, and Multi-Layer Perceptual Machines are used. Fusion machine learning algorithms are explored to achieve voting integration of classification multi-models with VotingClassifier. Hyperparameters were optimized by grid search method and experiments were conducted using a five-fold cross-validation approach. Evaluated model performance using metrics such as accuracy, precision, recall, F1 score, and area under the curve (AUC) using receiver operating characteristic curves (ROC) as well as PR curves. The powerful SHAP technique is applied to interpret and understand the contribution of each predictor in DR prediction. The rest of the paper is organized as follows: section 2 presents the concepts and methods, section 3 presents the results and analysis, and section 4 gives the conclusion and outlook.
Concepts and methods
Machine learning algorithms
Random Forest(RF), 15 proposed by Breiman, is an integrated learning algorithm based on decision trees, which reduces the risk of overfitting by constructing multiple decision trees and integrating their predictions, the core of which lies in the dual randomness of Bootstrap sampling and random selection of features; Decision Tree(DT) 16 serves as a basic classification model, constructing a tree rule through recursive feature partitioning, deriving ID3, C4.5, CART and other classic variants; Extreme Gradient Boosting(XGBoost) 17 is an optimized implementation of gradient boosting decision tree, developed by T.Q. Chen's team, which improves the training efficiency through the second-order Taylor expansion and regularization terms, and excels in structured data modeling; K-nearest-neighbor algorithm(KNN) 18 is based on the idea of local approximation, and selects the K nearest neighbors through the distance metric to make a voting decision, which is suitable for small to medium-sized low-dimensional data; Extreme Random Tree(ET), 19 as a variant of Random Forest, adopts a random threshold splitting strategy, exchanging small deviations for variance reduction; LightGBM, 20 proposed by Microsoft, optimizes large-scale data processing through a histogram algorithm and a leaf growth strategy, striking a balance between speed and accuracy; CatBoost(CAT), 21 developed by Yandex, is designed to design an ordered boosting mechanism for the category features and has a built-in missing value. boosting mechanism, and built-in missing value processing module; AdaBoost(ADA) 22 as a classical Boosting algorithm, by dynamically adjusting the sample weights in series of weak classifiers, and ultimately forming a strong predictor; GBM 23 as a gradient boosting framework, through iterative optimization of the loss function to improve the performance of the model, laying the foundation for the subsequent algorithms, such as XGBoost; Support Vector Machines(SVM) 24 Based on statistical learning theory, it searches for the maximum interval hyperplane through kernel trick, which has a natural advantage in high-dimensional data classification; Multilayer Perceptron(MLP), 25 as a feed-forward neural network, approximates the complex function through multilayer nonlinear mapping, and shows strong expression ability in the field of image, speech, and so on.
Algorithm evaluation metrics
In this paper, a comprehensive evaluation index system containing more information is adopted. On the basis of the traditional accuracy rate, the recall rate is introduced to measure the model's completeness in recognizing positive cases, i.e., the ratio of the number of correctly determined positive cases to the total number of actual positive cases; the F1 score, as the reconciled mean value of the precision rate and the recall rate, integrally reflects the model's classification effect on the majority class and the minority class, with particular emphasis on the recognition performance of the minority class. Meanwhile, the model's ability to exclude negative cases is assessed by specificity, i.e., the proportion of correctly identified negative cases to the total number of actual negative cases; positive and negative predictive values quantify the reliability of the samples predicted to be positive and negative cases, respectively. In addition, the ROC curve is used to depict the balance of sensitivity and specificity of the classifier under different thresholds, and the larger the AUC value of the AUC, the better the model's ability to distinguish between positive and negative samples; whereas the Precision-Recall curve focuses on the identification performance of positive cases, and the larger the AP value of the AUC, the better the model's classification performance in the data imbalance scenario. The above metrics together build a multi-dimensional evaluation framework, in which the confusion matrix serves as the basis to support the computation and interpretation of each metric through four basic statistics: TP, FN, FP, and TN. 26
Introduction to SHAP
The SHapley Additive exPlanations(SHAP) 27 method, rooted in Shapley value theory from game theory, was proposed by economist Lloyd Shapley in 1953 to address fairness in profit distribution within cooperative games. This study conceptualizes the model's prediction process as a cooperative game, where each feature acts as a participant and the model's predictive outcome represents the collective gain. By calculating Shapley values for individual features, SHAP quantifies the marginal contribution of each feature to the model's prediction, thereby elucidating the decision-making process. Features with positive SHAP values contribute to predicting DR occurrence, while those with negative SHAP values support the prediction of non-DR cases.
Results and analysis
Data sources
For the original data of physical examination in a hospital, preprocessing was carried out in this paper, and the main process included (1) Missing values were filled in using the mean filling method for continuous-value data, and plurality was used for dichotomous attribute values, and the descriptive analyses of these attributes are shown in Table 1. (2) Variable mapping was performed: SEX was assigned to 0 for males and 1 for females, and 1 for those with a history of high blood pressure, and 0 for those without.The BMI was set as follows: 0 = less than or equal to 23.9, 1 = 24.0–27.9, and 2 = greater than or equal to 28.0. The final dataset contained 1006 samples totaling 15 attributes.
Descriptive analysis.
Descriptive analysis.
Figure 1 gives a diagram of the two processes of data preprocessing. Among them, Figure 1A gives a heat map of the correlation coefficient matrix between features. Figure 1B shows the prediction of diabetic retinopathy based on four models selected with different number of features. The AUC in the figure varies with the increase in the number of features and the error bars indicate the 95% confidence intervals. Here Random Forest was used as the feature selection method and screened in combination with three models (Logistic regression, XGBoost and LightGBM). As the number of features increases, the AUC value gradually increases until all the features are included, the increase in AUC tends to level off. Therefore, 14 clinical features such as AGE,TG were selected as inputs to the final model.

Plot of data correlation analysis and feature selection curves.
Optimal parameter output
Firstly, the dataset is divided into training and test sets, using 70% of the data as the training set and the remaining 30% of the data as the test set, using the algorithm in subsection 2.1 for the binary classification task, setting multiple hyper-parameter options, and searching for the best hyper-parameter combinations through 5-fold cross validation combined with grid search, and finally training the model with the optimal performance, the results are shown in Table 2.
The optimal hyperparameter combinations.
The optimal hyperparameter combinations.
As can be seen from Table 3, the CAT model shows the best performance in experiments with a 20% test set share, with a classification correctness and recall of 81.19% and 80.00%, respectively, and its F1 score (0.7912) and NPV (0.8364) are in the top rankings, which indicates a strong discriminative ability for positive and negative class samples. The model demonstrated a well-balanced performance in the classification task and was able to effectively identify the risk of diabetic retinopathy (DR).The LGBM model had a high specificity of 83.93% (NPV = 0.8174), which was outstanding in controlling the false positives and is suitable for scenarios sensitive to the cost of misdiagnosis.The XGB and the ET model performed close to each other in terms of accuracy and sensitivity, but ET's F1 score (0.7692) is slightly lower than that of XGB (0.7760), reflecting that XGB may be better adapted under category-imbalanced data.The recall/sensitivity (≤71.11%) of DT and KNN is significantly low, which may lead to more DR cases being missed and higher clinical risk. The multilayer perceptron (MLP) and support vector machine (SVM), although the accuracy was up to standard, the specificity (≤79.46%) was insufficient, which may increase the misdiagnosis rate of non-DR patients.
Model performance results with different test set occupancies.
Model performance results with different test set occupancies.
When the test set share was increased to 30%, the ET model performed outstandingly on the 30% test set, achieving the best balance of 82.45% accuracy and 82.09% sensitivity/specificity, and its NPV (0.8528) indicated excellent ability to correctly exclude non-DR patients.The ET model still maintained high classification accuracy and stability on the larger test set, indicating its ability to adaptability to data distribution.The GBM model has a high specificity of 85.12% (NPV=0.8266) and performs optimally in controlling false positives, making it suitable for misdiagnosis-sensitive clinical scenarios.The SVM model has the highest recall/sensitivity (82.84%) and the strongest ability to detect DR cases, but has a low specificity (79.76%), which needs to be weighed against the risk of missed and misdiagnosed cases. In addition, RF, GBM and SVM models also showed better performance, with SVM achieving 82.84% in recall, demonstrating a stronger ability to identify positive class samples (high-risk patients).DT still had the lowest recall (66.42%) and sensitivity (66.42%), with a significant risk of clinical underdiagnosis.The specificity (≤75.60%) and the ADA (≤75.60%) of the MLP and the PPV (≤76.00%) were low, which may increase the misdiagnosis rate in non-DR patients.
The generalization performance of the models can be observed by comparing the results of the 20% and 30% test sets as shown in Table 3: the ET model performs robustly on both test sets, with fluctuations of accuracy and sensitivity less than 2%, indicating that its predictive ability is less affected by the sample partitioning.The CAT model performs well on the 20% test set, but the sensitivity decreases by 2.39% on the 30% test set, which may be due to the increased amount of data exposing the risk of overfitting. The XGB and LGBM models have less fluctuation in performance, showing their robustness under large samples.The SVM model has an increase in recall by 2.84% but a decrease in specificity by 2.7% at 30% test set, reflecting that its discriminative ability for positive and negative class samples is sensitive to the change in data volume.
VotingClassifier is an integrated learning approach that combines several different machine learning models to make decisions about the classification task by voting, thus improving the predictive performance and robustness of the overall model.VotingClassifier provides two main voting strategies hard voting and soft voting, hard voting is where each submodel predicts for each sample a category label, and the final classification result depends on the category with the most votes for the predicted category in the sub-model. In this paper, we implement a soft voting mechanism (Voting_soft) using dynamically weighted integrated learning based on model confidence. The integration weights are dynamically assigned by quantifying the uncertainty (entropy) of the prediction probability of each base model: firstly, the Shannon entropy is calculated for the prediction probability of each base model on the validation set (a lower entropy value indicates that the prediction probability is more concentrated, and the higher the confidence is), and the inverse of the entropy is used as the initial weights in order to avoid that the low confidence model dominates the integration results, and finally the probability weights are obtained through the normalization process. These weights are used to construct soft-voting classifiers to make the prediction results of high-confidence models play a greater role in integration decisions. This approach breaks through the limitations of traditional static weighting or simple majority voting, and enables a smarter model fusion strategy by dynamically capturing the differences in model reliability on specific tasks. The base model here consists of RandomForest, XGBoost, CatBoost, and ExtraTrees in subsection 3.3.2.
The ROC curves are plotted against the PR curves for a 20% test set share in Figure 2 A and Figure 2B. The ROC curves are plotted against the PR curves for a 30% test set share in Figure 2C and Figure 2D. The corresponding confidence intervals for the corresponding values are given. The 95% confidence intervals corresponding to the corresponding values are also given. It can be seen that the AUC values of the fusion machine learning model constructed by the soft voting mechanism proposed in this paper are optimal in the two datasets with different occupancy ratios, which are 0.8911 and 0.9040, respectively.In the comparison of the area AR values under the PR curves, Voting_soft (0.8558) is only slightly lower than the Random Forest (0.8580) under the 20% testing set occupancy ratio, and Voting_soft (0.8558) is only slightly lower than the Random Forest (0.8580) under the 20% testing set occupancy ratio. 20% test set share Voting_soft (0.8580) is only slightly lower than XGBoost (0.8659), which fully demonstrates the advantage of Voting_soft model in diabetic retinopathy risk prediction.

Graph of ROC curve and PR curve.
In order to deeply analyze the contribution of each risk factor to the prediction of diabetic retinopathy (DR), the study used the optimal XGBoost model combined with the SHAP interpretation framework to carry out the analysis. Figure 3 systematically demonstrates the results of the model interpretability analysis based on SHAP values: the framework reveals the association mechanism between clinical indicators and the development of DR through the quantification of feature attribution, in which the positive and negative directions of the SHAP values directly correspond to the risk increasing and decreasing effects, the vertical distribution of the sample points reflects the feature influence weight, and the color gradient characterizes the size of the indicator values. The global feature importance ranking in Figure 3A indicates that the indicators of AGE, TG, SEX and HDLC constitute the core decision factors for DR prediction. The SHAP summary plot in Figure 3B further validates the above findings by visualizing the marginal contribution of each risk factor to model prediction through scatter distribution, and specifically points out that TG metrics present a significant positive driving effect - samples with high TG levels (red dot set) are concentrated in high-risk regions, while samples with low TG values (blue dots set) correspond more to risk suppression effects.

SHAP analysis diagram.
In response to the unique SHAP value calculation characteristics of the XGBoost model, the study innovatively disassembles the analysis by real labels: Figure 3C and Figure 3D construct the box plots of DR negative (class 0) and positive (class 1), respectively, to realize the differentiated interpretation of the prediction results of the two classes. The force-directed plot in Figure 3E visualizes the superposition process of the model baseline value and the sample characteristic contribution, and it should be noted in particular that the f(x) value in the plot represents the log odds ratio before Sigmoid transformation, rather than the direct probability output. Figure 3F also shows the impact of each feature on the predictive performance of the model, reflecting the magnitude of the SHAP value for each feature. The blue line in the figure indicates the trend of the AUC with respect to the features, and the yellow bar highlights the features that have the greatest impact on the predictive performance. With this visualization, it can be seen that the AUC of the XGB model stabilizes to more than 90% after incorporating the 7th most important factor, indicating that AGE, TG, SEX, HDLC, PR, BUN, and VLDLC are important risk factors for DR, especially the first two. This progressive feature utility analysis not only verified the strong association between metabolic indicators and DR, but also provided a quantifiable risk factor prioritization basis for clinical intervention strategies.
This study compared multiple machine learning algorithms using patient data and identified a set of predictive factors for diabetic retinopathy (DR). Model performance was evaluated using accuracy, precision, recall, F1-score, ROC curves, and PR curves. Many prior studies have used machine learning to predict DR. Table 4 compares this research with existing literature. Hong DD 28 recruited 1739 type 2 diabetes patients and constructed DR prediction models using five algorithms (e.g., logistic regression, decision tree). Results showed increased DR risk at PR values of 57 and 72 beats/min, and elevated risk with rising BUN levels (≥5 mmol/L), consistent with our SHAP analysis. Zhao et al. 29 conducted a retrospective study on 7943 Chinese patients, using RF-based feature selection to identify predictors. Among RF, XGBoost, LR, SVM, and KNN models, XGBoost performed best (accuracy: 80.30%, AUC: 0.889). Tsao et al. 30 proposed an SVM-based model (accuracy: 79.0%, AUC: 0.839), outperforming LR, ANN, and DT. Ogunyemi et al. 31 analyzed 24 predictors in 513 U.S. patients, using LASSO for feature selection. The RUSBoost model achieved 73.5% accuracy. Oh et al. 32 applied ridge regression, LASSO, and elastic net to predict DR in 490 subjects. Hosseini et al. 33 included 3734 Iranian patients with 11 predictors, using LR-based diagnosis without feature selection. In our patient dataset, the LightGBM model outperformed others. SHAP analysis identified TG, HDLC, and BUN as primary predictors, consistent with findings in. 28 33–35 Additionally, AGE and SEX were identified as key factors, with DR risk increasing significantly with age, aligning with literature. 6 36–38
Comparison of the proposed study with the existing studies in literature.
Comparison of the proposed study with the existing studies in literature.
This study delves into the application of interpretable machine learning algorithms in the diagnostic prediction of DR and significant research results have been achieved. By comparing the performance of different machine learning algorithms, we found that models such as LightGBM, XGBoost and Random Forest perform well in DR risk prediction and significantly outperform traditional diagnostic methods. In particular, the LightGBM model shows the best performance when dealing with complex medical datasets, with high AUC values, highlighting its advantages in improving prediction accuracy. In terms of model interpretability, the SHAP method provides an effective tool to explain and understand the decision-making process of machine learning models. By analyzing the SHAP values, we found that characteristics such as age, TG, SEX, and HDLC played a key role in predicting the prevalence of DR. This finding not only enhances the clinical application value of the model, but also helps physicians to better understand the model decisions and improve clinical trust and acceptance.
Although this paper has achieved certain results in DR diagnosis, the research is based on a single data source. Future research will integrate multimodal models of computer vision and clinical structured data to significantly improve the accuracy and comprehensiveness of DR risk prediction. Moreover, although the model performs well on a specific dataset, its ability to generalize still needs to be improved. Future research will attempt to adopt techniques such as transfer learning, semi-supervised learning, and deep learning to improve the applicability and accuracy of models in different populations and environments. At the same time, strengthening the research on model interpretability is also an important direction for future development, which will help support clinical decision-making and the development of personalized medicine.
Footnotes
Acknowledgments
Data availability statement: The data are available from the corresponding author upon request.
Ethical statement
None to report.
Author contributions
Yifeng Dou and Jiantao Liu conceived the study. Yifeng Dou participated in the study design, data analysis and statistics. Yifeng Dou and Jiantao Liu helped draft the manuscript. All authors read and approved the final manuscript.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the the In-hospital project of Tianjin Medical University Baodi Hospital, (grant number BDYYQN01).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
The data are available from the corresponding author upon request.
