A federal learning-driven artificial intelligence framework for fundus image myopia diagnosis

Abstract

Objective

Myopia has emerged as a critical global public health challenge. This study aims to develop a privacy-preserving federated learning (FL) framework for the triple classification of fundus images (normal, myopia, and pathological myopia), designed to generalize across institutions while addressing data heterogeneity and class imbalance.

Methods

We propose a novel FL framework integrating a genetic algorithm-inspired dynamic aggregator (FedProx_GA), a distance-aware attention module (OptiFocus), and a class-frequency dynamic loss. It was trained and evaluated on 1,279 fundus images from three heterogeneous medical centers. Performance was compared against standard FL baselines using area under the curve (AUC), accuracy, sensitivity, and specificity.

Results

Our framework achieved an AUC of 0.9889, performing close to the performance achievable when all data are centrally stored and processed (the non-federated approach) while significantly outperforming conventional FL methods. It demonstrated robust cross-center generalization, with high sensitivity (0.9346) and specificity (0.9673), effectively managing data heterogeneity and class imbalance without breaching data privacy.

Conclusion

This work presents an effective, privacy-preserving FL solution for collaborative ophthalmic artificial intelligence, showing strong potential for multi-institutional clinical deployment. Future work should focus on prospective validation with larger, diverse cohorts. The implementation code is publicly available at: https://github.com/AngelaK-code/FL_Myopia-Diagnosis.

Keywords

Federated learning myopia classification multicenter collaboration privacy preservation

Introduction

Myopia^1,2 is one of the most common refractive errors worldwide, and severe myopia and pathological myopia may lead to irreversible visual function damage, such as retinal splitting, retinal detachment, and macular.³ Therefore, myopia classification and diagnosis based on fundus images are crucial for early intervention and treatment. In the early diagnosis of ocular diseases, fundus photography and optical coherence tomography (OCT)^4,5 have proven to be effective imaging techniques. Fundus photography is a cost-effective and non-invasive method of screening for ocular diseases compared to expensive OCT images. As a result, fundus photographs have become the primary method used by ophthalmologists to detect diseases such as diabetic retinopathy (DR), age-related macular degeneration (AMD), myopia, hypertension, glaucoma, and cataracts.⁶ However, manual fundus photography is time-consuming and labor-intensive, and the number of ophthalmologists is far from adequate for manual diagnosis. An automated diagnostic algorithm for screening ophthalmic diseases is urgently needed in order to reduce the pressure on ophthalmologists and improve the accuracy of fundus diagnosis.^7–10

In recent years, deep learning techniques have made significant progress in automated diagnostic tasks for ophthalmic diseases, especially in the detection of DR,¹¹ glaucoma,¹² and AMD,¹³ and have demonstrated diagnostic performances comparable to those of professional ophthalmologists.^14–20 Guo et al.²¹ proposed a method for the automated diagnosis of ophthalmic diseases using fundus images of the complete structure of cataracts.²² After extracting features from the image using wavelet transform and sketch-based methods, multiple fisher classifiers classify all samples according to the severity of the disease; Mookiah et al.²³ proposed an automated dry AMD detection system that used features extracted from grayscale fundus images of various features including entropy, higher order spectral (HOS) bispectral features, fractional dimension, and Gabor wavelet features. The features are then selected using a series of statistical tests and classified using a set of classifiers such as k-nearest neighbors, probabilistic neural networks, decision trees, and support vector machine (SVM) methods; Mookiah et al.²⁴ chose the HOS and discrete wavelet transform features as the depiction of the eye image as an image descriptors and SVM as a classifier to diagnose glaucoma, obtaining higher accuracy than before; Singh et al.²⁵ used genetic feature selection to select useful wavelet features from wavelet features extracted from segmented optic disc images, where blood vessels were removed, to automatically detect glaucoma in fundus images.

However, the widespread clinical deployment of such deep learning models is hindered by their reliance on a centralized training paradigm,²⁶ which necessitates aggregating data from various medical centers to a single server. This paradigm faces significant challenges in healthcare, particularly under increasingly stringent data privacy protection regulations²⁷ and practical limitations on data sharing,²⁸ making collaborative multicenter modeling difficult.^29,30 To address these barriers while enabling collaborative intelligence, federated learning (FL) has emerged as a promising decentralized alternative. Recent studies have achieved notable results in ophthalmic disease diagnosis using this paradigm, for example, a domain-adaptive federated framework has been proposed to realize privacy-preserving multi-disease ocular recognition,³¹ and a comprehensive federated system specifically optimized for DR grading and lesion segmentation effectively enhanced cross-dataset generalization.³² Existing studies still face many problems on the myopia classification task. First, the use of medical data are constrained by strict privacy-protecting regulations³³ (e.g. General Data Protection Regulation (GDPR)), and there is a compliance risk in directly aggregating data from multiple healthcare organizations for training—this is exactly the core issue addressed by FL, as demonstrated in cross-institutional medical image analysis where privacy is safeguarded without raw data sharing.³⁴ Second, there are differences in fundus photography equipment used in different medical centers, resulting in inconsistent image resolution, contrast, and color distribution,³⁵ while the annotation standards may be inconsistent across institutions, affecting the generalization ability of the model. Recent work has proposed interpretable FL methods to explicitly identify such inter-institutional biases in biomedical images.³⁶ In addition, a third major challenge is the relative scarcity of pathological myopia samples,³⁷ which can introduce class imbalance and bias during federated training, degrading the model’s recognition performance for minority categories.

To address the above issues, this paper proposes an FL-based myopia triple classification framework, which aims to improve the cross-center generalization ability of the model while safeguarding data privacy. Specifically, we adopt a dynamic aggregation strategy based on federated averaging (FedAvg) and incorporate a category imbalance handling mechanism to enhance the model’s performance on few sample categories. In addition, we optimize the model architecture to adapt to the heterogeneous data distribution³⁸ in a multicenter environment and improve the robustness of the model. The main contributions of this paper include the following: (1) designing and implementing a privacy-preserving multicenter myopia classification framework that enables different medical centers to jointly train deep learning models without sharing raw data; (2) proposing a dynamically adapted federation aggregation strategy for the problem of multicenter data heterogeneity in order to enhance the model’s fitness on different data sources; and (3) incorporating the FL environment, we explore federated knowledge distillation, resampling with adaptive loss function to alleviate the training bias caused by insufficient samples of pathological myopia categories and improve the model’s ability to recognize a few categories; (4) experiments are conducted on the datasets of multiple real medical centers to validate the effectiveness of the proposed method in data privacy protection, cross-center generalization ability, and classification performance. The research in this paper provides a new solution for intelligent diagnosis of ophthalmic diseases in privacy-preserving environments, and at the same time provides an important technical reference for the application of FL in medical image analysis.

Method

Currently, deep learning algorithms have demonstrated high accuracy in diagnosing diseases such as DR, glaucoma, and AMD from fundus photographs. However, such models typically rely on a centralized training paradigm that requires multicenter data aggregation to a single server, which is in fundamental conflict with healthcare data privacy regulations (e.g. European Union GDPR). In addition, there is significant heterogeneity in the data from different healthcare organizations: inconsistent image resolution and contrast due to differences in equipment brands, inconsistent labeling standards (e.g. ambiguous quantitative definition of pathological myopia), and unbalanced distribution of categories (e.g. scarcity of samples for pathological myopia). These factors create “data silos,” limiting the development of large-scale datasets and model generalization.

In this paper, we propose an innovative deep learning algorithm for the triple classification task of ocular myopia, pathological myopia, and normal eye at multiple medical centers. By adopting a distributed training approach, our framework prioritizes data privacy and avoids the risks of centralization. The framework combines FedProx FL algorithm, genetic algorithm (GA), OptiFocus attention mechanism, and realizes efficient collaborative learning of cross-agency data through the strategy of dynamically adjusting the loss function based on the category frequency. Experimental results show that the framework exhibits higher classification accuracy and stronger generalization ability than traditional methods on data from multiple centers.

In this section, the design of an FL framework for the triple classification task in multiple medical centers is described in detail, focusing on the optimization of the model architecture, the dynamic weight aggregation strategy, and the category imbalance handling scheme.

Data preprocessing

All fundus images were stratified and randomly divided into training and test sets at a ratio of $8 : 2$ . To address the significant heterogeneity in resolution, contrast, and noise levels arising from the different imaging equipment used across medical centers, we applied a standardized preprocessing and augmentation pipeline to the training set to improve model generalization and cross-center adaptability. First, all images were resized to a uniform resolution of $256 \times 256$ pixels and normalized using the mean and standard deviation of the ImageNet dataset. During training, the following augmentation techniques were applied sequentially in a randomized manner: (1) ColorJitter (brightness = 0.05, contrast = 0.05, saturation = 0.05, hue = 0.05) to simulate color variations under different lighting conditions; (2) random horizontal flipping to enhance invariance to lateral orientation; and (3) random rotation within a range of $\pm 20 \circ$ to improve robustness to varying capture angles. This pipeline ensures that the model learns features that are invariant to common clinical imaging variations while maintaining the biological integrity of the pathological signs (Figure 1).

Figure 1.

Overview of the proposed federated learning framework for myopia classification.

Model architecture

Embedded OptiFocus attention module

To enhance the model’s attention to different regions in fundus images, we embed a new visual attention mechanism called OptiFocus into the convolutional neural network (CNN) backbone. Fundus images contain subtle features such as myopic arcuate spots and macular lesions, yet traditional attention mechanisms often overlook distance variations within regions. The OptiFocus module addresses this by introducing a distance-aware mechanism that dynamically adjusts region weights based on relative spatial distances, thereby allocating different attention levels to myopia-related features. Specifically, it assigns higher weights to proximate regions (e.g. macular lesions) by computing inter-region distances, enhancing sensitivity to fine details. Its structure consists of two main parts:

Channel attention mechanism: Channel weights are first generated through global average pooling and maximum pooling operations. In this way, the network is able to emphasize those feature channels that are more important to the task in the overall image and suppress irrelevant or redundant channel information.

Spatial attention mechanism: On the feature map compressed by channel attention, OptiFocus utilizes a convolution operation to generate a spatial weight matrix and adjusts the weight distribution of the spatial attention map according to the distance information of the image regions. In this way, the network is able to focus on key regions in the image, such as the optic disk and macula and other important parts, thus improving the sensitivity to the lesion region.

The OptiFocus module is inserted after each convolutional layer (Conv block) of the model. The design of this structure effectively improves the CNN’s ability to capture detailed information, especially on the diverse and subtle features of ophthalmic images, to achieve finer region selection and significantly enhance the robustness and classification accuracy of the model.

To validate the effectiveness and interpretability of the OptiFocus module, we employed gradient-weighted class activation mapping to generate visual explanations for the model’s predictions. As shown in Figure 2, the generated heatmaps clearly demonstrate that our model, guided by the OptiFocus module, successfully learns to focus its attention on clinically relevant regions, such as the optic disc and macular area, which are critical for diagnosing pathological myopia. This provides intuitive and compelling evidence that our model makes predictions based on meaningful pathological features rather than irrelevant image artifacts, significantly enhancing the transparency and trustworthiness of our artificial intelligence (AI) system.

Figure 2.

Interpretability heatmap of OptiFocus module on three categories of fundus images: (a) pathological myopia, (b) myopia, and (c) normal.

Dynamic weight aggregation strategy based on GA FedProx_GA

To address the problem of data heterogeneity, this study proposes an improved FL algorithm. The algorithm introduces a dynamic weight aggregation strategy based on FedProx, and optimizes the client contributions by GA, so as to improve the performance of the model in heterogeneous data environments. The FedProx algorithm mitigates the model bias problem caused by the data not independently and identically distributed (Non-IID) by introducing a proximal term (proximal term), and its local objective function is defined as:

L_{k} = E_{D_{k}} [L (w_{k}; D_{k})] + μ \cdot {‖ w_{k} - w^{t} ‖}^{2}

(1)

where

L_{k}

is the local loss function of client

k

L (w_{k}; D_{k})

represents the loss of client

k

on its local dataset

D_{k}

, which measures the fit of the local model to the sample data.

w_{k}

denotes the model parameters of client

k

, and is the parameter of the global model.

μ

is the proximal term coefficient, which controls the degree of deviation of the local model from the global model. By adjusting the value of

μ

, the model drift problem caused by inconsistent data distribution can be effectively mitigated.

The introduction of the FedProx algorithm enables the training of models on multiple heterogeneous clients and the effective integration of the knowledge from each client through a global aggregation strategy, thus realizing the accurate classification of cross-device and cross-distribution medical images. In traditional FL frameworks, the FedAvg algorithm and the FedProx algorithm use the sample size $n_k$ of the clients to assign the aggregation weights $w_k = n_k / N$ , where $N$ is the sum of the sample sizes of all the clients. However, this data-volume-based weight assignment method ignores the actual performance differences of clients during model training, which may result in some poorly performing clients contributing too much to the aggregation results during global model update. To address this problem, inspired by the CVPR 2023 paper [1], this paper proposes a dynamic weight adjustment strategy based on GA, which aims to optimize the aggregation process by evaluating the actual performance of clients. After each round of federation training, the performance of the model is evaluated locally for each client. Specifically, the loss value $L_k$ or accuracy rate $A_{k}$ of its model is calculated for each client $k$ . The loss value reflects the fitting degree of the model, while the accuracy rate embodies the classification performance of the model, which provides the basis for the subsequent weight adjustment.

After client evaluation, a dynamic selection strategy is adopted in order to optimize client weight allocation. Selection in GAs refers to deciding which individuals are retained and used for reproduction based on fitness. In FL, the selection process determines which clients’ models will be used for aggregation. The contribution of each client in the global model training is measured by considering the client’s weight $w_k$ as a chromosome in the GA and defining the fitness function as the loss value or accuracy of each client during the training process. Specifically, the size of the weight is inversely proportional to the loss value of the client ( $c l i e n t_l o s s$ ) and directly proportional to the accuracy ( $c l i e n t_a c c$ ). If a client performs poorly for multiple consecutive rounds, its weight is reduced. In this way, clients with lower loss or higher accuracy will get higher fitness and thus play a greater role in the aggregation process.

After each round of training, the models of each client are weighted and aggregated according to the dynamically adjusted weights. The process can be represented as:

θ_{global} = \sum_{i = 1}^{N} w_{i} θ_{i}

(2)

where

θ_{global}

is the parameters of the global model,

θ_{i}

is the model parameters of the

i

-th client, which is the weight of client

i

, and

N

is the number of clients involved in training.

All training follows a synchronous client sampling protocol: in each communication round, the server broadcasts the current global model to all participating clients, waits for their local training completion, and then aggregates the received updates. A central validation set was constructed and held on the server to guide the federated optimization process. This set was created by uniformly sampling a small, stratified portion of data from each client’s local dataset prior to the initiation of federated training. This central validation set serves two critical purposes: (1) to evaluate the performance of candidate global models during the GA-based weight optimization in FedProx_GA, providing the fitness metric for different aggregation weight vectors; and (2) to monitor global model convergence and facilitate fair comparison across different aggregation strategies. The use of this centrally held, representative dataset allows for a consistent and unbiased evaluation of the global model’s generalization ability across the heterogeneous client distribution.

The strategy not only takes into account the differences in the amount of client data, but also dynamically adjusts the weights by means of the fitness function and genetic operation, thus effectively improving the performance of the global model and enhancing its robustness in heterogeneous data environments. The core workflow is summarized in Supplemental Algorithm 1.

Dynamic loss function adjustment methods for dealing with category imbalances

Class imbalance in medical image classification arises from varying data distributions across centers, compromising global model training and reducing accuracy for minority classes, which undermines diagnostic reliability. Traditional FL often applies basic oversampling or undersampling, but these can harm generalization or efficiency. To better align with each client’s data distribution and boost global model performance, we introduce a dynamic loss function adjusted by class frequency.

The method adjusts the loss function during training based on the sample frequency of each category in the dataset, thereby automatically mitigating the negative impact of category imbalance on model learning during training. Specifically, by dynamically assigning different weights to each category, the model is made to pay more attention to those categories that account for a relatively small number of samples, thus improving the model’s ability to recognize a small number of categories. The sample frequency for each category is first calculated. For each client $k$ , assume that the dataset of this client is, which contains samples of different categories, where $i$ denotes the category index. The category frequency can be calculated by the following formula:

f_{i} = \frac{| D_{k}^{i} |}{| D_{k} |}

(3)

Next, the weights for each category were calculated by using the category frequencies. The weights are designed with the goal of increasing the importance of the categories in the training process for categories with low sample frequencies. The specific weights

w_{i}

are calculated by the formula:

w_{i} = \frac{1}{f_{i} + ε}

(4)

where $ε$ is a small constant used to avoid division by zero errors when the frequency is zero. The weight $w_{i}$ is a positive value and decreases as the frequency of the category $f_{i}$ increases, so that less frequent categories receive more attention. During training, the loss function is weighted to fit the weights of each category. For each client $k$ , a Weighted Cross-Entropy Loss function is designed:

L_{K} = - \sum_{i = 1}^{C} w_{i} \cdot y_{i} \cdot \log ({\hat{y}}_{i})

(5)

where $C$ denotes the number of categories, $y_{i}$ is the true label of category $i$ , ${\hat{y}}_{i}$ is the predicted probability of the model for category $i$ , and $w_{i}$ is the weighted weights of category $i$ . In this way, the loss value of each category is dynamically adjusted according to the sample frequency of the category, which enables the model to learn more finely for categories with a small number of samples.

Result

Study design

This study is designed to simulate a realistic FL scenario across distinct, independent medical institutions. Our framework comprises three federated clients (Clients A, B, and C), each representing a separate, real-world data source with its own patient population, imaging protocols, and clinical practices, not a synthetic partition of a single dataset. This setup captures authentic institutional variation that directly impacts model performance and communication dynamics. This study is based on publicly available datasets from three heterogeneous medical centers and follows relevant usage protocols, comprising a total of 1,279 fundus images. The specific data distribution is shown in Table 1.

Client A: Normal eyes (39%) and pathological myopia (61%). Client A is from the dataset Joint Shantou International Eye Center (JSIEC), which contains 39 categories of fundus images, with data from the JSIEC in Shantou, Guangdong Province, China. The dataset was collected from PACS JSIEC between September 2009 and December 2018. The images were captured using a ZEISS FF450 Plus infrared fundus camera (2009–2013) and a Topcon TRC-50DX mydriatic retinal camera (2013–2018) with a 35°–50° field of view setting.

Client B: Normal eyes (40%) and myopia (60%).The Client B dataset was created by Sharmin et al. and described in their paper (https://doi.org/10.1016/j.dib.2024.110979). The data were collected over a period of eight months from Anawara Hamida Eye Hospital and B.N.S.B. Zahurul Haque Eye Hospital. Compared to Client A, Client B focuses primarily on fundus images of myopic eyes.

Client C: Normal eye (41.2%), myopia (6.1%), and pathological myopia (52.7%). Client C is from the dataset iChallenge-PM, a medical dataset provided in a competition jointly organized by Baidu Brain and Sun Yat-Sen University Zhongshan Ophthalmic Center. This dataset is dedicated to the classification of myopia-related ophthalmic diseases, particularly pathological myopia, and includes a diverse range of clinical samples.

This multicenter setup introduces a pronounced class imbalance both within and across clients, which is a central challenge for federated model training. The distribution is highly heterogeneous: Client A is dominated by pathological myopia (61%), Client B contains no pathological myopia samples, and Client C has a very small proportion of myopia cases (6.1%) relative to pathological myopia (52.7%). This inter-client Non-IID data distribution can destabilize training by causing client drift, where local updates from clients with skewed distributions pull the global model in conflicting directions. It particularly risks under-representing minority classes (e.g. myopia in Client C) in the global feature space, potentially biasing the model toward the dominant classes in the aggregate data. Our framework’s dynamic aggregation and loss components are designed in part to mitigate these instability and bias issues arising from such imbalance.

Table 1.

Characteristics and distribution of the multicenter fundus image datasets.

Feature	Training A	Testing A	Training B	Testing B	Training C	Testing C	Total
Number of fundus images	286	71	424	106	313	79	1279
Average age	51.7	52.0	NA	NA	37.5	37.5	NA
Normal	111 (38.8%)	28 (39.4%)	170 (40%)	42 (39.6%)	129 (41.2%)	32 (40.5%)	512(40.1%)
Myopia	0 (0%)	0 (0%)	254 (60%)	64 (60.4%)	19 (6.1%)	5 (6.4%)	342(26.7%)
Pathological myopia	175 (61.2%)	43 (60.6%)	0 (0%)	0 (0%)	165 (52.7%)	42 (53.1%)	425(33.2%)
Male	143 (50.0%)	35 (49.3%)	NA	NA	151 (48.2%)	38 (48.1%)	NA
Female	143 (50.0%)	36 (50.7%)	NA	NA	162 (51.8%)	41 (51.9%)	NA

Note. NA indicates that the data are not available or were not provided.

Our proposed algorithm has been rigorously validated on the three major datasets. The inclusion criteria for all fundus images are: (1) images are collected using single or dual fields of view; (2) the image has no obvious quality issues, such as severe stains, artifacts, defocusing, blurring, improper exposure, etc., which can affect the clarity of the observed target area. By fully considering the differences in data sources, disease types, and imaging characteristics, we ensure that the proposed method exhibits strong adaptability to cross-domain data and demonstrates excellent generalization capability.

This study was conducted and reported in accordance with the Standards for Reporting Diagnostic Accuracy Studies (STARD) guidelines. A completed STARD checklist is provided as Supplemental Table S1.

Evaluation of classification models

The training of our algorithm was conducted on a high-performance computing node equipped with an NVIDIA RTX A6000 GPU (48 GB VRAM). All experiments were implemented using Python 3.12.2 and the PyTorch 2.8.0 deep learning framework. The training configuration of our algorithm is as follows: the number of communication rounds is set to 40, the number of epochs for local training is 50, the batch size of each round is 64, the optimizer chooses the stochastic gradient descent algorithm and sets the initial learning rate to 0.001, and the proximal term constraint $μ = 0.1$ is introduced in the FedProx baseline. In the test set, the model achieves an accuracy of 0.9339, with a good area under the curve (AUC) value of 0.9889 (95% CI: 0.9869–0.9906), sensitivity of 0.9346, and specificity of 0.9673. The receiver operating characteristic (ROC) curve and confusion matrix are shown in Figure 3. According to the confusion matrix, the most significant misclassification occurs within the disease spectrum: 376 myopia cases were misclassified as pathological myopia, and 55 pathological myopia cases were misclassified as myopia. This bidirectional confusion, accounting for 431 out of 531 total errors (81.2%), highlights the model’s challenge in delineating the continuous phenotypic transition between high myopia and early pathological myopia, where shared features such as optic disc tilt and tessellated fundus create ambiguity. Notably, the model demonstrates high fidelity in recognizing normal fundus anatomy, with 3508 correct normal predictions and minimal confusion between normal and myopia classes (Figure 4). The evaluation indicators for each category are shown in Table 2.

Figure 3.

Receiver operating characteristic (ROC) curves and confusion matrices for aggregated models tested on all data.

Figure 4.

Convergence curves of different federated learning algorithms on the test set.

Table 2.

Detailed per-class performance metrics of the proposed model.

Category	Precision	Accuracy	F1 score	Sensitivity (95% CI)	Specificity (95% CI)
Pathological myopia	0.8670	0.9730	0.9183	0.9730 (0.9675–0.9776)	0.9072 (0.8998–0.9141)
Myopia	0.9743	0.7954	0.8641	0.7954 (0.7794–0.8105)	0.9655 (0.9613–0.9693)
Normal	0.9441	0.9430	0.9500	0.9571 (0.9469–0.9665)	0.9912 (0.9885–0.9932)

The independent performance of each client’s test set is shown in Figure 5. For Client A, the model demonstrates comparable performance in distinguishing pathological myopia and normal eyes, with AUC values both exceeding 0.98.

Figure 5.

Receiver operating characteristic (ROC) curves and confusion matrices for each of the three clients tested separately (left: Client A; center: Client B; right: Client C).

For Client B, the AUC for myopia cases is slightly lower than that for normal eyes, indicating that the model is capable of drawing relatively clear decision boundaries. Client C includes three categories, among which the AUC for myopia (0.92) is lower than that for pathological myopia and normal eyes (both around 0.99). This is mainly attributed to the scarcity of myopic samples and confusion between categories. Specifically, there are only 19 cases of myopia, compared to 165 cases of pathological myopia. With myopic cases representing only 6.1% of the dataset, it is difficult to cover the full range of diopter levels and associated complications. Furthermore, there is a partial overlap in fundus imaging features between myopia and pathological myopia, such as optic disc tilt and tessellated fundus. As a result, the few myopic samples are likely to be overwhelmed by pathological features, leading the model to classify ambiguous cases as pathological myopia. Across all clients, AUCs for normal eyes are consistently close to 0.99, indicating the model’s ability to effectively distinguish between healthy and diseased eyes.

Further analysis reveals that cross-domain feature differences are one of the key factors contributing to model performance variability and class confusion. To assess the algorithmic fairness and potential performance disparities of our FL model, we conducted a comparative analysis across demographic subgroups. Given the geographic provenance of our datasets, we used client membership as a proxy for demographic groups: East Asian (Clients A and C, primarily Chinese) and South Asian (Client B, Bangladeshi) populations. In general, East Asian populations tend to have smaller optic discs (average 1.2 mm $^{2}$ ), thinner choroids, higher rates of high myopia in adolescents, and faster progression to pathological myopia. In contrast, South Asian populations typically have larger optic discs (average 1.8 mm $^{2}$ ), more pigment deposition, and a predominance of mild-to-moderate myopia, with relatively low rates of pathological myopia.

The larger optic disc areas commonly observed in South Asian populations may cause the model to misjudge the degree of optic disc tilt. As a result, what is considered a normal optic disc in Client B may be interpreted as abnormal according to the criteria used in Clients A and C. Despite these differences, the model proposed in this study maintains high AUC values across datasets from diverse demographic groups, and the performance gaps are relatively small. By employing data augmentation, standardization of fundus features, and techniques such as class balancing and weighted loss functions, the model effectively identifies optic disc characteristics and pathological changes across populations. The experimental results show that the model exhibits significant robustness in heterogeneous data scenarios. Despite the differences in data distribution among participating nodes, the model is able to effectively adapt to different data characteristics and maintain high performance. Such performance demonstrates the feasibility of the FL approach to (1) accurately identify and classify normal eyes, myopia, and pathological myopia, thus allowing for guiding downstream interventions; and (2) handle cross-institutional collaborative tasks, especially in the field of ophthalmology AI, which is able to cross the data barriers of different medical centers, promote multi-party collaboration, and share the model knowledge so as to enhance the model’s generalization ability and accuracy.

Comparative experiments

To validate the effectiveness of the proposed method, we compare it with centralized training. In the centralized setting, all data are stored and processed in a unified location, and the model achieves a maximum AUC of 0.9934 on the test set. This approach can fully exploit the entire dataset and is generally regarded as a reference for the upper bound of model performance. However, centralized training carries the risk of data privacy breaches. Moreover, in large-scale distributed data scenarios, data collection and storage can be costly, and data transmission may suffer from latency and security concerns. In contrast, the FL framework proposed in this study achieves classification performance comparable to that of centralized training, while ensuring data privacy protection.

The comparison of this paper’s algorithm with other traditional FL algorithms is shown in Figure 6 and Table 3. Although the FedAvg and FedProx algorithms are able to maintain better performance as standard FL aggregation algorithms, they are inferior to the complete scheme designed in this paper in terms of task-specific optimization. Figure 6 is used to visualize the results of the comparison.

Figure 6.

Comparative performance evaluation of different federated learning algorithms.

Table 3.

Results of comparative experiments.

	Accuracy	AUC–ROC	Sensitivity	Specificity
Ours	0.9339	0.9889	0.9346	0.9673
FedAvg	0.9183	0.9762	0.9002	0.9501
FedProx	0.9261	0.9834	0.9201	0.9601
FedNova	0.9370	0.9815	0.9181	0.9555
FedOpt	0.9177	0.9793	0.9026	0.9492

AUC: area under the curve; ROC: receiver operating characteristic.

The FedAvg algorithm is susceptible to variations in client data distributions during the model update process, resulting in slower convergence and a noticeable decline in performance when dealing with Non-IID data. It struggles to effectively learn features from minority classes. Although it maintains relatively high overall accuracy and AUC values, its performance in terms of sensitivity and specificity is comparatively weaker.

In contrast, the FedProx algorithm introduces an additional regularization term, which improves training stability to a certain extent. In our experiments, FedProx achieved an accuracy of 0.9261 and an AUC–ROC of 0.9834, an improvement over FedAvg. However, it still falls short across multiple evaluation metrics when compared to the model proposed in this study.

It is worth noting that the choice of the $μ$ value (proximal term coefficient) significantly impacts model performance. In our experiments, we observed that the model performs best when $μ$ is set to 0.1, achieving the highest accuracy. This indicates that a smaller $μ$ value effectively balances the tradeoff between the global and local models, promoting knowledge transfer while avoiding excessive constraints. Specifically, the model is able to leverage the unique characteristics of local data without sacrificing the generalization ability of the global model. When $μ$ exceeds 0.3, the global model’s accuracy drops significantly, by as much as 7.2%. This suggests that overly strong proximal constraints hinder the knowledge transfer process, overemphasizing local data features and weakening the global model’s ability to adapt to shared knowledge across domains.

To further validate the superiority of our approach, we compared it with two more recent FL algorithms: FedNova and FedOpt. FedNova, which normalizes local updates to mitigate client drift, achieved the highest accuracy (0.9370) among all compared methods. However, its performance on other critical clinical metrics was inconsistent, with lower AUC–ROC (0.9815), sensitivity (0.9181), and specificity (0.9555) values compared to our method. This suggests that while FedNova improves aggregation efficiency, it may not optimally handle the complex feature relationships essential for fine-grained medical image classification. On the other hand, FedOpt, which utilizes adaptive optimizers on the server side, performed similarly to FedAvg with an accuracy of 0.9177 and AUC–ROC of 0.9793, but lagged significantly in sensitivity (0.9026) and specificity (0.9492). This indicates that adaptive server optimization alone is insufficient to address the challenges posed by heterogeneous ophthalmology data. In comparison, our proposed method demonstrates more balanced and robust performance across all evaluation metrics, achieving the highest AUC–ROC (0.9889), sensitivity (0.9346), and specificity (0.9673), which are particularly crucial for clinical diagnostic applications where both identifying true positives and ruling out false positives are equally important.

We define one communication round as the server broadcasting the global model to all selected clients and subsequently receiving their updated local models to evaluate communication efficiency. The primary communication cost is thus proportional to the size of the model parameters and the number of clients per round. Our aggregation strategy aims to achieve higher accuracy with fewer rounds, thereby improving overall efficiency. The convergence behavior of our proposed framework, compared to FedAvg and FedProx, is depicted in Figure 4. Our method demonstrates a steeper initial ascent and reaches a higher performance plateau within the 40 communication rounds, indicating faster and more stable convergence. For instance, our model achieved 90% of its final accuracy by round 15, whereas FedAvg required 30 rounds. This highlights the effectiveness of our dynamic aggregation strategy in steering the global model toward a superior optimum more efficiently.

To isolate the individual contribution of each core component in our proposed framework, we conducted a series of ablation studies. The performance of different configurations was compared on the same test set by systematically removing or replacing specific modules. The results (see Supplemental Table S2) demonstrate that all components contributed positively to the final performance. The OptiFocus module directly enhances feature representation by applying a distance-aware spatial attention mechanism, which forces the model to weigh local lesion regions (e.g. the macula) more heavily than surrounding tissue. This leads to more discriminative features for subtle pathologies, directly improving per-class specificity. In contrast, the GA-inspired dynamic aggregation operates at the system level to manage client heterogeneity. By treating client weights as evolvable parameters and using loss or accuracy as a fitness function, it iteratively finds a weighting scheme that maximizes global model performance. This leads to more stable convergence and a better aggregated model. Together, OptiFocus improves what the model sees (feature discriminability), while the GA strategy improves how client knowledge is combined (robust aggregation), resulting in the superior final performance.

Discussion

The FL framework proposed in this study successfully solves the problem of multicenter heterogeneity and category imbalance under the protection of data privacy, and provides new ideas for intelligent diagnosis of ophthalmic diseases. First, by optimizing client contributions through a dynamic weight aggregation strategy (FedProx_GA), the model maintains a high generalization ability despite cross-center data distribution differences. This strategy operates by evaluating each client’s contribution based on its local performance, effectively down-weighting updates from clients with potentially noisier or more divergent distributions, thereby steering the global model toward a more robust consensus. For instance, the specificity in Client A, which is dominated by pathological myopia, reaches 0.972, which is significantly better than that of the traditional FedAvg algorithm (AUC improvement of 1.27%). Second, the OptiFocus attention module strengthens the ability to capture subtle features such as macular lesions through a distance-aware mechanism, resulting in a specificity of 0.9673 for pathological myopia, which outperforms the traditional Squeeze-and-Excitation (SE) module. In addition, the dynamic loss function adjusts the weights according to the category frequency, which improves the recall of a few categories (pathological myopia) by 12%, and effectively mitigates the training bias caused by sample scarcity. Experiments show that the classification performance of this framework is close to the upper limit of centralized training under privacy-preserving conditions (AUC difference < 0.005), which validates its feasibility in clinical multicenter collaboration.

However, this study still has some limitations. First, the Non-IID nature of data across institutions introduces potential bias in the global model. Each client’s dataset exhibits distinct statistical characteristics due to variations in patient demographics, disease prevalence, and imaging protocols. For instance, Client A is dominated by pathological myopia cases from specific populations, while Client B contains different disease distributions. This statistical heterogeneity may lead to biased global model parameters that favor dominant data distributions, potentially compromising performance on underrepresented populations or rare disease subtypes. Second, federated training incurs inevitable communication latency. While effective, our aggregation approach requires synchronous communication, which can slow convergence if any client experiences delays or connectivity issues. This limitation becomes more pronounced as the number of participating institutions increases in real-world deployment scenarios. Third, the challenge of model personalization in ophthalmology remains inadequately addressed. A single global model may not optimally serve all institutions, particularly when local patient demographics, equipment specifications, or clinical practices differ significantly. The current framework does not fully address the need for site-specific adaptations while maintaining the benefits of collaborative learning. Additionally, the relatively small aggregate dataset size (1,279 images), while common in initial FL feasibility studies, remains a constraint. This not only limits the statistical power to validate generalizability but also increases the potential risk of the global model overfitting to idiosyncrasies of the participating client distributions rather than learning broadly applicable features. More medical centers need to be included in the future to validate the model’s generalizability. The differences in the annotation standards of different centers may also affect the model’s performance, and semi-supervised FL can be combined in the future to reduce the dependence on the annotation consistency. Furthermore, the GA increases the computational complexity even though it improves the aggregation efficiency, and an asynchronous federation architecture or a lightweight optimization strategy should be explored to adapt to edge devices.

Moreover, although the dynamic loss function partially alleviates class imbalance, potential biases introduced by unbalanced distributions require deeper investigation. For example, the model may still exhibit reduced sensitivity to minority classes (e.g. myopia in Client C) due to their limited representation in the global feature space. Future work should explore true minority-class augmentation techniques, such as federated generative adversarial networks or differential privacy-safe synthetic data generation, to create more balanced training sets without compromising privacy. In addition, integrating multimodal data could further enhance diagnostic accuracy and robustness. Combining fundus images with OCT, patient history, or genetic information could provide complementary insights and improve the challenge of distinguishing between diseases where anatomical differences are minimal or exist on a continuum. Developing multimodal FL frameworks that efficiently and privately integrate diverse data types represents a critical direction for future research. Other promising directions include: (1) extending the framework to multi-disease joint diagnosis tasks, such as glaucoma and AMD; (2) designing lightweight federation frameworks to reduce computational overhead and facilitate clinical deployment.

Conclusion

In this study, we developed and validated a novel FL framework enhanced by a dynamic aggregation strategy and the OptiFocus attention module, for the classification of normal, myopic, and pathological myopic fundus images across multicenter datasets. Our approach effectively mitigates the challenges posed by data heterogeneity and class imbalance while maintaining strict data privacy. The results demonstrate that the proposed method achieves performance comparable to centralized training and outperforms several existing FL baselines, highlighting its potential as a robust and privacy-preserving solution for collaborative ophthalmic AI.

This framework demonstrates clear potential for integration into real-world multicenter ophthalmology workflows. For instance, it could be deployed as a cloud-assisted or edge-computing diagnostic support system, enabling geographically dispersed clinics or screening camps to collaboratively improve a shared model without transferring sensitive patient data. Such a system would assist in large-scale screening efforts, prioritization of referrals, and consistency in diagnosis across different healthcare providers.

This work provides a solid technical foundation and a valuable reference for secure multi-institutional collaboration in medical AI. However, its translation into routine clinical practice necessitates further validation through larger-scale, prospective multicenter studies and continuous refinement of the framework. Future extensions will focus on scaling the framework with larger and more diverse international datasets to improve generalizability, and on exploring the integration of multimodal data, such as combining fundus images with OCT scans and patient metadata, to build a more comprehensive and clinically informative diagnostic assistant. We believe that this research represents a meaningful step forward in bridging the gap between advanced FL techniques and real-world clinical applications in ophthalmology.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076261426324 - Supplemental material for A federal learning-driven artificial intelligence framework for fundus image myopia diagnosis

Supplemental material, sj-docx-1-dhj-10.1177_20552076261426324 for A federal learning-driven artificial intelligence framework for fundus image myopia diagnosis by Xiaolong Yin, Chunhong Yu, Weiwei Xiong and Yujun Liao in DIGITAL HEALTH

Footnotes

Acknowledgements

We thank all patients who participated in this study. The authors thank the Second Affiliated Hospital of Nanchang University for providing the instrumentation and technical support.

ORCID iD

Yujun Liao

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Contributorships

Xiaolong Yin: Conceptualization, methodology, investigation, and writing original draft. Chunhong Yu: Data curation, formal analysis, and writing—review and editing. Weiwei Xiong: Software, validation, and visualization. Yujun Liao: Supervision, project administration, and funding acquisition.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Ethical approval

This study uses publicly available datasets that have been previously collected and anonymized by their respective providers. As no new human participants were recruited, and no identifiable personal data were used or collected in this study, ethical approval was not required. All data used comply with the terms and conditions set by the original data sources.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Data availability

The data used in this study derive from three distinct sources: Client A: Data are from the JSIEC dataset and can be accessed at the following URL: https://www.kaggle.com/datasets/linchundan/fundusimage1000, which includes 39 categories of ocular fundus diseases. More information can be obtained from the paper available at https://www.nature.com/articles/s41467-021-25138-w. Client B: Data are from the dataset created by Sharmin et al. (https://doi.org/10.1016/j.dib.2024.110979), which comprises 5335 fundus images collected over eight months from Anawara Hamida Eye Hospital and B.N.S.B. Zahurul Haque Eye Hospital in Faridpur, Bangladesh. The dataset is publicly available on Mendeley Data with direct URL https://data.mendeley.com/datasets/s9bfhswzjb/1. Client C: Data are from the iChallenge-PM dataset, a medical dataset provided in a competition jointly organized by Baidu Brain and Sun Yat-Sen University Zhongshan Ophthalmic Center. This dataset is dedicated to myopia-related ophthalmic disease classification and can be accessed through the competition’s official platform (registration and approval required). All datasets were used in compliance with their respective licensing and access policies. Details on data usage agreements and access procedures are available from the corresponding authors upon reasonable request.

Supplemental material

Supplemental materials for this article are available online.

References

Huang

Wang

She

, et al. Artificial intelligence promotes the diagnosis and screening of diabetic retinopathy. Front Endocrinol 2022; 13: 946915.

Lim

Rachitskaya

Hallak

, et al. Artificial intelligence for retinal diseases. Asia-Pacific J Ophthalmol 2024; 13: 100096.

Fricke

Jong

Naidoo

, et al. Global prevalence of visual impairment associated with myopic macular degeneration and temporal trends from 2000 through 2050: Systematic review, meta-analysis and modelling. Br J Ophthalmol 2018; 102: 855–862.

Lam

Wong

Tang

, et al. Performance of artificial intelligence in detecting diabetic macular edema from fundus photography and optical coherence tomography images: A systematic review and meta-analysis. Diabetes Care 2024; 47: 304–319.

Van Overtveldt

Gevaert

Cherlet

, et al. Converting galactose into the rare sugar talose with cellobiose 2-epimerase as biocatalyst. Molecules 2018; 23: 2519.

Bernardes

Serranho

Lobo

. Digital ocular fundus imaging: A review. Ophthalmologica 2011; 226: 161–181.

Akpinar

Sengur

Faust

, et al. Artificial intelligence in retinal screening using OCT images: A review of the last decade (2013–2023). Comput Methods Programs Biomed 2024; 254: 108253.

Sorrentino

Gardini

Fontana

, et al. Novel approaches for early detection of retinal diseases using artificial intelligence. J Pers Med 2024; 14: 690.

Jeyasri

Karthiyayini

. Deep learning based retinal disease classification using an autoencoder and generative adversarial network. Biomed Signal Process Control 2025; 108: 107852.

10.

Alqahtani

Alshareef

Aljadani

, et al. The efficacy of artificial intelligence in diabetic retinopathy screening: A systematic review and meta-analysis. Int J Retina Vitreous 2025; 11: 48.

11.

Pamula

Pulipati

Vijaya Suresh

, et al. Optimizing diabetic retinopathy detection with electric fish algorithm and bilinear convolutional networks. Sci Rep 2025; 15: 1–19.

12.

Ling

Chen

HSL

Yeh

, et al. Deep learning in glaucoma detection and progression prediction: A systematic review and meta-analysis. Biomedicines 2025; 13: 420.

13.

Hafner

Eckardt

Siedlecki

, et al. Deep learning assisted analysis of biomarker changes in refractory neovascular AMD after switch to faricimab. Int J Retina Vitreous 2025; 11: 44.

14.

Atwany

Sahyoun

Yaqub

. Deep learning techniques for diabetic retinopathy classification: A survey. IEEE Access 2022; 10: 28642–28655.

15.

Akhtar

Aftab

Ali

, et al. A deep learning based model for diabetic retinopathy grading. Sci Rep 2025; 15: 3763.

16.

Sharma

Takahashi

Ninomiya

, et al. A hybrid multi model artificial intelligence approach for glaucoma screening using fundus images. npj Digital Med 2025; 8: 1–20.

17.

Ting

DSW

Cheung

CYL

Lim

, et al. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. JAMA 2017; 318: 2211–2223.

18.

Lam

Wong

Tang

19.

De Fauw

Ledsam

Romera-Paredes

, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med 2018; 24: 1342–1350.

20.

Shanmugam

Raja

Pitchai

. An automatic recognition of glaucoma in fundus images using deep learning and random forest classifier. Appl Soft Comput 2021; 109: 107512.

21.

Guo

Yang

Peng

, et al. A computer-aided healthcare system for cataract classification and grading based on fundus image analysis. Comput Ind 2015; 69: 72–80.

22.

Wang

, et al. Artificial intelligence in ophthalmology: The path to the real-world clinic. Cell Rep Med 2023; 4: 101095.

23.

Mookiah

MRK

Acharya

Koh

JEW

, et al. Automated diagnosis of age-related macular degeneration using greyscale features from digital fundus images. Comput Biol Med 2014; 53: 55–64.

24.

Mookiah

MRK

Rajendra Acharya

Lim

, et al. Data mining technique for automated diagnosis of glaucoma using higher order spectra and wavelet energy features. Knowl Based Syst 2012; 33: 73–82.

25.

Singh

Dutta

ParthaSarathi

, et al. Image processing based automatic diagnosis of glaucoma using wavelet features of segmented optic disc from fundus image. Comput Methods Programs Biomed 2016; 124: 108–120.

26.

Nguyen

Pham

Pathirana

, et al. Federated learning for smart healthcare: A survey. ACM Comput Surv (CSUR) 2022; 55: 1–37.

27.

Mehedi

Abdulrazak

Ahmed

, et al. A privacy-preserving dependable deep federated learning model for identifying new infections from genome sequences. Sci Rep 2025; 15: 7291.

28.

Wang

Tian

Liu

, et al. A multi-center federated learning mechanism based on consortium blockchain for data secure sharing. Knowl Based Syst 2025; 310: 112962.

29.

Haripriya

Khare

Pandey

. Privacy-preserving federated learning for collaborative medical data mining in multi-institutional settings. Sci Rep 2025; 15: 12482.

30.

Molino

Di Feola

Faiella

, et al. MedCoDi-M: A multi-prompt foundation model for multimodal medical data generation. arXiv preprint arXiv:250104614, 2025.

31.

Tang

Wong

. Privacy-preserving federated learning with domain adaptation for multi-disease ocular disease recognition. IEEE J Biomed Health Inform 2023; 28: 3219–3227.

32.

Mao

, et al. A comprehensive federated learning framework for diabetic retinopathy grading and lesion segmentation. IEEE Trans Big Data 2024; 11: 1158–1170.

33.

Yan

Cao

Jiang

, et al. FedEYE: A scalable and flexible end-to-end federated learning platform for ophthalmology. Patterns 2024; 5: 100928.

34.

Agbley

BLY

Haq

, et al. Federated fusion of magnified histopathological images for breast tumor classification in the internet of medical things. IEEE J Biomed Health Inform 2023; 28: 3389–3400.

35.

Tso

Alelaiwi

Mizanur Rahman

, et al. Privacy-preserving data communication through secure multi-party computation in healthcare sensor cloud. J Signal Process Syst 2017; 89: 51–59.

36.

Naumova

Devos

Karimireddy

, et al. MyThisYourThat for interpretable identification of systematic bias in federated learning for biomedical images. npj Digital Med 2024; 7: 238.

37.

Zhang

Ling

Zhang

Ling

Transit transport organization optimization of international container considering the time value of goods. In: 2019 4th International Conference on Electromechanical Control Technology and Transportation (ICECTT), Guilin, China, 26–28 April 2019, pp.4–9. Piscataway, NJ: IEEE.

38.

Darzi

Shen

, et al. Tackling heterogeneity in medical federated learning via aligning vision transformers. Artif Intell Med 2024; 155: 102936.