Attention based multi-scale edge-aware segmentation and convolutional transformer framework for automated glaucoma detection from fundus images

Abstract

Background

Glaucoma is a leading cause of irreversible vision loss and is characterized by subtle structural changes in the optic disc and optic cup. However, existing automated detection systems often suffer from weak boundary delineation, dataset variability, and unstable feature learning, which limit their generalizability and clinical reliability.

Objective

This study aims to develop a unified and anatomically guided framework for accurate and reliable automated glaucoma detection from fundus images.

Methods

The proposed pipeline begins with contrast-enhanced preprocessing to improve image quality, followed by an Attention-guided Multi-scale Edge-aware Segmentation Network (AME-SegNet) for precise segmentation of the optic disc and optic cup. Both deep convolutional features and clinically relevant geometric features are extracted and optimized using Bitterling Colony Optimization (BCO) to select the most discriminative attributes. A Convolutional Transformer (CT) is then employed to integrate local convolutional representations with global attention mechanisms for robust classification. Additionally, the Honey Badger Algorithm (HBA) is used for automatic parameter tuning to ensure stable convergence.

Results

Experimental evaluation demonstrates high segmentation performance with Dice scores of 97.36% for the optic disc and 96.72% for the optic cup on the Drishti-GS1 dataset. The classification model achieves accuracies of 98.63% on RIM-ONE and 98.96% on ORIGA-Light datasets, indicating strong generalization capability.

Conclusions

The proposed framework exhibits robust performance, high accuracy, and strong generalization across multiple datasets. These results highlight its effectiveness and clinical potential for reliable automated glaucoma screening and early diagnosis.

Keywords

glaucoma detection fundus image analysis Opic cup optic disc convolutional transformer

Introduction

Glaucoma is a vision-threatening disease, and the progression of the disease is closely linked with the geometry of the optic disc and optic cup; therefore, accurate extraction of these parameters from fundus images is a mainstay for proper diagnosis for glaucoma. Manual delineation of these structures, however, tends to be subjective and inconsistent, which necessitated research toward automated segmentation. Convolutional neural networks have increased segmentation accuracy; however, the differences in imaging devices and patient populations often affect their generalizability. Advanced feature alignment and multi-color representation strategies are suggested to tackle these hurdles and stabilize learning across datasets, while edge-focused optimization further improves anatomical boundary accuracy.¹ Closer to glaucomatous diagnosis is the integrity of the neuro-retinal rim structurally; this, in turn, is dependent on how accurately one is able to identify the optic disc and optic cup. Traditional cup-to-disc analysis may not always carry forward true disease severity, given that blood vessels, low contrast, and anatomical variations conceal these regions. From accurate disc and cup segmentation, one could therefore derive better indicators of glaucomatous damage through reconstruction of rim tissue. Adversarial learning strategies have been employed to improve segmentation quality and thus stabilize glaucoma screening based on rim measurements.²

Structural modifications at the optic nerve head are usually the first signs of glaucoma, and therefore an essential aspect of efficacious automated diagnosis is the detailed modelling of the optic disc and optic cup. The biomarker ratios that can be measured clinically change with the increase in cup-size and rim thinning. These changes should be linked to an analysis of geometrical and texture-based information. Robust feature extraction can therefore be performed irrespective of poorly defined boundaries through the combination of classical image segmentation techniques from deep neural networks.³ Accurate localization of the disc is the prerequisite for efficient screening since the morphology of the optic disc is altered due to glaucomatous damage. However, illumination inconsistency and contrast often inhibit boundary detection in the retinal images. To increase the robustness of the approach, graph-based clustering and region-based segmentation algorithms were integrated with convolutional learning. As a result, the proposed hybrid approach is capable of recognizing meaningful disc regions even in noisy or incomplete images, making the automated detection of glaucoma stronger.⁴

The changing landscape that denotes the progress of the glaucoma disease is found in the relationship of the optic disc with the optic cup; hence, their coordinated analysis is indispensable for early diagnosis. Such changes in the relationship between the optic disc and cup may alter their observed comparisons because of variations in the image quality or by how severe the disease is. This makes entirely local-based measurement-based models potentially fragile. The combination of global fundus context featuring disc-specific aspects provides a much deeper insight into the disease. In addition, embedding prior anatomical knowledge into the learning framework would result in screening systems that are more consistent and generalizable.⁵ Glaucoma leads to structural damage and loss of functional vision while traditional field-based tests prove hard to scale. Retinal imaging includes a great non-invasiveness alternative because it captures anatomic changes related to external severity. Measures of the optic disc and cup allow construction of structures for correlating visual function. Moreover, thereby enhancing that, multimodal deep learning frameworks fuse clinical data with image-derived features, hence more accurately predicting the current and upcoming visual impairment.⁶

The early stages of glaucoma typically show invisibly small alterations in the optic nerve head that may not even be evident on visual field testing. In order to be able to detect the disease before incurring permanent damage, it is highly important to obtain very precise segmentation of the retinal structures. Such quantitative measures derived from these segmentations reduce subjectivity while improving reliability across observers. These anatomical features complement deep learning classifiers and thus render high-scale and credible glaucoma screening from fundus pictures.⁷ Automatically recognizing the lesions or pathology in glaucoma assumes a good extraction of optic disc and optic cup boundaries from retinal images. However, because of other factors like illumination, noise, and contrast, one often finds these parts are obscured. To overcome these obstacles, learning-based systems of nowadays work with the pictures to enhance their quality before performing clinically important features extraction from them using difficult visual conditions. Such advances contribute to lower rates of false diagnosis with respect to many disease-related changes and increase sensitivity in early glaucoma identification.⁸

An optic disc analysis is needed not only for glaucoma but also for any other optic nerve disease that alters the appearance of the disc. When swelling or distortion occurs, normal segmentation fails to find the actual borders in many cases. More advanced boundary modeling techniques have thus been introduced to track distorted disc contours. Pathological changes can then be accurately identified by coupling these models with machine-learning classifiers, guiding early clinical intervention.⁹ Glaucoma is the progressive damage of the optic nerve head, and fundus imaging provides a window through which these changes can be examined directly. Traditional clinical tests for glaucoma detection can be costly and require specialized premises; however, deep learning presents a scalable automated alternative. Neural networks learn discriminative patterns from various retinal datasets to detect glaucomatous features with increasing reliability. This enhances access to glaucoma detection, especially in settings where resources are limited.¹⁰

The main goal of this research consists of developing a promising, anatomically guided, and fully automated glaucoma diagnosis framework that can segment the optic disc and optic cup from fundus images, extract clinically meaningful features, and accurately classify glaucomatous and healthy eyes in various imaging conditions using biologically inspired optimization and deep learning.

The early detection of glaucoma is very important, for the disease will eventually cause irreversible vision loss through progressive damage to the optic nerve head. However, subtle structure changes within the optic disc and the optic cup are not always easy to detect, even by experienced clinicians, and are very dependent on image quality, anatomical variability, and observer subjectivity. Existing automated systems often consider segmentation, feature extraction, and classification as separate processes with little understanding between anatomical structure learning and prediction of the disease from the learned maps. Most of the current state-of-the-art methods depend on fixed feature sets and manually tuned hyperparameters. These limitations significantly reduce the adaptability and reliability of the methods across different datasets and clinical environments. That, in itself, provides a strong motivation for a self-optimizing, anatomically driven, unified approach for the reliable learning of glaucoma-specific representations with stability, efficiency, and clinical relevance.

While many deep learning models for glaucoma detection have been proposed, most existing strategies treat segmentation, feature extraction, and classification as poorly connected or independent tasks, resulting in poor coordination between the learning of anatomical structures and disease prediction. Many models would focus either on pixel-wise accuracy or classification performance without ensuring that both stages feed back into each other meaningfully in the clinical space. In addition, current systems often depend on a certain set of fixed features or hyperparameters that have been manually fine-tuned, thereby limiting their adaptability to datasets and imaging conditions. Such limitations pave the way for a framework that is capable of simultaneously optimizing anatomical segmentation, feature relevance, and classifier behavior in a fashion that is data-driven and biologically inspired.

The study aims to look into automated glaucoma screening from color fundus images using deep learning and biologically inspired optimization. The system designed is to:

Carry out pixel-level segmentation of the optic disc and optic cup for anatomical evaluation.

Calculate clinically relevant biomarkers like cup-disc ratios and shape descriptors.

Support binary classification separating glaucoma from normal eyes and vice versa with dependable accuracy.

To work economic enough to facilitate real-time or large-scale screening applications.

To confine the study to fundus-image-based glaucoma detection excludes any multimodal data like OCT or visual field measurements. Nevertheless, the architecture can be extended to incorporate such modalities later. The system is meant for screening and decision support, not to replace clinical diagnosis.

An enhanced retinal preprocessing pipeline is designed to improve the visibility of optic disc and optic cup by compensating for the variation in illumination, contrast, and noise in fundus image.

The Attention-guided Multi-scale Edge-aware Segmentation Network (AME-SegNet) developed in this research will be able to accurately identify and delineate the optic disc and optic cup boundaries while maintaining anatomical relevance.

A deep visual feature descriptor paired with clinically relevant structural descriptors will then be employed to better reflect glaucomatous damage from the segmented optic disc and cup regions.

BCO will be applied to select the most discriminative features and reduce redundancy for better generalization.

Build a Convolutional Transformer–based classifier that can jointly learn local retinal texture and long-range anatomical relationships for robust glaucoma detection.

Optimize HBA-based learning behavior of the classifier for stable convergence and accurate diagnosis on different datasets.

Establish a robust methodology with respect to cross-dataset generalization through validation on public datasets.

The rest of the paper is organized as follows: Section 2 deals with previous computerized methods for diagnosing glaucoma and retinal image analysis; Section 3 describes the complete design of the proposed segmentation and classification framework; Section 4 reports the experimental evaluations and quantitative results on a number of benchmark datasets; and finally, Section 5 wraps up the work at hand and proposes possible extensions for future work.

Related works

End-to-end retinal segmentation was developed to outline the optic disc simultaneously while outlining the optic cup for glaucoma screening support. The very strong feature extractor was embedded into a U-shaped network, so that fine-scale texture and the global retinal structure could be captured, and the spatial refinement was applied to make the anatomical boundaries more accurate. The class-aware learning objective thus ensured that small regions like the optic cup would not get swamped by background pixels during training. Stable optimization allowed for convergence of the network very effectively, with the outcome being consistence and anatomic meaning in segmentation through public datasets.¹¹ A deep learning framework has been introduced to enhance retinal image analysis for early glaucoma detection and assessment of the severity in glaucoma. Visibility of disease-related structures was enhanced by the normalization of images and enhancing the contrast, while the data was expanded to withstand real-world variations. A snapshot ensemble strategy allows the classifier to build diverse feature representations without blocking computation power. Reliably improved classification through parameter optimization, as well as having an independent segmentation module classifying disease subtypes once detected, provides clinically richer information in the diagnosis output.¹² To achieve joint segmentation of the optic disc and cup, a joint attention-shared encoder-decoder model was built. Attention at different scales brought different anatomical sites from long distances. Token aggregation reinforced separation of disc, cup, and backgrounds. Various clinical constraints were embedded in the learning objective to ensure that the model is anatomically consistent. Reduced labeled data dependence through self-supervised pretraining also improves robustness.¹³

A lightweight, efficient deep-learning-enabled glaucoma detection pipeline was designed using synthetic data production. To address data imbalance for better representation of pathological patterns, synthetic retinal images were created. Noise suppression and enhancement of the images improved visual clarity, and therefore, the reliability of optic cup extraction. A compact convolutional model was then trained on the enriched dataset to discriminate between healthy and glaucomatous eyes, making precise predictions at optimized computational costs.¹⁴ A combined approach used for glaucoma screening using several neural networks aligned with complementary strengths was proposed as follows. One model emphasizes hierarchical visual patterns, a second model focuses on spatial detail, and the third one captures long-range relationships within the retina. The prediction of the three models is then aggregated through a majority vote to counteract individual model bias while improving system stability. This cooperative strategy enabled better decisions when used in client environments where images tend to vary very widely.¹⁵

Glaucoma screening has been made easy to implement with a cloud-based no-code learning framework that facilitates build model creation automatically via software-less methods. The system uses a balanced training strategy to allow the system to distinguish clinically significant glaucoma from normal retinal images, romping in the affairs of prediction bias. The partition methodology concerning structured data controlled optimization and frosting it against overfitting. After training, the model was made to be executed both online and offline so that it could cater to resource-limited clinics. Reliability was established through independent image source performance evaluation, thus proving its adaptability in different clinical environments.¹⁶ One possible learning architecture integrates optic disc boundary extraction, optic cup region identification, and glaucoma presence detection, so that anatomical and diagnostic cues can reinforce each other. Effective feature encoding reduces the computational load while still storing image detail. Graph-based decoding improved the spatial relationships between retinal structures. Attention mechanism guided the network into clinically important regions, ensuring that feature learning patterns were dominated by important ones. Structural indicators from segmentation augment visual features, thus improving classification consistency.¹⁷

A dual-path network, or synergetic model, was designed to recognize multiple ocular diseases under detailed texture analysis and global structure understanding. Localization pathology cues were utilized by convolutional layers while transformer attention modeled the long-range spatial dependencies. Such complementary representations were adaptively fused, thus preventing either from dominating the decision process. A balanced cross-validation could stabilize performance across disease categories. It also permitted the reliable detection of multiple retinal disorders in a single screening apparatus.¹⁸ The ensemble-based retinal assessment framework was established as a cheap alternative to the traditional neurologic assessments. Such deep architectures combined would reduce prediction variance and thus improve robustness. The filtering criteria established for the quality of images ensured that only the informative retinal portions interacted with the learning. Indeed the system could estimate demographic traits and at the same time detect various neuro-ophthalmic diseases. Split annotations would verify the appropriateness of the clinical meanings guiding predictions in those retinal areas.¹⁹

As part of an interpretable glaucoma diagnosis pipeline, results from deep learning were linked to a clinically endorsed anatomical measure. The optic disc was first localized and segmented along with the optic cup for structural analysis. From these masks, a diagnostic ratio was produced to define disease status. Continuous transformer-based segmentation guaranteed smooth and accurate anatomical contours. This architecture created the transparent diagnostics path between artificial intelligence and the medical reasoning.²⁰ A transfer-learning-based classifier was created to distinguish between multiple retinal conditions within just one predictive model. Pretrained visual knowledge was adapted to ophthalmic images through fine-tuning and regularized augmentation to illuminate the insensitivity to orientation and illumination changes. Detection of many diseases was made possible thanks to a multi-class probability output. Optimized training strategies ensured reliable convergence and stable performance.²¹

To address certain issues such as blurred edges or vessel interference during optic disc extraction, a dense attention-based segmentation network was proposed. While keeping fine and coarse anatomical features intact, multi-scale contextual learning was employed. Established attention mechanisms emphasized pixel-level relationships across encoder and decoder stages.²² Enlarged receptive fields provided structural continuity; this allowed for highly reliable delineation of optic discs across datasets. An automated optic disc localization system was thus introduced for more robust downstream analyses of glaucoma. Initial preprocessing, relying on brightness-based methods, limited the scope and reduced unnecessary computation. A deep classification network then refined the pixel-level disc boundaries, aided by multi-scale contextual cues. Sophisticated convolutional operations helped accurately segregate the disc from the background, and confirmation of reliable validation was given by consistent localization across various datasets.²³

Unprecedented ensemble pipeline for glaucoma detection starting from an initial correction of substandard quality retinal images. The generative enhancement stage converted degraded scans to clearer representations. Anatomical segmentation then extracted disc and cup regions to derive clinical measurements that were then classified into disease status. The integrated approach built robustness against noise, imbalance, and image degradation.²⁴ An improved variant of attention-guided U-Net was built for better segmentation of the optic disc and cup structures. The preparatory processing steps carefully excluded vessel interference and normalized image appearance. Multi-scale dilated features preserved structural context across resolutions. Dual attention gates highlighted clinically relevant regions while suppressing irrelevant background. Optimized training produced consistent segmentation even across heterogeneous datasets.²⁵

A two-step detection framework was developed, where optic disc localization preceded glaucoma classification. Annotated disc boundaries not being available made it impossible to train a fully supervised classifier for disc detection, thus a semi-automated labeling strategy was conceived to help in the appropriate training of the detector. The disc regions thus extracted were classified using deep residual networks. This region-wise approach improved the diagnostic accuracy. It was shown that deeper architectures performed better on the discriminative power.²⁶ A retinal analysis pipeline was designed to improve the focus on nerve fiber layer abnormalities via specialized image preprocessing. Channel-specific enhancement revealed disease-associated patterns. The refined images were further assessed using a deep convolutional classifier that was adapted through transfer learning. This way of combining image processing with deep learning improved the separation between the classes of the disease. The system showed great ability to discriminate between glaucomatous and healthy eyes.²⁷

Designed jointly with deep learning and clinical insight, a multi-step framework for glaucoma diagnosis was developed. The optic disc area was segregated for the anatomy of disease importance. The removal of blood vessels helped with the suppression of structural noise. Dual-branch feature extraction obtained complementary information from the retina, and these features were fused to arrive at a stable disease decision.²⁸ A conjoint segmentation model was trained persona-wise for the anatomical correctness of the optic disc and optic cup, weighed against pixel-wise accuracy. Multi-resolution feature fusion allowed incorporating detailed information and context at large scales, and progressive refinement improved boundary accuracy. To avoid distorted or broken contours, geometric constraints were integrated through contour-based reconstruction techniques. Consequently, the final contours were forced to be closed and anatomically valid, which assisted in making the results more amenable to clinical measurement and analysis.²⁹

And the secondary technique consists of semi-supervision in the Gradation of glaucoma severity to use both labeled and unlabeled retinal images. An enforced student-teacher architecture in prediction consistency across different views of data. Relationship-based constrains also contributed to the learning stability improvement. A significantly larger quantity of unlabeled images generalizes with less effect. Confirmed external evaluation makes sure it withstands dataset variegation.³⁰ A dual-modality fusion framework was introduced to couple retinal surface information with cross-sectional structural detail. Separate networks extract complementary features of their fundus and OCT images. The merged levels resulted in a more complete profile of the disease. The system is much better than single-image approaches in diagnosing glaucoma because it jointly models the optic disc appearance and nerve fiber integrity.³¹

Integrated diseases classification performance using segmentation outputs and advanced deep learning systems is something the above works seem to indicate. In one of the studies, pre-segmented medical images are fed into an Enhanced Swin Transformer-based classification framework. The classification framework was enhanced through the introduction of a multi-layer perceptron guided by a Residual Pyramid Network that extracted strong multi-scale feature representations that further improved classification accuracy through proper conservation of local and global contextual information.³² One method extracted texture-based descriptors according to the Gray Level Co-occurrence Matrix and discarded visually irrelevant information. For that method, spatial decompositions of the images were assigned through the U-Net architecture, while sophisticated feature representations were learned by a three-dimensional convolutional neural network. The final classification decision considered the predictions from several independently trained model architectures to improve robustness and reliability of the prediction.³³ In another setting, an evolutionary deep learning framework was devised for the diagnosis of Autism Spectrum Disorder. The clinically important areas identified during the U-Net-based segmentation analysis were then evaluated using this technique. The model hyperparameters were optimized with the aim of improving convergence and generalization using a Chaotic Butterfly Optimization strategy while high-level features were extracted using an Inception-v3 network.^34,35 Finally, the temporal dependencies of the extracted features were modeled via a long short-term memory network for reliable disorder classification.³⁶

Recent studies have explored deep learning approaches for automated glaucoma detection using fundus images. A dual-network deep learning framework was developed to simulate human grading behavior, improving diagnostic transparency and achieving strong classification performance across multi-population datasets, demonstrating robustness against data diversity challenges.³⁷ Another segmentation-based framework employed independent convolutional neural networks to delineate optic disc and optic cup regions for accurate cup-to-disc ratio estimation, achieving high segmentation accuracy on benchmark retinal datasets.³⁸ Transformer-based architectures have also been introduced to replace conventional convolution operations, showing improved generalization ability and enhanced feature interpretability through attention mechanisms for glaucoma classification.³⁹ To address dataset variability, regression-based deep learning models were proposed using large multi-source fundus datasets, achieving strong generalizability and consistent screening performance across different populations and imaging conditions.⁴⁰ Additionally, convolutional neural network-based classification approaches utilizing data augmentation and channel-based feature extraction demonstrated high detection accuracy and early-stage glaucoma diagnosis capability across multiple retinal image datasets.⁴¹

Limitations and issues of existing models

Even though much progress has been made in glaucoma detection and optic cup-segmentation through deep-learning approaches, there are still some important drawbacks. First, most approaches utilize a two-stage or uncoupled pipeline for segmentation and classification, which leads to weak interactions between learning the anatomical structure and predicting the disease. In consequence, segmentation errors around an ambiguous optic cup-disc boundary may propagate to classification, diminishing diagnostic reliability. Furthermore, a majority of models are extremely sensitive to different illumination conditions, imaging devices, and patient populations, leading to a marked decline in performance when evaluated on a different set than that used to train the model. A large number of these methods are based on hand-crafted features or fixed clinical ratios like CDR only that do not fully capture the complex morphological and textural changes related to glaucoma. In addition, many deep learning frameworks use all extracted features without filtering them for relevance, thus adding redundancy and noise contributing to overfitting and computational burden. Most current systems, finally, depend on manually chosen hyper-parameters, which can make them unstable above and often difficult to adapt between datasets. These limitations call for a unified, anatomically guided self-optimizing framework for enhancing segmentation accuracy, feature relevance, and classification robustness, giving rise to the proposed AME-SegNet architecture. A comparison of recent approaches to optic disc-cup segmentation and glaucoma detection is shown in Table 1 with their corresponding methodologies, merits, and limitations.

Table 1.

Summary of conventional deep learning–based glaucoma detection and segmentation methods.

Authors	Methodology	Advantages	Limitations
Huang et al. ¹	Employs adaptive global style alignment, integrates features from multiple color spaces, and utilizes edge-aware loss for the combined segmentation of OC and OD.	Improves cross-domain generalization and boundary precision	Depends on a selected reference target domain for style alignment
Chowdhury et al. ²	GAN with attention-based Teacher and EM-based Student to segment optic disc and cup and generate the rim for SqueezeNet-based glaucoma detection	Reduces mode collapse and improves rim-based glaucoma discrimination	Requires separate disc and cup segmentation models before rim generation
Sangeetha J. et al. ³	Hybrid feature extraction (OD & OC segmentation + GLCM + CDR, BVR, NRR) fused with ResNet-152 for glaucoma classification	Integrates both structural and texture features, improving discrimination of glaucoma	High computational complexity due to very deep ResNet-152 architecture
Sanghavi et al. ⁴	Uses SLIC superpixels and normalized graph cut to segment the optic disc, followed by a 13-layer CNN for glaucoma classification	Can process both full retinal images and pre-segmented optic discs	Requires a separate preprocessing step to decide whether segmentation is needed
Kui et al. ⁵	Uses dual-path global and optic-disc features with binocular fusion	Integrates medical prior knowledge into network design	Requires optic disc localization stage
Huang et al. ⁶	Multi-modal and longitudinal deep learning (MLEDL) combining ROI fundus images, OD/OC representation, and clinical data to predict visual fields	Predicts present and future visual field loss from simple fundus photographs	Moderate-severity visual field grading is less accurate
Sharma et al. ⁷	Hybrid multi-model AI-GS using segmentation (disc, cup, fovea), RNFLD, DH detection, and classification fused via FFCN	Integrates multiple glaucomatous features for robust early detection	Performance decreases in real-world screening compared to controlled datasets
Abbasi et al. ⁸	Adaptive gamma enhancement with quantile-based histogram equalization and DCNN classification	Improves fundus image contrast while preserving structural similarity	Performance depends on correct blurriness grading for enhancement
Naing et al. ⁹	FGVF-based OD boundary segmentation with hybrid localization (HLM) and linear SVM classification	Accurately segments edematous and non-edematous optic discs in mixed datasets	Computationally intensive and sensitive to initialization
Chaurasia et al. ¹⁰	Deep-learning (VGG19-BN) based classification of fundus images for glaucoma screening	Uses large multi-dataset fundus data for generalised glaucoma detection	Model performance drops slightly on unseen external data
Wang et al. ¹¹	Uses EfficientNet-B4 U-Net and multi-label Dice-Focal loss for joint OD–OC segmentation	Produces highly accurate and smooth OD/OC boundaries	Performance degrades on low-contrast fundus images
Geetha et al. ¹²	Uses Snapshot-Ensemble EfficientNet-B4 with AQO and V-Net for glaucoma detection and stage segmentation	Very high diagnostic accuracy with ensemble learning	High computational complexity due to deep ensemble and segmentation pipeline
Zhou et al. ¹³	Attention-aware CNN with multi-scale and aggregation attention	Captures global and local optic disc–cup features	Two-stage training increases model complexity
Govindharaj et al. ¹⁴	GAN-based data augmentation with Enhanced Level Set segmentation and MobileNetV2 classification	Handles class imbalance and improves glaucoma detection accuracy	Uses complex multi-stage processing pipeline
Guntreddi et al. ¹⁵	Majority-vote ensemble of ResNet50, VGG16 & Swin Transformer	Combines CNN and Transformer strengths	Uses simple (unweighted) voting
Milad et al. ¹⁶	Code-free deep learning model trained on fundus images using Google Vertex AI	Enables clinicians without coding to build accurate glaucoma screening models	Requires cloud-based data upload for training
Lenka et al. ¹⁷	Modified U-Net with MobileNetV2 encoder, Attention Module, and Res-GCN decoder for joint OD, OC segmentation and glaucoma classification.	Combines structural (CDR) and image-based features in a unified multi-task framework.	Model complexity is higher due to integration of GCN and attention mechanisms.
Rieck et al. ¹⁸	Hybrid EfficientNet-B4 + Swin Transformer V2 with weighted feature fusion for multi-class eye disease classification	Combines local and global feature learning for robust broad-disease detection	Performance may vary when applied to datasets different from the training dataset
Hu et al. ¹⁹	Ensemble of CNNs and Vision Transformers trained on UK Biobank fundus images	Enables fast, non-invasive diagnosis of neurodegenerative diseases	Performance constrained by limited disease-specific sample size
Alasmari et al. ²⁰	Uses YOLOv11 for OD detection and MaskFormer (Swin-Base) for OD–OC segmentation with vCDR-based classification	Provides explainable glaucoma detection using a clinical biomarker (vCDR)	Uses vCDR alone for diagnosis without incorporating other image-level features
Alsohemi et al. ²¹	EfficientNetB3 with transfer learning, data augmentation, and cosine learning-rate scheduling	High classification performance with efficient computation	Lacks domain-specific pretraining for retinal images
Ma et al. ²²	Uses U-Net backbone enhanced with Dense Attention for OD segmentation	Improves boundary precision and multi-scale feature learning	Performance may vary with image quality and contrast
Sreedevi et al. ²³	Uses dilated convolutions with Spatial Pyramid Pooling and VGG-16 for semantic optic disc segmentation	Provides high segmentation accuracy for optic disc localization	Requires deep learning training with pre-processed fundus images
Lenka et al. ²⁴	Cycle-GAN with Autoencoder for enhancement, U-Net for OD/OC segmentation, and SVM using CDR for glaucoma classification	Reduces artifacts and improves detection on imbalanced datasets	Needs high-quality reference images for training
Kumar et al. ²⁵	Attention-based U-Net with DDSC blocks, ResNet-18 encoder and vessel-inpainting pre-processing	Improves multi-scale feature learning and optic cup segmentation accuracy	Sensitive to shadowing and over-exposed regions causing segmentation artefacts
Sheraz et al. ²⁶	Two-stage YOLO-v4 for OD localization and ResNet-101 for glaucoma classification	Fully automatic OD-based glaucoma detection	Requires custom ground-truth generation
Tampa et al. ²⁷	Blue-channel fundus preprocessing + Modified VGGNet-19 CNN	Uses transfer learning for fast and accurate classification	Depends on preprocessing quality for feature visibility
Yi et al. ²⁸	Multi-step MSGC-CNN combining ROI extraction, vessel removal, RA-ResNet feature extraction and feature fusion	Uses pathological knowledge and vessel-removed optic disc features for better glaucoma diagnosis	Multi-step architecture increases system complexity
Chen et al. ²⁹	Combines HRNet-based pixel segmentation with Fourier-based contour reconstruction	Preserves closed and anatomically valid OD and OC boundaries	Requires polar transformation and contour parameter regression
Wang et al. ³⁰	SRC-MT semi-supervised DenseNet121 for glaucoma grading	Uses large unlabeled OOD fundus data to improve learning	Sensitive to distribution shift between EyeW and EyeQ datasets
Islam et al. ³¹	Dual-branch CNN with mid-level fusion of fundus (ResNet-18) and OCT (custom CNN) features	Uses complementary fundus and OCT information to improve glaucoma detection	Requires paired fundus and OCT images for each patient

Even though deep learning has made quick advancements in medical imaging analysis, glaucoma screening stays a challenge since it involves subtle changes in the optic nerve head that cannot easily be captured through general feature learning. Besides, some automatic systems cannot tell between the pathological and the normal disc and cup boundaries, which reduces the reliability of parameters such as the cup-to-disc ratio. Hence, the predictions produced are inconsistent with the models when applied on data collected from different machines, different populations, or different clinical setups.

Moreover, the existing frameworks lack intelligent feature refinements or adaptive learning control, causing classification by the noisy and redundant representations which further complicate computation and weaken the generalization. Automated glaucoma detection systems will not attain the robustness necessary for clinical deployment without a harmonized strategy protecting only meaningful anatomical and textural cues while tuning the learning mechanism itself. Thus, a coherent self-optimizing structure tightly coupling anatomical segmentation, feature importance, and classification dynamics is indeed needed for accurate, consistent, and scalable glaucoma diagnosis.

Overall framework of the proposed system

The proposed system is developed as a unified and fully automated framework for glaucoma screening, performing sequential processing of retinal fundus images through several tightly coupled computational stages, as depicted in Figure 1. In the first stage, color fundus images from three benchmark datasets enter a preprocessing module, where illumination imbalance, contrast degradation, and noise artifacts are countered via green-channel enhancement, contrast-limited adaptive histogram equalization, edge-preserving bilateral smoothing, and dynamic intensity normalization. The processed images are then sent to the AME-SegNet segmentation network for the accurate segmentation of the optic cup and optic disc at the pixel level while leveraging multi-scale contextual information. The segmented anatomical regions are then exploited to calculate deep feature representation vectors embedding structural and spatial attributes and pertinent to glaucoma assessment. Due to differential discriminative power of the attributes so extracted, an intelligent Bitterling Colony Optimization strategy is employed to discover and retain the most informative feature subset, with consequent proceeds into improved compactness and generalization. These optimized feature vectors then serve as input for a convolution-transformer classifier, which elegantly reconciles local convolutional responses with global self-attention modeling to the task of distinguishing glaucomatous from normal eyes. Greatly facilitating the reliable convergence and optimum prediction capabilities is the Honey Badger Algorithm, which is incorporated to automatically tune the learning and architectural parameters of the classification model. Thus, the synchronized workflow—from data acquisition and enhancement to segmentation, feature refinement, and final decision making—guarantees that the proposed framework is a strong and highly accurate solution for automated glaucoma detection.

Figure 1.

Block diagram of the proposed AME-SegNet based glaucoma diagnosis framework.

Datasets description

To validate a glaucoma diagnosis framework under different imaging conditions, RIM-ONE, Drishti-GS1, and ORIGA-Light, all of which are popular public databases of retinal images, were employed. All of these datasets are composed of color fundus photographs centered at the optic nerve head, so they can be used for segmentation of the optic disc-cup and glaucoma diagnosis.

The Drishti-GS1 sample set⁴² has 101 images of high-quality retinal fundus images that were taken under clinical conditions and were graded by a professional ophthalmologist as shown in Figure 2. The images were divided into 50 and 51 samples for training and testing, respectively, including healthy and glaucomatous images. Besides class labels, ground truth is available for this dataset at a pixel-level for optic disc and cup, useful for accuracy measurements. The spatial resolution of the images is such that they capture very fine structural details around the optic nerve head, critical for judging cup-to-disc variations.

Figure 2.

Drishti-GS1 sample image.

The RIM-ONE dataset⁴³ consists of 169 retinal fundus photographs of normal and glaucoma-affected eyes as shown in Figure 3. This dataset also contains manual annotations of the optic disc and optic cup masks that ensure a reliable evaluation of segmentation-based algorithms. Images are equally distributed among the two diagnostic classes and are acquired at a consistent imaging quality and resolution, making it a good benchmark for both region delineation and disease classification methods.

Figure 3.

RIM-ONE dataset sample image.

The ORIGA-Light database,⁴⁴ the larger and more diverse set, contains 650 fundus images, each labeled with regard to clinical glaucoma status as shown in Figure 4. Along with the diagnostic labels, this database also provides expert-generated ground truth for the segmentation of the optic disc and optic cup for in-depth structural analysis. Compared with the other two databases, ORIGA-Light shows greater variations in optic nerve head appearance, image contrast, and illumination, making it especially useful for training and validating deep learning models in real-world screening scenarios.

Figure 4.

ORIGA-Light DatasetSample image.

All data sets used in this analysis are publicly available on Internet and without restrictions on access. The retinal fundus image datasets used for glaucoma segmentation and classification can be collected from Kaggle using the following link https://www.kaggle.com/datasets/ayush02102001/glaucoma-classification-datasets. This repository organizes public glaucoma fundus image datasets and contains annotated images, which are appropriate-for optic disc and optic cup segmentation, as well as glaucoma classification purposes. All of these datasets were used strictly for research purposes as their original licensing and usage guidelines permit. Moreover, this research would not use any private or proprietary clinical data at all.

The optic disc and optic cup segmentation in this work is a fully automated process performed with the help of the proposed AME-SegNet framework, while the ground-truth reference masks used for training and evaluation are obtained from expert-annotated public datasets. More specifically, the Drishti-GS1 databases provide pixel-wise delineation of the optic disc and optic cup created by well-trained ophthalmologists who followed standard clinical protocols for these delineations. Thus, these expert annotations serve for the gold standard of supervising the learning process and objectively evaluating segmentation accuracy. During the training process, each fundus image is associated with the corresponding masks of the disc and cup provided by the expert. The AME-SegNet network will learn to map the preprocessed retinal image with these anatomical labels through supervised learning. No manual intervention during prediction will be needed by this model, which will perform an automatic pixel-level segmentation of the optic disc and the optic cup on images never before encountered. Human involvement will not be required during inference, thus ensuring consistency, reproducibility, and scalability for large-scale screening.

It consists of the following segmentation stages: (i) preprocessing for disc-cup visibility enhancement, (ii) deep segmentation based on AME-SegNet giving probability maps for cup and disc, and (iii) thresholding and classifying using softmax leading to binary mask generation. These predicted masks are used in measuring clinically relevant indicators such as cup-to-disc ratio and in anatomical feature extraction for glaucoma classification. This extremely linear pipeline ensures that segmentation is clinically-grounded through the use of expert labels and computationally automatable through deep learning, thus bringing reliable and impartial anatomical measures for glaucoma detection.

Image preprocessing

The process of segmentation and classification comes after fine processing of retinal fundus images through a well-designed preprocessing pipeline dedicated to enhancing diagnostically important structures while suppressing irrelevant variations. This is done through an operation applied in a predetermined fixed manner such that all stages apply to one another, thus ensuring that each stage outputs stable and informing inputs into the learning model.

Green channel extraction

A color fundus image is derived from the red, green, and blue components, each responsible for different tissue responses. Among these, the green channel has the maximum contrast to display retinal vessels, optic disc margins, and the optic cup due to the preferential absorption of green by hemoglobin as compared to red or blue light; in other words, it delineates anatomical boundaries in contrast with reduced background saturation. Let $I_{R G B} (x, y)$ be the original image, and the green channel $I_{G} (x, y)$ is derived from Eq. (1), which sets apart the second channel as shown below⁴⁵:

I_{G} (x, y) = I_{R G B}^{(G)} (x, y)

(1)

which is used as a basic image for enhancing the rest.

Contrast enhancement using CLAHE

The extracted green image goes under Contrast Limited Adaptive Histogram Equalization to overcome the effects of non-uniform lighting and local intensity compression. In contrast to standard histogram equalization techniques aimed at adjusting global intensity distribution, CLAHE functions in small contextual regions and minimizes noise amplification in the image by clipping the histogram at an acceptable predetermined threshold. For a local window W, the cumulative distribution function $C_{W} (i)$ is calculated, following clipping and redistribution, after which each pixel intensity i is transformed as in Eq. (2).⁴⁶

I_{C L A H E} (x, y) = I_{\min} + (I_{\max} - I_{\min}) \times C_{W} (i)

(2)

where

I_{\min}

and

I_{\max}

defining the allowable intensity range. This adaptive measure allows the faint structures to be illuminated properly, while it enhances and sharpens the disc and cup boundaries to enable better segmentation.

Noise reduction using bilateral filtering

After contrast adjustment, the minor noise artifacts and texture variations would still be present. The bilateral filter reduces these, smoothing spatially while preserving edges. The value of the filtered pixel at the location p is computed in Eq. (3).⁴⁷

I_{B F} (p) = \frac{1}{W_{p}} \sum_{q \in Ω} I_{C L A H E} (q) \exp (- \frac{∥ p - q ∥^{2}}{2 σ_{s}^{2}}) \exp (- \frac{∣ I_{C L A H E} (p) - I_{C L A H E} (q) ∣^{2}}{2 σ_{r}^{2}})

(3)

where

Ω

defines the neighborhood around p,

σ_{s}

controls spatial smoothing,

σ_{r}

controls intensity similarity, and

W_{p}

is a normalization factor. This dual weight scheme ensures that pixels with similar intensities exert a more significant pull, thus helping to maintain the boundaries of optical disc and cup while suppressing background noise.

Min–max normalization

Finally, scaling the resultant image to a standard numerical range allows the input values to be equally proportioned during the network learning phase. Normalization occurs according to Eq. (4) with an image

I_{n o r m} (x, y) = \frac{I_{B F} (x, y) - I_{m i n}}{I_{m a x} - I_{m i n}}

(4)

where

I_{m i n}

and

I_{m a x}

refer to the minimum and maximum intensity values registered in the image and

I_{x, y}

denotes the original fundus image intensity at pixel location (x,y);

$I_{(x, y)}^{'}$ represents the enhanced pixel intensity after preprocessing. In short, it will map pixel values into the interval [0,1] preventing the appearance of numerical predominance of very bright areas, giving smooth gradient updates during training. Matching thus the dynamic range of inputs improves the convergence behavior and makes the deep-learning model generally more stable .

AME-SegNet architecture

The AME-SegNet model has been designed as a retinal structure segmentation network specifically capable of distinguishing fine cellular differences between the optic disc and optic cup. Unlike conventional encoder-decoder networks that treat all features uniformly, the proposed architecture implements a scale-adaptive boundary-aware strategy for enhancing regional discrimination. The entire design of the AME-SegNet pipeline, comprising the encoder, decoder, and embedded enhancement modules, shows the respective implementation of Figure 5.⁴⁸

Figure 5.

Encoder–decoder backbone of the proposed AME-SegNet architecture.

Base SegNet backbone

The AME-SegNet backbone is a symmetric encoder-decoder structure with a gradual reduction in spatial resolution for extracting abstract representations, which are later reconstructed for pixel-level predictions. We denote the input retinal image as $I \in R^{H \times W \times C}$ . In the encoder, a sequence of convolutional layers generates feature maps according to Eq. (5):

F_{l} = σ (W_{l} * F_{l - 1} + b_{l})

(5)

where

W_{l}

and

b_{l}

are convolutional weights and bias at layer l, * denotes convolution, and σ (⋅) is the nonlinear activation function. In each module, a max-pooling operation is followed, leading to a reduction in spatial resolution as per Eq. (6):

P_{l} (x, y) = \max_{(i, j) \in Ω} F_{l} (x + i, y + j)

(6)

where

Ω

defines the pooling window. The spatial locations of these maxima are stored together with the pooling indices

Π_{l}

. These indices store the locations of prominent activations which are important for accurate reconstruction of the anatomical boundaries.

In the decoder, instead of relying on interpolation for up-sampling, indices are kept and used. The unpooling operation reconstructs the feature maps using Eq. (7):

U_{l} (Π_{l}, P_{l}) \to {\hat{F}}_{l}

(7)

where

F^{(l)}

denotes the feature map at the lth convolutional layer. Pooled values were returned to their original positions in the encoder by means of

Π_{l}

. This index-driven expansion is free of spatial ambiguity and guarantees that retinal boundaries are in alignment during reconstruction. Because of this exact spatial recovery and efficient memory usage of SegNet, it has proven very useful for biomedical segmentation applications, where edge accuracy and anatomical fidelity matter.

Multi-Resolution feature aggregation module (MRFAM)

The optic disc and optic cup have a large difference in size and texture, and any representation of both structures in a single-scale convolution is insufficient. Accordingly, the Multi-Resolution Feature Aggregation Module (MRFAM) processes the incoming features through parallel convolutional streams of which the operation is performed at different receptive field sizes-as shown in Figure 6. MRFAM applies several parallel convolution operations with input feature map F and kernel sizes of $k_{1}, k_{2}, k_{3}$ , to produce the results defined in Eq. (8).

M_{i} = W_{k_{i}} * F for i \in {1, 2, 3}

(8)

Figure 6.

Multi-Resolution feature aggregation module (MRFAM).

While small kernels are used to capture fine local detail useful in the identification of the optic cup boundary, larger kernels provide more spatial context needed to detect the optic disc. The concurrent multi-scale responses are concatenated using Eq. (9):

M_{a g g} = Concat (M_{1}, M_{2}, M_{3})

(9)

The features are then compressed with an additional convolving operation in a process described in Eq. (10):

F_{m} = W_{f} * M_{a g g}

(10)

This multi-resolution fusion allows the network to view small and large structures at once, ensuring that neither small cup nor large disc regions are lost. The architecture of the multi-scale processing block is illustrated in Figure 6.

Edge-Aware context enhancement block (EACEB)

The boundary of the optic disc and optic cup is usually very faint owing to smooth intensity transitions and overlapping tissue appearance and poses a challenge to accurate separation. The Edge-Aware Context Enhancement Block (EACEB), as depicted in Figure 7, is embedded in the encoder path to strengthen these ambiguous areas.

Figure 7.

Structure of the edge-aware context enhancement block (EACEB).

EACEB uses both spatial and channel-wise attention in its feature computation so as to amplify the pixels corresponding to anatomical borders; it starts by extracting global and local statistics using average and max pooling, given the feature map F to obtain using Eq. (11)

F_{a v g} = AvgPool (F), F_{m a x} = MaxPool (F)

(11)

These descriptors get combined together and passed through a gating function to produce an attention map in accordance with Eq. (12)

A = σ (W_{a} [F_{a v g}, F_{m a x}])

(12)

where

σ

is the sigmoid function and

W_{a}

denotes learned weights. The refined feature map is then obtained using Eq. (13)

F_{e} = F ⊙ A

(13)

where

⊙

indicates element-wise multiplication. This operation increases the effect of the pixels that are positioned on object edges while suppressing and reducing background responses, allowing the network to focus on transition zones between the cup and disc. The internal mechanism of EACEB is shown in Figure 7.

Final segmentation output

After the decoding process and subsequent refinement, AME-SegNet outputs two probability maps corresponding to the optic disc and optic cup. A softmax function assigns each pixel a class likelihood according to Eq. (14):

P_{c} (x, y) = \frac{e^{z_{c} (x, y)}}{\sum_{k} e^{z_{k} (x, y)}}

(14)

From which $z_{c}$ is defined as the network output for class c. By thresholding the class probability for each pixel, binary masks are produced for the segmented disc and cup regions. From these masks, the cup-to-disc ratio (CDR), which is a key clinical indicator of glaucoma, is computed using Eq. (15):

CDR = \frac{A_{c u p}}{A_{d i s c}}

(15)

where

A_{c u p}

denotes the pixel areas of the optic cup, whereas

A_{d i s c}

denotes the pixel areas of the optic disc, respectively. An increased CDR indicates structural damage that is consistent with glaucomatous progression; hence, the segmented output will directly aid the diagnostic decision.

Feature extraction from segmented regions

The AME-SegNet model, when successfully separated from the optic disc and optic cup, allows the extraction of meaningful numerical descriptors from these areas, helping towards glaucoma classification. The anatomical regions are the focus of the framework rather than the raw image itself. The model represents structural and pathological changes associated with the disease via features within those areas. The binary masks for the optic disc and optic cup obtained from the segmentation step comprise $D (x, y)$ and $C (x, y)$ respectively. Therefore, two forms of feature computation have been done: region-based structural and deep representation features. Masks on both the disc and cup images are passed through a convolutional neural network which acts as a feature encoder to extract deep features. Let $I (x, y)$ be the fundus image after preprocessing, and the following defines the masked inputs in Eq. 16

I_{D} (x, y) = I (x, y) \cdot D (x, y), I_{C} (x, y) = I (x, y) \cdot C (x, y)

(16)

where

\cdot

indicates pixel-wise multiplication. The masked images are then forwarded through a convolutional neural network to generate high-level embeddings with Eq. 17

F_{D} = ϕ (I_{D}), F_{C} = ϕ (I_{C})

(17)

where

ϕ (\cdot)

is a convolutional transformation. These feature vectors capture textural patterns, distributions of edges, and intensity variations, within disc and cup, providing a means of capturing the very subtle retinal cues that are exceedingly hard to quantify manually

In addition to deep features, geometric and morphological characteristics are also extracted to describe the physical shape and size of the segmented regions. The area of the optic disc and optic cup is computed in Eq. (18)

A_{D} = \sum_{x, y} D (x, y), A_{C} = \sum_{x, y} C (x, y)

(18)

Each one represented by the pixel coverage of the structure. The perimeter for each of these regions is calculated by identifying the boundary pixels, thus allowing analysis into its compactness and contour irregularity. The vertical and horizontal diameter would thus be measured as the furthest distance, at those extremes, across the optic disc and optic cup, where the structural elongation is given as the ratio of the two measurements. The extracted features are finally concatenated into one feature vector in Eq. (19)

F = [F_{D}, F_{C}, A_{D}, A_{C}, CDR, shape descriptors]

(19)

where

F_{v}

represents the extracted feature vector which describes anatomically and appearance-based information together. The feature set, therefore, presents a holistic description of optic nerve head status by combining deep CNN features with explicit region and shape measurements, with respect to which classification and optimization stages will be significantly carried out.

Feature optimization using bitterling colony optimization (BCO)

Features with ability to discriminate between glaucoma and non-glaucoma patients are derived from a considerable number of deep and region-based characteristics quantified from the segmented optic disc and optic cup. Some features may be either redundant, noisy, or weak in correlation to disease state, thereby degrading classification accuracy and significantly increasing processing costs. As a result, intelligent feature selection must be made to have very compact subset keeping maximum diagnostic information while irrelevant components get eliminated as given in Algorithm 1. Bitterling Colony Optimization (BCO) is adopted for this purpose, which is a population-based metaheuristic inspired by the collective foraging behavior of bitterling fish.⁴⁹

Let the complete extracted feature vector as given in Eq. (20).

F = [f_{1}, f_{2}, \dots, f_{n}]

(20)

where n is the total number of features. Each candidate solution in BCO is a binary selection mask taking the form in Eq. (21)

S = [s_{1}, s_{2}, \dots, s_{n}], s_{i} \in {0, 1}

(21)

where

s_{i} = 1

signifies selection of feature

f_{i}

while

s_{i} = 0

means it is excluded. Initialize randomly a colony of such solutions, forming a population: Eq. (22) depicts this:

P = {S_{1}, S_{2}, \dots, S_{M}}

(22)

where M is the number of bitterling agents.

Each agent measures its chosen feature set by developing a classifier and inspecting its predictive capability. The fitness of one solution $S_{k}$ is computed using Eq. (23)

Fitness (S_{k}) = α \cdot A c c (S_{k}) - β \cdot \frac{∣ S_{k} ∣}{n}

(23)

where

A c c (S_{k})

stands for the classification accuracy achieved using the extracted features,

∣ S_{k} ∣

for the number of selected features, and α and β are weighting factors that balance accuracy and subset compactness. This formulation guarantees that BCO searches for feature combinations that optimize recognition performance on the one hand and dimensionality on the other.

In every iteration, bitterling agents update their positions by mimicking natural exploration and exploitation behavior. Agents that do well influence the movement of the rest, and an encouraging convergence towards feature combinations thought to be promising is encouraged while allowing the search to escape local optima via random perturbation. The update rule for each agent can be expressed in Eq. (24):

S_{k}^{t + 1} = S_{k}^{t} + r_{1} (S_{b e s t}^{t} - S_{k}^{t}) + r_{2} \cdot Δ

(24)

The best solution found up to t iterations is denoted $S_{b e s t}^{t}$ , while $r_{1}$ and $r_{2}$ are random coefficients. The term Δ represents exploratory variation.BCO gradually and relentlessly enhances the quality of the feature subsets from iteration to iteration until convergence is reached. The final result is the optimal feature mask as given in Eq. (25):

S * = \arg \max_{S_{k} \in P} Fitness (S_{k})

(25)

which is next used to filter the original feature vector. Thus, degenerating the most discriminative elements in the original feature space enhances BCO's reliability in terms of classification, lowers the probability of overfitting, and improves computational efficiency.

Algorithm 1:

Bitterling colony optimization for feature selection

Input: Extracted feature matrix F, class labels Y, population size M, maximum iterations

T

Output: Optimal feature subset

S *

Initialize a population of

M

bitterling agents with random binary masks

S_{1}, S_{2}, \dots, S_{M}

For each agent

S_{k}

, evaluate classification accuracy using selected features
Compute fitness using the defined fitness function
Identify the best-performing agent

S_{b e s t}

For

t = 1

T

:
a. Update each agent's feature mask using the BCO position update rule
b. Enforce binary constraints on updated masks
c. Recalculate fitness values
d. Update

S_{b e s t}

if a better solution is found
Return

S *

S_{b e s t}

Glaucoma classification using convolutional transformer

After the optic disc and optic cup segmentation followed by discriminative feature selection, refined feature vectors were forwarded to the Convolutional Transformer (CT) for the final glaucoma classification. This stage aims to learn local patterns from the retina and long-range spatial interactions imperative for distinguishing glaucomatous and healthy eyes. The Convolutional Transformer combines convolutional embedding with transformer-based sequence modeling, as shown in Figure 8. In this mix-and-match architecture, the network can conserve fine-scale spatial structure while also capturing global contextual relationships across the segmented retinal regions.⁵⁰

Figure 8.

Architecture of the convolutional transformer.

Convolutional token embedding

Let the given feature map from the previous stage be written as indicated in Eq. (26)

F \in R^{H \times W \times C}

(26)

where H and W are the spatial dimensions, and C is the number of channels. To convert the map into a sequence suitable for transformer processing, a convolutional embedding layer is applied according to Eq. (27)

E = Conv (F; W_{e})

(27)

where

W_{e}

stands for convolutional kernel parameters. This creates condensed tokens that maintain spatial locality and minimize redundancy. The output is reshaped into a sequence according to Eq. (28)

Z = {z_{1}, z_{2}, \dots, z_{N}}, z_{i} \in R^{d}

(28)

where N denotes the total number of tokens and d denotes the embedding dimension.

This convolutional embedding ensures that structural information such as cup deformation, disc expansion, and rim thinning is retained in each token.

Convolutional transformer block

Each token sequence is processed through stacked Convolutional Transformer-block, which consists of a convolutional projection layer, multi-head attention, normalization, and feedforward learning (as shown in Figure 8).

Query–Key–Value Projection

The embedded tokens are first projected via convolutional kernels to yield query, key, and value tensors according to Eq. (29):

Q = {Conv}_{Q} (Z), K = {Conv}_{K} (Z), V = {Conv}_{V} (Z)

(29)

It is via convolutional map that one will keep neighborhood coherence instead of linear projection; thus, one can maintain local retinal structure participation in global attention of the tokens.

Multi-Head Self-Attention

Self-attention determines how strongly each token relates with every other token according to Eq. (30)

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d}}) V

(30)

Multiple attention heads allow the model to learn extensively diverse relational patterns such as how the optic cup interacts with other regions in the disc. To form the more refined token representation, the outputs from all heads are concatenated and linearly projected as in Eq. (31)

Z^{'} = Concat ({head}_{1}, \dots, {head}_{h}) W_{o}

(31)

Residual Learning and Normalization

The stable learning criterion is made possible with the addition of both residual connections and normalization as given in Eq. (32):

Y = Norm (Z + Z^{'})

(32)

This preserves original information while encompassing much of the attention-based refinement.

Feed-Forward Network

Each token is subsequently processed by a multilayer perceptron (MLP) in accordance with Eq. (33)

O = Norm (Y + MLP (Y))

(33)

The MLP augments the power of representation, thereby allowing the model to learn non-linear dependencies of patterns that are relevant to glaucoma.Convolutional networks are very effective in capturing the shape and edge patterns of a local texture; however, they progress at low rates in increasing their receptive fields and hence model long-distance dependencies, such as the spatial relationship between the optic cup and disc boundary.This limitation is overcome by the transformer component that allows global interaction between all retinal tokens. Meanwhile, the convolutional projections ensure that spatial coherence is not lost during attention computation. This provides an avenue where the model can detect the microscopic deformations and macroscopic structural changes, which are very important indicators of glaucoma.

Final classification

The equation (34) refers to an average of the output token sequence using global pooling to get a compact representation:

g = \frac{1}{N} \sum_{i = 1}^{N} O_{i}

(34)

This vector is fed to a fully connected layer with softmax activation as mentioned in Eq. (35)

\hat{y} = softmax (W_{c} g + b_{c})

(35)

Here, $\hat{y}$ refers to the probability that the input is in the glaucoma or normal class; the predicted label is given in Eq. (36)

The predicted label is given in Eq. (36)

y = \arg max (\hat{y})

(36)

Hyperparameter optimization using honey badger algorithm (HBA)

The effectiveness of any deep learning classifier very much depends on suitable hyperparameter selection-in particular, the selection of learning rate, batch size, and number of neurons in hidden layers. The wrong choice of these may lead to slow training, unstable convergence, or suboptimal classification accuracies. To avoid manual tuning and to assure optimal training behavior, the Honey Badger Algorithm (HBA) has emerged as an intelligent optimizing strategy that will automatically select the hyperparameter configuration that is most suitable.⁵¹ The hyperparameter vector is defined as in Eq. (37)

Θ = [η, B, N]

(37)

where

η

denotes learning rate, B denotes batch size, and N is the number of neurons in the fully connected layer of the convolutional transformer. Each candidate solution in HBA corresponds to a honey badger agent positioned in hyperparameter space for searching. A population of M agents is randomly initialized according to Eq. (38)

H = {Θ_{1}, Θ_{2}, \dots, Θ_{M}}

(38)

The quality of each candidate is evaluated through classifying training with the chosen hyperparameters and computing a fitness score based on validation performance. The fitness function used is in Eq. (39)

Fitness (Θ_{i}) = A c c (Θ_{i}) - λ \cdot L o s s (Θ_{i})

(39)

where

A c c (Θ_{i})

is the validation accuracy where

L o s s (Θ_{i})

is the validation loss, and λ determines the influence of loss penalization. This formulation is specially designed in such a way that it emphasizes on high accuracy solutions but also more on stable learning behavior in HBA.

HBA models two major foraging strategies seen in honey badgers-that of digging and eating honey. In the digging phase, the agents carry out local exploitation around promising areas, while in honey search they are conducting global exploration to avoid premature convergence. Each agent's position is updated according to the equation (40)

Θ_{i}^{t + 1} = Θ_{i}^{t} + α \cdot R \cdot (Θ_{b e s t}^{t} - Θ_{i}^{t}) + β \cdot ϵ

(40)

where

Θ_{b e s t}^{t}

is the best solution at iteration t, R is a random vector, ɛ introduces stochastic exploration, and α and β control the amplification and attenuation of, respectively, exploitation and exploration. Such flexibility in movement allows the Algorithm 2 to quickly approach the optimal hyperparameters while randomly escaping local optima.

The hyperparameter set is iteratively refined with HBA to speed up training convergence through a suitable learning rate, improve classification accuracy on neuron tuning to find optimal network capacity and increase stability through choosing a suitable batch size that strikes a balance between gradient smoothness and learning efficiency.

Algorithm 2:

Honey badger optimization for hyperparameter tuning

Input: Training dataset, validation dataset, search ranges for

η

, B, and N, population size M, maximum iterations

T

Output: Optimal hyperparameter vector

Θ *

Randomly initialize

M

honey badger agents with different hyperparameter sets

Θ_{i}

For each

Θ_{i}

, train the convolutional transformer and compute validation accuracy and loss
Calculate fitness for each agent using the fitness function
Identify the best-performing agent

Θ_{b e s t}

For

t = 1

T

:
a. Update each agent's hyperparameters using the HBA position update rule
b. Enforce parameter bounds
c. Re-evaluate fitness for all agents
d. Update

Θ_{b e s t}

if a better solution appears
Return

Θ * = Θ_{b e s t}

Result and discussion

The proposed evaluation methodology of the AME-SegNetframework intends to test rigorously both quantitative performance and clinical reliability besides establishing the novelty of the architecture through comparison with other state-of-the-art methodologies. For optic disc and optic cup segmentation measures, Dice coefficient, Intersection-over-Union (IoU), pixel accuracy, and cup-to-disc ratio error (δCDR) have been considered. Dice and IoU do measure the spatial overlap between predicted and ground-truth masks whereas δCDR evaluates how segmentation errors will affect the most clinically important glaucoma biomarker. Hence, segmentation assessment ensures representation at both pixel level and diagnostic relevance. For glaucoma classification, these metrics are reported accuracy, precision, recall (sensitivity), F1 score, specificity and area under the ROC curve (AUC). These indices together assess the global correctness, the ability to detect the disease, the suppression of false alarms and the threshold-independent discriminatory performance. For all the major performance measures statistical robustness has been ensured by five-fold cross-validation and 95% confidence intervals. In addition, cross-dataset validation on Drishti-GS1, RIM-ONE, and ORIGA-Light is performed to confirm that improvements are not dependent on the dataset.

The uniqueness of the proposed design-presentation is thus benchmarked with all segmentation and classification frameworks available at the time. Individual separation or weak coupling governs feature extraction, classification, and other earlier paradigms of segmentation. The proposed framework is created instead as a fully integrated and anatomically driven pipeline, where every stage causes a reinforcing effect on the preceding stage. The previously existing U-Net, M-Net, SegNet, and the variety of attention-based models majorly optimize pixel-wise accuracy but such optimization has been realized to compromise the true optic cup-disc boundary, thereby contributing to cup-to-disc ratio estimation instability. On the other hand, with the overall context enhancement for large disc structures and multi-resolution aggregation of features, the proposed AME-SegNet explicitly models both fine cup boundaries and large disc structures, leading to higher values of Dice and IoU with lower values of δCDR.Classification methods have state-of-the-art performance yet mostly depend either upon raw image features or single clinical indicators such as CDR, which do not elaborate the truer picture of the complex structural and textural patterns of glaucoma. The proposed system takes on a biologically inspired feature optimization strategy, Bitterling Colony Optimization, to automatically choose the most discriminative combination of deep disc-cup features and anatomical descriptors. Furthermore, the unique amalgamation of Convolutional Transformer acts as an interface allowing convolutional locality to cooperate with transformer-based global attention in the local modeling of nerve fiber changes and global modeling of disc-cup relations. Such hyperparameter tuning through Honey Badger Algorithm extends uniqueness to the framework by making the learning process self-adaptive and stable over different datasets. Its superiority persists in the quantitative demonstration with consistent improvement over EfficientNet, Vision Transformer, ConvNeXt, and Swin Transformer across various datasets and ablation studies; thus affirming the efficacy as well as originality of the proposed AME-SegNet framework.

Experimental configuration

All the experiments were carried out on a dedicated high-performance computation workstation in order to maintain stable, reproducible, and efficient training and evaluation of the proposed deep-learning framework. The hardware platform comprised an Intel Core i9 central processing unit, 64 GB of DDR5 system memory, and an NVIDIA RTX 4090 graphics processing unit with 24 GB of GDDR6X memory, which endowed enough computational power to train large-scale convolutional and transformer-based models. Faster access to data and model checkpointing during training and testing were ensured by solid-state storage (2 TB NVMe SSD). The software environment was built on the Ubuntu 22.04 LTS (64-bit) operating system. All of the algorithms were implemented in Python 3.10, whereas the deep learning models were developed and trained using TensorFlow 2.13 and PyTorch 2.0, thus allowing a flexible implementation of both convolutional and transformer architectures. For GPU acceleration to increase training and inference speed significantly, CUDA 12.1 and cuDNN 8.9 were used. Image preprocessing and visualization tasks were performed using OpenCV 4.8 and scikit-image 0.21, while NumPy 1.25 and Pandas 2.0 were utilized for numerical computations and data management. Model evaluation and plotting were accomplished using Scikit-learn 1.3 and Matplotlib 3.8.

All experiments were conducted on a standalone local workstation without utilizing any external cloud services or distributed computing infrastructure. Hence, no network devices or network-based processing were involved in any step of training or testing, ensuring that all reported results came from a controlled and reproducible computing environment. All network architectures, including the AME-SegNet, Convolutional Transformer, and the optimizers, were trained using the hyperparameter configuration presented in Table 2, which includes learning rates, batch sizes, optimizer settings, and training schedules used to secure stable convergence and reproducible performance.

Table 2.

Hyperparameter configuration used for training and optimization.

Hyperparameter	Value
Learning Rate (η)	0.0001
Batch Size (B)	16
Optimizer for Classification	Adam
Weight Decay	1 × 10⁻⁴
Dropout Rate	0.5
Number of Convolutional Transformer Heads	8
Embedding Dimension (d)	256
Bitterling Colony Population Size	30
Bitterling Colony Iterations	50
Honey Badger Population Size	20
Honey Badger Iterations	40

Quantitative problem formulation and experimental validation

Let $I \in R^{H \times W \times 3}$ signify a retinal fundus image of height Hand width W. The objective of this study is to quantitatively perform optic disc and optic cup segmentation and glaucoma classification within a unified optimization framework. The segmentation task is formulated as a pixel-wise mapping function $f_{s} (\cdot)$ that predicts the optic disc and optic cup masks from the input image. Segmentation performance measures are optimized by minimizing a composite loss function that contains Dice loss and Focal loss, respectively defined to maximize region overlap addressing problems of class imbalance and hard boundary regions.

On the other hand, for glaucoma classification, discriminative deep features extracted from the segmentation network are refined through biologically inspired feature selection and attention-based learning. The classification task is defined as a supervised optimization problem in which categorical cross-entropy loss is minimized concerning the predicted and ground-truth class labels. To jointly optimize the objectives of segmentation and classification, a unified loss function is adopted to quantitatively balance the contribution of both tasks during training. This ensures that the proposed framework remains mathematically well-defined and is guided by objectiveness.

To underpin the quantitative assessment by an experimental validation, the proposed approach is statistically validated. Therefore, Dice, IoU, accuracy, and error in cup-to-disc ratio are used to assess segmentation performance, while accuracy, precision, recall, F1-score, specificity, and the area under the ROC curve are for classification. For reliability purposes, all classification results are reported as mean with a 95% confidence interval. Furthermore, five-fold cross-validation and cross-dataset evaluation are performed to minimize sampling bias and validate robustness on different data distributions. Moreover, comparative evaluation with state-of-the-art methods, and ablation studies are quantitatively reported to provide measurable performance gain, and validate the contribution of each component of the proposed framework.

Inclusion and exclusion criteria and sample selection method

The research follows a stringent guideline with respect to sampling inclusion and exclusion criteria keeping in mind that only clinically meaningful and such samples are model tested and trained for evaluation of results.

Included among eligibility conditions were: (i) color fundus photographs centered on the optic nerve head; (ii) images having corresponding significant expert-annotated ground-truth masks for optic disc and optic cup (for segmentation); (iii) images clinically labelled as normal or glaucomatous (for classification). Only images that were captured at sufficiently high-resolution and with sufficient clarity of the optic disc and cup were included, since accurate anatomical segmentation and reliable feature extraction from these regions are dependent upon good visualization.

Exclusion criteria included: (i) images without optic disc or optic cup annotations, (ii) fundus images having severe blur, considerable occlusion, or artifacts such that the optic nerve head cannot be well visualized, and (iii) images that contain other retinal diseases but have no glaucoma-related classification labeling, to avoid confusion during classification.

The data in this study were derived retrospectively from public standard database sets, namely RIM-ONE, Drishti-GS1, and ORIGA-Light. These datasets were collected and validated using standard labeling protocols in a clinical environment with experienced ophthalmologists as experts. There was no further recruitment of patients or manual selection of samples in this study. All images included were taken directly from the official releases of the datasets but processed according to their respective splits for five-fold cross validation and cross-dataset evaluation protocols, thus ensuring unbiased and reproducible experimentation. The systematic inclusion, exclusion, and sampling strategy render experimental analysis clinically valid, statistically reliable, and ethically compliant.

Segmentation results - drishti-GS1 dataset

Quantitative assessment demonstrates that the method proposed here overtakes all extant models with respect to optic cup segmentation quality by attaining a Dice value of 96.72% and an IoU of 92.13%, much higher than the scores for U-Net at 89.45% and 80.92% and M-Net at 91.83% and 84.71%, also surpassing the score for EE-UNet at 94.26% Dice and 89.31% IoU with this clear dominance stated in Table 3, where the suggested model also has the minimum δCDR value of 2.41% against 6.12% for U-Net, 4.85% for M-Net, 6.98% for SegNet, and 3.41% for EE-UNet. In compliance with this, the 97.81% accuracy indicates that the proposed framework has a more reliable optic cup boundary localization than the other techniques that achieved accuracies ranging between 94.31% and 97.08%, endorsing that the optimized structure indeed improves both overlap accuracy and clinical reliability.

Table 3.

Comparative performance of different methods for optic cup segmentation.

Method	DICE (%)	IoU (%)	ΔCDR (%)	Accuracy (%)
U-Net	89.45	80.92	6.12	94.85
M-Net	91.83	84.71	4.85	95.96
SegNet	88.12	78.77	6.98	94.31
EE-UNet	94.26	89.31	3.41	97.08
Proposed	96.72	92.13	2.41	97.81

It is visually demonstrated that the computational model isolates quite well the optic disc and the optic cup in various retinal samples even with differences in vessel density and illumination from image to image; and this consistency makes itself well visible, especially near the center of the optic nerve head where the boundaries are very well preserved and sharply defined as shown in Figure 9, indicating that the learned features are strong enough to maintain anatomical correctness. The segmented regions seem to be relatively balanced, meaning to say, the relation between the cup and the surrounding disc is preserved without significant evidence of shrinkage or expansion, which is an important factor in judging the integrity of the neuroretinal rim. Overlapping views are also proving that detected structures are aligned to the original fundus images, which allow to conclude that the developed system can adapt equally well to other optic disc shapes and contrast levels but at the same time produces stable and repeatable results for diagnostic evaluation.

Figure 9.

Optic cup and optic disc segmentation with overlay and dice score evaluation.

According to the numerical results, this presented framework performs the best in terms of total optic disc delineation accuracy among all the evaluated models with a Dice score of 97.36% and IoU of 94.89%, which are far superior to those achieved by U-Net, at 92.41% and 86.34%, and M-Net, at 94.86% and 90.21%, besides that, it beats EE-UNet which ranks with a Dice of 96.18% and an IoU of 92.76%; clear superiority is reflected in Table 4 where the proposed solution again registers the lowest δCDR of 1.84% compared with the 4.12% value of U-Net, 3.18% of M-Net, and 4.89% of SegNet, and 2.41% of EE-UNet. The accuracy attained also proves to be 98.11%, which further proves that the new methodology is superior in terms of providing more consistent and reliable methods of optic disc location as compared to the other techniques which produced accuracy values between 95.87% and 97.92%, thus evidencing superiority in boundary precision and overall diagnostic performance.

Table 4.

Comparative performance of different methods for optic disc segmentation.

Method	DICE (%)	IoU (%)	ΔCDR (%)	Accuracy (%)
U-Net	92.41	86.34	4.12	96.32
M-Net	94.86	90.21	3.18	97.41
SegNet	91.23	84.02	4.89	95.87
EE-UNet	96.18	92.76	2.41	97.92
Proposed	97.36	94.89	1.84	98.11

Visual comparisons provided indicate that the contour-based detection method accurately delineates the optic disc region across different retinal images, tracing out smooth or jagged boundary shapes closely matching the anatomical true margins, and the reliability is evidently reflected in Figure 10, where the several overlaid outlines closely cluster around the optic nerve head; the minimal distance between the plotted contours shows that the method remains reliably localized despite a change in factors such as illumination, vessel patterns, and disc appearance-important properties for any method intended to be are structurally reliable. Further, the concentric near alignment of the detected rings emphasizes the fact that the system preserves the geometric consistency of the optic disc, which is suggestive of its capability in generating measurements required for evaluation of glaucoma and other ophthalmic applications accurately.

Figure 10.

Multi-Contour optic disc boundary detection on retinal Fundus images.

The training graphs suggest that the RAdam with LookAhead together would produce the best learning behavior, achieving an accuracy of approximately 0.99 in about 30,000 batches and remaining close to that level until around 120,000 batches, whereas standalone RAdam got slightly stabilized around 0.985 and Adam around 0.97, with SGD at approximately 0.95, a clear comparative trend also seen in Figure 11 on the zoomed section of the mid-training period. The loss patterns concur, where the hybrid optimizer pulls down the error to nearly 0.01 before 20,000 batches and to values very close to 0.001 around 40,000 batches, whereas RAdam stabilizes around 0.002, Adam goes toward 0.005 with SGD hovering between 0.01 and 0.02 even beyond the higher batch counts. These differences in numbers testify that the integrated optimization strategy converges faster and achieves finer minima as compared to the other three methods, resulting in greater stability and better predictive performance.

Figure 11.

Accuracy and loss trends of different optimizers across training batches.

From the comparative learning curves, it is evident that the loss formulation with Dice and Focal terms at α=0.25 and γ=2 yield the best segmentation agreement, with a Dice score close to 0.99 at about 7000 batches, maintaining above 0.985 all the way through to 120,000 batches, whereas the Focal-only configuration plateaus slightly lower just below 0.985 while the pure Dice loss converges towards around 0.98; this ordering trend can be seen quite clearly in Figure 12 in the zoomed view of the early-training window. Other hybrid settings such as α=0.5, γ=1 and α=0.75, γ=0 reach plateaus near 0.97 to 0.975 while the α=0.5, γ=0.5 configuration remains closest to 0.965, indicative of decreased overlap accuracy. These gaps quantitatively suggest that an appropriate weighing of the class balancing and focusing parameters would interplay with boundary learning while suppressing the misclassifications more effectively, thereby leading to better consistency of segmentation through prolonged training.

Figure 12.

Dice score performance for different dice–focal loss configurations.

Classification results - RIM ONE dataset

The evaluation results indicate that a well-balanced and reliable classification ability has been obtained from the model developed, with an accuracy of 98.63%, precision of 98.41%, and the highest value of recall of 98.72%, demonstrating that the system is particularly efficient at detecting true positive cases. This overall reliability is perhaps best illustrated in Figure 13, where all performance metrics are aggregated very closely to the upper limit. An F1 score of 98.56% further bolsters the claim that an excellent trade-off has been obtained between precision and recall, while a specificity value of 98.34% shows that false positive detections are kept to fairly low levels. These closely packed high percentage metrics indicate that throughout both positive and negative samples the algorithm maintains fairly consistent predictive strength, making it very suitable for extremely robust medical image analysis.

Figure 13.

Performance evaluation of the proposed AME-SegNet model.

Statistical analysis demonstrated that the suggested method is proving to be highly reliable on the RIM-ONE dataset, as it achieved an average accuracy of 98.63% with confidence bounds ensuring that accuracy varies weakly from 98.41% to 98.85%, an indication that the model maintains strong stability across samples, which is indeed shown in Table 5 where all evaluation indicators exhibited similarly tight bounds. The average precision of the model was 98.41% with confidence bounds of 98.18%-98.64%, while the average recall was 98.72% credible from 98.50% to 98.94%, meaning that the system is equally effective in preventing false alarms while missing true glaucoma cases. Moreover, the F1-score at 98.56% with confidence bounds of 98.33%-98.79% and specificity at 98.34% ranging across 98.10%-98.58% further show that the classifier maintains diagnostic behavior well balanced and strong across the dataset.

Table 5.

Confidence interval–based performance metrics.

Performance metric	Mean (%)	95% confidence interval lower bound (%)	95% confidence interval upper bound (%)
Accuracy	98.63	98.41	98.85
Recall	98.72	98.50	98.94
F1-Score	98.56	98.33	98.79
Precision	98.41	98.18	98.64
Specificity	98.34	98.10	98.58

A behavior of learning showed that the network was advancing in a smooth and regulated manner, with the training and validation accuracy improving steadily from the mid-80% zone to a peak of 98.63%, while the corresponding loss curves were falling steadily to a low end of 0.0971, which indicated almost no prediction error, and this entire convergence trend has been clearly illustrated in Figure 14, where both curves traveled a nearly parallel journey without any major deviation. The closeness of the training and validation trajectories indicates that the model generalizes well instead of memorizing the training data, as there were no signs of the model either overfitting or being unstable even in the late epochs. These numerical findings support that the combined optimization strategy and network structure allowed consistent learning in the system leading to high classification reliability and, importantly, a tiny residual error.

Figure 14.

Training and validation accuracy and loss.

The proposed model is way faster and very efficient from all other deep learning architectures on computational comparison, as it achieved the best inference speed of 7.1 ms, the shortest training time of 2150 s, and is only 3.2 million parameters in size, an advantage clearly shown in Table 6 when compared to heavier networks like EfficientNet-B0 that require 18.6 ms for inference and 138.0 million parameters or Vision Transformer that still uses 25.6 million parameters, even though the run time is 14.2 ms. Even lighter transformer-based designs such as Swin Transformer, with an inference time of 9.8 ms and 5.3 million parameters, are also slower and larger in comparison to the presented model. Such numerical differences would prove that this new framework is a balancing act in speed, memoryspace, and economy in training, thus more suitable for real-time clinical deployment in the RIM-ONE dataset.

Table 6.

Computational efficiency comparison of different models.

Model	Inference time (ms)	Training time (s)	Parameters (M)
EfficientNet-B0	18.6	3820	138.0
Vision Transformer	14.2	3450	25.6
ConvNeXt	16.4	3910	8.0
Swin Transformer	9.8	2780	5.3
Proposed	7.1	2150	3.2

Classification results indicate that there is an extremely valid discrimination achieved by the proposed system between healthy and glaucomatous cases: 98.60% of normal samples are sufficiently labeled as normal, whereas only 1.40% are wrongly assigned to the diseased category; conversely, 98.66% of glaucoma images are accurately recognized with just 1.34% misclassified, with Figure 15 clearly demonstrating such balance as both diagonal values dominate the matrix. This near symmetry of the two correct classification rates tells us that the model is not biased toward any of the classes because, for medical screening scenarios, both missed diagnoses as well as false alarms bear their own risks. Thus, these numerical proportions vindicate that the algorithm actually yields a fair as well as robust decision regarding the diagnosis of both the normal and the glaucoma cases with almost equal and very high confidence.

Figure 15.

Confusion matrix for normal and glaucoma classification.

The performance comparison provides evidence that the proposed framework has the best diagnosis value among all the architectures evaluated, achieving an 98.63% accuracy, 98.41% precision, 98.72% recall, 98.56% F1-score, and a 98.34% specificity, higher than those reported Swin Transformer, which are 97.94%, 97.68%, 98.11%, 97.89%, and 97.52%, respectively, and it is a ranking that can be seen in Table 7, where every measure regarding the proposed model reaches the top rank. In contrast, EfficientNet-B0 and Vision Transformer are at the lower side concerning these value numbers. Accurate detection was recorded by the EfficientNet-B0 as moving with an accuracy percentage of 95.74, whereas Vision Transformer recorded it at 96.82, while ConvNeXt brought out a level of 97.36; indicating quite small but consistent lead above these models. The numbers in their respective differences indicate the proposed approach boosts overall correctness alongside enhancing the balance between detecting actual glaucoma cases and avoiding wrong detections on the RIM-ONE dataset.

Table 7.

Classification performance comparison of different models.

Model	Accuracy (%)	Recall (%)	Specificity (%)	Precision (%)	F1-Score (%)
EfficientNet-B0	95.74	96.02	95.18	95.31	95.66
Vision Transformer	96.82	97.01	96.39	96.44	96.72
ConvNeXt	97.36	97.58	96.94	97.12	97.35
Swin Transformer	97.94	98.11	97.52	97.68	97.89
Proposed	98.63	98.72	98.34	98.41	98.56

The Receiver Operating Characteristic (ROC) patterns illustrate that with little regard for varying decision threshold selection, the model has a very high ability to differentiate between glaucomatous and normal cases. The ROC for the normal category offers an area under the curve of approximately 98.02%, with the glaucoma category closely trailing with an AUC value nearly at 97.96; this is well-demonstrated in Figure 16, whereby both curves are well above the diagonal reference line, indicating the aforementioned near-perfect separation. At low false positive rates below 0.10, true positive rates have already exceeded about 0.90 for both classes, which means that the system can identify almost all diseased and healthy samples with very few false alarms. These numbers thus confirm that the classifier is preserving good sensitivity and specificity at the same time, which renders it very useful for screening and diagnostic assistance.

Figure 16.

ROC curves for normal and glaucoma classification.

The results from five-fold validation indicate that the model provides a stable and replicable effectiveness on the RIM-ONE dataset, with the accuracy values falling in the range of 98.42% on Fold-1 to 98.81% on Fold-5, whereas precision varied only marginally from 98.19% to 98.56% with very little fluctuation across the different data splits; such an attribute was evidently indicated by Table 8 where the mean accuracy was set up at 98.63%. The recall has also shown steadiness in its performance, with scores from Fold-1 down to Fold-3 lying between 98.55% and 98.88%, respectively, while the F1-score moved in the narrow band from 98.37% to 98.72%, showing that relative sensitivity and balance have been preserved again in all folds. Specificity also exhibited the values of 98.11–98.52%, yielding an average specific to the class label of 98.34%, thus corroborating the claim that the classifier has good consistency and generalization, consistent across partitionings of the dataset.

Table 8.

Five-Fold cross-validation performance of the proposed model.

Fold	Accuracy (%)	Recall (%)	Specificity (%)	Precision (%)	F1-Score (%)
Fold-1	98.42	98.55	98.11	98.19	98.37
Fold-2	98.61	98.70	98.29	98.43	98.56
Fold-3	98.73	98.88	98.51	98.56	98.72
Fold-4	98.58	98.64	98.27	98.32	98.48
Fold-5	98.81	98.82	98.52	98.55	98.68
Mean	98.63	98.72	98.34	98.41	98.56

Progressive enhancement studies show that every successive component added contributes a quantifiable increase in diagnostic quality, starting with the base SegNet at 92.14% accuracy; from there, moving to AME-SegNet with a major increase to 96.71%; the next integration of Bitterling optimization increased accuracy to 97.54%. Adding Conv-Transformer increased it back on top to 98.06%, and the entire upward trend is nicely summed up in Table 9, with the full proposed setting achieving the highest accuracy of 98.63%.This pattern is also visible in precision, recall, and F1-score, with values steadily improving from around 91–92% for the baseline to 98.41% precision, 98.72% recall, and 98.56% F1-score for the final model. The specificity increased from 91.32% to 98.34%, thereby indicating that every refinement made, be it architectural or optimization, strengthens the ability of the system to correctly discriminate healthy and glaucomatous cases on the RIM-ONE dataset.

Table 9.

Ablation study of model variants.

Model variant	Accuracy (%)	Recall (%)	Specificity (%)	Precision (%)	F1-Score (%)
SegNet (Baseline)	92.14	92.47	91.32	91.86	92.16
AME-SegNet	96.71	96.93	96.18	96.44	96.68
AME-SegNet + Bitterling	97.54	97.79	97.02	97.31	97.55
AME-SegNet + Bitterling + Conv-Transformer	98.06	98.29	97.78	97.85	98.07
AME-SegNet + Bitterling + Conv-Transformer + Honey Badger (Proposed)	98.63	98.72	98.34	98.41	98.56

Classification result – ORIGHA dataset

The statistics show that the model has quite a consistent diagnostic quality over the ORIGA-Light dataset with an overall accuracy of 98.96%, precision being recorded at 98.74%, and recall giving away the superlative value 99.12%- all establish that true glaucoma cases are recognized very reliably. This balanced performance is altogether shown in Figure 17 where all measures are firmly packed near the top of the scale. The F1 score of 98.93% further certifies that the harmony between sensitivity and precision is well maintained, whereas the specificity of 98.69% indicates that normal samples are similarly classified correctly with very few false alarms. These values together show that the system is capable of providing dependable screening results across both diseased and healthy images without noticeable bias.

Figure 17.

Performance metrics of the proposed AME-SegNet model.

The confidence interval evaluation shows that the suggested method has indeed achieved highly reproducible and thus trustworthy performance on the ORIGA-light dataset characterized by a 98.96% mean accuracy, and with very narrow limits from 98.74% to 99.18%, showing minimal variability across samples. This robustness is clearly demonstrated in Table 10 with all the metrics being tightly bounded in the same manner. Precision was very high, with an average of 98.74% with limits of 98.51% to 98.97%, whereas recall is even higher at 99.12% within intervals of 98.90% to 99.34%, denoting that capture of true glaucoma cases is highly reliable. Also, the classifier being confirmed with an F1-score of 98.93% ranging between 98.70% to 99.16% and specificity of 98.69% between 98.45% to 98.93% shows that it has maintained highly balanced and stable diagnostic behavior across the ORIGA-Light dataset.

Table 10.

Confidence interval–based performance metrics.

Performance metric	Mean (%)	95% confidence interval lower bound (%)	95% confidence interval upper bound (%)
Accuracy	98.96	98.74	99.18
Recall	99.12	98.90	99.34
Precision	98.74	98.51	98.97
Specificity	98.69	98.45	98.93
F1-Score	98.93	98.70	99.16

The learning trajectory indicates progressive and consistent adaptation of the network in ORIGA-Light, where the accuracy of training and validation swells from very low values to the final level of 98.96%, and the corresponding loss curves decrease regularly into a minimal value of 0.0648, indicating effective minimization of prediction errors, which synchronized behavior can be observed in Figure 18, where the two sets of curves are closely aligned paths. The small gap between training and validation trends indicates that the model generalizes well as opposed to memorizing specific samples since neither curve shows any instability or divergence even at later epochs. These numbers confirm that the training strategy creates a stable and effective learning process leading to a highly reliable classifier with minimum residual errors.

Figure 18.

Training and validation accuracy and loss.

Comparative efficiency shows the proposed network yields an extreme advantage in both speed and resource consumption on the ORIGA-Light dataset, providing an inference time of 7.3 ms and a training time of 2720 s, using only 3.2 million parameters, with this high advantage summarized in Table 11 when compared to even heavier models like EfficientNet-B0, which takes 18.9 ms in inference with 138.0 million parameters, or Vision Transformer, with 14.5 ms and 25.6 million parameters. Even amongst the standard deep models, Swin transformer is the smallest model with 5.3 million parameters and 10.1 ms inference time, but it still is neither as fast nor as small as the proposed method. These numerical differences hence confirm that the new architecture presents the most efficient trade-off between learning capacity and computational efficiency and thus can highly be favored for practical application in glaucoma screening tasks.

Table 11.

Computational efficiency comparison of different models.

Model	Inference time (ms)	Training time (s)	Parameters (M)
EfficientNet-B0	18.9	4620	138.0
Vision Transformer	14.5	4180	25.6
ConvNeXt	16.7	4750	8.0
Swin Transformer	10.1	3340	5.3
Proposed	7.3	2720	3.2

As depicted in the confusion matrix, this classifier is capable of extremely strong discrimination on the ORIGA-Light database, identifying 99.46% of normal images as normal while 0.54% are erroneously considered glaucomatous; similarly, 98.46% of glaucoma samples are correctly identified, with only 1.54% misclassified as normal. Such strongly balanced outcomes are illustrated in Figure 19, in which dominate values generally appear along the main diagonal. The very small off-diagonal percentages show that both false alarms and missed detections are maintained at negligible levels, which is of critical importance for any clinical decision-making. These numerical proportions are the proof that the system holds a high degree of reliability for either class, ensuring true efficiency in screening cases of glaucoma and health.

Figure 19.

Confusion matrix for normal and glaucoma classification.

The comparative study proves that the proposed model has the highest diagnostic performance in the ORIGA-Light dataset: accuracy 98.96%; precision 98.74%; recall 99.12%; F-1 score 98.93%; and specificity 98.69%-all better than Swin Transformer with 98.42% accuracy and 98.16% specificity, a ranking endorsed by Table 12, where the proposed model leads on every parameter. Compared to EffectiveNet-B0 with 96.21% accuracy and the Vision Transformer at 97.42%, ConvNeXt gains more momentum, reaching 97.96%, marking a slight improvement towards the glory of best-performing models. The difference in accuracy discussed above shows that the proposed method improves not only the predicted accuracy but also balances the identification of true glaucoma cases and the reduction of false detections, positioning it as the most reliable among all other architectures studied.

Table 12.

Classification performance comparison of different models.

Model	Accuracy (%)	Recall (%)	Specificity (%)	Precision (%)	F1-Score (%)
EfficientNet-B0	96.21	96.54	95.73	95.88	96.20
Vision Transformer	97.42	97.85	97.11	97.09	97.46
ConvNeXt	97.96	98.21	97.68	97.74	97.97
Swin Transformer	98.42	98.64	98.16	98.21	98.41
Proposed	98.96	99.12	98.69	98.74	98.93

Receiver operating characteristics indicated that the classifier maintained a high level of separability over the ORIGA-Light dataset, with an area under the curve of nearly 98.34% for the normal category and an approximate 98.41% for that of glaucoma categories, indicating that healthy and diseased examples were barely differentiated at almost equal and very high reliability. Such impressive separation can also be observed in Figure 20, where both the curves are found well above the diagonal reference line. Up until false positive rates of nearly 0.10, the true positive rates have already exceeded approximately 0.95 for both classes, indicating that most related samples are being correctly identified while very few false alarms occur. These numerical values confirm that the system maintains excellent sensitivity and specificity over a range of thresholds, thus rendering it highly effective for real-life screening applications.

Figure 20.

ROC curves for normal and glaucoma classification.

Results of the five-fold validation suggest that the model performs very well on all datasets of the ORIGA-Light dataset, with accuracy ranging from Fold-1 with 98.74% to Fold-5 with 99.17% while precision also issues slight variations between 98.52% and 98.94%, showing minimal dispersion across folds, and this uniformity is clearly summarized in Table 13 where the mean accuracy converges to 98.96% while recall remains stable with elevated values from 98.91% to 99.29%, while the F1-score stays tightly grouped between 98.71% and 99.14%, indicating that sensitivity and balance are preserved regardless of how the dataset is split. In addition, specificity lies within a narrow window from 98.43% to 98.92%, leading to an overall average of 98.69%, which confirms that the classifier generalizes reliably and avoids class bias across different validation folds.

Table 13.

Five-Fold cross-validation performance of the proposed model.

Fold	Accuracy (%)	Recall (%)	Specificity (%)	Precision (%)	F1-Score (%)
Fold-1	98.74	98.91	98.43	98.52	98.71
Fold-2	98.89	99.08	98.61	98.67	98.87
Fold-3	99.08	99.29	98.83	98.86	99.07
Fold-4	98.92	99.10	98.66	98.71	98.90
Fold-5	99.17	99.23	98.92	98.94	99.14
Mean	98.96	99.12	98.69	98.74	98.93

The improvement analysis demonstrates a clear and continuous increase in the classification strength with each architectural and optimization component deployed starting from SegNet baseline reaching 95.82% accuracy up to AME-SegNet which reaches 97.21% and the Bitterling-augmented version at 98.02%, while with the addition of the Conv-Transformer, the accuracy is lifted further to 98.57%, and this continuous improvement gets very well captured in Table 14 that presented the complete configuration proposed attaining maximum accuracy of 98.96%. Similar trends are also seen in precision, recall, and F1-score, which typically increase from around 95%-96% in the baseline to 98.74% precision, 99.12% recall, and 98.93% F1-score in the final model. The same goes for specificity: improving from 95.18% to 98.69% means as more refinements come, sensitivity into glaucoma gets deepened by being correctly anatomized against normal cases concerning the ORIGA-Light dataset.

Table 14.

Ablation study of model variants.

Model variant	Accuracy (%)	Recall (%)	Specificity (%)	Precision (%)	F1-Score (%)
SegNet (Baseline)	95.82	96.11	95.18	95.34	95.72
AME-SegNet	97.21	97.64	96.74	96.89	97.26
AME-SegNet + Bitterling	98.02	98.31	97.59	97.78	98.04
AME-SegNet + Bitterling + Conv-Transformer	98.57	98.79	98.21	98.36	98.57
AME-SegNet + Bitterling + Conv-Transformer + Honey Badger (Proposed)	98.96	99.12	98.69	98.74	98.93

Comparative analysis

The quantitative comparison shown in Table 15 clearly demonstrates that the proposed AME-SegNet framework achieves the most balanced and reliable segmentation across both the optic cup and optic disc structures with a 97.36% Dice and 94.89% IoU for the optic disc with a 96.72% Dice and 92.13% IoU for the optic cup, both better than the ones achieved by the closest competing methods, Huang et al.¹ with 97.12% disc Dice and 91.99% cup Dice, Wang et al.¹¹ reporting 96.24% disc Dice, 88.49% disc IoU, 92.28% cup Dice, and 91.52% cup IoU, Kumar et al.²⁵ achieving 95.95% disc Dice, 92.22% disc IoU, 88.70% cup Dice, and 79.72% cup IoU, and Chen et al.²⁹ obtaining 96.65% disc Dice and 91.78% cup Dice, meaning that the proposed method serves in improving region overlap but also speaks to much higher consistency along the anatomical boundaries for both structures to provide a much more reliable geometric measure for glaucoma evaluation.

Table 15.

Comparative performance of optic disc and optic cup segmentation on drishti-GS1.

Model	Optic disc		Optic cup
Model	DICE (%)	IoU (%)	DICE (%)	IoU (%)
Huang et al. ¹	97.12	—	91.99	–
Wang et al. ¹¹	96.24	88.49	92.28	91.52
Kumar et al. ²⁵	95.95	92.22	88.70	79.72
Chen et al. ²⁹	96.65	—	91.78	—
Proposed (Drishti)	97.36	94.89	96.72	92.13

The comparative results presented in Table 16 show that the suggested framework has achieved the most reliable and superior performance in diagnosis among the models under consideration. The framework achieved 98.63% accuracy, 98.41% precision, 98.72% recall, 98.56% F1 score, and 98.34% specificity, which together exceeds Sanghavi et al.⁴'s reported outcome of 96.33% accuracy, Aimmanee et al.⁹'s score of 98.14% accuracy, 86.56% precision, 88.19% recall, and 86.48% F1 score, and Sreema MA et al.²²'s figures of 97.88% accuracy, 97.45% precision, 97.87% recall, and 98.51% specificity. This indicates that the proposed technique is not only more superior in overall correctness, but also effects a more balanced and reliable separation between glaucomatous and normal cases on the RIM-ONE dataset.

Table 16.

Comparative classification performance on the RIM-ONE dataset.

Model	Accuracy (%)	Recall (%)	Specificity (%)	Precision (%)	F1 score (%)
Sanghavi et al. ⁴	96.33	—	—	—	—
Aimmanee et al. ⁹	98.14	88.19	—	86.56	86.48
Sreema et al. ²²	97.88	97.87	98.51	97.45	—
Proposed (RIM ONE)	98.63	98.72	98.34	98.41	98.56

The performance comparison summarized in Table 17 highlights that the proposed framework achieves a most comprehensive and balanced set of diagnostic outcomes upon the ORIGA-Light dataset with 98.96% accuracy, 98.74% precision, 99.12% recall, 98.93% F1 score, and 98.69% specificity, which are slightly but consistently higher than the respective values of Govindharaj et al.,¹⁴ who achieved 98.90%, 98.40%, 96.40%, 97.20%, and 97.80%, and those reported by Sreema MA et al.,²² including 98.34% accuracy, 96.89% precision, 98.78% recall, and 98.45% specificity, while showing a very large margin against Sheraz et al.,²⁶ whose accuracy and precision remain at 88.5% and 88.88% respectively. This indicated that the proposed method gives enhanced sensitivity and overall reliability in differentiating between glaucomatous and normal cases under varying imaging conditions.

Table 17.

Comparative classification performance on the ORIGA-light dataset.

Model	Accuracy (%)	Recall (%)	Specificity (%)	Precision (%)	F1 score (%)
Govindharaj et al. ¹⁴	98.9	96.4	97.8	98.4	97.2
Sreema MA et al. ²²	98.34	98.78	98.45	96.89	—
Sheraz et al. ²⁶	88.5	94.91	—	88.88	—
Proposed (ORIGHA)	98.96	99.12	98.69	98.74	98.93

Table 18 presents a comparative analysis of recent glaucoma detection approaches alongside the proposed framework. For fair evaluation, all methods were assessed using the same RIM ONE dataset and experimental configuration to ensure consistency in training and testing conditions. Earlier approaches primarily relied on standalone CNN or transformer-based architectures, whereas the proposed method integrates edge-aware segmentation, optimized feature selection, and convolutional transformer-based classification. As observed, the proposed framework achieves superior and more balanced performance in terms of accuracy, recall, and precision. These results demonstrate the effectiveness and robustness of the hybrid multi-stage design for reliable automated glaucoma screening.

Table 18.

Comparative analysis of the proposed method with existing state-of-the-art techniques.

Reference	Author	Model	Algorithm	Accuracy (%)	Recall (%)	Precision (%)
Lin et al. (2022)³⁷		GlaucomaNet	Dual CNN	96.84	96.12	95.73
Veena et al. (2022)³⁸		CNN Segmentation + CDR	CNN	95.92	95.40	94.88
Fan et al. (2023)³⁹		Vision Transformer	DeiT	97.18	96.85	96.41
Hemelings et al. (2023)⁴⁰		Regression Model	DL Regression	96.73	96.02	95.60
Shoukat et al. (2023)⁴¹		ResNet-50	CNN	97.54	98.10	96.32
Proposed	AME-SegNet + CT			98.63	98.72	98.41

Conclusion

This study introduced a unified and anatomically guided framework for automated glaucoma detection that integrates advanced preprocessing, AME-SegNet-based optic disc and cup segmentation, optimized feature selection, and a hybrid convolutional transformer classifier. In contrast to many existing approaches that emphasize either segmentation or classification independently, the proposed method jointly optimizes boundary precision, feature discrimination, and classification robustness within a single pipeline. To ensure objective evaluation, all comparative analyses were conducted under identical datasets, preprocessing strategies, training configurations, and evaluation metrics. This standardized protocol minimizes experimental bias and enables fair performance assessment against recent CNN- and transformer-based methods. Existing techniques commonly encounter challenges such as imprecise optic cup boundary delineation, sensitivity to dataset variability, overfitting due to unstable feature learning, and limited generalization across imaging conditions. By incorporating multi-scale edge-aware segmentation, metaheuristic-based feature optimization, and automated hyperparameter tuning, the proposed framework effectively addresses these limitations while maintaining computational efficiency suitable for real-time deployment. Extensive validation on the Drishti-GS1, RIM-ONE, and ORIGA-Light datasets demonstrates consistent segmentation accuracy, balanced classification performance, and stable cross-validation behavior. The improved cup-to-disc ratio estimation further confirms enhanced anatomical reliability. The proposed framework achieved classification accuracies of 98.63% on RIM-ONE and 98.96% on ORIGA-Light, with balanced precision, recall, and specificity values exceeding 98%, along with ROC-AUC scores above 0.98 and stable five-fold cross-validation performance. For segmentation on Drishti-GS1, it attained Dice scores of 97.36% (optic disc) and 96.72% (optic cup), demonstrating improved boundary delineation, reduced cup-to-disc ratio error, and computational efficiency with only 3.2 million parameters and approximately 7 ms inference time. Overall, the proposed AME-SegNet framework advances current glaucoma screening methodologies by delivering a more generalizable, stable, and clinically applicable automated diagnostic solution.

Limitations of the proposed work

Although the proposed AME-SegNet framework exhibits high accuracy, robustness, and computational efficiency across various benchmark datasets, certain constraints must be acknowledged. First, the current study relies solely on color fundus images and thus does not incorporate structural information from other modalities such as optical coherence tomography (OCT) or visual fields. While fundus images provide very important information about the optic disc and cup morphology, combining them with depth-resolved retinal measurements may further increase the certainty of diagnosis, especially during the early stages of glaucoma.

Second, the proposed model is designed for binary classification (glaucoma versus normal) and does not explicitly estimate disease severity or progression stages. Grading glaucoma into mild, moderate, and severe is very important for treatment planning and monitoring, which the current framework does not yet address. Thirdly, although the model shows strong generalization across publicly available datasets, all evaluations are based on retrospectively collected and highly curated datasets with high-quality annotations. Real-world screening environments may see images suffering from motion blur, poor focus, or occlusions that may affect quality of segmentation and proceed to classification.

Finally, the use of biologically inspired optimization algorithms such as Bitterling Colony Optimization and Honey Badger Algorithm would enhance accuracy and stability but impose extra computational overhead during the training stage, possibly extending the optimization time compared to other fixed-parameter learning strategies. Such limitations indicate fruitful avenues for future research, including multimodal data fusion, glaucoma stage classification, real-world clinical validation, and further optimizing for computational efficiency to elevate the clinical applicability of the proposed system.

Future scope of the research

The AME-SegNet framework proposed here is designed, apart from achieving diagnostic accuracy, to become a scalable and extensible platform for glaucoma research in the future, and actual deployment. Such an architecture, consisting of preprocessing, AME-SegNet segmentation, feature optimization, and convolutional transformer classification modules, allows components to be upgraded or extended independently without affecting the redesign of other systems. This opens up the possibility of integrating further biomarkers into the assessment, such as neuro-retinal rim thickness, peripapillary atrophy, or vessel density. Also, there is an option to integrate multimodal inputs, such as OCT and visual field data, for a more comprehensive glaucoma assessment. From a research perspective, this framework provides a good platform to develop a multi-stage glaucoma grading and diseased progression modeling exercise, working toward prediction of severity and longitudinal change rather than simple binary classification. The anatomically consistent segmentation achieved by AME-SegNet will also aid in new clinically meaningful descriptors that potentially support explainable AI as well as clinical decision support systems.

For real-time implementation, the proposed model is computationally efficient, requiring only 3.2 million parameters and approximately 7 ms per-image inference time, rendering it suitable for deployment on edge devices, clinical workstations, and mobile screening units. The application of convolutional tokenization in the transformer allows for retention of high accuracy while burdening memory and processing minimally. In addition, the automated hyperparameter tuning from the Honey Badger Algorithm allows the model to be easily adapted to new datasets or imaging devices without intervention from a user. Such characteristics make the proposed AME-SegNetframework an extremely useful candidate for large-scale, low-cost glaucoma screening in hospitals, community clinics, and remote health setups while providing an adaptable yet robust infrastructure for advancing future glaucoma research and intelligent ophthalmic diagnostics.

Footnotes

Acknowledgements

The corresponding author, C. Moorthy, would like to express sincere thanks to the co-authors, D. Arulanantham, A. Suresh Babu, and S. Murugesan, for their valuable contributions, collaboration, and support throughout the course of this research work.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Huang

Zhang

, et al. SAMCF: adaptive global style alignment and multi-color spaces fusion for joint optic cup and disc segmentation. Comput Biol Med 2024; 178: 108639.

Chowdhury

Lodh

Agarwal

, et al. Rim learning framework based on TS-GAN: a new paradigm of automated glaucoma screening from fundus images. Comput Biol Med 2025; 187: 109752.

Sangeetha

Rekha

Priyanka

. A residual network integrated with multimodal fundus features for automatic glaucoma classification. Comput Electr Eng 2025; 122: 109880.

Sanghavi

Kurhekar

. An efficient framework for optic disk segmentation and classification of glaucoma on fundus images. Biomed Signal Process Control 2024; 89: 105770.

Kui

Hai

Zou

, et al. PK-Net: A prior knowledge-driven dual-path network for enhanced glaucoma screening. Knowl Based Syst 2025; 329: 114374.

Huang

Kong

Yan

, et al. Interpretable longitudinal glaucoma visual field estimation deep learning system from fundus images and clinical narratives. npj Digital Medicine 2025; 8: 89.

Sharma

Takahashi

Ninomiya

, et al. A hybrid multi model artificial intelligence approach for glaucoma screening using fundus images. npj Digital Medicine 2025; 8: 30.

Abbasi

Amin

Alabrah

, et al. “Diabetic retinopathy detection using adaptive deep convolutional neural networks on fundus images.”. Sci Rep 2025; 15: 24647.

Naing

Aimmanee

. Automated optic disk segmentation for optic disk edema classification using factorized gradient vector flow. Sci Rep 2024; 14: 71.

10.

Chaurasia

Liu

G-S

Greatbatch

, et al. A generalised computer vision model for improved glaucoma screening using fundus images. Eye 2025; 39: 109–117.

11.

Wang

Cheng

. Towards an extended EfficientNet-based U-Net framework for joint optic disc and cup segmentation in the fundus image. Biomed Signal Process Control 2023; 85: 104906.

12.

Geetha

Carmel Sobia

Santhi

, et al. DEEP GD: deep learning based snapshot ensemble CNN with EfficientNet for glaucoma detection. Biomed Signal Process Control 2025; 100: 106989.

13.

Zhou

Zheng

Zhou

, et al. Self-supervised pre-training for joint optic disc and cup segmentation via attention-aware network. BMC Ophthalmol 2024; 24: 98.

14.

Govindharaj

Santhakumar

Pugazharasi

, et al. Enhancing glaucoma diagnosis: generative adversarial networks in synthesized imagery and classification with pretrained MobileNetV2. MethodsX 2025; 14: 103116.

15.

Guntreddi

Sivakumar

. Deep learning based glaucoma detection using majority voting ensemble of ResNet50, VGG16, and Swin transformer. Results in Engineering 2025; 28: 107229.

16.

Milad

Antaki

Mikhail

, et al. Code-free deep learning glaucoma detection on color fundus images. Ophthalmology Science 2025; 5: 100721.

17.

Lenka

Mayaluri

Panda

. “Glaucoma detection from retinal fundus images using graph convolution based multi-task model.” e-prime-advances in electrical engineering. Electronics and Energy 2025; 11: 100931.

18.

Rieck

Mai

Eisentraut

, et al. A novel transformer–CNN hybrid deep learning architecture for robust broad-coverage diagnosis of eye diseases on color fundus images. IEEE Access 2025; 13: 156285–156300.

19.

Gagnon

, et al. Fundusnet: a deep-learning approach for fast diagnosis of neurodegenerative and eye diseases using Fundus images. Bioengineering 2025; 12: 57.

20.

Alasmari

Amoudi

Alghamdi

. Explainable transformer-based framework for glaucoma detection from Fundus images using multi-backbone segmentation and vCDR-based classification. Diagnostics 2025; 15: 2301.

21.

Alsohemi

Dardouri

. Fundus image-based eye disease detection using EfficientNetB3 architecture. Journal of Imaging 2025; 11: 79.

22.

Ma S, Jayachandran A and Sudarson Rama Perumal T. “Multi-dimensional dense attention network for pixel-wise segmentation of optic disc in colour fundus images.”. Technol Health Care 2024; 32: 3829–3846.

23.

Sreedevi

Suresh

Mubarakali

, et al. OD-DeepNet: semantic classification by deep learning for optic disc localization. Trait Signal 2023; 40: 2827.

24.

Lenka

Mayaluri

Panda

. Retinal fundus image enhancement using an ensemble framework for accurate glaucoma detection. Neural Computing and Applications 2025; 37: 20499–20517.

25.

Kumar

. Enhanced segmentation of optic disc and cup using attention-based U-net with dense dilated series convolutions. Neural Computing and Applications 2025; 37: 6831–6847.

26.

Sheraz

Shehryar

Khan

. Two stage-network: automatic localization of optic disc (OD) and classification of glaucoma in fundus images using deep learning techniques. Multimed Tools Appl 2025; 84: 12949–12977.

27.

Tampa

Mekongo

Tiedeu

. Deep learning-based algorithm for automated detection of glaucoma on eye fundus images. Multimed Tools Appl 2025; 84: 22809–22826.

28.

Zhou

. Multi-step framework for glaucoma diagnosis in retinal fundus images using deep learning. Med Biol Eng Comput 2025; 63: 1–13.

29.

Chen

Zou

Chen

, et al. Optic disc and cup segmentation based on information aggregation network with contour reconstruction. Biomed Signal Process Control 2025; 104: 107179.

30.

Wang

Zhang

, et al. A deep semi-supervised learning approach to the detection of glaucoma on out-of-distribution retinal fundus image datasets. BMC Ophthalmol 2025; 25: 26.

31.

Islam

Deo

Barua

, et al. Novel deep learning model for glaucoma detection using fusion of fundus and optical coherence tomography images. Sensors 2025; 25: 4337.

32.

Visu

Sathiya

Ajitha

, et al. Enhanced swintransformer based tuberculosis classification with segmentation using chest X-ray. J Xray Sci Technol 2025; 33: 167–186.

33.

Rajendran

Rajagopal

Thanarajan

, et al. Automated segmentation of brain tumor MRI images using deep learning. IEEE Access 2023; 11: 64758–64768.

34.

Manoharan

Sivagnanam

. A novel human action recognition model by grad-CAM visualization with multi-level feature extraction using global average pooling with sequence modeling by bidirectional gated recurrent units. International Journal of Computational Intelligence Systems 2025; 18: 18.

35.

Jayamohan

Yuvaraj

. A novel human actions recognition and classification using semantic segmentation with deep learning techniques. Neural Computing and Applications 2025; 37: 7321–7337.

36.

Thanarajan

Alotaibi

Rajendran

, et al. Eye-tracking based autism spectrum disorder diagnosis using chaotic butterfly optimization with deep learning model computers. Comput Mater Contin 2023; 76: 1995–2013.

37.

Lin

Hou

Liu

, et al. Automated diagnosing primary open-angle glaucoma from fundus image by simulating human’s grading with deep learning. Sci Rep 2022; 12: 14080.

38.

Veena

Muruganandham

Kumaran

. A novel optic disc and optic cup segmentation technique to diagnose glaucoma using deep learning convolutional neural network over retinal fundus images. Journal of King Saud University-Computer and Information Sciences 2022; 34: 6187–6198.

39.

Fan

Alipour

Bowd

, et al. Detecting glaucoma from fundus photographs using deep learning without convolutions: transformer for improved generalization. Ophthalmology Science 2023; 3: 100233.

40.

Hemelings

Elen

Schuster

, et al. A generalizable deep learning regression model for automated glaucoma screening from fundus images. NPJ digital Medicine 2023; 6: 12.

41.

Shoukat

Akbar

Hassan

, et al. Automatic diagnosis of glaucoma from retinal images using deep learning approach. Diagnostics 2023; 13: 1738.

42.

Sivaswamy

Krishnadas

Joshi

, et al. Drishti-gs: Retinal image dataset for optic nerve head (ONH) segmentation. In: 2014 IEEE 11th international symposium on biomedical imaging (ISBI), Beijing, China: IEEE, 2014, pp.53–56.

43.

Fumero

Alayón

Sanchez

, et al. RIM-ONE: an open retinal image database for optic nerve evaluation. In: 2011 24th international symposium on computer-based medical systems (CBMS) . Bristol, UK: IEEE, 2011, pp. 1–6.

44.

Zhang

Yin

Liu

, et al. Origa-light: an online retinal fundus image database for glaucoma analysis and research. In: 2010 Annual international conference of the IEEE engineering in medicine and biology . Buenos Aires, Argentina: IEEE, 2010, pp.3065–3068.

45.

Sharif

NAM

Azhar

ASM

Harun

, et al. Green channel and top hat-based image enhancement for diabetic retinopathy screening. J Phys Conf Ser 2021; 1997: 012002.

46.

Kumari

PLS

. A qualitative approach for enhancing fundus images with novel clahe methods. Engineering, Technology & Applied Science Research 2025; 15: 20102–20107.

47.

Yamuna

Selvakumar

Suresh

. Towards accurate diabetic retinal disease detection using advanced deep metric learning. Biomed Signal Process Control 2026; 113: 109127.

48.

Liang

Sheng

. MDF-Net: an attention-guided multi-scale dual-fusion network for retinal vessel segmentation. Measurement (Mahwah N J) 2025; 257: 118695.

49.

Lai

Ding

Yin

, et al. Bitterling colony optimization: a bio-inspired algorithm for global search. Cluster Comput 2025; 28: 42.

50.

Balaji

Gobalakrishnan

. Biomedical image-based keratoconus classification using convolutional transformers and grey goose optimization. Biomed Signal Process Control 2026; 111: 108357.

51.

Zhong

Cao

, et al. Symbiotic mechanism-based honey badger algorithm for continuous optimization. Cluster Comput 2025; 28: 133. doi:10.1007/s10586-024-04765-0