Abstract
Glaucoma is a condition of the eye that is caused by an increase in the eye’s intraocular pressure that, when it reaches its advanced stage, causes the patient to lose all of their vision. Thus, glaucoma screening-based treatment administered in a timely manner has the potential to prevent the patient from losing all of their vision. However, because glaucoma screening is a complicated process and there is a shortage of human resources, we frequently experience delays, which can lead to an increase in the proportion of people who have lost their eyesight worldwide. In order to overcome the limitations of current manual approaches, there is a critical need to create a reliable automated framework for early detection of Optic Disc (OD) and Optic Cup (OC) lesions. In addition, the classification process is made more difficult by the high degree of overlap between the lesion and eye colour. In this paper, we proposed an automatic detection of Glaucoma disease. In this proposed model is consisting of two major stages. First approach is segmentation and other method is classification. The initial phase uses a Stacked Attention based U-Net architecture to identify the optic disc in a retinal fundus image and then extract it. MobileNet-V2 is used for classification of and glaucoma and non-glaucoma images. Experiment results show that the proposed method outperforms other methods with an accuracy, sensitivity and specificity of 98.9%, 95.2% and 97.5% respectively.
Introduction
The loss of retinal ganglion cells characterises glaucoma, an optic neuropathy that can have devastating consequences if left untreated. Optic nerve head (ONH) anatomical changes, specifically lamina cribrosa sheet thinning and posterior bowing, are clinically observable manifestations of this condition. A clinical decision can only be made after carefully observing and evaluating several clinical features relevant to glaucomatous optic neuropathy. Current methods for diagnosing and monitoring glaucoma involve extensive testing and a large amount of data that can be difficult to interpret, such as a comprehensive eye exam. In addition, early glaucoma patients and healthy subjects share many ocular characteristics. As a result, efforts are being made to create supplementary methods, such as Artificial intelligence systems, that can help tell the difference between true pathology and normal variability, as well as true progression and inter-test variability.
There is currently no cure for glaucoma, but early diagnosis and treatment can help a great deal. It is crucial in the creation of automated methods for early detection of glaucoma [1]. The vitreous, macula, retina, and blood vessels of the optic nerve can all be evaluated in a retinal fundus image. Retinal images were captured with a fundus camera by ophthalmologists. In the case of eye diseases like glaucoma, the retinal image was used for diagnosis. Glaucoma is a leading cause of blindness. The optic nerve head is responsible for sending signals from the retina to the brain [2]. Images of the retina and fundus are shown in Fig. 1.

Retinal fundus image.
One of the most prominent structural image cues for glaucoma detection is the Cup-to-Disc Ratio (CDR) [3]. Most symptoms occur near the OD. Automated glaucoma detection methods that use segmented discs are sensitive to the accuracy of the segmentation, and even a slight error in outlining OD can have an impact on the diagnosis [4]. On the other hand, localization provides information regarding the precise location of OD within the overall image as well as the context of its surroundings. Automatic methods for the detection of glaucoma that are based on this method of region of interest extraction are more resistant to errors in localization.
The disease pattern in retinal fundus images is subtle and complicated, making automated classification difficult. Because of its readily apparent colour, shape, texture, etc. in natural scene images, segmentation is a relatively straightforward task. In contrast, the most telling signs of illness in medical images are often obscured and only visible to specialists with years of experience in the field. On the other hand, it has been demonstrated that Deep Learning is capable of learning a discriminative representation of data that is able to identify subtle differences [5]. This allows the representation to be both compact and useful. Computer-assisted diagnosis (CAD) tools like these can greatly improve the quality and efficiency of mass screening programmes. Possible benefits of these automated systems include reduced human error, timely service delivery in underserved areas, and the absence of bias and chronic stress in medical professionals.
A CNN automatically figures out how to classify things based on their many features, without having to separate the fuzzy optical cups. Most of these methods, however, don’t have enough training data, which leads to the overfitting problem. Transfer learning is used when there isn’t enough data. This work moves the task of classifying images of nature to finding glaucoma on images of the fundus. Compared to images of nature, the fundus images have a lot of areas that are the same and don’t give any useful information for detecting glaucoma. The redundancy regions could trick CNN into focusing on information that isn’t important. To overcome this problem, we proposed an automatic detection of Glaucoma disease. In this proposed objective is as follows: The first stage is based on Optic Disc and Optic Cup segmentation which is responsible for segmenting and extracting optic disc from a retinal fundus image using Sacked Attention based U-Net architecture. While the second stage uses MobileNetV2 to classify the extracted disc into healthy or glaucomatous. The proposed method is evaluated experimentally and compared to other existing methods.
The remainder of the paper is organized in the following manner: Section II discusses works that are related to this study. Section III goes into wider context about the proposed method. Section IV discusses the experimental design, evaluation measures, and results obtained from the experiment. Finally, a conclusion is provided.
Recently, image processing and machine learning-based computer-aided systems for intelligent glaucoma detection have gained popularity [6, 7]. For a given dataset of retinal images, supervised machine learning algorithms are used for classification of the normal image and glaucomatous image, while unsupervised machine learning algorithms are primarily used for segmenting the disc and cup in the improved retinal image. Researchers have proposed new retinal image glaucoma detection methods in recent years using multi-layer perceptron, Random Forest, and Radial Basis Function classifiers [8, 9]. Heuristics and deep learning are the two primary categories of diagnostic methods that can be used to detect glaucoma. The extraction of features is done through many of the heuristic glaucoma detection methods by using various image processing techniques. Screening techniques for glaucoma have been developed that involve measuring the thickness of the retinal nerve fibre layer (RNFL) [10]. In [11], an attempt was made to facilitate the diagnosis of glaucoma by extracting features of the texture as well as higher order spectral features. In [12], an attempt was made to diagnose glaucoma using energy features that were derived from wavelets. [11, 12] utilised both a support vector machine (SVM) and a naive Bayesian classifier in order to classify the human-created features. However, the aforementioned heuristic approaches only take into account a small subset of the features that can be found on fundus images, which results in a classification accuracy that is not particularly high.
Deep learning-based methods are another class of glaucoma detection techniques. In particular, work on glaucoma detection using deep learning based on automatic segmentation of the optic cup and disc has been reported in [13, 14]. However, without comprehensive education, their work focuses solely on the optic disc and how they may or may not be connected to glaucoma. In addition, [15] suggested a multi-stream CNN that merged the final segmentation result with the full optical images. When it comes to detecting glaucoma, a deep learning approach based on RNFL defect was proposed by [33]. However, [16] was the first to propose a comprehensive CNN approach to glaucoma detection. Chen’s work was built upon in [17], where an improved CNN structure was proposed for glaucoma classification using a combination of global and local features. Both [16, 17] performed a pre-processing step on the primary fundus images, in which they removed redundant regions in order to regularise the input images. However, previous works did not succeed in achieving high sensitivity and specificity because there was a lack of training data and the networks themselves were overly simplistic.
The optic cup in retinal fundus images was segmented using the U-Net segmentation method, which was developed by [18]. Better results in detecting glaucoma can be attained through segmentation of the optic cup and optic disc. In this case, the optic disc image served as the basis for the ROI, which was then cropped and segmented using the U-Net algorithm [18]. To better detect glaucoma proposed [17], that is attention-based convolutional neural network (AG-CNN) model, which they are tested on the large-scale attention-based glaucoma database (LAG). The accuracy and reliability of glaucoma identification may suffer if large amounts of redundancy are removed from fundus images. The AG-CNN model thought about this, and then made a call. This model is the result of merging subnets for attention prediction, localization of pathological regions, and classification. The model has a 96.2% sensitivity and a 0.983 AUC for identifying glaucoma. In a few instances, only a portion of the ROI was highlighted, making it impossible to pinpoint the precise location of the problem [19].
The convolutional neural network model for retinal image classification was proposed by [20]. Rather than using the images themselves as input, this system uses the features extracted from the images to feed a CNN model, which then classifies the images as normal or abnormal. In order to categorise retinal images practically verified [21] in hybrid graph convolutional network (HGCN). With this network, features from a convolutional neural network (CNN) are integrated into a design for graph learning based on modularity, yielding a graph convolutional network. A proposed ensemble learning based model for retinal image classification presented [22]. In this proposed deep learning-based model for retinal image classification using U-Net and Inception V3 network [23]. Transfer learning-based mode with saliency maps are proposed in [24]. A proposed glaucoma detection model using U-Net and Efficient net presented [25]. There’s no need to manually separate the fuzzy optical cups because a CNN can figure out how to classify things based on their many features. There are various methods of U-Net has been proposed for medical image segmentation namely U-Net [26], 3D U-Net [27], Attention U-Net [28], CE-Net [29], U-Net++ [30] and Trans U-Net [31–38]. However, most of these approaches suffer from insufficient training data, which in turn causes the overfitting issue [39–46]. When there isn’t enough data, we turn to transfer learning. This work shifts the focus from categorising nature images to detecting glaucoma in fundus images. The fundus images provide no useful information for detecting glaucoma because they have many similar areas compared to nature images. Fundus images’ black background and the eyeball’s periphery are two such examples. The redundancy regions may mislead CNN into prioritising irrelevant data. In order to resolve this issue, we proposed a stacked attention based U-Net segmentation and automatic Glaucoma disease classification using MobileNet-v2.
Proposed model
In medical imaging, target structures often vary in size, shape, and texture from one class to the next. The local receptive field of conventional CNNs used for segmentation yields local feature representations. In order to overcome this problem, we proposed an automatic detection of glaucoma. We proposed this model in two stages (Fig. 1). The first stage is based on Optic Disc Segmentation, which is responsible for locating and extracting the optic disc from an image of the retinal fundus by using a Sacked Attention-based U-Net architecture. This stage is based on the optic disc segmentation algorithm. The extracted disc is classified as healthy or glaucomatous in the second stage using MobileNetV2.
Stacked attention based U-Net
U-Net can be broken down into two sections: The first is the conventional neural network (CNN)-based contracting route presented in Fig. 2(a b). In Fig. 2(c) p and s represents strides and padding respectively, this structure shows that proposed U-Net model internal process. Each shrinking path block is made up of two 3 x 3 convolutions, an activation unit using the ReLU function, and a max-pooling layer. There are multiple occurrences of this pattern. The second section of U-Net, known as the expansive path, is where the algorithm’s novelty lies. In this section, each stage uses a 2 x 2 up-convolution to upsample the feature map. After that, the up-sampled feature map gets cropped, and then it gets joined with the feature map from the layer that corresponds to it in the reducing path. Next, we employ ReLU activation followed by two consecutive 3x3 convolutions. Finally, the feature map is convolutionally reduced to produce the segmented image using only one channel. Pixel features near the image’s borders provide the least amount of context and must be removed before the image can be used.

Proposed Model.

Internal Architecture.

Internal process present U-Net model.
The energy function for the network is given by:
And b
k
denotes the activation function in channel k. The expression for the loss function can be written as follows:
Where y i denotes the model output and the corresponding true label represented as x i .
Unless the relevant context is encoded, there is a chance that features representing adjacent pixels with the same label will differ from one another in their local feature representations. The result could be inconsistency within the class, which would have a negative impact on recognition accuracy. In order to solve this problem, we investigate the cognitive processes that lay the groundwork for associating different features with one another. In order to incorporate features learned at varying levels of complexity, a stack of spatial and channel self-attention modules was developed.
In this context, features that are manifested across multiple scales are referred to as P
n
, where n stands for the level in the architecture. Because features are available at varying resolutions for each level n. This results in expanded feature maps
Let input feature map to the attention module is denoted by P ∈ R
cxlxb
., where c, landb. represents channel, width and height respectively. Input feature map is passed through the convolutional block resulting in a map
In the next step, the input is fed into the other layer resulting in
Channel attention modules have three branches like spatial attention modules.
The transposed version of the input feature is multiplied and the final channel attention map is generated as given in below equation,
In this stacked attention model, we propose incorporating sequential refinement modules that progressively improve attentive features. The theory is that by making small adjustments one after another, noise can be suppressed while the significance of various local regions is enhanced. In the first step of the process, which is the generation of self-attention features, F is used by the spatial and channel attention modules. In addition, we incorporate an encoder-decoder network that takes the input features P and outputs a representation in the latent space. To incorporate the class data into the next set of stacked attention modules. This can be done by making sure that the representation of encoder-decoders is as close as possible, which can be formulated as follows:
Where
Loss for reconstructed feature map is given by,
Transfer learning uses a model to learn a new task. This method uses a pre-trained model on a large dataset to complete tasks on other datasets. Transfer learning is popular because it accurately classifies small datasets. A deep learning model trained from scratch on a small dataset lacks data variation information, making it difficult to achieve high accuracy. The MobileNetV2 network is the base of this methodology, having been pre-trained on the ImageNet dataset. On top of the MobileNetV2’s convolutional layers, we implement a new set of computations known as the head model. The output of the base model is fed into the first layer of the head model, a global pooling layer, which produces a feature map with 7x7x1280 pixels. In order to significantly reduce the dimensionality of the data, the global pooling layer performs a pooling operation to produce a one-dimensional feature vector. The proposed method generates a 1x1x1280 pixel output feature map using an average pooling operation with a 7x7 pixel kernel size in the global pooling layer [47–53]. Two additional layers, both of which are fully connected to one another, follow the global pooling layer. ReLU activation function activated 128 and 64 nodes in the fully connected layers. Because we use one-hot encoding and the dataset has two classes-glaucoma and non-glaucoma output layer of the two nodes. Softmax activates the output layer. The 1x1 convolutional layer is the only one without a batch normalisation layer and an activation function (ReLU6). As the output of the convolutional layer is low dimensional, it only has batch normalisation. Figure 3 displayed that sample image datasets described in this work.

Sample dataset images [24].
Several experiments were performed to calculate the glaucoma identification and the results of these experiments are discussed in detail here. We have used a publicly available database (ORIGA) to test the dimensional stability of proposed method for detecting and classifying glaucoma cases and implemented in Matlab. The ORIGA database contains 650 samples, 168 of which show the glaucoma-affected regions of human eyes, and the remaining 650 show healthy human eyes. The ORIGA dataset is notoriously difficult to use for glaucoma classification due to the prevalence of artefacts within its samples. These include, but are not limited to, large variations in the size, colour, position, and texture of OD and OC. In addition, there are numerous types of distortion in images, including noise, blurring, colour and intensity variations. Fig. 3 displays some examples from the used data set. Training and testing are in the ratio of 80 % and 20% respectively.
Multiple evaluation metrics, including Intersection over Union (IoU), accuracy, precision, recall, and mean average precision (mAP), are used to evaluate in this method’s localization and categorization results in this work. Equation is used to determine accuracy is given by Equation (13),
The formula for determining the mAP score is given in equation (14) as follows: where AP represents the mean precision across all classes and q stands for the sample size. The letter Q also stands for the entire number of subjects in a test.
Other metrics are gives as follows,
For an efficient computer-aided approach to identifying and classifying glaucoma affected regions, timely and accurate identification of the optical disk head lesions is essential. Because of this, we decided to design an experiment to evaluate the localization capability of stacked attention-based U-Net by testing against the ORIGA database. The results of this experiment are depicted in Fig. 4, and they were obtained from the experiment. Experimental results state clearly that the proposed solution, which is known as stacked attention-based U-Net, is able to correctly diagnose OD and OC lesions of a wide range of sizes and positions. In addition to that, present work is able to deal with a wide variety of sample distortions, such as blurring, colour variations, and variations in brightness.

Segmented image samples.
The proposed method is able to precisely identify lesions with fewer signs because of its localization capabilities. Since mAP and IoU are the most widely used evaluation measures among researchers, we rely on them to quantitatively assess the efficacy of our approach in localization and segmentation. We find that proposed method yields mean IoU values of 0.979 and mAP values of 0.98 on average. The visual and numerical results demonstrate the effectiveness of our framework in identifying and classifying glaucoma hotspots. Figures 5 6 depicted as accuracy and loss function of proposed work.

Accuracy.

Loss.
We have conducted experiments to compare proposed framework’s glaucoma recognition results to those obtained by using alternative object detection methods, such as U-Net [26], 3D U-Net [27], Attention U-Net [28], CE-Net [29] U-Net++ [30], and Trans U-Net [31–36]. The results of this analysis is shown in Table 1. We have used the mean absolute percentage (mAP) and IoU evaluation metrics to make comparisons with other segmentation methods. It is clear from Table 1 that proposed stacked attention-based U-Net achieves the highest mAP value of 0.981 and the highest IoU of 0.979. The conventional U-Net method also achieves the lowest mAP value (0.923) and the lowest IoU (0.929). In addition, the TransNet yields findings that are consistent described. The computational advantage provided by proposed stacked attention mechanism makes presemt work clearly superior to the alternatives in terms of detecting and classifying glaucoma lesions. Further, the framework is able to precisely localise the region of interest and achieve the highest mAP value relative to its competitors because of the reliable feature detection provided by the proposed method.
Comparison with other segmentation methods
After segmentation of optical disc, pre-trained MobileNetV2 is used for classification. Accuracy and loss model is given in Figs. 5 6 respectively. In addition, as shown in Table 2, we compared with proposed segmentation to various traditional methods. A stacked attention-based U-Net segmentation method produces superior results with a pre-trained model, as shown by the analysis.
Proposed segmentation with various methods
Further, we have conducted a comparative analysis, selecting the most up-to-date methods using the same dataset, to verify this approach glaucoma identification and classification performance. In order to be objective, we have compared the outcomes of present method with those of the techniques detailed in [20–25]. Table 3 displays the quantitative comparisons made by using evaluation metrics with different methods.
Comparison with the State-of-art
Figures 7, 10 and 11 shows that proposed glaucoma detection model outperforms state-of-the-art methods. We also compared with and without attention mechanism presented in Table 4 and Fig. 12. An opposed to proposed method with others to use extremely complex and deep networks towards compute features, which leads to the model over-fitting problem and the loss of spatial features. Whereas, in comparison, proposed approach employs stacked attention-based U-Net as a base segmentation and localization network, which is capable of localize optical disc head and optical cup. It follows that proposed architecture offers a practical and efficient method for optical disk and optical cup recognition, which can aid physicians in making a prompt diagnosis of glaucoma-affected areas.

Accuracy.

Precision.

Recall.

Sensitivity.

Specificity.
Stacked attention model analysis

Comparison with attention mechanism.
A Deep CNN is proposed as a revolutionary strategy for glaucoma diagnosis and prediction in this research work. The glaucoma dataset was used to train the U-Net and DCNN model for glaucoma image analysis. Less than a quarter of the data was used for this research, with the rest of U-Net classification being used in this proposed segmentation.
To segregate the data was used to extract the features. Then, it will pre-trained U-Net transfer learning model linked with DCNN. A deep convolutional neural network was used to classify the images in order to diagnose glaucoma. These retinal fundus photographs were used to determine whether or not the patient had glaucoma. Identifying the areas of the fundus sample that have been affected by glaucoma requires trained human experts who can recognise subtle visual differences and classify the images into the appropriate categories. However, due to the complexity of glaucomatous regions and the difficulty of gaining access to domain experts, a fully automated system is required. In this paper, we proposed an automatic detection of glaucoma using stacked attention-based U-Net and pre-trained MobileNet2. The stacked attention-based U-Net is used for optic cup and optic head segmentation, and the segmented image is then fed to the pre-trained MobileNet2 model for classification. With an accuracy of 98.9%, the proposed approach outperforms other methods. In the future, Generative adversarial networks may be utilised to overcome the limited datasets that are currently available, and proposed method may also be applied to other publicly accessible datasets.
