Abstract
The identification of landmark features such as optic disc is of high prognostic significance in diagnosing various ophthalmic diseases. A retinal fundus photograph provides a non-invasive observation of the optic disc. The wide variability present in fundus images poses difficulties in its detection and further analysis. The reported work is a part of the fundus image screening for the diagnosis of Retinopathy of Prematurity (ROP), a sight threatening disorder seen in preterm infants. The diagnostic procedure for this disease estimates blood vessel tortuosity in a pre-defined area around the optic disc. Hence accurate optic disc localization is very important for the disease diagnosis. In this paper, we present an optic disc localization technique using a deep neural network based framework. The proposed system relies on the underlying architecture of YOLOv3, a fully convolutional neural network pipeline for object detection and localization. The new approach is tested in 10 different data sets and has achieved an overall accuracy of 99.25%, outperforming other deep learning-based OD detection methods. The test results guarantees the robustness of the proposed technique, and hence may be deployed to assist medical experts for disease diagnosis.
Introduction
Optic disc (OD) detection and its localization are important steps in the initial screening of many eye diseases namely, diabetic retinopathy, glaucoma, and retinopathy of prematurity [2, 38]. The OD is often visualized as a bright reddish circular or elliptic area in the retinal images. Irregular OD shape, diffused boundaries of OD regions and inconsistent imaging conditions make OD detection very challenging [40]. Hence a skilled retinal expert supervision is needed for OD segmentation. Solutions using computer aided automatic detection are extremely valuable in mass ophthalmic-screening and medical care, especially in developing countries with scarcity of qualified experts.
This motivates us to propose an automatic OD localization system with the help of a deep neural network architecture. The proposed approach substitutes the conventional image processing approaches with a data-driven convolutional neural network (CNN) OD detection technique.
Related works
Existing methods utilize either the intensity of OD region or the point of origination of major blood vessels for OD localization. The former method is based on the assumption that the pixels in OD region have higher intensity when compared to the remaining portions of the retina [12, 43]. The main drawback of this approach is that due to pathology or uneven illumination OD may not get properly detected in some images. The latter approach is based on an assumption that the OD region is the originating point of major blood vessels in the eye [15, 45]. However, when the blood vessels get occluded due to lesions, the method becomes less efficient in exploiting its advantage of originating position. The inabilities of the existing rule based technique demand the need of a system using conceptual and in-depth understanding of OD features. In the reported work, we put forward a deep learning (DL) based OD localization system without considering any prior knowledge.
In recent years, the high computing power provided by GPUs and the availability of enormous data sets propelled the research on DL algorithm based systems. Moreover, these systems have gained outstanding performance in various machine learning tasks including medical field. The literature contain a few amount of works that use DL approach for the OD localization process [10, 11]. Maninis et al. [28] employed a CNN system based on VGG-16 net along with transfer learning technique for OD detection. In [37], a glaucoma diagnostic system is developed in which OD localization is made possible using multiple pre-trained deep networks. Guo et al. [18] proposed a CNN based network to obtain pixel patch classification for OD segmentation. In [6], a fully-convolutional DenseNet with a symmmetric U-shaped framework predicts the boundary of OD with good accuracy.
In this paper we utilize the architecture of YOLOv3 (You Only Look Once) [36] network, that employ features learned using a fully convolutional neural (FCN) network, to detect OD in a fundus image. In [34], YOLOv3 network is used in CT images to identify cholelithiasis and classify gallstones. Unver et al. [42] proposed an effective framework for skin lesion segmentation in dermoscopic images by combining YOLO architecture and GrabCut algorithm. The deep learning based computer aided system proposed in [7] used YOLO network for simultaneous detection and classification of breast masses in mammograms.
In the reported work a robust computer aided automatic detection system is proposed for a specific medical application in the preterm infant fundus images. The proposed pipeline employs transfer learning technique in the complete detection process with the limited data set. We present the results of experimentation using 10 different data sets of retinal images using a complete end-to-end CNN architecture. The high detection accuracy obtained in the different image data sets show the adaptability of the network for various image parameters namely, resolution, illumination conditions, and other noises present in the images. The results show the high potential of this technique in practical clinical applications.
Materials and methods
Data set
In Table 1 we show the details of the various data sets used in our study. All details including resolution, the number of images used, and the nature of data set are summarized in the table. Apart from these publicly available data sets, the performance of the proposed technique is assessed on a new data set of 125 retinal fundus images that we obtained from Karnataka Internet Assisted Diagnosis of Retinopathy of Prematurity (KIDROP) [1], the largest organizational tele-medicine network across the globe which is aimed to eradicate ROP-based infant blindness. The images are captured by a RetCam3 TM camera of resolution 1600 × 1200 pixels. The data set contain 30 diseased images and the remaining are marked as healthy. Our research work is concentrated mainly on KIDROP data sets to diagnose ROP in the initial stage of the disease progression.
Data sets used in the proposed work
Data sets used in the proposed work
In this work we make use of YOLOv3 architecture for the detection of OD in fundus images. To the extend of our knowledge this is the first work which utilizes YOLOv3 architecture for OD detection. Figure 1 describes the schematic view of this pipeline for the detection of OD in a given retinal image.

The overview of the proposed approach. (a) Input fundus image, (b) YOLOv3 network block, (c) output tensor, (d) tensor structure, (e) output image with the bounding box shown in the blue box. In (d) only one bounding box (actual implementation holds three such boxes shown as dots) is shown along with its attributes for the ease of representation. The blue box shown in (a) is assumed to be the grid responsible for output prediction as it contains the OD center.
The YOLOv3 network architecture view the object detection task as a regression problem where an object region in an image is directly converted to bounding box dimensions and class probabilities. The network also output a confidence score which indicates the objectness in the predicted bounding box.
The architecture uses a fully convolutional network (FCN) as shown in Fig. 2, without using any form of max pooling layers. The architecture has 75 convolutional layers and boast upsampling layers with skip connections. The convolutional layers extract features from the entire fundus image using variable kernel size filters.The network employs Leaky ReLU (Rectfied Linear Unit) activation function, which prevents saturation for both positive as well as negative inputs, in all layers.

YOLOv3 architecture used in the proposed study.
The feature maps are down-sampled by changing the stride of the network layers. The network employs a learning framework based on feature pyramid network (FPN) architecture [26] to capture low- and high-level information present in the target fundus image. The network also include residual blocks, a ResNet-alike structure [19], to generate better features by adding old features using short cuts to a learned residue. The network employs a pre-processing step namely, batch normalization which reduces the variance between layers and helps in rapid training of the model.
The network splits an input fundus image into S × S grid cells with no overlap as shown in Fig. 1a. The grid cell which contain the center of the OD is responsible for the detection of OD in the image. Each grid cell output three bounding boxes each with a confidence score which gives the certainty of OD present in that cell. The final output is predicted based on which bounding box has the maximum score.
The network output is a tensor (see Fig. 1c) with dimension S × S × (B × (5 + C)) where S × S, B and C represents the grid configuration, number of bounding boxes predicted per cell in the grid and number of classes under consideration for a given problem, respectively. Each box is described by five attributes namely, t0, x, y, w and h as shown in Fig. 1d. Here t0 is a confidence measure which indicates the presence of OD in the predicted box and is obtained using logistic regression. The confidence metric is zero if there is no OD present in a particular grid. Bounding boxes with low OD probability are removed and those having high Intersection of Union (IoU) with the ground truth is retained. The excess boxes present in the image are removed using non-maxima suppression technique leaving behind the one which is having the maximum overlap with the ground truth. The co-ordinates (x, y) represent the center of the bounding box whose width and height are specified as w and h, respectively. Here C represents the object class which is the OD present in the fundus image. The tensor parameters are thus set as: S = 13, B = 3 and C = 1, accordingly the output tensor dimension is 13 × 13 × 18. In general, the detection kernel has a dimension of 1 × 1 × (N × (5 + C)) yielding a kernel size of 1 × 1 ×18.
Detection is performed by drawing bounding box over the OD region in the retinal fundus image. Input images are resized to a resolution of 416 × 416 pixels before it is applied to the network. The image is then down-sampled till the 81 st layer to obtain a resultant feature map is of size 13 × 13. The network detects optic disc in three different scales. The detection is made using a 1 × 1 detection kernel, giving us a feature map of 13 × 13 × 18 after 82 nd layer, which is considered as the output layer of first scale detection. The feature image is then convolved with variable size filters and is then up-sampled to obtain a feature map is of size 26 × 26. The second scale detection is obtained from 94 th layer with a dimension 26 × 26 × 18. Similarly, the third scale detection is derived after up-sampling from the 106 th layer with a feature map size of 52 × 52 × 18.
The bounding box prediction in detection tasks can be accomplished using different ways. One approach is the direct prediction of box parameters. However, this procedure may lead to instability in gradients while training the network. Alternatively, the second approach is to use bounding box priors called anchor box and predict offset to these predefined boxes. In YOLOv3 architecture, each cell in the grid has three anchors which result in the prediction of three bounding boxes. Since we use a three scale detection a total of nine anchor boxes are required for the entire network. The bounding box responsible for detecting the OD will have the highest IoU between its anchor and the ground truth box.
Good candidate anchor boxes are computed from the training images using K-means clustering and corrections are made to these prior boxes to match with the dimensions of the ground truth bounding box. The distance measure between clusters are evaluated with IoU metric using width and height of the bounding boxes as features. Since the network predicts output in three scales with three bounding boxes per grid cell, a total number of nine anchors are needed for prediction. Hence we set nine clusters to obtain the required number of anchor boxes.
While training the network we use stochastic gradient descent (SGD) optimization method with Adam optimizer [24] to minimize the cost function for the updation of weights during back-propagation. The learning rate α of the optimizer is fixed after experimenting using different values and the set value is a compromise between very fast and too slow learning rates. The class prediction is done with binary cross-entropy (BCE) loss using logistic activation (sigmoid). The loss function for the model is the sum of regression loss for the bounding box and cross-entropy loss for the classification.
In this work, we have trained and validated our proposed approach in 10 different data sets. It is indeed an established fact that the transfer learning is very effective in training a deep neural network [44]. We trained our system with the convolutional weights pre-trained on COCO data set [27]. Subsequently, we re-trained the model using the training images from our data set. The MESSIDOR data set, which contain 1200 retinal fundus images is used for training in this study. From the whole data set we have randomly selected 840 (70% of total images in the data set) images for training the model. The performance evaluation is done on the remaining 360 (30% of total images) images of the data set. Under strict medical expert supervision we manually annotated the optic disc by placing bounding boxes on the OD area.
The training process is done in two stages. In the first stage training is performed using the pre-trained weights of COCO data set. The first stage training is done by freezing all the layers except the last three layers of the network. The learning rate is set as 0.001 and training is continued till the loss reaches a low value. In the second stage the modified weights obtained from the first stage are further used to train the entire network. The whole model is trained with a learning rate of 0.0001, which is less than that used in the first stage.
The proposed approach employs Keras deep learning library with Google’s Tensorflow backend for its implementation. This work was carried out on a IntelTM XeonTM Processor E5-1620 with 32 GB RAM, clock speed of CPU @ 3.60 GHz, and GPU of NVIDIA Quadro M5000. In addition, we used Python 3.6.1 as the programming language on Ubuntu 16.04 operating system.
Evaluation
The performance of the network is evaluated based on logic as narrated in Fig. 3. Based on this logic, if the probability threshold of confidence of the detected bounding boxes are less than a set threshold, the corresponding ODs predicted are viewed as undetectable. The predicted ODs with confidence score greater than the set threshold are only passed to the second level of detection. The second level OD detection checks the IoU measure by comparing the predicted value with that of the ground truth. To compare the performance of our system with similar studies, we compute the Sensitivity (Se) or recall and accuracy (Acc) in terms of True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) as:

Evaluation approach adopted for the proposed system.
During back propagation while training the network, weights are updated using Adam optimization technique with the parameters α, β1, β2, and ε set as 0.0001, 0.9, 0.999, and 10-8, respectively. We used transfer learning to accelerate the training of the network and hence provide faster convergence. The weights obtained using COCO data set on YOLOv3 architecture are transferred and used as initial weights of the network. The weights are modified by training the network with the images from our data set as the retinal fundus images are entirely different from those images in the COCO data set. We performed early stopping, a regularization technique used to avoid over-fitting while training the network. Validation loss is monitored regularly and training is stopped if there is no improvement for a consecutive period of 10 epochs. The methodology also employs batch normalization in almost every layer, without any biases and used the Leaky ReLU activation. Batch normalization stabilizes the learning process and reduces the count of training epochs needed to train the network. The residual skip connections present in the network rectify the vanishing gradient problems that may happen while training the network. The use of up-sampling and concatenation preserves the fine-grained features of objects present in the image and help in its detection.
In Fig. 4 we show the predicted bounding box and OD detected in images obtained from various data set used in our study. In Table 2 we summarize the test results. To show the effectiveness of our pipeline, in Table 3 we include the results reported in existing works, which were obtained using different public data sets. Only the results reported in [8] and [5] can be directly compared. The overall accuracy of our system is evaluated as 99.17% for the nine publicly available data sets and 100% for the tested private data set.

Bounding box detected in representative images taken from various data sets. First row images labelled (a)-(d) are input images. The corresponding output images are labelled as (e)-(h). The last row images labelled (i)-(l) are the sample cases where no bounding box is predicted in the images.
Results obtained with various data sets used in our study
Performance comparison with existing approaches
The detection of OD plays a vital role in the diagnosis of many eye diseases, especially in Glaucoma prediction. In the diagnosis of Glaucoma, apart from OD detection optic cup is also analysed for disease prediction. In the presented work OD detection is used as a pre-diagnostic step for ROP prediction in new born babies. Accurate OD detection and its removal is very essential for this disease diagnosis. In infant retinal images OD detection is very challenging due to wide variability present in the images and also retina will be in the premature stage of development. Hence most of the researches use manual intervention for this procedure [31, 39]. In the proposed work, the infant images obtained from KIDROP are tested and an accuracy of 100% is achieved. We also evaluated the performance on nine more publicly available data sets and the performance obtained are found outstanding. In our knowledge this work analysed the maximum number of data sets for OD detection. The comparison with the state-of-art methods show that OD detection using YOLOv3 architecture attains the maximum performance. We never performed any form of data augmentation to increase the input training data size and also no form of image enhancement is used to emphasise OD in the image.
In this work, we have developed a fully CNN YOLOv3-based system which detects the location of OD in fundus images. The high generalization of the network accounts for the fact that detection is made in three different scales. This multi-scale approach will ensure the detection of variable size OD in the retinal images. For a given input image, the network predicts 10647 bounding boxes for three scales. For each scale, every cell in the grid predict three bounding boxes using three anchors. The dimension (width, height), of nine anchors required for the three scale prediction, obtained using k-means clustering are (121,134), (135,141), (140,152), (155,164), (193,207), (200,230), (216,221), (229,238) and (243,260).
Eye morphology studies [21] show that for human eye, the mean diameter horizontally is 1.76 ± 0.31mm (0.91-2.61 mm), and vertically 1.92 ± 0.29mm (0.96-2.91mm). Since inter-human variation in OD dimension is not that much large, we expect the proposed system using YOLOv3 architecture with multi-scale detection can predict OD location in any data set with high accuracy. Our system outperforms all existing state-of art methods in terms of accuracy in detecting the OD.
Clinical capabilities and applicability of this technique are under trial on a large image data obtained from KIDROP. The detection of OD in the infant retinal images often need a skilled expert for identification and hence time consuming. Moreover, the images are obtained directly from the infant eye screening programs and hence effect of noise is very high in images, as conventional image processing techniques will not work satisfactorily [31, 39]. In this work we didn’t perform any form of image enhancement to reduce the effect of noise, which emphasize the effectiveness of this framework in OD detection. Future development of the present work could include development of an user friendly graphical user interface system (GUI), which would then allow clinicians a better possibility to diagnose ROP and also reduce their manual labour.
We propose a highly effective computer aided OD detection system in retinal fundus images, which achieves very high accuracy results (99.25% for the 10 data sets). The main contributions of this work are three-fold. Firstly, a novel pipeline for OD detection, which is based on existing YOLOv3 architecture. Secondly, implemented transfer learning to initialize the model using weight parameters learned on a large-scale data set, and thirdly, a new data set for studies concentrated on ROP diagnosis in preterm retinal fundus images.
Footnotes
We will make it publicly available very soon.
