Abstract
Purpose:
Diabetic retinopathy (DR) is a microvascular complication of diabetes mellitus (DM). Standard of care for patients with DM is an annual eye examination or retinal imaging to assess for DR, the latter of which may be completed through telemedicine approaches. One significant issue is poor-quality images that prevent adequate screening and are thus ungradable. We used artificial intelligence to enable point-of-care (at time of imaging) identification of ungradable images in a DR screening program.
Methods:
Nonmydriatic retinal images were gathered from patients with DM imaged during a primary care or endocrinology visit from September 1, 2017, to June 1, 2021. The Topcon TRC-NW400 retinal camera (Topcon Corp., Tokyo, Japan) was used. Images were interpreted by 5 ophthalmologists for gradeability, presence and stage of DR, and presence of non-DR pathologies. A convolutional neural network with Inception V3 network architecture was trained to assess image gradeability. Images were divided into training and test sets, and 10-fold cross-validation was performed.
Results:
A total of 1,377 images from 537 patients (56.1% female, median age 58) were analyzed. Ophthalmologists classified 25.9% of images as ungradable. Of gradable images, 18.6% had DR of varying degrees and 26.5% had non-DR pathology. 10 fold cross-validation produced an average area under receiver operating characteristic curve (AUC) of 0.922 (standard deviation: 0.027, range: 0.882 to 0.961). The final model exhibited similar test set performance with an AUC of 0.924.
Conclusions:
This model accurately assesses gradeability of nonmydriatic retinal images. It could be used for increasing the efficiency of DR screening programs by enabling point-of-care identification of poor-quality images.
Introduction
Diabetic retinopathy (DR) is a microvascular complication of diabetes mellitus (DM). DR develops over time, progressing from early stages of nonproliferative DR to more advanced, vision-threatening stages that include proliferative DR (PDR) and diabetic macular edema (DME). 1 Epidemiologic studies have observed that ∼1 in 3 people with DM have DR, with 6.96% and 6.81% having progressed to the point of having PDR or DME, respectively. 2 Despite the widespread nature of DR, many patients with DM are unaware of their risk of DR. 3 Blindness from DR is highly preventable, with studies showing that up to 98% of DR-related blindness is preventable through a combination of glycemic control, photocoagulation therapy, vitrectomy surgery, and antivascular endothelial growth factor intraocular injections. 4 –6
DR screening guidelines by the International Council of Ophthalmology and American Diabetes Association were last released in 2018, which state that screening should include a visual acuity examination and retinal examination. 1 Retinal examination should be completed through one of (a) direct or indirect ophthalmoscopy or slit lamp biomicroscopic examination, or (b) retinal fundus photography (including any of the following: wide-field to 30°, mono or stereo, dilated or undilated). 1 The guidelines also specify that retinal photography may be completed through telemedicine approaches. The characteristic lesions of DR that may be detected include microaneurysms, intraretinal hemorrhages, venous beading, intraretinal microvascular abnormalities, hard exudates, and retinal neovascularization. 1 DM patients should undergo repeat screening every 1–2 years if no DR or DME is observed, and they should undergo more frequent screening if signs of DR or DME are detected. 1
Only 64.8% of DM patients in the United States receive annual DR screening. 7 Disparities in screening rates have also been observed in relation to educational status, income, race, immigration status, health insurance status, and rural community residence. 8 –14
One significant barrier to consistent DR screening is access to eye care professionals, particularly in resource-limited settings. To address this issue, many DR screening programs have transitioned to telehealth approaches using photography-based screening, which is a cost-effective alternative to in-person examination by an ophthalmologist. 15,16 In particular, nonmydriatic (undilated) retinal imaging has been observed to have relatively similar sensitivity (78–98%) and specificity (86–90%) for detecting DR compared with the sensitivity (84–92%) and specificity (92–98%) for a dilated retinal examination by a trained ophthalmologist. 17 However, one major pitfall of telemedicine-based DR screening is poor-quality images, which can cause multiple issues. Poor-quality images may lead to lower sensitivity and specificity for detecting DR than has been observed in studies, ultimately risking under- or overdetection of DR. In addition, poor-quality images lead to wasted time and resources due to the need for repeat office visits and additional imaging to satisfy screening guidelines.
Furthermore, this may lead to delay in care in a typical telehealth-based screening workflow. A poor-quality image will likely not be identified (particularly if the image quality is of borderline quality) until it is ultimately evaluated by an ophthalmologist, long after the patient has left the primary care imaging appointment. Reliable, automated identification of poor-quality images at the time of imaging would eliminate this risk for inaccurate diagnosis and inefficient patient screening. In addition, it may be a method that may improve the skill of the photographer due to real-time feedback.
In recent years, there have also been numerous efforts to use artificial intelligence (AI) to aid in a variety of imaging-oriented tasks within clinical ophthalmology. 18 –28 Many of the most successful approaches have utilized deep learning. 29 Deep learning uses representation-learning methods with multiple layers of abstraction to ultimately learn highly complex detection or classification tasks. 30 The first layer of a deep learning method may learn simple tasks such as detecting the presence or absence of edges, while subsequent layers built upon increasing levels of abstraction will learn higher level tasks such as recognition of specific objects or shapes. Deep learning is notable in that these layers are learned through a learning procedure, rather than through direct human intervention. 30 Deep learning is particularly well-suited to learning highly complex structures in high-dimensional data, such as those required for image classification tasks. 30
In this study, we used deep learning to create a method that could enable point-of-care identification of poor-quality images in a telemedicine DR screening program.
Methods
The study was approved by the Institutional Review Board of Northwestern University. The IRB exempted the study from written consent due to its observational and retrospective nature. In this IRB-approved study, nonmydriatic retinal images were gathered from DM patients who underwent imaging during a primary care or endocrinology visit, sent to Northwestern Medicine for analysis between September 1, 2017, and June 1, 2021. The Topcon TRC-NW400 retinal camera (Topcon Corp., Tokyo, Japan) was used. Images were interpreted by five Northwestern ophthalmologists for gradeability (defined as whether the image was of sufficient quality to allow for reliable DR screening), presence and stage of DR, and the presence of non-DR pathologies. Graders could further specify why an image was ungradable, such as media opacity or an insufficient view of the macula. Each image was interpreted by a single ophthalmologist based on the expert opinion of each of these assessments.
Following collection of the interpreted images, 20% (275 images) of the full set of images were randomly chosen to be set aside as an independent test set for assessing the performance of the final model. The remaining 80% (1,102) of images were randomly sampled into 10 separate sets/folds to be used for 10-fold cross-validation. Each stage of sampling was done in a stratified manner to ensure the test set and each fold had a similar distribution of gradable and ungradable images.
For inputting the images into the model, images were resized to 300 × 300 pixels, and all pixel values were rescaled to the [0, 1] range. Keras, a widely used deep learning library, was utilized with the TensorFlow backend for training and testing the model. The convolutional layers of the network were built using the Inception V3 network architecture. Network weights were preinitialized using a network trained on ImageNet, a set of 14 million images used for computer vision research. This preinitialization decreases training time due to the network already being trained to identify low-level features of common objects found in a variety of images, such as various shapes or textures.
We built additional layers on top of the convolutional layers of the network. The first layer was a global average pooling layer. Pooling layers serve to represent the features of a convolutional layer but with lower dimensionality, thus decreasing computational demands. Two fully connected layers were then added. The first consisted of 1,024 nodes and used the rectified linear unit activation function. The second consisted of two nodes representing the probability of an image belonging to each class (ungradable or gradable). This layer used the softmax activation function. A dropout layer with a probability of 0.5 was added between the two fully connected layers to prevent network overfitting. In training the model, the convolutional layers were frozen while all other layers were adjustable.
The 10 folds (subsets of the training set) were utilized for 10-fold cross-validation. This procedure consists of 10 iterations of training the model and assessing model performance on a validation set. At each iteration, one of the folds is held out as the validation set and the remaining 9 folds are used to train the model. Then model performance is assessed on the held-out fold. This method ultimately allows for improved estimation of model performance, due to obtaining 10 estimates rather than only 1 estimate such as would be obtained in a single training and validation assessment. Each component of model training occurred for 50 epochs. Model training was performed using the following hyperparameters: optimizer: minibatch adaptive moment estimation (ADAM), batch size: 32, learning rate: 0.001, ß1: 0.9, ß2: 0.999, loss function: binary cross-entropy, and validation metric: accuracy.
Following 10-fold cross-validation to obtain a more accurate estimate of model performance, the final model was produced by using the whole training set. This model was then evaluated on the previously unseen test set.
Model performance was assessed by calculating the area under the receiver operating characteristic (ROC) curve (AUC). ROC curves represent classification performance at all classification probability thresholds, and a higher area under the ROC curve signifies superior classification performance. For predicting whether an image was gradable versus ungradable, each convolution neural network (CNN) calculated a probability between 0 and 1 of an image being ungradable, and the image was classified as ungradable if this probability was greater than a chosen threshold. Model performance using a probability threshold of 0.5 was evaluated. The Youden Index (sensitivity + specificity – 1) was also assessed for determining a probability threshold that could optimize the trade-off between sensitivity and specificity for the model, and model performance was evaluated using a probability threshold (0.775) that maximized the Youden Index.
Results
In this study, 1,377 images from 537 patients were analyzed. The median age of these patients was 58 (range 25 to 101), and 56.1% of patients were female. 25.9% (357/1,377) of the images were deemed ungradable (Table 1; Fig. 1). The proportion of images deemed ungradable varied among the ophthalmologists (p < 0.0001), ranging from 13.3% to 31.8%. Of gradable images, 18.6% (190/1,020) of images were found to have signs of DR, with the images being suggestive of variable stages of DR (81.1% mild, 11.6% moderate, 3.68% severe, and 3.68% proliferative). Of gradable images, 26.5% (270/1,020) had signs suggestive of non-DR pathology.

Sample nonmydriatic fundus images that were deemed ungradable by ophthalmologist graders. Each of these images was correctly identified by the final model as being ungradable.
Image Assessment by Ophthalmologist Graders
AUC, area under receiver operating characteristic curve.
The average AUC from 10-fold cross-validation was 0.922 (standard deviation [SD]: 0.027; range 0.882 to 0.961) (Fig. 2A). After training the final model on the full training set, the final model exhibited an AUC of 0.924 on the test set (Fig. 2B). The performance of the model was fairly consistent (AUC range 0.904 to 0.955) across the test images that had been graded by the different ophthalmologists (Table 1). Among gradable images, test set accuracy was consistent (range: 0.93 to 1.00) across images regardless of whether or not DR or non-DR pathology was present. The probability threshold for classifying an image as ungradable could be altered to fit a specific purpose or application (e.g., prioritizing sensitivity at the cost of some specificity to avoid false negatives that would result in patients having to schedule a return visit to be reimaged).

Model performance was evaluated with two probability thresholds, 0.5 and a threshold that maximized the Youden Index. At a probability threshold of 0.5, the final model exhibited a sensitivity of 0.951 and a specificity of 0.549 on the test set. The Youden Index represents a composite metric for the trade-off between sensitivity and specificity, and for the final model, the Youden Index was maximized as 0.727 with a probability (of being ungradable) threshold of 0.775. When using this cutoff, the model had a sensitivity of 0.859 and a specificity of 0.867 on the test set.
Discussion
We trained a CNN to accurately and rapidly assess nonmydriatic retinal fundus images from a DR telemedicine screening program for image gradeability. We used 10-fold cross-validation to obtain a better estimate of test error for our method, and saw that the AUC for each of the 10 models was high (mean AUC: 0.922, SD: 0.027) with relatively small variability. When testing the final model trained on the full training set, we saw that the AUC was similar (0.924) to what was observed during cross-validation.
These findings support the idea that the model has not overfit the data and is generalizable to other Topcon TRC-NW400-acquired DR screening images. An additional relevant observation is that our model exhibited similar levels of test set performance across all five ophthalmologists, despite there being variability in the proportion of images that each ophthalmologist had designated as ungradable (Table 1).
There are some limitations to this study and model. One limitation is the relatively low specificity (0.549) achieved with a probability threshold of 0.5. One possible reason for the lower-than-expected test set specificity could be the variation across the 10-folds and the final test set with respect to the proportion of images with DR or non-DR pathology. We also observed a near statistically significant relationship between validation specificity of each fold and the proportion of both DR pathology and non-DR pathology, but this relationship was not seen for validation sensitivity. If we were to repeat this study, we may have achieved superior performance by conducting stratified sampling with respect to DR and non-DR pathology, in addition to gradeability. Having a larger data set or oversampling the minority class (ungradable images) could have also potentially improved model performance.
Despite this low specificity, the model could easily be adapted to serve different purposes by altering the probability threshold for classifying an image as ungradable. For example, in the setting of a DR screening program, it may be preferable to choose a cutoff that favors sensitivity over specificity. In this setting, a false positive of an image being labeled ungradable simply results in the patient being unnecessarily reimaged at their original appointment. On the contrary, a false-negative result is the same issue this model seeks to address in the first place, with the image not being identified as ungradable until it is later evaluated by a trained ophthalmologist.
In this scenario, the patient must then schedule and attend an entirely new clinic appointment for reimaging. A more modest balance of sensitivity or specificity may be achieved with a probability threshold of 0.775, which maximizes the Youden Index and resulted in a sensitivity of 0.859 and a specificity of 0.867. These are just two examples of possible probability thresholds, but the threshold for classifying an image as ungradable could be tailored further depending on the desired application.
Another limitation is that the model was only trained and tested on Topcon TRC-NW400-acquired images, and consequently, our analysis does not speak to how the model would perform on images from a different device. In addition, some limitations arise from inherent limitations of AI and deep learning. The accuracy of any AI model is highly dependent on the quality of the training data. With our approach of 10-fold cross-validation and comparison of such performance with the final test set performance, we feel confident that the data input into this model is of high quality.
Another limitation to our model is that “gradeability” is a subjective evaluation, even when assessed by trained ophthalmologists. While our data set contains a variety of reasons for an image being ungradable, it is reasonable to think that there may be other causes of an image being ungradable not represented in our data set. The generalizability of any AI model is limited by the spectrum of data it is trained on, and as such, it is possible the model may fall short in successfully identifying all possible instances of ungradable images.
Another limitation to AI algorithms, and CNNs in particular, is the black box problem. This is a term to describe the fact that there is little transparency in these algorithms for determining which components of the images are used to assess their gradeability. While this consideration also applies to our model and analysis, this is not a unique problem for our model or application.
We are not the first group to develop a CNN-based method for identifying low-quality retinal fundus images. Others have developed models using a variety of image sources, cameras, gradeability/image quality definitions, disease settings, and network architectures, but none has been developed with the same combination that we have utilized. 31 –37 There have also been CNN-based models built using images from wide-field imaging technology (primarily Optos). 38 Developing models for wide-field images often requires consideration and handling of higher class imbalance levels due to Optos having a significantly lower (around 3–7%) ungradable rate than nonmydriatic fundus photograph devices. 39 –41 Studies have observed superior CNN performance on wide-field images when artificially creating class-balanced data sets. 38
Ultimately, each of these models is best-suited to analyzing images that closely mirror the combination of these factors that were used for developing the initial model. In the context of these other models, our model has particular strengths in terms of the inclusion of a wide range of ages (25–101 years old) in our data set, the presence of non-DR ocular pathology in a sizeable portion (26.5% of gradable images) of images, and the observation that the model exhibited similar test set performance across all five ophthalmologists, despite there being statistically significant variability in the proportion of images that each ophthalmologist designated as ungradable.
Conclusions
In this study, we trained a CNN to assess the gradeability of nonmydriatic retinal fundus images from a telemedicine DR screening program. We have demonstrated that a CNN can assess the gradeability of such images with a high degree of accuracy. This and similar models can enable more efficient identification of low-quality/ungradable images at the point of care in DR screening programs, alert photographers to immediately reimage patients with ungradable images, and in this way serve as real-time feedback and potentially improve the skill of the photographer. Ultimately, this information could save the amount of time and resources needed for successful telemedicine DR screening.
Footnotes
Authors' Contributions
J.M.B.: Conceptualization, methodology, software, validation, formal analysis, investigation, visualization, data curation, writing—original draft, and writing—review and editing. P.J.B.: Conceptualization, formal analysis, investigation, writing—review and editing, funding acquisition, and supervision. R.G.M.: Conceptualization, project administration, and writing—review and editing.
Disclosure Statement
The authors have no conflicts of interest to disclose.
Funding Information
This research was supported by an unrestricted departmental grant from Research to Prevent Blindness.
