Abstract
Human eye is affected by the different eye diseases including choroidal neovascularization (CNV), diabetic macular edema (DME) and age-related macular degeneration (AMD). This work aims to design an artificial intelligence (AI) based clinical decision support system for eye disease detection and classification to assist the ophthalmologists more effectively detecting and classifying CNV, DME and drusen by using the Optical Coherence Tomography (OCT) images depicting different tissues. The methodology used for designing this system involves different deep learning convolutional neural network (CNN) models and long short-term memory networks (LSTM). The best image captioning model is selected after performance analysis by comparing nine different image captioning systems for assisting ophthalmologists to detect and classify eye diseases. The quantitative data analysis results obtained for the image captioning models designed using DenseNet201 with LSTM have superior performance in terms of overall accuracy of 0.969, positive predictive value of 0.972 and true-positive rate of 0.969using OCT images enhanced by the generative adversarial network (GAN). The corresponding performance values for the Xception with LSTM image captioning models are 0.969, 0.969 and 0.938, respectively. Thus, these two models yield superior performance and have potential to assist ophthalmologists in making optimal diagnostic decision.
Keywords
Introduction
In Ophthalmology, Optical Coherence Tomography (OCT) plays a vital role in the detection and classification of eye diseases for further assessment and treatment. OCT is an imaging technique that uses coherent light to capture high resolution images of biological tissues. OCT is heavily used by ophthalmologists to obtain high resolution images of the eye retina. Retina of the eye functions much more like a film in a camera. OCT images can be used to diagnose many retina related eyes diseases. OCT is an emerging biomedical imaging technology that offers non-invasive real-time, high-resolution imaging of highly scattering tissues [1]. It is widely used by ophthalmologist to perform diagnostic imaging on the structure of the anterior eye and the retina. The motivation of this research work is to automatically detect and classify the three different eye diseases present in OCT images of human eye by using image caption generator designed by using pre-trained CNN models and LSTM.
The small blood vessels of the retina are damaged due to diabetic retinopathy. The leakage of fluid into the retina may lead to swelling of the surrounding tissue, including the macula. Macula is located in the centre of the light-sensitive tissue called retina. Diabetic Macular Edema (DME) is a common eye disease that causes an irreversible vision loss for diabetic patients. The AMD is further classified into Early AMD and Late AMD. The earliest detectable changes associated with AMD occur at the interface between the macular retina and the underlying layer of connective tissue known as the choroid. This is known as DRUSEN. Early AMD detection is based on the identification of DRUSEN. DRUSEN is an eye problem caused due to aging and macular degeneration. It destroys our sharp central vision. The presence of DRUSEN is the symptom for Early AMD. The neovascular tissue associated with exudative AMD is referred to as Choroidal Neovascularization (CNV). CNV is an eye problem caused due to the creation of new blood vessels in the choroid layer of the eye which leads to sudden deterioration of central vision. CNV is the symptom for the Late AMD.
Literature survey
Both the natural language processing (NLP) and computer vision (CV) communities are connected and involved in the research and development of image captioning systems. A neural and probabilistic framework, which is a combination of CNN and a special form of recurrent neural network (RNN), is implemented to produce an end-to-end image captioning in [2]. They have also tested their model on three benchmark datasets and got improved performance by using standard evaluation metrics. A Long-Term Recurrent Merge Network (LRMN) model is proposed in [3] to use the language model to merge the image features at each step. This model has not only improved the accuracy of image captioning but also provided better description for the image. Experimental results demonstrated the promising enhancement in the image captioning with this model.
Previous studies have also combined region-based attention and scene-specific contexts to improve the performance over state-of-the-art models. A methodology to generate a simple sequence to explain the abnormal contents in fund us images by using image caption techniques was implemented in [4]. The experimental results have shown that the accuracy of diagnosis was more than 90%. A language model based on CNN to model the long range dependencies in the history of words, which are really important for image captioning, was introduced in [5]. A deep learning based approach using CNN and RNN for image caption generation was presented in [6]. In another study, Yang Fan et al. (2018) had done experiments on Long-term Recurrent Merge Network (LRMN) model to merge the features of image at each step via a language model [3].
Experimental results had shown that this LRMN model had a promising improvement in the caption generation process for the image. An experimental approach for retrieving a sequence of natural sentences for an image stream is proposed in [7]. They also demonstrated that their approach had outperformed other state-of-the-art image captioning methods for text sequence generation, using both quantitative measures and user studies via Amazon Mechanical Turk.
An image captioning framework was proposed in [8] with a self-retrieval module as training guidance, which encourages generating discriminative captions. They have demonstrated the effectiveness of the proposed retrieval-guided method on two different datasets and proved its superior captioning performance. A model that depends on computer vision and machine translation for image captioning was designed in [9]. Natural sentences which eventually describe the images are generated by this model. The experimental results have shown that this model is frequently accurate in generating descriptions. A novel attention framework called attentive linear transformation (ALT) which has automatically generated image captions was experimentally implemented in [10]. They have conducted extensive experiments on two different data sets to demonstrate the superiority of their model compared with other existing state-of-the-art models.
A coarse-to-fine multi-stage prediction framework for image captioning composed of multiple decoders each of which operates on the output of the previous stage, producing increasingly refined image descriptions was demonstrated in [11]. They have also evaluated their proposed approach to prove the state-of-the-art performance achievement. An approach based on convolutional Neural Network (CNN) that performed on par with LSTM (or RNN) based approaches on image captioning was experimented in [12]. Through the experimental results, they had proved that CNN based approach was superior than RNN based approach.
A model that used generator for generating textual descriptions for the given visual content of the video and discriminator for controlling the accuracy of generation was implemented in [13]. A brief survey of some technical aspects and methods for generating descriptions for the images was presented in [14]. They have concluded their work by discussing some open challenges and future directions for solving them.
An image captioning model for cross domain learning and prediction was designed in [15]. They trained the model using images of one domain. They have used the same training model for image captioning prediction on images belonging to another domain. They have used CNN-LSTM for developing their model. The statistical structure of LSTM-generated language to that of written natural language, and to those produced by Markov models of various orders were compared in [16].
They have found that both LSTM and Markov-generated texts can exhibit features similar to real ones in their word-frequency statistics and entropy measures. The research findings of this work can be used in the clinical decision support system to assist the Ophthalmologist in taking the final decision in predicting the eye disease. These research findings are also used to reduce the burden of the eye specialists. Nine different pre-trained CNN models and 7000 labelled high resolution OCT images are used for this purpose. The image caption generators designed using DenseNet201 and Xception are selected as the best models as it is more accurate than other models.
These two best models are also tested for their image captioning skills by using super-resolution OCT/Fund us images and noisy images obtained by adding gaussian noise. Base paper has used natural images for caption generation.
Dataset
The Table 1 shows the distribution of images used for training, validating and testing taken from the kaggle dataset.
Distribution of images in the dataset
Distribution of images in the dataset
Figure 1 shows the images related to the different eye diseases classified by our caption generator. The features in them will be different for different eye diseases. Finally, the textual description will be displayed on these images to detect and classify the different type of eye disease class.

OCT images from left to right Normal, DME, Early AMD and Late AMD.
The proposed methodology shown in Fig. 2 consists of four major functional blocks namely training model, validation model and testing model. Image captioning model is used for generating captions for OCT image classification. Deep learned CNN models are used for feature extraction. With more number of convolutional layers in CNN models, they are able to extract all the low level, mid level and high level features when numbered with machine learning models. Therefore, machine learning models are not considered for feature extraction.

The proposed methodology for image caption generator.
The CNN-LSTM model is trained by using partial captions to predict the next word in the sequence. The training model is designed by using three components namely image feature extractor, sequence processor and decoder. The features extracted by this pre-defined model will act as one input to the training model. The feature will be a one dimensional vector of 4096 32-bit floating-point numbers if vgg16 is used as the pre-trained CNN model. The size of this feature vector is different for different pre-trained CNN models.
A dense layer is used to process these features to generate 256element representation for each image. The sequence processor is implemented by using a word embedding layer followed by a Long Short-Term Memory (LSTM) recurrent neural network layer. An input sequence with a pre-defined length of 34 words is given as input to the embedding layer that uses a mask to ignore the padded values. LSTM layer with 256 memory units will follow the embedding layer in the sequence processor. Both the feature extractor model and LSTM model will produce a 256-element vector. Both the models will use regularization in the form of 50% dropout. This is done to decrease the over fitting in the training dataset as this model configuration learns fast. The decoder model then merges the two vectors from the two input models namely feature extractor model and sequence processor model using an addition operation.
The two vectors after merging are fed to Dense 256 neuron layer followed by a final output dense layer. The output dense layer makes a soft max prediction over the entire output vocabulary for predicting the next word in the sequence. The whole model is saved to a file if the skill or capability of the model is improved at the end of each epoch on the validation dataset.
Evaluation model
The enhanced accuracy is obtained from the super-resolution images created for the OCT images used in the testing for generating captions. With super-resolution OCT images, most of the inner details present in the OCT images that can be perceived and used by the trained model used for generating captions have been clearly visible. So the accuracy of the image captioning model has enhanced. Once the model is fit by using the features and text description of OCT images i.e., inputs-output pairs, it is ready for evaluation for predicting its skill on the holdout test data or validation data. The model is evaluated by generating descriptions for all the OCT/Fund us images in the validation dataset. These predictions are evaluated by using a standard cost function namely BLEU. The actual and predicted descriptions are collected and evaluated collectively using the corpus BLEU score that summarizes how close the generated text is to the true text descriptions.
Image caption generation model
Along with the trained model, the Tokenizer created during the encoding of text that has the tokens describing all the training images and the maximum length of the sequence defined to generate the text description about the OCT images are needed for generating captions for the new OCT images. Next, the OCT image for which captions to be generated and the features to be extracted is applied to the trained model. This is done by re-defining the model by using LSTM and then adding any one of the pre-trained models to it. The pre-trained models are used to extract the features and the extracted features are used as inputs to the trained model. After removing the start and end sequence tokens in the sequence generated, the description having symptoms about the image along with its class label are generated for the new OCT image. This description will assist the Ophthalmologist in classifying the type of eye disease. The caption is generated by using testing images belonging to the 4 different types of eye diseases. The captions generated will display the symptoms about the eye disease and the corresponding labels. The Figs. 4 7 are used to show the observations when the image caption generator module is tested.

Different steps involved in image caption generation.

Captioning generated for the NORMAL class.

Captioning generated for the CNV class.

Captioning generated for the DRUSEN class.

Captioning generated for the DME class.
The performance of the image caption generators using nine different pre-trained CNN models are analyzed by using the training loss and the validation loss as shown in Table 2.
Comparative analysis of different pre-trained CNN models using loss
Comparative analysis of different pre-trained CNN models using loss
Many performance metrics were considered and computed for performance analysis and comparison (as shown in Table 3). This analysis is also done only for the best trained models for super-resolution images, images with and without additive gaussian noises.
Overall performance statistics analysis for eye diseases image captioning models
Overall performance statistics analysis for eye diseases image captioning models
Kappa, Overall Accuracy, Positive Prediction Value (PPV_Macro and PPV_Micro) and True Positive Rate (TPR_Macro and TPR_Micro) are the different metrics considered. This performance analysis will provide concrete evidence for the best captioning model that can be recommended to the Ophthalmologist for assessing the eye disease. Gaussian noise is one of common source of noise generated in most of the imaging system. Gaussian noise is generated for different values of mean or standard deviation and it is superimposed with the OCT image for generating the noisy OCT image. This noisy superimposed OCT image is used for identifying the vulnerability of the model against noise.
Super-resolution images have created for testing OCT images for improving the perceiving capability of the image captioning model for improving the accuracy and other performance metrics of model when used as a classifier. Generative Adversarial Network (GAN) is used for creating super-resolution OCT images. The different features present in the super-resolution OCT image can be visualized clearly and hence the model can perform the prediction and classification accurately. Hence, the accuracy of the model is enhanced with super-resolution images created for the testing images. Gaussian and speckle noise are the two different sources of noise with OCT imaging system. Gaussian is an additive noise. Speckle noise is a multiplicative noise. The speckle noise is working with the speckle noise creation and removal for image analysis.
The kappa statistics is also used to evaluate the image captioning performance of the classifier. The kappa can be calculated by using the observed accuracy and expected accuracy. In essence, the kappa statistic is a measure of how closely the instances or classes or labels classified by the caption generator designed with pre-trained CNN model matched the data labelled as ground truth by the experts and thus controlling the accuracy of a caption generator as measured by the expected accuracy. The strength of agreement of the model can be recommended based on the values of kappa as very good, good, moderate, fair and poor as tabulated in the Table 4.
Kappa value and the corresponding strength of agreement
Kappa value and the corresponding strength of agreement
Caption generators which have used DenseNet201 and Xception as feature extractors have maximum kappa value equal to 0.95833 with super-resolution images and they have the very good strength of agreement.
The selected metrics considered in this analysis are based on exactness (precision) and completeness (recall) of the model. The description of the balance between these two is measured by F1-Score. The metrics not considered are having less important when the trained model is used for prediction and disease classification using OCT images. Therefore, BELU is not considered in the analysis. The Overall Accuracy of the model is computed by dividing the sum of the correct predictions from all the classes of eye diseases divided by the total number of images belonging to all the classes of eye diseases used in the caption generation process. The overall accuracy is equal to 0.9375 for the caption generators that were designed with DenseNet201 and Xception as feature extractors. The overall accuracy is increased to 0.96875 for these models when they are tested using super-resolution images. This is also an evidence for suggesting these image caption generators to the ophthalmologist for supporting their clinical decision in classifying the different types of eye diseases.
Performance analysis using PPV_Macro
A macro-average will perform computation of performance metric independently for each eye disease class and then take the average among all the classes and hence all the classes are treated equally. A micro-average will aggregate the contributions of all classes to compute the average metric. In multi-class classification, micro-average is preferable if there is class imbalance. It is observed that caption generators designed with DenseNet201 and Xception for feature extraction have the maximum value of PPV_Macro equal to 0.94. This value is increased to 0.97222 when super-resolution images are used for testing the model. Thus, this metric will help us to recommend these caption generators to the Ophthalmologist to assess the different eye disease classes.
Performance analysis using TPR_Macro and TPR_Micro
True Positive Rate is defined as the total number of classes correctly identified as positive by considering the images in all the classes divided by the total number of True positive and False Negative. True Positive Rate is equal to 0.96875 for the image captioning system designed with DenseNet201 and equal to 0.9375 when the model is designed with Xception as feature extractors. This is also an acknowledgement for recommending these caption generators for taking clinical decision correctly.
Individual performance statistics analysis for each eye disease class
Image caption generators with minimum validation loss can be compared with each other to select the best caption generator based on the benchmark results. For each of the 4 classes, 6 different metrics are computed. They are accuracy, F1-score, F2-score, precision or positive predictive value (PPV), specificity and sensitivity.
Our experiments have demonstrated that two caption generators designed using 2 different pre-trained CNN models namely DenseNet201 and Xception have shown good performance in predicting the symptoms for all the 4 classes. The Tables 5–8 are used to show the individual performance statistics analysis for each eye disease class.
Individual performance statistics analysis of the pre-trained CNN models for normal class
Individual performance statistics analysis of the pre-trained CNN models for normal class
Individual performance statistics analysis of the pre-trained CNN models for late AMD class
Individual performance statistics analysis of the pre-trained CNN model for DME class
Individual performance statistics analysis of the pre-trained CNN models for early AMD class
All the caption generator models except the one designed with InceptionV3 as feature extractor are perfect in predicting the Normal eye disease accurately and their accuracy of prediction is equal to or greater than 0.9375. The 4 caption generation systems designed with ResNet50, DenseNet121, DenseNet169 and Xception as feature extractors have accuracy of caption generation equal to 1.0. This accuracy is also further increased when super-resolution images are used for generating captions. It is experimentally identified that the LSTM caption generation system designed with Xception and DenseNet201 for extracting the features have generated captions for all the classes almost correctly.
Performance analysis using F1-score
F1-score is defined as the harmonic mean of precision and recall. Therefore, F1-score will consider both false positives and false negatives. If we have uneven class distribution in the number of images used in the training process, then F1-score is more useful than accuracy. Four models are having F1-score equal to 1 when predicting the captions for the Normal eye disease classes accurately. The 2 models that have used DenseNet201 and Xception for extracting the features in the images have F1-score almost very close to one when predicting captions for all the 4 classes. F1-score is a significant metrics used for comparing the performance of the model when the number of images in each class is not the same when used for training the model.
Performance analysis using F2-score
F2-score is defined as the weighted average of precision and recall. Therefore, F2-score will consider both false positives and false negatives. If we have uneven class distribution in the number of images used in the training process, then F2-score is more useful than accuracy. F2-score is very effective in classification when the cost of false negative is much higher than the cost of false positive. 4 image caption generation models having F2-score equal to 1 when generating captions for the Normal eye disease classes accurately are designed with the pre-trained CNN models namely ResNet50, DenseNet121, DenseNet169 and Xception for extracting the features to be used during the training of LSTM. The image caption generation systems which are based on DenseNet201 and Xception are found to have F2-score close to one.
Performance analysis using precision or positive predictive value (PPV)
Precision is calculated by dividing the number of correct positive predictions by the total number of positive predictions. It is also called positive predictive value (PPV). The best precision value is 1.0, whereas the worst value is 0.0. Four caption generator models are having precision equal to 1 when predicting the image caption description for the Normal eye disease classes accurately. Two neural caption generator models that are designed with DenseNet201 and Xception as feature extractors have precision almost very close to one when predicting the captions for all the 4 classes.
Performance analysis using specificity
Specificity (SP) is calculated by dividing the number of correct negative predictions by the total number of negatives. It is also called true negative rate (TNR). The best value for specificity is 1.0, whereas the worst value is 0.0. All caption models are having specificity equal to or greater than 0.8 when predicting the image description for all the classes. 4 caption generator models based on ResNet50, DenseNet121, DenseNet169 and Xception have specificity equal to one when predicting the image captions for the normal classes. 2 caption generation systems that are implemented with DenseNet201 and Xception for feature extraction have specificity value either equal to 1 or close to 1.
Performance analysis using sensitivity
Sensitivity (SN) is calculated by dividing the number of correct positive predictions by the total number of positives. It is also called recall (REC) or True Positive Rate (TPR). The best value for sensitivity is 1.0, whereas the worst value is 0.0. Seven caption generation models are having sensitivity equal to 1 when predicting captions for the Normal and DRUSEN eye disease classes accurately. The 2 caption generator models which have used DenseNet201and Xception as feature extractors have sensitivity either equal to one or very close to one when predicting the textual description for all the 4 classes. These experimental truths about these models have biased towards these models to recommend them for the eye specialists to predict the different types of eye disease classes.
Conclusions and future works
The performance of the best model deteriorates after superimposing additive Gaussian noise with σ= 0.002. It was identified that the performance of the model is improved with super resolution images synthesized by generative adversarial networks. Based on the performance metrics, it was observed that the image caption generators that have used DenseNet201 and Xception have provided the best performance in predicting the captions for the eye diseases correctly. Therefore, they can be used in the design of clinical decision support system to assist the ophthalmologist. This work can also be further extended by fusing the hand-crafted features with pre-trained CNN deep features and using the fused feature to design a new training model for caption generation.
Footnotes
Acknowledgment
This research was partially supported by SSN College of Engineering. We thank our colleagues from SSN College of Engineering who provided insight and expertise that greatly assisted the research, although they may not agree with all of the interpretations/conclusions of this paper.
