Abstract
Automatic captioning of Images has been explored extensively in the past 10 to 15 years. It is one of the elementary problems in Computer Vision and Natural Language Processing and has vast array of applications in the real world. In this survey, we aim to study different approaches used for the generation of image captions in a chronological manner starting from the basic template based caption generation model to using Neural Networks combined with external world knowledge. We review existing models in detail, highlighting the involved methodologies and improvements in the same that have occurred in time. We gave an overview to the standard image datasets and the evaluation measures developed to discern the quality of generated image captions. Apart from the basic benchmarks we also note speed and accuracy improvements in all the different approaches. Finally, we investigate further possibilities in automatic image caption generation.
Introduction
The last two decades have seen great improvements and enthusiasm in the fields of Computer Vision and Natural Language Processing. Main target of these problems is studying and generating automatic text descriptions, and comprehending images and videos. These problems find their roots in AI and ML which themselves are in early phases considering their vast potential. Moreover, these fields have been investigated and researched separately, making it extremely important to study their combined scope and further investigating their possibilities.
The image captioning problem has been viewed for long as a challenging problem because of the need to identify different segments of the image correctly, identifying the connection between them and finally weaving them together in a syntactically and semantically correct manner to describe the most salient aspect of the image. Therefore to accomplish this task, we need the technology which can fully understand the image and should be capable to apply external worldly knowledge to generate a description which is most true to the image.In other words, it is equivalent to mimicking a human capability to compress the most salient features of an image into the most descriptive, but true-to-the-image description. This seems a herculean task relative to normal Computer Vision Evaluation Systems simply identifying what is present in the image.
However, with most of the world population connecting online, the need for a image-caption generation system has surged to attention-seeking levels. We have lots of multi-modal data arriving online every second in the form of millions of tagged photographs on Facebook and cloud storage. Google Photos image classifying and story making feature is a great example. Also, with lots of video content, the need for automatic subtitles for the videos has seen an appraisal. Automatic Image Captioning will help in organizing the millions of unstructured and unorganized and unclassified images on the Internet and on humanitarian grounds, will also aid the visually impaired to sense the images. Therefore, the dire needs for image-caption generation can not go unnoticed and hence, lots of research papers and conferences in this regard are taking place throughout the world.
At the heart of this technology are the power of neural networks which makes it seems like magic. On one hand, where Convolution Neural Network (CNN) Layer performs the prime task of identifying the salient features of the images, the Recurrent Neural Networks makes it possible to construct (almost) meaningful captions from the identified visual features.We will see how the technology advanced from basic tree parsing to use of neural networks utilizing hundreds of millions of features to make the results competent enough to be compared to those of human generated captions.We will be unwinding these technologies one by one in brief in this survey paper.
So, our aim of this survey paper is to present, analyze and compare the research happening throughout the last decade in this ground-breaking field, in a matter of few pages. We aim at chronologically identifying the technological developments, the involved approaches, their drawbacks, how well they perform at various metrics and further investigating the future scope of automatic image generation.
Evolution of image captioning techniques
The paper is structured in a chronological manner, where we start from the basic techniques used, discuss its methodology and shortcomings and then move to a newer approach that solved the problems in the previous one. Figure 1 shows the timeline we have developed to help keep track of the advancements in this field and better understand the need and basis of development of newer models.
Timeline indicating the progress in automated image captioning.
The years 2005 to 2010 saw the birth of the major approaches dealing with computer vision and mapping the detected objects with words and weaving them into a meaningful or stylish description. Related work include those of Li et al. [5], Farhadi et al. [9] and Li et al. [6]. Also approaches involving adding some additional knowledge from the corresponding domain which normal computer vision can not see (Yang et al. [16]) as well as using some originally existing captions for the locally (Ordonez et al. [17]) and globally similar (Farhadi et al. [7]) and images to get rid off the need for structuring the self made sentences was observed during this period.
In years 2010 and 2011 some work still focussed on using primitive techniques for text generation for eg. Kulkarni et al. [15] used template-based description generation, Farhadi et al. [7] grouped the computer vision detections in triplets and then used them to generate descriptions based on templates. Li et al. [14] generated description by merging the computer vision detected objects using proper semantic relationships.
The year 2012 was a highlight in the era of automated Image Captioning Techniques as Imagenet classification with deep CNN (convolutional neural networks) was done using 60 million parameters and half million neurons, it consisted of five convolutional layers.
Year 2013 onwards, techniques involving the use of recurrent network started gaining momentum. Mao et al. [28] Vinyals et al. [48], Kiros et al. [25], Fang et al. [30], Chen et al. [47] used a recurrent NN and Kiros et al. [23] used a feedforward one. Also, Kiros et al. [25] proposed to create a âmultimodal embedding spaceâ by using a vision model and LSTM to encode text.
Years 2014 and 2015 saw the evolution of rich feature hierarchies for accurate and high quality object detection going deeper with convolutions. Fast progress in object detection [18, 49, 31, 32, 33] was identified with models which labeled multiple regions of image in image captioning.
Years 2015 and 2016, saw the evolution of very âdeep convolutional networksâ for large scale image recognition with multimodal R-CNN and various other techniques [24, 48, 34, 35, 36, 37, 38, 39, 40]. We discuss in detail 8 prominent categories of approaches belonging to different time periods. Starting with one of the very first techniques to the very recent one.
The template based technique is discussed using the work of Margaret Mitchell et al. [21] which was published in 2012.
Problems faced till now
Template Based Caption Generation was used earlier in – Kulkarni et al. [15], Yang et al. [16] system substitutes probable prepositions, verbs and interjections on parsing UIUC Pascal-VOC dataset (Farhadi et al. [9]) and choosing head nouns and their dependents using maximum likelihood calculated by taking the ratio of their individual logs.However, only predictable consistent sentences can be generated using template based techniques, but not novice captions. Ordonez et al. [17] matched the query image from a much larger set of existing captioned photographs followed by local reordering. Although natural, but these captions are not true to the image. They mainly describe the similar images and may miss out unique features of the query image.
Overview
Midge uses syntactical knowledge of the probability distribution of the next words which should appear after a given sequence of words. The generator uses constraints to filter out the noisy output from the vision system to generate syntactic trees to describe the image which computer vision detected.
Tree generated using Midge approach during the tree growth process.
For training, 700,000 images from Flickr dataset were taken with respective descriptions in dataset used in Ordonez et al. [17].Before parsing , normalization of description was done. Parsing was done using Berkeley parser (Petrov [12]).Once a head noun was selected, for formulating a description , probability was calculated for determiners (the, a, an) and pre-nominal modifiers/adjectives.
Head nouns were identified and physical objects were distinguished using WordNet (Miller [1]) from the detections of the vision system. Maximum 3 objects were kept in a description of a single sentence. Caption generating process was dealt as a problem of growing a syntactically and semantically informed tree based on detected object nouns. Tree growth was achieved using lexicalized syntactic derivations using head noun anchors detected above.
A three step growth process was followed (Reiter and Dale [3]) that involved utilising content determination for grouping followed by ordering of the object nouns, generating their local subtrees, and filtering irregular detections. Micro planning was done to generate full syntactic trees around the noun objects detected, and modifiers are selected and classified in post-nominal and pre-nominal in surface realisation step and choosing the final outputs. The system followed an approach where multiple trees are grown and then, the best one is chosen as the result. Contexts (nouns) for adjectives were weighted using Point-wise Mutual Information and for any adjective, only the best 1000 nouns are selected.
In Micro-planning, fully-grown trees are generated by taking the intersection of the subtrees created in Content Determination. Subtrees surrounding a noun in position 1 are directly merged with subtrees surrounding a noun in position 2 because the nouns are ordered.In the surface realisation stage, the most probable single tree is chosen by the system from all generated possible trees and mark-up is removed to produce a final string. Different strings may be generated depending on different specifications from the user. The final string is then the one with the most words.
Evaluation metrics
5-point Likert scale, Human decisions accumulated using Amazonâs Mechanical Turk (Amazon, 2011) were used as evaluation metrics.It was also evaluated against Kulkarni et al. system, the Yang et al. system, and human-generated descriptions on the same dataset (images). Other metrics include the parameters of grammar, main aspects, correctness, order, human likeness. Results Analysis were done using the non-parametric Wilcoxon Signed-Rank test where the parameter for comparing different systems is the median values.
Dataset
Training dataset: 700,000 images with their associated descriptions from the Flickr dataset in Ordonez et al. [17]. Testing dataset : 840 PASCAL images.
Results
Midge performed better than all earlier automatic approaches on criteria of correctness and order. And additionally performed better than Yang et al. on the criteria of close proximity of sentences with the human generated ones.
Figure 3 shows captions generated by this method.
Captions generated as in [21] L to R: The bus by the road with a clear blue sky; People with a bottle at the table; A person in black with a black dog by potted plants.
This technique is discussed using the work of Kiros, Salakhutdinov, and Zemel – Unifying visual semantic embeddings with multimodal neural language models [25] which was published in 2014.
Problems faced till now
Descriptions, by earlier strategies were more machine type in nature and failed to adapt to the fluidness of captions written by humans. Bleu and Rouge [22] evaluation ways were unreliable and did not match human perceptions.
Overview
Encoder (LSTM) ranks captions and pictures and develops sensible grading functions, and the decoder (SC-NLM) optimises the grading functions as some way of generating and grading new descriptions via the learnt representations.
Model/Methodology
Encoding of sentences was done by taking Long short-term memory (LSTM) recurrent neural netw-orks [2]. Projection of features of the image were taken from a deep CNN into the embedding region of the LSTM hidden states. Joint image-sentence embeddings were learnt and minimization of pairwise ranking loss so as to learn to rank pictures and their descriptions was performed. Images and descriptions were ranked. Using a decoder, a structure-content neural language model (SC-NLM), where the sentence structure was disentangled to its content and an encoder which created the conditioning on distributed representations. Sensible image captions were generated if sampling from SC-NLM was used. Decoder generated new captions from base.
Problem of ranking pictures and captions was used as alternate for generation. Optimising this task would lead to an enhancement in generation technique, because any generation system makes a grading function to analyse how well a caption and picture match.
Evaluation metrics
Med r; R@K.
Dataset
Flickr30K and Flickr8K.
Results
The ways delineated during this paper generated descriptions with quality greater than the that time state-of-the-art methods which were based on composition-based strategies. Authors worked on attention-based models which could learn to align the parts of captions to pictures and determining where to attend next by using these alignments, thus modifying dynamically the decoder conditioning vectors. Figure 4 shows captions generated by this method.
Captions generated as in [25] L to R: A parked car while driving down the road; A little boy with a bunch of friends on the street; There is a cat sitting on a shelf.
The approach is discussed using Minds Eye: A Recurrent Visual Representation for Image Caption Generation [47].
Problems faced till now
Many previous papers experimented projecting the image features and their associated description in common space [6, 7, 13] which find their uses in image search or image captions ranking.To learn these projections, various approached were used:Kernel Canonical Correlation Analysis (KCCA) [22], Recursive neural networks [29] and Deep neural networks [24]. While these techniques projected both visual features and associated semantics to joint embedding, they failed to perform the inverse projection. That is, they could not make fresh sentences or visual depictions from those joint embeddings.
Overview
This paper explored the bi-directional mapping between images and their sentence-based descriptions using a recurrent neural network. A new recurrent visual memory was deployed that mechanically learned to remember long-term visual concepts to help in both sentence production and visual feature reconstruction.
Model/Methodology
To accomplish the bidirectional mapping, a set of latent variables
1. Part of the model needed for generating sentences from visual features and vice versa. 2. Sentences to Visual Features. 3. Visual Features to sentence.
Language Model: This system was able to generate 3000 to 20000 words using word classing approach [11]
Learning Model consisted of BPTT (Back-propagation Through Time) Algorithm to revise the weights online. Activation Function used for all units was sigmoid function
The recurrent hidden state
Perplexity (PPL), BLEU, METEOR (METR), Human Subjects, Recall@1,5,10.
Dataset
PASCAL 1K, Flickr 8K and 30K, MS COCO.
Results
Figure 6 shows captions generated by this method.
Captions generated as in [47] L to R: A train is stopped at a train station; A group of people standing on a snow covered slope; A group of people that are standing in front of a building.
Captions generated as in [48] L to R: A red motorcycle parked on the side of the road; A group of young people playing a game of frisbee; Two dogs play in the grass.
This technique is discussed using the work of Vinyals, Toshev, Bengio, and Erhan, Show and tell: A neural image caption generator [48], which was published in 2014.
Problems faced till now
Text generation in previous works was rigid and excessively handcrafted. It couldn’t create descriptions of previously unobserved arrangements of objects, even if separate objects were detected in the training set.
Overview
End-to-end system that combined newfangled subnetworks for object detection and caption generation models was proposed. This neural network was extensively trained using stochastic gradient descent and described the subject matter of an image using accurately built English sentences.
Model/Methodology
It was based on a neural and probabilistic architecture to produce image captions given an image as input and applying the principle of translation for generating its description. (similar to how we translate text between two languages).
The model first uses the following formula to maximize the probability of the correct description:
Here,
A constant length hidden state ht expresses the number of words we consider upto
This non linear function
The loss was given by the the summation of the negative log likelihood of the right word generated at each time step as given below and was minimized with respect to all parameters of the LSTM network, word embeddings
This paper used the BeamSearch approach with a beam of size 20. They also experimented using greedy search by taking beam size equal to 1 only to find out that it degraded the results by and average of 2 BLEU points (the other technique explored was Sampling).
Captions generated as in [28] L to R: A square with burning street lamps and a street in the foreground; Tourists are sitting at a long table with a white table cloth and are eating; A blue sky in the background.
Bleu-4, METEOR, CIDER. Ranking Metric Recall@k (@1 and @10).
Dataset
Pascal VOC 2008. Flickr8k, Flickr30k, MSCOCO, SBU.
Results
NIC is performed better than various other approaches e.g. Tri5Sem, Im2Text, BabyTalk, SOTA etc. and was quite close to the ground truth. Figure 7 shows captions generated by this method.
Overview of fully convolutional localization network for dense captioning. The localization layer presents regions and extracts smoothly, batch of corresponding activations with the help of bi-linear interpolations.
This technique is discussed using the work of Mao et al. – Explain images with multimodal recurrent neural networks [28] which was published in 2014.
Problems faced till now
Earlier works extracted features for sentences and pictures, and mapped them into embedding space of same semantics. These strategies addressed tasks such as retrieval of sentences when the image is given or retrieval of images when the sentences are given but when they are existing within the database already, and lacking the flexibility to caption new pictures that consists of objects and scenes that are previously unseen.
Overview
The model contains 2 sub-networks: deep RNN for sentences and a deep CNN [20] for images where, RNN is Recurrent Neural Network and CNN is Convolutional Neural Network. These two sub-networks communicate with one other in a multimodal layer and this complete model is known as m-RNN model. It takes out probability distribution for generating a word provided previous words and picture are given and finally when this distribution is sampled, image descriptions are generated.
Model/Methodology
There are 6 layers in each time frame : first one is input word layer, then next two are Word Embedding layers, then there is Recurrent layer, then the layer where connection is made: Multimodal layer, and the last layer : Softmax layer.
where PPL is the Perplexity of the sentence and
Sentence Generation involved starting from the start sign ##START##, the model calculated the probability distribution for the upcoming word, given previous words and picture. Then the upcoming word was picked by sampling previously obtained probability distribution. But, the word which had the maximum probability was found out, since this method performed better, though slightly, than sampling. After that, the picked word was input to the model and the process is continued until the end sign ##END## is taken as output from the model
While doing retrieval of image, top ranked images were the output, where ranking was done on the basis of their perplexity with the query sentence. Sentence Retrieval used Normalized probability for each sentence.
Sentence Perplexity & BLEU scores (B-1, B-2, B-3), RK (K
Dataset
Flickr 8K; Flickr 30K; IAPR TC-12.
Results
This was the first work in which RNN in a deep multimodal architecture was incorporated.
Figure 8 shows captions generated by this method.
Object Detection (R-CNN)
Localization Layer
Caption Generation Model (RNN)
This technique is discussed using the work of Johnson, Karpathy, Li – DenseCap: Fully Convolutional Localisation Networks for Dense Captioning [46] which was published in 2014.
Problems faced till now
Predictions based on earlier region CNN-RNN models did not include context outside of each region. Those were inefficient as each region had to be forwarded independently. Localization layer is was proposed due to these difficulties.
The sequence of images shows Dense image captioning task using a model that generates rich and dense captions.
The paper consists of the work in the detection of objects, Image captioning, and the processing of particular regions of the image.
Model/Methodology
Convolutional Localization Network for Dense Captioning of the image was based on CNN- RNN models for image captioning but also included a differentiable localization layer that could be inserted in the neural network to enable localized predictions of the region proposals.
CNN consisted of 13 layers of 3
Positive proposals were the ones that were matched and hence increased confidence scores while training, while negative proposals decreased the confidence scores.
Recognition network processed features of each region from the localization layer. The features of each region were flattened to be made into a vector and then passed through fully connected layers. Position was redefined and confidence scores of each region were proposed.
Evaluation metrics
METEOR, mean Average Precision (AP).
Dataset
MSCOCO, YFCC100M, Visual Genome (VG) Dataset.
Results
FLCN model performed better than the Region RNN in both ranking and localization under all metrics in a way that median rank reduced from 7 to 5 and localization recall from 0.5IoU to 0.153.
Figure 10 shows how captions are generated by this method.
Semantic Alignment Models (R-CNN and B-RNN)
Description Generation Model (M-RNN)
This technique is discussed using the work of Karpathy and Li – Deep Visual-Semantic Alignments for Generating Image Descriptions [41] which was published in 2015.
Problems faced till now
The focus of most of the works so far has been on condensing elaborate visual depictions in an image to just one single sentence. However, this requirement is nothing but an unnecessary restriction.
Overview
This approach consists of two separate models, an alignment model for inferring the latent alignment between continuous group of words in a sentence and the region of the image that they correspond to and the second model which is trained on the inferred correlations.
Flowchart of proposed description generation model.
Captions generated as in [41] L to R: A man in black shirt is playing guitar; Two young girls are playing with lego toy; Construction worker in orange safety vest is working on road.
To detect objects in an input image the Alignment model used a Regional Convolutional Neural Network (RCNN). CNN was prepared by training it before hand on images in the ImageNet dataset and finally tuning it on the 200 classes of the ImageNet Challenge. In addition to the whole image 19 top detected locations were used. The objects were identified based on the pixels present inside each bounding box. It also used Bidirectional recurrent neural network (BRNN) to compute word representations in the sentence. An Image Sentence Score,
The ultimate goal was to associate snippets of text instead of single word to each bounding box. Therefore, the concept of Markov Random Field (MRF) and latent alignment variables was used to generate a number of image regions explained with segments of text. (for e.g. wooden table for table, messy pile of documents for documents)
M-RNN Model trained on the dataset of region-level annotations from the previous model took as inputs a series of input vectors and the image I. It then found out a series of hidden states and consequently a series of outputs by using a recurrence relation thereby generated a dense descriptions of images.
Evaluation metrics
Bleu-1,2,3,4, METEOR, CIDER. Ranking Metric Recall@1,5,10,Med r.
Dataset
Flickr8k, Flickr30k, MSCOCO.
Results
This model used very few hard-coded assumptions to formulate captions of individual image regions using the conventional dataset of images and sentences. Figure 12 shows captions generated by this method.
Object Detection (CNN)
Description Generation Model (RNN)
External Knowledge
This technique is discussed using the work of image Captioning and Visual Question Answering Based on Attributes and External Knowledge – Wu et al. [42] which was published in 2016.
Problems faced till now
The previous papers didnt take into account the external knowledge for generating the captions. Also the importance of introducing an intermediate attribute prediction layer was neglected by almost all previous work.
Overview
An intermediate attribute prediction layer is introduced into the predominant CNN-LSTM framework, which was neglected by almost all previous work.
Image Caption Generated: A man with bat readies to swing at the pitch while the umpire looks on. External Knowledge: A pitch is a place used to play various sports such as cricket. The umpire is present to review the match.
Attributes predicted by the CNN-based attribute prediction model were used to generate the captions for the image. In the image captioning, the gaps in the caption templates were filled by the attributes predicted by the model. The model for caption generation was trained by maximizing the probability for the correct description of the image. The semantic attribute prediction value
Evaluation metrics
BLEU, METEOR and CIDER.
Dataset
Flickr8k, Flickr30k and Microsoft COCO.
Results
Att-Region CNN
Examples where the attribute-Region-CNN 
PASCAL 1K [43]
The images found in this dataset are a subset of images collected from PASCAL VOC Challenge. It has 20 categories of images, for each of which, it chooses 50 images at random as sample along with their descriptions which is generated by Amazon’s Mechanical Turk.
Flickr8K & 30K [10]
There are 8000 and 31,783 images respectively in Flickr 8K and 30K datasets which are gathered from Flickr. Majority of these images represent participation of human beings in various tasks. Every image has 5 sentences describing it. These datasets are split for training, testing as well as validation following some approved standards.
Image captioning techniques and their scores on IAPRTC12 dataset
Image captioning techniques and their scores on IAPRTC12 dataset
Image captioning techniques and their scores on PASCAL dataset
Image captioning techniques and their scores on Flikr8k dataset
Image captioning techniques and their scores on Flikr30k dataset
Image captioning techniques and their scores on MSCOCO dataset
This is Microsoft Coco dataset which contains 82,000 training images and 40,000 validation images complemented by 5 sentences for their description. These images are sourced from Flickr by finding common/famous object categories and generally, they contain variety of objects with important information pertaining to their context.
IAPR TC [4]
In this dataset, there are 20,000 still natural pictures from various locations all around the world. Pictures from various categories like – sports, actions, cities, and shapes, animals, people and many other aspects of modern life. There are captions related with every image, in 3 specific languages English, German, Spanish. These 20000 images are of high resolution and strict image selection rules are followed while choosing images for this dataset.
VISUAL GENOME (VG) [52]
It is a dataset built by experts mainly from Stanford, Yahoo. It is a knowledge base which is basically a persistent effort to relate the image concepts to their natural language description in a structured manner. It is currently the largest dataset of image based question and answers with approximately over 17,000,000 question-answer pairs. Every image is supplemented with an average of 17 question-answer pairs.
Evaluation and ranking metrics
We have prepared a table for various datasets in chronological order showing various approaches used for the task of description generation and their corresponding scores using a variety of different metrics such as Bleu 1, 2, 3, 4. Meteor, Cider [45] etc. The results compiled in such a manner allow us to clearly see how over the years, image captioning techniques have evolved over time and also observe the large amounts of positive change in evaluated scores.
BLEU (bilingual evaluation understudy) [43] and METEOR (Metric for Evaluation of Translation with Explicit Ordering) [44] are metrics generally for the evaluation of machine translation output. R@K: recall rates for the first retrieved ground truth sentences or images. Some spaces in table are left empty as the corresponding scores were not calculated. Tables 1, 2, 3, 4, 5, 6 and 7 show the comparison of scores for various dataset using different techniques in a chronological order.
Image captioning techniques and their recall scores on Flikr8k dataset
Image captioning techniques and their recall scores on Flikr8k dataset
Image captioning techniques and their recall scores on Flikr30k dataset
The possibility of developing intelligent computer programs that could correctly interpret and caption photos have been intriguing machine learning experts since decades. However, it was only a few years ago some significant progress in this field has been made. We have come a long way from template based techniques to deep learning ones with attention models but still there are a lot of challenges that need to be overcome.
One of the challenges is the prudent use of an attention system which would describe individual components of an image rather than just the image as a whole in order to create a holistic description of the complete picture. The challenge here is to incorporate more knowledge than just what the model is trained on. This includes understanding the context of the image and incorporating worldly knowledge while generating captions, just as humans would do. Only since last year, a few researches have started working on this issue however significant improvements have not yet surfaced. Better performance can be expected by choosing a superior image encoder, fine-tuning it and setting up ensemble models.
Performance of a system can be judged better if we have better evaluation and ranking metrics. While most of the above discussed approaches use BLEU scores to compare their results to the ground truth suggesting this metric to be a benchmark of evaluation and having some obvious advantages, a number of shortcomings have been noticed. It has been noted that BLEU cannot deal with languages lacking word boundaries. Another problem is its bias towards shorter translations. We could use other automated metrics involving human effort such as HyTER, however it is still just an approximation.
Future scope
The field of image captioning has been researched for decades as we just saw. However there is an immense scope still left to explore. Though most of the recent studies have been pretty successful in describing the image correctly, but still, human level accuracy and descriptiveness seems a far fetched idea.
This all boils down to one thing, knowledge. Humans, while thinking for a caption, use their entire knowledge base which they have been acquiring for years. Hence, the emotions, the extra worldly knowledge, the power to express which humans possess is sufficient enough for any human to fail a machine in this so-simple-for-human task.
So, the need of the future is to have an excellent knowledge base, the hardware power to train any model to use that entire knowledge feasibly, in order for that machine to develop an entire multi-dimensional context(s), so that any open-ended question related to the image can be answered irrespective of the attributes simply detected using any computer vision system.
This is the reason, why the major search engines’ corporations, like Google and Microsoft (Bing) have the best cards with them to utilize the power of humongous databases to turn them into knowledge-bases and realize the future of this technology. Microsoft’s “CAPTION BOT” is an excellent example of this initiative which uses the power of Emotions, Computer Vision and most importantly the power of Bing to really give fantastic results.
Conclusion
We classified and discussed 8 major approaches used for image captioning according to the order in which they developed. We discussed how and why an approach evolved so as to solve the shortcomings of the previous one. We then explained each of the approaches in detail with the help of a particular study and lastly, compared the results of various experiments conducted so far using various popular metrics such as BLEU scores, METEOR, CIDER etc. We were able to observe clearly the large amount of positive difference the scores.
Footnotes
Acknowledgments
A very sincere thanks to Shubham Thakkar, Saumya Gupta and Shubham Singh who helped us throughout the formulation of this survey paper. This would not have been possible without their constant support.
