Abstract
Vlogs, Recordings, news, sport coverages are huge sources of multimodal information that do not just limit to text but extend to audio, images and videos. Applications such as summary generation, image/video captioning, multimodal sentiment analysis, cross modal retrieval requires Computer Vision along with Natural Language Processing techniques to extract relevant information. Information from different modalities must be leveraged in order to extract quality content. Hence, reducing the gap between different modalities is of utmost importance. Image to text conversion is an emerging field and employs the use of encoder decoder architecture. Deep CNNs extract the feature of images and sequence to sequence models are used to generate text description. This paper is a contribution towards the growing body of research in multimodal information retrieval. In order to generate the textual description of images, we have performed 5 experiments using the benchmark Flickr8k dataset. In these experiments we have utilized different architectures - simple sequence to sequence model, attention mechanism, transformer-based architecture to name a few. The results have been evaluated using BLEAU score. Results show that the best descriptions are attained by making use of transformer architecture. We have also compared our results with the pretrained visual model vit-gpt2 that incorporates visual transformer.
Introduction
Description of images or videos incorporates an understanding of the semantic and context, which is challenging for machines. Most of the generated visual data is asynchronous, i.e. it does not contain the description. Since lots of information is conveyed through images it is important to develop systems that convert the visual information into meaningful description of words.
An image consists of both visual objects and non-visual objects. Content of the image and the clues from image are used for generating the visual and non-visual objects respectively. Currently, understanding of images require use of machine learning and deep learning techniques. A high-performance system must be able to recognize visual as well as non- visual objects from the image.
Earlier systems used textual information to generate summaries. Latest in this regard is the work in [1] wherein relevance score of sentences has been extracted from multiple documents using regression and topic model framework. Today, large amount of data in the form of video and images pose a need to develop systems that extract meaningful information from these modes. The extraction of meaningful information can be of use in various applications like summarization, information retrieval, highlight generationmedical images, robotic interaction, helping blind to navigate paths, etc.
In order to extract the features from video it is quintessential to extract the frames. Frames can be extracted using various methods like uniform sampling, image histogram, Scale Invariant Feature Transform [2] etc. Numerous methods have been applied to extract features from images/frames of video. In some methods the image is divided into sub regions to identify important areas while in other, a bounding box captures the features of the image [3]. The information from these frames needs to be converted to text in order to obtain relevant textual information.
The process faces challenges like noise in data, limited features, asynchronous data, quality of images, etc. Numerous datasets have been developed in the past to help ease the task of finding description from images or videos. Popular datasets for captioning include Multi-modal Summarization Dataset v1.0 [4], Flickr 8k [5], UIUC Pascal Dataset [6], Microsoft COCO(MS-COCO) [7], Microsoft Research Video Description Corpus(MSVD) [8], MPII cooking data-set [9]. BLEAU [10], METEOR [11], ROUGE [12], SPICE [13], CIDEr [14], WMD [15] are the most commonly used evaluation metrics.
In this work, we have carried out experiments to find the descriptions of images. We have used HP DL380 Gen 10 PowerEdgeServer P.C. with Intel Xeon Gold 6338 R CPU to carry out our experiments. We use NVIDIA Quadro RTX5000 GPU with 3072 CUDA cores, Server DDR4 Memory of 16 GB,memory bandwidth of 448.06 Gb/S, memory Interface of 256 Bits. We trained the model using Tensorflow, and Python 3.9, and CUDA version 11.6. We have performed five different experiments on Flickr8k datasets in order to generate the textual description.
Our contribution is as follows: Extensive experiments have been performed on the benchmark Flickr8K dataset for generating the captions of images. Architecture of VGG16, LSTM, pretrained embeddings, transformers have been combined under different experimental settings in order to generate the description of images. Based on a comparative analysis, we have investigated how attention mechanism can help in generating fine grained captions, by focusing on different regions of image. The results have also been obtained using a pretrained visual transformer and BLEAU score has been used to compare the results obtained from different approaches.
The organization of the rest of this paper is as follows. In Section 2, we review earlier work and applications. In Section 3, encoder-decoder approach for the task has been discussed. Experimental details along with the hyperparameters have been discussed in Section 4. The results of most advanced methods are compared on the benchmark dataset in Section 5 and 6. In Section 7, we give a conclusion.
Multimedia data has seen a boost in the recent years. Availability of large datasets and GPUs has led to an ease in processing the images, thereby, leading to an advancement in the task of text generation from images. Pre-trained architectures such as Resnet 50, Inception, help in generating features from images. These pre-trained models reduce training time and also improve the accuracy. Much work has been done in recent years in the field of image captioning. Earlier, two primary approaches for image captioning existed: retrieval based and template based. In the retrieval-based approach, captions were generated by exploring images similar to the given image. An existing database was used to extract sentences for the given image [16] The disadvantage in this kind of approach is that the generatedsentence has the same style or pattern as that of an image, from image –caption database. Thus, the dependency on the available database was concerning. In template-based approach, semantic phrases were put in given language constructs [17] Although the sentences generated are syntactically correct, but in large scale data, the process of identifying the objects, attributes and their relationships is difficult. Various improvements were done in this regard. The authors in [18] addressed the challenge of weakly supervised image captioning-where the captions for training images are available but not aligned with specific image regions. It introduced a method to align words in the captions with corresponding image regions, thus improving the quality of generated captions.
With the advancement in time, generation-based approaches became popular. These involve using a learned model to generate captions. Generation-based models are flexible, widely used, and capable of generating novel captions from unfamiliar images. With the advent of higher processing power of GPUs, TPUs majority of methods now rely on neural networks and utilize labeled data. Using RNNs, a joint model is trained and the likelihood property of descriptions is maximized. Captioning tasks often employ Encoder-Decoder frameworks, where a deep CNN is used for image encoding and an LSTM is used for decoding [19]. The spatial and semantic features havebeen concatenated in order to generate the description. The combination of model-based graph convolution networks and LSTM has also been used. Authors in [20] proposed an attention-based approach for image captioning. It introduced the concept of using visual attention mechanisms to align image regions with correspond-ing words in the generated captions.
In [21], a hierarchical attention framework was employed for generating captions. A policy gradient algorithm was used with Generative Adversarial Network (GAN) for optimization. The authors in [22] used a probabilistic topic model, specifically Multi-modal entity LDA, to estimate the probabilistic relationships between text and images. Multimodal learning approaches leverage convolutional networks to extract image features and recurrent neural networks to model the distribution of words based on the image features and contextual words. Deep belief networks and Boltzmann machines are popular probabilistic graphical models used in deep representations [23]. Graphical models and classical models are also used to identify text of images.Gated Recurrent Units (GRU) alongwith attention have been used in [24] for decoding the images, thus focusing on important regions for images. Additionally, [25] introduces an architecture for predicting image saliency.
RNN models in these tasks are now being replaced with the Transformer model owing to parallel training and exceptional performance. [26] put forward the CaPtiontransformeR (CPTR), a complete Transformer network designed to substitute the commonly employed CNN in the encoder section of the encoder-decoder framework. Meanwhile, [27] presented the detector-free ViTCAP model, featuring a fully Transformer architecture that incorporates grid representations without relying on regional operations. A comprehensive research survey on the multimodal summarization task, outlining its scope and challenges, is documented in [28]. Image description tasks have been utilized in various applications such as:
Encoder decoder approach to text description of image
We have used an encoder-decoder approach to generate description of images. In this, deep neural networks are coupled with transfer learning to form a description of image. An encoder decoder based approach is used where the task is to train a model that maximizes P(C|I) where I is an input image and C is the caption such that each word C1, C2,..Cn comes from the vocabulary of words. Dense image features and embeddings of previously generated words are used as an input to the LSTM or an RNN that produces a sequence of words. The motive of training is mathematically given by:
Here, Θ * represents the model parameters, C represents the description of image and I signifies the input image.
The above equation is optimized over the entire training data and is given by:
Thus during training, I,C pairs are input and the sum of log probabilities is optimized over the entire training data. The right side of the equation consists of a sequence of words. Hence, an RNN based model is employed to generate the sequences. The variable number of words are given by memory state ht. The memory state is updated at every time step on the basis of previous t-1 states and input pairs, which in our case is image and the predicted word. A non-linear function is used to update the state:
Here x t represents the words and the features of image which will be extracted from Deep Convolution Networks. We will be using LSTMs for the function f due to their ability of dealing with problems like vanishing and exploring gradients. The output of the LSTM when passed through the softmax layer will give a probability distribution over all the words. The predicted word will be the word with the highest probability.
After considering all the parameters from Deep Convolution Network, LSTM and embedding, the overall loss (to be minimized) is given by the following equation:
Experiments
We have used Flickr 8k for our experiments. Flickr 8k Dataset consists of 8092 images with each image having 5 captions. These are real images from day to day life and consists of images of human beings and animals.
Text description of images using VGG16 and LSTM
LSTM (Long short-term memory), a type of recurrent neural network (RNN), is well-suited for sequence prediction tasks. It is commonly employed for next word prediction During input processing, LSTM effectively retains relevant information while disregarding non-relevant information. We trained the model for 50 epochs. The flow diagram of the description generator using VGG16 and LSTM is given in Fig. 1.

Flow diagram of description generator using VGG16 and LSTM.
The architecture consists of various layers as described below:
The pre-processing consists of two main tasks:
Image feature extraction using transfer learning: We have used the principles of transfer learning and extracted the bottleneck features from the image. The images are first preprocessed and scaled to the right size to be used as an input to the deep neural network, VGG 16. After that, VGG 16 is loaded and the last softmax layer is removed. A total of 134,260,544 parameters of VGG16 are loaded and training parameters are set to false. The image feature map produced by VGG 16 is a dense vector with 4096 features for each image.
Vocabulary building and preprocessing of captions: This layer is utilized to preprocess the textual data. A dictionary of unique words is created and the words are converted to numerics. We will feed a vector that incorporates the meaning of the word. So the image captioner will have knowledge of captions when generating them.
In order to generate the input output pairs, a dataset generator is used. At every time step the input consists of a dense feature vector of image and a word is added from the corresponding caption of that image. The output will be the next word that has to be predicted.
The same vector with 4096 features for a given image is provided at each timestep till the end of the caption.
The architecture of our model is given in Fig. 2. The topmost layer of VGG 16 is removed and a dropout layer is added. The layers of VGG 16 are freezed. The words from captions are passed to an embedding layer which is followed by the dropout layer, to avoid overfitting. The output of this layer goes to LSTM layer with 256 states. The outputs from VGG 16 and LSTM layer are then concatenated and sent to a dense layer, It is then followed by a softmax layer that gives the probability distribution over all words of vocabulary. The word with the highest probability is chosen as the next word of image description.

Layered architecture of description generator using VGG16-LSTM architecture.
In order to improve our results, we have modified our model to incorporate pretrained embeddings. The merging of image features and text encodings at a later stage within the architecture brings several benefits. This approach can result in higher-quality captions while utilizing smaller layers compared to the traditional architecture that uses CNN as an encoder and RNN as a decoder. For our second experiment we have used the pretrained Glove embeddings. After the extraction of embedding values for the words in the vocabulary, an embedding matrix is created. We have mapped each word to a 200-dimensional vector. Mapping takes place in a dedicated layer (embedding layer) placed after the Input layer. These embeddings capture the semantic meaning of every word and helps in understanding the context. Word vectors enable the mapping of words to a vector space, where similar words are grouped closely while dissimilar words are placed further apart. Unlike Word2Vec, GloVe (Global Vectors for Word Representation) considers global word co-occurrence alongwith the local context of words, thus making it advantageous. Semantic relationships between words in Glove can be derived from the co-occurrence matrix. In our model, we will utilize GloVe to map all the words in our caption to a 200-dimensional vector, leveraging its capabilities for semantic representation. We have developed a Merge model that integrates the image vector with the partial caption. Our model generates a vector representing the probability distribution across the entire vocabulary. We have adopted a Greedy Search approach for selecting the next predicted word. The word with the highest probability is taken as the next word.
Textual description of images using VGG16, LSTM and attention mechanism in images
Attention focuses on high resolution on some parts of the inputs as compared to the other parts of input. In earlier methods, LSTM looks at the entire image at a time. But usually the descriptions are generated while we focus on certain regions of image. When using attention, the decoder outputs the description while focusing on specific regions of the image. All sub regions and contexts are considered in the input and the weighted arithmetic mean of these regions is given in the output.
The weights and probabilities are determined on the basis of context C. Context represents the output of the neural network till the current time stamp. Previously, we took the dense representation of image from Fully connected layer while here attention involves looking at different spatial regions of image. Output of convolution layers, from VGG 16, in our case, is a tensor representation of input image. A score is given to each pixel of encoded image. The higher the weight of a pixel the more relevant it is for the output image.
Input regions are y from the convolution neural nets, which in our case are features from VGG 16 and context C from RNN. These inputs are applied to the weights that constitutes the learnable parameters of the attention unit. These vectors are updated when the training data is updated over time. Activation function is applied to get smaller values so that smoother or fine-grained regions within each sub region are obtained. The similarity can be determined by dot product between regions y and context C. Higher the dot product higher is the similarity. Hence, the output gives the relevant regions in the image.
These m’s are passed through a softmax function which outputs them as probability s.
Finally, we take the inner products of this probability vectors and the subregions y to get the final output z of relevant regions of the entire image.
The probabilities correspond to the relevance of sub regions y, given the context C.
The description of the attention unit is given in Fig. 3.

Components of attention unit.
The flow diagram of our experimental setting is given in Fig. 4. We have applied attention over the images. The outputs of the attention unit are the relevant regions of the image. These outputs serve as an input to the LSTM alongwith the word embeddings and together they generate the description of the image.

Flow diagram of description generator using attention.
The Transformer is a model architecture that departs from the conventional use of recurrence and instead relies solely on an attention mechanism to establish global dependencies between input and output. This architecture enables a higher degree ofparallelization, leading to substantial improvements in translation quality and achieving new state-of-the-art results. The transformer network adopts an encoder-decoder architecture, similar to that of an RNN [34]. However, the key distinction lies in the fact that transformers can process the input sentence in parallel. Unlike RNNs, there is no sequential processing with time steps associated with the input. Instead, all the words in the sentence can be simultaneously passed through the transformer network.
Contexts and meanings are learnt by the transformers in sequential data. In the same way as the tokens are converted to embeddings and embeddings are learned in order to understand context, similarly, images can also be transformed in a sequence of patches. These patches are encoded through positional encoding. Patch is denoted as an embedding of n dimension. Thus, the images are now analogous to a list of sequences. Transformers learn the context through attention mechanism. Earlier the drawback of using LSTMs was that meaning was lost when we considered sequences of length above 100. Also, parallelizations cannot be performed using LSTMs.
Here, the relationship between the sequences is learned through multi head. As the model trains, an attention or weight is assigned to each of the patches. This weight signifies the importance of each patch.
The tokens of the words are the queries which are used to query the patches. Keys are assigned weights and are associated with value V. Attention is represented mathematically as:
The tensors here are Q and K. The more the vectors q and k are closer, the higher the dot product. The factor d
k
will be used for scaling. The attention will result in a value closer to 0 if the query and key vectors are not aligned. Different representations of patches lead to repeated computation for attention which are then combined to form the final score for attention. These multi attention heads are calculated through different learned Wq, Wk and Wv matrices.
The component of encoder and is described below:
Encoder: The encoder takes an image. Self attention attends to the patches of images that are leveraged with positional encodings, A context layer (output from the attention layer) is passed to the feed forward neural network. A tensor is the output and comprises of batch_size, d_model and the sequence length. The different parameters used in this experiment are summarized in Table 1.
Fine tuned parameters for different experimental settings
Fine tuned parameters for different experimental settings
Decoder –The inputs are the target captions which, with the help of masked self-attention, learns the context. Query is the target caption, encoder outputs the keys and values. In this way relationship is learnt. The attention weights are backpropagated during training. A context vector is the output of attention layer and is passed through softmax function and the predicted word is found. The flow diagram of our architecture is shown in Fig. 5.

Flow diagram of description generator using transformer.
We have trained our model for 50 epochs. The fine tuned parameters that we have used are mentioned in Table 1. Positional embedding has been created using sin and cos functions.
We have used vit-gpt2-image-captioning [35] model in order to generate the description of images. This model is built using the vision transformer. Three pretrained models have been utilized here from the transformer class. The models used are Vision Encoder Decoder model, GPT 2 Tokenizer and ViT image Processor. These pretrained models reduce the training time. Here by utilizing the pretrained model we have avoided the training overhead of building our model from scratch. Also, we have evaluated the performance of our models by comparing them with these pretrained model. Vision encoder decoder model has been used to encode the image. The gpt2 tokenizer utilized here has already been trained to handle the token tasks like tokenization, addingnew tokens and the handling of mask tokens etc. ViT image processor, processes the image, rescale and resize it.
Results
The results of our experiments are mentioned in this section. The captions predicted for different architectures are given along with the actual caption and the corresponding images from the test set.
The test image is given in Fig. 6.

Test image for description generation using LSTM-CNN model.
Actual Caption: boy wearing red shirt and jeans is doing flip on his bike
Predicted Caption: biker in green suit is jumping off the ground in the wood
b.
The test image is given in Fig. 7.

Test image for description generation using LSTM-CNN and pretrained embeddings model.
Actual Caption: A dog with its mouth opened
Predicted Caption:a dog is running through a field
c.
The test image is given in Fig. 8.

Test image for description generation using attention in images.
Actual Caption:Two white dogs are playing in the snow
Predicted Caption: two white dogs are outside in the snow
d.
The test image is given in Fig. 9.

Test image for description generation using transformer model.
Actual Caption: two girls are walking by tree in front of brick building
Predicted Caption: people in winter coats stand in close to each other in the woods
e.
The test image is given in Fig. 10.

Test image for description generation using pretrained gpt2 transformer model.
Actual Caption: Four people and a dog play in the snow
Predicted Caption: People are in the snow with a dog
We have used BLEAU score [10] for the evaluation of our generated descriptions. The way of evaluation of BLEAU score is similar to that of humans. It is easily computable and quick even if we have multiple ground truth statements. The drawback however is that it considers exact matches and does not take into account the order of words. BLEAU score is calculated by taking the whole of the predicted corpus.
BLEAU score is given by:
BLEAU(N)=Brevity Penalty.Geometric Average Precision Scores (N)
Brevity Penalty is used to penalize too short sentences. It is given mathematically as:
c = predictedlength
r = target length
where,
Precision i-gram(pi)=Number of correct predicted i-grams / Number of total predicted i-grams
wi is the assigned weight to each precision score, if uniform, same weighing can be done. N can have varying values.
BLEAU-1 is calculated by considering unigram Precision score. BLEAU considers geometric average of unigram and bigram precision and so on. BLEAU score has been calculated for different experimental settings. The results are shown in the Table 2. The highest value of BLEAU-1, that utilizes unigrams, is obtained when we have trained our model with transformers. This value even surpasses the result on pretrained Visual transformer. The pretrained, visual transformer has not been fine-tuned on our dataset. So, the multi head attention in the transformer model that has been trained from scratch is giving better results.
Results from different experimental settings
Results from different experimental settings
All our models have been trained on 50 epochs.
Thus, we can say that multi head attention is by far the best way to find the description of images. LSTM with attention on the images is also giving good results. BLEAU-1, BLEAU-2, BLEAU-3, BLEAU-4 values for the experiment with attention are 0.6136,0.5041,0.41375 and 0.3342 respectively. Thus attention is of paramount importance, that focuses on different regions and helps in understanding the context. In experiment 2, wherein the Glove embeddings were incorporated, the performance and quality of descriptions improved significantly. All 4 values of BLEAU score improved using Glove embeddings. The BLEAU score values between 0.5 and 0.6 can be considered as good and in our case BLEAU-1 is greater than 0.5 in all our experiments. The performance can also be tested by analyzing the quality of captions. Section 5 provides the descriptions of the images under different settings. Results show that the transformer-based models are giving the best descriptions to our images. While using attention, the intricate details of the images are considered. For experiment 1, the caption, The image of a boy wearing red shirt and jeans is doing flip on his bike has been predicted as biker in green suit is jumping off the ground in the wood. Here the colour predicted is not correct thus ignoring the fine details. When we use pretrained embeddings the description is fine as we see that the actual caption of a dog with its mouth opened has been predicted as a dog is running through a field. Thus we see that minute details of the image are missing in our earlier experiments though the descriptions are fine. On applying attention the descriptions showed improvement and the quality improved significantly. In case of attention the actual and predicted captions are Two white dogs are playing in the snow and two white dogs are outside in the snow respectively which are quite similar.
On applying the transformers, the granularity has further increased. This is depicted by the predicted caption two people in winter coats stand in close to each other in the woods for the image shown in Fig. 9. The description given by the pretrained ViT transformer is also apt as given in Section 5. The results in Section 5 have been shown by taking a random image from the test set. The BLEAU score has, however, been calculated by considering all images in the test corpus.
This paper has covered the area of multimodal processing. Different solutions have been proposed towards bridging the semantic gap between image and text. Encoder decoder framework has been used to model the problem. We investigate different approaches to image-text matching. We performed extensive experiments with Flickr8k dataset in order to train models with VGG16, LSTM and transformers. We can say that multi head attention is by far the best way to find the description of images. LSTM with attention has also performed well.
In future, more models need to be developed that capture the relationships between text and other modes, particularly images. Different architectures can be used like Resnet, Inception etc. in order to extract features of images. Also, for text generation at each time step the highest probable word has been taken as the next predicted word. Several other candidates can be picked or sampling can be done in order to improve the results. Some of the layers of VGG16 can be trained on our dataset along-with the embeddings in order to improve results in the future. Large datasets like Flickr30k can be used in order to improve the accuracy. Hyperparameters can also be fine-tuned in order to improve the results. The work can be extended to video captioning by providing description to the frames of video. Furthermore, there is a need to develop deep reinforcement learning techniques for translation across various modalities. Visual question answering systems and visual reasoning systems can be developed. Currently, separate frameworks with weak connections exist for converting images to sentences or vice versa. More unified frameworks need to be developed. Additionally, in generative approaches to summarization, larger semantic units to improve the quality of the summaries need to be constructed.
Funding statement
This research was funded by Taif University, Taif, Saudi Arabia (TU-DSPP-2024-17).
Footnotes
Acknowledgments
The author extends their appreciation to Taif University, Saudi Arabia, for supporting this work through project number (TU-DSPP-2024-17).
