Multilevel Attention Networks and Policy Reinforcement Learning for Image Caption Generation

Abstract

The analysis of large-scale multimodal data has become very popular recently. Image captioning, whose goal is to describe the content of image with natural language automatically, is an essential and challenging task in artificial intelligence. Commonly, most existing image caption methods utilize the mixture of Convolutional Neural Network and Recurrent Neural Network framework. These methods either pay attention to global representation at the image level or only focus on the specific concepts, such as regions and objects. To make the most of characteristics about a given image, in this study, we present a novel model named Multilevel Attention Networks and Policy Reinforcement Learning for image caption generation. Specifically, our model is composed of a multilevel attention network module and a policy reinforcement learning module. In the multilevel attention network, the object-attention network aims to capture global and local details about objects, whereas the region-attention network obtains global and local features about regions. After that, a policy reinforcement learning algorithm is adopted to overcome the exposure bias problem in the training phase and solve the loss-evaluation mismatching problem at the caption generation stage. With the attention network and policy algorithm, our model can automatically generate accurate and natural sentences for any particular image. We carry out extensive experiments on the MSCOCO and Flickr30k data sets, demonstrating that our model is superior to other competitive methods.

Introduction

Generating image description is a more complex task than traditional computer vision tasks, for example, image classification,^1,2 image retrieval,^3,4 object detection, and recognition.^5,6 People's interest is growing in solving a more “end-to-end” task of generating more detailed descriptions of images in terms of natural language. Owing to its essence, image captioning has been regarded as the bond between image content understanding and natural language expression. Image captioning models are required to recognize objects, attributes, activities, and relationships, and then transfer them to semantic accurate and fluent sentences. Thereby, image captioning can be useful in many applications, such as image understanding, visual question answer, and multimedia retrieval. Image captioning can transmit information and bring convenience to people, which has shown great significant research value in real life.

Benefiting from the exemplary developments of deep neural networks in sequence generation, the encoder–decoder framework has become mainstream in image captioning and the methods based on this framework have achieved gratifying results.^7–11 Unlike traditionary template-based and retrieval-based approaches, the encoder–decoder models usually employ Convolutional Neural Network (CNN) to encode the input images into a feature vector and Recurrent Neural Network (RNN) to decode the feature vector into a complete natural language sentence. Certainly, encoder and decoder can be any foundational model.¹² Furthermore, to improve the encoder–decoder framework, the attention mechanism has been introduced in image captioning and has become a significant part. The descriptions of the image can be more real and articulate with the attention mechanism.^10,13,14

Nevertheless, current attention models often treat the image as a set of regions and only focus on relevant regions by probabilities.^11,15,16 In other words, only a few local regions are correlative when predicting the next word. However, the isolated regions may also have details that can be valuable for generating processes, such as color, quantity, category, and types. Besides, many models merely utilize global feature representation or treat objects and regions individually to generate sentences at the image level, making the model miss some objects. In addition, as Ranzato et al.¹⁷ mentions, the RNN-based model has a particular degree of inconsistency between training and testing, which can cause an accumulation of errors called exposure bias.

Aiming at the aforementioned challenges, we present a Multilevel Attention Networks and Policy Reinforcement Learning (MANPRL) model to handle them. Specifically, we design a multilevel attention network to take full advantage of the global and local feature and take objects and regions on board instead of considering them separately. Our multilevel attention network contains a region attention network and an object attention network. In our region attention network, we use the CNN to extract global region features and divide them into independent regions as local region features. Then, we multiply them with the hidden state of the previously generated word and process the output with a visual attention mechanism. At the same time, we utilize Faster R-CNN⁶ to capture objects features, which is a pretrained image detection model, then we perform average global pooling on all object features. The results are regarded as our global object feature and each object feature is regarded as local object feature. Similarly, we multiply them with the hidden state of the previously generated word and feed the output to the region attention model. After that, we use a concatenation operation to combine two visual features. Finally, to solve the exposure bias problem, we employ a policy gradient (PG) optimization to handle the concatenated visual feature to generate words. In summary, the essential contributions of our work are shown as follows:

Unlike current attention-based image captioning methods considering objects and regions in images individually, we take them into account simultaneously, which can capture elements that are easily overlooked in an image. Therefore, our model can extract relevant information as much as possible.

We utilize a global-local attention mechanism separately for objects and regions to balance the role of global and local information in sentence generation. Our model preserves sufficient context information through this attention mechanism, which improves the performance of sentences generated finally.

We evaluate our model on the MSCOCO and Flickr30k data sets. The experimental results reveal that our approach outperforms the compared baselines.

The remainder of this study is organized as follows. Related works are summarized in Related Work section. In Approach section, we introduce an outline of the proposed model MANPRL. Next, we elaborate the details of MANPRL in Experiments section. We also conduct extensive ablation studies by comparing with different models. Finally, Conclusion section shows the conclusion and future work.

Related Work

Generating descriptions for images has a long research history. In this section, we describe a variety of existing image captioning methods and divide these methods into three main types, including template-based, retrieval-based, and novel caption generation methods.¹⁸ The template-based methods¹⁹ first generate a predesigned sentence template with blank slots, then fills them with information such as objects, attributes, actions, and relationships of objects in images.^20–22 Template-based methods can generate proper image descriptions with fixed lengths. In retrieval-based approaches, as the name suggests, descriptions are retrieved from all standard descriptions. One way is to find the visually similar image and transfer corresponding descriptions to the query image before retrieving.^11,23 Another way is to retrieve descriptions from a multimodal embedding space of all candidate descriptions.^24,25 The size of the preconstructed sentence storage limits the retrieval-based methods. In recent years, with the development of deep neural networks in image captioning, abundant novel methods have been introduced.^26–28 Kiros et al.²⁶ learn a joint text-image embedding space for generating descriptions. Mao et al.²⁷ and Vinyals et al.¹¹ regard the image captioning task as a seq2seq task. They use CNN to capture visual features and RNN to decode the feature into sentences. Lu et al.¹⁶ introduce an adaptive encoder–decoder framework with new “visual sentinel” mechanism. Karpathy and Fei-Fei⁸ exploit an alignment model to generate image captions and their locations at the same time. Yang et al.²⁹ merge retrieval-based and generation-based methods with dual Generator GAN. The aforementioned methods accept all the information of the picture when generating words. These methods are not efficient, which costs more time to train.

Recently, the attention mechanism has become an important part of image captioning task. Xu et al.¹⁰ put forward two image captioning methods with different attention mechanism, which can learn a latent alignment from word and image region. You et al.¹³ integrate semantic concept into hidden states with different methods. Wu et al.³⁰ inject high-level concepts into a CNN-RNN framework as semantic attention to improve image captioning performance. Wu et al.³¹ use a review module to capture the global concepts into fact vectors with attention mechanism. Introduce a novel approach to bridge the gap between vision and language domains add textual concepts to enriching the image features. Liu et al.³² introduce a novel approach to bridge the gap between vision and language domains add textual concepts to enriching the image features. Yao et al.³³ construct variants of architectures by feeding high-level attributes from images to complement image representation for image captioning. Zhao et al.³⁴ proposes a style memory module designed to explicitly memorize the style knowledge learned and a decomposing sentence algorithm that separates style-related part from stylized sentence. Li et al.³⁵ encode image group into a context-aware feature by combining self-attention and group image visual features. However, the common attention method simply applies a single-level attention mechanism for image information, and our approach considers multiple levels of image information.

Reinforcement learning (RL) is another approach to boost the performance of image captioning, which aims to generate sequential actions defined by a policy through maximizing the accumulative forthcoming rewards.³⁶ Liu et al.³⁷ propose a new PG and new metric called SPIDEr. SPIDEr is a linear combination of semantic propositional image caption evaluation (SPICE)³⁶ and consensus-based image description evaluation (CIDEr).³⁸ Rennie et al.³⁹ propose a self-critical sequence training (SCST) model based on the reinforce algorithm, which chooses its test-time output as a “baseline” to normalize the rewards and weaken the variance generated during the training phase. Zhang et al.⁴⁰ employ an actor-critic model for image captioning. Bahdanau et al.⁴¹ apply a token-level critic network to generate reward and an actor network to give policy. Ren et al.⁴² present a “policy network” to predict actions for generating the next word and a “value network” to predict rewards by evaluating all possible sentences. Liu et al.⁴³ utilize a word-level policy network and a sentence-level network collaboratively to generate captions with multilevel reward function.

Approach

In this section, we explain our proposed model in detail. First, we present the overall framework briefly. Then, we expound the encoding process of images to generate two types of features, that is, region feature and object feature. Then, we introduce region attention network and object attention network, respectively. Finally, we give a state of our RL method based on the policy method to generate captions.

Overview

The framework of our model MANPRL is recapitulated in Figure 1. The model contains a modality embedding process, a MANPRL method. In Modality Embedding section, we employ the pretrained deep CNNs and Faster R-CNN to obtain region features and object features individually. Then, our multilevel attention network is built to learn more effective attended visual features. It contains a region attention network and an object attention network, which is used to extract attended region feature and attended object feature, respectively, both by consideration of the global and local features. Finally, a policy reinforcement learning network is introduced to generate the corresponding caption.

FIG. 1.

The framework of MANPRL. CNN, Convolutional Neural Network; LSTM, long short-term memory; MANPRL, Multilevel Attention Networks and Policy Reinforcement Learning.

Modality embedding

In our model, we extract two types of features, that is, region visual feature and object visual feature, on which the global and local operations are conducted individually. Global features reveal a general view of the input image, whereas local features take the detail of small imaging factors into account.

Region feature

Region visual features allow models to focus on parts of the image. Similar to,¹⁰ we use the conv5–4 layer of VGG-19 network to extract features. There are 14 × 14 regions of each image and every region is expressed as 512 visual feature channels, namely the region visual features can be represented as $F_{R} \in ℛ^{512 \times 14 \times 14}$ .

Object feature

Benefiting from Faster R-CNN in object detection,⁴⁴ a pretrained Faster R-CNN⁶ is used to extract object feature. We select top-10 ranked objects based on the classification confidence scores. Thus, the object visual features of an image can be expressed as $F_{O} = [f_{o}^{1}, f_{o}^{2}, \dots, f_{o}^{10}] \in ℛ^{4096 \times 10}$ .

Multilevel attention network

Region visual features and object visual features are obtained by modality embedding mentioned earlier. The next step is to make the most useful of them. Inspired by the incredible result of attention mechanism in computer vision studies,^8,45,46 we present a multilevel attention network, including region attention network and object attention network. Figure 2 describes the process of the region attention network, object attention network is depicted in Figure 3.

FIG. 2.

Region Attention Network.

FIG. 3.

Object Attention Network.

Region attention network

First, given the region feature $L f_{R} = [f_{R_{1}}, f_{R_{2}}, \dots, f_{R_{14 \times 14}}] \in ℛ^{512 \times 14 \times 14}$ at time t, global average pooling is used to handle all local region features $L f_{R}$ to get global feature $G f_{R_{0}} = f_{R_{0}}$ , which will capture the overall context information and ignore local details. Naturally, we use all local region features $L f_{R_{i}}$ to complete the missing information. We multiply global region feature and each local region feature with h^(t−1), where h^(t−1) is the hidden state output before time t. Thereby, the previously generated words information and the current regions will be all taken into consideration. The process can be formulated as $\begin{matrix} β_{R_{i}}^{(t)} = w^{T} φ (W_{h_{R}} h_{R}^{(t - 1)} + W_{R} f_{R_{i}} + b) \end{matrix}$ (1)

\begin{matrix} α_{R_{i}}^{(t)} = \frac{β_{R_{i}}^{(t)}}{\sum_{j = 0}^{n} β_{R_{j}}^{(t)}} \end{matrix} .

(2)

In Eq. (1), $W_{h_{R}} \in ℛ^{1 \times 512}$ , $W_{R} \in ℛ^{1 \times 512}$ , and $b \in ℛ^{1}$ are learnable parameters in region attention network.

After parameter $β_{R_{i}}^{(t)}$ has been computed, we use a softmax function to normalize it to get attention weight value $α_{O_{i}}^{(t)}$ , which is a positive number and the sum of them is 1. The global region feature and local region feature could be obtained through multiplying attention weight with these two kinds of features. Summing up all these features, we get the attended region feature $Ψ^{(t)} (I_{R})$ , as the following formula: $\begin{matrix} Ψ^{(t)} (I_{R}) = α_{R_{0}}^{(t)} G f_{R_{0}} + \sum_{i = 1}^{n} α_{R_{i}}^{(t)} L f_{R_{i}} \end{matrix} .$ (3)

Object attention network

Considering the importance of objects in image description generation, we choose to use objects to enhance the accuracy of the description. As described in Figure 3, the final attended object features are calculated by weighted summing the n processed object features. Similar to the region attention process, the attended object feature can be obtained as the following formula: $\begin{matrix} Ψ^{(t)} (I_{O}) = α_{O_{0}}^{(t)} G f_{O_{0}} + \sum_{i = 1}^{n} α_{O_{i}}^{(t)} L f_{O_{i}} \end{matrix} .$ (4)

Similar with the region attention network, the value $α_{O_{i}}^{(t)}$ represents the attention weight of every object feature at time t and $\sum_{i = 1}^{n} = 1$ . The calculation of object visual feature $Ψ^{(t)} (I_{O})$ include two parts, in which $G f_{O_{0}} = f_{O_{0}}$ stands for the global region features extracted by VGG-16 directly and $L f_{O_{i}} = f_{O_{i}, i > 0}$ is local region feature of n objects extracted from image. The significance of each object is measured by the attention weight $α_{O_{i}}^{(t)}$ through the following formula, which also reveals the relation between the previous information and the objects at time t. $\begin{matrix} β_{O_{i}}^{(t)} = w^{T} φ (W_{h} h^{(t - 1)} + W_{o} f_{O_{i}} + b) \end{matrix}$ (5)

\begin{matrix} α_{O_{i}}^{(t)} = \frac{β_{O_{i}}^{(t)}}{\sum_{j = 0}^{n} β_{O_{j}}^{(t)}} \end{matrix} .

(6)

In Eq. (5), $h^{(t - 1)}$ stands for the previous information, which will be introduced in the caption generation section and $f_{O_{i}} \in \{G f_{O}, L f_{O_{0}}, \dots, L f_{O_{n}}\}$ is a set of object feature. $W_{h_{O}} \in ℛ^{1 \times 512}$ , $W_{O} \in O^{1 \times 512}$ , and $b \in ℛ^{1}$ are parameters learned by object attention model. $φ$ is the activation function, which is an element-wise Hyperbolic Tangent function in object attention network.

After computing $β_{O_{i}}^{(t)}$ , we also use softmax regression to normalize it to obtain $α_{O_{i}}^{(t)}$ . In this way, the object attention network can automatically focus on salient objects in the process of generating words and takes their context information into consideration at the same time.

Reinforcement learning

In image captioning task, the evaluation function is generally recall-oriented understudy for gisting evaluation (ROUGE),⁴⁷ bilingual evaluation understudy (BLEU),⁴⁸ or CIDEr.³⁸ It is difficult for traditional maximum likelihood estimation methods to directly learn such nondifferentiable evaluation functions, leading to deviations in training and testing. RL can solve this problem because it does not require feedback or the loss is differentiable and any evaluation function can be used for optimization learning.

We treat sequence generation as an RL problem. The long short-term memory (LSTM) is regarded as model's agent to generate captions, the action is to generate next word, the state is hidden unit in LSTM, the policy is θ. LSTM will stop generating words when the end-of-sequence token is detected. Our model's reward r is computed by CIDEr score, calculated by comparing the generated sentences with corresponding human-annotated sentences. Therefore, the goal of our model is to minimize $J (θ)$ , as following formula shows: $\begin{matrix} J (θ) = - ℰ_{w^{s} \sim p_{θ}} [r (w^{s})] \end{matrix},$ (7)

where $w^{s} = (w_{1}^{s}, \dots w_{T}^{s})$ is the sentence generated by model at time step t and reward $r (w^{s})$ is computed by comparing the sentence with human-annotated sentence. However, $J (θ)$ cannot be optimized directly because of high-dimensional space of possible text generation actions. One solution is Monte-Carlo sampling, which can sample action sequences according to probability $p_{θ}$ in training phase. $J (θ)$ is estimated as $\begin{matrix} J (θ) \approx - r (w^{s}), w^{s} \sim p_{θ} \end{matrix} .$ (8)

Then, we utilize the policy RL algorithm to compute the gradient $\nabla J_{θ}$ . $\nabla J_{θ}$ can be computed as follows: $\begin{matrix} \nabla_{θ} J (θ) = - ℰ_{w^{s} \sim p_{θ}} [r (w^{s}) \nabla_{θ} l o g p_{θ} (w^{s})] \end{matrix} .$ (9)

In practice, we use Monte-Carlo to sample $w^{s} = (w_{1}^{s}, \dots w_{T}^{s})$ from $p_{θ}$ for each training example to approximate the expected gradient in the mini-batch. $\begin{matrix} \nabla_{θ} J (θ) \approx - r (w^{s}) \nabla_{θ} l o g P_{θ} (w^{s}) \end{matrix} .$ (10)

Mante-Carlo sampling has large randomness, which makes the final samples have huge differences and results in a high variance in the reward, especially in large search space such as our caption generation task. Adding baselines is one solution to this problem, which can constraint the reward to a restricted extent. Then Eq. (10) becomes the following: $\begin{matrix} \nabla_{θ} J (θ) \approx - (r (w^{s}) - b) \nabla_{θ} l o g P_{θ} (w^{s}) \end{matrix} .$ (11)

Any function can be used for baseline b, only if it does not depend on the action w^s. This conclusion can be derived from simple mathematics inference: $\begin{matrix} ℰ_{w^{s} \sim p_{θ}} [b \nabla_{θ} l o g p_{θ} (w^{s})] & = b \sum_{w_{s}} \nabla_{θ} p_{θ} (w^{s}) \\ = b \nabla_{θ} \sum_{w_{s}} p_{θ} (w^{s}) \\ = b \nabla_{θ} 1 = 0 \end{matrix} .$ (12)

These proof equations show that baseline b reduces the gradient variance although does not influence the expected gradient. We chose the baseline is the reward obtained by the reinforcement algorithm with the current model under the inference algorithm used at test time. This baseline forced to improve the model's performance under the inference algorithm used at test time and encourages training/test time consistency similar to the maximum likelihood-based approaches. Applying the chain rule, the gradient can be expressed as $\begin{matrix} \nabla_{θ} L (θ) = \sum_{t = 1}^{T} \frac{\partial L (θ)}{\partial s_{t}} \frac{\partial s_{t}}{\partial θ} \end{matrix},$ (13)

where s_t is the input features. In the text generation model, the gradient of $L_{θ}$ to s_t can be specifically derived as $\begin{matrix} \frac{\partial L (θ)}{\partial s_{t}} \approx (r (w^{s}) - b) (p_{θ} (w_{t} ∣ h_{t}) - 1_{w_{t}^{s}}) \end{matrix},$ (14)

where $1_{w_{t}^{s}}$ is one-hot vector representation of words. Similar with,³⁹ we choose the reward of the word generated by the current model in the testing phase as the baseline, the gradient becomes following:

Image description generation

We treat description generation as a variable-length sequence generation problem, in which LSTM is widely adopted. Before feeding visual features to LSTM, we need to integrate two kinds of features. Equation (16) calculates the final visual vector for our multilevel attention network, where Concat means the concatenation function of the region visual features $Ψ^{(t)} (I_{R})$ and the object visual features $Ψ^{(t)} (I_{O})$ . $\begin{matrix} Ψ^{(t)} (I) = C o n c a t (Ψ^{(t)} (I_{R}), Ψ^{(t)} (I_{O})) \end{matrix}$ (16)

Given image visual features I and the predicted sentences $\{s_{0}, s_{1}, \dots, s_{t - 1}\}$ , which is defined by $p (s_{t} ∣ I, s_{0}, s_{1}, \dots, s_{t_{1}})$ , the following formula shows the LSTM unit updating process at time step t: $\begin{matrix} x^{t} & = w_{x} s_{t}, I^{t} = Ψ^{t} (I) \\ i^{t} & = σ (W_{i} x^{t} + U_{i} I^{t} + Z_{i} h^{(t - 1)} + b_{i}) \\ f^{t} & = σ (W_{f} x^{t} + U_{f} I^{t} + Z_{f} h^{(t - 1)} + b_{f}) \\ o^{t} & = σ (W_{o} x^{t} + U_{o} I^{t} + Z_{o} h^{(t - 1)} + b_{o}) \\ c^{t} & = f^{t} ⨂ {c^{t}}^{- 1} + i^{t} ⨂ ϕ (W_{c} x^{t} + U_{c} I^{t} + Z_{c} {h^{t}}^{- 1}) \\ h^{t} & = o^{t} ⨂ c^{t} \\ P & (s_{t} | I, s_{0}, s_{1}, . . ., {s_{t}}_{- 1}) = S o f t m a x (w_{p} h^{t}) \end{matrix},$ (17)

where i^t, x^t, f^t, o^t, and $h^{(t - 1)}$ are the input, forget, memory, output, and hidden state of the LSTM, respectively. W_•, U_•, Z_•, and b_• are learnable weights and biases. The dot product of two vectors is denoted with $⨂$ . σ represents logic sigmoid function. $⨂$ represents sum operation.

Experiments

In this section, we conduct experiments to evaluate the performance of MANPRL. First, we go through the preparation of the experiment, including data sets, baselines and experiment setup. Then, evaluations are accomplished on MSCOCO and Flickr30k data sets to prove our proposed model. Finally, we analyze the effectiveness of the model with quantitative and qualitative evaluation results.

Experiment setup

Data set and evaluation metrics

Our experiments are conducted on the famous MSCOCO-2014²² and Flickr30k⁴⁹ data sets. MSCOCO data set is the largest data set for image captioning, which contains 82,783, 40,504, and 40,775 images in the train, validate, and test set, respectively. Each image has five corresponding sentences annotated in English by humans. We utilize the Karpathy split method,⁸ which takes 113,287 images for training, 5000 images for validation and another 5000 images for testing. Flickr30k data set contains 31,000 images, which takes 1000 images for validation, 1000 images for testing, and the rest for training.

Generally, BLEU,⁴⁸ metric for evaluation of translation with explicit ordering (METEOR),⁵⁰ ROUGE,⁴⁷ and CIDEr³⁸ are standard evaluation protocols. BLEU is the most common and popular metric, which is only based on the precision of n-gram with a sentence-brevity penalty. The value of n is 1, 2, 3, 4, which corresponds to the performance of n number of grams, respectively. METEOR aims to solve inherent deficiencies in the BLEU metric, which employs WordNet to calculate the harmonic mean of unigram precision and recall between sentences. ROUGE is usually the evaluation standard for the automatic summary task. There are three evaluation criteria about ROUGE, namely ROUGE-N, ROUGE-L, and ROUGE-S. CIDEr is specially designed for image captioning tasks, which treats each sentence as a document and expresses it as the form of a Term Frequency Inverse Document Frequency vector. By calculating the weight of each n-tuple and the cosine similarity between the reference caption and the caption generated by the model, the consistency of the description can be measured.

Features and parameter setting

For any given image, we first resize them to 224 × 224. Then, ResNet101-based Faster R-CNN is adopted to obtain object features and VGG-19 is utilized to obtain region features from the resized images. We only choose words that appear more than five times in the training data set and our vocabulary size is 9485. Each word is embedded in a 1000-dimensional word embedding space.

Training details

We employ PyTorch to implement our models on 2 Tesla-V100. Adam optimizer is employed to optimize our proposed model for 200 epochs. The learning rate is initialized to $5 e - 4$ and decays every 8 epochs shrinking it by 0.8. Batch size is set as 64 and dropout is utilized after every linear transformation to reduce the overfitting with a probability of 0.5. The best performing training model's snapshot on the validation set is chosen to evaluate our method on the testing set.

Baselines

We compare our model with attention-based methods, RL-based methods and some other methods under all metrics on the Karpathy test split to prove our method. We divide the compared models into three categories.

Attention-based models

Show-Attend-Tell¹⁰ introduces two attention mechanisms to learn salient image regions separately. SCA-CNN⁵¹ uses CNN-based encoder with spatial and channel-wise attention. Adaptive-Att¹⁶ employs RNN-based decoder with a visual sentinel adaptive attention. Up-Down⁵² introduces two LSTM layers, which has a top-down attention LSTM and Language LSTM.

RL-based models

SCST³⁹ employs a self-critical sequence training model for image captioning, which uses sentences generated at the test time as a benchmark to normalize rewards. PG-SPIDEr³⁷ optimizes a new metric with a linear mixture of SPICE and CIDEr by PG. Embedding-Reward⁴² employs a “policy network” and a “value network” to generate captions together. Actor-Critic⁵³ adopts a self-critical n-step training to forecast words based on the actor-critic algorithm. StackCap⁵⁴ proposes a coarse-to-fine forecast framework, which has multiple LSTM-based decoders and complicated RL-based reward.

Other models

Variational autoencoder (VAE)⁵⁵ uses a variational autoencoder model based on an encoder–decoder architecture to learn images and associated labels or captions. Neural image caption (NIC)¹¹ proposes an image description generation architecture based on a deep recurrent network. Attributes-CNN³⁰ presents a method of integrating high-level concepts into the encoder–decoder architecture. CNN $_{ℒ}$ +RNN⁵⁶ draws a statistical language model into CNN based on encoder–decoder architecture. GroupCap⁵⁷ proposes a group-based image description architecture, which jointly learns the structural association and diversity among group images.

Image captioning results

Quantitative analysis

We verify the performance of our model quantitatively based on two public data sets. Tables 1 and 2 are the experiment results on the MSCOCO and Flickr30k data sets. From these two tables, the following statements and analysis can be drawn.

Table 1.

Results on the MSCOCO data set

Methods	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE	CIDEr
VAE⁵⁵	72.0	52.0	37.0	28.0	24.0	—	—
NIC¹¹	—	—	—	32.1	25.7	—	—
Attributes-CNN³⁰	74.0	56.0	42.0	31.0	26.0	—	—
CNN_L+RNN⁵⁶	72.3	55.3	41.3	30.6	26.0	—	94.0
GroupCap⁵⁷	74.4	58.1	44.3	33.8	26.2	—	—
Show-Attend-Tell¹⁰	70.7	49.2	34.4	24.3	23.9	—	—
SCA-CNN⁵¹	71.9	54.8	41.1	31.1	25.0	53.1	95.2
CNN+Att⁵⁸	71.1	53.8	39.4	28.7	24.4	52.2	91.2
Adaptive-Att¹⁶	74.2	58.0	43.9	33.2	26.6	54.9	108.5
Up-Down⁵²	77.2	—	—	36.2	27.0	56.4	113.5
SCST³⁹	77.4	60.9	46.0	34.1	26.7	55.7	114.0
PG-SPIDEr³⁷	74.3	57.8	43.3	32.2	25.1	54.4	100
Embedding-Reward⁴²	71.3	53.9	40.3	30.4	25.1	52.5	93.7
Actor-Critic⁵³	77.9	61.5	46.7	34.9	26.9	56.2	115.2
StackCap⁵⁴	78.4	62.5	47.9	36.1	27.4	56.9	120.4
Up-Down_RL⁵²	79.8	63.4	48.4	36.3	27.7	56.9	120.1
Ours	80.1	64.1	48.7	38.5	28.7	58.7	120.8

BLEU, bilingual evaluation understudy; CIDEr, consensus-based image description evaluation; CNN, Convolutional Neural Network; METEOR, metric for evaluation of translation with explicit ordering; NIC, neural image caption; PG, policy gradient; RNN, Recurrent Neural Network; ROUGE, recall-oriented understudy for gisting evaluation; SCST, self-critical sequence training; SPIDEr, a linear combination of SPICE and CIDEr; VAE, variational autoencoder.

Table 2.

Results on the Flickr30k data set

Methods	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
Soft attention¹⁰	66.9	43.4	28.8	19.1	18.5	—	—
Hard attention¹⁰	66.7	43.9	29.6	19.9	18.5	—	—
VAE⁵⁵	72.0	53.0	38.0	25.0	—	—	—
NIC¹¹	63.0	41.0	27.0	—	—	—	—
Attributes-CNN³⁰	73.0	55.0	40.0	28.0	—	—	—
SCA-CNN⁵¹	68.2	49.6	35.9	25.8	22.4	50.9	66.5
CNN_L+RNN⁵⁶	73.8	56.3	41.9	30.7	21.6	—	61.8
Adaptive-Att¹⁶	67.7	49.4	35.4	25.1	20.4	—	53.1
Ours	74.9	57.1	42.5	31.3	23.2	52.3	67.3

First, the results show our model surpasses the compared models on seven evaluation measures. Compared with traditional models, we consider corresponding regions and objects in the sentence and their roles in sentence generation from local to global. In other words, the features of the images our model learns are richer.

Second, we also found that more sophisticated attention models are better than simple attention models. Adaptive-Att and Up-Down obtain higher score than Show-attend-tell, SCA-CNN, CNN+Att. The possible reason is that the models considering more image details achieve good performance in the sentences generation process. This reflects that the complex model distinguishes the important level of information in the picture in generating sentences.

Third, most RL-based models have more satisfactory performance than others, which demonstrates that the RL algorithm solves the loss function mismatch problem by optimizing evaluation metrics directly. Other models are trained by maximizing the likelihood of each ground-truth word given the previous ground-truth words and the image using back-propagation. This creates a mismatch between training and testing time since at test-time, the model uses the previously generated words from the model distribution to predict the next word.

Qualitative analysis

To facilitate the comparison between our model and other comparison models, we generate some descriptions from randomly selected images. Figure 4 shows these descriptions and the corresponding images. The first green ones are ground-truth sentences. The soft-attention model and SCST model generate the blue ones (C1) and purple ones (C2), respectively. The last red ones are generated by our model, which is more accurate and naturalist. From those captions, we can observe that our model can detect some missed objects, similar to the first example, after generated “A man,” the others generated “in a suit,” whereas our model can generate “in a glasses and tie.” Intuitively, the sentences generated by our model are closer to ground-truth sentences for all examples.

FIG. 4.

Visualization of the generated sentences: The blue ones (C1) generated by the soft-attention model and purple ones (C2) generated by the SCST model. All samples are randomly selected. SCST, self-critical sequence training.

To make the multilevel attention network more intuitionistic, Figure 5 visualizes the weights of attention network learned by the multilevel attention network, from which we can see that although our attention mechanism basically covers the corresponding regions and objects, it is not completely consistent. Similar to the top example in Figure 5, the area of attention does not completely cover the area of the boy. The possible reason is that we only use a simple concatenation operation without considering their mutual cooperation in our multilevel attention network.

FIG. 5.

The visualization of multilevel attention. (a) A boy riding a horse-drawn carriage in a field. (b) A large elephant standing next to a young boy.

Ablation study

To verify the effectiveness of different modules in our proposed model, we conduct ablation experiments without one of three parts at a time, respectively, for the ablation analysis. The three parts are object-attention network, region-attention network, and policy RL algorithm. The result on MSCOCO data set is presented in Table 3 and the result on Flickr30k data set is presented in Table 4. In Tables 3 and 4, “without OAN” represents no object attention network in re-train stage, “without RAN” represents no region attention network in re-train stage, and “without RL” represents no policy RL in re-train stage.

Table 3.

Performance of various combinations of model on MSCOCO data set

Methods	BLEU-4	METEOR	ROUGE-L	CIDEr
Without OAN	37.7	28.3	58.2	117.8
Without RAN	37.4	28.1	57.9	117.2
Without RL	36.1	27.1	56.3	115.1
Full model	38.5	28.7	58.7	120.8

OAN, object attention network; RAN, region attention network; RL, reinforcement learning.

Table 4.

Performance of various combinations of model on Flickr30k data set

Methods	BLEU-4	METEOR	ROUGE	CIDEr
Without OAN	28.3	21.9	49.8	49.6
Without RAN	28.1	21.6	49.5	49.3
Without RL	27.5	20.2	48.9	48.7
Full model	31.3	23.2	52.3	53.1

First of all, one can observe that all three components of the model play an essential role in the image captioning task. Second, policy RL algorithm has the greatest impact on the final results. Third, the performance improvement obtained through the object-network is not as influential as the region-network.

Conclusion

In this study, we present a novel model named MANPRL for image caption generation. Different from other methods, our model considers as much detail of images as possible from different views. We employ a multilevel attention network to explore the importance of regions and targets from local to global in the image features encoder stage. After that, a policy RL algorithm is adopted to overcome the exposure bias problem in the training phase and solve the loss-evaluation mismatching problem at the sentence generation stage. We conduct experiments on MSCOCO and Flickr30k data sets. The experimental results reveal that our model achieves better results on various metrics than the compared models, reflecting the effectiveness of our model. We also perform an ablation study on the MSCOCO data set, and the results show that the three components are beneficial for image captioning.

From the attention visualization laboratory of Qualitative Analysis, we can see that the effect of attention is not very accurate. In the future study, we will explore RL to guide attention mechanism to learn the connection between words and corresponding image details.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

Funding Information

This work was supported in part by the National Natural Science Foundation of China (Grant Nos. U1636211, 61672081, 61370126, 62002068), the 2020 Tencent Wechat Rhino-Bird Focused Research Program, and the Fund of the State Key Laboratory of Software Development Environment (Grant No. SKLSDE-2021ZX-18).

Abbreviations Used

References

Krizhevsky

, Sutskever

, Hinton

. Imagenet classification with deep convolutional neural networks. Commun ACM. 2017; 60:84–90.

, Zhang

, Ren

, Sun

Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016. IEEE Computer Society. pp. 770–778.

Kang

, Xiang

, Liao

, et al. Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans Multimed. 2015; 17:370–381.

Liu

, Guo

, Wu

, Cai

. Fusion of deep learning and compressed domain features for content-based image retrieval. IEEE Trans Image Process. 2017; 26:5706–5717.

Girshick

RB.

Fast R-CNN. In: IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, Dec. 7–13, 2015. IEEE Computer Society. pp. 1440–1448.

Ren

, He

, Girshick

, Sun

. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017; 39:1137–1149.

Bengio

, Vinyals

, Jaitly

, Shazeer

Scheduled sampling for sequence prediction with recurrent In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (Eds). Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, Dec. 7–12, 2015, Montreal, QC, Canada, pp. 1171–1179.

Karpathy

, Li

Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015. IEEE Computer Society, pp. 3128–3137.

Kulkarni

, Premraj

, Ordonez

, et al. Babytalk: Understanding and generating simple image descriptions. IEEE Trans Pattern Analysis Mach Intell. 2013; 35:2891–2903.

10.

, Ba

, Kiros

, Cho

, Courville

, Salakhutdinov

, et al. Show, attend and tell: neural image caption generation with visual attention. In: Bach FR, Blei DM (Eds). Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, July 6–11, 2015. vol. 37 of JMLRWorkshop and Conference Proceedings, 2015, pp. 2048–2057.

11.

Vinyals

, Toshev

, Bengio

, Erhan

Show and tell: A neural image caption generator. In: IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, June 7–12, 2015. IEEE Computer Society, pp. 3156–3164.

12.

Wang

, Chan

. CNN+

CNN

: Convolutional decoders for image captioning. arXiv preprint arXiv:180509019. 2018.

13.

You

, Jin

, Wang

, Fang

, Luo

Image captioning with semantic attention. In: IEEE Conference on computer vision and pattern recognition, Las Vegas, NV, USA, June 27–30, 2016. IEEE Computer Society. pp. 4651–4659.

14.

Jin

, Fu

, Cui

, et al. Aligning where to see and what to tell: Image caption with region-based attention and scene factorization. arXiv preprint arXiv:150606272. 2015.

15.

Yang

, Yuan

, Wu

, Cohen

, Salakhutdinov

Review networks for caption generation. In: Lee DD, Sugiyama M, von Luxburg U, Guyon I, Garnett R (Eds). Advances in neural information processing systems 29: Annual conference on neural information processing systems, Dec. 5–10, 2016, Barcelona, Spain, pp. 2361–2369.

16.

, Xiong

, Parikh

, Socher

. Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: 2017 IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, July 21–26, 2017. IEEE Computer Society, pp. 3242–3250.

17.

Ranzato

, Chopra

, Auli

, Zaremba

. Sequence level training with recurrent neural networks. arXiv preprint arXiv:151106732. 2015.

18.

Hossain

, Sohel

, Shiratuddin

, Laga

. A comprehensive survey of deep learning for image captioning. ACM Comput Surv (CSUR). 2019; 51:1–36.

19.

Farhadi

, Hejrati

SMM

, Sadeghi

, Young

, Rashtchian

, Hockenmaier

, et al. Every picture tells a story: generating sentences from images. In: Daniilidis K, Maragos P, Paragios N (Eds). Computer vision-ECCV 2010, 11th European conference on computer vision, Heraklion, Crete, Greece, Sept. 5–11, 2010, Springer, pp. 15–29.

20.

Elliott

, Keller

Image description using visual dependency representations. In: Proceedings of the 2013 conference on empirical methods in natural language processing, Oct. 18–21, 2013, Grand Hyatt Seattle, Seattle, Washington, USA, ACL; pp. 1292–1302.

21.

Mitchell

, Dodge

, Goyal

, Yamaguchi

, Stratos

, Han

, et al. Midge: generating image descriptions from computer vision detections. In: Daelemans W, Lapata M, Màrquez L (Eds). 13th Conference of the european chapter of the association for computational linguistics, Avignon, France, April 23–27, 2012. The Association for Computer Linguistics, pp. 747–756.

22.

Lin

, Maire

, Belongie

, Hays

, Perona

, Ramanan

, et al. Microsoft COCO: common objects in context. In: Fleet DJ, Pajdla T, Schiele B, Tuytelaars T (Eds). Computer Vision-ECCV 2014—13th European Conference, Zurich, Switzerland, Sept. 6–12, 2014; Springer, pp. 740–755.

23.

Gong

, Wang

, Hodosh

, Hockenmaier

, Lazebnik

Improving image-sentence embeddings using large weakly annotated photo collections. In: Fleet DJ, Pajdla T, Schiele B, Tuytelaars T (Eds). Computer Vision-ECCV 2014—13th European Conference, Zurich, Switzerland, Sept. 6–12, 2014; Springer, pp. 529–545.

24.

Karpathy

, Joulin

, Li

Deep fragment embeddings for bidirectional image sentence mapping. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (Eds). Annual conference on neural information processing Systems, Dec. 8–13, 2014, Montreal, QC, Canada; pp. 1889–1897.

25.

Faghri

, Fleet

, Kiros

, Fidler

. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:170705612. 2017.

26.

Kiros

, Salakhutdinov

, Zemel

. Multimodal neural language models. In: Proceedings of the 31th international conference on machine larning, Beijing, China, June 21–26, 2014, pp. 595–603.

27.

Mao

, Xu

, Yang

, et al. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:14101090. 2014.

28.

Kiros

, Salakhutdinov

, Zemel

. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:14112539. 2014.

29.

Yang

, Liu

, Shen

, et al. An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Trans Image Process. 2020; 29:9627–9640.

30.

, Shen

, Liu

, Dick

, van den Hengel

What value do explicit high level concepts have in vision to language problems? In: IEEE Conference on computer vision and pattern recognition, Las Vegas, NV, USA, June 27–30, 2016. IEEE Computer Society, pp. 203–212.

31.

, Cohen

. Encode, review, and decode: Reviewer module for caption generation. arXiv preprint arXiv:160507912. 2016.

32.

Liu

, Wu

, Ge

, Zhang

, Fan

, Zou

Bridging the gap between vision and language domains for improved image captioning. In: Chen CW, Cucchiara R, Hua X, Qi G, Ricci E, Zhang Z, et al. (Eds). The 28th ACM International Conference on Multimedia, Virtual Event, Seattle, WA, USA, Oct. 12–16, 2020, pp. 4153–4161.

33.

Yao

, Pan

, Li

, Qiu

, Mei

Boosting image captioning with attributes. In: IEEE international conference on computer vision, Venice, Italy, Oct. 22–29, 2017, pp. 4904–4912.

34.

Zhao

, Wu

, Zhang

. MemCap: Memorizing style knowledge for image captioning. Proc AAAI Conf Artif Intell. 2020; 34:12984–12992.

35.

, Tran

, Mai

, Lin

, Yuille

. Context-aware group captioning via self-attention and contrastive features. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, USA, June 13–19, 2020. Computer Vision Foundation/IEEE, pp. 3437–3447.

36.

Anderson

, Fernando

, Johnson

, Gould

SPICE: semantic propositional image caption evaluation. In: Leibe B, Matas J, Sebe N, Welling M (Eds). 14th European Conference, Amsterdam, The Netherlands, Oct. 11–14, 2016; Springer, pp. 382–398.

37.

Liu

, Zhu

, Ye

, Guadarrama

, Murphy

Improved image captioning via policy gradient optimization of SPIDEr. In: IEEE International Conference on Computer Vision, Venice, Italy, Oct. 22–29, 2017. IEEE Computer Society, pp. 873–881.

38.

Vedantam

, Zitnick

, Parikh

CIDEr: Consensus-based image description evaluation. In: IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, June 7–12, 2015. IEEE Computer Society, pp. 4566–4575.

39.

Rennie

, Marcheret

, Mroueh

, et al. Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. pp. 7008–7024.

40.

Zhang

, Sung

, Liu

, et al. Actor-critic sequence training for image captioning. arXiv preprint arXiv:170609601. 2017.

41.

Bahdanau

, Brakel

, Xu

, et al. An actorcritic algorithm for sequence prediction. arXiv preprint arXiv:160707086. 2016.

42.

Ren

, Wang

, Zhang

, Lv

, Li

. Deep reinforcement learning-based image captioning with embedding reward. In: 2017 IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, July 21–26, 2017. IEEE Computer Society, pp. 1151–1159.

43.

Liu

, Xu

, Zhang

, Nie

, Su

, Zhang

Multi-level policy and reward reinforcement learning for image captioning. In: Lang J (Ed.). Proceedings of the twenty-seventh international joint conference on artificial intelligence, July 13–19, 2018, Stockholm, Sweden, pp. 821–827.

44.

, Li

, Zhang

, et al. Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. arXiv preprint arXiv:171106794. 2017.

45.

Fukui

, Park

, Yang

, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:160601847. 2016.

46.

, Yang

, Batra

, Parikh

Hierarchical question-image co-attention for visual question aswering. In: Lee DD, Sugiyama M, von Luxburg U, Guyon I, Garnett R (Eds). Advances in neural information processing systems 29: Annual conference on neural information processing systems, Dec, 5–10, 2016, Barcelona, Spain, pp. 289–297.

47.

Lin

CY.

ROUGE: A package for automatic evaluation of summaries. In: Text summarization branches out, Barcelona, Spain: Association for Computational Linguistics; 2004, pp. 74–81.

48.

Papineni

, Roukos

, Ward

, Zhu

Bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, July 6–12, 2002, Philadelphia, PA, USA. Association for Computational Linguistics, pp. 311–318.

49.

Plummer

, Wang

, Cervantes

, Caicedo

, Hockenmaier

, Lazebnik

. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: 2015 IEEE international conference on computer vision, Santiago, Chile, December 7–13, 2015. IEEE Computer Society, pp. 2641–2649.

50.

Banerjee

, Lavie

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Goldstein J, Lavie A, Lin C, Voss CR (Eds). Proceedings of the workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005. Association for Computational Linguistics, pp. 65–72.

51.

Chen

, Zhang

, Xiao

, Nie

, Shao

, LiuW, et al. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, July 21–26, 2017. IEEE Computer Society, pp. 6298–6306.

52.

Anderson

, He

, Buehler

, Teney

, Johnson

, Gould

, et al. Bottom-up and top-down attention for image captioning and visual question answering. In: IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, June 18–22, 2018. Computer Vision Foundation/IEEE Computer Society, pp. 6077–6086.

53.

Gao

, Wang

, Ma

, Gao

Self-critical N-step training for image captioning. In: IEEE conference on computer vision and pattern recognition, Long Beach, CA, USA, June 16–20, 2019. Computer Vision Foundation/IEEE, pp. 6300–6308.

54.

, Cai

, Wang

, Chen

. Stack-captioning: Coarse-to-fine learning for image captioning. arXiv preprint arXiv:170903376. 2017.

55.

, Gan

, Henao

, Yuan

, Li

, Stevens

, et al. Variational autoencoder for deep learning of images, labels and captions. In: Lee DD, Sugiyama M, von Luxburg U, Guyon I, Garnett R (Eds). Annual conference on neural information processing systems, Dec. 5–10, 2016, Barcelona, Spain, pp. 2352–2360.

56.

, Wang

, Cai

, Chen

An empirical study of language CNN for image captioning. In: IEEE international conference on computer vision, Venice, Italy, Oct. 22–29, 2017. IEEE Computer Society, pp. 1231–1240.

57.

Chen

, Ji

, Sun

, Wu

, Su

Group cap: group-based image captioning with structured relevance and diversity constraints. In: IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, June 18–22, 2018. Computer Vision Foundation/IEEE Computer Society, pp. 1345–1353.

58.

Aneja

, Deshpande

, Schwing

. Convolutional image captioning. In: IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, June 18–22, 2018. Computer vision foundation/IEEE computer society, pp. 5561–5570.