Generating research of image caption based on improved NIC algorithm

Abstract

The image caption generation algorithm allows computer to understand the picture and generate sentences that comply with grammar rules and picture features. Under the Encoder-Decoder framework, the CNN (Convolutional Neural Networks) model is widely used as an encoder to extract image features and the RNN (Recurrent Neural Networks) model as a decoder to generate the description sentence to solve the problem of image caption generation. The most famous algorithm is the NIC, which used Inception-v3 as the encoder, and the LSTM (Long Short-term Memory) as the decoder. However, there are too many parameters in LSTM, and the quality of generated sentences is not high. In the field of visual features, deepening the network structure can improve the feature extraction ability, but the network will degenerate. Therefore, the NIC algorithm is improved. The Inception-ResNet-v2 network is used as the encoder, and the LSTMP network is introduced as the decoder. Taking BLUE-4, ROUGE, METEOR, and CIDEr as evaluation indicators, MSCOCO and Flickr30k are used as datasets to make comparative test between the NIC and the improved NIC. Experimental results show that the improved NIC algorithm outperforms the NIC algorithm in all four evaluation indicators.

Keywords

Image caption Inception-ResNet-v2 ResNet LSTMP

1. Perface

Image processing techniques use computers to analyze images to achieve the desired results, and can be applied to the image retrieval, target detection, semantic segmentation, image classification, image enhancement and restoration. Natural language processing (NLP) is an important direction of the Artificial Intelligence. It is a field of interaction between computer science, mathematics, and human languages. It is the study and realization of various theories and methods for effective communication between humans and computers with natural language, including semantic analysis, information extraction, text mining, machine translation, information retrieval, dialogue systems.

Given a picture, the technology of learning and accurately extracting the image content with a computer and generating the proper text description is called image caption generation technology, which includes image processing technology and natural language processing (NLP) technology. Among them, the accurate extraction of image content includes image type characteristics, color characteristics, position characteristics, and so on. To convert the image content into corresponding text, it is necessary to realize the mapping relationship between the image content and the text features. For humans, people can easily learn the content and meaning of images based on their own thinking and analysis. But it is difficult for computers. Computers not only need to identify the objects in the images, but also need to be able to understand the various characteristics, attributes of the objects and the relationships between multiple objects, and then need to generate information that humans can understand and meets grammar rules. Early image caption generation algorithms, such as those based on templates and language parsing, have great limitations. Nowadays, image description algorithms generally draw on the research ideas of machine translation algorithms [1, 2], and use Encoder-Decoder structure. Firstly, the CNN model will be used to encode the image into a fixed-length feature vector. Then, the RNN model generates the description language. This method could generate high quality and flexible sentences. The most essential task of the image caption algorithm is to correctly describe the content of the image, and it is helpful for the visually impaired people [3]. It can also be used for image retrieval, intelligent robot interaction, and so on.

$I$ is an image and $S$ is the image caption sentence. Establishing a model to implement $I\to S$ mapping, and maximizing the probability $p(S|I)$ to generate a description sentence that are closest to the image. The most famous algorithm in the field of image description generation is the NIC algorithm [4]. The NIC algorithm uses the Tensor Flow framework, drawing on the research ideas of machine translation algorithms, based on Encoder-Decoder architecture. The deep convolutional network Inception-v3 is used as an encoder for image feature extraction [5], and the LSTM network is used as a decoder for generating a textual description of the image encoding content.

In the field of computer vision, by deepening the depth of the network, better image feature extraction results can be achieved. But there is a problem of gradient disappearance or gradient explosion, resulting in the network can not converge. This problem can be solved by the normalization method. However, at this time, the network begins to degenerate, that is, increasing the depth of the network leads to greater errors. This problem can be solved by residual network (ResNet). Compared to Inception-v3 network, Inception-ResNet-v2 network structure is more complex, but it has better image feature extraction capabilities, and can avoid the degradation caused by the deepening of the network. So we use Inception-ResNet-v2 as an encoder. LSTMP is introduced as a decoder for generating textual descriptions of image encoding content to solve the problems that there are many parameters in LSTM and the quality of generation caption sentence is not high.

2. Related work

The image caption generation algorithm mainly includes four methods: based on templates, language-based analysis, based on common embedding ideas and neural network.

Template-based and language-based analysis methods are the early image-digest generation algorithm. Frahadi et al. uses a template-based triple model to generate image caption [6]; Kulkarni et al. proposed a three-dimensional graphical model [7]. The model uses a detector and combines conditions to determine the objects, attributes, etc. in the image and then uses the template to generate text. The method based on the template is relatively simple, but because the sentence template is invariant, the generation sentence structure is single and fluency is poor. The algorithm model based on semantic analysis [8, 9, 10] can also describe the image content to a certain extent, but the generated description statements are still relatively rigid.

Based on the idea of co-embedding [11, 12, 13], images and text are mapped together into the same vector space. For the image queries, the description of the closest image is retrieved in the vector space. This method requires a large number of manual annotation statements. The training set must satisfy a variety of conditions and the workload is very large.

The neural network methods have a good effect on image summary generation. Mao et al. proposed m-RNN neural network model to generate image description [14]. The m-RNN uses convolutional neural network to extract image features, and uses recurrent neural network RNN to predict the next word. Xu et al. introduced the adaptive attention mechanism into the neural network, segmented the image into equal-sized image blocks, and used the features of each extracted image block as the input to the LSTM [15]. It can learn the local feature of image. Kiros et al. used a powerful deep convolutional neural network and a long and short-term memory network to jointly construct a multi-mode annotation method [16]. Donahue et al. proposed an end-to-end LRCN model for image caption generation and video description tasks [17]. Jia et al. used the g-LSTM model to generate image caption that added semantic information to each LSTM block [18].

3. Models

3.1 NIC algorithm model

3.1.1 CNN model

CNN uses a series of operations such as convolution, pooling, and activation functions. The convolution operation fully utilizes the information of the neighboring areas in the picture, reduces the size of the parameter matrix through sparse connections and weight sharing, and greatly improves the convergence speed. Convolution can extract various features in the image. Pooling can sample the features obtained after convolution. The activation operation non-linearly transforms features after convolution and pooled. The commonly used activation function are RELU, PRELU [19]. With the development of deep learning, models like AlexNet [20], VGGNet [21], GoogLeNet [22] and others have emerged.

3.1.2 LSTM model-based NIC algorithm

RNN adopts BPTT algorithm to update parameters. However, as the time step increases, the gradient of subsequent nodes gradually decreases in the process of back propagation, and it is difficult to update the previous nodes effectively. Therefore, the prediction accuracy is poor.

The Long Short-Term Memory is a variant of the RNN model and it solves the problem of the decrease in the perceived ability of distant nodes due to gradient dispersion in traditional RNNs. In each time step of LSTM, there are multiple gates, including Input Gate, Forget Gate, and Output Gate. The ratio of the passing information can be controlled by sigmoid function. Based on the original short-term memory unit, the LSTM model adds a memory unit to maintain long-term memory. With the deepening of research, there are many improvements in LSTM, such as GRU [23] and peephole-LSTM [24]. Ignoring bias, the internal definition and update methods are shown as follows:

$\displaystyle\left\{\begin{array}[]{ll}f_{t}&=\sigma(W_{fx}x_{t}+W_{fm}m_{t-1}% )\\ f_{t}&=\sigma(W_{fx}x_{t}+W_{fm}m_{t-1})\\ o_{t}&=\sigma(W_{ox}x_{t}+W_{om}m_{t-1})\\ c_{t}&=f_{t}\odot c_{t-1}+i_{t}\odot\tanh(W_{cx}x_{t}+W_{cm}m_{t-1})\\ m_{t}&=o_{t}\odot c_{t}\end{array}\right.$ (1)

The process of image caption generation using the LSTM network is an Encoder-Decoder process. Encoder maps an image to a vector representation. Decoder is the process of transforming a feature into a sentence statement based on image features. The specific process is shown in Fig. 1.

Figure 1.

NIC algorithm model.

3.2 Improved algorithm model

Encoding the image into feature vectors, then mapping the annotations of the image and the image feature vectors into the same dimension to generate an image caption. Using Eq. (2) to transform the image caption problem into a model of the maximum probability optimization problem:

$\displaystyle\theta^{*}=\arg\underset{\theta}{\max}\sum_{(I,S)}\log p(S|I,\theta)$ (2)

where, $\theta$ is the parameter of the model, $I$ represents the image, $S=(S_{0},S_{1},\ldots,S_{N})$ represents the description sentence of the image, and the length of $S$ is not fixed. This problem can be further converted to solve the joint probability based on the $(S_{0},S_{1},\ldots,S_{N})$ chain rule:

$\displaystyle\log P(S|I,\theta)=\frac{1}{N}\sum_{t=0}^{N}\log p(S_{t}|I,\theta% ,S_{0},\ldots,S_{t-1})$ (3)

During training, ( $S, I$ ) constitutes a training sample pair, and the gradient descent method is used on the training set to optimize the sum of the log probabilities in Eq. (3). $h_{t}$ denotes the LSTMP implicit state layer, which is determined by the input data $x_{t}$ at the time $t$ and the implicit state layer $h_{t-1}$ at the time $t-1$ , as shown in Eq. (4).

$\displaystyle h_{t}=\textit{LSTMP}(h_{t-1},x_{t})$ (4)

where, $x_{t}$ represents the input data in which the image and the word are combined in a vector form at the same time.

Figure 2.

The improved algorithm model.

The improved algorithm model is shown in Fig. 2. The training phase and the prediction phase are different. In the training phase, the image features and the vectorized words need to be input into the LSMTP network at the same time and the input of the word comes from the actual sentence. In the prediction phase, the image features need to be input into the network and the words information are generated by the algorithm itself, and then input the generation words into the algorithm again.

In the training phase, $(S_{0},S_{1},\ldots,S_{N})$ is a description sentence sequence corresponding to the image data $I$ , $S_{0}$ represents the beginning character of the sentence, and $S_{N}$ represents the ending character of the sentence. $x_{-1}=\textit{CNN}(I)$ is the image feature data which is inputted to the LSMTP at time $-$ 1. Here CNN stands the Inception-ResNet-v2 network. All LSMTP networks share weights, the formula at moment $t$ is:

$\displaystyle x_{t}=W_{e}S_{t}\quad t\in\{0,1,\ldots,N-1\}$ (5) $\displaystyle p_{t+1}=\textit{LSTMP}(x_{t})\quad t\in\{0,1,\ldots,N-1\}$ (6)

where, $W_{e}$ is a random two-dimensional matrix that vectorizes words. One-hot encoding is used to represent the words in the description sentence as vectors, and then input LSMTP network together with the extracted image features. Image data is only input to the LSMTP network at time $-$ 1. Practice has proven that putting the image information to the network multiple times at different time nodes will cause the network to generate more noise and overfitting problem, ultimately reducing the accuracy of the prediction results. And to minimize the negative log-likelihood function of the objective function, to adjust the parameters of the network model, the loss function is Eq. (7)

$\displaystyle L(I,S)=-\sum_{t=1}^{N}\log p_{t}(S_{t})$ (7)

In the test phase, a caption statement is generated by using the BeamSearch method. The beam size is 20, and Top $K$ candidate statements are kept at each time. Generating Top $K$ candidate sentences at time $t+1$ base on the top $K$ candidate sentences at time $t$ , and repeating this process until the end of the search. Selecting the candidate sentence with the highest probability as the image caption sentence.

3.2.1 ResNet

Resident network (ResNet) is proposed by the He et al. [25]. When transmitting information on the general convolutional layer or on the fully connected layer, information is lost to some extent. The residual network is a better solution for this problem. Information is transferred from the input to the output directly. The network only needs to learn the difference between input and output. The residual network changes the learning objectives, reduces the difficulty of learning, and avoids the problem of the disappearance of gradients and the decrease of accuracy rate as the number of network layers increase. Finally it improves the training accuracy of the network.

$x$ is the input of the network and $H(x)$ is the desired actual mapping. If optimizing $H(x)$ is difficult, you can fit another mapping relationship (residual function) $F(x)=H(x)-x$ through a stacked multilayer nonlinear network. If $F(x)$ is easier to optimize than $H(x)$ , then the actual mapping can be expressed as $H(x)=F(x)+x$ . In this way, ResNet is no longer to obtain a complete output, but only to obtain the difference between output and input. The difference is defined as the Residual Unit, as shown in Fig. 3.

Figure 3.

Residual unit.

Figure 4.

Inception-ResNet-v2 structure.

3.2.2 Inception-ResNet-v2

In the area of feature extraction, Google has proven that Inception-ResNet-v2 is a more effective deep learning network than Inception-v3 [26], and it is also a evolution result based on the Inception-v3 network. The depth of Inception-v3 is 42, but Inception-ResNet-v2 has a deeper structure than Inception-v3. The most important change is the introduction of residuals in each Inception module. Compared to Inception-v3, the Inception module of Inception-ResNet-v2 contains fewer parallel towers, and each Inception uses a 1 $\times$ 1 convolution with no activation function for dimensional expansion to implement the input match. The overall architecture model of Inception-ResNet-v2 network given by Google is shown in Fig. 4.

Incorporating the residual connection in all Inception modules can effectively solve the problem of network degradation, speed up the training speed of the network, and improve the extraction accuracy of image features. Before adding the residual module to the active layer, scaling the residual module with the residual scaling factor of [0.1, 0.3] can effectively stabilize the training process.

The input data is multiple batch packets which containing 32 RGB pictures of 299 $\times$ 299. We use the 8 $\times$ 8 mean pooling layer, use dropout to prevent overfitting, and use the BN algorithm only in the underlying convolution.

3.3 LSTMP

A standard LSTM network has many memory blocks. Each memory block has a memory cell. The number of memory cells is $n_{c}$ , the number of input cells is $n_{i}$ , and the number of output cells is $n_{o}$ , ignoring the offset, then the parameters number of the LSTM network is:

$\displaystyle N=n_{c}\times n_{c}\times 4+n_{i}\times n_{c}\times 4+n_{c}% \times n_{o}+n_{c}\times 3$ (8)

When the number of LSTM network input units is moderate, the network size and computational complexity are determined by $n_{c}\times n_{c}\times 4+n_{c}\times n_{o}$ . If a certain training task requires many $n_{c}$ and $n_{o}$ , the network will be very computationally intensive.

The model is based on the LSTM structure and has a separate linear projection layer after the LSTM layer. The recurrent connections now connect from this recurrent projection layer to the input of the LSTM layer. The network output units are connected to this recurrent layer. This operation changes the number of model parameters [27, 28]. Let the number of recurrent units be $n_{r}$ , then the total LSTMP parameters is:

$\displaystyle N=n_{c}\times n_{r}\times 4+n_{i}\times n_{c}\times 4+n_{r}% \times n_{o}+n_{c}\times n_{r}+n_{c}\times 3$ (9)

The size and computational complexity of the LSTMP model are determined by $n_{r}\times n_{c}\times 4$ $+n_{r}$ $\times n_{o}$ , and the number of parameters is reduced by controlling the value of $n_{r}/n_{c}$ . Letting $n_{r}<n_{c}$ , we can increase the number of memory cells ( $n_{c}$ ) and control the number of parameters in recurrent connections and the output layer. LSTMP speeds up network convergence by reducing the amount of parameters, and can effectively increase the accuracy of generated description statements from the results. The internal update formula for the LSTMP network is:

$\displaystyle\left\{\begin{array}[]{ll}i_{t}&=\sigma(W_{ix}x_{t}+W_{ir}r_{t-1}% )\\ f_{t}&=\sigma(W_{fx}x_{t}+W_{fr}r_{t-1})\\ c_{t}&=f_{t}\odot c_{t-1}+i_{t}\odot\tanh(W_{cx}x_{t}+W_{cr}r_{t-1})\\ o_{t}&=\sigma(W_{ox}x_{t}+W_{or}r_{t-1})\\ m_{t}&=o_{t}\odot c_{t}\\ r_{t}&=W_{rm}m_{t}\\ p_{t+1}&=\text{Softmax}(W_{ym}r_{t})\end{array}\right.$ (10)

where $\sigma$ , is the sigmoid function, $W$ defines the weight matrix, $i$ is the input gate, $f$ is the forget gate, $o$ is the output gate, and $c$ is the memory cell.

Table 1

Datasets

Dataset name	Train	Valid	Test
MSCOCO	82783	5000	5000
Flickr30k	28000	1000	1000

Figure 5.

LSTMP network structure.

The LSTMP network is an improvement over the LSTM network. The LSMTP structure is shown in Fig. 5.

4. Experiments and results

4.1 Datasets

We use MSCOCO and Flickr30k datasets. The MSCOCO data has 82,783 training set and 40,504 validation set. There is no test set. So we select 5000 images as validation set, and another 5000 images as test set from the validation set. Flickr30k has 28000 training data, 1000 validation sets and 1000 test sets respectively. The final size of the dataset is shown in Table 1. Each image has 5 personally labeled statements, as shown in Fig. 6.

Figure 6.

Example pictures and description of dataset.

4.2 Dataset preprocessing

This article uses the MSCOCO datasets as an example to briefly explain the pretreatment of datasets. LSTMP network parameters, Inpcetion-ResNet-v2 network parameters and word vectorization parameters together constitute the total parameters of the improved model. There are 29,415 words in the MSCOCO dataset. But have great differences in the frequency of the description of the pictures. Therefore, all the words appearing less than 5 times are filtered. Finally, there are 11,519 words after filtering.

After training, without modifying the visual content, slight modification of any image in the training set may cause output that differs greatly from the original training image output. In order to enhance the model’s ability to deal with this phenomenon, it is necessary to modify the image data randomly during the training process, including aspects such as hue, size, position, and gray level. The input data of the Inception-ResNet-v2 is an RGB image of 299 $\times$ 299. In order to ensure that the input image size is 299 $\times$ 299, all images are scaled to the size of 346 $\times$ 346, and then we randomly cut out 299 $\times$ 299 size image on it. The Distort_image function defines a number of random image content modification operations that can achieve random image flips, modify image content brightness, saturation, hue, and so on.

Plenty of images are inputted into the network for training. If a single sample is used as the basic unit to correct the entire dataset gradually with reverse gradient method, the training speed will be slow or even have a problem of gradient divergence. Therefore, firstly, we transform input data into a batch package with batch_with_dynamic_pad function, then enter the batch package into network for training. In the prediction, the test data is not large, so the test data is inputted into the network with the placeholder method, and the images’ annotations need not to be inputted.

4.3 Evaluation methods

This article uses four evaluation methods: BLUE-4, ROUGE, METEOR, and CIDEr.

4.3.1 BLUE-4 evaluation method

BLUE is an accuracy-based similarity measurement method to analyze the number of matching n-grams between the reference statement and the generated statement. The ratio of the number of matches to the number of n-grams is used as the matching similarity. However, the word matching statistical method used by BLUE is independent of position, and it is difficult to measure the semantic information related to order.

The specific formula is:

$\displaystyle\textit{BLUE}=\textit{BP}\bullet\exp\left(\frac{1}{N}\sum_{1}^{N}% \log p_{n}\right)$ (11)

where, $p_{n}=\frac{\sum\textit{count}_{\textit{clip}}(n-\textit{gram})}{\sum\textit{% count}_{c}(n-\textit{gram})}$ . $c$ represents a generated sentence, $\textit{count}_{c}(n-\textit{gram})$ represents the number of n-grams in the generated sentence, and $\textit{count}_{\textit{clip}}(n-\textit{gram})$ represents the number of n-gram matches between the generated sentence and the reference sentence.

$\displaystyle\textit{BP}=\left\{\begin{array}[]{ll}1,&s_{c}>s_{r}\\ \exp\left(1-\frac{s_{r}}{s_{c}}\right),&s_{c}\leqslant s_{r}\end{array}\right.,$

where, $s_{c}$ represents the length of the generated sentence, $s_{r}$ represents the length of the reference sentence. In the n-gram, $n$ represents the length of the phrase, the desirable values are 1, 2, 3, 4 and $n=$ 4 is selected in this paper.

4.3.2 ROUGE evaluation method

ROUGE uses the recall rate as an indicator. The basic idea is using the n-gram contribution statistics between the generated sentence and the reference sentence as the judgment. The formula is as follows:

$\displaystyle\textit{ROUGE}=(1+\theta^{2})SP/(S+\theta^{2}P)$ (12)

where, $S=l/m$ , $P=l/n$ , among them, $\theta$ represents the weight parameter, $S$ is the coverage rate, $P$ is the matching precision between the generated statement and the reference statement, $m$ is the length of the generated statement, $n$ is the length of the reference statement, and $l$ is the maximum length of the common string sequence in subsequences between the generated statement and the reference statement.

4.3.3 METEOR evaluation method

METEOR finds the matching maximum values of words between the generated statement and the reference statement according to the exact match, root match, and synonym match. When the maximum value exists the same value, the dynamic programming algorithm is used to regard the matching with the least number of crossover times as “alignment”, and continue to add it to the alignment set. Then using the ratio of the number of the elements in the set to the total number of words in the reference sentence as the recall rate, and using the ratio of the number of the elements in the set to the total number of words in the generated sentence as the accuracy rate. Finally using the harmonic mean value calculates the final value. $P_{m}$ is the accuracy rate, and $R_{m}$ is the recall rate. The specific formula is as follows:

$\displaystyle\textit{METEOR}=(1-\textit{Pe})\frac{P_{m}R_{m}}{\alpha P_{m}+(1-% \alpha)R_{m}}$ (13)

where, $\alpha$ , $\gamma$ , $\theta$ are the default parameters of METEOR, $\textit{Pe}=\gamma\left(\frac{ch}{m}\right)^{\theta}$ represents the penalty factor, $m$ is the number of alignments in the alignment set, and $c h$ is the total number of words in all substrings that are consecutively matched in the generated statement and the reference statement.

4.3.4 CIDEr evaluation method

CIDEr is proposed for image description problems. The eigenvectors of the sentences are constructed by counting the word frequency information of all the image description samples, and then the eigenvectors are measured to represent the degree of similarity between the two description sentences. The specific formula is:

$\displaystyle\textit{CIDEr}(c_{i},s_{i})=\frac{1}{mN}\sum_{n=1}^{N}\sum_{j}% \frac{g(c_{i})\bullet g(s_{ij})}{\|g(c_{i})\|\|g(s_{ij})\|}$ (14)

where, the $c_{i}$ donates the evaluated statement, the $s_{i}$ denotes the corresponding reference statement set and the $s_{i}$ contains $N$ statements $s_{ij}$ , $g(c_{i})$ and $g(s_{ij})$ denote the feature vectors of the evaluated statement and the reference statement respectively.

The larger the value of these four evaluation methods is, the better the quality of the generated cation sentence will be.

4.4 Experimental results

4.4.1 Generate statements and analysis

Some examples of the image generation description sentences on MSCOCO and the Flickr30k datasets are shown in Fig. 7. For each image, the first sentence is generated by the NIC algorithm, and the second sentence is generated by the improved algorithm. The content of the first and fourth pictures is relatively simple, and the generation sentences almost have no difference in the meaning of the picture. About the second picture, the word of “with” in the statement generated by the NIC algorithm doesn’t clearly express the action between the dog and the teddy bear. But the statement generated by the improved algorithm in this paper vividly describes the dog “holding” the teddy bear and the phrase of “laying on the bed” is a more detailed description about the dog’s movement. In the third picture, the second sentence contains a phrase of “big-eyed” that is a description about the size of boy’s eyes. In addition, “put a toothbrush at his mouth” is obviously more consistent than the “holding a toothbrush in his hand” about the picture’s meaning. In the fifth picture, the second statement contains a color word of “blonde” and feature phrase of “his mouth wide open”. These are detailed descriptions of the boy. About the sixth picture, the word of “red” in the sentence generated by the improved algorithm is a description about the character’s clothes and “others laughing” is a more accurate description of the surrounding scene. By observing and comparing Fig. 7, we can find that the sentences generated by the NIC algorithm and the improved algorithm can meet the basic meaning of the image. But the semantic of the description sentence generated by the improved algorithm is more abundant. The improved algorithm can describe more content about the image, and describe the details, including actions, colors, features, scenes and so on. These information is more accurately. So the improved algorithm is more effective.

Table 2
Evaluation results on MSCOCO dataset

Algorithm	BLUE-4	ROUGE	METEOR	CIDER
NIC	27.6	52.3	23.5	92.6
Improved NIC	32.3	55.4	25.4	98.2

Table 3

Evaluation results on Flickr30k dataset

Algorithm	BLUE-4	ROUGE	METEOR	CIDER
NIC	22.3	40.3	18.5	88.7
Improved NIC	26.4	43.5	20.7	93.6

Figure 7.

Generating option examples.

4.4.2 Statistical results and analysis

The evaluation results of the NIC algorithm and the improved NIC algorithm on MSCOCO and Flickr30k datasets are shown in Tables 2 and 3 respectively.

By comparison, it can be found that with the same dataset, the values obtained by the improved NIC algorithm in four evaluation indicators are higher than the NIC algorithm. On the MSCOCO dataset, the improved algorithm is 4.7, 3.1, 1.9, and 5.6 higher than the NIC algorithm respectively. On the Flickr30k dataset, the improved algorithm is 4.1, 3.2, 2.2, and 4.9 higher than the NIC algorithm respectively. Comparing Tables 2 and 3, we can find that both the improved algorithm and the NIC algorithm have higher evaluation results on the MSCOCO dataset than the Flickr30k dataset. Because MSCOCO training set is far more than Flickr30k’s and the more datasets are, the stronger learning ability and the better the generalization ability will have. Also the accurate of the prediction results will be. These results also show that the improved algorithm is correct and more effective.

5. Conclusion

The image caption generation algorithm combines image feature extraction and natural language processing technique. The image and its annotations are used as the input of the algorithm to generate sentences that comply with grammar rules and image features. This study is very challenging due to the large scale of the network and the size of the dataset. The Encoder-Decoder framework is a very good deep learning framework. The LSTM as a generator is widely used in machine translation and image description. This article uses Inception-ResNet-v2 as an encoder and LSTMP as a decoder, uses MSCOCO and Flickr30k as datasets respectively, and uses BLUE-4, ROUGE, METEOR, and CIDEr as evaluation indicators, and compares it with NIC algorithm. The results show that the improved algorithm has a good image description generation effect. In the next step, we will continue to try to improve the encoder and generator networks and look forward to improving the effect of generating image caption statements.

References

Cho

Merrienboer

B.V.

Gulcehre

et al., Learning phrase representations using RNN encoder-decoder for statistical machine translation, Computer Science (2014).

Sutskever

Vinyals

and Le

Q.V.

, Sequence to sequence learning with neural networks, 4 (2014), 3104–3112.

Vinyals

Toshev

Bengio

et al., Show and tell: A neural image caption generator, in: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, IEEE, 2015, pp. 3156–3164.

Vinyals

Toshev

Bengio

et al., Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge, IEEE Transactions on Pattern Analysis and Machine Intelligence 39(4) (2016), 652–663.

Szegedy

Vanhoucke

Ioffe

et al., Rethinking the inception architecture for computer vision, Computer Science (2015), 2818–2826.

Farhadi

Hejrat

Sadeghi

M.A.

et al., Every picture tells a story: Generating sentences from images, in: European Conference on Computer Vision, Springer-Verlag, 2010, pp. 15–29.

Kulkarni

Premraj

Ordonez

et al., BabyTalk: Understanding and generating simple image descriptions, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 2011, pp. 1601–1608.

Mitchel

Han

Dodge

et al., Midge: Generating image descriptions from computer vision detections, in: Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2012, pp. 747–756.

Aker

and Gaizauskas

, Generating image descriptions using dependency relational patterns, in: Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2010, pp. 1250–1258.

10.

Kuznetsova

Ordonez

Berg

A.C.

et al., Collective generation of natural image descriptions, in: Meeting of the Association for Computational Linguistics: Long Papers, Association for Computational Linguistics, 2012, pp. 359–368.

11.

Hodosh

Young

and Hockenmaier

, Framing image description as a ranking task: Data, models and evaluation metrics, Journal of Artificial Intelligence Research 47(1) (2013), 853–899.

12.

Gong

Wang

Hodosh

et al., Improving image-sentence embeddings using large weakly annotated photo collections, Computer Vision – ECCV 2014, Springer International Publishing, 2014, pp. 529–545.

13.

Ordonez

Kulkarni

and Berg

T.L.

, Im2Text: Describing images using 1 million captioned photographs, in: International Conference on Neural Information Processing Systems, Curran Associates Inc., 2011, pp. 1143–1151.

14.

Mao

Yang

et al., Deep captioning with multimodal recurrent neural networks (m-RNN), Eprint Arxiv (2014).

15.

Kiros

Show et al., attend and tell: Neural image caption generation with visual attention, Computer Science (2015), 2048–2057.

16.

Kiros

Salakhutdinov

and Zemel

R.S.

, Unifying visual-semantic embeddings with multimodal neural language models, Computer Science (2014).

17.

Donahue

Anne Hendricks

Guadarrama

et al., Long-term recurrent convolutional networks for visual recognition and description, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625–2634.

18.

Jia

Gavves

Fernando

et al., Guiding the long-short term memory model for image caption generation, in: IEEE International Conference on Computer Vision, IEEE Computer Society, 2015, pp. 2407–2415.

19.

Zhang

Ren

et al., Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, (2015), 1026–1034.

20.

Krizhevsky

Sutskever

and Hinton

G.E.

, ImageNet classification with deep convolutional neural networks, in: International Conference on Neural Information Processing Systems, Curran Associates Inc., 2012, pp. 1097–1105.

21.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, Computer Science (2014).

22.

Szegedy

Liu

Jia

et al., Going deeper with convolutions, CVPR (2015).

23.

Cho

Merrienboer

B.V.

Gulcehre

et al., Learning phrase representations using RNN encoder-decoder for statistical machine translation, Computer Science (2014).

24.

Gers

F.A.

and Schmidhuber

, Recurrent nets that time and count, in: IEEE-Inns-Enns International Joint Conference on Neural Networks, IEEE Computer Society, 2000, pp. 3189.

25.

Zhang

et al., Deep residual learning for image recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 2016, pp. 770–778.

26.

Szegedy

Ioffe

and Vanhoucke

et al., Inception-v4, Inception-ResNet and the impact of residual connections on learning, 2016.

27.

Sak

Senior

and Beaufays

, Long short-term memory recurrent neural network architectures for large scale acoustic modeling, Computer Science (2014), 338–342.

28.

Sak

Senior

and Beaufays

, Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition, arXiv preprint arXiv:14021128, (2014).

Generating research of image caption based on improved NIC algorithm

Abstract

Keywords

1. Perface

2. Related work

3. Models

3.1 NIC algorithm model

3.1.1 CNN model

3.1.2 LSTM model-based NIC algorithm

3.3 LSTMP

4.1 Datasets

4.3 Evaluation methods

4.3.1 BLUE-4 evaluation method

4.4.1 Generate statements and analysis

Table 2 Evaluation results on MSCOCO dataset

5. Conclusion

References

Table 2
Evaluation results on MSCOCO dataset