Abstract
The current state of the art for image annotation and image retrieval tasks is obtained through deep neural network multimodal pipelines, which combine an image representation and a text representation into a shared embedding space. In this paper we evaluate the impact of using the Full-Network embedding (FNE) in this setting, replacing the original image representation in four competitive multimodal embedding generation schemes. Unlike the one-layer image embeddings typically used by most approaches, the Full-Network embedding provides a multi-scale discrete representation of images, which results in richer characterisations. Extensive testing is performed on three different datasets comparing the performance of the studied variants and the impact of the FNE on a levelled playground, i.e., under equality of data used, source CNN models and hyper-parameter tuning. The results obtained indicate that the Full-Network embedding is consistently superior to the one-layer embedding. Furthermore, its impact on performance is superior to the improvement stemming from the other variants studied. These results motivate the integration of the Full-Network embedding on any multimodal embedding generation scheme.
Keywords
Introduction
One of the main challenges of the semantic web is vagueness, the difficulty of representing imprecise concepts. An increasing trend in the community is to use vector representations of vague concepts. Vector representations allow for the evaluation of concepts’ similarity simply by computing a vector distance. Not less important is the possibility of obtaining these vector representations automatically. The use of automated large scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web [7].
Deep learning methods are representation learning techniques which can be used to generate such vectors. The models obtained from these methods are composed of multiple processing layers that learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state of the art in speech recognition, visual object recognition, object detection and many other domains [23]. The use of deep learning vector embeddings to represent words has had a substantial impact in many natural language processing tasks [6]. Similarly, deep learning image embeddings have shown great generalisation capabilities, even between distant domains [13]. In this regard, we argue that the semantic web can significantly benefit from the use of deep learning-based embeddings.
In this paper we focus on multimodal pipelines, which tackle two problems in parallel. First, the problem of obtaining a semantically meaningful embedding of an image representing a scene. Second, the problem of obtaining a visually meaningful embedding of a sentence describing a scene. This is done through the construction of a joint embedding, representing both modalities: an image of a scene, and a caption describing it.
The joint embedding constructed can be used to correlate images with sentences easily. As an example, imagine an e-commerce website where sellers upload images of the product to sell. Some sellers may add a very accurate textual description too, while others’ descriptions may be incomplete, inaccurate or non-existent at all. On the other side, buyers search for the desired product writing a free-text description of it. The use of a multimodal embedding as proposed can help to link the textual information provided by the buyer with the product that best matches it, regardless of whether an accurate description for that individual product was provided or not. In general, this approach can be used to automatically include representations of uncaptioned images in a semantic web.
The proposed methodology can also have an impact on semantic web technologies in the disambiguation of vague semantics. Take for instance the concept sports car. Certainly, the speed limit of the car is a key factor to define the concept, but there are many high-end cars with very high speed limits which we would not consider a sports car. The main difference between them and a sports car is that they do not look like a sports car. In this case, the comparison with a sports car visual embedding can be a key element in the definition of the concept.
Information retrieval is a natural way to assess the quality of joint embedding methods [14]. Image annotation (also known as caption retrieval) is the task of automatically associating an input image with a describing text. The complementary task of associating an input text with a fitting image is known as image retrieval or image search.
State-of-the-art image annotation methods are currently based on deep neural network representations, where an image embedding (e.g., obtained from a convolutional neural network or CNN) and a text embedding (e.g., obtained from a recurrent neural network or RNN) are combined into a unique multimodal embedding space. While several techniques for merging both spaces have been proposed in the past [10,11,15,17,20–22,26,29,34,37], little effort has been made in finding the most appropriate image embeddings to be used in that process. In fact, most approaches use a straight-forward one-layer CNN embedding [8,32], and the only method proposed to increase the quality of the image embedding relies on obtaining more data to allow for fine-tuning the CNN in the final stage of training [11].
The main goal of this paper is to explore the impact of using a Full-Network embedding (FNE) [13] to generate the image embedding required by multimodal pipelines, replacing the standard one-layer embedding. We do so by integrating the FNE into the multimodal embedding pipeline defined in Unifying visual-semantic embeddings with multimodal neural language models (UVS) [21]. This pipeline is based in the use of a Gated Recurrent Units neural network (GRU) [5] for text encoding and a single-layer CNN embedding for image encoding. Unlike one-layer embeddings, the FNE represents features of varying levels of abstraction by integrating information from different layers of the CNN. This particularity results in a richer visual embedding space, which may be more reliably mapped to a shared visual-textual representation. Furthermore, we hypothesise that the FNE discretisation (to 3 values with contextual implications) makes for a more natural mapping to a linguistic representation of concepts than using a regular real-valued embedding.
The generic pipeline defined by Kiros et al. [21] had been outperformed in image annotation and image search tasks by methods specifically targeting either one of those tasks [9,22]. However, more recent work by Vendrov et al. [37] and Faghri et al. [11], based on the same generic pipeline, has outperformed previous methods in both tasks, which shows the potential of the approach. This paper extends our previous work [39] by integrating and thoroughly evaluating the improvements proposed by Vendrov et al. [37] and Faghri et al. [11]. Additionally, some hindrances found on Faghri et al. [11] are studied, and a methodology for solving them is proposed which also increases performance.
We report the consequential improvements in our implementation, which increase the performance of the original method [21] as well. Finally, we exhaustively test the main variations on a levelled playground, obtaining insights on the real impact on performance of each of them. Indeed, properly assessing the sources of empirical gains is a key aspect in research that should be further encouraged [25]. Evaluation is done using three publicly available datasets: Flickr8K [30], Flickr30K [42] and MSCOCO [24].
To sum up, the contributions of this paper are:
Integration of the FNE into the generic pipeline defined by Kiros et al. [21], the Order Embedding by Vendrov et al. [37] and the Order++ and VSE++ Embeddings by Faghri et al. [11].
Comparative study of the impact on performance of the main variants introduced by [37] and [11] under equality of the rest of hyper-parameters.
Exhaustive study of optimal hyper-parameter configuration for the previous methods.
Novel curriculum learning process to further increase Order++ and VSE++ [11] training stability and performance.
The rest of the paper is structured as follows. In Section 2, the main different approaches existing in the literature for the image/caption retrieval problem are reviewed. This review introduces the basic methodology by Kiros et al. [21] and the other approaches studied in this paper. Beyond these, other proposals are considered, grouped according to their similitude with [21] and the possibility to be integrated with the FNE. Afterwards, in Section 3, the FNE and multimodal embedding methods studied here are described in further detail. The last subsection contains the methodology we propose to solve the issues found on the method from Faghri et al. [11]. Then, Section 4 presents all the information relative to the experiments conducted. This includes a description of the public datasets used, together with important notes on the choices made here and in the related works. This is followed by an extensive subsection explaining the details of the implementation which help to improve the results from our previous work [39]. Section 5 contains a discussion of the results obtained. Then, in Section 6, we focus specifically on the experimental difficulties we found when using the methodology of Faghri et al. [11]. Finally, Section 7 gathers the most important findings of this work.
Related work
This paper builds upon the methodology described by Kiros et al. [21], which is in turn based on previous works in the area of Neural Machine Translation [35]. In their work, Kiros et al. [21] define a vectorized representation of an input text by using GRU RNNs. In this setting, each word in the text is codified into a vector embedding, vectors which are then fed one by one into the GRUs. Once the last word vector has been processed, the activations of the GRUs at the last time step conveys the representation of the whole input text in the multimodal embedding space. In parallel, images are processed through a CNN pre-trained on ImageNet [31], extracting the activations of the last fully connected layer to be used as a representation of the images. To solve the dimensionality matching between both representations (the output of the GRUs and the last fully-connected layer of the CNN) an affine transformation is applied on the image representation.
Following the same pipeline [21], Vendrov et al. [37] proposed an asymmetric order-embedding space. Its main hypothesis is that captions convey more general abstractions than the images, such as the hypernym/hyponym relation. This relation is imposed in the embedding using the order error similarity defined in Eq. (3). Another improvement on the same pipeline was proposed by Faghri et al. [11]. This method, instead of taking into account all the contrastive examples, focuses only on the hardest of them. This improvement has also been applied to order embeddings successfully [11]. The present work studies the application of the FNE to these methods and variants.
Also, using two different neural networks for image and text, and the ranking loss as methodology keystone, we find the Embedding Network (EN) presented in [41] and the Word2VisualVec (W2VV) model [9]. The first approach (EN) introduces a novel neighbourhood constraint in the form of additional loss penalties i.e., the captions describing the same image should be placed together and far from other captions, and analogously for images. The second approach (W2VV), while restricted to the specific problem of image annotation, also obtain competitive results. This approach uses as a multimodal embedding space the same visual space where images are represented, involving a deeper text processing. These two methods are very similar to the ones presented in this work thus are good candidates to benefit from same improvements (e.g., FNE).
A substantially different group of methods is based on the Canonical Correlation Analysis (CCA). A first successful approach in this direction is the use of Fisher Vectors (FVs) [22]. FVs are computed with respect to the parameters of a Gaussian Mixture Model (GMM) and an Hybrid Gaussian–Laplacian Mixture Model (HGLMM). For both, images and text, FVs are build using deep neural network features: a CNN for images features, and a word2vec [27] for text features. A more recent approach based on the same CCA methodology [10], introduces a novel bidirectional neural network architecture. This architecture is based on two channels which share weights: one channel maps images to sentences while the other goes in the opposite direction. Losses are applied in each projection and in a middle layer. The loss in the middle layer seeks to ensure the correlation between both representations at this point. Instead of using the CCA, a more efficient Euclidean loss is used. Since both methods rely on a CNN representation of the image, the introduction of the FNE in these pipelines should be straightforward.
Attention-based models is another family of competitive solutions for tackling multimodal tasks. Dual Attention Networks (DANs) [29] currently holds the best results on the Flickr30K dataset. On a general pipeline similar to [21], DANs introduce two additional small neural networks as attention mechanisms for images and captions. This allows DANs to estimate the similarity between images and sentences by focusing on their shared semantics. In a similar fashion, selective multimodal Long Short-Term Memory network (sm-LSTM) [15] includes a multimodal context-modulated attention scheme at each time-step. This mechanism can selectively attend to a pair of instances of image and sentence, by predicting pairwise instance-aware saliency maps for image and sentence. All attention-based methods rely on CNN representations of the images, as the previously described methods did. However, they differ in that the representations are obtained from the last convolutional layer. At this level, information on the features position is available allowing for the use of attention mechanisms. On the contrary, FNE obtains a compact representation of the whole image at the cost of losing the spatial information. Application of the FNE methodology to those techniques would require to modify significantly the FNE schema and is one of our main lines of future work.
Methods
The multimodal embedding pipeline of Kiros et al. [21] represents images and textual captions within the same space. The pipeline is composed of two main elements, one which generates image embeddings and another one which generates text embeddings. In this work we replace the original image embedding generator by the FNE, resulting in the architecture shown in Fig. 1. In Section 3.1 the main characteristics and methods of the FNE are described. Section 3.2 explains the generic multimodal embedding pipeline by Kiros et al. [21] alongside with the main modifications proposed, including the integration of the FNE. Following Sections 3.3 and 3.4 explain the variations introduced by Vendrov et al. [37] and Faghri et al. [11] respectively. Finally, 3.5 explains the methodology developed to overcome the hindrances found in maximum loss methods.

Overview of the proposed multimodal embedding generation pipeline with the integrated Full-Network embedding. Elements colored in orange (dark grey) are components modified during the neural network training phase. During testing, only one of the inputs is provided.
The FNE [13] generates a vector representation of an input image by processing it through a pre-trained CNN, extracting the neural activations of all convolutional and fully-connected layers. After the initial feature extraction process, the FNE performs a dimensionality reduction step for convolutional activations, by applying a spatial average pooling on each convolutional filter. After the spatial pooling, every feature (from both convolutional and fully-connected layers) is standardized through the z-values, which are computed over the whole image train set. This standardisation process puts the value of each feature in the context of the dataset. At this point, the meaning of a single feature value in an image is the degree with which the feature value is atypically high (if positive) or atypically low (if negative) for that image in the context of the dataset. Zero marks the typical behaviour.
The last step of the FNE is a feature discretization process. The previously standardized embedding is usually of large dimensionality (e.g., 12,416 features for VGG16 [33]) which entails problems related with the curse of dimensionality. A common approach to address this issue would be to apply some dimensionality reduction methods (e.g., PCA) [1,28]. Instead, the FNE reduces expressiveness through the discretization of features, while keeping the dimensionality. Specifically, the FNE discretization maps the feature values to the
Multimodal embedding
In our approach, we integrate the FNE with the multimodal embedding pipeline of Kiros et al. [21]. To do so, we obtain the FNE image representation instead of the output of the last layer of a CNN, as the original model does. The encoder architecture processing the text is used as in the original pipeline, using a GRUs recurrent neural network to encode the sentences. Each word in the sentence is first encoded in a one-hot vector using a dictionary containing all the words in the train and validation sets. Next, it is encoded through a trainable linear embedding into a word embedding of lower dimensionality. Finally, the embeddings are fed to a GRU and the final state of the GRU’s hidden units is normalised to obtain the sentence embedding. To combine both embeddings, Kiros et al. [21] use an affine transformation on the image representation (in our case, the FNE) analogous to a fully connected neural network layer with identity activation function. We simplified it by removing the bias term, resulting in a linear transformation as in [37]. This simplification is also motivated by the good results of W2VV [9], where the transformation is completely removed. The output of the linear transformation is normalised to obtain the embedding. This linear transformation is trained simultaneously with the GRUs and the word embedding. The elements of the multimodal pipeline that are tuned during the training phase of the model are shown in orange (dark grey) in Fig. 1 (notice the image embedding is not fitted to the data).
In simple terms, the pipeline training procedure consists of the optimisation of the pairwise ranking loss between the correct image-caption pair and a random pair. Assuming that a correct pair of elements should be closer in the multimodal space than a random pair, the loss
Where
The similarity metric proposed in [21] is the cosine similarity
Multimodal order embedding
Using the same general schema, Vendrov et al. [37] proposed an asymmetric order embedding space. Their main hypothesis is that captions are abstractions of the images, including information such as the hypernym/hyponym relation. In the resulting shared embedding space, an image corresponds to a caption if the value of all components of the image embedding have higher values than the components of the caption embedding (
Notice that since image and caption embeddings are normalised to have unit L2-norm, both lay on an hyper-sphere centred on its coordinate origin, thus a perfect order-embedding will not be achieved unless they are the same vector, which is extremely unlikely to happen.
Maximum error loss
A recent contribution to the field [11] proposes to compute the loss focusing only on the worst contrasting example (i.e., the closest mistake) instead of taking into account all the examples. To achieve it, Eq. (1) is modified substituting the sum over all contrasting examples for the maximum contrasting example, as shown in Eq. (4).
Curriculum learning
Faghri et al. [11] reported problems in training when using their proposed Maximum of Hinge Loss (MH). They indicate that a rough form of curriculum learning [2] could be applied, but do not develop or experiment it further as in their preliminary experiments it obtained worse performance than the proposed method. Our experiments replicated their training problems, as well as an unstable behaviour with respect to hyper-parameter selection. As a result, on several occasions, the model is unable to start learning within a reasonable number of epochs.
To fix that, we define a sort of curriculum learning approach to combine the benefits of the sum loss
We propose to train the model using the sum of errors loss
We performed preliminary experiments using this methodology to apply a learning rate reduction, which resulted in small performance gains for some algorithms. We kept these results out of the paper as we do not consider them to be conclusive enough, and to avoid shadowing more relevant contributions.
Experiments
In this section, we evaluate the impact of using the FNE in a multimodal pipeline for both image annotation and image retrieval tasks. We extend our previous work [39] introducing the FNE in different multimodal pipelines. To properly measure the relevance of the FNE, we compare the results obtained with those of the original multimodal pipelines (i.e., without the FNE). Given the discrepancies in the experimental setup of the different contributions, we define baselines by keeping as much of the original setup as possible while leveling the playground (i.e., using the same training and test sets, the same text preprocessing, the same source CNN, the same data augmentation, etc.).
We identify the different combinations of embedding and multimodal pipeline with a notation in the form of EMB-PIPE. EMB denotes the embedding being either FNE (for the full network embedding) or FC7 (for the baselines using the last CNN layer,
Datasets
In our experiments we use three different and publicly available datasets:
The
The
The
Experimental setup
We investigate the impact of the FNE on the methods proposed in [11,21,37], and on the curriculum learning methodology proposed in Section 3.5. The methods are named following the convention of [11]. Notice all losses are actually based on a Hinge Loss:
Sum of Hinge Loss (
Maximum of Hinge Loss (
Sum of Order Embedding Loss (
Maximum of Order Embedding Loss (
Pre-trained Hinge Loss (
Pre-trained Order Embedding Loss (
The details of the hyper-parameters used in the experiments for each method can be found in Table 1.
Implementation details
The devil is in the details. To facilitate the reproducibility and interpretability of our work, in this section we provide all the details regarding our implementation. The Theano [36] based implementation we used is available at [38].
Hyper-parameter configuration for the experiments
Hyper-parameter configuration for the experiments
For MSCOCO Word embedding dimensionality is 2,000 and Learning rate is 0.00025.
First training-second training parameters.
During a training epoch, all images are presented with one caption chosen randomly from the five captions available. This approach differs from the usual of presenting all five captions per image each epoch [21,39]. If all five image-caption pairs are included in the dataset, it may be the case that more than one correct image-caption pairs can be included in the same random batch. Since the method uses all image-caption combinations in the batch as contrastive examples, a correct pair could be wrongly used as an incorrect pair during the loss computation, leading to noise during the training. By using only one correct caption, we remove this possibility. On the other hand it is now possible (although highly unlikely depending on the number of training epochs) that a correct caption is never used during training. In fact, the probability that a correct caption is never used during training is in the order of
The models are trained until a maximum number of epochs is reached, and the best performing model on the validation set is chosen. Notice that the result of this process is very similar to what could be obtained through an early stopping policy. In the case of baseline experiments, the maximum number of epochs is set to 200 for all our executions. In MH experiments on Flicker8k and Flicker30k, we raise the maximum number of epochs to 400 as we observed results kept improving after 200 epochs.
On all our experiments (for both the FC7 and the FNE variants) the batch size is of 128 image-caption pairs. Within the same batch, every possible alternative image-caption pair is used as contrasting example (i.e., we sum over 127 contrasting examples or we choose the worst example out of 127, depending on the loss used). In the GRUs we use gradient clipping with a threshold of 2. We use ADAM [18] as optimisation algorithm.
Caption processing
The caption sentences are word-tokenized using the Natural Language Toolkit (NLTK) for Python [3]. We did not remove punctuation marks as in [11,39], and in contrast to [37]. Also, unlike some previous works [21,39] we do not remove long sentences from the training split. We did not observe a significant impact on performance with this reduction of the text pre-processing. These observations are aligned with conclusions from [4], where simple tokenization works equally or better than more complex text preprocessing systems in general domain datasets. We hypothesise that the short nature of the texts combined with the availability of multiple text instances for each image helps the system to overcome sparsity issues.
The choice of the word embedding size and the number of GRUs has been analyzed to obtain a range of suitable parameters to test in the validation set. Previous contributions [11,21,37] set the word embedding dimensionality to 300. In our preliminary experiments, we tested word embedding dimensionalities of 300, 600, 1,024, 1,536, 2,048 and 3,072, finding that a higher dimensionality helps to obtain better results. We also found that very different dimensionalities between the word embedding and the multimodal embedding (i.e., 300–2,048) slow down the convergence speed during training. It seems reasonable that it may also affect the final performance. This could explain why higher word embedding dimensionalities help to obtain better results in this methodology. For word embeddings, a dimensionality between 1,024 and 1,536 performs competitively on all methods.
Similarly, we explored different multimodal embedding dimensionalities (i.e., number of GRU units) of 300, 400, 500, 800, 1,024, 1,536, 2,048, 2,560, 3,072, 5,000 and 10,000 finding that dimensionalities between 1,024 and 2,048 give good results for all methods considered. In all experiments we tested at least 3 different dimensionalities presenting the results of the best performing one on the validation set. Previous methods usually adopt 1,024 as the dimensionality of the multimodal embedding space [11,37], while others consider a much smaller dimensionality of 300 [21].
Image processing
For generating the image embedding we use the classical VGG16 CNN architecture [33] pretrained for ImageNet [31] as source model. This architecture is composed by 16 convolutional layers combined with pooling layers, followed by two fully connected layers and a final softmax output layer. Using only the activations of the last fully connected layer before the softmax (
To obtain a better representation of the image, the full network embedding resizes the image to 256 × 256 pixels and extracts 5 crops of 224 × 224 pixels (one from each corner and the center). Mirroring these 5 crops horizontally we obtain a total of 10 crops which are processed through the CNN independently. The activations collected from each of these 10 crops are averaged to obtain a single representation of the image before further processing. For the baseline we use the same process before L2-normalization. Although a similar process is common for data augmentation, notice that we are not actually doing data augmentation since the number of training samples does not increase.
Evaluation metrics
To evaluate the image annotation and image retrieval tasks we use the following metrics:
To obtain a comparable performance metric per model, we use the sum of the recalls on both tasks. This has been done before in [39] and in [11], the latter using only R@1 and R@10. We only use the score obtained on the validation set to select the best performing model for early stopping and hyper-parameter selection.
Results obtained for the Flickr8K dataset. R@K is Recall@K (high is good). Med r is Median rank (low is good). Best results for each FC7–FNE comparison are shown in underline . Best results for SotA and our experiments are shown in bold
Results obtained for the Flickr8K dataset. R@K is Recall@K (high is good). Med r is Median rank (low is good). Best results for each FC7–FNE comparison are shown in
Results from [39].
Trained for 400 epochs.
Results obtained for the Flickr30K dataset. R@K is Recall@K (high is good). Med r is Median rank (low is good). Best results for each FC7–FNE comparison are shown in
Single model.
CNN fine-tuned.
Results from [39].
Trained for 400 epochs.
Results obtained for the MSCOCO dataset. R@K is Recall@K (high is good). Med r is Median rank (low is good). Best results for each FC7–FNE comparison are shown in
Single model.
Results provided on [19].
Extra training data from validation set.
CNN fine-tuned.
Results from [39].
Table 2 shows the results of the proposed full network embedding on the Flickr8K dataset, for both image annotation and image retrieval tasks. The top part of the table includes the current state-of-the-art (SotA) results as published. The second part summarises the results published by the original contributions this work is based on. Following parts contain the results produced by us for each of the models defined in Section 4.2. Each of these blocks comprises two pairs of results, the first pair corresponds to the results while using a configuration of hyper-parameters as close as possible to the original (i.e., baseline or -bl), while the second pair corresponds to the results while using the best configuration we found for the FNE. Within each pair, the first experiment uses the FC7 embedding and the second uses the FNE, keeping all hyper-parameters unchanged. Best results for each pair are underlined. Tables 3 and 4 are analogous for the Flickr30K and MSCOCO datasets. Additional results of the UVS model [21] were made publicly available later on by the original authors [19]. We include these for the MSCOCO dataset, which was not evaluated in the original paper.
First, let us consider the effect of all modifications in the pipeline (detailed in Sections 3 and 4.3) compared to our previous work [39]. In the first block of experiments, we can compare the results from [39] (FC7-SH-bl and FNE-SH-bl) with the ones obtained in this work for the same model (FC7-SH and FNE-SH). Notice that in FC7-SH-bl and FNE-SH-bl hyper-parameters were already optimized for FNE. We can see a substantial improvement in results obtained using both the FC7 and the FNE image embeddings. With an average increase in recall of 4.75% on MSCOCO, these results validate the improvements made in the pipeline globally and the exhaustive hyper-parameter fine-tuning.
Results obtained in this work for the original pipeline from Kiros et al. (FC7-SH) are now very close to the ones obtained by other studied methods (FC7-MH, FC7-SOE and FC7-MOE) dimming the benefits of the proposed variants. In Table 4, we can easily compare the results claimed in the original papers [11,21,37] with the ones obtained under equal conditions (notice that not all methods were tested on Flickr datasets in original works). The most explicit differences are in recall@1 for both image annotation and image retrieval. For instance, VSE++ [11] obtains 21.2% and 21.0% increments over UVS [21], while the increments of our analogous versions (FC7-MH and FC7-SH) are now of 0.6% and 0.5% respectively. We hypothesise that most of the previously reported increment was due to different dataset sizes, CNN architectures and hyper-parameter fine tuning; factors that we set equal for all methods.
These results highlight the difficulty to perform a consistent comparison between different multimodal approaches since different authors make different choices in the settings of their experiments (and sometimes fail to detail them thoroughly). Notably, important differences arise depending on the data used for training and testing, specially when experimenting with the MSCOCO dataset as we have seen in Section 4.1. Similarly, data augmentation techniques, a standard approach in most SotA methods, can give a boost to performance. In our experiments, we did our best to avoid such differences or to specify them entirely when they are unavoidable. In this context, the results we provide are as comparable as possible. It is essential to keep in mind all these considerations, when comparing the results we report with the ones from other publications.
Comparing the results of the family of methods based on [21] with the state of the art, we see that their relative performance increases with dataset size (larger datasets lead to more competitive performances of these methods). Since the methods tested are more data-driven (i.e., fewer assumptions are made a priori), it is to be expected that they can benefit more from the increase of available data. These results are congruent with the ones in [11] where the experiments using more data obtain state-of-the-art results.
Now, let us focus on the differences between a model and the same model using the FNE image embedding. This is the most significant contribution of this paper, as it incorporates the FNE on several multimodal embedding pipelines. We can see through the tables of results that every method on every dataset obtains better results when using the FNE embedding when compared to the FC7. Moreover, even with the original hyper-parameter configuration (sub-optimal for FNE) the FNE obtains better results on all tests. The only exception is FNE-MOE-bl where training problems occur with the original configuration (we analyze this issue in Section 6). Even in this case, results using an appropriate hyper-parameter selection are superior to those of the baseline (FC7-MOE-bl). Considering all the experiments on MSCOCO dataset (including baselines), the average increase in recall using the FNE embedding is 3.7%.
Considering the methods tested in our consistent experimental setup, we see that FNE-MH tends to obtain the best results on image annotation while FNE-POE is usually superior in image retrieval tasks. With these results, we can not consider one method preferable to the other except in the smallest Flicker8K dataset, where FNE-MH is superior. In any case, the performance differences between the best versions of each method remain lower than the impact of the FNE. For instance, in the experiments on MSCOCO, the recall gap between the best and the worst method (for each task separately) is, on average, 2.1%.
Finally, we observe that the proposed methodology of curriculum learning increases the already good performance of the original FC7-MOE [11] and the FNE-MOE 1.7% on average at MSCOCO. On the other hand, on methods based on the cosine similarity
Experiments on MOE training behaviour
When training models using the maximum order embedding (MOE and MOE-bl), we observed instability issues. For some configurations of hyper-parameters, the model does not start learning, even after extending the number of epochs significantly. To obtain some insights on that behaviour, we trained the same model five times with different random initialisations. The configurations tested are shown in Table 5. The combinations of learning rate, margin and absolute value are taken from the original works of [11,37].
Hyper-parameter configuration and results for the experiments on MOE training behaviour. Success indicates the number of times that experiment succeeded in starting training (i.e., score > 10) over total repetitions
Hyper-parameter configuration and results for the experiments on MOE training behaviour. Success indicates the number of times that experiment succeeded in starting training (i.e., score > 10) over total repetitions
The rest of the hyper-parameters are kept the same for all experiments. The dimensionality of the word embedding is 300, and the multimodal embedding has 1,024 dimensions. The maximum number of epochs is 200. We run all the tests on Flickr8K to minimise computational cost, although we observed this behaviour in Flickr30K and MSCOCO too.
To evaluate these experiments, we count the number of times the algorithm succeeded in starting training. We consider it does not train if validation and test scores are below 10 (regular scores are higher than 200). The results obtained are shown in Table 5.
Results, quite surprisingly, do not point to a single variable as the cause of the problem. For the FC7 embedding, it did not train when the absolute value was used, independently of the learning rate and margin. The experiment with the same configuration that worked well with FC7 does not train with FNE. On the other hand, the original configuration from [37] (but using max loss) successfully trained on FNE embedding, but this behaviour is not entirely robust since it failed once.
These experiments show that the instability of the training does not come from the choice of embedding, but instead on the hyper-parameter selection and parameter initialisation. While these experiments help to shed light on the problem, further work is required to completely understand the cause.
The proposed curriculum learning methodology (see Section 3.5) effectively solved this problem in all our experiments, as it initialises the network using the more robust sum loss. None of the experiments we did using the proposed curriculum learning methodology for different hyper-parameters configurations failed to start training.
For the multimodal pipeline of Kiros et al. [21] and the other methods based on it [11,37], using the FNE results in consistently higher performances than using a one-layer image embedding. These results suggest that the visual representation provided by the FNE is superior to the current standard for the construction of most multimodal embeddings. In fact, the impact FNE has on performance is significantly superior to the improvement resultant of combining the main contributions from [37] and [11]. These results confirm our initial hypothesis that the richer and discrete representation obtained with FNE is more convenient for the construction of multimodal embeddings than the widely used single-layer real-valued embeddings.
The results of our comparative study of the different variants from [11,21,37] pointed up the need of properly assessing the sources of empirical gains. We consider it is a key aspect of research that should be further encouraged. We hope that our experimental study can help other researchers with design decisions from the text pre-processing to the loss choice including ranges of optimal dimensionalities and other hyper-parameters.
Another issue we tackled was the instability of MOE models. Depending on the random initialization of the weights, the same model may start learning or not. Our experiments showed that the combination of hyper-parameters also plays a role in these difficulties. However, further study is required to get a real insight into the mechanisms causing this problem. In any case, the proposed curriculum learning method of pre-training using a sum of losses effectively alleviates this problem while increasing performance.
When compared to the current state of the art, the results obtained from the studied variants using FNE are below the results reported through other methods. This difference is often the result of using a more substantial amount of training data. Indeed, results given in [11] indicate that models based on the pipeline of [21] can obtain state-of-the-art results when using enough data.
Finally, let us remark that the FNE is straight-forward compatible with most multimodal pipelines based on CNN embeddings. The constant improvement in the results observed here for the variants proposed by [11,21,37] suggest that other methods can also boost its performance incorporating the FNE. These results also encourage us to consider the modifications required to be able to introduce attention mechanisms (e.g., DAN) in our methodology in future work.
Footnotes
Acknowledgements
This work is partially supported by the Joint Study Agreement no. W156463 under the IBM/BSC Deep Learning Center agreement, by the Spanish Government through Programa Severo Ochoa (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR-1051), and by the Core Research for Evolutional Science and Technology (CREST) program of Japan Science and Technology Agency (JST).
