Abstract
Current State-of-the-Art image captioning systems that can read and integrate read text into the generated descriptions need high processing power and memory usage, which limits the sustainability and usability of the models (as they require expensive and very specialized hardware). The present work introduces two alternative versions (L-M4C and L-CNMT) of top architectures (on the TextCaps challenge), which were mainly adapted to achieve near-State-of-The-Art performance while being memory-lighter when compared to the original architectures, this is mainly achieved by using distilled or smaller pre-trained models on the text-and-OCR embedding modules. On the one hand, a distilled version of BERT was used in order to reduce the size of the text-embedding module (the distilled model has 59% fewer parameters), on the other hand, the OCR context processor on both architectures was replaced by Global Vectors (GloVe), instead of using FastText pre-trained vectors, this can reduce the memory used by the OCR-embedding module up to a 94% . Two of the three models presented in this work surpassed the baseline (M4C-Captioner) of the challenge on the evaluation and test sets, also, our best lighter architecture reached a CIDEr score of 88.24 on the test set, which is 7.25 points above the baseline model.
Introduction
The intersection of computer vision and natural language processing has given rise to several research areas and interesting new problems that involve both textual and visual information. The mixture of different modalities has lead researchers to a novel area: multimodal machine learning. Several problems can be categorized as multimodal, such as Visual Question Answering, Visual Commonsense Reasoning, Visual Dialogue, Phrase Grounding or Image Captioning [14]. More recently, a new kind of multimodal problems emerged: those that require the model to be able to read and use the read text to improve their outputs, such as Text Visual Question Answering (TextVQA) [19] or Image Captioning with Reading Comprehension (TextCaps) [18]. Image Captioning with Reading Comprehension is the main problem tackled in this work, which can also be called as Text-based image captioning, OCR-based image captioning or OCR-augmented image captioning.
A traditional image captioning system should be able to recognize important objects, their attributes, and the relationships among the actors and objects in the scene, and describe them in natural language. However, an image captioning system with reading comprehension should, in addition to the aforementioned, be able to read and integrate the read text in the output captions, which highly increases the difficulty of the problem. Unfortunately, traditional image captioning systems fail to recognize, integrate or paraphrase the text in the scene [18], which can lead to important information loss such as street names, product brands, prices, written instructions, and more [27]. Furthermore, as described in [18], an image captioning system with reading comprehension should be able to determine the relationships among different Optical Character Recognition (OCR) tokens and also the relationships between OCR tokens and the visual context, also, the system should be able to switch between the word from the vocabulary of the model and the OCR tokens extracted from the image, moreover, the model may need to paraphrase or infer meaning from previously never seen OCR tokens.
To address the aforementioned problems, novel and different sophisticated architectures, multimodal encoding frameworks, and training strategies are being used [22–27], however, these approaches are expensive both in processing power and memory consumption. As mentioned above, the main goal of this paper is to introduce different alternatives, which were designed to be lighter in memory consumption when compared to the State-of-the-Art (SOTA) architectures. It is important to highlight that all methods available in the literature are focused on achieving SOTA performance over the TextCaps challenge 1 [18].
As the problem of image captioning with reading comprehension is shaping up to be a high impact research area, given its wide application range, such as biomedicine, commerce, military, education, digital libraries, web searching, and robotics [1, 7], computationally-lighter yet accurate alternative solutions are needed, therefore, in this work the first two architectures focused on saving memory are presented: Lighter-M4C (L-M4C) and Lighter-CNMT (L-CNMT), both are modified versions of SOTA approaches over the TextCaps challenge, the former is based on the Multimodal Multi-Copy Mesh (M4C) architecture [8, 18] and the latter is based on the Confidence-Aware Non-repetitive Multimodal Transformer (CNMT) architecture [24], both architectures consists of the usage of lighter alternatives for the OCR-embedding and text-embedding modules: the FastText pre-trained vectors for OCR-context encoding were replaced by pre-trained GloVe vectors, while the original BERT [6] (which stands for Bidirectional Encoder Representations from Transformers) embedding module was replaced by a distilled version of the pre-trained BERT model. Three different configurations are presented, furthermore, our best configurations showed a reduction of up to 94% of the memory used by the OCR context processor(both in training and inference steps) while achieving near-SOTA scores.
The main contributions of this paper are listed below: The introduction of two architectures that aim to reduce memory usage while achieving near-SOTA scores on the TextCaps challenge, an alternative OCR-context encoding module that can save up to 94% of memory while training and predicting, an alternative text-encoding module with 59% fewer parameters when compared to the original approach, as all experiments are built on top of the MMF framework
2
our architectures are fully modular, and all the experiments are easy to reproduce, the code, scripts, pre-trained models and detailed instructions are made publicly available in the GitHub repository of this paper.
Related work
To the best of our knowledge, seven TextCaps-focused papers have been published, in which authors mainly propose solutions to the image captioning with reading comprehension problem. All the seven published works reached scores that surpassed the baseline proposed by the organizers of the TextCaps challenge, which was based on the M4C-Captioner model. The approaches vary among the papers, some authors tried to increase the performance of the model by pre-training their architectures over different yet related tasks [26], other authors implemented a graph-based approach to increase the content diversity exploration [25], while the remaining authors changed or improved some modules of the M4C-Captioner [18], such as the encoding module, the text generation head or the pointer network [22–24, 27]. This section will briefly introduce and describe each method available in literature, including its scores over the validation and testing set of the TextCaps challenge.
Multimodal multi-copy Mesh Captioner (M4C-Captioner)
The M4C-Captioner was presented as the first architecture focused on the TextCaps dataset, and was presented by the Facebook AI Research team as the baseline for the TextCaps challenge. The M4C-Captioner is based on the M4C model, which once was SOTA on the TextVQA challenge [8]. This model fuses different modalities by embedding them into a common semantic space, which is then processed by a multimodal transformer (MMT) [18], the caption generation head is performed by a pointer network, which allows the model to generate multi-word answers while mixing vocabulary and OCR tokens. The M4C-Captioner was taken as the base for one of our proposals (L-M4C), and will be detailed further in Section 3.
Multimodal attention Captioner with OCR Spatial Relationship (MMA-SR)
In [22], the authors proposed the MMA-SR architecture as a novel approach for OCR-based image captioning. This architecture first extracts feature representations from the image by using a pre-trained Faster R-CNN network, and extracts OCR information with external OCR systems (in addition to the features already in the dataset). Then, the two modalities are fed into a multimodal attention network (the text generation is LSTM-based instead of the MMT). In the last step, the caption is generated by using a dynamic pointer network, which selects candidates from the vocabulary of the model or from the extracted OCR tokens.
Text-Aware Pre-Training for Text-VQA and Text-Caption (TAP)
The TAP method reported some of the highest scores on the TextCaps challenge (see Table 1), and was first proposed in [26]. The main contribution of the TAP method is a novel pre-training strategy to help the model learn better-aligned representations among the text words, the visual objects, and the scene text (OCR). The aforementioned pre-training strategy consisted in three tasks: language modeling, image-text matching, and relative position prediction. During the pre-training, the model learns useful representations for the three modalities, processed by a multimodal transformer (the fusion module). Once pre-trained, the fusion module is fine-tuned to perform specific tasks, such as Text-VQA or OCR-based image captioning.
Simple is not Easy (SBD)
In the Simple is not Easy paper [27], the authors argue that sophisticated multimodal encoding frameworks are not strictly necessary to obtain SOTA scores on the TextCaps challenge. They proposed a “simple” attention mechanism to filter the features, before feeding them to the fusion encoder. The proposed architecture consists in the split of OCR tokens into visual and linguistic branches, which are then passed into vanilla attention blocks and then to a fusion encoder (to combine the features of the different modalities). The captions are then generated with a transformer module that receives the multimodal combined representation of the inputs.
Confidence-aware Non-repetitive Multimodal Transformers for TextCaps (CNMT)
The CNMT [24] architecture consists of three main components: a reading module, a reasoning module and a generation module. The reading module is better, when compared to previous systems, as the authors claim to use better OCR systems and recognition confidence. The reasoning module (a multimodal transformer) fuses the OCR token features with object features, which are then passed to the generation module to predict captions iteratively. A pointer network is used to select tokens from the vocabulary of the model or from the extracted OCR tokens. To avoid repetition, CNMT uses a repetition mask. Similar to the M4C-Captioner, the CNTM architecture was taken as the base for one of our proposals (L-CNMT), and will be detailed further in Section 3.
Anchor-Captioner (AnC)
To the best of our knowledge, the Anchor-captioner [25] is the only graph-based approach in literature, which is mainly designed to explore content diversity while captioning images. The architecture consists of four main components: a feature extractor (both for textual and visual information), a fusion module (self-attention layers), an anchor proposal module (AnPM), and an anchor captioning module (AnCM). Once the input features are fused in the self-attention layers, the AnPM constructs anchor-centered graphs to groups the relevant tokens, then, the AnCM uses a visual captioner to output a “global visual-specific caption”, which is then used to generate several text-specific captions.
Long Short-Term Memory plus Relation-aware pointer network (LSTM-R)
The LSTM-R architecture [23] is aimed to improve the “reasoning” of the model about the scene text by replacing the rich visual and semantic representations of the OCR tokens, this enhancement is mainly done by exploiting geometrical relationships among the OCR tokens. The authors consider the height, width, distance, and orientation relations in order to construct the geometrical relationships. Furthermore, the authors used an LSTM network instead of a multimodal transformer to perform feature processing, later the novel relation-aware pointer network was enhanced by receiving the geometrical relationship features.
Human-generated captions
A clear and detailed explanation about the creation of the TextCaps dataset is described in [18]. To estimate the performance of humans on this task, the authors collected an additional 6th caption when collecting the annotations for the training and test sets of the TextCaps dataset. These additional human captions were evaluated using all the metrics of the challenge, and the results can be observed in Table 1. Human estimated performance is not available for the validation set.
Summary of the State-of-the-Art
Overall, we have described seven methods for image captioning with reading comprehension. In Table 1, we present a summary of the reported results for each architecture (only the best experimental scores are shown) over the test set of TextCaps 3 , as well as the estimated human performance over the test set.
Performance of image captioning with RC methods available in literature, compared with the estimated “Human” performance. The method with an * indicates the baseline of the challenge, while bold numbers indicate the best scores. The results are ordered in a descending manner by CIDEr score. Metrics in columns: BLEU-4 (B-4), METEOR (M), ROUGE L (R), SPICE (S), CIDEr (C)
Performance of image captioning with RC methods available in literature, compared with the estimated “Human” performance. The method with an * indicates the baseline of the challenge, while bold numbers indicate the best scores. The results are ordered in a descending manner by CIDEr score. Metrics in columns: BLEU-4 (B-4), METEOR (M), ROUGE L (R), SPICE (S), CIDEr (C)
As mentioned before, our main contributions consist in two architectures that are focused on reducing the memory usage during the training and inference steps. In this section, our proposed modifications for both architectures are detailed.
Base architectures
As it can be seen in Table 1, TAP outperformed all the other methods in three of the five metrics, while the LSTM-R method outperformed TAP and the remaining methods in two of the five metrics. Unfortunately, the code for the TAP, LSTM-R and MMA-SR methods is not available at the moment of writing this paper, furthermore, the SBD, AnC and CNMT methods are variations of the baseline architecture (M4C-Captioner). As the goal of this work is to present architectures that perform at an equal or higher level than those models on the leaderboard of the challenge, while being lighter and easy to reproduce, two base architectures were selected: the M4C-Captioner and the CNMT, the former because it is the most used and well-documented (as it is the baseline of the challenge), and the latter because it is the best architecture with publicly available code and additional OCR features (which is extremely necessary to reproduce SOTA results).
Figure 1 presents a clear comparison between both of our base architectures, the original M4C-Captioner is illustrated at the left, while the original CNMT is illustrated at the right. Further, detailed information about each base architecture is included in the following sections.

Illustrations of the two base architectures of this work. At the left, the baseline architecture (M4C-Captioner) for the TextCaps challenge, at the right, the CNMT architecture.
The M4C architecture was first proposed to solve the TextVQA problem [8], and then, it was briefly modified to solve the problem of text-based image captioning [18], the modified version was called M4C-Captioner. The novel M4C-Captioner was the first approach specifically designed to use scene text and integrate it in the generated captions, establishing the baseline for the TextCaps challenge. The left illustration in Fig. 1 corresponds to the M4C-Captioner.
The M4C-Captioner firstly encodes the visual information (yellow blocks in Fig. 1) by using a pre-trained model of the Faster R-CNN architecture [16], we will also refer to this process as the feature extraction step. On the other hand, the textual features are encoded by using a 3-layered pre-trained BERT model [6], which was pre-trained to solve a masked language modeling task (MLM); the text-embedding module is present in Fig. 1 as the “Previous prediction embedding” block. Yet more important is the OCR-embedding module (colored in red in Fig. 1), the M4C architecture encodes a rich OCR representation, which includes visual, spatial, and semantic information of the OCR features; the visual features are extracted with the Faster R-CNN model, the character information is extracted with a Pyramidal Histogram of Characters network (PHOCNet) [20], and the semantic information is encoded with pre-trained FastText [12] vectors. The spatial information (location features) are based on the relative bounding box coordinates of the OCR token.
The gray blocks in the middle of the illustration represent a 4-layered multimodal transformer, which will process the previously embedded vectors (all the modalities are embedded into a common semantic space). Once processing is done, the outputs of the MMT are passed to a text generation head, which will be responsible for generating the caption. The generation head consists of two modules: a pointer network and a fully-connected (FC) layer; the pointer network will decide if an OCR token must be copied and introduced to the sequence, while the FC layer chooses tokens from the vocabulary of the model.
The blocks in green color in Fig. 1 correspond to those modules that were modified in order to obtain a lighter version of the M4C-Captioner.
CNMT
As mentioned before, the CNMT architecture is based on the M4C-Captioner. The main differences between these two architectures lie in two modules: the reading module and the generation module. The reading module on both architectures corresponds to the red OCR-related blocks in Fig. 1, while the generation module is located at the top of both illustrations, just after the output of the MMT. Further information about these modules is given below.
Reading module: The reading module of the M4C-Captioner is based on the Rosetta [4] large scale system for text detection and recognition. Also, all the OCR-related features in the M4C architecture are extracted from the Rosetta outputs. On the other hand, the authors of the CNMT architecture decided to enhance the reading module by using additional methods for text detection and recognition. First, CNMT uses two models for text detection, CRAFT [3] and ABCNet [10]. The text regions extracted by each model are combined and fed into the recognition model (a four-stage STR framework, as described in [3]). The OCR tokens extracted with the new OCR system are combined with the original Rosetta tokens at a dataset level (all tokens are included in the annotation file). The authors of the CNMT architecture reported an increment of 5.9 CIDEr points over the evaluation set just by improving the OCR features (Rosetta + CRAFT + ABCNet) through an ablation study where the original M4C-Captioner was kept intact, except for the additional OCR features.
Furthermore, the recognition confidence x conf of each token from the text recognition system is recorded. The x conf is between 0 and 1, where 1 indicates a complete confidence about the recognized text. The x conf for each recognized token is added to the input features at a dataset level. The confidence embedding incremented the CIDEr score by 2 points over the validation set (measured by another ablation study).
Generation module: In the CNMT architecture, the generation module is augmented by adding a repetition mask at the end of the architecture. Authors argued that repetition in captions brings negative effects on their fluency [24]. The repetition mask is added in order to avoid word repetition, which helps minimize the scores of elements that have appeared in previous steps. Further details about this mask can be found in [24].
Lighter architectures
Previously, both of our base architectures were briefly described. In this section, detailed information about the modified blocks is included, also, the motivations behind each modification is described. As mentioned before, the main difference between the original and our lighter alternatives lies in the OCR-embedding module, specifically, in the embedding of the OCR semantic features (recognized text tokens), and in the text embedding block. Both modules are green-colored in Fig. 1.
As the goal of this work is to reduce the memory usage of the aforementioned described architectures, the main modification lies in the memory-heavier component of the M4C and CNMT designs, which is the processor for the OCR semantic features (read text tokens). Originally, both approaches use pre-trained 300-dimensional (300-d) FastText vectors to obtain the embeddings for the OCR text tokens, while our main contribution is to use pre-trained and dictionary-based GloVe vectors instead, which are also in a 300-d space. Furthermore, the text embedding module is another heavy component of both designs, our proposal is to replace the original BERT pre-trained model with a distilled and lighter version of it, which is called DistilBERT [17].
As the semantic processor module and the text-embedding module of both architectures are invariant from M4C to CNMT (both use 300-d pre-trained FastText vectors and a pre-trained BERT model), the proposed modifications can be applied to both designs without changing other components. We introduce Figs. 2 and 3 as a graphical comparison of the original versus the lighter versions of the architectures. The left diagram in both images corresponds to the original encoder, this is, the encoder of M4C (left of Fig. 2) and the encoder of CNMT (left of Fig. 3). At the right, the modified versions of the encoders are presented, the L-M4C encoder (right of Fig. 2) and the L-CNMT encoder (right of Fig. 3). The input of both encoders is formatted the same way as in Fig. 1.

Comparison of encoder blocks. To the left, the encoder of the M4C-Captioner; to the right, the encoder of the Lighter M4C architecture. The modified blocks are green-colored.

Comparison of encoder blocks. To the left, the encoder of the CNMT; to the right, the encoder of the Lighter CNMT architecture. The modified blocks are green-colored.
FastText. The FastText model [12] is a word embedding method, which is an extension of the word2vec [11] model. However, the FastText model represents each word as a set of n-grams of characters, instead of learning vectors from words directly. The sub-word representation helps capture the meaning of shorter words and allows the embeddings to better process suffixes and prefixes. The FastText model can be interpreted as a bag of words model with a sliding window over a word, without an internal structure, that means that as long as the characters are within the window, the order of the n-grams does not affect the representation. The M4C and CNMT architectures firstly extract a set of N OCR tokens that are present in the image by using an external OCR system, and then, these architectures extract a 300-dimensional FastText vector from the n-th token. The default pre-trained model of FastText vectors is approximately 8.5 gigabytes in size.
Global Vectors. The Global Vectors (GloVe) model [15] was published as an alternative to the word2vec model, since the authors argued that the online scanning approach of the former was suboptimal, as it does not exploit the global statistical information regarding word co-occurrences. GloVe is built on global matrix factorization and a local context window. The former is a process where large term-frequency matrices are reduced using matrix factorization, these matrices usually represent the occurrence or absence of words in a document. On the other hand, the local context window can be a Continuous Bag of Words (CBOW) or a skip-gram, the former aims to predict the current word based on the input context, while the latter aims to predict the context, given a word. The GloVe model optimizes the embeddings directly, in a way that the product of two word vectors equals the log of the number of times the two words occur near each other. The GloVe model learns ratios of the co-occurrence probabilities of full words (unlike FastText, that uses sub-word information). The L-M4C and L-CNMT architectures extract a set of N OCR tokens in the same way the original architectures do, but, instead of using FastText vectors, our proposals extract a 300-dimensional GloVe vector from the n-th token. The default pre-trained model of the GloVe vectors is approximately 0.49 gigabytes in size, which can reduce the memory used by the encoder up to 94% , if the full models are loaded in memory. A graphical comparison of the M4C-Captioner encoder and the Lighter-M4C encoder is available in Fig. 2, the OCR embedding module is located at the center of the encoder, just below the red-colored blocks. Our GloVe-based OCR context processor is dictionary-based, which means it is trained over a fixed-vocabulary, which, in our case, has a total of 75,501 tokens.
BERT and DistilBERT
The text encoding module is another heavy component of the encoder of both architectures. The text encoding modules on the M4C and the CNMT architectures are pre-trained models from the HugginFace library 4 (with id: bert-base-uncased), which is a 12-layered model with a total of 110 million parameters. On the other hand, our proposal is to replace the original BERT model with a distilled version of it; the used DistilBERT [17] pre-trained model (with id: distilbert-base-cased) also belongs to the HugginFace library, and is a 6-layered architecture with a total of 65 million parameters. The original BERT model is 0.45 gigabytes in size, while the distilled version is just 0.25 gigabytes in size. Since the text-embedding modules of M4C and CNMT are built on just some layers of the BERT model, the effect of this modification on the final memory usage could not be so impactful when compared to the alternative OCR embedding module. A graphical comparison of the M4C-Captioner encoder and the Lighter-M4C encoder is available in Fig. 3, where the text-embedding module is located at the right of the encoder.
Experiments
All the aforementioned models were trained on the TextCaps dataset [18], and the evaluation was performed on the validation and test sets of the same dataset. Later, further information is given about the dataset, the evaluation server, and the metrics used for comparison. Also, this section includes detailed information about the implementation of the architectures, the experimental setup, and a comparison of our results against the SOTA methods.
TextCaps dataset
The TextCaps dataset was presented in 2020 by Oleksii Sidorov et al. from the Facebook AI Research team. The dataset consists of 145,329 captions for 28,408 images, and follows the image splits of TextVQA: 21,953 for training, 3,166 for validation and 3,289 for testing. The training and validation sets include the images and their annotations, while the testing set just includes the images (the annotations are not publicly available). A very complete report about the dataset is available in the paper TextCaps: a Dataset for Image Captioning with Reading Comprehension. The dataset is publicly available and can be easily downloaded 5 .
The dataset consists of natural images extracted from the Open Images dataset 6 , all of which use the RGB color model. Each image may contain a wide variety of objects (8.4 objects per image on average), as well as written text (OCR). Furthermore, every sample contains annotations that include 5 different captions (with an average length of 12.4 words per caption) and the OCR tokens (with their respective bounding boxes) detected by the Rosetta system [4].
Evaluation and metrics
In order to perform a fair evaluation of the challenge results, the authors of the TextCaps challenge enabled an evaluation server on the EvalAI platform 7 . On this evaluation server, each participating team can submit up to five prediction files (formatted as required by the challenge), then, the submitted files are evaluated against a non-public annotation file for the test set of TextCaps. During the evaluation, five different metrics are used by the evaluation server: BLEU-4 [13], METEOR [5], ROUGE-L [9], SPICE [2], and CIDEr [21]. The majority of the papers that use the TextCaps dataset present their results on these five different metrics.
Implementation details
In order to test the effectiveness of each modification, several experiments were performed. The impact of each modified component was measured by comparing the performance of the original architecture versus the modified architecture (one modified component at a time). In Table 2, the name of each architecture as well as its configurations can be found.
Summary of all proposed designs. The architecture with a † is the baseline model of the challenge, included for reference purposes
Summary of all proposed designs. The architecture with a † is the baseline model of the challenge, included for reference purposes
The baseline model of the TextCaps challenge is named M4C-BF, the “B” stands for BERT and the “F” for FastText, BF indicates that the model uses the original BERT model and FastText vectors, which corresponds to the original design. The CNMT-BF model is also included, as the original design uses the same text-embedding models and OCR context processor as M4C-BF.
To evaluate the impact of using a distilled version of BERT instead of the original model, the M4C-DF architecture is presented, which varies from M4C-BF only in the text-embedding module. Also, to measure the impact of replacing the FastText vectors, the L-M4C-BG architecture is introduced, which only varies from M4C-BF in the OCR context processor as the L-M4C-BG architecture uses GloVe instead of FastText vectors.
All names that start with “L” correspond to the proposed architectures, aimed to be lighter in memory usage. However, the CNMT-DF and the M4C-DF architectures are also introduced in this paper, but we decided not to present them as lighter versions, since the impact of the text-embedding replacement is practically unnoticeable when performing experiments.
All experiments were performed using the Multimodal Framework (MMF) 8 , which was made recently publicly available by the Facebook Artificial Intelligence Research (FAIR) team. The MMF is Python and PyTorch 9 based. Full instructions for installation, experiment reproducibility and usage are available on the official website of the framework.
On the other hand, the experimental setup consisted of an Ubuntu-based Azure Standard_NC6 instance, which is a 6-core machine with 56 gigabytes of memory, accelerated by an Nvidia Tesla K80 graphic card (with 24 gigabytes of memory).
As our code and the data are publicly available, our experiments are completely reproducible by using the configurations, scripts and instructions provided in the GitHub repository of this paper.
Results
Originally, we thought that using a distilled version of the text-embedding module would decrease the performance of the model, however, as it can be seen in Table 3, the replacement of the original text-embedding model with a distilled version of BERT increased the performance of the model in three of the five used metrics, when predicting for the evaluation set of TextCaps: BLEU-4 (+0.1), ROUGE-L (+0.2), SPICE (+0.6), and CIDEr (+0.3). Given the aforementioned behavior, we decided that both architectures based on CNMT were going to directly use the distilled version of BERT.
Results of replacing the original BERT model with a distilled version of it, in M4C-based architectures. The best scores are in bold numbers
Results of replacing the original BERT model with a distilled version of it, in M4C-based architectures. The best scores are in bold numbers
Furthermore, the impact of using GloVe instead of FastText vectors was measured similarly. The L-M4C-BG and M4C-BF architectures used the original BERT model, however, the L-M4C-BG model was trained using a GloVe-based approach for the OCR context processor, and, as it can be seen in Table 4, our approach slightly outperformed the original M4C-BF model in three of the five metrics: METEOR (+0.2), ROUGE-L (+0.2), SPICE (+0.5), and CIDEr (+0.3), while being up to 94% lighter in memory (when compared to the original OCR context processor), both in training and inference steps.
Results of replacing the FastText vectors with GloVe in M4C-based architectures. The best scores are in bold numbers
As the replacement of the OCR context processor is the main contribution of this work, and in order to measure the impact of this modification, two CNMT-based models were trained, one with FastText vectors and another with GloVe. However, both models used DistilBERT in the text-embedding module. The results can be found in Table 5. The results of this experiment were different; in this case, the usage of a GloVe-based approach for the OCR context processor reduced the performance of the model. L-CNMT-DG only outperformed CNMT-DF in one of the five metrics, the FastText-based approach was slightly better in BLEU-4 (+.1), METEOR (+.2), ROUGE-L (+.1), and surpassed the GloVe-based approach by 1.4 CIDEr points. The L-CNMT-DG model outperformed the CNMT-DF architecture by 0.3 SPICE points. However, as the L-CNMT-DG is up to 94% lighter in memory, and the reduction in metrics is very small, we consider the proposed modifications as a success.
Results of replacing the FastText vectors with GloVe in CNMT-based architectures. The best scores are in bold numbers
As mentioned before, the performance of all approaches that tackle the TextCaps problem was measured with five metrics. Following the original TextCaps paper, the CIDEr score is the main focus of the results, and thus, Table 6 is ordered from higher to lower CIDEr scores.
Performance comparison of the models proposed in this paper and the current SOTA methods. The architecture with a † is the baseline model of the challenge, while models with a ★ indicate one of our proposals. The best scores are in bold numbers
Performance comparison of the models proposed in this paper and the current SOTA methods. The architecture with a † is the baseline model of the challenge, while models with a ★ indicate one of our proposals. The best scores are in bold numbers
It is clear that the TAP method outperforms the remaining approaches, surpassing the LSTM-R method by 2.4 CIDEr points. However, as the main focus of this work is not to achieve SOTA scores but to find memory-lighter alternatives, the L-CNMT-DG architecture is well positioned when compared to other models in the leaderboard, achieving the 5th position, surpassing other methods such as MMA-SR and AnC, while using up to 8 gigabytes less memory on the OCR context processor (which is a reduction of 94% with respect to the original module), compared to the 3rd position. Furthermore, both the L-CNMT-DG and the L-M4C-BG surpassed the baseline of the challenge, even when both are GloVe-based approaches. However, the L-M4C-DG model is the last one, being outperformed by the baseline of the model, which indicates that the use of DistilBERT together with a GloVe-based OCR context processor reduces performance, this behavior is directly attributed to the usage of DistilBERT combined with GloVe in M4C, since the results in Table 4 demonstrate that the GloVe-based approach outperformed the FastText-based design. Also, as it can be seen in Table 3, the implementation of DistilBERT instead of the original BERT increased performance, however, this behavior is not reproduced when using both modules (DistilBERT and GloVe) together. Hence, further research is needed to understand why the usage of DistilBERT and GloVe vectors improved the M4C model when used separately, but worsened the model when used together.
Additionally, our best model (L-CNMT-DG) outperformed the baseline model (M4C-BF) on all metrics, achieving a CIDEr score 7.25 points above the baseline, even surpassing previous SOTA models such as MMA-SR and AnC.
This paper introduced five different architectures for OCR-augmented image captioning, which are alternative versions of SOTA designs, such as M4C-Captioner and CNMT. The main contribution of this work lies in the usage of a pre-trained disctionary-based GloVe approach instead of pre-trained FastText vectors in the OCR context processor of both architectures, which demonstrated to be able to reduce memory usage when performing training and inference steps.
In order to clarify the impact of each modification, an ablation-like comparison was presented, where the fully original model is compared with the modified version. The results showed that the usage of DistilBERT improved the performance of the M4C model, while the usage of a GloVe-based OCR context processor also improved the overall performance of the M4C model. However, the GloVe-based version of the CNMT architecture was outperformed by the original design, but the reduction in memory usage is a strong advantage of the L-CNMT-DG model over its original version.
From the five trained models, three of them can be categorized as memory-lighter alternatives since they can reduce memory usage by up to 94% , when compared to the original OCR context processor. Furthermore, our best architecture surpassed the MMA-SR and the AnC architectures, ranking 5th on the TextCaps test set metrics. Also, two of our memory-lighter proposals surpassed the original baseline of the TextCaps challenge, demonstrating to have higher scores while being cheaper in memory usage.
In this paper, it was demonstrated that efficiency-focused architectures can achieve near-SOTA scores. However, further research is needed to improve the results of these memory-lighter architectures, such as optimal hyper-parameters or even the reduction of the number of trainable parameters in the models. Also, the pre-training strategies used by the TAP team or the augmentation of features at a dataset level have demonstrated to increase model performance without needing to directly increase model capacity; these facts, added to our memory-lighter approaches, could strongly reduce the hardware needs of these designs. We leave these experiments as future work.
Declarations
Acknowledgments
We thank Microsoft Corporation, who kindly provided an Azure sponsorship with enough credits to perform all experiments.
Availability of data and material
The original TextCaps dataset is publicly available on the challenge website: https://textvqa.org/textcaps. The annotations with additional OCR information (Rosetta + CRAFT + ABCNet) can be downloaded from the CNMT official repository: https://github.com/wzk1015/CNMT
Code availability
The software requirements, Python code, scripts, and detailed instructions to reproduce our experiments are available on the following GitHub repository: https://github.com/gallardorafael/ multilingual-mmf
