Abstract
Visually rich documents, such as forms, invoices, receipts, and ID cards, are ubiquitous in daily business and life. Various methods have been used to convey such diverse information, including text, layout, font size, or text position. Combining these elements in information extraction can improve the result performance. However, previous works have not effectively utilized the cooperation between these rich information sources. Text detection and recognition have been performed without semantic supervision (e.g., entity name annotation), and text information extraction has been performed using only serialized plain text, ignoring rich visual information. This paper presents a method for extracting information from such documents, which integrates textual, non-spatial, and spatial visual features. The method consists of two main steps and uses three deep neural networks. The first step, Text Reading, employs two CNN models (Lightweight DB and C-PREN) for OCR tasks, based on the state-of-the-art models DB and PREN, with two improvements. These improvements include reducing noise by removing the SE block of DB and integrating both context and position features in PREN. The second step, Text Information Extraction, uses a graph convolutional network, RGCN, for name entity recognition. Experiments on self-collected and two public datasets have demonstrated that our method improves the performance of the original models and outperforms other state-of-the-art methods.
Introduction
Document understanding, especially for Visually Rich Documents (VRDs) has been attracted by more and more researchers, due to its important role in practical applications [15]. Documents, like invoices, receipts, business cards, or identification (ID) cards, are very common in day-to-day business workflows. They have massively increased both in number and complexity. Therefore, the automatic extraction of information from these document images is crucial to reduce manual effort and enhance the digitization process.
VRDs typically present information through textual and visual cues, e.g., text contents, layout, tabular structure, font size, and text positions [19]. Regarding the complexity of this structure, the visual information can be analyzed through spatial and non-spatial based features. The first features relate to geometry information like layouts or coordinates, while the second concerns text appearances, such as size or colors. Consequently, automatically extracting information from such documents is not a trivial task due to the difficulty not only in recognizing textual contents but also in determining relevant entity names (ID, code, seller, address, date, etc.).
Recently, there have been numerous studies focusing on this problem, which usually applied three phases, including Document Normalization, Text Reading, and Text Information Extraction (TIE) [45]. The first phase is necessary to normalize and segment the right region of interest (ROI) from the background. This allows for improving the performance of the next steps, especially for images captured in different situations (orientations, scaling …) and containing many other objects (even other VRDs) [27]. The second phase detects text regions, extract them from the document image, and recognizes the corresponding text. The results of this phase are affected by many factors, such as the diversity of scene text, the complexity of the background, and the interference factors [48]. The third phase extracts semantic information of recognized text, such as entities and relations, which is considered a specialized area in natural language processing (NLP). This problem can involve some NLP sub-tasks, for example, named entity recognition (NER), relation extraction (RE), event extraction (EE), salient facts extraction or rule-based methods [1]. In this study, we are interested in the last two phases: Text reading and informationextraction.
To tackle this problem, previous works frequently focus on the combination of non-spatial visual and textual features. A common solution is to apply Optical Character Recognition (OCR) at Text Reading to extract text contents, rule-based methods and/or NLP to identify entity names at Text Information Extraction. With the development of deep-learning technologies, many neural networks, like CNN and RNNs, have been applied for these steps, such as the work presented in [2, 39] that provides efficient neural networks for OCR, or [6, 45] for Name Entity Recognition(NER).
Although achieving promising results, these works lack the cooperation between text reading and information extraction. The former recognizes text without semantic supervision (e.g., entity name annotation), and the latter extracts information from only serialized plain text, not using rich visual information [45]. They merely transcribe the input [6], which overlooks important semantic information about spatial representation. Therefore, they are inadequate to capture all semantic facts of VRDs. For meaningful results, they should rely on the ability to incorporate all possible information, including textual, non-spatial, and spatial visual features. In this context, graph-based methods [9] are suitable solutions to deal with this complexity.
A graph is composed of nodes and connected edges, in which the first components delegate individual objects, while the second ones relate to relationships among those objects. Like many other data in the real world, we can model VRDs by this structure [19]. For example, we can use a graph of text segments to describe these documents, where each node (a segment instance) is characterized by textual (text contents), non-spatial (font and color), and spatial (text position) visual information; the edges are the distance among these nodes.
Graph neural networks (GNN) are an emerging framework in the deep learning community, which are designed to tackle this type of data. They have proven their efficiency in learning graphs for numerous areas, such as natural language processing, computer vision, drug discovery, and social networks [9]. This framework propagates information along edges to update nodes or edges using non-linear transformation and aggregation functions.
In this paper, we propose a method to read key information and NER of VRD images, which uses both their textual and visual features. We combined a CNN and a Represented—residual Graph Convolution (RGCN) to cover spatial and non-spatial visual features. CNN was used to extract text regions, contents, and their positions, while RGCN allows us to get NER of extracted key texts.
In summary, our contributions can be summarized as follows: we proposed a method, named VRD-GCN, to extract key information from common structured document images, such as receipts, business cards, or ID cards. we investigated a combination between Convolutional Neural and Graph Convolutional Networks to process all information and relations shown in VRDs, including textual, non-spatial, and spatial visual features. we implemented the proposed methods (lightweight DB and custom PREN for Text reading, and RGCN for Name Entity Recognition), evaluated them on three datasets (VinText [26], MC-OCR [41], and a self-collected dataset), and highlighted the experimental results.
This paper is structured in the following way: In Section 2 we review related works. Section 3 describes our proposed method and related materials. We show and discuss our results in section 4. Section 5 is our conclusion and further work.
Related works
This section presents a review of recent literature on text reading and text information extraction from document images.
Text reading using OCR methods has become a very active research topic recently. According to Long et al. [20], the evolution of text detection undergoes three stages, including learning-based methods with multistep pipelines, methods inspired by object detection, and methods based on sub-text components [20]. The third category, which can give the highest accuracy, is based on an important idea: instead of detecting the whole text instance, we can detect sub-text components and assemble them later. For instance, Baek et al. [3] proposed a method supporting text detection, named Character Region Awareness for Text Detection (CRAFT), which relies on exploring each character region and the affinity between them. These character-level methods can be found in the work of Liao [18]. Component-level is another method that focuses on the local region of text instance [20]. For instance, Zhang et al. [46] proposed a Graph Convolutional Network to detect such text, which is based on the layout and geometry attributes of text.
For text recognition, since (scene) text images can be curved, deformed, or extremely long, thus before truly mapping them to corresponding strings, previous works usually performed several processing steps for more stable results. According to [8], the following steps are common for many SOTA text recognition methods: Image preprocessing aims to enhance the image quality including background removal, text image super-resolution, and rectification. It allows for improving feature representation and recognition in the following steps. Feature representation maps the text image (after preprocessing) to a representation reflecting the character attributes while suppressing irrelevant features such as font, color, size, and background. Convolutional neural networks and their variants are the most common method for this step, such as VGGNet, Recursive CNNs, ResNet, and so on. Sequence modeling is an important step that combines the visual and contextual information of text images. It allows considering the character sequence as a whole, which makes the prediction more stable. Bidirectional long short-term memory (BiLSTM) networks are widely applied for sequence modeling. Prediction, which includes two main techniques Connectionist temporal classification (CTC) and the attention mechanism (ATTN), attempts to map the target string based on the extracted features.
Text information extraction aims to extract useful information from recognized text, such as entities and relations. This problem can involve some NLP sub-tasks, for example, named entity recognition (NER), Relation extraction (RE), Event extraction (EE), and salient facts extraction [1]. They can be categorized into two main approaches: Rule-based methods (RBM) and learning-based methods (LBM) [1, 45].
RBM relies on (i) rules that are represented by a set of specific properties, and (ii) a corresponding interpreter that applies these rules to find suitable patterns [43]. The regular expression is the most popular applied technique to define rules, which contains a sequence of characters of expected patterns. For example, Profitlich et al. [32] recently proposed a case study that applied the NegEx regular expression algorithm to detect and extract negations of concepts in German medical texts like discharge summaries and clinical notes. Similarly, Giannakis et al. [13] presented a method to extract information from Linked Data, which applied w-regular expressions. The author proposed a novel way to query the potentially infinite graph of Linked Data. This technique can also be found in [7, 42].
Regarding LBM, two main techniques have been applied in the literature: supervised, and unsupervised techniques. The need for labeled data is the most challenge for the first approach. Whereas, the second doesn’t need this data, but requires more intensive data processing [1]. Similarly to text reading, deep learning has been popular for TIE, which shows better results for studies with large datasets. Numerous applications in areas, such as name entity recognition, relation extraction, semantic analysis, searching, and classified models in natural language processing (NLP) have proven their effectiveness [35]. For instance, Nguyen et al. [25] applied transfer learning from transformers for Japanese NER in business documents. The authors successfully based on BERT [10] and ALBERT [16] to fine-tune transformers with a small amount of data. Zhang et al. [45] also presented an end-to-end method for IE, which contain a multimodal neural network (biLSTM) for TIE. The input is a combination of textual and visual features such as color, font, and layout of documents. Similar methods can be found in [5, 30].
Between RBM and LBM, the latter seems to have better performance in text information extraction. However, it still depends on the user needs and the task at hand because IE is a community-based process [1]. For example, with studies the dataset is not available or not enough, RBM seems a suitable solution.
Previous studies have shown the competency of the proposed methods on their tasks, most of them nevertheless considered text reading and text information extraction as two different phases that are executed independently. Although, several works tried to share and exploit the common features, such as visual and textual features. But, they still were not efficiently applied. Besides, information extraction from VRDs, which usually includes numerous models and interactive steps, requires a suitable architecture for their implementation in practice.
Material and methods
Overall methods
In this study, we are interested in extracting information from common Vietnamese VRDs, such as invoices, ID cards, or receipts. These VRDs describe their content using both textual and visual cues. Therefore, we propose a method, named VRD-GCN, that combines CNNs and GCN to recognize key texts and their semantics. The first networks were mainly used to detect, recognize, extract optical text, and supplement information, such as text coordination, width, or height. While the second network performs NER tasks.
The overall method is presented in Fig. 1. First, we use two networks, lightweight DB and custom PREN, to extract raw texts and their position. Second, these outputs are engineered to produce two sets of features: node and edge features, that incorporate textual, spatial, and non-spatial visual information. Lastly, we use these features to train the residual gate graph convolutional network (RGCN) to identify important keys. The next sections will detail these networks and features.

VRD-GCN: a CNN and GCN based method for information extraction of VRDs.
To incorporate all important information located on VRDs, we used two CNN-based neural networks to extract textual, non-spatial, and spatial visual features. First, we applied transfer learning with DB network [18] on two Vietnamese optical text datasets: VinText-2021 [26] and our self-collected dataset.
The detailed architecture of DB is presented in Fig. 2, which consists of three main components, including the backbone, neck, and head. The first component is used for feature extraction. To balance the accuracy and the computation time, instead of using ResNet as in the original network, we updated the backbone of this network to a lightweight one: EffecientNet B3 [38]. Moreover, we removed Squeeze and Excite (SE) block to reduce the noise due to its multiplication operation. After feature extraction, the second component processes these features to find potential areas that can include texts. The pyramid features are the main operation for this component, in which extracted features are processed by several up-sampling and 2D convolution blocks.

DB network [18].
At the last component, the network predicts labels of processed features. To do that firstly, the network produces two maps, which predict (i) the probability of pixels being in text areas, and (ii) the threshold for the boundary of each area. Finally, the approximate binary map is calculated to predict the text areas. The output of this step contains both non-spatial (feature map of text areas) and spatial visual (coordination of bounding boxes) features.
For the textual features, we proposed a deep neural network, named C-PREN, to recognize extracted optical texts from the previous step. The proposed network was based on PREN [44], in which we imposed the idea implemented in ABINet network [12]. ABINet applied the principle of decoupling recognition models into the vision model (VM) and language model (LM) to analyze and combine visual and linguistic features. These sub-models were used as functional units independently and learned separately. In fact, the idea is not new and was applied previously [28]. But, ABINet proposed to integrate both the context and position information in LM, which allows for improving the performance of the overall model. Owing to the availability of such information, this method fits perfectly with VRDs.
The detailed architecture of C-PREN is presented in Fig. 3 with two inputs, instead of only text images as in the original network. They correspond to two types of information: text images and position (coordination of bounding boxes) for context and spatial information, respectively. First, text images are passed through the new backbone, EffecientNet B3 [38]. Then, C-PREN predicts the visual result using the PREN method, as illustrated in the upper branch in Fig. 3. We also down-sample extracted features from the PREN backbone into 1/32, 1/16, and 1/8 of size and send them to a Feature Pyramid Network. The obtained results are features of size 1/8 of the extracted features, that are then passed through a self-attention block [40].

Custom PREN (C-PREN) and Feature fusion.
We combined these results and the positional information, then corrected them with a specific block, named Multi-Domain Character Distance Perception (MDCDP) [47]. This block allows for establishing a visual and semantic-related position encoding. The MDCDP was proposed in the CDistNet network [47], which is an improved version of the transformer encoder [40]. It supports focusing more attention on the location information, as presented in Fig. 4. Finally, we fed the two outputs of MDCDP and the original PREN into a fusion layer for customization and generating the text outputs, as illustrated in the last layers of Fig. 3. Embedded text and coordinates of the bounding box will be used to initiate an input graph. Subsequently, this graph will be passed to multilayer RGCN to construct a new graph of which node features are used to predict its label through a fully connected layer.

MDCDP block in CDistNet [47].
To extract the entity names of recognized texts, we proposed a residual gate Graph Convolutional Network (RGCN) that allows to integration and process of spatial information and text contents from the previous steps. The input of this network is a graph whose nodes and edges are derived from extracted features and information from the previous steps, as present in Fig. 5. Figure 6 presents the way to extract node and edge features. From the predicted text and position, nodes, which are equivalent to a bounding box, have been initiated using their coordinates, length, and the corresponding embedded text. We applied an LSTM layer with 512 hidden units and tanh as its activation function to pack these features together. A node is considered to be connected to another if the distance between two center points along the y-axis is 3 times less than its width. Edges have been featured by calculating the distance between two center points of the bounding box along coordinate axes.

Overall structure for NER.

Node and edge feature engineering.
We then passed these features (nodes and edges) to n + 1 layers of RGCN networks. The number of layers depends on the complexity of structured documents: the greater the complexity, the higher the number of layers. For low-complexity documents like ID or business cards, 3 – 5 layers are reasonable, while for more complex documents, like receipts and invoices, 6 – 8 layers are a good choice. Of course, for low-complexity documents, it is still possible to use more layers, but it can lead to critical issues, such as vanishing gradients or reduced performance.
RGCN is a deep neural network that has common issues, including vanishing and exploding gradients. To overcome these issues, the output of each layer in RGCN was normalized and added to the next input, as illustrated in Fig. 7. Graph normalization block based on basic normalization. Each GCN inside the Residual gate GCN performed an iterator operation to construct a new graph, as presented in Fig. 8 and Equation 1.

Structural of Residual gate GCN.

Structural of Residual gate GCN.
ei,j is the features of the edge between two nodes i and j n
i
, n
j
are the features of the node i and j w
i
and w
j
are the adjacent matrix (or the weight matrix) of the node i and j.
To avoid the exploding gradient, the influence level of each adjacent node was passed to a sigmoid function. Basically, the label of a node can be recognized by information on adjacent nodes. But only a few of them keeps the keyword role, while the others help to define the relationship of that node to the whole document, such as the relative position of documents (top, bottom, left, or right). Therefore, the influence level of each adjacent node must be different depending on the contextual meaning which was presented in its features. By training on the real data, the influence level of each adjacent node was changed through the gradient descent method to match the considered structured document.
To evaluate the performance of the proposed method, we performed experiments on three datasets, including the VinText dataset [26], a new synthetic dataset derived from our previous work [27, 28], and the MC-OCR dataset [41]. We evaluated the lightweight DB network on both the VinText and MC-OCR datasets, while C-PREN was assessed on all three datasets. For the NER task, we used only the MC-OCR dataset. The following sections will provide a detailed description of our experimental methodology.
Dataset
The VinText [26] is a dataset for Vietnamese text detection and recognition. The dataset contains 2,000 fully annotated images with 56,084 text instances. Each instance was described by a quadrilateral bounding box and its own underlying truth character string. The authors have randomly split the dataset into three subsets for training (1,200 images), validation (300 images), and testing (500 images). The dataset was either downloaded from the Internet or captured in everyday life in Vietnam. Therefore, it is very diverse, which consists of busy and chaotic scenes with many shop signs, billboards, and propaganda panels.
The synthetic dataset has been generated from a 40,000-word Vietnamese dictionary, of which 23,000 words contain 2 compound words and 17,000 single words. We have applied several image processing techniques to enrich the dataset, including changing angles, blur, contrast, brightness, and zooming of text images. Finally, we obtained 9 million images containing text. The dataset has been randomly split into three subsets for training (80%), validation (10%), and testing (10%).
Mobile captured receipts OCR (MC-OCR) dataset [41] is the dataset that was used in the RIVF2021 MC-OCR Competition, as illustrated in Fig. 9. It contains 2,436 Vietnamese receipt images. Each receipt has been manually annotated with two groups of ground truth: (i) the quality and important keywords. The quality of receipt images was determined by the line that annotators can easily extract, while four keywords were identified, including SELLER, ADDRESS, TIMESTAMP, and TOTAL COST. To improve the training performance, we adjusted a new keyword, named OTHERS, that describes the other recognized texts. The dataset has been split into three subsets for training (80%), validation (10%), and testing (10%) models.

Vietnamese receipts.
To evaluate the performance of the updated networks (the lightweight DB and C-PREN), as well as the overall method VRD-GCN, we conducted three scenarios of experiments, as follows:
The F1 score, precision, recall, and accuracy were used to evaluate the experimental results. CER is calculated as presented in Equation 3, where The loss functions were the binary loss, loss-attention, and online hard example mining (OHEM) [36] for the lightweight DB, C-PREN, and RGCN, respectively; The Adam optimizer was employed for optimization with β1 = 0.9 for all models; β2 = 0.999, and e = 10-8 for the lightweight DB; β2 = 0.98, and e = 10-9 for C-PREN; and β2 = 0.999, and e = 10-7 for RGCN. To minimize the cost function, we applied a mini-batch with a size of 8, 64, and 24 for the lightweight DB, C-PREN, and RGCN, respectively; The maximal training epochs were set to 1,200, 1,500, and 1,000 for the lightweight DB, C-PREN, and RGCN, respectively.
Besides, shuffled data were used to allow all models to learn randomly and provide more objective results; the shuffle process was conducted before selecting the batches. The initiated learning rate was 10-3, and a self-adjusting learning rate technique was applied for all optimizers.
For the first and second scenarios, we conducted four experiments, whose results are provided in Table 1 (DB and lightweight DB comparison) and Table 2 (PREN and C-PREN comparison). From the obtained results, we found that the light-weight DB outperforms the original works by 3.61% in terms of the F1 score. This can be explained by the improvements that we have implemented in the proposed networks. The new backbone EfficientNet B3 applied in the lightweight DB allows us to obtain better features while removing the SE block helps the network avoid the effect of noise. We also compared the lightweight DB with OCR methods/tools that used the original DB. The obtained results show that our updated network outperforms these frameworks in text detection, as illustrated in Fig. 10. The figure shows an example of text detection using Paddle OCR and the lightweight DB. Our model can detect unclear or blurred texts while Paddle OCR ignores them.

The F1 score and loss progress curves of RGCN model on MC-OCR dataset.
The lightweight and original DB comparison
C-PREN and PREN comparison
C-PREN also outperforms the PREN applied on the same dataset by 3.6% in terms of accuracy, as shown in Table 2. Owing to the new backbone (EffecientNet B3) and especially the application of the Transformer technique [40], the new network is capable of combining visual and linguistic features. Therefore, the correction of predicted texts achieves a better performance.
For the third scenario, the results of NER are presented in Table 3. The table indicates that VRD-GCN achieves a high precision, recall, F1 score, and accuracy of

Text detection using Paddle OCR (the upper image) and the lightweight DB (the lower image).
NER results on MC-OCR dataset
VRD-GCN also achieves a low CER of
Table 4 compares VRD-GCN to the baseline method that used the original DB, PREN, and RGCN, as well as recent studies applied to the same dataset, MC-OCR. The improvements in the lightweight DB and C-PREN allow us to achieve better OCR results, compared to the baseline. This leads to a lower CER (0.218 compared to 0.275) and higher accuracy of NER (92.93% compared to 91.08%). Moreover, the proposed method also outperformed recent works in terms of either accuracy or CER. For example, Pham et al., who used CRAFT, TransformerOCR, and GNN-based methods, obtained a similar accuracy as ours, but with a higher CER. Overall, we achieved the lowest CER, while the accuracy of NER is superior among the reported values.
Performance comparison
We also conducted experiments to evaluate the computation time of the proposed models on both GPU (Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz 1.80 GHz with Tesla P100 GPU) and CPU (Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz 1.80 GHz) configurations. The obtained results are presented in Table 5. The computation time of VRD-GCN is faster than the baseline method with a time of 29.07ms (30.43ms compared to 59.5ms) and 14.3s (11.4s compared to 25.7s) on GPU and CPU, respectively. On GPU, text detection takes the longest time (30 ms), while recognition takes 9s on CPU. Furthermore, the size of VRD-GCN is smaller than the baseline (141.5 MB compared to 417 MB), which allows us to deploy the obtained model in devices with limited storage.
Computation time and model size comparison
Despite achieving an overall high performance, VRD-GCN exhibits limitations in predicting TIMESTAMP and TOTAL_COST, resulting in lower F1 scores. These limitations can be attributed to the small size of the dataset (MC-OCR) and the variability in annotations for these classes. Consequently, the model may not have received sufficient training data to learn the relevant patterns, which leads to its lower performance compared to other classes.
In this paper, we have presented a method to extract information from VRDs, which is capable of using textual, non-spatial, and spatial visual features, usually available in these documents. The proposed method contains two main steps with three deep neural networks. The first step, Text Reading, relies on two CNN-based models to detect and recognize texts and supplement information, such as text coordination, width, or height. For text detection, we updated the DB network [18] with two improvements, including the new backbone EffecientNet B3 and removing the SE block. For text recognition, we propose the C-PREN network that was based on the work of [44]. In this new network, we implemented a new block, MDCDP, that allows the combination of visual and spatial features (text images and position information). The output of these models was then fused to build a graph describing studied VRDs, which contains node and edge features. In the second step, Text Information Extraction, we applied the RGCN network to process the output graph of the previous step and obtain important named entities. This network was based on Graph Convolutional Network. Consequently, the two steps can share and integrate the common features between them.
We conducted empirical experiments on three datasets for information extraction from Vietnamese receipts, including a self-collected and two public datasets. The obtained results indicate that our method achieved high accuracy and outperformed the current SOTA method. They show the promise of the proposed method in the information extraction of VRDs.
In future work, we plan to address the limitations of our study by incorporating additional datasets and improving our model architecture to better learn the complex patterns (i.e., TIMESTAMPS and TOTAL_COST). Additionally, we will focus on expanding the scope of our research to include other VRDs, such as medical documents, and improving the performance of our model, particularly in the areas of text detection and recognition.
