Scene text spotting based on end-to-end

Abstract

Aiming at the problem that the traditional OCR processing method ignores the inherent connection between the text detection task and the text recognition task, This paper propose a novel end-to-end text spotting framework. The framework includes three parts: shared convolutional feature network, text detector and text recognizer. By sharing convolutional feature network, the text detection network and the text recognition network can be jointly optimized at the same time. On the one hand, it can reduce the computational burden; on the other hand, it can effectively use the inherent connection between text detection and text recognition. This model add the TCM (Text Context Module) on the basis of Mask RCNN, which can effectively solve the negative sample problem in text detection tasks. This paper propose a text recognition model based on the SAM-BiLSTM (spatial attention mechanism with BiLSTM), which can more effectively extract the semantic information between characters. This model significantly surpasses state-of-the-art methods on a number of text detection and text spotting benchmarks, including ICDAR 2015, Total-Text.

Keywords

Scene text spotting End-to-end Joint optimization TCM SAM-BiLSTM

1 Introduction

Text spotting in natural scenes has high research value. Firstly, text is an important carrier of human civilization. Secondly, a large amount of high-level semantic information helps to understand the world better. In addition, text spotting in natural scenes has a wide range of application scenarios, such as industrial automation, instant translation, robot navigation, image search and blind assisted reading. Thus, text spotting in natural scenes is one of the current research hotspots.

However, the text in the natural scene has more complicated features. Firstly, text diversity, such as different languages, fonts, sizes, and shapes in natural scenes, as shown in Fig. 1a. Secondly, complex text background, such as leaves, bricks, windows, fences, etc., as shown in Fig. 1b; Lastly, low image quality, such as low resolution, distortion, and blurring, as shown in Fig. 1c.

Fig. 1

Features in natural scenes. a. Text diversity; b. Complex text background; c. Low image quality.

Long et al. [25] proposed traditional OCR (optical character recognition) processing methods generally decompose text spotting in natural scenes into two independent subtasks: text detection task and text recognition task. The traditional OCR processing method has achieved good performance in text spotting in natural scenes. However, this method ignores the inherent connection between text detection and text recognition. Firstly, the accumulation of training errors, the errors in the text detection stage will be passed to the text recognition process. In addition, the performance of the text detection task and the performance of the text recognition task cannot be optimized at the same time.

To address the issues of current OCR methods, end-to-end OCR processing has been proposed by Li et al. [1], Liu et al. [2], He et al. [3], Sun et al. [4] and Lyu et al. [5]. They shared a feature extraction network for text detection tasks and text recognition tasks to achieve joint optimization of the two tasks.

The end-to-end method first proposed by Li et al. [1], which can achieve good performance on horizontal text data sets, but cannot well deal with text diversity in natural scenes. For multi-directional text, Liu et al. [2], He et al. [3] and Sun et al. [4] all proposed similar end-to-end recognition models. Their first step is to extract the features of the text area through the feature extraction network, then modify the text area features, and finally, input the rectified text feature maps into the text recognizer. Lyu et al. [5] Although text detection performance can be improved, but the loss of the potential inter-character sequence information.

Liu et al. [2] and Qin et al. [10] both use a joint optimization of text detector and text recognizer, which can make good use of the potential internal connection between text detection and text recognition, thereby improving the overall performance. However, Liu et al. [2] is not suitable for processing Curve text, Qin et al. [10] can’t solve the negative sample problem of text detection well.

To better solve the above problems, This paper propose a novel end-to-end text spotting framework. This model jointly optimizes the text detector and the text recognizer, effectively utilizes the potential internal relationship between text detection tasks and text recognition tasks, and improves the overall performance. The text detector uses an improved Mask R-CNN framework, and adds a TCM to the original, which can alleviate the problem of negative samples. The text recognizer is a text recognition model based on the SAM-BiLSTM, which can more effectively extract semantic information between characters. In addition, this model can handle text of arbitrary shape.

The paper main contributions are summarized as follows:

The paper propose a novel end-to-end text spotting framework. By sharing convolutional features, the text detection network and the text recognition network can be jointly optimized at the same time. On the one hand, it can reduce the computational burden; on the other hand, it can effectively use the inherent connection between text detection and text recognition.

The model add the TCM module on the basis of Mask RCNN, which can effectively solve the negative sample problem in text detection tasks.

The paper propose a text recognition model based on the SAM-BiLSTM spatial attention mechanism, which can more effectively extract the semantic information between characters.

The method significantly surpasses state-of-the-art methods on a number of text detection and text spotting benchmarks, including ICDAR 2015, Total-Text.

This paper proposes an end-to-end text spotting model and adopts a joint optimization strategy to improve the performance of text spotting.

2 Related methods

2.1 Text detection

Natural scene complexity characteristic increases the difficulty of text detection. Tian et al. [26] uses the advantages of RNN (recurrent neural networks) to improve the detection performance of horizontal text, but it is not suitable for non-horizontal text detection. Shi et al. [38] first cuts each word into more directional small text segments that are easier to detect, and then connects each small text block into a word with a neighboring link, which is conducive to recognizing a wide range of lengths with directions Words and lines of text. Zhou et al. [14] uses FCN (full convolutional network) to generate multi-scale fusion feature maps to improve the performance of text detection.

2.2 Text recognition

The mainstream methods of natural scene text recognition tasks are: CTC-based methods and attention mechanism-based methods. Graves et al. [27] used the CTC-based method for the first time and achieved good performance. Shi et al. [19], Su et al. [28], Liu et al. [29], Gao et al. [30] and Yin et al. [31] also adopted the improved CTC method to further verify the performance of CTC method. In machine translation tasks, Bahdan et al. [32] first proposed the attention mechanism and achieved good performance. Now the attention mechanism is widely used in text recognition tasks in natural scenes. Liu et al. [33] proposed an attention mechanism based on encoding and decoding, which can better adapt to the problem of text recognition. Aiming at the problem of irregular text recognition, Shi et al. [34] proposed to combine the attention mechanism with the spatial transformation network to improve the performance of irregular text recognition in natural scenes.

2.3 Text spotting

The traditional OCR method ignores the inherent connection between text detection tasks and text recognition tasks. However, end-to-end text spotting can make full use of the potential connection between text detection and text recognition to improve overall performance. Liao et al. [21] uses a text detector based on SSD (single shot multibox detector) [35] and a text recognizer based on CRNN (convolutional recurrent neural network) [36]. Li et al. [1] uses a text detector based on RPN (region proposal network) [37] and a text recognizer based on the attention LSTM (Long Short Term Memory Network). Liu et al. [2] and Qin et al. [10] all adopted joint optimization strategies to improve the overall performance of text spotting.

The method has two mainly advantages compared to them. This papre adopt a joint optimization strategy, which can not only reduce the computational burden but also make full use of the inherent connection between text detection and text recognition. The TCM module effectively solves the negative sample problem of text detection. The text recognition model based on the SAM-BiLSTM can more effectively extract the semantic information between characters.

3 Model design

3.1 Model

This paper propose an end-to-end text spotting model, as shown in Fig. 2. It includes three parts: shared convolution feature extraction, text detection task branch and text recognition task branch. Firstly, the picture is fed into the feature extraction network for feature learning, and the obtained feature map is input to a text detector and a text recognizer. Inspired by Xie et al. [6], the text detection branch adds a TCM module on the basis of Mask R-CNN, which can better solve the problem of negative samples. The text recognizer is a text recognition model based on the SAM-BiLSTM spatial attention mechanism, which can more effectively extract semantic information between characters. This paper uses joint optimization to further improve the overall performance of text spotting in natural scenes.

Fig. 2

End-to-end OCR model overall framework.

3.2 Feature extraction

The feature extraction uses the ResNet-50 structure by He et al. [7], as shown in Fig. 3. Natural scene text size is usually large and small, in order to better adapt the text of various sizes, the need to maintain a large receptive fields and more feature-rich. Use Dilated convolution to maintain a large receptive field. Inspired by the FPN (feature pyramid networks) network proposed by Lin et al. [8], this paper used a method of cascading low-resolution feature maps and high-resolution feature maps to extract richer text features.

Fig. 3

Shared convolution network.

3.3 Text detection branch

The detection branch is improved based on Mask R-CNN. Negative samples are one of the difficulties in text detection. Text in natural scenes will contain many regular objects, such as fences, walls, etc. These objects are easily mistaken for text. Therefore, the introduction of contextual information helps the network to extract more significant features and accurately classify regions of interest. In order to solve the negative sample problem of text detection, the TCM module proposed by Xie et al. [6] is added on the basis of Mask RCNN, and the evaluation mechanism was also improved.

3.3.1 TCM

TCM structure as shown in Fig. 4. F is the output feature map of FPN. Firstly, the feature map F passes through three convolutional layers to output a text segmentation feature map F′, secondly, feature map F and text segmentation feature map F′ multiply element by element to output feature map F″, finally, the feature map F″ and the feature map F are added element by element to output the feature map F^″′, thereby fusing the detection features and segmentation features to obtain richer features.

Fig. 4

TCM structure.

The output feature F of the shared convolutional layer is fed into the TCM module, and the text salient feature F′ is calculated as follows: $F^{'} = e^{Softmax (TCM (F))}$ (1)

The feature map generated by TCM has two channels: text and non-text, and then is activated by the Softmax function to obtain the text salient feature F′.

Then, F′ broadcast operation after the element-wise multiplied with the feature F, is calculated as follows: $F^{″} = Broadcast (F^{'}) ⊙ F$ (2)

The symbol ⊙ stands for multiplication by element.

Because the features around the text are better for detecting the text, adding the original features and salient features element by element is helpful for better detecting the text. The calculation is as follows:

$F^{″'} = F^{'} \oplus {Conv}_{3 \times 3} (F)$ (3)

The symbol ⊕ represents element-wise addition, and Conv_3×3 represents a 3×3 convolution.

Finally, feed the output feature F^″′ into Mask R-CNN.

3.3.2 Re-Score mechanism

Mask R-CNN text classification score detected text box as the final score, to filter out lower box score text prediction by setting a threshold value. However, if the text area is relatively low, it will result in a lower classification score but a higher score on the semantic segmentation map. Some negative samples have higher detection scores but lower scores on semantic segmentation graphs. Therefore, the fusion of the features of semantic segmentation and instance segmentation, and recalculation of reasonable scores can reduce false sample detection.

The calculation is as follows: $s_{i} = \frac{e (s_{i 1}^{cs} + s_{i 1}^{is})}{e (s_{i 1}^{cs} + s_{i 1}^{is}) + e (s_{i 0}^{cs} + s_{i 0}^{is})}$ (4)s_i is the final score, cs is the classification score, is is the instance score, and the final score is obtained by the Softmax function. The instance score is obtained by projecting the instance segmentation result on the semantic segmentation map and calculating the average pixel value of the projection area. The calculation formula is as follows: $s_{i 1}^{cs} = \frac{\sum_{j} p_{i}^{j}}{N}$ (5)

3.4 Text recognition branch

3.4.1 CBAM

The CBAM (convolutional block attention module) proposed by Woo et al. [9] has achieved good performance in object recognition tasks. So CBAM was added to the text recognition branch. CBAM contains two dimensions: the channel dimensions and the spatial dimension. The channel attention mechanism can better obtain global features, and the spatial attention mechanism can better obtain local features. As shown in Fig. 5.

Fig. 5

CBAM structure diagram.

The module adopts a sequential combination attention mechanism: the output feature F of the shared convolution layer is first fed into the channel attention module, and then the output result F_cam is fed into the spatial attention module to obtain a modified attention feature map F_rf. This can better coordinate global and local attention.

The calculation is as follows: $F_{cam} = M_{c} (F) ⊙ F$ (6) $F_{rf} = M_{s} (F^{'}) ⊙ F^{'}$ (7)

3.4.2 ROI masking

Qin et al. [10] proposed ROI masking, which can be better extract the instance features of irregular text. Similarly, this paper inputs the output feature map of CBAM to ROI Masking. ROI Masking filters out the background around the text to avoid focusing on non-text areas. It consists of two steps: firstly, using the predicted rectangular bounding box to cut out the features; secondly, multiplying the mask by the corresponding instance.

3.4.3 Attention decoder

Zhou et al. [11] proposed a decoder based on attention mechanism. However, it has been further improved to increase spatial attention on the basis of the original. As shown in Fig. 6.

Fig. 6

SAM-BiLSTM module structure diagram.

Suppose T iterations are needed, and the predicted character sequence is y = (y₁, …, y_T). There are three inputs at step t: the input feature F, the hidden state s_t-1 of the previous iteration and the character category y_t-1 predicted by the previous iteration.

Firstly, expand the s_t-1 vector into a feature map of shape (V, H_p, W_p). V represents the size of the RNN hidden layer and is set to 256. $S_{t - 1} = {expand}_{\dim} (s_{t - 1}, H_{p}, W_{p})$ (8)

Secondly, calculate the weight α_t of attention: $e_{t} = W_{t} \times \tanh (W_{s} \times S_{t - 1} + W_{f} \times F + b)$ (9) $α_{t} (i, j) = \frac{exp (e_{t} (i, j))}{\sum_{i^{'} = 1}^{H_{p}} \sum_{j^{'} = 1}^{W_{p}} exp (e_{t} (i^{'}, j^{'}))}$ (10)

The shapes of e_t, α_t are (H_p, W_p), W_t, W_s, W_f and b are training weights and bias.

Thirdly, calculate the weighted feature g_t: $g_{t} = \sum_{i = 1}^{H_{p}} \sum_{j = 1}^{W_{p}} α_{t} (i, j) \times F (i, j)$ (11)

Embed the characters of the character type y_t-1 predicted by the previous iteration and perform a concat operation with g_t to calculate the input r_t of the RNN: $f (y_{t - 1}) = W_{y} \times onehot (y_{t - 1}, N_{c}) + b_{y}$ (12) $r_{t} = concat (g_{t}, f (y_{t - 1}))$ (13)

W_y, b_y are weights and bias, and N_c is the number of types of sequence decoders. This paper set it to 79, including English letter case, Arabic numerals, and several special characters.

The r_t, s_t-1 is fed into the RNN (Bi-LSTM): $(x_{t}, s_{t}) = rnn (s_{t - 1}, r_{t})$ (14)

Finally, the prediction result of the t-th iteration is as follows: $p (y_{t}) = softmax (W_{o} \times x_{t} + b_{o})$ (15) $y_{t} \sim p (y_{t})$ (16)

3.5 Joint optimization and loss function

The end-to-end text spotting framework is a multi-objective task. Therefore, the multi-objective loss function is defined as follows: $L = L_{detect} + λ_{1} L_{recog}$ (17) $L_{detect} = λ_{2} L_{cls} + λ_{3} L_{box} + λ_{4} L_{mask} + λ_{5} L_{gts}$ (18)

L_cls, L_box and L_mask are loss functions in Mask R-CNN. L_gts is used to optimize global text semantics and is defined as: $L_{gts} = \frac{1}{N} \sum_{i} - \log (\frac{e^{p_{i}}}{Σ_{j} e^{p_{j}}})$ (19)

L_gts is a Softmax loss function, and p is the predicted output of the network. $L_{recog} = - \frac{1}{T} \sum_{t = 1}^{T} log (p (y_{t}))$ (20)

T is the length of sequence tags. λ₁, λ₂, λ₃, λ₄, and λ₅ are respectively set to 2.0, 1.0, 1.0, 1.0, and 1.0.

4 Experiment design and analysis

4.1 Data sets

The datasets and experiments used in the experiment are as follows:

SynthText is a synthetic dataset, contains 800,000 artificially synthesized images with a large number of multi-directional text examples.

ICDAR2017 MLT is a natural scene text data set, contains 7,200 training data sets, 1800 validation data sets, and 9,000 test data sets. The dataset contains text in multiple orientations, angles, and languages.

ICDAR2015 is proposed in ICDAR 2015 contest natural scene text data set. These images were accidentally taken by Google’s eyes and contain 1,000 training data sets and 500 test sets.

Total-Text is a comprehensive data set text scene. The training set contains 1255 images, and the test set contains 300 images, which contains horizontal text, directional text, and curved text.

4.2 Experimental details

The training data set used in the experiment is mainly SynthText, and some data in the training set in ICDAR2017 MLT are 200 k and 7 k pictures, respectively. The test set is mainly ICDAR2015 and Total-Text dataset. A model trained on ImageNet data is used as a pre-trained model.

The end-to-end text spotting model is different from the previous independent training or alternate training of text detection and text discovery, it can simultaneously and end-to-end joint training.

The entire training process includes two phases: first pre-training on SynthText, and finally fine-tuning on actual data. Use different training data sets for different tasks.

The experiment uses the SGD (stochastic gradient descent) optimization algorithm, the weight attenuation value is 0.0001, and the momentum value is 0.9. During the pre-training phase, the model was trained 300 k iteratively, The default value of the initial learning rate is 0.01, and the learning rate decreased by 150 k and 300 k iteration one. During the fine-tuning phase, The default value of the initial learning rate is set to 0.001 and then reduced to one-tenth in 150 k iterations. Fine-tuning process stops at 200 k times.

4.3 Experimental results and analysis

The model was tested and evaluated on the ICDAR2015 dataset and the Total-Text dataset.

4.3.1 Straight text

The experiment verify the performance of the method proposed in this paper in staight text spotting on the IDAR2015 dataset. This data set is an image captured by a wearable camera, without deliberately focus on the text area. Therefore, it’s text size, text direction, and font size vary greatly. In addition, the text is an example word level mark. Providing quadrilateral side text box and the transcription.

In the training phase, first pre-train on the SynthText dataset and then fine-tune on the ICDAR2015 dataset. For the evaluation of detection, if the IOU closest to the real situation is greater than 0.5, the prediction is counted as a positive example. For end-to-end evaluation, the predicted text content must be the same as the label to be considered positive. Unreadable words are collectively labeled as “do not cate.” The evaluation indicators used in the experiment are Precision, recall and F-measure.

“P”, “R” and “F” mean Precision, Recall and F-measure in detection task respectively, “E2E” means end-to-end, “S”,“W” and “G” mean recognition with strong, weak and generic lexicon respectively.

4.3.2 Curve text

The experimental results are shown in Table 1. In the text detection task, compared with Liu et al. [2], the precision of the method proposed in this paper is increased by 0.11%. Compared with Qin et al. [10], in the end-to-end text spotting task (with strong lexicon) increased by 1.74%. In general, the model achieves good performance on the ICDR2015 data set.

Table 1
Results on the ICDAR2015 dataset

Method Detection Method E2E

P R F S W G

Seg Link [38] 74.74 76.50 75.63 Stradvision [17] 43.70 – –

SSTD [13] 80.23 73.86 76.91 E2E-MLT [20] – – 55.10

EAST [14] 84.27 78.33 80.72 Text Proposals [18] 56.00 52.40 49.70

Text Snake [15] 84.90 80.40 82.60 HUST_MCLAB [19] 67.86 – –

RRD MS [16] 88.00 80.00 83.80 Text Net [4] 78.66 74.90 60.45

Text Net [4] 89.42 85.41 87.37 Mask TextSpotter [5] 79.30 73.00 60.46

Mask Text Spotter [5] 91.60 81.00 86.00 He et al. [3] 82.00 77.00 63.00

FOTS MS [2] 91.85 87.92 89.84 FOTS MS [2] 83.55 79.11 65.33

Towards [10] 91.67 87.96 89.78 Towards [10] 85.51 81.91 69.94

Wei et al. [39] 91.90 87.80 89.80 Wei et al. [39] 83.4 75.1 63.3

This paper 91.96 87.84 89.85 This paper 87.25 83.83 70.46

Method	Detection	Method	E2E
Seg Link [38]	74.74	76.50	75.63	Stradvision [17]	43.70	–	–
SSTD [13]	80.23	73.86	76.91	E2E-MLT [20]	–	–	55.10
EAST [14]	84.27	78.33	80.72	Text Proposals [18]	56.00	52.40	49.70
Text Snake [15]	84.90	80.40	82.60	HUST_MCLAB [19]	67.86	–	–
RRD MS [16]	88.00	80.00	83.80	Text Net [4]	78.66	74.90	60.45
Text Net [4]	89.42	85.41	87.37	Mask TextSpotter [5]	79.30	73.00	60.46
Mask Text Spotter [5]	91.60	81.00	86.00	He et al. [3]	82.00	77.00	63.00
FOTS MS [2]	91.85	87.92	89.84	FOTS MS [2]	83.55	79.11	65.33
Towards [10]	91.67	87.96	89.78	Towards [10]	85.51	81.91	69.94
Wei et al. [39]	91.90	87.80	89.80	Wei et al. [39]	83.4	75.1	63.3
This paper	91.96	87.84	89.85	This paper	87.25	83.83	70.46

The biggest advantage of the method proposed in this paper is the excellent performance on irregularly shaped text. The experiments were tested on the Total-Text text dataset. First, pre-training on SynthText dataset, and then fine-tune the model on Total-Text training set. The Total-Text data set contains a large amount of curvilinear text, the annotation level is the word level, the training data set contains 1255 images, and the test data set contains 300 images. The evaluation protocol used for detection is based on the protocol proposed in [12], and the evaluation protocol for end-to-end identification uses the end-to-end evaluation protocol based on ICDAR2015.

The experimental results are shown in Table 2. The method proposed in this paper has made great breakthroughs in both text detection tasks and end-to-end text spotting tasks. Especially in the end-to-end text spotting task, the performance is improved by 3.16% compared with Qin et al. [10].

Table 2

Results on the Total-Text dataset

Method	Detection			E2E
	P	R	F	None
SegLink [38]	40.00	33.00	36.00	–
SSTD [13]	62.10	45.50	52.20	36.30
EAST [14]	68.21	59.45	63.53	54.00
TextSnake [15]	69.00	55.00	61.30	52.90
RRD MS [16]	82.70	74.50	78.40	–
TextNet [4]	85.20	73.0	78.60	–
Mask TextSpotter [5]	81.20	79.900	80.60	–
FOTS MS [2]	84.70	78.00	81.30	–
Towards [10]	87.80	85.00	86.40	70.70
This paper	88.71	84.89	86.75	73.86

Figure 7 is the visualization results on the ICDAR2015 dataset and the Total-Text dataset. As you can see from the first two lines, the model can handle curved and non-curved text very well. The third line, due to the complexity of the text background and blurred image quality, leads to false detections and missed detection.

Fig. 7

Visualize the results. The first column is the visual result of the Total-Text data set, and the second column is the visual result of the ICDAR2015 dataset.

4.4 Ablation experiment

Ablation experiments can better verify the model proposed in this paper and use the average accuracy (AP) score as an indicator. The ablation experiment used the ICDAR2015 dataset.

The experimental results are shown in Table 3.

Table 3
Results on the ICDAR 2015 dataset

Method TCM SAM-BiLSTM IC2015

AP_Det AP_E2E

Baselines 82.30 54.10

Baselines + TCM √ 84.50 54.60

Baselines + SAM-BiLSTM √ 83.20 56.90

Baselines + TCM + SAM-BiLSTM √ √ 85.40 57.30

Method	TCM	SAM-BiLSTM	IC2015
Baselines			82.30	54.10
Baselines + TCM	√		84.50	54.60
Baselines + SAM-BiLSTM		√	83.20	56.90
Baselines + TCM + SAM-BiLSTM	√	√	85.40	57.30

Baselines: build an end-to-end model, but without TCM module and SAM-BiLSTM module. Baselines + TCM: From the table it can be observed, when introduced into the module TCM, AP_Det to give 2.2% increase, which indicates a good role in TCM having text detection. Meanwhile, AP_E2E get some improvement, which has to a certain extent, it indicates that the text can facilitate detection performance text recognition performance. Baselines + SAM-BiLSTM: When the SAM-BiLSTM module is added, AP_E2E increases by 2.8%, which shows that the SAM-BiLSTM module can effectively improve the performance of text recognition.

“P”, “R” and “F” mean Precision, Recall and F-measure in detection task respectively. “E2E” means end-to-end, “None” means recognition without any lexicon.

Results on the ICDAR 2015 test set under different model configurations. AP number are reported. “TCM” and “SAM-BiLSTM” stand for TCM and SAM-BiLSTM respectively. Baselines + TCM + SAM-BiLSTM: When TCM and SAM-BiLSTM modules are added at the same time, AP_Det and AP_E2E are higher than other methods, which verifies that TCM and SAM-BiLSTM modules have a greater effect on end-to-end text spotting.

5 Conclusion

For texts of arbitrary shapes in natural scenes, this paper proposes an end-to-end text discovery model, which can jointly optimize text detection performance and text recognition performance and promote each other. On the one hand, this method can solve the problem of negative samples to a certain extent. And on the other hand, it can be better extracted semantic information of the text in natural scenes. The method proposed in this paper has achieved some breakthroughs in both ICDAR15 and Total-Text benchmark tests. In general, the method proposed in this paper has made great progress in text spotting tasks.

The inherent features of text in natural scenes are the difficulties of text spotting and the trend of future research. For example, the diversity of text (different languages, fonts, font sizes, shapes), the complexity of the text background (leaves, bricks, windows, fences), and the low quality of data (low resolution, distortion, and blur). How to deal with these problems becomes the key to text spotting in natural scenes.

Footnotes

Acknowledgments

The authors are grateful for the support from the Key Research Program of Shandong Province (2018GGX101011).

References

, Wang

and Shen

C.H.

, Towards end-to-end text spotting with convolutional recurrent neural networks, In Proceedings of the IEEE International Conference on Computer Vision (2017), 5238–5246.

Liu

X.B.

, Liang

, Yan

, Chen

D.G.

, Qiao

and Yan

J.J.

, Fots: Fast oriented text spotting with a unified network, In Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), 5676–5685.

, Tian

, Huang

W.L.

, Shen

C.H.

, Qiao

and Sun

C.M.

, An end-to-end textspotter with explicit alignment and attention, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), 5020–5029.

Sun

Y.P.

, Zhang

C.Q.

, Huang

Z.M.

, Liu

J.M.

, Han

J.Y.

and Ding

E.R.

, Textnet: Irregular text reading from images with an end-to-end trainable network, arXiv preprint arXiv:1812.09900, (2018).

Lyu

P.Y.

, Liao

M.H.

, Yao

, Wu

W.H.

and Bai

, Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes, In Proceedings of the European Conference on Computer Vision (ECCV), (2018), 67–83.

Xie

E.Z.

, Zang

Y.H.

, Shao

, Yu

, Yao

and Li

G.Y.

, Scene Text Detection with Supervised Pyramid Context Network, arXiv preprint arXiv:1811.08605, (2016).

, Zhang

, Ren

and Sun

, Deep residual learning for image recognition, In Proceedings of the IEEE conference on computer vision and pattern recognition (2018), 770–778.

Lin

T.-Y.

, Doll'ar

, Girshick

, He

, Hariharan

and Belongie

, Feature pyramid networks for object detection, arXiv preprint arXiv:1612.03144, (2016).

Woo

, Park

, Lee

J-Y.

and Kweon

I.S.

, CBAM: Convolutional Block Attention Module, arXiv preprint arXiv:1807.06521, 2018.

10.

Qin

S.Y.

, Bissacco

, Raptis

, Fujii

and Xiao

, Towards Unconstrained End-to-End Text Spotting, arXiv preprint arXiv:1908.09231, (2019).

11.

Zhou

, Shi

, Tian

, Qi

Z.Y.

, Li

B.C.

, Hao

H.W.

and Xu

, Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 207–212.

12.

Wolf

and Jolion

J.-M.

, Object count/area graphs for the evaluation of object detection and segmentational gorithms, International Journal of Document Analysis and Recognition (IJDAR) 8(4) (2006), 280–296.

13.

, Huang

W.L.

, He

, Zhu

, Qiao

and Li

X.L.

, Single shot text detector with regional attention, In Proceedings of the IEEE International Conference on Computer Vision, (2017), 3047–3055.

14.

Zhou

X.Y.

, Yao

, Wen

, Wang

Y.Z.

, Zhou

S.C.

, He

W.R.

and Liang

J.J.

, East: an efficient and accurate scene text detector, In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (2017), 5551–5560,

15.

Long

S.B.

, Ruan

J.Q.

, Zhang

W.J.

, He

, Wu

W.H.

and Yao

, Textsnake: A flexible representation for detecting text of arbitrary shapes, In Proceedings of the European Conference on Computer Vision (ECCV), (2018), 20–36.

16.

Liao

M.H.

, Zhu

, Shi

B.G.

, Xia

G.-S.

and Bai

, Rotation-sensitive regression for oriented scene text detection, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), 5909–5918.

17.

Karatzas

, Gomez-Bigorda

, Nicolaou

, Ghosh

, Bagdanov

, Iwamura

, Matas

, Neumann

, Chandrasekhar

V.R.

, Lu

S.J.

, et al., Icdar2015 competition on robust reading, In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), IEEE, (2015), 1156–1160.

18.

Gómez

and Karatzas

, Textproposals: a text-specific selective search algorithm for word spotting in the wild, Pattern Recognition 70 (2017), 60–74.

19.

Shi

B.G.

, Bai

and Yao

, An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 39(11) (2017), 2298–2304.

20.

Busta

, Patel

and Matas

, E2e-mlt-an unconstrained end-to-end method for multi-language scene text, arXiv preprint arXiv:1801.09919, (2018).

21.

Liao

M.H.

, Shi

B.G.

, Bai

, Wang

X.G.

and Liu

W.Y.

, Textboxes: A fast text detector with a single deep neural network, In Thirty-First AAAI Conference on Artificial Intelligence, (2017).

22.

Xue

C.H.

, Lu

S.J.

and Zhang

, Msr: Multi-scale shape regression for scene text detection, arXiv preprint arXiv:1901.02596, (2019).

23.

Y.C.

, Wang

Y.K.

, Zhou

, Wang

Y.G.

, Yang

Z.B

and Bai

, Textfield: Learning a deep direction field for irregular scene text detection, IEEE Transactions on Image Processing (2019).

24.

Dai

Y.C.

, Huang

, Gao

Y.T.

, Xu

Y.Z.

, Chen

, Guo

and Qiu

W.D.

, Fused text segmentation networks for multi-oriented scene text detection, In 2018 24th International Conference on Pattern Recognition (ICPR), IEEE, (2018), 3604–3609.

25.

Long

S.B.

, He

and Yao

, Scene Text Detection and Recognition: The Deep Learning Era, arXiv preprint arXiv:1811.04256, (2019).

26.

Tian

, Huang

W.L.

, He

and Qiao

, Detecting Text in Natural Image with Connectionist Text Proposal Network, arXiv preprint arXiv:1609.03605, (2016).

27.

Graves

, Liwicki

, Bunke

, Schmidhuber

and Fern'andez

, Unconstrained on-line handwriting recognition with recurrent neural networks, In Advances in neural information processing systems, (2008), pages 577–584.

28.

B.L.

and Lu

S.J.

, Accurate scene text recognition based on recurrent neural network, In Asian Conference on Computer Vision, Springer, (2014), 35–48.

29.

Liu

, Chen

C.F.

, Wong

K.-Y.K.

, Su

Z.Z.

and Han

J.Y.

, Star-net: A spatial attention residue network for scene text recognition, In , BMVC 2 (2016), page 7.

30.

Gao

Y.Z.

, Chen

Y.Y.

, Wang

J.Q

and Lu

H.Q.

, Reading scene text with attention convolutional sequence modeling, arXiv preprint arXiv:1709.04303, (2017).

31.

Yin

, Wu

Y.-C.

, Zhang

X.-Y.

and Liu

C.-L.

, Scene text recognition with sliding convolutional character models, arXiv preprint arXiv:1709.01727, (2017).

32.

Bahdanau

, Cho

and Bengio

, Neural machine translation by jointly learning to align and translate, ICLR (2015), 2014.

33.

Liu

Z.C.

, Li

Y.X.

, Ren

F.B.

, Yu

and Goh

W.L.

, Squeezedtext: A real-time scene text recognition by binary convolutional encoder-decoder network, AAAI, (2018).

34.

Shi

B.G.

, Yang

M.K.

, Wang

X.G.

, Lyu

, Bai

and Yao

, Aster: An attentional scene text recognizer with flexible rectification, IEEE transactions on pattern analysis and machine intelligence 31(11) (2018), 855–868.

35.

Liu

, Anguelov

, Erhan

, Szegedy

, Reed

, Fu

C.Y.

and Berg

A.C.

, Ssd: Single shot multibox detector, In European conference on computer vision, Springer, (2016), 21–37.

36.

Shi

, Bai

and Belongie

, Detecting oriented text in natural images by linking segments, arXiv preprint arXiv:1703.06520, (2017).

37.

Ren

, He

, Girshick

and Sun

, Faster r-cnn: Towards real-time object detection with region proposal networks, In Advances in neural information processing systems, (2015), pages 91–99.

38.

Shi

B.G.

, Bai

and Belongie

, Detecting Oriented Textin Natural Images by Linking Segments, arXiv preprint arXiv:1703.06520, (2017).

39.

Wei

G.C.

, Rong

W.S.

, Liang

Y.Q.

, Xiao

X.G.

and Liu

, Toward Arbitrary-Shaped Text Spotting Based on End-to-End, In, IEEE Access 8 (2020), 159906–159914.

Scene text spotting based on end-to-end

Abstract

Keywords

1 Introduction

2.1 Text detection

2.2 Text recognition

2.3 Text spotting

3 Model design

3.1 Model

3.3.1 TCM

3.4.1 CBAM

3.4.3 Attention decoder

4.1 Data sets

4.2 Experimental details

4.3 Experimental results and analysis

4.3.1 Straight text

4.3.2 Curve text

Table 3 Results on the ICDAR 2015 dataset Method TCM SAM-BiLSTM IC2015 APDet APE2E Baselines 82.30 54.10 Baselines + TCM √ 84.50 54.60 Baselines + SAM-BiLSTM √ 83.20 56.90 Baselines + TCM + SAM-BiLSTM √ √ 85.40 57.30

Footnotes

Acknowledgments

References

Table 3
Results on the ICDAR 2015 dataset

Method TCM SAM-BiLSTM IC2015

AP_Det AP_E2E

Baselines 82.30 54.10

Baselines + TCM √ 84.50 54.60

Baselines + SAM-BiLSTM √ 83.20 56.90

Baselines + TCM + SAM-BiLSTM √ √ 85.40 57.30