Enhances the recognition accuracy of the complex text for online production labels by improving convolutional recurrent neural network

Abstract

In the process of production, the label on the product provides the basic product information. Due to the complex text contained on the product labels, the high accuracy recognition for online production labels has always been a challenging problem. To address this issue, a more effective method for complex text detection by improving the convolutional recurrent neural network has been proposed to enhance the recognition accuracy of complex text. Firstly, the SE-DenseNet feature extraction network has been introduced for feature extraction, aiming to improve the model’s depth and feature extraction capacity. Then, the Bi-GRU network is utilized to learn and model the hidden states and spatial features extracted by SE-DenseNet, anticipate preliminary sequence results, reduce model parameters, and improve the model’s calculation performance. Finally, the CTC network is employed for transcription to convert each feature sequence prediction output by Bi-GRU into a label sequence, achieving complex text recognition. Experimental results on the SVT, IIIT-5K, ICDAR2013 public dataset, and a self-built dataset demonstrate that the proposed model achieves superior outcomes on both public and self-built datasets. Remarkably, the model exhibits the highest recognition accuracy of 93.2% on the ICDAR2013 public dataset, demonstrating its potential to support complex text recognition for online production labels.

Keywords

Online production labels complex text recognition SE-DenseNet Bi-GRU

1 Introduction

Product labeling plays a crucial role in modern production by serving as a marker in the product processing process and as an information record and traceability of product quality [1]. With the advent of information technology, enterprises can now improve their production efficiency and quality through digitalization and intelligence. In order to facilitate the identification of label information, more and more companies are using 2D codes, bar codes, RFID ray codes [2], and even electronic tags as information carriers [3]. However, in the field of cash crop-tobacco redrying, the gradual increase in the spectrum of tobacco raw materials in redrying units has made cross-production areas and modular processing commonplace. To date, complex text labels, shown in Fig. 1. containing Chinese characters, numbers, English letters, etc., are still used as information carriers on tobacco boxes during transfer and storage. So, online recognition with higher accuracy has long been a difficult subject because the environment is influenced by uneven illumination, complex backgrounds, and other issues. To improve the level of threshing and redrying production, it is crucial to design a system that can rapidly and reliably identify the information contained on the complex labels.

Fig. 1

The labels on the tobacco boxes.

Currently, with the rapid development of deep learning, text recognition algorithms can be broadly classified into two types: character-level text recognition methods [4] and sequence label-based text recognition methods [5]. Character-level text recognition methods, exemplifi ed by Jaderberg et al. [6], involve positioning and cutting character images before feeding them into a deep convolutional neural network model for feature training. The output is then compared with a fixed dictionary to obtain the final text recognition result. To ensure recognition accuracy, character recognition requires the arduous learning of different character features without any learning omissions. On the other hand, sequence recognition algorithms can directly learn semantic features from the image and are divided into two types: temporal feature classification-based algorithms [7] and decoder-encoder-based algorithms [8]. Inspired by Vision Transformer [9, 10], Du et al. [11] completely abandoned the text sequence modeling stage and proposed a single visual text recognition (SVTR) based on a single visual model, aiming to enhance text recognition ability by augmenting the visual model. The network extracts features of different scales of text images in stages to form multi-granularity feature descriptions, so that the recognition task can be completed by a single visual model. Based on the encoder-decoder autoregressive method, Sheng et al. [12] and Zheng et al. [13] transformed recognition into the process of iterative decoding, considering contextual information and improving accuracy. However, the inference is slow due to verbatim transcription. Chandio et al. [14] proposed a segmentation-free method based on deep convolutional recurrent neural network to solve the problem of cursive text recognition, without pre-segmenting into a single character, taking the entire word image as input, and then converting it into a sequence of related features. In the existing research, text recognition is mainly for the recognition of a type of character, and the research in this paper focuses on the identification of re-cured cigarettes, which contain Chinese characters, letters and numbers. At present, there are few studies on complex text recognition containing Chinese characters, letters, and numbers.

In order to increase the recognition speed and enhances accuracy for complex text, a new algorithm based on a convolutional recurrent neural network is proposed. The improved model incorporates SE-DenseNet during the feature extraction phase aiming to increase the model depth and improve feature extraction capabilities. Bi-directional Gate Recurrent Unit (Bi-GRU) with lower parameters is adopted to generate feature sequences, which can be predicted by the CTC networks and that will be turned into a label sequence.

2 Proposed method

2.1 Network architecture

The proposed structural design for the recognition model based on the convolutional recurrent neural network is shown in Fig. 2. The model comprises three modules: the convolutional layers (SE-DenseNet) [15], the recurrent layers (Bi-GRU) [16], and the transcription layers (CTC) [17]. the specific network architectures are shown in Table 1.

Fig. 2

Schematic diagram of the text recognition model of tobacco marking.

Table 1

Network architecture table

Hierarchical network architecture	type	configuration
	Input	112×112 gray-scale image
	Convolution layer	k:3×3, 32, s:1, p:1
	Maximum pooling layer	Window:3×3, s:2×2
	SE-DenseBlock-1	$[\begin{matrix} 1 \times 1 & Conv \\ 3 \times 3 & Conv \\ SE & Block \end{matrix}] \times 4$
	Transition-1	1×1 Conv 2×2 Avgpooling
	SE-DenseBlock-2	$[\begin{matrix} 1 \times 1 & Conv \\ 3 \times 3 & Conv \\ SE & Block \end{matrix}] \times 4$
Convolution layer architecture	Transition-2	1×1 Conv 2×2 Avgpooling
	SE-DenseBlock-3	$[\begin{matrix} 1 \times 1 & Conv \\ 3 \times 3 & Conv \\ SE & Block \end{matrix}] \times 4$
	Transition-3	1×1 Conv 2×2 Avgpooling
	SE-DenseBlock-4	$[\begin{matrix} 1 \times 1 & Conv \\ 3 \times 3 & Conv \\ SE & Block \end{matrix}] \times 4$
	Convolution layer	k:2×2, 512, s:1, p:0
	Map-to-Sequence	–
Cyclic layer architecture	Bi-GRU	#hidden units:256
	Bi-GRU	#hidden units:256
Transcriptional layer architecture	CTC	–

2.2 SE-DenseNet feature extraction network

The process of deep feature extraction using Convolutional Neural Networks (CNN) is highly adaptive and effective. To extract features in target detection models, a pre-trained image classification model can be used as the backbone network, with the fully connected layer removed. However, as the depth of the model increases, the gradient at the front progressively diminishes, making it difficult to update during gradient return and leading to the problem of gradient disappearance. This issue becomes more pronounced as the depth of the model increases and the number of parameters expands. To address this challenge, Huang et al. [18] proposed DenseNet, which restructures each concatenation of all network outputs in front of each convolutional layer into network input. This not only increases feature utilization at each layer but also reduces the number of parameters in the model. This approach enhances gradient propagation and improves the overall training efficiency of deep network models.

As illustrated in Fig. 3. a dense architecture block Dense-Block is utilized in the DenseNet model for feature extraction. Assume that Dense-Block has l convolutional layers, its output expression can be expressed as:

Fig. 3

Dense-Block module.

$x_{l} = H_{l} ([x_{l - 1}, x_{l - 2}, \cdot \cdot \cdot, x_{1}, x_{0}])$ (1)

Where H _l (·) is a nonlinear transformation function, [x _l-1, x _l-2, ·· · , x ₂, x ₁] is the output splicing of 0 to l - 1 layer. The Dense-Block architecture, according to analysis, makes it possible to get global knowledge by reusing features, decreases computational demands, broadens the variety of input data, and enhances model feature extraction abilities. This feature reuse strategy, however, does not successfully assess the significance of feature information in the feature layer or suitably increase the weight of such information. The SE-Block module developed by the SENet network [19] has proven its capacity to execute feature weighting on significant images during the 2017 ILSVRC classification competition.

The schematic diagram of the SE-Block module architecture is shown in Fig. 4. It can be seen that the SE-Block module obtains the feature map C × W × H by performing serialized convolution operations on each output. It then completes the recalibration of feature weights by performing the Squeeze, Excitation, and Reweigh operations.

Fig. 4

SE-Block module.

By conducting global average pooling on the C × W × H feature map U, the Squeeze module compresses the feature map on the spatial dimension into a 1 × 1 × C feature map, effectively getting the global receptive field, which can be written as:

$z_{c} = F_{sq} (u_{c}) = \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{j = 1}^{H} u_{c} (i, j)$ (2)

To acquire the relevant feature weights between each feature channel, the excitation module runs the outcomes of the squeeze operation through a two-layer fully connected neural network, which can be calculated as follows: $s_{c} = F_{ex} (z_{c}, W) = σ (g (z_{c}, W)) = σ (W_{2} δ (W_{1} z_{c}))$ (3)

Where δ and σ denote the activation functions for ReLU and sigmoid, respectively.

The reweigh module undertakes the pivotal task of recalibrating the feature weights. This is achieved by multiplying the original features with the corresponding feature weights between the feature channels, which were determined by the excitation module. The process can be described as: $y = F_{scale} (u_{c}, s_{c}) = u_{c} \times s_{c}$ (4)

DenseNet is used as the backbone network in this paper to develop a more effective feature extraction network, with the attention mechanism network SENet introduced in each 3×3 convolutional layer of the DenseNet network. The weights of the feature output channels of each layer of the Dense-Block module are predicted in this manner to achieve adaptive calibration of the weights in the feature extraction process and to improve the network’s feature information extraction capability. Weighting is also used to obtain the interdependencies between the convolutional feature channels. The diagram of the SE-DenseNet dense block architecture is shown in Fig. 5.

Fig. 5

Schematic diagram of SE-DenseNet dense block architecture.

2.3 Bi-GRU cycle layer

The image is passed through the convolutional network to generate a feature sequence{X ₁, X ₂, ⋯ , X _t}into the recurrent layer, recurrent neural network (LSTM) [20] can process sequence-type data to predict the label distribution of each frame, but its complex network architecture and a large number of calculation parameters deepen the difficulty of model training to a certain extent. To reduce the calculation of model training parameters without reducing the recognition accuracy, this paper introduces a Gated Recurrent Unit (GRU) similar to SLTM [21], which achieves a good balance between reducing the computational power of the model and improving the accuracy.

The convolutional network generates a feature sequence for the input image, denoted as {X ₁, X ₂, ⋯ , X _t}, which is then fed into the recurrent neural network (LSTM) to predict the label distribution of each frame. However, due to its complex network architecture and numerous calculation parameters, it is challenging to train the model effectively. In order to address this issue, a Gated Recurrent Unit (GRU) is introduced in this study with a network structure similar to LSTM. GRU strikes a fine balance between reducing the computational burden of the model and maintaining high accuracy levels, thereby improving model training efficiency while preserving recognition accuracy. The GRU network architecture is shown in Fig. 6.

Fig. 6

Schematic of GRU network architecture.

The GRU network mainly comprises the current time step input X _t, the previous time step hidden state H _t-1, the candidate hidden state H, the reset gate R _t and the update gate Z _t. Firstly, the hidden state in the recurrent neural network is modified through the control of the reset and update gates in GRU. Then the candidate’s hidden state is employed to further refine the internal state and eliminate irrelevant temporal information. Finally, the output is generated by combining the current hidden state and the last time step information. The corresponding formula is given below:

$R_{t} = σ (X_{t} W_{xr} + H_{t - 1} W_{hr} + b_{r})$ (5) $Z_{t} = σ (X_{t} W_{xz} + H_{t - 1} W_{hz} + b_{z})$ (6) $H = tanh (X_{t} W_{xh} + (H_{t - 1} ⊙ R_{t}) W_{hh} + b_{h})$ (7) $H_{t} = (Z_{t} ⊙ H_{t - 1} + (1 - z_{t}) ⊙ H)$ (8)

Where W _xr, W _hr, W _xz, and W _hz are weight parameters, b is deviation and σ is the sigmoid function to map the element value to [0,1], so that each element value in the reset gate and the update gate is between [0,1]. The output range of the tanh function is [–1,1].

In general, the GRU network is unidirectional and its output relies only on past context information. However, the modules before and after the image feature sequence prediction in the text are highly related. The current output not only depends on the previous hidden state but also on the subsequent hidden state. As a result, the unidirectional recurrent neural network fails to capture effective semantic relationships between the two directions, leading to low text recognition accuracy. To address this issue, we propose the Bi-directional GRU (Bi-GRU) network, as depicted in Fig. 7. Bi-GRU combines two opposite LSTM network connections to form a hidden layer that transmits information bi-directionally, thereby enhancing the model’s ability to extract semantic information and improving recognition accuracy. At time t, the hid state (H _t) of the Bi-GRU unit is computed as follows:

Fig. 7

Schematic of Bi-GRU network architecture.

$H_{t} = [{\vec{h}}_{t}, h_{t}^{\leftarrow}]$ (9)

Where ${\vec{h}}_{t}, \overset{\leftarrow}{h}$ are represented as the forward and backward hidden states of GRU at timet.

2.4 CTC transcription layer

As a crucial module in text recognition, the transcription layer is responsible for converting the predicted feature vectors of each Bi-GRU output into a label sequence. Thus, it is also commonly referred to as the sequence recognition layer. However, traditional time series classification and recognition methods pose a challenge for the sequence recognition layer, as the input sequence X = [x ₁, x ₂, ·· · , x _t] and corresponding output label sequence Y = [y ₁, y ₂, ·· · , y _u] are of varying lengths and not aligned properly, leading to redundancy issues. For instance, the input image text label in Fig. 2. is ‘Y23’, and after passing through the Bi-GRU network, it may generate output sequences such as’ YY23’, ‘Y233’, or ‘Y223’, resulting in an inability to output the final sequence recognition result due to repeated sequence labels. In order to effectively tackle the redundancy issue in sequence recognition, this paper proposes the adoption of Connectionist Temporal Classification (CTC) in the convolutional recurrent neural network model. CTC introduces a blank character to enable the Bi-GRU network to identify interruptions. Meanwhile, a multi-to-one mapping technique is employed to combine identical continuous characters and remove the blank character ‘_’ such as _ _YY_2_3 -> Y23, _Y_2_33 -> Y23, and _Y_22_3 -> Y23. In other words, multiple paths are available to generate the final label sequence. Figure 8. illustrates three legitimate paths at a time step of 10.

Fig. 8

CTC legal paths.

During the model training, the conditional probability of generating each path can be stated as: $p (l | x) = \prod_{t = 1}^{T} y_{l_{t}}^{t}$ (10)

Where x denotes the input sequence, l denotes the legal path, and $y_{l_{t}}^{t}$ denotes the current frame output on the legal path at time t. The probability of the output text label is further obtained by summing the conditional probability of all legal paths: $P (y | x) = \sum_{B (l) = y} P (l | x)$ (11)

Where y denotes the output text label, and B denotes the deduplication operation. If we rewrite the above equation into a negative log-likelihood function, we obtain what is known as the CTC loss: $L ({y}_{i = 1}^{n}) = \sum_{i = 1}^{n} - ln p (y_{i} | x)$ (12)

3 Experimental results and analysis

3.1 Laboratory equipment

Image acquisition is a crucial prerequisite for visual positioning and module recognition. The proper functioning of the image acquisition system for tobacco boxes labeling is essential for the overall workflow of the system. The labeling positions on both sides of the tobacco boxes are not fixed and vary among different outsourcing companies. Therefore, this article proposes a dual-zone binocular structure design to address the issue of inconsistent labeling positions. By selecting appropriate cameras, the system enables image acquisition for labeling at any position. The specific structure is illustrated in Fig. 9.

Fig. 9

Tobacco box label recognition system complete machine structure diagram.

3.1.1 Selection of industrial camera

During the process of camera selection, the best camera type should be determined according to the precision requirements of the detection target, field of view, target state (static or dynamic), and other conditions in order to obtain the best cost-performance ratio. In this paper, the tobacco boxes identification characters to be detected are in 16 pounds KaiTi, which has a more prominent feature size compared to the 0.16mm × 0.16mm feature size of handwritten fonts. Therefore, the camera needs to be set to capture the feature size of 0.16mm. During the process of camera sampling and imaging, the photodetector converts analog signals to digital signals, which complies with the Nyquist sampling theorem [22]. According to the resolution formula, the required sampling accuracy can be calculated as 0.08mm. $Pixel = \frac{Z_{i}}{U}$ (13)

Where Z _i represents the length and width of the tobacco boxes label, and Urepresents the required sampling accuracy.

Considering that the actual size of tobacco box identification may vary, resulting in different required sampling resolutions, this paper provides an example of the identification of Shandong Tobacco, which has a larger field of view. Figure 10. shows the identification of Shandong Tobacco.

Fig. 10

Shandong tobacco boxes label diagram.

The tobacco boxes identification has a size of 180mm × 120mm. Due to the manual pasting process, the position of the identification may fluctuate within a certain range. To account for this variation during the image acquisition process, a redundant field of view of 300mm × 200mm is set. According to formula (13), the longitudinal and lateral resolutions of the camera are approximately 37,500mm × 2,500mm, with a pixel count of around 9.4 million.

Based on the selection analysis above, we have chosen a MER2-1220-9GM GMOS industrial camera from Daheng as the image acquisition device for tobacco boxes identification. The camera parameters are shown in Table 2.

Table 2

MER2-1220-9GM camera parameter table

MER2-1220-9GM camera	parameter name	parameter
	product model	MER2-1220-9GM
	sensor type	CMOS
	resolution	4024×3036
	pixel dimension	1.83
	frame rate	9.63
	spectrum	black and white
	target size	1/1.7''
	signal-to-noise ratio	40.06dB
	lens interface type	C
	data interface	GigE
	mechanical dimension	29mm×29mm×29mm

3.1.2 Selection of industrial camera

When selecting a lens, parameters that determine optical performance such as focal length, the field of view, and aperture ratio should be considered. Additionally, to be compatible with the MER2-1220-9GM GMOS industrial camera, the lens should have a C mount, a pixel resolution of at least 12 million, and a sensor size larger than 1/1 .7^′′ inch to ensure the accuracy and clarity of the image acquisition.

Based on the method of pasting identification labels on tobacco boxes and the design of the “hyperbolic binocular” visual image acquisition structure, the camera is positioned at a distance of 200mm from the tobacco boxes label. Therefore, the required focal length of the lens can be calculated using formula (14) based on the size of the field of view, the working distance, and the size of the camera’s photosensitive sensor. $f = \frac{WD \cdot H_{FOV}}{H_{CCD} H_{FOV}} = 4.94 (mm)$ (14)

Where H _FOV represents the size of the field of view, WD represents working distance, and H _CCD represents the size of the camera photosensitive sensor. The conversion table of industrial camera target surface size is shown in Table 3.

Table 3

Camera target surface size unit conversion table

chip type	aspect ratio	Width (mm)	Height (mm)	diagonal (mm)
1/6″	4:3	2.400	1.800	3.000
1/4″	4:3	3.600	2.700	4.500
1/3″	4:3	4.800	3.600	6.000
1/2 .5″	4:3	5.760	4.290	7.182
1/2″	4:3	6.400	4.800	8.000
1/1 .8″	4:3	7.176	5.319	8.933
1/1 .7″/1 .7″	4:3	7.600	7.700	9.500
1/1 .6″	4:3	8.080	6.010	10.070
2/3″	4:3	8.800	6.600	11.000
1″	4:3	12.800	9.600	16.000

Based on the analysis of lens compatibility and the calculation of field of view and focal length for the industrial camera selection, we have chosen the Hikvision MVL-KF0618M-12MPE industrial lens, whose main parameters are shown in Table 4.

Table 4

MVL-KF0618M-12MPE industrial lens basic parameters

industrial lens	parameter name	parameter
	product model	MVL-KF0618M-12MPE
	pixel	12 million
	interface type	C
	target size	1.1''
	focal length	6mm
	aperture coefficient	F1.8∼F16
	focus control	hand drive
	watching field of view(D×H×V)	118 . 2° × 104 . 9° × 86 . 2°
	lens interface type	C
	data interface	GigE

3.2 Model operating environment and parameter setting

To expedite the training and verification process, the experiments were conducted on a 64-bit Windows system. The hardware and environment configurations are shown in Table 5. The model was trained using Ada as the optimizer, with a learning rate of 0.001, and with 16 images per iteration. The model was trained over a period of 300 iterations (epochs).

Table 5
Deep learning environment configuration table

platform CPU GPU memory Operating environment configuration

pc Intel i7-8700 Nvidia Pascal 1060 6G 16G CUDA9.0 cudnn7.1 PyTorch1.8 Python3.8

platform	CPU	GPU	memory	Operating environment configuration
pc	Intel i7-8700	Nvidia Pascal 1060 6G	16G	CUDA9.0 cudnn7.1 PyTorch1.8 Python3.8

3.3 Datasets

To test and compare the effectiveness of the proposed text recognition model based on a convolutional recurrent neural network for tobacco marking, three public datasets are used in this paper: SVT [23], IIIT-5K [24], and ICDAR2013 [25], as well as an actual threshing and redrying dataset for tobacco marking. SVT contains 350 images, 100 of which are used for training and 250 for testing. Some of the images in this dataset are heavily corrupted by blur, noise, and low resolution. IIIT-5k is obtained from Google Image Search and contains 5000 cropped word images from scene text images and digital images, of which 2000 images are used for training and the remaining 3000 images are used for testing. ICDAR2013 is cropped from 288 real scene images. Following previous work, the version with 857 images is chosen for testing, which deletes non-alphanumeric characters and text instances shorter than 3 characters. The re-roasting tobacco identification dataset is a screenshot of the pictures collected by the image acquisition system through the image acquisition system of the tobacco leaf re-roasting unit - Fujian Wuyi Tobacco Co., Ltd., a tobacco leaf re-roasting unit in Shaowu City, Fujian Province, through text detection, including a total of 15,000 pictures, 13,000 training set pictures, and 2,000 test data sets, as shown in Fig. 11. for some dataset examples.

Fig. 11

Re-roasting tobacco identification dataset.

3.4 Model comparison results

In order to verify the effectiveness of the text recognition model based on convolutional recurrent neural network proposed in this chapter, this section compares and analyzes the text detection algorithms that have emerged in recent years on SVT, IIIT-5K, and ICDAR2013 datasets. The state-of-the-art (SOTA) algorithms considered include CRNN [7], AON [26], RARE [27], R2AM [28], Rosetta [29], and MORAN [30]. Model performance is evaluated based on the Accuracy index, defined by the following formula: $ACC = \frac{M}{N}$ (15)

Among them, Mdenotes the number of text images correctly identified by the data set, and N denotes the total number of all text images in the data set.

Table 6 presents a performance comparison of various algorithm models on the SVT, IIIT-5K, and ICDAR2013 datasets, while Fig. 12. shows the visualization attempts. Notably, the R2AM, MORAN, and the improved model using CRNN demonstrated the highest accuracy rates on the SVT, IIIT-5K, and ICDAR2013 datasets, with 96.3%, 91.2%, and 93.2% respectively. These findings effectively highlight the competitiveness of the improved CRNN-based model. Moreover, to test the practical application of the improved model, it was compared to the SOTA method on the tobacco identification dataset. The test data performance results are shown in Table 7 and the corresponding visual comparison diagram is displayed in Fig. 13. The AON model demonstrated the lowest accuracy on this dataset, while the MORAN model achieved the highest test accuracy of 92.3% through the use of text correction and pixel-level weakly supervised learning. While models incorporating an attention mechanism or serial decoding based on the attention method limit prediction speed, their prediction times can be longer, such as the RARE and R2AM models, with prediction times extending up to 20ms.

Fig. 12

The performance comparison of different algorithm models on SVT, IIIT-5K, and ICDAR2013 datasets.

Fig. 13

Performance comparison diagram of different algorithm models on tobacco boxes label data set.

Table 6

Performance results of different algorithm models on SVT, IIIT-5K, and ICDAR2013 datasets

Method	SVT	IIIT-5K	IC13
CRNN	80.8	81.2	86.7
AON	82.8	87.0	–
RARE	81.9	81.9	88.6
R2AM	96.3	78.4	90.0
Rosetta	81.5	80.8	–
MORAN	88.3	91.2	92.4
OUR	87.3	88.6	93.2

Table 7

Performance results of different algorithm models on tobacco boxes label data set

Method	Accuracy / (%)	Time/(ms)
CRNN	85.7	4.52
AON	83.4	9.8
RARE	84.6	22.5
R2AM	87.3	21.2
Rosetta	84.5	6.3
MORAN	92.3	13.2
OUR	91.8	7.8

The model’s performance is also compared with detection accuracy as illustrated in Fig. 14. along with detection time. Although MORAN has achieved slightly higher accuracy than the improved model by 0.5%, it has a detection time that is 5.4ms longer. Furthermore, during the initial stages of model training, the improved model demonstrated better performance in terms of convergence and stability when compared to other models.

Fig. 14

Detection accuracy diagram.

3.5 Ablation experimental results

3.5.1 The effectiveness of SE-DenseNet feature extraction network

To investigate the impact of the SE-DenseNet feature extraction network on tobacco box marking characters, this study has developed three experimental comparison strategies. The first approach involved using the original CRNN feature extraction network - VGG. The second approach employed the Dense-Net, which is a dense residual network. Lastly, the third approach adopted the SE-DenseNet network, which incorporates a feature-weighted attention mechanism on top of the DenseNet network. This study then applied these three strategies to test and compare their effectiveness on the tobacco boxes label dataset using analytical evaluation.

Table 8 presents the performance of different feature extraction networks under various schemes. By comparing the results of experiment schemes 1 and 2, it can be observed that the recognition accuracy of the DenseNet network is 2.6% higher than that of the VGG network. This effectively demonstrates that simply stacking neural networks for feature extraction can only enhance performance to a certain degree. DenseNet network, which employs a deep residual architecture, can use the intensive connection to extract meaningful features and improve the recognition accuracy of the model. Furthermore, by comparing the performance of experimental schemes 2 and 3, it is evident that the SE-DenseNet network, which leverages a feature-weighted attention network mechanism, exhibits even better performance and provides a 3% increase in accuracy. This observation further emphasizes the role of a robust feature extraction weighting mechanism in promoting model performance.

Table 8
The influence of feature extraction network performance

Method Accuracy / (%)

VGG 85.7

DenseNet 88.3

SE- DenseNet 91.3

Method	Accuracy / (%)
VGG	85.7
DenseNet	88.3
SE- DenseNet	91.3

Figure 15 displays the losses in three different scenarios. It can be seen that the model’s loss function converges more easily when a feature-weighted attention network is introduced instead of replacing the feature extraction network. This observation further confirms the enhanced performance of the model.

Fig. 15

Loss diagram.

3.5.2 The effectiveness of the Bi-GRU loop network

This paper utilizes the Bi-GRU network to replace the original Bi-LSTM network in the convolutional recurrent neural network. To investigate its effect on tobacco boxes, we designed two control experiments: 1) utilizing the original Bi-LSTM recurrent network; 2) replacing the recurrent network with Bi-GRU. Both experiments were evaluated analytically on the tobacco boxes label dataset and compared.

According to Table 9, the experimental results of Bi-LSTM and Bi-GRU networks indicate that after replacing Bi-LSTM with Bi-GRU, the recognition accuracy of the network increased by 0.6%. This suggests that Bi-GRU has a simpler architecture than Bi-LSTM and is more effective in complex text character recognition performance.

Table 9
The impact of loop layer network performance

Method Accuracy / (%)

Bi-SLTM 85.7

Bi-GRU 86.3

Method	Accuracy / (%)
Bi-SLTM	85.7
Bi-GRU	86.3

4 Conclusion

Targeting the problem of speed and accuracy in recognizing complex text on online product labels, this study used tobacco strip marking as a case study and addressed the difficulty of recognizing tobacco boxes labels by improving the convolutional recurrent neural network approach. Initially, SE-DenseNet was utilized as the feature extraction network, followed by Bi-GRU, which learned feature vectors in sequences and outputted predicted label distributions. Finally, the CTC loss function was employed to convert the series of label distributions obtained from the recurrent layer into the final label sequence, constructing the model for identifying the text of tobacco strip marks in the production of sliced and toasted tobacco leaves.

The real-time and efficient performance of the improved model in this paper is verified by comparing and analyzing the text recognition algorithms in recent years on three public datasets and self-built datasets. The improved model has excellent performance on the self-built dataset, and the recognition accuracy reaches 91.8% under the condition of ensuring fast prediction speed. At the same time, the ablation experiments were carried out on the self-built dataset under the same experiment, and the recognition accuracy of the improved model for the feature extraction network and the recurrent layer network was increased by 5.6% and 0.6%, respectively, which verified the effectiveness of the improved module in this paper. These findings provide valuable support for the recognition of complex text in tobacco strip marking.

Footnotes

Acknowledgments

This project has been supported by: the Natural Science Foundation of Fujian Province (grant no. 2023J01342), the Program for Innovative Research Team in Science and Technology in Fujian Province University (2020No. grant: no.12). Fujian Provincial Key Project of Science and Technology Innovation (2022G02007) and High-level talents foundation of Fuzhou Polytechnic. National Natural Science Foundation (52275413).

References

Devlin

Anderson

Hastings

MacFadyen

, Targeting smokers via tobacco product labelling: Opportunities and challenges for Pan European health promotion, Health Promot Int 20(1) (2005), 41–49.

Kuo

Liu

L.W.

, The application of RFID on drug safety of inpatient nursing healthcare[C], Proceedings of the 7th international conference on Electronic Commerce (2005), 85–92.

Jepsen

Thorstad

E.B.

Havn

Lucas

M.C.

, The use of external electronic tags on fish: An evaluation of tag retention and tagging effects[J], Animal Biotelemetry 3(1) (2015), 1–23.

Baek

Lee

Han

Yun

Lee

, Character region awareness for text detection[J], Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (2019), 9365–9374.

Cheng

Bai

Zheng

Zhou

, Focusing attention: Towards accurate text recognition in natural images[C], Proceedings of the IEEE international conference on computer vision (ICCV) (2017), 5076–5084.

Jaderberg

Vedaldi

Zisserman

, Deep features for text spotting[C], LECT NOTES COMPUT SC (2014), 512–528.

Shi

Wang

Lyu

Yao

Bai

, Robust scene text recognition with automatic rectification[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), 4168–4176.

Lee

C.Y.

Osindero

, Recursive recurrent nets with attention modeling for ocr in the wild[C], Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (2016), 2231–2239.

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J], arXiv preprint arXiv:2010.11929, 2020.

10.

Liu

Lin

Cao

, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C], Proceedings of the IEEE/CVF international conference on computer vision (2021), 10012–10022.

11.

Chen

Jia

, et al. Svtr: Scene text recognition with a single visual model[J]. arXiv preprint arXiv:2205.00159, 2022.

12.

Sheng

Chen

, NRTR: A no-recurrence sequence-to-sequence model for scene text recognition[C], 2019 International conference on document analysis and recognition (ICDAR). IEEE (2019), 781–786.

13.

Zheng

Chen

Fang

, et al. Cdistnet: Perceiving multi-domain character distance for robust text recognition[J], International Journal of Computer Vision (2023), 1–19.

14.

Chandio

A.A.

Asikuzzaman

M.D.

Pickering

M.R.

, et al. Cursive text recognition in natural scene images using deep convolutional recurrent neural network[J], IEEE Access 10 (2022), 10062–10078.

15.

Yuan

Tao

Wang

, Classification and recognition method of fundus images based on SE-DenseNet[C], Twelfth International Conference on Graphics and Image Processing (ICGIP 2020). 11720 (2021), 416–423.

16.

Zhang

Zhou

Duan

Chen

, A hierarchical multi-input and output bi-GRU model for sentiment analysis on customer reviews[C], IOP CONF SER-MAT SCI 322(6) (2018), 062007.

17.

Kim

Hori

Watanabe

, Joint CTC-attention based end-to-end speech recognition using multi-task learning[J], 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (2017), 4835–4839.

18.

Huang

Liu

Van Der Maaten

Weinberger,

K.Q.

, Densely connected convolutional networks[J], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), 4700–4708.

19.

Shen

Sun

, Squeeze-and-excitation networks[C], Proceedings of the IEEE conference on computer vision and pattern recognition (2018), 7132–7141.

20.

Zhang

, A review of recurrent neural networks: LSTM cells and network architectures[J], Neural Comput 31(7) (2019), 1235–1270.

21.

Chung

Gulcehre

Cho

Bengio

Empirical evaluation of gated recurrent neural networks on sequence modeling[C], Neural and Evolutionary Computing (2014), arXiv:1412.3555.

22.

Hagiwara

, Nyquist stability criterion and positive-realness of sampled-data systems[J], Syst Control Lett 45(4) (2002), 283–291.

23.

Sun

Liu

Han

Ding

Liu

, Chinese street view text: Large-scale chinese text reading with partially supervised learning[C], Proceedings of the IEEE/CVF International Conference on Computer Vision. (2019), 9086–9095.

24.

Mishra

Alahari

Jawahar

C.V.

Scene text recognition using higher order language priors[C], BMVC-British machine vision conference 2012.

25.

Mishra

Alahari

Jawahar

C.V.

Scene text recognition using higher order language priors[C], BMVC-British Machine Vision Conference. BMVA, 2012.

26.

Cheng

Bai

Niu

Zhou

, Aon: Towards arbitrarily-oriented text recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), 5571–5579.

27.

Shi

Wang

Lyu

Yao

Bai

, Robust scene text recognition with automatic rectification[C], Proceedings of the IEEE conference on computer vision and pattern recognition (2016), 4168–4176.

28.

Lee

C.Y.

Osindero

, Recursive recurrent nets with attention modeling for ocr in the wild[C], Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (2016), 2231–2239.

29.

Wang

, Gated recurrent convolution neural network for ocr[J], Adv Neur In 30 (2017), 334–343.

30.

Luo

Jin

Sun

, Moran: A multi-object rectified attention network for scene text recognition[J], Comm Com Infsc 190 (2019), 1109–118.

Enhances the recognition accuracy of the complex text for online production labels by improving convolutional recurrent neural network

Abstract

Keywords

1 Introduction

2.1 Network architecture

3.1 Laboratory equipment

Table 5 Deep learning environment configuration table platform CPU GPU memory Operating environment configuration pc Intel i7-8700 Nvidia Pascal 1060 6G 16G CUDA9.0 cudnn7.1 PyTorch1.8 Python3.8

3.5.1 The effectiveness of SE-DenseNet feature extraction network

Table 8 The influence of feature extraction network performance Method Accuracy / (%) VGG 85.7 DenseNet 88.3 SE- DenseNet 91.3

Table 9 The impact of loop layer network performance Method Accuracy / (%) Bi-SLTM 85.7 Bi-GRU 86.3

Footnotes

Acknowledgments

References

Table 5
Deep learning environment configuration table

platform CPU GPU memory Operating environment configuration

pc Intel i7-8700 Nvidia Pascal 1060 6G 16G CUDA9.0 cudnn7.1 PyTorch1.8 Python3.8

Table 8
The influence of feature extraction network performance

Method Accuracy / (%)

VGG 85.7

DenseNet 88.3

SE- DenseNet 91.3

Table 9
The impact of loop layer network performance

Method Accuracy / (%)

Bi-SLTM 85.7

Bi-GRU 86.3