Abstract
Reading digits from natural images is a challenging computer vision task central to a variety of emerging applications. However, the increased scalability and complexity of datasets or complex applications bring about inevitable label noise. Because the label noise in the scene digit recognition dataset is sequence-like, most existing methods cannot deal with label noise in scene digit recognition. We propose a novel sequence class-label noise filter called Confident Sequence Learning. Confident Sequence Learning consists of two critical parts: the sequence-like confidence segmentation algorithm and the Confident Learning method. The sequence-like confidence segmentation algorithms slice the sequence-like labels and the sequence-like predicted probabilities, reorganize them in the form of the independent stochastic process and the white noise process. The Confident Learning method estimates the joint distribution between observed labels and latent labels using the segmented labels and probabilities. The TRDG dataset and SVHN dataset experiments showed that the confident sequence learning could find label errors with high accuracy and significantly improve the VGG-Attn and the TPS-ResNet-Attn model’s performance in the presence of synthetic sequence class-label noise.
Introduction
Reading text (including digit) from natural images, referred to as scene text (digit) recognition, has been a vital computer vision task in many industrial applications [22]. Like other computer vision tasks, deep neural networks also have excellent performance on scene text recognition [25]. Since supervised learning requires training data labels, the outcome accuracy and algorithm performance of scene text(digit) models are positively related to the data used for training the model. However, text images’ annotation work is strenuous because there are multiple characters in a text image. So, the label noises in scene text recognition datasets are often inevitable. Previous results proved that the label noises in scene text recognition datasets could adversely impact scene text recognition models’ performance [43]. Therefore, training a robust scene text (digit) recognition model in the presence of label noise is an increasingly valued task.
Methods [10, 37] to deal with label noise have been widely studied in recent years and applied to monotonic classification [9], distance metric learning [42], customer churn analysis [27]. However, training scene text recognition (STR) models require examples with a series of labels instead of a single label. Most previous methods cannot deal with a series of label noises, which we named sequence class-label noise. To deal with the sequence class-label noise in scene text recognition, inspired by the noise channel model [7], we formulate the stochastic process of sequence class-label noise. We also propose two sequence-like confidence segmentation algorithms: independent stochastic process segmentation algorithm and white noise segmentation algorithm. Besides, we integrate sequence-like confidence segmentation algorithms with the Confident Learning method [32] to find the corrupted labels. We named this method of confident sequence learning. According to the existing literature we found, confident sequence learning is the first class-label noise filtering technique applied to image-based sequence recognition. The main contributions of our work are as follows: Inspired by the noisy channel model, we formulate the stochastic process of sequence class-label noise, filling the gap between scene text recognition and label noise. Compared with the noisy channel model, the stochastic process of sequence class-label noise considers the contextual information within a sequence of characters and models the noising process without relying on the word lexicon. We propose two sequence-like confidence segmentation algorithms: independent stochastic process segmentation algorithm and white noise segmentation algorithm. We developed two confident sequence learning algorithms by integrating sequence-like confidence segmentation algorithms with the Confident Learning method. The confident sequence learning algorithms can filter the sequence class-label noise with linear time complexity. We conduct extensive experiments on the TRDG dataset and SVHN dataset with different sequence class-label noise levels. The experiments show that confident sequence learning can find label errors with high accuracy and significantly improve the preference of the VGG-Attn model and the TPS-ResNet-Attn model. According to the existing literature, the confident sequence learning method is considered the state-of-art on scene digit recognition in the presence of the sequence class-label noise.
The rest of this paper is organized as follows. In Section 2, we discuss related work about scene text recognition and label noise. In Section 3, we review the scene digit recognition and sequence class-label noise. In Section 4, we formulate the stochastic process of sequence label noise and propose confident sequence learning. The experimental results are shown in Section 5. In Section 6, we compare the experimental results on the TRDG dataset with the experimental results on the SVHN dataset and discuss the difference between independent sequence learning (SCL: IN) and white noise sequence learning (SCL: WN). The conclusion is in Section 7.
Related work
This section reviews some of the previous work on topics closely related to our paper. We first briefly review scene text recognition, and then the studies of label noise are followed.
Scene text recognition
Reading text (including digit) from natural images is an important computer vision task. Optical Character Recognition (OCR) technique has led to its successful application on cleaned documents. However, most traditional OCR methods have failed to be as effective on the scene text recognition(STR) tasks due to the distinct text appearances in the real world and the imperfect conditions in which these scenes are captured. Unlike image classification tasks, STR tasks require the system to predict a series of labels instead of a single label, which means STR tasks are more complex than image classification tasks. Early STR models [6, 38] first detected individual characters and then recognized these detected characters with Deep Convolutional Neural Network models trained using labeled character images. However, such models required training a capable character detector for accurately detecting and cropping each character out from the original word image. Shi el at. [34] proposed an innovative neural network architecture CRNN, which can recognize character sequences without cropping each character out from the original word image. Although CRNN can handle regular character sequences in arbitrary lengths, it poorly performs when identifying text with an irregular shape. The rectification-based STR models [29, 35] attempted to transform irregular text patches into regular ones and then recognize them using regular text recognizers. Furthermore, the attention-based STR models [11, 26] achieved state-of-the-art performance on irregular text recognition benchmarks; they applied attention mechanisms to extract context information of features on the decoding stage. Nonetheless, training STR models require large amounts of labeled data. The training dataset used by irregular text recognizers may contain label errors because irregular text images’ annotation work is quite challenging. Moreover, the quality of training data has a significant impact on the performance of the STR models [15].
Label noise
Learning with noise has been a hot topic in the machine learning field. Anything that obscures the relationship between an instance’s features and its label can be seen as noise [2]. According to the previous literature [15], two types of noise are distinguished: feature noise and label noise. Feature noise affects the features’ observed values, e.g., the adversarial examples [17, 40]. The sequence noise studied in the Natural Language Processing (NLP) tasks [1, 31] is also feature noise. Feature noise can affect the result outputted by deep neural network models at test time, while label noise is more harmful at training time. Label noise alters the observed labels assigned to instances, e.g., label flip attack [39]. Three types of methods to deal with label noise are distinguished [15]: label noise robust-models [16], label noise-tolerant algorithms [18], label noise filter approaches [10]. Confident learning is a kind of label noise filter approach that finds corrupted labels by estimating the joint distribution between observed labels and latent labels. Elkan [12] and Frank [14] pioneered counting methods to calculate false positive and false negative rates for binary classification, but these counting approaches perform poorly on unbalanced data sets. Elkan & Noto [13] introduced thresholding to learn a classifier from an unbalanced training set, but this method is limited to binary classification. For multi-classification, Lipton et al. [28] and Chen et al. [10] estimate label noise using approaches based on confusion matrices and cross-validation, but these methods fails to properly count errors for class imbalance or when a model is more confident for certain class than others. Recently, Northcutt et al. [32, 33] propose the Confident Learning (CL) method to directly estimate the joint distribution and prune error labels. The CL method outperforms seven recent state-of-the-art methods for learning with noisy labels. However, traditional confident learning methods can only filter class noise; they cannot deal with sequence class-label noise.
Preliminary
Scene digit recognition
The scene digit recognition task requires the system to predict a series of character labels using the images captured by scene digit detection systems. So, the feature space of scene digit recognition is
Sequence class-label noise
In scene digit recognition task, sequence label noise naturally occurs when human experts are involved. This phenomenon is termed as unknown noising process
Sequence class-label noise filter
When training data is polluted by label noise, an obvious and tempting solution is cleansing the training data themselves. A classic type of data cleanses method is a noise filter. Noise filter methods detect noisy examples in the training dataset first, then remove them. Precisely, for a training dataset T in STR tasks, sequence class-label noise filter can remove the noisy examples in the training dataset T and get a clean dataset T* ⊂ T. The STR models will perform better on the test dataset when training on a clean dataset T*.
The proposed method
In this section, we propose a novel sequence class-label noise filter refer to as confident sequence learning. Compared with the Confident Learning method, confident sequence learning introduces the sequence-like confidence segmentation algorithm to segment the sequence-like noisy labels and sequence-like predicted probabilities. The sequence-like confidence segmentation algorithms are based on the stochastic process of sequence class-label noise in section 4.1.
The stochastic process of sequence class-label noise
To model the sequence class label noise, we formulate the stochastic process of sequence class label noise based on the noisy channel model [7], class conditional classification noise process(CNP) [2], and stochastic process [8]. Unlike the noisy channel model applied to spelling correction, the stochastic process of sequence class label noise did not rely on the word lexicon.
Where

The stochastic process of sequence class label noise.
In the independent stochastic process of sequence class label noise, each random vector is an independent class conditional classification noise process which means we can independently estimate the joint distribution between
In the white noise of sequence class label noise, all random vectors have the same joint distribution which brings huge convenience to estimate the joint distribution between
Although the sequence class-label noise models help us understand the relationship between the distorted labels and the latent labels, they cannot directly be used to find error labels without integrating with other noise filter methods. Therefore, we developed two sequence-like confidence segmentation algorithms: independent stochastic process segmentation algorithm and white noise segmentation algorithm. We combine sequence-like confidence segmentation algorithms with the CL method [32,33, 32,33] to propose two noise filter techniques. These techniques are called confident sequence learning. The confident sequence learning has three necessary steps as shown in Fig. 2. First, we segment the sequence-like noisy labels and sequence-like predicted probabilities using the sequence-like confidence segmentation algorithms. Then, we use the CL method (Cleanlab) to estimate the joint distribution between

The procedure of confident sequence learning.
1: l m ← max(L)
2: S ← new empty list []
3:
4: append new empty tuple ([],[]) to S /*
5:
6:
7: l ← L[j]
8:
9: append
10: append
11:
12:
1:
2:
3:
4: l ← L[j]
5:
6: append
7: append
8:
9:
The threshold t
j
is the expected (average) self-confidence for each class.
The threshold t
j
fixes the class sensitivity problem with
Finally, for each off-diagonal entry in
In this section, we introduce the experiment architecture and analyze the experiment results. We conduct all the experiments on NICIDIA 2080Ti GPU. Besides, the operating system is Ubuntu 16.04LTS. The RAM size is 8G. Our code is available at our online code repository 1 .
Experiment setup
Performance measure
Following previous work [32], we use two types of metrics to measure the confident sequence learning method’s performance. The first type of metrics is used to evaluate the confident sequence learning method’s performance in finding misidentified characters in a sequence. The second type of metrics is used to evaluate the scene digit recognition models’ performance that trained with a dataset filtered by the confident sequence learning method.
Summary of related works regarding sequence confident learning
Summary of related works regarding sequence confident learning
Confusion matrix for finding Sequence class-label errors with maximum length 2
Accuracy is the most commonly used performance measure in multi-class problem; it is defined as the ratio of correct predictions to the total number of predictions.
However, the use of accuracy measure improper as the rare samples significantly influences accuracy relative to the common ones [4]. So, we further consider precision and recall as metrics to measure the performance of finding sequence class-label errors.
F1-Score is another performance measure that is commonly used for imbalances classification problem. It considers both precision and recall.
Where M represents the number of correctly recognized images, N represents the number of total images. The scene digit image is correctly recognized on the condition that no character is misidentified. We further consider Levenshtein distance (edit distance) to measure the performance of scene digit recognition models. The edit distance (ED) is a string metric for measuring the difference between two sequences. Informally, the edit distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. It is defined as
We generated sequence class label noise in the TRDG train dataset across varying noise rates and sparsities. All models evaluated on the unaltered TRDG test dataset. We used a pre-trained VGG-Attn model with a 0.8815 test accuracy to output the noisy TRDG train dataset’s predicted probabilities. The confident sequence learning methods estimated the joint distribution of noisy digits and true digits, as shown in Fig. 4.

Examples of the TRDG dataset and the SVHN dataset.

We estimate the joint distribution of noisy digits and true digits for the TRDG dataset with a 0.4 noise rate and a 0.6 sparsity using a white noise sequence learning method. Probabilities are scaled up by 100.
Table 3 shows the accuracy, F1, Precision, and recall measures for finding label errors in the TRDG dataset. According to the existing literature, we can’t find any other noise filter techniques that could be used on scene digit recognition task. We only compared the white noise sequence learning method (SCL: WN) and independent sequence learning method (SCL: IN). As shown in Table 3, the accuracy, precision, recall, and F1 of two confident sequence learning methods are all higher than 0.89 across varying noise rates and sparsities, which proves that confident sequence learning methods are able to find the label errors with a high success rate on TRDG dataset. Table 3 also shows that SCL: WN method performs better than SCL: IN method when finding label errors on the TRDG dataset with different noise levels except for a 40% noise rate with 0.2 sparsity.
Accuracy, F1, precision, and recall measures for finding label errors in the TRDG dataset
After finding out label errors on the TRDG train dataset, we counted the number of character label errors in each sequence label. We removed the examples that have character label errors. We used the filtered TRDG train dataset to train the VGG-Attn network and evaluated the trained VGG-Attn network on the TRDG test dataset. We did not train the TPS-ResNet-Attn network on the TRDG dataset because the sample size of the TRDG dataset is too small, which will cause an overfit for the TPS-ResNet-Attn. Table 4 shows the evaluation results. Baseline refers to using the noisy TRDG train dataset without filtered noise. As shown in Table 4, the SCL: WN method and SCL: IN method significantly improve the test performance of the VGG-Attn model. Table 4 also indicates that SCL: WN method performs better when the sparsity of noise is low and SCL: IN method performs better when the sparsity of noise is high.
Evaluation results for using the filtered TRDG train dataset to train the VGG-Attn network
The SVHN dataset is a more challenging scene digit recognition benchmark compared with the TRDG dataset. Because the SVHN train dataset’s sample size is huge, the training scene digit recognition(SDR) models only need 1-3 epoch on the SVHN train dataset, which will reduce the effect of sequence class label noise. We increased the noise rate and sparsity when generating sequence class label noise in the SVHN train dataset. We used a pre-trained TPS-ResNet-Attn network with a 0.8832 test accuracy to output the noisy SVHN train dataset’s predicted probabilities. Table 5 shows the accuracy, F1, Precision, and recall measures for finding label errors in the SVHN train dataset.
Accuracy, F1, precision, and recall measures for finding label errors in the SVHN dataset
Accuracy, F1, precision, and recall measures for finding label errors in the SVHN dataset
As shown in Table 5, the accuracy, precision, recall, and F1 of two confident sequence learning methods are all higher than 0.96 across varying noise rates and sparsities. This result proves that confident sequence learning methods can find label errors with a high success rate on the SVHN dataset. Unlike the TRDG dataset experiments, Table 5 shows no significant difference between the two types of confident sequence learning for finding label errors in the SVHN train dataset.
To evaluate the effectiveness of confident sequence learning methods, we used the filtered SVHN train dataset to train the TPS-ResNet-Attn network and evaluated the trained TPS-ResNet-Attn network on the SVHN test dataset. We did not train the VGG-Attn network on the SVHN train dataset because the VGG-Attn network is ill-suited to the SVHN dataset. The SVHN train dataset images are corrupted by natural phenomena that are difficult to compensate for by hand. The VGG-Attn network cannot recognize images with severe blur, distortion, and illumination effects on top of wide style and font variations. Table 6 shows the evaluation results for using the filtered SVHN train dataset to train the TPS-ResNet-Attn network. As shown in Table 6, he accuracy increased by 1.5% and 2%, respectively, from the lowest noise rate (40%) to the highest noise rate (80%), which prove that confident sequence learning methods are effective to improve the performance of TPS-ResNet-Attn network under sequence class label noise. Table 6 also shows that SCL: WN is more effective for enhancing the sequence class-label noise robustness of the TPS-ResNet-Attn network compared with the SCL: IN method.
Evaluation results for using the filtered SVHN train dataset to train the TPS-ResNet-Attn network
This section compares the experimental results on the TRDG dataset with the experimental results on the SVHN dataset first. Besides, it discusses the difference between the SCL: WN method and the SCL: IN method.
When comparing the experimental results on the TRDG dataset with the experimental results on the SVHN dataset, one interesting finding is that the confident sequence learning methods’ performance is positively related to the robustness of the scene digit recognition models. As shown in Table 4, the confident sequence learning methods significantly improve the VGG-Attn model’s performance on the noisy TRDG dataset (the accuracy improved by an average of 19.92%). Meanwhile, the VGG-Attn model’s accuracy sharply decreases when the SVHN training dataset’s noise rate increases, which indicates that the VGG-Attn model is not robust to the sequence class-label noise. Similar results are presented in Table 6.
In Section 5, we also find that the SCL: WN method and the SCL: IN method perform differently across varying noise rates and sparsities. From a theoretical perspective, the independent and identically distributed condition is more restrictive to the probability distribution of random vectors in noisy sequence labels. This conclusion indicates that the SCL: WN method is more suitable for situations that the sparsity of sequence class label noise is high. The experiment results also show that when the sparsity of sequence class label noise is high, using the training dataset filtered by SCL: WN method to train scene digit recognition models improve the test accuracy by 1% more than using SCL: IN method.
Conclusion
In this paper, we reviewed the scene digit recognition task and the sequence class label problem. We formulated the stochastic process of sequence class-label noise. Furthermore, we proposed two sequence-like confidence segmentation algorithms: independent stochastic process confidence segmentation algorithm and white noise stochastic process confidence segmentation algorithm. We proposed confident sequence learning, the first class-label noise filtering technique applied to image-based sequence recognition, by integrating two sequence-like confidence segmentation algorithms with confident learning. The TRDG dataset and SVHN dataset experiments show that confident sequence learning methods can find label errors with 0.9 accuracies. Besides, our experiment shows that confident learning methods improve the SR Accuracy of the VGG-Attn model by an average of 19.92% and improve the SR Accuracy of the TPS-ResNet-Attn model by an average of 1.04%. We think the confident sequence learning methods’ performance is highly related to the scene digit recognition models’ robustness. We also found that the SCL: WN method is more effective when the sparsity of noise is high compared with the SCL: IN method.
The confident sequence learning methods have two limits: (1) the confident sequence learning method can only be applied to attention-based scene text recognition models. (2) the confident sequence learning methods cannot filter the sequence label noise caused by insertions and deletions. In the future, our work will involve the improvement in the proposed sequence label noise model to deal with the noise caused by deletion and insertion. Moreover, we plan to evaluate the confident sequence learning method on scene text recognition datasets (e.g., MJSynth[
