Abstract
Handwritten numeral recognition is a challenging problem in the character recognition field due to the large variation in the writing styles of different persons and high similarity in the contour of different numerals. To address this problem, an effective multi-task learning network (MTLN) for handwritten numeral recognition is presented in this paper. Based on the observation that the writing style could play an effective complementary role to the learned feature extracted from numerals, the proposed MTLN simultaneously performs the handwritten numeral learning module and the writing style learning module. Consequently, the determination of scratchy/non-scratchy in the writing style learning module can effectively assist the handwritten numeral learning module to obtain a more robust and distinguishable feature so as to improve the recognition performance. Extensive experiments on multiple existing handwritten numeral datasets have demonstrated that the proposed MTLN can effectively improve the recognition accuracy, and outperform multiple state-of-the-art methods.
Introduction
Character recognition is a vital branch of pattern recognition and has been widely applied in various fields, such as information processing, traffic system, machine translation, and so on. Due to different characters, various intelligent recognition systems have been developed for identifying different characters, including vehicle license plate recognition [1, 2], English character recognition [3, 4], numeral character recognition [5, 6], Chinese character recognition [7, 8], to name a few.
With regard to numeral character recognition, handwritten numeral recognition has a great value from the viewpoint of practical application, e.g., large-scale data statistics, tax administration, financial affairs, and letter sorting, etc. For that, a straightforward solution is the traditional character classification framework called optical character recognition (OCR), which mainly consists four stages: image acquisition, image preprocessing, feature extraction and classification. Although the OCR has achieved good performance in the printed character recognition, it fails to yield a high accuracy on the handwritten numeral recognition.
This is because compared with the printed numeals, handwritten numerals usually have irregular shapes and arbitrary writings, since different persons have different handwritten styles. Many factors, e.g., the thickness of the strokes, the size of the font, and the tilt of font, will directly influence the handwritten numeral recognition. Refer to Fig. 1 for some handwritten numeral examples. In addition, the application field of the handwritten numeral recognition usually requires a very high accuracy, which supposes that there should be no error in the numeral recognition result. Therefore, how to develop an effective handwritten numeral recognition method becomes a challenging yet significant task in the computer vision field.
To study the handwritten numeral recognition, LeCun et al. [9] developed a MNIST database, which is served as the benchmark to evaluate the performances of various handwritten numeral recognition methods. Based on MNIST database and limited computational resource, LeNet-5 convolution network is also presented to obtain relatively arcuate numeral classifiers [9]. Simard et al. [10] used a plain multiple layer perceptron (MLP) algorithm with only one hidden layer for the handwritten numeral recognition. Kasun et al. [11] applied the extreme learning machine (ELM) on handwritten numeral recognition.

Some samples of handwritten numeral from various datasets.
In order to further improve the accuracy, some hybrid handwritten numeral recognition algorithms are presented by combining different strategies. For example, Luo et al. [12] proposed a new method by combining the ELM and sparse representation based classification (SRC), named as ELM-SRC, which outperforms the ELM. Niu et al. [13] jointly used Convolutional neural network (CNN) and support vector machine (SVM), called CNN-SVM. Guo et al. [14] proposed a hybrid learning model with CNN and ELM.
With the rapid increment of data and development of Graphics Processor Unit (GPU) and cloud computing in recent years, the deeper and wider CNNs are widely applied in various areas of image classification [15], e.g., face recognition [16], pedestrian re-identification [17–20]. Regarding handwritten numeral recognition, Ciresan et al. [21] exploited a committee of 35 convolutional deep neural networks (DNNs) similar to the architecture with 6 layers [22] to further decrease the error rate. Tabik et al. [23] provided some data-preprocessing techniques, such as centering, translation, rotation etc., to increase the training data so as to effectively improve the recognition accuracy.
In this paper, an effective multi-task learning network for handwritten numeral recognition is presented. Based on the observation that the scratchy/non-scratchy in the writing style could play an effective complementary role to the learned feature extracted from numerals themselves, the proposed method designs a multi-task learning network (MTLN) to simultaneously perform the handwritten numeral learning module and the writing style learning module. Such network is able to effectively share the relevant and irrelevant information among these two tasks for enlarging inter-class difference and shortening inner-class similarity. Consequently, extensive experiments have shown that the proposed MTLN method can effectively recognize the handwritten numerals.
The rest of this paper is organized as follows. Section 2 introduces the proposed handwritten numeral recognition method in detail. Experimental results and comparisons are given in Section 3. Finally, Section 4 outlines the conclusions.
Network architecture
Figure 2 shows the architecture of the proposed MTLN approach for handwritten numeral recognition. As shown in Fig. 2, the input images are with a fixed size of 28 × 28, and the MTLN contains eight layers, including 4 convolutional layers (i.e., C1, C2, C4, and C5), 2 maximum-pooling layers (i.e., M3 and M6), and 2 fully-connected layers (i.e., F7 and F8).

The architecture of the proposed MTLN for handwritten numeral recognition.
In the convolutional layers, the convolution operation is calculated as
where l is the l-th layer, x(l-1) and y(l) denote the input of l-th layer and the output feature map of l-th layer, respectively. w(l) means the weight vector of convolution kernel, which is the same for all neurons (i.e., weight sharing). ⊗ and b denote the convolution operation and corresponding trainable bias, respectively.
After performing the convolution operation, a nonlinear activation operation is conducted to obtain the nonlinear mapping, that is, the rectified linear unit (ReLU) [24] function, as follows.
Since the ReLU activation function can speed up the network training process [24], the ReLU is employed in all convolutional layers (i.e., C1, C2, C4, C5) and the first fully-connected layer F7 in our proposed MTLN. Moreover, to avoid the gradient vanishing problem, Batch Normalization (BN) [25] is further utilized in the first two convolutional layers C1 and C2. The corresponding algorithm steps of BN can be referred to Table 1.
Algorithm steps of BN operation
In order to reduce the dimension of feature maps obtained from convolution layers and the amount of model parameters, the sub-sampling operations are utilized after convolution layers, which can be presented as below:
where down (•) denotes a down sampling function. In the proposed MTLN method, the sub-sampling function is max pooling. Specifically, two maximum pooling layers (i.e., M3 and M6) are employed after convolution layers (i.e., C2 and C5), respectively.
In the proposed MTLN, the number of filters in first two convolutional layers (i.e., C1 and C2) is set to 32, and the number of filters in last two convolutional layers (i.e., C4 and C5) is set to 64. In addition, the tiny-filter with size of 3 × 3 is employed in all convolutional layers, since a stack of two 3 × 3 convolutional layers with no spatial pooling in between has more effective receptive filed than 5 × 5 [26]. For two maximum-pooling layers, the 2 × 2 max pooling operation is applied. More parameter configuration of the proposed MTLN is listed in Table 2.
The parameter details of the proposed MTLN method
Multi-task learning is considered as an effective solution to simultaneously solve several different problems by learning different related tasks in parallel while sharing the information obtained in different tasks [27]. It can improve the generalization ability by exploiting the information contained in the training signals of related tasks as an inductive bias [28]. Consequently, multi-task learning method can effectively improve the performance relative to learning each task independently.
Based on this motivation, the proposed method designs a special learning network to simultaneously perform two learning tasks: handwritten numeral learning and writing style learning. Such network can enlarge the inter-class difference and shorten the inner-class similarity via sharing the relevant and irrelevant information among these two tasks so as to improve the recognition accuracy.
where (t1, t2, t3, …, t10) = (1, 0, 0, …, 0) denotes the numeral 0, (t1, t2, t3, …, t10) = (0, 1, 0, …, 0) denotes the numeral 1, (t1, t2, t3, …, t10) = (0, 0, 1, …, 0) denotes the numeral 2, and so on, while (y1, y2, y3, …, y10) is the posterior probability output of Softmax layer.
Training and testing images from the MNIST database
where (t1, t2) = (1, 0) means the scratchy style, and (t1, t2) = (0, 1) represents non-scratchy style.
The proposed MTLN approach jointly performs the above-mentioned two tasks. Consequently, the loss of the proposed MTLN can be minimized in terms of
where λ i is the weight of i-th task, while L1 and L2 can be computed based on Equations (4 or 5), respectively. By simply treating these two tasks equally important, λ1 = λ2 = 0.5 is set in our work.
Some implementation details of the proposed MTLN method can be described as follows. Firstly, the experiments are implemented on a workstation with Xeon E5-2650 2.60GHz processor and Nvidia GeForce GTX TITAN X based on the deep learning framework called Caffe [29]. Secondly, the proposed MTLN method is trained by using stochastic gradient descent (SGD) [30] with a batch size of 60 samples, weight decay of 0.0005, momentum of 0.9. The learning rate is initialized as lr
base
= 0.01 and updated by:
where iter and lr iter denote iteration numbers and the learning rate of the current iteration numbers, respectively. And gamma and power are set as 0.0001 and 0.75, respectively. During the training process, the weights are initialized by a normal distribution N (0, 0.01) and the biases are initialized as 0. In addition, all the image samples used for training are subtracted their corresponding mean values, and all the training data are randomly shuffled before sending to the proposed MTLN network.
Datasets and evaluation protocol
In this section, the performance of the proposed MTLN method and multiple state-of-the-art methods are investigated on the benchmark dataset, i.e., MNIST dataset [9]. This dataset consists of 60,000 training images and 10,000 testing images, including ten classes corresponding to “0” to “9”, and all the images are with a fixed size of 28 × 28. Some handwritten numeral image samples from MNIST dataset can be found in Fig. 1. It can be clearly observed that there are different kinds of handwritten numerals, largely varying in the shape due to different writing style. Some are neat and easy to be recognized, while some are scratchy and hard to be distinguished. Table 3 shows the corresponding training and testing images for handwritten numeral learning. For the writing style learning, we further divide the corresponding training images into two parts: scratchy images and non-scratchy images. Moreover, the test error rate is used as the performance index to evaluate the performance of various handwritten numeral recognition methods, which is calculated as the ratio of the false recognition images to the total test images.
Performance comparison
Table 4 shows the performance of the proposed MTLN method and multiple state-of-the-art methods, including Hinton et al. [31], Salakhutdinov et al. [32], LeNet-A [21], LeNet-B [21], LeNet-C [21], Network3 [33], Guo et al. [14], Labusch et al. [34], Lauer et al. [35]. To analyze how much of the contributions coming from the writing style learning, the performance resulted from the handwritten numeral learning (HNL) solely is also investigated and shown in Table 4. Note that the HNL network is exactly the upper part in Fig. 2.
Performance comparison of different methods
Performance comparison of different methods
One can see from Table 4 that the proposed MTLN method is able to achieve the lowest test error rate (i.e., 0.40%), and consistently outperforms multiple state-of-the-art methods under comparison. In addition, it can be further observed the proposed HNL can achieve relatively good performance, and the proposed MTLN that jointly explores HNL and writing style learning achieves better performance than the proposed HNL. This study indicates that the writing style learning plays an effective complementary behavior to the HNL so as to further decrease the test error rate.
To demonstrate the generalization ability, the proposed MTLN method and some state-of-the-art methods are tested on two completely unseen datasets, USPS and Binary Alpha-digits [36]. This experiment takes the following methods as examples, including LeNet-A [21], LeNet-B [21], LeNet-C [21], and Networks3 [33]. In this cross-dataset evaluation, the corresponding numbers of testing handwritten numeral images are shown in Table 5. Note that 1) the Binary Alpha-digits dataset consists of not only 10 handwritten numerals but also 26 handwritten capitals. In our experiments, only the handwritten numerals are selected as testing images. 2) The sizes of the handwritten numeral images in datasets USPS and Binary Alpha-digits are 16 × 16 and 20 × 16, respectively. In our experiments, all these testing images are resized to 28 × 28 to adapt to the proposed network. Some image examples of handwritten numerals from these two datasets can be referred to Fig. 1. Similarly, the commonly-used performance index, test error rate, is also exploited to measure the performance.
Testing images for cross-dataset evaluation
Testing images for cross-dataset evaluation
The results of cross-dataset evaluation in terms of test error rate are summarized in Table 6. It can be clearly observed that the test error rate of all the methods are increased in the cross-dataset evaluation, compared with that in Table 4. The logical behind this observation is that the testing images from USPS and Binary Alpha-digits datasets are completely unknown for these methods in this cross-dataset evaluation. Furthermore, one can see that the proposed MTLN also obtains the lowest test error rate, and consistently outperforms the state-of-the-art methods under comparison. This investigation indicates that the proposed MTLN has good generalization ability.
Performances comparison in cross-dataset evaluation
In this paper, a novel multi-task learning network (MTLN) is proposed for handwritten numeral recognition. The success of the proposed MTLN is due to that the specially designed MTLN can simultaneously perform two tasks: handwritten numeral learning and writing style learning. Since the writing style learning has complementary behavior to the handwritten numeral learning, to a certain extent, the proposed MTLN can fully explore their correlation to obtain a more robust and distinguishable classifier with higher handwritten numeral recognition rate. Experiments on multiple handwritten numeral datasets show that the proposed MTLN can effectively improve the accuracy of handwritten numeral recognition, and have good generalization ability. Moreover, it also has been demonstrated that the proposed MTLN method is consistently superior to multiple state-of-the-art handwritten numeral recognition methods.
Footnotes
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under the Grants 61871434, 61602191 and 61802136, in part by the Natural Science Foundation of Fujian Province under the Grants 2016J01308 and 2017J05103, in part by the Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University under the Grants ZQN-YX403 and ZQN-PY418, in part by the High-Level Talent Project Foundation of Huaqiao University under the Grants 14BS201, 14BS204, and 16BS108, and in part by the Subsidized Project for Postgraduates Innovative Fund in Scientific Research of Huaqiao University.
