Abstract
Recent improvements in deep learning techniques show that deep models can extract more meaningful data directly from raw signals than conventional parametrization techniques, making it possible to avoid specific feature extraction in the area of pattern recognition, especially for Computer Vision or Speech tasks. In this work, we directly use raw text line images by feeding them to Convolutional Neural Networks and deep Multilayer Perceptrons for feature extraction in a Handwriting Recognition system. The proposed recognition system, based on Hidden Markov Models that are hybridized with Neural Networks, has been tested with the IAM Database, achieving a considerable improvement.
Introduction
The field of Handwriting Recognition (HWR) has been a topic of intensive research for a long time (see some surveys in [13,14,28,43,64]). However, recognizing unconstrained handwritten text remains a challenging task. HWR has two main modalities: the online case, where the trajectories of strokes are recorded while the user is writing, and the offline modality, where only the text image is available (e.g. scanned document). The offline case is more challenging due to the lack of temporal relations between strokes.
Connectionist methods and, especially, deep neural networks [3,4,52] are able to extract meaningful features from raw values (in offline HWR, the scanned text image) as in [9,11,17,25].
Along these lines, this work proposes the use of deep neural networks (and, more specifically, deep Multilayer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs)) to extract meaningful features for unconstrained offline HWR. CNNs have already been used in our research group for several related applications [40,41]. In previous works, we also developed a HWR engine based on Hidden Markov Models (HMMs) that was hybridized with Artificial Neural Networks (ANNs) which gave the best results in a fair comparison at that time [19]. Thus, the idea of using raw pixels from the text line images emerged naturally. The performance of the HWR engine using deep MLPs and CNNs to extract meaningful features has been very advantageous, as the experiments will show.
The remainder of this paper is organized as follows: Section 2 provides a short introduction to the state of the art. Our new proposals are presented in detail in Section 3. The experimental setup and results are described and analyzed in Sections 4 and 5, and we present our conclusions in Section 6.
State of the art
Offline HWR involves several stages from the image acquisition of the documents (e.g. scanning an ancient book) to the final result which can be in the form of a text transcription, a graph of words (in order to model recognition ambiguities) or even an index for keyword spotting applications.
In this work, we will center our attention on the transcription of text line images, hence skipping some steps such as image cleaning/enhancing, text detection or text line segmentation.
A HWR system receives a text line image which is generally converted to a sequence
The sequence with the maximum probability, given the input X, is searched in every possible sequence of words of a given vocabulary Ω.
From this point of view, the recognition of handwritten text lines images shares many characteristics with Large Vocabulary Continuous Speech Recognition (LVCSR): a joint segmentation and classification task is required in both cases for decoding since we cannot split the image or the audio into words or even graphemes/phonemes in order to classify them afterwards. To overcome this cyclic dependency (known as Sayre’s paradox [50]), HMMs have been used for decades for these and for many other sequence labeling problems [47]. For HMMs, the previous Formula (1) is decomposed, by using Bayes’ theorem, as the product of the optical model
The optical modeling
Emission probabilities
Scaled emission probabilities can be estimated with discriminative models (e.g., Neural Networks (NNs)) that approximate the posterior probabilities of each state
Several ANNs types have been used for this purpose: MLPs in [54], CNNs in [9], Recurrent Neural Networks (RNNs) in [37], and combinations of them. Additionally, we can find other related models: Radial Basis Functions in [56], Support Vector Machines (SVMs) in [57], or time-delay networks in [15,29,51].
On the other hand, tandem systems can make use of ANNs similar to hybrid HMM/ANN but posteriors are fed to GMMs. Other tandem approaches are possible and, for example, in [26], an MLP bottleneck is used to extract features for the tandem. This approach was later improved by Deep Belief Networks (DBN) as seen in [49].
Other connectionist approaches get rid off the decomposition of
LSTMs have been extended to the multidimensional case [25] and 2D-LSTMs has lead to impressive results in HWR [65] tasks at the expense of a high computational cost. In this regard, [46] proposes the use of CNNs combined with LSTMs as a cost-efficient alternative to Multidimensional Recurrent Layers.
Proposed approaches
Our baseline system [19,40] is based on Hidden Markov Models that are hybridized with Neural Networks where emission probabilities are estimated by an MLP. The input is a sequence of feature vectors extracted following the approach presented in [60]. An illustration of the baseline system is depicted in Fig. 1(a).

Handwriting recognition systems.
In the proposed approaches, instead of relying on a previous feature extraction process, we rely on the raw image input, by using a deep neural network to directly extract meaningful features (see Fig. 1(b)) or by using CNNs (see Fig. 1(c)).
The primary goal of this work is to extract meaningful features for HWR using deep learning techniques. When using the baseline system, the input of the MLP is a set of feature frames that are centered at the current frame (see Fig. 1(a)). However, in the first proposed approach, the sliding window receives a patch of raw pixels that are directly fed to the NN as illustrated in Fig. 1(b). The choice of a squared window has given good results in preliminary experiments [39,40], leading to a window of
When dealing with raw images, there are several issues to keep in mind to improve the performance and generalization. Several standard regularization methods such as weight decay or max weight penalty have been employed. Regularization techniques such as dropout have helped to improve results. We have also used a layer-wise pretraining with Stacked Denoising Autoencoders (SDAE) [63] in order to train deeper nets.
CNNs with 2D convolutions
In the classical HMM/ANN architecture, the use of a sliding window (where the same NN is applied to classify each frame) can be seen as a 1D convolution on the X axis. Now, we would like to explore the use of 2D convolutions combined with pooling layers and higher level convolutions that will hopefully be able to extract more useful features.
Figure 1(c) illustrates the CNN for feature extraction and conditional probability computation in our setup. In the proposed settings, several parameters must be chosen for the CNN, such as the number of convolutions, pooling layers, activation functions, number, and size of the convolutional kernels as well as the classifier, which is usually an MLP.
It is quite important, in practice, and a challenging task to obtain an architecture with a good cost-efficiency trade-off. Thus, the computational restrictions that are essential for finding an appropriate but efficient architecture must not be forgotten. We have explored three alternatives with all of those limitations in mind, namely: 1) using well known CNN architectures, 2) using a specific network for the mentioned task, and 3) using a model inspired by a well established feature extraction technique.
Using well known architectures
Our first attempt using CNNs for feature extraction imitates some of the previous architectures that have achieved good results in similar tasks. This is the case of the convolutional net LeNet CNN [33], which obtained good results on the MNIST database [34]. In addition, the increase in computational resources (especially advances in GPU computing and distributed systems) has allowed the use of deeper and more complex models. In recent years, these issues, combined with an appropriate parameter tuning, have led to remarkable improvements in performance, especially in image vision tasks. This is the case of nets like AlexNet [32], GoogleNet [59], and Very Deep Convolutional Networks [55], which have reported excellent results in other tasks such as the ImageNet Large Scale Visual Recognition Challenge contest [48].
Adhoc networks dealing with HWR
Most of the bibliography architectures are designed for tasks like MNIST, which consists of
When tuning a NN model we would have to explore, in an ideal case, every possible parameter and hyper-parameter in order to obtain the most successful configuration. The use of CNNs and deeper nets makes this tuning process worse since more parameters are added, most of which are related to the new topology and layer configurations. Thus, in order to guide our exploration, we should be concerned about the kernel sizes to extract useful features, the number of kernels to cover the variability of the text, and deeper layers of the model to properly represent the characteristics of the problem.
In the feature extraction process proposed in [60], the frames are computed using
When analyzing the kernels trained in some preliminary experiments, we could conclude that there is a tendency to extract redundant information from 16 kernels in the first convolution. Some of the learned kernels detect edges in several orientations, others estimate the ink text zones, and some of them model the background. It turns out that all of these features can be extracted with no more than 5 to 10 kernels. Due to the above-mentioned computational constraints, we will avoid large number of kernels, at least, in the first convolutions.
Cell feature extraction by kernels
Our baseline HWR system used the parametrization described in [60]. In this work, we will design CNNs that are powerful enough to mimic this feature extraction process. However, it is important to note that the convolution kernels are not limited to extract these features, since they will learn on their own.
The original feature extraction divides the input into cell regions. For each region, three values are extracted: one value with the proportion of gray level and two values for the vertical and horizontal derivatives. A linear regression model is performed to find the optimal derivative directions.
The minimal requirements for a CNN to model these features are that one convolution could compute the vertical derivatives from the differences between the upper and lower cell values. Similarly, another convolution can compute the horizontal derivatives, whereas a third convolution would be enough to estimate the smoothed gray level. This leads to a CNN with only one convolution layer of three maps of
CNNs with 1D convolutions
We have also explored a CNN that convolves the text line images in only one direction. The convolutional kernels would have the height of the image, and they would advance from left to right. Therefore, each kernel extracts only one feature per column. We explored two different approaches:
Applying the vertical kernels directly into the raw image (Fig. 2).
Applying the vertical kernels after a 2D convolved map (Fig. 3). In this case, the first set of 2D convolutions is obtained followed by the application of 1D kernels to these previous maps.

Vertical model (I). The kernels run over the input window and only in the horizontal direction.

Vertical models (II). Vertical kernels run over the maps generated by the first 2D convolutions.
In the first case, the vertical kernels are applied to the input window, and we have a set of
Evaluation corpus: IAM database
The IAM offline dataset [36] is composed of forms containing handwritten English sentences that are extracted from the LOB corpus [30]. The version 3.0 of the IAM Dataset was used.1
The recognition engine is based on a hybridized HMM with an MLP to model graphemes, which was presented in [19,40,41,68]. Each grapheme is modeled with a 7-state left-to-right HMM topology with loops and without skips. The connectionist model used to estimate the emission probabilities of the HMM states was an MLP with 2 hidden layers of 512 and 256 units, respectively, using the softmax activation at the output layer. The HMM/ANN system is trained by means of an Expectation-Maximization procedure with a forced Viterbi alignment by using the April-ANN toolkit [67]. This toolkit has also been used to perform the experiments based on deep MLPs and CNNs described in the following sections.
The images received by the recognition engine were preprocessed following the skew and slant correction presented in [19] and following the height normalization proposed in [41]. The emissions computed for each HMM state were calculated taking into account the current frame and the surrounding ones, for a total of 11 frames. Each frame was extracted from a window of
For the LM, a 4-gram with a Witten-Bell smoothing that was trained with the SRILM toolkit [58] was used. The text corpora used to train the n-gram LM were: the LOB corpus [30] (excluding those sentences that contain lines from the test set or the validation set of the IAM task), the Brown corpus [20], and the Wellington corpus [2]. The lexicon of the LM had approximately 103K different words. This LM is the same as the one in [68], whose larger vocabulary differs from the previous work of [19]. Word insertion penalty and grammar scale factor parameters were optimized on the validation set by means of the Minimum Error Rate Training procedure [38].
Using raw input and deep MLPs
Our first goal is to compare the baseline system with a new one, avoiding an explicit handcrafted feature extraction. Table 1 shows the configuration used for the deep MLP-based systems with a receptive field (
Deep MLPs fed directly with a raw image input from a window size of
(1764 pixels)
Deep MLPs fed directly with a raw image input from a window size of
Table 2 summarizes the explored CNN topologies by enumerating, for each one, the sequence of kernels, pooling layers, flatten procedures and fully connected layers applied in each case along with their parameters and the size (number of maps and their dimensions) of the corresponding outputs. First, a topology based on LeNet (LeNet-5) was tested. For the second alternative, after several trials, we could highlight one special configuration, called Adhoc CNN, which led us to the best results. We also decided to apply max pooling layers to not only speed up the computations but also to make our model more robust to translations. We tried increasing max-pooling layers of
CNN topologies for the recognition system
CNN topologies for the recognition system
The models with the minimal configuration able to imitate the cell feature extraction were tagged as Cell/Kernel 1 and 2 Conv. As can be observed, the size of the kernels increased up to
Overall performance of the proposed systems on the development set (configurations with the § mark make use of SDAE, the configuration with the † mark is the Approach 1 of Table 4, while the configuration with the ‡ mark is the Approach 2 of the same table)
Our baseline HWR system, based on hybrid HMMs with ANNs using handcrafted features was presented in [19]. With some slight modifications, our best results were reported in [40], obtaining a 15.6% and 19.0% Word Error Rate (WER) for validation and test sets, respectively. Table 3 shows the overall performance of the proposed systems, with a confidence interval of 95% [62]. First, it can be observed that all the deep models with more than two layers using raw inputs improved the baseline version. Indeed, when using two hidden layers in the raw setup, the results were worse than the baseline, unless dropout was added, where the results were similar. Dropout significantly helped in the deep model modality, reaching the best performance with three hidden layers and a drop rate equal to 0.2, obtaining a WER of 13.7 for the development set (this configuration is called Approach 1 in Table 4). We tried drop rates that were larger than 0.2 but the performance did not improve. As a matter of fact, although some results with deep models were better than others, there was no statistically significant difference among them. For Character Error Rate (CER), deep models statistically improved the baseline system.
The HMM/CNN showed better performances with respect to the baseline system. When compared with the deep MLPs using raw inputs, the results were similar when dropout was used. When exploring the different nets, good performances in the Adhoc CNN net or even LeNet-5 could be expected. Even though the performance in these cases is quite good, the best result achieved so far has been with a simple net (Cell-kernel 1, corresponding to Approach 2 in Table 4), using one convolution with a stride of three in each direction and only six kernels. We presume that the simplicity of the model eased the training, and with six kernels the model covers most of the variability of the handwritten (as illustrated in Fig. 4(a)). In this particular case, the net extracts 1014 features from the convolution process, which are conveniently combined with two fully connected layers of
Finally, vertical models showed a more modest performance, which did not improve the traditional 2D convolution models, but they were still better than the baseline. As before, CER was significantly better with the HMM/CNN than with the baseline system.

Generated maps from several convolution nets.
Regarding the execution time, both the baseline experiments and the proposed architectures have been trained and evaluated by means of the same April-ANN toolkit [67]. This toolkit performs all the computation on CPU, but it makes a heavy use of linear algebra optimized libraries (in particular, the Intel MKL library [66]). Experiments have been conducted on computers with different specifications, so a fair comparison of the execution time is limited to those that have been performed on the same type of machine. Most experiments have been performed on an Intel Core i7-3770 CPU at 3.40 GHz with 32 Gb of RAM: the “Raw input + Deep MLPs” 2048-512 0 and 2048-512 0.2 versions of Table 3 have required an execution time, for decoding, of 5.37 and 5.07 seconds/sentence on average, respectively. On the same machine, Lenet-5 only required 3.34 seconds/sentence, while Adhoc, Vertical 1 and Vertical 2 required 4.72, 4.17 and 5.19 seconds/sentence, respectively. Although other experiments (Baseline, Cell-kernels,
Table 4 shows the best results of our contributions together with the results of other works reported in the literature using the same database. The table is divided among isolated word recognition, line recognition and paragrah recognition. As mentioned above, we have used lines for training and evaluation. Although all the results cannot be directly compared, it can be observed that the use of CNN models to extract features from raw input for HWR consistently improve recognition rates. This has been proved in our HWR system (Approach 2 statistically improves the baseline) and, regarding the results summarized in Table 4, we can observe that some of them have also relied on CNNs [8,46].
We have presented several improvements to our HWR engine by removing handcrafted feature extraction from the text images and using deep learning techniques directly on the raw input. Deep MLPs and CNNs have been analyzed for the current HWR task. The results presented for the IAM Database validate this approach, consistently with other authors’ works.
Although several CNN topologies are explored, one of the configurations that led to good results is comprised of a single convolution layer without pooling, achieving a WER of 17.2. If we compare this result with the baseline (HMM/ANN with features system which has a WER of 19.0), a considerable step forward in the recognition performance has been achieved.
Despite the use of CNNs is not novel in HWR, the value of the experiments reported here lies in the fact that it provides a fair comparison between handcrafted and machine learned features by the virtue of using a baseline whose only difference with the proposed approaches is the replacement of the feature extraction stage, hence isolating the effect of this single stage from the whole HWR pipeline. Besides that, several different CNN topologies have been compared. We can also mention that the topologies proposed here only require a modest computational cost compared with the alternatives that can be found elsewhere.
There are many lines of future work to be pursued. First, we propose to apply other normalization techniques to speed up the training in order to improve results. We also need to do a more exhaustive error analysis to determine which steps of the whole transcription pipeline we should focus on to assure new improvements. Many novel CNN architectures are recently appearing in the general field of Computer Vision and, although many of them are not originally intended for HWR, some of them can also be adapted to this particular subfield. Finally, we plan to try the use of CTC decoding and to adopt training procedures for HMM/ANN in the line of [69].
Footnotes
Acknowledgements
Work partially supported by the Spanish MINECO and FEDER founds under project TIN2017-85854-C4-2-R.
