Abstract
Handwritten text recognition remains a popular area of research. An analysis of these techniques is more necessary. This article is practically interested in a bibliographic study on existing recognition systems with the aim of motivating researchers to look into these techniques and try to develop more advanced ones. It presents a detailed comparative study carried out on some Arabic handwritten character recognition techniques using holistic, analytical and a segmentation-free approaches. In this study, first, we show the difference between different recognition approaches: deep learning vs machine learning. Secondly, a description of the Arabic handwriting recognition process regrouping pre-processing, feature extraction and segmentation was presented. Then, we illustrate the main techniques used in the field of handwriting recognition and we make a synthesis of these methods.
Keywords
Introduction
The techniques related to the information processing know currently a very active development in connection with data processing. It presents an increasingly important potential in the field of the human-computer interaction. In addition, machine simulation of human reading has been the subject of intensive research for the last years.
The recognition of the writing comes under the wider domain which is the pattern recognition. It seeks to develop a system which is nearest to the human capacity of reading. The difficulty of recognition is partly related to the writing style; more writing is legible and regular, more the resolution is easy. The performance obtained by recognition system varies greatly with the clarity of the images provided (readability, good image resolution, lines of paragraph well spaced …). Three writing styles in ascending order of difficulty can be distinguished: the printed word, handwriting but capital letters or characters sticks, and handwriting cursive.
Handwriting remains an important means of communication, even if the printed takes a significant part in various applications thanks to its legibility, safety and speed of communication. Handwritten recognition has several applications in OCR systems, digitization of historical and cultural heritage, industrial application such as automatic cashing of bank checks and the processing of administrative forms, educational applications, etc.
The problem of handwritten character recognition is more complex than printed character recognition problem, due to variations in shapes and sizes of handwritten characters. Thus, in opposition to printed text in most languages, the characters in cursive handwritten words are connected. In fact, handwriting is a specific free expression to each person even if certain production rules have been taught to it. The difficulties encountered by the system of recognition handwriting are caused firstly by the wide variety of possible script of the same characters and secondly by the concatenation of characters inside the word.
Automatic Handwriting Recognition System can be either Online or Offline based on the presentation of the data to the system. The online recognition is carried out at the same time as the words are written; user writes using a digital device (such as a digital tablet) utilizing a special stylus [1]. Offline recognition concerns any document already written, a scanned handwritten or printed text is fed to the system in a digital image format. It is the automatic transcription by computer, its objective is to determine what letters or words are present in a digital image of handwritten text. The online problem is usually easier than the off-line problem since more information is available.
Much work has been done on the recognition of Arabic characters separated and in cursive script. Arabic script has many features that make the recognition of handwritten Arabic text an inherently difficult problem [2]. Several characters in Arabic are distinguished only in the placement and number of dots and strokes. Many other features of Arabic script were detailed in [2]. A typical recognition system consists of a number of stages to implement: pre-processing, segmentation, feature extraction training, classification and post-processing presented in Fig. 1.
Diagram of recognition system.
A deep architecture topology.
Faced with the complexity of the features extraction phase that requires domain expertise (HOG, SIFT, …etc.) and varies depending on the task to be performed, and because of the problems encountered during the segmentation phase [3], many research have adopted an approach that allows us to overcome the choice of features and propose a model based on deep networks capable of processing a variety of data (sequential or spatial) to recognize characters, words and concluding sentences [4, 5]. This is due to best power of representations that possess deep architectures compared to traditional or shallow architectures. Each layer allows a representation of a highest level than the previous one (Fig. 2). We therefore hope to learn an abstract representation of data, in which the task to resolve will be easier.
In the following, we will talk, in Section 2, of three approaches of recognition that are analytic, holistic and segmentation-free approaches then we show the main distinctions between them. In Section 3 we present the main steps of the recognition process. The most important techniques used in the field of handwriting recognition are presented in Section 4. We compare some existing works in Section 5 and we gave a synthesis in Section 6. We finish in Section 7 with a conclusion.
According to the nature and the size of the considered unit, there are three approaches of offline handwriting recognition. It can be a global (holistic) approach, a local (analytical) approach and a segmentation-free approach.
Holistic approach
Holistic approaches are known by their simplicity and their similarity to the human reading ability. The holistic approach consists of modeling the words as not divisible entities [6, 7, 8, 9]. In this strategy, global features, extracted from the entire word, such as loops, ascendants, descendants, profiles up/down, valleys, length, terminal dots, and many others, avoiding the segmentation process and its problems. As the size of the lexicon gets larger, the complexity of algorithms increases linearly due to the need for a larger search space and a more complex pattern representation. So, for methods based on a matching to a prototype, every added word requires the creation of new prototype and model that must be retrained. Most of these methods are applied to legal amount recognition in bank cheque processing, where the vocabulary is not so large [10, 11].
This approach is often applied to reduce the list of candidate words in the context of recognition of large vocabularies. This paradigm is more used for printed script recognition. However, the drawback of this approach is that it is only applicable to small vocabularies. Beyond a few tens of classes, the discriminating capacity of the primitives extracted globally decreases and the possible confusion between words of the vocabulary increases, which degrades performance.
Analytic approach
The analytical approach overcomes the limitations of the holistic approach but re-quires local interpretation based on segmentation. Much of the handwriting recognition researches focused on analytical approaches. In this paradigm, words are broken down into collection of simpler subunits such as characters. It is a bottom-up approach which starting from identifying character and going towards building a meaningful text [12]. A recognition process according to this approach is based on the segmentation and the identification of extracted segments. Multiple segmentations strategies were proposed to address the problem. Two principal solutions are possible which are explicit segmentation and implicit segmentation. Thanks to the segmentation stage, analytical approach can be able to generalize the recognition of an unlimited vocabulary. However, there is no efficient segmentation for correctly extracting the characters from a given word image.
Segmentation-free approach
In the segmentation-free approach, the recognition process is performed without explicit segmentation of the image. Thus, this method exceeds the challenge of finding character boundaries and reducing the problematic of under-segmentation. Nowadays, this class of approaches is the most widely used, getting a lot of research attention, and recording the best performance on standard benchmarks. They are mainly based on hidden Markov models (HMMs) and deep neural networks. HMMs are mainly appropriate to transform the sequence of feature vectors into a sequence of characters. It models each character with several states. An emission model defines the probability of a state generating a feature vector. Two main strategies based on HMM were used. The first strategy consists in using a sliding window to extract a sequence of feature vectors. In the second strategy, the system directly models the two-dimensional image input. Many methods based in those approaches are proposed for Arabic handwriting recognition [13, 14]. Recently, notable efficient techniques are based on deep neural networks, particularly convolutional neural networks that extract more complex representations [15, 16].
Comparison and discussion
Each of these approaches has its advantages and disadvantages. The holistic approach is perfectly possible for the recognition with a limited vocabulary, even for degraded words. It generally suffers from a problem of lack of information sufficiently discriminating for words, which can increase the risk of confusion when the size of the lexicon becomes important. To overcome segmentation problems due to the related nature of Arabic script, a specific approach to Arabic word image recognition has been proposed in [17]. The concept of this approach is generated by the presence of six letters of the Arabic alphabet which cannot be connected to their successors. It induces a natural segmentation of Arabic words into pseudo-words. Therefore, the pseudo-global or pseudo-analytical approach is based on the construction of word models by the association of pseudo-word models. The advantage of such an approach consists in the fact that the pseudo-words have an easy-to-isolate structure (natural separation into connected components, extraction of outlines, etc.) and they can occupy different positions in the word without changing their form.
The development of a lexicon covering all the pseudo-words of the Arabic language is almost impossible since the pseudo-words do not have a linguistic meaning. Consequently, these systems lose their feasibility for the recognition of large or open vocabularies and consequently of free texts. Systems based on the pseudo-global or pseudo-analytical approach remain useful only for the recognition of limited vocabularies as is the case for the global approach. To recognize infinity of words by modeling only the 28 letters of the Arabic alphabet in their different forms, the analytical approach seems the most appropriate. The major advantage of the analytical approach is in the ability, theoretically, to recognize any words, since the basic unit of modeling is the character or sub-character (grapheme) and the number of characters is naturally finished. However, his biggest weakness is in the process of segmentation that is not always trivial and requires a lot of computation time, and there is a great variability inherent in the form of segments. The location of segmentation points can be done implicitly or explicitly. In the case of implicit segmentation, all the points of the path of the word are likely to be chosen as segmentation points and the endpoints of segmentation are defined as the recognition progresses. We are thus talking about a segmentation guided by recognition. In the case of explicit segmentation, a segmentation algorithm is used to explicitly detect segmentation points during a preliminary phase which precedes the recognition phase. It is important to note that analytical approaches are normally usable in a recognition process using a large reference lexicon. However, they seem to be very sensitive to noise. An unidentified or missing letter automatically generates a rejection in most jobs. The difficulties encountered are often more complex for Arabic writing because of the diversity of the forms of Arabic characters, the short link that exists between successive characters, the lengthening of horizontal ligatures and the presence of vertical ligatures. Given these difficulties, a relatively large number of research works have concerned the segmentation of Arabic writing.
As a final solution to segmentation problems, in [18, 19, 20], the authors proposed open-vocabulary Arabic word image recognition systems based on the segmentation-free approach. The majority of works adopting this approach are based on Hidden Markov models or recurrent neuronal networks. The proposed methods often include a feature extraction phase that uses the sliding window technique in the case of hand-crafted features (the word image is split into a sequence of frames where a feature vector is extracted) and convolution networks in the case of learned features. Systems based on these techniques have the advantage of performing learning and recognition without segmenting words into characters.
Recognition process
Nowadays, the accelerating progress and availability of low-cost computer hardware as well as the growing difficulty of the tackled problems encouraged computationally expensive techniques use. Therefore, recent researches are focused on open vocabulary recognition to deal with continuous sequence in order to recognize words and text lines extracted from handwritten documents [21]. Handwriting recognition consists of several steps that will be described in this section. One generic model of handwritten text recognition process is presented in Fig. 3.
Offline handwriting recognition process.
The scanned image may need to be enhanced by preprocessing steps before inputted in the recognition system. Furthermore, Arabic handwritten text is of high variations due to the multiple writer’s styles and cultures. Thus, data preprocessing is a crucial component for the automatic recognition process and is a vital step to produce more suitable images for features extraction. The preprocessing techniques are either used to correct the scanned image faults and improve its quality or to correct and normalize the handwriting style in order to decrease variability intra-class.
Feature extraction
In handwriting recognition, the purpose of the feature extraction step is to capture the characteristics in which values are similar for characters belonging to the same classes and distinct for characters in different classes. Feature extraction techniques differ from one model to another dependent on the complexity of studied script and image quality. Therefore, the selection of features extraction method remains the most important step in the recognition process. Those methods can be classified into two global categories; handcrafted features methods and learned features methods.
a) Handcrafted features
Handcrafted features have been used for long time in a different computer vision application, including object detection and image classification. For handcrafted features, an algorithm was designed manually to extract them, incorporating priori information on data specificities. Some of the handcrafted features are very simple, while some more recent feature sets are too complex and generally highly-dimensional. Handcrafted features include the structural, statistical and global transformation features [22]. The structural features computed the geometrical and topological characteristics of the text. The statistical features analyses the spatial distribution of pixels like the direction of the contour of the character. Global transformation presents the signal by a linear combination of simpler well-defined functions series. Many global transformation techniques are used in text recognition; we cite for example the Fourier Transform, Discrete Cosine Transform, Wavelets and Hough Transform. The handcrafted features were commonly used with traditional machine learning approaches like Support Vector Machines.
b) Learned features
Learned features consist in automatically learning features from the images using machine learning which is the more used solution since the advent of Deep Learning [13]. Those features are generic and independent of any specific classification task. Learned features consist in automatically learning features from the image pixels using machine learning. The idea behind this approach is to discover multiple levels of representation so that higher level features can represent the data semantics, which in turn can provide greater robustness to intra-class variability. Learned features are trying to overcome the issues of handcrafted features. Since they can be trained on any data set, they very well generalize to change. Moreover, they are generally able to well cope with unknown examples, due to the higher generalization capabilities of neural networks. Finally, they do not require expert data knowledge and should not require a lot of human-labor time. Deep neural networks learn high-level features in the hidden layers. Actually, one of their greatest advantages is to reduce the need for feature engineering [23].
Segmentation
The stage of segmentation makes it possible to decompose the image of a text into entities (words, graphemes or characters) to reduce the complexity of the subsequent processing modules. In this phase, the different logical parts of an image are extracted. It is based on measurements of white space (interline and inter-character) for separation. From an acquired image, first there’s separation of text blocks and graphics blocks, then from a text block there’s lines extraction, then from these lines are extracted words and characters (or parts of the character). The multiplicity of fonts and change of justifications prevent to stabilize thresholds separation, leading to the generation of non-existent white separators or to the ignorance of white separators of word.
Segmentation is a critical phase of the single word recognition process [1]. Indeed, the separation of lines, words, pseudo-words, characters and graphemes is difficult and expensive. Moreover, scripts are varied, the lines are sometimes tangled and characters are generally related (case of Arabic: writing is semi- cursive) to each other.
According to the literature, the most difficult problem is the case of segmentation of cursive writing, where the community of handwriting recognition admitted the paradox of Sayre [24] “Alerter cannot be segmented before have been recognized and cannot be recognized before being segmented”. In order to solve this problem several segmentation algorithms are proposed [25, 26]. The proposed solutions are based on two different strategies of segmentation.
a) Explicit segmentation
Explicit segmentation (INSEG: input segmentation) performs a priori segmentation, it constructs a graph of all cases of cuts of the word. It consists in the use of feature points in the word [27]. This type of segmentation is based directly on a morphological analysis of the text or word, or on the detection of characteristic points such as intersections, inflection points, and loops in the interior of text or words to locate prospective segmentation points [28]. Several approaches propose a direct segmentation of text or word in primitive graphemes, followed by a step of combining of these characters or graphemes [28, 29]. The advantage of this segmentation is that the information is explicitly located. The major flaw of this segmentation comes first from choice of the limits independent of criteria of models.
b) Implicit segmentation
The implicit segmentation (OUTSEG: output segmentation) does not proceed to a prior segmentation in input of the classifier but to competition of classes of letters or graphemes in output of the classifier Contrary to the explicit segmentation, there is no pre-segmentation of the word. The segmentation is carried out during the recognition and is guided by the latter. The system seeks in the image the components or groups of graphemes that correspond to its class of letters. It is based on a recognition engine to validate and classify the segmentation hypothesis (search path points of possible segmentation). In this case, the segmentation and the recognition are made jointly; hence the name sometimes used “integrated segmentation-recognition”. This segmentation can be based on the sliding window [30, 31]. The advantage of this segmentation is that the information is located byte models letters and the validation is done by its models. There will be no segmentation fault and finally bypasses the dilemma of Sayre because knowing the letters.
Training
The training phase consists to find the most appropriate models to the inputs of the problem. The result is a training database that constitutes the reference base of the system. The training step is to characterize the shape classes in order to distinguish homogeneous families of shapes. This is a key step in the recognition system. In the literature, several studies have examined the issue of the automatic training. There are three types of training: supervised training, unsupervised training and reinforcement training.
a) Supervised training
For supervised training, a representative sample of all shapes to be recognized is provided to the module of training. Each form is labeled by an operator called professor; this label is used to indicate to the training module the class in which the professor wants that the shape being row. Parameters describing this partition are stored in training table, at which the decision module will then refer to classify the shapes that are presented [32].
b) Unsupervised training
Contrary to supervised training, in unsupervised training there is no labeled data. It is a question of building the classes automatically without professor intervention, from reference samples and grouping rules. This mode requires a large number of samples and precise rules of construction and non contradictory, but does not always provide classification corresponding to the reality of user. In the field of handwriting recognition, the methods are based on supervised training, which are most frequently used; and more particularly for handwritten isolated characters because the classes are known and in a limited number [2].
c) Reinforcement training
Reinforcement learning refers to all the methods that allow an agent to choose which action to take and this in an autonomous way. Immersed in a given environment, it learns by receiving rewards or penalties based on its actions. Through his experience, the agent seeks to find the optimal decision-making strategy that can allow him to maximize the rewards accumulated over time.
Classification
In the complete process of a pattern recognition system, the classification plays an important role in deciding on the belonging of a form to a class. At this step, the description of the character tube recognized stemming from the database of test is compared with the descriptions of the character in the reference base. The main idea of the classification is to assign an example (a form) not known to a predefined class from the description in parameters of form [32]. The classifier selection is very important. It constitutes the decisive element in a system of pattern recognition.
We will present in the next section, a description of classifiers commonly used in handwriting recognition, and demonstrating their strengths and weaknesses. We will review, statistical methods, structural and connectionist methods, support vector machines and Hidden Markov Models. The two main methods that predominant are neural networks and Markov modeling. From theoretical and industrial view, they have amply demonstrated their effectiveness. There are also various methods of pattern recognition based on the theory of fuzzy logic, KNN (K-Nearest Neighbors), SVM (Support Vector Machines), etc.
Used techniques for offline handwriting recognition system
All methods of pattern recognition have been applied to the script recognition. Today, Ecosystems for constrained and restricted vocabulary give important results. Recognition rates approaching to 100% for printed or mono-scripter of good quality documents. Until we have the same performance for unconstrained handwritten documents written by any scripter, a lot of research is being done to treat this writing. However, the recognition system of Arabic handwritten reached less important than system for Latin language [7, 2]. Researcher’s tyro fined supple and intelligent techniques of recognition. We can classify the useful techniques into four main classes: statistical methods, connectionist methods, structural methods and stochastic methods and there are also hybrids methods that use two or more of these techniques together.
Statistical methods
It is an approach that is mainly based on mathematical foundations (probability & statistics). The object of this approach is to describe the forms from a simple probabilistic model to use and to group the forms in classes. We distinguish in general two types of methods: (1) the non-parametric methods, where we seek to define the class boundaries in the space of representation, in order to organize the unknown point by a series of simple tests; (2) the parametric methods (Bayesian) where we are given a model of the distribution of each class (typically Gaussian), and where we seek the class to which the point haste greatest probability of belonging.
Parametric and non-parametric statistical classifiers were used by Abandon et al. in [33] for optical character recognition in Arabic handwritten scripts. They used Quadratic Discriminate Analysis (QDA) and extracted features from the raw main body, the main body’s skeleton, and main body’s boundary.
The theory of Bayesian decision is the central theory of statistical methods, which allows choosing the hypothesis with the highest probability “higher probability of belonging to a class”. The Bayes classification involves a probability distribution by class; if the probability distribution Pi (x) is known for each i, and the relative frequency of each class Qi, then the discriminator between classes can be built directly. They give rise to the Bayes classifier that minimizes the error rate of global classification. Another theory of statistical methods is the KNN (k nearest neighbors). The K Nearest Neighbors algorithm assigns an unknown pattern to the class of its nearest neighbor by comparing it to stored forms in a references class named prototypes. It returns K closest forms to the form to be recognized according to a similarity criterion. A decision strategy is used to assign confidence values to each class competing and assign the most likely class (within the meaning of the selected metric) to the unknown form. This method has the advantage of being easy to implement and provides good results. Its main disadvantage is due to the low speed of classification due to the large number of distances to be calculated.
In a statistical model, the recognition is represented by a mathematical model whose parameters must be estimated. These models constitute a limitation since they always remain an approximation of the shape of classes.
Structural methods
In general, structural methods allow the description of complex shapes from basic shapes called features which are extracted directly from data present in input of system. The main difference between these methods and statistical methods is that these features are elementary shapes and not measures. Another difference is that they introduce the notion of ordering the description of a shape. The most common methods use the calculation of edit distance between two strings and the dynamic programming.
Structural methods seek to find simple elements or primitive, and to describe their relations. Primitives are topological types such as a loop, bow, etc. A relationship can be the relative position of a primitive with respect to another. These methods are divided depending on the structure used.
Stochastic methods
Unlike the methods described above, the stochastic approach uses a model for the recognition, taking into account the high variability of form. The distance commonly used in techniques of “dynamic comparison” is replaced by probabilities calculated more finely by learning. The form is considered as a continuous observable signal in time in different locations constituting statements of observations. The model describes these states using probabilities of transitions of states and probabilities of observation per state. The comparison is to look in the graph of sates the path of high probabilities corresponding to sequence of elements observed in the input string. These methods are robust and reliable due to the existence of efficient training algorithms. If training is slow, recognition is against very rapid because the models usually include few states and the calculation is relatively immediate. The most answered methods in his approach are methods using Hidden Markov Models (HMM).
Markov models are widely used in pattern recognition because of their integration capacity of the context and noise absorption. In these models, the shapes are described by a sequence of primitives that will be observed in the states of the model. The probability of emission of the shape by the model is calculated by maximizing, on all the paths of states, the observation probability of weighted segments by the probabilities of transitions between states. In some of these approaches of handwriting recognition, the images of the words are converted into sequences of segments of image by a segmentation procedure. These segments are then transmitted to a module loaded to estimate the probability that each segment appears when the corresponding state of the Markov chain is a certain state.
In addition, semi-cursive Arabic script, in both printed and manuscript form, lends itself naturally to a stochastic modeling, at all levels of recognition. These models can take over the noise and the inherent variability of handwriting and avoid the problem of explicit segmentation of words. As result, they are widely used in automatic handwriting recognition [26, 34].
Connectionist methods
The connectionist model overcomes the problem of statistical methods in representing recognition in the form of a network of elementary units connected by weighted arcs. It is in these connections reside the recognition, and it can take more varied forms than mathematical model. Nodes of this graph are simple automata called formal neurons. The neurons are endowed with an internal state, the activation, by which they influence the other neural of the network. This activity is propagated in the graph along weighted arcs called synaptic links.
In OCR, the primitives extracted on image of character (or the selected entity) constitute the network inputs. The activated output of the network corresponds to the recognized character. The choice of network architecture is compromise between computational complexity and recognition rate. Moreover, the strength of neural networks resides in their ability to generate decision region of any shape required by classification algorithm, at the price of the integration of layers of additional cells in the network.
The artificial neural networks (ANN) are means strongly connected of distribut-ed elementary processors functioning in parallel and used for the training and classification. Currently, the most used types in systems of handwriting recognition are multilayer perceptions (MLP) [28, 35, 36, 37], direct propagation and associative memories or Kohonenmaps (SOM) ‘Self Organizing Map’ [38] which are recurring ANN. They can automatically detect prototypes of characters in a training set of examples.
The convolution neural network was also used for multi-script recognition by extracting features and learning from raw input [39]. This eliminates the need for manually defining discriminative features for particular scripts. These networks have been successful in various vision problems such as digit and character recognition. The problem using neural network approaches is that the objective function is non-convex, and their learning algorithms may get stuck in local minima during gradient descent. To overcome these limitations, many studies have proposed replacing these approaches successfully by a classifier, namely the deep networks [40, 41]. The new concept behind it the use of hidden variables as observed variables to train each layer of the deep structure independently and greedily.
Support vector machine (SVM) method
These methods are classifiers with two classes, which have high generalization ability. The SVMs are a set of training algorithms allow to discriminate shapes.
The main idea is that two classes can be linearly separated in a high-dimensional space (Fig. 4). In the case where the points are separable, there is often an infinite hyper plane separator. SVMs are discriminative models that attempt to minimize the learning error while maximizing the margin between classes, that is to say the space unexampled around the decision boundary. To do this the learning algorithm selects carefully a number of “support vectors” among the examples of the learning base, which define the optimal decision boundary. There is not much system of recognition of Arabic handwriting which is based on SVM compared with other technical such as HMM and ANN.
SVM approaches; (a) one-versus-all method, (b) one-versus-one method.
In order to improve the recognition rate of handwritten character, several techniques are proposed. In this section, we present a state of the art of different systems proposed in literature for recognition of handwriting script.
Abandah et al. presented in [33] an optical character recognition solution for Arabic handwritten scripts based on Quadratic Discriminant Analysis. They have implemented in this work the feature extraction routines for extracting 95 features. They start by detecting the two parts of the Arabic letters and extracting features from these parts. Then they remove the secondary parts and extract additional features from the raw main body, the main body’s skeleton, and the main body’s boundary. These routines where applied to the 104 collections of letter forms and the 95 feature vectors where used to find the recognition accuracy. The principal component analysis (PCA) method is used to select the best subset of features extracted from a large number of features. Parametric and non-parametric statistical classifiers were used. They found that a subset of 25 features is needed in order to get an 84% recognition accuracy using a linear discriminate classifier. They also found that using more features does not significantly enhance accuracy.
In [34], an Arabic handwriting recognition system based on hidden Markov model was introduced: a new technique for dividing the image into no uniform horizontal extract the features and a new technique for solving the problems of the skewing of characters by fusing multiple HMMs. The proposed system builds character HMM models and leans word HMM models using embedded training. Besides the vertical sliding window, two slanted sliding windows were used to extract the features. Three different HMMs are used: one for the vertical sliding window and two for the slanted windows. A fusion scheme is used to combine the three HMMs. Three individual classifiers are combined at the decision level. Each classifier observes the image from a given orientation combining those classifiers significantly increases the recognition rate.
In [42], authors proposed a discriminative learning approach for multi-script recognition at connected component level by using a convolutional neural network. The convolutional neural network combines feature extraction and script recognition process in one step and discriminative features for script recognition are extracted and learned as convolutional kernels from raw input. This eliminates the need for manually defining discriminative features for particular scripts. Results show above 95% script recognition accuracy at connected component level on datasets of Greek-Latin, Arabic-Latin multi-script documents and Antiqua-Fraktur documents. The proposed method can be easily adapted to different scripts.
In [43], authors presented a combined system for text localization and transcription in page images. It includes flexible learning-based methods for layout analysis and handwriting recognition, which were developed in the context of the Swiss research project HisDoc. A comprehensive experimental evaluation is provided for the medievat Parzival database, demonstrating a promising word recognition accuracy of 93% with closed vocabulary. In order to harmonize the evaluation of the two document analysis tasks, we introduce a novel evaluation measure for text line extraction that takes substitution, deletion, as well as insertion errors into account.
In [44], the authors proposed a state-of-the-art system for recognizing real-world handwritten images exposing a huge degree of noise and a high out-of-vocabulary rate. They describe methods for successful image denoising, line removal, deskewing, deslanting, and text line segmentation. They demonstrate how to use a HMM based recognition system to obtain competitive results, and how to further improve it using LSTM neural networks in the tandem approach. The final system outperforms other approaches on a new dataset for English and French handwriting. The presented framework scales well across other standard datasets.
Siddhu et al. [45] proposed a recognition method based on the combination of statistical and structural approaches for feature extraction. The main body of a character is modeled by the statistical method using modified direction features and Support Vector Machines. The structural method uses the dot descriptors to recognize the character. Amrouch et al. [46] considered two strategies of feature extraction combined with HMM as recognizer namely CNN-features-HMM and Handcrafted-features-HMM. The first strategy allows operating directly on the images and extracting relevant characteristics. It doesn’t need much emphasis on feature extraction and pre-processing stages as the second strategy. Khémiri et al. in [47] described the main highlights of the Dynamic Bayesian network (DBN) architecture. Features are extracted based on the word baseline which has been estimated to mainly cope with the problems of inclination and distortions. They applied deep learning architecture: a CNN convolving learned features with input data and uses 2D convolutional layers that make it well suited to 2D word image processing. Jayech et al. in [48] proposed a system based on a synchronous multi-stream HMM (MSHMM) which has the advantage of efficiently modeling the interaction between multiple features. These are composed by a combination of statistical and structural ones, which are extracted over the columns and rows using a sliding window approach. In fact, two-word models are implemented based on the holistic and analytical approaches without any explicit segmentation. Rabi et al presented in [49] a system for offline recognition cursive Arabic handwritten text based on Hidden Markov Models (HMMs). The system is analytical without explicit segmentation using embedded training to perform and enhance the character models. Extraction features preceded by baseline estimation are statistical and geometric to integrate both the peculiarities of the text and the pixel distribution characteristics in the word image. These features are modeled using hidden Markov models and trained by embedded training.
Zayene et al. [14] presented an approach for Arabic video text recognition based on recurrent neural networks. The proposed system presents a segmentation-free method that relies specifically on a multi-dimensional long short-term memory (MDLSTM) coupled with a connectionist temporal classification (CTC) layer. The proposed system allows avoiding the text line segmentation and feature extraction steps. The suggested method has been trained and evaluated using the AcTiV-R database. Salam et al. in [50] proposed a system for offline isolated Arabic handwriting character. Although half of the dataset used for training the Support Vector Machine (SVM) and the second half used for testing, the system achieved high performance with less training data. Mezghani et al. presented in [2] a system for offline recognition of cursive Arabic handwritten text based on Hidden Markov Models (HMMs). The authors studied the shape modeling of different handwritten Arabic characters using HMMs to make their training and recognition more efficient. The number of HMMs is reduced substantially while still capturing the variations between the character shape models. Mohamed et al. [51] displayed a system for recognition text’s handwriting. It is based on edge histogram descriptor (EHD), histogram of orientated gradients (HOG) for features extraction and support vector machine (SVM) as a classifier. HOG and EHD give an optimal feature of the Arabic handwritten text by extracting the directional properties of the text. The experimental evaluation is carried out for Arabic handwritten images from IESK-ArDB database. Table 2 displays a summary of the listed systems
AL-Saffar et al. [52] introduced a review of the Deep Learning Algorithms proposed for Arabic Handwriting Recognition. They affirm that the first successful systems-based DL proposed for Arabic characters depend on Convolutional Neural Network (CNN) and Deep Belief Networks. Firstly, Arabic handwriting recognition models are focused in combining deep neural network with traditional classifier. Recent research works investigate combining two deep learning methods to improve recognition results. Shi et al. [53] were the first ones proposing the combination of deep CNN and RNN with CTC decoder for image-based sequence recognition. Afterwards, many approaches for handwritten text recognition were inspired from this deep architecture. Elleuch et al. [40] apply Convolutional Deep Belief Networks (CDBN) to textual image data containing Arabic handwritten script (AHS) and evaluated it to IFN/ENIT and HACDB databases that are characterized by the low/high-dimension property. In addition to the benefits provided by deep networks, the system is protected against over-fitting.
For text line recognition, Ahmad et al. [54] proposed an MDLSTM based Arabic character recognition system. Connectionist Temporal Classification (CTC) is used as a final layer to align the predicted labels according to the most probable path. KHATT datasets was used for experiments. Rawls et al. [16] published a CNN-LSTM model where CNN is used for feature extraction, and bidirectional LSTMs for sequence modeling. In this work, the authors presented a comparison stage between features provided types. It is proved that the CNN model is better than the both existing handcrafted features and a simpler neural model consisting entirely of Fully Connected layers. The results are presented on both English and Arabic handwritten data, added to English machine printed data.
Jemni et al. [55] proposed an Arabic handwriting recognition system based on multiple BLSTM-CTC combination. The authors presented a comparative study of different combination levels of BLSTM-CTC recognition systems trained on different feature sets. The experiments were conducted on the Arabic KHATT dataset. Noubigh et al. [39] proposed a hybrid CNN-BLSTM model for Arabic handwriting text line recognition using KHATT database. The CNN is used for feature extraction. Then, the bidirectional long short-term memory (BLSTM) followed by a connectionist temporal classification layer (CTC) is used for sequence labeling. They proposed also a new approach based on convolutional recurrent neural networks (CRNNs), which adds feature reuse network component on the basis of a CRNN. The model is trained and tested on KHATT and AHTID/MW Arabic text recognition datasets. Table 4 displays a summary of the literature reviewed on Arabic handwriting recognition.
Shi et al. [56] adopted a convolutional recurrent neural network as an encoder and an attentional sequence-to-sequence model as a decoder to predict a character sequence directly. Luo et al. [57] combined a multi-object rectification network with an attention-based sequence recognition network to solve the issue of scene text recognition, especially for irregular texts. In a word, attention mechanism, which takes full account of the relevance between words, has become an important component of text line recognition network.
Transfer learning has also become a popular and promising area in deep learning [58]. The procedure of transfer learning with pre-trained model consists of two main parts: the construction of pre-trained model and the fine-tuning phase. Some studies developed different cross-domain transfer learning approaches by employing rich labels from the text domains in order to mitigate the problem of insufficient image training data.
Summary of state of the art Arabic handwriting recognition systems
Summary of state of the art Arabic handwriting recognition systems
In this section, we try to make a summary of the state of the art realized in the previous section by detecting the main features of used techniques. Table 1 presents a comparison between existing systems. From the realized study, firstly, we conclude that the major problems in this field can be reduced to the cursively of the writing and sensitivity of certain topological characteristics of Arabic; to the variability of connections inter-character or the horizontal and vertical ligatures as well as the presence of overlapping. Secondly, Arabic script and cursive Latin script have many common points, so it is possible to transfer to the Arabic system the techniques already proven in Latin. Specific preprocessing to the Arabic writing are required (diacritics detection, detection of the baseband), but the segmentation into graphemes, the feature extraction and the recognition step can be the same as those used for recognition of the Latin script. The major advantage of the analytical approach is in the ability, theoretically, to recognize any words, since the basic unit of modeling is the character or sub-character (grapheme) and the number of characters is naturally finished. However, his biggest weakness is in the process of segmentation that is not always trivial and requires a lot of computation time, and there is a great variability inherent in the form of segments. The global approach is perfectly possible for the recognition with a limited vocabulary, even for degraded words. It generally suffers from a problem of lack of information sufficiently discriminating for words, which can increase the risk of confusion when the size of the lexicon becomes important.
Some of the current approaches propose to take advantage of both methods, reducing the complexity of the holistic approach by applying it to smaller entities (letters). The analytical approach seeks the sequence of letters contained in the image to recognize. Some models allow combining these two levels in one and thus can overcome the prior segmentation of the image. Syntactic methods are based on the research of assembly laws of basic elements to form a constructed set that represents the form. Each character is represented by a phrase in a language where the vocabulary is constituted of primitives. The characters of the same family are represented by a grammar. The recognition of an unknown form is a syntactic analysis of the sentence that describes it. It consists in determining whether the sentence of character description can be generated by the grammar. The weakness of this method is the lack of efficient algorithms for direct grammatical inference, so how to build a grammar from a finite set of suitably chosen sentences.
We have also noticed that in the case of a large or open vocabulary and in the context of text recognition for example, the systems are often based on an analytical approach. In this context, several difficulties have been mentioned, the most important of which is the segmentation of the text into words. This problem is more delicate for Arabic writing for several reasons: existence of inter-word and inter-pseudo-word spaces, cursiveness of Arabic writing and presence of overlaps between characters. Another major difficulty to report concerns the implementation of post-processing due to the lack of vocabulary dictionaries and linguistic tools that can easily be integrated into recognition systems. All these difficulties have prompted several researchers to use a segmentation free recognition approach.
In addition, according to this study performed about the techniques used in the field of off-line handwriting recognition we can say that in the recent years the research is mainly directed towards techniques that have proven their performance in the field of handwriting recognition like HMM, SVM and Neural Networks. The Markov approaches are well adapted to modeling sequential data. HMMs have several advantages in handwriting recognition. Indeed, they take into account the variability of forms and noise that disturb writing especially in the handwriting. They also allow taking into account the sequence of variable lengths. This quality is especially critical in recognition of handwriting, where the length of letters, words can vary greatly depending on the writing styles and habits of writers. The proposed method shift towards the hybrid architectures, which utilize basically HMM together with the support of structural and statistical techniques.
A major problem in handwriting recognition is the huge variability and distortions of patterns. Elastic models based on local observations and dynamic programming such HMM are not efficient to absorb this variability. SVMs are discriminative models that attempt to minimize the leaning error while maximizing the margin between classes. The main idea is that two classes can be linearly separated in a high-dimensional space. In the case where the points are separable, there is often an infinite hyper plane separator. There is not much system of recognition of Arabic handwriting which is based on SVM compared with other technical such as HMM and ANN. The majority of classifiers meet a major problem which lies in the variability of the vector features size. In literature, three approaches are commonly used to manage the problems of dimensionality. These approaches are: Genetic Algorithms, Dynamic Programming and Graph Matching.
Recently, deep learning architectures [59] have been used for unsupervised feature learning, such as Convolutional Neural Network (CNN), Deep Belief Network (DBN) and Convolutional DBN. Convolutional Neural Network developed by LeCun et al. [60], is a specialized type of neural network which learns the good features at each layer of the visual hierarchy via back propagation. Ranzato et al. [61] achieves improvements in performance when they applied an unsupervised pre-training to a CNN. These networks have been successful in various vision problems such as digit and character recognition [62].
Deep Belief Network is a multi-layer generative model [63] which learns higher-level feature representations from unlabeled data using unsupervised learning algorithms, such as Restricted Boltzmann Machines (RBMs), auto-encoders and sparse coding. These algorithms have only succeeded in learning low-level features such as “edge” or “stroke” detectors with simple invariances. CDBN, which is composed of convolutional restricted Boltzmann machines (CRBMs), have been applied in several fields such as vision recognition task [64], automatic speech recognition (ASR) and EEG signal classification. Lee et al. in [64] demonstrated that CDBN had good performance. The principal idea is to scale up the algorithm to deal with high-dimensional data.
Conclusion
We presented in this work the field of automatic handwriting recognition. Different distinctions between deep learning and holistic/analytical approaches were presented in the first stage of this paper. In second stage, the main steps followed for the realization of a recognition system were displayed. We also mentioned some work done in the field of recognition of Arabic handwriting and compared the results.
It is very difficult to make a judgment about the success of the results of recognition methods, especially in terms of recognition rates, because of different databases, constraints and sample spaces. Several improvements can be attributed to the systems of recognition of Arabic script. These improvements can be based mainly on a combination of classifiers and a specific pretreatment to the Arabic script.
We can conclude that the extraction of relevant feature for Arabic handwriting still an interesting problem. Researchers try to apply several techniques for breaking through the complex problems of handwritten Arabic script. Some works try to combine CNN with SVM and HMM instead of handcrafted features. The results are improved but still insufficient. Despite these efforts, most of the recorded results are not significant and suboptimal, since they were trained and tested on little data or private datasets.
The research in handwritten Arabic character recognition is still in an early stage when compared to Latin and other languages. Arabic handwriting in the field of handwriting recognition needs more focus. Many of recent applications are interested to solve the problem of text line recognition in an open or varying large vocabulary using Deep Learning, which is an emerging approach within the machine learning research community. It is a hot topic in both academia and industry. Therefore, this research area is still open for further enhancement and extensive research needs to be conducted.
