Densely connected layer to improve VGGnet-based CRNN for Arabic handwriting text line recognition

Abstract

In recent years, Deep neural networks (DNNs) have achieved great success in sequence modeling. Several deep models have been used for enhancing Handwriting Text Recognition (HTR). Among these models, Convolutional Neural Networks (CNNs) and Recurrent Neural network especially Long-Short-Term-Memory (LSTM) networks achieve state-of-the-art recognition accuracy. The recognition methods for Arabic text lines have been widely applied in many specific tasks. However, there are still some potential challenges as the lack of available and large Arabic text recognition dataset and the characteristics of Arabic script. In order to address these challenges, we propose an end-to-end recognition method based on convolutional recurrent neural networks (CRNNs), which adds feature reuse network component on the basis of a CRNN. The model is trained and tested on two Arabic text recognition datasets named KHATT and AHTID/MW. The experimental results demonstrate that the proposed method achieves better performance than other methods in the literature.

Keywords

Deep learning handwriting arabic text recognition open vocabulary CNN BLSTM CTC beam search

1. Introduction

An important distinction in the offline handwriting recognition systems is between recognizing isolated digit, characters or words and recognizing whole lines of text. The better recognition results have been obtained for isolated digit and character but for text line is substantially harder [1, 2]. For text recognition tasks, the naive approach is to recognize individual characters and map them onto complete text. This could be done by pre-segmenting words into characters and classifying each segment. However, segmentation is difficult for cursive or unconstrained text unless the words have already been recognized [3].

Table 1
Arabic characters shapes in different positions

N ${}^{\circ}$	Letter	Isolated	Bigin	Middle	Final	N ${}^{\circ}$	Letter	Isolated	Bigin	Middle	Final
1	Alif	Ø¡> - Ø§>– Ø£>	Ø§> - Ø£>	Ù€Ø¦Ù€>	Ù€Ø§>- Ø¦>	15	Dhad	Ø¶>	Ø¶Ù€>	Ù€Ù€Ø¶Ù€>	Ù€Ù€Ø¶>
2	Ba	Ø¨>	Ø¨Ù€>	Ù€Ø¨Ù€>	Ù€Ù€Ø¨>	16	Tta	Ø·>	Ø·>	Ù€Ù€Ø·Ù€>	Ù€Ù€Ø·>
3	Ta	Ø©> – Ø>	ØÙ€Ù€>	Ù€ØÙ€>	Ù€Ù€Ø> - Ù€Ø©>	17	Dha	Ø¸>	Ø¸>	Ù€Ù€Ø¸Ù€>	Ù€Ù€Ø¸>
4	Tha	Ø«>	Ø«Ù€>	Ù€Ø«Ù€>	Ù€Ù€Ø«>	18	Ain	Ø>	ØÙ€Ù€>	Ù€Ù€ØÙ€>	Ù€Ù€Ø>
5	Jeem	Ø>	ØÙ€>	Ù€ØÙ€>	Ù€Ù€Ø>	19	Ghain	Ø°>	Ø°Ù€Ù€>	Ù€Ù€Ø°Ù€>	Ù€Ù€Ø°>
6	Ha	Ø>	ØÙ€>	Ù€ØÙ€>	Ù€Ù€Ø>	20	Fai	Ù>	ÙÙ€Ù€>	Ù€Ù€ÙÙ€Ù€>	Ù€Ù€Ù>
7	Kha	Ø®>	Ø®Ù€>	Ù€Ø®Ù€>	Ù€Ù€Ø®>	21	Qaf	Ù‚>	Ù‚Ù€Ù€>	Ù€Ù€Ù‚Ù€>	Ù€Ù€Ù‚>
8	Dal	Ø¯>	Ø¯>	Ø¯>	Ù€Ù€Ø¯>	22	Kaf	ÙƒÙ€>	ÙƒÙ€>	Ù€Ù€ÙƒÙ€>	Ù€Ù€ÙƒÙ€>
9	Thal	Ø°>	Ø°>	Ø°>	Ù€Ù€Ø°>	23	Lam	Ù„>	Ù„Ù€Ù€>	Ù€Ù€Ù„Ù€>	Ù€Ù€Ù„>
10	Ra	Ø±>	Ø±>	Ø±>	Ù€Ù€Ø±>	24	Meem	Ù…>	Ù…Ù€>	Ù€Ù…Ù€>	Ù€Ù€Ù…>
11	Za	Ø>	Ø>	Ø>	Ù€Ù€Ø>	25	Noun	Ù†>	Ù†Ù€>	Ù€Ù€Ù†Ù€>	Ù€Ù€Ù†>
12	Seen	Ø>	ØÙ€>	Ù€Ù€ØÙ€>	Ù€Ù€Ø>	26	Haa	Ù‡>	Ù‡Ù€>	Ù€Ù€Ù‡Ù€Ù€>	Ù€Ù€Ù‡>
13	Sheen	Ø´>	Ø´Ù€>	Ù€Ù€Ø´Ù€>	Ù€Ù€Ø´>	27	Waw	Ùˆ>	Ùˆ>	Ùˆ>	Ù€Ù€Ùˆ>
14	Sad	Øµ>	ØµÙ€>	Ù€Ù€ØµÙ€>	Ù€Ù€Øµ>	28	Ya	ÙŠ>	ÙŠÙ€Ù€>	Ù€Ù€ÙŠÙ€>	Ù€Ù€ÙŠ>

In the last few decades, Arabic text recognition has attracted great interest and has become one of the challenging areas of research in the field of document image processing and computer vision [4, 5]. But, the most Arabic handwriting recognition systems are limited to model isolated characters, digits or word with a limited vocabulary. Very few researches are interested in recognition of unconstrained Arabic text in open or large vocabulary [6, 7]. Most of the recent approaches for Arabic handwritten text/word recognition have used HMM-based techniques or shallow Artificial Neural Networks (ANN) [4, 8]. Nowadays, the accelerating progress and availability of low-cost computer hardware and the growing difficulty of the tackled problems encouraged the use of computationally expensive techniques. Therefore, recent researches are focused on open vocabulary recognition to deal with continuous sequence in order to recognize words and text lines extracted from handwritten documents [1, 2, 6, 8].

Deep neural networks were ignored, for a long period, due to the lack of efficient training methods. The only deep architectures found in the literature were convolutional neural networks, which contain a limited number of free parameters thanks to the locality and weight sharing aspects of their architecture [9]. With better resources, more data and better hardware, it recently became possible to build deep, quite simply and efficiently neural network [9]. Deep neural networks brought significant improvements and reductions of error rates in many areas, including speech recognition and computer vision. During the past decade Deep Neural Networks (DNNs)have shown impressive results with several architecture in extracting patterns from different data types. For instance, Convolutional Neural Networks (CNNs) constitute the state-of-the art in extracting patterns from images [6, 10], while Recurrent Neural Networks (RNN) with Long Short-Term Memory (LSTM) units constitute the state-of-the art in extracting patterns from text data [11].

In this paper, we propose a recognition system for Arabic handwriting text line. We will demonstrate the interest of using deep Convolutional Neural Networks (CNN) to solve the problem of features extraction. Then, our motivation is to study the potential benefits of adding a densely layer to the VGGNet architecture [12]. We will present results for Arabic handwriting sequences in open vocabulary context. The rest of this paper is organized as follows. Section 2 gives a general overview of the Arabic script proprieties and its affect in the recognition process. Section 3 present some recent works based on deep learning techniques. In Section 4, we will present an overview of the proposed recognition model and will describe and discuss the improvements in feature extraction step and the recognition step respectively. Several combinations were tested and the results will be reported in Section 5. In Section 6, we will discuss the recorded results and we will finish in Section 7 by the conclusion.

2. Characteristics of Arabic script and recognition challenges

The writing style in Arabic script is greatly different from other languages as English and Chinese. This script is purely cursive in both printed and handwritten forms written from right to left and have 28 letters. These makes Arabic text recognition a sophisticated process and present some unique challenges as shown in Fig. 1. The particular of the Arabic text is that the shape of characters may considerably change within a word according to its position in the word and its adjacent characters [13]. Each character has from two to four different forms which increase the number of classes to be recognized from 28 to 84 as shown in Table 1. Arabic letter shape can be isolated, connected from the right, from the left or from both sides which define the four connected forms respectively isolated, beginning, ending and middle [10, 13]. Therefore, one word may consist of one or many ligatures. The ligature contributes significantly in word formulation. There can result a vertical overlapping of characters in a word. Figure 1 shows examples of some problems in the Arabic handwriting text. This feature of connectivity will cause difficulty in the segmentation so in the recognition process. Dots are crucial component of the Arabic characters, indeed sixteen of them have from one to three dots. They can be above the character’s primary part like in the letter Thaal (Ø°>), below like in the letter Baa (Ø¨>) or in the middle as in the letter Jeem (Ø>). Similar characters share the same primary shape and are differentiated by number and position of the dots such as Ø´> which consists of Ø> letter body and three dots above it. In addition, characters in Arabic word can have diacritics read as vowels and written as strokes. They are placed either on top of or below the characters. Some of the dots and strokes might be missed in the handwriting written as presented in Fig. 1 which affect the recognition system accuracy.

Figure 1.

Some of handwriting Arabic script characteristics.

2.1 Related works

The literature of Arabic handwriting recognition contains a wide range of approaches. In this section, we present a survey of recent existent systems.

Some recent works used traditional classifier such as HMM and SVM with handcrafted features. Siddhu et al. [14] proposed a recognition method based on the combination of statistical and structural approaches for feature extraction. The main body of a character is modeled by the statistical method using modified direction features and Support Vector Machines. The structural method uses the dot descriptors to recognize the character. Hassan et al. in [15] proposed method for recognition handwritten Arabic word without segmentation to sub letters based on feature extraction scale invariant feature transform (SIFT) and support vector machines (SVMs). Salam et al. in [16] proposed system for offline isolated Arabic handwriting character. Although half of the dataset used for training the Support Vector Machine (SVM) and the second half used for testing, the system achieved high performance with less training data. Jayech et al. in [17] proposed system based on a synchronous multi-stream HMM (MSHMM) which has the advantage of efficiently modelling the interaction between multiple features. These features are composed by a combination of statistical and structural ones, which are extracted over the columns and rows using a sliding window approach. In fact, two-word models are implemented based on the holistic and analytical approaches without any explicit segmentation. Mezghani et al presented in [18] a system for offline recognition of cursive Arabic handwritten text based on Hidden Markov Models (HMMs). The authors studied the shape modelling of different handwritten Arabic characters using HMMs to make training and recognition of characters more efficient. The number of HMMs is reduced substantially while still capturing the variations between the character shape models. This led to a robust and efficient recognition with only 61 models.

Other Arabic handwriting recognition models are focused in combining deep neural network with traditional classifier. Elleuch et al. [19] investigate the combination of Convolutional Neural Network (CNN) and Support Vector Machine (SVM). In addition, in this work, authors study applicability of dropout in the proposed model. The evaluation of the model is conducted on HACDB and IFN/ENIT databases. Amrouch et al. [20] considered and compare two strategies of feature extraction combined with HMM as recognizer. The first one use CNN named CNN-features-HMM and the second use handcrafted features named Handcrafted-features-HMM. The first strategy allows operating directly on the images and extracting relevant characteristics. It doesn’t need much emphasis on feature extraction and pre-processing stages as the second strategy.

Table 2
Summary of handwriting recognition models using deep learning

Reference	Model	Dataset	Accuracy
Siddhu MK et al. [14] (2019)	Support Vector Machine (SVM)	IFN/ENIT	96.71%
Hassan AKA et al. [15] (2019)	Support Vector Machine (SVM)	AHDB	99.08%
Salam M et al. [16] (2019)	Support Vector Machine (SVM)	560 handwriting character images	99.64%
Jayech K et al. [17] (2016)	MSHMM	IFN/ENIT	91.10% on set a
Mezghani et al. [18] (2016)	Hidden Markov Models (HMMs)	IFN/ENIT	88.91%
Elleuch et al. [19] (2016)	SVM $+$ CNN	HACDB (chars) and IFN/ENIT (words)	94.17% 92.95%
Amrouch et al. [20] (2017)	HMM $+$ CNN	IFN/ENIT (words)	88.95%
Khémiri A et al. [21] 2019	Dynamic Bayesian network (DBN) and Hidden Markov model (HMM)	IFN/ENIT	95.20%
El-Sawy et al. [22] (2017)	CNN	MADBase (digits)	94.90%
Ashiquzzaman et al. [23] (2017)	CNN	CMATERDB (digits)	97.40%
Alaasam et al. [24] (2017)	CNN	HAHPT (Words)	85.00%
Younis [25] (2018)	CNN	AHCD (Chars)	94.70%
Elleuch et al. [27] (2019)	CDBN	HACDB (chars) and IFN/ENIT (words)	98.86% 91.55%
Ahmad et al. [28] (2017)	MDLSTM	KHATT (Text line)	75.70%
Jemni et al. [29]	BLSTM	KHATT (Text line)	84.81%
Noubigh et al. [30] (2019)	CNN $+$ BLSTM	KHATT (Text line)	79.90%
Noubigh et al. [31] (2020)	Very deep CNN $+$ BLSTM	KHATT (Text line) AHTID/MW (Text line)	87.39% 83.82%
Jemni et al. [32] (2019)	CNN $+$ MDLSTM	KHATT (Text line) AHTID/MW (Text line)	79.17% 81.87%

Then, deep neural networks were used in both feature extraction and classification for alphanumeric recognition. El-Sawy et al. [22] provided a Convolutional Neural Network (CNN) technique which is implemented using LeNet-5 architecture and were evaluated on a large Arabic digit’s database (MADBase). Ashiquzzaman et al. [23] proposed a novel model based on Convolutional Neural Network (CNN) and it used the Rectified Linear Unit (ReLU) activation function with the dropout technique as a regularization layer. Alaasam et al. [24] designed a CNN to recognize historical Arabic handwritten text and Younis [25] used a CNN for handwritten Arabic character recognition. Khémiri et al. [26] proposed a comparison study that demonstrate that the capacity of Bayesian and convolutional neural networks (CNNs) are very efficient compared to other methods.

Recently, the proposed recognition models investigate the combination of several deep neural networks. Elleuch et al. [27] applies a Convolutional Deep Belief Networks (CDBN) method. For text line recognition, Ahmad et al. [28] proposed an MDLSTM based Arabic character recognition system. Connectionist Temporal Classification (CTC) is used as a final layer to align the predicted labels according to the most probable path. KHATT datasets was used for experiments. Jemni et al. [29] proposed an Arabic handwriting recognition system based on multiple BLSTM-CTC combination. Authors presented a comparative study of different combination levels of BLSTM-CTC recognition systems trained on different feature sets. The experiments were conducted on the Arabic KHATT dataset. Noubigh et al. [30] proposed a hybrid CNN-BLSTM model for Arabic handwriting text line recognition using KHATT database. The CNN is used for feature extraction. Then, the bidirectional long short-term memory (BLSTM) followed by a connectionist temporal classification layer (CTC) is used for sequence labelling. Noubigh et al. [31] combine very deep CNN model with BLSTM for open vocabulary Arabic text recognition. The CTC beam search decoder was used with BLSTM for sequence modeling. Jemni et al. [32] proposed model based on MDLSTM and CNN. The main contribution of this work is a novel OOV (Out of Vocabulary) detection and recovery method that considerably improve the system performance. Experiments were performed using KHATT and AHTID/MW databases. Table 2 display a summary of the literature reviewed on Arabic handwriting recognition.

Figure 2.

Overview of the recognition system.

In reviewing works in Table 2, we can conclude that the SVM and HMM classifier was not advantageous for handwriting Arabic text recognition since the recognition rate was very low. The extraction of relevant feature for Arabic handwriting presents an interesting problem. Researchers try to apply several techniques for breaking through the complex problems of handwritten Arabic script. Some works try to combine CNN with SVM and HMM instead of handcrafted features. Other works are focused in improve CNN architectures. Results are improved but still insufficient. Despite these efforts, most of the recorded results are not significant and suboptimal, since it was trained and tested on little data or private datasets. The proposed models have obtained effective results with small or limited datasets, but the problem is still present as a challenge for very large databases. Indeed, despite the fact that the IFN/ENIT database is limited in terms of writing variations and vocabulary, it is still used on the large number of recent systems. The research in handwritten Arabic character recognition is still in an early stage when compared to Latin and other languages. Arabic handwriting in the field of handwriting recognition needs more focus. Until now, the most Arabic handwriting recognition systems are limited to recognize numerical character or words in limited vocabulary. Therefore, this area of research is still open for further enhancement and extensive research need to be conducted. Moreover, many of recent applications are interested to solve the problem of text line recognition in an open or vary large vocabulary.

3. System overview

In this section, we present an overview of the proposed recognition model. It is a hybrid approach based on combining CNN architectures with BLSTM flowed by CTC decoder layer. It consists of three main steps as presented in Fig. 2. The first step is the preprocessing of the input image in order to reduce the generated noise and eliminate any variability resource that occurred during the images scanning phase. Second step is the feature extraction in order to get relevant descriptors of the images based in two CNN architectures. Finally, the recognition step consists on the sequence modeling using the BLSTM network and training by CTC decoder. Therefore, we are focused on describe these two steps for the proposed recognition model.

Table 3
CNN layers configuration

N ${}^{\circ}$	Type	Configuration	N ${}^{\circ}$	Type	Configuration
1	Input	64 $\times$ 64 $\times$ 1 pixels	20	ReLu
2	Conv1	#maps: 64, k: 3 $\times$ 3	21	Batch normalization
3	ReLu		22	Max pooling	Window: 2 $\times$ 1, s: 2
4	Batch normalization		23	Conv7	#maps: 512, k: 3 $\times$ 3
5	Conv2	#maps: 64, k: 3 $\times$ 3	24	ReLu
6	ReLu		25	Batch normalization
7	Batch normalization		26	Conv8	#maps: 512, k: 3 $\times$ 3
8	Max pooling	Window: 2 $\times$ 2, s: 2	27	ReLu
9	Conv3	#maps: 128, k: 3 $\times$ 3	28	Batch normalization
10	ReLu		29	Max pooling	Window: 2 $\times$ 1, s: 2
11	Batch normalization		30	Conv9	#maps: 512, k: 3 $\times$ 3
12	Conv4	#maps: 128, k: 3 $\times$ 3	31	ReLu
13	ReLu		32	Batch normalization
14	Batch normalization		33	Conv10	#maps: 512, k: 3 $\times$ 3
15	Max pooling	Window: 2 $\times$ 2, s: 2	34	ReLu
16	Conv5	#maps: 256, k: 3 $\times$ 3	35	Batch normalization
17	ReLu		36	Max pooling	Window: 2 $\times$ 1, s: 2
18	Batch normalization		37	Output	2 $\times$ 16 $\times$ 512
19	Conv6	#maps: 256, k: 3 $\times$ 3	38	Fully connected	16 $\times$ 1024

Figure 3.

Feed-forward neural network with batch normalization.

3.1 Feature extraction step

As CNNs have provide high performance in many fields especially in image classification and object detection, various architectures are proposed to deal with different requirements. In this section, we will present the most important models which was the winner of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) competition [33]. The ImageNet is an image dataset with about 15 million images with over 22000 different categories. The principal premise and architecture of CNN are LeNet, AlexNet, VGGNet, GoogLeNet, ResNet and DenseNet [33]. In this section, we describe the feature extraction step of the proposed recognition model. It is based mainly on two CNN architectures. The first one is inspired by the VGGNet architecture. The second is based on a densely layer inserted on the VGGNet architecture to instigate the advantage of reuse features.

3.1.1 VGGNet-based feature extraction

We introduce the proposed architecture used for feature extraction. It follows the standard model of VGGNet with some differences. It contains 10 convolution layers with a filter size (3 $\times$ 3) and stride 1 for each layer as shown in Fig. 4. The filters numbers in a convolutional layer are set to 64, 64, 128,128, 256, 256, 512, 512, 512 and 512, and fully connected layers are not considered as we don’t use it for classification. The architecture details are given in Table 3. The Rectified Linear Unit (ReLU) activation function is used. It is defined as in Eq. (1).

$\displaystyle f(x)=\begin{cases}x&x\geqslant 0\\ 0&x<0\end{cases}$ (1)

Each two convolutional layers are followed by a max-pooling layer defined by the Eq. (2).

$\displaystyle P_{i}=\text{Max}(X_{i})$ (2)

Max-pooling uses a 2 $\times$ 2-pixel dimension with stride 2 in two first layers and 1 $\times$ 2 pixel dimension with stride 2 in the other three max-pooling layers. Max pooling layer improve the tolerance to features variation, summarize image regions and outputs a downsized version of the previous layer.

Additionally, we use Batch normalization layer before the ReLU layer to convergence and avoid over-fitting. Batch Normalization is a normalization technique done between the layers of a Neural Network instead of in the raw data. It is done along mini-batches instead of the full data set. It serves to speed up training and stabilize learning and make it easier. Also, it has been observed that BN reduces the effects of exploding and vanishing gradients because everything becomes roughly normally distributed.

We can define the normalization formula of Batch Normalization as:

$\displaystyle Z^{N}=\frac{Z-m_{z}}{s_{z}}$ (3)

where $m_{z}$ is the mean of the output of the neurons and $s_{z}$ denots its standard deviation. In the Fig. 3 we illustrate the position of the Batch Normalization layer (red rectangle). $x_{i}$ are the inputs, $z_{i}$ the output after applied the linear transformation of the neuron, $a_{i}$ the output of the activation functions, and y the output of the network.

Figure 4.

VGGNet-based feature extraction architecture.

Batch Normalization (BN) is applied to the output of the neurons just before applying the activation function. Usually, a neuron without BN would be computed as follows in the Eqs (4) and (5):

$\displaystyle Z=g(w,x)+b$ (4) $\displaystyle a=f(Z)$ (5)

where $g()$ the linear transformation of the neuron, $w$ and $b$ are the weights and the bias of the neurons respectively, and $f()$ the activation function. Adding Batch Norm, it looks as in the Eqs (6)–(8):

$\displaystyle Z=g(w,x)$ (6) $\displaystyle Z^{N}=\left(\frac{Z-m_{z}}{s_{z}}\right)\cdot\gamma+\beta$ (7) $\displaystyle a=f(Z^{N})$ (8)

where $Z^{N}$ the output of BN, and $\gamma$ and $\beta$ learning parameters of BN. Note that the bias of the neurons (b) is removed. This is because as we subtract the mean $m_{z}$ , any constant over the values of $z$ , such as $b$ , can be ignored. The parameters $\beta$ and $\gamma$ shift the mean and standard deviation, respectively. Thus, the outputs of Batch Norm over a layer results in a distribution with a mean $\beta$ and a standard deviation of $\gamma$ . These values are learned over epochs with the other learning parameters, such as the weights of the neurons, aiming to decrease the loss of the model.

An important issue in VGGNet-based feature extraction design is the selection of input text line image size. Our input images size is normalized to 64 $\times$ 1024 which increase the complexity of the network. Thus, we use a sliding window technique to overcame this problem. In Fig. 4, steps of the feature extraction model are illustrated. Conv(i) indicate the convolutional layer, BN is the batch normalization layer, ReLU is the layer of activation function and ‘Maxp’ demonstrate the max-pooling layer. The input image is scanned by a horizontal sliding window from right to left in order to obtain a sequence of matrix of pixel values of size 64 $\times$ 64. For each input image, we obtain 16 matrices imputed to the CNN as batch of images. The CNN output is of dimension 2 $\times$ 16 $\times$ 512, where 512 is the number of filter maps in the last convolution layer, and the two other dimensions depend on the amount of pooling in the CNN. Finally, it is resized to 16 $\times$ 1024 to be inputted to the classifier.

Figure 5.

Densely-VGGNet architecture for feature extraction.

3.1.2 Densely layers to improve VGGNet

The difficulties emerge with VGGNet when they go deeper. The augmentation of the path for information between the input layer and the output layer result its vanishing before reaching the other side. In this section, we describe the improvement affected to the VGGNet architecture describe previously. Inspired by the principle of other architecture as DenseNet and ResNet, we add a densely layer to the VGGNet by concatenate features maps. The objective is to decreases the vanishing-gradient problem by decreasing the depth of convolutional layers, improve feature propagation and encourage the feature reuse. As presented in Fig. 5, a layer that called Concat is added and some network parameters are adapted compared to the original network (Fig. 4).

In this model we have only 8 convolutional layer and all max pooling layer is applied with size of 2 $\times$ 2. The filters numbers in a convolutional layer are set to 32, 64, 128,256, 256, 512, 512 and 1024. The CNN output is of dimension 4 $\times$ 4 $\times$ 1024, where 1024 is the number of filter maps in the last convolution layer, and the two other dimensions depend on the amount of pooling in the CNN. As illustrated in Fig. 5, the first concatenation layer gets its inputs from the pixels of the input image and the first ReLU layer. Then, we concatenate the output of the ReLU layer with the previous max-pooling layer output. As a direct consequence of concatenation, the feature-maps learned by previous layers can be accessed by subsequent layers. This encourages feature reuse throughout the network, and results more condensed features.

The concatenation layer is followed by batch normalization layer for normalize and redistribute data. The low-level features and high-level semantic features are combined through the features maps concatenation. In consequence, this feature reuse network can greatly improve the performance of text line recognition because more detailed information of the original image can be retained. In addition, the model can be more robust to the shape and distortion of different characters in the image.

3.2 Sequence modeling step based on BLSTM-CTC

After extracting the features from image pixels rows, they are then independently used to train a BLSTM-CTC network. The network transforms a sequence of CNN-features of length T into a text line of length L. In this section, we will start by present the BLSTM network and then discuss the used CTC technique.

Figure 6.

LSTM cell illustration.

3.2.1 BLSTM

The LSTM [34] cell is defined as in Fig. 6 and its mathematical form is as follows. We have the hidden state $f_{t}$ , ${\overline{C}}_{t}$ , $I_{t}$ and $O_{t}$

$\displaystyle f_{t}=\sigma(X_{t}*U_{f}+H_{t-1}*W_{f})$ (9) $\displaystyle{\overline{C}}_{t}=\tanh(X_{t}*U_{c}+H_{t-1}*W_{c})$ (10) $\displaystyle I_{t}=\sigma(X_{t}*U_{i}+H_{t-1}*W_{i})$ (11) $\displaystyle O_{t}=\sigma(X_{t}*U_{o}+H_{t-1}*W_{o})$ (12)

The outputs $C_{t}$ and $H_{t}$ are calculated as:

$\displaystyle C_{t}=f_{t}*C_{t-1}+I_{t}*{\overline{C}}_{t}$ (13) $\displaystyle H_{t}=O_{t}*\tanh(C_{t})$ (14)

With $X_{t}$ is the input vector, $H_{t}$ and $H_{t-1}$ are the current and previous cell output respectively and $C_{t}$ and $C_{t-1}$ are respectively the current and the previous cell memory.

Bidirectional LSTM (BLSTM) [35] is a direct extension to LSTM, where the input sequence is scanned both in forward and backward direction and subsequently merged into a combined representation. In this architecture, there are two layers of hidden nodes. Both hidden layers are connected to input and output. The two hidden layers are differentiated in that the first has recurrent connections from the past time steps while in the second the direction of recurrent of connections is flipped, passing activation backwards along the sequence. It is in charge of processing a N dimensional input signal to produce an output signal that takes into account long term dependencies.

In order to do this the BLSTM is composed of two recurrent neural networks with Long Short-Term Memory neural. One network processes the data chronologically while the other processes the data in reverse chronological order. Therefore, at time t a decision can be taken by concatenate the outputs of the two networks, using past and future context as presented in Fig. 7.

Figure 7.

BLSTM architecture.

The proposed system present one Bidirectional LSTM layer with 512 neurons in each direction. The BLSTM output is a matrix of size $B\times T\times(C+1)$ , B present the batch size, T denotes the time step length and C is the number of classes with a pseudo-character called blank. Finally, this matrix is fed into the CTC decoder algorithm. A dropout layer is applied to feature sequence before inputted to the BLSTM network with dropout ratio 0.75. Then dropout is applied inside the BLSTM with ratio of 0.5. Figure 8 shows the BLSTM layer. The input feature sequence is of size (B, 16, 1024) obtained from the CNN-feature extraction. Thus, 16 present the time-step for our network. So, the output is of size (B, 16, 102), where 102 is the number of classes which are Arabic characters and symbols plus pseudo-character.

Figure 8.

BLSTM modeling.

3.2.2 The Connectionist Temporal Classification (CTC)

CTC model was introduced by Alex Graves in [36]. CTC can be applied to supervised sequence classification tasks, such as in our case the handwriting recognition. It is interesting for the fact that it doesn’t require any pre-segmentation of the input or any post-processing of the output labels to produce the final predicted label sequence. In handwriting recognition, the goal is to build a classifier which can convert an image to a sequence of labels. There are many challenges which get in the way of using simpler supervised learning algorithms. In particular input sequence and output sequence vary in length and we don’t have an accurate alignment of them. To address this issue, the Connectionist Temporal Classification (CTC) objective function was provided to infer this alignment automatically. For a given input it gives us a distribution over all possible outputs. A learning task using CTC, models are always ended with a softmax layer where the element represents the probability of emitting each label at a specific time step. After being trained with the CTC loss function (CTC-trained), the output of the network needs a CTC-decoder during inference.

CTC-trained: To deal with the issue that output length is shorter than input sequence, CTC adds a blank symbol as an additional blank label to the label set and allows repetition of labels. We define the CTC loss function as introduced on [25].

Assume that $\emptyset$ is the blank label and $X=(x_{1},x_{2},$ $\ldots,x_{T})$ is an input sequence of length $T$ and denote by ${y}^{t}_{k}$ the activation of output unit $k$ at time $t$ . Then ${y}^{t}_{k}$ is interpreted as the probability of observing label $k$ at time $t$ , which defines a distribution over the set $L^{\prime\rm T}$ of length $T$ sequences over the alphabet $L^{\prime}=L\cup\{\emptyset\}$ . A CTC path $\pi$ which is introduced in [36] as a sequence of labels (including $\emptyset$ ), can be expressed as $\pi=({\pi}_{1},{\pi}_{2},\ldots,{\pi}_{T})$ . The probability of a CTC path $\pi$ can be calculated as follows:

$\displaystyle P(\pi|x)=\prod^{\rm T}_{t=1}{y^{t}_{\pi t}},∼{}\forall\pi\in L^{% \prime\rm T}$ (15)

Finally, a mapping function noted $\beta$ is used as described in [36]. This function removes all blanks and repeated labels from the paths (example: $\beta(a\emptyset ab\emptyset)=\beta(\emptyset aa\emptyset\emptyset abb)=aab)$ .

We can evaluate the probability of a given labelling $l\in L^{\leqslant T}$ as the sum of the probabilities of all the paths corresponding to it:

$\displaystyle P(l|x)=\sum_{\pi\in{\beta}^{-1}(l)}P(\pi|x)$ (16)

However, it is virtually impossible to sum the probabilities of all the paths in ${\beta}^{-1}(l)$ . To calculate $P(l|x)$ , the CTC Forward-Backward Algorithm was invented in [25]. Afterwards, the network can be trained with the CTC objective function (CTC loss function):

$\displaystyle\textit{CTC}(X)=-\log P(l|x)$ (17)

CTC-decoder: Decoding a CTC network means finding the most probable output sequence for a given input. The first and simplest approximation of decoding the RNN output is the best path decoding presented in [36]. It is based on the selection of the most probable character per time-step, the most probable sequence will correspond to the most probable labelling. This approach is not sufficient to satisfy the needs of many sequence tasks although it can already provide useful transcriptions. Other decoding algorithm called beam search is described in the paper of Hwang and Sung [37]. Multiple candidates for the final labeling are iteratively calculated and are called beams. At each time-step, each beam-labeling is extended by all possible characters. Additionally, the original beam is also copied to the next time-step. The Beam Width (W) is defined to give the number of beams to keep (the best beams). The beam width determines the complexity and the accuracy of the algorithm. If W is big enough, the probability will be one and the algorithm will be too complex. But if W is too small, the probability of using beam search to find the correct answer will be too small. So, there is a tradeoff between the size of W and the accuracy.

The CTC beam search decoding searches for the most probable sequence in all the sequences (length $\leqslant$ T) combined with K labels ( $\emptyset$ will not appear in output sequence). The number of all the sequences is growing exponentially with the increase of T, but the number of the sequences searched with the CTC beam search decoding is no larger than $K\cdot W\cdot T$ .

The CTC beam search proved its effectivity for end-to-end sequence recognition and it accelerates the decoding process. According to our knowledge, there is no existing work for Arabic text recognition use CTC beam search decoding and it is used for the first time in this work. To understand the difference between the two algorithms, we reference to the work of H. Scheidl et al. [38] which explain their principles.

We take example testcase of the RNN output matrix contains 2 time-steps (t0 and t1) and 3 labels (a, b and – representing the CTC-blank) as shown in Fig. 9. Best path decoding takes the most probable label per time-step which gives the path “ $-$ ” and therefore the recognized text “” with probability $0.6*0.6=0.36$ . However, the correct answer is “ $a$ ”. The beam search algorithm is able to correctly handle such situations by summing up the probabilities of all paths yielding this labeling. For the labeling “ $a$ ” these algorithms sum over the paths “ $-a$ ”, “ $a-$ ” and “ $a a$ ” (see right Fig. 9) with probability $0.6*0.4+0.4*0.6+0.4*0.4=0.64$ . The only path which gives “” still has probability 0.36, therefore “ $a$ ” is the result returned by beam search.

Figure 9.

Result paths of CTC best path (in the left) and CTC beam search (in the right).

Table 4

Statistics about the two used databases; KHATT and AHTID

Dataset	Train			Test			Validation
	lines	words	Characters	lines	words	Characters	lines	Words	Characters
KHATT	9,475	129,826	605,537	2,007	26,449	122,757	1,901	26,142	121,433
AHTID	2,699	25,024	106,246	901	7,186	31,994	901	7,222	32,025

4. Experiments and results

4.1 Used databases

4.1.1 KHATT database

The offline Handwritten Arabic Text database KHATT presents a challenging unconstrained Arabic text database [39]. It contains 4000 grayscale subsubsection images and its ground-truth. This database is written by 1,000 distinct male and female writers representing diverse countries, age groups and education levels. It is grouped in two categories. The first present 2000 of these images contain similar text each covering all Arabic characters and shapes whereas the remaining. The second present 2000 images contain free texts written by the writers on any topic of their choice in an unrestricted style. In this work, all the line images, with similar text and free texts, is used in the experiments. This database is divided in training, validation and test subsets. Figure 10 shows examples of images extracted from KHATT database.

Figure 10.

Examples of images extracted from KHATT database.

4.1.2 AHTID/MW database

The Arabic Handwritten Text Images Database AHTID/MW has been built at the University of Sfax -Tunisia in join collaboration with the Institute for Communications Technology (IfN), Braunschweig Germany [40]. The AHTID/MW contains 3710 text lines and 22,896 words written by 53 native writers of Arabic. These images are divided into five equilibrated sets. The four first sets are available for scientific community. The database is freely available for worldwide researchers. Figure 11 shows examples of text line images extracted from the database.

Figure 11.

Samples of text line image from the AHTID/MW database.

5. Experimental setup

The experiments are carried out on two databases of handwritten text lines: KHATT and AHTID/MW described previously. We defined our classes by 101 Arabic characters, numbers and symbols extracted from the ground-truth of the two databases. Table 4 presents some useful statistics about the databases.

For all experiments, we used TensorFlow framework [41] to implement our network. In the training phase, we trained the RNN with CTC using the RMSprop (Root Mean Square Propagation) optimizer [42] method is used with the learning rate 10–4 and the training is stopped after 10 evaluations with no improvements of the character error rate. The CTC loss function is used for training. The CTC beam search are used for decoding. For the CTC beam search decoder, w was fixed experimentally to 30. Dropout layer are inserted between CNN and BLSTM with rate of 0.3. Commonly used performance measures for machine learning systems are the character error rate (CER) and word error rate (WER) for sequential data and more particularly handwriting recognition. We use those two measures to characterize the performance of our recognition systems.

The character error rate is calculated by the followed formula:

$\displaystyle\textit{CER}=\frac{S+D+I}{N}$ (18)

where $S$ , $D$ and $I$ denote the number of substituted characters, deleted and inserted characters respectively, and $N$ is the number of characters in the ground-truth.

Similarly, the word error rate is obtained by the Eq. (19):

$\displaystyle\textit{WER}=\frac{S+D+I}{N}$ (19)

where $S$ , $D$ and $I$ are the number of substituted words, deleted words and inserted words respectively and N is the number of ground-truth words.

Table 5

System evaluation with different LSTM layers

Number of LSTM layers	CER %
1	17.53
2	15.83
4	15.20
6	15.5

Table 6

System evaluation with different hidden layer neurons

Number of hidden layer neurons	CER %
128	21.56
256	17.83
512	14.12

Table 7

Results recorded with dense-VGGNet using KHATT and AHTID/ MW databases

Dataset/results	WER %
	Dense-VGGNet	VGGNet
KHATT	11.53	12.61
AHTID/MW	15.18	16.18

Table 8

Comparison study

Authors and Ref	Model	Data	Results
			WER %
BenZeghiba [43]	Handcrafted features MDLSTM $+$ LM	KHATT	24.1
Ahmad et al. [28]	Handcrafted features MDLSTM	KHATT	–
Jemni et al. [29]	Handcrafted features BLSTM $+$ LM	KHATT	15.19
Jemni et al. [32]	Learned features MDLSTM $+$ LM	KHATT	20.83
		AHTID/MW	18.13
Noubigh et al. [30]	CNN+BLSTM	KHATT	20.1%
Noubigh et al. [31]	Learned features VGGNet $+$ BLSTM	KHATT	12. 61
		AHTID/MW	16.18
Proposed system	Learned features densely-VGGNet $+$ BLSTM	KHATT	11.53
		AHTID/MW	15.18

5.1 Experimental results

A series of experiments are conducted to select appropriate parameters. The performance of the proposed method is heavily affected by three key parameters, namely the feature extractor architecture, the number of BLSTM layers and the number of its hidden neurons and the used decoder. In previous work [31], we proved the effectiveness the CTC beam search decoder combined with BLSTM. Therefore, the CTC beam search decoder will be used for all experiments in this work. We start by study the effects of the BLSTM parameters. Once all parameters are set, we compare the results of the two developed architectures VGGnet and densely-VGGNet. Lastly, we evaluate the performance of the final CRNN architecture and make comparisons with other state-of-the-art methods on KHATT and AHTID/MW.

5.1.1 BLSTM parameters

The number of LSTM layers are changed while other parameters are fixed. As Table 5 shows, the CER decreases obviously when the number of LSTM layers changes from 1 to 2. While changing from 4 to 6, the model performance decreases. It is proved that deeper LSTM may be negative to test results because of overfitting phenomenon. Therefore, we use two layers of LSTM in our model, considering the performance and speed.

In addition, the model performance is influenced by the number of hidden layer neurons in the LSTM network. In Table 6, we illustrate the CER values for each set of hidden layer neurons. When the number is set from 128 to 256, the CER decreases by almost 3 %. With increasing the number of hidden layer neurons to 512, the model performance doesn’t gain clear improvement but it still considerable. Therefore, we set the number of hidden layer neurons to 512.

Figure 12.

Comparison between CER evolution using VGGNet and dense-VGGNet.

5.1.2 Densely-VGGNet improvement

For the purpose of expose the advantages of the improved model dense-VGGNet, we compare the experimental results with those recorded by the original model [31]. Both KHATT and AHTID/MW handwriting Arabic text line databases are used for these experiments.

From the results summarized in Table 7, after adding reuse feature, the word recognition accuracy is improved by 1.1%. This is mainly due to the improvement of recognition accuracy for some difficult situations in the Arabic handwriting scripts such as characters with various sizes, fonts and shapes. As mentioned previously the feature reuse network combines semantic information with different degrees and fully considers of the relationship between context information. The graphic in Fig. 12. presents the evolution of the CER and WER. It illustrates the important improvement obtained with the dense-VGGNet model. The recognition accuracy tends to 89% (WER is about 11.5%) after 100 epochs.

6. Discussion and comparison

This work introduces a new recognition system using character modeling that improves previous proposed recognition model [31]. Our proposal defines a deep learning approach for Arabic handwriting text recognition combined very deep CNN architecture with BLSTM. The system relies on a cascade of CNN and BSTM and CTC layers. In order to validate the effectiveness of the suggested framework, we have presented experimental results using two Arabic text line handwritten databases namely KHATT and AHTID/MW. We introduce a study that explores the use of different deep CNN architectures in order to improve the Arabic handwriting recognition system.

It should be noted that increasing the depth of neural networks also increases the number of parameters in the models, and therefore their capacity. Thus, reusing the feature maps by adding a dense layer was proposed to ameliorate the system performance and it gives better results with reduced depth.

We can see that our densely-VGGNet architecture has achieved a very promising recognition rate of 89% and 85% when practiced to the KHATT and AHTID/MW databases, respectively.

Therefore, an even larger decrease in the word error rate of 9% was acquired utilizing densely layer compared to our basic deep CNN model [30]. Consequently, we have proven that reuse-features-based VGGNet is able to model perfectly the contextual information. For the BLSTM network, we have shown that increasing the number of units in the layer yielded better improvements than increasing the number of hidden layers proving that increase networks depth can result overfitting phenomenon. Additionally, we have demonstrated the capacity of BLSTM classifier combined with the CTC beam search decoder to treat the Arabic text specificities without the need to be segmented.

In another hand, the CTC beam search present in interesting part of the proposed model. it improves obviously recognition accuracy. With improving results over-fitting problem is also addressed by the use of data batch normalization and dropout.

Finally, we compared the performance of our model with models proposed for Arabic handwriting text lines recognition on KHATT and AHTID/MW databases. In Table 8, we present proprieties of each model and results. We introduce for each system the used database, the used model and the results.

We can conclude that the proposed system achieves higher accuracy as an open-vocabulary model compared with other methods based on limited vocabulary. We think it is mainly due to two advantages of proposed CRNN: one is that feature reuse network provides better feature representations compared with the way of simply stacked CNNs and the CTC beam search algorithm is more suitable for decoding inputted sequence.

The obtained results, are adequately considerable in comparison with the scientific research studies utilizing other classification methods. The contribution represents an interesting challenge in the computer vision and pattern recognition field since it will be a real motivation for the exploitation of deep learning strategies.

7. Conclusion

In this paper, we proposed an offline open vocabulary Arabic handwriting text recognition system. It is a quite hard problem since it is impossible to identify all the words of Arabic within a predefined lexicon. For that, we have done robust improvements on the different handwriting recognition system stages. Improvements were mainly based on preliminary efficient studies. We focus in on two main components: finding suitable features representation and implementing a robust recognizer engine. The proposed approach is based on two popular architectures: Convolutional Neural Network and Long Short-Term Memory Recurrent Neural Networks. We validated our approach on two public Arabic text line databases: KHATT and AHTID/MW.

We described details of the proposed recognition model. We provided several experiments and performed detailed analysis of the proposed module parameters. The results demonstrate a net advantage of the CRNN based on densely-VGGNet-dense and BLSTM.

The results of the proposed method in term of word recognition are still insufficient due to the lack of enough images for training and some weakness in the decoding process. Thus, to overtake this problem, we plan to improve in a future research the proposed system using transfer learning technique [21].

References

R Ahmed Dashtipour

Gogate

Raza

Zhang

Huang

Hawalah

Adeel

and Hussain

, Offline arabic handwriting recognition using deep machine learning: A review of recent advances, Advances in Brain Inspired Cognitive Systems 2020, pp. 457–468.

Nanehkaran

Y.A.

Zhang

Salimi

Chen

Tian

and Al-Nabhan

, Analysis and comparison of machine learning classifiers and deep neural networks techniques for recognition of Farsi handwritten digits, The Journal of Supercomputing 77 (2021), 3193–3222.

Sharma

and Jayagopi

D.B.

, Towards efficient unconstrained handwriting recognition using dilated temporal convolution network, Expert Systems with Applications 164 (2021), 114004.

Ali

A.A.A.

and Suresha

, Survey on segmentation and recognition of handwritten arabic script, SN Comput Sci 1 (2020), 192.

Mezghani

Slimane

and Kherallah

, Writing type, script and language identification in heterogeneous documents, International Journal of Intelligent Systems Technologies and Applications: (IJISTA) 16 (2017), 225–245.

Altwaijry

and Al-Turaiki

, Arabic handwriting recognition system using convolutional neural network, Neural Computing and Applications 33 (2021), 2249–2261.

Ahmed

Gogate

Tahir

Dashtipour

Al-Tamimi

Hawalah

El-Affendi

M.A.

and Hussain

, Novel deep convolutional neural network-based contextual recognition of arabic handwritten scripts, Entropy (Basel) 23(3) (2021), 340.

Ahmad

Mahmoud

S.A.

and Fink

G.A.

, Open-vocabulary recognition of machine-printed Arabic text using hidden Markov models, Pattern Recognit 51 (2016), 97–111.

Sengupta

Basak

Saikia

Paul

Tsalavoutis

Atiah

Ravi

and Peters

, A review of deep learning with special emphasis on architectures, applications and recent trends, Knowledge-Based Syst 194(4) (2020), 105596.

10.

Krizhevsky

Sutskever

and Hinton

G.E.

, Imagenet classification with deep convolutional neural networks, Communications of the ACM 60 (2017), 84–90.

11.

Chung

, On deep multiscale recurrent neural networks, Ph.D. Dissertation, Montreal University, 2018.

12.

Tong

Gao

Chen

Wang

and Yang

, MA-CRNN: A multi-scale attention CRNN for Chinese text line recognition in natural scenes, International Journal on Document Analysis and Recognition (IJDAR) 23 (2020), 103–114.

13.

Parvez

M.T.

and Mahmoud

S.A.

, Offline Arabic handwritten text recognition: A survey, ACM Computing Surveys 45(2) (2013), 1–35.

14.

Siddhu

M.K.

Parvez

M.T.

and Yaakob

S.N.

, Combining statistical and structural approaches for arabic handwriting recognition, International Conference on Computer and Information Sciences (ICCIS) 2019, pp. 1–6.

15.

Abdul Hassan

A.K.

Mahdi

B.S.

and Mohammed

A.A.

, Arabic handwriting word recognition based on scale invariant feature transform and support vector machine, Iraqi Journal of Science 60(2) (2019), 381–338.

16.

Kadhm

M.S.

and Karim

, Offline isolated Arabic handwriting character recognition system based on SVM, International Arab Journal of Information Technology (IAJIT) 16(3) (2019), pp. 467–472.

17.

Jayech

Mahjoub

MA.

and Ben Amara

N.E.

, Synchronous multi-stream hidden Markov model for offline Arabic handwriting recognition without explicit segmentation, Neuro-Computing 214 (2016), 958–971.

18.

Mezghani

Kallel

Kanoun

and Kherallah

, Contribution on character modelling for handwritten Arabic text recognition, International Afro-European Conference for Industrial Advancement: AECIA 2016, pp. 370–379.

19.

Elleuch

Maalej

and Kherallah

, A new design based-SVM of the CNN classifier architecture with dropout for offline Arabic handwritten recognition, Procedia Computer Science 80 (2016), 1712–1723.

20.

Amrouch

and Rabi

, Deep neural networks features for arabic handwriting recognition, International Conference on Advanced Information Technology, Services and Systems AIT2S 2017, pp. 138–149.

21.

Noubigh

Mezghani

and Kherallah

, Transfer learning to improve Arabic handwriting text Recognition, 21st International Arab Conference on Information Technology (ACIT) 2020, pp. 1–6.

22.

El-Sawy

EL-Bakry

and Loey

, CNN for handwritten arabic digits recognition based on LeNet-5, Proceedings of the International Conference on Advanced Intelligent Systems and Informatics AISI 2016, pp. 566–575.

23.

Ashiquzzaman

and Tushar

A.K.

, Handwritten Arabic numeral recognition using deep learning neural networks, IEEE International Conference on Imaging, Vision & Pattern Recognition (icIVPR) 2017, pp. 1–4.

24.

Alaasam

Kurar

Kassis

and El-Sana

, Experiment study on utilizing convolutional neural networks to recognize historical Arabic handwritten text, 1st International Workshop on Arabic Script Analysis and Recognition (ASAR) 2017, pp. 124–128.

25.

Younis

K.S.

, Arabic handwritten characters recognition based on deep convolutional neural networks, Jordanian Journal of Computers and Information Technology (JJCIT) 3(3) (2018), 186.

26.

Khémiri

Echi

A.K.

and Elloumi

, Bayesian versus convolutional networks for arabic handwriting recognition, Arabian Journal for Science and Engineering 44 (2019), 9301–9319.

27.

Elleuch

and Kherallah

, Boosting of deep convolutional architectures for Arabic handwriting recognition, International Journal of Multimedia Data Engineering and Management (IJMDEM) 10(4) (2019), 26–45.

28.

Ahmad

Naz

Afzal

M.Z.

Rashid

S.F.

Liwicki

and Dengel

, The impact of visual similarities of Arabic-like scripts regarding learning in an OCR system, 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) 2017, pp. 15–19.

29.

Jemni

S.K.

Kessentini

Kanoun

and Ogier

J.M.

, Offline Arabic handwriting recognition using blstms combination, 3th IAPR International Workshop on Document Analysis Systems (DAS) 2018, pp. 31–36.

30.

Noubigh

Mezghani

and Kherallah

, Contribution on Arabic handwriting recognition using deep neural network, International Conference on Hybrid Intelligent Systems (HIS) 2019, pp. 123–133.

31.

Noubigh

Mezghani

and Kherallah

, Open vocabulary recognition of offline Arabic handwriting text based on deep learning, International Conference on Intelligent Systems Design and Applications (ISDA) 2020, pp. 92–106.

32.

Khamekhem

Kessentini

and Kanoun

, Out of vocabulary word detection and recovery in Arabic handwritten text recognition, Pattern Recognition 93 (2019), 507–520.

33.

Aziz

, Deep learning: An overview of convolutional neural network (CNN), Ph.D. Dissertation, Faculty of Information Technology and Communication Sciences M.Sc, 2020.

34.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation 9 (1997), 1735–1780.

35.

Schuster

and Paliwal

K.K.

, Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing 45 (1997), 2673–2681.

36.

Graves

Fernández

Gomez

and Schmidhuber

, Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks, ICML ’06: Proceedings of the 23rd International Conference on Machine Learning 2006, pp. 369–376.

37.

Lin

and Wang

, A hardware-oriented and memory-efficient method for CTC decoding, IEEE Access 7 (2019), 120681–120694.

38.

Scheidl

Fiel

and Sablatnig

, Word beam search: A connectionist temporal classification decoding algorithm, 16th International Conference on Frontiers in Handwriting Recognition (ICFHR) 2018, pp. 253–258.

39.

Mahmoud

S.A.

Ahmad

Al-Khatib

W.G.

Alshayeb

Parvez

M.T.

Märgner

and Fink

G.A.

, KHATT: An open Arabic offline handwritten text database, Pattern Recognit 47 (2014), 1096–1112.

40.

Mezghani

Kanoun

Khemakhem

and Abed

H.E.

, A database for arabic handwritten text image recognition and writer identification, International Conference on Frontiers in Handwriting Recognition 2012, pp. 399–402.

41.

Abadi

Barham

Chen

Davis

Dean

Devin

Ghemawat

Irving

Isard

Kudlur

Levenberg

Monga

Moore

Murray

D.G.

Steiner

Tucker

Vasudevan

Warden

Wicke

and Zheng

, TensorFlow: A system for large-scale machine learning, 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), USENIX Association 2016, pp. 265–283.

42.

Tieleman

and Hinton

, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural Networks for Machine Learning 4 (2012), 26–31.

43.

BenZeghiba

M.F.

, A comparative study on optical modeling units for off-line Arabic text recognition, 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017, pp. 1025–1030.

Densely connected layer to improve VGGnet-based CRNN for Arabic handwriting text line recognition

Abstract

Keywords

1. Introduction

Table 1 Arabic characters shapes in different positions

Table 2 Summary of handwriting recognition models using deep learning

Table 3 CNN layers configuration

3.1.1 VGGNet-based feature extraction

3.2 Sequence modeling step based on BLSTM-CTC

4.1 Used databases

4.1.1 KHATT database

5.1.1 BLSTM parameters

6. Discussion and comparison

7. Conclusion

References

Table 1
Arabic characters shapes in different positions

Table 2
Summary of handwriting recognition models using deep learning

Table 3
CNN layers configuration