Abstract
Classical Handwriting recognition systems depend on manual feature extraction with a lot of previous domain knowledge. It’s difficult to train an optical character recognition system based on these requirements. Deep learning approaches are at the centre of handwriting recognition research, which has yielded breakthrough results in recent years. However, the rapid growth in the amount of handwritten data combined with the availability of enormous processing power necessitates an increase in recognition accuracy and warrants further investigation. Convolutional Neural Networks (CNNs) are extremely good at perceiving the structure of handwritten characters in ways that allow for the automatic extraction of distinct features, making CNN the best method for solving handwriting recognition problems. In this research work, a novel CNN has built to modify the network structure with Orthogonal Learning Chaotic Grey Wolf Optimization (CNN-OLCGWO). This modification is adopted for evolutionarily optimizing the number of hyper-parameters. This proposed optimizer predicts the optimal values from the fitness computation and shows better efficiency when compared to various other conventional approaches. The ultimate target of this work is to endeavour a suitable path towards digitalization by offering superior accuracy and better computation. Here, MATLAB 2018b has been used as the simulation environment to measure metrics like accuracy, recall, precision, and F-measure. The proposed CNN- OLCGWO offers a superior trade-off in contrary to prevailing approaches.
Keywords
Introduction
Handwriting recognition plays an important role in information processing in today’s digital age. On paper, there is a lot of information, and processing digital files is less expensive than processing conventional paper files. A handwriting recognition system’s goal is to convert handwritten characters into formats that are machine readable. Vehicle licence plate recognition, postal letter sorting, Cheque Truncation System (CTS) scanning and historical document preservation in archaeology departments, old document automation in libraries and banks, and other applications are among the most common. All of these fields deal with massive databases, necessitating high recognition precision, reduced computing complexity, and coherent recognition system performance.
Deep neural architectures are thought to be more beneficial than shallow neural architectures. Some of the databases like CEDAR, MNIST, CENPARMI are online available, which promotes the research advancements in the pattern recognition field [1]. With the abovementioned datasets, MNIST is the benchmark dataset for performing various tasks of pattern recognition [2]. Various classifiers such as neural networks (NNs), Restrictive Boltzmann Machines (RBMs) is validated with MNIST dataset [3]. In recent times, handwritten digits are recognized using DL classifier approaches which enters into a newer research zone because of its various applications in this field.
By considering most of the challenges that are associated with the handwritten digit recognition domain, the authors in [4] modelled various algorithms and schemes. As handwritten digits are of provided in various styles, and orientation, the investigators face enormous challenges for automated recognition of handwritten digits. Ciresan et al. anticipated CNN for analyzing handwritten character classification [5] Arora et al. consider two architectures: CNN and Feed-forward Neural Networks (FFNN) for performing feature extraction, training, and MNIST dataset classification, which has handwritten images [6].
The results from these network models reveal that CNN has attained better accuracy when compared to FFNN during handwritten digit recognition. The classification accuracy of FFNN is 90%; while CNN classification accuracy is revealed as 95%respectively. Ghosn et al. performed comparative analysis for CNN, Deep Neural Networks (DNN), and Deep Belief Network (DBN) over MNIST dataset [7]. Based on the author’s work, the classification accuracy of CNN is 98%with specific error rates. Similarly, Anil et al. explain the training of CNN with the backpropagation algorithm and gradient-based learning for Malayalam character recognition [8]. The author shows an accuracy of 75%by examining these models.
The proposed CNN-OLCGWO adopts orthogonal initialization to ensure uniform distribution of particles over search space by eliminating minimal local values and pre-mature convergence. However, the existed CNN models are computationally expensive, and it causes resource wastage when adopted under lesser complex research factors. CGWO is chosen by the filter maps of the convolutional layer randomly.
The experimental outcomes offer proof for CNN-OLCGWO adoption with SVM-CGWO to handle complex problems. Moreover, this work attempts to reduce the human effort as the performance of CNN-OLCGWO for handwritten digit recognition works beyond social skills.
This work shows the difference from other existing work as it sows CNN effectiveness with OLCGWO model. Similarly, the pixel features are extracted from the input image using Support Vector Machine (SVM) where both linear and non-linear analysis is made to form a Bag of Features (BoF). Most of the existing works have not marked any milestone with prediction accuracy using the CNN model. Therefore, this work models a CNN framework with the integration of OLCGWO over the network layers.
The novel contributions of this paper can be outlined as follows: The proposed novel CNN-OLCGWO model takes advantages of both CNN and Grey Wolf Optimization algorithms. This leads to attaining efficient accuracy computation for classifying handwritten digits. The proposed CNN-OLCGWO model provides better improvement towards recognition rate when compared to the prevailing CNN model. This work paves the way for a newer digitalization process. The new Hybrid optimizer can be operated as an underlying algorithm for the proposed model which reduces the classification error during the training phase. The novel Hybrid work attempts to reduce execution time by avoiding the feature spaces used for training and attains an optimal model to classify handwritten digits.
The present work is structured as section 2 gives a brief idea on current research works on Optical Character Recognition (OCR), and deep learning classifiers. It is followed by the projection of research gaps identified. Section 3 is the proposed methodology which discusses GWO, orthogonal learning strategy, and CNN classifier. Similarly, Section 4 and Section 5 demonstrate an experimental setup with numerical results and discussions. Section 6 is the conclusion with future research directions.
Literature review
The handwritten digit recognition based algorithms are trained with the use of the available dataset. It is noted whether the algorithm gives the accurate classification of digits or characters. The classification process is to learn a model from the provided input and maps label to the pre-defined classes. This section discusses in detail the available OCR dataset and the classification problem.
Pandey et al. explain about CEDAR dataset, which is developed by Buffalo University in 2002, which is a more extensive database for handwritten characters [9]. Here, the images are of 300 dpi. Kozielski et al. explain the MNIST dataset, which is a most-cited dataset for recognition of handwritten digits [10]. It is a subset of NIST dataset, composed of 10,000 testing and 60,000 training images. It is normalized with 20×20 grayscale images. Generally, it reduces the time needed for formatting and pre-processing. Kai et al. explain Urdu language dataset, which claims both writer identification and character recognition [11]. It comprises of 62000 words and 56248 characters written in calligraphy style of 300×dpi. One hundred writers create it.
The evaluation is made with 20 and 50 text line images for testing and training. The author reported the error rate as 0.004–0.006%and describes the popular Arabic database, which is constructed in 2002 by Technical University of Braunschweig. It comprises of 26459 handwritten images where 212211 characters are extracted from 411 different writers. Various researchers extensively use it for efficient recognition of Arabic characters. The author discusses CENPARMI dataset, which was constructed in the year 2006. It is composed of 18000 samples which are partitioned as 2000 for verification, 5000 samples for testing, 11000 for training samples. This dataset is constructed by extracting 102352 digits extracted from the high school registration form. It is the second form of the most extensive dataset. It comprises of 432,357 images which include words, dates, isolated letters, numerical strings, documents, and special symbols.
Jain et al. performed a handwritten recognition with ANN for Marathi characters [12]. The experimentation is validated with 500 characters taken from various people. The author attained an accuracy of 92%. Broumandnia et al. provide an OCR system for predicting the Sanskrit characters using Support Vector Machine (SVM) [13]. The author considers multiple datasets for various languages with an accuracy of 98%. Elleuch et al. anticipate a reliable approach for character recognition and segmentation for the Latin language [14]. The segmentation process is done by analyzing the structural properties where the joined and overlapped characters are segmented with the adoption of graph distance theory. Here, the k-NN classifier is used for printed input characters and handwritten digit recognition; where SVM classifier is used for validating the segmentation outputs. The accuracy attained by the author for Latin script is 97%.
In [15], a new system has been modelled for handwritten recognition for isolated characters with ensemble classifiers. Binarization is used for character image processing; then, the Histogram oriented gradient is applied for extracting the essential features. Finally, Neural Networks (NN), k-NN, and SVM are used for classification. The results from these classifiers are merged to form an ensemble classifier. The class labels are attained with the maximal voting technique. The accuracy attained with this process is 88%after ensembling.
In [16], a template matching technique is applied for character classification using pre-defined templates. Here, distance similarity metrics like city block distance, Euclidean distance, normalization and cross-correlation is used. The most common method used for character recognition is deformable template matching. The deformed images are used for evaluating the images of the known database. Therefore, classification is done with deformed shapes and also explained structural pattern recognition, which attempts to categorize objects based on pattern primitives and pattern structures. Chain Code Histogram is used for classifying image, curve, and character boundaries. However, this method is found to more complex with the adoption of handwritten character recognition.
The structural pattern recognition is sub-divided into grammar-based and graphical-based methods. In the former method, the similarity among the structural primitives is identified with grammar concepts. It is used for measuring the similarity between the graphs. For OCR, trees and strings are used for representing the models based on grammar. With this model, strings produced with this model is easily classified for character recognition. The tree structure shows hierarchical relation among the structural primitives. Similarly, with the graph model, the relationship among the connected objects are used for representing the alphabets, digits, and characters.
There are various graphical similarity measures used by various investigators. They are: SimRank algorithm, similarity flooding algorithm [17], vertex similarity methods, and graph similarity methods [18]. These methods are noticed to be work efficiently for recognizing characters and digits. However, the aforesaid analyses fail to meet out some research questions that are listed below. For handwritten digit recognition, what recent feature extraction and classification procedures are used? What other types of datasets are available for research purposes? What is the name of the new science domain that has replaced the OCR?
The proposed model answers the research questions mentioned above. This work attempts to bridge the gap between the existing flaws with the newer research idea for enhancing the prediction accuracy. Handwritten recognition for different languages are compared in Table 1.
Comparison of Handwritten recognition for different languages
Comparison of Handwritten recognition for different languages
This section elaborates the methodology of this research for handwritten digit recognition using MNIST corpus. The stages of research are: Data acquisition and pre-processing, feature extraction, and classification with CNN-OLCGWO. The block diagram of the proposed CNN-OLCGWO is shown in Fig. 1.

Sequential Flow of proposed model.
Mixed National Institute of Standards and Technology database is constructed by LeCun et al. [19]. This dataset is extensively used for various pattern recognition and machine learning-based applications. This database comprises of two diverse sources known as NNIST’s special database 1 and special database 3. The former samples are collected from high-school students, while the latter samples are gathered from census bureau employees [20]. The samples are chosen for testing and training; as same writers are not involved in both the sets. The training set holds more than 250 samples from the writers, while the remaining are provided for testing. The original images need to undergo pre-processing.
Initially, the image normalization is performed to fit the 20×20-pixel box for maintaining the aspect ratio [21]. Next, anti-aliasing is applied to acquire black and white images. It is next transformed into grayscale images. At last, blank padding is performed to make the image fit with 28×28-pixel box. Thus, the centre of mass of the digits is fit with the digit matched centre. The sample of this kind is given in Fig. 2a.

Dataset samples from MNIST.
The samples that come under the training set is given in Fig. 2b. From these samples, some digits are easily confusing. Based on the posture, the elucidation of individual images are incredibly intricate, i.e. 4 and 9 are considered to be more confusing. In some cases, the digits are very hard to realize. It is also shown below. By adopting this dataset, various approaches encounter error rate that ranges from 0.20%to 0.90%. However, it relies on the technique adopted.

Dataset samples from MNIST: Training set.
Generally, SVM is handles both binary classification and regression problems. SVM performance is validated successfully in various applications. SVM pretends to find the optimal hyperplane that categorizes the data points by separating the points of two different classes [22]. In linear-separable form, SVM separates hyperplane with a larger margin. It is expressed as in Equations (1) & (2):
The hyperplane intends to separate the data iff it satisfies the given condition in Equation (3):
Here, ′x′ is input pattern; ′w′ is weighted vector; ′b′ is bias. The weight and bias are evaluated with maximal margin of 1/||w||, which is subjected to training patterns and outside margin. It is expressed as in Equation (4):
Here, y i ∈ { - 1, 1 } is training pattern labels. The data needs to be separated with a maximal marginal hyperplane. ||w||2 is minimized to w T w. It is expressed as in Equation (5):
Here, (x i , y i ) is training samples in training data and ′n′ is number of training instances.
The Chaotic Grey Wolf Optimization for feature learning is provided to eliminate falling into the local optimum. The conventional GWO mimics the ranking and hunting techniques adopted in this research to learn the optimal features for seizure prediction [23].
There are four different types of grey wolves. They are Alpha (α → leader); beta (β → helps in decision making); delta (submissive towards α and β); omega (ω → wolves obedience). With the unique characteristics of social class, the hunting group shows other social characteristics. The hunting stage includes three different characteristics: 1) tracking, chasing, and approaching prey; 2) pursuing/encircling the prey until the criteria are stopped; 3) attack towards the prey. The CGWO model is mathematically structured based on wolves’ social hierarchy. The levels of GWO are given in Fig. 3.

Hierarchy of GWO algorithm.
Orthogonal Learning based CGWO is an ‘intelligent movement mechanism’ for generating a temporary position like r = (r1, r2, …, r
n
)
T
and h = (h1, h2, …, h
n
)
T
for all particles. It is expressed as in Equations (6) & (7):
Here, ′r′ and ′h′ are social learning and self-cognitive components. The OLCGWO is applied over ′r′ and ′h′ for attaining next position ′x′ and particle velocity is attained by evaluating the difference among current position ′x′ and new position x′’
The moving mechanism of OLCGWO efficiently merges the information of ′r′ and ′h′ to generate successive particle position. During searching process, the particles use moving mechanism for generating the newer velocity and position. The moving mechanism for discovering the particles’ personal and population best solution. The orthogonal learning strategy adopts the moving mechanism by combining
Here, some promising information like
When the learning exemplar reaches the maximal movement strategy, then new learning exemplar is reconstructed. Here, k i is set as the stagnation generation. When the particles’ personal best solution is not updated, then k i is incremented by 1. When k i > K, then a new learning exemplar is constructed. This method avoids oscillation learned enhances best-searching efficiency.
This work considers dimensional learning which is inspired by OLCGWO to preserve the essential information from the particles. Here,
A typical CNN comprises of three components known as a convolutional, pooling, and output layer. The second layer is an optional layer, and the scaling process is easier for higher resolution images. The pooling operations are used for reducing the input dimensions. Input data is loaded into an input layer which describes the width and height of the image. Similarly, the hidden layer is considered as the backbone for CNN architecture. It performs feature extraction with convolution, pooling, and activation function. The handwritten digits-based features are distinguished in this stage. The convolutional layer is placed above the given input image. It helps feature extraction (pixels) from the handwritten digit. The framework of proposed CNN-OLCGWO algorithm is depicted in Fig. 4.

Proposed CNN-OLCGWO framework.
The input neurons (n * n) are convoluted with m * m filter and give (n - m + 1) * (n - m * 1) as output. The significant contributors of CNN are receptive field, stride, dilation and padding, respectively. With receptive field, the small input region which affects specific network region is identified. It is an essential parameter in CNN architecture helps in parameter setting. It is influenced by pooling, striding, depth, and a kernel size of CNN. It is a terminology used for evaluating sub-regions of the network. Then, the stride is depicted as the step size where filter moves every time.
The stride value is ‘1’ which specifies sliding movement of filter in a pixel-by-pixel manner. When the stride size is larger, it shows lesser overlapping of cells. Next, padding is introduced to enhance the accuracy of CNN. It is used to control the output shrinking from the convolutional layer. The convolutional layer output (feature map) is smaller than the input image. It contains more information over the middle pixel, and colossal information is last at the corner. The ultimate target of optimization is to attain an optimal value with learning parameters. The OLCGWO is such an optimization algorithm which reports better performance during learning parameter identification.
In this section, CNN has developed and analyzed with the optimization parameters of OLCGWO to optimize recognition accuracy. The experimentation is done in MATLAB 2018b environment on Windows 10, Intel Core i7 processor, CPU (2.50 GHz). Here, MNIST (special database 1 and special database 3) is involved in testing and training model. This standardized database includes 60,000 (training) and 10,000 (testing) normalized digit images, where pre-processing plays an essential role in the recognition process. The dataset images are normalized with a fixed image size of 28 * 28 respectively. The convolution layer is used for feature extraction with 5 * 5 and 3 * 3 kernel size. The patch holds the structural information of the input image. The convolution operation is performed by sliding the filter size 5*5 over the input image.
In order measure the optimal performance of the proposed technique, CNN architecture considers MNIST database, Urdu database and chars 74 k benchmark dataset wherein 80%of the data designated for training, and 20%assigned for testing. This progress has experimented through the proposed method, namely CNN-OLCGWO, besides it has contrasted with other existing CNN + LSTM, Pre-trained CNN + MLP, Pre-trained CNN + LSTM, Pre-trained CNN + SVM, Dense trajectories with HoG by considering the metrics, like Precision, Recall, F-Measure and Accuracy. The following equations have applied to evaluate the performance parameters.
Here, True Positive (TP) defines the number of actual digit detected; False Positive (FP) refers to the number of non-digit, (i.e. wrongly classified as digit that is actually not); False Negative (FN) states the number of outcomes inappropriately unidentified; True Negative (TN) denotes the number of appropriately identified non-digits.
The objective of this work is to investigate the parameters of CNN for delivering better-handwritten digit recognition using SP-1 and SP-3 dataset. The overall performance of the CNN architecture is observed with the optimization of OL-SVM to deliver better recognition accuracy.
The accuracy comparison of the proposed CNN + OLCGWO is made with other CNN architectures. The best feature results are attained from number of chosen features using SVM classifier with the parameters (σ = 1 and C = 100). The drawback related to the other approaches is higher testing complexity and computational cost. The proposed CNN + OLCGWO gives 98%recognition accuracy with MNIST dataset. The proposed CNN model uses nominal layers to avoid over-fitting and to attain optimal value in the trial and error method. Similarly, dropouts can also be included for avoiding the over-fitting problems.
The training accuracy and loss measurements of CNN + OLCGWO for ten epochs is simulated along with the validation accuracy and loss measurements of CNN + OLCGWO for ten epochs and finally accuracy, precision, recall, and F-measure computation of proposed CNN + OLCGWO is depicted which shows 98%, 98.76%, 99.7%, and 99.27%respectively. Comparison of various CNN architecture with proposed CNN + OLCGWO where these models consider MNIST, Urdu, and chars 74 k benchmark dataset has been predicted. Here, the models use both pixel-based and geometric-based features for computation.
The accuracy comparison of different models is depicted in Fig. 5. It is clearly manifest that the functionality of the proposed CNN-OLCGWO is more reliable with 98%accuracy when compared to CNN + LSTM, Pre-trained CNN-MLP, CNN + LSTM, CNN + SVM, Dense trajectories with HoG. This is owing to the implementation of the Hybrid approach in the proposed model. The parameters of the proposed classifier are tuned properly with the aid of the Grey Wolf Optimization algorithm. Further, it extract the optimal features using SVM where these features are manipulated and deliberated to the proposed model. However, the existing models failed to attain the optimal parameter that causes lesser accuracy as compared with the proposed model.

Accuracy comparison of CNN-OLCGWO with existing models.
The validation accuracy and training accuracy is shown in Figs. 6 7 with loss measurements. The recognition rate of CNN-OLCGWO is 98%with the discriminative pixel features extracted from the SVM classifier (Linear and Non-linear) model. The proposed CNN-OLCGWO classifier model selects the pixel features which is placed over the BoF. The recognition rate of digits from 0–9 is 98%approximately with a lesser error rate that ranges from 0–2.

Validation accuracy and loss measurements.

Training accuracy and loss measurements.
Similarly, the execution time (sec) relies from 4–8 seconds as in Fig. 8. From the figure, it is evident that the proposed model achieves minimum execution time while varying the number of digits. The main reason behind this reduction that the proposed model employed the appropriate data pre-processing and acquisition strategy during the evaluation phase. Further, the optimal normalization of the proposed model categorizes the data points by separating the points of two different classes. The proposed hybrid model is accomplished to offer the proper balance between the accuracy and the convergence rate of the obtained features. This proper balance influences the proposed structure appropriately robust to encompass it to integrate any new language.

Recognition rate, error rate, and execution time of CNN-OLCGWO.
Figure 9 depicts the derived results for the MNIST datasets, where the proposed algorithm efficiently procures 98%, 98.76%, 99.7%, and 99.27%of Accuracy, Precision, Recall, and F-measure values respectively as compared with other traditional approaches. This is because of integrating the hybrid optimization algorithm to CNN in the proposed model. The accurate integration provides an appropriate balance between the exploitation and exploration during the searching process of feature extraction. Besides, the Orthogonal Learning-based CGWO is an intelligent movement mechanism for generating a temporary position in the proposed model. It paves a way to obtain lesser classification errors in the training phase. In contrast, the existing models yield misclassification of the data samples owing to an inappropriate balancing technique.

Performance metrics comparison.
The comparison of Running Time for different models is exposed in Table 2. It is evident from the Table that the proposed CNN-OLCGWO model requires lesser running time than existing state-of-the-art models. The major rationale behind this minimum running time is the association of the limited parameters in the proposed model for the training phase. As an outcome, it necessitates lesser running time for feature extraction and supports for larger digits. Nonetheless, the existing CNN model exploits a more number of parameters for training. They do not estimate the suitable learning rates for every parameter and thus its result in higher running time and slower convergence rate.
Comparison of Running Time for various models
According to Table 3, it is proven that the proposed CNN-OLCGWO model demonstrates its expertise to identify the digit with 98%of Accuracy rate whereas 88%and 77%have obtained by Googlenet and Alexnet respectively. These significant enhancements are due to the execution of the OLCGWO algorithm in the proposed model which reports better performance during learning parameter identification. In general, the processing of non-linear data is more problematic by handcrafted features. The proposed model can manage linear as well as non-linear data with the aid of the SVM classifier. This optimal supervision facilitates to extract more features. Alternatively, the unsatisfactory decision-making activities of the existing models cause intractable digits recognition and the higher error rate.
Handwritten digit recognition comparison of various CNN architectures
Hence, it can be concluded from the aforesaid statistics that the proposed method enhances the accuracy of digit classification and shows a better trade-off in the recognition rate of handwritten digits. It brings more significant advantages in tackling problematic issues and providing optimal decision-making. This novel hybrid combination in the proposed model will endorse the design of the pattern recognition technology.
The aim of this research work is to improve the performance of handwritten digit recognition. The CNN variants are analyzed here in order to avoid problems such as over-fitting and computational complexity. The MNIST database (SP-1 and SP-3) is used in this systematic assessment for teaching and testing. The proposed CNN-OLCGWO has been introduced to maximize the hyper-parameters. The initiative of increasing the number of layers over CNN leads to an over-fitting problem, making it impossible to obtain the optimum solution. With nominal CNN layers, this work removes these disadvantages. The proposed innovation is based on combining effective feature extraction (pixel values) with hyper-parameter analysis to improve near-human performance. Afterward, the non-linear and linear SVM models are used to extract the relevant features, which are then stored in a Bag of Features (BoF).
When compared to other traditional techniques, the proposed optimizer predicts the optimum values from the fitness calculation and is more efficient. The experimentation can be carried out using the MATLAB platform where it is utilized to evaluate the performance of the proposed CNN-OLCGWO. The proposed optimizer predicts the optimal values from the fitness computation and shows better efficiency when compared to various other conventional approaches. In particular, the obtained performance results of the proposed CNN-OLCGWO are 98%, 98.76%, 99.7%, 99.27%of Accuracy, Precision, Recall, and F-measure respectively. Various CNN architectures will be developed in the future to provide a domain-specific recognition system. Similarly, optimizers such as Adam are being investigated for the optimization of CNN learning parameters such as learning rate, number of layers, and kernel size.
