A textual question answering and handwritten answer evaluation system for hindi language

Abstract

Textual Question Answering targets answering questions defined in natural language. Question Answering Systems offer an automated approach to procuring answers to queries expressed in natural language. The need for Multilingual Question Answering without performing machine translation is ever existing. Besides that, automating tasks with the help of technology to assist humans, has been the main aim of research in recent years. This paper presents an automated answer evaluation system for reading comprehension-based questions in the Hindi language without requiring translation in any other language. The system accepts text, question, and handwritten answer of a student in the form of an image for answer evaluation. This is accomplished by developing a textual question-answering system for reading comprehension. It is an extractive approach that utilizes RoBERTa transformer model and fine-tunes it for Hindi question-answering. The answer to the question is extracted as a span from the provided text. Further, a handwritten text recognizer model is developed employing a Convolutional Recurrent Neural Network with Connectionist Temporal Classification module along with two layers of Bidirectional LSTM. Experimentation is performed using existing as well as self-created datasets to show the effectiveness of the proposed approach. An accuracy of 98.69% is obtained on the self-created Hindi-QA dataset and the proposed system outperformed the other existing methods. The paper also discusses potential research directions in the field.

Keywords

Multilingual question answering natural language processing machine reading comprehension optical character recognition

1. Introduction

Question Answering (QA) is a computer science discipline within the fields of Information Retrieval (IR) and Natural Language Processing (NLP). It is a task of answering questions in natural language. These answers can either be generated or extracted from a given context. Generative QA involves the generation of answers, which requires searching for documents that may contain the answer and then generating the answer. However, the focus of the system presented in this paper is on Machine Reading Comprehension (MRC) based QA which is an extractive technique. It involves the extraction of the answer from a given context.

At present, the development of education systems and education-related technologies, to help the students, teachers, and the entire administration to function smoothly and efficiently without spending a lot of time to achieve tasks, has been growing tremendously. Artificial Intelligence and Machine Learning have become popular fields for solving many problems based on human intelligence to reduce time spent by a human doing tasks that today machines and algorithms can do equally well. A system for automated checking of answer sheets of students can be of great help to the teachers. Handwritten Text Recognition (HTR) model can be employed to recognize and convert handwritten text into digital format. This paper presents a system for the extraction of handwritten text from an image, that contains the answer written by the student and verifies the answer for correctness, and provides a score for the written answer. The main focus of the work presented in this paper is to handle the automated valuation of Reading Comprehension based questions. To cater to the task, first, a Question Answering System for Machine Reading Comprehension is presented followed by the HTR model. Together, these models create an automated answer valuation system.

To increase the reachability and usefulness of the model it is important that the Question Answering System can understand and extract answers from multiple languages. The main aim of this research is to increase the usefulness of already existing question-answering models by creating one single model capable of handling different languages. The presented Question Answering system can handle multiple languages like English, Hindi, Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese and returns an appropriate answer for the question in that language. Although the QA system is a Multilingual Question Answering model, the target of the work presented in this paper is the valuation of Reading comprehension questions, particularly in the Hindi language.

The question-answering system is provided with a paragraph and a question for extraction of answers from the given passage. The provided paragraph and question are assumed to be the same as those given to the students. The answer written by the student (in the form of an image) is provided as input to the HTR model to obtain the digital version of the answer. The digital answer is then compared with the answer provided by the QA system. Based on the correctness of the answer, the marks are allocated.

The main contributions of the work presented in this paper are summarized below:

•
A system for question-answering based on textual content is presented. As input, the given QA system takes a text and a question. The answer is a span taken from the provided text. To test the model’s performance on real-world data, a Hindi-QA dataset is created. Performance comparison on existing datasets is performed to confirm the effectiveness of the model.
•
An HTR model for extracting handwritten text from images is developed. This module recognizes the answer written in the image is used to obtain the digital version of the word.
•
The marks allocation module is presented that compares the extracted text (obtained from the HTR model) with the predicted answer (obtained from the QA system).
•
Experimentation on existing datasets and on self-created dataset demonstrating the efficacy and applicability of the proposed system in real-world scenarios.

The paper is organized as follows. Section 2 presents the literature review and discusses the required background model. The proposed approach is discussed in Section 3 followed by discussion on the datasets used in Section 4. Experimentation and results are presented in Section 5. Section 6 discusses the possible future direction and emerging research topics in the field that have the potential to further advance the research area. Finally, Section 7 concludes the paper.
2. Literature review

In this section, we review the literature on textual question-answering and HTR models for the Hindi language. Hindi is the fourth most spoken language in the world. It is a very popular Indic script and is used by about 400 million people in India and other countries [1].

2.1 Textual question-answering

Textual question-answering aims to find the answer to a question and is a popular natural language processing task. Most of the work in textual-QA is in the English language due to the availability of datasets and there have been few efforts to address other languages and multilingual question-answering.

A restricted domain multilingual QA system is presented in [2]. It utilizes Universal Networking Language (UNL) [3] to convert the contents of a document (Hindi or English) into an intermediate language. Initially, the answer type is determined by analyzing the question. The answer template is converted to UNL expression. The UNL expression of the question is matched with the documents to determine the answer. The UNL expression of the answer is converted back to Hindi or English. An accuracy of 60% is reported.

Gupta et al. [4] have proposed a Multi-domain Multi-lingual Question-Answering (MMQA) system for English and Hindi languages. It focuses on factoid and short descriptive questions. A dataset is created comprising 250 articles each for English and Hindi languages. For answering questions in Hindi they are first translated into English using Google Translate and then the answer is found. The system mainly has the following modules: question processing, question classification based on CNN and RNN, question formulation by stop word removal, passage retrieval, and extraction of answers from the identified passage.

Patrick Lewis [5] presents MLQA, a multi-way aligned extractive QA evaluation approach. It caters to 7 languages, namely English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. It consists of over 12K QA instances in English and 5K in each other language, with each QA instance being parallel between 4 languages on average. MLQA is built using a novel alignment context strategy on Wikipedia articles and serves as a cross-lingual extension to existing extractive QA datasets. The Hindi-English cross-language question-answering system utilizing newspaper articles is presented in [6]. The question is accepted in English, it is analyzed and the keywords are translated into Hindi language using a bilingual dictionary. The answer is searched in newspaper articles based on the Hindi keywords. The answer is translated back to English and presented as the result. Another cross-language system is presented in [7]. However, all the above discussed systems require language translation to an intermediate language or other natural languages.

A machine learning approach for Hindi Language QA is presented in [8]. The system is divided into three phases: Accessing natural language query; where the input query is read, preprocessed, and get tokenized. It is followed by the feature extraction phase; where specific feature vectors are identified from the results of the previous phase and finally the Classification phase; where the Naïve Bayes classifier has been used, along with the knowledge base already stored in the system.

However, due to the advancement in Deep Learning and its success researchers have applied deep learning-based techniques for QA in the English language. Also recently, the Bidirectional Encoder Representations from Transformers (BERT) has become a benchmark in NLP tasks because it reads text simultaneously in both directions rather than sequentially (from left to right). Prior training for BERT focuses on the goals of language modeling and next-sentence prediction. It is thereafter trained to carry out particular NLP tasks. Various variants of BERT like m-BERT, ALBERT, RoBERTa, etc. are trained on a vast corpus of language-specific data.

A BERT Based Multilingual Machine Comprehension in English and Hindi is presented in [9]. They utilized multilingugal BERT (m-BERT) with fine-tuning. BERT, ALBERT, and SPANBERT are utilized in [10] for question answering of Tweets in English.

Recently, a Hindi and Tamil dataset was qurated by Google [11], that lead to focus on Hindi language question answering. Experimentations on various settings with XLM-RoBERTa model is presented in [12]. The first model is a cross-lingual version of XLM-RoBERTa that has only been fine-tuned on the SQUAD [13] dataset. The second model is based on the pretrained RoBERTa model, but it uses a customised Indic tokenizer instead of a classification head. The hyperparameters are then optimized and fine tuned on the ChAII dataset. The third model is built on XLM-RoBERTa but has been fine-tuned and trained on the Indic dataset. The paired RoBERTa models outperformed the XLM-RoBERTa models because the training data was more focused according to the task, as opposed to the XLM-RoBERTa models, which contained a lot of data that did not include in Hindi pairs.

Multilingual Contrastive Training for Question-Answering in Low-resource Languages (MuCoT) is proposed in [14]. It utilizes m-BERT fine-tuned on SQUAD and ChAII dataset with results on Hindi question-answering. Namasivayam et al. [15] performed experimentation by fine-tuning XLM-RoBERTa on Hindi data from SQUAD, MLQA datasets. Further, they also fine-tuned MuRIL [16] and found it to give the best performance.

We investigate the possibility of utilising a transformer model for our purpose because transformer models have demonstrated high performance in question-answering. Next, we discuss the most basic transformer model-BERT, which serve as the foundation for further discussion in the paper.

BERT transformer model

Transformers are pre-trained on large volume of text data, which allows them to learn rich representations of language. These pre-trained models can be fine-tuned on specific tasks with relatively small amounts of task-specific data, making them highly effective for a wide range of NLP tasks without the need for extensive data and computational resources. Moreover, the fine-tuning may be accomplished with a small number of epochs.

BERT [17] is a language model that was trained on a huge and diverse corpus of text from the internet. BERT training is divided into two stages: pre-training and fine-tuning. BERT is pre-trained on a huge amount of text data to gain an understanding of the language. This pretraining stage aids BERT in the development of contextual word embeddings. After pretraining, BERT can be fine-tuned on specific downstream tasks, such as natural language understanding, Question-answering, language generation, etc. A task-specific dataset is required for fine-tuning.

Further, BERT employs two training techniques named Masked LM and Next Sentence Prediction (NSP). First, 15% of the words in the input word sequence are altered and replaced with a [MASK] token. The model then attempts to predict the masked words using the context provided by the non-masked words in the sequence. After which, for Next Sentence Prediction (NSP), the model learns to predict whether the second sentence in a pair of sentences will come after the first one in the original document. A [CLS] token is added at the start of the first sentence and a [SEP] token is added at the end of each sentence. A positional embedding is added to each token to indicate its position in the sequence and a sentence embedding is added for the two sentences to capture their sequence.

In this paper, we experiment with transformer-based models since they can be fine-tuned with fewer resources on smaller datasets after pretraining to maximize their performance on particular tasks and achieve good accuracy.

2.2 Handwritten text recognition for Hindi langauge

Table 1
Summary of Significant methods in HTR

Reference	Method	Dataset used	Dataset details	Accuracy
[18]	Quadratic classifier with 64	Numbers and Characters	11270 samples of vowels	Characters: 80.36%
	dimension feature	in Hindi	and consonants
[19]	Artificial neural network	Characters in Hindi	200	80%
[20]	CNN-RNN hybrid architecture	IIIT-HW-Dev	95.38 1word samples	88.73%

There is considerable work in recognition of the printed Hindi language text [21]. However, very less work exists in Hindi handwritten text recognition. The reason is the lack of availability of large datasets. Some researchers targeted the recognition of only Hindi numbers [22, 23, 24]. It is important to recognize characters, words, and sentences for complete digitization. A comprehensive review of text recognition in various languages can be found in [25]. We discuss the literature on Hindi character recognition. Table 1 shows a summary of significant work in HTR in the Hindi language.

A Zernike moment feature-based approach for Hindi character recognition using an artificial neural network for classification is proposed in [19]. Perwej et al. [26] used Characteristic Loci features and a backpropagation-based neural network for the recognition of handwritten characters. A quadratic classifier for the recognition of characters and numbers in the Hindi language was proposed in [18]. For feature extraction, a bounding box is created for the input character followed by contour extraction to create chain code. A 196 dimension size feature is obtained which is sampled down to obtain the final 64 dimension feature by application of the Gaussian filter. Chaudhuri et al. [26] proposed a soft computing-based Hindi text recognition model using HP Labs India Indic Handwriting dataset comprising 29,970 characters and numbers. Various pre-processing techniques such as binarization, noise removal, skew detection and correction, character segmentation, and thinning are applied, followed by feature extraction using discrete Hough transformation. The feature-based classification into various Hindi characters is performed utilizing soft computing techniques like rough fuzzy multilayer perceptron (RFMLP), fuzzy support vector machine (FSVM), fuzzy rough support vector machine (FRSVM), and fuzzy Markov random fields (FMRF). A CNN-RNN hybrid architecture is presented in [B], that employs a spatial transformer network (STN) for input transformation to remove geometric distortion. STN is followed by a set of residual convolutional blocks and stacked bi-directional LSTM layers. Finally, the CTC layer is used for transcribing the labels. It comprises 18 convolutional layers. In this paper, we also consider the IIIT-HW-Dev dataset since it is a word-level dataset, and others are character-level datasets.

3. The proposed approach

This section discusses the proposed Question Answering System for Machine Reading Comprehension dataset using Text Recognition for the Hindi Language. In particular, an automated system for evaluating the answer written by hand is presented. The overview of the proposed system is shown in Fig. 1. A question and text (or context) are provided as input to the QA system that extracts a span of words as output corresponding to the answer to the input question. This answer is compared with the handwritten answer, which is first converted to digital form using a Handwritten Text Recognition (HTR) model. The comparison helps in determining if the written answer is correct or not and also assists in score computation. To accomplish the task of extraction of text from the image, we explore the option to use a deep learning neural network and finally give a score to the answer. Further, the QA system uses a Natural language processing (NLP) technique capable of handling multiple languages. Next, we discuss the modules of the proposed approach: Textual Question Answering (textual-QA), Handwritten Text Recognition, and Answer Evaluation.

3.1 Textual question answering

Textual-QA involves extraction of the answer span from the given context. To accomplish this a Multilingual QA model is built using transformer-based RoBERTa model.

As discussed in Section 2, transformer models have brought about a significant advancement in natural language processing and various other machine learning tasks due to their unique architecture and capabilities. As tranformer models can be fine-tuned with less amount of task-specific data and require a very few epochs for fine-tuning, we have utilized transformer model and fine-tuned it for the QA task.

Figure 1.

System overview.

Various transformer models have been utilized for NLP tasks like BERT, m-BERT,ALBERT, RoBERTa, etc. as discussed in Section 2. Since, our focus is span extraction from the given context, we have employed RoBERTa (A Robustly Optimized BERT Pretraining Approach), a transformer-based language model performing textual-QA. It utilizes self-attention and processes input to generate contextual representations of words in a sentence.

Next, we discuss the RoBERTa model and present the rationale for selection of RoBERTa model over alternative models. After which the proposed textual-QA model that utilizes RoBERTa model is discussed.

3.1.1 RoBERTa model

RoBERTa [27] is an improved version of BERT with significant improvements to the pre-training phase. It is trained for longer sequences using a dataset ten times larger than that used to train the BERT model, with an increased number of iterations. In the pre-training of RoBERTa, the Masked LM part is done using a dynamic masking approach instead of static masking as done in BERT, and instead of NSP module, a sentence order prediction (SOP) task is used.

Dynamic masking is a more adaptive approach in which the masked tokens are changed during training. At the start of training, a few tokens are masked. This masking, however, changes with each epoch. This introduces more variation into the training data, making the model more robust to different input sequence lengths. SOP is performed by packing full sentences sampled contiguously from one or more documents, such that the total length is at most 512 tokens. Inputs may cross document boundaries. When one document is completed, the sampling of sentences from the next document is continued, and a separator token is inserted to demarcate the documents. The NSP loss is also removed.

Rationale for selection of RoBERTa model

As detailed above, RoBERTa model is an improvement of BERT and other variants of BERT. It is trained on a significantly larger and more diverse corpus of text data compared to BERT. Futher, it omits the Next Sentence Prediction (NSP) task, which was part of BERT’s pre-training objective. Removing NSP training allows RoBERTa to focus more on learning the core contextual language representations, resulting in improved performance. It uses a dynamic masking strategy where it randomly masks out and replaces tokens during pre-training. In contrast, BERT uses static masking with fixed probabilities. This dynamic masking approach makes RoBERTa more robust.

3.1.2 Textual question answering model

Figure 2.

Textual question-answering approach.

The proposed textual-QA model is presented in Fig. 2. A dataset is provided as input to the transformer model $(T)$ . The RoBERTa trasformer is used in the proposed model. The dataset comprises context $(C)$ , question $(Q)$ , and answer $(A)$ pairs for training or fine-tuning the pre-trained transformer model $T$ . The input dataset $D$ having $n$ context-question-answer pairs can be represented as:

$\displaystyle D=<C,Q,A>$

where $C=\{c_{1},c_{2},\ldots,c_{n}\}$ , $Q=\{q_{1},q_{2},\ldots,q_{n}\}$ , and $A=\{a_{1},a_{2},\ldots,a_{n}\}$ . The pre-trained transformer model $T$ is fine-tuned on the dataset to generate the fine-tuned QA model $(\textit{Model}_{ft})$ that can be used to perform multilingual question-answering. This can be represented as below:

$\displaystyle\textit{Model}_{ft}=T(<C,Q,A>)$

$\textit{Model}_{ft}$ when presented with context $c_{\textit{test}}$ and question $q_{\textit{test}}$ , generates the answer $a_{\textit{pred}}$ (called as predicted answer) from the context, represented as:

$\displaystyle a_{\textit{pred}}=\textit{Model}_{ft}(c_{\textit{test}},q_{% \textit{test}})$

where, $a_{\textit{pred}}$ is a subsequence from the sequence (context) $c_{\textit{test}}$ .

The fine-tuning is done using text from multiple languages (discussed in Section 4) making the $\textit{Model}_{ft}$ a multi-lingual QA model, capable of performing textual-QA in multiple languages.

3.2 Handwritten text recognition

Figure 3.

Extraction of handwritten words.

Extraction of handwritten words involves building a Handwritten Text Recognition (HTR) model. The overview of the method is presented in Fig. 3. The HTR dataset $(HD)$ having m images of handwritten text $(I)$ and its corresponding digital words $(W)$ is used for training the model capable of recognizing the handwritten text. Each input image $I_{i}\in I,$ where $I=\{I_{1},I_{2},\ldots,I_{m}\}$ is first pre-processed by grayscaling and resizing. The model is then trained to obtain the corresponding words $W=\{W_{1},W_{2},\ldots,W_{m}\}$ . The trained model $\textit{Model}_{\textit{htr}}$ when presented with the test image $I_{\textit{test}}$ predicts the handwritten word $a_{\textit{extract}}$ .

The HTR model is a Convolutional Recurrent Neural Network (CRNN) based model. CRNN involves both Convolutional Neural Network (CNN) as well as Recurrent Neural Networks (RNN). CNN applies convolution operations by sliding a filter across an image to extract features from the image and RNN allows considering data from the past for making predictions. The features from CNN are sent to RNN to examine these aspects sequentially while taking into account prior knowledge to identify certain key relationships between these features that affect the result of word recognition. To further consider the context, a layer in opposite direction can be added to take into account context from the past as well as the future.

Figure 4.

Convolutional recurrent neural network based handwritten text recognition model.

The architecture of the proposed HTR model is presented in Fig. 4. A stacked architecture is employed for the design of the model. The image features are extracted using five layer CNN network. It is followed by a two-layer stacked Bi-directional Long Short-Term Memory (Bi-LSTM) along with Connectionist Temporal Classification (CTC) loss [28]. The text is then decoded using CTC decode to obtain the word, character by character. Next, we discuss each module.

Convolution neural network for feature extraction

In CRNN, CNN is regarded as the feature extractor, for extracting features from the images of hand written text. Typically, it is divided into two parts. The first part is application of the convolution operation to extract features and the second part is application of pooling operation to adjust the output of convolutional layer.

A five-layer convolutional neural network is designed for image feature extraction as shown in Fig. 4. Each convolutional block has a 2D convolutional layer that applies Rectified linear unit (ReLu) activation function, and then max pooling.

In the convolutional layer, a filter is applied to the input data to produce feature maps. The filter slides over the input, and at each position, the dot product between the filter and the local region of the input is computed. After which a non-linear activation function ReLU is applied. This can be represented as follows,

$\displaystyle\hat{y}^{(i)}=\textit{ReLU}(w^{T}x^{(i)}+b)$ (1)

where $x$ is the input, $\hat{y}$ is the output of the convolution layer. $w$ refers to weight, $b$ is bias, ReLU is a non-linear activation function, and $i$ is the input index for N samples. The convolution operation on an image with width $w_{i}$ , height $h_{i}$ , and channel $c_{i}$ changes the channel component as follows,

$\displaystyle c_{u}=\frac{(c_{i}-F+2P)}{S}+1$ (2)

where, $c_{u}$ is the output channel, $F$ is the filter size, $P$ is padding, $S$ is stride.

Further, the maximum pooling layer pools the convolved results to extract important features of the image. The height and width components are changed using the pooling operation. These can be represented as follows,

$\displaystyle z_{u}=\frac{(z_{i}-F+2P)}{S}+1$ (3)

where, $z_{u}$ is the output width or height, $z_{i}$ is the input width or height, $F$ is the filter size, $P$ is padding, $S$ is stride.

Recurrent neural network using Bi-LSTM and text recognition

The output of CNN is then passed to RNN module, which comprises of two-layer Bi-LSTM for time-sequence and sequence dependency computations. It generates a probability matrix for characters at each time step. Rather than using LSTM, Bi-LSTM is utilized. LSTM performs transition of the previous hidden state $h_{t-1}$ to the current hidden state $h_{t}$ at time step $t$ with the input $k_{t}$ as follows:

$\displaystyle h_{t}=\textit{LSTM}(k_{t},h_{t-1})$

Mathematically, the LSTM is defined as,

$\displaystyle f_{t}=\sigma(W_{f}k_{t}+U_{f}h_{t-1}+b_{f})$ $\displaystyle i_{t}=\sigma(W_{i}k_{t}+U_{i}h_{t-1}+b_{i})$ $\displaystyle o_{t}=\sigma(W_{o}k_{t}+U_{o}h_{t-1}+b_{0})$ $\displaystyle\hat{c_{t}}=\textit{tanh}(W_{c}k_{t}+U_{c}h_{t-1}+b_{c})$ $\displaystyle c_{t}=f_{t}\bigodot c_{t-1}+i_{t}\bigodot\hat{c_{t}}$ $\displaystyle h_{t}=o_{t}\bigodot\textit{tanh}(c_{t})$

where $i_{t}$ is the input gate, $f_{t}$ is forget gate, $o_{t}$ is output gate, and $c_{t}$ is a memory cell. The sigmoid function $\sigma$ , and tanh are non-linear activation functions applied element-wise. $W_{f},W_{i},W_{o},W_{c},U_{f},U_{i},U_{o},U_{c},$ represent the weight matrices and $\bigodot$ represents an element-wise product operator.

As a result, the Bi-LSTM is implemented with two separate LSTM layers for computing $h_{f}$ and $h_{b}$ , which are forward and backward hidden sequences respectively. Thereby, in comparison to LSTM, Bi-LSTM generates better representation and produces better representations because of the additional training capabilities.

Connectionist temporal classification (CTC) loss and decode

CTC (Connectionist Temporal Classification) [28] is a loss function and decoding algorithm. The CTC loss function is used to train a neural network to learn a mapping between variable-length input sequences and variable-length output sequences. The main difficulty with these types of tasks is that the input and output do not have a one-to-one correlation, making conventional loss functions like mean square error inadequate. The CTC loss introduces a special “blank” label and models the conditional probability of a target sequence given an input sequence. The loss function computes the probability of all possible alignments between the input and output sequences. It aims to maximize the likelihood of the correct alignment.

The Bi-LSTM is used along with the CTC loss. It computes the loss as -log(ground truth text) intending to minimize the negative maximum likelihood path by considering the ground truth text and the output from the Bi-LSTM. Finally, the given labels are used to find the possible paths. The loss for $(a,b)$ is given by,

$\displaystyle P(b|a)=\sum_{A\in A_{a,b}}\prod_{t=1}^{T}P_{t}(A_{t}|a)$ (4)

$P(a|b)$ is the CTC conditional probability. Marginalization is done over the set of valid alignments and probabilities are computed for all valid alignments step by step. The result from the prediction is then decoded using CTC Decode.

Decoding is done using the word beam search algorithm to recognize the handwritten text [29]. The algorithm proceeds by finding the beams (candidate characters in our case) at each step. The model provides as output the probabilities of each character at each time step. We consider only the best-scoring character for further processing. A beam width of 50 is used for controlling the number of surviving beams. This process is repeated until the entire output of the model is processed, that is, till the word is recognized. A dictionary of Hindi words is used that restricts the possible words that can be created.

3.3 Answer evaluation

This module is the final step towards building an automatic answer correction system. After obtaining the answer for the given question from the textual-QA model ( $a_{\textit{pred}}$ ) and the answer written in the image ( $a_{\textit{extract}}$ ), they are compared using F-score metrics to determine if the handwritten answer is correct. The F-score is used to determine the correctness of the answer and allot a score out of 1. If $a_{\textit{pred}}$ and $a_{\textit{extract}}$ are exactly the same, the F-score of 1 is obtained, if they are completely different the score of 0 is obtained. Otherwise, a score between 0 to 1 indicates a partial match of word characters. The computation of score and correctness is done as follows,

if (F-score $==$ 1) {Score $=$ 1, Value $=$ correct} elseif (F-score $==$ 0) {Score $=$ 0, Value $=$ incorrect} else {score $=$ F-score, Value $=$ partially correct}

4. Datasets

As discussed in Section 3 the system is divided into: textual-QA, HTR and marks allocation by comparing the correct answer extracted from the textual-QA model and the one written by the student (extracted from HTR model). XQuAD [30] and IIIT-HW-Dev dataset [20] existing datasets are used for training and evaluating the performance on textual-QA and HTR tasks respectively. A hindi question-answering dataset is created for textual-QA task. These datasets are detailed next.

XQuAD [30]

The experimentation on textual-QA is performed using the XQuAD dataset. It is a Cross-lingual Question Answering benchmark dataset. The dataset comprises tuples of (question, answer, and context paragraphs). There are 240 paragraphs and 1190 question-answer pairs from the development set of SQuAD v1.1 [13]. Apart from English, the professional translations of the entire dataset are also available in ten different languages, namely, Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. All of the files in this dataset are in JSON format, which is similar to the SQuAD dataset format. Table 2 shows the average number of tokens per paragraph, question, and answer for each language.

Table 2
XQuAD tokens

	en	es	de	el	ru	tr	ar	vi	th	zh	hi
Paragraph	142.4	160.7	139.5	149.6	133.9	126.5	128.2	191.2	158.7	147.6	232.4
Question	11.5	13.4	11.0	11.7	10.0	9.8	10.7	14.8	11.5	10.5	18.7
Answer	3.1	3.6	3.0	3.3	3.1	3.1	3.1	4.5	4.1	3.5	5.6

We have merged all these to form a single multilingual dataset of around 14000 question-answer pairs. The dataset is an MCR dataset, in which, the answer to any question is the span from the given context. Apart from the tuple of (question, answer, and context paragraphs), the dataset also contains a start index. The start index specifies the starting character index in the context. However, the end index is not available in the dataset. Thereby, we have applied pre-processed to compute the end index and appended it to the dataset. The end index was computed by considering the answer text, context, and start index.

IIIT-HW-Dev dataset [20]

Figure 5.

Sample handwritten words in Devanagari script in the IIIT-HW-Dev dataset [20].

IIIT-HW-Dev dataset is used to train the HTR model. It is a Hindi dataset that contains 9,540 handwritten words in the Devanagari script. It is annotated using UTF-8. For each word, there are about 9 to 10 samples in the dataset. A total of 95,381 word samples are available, these are collected from 12 different individuals with different educational backgrounds and ages. Figure 5 shows sample images from the dataset.

Hindi-QA dataset

Figure 6.

Sample of Hindi-QA dataset.

The XQUAD dataset is a general domain dataset which is originally created in English and then translated to 11 different languages. We have created a small dataset called as Hindi-QA. Rather than translating the English text, it is created in Hindi language by native Hindi speakers. The context is in form of short stories and Question-Answer are designed from the context. This is generally the format of asking questions based on MCR in schools. As the overall system is designed for evaluation of handwritten answers of students, Hindi-QA helps to demonstrate the performance on real data. Moreover, no such dataset exists that is specifically designed to handle the MCR task in Hindi language for answer evauation of students.

The dataset comprises of context, question and answer. There are 100 question-answer pairs in the same format as XQUAD. The dataset is in Hindi language and consists of mainly single-word correct answers. Figure 6 shows a context and two questions along with their answers from the dataset.

5. Experimentation and results

In this section, the experimentation and the results obtained for the two main modules: textual-QA, HTR are presented. Further, the application of these modules to the overall task of answer evaluation is presented, emphasizing the practical implications on real data in the Hindi language. For model and results evaluation purposes, a simple web app was created using Flask, which is a simple web framework. The performance metrices used to evaluate the model performance are presented first, followed by results and its discussion.

5.1 Performance evaluation metrics

Exact Match (EM) Score and F1 Score metrics are used to evaluate the performance of the textual-QA task. The exact match score provides the measure of how similar the two texts are. However, it doesn’t allow variations and checks for exact matches of words. Whereas, the F1-score is based on the computation of precision and recall. These can be expressed as follows,

$\displaystyle\textit{F1-score}=2*\frac{\textit{Precision}*\textit{Recall}}{% \textit{Precision}+\textit{Recall}}$ (5)

where,

$\textit{Precition}=\textit{True Positives}/(\textit{True Positives}+\textit{% False Positives})$ and $\textit{Recall}=\textit{True Positives}/(\textit{True Positives}+\textit{False% Negatives)}$

F1 score captures the precision and recall such that the words chosen as being part of the answer are part of the answer, and EM score is the number of answers that are exactly correct (with the same start and end index).

The HTR model is evaluated based on Accuracy that is the measure of fraction of text classified by the model which are correct.

5.2 Textual-QA

The performance of textual-QA is assessed on existing dataset-XQUAD and on the created dataset Hindi-QA.

5.2.1 Results on XQUAD

The task of textual-QA is accomplished by, first, training a pre-trained transformer model on a combined multilingual dataset, XQUAD (refer to Section 4). Experiments are conducted utilising a variety of pre-trained transformer models, including BERT, m-BERT, DistilBERT, ALBERT, RoBERTa-base, and RoBERTa-large, to demonstrate the effectiveness of the proposed textual-QA technique. These models were fine-tuned for 5 epochs with a batch size of 32 and a learning rate of $3*10^{-5}$ . The experiments were performed using different values of epochs and learning rates and best results were obtained using 5 epochs and a learning rate of $3*10^{-5}$ for all the variants.

Table 3
Time required to fine-tune the models

Models	Training time (minutes)
BERT	50.23
m-BERT	117.943
DistilBERT	57.939
ALBERT	20.83
RoBERTa	62

Table 4

Results of question answering on hindi and English language from XQUAD dataset

Used pre-trained model	Language: English		Language: Hindi
	EM score	F1-score	EM score	F1-score
BERT	66.72	68.33	34.56	42.26
m-BERT	67.88	70.65	45.98	46.21
DistilBERT	65.47	66.32	47.33	49.63
ALBERT	66.51	69.26	46.67	47.69
RoBERTa-base	73.90	84.70	72.0	81.0
RoBERTa-large	87.92	94.61	84.54	91.96

The time required for fine-tuning each of the models using XQUAD is presented in Table 3. The trained model $\textit{Model}_{ft}$ was tested for both Hindi and English qustion-answers of the XQUAD test dataset. Table 4 presents the F1 and EM scores for the XQUAD dataset. The results are indicative that the proposed model that uses RoBERTa performs better than the other transformer models. Specifically, RoBERTa-large is the best performing model.

Effectiveness of using combined XQUAD dataset

To further show the effectiveness of training the model on a combined set of all languages (as discussed in Section 4) rather than just on a single language, experiments were carried out by fine-tuning the pre-trained models only on a specific language from the XQUAD dataset. Table 5 shows the model performance when fine-tuning is done using only English and only Hindi language question-answer pairs from the dataset.

Table 5

Model performance when trained only using hindi/English QA pairs from XQUAD dataset

Used pre-trained model	Language: English		Language: Hindi
	EM score	F1-score	EM score	F1-score
BERT	45.80	48.98	0.22	0.0
m-BERT	55.98	64.86	15.72	10.86
DistilBERT	51.66	60.82	33.58	19.76
ALBERT	34.83	51.74	6.4	3.98
RoBERTa-base	64.56	73.65	55.63	60.66
RoBERTa-large	74.08	86.62	73.48	77.82

Figure 7.

Comparision of performance on English question-answer pairs from XQUAD dataset in two settings. S-trained and tested only using English QA pairs, C-trained on combined dataset of 11 languages and tested on English QA.

Figure 8.

Comparision of performance on Hindi QA from XQUAD dataset in two settings. S-trained and tested only using Hindi QA pairs of XQUAD, C-trained on combined dataset of 11 languages and tested on Hindi QA from XQUAD.

Comparative analysis is presented in Figs 7 and 8. It is observed that the performance of all the models improves considerably when trained on a combined dataset of all languages from XQUAD. This is also evident from the comparison of Tables 4 and 5. Thereby, rather than training only using specific language, trained model on a combined dataset offers more comprehensive language representation and is further used as the final model for next steps.

5.2.2 Results on Hindi-QA

In this section, the result obtained on Hindi-QA is discussed. The dataset is divided into training and test set of 80% and 20% respectively. It is clear from the discussion in Section 5.2.1 that the RoBERTa-large model outperforms the other models. As a result, RoBERTa-large model is used for further experimentation on our created Hindi-QA dataset. Additionally, it was shown that the models trained on a combined dataset perform better. Thereby, we employ the model that was produced in the previous section and further fine-tune it using the training set of Hindi-QA rather than only fine-tuning the RoBERTa-large model on the Hindi-QA dataset. This led to further improvement in the model performance. The F1-score obtained for the final model on the test set of Hindi-QA is 98.69.

Figure 9.

Sample result of textual-QA by entering a new paragraph and a question to obtain answer for a real data.

A sample answer for context and question are shown in Fig. 9. The system requires entering the context/paragraph as input along with the question in the Hindi Language. It then predicts the answer to the given question by extracting a span from the given context.

5.2.3 Results on ChAII – Hindi and tamil question answering existing dataset

To further, validate the performance of the proposed model, we performed experimentation on a recently released dataset called Challenge in AI for India (ChAII) – Hindi and Tamil Question Answering [11]. The dataset includes context, question and answer in Hindi and Tamil language. The training set has 740 Hindi and 364 Tamil examples. The test set comprises of 816 examples.

Evaluation metric used

There is very less work on the dataset, and researchers have employed Jaccard similarity as the evaluation metric. Thereby, we have also evaluated the results using the same metric for ease of comparision. Jaccard similarity coefficient is defined as,

$\displaystyle J(A,B)=\frac{|A\cap B|}{|A\cup B|}$ (6)

where, A and B are sets of predicted aswer and ground truth answer. A Jaccard score of 0 means that none of the characters are the same between the predicted answer and the correct answer, and a score of 1 means that the two are identical.

Results

Our model is obtained by employing the fine-tuned RoBERTa-large model (obtained in Section 5.2.1 using the combined XQUAD dataset) and further fine-tuning it using the training set of ChAII dataset. The model is fine-tuned for 4 epochs with learning rate of $3*10^{-5}$ . The performance comparison with existing methods of textual-QA on the ChAII dataset are presented in Table 6.

Table 6

Results of textual-QA on ChAII dataset

Model	Fine-tuned on	Jacard-score	F-score
MuCoT [14]	SQUAD and ChAII	0.53	–
XLM-RoBERTa-S [12]	SQUAD	0.65	–
XLM-RoBERTa-Chaii [15]	Chaii	0.632	0.688
XLM-RoBERTa-sqCh [12]	SQUAD and ChAII	0.749	–
BERT-FT	11 languages from XQUAD and ChAII	0.75	–
MuRIL [15]	Hindi instances from XQUAD and MLQA $+$ Chaii	0.803	0.826
Our model	11 languages from XQUAD and ChAII	0.84	0.88

The comparision is done with the following methods:

•

MuCoT is a Multilingual Contrastive Training for Question-Answering in Low-resource Languages [14]. In MuCoT, a pre-trained multilingual Bidirectional Encoder Representations from Transformers (mBERT) model is first pre-trained using SQuAD, a large-scale English question-answering dataset. This resulting English mBERT-QA model is then fine-tuned using the ChAII dataset and evaluated.

•

XLM-RoBERTa-S [12] uses XLM-RoBERTa-base model fine tuned on SQUAD dataset [13] which is a English QA dataset.

•

XLM-RoBERTa-sqCh [12] uses the model fine-tuned on SQUAD and is further trained on both the Hindi and Tamil QA pairs from ChAII dataset. The training was done for train for 2 epochs with a learning rate of $2\times 10^{-5}$ .

•

BERT-FT uses model fine-tuned on combined dataset of 11 languages of XQUAD dataset and further trained on ChAII dataset. We have trained this model for 4 epochs with learning rate of $3*10^{-5}$ .

•

MuRIL is fine tuned for 2 epochs with learning rate of $3*10^{-5}$ with only Hindi instances from MLQA, XQUAD and Chaii datasets.

As observed in Table 6, our approach performs better than the other methods. This is because the model was fine-tuned on combined set of 11 languages as opposed to just one language. Therefore, as discussed in Section 5.2, the results here are also indicative of the same fact that fine-tuning using combined XQUAD dataset is more effective and provides better representation of language. This is also evidant by comparing the results with the best performing model [15], which is trained only using Hindi QA pairs from the datasets used.

5.3 Handwritten text recognition on IIIT-HW-Dev dataset

Table 7
Accuracy of HTR model on IIIT-HW-Dev

Reference	Accuracy
[20]	88.73%
Our	90.37%

The HTR model was developed by training the CRNN network (as discussed in Section 3.2) on the IIIT-HW-Dev dataset (refer to Section 4). 80 percent dataset was used for training and the remaining 20 percent for testing. Initially, pre-processing of handwritten input images is done by applying gray scaling and resizing techniques. The input image is resized to 128 $\times$ 32 before being processed by the CRNN network. After passing through the 5 layers of CNN, 256 feature maps of 32 $\times$ 1 size are obtained, which are passed to 2-layer Bi-LSTM. CTC loss function is used while training. Finally, CTC decode is used for text recognition. A batch size of 25 was used for training and training converged in 27 epochs. The time required for training the HTR model was 102 min. We have also included 200 characters of [19] in the training set, to improve the performance of the model. An accuracy of 90.37% was obtained on the IIIT-HW-Dev test set as shown in Table 7.

Although, there is a considerable amount of work in devanagri script character recognition, there is very less work in word-level recognition in Hindi language. Based on the presented result, our model outperforms the existing method on this dataset.

Figure 10.

Sample result of QA in Hindi language on real data.

5.4 Overall answer evaluation on real data

Further, the task of answer evaluation is performed by comparing the answer extracted from the image with the answer predicted by the QA model. When comparing answers, lower and upper-case matches are taken into account to see if the expected and actual results exactly match.

For model and results evaluation, a web app was created using Flask. Context (Paragraph), Question, and Images (for recognition) are given as input in the web app to obtain the answer evaluation. Sample evaluation on real data is presented in Figs 10 and 11.

Figure 11.

Sample results of QA in Hindi language on real data.

Figure 10a shows the inputs required and Fig. 10b shows the obtained result along with the entered inputs. The entered context, question and images are displayed. The predicted answer ( $a_{\textit{pred}}$ ) is the answer extracted by the QA model ( $\textit{Model}_{ft}$ ) from the given context. Predicted text ( $a_{\textit{extract}}$ ) is the recognized handwritten text from the $\textit{Model}_{\textit{htr}}$ model. These answers $a_{\textit{pred}}$ and $a_{\textit{extract}}$ are compared to obtain the F-score of 1 as shown in Fig. 10b. As both are the same, full marks will be awarded for this question to the student.

Another result for two questions is shown in Fig. 11. For the first question $a_{\textit{pred}}$ is “pravas karna” (in Hindi) whereas $a_{\textit{extract}}$ (the answer written by the student) is “pravas” (in Hindi), thereby the F-score of 0.66 is computed. The student will therefore get 66% marks for the question. For the second question shown in Fig. 11, $a_{\textit{pred}}$ and $a_{\textit{extract}}$ are the same thereby generating an F-score of 1.

6. Future research directions

In this section, we discuss the possible future direction and emerging research topics in the field that have the potential to further advance this research area.

6.1 Future scope

Future Scope of the work presented in this paper is discussed below:

•
Currently, the text recognition model can extract only one-word text. Thereby, limiting the answer evaluation to a single-word answer. As a future scope, the model for recognizing a line or paragraph as an answer can be built to further enhance the answer evaluation system.
•
The Hindi-QA dataset can be further extended to include more context and question-answer pairs, to further improve the model accuracy.
•
Although, the focus is on Hindi language, the system can be developed in various languages. The XQUAD dataset used for fine-tuning the textual-QA models is in 11 languages. We have further finetuned it using Hindi question-answer pairs. Further fine-tuning in other languages can lead to an efficient multi-lingual system.
•
The work presented in this paper is on Machine reading comprehension type of questions. Other type of questions like multiple choice questions, open answer questions can also be addressed.
•
Additional module of automated QA pair generation from the given text can help in additional practice and undestanding of students.

6.2 Application areas

The application areas of the proposed system are presented next.

•
The proposed system can be very useful in education sector. It can be used in schools by training it on the textbook content, providing assistance to teachers and students.
•
It can help in designing Interactive textbooks for students offering explanations and additional information.
•
A chatbot for any educational institute can be built for answering common queries of students.

6.3 Research direction

The potential research directions in textual-QA are as follows:

•
Open-domain Question Answering The work presented in this paper focuses on context provided by the user and then extracting answer from the available context. However, if context is not available and needs to searched for on the internet or from offline documents, it is open-domain question answering. This is an active area of research in textual-QA. Various researchers have addressed open-domain QA [31, 32]. A survey for various open domain question answering techniques is presented in [33].
•
Cross-Language Question Answering Cross-Language QA System refers to a system that accepts textual query in some language (say Language A) and produces the answer in the same language (Language A). However, the context or documents available (from which the answer needs to be extracted or inferred) are in another language (say Language B).

BiQue a German-nglish bilingual textual QA-ystem is presented in [34]. It is an open-domain bilingual QA-System for English target document collections and German source language questions. In [35], discriminative adversarial neural networks are used to address cross-language task of question-uestion similarity reranking in community question answering. The question can be either in English or in Arabic, and the questions it is compared to are always in English. A Cross-Lingual Product Question Answering system called xPQA is presented in [36]. It is a cross language question answering system that supports 12 Languages.
•
Generative Question Answering The focus of study in this work is on extractive question-answering, where the answer is a span from the context. Another emerging area of research in textual-QA is generative QA, which involves generation of answer for the given question based on a large corpus of text. Chat-GPT is one of the best examples of generative question answering. The challanges and opportunities of Chat-GPT in higher education is discussed in [37].

7. Conclusion

In this paper, an answer evaluation system for the Hindi language is presented. The paper tries to connect the research outcomes to the real-world application of an automated answer evaluation system, highlighting its potential to assist students and educators. A question-answering model is built along a handwritten text recognition model for the automated evaluation of answers written by the student. For question answering on the Hindi dataset, the model built using the Roberta-large pre-trained model, trained on the combined 11 language XQUAD dataset gave the highest performance in comparison to other models. It is further fine-tuned on the Hindi-QA dataset to further improve the performance of the model. Based on the results of the experiments, it can be concluded that fine-tuning the RoBERTa model with question-answering datasets in multiple languages and then fine-tuning it specifically with the Hindi-QA dataset outperforms other methods that fine-tune only using Hindi pairs, only English pairs, or English and Hindi pairs. The text recognition model is built using Convolutional Recurrent Neural Network, which extracts the handwritten Hindi text from the image. The outputs from these two models are then compared to obtain the evaluation of the answer using the F1-score.

Declaration of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Dataset availability

The datasets generated during the current study are available from the corresponding author on request.

References

Simons

Fennig

. Ethnologue: Languages of the World, Dallas, Texas: SIL International. Online version: http://www.ethnologue.com. 2017.

Shukla

Mukherjee

Raina

. Towards a language independent encoding of documents. NLUCS 2004. 2004; 116.

Uchida

Zhu

. The universal networking language beyond machine translation. In: International Symposium on Language in Cyberspace, Seoul; 2001. pp. 26–27.

Gupta

Kumari

Ekbal

Bhattacharyya

. MMQA: A multi-domain multi-lingual question-answering framework for English and Hindi. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018); 2018.

Lewis

Oğuz

Rinott

Riedel

Schwenk

. MLQA: Evaluating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475. 2019.

Sekine

Grishman

. Hindi-English cross-lingual question-answering system. ACM Transactions on Asian Language Information Processing (TALIP). 2003; 2(3): 181–192.

Larkey

Connell

Abduljaleel

. Hindi CLIR in thirty days. ACM Transactions on Asian Language Information Processing (TALIP). 2003; 2(2): 130–142.

Nanda

Dua

Singla

. A hindi question answering system using machine learning approach. In: 2016 International Conference on Computational Techniques in Information and Communication Technologies (ICCTICT). IEEE; 2016. pp. 311–314.

Gupta

Khade

. Bert based multilingual machine comprehension in english and hindi. arXiv preprint arXiv:2006.01432. 2020.

10.

Butt

Ashraf

Siddiqui

MHF

Sidorov

Gelbukh

. Transformer-based extractive social media question answering on TweetQA. Computación y Sistemas. 2021; 25(1): 23–32.

11.

Google. Challenge in AI for India (ChAII) – Hindi and Tamil question answering. 2022. Available from: https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering.

12.

Thirumala

Ferracane

. Extractive Question Answering on Queries in Hindi and Tamil. arXiv preprint arXiv:2210.06356. 2022.

13.

Rajpurkar

Zhang

Lopyrev

Liang

. Squad: 100,000

+

questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. 2016.

14.

Kumar

Gehlot

Mullappilly

Nandakumar

. Mucot: Multilingual contrastive training for question-answering in low-resource languages. arXiv preprint arXiv:2204.05814. 2022.

15.

Namasivayam

Rajan

. Answer prediction for questions from tamil and hindi passages. Procedia Computer Science. 2023; 218: 1985–1993.

16.

Khanuja

Bansal

Mehtani

Khosla

Dey

Gopalan

, et al. Muril: Multilingual representations for indian languages. arXiv preprint arXiv:2103.10730. 2021.

17.

Devlin

Chang

Lee

Toutanova

. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018.

18.

Sharma

Pal

Kimura

Pal

. Recognition of off-line handwritten devnagari characters using quadratic classifier. In: Computer Vision, Graphics and Image Processing: 5th Indian Conference, ICVGIP 2006, Madurai, India, December 13–16, 2006. Proceedings. Springer; 2006. pp. 805–816.

19.

Kumar

Singh

. A study of zernike moments and its use in devnagari handwritten character recognition. In: Intl. Conf. on Cognition and Recognition; 2005. pp. 514–520.

20.

Dutta

Krishnan

Mathew

Jawahar

. Offline handwriting recognition on Devanagari using a new benchmark dataset. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE; 2018. pp. 25–30.

21.

Bansal

Sinha

. A complete OCR for printed Hindi text in Devanagari script. In: Proceedings of Sixth International Conference on Document Analysis and Recognition. IEEE Computer Society; 2001. pp. 0800–0800.

22.

Hanmandlu

Murthy

. Fuzzy model based recognition of handwritten numerals. Pattern Recognition. 2007; 40(6): 1840–1854.

23.

Bajaj

Dey

Chaudhury

. Devnagari numeral recognition by combining decision of multiple connectionist classifiers. Sadhana. 2002; 27: 59–72.

24.

Bhattacharya

Chaudhuri

. Handwritten numeral databases of Indian scripts and multistage recognition of mixed numerals. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2008; 31(3): 444–457.

25.

Memon

Sami

Khan

Uddin

. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR). IEEE Access. 2020; 8: 142642–142668.

26.

Chaudhuri

Mandaviya

Badelia

Ghosh

Chaudhuri

Mandaviya

, et al. Optical character recognition systems for Hindi language. Optical Character Recognition Systems for Different Languages with Soft Computing. 2017; 193–216.

27.

Liu

Ott

Goyal

Joshi

Chen

, et al. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. 2019.

28.

Graves

. Connectionist temporal classification. Supervised sequence labelling with recurrent neural networks. 2012; 61–93.

29.

Scheidl

Fiel

Sablatnig

. Word beam search: A connectionist temporal classification decoding algorithm. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE; 2018. pp. 253–258.

30.

Artetxe

Ruder

Yogatama

. On the cross-lingual transferability of monolingual representations. arXiv preprint arXiv:1910.11856. 2019.

31.

Nie

. Dynamic graph reasoning for conversational open-domain question answering. ACM Transactions on Information Systems (TOIS). 2022; 40(4): 1–24.

32.

Pal

Umapathi

Sankarasubbu

. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In: Conference on Health, Inference, and Learning. PMLR; 2022. pp. 248–260.

33.

Zhang

Chen

Cao

Chen

Cohn

, et al. A survey for efficient open domain question answering. arXiv preprint arXiv:2211.07886. 2022.

34.

Neumann

Sacaleanu

. A cross-language question/answering-system for german and english. In: Workshop of the Cross-Language Evaluation Forum for European Languages. Springer; 2003. pp. 559–571.

35.

Joty

Nakov

Màrquez

Jaradat

. Cross-language Learning with Adversarial Neural Networks. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Vancouver, Canada: Association for Computational Linguistics; 2017. pp. 226–237. Available from: https://aclanthology.org/K17-1024.

36.

Shen

Asai

Byrne

de Gispert

. xPQA: Cross-Lingual Product Question Answering across 12 Languages. arXiv preprint arXiv:2305.09249. 2023.

37.

Fuchs

. Exploring the opportunities and challenges of NLP models in higher education: is Chat GPT a blessing or a curse? In: Frontiers in Education. Frontiers; Vol. 8, 2023. p. 1166682.

A textual question answering and handwritten answer evaluation system for hindi language

Abstract

Keywords

1. Introduction

2.1 Textual question-answering

BERT transformer model

2.2 Handwritten text recognition for Hindi langauge

Table 1 Summary of Significant methods in HTR

3.1 Textual question answering

Rationale for selection of RoBERTa model

3.1.2 Textual question answering model

Convolution neural network for feature extraction

Recurrent neural network using Bi-LSTM and text recognition

Connectionist temporal classification (CTC) loss and decode

4. Datasets

XQuAD [30]

Table 2 XQuAD tokens

IIIT-HW-Dev dataset [20]

Hindi-QA dataset

5.1 Performance evaluation metrics

5.2.1 Results on XQUAD

Table 3 Time required to fine-tune the models

Effectiveness of using combined XQUAD dataset

Evaluation metric used

Results

Table 7 Accuracy of HTR model on IIIT-HW-Dev

6.1 Future scope

Declaration of interest

Dataset availability

References

Table 1
Summary of Significant methods in HTR

Table 2
XQuAD tokens

Table 3
Time required to fine-tune the models

Table 7
Accuracy of HTR model on IIIT-HW-Dev