Abstract
We investigated how Natural Language Processing (NLP) algorithms could automatically grade answers to open-ended inference questions in web-based eBooks. This is a component of research on making reading more motivating to children and to increasing their comprehension. We obtained and graded a set of answers to open-ended questions embedded in a fiction novel written in English. Computer science students used a subset of the graded answers to develop algorithms designed to grade new answers to the questions. The algorithms utilized the story text, existing graded answers for a given question and publicly accessible databases in grading new responses. A computer science professor used another subset of the graded answers to evaluate the students’ NLP algorithms and to select the best algorithm. The results showed that the best algorithm correctly graded approximately 85% of the real-world answers as correct, partly correct, or wrong. The best NLP algorithm was trained with questions and graded answers from a series of new text narratives in another language, Slovenian. The resulting NLP algorithm model was successfully used in fourth-grade language arts classes for providing feedback to student answers on open-ended questions in eBooks.
Problem and Potential Solution
Children’s literacy skills have been inadequate in the western world for some time (OECD, 2010, 2015, 2019; PISA, 2018). Meanwhile, there have been two related trends: (a) children’s independent reading, the most important factor in literacy skill (Mol & Bus, 2011) has declined, while (b) the popularity of interactive digital entertainment such as computer games, video games, and social media has increased. Therefore, it seems opportune to merge eBooks with digital interaction to boost children’s independent reading and literacy skills. To advance this research program, the current article’s lead author is developing web-based interactive eBooks and investigating their efficacy. The lead author and his research group have developed and investigated a number of forms of interaction in web-based eBooks. The current article focuses on open-ended questions.
Open-ended questions are a powerful way for readers to interact with texts. They may effectively assess comprehension. However, open-ended questions are much more effective when coupled with real-time feedback. Effective readers regularly monitor comprehension (Zargar et al., 2020). Open-ended questions, with real-time feedback, potentially provide a fun way to externalize comprehension monitoring and scaffold weaker readers. However, until now, real-time feedback to students about their answers has not been possible.
Natural Language Processing (NLP) is the ability for computers and cloud-based apps to communicate with humans in their own natural languages, such as English or Spanish. Siri is an example of an app using NLP. NLP, a big part of the cloud-based technological economic boom fueled partly by hand-held devices (Friedman, 2016, p. 486), has untapped potential for improving education. The current research seeks to provide better feedback to answers to open-ended questions in eBooks.
Our plan to develop NLP algorithms to improve feedback on open-ended questions involved:
obtaining a corpus of typed answers from readers, providing a small set of exemplar answers to the questions, along with a scheme for grading each answer by comparing it to the exemplars, grading these answers on an incremental scale from 0.0 to 0.5, to 1.0, as to correctness, developing NLP algorithms that automatically grade this data as to level of correctness of answers, and test the efficacy of these NLP algorithms on a portion of the data that has been held back.
Literature Review
Literacy Skills
Children’s reading comprehension skills are declining in economically developed countries (OECD, 2010, 2015, 2019; PISA, 2018). “. 20% of students in OECD countries . do not attain . baseline level of proficiency in reading. This proportion has remained stable since 2009” (OECD, 2010, 2015, 2019; PISA, 2018). More than half of adolescents in Western industrialized countries do not read recreationally. Independent, recreational reading is the largest factor in developing reading skills (Mol & Bus, 2011). While recreational reading has declined, computer games and social media have surged in popularity. Therefore, it seems logical to integrate digital interaction into eBooks to appeal to today’s generation of children. Seven studies in five countries highlight some of the advantages of these interactive eBooks (Alsofyani, 2019; Drobisz et al., 2019; Jordan et al., 2018; Nielen et al., 2018; Smith, Besalti, et al., 2019; Smith et al., 2013; Smith, Petek, et al., 2019).
Inserted Questions/Adjunct Questions
Rigorous research on the educational supports the use of inserted questions and adjunct questions in text. Although research on inserted questions and adjunct questions had its heyday in the 1980s (Dunning, 1988; Guthrie, 1977; McConaughy, 1980, 1985; Morrow, 1985), there were also studies in the 2000s (Callender & McDaniel, 2007; Callender et al., 2013; Dornisch & Sperling, 2008; Medina et al., 2017). It is worth reviving this program of research, since NLP provides innovative ways to provide feedback on answers to such questions.
There are three sequential arrangements for these questions: (a) pre-questions, before the text, (b) post-questions, after the text, and (c) inserted questions, within the text (Hiller, 1974). The current study focuses on inserted questions because they (a) add interaction during reading, potentially increasing engagement and (b) potentially make readers feel responsible for comprehending the text through implicit comprehension checking.
Inserted questions can emphasize story structure and the internal states of characters, as well as external aspects of stories (Dunning, 1988), thus improving comprehension (Dunning, 1988; Guthrie, 1977; McConaughy, 1980, 1985; Morrow, 1985). Inserted questions encourage rereading of text, an important habit of advanced readers (Lewman, 1999).
Interactive eBooks can use inserted open-ended questions as a powerful yet simple way to stimulate reading comprehension. An updated version of research surrounding inserted open-ended questions could leverage the possibilities of cloud technology, that is, provide easily usable tools to put “inserted” open-ended questions into eBooks. This could also potentially take some of the workload off of busy teachers by automating feedback to readers.
NLP Techniques for Reading Comprehension
Since the beginning of artificial intelligence, researchers wanted to develop a machine to understand conversation. Alan Turing introduced a test with the question—“Can machines think?” (Turing, 1950). The test proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would have a conversation with two partners, limited only to a text-only channel such as a computer keyboard and screen, so that the result would not depend on the machine’s ability to render words as speech. If the evaluator could not reliably tell the machine from the human, the machine is said to have passed the test. Humans, in our everyday language, often use tricks to convey a message faster, such as context, common knowledge, hidden meanings, and so forth. This is difficult for machines to follow and understand, therefore, no technology has been developed yet that enables efficient reading comprehension (Devlin et al., 2018; Wolf et al., 2019).
Although current algorithms, such as SuperGLUE (Wang et al., 2019), achieve state-of-the-art results on common reasoning tasks, they use large data sets. Although it seems that Artificial Intelligence (AI) has achieved the next level in understanding, we need to regularly remind ourselves that the current state of technology has limitations (Manning, 2015). There are many new possibilities of natural language understanding by machines (Wang et al., 2019). For example, the need of current algorithms for large data sets limits these solutions to many people.
This study focuses on automatic grading of students’ responses, to indicate whether they have correctly comprehended a text. In NLP, the challenges of reading comprehension and question answering deal with the ability to read text and make an internal representation of it or answer questions about it. It requires both understanding of natural language and knowledge about the world. This research started in the early 1960s, with systems that used manually encoded rules. Later, more advanced systems were developed using statistical methods and machine learning approaches which currently achieve state-of-the-art results.
The first systems to answer human questions were closed domain systems using manually defined rules to answer questions, such as BASEBALL (Green et al., 1961) and LUNAR (Woods & Kaplan, 1977). The first answered questions about the U.S. baseball league over a period of 1 year. LUNAR answered questions about the geology of rocks returned by the Apollo moon missions. Both question answering systems were very effective in their chosen domains. LUNAR was demonstrated at a lunar science convention in 1971, where it answered 90% of the questions in its domain posed by people untrained on the system. However, due to the many manual rules, these closed domain systems were expensive to build and maintain and worked only for the limited domains.
Later advancements introduced machine learning algorithms that could be trained on labeled data of correct question and answer pairs. The result of training a model is a statistical program that can answer a question or predict the correctness of it. With the advent of big data and access to enormous text databases, the use of deep learning techniques became popular. To evaluate these new techniques, data sets were prepared, such as SQA (Iyyer et al., 2017), SQuAD (Rajpurkar et al., 2016), TriviaQA (Joshi et al., 2017), NarrativeQA (Kočiský et al., 2018), or CoQA (Reddy et al., 2019). The data sets consist of questions and accompanying answers. All the answers are labeled as correct or incorrect to enable the model to learn to differentiate between positive and negative examples. Recently, Zhang et al. (2019) presented an algorithm to grade semi-open-ended short answer questions, similar to the approach used by one of the student groups described in this article.
Our approach uses machine learning techniques to model questions and answers. We prepared a new data set in order to evaluate the reading comprehension of students reading eBooks in the IMapBook system. Furthermore, we experimented using traditional approaches and new neural network architectures and integrated the automatic solution into the IMapBook system.
Research Questions
Which NLP algorithms provide more accurate feedback to answers to open-ended inference questions in narrative text in web-based eBooks, while using a minimum of initial data (sparse data sets)? What kinds of NLP algorithms for providing feedback to answers to open-ended inference questions can be more consistent with a development and adoption model that allows laypersons, without technical skills, to add such open-ended questions with minimal labor to web-based eBooks, allowing for rapid materials development? How well can the algorithms developed be used on open-ended questions in another language, for instance, a language from a totally different family of languages?
Methods
Materials for Automatic Grading
Data Set Preparation
To develop NLP algorithms for providing more accurate feedback to answers to open-ended questions to web-based eBooks, we (a) created a series of open-ended questions, based on inferencing, to embed in online narrative text, (b) obtained a suitable, but minimal, sample of authentic answers to these questions, and (c) graded the answers, to provide a corpora for computer science students to develop NLP algorithms. We selected as a narrative text, Weightless, a young-adult science fiction novel, about two 12-year-old castaways on a space vessel. The novel Weightless had several things recommending it. Because it was used in previous studies, it was very familiar to the researchers. It fit conveniently into school curriculum and standards, both for language arts and science. Thus, comprehending it required different types of inferences including those based on both human social interactions and simple physics, appealing to both male and female readers.
The researchers selected a well-known taxonomy of inferences (Graesser et al., 1994) as a framework for generating inference questions from the text. Graesser’s taxonomy of inferences comprises 12 types of inferences. A subset of these 12 types are listed and defined in Table 1. We wanted rich responses to these inferencing questions, in order to provide the computer science students data to develop algorithms. By rich responses, we mean answers comprised of one or two sentences with multiple clauses. Questions answerable by “yes,” “no,” or other one-word answers would be useless for generating NLP algorithms. However, excessively wordy answers might make data analysis unwieldy. Furthermore, because some balance of correct and incorrect answers would provide the computer science students with a variety of data inputs to develop algorithms, we sought to develop inference questions of medium difficulty level. All inferences require some level of life knowledge, to be added to two or more pieces of information gleaned from the text. However, because computer algorithms typically have little or no life knowledge, we sought to create open-ended inference questions not requiring excessive background life knowledge.
Definitions of the Graesser Inference Types That Were Specifically Used in Data in the Current Study.
Two of the researchers reread Weightless and generated 52 candidate open-ended inference questions, related to character mood and causal antecedents (either based in human behavior or physical situations involving simple science). Table 1 provides definitions of the inference types that were used in the study. The researchers were qualified to work with inference questions and answers, because (a) one of them, the lead author of this article, has conducted multiple studies involving inferencing and reading (e.g., Smith et al., 2010; Smith et al., 2013), and (b) both researchers in this phase have experience with K-12 teaching, and university level teaching, so are experienced educators.
The two researchers also provided exemplar answer(s) to each question, and then evaluated each inference question along these dimensions:
classification in Graesser’s taxonomy of inferences, perceived difficulty level on a scale of one to five, with five being the greatest, and perceived amount of life knowledge needed, also on a scale of one to five.
Based on these evaluations, we selected 12 inference questions of moderate difficulty level (1–3) that required a moderate life knowledge (1–2).
Here are two examples of open-ended questions we used:
Example 1:
Background knowledge needed: 1
Difficulty level: 2
Example 2:
Difficulty level: 1
The researchers inserted the 12 open-ended questions into the IMapBook web-based eBook version of the science fiction novel Weightless, between the appropriate text pages.
Participants
Based on an IRB-approved study protocol, the data were obtained by master’s level students for extra-credit in courses. Participants were 87 master’s students in Library Sciences and Instructional Technology. Interactive eBooks were a part of the course curriculum. However, students could optionally earn extra credit by participating in this research or through other means. The average age and gender of participants were not known.
Users/readers were presented with the question and a text box to type their answers, which were then saved in a database. The readers were thanked for their answers, but not given feedback on the correctness of their answers. The data were then downloaded from the IMapBook system as Excel files.
Two of the researchers met and reviewed approximately 5% of the answers, comparing the answers to the exemplars. In some cases, the reviewers modified the exemplars based on answers which suggested shortcomings in the exemplars. This preliminary review process was halted once no new revisions to exemplars occurred, as per snowball sampling methodology (Goodman, 1961).
Two of the researchers read all the answers and independently graded them with scores of 0.0, 0.5, or 1 by comparing them semantically with the exemplar answers. Answers that had none of the required elements of the exemplar answers were deemed incorrect and given a score of 0.0. Answers with all of the required elements from the exemplar were given a score of 1.0, while answers with some, but not all, elements of the exemplar were scored as 0.5. After the two researchers had graded all the answers, they resolved all discrepantly graded answers by discussing them.
Data Set Overview
The final data set consisted of a total of 963 question–answer pairs—850 pairs for training the algorithms and 113 pairs for testing them. The testing pairs were selected using a randomly stratified method (i.e., preserving the ratios of labels 1.0, 0.5 and 0.0 in both the training set and the testing set), so that each question was fairly represented in the test set. As the two researchers independently selected labels for the questions, we checked their inter-rater agreement. We found that out of the 850 answers, they agreed on 756 answers (88.9%). Their Cohen’s kappa inter-rater agreement score was 0.73, which is sufficient for further analysis (McHugh, 2012). After the tagging, they jointly selected a final rating for each discrepant grade.
In the data set, each question was supplemented with a text excerpt (the minimal set of sentences from the story containing sufficient information for a reader to infer the correct answer). In Table 2, we provide information on data distribution across the training and testing data sets.
Distribution of Answers for Each Question in the Data sets.
Note. The full data set was divided into training and testing parts. Each answer in the data set can be considered as wrong (0), partly correct (0.5), and correct (1) with respect to a question.
The division of the data into training and testing parts is the most common approach when using machine learning algorithms. The training part of the data is used to infer knowledge needed to create the model. The testing part is used as unseen data for testing the algorithm (i.e., a simulation of live students reading a book). The algorithm then tries to guess the correct label for a given answer in the testing data (i.e., 0, 0.5, or 1). Finally, we perform an analysis of how successful each algorithm is.
Algorithms Development
The data were presented to master students in a course on NLP at the University of Ljubljana, Faculty for Computer and Information Science. A total of 13 students in the course worked on the IMapBook domain data set. They formed teams of up to three students, resulting in nine student groups. The students created different models to automatically classify whether an answer to a given question was wrong, partly correct, or correct (as per the previous coding of the answers, i.e., answers with all of required elements from the exemplar were scored 1.0 while answers with some, but not all, elements of the exemplar were scored as 0.5, etc.).
Each group implemented three different models following the real-world scenarios that will be further integrated into the IMapBook project:
The students were shown the training data set only, which they used to develop their algorithms—one training data set per each type of the models described earlier. They were also provided with an HTTP server implementation, so the lecturer could evaluate the performance of their algorithms on the hidden test data set. The students did not have access to the testing data set. They could not optimize their algorithms on the test data set. Therefore, the presented results were what you might expect in a real-world scenario.
The student’s final grade consisted of the quality of their implementation (i.e., their proportion of correctly marked answers) and their public presentation of their work.
Figure 1 shows the architecture of the HTTP client program the lecturer developed for evaluation of students’ models. The implementation is publicly available online (see https://bitbucket.org/szitnik/onj-2018-2019-evaluator). The server API that the students needed to develop for each question had to accept an NLP model, a question and a reader’s answer to grade as parameters (message 1 in Figure 1). The client iterated through each question–answer pair in the test data set and sent a message to a student’s model implementation. The client sent the same message for all the three types of models. The student’s HTTP server needed to answer with a predicted score (i.e., 0, 0.5, or 1). After the client tested all the question–answer pairs, it showed the success rate of a student (message 3 in the Figure), as the F-score performance measures (described in the following section).

High-Level Architecture of Automated Evaluation of Students’ Models Using Testing Data Set. The client iterates through the data set, sending messages to the server for all three types of models (1), the server replies with the scores (2) and client outputs the performance at the end (3).
We compared the algorithms using an F-score which is a mean of precision and recall:
Thus, we may achieve high recall if an algorithm classifies all examples as correct, but low precision as the algorithm makes many mistakes. F-score, or harmonic mean, is a sort of “mean” of precision and recall which is more intuitive than just average. Suppose our algorithm achieves a precision of 1.0 and a recall of 0.2. Intuitively, the performance of the algorithm is very low, because the system identifies only 20% of the correct examples, which means it is almost useless. The arithmetic mean of those two values would be 0.6, whereas the harmonic mean of 0.33 seems a more reasonable choice.
We provide the results using micro-averaged, macro-averaged, and weighted F-score. A macro-average score computes the F-score metric independently for each class (wrong, partly correct, and correct) and then takes the average (hence treating all classes equally). A micro-average combines the contributions of all classes to compute the average metric. In a multiclass classification setup (i.e., having more than two classes, wrong, partly correct, and correct, as opposed to wrong and correct), micro-average is preferable if there is a class imbalance (i.e., significantly more examples of one class than of other class). As we can observe in Table 2, we have a lot more positive examples (correct answers) compared with the other two types. Finally, we sort the algorithms based on weighted F-score, which is a variation of macro-averaged F-score that takes class imbalance into account. To get weighted F-score, we calculate F-score for each class, and then find the average weighted by the number of instances for each class.
Table 3 shows the results for all nine groups that participated in the NLP course. The results are ordered by the weighted F-score for all examples+ model, and the best scores per model type are marked in bold. Figure 2 shows the same results. In general, the results of all examples model are better than the results of single example model; results of all examples+ model are better than the results of all examples model. Models all examples and all examples+ result in similar performance. Additional resources did not help in identifying whether an answer was correct or not, because the questions are very domain-specific, and therefore, general knowledge databases do not cover knowledge specifics of science-fiction books.
Evaluation Results of the System of All the Student Groups for All Three Types of Models.
Note. Lines are ordered based on the Fweighted score of the all examples + Model. Results of groups that perform best for a specific model based on Fweighted score are marked in bold. The colors are used to ease the visual comparison of the results between the models.

Graphical Representation of the Final Evaluation Results.
The student groups used different approaches in their algorithms. All the groups used initial preprocessing of the text, including “stopwords” removal (words with little semantic representation, e.g., “and” or “the”), lemmatization or stemming. Stemming means obtaining the root of the word only. Lemmatization changes the word to get its base representation. For example, the stem for the word “having” is “hav”, while the lemma is “have.” Therefore, the lemma retains the word’s semantic meaning.
Implementations of single example models consisted of different string-matching techniques. Students represented the exemplar answer as a set of words and did the same for the answer to evaluate. Then they figured out the threshold of how many words must co-occur on both answers so that the answer to evaluate is correct (Groups 4, 7, and 8). Group 3 used the same approach, except they used stems instead of lemmas. Some groups represented words as vectors and then compared the cosine angle between groups of vectors of an exemplar answer and the answer to evaluate (Groups 1, 2, 5, and 6). Group 9 chose a completely different approach, representing sentences as graphs, formed from triples of nouns and verbs representing subject–predicate–object (Khashabi et al., 2018). After the graphs were formed, they used graph similarity measures to decide whether a given answer was correct or not.
For the all examples model, groups used ideas from single example models and improved them by using machine learning algorithms for classification. Also, they defined additional sets of features used by these models. The groups used standard machine learning algorithms, such as random forests, SVMs, naive Bayes, or decision trees. For a given training data set, these algorithms build a model that can then predict classes for new unseen answers.
For the all examples+ model, groups could use any additional text resource they could find to enrich their data. Although the domain of the selected book seemed somewhat narrow, the questions required some general knowledge to answer correctly. Therefore, some student groups used common databases known in the field of NLP, such as WordNet (Miller, 1995), GloVe vectors (Pennington et al., 2014), or ConceptNet (Liu & Singh, 2004). WordNet is a large database that consists of groups of words, called synsets, or interconnected semantic relationships that define synonyms, hypernyms, or antonyms. GloVe vectors are vector representations for words that encode aggregated global word-word co-occurrence from large corpus of data. Finally, ConceptNet is a freely available semantic network, designed to help computers understand the meanings of words that people use. For example, for the word “dog,” we could find relations that this word is of type “a pet” or “mammal,” can “run” or “bark,” and has “four legs” and “two ears.” Group 6 introduced additional changes in their model to create three different versions of the model, by using WordNet, semantic triples (subject–predicate–object) and important words. Although they introduced many changes to their all examples+ model, their all examples+ model did not improve compared with their all examples model.
The data set we used for training is not large. These deep learning techniques could take advantage of larger amounts of data (Devlin et al., 2018; Wolf et al., 2019). However, Group 2 still decided to use convolutional neural networks (Hu et al., 2014) with a combination of GloVe vectors, methods typically associated with larger data set. They did significantly increase the performance of their all examples+ model compared with the all examples model and achieved results similar to the best performing group.
The best performing group overall was Group 1, comprised of Miha Nahtigal, a computer science master’s student at the University of Ljubljana working alone. We adapted his source code and released it publicly to be used by anyone. The implemented methods have already been used within the IMapBook studies in Slovenia (see Section 4.1). Group 1’s solution is based on representing text using TF-IDF and Word2Vec vectors (Mikolov et al., 2015). The TF-IDF is a standard vector-based weighting scheme that tries to promote more important words in a corpus. The Word2Vec vectors are trained using neural networks similar to GloVe and encode the similarities among words.
Results
Table 4 shows the performance of the algorithms per each question separately. The results of the best model (All examples+ model, micro F-score) correlate with the annotator agreement scores (column with heading “Ann. Agree.”). Also, the best results were achieved for the question “Why is every adult trying to shake her hand?” However, we need to be careful with this question. The question was obviously very easy to answer, and therefore, the majority of the answers were correct. Therefore, there were few negative answers in the data set, so a simple majority classifier would achieve similar results. When comparing the results of each model, we found that all examples+ models were better for the answers that needed to be answered with longer text, such as “Why does Shiranna’s father get sucked into a black hole in her nightmare?” or “How do you think Shiranna’s confidence has changed? What events caused this change?”
Performance of the Algorithms of Group 1.
Note. Per each question separately for all three models. The annotator agreement is calculated only as a share of the same scores given by both annotators. The results were retrieved using standard machine learning evaluations (10-fold cross-validation on the training data).
Do the models significantly improve if we do away with partly correct, and only classify whether answers are correct or not correct (binary)? In Table 5, we report the results where models were trained against the full setup (correct, incorrect, and partly correct), versus the binary setup (correct and not correct). Both setups were evaluated against the test set containing labels 0 and 1. We can observe that the performance of both setups for single example and all examples models is similar. However, with the all examples+ model, the binary setup significantly outperforms the full setup. The all examples+ models use additional resources which help in differentiating between the correct and incorrect answers but may include more noise in prediction when including partly correct answers. Also, if we compare the binary setup to the best student model (Group 1) evaluated against the full testing data set (see first line in Table 3), the performance is slightly increased. However, the difference is not large enough to remove partly correct answers, because they contribute additional feedback to the IMapBook readers.
Comparison between training the algorithm of Group 1 to predict classes 0 and 1 only (second line) instead of classes 0, 0.5 and 1 (first line) for all three models. The first algorithm was trained against the full dataset and the second against the dataset containing only labels 0 and 1. Both were evaluated against the test part of the dataset containing only labels 0 and 1.
Note. The first algorithm was trained against the full data set and the second against the data set containing only. labels 0 and 1. Both were evaluated against the test part of the data set containing only labels 0 and 1. Items in the All examples+ columns are bolded to highlight that the algorithm performed better with the binary setup for All examples+.
Empirical Study in Slovenian Schools
We also had an opportunity to address Research Question 3 (transferring the NLP algorithms to another language) by testing the NLP algorithms in 19 fourth-grade classes in eight schools, involving 341 students. This testing was part of an experimental study in Slovenian schools focusing partly on open-ended questions with NLP feedback. The study investigated the relative effects (on reading motivation and text comprehension) of (a) individual interaction: two forms of open-ended questions, (i) questions with NLP feedback and (ii) questions framed as game-like conversations with characters in the story, versus (b) social interaction: small group SMS style discussions of questions about the stories, versus (c) no interaction: control condition with text only. See Figure 3.

In-Classroom Study Design. The figure shows the division of IMapBook reading tasks among participants—elementary school students.
The study was based on the fourth-grade language arts curriculum in Slovenia, using stories from the fourth- and fifth-grade school readers, which the students had not yet read. It also used digital versions of some typical classroom activities—online discussions similar to the small-group discussions Slovenian teachers often use in classrooms.
Under the supervision of their teachers, and preservice teachers conducting the research, fourth graders went to the computer laboratory in their school. The students logged into the IMapBook web-based eReader software with anonymous usernames supplied to them. The students then read and interacted with two stories in digital form in one of three conditions:
Condition 1: Two forms of questions involving single-person interaction with the eBook: (a) open-ended questions with NLP feedback—we transferred the NLP algorithms developed by the computer science students on the English data set, to the Slovenian language. To obtain the necessary training data, partway through the study fourth-grade student answers were downloaded and then graded by two preservice teachers. These graded responses, along with the relevant text excerpts from which the inferences were made, were used by one of the current investigators to create NLP models on a server. The IMapBook eReader was adapted to query these models and then pass appropriate real-time feedback on to the fourth graders.
Each story in the study contained two embedded open-ended questions with NLP feedback. Students typed in their answer and received polite feedback on the correctness of their answers (determined by the NLP algorithm). If the NLP algorithm returned a score of 1.0, IMapBook told the student, “You are correct!” If the NLP algorithm returned a score of 0.0 or 0.5, IMapBook replied noncommittally “Thank you for your answer.” We did not inform the students that they had responded incorrectly, because we were not sure of the accuracy of the NLP algorithms.
In either case, the students could move on to the next page to read further. Other than being in the Slovenian language, the open-ended questions with NLP feedback used in the Slovenian study were quite similar to the questions in English used to develop the NLP algorithms. They were medium difficulty level inference questions, mostly related to causal antecedent inferences.
(b) In this condition, students also interacted, once per story, with game-like conversations with a character from the story. These conversations with characters also involved an inference question about the story, similar to the open-ended questions scored by NLP. However, in these conversations with characters, students clicked on buttons with words to form a short response. The feedback used a rigid algorithm that matched their answers with exemplar correct and incorrect answers, providing feedback accordingly.
Condition 2: Social interaction via SMS-style texting in groups of four to six students: Partway through each story, a pop-up message notified students that a discussion awaited them. They could click on a button and navigate to a discussion, where they were presented with a question relating to a theme of the story. Students posted messages related to the question, and also to other extraneous subjects. The researcher, preservice students and the teacher monitored and moderated these student discussions from a separate workstation.
Condition 3: Control condition with text only, no interaction.
Regardless of condition, students took hard-copy pre- and posttests before and after the intervention, including (a) motivation to read pre- and posttests. The posttest was a selection of questions from the pretest rewritten to apply to what the students had experienced in the experimental. (b) Students took a multiple-choice comprehension posttest on the two stories they had just read.
The testing of the NLP algorithms is considered preliminary, because the open-ended questions with NLP feedback were not isolated as a variable. However, the situation was ecologically valid, as the study (a) was set in fourth-grade computer labs and classrooms in Slovenian schools, (b) used stories from the fourth-and fifth-grade readers used in these schools, and (c) was based on the exact curriculum for those fourth-grade language arts classes. Furthermore, the study was conducted by Slovenian preservice teachers, who were in the classrooms during the study, and observed first-hand how the students reacted to the open-ended questions.
NLP Models in the Study
The NLP models developed on English data were repurposed for open-ended questions in Slovenian stories (different language), using small data sets for training. A first subset of the answers to the Slovenian open-ended questions, from the first part of the study, were graded by two Slovenian preservice teachers to produce a data set to train NLP models. The NLP models were then inserted into the web-based eBooks to provide feedback to the fourth graders during the remainder of the study. The responses from the NLP algorithms were then graded by the two preservice teachers, to measure the accuracy of the NLP algorithms.
Results
NLP Data
Table 6 summarizes results on research question one, especially about sparse data sets, and research question three about transfer of the NLP algorithms to another language. For comparison, Table 6 includes results from the algorithms developed with the English data, using the young adult novel Weightless. Table 6 includes results from three Slovenian stories for fourth graders, “Veveriček,” “Butalci,” and “Sonce.” For Weightless, in English, 12 questions were used, while for each of the Slovenian stories, two questions each were used. Best model accuracy was well over 50% for three of the four stories, Weightless in English 89% and two of the Slovenian stories, “Veveriček” 69% and “Butalci” 82%. However, for “Sonce,” also in Slovenian, best model accuracy was only 48%. The difference between over and under 50% accuracy suggests a minimum size of data set needed to train an NLP model. The relationship between number of training answers per question and best model accuracy is indicative: Weightless, “Veveriček” and “Butalci” had 71, 126 and 80 training answers per question and best model accuracies of 89%, 69%, and 82%, respectively. However, “Sonce” had 40 training answers per questions and 48% best model accuracy. For these types of questions (medium difficulty inference questions), it appears a minimum of 80 answers per question training data set is needed to produce a model in 69% to 89% accuracy range, that is, well over 50% accuracy. Notably, the algorithms developed with English-language data were transferred to Slovenian (in a completely different family of languages, Slavic languages) with minimal data sizes of 80 answers per question training data resulting in best model accuracy of 69% to 82%.
Summary of Best Model Accuracy and Size of Data Sets Across Four Stories, in Two Languages. Training answers per question and Best model accuracy are highlight to make it easier to understand this data relative to research question one.
Experimental Study Data
There were significant results on the posttest for motivation to read (focusing on the condition the students had just experienced). Students in both (a) the control condition/text only (M = 0.586, SD = 0.345) and (b) the open-ended questions and game-like conversations with characters condition (M = 0.614, SD = 0.332 were significantly more motivated to read the eBooks than (c) students in the discussion condition (M = 0.448, SD = 0.448), F(2, 237) = 3.59, p = .029.
On the local comprehension test related to the stories just read, there were no significant differences. The scores of students by group were (a) control condition/text only (M = 0.74, SD = 0.13) and (b) open-ended questions & game-like conversations with characters condition (M = 0.73, SD = 0.16), and (c) discussion condition (M = 0.71, SD = 0.18).
Results: Research Question 2
Research question two addressed how NLP algorithms for providing feedback to answers to open-ended inference questions can be used in a low-labor model for developing interactive web-based eBooks to improve children’s motivation to read and literacy skills. The adaptation of the NLP algorithms developed in one language and from one educational level (English and Masters’ students) to another language and educational level (Slovenian and fourth graders), provides some indication of the generality. Furthermore, the amount of work involved in setting up NLP models for a new educational situation suggests how well these NLP techniques work with a low-labor model of instructional technology development.
First, we must differentiate between up-front labor versus just-in-time eBook development labor for using the NLP for a particular classroom. The effort to develop the NLP algorithms themselves is up-front labor. This includes preparing the initial data set, students developing NLP algorithms for grading open-ended questions, the computer science professor testing and selecting algorithms and integration of the NLP algorithm into the IMapBook server and eReader.
Only labor associated with just-in-time adaptation of these NLP algorithms to a new educational situation measures how these NLP algorithms fit a low-labor eBook educational development model. This just-in-time labor includes downloading of answers to six open-ended questions (5 minutes), grading of responses (1.5 hours), creation of four NLP models on a server (1 hour), using the eBook authoring system to set up the NLP transitions in two short eBooks (1 hour) and testing the NLP transitions prior to using them in the field (1 hour), for a total of 4.5 hours.
Discussion
Research question one investigates what NLP algorithms make feedback to answers to open-ended inference questions in eBooks more accurate and flexible, using sparse data sets. Computer science students from the University of Ljubljana designed language-agnostic algorithms, which were transferrable to Slovenian data. Two basic model types, single-example model and all examples model, could be directly trained against Slovenian data. In contrast, all examples+ model needed additional resources (such as semantic databases) not available for Slovenian, so these models achieved similar performance to the all examples models.
Group 1’s solution was deployed in IMapBook software for a study in elementary schools in Slovenia. The solution used some traditional machine learning approaches, including pretrained word with embeddings, such as Word2Vec models. The method does not include language-specific rules and is, therefore, easily transferred to other languages, with a small number of initial examples for training a model. Initially, the model was designed using English. During pilot studies in Slovenian schools, we gathered enough Slovene data to automatically train the models for the Slovene language. These models were then used in a full study to demonstrate a practical solution for automatically grading students’ answers to open-ended reading comprehension questions, across languages. Further studies will focus on fine-tuning existing pretrained language models (such as BERT; Devlin et al., 2018) to adapt them for reading comprehension tasks.
This NLP research is an important first step, because the quality of interaction is important to the development of educational media. When instructional designers develop computer-based educational modules, the potential outcomes can range from low to high on Bloom’s taxonomy, from simple recall of facts, to comprehension, to outcomes as sophisticated as synthesizing sophisticated solutions. One problem with developing assessments for computer-based educational modules is that assessments are often at a lower level than the learning outcomes, because the technical skills of the designers make it difficult to assess at higher levels. NLP has the potential to provide higher order assessments. Multiple choice assessment might be appropriate for lower level knowledge assessments. However, open-ended questions where students type the answers may be more appropriate for higher level comprehension assessments. Inferencing is fundamental to comprehension. The open-ended questions used in the current research were such inference questions. This research provides a path toward providing automated feedback to such inference questions.
Research question two revolved around fitting the NLP algorithms into a low-labor model for developing educational web-based eBooks. The preliminary study in Slovenia illustrated some important points. It was the first time that we adapted the NLP algorithms to a classroom situation. Also, we developed the NLP algorithms in English and then adapted them to another language, Slovenian. Nevertheless, just-in-time adaptation of the NLP algorithms to the new situation required a modest amount of time, approximately 4.5 hours. This suggests that the NLP algorithms fit extremely well into a low-labor model for developing educational web-based eBooks. This is especially so when one considers that this was the first time the NLP was used in the web-based eBooks. With subsequent attempts, some rough edges could be smoothed out and the time required would probably decrease.
It is also worth discussing some of our results from the experimental study with Slovenian fourth graders, which in one condition used open-ended questions with NLP feedback. In terms of reading motivation, students in both the open-ended question/game condition and the control condition were significantly more motivated to read the eBooks than in the discussion condition. Low motivation in the discussion condition may have been because students used anonymous usernames. Therefore, in the discussion condition, students did not know with whom they were texting. A high percentage of postings related to figuring out who was in their texting group, lessening the focus on the eBook. The anonymity may have also contributed to behavior problems in the discussions. Both of these factors may have diminished the motivational qualities of the eBooks.
On the local comprehension posttest related to the stories just read, there were no significant differences between groups. The open-ended questions with NLP feedback at least did no harm. Thus, our results did not confirm some prior studies which have suggested that inserted questions can improve comprehension (Dunning, 1988; Guthrie, 1977; McConaughy, 1980, 1985; Morrow, 1985). However, other studies on inserted questions have also obtained inconclusive results (Dornisch & Sperling, 2008; Medina et al., 2017). Apparently, obtaining significant results from inserted adjunct questions is no easy task. Perhaps follow up studies should isolate the open-ended questions from other forms of interaction and perhaps also find more sensitive pre- and posttests. While experimental results were inconclusive, the porting of the NLP algorithms to Slovenian and its use in the fourth-grade classes was certainly a logistic success.
Future Work
Pretrained models are models that have encoded knowledge of specific language—for example, how sentences are formed, which words are important, and so forth. These models are constructed on large text corpora from many sources. Typically, these models are further adapted to specific tasks, such as question answering. Adapting existing models, called “transfer learning,” did not achieve superior performance until large neural networks were used for language modeling. Currently, these types of models achieve the best performance for most of the NLP tasks. The transfer modeling approach enables successfully training of a model without having a large corpus of tagged data. However, sizes of corpora of tagged data are relative to situation and point of view. For computer scientists, gleaning data sets from the internet, or from an open-source academic common, 10,000 data points might be deemed sparse. However, for teachers in a practical educational situation 50 or fewer data points might be what is actually practical and available. Therefore, transfer learning with extremely sparse data sets, in practical educational situations, represents a logical direction for our further research.
Limitations
Investigating Research Question 3 (How well can the algorithms developed be used on open-ended questions in another language), we successfully ported the algorithms from English to Slovenian and tested them in a school setting. Specifically, we tested the relationship of training data sets size to the accuracy of the resulting models. We also provided some rough estimate of the time needed to set up these models.
The testing of the NLP models in Slovenian schools was framed in an experimental study on how different types of interactions in eBooks affected motivation to read. One form of interaction was open-ended questions, with feedback supplied by these NLP algorithms. However, the open-ended questions, with NLP feedback, were not isolated as a variable, but rather bundled with another form of interaction (“game-like conversations with characters,” i.e., answering questions by clicking on buttons with words). Although there were some significant differences between some of the conditions (for instance, the condition with open-ended questions with NLP feedback was significantly more motivational than the discussion condition), this difference could not be attributed only to the open-ended questions with NLP feedback, as that form of interaction was not isolated as a variable.
Another limitation was that many things done in the study were done for the first time. Therefore, time estimates on developing and using NLP models for web-based eBooks were over estimated.
An additional limitation is that typical educators, whether at the university or K12 level, do not have professional connections with computer scientists working on NLP. Therefore, they cannot easily set up NLP feedback in their custom-made materials.
Conclusion
These results provide a proof-of-concept that reasonable NLP feedback for open-ended questions are close to within reach for educators. Follow up studies are needed to refine this process. Furthermore, NLP feedback for open-ended questions needs to be packaged into a user friendly, off-the-shelf package for educators. This study also illustrates how educators and computer scientists can collaborate in mutually beneficial ways.
Footnotes
Acknowledgments
The authors would like to thank Miha Nahtigal for providing his source code online and also other students taking the NLP course in the Fall 2018 at the Faculty for Computer and Information Science at the University of Ljubljana.
The authors would like to thank the faculty at the University of Ljubljana in the College of Education: Tomaž Petek, Karmen Pižorn, Igor Saksida, and Milena Blažić, who worked to make this possible.
Particular thanks to the College of Education students at the University of Ljubljana: Meta Hojs, Lara Hodej, Alja Vintar, Nina Kostrevc, Ana Kogovšek, Neža Krajnik, Timea Staric, Eva Sikošek, Milan Ajduković, and Laura Krajnc.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The visits between Slovenia and United States of America were funded by the Slovenian Research Agency under the project “Web-based eBooks with activities: internationalization of Natural Language Processing” (BI-US/18-19-043) and the USF Nexus Travel Grant. The research was largely funded by a Fulbright U.S. Scholarship from the United States and Slovenian governments. Great thanks go to the IIE Fulbright organization to make this uniquely rewarding experience possible.
