Examining the Effects of a Real-Time,Knowledge-Aware Tool for Academic Writing Assessment

Abstract

The computer-supported writing assessment (CSWA) has been widely used to reduce instructor workload and provide real-time feedback. Interpretability of CSWA draws extensive attention because it can benefit the validity, transparency, and knowledge-aware feedback of academic writing assessments. This study proposes a novel assessment tool, EduNERScore, which enables raters to identify different types of domain knowledge from unstructured educational texts and strengthen the interpretation of assessment through knowledge-aware evidence. This research uses an educational experiment to examine raters’ validity, efficiency, interpretability and user perception with and without the support of EduNERScore. The findings show that the experimental group with the support of EduNERScore achieves human-expert validity, high efficiency, interpretability and good use perception, compared with the control group. Finally, based on the empirical results, this research discusses the technological and educational implications for developing and applying the interpretable CSWA system in education.

Keywords

computer-supported writing assessment natural language processing named entity recognition interpretability higher education

Introduction

Academic writing has been employed as an effective instructional strategy to facilitate students’ information presentation, knowledge application and guide students’ learning progress in different subjects (Huisman et al., 2019; Yang, 2016; Zhang, 2021). Computer-supported writing assessment (CSWA), as a scaffolding tool, has been widely used in academic writing contexts to help instructors measure students' mastery of targeted knowledge (Dunsmuir et al., 2015; Thompson & Braude, 2016). CSWA can provide real-time, automated assessment results, reduce time-consuming workloads from the instructor, and develop individualized feedback for students to revise writing products (Conijn et al., 2020; Klimova, 2011; Sung et al., 2016; Wang et al., 2008).

The research community has paid attention to improve the interpretability of results from CSWA without the involvement of human experts (Chapelle et al., 2015; Wei et al., 2021; Wilson et al., 2021a, 2021b). Interpretability of automated assessment has three major advantages. First, interpretability can improve the CSWA system’s validity since it requires explanations of the assessment criteria and the process about how the scores are generated (Kim et al., 2020; Ploegh et al., 2009). Second, interpretability helps instructors comprehend how the assessment works and thus reduces negative factors generated by assessors’ subjectivity (e.g., writing expertise, assessment experience) (Ade-Ibijola et al., 2012; Weideman, 2019). Third, interpretability enables instructors to confirm the accuracy of CSWA results and the alignment of assessment criteria with instructional requirements (Ade-Ibijola et al., 2012), and help students improve their writing qualities based on feedback (Weideman, 2019). Overall, interpretability of CSWA can enhance the transparency of the assessment criteria, procedures and the credibility of the results, which offers continuous supports for instructors’ guidance and students’ writing (Weideman, 2019).

However, there are two major challenges to enhancing the interpretability of CSWA, namely, technologies related to the interpretability, and considerations of the targeted assessment criteria and contexts. First, multiple machine learning (ML) algorithms are used to implement CSWA and one of the major approaches is the deep learning-based automated assessment model (Azmi et al., 2019). The automated assessment model can provide timely scoring results, but do not offer interpretation of the assessment process and the derivation of assessment results (Kumar & Boulanger, 2020). Second, the expected assessment criteria are usually quite different in various educational contexts (Conijn et al., 2020; Itua et al., 2014). For example, the linguistic feature is the crucial indicator for writing in a language learning context; however, for academic writing, attention needs to be paid to the high-level semantic features (Madnani et al., 2017; Villalon & Calvo, 2011). Therefore, to achieve a high quality of interpretability, it is necessary to improve CSWA quality by implementing an integration of assessment techniques and targeted contexts.

This research integrated a knowledge-aware strategy and the Named Entity Recognition (NER) technique to address these two challenges. First, a knowledge-aware strategy is proposed to demonstrate the structured knowledge reflected by content in student papers and to provide evidence for the raters to make assessment scores. Second, NER is an advanced artificial intelligence technique to extract structured knowledge by employing a deep learning (DL) algorithm. Based on those two types of techniques, we designed and implemented a semi-automated assessment tool, EduNERScore, to extract knowledge from unstructured texts, visualize the structured knowledge of content from students’ papers, and evaluate the quality of academic papers based on a scoring model. Raters can use EduNERScore to quickly access a paper’s quantitative metrics at the knowledge dimension, and related assessment information of the paper (e.g., length, reference score, etc.). The knowledge graph presented in the tool can enhance raters’ understanding of the knowledge structures demonstrated in the paper as well as improve raters’ interpretation quality of the assessment results. Moreover, we conducted an experiment to verify the validity, efficiency, and interpretability of assessment results supported by EduNERScore, and explore raters’ perceptions of the EduNERScore tool. Based on the empirical research results, we discussed the methodological and pedagogical implications for developing and applying the interpretable CSWA systems in education.

Literature Review

Types of Computer-Supported Assessment Systems

CSWA, as the typical computer-supported technology, has gained popularity in a variety of education settings to score and assess the quality of academic writing (Dikli, 2006; Shermis et al., 2013). From the theoretical perspective, based on the zone of proximal development theory, CSWA can provide in-time and sufficient feedback to students in order to facilitate their knowledge internalization process (Nunes et al., 2022). From the practical perspective, the CSWA’s real-time feedback feature provides opportunities for students to plan, write, and revise their papers, and allows them to improve their writing quality through trial and error (Weigle, 2013). In addition, CSWA can also avoid bias and time-consuming caused by factors such as insufficient human raters’ assessment knowledge (Weigle, 2013). This advantage can help instructors or raters save time to focus more on the higher-level assessment of writing quality (such as logics, structure and vocabulary usage) (Nunes et al., 2022). Because of these advantages, there is an increasing interest in designing and implementing effective CSWA systems to meet the requirements of academic writing assessment.

Multiple types of CSWA systems have been developed, including automated essay scoring system (AESS), automated writing evaluation system (AWES), and intelligent tutoring system (ITS) (Conijn et al., 2020). These systems aim to improve the efficiency of assessment, provide automated feedback and correction suggestions, and help students revise their academic papers (Dikli, 2006; Ma et al., 2014; Nunes et al., 2022; Wilson & Czik, 2016). Most of these systems are product-targeted, widely used for standardized test evaluations (e.g., TOEFL, GRE) (Kyle, 2020; Stevenson, 2016). Due to this standardized attribute, most CSWA systems focus on the validity and efficiency of the assessment rather than the interpretability of the assessment results. In other words, users usually cannot get access to the scoring criteria and procedures (Azmi et al., 2019; Zupanc & Bosnić, 2015). Moreover, these systems do not focus on collecting evidence from students’ academic writing process to provide meaningful feedback (Conijn et al., 2020). However, in the academic writing practices, it is important for the instructors and students to understand the detailed standard and process of assessment in order to improve writing quality (Gerritsen-van Leeuwenkamp et al., 2017), such as what assessment techniques are used, and which metrics contribute to the final score. To this end, there is an urgent need to investigate relevant factors that can improve the interpretability of CSWA.

Knowledge-Aware Writing Assessment

Visualization of structural knowledge can be an important factor to be considered in order to enhance the interpretability of CSWA. Existing research has examined the method of knowledge visualization to assist instructors in conducting academic assessments. For example, Wang et al. (2011) emphasized the importance of converting tacit assessment information into explicit knowledge. They designed a knowledge-map grading tool to integrate the domain knowledge and provide a visualized graph of key concepts. Research results indicated that the visualization of knowledge aided instructors in understanding the composition of concepts or ideas written in student papers and being aware of missing concepts. Villalon and Calvo (2011) used concept map as an approximate representation of students’ current state of knowledge. And research results showed that the concept map tool could assess students’ conceptual understandings and further encourage students to revise their writing products. In short, automatic extraction of structured knowledge aids instructors improve assessment quality and offers a potential approach to enhance the interpretability of academic writing assessment.

Moreover, various AI techniques have been employed to visualize the structural knowledge in academic writing assessment. For example, Wang et al. (2008) employed three different machine learning techniques to identify the concepts of the domain and thus enable the assessment of creative-problem in earth science education. Sung et al. (2016) optimized the Latent Semantic Analysis (LSA)-based scoring tool to grade students’ summary writing from the concepts and semantic knowledge dimension. This tool built three databases (latent semantic space, expert summaries, expert concept maps) with human experts’ intervention. Zupanc and Bosnić (2017) proposed a semantic-aware assessment system to improve semantic coherence and consistency. This system used the third-party tools to extract the knowledge from the corresponding subject, and then checked the semantic consistency. Prior studies have indicated that extracting specific knowledge from students’ writing content is beneficial for improving the quality and interpretability of academic writing assessments (Ramesh & Sanampudi, 2022). However, existing methods mainly rely on human feature engineering (e.g., manual constructions of expert knowledge base) and can only recognize phrases or keywords contained in texts rather than extracting meaningful knowledge entities (e.g., concepts, theories, policies) (Sung et al., 2016; Wang et al., 2011). Therefore, it is necessary to investigate advanced AI techniques that can automatically extract diverse structured knowledge from large-scale texts.

An Integrated AI Techniques for the CSWA Systems

Extensive ML-based and DL-based algorithms have been used to extract text features and provide assistance for academic writing assessment. Combining human feature engineering, ML-based methods are widely used for educational text analysis. Lee and Segev (2012) utilized the Term Frequency-Inverse Document Frequency (TF/IDF) algorithm to mine the keywords from specific texts in order to measure the text quality. Sung et al. (2016) and Jorge-Botana et al. (2015) employed the LSA algorithm to mine concepts in the text and thus measured the similarity between the source and target texts. ML-based methods usually require extra human feature engineering, which is difficult to transfer from one educational context to other contexts. On the other hand, DL-based NLP methods have been recently used to automatically learn the semantic representation of text from the large-scale training datasets (Strobl et al., 2019). For example, Azmi et al. (2019) employed DL and LSA techniques to develop an automated Arabic Essay Evaluator that can effectively recognized the features of language and structure of an essay. Litman et al. (2022) employed NLP-based algorithms to identify the revisions of essays and predict the purpose of revision, which assisted the assessment of the writing quality. Compared to ML-based methods, DL-based methods can better learn the semantic representation of the text and reduce the reliance on human feature engineering. In summary, DL-based methods can improve the efficiency of text analysis, while ML-based methods can acquire explicit text features.

To improve the interpretability of CSWA systems, it is necessary to integrate the advantage of efficiency offered by DL-based methods with the interpretability attribute of ML-based text analysis. In this way, the CSWA system can automatically infer learners' knowledge and contribute to the assessment efficiency and interpretable features (Litman et al., 2021; Schumacher, 2020). For example, Zupanc and Bosnić (2017) emphasized that the information extraction (IE) technique can be used to mine diverse knowledge entities from natural language texts to measure the quality of essays. IE techniques do not rely on human feature engineering and can also extract text features with interpretability. NER, a typical IE technique in the current NLP field, enables the extraction of different types of structured knowledge from large-scale unstructured texts. The NER technique can simultaneously obtain interpretable metrics that reflect the text quality and ensure the efficiency of knowledge extraction.

An integration of the knowledge-aware strategy and NER technique is a potential solution to balance the efficiency and interpretability of academic writing assessments. The knowledge-aware strategy helps raters measure the quality of papers through explicit, structured knowledge representations; in this way, raters can evaluate how they contribute to the quality of a paper by understanding the ratio, quantity, and types of knowledge reflected in students’ academic writing. NER, from a technical perspective, can automatically identify different types of knowledge entities from the text content, to enable the knowledge-aware strategy. Therefore, this research aims to develop an interpretable writing assessment tool, by integrating the knowledge-aware strategy and NER technique, and conducting an experiment on academic writing assessment to examine the validity, efficiency, interpretability of assessment results with the tool usage.

The Design and Development of EduNERScore

EduNERScore, an assistant CSWA tool, is designed to help instructors or raters understand how different types of knowledge are reflected in students’ academic papers. Integrating the knowledge-aware strategy and the NER technique, EduNERScore provides reliable reference indicators to support instructors’ academic writing assessment process, rather than merely providing the final scores. The quantitative indicators and metrics provided by EduNERScore can help raters understand how different types of knowledge are reflected in students’ academic papers, which has potential to improves the efficiency and interpretability of academic writing assessment.

EduNERScore is a web-based text analysis tool deployed in a cloud server, which consists of three distinctive modules: an automated knowledge mining engine (KME), knowledge visualization engine (KVE), and a text quality analysis engine (TQAE) (see Figure 1). KME is designed to extract knowledge from unstructured text, KVE visualizes structured knowledge through social network analysis techniques, and TQAE measures the text quality based on quantifiable metrics.

Figure 1.

The architecture of EduNERScore.

The KME module has a built-in discipline entity recognition model based on a large-scale labeled dataset. The dataset was generated from a collection of Chinese textbooks and Chinese journal papers under the Educational Technology-related disciplines. Journal papers ranged from 2012 to 2021, including: Open Education Research, e-Education Research, Modern Distance Education Research, Journal of Distance Education, and Modern Educational Technology (all in Chinese). The dataset had a total of 658,594 characters. Then, five professionals were invited to define the knowledge entity schema and the BIO annotation schema (Tjong Kim Sang & De Meulder, 2003) was used to label the dataset. After that, we obtained a high-quality labeled dataset with 16 entity types (see Table 1). The dataset was then split into three subsets: TRAIN (78.13%), DEV (8.94%), and TEST (12.92%). The TRAIN and DEV datasets were used to develop NER algorithms, and the TEST dataset was used to evaluate the model’s performance. We adopted the classical sequence labeling technique (Liu et al., 2022) and state-of-the-art neural network architecture, transformer, to train a NER model (Ma et al., 2020). After that, we leverage the trained NER model to automatically extract structured knowledge and identify corresponding types. KME is more efficient and accurate than traditional Chinese word segmentation tools (e.g., Jieba), since it is trained on a large-scale dataset which can accurately identify the knowledge entity and the corresponding knowledge types from the covered disciplinary texts.

Table 1.

The schema of discipline knowledge entity.

Entity Types	Description
ALG	Computer algorithms
BOO	Book, textbook
COF	Conference relate to education domain
CON	Discipline concept
COU	Country
CRN	Course name, discipline domain, e.g., Math, educational Psychology
DAT	Date
FRM	The classical architecture, framework, model etc.
JOU	Journal
LOC	Address, state or province
ORG	All organization, e.g., association, committee, department, faculty
PER	Person name, person reference, e.g., student, teacher etc.
PLO	Policy, authoritative report, guidance document etc.
TER	Discipline terminology
THE	Discipline theory
TOO	Technique, method, tool, platform etc.

The KVE module presents the relation of knowledge entities and knowledge types in the format of social network graph. We first utilize social network analysis (SNA) techniques to analyze and demonstrate knowledge types and attributes. In a knowledge graph, the node represents the specific knowledge entity and node size represented the node degree which reflected the number of knowledge entities. The edge connects knowledge entities and the corresponding knowledge type, which in turn form knowledge clusters. Users can scale different knowledge clusters to observe the properties of knowledge entities and the types to which they belong. In summary, the KVE module presents an intuitive representation of knowledge types in a network format, which help instructors visualize the knowledge composition of unstructured texts.

The TQAE module generates content-related feedback by considering the contribution of knowledge demonstrated in an academic paper. Prior studies have claimed that the quality of text should consider the contribution of key concepts (Wang et al., 2011; Zupanc & Bosnić, 2017). Sung et al. (2016) proposed an algorithm to measure the quality of text in terms of key concepts. Following Sung et al. (2016)’s method, we revised the scoring equations to measure the contribution of diverse knowledge types. Equation (1) represents the ratio of different types of knowledge entities to the total words of the paper. Equation (2) represents the ratio of knowledge entities to the total number of words in all papers. Equation (3) describes the relative contribution (score) compared with a number of papers. This score offered by EduNERScore is not the final score of a paper, but is used as a reference indicator for instructor assessment. Moreover, this module also provides a detailed description of other metrics, such as the explanation of entity types, and the ratio of each entity in the paper. In summary, EduNERScore aims to provide raters with insight into the composition of knowledge of unstructured text rather than to offer a final assessment score.

\begin{array}{c} C o n t r i b u t i o n S c o r e = \frac{\sum_{t y p e s = 1}^{N u m . o f t y p e s} k n o w l e d g e e n t i t i e s}{N u m . o f p a p e r s} \end{array}

(1)

\begin{array}{c} P e n a l t y = \frac{N u m . o f k n o w l e d g e e n t i t i e s}{N u m . o f c h a r a c t e r s} \end{array}

(2)

\begin{array}{c} R e f e r e n c e S c o r e = \frac{C o n t r i b u t i o n S c o r e}{P e n a l t y} \end{array}

(3)

Next, we demonstrate the details of the user-interface of EduNERScore. EduNERScore provides a straightforward operation routine. Users only need to specify the filename and input the source text on the user-side, and the server-side returns the results of the text analysis (see Figure 2). In the knowledge visualized network (see Figure 2(a)), a knowledge node has a corresponding color and the node size is weighted by degree. The degree of node represents the occurrence frequency of a knowledge entity. The edge connects a knowledge entity and the corresponding knowledge type; the edge’s thickness represents the frequency of a knowledge entity with a corresponding knowledge type. The right section of EduNERScore (see Figure 2(b)) shows detailed descriptions, such as the description of knowledge entity types, the explanation related to visualization results, and quantitative description. Users can choose and switch the menu list to view information they are interested in. Based on the knowledge visualized network and related information feedback provided by EduNERScore, users can easily check the composition and ratio of diverse knowledge types, and explain the derivation of the results, which can increase the transparency and interpretability of CSWA.

Figure 2.

The interface of EduNERScore.

Methodology

Research Purpose and Questions

The research purpose was to examine the effect of EduNERScore in the academic writing assessment process. Our research questions were:

RQ 1. To what extent did EduNERScore improve the validity of assessment of academic writing papers?

RQ 2. To what extent did EduNERScore improve the efficiency of assessment of academic writing papers?

RQ 3. To what extent did EduNERScore enhance raters’ interpretations of their academic writing assessment results?

RQ 4. What were the raters’ perceptions of using EduNERScore in the academic writing assessment?

The Experiment Context

To examine the effect of EduNERScore, we conducted an experiment to compare the difference with/without the tool support in an academic writing assessment context. The dataset of academic papers was collected from an undergraduate-level course, named Modern Educational Technology. The instructor conducted a collaborative writing activity to help students understand related subject topics and improve their knowledge inquiry and application. 69 students enrolled in this course. They were divided into 18 groups (3 or 4 students per group), and each group was required to complete four collaborative writing activities (each writing task lasted for 2 weeks). Four academic writing topics were Learning Analytics, Education and Artificial Intelligence, STEM education, and Computer Programming in K-12 education. Groups collaboratively completed the academic papers in a collaborative writing platform named ZheDaYunPan. After the course, except 3 incomplete papers, a total of 69 papers were generated and collected from ZheDaYunPan. Based on the instructional goals, the academic writing assessment should focus on students’ level of knowledge acquisition and application demonstrated in the academic papers.

The Participants, Experiment Design, and Procedures

A total of 16 raters enrolled in the experiment (see Table 2). Raters were assigned into an expert group, a control group and an experimental group. Three university lecturers (average age = 32) formed the expert group. They all have a PhD degree or postdoctoral experiences in educational technology, and have rich experience in publishing and reviewing academic papers. Therefore, this study employed their scoring results as a baseline to verify the validity of writing assessment. Moreover, to reduce the effects of other confounding variables, the other 13 raters (average age = 25) were randomly divided into the experimental group (N = 6, male = 1 and female = 5; years of teaching experience = 0.80) and the control group (N = 7, male = 3 and female = 4; years of teaching experience = 0.93).

Table 2.

All raters engaged in the experiment.

	Raters	Gender	Major	Profession
Expert group	N = 3	Male (N = 1), Female (N = 2)	Educational technology	University lecturers (N = 3)
Control group	N = 7	Male (N = 3), Female (N = 4)	Educational technology	Middle school teachers (N = 2) Graduate students (N=5)
Experimental group	N = 6	Male (N = 1), Female (N = 5)	Educational technology	Middle school teachers (N = 2) Graduate students (N=4)

We designed an experiment to examine the effect of EduNERScore on academic writing assessments. The expert group only involved in the scoring process and the average score of three experts was used as a standardized score of the papers. The intraclass correlation coefficient (ICC) of three raters’ scores in the expert group was $0.80$ , which indicated a high agreement between raters. The ICC metrics for the scores of the experimental and control groups were 0.86 and 0.68, respectively. Raters in the control group were asked to conduct the assessment task in a traditional, manual assessment fashion, without the support of EduNERScore, while raters in the experimental group used EduNERScore to complete the assessment tasks. The experimental and control groups were involved in all experimental phases, as described below, in order to make comparisons.

The experiment consisted of three main phases (see Figure 3): a training phase, an assessment phase, and an interview phase. The training phase aimed to help all raters understand the experimental operations. This phase included four steps. First, one of the researchers stated the experiment procedures to raters. Second, researchers helped raters master the think-aloud method, which could verbalize raters’ thoughts during the assessment (Cotton & Gresty, 2006; Zheng et al., 2021). In this session, all raters went through practice sessions until they became proficient in using the think-aloud method. They were required to adopt the think-aloud method to verbalize their thoughts when a score was given. Third, the researcher explained the assessment rubric and the recording requirements to raters (see Appendix A). Raters were required to fill in a score sheet after completing the assessment of each paper. Fourth, raters in the experimental group were trained to skillfully use the EduNERScore, while raters in the control group were asked to score the papers in the traditional, manual assessment fashion.

Figure 3.

The experimental workflow.

Second, raters completed the academic writing assessment tasks during the assessment phase. At this stage, raters in the experimental group used the EduNERScore tool to access the quantitative metrics and visualize the knowledge graph of the paper, and then reconfirmed the quality of the metrics by reviewing the paper, which in turn assisted them in scoring and interpreting the assessment results. Moreover, all raters were required to record assessment data in strict accordance with the requirements in the assessment rubrics (see Appendix A). The assessment rubric included two dimensions, namely identification and quality. Identification indicated whether the metrics could be easily identified by raters during the assessment, and quality indicated the quality of the corresponding metrics evaluated by raters. During the assessment process, they were also required to record each paper’s assessment time and categorize the time into a five-point time-use scale (see Appendix A). The self-reported time-use scale was the indicator of the assessment efficiency. In addition, they were asked to use the think-aloud method to explain why they gave the score. The think-aloud method required students to articulate their thoughts and feelings when performing an assessment task, that helped researchers review the thinking process or decision-making of raters (Cotton & Gresty, 2006). Finally, to investigate raters’ perception of EduNERScore, raters in the experimental group were required to complete a questionnaire at the end of each assessment task. Based on Wang et al. (2011)’s questionnaire, we designed a questionnaire (see Appendix B), including eight dimensions: usefulness, usability, attitude, technological complexity, concentration, satisfaction, stability, and usage intention. There were 33 items under 8 dimensions, measured by a five-point Likert scale.

The final phase was a semi-structured interview (see Appendix C), used to collect raters’ perceptions about EduNERScore. Since only the experimental group was provided with the EduNERScore, the questionnaire and interview were only carried out in the experimental group.

Data Collection and Analysis

Data included the think-aloud audio recordings and raters’ score sheets from both the experimental and control groups, as well as the questionnaire and semi-structured interview data from the experimental group. There were 897 think-aloud audio recordings (156,028 Chinese words), 897 score sheets, 23 questionnaires, and 6 sets of interview data (54,199 Chinese words).

This research employed quantitative and qualitative methods to analyze the collected data. First, content analysis was used to analyze the data of think-aloud audio recordings. The think-aloud data was first transcribed by“讯飞听见/iflyrec”(a Chinese version transcript software). The first author then split the transcribed text into units of analysis: a unit was a sentence spoken by a rater. Two coders coded the units based on a coding framework (see Table 3). The coding framework included 11 metrics and 2 categories that reflected the breadth or depth of the interpretation of assessment results. Exploration (Exp) indicated that the rater explored and mentioned the metrics in the think-aloud scripts without further elaborations; and Elaboration (Ela) indicated that the rater elaborated the corresponding metric with detailed explanations. The coding training included two procedures. First, the researcher introduced the coding rules and the meanings of codes. The raters determined whether the think-aloud data contained the corresponding code based on the coding rules, and then decided which code corresponded to the text content. Second, all raters independently coded the think-aloud data, discussed the coding results through meetings, and reached a coding consensus. Inconsistent codes were resolved by negotiations between raters for reaching a 100% agreement. Then, one-way ANOVA was used to analyze the statistical difference of the frequency of codes between experimental and control groups.

Table 3.

The coding framework.

	Description	Code: Exploration [Exp]	Code: Elaboration [Ela]
Overall quality	Assessing the quality of the paper at a holistic level	This paper’s quality is quite good.	The paper is clear, readable, and of good overall quality.
Length	Measuring the length of a paper according to the requirement	The paper is short.	The paper has 5000 words.
Knowledge types	Number of types of knowledge in the paper	The text contains diverse knowledge types.	This paper includes 11 entity types among the 16 types.
Format	Paper layout and formatting	The paper layout is confusing.	Text font sizes are inconsistent.
Reference	Number and format of references in the paper	The paper has no references.	The paper has references, and the APA literature format is in accordance with the standard.
Content logic	Logical connections within a paragraph and between paragraphs	Content logic is very messy.	There is no clear argument within the paragraph, and the paragraphs lack logical connections.
Literature review	Review of related works	The paper lacks a comparison of different literature.	The literature compares K-12 education policies between the United States and China.
Time clue	Time clue of reference related to the topic	The reference time is too old.	The paper identifies the main ideas during the last 5 years.
Terminology	Applications of discipline-specific vocabulary	The vocabulary in the paper is very colloquial.	The paper uses disciplinary terminology to articulate theories and ideas, such as “learning analytics” and “constructivist theory".
Argumentation	Providing explanations and evidences for certain arguments	The paper lacks a clear and detailed argumentation.	A case study is used to illustrate how learning analytics influence evaluation which make the point convincing.
Other rubrics	Factors that are not included in the above 10 metrics, but can be used to some extent to reflect the quality of a paper	The writing is quite rough.	The essay is written in a coherent and fluent style, without grammatical errors.

We adopted the one-way ANOVA method to examine the statistical difference between the experimental and control groups on the validity, efficiency, and assessment metrics. The comparison of validity was used to examine the accuracy of the assessment methods. Each paper was scored by all raters, and the average score was taken as the final score of a paper. ANOVA analysis was performed to examine the statistic differences between experimental groups and control groups. Specifically, the comparison of efficiency reflected the time period that EduNERScore helped raters accomplish the time-consuming assessment tasks; and the comparison of assessment metrics reflected how the specific metric contributed to the validity.

Moreover, the descriptive statistic method was employed to investigate the questionnaire responses in order to understand the raters’ perceptions of EduNERScore. Thematic analysis was used to analyze the transcribed interview content in order to identify their perceptions about advantages and disadvantages of the EduNERScore tool.

Results

To what extent did EduNERScore improve the validity of assessment of academic writing papers?

To examine the validity of EduNERScore, we compared the difference of the final assessment scores between the control, experimental, and expert groups. Raters from the expert and experimental groups made similar assessment scores. The average scores from the experimental group (M = 79.66) and the expert group (M = 79.91) were higher than the control group (M = 76.16) (see Table 4). Furthermore, the experimental group (SD = 33.70) and expert group (SD = 33.37) had a smaller standard deviation compared with the control group. The standard deviation of the experimental group was slightly larger than the expert groups. The control group (SD = 61.76) had a larger standard deviation than that of the experimental and expert group. In addition, the ANOVA results showed that there were significant differences between the control and expert group (p = 0.002) as well as the control group and experimental group (p = 0.003). In contrast, there was no significant difference between the experimental group and expert group (p = 0.795). In other words, the experimental group, supported with the EduNERScore, achieved a reliable assessment validity.

Table 4.

Assessment score comparison across the control group, experimental group and expert group.

	Papers	Mean #Score	SD	F	p (control group)	p (experimental group)	p (expert group)
Expert group	69	79.91	33.37	10.24	0.002**	—	—
Control group	69	76.16	61.76	8.85	—	0.003**	—
Experimental group	69	79.66	33.70	0.07	—	—	0.795

Note. ‘—’ means the deleted duplicated value.

Moreover, we analyzed the metrics in raters’ score sheets to investigate how the tool assisted raters to achieve the reliable assessment validity. The score sheet comprised seven metrics (e.g., Theoretical foundation, Domain knowledge, Literature review, Educational policies, Techniques/tools, Research purpose, and Reference). Each metric had two sub-dimensions, e.g., Identification and Quality. Identification indicated whether the metric was easily observed; and Quality meant the quality of a paper on the corresponding metric. Table 5 presented the comparison of the seven fine-grained metrics. The results under the sub-dimension of Identification showed significant differences between the experimental and control groups. Raters in the experimental group could easily identify these fine-grained assessment metrics with the EduNERScore support. In contrast, raters in the control group might feel difficult to identify these specific metrics. Moreover, the experimental and control groups did not have significant differences in the scores under the sub-dimension of Quality on most metrics, except Theoretical foundation and Educational policies. The average quality score of Theoretical foundation in the control group (M = 2.57) was significantly larger than the experimental group (M = 2.12). The average of quality score of Educational policies in the control group (M = 2.69) was also larger than the experimental group (M = 2.32). Therefore, EduNERScore, to some extent, helped the rater to easily identify the assessment evidence in the paper.

Table 5.

The comparisons of seven metrics in the assessment rubric.

	Theory foundation
	Identification					Quality
	Score	M	SD	F	p	Score	M	SD	F	p
Control group	121.14	1.76	0.07			177.57	2.57	0.24
Experimental group	153.17	2.22	0.04	138.74	0.000**	146.50	2.12	0.33	24.48	0.000**
	Domain knowledge
	Identification					Quality
	Score	M	SD	F	p	Score	M	SD	F	p
Control group	135.57	1.97	0.08			218.00	3.16	0.34
Experimental group	182.33	2.64	0.05	259.62	0.000**	222.33	3.22	0.30	0.43	0.513
	Literature review
	Identification					Quality
	Score	M	SD	F	p	Score	M	SD	F	p
Control group	121.14	1.76	0.20			153.14	2.22	0.81
Experimental group	163.17	2.36	0.13	77.22	0.000**	162.00	2.35	0.79	0.71	0.401
	Educational policies
	Identification					Quality
	Score	M	SD	F	p	Score	M	SD	F	p
Control group	135.00	1.96	0.15			185.43	2.69	0.47
Experimental group	158.83	2.30	0.16	26.98	0.000**	160.17	2.32	0.67	8.18	0.005**

	Techniques/tools
	Identification					Quality
	Score	M	SD	F	p	Score	M	SD	F	p
Control group	149.43	2.17	0.20			208.86	3.03	0.76
Experimental group	176.50	2.56	0.09	36.17	0.000**	203.67	2.95	0.51	0.31	0.580
	Research purpose
	Identification					Quality
	Score	M	SD	F	p	Score	M	SD	F	p
Control group	140.57	2.04	0.06			225.14	3.26	0.23
Experimental group	168.83	2.45	0.04	109.91	0.000**	218.67	3.17	0.17	1.52	0.220
	Reference
	Identification					Quality
	Score	M	SD	F	p	Score	M	SD	F	p
Control group	134.29	1.95	0.26			177.14	2.57	1.00
Experimental group	164.83	2.39	0.11	37.03	0.000**	189.50	2.75	1.30	0.96	0.328

To What Extent Did EduNERScore Improve the Efficiency of Assessment of Academic Writing Papers?

To address the second research question, we examined the efficiency by comparing the time self-reported by raters in the experimental and control groups. All raters were required to record their scoring time based on a five-point scale. The minimum value of 1 corresponded to the maximum assessment duration (≈60min); the maximum value of 5 corresponded to the minimum assessment duration (≈15min). A larger the value indicated a higher efficiency. Table 6 showed a remarkable difference (p < 0.01), where the experimental group (M = 3.52) achieved a better efficiency than the control group (M = 3.15). The results indicated that the semi-automated assessment mechanism (experimental group) exhibited an advantage in assessment efficiency compared to the manual assessment (control group).

Table 6.

Assessment efficiency comparison between the control and experimental groups.

	Papers	Sum #Score	Mean. efficiency	SD	F	p
Control group	69	217.00	3.15	0.09
Experimental group	69	243.17	3.52	0.17	38.43	0.000**

Moreover, Figure 4 showed the comparison of efficiency. First, the assessment duration presented a negative relation with the length of the paper. That is, the assessment efficiency decreased as the word count of the paper increased, regardless of whether it was supported with EduNERScore or not. Second, the average efficiency of assessment gradually degraded with the increase of the number of paper words. But the experimental group generally performed better than the control group. In other words, the experimental group with EduNERScore support could achieve a remarkable strength on assessment efficiency on a holistic level.

Figure 4.

The comparison of the assessment efficiency. x axis is the index of papers; y axis (left) is the efficiency scale; y axis (right) is the number of words per paper.

To What Extent Did EduNERScore Enhance Raters’ Interpretations of Their Academic Writing Assessment Results?

To address the third research question, we employed the think-aloud method to verbalize raters’ thoughts when a final score was given. Table 7 compared the experimental and control groups regarding how many rubric codes were mentioned to interpret the given scores. The result implied that raters in the experimental group (M = 4.07) significantly outperformed the control group (M = 3.01). Moreover, based on the statistics of the Exploration and Elaboration codes, the experimental group achieved remarkable breadth and depth of interpretation of assessment results. For example, the average Exploration code of the experimental group (M = 2.46) was significantly larger than the control group (M = 1.75); the average Elaboration code of the experimental group (M = 1.61) was noticeably larger than the control group (M = 1.27). The results suggested that the experimental group not only mentioned more dimensions but also offered more detailed explanations about why they gave the assessment score.

Table 7.

Comparisons of the interpretation dimensions of the scoring results.

	Num. code (Total)	Mean. code (Total)	SD	F	p
Control group	208.00	3.01	0.16
Experimental group	280.83	4.07	0.26	182.15	0.000**
	Num. Code (Exp)	Mean. Code (Exp)	SD		p
Control group	120.57	1.75	0.26
Experimental group	169.67	2.46	0.28	64.72	0.000**
	Num. Code (Ela)	Mean. Code (Ela)	SD		p
Control group	87.43	1.27	0.12
Experimental group	111.17	1.61	0.16	29.21	0.000**

Table 8 examined the differences between the experimental and control groups on each code. First, the statistics of Knowledge types showed a significant difference. The results of Knowledge types indicated that raters in the control group (M = 0.00) never considered the diversity of knowledge types demonstrated in papers, while the experimental group paid attention to the diversity of knowledge (M = 0.05). In addition, the comparison indicated that 8 out of 11 codes presented significant differences. The total frequency of seven codes in the experimental group was significantly larger than the control group (e.g., Overall quality, Length, Knowledge types, Format, Time clue, Terminology, Argumentation). The Reference in the control group was significantly larger than the experimental group. Nevertheless, there was no difference in the codes of Content logic, Literature review, and Other rubrics. The comparison indicated that both the experimental and control groups tended to interpret the assessment results based on Content logic, Literature review, and Other rubrics.

Table 8.

The comparisons of the think-aloud coding results.

	Overall Quality
	Sum #code	M	SD	F	p
Control group	30.29	0.44	0.02
Experimental group	41.17	0.60	0.04	29.43	0.000**
	Length
	Sum #code	M	SD	F	p
Control group	3.86	0.06	0.01
Experimental group	9.33	0.14	0.02	16.80	0.000**
	Knowledge types
	Sum #code	M	SD	F	p
Control group	0.00	0.00	0.00
Experimental group	3.67	0.05	0.01	28.08	0.000**
	Format
	Sum #code	M	SD	F	p
Control group	21.43	0.31	0.04
Experimental group	41.33	0.60	0.04	65.12	0.000**
	Reference
	Sum #code	M	SD	F	p
Control group	40.71	0.59	0.05
Experimental group	31.67	0.46	0.05	11.88	0.001**
	Content logic
	Sum #code	M	SD	F	p
Control group	39.43	0.57	0.03
Experimental group	36.67	0.53	0.04	1.54	0.217
	Literature review
	Sum #code	M	SD	F	p
Control group	17.86	0.26	0.02
Experimental group	19.83	0.29	0.03	1.15	0.285
	Time clue
	Sum #code	M	SD	F	p
Control group	2.43	0.04	0.00
Experimental group	9.83	0.14	0.01	66.09	0.000**
	Terminology
	Sum #code	M	SD	F	p
Control group	2.71	0.04	0.01
Experimental group	21.67	0.31	0.02	193.07	0.000**
	Argumentation
	Sum #code	M	SD	F	p
Control group	21.29	0.31	0.06
Experimental group	37.00	0.54	0.04	35.93	0.000**
	Other rubrics
	Sum #code	M	SD	F	p
Control group	28.00	0.41	0.02
Experimental group	28.67	0.42	0.02	0.16	0.688

What Were the Raters’ Perception of Using EduNERScore in the Academic Writing Assessment?

The analysis results from raters’ questionnaire and interview responses further revealed their perception of EduNERScore. Figure 5 showed the average metric of eight dimensions (usability = 4.28; usefulness = 4.43; usage intention = 4.38; stability = 4.07; attitude = 4.45; satisfaction = 4.43; concentration = 4.37; technological complexity = 4.35) indicated that raters expressed satisfaction with the effect of EduNERScore. The concentration (M = 4.37) indicated that EduNERScore could help rater focus on the writing assessment task; the technological complexity (M = 4.35) indicated that rater believed there was no technical barrier to using EduNERScore; the attitude (M = 4.45) and satisfaction (M = 4.43) indicated all raters held a positive perception; the remaining metrics (e.g., usability; usefulness; usage intention) indicated that EduNERScore made a substantial contribution to raters’ assessment; the lowest score of stability (M = 4.07) in the eight dimensions implied that the functions and stability of EduNERScore could be improved in the future. Moreover, the interview results further confirmed that EduNERScore helped raters identify relevant evidence for making an objective assessment. For example, interviewees #4 said: “EduNERScore presented a good experience for me. It was easy to use, and provided quantitative measures and understandable visualizations …”. Interviewees #6 added: “With this visualization tool, I could easily identify more metrics and evaluate the papers objectively…”. And interviewee #3 emphasized: “it helped me become aware of details that were not easy to be noticed…”.

Figure 5.

Results of raters’ perceptions.

In addition, interviewees argued that EduNERScore helped them improve the efficiency and fairness of their assessments. As interviewees #1 said: “I would like to firstly focus on the quantitative indicators and check the knowledge graph…my assessment process was very efficient”, and interviewee #2 said: “For long papers, I often forgot the previous content, and this tool helped me to recall those content...”.

However, interviewees also reported some shortcomings of EduNERScore. First, they found that the reference scores were sometimes inconsistent with raters’ scores. Some raters thought the papers were poorly written, but EduNERScore offered the opposite results. As interviewee #3 explained: “For some papers, although the tool identified knowledge types... but when I doublechecked the paper’s content, I found that the content was not usually well written”. Second, they reported that the structured knowledge generated by EduNERScore had some errors of entity-boundary (e.g., did not locate the right starting or stopping position of an entity) or entity types (e.g., did not assign the right entity type to an entity). Interviewees also mentioned that, although the incorrectly structured knowledge sometimes confused them, it did not negatively influence their subsequent assessment. As interviewees #5 said: “I interpreted some indicators from the original content, so that some of the errors did not bother my assessment”.

Discussions and Implications

Addressing research questions

Interpretability of CSWA provides knowledge-aware feedback, which helps raters deal with complicated academic writing assessment processes. However, existing research has focused on basic text features related to assessment (e.g., length, grammatical errors, etc.) rather than interpretable evidence related to the writing topics and content (Thompson & Braude, 2016). This research designed and implemented an academic writing assessment tool named EduNERScore to make available structured knowledge and quantifiable metrics, in order to improve raters’ interpretability of assessment and further improve the quality of assessment. We employed the experimental research with multi-methods to investigate the validity, efficiency, interpretability of EduNERScore as well as raters’ perceptions about the tool. First, the experimental group with the support of EduNERScore presented advantages in improving the assessment validity. That is, the experimental group not only outperformed the control group on indicators during the assessment process, but also achieved similar scores with the scores assessed by the expert group. The findings implied that EduNERScore employing the semi-automated assessment mechanism enabled raters to make objective assessments based on these comprehensive and quantitative indicators (Weinerth et al., 2014), which in turn offered human experts-like validity like other CSWA systems (Azmi et al., 2019; Mao et al., 2018; Martínez-Huertas et al., 2019).

Second, EduNERScore exhibited remarkable efficiency, consistent with the results of other computer-assisted writing assessment cases (Jackson, 2000; Süzen et al., 2020; Wilson & Czik, 2016). The time consumption of the experimental group outperformed the control group to a statistically significant level. One of the major advantages of EduNERScore is that it extracted structured knowledge to align with the required rubrics and analyzed the text quality in an automatic and timely manner. However, because of individual rater’s subjectivity when using the semi-automated assessment tool, the time consumption on each paper assessed by different raters was fluctuant. In summary, the efficiency of EduNERScore did not reflect at the local level (namely each paper) but rather at the holistic level.

Third, EduNERScore improved the rater’s access to the assessment evidence, which positively affected the interpretability of assessment results. The think-aloud results in the experimental group indicated that raters who received more metric-related feedback from EduNERScore better interpreted the causality of assessment results, compared to raters in the control group (Wilson & Czik, 2016; Wilson et al., 2021a, 2021b). Particularly, the use frequencies of Exploration and Elaboration in the experimental group were higher than that from the control group; moreover, the experimental group mentioned Knowledge types but the control group never considered Knowledge types. These results together reflected that the raters in the experimental group elaborated the causal interpretation of assessment results, since EduNERScore provided visual and quantifiable representations to help the rater recall critical ideas (Sung et al., 2016). In contrast, the raters in the control group without EduNERScore were prone to interpret the scoring results with basic metrics (such as Reference), that they could recall.

Finally, the responses from questionnaires and interviews revealed that EduNERScore offered raters a positive assessment experience, which was consistent with prior research (e.g., Wang et al., 2011; Wilson et al., 2021a, 2021b). The results suggested that the tool helped raters to effectively deal with complicated writing assessment tasks through accurately extracting domain knowledge and presenting evidence, which improved raters’ engagement and willingness in completing the assessment tasks (Nunes et al., 2022; Wilson & Czik, 2016). Overall, the EduNERScore tool provides instructors with a scaffolding for assessing academic writing with reliable validity and good user perceptions, while also balancing well the efficiency and interpretability of the assessment.

Technological Implications

Based on the empirical results, this research offers technological implications for developing and optimizing interpretable tools for academic writing assessment. First, the knowledge-aware strategy is a critical mechanism for improving the interpretability of CSWA systems. The strengths of the knowledge-aware strategy comprise two aspects: representing students’ discipline knowledge and providing raters with knowledge-aware evidence. This result follows previous findings that have proved that fine-grained knowledge feedback related to the writing topic can facilitate the interpretability of the assessment (Wilson et al., 2021a, 2021b). Furthermore, the NER technique is employed to implement the knowledge-aware strategy, which can extract knowledge entities and corresponding entity types simultaneously. However, academic writing is a complicated, creative integration of knowledge from individual students, therefore, the assessment of academic writing cannot merely depend on knowledge entities and types included in the text. For instance, knowledge entities can reflect what knowledge is contained, but cannot present how the author organizes and links these knowledge entities into a creative or critical writing (Al-Moslmi et al., 2020). In this case, knowledge graph technology that identifies both entities and relationships can provide comprehensive access to knowledge and the associative relationships between them, thus addressing the limitations of NER technology (Ji et al., 2022). Therefore, this research suggests that future research can consider to employ the knowledge graph technique in designing interpretable assessment tools.

Moreover, the validity issue namely inconsistency of assessment results, to some extent, counteract the advantage of EduNERScore. A critical phenomenon associated with the validity issue is the discrepancy between the EduNERScore and the human grading score, which has been pointed out in prior research (e.g., Mao et al., 2018). This research reflected on two possible reasons. First, the KME extracts some redundant entities which have similar meanings (e.g., both “theory” and “constructivism” were identified). When redundant entities are visualized and calculated by EduNERScore, it leads to a less accurate rating scores. Second, some entities are only mentioned in the paper without further elaboration, which did not contribute to a high quality of the paper but EduNERScore cannot identify this depth. To overcome this disadvantage, future work should optimize the algorithm to improve the accuracy and capture the contextual semantic of knowledge entities. Moreover, another disadvantage is that the information presented in the graphs may increase the rater’s cognitive load (Sung et al., 2016), which could weaken the rater’s motivation to consistently use it (Chen & Tsao, 2021). Thus, the findings suggest that future design should improve algorithmic performance to reduce redundant information and optimize the representations of knowledge or the hierarchy of feedback to make them more accessible and digestible (Kim et al., 2020).

Educational Implications

The interpretable academic writing assessment tool can be used in educational practice to improve efficiency, promote the interpretability of scoring, and serve as an instructional scaffolding for cognitive visualization and higher-order semantic analysis. First, offering real-time feedback and evidence can make effective writing revisions for students (Link et al., 2022; Wilson et al., 2021a, 2021b). The findings imply that EduNERScore can straightforwardly present fine-grained knowledge related to writing topics. Thus, integrating EduNERScore into discipline-specific contexts enables the instructor to tackle the writing assessment-related instruction in an efficient way, especially in large-size classes (Nunes et al., 2022).

Second, the feature of EduNERScore for extracting structured knowledge is naturally reminiscent of its application to text analysis contexts. The real-time presentation of knowledge and corresponding types can promote students’ engagement in knowledge construction. For example, when this type of tool is integrated in online discussion activities, the efficient knowledge extraction can avoid students missing critical discussion information and further enhance students’ focus on the current discussion topics to construct knowledge (Su et al., 2018). It can also enable students quickly compare their knowledge with their peers, which in turn triggers cognitive conflicts and deepens knowledge construction (Li et al., 2021). Therefore, the knowledge-aware tool such as EduNERScore has potential to be applied in diverse instruction and learning contexts to foster students’ collaborative knowledge construction.

Conclusion, Limitations, and Future Works

The increasing application of the CSWA systems in education requires improving the interpretability of academic writing assessments in order to assure the assessment quality (Wilson et al., 2021a, 2021b). This research designs and implements an interpretable CSWA tool (EduNERScore) that models domain knowledge from a quantitative, visual perspective, which helps raters interpret assessment results at the knowledge level in order to improve academic writing assessment quality. The positive findings indicate that EduNERScore can make human assessments more valid, efficient, and interpretable. There are two limitations of this research, which lead to future research directions. First, our NER model presents errors in knowledge recognition because its accuracy is influenced by various factors (such as language category and dataset size). Future work should investigate advanced AI algorithms to improve the performance of the knowledge mining engine. Second, this research is conducted with a small sample size and there is an imbalance in the number of experts and participants of different genders, which may limit the generalizability of findings. Future work should apply and test the tool in a large-scale instructional context to verify the research results and implications. Overall, EduNERScore offers an AI-driven, knowledge-aware scaffolding to serve the educational assessment, which enables instructors to effectively address challenges in complicated assessment tasks and conduct diverse instructional practices related to academic writing.

Footnotes

Acknowledgments

We appreciate participants’ engagement in this research.

Authors’ Contribution

Xu Li took responsibility of the software development, data collection and analysis, and writing of the manuscript draft. Fan Ouyang took responsibility for research conception, writing and revision of the manuscript, as well as supervision of the research. Jianwen Liu took responsibility for data analysis and software development. Chengkun Wei took responsibility for software development. Wenzhi Chen supervised the research project.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors acknowledge the financial support from the National Natural Science Foundation of China (62177041), the 2021 Key research and development plan of Zhejiang province (2021C03140), Zhejiang Province educational science and planning research project (2022SCG256), and Zhejiang University graduate education research project (20220310).

Data Availability

Data is available upon request from the first author.

ORCID iD

Fan Ouyang

Author Biographies

Xu Li is a doctoral student, majoring in artificial intelligence in education at Zhejiang University. His research interest focuses on artificial intelligence, natural language processing, and knowledge graph in education.

Fan Ouyang is a research professor from the College of Education at Zhejiang University. Her research interests are computer-supported collaborative learning, learning analytics and educational data mining, AI in education, online learning.

Jianwen Liu is a master student with a computer science and technology major at Zhejiang University. His research focuses on machine learning.

Chengkun Wei is currently a PostDoc at the College of Computer Science and Technology, Zhejiang University. He received his Ph.D. degree of Computer Science and Technology from Zhejiang University. His current research interests include confidential computing, differential privacy, federated learning, machine learning privacy and security.

Wenzhi Chen (Member, IEEE) received the Ph.D. degree from the College of Computer Science and Engineering, Zhejiang University. He is currently a Professor with the College of Computer Science and Technology, Zhejiang University, and the Director of the Information Technology Center, Zhejiang University. He used to be the Vice Dean of the College of Computer Science and Technology. His current research interests include embedded systems and its application, computer architecture, computer system software, and information security. He is a member of ACM and the ACM Education Council.

Appendix A.

The Assessment Rubric.

Assessment Rubric
Paper No.							Score:
Dimensions		Quality (1-Very bad; 2-bad; 3-neutral; 4-good; 5-very good)
Overall	Overall quality	☐1	☐2	☐3			☐4	☐5
Assessment	Overall style	☐1	☐2	☐3			☐4	☐5
	Argument quality	☐1	☐2	☐3			☐4	☐5
		Identification(1-Very hard; 2-hard; 3-neutral; 4-easy; 5-very easy)					Quality(1-Very bad; 2-bad; 3-neutral; 4-good; 5-very good)
Specific Assessment	Theoretical foundation	☐1	☐2	☐3	☐4	☐5	☐1	☐2	☐3	☐4	☐5
	Domain knowledge	☐1	☐2	☐3	☐4	☐5	☐1	☐2	☐3	☐4	☐5
	Literature review	☐1	☐2	☐3	☐4	☐5	☐1	☐2	☐3	☐4	☐5
	Educational policies	☐1	☐2	☐3	☐4	☐5	☐1	☐2	☐3	☐4	☐5
	Techniques/tools	☐1	☐2	☐3	☐4	☐5	☐1	☐2	☐3	☐4	☐5
	Research purpose	☐1	☐2	☐3	☐4	☐5	☐1	☐2	☐3	☐4	☐5
	Reference	☐1	☐2	☐3	☐4	☐5	☐1	☐2	☐3	☐4	☐5
	Duration					☐ ≈60min	☐≈45min	☐≈35min	☐≈25min	☐≈15min

References

Ade-Ibijola

A. O.

Wakama

Amadi

J. C.

(2012). An expert system for Automated Essay Scoring (AES) in computing using shallow NLP techniques for inferencing. International Journal of Computer Applications, 51(10), 37–45. https://doi.org/10.5120/8080-1480

Al-Moslmi

Gallofré Ocaña

Opdahl

L.A.

Veres

(2020). Named entity extraction for knowledge graphs: A literature overview. IEEE Access, 8, 32862–32881. https://doi.org/10.1109/ACCESS.2020.2973928

Azmi

A. M.

Al-Jouie

M. F.

Hussain

(2019). AAEE – automated evaluation of students’ essays in Arabic language. Information Processing & Management, 56(5), 1736–1752. https://doi.org/10.1016/j.ipm.2019.05.008

Chapelle

C. A.

Cotos

Lee

(2015). Validity arguments for diagnostic assessment using automated writing evaluation. Language Testing, 32(3), 385–405. https://doi.org/10.1177/0265532214565386

Chen

C.-M.

Tsao

H.-W.

(2021). An instant perspective comparison system to facilitate learners’ discussion effectiveness in an online discussion process. Computers & Education, 164, 104037. https://doi.org/10.1016/j.compedu.2020.104037

Conijn

Martinez-Maldonado

Knight

Buckingham Shum

Van Waes

van Zaanen

(2020). How to provide automated feedback on the writing process? A participatory approach to design writing analytics tools. Computer Assisted Language Learning, 0(0), 1–31. https://doi.org/10.1080/09588221.2020.1839503

Cotton

Gresty

(2006). Reflecting on the think-aloud method for evaluating e-learning. British Journal of Educational Technology, 37(1), 45–54. https://doi.org/10.1111/j.1467-8535.2005.00521.x

Dikli

(2006). An overview of automated scoring of essays. The Journal of Technology, Learning and Assessment, 5(1), 1. https://bit.ly/3w3YeV0

Dunsmuir

Kyriacou

Batuwitage

Hinson

Ingram

O’Sullivan

(2015). An evaluation of the Writing Assessment Measure (WAM) for children’s narrative writing. Assessing Writing, 23, 1–18. https://doi.org/10.1016/j.asw.2014.08.001

10.

Gerritsen-van Leeuwenkamp

K. J.

Joosten-ten Brinke

Kester

(2017). Assessment quality in tertiary education: An integrative literature review. Studies in Educational Evaluation, 55, 94–116. https://doi.org/10.1016/j.stueduc.2017.08.001

11.

Huisman

Saab

van den Broek

van Driel

(2019). The impact of formative peer feedback on higher education students’ academic writing: A meta-analysis. Assessment & Evaluation in Higher Education, 44(6), 863–880. https://doi.org/10.1080/02602938.2018.1545896

12.

Itua

Coffey

Merryweather

Norton

Foxcroft

(2014). Exploring barriers and solutions to academic writing: Perspectives from students, higher education and further education tutors. Journal of Further and Higher Education, 38(3), 305–326. https://doi.org/10.1080/0309877X.2012.726966

13.

Jackson

(2000). A semi-automated approach to online assessment. Proceedings of the 5th Annual SIGCSE/SIGCUE ITiCSE Conference on Innovation and Technology in Computer Science Education, 32(3), 164–167. https://doi.org/10.1145/343048.343160

14.

Pan

Cambria

Marttinen

P. S.

(2022). A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems, 33(2), 494–514. https://doi.org/10.1109/TNNLS.2021.3070843

15.

Jorge-Botana

Luzón

J. M.

Gómez-Veiga

Martín-Cordero

J. I.

(2015). Automated LSA assessment of summaries in distance education: Some variables to Be considered. Journal of Educational Computing Research, 52(3), 341–364. https://doi.org/10.1177/0735633115571930

16.

Kim

A. A.

Chapman

Kondo

Wilmes

(2020). Examining the assessment literacy required for interpreting score reports: A focus on educators of K–12 English learners. Language Testing, 37(1), 54–75. https://doi.org/10.1177/0265532219859881

17.

Klimova

B. F.

(2011). Assessment methods in the course on academic writing. Procedia - Social and Behavioral Sciences, 15, 2604–2608. https://doi.org/10.1016/j.sbspro.2011.04.154

18.

Kumar

Boulanger

(2020). Explainable automated essay scoring: Deep learning really has pedagogical value. Frontiers in Education, 5, 186. https://doi.org/10.3389/feduc.2020.572367

19.

Kyle

(2020). The relationship between features of source text use and integrated writing quality. Assessing Writing, 45, 100467. https://doi.org/10.1016/j.asw.2020.100467

20.

Lee

J. H.

Segev

(2012). Knowledge maps for e-learning. Computers & Education, 59(2), 353–364. https://doi.org/10.1016/j.compedu.2012.01.017

21.

Zhang

(2021). The effects of a group awareness tool on knowledge construction in computer-supported collaborative learning. British Journal of Educational Technology, 52(3), 1178–1196. https://doi.org/10.1111/bjet.13066

22.

Link

Mehrzad

Rahimi

(2022). Impact of automated writing evaluation on teacher feedback, student revision, and writing improvement. Computer Assisted Language Learning, 35(4), 605–634. https://doi.org/10.1080/09588221.2020.1743323

23.

Litman

Afrin

Kashefi

Olshefski

Godley

Hwa

(2022). An automated writing evaluation system for supporting self-monitored revising. In Rodrigo

M.M.

Matsuda

Cristea

A.I.

Dimitrova

(Eds), Artificial intelligence in education. AIED 2022. Lecture notes in computer science (13355). : Springer. https://doi.org/10.1007/978-3-031-11644-5_52

24.

Litman

Zhang

Correnti

Matsumura

L.C.

Wang

(2021). A fairness evaluation of automated methods for scoring text evidence usage in writing. In Roll

McNamara

Sosnovsky

Luckin

Dimitrova

(Eds), Artificial intelligence in education (pp. 255–267). : Springer. https://doi.org/10.1007/978-3-030-78292-4_21

25.

Liu

Guo

Wang

(2022). Chinese named entity recognition: The state of the art. Neurocomputing, 473, 37–53. https://doi.org/10.1016/j.neucom.2021.10.101

26.

Peng

Zhang

Wei

Huang

(2020). Simplify the usage of lexicon in Chinese NER. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5951–5960. https://doi.org/10.48550/arXiv.1908.05969

27.

Adesope

O. O.

Nesbit

J. C.

Liu

(2014). Intelligent tutoring systems and learning outcomes: A meta-analysis. Journal of Educational Psychology, 106(4), 901–918. https://doi.org/10.1037/a0037123

28.

Madnani

Loukina

Cahill

(2017). A large scale quantitative exploration of modeling strategies for content scoring. Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, 457–467. https://doi.org/10.18653/v1/W17-5052

29.

Mao

Liu

O. L.

Roohr

Belur

Mulholland

Lee

H.-S.

(2018). Validation of automated scoring for a formative assessment that employs scientific argumentation. Educational Assessment, 23(2), 121–138. https://doi.org/10.1080/10627197.2018.1427570

30.

Martínez-Huertas

J. Á.

Jastrzebska

Olmos

León

J. A.

(2019). Automated summary evaluation with inbuilt rubric method: An alternative to constructed responses and multiple-choice tests assessments. Assessment & Evaluation in Higher Education, 44(7), 1029–1041. https://doi.org/10.1080/02602938.2019.1570079

31.

Nunes

Cordeiro

Limpo

Castro

S. L.

(2022). Effectiveness of automated writing evaluation systems in school settings: A systematic review of studies from 2000 to 2020. Journal of Computer Assisted Learning, 38(2), 599–620. https://doi.org/10.1111/jcal.12635

32.

Ploegh

Tillema

H. H.

Segers

M. S. R.

(2009). In search of quality criteria in peer assessment practices. Studies in Educational Evaluation, 35(2), 102–109. https://doi.org/10.1016/j.stueduc.2009.05.001

33.

Ramesh

Sanampudi

S. K.

(2022). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 2495–2527. https://doi.org/10.1007/s10462-021-10068-2

34.

Schumacher

(2020). Linking assessment and learning analytics to support learning processes in higher education. In Spector

M. J.

Lockee

B. B.

Childress

M. D.

(Eds), Learning, design, and technology: An international compendium of theory, research, practice, and policy (pp. 1–40). Springer. https://doi.org/10.1007/978-3-319-17727-4_166-1

35.

Shermis

M. D.

Burstein

Bursky

S. A.

(2013). Introduction to automated essay evaluation. In Shermis

M. D.

Burstein

(Eds), Handbook of automated essay evaluation: Current applications and new directions (pp. 1–15).

36.

Stevenson

(2016). A critical interpretative synthesis: The integration of automated writing evaluation into classroom writing instruction. Computers and Composition, 42, 1–16. https://doi.org/10.1016/j.compcom.2016.05.001

37.

Strobl

Ailhaud

Benetos

Devitt

Kruse

Proske

(2019). Digital support for academic writing: A review of technologies and pedagogies. Computers & Education, 131, 33–48. https://doi.org/10.1016/j.compedu.2018.12.005

38.

Rosé

C. P.

(2018). Exploring college English language learners’ self and social regulation of learning during wiki-supported collaborative reading activities. International Journal of Computer-Supported Collaborative Learning, 13(1), 35–60. https://doi.org/10.1007/s11412-018-9269-y

39.

Sung

Y.-T.

Liao

C.-N.

Chang

T.-H.

Chen

C.-L.

Chang

K.-E.

(2016). The effect of online summary assessment and feedback system on the summary writing on 6th graders: The LSA-based technique. Computers & Education, 95, 1–18. https://doi.org/10.1016/j.compedu.2015.12.003

40.

Süzen

Gorban

A. N.

Levesley

Mirkes

E. M.

(2020). Automatic short answer grading and feedback using text mining methods. Procedia Computer Science, 169, 726–743. https://doi.org/10.1016/j.procs.2020.02.171

41.

Thompson

M. M.

Braude

E. J.

(2016). Evaluation of knowla: An online assessment and learning tool. Journal of Educational Computing Research, 54(4), 483–512. https://doi.org/10.1177/0735633115621923

42.

Tjong Kim Sang

E. F.

De Meulder

(2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL, 4, 142–147. https://doi.org/10.3115/1119176.1119195

43.

Villalon

Calvo

R. A.

(2011). Concept maps as cognitive visualizations of writing assignments. Educational Technology & Society, 14(3), 16–27. https://bit.ly/37yeLa3

44.

Wang

H.-C.

Chang

C.-Y.

T.-Y.

(2008). Assessing creative problem-solving with automated text grading. Computers & Education, 51(4), 1450–1466. https://doi.org/10.1016/j.compedu.2008.01.006

45.

Wang

T.-I.

C.-Y.

Hsieh

T.-C.

(2011). Accumulating and visualising tacit knowledge of teachers on educational assessments. Computers & Education, 57(4), 2212–2223. https://doi.org/10.1016/j.compedu.2011.06.018

46.

Wei

Saab

Admiraal

(2021). Assessment of cognitive, behavioral, and affective learning outcomes in massive open online courses: A systematic literature review. Computers & Education, 163, 104097. https://doi.org/10.1016/j.compedu.2020.104097

47.

Weideman

(2019). Degrees of adequacy: The disclosure of levels of validity in language assessment. Koers, 84(1), 1–15. https://doi.org/10.19108/koers.84.1.2451

48.

Weigle

S. C.

(2013). English as a second language writing and automated essay evaluation. In Shermis

M. D.

Burstein

(Eds), Handbook of automated essay evaluation: Current applications and new directions (pp. 36–54).

49.

Weinerth

Koenig

Brunner

Martin

(2014). Concept maps: A useful and usable tool for computer-based knowledge assessment? A literature review with a focus on usability. Computers & Education, 78, 201–209. https://doi.org/10.1016/j.compedu.2014.06.002

50.

Wilson

Ahrendt

Fudge

E. A.

Raiche

Beard

MacArthur

(2021a). Elementary teachers’ perceptions of automated feedback and automated scoring: Transforming the teaching and learning of writing using automated writing evaluation. Computers & Education, 168, 104208. https://doi.org/10.1016/j.compedu.2021.104208

51.

Wilson

Czik

(2016). Automated essay evaluation software in English language arts classrooms: Effects on teacher feedback, student motivation, and writing quality. Computers & Education, 100, 94–109. https://doi.org/10.1016/j.compedu.2016.05.004

52.

Wilson

Huang

Palermo

Beard

MacArthur

C. A.

(2021b). Automated feedback and automated scoring in the elementary grades: Usage, attitudes, and associations with writing outcomes in a districtwide implementation of MI write. International Journal of Artificial Intelligence in Education, 31(2), 234–276. https://doi.org/10.1007/s40593-020-00236-w

53.

Yang

Y.-F.

(2016). Transforming and constructing academic knowledge through online peer feedback in summary writing. Computer Assisted Language Learning, 29(4), 683–702. https://doi.org/10.1080/09588221.2015.1016440

54.

Zhang

(2021). Review of automated writing evaluation systems. Journal of China Computer-Assisted Language Learning, 1(1), 170–176. https://doi.org/10.1515/jccall-2021-2007

55.

Zheng

Huang

Lajoie

S. P.

Chen

Hmelo-Silver

C. E.

(2021). Self-regulation and emotion matter: A case study of instructor interactions with a learning analytics dashboard. Computers & Education, 161, 104061. https://doi.org/10.1016/j.compedu.2020.104061

56.

Zupanc

Bosnić

(2015). Advances in the field of automated essay evaluation. Informatica (Slovenia), 39(4), 383–395. https://bit.ly/3w2vfRq

57.

Zupanc

Bosnić

(2017). Automated essay evaluation with semantic analysis. Knowledge-Based Systems, 120(C), 118–132. https://doi.org/10.1016/j.knosys.2017.01.006