Abstract
The computer-supported writing assessment (CSWA) has been widely used to reduce instructor workload and provide real-time feedback. Interpretability of CSWA draws extensive attention because it can benefit the validity, transparency, and knowledge-aware feedback of academic writing assessments. This study proposes a novel assessment tool, EduNERScore, which enables raters to identify different types of domain knowledge from unstructured educational texts and strengthen the interpretation of assessment through knowledge-aware evidence. This research uses an educational experiment to examine raters’ validity, efficiency, interpretability and user perception with and without the support of EduNERScore. The findings show that the experimental group with the support of EduNERScore achieves human-expert validity, high efficiency, interpretability and good use perception, compared with the control group. Finally, based on the empirical results, this research discusses the technological and educational implications for developing and applying the interpretable CSWA system in education.
Keywords
Introduction
Academic writing has been employed as an effective instructional strategy to facilitate students’ information presentation, knowledge application and guide students’ learning progress in different subjects (Huisman et al., 2019; Yang, 2016; Zhang, 2021). Computer-supported writing assessment (CSWA), as a scaffolding tool, has been widely used in academic writing contexts to help instructors measure students' mastery of targeted knowledge (Dunsmuir et al., 2015; Thompson & Braude, 2016). CSWA can provide real-time, automated assessment results, reduce time-consuming workloads from the instructor, and develop individualized feedback for students to revise writing products (Conijn et al., 2020; Klimova, 2011; Sung et al., 2016; Wang et al., 2008).
The research community has paid attention to improve the interpretability of results from CSWA without the involvement of human experts (Chapelle et al., 2015; Wei et al., 2021; Wilson et al., 2021a, 2021b). Interpretability of automated assessment has three major advantages. First, interpretability can improve the CSWA system’s validity since it requires explanations of the assessment criteria and the process about how the scores are generated (Kim et al., 2020; Ploegh et al., 2009). Second, interpretability helps instructors comprehend how the assessment works and thus reduces negative factors generated by assessors’ subjectivity (e.g., writing expertise, assessment experience) (Ade-Ibijola et al., 2012; Weideman, 2019). Third, interpretability enables instructors to confirm the accuracy of CSWA results and the alignment of assessment criteria with instructional requirements (Ade-Ibijola et al., 2012), and help students improve their writing qualities based on feedback (Weideman, 2019). Overall, interpretability of CSWA can enhance the transparency of the assessment criteria, procedures and the credibility of the results, which offers continuous supports for instructors’ guidance and students’ writing (Weideman, 2019).
However, there are two major challenges to enhancing the interpretability of CSWA, namely, technologies related to the interpretability, and considerations of the targeted assessment criteria and contexts. First, multiple machine learning (ML) algorithms are used to implement CSWA and one of the major approaches is the deep learning-based automated assessment model (Azmi et al., 2019). The automated assessment model can provide timely scoring results, but do not offer interpretation of the assessment process and the derivation of assessment results (Kumar & Boulanger, 2020). Second, the expected assessment criteria are usually quite different in various educational contexts (Conijn et al., 2020; Itua et al., 2014). For example, the linguistic feature is the crucial indicator for writing in a language learning context; however, for academic writing, attention needs to be paid to the high-level semantic features (Madnani et al., 2017; Villalon & Calvo, 2011). Therefore, to achieve a high quality of interpretability, it is necessary to improve CSWA quality by implementing an integration of assessment techniques and targeted contexts.
This research integrated a knowledge-aware strategy and the Named Entity Recognition (NER) technique to address these two challenges. First, a knowledge-aware strategy is proposed to demonstrate the structured knowledge reflected by content in student papers and to provide evidence for the raters to make assessment scores. Second, NER is an advanced artificial intelligence technique to extract structured knowledge by employing a deep learning (DL) algorithm. Based on those two types of techniques, we designed and implemented a semi-automated assessment tool, EduNERScore, to extract knowledge from unstructured texts, visualize the structured knowledge of content from students’ papers, and evaluate the quality of academic papers based on a scoring model. Raters can use EduNERScore to quickly access a paper’s quantitative metrics at the knowledge dimension, and related assessment information of the paper (e.g., length, reference score, etc.). The knowledge graph presented in the tool can enhance raters’ understanding of the knowledge structures demonstrated in the paper as well as improve raters’ interpretation quality of the assessment results. Moreover, we conducted an experiment to verify the validity, efficiency, and interpretability of assessment results supported by EduNERScore, and explore raters’ perceptions of the EduNERScore tool. Based on the empirical research results, we discussed the methodological and pedagogical implications for developing and applying the interpretable CSWA systems in education.
Literature Review
Types of Computer-Supported Assessment Systems
CSWA, as the typical computer-supported technology, has gained popularity in a variety of education settings to score and assess the quality of academic writing (Dikli, 2006; Shermis et al., 2013). From the theoretical perspective, based on the zone of proximal development theory, CSWA can provide in-time and sufficient feedback to students in order to facilitate their knowledge internalization process (Nunes et al., 2022). From the practical perspective, the CSWA’s real-time feedback feature provides opportunities for students to plan, write, and revise their papers, and allows them to improve their writing quality through trial and error (Weigle, 2013). In addition, CSWA can also avoid bias and time-consuming caused by factors such as insufficient human raters’ assessment knowledge (Weigle, 2013). This advantage can help instructors or raters save time to focus more on the higher-level assessment of writing quality (such as logics, structure and vocabulary usage) (Nunes et al., 2022). Because of these advantages, there is an increasing interest in designing and implementing effective CSWA systems to meet the requirements of academic writing assessment.
Multiple types of CSWA systems have been developed, including automated essay scoring system (AESS), automated writing evaluation system (AWES), and intelligent tutoring system (ITS) (Conijn et al., 2020). These systems aim to improve the efficiency of assessment, provide automated feedback and correction suggestions, and help students revise their academic papers (Dikli, 2006; Ma et al., 2014; Nunes et al., 2022; Wilson & Czik, 2016). Most of these systems are product-targeted, widely used for standardized test evaluations (e.g., TOEFL, GRE) (Kyle, 2020; Stevenson, 2016). Due to this standardized attribute, most CSWA systems focus on the validity and efficiency of the assessment rather than the interpretability of the assessment results. In other words, users usually cannot get access to the scoring criteria and procedures (Azmi et al., 2019; Zupanc & Bosnić, 2015). Moreover, these systems do not focus on collecting evidence from students’ academic writing process to provide meaningful feedback (Conijn et al., 2020). However, in the academic writing practices, it is important for the instructors and students to understand the detailed standard and process of assessment in order to improve writing quality (Gerritsen-van Leeuwenkamp et al., 2017), such as what assessment techniques are used, and which metrics contribute to the final score. To this end, there is an urgent need to investigate relevant factors that can improve the interpretability of CSWA.
Knowledge-Aware Writing Assessment
Visualization of structural knowledge can be an important factor to be considered in order to enhance the interpretability of CSWA. Existing research has examined the method of knowledge visualization to assist instructors in conducting academic assessments. For example, Wang et al. (2011) emphasized the importance of converting tacit assessment information into explicit knowledge. They designed a knowledge-map grading tool to integrate the domain knowledge and provide a visualized graph of key concepts. Research results indicated that the visualization of knowledge aided instructors in understanding the composition of concepts or ideas written in student papers and being aware of missing concepts. Villalon and Calvo (2011) used concept map as an approximate representation of students’ current state of knowledge. And research results showed that the concept map tool could assess students’ conceptual understandings and further encourage students to revise their writing products. In short, automatic extraction of structured knowledge aids instructors improve assessment quality and offers a potential approach to enhance the interpretability of academic writing assessment.
Moreover, various AI techniques have been employed to visualize the structural knowledge in academic writing assessment. For example, Wang et al. (2008) employed three different machine learning techniques to identify the concepts of the domain and thus enable the assessment of creative-problem in earth science education. Sung et al. (2016) optimized the Latent Semantic Analysis (LSA)-based scoring tool to grade students’ summary writing from the concepts and semantic knowledge dimension. This tool built three databases (latent semantic space, expert summaries, expert concept maps) with human experts’ intervention. Zupanc and Bosnić (2017) proposed a semantic-aware assessment system to improve semantic coherence and consistency. This system used the third-party tools to extract the knowledge from the corresponding subject, and then checked the semantic consistency. Prior studies have indicated that extracting specific knowledge from students’ writing content is beneficial for improving the quality and interpretability of academic writing assessments (Ramesh & Sanampudi, 2022). However, existing methods mainly rely on human feature engineering (e.g., manual constructions of expert knowledge base) and can only recognize phrases or keywords contained in texts rather than extracting meaningful knowledge entities (e.g., concepts, theories, policies) (Sung et al., 2016; Wang et al., 2011). Therefore, it is necessary to investigate advanced AI techniques that can automatically extract diverse structured knowledge from large-scale texts.
An Integrated AI Techniques for the CSWA Systems
Extensive ML-based and DL-based algorithms have been used to extract text features and provide assistance for academic writing assessment. Combining human feature engineering, ML-based methods are widely used for educational text analysis. Lee and Segev (2012) utilized the Term Frequency-Inverse Document Frequency (TF/IDF) algorithm to mine the keywords from specific texts in order to measure the text quality. Sung et al. (2016) and Jorge-Botana et al. (2015) employed the LSA algorithm to mine concepts in the text and thus measured the similarity between the source and target texts. ML-based methods usually require extra human feature engineering, which is difficult to transfer from one educational context to other contexts. On the other hand, DL-based NLP methods have been recently used to automatically learn the semantic representation of text from the large-scale training datasets (Strobl et al., 2019). For example, Azmi et al. (2019) employed DL and LSA techniques to develop an automated Arabic Essay Evaluator that can effectively recognized the features of language and structure of an essay. Litman et al. (2022) employed NLP-based algorithms to identify the revisions of essays and predict the purpose of revision, which assisted the assessment of the writing quality. Compared to ML-based methods, DL-based methods can better learn the semantic representation of the text and reduce the reliance on human feature engineering. In summary, DL-based methods can improve the efficiency of text analysis, while ML-based methods can acquire explicit text features.
To improve the interpretability of CSWA systems, it is necessary to integrate the advantage of efficiency offered by DL-based methods with the interpretability attribute of ML-based text analysis. In this way, the CSWA system can automatically infer learners' knowledge and contribute to the assessment efficiency and interpretable features (Litman et al., 2021; Schumacher, 2020). For example, Zupanc and Bosnić (2017) emphasized that the information extraction (IE) technique can be used to mine diverse knowledge entities from natural language texts to measure the quality of essays. IE techniques do not rely on human feature engineering and can also extract text features with interpretability. NER, a typical IE technique in the current NLP field, enables the extraction of different types of structured knowledge from large-scale unstructured texts. The NER technique can simultaneously obtain interpretable metrics that reflect the text quality and ensure the efficiency of knowledge extraction.
An integration of the knowledge-aware strategy and NER technique is a potential solution to balance the efficiency and interpretability of academic writing assessments. The knowledge-aware strategy helps raters measure the quality of papers through explicit, structured knowledge representations; in this way, raters can evaluate how they contribute to the quality of a paper by understanding the ratio, quantity, and types of knowledge reflected in students’ academic writing. NER, from a technical perspective, can automatically identify different types of knowledge entities from the text content, to enable the knowledge-aware strategy. Therefore, this research aims to develop an interpretable writing assessment tool, by integrating the knowledge-aware strategy and NER technique, and conducting an experiment on academic writing assessment to examine the validity, efficiency, interpretability of assessment results with the tool usage.
The Design and Development of EduNERScore
EduNERScore, an assistant CSWA tool, is designed to help instructors or raters understand how different types of knowledge are reflected in students’ academic papers. Integrating the knowledge-aware strategy and the NER technique, EduNERScore provides reliable reference indicators to support instructors’ academic writing assessment process, rather than merely providing the final scores. The quantitative indicators and metrics provided by EduNERScore can help raters understand how different types of knowledge are reflected in students’ academic papers, which has potential to improves the efficiency and interpretability of academic writing assessment.
EduNERScore is a web-based text analysis tool deployed in a cloud server, which consists of three distinctive modules: an automated knowledge mining engine (KME), knowledge visualization engine (KVE), and a text quality analysis engine (TQAE) (see Figure 1). KME is designed to extract knowledge from unstructured text, KVE visualizes structured knowledge through social network analysis techniques, and TQAE measures the text quality based on quantifiable metrics. The architecture of EduNERScore.
The schema of discipline knowledge entity.
The KVE module presents the relation of knowledge entities and knowledge types in the format of social network graph. We first utilize social network analysis (SNA) techniques to analyze and demonstrate knowledge types and attributes. In a knowledge graph, the node represents the specific knowledge entity and node size represented the node degree which reflected the number of knowledge entities. The edge connects knowledge entities and the corresponding knowledge type, which in turn form knowledge clusters. Users can scale different knowledge clusters to observe the properties of knowledge entities and the types to which they belong. In summary, the KVE module presents an intuitive representation of knowledge types in a network format, which help instructors visualize the knowledge composition of unstructured texts.
The TQAE module generates content-related feedback by considering the contribution of knowledge demonstrated in an academic paper. Prior studies have claimed that the quality of text should consider the contribution of key concepts (Wang et al., 2011; Zupanc & Bosnić, 2017). Sung et al. (2016) proposed an algorithm to measure the quality of text in terms of key concepts. Following Sung et al. (2016)’s method, we revised the scoring equations to measure the contribution of diverse knowledge types. Equation (1) represents the ratio of different types of knowledge entities to the total words of the paper. Equation (2) represents the ratio of knowledge entities to the total number of words in all papers. Equation (3) describes the relative contribution (score) compared with a number of papers. This score offered by EduNERScore is not the final score of a paper, but is used as a reference indicator for instructor assessment. Moreover, this module also provides a detailed description of other metrics, such as the explanation of entity types, and the ratio of each entity in the paper. In summary, EduNERScore aims to provide raters with insight into the composition of knowledge of unstructured text rather than to offer a final assessment score.
Next, we demonstrate the details of the user-interface of EduNERScore. EduNERScore provides a straightforward operation routine. Users only need to specify the filename and input the source text on the user-side, and the server-side returns the results of the text analysis (see Figure 2). In the knowledge visualized network (see Figure 2(a)), a knowledge node has a corresponding color and the node size is weighted by degree. The degree of node represents the occurrence frequency of a knowledge entity. The edge connects a knowledge entity and the corresponding knowledge type; the edge’s thickness represents the frequency of a knowledge entity with a corresponding knowledge type. The right section of EduNERScore (see Figure 2(b)) shows detailed descriptions, such as the description of knowledge entity types, the explanation related to visualization results, and quantitative description. Users can choose and switch the menu list to view information they are interested in. Based on the knowledge visualized network and related information feedback provided by EduNERScore, users can easily check the composition and ratio of diverse knowledge types, and explain the derivation of the results, which can increase the transparency and interpretability of CSWA. The interface of EduNERScore.
Methodology
Research Purpose and Questions
The research purpose was to examine the effect of EduNERScore in the academic writing assessment process. Our research questions were:
RQ 1. To what extent did EduNERScore improve the validity of assessment of academic writing papers?
RQ 2. To what extent did EduNERScore improve the efficiency of assessment of academic writing papers?
RQ 3. To what extent did EduNERScore enhance raters’ interpretations of their academic writing assessment results?
RQ 4. What were the raters’ perceptions of using EduNERScore in the academic writing assessment?
The Experiment Context
To examine the effect of EduNERScore, we conducted an experiment to compare the difference with/without the tool support in an academic writing assessment context. The dataset of academic papers was collected from an undergraduate-level course, named Modern Educational Technology. The instructor conducted a collaborative writing activity to help students understand related subject topics and improve their knowledge inquiry and application. 69 students enrolled in this course. They were divided into 18 groups (3 or 4 students per group), and each group was required to complete four collaborative writing activities (each writing task lasted for 2 weeks). Four academic writing topics were Learning Analytics, Education and Artificial Intelligence, STEM education, and Computer Programming in K-12 education. Groups collaboratively completed the academic papers in a collaborative writing platform named ZheDaYunPan. After the course, except 3 incomplete papers, a total of 69 papers were generated and collected from ZheDaYunPan. Based on the instructional goals, the academic writing assessment should focus on students’ level of knowledge acquisition and application demonstrated in the academic papers.
The Participants, Experiment Design, and Procedures
All raters engaged in the experiment.
We designed an experiment to examine the effect of EduNERScore on academic writing assessments. The expert group only involved in the scoring process and the average score of three experts was used as a standardized score of the papers. The intraclass correlation coefficient (ICC) of three raters’ scores in the expert group was
The experiment consisted of three main phases (see Figure 3): a training phase, an assessment phase, and an interview phase. The training phase aimed to help all raters understand the experimental operations. This phase included four steps. First, one of the researchers stated the experiment procedures to raters. Second, researchers helped raters master the think-aloud method, which could verbalize raters’ thoughts during the assessment (Cotton & Gresty, 2006; Zheng et al., 2021). In this session, all raters went through practice sessions until they became proficient in using the think-aloud method. They were required to adopt the think-aloud method to verbalize their thoughts when a score was given. Third, the researcher explained the assessment rubric and the recording requirements to raters (see Appendix A). Raters were required to fill in a score sheet after completing the assessment of each paper. Fourth, raters in the experimental group were trained to skillfully use the EduNERScore, while raters in the control group were asked to score the papers in the traditional, manual assessment fashion. The experimental workflow.
Second, raters completed the academic writing assessment tasks during the assessment phase. At this stage, raters in the experimental group used the EduNERScore tool to access the quantitative metrics and visualize the knowledge graph of the paper, and then reconfirmed the quality of the metrics by reviewing the paper, which in turn assisted them in scoring and interpreting the assessment results. Moreover, all raters were required to record assessment data in strict accordance with the requirements in the assessment rubrics (see Appendix A). The assessment rubric included two dimensions, namely identification and quality. Identification indicated whether the metrics could be easily identified by raters during the assessment, and quality indicated the quality of the corresponding metrics evaluated by raters. During the assessment process, they were also required to record each paper’s assessment time and categorize the time into a five-point time-use scale (see Appendix A). The self-reported time-use scale was the indicator of the assessment efficiency. In addition, they were asked to use the think-aloud method to explain why they gave the score. The think-aloud method required students to articulate their thoughts and feelings when performing an assessment task, that helped researchers review the thinking process or decision-making of raters (Cotton & Gresty, 2006). Finally, to investigate raters’ perception of EduNERScore, raters in the experimental group were required to complete a questionnaire at the end of each assessment task. Based on Wang et al. (2011)’s questionnaire, we designed a questionnaire (see Appendix B), including eight dimensions: usefulness, usability, attitude, technological complexity, concentration, satisfaction, stability, and usage intention. There were 33 items under 8 dimensions, measured by a five-point Likert scale.
The final phase was a semi-structured interview (see Appendix C), used to collect raters’ perceptions about EduNERScore. Since only the experimental group was provided with the EduNERScore, the questionnaire and interview were only carried out in the experimental group.
Data Collection and Analysis
Data included the think-aloud audio recordings and raters’ score sheets from both the experimental and control groups, as well as the questionnaire and semi-structured interview data from the experimental group. There were 897 think-aloud audio recordings (156,028 Chinese words), 897 score sheets, 23 questionnaires, and 6 sets of interview data (54,199 Chinese words).
The coding framework.
We adopted the one-way ANOVA method to examine the statistical difference between the experimental and control groups on the validity, efficiency, and assessment metrics. The comparison of validity was used to examine the accuracy of the assessment methods. Each paper was scored by all raters, and the average score was taken as the final score of a paper. ANOVA analysis was performed to examine the statistic differences between experimental groups and control groups. Specifically, the comparison of efficiency reflected the time period that EduNERScore helped raters accomplish the time-consuming assessment tasks; and the comparison of assessment metrics reflected how the specific metric contributed to the validity.
Moreover, the descriptive statistic method was employed to investigate the questionnaire responses in order to understand the raters’ perceptions of EduNERScore. Thematic analysis was used to analyze the transcribed interview content in order to identify their perceptions about advantages and disadvantages of the EduNERScore tool.
Results
To what extent did EduNERScore improve the validity of assessment of academic writing papers?
Assessment score comparison across the control group, experimental group and expert group.
Note. ‘—’ means the deleted duplicated value.
The comparisons of seven metrics in the assessment rubric.
To What Extent Did EduNERScore Improve the Efficiency of Assessment of Academic Writing Papers?
Assessment efficiency comparison between the control and experimental groups.
Moreover, Figure 4 showed the comparison of efficiency. First, the assessment duration presented a negative relation with the length of the paper. That is, the assessment efficiency decreased as the word count of the paper increased, regardless of whether it was supported with EduNERScore or not. Second, the average efficiency of assessment gradually degraded with the increase of the number of paper words. But the experimental group generally performed better than the control group. In other words, the experimental group with EduNERScore support could achieve a remarkable strength on assessment efficiency on a holistic level. The comparison of the assessment efficiency. x axis is the index of papers; y axis (left) is the efficiency scale; y axis (right) is the number of words per paper.
To What Extent Did EduNERScore Enhance Raters’ Interpretations of Their Academic Writing Assessment Results?
Comparisons of the interpretation dimensions of the scoring results.
The comparisons of the think-aloud coding results.
What Were the Raters’ Perception of Using EduNERScore in the Academic Writing Assessment?
The analysis results from raters’ questionnaire and interview responses further revealed their perception of EduNERScore. Figure 5 showed the average metric of eight dimensions (usability = 4.28; usefulness = 4.43; usage intention = 4.38; stability = 4.07; attitude = 4.45; satisfaction = 4.43; concentration = 4.37; technological complexity = 4.35) indicated that raters expressed satisfaction with the effect of EduNERScore. The concentration (M = 4.37) indicated that EduNERScore could help rater focus on the writing assessment task; the technological complexity (M = 4.35) indicated that rater believed there was no technical barrier to using EduNERScore; the attitude (M = 4.45) and satisfaction (M = 4.43) indicated all raters held a positive perception; the remaining metrics (e.g., usability; usefulness; usage intention) indicated that EduNERScore made a substantial contribution to raters’ assessment; the lowest score of stability (M = 4.07) in the eight dimensions implied that the functions and stability of EduNERScore could be improved in the future. Moreover, the interview results further confirmed that EduNERScore helped raters identify relevant evidence for making an objective assessment. For example, interviewees #4 said: “EduNERScore presented a good experience for me. It was easy to use, and provided quantitative measures and understandable visualizations …”. Interviewees #6 added: “With this visualization tool, I could easily identify more metrics and evaluate the papers objectively…”. And interviewee #3 emphasized: “it helped me become aware of details that were not easy to be noticed…”. Results of raters’ perceptions.
In addition, interviewees argued that EduNERScore helped them improve the efficiency and fairness of their assessments. As interviewees #1 said: “I would like to firstly focus on the quantitative indicators and check the knowledge graph…my assessment process was very efficient”, and interviewee #2 said: “For long papers, I often forgot the previous content, and this tool helped me to recall those content...”.
However, interviewees also reported some shortcomings of EduNERScore. First, they found that the reference scores were sometimes inconsistent with raters’ scores. Some raters thought the papers were poorly written, but EduNERScore offered the opposite results. As interviewee #3 explained: “For some papers, although the tool identified knowledge types... but when I doublechecked the paper’s content, I found that the content was not usually well written”. Second, they reported that the structured knowledge generated by EduNERScore had some errors of entity-boundary (e.g., did not locate the right starting or stopping position of an entity) or entity types (e.g., did not assign the right entity type to an entity). Interviewees also mentioned that, although the incorrectly structured knowledge sometimes confused them, it did not negatively influence their subsequent assessment. As interviewees #5 said: “I interpreted some indicators from the original content, so that some of the errors did not bother my assessment”.
Discussions and Implications
Addressing research questions
Interpretability of CSWA provides knowledge-aware feedback, which helps raters deal with complicated academic writing assessment processes. However, existing research has focused on basic text features related to assessment (e.g., length, grammatical errors, etc.) rather than interpretable evidence related to the writing topics and content (Thompson & Braude, 2016). This research designed and implemented an academic writing assessment tool named EduNERScore to make available structured knowledge and quantifiable metrics, in order to improve raters’ interpretability of assessment and further improve the quality of assessment. We employed the experimental research with multi-methods to investigate the validity, efficiency, interpretability of EduNERScore as well as raters’ perceptions about the tool. First, the experimental group with the support of EduNERScore presented advantages in improving the assessment validity. That is, the experimental group not only outperformed the control group on indicators during the assessment process, but also achieved similar scores with the scores assessed by the expert group. The findings implied that EduNERScore employing the semi-automated assessment mechanism enabled raters to make objective assessments based on these comprehensive and quantitative indicators (Weinerth et al., 2014), which in turn offered human experts-like validity like other CSWA systems (Azmi et al., 2019; Mao et al., 2018; Martínez-Huertas et al., 2019).
Second, EduNERScore exhibited remarkable efficiency, consistent with the results of other computer-assisted writing assessment cases (Jackson, 2000; Süzen et al., 2020; Wilson & Czik, 2016). The time consumption of the experimental group outperformed the control group to a statistically significant level. One of the major advantages of EduNERScore is that it extracted structured knowledge to align with the required rubrics and analyzed the text quality in an automatic and timely manner. However, because of individual rater’s subjectivity when using the semi-automated assessment tool, the time consumption on each paper assessed by different raters was fluctuant. In summary, the efficiency of EduNERScore did not reflect at the local level (namely each paper) but rather at the holistic level.
Third, EduNERScore improved the rater’s access to the assessment evidence, which positively affected the interpretability of assessment results. The think-aloud results in the experimental group indicated that raters who received more metric-related feedback from EduNERScore better interpreted the causality of assessment results, compared to raters in the control group (Wilson & Czik, 2016; Wilson et al., 2021a, 2021b). Particularly, the use frequencies of Exploration and Elaboration in the experimental group were higher than that from the control group; moreover, the experimental group mentioned Knowledge types but the control group never considered Knowledge types. These results together reflected that the raters in the experimental group elaborated the causal interpretation of assessment results, since EduNERScore provided visual and quantifiable representations to help the rater recall critical ideas (Sung et al., 2016). In contrast, the raters in the control group without EduNERScore were prone to interpret the scoring results with basic metrics (such as Reference), that they could recall.
Finally, the responses from questionnaires and interviews revealed that EduNERScore offered raters a positive assessment experience, which was consistent with prior research (e.g., Wang et al., 2011; Wilson et al., 2021a, 2021b). The results suggested that the tool helped raters to effectively deal with complicated writing assessment tasks through accurately extracting domain knowledge and presenting evidence, which improved raters’ engagement and willingness in completing the assessment tasks (Nunes et al., 2022; Wilson & Czik, 2016). Overall, the EduNERScore tool provides instructors with a scaffolding for assessing academic writing with reliable validity and good user perceptions, while also balancing well the efficiency and interpretability of the assessment.
Technological Implications
Based on the empirical results, this research offers technological implications for developing and optimizing interpretable tools for academic writing assessment. First, the knowledge-aware strategy is a critical mechanism for improving the interpretability of CSWA systems. The strengths of the knowledge-aware strategy comprise two aspects: representing students’ discipline knowledge and providing raters with knowledge-aware evidence. This result follows previous findings that have proved that fine-grained knowledge feedback related to the writing topic can facilitate the interpretability of the assessment (Wilson et al., 2021a, 2021b). Furthermore, the NER technique is employed to implement the knowledge-aware strategy, which can extract knowledge entities and corresponding entity types simultaneously. However, academic writing is a complicated, creative integration of knowledge from individual students, therefore, the assessment of academic writing cannot merely depend on knowledge entities and types included in the text. For instance, knowledge entities can reflect what knowledge is contained, but cannot present how the author organizes and links these knowledge entities into a creative or critical writing (Al-Moslmi et al., 2020). In this case, knowledge graph technology that identifies both entities and relationships can provide comprehensive access to knowledge and the associative relationships between them, thus addressing the limitations of NER technology (Ji et al., 2022). Therefore, this research suggests that future research can consider to employ the knowledge graph technique in designing interpretable assessment tools.
Moreover, the validity issue namely inconsistency of assessment results, to some extent, counteract the advantage of EduNERScore. A critical phenomenon associated with the validity issue is the discrepancy between the EduNERScore and the human grading score, which has been pointed out in prior research (e.g., Mao et al., 2018). This research reflected on two possible reasons. First, the KME extracts some redundant entities which have similar meanings (e.g., both “theory” and “constructivism” were identified). When redundant entities are visualized and calculated by EduNERScore, it leads to a less accurate rating scores. Second, some entities are only mentioned in the paper without further elaboration, which did not contribute to a high quality of the paper but EduNERScore cannot identify this depth. To overcome this disadvantage, future work should optimize the algorithm to improve the accuracy and capture the contextual semantic of knowledge entities. Moreover, another disadvantage is that the information presented in the graphs may increase the rater’s cognitive load (Sung et al., 2016), which could weaken the rater’s motivation to consistently use it (Chen & Tsao, 2021). Thus, the findings suggest that future design should improve algorithmic performance to reduce redundant information and optimize the representations of knowledge or the hierarchy of feedback to make them more accessible and digestible (Kim et al., 2020).
Educational Implications
The interpretable academic writing assessment tool can be used in educational practice to improve efficiency, promote the interpretability of scoring, and serve as an instructional scaffolding for cognitive visualization and higher-order semantic analysis. First, offering real-time feedback and evidence can make effective writing revisions for students (Link et al., 2022; Wilson et al., 2021a, 2021b). The findings imply that EduNERScore can straightforwardly present fine-grained knowledge related to writing topics. Thus, integrating EduNERScore into discipline-specific contexts enables the instructor to tackle the writing assessment-related instruction in an efficient way, especially in large-size classes (Nunes et al., 2022).
Second, the feature of EduNERScore for extracting structured knowledge is naturally reminiscent of its application to text analysis contexts. The real-time presentation of knowledge and corresponding types can promote students’ engagement in knowledge construction. For example, when this type of tool is integrated in online discussion activities, the efficient knowledge extraction can avoid students missing critical discussion information and further enhance students’ focus on the current discussion topics to construct knowledge (Su et al., 2018). It can also enable students quickly compare their knowledge with their peers, which in turn triggers cognitive conflicts and deepens knowledge construction (Li et al., 2021). Therefore, the knowledge-aware tool such as EduNERScore has potential to be applied in diverse instruction and learning contexts to foster students’ collaborative knowledge construction.
Conclusion, Limitations, and Future Works
The increasing application of the CSWA systems in education requires improving the interpretability of academic writing assessments in order to assure the assessment quality (Wilson et al., 2021a, 2021b). This research designs and implements an interpretable CSWA tool (EduNERScore) that models domain knowledge from a quantitative, visual perspective, which helps raters interpret assessment results at the knowledge level in order to improve academic writing assessment quality. The positive findings indicate that EduNERScore can make human assessments more valid, efficient, and interpretable. There are two limitations of this research, which lead to future research directions. First, our NER model presents errors in knowledge recognition because its accuracy is influenced by various factors (such as language category and dataset size). Future work should investigate advanced AI algorithms to improve the performance of the knowledge mining engine. Second, this research is conducted with a small sample size and there is an imbalance in the number of experts and participants of different genders, which may limit the generalizability of findings. Future work should apply and test the tool in a large-scale instructional context to verify the research results and implications. Overall, EduNERScore offers an AI-driven, knowledge-aware scaffolding to serve the educational assessment, which enables instructors to effectively address challenges in complicated assessment tasks and conduct diverse instructional practices related to academic writing.
Footnotes
Acknowledgments
We appreciate participants’ engagement in this research.
Authors’ Contribution
Xu Li took responsibility of the software development, data collection and analysis, and writing of the manuscript draft. Fan Ouyang took responsibility for research conception, writing and revision of the manuscript, as well as supervision of the research. Jianwen Liu took responsibility for data analysis and software development. Chengkun Wei took responsibility for software development. Wenzhi Chen supervised the research project.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors acknowledge the financial support from the National Natural Science Foundation of China (62177041), the 2021 Key research and development plan of Zhejiang province (2021C03140), Zhejiang Province educational science and planning research project (2022SCG256), and Zhejiang University graduate education research project (20220310).
Data Availability
Data is available upon request from the first author.
Author Biographies
The Assessment Rubric.
| Assessment Rubric | |||||||||||
| Paper No. | Score: | ||||||||||
| Dimensions | Quality (1-Very bad; 2-bad; 3-neutral; 4-good; 5-very good) |
||||||||||
| Overall | Overall quality | ☐1 | ☐2 | ☐3 | ☐4 | ☐5 | |||||
| Assessment | Overall style | ☐1 | ☐2 | ☐3 | ☐4 | ☐5 | |||||
| Argument quality | ☐1 | ☐2 | ☐3 | ☐4 | ☐5 | ||||||
| Identification |
Quality |
||||||||||
| Specific Assessment | Theoretical foundation | ☐1 | ☐2 | ☐3 | ☐4 | ☐5 | ☐1 | ☐2 | ☐3 | ☐4 | ☐5 |
| Domain knowledge | ☐1 | ☐2 | ☐3 | ☐4 | ☐5 | ☐1 | ☐2 | ☐3 | ☐4 | ☐5 | |
| Literature review | ☐1 | ☐2 | ☐3 | ☐4 | ☐5 | ☐1 | ☐2 | ☐3 | ☐4 | ☐5 | |
| Educational policies | ☐1 | ☐2 | ☐3 | ☐4 | ☐5 | ☐1 | ☐2 | ☐3 | ☐4 | ☐5 | |
| Techniques/tools | ☐1 | ☐2 | ☐3 | ☐4 | ☐5 | ☐1 | ☐2 | ☐3 | ☐4 | ☐5 | |
| Research purpose | ☐1 | ☐2 | ☐3 | ☐4 | ☐5 | ☐1 | ☐2 | ☐3 | ☐4 | ☐5 | |
| Reference | ☐1 | ☐2 | ☐3 | ☐4 | ☐5 | ☐1 | ☐2 | ☐3 | ☐4 | ☐5 | |
| Duration | ☐ ≈60min | ☐≈45min | ☐≈35min | ☐≈25min | ☐≈15min | ||||||
