Abstract
Automated writing scoring can not only provide holistic scores but also instant and corrective feedback on L2 learners’ writing quality. It has been increasing in use throughout China and internationally. Given the advantages, the past several years has witnessed the emergence and growth of writing evaluation products in China. To the best of our knowledge, no previous studies have touched upon the validity of China’s automated essay scoring systems. By drawing on the four major categories of argument for validity framework proposed by Kane—scoring, generalization, extrapolation, and implication, this article aims to evaluate the performance of one of the China’s automated essay scoring systems—iWrite against human scores. The results show that iWrite fails to be a valid tool to assess L2 writings and predict human scores. Therefore, iWrite currently should be restricted to nonconsequential uses and cannot be employed as an alternative to or a substitute for human raters.
Introduction
As one of the learning outcomes in educational settings, writing proficiency has been taken to be an essential component of communication in determining whether a student can convey a message or express an idea with clarity and ease. The assessment of writing ability has evolved from simply asking students to answer multiple-choice items to constructed-response tasks in which students are required to read a prompt or prompts, then reflect on major viewpoints to develop an essay by examining their language use, writing skills, and critical thinking. In terms of assessing writing ability, human raters generally need to be recruited and trained to assign scores. However, the process of manual grading is subject to fatigue, distraction, inconsistency of scoring across time, and so on (Braun, 1988; Hughes & Keeling, 1984; Lunz, Wright, & Linacre, 1990). This time-consuming and labor-intensive way of scoring has led to the pressing need for quick feedback including instant scoring and integrated analysis of writing quality. Automated essay scoring (AES), which refers to the provision of writing scores by computer programs, therefore emerged. Some AES systems go beyond solely offering holistic scores but also provide feedback with regard to sentence structure, organization and language use, and so on. These systems are now generally referred to as automated writing evaluation (AWE) systems.
Constantly developing computer technology has facilitated our goals of instant scoring in one way or another. One manifestation is the computer-aided writing scoring system—Project Essay Grader™—first introduced in the 1960s by Ellis Batten Page and his colleagues (Page, 1966). From then on, such systems have been increasing in use internationally and have been praised as both labor- and time-saving (Dikli, 2006), and for the reduction in bias an individual rater may have in assessing writing (Weigle, 2010). Therefore, the use of AES has been taken to be a significant supportive computational tool in educational environments (Zupanc & Bosnic, 2017).
The widespread use of AES technologies has also changed the way of scores report. For instance, e-rater®, an essay scoring and evaluating system developed by Educational Testing Service (ETS), has been used for high-stakes tests like GRE (Graduate Record Examination) Analytical Writing section (Issue and Argument writing tasks). That is, each essay receives a score both from at least one trained human rater and e-rater, respectively. If the human and the e-rater scores closely agree, the average of the two scores is the final score. If they disagree, a second human score is obtained, and the final score is the average of the two human scores. Likewise, automated scoring is employed to complement human scoring in the TOEFL Writing section (Independent and Integrated writing tasks). with human raters judging the quality of content and meaning and e-rater® scoring linguistic features. In addition, low-stakes tests such as placement tests directly adopt automated scores to report writing assessment, among which ACT’s COMPASS, College Board’s ACCUPLACER and the Project Essay Grader™ (Page, 2003), and the e-rater by the ETS Criterion® Online Writing Evaluation service are the major AES systems.
Meanwhile, the increasing use of such systems in L2 settings has prompted great interest in evaluating the merits and demerits of AES systems, particularly in the extent to which machine scoring agrees with human raters. Such evaluation of automated scoring against human performance of rating has always been the focus in language assessment, since the agreement of human raters with computer-generated scores serves as a precursor to judge the inherent reliability of AES systems.
To date, most literature report high agreement between scores generated by AES systems and human expert raters (Attali, 2004; Burstein, 2003; Burstein & Chodorow, 1999; Elliot, 2003; Page, 2003; Rudner, Garcia, & Welch, 2006; Shermis, Burstein, Higgins, & Zechner, 2010). For example, high degree of accuracy has been demonstrated among such systems as Intelligent Essay Assessor (Landauer, Foltz, & Laham, 2003; Pearson, 2007), Intellimetric (Elliot, 2001; Dikli, 2006), and Criterion® (Koizumi, Asano, & Agawa, 2016; Ramineni, Trapani, Williamson, Davey, & Bridgeman, 2012; Shermis & Burstien, 2003). Despite the increasingly wider application of AES systems, controversy over and opposition to the use of automated scoring also exist because it is argued that a computer software cannot validly rate a student’s writing as humans do (Anson, 2006; Cheville, 2004; Herrington & Moran, 2001; McCurry, 2010).
It is worth noting that currently all the studies on AES target such programs as The Criterion® Online Writing Evaluation service by ETS (the e-rater® Engine), IntelliMetric®, and MY Access! by Vantage Learning, Writing Roadmap by McGraw Hill Education, WriteToLearn by Pearson Education and Summary Street (a central part of the Articulate Learners Project at the University of Colorado), most of which are designed and developed by institutions and companies in the United States. In addition, to the best of our knowledge, their applications are inaccessible in China. Due to the lack of such AES tools, the past several years has witnessed the launch of China’s AWE systems, among which iWrite is one of the most extensively employed two systems (the other being pigai) in low-stakes writing assessment (in classrooms only). As a compulsory course in China’s secondary and higher education, English as a foreign language (EFL) has an extremely large population of language learners in China, the number of EFL learners using AES as a means of assessing L2 writing quality is increasing substantially. However, what is left unexamined, after a critical review of literature, is the validity of China’s flourishing automated scoring programs. The absence of such studies on China’s AES performances has motivated this study. Such kind of study is valuable and meaningful because it will shed light on China’s computer-assisted scoring technology and will contribute useful insights into the application of AES technology in China, which in turn will also help improve the validity of China’s future products in this regard. This article is a preliminary endeavor in this direction to bridge this research gap. Results from this study also have implications for both instructors and AES vendors. This article aims to address the following research questions:
What is the relationship between human and iWrite scores? Are the scores generated by iWrite representative of responses across all possible writing tasks in comparison to human scores? That is, when test takers do different prompts of similar design, are their scores similar across responses? Do the automated scores of the same participant demonstrate consistency across similar writing prompts?
About iWrite
iWrite (the current version is V2.5), developed by an online education technology company affiliated to Foreign Language Teaching and Research Press—the largest foreign languages publisher and university press in China, is a commercial online program. As displayed on the writing feedback webpage, iWrite not only provides instant holistic scores but also reports on linguistic features under four domains—language (fluency, accuracy, and complexity), content (relevance and coherence), discourse structure (organization and discourse markers), and mechanics (spelling and punctuation). The final scores and feedback can be reported to the students based on the automated assessment or on the instructor’s revised evaluation. In addition, this program presents instructors with a repertoire of writing assignments under different discourse modes—argumentation, exposition, a variety of letter writing, and so on. Instructors can create their own writing tasks, which can also be scored automatically. Students are required to complete the assigned writing task online in or out of class with a word limit under either timed or untimed condition and they can submit their drafts only once or several times, depending on the instructor's requirement. In the process of writing, students can have access to writing assistant, which is an online dictionary to help them with such skills as choice of words, collocation. Copy and paste buttons can be turned on or off within the browser to prevent possible plagiarism online. Moreover, iWrite can assess writings in batch mode. That is, the instructor can upload more than one writing sample into the system and automated scores can be subsequently reported into spreadsheets for further analysis.
Methods
Participants
All the 332 participants in this study are non-English major undergraduates in a 4-year Chinese university. They are all native Chinese in their first year in university, preparing for National College English Test Band 4 (CET 4) for the first time.
Validation Framework
Kane (2006) proposed four major categories of argument: scoring, generalization, extrapolation, and implication or decision. Specifically, these inferences refer to evaluation of the association between automated scores and human scores, representativeness of the observed scores across the parallel forms of the same text, the actual performance of the test takers beyond the texting contexts, and construct understanding or decision-making based on target score, respectively. As extrapolation inference involves external variables (such as within-test relationship, self-evaluated writing ability, and writing portfolio) which are imperfect and problematic, this will impose complication on the interpretation of validity evidence for automated scores in comparison to human scoring (Ramineni & Williamson, 2013; Williamson, Xi, & Breyer, 2012). Drawing on Kane’s validation framework, this study adopts only three of the four inferences—scoring, generalization, and implication—to evaluate China’s AWE system iWrite.
The Scoring Rubric
Hamp-Lyons (1991) suggested that “focused holistic scoring” (p. 244) is usually used as a common procedure for assessing writing proficiency. Although this has been extensively employed in major writing assessments such as TOEFL Internet Based Test (iBT) and GRE (Williamson, 1993; Williamson & Huot, 1993), holistic scoring is not without problems. It cannot represent an examinee’s performance across different aspects of writing. To overcome its limitations, East (2009) claimed that the reliability of holistic scoring can be improved by using analytic rubric. This means each linguistic trait is marked individually and then is summed up to form an overall score. Although this kind of scoring is time-consuming, it is advantaged by presenting more details about the different facets of writing quality.
To warrant an accurate and fair scoring, the scoring criteria for human raters adopted in this study are based on the “ESL Composition Profile” (Jacobs, Zinkgraf, Wormuth, Hartfiel, & Hughey, 1981, p. 2) with slight revision. This scoring rubric includes five differentially weighted subscales of writing quality—content, language use, vocabulary, organization, and mechanics, representing four levels of writing performance—excellent to very good, good to average, fair to poor, and very poor while excluding zero for no rewardable response. This rubric is regarded as “one of the best known and widely used analytic scales in ESL” (Weigle, 2002, p. 115). What differentiates the scoring rubric in this study from the original “ESL Composition Profile” is that each of the five writing subscales in the resulting rubric is equally weighted and is rated on a 3-point scale (0, 1, 2, and 3), where 0 indicates no rewardable response, while 3 indicates excellent response. The reason why each subscale is equally weighted is due to its problematic nature of weighting (East, 2009), since differential weightings indicate that some subscales in the rubric are more (or less) important or relevant or involved than other ones (Weigle, 2002) and raters may pay more attention to some than others, even neglect them. Hamp-Lyons (1991) also suggested that if one trait in scoring rubric receives more weighting than others, then global scoring is used rather than analytic rubric. For this, to avoid certain subjectivity and focused emphasis on specific linguistic traits in assessing writing quality, the scoring rubric for human raters in this study is to weight all subscales equally.
Rating Procedures
Before the formal rating, a pilot rating was conducted in which 15 writing samples were randomly selected from the bank of all available writings written by the participants. Results of pilot rating show that the two raters had a perfect match in rating sample writings (100% exact agreement, kappa = 1). Three weeks later, 20 randomly selected writings (with 50% of the original samples, i.e., 10 writings.), which were in a different order from the pilot rating, were rescored by the same two raters to examine the degree of intrarater reliability. Results suggest that 9 of original 10 writings received the same scores (90% exact agreement, kappa = .756), which suggested substantial agreement of each single rater. As iWrite reports scores with one decimal, so this study adopted a rounded automated score for each writing to better analyze its correspondence with human scores.
Both pilot rating and formal rating in this study reported scores on a 15-point scale, which is in accord with the scoring scale of Writing section in CET 4—a national English proficiency test administered by Ministry of Education in China. CET 4 was a mandatory test for Chinese undergraduates, but from years ago, it became optional. In addition, in our regular writing practice and final exams, 15-point scale is also the predefined rating scale.
Based on the criteria described in the above-scoring rubric adapted from “ESL Composition Profile,” each writing in the data set of this study was scored independently by two expert raters who have been teaching EFL for over 10 years and have been experienced in analyzing and grading writings. The raters were given a short instruction manual and had been well trained before rating. They had been informed of the requirement in this study and therefore had a thorough understanding of scoring rules.
Summary of Points Representation in College English Test Band 4 for Writing Section.
However, the range in each score level could bring about a small margin for subjectivity (East, 2009) because no objective criteria can be clearly defined as to whether a writing should get 7, 8, or 9 points in terms of, for example, the exact number of spelling errors, degree of topic-relatedness, and coherence. To avoid subjectivity, the final reported human score for each writing, adopts five fixed scores from each score level—2 points, 5 points, 8 points, 11 points, and 14 points. The set scores at each level are advantaged by categorizing all the writing samples into five levels of writing quality and contributing to the measurement of interrater agreement. If two human scores for a single writing fall into different levels, then a consensus score is applied—a third expert rater is invited, and discussion among the three raters is initiated to resolve the discrepancy to assign a consensus final score.
Criteria for Agreement Between Automated Scores with Human Scores
Report of percentages of agreement (exact and exact-plus-adjacent) between human and automated scores has been generally regarded as a gold standard and a long-standing way to evaluate the reliability of AES—whether automated scores can match those assigned by human raters, since reliability is the quantification of representing consistency and inconsistency of how an examinee performs in the assigned task (Feldt & Brennan, 1989). However, percentages of agreement are influenced by “scale dependence and sensitivity to base distributions” (Williamson et al., 2012, p. 7). In other words, the values of human–machine agreement vary when different scoring scales are used, say, 5-point scale or 10-point scale; and humans tend to use some score points in the process of scoring. To avoid chance agreement, this study uses weighted kappa and Pearson correlation as one of the acceptance criteria to evaluate the agreement between two human raters and between automated and human scores.
Another acceptance criterion in terms of evaluating human-AES score agreement is degradation. That is, “the automated–human scoring agreement cannot be more than .10 lower, in either weighted kappa or correlation, than the human–human agreement” (Williamson et al., 2012, p. 7). Williamson et al. (2012) also maintained that standardized mean score differences between automated and human scores must satisfy a predefined threshold of .15 to prevent possible deviation of scores distribution from a centered point. This serves as a third acceptance criterion.
Writing Tasks and Prompts
To date, an overwhelming majority of writing prompts supplied by examiners and the existing AWE systems are related to specific and narrow prompts—topic-specific or test-constrained, such as “Do you agree or disagree with the following statement …,” or “For this part, you will write a passage entitled … ”. In addition, most of the literature studied the agreement between automated and human scores by using specific or constrained writing prompts, while findings of the association of an open and broad writing task between human–machine scorings are scarcely seen. This type of writing task features a variety of stimulus within a rather broad theme which can be elicited for comment by each student, since the assessment of writing quality does not aim to simply ask students to write on a specific topic but also involves the examination of their ability to write on something in a broader and less constrained way because different discourse modes involve different reasoning demands such as spatial reasoning, causal reasoning, and intentional reasoning (Robinson, 2011; Yang, Lu, & Weigle, 2015). For instance, for the same piece of open theme, one student may explicitly express his point of view, another may approach it from a narrative perspective. Whatever the discourse mode may be, students can display the writings to the best of their ability. To evaluate cross-task performances of iWrite and address Research Questions 1 and 2, this study uses two writing tasks, one is prompt-specific (Task 1), the other is an open writing prompt (Task 2) that requires students to freely construct responses with their own thoughts and ideas.
Task 1
For Task 1, one group of 217 participants were assigned to write a composition with at least 120 words using two expository prompts of similar design (with 100 participants for Prompt 1 and 117 participants for Prompt 2) selected from iWrite library of prompts. The word limit is also based on the requirement of CET 4 Writing section in which the minimum is 120 words.
Task 2
For Task 2, another group of 115 participants were asked to write to an open theme (Prompt 5), which is adapted from the National Assessment of Educational Progress (NAEP) report on “Online Assessment in Mathematics and Writing: Reports from the NAEP Technology-Based Assessment Project” (Sandene et al., 2005): A novel written in the 1950s describes a world where people are not allowed to read books. A small group of people who want to save books memorize them so that the books won’t be forgotten. For example, an old man who has memorized the novel The Call of the Wild helps a young boy memorize it by reciting the story to him. In this way, the book is saved for the future. If you were told that you could save just one book for future generations, which book would you choose? Write an essay in which you discuss which book you would choose to save for future generations and what it is about the book that makes it important to save. Be sure to discuss in detail why the book is important to you and why it would be important to future generations.
Task 3
To examine the consistency of scores generated by iWrite for the same participant across similar types of writing design, 2 weeks later, a group of the same 100 participants (who were previously assigned to write on Prompt 1) were asked to write an argumentative composition (Prompt 3) and another 2 weeks later, a narrative (Prompt 4). The results relating to Prompt 3 and Prompt 4 will address Research Question 3.
Selected prompts from TOEFL11 corpus
To further examine the performance of iWrite on high-stakes writings compared with human expert raters, this study also uses sample essays written by Chinese L2 learners collected from the ETS Corpus of Non-Native Written English (TOEFL11 corpus; Blanchard, Tetreault, Higgins, Cahill, & Chodorow, 2014), which includes 12,100 essays written by TOEFL test takers in 11 non-English native languages (Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish). TOEFL11 corpus comprises 1,100 essays for each language evenly sampled from the independent writing tasks of 8 argumentative prompts, along with human scoring levels for each sample essay. As the leading English-language test in the world, TOEFL test is universally accepted and recognized for its reliability in terms of human expert judgment of writing quality.
Based on the ETS Research Report (ETS RR-13-24; Blanchard et al., 2014), all essays in TOEFL11 corpus are first manually rated by ETS-trained human raters on a 5-point scale and later collapsed into 3-point scale: Group low (scores between 1.0 and 2.0), Group medium (scores between 2.5 and 3.5), and Group high (scores between 4.0 and 5.0). In this study, we randomly selected two prompts (labeled as Prompt 6 and Prompt 7 in this study) from all the eight prompts and used iWrite to score these sample essays on a 5-point scoring scale. To conform to the grouping in TOEFL11 corpus, these automated scores on Prompts 6 and 7 were also collapsed into three-level groups (Group low, medium, and high).
Summary of Prompts.
Results
Descriptive Statistics of all Scores.
Note. M = mean; SD = standard deviation.
Interrater Agreement of Human Raters.
Agreement of Human Scores With iWrite Scores (Rounded).
Agreement of Human Scores With iWrite Scores (Unrounded).
Degradation and Standardized Mean Score Difference Between Human and iWrite Scores.
As shown in the tables, both exact agreement and exact + adjacent agreement in Table 5 are extremely low, with 9% and 34%, respectively. Although of similar writing design (expository writing), Prompt 2 indicates a consistently higher percentage with regard to the two percentages. Correlation analysis suggests that there is no linear correlation between iWrite and human scores—which is a clear sign of poor quality of automated scoring. Weighted kappa for iWrite–human agreement in terms of the total samples is negative. This implies that no effective agreement or complete disagreement is observed between the two ways of scoring. Table 6 also shows no correlation between automated and human scoring.
As can be seen in Table 7, the difference between iWrite–human scoring agreement and Rater 1–Rater 2 agreement—degradation—fails to satisfy the threshold in terms of both weighted kappa and correlation, which are .686 and .737, respectively. As for the standardized mean score difference, it apparently exceeds the allowable threshold .15. The results imply that under the prompt-specific model, the comparison of iWrite–human scores has unacceptable degradation and standardized mean score difference.
Number of Essays for Each Score Level.
Total Distribution of Each Score Level for iWrite and TOEFL11 Corpus.
Consistency of Scores Between iWrite and TOEFL11 Corpus.
As can be seen in Table 10, according to the interpretation of kappa by Landis and Koch (1977), there is a slight agreement between the performance of iWrite and human experts on essays for p7 (k value is between 0 and .20), a fair agreement is observed for p6 (k value is between .21 and .40), and for the total essays, k = 0, which is indicative of chance agreement. Taken together, the three values, which are all below .40, suggest poor agreement (Fleiss, Levin, & Paik, 2003) between iWrite and ETS scoring.
Generalization of iWrite Scores Across Different Writing Tasks
For Prompt 5 (Task 2), analysis shows that exact agreement, Pearson correlation, and weighted kappa between two raters are 92%, .901 (sig. value = .000), and .791, respectively (Mean = 10.84/10.71 and standard deviation = 1.899/1.82 for Rater1/ Rater2), demonstrating excellent interrater agreement.
Relation of Human and iWrite Scores for p5.
As can be seen in Table 11, weighted kappa is .136, indicating a slight agreement between the performance of humans and iWrite. There are significant and positive correlations between the human–iWrite scoring (r = .838). Two of the three values exceed the allowable threshold .10 and .15, respectively (weight kappa for degradation and standardized mean difference), while Pearson correlation for degradation is below a predefined threshold .10 (.063).
An independent t test is conducted to compare the means of automated scores between Task 1 (Prompt 2) and Task 2 (Prompt 5). By comparing the means of two unrelated groups, we are to investigate whether the writing quality of the first-year undergraduates shows a statistically significant difference based on discourse mode. That is, whether iWrite is consistent in scoring different writing tasks. Results show that participants for Prompt 5 had statistically higher mean scores (11.29 ± 0.105) compared with those for Prompt 2 (9.90 ± 0.084), t(230) = − 10.353, p = .000 (sig. value = .038).
Consistency of iWrite Scores for the Same Participants
To further explore the consistency of scores generated by iWrite for the same participants across different time points for similar types of writing prompts, a one-way analysis of variance with repeated measures is conducted to compare three group means between the three time points of measurement. Results reveal that sphericity assumption is violated and a one-way repeated measures analysis of variance with a Greenhouse–Geisser correction indicates that there are significant differences in the mean scores between time points – F(1.845, 180.837) = 124.028, p < .0005, though this was a relatively high effect size (partial η2 = .559). Bonferroni correction reveals that mean scores generated by iWrite demonstrate a significant rise in the writings from Prompt 1 to Prompt 3 and a slight decrease from Prompt 3 to Prompt 4 (8.23 ± 0.78 points vs. 9.95 ± 0.78 points, 9.95 ± 0.78 vs. 9.90 ± 1.07 points, respectively), both of which are statistically significant (p = .000). Therefore, it is concluded that mean scores assigned by iWrite for the same participants display a statistically significant increase after 2 weeks and a slight reduction after another 2 weeks.
Discussion
To evaluate iWrite for its assessing writing quality, we addressed three research questions. For the first research question, “What is the relationship between human and iWrite scores,” the findings revealed that for low-stakes writing tests (regular classroom practices), Pearson correlation and kappa showed no agreement between iWrite and human scoring. That is, there was no significant correlation between iWrite and human scores—for similar constructs, iWrite cannot score as reliably as humans do. From a practical perspective, humans generally assigned higher scores than iWrite did, one possible explanation may be the fact that human raters based their scores only on 5 fixed score points, unlike those of iWrite which assigned a full range of scores on a 15-point scale. On the other hand, for high-stakes tests such as TOEFL Independent writing, iWrite still failed to be significantly correlated with human raters and demonstrate desirable performance. Through there were some correlations between certain score levels, iWrite is not assumed to serve as a valid tool to assess high-stakes writings. All in all, it is suggested that at present iWrite cannot score as accurately and reliably as human raters do.
For Research Question 2, “generalization of iWrite scores across writing tasks,” analysis revealed that for the specific and narrow task, iWrite displayed no correlation with human scores regarding both low-stakes and high-stakes writings. It is with the same results when iWrite responded to tasks like an open and broad prompt. Thus, iWrite scores are not consistent across different writing tasks—in this study, across a specific, narrow prompt and an open, broader one. However, iWrite demonstrates better performance in Task 2 than that in Task 1. This inconsistency of automated scores across task types provides unfavorable evidence for the inferences of generalization, indicating that currently iWrite’s responses to writing tasks are not representative of all possible writing constructs and the examinees’ performances evaluated by AES are affected by the types of writing tasks. However, human raters show consistency of scores, no significant difference in scoring was observed between two raters across tasks. This indicates a substantial level of interrater agreement as well.
Taken together, task-related differences brought about no impact on scoring capacity of iWrite since iWrite has no capacity of accurate and reliable assessment of writing quality, thus, no generalization of automated scores across tasks can be displayed. The results suggest varying scoring mechanisms applied in iWrite and support the claim that each discourse mode calls for a unique way of solution (Elliot, 2003). In addition, it is also suggested that evaluation based on iWrite scores may misrepresent an examinee’s writing quality and cannot be used to predict human scores when applied as alternative forms to writing test. Meanwhile, before the validiy and accuracy of iWrite can be further improved, it is advisable to employ the automated scores only as a supportive tool to assist, but not to replace instructors, and no decision should be made about the examinee’s actual writing performance.
For Research Question 3, “consistency of automated scores for the same participants,” iWrite did not perform in a consistent way to assign scores to the same examinee at different time points as there was a significant mean score difference over time spans. Specifically, after 2 weeks, mean scores for the same participants were significantly higher than the first measurement, and the increase was maintained after another 2 weeks for the third measurement with slight decline in the mean scores. For the same participants without any training of writing skills within a short time span (1 month in this study), their performances of L2 writing were supposed to remain at the same, or at least similar level—no significant improvement in scores should be observed. The results indicate that when subject to repeated measurement, iWrite cannot consistently score writings from the same examinee and therefore we came to the same conclusion—iWrite cannot score as reliably as humans do.
Conclusion
This study presents the most thoroughly researched evaluation as to the reliability and validity of scores generated by China’s AWE system iWrite. The findings suggest that the performance of iWrite is far from being satisfactory and may not be a valid tool for evaluating L2 writing proficiency in terms of both low-stakes and high-stakes writings. As a developing AES technology, iWrite currently can only be restricted to assisting writing assessment as a supplement before it can be recommended for use as an alternative to or substitute for human raters. Nowadays, computer-assisted technology has become fast-advancing and increasingly prevailing throughout China, this in turn has given impetus to improved understanding of automated writing assessment. Search for better and desirable automated scoring performance will be an ongoing and long-term endeavor for vendors in China.
In the meantime, this study also calls for further improvement in human−computer interactive activities embedded in China’s AWE systems because a lack of human interaction has always been a drawback in AWE systems (Hamp-Lyons, 2001). For instance, it should offer a series of discourse templates so that students can get a basic idea and visual representation of different discourse modes; more global features of writing such as style, organizational coherence should contribute to automated writing assessment; vague wording and expression in machine feedback should be clearly articulated so that students can get well-defined responses. In addition, with the passage of time, it is expected that degrees of progress made by students after systematic writing training and successive writing practices will be marked out in AWE systems. This would be of great value and benefits for L2 learners. Just as Vojak, Kline, Cope, McCarthey, and Kalantzis (2011) claimed that in this computer-aided learning environment, if students can have access to social contextuality, diversity, and multimodality, writing assessment softwares will make a difference.
Given the scope of this study and the unavailability of scoring mechanisms used in iWrite, there are some issues that have not been taken up in this study. First, the data set in this study only contain sample writings produced by Chinese L2 learners, future studies will benefit from using iWrite to score writings from native speakers of English. Second, this study examines the relationship between automated and human scoring by holistic scores, without exploring the differences in specific scoring dimensions such as language and content between human and automated scores. Therefore, the analysis of relationships among individual linguistic features judged by iWrite and indictors of writing quality are worthy of future research. Finally, fairness across subgroups, sensitivity to writing length, and impact of age are not covered to contribute to a line of comprehensive study in evaluating automated scores awarded by iWrite.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
