Abstract
A short training program for evaluating responses to an essay writing task consisted of scoring 20 training essays with immediate feedback about the correct score. The same scoring session also served as a certification test for trainees. Participants with little or no previous rating experience completed this session and 14 trainees who passed an accuracy threshold proceeded to score other essays. Performance of the newly-trained raters was compared to that of 16 expert raters with extensive experience in scoring responses to the writing task. Results showed that the scores from the newly-trained group of raters exhibited similar measurement properties (mean and variability of scores, reliability and various validity coefficients, and underlying factor structure) to those from the experienced group of raters. Implications for the place of initial training and screening of raters on rater performance are discussed.
Introduction
Performance assessments necessarily involve subjective judgments. This fact accords a central place to the rater and the rating process. This is especially true for essay writing assessments, which involve multifaceted skills and evaluation criteria. This inherent subjectivity can be at odds with the concept of psychometric reliability (Moss, 1994). In the context of large-scale, high-stakes writing assessments in particular, a primary goal is to ensure that raters think similarly enough about what characteristics of student responses determine their quality to achieve reasonable consistency of scores across ratings.
The attainment of this goal is a continuous challenge, as numerous studies of rater behavior have shown substantial differences in the way raters interpret scoring criteria (see, e.g., Bachman, Lynch, & Mason, 1995; Eckes, 2008; Engelhard, 1994; Engelhard & Myford, 2003; Lumley & McNamara, 1995; Weigle, 1998). As a result, the reliability of large-scale essay writing assessments is low in comparison to multiple-choice tests that require the same time (typically 30–45 minutes). Breland, Camp, Jones, Morris, and Rock (1987) conducted an extensive reliability study of essay ratings. Examinees wrote six essays in three modes of writing, and each essay was scored by three experienced raters. The alternate-form reliability estimate for a test composed of two essays (in a single mode) was .59 when each essay was scored by one rater and .70 when each essay was scored by two raters. Breland, Bridgeman, and Fowles (1999) summarized the findings of reliability studies conducted before the Breland, Camp, Jones, Morris, and Rock (1987) study, and found that the mean alternate-form reliability estimates for two double-rated essay examinations was .71. They also reported on unpublished results from the writing assessment of the Graduate Management Admissions Test (GMAT), which then consisted of two tasks. The mean reliability estimate for a two-essay, double-rated assessment was .71.
Rater training
Because rater variability is such a serious source of construct-irrelevant variance, rater training is commonly employed to limit such variation (Barrett, 2001; Elder, Knoch, Barkhuizen, & von Randow, 2005; Lumley & McNamara, 1995; Weigle, 1998, 1999). The first step in this process is to specify qualifications and characteristics of raters and recruit potential raters who meet these qualifications (Baldwin, Fowles, & Livingston, 2005). These typically include experience in observing the kind of performance being assessed and an academic or professional background.
Training of potential raters is typically interactive, allowing those trained to ask questions and get feedback on their scoring, but it may be conducted in a variety of forms: face-to-face, online, webinars, or conference calls (Baldwin, Fowles, & Livingston, 2005; McClellan, 2010; Wolfe, Matthews, & Vickers, 2010). During training, potential raters review the writing prompt, scoring rubric, prompt-specific scoring notes, and benchmark responses, and discuss these materials with an expert rater. They then practice scoring training responses that have been assigned “consensus” scores and annotations by a committee of expert raters, and the raters then receive feedback about their scores.
At the end of initial training, potential raters are typically required to pass a certification test in order to qualify for operational scoring. For example, with a six-point rating scale in the context of essay scoring, the rater may be required to assign the correct rating (exact agreement) on at least 25 of 50 responses, with no more than two ratings that differ by more than a point (adjacent agreement) from the correct rating (Baldwin et al., 2005). For some large-scale essay writing assessments, raters are trained on one or more paradigmatic prompts, but are then expected to score operationally essays written to other new prompts.
Continued training of operational raters is supported through two main processes (McClellan, 2010). First, raters are typically required to pass a “calibration” test (similar to, but shorter than, the certification test) before every scoring shift. Second, scoring leaders (experienced raters) monitor the quality raters’ scores in different ways and discuss problems with the raters. Monitoring can be based on backscoring (scoring after the fact) some responses, seeding special “validity” responses (for which a consensus score exists, also called monitor responses) among regular operational responses, and examining discrepancies among rater scores when operational responses are scored by more than one rater.
Effect of training and experience
Although operational procedures for monitoring and supporting raters over extended rating periods are well documented (Baldwin, Fowles, & Livingston, 2005; McClellan, 2010), relatively little is known about the effectiveness of initial training and certification procedures, as well as the effect of later experience on rater quality. Although several qualitative studies found differences in the thought processes of experienced and inexperienced raters (Barkaoui, 2010; Cumming, 1990; Weigle, 1994), quantitative research found surprisingly little difference in the quality of ratings from experienced and inexperienced raters (Lim, 2011; Shohamy, Gordon, & Kraemer, 1992; Weigle, 1998).
For example, Shohamy et al. (1992) compared ratings of 10 teachers and 10 non-teachers. Five raters from each group underwent training for applying a rating scale and the other five were not trained. Inter-rater consistency coefficients within each of the four groups were all high. However, it is hard to draw conclusions from this study because the authors do not report on the consistency of each individual rater’s ratings with those of all other raters, or all teacher (experienced) raters.
Weigle (1998) compared performance of eight teaching assistants with no experience in rating with eight experienced raters. All participants completed a pre-training scoring session, followed by training and operational scoring for a composition test, and a final post-training scoring session. Results from the pre-training session indicated that as a group, inexperienced raters showed more variability in the severity of their ratings and were less consistent in their ratings than experienced raters. However, in the post-training session the two groups showed similar variability in the severity of their ratings and similar consistency in their ratings (and the entire group of raters showed improvement in both aspects of rating quality).
A different approach was taken by Lim (2011), who investigated the performance of 11 raters over months of operational rating. Some of these raters were novices at the beginning of the examined periods. Results indicated that in some cases individual novice raters were discrepant in their initial severity or less consistent in their ratings, but these effects were eliminated in subsequent months. Overall, novice and experienced raters did not differ in severity or consistency.
A related line of longitudinal research, on the effectiveness of providing individualized feedback to raters about their overall severity, consistency, and biases over an extended period of time (often based on the output of a many-faceted Rasch analysis), also found mixed results (Elder et al., 2005; Knoch, 2011; Lunt, Morton & Wigglesworth, 1994; O’Sullivan & Rignall, 2007).
Current study
The purpose of this study was to look more closely at the effect of initial training on the performance of inexperienced raters by comparing their performance after training to that of experienced raters. The training program, for a college-level writing task, was web-based and minimalistic. It consisted of making all task materials available to the trainees and testing them on a set of 20 training responses with immediate feedback on the correctness of their scores. In other words, this training program was a condensed and abridged version of typical rater training, with review, practice, and certification occurring at the same time, on a small number of responses and with no opportunity for discussion. Trainees who showed a minimal level of performance on these responses scored other responses that were the focus of analyses.
The effectiveness of this training procedure was evaluated by comparing the results of the newly-trained group to the results of an expert group of raters with extensive experience on this writing task. Results of the newly-trained and experienced raters were compared with respect to the severity of scores (the degree to which scoring standards are the same across raters), reliability of scores (the consistency of scores across raters), convergent validity (strength of relation between scores from the two groups), and predictive validity of scores (strength of relation of scores with other measures of student writing and a measure of reading comprehension and verbal reasoning skills).
Method
Writing task
Materials for this study were based on the analytical writing measure of a large-scale, college-level standardized assessment (which also includes a measure of reading comprehension and verbal reasoning ability). This assessment comprises two essay writing tasks. In the issue task, the test taker is asked to discuss and express his or her perspective on a topic of general interest (see Appendix A). In the argument task, a brief passage is presented in which the author makes a case for some course of action or interpretation of events by presenting claims backed by reasons and evidence. The test-taker’s task is to discuss the logical soundness of the author’s case by critically examining the line of reasoning and the use of evidence. Test takers have 45 minutes to complete the issue task and 30 minutes to complete the argument task. The average lengths of the essays that were analyzed in this study were 458 and 339 words for the issue and argument tasks, respectively. The two tasks are scored using holistic scoring rubrics with six points that emphasize ideas, development and organization, word choice, sentence fluency, and conventions (see Appendix A).
Materials
Operational essays written by 200 examinees who wrote in response to a single pair of argument and issue prompts were used in this study. Expert ratings were collected in a previous study (Attali, Lewis, & Steier, 2013). In this previous study, a total of 16 experienced raters evaluated each of the 200 essays of the examinees (both issue and argument). All these raters successfully passed training for both the issue and argument tasks and had been scoring for at least nine months thousands of essays written to these tasks. For this current study, 100 issue essays were used for the regular scoring sessions (the 200 essays were originally divided into two approximately equivalent groups of essays, and the 100 essays selected for this study constituted one of these groups) and 20 other essays were used for the training session. The 20 training essays constituted a stratified (by score) random sample from the other group of 100 essays. Since all essays were previously scored by the 16 experienced raters, a “true” score was calculated by averaging the 16 scores and scaling these true scores to have a standard deviation equal to the average rater SD across the 16 expert raters.
For the regular scoring sessions, five batches of 20 essays each were prepared. The batches had similar distributions in terms of essay length. The order of essays in the batches was randomized but fixed across participants. The order of the batches was randomized across participants.
Both training and regular scoring were completed through a web application where participants were presented with an essay to evaluate on the left side of the screen and all the information they needed on the right side of the screen through several “tabs”: the writing prompt, the scoring guide that describes each score point (1–6), scoring notes that further interpret the scoring guide for the specific task, benchmark responses that exemplify each score point (presented as an “accordion” of stacked items, each of which can be individually expanded), and a library of the training essays that were available during the regular scoring sessions. Every mouse click on the different elements of the training materials (tabs or accordion items) was recorded.
Participants
Participants for this study were recruited from Amazon.com’s Mechanical Turk (MTurk) crowdsourcing marketplace, which allows researchers to post experiments to be completed by Amazon.com users in return for monetary compensation. This platform has seen a growing interest among researchers as a way of recruiting subjects for social-science experiments (Buhrmester, Kwang, & Gosling, 2011). There were no qualifications required to participate in the study, and 48 MTurk workers completed the training session. This session was completed in 32 minutes on average. Participants were paid $6. All participants were US residents, their first language was English, their ages varied from 21 to 56 years (M = 32, SD = 7), 42% were women, and most had at least some postsecondary education (6% were high school graduates, 17% had some college education, 10% had an associate degree, 58% had a bachelor’s degree, and 8% had a graduate degree). In comparison, raters for this writing assessment are required to have at least a bachelor’s degree. Participants were also asked about their experience in grading student work; 46% reported none or very little grading experience, 42% some experience, and 12% extensive experience. Those that indicated they had some or extensive experience were asked to describe their experience. These descriptions revealed that none of the participants had any substantial experience in grading. All of them served as either teacher assistants in elementary and high school or as teaching assistants in college. They reported grading dozens to hundreds of student responses in a variety of topics (including mathematics, chemistry, and psychology) with no formal training, certification, or feedback procedures, and in most cases this experience had occurred at least three years before.
Out of the 48 training participants, 18 passed the minimum performance threshold (explained below) and were invited to participate in five additional scoring sessions. These participants had a mean age of 30 years, 15 of them were male, 13 of them indicated some or extensive grading experience, and 14 of them had a bachelor’s or graduate degree. Out of these 18 participants, 14 of them completed all five sessions on the same or the following day, and analyses focused on these participants. These 14 participants had a mean age of 30 years, 11 of them were male, 10 of them indicated some or extensive grading experience, and 11 of them had a bachelor’ or graduate degree. Regular (post-training) sessions by these 14 participants were completed in 28 minutes on average, and participants were paid another $6 per session. Overall, these 14 participants completed six sessions (including training) in an average of three hours and were paid $36, with an average hourly rate of $12/hour.
Procedures
Participants in the training session were first asked to read carefully all task materials and then were presented with the training essays one at a time. Analysis of the log files indicated that participants spent very little time (less than five minutes) reviewing task materials before starting to read and score training essays. Further analyses of the initial time spent reading task materials and the number of actions (mouse clicks) navigating the training materials (every click was recorded in the system) indicated no relation between these indicators and success during training. After assigning a score to each of these essays, they received immediate feedback about the correctness of the scores they assigned, and in addition they received overall feedback about their performance by way of points for each response: three points for an exact match between the assigned score and the true score and one point for a one-point discrepancy. In order to pass the training successfully, they had to receive at least 38 points. The minimum level of performance for achieving this threshold was 10 exact agreements (50%) and eight adjacent agreements, which roughly corresponds to the minimum level of performance expected in certification tests (Baldwin, Fowles, & Livingston, 2005). Participants were only told they had to accumulate a minimum number of points to proceed to regular scoring, but were not told what the threshold was.
Following the training session, participants were told if they successfully passed training and if so were invited for five additional scoring sessions during which they did not receive feedback. All scoring was completed within two days (no more than 30 hours passed between training and the last scoring session).
Results
Number of points during training
The number of points accumulated in training ranged from 19 to 48 with a median of 35, and 18 participants received 38 points or higher. In comparison, the range of points that the 16 experienced raters have earned on the same 20 essays was from 40 to 56 with a median of 48 (note, however, that the true scores that were the basis for awarding points were determined by the 16 experienced rater scores). In other words, the number of points that the 18 selected participants earned roughly corresponds to the bottom half of the point distribution for the experienced raters. Also note that the median number of points accumulated by the 18 participants that passed certification was 42, which was the same as the median number of points for the 14 participants who completed the study.
The median number of points across the self-reported grading experience was 34 for no experience (N = 9), 35 for very little experience (N = 13), 37 for some experience (N = 20), and 41 for extensive experience (N = 6). Although this pattern corresponds to intuition, a Kruskal–Wallis test failed to reveal a significant effect of experience on number of points, χ2(3) = 5.4, p = .14.
Measurement properties of scores during regular scoring sessions
Table 1 presents a summary of some measurement properties for newly-trained and experienced raters. For each of these measures, a Mann–Whitney U two-sample test was performed to evaluate possible differences between the two groups of raters. With respect to the average of scores, no difference was found between the groups (U = 77.5 for the newly-trained group, p = .16). However, note that the variability of average scores across raters is smaller for the newly-trained raters (SD = .23) than for the experienced raters (SD = .30), suggesting a smaller rater effect on scores for the newly-trained raters (see also generalizability analyses below). This difference is even more noteworthy since the standard deviation of rater scores (second row in Table 1) is significantly larger for newly-trained raters (U = 33, p < .01). In other words, newly-trained raters are less variable in their average scores at the same time that their individual scores are more variable. Insufficient variability of scores is a frequent problem in CR scoring – raters often do not use the entire scoring scale. In this case, the overall distribution of scores in the experienced group was 1%, 9%, 30%, 42%, 17%, and 2% for scores of 1–6, respectively, while the overall distribution of scores in the newly-trained group was 2%, 9%, 23%, 38%, 25%, and 3% for scores of 1–6, respectively.
Psychometric properties of essay scores.
Note: Number of essays is 100.
p < .01 two-tailed, Mann–Whitney U two-sample test.
The bottom part of Table 1 summarizes correlations of scores for newly-trained and experienced raters with several validity measures. The first of these is the true score of the essays that were evaluated. This measure is somewhat biased in favor of the experienced raters since the same scores from the experienced group were used to calculate it (one of 16 ratings). Nevertheless, the median correlation with true scores was higher in the experienced group (.88) than in the newly-trained group (.77), U = 178, p < .01.
The second validity measure compared was the true score on the other (argument) essay that examinees wrote. As a different writing sample completed at the same time as the essay that is the focus of these analyses, the scores on this other writing sample represent an optimal concurrent validity measure. For this measure, no differences were found between the groups (U = 137, p = .31).
The scores on the measure of reading comprehension and verbal reasoning ability were used as a third validity measure, representing a related (but different) verbal ability. For this measure too, no differences were found between the groups (U = 135, p = .36).
Finally, whereas the first three measures were convergent validity measures (with an expectation of high correlations), association with essay length can be considered as a discriminant validity measure. On the one hand, quality and quantity of writing are naturally associated (Powers, 2005). However, it is easy for a less qualified or motivated rater to fall back on essay length as the primary criterion for “evaluation.” It is surprising to note that newly-trained raters showed lower correlations with essay length than experienced raters, U = 161, p < .01.
To summarize, newly-trained and experienced raters were quite similar in their score distributions and correlations with validity measures. Significant differences were found for the variability of scores, with newly-trained raters more variable in their individual scores and at the same time less variable in their average scores; for correlations with true scores, with lower median correlations for newly-trained raters; and for correlations with essay length, with lower median correlations for newly trained raters.
It is interesting to note that although the selection process for newly-trained raters, which was based on the number of points in training, was considerable (less than 40% were selected to continue), a strong association between the number of points in training and the correlation with true scores was found (r = .73, p < .01). That is, in spite of the considerable restriction of range for the number of points, success in training was still a viable predictor of success in regular scoring. For example, the median correlation with true scores for the seven raters with the lowest number of points was .74, and the median correlation with true scores for the seven raters with the highest number of points was .82.
Generalizability analysis
To investigate the reliability of scores for the newly-trained and experienced groups, generalizability analyses with raters as a single facet were conducted for each group. Table 2 shows that the relative size of the rater effect is smaller for the newly-trained group (5% versus 10% of total variance), whereas residual variance is relatively larger for the newly-trained group (34% versus 27%). These two differences nearly cancel each other as evidenced by the similarity of the relative size of essay variance (62% versus 64%) and Phi (dependability) coefficients of the two groups.
Variance components and Phi coefficients.
Confirmatory factor analysis
To investigate further the measurement properties of scores for the newly-trained and experienced groups, a confirmatory factor analysis was conducted on the raters’ scores. In these analyses, essays were observations (with a sample size of 100) and individual raters were indicators (with sample sizes of 16 and 14 in the experienced and newly-trained groups, respectively. Two models were compared – a single factor model where all raters are hypothesized to measure the same underlying construct, and a two-factor method model where newly-trained and experienced raters are hypothesized to measure separate (although correlated) underlying constructs. Analyses were performed with the R lavaan package (Rosseel, 2012) using maximum likelihood estimation. The comparative fit index (CFI) and root mean square error of approximation (RMSEA) were used for overall model fit. Common rules of thumb were used in appraising the measures (Hoyle & Panter, 1995): .90 or more for CFI and .05 or less for RMSEA.
For the one-factor model, CFI was .904, RMSEA was .091, and χ2(405) = 741.40. For the two-factor model, CFI was .909, RMSEA was .089, and χ2(404) = 724.33. The one-factor and two-factor solutions showed reasonable and very similar fit. Although the two-factor solution showed slightly better fit (with significant χ2 differences), these differences in fit are very small and the correlation between the two latent factors, .98, is extremely high, indicating that the two groups of raters measure the same construct.
Discussion
This study found small differences in performance between a newly-trained group of raters and a highly experienced group of raters. These results suggest that rater performance is less influenced by actual experience in rating responses (the key difference between the groups) and more by learning that occurs during initial training and abilities that are acquired prior to training.
Moreover, training for this study, although modeled on operational practices, was particularly short (32 minutes on average), was mostly spent on scoring student responses, did not include discussions of the student task, scoring rubric, and benchmark responses, and was not facilitated by an expert rater. Therefore, it is unlikely that participants learned much if anything about the scoring guide – the definition and interpretation of the different aspects of good writing that are expected in this particular task. Trainees had to rely on their existing competencies as readers and writers in order to evaluate the essays. In particular, they must have applied their own idiosyncratic interpretations of what constitutes good writing.
What could they have learned in such a short time that enabled them to perform so well? One possibility is that nothing of value was learned during training, and that successful trainees simply applied previous knowledge and abilities during both training and regular scoring. Although this possibility cannot be ruled out in this study, it is not supported by previous research (e.g., Barrett, 2001). For example, Carlton, Diederich, and French (1961) asked 53 readers representing several professional fields to evaluate essays on a nine-point scale without any training. They found that none of the 300 essays received less than five of the nine possible ratings and 94% received at least seven different ratings.
In comparison, in the present study the maximum range of scores for an essay was three adjacent ratings (23 of the essays received the same score by all 14 novice raters, 57 received two adjacent scores, and 20 received three adjacent scores). This suggests that, through feedback about true scores during training, participants were able to align their relative evaluations of the writing samples with the 1–6 scale the raters were asked to apply. In other words, they used the feedback from the training essays to develop boundary conditions for the different score levels or to create exemplars of the score levels.
Success in rating is not solely a matter of aligning overall scoring standards (or overall leniency) to a desired level. To be successful, raters must be sensitive to the relative merit of individual responses. As indicated by the range of points accumulated during training, there were large differences in the ability of trainees to discriminate between high and low quality essays, as defined by the professional raters. Such differences were interpreted by Carlton et al. (1961) as reflecting “schools of thought” about the general merit of essays. Whether reflecting differences of interpretation or general competencies as readers and writers, these results underscore the importance of screening potential raters on the basis of training performance, as is done in standard practice.
Carlton et al.’s (1961) study was instrumental in the development of modern writing assessment. Its results implied that considerable training is required in order to develop among the raters a shared interpretation of what constitutes good writing and similar scoring standards. The results of this study suggest that evaluating responses with immediate feedback can quickly teach readers what scoring standards they are expected to apply, and at the same time can be used to identify readers who cannot apply the expected interpretation of merit.
In this study, fewer than 40% of trainees passed the certification test. This low rate was surely owing to the particular way in which candidates were recruited, by soliciting participants from a general crowdsourcing marketplace without any required qualifications or background credentials. Although this recruitment method was appropriate for the research purpose of contrasting naive raters with experienced ones, it is not necessarily appropriate for an operational setting. Although the research on this subject is scarce (Schoonen, Vergeer, & Eiting, 1997), it is likely that general qualifications (including relevant education and teaching experience) can have a beneficial effect on success in rating. Therefore, screening of candidates might have raised the success rates of the training program.
Furthermore, it is also possible that a more extensive training program could help more candidates learn and better apply the scoring rubrics that are expected from raters. Again, the short training program applied in this study served the research purpose but could be extended in an operational setting. For example, the short program could be used as a pre-screening tool for a more extensive program.
An interesting aspect of the training procedures applied in this study is that they de-emphasized studying task materials and example responses, and instead focused on testing with feedback. This distinction is reminiscent of the testing effect, which is the finding in memory and learning research that students who are tested on material they had initially studied learn more effectively than students who re-study the material (Glover, 1989; Roediger & Karpicke, 2006; Karpicke & Blunt, 2011). Future research should look more closely on the relative importance of study versus testing in the context of rater training.
Future research should also examine to what extent these results generalize across different prompts. In this study, raters were trained and then applied their training on a single prompt. In practice, however, raters of large-scale assessments are trained on a small number of prompts, but are expected to apply their training on other prompts.
Finally, streamlined training through testing raises other issues regarding the design of such a training environment, such as the number of training essays to administer and whether more elaborative feedback should be provided. In this study, only the correct score was provided to trainees as feedback. More extensive annotations about the response and explanations for the correct score might be more helpful for raters during the training process.
Footnotes
Appendix A
Acknowledgements
The author would like to thank Doug Baldwin for his thoughtful reviews and critical comments made on early versions of the manuscript. Any views expressed in this publication are the views of the author.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
