Abstract
Two factors were investigated that are thought to contribute to consistency in rater scoring judgments: rater training and experience in scoring. Also considered were the relative effects of scoring rubrics and exemplars on rater performance. Experienced teachers of English (N = 20) scored recorded responses from the TOEFL iBT speaking test prior to training and in three sessions following training (100 responses for each session). Scores were analyzed using multifaceted Rasch measurement and traditional measures of rater reliability and agreement, and the frequency with which exemplar responses were viewed was measured. Prior to training, rater severity and internal consistency were already of a standard typical for operational language performance tests, but training resulted in increased inter-rater correlation and agreement as well as improved agreement with established reference scores. Additional experience gained after training appeared to have little further effect on raters’ scoring consistency, although the level of agreement with reference scores continued to increase. The most accurate raters generally reviewed exemplar responses more often and took longer to make scoring decisions compared to the least accurate raters. These results raise questions regarding the relative contribution of scoring aids such as exemplars and scoring rubrics to desirable scoring patterns.
Keywords
The quality of rater scoring decisions has important consequences for test reliability and validity, and the nature of rater-associated variability in scores has been a persistent source of concern for language performance tests (Norris, Brown, Hudson, & Yoshioka, 1998). The apparent simplicity of a score belies the fact that language performance tests often measure complex phenomena and raters must make judgments on the basis of summary scoring rubrics that only loosely fit the performances to be judged (Cumming, Kantor, & Powers, 2001; Lumley, 2005). Yet, despite the challenges raters face, a variety of studies have observed that raters are capable of producing consistent scores (e.g., Eckes, 2011; McNamara, 1996). Achievement of such a feat is undoubtedly influenced by the training and experiences a rater obtains, and it is widely accepted that rater training is necessary to ensure the reliability and validity of scores produced in language performance tests (Fulcher, 2003). Nonetheless, empirical studies of the effects of training and experience have sometimes reported conflicting results, and in any case, the specific contributions of various elements of rater training remain poorly understood. A better understanding of how training and experience combines to enable reliable scoring has obvious implications for optimizing training practices and may contribute to a more thorough understanding of how scoring decisions are made. Accordingly, the goal of this study was to clarify the contributions to rater performance of training, experience, and scoring aids such as rubrics and exemplar responses.
Empirical studies generally support the contention that training and experience contribute to rater expertise. A variety of studies of rater performance have noted higher inter-rater reliability and agreement following training (Fahim & Bijani, 2011; McIntyre, 1993; Shohamy, Gordon, & Kraemer, 1992; Weigle, 1994). Novice raters in particular, or raters who are excessively severe or lenient, seem to benefit from training with scoring patterns changing to become more like their fellow raters. Within-rater scoring consistency has also been observed to increase (Weigle, 1998), and Weigle suggested that improvement in raters’ internal consistency may in fact be the primary result of rater training. Nonetheless, considerable variation in rater severity persists after training (Eckes, 2011; Lumley & McNamara, 1995) and variability in scoring criteria and decision-making processes may persist as well (Meiron, 1998; Orr, 2002; Papajohn, 2002). Retraining of established raters may also have only modest effects on scoring performance. Both Knoch (2011) and O’Sullivan and Rignall (2007) were unable to show any effect of additional training of established raters in terms of inter-rater variability, rater bias, or, in Knoch’s study, within-rater consistency. Reduction of scoring bias by providing raters with feedback has also been an area of interest, but attempts to reduce the bias among operational raters have produced mixed results; although an early study suggested that bias could be reduced (Wigglesworth, 1993), more recent studies have observed relatively little effect (Elder, Barkhuizen, Knoch, & von Randow, 2007; Knoch, Read, & von Randow, 2007).
Another factor that may contribute to scoring expertise is an individual’s experience in scoring a particular test, or their experiences of interacting with language learners more generally. Although some novice raters may show perfectly acceptable scoring patterns even before training (Weigle, 1998) or immediately after (Lim, 2011), novice raters who are overly severe or lenient or who show poor internal consistency seem to become more like their more experienced peers after a few scoring sessions (Lim, 2011), or following additional professional or personal experience with the test taker population (Bonk & Ockey, 2003). In addition, as raters gain more experience they may also become more accurate in their scores; both Furneax and Rignall (2007) and Shaw (2002) have reported steady increases in rater agreement with established reference scores over time as raters became more familiar with the scoring system. On the other hand, a certain degree of difference in rater severity seems to persist regardless of experience (e.g., Kim, 2011; Lumley, 2005; Lim, 2011) and more experience does not necessarily correlate with better scoring performance in terms of within-rater consistency (Myford, Marr, and Linacre, 1996). Overall, it would seem that additional experience, like training, may benefit raters with relatively poor scoring performance but effects on established raters are more limited.
As can be seen, studies of the effects of training and experience have produced somewhat inconsistent results. Part of this variability is undoubtedly related to differences in rater background or abilities, but the specific training activities used may have their influence as well. Detailed descriptions of rater training protocols are not always available, but typical rater-training procedures have been described in the literature for several smaller tests (e.g., Elder et al., 2007; Lim, 2011), and for some of the larger tests brief descriptions of actual or parallel training protocols are available (e.g., Shaw, 2002; Xi & Mollaun, 2009). Rater training commonly begins with an introduction to the test as well as the procedures and criteria to be used when scoring. This is often followed by practice scoring of a set of examinee responses, with practice scores compared to previously established references scores. In face-to-face training sessions practice scoring may be followed by discussion of the results, and may include specific instruction in how to deal with difficult cases. Where online training is used, individual written feedback may be provided which notifies raters of their specific scoring patterns and areas for improvement.
From a reliability standpoint, the goal of rater training is to reduce the differences in scores from different raters. From a validity standpoint, the goal is to lead raters to an understanding and application of the scoring criteria that accurately reflects the language abilities the test is intended to measure. How the various elements of rater training actually contribute to these goals is not fully clear. Review and discussion of the scoring rubric is assumed to help raters focus on the elements of performance targeted by the assessment (Fulcher, 2003), and while such training may guide and standardize raters’ perceptions (Weigle, 1994), variability in scoring criteria may persist following training and raters may even continue to use inappropriate criteria (Kim, 2011; Meiron, 1998; Papajohn, 2002). Raters have also been observed to give the same score for opposite reasons (Orr, 2002). Beyond the scoring rubric, it seems possible that exemplar responses or practice scoring may play an important role in mapping rater perceptions to the rating scale, at least when performance descriptors are relatively vague as is found in many scoring rubrics. This possibility is supported by recent theories in psychology and behavioral economics which have asserted that the basic psychological process of magnitude judgment is one of comparison, where the item being judged is compared with other similar items located in the immediate environment or in memory (Laming, 1997; Stewart, Chater, and Brown, 2006).
Finally, current research in the field of rater training provides relatively little insight into exactly how training leads to rater expertise, and few studies have focused on the effects of both rater training and experience in scoring. How these factors combine to facilitate consistent and accurate scoring is of considerable practical and theoretical interest. In practical terms, understanding how various experiences contribute to consistency in scoring has obvious relevance for optimizing rater training and scoring procedures. In theoretical terms, additional empirical evidence may strengthen our understanding of the nature of the decision-making process, and how different procedures or scoring aids influence this process.
The overall goal of the study was to examine the effect of experience and training on rater scoring patterns in a speaking test context. Specifically, the study examined the following research questions.
In what ways do rater severity and internal consistency change with training and experience?
In what ways does rater accuracy, defined as agreement with previously established reference scores, change with training and experience?
How does raters’ use of exemplar responses change with training and experience?
Method
Participants
The participants were 20 experienced teachers of English who had not previously worked as scorers for the TOEFL iBT Speaking Test. All raters had two years or more of experience in teaching English (average 7.3 years). Fifteen of the raters had experience in teaching at the college level, while the other five had taught high school students, adult classes, or a mixture of both. All raters reported having taught learners at intermediate proficiency level and above, similar to the TOEFL candidature. Recruitment was also limited to individuals with English as their first and/or dominant language.
Six people reported having occasionally scored local speaking tests, one person had worked one day as an interlocutor/scorer in an oral test administration as part of a research study, and one participant had previously scored the Cambridge ESOL Cambridge English: First (FCE) and Cambridge English: Advanced (CAE) tests approximately one day a month, but had scored only four to five times in the previous year. So, in terms of the research questions, the participants were considered to be inexperienced in scoring speaking within the specific research context, although they had English teaching or other professional experience which may have contributed to expertise in scoring.
Materials
TOEFL iBT materials were used in the study because of the availability of a large collection of recorded responses along with established scoring materials. However, the focus of the study was a general investigation of rater expertise and this investigation does not address a specific speaking test. The study was neither designed nor intended to make specific claims regarding the reliability or validity of TOEFL iBT Speaking Test scores.
Test taker responses
Raters scored examinee responses taken from the TOEFL iBT Public Use Dataset (Educational Testing Service, 2008). For the current study, responses to items 1 and 2 (independent items) taken from the same form were used, for a total of 480 responses from 240 test takers. Item 1 directed candidates to make a recommendation, while item 2 required candidates to choose between two options and explain their choice; in both cases no additional stimulus materials were provided beyond the prompt and responses were based on candidates’ personal preferences and experiences.
Scoring rubric
Examinee responses were scored holistically using the criteria listed on the iBT speaking scoring rubric. Although only a single holistic score was awarded for each response, the rubric provides performance descriptors in three domains: delivery (pronunciation, intonation, fluency), language use (grammar and vocabulary), and topic development (detail and coherence of content; Educational Testing Service, 2004). In the operational test each item is scored using a four-point scale (1–4); this was adapted to a six-point scale (1–6) in the current study. This measure was taken to reduce the probability of chance agreement between raters and require finer distinctions be made between responses, which was thought would make the scoring task more challenging and leave room for improvement following training and experience. The operational scale used to score the iBT was adapted by adding half points to make a seven-point scale (1, 1.5, 2, 2.5, 3, 3.5, and 4), and then combining the lowest two scoring categories (owing to very few responses in the lower part of the scale) and re-labeling the result to give a scale of 1–6. Scores of 1 and 1.5 on the adapted iBT scale became 1, and scores of 2, 2.5, 3, 3.5, and 4 became 2, 3, 4, 5, and 6, respectively, in the new scale. The descriptors in the scoring rubric were unchanged, with scores of 3 and 5 in the new scale corresponding to the half point scores in the original scale. Accordingly, a score of ‘3’ in the new scale was awarded to responses that were intermediate in quality between 2 and 3 in the TOEFL iBT scale, while a score of ‘5’ was awarded to responses that were between iBT scores of 3 and 4.
Reference scores
The TOEFL Public Use Data Set includes item-level scores awarded using the operational scale of 1-4. Additional scores implementing the 1–6 scale used in the current study were obtained from 11 operational TOEFL Speaking Test scoring leaders, with the goal of producing scores that would incorporate the revised rating scale and be relatively precise and valid measures of the speaking construct operationalized by the TOEFL iBT Speaking Test. All 480 test taker responses were scored by all 11 raters in a fully crossed design. The resulting raw scores were then processed using FACETS multi-facet Rasch measurement software (Version 3.62.0; Linacre, 2007a) to produce average scores which were adjusted for differences in rater severity (Linacre, 2007b). These scores then served as a reference for comparison to participants’ scores, with unrounded scores used to calculate correlations and rounded scores used to calculate agreement indices (Cohen’s kappa). The distribution of rounded reference scores is given in Table 1.
Distribution of rounded reference scores (N = 400).
Data collection instruments
Data were collected using interactive Adobe Acrobat documents. Two background surveys were used to collect data on participants’ professional experience and language abilities. Orientation and rater training sessions were also conducted through interactive pdf documents, as described below. For each scoring session, a separate scoring instrument was provided for each item, along with a follow-up survey administered at the end of each session to collect rater self-perceptions of their scoring performance. The scoring instrument included a drop-down list for each response to record a score and a comment box for noting any problems while scoring. In addition, JavaScript was incorporated to collect data regarding the way the rater interacted with the form, including the following: (a) the number of times each response was played; (b) the number of times the score was modified; (c) the number of times exemplar responses were checked; and (d) timestamp data for various actions, such as when scoring was completed for each response.
Procedures
Following an orientation session, participants completed one scoring session, then a rater training session, and finally three more scoring sessions; the design is shown in Figure 1.

Sampling design for the study.
Orientation
At the start of the study each rater completed an individual orientation with the researcher either in a face-to-face setting (18 raters) or online via Skype (2 raters); thereafter, communication was generally conducted via email and phone as needed. The orientation started with a discussion of the tasks, schedule, time commitment, and compensation for the study, followed by the signing of a consent form. Participants then logged into their own online account and completed a short practice scoring session, which consisted of listening to 12 exemplars (two for each point of the scale) and scoring ten responses, all for a single item. The purpose of the orientation was to make sure raters were familiar with the procedures used in the study and to confirm that they had the hardware and computer skills necessary to access and use the study materials successfully.
Scoring sessions
For the scoring sessions raters worked from their own locations, downloading three Acrobat files for the session: one for item 1, another for item 2, and a short wrap-up survey. When opening a scoring session raters first reviewed the scoring rubric, except for the first scoring session when no rubric was provided. Raters were then required to listen to six exemplar responses, one for each point of the scale, in order of the lowest to the highest scores. Once the exemplars had been reviewed, the first examinee response to be scored was unlocked. Responses were re-locked once a score was confirmed, eliminating the possibility of any subsequent changes; this measure was necessary for a related study of sequence effect. Each scoring session included 100 responses in total to be scored, with an additional 20 responses used for verbal reports of rater cognition, which provided data for a related study. In addition, eight responses per prompt (16 total) were repeated across all sessions to allow for additional analyses of rater consistency over time. Each rater received the same set of responses for a given scoring session, with the presentation order of both items (A or B) and responses within each item randomized separately for each rater. The time required for each item (including recalls) was approximately 2.5 hours, with the whole scoring session taking approximately 5 hours. Raters were instructed to take breaks as needed, although they were encouraged to keep breaks short while working through an item. Rating sessions were unsupervised, although raters were asked to work in a location where they would not be disturbed, and certain aspects of rater behavior were recorded by the data collection instrument, such as the length and frequency of breaks.
Rater training
The training session began with raters reading through the scoring rubric, followed by review of a series of exemplar responses demonstrating performance at different levels, accompanied by written commentaries describing why the response merited a certain score. A total of ten exemplars were presented, representing responses from five different prompts. The exemplars and scoring rationales were taken from published materials for actual TOEFL iBT forms and from materials obtained directly from ETS. Scores for the exemplars were converted from the original TOEFL scale to the 1-6 scale used in the study.
All of the exemplars were examples of responses to TOEFL independent speaking tasks, but none addressed the two items being scored in the study, since no such exemplars were available. However, written commentaries focusing on the domain of topic development for the two items were obtained from ETS and presented to raters prior to the scoring calibration exercise. After reading the commentaries, raters listened to 12 exemplars for item 1 (two for each score level), followed by practice scoring of ten calibration responses. Upon selecting a score, the reference score for the response was provided as feedback. The same process was then repeated for item 2. The training session typically required 1.5 to 2 hours, and is summarized in Figure 2.

Summary of the rater training.
Analyses
Severity and consistency of scores
A total of 400 scores, 100 scores per scoring session, were collected for each of 20 raters to make a total number of 8000 scores. Multifaceted Rasch analysis was the primary tool used to quantitatively analyze the severity and consistency of scores produced by raters. The FACETS software package (Version 3.62; Linacre, 2007a) was used to examine rater severity and rater consistency, the latter as measured by model infit statistics.
Two different approaches were taken with the analysis. First, separate analyses were conducted for each scoring session, providing independent measures of rater performance for each session. Second, a combined analysis was conducted in which scores from each rater were separately entered by date, treating each rater-by-date combination as a separate individual. This procedure makes it possible to obtain separate measurements of rater performance across time, while including all measurements in a common analysis. However, this approach has the potential to violate the assumption of local independence required for Rasch analysis given that repeated measures are treated as independent observations. The practical effect of this issue can be examined using common responses repeated across sessions (John Linacre, pers. com., June 25, 2011), with the rationale being that if the logit ability measures of the repeated responses are the same for both analytical approaches then local dependence in the rater facet has no material effect on the results. Common person linking plots (Bond & Fox, 2007, p. 80) were used to plot ability measures for the repeated test takers, where person measures generated independently from each scoring session were plotted against measures obtained from the combined analysis. For all four scoring sessions, scatterplots showed that values were within a 95% confidence interval for an identity line (where y = x), indicating that the test taker ability measures derived from the two analytical approaches were equivalent within the bounds of error. Moreover, the rater severity measures themselves were essentially identical regardless of the analysis approach used; correlation of the rater severity measures produced by the two analyses produced a Pearson’s r value of .99. Therefore, the measures produced by the latter combined analysis are reported in the results section.
In addition to the results produced by multifaceted Rasch analysis, inter-rater consistency was also examined in terms of inter-rater correlations and agreement. Within each of the four scoring sessions Pearson product–moment correlations were calculated for all possible rater pairings. Rater agreement was also examined within each session by computing the percentage of exact agreements between pairs of raters, as well as linearly weighted Fleiss’ kappa, which is an adaptation of Cohen’s kappa for use with more than two raters (Fleiss, Levin, & Paik, 2003). Calculations of Fleiss’ kappa were made using AgreeStat (Version 2011.1, Advanced Analytics, 2011).
Accuracy of scores
The accuracy of scoring was operationalized as the degree of correlation or agreement between a rater’s scores and the reference scores produced by the ETS scoring leaders. For each scoring session, a Pearson’s r value was calculated for the comparison of each rater’s scores against the corresponding reference scores. Agreement with the reference scores was examined by similarly calculating a linearly weighted Cohen’s kappa for each rater, within each session.
Scoring behavior
Scoring behavior was examined in terms of the number of times the exemplars were reviewed during the scoring session. Exemplar use was analyzed using a one-way repeated measures ANOVA with scoring session as the repeated factor with four levels, one for each scoring session.
Results
Effect on rater severity and consistency
Throughout the four scoring sessions, rater severity consistently spread across a band of approximately plus/minus one logit from the mean, suggesting that training and experience had little influence on the variation in rater severity at the overall group level (Table 2). This range of severity values is not unusual for language performance tests (Eckes, 2011; McNamara, 1996), but it nonetheless indicates that some raters tended to grade more severely while others graded more leniently; these differences were statistically significant (fixed effects chi-square: χ(79) = 1663.8, p < .01). Changes in the severity of raters over time were examined using 16 responses that were repeated across all scoring sessions. No statistical difference between sessions was detected using a one-way repeated measures ANOVA, F(1.539,0.143) = 1.419, p = .229, η2 = .075, ω2 = .016; degrees of freedom corrected because of unequal variances using Greenhouse–Geisser estimates of sphericity, ε = .625 (Field, 2005). However, a moderate effect of .075 for eta squared was observed (Field, 2005), suggesting the significance test may have been under-powered. Additionally, the individual who was farthest away from average severity (rater 108 with a value of −1.8 logits) changed following training to become more like the rest of the group (−0.52 logits), much like Lim’s (2011) findings. A finding of particular interest was that in the first scoring session, 90% of the raters had severity measures that were within one logit of the mean. This indicates that novice raters were able to achieve levels of inter-rater consistency typical for operational language tests before they had completed the rater training, or had even seen the scoring rubric.
Rater severity measures (logits) and fit indices (infit mean square) from multifaceted Rasch analysis.
Note: Standard error for all severity measures is 0.13.
Values show number of raters falling within these ranges, N = 20.
The severity data reported in Table 2 only indicate that raters, on aggregate, did not tend to become more or less severe over time and maintained a consistent range of variability in severity. On the other hand, when rater variability was examined in terms of pairwise inter-rater correlations there appeared to be an increase in inter-rater consistency following training (Table 3). Mean Pearson r values rose from .596 to .673 following training, and then were somewhat lower for the remaining two scoring sessions. A one-way repeated measures ANOVA (where each rater pairing was considered to be a context repeated across sessions) was conducted using correlations transformed using a Fisher z transformation (Hatch & Lazaraton, 1991), and a statistically significant difference was observed across sessions (F(3, 567) = 77.855, p < .001, η2 = .292, ω2 = .143). Pairwise comparisons of means indicated that all means were significantly different from each other at the p < .05 level (incorporating a Bonferroni adjustment for multiple contrasts) with the exception of session 3 versus session 4. The magnitude of the differences was modest, but the difference between sessions 1 and 2 amounted to an increase of 13% and constituted a relatively large effect (r = .71; Table 4). In this instance the effect size for the contrast, r, is calculated as the square root of the F value for the contrast divided by the sum of the F value plus the associated degrees of freedom, and is interpreted in the same way as Pearson r (Field, 2005, p. 453.)
Mean pairwise inter-rater correlations within scoring sessions.
Note: The N value is the number of pairwise Pearson product–moment correlations used to calculate summary statistics.
Effect sizes for changes across sessions in inter-rater correlations.
Similarly, inter-rater agreement also increased somewhat over time (Table 5). The percentage of exact agreements (where a pair of raters gave exactly the same score to the same response) was 33.9% before training and 34.6% following training, and rose slightly in later scoring sessions, reaching a value of 38.5% in session 4. Fleiss kappa values also seemed to increase somewhat over time, with the exception of session 3; the general level of agreement across sessions would be considered moderate in magnitude (Landis & Koch, 1977). Unlike either severity or rater reliability, rater agreement values generally continued to gradually increase as raters gained experience in scoring, with the highest values seen in the final session.
Agreement indices within scoring sessions.
Within-rater consistency was investigated by calculating the degree that each rater’s scoring patterns fit the predications of the Rasch model. Infit mean square values varied between raters and across scoring sessions, but were generally within acceptable bounds (Table 2). A total of 78 observations (97.5%) fell inside the range of 0.5 to 1.5 proposed as acceptable by Linacre (2007a), while 66 observations (82.5%) fell within the more restrictive bounds of .75 to 1.3 suggested by Bond and Fox (2007). Similar to the severity data, infit values observed in the first scoring session were already generally within acceptable bounds, before raters underwent training or had access to the scoring rubric. Training and experience appeared to have little effect on infit measures, much like the pattern seen for rater severity.
Effect on accuracy of scores
Although the analyses reported so far provide information regarding raters’ scoring patterns, they do not necessarily address the issue of whether the scores accurately reflect the scale used in the study. To examine this issue, scores produced by each rater in each session were compared to reference scores for the same responses using Pearson product-moment correlations (Table 6). Correlations between raters’ scores and reference scores ranged from .62 to .90, with most values falling in the range of .70 to .90. A one-way repeated measures ANOVA performed on the correlations following a Fisher z transformation detected a statistically significant difference across sessions, F(3,57) = 12.287, p < .001, η2 = .393, ω2 = .121. Pairwise contrasts indicated that values increased significantly from session 1 to session 2 (F(1,19) = 21.181, p < .001, r = .726) but no difference was seen between sessions 2 and 3, or sessions 3 and 4. Once again, rater scoring performance was reasonably good across all sessions but nonetheless improved following training. In addition, there was a general tendency for the raters with the lowest correlations to reference scores in session 1 to show the largest gain following training. Of the six raters with correlations below .70 in session 1, all but one individual showed an increase of .10 or greater after training. It should be noted that aside from training effects, regression to the mean may explain this finding. However, little evidence of regression to the mean was seen at the other end of the performance distribution, where the four raters with scoring accuracy of .8 or greater in session 1 remained at this level for the remaining sessions, except for one rater who showed a transitory drop to .78 immediately following training.
Rater accuracy in terms of Pearson correlation and agreement (Cohen’s Kappa) with reference scores.
Agreement with the reference scores also increased following training, with average values for Cohen’s kappa (linearly weighted) increasing from .47 to .52 (Table 6). This difference in Cohen’s kappa was statistically significant (one-way repeated measures ANOVA F(3,57) = 13.506, p < .001, η2 = .415, ω2 = .140), with agreement increasing significantly with each session (Table 7). Once again, individual raters showing the lowest levels of agreement with the reference scores in session 1 improved substantially after training. The four raters who had kappa values less than .40 before training improved by .18 on average, with one rater more than doubling the level of agreement, from .22 to .51. Conversely, one rater with a kappa value of .42 in session 1 actually worsened after training, dropping to .26, but individuals with the poorest agreement with the reference scores at the beginning showed the most improvement following training.
Effect sizes for changes across sessions in agreement (Cohen’s kappa) with reference scores.
Use of exemplar responses
Given that raters showed relatively good scoring performance in session 1 with only the exemplars for reference, the frequency of use of the exemplars and its relationship to scores was examined further. Thirteen raters tended to view the exemplars the minimum 12 times per scoring session. Within this group, four individuals reviewed the exemplars more than 12 times during the first session, but reverted to the minimum in subsequent sessions. This pattern might be expected given that the exemplars were the only scoring aid provided in the first session. Of the seven raters who regularly checked the exemplars more than the minimum, there was considerable between- and within-rater variation that had no obvious relationship to either training or experience.
The potential for a relationship between exemplar use and scoring performance was examined by selecting three raters that showed the highest scoring accuracy and another three with the lowest accuracy and comparing their use of the exemplars. For each scoring session, raters were ranked on the basis of their agreement with reference score values (kappa) and again on their correlation with the reference scores. These ranks were then added for all sessions and three individuals with the lowest cumulative ranks (i.e., highest accuracy: raters 101, 102, and 123) and highest cumulative ranks (i.e., lowest accuracy: raters 108, 109, and 113) were used for the analysis.
Figure 3 shows exemplar use for more- and less-accurate raters. A clear difference is visible between the more-accurate raters, who viewed the exemplars from 14 to 47 times per session, and the other group where exemplar use was close to the minimum required to unlock the responses to be scored (12 views). It was also apparent from timestamp data recorded by the scoring instrument that the more-accurate raters were checking the exemplars during the scoring session, rather than repeatedly playing the exemplars before scoring. Although these data do not prove a causal link between exemplar use and scoring patterns, it is possible that periodic review of one or more exemplar responses helped to calibrate the perceptions of the more-accurate raters.

Use of exemplars while scoring for more-proficient raters (white bars), and less-proficent raters (gray bars).
Discussion
The training procedure used in the study modified rater scoring performance in a variety of different ways. Inter-rater reliability and agreement both seemed to show modest improvement following training: an increase of .077 was seen for average inter-rater correlation and .067 for Fleiss Kappa values, amounting to increases of 13% and 19%, respectively. These findings suggest that variability in rater judgments of speaking ability decreased somewhat following training, similar to the findings of previous studies done primarily in writing assessment contexts (Fahim & Bijani, 2011; McIntyre, 1993; Shohamy, Gordon, & Kraemer, 1992; Weigle, 1994). On the other hand, training seemed to have little effect on variability in rater severity, with the range of severity values, as measured using MFRM, consistently extending across a range of about 2 logits. This result is also in keeping with the findings of writing assessment studies where it has been demonstrated repeatedly that moderate differences in severity are durable over time (e.g., Lim, 2011; Lumley & McNamara, 1995; Weigle, 1998). Overall, the results support the conventional wisdom that training may make raters more consistent in their own scoring, but does not necessarily increase inter-rater consistency (Eckes, 2011; Fulcher, 2003; McNamara, 1996).
Following training, little additional change in rater consistency or severity was seen, suggesting that additional scoring experience had little impact on this aspect of scoring performance. Most raters showed acceptable performance after training, so the observation of little added benefit of additional experience is consistent with previous findings that continuing experience has minimal effect on established raters (Knoch, 2011; O’Sullivan & Rignall, 2007). In addition, training appeared to have little effect on raters’ use of exemplars. Most raters reviewed the exemplars the minimum number of times required by the scoring instrument, with slightly higher frequencies observed for a few individuals in session 1 when the exemplars were the only scoring aid available. This finding is not surprising given that there was no requirement to periodically review the exemplars, and rather than being influenced by training or experience, the frequency of exemplar use seemed to be more a matter of personal style. A requirement to periodically refer to the exemplars might be worth considering, however, given the observation that more-accurate raters referred to the exemplars more often than less-accurate raters. In particular, it seems common sense to encourage raters to review the relevant exemplars when deciding difficult cases.
In contrast to the findings for consistency and severity, raters appeared to be able to better approximate the intended scoring scale as they gained experience, as indicated by improved accuracy over time. Scoring accuracy improved immediately after training, with the average correlation to reference scores increasing by .058 (8%) and average agreement (Cohen’s kappa) increasing by .05 (11%) after training. Average values continued to increase with the highest values seen in the final scoring session. These results are similar to those reported by Furneax and Rignall (2007) and Shaw (2002). Such findings also raise the possibility that while scoring consistency or agreement with other raters is certainly a feature of proficient raters, the ability to accurately target the intended scale of measurement might distinguish more expert raters from their peers.
The finding that experience in scoring had relatively little additional impact on the consistency and severity of scoring might be the result of the relatively brief period in which data were collected (about two weeks), and additional changes in scoring patterns or rater behavior might have been observed if the study had included more scoring sessions or been extended over a longer period of time. However, it should be noted that raters scored 120 responses in each session and by the end of the study had scored over 500 responses (including scoring done in the training session), assumedly enough to establish a fair degree of familiarity with the testing context. Another possible explanation is that raters were not required to score a practice or calibration set before starting the post-training sessions, and such feedback might have promoted continuing improvement in scoring performance. Along similar lines, there were no rewards or consequences for raters, which could have influenced their motivation and performance.
It is also noteworthy that, prior to training, participants had already achieved a level of scoring performance typical of raters in operational speaking tests, at least in terms of the features measured by multifaceted Rasch measurement. Moreover, average correlation with reference scores was .73, a respectable result given that (a) the raters had not yet seen the scoring criteria upon which the reference scores were based and (b) the original TOEFL iBT Speaking Test scale was expanded from four points to six points with the express purpose of making it more difficult to achieve consistent and accurate scoring. Although good scoring performance among untrained raters is not unheard of (e.g., Lim, 2011; Shohamy, Gordon, & Kraemer, 1992), the findings are rather counterintuitive given the importance usually placed on rater training for ensuring reliable scores. It remains unclear how novice raters were able to achieve this level of performance. One possibility is that the previous teaching and other experiences of the raters provided them with enough knowledge of the examinee population to be able to score in a consistent way. This explanation fits well with the approach of operationalizing rater expertise in terms of teaching experience, as has been done in some studies of language tests (e.g., Cumming, 1990; Delaruelle, 1997). A second possibility is that the testing context itself contributed to rater performance in that raters scored responses that were quite short and addressed a single type of speaking task. This situation might have made it easier for raters to quickly get a sense of the range of examinee performance and achieve consistency in distinguishing different degrees of ability.
A third possibility for the relatively good initial performance is that the exemplar responses available during the first scoring session assisted raters in consistently applying the scale. The set of exemplars was the only resource available for mapping the characteristics of test-taker responses to the appropriate score, and so the accuracy of rater judgments must have been based largely on this scoring aid. This explanation is consistent with the observation that more-accurate raters reviewed the exemplars more often than raters with relatively poor performance, as well as recent psychological theories regarding the nature of magnitude judgments. These theories contend that evaluations of magnitude are fundamentally comparisons of the thing being judged with other similar things, and predict that referring to a standard for comparison will improve the accuracy and precision of scoring decisions (Laming, 2003).
Conclusion
A number of factors were examined that may contribute to consistent and accurate scoring in a speaking assessment. The effects of training observed in this study are generally consistent with earlier studies; however, this study provides new insights into the combined contributions to scoring performance of a variety of factors. Of particular interest were findings related to the impact of exemplar responses and scoring rubrics. Relatively consistent and accurate scoring was seen at the beginning of the study when only exemplars were available for reference, but score accuracy continued to increase once the scoring rubric was available. One possible explanation for this finding is that exemplars and rubrics played different roles in the current study, with exemplars helping raters to align their perceptions to the rating scale and rubrics helping raters to direct their attention to the relevant features of performance. This idea remains speculative, however, and the specific contributions of exemplars and rubrics to desired scoring behaviors remain to be clarified in future studies.
A few limitations of the current findings should also be mentioned. First, these findings were obtained using a specific set of materials in a specific context, and the usefulness of particular scoring aids for guiding raters’ judgments undoubtedly varies with the quality and detail of the scoring materials as well as other aspects of the testing situation. Second, in the current study all participants underwent rater training, so the effect of experience in the absence of training was not investigated. Inclusion of a “no training” group might have provided more direct insight into the influence of scoring experience and the stability of rater judgments across time. Despite these limitations, the findings of the current study raise questions regarding the relative function and importance of different scoring aids. Future studies to answer these questions may lead to both improved scoring practices as well as a better understanding of how judgments of language proficiency are made.
Footnotes
Acknowledgements
This work was supported by a TOEFL Small Dissertation Grant from Educational Testing Service, a dissertation grant from the International Research Foundation for English Language Education (TIRF) and a dissertation completion fellowship from the Bilinski Educational Foundation. I thank Xiaoming Xi of Educational Testing Service (ETS) for her help in obtaining access to the TOEFL iBT Public Use Dataset and to Pam Mollaun of ETS for her help in recruiting TOEFL scoring leaders to provide reference scores. I also thank the ETS scoring leaders who provided additional scores for responses from the TOEFL iBT Public Use Dataset. Although they must remain anonymous, I extend my gratitude to those individuals who participated as raters in the study.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
