Abstract
This study applied the many-facet Rasch model to assess learners’ translation ability in an English as a foreign language context. Few attempts have been made in extant research to detect and calibrate rater severity in the domain of translation testing. To fill the research gap, this study documented the process of validating a test of Chinese-to-English sentence translation and modeled raters’ scoring propensity defined by harshness or leniency, expert/novice effects on severity, and concomitant effects on item difficulty. Two hundred twenty-five, third-year senior high school Taiwanese students and six educators from tertiary and secondary educational institutions served as participants. The students’ mean age was 17.80 years (SD = 1.20, range 17–19). The exam consisted of 10 translation items adapted from two entrance exam tests. The results showed that this subjectively scored performance assessment exhibited robust unidimensionality, thus reliably measuring translation ability free from unmodeled disturbances. Furthermore, discrepancies in ratings between novice and expert raters were also identified and modeled by the many-facet Rasch model. The implications for applying the many-facet Rasch model in translation tests at the tertiary level were discussed.
Introduction
Language learning and teaching ideally involve the interconnectedness of linguistic competences in four broad areas: listening, speaking, reading, and writing. Common to all of these linguistic skill areas is the cognitive process of translation which combines learners’ language knowledge and language use. According to Liao (2006), translation is an integrative cognitive process which facilitates comprehension, retention, and production of a foreign language. In fact, translation is a commonly employed approach in pedagogical and assessment practices in Taiwan (Lai, 2011; Liou, 2009; Pan, 2001; Zhou, 1996) as well as other English as a foreign language (EFL) contexts such as Japan and Korea (Benson, 2000; Li, 2006).
In spite of the longstanding use of translation tasks as an indirect assessment of a foreign language (L2) ability, formal investigation of the quality of such assessments is lacking (Arjona-Tseng, 1994; Campbell & Hale, 2003; Lai, 2011; Li, 2006), resulting in the continual use of translation tests with unknown psychometric characteristics. Maintaining objectivity in assessment of translation ability based on sentential constructed response items coincides with a variety of seemingly intractable problems when tackled with the commonly used test theories that were designed to model dichotomous-based multiple-choice question responses.
The primary concern in subjectively scored constructed responses is that the use of human raters introduces measurement variation. Research has demonstrated that raters still vary in scoring propensity despite thorough training (Lunz, Wright, & Linacre, 1990) and that inter-rater reliability issues create difficulties in maintaining measurement invariance (Braun, 1988; Cohen, 1960). These studies (Braun, 1988; Cohen, 1960) examined essay writing in first language (L1) settings, the domain where concerns over rater reliability first surfaced; yet in the L2 learning, particularly in Asian contexts where translation remains valued for its insight into the cognitive processing aspects of linguistic ability, no research has examined the impact of rater variability on item scores and estimates of EFL learner ability, although small-scale research has investigated the comparative reliabilities of scoring procedures (Arjona-Tseng, 1994; Lai, 2011). Similar to the concerns pertaining to standardized tests, comparability of translation task difficulty should be established. Test designers and teachers who use these tests need to be informed of the variability in translation task difficulty and recognize that no two translation tasks may be considered equal without rigorous empirical support. Additionally, rater-induced variance must be estimated and modeled such that the property of specific objectivity in establishing estimates of learner trait-ability can be achieved.
Rasch measurement models are frequently deployed in assessment contexts as they enable objective and statistically invariant measurements of L2 trait-ability across the various aspects of the assessment process, such as raters and tasks. For this reason, a many-facet Rasch model (MFRM) was used for the assessment of sentential translation task items. The MFRM can simultaneously estimate rater severity, task difficulty, and test-taker ability and has been widely applied in performance-based assessment domains such as writing and speaking skill performance (Du & Brown, 2000; Eckes, 2005; Lynch & McNamara, 1998) and medical rehabilitation assessment (Hermansson, Fisher, Bernspång, & Eliasson, 2004), and sport performance (Looney, 2004). It is hoped that the current research can serve as a prototype for educational assessments or national standardized assessments of translation ability.
The role of translation in pedagogy
The conventional view of translation in the L2 classroom posits that the grammar translation method (GTM), with a focus on grammar drill exercise, and translation as equivalent practices. Since GTM has long been criticized for its inhibitory influence against communicative competence, translation has also been summarily criticized by many language educators. Actually, translation and GTM are essentially different, and research papers from both seminal and contemporary studies have also argued that there is a role for translation in foreign language classroom (Bransford, Brown, & Cocking, 2000; Catford, 1965; Coelho, 2006; Cook, 2010; Lado, 1957; Li, 2006). Translation refers to a broader linguistic process that involves a series of linguistic decoding and meaning exchange between first language and second language, as in the case of bilingualism (Catford, 1965; Cook, 2010; Liao, 2006), whereas GTM explicitly emphasizes discrete grammar rules, grammatical accuracy, and writing ability development. Therefore, it is inaccurate to simply conflate translation with GTM, and, by logical extension, the disadvantages of GTM do not apply to translation.
Ideally, the ultimate goal of L2 instruction for earnest learners is to attain bilingual ability (Sridhar & Sridhar, 1986). Essentially, it is for this reason that translation is still widely used in many contexts and classroom situations (Benson, 2000). In other words, translation is a skill that is valued as part of L2 pedagogy (Cook, 2007; House, 1977; Sridhar & Sridhar, 1986), and therefore it is a skill which must be evaluated through testing regimens (Lai, 2011; Li, 2006). Translation develops learners’ critical and independent thinking by the analysis of the two languages, which in turn clarifies language knowledge (Cook, 2010).
In essence, the first language is a tool to provide learners social and cognitive guidance in L2 learning tasks with the ultimate goal of extending inter-linguistic competence into real life situations gradually (Amirian & Abbasi, 2014; Antón & DiCamilla, 1999). Therefore, translation represents a suitable means to provide interaction and mediation between L1 and L2 in different contexts within a communicative language framework. For this reason, strengths in translation ability tend to coincide with linguistic proficiency, and translation tasks remain a staple component of L2 achievement tests, particularly in EFL contexts (Lai, 2011; Li, 2006).
Translation for L2 assessment
Translation is not only a learning strategy or a teaching technique but also a mode of testing. Over 20 years ago, Buck (1992) demonstrated that translation can be one kind of test item that was both reliable and valid. In translation items used in Taiwan, test-takers are first presented with native language sentences and then required to translate them into the target language, typically English (Liou, 2009, Pan, 2001; Zhou, 1996). Although an ideal translation should maintain fidelity to the original target language passage, even the best translator is unlikely to consistently produce output in the second language that is identical in both semantics and nuance with the first language (Cook & Bassetti, 2011). Since there may be gaps between the lexical and syntactic nuances of the respective languages, a good translator must identify the essential meaning of one language and express the concepts via another language (Brislin, 1970; Malakoff & Hakuta, 1991).
In Taiwan, translation continues to be a staple format of test items on advanced subject tests (ASTs) as well as general scholastic ability tests, two types of tests used nationwide for college entrance purposes. Translation tasks may be classified as a sub-type of constructed response items known as performance assessments (Brown & Hudson, 1998). Moreover, as performance assessments are intended to be scored by well-trained raters (Norris, Brown, Hudson, & Bonk, 2002), the question of subjectivity in scoring becomes moot. Raters must have sufficient experience to internalize the scoring rubrics in order to yield efficient and reliable human scoring. Yet, in the case of writing assessment, another form of performance evaluation, even well-trained raters exhibit notable variance due to scoring procedures and features of the writing samples such as content, organization, sentence structure, or genre (Cooper, 1984; Schoonen, 2005; Weigle, 2007). It is also known that rating experience, knowledge of the assessment process, and familiarity with the rating criteria exert considerable influence on the variability of inter-rater and intra-rater reliability (Kuiken & Vedder, 2014; Schaefer, 2008).
Specific to the field of translation testing, similar concerns are likewise raised regarding the issues of rater reliability and effects of scoring procedures. Lai (2011) focused on the comparability of scores derived from the error deduction method and those derived from various benchmarked scales. Although her results found comparability in total raw scores using classical measures of reliability, analyses of item difficulty or impact of rater factors were not included. Thus, it is clear that, similar to writing assessment, investigation of rater reliability in scoring translation test items is a critical issue.
In a similar vein, Campbell and Hale (2003) arduously assert the need for more rigor in translation assessment methodologies. Campbell and Hale decry the paucity of research into translation assessment, suggesting that the intuitive nature of translation judgment makes objectivity and reliability a problem too intractable to tackle. Their position paper calls for greater research into the scoring procedures of translation tests. There are no standardised [sic] interpreting aptitude tests. In spite of the advances made in language testing, little of that knowledge has been adopted by interpreter educators in the design of their testing. (Campbell & Hale, p. 212)
The MFRM is advantageous primarily for it invariance feature. In the two-facet model commonly applied to objectively scored assessment (i.e., testee and item), invariance denotes independence of testee parameters from item parameters and vice versa. This invariance principle can be extended to the rater facets in the case of subjectively scored assessments. This enables the impact of raters’ severity to be examined and corrected even if items are rated by different individuals, meaning that the effects of rater severity and other related facets are independent of testee trait-ability and item difficulty, a feature which facilitates the expansion of subjective assessment from small-scale to large-scale endeavors. The invariance feature of MFRM leads to a second advantage critical to establishing the reliability and validity of performance assessments, namely, monitoring of rater-by-item bias via quantitative information on rater performance, which can then be used as feedback for rater training (Stahl & Lunz, 1996). The MFRM can therefore map the ability of testees, the relative difficulty of items, as well as the severity of raters to form a model to estimate a test-taker’s true score from a rater with a given leniency on a given item (McNamara, 1996).
The present study aims at validating a test of Chinese to English sentence translation, that is, establishing its reliability and validity via the MFRM. In specific, this research applies the MFRM to performance assessments of translation in order to construct reliability and validity for both the translation items and raters. The results of the study can facilitate the standardization of translation-type linguistic performance assessments.
Method
Participants
Invitations were sent to 861 third-grade senior high school students at 11 schools in northern Taiwan, and 225 of them agreed to partake in the present study. The participants’ mean age was 17.8 years (SD = 1.20, range 17–19), and 52% of the sample was female. Every participant was paid 300 NT dollars if he or she completed the translation test. Likewise, invitations were also sent to 67 educators in both tertiary and secondary educational institutions. In total, 13 educators agreed to serve as raters in the study. Nevertheless, among the 13 volunteers, three educators were novice teachers. Therefore, to fulfill the requirements of the study, an equal number of expert teachers was also recruited. In sum, the panel of raters consisted of three experts who were university lecturers trained in scoring procedures and also possessed actual rating experience, whereas the three novices were graduate students undergoing pre-service teacher training to become junior and senior high teachers. The experts on average had 12 years of translation grading experience, whereas the novices did not have any prior training or experience in rating translation test items. Each of the six raters was paid 5000 NT dollars for grading the assigned translation exam papers.
Measures
The characteristics of revised version of translation items.
Note: S: subject, Vt: transitive verb, Vi: intransitive verb, O: object, SC: subject complement, OC: object complement, Pres.: present, PP: present perfective, PC: present continuous, L: word frequency level; CEEC: college entrance examination center.
Two counter-balanced test forms with different ordering of items were created and administered to the participants. The 10-question tests were administered by the participants’ teachers and completed in 50 minutes. The test form is provided in Table 8.
Data collection
Data analysis
The MFRM (Linacre, 1989) was adopted to model the rating scores, and four facets—rater severity, rater experience, item difficulty, and test-taker trait-ability—were concurrently calibrated in the model. The four-facet model was defined as follows:
Probability of student n being rated k on translation item i by rater j in experience group m Probability of student n being rated k−1 on translation item i by rater j in experience group m (novice/expert rater) Trait-ability of student n Difficulty of translation item i Severity of rater j Severity of group membership m (novice/expert) Difficulty of rating step k relative to step k − 1
The FACETS program (Linacre, 2017) was used to calibrate test items under the partial credit model with item facet measured positively yielding higher logit scores for more difficult items. Meanwhile, testees with less ability would receive lower logit scores, similar to comparatively lenient raters.
FACETS returns two fit-statistic indexes, Infit and Outfit (Wright & Masters, 1982), to assess item fit. Outfit denotes the mean-square standardized residuals showing the influence of outlying observations on the estimates, whereas Infit denotes the information-weighted mean-square fit, that is, the squared model standard deviation of the observation, which indicates the effects of unmodeled disturbances. According to Rasch conventions, Infit and Outfit statistics equal to 1.00 indicate perfect model fit. In the present study, values greater than 1.50 or lower than 0.50 were defined as misfits (Linacre, 2017). Using the model estimates for testee-ability, item difficulty and rater severity, group-based differences in ratings were compared by rater experience.
Results
FACETS model summary
Obvious variations in average rater raw scores between groups were not observed, and overall raters exhibited acceptable consistency. In addition, the item-person map indicated good pairing of test item difficulty with testee-ability as shown in the FACETS Wright map in Figure 1. The Wright map in Figure 1 displays the distribution of the four facets in this study from left to right: testee trait-ability (Students), rater experience (Experiences), rater severity (Judges), and item difficulty (Task). The logit scale is shown on the far left-hand column where average ability/difficulty is set to zero. Negative values indicate low ability for testees, leniency for raters, and facility for items, whereas positive values indicate high ability, severity, and difficulty. The far right-hand column displays the corresponding scale of raw scores.
Wright map for students, experiences, judges, tasks, and rating scale.
Review of this map provides a broad overview of test characteristics. Testee abilities were shown to be approximately normally distributed, with the bulk of testees situated within the +1 to −1 logit range, while a slight tail of six testees fell near extremely low ability (< −2 logits). Conversely, two individuals showed extremely high ability being located near +2 logits. Approximately half of the testees possessed trait-ability sufficient to complete all items on the test. Test items were also reasonably balanced between difficult and easy items and fell within the range of −0.50 to +0.50 logits. Finally, both groups of raters were approximately equal and located at average severity, indicating that the CEEC scoring protocol for this exam functioned well in terms of providing uniform guidance to raters. Although one rater (Expert 3) was comparatively more severe than the others, all raters remained at near average severity.
Measurement reports of facets
Next, logit score estimates and indices of model fit were consulted for each of the facets under investigation to verify good fit among the specific model parameters. Person estimates were reviewed first, followed by rater estimates, rater group estimates, item estimates, and then rater group-item interaction estimates.
Sample of testee measurement report (person facet).
Note: Estimate in logit scale.
Judges measurement report (rater facet).
Note: Estimate in logit scale.
As shown in Table 3, Novice 2 exhibited the greatest amount of inconsistency in scoring (Infit = 1.13, Outfit = 1.09) which means that Novice 2’s scores of testee responses display 9% random inconsistency when considering the sum of squared distance of the observed score to model the predicted score. When this discrepancy is weighted by the variance in Novice 2’s scoring pattern, that is, the Infit measure, the amount of noise can be adjusted to 13%. Conversely, Novice 3 showed the greatest tendency towards “response set,” the habit of scoring within a narrow range, or at an overly predictable value. This is seen in the Infit and Outfit measures of 0.88, showing that the MFRM overestimated the amount of random variation in Novice 3’s scoring pattern.
The most crucial values in Table 3, however, are the z scores of Expert 2’s Infit and Outfit measures; these values show that Expert 2, like Novice 3, has a statistically significant response set, whereas the Outfit measures of the other raters can be construed as randomly occurring in the present analysis. In practical terms, for administering performance assessments such as translation testing, a rater such as Expert 2 would benefit from more panel discussion with the other raters concerning the types of English responses that may be construed as “errors.”
Furthermore, the separation index of 1.22 and reliability of 0.60 indicated that these raters demonstrated good inter-rater reliability. The FACETS program computes reliability as measures of variance in the sample; hence, the low values among the sample of raters means the raters are relatively homogenous in scoring, a desired feature of the rater facet and an indicator of convergent validity (Wright, 1996). Although the separation index of 1.22 can be considered low, being greater than 1.00 suggests that the six raters approximate two groups of heterogeneous judges. Likewise, the fixed-effects χ2 test shows that these raters were statistically different in ratings, that is, the property of independence in the rater parameter can be established (χ2 = 14.50, df = 5, p < .001), which provides divergent validity. The results of both the separation index and chi-square value concurred and pointed to the same direction, suggesting that two distinct groups (expert and novice groups) exist in the data.
Finally, to ensure construct validity of the test, a principal component analysis of residuals on the rater facet was conducted (Eckes, 2015; Wolfe & McVay, 2012). The results showed that loadings on the first principal component of the residuals were less than .40 (absolute value), suggesting random effects underlying the raters score residuals, that is, an absence of secondary, construct-irrelevant dimensions. In sum, this result indicated that this subjectively scored, partial credit rated performance assessment exhibits robust unidimensionality, thus reliably measuring translation ability free from unmodeled disturbances.
Raters’ experience measurement report (group facet).
Note: Estimate in logit scale.
Task measurement report (item facet).
Note: Estimate in logit scale.
Bias/interaction analysis
Expert/novice interaction with item difficulty.
Note: Bold indicates significant interaction between item difficulty and rater experience.
The practical significance of these differential estimates is that experts and novices have different points of view on the respective test items. According to Table 5, the difficulty level of item 8 is high (0.27), and novice raters showed more severe scoring than expert raters (novice severity, 0.34 > expert severity, 0.20). On the contrary, the difficulty level of item 2 is low (−0.04), and novice raters were more lenient than expert raters (novice severity, 0.14 < expert severity, 0.06).
Significant rater-item interactions.
In the case of item 2, as already shown in Table 6, all expert-novice pairs show novice raters exhibiting significantly more lenient scoring. In contrast, item 1 shows Expert 1 exhibited a leniency similar to novice raters, while item 10 shows Novice 2 and Expert 2 being more lenient than Novice 1 and Expert 1. Mixed between- and within-groups differences are not disconcerting per se: It is natural that individuals would have differing opinions, but these differences highlight the necessity for panel discussions on the scoring criteria during the scoring procedures, particularly if performance assessments are to be used in high-stakes examinations.
Discussion
The primary contribution of MFRM to development of subjectively graded performance assessments such as translation is that variance in rater severity can be controlled via detection, correction, and application of unbiased ratings to item evaluation. As noted in the literature on translation assessment, minimal effort has been devoted to detect and catalogue rater severity in currently prevailing testing practices. As a result, the reliability and concomitant validity of exams such as the English portions of AST and GST college entrance exams remains unknown. Raymond, Webb, and Houston (1991) noted that controlling rater severity via a pre-rating training program is frequently ineffective when availed as the sole option. The present study corroborates this observation with the finding that individual raters indeed differ significantly if not provided continual opportunities for discussion of rating criteria.
In addition, the influence of expert/novice group membership on ratings for subjectively scored performance items stands as a perpetual issue to both researchers and test administrators. Extensive studies of expertise in professional studies suggest that experts and novices comprise distinct ability groups (Alexander & Judy, 1988; Chi, Glaser, & Rees, 1982; Govaerts, Schuwirth, Van der Vleuten, & Muijtjens, 2011; Scribner, 1985). In the case of performance assessments, variance in rater abilities converts to variance in severity, which was also corroborated in the present study. Expert/novice group membership was found to exert effects on rater severity with respect to items 2 and 8. With respect to item 2, novice raters were more lenient (logit = −0.14), awarding higher ratings than expert raters (logit = 0.06). The opposite pattern was observed with respect to item 8 where novice raters were more severe (logit = 0.34) than expert raters (logit = 0.20).
In the present study, even though group interactions were observed, fortuitously, all items exhibited acceptable characteristics because of balanced panel membership, that is, the proclivity for novices to rate item 8 as low was balanced by the proclivity for leniency by experts, and vice versa for item 2. In addition, within-group inter-rater effects mitigated extreme contrasts between groups. In real high-stakes testing situations, where item characteristics to be severely influenced by group, or individual rater effects, intervention such as panel discussions and training would be required. Without this option, the affected items would then have to be dropped from the exam form.
In addition to severity, scoring consistency is a critical characteristic of expert raters. In the classical test theory paradigm, consistency in rater performance was only examined by inter-rater reliability indices. In the present study, however, both inter- and intra-rating consistency and the effect of group membership on consistency were examined using the MFRM-supplied fit indices. Rater Infit or Outfit values outside the accepted range would indicate that abnormal ratings were awarded, that is, either a low-quality item received a high rating or vice versa. Overall, rater fit indices were satisfactory for all raters as individuals, and also in the group facet of experience, indicating that there were no intra-rater anomalies that require immediate intervention.
It may be recalled that, based on fit indices, Expert 2 exhibited mild “over-consistency” (Infit = 0.81, Outfit = 0.83). The MFRM is advantageous over classical reliabilities in that an individual rater’s behavior can be examined and evaluated. For example, the “over-consistency” in this professor’s ratings could be attributed to several causes, such as philosophical disagreement, insufficient training, or simply response set. In this particular case, the magnitude of over-consistency does not hinder quality of ratings from a statistical standpoint. In real high-stakes testing contexts, however, test developers may opt to convene additional training or panel discussion on scoring criteria, or in the case of extreme inconsistency, the rater may simply be dropped to avoid biasing test scores.
Similar to rater severity, the item quality was evaluated by fit statistics and overall difficulty expressed in logit scores. Logit scores for item difficulty reflect overall ratings of the quality of translation performance elicited, but the MFRM renders them independent of the unique characteristics of the rater panel. In other words, rater severity effects and other rating facets have been controlled. Meanwhile, item fit indices reflect the degree of unexpected ratings on the respective items. Therefore, item misfit would mean that either the item was poorly constructed or the item was biased by raters’ group membership, in which case the item should be revised or omitted from the test.
Simultaneous examination of all facets in the test also allowed for the location of potential bias effects. That is, it was observed that Experts and Novices tend toward disparate ratings of easy test items: Novices were more severe than Experts, which accords with their lack of experience. With respect to difficult items, if test-takers do not possess good command of the target language, high scoring translations are less likely, the very definition of the Rasch, probability-based estimations. Review of the actual test responses showed that higher ability testees still did not write well on difficult items, making obvious errors which are easy for both groups of raters to identify and tabulate. In addition, lower ability testees very often could not write a complete response, again, rendering the raters’ task relatively facile, thus Experts and Novices easily achieve grading agreement.
In contrast, greater variance in ratings becomes possible with respect to easier items because of wider variance in responses caused by more testees providing complete sentential translations. This situation necessitates greater amounts of rater knowledge and expertise to judge each response. In a follow-up interview of raters reported in Su (2015), raters revealed that the longer amount of time required for scoring resulted in less adherence to scoring criteria. As time went by, attention span decreased and raters began to depend on their “automaticity.” Novices and Experts differ in their ability to internalize the scoring criteria, hence experts who score automatically have higher probability of retaining fidelity to the criteria, whereas the opposite tendency is observed for Novices, resulting in differential severity. Response set, or automatic scoring, was indicated by model overfit in Novice 3 and Expert 2. Use of the MFRM was advantageous in this respect because it can caution future test administrators to be wary of rater effects on easy items (Zhu, Ennis, & Chen, 1998). As Zhu et al. suggest, “The many-faceted Rasch model demonstrates a psychometrically appropriate technique for applying expert judgment in test development” (p. 21).
To synthesize the forgoing empirical discussions, we recap the research goal as follows. First, a simultaneous analysis of the four facets by the MFRM can uncover subtle rater characteristics by allowing for the location of potential bias effects induced by raters. Second, since a critical parameter in validation of performance tests is rater effect, the influential factors that may impact the estimation of rater effects also need to be scrutinized and modeled (Engelhard Jr., 2012). In the present study, we selected “experience of raters” as the fourth facet and modeled it along with the other three facets including “participants,” “items,” and “raters.” To elaborate, a testee rated on a translation item by two lenient/novice raters will procure an ability estimate adjusted to reflect the raters’ leniency/novice; similarly, another testee, hypothetically of the same ability, rated on the same item by two severe raters, will obtain an ability estimate approximately equal to that of the one rated by two lenient/novice raters. The unique psychometric property under MFRM makes it possible to close the gap between leniency/novice and severity/expert rating on those with the same raw scores, thus ensuring validity of the whole translation test (Eckes, 2015; Linacre, 1993). Finally, construct validity is obtained via the analysis of principal component analysis of residuals on the rater facet. Thus, the manifold statistical testing inherent in the MFRM serves to enhance the validity of the whole test.
Conclusion
In sum, the study sheds new light on the way in which high-stakes translation test items can be validated and analyzed. The practical significance of the study highlights the deployment of MFRM in counterbalancing potential raters’ bias in rating translation test items of different content features. Score validity of test-takers’ translation ability can be more fairly established by modeling raters’ scoring propensity operationalized as severe or lenient.
The MFRM as utilized in the present study is not without limitations. With regard to the rater characteristics, only rater experience was examined. Therefore, it remains unknown to what extent other characteristics such as rater’s gender, or age would influence the rating behaviors. Moreover, the essence of rating is a complicated cognitive process, and there are likely to be many other psychological and situational factors which cannot be ascertained from the statistical data only. Direct consultation of raters via interview protocols is still recommended in order to clarify individual scoring rationales and identify the processes by which scoring criteria become distorted by raters.
