Abstract
Summative assessment of interpretation is widely conducted in interpreting courses/programs to inform high-stakes decision making, such as the selection, certification, and conferral of academic degrees. Yet there has been very limited empirical research to investigate the score dependability of summative interpretation assessment. The present study therefore sets out to explore the optimal measurement design(s) for a locally created summative assessment of English/Chinese consecutive interpretation, based on multiple fully crossed generalizability studies. Major findings include the following: (a) overall the raters behaved more consistently by using the information completeness (InfoCom) scale rather than the fluency of delivery (FluDel) or target language quality (TLQual) scales; (b) the raters displayed greater variability in evaluating the Chinese-to-English interpretation rather than the English-to-Chinese interpretation; (c) although adding tasks worked more effectively in raising score dependability than using additional raters for the InfoCom ratings in the English-to-Chinese interpretation, the pattern was reversed for the other observations; and (d) two potentially optimal designs were identified for the English-to-Chinese direction, and one design for the other direction. These results are discussed, highlighting the complex nature of relationships among the assessment criterion, the interpreting directionality, the raters’ dominant language and score dependability, together with the need to ensure score dependability for summative interpretation assessment.
Keywords
Summative assessment of student interpreters’ performance at the end of a training course/program is widely conducted in tertiary-level educational institutions (see Liu, Chang, & Wu, 2008; Sawyer, 2004; Tsagari & van Deemter, 2013). Scores obtained from summative assessment inform such decisions as determining the preparedness of students for an advanced level of study (Sawyer, 2004), granting academic credits and conferring a degree (Lee, 2008; Liu et al., 2008), as well as ascertaining the effectiveness of curriculum design (Hale & Ozolins, 2014). These score-based decisions could produce washback effects for students, teachers, curriculum developers, and other relevant stakeholders. Despite the high-stakes nature of summative assessment in interpreter training and education, little empirical research has been conducted so far to ensure the dependability of assessment outcomes.
To improve score dependability, language testers could use a large number of assessment tasks to elicit more data for better inference making, and/or employ more raters to average out inter-rater variability. However, it may become increasingly impractical to use more tasks and/or raters beyond a certain limit to pursue high levels of score dependability. Usually, an optimal measurement design, in terms of equilibrium between the number of tasks and the number of raters, is empirically explored to achieve a desirable level of score dependability, using generalizability (G) theory (e.g., Gebril, 2009; Lee, 2006; Xi, 2007). Despite the rich literature in second/foreign language testing, there seems to be little research in order to explore the optimal design(s) for interpretation assessment in the context of interpreter education, as can be seen in the literature review below. The research reported here, therefore, represents one of the first studies to explore the optimal design(s) to achieve sufficient score dependability for a locally created summative assessment of consecutive interpreting (CI). In particular, what distinguishes the present study from previous studies is that (a) it attempts to identify the practical measurement design(s) for an English/Chinese CI test (in which students interpret for separate, standalone English and Chinese speeches), whereas previous literature primarily concerns the assessment of a monolingual skill (or skills) (e.g., speaking); and (b) it uses averaged or pooled variance component estimates, derived from multiple fully crossed G studies, in subsequent decision (D) studies to increase estimation stability and accuracy, whereas previous studies are usually based on a one-off G study, with an incomplete dataset.
Literature review
Interpreter training and education
Over the past decade, interpreter training and education has gained currency, with new programs being established and more students being enrolled, particularly in emerging markets (e.g., mainland China), although this development can be attributed to a variety of reasons. In countries of immigration such as the USA and Australia, much of the demand for interpreting services is driven by immigrants and language-minority populations, who need to access medical, legal, and other public services (Hale et al., 2012; Han & Slatyer, 2016). In other countries, notably China, the need for language/interpreting services is largely fueled by recent economic growth, surging trade and investment, as well as cultural and people-to-people exchanges with other countries (Guo, 2010). The growing market demand has led to an increased level of attention to interpreter training and education in China. At the undergraduate level, interpreting is a popular or even compulsory course for foreign-language majors. The revision of the national Test for English Majors-Band 8 (TEM-8) in 2003 to include a new component of English/Chinese interpreting is a testimony to the greater emphasis placed on interpreting teaching. Targeting the undergraduate English majors in Chinese universities, the TEM-8 interpreting test primarily assesses basic interpreting skills and provides feedback to interpreting teaching (for a review of TEM-8, see Jin & Fan, 2011), whereas other tests such as the China Accreditation Tests for Translators and Interpreters (CATTI) tend to play a role in regulating access to professional practice. At the postgraduate level, the State Council Academic Degrees Committee launched the Master of Translators and Interpreters (MTI) program in 2007. As of 2014, a total of 209 universities have been able to enroll students in the MTI program.
Interpretation assessment in the educational context
Both formative assessment and summative assessment take place in interpreter training and education. Formative assessment is generally of low stakes, and is usually represented by student-enacted self-assessment (Han & Riazi, 2018; Iaroslavschi, 2011) and/or peer assessment (Fowler, 2007; Han, 2018b); whereas summative assessment may be high-stakes, and features rater-mediated evaluation of students’ performance at the end of an instructional period (e.g., Lee, 2008; Liu et al., 2008; Sawyer, 2004).
The purposes of summative assessment, however, may differ across different interpreting programs. For example, in some cases summative assessment at the end of the first year of study is essentially eliminatory, conducted to select students for an advanced level of interpreter training in the second year of study (Sawyer, 2004); in other cases, student interpreters must pass summative/exit examinations before graduation (Liu et al., 2008); in still other cases, end-of-semester examinations can serve as a professional certification test (Lee, 2008); finally, to earn academic/course credit (Feng, 2005) or obtain proficiency certification (e.g., TEM-8) (Wang, Wang, & Zhou, 2011), undergraduate students also need to succeed in summative interpretation assessment.
Apart from the different purposes above, four specific aspects of summative assessment merit special attention, as they relate to score dependability: (a) interpreting directionality, (b) the number of assessment tasks/samples of speeches, (c) the number of raters involved in formal rating, and (d) assessment criteria. Regarding directionality, summative assessment for both under- and postgraduate interpreting programs/courses is mostly bi-directional, that is, interpreting between at least two languages (Lee, 2008; Liu et al., 2008; Sawyer, 2004). This is primarily because professional practice usually involves interpreting into and from interpreters’ mother tongue, as reported in a number of surveys (e.g., Han, 2016b). Note that in some institutions and organizations such as the European Union and the United Nations, interpreters are required to interpret into their mother tongue.
In terms of the number of interpreting tasks, Liu et al. (2008, pp. 6–7 and 24–25) reported in a cross-national review of 11 postgraduate-level interpreting programs in Taiwan, mainland China, Britain and the USA that summative/exit testing consisted of only one task/speech for each direction, although speech length varied. Overall, the duration of source-language (SL) speech samples range from 45 seconds to 12 minutes. Samples of SL speech may also be divided into multiple segments (usually two to four), so that students interpret segment by segment. For undergraduate-level assessment, SL speeches are generally shorter and divided into more segments. For instance, the TEM-8 English/Chinese interpreting section has one task/speech of about two to three minutes for each direction, with each speech being divided into five shorter segments (Wang et al., 2011, p. 48).
Regarding the number of raters, more raters are generally used in summative assessment for postgraduate than undergraduate programs and courses. In Liu et al.’s (2008, pp. 15–19) review, seven out of 11 postgraduate-level interpreting programs customarily use at least three raters, with some even requiring as many as five to six raters. In contrast, Feng (2005) observes that only one rater (usually the course teacher) is involved in formal rating for most undergraduate interpreting courses in mainland China.
Lastly, three general dimensions of interpretation have emerged as primary concerns in quality assessment: information completeness (InfoCom), fluency of delivery (FluDel), and target language quality (TLQual), although the detailed description of each criterion may differ (see Han, 2018a; Lee, 2008; Lee, 2015; Liu, 2013). Descriptor-based rating scales have also been developed to operationalize the three criteria (Han, 2015; Lee, 2008).
Research on interpretation assessment in the context of interpreter education
Despite the important role of summative assessment in the interpreting programs and courses, academic research is scarce in the area of developing a psychometrically sound summative test to assess interpretation. A thorough search of the two highly indexed language-testing journals, Language Testing (1984–2016) and Language Assessment Quarterly (2004–16), yielded only three publications directly related to interpretation assessment: (a) Stansfield and Hewitt’s (2005) examination of predictive validity of a screening test for court interpreters, (b) Han’s (2016a) exploration of score dependability for a simultaneous interpreting test, and (c) Zhao and Gu’s (2016) review of CATTI.
Although there is relatively more literature on interpretation testing and assessment in translation and interpreting (T&I) journals, the majority of them are theoretical and/or review articles, discussing such general issues as rationales for assessing interpretation, test design and delivery, and comparison of different tests (for details, see Han, 2018a). Moreover, a large portion of empirical articles in the T&I literature pertains to professional certification testing (e.g., Liu, 2013; Turner, Lai, & Huang, 2010). Of those very limited empirical articles that address educational summative assessment, attention has been focused on rater behavior and rating scales (Lee, 2008; Wu, 2013). It is thus fair to say that there have been very few studies conducted to optimize measurement designs for locally created summative interpretation assessment with a view to achieving desirable score dependability.
Generalizability theory applied in second/foreign language testing
In second/foreign language testing, G theory is often used to guide the development and optimization of measurement procedures (e.g., Lee & Kantor, 2007). G theory decomposes observed score variances into the true variation in the object of measurement and other variations attributable to different measurement facets of particular interest (Brennan, 2001). G theory generally consists of a G study, in which variance components (VCs) for each of the main and interaction effects are estimated for a single observation, and follow-up D studies, in which the VC estimates are used to find a certain measurement design to obtain a desirable level of score reliability. Indices of generalizability (the G coefficient or ρ2) and dependability (the Phi coefficient or Φ) can be calculated for norm- and criterion-referenced score explanation, the latter of which (i.e., the Phi coefficient) is particularly relevant to summative interpretation assessment, as students’ performance is explained against a set of predetermined standards and the primary concern is the absolute value of the scores.
Particularly, in order that a wide range of potential D study designs can be conducted, and that ρ2 and Φ indices be flexibly computed, it is ideally more desirable to operationalize fully crossed designs than nested designs in the G study phase (Shavelson & Webb, 1991). In nested designs where two raters are usually paired in order to assess a subset of language performances, the effect of the nested variable cannot be differentiated from its interaction with the facet within which it is nested, thus providing less specific information than fully crossed designs where all raters need to assess all performances. Moreover, to obtain more stable VC estimates and, ultimately, to produce more accurate coefficients of ρ2 and/or Φ for D study designs, language testing researchers could use averaged or pooled VCs based on multiple G studies (Chiu & Wolfe, 2002; Xi, 2007), which is especially preferable to such scenarios as sparse-rated data (as a result of assigning test takers’ performances to a fraction of available raters) in performance-based language tests (Lin, 2017).
On a more substantive note, based on Huang’s (2009) meta-analysis of the generalizability of educational and psychological performance assessments, on average a relatively large person-by-task interaction was found for L2 learning (15.06%) and L1 writing (27.46%). In Brown’s (2011) study, it was reported that person-by-task interactions explained from 0.45% to 49.06% of the variance in L2 performance tests. In In’nami and Koizumi’s (2016) synthesis of G theory investigations into L2 writing and speaking performance, it was found that a larger proportion of the score variances was attributable to task and task-related interaction effects than the rater and rater-related interaction effects. In addition, previous G theory analyses have shown that increasing the number of tasks or raters contributes to higher score generalizability/dependability in general, but to different degrees. For instance, employing more raters tended to be less effective in raising G coefficients than using more tasks in writing assessments (Gebril, 2009; Lee & Kantor, 2007) and in speaking assessments (Lee, 2006; Xi, 2007).
One study that is particularly relevant to the current research is a G theory analysis of score dependability for an assessment of English-to-Chinese simultaneous interpreting (Han, 2016a). It was found that the InfoCom ratings were more dependable than the ratings of FluDel and TLQual, and that the addition of tasks was more effective in raising dependability for InfoCom than the use of extra raters, but the effect was reversed for FluDel and TLQual. A design of four tasks and two raters barely resulted in a Phi value larger than 0.75 for a composite score based on InfoCom, FluDel, and TLQual. However, the study described in Han (2016a) is based on a single, one-off G study to compute VC estimates, and only investigates a unidirectional interpretation (i.e., from English-to-Chinese) to inform professional certification testing. In addition, it focuses on simultaneous interpreting, which differs considerably, in terms of working mode, from CI, a more popular form of oral translation in interpreter training.
In summary, despite the abundant literature in second/foreign language testing research, it seems that very few empirical studies have been conducted in interpretation assessment to investigate the potential impacts of the number of tasks and/or raters on score dependability, and to explore a useful measurement design (or designs) for locally created summative assessment of interpretation. It may therefore be risky to make relatively high-stakes decisions based on the current assessment practice.
Specific background to the study
The present study pertains to a third-year undergraduate compulsory course of advanced Consecutive Interpreting taught in a foreign-languages college of a university located in Southwest China. To earn academic credit, students need to pass a summative assessment of English/Chinese CI at the end of a semester. Students who fail the assessment have to retake it next semester. More importantly, the university stipulates that students must pass all compulsory courses before they can obtain their academic degrees. In this sense, the assessment is high-stakes.
Specifically, the summative assessment of English/Chinese CI consisted of two tasks, one for each direction. Each task featured two minutes of SL speech on a general topic. Students’ CI performance was audio-recorded for later evaluation. Most often, the lecturer served as the only rater, using a scale of 100 points that gave priority to such general criteria as fidelity and delivery; however, no detailed descriptors were provided to make the criteria explicit and transparent. As a result, the rater may have tended to score interpretation based on her or his impression. To pass the exam, a student needs to score above 60 points. According to Feng (2005), summative interpretation assessment of this kind is, arguably, a common practice in many Chinese universities.
A student-satisfaction survey conducted in the college in 2016, however, revealed students’ discontent with the assessment practice. In particular, the students felt as follows: (a) they were not given enough opportunities to demonstrate their skills; (b) assessment criteria were not transparent; (c) the scores provided by only one rater/teacher may not be reliable; and finally (d) they expressed concerns over the use of such scores for decision making. This forms the background for the present study.
Research questions
In light of the literature review above and against the specific background, the author of this article was tasked with improving the summative assessment. The study reported here represents one of the ongoing efforts to explore the optimal measurement design(s) to achieve sufficient score dependability for this locally created assessment. The study attempts to answer three research questions:
To what extent could the score variances for InfoCom, FluDel, and TLQual be explained by task/rater main effects and task-/rater-related interaction effects in the summative assessment of English/Chinese CI?
How would changing the number of tasks and/or raters affect the dependability of InfoCom, FluDel, and TLQual ratings for each interpreting direction?
What would be the optimal measurement design(s) for the summative CI assessment?
Method
The current study uses G theory to investigate the above research questions, with a focus on exploring the optimal measurement design(s) of the summative CI assessment. It is worth noting that “optimal design(s)” refers to a cost-effective configuration of the assessment, in terms of using a minimal number of tasks and raters to achieve score dependability of at least 0.75 (i.e., Φ ⩾ 0.75). Although some researchers recommend using 0.80 as the minimally accepted Φ value for high-stakes decision making (Cardinet, Johnson, & Pini, 2010), the Phi coefficient of 0.75 was regarded appropriate, given the exploratory nature of the present study and the current educational context.
Participants: Students and raters
A total of 38 third-year undergraduate students, majoring in English/Chinese translation in a Bachelor of Arts program, participated in the CI course including the testing. With an average age of 21 years, 32 of them were female and the rest were male. They all had Mandarin Chinese as their L1 and English as their L2.
Six raters were also recruited to assess students’ CI performance on the assessment tasks. Specifically, two raters (i.e., Raters 05 and 06) were university lecturers of English/Chinese interpreting. The other four raters (i.e., Raters 01, 02, 03, and 04) were postgraduate MTI students who were working as teaching assistants to the CI course. All raters had Mandarin Chinese as their L1, and English as their L2. The recruitment of the raters with such a language combination was not a deliberate decision, but dictated by circumstances.
Assessment tasks
In light of relevant literature (Han & Riazi, 2017, pp. 252–243; Dawrant & Setton, 2016, pp. 414–421; Liu et al., 2008, pp. 8–10), a set of guidelines was prepared to develop assessment tasks. The guidelines were as follows: (a) SL texts must be authentic speech concerning a variety of general topics and themes; (b) SL texts need to be recorded by native language speakers, and each SL recording represents an assessment task; (c) to make it suitable for the undergraduate students, the length of SL texts should be about 300 words, divided into three or four segments; and (d) the duration of SL speech should be two and a half to three minutes, with an overall delivery speed of 100 to 120 words per minute. In addition, a short description should be provided for each CI task, briefing on the topics to be discussed in the SL speech. Usually, these descriptions were emailed to students two days before each assessment. Based on the guidelines, a total of 18 tasks were carefully developed, with nine for each interpreting direction.
Rating scales
Three descriptor-based rating scales were used, focusing on three dimensions: (a) InfoCom (i.e., to what extent SL information is successfully translated), (b) FluDel (i.e., to what extent disfluencies, such as (un)filled pauses, long silence, fillers, and/or excessive repairs are present in target-language interpretation), and (c) TLQual (i.e., to what extent target-language expressions are natural to a native English or Chinese speaker). Adapted from Han (2015), the eight-point scales were revised, based on students’ feedback, so that the descriptors could capture better the characteristics of their performance (see the Appendix). The eight-point scales can also be reduced to four-band scales by collapsing two neighboring points into a single score band. A previous Rasch analysis suggested that, in general, the scales functioned properly (e.g., monotonic increase of step thresholds in line with the eight-point scale) (Han, 2015).
Rater training
A rater training session, of about three hours, was held to prepare the raters for formal rating. Before the training, the raters read the SL texts in order to gain familiarity. In the training, each rater received a copy of the rating scales, and was asked to become familiar with the scale structure and descriptors. The author provided detailed explanations of such terms as deviation, omission, and disfluencies. The raters were also given opportunities to air their opinions, so that potential misunderstandings could be highlighted and discussed. Then, a batch of 15 randomly selected recordings was distributed to the raters for pilot rating. After each round of rating, all raters compared their scores with each other, and justified why a specific score was given. By doing so, it was hoped that each rater would become aware of potential differences, gain a better understanding of rating scales and descriptors, and adjust his or her rating behavior accordingly.
Procedures
Overall, three rounds of assessment were conducted. On each occasion, the same group of students performed English/Chinese CI for six tasks that were randomly sampled from the 18 tasks, with three tasks for each direction. Between two tasks, the students had a one-minute break. The CI performance was audio-recorded, giving a total of 228 recordings. The recordings were then randomly distributed to each rater. In the two-day formal rating, the raters gathered together in a room, but worked independently and at their own pace. Nonetheless, they usually assessed a batch of 20–25 recordings before a 15-minute break. Apart from the recordings, they were also given the SL texts, so that they could compare the interpretations with the original content.
Unlike many large-scale performance tests where nested rating designs are more practical, a fully crossed design was operationalized in the current study to estimate the greatest number of distinct sources of variability possible in the G study phase, generating a total of 4,104 data points on each occasion (i.e., 38 students × 3 tasks × 2 directions × 6 raters × 3 criteria).
Data analysis
Regarding the G theory design, given that students’ CI ability was measured, the students (denoted as s) constituted the objects of measurement. CI tasks (t) and raters (r) were treated as the random facets, because for either tasks or raters the size of the sample was much smaller than that of the universe, and the sample was also considered to be exchangeable with any other sample of the same size drawn from the universe (Shavelson & Webb, 1991).
Overall, to address the research questions, a univariate G theory analysis was conducted. The choice of the univariate analysis was largely determined by the fact that the decisions regarding passing/failing the summative assessment are based on individual criterion scores. Specifically, in the G study phase, two steps were taken to compute VC estimates. First, a G study with an s × t × r design was carried out on InfoCom, FluDel, and TLQual, respectively. The variation contributed by the students, tasks, raters, and their interaction effects to the total amount of variation that was observed in the performance ratings was then estimated (i.e., VC estimates) for the hypothetical design of one task and one rater (treated as the baseline). Second, the same G study design was repeated for each of the three CI assessments, resulting in a total of three sets of VC estimates. These estimates were then averaged to produce a new set of pooled VC estimates.
In subsequent D studies characterized by an s × T × R design, the pooled VC estimates were used in statistical computation to explore an optimal measurement design(s) that achieved the minimally accepted Φ value of 0.75 for each criterion, each direction, and the assessment as a whole, based on different combinations of tasks and raters. The statistical program of EduG 6.1e was used for the data analysis (Cardinet et al., 2010).
Results
Pooled VC estimates based on the three G studies
To address research question 1, Table 1 summarizes the pooled VC estimates from the three univariate G studies for the baseline design of one task and one rater. The pooled VC estimates were then used for statistical computation in subsequent D studies. As can be seen in Table 1, for both directions a larger percentage of the score variance was accounted for by the task main effect variance for InfoCom than for FluDel and TLQual. This indicates that, on average, there was a greater variation of task difficulty based on InfoCom ratings than on FluDel or TLQual ratings. Furthermore, the task main effect variances for the three criteria explained a larger proportion of the score variance for the English-to-Chinese direction than the opposite direction. The result means that regarding task difficulty greater heterogeneity was observed for the English-to-Chinese interpretation than for the Chinese-to-English interpretation.
Pooled VC estimates based on the design of one task and one rater.
When it comes to the rater facet, a larger share of the score variance was attributable to the rater main effect variances for FluDel and TLQual than for InfoCom, which holds true for both directions; and the rater main effect variances for all the three criteria accounted for a bigger percentage of the score variance for the Chinese-to-English interpretation than for the English-to-Chinese interpretation.
In terms of the interaction effects, particularly the person-by-task effect, Table 1 shows that only a moderate person-by-task interaction effect was found for both interpreting directions (i.e., ranging from 4.1% to 6.5%), except the relatively large interaction effect (13.9%) based on InfoCom for the English-to-Chinese interpretation. In addition, for both interpreting directions, the rater (r) and rater-related interaction effects (sr, tr) explained more of the score variances than the task (t) and task-related interaction effects (st, tr) across the three rating dimensions, except for the InfoCom ratings in English-to-Chinese interpretation, where the task and task-related interaction effects (i.e., 16.7% + 13.9% + 2.4% = 33.1%) contributed more variability to the score variance than the rater and rater-related effects (i.e., 6.0% + 4.1% + 2.4% = 12.5%).
Effects of additional tasks and/or raters on score dependability
In order to address research question 2, a series of D studies was conducted, based on the pooled VC estimates in Table 1. Specifically, any possible combinations of one to seven tasks and one to seven raters were explored in order to gain a better understanding of the potential effects of using different numbers of tasks and/or raters on the magnitude of score dependability (Φ), for each assessment criterion and for each interpreting direction. The use of seven raters and seven tasks for exploring optimal designs represents a decision that is deemed financially viable in the specific context described in the study. In theory, more raters and tasks could be used for the exploratory analysis, but such designs are too expensive to be practical. Figure 1 presents, side by side, the incremental change of Φ for each criterion in response to different combinations of the number of tasks and/or raters for the English-to-Chinese and Chinese-to-English interpretations, respectively.

Changes of score dependability as a function of the number of tasks and/or raters
As seen in Figure 1, an apparent trend is that increasing the number of tasks and/or raters would contribute to larger values of the Phi coefficient across the criteria and for both directions, although the marginal effect tapered off with more tasks and/or raters. Another observable pattern is that, comparatively, score dependability for each criterion tended to be greater in the English-to-Chinese direction than the other direction, for most of the measurement designs. For example, regarding either the baseline design of one task and one rater (i.e., nt = 1 and nr = 1) or the potentially maximal design of seven tasks and seven raters (i.e., nt = 7 and nr = 7), the Phi coefficients for each criterion, as shown in the figure, were smaller in the Chinese-to-English direction than the opposite direction. One additional pattern is that among all the measurement designs it seems that a greater number of designs achieved the desirable level of score dependability (Φ ⩾ 0.75, indicated by the dash lines in the figure) for the English-to-Chinese than the Chinese-to-English interpretation. Taking the score dependability for TLQual as an example, a total of 17 designs was above the minimal accepted value of 0.75 in the English-to-Chinese direction, whereas only seven designs were identified for the other direction.
In specific, regarding the efficacy of tasks vis-à-vis raters in raising score dependability, it appears that, on average, using an additional task would contribute to a larger Phi coefficient across the criteria for the English-to-Chinese direction than the opposite direction. The reason is that, while holding the number of raters constant, the curves for the English-to-Chinese direction were steeper and curvier than those for the Chinese-to-English direction, suggesting a greater increase in score dependability. In contrast, using one more rater would be more effective across the criteria (in raising Φ) for the Chinese-to-English direction than the other direction. As seen in Figure 1, while holding the number of tasks constant, the vertical distance between one rater and seven raters was greater for the Chinese-to-English direction than the opposite direction, signaling a greater increase of the Phi coefficients.
More specifically, Table 2 presents the average contribution (i.e., the average marginal effect) of one more task vis-à-vis rater towards raising the Phi coefficient for each criterion and for each direction. Regarding the English-to-Chinese interpretation, for FluDel and TLQual the average contribution made by an additional rater to increase the value of Φ was slightly larger than that by one more task (i.e., FluDel: 0.045 > 0.028; TLQual: 0.044 > 0.036), but this contribution was reversed for InfoCom (i.e., 0.025 < 0.053). Regarding the Chinese-to-English direction, a greater average contribution was made by using one more rater than that made by adding one more task across each of the criteria (i.e., InfoCom: 0.041 > 0.029; FluDel: 0.058 > 0.019; TLQual: 0.063 > 0.017).
Average contribution made by an additional task or rater to increase the Phi coefficient.
Identifying an optimal measurement design(s)
To answer research question 3, a host of potentially viable measurement designs, together with their corresponding Phi coefficients for each criterion, are shown in Table 3. Overall the table shows that, given a certain design, the Phi coefficient for each criterion was more likely to be lower than 0.75 (indicated by the notation ×) for the Chinese-to-English direction than for the opposite direction. Furthermore, although some designs, for example, the design of three tasks and six raters (nt = 3 and nr = 6), were able to generate acceptable values of Φ for some criteria, they might not be equally effective for the others.
Potentially viable measurement designs.
Note: The notation × indicates that the Phi coefficient was lower than 0.75; an asterisk * signifies the potentially optimal design(s).
More importantly, a basic requirement for potentially useful designs is that any given design should be able to produce a Phi coefficient equal to or greater than 0.75 for the three criteria, simultaneously. Six designs for the English-to-Chinese direction and two designs for the Chinese-to-English direction met the requirement, as indicated by asterisks (*) in the table. Another requirement then pertains to the cost-effectiveness of measurement designs. At face value, two designs could be shortlisted for the English-to-Chinese direction: a) the design of four tasks and six raters (nt = 4 and nr = 6), and b) the design of five tasks and five raters (nt = 5 and nr = 5). For the Chinese-to-English direction, the design of five tasks and six raters (nt = 5 and nr = 6) could be the best candidate.
Discussion
Main and interaction effects
The study found a much larger proportion of task main effect variance accounted for by InfoCom than FluDel or TLQual for both English-to-Chinese (i.e., 16.7% > 8.1% or 9.0%) and Chinese-to-English consecutive interpretation (i.e., 11.8% > 2.3% or 0.7%). These findings indicate that on average there was a greater variation of task difficulty based on InfoCom than FluDel or TLQual ratings. The pattern differs somewhat from Han’s (2016a) finding: for English-to-Chinese simultaneous interpretation, the InfoCom (i.e., 8.0%) and FluDel ratings (i.e., 8.9%) explained a roughly equal percentage of task main effect variances, though much larger than that of TLQual (i.e., 1.3%). This somewhat conflicting result could be explained by the topical diversity of SL responses used in the assessments. In the current study, the SL speech related to a variety of topics and themes, whereas in Han (2016a) all tasks were characterized by a common topic. Given that InfoCom is the most task-specific feature, the relatively larger variation was expected. It is also worth mentioning that in the current study CI tasks are independent speeches of more or less the same length, while in other CI assessments segments of a long speech on the same topic are treated as assessment tasks (e.g., TEM-8; see Wang et al., 2011). As such, given that topical diversity could be a significant factor contributing to score variances, the larger variation for InfoCom found in this study cannot be extrapolated to CI assessments in which segments of a single speech are used as assessment tasks.
The results concerning the rater main effect suggest that in general the raters behaved more consistently, using the InfoCom scale than the FluDel or TLQual scale, which concurs with the previous finding reported by Han (2016a). One possible reason that accounts for this pattern is that, when evaluating interpretation, the raters relied on SL texts to monitor and identify whether there were any deviations and/or omissions occurring in target-language renditions. As such, the SL texts served as a consistent reference or anchor to help in reducing rater idiosyncrasy, leading to relatively more reliable InfoCom ratings. In contrast, when using FluDel and TLQual scales, no such apparent anchors were available, and the raters probably had to rely on their mental representation of quality descriptors as a frame of reference, which may be susceptible to change over time. Furthermore, a comparison of VC percentages for different interpreting directions suggest that overall the raters exhibited greater variability, when assessing Chinese-to-English than English-to-Chinese interpretation. A possible explanation for this result relates to the mismatch of raters’ native language to the target language in which renditions were produced. Specifically, given that all the raters in the study had Mandarin Chinese as their native language and English as a second language, they may be more competent and confident in evaluating interpretations produced in their mother tongue. As a result, comparatively more accurate judgements were attained when the raters were evaluating the English-to-Chinese renditions than the Chinese-to-English renditions.
The moderate person-by-task interaction effects (i.e., most of the VC percentages ranging from 4.1% to 6.5%) indicate that overall the quality of the students’ CI performances, based on the InfoCom, FluDel and TLQual ratings, did not vary substantially across different tasks. This pattern, however, is not fully consistent with Huang’s (2009) meta-analytic findings in which relatively large person-by-task interaction was identified (i.e., on average, 15.6% for L2 learning, and 27.46% for L1 writing). Nevertheless, the current pattern is in line with the previous finding reported in Han (2016a), in which person-by-task interaction effect was also moderate for English-to-Chinese simultaneous interpretation (i.e., 9.5% for InfoCom, 8.4% for FluDel, and 6.9% for TLQual). The lack of variance could be an indication of consistent scoring, but it may also be explained by the homogeneous group of the students.
Score dependability as a function of the number of tasks and/or raters
In the study, a greater number of measurement designs that produced a Phi value equal to or larger than 0.75 was found in the English-to-Chinese than Chinese-to-English direction. This finding suggests that comparatively the minimum level of score dependability was more easily achieved for the English-to-Chinese interpretation.
More importantly, increasing the number of raters seemed to have a relatively larger impact on score dependability than adding more tasks for both directions. In regard to score dependability of InfoCom ratings for the English-to-Chinese direction, however, using more tasks was found to be more effective than employing more raters. Although the latter part of the results is largely consistent with the findings of previous research in performance-based assessments in foreign/second language testing (Gebril, 2009; Lee, 2006; Xi, 2007), the former part is not.
Another way to look at the results is that, in the English-to-Chinese consecutive interpretation, for InfoCom sampling more tasks did a better job in raising score dependability, but for FluDel and TLQual recruiting more raters produced larger impacts. This pattern is actually consistent with the findings in Han (2016a), in which English-to-Chinese simultaneous interpretation was assessed by a group of nine raters whose native language was Mandarin Chinese. Particularly, for the English-to-Chinese direction, it was found that regarding the InfoCom ratings the task main effect and task-related interaction effects were larger than the rater main effect and rater-related interaction effects, whereas this pattern was reversed for FluDel and TLQual, not only in the present study (i.e., 33.1% > 12.5%) but also in Han (2016a) (i.e., 18.9% > 13.5%). This phenomenon could be explained by the fact, as has also been stated in the section of main and interaction effects, that when assessing English-to-Chinese renditions, raters used English texts as a SL reference, which probably led to less variability. But why was this exact pattern not observed for the Chinese-to-English direction, since the raters also used Chinese texts as an original reference? This is probably because the raters who had Mandarin Chinese as their L1 and English as their L2 were less able to evaluate English renditions confidently and accurately than to assess Chinese renditions, resulting in diverse evaluations, despite their access to SL texts. In other words, although raters’ language familiarity and the availability of SL texts helped lessen rating inconsistency, the former seemed to be more effective than the latter. The findings therefore highlight the potential complex interactions among assessment criterion, interpreting direction, and raters’ dominant language that are involved in rater-mediated assessment of interpreting.
Optimal measurement designs
The study found that for the English-to-Chinese direction there were two potentially cost-effective designs: nt = 4 and nr = 6, as well as nt = 5 and nr = 5; for the Chinese-to-English direction there was only one: nt = 5 and nr = 6. These findings lend credence to the practice of recruiting as many as five to six raters for summative assessment in some of the postgraduate interpreting programs (Liu et al., 2008). However, in the previous literature there was usually only one task (of different lengths) for each interpreting direction, which could be a psychometric concern, although task/speech length may be another important variable affecting score dependability.
Moreover, the findings invalidate the practice of summative assessment previously conducted for the English/Chinese CI course and potentially for many others reported in Feng (2005), in which only one (short) task and one rater were involved for each direction. More (or perhaps longer) tasks and raters are definitely needed so that assessment outcomes could become dependable and be used for important decision making. In the current study, for the English-to-Chinese direction, the choice between the two designs is reduced to one question: which is more cost-effective for the assessment as a whole, using one more task or one more rater? Given that the two designs were able to produce a Phi coefficient larger than 0.75, perhaps the respective cost involved in the designs has to be examined in the specific context of the assessment.
Conclusion
This study explored the optimal measurement design(s) for a locally created summative assessment of English/Chinese CI, based on multiple fully crossed G studies. It was found that (a) overall the raters behaved more consistently using the InfoCom scale than they did using the FluDel or TLQual scale; (b) raters displayed greater variability in assessing the Chinese-to-English interpretation than the English-to-Chinese interpretation; (c) adding tasks was more effective in boosting the Phi coefficient than was using additional raters for the InfoCom ratings in the English-to-Chinese interpretation, but the pattern was reversed in the rest of the observations; and (d) two potentially optimal designs were identified for the English-to-Chinese direction, and one design for the other direction. Although these findings might be most useful in the local context, the current study provides relevant procedures that other local interpreting courses/programs may find useful in carrying out their own G-theory analysis to inform their local measurement designs or to investigate the dependability and the distinctness of analytic scores.
In addition, this study demonstrated that validation research needs to be conducted for locally created summative assessments of interpretation to examine whether assessment outcomes have been properly used for decision making. This recommendation challenges current practice in most settings. A second implication is that SL texts should be provided to raters and that raters should study them before formal rating procedures begin in operational rating. A third implication is that, if possible, interpretations need to be evaluated by a rater whose dominant language is the language into which the renditions are delivered. By doing so, rating precision and consistency could be improved. The last implication is that given the potentially complex interactions among assessment criterion, interpreting direction, and raters’ dominant language, using one more rater or one more task could produce differential effects on the dependability of analytic scores. As a result, a common optimal measurement design should not be expected for both interpreting directions.
The findings and implications should be considered in view of four limitations of the research. First, some of the raters in the study were postgraduate students working as teaching assistants for the interpreting course. They may not have fully acquired the ability to assess interpretation in both directions, thus contributing more variability to assessment outcomes. In future research, two groups of raters (i.e., experienced versus novice raters) could be recruited to investigate whether there exists a substantial difference in rating quality. Second, the raters in the study had Mandarin Chinese as their L1 and English as L2, but had to assess interpretation in both directions. The mismatch of raters’ dominant language with the target language into which an interpretation is delivered may result in greater score variability than the other way around. Therefore, raters’ language familiarity could be hypothesized as an additional source of variability, which was not examined in the current study. It would therefore be of interest to employ English/Chinese interpreters whose native or dominant language is English as raters in order to assess interpretation in both directions, and to see whether a similar pattern still persists. It would be equally intriguing if future research incorporates raters’ language familiarity as a fixed facet in a G study to examine VC estimates associated with it for interpretation assessment. Third, in the study the length of all tasks was designed to be about the same, although the current practice of interpretation assessment is characterized by different task lengths. The lack of variation in task length rules out the possibility of investigating its effects on rater judgement and, ultimately, score dependability. Therefore, future research could employ tasks of different lengths, and model it as a measurement facet in G theory studies. Lastly, the use of the purely quantitative method (i.e., G theory analysis) in the study was unable to generate qualitative insight into the statistical patterns identified. Although many explanations provided for the statistical results seemed plausible, they were essentially hypothetical and lacking substantive backup. Qualitative data (via interviews of, or verbal protocols from, the raters) should be collected in future research to triangulate with findings of statistical analysis.
Footnotes
Appendix
Descriptor-based rating scales for assessing consecutive interpretation.
|
|
|||
|---|---|---|---|
(Score range: 7–8) |
A substantial amount of original messages delivered (i.e., > 80%), with a few number of deviations, inaccuracies, and minor/major omissions. | Delivery on the whole fluent, containing a few disfluencies such as (un)filled pauses, long silence, fillers and/or excessive repairs. | Target language idiomatic and on the whole correct, with only a few instances of unnatural expressions and grammatical errors. |
(Score range: 5–6) |
Majority of original messages delivered (i.e., 60–70%), with a small number of deviations, inaccuracies, and minor/major omissions. | Delivery on the whole generally fluent, containing a small number of disfluencies. | Target language generally idiomatic and on the whole mostly correct, with a small amount of instances of unnatural expressions and grammatical errors. |
(Score range: 3–4) |
About half of original messages delivered (i.e., 40–50%), with many instances of deviations, inaccuracies, and minor/major omissions. | Delivery rather fluent. Acceptable, but with regular disfluencies. | Target language to a certain degree both idiomatic and correct. Acceptable, but contains many instances of unnatural expressions and grammatical errors. |
(Score range: 1–2) |
A small portion of original messages delivered (i.e., < 30%), with frequent occurrences of deviations, inaccuracies, and minor/major omissions, to such a degree that listeners may doubt the integrity of renditions. | Delivery lacks fluency. It is frequently hampered by disfluencies, to such a degree that they may impede comprehension. | Target language stilted, lacking in idiomaticity, and containing frequent grammatical errors, to such a degree that it may impede comprehension. |
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by China National Social Sciences Foundation (grant number: 18CYY010).
