Abstract
For Chinese as a second language (L2 Chinese), there has been little research into distinguishing features (Fulcher, 1996; Iwashita et al., 2008) used in scoring L2 Chinese speaking performance. The study reported here investigates the relationship between the distinguishing features of L2 Chinese spoken performances and the scores awarded by raters to the performances using holistic scoring. Seven distinguishing features – representing four major categories of Pronunciation, Fluency, Vocabulary and Grammar in the L2 Chinese speaking construct – were employed. An L2 Chinese speaking test was developed to assess the overall communicative ability in L2 spoken Chinese within an academic context. Speech samples of 66 candidates on the speaking test (i.e. 9 minutes’ speech length for each candidate) were analyzed in terms of the seven distinguishing features, with correlations and standard multiple regression being employed. Results showed that, first, each of the seven distinguishing features was significantly correlated to the scores, producing large or medium effect sizes; second, 79% and 77% of the variance in the scores could be explained by the distinguishing features (incorporating word tokens and word types respectively) in two regression analyses. The current study has established a link between distinguishing features and scores, contributing empirical evidence of candidate performance to the validation of assessing speaking proficiency in the L2 Chinese context.
In second language (L2) testing and assessment, features and the degree of mastery of the features are crucial for defining rating scales in terms of ‘scales’ and ‘levels’ to score candidate performance for score interpretation and use (Bachman & Palmer, 2010; Fulcher & Davidson, 2007). Such a dual role of features in scoring L2 speaking performance has been integrated as distinguishing features representing speaking construct that can characterize candidate-spoken performances at each of various proficiency levels in L2 speaking assessment (Fulcher, 2003; Iwashita, Brown, McNamara, & O’Hagan, 2008). Accordingly, the concept of distinguishing features has been widely used in the construction of rating scales to score L2 speaking performance in large-scale international language tests – for example, IELTS and TOEFL iBT (see Taylor & Falvey, 2007 for IELTS; also see Chapelle, Enright, & Jamieson, 2008 for TOEFL iBT).
The concept of distinguishing features has gone through a number of stages of development and changes in the history of L2 speaking assessment. In early studies, the concept of the ‘well-educated native speaker’ was adopted as the fundamental feature to distinguish speaking proficiency at different levels in the Foreign Service Institute (FSI) scale (Wilds, 1975, p. 36). Such a distinguishing feature was also employed for scales developed by the InterAgency Language Roundtable (ILR), the American Council for the Teaching of Foreign Languages (ACTFL) and the Australian Second Language Proficiency Ratings (ASLPR) (see Ingram, 1985; Liskin-Gasparro, 1984; Lowe, 1985). However, the distinguishing feature, known as the ‘native speaker’ or ‘educated native speaker’ characteristic, has been challenged or criticized by many language testing researchers (see Bachman & Savignon, 1986; Davies, 1991, 2003; Lantolf & Frawley, 1985) for its inexplicit definition. Rating scales using the ‘native speaker’ element ‘are described in only vague general terms and abound in qualifiers, so that only gross distinctions can be made with any confidence’ (Matthews, 1990, p. 119). Possibly responding to these critical comments on the ‘native speaker’ element, the distinguishing feature was then slightly adjusted to ‘can do’ statements, which identified ‘the sorts of tasks learners can perform at various levels and describe very broad parameters of phonological, grammatical, discoursal, and sociolinguistic development’ (Ingram & Wylie, 1993, p. 222).
Before the 1990s, the mainstream approach to determining or validating distinguishing features was the expert intuition and experience model (see Fulcher, 2003, 2010). Since then, researchers have focused more on cognitive studies of raters’ (or teachers’) perceptions (e.g. Brown, 2000; North, 2000; Pollitt & Murray, 1996; Upshur & Turner, 1995), discourse analysis of candidates’ performances (e.g. Douglas, 1994; Fulcher, 1996; Lazaraton, 2002) and mixed methods studies of both (e.g. Brown, 2006a, 2006b; Brown, Iwashita, & McNamara, 2005) when investigating detailed distinguishing features. Accordingly, the focus of research into distinguishing features has shifted from a communicative and functional angle (‘native speaker’ or ‘can do’) to a linguistic quality emphasis (‘features of language use’) (Brown & Taylor, 2006; Davies, 2008) with such attempts to explore distinguishing features reflected in the operational speaking rating scales of international language tests, e.g. IELTS and TOEFL iBT (IELTS website, retrieved on 7 August 2010; Educational Testing Service, 2005).
In the present study, the objective is to validate the distinguishing features widely used in scoring candidates’ spoken performances in the context of L2 Chinese. Few studies have been conducted to examine the relationship between the distinguishing features of L2 Chinese spoken performances and the scores awarded to the performances (see Wang, 2002; Zhu, 2009). The current study therefore aims to contribute to two aspects of this concern: on the one hand, investigating the relationship between each individual distinguishing feature and the scores; on the other hand, examining the contribution of distinguishing features to scores.
Literature review
Since the late 1980s, there has been a tendency to analyze the actual performance or discourse of candidates in speaking tests (van Lier, 1989). As a result, an increasing number of discourse studies have been conducted on speaking tests. However, most studies have investigated the behavior of candidates and interviewers in oral interviews (e.g. Ross & Berwick, 1992; Young, 1995a; Young & Milanovic, 1992), differences between interview behavior and conversation (e.g. Lazaraton, 1996; Young & He, 1998) and effects of test formats (e.g. Shohamy, 1994; O’Loughlin, 1995, 2001). Few studies have attempted to establish a link between the discourse (qualitative descriptions) and quantitative scores with the exception of the work of Douglas and Selinker (1992, 1993) and Douglas (1994), where the relationship between test scores and actual performance was weak. Later, Lazaraton (1998, as cited in 2002, pp. 161–168) examined the relationship between candidates’ spoken performances and their assigned scores. Discourse analysis of transcriptions for 20 IELTS spoken performances with different bands (i.e. IELTS Bands 3 to 7) was attempted to identify distinguishing features among the different proficiency levels. The results indicated inconsistencies in the features analyzed among the different levels which led to researchers using operationally defined terms in rating scales and relevant refinements. In this regard, Young (1995b, p. 13) – in exploring rating scales from the perspective of second language acquisition – made the point that, if L2 proficiency is ‘architectural (modular)’ and ‘context dependent’, the relationship between scores assigned and the actual performances will be weak (see Young, 1995b, p. 13).
One exception has been the work of Fulcher (1996). In his study, 21 ELTS oral interviews with a range of Bands 4 to 9 were transcribed and coded into eight explanatory categories developed using Grounded Theory methodology (Strauss & Corbin, 1994). A discriminant analysis was subsequently conducted, with results revealing that the eight explanatory categories discriminated well among candidates. Candidates were assigned band scores predicted by those significantly discriminating explanatory categories. There was considerable agreement between the predicted band scores and actual band scores (21 out of 22 candidates were awarded the same band score). In Fulcher (1996), a wide range of fluency features clearly distinguished candidates at different band scores and provided an empirical relationship between candidates’ spoken performances and their quantitative scores.
In recent years, mixed methods of combining studies of raters’ orientation and candidates’ performance have been applied to validate distinguishing features used in scoring L2 English speaking performance. Brown et al. (2005) first conducted a rater cognition study based on verbal reports from 10 expert judges’ evaluations of candidates’ performances without any guidance, so as to obtain features and categories identified as important by judges. Then, using such categories and features, the same candidates’ performances were analyzed. Features of global accuracy (grammatical accuracy), mean length of run (grammatical complexity), word tokens (vocabulary), word types (vocabulary), target-like syllables (pronunciation), number of unfilled pauses (fluency), total pause time (fluency) and speech rate (fluency) were found to have significant difference with effect sizes close to or not smaller than marginal (also see Iwashita et al., 2008).
Such mixed methods were also applied to validate distinguishing features of revised IELTS rating scales by Brown (2006a, 2006b). Brown (2006a) used stimulated verbal report as well as questionnaires to survey six expert examiners on their IELTS rating experiences and accuracy for exploring their interpretation and application of the scales. Brown also (2006b) conducted a discourse analysis of 20 interview samples from operational IELTS administrations in a range of countries with Bands 5 to 8 in order to verify three categories, namely, Fluency and Coherence, Lexical Resources, as well as Grammatical Range and Accuracy in the speaking band descriptors. On the whole, most features increased in the expected direction over levels (e.g. the proportion of error-free utterances at higher bands was larger than those at lower bands), and all features contributed in some way to the assessment instead of one measure dominating the assessment. However, most of the features were not significantly different at the four assigned score bands.
In the context of L2 Chinese, two relevant studies are noteworthy. Wang (2002) measured the spoken performances of 39 candidates at elementary and intermediate levels in terms of three features of pronunciation (the percentage of correct syllables to total syllables), grammar (mean length of error-free sentences) and fluency (ratio of speech rate to unfilled pauses). The three features correlated with teachers’ overall evaluations of the candidates’ performance in their courses, ranging from around 0.30 to 0.60. Zhu (2009) analyzed the spoken discourse of 27 male Korean candidates at elementary, intermediate and advanced levels, using features of vocabulary, grammar, fluency and coherence and found that candidates at the three proficiency levels showed differences in these features, with the most significant differences emerging on vocabulary and fluency.
In sum, previous L2 English studies have established a link between spoken performances (distinguishing features) and proficiency levels (scores). Empirical evidence has been presented to justify the use of distinguishing features in assessing L2 English speaking proficiency (e.g. Fulcher, 1996; Iwashita et al., 2008). With respect to L2 Chinese, Wang (2002) and Zhu (2009) have been among the pioneers analyzing L2 Chinese spoken performances. Wang (2002) focused on how the three measures of pronunciation, grammar and fluency might be related to the teachers’ overall evaluations of the candidates’ performance in their courses – not to the scores awarded to the test performance. Zhu (2009) examined the existence of differences in terms of vocabulary, grammar, fluency and coherence among the elementary, intermediate and advanced levels. These two studies illustrate two aspects: first, a relationship exists between distinguishing features and speaking proficiency (Wang, 2002); second, distinguishing features emerge with significant difference among candidate performance at different proficiency levels (Zhu, 2009). However, concern regarding the relationship between the distinguishing features and the scores still has not been addressed. The current study therefore attempts to fill this gap.
Identification of distinguishing features
The identification of the distinguishing features of L2 English speaking performance was based on a review of the literature (Fulcher, 1996), cognitive studies by experts (Brown et al., 2005; Iwashita et al., 2008), and analyses of relevant documents (Brown, 2006a, 2006b). In scoring L2 Chinese speaking performance, few pieces of relevant literature or cognitive studies were documented (see Wang, 2002; Zhu, 2009). However, considerable achievements in testing L2 Chinese have been achieved through the implementation of a series of L2 Chinese teaching and testing syllabi in mainland China since the 1990s (Zhang, 2009; also see Li, 2006). Based on these documents, L2 Chinese learners have been classified into three proficiency levels for teaching and testing purposes, that is, elementary, intermediate and advanced levels. The current study was designed for examining spoken performances of advanced L2 Chinese learners as they produced adequately long discourses for analysis – although we accept that L2 Chinese learners at both elementary and intermediate levels should also be involved in future studies.
L2 Chinese teaching and testing syllabi with focuses on speaking proficiency were therefore analyzed to identify the distinguishing features widely used for teaching and testing L2 Chinese speaking, involving four sets of the most influential documents in mainland China:
Chinese Proficiency Scales and Syllabus of Graded Grammar (China National Office for Teaching Chinese as a Foreign Language, 1996, CPS hereafter) and Syllabus of Graded Words and Characters for Chinese Proficiency (Test Center in National Chinese Proficiency Test Committee Office, 2001, SGWC hereafter);
Chinese Language Proficiency Scales for Speakers of Other Languages (The Office of Chinese Language Council International, 2007, CLPS hereafter) and International Curriculum for Chinese Language Education (The Office of Chinese Language Council International, 2008, ICCLE hereafter);
Five-Band Holistic Scoring Standards for L2 Chinese Speaking in Test Syllabus for HSK-Advanced Level (Hanban/Confucius Institute Headquarters, 2009, HSS hereafter);
Spoken Chinese Proficiency Grading Standards and Testing Guidelines (Ministry of Education & State Language Commission, the People’s Republic of China, 2011, SCP hereafter) and The Graded Chinese Syllables, Characters and Words for the Application of Teaching Chinese to the Speakers of Other Languages (Ministry of Education & State Language Commission, the People’s Republic of China, 2010, GCSCW hereafter).
As shown in Table 1, seven distinguishing features under four categories of Pronunciation, Fluency, Vocabulary and Grammar were found to be widely employed in teaching and testing L2 Chinese speaking. 1 The seven distinguishing features were target-like syllables, speech rate, pause time, word tokens, word types, grammatical accuracy and grammatical complexity, which comprised the distinguishing features analyzed in the current study.
Distinguishing features in the four sets of documents
√ Categories and/or distinguishing features used in the relevant documents.
Research questions
In examining the relationship between the distinguishing features and the scores in scoring L2 Chinese speaking performance, two research questions are addressed accordingly:
In what way does each of the distinguishing features relate to the scores?
How do the distinguishing features contribute to the scores?
Method
Measurement of distinguishing features
Pronunciation
The standard pronunciation of modern spoken Chinese is based on native speakers of the Beijing dialect, also known as Putonghua (Norman, 1988; Sun, 2006; also see Kuo, 2007). This has also been adopted as the criterion for testing L2 Chinese pronunciation in mainland China, for example, HSS and SCP. In examining the pronunciation of L2 Chinese learners, the syllable has been widely used as a basic unit of analysis, comprising an initial, a final and a tone (Mao & Ye, 2002). For example, the Chinese pronunciation of ‘I’ is ‘wǒ’. In the syllable ‘wǒ’, w is the initial, o is the final and ˇ is the tone. The intelligibility of all three parts (i.e. initial, final and tone) of a syllable (a target-like syllable) was further applied to score the pronunciation of L2 Chinese learners (Mao & Ye, 2002; Wang, 2002; also see Ministry of Education & State Language Commission, the People’s Republic of China, 2011). The current study therefore measured target-like syllables per 10 syllables (see Iwashita et al., 2008; Wang, 2002) in the category of Pronunciation. In counting target-like syllables, tone-sandhi (tonal alternations) and rhotacization (see Sun, 2006) of spoken performances of L2 Chinese learners were also considered.
Fluency
In assessing the Fluency of L2 Chinese learners, a number of features have been employed (see Guo, 2007; Zhai, 2011), which are consistent overall with those for L2 English learners (Brown, 2006b; Fulcher, 1996; Iwashita et al., 2008). Given the feasibility of the study, two widely accepted features of speech rate and pause time were included as distinguishing features under the category of Fluency in the current study. Specifically, speech rate was calculated by counting the number of syllables (excluding features of repair), divided by total speech in seconds (excluding pauses of three or more seconds); pause time was calculated as the total time of unfilled pauses of 1 or more seconds, divided by total speech duration in seconds (both excluding pauses of three or more seconds) (Guo, 2007; Iwashita et al., 2008; Mehnert, 1998; Zhai, 2011).
Vocabulary
Word tokens and word types have been widely accepted as effective measures for assessing the Vocabulary of L2 spoken English (Iwashita et al., 2008) and L2 spoken Chinese (Zhu, 2009). However, counting word tokens and word types of Chinese is more challenging than in English since in Chinese there are no spaces between words as is the case with English (see Sun, 2006). In segmenting Chinese words, systematic segmentation specifications were established and applied (see Guo, 2011; Liu, Tan & Shen, 1994; Yu, Duan, Zhu & Sun, 2002) – for example, Contemporary Chinese Language Word Segmentation Specification for Information Processing (GB13715 National Standards of the People’s Republic of China; see Liu, Tan & Shen, 1994). Based on these specifications, Chinese words in the current study were first segmented. Subsequently, word tokens and word types were counted in terms of segmented words. Consider the example of segmented Chinese words below; there are 5 word tokens, and 4 word types.
Grammar
In the category of Grammar, two distinguishing features of grammatical accuracy and grammatical complexity were adopted (Wang, 2002; Zhu, 2009). However, the T-unit concept (Hunt, 1970) does not work in Chinese. As Shi (2002), Wang (2002) and Zhu (2009) note, in L2 Chinese the ‘sentence’ needs to be taken as the basic unit of analyzing grammatical accuracy and grammatical complexity. Operationally, the percentage of error-free sentences and the mean length (number of syllables) of sentences were applied in order to measure grammatical accuracy and grammatical complexity (see Zhu, 2009). As for ‘error’, this is defined as forms that are not used by native speakers in all likelihood within the same context (Lennon, 1991; also see Ellis & Barkhuizen, 2005). Such a definition has been widely employed for identifying and classifying grammatical errors of L2 Chinese learners (Lu, 1994; Wang, 2011; Zhao, 2002; Zhu & Zhou, 2007).
Taking Chinese sentences as analyzing units of speech samples is, however, still challenging in identifying sentence boundaries (S. Guo, personal communication; also see Liu & Hakkani-Tür, 2011; Zong & Ren, 2003). In breaking up sentences of L2 Chinese spoken performances, two complementary criteria semantic and syntactic completeness within context and sentence intonation and pause duration (see the review of definitions of a Chinese sentence by Tang, 2010) were used in conducting manual annotations after consultations with L2 Chinese corpus linguists. The current study thus applied the two complementary criteria for the sentence segmentation. The segmented sentences were subsequently counted in terms of the percentage of error-free sentences and the mean sentence length.
Instrument
Speaking test
To elicit spoken performances of advanced L2 Chinese learners, a speaking test was developed for the current study. The purpose of the test was to assess the overall communicative ability in L2 spoken Chinese for advanced learners within an academic context (college life). As specified in CLPS, the communicative ability in L2 spoken Chinese of advanced learners was defined as:
Able to understand formal or informal conversation[1] or speech[2] on a wide variety of occasions including discussions about one’s work or study, able to comprehend the main points, grasp the basic details, and find out the speakers’ aims and intentions.[3] Able to make oneself understood and communicate effectively with others on concrete or abstract topics[4] and able to give a description[5] or argumentation[6] on a topic that one is interested in, expressing oneself clearly and coherently with appropriate details. (The Office of Chinese Language Council International, 2007, p. 2; six key points added with superscripts from [1] to [6])
As shown in Table 2, the speaking test consisted of three speaking tasks (Task I, Task II and Task III), each including both integrated and independent tasks, to reflect the six key points from [1] to [6] as described in the CLPS. The task design and task materials were both reviewed and revised by language testing experts and experienced L2 Chinese speaking teachers (see Appendix 1 for an English-translated version).
The speaking test
Scoring method
Given that the objective of the current study is to investigate the relationship between the distinguishing features and the scores, the scoring rubric used only focused on degrees of communicative effectiveness and task fulfillment without addressing any specific distinguishing features (see Kim, 2009) in order to provide a broader score interpretation of communicative ability in L2 spoken Chinese among advanced learners. Regarding the number of levels employed for the scoring rubric, the scoring consistency and practicality were taken into account as suggested by Bachman & Palmer (2010). Five levels have been used to score speaking performance of advanced L2 Chinese learners since the 1990s in Chinese Proficiency Test (Liu, 1997; also see HSS), a state of affairs which has shaped the scoring experience of teachers and raters involved in the current study. Therefore, a five-level scoring rubric was developed for the scoring (see Appendix 2). Based on the five-level scoring rubric, raters participating in the current study were invited to use holistic scoring by assigning single whole levels to candidates’ spoken performances on all three tasks.
Participants
Candidates
The candidates of the speaking test comprised 66 students from L2 Chinese speaking courses for advanced level learners 2 at a comprehensive university in Shanghai, China. The gender of the 66 candidates was split between 26 males and 40 females, with South Korean and Japanese candidates accounting for 50% and 24%, respectively. The remaining candidates (26%) came from Brazil, Cambodia, Cameroon, Canada, Colombia, France, Indonesia, Iran, Myanmar, New Zealand, Poland, Singapore, Turkmenistan, United Kingdom and the United States. The country distribution of the candidates was overall representative of the population of international students in China, with the two largest proportions from South Korea and Japan (see Cheng, 2009). All 66 candidates responded to the three tasks, with approximately 9 minutes’ speech being obtained for each candidate. The spoken performances of the 66 candidates were recorded, with the quality of the recorded files subsequently examined by the researchers. All recorded files were audible for further analysis, constituting the 66 speech samples for the current study.
Raters
Two raters participated in the scoring of the speech data. The two raters had taught and/or tested L2 Chinese at tertiary institutions for at least three years, with an extensive experience in testing L2 Chinese speaking, including designing speaking tasks, constructing rating scales, developing test specifications and scoring speaking performance. The two raters employed holistic scoring, without focusing on analytic aspects of the L2 Chinese speaking construct. The scoring had three stages: familiarization, scoring, and adjudication. At the first stage, the two raters listened to all spoken performances of the 66 candidates to become familiar with each candidate’s performance. At the second stage, the two raters scored each candidate independently as follows: a three-level range among five levels was first identified while listening to Task I; the three-level range was narrowed to two adjacent levels while listening to Task II; a final single level was determined while listening to Task III (see Alderson, 1991). At the final stage, scores of the two raters were compared and analyzed, with a high rater agreement proportion of 87.88% being obtained. Disagreement between the two raters existed in eight candidates’ scores. The two raters subsequently discussed the spoken performances of these eight misfitting candidates and agreed upon final scores. These scores were used in the current study to reflect the overall communicative ability in L2 spoken Chinese of the candidates.
Coding
To recap, seven distinguishing features of four categories representing L2 Chinese speaking construct were identified: Pronunciation (target-like syllables), Fluency (speech rate and pause time), Vocabulary (word tokens and word types) and Grammar (grammatical accuracy and grammatical complexity). In coding these seven distinguishing features, the speech samples were first transcribed to orthographic Chinese characters (see Li & Zu, 2007), followed by an additional review to ensure transcription accuracy. Two coders (Coder 1 and Coder 2) participated in the coding work based on the transcription. Both coders had postgraduate degrees in applied Chinese linguistics, proper training in Chinese corpus linguistics and lengthy experience in constructing and working with L2 Chinese corpora.
The coding work comprised three stages. At the first stage, a trial coding of a small portion of the speech samples was carried out by the two coders independently for discussion (i.e. six out of the 66 samples). Coding schemes were thus developed based on the discussion of the two coders, and an L2 Chinese corpus linguist was consulted wherever difficulties arose. Software was also developed to assist in counting measures of speech rate and pause time. At the second stage, based on the agreed coding schemes, Coder 1 coded the remaining 60 speech samples, after which Coder 2 reviewed the coding results of Coder 1. As for each of the 60 speech samples, a disagreement proportion of 10% was set as the upper bound for determining whether a third coding by the corpus linguist was necessary. As for target-like syllables and error-free sentences (for grammatical accuracy), each had one sample beyond the upper bound (i.e. 10%), with a third coding undertaken. Regarding word segmentation, no samples needed a third coding due to established specifications (see Guo, 2011; Liu, Tan & Shen, 1994; Yu, Duan, Zhu & Sun, 2002). Regarding sentence segmentation, three samples were third coded. At the third stage, the two coders conducted a final calculation of the coding results. For each of the seven distinguishing features, the coding results of candidates’ spoken performances at each of the three speaking tasks were then averaged out, composing the quantitative data of the current study.
Statistical analysis
All statistical analyses in the current study were performed using PASW Statistics 18 (SPSS Inc., Chicago, IL, USA). Descriptive statistics of the scores and the seven distinguishing features are presented in Table 3, indicating that the data of the current study were normally distributed overall, although the distribution of speech rate warranted some caution. The decision to not transform the data of speech rate is made for two reasons: on the one hand, parametric statistics are robust to violations of the assumption of normal distribution; on the other hand, transformed data of speech rate are difficult to interpret (Tabachnick & Fidell, 2007; also see Iwashita et al., 2008; Larson-Hall, 2010).
Descriptive statistics
Multiple modes exist. The smallest value is shown.
To address the two research questions, correlations and standard multiple regression were employed. In the regression analysis, correlations between the seven distinguishing features were first examined, as multicollinearity appears when strong correlations exist between predictors (Field, 2009). A high correlation of 0.92 between word tokens and word types was found, although no other correlations between predictors were larger than 0.60 (see Table 4). Therefore, word tokens and word types, with the five other distinguishing features, were regressed on the scores with two regressions respectively (Regression I and II hereafter). As such, two regression analyses were carried out with six predictors each for a sample size of 66, meeting the assumption of at least 10 cases for each predictor (see Larson-Hall, 2010). Assumptions including normality, linearity, homoscedasticity and independence of residuals, outliers and multicollinearity were evaluated subsequent to the regression analysis (Tabachnick & Fidell, 2007).
Correlation matrix
Correlation is significant at the 0.01 level (2-tailed); * Correlation is significant at the 0.05 level (2-tailed).
Results
Correlations
A correlation matrix is reported in Table 4 (Pearson’s r correlation coefficients). Results showed that each of the seven distinguishing features was significantly correlated with the scores, with r values ranging from 0.30 to 0.83 (p values < 0.05). Specifically, speech rate, word tokens and word types were of large effect sizes (|r| ≥ 0.50) in correlations with the scores and target-like syllables, pause time, grammatical accuracy and grammatical complexity were of medium effect sizes (0.30 ≤ |r |≤ 0.49) in correlations with the scores (Cohen, 1988), suggesting there was a strong or moderate relationship between the seven distinguishing features and the scores.
Standard multiple regression
Two standard multiple regressions were carried out in which six distinguishing features (incorporating word tokens and word types, respectively) were regressed against the scores. As shown in Table 5 and Table 6, total R2 = 0.79 in Regression I and R2 = 0.77 in Regression II, indicating that 79% and 77% of the variance in the scores could be explained by the six distinguishing features incorporating word tokens and word types respectively (F6,59 = 37.78, p < 0.001; F6,59 = 32.91, p < 0.001). Regarding the individual predictors, target-like syllables, grammatical accuracy and word tokens/word types were both significant (in bold) in the two regressions, with pause time being only significant in Regression II. 3 Partial correlations are also presented in column 5 of both Table 5 and Table 6, reflecting the unique relationship between each distinguishing feature and the scores in the two regressions excluding overlap with other distinguishing features (Field, 2009; Kang, Rubin, and Pickering, 2010).
Regression I (incorporating word tokens)
Regression II (incorporating word types)
Evaluation of assumptions
The assumptions of the two regressions were evaluated from four perspectives. First, the histograms of the residuals for Regression I and Regression II were both normally distributed roughly (see Appendix 3), satisfying the assumption of normality of residuals. Second, in scatterplots of standardized residuals against standardized predicted values, the dots were randomly and evenly scattered for both Regression I and Regression II (see Appendix 3), indicating that the assumptions of linearity and homoscedasticity were met (Field, 2009). As also seen from the scatterplots for the two regressions, no outliers were found above +3.0 or below −3.0 (Larson-Hall, 2010). Third, values of tolerance or VIF (Variance Inflation Factor) were all well above 0.10 or below 10 (see the last two columns of Table 5 and Table 6), and there was no cause for concern regarding the multicollinearity for the two regressions (Cohen, Cohen, West, & Aiken, 2003). Fourth, the assumption of independence of residuals is met if the Durbin-Watson statistic is close to 2 (between 1 and 3) (Cohen et al., 2003; Field, 2009), which was the case for both Regression I and Regression II (see the Durbin-Watson statistics in Table 5 and Table 6).
Discussion
In the present study, speech samples of 66 candidates were analyzed in terms of target-like syllables, speech rate, pause time, word tokens, word types, grammatical accuracy and grammatical complexity. Correlations were first carried out to examine the relationships between the seven distinguishing features and the scores, with large or medium effect sizes in correlations being found. Two regression analyses were subsequently applied to investigate the contribution of the distinguishing features to the scores, with 79% and 77% of the variance in the scores being accounted for in Regression I and Regression II respectively. Results will now be further discussed with regard to the two research questions.
In what way does each of the distinguishing features relate to the scores?
Large or medium correlations emerged on seven distinguishing features with the scores reflecting the overall communicative ability in L2 spoken Chinese. The wide coverage of the seven distinguishing features under four major categories in the L2 speaking construct (i.e. Pronunciation, Fluency, Vocabulary and Grammar) therefore provides empirical evidence to validate the use of these distinguishing features in scoring the L2 Chinese speaking performance. A link has thus been established between test scores and construct-relevant features of candidate performance, with scores having rich construct- relevant scale meaning (Fulcher, 1996, 2003).
Even though there were strong or moderate relationships between the distinguishing features and the scores, variation of the effect sizes in the correlations was noted, for example, word tokens/word types (0.83), and grammatical complexity (0.30). Further, apart from word tokens/word types, the effect sizes of other distinguishing features were smaller than or close to a medium effect of 0.50. These findings concur with those of Wang (2002), where the correlations between three features of pronunciation, grammar and fluency and teachers’ overall evaluations were around 0.30 to 0.60. This might be attributed, as Iwashita et al. (2008) suggest, to the indistinction present at adjacent levels – a situation which occurred in the current study. Consider word tokens, for example. The issue of indistinction is shown in Figure 1. The histograms represent the candidates at different score levels, with the y-axis and the x-axis indicating the number of the candidates and the values of word tokens respectively. Although the values of word tokens demonstrated an overall increase from Level 1 to Level 5, level differences were not distinct between adjacent score levels, i.e. Level 1 and Level 2, Level 2 and Level 3, Level 3 and Level 4, and Level 4 and Level 5. Indistinction was particularly evident at adjacent levels, as the highlighted areas in Figure 1 illustrate.

Measures of word tokens at the five score levels
Iwashita et al. (2008) proposed one possible explanation for the indistinction between adjacent levels. They suggested that level distinctions might be more clear-cut if fewer levels were used – for example, using three levels rather than five. However, findings of other studies regarding the use of fewer levels have not been encouraging. Brown (2006b) investigated candidates’ spoken performances at four levels; most of the measures, however, did not differ significantly at the four levels. Zhu (2009) analyzed spoken performances of candidates using fewer levels, that is, only three levels; however, significant differences were still not found between any two adjacent levels for certain measures of features, such as the mean sentence length.
The scoring data of certified raters also supports the existence of indistinction between adjacent levels. In Carey, Mannell and Dunn (2011), 99 IELTS examiners were invited to score three speaking candidates, using the pronunciation subscale with four bands of 2, 4, 6 and 8 (four levels). The distribution of the band scores (N = 297) revealed that 35% and 58% of raters awarded adjacent scores (i.e. Bands 4 and 6), with only 3% and 4% of raters representing the other levels (i.e. Bands 2 and 8). Most of the raters (93%) assigned a band score between the two adjacent levels of Bands 4 and 6. Therefore, it might be the case that indistinction is inherent between adjacent levels despite the number of levels used.
How do the distinguishing features contribute to the scores?
It was reported that 79% and 77% of the variance in the scores could be attributed to the conjoined contribution of the six distinguishing features incorporating word tokens and word types respectively in the two regressions. In Regression I, three significant predictors of target-like syllables, word tokens and grammatical accuracy were identified as making a unique contribution according to partial correlations with the scores, that is, 0.33, 0.59 and 0.42; In Regression II, four significant predictors of target-like syllables, pause time, word types and grammatical accuracy were found to make a unique contribution with partial correlations of 0.35, −0.30, 0.52 and 0.26 (see the fifth columns of Table 5 and Table 6 respectively).
Three predictors – target-like syllables, word tokens and grammatical accuracy, both significant in Regression I and Regression II – also represent the three major categories of Pronunciation, Vocabulary and Grammar in assessing L2 Chinese speaking proficiency. Such findings are consistent with teaching and testing practices prescribed by L2 Chinese official documents. For example, as for L2 Chinese teaching, the Office of Chinese Language Council International (2008) defines L2 Chinese linguistic knowledge as having six categories in ICCLE, namely, Pronunciation, Vocabulary, Grammar, Function, Theme and Discourse, among which Pronunciation, Vocabulary and Grammar are considered to be the three major categories. Regarding L2 Chinese testing, both the Hanban/Confucius Institute Headquarters (2009) and Ministry of Education and State Language Commission of the People’s Republic of China (2011) include Pronunciation, Vocabulary and Grammar as the three major components in assessing L2 Chinese speaking proficiency. The predictor of pause time, however, was only significant in Regression II when being regressed together with word types. One potential explanation might be that word tokens is more sensitive to pause time than that of word types as word tokens measured more directly the number of words that candidates spoke – the more words candidates spoke, the less pause time candidates had. It might be illustrated in this regard, as demonstrated in Table 4, that the correlation between pause time and word tokens (−0.60) was slightly larger than that between pause time and word types (−0.53).
It is also worth comparing the correlations (Table 4) with the partial correlations in Regression I and Regression II (Table 5 and Table 6). Decreases were identified among all of the relevant distinguishing features. Speech rate, for example, had a correlation of 0.52 (with scores in Table 4), but the partial correlation substantially decreased to 0.12 in Regression I (Table 5) and to 0.19 in Regression II (Table 6). One explanation for this is the possible overlap between the distinguishing features in different categories. As such, partial correlations – reflecting the unique contribution of each distinguishing feature – decrease. The overlap between categories or between features has also been corroborated by studies into rater perception regarding the distinctiveness of different categories (see Brown, 2006a; Xi, 2007). In Brown (2006a), both questionnaire data and verbal report data of six experienced IELTS examiners revealed a degree of overlap between the four analytic scales of Fluency and Coherence, Grammatical Range and Accuracy, Lexical Resources and Pronunciation. In Xi (2007), overlap between the three dimensions of Delivery, Language Use and Topic Development was also identified in raters’ questionnaire responses – 12 out of 14 raters both perceived an overlap between Delivery and Language Use and between Language Use and Topic Development; 9 out of 14 raters indicated an overlap between Delivery and Topic Development. Overlap between distinguishing features of different categories might therefore be inherent in the scoring of L2 speaking performance.
Implications
To recap, the current study has established a link between the distinguishing features of candidate performance and the scores representing L2 Chinese speaking proficiency. There are two implications for scoring L2 Chinese speaking performance. First and foremost, distinguishing features of candidate performance provide empirical evidence for constructing data-based rating scales (Fulcher, 1996). It had been found that the construction of rating scales for scoring L2 Chinese speaking performance was mainly attributed to interpretation and perception of experts (see Liu, 1997). The salient features, attended to by experts, teachers and raters to distinguish candidates, are consequently included in the construction of rating scales, with few studies of candidate performance being conducted to validate the features used. To this end, ‘thick’ description and further analysis of distinguishing features (see Fulcher, 1996, 2003) can be resorted to for developing data-based rating scales to score L2 Chinese speaking performances. Furthermore, the distinguishing features reported in the current study can also provide empirical evidence for developing automated scoring of L2 Chinese speaking performance in future. 4
In addition, the indistinction at adjacent levels and overlap between features of different categories might exist inherently when complex spoken performances of candidates are quantified to an exact score. These two observations (i.e. the indistinction and overlap) 5 have also been confirmed by both cognitive studies of rater perception and discourse analysis of candidate performance (e.g. Brown, 2006a, 2006b; Brown et al., 2005; Iwashita et al., 2008). However, in which ways the indistinction and overlap may invalidate the scoring of L2 Chinese speaking performance is a different question from that being pursued in the current study. A re-examination of the current scoring practices regarding raters’ assigning an exact score to a candidate’s performance may need to be considered (see Jin, Mak, & Zhou, 2012).
Limitations
Limitations of the current study rest in part on the fact that the three speaking tasks are all monologues; dialogues or interviews have not been included. Further, the selection of only quantifiable features and advanced L2 Chinese learners for analysis poses another limitation, resulting in, for example, a lack of measures for showing content quality in speaking performance of elementary and intermediate L2 Chinese learners. Last but not least, other variables (e.g. type – token ratio, 6 lexical density and tone error rate) that potentially affect rater decisions about the scores and the difference between speaking tasks in terms of the distinguishing features warrant attention. In future studies, it is proposed that interviews, dialogues and group discussions should be used for eliciting speaking performance of L2 Chinese learners at different proficiency levels, with more comprehensive distinguishing features being analyzed both quantitatively and qualitatively. Moreover, investigations into the rater decision-making process and into task difference in terms of distinguishing features may subsequently be conducted in order to see how these compare with the results here so as to contribute to the external validity of distinguishing features reported by the current study.
Conclusion
Since the Second World War, distinguishing features in scoring speaking performance have been the subject of extensive investigation, with the concept developing accordingly from ‘native speakers’, through ‘can do statements’, to ‘features of language use’ (Brown & Taylor, 2006; Davies, 2008; Fulcher, 2008). In recent years, the focus has been on analyzing candidate performance to validate distinguishing features used in L2 English speaking tests, for example, Brown (2006b) and Iwashita et al. (2008). However, few studies have attempted to analyze candidate performance with a view towards considering how distinguishing features work in scoring L2 Chinese speaking performance. The current study was undertaken for this purpose.
In the current study, the speech samples of the 66 candidates were analyzed using seven distinguishing features representing the L2 Chinese speaking construct. Correlations and standard multiple regression were conducted to address the two research questions respectively. Results demonstrated that, first, each of the seven distinguishing features was significantly correlated to the scores, with large or medium effect sizes being produced; second, 79% and 77% of the variance in the scores could, in concert, be explained by the distinguishing features involved in the two regression analyses respectively. Findings from the current study thereby provide empirical evidence to validate the assessment of L2 Chinese speaking proficiency in both teaching and testing contexts, as demonstrated by four sets of official teaching and testing documents, namely CPS/SGWC (1996/2001), CLPS/ICCLE (2007/2008), HSS (2009) and SCP/GCSCW (2011/2010).
The results of the current study concur closely with other studies of distinguishing features in scoring L2 English speaking performance (e.g. Brown, 2006a, 2006b; Brown et al., 2005; Iwashita et al., 2008). Overall, the distinguishing features of candidate performance are closely linked to the scores, although indistinction at adjacent levels and overlap between features of different categories may exist inherently (see Jin, Mak, & Zhou, 2012). The study also has implications for the construction of data-based rating scales to score L2 Chinese speaking performance. The description and analysis of the speech data reported in the current study could be further explored to develop data-based rating scales (see Fulcher, 1996, 2003), which is an underexplored area in the context of assessing L2 Chinese speaking.
